Title: | A toolkit for performing KNN-based statistics for flow and mass cytometry data |
---|---|
Description: | This package does k-nearest neighbor based statistics and visualizations with flow and mass cytometery data. This gives tSNE maps"fold change" functionality and provides a data quality metric by assessing manifold overlap between fcs files expected to be the same. Other applications using this package include imputation, marker redundancy, and testing the relative information loss of lower dimension embeddings compared to the original manifold. |
Authors: | Tyler J Burns |
Maintainer: | Tyler J Burns <[email protected]> |
License: | Artistic-2.0 |
Version: | 1.27.0 |
Built: | 2024-10-31 04:48:38 UTC |
Source: | https://github.com/bioc/Sconify |
This function gives the user the option to add t-SNE to the final output, using the same input features used in KNN, eg. surface markers, as input for t-SNE.
AddTsne(dat, input)
AddTsne(dat, input)
dat |
matrix of cells by features, that contain all features needed for tSNE analysis |
input |
the features to be used as input for tSNE,usually the same for knn generation |
result: dat, with tSNE1 and tSNE2 attached
The post-SCONE output from Bodenmiller-Zunder dataset pair of fcs files, one untreated and one treated with GM-CSF.We ran this on 10,000 cells and subsampled to 1000 for this vignette.
bz.gmcsf.final
bz.gmcsf.final
A tibble of 1000 cells by 69 features. This includes all the original parameters, the KNN-generated comparisons, differential abundance ("fraction.cond.2), and two t-SNE coordinates.
The post-SCONE output from a per-marker quantile normalized and z scored treated with GM-CSF. We ran this on 10,000 cells and sub-sampled to 1000 for this package.
bz.gmcsf.final.norm.scale
bz.gmcsf.final.norm.scale
A tibble of 1000 cells by 69 features. This includes all the original parameters, the KNN-generated comparisons, differential abundance ("fraction.cond.2), and two t-SNE coordinates.
This function is a quick way to take the exprs content of a fcs file, do an asinh transform, and create a tibble data structure that can be further manipulated. Our default transform is asinh, but you just have to change the transform to anything else, and you'll get the raw data. This function is used in the main function process.multiple.files
FcsToTibble(file, transform = "asinh")
FcsToTibble(file, transform = "asinh")
file |
the fcs file containing cell infomration |
transform |
if set to asinh, then asinh transforms with scale arg 5 |
tibble of info contained within the fcs file
file <- system.file("extdata", "Bendall_et_al_Cell_Sample_C_basal.fcs", package = "Sconify") FcsToTibble(file)
file <- system.file("extdata", "Bendall_et_al_Cell_Sample_C_basal.fcs", package = "Sconify") FcsToTibble(file)
This function is a wrapper around FNN package functionality to speed up the KNN process. It uses KD trees as default, along with k set to 100. Selection of K will vary based on the dataset. See k.selection.R.
Fnn(cell.df, input.markers, k = 100)
Fnn(cell.df, input.markers, k = 100)
cell.df |
the cell data frame used as input |
input.markers |
markers to be used as input for knn |
k |
the number of nearest neighbors to identify |
nn: list of 2, nn.index: index of knn (columns) for each cell (rows) nn.dist: euclidean distance of each k-nearest neighbor
Fnn(wand.combined, input.markers)
Fnn(wand.combined, input.markers)
These are the markers that will be used in the KNN comparisons, as opposed to the KNN generation.
funct.markers
funct.markers
A vector of strings.
Obtain a density estimation derived from the original manifold, avoiding the lossiness of lower dimensional embeddings
GetKnnDe(nn.matrix)
GetKnnDe(nn.matrix)
nn.matrix |
A list of 2, where the first is a matrix of nn indices, and the second is a matrix of nn distances |
a vector where each element is the KNN-DE for that given cell, ordered by row number, in the original input matrix of cells x features
ex.knn <- Fnn(wand.combined, input.markers, k = 30) GetKnnDe(ex.knn)
ex.knn <- Fnn(wand.combined, input.markers, k = 30) GetKnnDe(ex.knn)
This is a quick way to get a list of preferred marker names. This outputs a csv file containing all markers in the dataset in the name format that will be recognized by downstream functions. You manually alter this list to remove and/or categorize the said markers. The file can then be read in (stringsAsFactors = FALSE) to give you the marker list of interest. In particular, name the top of the column as "markers" if you're just altering the list. If you're doing to divide it into static and functional markers, produce two columns, naming them respectively.
GetMarkerNames(file)
GetMarkerNames(file)
file |
the fcs file of interest |
the list of markers of interest. This is to be written as a csv
file <- system.file("extdata", "Bendall_et_al_Cell_Sample_C_basal.fcs", package = "Sconify") GetMarkerNames(file)
file <- system.file("extdata", "Bendall_et_al_Cell_Sample_C_basal.fcs", package = "Sconify") GetMarkerNames(file)
This function takes as input the markers to be imputed from a pre-existing KNN computation.
Impute(cells, input.markers, nn)
Impute(cells, input.markers, nn)
cells |
the input matrix of cells |
input.markers |
the markers the user wants to impute |
nn |
the matrix of k-nearest neighbors (derived perhaps NOT from the "input markers" above) |
a data frame of imputed cells for the "input markers" of interest
Tests the euclidean distance error for imputation using knn and markers of interest
ImputeTesting(k.titration, cells, input.markers, test.markers)
ImputeTesting(k.titration, cells, input.markers, test.markers)
k.titration |
a vector integer values of k to be tested |
cells |
a matrix of cells by features used as original input |
input.markers |
markers to be used for the knn calculation |
test.markers |
the markers to be tested for imputation (either surface or scone) |
the median imputation error for each value k tested
ImputeTesting(k.titration = c(10, 20), cells = wand.combined, input.markers = input.markers, test.markers = funct.markers)
ImputeTesting(k.titration = c(10, 20), cells = wand.combined, input.markers = input.markers, test.markers = funct.markers)
These are the markers that KNN generation will be done on for the Wanderlust dataset. These are mostly surface markers.These are the same markers one would use as input for clustering or t-SNE generation, for exmaple, as they are not expected to change through the duration of the quick IL7 stimulation.
input.markers
input.markers
A vector of strings corresponding to the markers.
Takes all p values from the data and does a log10 transform for easier visualization.
LogTransformQ(dat, negative)
LogTransformQ(dat, negative)
dat |
tibble containing cells x features, with orignal expression, p values, and raw change |
negative |
boolean value to determine whether to multiple transformed p values by -1 |
result: tibble of cells x features with all p values log10 transformed
Makes a histogram of the data that is inputted
MakeHist(dat, k, column.label, x.label)
MakeHist(dat, k, column.label, x.label)
dat |
tibble consisting both of original markers and the appended values from scone |
k |
the binwidth, set to 1/k |
column.label |
the label in the tibble's columns the function will search for |
x.label |
the label that the x axis will be labeled as |
a histogram of said vector in ggplot2 form
MakeHist(wand.final, 100, "IL7.fraction.cond.2", "fraction IL7")
MakeHist(wand.final, 100, "IL7.fraction.cond.2", "fraction IL7")
Takes the KNN function output and the cell data, and makes list where each element is a matrix of cells in the KNN and features.
MakeKnnList(cell.data, nn.matrix)
MakeKnnList(cell.data, nn.matrix)
cell.data |
tibble of cells by features |
nn.matrix |
list of 2. First element is cells x 100 nearest neighbor indices. Second element is cells x 100 nearest neighbor distances |
a list where each element is the cell number from the original cell.data tibble and a matrix of cells x feautures for its KNN
ex.knn <- Fnn(wand.combined, input.markers, k = 30) knn.list <- MakeKnnList(wand.combined, ex.knn)
ex.knn <- Fnn(wand.combined, input.markers, k = 30) knn.list <- MakeKnnList(wand.combined, ex.knn)
Both the surface and functional markers for the Wanderlust dataset
markers
markers
a tibble with two columns, "surface" and "fucntional."
Just a random musing
MeaningOfLife()
MeaningOfLife()
A string containing a random musing
MeaningOfLife()
MeaningOfLife()
This function is for the instance that multiple donors are being compared against each other within the k-nearest neighborhood of interest. The mean value of the markers of interest is calculated across the donors, such that each data point for the subsequent t-test represents a marker for a danor.
MultipleDonorStatistics(basal, stim, stim.name, donors)
MultipleDonorStatistics(basal, stim, stim.name, donors)
basal |
tibble that contains unstim for a knn including donor identity |
stim |
tibble that contains stim for a knn including donor identity |
stim.name |
string of the name of the current stim being tested |
donors |
vector of strings corresponding to the designated names of the donors |
result: a named vector of p values (soon to be q values) from the t test done on each marker
This occurs after the user has modified the markers.csv file to determine which markers are to be used as input for KNN and which markers are to be used for within-knn comparisons
ParseMarkers(marker.file)
ParseMarkers(marker.file)
marker.file |
modified markers.csv file, now containing two columns. the left column containing KNN input markers, and the right column containing KNN comparison markers |
a list of 2 vectors of strings. The first element, labeled "input" is a vector KNN input markers. THe second slemenet, labeled "functional" are the markers to be used in the KNN based comparisons
file <- system.file("extdata", "markers.csv", package = "Sconify") ParseMarkers(file)
file <- system.file("extdata", "markers.csv", package = "Sconify") ParseMarkers(file)
Performs final processing and transformations on the scone data
PostProcessing(scone.output, cell.data, input, tsne = TRUE, log.transform.qvalue = TRUE)
PostProcessing(scone.output, cell.data, input, tsne = TRUE, log.transform.qvalue = TRUE)
scone.output |
tibble of the output of the given scone analysis |
cell.data |
the tibble used as input for the scone.values function |
input |
the input markers used for the knn calculation (to be used for tsne here) |
tsne |
boolean value to indicate whether tSNE is to be done |
log.transform.qvalue |
boolean to indicate whether log transformation of all q values is to be done |
result: the concatenated original input data with the scone derived data, with the option of the q values being inverse log10 transformed, and two additional tSNE columns being added to the data (from the Rtsne package)
PostProcessing(wand.scone, wand.combined, input.markers, tsne = FALSE)
PostProcessing(wand.scone, wand.combined, input.markers, tsne = FALSE)
This is tailored to a very specific file format for unstim/stim Files need the following name convention: "xxxx_stim.fcs" Files where you want to name the patients need the following convention: "xxxx__patientID_stim.fcs"
ProcessMultipleFiles(files, transform = "asinh", numcells = 10000, norm = FALSE, scale = FALSE, input, name.multiple.donors = FALSE)
ProcessMultipleFiles(files, transform = "asinh", numcells = 10000, norm = FALSE, scale = FALSE, input, name.multiple.donors = FALSE)
files |
a vector of file names (name = "anything_condition.fcs") |
transform |
set to asinh if you want to do an asinh transform of all markers in the dataset |
numcells |
desiered number of cells in the matrix, set at 10k |
norm |
boolean that quantile normalizes the data if true |
scale |
boolean that converts all values to z scores if true |
input |
the static markers that will be used downstream in knn computation. These are included here to include the option of per-marker quantile normalization, in the event norm is set to TRUE |
name.multiple.donors |
boolean indicating whether multiple donors will be distinguished (as a separate "patient" column) |
result: a combined file set
file1 <- system.file("extdata", "Bendall_et_al_Cell_Sample_C_basal.fcs", package = "Sconify") file2 <- system.file("extdata", "Bendall_et_al_Cell_Sample_C_IL7.fcs", package = "Sconify") ProcessMultipleFiles(c(file1, file2), input = input.markers)
file1 <- system.file("extdata", "Bendall_et_al_Cell_Sample_C_basal.fcs", package = "Sconify") file2 <- system.file("extdata", "Bendall_et_al_Cell_Sample_C_IL7.fcs", package = "Sconify") ProcessMultipleFiles(c(file1, file2), input = input.markers)
Given the number of comparisons we make across k-nearest neighborhoods, which is far more than that of disjoint subsetting, this step is important given that there is an increased likelihood that some statistically significant differences will occur by chance.
QCorrectionThresholding(cells, threshold)
QCorrectionThresholding(cells, threshold)
cells |
tibble of change values, p values, and fraction condition 2 |
threshold |
a q value below which the change values will be reported for that cell for that param. If no change is desired, this is set to 1. |
inputted p values, adjusted and therefore described as "q values"
Credit goes to: http://davetang.org/muse/2014/07/07/quantile-normalisation-in-r/ for this function
QuantNormalize(df)
QuantNormalize(df)
df |
a data frame with rows as cells and columns as features |
a data frame where the columns have been quantile normalized
This function performs per-marker quantile normalization on multiple data tibbles. The normalization occurrs marker by marker. The user assumes that the markers are distributed equally across tibbles, as quantile normalization forces these marker distributions to be the same per file
QuantNormalizeElements(dat.list)
QuantNormalizeElements(dat.list)
dat.list |
a list of tibbles |
the per-column quantile normalized list
basal <- wand.combined[wand.combined$condition == "basal",][,1:10] il7 <- wand.combined[wand.combined$condition == "IL7",][,1:10] QuantNormalizeElements(list(basal, il7))
basal <- wand.combined[wand.combined$condition == "basal",][,1:10] il7 <- wand.combined[wand.combined$condition == "IL7",][,1:10] QuantNormalizeElements(list(basal, il7))
This function performs the statistics across the nearest neighborhoods, and is one of the main workhorses within the scone.values function
RunStatistics(basal, stim, fold = "median", stat.test = "mwu", stim.name)
RunStatistics(basal, stim, fold = "median", stat.test = "mwu", stim.name)
basal |
tibble of cells corresponding to the unstimulated condition |
stim |
a tibble of cells corresponding to the stimulated condition |
fold |
a string that specifies the use of "median" or "mean" when calculating fold change |
stat.test |
a string that specifies Mann-Whitney U test (mwu) or T test (t) for q value calculation |
stim.name |
a string corresponding to the name of the stim being tested compared to basal |
result: a named vector corresponding to the results of the "fold change" and mann-whitney u test
This function is run following the KNN computation and respective cell grouping. The function also contains a progress ticker that allows one to determine how much time left in this function.
SconeValues(nn.matrix, cell.data, scone.markers, unstim, threshold = 0.05, fold = "median", stat.test = "mwu", multiple.donor.compare = FALSE)
SconeValues(nn.matrix, cell.data, scone.markers, unstim, threshold = 0.05, fold = "median", stat.test = "mwu", multiple.donor.compare = FALSE)
nn.matrix |
a matrix of cell index by nearest neighbor index, with values being cell index of the nth nearest neighbor |
cell.data |
tibble of cells by features |
scone.markers |
vector of all markers to be interrogated via statistical testing |
unstim |
an object (used so far: string, number) specifying the "basal" condition |
threshold |
a number indicating the p value the raw change should be thresholded by. |
fold |
a string that specifies the use of "median" or "mean" when calculating fold change |
stat.test |
string denoting Mann Whitney U test ("mwu") or T test ("t) |
multiple.donor.compare |
a boolean that indicates whether t test across multiple donors should be done |
result: tibble of raw changes and p values for each feature of interest, and fraction of cells with condition 2
ex.nn <- Fnn(wand.combined, input.markers) SconeValues(ex.nn, wand.combined, funct.markers, "basal")
ex.nn <- Fnn(wand.combined, input.markers) SconeValues(ex.nn, wand.combined, funct.markers, "basal")
This is meant to serve as a control for the basic "unstim" and "stim" pipeline that is generally used within this package. If phosphoproteins are being compared across conditions, for example, then there should be no difference in the case that you split the same file and compare the two halves.
SplitFile(file, transform = "asinh", numcells = 10000, norm = FALSE, scale = FALSE, input.markers)
SplitFile(file, transform = "asinh", numcells = 10000, norm = FALSE, scale = FALSE, input.markers)
file |
the file we're going to split |
transform |
if set to asinh, performs asinh transformation on all markers of the dataset |
numcells |
the number of total cells to be subsampled to, set at 10k for default |
norm |
boolean of whether data should be quantile normalized |
scale |
boolean of whether data should be z-scored |
input.markers |
vector of strings indicating the markers to be used as input |
tibble containing original markers and all values calculated by SCONE
file <- system.file("extdata", "Bendall_et_al_Cell_Sample_C_basal.fcs", package = "Sconify") SplitFile(file, input.markers = input.markers)
file <- system.file("extdata", "Bendall_et_al_Cell_Sample_C_basal.fcs", package = "Sconify") SplitFile(file, input.markers = input.markers)
Takes a vector of strings and outputs simple numbers. This takes care of the case where conditions are listed as strings (basal, IL7), in which case they are converted to numbers (1, 2)
StringToNumbers(strings)
StringToNumbers(strings)
strings |
vector of strings |
strings: same vector with each unique element converted to a number
ex.string <- c("unstim", "unstim", "stim", "stim", "stim") StringToNumbers(ex.string)
ex.string <- c("unstim", "unstim", "stim", "stim", "stim") StringToNumbers(ex.string)
A wrapper for Rtsne that takes final SCONE output, and runs tSNE on it after subsampling. This is specifically for SCONE runs that contain large numbers of cells that tSNE would either be too time-consuming or messy for. Regarding the latter, tSNE typically appears less clean in the range of 10^5 cells
SubsampleAndTsne(dat, input, numcells)
SubsampleAndTsne(dat, input, numcells)
dat |
tibble of original input data, and scone-based additions. |
input |
the markers used in the original knn computation, which are typically surface markers |
numcells |
the number of cells to be downsampled to |
a subsampled tibble that contains tSNE values
SubsampleAndTsne(wand.combined, input.markers, 500)
SubsampleAndTsne(wand.combined, input.markers, 500)
Wrapper for ggplot2 based plotting of a tSNE map to color by markers from the post-processed file if tSNE was set to TRUE in the post-processing function.
TsneVis(final, marker, label = marker)
TsneVis(final, marker, label = marker)
final |
The tibble of cells by features outputted from the post.processing function. These features encompass both regular markers from the original data and the KNN statistics processed markers |
marker |
String that matches the marker name in the final data object exactly. |
label |
a string that indicates the name of the color label in the ensuing plot. Set to the marker string as default. |
A plot of bh-SNE1 x bh-SNE2 colored by the specified marker.
TsneVis(wand.final, "pSTAT5(Nd150)Di.IL7.change", "pSTAT5 change")
TsneVis(wand.final, "pSTAT5(Nd150)Di.IL7.change", "pSTAT5 change")
A single patient pair of basal and IL7 treated cells from bone marrow gated for B cell precursors.
wand.combined
wand.combined
A tibble of 1000 cells by 51 features, including all the input markers, Wanderlust values, and the condition. The first 500 rows are untreated cells and the last 500 rows are IL7 treated.
"combined" data taken through KNN generation and comparisons, along with t-SNE map generation.
wand.final
wand.final
A tibble of 1000 cells and 87 feaures, including the input features, the SCONE-generated comparisons, differential abundance, and two t-SNE dimesnions
This is the output of the impute.testing function used on the Wanderlust dataset, which finds the avergae imputation error of all signal markers imputed from KNN of surface markers.
wand.ideal.k
wand.ideal.k
A named vector, where the elements are averge imputation error and the names are the values of from a 10,000 cell dataset.
The IL7 treated cells from a single patient in the Wanderlust dataset
wand.il7
wand.il7
A tibble of 1000 cells by 51 features. All markers in the dataset, along with pre-calculated Wanderlust value and condition, which is a string that denotes that this is the "IL7" condition for each row. Important when this is concatenated with additional conditions
The scone output for the Wanderlust dataset
wand.scone
wand.scone
A tibble of 1000 cells by 34 features. These features include the KNN comparisons, KNN density estimation, and differential abundance. Note that this tibble gets concatenated with the original tibble, as well as two t-SNE dimensions in the post.processing() command of the pipeline.