Title: | Differential neighbourhood abundance testing on a graph |
---|---|
Description: | Milo performs single-cell differential abundance testing. Cell states are modelled as representative neighbourhoods on a nearest neighbour graph. Hypothesis testing is performed using either a negative bionomial generalized linear model or negative binomial generalized linear mixed model. |
Authors: | Mike Morgan [aut, cre] , Emma Dann [aut, ctb] |
Maintainer: | Mike Morgan <[email protected]> |
License: | GPL-3 + file LICENSE |
Version: | 2.3.0 |
Built: | 2024-10-30 09:04:15 UTC |
Source: | https://github.com/bioc/miloR |
The miloR package provides modular functions to perform differential
abundance testing on replicated single-cell experiments. For details please
see the vignettes vignette("milo_demo", package="miloR")
and
vignette("milo_gastrulation", package="miloR")
.
The miloR package
Mike Morgan & Emma Dann
This function assigns a categorical label to neighbourhoods in the differential abundance results
data.frame (output of testNhoods
), based on the most frequent label among cells in each
neighbourhood. This can be useful to stratify DA testing results by cell types or samples.
Also the fraction of cells carrying that label is stored.
annotateNhoods(x, da.res, coldata_col, subset.nhoods = NULL)
annotateNhoods(x, da.res, coldata_col, subset.nhoods = NULL)
x |
A |
da.res |
A |
coldata_col |
A character scalar determining which column of |
subset.nhoods |
A character, numeric or logical vector that will subset the annotation to the specific nhoods. If
a character vector these should correspond to row names of |
For each neighbourhood, this calculates the most frequent value of colData(x)[coldata_col]
among cells in the neighbourhood and assigns that value as annotation for the neighbourhood, adding a column in the
da.res
data.frame. In addition, a coldata_col_fraction
column will be added, storing the fraction of cells
carrying the assigned label. While in practice neighbourhoods are often homogeneous, one might choose to remove an
annotation label when the fraction of cells with the label is too low (e.g. below 0.6).
A data.frame
of model results (as da.res
input) with two new columns: (1) coldata_col
storing
the assigned label for each neighbourhood; (2) coldata_col_fraction
storing the fraction of cells in the neighbourhood with
the assigned label.
Emma Dann
NULL
NULL
Construct a kNN-graph from an input adjacency matrix - either binary or distances between NNs.
x |
An n X n |
k |
(optional) Scalar value that represents the number of nearest neighbours in the original graph. This
can also be inferred directly from the adjacency matrix |
is.binary |
Logical scalar indicating if the input matrix is binary or not. |
This function will take a matrix as input and construct the kNN graph that it describes. If the matrix is not symmetric then the graph is assumed to be directed, whereas if the matrix is not binary, i.e. all 0's and 1's then the input values are taken to be distances between graph vertices; 0 values are assumed to represent a lack of edge between vertices.
A Milo
with the graph slot populated.
Mike Morgan
r <- 1000 c <- 1000 k <- 35 m <- floor(matrix(runif(r*c), r, c)) for(i in seq_along(1:r)){ m[i, sample(1:c, size=k)] <- 1 } milo <- buildFromAdjacency(m)
r <- 1000 c <- 1000 k <- 35 m <- floor(matrix(runif(r*c), r, c)) for(i in seq_along(1:r)){ m[i, sample(1:c, size=k)] <- 1 } milo <- buildFromAdjacency(m)
This function is borrowed from the old buildKNNGraph function in scran. Instead of returning an igraph object it populates the graph and distance slots in a Milo object. If the input is a SingleCellExperiment object or a matrix then it will return a de novo Milo object with the same slots filled.
buildGraph( x, k = 10, d = 50, transposed = FALSE, get.distance = FALSE, reduced.dim = "PCA", BNPARAM = KmknnParam(), BSPARAM = bsparam(), BPPARAM = SerialParam() )
buildGraph( x, k = 10, d = 50, transposed = FALSE, get.distance = FALSE, reduced.dim = "PCA", BNPARAM = KmknnParam(), BSPARAM = bsparam(), BPPARAM = SerialParam() )
x |
A matrix, |
k |
An integer scalar that specifies the number of nearest-neighbours to consider for the graph building. |
d |
The number of dimensions to use if the input is a matrix of cells X reduced dimensions. If this is provided, transposed should also be set=TRUE. |
transposed |
Logical if the input x is transposed with rows as cells. |
get.distance |
A logical scalar whether to compute distances during graph construction. |
reduced.dim |
A character scalar that refers to a specific entry in
the |
BNPARAM |
refer to |
BSPARAM |
refer to |
BPPARAM |
refer to |
This function computes a k-nearest neighbour graph. Each graph vertex is a
single-cell connected by the edges between its neighbours. Whilst a
kNN-graph is strictly directed, we remove directionality by forcing all
edge weights to 1; this behaviour can be overriden by providing
directed=TRUE
.
If you wish to use an
alternative graph structure, such as a shared-NN graph I recommend you
construct this separately and add to the relevant slot in the
Milo
object.
A Milo
object with the graph and distance slots populated.
Mike Morgan, with KNN code written by Aaron Lun & Jonathan Griffiths.
library(SingleCellExperiment) ux <- matrix(rpois(12000, 5), ncol=200) vx <- log2(ux + 1) pca <- prcomp(t(vx)) sce <- SingleCellExperiment(assays=list(counts=ux, logcounts=vx), reducedDims=SimpleList(PCA=pca$x)) milo <- Milo(sce) milo <- buildGraph(milo, d=30, transposed=TRUE) milo
library(SingleCellExperiment) ux <- matrix(rpois(12000, 5), ncol=200) vx <- log2(ux + 1) pca <- prcomp(t(vx)) sce <- SingleCellExperiment(assays=list(counts=ux, logcounts=vx), reducedDims=SimpleList(PCA=pca$x)) milo <- Milo(sce) milo <- buildGraph(milo, d=30, transposed=TRUE) milo
Build an abstracted graph of neighbourhoods for visualization
buildNhoodGraph(x, overlap = 1)
buildNhoodGraph(x, overlap = 1)
x |
A |
overlap |
A numeric scalar that thresholds graph edges based on the number of overlapping cells between neighbourhoods. |
This constructs a weighted graph where nodes represent neighbourhoods and edges represent the number of overlapping cells between two neighbourhoods.
A Milo
object containg an igraph
graph in the nhoodGraph
slot.
Emma Dann
NULL
NULL
This function will calculate Euclidean distances between single-cells in a
neighbourhood using the same dimensionality as was used to construct the graph.
This step follows the makeNhoods
call to limit the number of distance
calculations required.
calcNhoodDistance(x, d, reduced.dim = NULL, use.assay = "logcounts")
calcNhoodDistance(x, d, reduced.dim = NULL, use.assay = "logcounts")
x |
A |
d |
The number of dimensions to use for computing within-neighbourhood
distances. This should be the same value used construct the |
reduced.dim |
If x is an |
use.assay |
A character scalar defining which |
A Milo
object with the distance slots populated.
Mike Morgan, Emma Dann
library(SingleCellExperiment) ux <- matrix(rpois(12000, 5), ncol=200) vx <- log2(ux + 1) pca <- prcomp(t(vx)) sce <- SingleCellExperiment(assays=list(counts=ux, logcounts=vx), reducedDims=SimpleList(PCA=pca$x)) milo <- Milo(sce) milo <- buildGraph(milo, d=30, transposed=TRUE) milo <- makeNhoods(milo) milo <- calcNhoodDistance(milo, d=30) milo
library(SingleCellExperiment) ux <- matrix(rpois(12000, 5), ncol=200) vx <- log2(ux + 1) pca <- prcomp(t(vx)) sce <- SingleCellExperiment(assays=list(counts=ux, logcounts=vx), reducedDims=SimpleList(PCA=pca$x)) milo <- Milo(sce) milo <- buildGraph(milo, d=30, transposed=TRUE) milo <- makeNhoods(milo) milo <- calcNhoodDistance(milo, d=30) milo
This function calculates the mean expression of each feature in the
Milo object stored in the assays slot. Neighbourhood expression data are
stored in a new slot nhoodExpression
.
calcNhoodExpression(x, assay = "logcounts", subset.row = NULL, exprs = NULL)
calcNhoodExpression(x, assay = "logcounts", subset.row = NULL, exprs = NULL)
x |
A |
assay |
A character scalar that describes the assay slot to use for calculating neighbourhood expression. |
subset.row |
A logical, integer or character vector indicating the rows
of |
exprs |
If |
This function computes the mean expression of each gene, subset by subset.rows
where present, across the cells contained within each neighbourhood.
A Milo
object with the nhoodExpression
slot populated.
Mike Morgan
require(SingleCellExperiment) m <- matrix(rnorm(100000), ncol=100) milo <- Milo(SingleCellExperiment(assays=list(logcounts=m))) milo <- buildGraph(m, k=20, d=30) milo <- makeNhoods(milo) milo <- calcNhoodExpression(milo) dim(nhoodExpression(milo))
require(SingleCellExperiment) m <- matrix(rnorm(100000), ncol=100) milo <- Milo(SingleCellExperiment(assays=list(logcounts=m))) milo <- buildGraph(m, k=20, d=30) milo <- makeNhoods(milo) milo <- calcNhoodExpression(milo) dim(nhoodExpression(milo))
Check the count distributions for each nhood according to a test variable of interest. This is important for checking if there is separation in the GLMM to inform either nhood subsetting or re-computation of the NN-graph and refined nhoods.
x |
|
design.df |
A |
condition |
A character scalar of the test variable contained in |
min.val |
A numeric scalar that sets the minimum number of counts across condition level samples, below which separation is defined. |
factor.check |
A logical scalar that sets the factor variable level checking. See details for more information. |
This function checks across nhoods for separation based on the separate levels
of an input factor variable. It checks if condition is a factor variable,
and if not it will cast it to a factor. Note that the function first checks for the
number of unique values - if this exceeds > 50
error is generated. Users can override this behaviour with factor.check=FALSE
.
A logical vector of the same length as ncol(nhoodCounts(x))
where TRUE
values represent nhoods where separation is detected. The output of this function
can be used to subset nhood-based analyses
e.g. testNhoods(..., subset.nhoods=checkSepartion(x, ...))
.
Mike Morgan
library(SingleCellExperiment) ux.1 <- matrix(rpois(12000, 5), ncol=400) ux.2 <- matrix(rpois(12000, 4), ncol=400) ux <- rbind(ux.1, ux.2) vx <- log2(ux + 1) pca <- prcomp(t(vx)) sce <- SingleCellExperiment(assays=list(counts=ux, logcounts=vx), reducedDims=SimpleList(PCA=pca$x)) milo <- Milo(sce) milo <- buildGraph(milo, k=20, d=10, transposed=TRUE) milo <- makeNhoods(milo, k=20, d=10, prop=0.3) milo <- calcNhoodDistance(milo, d=10) cond <- rep("A", ncol(milo)) cond.a <- sample(1:ncol(milo), size=floor(ncol(milo)*0.25)) cond.b <- setdiff(1:ncol(milo), cond.a) cond[cond.b] <- "B" meta.df <- data.frame(Condition=cond, Replicate=c(rep("R1", 132), rep("R2", 132), rep("R3", 136))) meta.df$SampID <- paste(meta.df$Condition, meta.df$Replicate, sep="_") milo <- countCells(milo, meta.data=meta.df, samples="SampID") test.meta <- data.frame("Condition"=c(rep("A", 3), rep("B", 3)), "Replicate"=rep(c("R1", "R2", "R3"), 2)) test.meta$Sample <- paste(test.meta$Condition, test.meta$Replicate, sep="_") rownames(test.meta) <- test.meta$Sample check.sep <- checkSeparation(milo, design.df=test.meta, condition='Condition') sum(check.sep)
library(SingleCellExperiment) ux.1 <- matrix(rpois(12000, 5), ncol=400) ux.2 <- matrix(rpois(12000, 4), ncol=400) ux <- rbind(ux.1, ux.2) vx <- log2(ux + 1) pca <- prcomp(t(vx)) sce <- SingleCellExperiment(assays=list(counts=ux, logcounts=vx), reducedDims=SimpleList(PCA=pca$x)) milo <- Milo(sce) milo <- buildGraph(milo, k=20, d=10, transposed=TRUE) milo <- makeNhoods(milo, k=20, d=10, prop=0.3) milo <- calcNhoodDistance(milo, d=10) cond <- rep("A", ncol(milo)) cond.a <- sample(1:ncol(milo), size=floor(ncol(milo)*0.25)) cond.b <- setdiff(1:ncol(milo), cond.a) cond[cond.b] <- "B" meta.df <- data.frame(Condition=cond, Replicate=c(rep("R1", 132), rep("R2", 132), rep("R3", 136))) meta.df$SampID <- paste(meta.df$Condition, meta.df$Replicate, sep="_") milo <- countCells(milo, meta.data=meta.df, samples="SampID") test.meta <- data.frame("Condition"=c(rep("A", 3), rep("B", 3)), "Replicate"=rep(c("R1", "R2", "R3"), 2)) test.meta$Sample <- paste(test.meta$Condition, test.meta$Replicate, sep="_") rownames(test.meta) <- test.meta$Sample check.sep <- checkSeparation(milo, design.df=test.meta, condition='Condition') sum(check.sep)
Based on the asymptotic t-distribution, comptue the 2-tailed p-value that estimate != 0. This function is not intended to be used directly, but is included for reference or if an alternative estimate of the degrees of freedom is available.
computePvalue(Zscore, df)
computePvalue(Zscore, df)
Zscore |
A numeric vector containing the Z scores for each fixed effect parameter |
df |
A numeric vector containing the estimated degrees of freedom for each fixed effect parameter |
Based on sampling from a 2-tailed t-distribution with df
degrees of freedom,
compute the probability that the calculated Zscore
is greater than or equal to what would be
expected from random chance.
Numeric vector of p-values, 1 per fixed effect parameter
Mike Morgan & Alice Kluzer
NULL
NULL
This function quantifies the number of cells in each neighbourhood according to an input experimental design. This forms the basis for the differential neighbourhood abundance testing.
countCells(x, samples, meta.data = NULL)
countCells(x, samples, meta.data = NULL)
x |
A |
samples |
Either a string specifying which column of |
meta.data |
A cell X variable |
This function generates a counts matrix of nhoods
X samples,
and populates the nhoodCounts
slot of the input
Milo
object. This matrix is used down-stream for
differential abundance testing.
A Milo
object containing a counts matrix in the
nhoodCounts
slot.
Mike Morgan, Emma Dann
library(igraph) m <- matrix(rnorm(100000), ncol=100) milo <- buildGraph(t(m), k=20, d=10) milo <- makeNhoods(milo, k=20, d=10, prop=0.3) cond <- rep("A", nrow(m)) cond.a <- sample(seq_len(nrow(m)), size=floor(nrow(m)*0.25)) cond.b <- setdiff(seq_len(nrow(m)), cond.a) cond[cond.b] <- "B" meta.df <- data.frame(Condition=cond, Replicate=c(rep("R1", 330), rep("R2", 330), rep("R3", 340))) meta.df$SampID <- paste(meta.df$Condition, meta.df$Replicate, sep="_") milo <- countCells(milo, meta.data=meta.df, samples="SampID") milo
library(igraph) m <- matrix(rnorm(100000), ncol=100) milo <- buildGraph(t(m), k=20, d=10) milo <- makeNhoods(milo, k=20, d=10, prop=0.3) cond <- rep("A", nrow(m)) cond.a <- sample(seq_len(nrow(m)), size=floor(nrow(m)*0.25)) cond.b <- setdiff(seq_len(nrow(m)), cond.a) cond[cond.b] <- "B" meta.df <- data.frame(Condition=cond, Replicate=c(rep("R1", 330), rep("R2", 330), rep("R3", 340))) meta.df$SampID <- paste(meta.df$Condition, meta.df$Replicate, sep="_") milo <- countCells(milo, meta.data=meta.df, samples="SampID") milo
This function will perform differential gene expression analysis on
groups of neighbourhoods. Adjacent and concordantly DA neighbourhoods can be defined using
groupNhoods
or by the user. Cells between these
aggregated groups are compared. For differential gene experession based on an input design
within DA neighbourhoods see testDiffExp
.
findNhoodGroupMarkers( x, da.res, assay = "logcounts", aggregate.samples = FALSE, sample_col = NULL, subset.row = NULL, gene.offset = TRUE, subset.nhoods = NULL, subset.groups = NULL, na.function = "na.pass" )
findNhoodGroupMarkers( x, da.res, assay = "logcounts", aggregate.samples = FALSE, sample_col = NULL, subset.row = NULL, gene.offset = TRUE, subset.nhoods = NULL, subset.groups = NULL, na.function = "na.pass" )
x |
A |
da.res |
A |
assay |
A character scalar determining which |
aggregate.samples |
logical indicating wheather the expression values for cells in the same sample
and neighbourhood group should be merged for DGE testing. This allows to perform testing exploiting the replication structure
in the experimental design, rather than treating single-cells as independent replicates. The function used for aggregation depends on the
selected gene expression assay: if |
sample_col |
a character scalar indicating the column in the colData storing sample information
(only relevant if |
subset.row |
A logical, integer or character vector indicating the rows
of |
gene.offset |
A logical scalar the determines whether a per-cell offset is provided in the DGE GLM to adjust for the number of detected genes with expression > 0. |
subset.nhoods |
A logical, integer or character vector indicating which neighbourhoods to subset before aggregation and DGE testing (default: NULL). |
subset.groups |
A character vector indicating which groups to test for markers (default: NULL) |
na.function |
A valid NA action function to apply, should be one of
|
Using a one vs. all approach, each aggregated group of cells is compared to all others
using the single-cell log normalized gene expression with a GLM
(for details see limma-package
), or the single-cell counts using a
negative binomial GLM (for details see edgeR-package
). When using
the latter it is recommended to set gene.offset=TRUE
as this behaviour adjusts
the model offsets by the number of detected genes in each cell.
A data.frame
of DGE results containing a log fold change and adjusted
p-value for each aggregated group of neighbourhoods. If return.groups
then
the return value is a list with the slots groups
and dge
containing the
aggregated neighbourhood groups per single-cell and marker gene results, respectively.
Warning: If all neighbourhoods are grouped together, then it is impossible to
run findNhoodMarkers
. In this (hopefully rare) instance, this function will return
a warning and return NULL
.
NULL
NULL
This function will perform differential gene expression analysis on
differentially abundant neighbourhoods, by first aggregating adjacent and
concordantly DA neighbourhoods, then comparing cells between these
aggregated groups. For differential gene experession based on an input design
within DA neighbourhoods see testDiffExp
.
x |
A |
da.res |
A |
da.fdr |
A numeric scalar that determines at what FDR neighbourhoods are declared DA for the purposes of aggregating across concorantly DA neighbourhoods. |
assay |
A character scalar determining which |
aggregate.samples |
logical indicating wheather the expression values for cells in the same sample
and neighbourhood group should be merged for DGE testing. This allows to perform testing exploiting the replication structure
in the experimental design, rather than treating single-cells as independent replicates. The function used for aggregation depends on the
selected gene expression assay: if |
sample_col |
a character scalar indicating the column in the colData storing sample information
(only relevant if |
overlap |
A scalar integer that determines the number of cells that must overlap between adjacent neighbourhoods for merging. |
lfc.threshold |
A scalar that determines the absolute log fold change above which neighbourhoods should be considerd 'DA' for merging. Default=NULL |
merge.discord |
A logical scalar that overrides the default behaviour and allows adjacent neighbourhoods to be merged if they have discordant log fold change signs. Using this argument is generally discouraged, but may be useful for constructing an empirical null group of cells, regardless of DA sign. |
subset.row |
A logical, integer or character vector indicating the rows
of |
gene.offset |
A logical scalar the determines whether a per-cell offset is provided in the DGE GLM to adjust for the number of detected genes with expression > 0. |
return.groups |
A logical scalar that returns a |
subset.nhoods |
A logical, integer or character vector indicating which neighbourhoods to subset before aggregation and DGE testing. |
na.function |
A valid NA action function to apply, should be one of
|
compute.new |
A logical scalar indicating whether to force computing a new neighbourhood adjacency matrix if already present. |
Louvain clustering is applied to the neighbourhood graph. This graph is first modified
based on two criteria: 1) neighbourhoods share at least overlap
number of cells,
and 2) the DA log fold change sign is concordant.
This behaviour can be modulated by setting overlap
to be more or less stringent.
Additionally, a threshold on the log fold-changes can be set, such that lfc.threshold
is required to retain edges between adjacent neighbourhoods. Note: adjacent neighbourhoods will
never be merged with opposite signs.
Using a one vs. all approach, each aggregated group of cells is compared to all others
using the single-cell log normalized gene expression with a GLM
(for details see limma-package
), or the single-cell counts using a
negative binomial GLM (for details see edgeR-package
). When using
the latter it is recommended to set gene.offset=TRUE
as this behaviour adjusts
the model offsets by the number of detected genes in each cell.
A data.frame
of DGE results containing a log fold change and adjusted
p-value for each aggregated group of neighbourhoods. If return.groups
then
the return value is a list with the slots groups
and dge
containing the
aggregated neighbourhood groups per single-cell and marker gene results, respectively.
Warning: If all neighbourhoods are grouped together, then it is impossible to
run findNhoodMarkers
. In this (hopefully rare) instance, this function will return
a warning and return NULL
.
Mike Morgan & Emma Dann
library(SingleCellExperiment) ux.1 <- matrix(rpois(12000, 5), ncol=400) ux.2 <- matrix(rpois(12000, 4), ncol=400) ux <- rbind(ux.1, ux.2) vx <- log2(ux + 1) pca <- prcomp(t(vx)) sce <- SingleCellExperiment(assays=list(counts=ux, logcounts=vx), reducedDims=SimpleList(PCA=pca$x)) colnames(sce) <- paste0("Cell", seq_len(ncol(sce))) milo <- Milo(sce) milo <- buildGraph(milo, k=20, d=10, transposed=TRUE) milo <- makeNhoods(milo, k=20, d=10, prop=0.3) milo <- calcNhoodDistance(milo, d=10) cond <- rep("A", ncol(milo)) cond.a <- sample(seq_len(ncol(milo)), size=floor(ncol(milo)*0.25)) cond.b <- setdiff(seq_len(ncol(milo)), cond.a) cond[cond.b] <- "B" meta.df <- data.frame(Condition=cond, Replicate=c(rep("R1", 132), rep("R2", 132), rep("R3", 136))) meta.df$SampID <- paste(meta.df$Condition, meta.df$Replicate, sep="_") milo <- countCells(milo, meta.data=meta.df, samples="SampID") test.meta <- data.frame("Condition"=c(rep("A", 3), rep("B", 3)), "Replicate"=rep(c("R1", "R2", "R3"), 2)) test.meta$Sample <- paste(test.meta$Condition, test.meta$Replicate, sep="_") rownames(test.meta) <- test.meta$Sample da.res <- testNhoods(milo, design=~0 + Condition, design.df=test.meta[colnames(nhoodCounts(milo)), ]) nhood.dge <- findNhoodMarkers(milo, da.res, overlap=1, compute.new=TRUE) nhood.dge
library(SingleCellExperiment) ux.1 <- matrix(rpois(12000, 5), ncol=400) ux.2 <- matrix(rpois(12000, 4), ncol=400) ux <- rbind(ux.1, ux.2) vx <- log2(ux + 1) pca <- prcomp(t(vx)) sce <- SingleCellExperiment(assays=list(counts=ux, logcounts=vx), reducedDims=SimpleList(PCA=pca$x)) colnames(sce) <- paste0("Cell", seq_len(ncol(sce))) milo <- Milo(sce) milo <- buildGraph(milo, k=20, d=10, transposed=TRUE) milo <- makeNhoods(milo, k=20, d=10, prop=0.3) milo <- calcNhoodDistance(milo, d=10) cond <- rep("A", ncol(milo)) cond.a <- sample(seq_len(ncol(milo)), size=floor(ncol(milo)*0.25)) cond.b <- setdiff(seq_len(ncol(milo)), cond.a) cond[cond.b] <- "B" meta.df <- data.frame(Condition=cond, Replicate=c(rep("R1", 132), rep("R2", 132), rep("R3", 136))) meta.df$SampID <- paste(meta.df$Condition, meta.df$Replicate, sep="_") milo <- countCells(milo, meta.data=meta.df, samples="SampID") test.meta <- data.frame("Condition"=c(rep("A", 3), rep("B", 3)), "Replicate"=rep(c("R1", "R2", "R3"), 2)) test.meta$Sample <- paste(test.meta$Condition, test.meta$Replicate, sep="_") rownames(test.meta) <- test.meta$Sample da.res <- testNhoods(milo, design=~0 + Condition, design.df=test.meta[colnames(nhoodCounts(milo)), ]) nhood.dge <- findNhoodMarkers(milo, da.res, overlap=1, compute.new=TRUE) nhood.dge
Iteratively estimate GLMM fixed and random effect parameters, and variance component parameters using Fisher scoring based on the Pseudo-likelihood approximation to a Normal loglihood. This function incorporates a user-defined covariance matrix, e.g. a kinship matrix for genetic analyses.
fitGeneticPLGlmm( Z, X, K, muvec, offsets, curr_beta, curr_theta, curr_u, curr_sigma, curr_G, y, u_indices, theta_conv, rlevels, curr_disp, REML, maxit, solver, vardist )
fitGeneticPLGlmm( Z, X, K, muvec, offsets, curr_beta, curr_theta, curr_u, curr_sigma, curr_G, y, u_indices, theta_conv, rlevels, curr_disp, REML, maxit, solver, vardist )
Z |
mat - sparse matrix that maps random effect variable levels to observations |
X |
mat - sparse matrix that maps fixed effect variables to observations |
K |
mat - sparse matrix that defines the known covariance patterns between individual observations. For example, a kinship matrix will then adjust for the known/estimated genetic relationships between observations. |
muvec |
vec vector of estimated phenotype means |
offsets |
vec vector of model offsets |
curr_beta |
vec vector of initial beta estimates |
curr_theta |
vec vector of initial parameter estimates |
curr_u |
vec of initial u estimates |
curr_sigma |
vec of initial sigma estimates |
curr_G |
mat c X c matrix of variance components |
y |
vec of observed counts |
u_indices |
List a List, each element contains the indices of Z relevant to each RE and all its levels |
theta_conv |
double Convergence tolerance for paramter estimates |
rlevels |
List containing mapping of RE variables to individual levels |
curr_disp |
double Dispersion parameter estimate |
REML |
bool - use REML for variance component estimation |
maxit |
int maximum number of iterations if theta_conv is FALSE |
solver |
string which solver to use - either HE (Haseman-Elston regression) or Fisher scoring |
vardist |
string which variance form to use NB = negative binomial, P=Poisson [not yet implemented]/ |
Fit a NB-GLMM to the counts provided in y. The model uses an iterative approach that
switches between the joint fixed and random effect parameter inference, and the variance component
estimation. A pseudo-likelihood approach is adopted to minimise the log-likelihood of the model
given the parameter estimates. The fixed and random effect parameters are estimated using
Hendersons mixed model equations, and the variance component parameters are then estimated with
the specified solver, i.e. Fisher scoring, Haseman-Elston or constrained Haseman-Elston regression. As
the domain of the variance components is [0, +Inf
], any negative variance component estimates will
trigger the switch to the HE-NNLS solver until the model converges.
A list
containing the following elements (note: return types are dictated by Rcpp, so the R
types are described here):
FE
:numeric
vector of fixed effect parameter estimates.
RE
:list
of the same length as the number of random effect variables. Each slot contains the best
linear unbiased predictors (BLUPs) for the levels of the corresponding RE variable.
Sigma:
numeric
vector of variance component estimates, 1 per random effect variable. For this model the
last variance component corresponds to the input K matrix.
converged:
logical
scalar of whether the model has reached the convergence tolerance or not.
Iters:
numeric
scalar with the number of iterations that the model ran for. Is strictly <= max.iter
.
Dispersion:
numeric
scalar of the dispersion estimate computed off-line
Hessian:
matrix
of 2nd derivative elements from the fixed and random effect parameter inference.
SE:
matrix
of standard error estimates, derived from the hessian, i.e. the square roots of the diagonal elements.
t:
numeric
vector containing the compute t-score for each fixed effect variable.
COEFF:
matrix
containing the coefficient matrix from the mixed model equations.
P:
matrix
containing the elements of the REML projection matrix.
Vpartial:
list
containing the partial derivatives of the (pseudo)variance matrix with respect to each variance
component.
Ginv:
matrix
of the inverse variance components broadcast to the full Z matrix.
Vsinv:
matrix
of the inverse pseudovariance.
Winv:
matrix
of the inverse elements of W = D^-1 V D^-1
VCOV:
matrix
of the variance-covariance for all model fixed and random effect variable parameter estimates.
This is required to compute the degrees of freedom for the fixed effect parameter inference.
CONVLIST:
list
of list
containing the parameter estimates and differences between current and previous
iteration estimates at each model iteration. These are included for each fixed effect, random effect and variance component parameter.
The list elements for each iteration are: ThetaDiff, SigmaDiff, beta, u, sigma.
Mike Morgan
NULL
NULL
This function will perform DA testing per-nhood using a negative binomial generalised linear mixed model
fitGLMM( X, Z, y, offsets, init.theta = NULL, Kin = NULL, random.levels = NULL, REML = FALSE, glmm.control = list(theta.tol = 1e-06, max.iter = 100, init.sigma = NULL, init.beta = NULL, init.u = NULL, solver = NULL), dispersion = 1, geno.only = FALSE, intercept.type = "fixed", solver = NULL )
fitGLMM( X, Z, y, offsets, init.theta = NULL, Kin = NULL, random.levels = NULL, REML = FALSE, glmm.control = list(theta.tol = 1e-06, max.iter = 100, init.sigma = NULL, init.beta = NULL, init.u = NULL, solver = NULL), dispersion = 1, geno.only = FALSE, intercept.type = "fixed", solver = NULL )
X |
A matrix containing the fixed effects of the model. |
Z |
A matrix containing the random effects of the model. |
y |
A matrix containing the observed phenotype over each neighborhood. |
offsets |
A vector containing the (log) offsets to apply normalisation for different numbers of cells across samples. |
init.theta |
A column vector (m X 1 matrix) of initial estimates of fixed and random effect coefficients |
Kin |
A n x n covariance matrix to explicitly model variation between observations |
random.levels |
A list describing the random effects of the model, and for each, the different unique levels. |
REML |
A logical value denoting whether REML (Restricted Maximum Likelihood) should be run. Default is TRUE. |
glmm.control |
A list containing parameter values specifying the theta tolerance of the model, the maximum number of iterations to be run, initial parameter values for the fixed (init.beta) and random effects (init.u), and glmm solver (see details). |
dispersion |
A scalar value for the initial dispersion of the negative binomial. |
geno.only |
A logical value that flags the model to use either just the |
intercept.type |
A character scalar, either fixed or random that sets the type of the global
intercept variable in the model. This only applies to the GLMM case where additional random effects variables are
already included. Setting |
solver |
a character value that determines which optimisation algorithm is used for the variance components. Must be either HE (Haseman-Elston regression) or Fisher (Fisher scoring). |
This function runs a negative binomial generalised linear mixed effects model. If mixed effects are detected in testNhoods,
this function is run to solve the model. The solver defaults to the Fisher optimiser, and in the case of negative variance estimates
it will switch to the non-negative least squares (NNLS) Haseman-Elston solver. This behaviour can be pre-set by passing
glmm.control$solver="HE"
for Haseman-Elston regression, which is the recommended solver when a covariance matrix is provided,
or glmm.control$solver="HE-NNLS"
which is the constrained HE optimisation algorithm.
A list containing the GLMM output, including inference results. The list elements are as follows:
FE
:numeric
vector of fixed effect parameter estimates.
RE
:list
of the same length as the number of random effect variables. Each slot contains the best
linear unbiased predictors (BLUPs) for the levels of the corresponding RE variable.
Sigma:
numeric
vector of variance component estimates, 1 per random effect variable.
converged:
logical
scalar of whether the model has reached the convergence tolerance or not.
Iters:
numeric
scalar with the number of iterations that the model ran for. Is strictly <= max.iter
.
Dispersion:
numeric
scalar of the dispersion estimate computed off-line
Hessian:
matrix
of 2nd derivative elements from the fixed and random effect parameter inference.
SE:
matrix
of standard error estimates, derived from the hessian, i.e. the square roots of the diagonal elements.
t:
numeric
vector containing the compute t-score for each fixed effect variable.
COEFF:
matrix
containing the coefficient matrix from the mixed model equations.
P:
matrix
containing the elements of the REML projection matrix.
Vpartial:
list
containing the partial derivatives of the (pseudo)variance matrix with respect to each variance
component.
Ginv:
matrix
of the inverse variance components broadcast to the full Z matrix.
Vsinv:
matrix
of the inverse pseudovariance.
Winv:
matrix
of the inverse elements of W = D^-1 V D^-1
VCOV:
matrix
of the variance-covariance for all model fixed and random effect variable parameter estimates.
This is required to compute the degrees of freedom for the fixed effect parameter inference.
DF:
numeric
vector of the number of inferred degrees of freedom. For details see Satterthwaite_df.
PVALS:
numeric
vector of the compute p-values from a t-distribution with the inferred number of degrees of
freedom.
ERROR:
list
containing Rcpp error messages - used for internal checking.
Mike Morgan
data(sim_nbglmm) random.levels <- list("RE1"=paste("RE1", levels(as.factor(sim_nbglmm$RE1)), sep="_"), "RE2"=paste("RE2", levels(as.factor(sim_nbglmm$RE2)), sep="_")) X <- as.matrix(data.frame("Intercept"=rep(1, nrow(sim_nbglmm)), "FE2"=as.numeric(sim_nbglmm$FE2))) Z <- as.matrix(data.frame("RE1"=paste("RE1", as.numeric(sim_nbglmm$RE1), sep="_"), "RE2"=paste("RE2", as.numeric(sim_nbglmm$RE2), sep="_"))) y <- sim_nbglmm$Mean.Count dispersion <- 0.5 glmm.control <- glmmControl.defaults() glmm.control$theta.tol <- 1e-6 glmm.control$max.iter <- 15 model.list <- fitGLMM(X=X, Z=Z, y=y, offsets=rep(0, nrow(X)), random.levels=random.levels, REML = TRUE, glmm.control=glmm.control, dispersion=dispersion, solver="Fisher") model.list
data(sim_nbglmm) random.levels <- list("RE1"=paste("RE1", levels(as.factor(sim_nbglmm$RE1)), sep="_"), "RE2"=paste("RE2", levels(as.factor(sim_nbglmm$RE2)), sep="_")) X <- as.matrix(data.frame("Intercept"=rep(1, nrow(sim_nbglmm)), "FE2"=as.numeric(sim_nbglmm$FE2))) Z <- as.matrix(data.frame("RE1"=paste("RE1", as.numeric(sim_nbglmm$RE1), sep="_"), "RE2"=paste("RE2", as.numeric(sim_nbglmm$RE2), sep="_"))) y <- sim_nbglmm$Mean.Count dispersion <- 0.5 glmm.control <- glmmControl.defaults() glmm.control$theta.tol <- 1e-6 glmm.control$max.iter <- 15 model.list <- fitGLMM(X=X, Z=Z, y=y, offsets=rep(0, nrow(X)), random.levels=random.levels, REML = TRUE, glmm.control=glmm.control, dispersion=dispersion, solver="Fisher") model.list
Iteratively estimate GLMM fixed and random effect parameters, and variance component parameters using Fisher scoring based on the Pseudo-likelihood approximation to a Normal loglihood.
fitPLGlmm( Z, X, muvec, offsets, curr_beta, curr_theta, curr_u, curr_sigma, curr_G, y, u_indices, theta_conv, rlevels, curr_disp, REML, maxit, solver, vardist )
fitPLGlmm( Z, X, muvec, offsets, curr_beta, curr_theta, curr_u, curr_sigma, curr_G, y, u_indices, theta_conv, rlevels, curr_disp, REML, maxit, solver, vardist )
Z |
mat - sparse matrix that maps random effect variable levels to observations |
X |
mat - sparse matrix that maps fixed effect variables to observations |
muvec |
vec vector of estimated phenotype means |
offsets |
vec vector of model offsets |
curr_beta |
vec vector of initial beta estimates |
curr_theta |
vec vector of initial parameter estimates |
curr_u |
vec of initial u estimates |
curr_sigma |
vec of initial sigma estimates |
curr_G |
mat c X c matrix of variance components |
y |
vec of observed counts |
u_indices |
List a List, each element contains the indices of Z relevant to each RE and all its levels |
theta_conv |
double Convergence tolerance for paramter estimates |
rlevels |
List containing mapping of RE variables to individual levels |
curr_disp |
double Dispersion parameter estimate |
REML |
bool - use REML for variance component estimation |
maxit |
int maximum number of iterations if theta_conv is FALSE |
solver |
string which solver to use - either HE (Haseman-Elston regression) or Fisher scoring |
vardist |
string which variance form to use NB = negative binomial, P=Poisson [not yet implemented.] |
Fit a NB-GLMM to the counts provided in y. The model uses an iterative approach that
switches between the joint fixed and random effect parameter inference, and the variance component
estimation. A pseudo-likelihood approach is adopted to minimise the log-likelihood of the model
given the parameter estimates. The fixed and random effect parameters are estimated using
Hendersons mixed model equations, and the variance component parameters are then estimated with
the specified solver, i.e. Fisher scoring, Haseman-Elston or constrained Haseman-Elston regression. As
the domain of the variance components is [0, +Inf
], any negative variance component estimates will
trigger the switch to the HE-NNLS solver until the model converges.
A list
containing the following elements (note: return types are dictated by Rcpp, so the R
types are described here):
FE
:numeric
vector of fixed effect parameter estimates.
RE
:list
of the same length as the number of random effect variables. Each slot contains the best
linear unbiased predictors (BLUPs) for the levels of the corresponding RE variable.
Sigma:
numeric
vector of variance component estimates, 1 per random effect variable.
converged:
logical
scalar of whether the model has reached the convergence tolerance or not.
Iters:
numeric
scalar with the number of iterations that the model ran for. Is strictly <= max.iter
.
Dispersion:
numeric
scalar of the dispersion estimate computed off-line
Hessian:
matrix
of 2nd derivative elements from the fixed and random effect parameter inference.
SE:
matrix
of standard error estimates, derived from the hessian, i.e. the square roots of the diagonal elements.
t:
numeric
vector containing the compute t-score for each fixed effect variable.
COEFF:
matrix
containing the coefficient matrix from the mixed model equations.
P:
matrix
containing the elements of the REML projection matrix.
Vpartial:
list
containing the partial derivatives of the (pseudo)variance matrix with respect to each variance
component.
Ginv:
matrix
of the inverse variance components broadcast to the full Z matrix.
Vsinv:
matrix
of the inverse pseudovariance.
Winv:
matrix
of the inverse elements of W = D^-1 V D^-1
VCOV:
matrix
of the variance-covariance for all model fixed and random effect variable parameter estimates.
This is required to compute the degrees of freedom for the fixed effect parameter inference.
CONVLIST:
list
of list
containing the parameter estimates and differences between current and previous
iteration estimates at each model iteration. These are included for each fixed effect, random effect and variance component parameter.
The list elements for each iteration are: ThetaDiff, SigmaDiff, beta, u, sigma.
Mike Morgan
NULL
NULL
This will give the default values for the GLMM solver
glmmControl.defaults(...)
glmmControl.defaults(...)
... |
see |
The default values for the parameter estimation convergence is 1e-6, and the maximum number of iterations is 100. In practise if the solver converges it generally does so fairly quickly on moderately well conditioned problems. The default solver is Fisher scoring, but this will switch (with a warning produced) to the NNLS Haseman-Elston solver if negative variance estimates are found.
list
containing the default values GLMM solver. This can be saved in the
user environment and then passed to testNhoods directly to modify the convergence
criteria of the solver that is used.
theta.tol:
numeric
scalar that sets the convergence threshold for the
parameter inference - this is applied globally to fixed and random effect parameters, and
to the variance estimates.
max.iter:
numeric
scalar that sets the maximum number of iterations that
the NB-GLMM will run for.
solver:
character
scalar that sets the solver to use. Valid values are
Fisher, HE or HE-NNLS. See fitGLMM for details.
Mike Morgan
mmcontrol <- glmmControl.defaults() mmcontrol mmcontrol$solver <- "HE-NNLS" mmcontrol
mmcontrol <- glmmControl.defaults() mmcontrol mmcontrol$solver <- "HE-NNLS" mmcontrol
Borrowing heavily from cydar
which corrects for multiple-testing
using a weighting scheme based on the volumetric overlap over hyperspheres.
In the instance of graph neighbourhoods this weighting scheme can use graph
connectivity or incorpate different within-neighbourhood distances for the
weighted FDR calculation.
x.nhoods |
A list of vertices and the constituent vertices of their neighbourhood |
graph |
The kNN graph used to define the neighbourhoods |
pvalues |
A vector of p-values calculated from a GLM or other appropriate statistical test for differential neighbourhood abundance |
k |
A numeric integer that determines the kth nearest neighbour distance to use for
the weighted FDR. Only applicaple when using |
weighting |
A string scalar defining which weighting scheme to use. Choices are: max, k-distance, neighbour-distance or graph-overlap. |
reduced.dimensions |
(optional) A |
distances |
(optional) A |
indices |
(optional) A list of neighbourhood index vertices in the same order as the input neighbourhoods. Only used for the k-distance weighting. |
Each neighbourhood is weighted according to the weighting scheme
defined. k-distance uses the distance to the kth nearest neighbour
of the index vertex, neighbour-distance uses the average within-neighbourhood
Euclidean distance in reduced dimensional space, max uses the largest within-neighbourhood distance
from the index vertex, and graph-overlap uses the total number of cells overlapping between
neighborhoods (distance-independent measure). The frequency-weighted version of the
BH method is then applied to the p-values, as in cydar
.
A vector of adjusted p-values
Adapted by Mike Morgan, original function by Aaron Lun
NULL
NULL
This function groups overlapping and concordantly DA neighbourhoods, using the louvain community detection algorithm.
groupNhoods( x, da.res, da.fdr = 0.1, overlap = 1, max.lfc.delta = NULL, merge.discord = FALSE, subset.nhoods = NULL, compute.new = FALSE, na.function = "na.pass" )
groupNhoods( x, da.res, da.fdr = 0.1, overlap = 1, max.lfc.delta = NULL, merge.discord = FALSE, subset.nhoods = NULL, compute.new = FALSE, na.function = "na.pass" )
x |
A |
da.res |
A |
da.fdr |
A numeric scalar that determines at what FDR neighbourhoods are declared DA for the purposes of aggregating across concorantly DA neighbourhoods. |
overlap |
A scalar integer that determines the number of cells that must overlap between adjacent neighbourhoods for merging. |
max.lfc.delta |
A scalar that determines the absolute difference in log fold change below which neighbourhoods should not be considered adjacent. Default=NULL |
merge.discord |
A logical scalar that overrides the default behaviour and allows adjacent neighbourhoods to be merged if they have discordant log fold change signs. Using this argument is generally discouraged, but may be useful for constructing an empirical null group of cells, regardless of DA sign. |
subset.nhoods |
A logical, integer or character vector indicating which neighbourhoods to subset before grouping. All other neighbourhoods will be assigned NA |
compute.new |
A logical scalar indicating whether to force computing a new neighbourhood adjacency matrix if already present. |
na.function |
A valid NA action function to apply, should be one of
|
Louvain clustering is applied to the neighbourhood graph. This graph is first modified
based on two criteria: 1) neighbourhoods share at least overlap
number of cells,
and 2) the DA log fold change sign is concordant.
This behaviour can be modulated by setting overlap
to be more or less stringent.
Additionally, a threshold on the log fold-changes can be set, such that max.lfc.delta
is required to retain edges between adjacent neighbourhoods. Note: adjacent neighbourhoods will
never be merged with opposite signs.
A data.frame
of model results (as da.res
input) with a new column storing the assigned
group label for each neighbourhood (NhoodGroup
column)
Emma Dann & Mike Morgan
This function maps the variance estimates onto the full c x q
levels for each random effect. This
ensures that the matrices commute in the NB-GLMM solver. This function is included for reference, and
should not be used directly
initialiseG(cluster_levels, sigmas, Kin = NULL)
initialiseG(cluster_levels, sigmas, Kin = NULL)
cluster_levels |
A |
sigmas |
A |
Kin |
A |
Broadcast the variance component estimates to the full c\*q x c\*q
matrix.
matrix
of the full broadcast variance component estimates.
Mike Morgan & Alice Kluzer
data(sim_nbglmm) random.levels <- list("RE1"=paste("RE1", levels(as.factor(sim_nbglmm$RE1)), sep="_"), "RE2"=paste("RE2", levels(as.factor(sim_nbglmm$RE2)), sep="_")) rand.sigma <- matrix(runif(2), ncol=1) rownames(rand.sigma) <- names(random.levels) big.G <- initialiseG(random.levels, rand.sigma) dim(big.G)
data(sim_nbglmm) random.levels <- list("RE1"=paste("RE1", levels(as.factor(sim_nbglmm$RE1)), sep="_"), "RE2"=paste("RE2", levels(as.factor(sim_nbglmm$RE2)), sep="_")) rand.sigma <- matrix(runif(2), ncol=1) rownames(rand.sigma) <- names(random.levels) big.G <- initialiseG(random.levels, rand.sigma) dim(big.G)
Using a simplified version of the n x c
Z matrix, with one column per variable, construct the fully broadcast
n x (c*q)
binary matrix that maps each individual onto the random effect variable levels. It is not intended
for this function to be called by the user directly, but it can be useful to debug mappings between random effect
levels and input variables.
initializeFullZ(Z, cluster_levels, stand.cols = FALSE)
initializeFullZ(Z, cluster_levels, stand.cols = FALSE)
Z |
A |
cluster_levels |
A |
stand.cols |
A logical scalar that determines if Z* should be computed which is the row-centered and scaled version of the full Z matrix |
To make sure that matrices commute it is necessary to construct the full n x c*q
matrix. This is a binary
matrix where each level of each random effect occupies a column, and the samples/observations are mapped onto
the correct levels based on the input Z.
matrix
Fully broadcast Z matrix with one column per random effect level for all random effect variables
in the model.
Mike Morgan & Alice Kluzer
data(sim_nbglmm) random.levels <- list("RE1"=paste("RE1", levels(as.factor(sim_nbglmm$RE1)), sep="_"), "RE2"=paste("RE2", levels(as.factor(sim_nbglmm$RE2)), sep="_")) Z <- as.matrix(data.frame("RE1"=paste("RE1", as.numeric(sim_nbglmm$RE1), sep="_"), "RE2"=paste("RE2", as.numeric(sim_nbglmm$RE2), sep="_"))) fullZ <- initializeFullZ(Z, random.levels) dim(Z) dim(fullZ)
data(sim_nbglmm) random.levels <- list("RE1"=paste("RE1", levels(as.factor(sim_nbglmm$RE1)), sep="_"), "RE2"=paste("RE2", levels(as.factor(sim_nbglmm$RE2)), sep="_")) Z <- as.matrix(data.frame("RE1"=paste("RE1", as.numeric(sim_nbglmm$RE1), sep="_"), "RE2"=paste("RE2", as.numeric(sim_nbglmm$RE2), sep="_"))) fullZ <- initializeFullZ(Z, random.levels) dim(Z) dim(fullZ)
This function randomly samples vertices on a graph to define neighbourhoods. These are then refined by either computing the median profile for the neighbourhood in reduced dimensional space and selecting the nearest vertex to this position (refinement_scheme = "reduced_dim"), or by computing the vertex with the highest number of triangles within the neighborhood (refinement_scheme = "graph"). Thus, multiple neighbourhoods may be collapsed down together to prevent over-sampling the graph space.
makeNhoods( x, prop = 0.1, k = 21, d = 30, refined = TRUE, reduced_dims = "PCA", refinement_scheme = "reduced_dim" )
makeNhoods( x, prop = 0.1, k = 21, d = 30, refined = TRUE, reduced_dims = "PCA", refinement_scheme = "reduced_dim" )
x |
A |
prop |
A double scalar that defines what proportion of graph vertices to randomly sample. Must be 0 < prop < 1. |
k |
An integer scalar - the same k used to construct the input graph. |
d |
The number of dimensions to use if the input is a matrix of cells X reduced dimensions. |
refined |
A logical scalar that determines the sampling behavior, default=TRUE implements a refined sampling scheme, specified by the refinement_scheme argument. |
reduced_dims |
If x is an |
refinement_scheme |
A character scalar that defines the sampling scheme, either "reduced_dim" or "graph". Default is "reduced_dim". |
This function randomly samples graph vertices, then refines them to collapse
down the number of neighbourhoods to be tested. The refinement behaviour can
be turned off by setting refine=FALSE
, however, we do not recommend
this as neighbourhoods will contain a lot of redundancy and lead to an
unnecessarily larger multiple-testing burden.
A Milo
object containing a list of vertices and
the indices of vertices that constitute the neighbourhoods in the
nhoods slot. If the input is a igraph
object then the output
is a matrix containing a list of vertices and the indices of vertices that constitute the
neighbourhoods.
Emma Dann, Mike Morgan
require(igraph) m <- matrix(rnorm(100000), ncol=100) milo <- buildGraph(m, d=10) milo <- makeNhoods(milo, prop=0.1) milo
require(igraph) m <- matrix(rnorm(100000), ncol=100) milo <- buildGraph(m, d=10) milo <- makeNhoods(milo, prop=0.1) milo
Exactly what it says on the tin - compute the sum of the matrix diagonal
matrix.trace(x)
matrix.trace(x)
x |
A |
It computes the matrix trace of a square matrix.
numeric
scalar of the matrix trace.
Mike Morgan
matrix.trace(matrix(runif(9), ncol=3, nrow=3))
matrix.trace(matrix(runif(9), ncol=3, nrow=3))
The Milo class extends the SingleCellExperiment class and is designed to work with neighbourhoods of cells. Therefore, it inherits from the SingleCellExperiment class and follows the same usage conventions. There is additional support for cell-to-cell distances via distance, and the KNN-graph used to define the neighbourhoods.
Milo( ..., graph = list(), nhoodDistances = Matrix(0L, sparse = TRUE), nhoods = Matrix(0L, sparse = TRUE), nhoodCounts = Matrix(0L, sparse = TRUE), nhoodIndex = list(), nhoodExpression = Matrix(0L, sparse = TRUE), .k = NULL )
Milo( ..., graph = list(), nhoodDistances = Matrix(0L, sparse = TRUE), nhoods = Matrix(0L, sparse = TRUE), nhoodCounts = Matrix(0L, sparse = TRUE), nhoodIndex = list(), nhoodExpression = Matrix(0L, sparse = TRUE), .k = NULL )
... |
Arguments passed to the Milo constructor to fill the slots of the
base class. This should be either a |
graph |
An igraph object or list of adjacent vertices that represents the KNN-graph |
nhoodDistances |
A list containing sparse matrices of cell-to-cell distances for cells in the same neighbourhoods, one list entry per neighbourhood. |
nhoods |
A list of graph vertices, each containing the indices of the constiuent graph vertices in the respective neighbourhood |
nhoodCounts |
A matrix of neighbourhood X sample counts of the number of cells in each neighbourhood derived from the respective samples |
nhoodIndex |
A list of cells that are the neighborhood index cells. |
nhoodExpression |
A matrix of gene X neighbourhood expression. |
.k |
An integer value. The same value used to build the k-NN graph if already computed. |
In this class the underlying structure is the gene/feature X cell expression data. The additional slots provide a link between these single cells and the neighbourhood representation. This can be further extended by the use of an abstracted graph for visualisation that preserves the structure of the single-cell KNN-graph
A Milo object can also be constructed by inputting a feature X cell gene expression matrix. In this case it simply constructs a SingleCellExperiment and fills the relevant slots, such as reducedDims.
a Milo object
Mike Morgan
library(SingleCellExperiment) ux <- matrix(rpois(12000, 5), ncol=200) vx <- log2(ux + 1) pca <- prcomp(t(vx)) sce <- SingleCellExperiment(assays=list(counts=ux, logcounts=vx), reducedDims=SimpleList(PCA=pca$x)) milo <- Milo(sce) milo
library(SingleCellExperiment) ux <- matrix(rpois(12000, 5), ncol=200) vx <- log2(ux + 1) pca <- prcomp(t(vx)) sce <- SingleCellExperiment(assays=list(counts=ux, logcounts=vx), reducedDims=SimpleList(PCA=pca$x)) milo <- Milo(sce) milo
Get and set methods for Milo object slots. Generally speaking these methods are used internally, but they allow the user to assign their own externally computed values - should be used with caution.
See individual methods for return values
In the following descriptions x
is always a Milo object.
graph(x)
:Returns an igraph
object representation of the
KNN-graph, with number of vertices equal to the number of single-cells.
nhoodDistances(x)
:Returns a list of sparse matrix of cell-to-cell distances
between nearest neighbours, one list entry per neighbourhood. Largely used internally for computing the k-distance
weighting in graphSpatialFDR
.
nhoodCounts(x)
:Returns a NxM sparse matrix of cell counts in
each of N
neighbourhoods with respect to the M
experimental samples defined.
nhoodExpression(x)
:Returns a GxN matrix of gene expression values.
nhoodIndex(x)
:Returns a list of the single-cells that are the neighbourhood indices.
nhoodReducedDim(x)
:Returns an NxP matrix of reduced dimension positions. Either
generated by projectNhoodExpression(x)
or by providing an NxP matrix (see
setter method below).
nhoods(x)
:Returns a sparse matrix of CxN
mapping of C
single-cells toN
neighbourhoods.
nhoodGraph(x)
:Returns an igraph
object representation of the
graph of neighbourhoods, with number of vertices equal to the number of neighbourhoods.
nhoodAdjacency(x)
:Returns a matrix of N
by N
neighbourhoods with entries
of 1 where neighbourhods share cells, and 0 elsewhere.
In the following descriptions x
is always a Milo object.
graph(x) <- value
:Populates the graph slot with value
-
this should be a valid graph representation in either igraph
or list
format.
nhoodDistances(x) <- value
:Replaces the internally comptued neighbourhood distances. This is normally computed internally during graph building, but can be defined externally. Must be a list with one entry per neighbourhood containing the cell-to-cell distances for the cells within that neighbourhood.
nhoodCounts(x) <- value
:Replaces the neighbourhood counts matrix.
This is normally computed and assigned by countCells
, however, it can also be user-defined.
nhoodExpression(x) <- value
:Replaces the nhoodExpression
slot. This is calculated
internally by calcNhoodExpression
, which calculates the mean
expression. An alternative
summary function can be used to assign an alternative in this way.
nhoodIndex(x) <- value
:Replaces the list of neighbourhood indices. This is provided
purely for completeness, and is usually only set internally in makeNhoods
.
nhoodReducedDim(x) <- value
:Replaces the reduced dimensional representation or projection of neighbourhoods. This can be useful for externally computed projections or representations.
nhoods(x) <- value
:Replaces the neighbourhood matrix. Generally use of this function is discouraged, however, it may be useful for users to define their own bespoke neighbourhoods by some means.
nhoodGraph(x) <- value
:Populates the nhoodGraph slot with value
-
this should be a valid graph representation in either igraph
or list
format.
nhoodAdjacency(x) <- value
:Populates the nhoodAdjacency slot with value
-
this should be a N
by N
matrix with elements denoting which neighbourhoods share cells
A collection of non-getter and setter methods that operate on Milo objects.
show(x)
:Prints information to the console regarding the Milo
object.
Mike Morgan
example(Milo, echo=FALSE) show(milo)
example(Milo, echo=FALSE) show(milo)
Milo performs single-cell differential abundance testing. Cell states are modelled as representative neighbourhoods on a nearest neighbour graph. Hypothesis testing is performed using a negative bionomial generalized linear model.
This function constructs a beeswarm plot using the ggplot engine to visualise the distribution of log fold changes across neighbourhood annotations.
plotDAbeeswarm(da.res, group.by = NULL, alpha = 0.1, subset.nhoods = NULL)
plotDAbeeswarm(da.res, group.by = NULL, alpha = 0.1, subset.nhoods = NULL)
da.res |
a data.frame of DA testing results |
group.by |
a character scalar determining which column of |
alpha |
significance level for Spatial FDR (default: 0.1) |
subset.nhoods |
A logical, integer or character vector indicating a subset of nhoods to show in plot (default: NULL, no subsetting) |
The group.by variable will be coerced to a factor. If you want the variables in group.by to be in a given order make sure you set the column to a factor with the levels in the right order before running the function.
a ggplot
object
Emma Dann
NULL
NULL
Plot the number of cells in a neighbourhood per sample and condition
plotNhoodCounts(x, subset.nhoods, design.df, condition, n_col = 3)
plotNhoodCounts(x, subset.nhoods, design.df, condition, n_col = 3)
x |
A |
subset.nhoods |
A logical, integer or character vector indicating the rows of |
design.df |
A |
condition |
String specifying the condition of interest Has to be a column in the |
n_col |
Number of columns in the output |
A ggplot-class
object
Nick Hirschmüller
require(SingleCellExperiment) ux.1 <- matrix(rpois(12000, 5), ncol=300) ux.2 <- matrix(rpois(12000, 4), ncol=300) ux <- rbind(ux.1, ux.2) vx <- log2(ux + 1) pca <- prcomp(t(vx)) sce <- SingleCellExperiment(assays=list(counts=ux, logcounts=vx), reducedDims=SimpleList(PCA=pca$x)) milo <- Milo(sce) milo <- buildGraph(milo, k=20, d=10, transposed=TRUE) milo <- makeNhoods(milo, k=20, d=10, prop=0.3) milo <- calcNhoodDistance(milo, d=10) cond <- sample(c("A","B","C"),300,replace=TRUE) meta.df <- data.frame(Condition=cond, Replicate=c(rep("R1", 100), rep("R2", 100), rep("R3", 100))) meta.df$SampID <- paste(meta.df$Condition, meta.df$Replicate, sep="_") milo <- countCells(milo, meta.data=meta.df, samples="SampID") design.mtx <- data.frame("Condition"=c(rep("A", 3), rep("B", 3), rep("C",3)), "Replicate"=rep(c("R1", "R2", "R3"), 3)) design.mtx$SampID <- paste(design.mtx$Condition, design.mtx$Replicate, sep="_") rownames(design.mtx) <- design.mtx$SampID plotNhoodCounts(x = milo, subset.nhoods = c(1,2), design.df = design.mtx, condition = "Condition")
require(SingleCellExperiment) ux.1 <- matrix(rpois(12000, 5), ncol=300) ux.2 <- matrix(rpois(12000, 4), ncol=300) ux <- rbind(ux.1, ux.2) vx <- log2(ux + 1) pca <- prcomp(t(vx)) sce <- SingleCellExperiment(assays=list(counts=ux, logcounts=vx), reducedDims=SimpleList(PCA=pca$x)) milo <- Milo(sce) milo <- buildGraph(milo, k=20, d=10, transposed=TRUE) milo <- makeNhoods(milo, k=20, d=10, prop=0.3) milo <- calcNhoodDistance(milo, d=10) cond <- sample(c("A","B","C"),300,replace=TRUE) meta.df <- data.frame(Condition=cond, Replicate=c(rep("R1", 100), rep("R2", 100), rep("R3", 100))) meta.df$SampID <- paste(meta.df$Condition, meta.df$Replicate, sep="_") milo <- countCells(milo, meta.data=meta.df, samples="SampID") design.mtx <- data.frame("Condition"=c(rep("A", 3), rep("B", 3), rep("C",3)), "Replicate"=rep(c("R1", "R2", "R3"), 3)) design.mtx$SampID <- paste(design.mtx$Condition, design.mtx$Replicate, sep="_") rownames(design.mtx) <- design.mtx$SampID plotNhoodCounts(x = milo, subset.nhoods = c(1,2), design.df = design.mtx, condition = "Condition")
Plots the average gene expression in neighbourhoods, sorted by DA fold-change
Plots the average gene expression in neighbourhood groups
plotNhoodExpressionDA( x, da.res, features, alpha = 0.1, subset.nhoods = NULL, cluster_features = FALSE, assay = "logcounts", scale_to_1 = FALSE, show_rownames = TRUE, highlight_features = NULL ) plotNhoodExpressionGroups( x, da.res, features, alpha = 0.1, subset.nhoods = NULL, cluster_features = FALSE, assay = "logcounts", scale_to_1 = FALSE, show_rownames = TRUE, highlight_features = NULL, grid.space = "free" )
plotNhoodExpressionDA( x, da.res, features, alpha = 0.1, subset.nhoods = NULL, cluster_features = FALSE, assay = "logcounts", scale_to_1 = FALSE, show_rownames = TRUE, highlight_features = NULL ) plotNhoodExpressionGroups( x, da.res, features, alpha = 0.1, subset.nhoods = NULL, cluster_features = FALSE, assay = "logcounts", scale_to_1 = FALSE, show_rownames = TRUE, highlight_features = NULL, grid.space = "free" )
x |
A |
da.res |
a data.frame of DA testing results |
features |
a character vector of features to plot (they must be in rownames(x)) |
alpha |
significance level for Spatial FDR (default: 0.1) |
subset.nhoods |
A logical, integer or character vector indicating a subset of nhoods to show in plot (default: NULL, no subsetting) |
cluster_features |
logical indicating whether features should be clustered with hierarchical clustering.
If FALSE then the order in |
assay |
A character scalar that describes the assay slot to use for calculating neighbourhood expression.
(default: logcounts)
Of note: neighbourhood expression will be computed only if the requested features are not in the |
scale_to_1 |
A logical scalar to re-scale gene expression values between 0 and 1 for visualisation. |
show_rownames |
A logical scalar whether to plot rownames or not. Generally useful to set this to
|
highlight_features |
A character vector of feature names that should be highlighted on the right side of
the heatmap. Generally useful in conjunction to |
grid.space |
a character setting the |
a ggplot
object
a ggplot
object
Emma Dann
NULL NULL
NULL NULL
Visualize graph of neighbourhoods
plotNhoodGraph( x, layout = "UMAP", colour_by = NA, subset.nhoods = NULL, size_range = c(0.5, 3), node_stroke = 0.3, is.da = FALSE, highlight.da = FALSE, ... )
plotNhoodGraph( x, layout = "UMAP", colour_by = NA, subset.nhoods = NULL, size_range = c(0.5, 3), node_stroke = 0.3, is.da = FALSE, highlight.da = FALSE, ... )
x |
A |
layout |
this can be (a) a character indicating the name of the |
colour_by |
this can be a data.frame of milo results or a character corresponding to a column in colData |
subset.nhoods |
A logical, integer or character vector indicating a subset of nhoods to show in plot
(default: NULL, no subsetting). This is necessary if |
size_range |
a numeric vector indicating the range of node sizes to use for plotting (to avoid overplotting in the graph) |
node_stroke |
a numeric indicating the desired thickness of the border around each node |
is.da |
logical scalar that tells plotNhoodGraph to order nhoods by |LFC| which can help to visually emphasise which nhoods are DA. |
highlight.da |
logical or numeric scalar that emphasises the DA nhoods in the layout by adjusting the transparency
of the non-DA nhoods. Can only be used if |
... |
arguments to pass to |
a ggplot-class
object
Emma Dann
NULL
NULL
Visualize log-FC estimated with differential nhood abundance testing on embedding of original single-cell dataset.
plotNhoodGraphDA(x, milo_res, alpha = 0.05, res_column = "logFC", ...)
plotNhoodGraphDA(x, milo_res, alpha = 0.05, res_column = "logFC", ...)
x |
A |
milo_res |
a data.frame of milo results |
alpha |
significance level for Spatial FDR (default: 0.05) |
res_column |
which column of |
... |
arguments to pass to |
a ggplot
object
Emma Dann
NULL
NULL
Visualize grouping of neighbourhoods obtained with groupNhoods
plotNhoodGroups(x, milo_res, show_groups = NULL, ...)
plotNhoodGroups(x, milo_res, show_groups = NULL, ...)
x |
A |
milo_res |
a data.frame of milo results containing the |
show_groups |
a character vector indicating which groups to plot all other neighbourhoods will be gray |
... |
arguments to pass to |
a ggplot
object
Emma Dann
NULL
NULL
Make an MAplot to visualise the relationship between DA log fold changes and neighbourhood abundance. This is a useful way to diagnose issues with the DA testing, such as large compositional biases and/or issues relating to large imbalances in numbers of cells between condition labels/levels.
plotNhoodMA(da.res, alpha = 0.05, null.mean = 0)
plotNhoodMA(da.res, alpha = 0.05, null.mean = 0)
da.res |
A data.frame of DA testing results |
alpha |
A numeric scalar that represents the Spatial FDR threshold for statistical significance. |
null.mean |
A numeric scalar determining the expected value of the log fold change under the null
hypothesis. |
MA plots provide a useful means to evaluate the distribution of log fold changes after differential abundance testing. In particular, they can be used to diagnose global shifts that occur in the presence of confounding between the number of cells acquired and the experimental variable of interest. The expected null value for the log FC distribution (grey dashed line), along with the mean observed log fold change for non-DA neighbourhoods (purple dashed line) are plotted for reference. The deviation between these two lines can give an indication of biases in the results, such as in the presence of a single strong region of DA leading to an increase in false positive DA neighbourhoods in the opposite direction.
a ggplot
object
Mike Morgan
NULL
NULL
This function plots the histogram of the number of cells belonging to each neighbourhood
plotNhoodSizeHist(milo, bins = 50)
plotNhoodSizeHist(milo, bins = 50)
milo |
A |
bins |
number of bins for |
A ggplot-class
object
Emma Dann
require(igraph) require(SingleCellExperiment) ux.1 <- matrix(rpois(12000, 5), ncol=400) ux.2 <- matrix(rpois(12000, 4), ncol=400) ux <- rbind(ux.1, ux.2) vx <- log2(ux + 1) pca <- prcomp(t(vx)) sce <- SingleCellExperiment(assays=list(counts=ux, logcounts=vx), reducedDims=SimpleList(PCA=pca$x)) colnames(sce) <- paste0("Cell", seq_len(ncol(sce))) milo <- Milo(sce) milo <- buildGraph(milo, k=20, d=10, transposed=TRUE) milo <- makeNhoods(milo, d=10, prop=0.1) plotNhoodSizeHist(milo)
require(igraph) require(SingleCellExperiment) ux.1 <- matrix(rpois(12000, 5), ncol=400) ux.2 <- matrix(rpois(12000, 4), ncol=400) ux <- rbind(ux.1, ux.2) vx <- log2(ux + 1) pca <- prcomp(t(vx)) sce <- SingleCellExperiment(assays=list(counts=ux, logcounts=vx), reducedDims=SimpleList(PCA=pca$x)) colnames(sce) <- paste0("Cell", seq_len(ncol(sce))) milo <- Milo(sce) milo <- buildGraph(milo, k=20, d=10, transposed=TRUE) milo <- makeNhoods(milo, d=10, prop=0.1) plotNhoodSizeHist(milo)
This function is not intended to be called by the user, and is included for reference
Satterthwaite_df( coeff.mat, mint, cint, SE, curr_sigma, curr_beta, V_partial, V_a, G_inv, random.levels )
Satterthwaite_df( coeff.mat, mint, cint, SE, curr_sigma, curr_beta, V_partial, V_a, G_inv, random.levels )
coeff.mat |
A |
mint |
A numeric scalar of the number of fixed effect variables in the model |
cint |
A numeric scalar of the number of random effect variables in the model |
SE |
A |
curr_sigma |
A |
curr_beta |
A |
V_partial |
A |
V_a |
A |
G_inv |
A |
random.levels |
A |
The Satterthwaite degrees of freedom are computed, which estimates the numbers of degrees of freedom in the NB-GLMM based on ratio of the squared standard errors and the product of the Jacobians of the variance-covariance matrix from the fixed effect variable parameter estimation with full variance-covariance matrix. For more details see Satterthwaite FE, Biometrics Bulletin (1946) Vol 2 No 6, pp110-114.
matrix
containing the inferred number of degrees of freedom for the specific model.
Mike Morgan & Alice Kluzer
NULL
NULL
Simulated discrete groups data
data(sim_discrete)
data(sim_discrete)
A list containing a Milo
object in the "mylo" slot,
and a data.frame
containing experimental meta-data in the "meta" slot.
Data are simulated single-cells in 4 distinct groups of cells. Cells in each group are assigned to 1 of 2 conditions: A or B. Specifically, the cells in block 1 are highly abundant in the A condition, whilst cells in block 4 are most abundant in condition B.
NULL
NULL
Simulated counts data from a series of simulated family trees
data(sim_family)
data(sim_family)
A list containing a data.frame
in the "DF" slot containing the
mean counts and meta-data, and a matrix
containing the kinship matrix
across all families in the "IBD" slot.
Data are simulated counts from 30 families and includes X and Z design matrices, as well as a single large kinship matrix. Kinships between family members are dictated by the simulated family, i.e. sibs=0.5, parent-sib=0.5, sib-grandparent=0.25, etc. These kinships, along with 2 other random effects, are used to induce a defined covariance between simulated obserations as such:
Z:= random effect design matrix, n X q G:= matrix of variance components, including kinship matrix
LL^T = Chol(ZGZ^T) := the Cholesky decomposition of the random effect contribution to the sample covariance Ysim:= simulated means based on exp(offset + Xbeta + Zb) Y = LYsim := simulated means with defined covariance
NULL
NULL
Simulated counts data from a NB-GLMM for a single trait
data(sim_nbglmm)
data(sim_nbglmm)
A data.frame
sim_nbglmm containing the following columns:
Mean:
numeric
containing the base mean computed as the linear combination of the
simulated fixed and random effect weights multiplied by their respective weight matrices.
Mean.Count:
numeric
containing the integer count values randomly sampled from a negative
binomail distribution with mean = Mean and dispersion = r
r:
numeric
containing the dispersion value used to simulate the integer counts in
Mean.Count.
Intercept:
numeric
of all 1s which can be used to set the intercept term in the X design
matrix.
FE1:
numeric
a binary fixed effect variable taking on values [0, 1]
FE2:
numeric
a continuous fixed effect variables
RE1:
numeric
a random effect variable with 10 levels
RE2:
numeric
a random effect variable with 7 levels
Data are simulated counts from 50 samples in a single data frame, from which the X and Z design matrices, can be constructed (see examples). There are 2 random effects and 2 fixed effect variables used to simulate the count trait.
data(sim_nbglmm) head(sim_nbglmm)
data(sim_nbglmm) head(sim_nbglmm)
Data are simulated single-cells along a single linear trajectory. Cells are
simulated from 5 groups, and assigned to 1 of 2 conditions; A or B.
Data were generated using in the simulate_linear_trajectory
function in the dyntoy
package.
data(sim_trajectory)
data(sim_trajectory)
A list containing a Milo
object in the "mylo" slot,
and a data.frame
containing experimental meta-data in the "meta" slot.
https://github.com/dynverse/dyntoy
NULL
NULL
This function will perform differential gene expression analysis within
differentially abundant neighbourhoods, by first aggregating adjacent and
concordantly DA neighbourhoods, then comparing cells within these
aggregated groups for differential gene expression using the input design. For
comparing between DA neighbourhoods see findNhoodMarkers
.
testDiffExp( x, da.res, design, meta.data, model.contrasts = NULL, assay = "logcounts", subset.nhoods = NULL, subset.row = NULL, gene.offset = TRUE, n.coef = NULL, na.function = "na.pass" )
testDiffExp( x, da.res, design, meta.data, model.contrasts = NULL, assay = "logcounts", subset.nhoods = NULL, subset.row = NULL, gene.offset = TRUE, n.coef = NULL, na.function = "na.pass" )
x |
A |
da.res |
A |
design |
A |
meta.data |
A cell X variable |
model.contrasts |
A string vector that defines the contrasts used to perform DA testing. This should be the same as was used for DA testing. |
assay |
A character scalar determining which |
subset.nhoods |
A logical, integer or character vector indicating which neighbourhoods to subset before aggregation and DGE testing (default: NULL). |
subset.row |
A logical, integer or character vector indicating the rows
of |
gene.offset |
A logical scalar the determines whether a per-cell offset is provided in the DGE GLM to adjust for the number of detected genes with expression > 0. |
n.coef |
A numeric scalar refering to the coefficient to select from the DGE model. This is especially pertinent when passing an ordered variable and only one specific type of effects are to be tested. |
na.function |
A valid NA action function to apply, should be one of
|
Adjacent neighbourhoods are first merged based on two criteria: 1) they share at
least overlap
number of cells, and 2) the DA log fold change sign is concordant.
This behaviour can be modulated by setting overlap
to be more or less stringent.
Additionally, a threshold on the log fold-changes can be set, such that lfc.threshold
is required to merge adjacent neighbourhoods. Note: adjacent neighbourhoods will never be
merged with opposite signs unless merge.discord=TRUE
.
Within each aggregated group of cells differential gene expression testing is performed
using the single-cell log normalized gene expression with a GLM
(for details see limma-package
), or the single-cell counts using a
negative binomial GLM (for details see edgeR-package
). When using
single-cell data for DGE it is recommended to set gene.offset=TRUE
as this
behaviour adjusts the model by the number of detected genes in each cell as a proxy for
differences in capture efficiency and cellular RNA content.
A list
containing a data.frame
of DGE results for each aggregated
group of neighbourhoods.
Mike Morgan & Emma Dann
data(sim_discrete) milo <- Milo(sim_discrete$SCE) milo <- buildGraph(milo, k=20, d=10, transposed=TRUE) milo <- makeNhoods(milo, k=20, d=10, prop=0.3) meta.df <- sim_discrete$meta meta.df$SampID <- paste(meta.df$Condition, meta.df$Replicate, sep="_") milo <- countCells(milo, meta.data=meta.df, samples="SampID") test.meta <- data.frame("Condition"=c(rep("A", 3), rep("B", 3)), "Replicate"=rep(c("R1", "R2", "R3"), 2)) test.meta$Sample <- paste(test.meta$Condition, test.meta$Replicate, sep="_") rownames(test.meta) <- test.meta$Sample da.res <- testNhoods(milo, design=~Condition, design.df=test.meta[colnames(nhoodCounts(milo)), ]) da.res <- groupNhoods(milo, da.res, da.fdr=0.1) nhood.dge <- testDiffExp(milo, da.res, design=~Condition, meta.data=meta.df) nhood.dge
data(sim_discrete) milo <- Milo(sim_discrete$SCE) milo <- buildGraph(milo, k=20, d=10, transposed=TRUE) milo <- makeNhoods(milo, k=20, d=10, prop=0.3) meta.df <- sim_discrete$meta meta.df$SampID <- paste(meta.df$Condition, meta.df$Replicate, sep="_") milo <- countCells(milo, meta.data=meta.df, samples="SampID") test.meta <- data.frame("Condition"=c(rep("A", 3), rep("B", 3)), "Replicate"=rep(c("R1", "R2", "R3"), 2)) test.meta$Sample <- paste(test.meta$Condition, test.meta$Replicate, sep="_") rownames(test.meta) <- test.meta$Sample da.res <- testNhoods(milo, design=~Condition, design.df=test.meta[colnames(nhoodCounts(milo)), ]) da.res <- groupNhoods(milo, da.res, da.fdr=0.1) nhood.dge <- testDiffExp(milo, da.res, design=~Condition, meta.data=meta.df) nhood.dge
This will perform differential neighbourhood abundance testing after cell counting.
x |
A |
design |
A |
design.df |
A |
kinship |
(optional) An n X n |
min.mean |
A scalar used to threshold neighbourhoods on the minimum average cell counts across samples. |
model.contrasts |
A string vector that defines the contrasts used to perform
DA testing. For a specific comparison we recommend a single contrast be passed to
|
fdr.weighting |
The spatial FDR weighting scheme to use. Choice from max,
neighbour-distance, graph-overlap or k-distance (default). If |
robust |
If robust=TRUE then this is passed to edgeR and limma which use a robust
estimation for the global quasilikelihood dispersion distribution. See |
norm.method |
A character scalar, either |
cell.sizes |
A named numeric vector of cell numbers per experimental samples. Names should correspond
to the columns of |
reduced.dim |
A character scalar referring to the reduced dimensional slot used to compute distances for the spatial FDR. This should be the same as used for graph building. |
REML |
A logical scalar that controls the variance component behaviour to use either restricted maximum likelihood (REML) or maximum likelihood (ML). The former is recommened to account for the bias in the ML variance estimates. |
glmm.solver |
A character scalar that determines which GLMM solver is applied. Must be one of: Fisher, HE or HE-NNLS. HE or HE-NNLS are recommended when supplying a user-defined covariance matrix. |
max.iters |
A scalar that determines the maximum number of iterations to run the GLMM solver if it does not reach the convergence tolerance threshold. |
max.tol |
A scalar that deterimines the GLMM solver convergence tolerance. It is recommended to keep this number small to provide some confidence that the parameter estimates are at least in a feasible region and close to a local optimum |
subset.nhoods |
A character, numeric or logical vector that will subset the analysis to the specific nhoods. If
a character vector these should correspond to row names of |
intercept.type |
A character scalar, either fixed or random that sets the type of the global
intercept variable in the model. This only applies to the GLMM case where additional random effects variables are
already included. Setting |
fail.on.error |
A logical scalar the determines the behaviour of the error reporting. Used for debugging only. |
BPPARAM |
A BiocParallelParam object specifying the arguments for parallelisation. By default
this will evaluate using |
force |
A logical scalar that overrides the default behaviour to nicely error when N < 50 and using a mixed effect model. This is because model parameter estimation may be unstable with these sample sizes, and hence the fixed effect GLM is recommended instead. If used with the LMM, a warning will be produced. |
This function wraps up several steps of differential abundance testing using
the edgeR
functions. These could be performed separately for users
who want to exercise more contol over their DA testing. By default this
function sets the lib.sizes
to the colSums(x), and uses the
Quasi-Likelihood F-test in glmQLFTest
for DA testing. FDR correction
is performed separately as the default multiple-testing correction is
inappropriate for neighbourhoods with overlapping cells.
The GLMM testing cannot be performed using edgeR
, however, a separate
function fitGLMM
can be used to fit a mixed effect model to each
nhood (see fitGLMM
docs for details).
Parallelisation is currently only enabled for the NB-GLMM and uses the BiocParallel paradigm at
the level of R, and OpenMP to allow multi-threading of RCpp code. In general the GLM implementation
in glmQLFit
is sufficiently fast that it does not require
parallelisation. Parallelisation requires the user to pass a BiocParallelParam object
with the parallelisation arguments contained therein. This relies on the user specifying how to
parallelise - for details see the BiocParallel
package.
model.contrasts
are used to define specific comparisons for DA testing. Currently,
testNhoods
will take the last formula variable for comparisons, however, contrasts
need this to be the first variable. A future update will harmonise these behaviours for
consistency. While it is strictly feasible to compute multiple contrasts at once, the
recommendation, for ease of interpretability, is to compute one at a time.
If using the GLMM option, i.e. including a random effect variable in the design
formula, then testNhoods
will check for the sample size of the analysis. If this is
less than 60 it will stop and produce an error. It is strongly recommended that
the GLMM is not used with relatively small sample sizes, i.e. N<60, and even up to N~100
may have unstable parameter estimates across nhoods. This behaviour can be overriden by
setting force=TRUE
, but also be aware that parameter estimates may not be
accurate. A warning will be produced to alert you to this fact.
A data.frame
of model results, which contain:
logFC
:Numeric, the log fold change between conditions, or for an ordered/continous variable the per-unit change in (normalized) cell counts per unit-change in experimental variable.
logCPM
:Numeric, the log counts per million (CPM), which equates to the average log normalized cell counts across all samples.
F
:Numeric, the F-test statistic from the quali-likelihood F-test
implemented in edgeR
.
PValue
:Numeric, the unadjusted p-value from the quasi-likelihood F-test.
FDR
:Numeric, the Benjamini & Hochberg false discovery weight
computed from p.adjust
.
Nhood
:Numeric, a unique identifier corresponding to the specific graph neighbourhood.
SpatialFDR
:Numeric, the weighted FDR, computed to adjust for spatial graph overlaps between neighbourhoods. For details see graphSpatialFDR.
Mike Morgan
library(SingleCellExperiment) ux.1 <- matrix(rpois(12000, 5), ncol=400) ux.2 <- matrix(rpois(12000, 4), ncol=400) ux <- rbind(ux.1, ux.2) vx <- log2(ux + 1) pca <- prcomp(t(vx)) sce <- SingleCellExperiment(assays=list(counts=ux, logcounts=vx), reducedDims=SimpleList(PCA=pca$x)) milo <- Milo(sce) milo <- buildGraph(milo, k=20, d=10, transposed=TRUE) milo <- makeNhoods(milo, k=20, d=10, prop=0.3) milo <- calcNhoodDistance(milo, d=10) cond <- rep("A", ncol(milo)) cond.a <- sample(1:ncol(milo), size=floor(ncol(milo)*0.25)) cond.b <- setdiff(1:ncol(milo), cond.a) cond[cond.b] <- "B" meta.df <- data.frame(Condition=cond, Replicate=c(rep("R1", 132), rep("R2", 132), rep("R3", 136))) meta.df$SampID <- paste(meta.df$Condition, meta.df$Replicate, sep="_") milo <- countCells(milo, meta.data=meta.df, samples="SampID") test.meta <- data.frame("Condition"=c(rep("A", 3), rep("B", 3)), "Replicate"=rep(c("R1", "R2", "R3"), 2)) test.meta$Sample <- paste(test.meta$Condition, test.meta$Replicate, sep="_") rownames(test.meta) <- test.meta$Sample da.res <- testNhoods(milo, design=~Condition, design.df=test.meta[colnames(nhoodCounts(milo)), ], norm.method="TMM") da.res
library(SingleCellExperiment) ux.1 <- matrix(rpois(12000, 5), ncol=400) ux.2 <- matrix(rpois(12000, 4), ncol=400) ux <- rbind(ux.1, ux.2) vx <- log2(ux + 1) pca <- prcomp(t(vx)) sce <- SingleCellExperiment(assays=list(counts=ux, logcounts=vx), reducedDims=SimpleList(PCA=pca$x)) milo <- Milo(sce) milo <- buildGraph(milo, k=20, d=10, transposed=TRUE) milo <- makeNhoods(milo, k=20, d=10, prop=0.3) milo <- calcNhoodDistance(milo, d=10) cond <- rep("A", ncol(milo)) cond.a <- sample(1:ncol(milo), size=floor(ncol(milo)*0.25)) cond.b <- setdiff(1:ncol(milo), cond.a) cond[cond.b] <- "B" meta.df <- data.frame(Condition=cond, Replicate=c(rep("R1", 132), rep("R2", 132), rep("R3", 136))) meta.df$SampID <- paste(meta.df$Condition, meta.df$Replicate, sep="_") milo <- countCells(milo, meta.data=meta.df, samples="SampID") test.meta <- data.frame("Condition"=c(rep("A", 3), rep("B", 3)), "Replicate"=rep(c("R1", "R2", "R3"), 2)) test.meta$Sample <- paste(test.meta$Condition, test.meta$Replicate, sep="_") rownames(test.meta) <- test.meta$Sample da.res <- testNhoods(milo, design=~Condition, design.df=test.meta[colnames(nhoodCounts(milo)), ], norm.method="TMM") da.res