| Title: | Model Gene Expression Dynamics with Spline-Based NB GLMs, GEEs, & GLMMs |
|---|---|
| Description: | Our scLANE model uses truncated power basis spline models to build flexible, interpretable models of single cell gene expression over pseudotime or latent time. The modeling architectures currently supported are Negative-binomial GLMs, GEEs, & GLMMs. Downstream analysis functionalities include model comparison, dynamic gene clustering, smoothed counts generation, gene set enrichment testing, & visualization. |
| Authors: | Jack R. Leary [aut, cre] (ORCID: <https://orcid.org/0009-0004-8821-3269>), Rhonda Bacher [ctb, fnd] (ORCID: <https://orcid.org/0000-0001-5787-476X>) |
| Maintainer: | Jack R. Leary <[email protected]> |
| License: | MIT + file LICENSE |
| Version: | 1.3.0 |
| Built: | 2026-05-30 09:38:03 UTC |
| Source: | https://github.com/bioc/scLANE |
Backward selection function for MARGE - uses the Wald information criterion (WIC).
backward_sel_WIC( Y = NULL, B_new = NULL, is.gee = FALSE, id.vec = NULL, cor.structure = NULL, theta.hat = NULL, sandwich.var = FALSE )backward_sel_WIC( Y = NULL, B_new = NULL, is.gee = FALSE, id.vec = NULL, cor.structure = NULL, theta.hat = NULL, sandwich.var = FALSE )
Y |
The response variable. Defaults to NULL. |
B_new |
The model matrix. Defaults to NULL. |
is.gee |
Is the model a GEE? Defaults to FALSE. |
id.vec |
A vector of observation IDs that is necessary for fitting a GEE model. Defaults to NULL. |
cor.structure |
The specified working correlation structure of the GEE model. Must be one of "independence", "ar1", or "exchangeable". Defaults to NULL. |
theta.hat |
An initial estimate of |
sandwich.var |
Should the sandwich variance estimator be used instead of the model-based estimator? Default to FALSE. |
backward_sel_WIC returns the Wald statistic from the fitted
model (the penalty is applied later on).
Jakub Stoklosa
David I. Warton
Jack R. Leary
Stoklosa, J. Gibb, H. Warton, D.I. Fast forward selection for Generalized Estimating Equations With a Large Number of Predictor Variables. Biometrics, 70, 110–120.
Stoklosa, J. and Warton, D.I. (2018). A generalized estimating equation approach to multivariate adaptive regression splines. Journal of Computational and Graphical Statistics, 27, 245–253.
This functions implements several bias-correction methods for the GEE sandwich variance-covariance matrix; they are to be used when the number of subjects is small or the numer of timepoints per-subject is very large.
biasCorrectGEE( fitted.model = NULL, correction.method = "kc", id.vec = NULL, cor.structure = "ar1", verbose = FALSE )biasCorrectGEE( fitted.model = NULL, correction.method = "kc", id.vec = NULL, cor.structure = "ar1", verbose = FALSE )
fitted.model |
The fitted model of class |
correction.method |
A string specifying the correction method to be used. Currently supported options are "df" and "kc". Defaults to "kc". |
id.vec |
A vector of subject IDs. Defaults to NULL. |
cor.structure |
A string specifying the correlation structure used in fitting the model. Defaults to "ar1". |
verbose |
(Optional) A Boolean specifying whether or not verbose output should be printed to the console. Occasionally useful for debugging. Defaults to FALSE. |
An object of class matrix containing the bias-corrected
variance-covariance estimates.
Jack R. Leary
This function leverages the parametric bootstrap to generate empirical confidence intervals for the random effects terms of a fitted model.
bootstrapRandomEffects( glmm.mod = NULL, id.vec = NULL, Y.offset = NULL, n.boot = 500L, alpha = 0.05, n.cores = 4L, random.seed = 312, verbose = TRUE )bootstrapRandomEffects( glmm.mod = NULL, id.vec = NULL, Y.offset = NULL, n.boot = 500L, alpha = 0.05, n.cores = 4L, random.seed = 312, verbose = TRUE )
glmm.mod |
The output from |
id.vec |
A vector of subject IDs. Defaults to NULL. |
Y.offset |
An offset to be included in the final model fit. Defaults to NULL. |
n.boot |
(Optional) The number of bootstrap resamples to generate. Defaults to 500. |
alpha |
(Optional) The desired confidence level. Defaults to good old 0.05. |
n.cores |
(Optional) The number of threads to use in parallel processing of the bootstrap resampling procedure. Defaults to 4. |
random.seed |
(Optional) The seed used to control stochasticity during bootstrap resampling. Defaults to 312. |
verbose |
(Optional) A boolean indicating whether a progress bar should be printed to the console. Defaults to TRUE. |
An object of class data.frame containing the upper and
lower quantiles of the per-subject random effects.
Jack R. Leary
data(sim_counts) data(sim_pseudotime) cell_offset <- createCellOffset(sim_counts) glmm_mod <- fitGLMM( X_pred = sim_pseudotime, Y = BiocGenerics::counts(sim_counts)[4, ], Y.offset = cell_offset, id.vec = sim_counts$subject, return.basis = TRUE ) ranef_sumy <- bootstrapRandomEffects(glmm_mod, id.vec = sim_counts$subject, Y.offset = cell_offset, n.boot = 10L, # in practice use a larger number such as 500 n.cores = 1L )data(sim_counts) data(sim_pseudotime) cell_offset <- createCellOffset(sim_counts) glmm_mod <- fitGLMM( X_pred = sim_pseudotime, Y = BiocGenerics::counts(sim_counts)[4, ], Y.offset = cell_offset, id.vec = sim_counts$subject, return.basis = TRUE ) ranef_sumy <- bootstrapRandomEffects(glmm_mod, id.vec = sim_counts$subject, Y.offset = cell_offset, n.boot = 10L, # in practice use a larger number such as 500 n.cores = 1L )
This function identifies good gene candidates for trajectory differential expression modeling by ranking genes based on their mean expression, SD of expression, and sparsity across cells.
chooseCandidateGenes( obj = NULL, group.by.subject = TRUE, id.vec = NULL, n.desired.genes = 2000L )chooseCandidateGenes( obj = NULL, group.by.subject = TRUE, id.vec = NULL, n.desired.genes = 2000L )
obj |
An object of class
|
group.by.subject |
Boolean specifying whether or not the summary statistics should be computed per-subject and then mean-aggregated. Defaults to TRUE. |
id.vec |
A vector of subject IDs. Defaults to NULL. |
n.desired.genes |
An integer specifying the number of candidate genes to return. Defaults to 2000. |
A vector of candidate gene names.
Jack R. Leary
data(sim_counts) candidate_genes <- chooseCandidateGenes(sim_counts, id.vec = sim_counts$subject)data(sim_counts) candidate_genes <- chooseCandidateGenes(sim_counts, id.vec = sim_counts$subject)
scLANE models.This function takes as input the output from
testDynamic and clusters the fitted values from the model for
each gene using one of several user-chosen algorithms. An approximately
optimal clustering is determined by iterating over reasonable hyperparameter
values & choosing the value with the highest mean silhouette score based on
the cosine distance.
clusterGenes( test.dyn.res = NULL, pt = NULL, size.factor.offset = NULL, clust.algo = "leiden", use.pca = FALSE, n.PC = 15L, lineages = NULL )clusterGenes( test.dyn.res = NULL, pt = NULL, size.factor.offset = NULL, clust.algo = "leiden", use.pca = FALSE, n.PC = 15L, lineages = NULL )
test.dyn.res |
The list returned by |
pt |
A data.frame containing the pseudotime or latent time estimates for each cell. Defaults to NULL. |
size.factor.offset |
(Optional) An offset to be used to rescale the
fitted values. Can be generated easily with |
clust.algo |
The clustering method to use. Can be one of "hclust", "kmeans", "leiden". Defaults to "leiden". |
use.pca |
Should PCA be performed prior to clustering? Defaults to FALSE. |
n.PC |
The number of principal components to use when performing dimension reduction prior to clustering. Defaults to 15. |
lineages |
Should one or more lineages be isolated? If so, specify which one(s). Otherwise, all lineages will be clustered independently. Defaults to NULL. |
Due to some peculiarities of how the fitted values (on the link scale)
are generated for geeM models, it's not necessary to multiply them by
the offset as this is done internally. For GLM & GEE models, the opposite is
true, and size.factor.offset must be provided in order to rescale the
fitted values correctly.
A data.frame of with three columns: Gene, Lineage,
and Cluster.
Jack Leary
data(sim_counts) data(scLANE_models) data(sim_pseudotime) cell_offset <- createCellOffset(sim_counts) gene_clusters <- clusterGenes(scLANE_models, pt = sim_pseudotime, size.factor.offset = cell_offset )data(sim_counts) data(scLANE_models) data(sim_pseudotime) cell_offset <- createCellOffset(sim_counts) gene_clusters <- clusterGenes(scLANE_models, pt = sim_pseudotime, size.factor.offset = cell_offset )
Creates a vector of per-cell size factors to be used as input
to testDynamic as a model offset given a variety of inputs.
createCellOffset(expr.mat = NULL, scale.factor = 10000)createCellOffset(expr.mat = NULL, scale.factor = 10000)
expr.mat |
Either a (sparse or dense) matrix of raw integer counts
(cells as columns), a |
scale.factor |
The scaling factor use to multiply the sequencing depth factor for each cell. The default value is 1e4, which returns counts-per-10k. |
A named numeric vector containing the computed size factor for each cell.
Jack R. Leary
data(sim_counts) cell_offset <- createCellOffset(sim_counts)data(sim_counts) cell_offset <- createCellOffset(sim_counts)
marge model.Creates a data.frame of marge model breakpoints,
p-values, and other info.
createSlopeTestData( marge.model = NULL, pt = NULL, is.gee = FALSE, is.glmm = FALSE )createSlopeTestData( marge.model = NULL, pt = NULL, is.gee = FALSE, is.glmm = FALSE )
marge.model |
A |
pt |
A data.frame containing pseudotime or latent time values. Defaults to NULL. |
is.gee |
Was the GEE mode used? Defaults to FALSE. |
is.glmm |
Was the GLMM mode used? Defaults to FALSE. |
A data.frame containing model data.
Jack R. Leary
Embed genes in dimension-reduced space given a smoothed counts matrix.
embedGenes( smoothed.counts = NULL, genes = NULL, pca.init = FALSE, pc.embed = 30, pc.return = 2, cluster.genes = TRUE, gene.meta.data = NULL, k.param = 20, resolution.param = NULL, random.seed = 312, n.cores = 2L )embedGenes( smoothed.counts = NULL, genes = NULL, pca.init = FALSE, pc.embed = 30, pc.return = 2, cluster.genes = TRUE, gene.meta.data = NULL, k.param = 20, resolution.param = NULL, random.seed = 312, n.cores = 2L )
smoothed.counts |
The output from |
genes |
A character vector of genes to embed. If not specified, all
genes in |
pca.init |
A boolean specifying whether or not the embedded PCs should be used as initialization for clustering and UMAP. The default is to cluster/embed the raw dynamics i.e., defaults to FALSE. |
pc.embed |
(Optional) How many PCs should be used to cluster the genes and run UMAP? Defaults to 30. |
pc.return |
(Optional) How many principal components should be included in the output? Defaults to 2. |
cluster.genes |
(Optional) Should genes be clustered in PCA space using the Leiden algorithm? Defaults to TRUE. |
gene.meta.data |
(Optional) A data.frame of metadata values for each gene (HVG status, Ensembl ID, gene biotype, etc.) that will be included in the result table. Defaults to NULL. |
k.param |
(Optional) The value of nearest-neighbors used in creating the SNN graph prior to clustering & in running UMAP. Defaults to 20. |
resolution.param |
(Optional) The value of the resolution parameter for the Leiden algorithm. If unspecified, silhouette scoring is used to select an optimal value. Defaults to NULL. |
random.seed |
(Optional) The random seed used to control stochasticity in the clustering algorithm. Defaults to 312. |
n.cores |
(Optional) Integer specifying the number of threads used by
|
A data.frame containing embedding coordinates, cluster IDs, and metadata for each gene.
Jack R. Leary
data(sim_pseudotime) data(scLANE_models) smoothed_dynamics <- smoothedCountsMatrix(scLANE_models, pt = sim_pseudotime, n.cores = 1L ) gene_embed <- embedGenes(smoothed_dynamics$Lineage_A, n.cores = 1L)data(sim_pseudotime) data(scLANE_models) smoothed_dynamics <- smoothedCountsMatrix(scLANE_models, pt = sim_pseudotime, n.cores = 1L ) gene_embed <- embedGenes(smoothed_dynamics$Lineage_A, n.cores = 1L)
scLANE.This function uses the gprofiler2 package to perform
pathway analysis on a set of genes from one or more lineages that were
determined to be dynamic with testDynamic.
enrichDynamicGenes(scLANE.de.res = NULL, lineage = NULL, species = "hsapiens")enrichDynamicGenes(scLANE.de.res = NULL, lineage = NULL, species = "hsapiens")
scLANE.de.res |
The output from |
lineage |
A character vector specifying lineages to isolate. Defaults to NULL. |
species |
The species against which to run enrichment analysis. Defaults to "hsapiens". |
The output from gost.
Jack R. Leary
data(scLANE_models) scLANE_de_res <- getResultsDE(scLANE_models) enr_res <- enrichDynamicGenes(scLANE_de_res)data(scLANE_models) scLANE_de_res <- getResultsDE(scLANE_models) enr_res <- enrichDynamicGenes(scLANE_de_res)
marge model.Extracts the breakpoints from a fitted marge model.
Note - this function relies on the name of the pseudotime variable not
having any numeric characters in it e.g., "pseudotime" or "PT" would be
fine but "pseudotime1" would not. If multiple lineages exist, use letters
to denote lineages instead of numbers e.g., "Lineage_A" and "Lineage_B".
This is currently handled automatically in testDynamic, so
don't change anything.
extractBreakpoints(model = NULL, directions = TRUE)extractBreakpoints(model = NULL, directions = TRUE)
model |
The |
directions |
Should the directions of the hinge functions also be extracted? Defaults to TRUE. |
A data.frame of breakpoints & their directions.
Jack R. Leary
data(sim_counts) data(sim_pseudotime) cell_offset <- createCellOffset(sim_counts) marge_model <- marge2(sim_pseudotime, Y = BiocGenerics::counts(sim_counts)[4, ], Y.offset = cell_offset ) breakpoint_df <- extractBreakpoints(model = marge_model)data(sim_counts) data(sim_pseudotime) cell_offset <- createCellOffset(sim_counts) marge_model <- marge2(sim_pseudotime, Y = BiocGenerics::counts(sim_counts)[4, ], Y.offset = cell_offset ) breakpoint_df <- extractBreakpoints(model = marge_model)
Fits a negative-binomial generalized linear mixed model using
truncated power basis function splines as input. The basis matrix can be
created adaptively using subject-specific estimation of optimal knots using
marge2, or basis functions can be evenly spaced across
quantiles. The resulting model can output subject-specific and
population-level fitted values.
fitGLMM( X_pred = NULL, Y = NULL, Y.offset = NULL, id.vec = NULL, adaptive = TRUE, approx.knot = TRUE, M.glm = 3, reg.penalty = "snet", return.basis = FALSE, return.GCV = FALSE, verbose = FALSE )fitGLMM( X_pred = NULL, Y = NULL, Y.offset = NULL, id.vec = NULL, adaptive = TRUE, approx.knot = TRUE, M.glm = 3, reg.penalty = "snet", return.basis = FALSE, return.GCV = FALSE, verbose = FALSE )
X_pred |
A matrix with one column containing cell ordering values. Defaults to NULL. |
Y |
A vector of raw single cell counts. Defaults to NULL. |
Y.offset |
(Optional) An offset to be included in the final model fit. Defaults to NULL. |
id.vec |
A vector of subject IDs. Defaults to NULL. |
adaptive |
Should basis functions be chosen adaptively? Defaults to TRUE. |
approx.knot |
Should knot approximation be used in the calls to
|
M.glm |
The number of possible basis functions to use in the calls to
|
reg.penalty |
(Optional) String specifying the penalty type to be used when fitting a regularized negative-binomial model to select optimal basis functions. Defaults to "snet". |
return.basis |
(Optional) Whether the basis model matrix (denoted
|
return.GCV |
(Optional) Whether the final GCV value should be returned
as part of the |
verbose |
(Optional) Should intermediate output be printed to the console? Defaults to FALSE. |
An object of class marge containing the fitted model & other
optional quantities of interest (basis function matrix, GCV, etc.).
Jack R. Leary
data(sim_counts) data(sim_pseudotime) cell_offset <- createCellOffset(sim_counts) glmm_mod <- fitGLMM( X_pred = sim_pseudotime, Y = BiocGenerics::counts(sim_counts)[4, ], Y.offset = cell_offset, id.vec = sim_counts$subject )data(sim_counts) data(sim_pseudotime) cell_offset <- createCellOffset(sim_counts) glmm_mod <- fitGLMM( X_pred = sim_pseudotime, Y = BiocGenerics::counts(sim_counts)[4, ], Y.offset = cell_offset, id.vec = sim_counts$subject )
This function computes the correlation between smoothed gene expression and gene program scores in order to identify genes are significantly associated with program scores i.e., the "drivers" of the gene program.
geneProgramDrivers( expr.mat = NULL, genes = NULL, gene.program = NULL, cor.method = "spearman", fdr.cutoff = 0.01, p.adj.method = "holm", n.cores = 2L, verbose = TRUE )geneProgramDrivers( expr.mat = NULL, genes = NULL, gene.program = NULL, cor.method = "spearman", fdr.cutoff = 0.01, p.adj.method = "holm", n.cores = 2L, verbose = TRUE )
expr.mat |
Either a |
genes |
A character vector of genes to test. Defaults to NULL. |
gene.program |
A vector of program scores as returned by
|
cor.method |
(Optional) The correlation method to be used. Defaults to "spearman". |
fdr.cutoff |
(Optional) The FDR threshold for determining statistical significance. Defaults to 0.01. |
p.adj.method |
(Optional) The method used to adjust p-values for multiple hypothesis testing. Defaults to "holm". |
n.cores |
(Optional) The number of cores used when iterating over genes to perform testing. Defaults to 2. |
verbose |
(Optional) Should a progress bar be printed to the console during processing? Defaults to TRUE. |
Either a Seurat or SingleCellExperiment object if
expr.mat is in either form, or a data.frame containing per-cell
program scores if expr.mat is a matrix.
Jack R. Leary
data(sim_counts) data(scLANE_models) data(sim_pseudotime) smoothed_dynamics <- smoothedCountsMatrix(scLANE_models, pt = sim_pseudotime, n.cores = 1L ) gene_embed <- embedGenes(smoothed_dynamics$Lineage_A, n.cores = 1L) sim_counts <- geneProgramScoring(sim_counts, genes = gene_embed$gene, gene.clusters = gene_embed$leiden, n.cores = 1L ) program_drivers <- geneProgramDrivers(sim_counts, genes = gene_embed$gene, gene.program = sim_counts$cluster_0, fdr.cutoff = 0.05, n.cores = 1L )data(sim_counts) data(scLANE_models) data(sim_pseudotime) smoothed_dynamics <- smoothedCountsMatrix(scLANE_models, pt = sim_pseudotime, n.cores = 1L ) gene_embed <- embedGenes(smoothed_dynamics$Lineage_A, n.cores = 1L) sim_counts <- geneProgramScoring(sim_counts, genes = gene_embed$gene, gene.clusters = gene_embed$leiden, n.cores = 1L ) program_drivers <- geneProgramDrivers(sim_counts, genes = gene_embed$gene, gene.program = sim_counts$cluster_0, fdr.cutoff = 0.05, n.cores = 1L )
This function uses ScoreSignatures_UCell
to create a per-cell module score for each of the provided gene clusters.
If the input matrix is a Seurat or SingleCellExperiment
object, then the resulting scores will be added to the meta.data or
the colData slot, respectively. Otherwise, a data.frame of the
per-program scores is returned.
geneProgramScoring( expr.mat = NULL, genes = NULL, gene.clusters = NULL, program.labels = NULL, minmax.norm = TRUE, minmax.epsilon = 0.01, n.cores = 2L )geneProgramScoring( expr.mat = NULL, genes = NULL, gene.clusters = NULL, program.labels = NULL, minmax.norm = TRUE, minmax.epsilon = 0.01, n.cores = 2L )
expr.mat |
Either a |
genes |
A character vector of gene IDs. Defaults to NULL. |
gene.clusters |
A factor containing the cluster assignment of each gene
in |
program.labels |
(Optional) A character vector specifying a label for each gene cluster. Defaults to NULL. |
minmax.norm |
(Optional) Should each program's score be min-max normalized to be on (0, 1)? Defaults to TRUE. |
minmax.epsilon |
(Optional) The tolerance used to ensure that program scores equal to 0 or 1 do not occur. Defaults to 0.01. |
n.cores |
(Optional) The number of cores used under the hood in
|
Either a Seurat or SingleCellExperiment object if
expr.mat is in either form, or a data.frame containing per-cell
program scores if expr.mat is a matrix.
Jack R. Leary
data(sim_counts) data(scLANE_models) data(sim_pseudotime) smoothed_dynamics <- smoothedCountsMatrix(scLANE_models, pt = sim_pseudotime, n.cores = 1L ) gene_embed <- embedGenes(smoothed_dynamics$Lineage_A, n.cores = 1L) sim_counts <- geneProgramScoring(sim_counts, genes = gene_embed$gene, gene.clusters = gene_embed$leiden, n.cores = 1L )data(sim_counts) data(scLANE_models) data(sim_pseudotime) smoothed_dynamics <- smoothedCountsMatrix(scLANE_models, pt = sim_pseudotime, n.cores = 1L ) gene_embed <- embedGenes(smoothed_dynamics$Lineage_A, n.cores = 1L) sim_counts <- geneProgramScoring(sim_counts, genes = gene_embed$gene, gene.clusters = gene_embed$leiden, n.cores = 1L )
This function fits a Beta GAM with pseudotime as a covariate and gene program score as the response. The fitted model is then compared to a null, intercept-only model in order to determine the significance of the relationship between gene program and pseudotime.
geneProgramSignificance( gene.programs = NULL, pt = NULL, program.labels = NULL, p.adj.method = "holm" )geneProgramSignificance( gene.programs = NULL, pt = NULL, program.labels = NULL, p.adj.method = "holm" )
gene.programs |
A list of vectors of program scores as returned by
|
pt |
A vector of pseudotime values for each cell. May contain NAs, which are handled internally. Defaults to NULL. |
program.labels |
(Optional) A character vector specifying a label for each gene cluster. Defaults to NULL. |
p.adj.method |
(Optional) The method used to adjust p-values for multiple hypothesis testing. Defaults to "holm". |
This function assumes that the gene program scores have been min-max
normalized to (0, 1) i.e., not including the values 0 or 1. This is
necessary to fit the Beta distribution additive model. This normalization
can be easily generated by setting the argument minmax.norm = TRUE
in the geneProgramScoring function.
A table of statistical output showing the significance of the association between pseudotime and program scores.
Jack R. Leary
data(sim_counts) data(scLANE_models) data(sim_pseudotime) smoothed_dynamics <- smoothedCountsMatrix(scLANE_models, pt = sim_pseudotime, n.cores = 1L ) gene_embed <- embedGenes(smoothed_dynamics$Lineage_A, n.cores = 1L) sim_counts <- geneProgramScoring(sim_counts, genes = gene_embed$gene, gene.clusters = gene_embed$leiden, n.cores = 1L ) program_enrichment_stats <- geneProgramSignificance(list(sim_counts$cluster_0), pt = sim_pseudotime$PT, program.labels = c("Program Name") )data(sim_counts) data(scLANE_models) data(sim_pseudotime) smoothed_dynamics <- smoothedCountsMatrix(scLANE_models, pt = sim_pseudotime, n.cores = 1L ) gene_embed <- embedGenes(smoothed_dynamics$Lineage_A, n.cores = 1L) sim_counts <- geneProgramScoring(sim_counts, genes = gene_embed$gene, gene.clusters = gene_embed$leiden, n.cores = 1L ) program_enrichment_stats <- geneProgramSignificance(list(sim_counts$cluster_0), pt = sim_pseudotime$PT, program.labels = c("Program Name") )
Generate a table of expression counts, model fitted values, celltype metadata, etc. in order to create custom plots of gene dynamics.
getFittedValues( test.dyn.res = NULL, genes = NULL, pt = NULL, expr.mat = NULL, size.factor.offset = NULL, log1p.norm = TRUE, cell.meta.data = NULL, is.gee = FALSE, id.vec = NULL, ci.alpha = 0.05, filter.lineage = NULL )getFittedValues( test.dyn.res = NULL, genes = NULL, pt = NULL, expr.mat = NULL, size.factor.offset = NULL, log1p.norm = TRUE, cell.meta.data = NULL, is.gee = FALSE, id.vec = NULL, ci.alpha = 0.05, filter.lineage = NULL )
test.dyn.res |
The output from |
genes |
A character vector of genes to generate fitted values for. Defaults to NULL. |
pt |
A data.frame of pseudotime values for each cell. Defaults to NULL. |
expr.mat |
Either a |
size.factor.offset |
(Optional) An offset to be used to rescale the
fitted values. Can be generated easily with |
log1p.norm |
(Optional) Should log1p-normalized versions of expression & model predictions be returned as well? Defaults to TRUE. |
cell.meta.data |
(Optional) A data.frame of metadata values for each cell (celltype label, subject characteristics, tissue type, etc.) that will be included in the result table. Defaults to NULL. |
is.gee |
Was the GEE mode used to fit the models? Defaults to FALSE. |
id.vec |
(Optional) A vector of subject IDs used in fitting GEE or GLMM models. Defaults to NULL. |
ci.alpha |
(Optional) The pre-specified Type I Error rate used in
generating ( |
filter.lineage |
(Optional) A character vector of lineages to filter out. Should be letters, i.e. lineage "A" or "B". Defaults to NULL. |
A data.frame containing depth- and log1p-normalized expression, model predictions, and cell-level metadata.
Jack R. Leary
data(sim_counts) data(sim_pseudotime) data(scLANE_models) fitted_vals <- getFittedValues(scLANE_models, genes = sample(names(scLANE_models), 5), pt = sim_pseudotime, expr.mat = sim_counts )data(sim_counts) data(sim_pseudotime) data(scLANE_models) fitted_vals <- getFittedValues(scLANE_models, genes = sample(names(scLANE_models), 5), pt = sim_pseudotime, expr.mat = sim_counts )
Pulls knot locations for dynamic genes across each lineage, allowing comparisons of where transcriptional switches occur between lineages.
getKnotDist(test.dyn.res = NULL, dyn.genes = NULL)getKnotDist(test.dyn.res = NULL, dyn.genes = NULL)
test.dyn.res |
The output from |
dyn.genes |
The set of genes to pull knots for. If unspecified, pulls knots for all modeled genes. Defaults to NULL. |
A data.frame containing gene name, lineage ID, and knot location in pseudotime.
Jack R. Leary
data(scLANE_models) knot_dist <- getKnotDist(scLANE_models)data(scLANE_models) knot_dist <- getKnotDist(scLANE_models)
testDynamic.This function turns the nested list differential expression
results of testDynamic and turns them into a tidy data.frame.
getResultsDE(test.dyn.res = NULL, p.adj.method = "fdr", fdr.cutoff = 0.01)getResultsDE(test.dyn.res = NULL, p.adj.method = "fdr", fdr.cutoff = 0.01)
test.dyn.res |
The nested list returned by |
p.adj.method |
(Optional) The method used to adjust p-values for multiple hypothesis testing. Defaults to "fdr". |
fdr.cutoff |
(Optional) The FDR threshold for determining statistical significance. Defaults to 0.01. |
A data.frame containing differential expression results & test statistics for each gene.
Jack R. Leary
Rhonda Bacher
data(scLANE_models) scLANE_de_res <- getResultsDE(scLANE_models)data(scLANE_models) scLANE_de_res <- getResultsDE(scLANE_models)
MARGE models of single cell counts.MARS fitting function for negative binomial generalized linear models (GLMs) & generalized estimating equations (GEEs).
marge2( X_pred = NULL, Y = NULL, Y.offset = NULL, M = 5, is.gee = FALSE, is.glmm = FALSE, id.vec = NULL, cor.structure = "ar1", sandwich.var = FALSE, approx.knot = TRUE, n.knot.max = 25, glm.backend = "MASS", tols_score = 1e-05, minspan = NULL, return.basis = FALSE, return.WIC = FALSE, return.GCV = FALSE )marge2( X_pred = NULL, Y = NULL, Y.offset = NULL, M = 5, is.gee = FALSE, is.glmm = FALSE, id.vec = NULL, cor.structure = "ar1", sandwich.var = FALSE, approx.knot = TRUE, n.knot.max = 25, glm.backend = "MASS", tols_score = 1e-05, minspan = NULL, return.basis = FALSE, return.WIC = FALSE, return.GCV = FALSE )
X_pred |
A matrix of the predictor variables. Defaults to NULL. |
Y |
The response variable. Defaults to NULL. |
Y.offset |
(Optional) An vector of per-cell size factors to be included in the final model fit as an offset. Defaults to NULL. |
M |
A set threshold for the maximum number of basis functions to be chosen. Defaults to 5. |
is.gee |
Should the |
is.glmm |
Is the overall model to be fit a GLMM? Defaults to FALSE. |
id.vec |
If |
cor.structure |
If |
sandwich.var |
(Optional) Should the sandwich variance estimator be used instead of the model-based estimator? Default to FALSE. |
approx.knot |
(Optional) Should the set of candidate knots be subsampled in order to speed up computation? This has little effect on the final fit, but can improve computation time somewhat. Defaults to TRUE. |
n.knot.max |
(Optional) The maximum number of candidate knots to consider. Uses uniform sampling to select this number of unique values from the reduced set of all candidate knots. Defaults to 25. |
glm.backend |
(Optional) Character specifying which GLM-fitting backend should be used. Must be one of "MASS" or "speedglm". Defaults to "MASS". |
tols_score |
(Optional) The set tolerance for monitoring the convergence for the difference in score statistics between the parent and candidate model (this is the lack-of-fit criterion used for MARGE). Defaults to 0.00001. |
minspan |
(Optional) A set minimum span value. Defaults to NULL. |
return.basis |
(Optional) Whether the basis model matrix should be
returned as part of the |
return.WIC |
(Optional) Whether the WIC matrix should be returned as
part of the |
return.GCV |
(Optional) Whether the final GCV value should be returned
as part of the |
If models are being fit using an offset (as is recommended), it is
assumed that the offset represents a library size factor (or similar
quantity) generated using e.g., createCellOffset or
computeLibraryFactors. Since this quantity represents
a scaling factor divided by sequencing depth, the offset is formulated
as offset(log(1 / cell_offset)). The inversion is necessary because
the rate term, i.e. the sequencing depth, is the denominator of the
estimated size factors.
An object of class marge containing the fitted model & other
optional quantities of interest (basis function matrix, GCV, etc.).
Jakub Stoklosa
David I. Warton.
Jack R. Leary
Rhonda Bacher
Friedman, J. (1991). Multivariate adaptive regression splines. The Annals of Statistics, 19, 1–67.
Stoklosa, J., Gibb, H. and Warton, D.I. (2014). Fast forward selection for generalized estimating equations with a large number of predictor variables. Biometrics, 70, 110–120.
Stoklosa, J. and Warton, D.I. (2018). A generalized estimating equation approach to multivariate adaptive regression splines. Journal of Computational and Graphical Statistics, 27, 245–253.
data(sim_counts) data(sim_pseudotime) cell_offset <- createCellOffset(sim_counts) marge_model <- marge2(sim_pseudotime, Y = BiocGenerics::counts(sim_counts)[4, ], Y.offset = cell_offset )data(sim_counts) data(sim_pseudotime) cell_offset <- createCellOffset(sim_counts) marge_model <- marge2(sim_pseudotime, Y = BiocGenerics::counts(sim_counts)[4, ], Y.offset = cell_offset )
Truncates the predictor variable value to exclude extreme values in knots selection.
max_span(X_red = NULL, q = NULL, alpha = 0.05)max_span(X_red = NULL, q = NULL, alpha = 0.05)
X_red |
A vector of values for the predictor variable. |
q |
The number of predictors used. |
alpha |
See Friedman (1991) equation (45). Defaults to 0.05. |
Note that this equation comes from Friedman (1991) equation (45).
max_span returns a vector of truncated predictor variable
values.
Jakub Stoklosa
David I. Warton.
Friedman, J. (1991). Multivariate adaptive regression splines. The Annals of Statistics, 19, 1–67.
Stoklosa, J. and Warton, D.I. (2018). A generalized estimating equation approach to multivariate adaptive regression splines. Journal of Computational and Graphical Statistics, 27, 245–253.
A truncation function applied on the predictor variable for knot selection.
min_span(X_red = NULL, q = NULL, minspan = NULL, alpha = 0.05)min_span(X_red = NULL, q = NULL, minspan = NULL, alpha = 0.05)
X_red |
A vector of reduced predictor variable values. Defaults to NULL. |
q |
The number of predictor variables used. Defaults to NULL. |
minspan |
The set minimum span value. Defaults to
|
alpha |
See Friedman (1991) equation (43). Defaults to 0.05. |
This function selects a minimum span between the knots to mitigate runs of correlated noise in the input data and hence avoiding estimation issues, this equation comes from Friedman (1991) equation 43.
min_span returns a vector of truncated predictor
variable values.
Jakub Stoklosa
David I. Warton.
Friedman, J. (1991). Multivariate adaptive regression splines. The Annals of Statistics, 19, 1–67.
Stoklosa, J. and Warton, D.I. (2018). A generalized estimating equation approach to multivariate adaptive regression splines. Journal of Computational and Graphical Statistics, 27, 245–253.
This function compares two models using a likelihood ratio test (LRT) under the assumption that the test statistic is asymptotically Chi-squared with degrees freedom equal to the difference in the number of parameters between the larger and smaller model.
modelLRT(mod.1 = NULL, mod.0 = NULL, is.glmm = FALSE)modelLRT(mod.1 = NULL, mod.0 = NULL, is.glmm = FALSE)
mod.1 |
The model corresponding to the alternative hypothesis. Defaults to NULL. |
mod.0 |
The model corresponding to the null hypothesis. Defaults to NULL. |
is.glmm |
Are the models being compared GLMMs? Defaults to FALSE. |
A list containing the LRT test statistic, degrees freedom, and the p-value computed using the Chi-squared assumption.
Jack R. Leary
Fits a negative-binomial family GAM using a cubic basis spline on pseudotime. If data are multi-subject in nature, a random intercept is included for each subject.
nbGAM( expr = NULL, pt = NULL, Y.offset = NULL, id.vec = NULL, penalize.spline = FALSE, spline.df = 5 )nbGAM( expr = NULL, pt = NULL, Y.offset = NULL, id.vec = NULL, penalize.spline = FALSE, spline.df = 5 )
expr |
A vector of integer counts. Defaults to NULL. |
pt |
A dataframe or vector of pseudotime values. Defaults to NULL. |
Y.offset |
(Optional) An offset to be included in the final model fit. Defaults to NULL. |
id.vec |
(Optional) A vector of subject IDs to be used in creating random intercepts in the GAM. Useful for comparing GAMs to GLMMs. Defaults to NULL. |
penalize.spline |
(Optional) Should a P-spline be used to fit the GAM? Otherwise the default cubic basis spline is used instead. Defaults to FALSE. |
spline.df |
(Optional) Degrees of freedom of the cubic basis spline. Unused if a P-spline is being fit, since it's estimated internally. Defaults to 5. |
An object of class gamlss
Jack R. Leary
data(sim_counts) data(sim_pseudotime) cell_offset <- createCellOffset(sim_counts) gam_mod <- nbGAM(BiocGenerics::counts(sim_counts)[4, ], pt = sim_pseudotime, Y.offset = cell_offset, penalize.spline = TRUE, spline.df = 10 )data(sim_counts) data(sim_pseudotime) cell_offset <- createCellOffset(sim_counts) gam_mod <- nbGAM(BiocGenerics::counts(sim_counts)[4, ], pt = sim_pseudotime, Y.offset = cell_offset, penalize.spline = TRUE, spline.df = 10 )
np.convolve.Convolve a vector with a user-specified kernel. Can be useful for heatmap smoothing, weighted moving means, etc.
npConvolve(x = NULL, conv.kernel = NULL)npConvolve(x = NULL, conv.kernel = NULL)
x |
The vector to be convolved. Defaults to NULL. |
conv.kernel |
The kernel to be used in the convolution. If unspecified,
defaults to a vector of |
The convolution here uses convolve, but creates
the kernel and padding in such a way that it matches the output from
np.convolve in Python's numpy matrix algebra package.
A convolution with same length as the input vector.
Jack R. Leary
convolved_vec <- npConvolve(x = rnorm(20), conv.kernel = rep(1 / 5, 5))convolved_vec <- npConvolve(x = rnorm(20), conv.kernel = rep(1 / 5, 5))
clusterGenes to use in plotting.Generate a table of per-lineage, per-cluster fitted values
from scLANE to be used in visualizations.
plotClusteredGenes( test.dyn.res = NULL, gene.clusters = NULL, pt = NULL, size.factor.offset = NULL, n.cores = 2L )plotClusteredGenes( test.dyn.res = NULL, gene.clusters = NULL, pt = NULL, size.factor.offset = NULL, n.cores = 2L )
test.dyn.res |
The list returned by |
gene.clusters |
The data.frame returned by |
pt |
A data.frame containing the pseudotime or latent time estimates for each cell. Defaults to NULL. |
size.factor.offset |
(Optional) An offset to be used to rescale the
fitted values. Can be generated easily with
|
n.cores |
If parallel execution is desired, how many cores should be utilized? Defaults to 2. |
Due to some peculiarities of how the fitted values (on the link scale)
are generated for geeM models,
it's not necessary to multiply them by the offset as this is done
internally. For GLM & GEE models, the opposite is
true, and size.factor.offset must be provided in order to rescale
the fitted values correctly.
A data.frame with ready-to-plot tidy data. Includes columns
for gene name, pseudotime lineage,
cell name, fitted values on link & response scale, pseudotime, &
gene cluster.
Jack R. Leary
data(sim_counts) data(scLANE_models) data(sim_pseudotime) cell_offset <- createCellOffset(sim_counts) gene_clusters <- clusterGenes(scLANE_models, pt = sim_pseudotime, size.factor.offset = cell_offset ) library(ggplot2) plotClusteredGenes( test.dyn.res = scLANE_models, gene.clusters = gene_clusters, pt = sim_pseudotime, n.cores = 1L ) %>% ggplot(aes(x = PT, y = FITTED, color = CLUSTER, group = GENE)) + facet_wrap(~ LINEAGE + CLUSTER) + geom_line() + theme_classic()data(sim_counts) data(scLANE_models) data(sim_pseudotime) cell_offset <- createCellOffset(sim_counts) gene_clusters <- clusterGenes(scLANE_models, pt = sim_pseudotime, size.factor.offset = cell_offset ) library(ggplot2) plotClusteredGenes( test.dyn.res = scLANE_models, gene.clusters = gene_clusters, pt = sim_pseudotime, n.cores = 1L ) %>% ggplot(aes(x = PT, y = FITTED, color = CLUSTER, group = GENE)) + facet_wrap(~ LINEAGE + CLUSTER) + geom_line() + theme_classic()
Generate a plot of gene dynamics over a single pseudotime lineage, along with a table of coefficients across pseudotime intervals.
plotModelCoefs( test.dyn.res = NULL, gene = NULL, pt = NULL, expr.mat = NULL, size.factor.offset = NULL, lineage = "A", log1p.norm = TRUE )plotModelCoefs( test.dyn.res = NULL, gene = NULL, pt = NULL, expr.mat = NULL, size.factor.offset = NULL, lineage = "A", log1p.norm = TRUE )
test.dyn.res |
The output from |
gene |
A character specifying which gene's dynamics should be plotted. Defaults to NULL. |
pt |
A data.frame of pseudotime values for each cell. Defaults to NULL. |
expr.mat |
Either a |
size.factor.offset |
(Optional) An offset to be used to rescale the
fitted values. Can be generated easily with |
lineage |
A character vector specifying which lineage should be plotted. Should be letters, i.e. lineage "A" or "B". Defaults to "A". |
log1p.norm |
(Optional) Should log1p-normalized versions of expression & model predictions be returned as well? Defaults to TRUE. |
A ggplot2 object displaying a gene dynamics plot & a table
of coefficients across pseudotime intervals.
Jack R. Leary
data(sim_counts) data(sim_pseudotime) data(scLANE_models) cell_offset <- createCellOffset(sim_counts) scLANE_de_res <- getResultsDE(scLANE_models) plotModelCoefs(scLANE_models, gene = "ACLY", pt = sim_pseudotime, expr.mat = sim_counts, size.factor.offset = cell_offset )data(sim_counts) data(sim_pseudotime) data(scLANE_models) cell_offset <- createCellOffset(sim_counts) scLANE_de_res <- getResultsDE(scLANE_models) plotModelCoefs(scLANE_models, gene = "ACLY", pt = sim_pseudotime, expr.mat = sim_counts, size.factor.offset = cell_offset )
marge and other models using ggplot2.This function visualizes the fitted values of several types of models over the expression and pseudotime values of each cell.
plotModels( test.dyn.res = NULL, gene = NULL, pt = NULL, expr.mat = NULL, size.factor.offset = NULL, log1p.norm = TRUE, is.gee = FALSE, is.glmm = FALSE, id.vec = NULL, cor.structure = "ar1", ci.alpha = 0.05, plot.null = FALSE, plot.glm = FALSE, plot.gam = FALSE, plot.scLANE = TRUE, filter.lineage = NULL, gg.theme = theme_scLANE() )plotModels( test.dyn.res = NULL, gene = NULL, pt = NULL, expr.mat = NULL, size.factor.offset = NULL, log1p.norm = TRUE, is.gee = FALSE, is.glmm = FALSE, id.vec = NULL, cor.structure = "ar1", ci.alpha = 0.05, plot.null = FALSE, plot.glm = FALSE, plot.gam = FALSE, plot.scLANE = TRUE, filter.lineage = NULL, gg.theme = theme_scLANE() )
test.dyn.res |
The output from |
gene |
The name of the gene that's being analyzed. Used as the title of
the |
pt |
A data.frame of pseudotime values for each cell. Defaults to NULL. |
expr.mat |
Either a |
size.factor.offset |
(Optional) An offset to be included in the final
model fit. Can be generated easily with |
log1p.norm |
(Optional) Should log1p-normalized versions of expression & model predictions be returned instead of raw counts? Defaults to TRUE. |
is.gee |
Should a GEE framework be used instead of the default GLM? Defaults to FALSE. |
is.glmm |
Should a GLMM framework be used instead of the default GLM? Defaults to FALSE. |
id.vec |
If the GEE or GLMM framework is being used, a vector of
subject IDs to use as input to |
cor.structure |
If the GEE framework is used, specifies the desired working correlation structure. Must be one of "ar1", "independence", or "exchangeable". Defaults to "ar1". |
ci.alpha |
(Optional) The pre-specified Type I Error rate used in
generating ( |
plot.null |
(Optional) Should the fitted values from the intercept-only null model be plotted? Defaults to FALSE. |
plot.glm |
(Optional) Should the fitted values from an NB GLM be plotted? If the data are multi-subject, the "GLM" model can be a GEE or GLMM depending on the desired framework. See Examples for more detail. Defaults to FALSE. |
plot.gam |
(Optional) Should the fitted values from an NB GAM be plotted? Defaults to FALSE. |
plot.scLANE |
(Optional) Should the fitted values from
the |
filter.lineage |
(Optional) A character vector of lineages to filter out before generating the final plot. Should be letters, i.e. lineage "A" or "B". Defaults to NULL. |
gg.theme |
(Optional) A |
A ggplot object.
Jack R. Leary
data(sim_counts) data(scLANE_models) data(sim_pseudotime) cell_offset <- createCellOffset(sim_counts) model_plot <- plotModels(scLANE_models, gene = names(scLANE_models)[2], pt = sim_pseudotime, expr.mat = sim_counts, size.factor.offset = cell_offset )data(sim_counts) data(scLANE_models) data(sim_pseudotime) cell_offset <- createCellOffset(sim_counts) model_plot <- plotModels(scLANE_models, gene = names(scLANE_models)[2], pt = sim_pseudotime, expr.mat = sim_counts, size.factor.offset = cell_offset )
Print method for summary.scLANE objects.
## S3 method for class 'summary.scLANE' print(x, ...)## S3 method for class 'summary.scLANE' print(x, ...)
x |
An object of class summary.scLANE. |
... |
Other options passed to |
A printed summary of overall scLANE results
Jack R. Leary
This function takes in the MARGE model fitted during the
running of marge2 and summarizes it.
pullMARGESummary( marge.model = NULL, is.gee = FALSE, sandwich.var = FALSE, is.glmm = FALSE )pullMARGESummary( marge.model = NULL, is.gee = FALSE, sandwich.var = FALSE, is.glmm = FALSE )
marge.model |
The |
is.gee |
Boolean specifying whether GEE mode was used in fitting the null model. Defaults to FALSE. |
sandwich.var |
Boolean specifying whether the robust sandwich variance-covariance matrix should be used. Defaults to FALSE. |
is.glmm |
Boolean specifying whether the GLMM mode was used in fitting the model. Defaults to FALSE. |
A list containing a coefficient summary, fitted values and their standard errors, and the log-likelihood and deviance of the model.
Jack R. Leary
Rhonda Bacher
This function takes in the null model fitted during the running
of marge2 and summarizes it.
pullNullSummary( null.model = NULL, is.gee = FALSE, sandwich.var = FALSE, is.glmm = FALSE )pullNullSummary( null.model = NULL, is.gee = FALSE, sandwich.var = FALSE, is.glmm = FALSE )
null.model |
The null model from |
is.gee |
Boolean specifying whether GEE mode was used in fitting the null model. Defaults to FALSE. |
sandwich.var |
Boolean specifying whether the robust sandwich variance-covariance matrix should be used. Defaults to FALSE. |
is.glmm |
Boolean specifying whether the GLMM mode was used in fitting the model. Defaults to FALSE. |
A list containing a coefficient summary, fitted values and their standard errors, and the log-likelihood and deviance of the model.
Jack R. Leary
Rhonda Bacher
scLANE.Contains the results from running testDynamic
on sim_counts.
data(scLANE_models)data(scLANE_models)
An object of class scLANE.
scLANE results on small simulated trajectory dataset.
Calculate the score statistic for a GEE model.
score_fun_gee( Y = NULL, N = NULL, n_vec = NULL, VS.est_list = NULL, AWA.est_list = NULL, J2_list = NULL, Sigma2_list = NULL, J11.inv = NULL, JSigma11 = NULL, mu.est = NULL, V.est = NULL, B1 = NULL, XA = NULL )score_fun_gee( Y = NULL, N = NULL, n_vec = NULL, VS.est_list = NULL, AWA.est_list = NULL, J2_list = NULL, Sigma2_list = NULL, J11.inv = NULL, JSigma11 = NULL, mu.est = NULL, V.est = NULL, B1 = NULL, XA = NULL )
Y |
The response variable. Defaults to NULL. |
N |
The number of clusters. Defaults to NULL. |
n_vec |
A vector consisting of the cluster sizes for each cluster. Defaults to NULL. |
VS.est_list |
A product of matrices. Defaults to NULL. |
AWA.est_list |
A product of matrices. Defaults to NULL. |
J2_list |
A product of matrices. Defaults to NULL. |
Sigma2_list |
A product of matrices. Defaults to NULL. |
J11.inv |
A product of matrices. Defaults to NULL. |
JSigma11 |
A product of matrices. Defaults to NULL. |
mu.est |
Estimates of the fitted mean under the null model. Defaults to NULL. |
V.est |
Estimates of the fitted variance under the null model. Defaults to NULL. |
B1 |
Design matrix under the null model. Defaults to NULL. |
XA |
Design matrix under the alternative model. Defaults to NULL. |
A calculated score statistic for the null and alternative model when fitting a GEE.
Jakub Stoklosa
David I. Warton
Jack R. Leary
Stoklosa, J., Gibb, H. and Warton, D.I. (2014). Fast forward selection for generalized estimating equations with a large number of predictor variables. Biometrics, 70, 110–120.
Stoklosa, J. and Warton, D.I. (2018). A generalized estimating equation approach to multivariate adaptive regression splines. Journal of Computational and Graphical Statistics, 27, 245–253.
Calculate the score statistic for a GLM model.
score_fun_glm( Y = NULL, VS.est_list = NULL, A_list = NULL, B1_list = NULL, mu.est = NULL, V.est = NULL, B1 = NULL, XA = NULL )score_fun_glm( Y = NULL, VS.est_list = NULL, A_list = NULL, B1_list = NULL, mu.est = NULL, V.est = NULL, B1 = NULL, XA = NULL )
Y |
The response variable. Defaults to NULL. |
VS.est_list |
A product of matrices. Defaults to NULL. |
A_list |
A product of matrices. Defaults to NULL. |
B1_list |
A product of matrices. Defaults to NULL. |
mu.est |
Estimates of the fitted mean under the null model. Defaults to NULL. |
V.est |
Estimates of the fitted variance under the null model. Defaults to NULL. |
B1 |
Design matrix under the null model. Defaults to NULL. |
XA |
Design matrix under the alternative model. Defaults to NULL. |
A calculated score statistic for the null and alternative model when fitting a GLM.
Jakub Stoklosa
David I. Warton
Jack R. Leary
Stoklosa, J., Gibb, H. and Warton, D.I. (2014). Fast forward selection for generalized estimating equations with a large number of predictor variables. Biometrics, 70, 110–120.
Stoklosa, J. and Warton, D.I. (2018). A generalized estimating equation approach to multivariate adaptive regression splines. Journal of Computational and Graphical Statistics, 27, 245–253.
Performs a basic Lagrange multiplier test to determine whether
an alternate model is significantly better than a nested null model. This is
the GEE equivalent (kind of) of modelLRT. Be careful with
small sample sizes.
scoreTestGEE( mod.1 = NULL, mod.0 = NULL, alt.df = NULL, null.df = NULL, id.vec = NULL, cor.structure = "ar1" )scoreTestGEE( mod.1 = NULL, mod.0 = NULL, alt.df = NULL, null.df = NULL, id.vec = NULL, cor.structure = "ar1" )
mod.1 |
The model under the alternative hypothesis. Must be of class
|
mod.0 |
The model under the null hypothesis. Must be of class
|
alt.df |
The dataframe used to fit the alternative model. Defaults to NULL. |
null.df |
The dataframe used to fit the null model. Defaults to NULL. |
id.vec |
A vector of subject IDs to use as input to
|
cor.structure |
A string specifying the working correlation structure used to fit each model. Must be one of "ar1", "independence", or "exchangeable". Defaults to "ar1". |
Calculating the test statistic involves taking the inverse of the variance of the score vector. Ideally this would be done using the true inverse, but in practice this can cause issues when the matrix is near-singular. With this in mind, we use the Moore-Penrose pseudoinverse if the original matrix inversion fails.
The p-value is calculated using an asymptotic Chi-squared distribution, with the degrees of freedom equal to the number of non-intercept coefficients in the alternative model.
A list containing the Score test statistic, a p-value, and the degrees of freedom used in the test.
Jack R. Leary
SingleCellExperiment object containing
simulated counts.Data simulated using the scaffold R package for 50 dynamic and 50
static genes across 1200 cells from 3 subjects.
data(sim_counts)data(sim_counts)
An object of class
SingleCellExperiment.
Small simulated trajectory dataset.
https://www.rhondabacher.com/scaffold-vignette.pdf
The true ordering of the 1200 cells contained in sim_counts.
data(sim_pseudotime)data(sim_pseudotime)
An object of class data.frame with 1200 rows and one variable:
PT: the true pseudotime (0.0025–1)
Small simulated trajectory pseudotime.
scLANE models.This function takes as input the output from
testDynamic and returns the fitted values from each model in
a wide format, with one column per-gene and one row-per cell. This matrix
can be use as input to cell or gene clustering and / or visualizations
such as heatmaps.
smoothedCountsMatrix( test.dyn.res = NULL, size.factor.offset = NULL, pt = NULL, genes = NULL, log1p.norm = FALSE, n.cores = 2L )smoothedCountsMatrix( test.dyn.res = NULL, size.factor.offset = NULL, pt = NULL, genes = NULL, log1p.norm = FALSE, n.cores = 2L )
test.dyn.res |
The list returned by |
size.factor.offset |
(Optional) An offset to be used to rescale the
fitted values. Can be generated easily with |
pt |
A data.frame of pseudotime values for each cell. Defaults to NULL. |
genes |
(Optional) A character vector of genes with which to subset the results. Defaults to NULL. |
log1p.norm |
A boolean specifying whether the smoothed counts should be log1p-transformed after depth normalization. Defaults to FALSE. |
n.cores |
If parallel execution is desired, how many cores should be utilized? Defaults to 2. |
A list of matrices of smoothed counts, with each element of the list being a single pseudotime lineage.
Jack R. Leary
data(sim_pseudotime) data(scLANE_models) smoothed_dynamics <- smoothedCountsMatrix(scLANE_models, pt = sim_pseudotime, n.cores = 1L )data(sim_pseudotime) data(scLANE_models) smoothed_dynamics <- smoothedCountsMatrix(scLANE_models, pt = sim_pseudotime, n.cores = 1L )
Sort genes such that genes with peak expression occurring earlier in pseudotime are first, and vice versa for genes with late peak expression. Useful for ordering genes in order to create heatmaps of expression cascades.
sortGenesHeatmap(heatmap.mat = NULL, pt.vec = NULL)sortGenesHeatmap(heatmap.mat = NULL, pt.vec = NULL)
heatmap.mat |
A matrix of raw or smoothed expression values with genes as columns and cells as rows. Defaults to NULL. |
pt.vec |
A numeric vector of pseudotime values for each cell i.e., for each row in the heatmap matrix. Defaults to NULL. |
A character vector of genes sorted by their peak expression values over pseudotime.
Jack R. Leary
data(sim_pseudotime) data(scLANE_models) smoothed_counts <- smoothedCountsMatrix(scLANE_models, pt = sim_pseudotime, n.cores = 1L ) sorted_genes <- sortGenesHeatmap(smoothed_counts$Lineage_A, pt.vec = sim_pseudotime$PT)data(sim_pseudotime) data(scLANE_models) smoothed_counts <- smoothedCountsMatrix(scLANE_models, pt = sim_pseudotime, n.cores = 1L ) sorted_genes <- sortGenesHeatmap(smoothed_counts$Lineage_A, pt.vec = sim_pseudotime$PT)
Since the GEE & GLMM modes require data to be sorted by sample ID and pseudotime, this function provides a simple way to do so for a range of inputs.
sortObservations(expr.mat = NULL, pt.vec = NULL, id.vec = NULL)sortObservations(expr.mat = NULL, pt.vec = NULL, id.vec = NULL)
expr.mat |
Either a |
pt.vec |
A vector of pseudotime values used to sort the observations. May contain NA values. Defaults to NULL. |
id.vec |
A vector of subject IDs used to sort the observations. Defaults to NULL. |
If the input is a matrix, it is assumed that the columns are cells - and are named as such - and the rows are genes.
If the input is a Seurat object, sorting requires converting
to SingleCellExperiment object first, then ordering, then converting
back to a Seurat object. Some information might be lost, so it is
recommended not to overwrite your original Seurat object.
An object of the same class as the input expr.mat, but
sorted by sample ID & pseudotime.
Jack R. Leary
data(sim_counts) data(sim_pseudotime) sorted_counts <- sortObservations(sim_counts, pt.vec = sim_pseudotime$PT, id.vec = sim_counts$subject )data(sim_counts) data(sim_pseudotime) sorted_counts <- sortObservations(sim_counts, pt.vec = sim_pseudotime$PT, id.vec = sim_counts$subject )
Calculate the final RSS / GCV for a fitted model.
stat_out(Y = NULL, B1 = NULL, TSS = NULL, GCV.null = NULL, pen = 2)stat_out(Y = NULL, B1 = NULL, TSS = NULL, GCV.null = NULL, pen = 2)
Y |
The response variable. Defaults to NULL. |
B1 |
The model matrix of predictor variables. Defaults to NULL. |
TSS |
Total sum of squares. Defaults to NULL. |
GCV.null |
GCV value for the intercept model. Defaults to NULL. |
pen |
The set/fixed penalty used for the GCV. Defaults to 2. |
A list consisting of the RSS, RSSq1, GCV1 and GCVq1 values for the fitted model.
Jakub Stoklosa
David I. Warton
Jack R. Leary
Friedman, J. (1991). Multivariate adaptive regression splines. The Annals of Statistics, 19, 1–67.
Stoklosa, J. and Warton, D.I. (2018). A generalized estimating equation approach to multivariate adaptive regression splines. Journal of Computational and Graphical Statistics, 27, 245–253.
A function that calculates parts of the score statistic for GEEs only (it is used for the full path for forward selection).
stat_out_score_gee_null( Y = NULL, B_null = NULL, id.vec = NULL, cor.structure = NULL, theta.hat = NULL )stat_out_score_gee_null( Y = NULL, B_null = NULL, id.vec = NULL, cor.structure = NULL, theta.hat = NULL )
Y |
The response variable Defaults to NULL. |
B_null |
The design matrix matrix under the null model Defaults to NULL. |
id.vec |
A vector of ID values for the observations. Defaults to NULL. |
cor.structure |
A string specifying the desired correlation structure for the NB GEE. Defaults to NULL. |
theta.hat |
Estimated value to be treated as the "known" theta when
passing the negative binomial family to |
A list of values (mainly products of matrices) that make up the final score statistic calculation (required for another function).
Jakub Stoklosa
David I. Warton
Jack R. Leary
Stoklosa, J., Gibb, H. and Warton, D.I. (2014). Fast forward selection for generalized estimating equations with a large number of predictor variables. Biometrics, 70, 110–120.
Stoklosa, J. and Warton, D.I. (2018). A generalized estimating equation approach to multivariate adaptive regression splines. Journal of Computational and Graphical Statistics, 27, 245–253.
A function that calculates parts of the score statistic for GLMs only (it is used for the full path for forward selection).
stat_out_score_glm_null(Y = NULL, B_null = NULL)stat_out_score_glm_null(Y = NULL, B_null = NULL)
Y |
: The response variable Defaults to NULL. |
B_null |
: Design matrix under the null model. Defaults to NULL. |
A list of values (mainly products of matrices) that make up the final score statistic calculation (required for another function).
Jakub Stoklosa
David I. Warton
Jack R. Leary
Stoklosa, J., Gibb, H. and Warton, D.I. (2014). Fast forward selection for generalized estimating equations with a large number of predictor variables. Biometrics, 70, 110–120.
Stoklosa, J. and Warton, D.I. (2018). A generalized estimating equation approach to multivariate adaptive regression splines. Journal of Computational and Graphical Statistics, 27, 245–253.
This function removes a lot of components from the
default GLM object in order to make it take up less memory. It does however
retain enough pieces for predict() to still work. No promises
beyond that.
stripGLM(glm.obj = NULL)stripGLM(glm.obj = NULL)
glm.obj |
An object of class GLM from which you'd like to strip out unnecessary components. Defaults to NULL. |
A slimmed-down glm object.
Jack R. Leary
marge model as a series of piecewise equations.This function summarizes the model for each gene and allows for quantitative interpretation of fitted gene dynamics.
summarizeModel(marge.model = NULL, pt = NULL, is.glmm = FALSE)summarizeModel(marge.model = NULL, pt = NULL, is.glmm = FALSE)
marge.model |
The fitted model output from |
pt |
The predictor matrix of pseudotime. Defaults to NULL. |
is.glmm |
Flag specifying whether or not data were generated using GLMM mode. Defaults to FALSE. |
A data.frame of the model coefficients, cutpoint intervals, and formatted equations.
Jack R. Leary
Rhonda Bacher
data(sim_counts) data(sim_pseudotime) cell_offset <- createCellOffset(sim_counts) marge_model <- marge2(sim_pseudotime, Y = BiocGenerics::counts(sim_counts)[4, ], Y.offset = cell_offset ) model_summary <- summarizeModel(marge.model = marge_model, pt = sim_pseudotime)data(sim_counts) data(sim_pseudotime) cell_offset <- createCellOffset(sim_counts) marge_model <- marge2(sim_pseudotime, Y = BiocGenerics::counts(sim_counts)[4, ], Y.offset = cell_offset ) model_summary <- summarizeModel(marge.model = marge_model, pt = sim_pseudotime)
Summary method for scLANE objects.
## S3 method for class 'scLANE' summary(object, ...)## S3 method for class 'scLANE' summary(object, ...)
object |
The nested list returned by |
... |
Other options passed to |
A summary list with aggregated statistics concerning the trajectory
DE tests from scLANE.
Jack R. Leary
data(scLANE_models) summary(scLANE_models)data(scLANE_models) summary(scLANE_models)
This function tests whether a NB marge model is better than a null (intercept-only) model using the Likelihood Ratio Test. In effect, the test tells us whether a gene's expression changes (in any way) over pseudotime.
testDynamic( expr.mat = NULL, pt = NULL, genes = NULL, size.factor.offset = NULL, is.gee = FALSE, cor.structure = "ar1", gee.bias.correction.method = NULL, gee.test = "wald", is.glmm = FALSE, glmm.adaptive = TRUE, id.vec = NULL, n.potential.basis.fns = 5, n.cores = 4L, approx.knot = TRUE, verbose = TRUE, random.seed = 312 )testDynamic( expr.mat = NULL, pt = NULL, genes = NULL, size.factor.offset = NULL, is.gee = FALSE, cor.structure = "ar1", gee.bias.correction.method = NULL, gee.test = "wald", is.glmm = FALSE, glmm.adaptive = TRUE, id.vec = NULL, n.potential.basis.fns = 5, n.cores = 4L, approx.knot = TRUE, verbose = TRUE, random.seed = 312 )
expr.mat |
Either a |
pt |
Either the output from |
genes |
A character vector of genes to model. If not provided, defaults
to all genes in |
size.factor.offset |
(Optional) An offset to be included in the final
model fit. Can be generated easily with |
is.gee |
Should a GEE framework be used instead of the default GLM? Defaults to FALSE. |
cor.structure |
If the GEE framework is used, specifies the desired working correlation structure. Must be one of "ar1", "independence", or "exchangeable". Defaults to "ar1". |
gee.bias.correction.method |
(Optional) Specify which small-sample bias correction to be used on the sandwich variance-covariance matrix prior to test statistic estimation. Options are "kc" and "df". Defaults to NULL, indicating the use of the model-based variance. |
gee.test |
A string specifying the type of test used to estimate the significance of the full model. Must be one of "wald" or "score". Defaults to "wald". |
is.glmm |
Should a GLMM framework be used instead of the default GLM? Defaults to FALSE. |
glmm.adaptive |
(Optional) Should the basis functions for the GLMM be chosen adaptively? If not, uses 4 evenly spaced knots. Defaults to TRUE. |
id.vec |
If a GEE or GLMM framework is being used, a vector of subject
IDs to use as input to |
n.potential.basis.fns |
(Optional) The maximum number of possible basis
functions. See the parameter |
n.cores |
(Optional) If running in parallel, how many cores should be used? Defaults to 4L. |
approx.knot |
(Optional) Should the knot space be reduced in order to improve computation time? Defaults to TRUE. |
verbose |
(Optional) A boolean indicating whether a progress bar should be printed to the console. Defaults to TRUE. |
random.seed |
(Optional) The random seed used to initialize RNG streams in parallel. Defaults to 312. |
If expr.mat is a Seurat object, counts will be extracted
from the output of DefaultAssay. If using this
functionality, check to ensure the specified assay is correct before running
the function. If the input is a SingleCellExperiment or
CellDataSet object, the raw counts will be extracted with
counts.
If using the GEE or GLMM model architectures, ensure that the observations are sorted by subject ID (this is assumed by the underlying fit implementations). If they are not, the models will error out.
If gee.bias.correction.method is set to "kc" or "df", a bias
adjustment will be used to inflate the robust variance-covariance matrix
prior to estimating the Wald test statistic. This is useful when the number
of subjects is small and / or the number of per-subject observations is very
large. Doing so will remove the bias in the sandwich estimator in
small-sample cases. Currently, we suggest keeping this NULL and using the
model-based variance estimates and specifying the "ar1" correlation
structure.
A list of lists, where each element is a gene and each gene contains
sublists for each lineage. Each gene-lineage sublist contains a gene name,
lineage number, default marge vs. null model test results, model
statistics, and fitted values. Use getResultsDE to tidy the
results.
Jack R. Leary
data(sim_counts) data(sim_pseudotime) cell_offset <- createCellOffset(sim_counts) scLANE_models <- testDynamic(sim_counts, pt = sim_pseudotime, size.factor.offset = cell_offset, n.cores = 1L )data(sim_counts) data(sim_pseudotime) cell_offset <- createCellOffset(sim_counts) scLANE_models <- testDynamic(sim_counts, pt = sim_pseudotime, size.factor.offset = cell_offset, n.cores = 1L )
This function tests whether each gene's estimated
for pseudotime differs significantly from 0 over each empirically estimated
sets of knots / pseudotime intervals using a Wald test.
testSlope(test.dyn.res = NULL, p.adj.method = "fdr", fdr.cutoff = 0.01)testSlope(test.dyn.res = NULL, p.adj.method = "fdr", fdr.cutoff = 0.01)
test.dyn.res |
The list returned by |
p.adj.method |
The method used to adjust the p-values for each coefficient. Defaults to "fdr". |
fdr.cutoff |
The FDR threshold for determining statistical significance. Defaults to 0.01. |
A dataframe containing the genes, breakpoints, and coefficient p-values from each model.
Jack R. Leary
data(scLANE_models) slope_test_res <- testSlope(scLANE_models)data(scLANE_models) slope_test_res <- testSlope(scLANE_models)
ggplot2 theme for scLANE.A publication-ready theme for creating gene dynamics plots, embedding plots, etc.
theme_scLANE( base.size = 12, base.lwd = 0.75, base.family = "sans", umap = FALSE )theme_scLANE( base.size = 12, base.lwd = 0.75, base.family = "sans", umap = FALSE )
base.size |
The base font size. Defaults to 12. |
base.lwd |
The base linewidth. Defaults to 0.75. |
base.family |
The font family to be used throughout. Defaults to "sans". |
umap |
(Optional) If set to TRUE, removes axis text and ticks for a cleaner look. Defaults to FALSE. |
A ggplot2 theme.
Jack R. Leary
data(sim_counts) data(scLANE_models) data(sim_pseudotime) cell_offset <- createCellOffset(sim_counts) model_plot <- plotModels(scLANE_models, gene = names(scLANE_models)[1], pt = sim_pseudotime, expr.mat = sim_counts, size.factor.offset = cell_offset ) + theme_scLANE()data(sim_counts) data(scLANE_models) data(sim_pseudotime) cell_offset <- createCellOffset(sim_counts) model_plot <- plotModels(scLANE_models, gene = names(scLANE_models)[1], pt = sim_pseudotime, expr.mat = sim_counts, size.factor.offset = cell_offset ) + theme_scLANE()
Truncated p-th power function (positive part).
tp1(x = NULL, t = NULL, p = 1)tp1(x = NULL, t = NULL, p = 1)
x |
A predictor variable value. Defaults to NULL. |
t |
A specified knot value. Defaults to NULL. |
p |
The |
A vector of values that have been transformed using a truncated
power function (positive part) for a specified knot value.
Jakub Stoklosa
David I. Warton
Friedman, J. (1991). Multivariate adaptive regression splines. The Annals of Statistics, 19, 1–67.
Stoklosa, J. and Warton, D.I. (2018). A generalized estimating equation approach to multivariate adaptive regression splines. Journal of Computational and Graphical Statistics, 27, 245–253.
Truncated p-th power function (negative part).
tp2(x = NULL, t = NULL, p = 1)tp2(x = NULL, t = NULL, p = 1)
x |
A predictor variable value. Defaults to NULL. |
t |
A specified knot value. Defaults to NULL. |
p |
The |
A vector of values that have been transformed using a truncated
power function (negative part) for a specified knot value.
Jakub Stoklosa
David I. Warton
Friedman, J. (1991). Multivariate adaptive regression splines. The Annals of Statistics, 19, 1–67.
Stoklosa, J. and Warton, D.I. (2018). A generalized estimating equation approach to multivariate adaptive regression splines. Journal of Computational and Graphical Statistics, 27, 245–253.
Performs a basic Wald test to determine whether an alternate
model is significantly better than a nested null model. This is the GEE
equivalent (kind of) of modelLRT. Be careful with small
sample sizes.
waldTestGEE( mod.1 = NULL, mod.0 = NULL, correction.method = NULL, id.vec = NULL, verbose = FALSE )waldTestGEE( mod.1 = NULL, mod.0 = NULL, correction.method = NULL, id.vec = NULL, verbose = FALSE )
mod.1 |
The model under the alternative hypothesis. Must be of class
|
mod.0 |
The model under the null hypothesis. Must be of class
|
correction.method |
A string specifying the correction method to be used. Currently supported options are "df", "kc", and NULL. Defaults to NULL. |
id.vec |
A vector of subject IDs. Required when
|
verbose |
(Optional) A Boolean specifying whether or not verbose output should be printed to the console. Occasionally useful for debugging. Defaults to FALSE. |
Calculating the test statistic involves taking the inverse of the
variance-covariance matrix of the coefficients. Ideally this would be done
using the "true" inverse with something like solve,
qr.solve, or chol2inv, but in practice this can
cause issues when the variance-covariance matrix is near-singular. With this
in mind, we use the Moore-Penrose pseudoinverse as implemented in
ginv instead, which leads to more stable results.
The p-value is calculated using an asymptotic Chi-squared distribution, with the degrees of freedom equal to the number of non-intercept coefficients in the alternative model.
A list containing the Wald test statistic, a p-value, and the degrees of freedom used in the test.
Jack R. Leary