Cell type enrichment analysis and cellular deconvolution are computational techniques aimed to decipher the cellular heterogeneity in bulk transcriptomics samples.
xCell2
is an R package for cell type enrichment analysis
that builds upon the original xCell methodology (Aran
et al., 2017), offering new features and improved performance.
This vignette introduces the key features of xCell2
and
provides step-by-step guidance for utilizing its functions and
performing cell type enrichment analysis. Whether you are new or an
experienced bioinformatician, this vignette will help you effectively
use xCell2
to advance your research.
For more details on xCell2
and one of its applications
in predicting immune checkpoint blockade responses, see the xCell2 preprint (Angel
et al., 2024).
The most significant advancements in xCell2
is its
flexibility, allowing users to train custom reference objects tailored
to their specific datasets.
Unlike its predecessor, xCell2
supports references
derived from various data types, including single-cell RNA-Seq
(scRNA-Seq), bulk RNA-Seq, and microarrays. This adaptability makes it
possible to generate reference objects that align closely with the
biological context and experimental conditions of your study.
Advantages of Custom Reference Training
Training a custom reference offers several advantages:
Compatibility with Diverse Data Types: Whether
your data originates from scRNA-Seq, bulk RNA-Seq, or microarrays,
xCell2
can adapt to the unique characteristics of each
platform.
Biological Context Relevance: Leveraging data
tailored to your organism, tissue type, or experimental condition
ensures that the resulting xCell2
reference object
accurately reflects the relevant cellular diversity and
characteristics.
Improved Accuracy: Users can further improve the
precision of enrichment analysis by training an xCell2
reference object that include only the genes shared with their bulk
mixture. This ensures that the cell type signatures are exclusively
constructed from genes present in the bulk mixture, addressing potential
limitations of pre-trained references where some signature genes may be
absent.
Expanding Repository of References: Custom
xCell2
reference objects contributed by the developers and
the community will be curated and be accessible within the package.
Those references will makes it convenient to conduct cross-analyses for
the same bulk mixtures by simply loading references that are customized
and shared by researchers in your field.
When to Use Pre-trained References
Deciding whether to use a pre-trained reference or to train your own custom reference depends on the specifics of your study and the data you are analyzing. Below, we outline key considerations where pre-trained references may be more suitable.
Broad Applicability: If your study involves commonly studied tissues or organisms, pre-trained references, such as those available in the xCell2 Reference Repository, may already meet your needs.
Time Efficiency: Using pre-trained references eliminates the need for additional computational time and expertise required to train a custom reference.
Reproducibility: Pre-trained references are standardized and well-documented, ensuring that results are comparable across studies.
Good Compatibility with Reference: When your bulk mixture shares a high percentage of genes with the pre-trained reference, and both datasets originate from the same platform (e.g., RNA-Seq or microarray), using a pre-trained reference can save time while still delivering accurate results.
Another of the key advancements in xCell2 is its ability to leverage cell ontology to account for hierarchical relationships between cell types.
What is Ontological Integration?
Ontological integration uses a structured vocabulary, such as the Cell Ontology (CL),
to map relationships between cell types. These relationships may
include:
- Ancestry: A broader cell type (e.g., “T cells”) encompassing a more
specific cell type (e.g., “CD4+ T cells”).
- Functional Hierarchies: Specialized subtypes within a functional group
(e.g., “Memory CD4+ T cells” as a subset of “CD4+ T cells”).
By incorporating these relationships, xCell2 considers biological lineage when generating cell type signatures and ensures that spillover correction respects the hierarchical dependencies of related cell types.
How Does it Work?
When useOntology
is set to TRUE
(default),
xCell2Train
:
1. Identifies cell type dependencies in the reference using lineage
relationships.
2. Uses this information to adjust gene signatures generation.
3. Incorporates the dependencies during spillover correction to prevent
correction between lineage-related cell types.
Generating Lineage Information
Users can manually inspect and refine lineage relationships using the
xCell2GetLineage
function. xCell2GetLineage
generates a file with descendants and ancestors for each cell type,
which can be reviewed and provided to the xCell2Train
function via the lineageFile
parameter.
Disabling Ontological Integration
While ontological integration is recommended, it can be disabled by
setting useOntology
to FALSE
.
This may be suitable for:
For more information on the implementation and benefits of ontological integration, refer to the xCell2 paper.
Spillover correction is a useful feature of the xCell
and xCell2
algorithms, addressing a challenge in cell type
enrichment analysis.
What is Spillover?
Spillover occurs when the same signature genes are associated with multiple cell types, especially those that are closely related. For instance, some signature genes of CD4+ T cells may also be expressed in CD8+ T cells. This overlap can artificially inflate the enrichment scores of CD8+ T cells, reducing the specificity of the analysis.
How Does xCell2 Correct Spillover?
Simulating Scores: Synthetic mixtures are generated to model the spillover of the enrichment scores between all cell types in the reference.
Creating a Spillover Matrix: This matrix captures the overall degree to which the enrichment score of one cell type influenced by the signatures of the others.
Incorporating Ontology: When ontological integration is enabled, xCell2 uses lineage relationships between cell types to avoid spillover correction. For example, no correction is preformed between parent and child cell types (e.g., T cells and CD4+ T cells).
Applying Spillover Compensation: A linear compensation method, similar to approaches used in flow cytometry, adjusts the enrichment scores. This reduces the contribution of shared genes while preserving the unique expression signals of each cell type.
Maintaining Balance: Overcorrection is avoided by controlling the
correction strength with the spilloverAlpha
parameter.
Controlling Spillover Correction Strength
The strength of spillover correction in xCell2 can be controlled
using the spilloverAlpha
parameter:
Practical Considerations
spilloverAlpha
parameter to suit your
dataset’s characteristics.useOntology
) to prevent
unnecessary spillover corrections by accounting for hierarchical cell
type dependencies.For further details, refer to the first xCell paper
From Bioconductor (Coming Soon)
To install xCell2
from Bioconductor, use:
if (!requireNamespace("BiocManager", quietly = TRUE)) {
install.packages("BiocManager")
}
BiocManager::install("xCell2")
From GitHub (Development Version)
To install the development version from GitHub, use:
if (!requireNamespace("devtools", quietly = TRUE)) {
install.packages("devtools")
}
devtools::install_github("AlmogAngel/xCell2")
After installation, load the package with:
xCell2Train
The most significant advancement of xCell2
is the
ability to create custom reference objects tailored to your specific
research needs. This section will guide you through the process of
generating a custom xCell2
reference object using the
xCell2Train
function.
Training a custom reference object allows you to:
Include cell types unique to your research area.
Leverage scRNA-seq data as a reference.
Adapt the tool for non-standard organisms, tissues, or experimental conditions.
Improve the accuracy of enrichment analysis.
See also - Custom Reference Training
Before using xCell2Train
, you need to prepare the input
data:
This data frame contain information about each sample or cell in your reference.
It should include the following columns:
"ont"
: Cell type ontology (e.g.,
"CL:0000545"
). Use NA
if not applicable. Read
about assigning ontologies."label"
: Cell type name (e.g.,
"T-helper 1 cell"
)."sample"
: Sample or cell identifier
matching the column names in the reference matrix."dataset"
: Source dataset or subject
identifier (e.g., "DICE"
). If the information is not
available, you can assign a uniform name for all entriesYou can provide a SummarizedExperiment
or
SingleCellExperiment
object as input.
Assays:
- Gene expression data should be stored in one of the following assay
slots:
- "tpm"
(Transcripts Per Million,
recommended)
- "logcounts"
(log-transformed counts)
- "normcounts"
(normalized counts)
- "counts"
(raw counts).
Use assayNames(object)
to check available assay
names.
If multiple assays exist, the function prioritizes
"tpm"
.
Metadata:
- Sample metadata (labels) should be stored in
colData(object)
as a data frame with 4 required
columns:
- "ont"
- Ontology ID (e.g., "CL:0000236"
).
Use NA
if unavailable.
- "label"
- Cell type name.
- "sample"
- Sample or cell identifier matching column
names in the assay.
- "dataset"
- Dataset source for each sample (use a
constant value if not applicable).
Notes:
- Labels with underscores ("_"
) in cell type names are
automatically replaced with dashes ("-"
).
- Microarray references must use "counts"
as the assay
slot.
Let’s walk through an example using a subset of the Database of Immune Cell Expression (DICE):
library(xCell2)
# Load the demo data
data(dice_demo_ref, package = "xCell2")
# Extract reference gene expression matrix
# (Note that values are in TPM but log transformed)
dice_ref <- SummarizedExperiment::assay(dice_demo_ref, "logcounts")
colnames(dice_ref) <- make.unique(colnames(dice_ref)) # Ensure unique sample names
# Extract reference metadata
dice_labels <- SummarizedExperiment::colData(dice_demo_ref)
dice_labels <- as.data.frame(dice_labels) # "label" column already exists
# Prepare labels data frame (the "label" column already exist)
dice_labels$ont <- NA # Here we assign the cell ontology (next section)
dice_labels$sample <- colnames(dice_ref)
dice_labels$dataset <- "DICE"
Assigning cell type ontologies improves the quality of signatures
generated by xCell2Train
. This step is essential for
accurate identification of cell type spesific genes, as it helps account
for lineage dependencies of related cell types (e.g., “T cells, CD4+”
and “T cells, CD4+, memory”).
You may skip this step if:
You can assign cell type ontologies using a controlled vocabulary like the Cell Ontology (CL):
"T CD4+ memory"
)."CD4-positive, alpha-beta memory T cell"
with ID
"CL:0000897"
)."labels"
data frame under the
"ont"
column.# Assign ontologies to cell types
dice_labels[dice_labels$label == "B cells", ]$ont <- "CL:0000236"
dice_labels[dice_labels$label == "Monocytes", ]$ont <- "CL:0000576"
dice_labels[dice_labels$label == "NK cells", ]$ont <- "CL:0000623"
dice_labels[dice_labels$label == "T cells, CD8+", ]$ont <- "CL:0000625"
dice_labels[dice_labels$label == "T cells, CD4+", ]$ont <- "CL:0000624"
dice_labels[dice_labels$label == "T cells, CD4+, memory", ]$ont <- "CL:0000897"
Use the xCell2GetLineage
function to identify and
inspect cell type dependencies based on the Cell Ontology tree. This
function can either output a file in TSV format or return a list mapping
cell types to their ancestors and descendants.
After running the command:
demo_dice_dep.tsv
) in a text
editor or spreadsheet tool."T cells, CD4+, memory"
should be assigned as a
descendant of "T cells, CD4+"
.The lineage file can then be used in the xCell2Train
function via the lineageFile
parameter to enhance the
signatures generation process with user-curated dependencies.
## downloading 1 resources
## retrieving 1 resource
## loading from cache
Interpreting the Output
If a TSV file is generated (outFile provided):
If a list is returned (no outFile specified):
Once your inputs are prepared, you can create a custom
xCell2
reference object using the xCell2Train
function. This object will serve as the basis for cell type enrichment
analysis in your downstream workflows and is a required input for the
xCell2Analysis
function.
xCell2Train
ref
: The prepared reference gene
expression matrix or a SummarizedExperiment
/
SingleCellExperiment
object. For more information, see Preparing the Input Data.
labels
: The labels data frame you
created. If you use a SummarizedExperiment
or
SingleCellExperiment
object, this is not needed because the
metadata should already be included in colData
. See Preparing the Input Data and Assigning Ontologies for
details.
refType
: The type of reference
data. Options are:
"rnaseq"
: Bulk RNA-Seq data."array"
: Microarray data."sc"
: Single-cell RNA-Seq data.useOntology
: Whether to use
ontological integration to account for hierarchical dependencies between
cell types (default: TRUE
). For more information, refer to
the Ontological Integration
in xCell2 section.
lineageFile
: The path to a manually
curated lineage file generated with the xCell2GetLineage
function, if available. For more information, refer to the Checking Lineage
Relationships section.
BPPARAM
: A
BiocParallelParam
instance for parallelization. See Parallelization in xCell2 for
details.
useSpillover
: A Boolean indicating
whether to apply spillover correction during the training process
(default: TRUE
). For a detailed explanation of spillover
correction, see the How
Does xCell2 Correct Spillover? section.
spilloverAlpha
: A numeric value
(default: 0.5
) that controls the strength of spillover
correction. For guidance on spillover correction and choosing alpha,
refer to the How Does
xCell2 Correct Spillover? section.
returnSignatures
: A Boolean
indicating whether to return only the cell type-specific gene signatures
from the training process (default: FALSE
). Setting this
parameter to TRUE
provides the signatures alone, without
generating the full reference object. This can be useful for advanced
users who wish to analyze or modify the signatures directly or use them
in custom workflows.
returnAnalysis
: A Boolean
indicating whether to directly return the results of
xCell2Analysis
using the trained reference object and the
provided bulk mixture data (default: FALSE
). When set to
TRUE
, the function do not return the reference
object.
The following code demonstrates how to generate a reference object.
Before running this example make sure you have dice_ref
and
dice_lables
loaded in R by running previous examples of “Preparing the Input Data”
and “Assigning Cell Type
Ontology”.
library(BiocParallel)
parallel_param <- MulticoreParam(workers = 2) # Adjust workers as needed
set.seed(123) # (optional) For reproducibility
DICE_demo.xCell2Ref <- xCell2Train(
ref = dice_ref,
labels = dice_labels,
refType = "rnaseq",
BPPARAM = parallel_param
)
## Starting xCell2 Train...
## Finding dependencies using cell type ontology...
## loading from cache
## Generating signatures...
## Learning linear transformation and spillover parameters...
## Your custom xCell2 reference object is ready!
## > Please consider sharing with others here: https://dviraran.github.io/xCell2ref
Why Use set.seed()
?
The set.seed()
function is used to ensure
reproducibility. This is only relevant when working with scRNA-Seq data,
as pseudo-bulk samples are generated by randomly sampling cells. Setting
a seed ensures that the same random cells are selected each time the
code is run, allowing you to replicate results and maintain consistency
across analyses. If you are not working with scRNA-Seq data or if exact
reproducibility is not a concern, setting the seed is optional.
After creating your custom reference, you can use it for cell type
enrichment analysis with xCell2Analysis
. We’ll cover this in the next
section.
xCell2
provides a selection of pre-trained reference
objects that can be directly used in your analysis. These references
cover diverse cell types and are based on well-curated datasets.
Pre-trained references are ideal for researchers who prefer not to train
custom references or need a starting point for their analysis.
The following table provides an overview of the available pre-trained references as to January 2025.
Dataset | Study | Species | Normalization | nSamples/Cells | nCellTypes | Platform | Tissues |
---|---|---|---|---|---|---|---|
BlueprintEncode | Martens JHA and Stunnenberg HG (2013), The ENCODE Project Consortium (2012), Aran D (2019) | Homo Sapiens | TPM | 259 | 43 | RNA-seq | Mixed |
ImmGenData | The Immunological Genome Project Consortium (2008), Aran D (2019) | Mus Musculus | RMA | 843 | 19 | Microarray | Immune/Blood |
Immune Compendium | Zaitsev A (2022) | Homo Sapiens | TPM | 3626 | 40 | RNA-seq | Immune/Blood |
LM22 | Chen B (2019) | Homo Sapiens | RMA | 113 | 22 | Microarray | Mixed |
MouseRNAseqData | Benayoun B (2019) | Mus Musculus | TPM | 358 | 18 | RNA-seq | Mixed |
Pan Cancer | Nofech-Mozes I (2023) | Homo Sapiens | Counts | 25084 | 29 | scRNA-seq | Tumor |
Tabula Muris Blood | The Tabula Muris Consortium (2018) | Mus Musculus | Counts | 11145 | 6 | scRNA-seq | Bone Marrow, Spleen, Thymus |
Tabula Sapiens Blood | The Tabula Sapiens Consortium (2022) | Homo Sapiens | Counts | 11921 | 18 | scRNA-seq | Blood, Lymph_Node, Spleen, Thymus, Bone Marrow |
TME Compendium | Zaitsev A (2022) | Homo Sapiens | TPM | 8146 | 25 | RNA-seq | Tumor |
From the Package Popular pre-trained references are
included directly in the xCell2
package. You can load them
using the data()
function:
From the xCell2 References Repository You can download additional pre-trained references from the xCell2 References Repository.
Use the following code to download a pre-trained reference within R:
# Set the URL of the pre-trained reference
ref_url <- "https://dviraran.github.io/xCell2refs/references/BlueprintEncode.xCell2Ref.rds"
# Set the local filename to save the reference
local_filename <- "BlueprintEncode.xCell2Ref.rds"
# Download the file
download.file(ref_url, local_filename, mode = "wb")
# Load the downloaded reference
BlueprintEncode.xCell2Ref <- readRDS(local_filename)
Selecting an appropriate reference is crucial for accurate cell type enrichment analysis. Consider the following:
Organism and Tissue Type: Select a reference that corresponds to the organism and tissue type of your samples.
Platform: While not mandatory, it is recommended to start with a reference that aligns with your dataset type and normalization.
Condition-Specific: If available, select a reference that corresponds to the condition you are investigating. For example, a scRNA-Seq reference created from datasets of blood immune cell types in autoimmune diseases, would be more suitable than a dataset of blood immune cell types from healthy controls.
After loading a pre-trained reference, you can proceed to perform
cell type enrichment analysis with the xCell2Analysis
function. For details, see the Performing Cell Type Enrichment
Analysis section.
xCell2Analysis
After training or obtaining an xCell2
reference object,
the next step is to use it for cell type enrichment analysis on your
bulk transcriptomics data. This section will guide you through using the
xCell2Analysis
function and interpreting its results.
Before running the analysis, ensure you have:
xCell2
reference object (see Creating a Custom
Reference or Using
Pre-trained References).Ensuring Gene Compatibility Between Bulk Mixture and Reference
For accurate cell type enrichment analysis, it is essential to verify
compatibility between the genes in your bulk mixture (mix
)
and the reference object (xcell2object
). Compatibility
means that the gene nomenclature (e.g., gene symbols, HUGO IDs, Ensembl
IDs) match and there is sufficient overlap between the genes in the bulk
mixture and the reference.
The getGenesUsed
function allows you to access the genes
used for the reference training:
library(xCell2)
data("DICE_demo.xCell2Ref", package = "xCell2")
genes_ref <- getGenesUsed(DICE_demo.xCell2Ref)
genes_ref[1:10]
## [1] "CD28" "S100B" "IL2RB" "CD3E" "KRT72" "CTLA4" "CD3D" "GNLY" "IL7R"
## [10] "MAL"
You can calculate the percentage of overlapping genes between your bulk expression matrix and the reference. For example:
# Load demo bulk gene expression mixture
data("mix_demo", package = "xCell2")
genes_mix <- rownames(mix_demo)
# Calculate the percentage of overlapping genes
overlap_percentage <- round(length(intersect(genes_mix, genes_ref)) / length(genes_ref) * 100, 2)
print(paste0("Overlap between mixture and reference genes is: ", overlap_percentage, "%"))
## [1] "Overlap between mixture and reference genes is: 100%"
Ensure the overlap is sufficient (default minimum: 90%). You can
adjust this threshold in xCell2Analysis
using the
minSharedGenes
parameter:
xcell2_results <- xCell2Analysis(
mix = mix_demo,
xcell2object = DICE_demo.xCell2Ref,
minSharedGenes = 0.8 # Allow for a lower overlap threshold
)
## Starting xCell2 Analysis...
## Calculating enrichment scores for all cell types...
## Performing spillover correction...
## xCell2 Analysis completed successfully.
If your expression matrix and the reference share a low percentage of
common genes, consider creating a custom reference using the xCell2Train
function. Custom references ensure better alignment between your bulk
mixture and the reference, leading to more accurate enrichment
results.
xCell2Analysis
mix
: The bulk mixture of gene
expression matrix to analyze, with genes in rows and samples in columns.
The gene nomenclature must match the reference, see Ensuring Gene
Compatibility.
xcell2object
: A pre-trained
reference object of class xCell2Object
, created using the
xCell2Train
function. Pre-trained references are also
available for common use cases. See Using Pre-trained xCell2
References.
minSharedGenes
: The minimum
fraction of shared genes required between the bulk mixture
(mix
) and the reference (xcell2object
). The
default value is 0.9
. If the overlap is insufficient, the
function will terminate with an error. Adjust this threshold if
necessary, but note that higher overlap ensures more reliable results.
To check gene overlap, see Ensuring
Gene Compatibility.
rawScores
: A Boolean indicating
whether to return raw enrichment scores (default: FALSE
).
Raw enrichment scores are computed directly from gene signatures without
applying linear transformation or spillover correction. This option can
be useful for debugging or advanced workflows.
spillover
: A Boolean indicating
whether to apply spillover correction on the enrichment scores (default:
TRUE
). Spillover correction addresses signature genes
overlaps between closely related cell types, improving specificity. For
details, see How Does
xCell2 Correct Spillover?.
spilloverAlpha
: A numeric value
(default: 0.5
) controlling the strength of spillover
correction. Lower values apply weaker correction, while higher values
apply stronger correction. Adjust this parameter based on the similarity
of cell types in your reference. See Spillover Correction in
xCell2.
BPPARAM
: A
BiocParallelParam
instance that determines the
parallelization strategy. This parameter allows you to leverage
multi-core processing for faster computation. For more details, see Parallelization in
xCell2.
To calculate the cell type enrichment, use the following command*:
## Starting xCell2 Analysis...
## Calculating enrichment scores for all cell types...
## Performing spillover correction...
## xCell2 Analysis completed successfully.
## F0303 F0304 F0305
## B cells 0.006400000 0.000000000 0.00647
## Monocytes 0.097510000 0.000000000 0.11779
## NK cells 0.000000000 0.018030000 0.00200
## T cells, CD8+ 0.006078756 0.001443457 0.00000
## T cells, CD4+ 0.009140549 0.020706785 0.00000
## T cells, CD4+, memory 0.001950000 0.009530000 0.00000
The xCell2Analysis
function return a matrix of cell type
enrichment scores with the following structure:
The output of xCell2Analysis
, a matrix of cell type
enrichment scores, can be utilized for a variety of advanced analyses to
uncover valuable biological insights. Below are examples for some common
applications.
What is Tumor Purity? Tumor purity refers to the proportion of tumor cells within a sample, compared to the surrounding non-tumor cells (the tumor microenvironment), such as immune and stromal cells. Tumor purity typically varies across samples and can influence the interpretation of enrichment scores.
Why Normalize by Tumor Purity? In tumor samples, differences in tumor purity can skew the enrichment scores of non-tumor components like immune or stromal cells. For example, a sample with high tumor purity (e.g., 80%) contains a smaller proportion of tumor microenvironment (TME) cells compared to a sample with lower tumor purity (e.g., 40%). This makes direct comparisons of immune cell enrichment scores between these samples challenging. By normalizing the enrichment scores by tumor purity, we can isolate and compare the cellular composition of the TME alone, providing a clearer picture of immune or stromal cell dynamics.
How to Obtain Tumor Purity Values Tumor purity
values can be estimated using external tools or methods. Some common
approaches include: - Genomic Analysis Tools: Tools
like ABSOLUTE or ESTIMATE can compute tumor purity values based on DNA
sequencing or gene expression data. - Published
Datasets: Tumor purity estimates are available for many
publicly available datasets, such as The Cancer Genome Atlas (TCGA). -
Future xCell2 Feature: In a future update,
xCell2
will introduce a feature to compute the tumor
microenvironment (TME) score. This score has been shown to have a
negative correlation with tumor purity, as demonstrated in the study
“Systematic pan-cancer analysis of tumour purity” (DOI: 10.1038/ncomms9971).
Normalization Formula The normalized enrichment score for a cell type is calculated as:
$$ \text{Normalized Score} = \frac{\text{Enrichment Score}}{1 - \text{Tumor Purity}} $$ where 1 − Tumor Purity represents the fraction of non-tumor cells in the sample.
Interpreting Normalized Scores After normalization, the enrichment scores reflect the relative abundance of the cell type within the TME, accounting for the sample’s tumor purity. This adjustment allows for more meaningful comparisons of immune or stromal cell enrichment between samples with varying tumor purity levels. For example, in the plot above, normalized scores for CD8+ T cells highlight differences that might otherwise be obscured in raw scores due to variations in tumor purity.
For one cell type
Compare enrichment scores between groups (e.g., healthy vs. diseased) using statistical tests like the Wilcoxon rank-sum test.
library(ggplot2, quietly = TRUE)
# Generate pseudo enrichment scores for demonstration
set.seed(123)
conditions <- c(rep("Healthy", 5), rep("Disease", 5))
enrichment_scores <- data.frame(
Condition = conditions,
T_cells_CD8 = c(runif(5, 0.4, 0.6), runif(5, 0.6, 0.8)),
T_cells_CD4 = c(runif(5, 0.3, 0.5), runif(5, 0.5, 0.7))
)
# Wilcoxon rank-sum test for CD8+ T cells
wilcox_test <- wilcox.test(T_cells_CD8 ~ Condition, data = enrichment_scores)
print(wilcox_test)
##
## Wilcoxon rank sum exact test
##
## data: T_cells_CD8 by Condition
## W = 25, p-value = 0.007937
## alternative hypothesis: true location shift is not equal to 0
# Boxplot with p-value
ggplot(enrichment_scores, aes(x = Condition, y = T_cells_CD8)) +
geom_boxplot() +
geom_jitter(width = 0.2) +
labs(
title = "CD8+ T Cells Enrichment Across Conditions",
x = "Condition", y = "Enrichment Score"
) +
annotate("text", x = 1.5, y = max(enrichment_scores$T_cells_CD8),
label = paste("p =", signif(wilcox_test$p.value, 3)))
For many cell types
Visualize differential cell type enrichment between conditions using a volcano plot with Wilcoxon rank-sum test.
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
# Simulate enrichment scores for 50 cell types
set.seed(123)
cell_types <- paste0("CellType_", 1:50)
enrichment_scores_many <- data.frame(
Sample = paste0("Sample", 1:10),
Condition = factor(rep(c("Healthy", "Disease"), each = 5))
)
for (cell_type in cell_types) {
# Healthy group: random around baseline
healthy <- rnorm(5, mean = 0.5, sd = 0.1)
# Disease group: add variability and directional shifts
disease <- rnorm(5, mean = sample(c(0.4, 0.5, 0.6), 1), sd = 0.1)
enrichment_scores_many[[cell_type]] <- c(healthy, disease)
}
special_cells <- sample(cell_types, 2)
for (cell_type in special_cells) {
enrichment_scores_many[1:5, cell_type] <- rnorm(5, mean = 0.4, sd = 0.05) # Lower in healthy
}
# Calculate log fold change (LFC) and p-values for each cell type using Wilcoxon test
differential_results <- enrichment_scores_many %>%
tidyr::pivot_longer(cols = starts_with("CellType"), names_to = "CellType", values_to = "Enrichment") %>%
group_by(CellType) %>%
summarise(
LFC = log2(median(Enrichment[Condition == "Disease"]) / median(Enrichment[Condition == "Healthy"])),
p_value = wilcox.test(Enrichment ~ Condition)$p.value
) %>%
ungroup()
# Prepare data for EnhancedVolcano
volcano_data <- differential_results %>%
mutate(
neg_log10_p = -log10(p_value),
Significance = ifelse(p_value < 0.05, "Significant", "Not Significant")
) %>%
rename(
log2FC = LFC,
pval = p_value
)
# Plot using EnhancedVolcano
EnhancedVolcano(
volcano_data,
lab = volcano_data$CellType,
x = "log2FC",
y = "pval",
pCutoff = 0.05,
FCcutoff = 0.5,
title = "Volcano Plot of Cell Type Enrichment",
subtitle = "Simulated Data for 50 Cell Types",
xlab = "Log2 Fold Change (LFC)",
ylab = "-Log10(p-value)",
pointSize = 2.0,
labSize = 3.0,
xlim = c(-1, 1),
ylim = c(0, 3)
)
Correlate enrichment scores with clinical outcomes or molecular features, such as mutation burden.
# Simulated clinical feature (e.g., mutation burden)
set.seed(123)
mutation_burden <- runif(10, min = 0, max = 100)
# Calculate Pearson and Spearman correlation
pearson_corr <- cor.test(enrichment_scores$T_cells_CD8, mutation_burden, method = "pearson")
spearman_corr <- cor.test(enrichment_scores$T_cells_CD8, mutation_burden, method = "spearman")
# Scatterplot of the correlation with annotations for Pearson and Spearman coefficients
ggplot(enrichment_scores, aes(x = mutation_burden, y = T_cells_CD8)) +
geom_point(size = 3) +
geom_smooth(method = "lm", se = FALSE, color = "blue") +
labs(
title = "Correlation Between Mutation Burden and CD8+ T Cells Enrichment",
x = "Mutation Burden",
y = "CD8+ T Cells Enrichment Score"
) +
annotate("text", x = 15, y = 0.7,
label = paste0("Pearson: ", signif(pearson_corr$estimate, 3),
"\nP-value: ", signif(pearson_corr$p.value, 3)),
hjust = 0, color = "darkred") +
annotate("text", x = 15, y = 0.65,
label = paste0("Spearman: ", signif(spearman_corr$estimate, 3),
"\nP-value: ", signif(spearman_corr$p.value, 3)),
hjust = 0, color = "darkgreen")
## `geom_smooth()` using formula = 'y ~ x'
Use enrichment scores to group samples and visualize their cellular composition landscape.
# Simulated enrichment scores for demonstration
set.seed(123)
enrichment_scores <- data.frame(
Sample = paste0("Sample", 1:10),
Condition = rep(c("Healthy", "Disease"), each = 5),
CD8_T_cells = runif(10, min = 0.1, max = 0.8),
B_cells = runif(10, min = 0.2, max = 0.6),
NK_cells = runif(10, min = 0.05, max = 0.5)
)
# Perform PCA on the simulated enrichment scores
pca <- prcomp(enrichment_scores[, 3:5], scale. = TRUE)
# Create a data frame for PCA results
pca_df <- data.frame(
PC1 = pca$x[, 1],
PC2 = pca$x[, 2],
Condition = enrichment_scores$Condition
)
# Visualize the first two principal components
ggplot(pca_df, aes(x = PC1, y = PC2, color = Condition)) +
geom_point(size = 3) +
labs(
title = "PCA of Enrichment Scores",
x = "PC1 (Principal Component 1)",
y = "PC2 (Principal Component 2)"
) +
theme_minimal()
Perform k-means clustering on the original enrichment scores
kmeans_result <- kmeans(enrichment_scores[, 3:5], centers = 2, nstart = 25)
# Add cluster assignment to the PCA results
pca_df <- data.frame(
PC1 = pca$x[, 1],
PC2 = pca$x[, 2],
Condition = enrichment_scores$Condition,
Cluster = as.factor(kmeans_result$cluster)
)
# Visualize the PCA results with k-means clusters
ggplot(pca_df, aes(x = PC1, y = PC2, color = Cluster, shape = Condition)) +
geom_point(size = 3) +
labs(
title = "PCA of Enrichment Scores with K-Means Clusters",
x = "PC1 (Principal Component 1)",
y = "PC2 (Principal Component 2)"
) +
scale_color_manual(values = c("1" = "blue", "2" = "red")) +
theme_minimal()
Use enrichment scores as predictive features in machine learning models to classify conditions.
## randomForest 4.7-1.2
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:dplyr':
##
## combine
## The following object is masked from 'package:ggplot2':
##
## margin
# Simulated enrichment scores for demonstration
set.seed(123)
enrichment_scores <- data.frame(
Sample = paste0("Sample", 1:10),
Condition = factor(rep(c("Healthy", "Disease"), each = 5)),
CD8_T_cells = runif(10, min = 0.1, max = 0.8),
B_cells = runif(10, min = 0.2, max = 0.6),
NK_cells = runif(10, min = 0.05, max = 0.5)
)
# Train a random forest model
rf_model <- randomForest(Condition ~ CD8_T_cells + B_cells + NK_cells,
data = enrichment_scores,
ntree = 500, importance = TRUE)
# Assess variable importance
var_importance <- data.frame(
Variable = rownames(importance(rf_model)),
Importance = importance(rf_model)[, 1]
)
# Plot variable importance
ggplot(var_importance, aes(x = reorder(Variable, Importance), y = Importance)) +
geom_bar(stat = "identity", fill = "steelblue") +
coord_flip() +
labs(
title = "Feature Importance from Random Forest Model",
x = "Features",
y = "Importance Score"
) +
theme_minimal()
xCell2 leverages the BiocParallel
package for parallel
processing, enabling computationally intensive tasks to run faster by
utilizing multiple processors.
By default, xCell2 performs computations sequentially using
BiocParallel::SerialParam
.
To enable parallel processing, users can specify a
BiocParallelParam
object via the BPPARAM argument in
functions like xCell2Train
and
xCell2Analysis
.
To use parallelization, simply create a
BiocParallelParam
instance:
library(BiocParallel)
# Example: Setting up parallel processing with 2 workers
BPPARAM <- MulticoreParam(workers = 2) # Adjust the number of workers as needed
If you are using an operating system where MulticoreParam is
unavailable (e.g., certain Windows configurations),
SnowParam
can be used as an alternative:
Example: Generating the xCell2 Reference Object with Parallelization
# Generate the xCell2 reference object with parallel processing
DICE_demo.xCell2Ref <- xCell2Train(
ref = dice_ref,
labels = dice_labels,
refType = "rnaseq",
BPPARAM = BPPARAM
)
Example: Generating the xCell2 Reference Object with Parallelization
# Perform cell type enrichment analysis with parallel processing
xcell2_results <- xCell2Analysis(
mix = mix_demo,
xcell2object = DICE_demo.xCell2Ref,
BPPARAM = BPPARAM
)
For more information on BiocParallel
and its options,
visit the BiocParallel
package documentation.
If you use xCell2
in your research, please cite: Angel
A, Naom L, Nabel-Levy S, Aran D. xCell 2.0: Robust Algorithm for Cell
Type Proportion Estimation Predicts Response to Immune Checkpoint
Blockade. bioRxiv 2024.
Aran, D., Hu, Z., & Butte, A. J. (2017). xCell: digitally portraying the tissue cellular heterogeneity landscape. Genome biology, 18, 1-14.
Aran, D. (2021). Extracting insights from heterogeneous tissues. Nature Computational Science, 1(4), 247-248.
Angel, A., Naom, L., Nabet-Levy, S., & Aran, D. (2024). xCell 2.0: Robust Algorithm for cell type Proportion Estimation Predicts Response to Immune Checkpoint Blockade. bioRxiv, 2024-09.
## R version 4.4.2 (2024-10-31)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.1 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: Etc/UTC
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] randomForest_4.7-1.2 dplyr_1.1.4 EnhancedVolcano_1.25.0
## [4] ggrepel_0.9.6 ggplot2_3.5.1 BiocParallel_1.41.0
## [7] xCell2_0.99.107
##
## loaded via a namespace (and not attached):
## [1] DBI_1.2.3 GSEABase_1.69.0
## [3] rlang_1.1.5 magrittr_2.0.3
## [5] matrixStats_1.5.0 compiler_4.4.2
## [7] RSQLite_2.3.9 mgcv_1.9-1
## [9] png_0.1-8 vctrs_0.6.5
## [11] reshape2_1.4.4 RcppZiggurat_0.1.6
## [13] quadprog_1.5-8 stringr_1.5.1
## [15] pkgconfig_2.0.3 crayon_1.5.3
## [17] fastmap_1.2.0 dbplyr_2.5.0
## [19] XVector_0.47.2 labeling_0.4.3
## [21] rmarkdown_2.29 tzdb_0.4.0
## [23] pracma_2.4.4 graph_1.85.1
## [25] UCSC.utils_1.3.1 purrr_1.0.2
## [27] bit_4.5.0.1 xfun_0.50
## [29] Rfast_2.1.4 cachem_1.1.0
## [31] GenomeInfoDb_1.43.2 jsonlite_1.8.9
## [33] progress_1.2.3 blob_1.2.4
## [35] DelayedArray_0.33.4 parallel_4.4.2
## [37] prettyunits_1.2.0 singscore_1.27.0
## [39] R6_2.5.1 bslib_0.8.0
## [41] stringi_1.8.4 limma_3.63.3
## [43] GenomicRanges_1.59.1 jquerylib_0.1.4
## [45] Rcpp_1.0.14 SummarizedExperiment_1.37.0
## [47] knitr_1.49 readr_2.1.5
## [49] IRanges_2.41.2 splines_4.4.2
## [51] Matrix_1.7-1 tidyselect_1.2.1
## [53] abind_1.4-8 yaml_2.3.10
## [55] codetools_0.2-20 minpack.lm_1.2-4
## [57] curl_6.1.0 lattice_0.22-6
## [59] tibble_3.2.1 plyr_1.8.9
## [61] withr_3.0.2 Biobase_2.67.0
## [63] KEGGREST_1.47.0 evaluate_1.0.3
## [65] ontologyIndex_2.12 RcppParallel_5.1.9
## [67] BiocFileCache_2.15.1 Biostrings_2.75.3
## [69] pillar_1.10.1 BiocManager_1.30.25
## [71] filelock_1.0.3 MatrixGenerics_1.19.1
## [73] stats4_4.4.2 generics_0.1.3
## [75] BiocVersion_3.21.1 S4Vectors_0.45.2
## [77] hms_1.1.3 munsell_0.5.1
## [79] scales_1.3.0 xtable_1.8-4
## [81] glue_1.8.0 maketools_1.3.1
## [83] tools_4.4.2 AnnotationHub_3.15.0
## [85] sys_3.4.3 locfit_1.5-9.10
## [87] annotate_1.85.0 buildtools_1.0.0
## [89] XML_3.99-0.18 grid_4.4.2
## [91] tidyr_1.3.1 AnnotationDbi_1.69.0
## [93] edgeR_4.5.1 colorspace_2.1-1
## [95] SingleCellExperiment_1.29.1 nlme_3.1-166
## [97] GenomeInfoDbData_1.2.13 cli_3.6.3
## [99] rappdirs_0.3.3 S4Arrays_1.7.1
## [101] gtable_0.3.6 sass_0.4.9
## [103] digest_0.6.37 BiocGenerics_0.53.3
## [105] SparseArray_1.7.4 farver_2.1.2
## [107] memoise_2.0.1 htmltools_0.5.8.1
## [109] lifecycle_1.0.4 httr_1.4.7
## [111] mime_0.12 statmod_1.5.0
## [113] bit64_4.6.0-1