xCell 2.0: Cell Type Enrichment Analysis

Introduction to xCell 2.0 and Key Features

Cell type enrichment analysis and cellular deconvolution are computational techniques aimed to decipher the cellular heterogeneity in bulk transcriptomics samples.

xCell2 is an R package for cell type enrichment analysis that builds upon the original xCell methodology (Aran et al., 2017), offering new features and improved performance.

This vignette introduces the key features of xCell2 and provides step-by-step guidance for utilizing its functions and performing cell type enrichment analysis. Whether you are new or an experienced bioinformatician, this vignette will help you effectively use xCell2 to advance your research.

For more details on xCell2 and one of its applications in predicting immune checkpoint blockade responses, see the xCell2 preprint (Angel et al., 2024).

Custom Reference Training

The most significant advancements in xCell2 is its flexibility, allowing users to train custom reference objects tailored to their specific datasets.

Unlike its predecessor, xCell2 supports references derived from various data types, including single-cell RNA-Seq (scRNA-Seq), bulk RNA-Seq, and microarrays. This adaptability makes it possible to generate reference objects that align closely with the biological context and experimental conditions of your study.

Advantages of Custom Reference Training

Training a custom reference offers several advantages:

  1. Compatibility with Diverse Data Types: Whether your data originates from scRNA-Seq, bulk RNA-Seq, or microarrays, xCell2 can adapt to the unique characteristics of each platform.

  2. Biological Context Relevance: Leveraging data tailored to your organism, tissue type, or experimental condition ensures that the resulting xCell2 reference object accurately reflects the relevant cellular diversity and characteristics.

  3. Improved Accuracy: Users can further improve the precision of enrichment analysis by training an xCell2 reference object that include only the genes shared with their bulk mixture. This ensures that the cell type signatures are exclusively constructed from genes present in the bulk mixture, addressing potential limitations of pre-trained references where some signature genes may be absent.

  4. Expanding Repository of References: Custom xCell2 reference objects contributed by the developers and the community will be curated and be accessible within the package. Those references will makes it convenient to conduct cross-analyses for the same bulk mixtures by simply loading references that are customized and shared by researchers in your field.

When to Use Pre-trained References

Deciding whether to use a pre-trained reference or to train your own custom reference depends on the specifics of your study and the data you are analyzing. Below, we outline key considerations where pre-trained references may be more suitable.

  • Broad Applicability: If your study involves commonly studied tissues or organisms, pre-trained references, such as those available in the xCell2 Reference Repository, may already meet your needs.

  • Time Efficiency: Using pre-trained references eliminates the need for additional computational time and expertise required to train a custom reference.

  • Reproducibility: Pre-trained references are standardized and well-documented, ensuring that results are comparable across studies.

  • Good Compatibility with Reference: When your bulk mixture shares a high percentage of genes with the pre-trained reference, and both datasets originate from the same platform (e.g., RNA-Seq or microarray), using a pre-trained reference can save time while still delivering accurate results.

Ontological Integration

Another of the key advancements in xCell2 is its ability to leverage cell ontology to account for hierarchical relationships between cell types.

What is Ontological Integration?

Ontological integration uses a structured vocabulary, such as the Cell Ontology (CL), to map relationships between cell types. These relationships may include:
- Ancestry: A broader cell type (e.g., “T cells”) encompassing a more specific cell type (e.g., “CD4+ T cells”).
- Functional Hierarchies: Specialized subtypes within a functional group (e.g., “Memory CD4+ T cells” as a subset of “CD4+ T cells”).

By incorporating these relationships, xCell2 considers biological lineage when generating cell type signatures and ensures that spillover correction respects the hierarchical dependencies of related cell types.

How Does it Work?

When useOntology is set to TRUE (default), xCell2Train:
1. Identifies cell type dependencies in the reference using lineage relationships.
2. Uses this information to adjust gene signatures generation.
3. Incorporates the dependencies during spillover correction to prevent correction between lineage-related cell types.

Generating Lineage Information

Users can manually inspect and refine lineage relationships using the xCell2GetLineage function. xCell2GetLineage generates a file with descendants and ancestors for each cell type, which can be reviewed and provided to the xCell2Train function via the lineageFile parameter.

Disabling Ontological Integration

While ontological integration is recommended, it can be disabled by setting useOntology to FALSE.

This may be suitable for:

  • Small references where all cell types are independent.
  • Situations where hierarchical relationships are irrelevant to the analysis.

For more information on the implementation and benefits of ontological integration, refer to the xCell2 paper.

Spillover Correction

Spillover correction is a useful feature of the xCell and xCell2 algorithms, addressing a challenge in cell type enrichment analysis.

What is Spillover?

Spillover occurs when the same signature genes are associated with multiple cell types, especially those that are closely related. For instance, some signature genes of CD4+ T cells may also be expressed in CD8+ T cells. This overlap can artificially inflate the enrichment scores of CD8+ T cells, reducing the specificity of the analysis.

How Does xCell2 Correct Spillover?

  1. Simulating Scores: Synthetic mixtures are generated to model the spillover of the enrichment scores between all cell types in the reference.

  2. Creating a Spillover Matrix: This matrix captures the overall degree to which the enrichment score of one cell type influenced by the signatures of the others.

  3. Incorporating Ontology: When ontological integration is enabled, xCell2 uses lineage relationships between cell types to avoid spillover correction. For example, no correction is preformed between parent and child cell types (e.g., T cells and CD4+ T cells).

  4. Applying Spillover Compensation: A linear compensation method, similar to approaches used in flow cytometry, adjusts the enrichment scores. This reduces the contribution of shared genes while preserving the unique expression signals of each cell type.

  5. Maintaining Balance: Overcorrection is avoided by controlling the correction strength with the spilloverAlpha parameter.

Controlling Spillover Correction Strength

The strength of spillover correction in xCell2 can be controlled using the spilloverAlpha parameter:

  • Lower Alpha Values: Apply weaker correction, suitable for datasets where cell types are distinct.
  • Higher Alpha Values: Apply stronger correction, ideal for datasets with closely related cell types or high spillover potential.
  • Default Alpha Value (0.5): Provides a balanced approach for most cases.

Practical Considerations

  • Spillover correction is especially important when analyzing closely related cell types, which often have overlapping gene expression patterns.
  • Adjust the spilloverAlpha parameter to suit your dataset’s characteristics.
  • Use ontological integration (useOntology) to prevent unnecessary spillover corrections by accounting for hierarchical cell type dependencies.

For further details, refer to the first xCell paper

Installation

From Bioconductor (Coming Soon)

To install xCell2 from Bioconductor, use:

if (!requireNamespace("BiocManager", quietly = TRUE)) {
 install.packages("BiocManager")
}
BiocManager::install("xCell2")

From GitHub (Development Version)

To install the development version from GitHub, use:

if (!requireNamespace("devtools", quietly = TRUE)) {
 install.packages("devtools")
}
devtools::install_github("AlmogAngel/xCell2")

After installation, load the package with:

library(xCell2)

Creating a Custom Reference with xCell2Train

The most significant advancement of xCell2 is the ability to create custom reference objects tailored to your specific research needs. This section will guide you through the process of generating a custom xCell2 reference object using the xCell2Train function.

Why Create a Custom Reference?

Training a custom reference object allows you to:

  • Include cell types unique to your research area.

  • Leverage scRNA-seq data as a reference.

  • Adapt the tool for non-standard organisms, tissues, or experimental conditions.

  • Improve the accuracy of enrichment analysis.

See also - Custom Reference Training

Preparing the Input Data

Before using xCell2Train, you need to prepare the input data:

1. Reference Gene Expression Matrix

  • Can be generated from various platforms: microarray, bulk RNA-Seq, or single-cell RNA-Seq.
  • Genes should be in rows, and samples or cells should be in columns.
  • You can use different gene nomenclatures (e.g., gene symbols, HUGO IDs, Ensembl IDs) as long as the genes in the reference and bulk mixture are of the same type.
  • For RNA-Seq data, normalization to gene length and library size (e.g., TPM, RPKM, FPKM) is recommended but not mandatory.
  • xCell2 uses ranks of gene expression values, so transformations such as log-normalization do not affect results.
  • Microarray data typically requires no additional normalization.

2. Labels Data Frame

This data frame contain information about each sample or cell in your reference.

It should include the following columns:

  • "ont": Cell type ontology (e.g., "CL:0000545"). Use NA if not applicable. Read about assigning ontologies.
  • "label": Cell type name (e.g., "T-helper 1 cell").
  • "sample": Sample or cell identifier matching the column names in the reference matrix.
  • "dataset": Source dataset or subject identifier (e.g., "DICE"). If the information is not available, you can assign a uniform name for all entries

Using SummarizedExperiment or SingleCellExperiment Objects

You can provide a SummarizedExperiment or SingleCellExperiment object as input.

Assays:
- Gene expression data should be stored in one of the following assay slots:
- "tpm" (Transcripts Per Million, recommended)
- "logcounts" (log-transformed counts)
- "normcounts" (normalized counts)
- "counts" (raw counts).

  • Use assayNames(object) to check available assay names.

  • If multiple assays exist, the function prioritizes "tpm".

Metadata:
- Sample metadata (labels) should be stored in colData(object) as a data frame with 4 required columns:
- "ont" - Ontology ID (e.g., "CL:0000236"). Use NA if unavailable.
- "label" - Cell type name.
- "sample" - Sample or cell identifier matching column names in the assay.
- "dataset" - Dataset source for each sample (use a constant value if not applicable).

Notes:
- Labels with underscores ("_") in cell type names are automatically replaced with dashes ("-").
- Microarray references must use "counts" as the assay slot.

Example: preparing the input data

Let’s walk through an example using a subset of the Database of Immune Cell Expression (DICE):

library(xCell2)

# Load the demo data
data(dice_demo_ref, package = "xCell2")

# Extract reference gene expression matrix
# (Note that values are in TPM but log transformed)
dice_ref <- SummarizedExperiment::assay(dice_demo_ref, "logcounts")
colnames(dice_ref) <- make.unique(colnames(dice_ref)) # Ensure unique sample names

# Extract reference metadata
dice_labels <- SummarizedExperiment::colData(dice_demo_ref)
dice_labels <- as.data.frame(dice_labels) # "label" column already exists

# Prepare labels data frame (the "label" column already exist)
dice_labels$ont <- NA # Here we assign the cell ontology (next section)
dice_labels$sample <- colnames(dice_ref)
dice_labels$dataset <- "DICE"

Assigning Cell Type Ontology (optional but recommended)

Assigning cell type ontologies improves the quality of signatures generated by xCell2Train. This step is essential for accurate identification of cell type spesific genes, as it helps account for lineage dependencies of related cell types (e.g., “T cells, CD4+” and “T cells, CD4+, memory”).

When to Skip Ontology Assignment

You may skip this step if:

  • You do not want to account for cell type dependencies (not recommended).
  • You are confident that no lineage relationships exist within your reference.

Assigning Ontologies

You can assign cell type ontologies using a controlled vocabulary like the Cell Ontology (CL):

  1. Visit the EMBL-EBI Ontology Lookup Service.
  2. Search for your cell type (e.g., "T CD4+ memory").
  3. Identify the best match (e.g., "CD4-positive, alpha-beta memory T cell" with ID "CL:0000897").
  4. Assign the appropriate ontology ID to each cell type in the "labels" data frame under the "ont" column.

Example: assigning cell type ontology

Run previous example first.

# Assign ontologies to cell types
dice_labels[dice_labels$label == "B cells", ]$ont <- "CL:0000236"
dice_labels[dice_labels$label == "Monocytes", ]$ont <- "CL:0000576"
dice_labels[dice_labels$label == "NK cells", ]$ont <- "CL:0000623"
dice_labels[dice_labels$label == "T cells, CD8+", ]$ont <- "CL:0000625"
dice_labels[dice_labels$label == "T cells, CD4+", ]$ont <- "CL:0000624"
dice_labels[dice_labels$label == "T cells, CD4+, memory", ]$ont <- "CL:0000897"

Example: checking lineage relationships {checking-lineage-relationships}

Use the xCell2GetLineage function to identify and inspect cell type dependencies based on the Cell Ontology tree. This function can either output a file in TSV format or return a list mapping cell types to their ancestors and descendants.

  • To generate a TSV file for manual inspection:

After running the command:

  1. Open the generated file (demo_dice_dep.tsv) in a text editor or spreadsheet tool.
  2. Verify the lineage assignments. For example:
    • "T cells, CD4+, memory" should be assigned as a descendant of "T cells, CD4+".
  3. Manually adjust the file if necessary, and save the changes.

The lineage file can then be used in the xCell2Train function via the lineageFile parameter to enhance the signatures generation process with user-curated dependencies.

  • To return a list directly in R:
lineage_list <- xCell2GetLineage(labels = dice_labels)
## downloading 1 resources
## retrieving 1 resource
## loading from cache

Interpreting the Output

If a TSV file is generated (outFile provided):

  • The file demo_dice_dep.tsv contains the following columns:
    • ont: Cell type ontology ID (e.g., CL:0000545).
    • label: Cell type name (e.g., “CD4+ T cells”).
    • descendants: Semicolon-separated list of direct descendants for each cell type.
    • ancestors: Semicolon-separated list of direct ancestors for each cell type.

If a list is returned (no outFile specified):

  • Each entry in the list corresponds to a cell type, with the following components:
    • descendants: A vector of direct descendant cell types.
    • ancestors: A vector of direct ancestor cell types.

Generating the xCell2 Reference Object

Once your inputs are prepared, you can create a custom xCell2 reference object using the xCell2Train function. This object will serve as the basis for cell type enrichment analysis in your downstream workflows and is a required input for the xCell2Analysis function.

Key Parameters of xCell2Train

  • ref: The prepared reference gene expression matrix or a SummarizedExperiment / SingleCellExperiment object. For more information, see Preparing the Input Data.

  • labels: The labels data frame you created. If you use a SummarizedExperiment or SingleCellExperiment object, this is not needed because the metadata should already be included in colData. See Preparing the Input Data and Assigning Ontologies for details.

  • refType: The type of reference data. Options are:

    • "rnaseq": Bulk RNA-Seq data.
    • "array": Microarray data.
    • "sc": Single-cell RNA-Seq data.
  • useOntology: Whether to use ontological integration to account for hierarchical dependencies between cell types (default: TRUE). For more information, refer to the Ontological Integration in xCell2 section.

  • lineageFile: The path to a manually curated lineage file generated with the xCell2GetLineage function, if available. For more information, refer to the Checking Lineage Relationships section.

  • BPPARAM: A BiocParallelParam instance for parallelization. See Parallelization in xCell2 for details.

  • useSpillover: A Boolean indicating whether to apply spillover correction during the training process (default: TRUE). For a detailed explanation of spillover correction, see the How Does xCell2 Correct Spillover? section.

  • spilloverAlpha: A numeric value (default: 0.5) that controls the strength of spillover correction. For guidance on spillover correction and choosing alpha, refer to the How Does xCell2 Correct Spillover? section.

  • returnSignatures: A Boolean indicating whether to return only the cell type-specific gene signatures from the training process (default: FALSE). Setting this parameter to TRUE provides the signatures alone, without generating the full reference object. This can be useful for advanced users who wish to analyze or modify the signatures directly or use them in custom workflows.

  • returnAnalysis: A Boolean indicating whether to directly return the results of xCell2Analysis using the trained reference object and the provided bulk mixture data (default: FALSE). When set to TRUE, the function do not return the reference object.

Example: generating xCell2 reference object

The following code demonstrates how to generate a reference object. Before running this example make sure you have dice_ref and dice_lables loaded in R by running previous examples of “Preparing the Input Data” and “Assigning Cell Type Ontology”.

library(BiocParallel)
parallel_param <- MulticoreParam(workers = 2) # Adjust workers as needed

set.seed(123) # (optional) For reproducibility

DICE_demo.xCell2Ref <- xCell2Train(
  ref = dice_ref,
  labels = dice_labels,
  refType = "rnaseq",
  BPPARAM = parallel_param
)
## Starting xCell2 Train...
## Finding dependencies using cell type ontology...
## loading from cache
## Generating signatures...
## Learning linear transformation and spillover parameters...
## Your custom xCell2 reference object is ready!
## > Please consider sharing with others here: https://dviraran.github.io/xCell2ref

Why Use set.seed()?

The set.seed() function is used to ensure reproducibility. This is only relevant when working with scRNA-Seq data, as pseudo-bulk samples are generated by randomly sampling cells. Setting a seed ensures that the same random cells are selected each time the code is run, allowing you to replicate results and maintain consistency across analyses. If you are not working with scRNA-Seq data or if exact reproducibility is not a concern, setting the seed is optional.

Sharing Your Custom xCell2 Reference Object

Sharing your custom xCell2 reference object with the scientific community can greatly benefit researchers working in similar fields. Here’s how you can contribute:

  1. Save Your Reference Object
    Save your newly generated using the recommended saveRDS() function:

    saveRDS(DICE_demo.xCell2Ref, file = "DICE_demo.xCell2Ref.rds")
  2. Upload to the xCell2 References Repository

    • Visit the xCell2 References Repository
    • Navigate to the “references” directory
    • Click “Add file” > “Upload files” and select your .rds xCell2 reference object file
  3. Update the README

    To help others understand and use your reference:

    • Return to the main page of the xCell2 References Repository
    • Open the README.md file
    • Click the pencil icon (Edit this file) in the top right corner
    • Add an entry to the references table
  4. Submit a Pull Request

    • Click on “Commit changes…”
    • Select “Create a new branch for this commit and start a pull request”
    • Click “Propose changes”
    • On the next page, click “Create pull request”

The repository maintainers will review your submission and merge it if everything is in order.

By sharing your custom reference, you contribute to the scientific community. Your reference object could help researchers in related fields achieve more accurate and relevant results in their studies.

Next Steps

After creating your custom reference, you can use it for cell type enrichment analysis with xCell2Analysis. We’ll cover this in the next section.

Using Pre-trained xCell2 References

xCell2 provides a selection of pre-trained reference objects that can be directly used in your analysis. These references cover diverse cell types and are based on well-curated datasets. Pre-trained references are ideal for researchers who prefer not to train custom references or need a starting point for their analysis.

Available Pre-trained References

The following table provides an overview of the available pre-trained references as to January 2025.

Dataset Study Species Normalization nSamples/Cells nCellTypes Platform Tissues
BlueprintEncode Martens JHA and Stunnenberg HG (2013), The ENCODE Project Consortium (2012), Aran D (2019) Homo Sapiens TPM 259 43 RNA-seq Mixed
ImmGenData The Immunological Genome Project Consortium (2008), Aran D (2019) Mus Musculus RMA 843 19 Microarray Immune/Blood
Immune Compendium Zaitsev A (2022) Homo Sapiens TPM 3626 40 RNA-seq Immune/Blood
LM22 Chen B (2019) Homo Sapiens RMA 113 22 Microarray Mixed
MouseRNAseqData Benayoun B (2019) Mus Musculus TPM 358 18 RNA-seq Mixed
Pan Cancer Nofech-Mozes I (2023) Homo Sapiens Counts 25084 29 scRNA-seq Tumor
Tabula Muris Blood The Tabula Muris Consortium (2018) Mus Musculus Counts 11145 6 scRNA-seq Bone Marrow, Spleen, Thymus
Tabula Sapiens Blood The Tabula Sapiens Consortium (2022) Homo Sapiens Counts 11921 18 scRNA-seq Blood, Lymph_Node, Spleen, Thymus, Bone Marrow
TME Compendium Zaitsev A (2022) Homo Sapiens TPM 8146 25 RNA-seq Tumor

Accessing Pre-trained References

From the Package Popular pre-trained references are included directly in the xCell2 package. You can load them using the data() function:

# Load the BlueprintEncode reference
data(BlueprintEncode.xCell2Ref, package = "xCell2")

From the xCell2 References Repository You can download additional pre-trained references from the xCell2 References Repository.

Use the following code to download a pre-trained reference within R:

# Set the URL of the pre-trained reference
ref_url <- "https://dviraran.github.io/xCell2refs/references/BlueprintEncode.xCell2Ref.rds"
# Set the local filename to save the reference
local_filename <- "BlueprintEncode.xCell2Ref.rds"
# Download the file
download.file(ref_url, local_filename, mode = "wb")
# Load the downloaded reference
BlueprintEncode.xCell2Ref <- readRDS(local_filename)

Choosing the Right Reference

Selecting an appropriate reference is crucial for accurate cell type enrichment analysis. Consider the following:

  • Organism and Tissue Type: Select a reference that corresponds to the organism and tissue type of your samples.

  • Platform: While not mandatory, it is recommended to start with a reference that aligns with your dataset type and normalization.

  • Condition-Specific: If available, select a reference that corresponds to the condition you are investigating. For example, a scRNA-Seq reference created from datasets of blood immune cell types in autoimmune diseases, would be more suitable than a dataset of blood immune cell types from healthy controls.

Next Steps

After loading a pre-trained reference, you can proceed to perform cell type enrichment analysis with the xCell2Analysis function. For details, see the Performing Cell Type Enrichment Analysis section.

Calculating Cell Type Enrichment with xCell2Analysis

After training or obtaining an xCell2 reference object, the next step is to use it for cell type enrichment analysis on your bulk transcriptomics data. This section will guide you through using the xCell2Analysis function and interpreting its results.

Preparing the Input Data

Before running the analysis, ensure you have:

  1. An xCell2 reference object (see Creating a Custom Reference or Using Pre-trained References).
  2. A bulk gene expression matrix to analyze.

Ensuring Gene Compatibility Between Bulk Mixture and Reference

For accurate cell type enrichment analysis, it is essential to verify compatibility between the genes in your bulk mixture (mix) and the reference object (xcell2object). Compatibility means that the gene nomenclature (e.g., gene symbols, HUGO IDs, Ensembl IDs) match and there is sufficient overlap between the genes in the bulk mixture and the reference.

The getGenesUsed function allows you to access the genes used for the reference training:

library(xCell2)
data("DICE_demo.xCell2Ref", package = "xCell2")
genes_ref <- getGenesUsed(DICE_demo.xCell2Ref)
genes_ref[1:10]
##  [1] "CD28"  "S100B" "IL2RB" "CD3E"  "KRT72" "CTLA4" "CD3D"  "GNLY"  "IL7R" 
## [10] "MAL"

You can calculate the percentage of overlapping genes between your bulk expression matrix and the reference. For example:

# Load demo bulk gene expression mixture
data("mix_demo", package = "xCell2")
genes_mix <- rownames(mix_demo)

# Calculate the percentage of overlapping genes
overlap_percentage <- round(length(intersect(genes_mix, genes_ref)) / length(genes_ref) * 100, 2)
print(paste0("Overlap between mixture and reference genes is: ", overlap_percentage, "%"))
## [1] "Overlap between mixture and reference genes is: 100%"

Ensure the overlap is sufficient (default minimum: 90%). You can adjust this threshold in xCell2Analysis using the minSharedGenes parameter:

xcell2_results <- xCell2Analysis(
  mix = mix_demo,
  xcell2object = DICE_demo.xCell2Ref,
  minSharedGenes = 0.8 # Allow for a lower overlap threshold
)
## Starting xCell2 Analysis...
## Calculating enrichment scores for all cell types...
## Performing spillover correction...
## xCell2 Analysis completed successfully.

If your expression matrix and the reference share a low percentage of common genes, consider creating a custom reference using the xCell2Train function. Custom references ensure better alignment between your bulk mixture and the reference, leading to more accurate enrichment results.

Key Parameters of xCell2Analysis

  • mix: The bulk mixture of gene expression matrix to analyze, with genes in rows and samples in columns. The gene nomenclature must match the reference, see Ensuring Gene Compatibility.

  • xcell2object: A pre-trained reference object of class xCell2Object, created using the xCell2Train function. Pre-trained references are also available for common use cases. See Using Pre-trained xCell2 References.

  • minSharedGenes: The minimum fraction of shared genes required between the bulk mixture (mix) and the reference (xcell2object). The default value is 0.9. If the overlap is insufficient, the function will terminate with an error. Adjust this threshold if necessary, but note that higher overlap ensures more reliable results. To check gene overlap, see Ensuring Gene Compatibility.

  • rawScores: A Boolean indicating whether to return raw enrichment scores (default: FALSE). Raw enrichment scores are computed directly from gene signatures without applying linear transformation or spillover correction. This option can be useful for debugging or advanced workflows.

  • spillover: A Boolean indicating whether to apply spillover correction on the enrichment scores (default: TRUE). Spillover correction addresses signature genes overlaps between closely related cell types, improving specificity. For details, see How Does xCell2 Correct Spillover?.

  • spilloverAlpha: A numeric value (default: 0.5) controlling the strength of spillover correction. Lower values apply weaker correction, while higher values apply stronger correction. Adjust this parameter based on the similarity of cell types in your reference. See Spillover Correction in xCell2.

  • BPPARAM: A BiocParallelParam instance that determines the parallelization strategy. This parameter allows you to leverage multi-core processing for faster computation. For more details, see Parallelization in xCell2.

Calculating Cell Type Enrichment

To calculate the cell type enrichment, use the following command*:

xcell2_results <- xCell2Analysis(
 mix = mix_demo,
 xcell2object = DICE_demo.xCell2Ref
)
## Starting xCell2 Analysis...
## Calculating enrichment scores for all cell types...
## Performing spillover correction...
## xCell2 Analysis completed successfully.
xcell2_results
##                             F0303       F0304   F0305
## B cells               0.006400000 0.000000000 0.00647
## Monocytes             0.097510000 0.000000000 0.11779
## NK cells              0.000000000 0.018030000 0.00200
## T cells, CD8+         0.006078756 0.001443457 0.00000
## T cells, CD4+         0.009140549 0.020706785 0.00000
## T cells, CD4+, memory 0.001950000 0.009530000 0.00000

The xCell2Analysis function return a matrix of cell type enrichment scores with the following structure:

  • Rows: Represent individual cell types found in the reference.
  • Columns: Correspond to the samples in your input mixture.

Understanding Enrichment Scores

  1. Enrichment Scores Are Not Proportions:
    • While the raw enrichment scores in xCell2 undergo a linear transformation that makes them resemble proportions, they are not actual proportions.
    • The transformation ensures that scores are on a linear scale within the range of 0 to 1, but they might not sum to 1 across all cell types in a sample.
  2. Differences from Cellular Deconvolution Methods:
    • Cellular deconvolution methods typically model bulk expression data as a linear combination of all reference cell type profiles, which forces the estimated proportions to sum to 1.
    • However, this assumption is often unrealistic, as reference datasets rarely capture all cell type in a complex tissue sample. Missing cell types in the reference can lead to skewed or inaccurate proportions.
  3. Interpreting the Scale of Scores:
    • Enrichment scores fall between 0 and 1, where higher scores indicate a stronger enrichment of a cell type in the corresponding sample relative to the others.
    • Use the scores to compare cell type compositions across samples and identify differences rather than report their absolute value.

Advanced Analysis with xCell2 Results

The output of xCell2Analysis, a matrix of cell type enrichment scores, can be utilized for a variety of advanced analyses to uncover valuable biological insights. Below are examples for some common applications.

Normalizing Enrichment Scores by Tumor Purity

What is Tumor Purity? Tumor purity refers to the proportion of tumor cells within a sample, compared to the surrounding non-tumor cells (the tumor microenvironment), such as immune and stromal cells. Tumor purity typically varies across samples and can influence the interpretation of enrichment scores.

Why Normalize by Tumor Purity? In tumor samples, differences in tumor purity can skew the enrichment scores of non-tumor components like immune or stromal cells. For example, a sample with high tumor purity (e.g., 80%) contains a smaller proportion of tumor microenvironment (TME) cells compared to a sample with lower tumor purity (e.g., 40%). This makes direct comparisons of immune cell enrichment scores between these samples challenging. By normalizing the enrichment scores by tumor purity, we can isolate and compare the cellular composition of the TME alone, providing a clearer picture of immune or stromal cell dynamics.

How to Obtain Tumor Purity Values Tumor purity values can be estimated using external tools or methods. Some common approaches include: - Genomic Analysis Tools: Tools like ABSOLUTE or ESTIMATE can compute tumor purity values based on DNA sequencing or gene expression data. - Published Datasets: Tumor purity estimates are available for many publicly available datasets, such as The Cancer Genome Atlas (TCGA). - Future xCell2 Feature: In a future update, xCell2 will introduce a feature to compute the tumor microenvironment (TME) score. This score has been shown to have a negative correlation with tumor purity, as demonstrated in the study “Systematic pan-cancer analysis of tumour purity” (DOI: 10.1038/ncomms9971).

Normalization Formula The normalized enrichment score for a cell type is calculated as:

$$ \text{Normalized Score} = \frac{\text{Enrichment Score}}{1 - \text{Tumor Purity}} $$ where 1 − Tumor Purity represents the fraction of non-tumor cells in the sample.

Interpreting Normalized Scores After normalization, the enrichment scores reflect the relative abundance of the cell type within the TME, accounting for the sample’s tumor purity. This adjustment allows for more meaningful comparisons of immune or stromal cell enrichment between samples with varying tumor purity levels. For example, in the plot above, normalized scores for CD8+ T cells highlight differences that might otherwise be obscured in raw scores due to variations in tumor purity.

Comparing Cellular Enrichment Across Conditions

For one cell type

Compare enrichment scores between groups (e.g., healthy vs. diseased) using statistical tests like the Wilcoxon rank-sum test.

library(ggplot2, quietly = TRUE)

# Generate pseudo enrichment scores for demonstration
set.seed(123)
conditions <- c(rep("Healthy", 5), rep("Disease", 5))
enrichment_scores <- data.frame(
  Condition = conditions,
  T_cells_CD8 = c(runif(5, 0.4, 0.6), runif(5, 0.6, 0.8)),
  T_cells_CD4 = c(runif(5, 0.3, 0.5), runif(5, 0.5, 0.7))
)

# Wilcoxon rank-sum test for CD8+ T cells
wilcox_test <- wilcox.test(T_cells_CD8 ~ Condition, data = enrichment_scores)
print(wilcox_test)
## 
##  Wilcoxon rank sum exact test
## 
## data:  T_cells_CD8 by Condition
## W = 25, p-value = 0.007937
## alternative hypothesis: true location shift is not equal to 0
# Boxplot with p-value
ggplot(enrichment_scores, aes(x = Condition, y = T_cells_CD8)) +
  geom_boxplot() +
  geom_jitter(width = 0.2) +
  labs(
    title = "CD8+ T Cells Enrichment Across Conditions",
    x = "Condition", y = "Enrichment Score"
  ) +
  annotate("text", x = 1.5, y = max(enrichment_scores$T_cells_CD8), 
           label = paste("p =", signif(wilcox_test$p.value, 3)))

For many cell types

Visualize differential cell type enrichment between conditions using a volcano plot with Wilcoxon rank-sum test.

library(EnhancedVolcano, quietly = TRUE)
library(dplyr, quietly = TRUE, verbose = FALSE)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
# Simulate enrichment scores for 50 cell types
set.seed(123) 
cell_types <- paste0("CellType_", 1:50)

enrichment_scores_many <- data.frame(
  Sample = paste0("Sample", 1:10),
  Condition = factor(rep(c("Healthy", "Disease"), each = 5))
)

for (cell_type in cell_types) {
  # Healthy group: random around baseline
  healthy <- rnorm(5, mean = 0.5, sd = 0.1)
  # Disease group: add variability and directional shifts
  disease <- rnorm(5, mean = sample(c(0.4, 0.5, 0.6), 1), sd = 0.1)
  enrichment_scores_many[[cell_type]] <- c(healthy, disease)
}

special_cells <- sample(cell_types, 2)
for (cell_type in special_cells) {
  enrichment_scores_many[1:5, cell_type] <- rnorm(5, mean = 0.4, sd = 0.05) # Lower in healthy
}

# Calculate log fold change (LFC) and p-values for each cell type using Wilcoxon test
differential_results <- enrichment_scores_many %>%
  tidyr::pivot_longer(cols = starts_with("CellType"), names_to = "CellType", values_to = "Enrichment") %>%
  group_by(CellType) %>%
  summarise(
    LFC = log2(median(Enrichment[Condition == "Disease"]) / median(Enrichment[Condition == "Healthy"])),
    p_value = wilcox.test(Enrichment ~ Condition)$p.value
  ) %>%
  ungroup()

# Prepare data for EnhancedVolcano
volcano_data <- differential_results %>%
  mutate(
    neg_log10_p = -log10(p_value),
    Significance = ifelse(p_value < 0.05, "Significant", "Not Significant")
  ) %>%
  rename(
    log2FC = LFC,
    pval = p_value
  )

# Plot using EnhancedVolcano
EnhancedVolcano(
  volcano_data,
  lab = volcano_data$CellType,
  x = "log2FC",
  y = "pval",
  pCutoff = 0.05,
  FCcutoff = 0.5,
  title = "Volcano Plot of Cell Type Enrichment",
  subtitle = "Simulated Data for 50 Cell Types",
  xlab = "Log2 Fold Change (LFC)",
  ylab = "-Log10(p-value)",
  pointSize = 2.0,
  labSize = 3.0,
  xlim = c(-1, 1),
  ylim = c(0, 3)
)

Correlating Enrichment Scores with Clinical or Molecular Features

Correlate enrichment scores with clinical outcomes or molecular features, such as mutation burden.

# Simulated clinical feature (e.g., mutation burden)
set.seed(123)

mutation_burden <- runif(10, min = 0, max = 100)

# Calculate Pearson and Spearman correlation
pearson_corr <- cor.test(enrichment_scores$T_cells_CD8, mutation_burden, method = "pearson")
spearman_corr <- cor.test(enrichment_scores$T_cells_CD8, mutation_burden, method = "spearman")

# Scatterplot of the correlation with annotations for Pearson and Spearman coefficients
ggplot(enrichment_scores, aes(x = mutation_burden, y = T_cells_CD8)) +
  geom_point(size = 3) +
  geom_smooth(method = "lm", se = FALSE, color = "blue") +
  labs(
    title = "Correlation Between Mutation Burden and CD8+ T Cells Enrichment",
    x = "Mutation Burden",
    y = "CD8+ T Cells Enrichment Score"
  ) +
  annotate("text", x = 15, y = 0.7, 
           label = paste0("Pearson: ", signif(pearson_corr$estimate, 3), 
                          "\nP-value: ", signif(pearson_corr$p.value, 3)),
           hjust = 0, color = "darkred") +
  annotate("text", x = 15, y = 0.65, 
           label = paste0("Spearman: ", signif(spearman_corr$estimate, 3), 
                          "\nP-value: ", signif(spearman_corr$p.value, 3)),
           hjust = 0, color = "darkgreen")
## `geom_smooth()` using formula = 'y ~ x'

Clustering and Dimension Reduction

Use enrichment scores to group samples and visualize their cellular composition landscape.

# Simulated enrichment scores for demonstration
set.seed(123) 
enrichment_scores <- data.frame(
  Sample = paste0("Sample", 1:10),
  Condition = rep(c("Healthy", "Disease"), each = 5),
  CD8_T_cells = runif(10, min = 0.1, max = 0.8),
  B_cells = runif(10, min = 0.2, max = 0.6),
  NK_cells = runif(10, min = 0.05, max = 0.5)
)

# Perform PCA on the simulated enrichment scores
pca <- prcomp(enrichment_scores[, 3:5], scale. = TRUE)

# Create a data frame for PCA results
pca_df <- data.frame(
  PC1 = pca$x[, 1],
  PC2 = pca$x[, 2],
  Condition = enrichment_scores$Condition
)

# Visualize the first two principal components
ggplot(pca_df, aes(x = PC1, y = PC2, color = Condition)) +
  geom_point(size = 3) +
  labs(
    title = "PCA of Enrichment Scores",
    x = "PC1 (Principal Component 1)",
    y = "PC2 (Principal Component 2)"
  ) +
  theme_minimal()

Perform k-means clustering on the original enrichment scores

kmeans_result <- kmeans(enrichment_scores[, 3:5], centers = 2, nstart = 25)

# Add cluster assignment to the PCA results
pca_df <- data.frame(
  PC1 = pca$x[, 1],
  PC2 = pca$x[, 2],
  Condition = enrichment_scores$Condition,
  Cluster = as.factor(kmeans_result$cluster)
)

# Visualize the PCA results with k-means clusters
ggplot(pca_df, aes(x = PC1, y = PC2, color = Cluster, shape = Condition)) +
  geom_point(size = 3) +
  labs(
    title = "PCA of Enrichment Scores with K-Means Clusters",
    x = "PC1 (Principal Component 1)",
    y = "PC2 (Principal Component 2)"
  ) +
  scale_color_manual(values = c("1" = "blue", "2" = "red")) +
  theme_minimal()

Using Enrichment Scores as Features for Predictive Modeling

Use enrichment scores as predictive features in machine learning models to classify conditions.

library(randomForest, quietly = TRUE)
## randomForest 4.7-1.2
## Type rfNews() to see new features/changes/bug fixes.
## 
## Attaching package: 'randomForest'
## The following object is masked from 'package:dplyr':
## 
##     combine
## The following object is masked from 'package:ggplot2':
## 
##     margin
# Simulated enrichment scores for demonstration
set.seed(123)

enrichment_scores <- data.frame(
  Sample = paste0("Sample", 1:10),
  Condition = factor(rep(c("Healthy", "Disease"), each = 5)),
  CD8_T_cells = runif(10, min = 0.1, max = 0.8),
  B_cells = runif(10, min = 0.2, max = 0.6),
  NK_cells = runif(10, min = 0.05, max = 0.5)
)

# Train a random forest model
rf_model <- randomForest(Condition ~ CD8_T_cells + B_cells + NK_cells, 
                         data = enrichment_scores, 
                         ntree = 500, importance = TRUE)


# Assess variable importance
var_importance <- data.frame(
  Variable = rownames(importance(rf_model)),
  Importance = importance(rf_model)[, 1]
)

# Plot variable importance
ggplot(var_importance, aes(x = reorder(Variable, Importance), y = Importance)) +
  geom_bar(stat = "identity", fill = "steelblue") +
  coord_flip() +
  labs(
    title = "Feature Importance from Random Forest Model",
    x = "Features",
    y = "Importance Score"
  ) +
  theme_minimal()

Parallelization in xCell2

xCell2 leverages the BiocParallel package for parallel processing, enabling computationally intensive tasks to run faster by utilizing multiple processors.

By default, xCell2 performs computations sequentially using BiocParallel::SerialParam.

To enable parallel processing, users can specify a BiocParallelParam object via the BPPARAM argument in functions like xCell2Train and xCell2Analysis.

To use parallelization, simply create a BiocParallelParam instance:

library(BiocParallel)

# Example: Setting up parallel processing with 2 workers
BPPARAM <- MulticoreParam(workers = 2) # Adjust the number of workers as needed

If you are using an operating system where MulticoreParam is unavailable (e.g., certain Windows configurations), SnowParam can be used as an alternative:

BPPARAM <- SnowParam(workers = 2)

Example: Generating the xCell2 Reference Object with Parallelization

# Generate the xCell2 reference object with parallel processing
DICE_demo.xCell2Ref <- xCell2Train(
  ref = dice_ref,
  labels = dice_labels,
  refType = "rnaseq",
  BPPARAM = BPPARAM
)

Example: Generating the xCell2 Reference Object with Parallelization

# Perform cell type enrichment analysis with parallel processing
xcell2_results <- xCell2Analysis(
  mix = mix_demo,
  xcell2object = DICE_demo.xCell2Ref,
  BPPARAM = BPPARAM
)

For more information on BiocParallel and its options, visit the BiocParallel package documentation.

Citing xCell2

If you use xCell2 in your research, please cite: Angel A, Naom L, Nabel-Levy S, Aran D. xCell 2.0: Robust Algorithm for Cell Type Proportion Estimation Predicts Response to Immune Checkpoint Blockade. bioRxiv 2024.

Referece

  • Aran, D., Hu, Z., & Butte, A. J. (2017). xCell: digitally portraying the tissue cellular heterogeneity landscape. Genome biology, 18, 1-14.

  • Aran, D. (2021). Extracting insights from heterogeneous tissues. Nature Computational Science, 1(4), 247-248.

  • Angel, A., Naom, L., Nabet-Levy, S., & Aran, D. (2024). xCell 2.0: Robust Algorithm for cell type Proportion Estimation Predicts Response to Immune Checkpoint Blockade. bioRxiv, 2024-09.

R Session Info

sessionInfo()
## R version 4.4.2 (2024-10-31)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.1 LTS
## 
## Matrix products: default
## BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so;  LAPACK version 3.12.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## time zone: Etc/UTC
## tzcode source: system (glibc)
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] randomForest_4.7-1.2   dplyr_1.1.4            EnhancedVolcano_1.25.0
## [4] ggrepel_0.9.6          ggplot2_3.5.1          BiocParallel_1.41.0   
## [7] xCell2_0.99.107       
## 
## loaded via a namespace (and not attached):
##   [1] DBI_1.2.3                   GSEABase_1.69.0            
##   [3] rlang_1.1.5                 magrittr_2.0.3             
##   [5] matrixStats_1.5.0           compiler_4.4.2             
##   [7] RSQLite_2.3.9               mgcv_1.9-1                 
##   [9] png_0.1-8                   vctrs_0.6.5                
##  [11] reshape2_1.4.4              RcppZiggurat_0.1.6         
##  [13] quadprog_1.5-8              stringr_1.5.1              
##  [15] pkgconfig_2.0.3             crayon_1.5.3               
##  [17] fastmap_1.2.0               dbplyr_2.5.0               
##  [19] XVector_0.47.2              labeling_0.4.3             
##  [21] rmarkdown_2.29              tzdb_0.4.0                 
##  [23] pracma_2.4.4                graph_1.85.1               
##  [25] UCSC.utils_1.3.1            purrr_1.0.2                
##  [27] bit_4.5.0.1                 xfun_0.50                  
##  [29] Rfast_2.1.4                 cachem_1.1.0               
##  [31] GenomeInfoDb_1.43.2         jsonlite_1.8.9             
##  [33] progress_1.2.3              blob_1.2.4                 
##  [35] DelayedArray_0.33.4         parallel_4.4.2             
##  [37] prettyunits_1.2.0           singscore_1.27.0           
##  [39] R6_2.5.1                    bslib_0.8.0                
##  [41] stringi_1.8.4               limma_3.63.3               
##  [43] GenomicRanges_1.59.1        jquerylib_0.1.4            
##  [45] Rcpp_1.0.14                 SummarizedExperiment_1.37.0
##  [47] knitr_1.49                  readr_2.1.5                
##  [49] IRanges_2.41.2              splines_4.4.2              
##  [51] Matrix_1.7-1                tidyselect_1.2.1           
##  [53] abind_1.4-8                 yaml_2.3.10                
##  [55] codetools_0.2-20            minpack.lm_1.2-4           
##  [57] curl_6.1.0                  lattice_0.22-6             
##  [59] tibble_3.2.1                plyr_1.8.9                 
##  [61] withr_3.0.2                 Biobase_2.67.0             
##  [63] KEGGREST_1.47.0             evaluate_1.0.3             
##  [65] ontologyIndex_2.12          RcppParallel_5.1.9         
##  [67] BiocFileCache_2.15.1        Biostrings_2.75.3          
##  [69] pillar_1.10.1               BiocManager_1.30.25        
##  [71] filelock_1.0.3              MatrixGenerics_1.19.1      
##  [73] stats4_4.4.2                generics_0.1.3             
##  [75] BiocVersion_3.21.1          S4Vectors_0.45.2           
##  [77] hms_1.1.3                   munsell_0.5.1              
##  [79] scales_1.3.0                xtable_1.8-4               
##  [81] glue_1.8.0                  maketools_1.3.1            
##  [83] tools_4.4.2                 AnnotationHub_3.15.0       
##  [85] sys_3.4.3                   locfit_1.5-9.10            
##  [87] annotate_1.85.0             buildtools_1.0.0           
##  [89] XML_3.99-0.18               grid_4.4.2                 
##  [91] tidyr_1.3.1                 AnnotationDbi_1.69.0       
##  [93] edgeR_4.5.1                 colorspace_2.1-1           
##  [95] SingleCellExperiment_1.29.1 nlme_3.1-166               
##  [97] GenomeInfoDbData_1.2.13     cli_3.6.3                  
##  [99] rappdirs_0.3.3              S4Arrays_1.7.1             
## [101] gtable_0.3.6                sass_0.4.9                 
## [103] digest_0.6.37               BiocGenerics_0.53.3        
## [105] SparseArray_1.7.4           farver_2.1.2               
## [107] memoise_2.0.1               htmltools_0.5.8.1          
## [109] lifecycle_1.0.4             httr_1.4.7                 
## [111] mime_0.12                   statmod_1.5.0              
## [113] bit64_4.6.0-1