| Title: | Calculates cell type specificity from single cell data |
|---|---|
| Description: | SPICEY (SPecificity Index for Coding and Epigenetic activitY) is an R package designed to quantify cell-type specificity in single-cell transcriptomic and epigenomic data, particularly scRNA-seq and scATAC-seq. It introduces two complementary indices: the Gene Expression Tissue Specificity Index (GETSI) and the Regulatory Element Tissue Specificity Index (RETSI), both based on entropy to provide continuous, interpretable measures of specificity. By integrating gene expression and chromatin accessibility, SPICEY enables standardized analysis of cell-type-specific regulatory programs across diverse tissues and conditions. |
| Authors: | Georgina Fuentes-Páez [aut, cre] (ORCID: <https://orcid.org/0009-0003-9417-7148>), Nacho Molina [aut], Mireia Ramos-Rodriguez [aut], Lorenzo Pasquali [aut], Ministerio de Ciencia e Innovación Spain [fnd] (program: FPI Fellowship) |
| Maintainer: | Georgina Fuentes-Páez <[email protected]> |
| License: | Artistic-2.0 |
| Version: | 1.3.0 |
| Built: | 2026-05-30 09:49:14 UTC |
| Source: | https://github.com/bioc/SPICEY |
Identifies transcription start sites (TSS) overlapping with input peaks and adds logical and gene ID columns to an annotation table.
.add_tss_annotation( annotation, peaks, txdb, annot_dbi, protein_coding_only, verbose ).add_tss_annotation( annotation, peaks, txdb, annot_dbi, protein_coding_only, verbose )
annotation |
A |
peaks |
A |
txdb |
|
annot_dbi |
|
protein_coding_only |
Logical; restrict to protein-coding genes (default TRUE). |
verbose |
Logical; print messages (default TRUE). |
A data.frame or GRanges (depending on input) with added columns:
Logical, TRUE if the peak overlaps a TSS.
Gene symbol of the overlapping TSS, if any.
GRanges
or data.frame, or a GRangesList) into a single tidy
data.frame, with a cell_type column.Parses input data of various types (e.g., named lists of GRanges
or data.frame, or a GRangesList) into a single tidy
data.frame, with a cell_type column.
.parse_input_diff(input).parse_input_diff(input)
input |
An object representing differential results, such as:
|
A data.frame combining all elements, with an added cell_type column indicating the source.
GRanges object,
removes alternate scaffolds (_alt, random, fix, Un),
and assigns region IDs as names.Standardizes peak input, ensuring that input peaks is a GRanges object,
removes alternate scaffolds (_alt, random, fix, Un),
and assigns region IDs as names.
.standardize_peaks(peaks).standardize_peaks(peaks)
peaks |
A |
A GRanges object with:
Coordinates of the input regions.
Unique identifier of the region (e.g., chr1-5000-5800).
Any additional columns present in the input.
Links peaks to genes based on Cicero co-accessibility with promoters or TSSs.
annotate_with_coaccessibility( peaks, txdb, links_df, annot_dbi, protein_coding_only = TRUE, verbose = TRUE, add_tss_annotation = FALSE, upstream, downstream )annotate_with_coaccessibility( peaks, txdb, links_df, annot_dbi, protein_coding_only = TRUE, verbose = TRUE, add_tss_annotation = FALSE, upstream, downstream )
peaks |
A
|
txdb |
|
links_df |
A |
annot_dbi |
|
protein_coding_only |
Logical; restrict to protein-coding genes (default TRUE). |
verbose |
Logical; print messages (default TRUE). |
add_tss_annotation |
Logical; annotate regulatory elements overlapping TSS (default FALSE). If TRUE, use +/- 1bp TSS. |
upstream |
Single integer value indicating the number of bases upstream
from the TSS (transcription start sites) (default |
downstream |
Single integer values indicating the number of bases downstream
from the TSS (transcription start sites) (default |
A data.frame with the original metadata columns from peaks,
along with an added gene_id column containing the symbol of the co-accessible gene.
Peaks with no gene annotation will have NA in the gene_id field.
library(TxDb.Hsapiens.UCSC.hg38.knownGene) library(org.Hs.eg.db) data(atac) data(cicero_links) peaks <- unique(unlist(atac)[, c("region_id")]) annotation_coacc <- annotate_with_coaccessibility( peaks = peaks, txdb = TxDb.Hsapiens.UCSC.hg38.knownGene, links_df = cicero_links, annot_dbi = org.Hs.eg.db, protein_coding_only = TRUE, verbose = TRUE, add_tss_annotation = FALSE, upstream = 2000, downstream = 2000 )library(TxDb.Hsapiens.UCSC.hg38.knownGene) library(org.Hs.eg.db) data(atac) data(cicero_links) peaks <- unique(unlist(atac)[, c("region_id")]) annotation_coacc <- annotate_with_coaccessibility( peaks = peaks, txdb = TxDb.Hsapiens.UCSC.hg38.knownGene, links_df = cicero_links, annot_dbi = org.Hs.eg.db, protein_coding_only = TRUE, verbose = TRUE, add_tss_annotation = FALSE, upstream = 2000, downstream = 2000 )
based on distance to the transcription start site (TSS), using a TxDb
reference and gene annotations from org.*.db packages.
annotate_with_nearest( peaks, txdb, annot_dbi, protein_coding_only = TRUE, verbose = TRUE, add_tss_annotation = FALSE, upstream, downstream )annotate_with_nearest( peaks, txdb, annot_dbi, protein_coding_only = TRUE, verbose = TRUE, add_tss_annotation = FALSE, upstream, downstream )
peaks |
A
|
txdb |
|
annot_dbi |
|
protein_coding_only |
Logical; restrict to protein-coding genes (default TRUE). |
verbose |
Logical; print messages (default TRUE). |
add_tss_annotation |
Logical; annotate regulatory elements overlapping TSS (default FALSE). If TRUE, use +/- 1bp TSS. |
upstream |
Single integer value indicating the number of bases upstream
from the TSS (transcription start sites) (default |
downstream |
Single integer values indicating the number of bases downstream
from the TSS (transcription start sites) (default |
A data.frame of peaks annotated to its nearest gene, with columns:
distanceToTSS |
Distance to the nearest TSS |
gene_id |
Identifier of the gene. Must be an official gene symbol (e.g., |
annotation |
|
library(dplyr) library(TxDb.Hsapiens.UCSC.hg38.knownGene) library(org.Hs.eg.db) data(atac) peaks <- unique(unlist(atac)[, c("region_id")]) annotation_near <- annotate_with_nearest( peaks = peaks, txdb = TxDb.Hsapiens.UCSC.hg38.knownGene, annot_dbi = org.Hs.eg.db, protein_coding_only = TRUE, verbose = TRUE, add_tss_annotation = FALSE, upstream = 2000, downstream = 2000 )library(dplyr) library(TxDb.Hsapiens.UCSC.hg38.knownGene) library(org.Hs.eg.db) data(atac) peaks <- unique(unlist(atac)[, c("region_id")]) annotation_near <- annotate_with_nearest( peaks = peaks, txdb = TxDb.Hsapiens.UCSC.hg38.knownGene, annot_dbi = org.Hs.eg.db, protein_coding_only = TRUE, verbose = TRUE, add_tss_annotation = FALSE, upstream = 2000, downstream = 2000 )
A toy example dataset representing single-cell ATAC-seq differential accessibility results.
Each row corresponds to a chromatin accessibility peak tested for
differential accessibility across one or more cell types.
The dataset is formatted as a list of GRanges objects or
data frames (convertible to GRanges), where each element represents
differential accessibility statistics for a specific cell type.
Multiple rows may exist per peak, each representing results in a different cell type.
data(atac)data(atac)
A data frame or GRanges-like object with the following required columns:
Unique identifier of the region (e.g., chr1-5000-5800).
Average log2 fold-change of accessibility for the peak in the specific cell type
Adjusted p-value (e.g., FDR-corrected)
Cell type or cluster label associated with each measurement (e.g., Acinar)
Precomputed using FindMarkers() (Wilcoxon test, via Presto if available) on control samples
from the Human Pancreas Analysis Program (HPAP), using paired snATAC-seq and snRNA-seq data
from three non-diabetic human donors.
A toy example dataset of co-accessibility links inferred from single-cell ATAC-seq data using tools such as Cicero or Signac's LinkPeaks(). These links support integrative analysis by associating regulatory elements with putative target genes. The peaks referenced here must exactly match those in the ATAC-seq differential accessibility dataset.
data(cicero_links)data(cicero_links)
A data frame with the following columns:
Genomic coordinate or peak identifier for the first peak in the pair (e.g., chr1-110209621-110211746)
Genomic coordinate or peak identifier for the second peak in the pair (e.g., chr1-110209621-110211746)
Co-accessibility score or correlation value quantifying the linkage
Cicero co-accessibility links were computed from UMAP-reduced snATAC-seq data
(HPAP, control donors) using run_cicero() with chromosome sizes from hg38.
Input data matched the peaks in the provided ATAC dataset.
Computes:
GETSI (Gene Expression Tissue Specificity Index) from single-cell RNA-seq differential expression data.
RETSI (Regulatory Element Tissue Specificity Index) from single-cell ATAC-seq differential accessibility data.
Either RNA or ATAC input must be provided.
compute_spicey_index(diff = NULL, id = NULL)compute_spicey_index(diff = NULL, id = NULL)
diff |
Single
|
id |
A character string specifying the name of the column in |
A data frame with specificity scores for each feature:
Specificity score (weighted log2FC).
Shannon entropy-based specificity measure.
Computes the normalized entropy of specificity scores (RETSI or GETSI) across cell types. Entropy quantifies how evenly a feature's activity is distributed among cell types, and if normalized, yields scores from 0 to 1, where values close to 1 indicate widespread distribution across cell types, and values near 0 denote dominating distribution towards one cell type.
entropy_index(spec_df, group_col)entropy_index(spec_df, group_col)
spec_df |
A data.frame containing the computed specificity scores containing at least the following columns:
|
group_col |
A string specifying the name of the column in |
A data.frame with one row per feature, containing:
Feature identifier.
Raw Shannon entropy computed from specificity scores.
Normalized Shannon entropy score (1 - exp(-entropy)) bounded between 0 and 1, where lower values indicate higher specificity.
Identifies overlaps between a set of peaks and promoter regions, optionally restricted to protein-coding genes.
extract_gene_peak_annotations( peaks, txdb, annot_dbi, protein_coding_only = TRUE, upstream, downstream, verbose = FALSE )extract_gene_peak_annotations( peaks, txdb, annot_dbi, protein_coding_only = TRUE, upstream, downstream, verbose = FALSE )
peaks |
A
|
txdb |
|
annot_dbi |
|
protein_coding_only |
Logical; restrict to protein-coding genes (default TRUE). |
upstream |
Single integer value indicating the number of bases upstream
from the TSS (transcription start sites) (default |
downstream |
Single integer values indicating the number of bases downstream
from the TSS (transcription start sites) (default |
verbose |
Logical; print messages (default TRUE). |
A GRanges with with:
Coordinates of the peak that overlaps with a gene promoter.
Unique identifier of the region (e.g., chr1-5000-5800).
Identifier of the gene. Must be an official gene symbol (e.g., GAPDH)
Extract promoter regions annotated gene symbols from a TxDb and AnnotationDbi object
get_promoters( txdb, annot_dbi, upstream, downstream, protein_coding_only = TRUE )get_promoters( txdb, annot_dbi, upstream, downstream, protein_coding_only = TRUE )
txdb |
|
annot_dbi |
|
upstream |
Single integer value indicating the number of bases upstream
from the TSS (transcription start sites) (default |
downstream |
Single integer values indicating the number of bases downstream
from the TSS (transcription start sites) (default |
protein_coding_only |
Logical; restrict to protein-coding genes (default TRUE). |
A GRanges object with the chromosomes, start and end positions
of defined specie promoter regions together with the official gene symbol
stored in the gene_id metadata column.
This function connects regulatory regions scored with RETSI
link_spicey(retsi = NULL, getsi = NULL, annotation = NULL)link_spicey(retsi = NULL, getsi = NULL, annotation = NULL)
retsi |
A data.frame containing RETSI scores for chromatin accessibility regions,
as returned by
|
getsi |
A data.frame containing GETSI scores for genes,
as returned by
.
|
annotation |
(Optional). A data.frame linking
|
A data.frame where each row represents a regulatory element–gene pair
linked within a given cell type. The output includes:
Unique identifier of the region (e.g., chr1-5000-5800)
Identifier of the gene. Must be an official gene symbol (e.g., GAPDH)
.
Cell type or cluster in which the association is observed (e.g., Acinar)
RETSI score: regulatory element specificity in this cell type.
Normalized shannon-entropy of RETSI (lower = more specific).
GETSI score: gene expression specificity in this cell type.
Normalized shannon-entropy of GETSI (lower = more specific).
Any additional columns from the original retsi and getsi inputs, suffixed
with _ATAC and _RNA respectively (e.g., avg_log2FC_ATAC, p_val_RNA).
Generates a heatmap using ggplot2 to visualize expression or accessibility SPICEY scores for genes across different cell types. Genes are ordered by their highest-scoring cell type, and then by maximum SPICEY score within that group.
plot_heatmap(df_z, title_text, fill_label)plot_heatmap(df_z, title_text, fill_label)
df_z |
A data frame with SPICEY scored values. Must contain:
.
|
title_text |
Character. Title of the heatmap. |
fill_label |
Character. Legend label for the color scale. |
A ggplot2 object representing the heatmap.
prepare_heatmap_data, spicey_heatmap
Filters and processes a gene–cell-type matrix for heatmap visualization.
Selects the top n genes per cell type,
and returns a summary matrix suitable for plotting.
prepare_heatmap_data(df, score_col, top_n)prepare_heatmap_data(df, score_col, top_n)
df |
A data frame with at least: |
score_col |
Character. Name of the score column to rank. |
top_n |
Integer. Number of top-ranked genes per cell type. |
A data frame with: gene_id, cell_type, and score.
A toy example dataset representing single-cell RNA-seq
differential expression results.
Each row corresponds to a gene tested across one or more cell types.
The dataset is formatted as a list of GRanges objects or
data frames (convertible to GRanges), where each element contains
differential expression statistics for a specific cell type.
Multiple rows may exist per gene, each representing results in a different cell type.
data(rna)data(rna)
A data frame or GRanges-like object with the following required columns:
Identifier of the gene. Must be an official gene symbol (e.g., GAPDH)
Average log2 fold-change of expression for the gene in the specific cell type
Adjusted p-value (e.g., FDR-corrected)
Cell type or cluster label associated with each measurement (e.g., Acinar)
Precomputed using FindMarkers() (Wilcoxon test, via Presto if available) on control samples
from the Human Pancreas Analysis Program (HPAP), using paired snATAC-seq and scRNA-seq data
from three non-diabetic human donors.
This function computes a specificity index for different features (e.g., genes or regions) based on differential expression/accessibility data. It rescales fold-change values and weights them by significance to quantify how specific a feature's activity is to a particular cell type.
specificity_index(da, group_col)specificity_index(da, group_col)
da |
A data.frame containing differential results with at least the following columns:
|
group_col |
A string specifying the name of the column in |
A data.frame identical to the input but with additional columns:
Fold-change converted from log2 scale.
Maximum fold-change observed within each feature group.
Normalized significance weight derived from adjusted p-values.
Fold-change normalized by maximum fold-change in the group.
Specificity score computed as the product of normalized fold-change significantly weighted.
Computes tissue-specificity scores from differential accessibility (RETSI) and/or gene expression (GETSI) data obtained from single cell experiments. Supports:
RETSI calculation from differential accessibility data in different cell types/clusters (scATAC-seq).
GETSI calculation from differential expression data in different cell types/clusters (scRNA-seq).
Optional integration of RETSI and GETSI scores by linking gene associations
(see annotate_with_nearest or annotate_with_coaccessibility)
SPICEY(atac = NULL, rna = NULL, annotation = NULL, verbose = TRUE)SPICEY(atac = NULL, rna = NULL, annotation = NULL, verbose = TRUE)
atac |
Either a single
|
rna |
Either a single
|
annotation |
(Optional). A data.frame linking
|
verbose |
Logical; print messages (default TRUE). |
Depending on inputs, returns RETSI and/or GETSI data frames, optionally linked and annotated.
data(rna) data(atac) # Calculate RETSI only retsi <- SPICEY(atac = atac) # Calculate GETSI only getsi <- SPICEY(rna = rna) # Calculate both both <- SPICEY( rna = rna, atac = atac ) # Integrate RETSI and GETSI with nearest gene library(TxDb.Hsapiens.UCSC.hg38.knownGene) library(org.Hs.eg.db) peaks <- unique(unlist(atac)[, c("region_id")]) annotation_near <- annotate_with_nearest( peaks = peaks, txdb = TxDb.Hsapiens.UCSC.hg38.knownGene, annot_dbi = org.Hs.eg.db, protein_coding_only = TRUE, verbose = TRUE, add_tss_annotation = FALSE, upstream = 2000, downstream = 2000 ) spicey_near <- SPICEY( rna = rna, atac = atac, annotation = annotation_near ) # Integrate RETSI and GETSI with coaccessibility data(cicero_links) annotation_coacc <- annotate_with_coaccessibility( peaks = peaks, txdb = TxDb.Hsapiens.UCSC.hg38.knownGene, links_df = cicero_links, annot_dbi = org.Hs.eg.db, protein_coding_only = TRUE, verbose = TRUE, add_tss_annotation = FALSE, upstream = 2000, downstream = 2000 ) spicey_coacc <- SPICEY( rna = rna, atac = atac, annotation = annotation_coacc )data(rna) data(atac) # Calculate RETSI only retsi <- SPICEY(atac = atac) # Calculate GETSI only getsi <- SPICEY(rna = rna) # Calculate both both <- SPICEY( rna = rna, atac = atac ) # Integrate RETSI and GETSI with nearest gene library(TxDb.Hsapiens.UCSC.hg38.knownGene) library(org.Hs.eg.db) peaks <- unique(unlist(atac)[, c("region_id")]) annotation_near <- annotate_with_nearest( peaks = peaks, txdb = TxDb.Hsapiens.UCSC.hg38.knownGene, annot_dbi = org.Hs.eg.db, protein_coding_only = TRUE, verbose = TRUE, add_tss_annotation = FALSE, upstream = 2000, downstream = 2000 ) spicey_near <- SPICEY( rna = rna, atac = atac, annotation = annotation_near ) # Integrate RETSI and GETSI with coaccessibility data(cicero_links) annotation_coacc <- annotate_with_coaccessibility( peaks = peaks, txdb = TxDb.Hsapiens.UCSC.hg38.knownGene, links_df = cicero_links, annot_dbi = org.Hs.eg.db, protein_coding_only = TRUE, verbose = TRUE, add_tss_annotation = FALSE, upstream = 2000, downstream = 2000 ) spicey_coacc <- SPICEY( rna = rna, atac = atac, annotation = annotation_coacc )
Visualizes gene-level specificity scores (RETSI and/or GETSI) across cell types
using a SPICEY scored heatmap representation. Depending on the chosen mode, the function
can display either RETSI or GETSI scores independently, or compute and visualize
a combined SPICEY score (mean score of RETSI and GETSI).
If spicey_measure = "SPICEY" and combined_score = TRUE, RETSI and GETSI
scores are scaled, averaged, and shown in a unified heatmap. Otherwise, separate
heatmaps are produced for RETSI and GETSI, respectively.
spicey_heatmap( df, top_n = 5, spicey_measure = c("RETSI", "GETSI", "SPICEY"), combined_score = FALSE )spicey_heatmap( df, top_n = 5, spicey_measure = c("RETSI", "GETSI", "SPICEY"), combined_score = FALSE )
df |
A data frame with at least the following columns:
|
top_n |
Integer. Number of top-ranked genes to include per cell type (default |
spicey_measure |
Character. Score type to visualize. Must be one of the following:
|
combined_score |
Logical. Only relevant if |
A ggplot2 object, or a patchwork layout if two heatmaps are returned.
SPICEY, prepare_heatmap_data, plot_heatmap
library(TxDb.Hsapiens.UCSC.hg38.knownGene) library(org.Hs.eg.db) data(rna) data(atac) data(cicero_links) # Obtain annotatin with coaccessibility peaks <- unique(unlist(atac)[, c("region_id")]) annotation_coacc <- annotate_with_coaccessibility( peaks = peaks, txdb = TxDb.Hsapiens.UCSC.hg38.knownGene, links_df = cicero_links, annot_dbi = org.Hs.eg.db, protein_coding_only = TRUE, verbose = TRUE, add_tss_annotation = FALSE, upstream = 2000, downstream = 2000 ) # Obtain linked SPICEY measures spicey_coacc <- SPICEY( rna = rna, atac = atac, annotation = annotation_coacc ) # Make plots retsi <- spicey_coacc$RETSI |> dplyr::left_join(annotation_coacc, by = c("region_id")) spicey_heatmap(retsi, spicey_measure = "RETSI") spicey_heatmap(spicey_coacc$GETSI, spicey_measure = "GETSI") spicey_heatmap(spicey_coacc$linked, spicey_measure = "SPICEY", combined_score = FALSE) spicey_heatmap(spicey_coacc$linked, spicey_measure = "SPICEY", combined_score = TRUE)library(TxDb.Hsapiens.UCSC.hg38.knownGene) library(org.Hs.eg.db) data(rna) data(atac) data(cicero_links) # Obtain annotatin with coaccessibility peaks <- unique(unlist(atac)[, c("region_id")]) annotation_coacc <- annotate_with_coaccessibility( peaks = peaks, txdb = TxDb.Hsapiens.UCSC.hg38.knownGene, links_df = cicero_links, annot_dbi = org.Hs.eg.db, protein_coding_only = TRUE, verbose = TRUE, add_tss_annotation = FALSE, upstream = 2000, downstream = 2000 ) # Obtain linked SPICEY measures spicey_coacc <- SPICEY( rna = rna, atac = atac, annotation = annotation_coacc ) # Make plots retsi <- spicey_coacc$RETSI |> dplyr::left_join(annotation_coacc, by = c("region_id")) spicey_heatmap(retsi, spicey_measure = "RETSI") spicey_heatmap(spicey_coacc$GETSI, spicey_measure = "GETSI") spicey_heatmap(spicey_coacc$linked, spicey_measure = "SPICEY", combined_score = FALSE) spicey_heatmap(spicey_coacc$linked, spicey_measure = "SPICEY", combined_score = TRUE)