Proteins at the cell surface connect intracellular and extracellular signaling networks and largely determine a cell’s capacity to communicate and interact with its environment.
Importantly, variations in transcriptomic profiles are often observed between healthy and diseased cells, presenting distinct sets of cell-surface proteins. Indeed, cell surface proteins i) may act as biomarkers for the detection of diseased cells in tissues or body fluids and ii) are the most prevalent target of human drugs: 66% of approved human drugs listed in the DrugBank database target a cell-surface protein. The investigation of the cell surfaceome therefore could provide new possibilities for diagnosis, prognosis, treatment development, and therapy response evaluation.
When a study aims to find new biomarkers, the small number of samples often limits the ability to obtain reliable results. However, as sequencing costs continue to decrease, several follow-up experiments will likely be conducted to re-address the same biological question, suggesting a need for methods able to jointly analyze data from multiple studies.
SurfR provides a solution to these issues, generating a list of ranked surface protein-coding differentially-expressed genes starting from the raw count matrix of your own RNA-seq experiment or from bulk transcriptomic data automatically retrieved from public databases such as GEO and TCGA. GEO queries are based on the ArchS4 pipeline. TCGA repository is interrogated through TCGAbiolinks.
SurfR also offers the opportunity to increase the sample size of a cohort by integrating related datasets, therefore enhancing the power to detect differentially expressed genes of interest. Meta-analysis can be performed through metaRNASeq, taking into account inter-study variability that may arise from technical differences among studies (e.g., sample preparation, library protocols, batch effects) as well as additional biological variability.
Gene ontology (GO) and pathway annotation can also be performed within SurfR to gain further insights about surface protein candidates.
Finally, SurfR includes functions to visualize DEG and enrichment results, aiding in the interpretation of findings.
Maurizio, A., Tascini, A.S., Morelli, M. SurfR: Riding the wave of RNA-Seq data with a comprehensive Bioconductor package to identify Surface Protein Coding Genes. Bioinformatics Advances, 2024 (DOI: 10.1093/bioadv/vbae201)
Install the package from Bioconductor or GitHub, ensuring correct Bioconductor dependencies.
if (!"BiocManager" %in% rownames(installed.packages()))
install.packages("BiocManager", repos = "https://cran.r-project.org")
When the package is available on Bioconductor, use
Use the pre-release or devel version from GitHub using devtools with
# install.packages("devtools")
devtools::install_github("auroramaurizio/SurfR", build_vignettes = TRUE)
If you encounter issues during installation or execution, try downgrading dplyr to version 2.3.4.
The basic idea behind SurfR has been to create a
complete framework to detect surface protein coding genes within your
data, or within public datasets. This framework facilitates the
simultaneous analysis and comparison of multiple studies, easily
revealing functional consensus and differences among distinct
conditions. To begin, let’s ask SurfR to detect surface
protein coding genes among a vector of input genes. Gene ID can be
provided as gene_name
, ensembl
,
entrez
or uniProt_name
.
The protein classification is based on a recently developed surfaceome predictor, called SURFY (Bausch-Fluck et al. 2018), based on machine learning.
library(SurfR)
GeneNames <- c("CIITA", "EPCAM", "CD24", "CDCP1", "LYVE1")
SurfaceProteins_df <- Gene2SProtein(GeneNames,
input_type = "gene_name",
output_tsv = FALSE,
Surfy_version = "new")
#The output dataframe contains several information retrieved from Surfy.
colnames(SurfaceProteins_df)
#These are the 5 surface protein coding genes detected by SurfR.
SurfaceProteins_df$UniProt.gene
Although SurfR provides many functions to retrieve public data you can always start from your own dataset.
Here we are going to simulate a small bulkRNA dataset with the R
package SPsimSeq (Assefa, Vandesompele, and Thas 2020) starting
from a subset of Zhang RNA-seq data (Zhang et al.
2015), adding 20% of Differentially Expressed genes
(pDE = 0.2
).
You can than decide to stick to it or combine it with other datasets (public or private).
library(SPsimSeq)
data("zhang.data.sub")
zhang.counts <- zhang.data.sub$counts
MYCN.status <- zhang.data.sub$MYCN.status
# Simulation of bulk RNA data
sim.data.bulk <- SPsimSeq(n.sim = 1,
s.data = zhang.counts,
group = MYCN.status,
n.genes = 1000,
batch.config = 1,
group.config = c(0.5, 0.5),
tot.samples = ncol(zhang.counts),
pDE = 0.2,
lfc.thrld = 0.5,
result.format = "list",
return.details = TRUE)
sim.data.bulk1 <- sim.data.bulk$sim.data.list[[1]]
countMatrix <- sim.data.bulk1$counts
row.names(countMatrix) <- row.names(zhang.counts)
metadata <- sim.data.bulk1$colData
metadata$Group <- as.factor(metadata$Group)
A fundamental task in the analysis of count data from RNA-seq is the detection of differentially expressed genes. For this task we rely on the package DESeq2 (Love, Huber, and Anders 2014), starting from counts data. The count data are presented as a table which reports, for each sample, the number of sequence fragments that have been assigned to each gene.
We use the built-in SurfR DGE function, to perform the differential expression analysis of the simulated dataset.
Good Differential Expressed surface protein are the ones which are strongly expressed in one condition and almost not expressed in the other. To help detecting those candidates, the output data.frame of the DGE function contains information on the average expression in the two group (see Mean_CPM_T and Mean_CPM_C columns).
library(SurfR)
df_zhang <- DGE(expression = countMatrix,
metadata = metadata,
Nreplica = 50,
design = "~Group",
condition = "Group",
alpha = 0.05,
TEST = "1", CTRL = "0",
output_tsv = FALSE)
head(df_zhang)
Once DEGS have been detected, we may want to isolate Surface protein-coding genes.
The protein classification is based on a recently developed surfaceome predictor, called SURFY, based on machine learning.
We use the built-in SurfR Gene2SProtein function to identify Surface protein-coding genes (SP).
# remove NA values
df_zhang <- df_zhang[!is.na(df_zhang$padj), ]
# all fdr
fdr_GeneID <- df_zhang[df_zhang$padj < 0.05, "GeneID"]
SP <- Gene2SProtein(genes = fdr_GeneID, input_type = "gene_name")
# upregulated fdr
fdrUP_GeneID <- df_zhang[df_zhang$padj < 0.05 & df_zhang$log2FoldChange > 0,
"GeneID"]
SPup <- Gene2SProtein(genes = fdrUP_GeneID, input_type = "gene_name")
# dowregulated fdr
fdrDW_GeneID <- df_zhang[df_zhang$padj < 0.05 & df_zhang$log2FoldChange < 0,
"GeneID"]
SPdw <- Gene2SProtein(genes = fdrDW_GeneID, input_type = "gene_name")
GEO (Edgar 2002) is a public functional genomics data repository containing high throughput gene expression data and hybridization arrays. We provide a handy interface to download experiments and curated gene expression profiles.
Here we are going to reanalyze the Cloughesy et al. (Cloughesy et al. 2019) public dataset of recurrent glioblastoma patients undergoing two different treatments: neoadjuvant pembrolizumab or adjuvant pembrolizumab..
This study is available under the GEO accession series GSE121810.
The metadata is downloaded with SurfR built-in function GEOmetadata. Note that this study has been sequenced with only one sequencing platform. If this is not the case, you have to download separately all the metadata specifying the GPL series numbers, and then merge them.
The count matrix is downloaded from ArchS4 (Lachmann et al. 2018) with the SurfR built-in function DownloadArchS4, to ensure handling not normalize data.
library(SurfR)
library(stringr)
# Download metadata from GEO
mGSE121810 <- GEOmetadata(GSE = "GSE121810")
# create new metadata column in order to remove unwanted special characters
unwanted_character <- " "
fx <- function(x) {
str_split(string = x, pattern = unwanted_character)[[1]][1]
}
mGSE121810$condition <- sapply(mGSE121810$therapy, fx)
mGSE121810$condition <- as.factor(mGSE121810$condition)
# Preview metadata
head(mGSE121810)
# only select 3 samples per condition to save time
na_samples <- c("GSM3447013", "GSM3447018", "GSM3447019")
a_samples <- c("GSM3447023", "GSM3447024", "GSM3447026")
mGSE121810 <- mGSE121810[c(na_samples, a_samples), ]
# Download count matrix from ArchS4
cGSE121810 <- DownloadArchS4(mGSE121810$GSM,
species = "human",
print_tsv = FALSE,
filename = NULL)
# Preview count matrix
head(cGSE121810[, ])
A fundamental objective in the analysis of RNA-seq counts data is the detection of differentially expressed genes. For this task we rely on the package DESeq2, starting from count data. Count data reports for each sample the number of sequence fragments that have been assigned to each gene.
Here, we use the built-in SurfR DGE function, to perform the differential expression analysis of the GSE121810 dataset.
Good Differential Expressed surface protein are the ones which are strongly expressed in one condition and almost not expressed in the other. To help detecting the best candidates, the output data.frame of the DGE function contains information on the average expression in the two group (see Mean_CPM_T and Mean_CPM_C columns).
# Perform DGE
df_GEO <- DGE(expression = cGSE121810,
metadata = mGSE121810,
Nreplica = 3,
design = "~condition",
condition = "condition",
alpha = 0.05,
TEST = "neoadjuvant", CTRL = "adjuvant",
output_tsv = FALSE)
# remove NA values
df_GEO <- df_GEO[!is.na(df_GEO$padj), ]
head(df_GEO)
Once DEGS have been detected, we may want to isolate Surface protein-coding genes.
The protein classification is based on a recently developed surfaceome predictor, called SURFY, based on machine learning.
We use the built-in SurfR Gene2SProtein function to identify Surface protein-coding genes (SP).
# Detect SP amoung differentially expressed genes
fdr_GeneID <- df_GEO[df_GEO$padj < 0.1, "GeneID"]
SP <- Gene2SProtein(genes = fdr_GeneID, input_type = "gene_name")
fdrUP_GeneID <- df_GEO[df_GEO$padj < 0.1 & df_GEO$log2FoldChange > 0, "GeneID"]
SPup <- Gene2SProtein(genes = fdrUP_GeneID, input_type = "gene_name")
fdrDW_GeneID <- df_GEO[df_GEO$padj < 0.1 & df_GEO$log2FoldChange < 0, "GeneID"]
SPdw <- Gene2SProtein(genes = fdrDW_GeneID, input_type = "gene_name")
TCGA (The Cancer Genome Atlas Research Network et al. 2013) contains data for thousands of tumor samples across more than 20 types of cancer. Navigating through all of the files manually is impossible. Therefore we provide a function based on TCGAbiolinks that automates and streamlines the retrieval of public TCGA transcriptomics data. Note that to use this function you need to install the developmental version of TCGAbiolinks (Mounir et al. 2019) (Colaprico et al. 2016).
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("BioinformaticsFMRP/TCGAbiolinksGUI.data")
BiocManager::install("BioinformaticsFMRP/TCGAbiolinks")
Here we reanalyze the TCGA-THYM dataset, since it is one of the smallest TCGA datasets, including normal solid tissue samples.
The TCGA count matrix and the metadata are downloaded with SurfR built-in function TCGA_download. The shortLetterCode column of the metadata allows us to distinguish Primary Solid Tumor (TP) and normal (NT) samples.
# Download TCGA data
# To save time only download 3 Tumor samples and 3 normal samples.
# If barcodes are not specified TCGA_download function will download all the
# available samples in the TCGA study
barcodes2Download <- c("TCGA-X7-A8D6-11A-22R-A42C-07",
"TCGA-X7-A8D7-11A-11R-A42C-07",
"TCGA-XM-A8RB-01A-11R-A42C-07",
"TCGA-X7-A8M0-01A-11R-A42C-07")
TCGA.THYM <- TCGA_download(project = "TCGA-THYM", barcodes = barcodes2Download)
cTCGA.THYM <- TCGA.THYM[[1]]
mTCGA.THYM <- TCGA.THYM[[2]]
table(mTCGA.THYM$shortLetterCode)
mTCGA.THYM$shortLetterCode <- as.factor(mTCGA.THYM$shortLetterCode)
A fundamental task in the analysis of count data from RNA-seq is the detection of differentially expressed genes. For this task we rely on the package DESeq2, starting from counts data. The count data are presented as a table which reports, for each sample, the number of sequence fragments that have been assigned to each gene.
We use the built-in SurfR DGE function to perform the differential expression analysis of the TCGA-THYM dataset.
Good Differential Expressed surface protein are the ones which are strongly expressed in one condition and almost not expressed in the other. To help detecting those candidates, the output data.frame of the DGE function contains information on the average expression in the two group (see Mean_CPM_T and Mean_CPM_C columns).
df_TCGA <- DGE(expression = cTCGA.THYM,
metadata = mTCGA.THYM,
Nreplica = 2,
design = "~shortLetterCode",
condition = "shortLetterCode",
alpha = 0.05,
TEST = "TP", CTRL = "NT",
output_tsv = FALSE)
head(df_TCGA)
Once DEGs have been detected, we may want to isolate Surface protein-coding genes.
The protein classification takes advantage of a recently developed surfaceome predictor, called SURFY, based on machine learning.
We use the built-in SurfR Gene2SProtein function to identify Surface protein-coding genes (SP).
# remove NA values
df_TCGA <- df_TCGA[!is.na(df_TCGA$padj), ]
fdr_GeneID <- df_TCGA[df_TCGA$padj < 0.05,
"GeneID"]
SP <- Gene2SProtein(genes = fdr_GeneID, input_type = "gene_name")
fdrUP_GeneID <- df_TCGA[df_TCGA$padj < 0.05 & df_TCGA$log2FoldChange > 0,
"GeneID"]
SPup <- Gene2SProtein(genes = fdrUP_GeneID, input_type = "gene_name")
fdrDW_GeneID <- df_TCGA[df_TCGA$padj < 0.05 & df_TCGA$log2FoldChange < 0,
"GeneID"]
SPdw <- Gene2SProtein(genes = fdrDW_GeneID, input_type = "gene_name")
Analyzing data arising from several experiments studying the same question is a way to obtain more robust results, increasing the detection power of differentially expressed genes.
In SurfR we provide a set of functions based on MetaRNASeq (Rau, Marot, and Jaffrézic 2014) package to combine data from multiple RNAseq experiments.
Let’s suppose we want to integrate Breast cancer data from GEO and TCGA. Breast Cancer datasets are downloaded from TCGA with the SurfR built-in function TCGA_download.
# Download TCGA data
# To save time only download 3 Tumor samples and 3 normal samples.
# If barcodes are not specified TCGA_download function will download all the
# available samples in the TCGA study
barcodes2Download <- c("TCGA-BH-A1FU-11A-23R-A14D-07",
"TCGA-BH-A1FC-11A-32R-A13Q-07",
"TCGA-BH-A0DO-11A-22R-A12D-07",
"TCGA-B6-A0RH-01A-21R-A115-07",
"TCGA-BH-A1FU-01A-11R-A14D-07",
"TCGA-A1-A0SE-01A-11R-A084-07")
TCGA.BRCA <- TCGA_download(project = "TCGA-BRCA", barcodes = barcodes2Download)
In GEO, we want to analyze Varley data (Varley et al. 2014), which includes samples of ER+ breast cancer, Triple Negative Breast cancer, adjacent tissues, and normal breast. These datasets can be retrieved from the GEO series GSE58135, using the SurfR built-in functions GEOmetadata and DownloadArchS4.
mGSE58135 <- GEOmetadata("GSE58135")
mGSE58135 <- mGSE58135[mGSE58135$tissue != "Breast Cancer Cell Line", ]
mGSE58135$condition <- "NT"
mGSE58135$condition[mGSE58135$tissue %in% c("ER+ Breast Cancer Primary Tumor",
"Triple Negative Breast Cancer Primary Tumor")] <- "TP"
# only select 3 samples per condition to save time
TP_samples <- c("GSM1401694", "GSM1401717", "GSM1401729")
NT_samples <- c("GSM1401799", "GSM1401813", "GSM1401814")
mGSE58135 <- mGSE58135[c(TP_samples, NT_samples), ]
cGSE58135 <- DownloadArchS4(mGSE58135$GSM, species = "human")
cGSE58135 <- cGSE58135[, row.names(mGSE58135)]
table(mGSE58135$condition)
The first step in the analysis is to detect differentially expressed genes for each count data, separately. For this task we rely on the package DESeq2, starting from counts data. The count data are presented as a table which reports, for each sample, the number of sequence fragments that have been assigned to each gene.
We use the built-in SurfR DGE function to perform the differential expression analysis of the TCGA-BRCA dataset and GSE58135.
Good differentially expressed surface proteins should be strongly expressed in one condition and almost not expressed in the other. The output dataframe of the DGE function contains information on the average expression in the two groups (see Mean_CPM_T and Mean_CPM_C columns), to help in the detection of the best candidates.
# TCGA DGE
df.TCGA <- DGE(expression = cTCGA.BRCA,
metadata = mTCGA.BRCA,
Nreplica = 3,
design = "~shortLetterCode",
condition = "shortLetterCode",
alpha = 0.05,
TEST = "TP", CTRL = "NT",
output_tsv = FALSE)
head(df.TCGA)
# GSE58135 DGE
df.GSE58135 <- DGE(expression = cGSE58135,
metadata = mGSE58135,
Nreplica = 3,
design = "~condition",
condition = "condition",
alpha = 0.05,
TEST = "TP", CTRL = "NT",
output_tsv = FALSE)
head(df.GSE58135)
Here we provide a function based on metaRNASeq bioconductor package to implement two p-value combination techniques (inverse normal and Fisher methods).
The meta-analysis is performed by the SurfR built-in function metaRNAseq, which requires as input:
ind_deg
);fishercomb
) or the inverse normal
combination technique (invnorm
);BHth
);nrep
).The function automatically produces and saves as .pdf histograms of raw p-values for each of the individual differential analyses performed using the independent filtering from DESeq2 package. You can also examine the p-value distribution after p.value combination.
L_fishercomb <- metaRNAseq(ind_deg = list(TCGA.BRCA = df.TCGA, GEO.GSE58135 = df.GSE58135),
test_statistic = "fishercomb",
BHth = 0.05,
adjpval.t = 0.05)
L_invnorm <- metaRNAseq(ind_deg = list(TCGA.BRCA = df.TCGA, GEO.GSE58135 = df.GSE58135),
test_statistic = "invnorm",
BHth = 0.05,
adjpval.t = 0.05,
nrep = c(102, 56))
Finally, we can summarize the results of the meta-analysis in a data.frame highlighting the statistical information for the common genes to all methods using the built-in SurfR function combine_fisher_invnorm and use the built-in SurfR Gene2SProtein function to identify Surface protein-coding genes (SP) among those.
Genes displaying contradictory differential expression in separate
studies can be identified in the column signFC
= 0 and
removed from the list of differentially expressed genes via
meta-analysis.
metacomb <- combine_fisher_invnorm(ind_deg = list(TCGA.BRCA = df.TCGA,
GEO.GSE58135 = df.GSE58135),
invnorm = L_invnorm,
fishercomb = L_fishercomb,
adjpval = 0.05)
metacomb_GeneID <- metacomb[metacomb$signFC != 0, "GeneID"]
SP <- Gene2SProtein(genes = metacomb_GeneID, input_type = "gene_name")
After identifying the subset of genes enriched in our specific condition of interest, a range of analyses becomes useful to move beyond a mere gene list.
A general enrichment analysis allows to gain further insights about
upregulated or downregulated DEGs. To do so, we use the SurfR built-in
function Enrichment, based on the enrichR
cran package (Kuleshov et al. 2016). You
have the option to indicate the specific database you wish to utilize
among the available in enrichR. The enrichR function
listEnrichrDbs()
allows you to navigate the options.
Sporadically, network connectivity issues may arise with EnrichR server.
If it happens, please, retry to run the function after a few
minutes.
dfList <- list(GEO = df.GSE58135,
TCGA = df.TCGA)
# Enrichment analysis
Enrich <- Enrichment(dfList,
enrich.databases = c("GO_Biological_Process_2021",
"KEGG_2021_Human"),
p_adj = 0.05, logFC = 1)
head(Enrich$GEO$fdr_up$GO_Biological_Process_2021)
SurfR implements several visualization methods to help interpret enrichment results obtained through EnrichR using ggplot2, with the built-in function Enrichment_barplot.
It depicts gene count ratio and enrichment scores (- Log10 adjusted p values) as bar height and color. Users can specify the number of terms (most significant) to display.
library(ggplot2)
# barplot of the top 5 upregulated pathways
Enrichment_barplot(Enrich$GEO,
enrich.databases <- c("GO_Biological_Process_2021",
"KEGG_2021_Human"),
p_adj = 0.05,
num_term = 5,
cond = "UP")
# barplot of the top 5 downregulated pathways
Enrichment_barplot(Enrich$GEO,
enrich.databases <- c("GO_Biological_Process_2021",
"KEGG_2021_Human"),
p_adj = 0.05,
num_term = 5,
cond = "DOWN")
Moreover, we can annotate our list of genes with cross-database identifiers and descriptions (Entrezid, Uniprot, KEGG, etc.), taking advantage of one of the 35 gene-set libraries present in the Enrichr database, using the SurfR built-in function Annotate_SPID.
Utilizing the SurfR function Splot you can create barplots to visualize the annotation classes reported in the dataframe produced by the Gene2SProtein function. The default grouping is the Membranome Almen classification.
# upregulated genes in GEO dataset
fdrUP_GeneID <- df.GSE58135[df.GSE58135$padj < 0.05 & df.GSE58135$log2FoldChange > 0, "GeneID"]
# corresponding Surface Proteins
SPup <- Gene2SProtein(genes = fdrUP_GeneID, input_type = "gene_name")
# Barplot of Almen classification
Splot(SPup,
group.by = "Membranome.Almen.main-class",
main = "Almen class")
You can compare the list of resulting surface proteins from up to 7 different studies, using a venn diagram, with the built-in SurfR function SVenn.
Principal Components Analysis (PCA) is a very useful diagnostic feature to gain insights about your datasets.
You can perform PCA and visualize the result with a customizable plot with the built-in SurfR function plotPCA.
sessionInfo()
#> R version 4.4.2 (2024-10-31)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.1 LTS
#>
#> Matrix products: default
#> BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
#> LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
#>
#> locale:
#> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
#> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
#> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
#> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
#> [9] LC_ADDRESS=C LC_TELEPHONE=C
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
#>
#> time zone: Etc/UTC
#> tzcode source: system (glibc)
#>
#> attached base packages:
#> [1] stats graphics grDevices utils datasets methods base
#>
#> other attached packages:
#> [1] BiocStyle_2.35.0
#>
#> loaded via a namespace (and not attached):
#> [1] digest_0.6.37 R6_2.5.1 fastmap_1.2.0
#> [4] xfun_0.49 maketools_1.3.1 cachem_1.1.0
#> [7] knitr_1.49 htmltools_0.5.8.1 rmarkdown_2.29
#> [10] buildtools_1.0.0 lifecycle_1.0.4 cli_3.6.3
#> [13] sass_0.4.9 jquerylib_0.1.4 compiler_4.4.2
#> [16] sys_3.4.3 tools_4.4.2 evaluate_1.0.1
#> [19] bslib_0.8.0 yaml_2.3.10 BiocManager_1.30.25
#> [22] jsonlite_1.8.9 rlang_1.1.4