---
title: "Introduction to SurfR"
author: "Aurora Maurizio, Anna Sofia Tascini, and Marco Jacopo Morelli"
output:
  BiocStyle::html_document:
    toc_float: true
bibliography: references.bib
vignette: >
  %\VignetteIndexEntry{Introduction to SurfR}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
```

# Introduction

Proteins at the cell surface connect intracellular and extracellular signaling networks and largely determine a cell’s capacity to communicate and interact with its environment. 

Importantly, variations in transcriptomic profiles are often observed between healthy and diseased cells, presenting distinct sets of cell-surface proteins. Indeed, cell surface proteins i) may act as biomarkers for the detection of diseased cells in tissues or body fluids and ii) are the most prevalent target of human drugs: 66% of approved human drugs listed in the DrugBank database target a cell-surface protein. The investigation of the cell surfaceome therefore could provide new possibilities for diagnosis, prognosis, treatment development, and therapy response evaluation.


When a study aims to find new biomarkers, the small number of samples often limits the ability to obtain reliable results. However, as sequencing costs continue to decrease, several follow-up experiments will likely be conducted to re-address the same biological question, suggesting a need for methods able to jointly analyze data from multiple studies.


**SurfR** provides a solution to these issues, generating a list of ranked surface protein-coding differentially-expressed genes starting from the raw count matrix of your own RNA-seq experiment or from bulk transcriptomic data automatically retrieved from public databases such as GEO and TCGA.
GEO queries are based on the ArchS4 pipeline. TCGA repository is interrogated through TCGAbiolinks.

**SurfR** also offers the opportunity to increase the sample size of a cohort by integrating related datasets, therefore enhancing the power to detect differentially expressed genes of interest. Meta-analysis can be performed through metaRNASeq, taking into account inter-study variability that may arise from technical differences among studies (e.g., sample preparation, library protocols, batch effects) as well as additional biological variability.

Gene ontology (GO) and pathway annotation can also be performed within **SurfR** to gain further insights about surface protein candidates.

Finally, **SurfR** includes functions to visualize DEG and enrichment results, aiding in the interpretation of findings.


# Installation

Install the package from *Bioconductor* or GitHub, ensuring correct
*Bioconductor* dependencies.

```{r install, eval = FALSE}
if (!"BiocManager" %in% rownames(installed.packages()))
    install.packages("BiocManager", repos = "https://cran.r-project.org")
```

When the package is available on *Bioconductor*, use

```{r install-Bioconductor, eval = FALSE}
BiocManager::install("SurfR")
```

Use the pre-release or devel version from GitHub using devtools with

```{r install-github, eval = FALSE}
# install.packages("devtools")
devtools::install_github("auroramaurizio/SurfR", build_vignettes = TRUE)
```


# Quick Start

The basic idea behind **SurfR** has been to create a complete framework to detect surface protein coding genes within your data, or within public datasets. This framework facilitates the simultaneous analysis and comparison of multiple studies, easily revealing functional consensus and differences among distinct conditions. To begin, let’s ask **SurfR** to detect surface protein coding genes among a vector of input genes. Gene ID can be provided as  `gene_name`, `ensembl`, `entrez` or `uniProt_name`.

The protein classification is based on a recently developed surfaceome predictor, called [**SURFY**](http://wlab.ethz.ch/surfaceome/) [@bausch-fluck_silico_2018], based on machine learning.


```{r gene2protein }
library(SurfR)

GeneNames <- c("CIITA", "EPCAM", "CD24", "CDCP1", "LYVE1")
SurfaceProteins_df <- Gene2SProtein(GeneNames, 
                                    input_type = "gene_name", 
                                    output_tsv = FALSE,
                                    Surfy_version = "new")
#The output dataframe contains several information retrieved from Surfy.
colnames(SurfaceProteins_df)
#These are the 5 surface protein coding genes detected by SurfR.
SurfaceProteins_df$UniProt.gene
```


# Tutorial 

## Start from your own data

Although **SurfR** provides many functions to retrieve public data you can always start from your own dataset.

Here we are going to simulate a small bulkRNA dataset with the R package [SPsimSeq](https://doi.org/10.1093/bioinformatics/btaa105) [@assefa_spsimseq_2020]
starting from a subset of Zhang RNA-seq data [@zhang_targeting_2015], 
adding 20% of Differentially Expressed genes  (`pDE = 0.2`).

You can than decide to stick to it or combine it with other datasets (public or private).

```{r simulate zhang, eval=FALSE}
library(SPsimSeq)

data("zhang.data.sub")
zhang.counts <- zhang.data.sub$counts
MYCN.status  <- zhang.data.sub$MYCN.status

# Simulation of bulk RNA data
sim.data.bulk <- SPsimSeq(n.sim = 1, 
                          s.data = zhang.counts,
                          group = MYCN.status, 
                          n.genes = 1000, 
                          batch.config = 1,
                          group.config = c(0.5, 0.5), 
                          tot.samples = ncol(zhang.counts),
                          pDE = 0.2, 
                          lfc.thrld = 0.5, 
                          result.format = "list",
                          return.details = TRUE)

sim.data.bulk1 <- sim.data.bulk$sim.data.list[[1]]

countMatrix <- sim.data.bulk1$counts
row.names(countMatrix) <- row.names(zhang.counts)
metadata <- sim.data.bulk1$colData
metadata$Group <- as.factor(metadata$Group)
```

A fundamental task in the analysis of count data from RNA-seq is the detection of differentially expressed genes. For this task we rely on the package [DESeq2](https://bioconductor.org/packages/release/bioc/html/DESeq2.html) [@love_moderated_2014], starting from counts data. The count data are presented as a table which reports, for each sample, the number of sequence fragments that have been assigned to each gene. 

We use the built-in SurfR **DGE** function, 
to perform the differential expression analysis of the simulated dataset.

Good Differential Expressed surface protein are the ones which are strongly expressed in one condition and almost not expressed in the other. To help detecting those candidates, the output data.frame of the DGE function contains information on the average expression in the two group (see Mean_CPM_T and Mean_CPM_C columns).

```{r DGE zhang, eval=FALSE}
library(SurfR)

df_zhang <- DGE(expression = countMatrix,
                metadata = metadata,
                Nreplica = 50,
                design = "~Group",
                condition = "Group",
                alpha = 0.05,
                TEST = "1", CTRL =  "0", 
                output_tsv = FALSE)

head(df_zhang)
```

Once DEGS have been detected, we may want to isolate Surface protein-coding genes.

The protein classification is based on a recently developed surfaceome predictor, 
called  [**SURFY**](http://wlab.ethz.ch/surfaceome/), based on machine learning. 

We use the built-in SurfR **Gene2SProtein** function to identify
Surface protein-coding genes (SP). 

```{r G2P zhang, eval=FALSE}
# remove NA values
df_zhang <- df_zhang[!is.na(df_zhang$padj), ]

# all fdr
fdr_GeneID <- df_zhang[df_zhang$padj < 0.05, "GeneID"]
SP <- Gene2SProtein(genes = fdr_GeneID, input_type = "gene_name")

# upregulated fdr
fdrUP_GeneID <- df_zhang[df_zhang$padj < 0.05 & df_zhang$log2FoldChange > 0,
                         "GeneID"]
SPup <- Gene2SProtein(genes = fdrUP_GeneID, input_type = "gene_name")

# dowregulated fdr
fdrDW_GeneID <- df_zhang[df_zhang$padj < 0.05 & df_zhang$log2FoldChange < 0,
                         "GeneID"]
SPdw <- Gene2SProtein(genes = fdrDW_GeneID, input_type = "gene_name")
```


## Explore public datasets

### Dowload from GEO (Gene Expression Omnibus)

[GEO](https://www.ncbi.nlm.nih.gov/geo/) [@edgar_gene_2002] is a public functional genomics data repository containing high throughput gene expression data and hybridization arrays. 
We provide a handy interface to download experiments and curated gene expression profiles.

Here we are going to reanalyze the [Cloughesy et al.](https://doi.org/10.1038/s41591-018-0337-7) 
[@cloughesy_neoadjuvant_2019] public dataset of recurrent glioblastoma patients undergoing two different treatments: 
neoadjuvant pembrolizumab or adjuvant pembrolizumab.. 

This study is available under the GEO accession series [GSE121810](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE121810).

The metadata is downloaded with SurfR built-in function **GEOmetadata**. 
Note that this study has been sequenced with only one sequencing platform. If this is not the case, you have to download separately all the metadata specifying the GPL series numbers, and then merge them.

The count matrix is downloaded from [ArchS4](https://maayanlab.cloud/archs4/)
[@lachmann_massive_2018] with the SurfR built-in function **DownloadArchS4**,
to ensure handling not normalize data.

```{r GEO input meta_counts, eval=FALSE}
library(SurfR)
library(stringr)

# Download metadata from GEO
mGSE121810 <- GEOmetadata(GSE = "GSE121810")

# create new metadata column in order to remove unwanted special characters
unwanted_character <- " "
fx <- function(x) {
  str_split(string = x, pattern = unwanted_character)[[1]][1]
}
mGSE121810$condition <- sapply(mGSE121810$therapy, fx)
mGSE121810$condition <- as.factor(mGSE121810$condition)

# Preview metadata
head(mGSE121810)

# only select 3 samples per condition to save time
na_samples <- c("GSM3447013", "GSM3447018", "GSM3447019")
a_samples <- c("GSM3447023", "GSM3447024", "GSM3447026")
mGSE121810 <- mGSE121810[c(na_samples, a_samples), ]

# Download count matrix from ArchS4
cGSE121810 <- DownloadArchS4(mGSE121810$GSM, 
                             species = "human", 
                             print_tsv = FALSE, 
                             filename = NULL)

# Preview count matrix
head(cGSE121810[, ])
```

A fundamental objective in the analysis of RNA-seq counts data is the detection of differentially expressed genes. For this task we rely on the package DESeq2, starting from count data. Count data reports for each sample the number of sequence fragments that have been assigned to each gene.

Here, we use the built-in SurfR **DGE** function, to perform the differential expression analysis of the GSE121810 dataset.

Good Differential Expressed surface protein are the ones which are strongly expressed in one condition and almost not expressed in the other. To help detecting the best candidates, the output data.frame of the DGE function contains information on the average expression in the two group (see Mean_CPM_T and Mean_CPM_C columns).

```{r GEO dge, eval=FALSE}
# Perform DGE
df_GEO <- DGE(expression = cGSE121810,
              metadata = mGSE121810,
              Nreplica = 3,
              design = "~condition",
              condition = "condition",
              alpha = 0.05,
              TEST = "neoadjuvant", CTRL =  "adjuvant",
              output_tsv = FALSE)

# remove NA values
df_GEO <- df_GEO[!is.na(df_GEO$padj), ]

head(df_GEO)
```

Once DEGS have been detected, we may want to isolate Surface protein-coding genes.

The protein classification is based on a recently developed surfaceome predictor, 
called **SURFY**, based on machine learning. 

We use the built-in SurfR **Gene2SProtein** function to identify
Surface protein-coding genes (SP). 

```{r GEO SP, eval=FALSE}
# Detect SP amoung differentially expressed genes
fdr_GeneID <- df_GEO[df_GEO$padj < 0.1, "GeneID"]

SP <- Gene2SProtein(genes = fdr_GeneID, input_type = "gene_name")

fdrUP_GeneID <- df_GEO[df_GEO$padj < 0.1 & df_GEO$log2FoldChange > 0, "GeneID"]
SPup <- Gene2SProtein(genes = fdrUP_GeneID, input_type = "gene_name")

fdrDW_GeneID <- df_GEO[df_GEO$padj < 0.1 & df_GEO$log2FoldChange < 0, "GeneID"]
SPdw <- Gene2SProtein(genes = fdrDW_GeneID, input_type = "gene_name")

```


### Download from TCGA (The Cancer Genome Atlas Program)

[TCGA](https://tcga-data.nci.nih.gov/tcga/) [@the_cancer_genome_atlas_research_network_cancer_2013] contains data for thousands of tumor samples across more than 20 types of cancer. 
Navigating through all of the files manually is impossible. Therefore we provide a function based on TCGAbiolinks that automates and streamlines the retrieval of public TCGA transcriptomics data. 
Note that to use this function you need to install the developmental version of
[TCGAbiolinks](https://github.com/BioinformaticsFMRP/TCGAbiolinks) [@mounir_new_2019] [@colaprico_tcgabiolinks_2016].

```{r TCGA install, eval=FALSE}
if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

BiocManager::install("BioinformaticsFMRP/TCGAbiolinksGUI.data")
BiocManager::install("BioinformaticsFMRP/TCGAbiolinks")
```

Here we reanalyze the *TCGA-THYM* dataset, since it is one of the smallest 
TCGA datasets, including normal solid tissue samples. 

The TCGA count matrix and the metadata are downloaded with SurfR built-in 
function **TCGA_download**. The shortLetterCode column of the metadata allows us 
to distinguish Primary Solid Tumor (TP) and normal (NT) samples.


```{r echo=FALSE, warning=FALSE, include=FALSE, message = FALSE, eval=FALSE}
# Download TCGA data
# To save time only download 3 Tumor samples and 3 normal samples.
# If barcodes are not specified TCGA_download function will download all the 
# available samples in the TCGA study
barcodes2Download <- c("TCGA-X7-A8D6-11A-22R-A42C-07", 
                       "TCGA-X7-A8D7-11A-11R-A42C-07",
                       "TCGA-XM-A8RB-01A-11R-A42C-07", 
                       "TCGA-X7-A8M0-01A-11R-A42C-07")
TCGA.THYM <- TCGA_download(project = "TCGA-THYM", barcodes = barcodes2Download)

cTCGA.THYM <- TCGA.THYM[[1]]
mTCGA.THYM <- TCGA.THYM[[2]]
table(mTCGA.THYM$shortLetterCode)

mTCGA.THYM$shortLetterCode <- as.factor(mTCGA.THYM$shortLetterCode)
```


A fundamental task in the analysis of count data from RNA-seq is the detection of differentially expressed genes. For this task we rely on the package DESeq2, starting from counts data. The count data are presented as a table which reports, for each sample, the number of sequence fragments that have been assigned to each gene. 

We use the built-in SurfR **DGE** function to perform the differential expression 
analysis of the TCGA-THYM dataset.

Good Differential Expressed surface protein are the ones which are strongly expressed in one condition and almost not expressed in the other. To help detecting those candidates, the output data.frame of the DGE function contains information on the average expression in the two group (see Mean_CPM_T and Mean_CPM_C columns).

```{r TCGA DGE, eval=FALSE}
df_TCGA <- DGE(expression = cTCGA.THYM,
               metadata = mTCGA.THYM,
               Nreplica = 2,
               design = "~shortLetterCode",
               condition = "shortLetterCode",
               alpha = 0.05,
               TEST = "TP", CTRL =  "NT",
               output_tsv = FALSE)

head(df_TCGA)
```


Once DEGs have been detected, we may want to isolate Surface protein-coding genes.

The protein classification takes advantage of a recently developed surfaceome predictor, 
called **SURFY**, based on machine learning. 

We use the built-in SurfR **Gene2SProtein** function to identify
Surface protein-coding genes (SP).

```{r TCGA SP, eval=FALSE}
# remove NA values
df_TCGA <- df_TCGA[!is.na(df_TCGA$padj), ]

fdr_GeneID <- df_TCGA[df_TCGA$padj < 0.05,
                      "GeneID"]

SP <- Gene2SProtein(genes = fdr_GeneID, input_type = "gene_name")

fdrUP_GeneID <- df_TCGA[df_TCGA$padj < 0.05 & df_TCGA$log2FoldChange > 0,
                        "GeneID"]
SPup <- Gene2SProtein(genes = fdrUP_GeneID, input_type = "gene_name")

fdrDW_GeneID <- df_TCGA[df_TCGA$padj < 0.05 & df_TCGA$log2FoldChange < 0,
                        "GeneID"]
SPdw <- Gene2SProtein(genes = fdrDW_GeneID, input_type = "gene_name")
```


## Meta-analysis

Analyzing data arising from several experiments studying the same question is a 
way to obtain more robust results, increasing the detection power of differentially expressed genes.

In SurfR we provide a set of functions based on [MetaRNASeq](https://cran.r-project.org/web/packages/metaRNASeq/vignettes/metaRNASeq.pdf) 
[@rau_differential_2014] package to combine data from multiple RNAseq experiments.

Let’s suppose we want to integrate Breast cancer data from GEO and TCGA. 
Breast Cancer datasets are downloaded from TCGA with the SurfR built-in function  **TCGA_download**.


```{r echo=FALSE, warning=FALSE, include=FALSE, message = FALSE}
# Download TCGA data
# To save time only download 3 Tumor samples and 3 normal samples.
# If barcodes are not specified TCGA_download function will download all the 
# available samples in the TCGA study
barcodes2Download <- c("TCGA-BH-A1FU-11A-23R-A14D-07", 
                       "TCGA-BH-A1FC-11A-32R-A13Q-07",
                       "TCGA-BH-A0DO-11A-22R-A12D-07", 
                       "TCGA-B6-A0RH-01A-21R-A115-07",
                       "TCGA-BH-A1FU-01A-11R-A14D-07", 
                       "TCGA-A1-A0SE-01A-11R-A084-07")
TCGA.BRCA <- TCGA_download(project = "TCGA-BRCA", barcodes = barcodes2Download)
```

```{r meta TCGA_input}
cTCGA.BRCA <- TCGA.BRCA[[1]]
mTCGA.BRCA <- TCGA.BRCA[[2]]
table(mTCGA.BRCA$shortLetterCode)

```

In GEO, we want to analyze [Varley](https://doi.org/10.1007/s10549-014-3019-2) data 
[@varley_recurrent_2014], which includes samples of ER+ breast cancer, Triple Negative Breast cancer, 
adjacent tissues, and normal breast. These datasets can be retrieved from the GEO series [GSE58135](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE58135), 
using the SurfR built-in functions **GEOmetadata** and **DownloadArchS4**.

```{r meta GEO_input metadata}
mGSE58135 <- GEOmetadata("GSE58135")
mGSE58135 <- mGSE58135[mGSE58135$tissue != "Breast Cancer Cell Line", ]
mGSE58135$condition <- "NT"
mGSE58135$condition[mGSE58135$tissue %in% c("ER+ Breast Cancer Primary Tumor", 
                                            "Triple Negative Breast Cancer Primary Tumor")] <- "TP"

# only select 3 samples per condition to save time
TP_samples <- c("GSM1401694", "GSM1401717", "GSM1401729")
NT_samples <- c("GSM1401799", "GSM1401813", "GSM1401814")
mGSE58135 <- mGSE58135[c(TP_samples, NT_samples), ]
```

```{r meta GEO_input counts}
cGSE58135 <- DownloadArchS4(mGSE58135$GSM, species = "human")
cGSE58135 <- cGSE58135[, row.names(mGSE58135)]

table(mGSE58135$condition)

```

The first step in the analysis is to detect differentially expressed genes for each count data, separately. 
For this task we rely on the package DESeq2, starting from counts data. The count data are presented as a table which reports, for each sample, the number of sequence fragments that have been assigned to each gene.

We use the built-in SurfR **DGE** function to perform the differential expression 
analysis of the TCGA-BRCA dataset and GSE58135.

Good differentially expressed surface proteins should be strongly 
expressed in one condition and almost not expressed in the other. The output dataframe of the DGE function contains information 
on the average expression in the two groups (see Mean_CPM_T and Mean_CPM_C columns), to help in the detection of the best candidates.


```{r meta DGE TCGA}
# TCGA DGE
df.TCGA <- DGE(expression = cTCGA.BRCA,
               metadata = mTCGA.BRCA,
               Nreplica = 3,
               design = "~shortLetterCode",
               condition = "shortLetterCode",
               alpha = 0.05,
               TEST = "TP", CTRL =  "NT",
               output_tsv = FALSE)
head(df.TCGA)
```

```{r meta DGE GEO}
# GSE58135 DGE
df.GSE58135 <- DGE(expression = cGSE58135,
                   metadata = mGSE58135,
                   Nreplica = 3,
                   design = "~condition",
                   condition = "condition",
                   alpha = 0.05,
                   TEST = "TP", CTRL =  "NT",
                   output_tsv = FALSE)
head(df.GSE58135)
```

Here we provide a function based on **metaRNASeq** bioconductor package to implement 
two p-value combination techniques (inverse normal and Fisher methods).

The meta-analysis is performed by the SurfR built-in function **metaRNAseq**, 
which requires as input: 

* a list of data.frames with the DGE results of the chosen databases to combine (`ind_deg`); 
* the statistic test to use to combine p.values, which can be the Fisher method (`fishercomb`)
or the inverse normal combination technique (`invnorm`);
* the Benjamini Hochberg threshold (`BHth`);
* if using inverse normal combination technique, a vector of the number of replicates used in each study to calculate the previous one-sided p-values (`nrep`).

The function automatically produces and saves as .pdf histograms of raw p-values for each of the individual differential analyses performed using the independent filtering from DESeq2 package. You can also examine the p-value distribution after p.value combination.


```{r meta fishercomb}

L_fishercomb <- metaRNAseq(ind_deg = list(TCGA.BRCA =  df.TCGA, GEO.GSE58135 = df.GSE58135),
                           test_statistic = "fishercomb",
                           BHth = 0.05,
                           adjpval.t = 0.05)
```

```{r meta invnorm}
L_invnorm <- metaRNAseq(ind_deg = list(TCGA.BRCA =  df.TCGA, GEO.GSE58135 = df.GSE58135),
                        test_statistic = "invnorm",
                        BHth = 0.05,
                        adjpval.t = 0.05,
                        nrep = c(102, 56))
```

Finally, we can summarize the results of the meta-analysis in a data.frame highlighting the statistical information for the common genes to all methods using the built-in SurfR function **combine_fisher_invnorm** and use the built-in SurfR **Gene2SProtein** function to identify Surface protein-coding genes (SP) among those.

Genes displaying contradictory differential expression in separate studies can be identified in the column `signFC`= 0 and removed from the list of differentially expressed genes via meta-analysis.


```{r metacombine}
metacomb <- combine_fisher_invnorm(ind_deg = list(TCGA.BRCA =  df.TCGA, 
                                                  GEO.GSE58135 = df.GSE58135),
                                   invnorm = L_invnorm,
                                   fishercomb = L_fishercomb,
                                   adjpval = 0.05)


metacomb_GeneID <- metacomb[metacomb$signFC != 0, "GeneID"]
SP <- Gene2SProtein(genes = metacomb_GeneID, input_type = "gene_name")
```

```{r metacombine up}
metacombUP_GeneID <- metacomb[metacomb$signFC == 1, "GeneID"]
SPup <- Gene2SProtein(genes = metacombUP_GeneID, input_type = "gene_name")
```

```{r metacombine dw}
metacombDW_GeneID <- metacomb[metacomb$signFC == -1, "GeneID"]
SPdw <- Gene2SProtein(genes = metacombDW_GeneID, input_type = "gene_name")

```


## Functional Enrichment
After identifying the subset of genes enriched in our specific condition of interest, 
a range of analyses becomes useful to move beyond a mere gene list.

A general enrichment analysis allows to gain further insights about upregulated or downregulated DEGs. 
To do so, we use the SurfR built-in function **Enrichment**,  based on the [enrichR](https://cran.r-project.org/web/packages/enrichR/vignettes/enrichR.html) cran package [@kuleshov_enrichr_2016]. 
You have the option to indicate the specific database you wish to utilize among the available in enrichR. 
The enrichR function `listEnrichrDbs()` allows you to navigate the options. 
Sporadically, network connectivity issues may arise with EnrichR server. 
If it happens, please, retry to run the function after a few minutes.


```{r enrich, eval=FALSE}
library(enrichR)

dfList <- list(GEO = df.GSE58135,
               TCGA = df.TCGA)

# Enrichment analysis
Enrich <- Enrichment(dfList, 
                     enrich.databases = c("GO_Biological_Process_2021", 
                                          "KEGG_2021_Human"),
                     p_adj = 0.05, logFC = 1)

head(Enrich$GEO$fdr_up$GO_Biological_Process_2021)
```

**SurfR** implements several visualization methods to help interpret enrichment results 
obtained through EnrichR using ggplot2, with the built-in function 
**Enrichment_barplot**.

It depicts gene count ratio and enrichment scores (- Log10 adjusted p values) as bar height and color. Users can specify the number of terms (most significant) to display.

```{r, fig.width=7, fig.height=2.5, eval=FALSE}
library(ggplot2)

# barplot of the top 5 upregulated pathways
Enrichment_barplot(Enrich$GEO, 
                   enrich.databases <- c("GO_Biological_Process_2021",  
                                         "KEGG_2021_Human"), 
                   p_adj = 0.05, 
                   num_term = 5, 
                   cond = "UP")

# barplot of the top 5 downregulated pathways
Enrichment_barplot(Enrich$GEO, 
                   enrich.databases <- c("GO_Biological_Process_2021", 
                                         "KEGG_2021_Human"), 
                   p_adj = 0.05, 
                   num_term = 5, 
                   cond = "DOWN")
```


Moreover, we can annotate our list of genes with cross-database identifiers and descriptions (Entrezid, Uniprot, KEGG, etc.),
taking advantage of one of the 35 gene-set libraries present in the Enrichr database, 
using the SurfR built-in function **Annotate_SPID**. 

```{r anno spid, eval=FALSE}
annotated <- Annotate_SPID(df.GSE58135, "WikiPathway_2021_Human") 
head(annotated, 10)
```

## Results visualization

### Bar plot 

Utilizing the SurfR function **Splot** you can create barplots to visualize the annotation classes reported in
the dataframe produced by the **Gene2SProtein** function.
The default grouping is the [Membranome Almen classification](https://doi.org/10.1186/1741-7007-7-50). 

```{r, fig.height= 4.5, fig.width=7}
# upregulated genes in GEO dataset
fdrUP_GeneID <- df.GSE58135[df.GSE58135$padj < 0.05 & df.GSE58135$log2FoldChange > 0, "GeneID"]
# corresponding Surface Proteins
SPup <- Gene2SProtein(genes = fdrUP_GeneID, input_type = "gene_name")
# Barplot of Almen classification 
Splot(SPup,
      group.by = "Membranome.Almen.main-class",
      main = "Almen class")
```

### Venn diagram

You can compare the list of resulting surface proteins from up to 7 different studies, 
using a venn diagram,
with the built-in SurfR function **SVenn**.


```{r, fig.height= 6, fig.width=7}

S_list <- list(SP_1 = c("EPCAM", "CD24",  "DLK1",  "CDCP1", "LYVE1"),
               SP_2 = c("DLK1", "EPCAM", "EGFR", "UPK1A", "UPK2"))

SVenn(S_list,
      cols.use = c("green", "blue"),
      opacity = 0.5,
      output_intersectionFile = FALSE)

```


### PCA

Principal Components Analysis (PCA) is a very useful diagnostic feature to gain insights about your datasets.

You can perform PCA and visualize the result with a customizable plot 
with the built-in SurfR function **plotPCA**.


```{r, fig.height= 6, fig.width=7}

SurfR::plotPCA(matrix = edgeR::cpm(cGSE58135), metadata = mGSE58135,               
               dims = c(1, 2),
               color.by = "condition", shape.by = "condition", 
               label = FALSE, main = "PCA GSE58135")
```


# SessionInfo

```{r}
sessionInfo()
```

# References