![Workflow for scIGD](scIGD_SCAE_workflow_final.png)
**Figure 1:** Overview of the scIGD workflow for unraveling immunogenomic diversity in single-cell data, highlighting the integration of the SingleCellAlleleExperiment package for comprehensive data analysis.
# Introduction to the `SingleCellAlleleExperiment (SCAE)` class
The `SingleCellAlleleExperiment (SCAE)` class serves as a comprehensive multi-layer data structure, enabling the representation of immune genes at specific levels, including alleles, genes, and groups of functionally similar genes and thus, allows data analysis across these immunologically relevant, different layers of annotation. The implemented data object is derived from the `r BiocStyle::Biocpkg(pkg="SingleCellExperiment")` class and follows similar conventions, where rows should represent features (genes, transcripts) and columns should represent cells.
![A schematics of the SingleCellAlleleExperiment class](scae_advanced.png)
**Figure 2:** Scheme of SingleCellAlleleExperiment object structure with lookup table.
For the integration of the relevant additional data layers (see **Figure 2**), the quantification data for alleles, generated by the novel `r BiocStyle::Githubpkg(repo="AGImkeller/scIGD", pkg="scIGD")` software package, is aggregated into two additional data layers via an ontology-based design principle using a lookup table during object generation.
For example, the counts of the alleles `A*01:01:01:01` and `A*02:01:01:01` that are present in the raw input data will be combined into the `HLA-A` immune gene layer (see **Table 1** below). Next, all counts of immune genes corresponding to `HLA-class I` are combined into the `HLA-class I` functional class layer. See the structure of the used lookup table below.
**Table 1:** Scheme of the lookup table used to aggregate allele information into multiple data layers.
| Allele | Gene | Function |
| :----------- | :--------- | :---------- |
| A*01:01:01 | HLA-A | HLA class I |
| A*01:01:01 | HLA-A | HLA class I |
| ... | ... | ... |
| DRB1*01:01:01| HLA-DRB1 | HLA class II|
The resulting `SCAE` data object can be used in combination with established single cell analysis packages like `r BiocStyle::Biocpkg(pkg="scater")` and `r BiocStyle::Biocpkg(pkg="scran")` to perform downstream analysis on immune gene expression, allowing data exploration on functional and allele level. See the vignette for further information and insights on how to perform downstream analysis using exemplary data from the accompanying `R/Experimenthub` package `r BiocStyle::Githubpkg(repo="AGImkeller/scaeData", pkg="scaeData")`.
## Expected input and dataset description
The read in function of the SCAE package `read_allele_counts()` expects specific files that are generated by the previous steps of the workflow, performed in the `r BiocStyle::Githubpkg(repo="AGImkeller/scIGD", pkg="scIGD")` package. One input parameter needs to state the path to a directory containing all the input files. The file identifiers can be
specifically stated as parameters. The default file identifiers, as they are outputted from **scIGD** are listed below:
The stated input directory should contain the following files:
- **cells_x_genes.barcodes.txt** (list of barcodes/cell identifiers)
- **cells_x_genes.features.txt** (list of feature identifiers)
- **cells_x_genes_mtx.mtx** (Contains the quantification matrix)
- **lookup_table.csv** (Info for generating multiple data layers)
The dataset used for the here shown downstream analysis is taken from the accompanying data package `r BiocStyle::Githubpkg(repo="AGImkeller/scaeData", pkg="scaeData")`. Specifically we are using the `pbmc_5k` dataset. Find more information about the data, including the source of the raw data here: `r BiocStyle::Githubpkg(repo="AGImkeller/scaeData", pkg="scaeData")`.
# Exemplary downstream analysis
## Loading suggested packages
The following packages are abundant for performing the here stated downstream analysis and visualization. Make sure they are installed to use this vignette.
```{r, message = FALSE}
library(SingleCellAlleleExperiment)
library(scaeData)
library(scater)
library(scran)
library(scuttle)
library(DropletUtils)
library(ggplot2)
library(patchwork)
library(org.Hs.eg.db)
library(AnnotationDbi)
```
## Generation of a SingleCellAlleleExperiment object
Download and save the **"pbmc_10k"** dataset from the accompanying `ExperimentHub/scaeData` package.
```{r}
example_data_10k <- scaeData::scaeDataGet(dataset="pbmc_10k")
```
```{r}
example_data_10k
```
The return value of the `scaeDataGet()` function is a list containing four elements. The `$dir` slot contains the directory where the files downloaded from `ExperimentHub` are saved on your device. The remaining three slots `$barcodes`, `$features`, `$matrix` contain the corresponding file names (named by ExperimentHub). The list is then used to pass the saved information to the related parameters of the `read_allele_counts()` function of the here presented package. See next section.
For the usage of the corresponding lookup table, specify a directory containing the lookup table as seen in the next chunk. For the available example datasets in `scaeData`, you can choose one of the following names. This provides the lookup table to the corresponding dataset:
- "pbmc_5k_lookup_table.csv" is used for the `pbmc_5k" dataset.
- "pbmc_10k_lookup_table.csv" is used for the `pbmc_10k" dataset.
- "pbmc_20k_lookup_table.csv" is used for the `pbmc_20k" dataset.
These are part of the `scaeData` package, but are not fetched from `ExperimentHub` but rather read in as internal files.
```{r}
lookup_name <- "pbmc_10k_lookup_table.csv"
lookup <- utils::read.csv(system.file("extdata", lookup_name, package="scaeData"))
```
```{r}
head(lookup)
```
After the directories of the data files and the lookup table are known and saved, proceed to generate a `SingleCellAlleleExperiment` object.
This is an essential step. The read in function `read_allele_counts()` is used to read in data and generate a `SCAE` object. The `SingleCellAlleleExperiment` Constructor is exported as well. However, as the raw data is processed during the read in, we recommend to use `read_allele_counts()`. Find information about the read in using the Constructor in its function documentation `?SingleCellAlleleExperiment`.
### Description of Read in parameters
If you used the `r BiocStyle::Githubpkg(repo="AGImkeller/scIGD", pkg="scIGD")` package to generate your input files **(RECOMMENDED AND EXPECTED)**, then state the path containing all expected files to the `sample_dir` parameter in the `read_allele_counts()` function and the corresponding lookup table. In case you renamed the files, specify the new file identifiers in the corresponding parameters `lookup_file`, `barcode_file`, `gene_file`, `matrix_file`, otherwise leave it to the stated default values.
The `read_allele_counts()` function also provides multiple parameters to perform filtering steps on the cells contained in your data. For this, **multiple parameter combinations are possible and showcased below.** Advanced filtering can be performed based on a [**knee plot**](https://liorpachter.wordpress.com/tag/knee-plot/).
The first parameter, where you state which **filter mode** you want to use, gives the following valid options: `filter_mode=c("yes", "no", "custom")`.
- `filter_mode = "no"`: Default filter mode. Filtering based on **threshold=0**, filtering out columns (cells) with a count sum of 0 over all features.
- `filter_mode = "yes"` : Advanced filtering is performed on the computed **inflection point of a knee plot** based on barcode ranks. This is an **optional** functionality.
- `filter_mode = "custom"` : Custom filter mode. Filtering performed on a threshold stated in the `filter_threshold` parameter. (Useful for example after you looked at the knee plot after using filter_mode="yes" and decide to use a different filter threshold)
Additionally, the `verbose` parameter gives you an option to toggle runtime-information for the different steps during object generation. Messages regarding the filtering process (used thresholds) will not be toggled off if `verbose = FALSE`.
**Exemplary read in using all described filter modes generating a SCAE object are shown in the following code-chunks:**
#### `filter_mode="no"`
Default filter mode. Filtering based on **threshold=0**, filtering out columns (cells) with a count sum of 0 over all features.
```{r eval=FALSE, warning=FALSE}
scae <- read_allele_counts(example_data_10k$dir,
sample_names="example_data",
filter_mode="no",
lookup_file=lookup,
barcode_file=example_data_10k$barcodes,
gene_file=example_data_10k$features,
matrix_file=example_data_10k$matrix,
filter_threshold=NULL,
verbose=FALSE)
```
#### `filter_mode="yes"`
Advanced filtering is performed on the computed **inflection point of a knee plot** based on barcode ranks. This is an **optional** functionality. To be able to use this filter mode, the `r BiocStyle::Biocpkg(pkg="DropletUtils") ` package needs to be installed on your machine.
```{r warning=FALSE}
scae <- read_allele_counts(example_data_10k$dir,
sample_names="example_data",
filter_mode="yes",
lookup_file=lookup,
barcode_file=example_data_10k$barcodes,
gene_file=example_data_10k$features,
matrix_file=example_data_10k$matrix,
log=TRUE)
scae
```
If this filter mode is chosen, the information to the corresponding plot including the inflection point used for filtering is stored in `metadata(scae)[["knee_info"]]`:
```{r}
metadata(scae)[["knee_info"]]
```
This can be used to plot the corresponding knee plot and potentially infer a new threshold for filtering (then used with **filter_mode="custom"**). Note that only using `filter_mode="yes"` computes the information for the knee plot. See corresponding code chunks for further information about the other filter modes.
```{r, warning=FALSE, fig.height=5, fig.width=5}
knee_df <- metadata(scae)[["knee_info"]]$knee_df
knee_point <- metadata(scae)[["knee_info"]]$knee_point
inflection_point <- metadata(scae)[["knee_info"]]$inflection_point
ggplot(knee_df, aes(x=rank, y=total)) +
geom_point() +
geom_line(aes(y=fitted), color="red") +
scale_x_log10() +
scale_y_log10() +
annotation_logticks() +
labs(x="Barcode rank", y="Total UMI count") +
geom_hline(yintercept=S4Vectors::metadata(knee_df)$knee,
color="dodgerblue", linetype="dashed") +
geom_hline(yintercept=S4Vectors::metadata(knee_df)$inflection,
color="forestgreen", linetype="dashed") +
annotate("text", x=2, y=S4Vectors::metadata(knee_df)$knee * 1.2,
label="knee", color="dodgerblue") +
annotate("text", x=2.25, y=S4Vectors::metadata(knee_df)$inflection * 1.2,
label="inflection", color="forestgreen") + theme_bw()
```
#### `filter_mode="custom"`
Custom filter mode. Filtering performed on a threshold stated in the `filter_threshold` parameter.
```{r warning=FALSE, eval=FALSE}
#this is the object used in the further workflow
scae <- read_allele_counts(example_data_10k$dir,
sample_names="example_data",
filter_mode="custom",
lookup_file=lookup,
barcode_file=example_data_10k$barcodes,
gene_file=example_data_10k$features,
matrix_file=example_data_10k$matrix,
filter_threshold=282)
scae
```
### Description of optional functionalities
Three optional functionalities can be toggled on or off during read in:
- Optional filtering performed on the inflection point of a computed knee plot based on barcode ranks. Use `filter_mode="yes"` for this optional filter mode. Find more information about the different filter modes in the prior section.
- Computation of a `logcounts` assay using Library Factors. Use `log=TRUE/FALSE` to toggle on or off accordingly. **Turned on by default**. If you want to compute the logcounts using a different method, be aware that the count matrix is extended (adding additional rows) during the object generation, **its abundant to compute scaling factors only on data layers present in the raw data (non-immune and alleles)**. Otherwise the scaling factors would not be correct. Find information about subsetting the object in sections further below.
- Computation of additional gene symbols in case the raw data only contains Ensembl gene identifiers. Use `gene_symbols=TRUE/FALSE` to toggle on or off accordingly. **Turned off by default**. Uses the org.HS package.
## Showcasing content of object slots
### RowData
Two new classification columns are introduced in the `rowData` slot. Namely the `NI_I` column (classification of each row as `NI = non_immune` or `I = immune`) and `Quant_type` column (classification of each row to which data layer it is corresponding to). Both columns are used jointly to identify each row of the object to its corresponding data layer (see **figure 1**).
```{r}
rowData(scae)
```
### ColData
Contains sample and barcode information. If the `logcounts` assay is computed, find another column containing the `sizeFactors` here.
```{r}
head(colData(scae))
```
### Metadata
Only contains information if `filter_mode="yes"` is used.
```{r}
metadata(scae)[["knee_info"]]
```
### Utilize layer specific `getter-functions()`
On top of the established `getters` from the SCE package, new getters are implemented to retrieve the different data layers integrated in the SCAE object.
#### Non-immune genes
```{r}
scae_nonimmune_subset <- scae_subset(scae, "nonimmune")
head(rownames(scae_nonimmune_subset))
```
#### Alleles
```{r}
scae_alleles_subset <- scae_subset(scae, "alleles")
head(rownames(scae_alleles_subset))
```
#### Immune genes
```{r}
scae_immune_genes_subset <- scae_subset(scae, "immune_genes")
head(rownames(scae_immune_genes_subset))
```
#### Functional gene groups
```{r}
scae_functional_groups_subset <- scae_subset(scae, "functional_groups")
head(rownames(scae_functional_groups_subset))
```
## Expression evaluation
Checking the expression for the allele-layer, immune gene layer and functional class layer. Allele identifiers are in the form of `A*02:01:01:01`. The immune genes are in the form of `HLA-A` and the functional classes `HLA_class_I`. Here we see that `HLA_class_I` and `HLA-C` are the most abundant functional class and immune gene respectively, given the underlying dataset.
```{r}
scae_immune_layers_subset <- c(rownames(scae_subset(scae, "alleles")),
rownames(scae_subset(scae, "immune_genes")),
rownames(scae_subset(scae, "functional_groups")))
scater::plotExpression(scae, scae_immune_layers_subset)
```
# Downstream analysis on immune genes
In the following sections, main steps for dimensional reduction are performed, offering insights into the different data layers of the SCAE object as well as giving an idea on how to perform immune gene expression analysis.
## Subsetting the different layers
The non-immune genes are combined with each of the integrated immune gene allele-aware layers to determine three different subsets.
### Non-immune genes + alleles
```{r}
rn_scae_ni <- rownames(scae_subset(scae, "nonimmune"))
rn_scae_alleles <- rownames(scae_subset(scae, "alleles"))
scae_nonimmune__allels_subset <- scae[c(rn_scae_ni, rn_scae_alleles), ]
scae_nonimmune__allels_subset
```
### Non-immune genes + immune genes
```{r}
rn_scae_i <- rownames(scae_subset(scae, "immune_genes"))
scae_nonimmune__immune <- scae[c(rn_scae_ni, rn_scae_i), ]
scae_nonimmune__immune
```
### Non-immune genes + functional class
```{r}
rn_scae_functional <- rownames(scae_subset(scae, "functional_groups"))
scae_nonimmune__functional <- scae[c(rn_scae_ni, rn_scae_functional), ]
scae_nonimmune__functional
```
## Dimensional Reduction
### Model variance and HVGs for all data layers
Using the `modelGeneVar()` function prior to `getTopHVGs`. Both functions are part from the `Biocpkg("scran")` package. Compute a list of HVGs for each data layer. Return the top 0.1 % HVGs per layer using `getTopHVGs`.
```{r}
df_ni_a <- modelGeneVar(scae_nonimmune__allels_subset)
top_ni_a <- getTopHVGs(df_ni_a, prop=0.1)
```
```{r}
df_ni_g <- modelGeneVar(scae_nonimmune__immune)
top_ni_g <- getTopHVGs(df_ni_g, prop=0.1)
```
```{r}
df_ni_f <- modelGeneVar(scae_nonimmune__functional)
top_ni_f <- getTopHVGs(df_ni_f, prop=0.1)
```
### PCA
Compute PCA for each layer and store the results in the object. Its Important to make unique identifiers for each layer/run or the results will be overwritten and just saved as `PCA`. Here, the `runPCA` functions from the `Biocpkg("scater")` package is used.
```{r}
assay <- "logcounts"
scae <- runPCA(scae, ncomponents=10, subset_row=top_ni_a,
exprs_values=assay, name="PCA_a")
scae <- runPCA(scae, ncomponents=10, subset_row=top_ni_g,
exprs_values=assay, name="PCA_g")
scae <- runPCA(scae, ncomponents=10, subset_row=top_ni_f,
exprs_values=assay, name="PCA_f")
```
```{r}
reducedDimNames(scae)
```
### t-SNE
The same goes for running t-SNE with the `runTSNE` function from the `Biocpkg("scater")` package. Unique identifiers are stated here for each layer as well. For simplicity, we only compute the t-SNE on the `gene layer`. Information regarding how this process is conducted on the other additional layers, can be seen in the non evaluated code chunks below:
```{r}
set.seed(18)
scae <- runTSNE(scae, dimred="PCA_g", name="TSNE_g")
```
The following two chunks show how the t-SNE could be computed on the `allele` and `functional_group` layer. These chunks are not run by default:
```{r, eval=FALSE}
set.seed(18)
scae <- runTSNE(scae, dimred="PCA_a", name="TSNE_a")
```
```{r, eval=FALSE}
set.seed(18)
scae <- runTSNE(scae, dimred="PCA_f", name="TSNE_f")
```
List of results from the performed reduced dimension analysis.
```{r}
reducedDimNames(scae)
```
## Visualization of results
Exemplary visualization for the t-SNE results on gene level for immune genes that relate to HLA-class I. In the given dataset, it is the immune genes `HLA-DRB1`, plotted alongside their alleles. This allows for insights into potential genetic differences shown on allele-level.
### HLA-A immune gene and alleles
```{r fig3, fig.height = 4, fig.width = 12, fig.align = "center", warning = FALSE, message=FALSE}
tsne <- "TSNE_g"
tsne_g_a <- plotReducedDim(scae, dimred=tsne, colour_by="HLA-DRB1") +
ggtitle("HLA-DRB1 gene")
tsne_g_a1 <- plotReducedDim(scae, dimred=tsne, colour_by="DRB1*07:01:01:01") +
ggtitle("Allele DRB1*07:01:01:01")
tsne_g_a2 <- plotReducedDim(scae, dimred=tsne, colour_by="DRB1*13:01:01:01") +
ggtitle("Allele DRB1*13:01:01:01")
p2 <- tsne_g_a + tsne_g_a1 + tsne_g_a2
p2
```
# Additional
As the SCAE object is extending the SCE object, it is also compatible with the `Biocpkg("iSEE")` package for interactive data exploration.
# Session Information
```{r}
sessionInfo()
```