Process scRNA-Seq reads in scruff

Introduction

scruff is a toolkit for processing single cell RNA-seq FASTQ reads generated by CEL-Seq and CEL-Seq2 protocols. It does demultiplexing, alignment, Unique Molecular Identifier (UMI) filtering, and transcript counting in an automated fashion and generates the gene count matrix, QC metrics and provides visualizations of data quality. This vignette provides a brief introduction to the scruff package by walking through the demultiplexing, alignment, and UMI-counting of a built-in publicly available example dataset (van den Brink, et al. 2017).

Quick Start

# Run scruff on example dataset
# NOTE: Requires Rsubread index and TxDb objects for the reference genome.
# For generation of these files, please refer to the Stepwise Tutorial.

library(scruff)
## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2
# Get the paths to example FASTQ, FASTA, and GTF files.
# Please note that because the following files are included in
# scruff R package, we use system.file() function to extract the paths
# to these files. If the user is running scruff on real data, the
# paths to the input FASTQ, FASTA, and GTF files should be provided,
# and there is no need to call system.file() function. For example,
# v1h1R1 <- "/PATH/TO/vandenBrink_1h1_L001_R1_001.fastq.gz"
# fasta <- "/Path/TO/GRCm38_MT.fa"
v1h1R1 <- system.file("extdata",
    "vandenBrink_1h1_L001_R1_001.fastq.gz",
    package = "scruff")
v1h1R2 <- system.file("extdata",
    "vandenBrink_1h1_L001_R2_001.fastq.gz",
    package = "scruff")
fasta <- system.file("extdata", "GRCm38_MT.fa", package = "scruff")
gtf <- system.file("extdata", "GRCm38_MT.gtf", package = "scruff")

Build Rsubread alignment index. This is for the alignment step. For test purpose, here we are aligning the example FASTQ files to the genes on mitochondrial chromosome only.

# Create index files for GRCm38_MT.
# For details, please refer to Rsubread user manual.
# Specify the basename for Rsubread index
indexBase <- "GRCm38_MT"
Rsubread::buildindex(basename = indexBase,
    reference = fasta,
    indexSplit = FALSE)

Now that everything is ready, we can run scruff. In experiment 1h1, cell barcodes 95 and 96 are empty well controls. Only cell barcodes 49 to 96 are used here to reduce computing time. These information can be set by the cellPerWell argument. We apply cell barcode and UMI correction by setting bcEdit to 1 and umiEdit to 1. scruff makes use of the SingleCellExperiment package. The following command returns a SingleCellExperiment object containing UMI filtered count matrix as well as gene and sample annotations and QC metrics.

data(barcodeExample, package = "scruff")

sce <- scruff(project = "example",
    experiment = c("1h1"),
    lane = c("L001"),
    read1Path = c(v1h1R1),
    read2Path = c(v1h1R2),
    bc = barcodeExample,
    index = indexBase,
    unique = FALSE,
    nBestLocations = 1,
    reference = gtf,
    bcStart = 1,
    bcStop = 8,
    bcEdit = 1,
    umiStart = 9,
    umiStop = 12,
    umiEdit = 1,
    keep = 75,
    cellPerWell = c(rep(1, 46), 0, 0),
    cores = 2,
    verbose = TRUE)

Visualize data quality.

data(sceExample, package = "scruff")
qc <- qcplots(sceExample)

Stepwise Tutorial For CEL-Seq Samples

Load Example Dataset

The scruff package contains 2 single cell RNA-seq FASTQ example files. Each file has 500 sequenced reads.

library(scruff)

# Get the paths to example FASTQ files.
# Please note that because the following files are included in
# scruff R package, we use system.file() function to extract the paths
# to these files. If the user is running scruff on real data, the
# paths to the input FASTQ, FASTA, and GTF files should be provided,
# and there is no need to call system.file() function. For example,
# v1h1R1 <- "/PATH/TO/vandenBrink_1h1_L001_R1_001.fastq.gz"
# fasta <- "/Path/TO/GRCm38_MT.fa"
v1h1R1 <- system.file("extdata",
    "vandenBrink_1h1_L001_R1_001.fastq.gz",
    package = "scruff")
v1h1R2 <- system.file("extdata",
    "vandenBrink_1h1_L001_R2_001.fastq.gz",
    package = "scruff")

Demultiplex and Assign Cell Specific Reads

Now the FASTQ files are ready to be demultiplexed. scruff package provides built-in predefined cell barcodes barcodeExample for demultiplexing the example dataset. Only cell barcodes 49 to 96 are included in barcodeExample to reduce computing time. In the example FASTQ files, read 1 contains cell barcode and UMI sequence information. Read 2 contains transcript sequences. The barcode sequence of each read starts at base 1 and ends at base 8. The UMI sequence starts at base 9 and ends at base 12. They can be set via bcStart, bcStop, and umiStart, umiStop arguments. Cell barcode correction can be set by bcEdit parameter. By default, reads with any nucleotide in the barcode and UMI sequences with sequencing quality lower than 10 (Phred score) will be excluded. The following command demultiplexes the example FASTQ reads and trims reads longer than 75 nucleotides. The command returns a SingleCellExperiment object whose colData contains the cell index, barcode, reads, percentage of reads assigned, sample, and FASTQ file path information for each cell. By default, the cell specific demultiplexed fastq.gz files are stored in ./Demultiplex folder.

data(barcodeExample, package = "scruff")
de <- demultiplex(project = "example",
    experiment = c("1h1"),
    lane = c("L001"),
    read1Path = c(v1h1R1),
    read2Path = c(v1h1R2),
    barcodeExample,
    bcStart = 1,
    bcStop = 8,
    bcEdit = 1,
    umiStart = 9,
    umiStop = 12,
    keep = 75,
    minQual = 10,
    yieldReads = 1e+06,
    verbose = TRUE,
    overwrite = TRUE,
    cores = 2)

Alignment

scruff provides an alignment function alignRsubread which is a wrapper function to align in Rsubread package. It aligns the reads to reference sequence index and outputs sequence alignment map files in “BAM” or “SAM” format. For demonstration purpose, the built-in mitochondrial DNA sequence from GRCm38 reference assembly GRCm38MitochondrialFasta will be used to map the reads. First, a Rsubread index for the reference sequence needs to be generated.

# Create index files for GRCm38_MT. For details, please refer to Rsubread
# user manual.
fasta <- system.file("extdata", "GRCm38_MT.fa", package = "scruff")

# Create index files for GRCm38_MT.
# For details, please refer to Rsubread user manual.
# Specify the basename for Rsubread index
indexBase <- "GRCm38_MT"
Rsubread::buildindex(basename = indexBase,
    reference = fasta,
    indexSplit = FALSE)

The following command maps the FASTQ files to GRCm38 mitochondrial reference sequence GRCm38_MT.fa and returns a SingleCellExperiment object. By default, the files are stored in BAM format in ./Alignment folder.

# Align the reads using Rsubread
al <- alignRsubread(de,
    indexBase,
    unique = FALSE,
    nBestLocations = 1,
    format = "BAM",
    overwrite = TRUE,
    verbose = TRUE,
    cores = 2)

UMI correction and Generation of Count Matrix

Example GTF file GRCm38_MT.gtf will be used for feature counting. Currently, scruff applies the union counting mode of the HTSeq Python package. The following command generates the UMI corrected count matrix for the example dataset by allowing correction of UMIs with Hamming distances smaller than 1 for each gene in each cell.

gtf <- system.file("extdata", "GRCm38_MT.gtf", package = "scruff")
# get the molecular counts of trancsripts for each cell
# In experiment 1h1, cell barcodes 95 and 96 are empty well controls.
# In experiment b1, cell barcode 95 is bulk sample containing 300 cells.
sce <- countUMI(al,
    gtf,
    umiEdit = 1,
    format = "BAM",
    cellPerWell = c(rep(1, 46), 0, 0),
    verbose = TRUE,
    cores = 2)

Visualization of QC metrics

The data quality diagnostic information are contained in the colData of the returned SingleCellExperiment object sce. They can be visualized using the qcplots function.

data(sceExample, package = "scruff")
qc <- qcplots(sceExample)
qc

Each of these boxplots shows the distribution of a quality metric within and across experiments. Y axis shows the number or fraction of these metrics. Each point represents a unique cell barcode associated with the cells or bulk samples in the wells of the plate. These points are colored by the cellPerWell parameter which is the number of sorted cells in each well. For example, in experiment 1h1, cell barcodes 95 and 96 are empty well controls. In experiment b1, cell barcode 95 is a bulk sample containing 300 cells.

These quality metrics include total reads, aligned reads, reads mapped to genes, fraciont of aligned reads, fraction of gene reads out of aligned reads, fraction of gene reads out of total reads, total transcripts, mitochondrial transcripts, fraction of mitochondrial transcripts, transcribed genes, fraction of protein coding genes and transcripts, median and average number of reads per corrected or uncorrectd UMI, and gene detection rate.

Visualization of Read mapping locations

scruff package provides function to visualize read alignment locations at specified genomic coordinates on the reference genome. The following command visualize the reads mapped to gene mt-Rnr2 for the bulk sample vandenBrink_b1_cell_0095. Reads are colored by their uncorrected UMI tags. bamExample in the following example is a GAlignments object generated by the readGAlignments function in the GenomicAlignments package. Please refer to the documentation of readGAlignments and GenomicAlignments for details about how to read BAM files into R.

# Visualize the reads mapped to gene "mt-Rnr2" in
# cell "vandenBrink_b1_cell_0095".
# bamExample is generated by GenomicAlignments::GAlignments function
data(bamExample, package = "scruff")
# gene mt-Rnr2 starts at 1094 and ends at 2675
start <- 1094
end <- 2675

g1 <- rview(bamExample, chr = "MT", start = start, end = end)
g2 <- gview(gtf, chr = "MT", start = start, end = end)
g <- ggbio::tracks(g1, g2, heights = c(4, 1), xlab = "chr MT")
g

This plot shows the read alignment information for gene mt-RNA2. The top panel shows the alignment and orientation of each read aligned to this gene. Y axis indicates the number of aligned reads at specific locations. Each arrow represents a read. The length of the arrow indicates the length of the read. The reads are colored by their UMI tags. In the bottom panel, each arrow represents an isoform of the gene. The isoforms are named by their transcript names. The grey boxes indicate the exonic regions of the transcript. X axis is the genomic coordinates on the corresponding chromosome (e.g. chr MT). It is consistent between the top and bottom panels if legend is set to FALSE for rview function.

10X BAM File Quality Assessment

The function tenxBamqc collects QC metrics from BAM files generated by 10X Cell Ranger pipeline. The collected QC metrics can be visualized by the function qcplots. The collected measurements include number of aligned reads and number of reads aligned to an gene for each valid cell barcode. Users can see the alignment quality for the filtered cells after plugging in the filtered barcode file from the Cell Ranger results. Here we show the visualization of an example BAM file from Cell Ranger output.

# The following example BAM file is the first 5000 BAM file records extracted
# from sample 01 of the 1.3 Million Brain Cells dataset from E18 Mice.
# (https://support.10xgenomics.com/single-cell-gene-expression/datasets/1.3.0/
# 1M_neurons)
# The BAM file for sample 01 is downloaded from here:
# http://sra-download.ncbi.nlm.nih.gov/srapub_files/
# SRR5167880_E18_20160930_Neurons_Sample_01.bam
# see details here:
# https://trace.ncbi.nlm.nih.gov/Traces/sra/?study=SRP096558
# and here:
# https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE93421
bamfile10x <- system.file("extdata",
    "SRR5167880_E18_20160930_Neurons_Sample_01_5000.bam",
    package = "scruff")

# The filtered cell barcodes are generated using the following script:
# library(TENxBrainData)
# library(data.table)
# tenx <- TENxBrainData()
# # get filtered barcodes for sample 01
# filteredBcIndex <- tstrsplit(colData(tenx)[, "Barcode"], "-")[[2]] == 1
# filteredBc <- colData(tenx)[filteredBcIndex, ][["Barcode"]]
filteredBc <- system.file("extdata",
    "SRR5167880_E18_20160930_Neurons_Sample_01_filtered_barcode.tsv",
    package = "scruff")
# QC results are saved to current working directory
sce <- tenxBamqc(bam = bamfile10x,
    experiment = "Neurons_Sample_01",
    filter = filteredBc)
## tenxBamqc(bam = bamfile10x, experiment = "Neurons_Sample_01", 
##     filter = filteredBc)
sce
## class: SingleCellExperiment 
## dim: 1 1576 
## metadata(0):
## assays(1): ''
## rownames: NULL
## rowData names(0):
## colnames(1576): AAACCTGAGGCCCGTT-Neurons_Sample_01
##   AAACCTGAGTCCGGTC-Neurons_Sample_01 ...
##   TTTGTCATCGGTGTTA-Neurons_Sample_01 TTTGTCATCTTGTATC-Neurons_Sample_01
## colData names(5): cell_barcode reads_mapped_to_genome
##   reads_mapped_to_genes experiment number_of_cells
## reducedDimNames(0):
## mainExpName: NULL
## altExpNames(0):
g <- qcplots(sce)
g

Session Information

## R version 4.4.2 (2024-10-31)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.1 LTS
## 
## Matrix products: default
## BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so;  LAPACK version 3.12.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## time zone: Etc/UTC
## tzcode source: system (glibc)
## 
## attached base packages:
## [1] stats4    stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
##  [1] SingleCellExperiment_1.29.1 SummarizedExperiment_1.37.0
##  [3] Biobase_2.67.0              GenomicRanges_1.59.1       
##  [5] GenomeInfoDb_1.43.1         IRanges_2.41.1             
##  [7] S4Vectors_0.45.2            BiocGenerics_0.53.3        
##  [9] generics_0.1.3              MatrixGenerics_1.19.0      
## [11] matrixStats_1.4.1           scruff_1.25.0              
## [13] BiocStyle_2.35.0           
## 
## loaded via a namespace (and not attached):
##   [1] RColorBrewer_1.1-3       sys_3.4.3                rstudioapi_0.17.1       
##   [4] jsonlite_1.8.9           magrittr_2.0.3           GenomicFeatures_1.59.1  
##   [7] farver_2.1.2             rmarkdown_2.29           BiocIO_1.17.0           
##  [10] zlibbioc_1.52.0          vctrs_0.6.5              memoise_2.0.1           
##  [13] Rsamtools_2.23.0         RCurl_1.98-1.16          base64enc_0.1-3         
##  [16] progress_1.2.3           htmltools_0.5.8.1        S4Arrays_1.7.1          
##  [19] curl_6.0.1               SparseArray_1.7.2        Formula_1.2-5           
##  [22] parallelly_1.39.0        sass_0.4.9               bslib_0.8.0             
##  [25] htmlwidgets_1.6.4        httr2_1.0.6              plyr_1.8.9              
##  [28] cachem_1.1.0             buildtools_1.0.0         GenomicAlignments_1.43.0
##  [31] lifecycle_1.0.4          pkgconfig_2.0.3          Matrix_1.7-1            
##  [34] R6_2.5.1                 fastmap_1.2.0            GenomeInfoDbData_1.2.13 
##  [37] digest_0.6.37            colorspace_2.1-1         GGally_2.2.1            
##  [40] ShortRead_1.65.0         OrganismDbi_1.49.0       AnnotationDbi_1.69.0    
##  [43] Hmisc_5.2-0              RSQLite_2.3.8            hwriter_1.3.2.1         
##  [46] labeling_0.4.3           filelock_1.0.3           fansi_1.0.6             
##  [49] httr_1.4.7               abind_1.4-8              compiler_4.4.2          
##  [52] withr_3.0.2              bit64_4.5.2              htmlTable_2.4.3         
##  [55] backports_1.5.0          BiocParallel_1.41.0      DBI_1.2.3               
##  [58] ggstats_0.7.0            biomaRt_2.63.0           rappdirs_0.3.3          
##  [61] DelayedArray_0.33.2      rjson_0.2.23             tools_4.4.2             
##  [64] foreign_0.8-87           nnet_7.3-19              glue_1.8.0              
##  [67] restfulr_0.0.15          grid_4.4.2               checkmate_2.3.2         
##  [70] cluster_2.1.6            reshape2_1.4.4           gtable_0.3.6            
##  [73] BSgenome_1.75.0          tidyr_1.3.1              ensembldb_2.31.0        
##  [76] hms_1.1.3                data.table_1.16.2        xml2_1.3.6              
##  [79] utf8_1.2.4               XVector_0.47.0           pillar_1.9.0            
##  [82] stringr_1.5.1            dplyr_1.1.4              BiocFileCache_2.15.0    
##  [85] lattice_0.22-6           rtracklayer_1.67.0       bit_4.5.0               
##  [88] deldir_2.0-4             biovizBase_1.55.0        RBGL_1.83.0             
##  [91] tidyselect_1.2.1         maketools_1.3.1          Biostrings_2.75.1       
##  [94] knitr_1.49               gridExtra_2.3            ProtGenerics_1.39.0     
##  [97] ggbio_1.55.0             xfun_0.49                stringi_1.8.4           
## [100] UCSC.utils_1.3.0         lazyeval_0.2.2           yaml_2.3.10             
## [103] evaluate_1.0.1           codetools_0.2-20         interp_1.1-6            
## [106] tibble_3.2.1             graph_1.85.0             BiocManager_1.30.25     
## [109] cli_3.6.3                rpart_4.1.23             Rsubread_2.21.1         
## [112] munsell_0.5.1            jquerylib_0.1.4          dichromat_2.0-0.1       
## [115] Rcpp_1.0.13-1            dbplyr_2.5.0             png_0.1-8               
## [118] XML_3.99-0.17            parallel_4.4.2           ggplot2_3.5.1           
## [121] blob_1.2.4               prettyunits_1.2.0        latticeExtra_0.6-30     
## [124] jpeg_0.1-10              AnnotationFilter_1.31.0  bitops_1.0-9            
## [127] txdbmaker_1.3.0          pwalign_1.3.0            ggthemes_5.1.0          
## [130] VariantAnnotation_1.53.0 scales_1.3.0             purrr_1.0.2             
## [133] crayon_1.5.3             rlang_1.1.4              KEGGREST_1.47.0