scruff is a toolkit for processing single cell RNA-seq FASTQ reads generated by CEL-Seq and CEL-Seq2 protocols. It does demultiplexing, alignment, Unique Molecular Identifier (UMI) filtering, and transcript counting in an automated fashion and generates the gene count matrix, QC metrics and provides visualizations of data quality. This vignette provides a brief introduction to the scruff package by walking through the demultiplexing, alignment, and UMI-counting of a built-in publicly available example dataset (van den Brink, et al. 2017).
# Run scruff on example dataset
# NOTE: Requires Rsubread index and TxDb objects for the reference genome.
# For generation of these files, please refer to the Stepwise Tutorial.
library(scruff)
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
# Get the paths to example FASTQ, FASTA, and GTF files.
# Please note that because the following files are included in
# scruff R package, we use system.file() function to extract the paths
# to these files. If the user is running scruff on real data, the
# paths to the input FASTQ, FASTA, and GTF files should be provided,
# and there is no need to call system.file() function. For example,
# v1h1R1 <- "/PATH/TO/vandenBrink_1h1_L001_R1_001.fastq.gz"
# fasta <- "/Path/TO/GRCm38_MT.fa"
v1h1R1 <- system.file("extdata",
"vandenBrink_1h1_L001_R1_001.fastq.gz",
package = "scruff")
v1h1R2 <- system.file("extdata",
"vandenBrink_1h1_L001_R2_001.fastq.gz",
package = "scruff")
fasta <- system.file("extdata", "GRCm38_MT.fa", package = "scruff")
gtf <- system.file("extdata", "GRCm38_MT.gtf", package = "scruff")
Build Rsubread alignment index. This is for the alignment step. For test purpose, here we are aligning the example FASTQ files to the genes on mitochondrial chromosome only.
# Create index files for GRCm38_MT.
# For details, please refer to Rsubread user manual.
# Specify the basename for Rsubread index
indexBase <- "GRCm38_MT"
Rsubread::buildindex(basename = indexBase,
reference = fasta,
indexSplit = FALSE)
Now that everything is ready, we can run scruff. In
experiment 1h1, cell barcodes 95 and 96 are empty well controls.
Only cell barcodes 49 to 96 are used here to reduce computing
time. These information can be set by the
cellPerWell
argument. We apply cell barcode and UMI
correction by setting bcEdit
to 1 and umiEdit
to 1. scruff makes use of the SingleCellExperiment
package. The following command returns a
SingleCellExperiment
object containing UMI filtered count
matrix as well as gene and sample annotations and QC metrics.
data(barcodeExample, package = "scruff")
sce <- scruff(project = "example",
experiment = c("1h1"),
lane = c("L001"),
read1Path = c(v1h1R1),
read2Path = c(v1h1R2),
bc = barcodeExample,
index = indexBase,
unique = FALSE,
nBestLocations = 1,
reference = gtf,
bcStart = 1,
bcStop = 8,
bcEdit = 1,
umiStart = 9,
umiStop = 12,
umiEdit = 1,
keep = 75,
cellPerWell = c(rep(1, 46), 0, 0),
cores = 2,
verbose = TRUE)
Visualize data quality.
The scruff package contains 2 single cell RNA-seq FASTQ example files. Each file has 500 sequenced reads.
library(scruff)
# Get the paths to example FASTQ files.
# Please note that because the following files are included in
# scruff R package, we use system.file() function to extract the paths
# to these files. If the user is running scruff on real data, the
# paths to the input FASTQ, FASTA, and GTF files should be provided,
# and there is no need to call system.file() function. For example,
# v1h1R1 <- "/PATH/TO/vandenBrink_1h1_L001_R1_001.fastq.gz"
# fasta <- "/Path/TO/GRCm38_MT.fa"
v1h1R1 <- system.file("extdata",
"vandenBrink_1h1_L001_R1_001.fastq.gz",
package = "scruff")
v1h1R2 <- system.file("extdata",
"vandenBrink_1h1_L001_R2_001.fastq.gz",
package = "scruff")
Now the FASTQ files are ready to be demultiplexed. scruff
package provides built-in predefined cell barcodes
barcodeExample
for demultiplexing the example dataset.
Only cell barcodes 49 to 96 are included in
barcodeExample
to reduce computing time. In the
example FASTQ files, read 1 contains cell barcode and UMI sequence
information. Read 2 contains transcript sequences. The barcode sequence
of each read starts at base 1 and ends at base 8. The UMI sequence
starts at base 9 and ends at base 12. They can be set via
bcStart
, bcStop
, and umiStart
,
umiStop
arguments. Cell barcode correction can be set by
bcEdit
parameter. By default, reads with any nucleotide in
the barcode and UMI sequences with sequencing quality lower than 10
(Phred score) will be excluded. The following command demultiplexes the
example FASTQ reads and trims reads longer than 75 nucleotides. The
command returns a SingleCellExperiment
object whose
colData
contains the cell index, barcode, reads, percentage
of reads assigned, sample, and FASTQ file path information for each
cell. By default, the cell specific demultiplexed fastq.gz files are
stored in ./Demultiplex
folder.
data(barcodeExample, package = "scruff")
de <- demultiplex(project = "example",
experiment = c("1h1"),
lane = c("L001"),
read1Path = c(v1h1R1),
read2Path = c(v1h1R2),
barcodeExample,
bcStart = 1,
bcStop = 8,
bcEdit = 1,
umiStart = 9,
umiStop = 12,
keep = 75,
minQual = 10,
yieldReads = 1e+06,
verbose = TRUE,
overwrite = TRUE,
cores = 2)
scruff provides an alignment function
alignRsubread
which is a wrapper function to
align
in Rsubread
package. It aligns the reads
to reference sequence index and outputs sequence alignment map files in
“BAM” or “SAM” format. For demonstration purpose, the built-in
mitochondrial DNA sequence from GRCm38 reference assembly
GRCm38MitochondrialFasta
will be used to map the reads.
First, a Rsubread
index for the reference sequence needs to be generated.
# Create index files for GRCm38_MT. For details, please refer to Rsubread
# user manual.
fasta <- system.file("extdata", "GRCm38_MT.fa", package = "scruff")
# Create index files for GRCm38_MT.
# For details, please refer to Rsubread user manual.
# Specify the basename for Rsubread index
indexBase <- "GRCm38_MT"
Rsubread::buildindex(basename = indexBase,
reference = fasta,
indexSplit = FALSE)
The following command maps the FASTQ files to GRCm38 mitochondrial
reference sequence GRCm38_MT.fa
and returns a
SingleCellExperiment
object. By default, the files are
stored in BAM format in ./Alignment
folder.
Example GTF file GRCm38_MT.gtf
will be used for feature
counting. Currently, scruff applies the union counting mode of
the HTSeq Python package. The following command generates the UMI
corrected count matrix for the example dataset by allowing correction of
UMIs with Hamming distances smaller than 1 for each gene in each
cell.
gtf <- system.file("extdata", "GRCm38_MT.gtf", package = "scruff")
# get the molecular counts of trancsripts for each cell
# In experiment 1h1, cell barcodes 95 and 96 are empty well controls.
# In experiment b1, cell barcode 95 is bulk sample containing 300 cells.
sce <- countUMI(al,
gtf,
umiEdit = 1,
format = "BAM",
cellPerWell = c(rep(1, 46), 0, 0),
verbose = TRUE,
cores = 2)
The data quality diagnostic information are contained in the
colData
of the returned SingleCellExperiment
object sce
. They can be visualized using the
qcplots
function.
Each of these boxplots shows the distribution of a quality metric
within and across experiments. Y axis shows the number or fraction of
these metrics. Each point represents a unique cell barcode associated
with the cells or bulk samples in the wells of the plate. These points
are colored by the cellPerWell
parameter which is the
number of sorted cells in each well. For example, in experiment 1h1,
cell barcodes 95 and 96 are empty well controls. In experiment b1, cell
barcode 95 is a bulk sample containing 300 cells.
These quality metrics include total reads, aligned reads, reads mapped to genes, fraciont of aligned reads, fraction of gene reads out of aligned reads, fraction of gene reads out of total reads, total transcripts, mitochondrial transcripts, fraction of mitochondrial transcripts, transcribed genes, fraction of protein coding genes and transcripts, median and average number of reads per corrected or uncorrectd UMI, and gene detection rate.
scruff package provides function to visualize read alignment
locations at specified genomic coordinates on the reference genome. The
following command visualize the reads mapped to gene mt-Rnr2 for the
bulk sample vandenBrink_b1_cell_0095
. Reads are colored by
their uncorrected UMI tags. bamExample
in the following
example is a GAlignments
object generated by the
readGAlignments
function in the GenomicAlignments
package. Please refer to the documentation of
readGAlignments
and GenomicAlignments
for details about how to read BAM files into R.
# Visualize the reads mapped to gene "mt-Rnr2" in
# cell "vandenBrink_b1_cell_0095".
# bamExample is generated by GenomicAlignments::GAlignments function
data(bamExample, package = "scruff")
# gene mt-Rnr2 starts at 1094 and ends at 2675
start <- 1094
end <- 2675
g1 <- rview(bamExample, chr = "MT", start = start, end = end)
g2 <- gview(gtf, chr = "MT", start = start, end = end)
g <- ggbio::tracks(g1, g2, heights = c(4, 1), xlab = "chr MT")
g
This plot shows the read alignment information for gene mt-RNA2. The
top panel shows the alignment and orientation of each read aligned to
this gene. Y axis indicates the number of aligned reads at specific
locations. Each arrow represents a read. The length of the arrow
indicates the length of the read. The reads are colored by their UMI
tags. In the bottom panel, each arrow represents an isoform of the gene.
The isoforms are named by their transcript names. The grey boxes
indicate the exonic regions of the transcript. X axis is the genomic
coordinates on the corresponding chromosome (e.g. chr MT). It is
consistent between the top and bottom panels if legend
is
set to FALSE for rview
function.
The function tenxBamqc
collects QC metrics from BAM
files generated by 10X Cell Ranger pipeline. The collected QC metrics
can be visualized by the function qcplots
. The collected
measurements include number of aligned reads and number of reads aligned
to an gene for each valid cell barcode. Users can see the alignment
quality for the filtered cells after plugging in the filtered barcode
file from the Cell Ranger results. Here we show the visualization of an
example BAM file from Cell Ranger output.
# The following example BAM file is the first 5000 BAM file records extracted
# from sample 01 of the 1.3 Million Brain Cells dataset from E18 Mice.
# (https://support.10xgenomics.com/single-cell-gene-expression/datasets/1.3.0/
# 1M_neurons)
# The BAM file for sample 01 is downloaded from here:
# http://sra-download.ncbi.nlm.nih.gov/srapub_files/
# SRR5167880_E18_20160930_Neurons_Sample_01.bam
# see details here:
# https://trace.ncbi.nlm.nih.gov/Traces/sra/?study=SRP096558
# and here:
# https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE93421
bamfile10x <- system.file("extdata",
"SRR5167880_E18_20160930_Neurons_Sample_01_5000.bam",
package = "scruff")
# The filtered cell barcodes are generated using the following script:
# library(TENxBrainData)
# library(data.table)
# tenx <- TENxBrainData()
# # get filtered barcodes for sample 01
# filteredBcIndex <- tstrsplit(colData(tenx)[, "Barcode"], "-")[[2]] == 1
# filteredBc <- colData(tenx)[filteredBcIndex, ][["Barcode"]]
filteredBc <- system.file("extdata",
"SRR5167880_E18_20160930_Neurons_Sample_01_filtered_barcode.tsv",
package = "scruff")
# QC results are saved to current working directory
sce <- tenxBamqc(bam = bamfile10x,
experiment = "Neurons_Sample_01",
filter = filteredBc)
## tenxBamqc(bam = bamfile10x, experiment = "Neurons_Sample_01",
## filter = filteredBc)
## class: SingleCellExperiment
## dim: 1 1576
## metadata(0):
## assays(1): ''
## rownames: NULL
## rowData names(0):
## colnames(1576): AAACCTGAGGCCCGTT-Neurons_Sample_01
## AAACCTGAGTCCGGTC-Neurons_Sample_01 ...
## TTTGTCATCGGTGTTA-Neurons_Sample_01 TTTGTCATCTTGTATC-Neurons_Sample_01
## colData names(5): cell_barcode reads_mapped_to_genome
## reads_mapped_to_genes experiment number_of_cells
## reducedDimNames(0):
## mainExpName: NULL
## altExpNames(0):
## R version 4.4.2 (2024-10-31)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.1 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: Etc/UTC
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats4 stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] SingleCellExperiment_1.29.1 SummarizedExperiment_1.37.0
## [3] Biobase_2.67.0 GenomicRanges_1.59.1
## [5] GenomeInfoDb_1.43.1 IRanges_2.41.1
## [7] S4Vectors_0.45.2 BiocGenerics_0.53.3
## [9] generics_0.1.3 MatrixGenerics_1.19.0
## [11] matrixStats_1.4.1 scruff_1.25.0
## [13] BiocStyle_2.35.0
##
## loaded via a namespace (and not attached):
## [1] RColorBrewer_1.1-3 sys_3.4.3 rstudioapi_0.17.1
## [4] jsonlite_1.8.9 magrittr_2.0.3 GenomicFeatures_1.59.1
## [7] farver_2.1.2 rmarkdown_2.29 BiocIO_1.17.0
## [10] zlibbioc_1.52.0 vctrs_0.6.5 memoise_2.0.1
## [13] Rsamtools_2.23.0 RCurl_1.98-1.16 base64enc_0.1-3
## [16] progress_1.2.3 htmltools_0.5.8.1 S4Arrays_1.7.1
## [19] curl_6.0.1 SparseArray_1.7.2 Formula_1.2-5
## [22] parallelly_1.39.0 sass_0.4.9 bslib_0.8.0
## [25] htmlwidgets_1.6.4 httr2_1.0.6 plyr_1.8.9
## [28] cachem_1.1.0 buildtools_1.0.0 GenomicAlignments_1.43.0
## [31] lifecycle_1.0.4 pkgconfig_2.0.3 Matrix_1.7-1
## [34] R6_2.5.1 fastmap_1.2.0 GenomeInfoDbData_1.2.13
## [37] digest_0.6.37 colorspace_2.1-1 GGally_2.2.1
## [40] ShortRead_1.65.0 OrganismDbi_1.49.0 AnnotationDbi_1.69.0
## [43] Hmisc_5.2-0 RSQLite_2.3.8 hwriter_1.3.2.1
## [46] labeling_0.4.3 filelock_1.0.3 fansi_1.0.6
## [49] httr_1.4.7 abind_1.4-8 compiler_4.4.2
## [52] withr_3.0.2 bit64_4.5.2 htmlTable_2.4.3
## [55] backports_1.5.0 BiocParallel_1.41.0 DBI_1.2.3
## [58] ggstats_0.7.0 biomaRt_2.63.0 rappdirs_0.3.3
## [61] DelayedArray_0.33.2 rjson_0.2.23 tools_4.4.2
## [64] foreign_0.8-87 nnet_7.3-19 glue_1.8.0
## [67] restfulr_0.0.15 grid_4.4.2 checkmate_2.3.2
## [70] cluster_2.1.6 reshape2_1.4.4 gtable_0.3.6
## [73] BSgenome_1.75.0 tidyr_1.3.1 ensembldb_2.31.0
## [76] hms_1.1.3 data.table_1.16.2 xml2_1.3.6
## [79] utf8_1.2.4 XVector_0.47.0 pillar_1.9.0
## [82] stringr_1.5.1 dplyr_1.1.4 BiocFileCache_2.15.0
## [85] lattice_0.22-6 rtracklayer_1.67.0 bit_4.5.0
## [88] deldir_2.0-4 biovizBase_1.55.0 RBGL_1.83.0
## [91] tidyselect_1.2.1 maketools_1.3.1 Biostrings_2.75.1
## [94] knitr_1.49 gridExtra_2.3 ProtGenerics_1.39.0
## [97] ggbio_1.55.0 xfun_0.49 stringi_1.8.4
## [100] UCSC.utils_1.3.0 lazyeval_0.2.2 yaml_2.3.10
## [103] evaluate_1.0.1 codetools_0.2-20 interp_1.1-6
## [106] tibble_3.2.1 graph_1.85.0 BiocManager_1.30.25
## [109] cli_3.6.3 rpart_4.1.23 Rsubread_2.21.1
## [112] munsell_0.5.1 jquerylib_0.1.4 dichromat_2.0-0.1
## [115] Rcpp_1.0.13-1 dbplyr_2.5.0 png_0.1-8
## [118] XML_3.99-0.17 parallel_4.4.2 ggplot2_3.5.1
## [121] blob_1.2.4 prettyunits_1.2.0 latticeExtra_0.6-30
## [124] jpeg_0.1-10 AnnotationFilter_1.31.0 bitops_1.0-9
## [127] txdbmaker_1.3.0 pwalign_1.3.0 ggthemes_5.1.0
## [130] VariantAnnotation_1.53.0 scales_1.3.0 purrr_1.0.2
## [133] crayon_1.5.3 rlang_1.1.4 KEGGREST_1.47.0