Title: | Single Cell RNA-Seq UMI Filtering Facilitator (scruff) |
---|---|
Description: | A pipeline which processes single cell RNA-seq (scRNA-seq) reads from CEL-seq and CEL-seq2 protocols. Demultiplex scRNA-seq FASTQ files, align reads to reference genome using Rsubread, and generate UMI filtered count matrix. Also provide visualizations of read alignments and pre- and post-alignment QC metrics. |
Authors: | Zhe Wang [aut, cre], Junming Hu [aut], Joshua Campbell [aut] |
Maintainer: | Zhe Wang <[email protected]> |
License: | MIT + file LICENSE |
Version: | 1.25.0 |
Built: | 2024-12-21 06:14:41 UTC |
Source: | https://github.com/bioc/scruff |
Rsubread
read alignment function align
This function is not available in Windows environment. Align cell
specific reads to reference genome and write sequence alignment results to
output directory. A wrapper to the align
function in Rsubread
package. For details please refer to Rsubread
manual.
alignRsubread( sce, index, unique = FALSE, nBestLocations = 1, format = "BAM", outDir = "./Alignment", cores = max(1, parallelly::availableCores() - 2), threads = 1, summaryPrefix = "alignment", overwrite = FALSE, verbose = FALSE, logfilePrefix = format(Sys.time(), "%Y%m%d_%H%M%S"), ... )
alignRsubread( sce, index, unique = FALSE, nBestLocations = 1, format = "BAM", outDir = "./Alignment", cores = max(1, parallelly::availableCores() - 2), threads = 1, summaryPrefix = "alignment", overwrite = FALSE, verbose = FALSE, logfilePrefix = format(Sys.time(), "%Y%m%d_%H%M%S"), ... )
sce |
A |
index |
Path to the |
unique |
Argument passed to |
nBestLocations |
Argument passed to |
format |
File format of sequence alignment results. "BAM" or "SAM". Default is "BAM". |
outDir |
Output directory for alignment results. Sequence alignment
files will be stored in folders in this directory, respectively.
Make sure the folder is empty. Default is |
cores |
Number of cores used for parallelization. Default is
|
threads |
Do not change. Number of threads/CPUs used for
mapping for each core. Refer to |
summaryPrefix |
Prefix for alignment summary filename. Default is
|
overwrite |
Boolean indicating whether to overwrite the output directory. Default is FALSE. |
verbose |
Boolean indicating whether to print log messages. Useful for debugging. Default to FALSE. |
logfilePrefix |
Prefix for log file. Default is current date and time
in the format of |
... |
Additional arguments passed to the |
A SingleCellExperiment object containing the alignment
summary information in the colData
slot. The alignment_path
column of the annotation table contains the paths to output alignment files.
# The SingleCellExperiment object returned by demultiplex function is # required for running alignRsubread function ## Not run: data(barcodeExample, package = "scruff") fastqs <- list.files(system.file("extdata", package = "scruff"), pattern = "\\.fastq\\.gz", full.names = TRUE) de <- demultiplex( project = "example", experiment = c("1h1"), lane = c("L001"), read1Path = c(fastqs[1]), read2Path = c(fastqs[2]), barcodeExample, bcStart = 1, bcStop = 8, umiStart = 9, umiStop = 12, keep = 75, overwrite = TRUE) # Alignment library(Rsubread) # Create index files for GRCm38_MT. fasta <- system.file("extdata", "GRCm38_MT.fa", package = "scruff") # Specify the basename for Rsubread index indexBase <- "GRCm38_MT" buildindex(basename = indexBase, reference = fasta, indexSplit = FALSE) al <- alignRsubread(de, indexBase, overwrite = TRUE) ## End(Not run)
# The SingleCellExperiment object returned by demultiplex function is # required for running alignRsubread function ## Not run: data(barcodeExample, package = "scruff") fastqs <- list.files(system.file("extdata", package = "scruff"), pattern = "\\.fastq\\.gz", full.names = TRUE) de <- demultiplex( project = "example", experiment = c("1h1"), lane = c("L001"), read1Path = c(fastqs[1]), read2Path = c(fastqs[2]), barcodeExample, bcStart = 1, bcStop = 8, umiStart = 9, umiStop = 12, keep = 75, overwrite = TRUE) # Alignment library(Rsubread) # Create index files for GRCm38_MT. fasta <- system.file("extdata", "GRCm38_MT.fa", package = "scruff") # Specify the basename for Rsubread index indexBase <- "GRCm38_MT" buildindex(basename = indexBase, reference = fasta, indexSplit = FALSE) al <- alignRsubread(de, indexBase, overwrite = TRUE) ## End(Not run)
An example GAlignments object containing read alignment information for cell
"vandenBrink_b1_cell_0095" of example FASTQ files. Used as an example for
rview
function.
bamExample
bamExample
A GAlignments
object.
A vector containing 48 predefined cell barcodes which will be used for demultiplexing the example FASTQ files. Only the cell barcodes from 49 to 96 of the original 96 barcodes are used here to reduce the time to run example codes and compile the vignette.
barcodeExample
barcodeExample
A vector of cell barcode sequences. Cell barcodes for this study (van den Brink, et al.) are of length 8.
The first 10,000 cell barcodes in v1 (737K-april-2014_rc.txt), v2
(737K-august-2016.txt), and v3 (3M-february-2018.txt) cell barcode
whitelist files. This object is used for testing the validity of input
assay chemistry validCb
for tenxBamqc function. The cell
barcodes for the first 10,000 alignments in the input BAM file will be
mapped to each chemistry's whitelist to determine the assay chemistry of
the BAM file.
cbtop10000
cbtop10000
A data.table
object.
Count unique UMI:gene pairs for single cell RNA-sequencing alignment
files. Write resulting count matrix to output directory. Columns are
samples (cells) and rows are gene IDs. The input sequence alignment files
must be generated using FASTQ files generated by the demultiplex
function in scruff package. Return a SingleCellExperiment object containing
the count matrix, cell and gene annotations, and all QC metrics.
countUMI( sce, reference, umiEdit = 0, format = "BAM", outDir = "./Count", cellPerWell = 1, cores = max(1, parallelly::availableCores() - 2), outputPrefix = "countUMI", verbose = FALSE, logfilePrefix = format(Sys.time(), "%Y%m%d_%H%M%S") )
countUMI( sce, reference, umiEdit = 0, format = "BAM", outDir = "./Count", cellPerWell = 1, cores = max(1, parallelly::availableCores() - 2), outputPrefix = "countUMI", verbose = FALSE, logfilePrefix = format(Sys.time(), "%Y%m%d_%H%M%S") )
sce |
A |
reference |
Path to the reference GTF file. The TxDb object of the GTF file will be generated and saved in the current working directory with ".sqlite" suffix. |
umiEdit |
Maximally allowed Hamming distance for UMI correction. For
read alignments in each gene, by comparing to a more abundant UMI with more
reads, UMIs having fewer reads and with mismatches equal or fewer than
|
format |
Format of input sequence alignment files. "BAM" or "SAM". Default is "BAM". |
outDir |
Output directory for UMI counting results. UMI corrected count
matrix will be stored in this directory. Default is |
cellPerWell |
Number of cells per well. Can be an integer (e.g. 1) indicating the number of cells in each well or an vector with length equal to the total number of cells in the input alignment files specifying the number of cells in each file. Default is 1. |
cores |
Number of cores used for parallelization. Default is
|
outputPrefix |
Prefix for expression table filename. Default is
|
verbose |
Print log messages. Useful for debugging. Default to FALSE. |
logfilePrefix |
Prefix for log file. Default is current date and time
in the format of |
A SingleCellExperiment object.
## Not run: data(barcodeExample, package = "scruff") # The SingleCellExperiment object returned by alignRsubread function and the # alignment BAM files are required for running countUMI function # First demultiplex example FASTQ files fastqs <- list.files(system.file("extdata", package = "scruff"), pattern = "\\.fastq\\.gz", full.names = TRUE) de <- demultiplex( project = "example", experiment = c("1h1"), lane = c("L001"), read1Path = c(fastqs[1]), read2Path = c(fastqs[2]), barcodeExample, bcStart = 1, bcStop = 8, umiStart = 9, umiStop = 12, keep = 75, overwrite = TRUE) # Alignment library(Rsubread) # Create index files for GRCm38_MT. fasta <- system.file("extdata", "GRCm38_MT.fa", package = "scruff") # Specify the basename for Rsubread index indexBase <- "GRCm38_MT" buildindex(basename = indexBase, reference = fasta, indexSplit = FALSE) al <- alignRsubread(de, indexBase, overwrite = TRUE) # Counting gtf <- system.file("extdata", "GRCm38_MT.gtf", package = "scruff") sce = countUMI(al, gtf, cellPerWell=c(rep(1, 46), 0, 0)) ## End(Not run) # or use the built-in SingleCellExperiment object generated using # example dataset (see ?sceExample) data(sceExample, package = "scruff")
## Not run: data(barcodeExample, package = "scruff") # The SingleCellExperiment object returned by alignRsubread function and the # alignment BAM files are required for running countUMI function # First demultiplex example FASTQ files fastqs <- list.files(system.file("extdata", package = "scruff"), pattern = "\\.fastq\\.gz", full.names = TRUE) de <- demultiplex( project = "example", experiment = c("1h1"), lane = c("L001"), read1Path = c(fastqs[1]), read2Path = c(fastqs[2]), barcodeExample, bcStart = 1, bcStop = 8, umiStart = 9, umiStop = 12, keep = 75, overwrite = TRUE) # Alignment library(Rsubread) # Create index files for GRCm38_MT. fasta <- system.file("extdata", "GRCm38_MT.fa", package = "scruff") # Specify the basename for Rsubread index indexBase <- "GRCm38_MT" buildindex(basename = indexBase, reference = fasta, indexSplit = FALSE) al <- alignRsubread(de, indexBase, overwrite = TRUE) # Counting gtf <- system.file("extdata", "GRCm38_MT.gtf", package = "scruff") sce = countUMI(al, gtf, cellPerWell=c(rep(1, 46), 0, 0)) ## End(Not run) # or use the built-in SingleCellExperiment object generated using # example dataset (see ?sceExample) data(sceExample, package = "scruff")
Demultiplex fastq files and write cell specific reads in compressed fastq format to output directory
demultiplex( project = paste0("project_", Sys.Date()), experiment, lane, read1Path, read2Path, bc, bcStart, bcStop, bcEdit = 0, umiStart, umiStop, keep, minQual = 10, yieldReads = 1e+06, outDir = "./Demultiplex", summaryPrefix = "demultiplex", overwrite = FALSE, cores = max(1, parallelly::availableCores() - 2), verbose = FALSE, logfilePrefix = format(Sys.time(), "%Y%m%d_%H%M%S") )
demultiplex( project = paste0("project_", Sys.Date()), experiment, lane, read1Path, read2Path, bc, bcStart, bcStop, bcEdit = 0, umiStart, umiStop, keep, minQual = 10, yieldReads = 1e+06, outDir = "./Demultiplex", summaryPrefix = "demultiplex", overwrite = FALSE, cores = max(1, parallelly::availableCores() - 2), verbose = FALSE, logfilePrefix = format(Sys.time(), "%Y%m%d_%H%M%S") )
project |
The project name. Default is
|
experiment |
A character vector of experiment names. Represents the
group label for each FASTQ file, e.g. "patient1, patient2, ...". The number
of cells in a experiment equals the length of cell barcodes |
lane |
A character or character vector of flow cell lane numbers. FASTQ
files from lanes having the same |
read1Path |
A character vector of file paths to the read 1 FASTQ files. These are the read files containing UMI and cell barcode sequences. |
read2Path |
A character vector of file paths to the read 2 FASTQ files. These read files contain genomic transcript sequences. |
bc |
A character vector of pre-determined cell barcodes. For example,
see |
bcStart |
Integer or vector of integers containing the cell barcode start positions (inclusive, one-based numbering). |
bcStop |
Integer or vector of integers containing the cell barcode stop positions (inclusive, one-based numbering). |
bcEdit |
Maximally allowed Hamming distance for barcode correction. Barcodes with mismatches equal or fewer than this will be assigned a corrected barcode if the inferred barcode matches uniquely in the provided predetermined barcode list. Default is 0, meaning no cell barcode correction is performed. |
umiStart |
Integer or vector of integers containing the start positions (inclusive, one-based numbering) of UMI sequences. |
umiStop |
Integer or vector of integers containing the stop positions (inclusive, one-based numbering) of UMI sequences. |
keep |
Read trimming. Read length or number of nucleotides to keep for read 2 (the read that contains transcript sequence information). Longer reads will be clipped at 3' end. Shorter reads will not be affected. |
minQual |
Minimally acceptable Phred quality score for barcode and UMI sequences. Phread quality scores are calculated for each nucleotide in the sequence. Sequences with at least one nucleotide with score lower than this will be filtered out. Default is 10. |
yieldReads |
The number of reads to yield when drawing successive
subsets from a fastq file, providing the number of successive records to be
returned on each yield. This parameter is passed to the |
outDir |
Output folder path for demultiplex results. Demultiplexed
cell specifc FASTQ files will be stored in folders in this path,
respectively. Make sure the folder is empty. Default is
|
summaryPrefix |
Prefix for demultiplex summary filename. Default is
|
overwrite |
Boolean indicating whether to overwrite the output directory. Default is FALSE. |
cores |
Number of cores used for parallelization. Default is
|
verbose |
Poolean indicating whether to print log messages. Useful for debugging. Default to FALSE. |
logfilePrefix |
Prefix for log file. Default is current date and time
in the format of |
A SingleCellExperiment object
containing the demultiplex summary information in the colData
slot.
# Demultiplex example FASTQ files data(barcodeExample, package = "scruff") fastqs <- list.files(system.file("extdata", package = "scruff"), pattern = "\\.fastq\\.gz", full.names = TRUE) de <- demultiplex( project = "example", experiment = c("1h1"), lane = c("L001"), read1Path = c(fastqs[1]), read2Path = c(fastqs[2]), barcodeExample, bcStart = 1, bcStop = 8, umiStart = 9, umiStop = 12, keep = 75, overwrite = TRUE)
# Demultiplex example FASTQ files data(barcodeExample, package = "scruff") fastqs <- list.files(system.file("extdata", package = "scruff"), pattern = "\\.fastq\\.gz", full.names = TRUE) de <- demultiplex( project = "example", experiment = c("1h1"), lane = c("L001"), read1Path = c(fastqs[1]), read2Path = c(fastqs[2]), barcodeExample, bcStart = 1, bcStop = 8, umiStart = 9, umiStop = 12, keep = 75, overwrite = TRUE)
Visualize reference genome. Rectangles represent exons. Arrow represents orientation of transcripts.
gview( gtfFile, chr = 1, start = 1, end = NULL, rect_width = 0.3, line_width = 0.5, arrow_segments = 10, arrow_width = 30, arrow_length = 0.08, arrow_type = "open", text_size = 4 )
gview( gtfFile, chr = 1, start = 1, end = NULL, rect_width = 0.3, line_width = 0.5, arrow_segments = 10, arrow_width = 30, arrow_length = 0.08, arrow_type = "open", text_size = 4 )
gtfFile |
A genome annotation file in GTF format. |
chr |
Chromosome name. Integer or "X", "Y", "MT". |
start |
Genomic coordinate of the start position. |
end |
Genomic coordinate of the end position. If |
rect_width |
Exon widths. Default 0.3. |
line_width |
Line weight. Default 0.5. |
arrow_segments |
The number of segments lines be divided to. The greater the number, more arrows there are. Default 10. |
arrow_width |
The angle of the arrow head in degrees (smaller numbers produce narrower, pointier arrows). Essentially describes the width of the arrow head. Passed to the angle parameter of arrow function. Default 30. |
arrow_length |
The length of the arrow head. Passed to the length argument of arrow function. Default 0.08. |
arrow_type |
One of "open" or "closed" indicating whether the arrow head should be a closed triangle. Passed to the type argument of arrow function. Default "open". |
text_size |
Size of text. Passed to the size argument of the geom_text function. Default 4. |
A ggplot object of genomic view
gtf <- system.file("extdata", "GRCm38_MT.gtf", package = "scruff") g <- gview(gtf, chr = "MT") g
gtf <- system.file("extdata", "GRCm38_MT.gtf", package = "scruff") g <- gview(gtf, chr = "MT") g
Visualize data quality from the colData
of the
SingleCellExperiment
object and return a list of figures in
arrangelist
object.
qcplots(sce)
qcplots(sce)
sce |
An |
A list of grobs
objects ready for plotting
data(sceExample, package = "scruff") qcplots(sceExample)
data(sceExample, package = "scruff") qcplots(sceExample)
Visualize read alignments for UMI tagged single cell RNA-sequencing data. Read names must contain UMI sequences at the end delimited by ":". Arrow represents orientation of alignment. Reads are colored by their UMI and sorted by their start positions and UMI.
rview( bamGA, chr = "1", start = 1, end = max(BiocGenerics::end(bamGA)), legend = FALSE )
rview( bamGA, chr = "1", start = 1, end = max(BiocGenerics::end(bamGA)), legend = FALSE )
bamGA |
A GenomicAlignment object |
chr |
Chromosome. Integer or "X", "Y", "MT". |
start |
Genomic coordinate of the start position. |
end |
Genomic coordinate of the end position. |
legend |
Show legend. Default is FALSE. |
A ggplot object of aligned reads
data(bamExample, package = "scruff") g <- rview(bamExample, chr = "MT", legend = TRUE) g
data(bamExample, package = "scruff") g <- rview(bamExample, chr = "MT", legend = TRUE) g
An example SingleCellExperiment object containing count matrix, cell and gene annotations, and all QC metrics for mouse mitochonrial genes generated from example FASTQ reads.
sceExample
sceExample
A SingleCellExperiment
object.
Run the scruff
pipeline. This function performs all
demultiplex
, alignRsubread
, and countUMI
functions.
Write demultiplex statistics, alignment statistics, and UMI filtered count
matrix in output directories. Return a SingleCellExperiment object
containing the count matrix, cell and gene annotations, and all QC metrics.
scruff( project = paste0("project_", Sys.Date()), experiment, lane, read1Path, read2Path, bc, index, reference, bcStart, bcStop, bcEdit = 0, umiStart, umiStop, umiEdit = 0, keep, cellPerWell = 1, unique = FALSE, nBestLocations = 1, minQual = 10, yieldReads = 1e+06, alignmentFileFormat = "BAM", demultiplexOutDir = "./Demultiplex", alignmentOutDir = "./Alignment", countUmiOutDir = "./Count", demultiplexSummaryPrefix = "demultiplex", alignmentSummaryPrefix = "alignment", countPrefix = "countUMI", logfilePrefix = format(Sys.time(), "%Y%m%d_%H%M%S"), overwrite = FALSE, verbose = FALSE, cores = max(1, parallelly::availableCores() - 2), threads = 1, ... )
scruff( project = paste0("project_", Sys.Date()), experiment, lane, read1Path, read2Path, bc, index, reference, bcStart, bcStop, bcEdit = 0, umiStart, umiStop, umiEdit = 0, keep, cellPerWell = 1, unique = FALSE, nBestLocations = 1, minQual = 10, yieldReads = 1e+06, alignmentFileFormat = "BAM", demultiplexOutDir = "./Demultiplex", alignmentOutDir = "./Alignment", countUmiOutDir = "./Count", demultiplexSummaryPrefix = "demultiplex", alignmentSummaryPrefix = "alignment", countPrefix = "countUMI", logfilePrefix = format(Sys.time(), "%Y%m%d_%H%M%S"), overwrite = FALSE, verbose = FALSE, cores = max(1, parallelly::availableCores() - 2), threads = 1, ... )
project |
The project name. Default is
|
experiment |
A character vector of experiment names. Represents the
group label for each FASTQ file, e.g. "patient1, patient2, ...". The number
of cells in a experiment equals the length of cell barcodes |
lane |
A character or character vector of flow cell lane numbers. If FASTQ files from multiple lanes are concatenated, any placeholder would be sufficient, e.g. "L001". |
read1Path |
A character vector of file paths to the read1 FASTQ files. These are the read files with UMI and cell barcode information. |
read2Path |
A character vector of file paths to the read2 FASTQ files. These read files contain genomic sequences. |
bc |
A vector of pre-determined cell barcodes. For example, see
|
index |
Path to the |
reference |
Path to the reference GTF file. The TxDb object of the GTF file will be generated and saved in the current working directory with ".sqlite" suffix. |
bcStart |
Integer or vector of integers containing the cell barcode start positions (inclusive, one-based numbering). |
bcStop |
Integer or vector of integers containing the cell barcode stop positions (inclusive, one-based numbering). |
bcEdit |
Maximally allowed Hamming distance for barcode correction. Barcodes with mismatches equal or fewer than this will be assigned a corrected barcode if the inferred barcode matches uniquely in the provided predetermined barcode list. Default is 0, meaning no cell barcode correction is performed. |
umiStart |
Integer or vector of integers containing the start positions (inclusive, one-based numbering) of UMI sequences. |
umiStop |
Integer or vector of integers containing the stop positions (inclusive, one-based numbering) of UMI sequences. |
umiEdit |
Maximally allowed Hamming distance for UMI correction. For
read alignments in each gene, by comparing to a more abundant UMI with more
reads, UMIs having fewer reads and with mismatches equal or fewer than
|
keep |
Read trimming. Read length or number of nucleotides to keep for read 2 (the read that contains transcript sequence information). Longer reads will be clipped at 3' end. Shorter reads will not be affected. This number should be determined based on the sequencing kit that was used in library preparation step. |
cellPerWell |
Number of cells per well. Can be an integer (e.g. 1) indicating the number of cells in each well or an vector with length equal to the total number of cells in the input alignment files specifying the number of cells in each file. Default is 1. |
unique |
Argument passed to |
nBestLocations |
Argument passed to |
minQual |
Minimally acceptable Phred quality score for cell barcode and UMI sequences. Phread quality scores are calculated for each nucleotide in these tags. Tags with at least one nucleotide with score lower than this will be filtered out. Default is 10. |
yieldReads |
The number of reads to yield when drawing successive
subsets from a fastq file, providing the number of successive records to be
returned on each yield. This parameter is passed to the |
alignmentFileFormat |
File format of sequence alignment results. "BAM" or "SAM". Default is "BAM". |
demultiplexOutDir |
Output folder path for demultiplex results.
Demultiplexed cell specifc FASTQ files will be stored in folders in this
path, respectively. Make sure the folder is empty. Default is
|
alignmentOutDir |
Output directory for alignment results. Sequence
alignment maps will be stored in folders in this directory, respectively.
Make sure the folder is empty. Default is |
countUmiOutDir |
Output directory for UMI counting results. UMI
filtered count matrix will be stored in this directory. Default is
|
demultiplexSummaryPrefix |
Prefix for demultiplex summary filename.
Default is |
alignmentSummaryPrefix |
Prefix for alignment summary filename. Default
is |
countPrefix |
Prefix for UMI filtered count matrix filename. Default is
|
logfilePrefix |
Prefix for log file. Default is current date and time
in the format of |
overwrite |
Boolean indicating whether to overwrite the output directory. Default is FALSE. |
verbose |
Boolean indicating whether to print log messages. Useful for debugging. Default to FALSE. |
cores |
Number of cores to use for parallelization. Default is
|
threads |
Do not change. Number of threads/CPUs used for
mapping for each core. Refer to |
... |
Additional arguments passed to the |
A SingleCellExperiment
object.
## Not run: # prepare required files data(barcodeExample, package = "scruff") fastqs <- list.files(system.file("extdata", package = "scruff"), pattern = "\\.fastq\\.gz", full.names = TRUE) fasta <- system.file("extdata", "GRCm38_MT.fa", package = "scruff") gtf <- system.file("extdata", "GRCm38_MT.gtf", package = "scruff") library(Rsubread) # Specify the basename for Rsubread index indexBase <- "GRCm38_MT" # Create index files for GRCm38_MT. buildindex(basename = indexBase, reference = fasta, indexSplit = FALSE) # run scruff pipeline sce <- scruff(project = "example", experiment = c("1h1"), lane = c("L001"), read1Path = c(fastqs[1]), read2Path = c(fastqs[2]), bc = barcodeExample, index = indexBase, reference = gtf, bcStart = 1, bcStop = 8, umiStart = 9, umiStop = 12, keep = 75, cellPerWell = c(rep(1, 46), 0, 0), overwrite = TRUE, verbose = TRUE) ## End(Not run) # or use the built-in SingleCellExperiment object generated using # example dataset (see ?sceExample) data(sceExample, package = "scruff")
## Not run: # prepare required files data(barcodeExample, package = "scruff") fastqs <- list.files(system.file("extdata", package = "scruff"), pattern = "\\.fastq\\.gz", full.names = TRUE) fasta <- system.file("extdata", "GRCm38_MT.fa", package = "scruff") gtf <- system.file("extdata", "GRCm38_MT.gtf", package = "scruff") library(Rsubread) # Specify the basename for Rsubread index indexBase <- "GRCm38_MT" # Create index files for GRCm38_MT. buildindex(basename = indexBase, reference = fasta, indexSplit = FALSE) # run scruff pipeline sce <- scruff(project = "example", experiment = c("1h1"), lane = c("L001"), read1Path = c(fastqs[1]), read2Path = c(fastqs[2]), bc = barcodeExample, index = indexBase, reference = gtf, bcStart = 1, bcStop = 8, umiStart = 9, umiStop = 12, keep = 75, cellPerWell = c(rep(1, 46), 0, 0), overwrite = TRUE, verbose = TRUE) ## End(Not run) # or use the built-in SingleCellExperiment object generated using # example dataset (see ?sceExample) data(sceExample, package = "scruff")
Read BAM file generated by Cell Ranger pipeline and output QC metrics including number of aligned reads and reads aligned to an gene.
tenxBamqc( bam, experiment, filter, validCb = NA, tags = c("NH", "GX", "CB", "MM"), yieldSize = 1e+06, outDir = "./", cores = max(1, parallelly::availableCores() - 2) )
tenxBamqc( bam, experiment, filter, validCb = NA, tags = c("NH", "GX", "CB", "MM"), yieldSize = 1e+06, outDir = "./", cores = max(1, parallelly::availableCores() - 2) )
bam |
Paths to input BAM files generated by Cell Ranger pipeline. These files are usually named "possorted_genome_bam.bam" in the "outs" folder of the top-level project output folders, respectively. |
experiment |
A character vector of experiment names. Represents the
group label for each BAM file, e.g. "patient1, patient2, ...". The
length of |
filter |
Paths to the filtered barcode files. Should be in the same length and order of the input BAM files. These files are named "barcodes.tsv" located at outs/filtered_gene_bc_matrices/<reference_genome>/barcodes.tsv. |
validCb |
Path to the cell barcode whitelist file. By default uses the
file "737K-august-2016.txt" which is compatible with the v2 chemistry
protocol. The file can be inspected by calling
|
tags |
BAM tags used for collecting QC metrics. Contains non-standard tags locally-defined by Cell Ranger pipeline. Should not be changed in most cases. |
yieldSize |
The number of records (alignments) to yield when drawing
successive subsets from a BAM file, providing the number of successive
records to be returned on each yield. This parameter is passed to the
|
outDir |
Output directory. The location to write resulting QC table. |
cores |
Number of cores used for parallelization. Default is
|
A SingleCellExperiment object. The
colData
contains the number of aligned reads
(reads_mapped_to_genome) and reads aligned
to genes (reads_mapped_to_genes).
# first 5000 records in the bam file downloaded from here: # http://sra-download.ncbi.nlm.nih.gov/srapub_files/ # SRR5167880_E18_20160930_Neurons_Sample_01.bam # see details here: # https://trace.ncbi.nlm.nih.gov/Traces/sra/?study=SRP096558 # and here: # https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE93421 bamfile10x <- system.file("extdata", "SRR5167880_E18_20160930_Neurons_Sample_01_5000.bam", package = "scruff") # library(TENxBrainData) # library(data.table) # tenx <- TENxBrainData() # # get filtered barcodes for sample 01 # filteredBcIndex <- tstrsplit(colData(tenx)[, "Barcode"], "-")[[2]] == 1 # filteredBc <- colData(tenx)[filteredBcIndex, ][["Barcode"]] filteredBc <- system.file("extdata", "SRR5167880_E18_20160930_Neurons_Sample_01_filtered_barcode.tsv", package = "scruff") # QC results are saved to current working directory sce <- tenxBamqc(bam = bamfile10x, experiment = "Neurons_Sample_01", filter = filteredBc) sce
# first 5000 records in the bam file downloaded from here: # http://sra-download.ncbi.nlm.nih.gov/srapub_files/ # SRR5167880_E18_20160930_Neurons_Sample_01.bam # see details here: # https://trace.ncbi.nlm.nih.gov/Traces/sra/?study=SRP096558 # and here: # https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE93421 bamfile10x <- system.file("extdata", "SRR5167880_E18_20160930_Neurons_Sample_01_5000.bam", package = "scruff") # library(TENxBrainData) # library(data.table) # tenx <- TENxBrainData() # # get filtered barcodes for sample 01 # filteredBcIndex <- tstrsplit(colData(tenx)[, "Barcode"], "-")[[2]] == 1 # filteredBc <- colData(tenx)[filteredBcIndex, ][["Barcode"]] filteredBc <- system.file("extdata", "SRR5167880_E18_20160930_Neurons_Sample_01_filtered_barcode.tsv", package = "scruff") # QC results are saved to current working directory sce <- tenxBamqc(bam = bamfile10x, experiment = "Neurons_Sample_01", filter = filteredBc) sce
A barcode whitelist is the list of all known barcode sequences that have been included in the assay kit and are available during library preparation. There are roughly 737,000 cell barcodes in the whitelist (737K-august-2016.txt) for Cell Ranger's Single Cell 3' and V(D)J applications.
validCb
validCb
A data.table
object.