Title: | Pipeline for single cell multi-omic data pre-processing |
---|---|
Description: | A preprocessing pipeline for single cell RNA-seq/ATAC-seq data that starts from the fastq files and produces a feature count matrix with associated quality control information. It can process fastq data generated by CEL-seq, MARS-seq, Drop-seq, Chromium 10x and SMART-seq protocols. |
Authors: | Luyi Tian [aut], Shian Su [aut, cre], Shalin Naik [ctb], Shani Amarasinghe [aut], Oliver Voogd [aut], Phil Yang [aut], Matthew Ritchie [ctb] |
Maintainer: | Shian Su <[email protected]> |
License: | GPL (>= 2) |
Version: | 2.7.0 |
Built: | 2024-10-31 04:58:18 UTC |
Source: | https://github.com/bioc/scPipe |
Detect outliers based on robust linear regression of QQ plot
.qq_outliers_robust(x, df, conf)
.qq_outliers_robust(x, df, conf)
x |
a vector of mahalanobis distance |
df |
degree of freedom for chi-square distribution |
conf |
confidence for linear regression |
cell names of outliers
Because of the variations in data format depending on annotation source, this function has only been tested with human annotation from ENSEMBL, RefSeq and Gencode. If it behaves unexpectedly with any annotation please submit an issue at www.github.com/LuyiTian/scPipe with details.
anno_import(filename)
anno_import(filename)
filename |
The name of the annotation gff3 or gtf file. File can be gzipped. |
Imports and GFF3 or GTF gene annotation file and transforms it into a SAF formatted data.frame. SAF described at http://bioinf.wehi.edu.au/featureCounts/. SAF contains positions for exons, strand and the GeneID they are associated with.
data.frame containing exon information in SAF format
ens_chrY <- anno_import(system.file("extdata", "ensembl_hg38_chrY.gtf.gz", package = "scPipe"))
ens_chrY <- anno_import(system.file("extdata", "ensembl_hg38_chrY.gtf.gz", package = "scPipe"))
This function converts a GRanges object into a data.frame of the SAF format for scPipe's consumption. The GRanges object should contain a "type" column where at least some features are annotated as "exon", in addition there should be a gene_id column specifying the gene to which the exon belongs. In the SAF only the gene ID, chromosome, start, end and strand are recorded, this is a gene-exon centric format, with all entries containing the same gene ID treated as exons of that gene. It is possible to count alternative features by setting the gene_id column to an arbitrary feature name and having alternative features in the SAF table, the main caveat is that the features are still treated as exons, and the mapping statistics for exon and intron will not reflect biological exons and introns but rather the annotation features.
anno_to_saf(anno)
anno_to_saf(anno)
anno |
The GRanges object containing exon information |
Convert a GRanges object containing type and gene_id information into a SAF format data.frame. SAF described at http://bioinf.wehi.edu.au/featureCounts/. SAF contains positions for exons, strand and the GeneID they are associated with.
data.frame containing exon information in SAF format
## Not run: anno <- system.file("extdata", "ensembl_hg38_chrY.gtf.gz", package = "scPipe") saf_chrY <- anno_to_saf(rtracklayer::import(anno)) ## End(Not run)
## Not run: anno <- system.file("extdata", "ensembl_hg38_chrY.gtf.gz", package = "scPipe") saf_chrY <- anno_to_saf(rtracklayer::import(anno)) ## End(Not run)
Calculate QC metrics from gene count matrix
calculate_QC_metrics(sce)
calculate_QC_metrics(sce)
sce |
a |
get QC metrics using gene count matrix. The QC statistics added are
number_of_genes number of genes detected.
total_count_per_cell sum of read number after UMI deduplication.
non_mt_percent 1 - percentage of mitochondrial gene counts. Mitochondrial genes are retrived by GO term GO:0005739
non_ERCC_percent ratio of exon counts to ERCC counts
non_ribo_percent 1 - percentage of ribosomal gene counts ribosomal genes are retrived by GO term GO:0005840.
an SingleCellExperiment
with updated QC metrics
data("sc_sample_data") data("sc_sample_qc") sce <- SingleCellExperiment(assays = list(counts = as.matrix(sc_sample_data))) organism(sce) <- "mmusculus_gene_ensembl" gene_id_type(sce) <- "ensembl_gene_id" QC_metrics(sce) <- sc_sample_qc demultiplex_info(sce) <- cell_barcode_matching UMI_dup_info(sce) <- UMI_duplication # The sample qc data already run through function `calculate_QC_metrics`. # So we delete these columns and run `calculate_QC_metrics` to get them again: colnames(colnames(QC_metrics(sce))) QC_metrics(sce) <- QC_metrics(sce)[,c("unaligned","aligned_unmapped","mapped_to_exon")] sce = calculate_QC_metrics(sce) colnames(QC_metrics(sce))
data("sc_sample_data") data("sc_sample_qc") sce <- SingleCellExperiment(assays = list(counts = as.matrix(sc_sample_data))) organism(sce) <- "mmusculus_gene_ensembl" gene_id_type(sce) <- "ensembl_gene_id" QC_metrics(sce) <- sc_sample_qc demultiplex_info(sce) <- cell_barcode_matching UMI_dup_info(sce) <- UMI_duplication # The sample qc data already run through function `calculate_QC_metrics`. # So we delete these columns and run `calculate_QC_metrics` to get them again: colnames(colnames(QC_metrics(sce))) QC_metrics(sce) <- QC_metrics(sce)[,c("unaligned","aligned_unmapped","mapped_to_exon")] sce = calculate_QC_metrics(sce) colnames(QC_metrics(sce))
This data.frame contains cell barcode demultiplex statistics with several rows:
barcode_unmatch_ambiguous_mapping is the number of reads that do not match any barcode, but aligned to the genome and mapped to multiple features.
barcode_unmatch_mapped_to_intron is the number of reads that do not match any barcode, but aligned to the genome and mapped to intron.
barcode_match is the number of reads that match the cell barcodes
barcode_unmatch_unaligned is the number of reads that do not match any barcode, and not aligned to the genome
barcode_unmatch_aligned is the number of reads that do not match any barcode, but aligned to the genome and do not mapped to any feature
barcode_unmatch_mapped_to_exon is the number of reads that do not match any barcode, but aligned to the genome and mapped to the exon
a data.frame instance, one row per cell.
NULL, but makes a data frame with cell barcode demultiplex statistics
Luyi Tian
Christin Biben (WEHI). She FACS sorted cells from several immune cell types including B cells, granulocyte and some early progenitors.
data("sc_sample_data") data("sc_sample_qc") sce = SingleCellExperiment(assays = list(counts =as.matrix(sc_sample_data))) organism(sce) = "mmusculus_gene_ensembl" gene_id_type(sce) = "ensembl_gene_id" QC_metrics(sce) = sc_sample_qc demultiplex_info(sce) = cell_barcode_matching UMI_dup_info(sce) = UMI_duplication demultiplex_info(sce)
data("sc_sample_data") data("sc_sample_qc") sce = SingleCellExperiment(assays = list(counts =as.matrix(sc_sample_data))) organism(sce) = "mmusculus_gene_ensembl" gene_id_type(sce) = "ensembl_gene_id" QC_metrics(sce) = sc_sample_qc demultiplex_info(sce) = cell_barcode_matching UMI_dup_info(sce) = UMI_duplication demultiplex_info(sce)
Checks to see if the given barcode start position (bstart
)
is valid for the fastq file. If the found barcode percentage is less than
the given threshold
, a new barcode start position is searched for by
checking every position from the start of each read to 10 bases after the bstart
check_barcode_start_position( fastq, barcode_file, barcode_file_realname, bstart, blength, search_lines, threshold )
check_barcode_start_position( fastq, barcode_file, barcode_file_realname, bstart, blength, search_lines, threshold )
fastq |
file containing reads |
barcode_file |
csv file |
barcode_file_realname |
the real name of the csv file |
bstart |
the start position for barcodes in the given reads |
blength |
length of each barcode |
search_lines |
the number of fastq lines to use for the check |
threshold |
the minimum percentage of found barcodes to accept for the program to continue |
Boolean; TRUE if program can continue execution, FALSE otherwise.
convert the gene ids of a SingleCellExperiment object
convert_geneid(sce, returns = "external_gene_name", all = TRUE)
convert_geneid(sce, returns = "external_gene_name", all = TRUE)
sce |
a SingleCellExperiment object |
returns |
the gene id which is set as return. Default to be 'external_gene_name'
A possible list of attributes can be retrieved using the
function |
all |
logic. For genes that cannot covert to new gene id, keep them with the old id or delete them. The default is keep them. |
convert the gene id of all datas in the SingleCellExperiment object
sce with converted id
# the gene id in example data are `external_gene_name` # the following example will convert it to `external_gene_name`. data("sc_sample_data") data("sc_sample_qc") sce = SingleCellExperiment(assays = list(counts = as.matrix(sc_sample_data))) organism(sce) = "mmusculus_gene_ensembl" gene_id_type(sce) = "ensembl_gene_id" QC_metrics(sce) = sc_sample_qc demultiplex_info(sce) = cell_barcode_matching UMI_dup_info(sce) = UMI_duplication head(rownames(sce)) sce = convert_geneid(sce, return="external_gene_name") head(rownames(sce))
# the gene id in example data are `external_gene_name` # the following example will convert it to `external_gene_name`. data("sc_sample_data") data("sc_sample_qc") sce = SingleCellExperiment(assays = list(counts = as.matrix(sc_sample_data))) organism(sce) = "mmusculus_gene_ensembl" gene_id_type(sce) = "ensembl_gene_id" QC_metrics(sce) = sc_sample_qc demultiplex_info(sce) = cell_barcode_matching UMI_dup_info(sce) = UMI_duplication head(rownames(sce)) sce = convert_geneid(sce, return="external_gene_name") head(rownames(sce))
Create an HTML report summarising pro-processed data. This is an alternative to the more verbose create_report
that requires only the processed counts and stats folders.
create_processed_report( outdir = ".", organism, gene_id_type, report_name = "report" )
create_processed_report( outdir = ".", organism, gene_id_type, report_name = "report" )
outdir |
output folder. |
organism |
the organism of the data. List of possible names can be retrieved using the function 'listDatasets'from 'biomaRt' package. (e.g. 'mmusculus_gene_ensembl' or 'hsapiens_gene_ensembl'). |
gene_id_type |
gene id type of the data A possible list of ids can be retrieved using the function 'listAttributes' from 'biomaRt' package. the commonly used id types are 'external_gene_name', 'ensembl_gene_id' or 'entrezgene'. |
report_name |
the name of the report .Rmd and .html files. |
file path of the created compiled document.
## Not run: create_report( outdir="output_dir_of_scPipe", organism="mmusculus_gene_ensembl", gene_id_type="ensembl_gene_id") ## End(Not run)
## Not run: create_report( outdir="output_dir_of_scPipe", organism="mmusculus_gene_ensembl", gene_id_type="ensembl_gene_id") ## End(Not run)
create an HTML report using data generated by proprocessing step.
create_report( sample_name, outdir, r1 = "NA", r2 = "NA", outfq = "NA", read_structure = list(bs1 = 0, bl1 = 0, bs2 = 0, bl2 = 0, us = 0, ul = 0), filter_settings = list(rmlow = TRUE, rmN = TRUE, minq = 20, numbq = 2), align_bam = "NA", genome_index = "NA", map_bam = "NA", exon_anno = "NA", stnd = TRUE, fix_chr = FALSE, barcode_anno = "NA", max_mis = 1, UMI_cor = 1, gene_fl = FALSE, organism, gene_id_type )
create_report( sample_name, outdir, r1 = "NA", r2 = "NA", outfq = "NA", read_structure = list(bs1 = 0, bl1 = 0, bs2 = 0, bl2 = 0, us = 0, ul = 0), filter_settings = list(rmlow = TRUE, rmN = TRUE, minq = 20, numbq = 2), align_bam = "NA", genome_index = "NA", map_bam = "NA", exon_anno = "NA", stnd = TRUE, fix_chr = FALSE, barcode_anno = "NA", max_mis = 1, UMI_cor = 1, gene_fl = FALSE, organism, gene_id_type )
sample_name |
sample name |
outdir |
output folder |
r1 |
file path of read1 |
r2 |
file path of read2 default to be NULL |
outfq |
file path of the output of |
read_structure |
a list contains read structure configuration. For more help see '?sc_trim_barcode' |
filter_settings |
a list contains read filter settings for more help see '?sc_trim_barcode' |
align_bam |
the aligned bam file |
genome_index |
genome index used for alignment |
map_bam |
the mapped bam file |
exon_anno |
the gff exon annotation used. Can have multiple files |
stnd |
whether to perform strand specific mapping |
fix_chr |
add 'chr' to chromosome names, fix inconsistent names. |
barcode_anno |
cell barcode annotation file path. |
max_mis |
maximum mismatch allowed in barcode. Default to be 1 |
UMI_cor |
correct UMI sequence error: 0 means no correction, 1 means simple correction and merge UMI with distance 1. |
gene_fl |
whether to remove low abundant gene count. Low abundant is defined as only one copy of one UMI for this gene |
organism |
the organism of the data. List of possible names can be retrieved using the function 'listDatasets'from 'biomaRt' package. (i.e 'mmusculus_gene_ensembl' or 'hsapiens_gene_ensembl') |
gene_id_type |
gene id type of the data A possible list of ids can be retrieved using the function 'listAttributes' from 'biomaRt' package. the commonly used id types are 'external_gene_name', 'ensembl_gene_id' or 'entrezgene' |
no return
## Not run: create_report(sample_name="sample_001", outdir="output_dir_of_scPipe", r1="read1.fq", r2="read2.fq", outfq="trim.fq", read_structure=list(bs1=-1, bl1=2, bs2=6, bl2=8, us=0, ul=6), filter_settings=list(rmlow=TRUE, rmN=TRUE, minq=20, numbq=2), align_bam="align.bam", genome_index="mouse.index", map_bam="aligned.mapped.bam", exon_anno="exon_anno.gff3", stnd=TRUE, fix_chr=FALSE, barcode_anno="cell_barcode.csv", max_mis=1, UMI_cor=1, gene_fl=FALSE, organism="mmusculus_gene_ensembl", gene_id_type="ensembl_gene_id") ## End(Not run)
## Not run: create_report(sample_name="sample_001", outdir="output_dir_of_scPipe", r1="read1.fq", r2="read2.fq", outfq="trim.fq", read_structure=list(bs1=-1, bl1=2, bs2=6, bl2=8, us=0, ul=6), filter_settings=list(rmlow=TRUE, rmN=TRUE, minq=20, numbq=2), align_bam="align.bam", genome_index="mouse.index", map_bam="aligned.mapped.bam", exon_anno="exon_anno.gff3", stnd=TRUE, fix_chr=FALSE, barcode_anno="cell_barcode.csv", max_mis=1, UMI_cor=1, gene_fl=FALSE, organism="mmusculus_gene_ensembl", gene_id_type="ensembl_gene_id") ## End(Not run)
after we run sc_gene_counting
and finish the preprocessing step. create_sce_by_dir
can be used to generate the SingleCellExperiment object from the folder that contains gene count matrix and QC statistics.
it can also generate the html report based on the gene count and quality control statistics
create_sce_by_dir( datadir, organism = NULL, gene_id_type = NULL, pheno_data = NULL, report = FALSE )
create_sce_by_dir( datadir, organism = NULL, gene_id_type = NULL, pheno_data = NULL, report = FALSE )
datadir |
the directory that contains all the data and 'stat' subfolder. |
organism |
the organism of the data. List of possible names can be retrieved using the function 'listDatasets'from 'biomaRt' package. (i.e 'mmusculus_gene_ensembl' or 'hsapiens_gene_ensembl') |
gene_id_type |
gene id type of the data A possible list of ids can be retrieved using the function 'listAttributes' from 'biomaRt' package. the commonly used id types are 'external_gene_name', 'ensembl_gene_id' or 'entrezgene' |
pheno_data |
the external phenotype data that linked to each single cell. This should be an |
report |
whether to generate the html report in the data folder |
after we run sc_gene_counting
and finish the preprocessing step. create_sce_by_dir
can be used to generate the SingleCellExperiment object from the folder that contains gene count matrix and QC statistics.
a SingleCellExperiment object
## Not run: # the sce can be created fron the output folder of scPipe # please refer to the vignettes sce = create_sce_by_dir(datadir="output_dir_of_scPipe", organism="mmusculus_gene_ensembl", gene_id_type="ensembl_gene_id") ## End(Not run) # or directly from the gene count and quality control matrix: data("sc_sample_data") data("sc_sample_qc") sce = SingleCellExperiment(assays = list(counts = as.matrix(sc_sample_data))) organism(sce) = "mmusculus_gene_ensembl" gene_id_type(sce) = "ensembl_gene_id" QC_metrics(sce) = sc_sample_qc demultiplex_info(sce) = cell_barcode_matching UMI_dup_info(sce) = UMI_duplication dim(sce)
## Not run: # the sce can be created fron the output folder of scPipe # please refer to the vignettes sce = create_sce_by_dir(datadir="output_dir_of_scPipe", organism="mmusculus_gene_ensembl", gene_id_type="ensembl_gene_id") ## End(Not run) # or directly from the gene count and quality control matrix: data("sc_sample_data") data("sc_sample_qc") sce = SingleCellExperiment(assays = list(counts = as.matrix(sc_sample_data))) organism(sce) = "mmusculus_gene_ensembl" gene_id_type(sce) = "ensembl_gene_id" QC_metrics(sce) = sc_sample_qc demultiplex_info(sce) = cell_barcode_matching UMI_dup_info(sce) = UMI_duplication dim(sce)
Get or set cell barcode demultiplex results in a SingleCellExperiment object
demultiplex_info(object) demultiplex_info(object) <- value demultiplex_info.sce(object) ## S4 method for signature 'SingleCellExperiment' demultiplex_info(object) ## S4 replacement method for signature 'SingleCellExperiment' demultiplex_info(object) <- value
demultiplex_info(object) demultiplex_info(object) <- value demultiplex_info.sce(object) ## S4 method for signature 'SingleCellExperiment' demultiplex_info(object) ## S4 replacement method for signature 'SingleCellExperiment' demultiplex_info(object) <- value
object |
A |
value |
Value to be assigned to corresponding object. |
a dataframe of cell barcode demultiplex information
A DataFrame of cell barcode demultiplx results.
Luyi Tian
data("sc_sample_data") data("sc_sample_qc") sce = SingleCellExperiment(assays = list(counts = as.matrix(sc_sample_data))) organism(sce) = "mmusculus_gene_ensembl" gene_id_type(sce) = "ensembl_gene_id" QC_metrics(sce) = sc_sample_qc demultiplex_info(sce) = cell_barcode_matching UMI_dup_info(sce) = UMI_duplication demultiplex_info(sce)
data("sc_sample_data") data("sc_sample_qc") sce = SingleCellExperiment(assays = list(counts = as.matrix(sc_sample_data))) organism(sce) = "mmusculus_gene_ensembl" gene_id_type(sce) = "ensembl_gene_id" QC_metrics(sce) = sc_sample_qc demultiplex_info(sce) = cell_barcode_matching UMI_dup_info(sce) = UMI_duplication demultiplex_info(sce)
This algorithm will try to find comp
number of components
in quality control metrics using a Gaussian mixture model. Outlier
detection is performed on the component with the most genes detected. The
rest of the components will be considered poor quality cells. More cells
will be classified low quality as you increase comp
.
detect_outlier( sce, comp = 1, sel_col = NULL, type = c("low", "both", "high"), conf = c(0.9, 0.99), batch = FALSE )
detect_outlier( sce, comp = 1, sel_col = NULL, type = c("low", "both", "high"), conf = c(0.9, 0.99), batch = FALSE )
sce |
a |
comp |
the number of component used in GMM. Depending on the quality of the experiment. |
sel_col |
a vector of column names which indicate the columns to use for QC. By default it will be the statistics generated by 'calculate_QC_metrics()' |
type |
only looking at low quality cells ('low') or possible doublets ('high') or both ('both') |
conf |
confidence interval for linear regression at lower and upper tails.Usually, this is smaller for lower tail because we hope to pick out more low quality cells than doublets. |
batch |
whether to perform quality control separately for each batch. Default is FALSE. If set to TRUE then you should have a column called 'batch' in the 'colData(sce)'. |
detect outlier using Mahalanobis distances
an updated SingleCellExperiment
object with an 'outlier'
column in colData
data("sc_sample_data") data("sc_sample_qc") sce = SingleCellExperiment(assays = list(counts = as.matrix(sc_sample_data))) organism(sce) = "mmusculus_gene_ensembl" gene_id_type(sce) = "ensembl_gene_id" QC_metrics(sce) = sc_sample_qc demultiplex_info(sce) = cell_barcode_matching UMI_dup_info(sce) = UMI_duplication # the sample qc data already run through function `calculate_QC_metrics` # for a new sce please run `calculate_QC_metrics` before `detect_outlier` sce = detect_outlier(sce) table(QC_metrics(sce)$outliers)
data("sc_sample_data") data("sc_sample_qc") sce = SingleCellExperiment(assays = list(counts = as.matrix(sc_sample_data))) organism(sce) = "mmusculus_gene_ensembl" gene_id_type(sce) = "ensembl_gene_id" QC_metrics(sce) = sc_sample_qc demultiplex_info(sce) = cell_barcode_matching UMI_dup_info(sce) = UMI_duplication # the sample qc data already run through function `calculate_QC_metrics` # for a new sce please run `calculate_QC_metrics` before `detect_outlier` sce = detect_outlier(sce) table(QC_metrics(sce)$outliers)
feature_info
from a SingleCellExperiment objectGet or set feature_info
from a SingleCellExperiment object
feature_info(object) feature_info(object) <- value feature_info.sce(object) ## S4 method for signature 'SingleCellExperiment' feature_info(object) ## S4 replacement method for signature 'SingleCellExperiment' feature_info(object) <- value
feature_info(object) feature_info(object) <- value feature_info.sce(object) ## S4 method for signature 'SingleCellExperiment' feature_info(object) ## S4 replacement method for signature 'SingleCellExperiment' feature_info(object) <- value
object |
A |
value |
Value to be assigned to corresponding object. |
a dataframe of feature info for scATAC-seq data
A DataFrame of feature information
Shani Amarasinghe
feature_type
from a SingleCellExperiment objectGet or set feature_type
from a SingleCellExperiment object
feature_type(object) feature_type(object) <- value feature_type.sce(object) ## S4 method for signature 'SingleCellExperiment' feature_type(object) ## S4 replacement method for signature 'SingleCellExperiment' feature_type(object) <- value
feature_type(object) feature_type(object) <- value feature_type.sce(object) ## S4 method for signature 'SingleCellExperiment' feature_type(object) ## S4 replacement method for signature 'SingleCellExperiment' feature_type(object) <- value
object |
A |
value |
Value to be assigned to corresponding object. |
the feature type used in feature counting for scATAC-Seq data
A string representing the feature type
Shani Amarasinghe
gene_id_type
from a SingleCellExperiment objectGet or set gene_id_type
from a SingleCellExperiment object
gene_id_type(object) gene_id_type(object) <- value gene_id_type.sce(object) ## S4 method for signature 'SingleCellExperiment' gene_id_type(object) ## S4 replacement method for signature 'SingleCellExperiment' gene_id_type(object) <- value
gene_id_type(object) gene_id_type(object) <- value gene_id_type.sce(object) ## S4 method for signature 'SingleCellExperiment' gene_id_type(object) ## S4 replacement method for signature 'SingleCellExperiment' gene_id_type(object) <- value
object |
A |
value |
Value to be assigned to corresponding object. |
the gene id type used by Biomart
gene id type string
Luyi Tian
data("sc_sample_data") data("sc_sample_qc") sce = SingleCellExperiment(assays = list(counts = as.matrix(sc_sample_data))) organism(sce) = "mmusculus_gene_ensembl" gene_id_type(sce) = "ensembl_gene_id" QC_metrics(sce) = sc_sample_qc demultiplex_info(sce) = cell_barcode_matching UMI_dup_info(sce) = UMI_duplication gene_id_type(sce)
data("sc_sample_data") data("sc_sample_qc") sce = SingleCellExperiment(assays = list(counts = as.matrix(sc_sample_data))) organism(sce) = "mmusculus_gene_ensembl" gene_id_type(sce) = "ensembl_gene_id" QC_metrics(sce) = sc_sample_qc demultiplex_info(sce) = cell_barcode_matching UMI_dup_info(sce) = UMI_duplication gene_id_type(sce)
Gets a list of NamedList of chromosomes and the reference length acquired through the bam index file.
get_chromosomes(bam, keep_contigs = "^chr")
get_chromosomes(bam, keep_contigs = "^chr")
bam |
file path to the bam file to get data from |
keep_contigs |
regular expression used with grepl to filter reference names |
a named list where element names are chromosomes reference names and elements are integer lengths
Helper function to retrieve ERCC annotation as a dataframe in SAF format
get_ercc_anno()
get_ercc_anno()
data.frame containing ERCC annotation
ercc_anno <- get_ercc_anno()
ercc_anno <- get_ercc_anno()
Get genes related to certain GO terms from biomart database
get_genes_by_GO( returns = "ensembl_gene_id", dataset = "mmusculus_gene_ensembl", go = NULL )
get_genes_by_GO( returns = "ensembl_gene_id", dataset = "mmusculus_gene_ensembl", go = NULL )
returns |
the gene id which is set as return. Default to be ensembl id
A possible list of attributes can be retrieved using the
function |
dataset |
Dataset you want to use. List of possible datasets can be
retrieved using the function |
go |
a vector of GO terms |
Get genes related to certain GO terms from biomart database
a vector of gene ids.
# get all genes under GO term GO:0005739 in mouse, return ensembl gene id get_genes_by_GO(returns="ensembl_gene_id", dataset="mmusculus_gene_ensembl", go=c('GO:0005739'))
# get all genes under GO term GO:0005739 in mouse, return ensembl gene id get_genes_by_GO(returns="ensembl_gene_id", dataset="mmusculus_gene_ensembl", go=c('GO:0005739'))
The supported protocols are:
CelSeq
CelSeq2
DropSeq
10x (also called ChromiumV1)
If you know the structure of a specific protocol and would like it supported, please leave a issue post at www.github.com/luyitian/scPipe.
get_read_str(protocol)
get_read_str(protocol)
protocol |
name of the protocol |
list of UMI and Barcode locations for use in other scPipe functions
get_read_str("celseq")
get_read_str("celseq")
organism
from a SingleCellExperiment objectGet or set organism
from a SingleCellExperiment object
organism.sce(object) ## S4 method for signature 'SingleCellExperiment' organism(object) ## S4 replacement method for signature 'SingleCellExperiment' organism(object) <- value
organism.sce(object) ## S4 method for signature 'SingleCellExperiment' organism(object) ## S4 replacement method for signature 'SingleCellExperiment' organism(object) <- value
object |
A |
value |
Value to be assigned to corresponding object. |
organism string
Luyi Tian
data("sc_sample_data") data("sc_sample_qc") sce = SingleCellExperiment(assays = list(counts = as.matrix(sc_sample_data))) organism(sce) = "mmusculus_gene_ensembl" gene_id_type(sce) = "ensembl_gene_id" QC_metrics(sce) = sc_sample_qc demultiplex_info(sce) = cell_barcode_matching UMI_dup_info(sce) = UMI_duplication organism(sce)
data("sc_sample_data") data("sc_sample_qc") sce = SingleCellExperiment(assays = list(counts = as.matrix(sc_sample_data))) organism(sce) = "mmusculus_gene_ensembl" gene_id_type(sce) = "ensembl_gene_id" QC_metrics(sce) = sc_sample_qc demultiplex_info(sce) = cell_barcode_matching UMI_dup_info(sce) = UMI_duplication organism(sce)
Plot cell barcode demultiplexing result for the
SingleCellExperiment
. The barcode demultiplexing result is shown
using a barplot, with the bars indicating proportions of total reads.
Barcode matches and mismatches are summarised along with whether or not the
read mapped to the genome. High proportion of genome aligned reads with no
barcode match may indicate barcode integration failure.
plot_demultiplex(sce)
plot_demultiplex(sce)
sce |
a |
a ggplot2 bar chart
data("sc_sample_data") data("sc_sample_qc") sce = SingleCellExperiment(assays = list(counts = as.matrix(sc_sample_data))) organism(sce) = "mmusculus_gene_ensembl" gene_id_type(sce) = "ensembl_gene_id" QC_metrics(sce) = sc_sample_qc demultiplex_info(sce) = cell_barcode_matching UMI_dup_info(sce) = UMI_duplication plot_demultiplex(sce)
data("sc_sample_data") data("sc_sample_qc") sce = SingleCellExperiment(assays = list(counts = as.matrix(sc_sample_data))) organism(sce) = "mmusculus_gene_ensembl" gene_id_type(sce) = "ensembl_gene_id" QC_metrics(sce) = sc_sample_qc demultiplex_info(sce) = cell_barcode_matching UMI_dup_info(sce) = UMI_duplication plot_demultiplex(sce)
SingleCellExperiment
object.Plot mapping statistics for SingleCellExperiment
object.
plot_mapping(sce, sel_col = NULL, percentage = FALSE, dataname = "")
plot_mapping(sce, sel_col = NULL, percentage = FALSE, dataname = "")
sce |
a |
sel_col |
a vector of column names, indicating the columns to use for plot. by default it will be the mapping result. |
percentage |
TRUE to convert the number of reads to percentage |
dataname |
the name of this dataset, used as plot title |
a ggplot2 object
data("sc_sample_data") data("sc_sample_qc") sce = SingleCellExperiment(assays = list(counts = as.matrix(sc_sample_data))) organism(sce) = "mmusculus_gene_ensembl" gene_id_type(sce) = "ensembl_gene_id" QC_metrics(sce) = sc_sample_qc demultiplex_info(sce) = cell_barcode_matching UMI_dup_info(sce) = UMI_duplication plot_mapping(sce,percentage=TRUE,dataname="sc_sample")
data("sc_sample_data") data("sc_sample_qc") sce = SingleCellExperiment(assays = list(counts = as.matrix(sc_sample_data))) organism(sce) = "mmusculus_gene_ensembl" gene_id_type(sce) = "ensembl_gene_id" QC_metrics(sce) = sc_sample_qc demultiplex_info(sce) = cell_barcode_matching UMI_dup_info(sce) = UMI_duplication plot_mapping(sce,percentage=TRUE,dataname="sc_sample")
SingleCellExperiment
objectPlot GGAlly pairs plot of QC statistics from SingleCellExperiment
object
plot_QC_pairs(sce, sel_col = NULL)
plot_QC_pairs(sce, sel_col = NULL)
sce |
a |
sel_col |
a vector of column names which indicate the columns to use for plot. By default it will be the statistics generated by 'calculate_QC_metrics()' |
a ggplot2 object
data("sc_sample_data") data("sc_sample_qc") sce = SingleCellExperiment(assays = list(counts = as.matrix(sc_sample_data))) organism(sce) = "mmusculus_gene_ensembl" gene_id_type(sce) = "ensembl_gene_id" QC_metrics(sce) = sc_sample_qc demultiplex_info(sce) = cell_barcode_matching UMI_dup_info(sce) = UMI_duplication sce = detect_outlier(sce) plot_QC_pairs(sce)
data("sc_sample_data") data("sc_sample_qc") sce = SingleCellExperiment(assays = list(counts = as.matrix(sc_sample_data))) organism(sce) = "mmusculus_gene_ensembl" gene_id_type(sce) = "ensembl_gene_id" QC_metrics(sce) = sc_sample_qc demultiplex_info(sce) = cell_barcode_matching UMI_dup_info(sce) = UMI_duplication sce = detect_outlier(sce) plot_QC_pairs(sce)
Plot the UMI duplication frequency.
plot_UMI_dup(sce, log10_x = TRUE)
plot_UMI_dup(sce, log10_x = TRUE)
sce |
a |
log10_x |
whether to use log10 scale for x axis |
a line chart of the UMI duplication frequency
data("sc_sample_data") data("sc_sample_qc") sce = SingleCellExperiment(assays = list(counts = as.matrix(sc_sample_data))) organism(sce) = "mmusculus_gene_ensembl" gene_id_type(sce) = "ensembl_gene_id" QC_metrics(sce) = sc_sample_qc demultiplex_info(sce) = cell_barcode_matching UMI_dup_info(sce) = UMI_duplication plot_UMI_dup(sce)
data("sc_sample_data") data("sc_sample_qc") sce = SingleCellExperiment(assays = list(counts = as.matrix(sc_sample_data))) organism(sce) = "mmusculus_gene_ensembl" gene_id_type(sce) = "ensembl_gene_id" QC_metrics(sce) = sc_sample_qc demultiplex_info(sce) = cell_barcode_matching UMI_dup_info(sce) = UMI_duplication plot_UMI_dup(sce)
Get or set quality control metrics in a SingleCellExperiment object
QC_metrics(object) QC_metrics(object) <- value QC_metrics.sce(object) ## S4 method for signature 'SingleCellExperiment' QC_metrics(object) ## S4 replacement method for signature 'SingleCellExperiment' QC_metrics(object) <- value
QC_metrics(object) QC_metrics(object) <- value QC_metrics.sce(object) ## S4 method for signature 'SingleCellExperiment' QC_metrics(object) ## S4 replacement method for signature 'SingleCellExperiment' QC_metrics(object) <- value
object |
A |
value |
Value to be assigned to corresponding object. |
a dataframe of quality control matrics
A DataFrame of quality control metrics.
Luyi Tian
data("sc_sample_data") data("sc_sample_qc") sce = SingleCellExperiment(assays = list(counts = as.matrix(sc_sample_data))) QC_metrics(sce) = sc_sample_qc head(QC_metrics(sce))
data("sc_sample_data") data("sc_sample_qc") sce = SingleCellExperiment(assays = list(counts = as.matrix(sc_sample_data))) QC_metrics(sce) = sc_sample_qc head(QC_metrics(sce))
Read Cell barcode file
read_cells(cells)
read_cells(cells)
cells |
the file path to the barcode file. Assumes one barcode per line or barcode csv. Or, cells can be a comma delimited string of barcodes |
a character vector of the provided barcodes
SingleCellExperiment
Removes outliers flagged by detect_outliers()
remove_outliers(sce)
remove_outliers(sce)
sce |
a |
a SingleCellExperiment
object without outliers
data("sc_sample_data") data("sc_sample_qc") sce = SingleCellExperiment(assays = list(counts = as.matrix(sc_sample_data))) organism(sce) = "mmusculus_gene_ensembl" gene_id_type(sce) = "ensembl_gene_id" QC_metrics(sce) = sc_sample_qc demultiplex_info(sce) = cell_barcode_matching UMI_dup_info(sce) = UMI_duplication sce = detect_outlier(sce) dim(sce) sce = remove_outliers(sce) dim(sce)
data("sc_sample_data") data("sc_sample_qc") sce = SingleCellExperiment(assays = list(counts = as.matrix(sc_sample_data))) organism(sce) = "mmusculus_gene_ensembl" gene_id_type(sce) = "ensembl_gene_id" QC_metrics(sce) = sc_sample_qc demultiplex_info(sce) = cell_barcode_matching UMI_dup_info(sce) = UMI_duplication sce = detect_outlier(sce) dim(sce) sce = remove_outliers(sce) dim(sce)
after we run the sc_trim_barcode
or sc_atac_trim_barcode
to demultiplex the fastq files, we are using this
function to align those fastq files to a known reference.
sc_aligning( R1, R2 = NULL, tech = "atac", index_path = NULL, ref = NULL, output_folder = NULL, output_file = NULL, input_format = "FASTQ", output_format = "BAM", type = "dna", nthreads = 1 )
sc_aligning( R1, R2 = NULL, tech = "atac", index_path = NULL, ref = NULL, output_folder = NULL, output_file = NULL, input_format = "FASTQ", output_format = "BAM", type = "dna", nthreads = 1 )
R1 |
a mandatory character vector including names of files that include sequence reads to be aligned. For paired-end reads, this gives the list of files including first reads in each library. File format is FASTQ/FASTA by default. |
R2 |
a character vector, the second fastq file, which is required if the data is paired-end |
tech |
a character string giving the sequencing technology. Possible value includes "atac" or "rna" |
index_path |
character string specifying the path/basename of the index files, if the Rsubread genome build is available |
ref |
a character string specifying the path to reference genome file (.fasta, .fa format) |
output_folder |
a character string, the name of the output folder |
output_file |
a character vector specifying names of output files. By default, names of output files are set as the file names provided in R1 added with an suffix string |
input_format |
a string indicating the input format |
output_format |
a string indicating the output format |
type |
type of sequencing data ('RNA' or 'DNA') |
nthreads |
numeric value giving the number of threads used for mapping. |
the file path of the output aligned BAM file
## Not run: sc_aligning(index_path, tech = 'atac', R1, R2, nthreads = 6) ## End(Not run)
## Not run: sc_aligning(index_path, tech = 'atac', R1, R2, nthreads = 6) ## End(Not run)
Demultiplexes the reads
sc_atac_bam_tagging( inbam, output_folder = NULL, bc_length = NULL, bam_tags = list(bc = "CB", mb = "OX"), nthreads = 1 )
sc_atac_bam_tagging( inbam, output_folder = NULL, bc_length = NULL, bam_tags = list(bc = "CB", mb = "OX"), nthreads = 1 )
inbam |
The input BAM file |
output_folder |
The path of the output folder |
bc_length |
The length of the cellular barcodes |
bam_tags |
The BAM tags |
nthreads |
The number of threads |
sc_atac_bam_tagging()
file path of the resultant demmultiplexed BAM file.
r1 <- system.file("extdata", "small_chr21_R1.fastq.gz", package="scPipe") r2 <- system.file("extdata", "small_chr21_R3.fastq.gz", package="scPipe") barcode_fastq <- system.file("extdata", "small_chr21_R2.fastq.gz", package="scPipe") out <- tempdir() sc_atac_trim_barcode(r1=r1, r2=r2, bc_file=barcode_fastq, output_folder=out) demux_r1 <- file.path(out, "demux_completematch_small_chr21_R1.fastq.gz") demux_r2 <- file.path(out, "demux_completematch_small_chr21_R3.fastq.gz") reference <- system.file("extdata", "small_chr21.fa", package="scPipe") aligned_bam <- sc_aligning(ref=reference, R1=demux_r1, R2=demux_r2, nthreads=6, output_folder=out) out_bam <- sc_atac_bam_tagging( inbam = aligned_bam, output_folder = out, nthreads = 6)
r1 <- system.file("extdata", "small_chr21_R1.fastq.gz", package="scPipe") r2 <- system.file("extdata", "small_chr21_R3.fastq.gz", package="scPipe") barcode_fastq <- system.file("extdata", "small_chr21_R2.fastq.gz", package="scPipe") out <- tempdir() sc_atac_trim_barcode(r1=r1, r2=r2, bc_file=barcode_fastq, output_folder=out) demux_r1 <- file.path(out, "demux_completematch_small_chr21_R1.fastq.gz") demux_r2 <- file.path(out, "demux_completematch_small_chr21_R3.fastq.gz") reference <- system.file("extdata", "small_chr21.fa", package="scPipe") aligned_bam <- sc_aligning(ref=reference, R1=demux_r1, R2=demux_r2, nthreads=6, output_folder=out) out_bam <- sc_atac_bam_tagging( inbam = aligned_bam, output_folder = out, nthreads = 6)
the methods to call true cells are of various ways.
implement (i.e. filtering
from scATAC-Pro
as default
sc_atac_cell_calling( mat, cell_calling = "filter", output_folder, genome_size = NULL, cell_qc_metrics_file = NULL, lower = NULL, min_uniq_frags = 3000, max_uniq_frags = 50000, min_frac_peak = 0.3, min_frac_tss = 0, min_frac_enhancer = 0, min_frac_promoter = 0.1, max_frac_mito = 0.15 )
sc_atac_cell_calling( mat, cell_calling = "filter", output_folder, genome_size = NULL, cell_qc_metrics_file = NULL, lower = NULL, min_uniq_frags = 3000, max_uniq_frags = 50000, min_frac_peak = 0.3, min_frac_tss = 0, min_frac_enhancer = 0, min_frac_promoter = 0.1, max_frac_mito = 0.15 )
mat |
the feature by cell matrix. |
cell_calling |
the cell calling approach, possible options were "emptydrops" , "cellranger" and "filter". But we opten to using "filter" as it was most robust. "emptydrops" is still an opition for data with large umber of cells. |
output_folder |
output directory for the cell called matrix. |
genome_size |
genome size for the data in feature by cell matrix. |
cell_qc_metrics_file |
quality per barcode file for the barcodes in the matrix if using the |
lower |
the lower threshold for the data if using the |
min_uniq_frags |
The minimum number of required unique fragments required for a cell (used for |
max_uniq_frags |
The maximum number of required unique fragments required for a cell (used for |
min_frac_peak |
The minimum proportion of fragments in a cell to overlap with a peak (used for |
min_frac_tss |
The minimum proportion of fragments in a cell to overlap with a tss (used for |
min_frac_enhancer |
The minimum proportion of fragments in a cell to overlap with a enhancer sequence (used for |
min_frac_promoter |
The minimum proportion of fragments in a cell to overlap with a promoter sequence (used for |
max_frac_mito |
The maximum proportion of fragments in a cell that are mitochondrial (used for |
## Not run: sc_atac_cell_calling <- function(mat, cell_calling, output_folder, genome_size = NULL, cell_qc_metrics_file = NULL, lower = NULL) ## End(Not run)
## Not run: sc_atac_cell_calling <- function(mat, cell_calling, output_folder, genome_size = NULL, cell_qc_metrics_file = NULL, lower = NULL) ## End(Not run)
uses the peak file and annotation files for features
sc_atac_create_cell_qc_metrics( frags_file, peaks_file, promoters_file, tss_file, enhs_file, output_folder )
sc_atac_create_cell_qc_metrics( frags_file, peaks_file, promoters_file, tss_file, enhs_file, output_folder )
frags_file |
The fragment file |
peaks_file |
The peak file |
promoters_file |
The path of the promoter annotation file |
tss_file |
The path of the tss annotation file |
enhs_file |
The path of the enhs annotation file |
output_folder |
The path of the output folder for resultant files |
Nothing (Invisible 'NULL')
Takes in a tagged and sorted BAM file and outputs the associated fragments in a .bed file
sc_atac_create_fragments( inbam, output_folder = "", min_mapq = 30, nproc = 1, cellbarcode = "CB", chromosomes = "^chr", readname_barcode = NULL, cells = NULL, max_distance = 5000, min_distance = 10, chunksize = 5e+05 )
sc_atac_create_fragments( inbam, output_folder = "", min_mapq = 30, nproc = 1, cellbarcode = "CB", chromosomes = "^chr", readname_barcode = NULL, cells = NULL, max_distance = 5000, min_distance = 10, chunksize = 5e+05 )
inbam |
The tagged, sorted and duplicate-free input BAM file |
output_folder |
The path of the output folder |
min_mapq |
: int Minimum MAPQ to retain fragment |
nproc |
: int, optional Number of processors to use. Default is 1. |
cellbarcode |
: str Tag used for cell barcode. Default is CB (used by cellranger) |
chromosomes |
: str, optional Regular expression used to match chromosome names to include in the output file. Default is "(?i)^chr" (starts with "chr", case-insensitive). If None, use all chromosomes in the BAM file. |
readname_barcode |
: str, optional Regular expression used to match cell barocde stored in read name. If None (default), use read tags instead. Use "[^:]*" to match all characters before the first colon (":"). |
cells |
: str File containing list of cell barcodes to retain. If None (default), use all cell barcodes found in the BAM file. |
max_distance |
: int, optional Maximum distance between integration sites for the fragment to be retained. Allows filtering of implausible fragments that likely result from incorrect mapping positions. Default is 5000 bp. |
min_distance |
: int, optional Minimum distance between integration sites for the fragment to be retained. Allows filtering implausible fragments that likely result from incorrect mapping positions. Default is 10 bp. |
chunksize |
: int Number of BAM entries to read through before collapsing and writing fragments to disk. Higher chunksize will use more memory but will be faster. |
returns NULL
Generates a HTML report using the output folder produced by the pipeline
sc_atac_create_report( input_folder, output_folder = NULL, organism = NULL, sample_name = NULL, feature_type = NULL, n_barcode_subset = 500 )
sc_atac_create_report( input_folder, output_folder = NULL, organism = NULL, sample_name = NULL, feature_type = NULL, n_barcode_subset = 500 )
input_folder |
The path of the folder produced by the pipeline |
output_folder |
The path of the output folder to store the HTML report in |
organism |
A string indicating the name of the organism being analysed |
sample_name |
A string indicating the name of the sample |
feature_type |
A string indicating the type of the feature ('genome_bin' or 'peak') |
n_barcode_subset |
if you require only to visualise stats for a sample of barcodes to improve processing time (integer) |
the path of the output file
sc_atac_create_sce()
sc_atac_create_sce( input_folder = NULL, organism = NULL, sample_name = NULL, feature_type = NULL, pheno_data = NULL, report = FALSE )
sc_atac_create_sce( input_folder = NULL, organism = NULL, sample_name = NULL, feature_type = NULL, pheno_data = NULL, report = FALSE )
input_folder |
The output folder produced by the pipeline |
organism |
The type of the organism |
sample_name |
The name of the sample |
feature_type |
The type of the feature |
pheno_data |
The pheno data |
report |
Whether or not a HTML report should be produced |
a SingleCellExperiment object created from the scATAC-Seq data provided
## Not run: sc_atac_create_sce( input_folder = input_folder, organism = "hg38", feature_type = "peak", report = TRUE) ## End(Not run)
## Not run: sc_atac_create_sce( input_folder = input_folder, organism = "hg38", feature_type = "peak", report = TRUE) ## End(Not run)
The empty drops cell calling method
sc_atac_emptydrops_cell_calling(mat, output_folder, lower = NULL)
sc_atac_emptydrops_cell_calling(mat, output_folder, lower = NULL)
mat |
The input matrix |
output_folder |
The path of the output folder |
lower |
The lower threshold for the data if using the |
feature matrix is created using a given demultiplexed BAM file and a selected feature type
sc_atac_feature_counting( fragment_file, feature_input = NULL, bam_tags = list(bc = "CB", mb = "OX"), feature_type = "peak", organism = "hg38", cell_calling = "filter", sample_name = "", genome_size = NULL, promoters_file = NULL, tss_file = NULL, enhs_file = NULL, gene_anno_file = NULL, pheno_data = NULL, bin_size = NULL, yieldsize = 1e+06, n_filter_cell_counts = 200, n_filter_feature_counts = 10, exclude_regions = FALSE, excluded_regions_filename = NULL, output_folder = NULL, fix_chr = "none", lower = NULL, min_uniq_frags = 3000, max_uniq_frags = 50000, min_frac_peak = 0.3, min_frac_tss = 0, min_frac_enhancer = 0, min_frac_promoter = 0.1, max_frac_mito = 0.15, create_report = FALSE )
sc_atac_feature_counting( fragment_file, feature_input = NULL, bam_tags = list(bc = "CB", mb = "OX"), feature_type = "peak", organism = "hg38", cell_calling = "filter", sample_name = "", genome_size = NULL, promoters_file = NULL, tss_file = NULL, enhs_file = NULL, gene_anno_file = NULL, pheno_data = NULL, bin_size = NULL, yieldsize = 1e+06, n_filter_cell_counts = 200, n_filter_feature_counts = 10, exclude_regions = FALSE, excluded_regions_filename = NULL, output_folder = NULL, fix_chr = "none", lower = NULL, min_uniq_frags = 3000, max_uniq_frags = 50000, min_frac_peak = 0.3, min_frac_tss = 0, min_frac_enhancer = 0, min_frac_promoter = 0.1, max_frac_mito = 0.15, create_report = FALSE )
fragment_file |
The fragment file |
feature_input |
The feature input data e.g. the .narrowPeak file for peaks of a bed file format |
bam_tags |
The BAM tags |
feature_type |
The type of feature |
organism |
The organism type (contains hg19, hg38, mm10) |
cell_calling |
The desired cell calling method; either |
sample_name |
The sample name to identify which is the data is analysed for. |
genome_size |
The size of the genome (used for the |
promoters_file |
The path of the promoter annotation file (if the specified organism isn't recognised). |
tss_file |
The path of the tss annotation file (if the specified organism isn't recognised). |
enhs_file |
The path of the enhs annotation file (if the specified organism isn't recognised). |
gene_anno_file |
The path of the gene annotation file (gtf or gff3 format). |
pheno_data |
The phenotypic data as a data frame |
bin_size |
The size of the bins |
yieldsize |
The yield size |
n_filter_cell_counts |
An integer value to filter the feature matrix on the number of reads per cell (default = 200) |
n_filter_feature_counts |
An integer value to filter the feature matrix on the number of reads per feature (default = 10). |
exclude_regions |
Whether or not the regions (specified in the file) should be excluded |
excluded_regions_filename |
The filename of the file containing the regions to be excluded |
output_folder |
The output folder |
fix_chr |
Whether chr should be fixed or not |
lower |
the lower threshold for the data if using the |
min_uniq_frags |
The minimum number of required unique fragments required for a cell (used for |
max_uniq_frags |
The maximum number of required unique fragments required for a cell (used for |
min_frac_peak |
The minimum proportion of fragments in a cell to overlap with a peak (used for |
min_frac_tss |
The minimum proportion of fragments in a cell to overlap with a tss (used for |
min_frac_enhancer |
The minimum proportion of fragments in a cell to overlap with a enhancer sequence (used for |
min_frac_promoter |
The minimum proportion of fragments in a cell to overlap with a promoter sequence (used for |
max_frac_mito |
The maximum proportion of fragments in a cell that are mitochondrial (used for |
create_report |
Logical value to say whether to create the report or not (default = TRUE). |
None (invisible 'NULL')
## Not run: sc_atac_feature_counting( fragment_file = fragment_file, cell_calling = "filter", exclude_regions = TRUE, feature_input = feature_file) ## End(Not run)
## Not run: sc_atac_feature_counting( fragment_file = fragment_file, cell_calling = "filter", exclude_regions = TRUE, feature_input = feature_file) ## End(Not run)
specify various qc cutoffs to select the desired cells
sc_atac_filter_cell_calling( mtx, cell_qc_metrics_file, min_uniq_frags = 0, max_uniq_frags = 50000, min_frac_peak = 0.05, min_frac_tss = 0, min_frac_enhancer = 0, min_frac_promoter = 0, max_frac_mito = 0.2 )
sc_atac_filter_cell_calling( mtx, cell_qc_metrics_file, min_uniq_frags = 0, max_uniq_frags = 50000, min_frac_peak = 0.05, min_frac_tss = 0, min_frac_enhancer = 0, min_frac_promoter = 0, max_frac_mito = 0.2 )
mtx |
The input matrix |
cell_qc_metrics_file |
A file containing qc statistics for each cell |
min_uniq_frags |
The minimum number of required unique fragments required for a cell |
max_uniq_frags |
The maximum number of required unique fragments required for a cell |
min_frac_peak |
The minimum proportion of fragments in a cell to overlap with a peak |
min_frac_tss |
The minimum proportion of fragments in a cell to overlap with a tss |
min_frac_enhancer |
The minimum proportion of fragments in a cell to overlap with a enhancer sequence |
min_frac_promoter |
The minimum proportion of fragments in a cell to overlap with a promoter sequence |
max_frac_mito |
The maximum proportion of fragments in a cell that are mitochondrial |
sc_atac_peak_calling()
sc_atac_peak_calling( inbam, ref = NULL, genome_size = NULL, output_folder = NULL )
sc_atac_peak_calling( inbam, ref = NULL, genome_size = NULL, output_folder = NULL )
inbam |
The input tagged, sorted, duplicate-free input BAM file |
ref |
The reference genome file |
genome_size |
The size of the genome |
output_folder |
The path of the output folder |
None (invisible 'NULL')
## Not run: sc_atac_peak_calling( inbam, reference) ## End(Not run)
## Not run: sc_atac_peak_calling( inbam, reference) ## End(Not run)
A convenient function for running the entire pipeline
sc_atac_pipeline( r1, r2, bc_file, valid_barcode_file = "", id1_st = -0, id1_len = 16, id2_st = 0, id2_len = 16, rmN = TRUE, rmlow = TRUE, organism = NULL, reference = NULL, feature_type = NULL, remove_duplicates = FALSE, samtools_path = NULL, genome_size = NULL, bin_size = NULL, yieldsize = 1e+06, exclude_regions = TRUE, excluded_regions_filename = NULL, fix_chr = "none", lower = NULL, cell_calling = "filter", promoters_file = NULL, tss_file = NULL, enhs_file = NULL, gene_anno_file = NULL, min_uniq_frags = 3000, max_uniq_frags = 50000, min_frac_peak = 0.3, min_frac_tss = 0, min_frac_enhancer = 0, min_frac_promoter = 0.1, max_frac_mito = 0.15, report = TRUE, nthreads = 12, output_folder = NULL )
sc_atac_pipeline( r1, r2, bc_file, valid_barcode_file = "", id1_st = -0, id1_len = 16, id2_st = 0, id2_len = 16, rmN = TRUE, rmlow = TRUE, organism = NULL, reference = NULL, feature_type = NULL, remove_duplicates = FALSE, samtools_path = NULL, genome_size = NULL, bin_size = NULL, yieldsize = 1e+06, exclude_regions = TRUE, excluded_regions_filename = NULL, fix_chr = "none", lower = NULL, cell_calling = "filter", promoters_file = NULL, tss_file = NULL, enhs_file = NULL, gene_anno_file = NULL, min_uniq_frags = 3000, max_uniq_frags = 50000, min_frac_peak = 0.3, min_frac_tss = 0, min_frac_enhancer = 0, min_frac_promoter = 0.1, max_frac_mito = 0.15, report = TRUE, nthreads = 12, output_folder = NULL )
r1 |
The first read fastq file |
r2 |
The second read fastq file |
bc_file |
the barcode information, can be either in a |
valid_barcode_file |
optional file path of the valid (expected) barcode sequences to be found in the bc_file (.txt, can be txt.gz).
Must contain one barcode per line on the second column separated by a comma (default ="").
If given, each barcode from bc_file is matched against the barcode of
best fit (allowing a hamming distance of 1). If a FASTQ |
id1_st |
barcode start position (0-indexed) for read 1, which is an extra parameter that is needed if the
|
id1_len |
barcode length for read 1, which is an extra parameter that is needed if the
|
id2_st |
barcode start position (0-indexed) for read 2, which is an extra parameter that is needed if the
|
id2_len |
barcode length for read 2, which is an extra parameter that is needed if the
|
rmN |
ogical, whether to remove reads that contains N in UMI or cell barcode. |
rmlow |
logical, whether to remove reads that have low quality barcode sequences. |
organism |
The name of the organism e.g. hg38 |
reference |
The reference genome file |
feature_type |
The feature type (either 'genome_bin' or 'peak') |
remove_duplicates |
Whether or not to remove duplicates (samtools is required) |
samtools_path |
A custom path of samtools to use for duplicate removal |
genome_size |
The size of the genome (used for the |
bin_size |
The size of the bins for feature counting with the 'genome_bin' feature type |
yieldsize |
The number of reads to read in for feature counting |
exclude_regions |
Whether or not the regions should be excluded |
excluded_regions_filename |
The filename of the file containing the regions to be excluded |
fix_chr |
Specify 'none', 'exclude_regions', 'feature' or 'both' to prepend the string "chr" to the start of the associated file |
lower |
the lower threshold for the data if using the |
cell_calling |
The desired cell calling method either |
promoters_file |
The path of the promoter annotation file (if the specified organism isn't recognised) |
tss_file |
The path of the tss annotation file (if the specified organism isn't recognised) |
enhs_file |
The path of the enhs annotation file (if the specified organism isn't recognised) |
gene_anno_file |
The path of the gene annotation file (gtf or gff3 format) |
min_uniq_frags |
The minimum number of required unique fragments required for a cell (used for |
max_uniq_frags |
The maximum number of required unique fragments required for a cell (used for |
min_frac_peak |
The minimum proportion of fragments in a cell to overlap with a peak (used for |
min_frac_tss |
The minimum proportion of fragments in a cell to overlap with a tss (used for |
min_frac_enhancer |
The minimum proportion of fragments in a cell to overlap with a enhancer sequence (used for |
min_frac_promoter |
The minimum proportion of fragments in a cell to overlap with a promoter sequence (used for |
max_frac_mito |
The maximum proportion of fragments in a cell that are mitochondrial (used for |
report |
Whether or not a HTML report should be produced |
nthreads |
The number of threads to use for alignment (sc_align) and demultiplexing (sc_atac_bam_tagging) |
output_folder |
The path of the output folder |
None (invisible 'NULL')
data.folder <- system.file("extdata", package = "scPipe", mustWork = TRUE) r1 <- file.path(data.folder, "small_chr21_R1.fastq.gz") r2 <- file.path(data.folder, "small_chr21_R3.fastq.gz") # Using a barcode fastq file: # barcodes in fastq format barcode_fastq <- file.path(data.folder, "small_chr21_R2.fastq.gz") ## Not run: sc_atac_pipeline( r1 = r1, r2 = r2, bc_file = barcode_fastq ) ## End(Not run)
data.folder <- system.file("extdata", package = "scPipe", mustWork = TRUE) r1 <- file.path(data.folder, "small_chr21_R1.fastq.gz") r2 <- file.path(data.folder, "small_chr21_R3.fastq.gz") # Using a barcode fastq file: # barcodes in fastq format barcode_fastq <- file.path(data.folder, "small_chr21_R2.fastq.gz") ## Not run: sc_atac_pipeline( r1 = r1, r2 = r2, bc_file = barcode_fastq ) ## End(Not run)
A function that tests the pipeline on a small test sample (without duplicate removal)
sc_atac_pipeline_quick_test()
sc_atac_pipeline_quick_test()
None (invisible 'NULL')
A histogram of the log-number of cells per feature
sc_atac_plot_cells_per_feature(sce)
sc_atac_plot_cells_per_feature(sce)
sce |
The SingleExperimentObject produced by the sc_atac_create_sce function at the end of the pipeline |
returns NULL
A histogram of the log-number of features per cell
sc_atac_plot_features_per_cell(sce)
sc_atac_plot_features_per_cell(sce)
sce |
The SingleExperimentObject produced by the sc_atac_create_sce function at the end of the pipeline |
returns NULL
Plot showing the number of features per cell in ascending order
sc_atac_plot_features_per_cell_ordered(sce)
sc_atac_plot_features_per_cell_ordered(sce)
sce |
The SingleExperimentObject produced by the sc_atac_create_sce function at the end of the pipeline |
returns NULL
A scatter plot of the log-number of fragments and log-number of cells per feature
sc_atac_plot_fragments_cells_per_feature(sce)
sc_atac_plot_fragments_cells_per_feature(sce)
sce |
The SingleExperimentObject produced by the sc_atac_create_sce function at the end of the pipeline |
returns NULL
A scatter plot of the log-number of fragments and log-number of features per cell
sc_atac_plot_fragments_features_per_cell(sce)
sc_atac_plot_fragments_features_per_cell(sce)
sce |
The SingleExperimentObject produced by the sc_atac_create_sce function at the end of the pipeline |
returns NULL
A histogram of the log-number of fragments per cell
sc_atac_plot_fragments_per_cell(sce)
sc_atac_plot_fragments_per_cell(sce)
sce |
The SingleExperimentObject produced by the sc_atac_create_sce function at the end of the pipeline |
returns NULL
A histogram of the log-number of fragments per feature
sc_atac_plot_fragments_per_feature(sce)
sc_atac_plot_fragments_per_feature(sce)
sce |
The SingleExperimentObject produced by the sc_atac_create_sce function at the end of the pipeline |
returns NULL
Takes in a BAM file and removes the PCR duplicates using the samtools markdup function. Requires samtools 1.10 or newer for statistics to be generated.
sc_atac_remove_duplicates(inbam, samtools_path = NULL, output_folder = NULL)
sc_atac_remove_duplicates(inbam, samtools_path = NULL, output_folder = NULL)
inbam |
The tagged, sorted and duplicate-free input BAM file |
samtools_path |
The path of the samtools executable (if a custom installation is to be specified) |
output_folder |
The path of the output folder |
file path to a bam file created from samtools markdup
Takes the binary matrix and generate a TF-IDF so the clutering can take place on the reduced dimentions.
sc_atac_tfidf(binary.mat, output_folder = NULL)
sc_atac_tfidf(binary.mat, output_folder = NULL)
binary.mat |
The final, filtered feature matrix in binary format |
output_folder |
The path of the output folder |
None (invisible 'NULL')
## Not run: sc_atac_tfidf(binary.mat = final_binary_matrix) ## End(Not run)
## Not run: sc_atac_tfidf(binary.mat = final_binary_matrix) ## End(Not run)
single-cell data need to be demultiplexed in order to retain the information of the cell barcodes the data belong to. Here we reformat fastq files so barcode/s (and if available the UMI sequences) are moved from the sequence into the read name. Since scATAC-Seq data are mostly paired-end, both 'r1' and 'r2' are demultiplexed in this function.
sc_atac_trim_barcode( r1, r2, bc_file = NULL, valid_barcode_file = "", output_folder = "", umi_start = 0, umi_length = 0, umi_in = "both", rmN = FALSE, rmlow = FALSE, min_qual = 20, num_below_min = 2, id1_st = -0, id1_len = 16, id2_st = 0, id2_len = 16, no_reverse_complement = FALSE )
sc_atac_trim_barcode( r1, r2, bc_file = NULL, valid_barcode_file = "", output_folder = "", umi_start = 0, umi_length = 0, umi_in = "both", rmN = FALSE, rmlow = FALSE, min_qual = 20, num_below_min = 2, id1_st = -0, id1_len = 16, id2_st = 0, id2_len = 16, no_reverse_complement = FALSE )
r1 |
read one for pair-end reads. |
r2 |
read two for pair-end reads, NULL if single read. |
bc_file |
the barcode information, can be either in a |
valid_barcode_file |
optional file path of the valid (expected) barcode sequences to be found in the bc_file (.txt, can be txt.gz).
Must contain one barcode per line on the second column separated by a comma (default ="").
If given, each barcode from bc_file is matched against the barcode of
best fit (allowing a hamming distance of 1). If a FASTQ |
output_folder |
the output dir for the demultiplexed fastq file, which will contain
fastq files with reformatted barcode and UMI into the read name.
Files ending in |
umi_start |
if available, the start position of the molecular identifier. |
umi_length |
if available, the start position of the molecular identifier. |
umi_in |
umi_in |
rmN |
logical, whether to remove reads that contains N in UMI or cell barcode. |
rmlow |
logical, whether to remove reads that have low quality barcode sequences |
min_qual |
the minimum base pair quality that is allowed (default = 20). |
num_below_min |
the maximum number of base pairs below the quality threshold. |
id1_st |
barcode start position (0-indexed) for read 1, which is an extra parameter that is needed if the
|
id1_len |
barcode length for read 1, which is an extra parameter that is needed if the
|
id2_st |
barcode start position (0-indexed) for read 2, which is an extra parameter that is needed if the
|
id2_len |
barcode length for read 2, which is an extra parameter that is needed if the
|
no_reverse_complement |
specifies if the reverse complement of the barcode sequence should be used for barcode error correction (only when barcode sequences are provided as fastq files). FALSE (default) lets the function decide whether to use reverse complement, and TRUE forces the function to use the forward barcode sequences. |
None (invisible 'NULL')
data.folder <- system.file("extdata", package = "scPipe", mustWork = TRUE) r1 <- file.path(data.folder, "small_chr21_R1.fastq.gz") r2 <- file.path(data.folder, "small_chr21_R3.fastq.gz") # Using a barcode fastq file: # barcodes in fastq format barcode_fastq <- file.path(data.folder, "small_chr21_R2.fastq.gz") sc_atac_trim_barcode ( r1 = r1, r2 = r2, bc_file = barcode_fastq, rmN = TRUE, rmlow = TRUE, output_folder = tempdir()) # Using a barcode csv file: # barcodes in .csv format barcode_1000 <- file.path(data.folder, "chr21_modified_barcode_1000.csv") ## Not run: sc_atac_trim_barcode ( r1 = r1, r2 = r2, bc_file = barcode_1000, id1_st = 0, rmN = TRUE, rmlow = TRUE, output_folder = tempdir()) ## End(Not run)
data.folder <- system.file("extdata", package = "scPipe", mustWork = TRUE) r1 <- file.path(data.folder, "small_chr21_R1.fastq.gz") r2 <- file.path(data.folder, "small_chr21_R3.fastq.gz") # Using a barcode fastq file: # barcodes in fastq format barcode_fastq <- file.path(data.folder, "small_chr21_R2.fastq.gz") sc_atac_trim_barcode ( r1 = r1, r2 = r2, bc_file = barcode_fastq, rmN = TRUE, rmlow = TRUE, output_folder = tempdir()) # Using a barcode csv file: # barcodes in .csv format barcode_1000 <- file.path(data.folder, "chr21_modified_barcode_1000.csv") ## Not run: sc_atac_trim_barcode ( r1 = r1, r2 = r2, bc_file = barcode_1000, id1_st = 0, rmN = TRUE, rmlow = TRUE, output_folder = tempdir()) ## End(Not run)
update the cell barcode tag in bam file with corrected barcode output to a new bam file. the function will be useful for methods that use the cell barcode information from bam file, such as 'Demuxlet'
sc_correct_bam_bc( inbam, outbam, bc_anno, max_mis = 1, bam_tags = list(am = "YE", ge = "GE", bc = "BC", mb = "OX"), mito = "MT", nthreads = 1 )
sc_correct_bam_bc( inbam, outbam, bc_anno, max_mis = 1, bam_tags = list(am = "YE", ge = "GE", bc = "BC", mb = "OX"), mito = "MT", nthreads = 1 )
inbam |
input bam file. This should be the output of
|
outbam |
output bam file with updated cell barcode |
bc_anno |
barcode annotation, first column is cell id, second column is cell barcode sequence |
max_mis |
maximum mismatch allowed in barcode. (default: 1) |
bam_tags |
list defining BAM tags where mapping information is stored.
|
mito |
mitochondrial chromosome name. This should be consistent with the chromosome names in the bam file. |
nthreads |
number of threads to use. (default: 1) |
no return
data_dir="celseq2_demo" barcode_annotation_fn = system.file("extdata", "barcode_anno.csv", package = "scPipe") ## Not run: # refer to the vignettes for the complete workflow ... sc_correct_bam_bc(file.path(data_dir, "out.map.bam"), file.path(data_dir, "out.map.clean.bam"), barcode_annotation_fn) ... ## End(Not run)
data_dir="celseq2_demo" barcode_annotation_fn = system.file("extdata", "barcode_anno.csv", package = "scPipe") ## Not run: # refer to the vignettes for the complete workflow ... sc_correct_bam_bc(file.path(data_dir, "out.map.bam"), file.path(data_dir, "out.map.clean.bam"), barcode_annotation_fn) ... ## End(Not run)
Wrapper to run sc_exon_mapping
,
sc_demultiplex
and sc_gene_counting
with a
single command
sc_count_aligned_bam( inbam, outbam, annofn, bam_tags = list(am = "YE", ge = "GE", bc = "BC", mb = "OX"), bc_len = 8, UMI_len = 6, stnd = TRUE, fix_chr = FALSE, outdir, bc_anno, max_mis = 1, mito = "MT", has_UMI = TRUE, UMI_cor = 1, gene_fl = FALSE, keep_mapped_bam = TRUE, nthreads = 1 )
sc_count_aligned_bam( inbam, outbam, annofn, bam_tags = list(am = "YE", ge = "GE", bc = "BC", mb = "OX"), bc_len = 8, UMI_len = 6, stnd = TRUE, fix_chr = FALSE, outdir, bc_anno, max_mis = 1, mito = "MT", has_UMI = TRUE, UMI_cor = 1, gene_fl = FALSE, keep_mapped_bam = TRUE, nthreads = 1 )
inbam |
input aligned bam file. can have multiple files as input |
outbam |
output bam filename |
annofn |
single string or vector of gff3 annotation filenames, data.frame in SAF format or GRanges object containing complete gene_id metadata column. |
bam_tags |
list defining BAM tags where mapping information is stored.
|
bc_len |
total barcode length |
UMI_len |
UMI length |
stnd |
TRUE to perform strand specific mapping. (default: TRUE) |
fix_chr |
TRUE to add 'chr' to chromosome names, MT to chrM. (default: FALSE) |
outdir |
output folder |
bc_anno |
barcode annotation, first column is cell id, second column is cell barcode sequence |
max_mis |
maximum mismatch allowed in barcode. (default: 1) |
mito |
mitochondrial chromosome name. This should be consistent with the chromosome names in the bam file. |
has_UMI |
whether the protocol contains UMI (default: TRUE) |
UMI_cor |
correct UMI sequencing error: 0 means no correction, 1 means simple correction and merge UMI with distance 1. 2 means merge on both UMI alignment position match. |
gene_fl |
whether to remove low abundance genes. A gene is considered to have low abundance if only one copy of one UMI is associated with it. |
keep_mapped_bam |
TRUE if feature mapped bam file should be retained. |
nthreads |
number of threads to use. (default: 1) |
no return
## Not run: sc_count_aligned_bam( inbam = "aligned.bam", outbam = "mapped.bam", annofn = c("MusMusculus-GRCm38p4-UCSC.gff3", "ERCC92_anno.gff3"), outdir = "output", bc_anno = "barcodes.csv" ) ## End(Not run)
## Not run: sc_count_aligned_bam( inbam = "aligned.bam", outbam = "mapped.bam", annofn = c("MusMusculus-GRCm38p4-UCSC.gff3", "ERCC92_anno.gff3"), outdir = "output", bc_anno = "barcodes.csv" ) ## End(Not run)
Process bam file by cell barcode, output to outdir/count/[cell_id].csv. the output contains information for all reads that can be mapped to exons. including the gene id, UMI of that read and the distance to transcript end position.
sc_demultiplex( inbam, outdir, bc_anno, max_mis = 1, bam_tags = list(am = "YE", ge = "GE", bc = "BC", mb = "OX"), mito = "MT", has_UMI = TRUE, nthreads = 1 )
sc_demultiplex( inbam, outdir, bc_anno, max_mis = 1, bam_tags = list(am = "YE", ge = "GE", bc = "BC", mb = "OX"), mito = "MT", has_UMI = TRUE, nthreads = 1 )
inbam |
input bam file. This should be the output of
|
outdir |
output folder |
bc_anno |
barcode annotation, first column is cell id, second column is cell barcode sequence |
max_mis |
maximum mismatch allowed in barcode. (default: 1) |
bam_tags |
list defining BAM tags where mapping information is stored.
|
mito |
mitochondrial chromosome name. This should be consistent with the chromosome names in the bam file. |
has_UMI |
whether the protocol contains UMI (default: TRUE) |
nthreads |
number of threads to use. (default: 1) |
no return
data_dir="celseq2_demo" barcode_annotation_fn = system.file("extdata", "barcode_anno.csv", package = "scPipe") ## Not run: # refer to the vignettes for the complete workflow ... sc_demultiplex(file.path(data_dir, "out.map.bam"), data_dir, barcode_annotation_fn,has_UMI=FALSE) ... ## End(Not run)
data_dir="celseq2_demo" barcode_annotation_fn = system.file("extdata", "barcode_anno.csv", package = "scPipe") ## Not run: # refer to the vignettes for the complete workflow ... sc_demultiplex(file.path(data_dir, "out.map.bam"), data_dir, barcode_annotation_fn,has_UMI=FALSE) ... ## End(Not run)
Wrapper to run sc_demultiplex
and
sc_gene_counting
with a single command
sc_demultiplex_and_count( inbam, outdir, bc_anno, max_mis = 1, bam_tags = list(am = "YE", ge = "GE", bc = "BC", mb = "OX"), mito = "MT", has_UMI = TRUE, UMI_cor = 1, gene_fl = FALSE, nthreads = 1 )
sc_demultiplex_and_count( inbam, outdir, bc_anno, max_mis = 1, bam_tags = list(am = "YE", ge = "GE", bc = "BC", mb = "OX"), mito = "MT", has_UMI = TRUE, UMI_cor = 1, gene_fl = FALSE, nthreads = 1 )
inbam |
input bam file. This should be the output of
|
outdir |
output folder |
bc_anno |
barcode annotation, first column is cell id, second column is cell barcode sequence |
max_mis |
maximum mismatch allowed in barcode. (default: 1) |
bam_tags |
list defining BAM tags where mapping information is stored.
|
mito |
mitochondrial chromosome name. This should be consistent with the chromosome names in the bam file. |
has_UMI |
whether the protocol contains UMI (default: TRUE) |
UMI_cor |
correct UMI sequencing error: 0 means no correction, 1 means simple correction and merge UMI with distance 1. 2 means merge on both UMI alignment position match. |
gene_fl |
whether to remove low abundance genes. A gene is considered to have low abundance if only one copy of one UMI is associated with it. |
nthreads |
number of threads to use. (default: 1) |
no return
## Not run: refer to the vignettes for the complete workflow, replace demultiplex and count with single command: ... sc_demultiplex_and_count( file.path(data_dir, "out.map.bam"), data_dir, barcode_annotation_fn, has_UMI = FALSE ) ... ## End(Not run)
## Not run: refer to the vignettes for the complete workflow, replace demultiplex and count with single command: ... sc_demultiplex_and_count( file.path(data_dir, "out.map.bam"), data_dir, barcode_annotation_fn, has_UMI = FALSE ) ... ## End(Not run)
Detect cell barcode and generate the barcode annotation
sc_detect_bc( infq, outcsv, prefix = "CELL_", bc_len, max_reads = 1e+06, min_count = 10, number_of_cells = 10000, max_mismatch = 1, white_list_file = NULL )
sc_detect_bc( infq, outcsv, prefix = "CELL_", bc_len, max_reads = 1e+06, min_count = 10, number_of_cells = 10000, max_mismatch = 1, white_list_file = NULL )
infq |
input fastq file, should be the output file of
|
outcsv |
output barcode annotation |
prefix |
the prefix of cell name (default: 'CELL_') |
bc_len |
the length of cell barcode, should be consistent with bl1+bl2
in |
max_reads |
the maximum of reads processed (default: 1,000,000) |
min_count |
minimum counts to keep, barcode will be discarded if
it has lower count. Default value is 10. This should be set according
to |
number_of_cells |
number of cells kept in result. (default: 10000) |
max_mismatch |
the maximum mismatch allowed. Barcodes within this number will be considered as sequence error and merged. (default: 1) |
white_list_file |
a file that list all the possible barcodes each row is a barcode sequence. the list for 10x can be found at: https://community.10xgenomics.com/t5/Data-Sharing/List-of-valid-cellular-barcodes/td-p/527 (default: NULL) |
no return
## Not run: # `sc_detect_bc`` should run before `sc_demultiplex` for # Drop-seq or 10X protocols sc_detect_bc("input.fastq","output.cell_index.csv",bc_len=8) sc_demultiplex(...,"output.cell_index.csv") ## End(Not run)
## Not run: # `sc_detect_bc`` should run before `sc_demultiplex` for # Drop-seq or 10X protocols sc_detect_bc("input.fastq","output.cell_index.csv",bc_len=8) sc_demultiplex(...,"output.cell_index.csv") ## End(Not run)
Map aligned reads to exon annotation. The result will be written into optional fields in bam file with different tags. Following this link for more information regarding to bam file format: http://samtools.github.io/hts-specs
The function can accept multiple bam file as input, if multiple bam file is provided and the 'bc_len' is zero, then the function will use the barcode in the 'barcode_vector' to insert into the 'bc' bam tag. So the length of 'barcode_vector' and the length of 'inbam' should be the same If this is the case then the 'max_mis' argument in 'sc_demultiplex' should be zero. If 'be_len' is larger than zero, then the function will still seek for barcode in fastq headers with given length. In this case each bam file is not treated as from a single cell.
sc_exon_mapping( inbam, outbam, annofn, bam_tags = list(am = "YE", ge = "GE", bc = "BC", mb = "OX"), bc_len = 8, barcode_vector = "", UMI_len = 6, stnd = TRUE, fix_chr = FALSE, nthreads = 1 )
sc_exon_mapping( inbam, outbam, annofn, bam_tags = list(am = "YE", ge = "GE", bc = "BC", mb = "OX"), bc_len = 8, barcode_vector = "", UMI_len = 6, stnd = TRUE, fix_chr = FALSE, nthreads = 1 )
inbam |
input aligned bam file. can have multiple files as input |
outbam |
output bam filename |
annofn |
single string or vector of gff3 annotation filenames, data.frame in SAF format or GRanges object containing complete gene_id metadata column. |
bam_tags |
list defining BAM tags where mapping information is stored.
|
bc_len |
total barcode length |
barcode_vector |
a list of barcode if each individual bam is a single cell. (default: NULL). The barcode should be of the same length for each cell. |
UMI_len |
UMI length |
stnd |
TRUE to perform strand specific mapping. (default: TRUE) |
fix_chr |
TRUE to add 'chr' to chromosome names, MT to chrM. (default: FALSE) |
nthreads |
number of threads to use. (default: 1) |
generates a bam file with exons assigned
data_dir="celseq2_demo" ERCCanno_fn = system.file("extdata", "ERCC92_anno.gff3", package = "scPipe") ## Not run: # for the complete workflow, refer to the vignettes ... sc_exon_mapping(file.path(data_dir, "out.aln.bam"), file.path(data_dir, "out.map.bam"), ERCCanno_fn) ... ## End(Not run)
data_dir="celseq2_demo" ERCCanno_fn = system.file("extdata", "ERCC92_anno.gff3", package = "scPipe") ## Not run: # for the complete workflow, refer to the vignettes ... sc_exon_mapping(file.path(data_dir, "out.aln.bam"), file.path(data_dir, "out.map.bam"), ERCCanno_fn) ... ## End(Not run)
Generate gene counts matrix with UMI deduplication
sc_gene_counting(outdir, bc_anno, UMI_cor = 2, gene_fl = FALSE)
sc_gene_counting(outdir, bc_anno, UMI_cor = 2, gene_fl = FALSE)
outdir |
output folder containing |
bc_anno |
barcode annotation comma-separated-values, first column is cell id, second column is cell barcode sequence |
UMI_cor |
correct UMI sequencing error: 0 means no correction, 1 means simple correction and merge UMI with distance 1. 2 means merge on both UMI alignment position match. |
gene_fl |
whether to remove low abundance genes. A gene is considered to have low abundance if only one copy of one UMI is associated with it. |
no return
data_dir="celseq2_demo" barcode_annotation_fn = system.file("extdata", "barcode_anno.csv", package = "scPipe") ## Not run: # refer to the vignettes for the complete workflow ... sc_gene_counting(data_dir, barcode_annotation_fn) ... ## End(Not run)
data_dir="celseq2_demo" barcode_annotation_fn = system.file("extdata", "barcode_anno.csv", package = "scPipe") ## Not run: # refer to the vignettes for the complete workflow ... sc_gene_counting(data_dir, barcode_annotation_fn) ... ## End(Not run)
Produces a DataFrame containing the UMAP dimensions, as well as all the colData of the sce object for each cell
sc_get_umap_data(sce, n_neighbours = 30)
sc_get_umap_data(sce, n_neighbours = 30)
sce |
The SingleCellExperiment object |
n_neighbours |
No. of neighbours for KNN |
A dataframe containing the UMAP dimensions, as well as all the colData of the sce object for each cell
Generates an integrated SCE object with scRNA-Seq and scATAC-Seq data produced by the scPipe pipelines
sc_integrate( sce_list, barcode_match_file = NULL, sce_column_to_barcode_files = NULL, rev_comp = NULL, cell_line_info = NULL, output_folder = NULL )
sc_integrate( sce_list, barcode_match_file = NULL, sce_column_to_barcode_files = NULL, rev_comp = NULL, cell_line_info = NULL, output_folder = NULL )
sce_list |
A list of SCE objects, named with the corresponding technologies |
barcode_match_file |
A .csv file with columns corresponding to the barcodes for each tech |
sce_column_to_barcode_files |
A list of files containing the barcodes for each tech (if not needed then give a 'NULL' entry) |
rev_comp |
A named list of technologies and logical flags specifying if reverse complements should be applyed for sequences (if not needed then provide a 'NULL' entry) |
cell_line_info |
A list of files, each of which contains 2 columns corresponding to the barcode and cell line for each tech (if not needed then provide a 'NULL' entry) |
output_folder |
The path to the output folder |
Returns a MultiAssayExperiment containing the scRNA-Seq and scATAC-Seq data produced by the scPipe pipelines
## Not run: sc_integrate( sce_list = list("RNA" = sce.rna, "ATAC" = sce.atac), barcode_match_file = bc_match_file, sce_column_to_barcode_files = list("RNA" = rna_bc_anno, "ATAC" = NULL), rev_comp = list("RNA" = FALSE, "ATAC" = TRUE), cell_line_info = list("RNA" = rna_cell_line_info, "ATAC" = atac_cell_line_info,) output_folder = output_folder ) ## End(Not run)
## Not run: sc_integrate( sce_list = list("RNA" = sce.rna, "ATAC" = sce.atac), barcode_match_file = bc_match_file, sce_column_to_barcode_files = list("RNA" = rna_bc_anno, "ATAC" = NULL), rev_comp = list("RNA" = FALSE, "ATAC" = TRUE), cell_line_info = list("RNA" = rna_cell_line_info, "ATAC" = atac_cell_line_info,) output_folder = output_folder ) ## End(Not run)
Can colour the UMAP by any of the colData columns in the SCE object
sc_interactive_umap_plot(sce)
sc_interactive_umap_plot(sce)
sce |
The SingleCellExperiment object |
A shiny object which represents the app. Printing the object or passing it to 'shiny::runApp(...)' will run the app.
Uses feature count data from multiple experiment objects to produce UMAPs for each assay and then overlay them on the same pair of axes
sc_mae_plot_umap(mae, by = NULL, output_file = NULL)
sc_mae_plot_umap(mae, by = NULL, output_file = NULL)
mae |
The MultiAssayExperiment object |
by |
What to colour the points by. Needs to be in colData of all experiments. |
output_file |
The path of the output file |
A ggplot2 object representing the UMAP plot
This data set contains counts for high variable genes for 100 cells. The cells have different cell types. The data contains raw read counts. The cells are chosen randomly from 384 cells and they did not go through quality controls. The rows names are Ensembl gene ids and the columns are cell names, which is the wall position in the 384 plates.
a matrix instance, one row per gene.
NULL, but makes a matrix of count data
Luyi Tian
Christin Biben (WEHI). She FACS sorted cells from several immune cell types including B cells, granulocyte and some early progenitors.
# use the example dataset to perform quality control data("sc_sample_data") data("sc_sample_qc") sce = SingleCellExperiment(assays = list(counts = as.matrix(sc_sample_data))) organism(sce) = "mmusculus_gene_ensembl" gene_id_type(sce) = "ensembl_gene_id" QC_metrics(sce) = sc_sample_qc demultiplex_info(sce) = cell_barcode_matching UMI_dup_info(sce) = UMI_duplication sce = detect_outlier(sce) plot_QC_pairs(sce)
# use the example dataset to perform quality control data("sc_sample_data") data("sc_sample_qc") sce = SingleCellExperiment(assays = list(counts = as.matrix(sc_sample_data))) organism(sce) = "mmusculus_gene_ensembl" gene_id_type(sce) = "ensembl_gene_id" QC_metrics(sce) = sc_sample_qc demultiplex_info(sce) = cell_barcode_matching UMI_dup_info(sce) = UMI_duplication sce = detect_outlier(sce) plot_QC_pairs(sce)
This data.frame contains cell quality control information for the 100 cells. For each cell it has:
unaligned the number of unaligned reads.
aligned_unmapped the number of reads that aligned to genome but fail to map to any features.
mapped_to_exon is the number of reads that mapped to exon.
mapped_to_intron is the number of reads that mapped to intron.
ambiguous_mapping is the number of reads that mapped to multiple features. They are not considered in the following analysis.
mapped_to_ERCC is the number of reads that mapped to ERCC spike-in controls.
mapped_to_MT is the number of reads that mapped to mitochondrial genes.
total_count_per_cell is the number of reads that mapped to exon after UMI deduplication. In contrast, 'mapped_to_exon' is the number of reads mapped to exon before UMI deduplication.
number_of_genes is the number of genes detected for each cells
non_ERCC_percent is 1 - (percentage of ERCC reads). Reads are UMI deduplicated.
non_mt_percent is 1 - (percentage of mitochondrial reads). Reads are UMI deduplicated.
non_ribo_percent is 1- (percentage of ribosomal reads). Reads are UMI deduplicated.
a data.frame instance, one row per cell.
NULL, but makes a data frame with cell quality control data.frame
Luyi Tian
Christin Biben (WEHI). She FACS sorted cells from several immune cell types including B cells, granulocyte and some early progenitors.
data("sc_sample_data") data("sc_sample_qc") sce = SingleCellExperiment(assays = list(counts = as.matrix(sc_sample_data))) organism(sce) = "mmusculus_gene_ensembl" gene_id_type(sce) = "ensembl_gene_id" QC_metrics(sce) = sc_sample_qc head(QC_metrics(sce)) plot_mapping(sce,percentage=TRUE,dataname="sc_sample")
data("sc_sample_data") data("sc_sample_qc") sce = SingleCellExperiment(assays = list(counts = as.matrix(sc_sample_data))) organism(sce) = "mmusculus_gene_ensembl" gene_id_type(sce) = "ensembl_gene_id" QC_metrics(sce) = sc_sample_qc head(QC_metrics(sce)) plot_mapping(sce,percentage=TRUE,dataname="sc_sample")
Reformat fastq files so barcode and UMI sequences are moved from the sequence into the read name.
sc_trim_barcode( outfq, r1, r2 = NULL, read_structure = list(bs1 = -1, bl1 = 0, bs2 = 6, bl2 = 8, us = 0, ul = 6), filter_settings = list(rmlow = TRUE, rmN = TRUE, minq = 20, numbq = 2) )
sc_trim_barcode( outfq, r1, r2 = NULL, read_structure = list(bs1 = -1, bl1 = 0, bs2 = 6, bl2 = 8, us = 0, ul = 6), filter_settings = list(rmlow = TRUE, rmN = TRUE, minq = 20, numbq = 2) )
outfq |
the output fastq file, which reformat the barcode and UMI into
the read name. Files ending in |
r1 |
read one for pair-end reads. This read should contain the transcript. |
r2 |
read two for pair-end reads, NULL if single read. (default: NULL) |
read_structure |
a list containing the read structure configuration:
|
filter_settings |
A list contains read filter settings:
|
Positions used in this function are 0-indexed, so they start from 0
rather than 1. The default read structure in this function represents
CEL-seq paired-ended reads. This contains a transcript in the first read, a
UMI in the first 6bp of the second read followed by a 8bp barcode. So the
read structure will be : list(bs1=-1, bl1=0, bs2=6, bl2=8, us=0,
ul=6)
. bs1=-1, bl1=0
indicates negative start position and zero
length for the barcode on read one, this is used to denote "no barcode" on
read one. bs2=6, bl2=8
indicates there is a barcode in read two that
starts at the 7th base with length 8bp. us=0, ul=6
indicates a UMI
from first base of read two and the length in 6bp.
For a typical Drop-seq experiment the read structure will be
list(bs1=-1, bl1=0, bs2=0, bl2=12, us=12, ul=8)
, which means the read
one only contains transcript, the first 12bp in read two are cell barcode, followed
by a 8bp UMI.
generates a trimmed fastq file named outfq
data_dir="celseq2_demo" ## Not run: # for the complete workflow, refer to the vignettes ... sc_trim_barcode(file.path(data_dir, "combined.fastq"), file.path(data_dir, "simu_R1.fastq"), file.path(data_dir, "simu_R2.fastq")) ... ## End(Not run)
data_dir="celseq2_demo" ## Not run: # for the complete workflow, refer to the vignettes ... sc_trim_barcode(file.path(data_dir, "combined.fastq"), file.path(data_dir, "simu_R1.fastq"), file.path(data_dir, "simu_R2.fastq")) ... ## End(Not run)
The scPipe will do cell barcode demultiplexing, UMI deduplication and quality control on fastq data generated from all protocols
Luyi Tian <[email protected]>; Shian Su <[email protected]>
Returns the TF-IDF normalised version of a binary matrix
TF.IDF.custom(binary.mat, verbose = TRUE)
TF.IDF.custom(binary.mat, verbose = TRUE)
binary.mat |
The binary matrix |
verbose |
boolean flag to print status messages |
Returns the TF-IDF normalised version of a binary matrix
Get or set UMI duplication results in a SingleCellExperiment object
UMI_dup_info(object) UMI_dup_info(object) <- value UMI_dup_info.sce(object) ## S4 method for signature 'SingleCellExperiment' UMI_dup_info(object) ## S4 replacement method for signature 'SingleCellExperiment' UMI_dup_info(object) <- value
UMI_dup_info(object) UMI_dup_info(object) <- value UMI_dup_info.sce(object) ## S4 method for signature 'SingleCellExperiment' UMI_dup_info(object) ## S4 replacement method for signature 'SingleCellExperiment' UMI_dup_info(object) <- value
object |
A |
value |
Value to be assigned to corresponding object. |
a dataframe of cell UMI duplication information
A DataFrame of UMI duplication results.
Luyi Tian
data("sc_sample_data") data("sc_sample_qc") sce = SingleCellExperiment(assays = list(counts = as.matrix(sc_sample_data))) organism(sce) = "mmusculus_gene_ensembl" gene_id_type(sce) = "ensembl_gene_id" QC_metrics(sce) = sc_sample_qc demultiplex_info(sce) = cell_barcode_matching UMI_dup_info(sce) = UMI_duplication head(UMI_dup_info(sce))
data("sc_sample_data") data("sc_sample_qc") sce = SingleCellExperiment(assays = list(counts = as.matrix(sc_sample_data))) organism(sce) = "mmusculus_gene_ensembl" gene_id_type(sce) = "ensembl_gene_id" QC_metrics(sce) = sc_sample_qc demultiplex_info(sce) = cell_barcode_matching UMI_dup_info(sce) = UMI_duplication head(UMI_dup_info(sce))
This data.frame contains UMI duplication statistics, where the first column is the number of duplication, and the second column is the count of UMIs.
a data.frame instance, one row per cell.
NULL, but makes a data frame with UMI dulication statistics
Luyi Tian
Christin Biben (WEHI). She FACS sorted cells from several immune cell types including B cells, granulocyte and some early progenitors.
data("sc_sample_data") data("sc_sample_qc") sce = SingleCellExperiment(assays = list(counts =as.matrix(sc_sample_data))) organism(sce) = "mmusculus_gene_ensembl" gene_id_type(sce) = "ensembl_gene_id" QC_metrics(sce) = sc_sample_qc demultiplex_info(sce) = cell_barcode_matching UMI_dup_info(sce) = UMI_duplication head(UMI_dup_info(sce))
data("sc_sample_data") data("sc_sample_qc") sce = SingleCellExperiment(assays = list(counts =as.matrix(sc_sample_data))) organism(sce) = "mmusculus_gene_ensembl" gene_id_type(sce) = "ensembl_gene_id" QC_metrics(sce) = sc_sample_qc demultiplex_info(sce) = cell_barcode_matching UMI_dup_info(sce) = UMI_duplication head(UMI_dup_info(sce))