Title: | FLAMES: Full Length Analysis of Mutations and Splicing in long read RNA-seq data |
---|---|
Description: | Semi-supervised isoform detection and annotation from both bulk and single-cell long read RNA-seq data. Flames provides automated pipelines for analysing isoforms, as well as intermediate functions for manual execution. |
Authors: | Luyi Tian [aut], Changqing Wang [aut, cre], Yupei You [aut], Oliver Voogd [aut], Jakob Schuster [aut], Shian Su [aut], Matthew Ritchie [ctb] |
Maintainer: | Changqing Wang <[email protected]> |
License: | GPL (>= 3) |
Version: | 1.11.0 |
Built: | 2024-07-20 02:26:54 UTC |
Source: | https://github.com/bioc/FLAMES |
convert the transcript annotation to transcriptome assembly as FASTA file. The genome annotation is first imported as TxDb object and then used to extract transcript sequence from the genome assembly.
annotation_to_fasta(isoform_annotation, genome_fa, outdir, extract_fn)
annotation_to_fasta(isoform_annotation, genome_fa, outdir, extract_fn)
isoform_annotation |
Path to the annotation file (GTF/GFF3) |
genome_fa |
The file path to genome fasta file. |
outdir |
The path to directory to store the transcriptome as |
extract_fn |
(optional) Function to extract |
Path to the outputted transcriptome assembly
fasta <- annotation_to_fasta(system.file("extdata/rps24.gtf.gz", package = "FLAMES"), system.file("extdata/rps24.fa.gz", package = "FLAMES"), tempdir()) cat(readChar(fasta, nchars = 1e3))
fasta <- annotation_to_fasta(system.file("extdata/rps24.gtf.gz", package = "FLAMES"), system.file("extdata/rps24.fa.gz", package = "FLAMES"), tempdir()) cat(readChar(fasta, nchars = 1e3))
Uses BLAZE to generate barcode list and assign reads to cell barcodes.
blaze(expect_cells, fq_in, ...)
blaze(expect_cells, fq_in, ...)
expect_cells |
Integer, expected number of cells. Note: this could be just a rough estimate. E.g., the targeted number of cells. |
fq_in |
File path to the fastq file used as a query sequence file |
... |
Additional BLAZE configuration parameters. E.g., setting ''output-prefix'='some_prefix'' is equivalent to specifying '–output-prefix some_prefix' in BLAZE; Similarly, 'overwrite=TRUE' is equivalent to switch on the '–overwrite' option. Note that the specified parameters will override the parameters specified in the configuration file. All available options can be found at https://github.com/shimlab/BLAZE. |
A data.frame
summarising the reads aligned. Other outputs are written to disk.
The details of the output files can be found at https://github.com/shimlab/BLAZE.
temp_path <- tempfile() bfc <- BiocFileCache::BiocFileCache(temp_path, ask = FALSE) fastq1_url <- 'https://raw.githubusercontent.com/shimlab/BLAZE/main/test/data/FAR20033_pass_51e510db_100.fastq' fastq1 <- bfc[[names(BiocFileCache::bfcadd(bfc, 'Fastq1', fastq1_url))]] outdir <- tempfile() dir.create(outdir) ## Not run: blaze(expect_cells=10, fastq1, overwrite=TRUE) ## End(Not run)
temp_path <- tempfile() bfc <- BiocFileCache::BiocFileCache(temp_path, ask = FALSE) fastq1_url <- 'https://raw.githubusercontent.com/shimlab/BLAZE/main/test/data/FAR20033_pass_51e510db_100.fastq' fastq1 <- bfc[[names(BiocFileCache::bfcadd(bfc, 'Fastq1', fastq1_url))]] outdir <- tempfile() dir.create(outdir) ## Not run: blaze(expect_cells=10, fastq1, overwrite=TRUE) ## End(Not run)
Semi-supervised isofrom detection and annotation for long read data. This variant is meant for bulk samples. Specific parameters relating to analysis can be changed either through function arguments, or through a configuration JSON file.
bulk_long_pipeline( annotation, fastq, outdir, genome_fa, minimap2 = NULL, k8 = NULL, config_file = NULL )
bulk_long_pipeline( annotation, fastq, outdir, genome_fa, minimap2 = NULL, k8 = NULL, config_file = NULL )
annotation |
The file path to the annotation file in GFF3 format |
fastq |
The file path to input fastq file |
outdir |
The path to directory to store all output files. |
genome_fa |
The file path to genome fasta file. |
minimap2 |
Path to minimap2, if it is not in PATH. Only required if either or both of
|
k8 |
Path to the k8 Javascript shell binary. Only required if |
config_file |
File path to the JSON configuration file. If specified, |
By default FLAMES use minimap2 for read alignment. After the genome alignment step (do_genome_align
), FLAMES summarizes the alignment for each read by grouping reads
with similar splice junctions to get a raw isoform annotation (do_isoform_id
). The raw isoform
annotation is compared against the reference annotation to correct potential splice site
and transcript start/end errors. Transcripts that have similar splice junctions
and transcript start/end to the reference transcript are merged with the
reference. This process will also collapse isoforms that are likely to be truncated
transcripts. If isoform_id_bambu
is set to TRUE
, bambu::bambu
will be used to generate the updated annotations.
Next is the read realignment step (do_read_realign
), where the sequence of each transcript from the update annotation is extracted, and
the reads are realigned to this updated transcript_assembly.fa
by minimap2. The
transcripts with only a few full-length aligned reads are discarded.
The reads are assigned to transcripts based on both alignment score, fractions of
reads aligned and transcript coverage. Reads that cannot be uniquely assigned to
transcripts or have low transcript coverage are discarded. The UMI transcript
count matrix is generated by collapsing the reads with the same UMI in a similar
way to what is done for short-read scRNA-seq data, but allowing for an edit distance
of up to 2 by default. Most of the parameters, such as the minimal distance to splice site and minimal percentage of transcript coverage
can be modified by the JSON configuration file (config_file
).
The default parameters can be changed either through the function
arguments are through the configuration JSON file config_file
. the pipeline_parameters
section specifies which steps are to be executed in the pipeline - by default, all
steps are executed. The isoform_parameters
section affects isoform detection - key
parameters include:
Min_sup_cnt
which causes transcripts with less reads aligned than
it's value to be discarded
MAX_TS_DIST
which merges transcripts with the same intron
chain and TSS/TES distace less than MAX_TS_DIST
strand_specific
which specifies if reads are in the same strand as the mRNA (1),
or the reverse complemented (-1) or not strand specific (0), which results in
strand information being based on reference annotation.
if do_transcript_quantification
set to true, bulk_long_pipeline
returns a SummarizedExperiment object, containing a count
matrix as an assay, gene annotations under metadata, as well as a list of the other
output files generated by the pipeline. The pipeline also outputs a number of output
files into the given outdir
directory. These output files generated by the pipeline are:
transcript_count.csv.gz - a transcript count matrix (also contained in the SummarizedExperiment)
isoform_annotated.filtered.gff3 - isoforms in gff3 format (also contained in the SummarizedExperiment)
transcript_assembly.fa - transcript sequence from the isoforms
align2genome.bam - sorted BAM file with reads aligned to genome
realign2transcript.bam - sorted realigned BAM file using the transcript_assembly.fa as reference
tss_tes.bedgraph - TSS TES enrichment for all reads (for QC)
if do_transcript_quantification
set to false, nothing will be returned
sc_long_pipeline()
for single cell data,
SummarizedExperiment()
for how data is outputted
# download the two fastq files, move them to a folder to be merged together temp_path <- tempfile() bfc <- BiocFileCache::BiocFileCache(temp_path, ask = FALSE) file_url <- "https://raw.githubusercontent.com/OliverVoogd/FLAMESData/master/data" # download the required fastq files, and move them to new folder fastq1 <- bfc[[names(BiocFileCache::bfcadd(bfc, "Fastq1", paste(file_url, "fastq/sample1.fastq.gz", sep = "/")))]] fastq2 <- bfc[[names(BiocFileCache::bfcadd(bfc, "Fastq2", paste(file_url, "fastq/sample2.fastq.gz", sep = "/")))]] annotation <- bfc[[names(BiocFileCache::bfcadd(bfc, "annot.gtf", paste(file_url, "SIRV_isoforms_multi-fasta-annotation_C_170612a.gtf", sep = "/")))]] genome_fa <- bfc[[names(BiocFileCache::bfcadd(bfc, "genome.fa", paste(file_url, "SIRV_isoforms_multi-fasta_170612a.fasta", sep = "/")))]] fastq_dir <- paste(temp_path, "fastq_dir", sep = "/") # the downloaded fastq files need to be in a directory to be merged together dir.create(fastq_dir) file.copy(c(fastq1, fastq2), fastq_dir) unlink(c(fastq1, fastq2)) # the original files can be deleted outdir <- tempfile() dir.create(outdir) if (!any(is.na(sys_which(c("minimap2", "k8"))))) { se <- bulk_long_pipeline( annotation = annotation, fastq = fastq_dir, outdir = outdir, genome_fa = genome_fa, config_file = system.file("extdata/SIRV_config_default.json", package = "FLAMES") ) se_2 <- create_se_from_dir(outdir = outdir, annotation = annotation) }
# download the two fastq files, move them to a folder to be merged together temp_path <- tempfile() bfc <- BiocFileCache::BiocFileCache(temp_path, ask = FALSE) file_url <- "https://raw.githubusercontent.com/OliverVoogd/FLAMESData/master/data" # download the required fastq files, and move them to new folder fastq1 <- bfc[[names(BiocFileCache::bfcadd(bfc, "Fastq1", paste(file_url, "fastq/sample1.fastq.gz", sep = "/")))]] fastq2 <- bfc[[names(BiocFileCache::bfcadd(bfc, "Fastq2", paste(file_url, "fastq/sample2.fastq.gz", sep = "/")))]] annotation <- bfc[[names(BiocFileCache::bfcadd(bfc, "annot.gtf", paste(file_url, "SIRV_isoforms_multi-fasta-annotation_C_170612a.gtf", sep = "/")))]] genome_fa <- bfc[[names(BiocFileCache::bfcadd(bfc, "genome.fa", paste(file_url, "SIRV_isoforms_multi-fasta_170612a.fasta", sep = "/")))]] fastq_dir <- paste(temp_path, "fastq_dir", sep = "/") # the downloaded fastq files need to be in a directory to be merged together dir.create(fastq_dir) file.copy(c(fastq1, fastq2), fastq_dir) unlink(c(fastq1, fastq2)) # the original files can be deleted outdir <- tempfile() dir.create(outdir) if (!any(is.na(sys_which(c("minimap2", "k8"))))) { se <- bulk_long_pipeline( annotation = annotation, fastq = fastq_dir, outdir = outdir, genome_fa = genome_fa, config_file = system.file("extdata/SIRV_config_default.json", package = "FLAMES") ) se_2 <- create_se_from_dir(outdir = outdir, annotation = annotation) }
Combine long- and short-read SingleCellExperiment objects
combine_sce( short_read_large, short_read_small, long_read_sce, remove_duplicates = FALSE )
combine_sce( short_read_large, short_read_small, long_read_sce, remove_duplicates = FALSE )
short_read_large |
The SCE object, or path to the HDF5 file, or folder containing the matrix file, corresponding to the larger short-read sample |
short_read_small |
The SCE object, or path to the HDF5 file, or folder containing the matrix file, corresponding to the smaller short-read sample |
long_read_sce |
The SCE object of the transcript counts, from the long-read pipelines. |
remove_duplicates |
determines whether cells with duplicated barcodes aer kept in the smaller library ( they are always removed from the larger library) |
Takes the long-read SCE object from the long-read pipeline and the short-read SCE object,
creates a MultiAssayExperiment
object with the two SingleCellExperiment
objects.
Cells with duplicated barcodes are removed from the larger library.
A MultiAssayExperiment
object, with 'gene_counts' and 'transcript_counts' experiments.
library(SingleCellExperiment) a <- SingleCellExperiment(assays = list(counts = matrix(rpois(100, 5), ncol = 10))) b <- SingleCellExperiment(assays = list(counts = matrix(rpois(100, 5), ncol = 10))) long_read <- SingleCellExperiment(assays = list(counts = matrix(rpois(100, 5), ncol = 10))) colData(a)$Barcode <- paste0(1:10, '-1') colData(b)$Barcode <- paste0(8:17, '-1') colnames(long_read) <- as.character(2:11) rownames(a) <- as.character(101:110) rownames(b) <- as.character(103:112) rownames(long_read) <- as.character(1001:1010) combine_sce(short_read_large = a, short_read_small = b, long_read_sce = long_read)
library(SingleCellExperiment) a <- SingleCellExperiment(assays = list(counts = matrix(rpois(100, 5), ncol = 10))) b <- SingleCellExperiment(assays = list(counts = matrix(rpois(100, 5), ncol = 10))) long_read <- SingleCellExperiment(assays = list(counts = matrix(rpois(100, 5), ncol = 10))) colData(a)$Barcode <- paste0(1:10, '-1') colData(b)$Barcode <- paste0(8:17, '-1') colnames(long_read) <- as.character(2:11) rownames(a) <- as.character(101:110) rownames(b) <- as.character(103:112) rownames(long_read) <- as.character(1001:1010) combine_sce(short_read_large = a, short_read_small = b, long_read_sce = long_read)
Create Configuration File From Arguments
create_config(outdir, type = "sc_3end", ...)
create_config(outdir, type = "sc_3end", ...)
outdir |
the destination directory for the configuratio nfile |
type |
use an example config, available values:
|
... |
Configuration parameters.
|
Create a list object containing the arguments supplied in a format usable for the FLAMES pipeline.
Also writes the object to a JSON file, which is located with the prefix 'config_' in the supplied outdir
.
Default values from extdata/config_sclr_nanopore_3end.json
will be used for unprovided parameters.
file path to the config file created
# create the default configuration file outdir <- tempdir() config <- create_config(outdir)
# create the default configuration file outdir <- tempdir() config <- create_config(outdir)
SingleCellExperiment
object from FLAMES
output folderCreate SingleCellExperiment
object from FLAMES
output folder
create_sce_from_dir(outdir, annotation)
create_sce_from_dir(outdir, annotation)
outdir |
The folder containing |
annotation |
(Optional) the annotation file that was used to produce the output files |
a list of SingleCellExperiment
objects if multiple transcript matrices were
found in the output folder, or a SingleCellExperiment
object if only one were found
outdir <- tempfile() dir.create(outdir) bc_allow <- file.path(outdir, "bc_allow.tsv") genome_fa <- file.path(outdir, "rps24.fa") R.utils::gunzip(filename = system.file("extdata/bc_allow.tsv.gz", package = "FLAMES"), destname = bc_allow, remove = FALSE) R.utils::gunzip(filename = system.file("extdata/rps24.fa.gz", package = "FLAMES"), destname = genome_fa, remove = FALSE) annotation <- system.file("extdata/rps24.gtf.gz", package = "FLAMES") if (!any(is.na(sys_which(c("minimap2", "k8"))))) { sce <- FLAMES::sc_long_pipeline( genome_fa = genome_fa, fastq = system.file("extdata/fastq", package = "FLAMES"), annotation = annotation, outdir = outdir, barcodes_file = bc_allow ) sce_2 <- create_sce_from_dir(outdir, annotation) }
outdir <- tempfile() dir.create(outdir) bc_allow <- file.path(outdir, "bc_allow.tsv") genome_fa <- file.path(outdir, "rps24.fa") R.utils::gunzip(filename = system.file("extdata/bc_allow.tsv.gz", package = "FLAMES"), destname = bc_allow, remove = FALSE) R.utils::gunzip(filename = system.file("extdata/rps24.fa.gz", package = "FLAMES"), destname = genome_fa, remove = FALSE) annotation <- system.file("extdata/rps24.gtf.gz", package = "FLAMES") if (!any(is.na(sys_which(c("minimap2", "k8"))))) { sce <- FLAMES::sc_long_pipeline( genome_fa = genome_fa, fastq = system.file("extdata/fastq", package = "FLAMES"), annotation = annotation, outdir = outdir, barcodes_file = bc_allow ) sce_2 <- create_sce_from_dir(outdir, annotation) }
SummarizedExperiment
object from FLAMES
output folderCreate SummarizedExperiment
object from FLAMES
output folder
create_se_from_dir(outdir, annotation)
create_se_from_dir(outdir, annotation)
outdir |
The folder containing |
annotation |
(Optional) the annotation file that was used to produce the output files |
a SummarizedExperiment
object
# download the two fastq files, move them to a folder to be merged together temp_path <- tempfile() bfc <- BiocFileCache::BiocFileCache(temp_path, ask = FALSE) file_url <- "https://raw.githubusercontent.com/OliverVoogd/FLAMESData/master/data" # download the required fastq files, and move them to new folder fastq1 <- bfc[[names(BiocFileCache::bfcadd(bfc, "Fastq1", paste(file_url, "fastq/sample1.fastq.gz", sep = "/")))]] fastq2 <- bfc[[names(BiocFileCache::bfcadd(bfc, "Fastq2", paste(file_url, "fastq/sample2.fastq.gz", sep = "/")))]] annotation <- bfc[[names(BiocFileCache::bfcadd(bfc, "annot.gtf", paste(file_url, "SIRV_isoforms_multi-fasta-annotation_C_170612a.gtf", sep = "/")))]] genome_fa <- bfc[[names(BiocFileCache::bfcadd(bfc, "genome.fa", paste(file_url, "SIRV_isoforms_multi-fasta_170612a.fasta", sep = "/")))]] fastq_dir <- paste(temp_path, "fastq_dir", sep = "/") # the downloaded fastq files need to be in a directory to be merged together dir.create(fastq_dir) file.copy(c(fastq1, fastq2), fastq_dir) unlink(c(fastq1, fastq2)) # the original files can be deleted outdir <- tempfile() dir.create(outdir) if (!any(is.na(sys_which(c("minimap2", "k8"))))) { se <- bulk_long_pipeline( annotation = annotation, fastq = fastq_dir, outdir = outdir, genome_fa = genome_fa, config_file = system.file("extdata/SIRV_config_default.json", package = "FLAMES") ) se_2 <- create_se_from_dir(outdir = outdir, annotation = annotation) }
# download the two fastq files, move them to a folder to be merged together temp_path <- tempfile() bfc <- BiocFileCache::BiocFileCache(temp_path, ask = FALSE) file_url <- "https://raw.githubusercontent.com/OliverVoogd/FLAMESData/master/data" # download the required fastq files, and move them to new folder fastq1 <- bfc[[names(BiocFileCache::bfcadd(bfc, "Fastq1", paste(file_url, "fastq/sample1.fastq.gz", sep = "/")))]] fastq2 <- bfc[[names(BiocFileCache::bfcadd(bfc, "Fastq2", paste(file_url, "fastq/sample2.fastq.gz", sep = "/")))]] annotation <- bfc[[names(BiocFileCache::bfcadd(bfc, "annot.gtf", paste(file_url, "SIRV_isoforms_multi-fasta-annotation_C_170612a.gtf", sep = "/")))]] genome_fa <- bfc[[names(BiocFileCache::bfcadd(bfc, "genome.fa", paste(file_url, "SIRV_isoforms_multi-fasta_170612a.fasta", sep = "/")))]] fastq_dir <- paste(temp_path, "fastq_dir", sep = "/") # the downloaded fastq files need to be in a directory to be merged together dir.create(fastq_dir) file.copy(c(fastq1, fastq2), fastq_dir) unlink(c(fastq1, fastq2)) # the original files can be deleted outdir <- tempfile() dir.create(outdir) if (!any(is.na(sys_which(c("minimap2", "k8"))))) { se <- bulk_long_pipeline( annotation = annotation, fastq = fastq_dir, outdir = outdir, genome_fa = genome_fa, config_file = system.file("extdata/SIRV_config_default.json", package = "FLAMES") ) se_2 <- create_se_from_dir(outdir = outdir, annotation = annotation) }
trim TSO adaptor with cutadapt
cutadapt(args)
cutadapt(args)
args |
arguments to be passed to cutadapt |
Exit code of cutadapt
## Not run: cutadapt("-h") ## End(Not run)
## Not run: cutadapt("-h") ## End(Not run)
Demultiplex reads using the cell_umi_gene.tsv
file from Sockeye.
demultiplex_sockeye(fastq_dir, sockeye_tsv, out_fq)
demultiplex_sockeye(fastq_dir, sockeye_tsv, out_fq)
fastq_dir |
The folder containing FASTQ files from Sockeye's output under |
sockeye_tsv |
The |
out_fq |
The output FASTQ file. |
returns NULL
Removes isoform annotations that could produce ambigious reads, such as isoforms that only differ by the 5' / 3' end. This could be useful for plotting average coverage plots.
filter_annotation(annotation, keep = "tss_differ")
filter_annotation(annotation, keep = "tss_differ")
annotation |
path to the GTF annotation file, or the parsed GenomicRanges object. |
keep |
string, one of 'tss_differ' (only keep isoforms that all differ by the transcription start site position), 'tes_differ' (only keep those that differ by the transcription end site position), 'both' (only keep those that differ by both the start and end site), or 'single_transcripts' (only keep genes that contains a sinlge transcript). |
GenomicRanges of the filtered isoforms
filtered_annotation <- filter_annotation(system.file('extdata/rps24.gtf.gz', package = 'FLAMES'), keep = 'tes_differ') filtered_annotation
filtered_annotation <- filter_annotation(system.file('extdata/rps24.gtf.gz', package = 'FLAMES'), keep = 'tes_differ') filtered_annotation
demultiplex reads with flexiplex
find_barcode( fastq, barcodes_file, max_bc_editdistance = 2, max_flank_editdistance = 8, reads_out, stats_out, threads = 1, pattern = c(primer = "CTACACGACGCTCTTCCGATCT", BC = paste0(rep("N", 16), collapse = ""), UMI = paste0(rep("N", 12), collapse = ""), polyT = paste0(rep("T", 9), collapse = "")), TSO_seq = "", TSO_prime = 3, full_length_only = FALSE )
find_barcode( fastq, barcodes_file, max_bc_editdistance = 2, max_flank_editdistance = 8, reads_out, stats_out, threads = 1, pattern = c(primer = "CTACACGACGCTCTTCCGATCT", BC = paste0(rep("N", 16), collapse = ""), UMI = paste0(rep("N", 12), collapse = ""), polyT = paste0(rep("T", 9), collapse = "")), TSO_seq = "", TSO_prime = 3, full_length_only = FALSE )
fastq |
input FASTQ file path |
barcodes_file |
path to file containing barcode allow-list, with one barcode in each line |
max_bc_editdistance |
max edit distances for the barcode sequence |
max_flank_editdistance |
max edit distances for the flanking sequences (primer and polyT) |
reads_out |
path of output FASTQ file |
stats_out |
path of output stats file |
threads |
number of threads to be used |
pattern |
named character vector defining the barcode pattern |
TSO_seq |
TSO sequence to be trimmed |
TSO_prime |
either 3 (when |
full_length_only |
boolean, when TSO sequence is provided, whether reads without TSO are to be discarded |
invisible()
outdir <- tempfile() dir.create(outdir) bc_allow <- file.path(outdir, "bc_allow.tsv") R.utils::gunzip(filename = system.file("extdata/bc_allow.tsv.gz", package = "FLAMES"), destname = bc_allow, remove = FALSE) find_barcode( fastq = system.file("extdata/fastq", package = "FLAMES"), stats_out = file.path(outdir, "bc_stat"), reads_out = file.path(outdir, "demultiplexed.fq.gz"), barcodes_file = bc_allow )
outdir <- tempfile() dir.create(outdir) bc_allow <- file.path(outdir, "bc_allow.tsv") R.utils::gunzip(filename = system.file("extdata/bc_allow.tsv.gz", package = "FLAMES"), destname = bc_allow, remove = FALSE) find_barcode( fastq = system.file("extdata/fastq", package = "FLAMES"), stats_out = file.path(outdir, "bc_stat"), reads_out = file.path(outdir, "demultiplexed.fq.gz"), barcodes_file = bc_allow )
Long-read isoform identification with FLAMES or bambu.
find_isoform(annotation, genome_fa, genome_bam, outdir, config)
find_isoform(annotation, genome_fa, genome_bam, outdir, config)
annotation |
Path to annotation file. If configured to use bambu, the annotation must be provided as GTF file. |
genome_fa |
The file path to genome fasta file. |
genome_bam |
File path to BAM alignment file. Multiple files could be provided. |
outdir |
The path to directory to store all output files. |
config |
Parsed FLAMES configurations. |
The updated annotation and the transcriptome assembly will be saved in the
output folder as isoform_annotated.gff3
(GTF if bambu is selected) and
transcript_assembly.fa
respectively.
temp_path <- tempfile() bfc <- BiocFileCache::BiocFileCache(temp_path, ask = FALSE) file_url <- "https://raw.githubusercontent.com/OliverVoogd/FLAMESData/master/data" fastq1 <- bfc[[names(BiocFileCache::bfcadd(bfc, "Fastq1", paste(file_url, "fastq/sample1.fastq.gz", sep = "/")))]] genome_fa <- bfc[[names(BiocFileCache::bfcadd(bfc, "genome.fa", paste(file_url, "SIRV_isoforms_multi-fasta_170612a.fasta", sep = "/")))]] annotation <- bfc[[names(BiocFileCache::bfcadd(bfc, "annot.gtf", paste(file_url, "SIRV_isoforms_multi-fasta-annotation_C_170612a.gtf", sep = "/")))]] outdir <- tempfile() dir.create(outdir) if (!any(is.na(sys_which(c("minimap2", "k8"))))) { config <- jsonlite::fromJSON(system.file("extdata/SIRV_config_default.json", package = "FLAMES")) minimap2_align( config = config, fa_file = genome_fa, fq_in = fastq1, annot = annotation, outdir = outdir ) ## Not run: find_isoform( annotation = annotation, genome_fa = genome_fa, genome_bam = file.path(outdir, "align2genome.bam"), outdir = outdir, config = config ) ## End(Not run) }
temp_path <- tempfile() bfc <- BiocFileCache::BiocFileCache(temp_path, ask = FALSE) file_url <- "https://raw.githubusercontent.com/OliverVoogd/FLAMESData/master/data" fastq1 <- bfc[[names(BiocFileCache::bfcadd(bfc, "Fastq1", paste(file_url, "fastq/sample1.fastq.gz", sep = "/")))]] genome_fa <- bfc[[names(BiocFileCache::bfcadd(bfc, "genome.fa", paste(file_url, "SIRV_isoforms_multi-fasta_170612a.fasta", sep = "/")))]] annotation <- bfc[[names(BiocFileCache::bfcadd(bfc, "annot.gtf", paste(file_url, "SIRV_isoforms_multi-fasta-annotation_C_170612a.gtf", sep = "/")))]] outdir <- tempfile() dir.create(outdir) if (!any(is.na(sys_which(c("minimap2", "k8"))))) { config <- jsonlite::fromJSON(system.file("extdata/SIRV_config_default.json", package = "FLAMES")) minimap2_align( config = config, fa_file = genome_fa, fq_in = fastq1, annot = annotation, outdir = outdir ) ## Not run: find_isoform( annotation = annotation, genome_fa = genome_fa, genome_bam = file.path(outdir, "align2genome.bam"), outdir = outdir, config = config ) ## End(Not run) }
Treat each bam file as a bulk sample and identify variants against the reference
find_variants( bam_path, reference, annotation, min_nucleotide_depth = 100, homopolymer_window = 3, annotated_region_only = FALSE, names_from = "gene_name", threads = 1 )
find_variants( bam_path, reference, annotation, min_nucleotide_depth = 100, homopolymer_window = 3, annotated_region_only = FALSE, names_from = "gene_name", threads = 1 )
bam_path |
character(1) or character(n): path to the bam file(s) aligned to the reference genome (NOT the transcriptome!). |
reference |
DNAStringSet: the reference genome |
annotation |
GRanges: the annotation of the reference genome. You can load
a GTF/GFF annotation file with |
min_nucleotide_depth |
integer(1): minimum read depth for a position to be considered a variant. |
homopolymer_window |
integer(1): the window size to calculate the homopolymer
percentage. The homopolymer percentage is calculated as the percentage of the most
frequent nucleotide in a window of |
annotated_region_only |
logical(1): whether to only consider variants outside
annotated regions. If |
names_from |
character(1): the column name in the metadata column of the annotation
( |
threads |
integer(1): number of threads to use. Threading is done over each
annotated region and (if |
Each bam file is treated as a bulk sample to perform pileup and identify variants.
You can run sc_mutations
with the variants identified with this function
to get single-cell allele counts. Note that reference genome FASTA files may have
the chromosome names field as '>chr1 1' instead of '>chr1'. You may need to remove
the trailing number to match the chromosome names in the bam file, for example with
names(ref) <- sapply(names(ref), function(x) strsplit(x, " ")[[1]][1])
.
A tibble with columns: seqnames, pos, nucleotide, count, sum, freq, ref, region,
homopolymer_pct, bam_path The homopolymer percentage is calculated as the percentage of the
most frequent nucleotide in a window of homopolymer_window
nucleotides around
the variant position, excluding the variant position itself.
outdir <- tempfile() dir.create(outdir) genome_fa <- file.path(outdir, "rps24.fa") R.utils::gunzip(filename = system.file("extdata/rps24.fa.gz", package = "FLAMES"), destname = genome_fa, remove = FALSE) download.file("https://raw.githubusercontent.com/mritchielab/FLAMES/devel/tests/testthat/demultiplexed.fq", destfile = file.path(outdir, "demultipelxed.fq") ) # can't be bothered to run demultiplexing again if (!any(is.na(sys_which(c("minimap2", "k8"))))) { minimap2_align( # align to genome config = jsonlite::fromJSON(system.file("extdata/SIRV_config_default.json", package = "FLAMES")), fa_file = genome_fa, fq_in = file.path(outdir, "demultipelxed.fq"), annot = system.file("extdata/rps24.gtf.gz", package = "FLAMES"), outdir = outdir ) variants <- find_variants( bam_path = file.path(outdir, "align2genome.bam"), reference = genome_fa, annotation = GenomicRanges::GRanges("chr14", IRanges::IRanges(1, 1)), min_nucleotide_depth = 10 ) head(variants) }
outdir <- tempfile() dir.create(outdir) genome_fa <- file.path(outdir, "rps24.fa") R.utils::gunzip(filename = system.file("extdata/rps24.fa.gz", package = "FLAMES"), destname = genome_fa, remove = FALSE) download.file("https://raw.githubusercontent.com/mritchielab/FLAMES/devel/tests/testthat/demultiplexed.fq", destfile = file.path(outdir, "demultipelxed.fq") ) # can't be bothered to run demultiplexing again if (!any(is.na(sys_which(c("minimap2", "k8"))))) { minimap2_align( # align to genome config = jsonlite::fromJSON(system.file("extdata/SIRV_config_default.json", package = "FLAMES")), fa_file = genome_fa, fq_in = file.path(outdir, "demultipelxed.fq"), annot = system.file("extdata/rps24.gtf.gz", package = "FLAMES"), outdir = outdir ) variants <- find_variants( bam_path = file.path(outdir, "align2genome.bam"), reference = genome_fa, annotation = GenomicRanges::GRanges("chr14", IRanges::IRanges(1, 1)), min_nucleotide_depth = 10 ) head(variants) }
FLAMES: full-length analysis of mutations and splicing
demultiplex reads with flexiplex, for detailed description, see documentation for the original flexiplex: https://davidsongroup.github.io/flexiplex
flexiplex( reads_in, barcodes_file, bc_as_readid, max_bc_editdistance, max_flank_editdistance, pattern, reads_out, stats_out, bc_out, n_threads )
flexiplex( reads_in, barcodes_file, bc_as_readid, max_bc_editdistance, max_flank_editdistance, pattern, reads_out, stats_out, bc_out, n_threads )
reads_in |
Input FASTQ or FASTA file |
barcodes_file |
barcode allow-list file |
bc_as_readid |
bool, whether to add the demultiplexed barcode to the read ID field |
max_bc_editdistance |
max edit distance for barcode ' |
max_flank_editdistance |
max edit distance for the flanking sequences ' |
pattern |
StringVector defining the barcode structure, see [find_barcode] |
reads_out |
output file for demultiplexed reads |
stats_out |
output file for demultiplexed stats |
bc_out |
WIP |
n_threads |
number of threads to be used during demultiplexing |
integer return value. 0 represents normal return.
Parse FLAMES' GFF ouputs into a Genomic Ranges List
get_GRangesList(file)
get_GRangesList(file)
file |
the GFF file to parse |
A Genomic Ranges List
temp_path <- tempfile() bfc <- BiocFileCache::BiocFileCache(temp_path, ask = FALSE) file_url <- "https://raw.githubusercontent.com/OliverVoogd/FLAMESData/master/data" fastq1 <- bfc[[names(BiocFileCache::bfcadd(bfc, "Fastq1", paste(file_url, "fastq/sample1.fastq.gz", sep = "/")))]] genome_fa <- bfc[[names(BiocFileCache::bfcadd(bfc, "genome.fa", paste(file_url, "SIRV_isoforms_multi-fasta_170612a.fasta", sep = "/")))]] annotation <- bfc[[names(BiocFileCache::bfcadd(bfc, "annot.gtf", paste(file_url, "SIRV_isoforms_multi-fasta-annotation_C_170612a.gtf", sep = "/")))]] outdir <- tempfile() dir.create(outdir) if (!any(is.na(sys_which(c("minimap2", "k8"))))) { config <- jsonlite::fromJSON(system.file("extdata/SIRV_config_default.json", package = "FLAMES")) minimap2_align( config = config, fa_file = genome_fa, fq_in = fastq1, annot = annotation, outdir = outdir ) find_isoform( annotation = annotation, genome_fa = genome_fa, genome_bam = file.path(outdir, "align2genome.bam"), outdir = outdir, config = config ) grlist <- get_GRangesList(file = file.path(outdir, "isoform_annotated.gff3")) }
temp_path <- tempfile() bfc <- BiocFileCache::BiocFileCache(temp_path, ask = FALSE) file_url <- "https://raw.githubusercontent.com/OliverVoogd/FLAMESData/master/data" fastq1 <- bfc[[names(BiocFileCache::bfcadd(bfc, "Fastq1", paste(file_url, "fastq/sample1.fastq.gz", sep = "/")))]] genome_fa <- bfc[[names(BiocFileCache::bfcadd(bfc, "genome.fa", paste(file_url, "SIRV_isoforms_multi-fasta_170612a.fasta", sep = "/")))]] annotation <- bfc[[names(BiocFileCache::bfcadd(bfc, "annot.gtf", paste(file_url, "SIRV_isoforms_multi-fasta-annotation_C_170612a.gtf", sep = "/")))]] outdir <- tempfile() dir.create(outdir) if (!any(is.na(sys_which(c("minimap2", "k8"))))) { config <- jsonlite::fromJSON(system.file("extdata/SIRV_config_default.json", package = "FLAMES")) minimap2_align( config = config, fa_file = genome_fa, fq_in = fastq1, annot = annotation, outdir = outdir ) find_isoform( annotation = annotation, genome_fa = genome_fa, genome_bam = file.path(outdir, "align2genome.bam"), outdir = outdir, config = config ) grlist <- get_GRangesList(file = file.path(outdir, "isoform_annotated.gff3")) }
Uses minimap2 to align sequences agains a reference databse.
Uses options '-ax splice -t 12 -k14 –secondary=no fa_file
fq_in
'
minimap2_align( config, fa_file, fq_in, annot, outdir, minimap2 = NA, k8 = NA, samtools = NA, prefix = NULL, threads = 1 )
minimap2_align( config, fa_file, fq_in, annot, outdir, minimap2 = NA, k8 = NA, samtools = NA, prefix = NULL, threads = 1 )
config |
Parsed list of FLAMES config file |
fa_file |
Path to the fasta file used as a reference database for alignment |
fq_in |
File path to the fastq file used as a query sequence file |
annot |
Genome annotation file used to create junction bed files |
outdir |
Output folder |
minimap2 |
Path to minimap2 binary |
k8 |
Path to the k8 Javascript shell binary |
samtools |
path to the samtools binary, required for large datasets since |
prefix |
String, the prefix (e.g. sample name) for the outputted BAM file |
threads |
Integer, threads for minimap2 to use, see minimap2 documentation for details, FLAMES will try to detect cores if this parameter is not provided. |
a data.frame
summarising the reads aligned
[minimap2_realign()]
temp_path <- tempfile() bfc <- BiocFileCache::BiocFileCache(temp_path, ask = FALSE) file_url <- 'https://raw.githubusercontent.com/OliverVoogd/FLAMESData/master/data' fastq1 <- bfc[[names(BiocFileCache::bfcadd(bfc, 'Fastq1', paste(file_url, 'fastq/sample1.fastq.gz', sep = '/')))]] genome_fa <- bfc[[names(BiocFileCache::bfcadd(bfc, 'genome.fa', paste(file_url, 'SIRV_isoforms_multi-fasta_170612a.fasta', sep = '/')))]] annotation <- bfc[[names(BiocFileCache::bfcadd(bfc, 'annot.gtf', paste(file_url, 'SIRV_isoforms_multi-fasta-annotation_C_170612a.gtf', sep = '/')))]] outdir <- tempfile() dir.create(outdir) if (!any(is.na(sys_which(c("minimap2", "k8"))))) { minimap2_align( config = jsonlite::fromJSON(system.file('extdata/SIRV_config_default.json', package = 'FLAMES')), fa_file = genome_fa, fq_in = fastq1, annot = annotation, outdir = outdir ) }
temp_path <- tempfile() bfc <- BiocFileCache::BiocFileCache(temp_path, ask = FALSE) file_url <- 'https://raw.githubusercontent.com/OliverVoogd/FLAMESData/master/data' fastq1 <- bfc[[names(BiocFileCache::bfcadd(bfc, 'Fastq1', paste(file_url, 'fastq/sample1.fastq.gz', sep = '/')))]] genome_fa <- bfc[[names(BiocFileCache::bfcadd(bfc, 'genome.fa', paste(file_url, 'SIRV_isoforms_multi-fasta_170612a.fasta', sep = '/')))]] annotation <- bfc[[names(BiocFileCache::bfcadd(bfc, 'annot.gtf', paste(file_url, 'SIRV_isoforms_multi-fasta-annotation_C_170612a.gtf', sep = '/')))]] outdir <- tempfile() dir.create(outdir) if (!any(is.na(sys_which(c("minimap2", "k8"))))) { minimap2_align( config = jsonlite::fromJSON(system.file('extdata/SIRV_config_default.json', package = 'FLAMES')), fa_file = genome_fa, fq_in = fastq1, annot = annotation, outdir = outdir ) }
Uses minimap2 to re-align reads to transcriptome
minimap2_realign( config, fq_in, outdir, minimap2, samtools = NULL, prefix = NULL, threads = 1 )
minimap2_realign( config, fq_in, outdir, minimap2, samtools = NULL, prefix = NULL, threads = 1 )
config |
Parsed list of FLAMES config file |
fq_in |
File path to the fastq file used as a query sequence file |
outdir |
Output folder |
minimap2 |
Path to minimap2 binary |
samtools |
path to the samtools binary, required for large datasets since |
prefix |
String, the prefix (e.g. sample name) for the outputted BAM file |
threads |
Integer, threads for minimap2 to use, see minimap2 documentation for details, FLAMES will try to detect cores if this parameter is not provided. |
a data.frame
summarising the reads aligned
[minimap2_align()]
temp_path <- tempfile() bfc <- BiocFileCache::BiocFileCache(temp_path, ask = FALSE) file_url <- 'https://raw.githubusercontent.com/OliverVoogd/FLAMESData/master/data' fastq1 <- bfc[[names(BiocFileCache::bfcadd(bfc, 'Fastq1', paste(file_url, 'fastq/sample1.fastq.gz', sep = '/')))]] genome_fa <- bfc[[names(BiocFileCache::bfcadd(bfc, 'genome.fa', paste(file_url, 'SIRV_isoforms_multi-fasta_170612a.fasta', sep = '/')))]] annotation <- bfc[[names(BiocFileCache::bfcadd(bfc, 'annot.gtf', paste(file_url, 'SIRV_isoforms_multi-fasta-annotation_C_170612a.gtf', sep = '/')))]] outdir <- tempfile() dir.create(outdir) if (!any(is.na(sys_which(c("minimap2", "k8"))))) { fasta <- annotation_to_fasta(annotation, genome_fa, outdir) minimap2_realign( config = jsonlite::fromJSON(system.file('extdata/SIRV_config_default.json', package = 'FLAMES')), fq_in = fastq1, outdir = outdir ) }
temp_path <- tempfile() bfc <- BiocFileCache::BiocFileCache(temp_path, ask = FALSE) file_url <- 'https://raw.githubusercontent.com/OliverVoogd/FLAMESData/master/data' fastq1 <- bfc[[names(BiocFileCache::bfcadd(bfc, 'Fastq1', paste(file_url, 'fastq/sample1.fastq.gz', sep = '/')))]] genome_fa <- bfc[[names(BiocFileCache::bfcadd(bfc, 'genome.fa', paste(file_url, 'SIRV_isoforms_multi-fasta_170612a.fasta', sep = '/')))]] annotation <- bfc[[names(BiocFileCache::bfcadd(bfc, 'annot.gtf', paste(file_url, 'SIRV_isoforms_multi-fasta-annotation_C_170612a.gtf', sep = '/')))]] outdir <- tempfile() dir.create(outdir) if (!any(is.na(sys_which(c("minimap2", "k8"))))) { fasta <- annotation_to_fasta(annotation, genome_fa, outdir) minimap2_realign( config = jsonlite::fromJSON(system.file('extdata/SIRV_config_default.json', package = 'FLAMES')), fq_in = fastq1, outdir = outdir ) }
Parse a Gff3 file into 3 components: chromasome to gene name, a transcript dictionary, a gene to transcript dictionary and a transcript to exon dictionary. These components are returned in a named list.
parse_gff_tree(gff_file)
parse_gff_tree(gff_file)
gff_file |
the file path to the gff3 file to parse |
a named list with the elements "chr_to_gene", "transcript_dict", "gene_to_transcript", "transcript_to_exon", containing the data parsed from the gff3 file.
temp_path <- tempfile() bfc <- BiocFileCache::BiocFileCache(temp_path, ask = FALSE) file_url <- "https://raw.githubusercontent.com/OliverVoogd/FLAMESData/master/data" gff <- bfc[[names(BiocFileCache::bfcadd(bfc, "GFF", paste(file_url, "SIRV_isoforms_multi-fasta-annotation_C_170612a.gtf", sep = "/")))]] ## Not run: parsed_gff <- parse_gff_tree(gff)
temp_path <- tempfile() bfc <- BiocFileCache::BiocFileCache(temp_path, ask = FALSE) file_url <- "https://raw.githubusercontent.com/OliverVoogd/FLAMESData/master/data" gff <- bfc[[names(BiocFileCache::bfcadd(bfc, "GFF", paste(file_url, "SIRV_isoforms_multi-fasta-annotation_C_170612a.gtf", sep = "/")))]] ## Not run: parsed_gff <- parse_gff_tree(gff)
Plot the average read coverages for each length bin or a perticular isoform
plot_coverage( bam, isoform = NULL, length_bins = c(0, 1, 2, 5, 10, Inf), weight_fn = "read_counts" )
plot_coverage( bam, isoform = NULL, length_bins = c(0, 1, 2, 5, 10, Inf), weight_fn = "read_counts" )
bam |
path to the BAM file (aligning reads to the transcriptome), or the (GenomicAlignments::readGAlignments) parsed GAlignments object |
isoform |
string vector, provide isoform names to plot the coverage for the corresponding isoforms, or provide NULL to plot average coverages for each length bin |
length_bins |
numeric vector to specify the sizes to bin the isoforms by |
weight_fn |
"read_counts" or "sigmoid", determins how the transcripts should be weighted within length bins. |
a ggplot2 object of the coverage plot(s)
temp_path <- tempfile() bfc <- BiocFileCache::BiocFileCache(temp_path, ask = FALSE) file_url <- 'https://raw.githubusercontent.com/OliverVoogd/FLAMESData/master/data' fastq1 <- bfc[[names(BiocFileCache::bfcadd(bfc, 'Fastq1', paste(file_url, 'fastq/sample1.fastq.gz', sep = '/')))]] genome_fa <- bfc[[names(BiocFileCache::bfcadd(bfc, 'genome.fa', paste(file_url, 'SIRV_isoforms_multi-fasta_170612a.fasta', sep = '/')))]] annotation <- bfc[[names(BiocFileCache::bfcadd(bfc, 'annot.gtf', paste(file_url, 'SIRV_isoforms_multi-fasta-annotation_C_170612a.gtf', sep = '/')))]] outdir <- tempfile() dir.create(outdir) if (!any(is.na(sys_which(c("minimap2", "k8"))))) { fasta <- annotation_to_fasta(annotation, genome_fa, outdir) minimap2_realign( config = jsonlite::fromJSON(system.file('extdata/SIRV_config_default.json', package = 'FLAMES')), fq_in = fastq1, outdir = outdir ) plot_coverage(bam = file.path(outdir, 'realign2transcript.bam')) }
temp_path <- tempfile() bfc <- BiocFileCache::BiocFileCache(temp_path, ask = FALSE) file_url <- 'https://raw.githubusercontent.com/OliverVoogd/FLAMESData/master/data' fastq1 <- bfc[[names(BiocFileCache::bfcadd(bfc, 'Fastq1', paste(file_url, 'fastq/sample1.fastq.gz', sep = '/')))]] genome_fa <- bfc[[names(BiocFileCache::bfcadd(bfc, 'genome.fa', paste(file_url, 'SIRV_isoforms_multi-fasta_170612a.fasta', sep = '/')))]] annotation <- bfc[[names(BiocFileCache::bfcadd(bfc, 'annot.gtf', paste(file_url, 'SIRV_isoforms_multi-fasta-annotation_C_170612a.gtf', sep = '/')))]] outdir <- tempfile() dir.create(outdir) if (!any(is.na(sys_which(c("minimap2", "k8"))))) { fasta <- annotation_to_fasta(annotation, genome_fa, outdir) minimap2_realign( config = jsonlite::fromJSON(system.file('extdata/SIRV_config_default.json', package = 'FLAMES')), fq_in = fastq1, outdir = outdir ) plot_coverage(bam = file.path(outdir, 'realign2transcript.bam')) }
produce a barplot of cell barcode demultiplex statistics
plot_demultiplex(outdir, stats_file)
plot_demultiplex(outdir, stats_file)
outdir |
folder containing the |
stats_file |
|
a ggplot
object of the barcode plot
outdir <- tempfile() dir.create(outdir) bc_allow <- file.path(outdir, "bc_allow.tsv") R.utils::gunzip(filename = system.file("extdata/bc_allow.tsv.gz", package = "FLAMES"), destname = bc_allow, remove = FALSE) find_barcode( fastq = system.file("extdata/fastq", package = "FLAMES"), stats_out = file.path(outdir, "bc_stat"), reads_out = file.path(outdir, "demultiplexed.fq.gz"), barcodes_file = bc_allow ) plot_demultiplex(stats_file = file.path(outdir, "bc_stat"))
outdir <- tempfile() dir.create(outdir) bc_allow <- file.path(outdir, "bc_allow.tsv") R.utils::gunzip(filename = system.file("extdata/bc_allow.tsv.gz", package = "FLAMES"), destname = bc_allow, remove = FALSE) find_barcode( fastq = system.file("extdata/fastq", package = "FLAMES"), stats_out = file.path(outdir, "bc_stat"), reads_out = file.path(outdir, "demultiplexed.fq.gz"), barcodes_file = bc_allow ) plot_demultiplex(stats_file = file.path(outdir, "bc_stat"))
Calculate the per gene UMI count matrix by parsing the genome alignment file.
quantify_gene(annotation, outdir, n_process, pipeline = "sc_single_sample")
quantify_gene(annotation, outdir, n_process, pipeline = "sc_single_sample")
annotation |
The file path to the annotation file in GFF3 format |
outdir |
The path to directory to store all output files. |
n_process |
The number of processes to use for parallelization. |
pipeline |
The pipeline type as a character string, either |
After the genome alignment step (do_genome_align
), the alignment file will be parsed to
generate the per gene UMI count matrix. For each gene in the annotation file, the number of
reads whose mapped ranges overlap with the gene's genome coordinates will be assigned to the
gene. For reads can be assigned to multiple gene, the read will be assigned to the gene with
the highest number of overlapping nucleotides. If the read can be assigned to multiple genes
with the same number of overlapping nucleotides, the read will be not be assigned.
After the read-to-gene assignment, the per gene UMI count matrix will be generated. Specifically, for each gene, the reads with similar mapping coordinates of transcript termination sites (TTS, i.e. the end of the the read with a polyT or polyA) will be grouped together. UMIs of reads in the same group will be collapsed to generate the UMI counts for each gene.
Finally, a new fastq file with deduplicated reads by keeping the longest read in each UMI.
The count matrix will be saved in the output folder as transcript_count.csv.gz
.
Calculate the transcript count matrix by parsing the re-alignment file.
quantify_transcript(annotation, outdir, config, pipeline = "sc_single_sample")
quantify_transcript(annotation, outdir, config, pipeline = "sc_single_sample")
annotation |
The file path to the annotation file in GFF3 format |
outdir |
The path to directory to store all output files. |
config |
Parsed FLAMES configurations. |
pipeline |
The pipeline type as a character string, either |
The count matrix will be saved in the output folder as transcript_count.csv.gz
.
temp_path <- tempfile() bfc <- BiocFileCache::BiocFileCache(temp_path, ask = FALSE) file_url <- "https://raw.githubusercontent.com/OliverVoogd/FLAMESData/master/data" fastq1 <- bfc[[names(BiocFileCache::bfcadd(bfc, "Fastq1", paste(file_url, "fastq/sample1.fastq.gz", sep = "/")))]] genome_fa <- bfc[[names(BiocFileCache::bfcadd(bfc, "genome.fa", paste(file_url, "SIRV_isoforms_multi-fasta_170612a.fasta", sep = "/")))]] annotation <- bfc[[names(BiocFileCache::bfcadd(bfc, "annot.gtf", paste(file_url, "SIRV_isoforms_multi-fasta-annotation_C_170612a.gtf", sep = "/")))]] outdir <- tempfile() dir.create(outdir) fasta <- annotation_to_fasta(annotation, genome_fa, outdir) config <- jsonlite::fromJSON(create_config(outdir, bambu_isoform_identification = TRUE, min_tr_coverage = 0.1, min_read_coverage = 0.1, min_sup_cnt = 1)) file.copy(annotation, file.path(outdir, "isoform_annotated.gtf")) ## Not run: if (!any(is.na(sys_which(c("minimap2", "k8"))))) { minimap2_realign( config = config, outdir = outdir, fq_in = fastq1 ) quantify_transcript(annotation, outdir, config, pipeline = "bulk") } ## End(Not run)
temp_path <- tempfile() bfc <- BiocFileCache::BiocFileCache(temp_path, ask = FALSE) file_url <- "https://raw.githubusercontent.com/OliverVoogd/FLAMESData/master/data" fastq1 <- bfc[[names(BiocFileCache::bfcadd(bfc, "Fastq1", paste(file_url, "fastq/sample1.fastq.gz", sep = "/")))]] genome_fa <- bfc[[names(BiocFileCache::bfcadd(bfc, "genome.fa", paste(file_url, "SIRV_isoforms_multi-fasta_170612a.fasta", sep = "/")))]] annotation <- bfc[[names(BiocFileCache::bfcadd(bfc, "annot.gtf", paste(file_url, "SIRV_isoforms_multi-fasta-annotation_C_170612a.gtf", sep = "/")))]] outdir <- tempfile() dir.create(outdir) fasta <- annotation_to_fasta(annotation, genome_fa, outdir) config <- jsonlite::fromJSON(create_config(outdir, bambu_isoform_identification = TRUE, min_tr_coverage = 0.1, min_read_coverage = 0.1, min_sup_cnt = 1)) file.copy(annotation, file.path(outdir, "isoform_annotated.gtf")) ## Not run: if (!any(is.na(sys_which(c("minimap2", "k8"))))) { minimap2_realign( config = config, outdir = outdir, fq_in = fastq1 ) quantify_transcript(annotation, outdir, config, pipeline = "bulk") } ## End(Not run)
Chi-square based differential transcription usage analysis. This variant is meant for single cell data.
Takes the SingleCellExperiment
object from sc_long_pipeline
as input.
Alternatively, the path to the output folder could be provided instead of the SCE object.
A cluster annotation file cluster_annotation.csv
is required, please provide this file under the
output folder of sc_long_pipeline
.
sc_DTU_analysis(sce, min_count = 15)
sc_DTU_analysis(sce, min_count = 15)
sce |
The |
min_count |
The minimum UMI count threshold for filtering isoforms. |
This function will search for genes that have at least two isoforms, each with more than min_count
UMI counts.
For each gene, the per cell transcript counts were merged by group to generate pseudo bulk samples.
Grouping is specified by the cluster_annotation.csv
file.
The top 2 highly expressed transcripts for each group were selected and a UMI count matrix where
the rows are selected transcripts and columns are groups was used as input to a chi-square test of independence (chisq.test).
Adjusted P-values were calculated by Benjamini–Hochberg correction.
a data.frame
containing the following columns:
gene_id - differentially transcribed genes
X_value - the X value for the DTU gene
df - degrees of freedom of the approximate chi-squared distribution of the test statistic
DTU_tr - the transcript_id with the highest squared residuals
DTU_group - the cell group with the highest squared residuals
p_value - the p-value for the test
adj_p - the adjusted p-value (by Benjamini–Hochberg correction)
The table is sorted by decreasing P-values. It will also be saved as sc_DTU_analysis.csv
under the
output folder.
outdir <- tempfile() dir.create(outdir) bc_allow <- file.path(outdir, "bc_allow.tsv") genome_fa <- file.path(outdir, "rps24.fa") R.utils::gunzip(filename = system.file("extdata/bc_allow.tsv.gz", package = "FLAMES"), destname = bc_allow, remove = FALSE) R.utils::gunzip(filename = system.file("extdata/rps24.fa.gz", package = "FLAMES"), destname = genome_fa, remove = FALSE) if (!any(is.na(sys_which(c("minimap2", "k8"))))) { sce <- FLAMES::sc_long_pipeline( genome_fa = genome_fa, fastq = system.file("extdata/fastq", package = "FLAMES"), annotation = system.file("extdata/rps24.gtf.gz", package = "FLAMES"), outdir = outdir, barcodes_file = bc_allow ) group_anno <- data.frame(barcode_seq = colnames(sce), groups = SingleCellExperiment::counts(sce)["ENSMUST00000169826.2", ] > 1) write.csv(group_anno, file.path(outdir, "cluster_annotation.csv"), row.names = FALSE) sc_DTU_analysis(sce, min_count = 1) }
outdir <- tempfile() dir.create(outdir) bc_allow <- file.path(outdir, "bc_allow.tsv") genome_fa <- file.path(outdir, "rps24.fa") R.utils::gunzip(filename = system.file("extdata/bc_allow.tsv.gz", package = "FLAMES"), destname = bc_allow, remove = FALSE) R.utils::gunzip(filename = system.file("extdata/rps24.fa.gz", package = "FLAMES"), destname = genome_fa, remove = FALSE) if (!any(is.na(sys_which(c("minimap2", "k8"))))) { sce <- FLAMES::sc_long_pipeline( genome_fa = genome_fa, fastq = system.file("extdata/fastq", package = "FLAMES"), annotation = system.file("extdata/rps24.gtf.gz", package = "FLAMES"), outdir = outdir, barcodes_file = bc_allow ) group_anno <- data.frame(barcode_seq = colnames(sce), groups = SingleCellExperiment::counts(sce)["ENSMUST00000169826.2", ] > 1) write.csv(group_anno, file.path(outdir, "cluster_annotation.csv"), row.names = FALSE) sc_DTU_analysis(sce, min_count = 1) }
Plot expression heatmap of top n isoforms of a gene
sc_heatmap_expression( gene, multiAssay, impute = FALSE, n_isoforms = 4, transcript_ids, n_pcs = 40, isoform_legend_width = 7, col_low = "#313695", col_mid = "#FFFFBF", col_high = "#A50026", color_quantile = 0.95 )
sc_heatmap_expression( gene, multiAssay, impute = FALSE, n_isoforms = 4, transcript_ids, n_pcs = 40, isoform_legend_width = 7, col_low = "#313695", col_mid = "#FFFFBF", col_high = "#A50026", color_quantile = 0.95 )
gene |
The gene symbol of interest. |
multiAssay |
The |
impute |
Whether to impute expression levels for cells without transcript counts |
n_isoforms |
The number of expressed isoforms to keep. |
transcript_ids |
specify the transcript ids instead of selecting the top |
n_pcs |
The number of principal components to generate. |
isoform_legend_width |
The width of isoform legends in heatmaps, in |
col_low |
Color for cells with low expression levels in UMAPs. |
col_mid |
Color for cells with intermediate expression levels in UMAPs. |
col_high |
Color for cells with high expression levels in UMAPs. |
color_quantile |
The lower and upper expression quantile to be displayed bewteen |
This function takes the combined MultiAssayExperiment
object from
combine_sce
and plots an expression heatmap with the isoform alignment visualisations.
a ggplot
object of the heatmap
combined_sce <- combine_sce( short_read_large = scmixology_lib90, short_read_small = scmixology_lib10, long_read_sce = scmixology_lib10_transcripts, remove_duplicates = FALSE) sc_heatmap_expression(gene = "ENSG00000108107", multiAssay = combined_sce)
combined_sce <- combine_sce( short_read_large = scmixology_lib90, short_read_small = scmixology_lib10, long_read_sce = scmixology_lib10_transcripts, remove_duplicates = FALSE) sc_heatmap_expression(gene = "ENSG00000108107", multiAssay = combined_sce)
Semi-supervised isoform detection and annotation for long read data.
This variant is for multi-sample single cell data. By default, this pipeline demultiplexes input
fastq data (match_cell_barcode = TRUE
). Specific parameters relating to
analysis can be changed either through function arguments, or through a
configuration JSON file.
sc_long_multisample_pipeline( annotation, fastqs, outdir, genome_fa, sample_names = NULL, minimap2 = NULL, k8 = NULL, barcodes_file = NULL, expect_cell_numbers = NULL, config_file = NULL )
sc_long_multisample_pipeline( annotation, fastqs, outdir, genome_fa, sample_names = NULL, minimap2 = NULL, k8 = NULL, barcodes_file = NULL, expect_cell_numbers = NULL, config_file = NULL )
annotation |
The file path to the annotation file in GFF3 format |
fastqs |
The input fastq files for multiple samples. It can be provided in different way: 1) a single path to the folder containing fastq files, each fastq file will be treated as a sample; or 2) a vector of paths to each fastq file, each fastq file will be treated as a sample; or 3) a vector of paths to folders containing fastq files, each folder will be treated as a sample. |
outdir |
The path to directory to store all output files. |
genome_fa |
The file path to genome fasta file. |
sample_names |
A vector of sample names,
Default to the file names of input fastq files, or folder names if |
minimap2 |
Path to minimap2, if it is not in PATH. Only required if either or both of
|
k8 |
Path to the k8 Javascript shell binary. Only required if |
barcodes_file |
The file path to the reference csv used for demultiplexing in flexiplex. If not specified, the
demultiplexing will be performed using BLAZE. Default is |
expect_cell_numbers |
A vector of roughly expected numbers of cells in each sample E.g., the targeted number of cells.
Required if using BLAZE for demultiplexing, specifically, when the |
config_file |
File path to the JSON configuration file. If specified, |
By default FLAMES use minimap2 for read alignment. After the genome alignment step (do_genome_align
), FLAMES summarizes the alignment for each read in every sample by grouping reads
with similar splice junctions to get a raw isoform annotation (do_isoform_id
). The raw isoform
annotation is compared against the reference annotation to correct potential splice site
and transcript start/end errors. Transcripts that have similar splice junctions
and transcript start/end to the reference transcript are merged with the
reference. This process will also collapse isoforms that are likely to be truncated
transcripts. If isoform_id_bambu
is set to TRUE
, bambu::bambu
will be used to generate the updated annotations (Not implemented for multi-sample yet).
Next is the read realignment step (do_read_realign
), where the sequence of each transcript from the update annotation is extracted, and
the reads are realigned to this updated transcript_assembly.fa
by minimap2. The
transcripts with only a few full-length aligned reads are discarded (Not implemented for multi-sample yet).
The reads are assigned to transcripts based on both alignment score, fractions of
reads aligned and transcript coverage. Reads that cannot be uniquely assigned to
transcripts or have low transcript coverage are discarded. The UMI transcript
count matrix is generated by collapsing the reads with the same UMI in a similar
way to what is done for short-read scRNA-seq data, but allowing for an edit distance
of up to 2 by default. Most of the parameters, such as the minimal distance to splice site and minimal percentage of transcript coverage
can be modified by the JSON configuration file (config_file
).
The default parameters can be changed either through the function
arguments are through the configuration JSON file config_file
. the pipeline_parameters
section specifies which steps are to be executed in the pipeline - by default, all
steps are executed. The isoform_parameters
section affects isoform detection - key
parameters include:
Min_sup_cnt
which causes transcripts with less reads aligned than
it's value to be discarded
MAX_TS_DIST
which merges transcripts with the same intron
chain and TSS/TES distace less than MAX_TS_DIST
strand_specific
which specifies if reads are in the same strand as the mRNA (1),
or the reverse complemented (-1) or not strand specific (0), which results in
strand information being based on reference annotation.
a list of SingleCellExperiment
objects if "do_transcript_quantification" set to true.
Otherwise nothing will be returned.
bulk_long_pipeline()
for bulk long data,
SingleCellExperiment()
for how data is outputted
reads <- ShortRead::readFastq(system.file("extdata/fastq/musc_rps24.fastq.gz", package = "FLAMES")) outdir <- tempfile() dir.create(outdir) dir.create(file.path(outdir, "fastq")) bc_allow <- file.path(outdir, "bc_allow.tsv") genome_fa <- file.path(outdir, "rps24.fa") R.utils::gunzip(filename = system.file("extdata/bc_allow.tsv.gz", package = "FLAMES"), destname = bc_allow, remove = FALSE) R.utils::gunzip(filename = system.file("extdata/rps24.fa.gz", package = "FLAMES"), destname = genome_fa, remove = FALSE) ShortRead::writeFastq(sample(reads, size = 500, replace = TRUE), file.path(outdir, "fastq/sample1.fq.gz"), mode = "w", full = FALSE) ShortRead::writeFastq(sample(reads, size = 500, replace = TRUE), file.path(outdir, "fastq/sample2.fq.gz"), mode = "w", full = FALSE) ShortRead::writeFastq(sample(reads, size = 500, replace = TRUE), file.path(outdir, "fastq/sample3.fq.gz"), mode = "w", full = FALSE) if (!any(is.na(sys_which(c("minimap2", "k8"))))) { sce_list <- FLAMES::sc_long_multisample_pipeline( annotation = system.file("extdata/rps24.gtf.gz", package = "FLAMES"), fastqs = file.path(outdir, "fastq", list.files(file.path(outdir, "fastq"))), outdir = outdir, genome_fa = genome_fa, barcodes_file = rep(bc_allow, 3) ) }
reads <- ShortRead::readFastq(system.file("extdata/fastq/musc_rps24.fastq.gz", package = "FLAMES")) outdir <- tempfile() dir.create(outdir) dir.create(file.path(outdir, "fastq")) bc_allow <- file.path(outdir, "bc_allow.tsv") genome_fa <- file.path(outdir, "rps24.fa") R.utils::gunzip(filename = system.file("extdata/bc_allow.tsv.gz", package = "FLAMES"), destname = bc_allow, remove = FALSE) R.utils::gunzip(filename = system.file("extdata/rps24.fa.gz", package = "FLAMES"), destname = genome_fa, remove = FALSE) ShortRead::writeFastq(sample(reads, size = 500, replace = TRUE), file.path(outdir, "fastq/sample1.fq.gz"), mode = "w", full = FALSE) ShortRead::writeFastq(sample(reads, size = 500, replace = TRUE), file.path(outdir, "fastq/sample2.fq.gz"), mode = "w", full = FALSE) ShortRead::writeFastq(sample(reads, size = 500, replace = TRUE), file.path(outdir, "fastq/sample3.fq.gz"), mode = "w", full = FALSE) if (!any(is.na(sys_which(c("minimap2", "k8"))))) { sce_list <- FLAMES::sc_long_multisample_pipeline( annotation = system.file("extdata/rps24.gtf.gz", package = "FLAMES"), fastqs = file.path(outdir, "fastq", list.files(file.path(outdir, "fastq"))), outdir = outdir, genome_fa = genome_fa, barcodes_file = rep(bc_allow, 3) ) }
Semi-supervised isoform detection and annotation for long read data.
This variant is for single cell data. By default, this pipeline demultiplexes input
fastq data (match_cell_barcode = TRUE
). Specific parameters relating to
analysis can be changed either through function arguments, or through a
configuration JSON file.
sc_long_pipeline( annotation, fastq, genome_bam = NULL, outdir, genome_fa, minimap2 = NULL, k8 = NULL, barcodes_file = NULL, expect_cell_number = NULL, config_file = NULL )
sc_long_pipeline( annotation, fastq, genome_bam = NULL, outdir, genome_fa, minimap2 = NULL, k8 = NULL, barcodes_file = NULL, expect_cell_number = NULL, config_file = NULL )
annotation |
The file path to the annotation file in GFF3 format |
fastq |
The file path to input fastq file |
genome_bam |
Optional file path to a bam file to use instead of fastq file (skips initial alignment step) |
outdir |
The path to directory to store all output files. |
genome_fa |
The file path to genome fasta file. |
minimap2 |
Path to minimap2, if it is not in PATH. Only required if either or both of
|
k8 |
Path to the k8 Javascript shell binary. Only required if |
barcodes_file |
The file path to the reference csv used for demultiplexing in flexiplex. If not specified, the
demultiplexing will be performed using BLAZE. Default is |
expect_cell_number |
Expected number of cells for identifying the barcode list in BLAZE.
This could be just a rough estimate. E.g., the targeted number of cells.
Required if the |
config_file |
File path to the JSON configuration file. If specified, |
By default FLAMES use minimap2 for read alignment. After the genome alignment step (do_genome_align
), FLAMES summarizes the alignment for each read by grouping reads
with similar splice junctions to get a raw isoform annotation (do_isoform_id
). The raw isoform
annotation is compared against the reference annotation to correct potential splice site
and transcript start/end errors. Transcripts that have similar splice junctions
and transcript start/end to the reference transcript are merged with the
reference. This process will also collapse isoforms that are likely to be truncated
transcripts. If isoform_id_bambu
is set to TRUE
, bambu::bambu
will be used to generate the updated annotations.
Next is the read realignment step (do_read_realign
), where the sequence of each transcript from the update annotation is extracted, and
the reads are realigned to this updated transcript_assembly.fa
by minimap2. The
transcripts with only a few full-length aligned reads are discarded.
The reads are assigned to transcripts based on both alignment score, fractions of
reads aligned and transcript coverage. Reads that cannot be uniquely assigned to
transcripts or have low transcript coverage are discarded. The UMI transcript
count matrix is generated by collapsing the reads with the same UMI in a similar
way to what is done for short-read scRNA-seq data, but allowing for an edit distance
of up to 2 by default. Most of the parameters, such as the minimal distance to splice site and minimal percentage of transcript coverage
can be modified by the JSON configuration file (config_file
).
The default parameters can be changed either through the function
arguments are through the configuration JSON file config_file
. the pipeline_parameters
section specifies which steps are to be executed in the pipeline - by default, all
steps are executed. The isoform_parameters
section affects isoform detection - key
parameters include:
Min_sup_cnt
which causes transcripts with less reads aligned than
it's value to be discarded
MAX_TS_DIST
which merges transcripts with the same intron
chain and TSS/TES distace less than MAX_TS_DIST
strand_specific
which specifies if reads are in the same strand as the mRNA (1),
or the reverse complemented (-1) or not strand specific (0), which results in
strand information being based on reference annotation.
if do_transcript_quantification
set to true, sc_long_pipeline
returns a SingleCellExperiment
object, containing a count
matrix as an assay, gene annotations under metadata, as well as a list of the other
output files generated by the pipeline. The pipeline also outputs a number of output
files into the given outdir
directory. These output files generated by the pipeline are:
transcript_count.csv.gz - a transcript count matrix (also contained in the SingleCellExperiment)
isoform_annotated.filtered.gff3 - isoforms in gff3 format (also contained in the SingleCellExperiment)
transcript_assembly.fa - transcript sequence from the isoforms
align2genome.bam - sorted BAM file with reads aligned to genome
realign2transcript.bam - sorted realigned BAM file using the transcript_assembly.fa as reference
tss_tes.bedgraph - TSS TES enrichment for all reads (for QC)
if do_transcript_quantification
set to false, nothing will be returned
bulk_long_pipeline()
for bulk long data,
SingleCellExperiment()
for how data is outputted
outdir <- tempfile() dir.create(outdir) bc_allow <- file.path(outdir, "bc_allow.tsv") genome_fa <- file.path(outdir, "rps24.fa") R.utils::gunzip(filename = system.file("extdata/bc_allow.tsv.gz", package = "FLAMES"), destname = bc_allow, remove = FALSE) R.utils::gunzip(filename = system.file("extdata/rps24.fa.gz", package = "FLAMES"), destname = genome_fa, remove = FALSE) if (!any(is.na(sys_which(c("minimap2", "k8"))))) { sce <- FLAMES::sc_long_pipeline( genome_fa = genome_fa, fastq = system.file("extdata/fastq", package = "FLAMES"), annotation = system.file("extdata/rps24.gtf.gz", package = "FLAMES"), outdir = outdir, barcodes_file = bc_allow ) }
outdir <- tempfile() dir.create(outdir) bc_allow <- file.path(outdir, "bc_allow.tsv") genome_fa <- file.path(outdir, "rps24.fa") R.utils::gunzip(filename = system.file("extdata/bc_allow.tsv.gz", package = "FLAMES"), destname = bc_allow, remove = FALSE) R.utils::gunzip(filename = system.file("extdata/rps24.fa.gz", package = "FLAMES"), destname = genome_fa, remove = FALSE) if (!any(is.na(sys_which(c("minimap2", "k8"))))) { sce <- FLAMES::sc_long_pipeline( genome_fa = genome_fa, fastq = system.file("extdata/fastq", package = "FLAMES"), annotation = system.file("extdata/rps24.gtf.gz", package = "FLAMES"), outdir = outdir, barcodes_file = bc_allow ) }
Count the number of reads supporting each variants at the given positions for each cell.
sc_mutations( bam_path, seqnames, positions, indel = FALSE, barcodes, threads = 1 )
sc_mutations( bam_path, seqnames, positions, indel = FALSE, barcodes, threads = 1 )
bam_path |
character(1) or character(n): path to the bam file(s) aligned to the reference genome (NOT the transcriptome! Unless the postions are also from the transcriptome). |
seqnames |
character(n): chromosome names of the postions to count alleles. |
positions |
integer(n): positions, 1-based, same length as seqnames. The positions to count alleles. |
indel |
logical(1): whether to count indels (TRUE) or SNPs (FALSE). |
barcodes |
character(n) when bam_path is a single file, or list of character(n) when bam_path is a list of files paths. The cell barcodes to count alleles for. Only reads with these barcodes will be counted. |
threads |
integer(1): number of threads to use. Maximum number of threads is the number of bam files * number of positions. |
A tibble with columns: allele, barcode, allele_count, cell_total_reads, pct, pos, seqname.
outdir <- tempfile() dir.create(outdir) genome_fa <- file.path(outdir, "rps24.fa") R.utils::gunzip(filename = system.file("extdata/rps24.fa.gz", package = "FLAMES"), destname = genome_fa, remove = FALSE) download.file("https://raw.githubusercontent.com/mritchielab/FLAMES/devel/tests/testthat/demultiplexed.fq", destfile = file.path(outdir, "demultipelxed.fq") ) # can't be bothered to run demultiplexing again if (!any(is.na(sys_which(c("minimap2", "k8"))))) { minimap2_align( # align to genome config = jsonlite::fromJSON(system.file("extdata/SIRV_config_default.json", package = "FLAMES")), fa_file = genome_fa, fq_in = file.path(outdir, "demultipelxed.fq"), annot = system.file("extdata/rps24.gtf.gz", package = "FLAMES"), outdir = outdir ) snps_tb <- sc_mutations( bam_path = file.path(outdir, "align2genome.bam"), seqnames = c("chr14", "chr14"), positions = c(1260, 2714), # positions of interest indel = FALSE, barcodes = read.delim(system.file("extdata/bc_allow.tsv.gz", package = "FLAMES"), header = FALSE)$V1 ) head(snps_tb) snps_tb |> dplyr::filter(pos == 1260) |> dplyr::group_by(allele) |> dplyr::summarise(count = sum(allele_count)) # should be identical to samtools pileup }
outdir <- tempfile() dir.create(outdir) genome_fa <- file.path(outdir, "rps24.fa") R.utils::gunzip(filename = system.file("extdata/rps24.fa.gz", package = "FLAMES"), destname = genome_fa, remove = FALSE) download.file("https://raw.githubusercontent.com/mritchielab/FLAMES/devel/tests/testthat/demultiplexed.fq", destfile = file.path(outdir, "demultipelxed.fq") ) # can't be bothered to run demultiplexing again if (!any(is.na(sys_which(c("minimap2", "k8"))))) { minimap2_align( # align to genome config = jsonlite::fromJSON(system.file("extdata/SIRV_config_default.json", package = "FLAMES")), fa_file = genome_fa, fq_in = file.path(outdir, "demultipelxed.fq"), annot = system.file("extdata/rps24.gtf.gz", package = "FLAMES"), outdir = outdir ) snps_tb <- sc_mutations( bam_path = file.path(outdir, "align2genome.bam"), seqnames = c("chr14", "chr14"), positions = c(1260, 2714), # positions of interest indel = FALSE, barcodes = read.delim(system.file("extdata/bc_allow.tsv.gz", package = "FLAMES"), header = FALSE)$V1 ) head(snps_tb) snps_tb |> dplyr::filter(pos == 1260) |> dplyr::group_by(allele) |> dplyr::summarise(count = sum(allele_count)) # should be identical to samtools pileup }
runPCA and runUMAP wrapper for combined SCE object from combine_sce
sc_reduce_dims(multiAssay, n_pcs = 40, n_hvgs = 2000)
sc_reduce_dims(multiAssay, n_pcs = 40, n_hvgs = 2000)
multiAssay |
The |
n_pcs |
The number of principal components to generate. |
n_hvgs |
The number of variable genes to use for running PCA. |
This function takes the combined MultiAssayExperiment
object from
combine_sce
and run scater::runUMAP
and / or scran::fixedPCA
the MultiAssayExperiment
with reduced dimensions
combined_sce <- combine_sce( short_read_large = scmixology_lib90, short_read_small = scmixology_lib10, long_read_sce = scmixology_lib10_transcripts, remove_duplicates = FALSE) sc_reduce_dims(multiAssay = combined_sce)
combined_sce <- combine_sce( short_read_large = scmixology_lib90, short_read_small = scmixology_lib10, long_read_sce = scmixology_lib10_transcripts, remove_duplicates = FALSE) sc_reduce_dims(multiAssay = combined_sce)
Plot expression UMAPs of top n isoforms of a gene
sc_umap_expression( gene, multiAssay, impute = FALSE, grided = TRUE, n_isoforms = 4, transcript_ids, n_pcs = 40, col_low = "#313695", col_mid = "#FFFFBF", col_high = "#A50026" )
sc_umap_expression( gene, multiAssay, impute = FALSE, grided = TRUE, n_isoforms = 4, transcript_ids, n_pcs = 40, col_low = "#313695", col_mid = "#FFFFBF", col_high = "#A50026" )
gene |
The gene symbol of interest. |
multiAssay |
The |
impute |
Whether to impute expression levels for cells without transcript counts |
grided |
Wheter to produce multiple UMAP plots, with each showing expression level for an isoform, to allow plotting more than 2 isoforms. |
n_isoforms |
The number of expressed isoforms to keep. |
transcript_ids |
specify the transcript ids instead of selecting the top |
n_pcs |
The number of principal components to generate. |
col_low |
Color for cells with low expression levels in UMAPs. |
col_mid |
Color for cells with intermediate expression levels in UMAPs. |
col_high |
Color for cells with high expression levels in UMAPs. |
This function takes the combined MultiAssayExperiment
object from
cexample("MultiAssayExperiment")ombine_sce
and plots UMAPs for each isoform of gene
, where cells
are colored by expression levels. When grided = TRUE
, the UMAPs are combined into a grid, along with the isoforms' visualization along genomic coordinates. Produces a single UMAP with isoform expressions colored by col_low
and col_high
when grided = FALSE
.
a ggplot
object of the UMAP(s)
combined_sce <- combine_sce( short_read_large = scmixology_lib90, short_read_small = scmixology_lib10, long_read_sce = scmixology_lib10_transcripts, remove_duplicates = FALSE) sc_umap_expression(gene = "ENSG00000108107", multiAssay = combined_sce)
combined_sce <- combine_sce( short_read_large = scmixology_lib90, short_read_small = scmixology_lib10, long_read_sce = scmixology_lib10_transcripts, remove_duplicates = FALSE) sc_umap_expression(gene = "ENSG00000108107", multiAssay = combined_sce)
Short-read gene counts from long and short-read single cell RNA-seq profiling of human lung adenocarcinoma cell lines using 10X version 2 chemstry. See Tian, L. et al. Comprehensive characterization of single-cell full-length isoforms in human and mouse with long-read sequencing. Genome Biology 22, 310 (2021).
scmixology_lib10
scmixology_lib10
## 'scmixology_lib10'
A SingleCellExperiment
with 7,240 rows and 60 columns:
<https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE154869>
long-read transcript counts from long and short-read single cell RNA-seq profiling of human lung adenocarcinoma cell lines using 10X version 2 chemstry. See Tian, L. et al. Comprehensive characterization of single-cell full-length isoforms in human and mouse with long-read sequencing. Genome Biology 22, 310 (2021).
scmixology_lib10_transcripts
scmixology_lib10_transcripts
## 'scmixology_lib10_transcripts'
A SingleCellExperiment
with 7,240 rows and 60 columns:
<https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE154869>
Short-read single cell RNA-seq profiling of human lung adenocarcinoma cell lines using 10X version 2 chemstry. Single cells from five human lung adenocarcinoma cell lines (H2228, H1975, A549, H838 and HCC827) were mixed in equal proportions and processed using the Chromium 10X platform, then sequenced using Illumina HiSeq 2500. See Tian L, Dong X, Freytag S, Lê Cao KA et al. Benchmarking single cell RNA-sequencing analysis pipelines using mixture control experiments. Nat Methods 2019 Jun;16(6):479-487. PMID: 31133762
scmixology_lib90
scmixology_lib90
## 'scmixology_lib90'
A SingleCellExperiment
<https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE126906>
The base::Sys.which
function returns "" if the command is not found
on some systems and NA
on others. This wrapper replaces "" with
NA
sys_which(command)
sys_which(command)
command |
character, the command to search for |
character, the path to the command or NA
sys_which("minimap2")
sys_which("minimap2")