Title: | FLAMES: Full Length Analysis of Mutations and Splicing in long read RNA-seq data |
---|---|
Description: | Semi-supervised isoform detection and annotation from both bulk and single-cell long read RNA-seq data. Flames provides automated pipelines for analysing isoforms, as well as intermediate functions for manual execution. |
Authors: | Luyi Tian [aut], Changqing Wang [aut, cre], Yupei You [aut], Oliver Voogd [aut], Jakob Schuster [aut], Shian Su [aut], Matthew Ritchie [ctb] |
Maintainer: | Changqing Wang <[email protected]> |
License: | GPL (>= 3) |
Version: | 2.1.3 |
Built: | 2024-12-20 03:01:44 UTC |
Source: | https://github.com/bioc/FLAMES |
convert the transcript annotation to transcriptome assembly as FASTA file. The genome annotation is first imported as TxDb object and then used to extract transcript sequence from the genome assembly.
annotation_to_fasta(isoform_annotation, genome_fa, outdir, extract_fn)
annotation_to_fasta(isoform_annotation, genome_fa, outdir, extract_fn)
isoform_annotation |
Path to the annotation file (GTF/GFF3) |
genome_fa |
The file path to genome fasta file. |
outdir |
The path to directory to store the transcriptome as |
extract_fn |
(optional) Function to extract |
Path to the outputted transcriptome assembly
fasta <- annotation_to_fasta(system.file("extdata", "rps24.gtf.gz", package = "FLAMES"), system.file("extdata", "rps24.fa.gz", package = "FLAMES"), tempdir()) cat(readChar(fasta, nchars = 1e3))
fasta <- annotation_to_fasta(system.file("extdata", "rps24.gtf.gz", package = "FLAMES"), system.file("extdata", "rps24.fa.gz", package = "FLAMES"), tempdir()) cat(readChar(fasta, nchars = 1e3))
Uses BLAZE to generate barcode list and assign reads to cell barcodes.
blaze(expect_cells, fq_in, ...)
blaze(expect_cells, fq_in, ...)
expect_cells |
Integer, expected number of cells. Note: this could be just a rough estimate. E.g., the targeted number of cells. |
fq_in |
File path to the fastq file used as a query sequence file |
... |
Additional BLAZE configuration parameters. E.g., setting ''output-prefix'='some_prefix'' is equivalent to specifying '–output-prefix some_prefix' in BLAZE; Similarly, 'overwrite=TRUE' is equivalent to switch on the '–overwrite' option. Note that the specified parameters will override the parameters specified in the configuration file. All available options can be found at https://github.com/shimlab/BLAZE. |
A data.frame
summarising the reads aligned. Other outputs are written to disk.
The details of the output files can be found at https://github.com/shimlab/BLAZE.
temp_path <- tempfile() bfc <- BiocFileCache::BiocFileCache(temp_path, ask = FALSE) fastq1_url <- 'https://raw.githubusercontent.com/shimlab/BLAZE/main/test/data/FAR20033_pass_51e510db_100.fastq' fastq1 <- bfc[[names(BiocFileCache::bfcadd(bfc, 'Fastq1', fastq1_url))]] outdir <- tempfile() dir.create(outdir) ## Not run: blaze(expect_cells=10, fastq1, overwrite=TRUE) ## End(Not run)
temp_path <- tempfile() bfc <- BiocFileCache::BiocFileCache(temp_path, ask = FALSE) fastq1_url <- 'https://raw.githubusercontent.com/shimlab/BLAZE/main/test/data/FAR20033_pass_51e510db_100.fastq' fastq1 <- bfc[[names(BiocFileCache::bfcadd(bfc, 'Fastq1', fastq1_url))]] outdir <- tempfile() dir.create(outdir) ## Not run: blaze(expect_cells=10, fastq1, overwrite=TRUE) ## End(Not run)
Semi-supervised isofrom detection and annotation for long read data. This variant is meant for bulk samples. Specific parameters relating to analysis can be changed either through function arguments, or through a configuration JSON file.
bulk_long_pipeline( annotation, fastq, outdir, genome_fa, minimap2 = NULL, k8 = NULL, config_file = NULL )
bulk_long_pipeline( annotation, fastq, outdir, genome_fa, minimap2 = NULL, k8 = NULL, config_file = NULL )
annotation |
The file path to the annotation file in GFF3 format |
fastq |
The file path to input fastq file |
outdir |
The path to directory to store all output files. |
genome_fa |
The file path to genome fasta file. |
minimap2 |
Path to minimap2, if it is not in PATH. Only required if either or both of
|
k8 |
Path to the k8 Javascript shell binary. Only required if |
config_file |
File path to the JSON configuration file. If specified, |
By default FLAMES use minimap2 for read alignment. After the genome alignment step (do_genome_align
), FLAMES summarizes the alignment for each read by grouping reads
with similar splice junctions to get a raw isoform annotation (do_isoform_id
). The raw isoform
annotation is compared against the reference annotation to correct potential splice site
and transcript start/end errors. Transcripts that have similar splice junctions
and transcript start/end to the reference transcript are merged with the
reference. This process will also collapse isoforms that are likely to be truncated
transcripts. If isoform_id_bambu
is set to TRUE
, bambu::bambu
will be used to generate the updated annotations.
Next is the read realignment step (do_read_realign
), where the sequence of each transcript from the update annotation is extracted, and
the reads are realigned to this updated transcript_assembly.fa
by minimap2. The
transcripts with only a few full-length aligned reads are discarded.
The reads are assigned to transcripts based on both alignment score, fractions of
reads aligned and transcript coverage. Reads that cannot be uniquely assigned to
transcripts or have low transcript coverage are discarded. The UMI transcript
count matrix is generated by collapsing the reads with the same UMI in a similar
way to what is done for short-read scRNA-seq data, but allowing for an edit distance
of up to 2 by default. Most of the parameters, such as the minimal distance to splice site and minimal percentage of transcript coverage
can be modified by the JSON configuration file (config_file
).
The default parameters can be changed either through the function
arguments are through the configuration JSON file config_file
. the pipeline_parameters
section specifies which steps are to be executed in the pipeline - by default, all
steps are executed. The isoform_parameters
section affects isoform detection - key
parameters include:
Min_sup_cnt
which causes transcripts with less reads aligned than it's value to be discarded
MAX_TS_DIST
which merges transcripts with the same intron
chain and TSS/TES distace less than MAX_TS_DIST
strand_specific
which specifies if reads are in the same strand as the mRNA (1), or the reverse complemented (-1) or not strand specific (0), which results in strand information being based on reference annotation.
if do_transcript_quantification
set to true, bulk_long_pipeline
returns a SummarizedExperiment object, containing a count
matrix as an assay, gene annotations under metadata, as well as a list of the other
output files generated by the pipeline. The pipeline also outputs a number of output
files into the given outdir
directory. These output files generated by the pipeline are:
- a transcript count matrix (also contained in the SummarizedExperiment)
- isoforms in gff3 format (also contained in the SummarizedExperiment)
- transcript sequence from the isoforms
- sorted BAM file with reads aligned to genome
- sorted realigned BAM file using the transcript_assembly.fa as reference
- TSS TES enrichment for all reads (for QC)
if do_transcript_quantification
set to false, nothing will be returned
sc_long_pipeline()
for single cell data,
SummarizedExperiment()
for how data is outputted
# download the two fastq files, move them to a folder to be merged together temp_path <- tempfile() bfc <- BiocFileCache::BiocFileCache(temp_path, ask = FALSE) file_url <- "https://raw.githubusercontent.com/OliverVoogd/FLAMESData/master/data" # download the required fastq files, and move them to new folder fastq1 <- bfc[[names(BiocFileCache::bfcadd(bfc, "Fastq1", paste(file_url, "fastq/sample1.fastq.gz", sep = "/")))]] fastq2 <- bfc[[names(BiocFileCache::bfcadd(bfc, "Fastq2", paste(file_url, "fastq/sample2.fastq.gz", sep = "/")))]] annotation <- bfc[[names(BiocFileCache::bfcadd(bfc, "annot.gtf", paste(file_url, "SIRV_isoforms_multi-fasta-annotation_C_170612a.gtf", sep = "/")))]] genome_fa <- bfc[[names(BiocFileCache::bfcadd(bfc, "genome.fa", paste(file_url, "SIRV_isoforms_multi-fasta_170612a.fasta", sep = "/")))]] fastq_dir <- paste(temp_path, "fastq_dir", sep = "/") # the downloaded fastq files need to be in a directory to be merged together dir.create(fastq_dir) file.copy(c(fastq1, fastq2), fastq_dir) unlink(c(fastq1, fastq2)) # the original files can be deleted outdir <- tempfile() dir.create(outdir) se <- bulk_long_pipeline( annotation = annotation, fastq = fastq_dir, outdir = outdir, genome_fa = genome_fa, config_file = create_config(outdir, type = "sc_3end", threads = 1, no_flank = TRUE) )
# download the two fastq files, move them to a folder to be merged together temp_path <- tempfile() bfc <- BiocFileCache::BiocFileCache(temp_path, ask = FALSE) file_url <- "https://raw.githubusercontent.com/OliverVoogd/FLAMESData/master/data" # download the required fastq files, and move them to new folder fastq1 <- bfc[[names(BiocFileCache::bfcadd(bfc, "Fastq1", paste(file_url, "fastq/sample1.fastq.gz", sep = "/")))]] fastq2 <- bfc[[names(BiocFileCache::bfcadd(bfc, "Fastq2", paste(file_url, "fastq/sample2.fastq.gz", sep = "/")))]] annotation <- bfc[[names(BiocFileCache::bfcadd(bfc, "annot.gtf", paste(file_url, "SIRV_isoforms_multi-fasta-annotation_C_170612a.gtf", sep = "/")))]] genome_fa <- bfc[[names(BiocFileCache::bfcadd(bfc, "genome.fa", paste(file_url, "SIRV_isoforms_multi-fasta_170612a.fasta", sep = "/")))]] fastq_dir <- paste(temp_path, "fastq_dir", sep = "/") # the downloaded fastq files need to be in a directory to be merged together dir.create(fastq_dir) file.copy(c(fastq1, fastq2), fastq_dir) unlink(c(fastq1, fastq2)) # the original files can be deleted outdir <- tempfile() dir.create(outdir) se <- bulk_long_pipeline( annotation = annotation, fastq = fastq_dir, outdir = outdir, genome_fa = genome_fa, config_file = create_config(outdir, type = "sc_3end", threads = 1, no_flank = TRUE) )
Combine FLT-seq SingleCellExperiment objects
combine_sce(sce_with_lr, sce_without_lr)
combine_sce(sce_with_lr, sce_without_lr)
sce_with_lr |
A |
sce_without_lr |
A |
For protcols like FLT-seq that generate two libraries, one with both short and long reads,
and one with only short reads, this function combines the two libraries into a single
SingleCellExperiment
object. For the library with both long and short reads, the long-read
transcript counts should be stored in the 'transcript' altExp slot of the SingleCellExperiment
object. This function will combine the short-read gene counts of both libraries, and for the
transcripts counts, it will leave NA
values for the cells from the short-read only library.
The sc_impute_transcript
function can then be used to impute the NA
values.
A SingleCellExperiment
object with combined gene counts and a "transcript" altExp slot.
with_lr <- SingleCellExperiment::SingleCellExperiment(assays = list(counts = matrix(rpois(100, 5), ncol = 10))) without_lr <- SingleCellExperiment::SingleCellExperiment(assays = list(counts = matrix(rpois(200, 5), ncol = 20))) long_read <- SingleCellExperiment::SingleCellExperiment(assays = list(counts = matrix(rpois(50, 5), ncol = 10))) SingleCellExperiment::altExp(with_lr, "transcript") <- long_read SummarizedExperiment::colData(with_lr)$Barcode <- paste0(1:10, "-1") SummarizedExperiment::colData(without_lr)$Barcode <- paste0(8:27, "-1") rownames(with_lr) <- as.character(101:110) rownames(without_lr) <- as.character(103:112) rownames(long_read) <- as.character(1001:1005) combined_sce <- FLAMES::combine_sce(sce_with_lr = with_lr, sce_without_lr = without_lr) combined_sce
with_lr <- SingleCellExperiment::SingleCellExperiment(assays = list(counts = matrix(rpois(100, 5), ncol = 10))) without_lr <- SingleCellExperiment::SingleCellExperiment(assays = list(counts = matrix(rpois(200, 5), ncol = 20))) long_read <- SingleCellExperiment::SingleCellExperiment(assays = list(counts = matrix(rpois(50, 5), ncol = 10))) SingleCellExperiment::altExp(with_lr, "transcript") <- long_read SummarizedExperiment::colData(with_lr)$Barcode <- paste0(1:10, "-1") SummarizedExperiment::colData(without_lr)$Barcode <- paste0(8:27, "-1") rownames(with_lr) <- as.character(101:110) rownames(without_lr) <- as.character(103:112) rownames(long_read) <- as.character(1001:1005) combined_sce <- FLAMES::combine_sce(sce_with_lr = with_lr, sce_without_lr = without_lr) combined_sce
Filter out transcripts with sharp drops / rises in coverage,
to be used in filter_coverage
to remove transcripts with potential
misalignments / internal priming etc. Filtering is done by convolving the
coverage with a kernal of 1s and -1s (e.g. c(1, 1, -1, -1)
, where
the width of the 1s and -1s are determined by the width
parameter),
and check if the maximum absolute value of the convolution is below a
threshold. If the convolution is below the threshold, TRUE
is returned,
otherwise FALSE
.
convolution_filter(x, threshold = 0.15, width = 2, trim = 0.05)
convolution_filter(x, threshold = 0.15, width = 2, trim = 0.05)
x |
numeric vector of coverage values |
threshold |
numeric, the threshold for the maximum absolute value of the convolution |
width |
numeric, the width of the 1s and -1s in the kernal. E.g. |
trim |
numeric, the proportion of the coverage values to ignore at both ends before convolution. |
logical, TRUE
if the transcript passes the filter, FALSE
otherwise
# A >30% drop in coverage will fail the filter with threshold = 0.3 convolution_filter(c(1, 1, 1, 0.69, 0.69, 0.69), threshold = 0.3) convolution_filter(c(1, 1, 1, 0.71, 0.7, 0.7), threshold = 0.3)
# A >30% drop in coverage will fail the filter with threshold = 0.3 convolution_filter(c(1, 1, 1, 0.69, 0.69, 0.69), threshold = 0.3) convolution_filter(c(1, 1, 1, 0.71, 0.7, 0.7), threshold = 0.3)
Create Configuration File From Arguments
create_config(outdir, type = "sc_3end", ...)
create_config(outdir, type = "sc_3end", ...)
outdir |
the destination directory for the configuratio nfile |
type |
use an example config, available values:
|
... |
Configuration parameters.
|
Create a list object containing the arguments supplied in a format usable for the FLAMES pipeline.
Also writes the object to a JSON file, which is located with the prefix 'config_' in the supplied outdir
.
Default values from extdata/config_sclr_nanopore_3end.json
will be used for unprovided parameters.
file path to the config file created
# create the default configuration file outdir <- tempdir() config <- create_config(outdir)
# create the default configuration file outdir <- tempdir() config <- create_config(outdir)
SingleCellExperiment
object from FLAMES
output folderCreate SingleCellExperiment
object from FLAMES
output folder
create_sce_from_dir(outdir, annotation)
create_sce_from_dir(outdir, annotation)
outdir |
The folder containing |
annotation |
(Optional) the annotation file that was used to produce the output files |
a list of SingleCellExperiment
objects if multiple transcript matrices were
found in the output folder, or a SingleCellExperiment
object if only one were found
outdir <- tempfile() dir.create(outdir) bc_allow <- file.path(outdir, "bc_allow.tsv") genome_fa <- file.path(outdir, "rps24.fa") R.utils::gunzip( filename = system.file("extdata", "bc_allow.tsv.gz", package = "FLAMES"), destname = bc_allow, remove = FALSE ) R.utils::gunzip( filename = system.file("extdata", "rps24.fa.gz", package = "FLAMES"), destname = genome_fa, remove = FALSE ) annotation <- system.file("extdata", "rps24.gtf.gz", package = "FLAMES") sce <- FLAMES::sc_long_pipeline( genome_fa = genome_fa, fastq = system.file("extdata", "fastq", "musc_rps24.fastq.gz", package = "FLAMES"), annotation = annotation, outdir = outdir, barcodes_file = bc_allow, config_file = create_config(outdir, oarfish_quantification = FALSE) ) sce_2 <- create_sce_from_dir(outdir, annotation)
outdir <- tempfile() dir.create(outdir) bc_allow <- file.path(outdir, "bc_allow.tsv") genome_fa <- file.path(outdir, "rps24.fa") R.utils::gunzip( filename = system.file("extdata", "bc_allow.tsv.gz", package = "FLAMES"), destname = bc_allow, remove = FALSE ) R.utils::gunzip( filename = system.file("extdata", "rps24.fa.gz", package = "FLAMES"), destname = genome_fa, remove = FALSE ) annotation <- system.file("extdata", "rps24.gtf.gz", package = "FLAMES") sce <- FLAMES::sc_long_pipeline( genome_fa = genome_fa, fastq = system.file("extdata", "fastq", "musc_rps24.fastq.gz", package = "FLAMES"), annotation = annotation, outdir = outdir, barcodes_file = bc_allow, config_file = create_config(outdir, oarfish_quantification = FALSE) ) sce_2 <- create_sce_from_dir(outdir, annotation)
SummarizedExperiment
object from FLAMES
output folderCreate SummarizedExperiment
object from FLAMES
output folder
create_se_from_dir(outdir, annotation)
create_se_from_dir(outdir, annotation)
outdir |
The folder containing |
annotation |
(Optional) the annotation file that was used to produce the output files |
a SummarizedExperiment
object
# download the two fastq files, move them to a folder to be merged together temp_path <- tempfile() bfc <- BiocFileCache::BiocFileCache(temp_path, ask = FALSE) file_url <- "https://raw.githubusercontent.com/OliverVoogd/FLAMESData/master/data" # download the required fastq files, and move them to new folder fastq1 <- bfc[[names(BiocFileCache::bfcadd(bfc, "Fastq1", paste(file_url, "fastq/sample1.fastq.gz", sep = "/")))]] fastq2 <- bfc[[names(BiocFileCache::bfcadd(bfc, "Fastq2", paste(file_url, "fastq/sample2.fastq.gz", sep = "/")))]] annotation <- bfc[[names(BiocFileCache::bfcadd(bfc, "annot.gtf", paste(file_url, "SIRV_isoforms_multi-fasta-annotation_C_170612a.gtf", sep = "/")))]] genome_fa <- bfc[[names(BiocFileCache::bfcadd(bfc, "genome.fa", paste(file_url, "SIRV_isoforms_multi-fasta_170612a.fasta", sep = "/")))]] fastq_dir <- paste(temp_path, "fastq_dir", sep = "/") # the downloaded fastq files need to be in a directory to be merged together dir.create(fastq_dir) file.copy(c(fastq1, fastq2), fastq_dir) unlink(c(fastq1, fastq2)) # the original files can be deleted outdir <- tempfile() dir.create(outdir) se <- bulk_long_pipeline( annotation = annotation, fastq = fastq_dir, outdir = outdir, genome_fa = genome_fa, config_file = create_config(outdir, type = "sc_3end", threads = 1, no_flank = TRUE) )
# download the two fastq files, move them to a folder to be merged together temp_path <- tempfile() bfc <- BiocFileCache::BiocFileCache(temp_path, ask = FALSE) file_url <- "https://raw.githubusercontent.com/OliverVoogd/FLAMESData/master/data" # download the required fastq files, and move them to new folder fastq1 <- bfc[[names(BiocFileCache::bfcadd(bfc, "Fastq1", paste(file_url, "fastq/sample1.fastq.gz", sep = "/")))]] fastq2 <- bfc[[names(BiocFileCache::bfcadd(bfc, "Fastq2", paste(file_url, "fastq/sample2.fastq.gz", sep = "/")))]] annotation <- bfc[[names(BiocFileCache::bfcadd(bfc, "annot.gtf", paste(file_url, "SIRV_isoforms_multi-fasta-annotation_C_170612a.gtf", sep = "/")))]] genome_fa <- bfc[[names(BiocFileCache::bfcadd(bfc, "genome.fa", paste(file_url, "SIRV_isoforms_multi-fasta_170612a.fasta", sep = "/")))]] fastq_dir <- paste(temp_path, "fastq_dir", sep = "/") # the downloaded fastq files need to be in a directory to be merged together dir.create(fastq_dir) file.copy(c(fastq1, fastq2), fastq_dir) unlink(c(fastq1, fastq2)) # the original files can be deleted outdir <- tempfile() dir.create(outdir) se <- bulk_long_pipeline( annotation = annotation, fastq = fastq_dir, outdir = outdir, genome_fa = genome_fa, config_file = create_config(outdir, type = "sc_3end", threads = 1, no_flank = TRUE) )
This function creates a SpatialExperiment object from a SingleCellExperiment object and a spatial barcode file.
create_spe(sce, spatial_barcode_file, mannual_align_json, image)
create_spe(sce, spatial_barcode_file, mannual_align_json, image)
sce |
The SingleCellExperiment object obtained from running the |
spatial_barcode_file |
The path to the spatial barcode file, e.g. |
mannual_align_json |
The path to the mannual alignment json file. |
image |
'DataFrame' containing the image data. See |
A SpatialExperiment object.
trim TSO adaptor with cutadapt
cutadapt(args)
cutadapt(args)
args |
arguments to be passed to cutadapt |
Exit code of cutadapt
## Not run: cutadapt("-h") ## End(Not run)
## Not run: cutadapt("-h") ## End(Not run)
Demultiplex reads using the cell_umi_gene.tsv
file from Sockeye.
demultiplex_sockeye(fastq_dir, sockeye_tsv, out_fq)
demultiplex_sockeye(fastq_dir, sockeye_tsv, out_fq)
fastq_dir |
The folder containing FASTQ files from Sockeye's output under |
sockeye_tsv |
The |
out_fq |
The output FASTQ file. |
returns NULL
Removes isoform annotations that could produce ambigious reads, such as isoforms that only differ by the 5' / 3' end. This could be useful for plotting average coverage plots.
filter_annotation(annotation, keep = "tss_differ")
filter_annotation(annotation, keep = "tss_differ")
annotation |
path to the GTF annotation file, or the parsed GenomicRanges object. |
keep |
string, one of 'tss_differ' (only keep isoforms that all differ by the transcription start site position), 'tes_differ' (only keep those that differ by the transcription end site position), 'both' (only keep those that differ by both the start and end site), or 'single_transcripts' (only keep genes that contains a sinlge transcript). |
GenomicRanges of the filtered isoforms
filtered_annotation <- filter_annotation( system.file("extdata", "rps24.gtf.gz", package = 'FLAMES'), keep = 'tes_differ') filtered_annotation
filtered_annotation <- filter_annotation( system.file("extdata", "rps24.gtf.gz", package = 'FLAMES'), keep = 'tes_differ') filtered_annotation
Filter the transcript coverage by applying a filter function to the coverage values.
filter_coverage(x, filter_fn = convolution_filter)
filter_coverage(x, filter_fn = convolution_filter)
x |
The tibble returned by |
filter_fn |
The filter function to apply to the coverage values. The function
should take a numeric vector of coverage values and return a logical value (TRUE if
the transcript passes the filter, FALSE otherwise). The default filter function is
|
a tibble of the transcript information and coverages, with transcipts that pass the filter
# Create a BAM file with minimap2_realign temp_path <- tempfile() bfc <- BiocFileCache::BiocFileCache(temp_path, ask = FALSE) file_url <- 'https://raw.githubusercontent.com/OliverVoogd/FLAMESData/master/data' fastq1 <- bfc[[names(BiocFileCache::bfcadd(bfc, 'Fastq1', paste(file_url, 'fastq/sample1.fastq.gz', sep = '/')))]] genome_fa <- bfc[[names(BiocFileCache::bfcadd(bfc, 'genome.fa', paste(file_url, 'SIRV_isoforms_multi-fasta_170612a.fasta', sep = '/')))]] annotation <- bfc[[names(BiocFileCache::bfcadd(bfc, 'annot.gtf', paste(file_url, 'SIRV_isoforms_multi-fasta-annotation_C_170612a.gtf', sep = '/')))]] outdir <- tempfile() dir.create(outdir) fasta <- annotation_to_fasta(annotation, genome_fa, outdir) minimap2_realign( config = jsonlite::fromJSON( system.file("extdata", "config_sclr_nanopore_3end.json", package = "FLAMES")), fq_in = fastq1, outdir = outdir ) x <- get_coverage(file.path(outdir, 'realign2transcript.bam')) nrow(x) filter_coverage(x) |> nrow()
# Create a BAM file with minimap2_realign temp_path <- tempfile() bfc <- BiocFileCache::BiocFileCache(temp_path, ask = FALSE) file_url <- 'https://raw.githubusercontent.com/OliverVoogd/FLAMESData/master/data' fastq1 <- bfc[[names(BiocFileCache::bfcadd(bfc, 'Fastq1', paste(file_url, 'fastq/sample1.fastq.gz', sep = '/')))]] genome_fa <- bfc[[names(BiocFileCache::bfcadd(bfc, 'genome.fa', paste(file_url, 'SIRV_isoforms_multi-fasta_170612a.fasta', sep = '/')))]] annotation <- bfc[[names(BiocFileCache::bfcadd(bfc, 'annot.gtf', paste(file_url, 'SIRV_isoforms_multi-fasta-annotation_C_170612a.gtf', sep = '/')))]] outdir <- tempfile() dir.create(outdir) fasta <- annotation_to_fasta(annotation, genome_fa, outdir) minimap2_realign( config = jsonlite::fromJSON( system.file("extdata", "config_sclr_nanopore_3end.json", package = "FLAMES")), fq_in = fastq1, outdir = outdir ) x <- get_coverage(file.path(outdir, 'realign2transcript.bam')) nrow(x) filter_coverage(x) |> nrow()
demultiplex reads with flexiplex
find_barcode( fastq, barcodes_file, max_bc_editdistance = 2, max_flank_editdistance = 8, reads_out, stats_out, threads = 1, pattern = c(primer = "CTACACGACGCTCTTCCGATCT", BC = paste0(rep("N", 16), collapse = ""), UMI = paste0(rep("N", 12), collapse = ""), polyT = paste0(rep("T", 9), collapse = "")), TSO_seq = "", TSO_prime = 3, strand = "+", cutadapt_minimum_length = 1, full_length_only = FALSE )
find_barcode( fastq, barcodes_file, max_bc_editdistance = 2, max_flank_editdistance = 8, reads_out, stats_out, threads = 1, pattern = c(primer = "CTACACGACGCTCTTCCGATCT", BC = paste0(rep("N", 16), collapse = ""), UMI = paste0(rep("N", 12), collapse = ""), polyT = paste0(rep("T", 9), collapse = "")), TSO_seq = "", TSO_prime = 3, strand = "+", cutadapt_minimum_length = 1, full_length_only = FALSE )
fastq |
character vector of paths to FASTQ files or folders, if named, the names will be used as sample names, otherwise the file names will be used |
barcodes_file |
path to file containing barcode allow-list, with one barcode in each line |
max_bc_editdistance |
max edit distances for the barcode sequence |
max_flank_editdistance |
max edit distances for the flanking sequences (primer and polyT) |
reads_out |
path to output FASTQ file; if multiple samples are processed,
the sample name will be appended to this argument, e.g. provide |
stats_out |
path of output stats file; similar to |
threads |
number of threads to be used |
pattern |
named character vector defining the barcode pattern |
TSO_seq |
TSO sequence to be trimmed |
TSO_prime |
either 3 (when |
strand |
strand of the barcode pattern, either '+' or '-' (read will be reverse complemented after barcode matching if '-') |
cutadapt_minimum_length |
minimum read length after TSO trimming (cutadapt's –minimum-length) |
full_length_only |
boolean, when TSO sequence is provided, whether reads without TSO are to be discarded |
This function demultiplexes reads by searching for flanking sequences (adaptors) around the barcode sequence, and then matching against allowed barcodes. For single sample, either provide a single FASTQ file or a folder containing FASTQ files. For multiple samples, provide a vector of paths (either to FASTQ files or folders containing FASTQ files). Gzipped file input are supported but the output will be uncompressed.
a list containing: reads_tb
(tibble of read demultiplexed information) and
input
, output
, read1_with_adapter
from cutadapt report
(if TSO trimming is performed)
outdir <- tempfile() dir.create(outdir) bc_allow <- file.path(outdir, "bc_allow.tsv") R.utils::gunzip( filename = system.file("extdata", "bc_allow.tsv.gz", package = "FLAMES"), destname = bc_allow, remove = FALSE ) # single sample find_barcode( fastq = system.file("extdata", "fastq", "musc_rps24.fastq.gz", package = "FLAMES"), stats_out = file.path(outdir, "bc_stat"), reads_out = file.path(outdir, "demultiplexed.fq"), barcodes_file = bc_allow, TSO_seq = "AAGCAGTGGTATCAACGCAGAGTACATGGG", TSO_prime = 5, strand = '-', cutadapt_minimum_length = 10, full_length_only = TRUE ) # multi-sample fastq_dir <- tempfile() dir.create(fastq_dir) file.copy(system.file("extdata", "fastq", "musc_rps24.fastq.gz", package = "FLAMES"), file.path(fastq_dir, "musc_rps24.fastq.gz")) sampled_lines <- readLines(file.path(fastq_dir, "musc_rps24.fastq.gz"), n = 400) writeLines(sampled_lines, file.path(fastq_dir, "copy.fastq")) result <- find_barcode( # you can mix folders and files. each path will be considered as a sample fastq = c(fastq_dir, system.file("extdata", "fastq", "musc_rps24.fastq.gz", package = "FLAMES")), stats_out = file.path(outdir, "bc_stat"), reads_out = file.path(outdir, "demultiplexed.fq"), barcodes_file = bc_allow, TSO_seq = "CCCATGTACTCTGCGTTGATACCACTGCTT" )
outdir <- tempfile() dir.create(outdir) bc_allow <- file.path(outdir, "bc_allow.tsv") R.utils::gunzip( filename = system.file("extdata", "bc_allow.tsv.gz", package = "FLAMES"), destname = bc_allow, remove = FALSE ) # single sample find_barcode( fastq = system.file("extdata", "fastq", "musc_rps24.fastq.gz", package = "FLAMES"), stats_out = file.path(outdir, "bc_stat"), reads_out = file.path(outdir, "demultiplexed.fq"), barcodes_file = bc_allow, TSO_seq = "AAGCAGTGGTATCAACGCAGAGTACATGGG", TSO_prime = 5, strand = '-', cutadapt_minimum_length = 10, full_length_only = TRUE ) # multi-sample fastq_dir <- tempfile() dir.create(fastq_dir) file.copy(system.file("extdata", "fastq", "musc_rps24.fastq.gz", package = "FLAMES"), file.path(fastq_dir, "musc_rps24.fastq.gz")) sampled_lines <- readLines(file.path(fastq_dir, "musc_rps24.fastq.gz"), n = 400) writeLines(sampled_lines, file.path(fastq_dir, "copy.fastq")) result <- find_barcode( # you can mix folders and files. each path will be considered as a sample fastq = c(fastq_dir, system.file("extdata", "fastq", "musc_rps24.fastq.gz", package = "FLAMES")), stats_out = file.path(outdir, "bc_stat"), reads_out = file.path(outdir, "demultiplexed.fq"), barcodes_file = bc_allow, TSO_seq = "CCCATGTACTCTGCGTTGATACCACTGCTT" )
This function is a wrapper for base::Sys.which
to find the path
to a command. It also searches within the FLAMES
basilisk conda
environment. This function also replaces "" with NA
in the
output of base::Sys.which
to make it easier to check if the
binary is found.
find_bin(command)
find_bin(command)
command |
character, the command to search for |
character, the path to the command or NA
find_bin("minimap2")
find_bin("minimap2")
Long-read isoform identification with FLAMES or bambu.
find_isoform(annotation, genome_fa, genome_bam, outdir, config)
find_isoform(annotation, genome_fa, genome_bam, outdir, config)
annotation |
Path to annotation file. If configured to use bambu, the annotation must be provided as GTF file. |
genome_fa |
The file path to genome fasta file. |
genome_bam |
File path to BAM alignment file. Multiple files could be provided. |
outdir |
The path to directory to store all output files. |
config |
Parsed FLAMES configurations. |
The updated annotation and the transcriptome assembly will be saved in the
output folder as isoform_annotated.gff3
(GTF if bambu is selected) and
transcript_assembly.fa
respectively.
temp_path <- tempfile() bfc <- BiocFileCache::BiocFileCache(temp_path, ask = FALSE) file_url <- "https://raw.githubusercontent.com/OliverVoogd/FLAMESData/master/data" fastq1 <- bfc[[names(BiocFileCache::bfcadd(bfc, "Fastq1", paste(file_url, "fastq/sample1.fastq.gz", sep = "/")))]] genome_fa <- bfc[[names(BiocFileCache::bfcadd(bfc, "genome.fa", paste(file_url, "SIRV_isoforms_multi-fasta_170612a.fasta", sep = "/")))]] annotation <- bfc[[names(BiocFileCache::bfcadd(bfc, "annot.gtf", paste(file_url, "SIRV_isoforms_multi-fasta-annotation_C_170612a.gtf", sep = "/")))]] outdir <- tempfile() dir.create(outdir) config <- jsonlite::fromJSON( system.file("extdata", "config_sclr_nanopore_3end.json", package = "FLAMES") ) minimap2_align( config = config, fa_file = genome_fa, fq_in = fastq1, annot = annotation, outdir = outdir ) find_isoform( annotation = annotation, genome_fa = genome_fa, genome_bam = file.path(outdir, "align2genome.bam"), outdir = outdir, config = config )
temp_path <- tempfile() bfc <- BiocFileCache::BiocFileCache(temp_path, ask = FALSE) file_url <- "https://raw.githubusercontent.com/OliverVoogd/FLAMESData/master/data" fastq1 <- bfc[[names(BiocFileCache::bfcadd(bfc, "Fastq1", paste(file_url, "fastq/sample1.fastq.gz", sep = "/")))]] genome_fa <- bfc[[names(BiocFileCache::bfcadd(bfc, "genome.fa", paste(file_url, "SIRV_isoforms_multi-fasta_170612a.fasta", sep = "/")))]] annotation <- bfc[[names(BiocFileCache::bfcadd(bfc, "annot.gtf", paste(file_url, "SIRV_isoforms_multi-fasta-annotation_C_170612a.gtf", sep = "/")))]] outdir <- tempfile() dir.create(outdir) config <- jsonlite::fromJSON( system.file("extdata", "config_sclr_nanopore_3end.json", package = "FLAMES") ) minimap2_align( config = config, fa_file = genome_fa, fq_in = fastq1, annot = annotation, outdir = outdir ) find_isoform( annotation = annotation, genome_fa = genome_fa, genome_bam = file.path(outdir, "align2genome.bam"), outdir = outdir, config = config )
Treat each bam file as a bulk sample and identify variants against the reference
find_variants( bam_path, reference, annotation, min_nucleotide_depth = 100, homopolymer_window = 3, annotated_region_only = FALSE, names_from = "gene_name", threads = 1 )
find_variants( bam_path, reference, annotation, min_nucleotide_depth = 100, homopolymer_window = 3, annotated_region_only = FALSE, names_from = "gene_name", threads = 1 )
bam_path |
character(1) or character(n): path to the bam file(s) aligned to the reference genome (NOT the transcriptome!). |
reference |
DNAStringSet: the reference genome |
annotation |
GRanges: the annotation of the reference genome. You can load
a GTF/GFF annotation file with |
min_nucleotide_depth |
integer(1): minimum read depth for a position to be considered a variant. |
homopolymer_window |
integer(1): the window size to calculate the homopolymer
percentage. The homopolymer percentage is calculated as the percentage of the most
frequent nucleotide in a window of |
annotated_region_only |
logical(1): whether to only consider variants outside
annotated regions. If |
names_from |
character(1): the column name in the metadata column of the annotation
( |
threads |
integer(1): number of threads to use. Threading is done over each
annotated region and (if |
Each bam file is treated as a bulk sample to perform pileup and identify variants.
You can run sc_mutations
with the variants identified with this function
to get single-cell allele counts. Note that reference genome FASTA files may have
the chromosome names field as '>chr1 1' instead of '>chr1'. You may need to remove
the trailing number to match the chromosome names in the bam file, for example with
names(ref) <- sapply(names(ref), function(x) strsplit(x, " ")[[1]][1])
.
A tibble with columns: seqnames, pos, nucleotide, count, sum, freq, ref, region,
homopolymer_pct, bam_path The homopolymer percentage is calculated as the percentage of the
most frequent nucleotide in a window of homopolymer_window
nucleotides around
the variant position, excluding the variant position itself.
outdir <- tempfile() dir.create(outdir) genome_fa <- system.file("extdata", "rps24.fa.gz", package = "FLAMES") minimap2_align( # align to genome config = jsonlite::fromJSON( system.file("extdata", "config_sclr_nanopore_3end.json", package = "FLAMES")), fa_file = genome_fa, fq_in = system.file("extdata", "fastq", "demultiplexed.fq.gz", package = "FLAMES"), annot = system.file("extdata", "rps24.gtf.gz", package = "FLAMES"), outdir = outdir ) variants <- find_variants( bam_path = file.path(outdir, "align2genome.bam"), reference = genome_fa, annotation = system.file("extdata", "rps24.gtf.gz", package = "FLAMES"), min_nucleotide_depth = 4 ) head(variants)
outdir <- tempfile() dir.create(outdir) genome_fa <- system.file("extdata", "rps24.fa.gz", package = "FLAMES") minimap2_align( # align to genome config = jsonlite::fromJSON( system.file("extdata", "config_sclr_nanopore_3end.json", package = "FLAMES")), fa_file = genome_fa, fq_in = system.file("extdata", "fastq", "demultiplexed.fq.gz", package = "FLAMES"), annot = system.file("extdata", "rps24.gtf.gz", package = "FLAMES"), outdir = outdir ) variants <- find_variants( bam_path = file.path(outdir, "align2genome.bam"), reference = genome_fa, annotation = system.file("extdata", "rps24.gtf.gz", package = "FLAMES"), min_nucleotide_depth = 4 ) head(variants)
FLAMES: full-length analysis of mutations and splicing
demultiplex reads with flexiplex, for detailed description, see documentation for the original flexiplex: https://davidsongroup.github.io/flexiplex
flexiplex( reads_in, barcodes_file, bc_as_readid, max_bc_editdistance, max_flank_editdistance, pattern, reads_out, stats_out, bc_out, reverseCompliment, n_threads )
flexiplex( reads_in, barcodes_file, bc_as_readid, max_bc_editdistance, max_flank_editdistance, pattern, reads_out, stats_out, bc_out, reverseCompliment, n_threads )
reads_in |
Input FASTQ or FASTA file |
barcodes_file |
barcode allow-list file |
bc_as_readid |
bool, whether to add the demultiplexed barcode to the read ID field |
max_bc_editdistance |
max edit distance for barcode ' |
max_flank_editdistance |
max edit distance for the flanking sequences ' |
pattern |
StringVector defining the barcode structure, see [find_barcode] |
reads_out |
output file for demultiplexed reads |
stats_out |
output file for demultiplexed stats |
bc_out |
WIP |
reverseCompliment |
bool, whether to reverse complement the reads after demultiplexing |
n_threads |
number of threads to be used during demultiplexing |
integer return value. 0 represents normal return.
Get the read coverages for each transcript in the BAM file (or a GAlignments object).
The read coverages are sampled at 100 positions along the transcript, and the coverage is scaled
by dividing the coverage at each position by the total read counts for the transcript. If a BAM
file is provided, alignment with MAPQ < 5, secondary alignments and supplementary alignments
are filtered out. A GAlignments
object can also be provided in case alternative filtering
is desired.
get_coverage(bam, min_counts = 10, remove_UTR = FALSE, annotation)
get_coverage(bam, min_counts = 10, remove_UTR = FALSE, annotation)
bam |
path to the BAM file, or a parsed GAlignments object |
min_counts |
numeric, the minimum number of alignments required for a transcript to be included |
remove_UTR |
logical, if |
annotation |
(Required if |
a tibble of the transcript information and coverages, with the following columns:
transcript
: the transcript name / ID
read_counts
: the total number of aligments for the transcript
coverage_1-100
: the coverage at each of the 100 positions along the transcript
tr_length
: the length of the transcript
# Create a BAM file with minimap2_realign temp_path <- tempfile() bfc <- BiocFileCache::BiocFileCache(temp_path, ask = FALSE) file_url <- 'https://raw.githubusercontent.com/OliverVoogd/FLAMESData/master/data' fastq1 <- bfc[[names(BiocFileCache::bfcadd(bfc, 'Fastq1', paste(file_url, 'fastq/sample1.fastq.gz', sep = '/')))]] genome_fa <- bfc[[names(BiocFileCache::bfcadd(bfc, 'genome.fa', paste(file_url, 'SIRV_isoforms_multi-fasta_170612a.fasta', sep = '/')))]] annotation <- bfc[[names(BiocFileCache::bfcadd(bfc, 'annot.gtf', paste(file_url, 'SIRV_isoforms_multi-fasta-annotation_C_170612a.gtf', sep = '/')))]] outdir <- tempfile() dir.create(outdir) fasta <- annotation_to_fasta(annotation, genome_fa, outdir) minimap2_realign( config = jsonlite::fromJSON( system.file("extdata", "config_sclr_nanopore_3end.json", package = "FLAMES")), fq_in = fastq1, outdir = outdir ) x <- get_coverage(file.path(outdir, 'realign2transcript.bam')) head(x)
# Create a BAM file with minimap2_realign temp_path <- tempfile() bfc <- BiocFileCache::BiocFileCache(temp_path, ask = FALSE) file_url <- 'https://raw.githubusercontent.com/OliverVoogd/FLAMESData/master/data' fastq1 <- bfc[[names(BiocFileCache::bfcadd(bfc, 'Fastq1', paste(file_url, 'fastq/sample1.fastq.gz', sep = '/')))]] genome_fa <- bfc[[names(BiocFileCache::bfcadd(bfc, 'genome.fa', paste(file_url, 'SIRV_isoforms_multi-fasta_170612a.fasta', sep = '/')))]] annotation <- bfc[[names(BiocFileCache::bfcadd(bfc, 'annot.gtf', paste(file_url, 'SIRV_isoforms_multi-fasta-annotation_C_170612a.gtf', sep = '/')))]] outdir <- tempfile() dir.create(outdir) fasta <- annotation_to_fasta(annotation, genome_fa, outdir) minimap2_realign( config = jsonlite::fromJSON( system.file("extdata", "config_sclr_nanopore_3end.json", package = "FLAMES")), fq_in = fastq1, outdir = outdir ) x <- get_coverage(file.path(outdir, 'realign2transcript.bam')) head(x)
Parse FLAMES' GFF ouputs into a Genomic Ranges List
get_GRangesList(file)
get_GRangesList(file)
file |
the GFF file to parse |
A Genomic Ranges List
temp_path <- tempfile() bfc <- BiocFileCache::BiocFileCache(temp_path, ask = FALSE) file_url <- "https://raw.githubusercontent.com/OliverVoogd/FLAMESData/master/data" fastq1 <- bfc[[names(BiocFileCache::bfcadd(bfc, "Fastq1", paste(file_url, "fastq/sample1.fastq.gz", sep = "/")))]] genome_fa <- bfc[[names(BiocFileCache::bfcadd(bfc, "genome.fa", paste(file_url, "SIRV_isoforms_multi-fasta_170612a.fasta", sep = "/")))]] annotation <- bfc[[names(BiocFileCache::bfcadd(bfc, "annot.gtf", paste(file_url, "SIRV_isoforms_multi-fasta-annotation_C_170612a.gtf", sep = "/")))]] outdir <- tempfile() dir.create(outdir) config <- jsonlite::fromJSON( system.file("extdata", "config_sclr_nanopore_3end.json", package = "FLAMES")) minimap2_align( config = config, fa_file = genome_fa, fq_in = fastq1, annot = annotation, outdir = outdir ) find_isoform( annotation = annotation, genome_fa = genome_fa, genome_bam = file.path(outdir, "align2genome.bam"), outdir = outdir, config = config ) grlist <- get_GRangesList(file = file.path(outdir, "isoform_annotated.gff3"))
temp_path <- tempfile() bfc <- BiocFileCache::BiocFileCache(temp_path, ask = FALSE) file_url <- "https://raw.githubusercontent.com/OliverVoogd/FLAMESData/master/data" fastq1 <- bfc[[names(BiocFileCache::bfcadd(bfc, "Fastq1", paste(file_url, "fastq/sample1.fastq.gz", sep = "/")))]] genome_fa <- bfc[[names(BiocFileCache::bfcadd(bfc, "genome.fa", paste(file_url, "SIRV_isoforms_multi-fasta_170612a.fasta", sep = "/")))]] annotation <- bfc[[names(BiocFileCache::bfcadd(bfc, "annot.gtf", paste(file_url, "SIRV_isoforms_multi-fasta-annotation_C_170612a.gtf", sep = "/")))]] outdir <- tempfile() dir.create(outdir) config <- jsonlite::fromJSON( system.file("extdata", "config_sclr_nanopore_3end.json", package = "FLAMES")) minimap2_align( config = config, fa_file = genome_fa, fq_in = fastq1, annot = annotation, outdir = outdir ) find_isoform( annotation = annotation, genome_fa = genome_fa, genome_bam = file.path(outdir, "align2genome.bam"), outdir = outdir, config = config ) grlist <- get_GRangesList(file = file.path(outdir, "isoform_annotated.gff3"))
Uses minimap2 to align sequences agains a reference databse.
Uses options '-ax splice -t 12 -k14 –secondary=no fa_file
fq_in
'
minimap2_align( config, fa_file, fq_in, annot, outdir, minimap2 = NA, k8 = NA, samtools = NA, prefix = NULL, threads = 1 )
minimap2_align( config, fa_file, fq_in, annot, outdir, minimap2 = NA, k8 = NA, samtools = NA, prefix = NULL, threads = 1 )
config |
Parsed list of FLAMES config file |
fa_file |
Path to the fasta file used as a reference database for alignment |
fq_in |
File path to the fastq file used as a query sequence file |
annot |
Genome annotation file used to create junction bed files |
outdir |
Output folder |
minimap2 |
Path to minimap2 binary |
k8 |
Path to the k8 Javascript shell binary |
samtools |
path to the samtools binary, required for large datasets since |
prefix |
String, the prefix (e.g. sample name) for the outputted BAM file |
threads |
Integer, threads for minimap2 to use, see minimap2 documentation for details, FLAMES will try to detect cores if this parameter is not provided. |
a data.frame
summarising the reads aligned
[minimap2_realign()]
temp_path <- tempfile() bfc <- BiocFileCache::BiocFileCache(temp_path, ask = FALSE) file_url <- 'https://raw.githubusercontent.com/OliverVoogd/FLAMESData/master/data' fastq1 <- bfc[[names(BiocFileCache::bfcadd(bfc, 'Fastq1', paste(file_url, 'fastq/sample1.fastq.gz', sep = '/')))]] genome_fa <- bfc[[names(BiocFileCache::bfcadd(bfc, 'genome.fa', paste(file_url, 'SIRV_isoforms_multi-fasta_170612a.fasta', sep = '/')))]] annotation <- bfc[[names(BiocFileCache::bfcadd(bfc, 'annot.gtf', paste(file_url, 'SIRV_isoforms_multi-fasta-annotation_C_170612a.gtf', sep = '/')))]] outdir <- tempfile() dir.create(outdir) minimap2_align( config = jsonlite::fromJSON( system.file("extdata", "config_sclr_nanopore_3end.json", package = 'FLAMES') ), fa_file = genome_fa, fq_in = fastq1, annot = annotation, outdir = outdir )
temp_path <- tempfile() bfc <- BiocFileCache::BiocFileCache(temp_path, ask = FALSE) file_url <- 'https://raw.githubusercontent.com/OliverVoogd/FLAMESData/master/data' fastq1 <- bfc[[names(BiocFileCache::bfcadd(bfc, 'Fastq1', paste(file_url, 'fastq/sample1.fastq.gz', sep = '/')))]] genome_fa <- bfc[[names(BiocFileCache::bfcadd(bfc, 'genome.fa', paste(file_url, 'SIRV_isoforms_multi-fasta_170612a.fasta', sep = '/')))]] annotation <- bfc[[names(BiocFileCache::bfcadd(bfc, 'annot.gtf', paste(file_url, 'SIRV_isoforms_multi-fasta-annotation_C_170612a.gtf', sep = '/')))]] outdir <- tempfile() dir.create(outdir) minimap2_align( config = jsonlite::fromJSON( system.file("extdata", "config_sclr_nanopore_3end.json", package = 'FLAMES') ), fa_file = genome_fa, fq_in = fastq1, annot = annotation, outdir = outdir )
Uses minimap2 to re-align reads to transcriptome
minimap2_realign( config, fq_in, outdir, minimap2, samtools = NULL, prefix = NULL, minimap2_args, sort_by, threads = 1 )
minimap2_realign( config, fq_in, outdir, minimap2, samtools = NULL, prefix = NULL, minimap2_args, sort_by, threads = 1 )
config |
Parsed list of FLAMES config file |
fq_in |
File path to the fastq file used as a query sequence file |
outdir |
Output folder |
minimap2 |
Path to minimap2 binary |
samtools |
path to the samtools binary, required for large datasets since |
prefix |
String, the prefix (e.g. sample name) for the outputted BAM file |
minimap2_args |
vector of command line arguments to pass to minimap2 |
sort_by |
String, If provided, sort the BAM file by this tag instead of by position. |
threads |
Integer, threads for minimap2 to use, see minimap2 documentation for details, FLAMES will try to detect cores if this parameter is not provided. |
a data.frame
summarising the reads aligned
[minimap2_align()]
outdir <- tempfile() dir.create(outdir) annotation <- system.file('extdata', 'rps24.gtf.gz', package = 'FLAMES') genome_fa <- system.file('extdata', 'rps24.fa.gz', package = 'FLAMES') fasta <- annotation_to_fasta(annotation, genome_fa, outdir) fastq <- system.file('extdata', 'fastq', 'demultiplexed.fq.gz', package = 'FLAMES') minimap2_realign( config = jsonlite::fromJSON( system.file("extdata", "config_sclr_nanopore_3end.json", package = 'FLAMES') ), fq_in = fastq, outdir = outdir )
outdir <- tempfile() dir.create(outdir) annotation <- system.file('extdata', 'rps24.gtf.gz', package = 'FLAMES') genome_fa <- system.file('extdata', 'rps24.fa.gz', package = 'FLAMES') fasta <- annotation_to_fasta(annotation, genome_fa, outdir) fastq <- system.file('extdata', 'fastq', 'demultiplexed.fq.gz', package = 'FLAMES') minimap2_realign( config = jsonlite::fromJSON( system.file("extdata", "config_sclr_nanopore_3end.json", package = 'FLAMES') ), fq_in = fastq, outdir = outdir )
Parse a Gff3 file into 3 components: chromasome to gene name, a transcript dictionary, a gene to transcript dictionary and a transcript to exon dictionary. These components are returned in a named list.
parse_gff_tree(gff_file)
parse_gff_tree(gff_file)
gff_file |
the file path to the gff3 file to parse |
a named list with the elements "chr_to_gene", "transcript_dict", "gene_to_transcript", "transcript_to_exon", containing the data parsed from the gff3 file.
temp_path <- tempfile() bfc <- BiocFileCache::BiocFileCache(temp_path, ask = FALSE) file_url <- "https://raw.githubusercontent.com/OliverVoogd/FLAMESData/master/data" gff <- bfc[[names(BiocFileCache::bfcadd(bfc, "GFF", paste(file_url, "SIRV_isoforms_multi-fasta-annotation_C_170612a.gtf", sep = "/")))]] ## Not run: parsed_gff <- parse_gff_tree(gff)
temp_path <- tempfile() bfc <- BiocFileCache::BiocFileCache(temp_path, ask = FALSE) file_url <- "https://raw.githubusercontent.com/OliverVoogd/FLAMESData/master/data" gff <- bfc[[names(BiocFileCache::bfcadd(bfc, "GFF", paste(file_url, "SIRV_isoforms_multi-fasta-annotation_C_170612a.gtf", sep = "/")))]] ## Not run: parsed_gff <- parse_gff_tree(gff)
Plot the average read coverages for each length bin or a perticular isoform
plot_coverage( x, quantiles = c(0, 0.2375, 0.475, 0.7125, 0.95, 1), length_bins = c(0, 1, 2, 5, 10, Inf), weight_fn = weight_transcripts, filter_fn, detailed = FALSE )
plot_coverage( x, quantiles = c(0, 0.2375, 0.475, 0.7125, 0.95, 1), length_bins = c(0, 1, 2, 5, 10, Inf), weight_fn = weight_transcripts, filter_fn, detailed = FALSE )
x |
path to the BAM file (aligning reads to the transcriptome), or
the (GenomicAlignments::readGAlignments) parsed GAlignments object, or the
tibble returned by |
quantiles |
numeric vector to specify the quantiles to bin the transcripts lengths by if length_bins is missing. The length bins will be determined such that the read counts are distributed acording to the quantiles. |
length_bins |
numeric vector to specify the sizes to bin the transcripts by |
weight_fn |
function to calculate the weights for the transcripts. The function
should take a numeric vector of read counts and return a numeric vector of weights.
The default function is |
filter_fn |
Optional filter function to filter the transcripts before plotting.
See the |
detailed |
logical, if |
a ggplot2 object of the coverage plot(s)
# Create a BAM file with minimap2_realign temp_path <- tempfile() bfc <- BiocFileCache::BiocFileCache(temp_path, ask = FALSE) file_url <- 'https://raw.githubusercontent.com/OliverVoogd/FLAMESData/master/data' fastq1 <- bfc[[names(BiocFileCache::bfcadd(bfc, 'Fastq1', paste(file_url, 'fastq/sample1.fastq.gz', sep = '/')))]] genome_fa <- bfc[[names(BiocFileCache::bfcadd(bfc, 'genome.fa', paste(file_url, 'SIRV_isoforms_multi-fasta_170612a.fasta', sep = '/')))]] annotation <- bfc[[names(BiocFileCache::bfcadd(bfc, 'annot.gtf', paste(file_url, 'SIRV_isoforms_multi-fasta-annotation_C_170612a.gtf', sep = '/')))]] outdir <- tempfile() dir.create(outdir) fasta <- annotation_to_fasta(annotation, genome_fa, outdir) minimap2_realign( config = jsonlite::fromJSON( system.file("extdata", "config_sclr_nanopore_3end.json", package = "FLAMES")), fq_in = fastq1, outdir = outdir ) # Plot the coverages directly from the BAM file plot_coverage(file.path(outdir, 'realign2transcript.bam')) # Get the coverage information first coverage <- get_coverage(file.path(outdir, 'realign2transcript.bam')) |> dplyr::filter(read_counts > 2) |> # Filter out transcripts with read counts < 3 filter_coverage(filter_fn = convolution_filter) # Filter out transcripts with sharp drops / rises # Plot the filtered coverages plot_coverage(coverage, detailed = TRUE) # filtering function can also be passed directly to plot_coverage plot_coverage(file.path(outdir, 'realign2transcript.bam'), filter_fn = convolution_filter)
# Create a BAM file with minimap2_realign temp_path <- tempfile() bfc <- BiocFileCache::BiocFileCache(temp_path, ask = FALSE) file_url <- 'https://raw.githubusercontent.com/OliverVoogd/FLAMESData/master/data' fastq1 <- bfc[[names(BiocFileCache::bfcadd(bfc, 'Fastq1', paste(file_url, 'fastq/sample1.fastq.gz', sep = '/')))]] genome_fa <- bfc[[names(BiocFileCache::bfcadd(bfc, 'genome.fa', paste(file_url, 'SIRV_isoforms_multi-fasta_170612a.fasta', sep = '/')))]] annotation <- bfc[[names(BiocFileCache::bfcadd(bfc, 'annot.gtf', paste(file_url, 'SIRV_isoforms_multi-fasta-annotation_C_170612a.gtf', sep = '/')))]] outdir <- tempfile() dir.create(outdir) fasta <- annotation_to_fasta(annotation, genome_fa, outdir) minimap2_realign( config = jsonlite::fromJSON( system.file("extdata", "config_sclr_nanopore_3end.json", package = "FLAMES")), fq_in = fastq1, outdir = outdir ) # Plot the coverages directly from the BAM file plot_coverage(file.path(outdir, 'realign2transcript.bam')) # Get the coverage information first coverage <- get_coverage(file.path(outdir, 'realign2transcript.bam')) |> dplyr::filter(read_counts > 2) |> # Filter out transcripts with read counts < 3 filter_coverage(filter_fn = convolution_filter) # Filter out transcripts with sharp drops / rises # Plot the filtered coverages plot_coverage(coverage, detailed = TRUE) # filtering function can also be passed directly to plot_coverage plot_coverage(file.path(outdir, 'realign2transcript.bam'), filter_fn = convolution_filter)
produce a barplot of cell barcode demultiplex statistics
plot_demultiplex(find_barcode_result)
plot_demultiplex(find_barcode_result)
find_barcode_result |
output from |
a list of ggplot objects:
reads_count_plot: stacked barplot of: demultiplexed reads
knee_plot: knee plot of UMI counts before TSO trimming
flank_editdistance_plot: flanking sequence (adaptor) edit-distance plot
barcode_editdistance_plot: barcode edit-distance plot
cutadapt_plot: if TSO trimming is performed, number of reads kept by cutadapt
outdir <- tempfile() dir.create(outdir) fastq_dir <- tempfile() dir.create(fastq_dir) file.copy(system.file("extdata", "fastq", "musc_rps24.fastq.gz", package = "FLAMES"), file.path(fastq_dir, "musc_rps24.fastq.gz")) sampled_lines <- readLines(file.path(fastq_dir, "musc_rps24.fastq.gz"), n = 400) writeLines(sampled_lines, file.path(fastq_dir, "copy.fastq")) bc_allow <- file.path(outdir, "bc_allow.tsv") R.utils::gunzip( filename = system.file("extdata", "bc_allow.tsv.gz", package = "FLAMES"), destname = bc_allow, remove = FALSE ) find_barcode( fastq = fastq_dir, stats_out = file.path(outdir, "bc_stat"), reads_out = file.path(outdir, "demultiplexed.fq"), barcodes_file = bc_allow, TSO_seq = "CCCATGTACTCTGCGTTGATACCACTGCTT" ) |> plot_demultiplex()
outdir <- tempfile() dir.create(outdir) fastq_dir <- tempfile() dir.create(fastq_dir) file.copy(system.file("extdata", "fastq", "musc_rps24.fastq.gz", package = "FLAMES"), file.path(fastq_dir, "musc_rps24.fastq.gz")) sampled_lines <- readLines(file.path(fastq_dir, "musc_rps24.fastq.gz"), n = 400) writeLines(sampled_lines, file.path(fastq_dir, "copy.fastq")) bc_allow <- file.path(outdir, "bc_allow.tsv") R.utils::gunzip( filename = system.file("extdata", "bc_allow.tsv.gz", package = "FLAMES"), destname = bc_allow, remove = FALSE ) find_barcode( fastq = fastq_dir, stats_out = file.path(outdir, "bc_stat"), reads_out = file.path(outdir, "demultiplexed.fq"), barcodes_file = bc_allow, TSO_seq = "CCCATGTACTCTGCGTTGATACCACTGCTT" ) |> plot_demultiplex()
Plot expression heatmap of top n isoforms of a gene
plot_isoform_heatmap( sce, gene_id, transcript_ids, n = 4, isoform_legend_width = 7, col_low = "#313695", col_mid = "#FFFFBF", col_high = "#A50026", color_quantile = 1 )
plot_isoform_heatmap( sce, gene_id, transcript_ids, n = 4, isoform_legend_width = 7, col_low = "#313695", col_mid = "#FFFFBF", col_high = "#A50026", color_quantile = 1 )
sce |
The |
gene_id |
The gene symbol of interest, ignored if |
transcript_ids |
The transcript ids to plot. |
n |
The number of top isoforms to plot from the gene. Ignored if |
isoform_legend_width |
The width of isoform legends in heatmaps, in |
col_low |
Color for cells with low expression levels in UMAPs. |
col_mid |
Color for cells with intermediate expression levels in UMAPs. |
col_high |
Color for cells with high expression levels in UMAPs. |
color_quantile |
The lower and upper expression quantile to be displayed bewteen |
Takes SingleCellExperiment
object and plots an expression heatmap with the
isoform visualizations along genomic coordinates.
a ComplexHeatmap
scmixology_lib10_transcripts |> scuttle::logNormCounts() |> plot_isoform_heatmap(gene = "ENSG00000108107")
scmixology_lib10_transcripts |> scuttle::logNormCounts() |> plot_isoform_heatmap(gene = "ENSG00000108107")
Plot expression of top n isoforms of a gene in reduced dimensions
plot_isoform_reduced_dim( sce, gene_id, transcript_ids, n = 4, reduced_dim_name = "UMAP", use_gene_dimred = FALSE, expr_func = function(x) { SingleCellExperiment::logcounts(x) }, col_low = "#313695", col_mid = "#FFFFBF", col_high = "#A50026", color_quantile = 1, format = "plot_grid", ... )
plot_isoform_reduced_dim( sce, gene_id, transcript_ids, n = 4, reduced_dim_name = "UMAP", use_gene_dimred = FALSE, expr_func = function(x) { SingleCellExperiment::logcounts(x) }, col_low = "#313695", col_mid = "#FFFFBF", col_high = "#A50026", color_quantile = 1, format = "plot_grid", ... )
sce |
The |
gene_id |
The gene symbol of interest, ignored if |
transcript_ids |
The transcript ids to plot. |
n |
The number of top isoforms to plot from the gene. Ignored if |
reduced_dim_name |
The name of the reduced dimension to use for plotting cells. |
use_gene_dimred |
Whether to use gene-level reduced dimensions for plotting. Set to |
expr_func |
The function to extract expression values from the |
col_low |
Color for cells with low expression levels in UMAPs. |
col_mid |
Color for cells with intermediate expression levels in UMAPs. |
col_high |
Color for cells with high expression levels in UMAPs. |
color_quantile |
The lower and upper expression quantile to be displayed bewteen |
format |
The format of the output, either "plot_grid" or "list". |
... |
Additional arguments to pass to |
Takes SingleCellExperiment
object and plots an expression on reduced dimensions
with the isoform visualizations along genomic coordinates.
a ggplot
object of the UMAP(s)
scmixology_lib10 <- scmixology_lib10[, colSums(SingleCellExperiment::counts(scmixology_lib10)) > 0] sce_lr <- scmixology_lib10[, colnames(scmixology_lib10) %in% colnames(scmixology_lib10_transcripts)] SingleCellExperiment::altExp(sce_lr, "transcript") <- scmixology_lib10_transcripts[, colnames(sce_lr)] combined_sce <- combine_sce(sce_lr, scmixology_lib90) combined_sce <- combined_sce |> scuttle::logNormCounts() |> scater::runPCA() |> scater::runUMAP() combined_imputed_sce <- sc_impute_transcript(combined_sce) plot_isoform_reduced_dim(combined_sce, 'ENSG00000108107') plot_isoform_reduced_dim(combined_imputed_sce, 'ENSG00000108107')
scmixology_lib10 <- scmixology_lib10[, colSums(SingleCellExperiment::counts(scmixology_lib10)) > 0] sce_lr <- scmixology_lib10[, colnames(scmixology_lib10) %in% colnames(scmixology_lib10_transcripts)] SingleCellExperiment::altExp(sce_lr, "transcript") <- scmixology_lib10_transcripts[, colnames(sce_lr)] combined_sce <- combine_sce(sce_lr, scmixology_lib90) combined_sce <- combined_sce |> scuttle::logNormCounts() |> scater::runPCA() |> scater::runUMAP() combined_imputed_sce <- sc_impute_transcript(combined_sce) plot_isoform_reduced_dim(combined_sce, 'ENSG00000108107') plot_isoform_reduced_dim(combined_imputed_sce, 'ENSG00000108107')
Plot isoforms, either from a gene or a list of transcript ids.
plot_isoforms( sce, gene_id, transcript_ids, n = 4, format = "plot_grid", colors )
plot_isoforms( sce, gene_id, transcript_ids, n = 4, format = "plot_grid", colors )
sce |
The |
gene_id |
The gene symbol of interest, ignored if |
transcript_ids |
The transcript ids to plot. |
n |
The number of top isoforms to plot from the gene. Ignored if |
format |
The format of the output, either "plot_grid" or "list". |
colors |
A character vector of colors to use for the isoforms. If not provided, gray will be used. for all isoforms. |
This function takes a SingleCellExperiment
object and plots the top isoforms of a gene,
or a list of specified transcript ids. Either as a list of plots or together in a grid.
This function wraps the ggbio::geom_alignment
function to plot the isoforms, and orders
the isoforms by expression levels (when specifying a gene) or by the order of the transcript_ids.
When format = "list"
, a list of ggplot
objects is returned.
Otherwise, a grid of the plots is returned.
plot_isoforms(scmixology_lib10_transcripts, gene_id = "ENSG00000108107")
plot_isoforms(scmixology_lib10_transcripts, gene_id = "ENSG00000108107")
This function plots a spatial pie chart for given features.
plot_spatial_isoform(spe, isoforms, assay_type = "counts")
plot_spatial_isoform(spe, isoforms, assay_type = "counts")
spe |
The SpatialExperiment object. |
isoforms |
The isoforms to plot. |
assay_type |
The assay that contains the given features. E.g. 'counts', 'logcounts'. |
A ggplot object.
This function plots a spatial pie chart for given features.
plot_spatial_pie(spe, features, assay_type = "counts")
plot_spatial_pie(spe, features, assay_type = "counts")
spe |
The SpatialExperiment object. |
features |
The features to plot. |
assay_type |
The assay that contains the given features. |
A ggplot object.
Calculate the per gene UMI count matrix by parsing the genome alignment file.
quantify_gene( annotation, outdir, infq, n_process, pipeline = "sc_single_sample", samples = NULL, random_seed = 2024 )
quantify_gene( annotation, outdir, infq, n_process, pipeline = "sc_single_sample", samples = NULL, random_seed = 2024 )
annotation |
The file path to the annotation file in GFF3 format |
outdir |
The path to directory to store all output files. |
infq |
The input FASTQ file. |
n_process |
The number of processes to use for parallelization. |
pipeline |
The pipeline type as a character string, either |
samples |
A vector of sample names, default to the file names of input fastq files,
or folder names if |
random_seed |
The random seed for reproducibility. |
After the genome alignment step (do_genome_align
), the alignment file will be parsed to
generate the per gene UMI count matrix. For each gene in the annotation file, the number of reads
overlapping with the gene’s genomic coordinates will be assigned to that gene. If a read overlaps
multiple genes, it will be assigned to the gene with the highest number of overlapping nucleotides.
If exon coordinates are included in the provided annotation, the decision will first consider the
number of nucleotides aligned to the exons of each gene. In cases of a tie, the overlap with introns
will be used as a tiebreaker. If there is still a tie after considering both exons and introns,
a random gene will be selected from the tied candidates.
After the read-to-gene assignment, the per gene UMI count matrix will be generated. Specifically, for each gene, the reads with similar mapping coordinates of transcript termination sites (TTS, i.e. the end of the the read with a polyT or polyA) will be grouped together. UMIs of reads in the same group will be collapsed to generate the UMI counts for each gene.
Finally, a new fastq file with deduplicated reads by keeping the longest read in each UMI.
The count matrix will be saved in the output folder as transcript_count.csv.gz
.
Calculate the transcript count matrix by parsing the re-alignment file.
quantify_transcript( annotation, outdir, config, pipeline = "sc_single_sample", ... )
quantify_transcript( annotation, outdir, config, pipeline = "sc_single_sample", ... )
annotation |
The file path to the annotation file in GFF3 format |
outdir |
The path to directory to store all output files. |
config |
Parsed FLAMES configurations. |
pipeline |
The pipeline type as a character string, either |
... |
Supply sample names as character vector (e.g. |
A SingleCellExperiment
object for single-cell pipeline, a list of SingleCellExperiment
objects for multi-sample pipeline, or a SummarizedExperiment
object for bulk pipeline.
temp_path <- tempfile() bfc <- BiocFileCache::BiocFileCache(temp_path, ask = FALSE) file_url <- "https://raw.githubusercontent.com/OliverVoogd/FLAMESData/master/data" fastq1 <- bfc[[names(BiocFileCache::bfcadd(bfc, "Fastq1", paste(file_url, "fastq/sample1.fastq.gz", sep = "/")))]] genome_fa <- bfc[[names(BiocFileCache::bfcadd(bfc, "genome.fa", paste(file_url, "SIRV_isoforms_multi-fasta_170612a.fasta", sep = "/")))]] annotation <- bfc[[names(BiocFileCache::bfcadd(bfc, "annot.gtf", paste(file_url, "SIRV_isoforms_multi-fasta-annotation_C_170612a.gtf", sep = "/")))]] outdir <- tempfile() dir.create(outdir) fasta <- annotation_to_fasta(annotation, genome_fa, outdir) config <- jsonlite::fromJSON(create_config(outdir, bambu_isoform_identification = TRUE, min_tr_coverage = 0.1, min_read_coverage = 0.1, min_sup_cnt = 1)) file.copy(annotation, file.path(outdir, "isoform_annotated.gtf")) ## Not run: if (!any(is.na(find_bin(c("minimap2", "k8"))))) { minimap2_realign( config = config, outdir = outdir, fq_in = fastq1 ) quantify_transcript_flames(annotation, outdir, config, pipeline = "bulk") } ## End(Not run)
temp_path <- tempfile() bfc <- BiocFileCache::BiocFileCache(temp_path, ask = FALSE) file_url <- "https://raw.githubusercontent.com/OliverVoogd/FLAMESData/master/data" fastq1 <- bfc[[names(BiocFileCache::bfcadd(bfc, "Fastq1", paste(file_url, "fastq/sample1.fastq.gz", sep = "/")))]] genome_fa <- bfc[[names(BiocFileCache::bfcadd(bfc, "genome.fa", paste(file_url, "SIRV_isoforms_multi-fasta_170612a.fasta", sep = "/")))]] annotation <- bfc[[names(BiocFileCache::bfcadd(bfc, "annot.gtf", paste(file_url, "SIRV_isoforms_multi-fasta-annotation_C_170612a.gtf", sep = "/")))]] outdir <- tempfile() dir.create(outdir) fasta <- annotation_to_fasta(annotation, genome_fa, outdir) config <- jsonlite::fromJSON(create_config(outdir, bambu_isoform_identification = TRUE, min_tr_coverage = 0.1, min_read_coverage = 0.1, min_sup_cnt = 1)) file.copy(annotation, file.path(outdir, "isoform_annotated.gtf")) ## Not run: if (!any(is.na(find_bin(c("minimap2", "k8"))))) { minimap2_realign( config = config, outdir = outdir, fq_in = fastq1 ) quantify_transcript_flames(annotation, outdir, config, pipeline = "bulk") } ## End(Not run)
Calculate the transcript count matrix by parsing the re-alignment file.
quantify_transcript_flames( annotation, outdir, config, pipeline = "sc_single_sample", samples )
quantify_transcript_flames( annotation, outdir, config, pipeline = "sc_single_sample", samples )
annotation |
The file path to the annotation file in GFF3 format |
outdir |
The path to directory to store all output files. |
config |
Parsed FLAMES configurations. |
pipeline |
The pipeline type as a character string, either |
samples |
A vector of sample names, required for |
A SingleCellExperiment
object for single-cell pipeline, a list of SingleCellExperiment
objects for multi-sample pipeline, or a SummarizedExperiment
object for bulk pipeline.
temp_path <- tempfile() bfc <- BiocFileCache::BiocFileCache(temp_path, ask = FALSE) file_url <- "https://raw.githubusercontent.com/OliverVoogd/FLAMESData/master/data" fastq1 <- bfc[[names(BiocFileCache::bfcadd(bfc, "Fastq1", paste(file_url, "fastq/sample1.fastq.gz", sep = "/")))]] genome_fa <- bfc[[names(BiocFileCache::bfcadd(bfc, "genome.fa", paste(file_url, "SIRV_isoforms_multi-fasta_170612a.fasta", sep = "/")))]] annotation <- bfc[[names(BiocFileCache::bfcadd(bfc, "annot.gtf", paste(file_url, "SIRV_isoforms_multi-fasta-annotation_C_170612a.gtf", sep = "/")))]] outdir <- tempfile() dir.create(outdir) fasta <- annotation_to_fasta(annotation, genome_fa, outdir) config <- jsonlite::fromJSON(create_config(outdir, bambu_isoform_identification = TRUE, min_tr_coverage = 0.1, min_read_coverage = 0.1, min_sup_cnt = 1)) file.copy(annotation, file.path(outdir, "isoform_annotated.gtf")) ## Not run: if (!any(is.na(find_bin(c("minimap2", "k8"))))) { minimap2_realign( config = config, outdir = outdir, fq_in = fastq1 ) quantify_transcript_flames(annotation, outdir, config, pipeline = "bulk") } ## End(Not run)
temp_path <- tempfile() bfc <- BiocFileCache::BiocFileCache(temp_path, ask = FALSE) file_url <- "https://raw.githubusercontent.com/OliverVoogd/FLAMESData/master/data" fastq1 <- bfc[[names(BiocFileCache::bfcadd(bfc, "Fastq1", paste(file_url, "fastq/sample1.fastq.gz", sep = "/")))]] genome_fa <- bfc[[names(BiocFileCache::bfcadd(bfc, "genome.fa", paste(file_url, "SIRV_isoforms_multi-fasta_170612a.fasta", sep = "/")))]] annotation <- bfc[[names(BiocFileCache::bfcadd(bfc, "annot.gtf", paste(file_url, "SIRV_isoforms_multi-fasta-annotation_C_170612a.gtf", sep = "/")))]] outdir <- tempfile() dir.create(outdir) fasta <- annotation_to_fasta(annotation, genome_fa, outdir) config <- jsonlite::fromJSON(create_config(outdir, bambu_isoform_identification = TRUE, min_tr_coverage = 0.1, min_read_coverage = 0.1, min_sup_cnt = 1)) file.copy(annotation, file.path(outdir, "isoform_annotated.gtf")) ## Not run: if (!any(is.na(find_bin(c("minimap2", "k8"))))) { minimap2_realign( config = config, outdir = outdir, fq_in = fastq1 ) quantify_transcript_flames(annotation, outdir, config, pipeline = "bulk") } ## End(Not run)
Given a set of mutations and gene annotation, calculate the relative position of each mutation within the gene body they are in.
relative_mutation_positions( mutations, annotation, bin = FALSE, by = c(region = "gene_name"), threads = 1 )
relative_mutation_positions( mutations, annotation, bin = FALSE, by = c(region = "gene_name"), threads = 1 )
mutations |
either the tibble output from |
annotation |
Either path to the annotation file (GTF/GFF) or a GRanges object of the gene annotation. |
bin |
logical(1): whether to bin the relative positions into 100 bins. |
by |
character(1): the column name in the annotation to match with the gene annotation.
E.g. |
threads |
integer(1): number of threads to use. |
If bin = FALSE
, a list of numeric vectors of relative positions of each mutation within the gene body.
If bin = TRUE
, a numeric vector of length 100 of the number of mutations in each bin.
outdir <- tempfile() dir.create(outdir) genome_fa <- system.file("extdata", "rps24.fa.gz", package = "FLAMES") minimap2_align( # align to genome config = jsonlite::fromJSON( system.file("extdata", "config_sclr_nanopore_3end.json", package = "FLAMES")), fa_file = genome_fa, fq_in = system.file("extdata", "fastq", "demultiplexed.fq.gz", package = "FLAMES"), annot = system.file("extdata", "rps24.gtf.gz", package = "FLAMES"), outdir = outdir ) variants <- find_variants( bam_path = file.path(outdir, "align2genome.bam"), reference = genome_fa, annotation = system.file("extdata", "rps24.gtf.gz", package = "FLAMES"), min_nucleotide_depth = 4 ) positions <- relative_mutation_positions( mutations = variants, annotation = system.file("extdata", "rps24.gtf.gz", package = "FLAMES") )
outdir <- tempfile() dir.create(outdir) genome_fa <- system.file("extdata", "rps24.fa.gz", package = "FLAMES") minimap2_align( # align to genome config = jsonlite::fromJSON( system.file("extdata", "config_sclr_nanopore_3end.json", package = "FLAMES")), fa_file = genome_fa, fq_in = system.file("extdata", "fastq", "demultiplexed.fq.gz", package = "FLAMES"), annot = system.file("extdata", "rps24.gtf.gz", package = "FLAMES"), outdir = outdir ) variants <- find_variants( bam_path = file.path(outdir, "align2genome.bam"), reference = genome_fa, annotation = system.file("extdata", "rps24.gtf.gz", package = "FLAMES"), min_nucleotide_depth = 4 ) positions <- relative_mutation_positions( mutations = variants, annotation = system.file("extdata", "rps24.gtf.gz", package = "FLAMES") )
Chi-square based differential transcription usage analysis. This variant is meant for single cell data.
Takes the SingleCellExperiment
object from sc_long_pipeline
as input.
Alternatively, the path to the output folder could be provided instead of the SCE object.
sc_DTU_analysis(sce, min_count = 15)
sc_DTU_analysis(sce, min_count = 15)
sce |
The |
min_count |
The minimum UMI count threshold for filtering isoforms. |
This function will search for genes that have at least two isoforms, each with more than min_count
UMI counts.
For each gene, the per cell transcript counts were merged by group to generate pseudo bulk samples.
Grouping is specified by the colLabels
of the SCE object.
The top 2 highly expressed transcripts for each group were selected and a UMI count matrix where
the rows are selected transcripts and columns are groups was used as input to a chi-square test of independence (chisq.test).
Adjusted P-values were calculated by Benjamini–Hochberg correction.
a data.frame
containing the following columns:
- differentially transcribed genes
- the X value for the DTU gene
- degrees of freedom of the approximate chi-squared distribution of the test statistic
- the transcript_id with the highest squared residuals
- the cell group with the highest squared residuals
- the p-value for the test
- the adjusted p-value (by Benjamini–Hochberg correction)
The table is sorted by decreasing P-values. It will also be saved as sc_DTU_analysis.csv
under the
output folder.
outdir <- tempfile() dir.create(outdir) bc_allow <- file.path(outdir, "bc_allow.tsv") genome_fa <- file.path(outdir, "rps24.fa") R.utils::gunzip( filename = system.file("extdata", "bc_allow.tsv.gz", package = "FLAMES"), destname = bc_allow, remove = FALSE ) R.utils::gunzip( filename = system.file("extdata", "rps24.fa.gz", package = "FLAMES"), destname = genome_fa, remove = FALSE ) sce <- FLAMES::sc_long_pipeline( genome_fa = genome_fa, fastq = system.file("extdata", "fastq", "musc_rps24.fastq.gz", package = "FLAMES"), annotation = system.file("extdata", "rps24.gtf.gz", package = "FLAMES"), outdir = outdir, barcodes_file = bc_allow, config_file = create_config(outdir, oarfish_quantification = FALSE) ) group_anno <- data.frame(barcode_seq = colnames(sce), groups = SingleCellExperiment::counts(sce)["ENSMUST00000169826.2", ] > 1) SingleCellExperiment::colLabels(sce) <- group_anno$groups sc_DTU_analysis(sce, min_count = 1)
outdir <- tempfile() dir.create(outdir) bc_allow <- file.path(outdir, "bc_allow.tsv") genome_fa <- file.path(outdir, "rps24.fa") R.utils::gunzip( filename = system.file("extdata", "bc_allow.tsv.gz", package = "FLAMES"), destname = bc_allow, remove = FALSE ) R.utils::gunzip( filename = system.file("extdata", "rps24.fa.gz", package = "FLAMES"), destname = genome_fa, remove = FALSE ) sce <- FLAMES::sc_long_pipeline( genome_fa = genome_fa, fastq = system.file("extdata", "fastq", "musc_rps24.fastq.gz", package = "FLAMES"), annotation = system.file("extdata", "rps24.gtf.gz", package = "FLAMES"), outdir = outdir, barcodes_file = bc_allow, config_file = create_config(outdir, oarfish_quantification = FALSE) ) group_anno <- data.frame(barcode_seq = colnames(sce), groups = SingleCellExperiment::counts(sce)["ENSMUST00000169826.2", ] > 1) SingleCellExperiment::colLabels(sce) <- group_anno$groups sc_DTU_analysis(sce, min_count = 1)
Impute missing transcript counts using a shared nearest neighbor graph
sc_impute_transcript(combined_sce, dimred = "PCA", ...)
sc_impute_transcript(combined_sce, dimred = "PCA", ...)
combined_sce |
A |
dimred |
The name of the reduced dimension to use for building the shared nearest neighbor graph. |
... |
Additional arguments to pass to |
For cells with NA
values in the "transcript" altExp slot, this function imputes the
missing values from cells with non-missing values. A shared nearest neighbor graph is built using
reduced dimensions from the SingleCellExperiment
object, and the imputation is done where
the imputed value for a cell is the weighted sum of the transcript counts of its neighbors.
Imputed values are stored in the "logcounts" assay of the "transcript" altExp slot.
The "counts" assay is used to obtain logcounts but left unchanged.
A SingleCellExperiment
object with imputed logcounts assay in the "transcript" altExp slot.
sce <- SingleCellExperiment::SingleCellExperiment(assays = list(counts = matrix(rpois(50, 5), ncol = 10))) long_read <- SingleCellExperiment::SingleCellExperiment(assays = list(counts = matrix(rpois(40, 5), ncol = 10))) SingleCellExperiment::altExp(sce, "transcript") <- long_read SingleCellExperiment::counts(SingleCellExperiment::altExp(sce))[,1:2] <- NA SingleCellExperiment::counts(SingleCellExperiment::altExp(sce)) imputed_sce <- sc_impute_transcript(sce, k = 4) SingleCellExperiment::logcounts(SingleCellExperiment::altExp(imputed_sce))
sce <- SingleCellExperiment::SingleCellExperiment(assays = list(counts = matrix(rpois(50, 5), ncol = 10))) long_read <- SingleCellExperiment::SingleCellExperiment(assays = list(counts = matrix(rpois(40, 5), ncol = 10))) SingleCellExperiment::altExp(sce, "transcript") <- long_read SingleCellExperiment::counts(SingleCellExperiment::altExp(sce))[,1:2] <- NA SingleCellExperiment::counts(SingleCellExperiment::altExp(sce)) imputed_sce <- sc_impute_transcript(sce, k = 4) SingleCellExperiment::logcounts(SingleCellExperiment::altExp(imputed_sce))
Semi-supervised isoform detection and annotation for long read data.
This variant is for multi-sample single cell data. By default, this pipeline demultiplexes input
fastq data (match_cell_barcode = TRUE
). Specific parameters relating to
analysis can be changed either through function arguments, or through a
configuration JSON file.
sc_long_multisample_pipeline( annotation, fastqs, outdir, genome_fa, minimap2 = NULL, k8 = NULL, barcodes_file = NULL, expect_cell_numbers = NULL, config_file = NULL )
sc_long_multisample_pipeline( annotation, fastqs, outdir, genome_fa, minimap2 = NULL, k8 = NULL, barcodes_file = NULL, expect_cell_numbers = NULL, config_file = NULL )
annotation |
The file path to the annotation file in GFF3 format |
fastqs |
The input fastq files for multiple samples. Should be a named vector of file paths (eithr to FASTQ files or directories containing FASTQ files). The names of the vector will be used as the sample names. |
outdir |
The path to directory to store all output files. |
genome_fa |
The file path to genome fasta file. |
minimap2 |
Path to minimap2, if it is not in PATH. Only required if either or both of
|
k8 |
Path to the k8 Javascript shell binary. Only required if |
barcodes_file |
The file path to the reference csv used for demultiplexing in flexiplex. If not specified, the
demultiplexing will be performed using BLAZE. Default is |
expect_cell_numbers |
A vector of roughly expected numbers of cells in each sample E.g., the targeted number of cells.
Required if using BLAZE for demultiplexing, specifically, when the |
config_file |
File path to the JSON configuration file. If specified, |
By default FLAMES use minimap2 for read alignment. After the genome alignment step (do_genome_align
), FLAMES summarizes the alignment for each read in every sample by grouping reads
with similar splice junctions to get a raw isoform annotation (do_isoform_id
). The raw isoform
annotation is compared against the reference annotation to correct potential splice site
and transcript start/end errors. Transcripts that have similar splice junctions
and transcript start/end to the reference transcript are merged with the
reference. This process will also collapse isoforms that are likely to be truncated
transcripts. If isoform_id_bambu
is set to TRUE
, bambu::bambu
will be used to generate the updated annotations (Not implemented for multi-sample yet).
Next is the read realignment step (do_read_realign
), where the sequence of each transcript from the update annotation is extracted, and
the reads are realigned to this updated transcript_assembly.fa
by minimap2. The
transcripts with only a few full-length aligned reads are discarded (Not implemented for multi-sample yet).
The reads are assigned to transcripts based on both alignment score, fractions of
reads aligned and transcript coverage. Reads that cannot be uniquely assigned to
transcripts or have low transcript coverage are discarded. The UMI transcript
count matrix is generated by collapsing the reads with the same UMI in a similar
way to what is done for short-read scRNA-seq data, but allowing for an edit distance
of up to 2 by default. Most of the parameters, such as the minimal distance to splice site and minimal percentage of transcript coverage
can be modified by the JSON configuration file (config_file
).
The default parameters can be changed either through the function
arguments are through the configuration JSON file config_file
. the pipeline_parameters
section specifies which steps are to be executed in the pipeline - by default, all
steps are executed. The isoform_parameters
section affects isoform detection - key
parameters include:
Min_sup_cnt
which causes transcripts with less reads aligned than it's value to be discarded
MAX_TS_DIST
which merges transcripts with the same intron
chain and TSS/TES distace less than MAX_TS_DIST
strand_specific
which specifies if reads are in the same strand as the mRNA (1), or the reverse complemented (-1) or not strand specific (0), which results in strand information being based on reference annotation.
If "do_transcript_quantification" set to true, a list with two elements:
A list of metadata from the pipeline run.
A list of SingleCellExperiment objects, one for each sample.
bulk_long_pipeline()
for bulk long data,
SingleCellExperiment()
for how data is outputted
reads <- ShortRead::readFastq( system.file("extdata", "fastq", "musc_rps24.fastq.gz", package = "FLAMES") ) outdir <- tempfile() dir.create(outdir) dir.create(file.path(outdir, "fastq")) bc_allow <- file.path(outdir, "bc_allow.tsv") genome_fa <- file.path(outdir, "rps24.fa") R.utils::gunzip( filename = system.file("extdata", "bc_allow.tsv.gz", package = "FLAMES"), destname = bc_allow, remove = FALSE ) R.utils::gunzip( filename = system.file("extdata", "rps24.fa.gz", package = "FLAMES"), destname = genome_fa, remove = FALSE ) ShortRead::writeFastq(reads[1:100], file.path(outdir, "fastq/sample1.fq.gz"), mode = "w", full = FALSE) reads <- reads[-(1:100)] ShortRead::writeFastq(reads[1:100], file.path(outdir, "fastq/sample2.fq.gz"), mode = "w", full = FALSE) reads <- reads[-(1:100)] ShortRead::writeFastq(reads, file.path(outdir, "fastq/sample3.fq.gz"), mode = "w", full = FALSE) sce_list <- FLAMES::sc_long_multisample_pipeline( annotation = system.file("extdata", "rps24.gtf.gz", package = "FLAMES"), fastqs = c("sampleA" = file.path(outdir, "fastq"), "sample1" = file.path(outdir, "fastq", "sample1.fq.gz"), "sample2" = file.path(outdir, "fastq", "sample2.fq.gz"), "sample3" = file.path(outdir, "fastq", "sample3.fq.gz")), outdir = outdir, genome_fa = genome_fa, barcodes_file = rep(bc_allow, 4) )
reads <- ShortRead::readFastq( system.file("extdata", "fastq", "musc_rps24.fastq.gz", package = "FLAMES") ) outdir <- tempfile() dir.create(outdir) dir.create(file.path(outdir, "fastq")) bc_allow <- file.path(outdir, "bc_allow.tsv") genome_fa <- file.path(outdir, "rps24.fa") R.utils::gunzip( filename = system.file("extdata", "bc_allow.tsv.gz", package = "FLAMES"), destname = bc_allow, remove = FALSE ) R.utils::gunzip( filename = system.file("extdata", "rps24.fa.gz", package = "FLAMES"), destname = genome_fa, remove = FALSE ) ShortRead::writeFastq(reads[1:100], file.path(outdir, "fastq/sample1.fq.gz"), mode = "w", full = FALSE) reads <- reads[-(1:100)] ShortRead::writeFastq(reads[1:100], file.path(outdir, "fastq/sample2.fq.gz"), mode = "w", full = FALSE) reads <- reads[-(1:100)] ShortRead::writeFastq(reads, file.path(outdir, "fastq/sample3.fq.gz"), mode = "w", full = FALSE) sce_list <- FLAMES::sc_long_multisample_pipeline( annotation = system.file("extdata", "rps24.gtf.gz", package = "FLAMES"), fastqs = c("sampleA" = file.path(outdir, "fastq"), "sample1" = file.path(outdir, "fastq", "sample1.fq.gz"), "sample2" = file.path(outdir, "fastq", "sample2.fq.gz"), "sample3" = file.path(outdir, "fastq", "sample3.fq.gz")), outdir = outdir, genome_fa = genome_fa, barcodes_file = rep(bc_allow, 4) )
Semi-supervised isoform detection and annotation for long read data.
This variant is for single cell data. By default, this pipeline demultiplexes input
fastq data (match_cell_barcode = TRUE
). Specific parameters relating to
analysis can be changed either through function arguments, or through a
configuration JSON file.
sc_long_pipeline( annotation, fastq, genome_bam = NULL, outdir, genome_fa, minimap2 = NULL, k8 = NULL, barcodes_file = NULL, expect_cell_number = NULL, config_file = NULL )
sc_long_pipeline( annotation, fastq, genome_bam = NULL, outdir, genome_fa, minimap2 = NULL, k8 = NULL, barcodes_file = NULL, expect_cell_number = NULL, config_file = NULL )
annotation |
The file path to the annotation file in GFF3 format |
fastq |
The file path to input fastq file |
genome_bam |
Optional file path to a bam file to use instead of fastq file (skips initial alignment step) |
outdir |
The path to directory to store all output files. |
genome_fa |
The file path to genome fasta file. |
minimap2 |
Path to minimap2, if it is not in PATH. Only required if either or both of
|
k8 |
Path to the k8 Javascript shell binary. Only required if |
barcodes_file |
The file path to the reference csv used for demultiplexing in flexiplex. If not specified, the
demultiplexing will be performed using BLAZE. Default is |
expect_cell_number |
Expected number of cells for identifying the barcode list in BLAZE.
This could be just a rough estimate. E.g., the targeted number of cells.
Required if the |
config_file |
File path to the JSON configuration file. If specified, |
By default FLAMES use minimap2 for read alignment. After the genome alignment step (do_genome_align
), FLAMES summarizes the alignment for each read by grouping reads
with similar splice junctions to get a raw isoform annotation (do_isoform_id
). The raw isoform
annotation is compared against the reference annotation to correct potential splice site
and transcript start/end errors. Transcripts that have similar splice junctions
and transcript start/end to the reference transcript are merged with the
reference. This process will also collapse isoforms that are likely to be truncated
transcripts. If isoform_id_bambu
is set to TRUE
, bambu::bambu
will be used to generate the updated annotations.
Next is the read realignment step (do_read_realign
), where the sequence of each transcript from the update annotation is extracted, and
the reads are realigned to this updated transcript_assembly.fa
by minimap2. The
transcripts with only a few full-length aligned reads are discarded.
The reads are assigned to transcripts based on both alignment score, fractions of
reads aligned and transcript coverage. Reads that cannot be uniquely assigned to
transcripts or have low transcript coverage are discarded. The UMI transcript
count matrix is generated by collapsing the reads with the same UMI in a similar
way to what is done for short-read scRNA-seq data, but allowing for an edit distance
of up to 2 by default. Most of the parameters, such as the minimal distance to splice site and minimal percentage of transcript coverage
can be modified by the JSON configuration file (config_file
).
The default parameters can be changed either through the function
arguments are through the configuration JSON file config_file
. the pipeline_parameters
section specifies which steps are to be executed in the pipeline - by default, all
steps are executed. The isoform_parameters
section affects isoform detection - key
parameters include:
Min_sup_cnt
which causes transcripts with less reads aligned than it's value to be discarded
MAX_TS_DIST
which merges transcripts with the same intron
chain and TSS/TES distace less than MAX_TS_DIST
strand_specific
which specifies if reads are in the same strand as the mRNA (1), or the reverse complemented (-1) or not strand specific (0), which results in strand information being based on reference annotation.
if do_transcript_quantification
set to true, sc_long_pipeline
returns a SingleCellExperiment
object, containing a count
matrix as an assay, gene annotations under metadata, as well as a list of the other
output files generated by the pipeline. The pipeline also outputs a number of output
files into the given outdir
directory. These output files generated by the pipeline are:
- a transcript count matrix (also contained in the SingleCellExperiment)
- isoforms in gff3 format (also contained in the SingleCellExperiment)
- transcript sequence from the isoforms
- sorted BAM file with reads aligned to genome
- sorted realigned BAM file using the transcript_assembly.fa as reference
- TSS TES enrichment for all reads (for QC)
if do_transcript_quantification
set to false, nothing will be returned
bulk_long_pipeline()
for bulk long data,
SingleCellExperiment()
for how data is outputted
outdir <- tempfile() dir.create(outdir) bc_allow <- file.path(outdir, "bc_allow.tsv") genome_fa <- file.path(outdir, "rps24.fa") R.utils::gunzip( filename = system.file("extdata", "bc_allow.tsv.gz", package = "FLAMES"), destname = bc_allow, remove = FALSE ) R.utils::gunzip( filename = system.file("extdata", "rps24.fa.gz", package = "FLAMES"), destname = genome_fa, remove = FALSE ) if (!any(is.na(find_bin(c("minimap2", "k8"))))) { sce <- FLAMES::sc_long_pipeline( genome_fa = genome_fa, fastq = system.file("extdata", "fastq", "musc_rps24.fastq.gz", package = "FLAMES"), annotation = system.file("extdata", "rps24.gtf.gz", package = "FLAMES"), outdir = outdir, barcodes_file = bc_allow ) }
outdir <- tempfile() dir.create(outdir) bc_allow <- file.path(outdir, "bc_allow.tsv") genome_fa <- file.path(outdir, "rps24.fa") R.utils::gunzip( filename = system.file("extdata", "bc_allow.tsv.gz", package = "FLAMES"), destname = bc_allow, remove = FALSE ) R.utils::gunzip( filename = system.file("extdata", "rps24.fa.gz", package = "FLAMES"), destname = genome_fa, remove = FALSE ) if (!any(is.na(find_bin(c("minimap2", "k8"))))) { sce <- FLAMES::sc_long_pipeline( genome_fa = genome_fa, fastq = system.file("extdata", "fastq", "musc_rps24.fastq.gz", package = "FLAMES"), annotation = system.file("extdata", "rps24.gtf.gz", package = "FLAMES"), outdir = outdir, barcodes_file = bc_allow ) }
Count the number of reads supporting each variants at the given positions for each cell.
sc_mutations(bam_path, seqnames, positions, indel = FALSE, threads = 1)
sc_mutations(bam_path, seqnames, positions, indel = FALSE, threads = 1)
bam_path |
character(1) or character(n): path to the bam file(s) aligned to the reference genome (NOT the transcriptome! Unless the postions are also from the transcriptome). |
seqnames |
character(n): chromosome names of the postions to count alleles. |
positions |
integer(n): positions, 1-based, same length as seqnames. The positions to count alleles. |
indel |
logical(1): whether to count indels (TRUE) or SNPs (FALSE). |
threads |
integer(1): number of threads to use. Maximum number of threads is the number of bam files * number of positions. |
A tibble with columns: allele, barcode, allele_count, cell_total_reads, pct, pos, seqname.
outdir <- tempfile() dir.create(outdir) genome_fa <- file.path(outdir, "rps24.fa") R.utils::gunzip( filename = system.file("extdata", "rps24.fa.gz", package = "FLAMES"), destname = genome_fa, remove = FALSE ) minimap2_align( # align to genome config = jsonlite::fromJSON( system.file("extdata", "config_sclr_nanopore_3end.json", package = "FLAMES") ), fa_file = genome_fa, fq_in = system.file("extdata", "fastq", "demultiplexed.fq.gz", package = "FLAMES"), annot = system.file("extdata", "rps24.gtf.gz", package = "FLAMES"), outdir = outdir ) snps_tb <- sc_mutations( bam_path = file.path(outdir, "align2genome.bam"), seqnames = c("chr14", "chr14"), positions = c(1260, 2714), # positions of interest indel = FALSE ) head(snps_tb) snps_tb |> dplyr::filter(pos == 1260) |> dplyr::group_by(allele) |> dplyr::summarise(count = sum(allele_count)) # should be identical to samtools pileup
outdir <- tempfile() dir.create(outdir) genome_fa <- file.path(outdir, "rps24.fa") R.utils::gunzip( filename = system.file("extdata", "rps24.fa.gz", package = "FLAMES"), destname = genome_fa, remove = FALSE ) minimap2_align( # align to genome config = jsonlite::fromJSON( system.file("extdata", "config_sclr_nanopore_3end.json", package = "FLAMES") ), fa_file = genome_fa, fq_in = system.file("extdata", "fastq", "demultiplexed.fq.gz", package = "FLAMES"), annot = system.file("extdata", "rps24.gtf.gz", package = "FLAMES"), outdir = outdir ) snps_tb <- sc_mutations( bam_path = file.path(outdir, "align2genome.bam"), seqnames = c("chr14", "chr14"), positions = c(1260, 2714), # positions of interest indel = FALSE ) head(snps_tb) snps_tb |> dplyr::filter(pos == 1260) |> dplyr::group_by(allele) |> dplyr::summarise(count = sum(allele_count)) # should be identical to samtools pileup
Short-read gene counts from long and short-read single cell RNA-seq profiling of human lung adenocarcinoma cell lines using 10X version 2 chemstry. See Tian, L. et al. Comprehensive characterization of single-cell full-length isoforms in human and mouse with long-read sequencing. Genome Biology 22, 310 (2021).
scmixology_lib10
scmixology_lib10
## 'scmixology_lib10'
A SingleCellExperiment
with 7,240 rows and 60 columns:
<https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE154869>
long-read transcript counts from long and short-read single cell RNA-seq profiling of human lung adenocarcinoma cell lines using 10X version 2 chemstry. See Tian, L. et al. Comprehensive characterization of single-cell full-length isoforms in human and mouse with long-read sequencing. Genome Biology 22, 310 (2021).
scmixology_lib10_transcripts
scmixology_lib10_transcripts
## 'scmixology_lib10_transcripts'
A SingleCellExperiment
with 7,240 rows and 60 columns:
<https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE154869>
Short-read single cell RNA-seq profiling of human lung adenocarcinoma cell lines using 10X version 2 chemstry. Single cells from five human lung adenocarcinoma cell lines (H2228, H1975, A549, H838 and HCC827) were mixed in equal proportions and processed using the Chromium 10X platform, then sequenced using Illumina HiSeq 2500. See Tian L, Dong X, Freytag S, Lê Cao KA et al. Benchmarking single cell RNA-sequencing analysis pipelines using mixture control experiments. Nat Methods 2019 Jun;16(6):479-487. PMID: 31133762
scmixology_lib90
scmixology_lib90
## 'scmixology_lib90'
A SingleCellExperiment
<https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE126906>
Given a vector of read counts, return a vector of weights.
The weights could be either the read counts themselves (type = 'counts'
),
a binary vector of 0s and 1s where 1s are assigned to transcripts with read
counts above a threshold (type = 'equal', min_counts = 1000
), or a
sigmoid function of the read counts (type = 'sigmoid'
). The sigmoid
function is defined as 1 / (1 + exp(-steepness/inflection * (x - inflection)))
.
weight_transcripts( counts, type = "sigmoid", min_counts = 1000, inflection_idx = 10, inflection_max = 1000, steepness = 5 )
weight_transcripts( counts, type = "sigmoid", min_counts = 1000, inflection_idx = 10, inflection_max = 1000, steepness = 5 )
counts |
numeric vector of read counts |
type |
string, one of 'counts', 'sigmoid', or 'equal' |
min_counts |
numeric, the threshold for the 'equal' type |
inflection_idx |
numeric, the index of the read counts to determine the inflection point for the sigmoid function. The default is 10, i.e. the 10th highest read count will be the inflection point. |
inflection_max |
numeric, the maximum value for the inflection point. If the inflection point according to the inflection_idx is higher than this value, the inflection point will be set to this value instead. |
steepness |
numeric, the steepness of the sigmoid function |
numeric vector of weights
weight_transcripts(1:2000) par(mfrow = c(2, 2)) plot( 1:2000, weight_transcripts(1:2000, type = 'sigmoid'), type = 'l', xlab = 'Read counts', ylab = 'Sigmoid weight' ) plot( 1:2000, weight_transcripts(1:2000, type = 'counts'), type = 'l', xlab = 'Read counts', ylab = 'Weight by counts' ) plot( 1:2000, weight_transcripts(1:2000, type = 'equal'), type = 'l', xlab = 'Read counts', ylab = 'Equal weights' )
weight_transcripts(1:2000) par(mfrow = c(2, 2)) plot( 1:2000, weight_transcripts(1:2000, type = 'sigmoid'), type = 'l', xlab = 'Read counts', ylab = 'Sigmoid weight' ) plot( 1:2000, weight_transcripts(1:2000, type = 'counts'), type = 'l', xlab = 'Read counts', ylab = 'Weight by counts' ) plot( 1:2000, weight_transcripts(1:2000, type = 'equal'), type = 'l', xlab = 'Read counts', ylab = 'Equal weights' )