| Title: | FLAMES: Full Length Analysis of Mutations and Splicing in long read RNA-seq data |
|---|---|
| Description: | Semi-supervised isoform detection and annotation from both bulk and single-cell long read RNA-seq data. Flames provides automated pipelines for analysing isoforms, as well as intermediate functions for manual execution. |
| Authors: | Changqing Wang [aut, cre], Luyi Tian [aut], Oliver Voogd [aut], Jakob Schuster [aut], Shian Su [aut], Yair D.J. Prawer [aut], Yupei You [aut], Matthew Ritchie [ctb] |
| Maintainer: | Changqing Wang <[email protected]> |
| License: | GPL (>= 3) |
| Version: | 2.7.0 |
| Built: | 2026-06-05 11:05:38 UTC |
| Source: | https://github.com/bioc/FLAMES |
SingleCellExperiment objectAdd gene counts to a SingleCellExperiment object
as an altExps slot named gene.
add_gene_counts(sce, gene_count_stem)add_gene_counts(sce, gene_count_stem)
sce |
A |
gene_count_stem |
The file path stem for the gene count MTX files. The function
expects three files: |
A SingleCellExperiment object with gene counts added as
altExps(sce)$gene.
# Set up a mock SingleCellExperiment object sce <- SingleCellExperiment::SingleCellExperiment( assays = list(counts = matrix(0, nrow = 10, ncol = 10)) ) colnames(sce) <- paste0("cell", 1:10) # Write mock gene count MTX files gene_count_stem <- file.path(tempdir(), "gene_count") gene_mtx <- Matrix::Matrix(1:10, nrow = 2, ncol = 5, sparse = TRUE) colnames(gene_mtx) <- paste0("cell", 1:5) rownames(gene_mtx) <- c("gene1", "gene2") Matrix::writeMM(gene_mtx, paste0(gene_count_stem, ".mtx")) writeLines(rownames(gene_mtx), paste0(gene_count_stem, "_features.tsv")) writeLines(colnames(gene_mtx), paste0(gene_count_stem, "_barcodes.tsv")) # Add gene counts to the SingleCellExperiment object sce <- add_gene_counts(sce, gene_count_stem) # verify the gene counts are added SingleCellExperiment::altExps(sce)$gene# Set up a mock SingleCellExperiment object sce <- SingleCellExperiment::SingleCellExperiment( assays = list(counts = matrix(0, nrow = 10, ncol = 10)) ) colnames(sce) <- paste0("cell", 1:10) # Write mock gene count MTX files gene_count_stem <- file.path(tempdir(), "gene_count") gene_mtx <- Matrix::Matrix(1:10, nrow = 2, ncol = 5, sparse = TRUE) colnames(gene_mtx) <- paste0("cell", 1:5) rownames(gene_mtx) <- c("gene1", "gene2") Matrix::writeMM(gene_mtx, paste0(gene_count_stem, ".mtx")) writeLines(rownames(gene_mtx), paste0(gene_count_stem, "_features.tsv")) writeLines(colnames(gene_mtx), paste0(gene_count_stem, "_barcodes.tsv")) # Add gene counts to the SingleCellExperiment object sce <- add_gene_counts(sce, gene_count_stem) # verify the gene counts are added SingleCellExperiment::altExps(sce)$gene
convert the transcript annotation to transcriptome assembly as FASTA file.
annotation_to_fasta(isoform_annotation, genome_fa, outfile, extract_fn)annotation_to_fasta(isoform_annotation, genome_fa, outfile, extract_fn)
isoform_annotation |
Path to the annotation file (GTF/GFF3) |
genome_fa |
The file path to genome fasta file. |
outfile |
The file path to the output FASTA file. |
extract_fn |
(optional) Function to extract a |
This does not return anything. A FASTA file will be created at the specified location.
fasta <- tempfile() annotation_to_fasta(system.file("extdata", "rps24.gtf.gz", package = "FLAMES"), system.file("extdata", "rps24.fa.gz", package = "FLAMES"), fasta) cat(readChar(fasta, 1e3))fasta <- tempfile() annotation_to_fasta(system.file("extdata", "rps24.gtf.gz", package = "FLAMES"), system.file("extdata", "rps24.fa.gz", package = "FLAMES"), fasta) cat(readChar(fasta, 1e3))
Creates a FlexiplexGroup object that links together a set of
MATCHED_SPLIT barcode_segments for concatenatiing segments before barcode
matching. A barcode group lets flexiplex jointly match multiple segment sequences
against a shared allow-list (e.g. the concatenation of two barcode halves).
barcode_group(name, bc_list_name, max_edit_distance = 2)barcode_group(name, bc_list_name, max_edit_distance = 2)
name |
The group identifier. Must match the |
bc_list_name |
A key used to look up the allow-list file for this group from
the |
max_edit_distance |
Non-negative integer. Maximum total edit distance allowed
when matching the concatenated segment sequences against the group allow-list.
Default: |
A FlexiplexGroup object for use in find_barcode.
Creates a FlexiplexSegment object describing one component of a read's
barcode structure. A list of segments is passed to find_barcode
(via the segments argument) to define how reads are parsed from 5' to 3'.
barcode_segment( type = "FIXED", pattern, name, bc_list = NA_character_, group = NA_character_, buffer_size = 2, max_edit_distance = 2 )barcode_segment( type = "FIXED", pattern, name, bc_list = NA_character_, group = NA_character_, buffer_size = 2, max_edit_distance = 2 )
type |
Segment type: one of |
pattern |
The nucleotide pattern for this segment.
For |
name |
Label for this segment in output files (e.g. |
bc_list |
For |
group |
For |
buffer_size |
Non-negative integer. Extra nucleotides searched on each side of
the expected segment position to accommodate small insertions/deletions. Only applies
to |
max_edit_distance |
Non-negative integer. Maximum edit distance allowed when
matching a read sequence against the barcode allow-list. Only applies to
|
Four segment types are supported:
"FIXED"A known, constant flanking sequence (e.g. a sequencing primer or poly-T tail). Used as an alignment anchor; i.e. no barcode allow-list is associated
"MATCHED"A variable sequence matched against a barcode allow-list
(e.g. a cell barcode). The allow-list file is resolved via the barcodes_files
argument of find_barcode. bc_list must be supplied (or
barcodes_files must be a single unnamed path).
"RANDOM"A random sequence of fixed length captured verbatim without matching (e.g. a UMI). No allow-list is needed.
"MATCHED_SPLIT"Like "MATCHED", but participates in a
barcode_group for multi-segment barcode matching,
where the all MATCHED_SPLIT segments of the same group are concatenated and then
matched to the allow-list barcodes.
group must name a corresponding barcode_group.
A FlexiplexSegment object for use in find_barcode.
# A typical 10x Genomics 3' v3 barcode structure: segments <- list( barcode_segment(type = "FIXED", pattern = "CTACACGACGCTCTTCCGATCT", name = "primer"), barcode_segment(type = "MATCHED", pattern = "NNNNNNNNNNNNNNNN", name = "CB", bc_list = "CB", buffer_size = 5, max_edit_distance = 2), barcode_segment(type = "RANDOM", pattern = "NNNNNNNNNNNN", name = "UB"), barcode_segment(type = "FIXED", pattern = "TTTTTTTTT", name = "polyT") )# A typical 10x Genomics 3' v3 barcode structure: segments <- list( barcode_segment(type = "FIXED", pattern = "CTACACGACGCTCTTCCGATCT", name = "primer"), barcode_segment(type = "MATCHED", pattern = "NNNNNNNNNNNNNNNN", name = "CB", bc_list = "CB", buffer_size = 5, max_edit_distance = 2), barcode_segment(type = "RANDOM", pattern = "NNNNNNNNNNNN", name = "UB"), barcode_segment(type = "FIXED", pattern = "TTTTTTTTT", name = "polyT") )
Uses BLAZE to generate barcode list and assign reads to cell barcodes.
blaze( expect_cells, fq_in, outdir, fq_out, sample_name = "", additional_args = NULL, ... )blaze( expect_cells, fq_in, outdir, fq_out, sample_name = "", additional_args = NULL, ... )
expect_cells |
Integer, expected number of cells. Note: this could be just a rough estimate. E.g., the targeted number of cells. |
fq_in |
File path to the fastq file used as a query sequence file |
outdir |
Output directory to save BLAZE results. |
fq_out |
File path to save the output fastq file containing reads assigned to cell barcodes. |
sample_name |
Sample name prefix for output files. Default is an empty string. |
additional_args |
Additional command line style arguments to be passed to BLAZE. E.g. c("–10x-kit-version", "3v3") |
... |
Additional BLAZE configuration parameters. E.g., setting 'overwrite=TRUE' is equivalent to switch on the '–overwrite' option. Note that the specified parameters will override the parameters specified in the configuration file. All available options can be found at https://github.com/shimlab/BLAZE. |
A data.frame summarising the reads aligned. Other outputs are written to disk.
The details of the output files can be found at https://github.com/shimlab/BLAZE.
outdir <- tempfile() dir.create(outdir) fastq <- system.file("extdata", "fastq", "musc_rps24.fastq.gz", package = "FLAMES") blaze( expect_cells = 10, fastq, outdir = outdir, fq_out = file.path(outdir, "blaze_matched_reads.fastq.gz"), overwrite = TRUE )outdir <- tempfile() dir.create(outdir) fastq <- system.file("extdata", "fastq", "musc_rps24.fastq.gz", package = "FLAMES") blaze( expect_cells = 10, fastq, outdir = outdir, fq_out = file.path(outdir, "blaze_matched_reads.fastq.gz"), overwrite = TRUE )
This function is deprecated. Use BulkPipeline instead.
bulk_long_pipeline( annotation, fastq, outdir, genome_fa, minimap2 = NULL, config_file )bulk_long_pipeline( annotation, fastq, outdir, genome_fa, minimap2 = NULL, config_file )
annotation |
The file path to the annotation file in GFF3 / GTF format. |
fastq |
Path to the FASTQ file or a directory containing FASTQ files. Each file will be processed as an individual sample. |
outdir |
Path to the output directory. If it does not exist, it will be created. |
genome_fa |
The file path to the reference genome in FASTA format. |
minimap2 |
(optional) The path to the minimap2 binary. If not provided, FLAMES will
use a copy from bioconda via |
config_file |
Path to the JSON configuration file. See |
A SummarizedExperiment object containing the transcript counts.
BulkPipeline for the new pipeline function.
SingleCellPipeline for single cell pipelines,
MultiSampleSCPipeline for multi sample single cell pipelines.
outdir <- tempfile() dir.create(outdir) # simulate 3 samples via sampling reads <- ShortRead::readFastq( system.file("extdata", "fastq", "musc_rps24.fastq.gz", package = "FLAMES") ) dir.create(file.path(outdir, "fastq")) ShortRead::writeFastq(reads[1:100], file.path(outdir, "fastq/sample1.fq.gz"), mode = "w", full = FALSE ) reads <- reads[-(1:100)] ShortRead::writeFastq(reads[1:100], file.path(outdir, "fastq/sample2.fq.gz"), mode = "w", full = FALSE ) reads <- reads[-(1:100)] ShortRead::writeFastq(reads, file.path(outdir, "fastq/sample3.fq.gz"), mode = "w", full = FALSE ) # prepare the reference genome genome_fa <- file.path(outdir, "rps24.fa") R.utils::gunzip( filename = system.file("extdata", "rps24.fa.gz", package = "FLAMES"), destname = genome_fa, remove = FALSE ) se <- bulk_long_pipeline( fastq = file.path(outdir, "fastq"), annotation = system.file("extdata", "rps24.gtf.gz", package = "FLAMES"), outdir = outdir, genome_fa = genome_fa, config_file = create_config(outdir, type = "sc_3end", threads = 1, no_flank = TRUE) ) seoutdir <- tempfile() dir.create(outdir) # simulate 3 samples via sampling reads <- ShortRead::readFastq( system.file("extdata", "fastq", "musc_rps24.fastq.gz", package = "FLAMES") ) dir.create(file.path(outdir, "fastq")) ShortRead::writeFastq(reads[1:100], file.path(outdir, "fastq/sample1.fq.gz"), mode = "w", full = FALSE ) reads <- reads[-(1:100)] ShortRead::writeFastq(reads[1:100], file.path(outdir, "fastq/sample2.fq.gz"), mode = "w", full = FALSE ) reads <- reads[-(1:100)] ShortRead::writeFastq(reads, file.path(outdir, "fastq/sample3.fq.gz"), mode = "w", full = FALSE ) # prepare the reference genome genome_fa <- file.path(outdir, "rps24.fa") R.utils::gunzip( filename = system.file("extdata", "rps24.fa.gz", package = "FLAMES"), destname = genome_fa, remove = FALSE ) se <- bulk_long_pipeline( fastq = file.path(outdir, "fastq"), annotation = system.file("extdata", "rps24.gtf.gz", package = "FLAMES"), outdir = outdir, genome_fa = genome_fa, config_file = create_config(outdir, type = "sc_3end", threads = 1, no_flank = TRUE) ) se
Semi-supervised isofrom detection and annotation for long read data.
This variant is meant for bulk samples. Specific parameters can be configured in
the config file (see create_config), input files are specified via
arguments.
BulkPipeline( config_file, outdir, fastq, annotation, genome_fa, genome_mmi, minimap2, samtools, controllers )BulkPipeline( config_file, outdir, fastq, annotation, genome_fa, genome_mmi, minimap2, samtools, controllers )
config_file |
Path to the JSON configuration file. See |
outdir |
Path to the output directory. If it does not exist, it will be created. |
fastq |
Path to the FASTQ file or a directory containing FASTQ files. Each file will be processed as an individual sample. |
annotation |
The file path to the annotation file in GFF3 / GTF format. |
genome_fa |
The file path to the reference genome in FASTA format. |
genome_mmi |
(optional) The file path to minimap2's index reference genome. |
minimap2 |
(optional) The path to the minimap2 binary. If not provided, FLAMES will
use a copy from bioconda via |
samtools |
(optional) The path to the samtools binary. If not provided, FLAMES will
use a copy from bioconda via |
controllers |
(optional, experimental) A |
By default FLAMES use minimap2 for read alignment. After the genome alignment step (do_genome_align), FLAMES summarizes the alignment for each read by grouping reads
with similar splice junctions to get a raw isoform annotation (do_isoform_id). The raw isoform
annotation is compared against the reference annotation to correct potential splice site
and transcript start/end errors. Transcripts that have similar splice junctions
and transcript start/end to the reference transcript are merged with the
reference. This process will also collapse isoforms that are likely to be truncated
transcripts. If isoform_id_bambu is set to TRUE, bambu::bambu will be used to generate the updated annotations.
Next is the read realignment step (do_read_realign), where the sequence of each transcript from the update annotation is extracted, and
the reads are realigned to this updated transcript_assembly.fa by minimap2. The
transcripts with only a few full-length aligned reads are discarded.
The reads are assigned to transcripts based on both alignment score, fractions of
reads aligned and transcript coverage. Reads that cannot be uniquely assigned to
transcripts or have low transcript coverage are discarded. The UMI transcript
count matrix is generated by collapsing the reads with the same UMI in a similar
way to what is done for short-read scRNA-seq data, but allowing for an edit distance
of up to 2 by default. Most of the parameters, such as the minimal distance to splice site and minimal percentage of transcript coverage
can be modified by the JSON configuration file (config_file).
A FLAMES.Pipeline object. The pipeline could be run using run_FLAMES, and / or resumed using resume_FLAMES.
create_config for creating a configuration file,
SingleCellPipeline for single cell pipelines,
MultiSampleSCPipeline for multi sample single cell pipelines.
outdir <- tempfile() dir.create(outdir) # simulate 3 samples via sampling reads <- ShortRead::readFastq( system.file("extdata", "fastq", "musc_rps24.fastq.gz", package = "FLAMES") ) dir.create(file.path(outdir, "fastq")) ShortRead::writeFastq(reads[1:100], file.path(outdir, "fastq/sample1.fq.gz"), mode = "w", full = FALSE ) reads <- reads[-(1:100)] ShortRead::writeFastq(reads[1:100], file.path(outdir, "fastq/sample2.fq.gz"), mode = "w", full = FALSE ) reads <- reads[-(1:100)] ShortRead::writeFastq(reads, file.path(outdir, "fastq/sample3.fq.gz"), mode = "w", full = FALSE ) # prepare the reference genome genome_fa <- file.path(outdir, "rps24.fa") R.utils::gunzip( filename = system.file("extdata", "rps24.fa.gz", package = "FLAMES"), destname = genome_fa, remove = FALSE ) ppl <- BulkPipeline( fastq = c( "sample1" = file.path(outdir, "fastq", "sample1.fq.gz"), "sample2" = file.path(outdir, "fastq", "sample2.fq.gz"), "sample3" = file.path(outdir, "fastq", "sample3.fq.gz") ), annotation = system.file("extdata", "rps24.gtf.gz", package = "FLAMES"), genome_fa = genome_fa, config_file = create_config(outdir, type = "sc_3end", threads = 1, no_flank = TRUE), outdir = outdir ) ppl <- run_FLAMES(ppl) # run the pipeline experiment(ppl) # get the result as SummarizedExperimentoutdir <- tempfile() dir.create(outdir) # simulate 3 samples via sampling reads <- ShortRead::readFastq( system.file("extdata", "fastq", "musc_rps24.fastq.gz", package = "FLAMES") ) dir.create(file.path(outdir, "fastq")) ShortRead::writeFastq(reads[1:100], file.path(outdir, "fastq/sample1.fq.gz"), mode = "w", full = FALSE ) reads <- reads[-(1:100)] ShortRead::writeFastq(reads[1:100], file.path(outdir, "fastq/sample2.fq.gz"), mode = "w", full = FALSE ) reads <- reads[-(1:100)] ShortRead::writeFastq(reads, file.path(outdir, "fastq/sample3.fq.gz"), mode = "w", full = FALSE ) # prepare the reference genome genome_fa <- file.path(outdir, "rps24.fa") R.utils::gunzip( filename = system.file("extdata", "rps24.fa.gz", package = "FLAMES"), destname = genome_fa, remove = FALSE ) ppl <- BulkPipeline( fastq = c( "sample1" = file.path(outdir, "fastq", "sample1.fq.gz"), "sample2" = file.path(outdir, "fastq", "sample2.fq.gz"), "sample3" = file.path(outdir, "fastq", "sample3.fq.gz") ), annotation = system.file("extdata", "rps24.gtf.gz", package = "FLAMES"), genome_fa = genome_fa, config_file = create_config(outdir, type = "sc_3end", threads = 1, no_flank = TRUE), outdir = outdir ) ppl <- run_FLAMES(ppl) # run the pipeline experiment(ppl) # get the result as SummarizedExperiment
Combine FLT-seq SingleCellExperiment objects
combine_sce(sce_with_lr, sce_without_lr)combine_sce(sce_with_lr, sce_without_lr)
sce_with_lr |
A |
sce_without_lr |
A |
For protcols like FLT-seq that generate two libraries, one with both short and long reads,
and one with only short reads, this function combines the two libraries into a single
SingleCellExperiment object. For the library with both long and short reads, the long-read
transcript counts should be stored in the 'transcript' altExp slot of the SingleCellExperiment
object. This function will combine the short-read gene counts of both libraries, and for the
transcripts counts, it will leave NA values for the cells from the short-read only library.
The sc_impute_transcript function can then be used to impute the NA values.
A SingleCellExperiment object with combined gene counts and a "transcript" altExp slot.
with_lr <- SingleCellExperiment::SingleCellExperiment(assays = list(counts = matrix(rpois(100, 5), ncol = 10))) without_lr <- SingleCellExperiment::SingleCellExperiment(assays = list(counts = matrix(rpois(200, 5), ncol = 20))) long_read <- SingleCellExperiment::SingleCellExperiment(assays = list(counts = matrix(rpois(50, 5), ncol = 10))) SingleCellExperiment::altExp(with_lr, "transcript") <- long_read SummarizedExperiment::colData(with_lr)$Barcode <- paste0(1:10, "-1") SummarizedExperiment::colData(without_lr)$Barcode <- paste0(8:27, "-1") rownames(with_lr) <- as.character(101:110) rownames(without_lr) <- as.character(103:112) rownames(long_read) <- as.character(1001:1005) combined_sce <- FLAMES::combine_sce(sce_with_lr = with_lr, sce_without_lr = without_lr) combined_scewith_lr <- SingleCellExperiment::SingleCellExperiment(assays = list(counts = matrix(rpois(100, 5), ncol = 10))) without_lr <- SingleCellExperiment::SingleCellExperiment(assays = list(counts = matrix(rpois(200, 5), ncol = 20))) long_read <- SingleCellExperiment::SingleCellExperiment(assays = list(counts = matrix(rpois(50, 5), ncol = 10))) SingleCellExperiment::altExp(with_lr, "transcript") <- long_read SummarizedExperiment::colData(with_lr)$Barcode <- paste0(1:10, "-1") SummarizedExperiment::colData(without_lr)$Barcode <- paste0(8:27, "-1") rownames(with_lr) <- as.character(101:110) rownames(without_lr) <- as.character(103:112) rownames(long_read) <- as.character(1001:1005) combined_sce <- FLAMES::combine_sce(sce_with_lr = with_lr, sce_without_lr = without_lr) combined_sce
This function returns the configuration of the pipeline.
config(pipeline) ## S4 method for signature 'FLAMES.Pipeline' config(pipeline)config(pipeline) ## S4 method for signature 'FLAMES.Pipeline' config(pipeline)
pipeline |
An object of class 'FLAMES.Pipeline'. |
A list containing the configuration of the pipeline.
pipeline <- example_pipeline(type = "BulkPipeline") config(pipeline)pipeline <- example_pipeline(type = "BulkPipeline") config(pipeline)
This function sets the configuration of the pipeline.
config(pipeline) <- value ## S4 replacement method for signature 'FLAMES.Pipeline' config(pipeline) <- valueconfig(pipeline) <- value ## S4 replacement method for signature 'FLAMES.Pipeline' config(pipeline) <- value
pipeline |
An pipeline of class 'FLAMES.Pipeline'. |
value |
A list containing the configuration of the pipeline, or a path to a JSON configuration file. |
An pipeline of class 'FLAMES.Pipeline' with the updated configuration.
pipeline <- example_pipeline(type = "BulkPipeline") # Set a new configuration config(pipeline) <- create_config(outdir = tempdir())pipeline <- example_pipeline(type = "BulkPipeline") # Set a new configuration config(pipeline) <- create_config(outdir = tempdir())
Gets the controllers for the pipeline.
controllers(pipeline) ## S4 method for signature 'FLAMES.Pipeline' controllers(pipeline)controllers(pipeline) ## S4 method for signature 'FLAMES.Pipeline' controllers(pipeline)
pipeline |
A FLAMES.Pipeline object. |
A named list of crew_class_controller objects, where each
controller corresponds to a step in the pipeline.
pipeline <- example_pipeline(type = "MultiSampleSCPipeline") controllers(pipeline) # get the controllerspipeline <- example_pipeline(type = "MultiSampleSCPipeline") controllers(pipeline) # get the controllers
Sets the controllers for the pipeline.
controllers(pipeline) <- value ## S4 replacement method for signature 'FLAMES.Pipeline' controllers(pipeline) <- valuecontrollers(pipeline) <- value ## S4 replacement method for signature 'FLAMES.Pipeline' controllers(pipeline) <- value
pipeline |
A FLAMES.Pipeline object. |
value |
A |
An updated FLAMES.Pipeline object with the specified controllers.
pipeline <- example_pipeline() # Only set the genome alignment controller controllers(pipeline) <- list(genome_alignment = crew::crew_controller_local()) # Same as above controllers(pipeline)[["genome_alignment"]] <- crew::crew_controller_local() # Set a controller for all steps controllers(pipeline) <- crew::crew_controller_local() # Unset all controllers and use the current R session controllers(pipeline) <- list()pipeline <- example_pipeline() # Only set the genome alignment controller controllers(pipeline) <- list(genome_alignment = crew::crew_controller_local()) # Same as above controllers(pipeline)[["genome_alignment"]] <- crew::crew_controller_local() # Set a controller for all steps controllers(pipeline) <- crew::crew_controller_local() # Unset all controllers and use the current R session controllers(pipeline) <- list()
Filter out transcripts with sharp drops / rises in coverage,
to be used in filter_coverage to remove transcripts with potential
misalignments / internal priming etc. Filtering is done by convolving the
coverage with a kernal of 1s and -1s (e.g. c(1, 1, -1, -1), where
the width of the 1s and -1s are determined by the width parameter),
and check if the maximum absolute value of the convolution is below a
threshold. If the convolution is below the threshold, TRUE is returned,
otherwise FALSE.
convolution_filter(x, threshold = 0.15, width = 2, trim = 0.05)convolution_filter(x, threshold = 0.15, width = 2, trim = 0.05)
x |
numeric vector of coverage values |
threshold |
numeric, the threshold for the maximum absolute value of the convolution |
width |
numeric, the width of the 1s and -1s in the kernal. E.g. |
trim |
numeric, the proportion of the coverage values to ignore at both ends before convolution. |
logical, TRUE if the transcript passes the filter, FALSE otherwise
# A >30% drop in coverage will fail the filter with threshold = 0.3 convolution_filter(c(1, 1, 1, 0.69, 0.69, 0.69), threshold = 0.3) convolution_filter(c(1, 1, 1, 0.71, 0.7, 0.7), threshold = 0.3)# A >30% drop in coverage will fail the filter with threshold = 0.3 convolution_filter(c(1, 1, 1, 0.69, 0.69, 0.69), threshold = 0.3) convolution_filter(c(1, 1, 1, 0.71, 0.7, 0.7), threshold = 0.3)
Create Configuration File From Arguments
create_config(outdir, type = "sc_3end", ...)create_config(outdir, type = "sc_3end", ...)
outdir |
the destination directory for the configuration file |
type |
use an example config, available values:
|
... |
Configuration parameters (using dot for nested parameters)
|
Create a list object containing the arguments supplied in a format
usable for the FLAMES pipeline, and writes the object to a JSON file, which
is located with the prefix 'config_' in the supplied outdir. Default
values from extdata/config_sclr_nanopore_3end.json will be used for
unprovided parameters.
Simple scalar parameters can be set via the ... argument using dot
notation (e.g. barcode_parameters.max_bc_editdistance = 3). For
complex structured parameters such as barcode_parameters.segments
(which is a JSON array of objects), it is strongly recommended to call
create_config(outdir) first to generate the default JSON file, then
open that file in a text editor and modify the segments array
directly. See the FLAMES vignette for a worked example.
file path to the config file created
# create the default configuration file outdir <- tempdir() config <- create_config(outdir) # create config with custom parameters including nested ones config <- create_config(outdir, threads = 16, barcode_parameters.max_bc_editdistance = 3, barcode_parameters.pattern.primer = "ATCGATCG", isoform_parameters.min_sup_cnt = 10, # use the coverage model in oarfish # via supplying additional CLI arguments additional_arguments.oarfish = c("--model-coverage") )# create the default configuration file outdir <- tempdir() config <- create_config(outdir) # create config with custom parameters including nested ones config <- create_config(outdir, threads = 16, barcode_parameters.max_bc_editdistance = 3, barcode_parameters.pattern.primer = "ATCGATCG", isoform_parameters.min_sup_cnt = 10, # use the coverage model in oarfish # via supplying additional CLI arguments additional_arguments.oarfish = c("--model-coverage") )
SingleCellExperiment object from FLAMES output folderCreate SingleCellExperiment object from FLAMES output folder
create_sce_from_dir(outdir, annotation, quantification = "FLAMES")create_sce_from_dir(outdir, annotation, quantification = "FLAMES")
outdir |
The folder containing |
annotation |
the annotation file that was used to produce the output files |
quantification |
(Optional) the quantification method used to generate the output files (either "FLAMES" or "Oarfish".). If not specified, the function will attempt to determine the quantification method. |
a list of SingleCellExperiment objects if multiple transcript matrices were
found in the output folder, or a SingleCellExperiment object if only one were found
outdir <- tempfile() dir.create(outdir) bc_allow <- file.path(outdir, "bc_allow.tsv") genome_fa <- file.path(outdir, "rps24.fa") R.utils::gunzip( filename = system.file("extdata", "bc_allow.tsv.gz", package = "FLAMES"), destname = bc_allow, remove = FALSE ) R.utils::gunzip( filename = system.file("extdata", "rps24.fa.gz", package = "FLAMES"), destname = genome_fa, remove = FALSE ) annotation <- system.file("extdata", "rps24.gtf.gz", package = "FLAMES") sce <- sc_long_pipeline( genome_fa = genome_fa, fastq = system.file("extdata", "fastq", "musc_rps24.fastq.gz", package = "FLAMES"), annotation = annotation, outdir = outdir, barcodes_file = bc_allow, config_file = create_config( outdir, pipeline_parameters.demultiplexer = "flexiplex", oarfish_quantification = FALSE ) ) sce_2 <- create_sce_from_dir(outdir, annotation)outdir <- tempfile() dir.create(outdir) bc_allow <- file.path(outdir, "bc_allow.tsv") genome_fa <- file.path(outdir, "rps24.fa") R.utils::gunzip( filename = system.file("extdata", "bc_allow.tsv.gz", package = "FLAMES"), destname = bc_allow, remove = FALSE ) R.utils::gunzip( filename = system.file("extdata", "rps24.fa.gz", package = "FLAMES"), destname = genome_fa, remove = FALSE ) annotation <- system.file("extdata", "rps24.gtf.gz", package = "FLAMES") sce <- sc_long_pipeline( genome_fa = genome_fa, fastq = system.file("extdata", "fastq", "musc_rps24.fastq.gz", package = "FLAMES"), annotation = annotation, outdir = outdir, barcodes_file = bc_allow, config_file = create_config( outdir, pipeline_parameters.demultiplexer = "flexiplex", oarfish_quantification = FALSE ) ) sce_2 <- create_sce_from_dir(outdir, annotation)
SummarizedExperiment object from FLAMES output folderCreate SummarizedExperiment object from FLAMES output folder
create_se_from_dir(outdir, annotation, quantification = "FLAMES")create_se_from_dir(outdir, annotation, quantification = "FLAMES")
outdir |
The folder containing |
annotation |
(Optional) the annotation file that was used to produce the output files |
quantification |
(Optional) the quantification method used to generate the output files (either "FLAMES" or "Oarfish".). If not specified, the function will attempt to determine the quantification method. |
a SummarizedExperiment object
ppl <- example_pipeline("BulkPipeline") ppl <- run_FLAMES(ppl) se1 <- experiment(ppl) se2 <- create_se_from_dir(ppl@outdir, ppl@annotation)ppl <- example_pipeline("BulkPipeline") ppl <- run_FLAMES(ppl) se1 <- experiment(ppl) se2 <- create_se_from_dir(ppl@outdir, ppl@annotation)
This function creates a SpatialExperiment object from a SingleCellExperiment object and a spatial barcode file.
create_spe( sce, spatial_barcode_file, mannual_align_json, image, tissue_positions_file )create_spe( sce, spatial_barcode_file, mannual_align_json, image, tissue_positions_file )
sce |
The SingleCellExperiment object obtained from running the |
spatial_barcode_file |
The path to the spatial barcode file, e.g. |
mannual_align_json |
The path to the mannual alignment json file. |
image |
'DataFrame' containing the image data. See |
tissue_positions_file |
The path to Visium positions file, e.g. |
A SpatialExperiment object.
trim TSO adaptor with cutadapt
cutadapt(args)cutadapt(args)
args |
arguments to be passed to cutadapt |
Exit code of cutadapt
cutadapt("-h")cutadapt("-h")
Demultiplex reads using the cell_umi_gene.tsv file from Sockeye.
demultiplex_sockeye(fastq_dir, sockeye_tsv, out_fq)demultiplex_sockeye(fastq_dir, sockeye_tsv, out_fq)
fastq_dir |
The folder containing FASTQ files from Sockeye's output under |
sockeye_tsv |
The |
out_fq |
The output FASTQ file. |
returns NULL
Provides example pipelines for bulk, single cell and multi-sample single cell.
example_pipeline(type = "SingleCellPipeline", outdir)example_pipeline(type = "SingleCellPipeline", outdir)
type |
The type of pipeline to create. Options are "SingleCellPipeline", "BulkPipeline", and "MultiSampleSCPipeline". |
outdir |
(Optional) The output directory where the example pipeline will be created. If not provided, a temporary directory will be created. |
A pipeline object of the specified type.
SingleCellPipeline for creating the single cell pipeline,
BulkPipeline for bulk long data,
MultiSampleSCPipeline for multi sample single cell pipelines.
example_pipeline("SingleCellPipeline")example_pipeline("SingleCellPipeline")
This function returns the results of the pipeline as a
SummarizedExperiment object, a SingleCellExperiment object, or a
list of SingleCellExperiment objects, depending on the pipeline type.
experiment(pipeline) ## S4 method for signature 'FLAMES.Pipeline' experiment(pipeline)experiment(pipeline) ## S4 method for signature 'FLAMES.Pipeline' experiment(pipeline)
pipeline |
A FLAMES.Pipeline object. |
A SummarizedExperiment object, a SingleCellExperiment object,
or a list of SingleCellExperiment objects.
pipeline <- example_pipeline(type = "BulkPipeline") pipeline <- run_FLAMES(pipeline) se <- experiment(pipeline)pipeline <- example_pipeline(type = "BulkPipeline") pipeline <- run_FLAMES(pipeline) se <- experiment(pipeline)
Removes isoform annotations that could produce ambigious reads, such as isoforms that only differ by the 5' / 3' end. This could be useful for plotting average coverage plots.
filter_annotation(annotation, keep = "tss_differ")filter_annotation(annotation, keep = "tss_differ")
annotation |
path to the GTF annotation file, or the parsed GenomicRanges
object with a valid |
keep |
string, one of 'tss_differ' (only keep isoforms that all differ by the transcription start site position), 'tes_differ' (only keep those that differ by the transcription end site position), 'both' (only keep those that differ by both the start and end site), or 'single_transcripts' (only keep genes that contain a single transcript). |
GenomicRanges of the filtered isoforms
filtered_annotation <- filter_annotation( system.file("extdata", "rps24.gtf.gz", package = 'FLAMES'), keep = 'tes_differ') filtered_annotationfiltered_annotation <- filter_annotation( system.file("extdata", "rps24.gtf.gz", package = 'FLAMES'), keep = 'tes_differ') filtered_annotation
Filter the transcript coverage by applying a filter function to the coverage values.
filter_coverage(x, filter_fn = convolution_filter)filter_coverage(x, filter_fn = convolution_filter)
x |
The tibble returned by |
filter_fn |
The filter function to apply to the coverage values. The function
should take a numeric vector of coverage values and return a logical value (TRUE if
the transcript passes the filter, FALSE otherwise). The default filter function is
|
a tibble of the transcript information and coverages, with transcipts that pass the filter
ppl <- example_pipeline("BulkPipeline") steps(ppl)["isoform_identification"] <- FALSE ppl <- run_step(ppl, "read_realignment") x <- get_coverage(ppl@transcriptome_bam[[1]]) nrow(x) filter_coverage(x) |> nrow()ppl <- example_pipeline("BulkPipeline") steps(ppl)["isoform_identification"] <- FALSE ppl <- run_step(ppl, "read_realignment") x <- get_coverage(ppl@transcriptome_bam[[1]]) nrow(x) filter_coverage(x) |> nrow()
demultiplex reads with flexiplex
find_barcode( fastq, segments, barcode_groups, barcodes_files, max_flank_editdistance = 8, reads_out, stats_out, threads = 1, TSO_seq = "", TSO_prime = 3, strand = "+", cutadapt_minimum_length = 1, full_length_only = FALSE, pattern = c(primer = "CTACACGACGCTCTTCCGATCT", BC = paste0(rep("N", 16), collapse = ""), UMI = paste0(rep("N", 12), collapse = ""), polyT = paste0(rep("T", 9), collapse = "")), max_bc_editdistance = 2 )find_barcode( fastq, segments, barcode_groups, barcodes_files, max_flank_editdistance = 8, reads_out, stats_out, threads = 1, TSO_seq = "", TSO_prime = 3, strand = "+", cutadapt_minimum_length = 1, full_length_only = FALSE, pattern = c(primer = "CTACACGACGCTCTTCCGATCT", BC = paste0(rep("N", 16), collapse = ""), UMI = paste0(rep("N", 12), collapse = ""), polyT = paste0(rep("T", 9), collapse = "")), max_bc_editdistance = 2 )
fastq |
A path to a FASTQ file or a directory containing FASTQ files. |
segments |
A list of |
barcode_groups |
A list of |
barcodes_files |
Path(s) to barcode allow-list file(s), one barcode per line. Can be:
|
max_flank_editdistance |
Maximum edit distance for matching the fixed flanking
sequences (e.g. primer, poly-T tail). Default: |
reads_out |
Path to output FASTQ file containing demultiplexed reads with barcode information added to the read header. |
stats_out |
Path to output TSV (optionally gzip-compressed) file with per-read demultiplex results. |
threads |
Number of threads to use. Default: |
TSO_seq |
TSO (template-switching oligo) sequence to trim from reads using
cutadapt. Set to |
TSO_prime |
Either |
strand |
Strand of the barcode pattern, either |
cutadapt_minimum_length |
Minimum read length (in nucleotides) to retain after
TSO trimming (passed to cutadapt's |
full_length_only |
Logical. When |
pattern |
Deprecated. Named character vector defining the barcode
structure (legacy interface). Use |
max_bc_editdistance |
Deprecated. Maximum edit distance for barcode
matching when using the legacy |
This function demultiplexes reads by searching for flanking sequences (adaptors) around the barcode sequence, and then matching against an allowed barcode list.
The read structure is described by a list of barcode_segment objects
passed to segments. Each segment describes one
component of the read (e.g. a fixed primer, a cell barcode, a UMI, a poly-T tail).
Use barcode_segment to construct segments and, for combinatorial
multi-segment barcodes, barcode_group to define groups.
For backward compatibility, the legacy pattern argument (a named character
vector) is still accepted when segments is not supplied.
A list containing:
read_counts: a named integer vector with total reads,
demultiplexed reads, and single match reads.
stats_out: the path to the demultiplex stats file.
cutadapt: (only when TSO trimming is performed) the parsed cutadapt
JSON report.
barcode_segment, barcode_group,
plot_demultiplex_raw
outdir <- tempfile() dir.create(outdir) bc_allow <- file.path(outdir, "bc_allow.tsv") R.utils::gunzip( filename = system.file("extdata", "bc_allow.tsv.gz", package = "FLAMES"), destname = bc_allow, remove = FALSE ) # Modern interface: define segments explicitly find_barcode( fastq = system.file("extdata", "fastq", "musc_rps24.fastq.gz", package = "FLAMES"), segments = list( barcode_segment("FIXED", "CTACACGACGCTCTTCCGATCT", "primer"), barcode_segment("MATCHED", "NNNNNNNNNNNNNNNN", "CB", bc_list = "CB", buffer_size = 5, max_edit_distance = 2), barcode_segment("RANDOM", "NNNNNNNNNNNN", "UB"), barcode_segment("FIXED", "TTTTTTTTT", "polyT") ), barcode_groups = list(), barcodes_files = c(CB = bc_allow), stats_out = file.path(outdir, "bc_stat.tsv.gz"), reads_out = file.path(outdir, "demultiplexed.fastq.gz"), TSO_seq = "AAGCAGTGGTATCAACGCAGAGTACATGGG", TSO_prime = 5, strand = '-', cutadapt_minimum_length = 10, full_length_only = TRUE )outdir <- tempfile() dir.create(outdir) bc_allow <- file.path(outdir, "bc_allow.tsv") R.utils::gunzip( filename = system.file("extdata", "bc_allow.tsv.gz", package = "FLAMES"), destname = bc_allow, remove = FALSE ) # Modern interface: define segments explicitly find_barcode( fastq = system.file("extdata", "fastq", "musc_rps24.fastq.gz", package = "FLAMES"), segments = list( barcode_segment("FIXED", "CTACACGACGCTCTTCCGATCT", "primer"), barcode_segment("MATCHED", "NNNNNNNNNNNNNNNN", "CB", bc_list = "CB", buffer_size = 5, max_edit_distance = 2), barcode_segment("RANDOM", "NNNNNNNNNNNN", "UB"), barcode_segment("FIXED", "TTTTTTTTT", "polyT") ), barcode_groups = list(), barcodes_files = c(CB = bc_allow), stats_out = file.path(outdir, "bc_stat.tsv.gz"), reads_out = file.path(outdir, "demultiplexed.fastq.gz"), TSO_seq = "AAGCAGTGGTATCAACGCAGAGTACATGGG", TSO_prime = 5, strand = '-', cutadapt_minimum_length = 10, full_length_only = TRUE )
This function is a wrapper for base::Sys.which to find the path
to a command. It also searches within the FLAMES basilisk conda
environment. This function also replaces "" with NA in the
output of base::Sys.which to make it easier to check if the
binary is found.
find_bin(command)find_bin(command)
command |
character, the command to search for |
character, the path to the command or NA
find_bin("minimap2")find_bin("minimap2")
Calculates normalized Shannon entropy for gene isoform expression across cells. Higher entropy indicates more diverse isoform usage, lower entropy indicates dominance by fewer isoforms.
find_diversity( sce, assay = "counts", gene_col = "gene_id", alpha = .Machine$double.xmin, min_counts_per_cell = 5, isoform_min_pct_cells = 0.05, isoform_cumulative_pct = 0.95, min_cell_fraction = 0.25, threads = 1, show_progress = interactive() )find_diversity( sce, assay = "counts", gene_col = "gene_id", alpha = .Machine$double.xmin, min_counts_per_cell = 5, isoform_min_pct_cells = 0.05, isoform_cumulative_pct = 0.95, min_cell_fraction = 0.25, threads = 1, show_progress = interactive() )
sce |
A |
assay |
Name of assay containing isoform counts (default: "counts") |
gene_col |
Column name in rowData containing gene identifiers (default: "gene_id") |
alpha |
Pseudocount added to avoid log(0) (default: .Machine$double.xmin) |
min_counts_per_cell |
Minimum total gene counts per cell to include (default: 5) |
isoform_min_pct_cells |
Minimum fraction of cells expressing each isoform (default: 0.05) |
isoform_cumulative_pct |
Keep top isoforms contributing to this cumulative proportion (default: 0.95) |
min_cell_fraction |
Minimum fraction of cells with valid entropy per gene (default: 0.25) |
threads |
Number of threads for parallel processing (default: 1) |
show_progress |
Logical indicating whether to show progress (default: TRUE if interactive) |
Matrix with genes as rows and cells as columns containing normalized entropy values (0-1).
sce <- scuttle::mockSCE(ncells = 50, ngenes = 30) SummarizedExperiment::rowData(sce)$gene_id <- sort( paste0("gene", sample(1:9, nrow(sce), replace = TRUE)) ) res <- find_diversity(sce, threads = 2)sce <- scuttle::mockSCE(ncells = 50, ngenes = 30) SummarizedExperiment::rowData(sce)$gene_id <- sort( paste0("gene", sample(1:9, nrow(sce), replace = TRUE)) ) res <- find_diversity(sce, threads = 2)
Treat each bam file as a bulk sample and identify variants against the reference
find_variants( bam_path, reference, annotation, min_nucleotide_depth = 100, homopolymer_window = 3, annotated_region_only = FALSE, names_from = "gene_name", threads = 1 )find_variants( bam_path, reference, annotation, min_nucleotide_depth = 100, homopolymer_window = 3, annotated_region_only = FALSE, names_from = "gene_name", threads = 1 )
bam_path |
character(1) or character(n): path to the bam file(s) aligned to the reference genome (NOT the transcriptome!). |
reference |
DNAStringSet: the reference genome |
annotation |
GRanges: the annotation of the reference genome. You can load
a GTF/GFF annotation file with |
min_nucleotide_depth |
integer(1): minimum read depth for a position to be considered a variant. |
homopolymer_window |
integer(1): the window size to calculate the homopolymer
percentage. The homopolymer percentage is calculated as the percentage of the most
frequent nucleotide in a window of |
annotated_region_only |
logical(1): whether to only consider variants outside
annotated regions. If |
names_from |
character(1): the column name in the metadata column of the annotation
( |
threads |
integer(1): number of threads to use. Threading is done over each
annotated region and (if |
Each bam file is treated as a bulk sample to perform pileup and identify variants.
You can run sc_mutations with the variants identified with this function
to get single-cell allele counts. Note that reference genome FASTA files may have
the chromosome names field as '>chr1 1' instead of '>chr1'. You may need to remove
the trailing number to match the chromosome names in the bam file, for example with
names(ref) <- sapply(names(ref), function(x) strsplit(x, " ")[[1]][1]).
A tibble with columns: seqnames, pos, nucleotide, count, sum, freq, ref, region,
homopolymer_pct, bam_path The homopolymer percentage is calculated as the percentage of the
most frequent nucleotide in a window of homopolymer_window nucleotides around
the variant position, excluding the variant position itself.
ppl <- example_pipeline("SingleCellPipeline") ppl <- run_step(ppl, "genome_alignment") variants <- find_variants( bam_path = ppl@genome_bam, reference = ppl@genome_fa, annotation = ppl@annotation, min_nucleotide_depth = 4 ) head(variants)ppl <- example_pipeline("SingleCellPipeline") ppl <- run_step(ppl, "genome_alignment") variants <- find_variants( bam_path = ppl@genome_bam, reference = ppl@genome_fa, annotation = ppl@annotation, min_nucleotide_depth = 4 ) head(variants)
FLAMES: full-length analysis of mutations and splicing
invisible()
demultiplex reads with flexiplex, for detailed description, see documentation for the original flexiplex: https://davidsongroup.github.io/flexiplex
flexiplex( r_segments, r_barcode_groups, max_flank_editdistance, reads_in, reads_out, stats_out, bc_out, reverseCompliment, n_threads )flexiplex( r_segments, r_barcode_groups, max_flank_editdistance, reads_in, reads_out, stats_out, bc_out, reverseCompliment, n_threads )
r_segments |
List defining the barcode structure |
r_barcode_groups |
List defining barcode groups |
max_flank_editdistance |
int, maximum edit distance for matching flanking sequences |
reads_in |
Input FASTQ or FASTA file |
reads_out |
output file for demultiplexed reads |
stats_out |
output file for demultiplexed stats |
bc_out |
WIP |
reverseCompliment |
bool, whether to reverse complement the reads after demultiplexing |
n_threads |
number of threads to be used during demultiplexing |
integer return value. 0 represents normal return.
Get the read coverages for each transcript in the BAM file (or a GAlignments object).
The read coverages are sampled at 100 positions along the transcript, and the coverage is scaled
by dividing the coverage at each position by the total read counts for the transcript. If a BAM
file is provided, alignment with MAPQ < 5, secondary alignments and supplementary alignments
are filtered out. A GAlignments object can also be provided in case alternative filtering
is desired.
get_coverage(bam, min_counts = 10, remove_UTR = FALSE, annotation)get_coverage(bam, min_counts = 10, remove_UTR = FALSE, annotation)
bam |
path to the BAM file, or a parsed GAlignments object |
min_counts |
numeric, the minimum number of alignments required for a transcript to be included |
remove_UTR |
logical, if |
annotation |
(Required if |
a tibble of the transcript information and coverages, with the following columns:
transcript: the transcript name / ID
read_counts: the total number of aligments for the transcript
coverage_1-100: the coverage at each of the 100 positions along the transcript
tr_length: the length of the transcript
ppl <- example_pipeline("BulkPipeline") steps(ppl)["isoform_identification"] <- FALSE ppl <- run_step(ppl, "read_realignment") x <- get_coverage(ppl@transcriptome_bam[[1]]) head(x)ppl <- example_pipeline("BulkPipeline") steps(ppl)["isoform_identification"] <- FALSE ppl <- run_step(ppl, "read_realignment") x <- get_coverage(ppl@transcriptome_bam[[1]]) head(x)
Calls minimap2 to index the reference genome.
index_genome(pipeline, path, additional_args = c("-k", "14")) ## S4 method for signature 'FLAMES.Pipeline' index_genome(pipeline, path, additional_args = c("-k", "14"))index_genome(pipeline, path, additional_args = c("-k", "14")) ## S4 method for signature 'FLAMES.Pipeline' index_genome(pipeline, path, additional_args = c("-k", "14"))
pipeline |
A FLAMES.Pipeline object. |
path |
The file path to save the minimap2 index. If not provided, it will be saved to the output directory with the name "genome.mmi". |
additional_args |
(optional) Additional arguments to pass to minimap2. |
A SummarizedExperiment object, a SingleCellExperiment object,
or a list of SingleCellExperiment objects.
pipeline <- example_pipeline(type = "BulkPipeline") pipeline <- index_genome(pipeline)pipeline <- example_pipeline(type = "BulkPipeline") pipeline <- index_genome(pipeline)
Loads a configuration file and fills in missing values with defaults from the package's default configuration.
load_config(config_file, type = "sc_3end")load_config(config_file, type = "sc_3end")
config_file |
Path to the configuration JSON file |
type |
Config type to use for defaults ("sc_3end" or "SIRV") |
A complete configuration list with all parameters filled
Semi-supervised isofrom detection and annotation for long read data. This variant is
meant for multi-sample scRNA-seq data. Specific parameters can be configured in
the config file (see create_config), input files are specified via
arguments.
MultiSampleSCPipeline( config_file, outdir, fastq, annotation, genome_fa, genome_mmi, minimap2, samtools, barcodes_file, expect_cell_number, controllers )MultiSampleSCPipeline( config_file, outdir, fastq, annotation, genome_fa, genome_mmi, minimap2, samtools, barcodes_file, expect_cell_number, controllers )
config_file |
Path to the JSON configuration file. See |
outdir |
Path to the output directory. If it does not exist, it will be created. |
fastq |
A named vector of fastq file (or folder) paths. Each element of the vector will be treated as a sample. The names of the vector will be used as the sample names. If not named, the sample names will be generated from the file names. |
annotation |
The file path to the annotation file in GFF3 / GTF format. |
genome_fa |
The file path to the reference genome in FASTA format. |
genome_mmi |
(optional) The file path to minimap2's index reference genome. |
minimap2 |
(optional) The path to the minimap2 binary. If not provided, FLAMES will
use a copy from bioconda via |
samtools |
(optional) The path to the samtools binary. If not provided, FLAMES will
use a copy from bioconda via |
barcodes_file |
The file with expected cell barcodes, with each barcode on a new line. |
expect_cell_number |
The expected number of cells in the sample. This is used if
|
controllers |
(optional, experimental) A |
By default the pipeline starts with demultiplexing the input fastq data. If the
cell barcodes are known apriori (e.g. via coupled short-read sequencing), the
barcodes_file argument can be used to specify a file containing the cell
barcodes, and a modified Rcpp version of flexiplex will be used; otherwise,
expect_cell_number need to be provided, and BLAZE will be used to
generate the cell barcodes. The pipeline then aligns the reads to the genome using
minimap2. The alignment is then used for isoform detection (either using
FLAMES or bambu, can be configured). The reads are then realigned
to the detected isoforms. Finally, a transcript count matrix is generated (either
using FLAMES's simplistic counting or oarfish's Expectation
Maximization algorithm, can be configured). The results can be accssed with
experiment(pipeline). If the pipeline errored out / new steps were configured,
it can be resumed by calling resume_FLAMES(pipeline)
A FLAMES.MultiSampleSCPipeline object. The pipeline can be run using
the run_FLAMES function. The resulting list of SingleCellExperiment
objects can be accessed using the experiment method.
SingleCellPipeline for single-sample long data and more details on the
pipeline output,
create_config for creating a configuration file,
BulkPipeline for bulk long data.
reads <- ShortRead::readFastq( system.file("extdata", "fastq", "musc_rps24.fastq.gz", package = "FLAMES") ) outdir <- tempfile() dir.create(outdir) dir.create(file.path(outdir, "fastq")) bc_allow <- file.path(outdir, "bc_allow.tsv") genome_fa <- file.path(outdir, "rps24.fa") R.utils::gunzip( filename = system.file("extdata", "bc_allow.tsv.gz", package = "FLAMES"), destname = bc_allow, remove = FALSE ) R.utils::gunzip( filename = system.file("extdata", "rps24.fa.gz", package = "FLAMES"), destname = genome_fa, remove = FALSE ) ShortRead::writeFastq(reads[1:100], file.path(outdir, "fastq/sample1.fq.gz"), mode = "w", full = FALSE) reads <- reads[-(1:100)] ShortRead::writeFastq(reads[1:100], file.path(outdir, "fastq/sample2.fq.gz"), mode = "w", full = FALSE) reads <- reads[-(1:100)] ShortRead::writeFastq(reads, file.path(outdir, "fastq/sample3.fq.gz"), mode = "w", full = FALSE) ppl <- MultiSampleSCPipeline( config_file = create_config( outdir, pipeline_parameters.demultiplexer = "flexiplex", pipeline_parameters.threads = 1, alignment_parameters.no_flank = TRUE ), outdir = outdir, fastq = c("sampleA" = file.path(outdir, "fastq"), "sample1" = file.path(outdir, "fastq", "sample1.fq.gz"), "sample2" = file.path(outdir, "fastq", "sample2.fq.gz"), "sample3" = file.path(outdir, "fastq", "sample3.fq.gz")), annotation = system.file("extdata", "rps24.gtf.gz", package = "FLAMES"), genome_fa = genome_fa, barcodes_file = rep(bc_allow, 4) ) ppl <- run_FLAMES(ppl) experiment(ppl)reads <- ShortRead::readFastq( system.file("extdata", "fastq", "musc_rps24.fastq.gz", package = "FLAMES") ) outdir <- tempfile() dir.create(outdir) dir.create(file.path(outdir, "fastq")) bc_allow <- file.path(outdir, "bc_allow.tsv") genome_fa <- file.path(outdir, "rps24.fa") R.utils::gunzip( filename = system.file("extdata", "bc_allow.tsv.gz", package = "FLAMES"), destname = bc_allow, remove = FALSE ) R.utils::gunzip( filename = system.file("extdata", "rps24.fa.gz", package = "FLAMES"), destname = genome_fa, remove = FALSE ) ShortRead::writeFastq(reads[1:100], file.path(outdir, "fastq/sample1.fq.gz"), mode = "w", full = FALSE) reads <- reads[-(1:100)] ShortRead::writeFastq(reads[1:100], file.path(outdir, "fastq/sample2.fq.gz"), mode = "w", full = FALSE) reads <- reads[-(1:100)] ShortRead::writeFastq(reads, file.path(outdir, "fastq/sample3.fq.gz"), mode = "w", full = FALSE) ppl <- MultiSampleSCPipeline( config_file = create_config( outdir, pipeline_parameters.demultiplexer = "flexiplex", pipeline_parameters.threads = 1, alignment_parameters.no_flank = TRUE ), outdir = outdir, fastq = c("sampleA" = file.path(outdir, "fastq"), "sample1" = file.path(outdir, "fastq", "sample1.fq.gz"), "sample2" = file.path(outdir, "fastq", "sample2.fq.gz"), "sample3" = file.path(outdir, "fastq", "sample3.fq.gz")), annotation = system.file("extdata", "rps24.gtf.gz", package = "FLAMES"), genome_fa = genome_fa, barcodes_file = rep(bc_allow, 4) ) ppl <- run_FLAMES(ppl) experiment(ppl)
Given a set of mutations and gene annotation, calculate the position of each mutation within the gene body they are in.
mutation_positions( mutations, annotation, type = "relative", bin = FALSE, by = c(region = "gene_name"), threads = 1 )mutation_positions( mutations, annotation, type = "relative", bin = FALSE, by = c(region = "gene_name"), threads = 1 )
mutations |
either the tibble output from |
annotation |
Either path to the annotation file (GTF/GFF) or a GRanges object of the gene annotation. |
type |
character(1): the type of position to calculate. Can be one of "TSS" (distance from the transcription start site), "TES" (distance from the transcription end site), or "relative" (relative position within the gene body). |
bin |
logical(1): whether to bin the relative positions into 100 bins. Only applicable when
|
by |
character(1): the column name in the annotation to match with the gene annotation.
E.g. |
threads |
integer(1): number of threads to use. |
A numeric vector of positions of each mutation within the gene body. When type = "relative",
the positions are normalized to the gene length, ranging from 0 (start of the gene) to 1 (end of the gene).
When type = "TSS" / type = "TES", the distances from the transcription start
/ end site. If bin = TRUE, and type = "relative", the relative positions are binned into 100
bins along the gene body, and the output is a matrix with the number of mutations in each bin, the
rows are named by the by column (e.g. gene name).
variants <- data.frame( seqnames = rep("chr14", 8), pos = c(1084, 1085, 1217, 1384, 2724, 2789, 5083, 5147), region = rep("Rps24", 8) ) positions <- mutation_positions( mutations = variants, annotation = system.file("extdata", "rps24.gtf.gz", package = "FLAMES") )variants <- data.frame( seqnames = rep("chr14", 8), pos = c(1084, 1085, 1217, 1384, 2724, 2789, 5083, 5147), region = rep("Rps24", 8) ) positions <- mutation_positions( mutations = variants, annotation = system.file("extdata", "rps24.gtf.gz", package = "FLAMES") )
Plot the average read coverages for each length bin or a perticular isoform
plot_coverage( x, quantiles = c(0, 0.2375, 0.475, 0.7125, 0.95, 1), length_bins = c(0, 1, 2, 5, 10, Inf), weight_fn = weight_transcripts, filter_fn, detailed = FALSE )plot_coverage( x, quantiles = c(0, 0.2375, 0.475, 0.7125, 0.95, 1), length_bins = c(0, 1, 2, 5, 10, Inf), weight_fn = weight_transcripts, filter_fn, detailed = FALSE )
x |
path to the BAM file (aligning reads to the transcriptome), or
the (GenomicAlignments::readGAlignments) parsed GAlignments object, or the
tibble returned by |
quantiles |
numeric vector to specify the quantiles to bin the transcripts lengths by if length_bins is missing. The length bins will be determined such that the read counts are distributed acording to the quantiles. |
length_bins |
numeric vector to specify the sizes to bin the transcripts by |
weight_fn |
function to calculate the weights for the transcripts. The function
should take a numeric vector of read counts and return a numeric vector of weights.
The default function is |
filter_fn |
Optional filter function to filter the transcripts before plotting.
See the |
detailed |
logical, if |
a ggplot2 object of the coverage plot(s)
ppl <- example_pipeline("BulkPipeline") steps(ppl)["isoform_identification"] <- FALSE ppl <- run_step(ppl, "read_realignment") # Plot the coverages directly from the BAM file plot_coverage(ppl@transcriptome_bam[[1]]) # Get the coverage information first coverage <- get_coverage(ppl@transcriptome_bam[[1]]) |> dplyr::filter(read_counts > 2) |> # Filter out transcripts with read counts < 3 filter_coverage(filter_fn = convolution_filter) # Filter out transcripts with sharp drops / rises # Plot the filtered coverages plot_coverage(coverage, detailed = TRUE) # filtering function can also be passed directly to plot_coverage plot_coverage(ppl@transcriptome_bam[[1]], filter_fn = convolution_filter)ppl <- example_pipeline("BulkPipeline") steps(ppl)["isoform_identification"] <- FALSE ppl <- run_step(ppl, "read_realignment") # Plot the coverages directly from the BAM file plot_coverage(ppl@transcriptome_bam[[1]]) # Get the coverage information first coverage <- get_coverage(ppl@transcriptome_bam[[1]]) |> dplyr::filter(read_counts > 2) |> # Filter out transcripts with read counts < 3 filter_coverage(filter_fn = convolution_filter) # Filter out transcripts with sharp drops / rises # Plot the filtered coverages plot_coverage(coverage, detailed = TRUE) # filtering function can also be passed directly to plot_coverage plot_coverage(ppl@transcriptome_bam[[1]], filter_fn = convolution_filter)
produce a barplot of cell barcode demultiplex statistics
plot_demultiplex(pipeline) ## S4 method for signature 'FLAMES.SingleCellPipeline' plot_demultiplex(pipeline)plot_demultiplex(pipeline) ## S4 method for signature 'FLAMES.SingleCellPipeline' plot_demultiplex(pipeline)
pipeline |
A |
a list of ggplot objects:
reads_count_plot: stacked barplot of: demultiplexed reads
knee_plot: knee plot of UMI counts before TSO trimming
flank_editdistance_plot: flanking sequence (adaptor) edit-distance plot
barcode_editdistance_plot: barcode edit-distance plot
cutadapt_plot: if TSO trimming is performed, number of reads kept by cutadapt
pipeline <- example_pipeline("MultiSampleSCPipeline") |> run_step("barcode_demultiplex") plot_demultiplex(pipeline)pipeline <- example_pipeline("MultiSampleSCPipeline") |> run_step("barcode_demultiplex") plot_demultiplex(pipeline)
This function creates a horizontal bar plot showing the duration of each pipeline step using ggplot2.
plot_durations(x) ## S4 method for signature 'FLAMES.Pipeline' plot_durations(x)plot_durations(x) ## S4 method for signature 'FLAMES.Pipeline' plot_durations(x)
x |
A FLAMES.Pipeline object. |
A ggplot2 object.
pipeline <- example_pipeline("BulkPipeline") pipeline <- run_FLAMES(pipeline) plot_durations(pipeline)pipeline <- example_pipeline("BulkPipeline") pipeline <- run_FLAMES(pipeline) plot_durations(pipeline)
Plot expression heatmap of top n isoforms of a gene
plot_isoform_heatmap( sce, gene_id, transcript_ids, n = 4, isoform_legend_width = 7, col_low = "#313695", col_mid = "#FFFFBF", col_high = "#A50026", color_quantile = 1, cluster_palette, ... )plot_isoform_heatmap( sce, gene_id, transcript_ids, n = 4, isoform_legend_width = 7, col_low = "#313695", col_mid = "#FFFFBF", col_high = "#A50026", color_quantile = 1, cluster_palette, ... )
sce |
The |
gene_id |
The gene symbol of interest, ignored if |
transcript_ids |
The transcript ids to plot. |
n |
The number of top isoforms to plot from the gene. Ignored if |
isoform_legend_width |
The width of isoform legends in heatmaps, in |
col_low |
Color for cells with low expression levels in UMAPs. |
col_mid |
Color for cells with intermediate expression levels in UMAPs. |
col_high |
Color for cells with high expression levels in UMAPs. |
color_quantile |
The lower and upper expression quantile to be displayed bewteen |
cluster_palette |
Optional, named vector of colors for the cluster annotations. |
... |
Additional arguments to pass to |
Takes SingleCellExperiment object and plots an expression heatmap with the
isoform visualizations along genomic coordinates.
a ComplexHeatmap
data(scmixology_lib10_transcripts) scmixology_lib10_transcripts |> scrapper::normalizeRnaCounts.se() |> plot_isoform_heatmap(gene = "ENSG00000108107")data(scmixology_lib10_transcripts) scmixology_lib10_transcripts |> scrapper::normalizeRnaCounts.se() |> plot_isoform_heatmap(gene = "ENSG00000108107")
Plot expression of top n isoforms of a gene in reduced dimensions
plot_isoform_reduced_dim( sce, gene_id, transcript_ids, n = 4, reduced_dim_name = "UMAP", use_gene_dimred = FALSE, expr_func = function(x) { SingleCellExperiment::logcounts(x) }, col_low = "#313695", col_mid = "#FFFFBF", col_high = "#A50026", alpha = 0.5, size = 0.2, ggtheme = theme_minimal() + theme(axis.text = element_blank()), color_quantile = 1, format = "plot_grid", ... )plot_isoform_reduced_dim( sce, gene_id, transcript_ids, n = 4, reduced_dim_name = "UMAP", use_gene_dimred = FALSE, expr_func = function(x) { SingleCellExperiment::logcounts(x) }, col_low = "#313695", col_mid = "#FFFFBF", col_high = "#A50026", alpha = 0.5, size = 0.2, ggtheme = theme_minimal() + theme(axis.text = element_blank()), color_quantile = 1, format = "plot_grid", ... )
sce |
The |
gene_id |
The gene symbol of interest, ignored if |
transcript_ids |
The transcript ids to plot. |
n |
The number of top isoforms to plot from the gene. Ignored if |
reduced_dim_name |
The name of the reduced dimension to use for plotting cells. |
use_gene_dimred |
Whether to use gene-level reduced dimensions for plotting. Set to |
expr_func |
The function to extract expression values from the |
col_low |
Color for cells with low expression levels in UMAPs. |
col_mid |
Color for cells with intermediate expression levels in UMAPs. |
col_high |
Color for cells with high expression levels in UMAPs. |
alpha |
The transparency of the points in the UMAPs. |
size |
The size of the points in the UMAPs. |
ggtheme |
The theme to use for the UMAPs. |
color_quantile |
The lower and upper expression quantile to be displayed bewteen |
format |
The format of the output, either "plot_grid" or "list". |
... |
Additional arguments to pass to |
Takes SingleCellExperiment object and plots an expression on reduced dimensions
with the isoform visualizations along genomic coordinates.
a ggplot object of the UMAP(s)
data(scmixology_lib10_transcripts, scmixology_lib10, scmixology_lib90) scmixology_lib10 <- scmixology_lib10[, colSums(SingleCellExperiment::counts(scmixology_lib10)) > 0] sce_lr <- scmixology_lib10[, colnames(scmixology_lib10) %in% colnames(scmixology_lib10_transcripts)] SingleCellExperiment::altExp(sce_lr, "transcript") <- scmixology_lib10_transcripts[, colnames(sce_lr)] combined_sce <- combine_sce(sce_lr, scmixology_lib90) combined_sce <- combined_sce |> scrapper::normalizeRnaCounts.se() |> scater::runPCA() |> scater::runUMAP() combined_imputed_sce <- sc_impute_transcript(combined_sce) plot_isoform_reduced_dim(combined_sce, 'ENSG00000108107') plot_isoform_reduced_dim(combined_imputed_sce, 'ENSG00000108107')data(scmixology_lib10_transcripts, scmixology_lib10, scmixology_lib90) scmixology_lib10 <- scmixology_lib10[, colSums(SingleCellExperiment::counts(scmixology_lib10)) > 0] sce_lr <- scmixology_lib10[, colnames(scmixology_lib10) %in% colnames(scmixology_lib10_transcripts)] SingleCellExperiment::altExp(sce_lr, "transcript") <- scmixology_lib10_transcripts[, colnames(sce_lr)] combined_sce <- combine_sce(sce_lr, scmixology_lib90) combined_sce <- combined_sce |> scrapper::normalizeRnaCounts.se() |> scater::runPCA() |> scater::runUMAP() combined_imputed_sce <- sc_impute_transcript(combined_sce) plot_isoform_reduced_dim(combined_sce, 'ENSG00000108107') plot_isoform_reduced_dim(combined_imputed_sce, 'ENSG00000108107')
Plot isoforms, either from a gene or a list of transcript ids.
plot_isoforms( sce, gene_id, transcript_ids, n = 4, format = "plot_grid", colors )plot_isoforms( sce, gene_id, transcript_ids, n = 4, format = "plot_grid", colors )
sce |
The |
gene_id |
The gene symbol of interest, ignored if |
transcript_ids |
The transcript ids to plot. |
n |
The number of top isoforms to plot from the gene. Ignored if |
format |
The format of the output, either "plot_grid" or "list". |
colors |
A character vector of colors to use for the isoforms. If not provided, gray will be used. for all isoforms. |
This function takes a SingleCellExperiment object and plots the top isoforms of a gene,
or a list of specified transcript ids. Either as a list of plots or together in a grid.
This function plots transcript isoforms as exon-intron structures and orders
the isoforms by expression levels (when specifying a gene) or by the order of the transcript_ids.
When format = "list", a list of ggplot objects is returned.
Otherwise, a grid of the plots is returned.
data(scmixology_lib10_transcripts) plot_isoforms(scmixology_lib10_transcripts, gene_id = "ENSG00000108107")data(scmixology_lib10_transcripts) plot_isoforms(scmixology_lib10_transcripts, gene_id = "ENSG00000108107")
This function plots a spatial point plot for given feature
plot_spatial_feature( spe, feature, opacity = 50, grayscale = TRUE, size = 1, assay_type = "counts", color = "red", ... )plot_spatial_feature( spe, feature, opacity = 50, grayscale = TRUE, size = 1, assay_type = "counts", color = "red", ... )
spe |
The SpatialExperiment object. |
feature |
The feature to plot. Could be either a feature name or index present in the assay or a numeric vector of length nrow(spe). |
opacity |
The opacity of the background tissue image. |
grayscale |
Whether to convert the background image to grayscale. |
size |
The size of the points. |
assay_type |
The assay that contains the given features. E.g. 'counts', 'logcounts'. |
color |
The maximum color for the feature. Minimum color is transparent. |
... |
Additional arguments to pass to |
A ggplot object.
This function plots a spatial pie chart for given features.
plot_spatial_isoform(spe, isoforms, assay_type = "counts", color_palette, ...)plot_spatial_isoform(spe, isoforms, assay_type = "counts", color_palette, ...)
spe |
The SpatialExperiment object. |
isoforms |
The isoforms to plot. |
assay_type |
The assay that contains the given features. E.g. 'counts', 'logcounts'. |
color_palette |
Named vector of colors for each isoform. |
... |
Additional arguments to pass to |
A ggplot object.
Calculate the per gene UMI count matrix by parsing the genome alignment file.
quantify_gene( annotation, outdir, pipeline = "sc_single_sample", infq, in_bam, out_fastq, n_process, saturation_curve = TRUE, sample_names = NULL, random_seed = 2024 )quantify_gene( annotation, outdir, pipeline = "sc_single_sample", infq, in_bam, out_fastq, n_process, saturation_curve = TRUE, sample_names = NULL, random_seed = 2024 )
annotation |
The file path to the annotation file in GFF3 format |
outdir |
The path to directory to store all output files. |
pipeline |
The pipeline type as a character string, either |
infq |
The input FASTQ file. |
in_bam |
The input BAM file(s) from the genome alignment step. |
out_fastq |
The output FASTQ file(s) to store deduplicated reads. |
n_process |
The number of processes to use for parallelization. |
saturation_curve |
Logical, whether to generate a saturation curve figure. |
sample_names |
A vector of sample names, default to the file names of input fastq files,
or folder names if |
random_seed |
The random seed for reproducibility. |
After the genome alignment step (do_genome_align), the alignment file will be parsed to
generate the per gene UMI count matrix. For each gene in the annotation file, the number of reads
overlapping with the gene’s genomic coordinates will be assigned to that gene. If a read overlaps
multiple genes, it will be assigned to the gene with the highest number of overlapping nucleotides.
If exon coordinates are included in the provided annotation, the decision will first consider the
number of nucleotides aligned to the exons of each gene. In cases of a tie, the overlap with introns
will be used as a tiebreaker. If there is still a tie after considering both exons and introns,
a random gene will be selected from the tied candidates.
After the read-to-gene assignment, the per gene UMI count matrix will be generated. Specifically, for each gene, the reads with similar mapping coordinates of transcript termination sites (TTS, i.e. the end of the the read with a polyT or polyA) will be grouped together. UMIs of reads in the same group will be collapsed to generate the UMI counts for each gene.
Finally, a new fastq file with deduplicated reads by keeping the longest read in each UMI.
The count matrix will be saved in the output folder as transcript_count.csv.gz.
This function resumes a FLAMES pipeline by running configured but unfinished steps.
resume_FLAMES(pipeline) ## S4 method for signature 'FLAMES.Pipeline' resume_FLAMES(pipeline)resume_FLAMES(pipeline) ## S4 method for signature 'FLAMES.Pipeline' resume_FLAMES(pipeline)
pipeline |
A FLAMES.Pipeline object. |
An updated FLAMES.Pipeline object.
run_FLAMES to run the entire pipeline.
pipeline <- example_pipeline("BulkPipeline") pipeline <- run_step(pipeline, "genome_alignment") pipeline <- resume_FLAMES(pipeline)pipeline <- example_pipeline("BulkPipeline") pipeline <- run_step(pipeline, "genome_alignment") pipeline <- resume_FLAMES(pipeline)
This function runs the FLAMES pipeline. It will run all steps in the pipeline.
run_FLAMES(pipeline, overwrite = FALSE) ## S4 method for signature 'FLAMES.Pipeline' run_FLAMES(pipeline, overwrite = FALSE)run_FLAMES(pipeline, overwrite = FALSE) ## S4 method for signature 'FLAMES.Pipeline' run_FLAMES(pipeline, overwrite = FALSE)
pipeline |
A FLAMES.Pipeline object. |
overwrite |
(optional) If TRUE, the pipeline will be re-run even if some steps are already completed. |
An updated FLAMES.Pipeline object.
resume_FLAMES to resume a pipeline from the last completed step.
pipeline <- example_pipeline("BulkPipeline") pipeline <- run_FLAMES(pipeline)pipeline <- example_pipeline("BulkPipeline") pipeline <- run_FLAMES(pipeline)
This function runs the specified step of the FLAMES pipeline.
run_step(pipeline, step, disable_controller = TRUE) ## S4 method for signature 'FLAMES.Pipeline' run_step(pipeline, step, disable_controller = TRUE)run_step(pipeline, step, disable_controller = TRUE) ## S4 method for signature 'FLAMES.Pipeline' run_step(pipeline, step, disable_controller = TRUE)
pipeline |
A FLAMES.Pipeline object. |
step |
The step to run. One of "barcode_demultiplex", "genome_alignment", "gene_quantification", "isoform_identification", "read_realignment", or "transcript_quantification". |
disable_controller |
(optional) If TRUE, the step will be executed in the current R session, instead of using crew controllers. |
An updated FLAMES.Pipeline object.
run_FLAMES to run the entire pipeline.
resume_FLAMES to resume a pipeline from the last completed step.
pipeline <- example_pipeline("BulkPipeline") pipeline <- run_step(pipeline, "genome_alignment")pipeline <- example_pipeline("BulkPipeline") pipeline <- run_step(pipeline, "genome_alignment")
Differential transcription usage testing for single cell data, using
colLabels as cluster labels.
sc_DTU_analysis( sce, gene_col = "gene_id", min_count = 15, threads = 1, method = "transcript usage permutation", permuations = 1000 )sc_DTU_analysis( sce, gene_col = "gene_id", min_count = 15, threads = 1, method = "transcript usage permutation", permuations = 1000 )
sce |
The |
gene_col |
The column name in the |
min_count |
The minimum total counts for a transcript to be tested. |
threads |
Number of threads to use for parallel processing. |
method |
The method to use for testing, listed in |
permuations |
Number of permutations for permutation methods. |
Genes with more than 2 isoforms expressing more than min_count counts
are selected for testing with one of the following methods:
Transcript usage are taken as the test statistic, cluster labels are permuted to generate a null distribution.
Chi-square test of the transcript count matrix for each gene.
Adjusted P-values were calculated by Benjamini–Hochberg correction.
a tibble containing the following columns:
- the raw p-value
- multiple testing adjusted p-value
- the cluster where DTU was observed
- rowname of sce, the DTU isoform
- the transcript usage of the isoform in the cluster
Additional columns from method = "transcript usage permutation":
- transcript usage in other clusters
- the difference between the two transcript usage
- the variance of usage difference in the permuted data
Additional columns from method = "chisq":
- the test statistic
- the degrees of freedom
- the expected usage (mean across all clusters)
- the difference between the observed and expected usage
The table is sorted by P-values.
outdir <- tempfile() dir.create(outdir) bc_allow <- file.path(outdir, "bc_allow.tsv") genome_fa <- file.path(outdir, "rps24.fa") R.utils::gunzip( filename = system.file("extdata", "bc_allow.tsv.gz", package = "FLAMES"), destname = bc_allow, remove = FALSE ) R.utils::gunzip( filename = system.file("extdata", "rps24.fa.gz", package = "FLAMES"), destname = genome_fa, remove = FALSE ) sce <- FLAMES::sc_long_pipeline( genome_fa = genome_fa, fastq = system.file("extdata", "fastq", "musc_rps24.fastq.gz", package = "FLAMES"), annotation = system.file("extdata", "rps24.gtf.gz", package = "FLAMES"), outdir = outdir, barcodes_file = bc_allow, config_file = create_config( outdir, pipeline_parameters.demultiplexer = "flexiplex" ) ) group_anno <- data.frame(barcode_seq = colnames(sce), groups = SingleCellExperiment::counts(sce)["ENSMUST00000169826.2", ] > 1) SingleCellExperiment::colLabels(sce) <- group_anno$groups # DTU with permutation testing: sc_DTU_analysis(sce, min_count = 1, method = "transcript usage permutation") # now try with chisq: sc_DTU_analysis(sce, min_count = 1, method = "chisq")outdir <- tempfile() dir.create(outdir) bc_allow <- file.path(outdir, "bc_allow.tsv") genome_fa <- file.path(outdir, "rps24.fa") R.utils::gunzip( filename = system.file("extdata", "bc_allow.tsv.gz", package = "FLAMES"), destname = bc_allow, remove = FALSE ) R.utils::gunzip( filename = system.file("extdata", "rps24.fa.gz", package = "FLAMES"), destname = genome_fa, remove = FALSE ) sce <- FLAMES::sc_long_pipeline( genome_fa = genome_fa, fastq = system.file("extdata", "fastq", "musc_rps24.fastq.gz", package = "FLAMES"), annotation = system.file("extdata", "rps24.gtf.gz", package = "FLAMES"), outdir = outdir, barcodes_file = bc_allow, config_file = create_config( outdir, pipeline_parameters.demultiplexer = "flexiplex" ) ) group_anno <- data.frame(barcode_seq = colnames(sce), groups = SingleCellExperiment::counts(sce)["ENSMUST00000169826.2", ] > 1) SingleCellExperiment::colLabels(sce) <- group_anno$groups # DTU with permutation testing: sc_DTU_analysis(sce, min_count = 1, method = "transcript usage permutation") # now try with chisq: sc_DTU_analysis(sce, min_count = 1, method = "chisq")
A simplistic function to genotype a single-cell mutation at a given position. It filters the SNPs table for the given reference and alternative alleles, and determines the genotype based on the allele counts and percentages.
sc_genotype( snps_tb, ref, alt, seqname, pos, alt_min_count = 2, alt_min_pct = 0.1, ref_min_count = 1, ref_min_pct = 1 )sc_genotype( snps_tb, ref, alt, seqname, pos, alt_min_count = 2, alt_min_pct = 0.1, ref_min_count = 1, ref_min_pct = 1 )
snps_tb |
tibble: the SNPs table, output from |
ref |
character(1): the reference allele. |
alt |
character(1): the alternative allele. |
seqname |
character(1): the chromosome name of the position. |
pos |
integer(1): the position of the mutation, 1-based. |
alt_min_count |
integer(1): minimum UMI count of the alternative allele to call it "alt". |
alt_min_pct |
numeric(1): minimum percentage of the alternative allele to call it "alt". |
ref_min_count |
integer(1): minimum UMI count of the reference allele to call it "ref". |
ref_min_pct |
numeric(1): minimum percentage of the reference allele to call it "ref". |
A tibble with columns: barcode, allele_count_ref, pct_ref, allele_count_alt, pct_alt, genotype.
# get the SNPs table from sc_mutations example(sc_mutations) genotype_tb <- snps_tb |> sc_genotype( ref = "G", alt = "A", seqname = "chr14", pos = 1260, alt_min_count = 2, alt_min_pct = 0.1, ref_min_count = 1, ref_min_pct = 1 ) dplyr::count(genotype_tb, genotype) head(genotype_tb)# get the SNPs table from sc_mutations example(sc_mutations) genotype_tb <- snps_tb |> sc_genotype( ref = "G", alt = "A", seqname = "chr14", pos = 1260, alt_min_count = 2, alt_min_pct = 0.1, ref_min_count = 1, ref_min_pct = 1 ) dplyr::count(genotype_tb, genotype) head(genotype_tb)
Impute missing transcript counts using a shared nearest neighbor graph
sc_impute_transcript(combined_sce, dimred = "PCA", ...)sc_impute_transcript(combined_sce, dimred = "PCA", ...)
combined_sce |
A |
dimred |
The name of the reduced dimension to use for building the shared nearest neighbor graph. |
... |
Additional arguments to pass to |
For cells with NA values in the "transcript" altExp slot, this function imputes the
missing values from cells with non-missing values. A shared nearest neighbor graph is built using
reduced dimensions from the SingleCellExperiment object, and the imputation is done where
the imputed value for a cell is the weighted sum of the transcript counts of its neighbors.
Imputed values are stored in the "logcounts" assay of the "transcript" altExp slot.
The "counts" assay is used to obtain logcounts but left unchanged.
A SingleCellExperiment object with imputed logcounts assay in the "transcript" altExp slot.
sce <- SingleCellExperiment::SingleCellExperiment(assays = list(counts = matrix(rpois(50, 5), ncol = 10))) long_read <- SingleCellExperiment::SingleCellExperiment(assays = list(counts = matrix(rpois(40, 5), ncol = 10))) SingleCellExperiment::altExp(sce, "transcript") <- long_read SingleCellExperiment::counts(SingleCellExperiment::altExp(sce))[,1:2] <- NA SingleCellExperiment::counts(SingleCellExperiment::altExp(sce)) imputed_sce <- sc_impute_transcript(sce, k = 4) SingleCellExperiment::logcounts(SingleCellExperiment::altExp(imputed_sce))sce <- SingleCellExperiment::SingleCellExperiment(assays = list(counts = matrix(rpois(50, 5), ncol = 10))) long_read <- SingleCellExperiment::SingleCellExperiment(assays = list(counts = matrix(rpois(40, 5), ncol = 10))) SingleCellExperiment::altExp(sce, "transcript") <- long_read SingleCellExperiment::counts(SingleCellExperiment::altExp(sce))[,1:2] <- NA SingleCellExperiment::counts(SingleCellExperiment::altExp(sce)) imputed_sce <- sc_impute_transcript(sce, k = 4) SingleCellExperiment::logcounts(SingleCellExperiment::altExp(imputed_sce))
This function is deprecated. Please use MultiSampleSCPipeline.
sc_long_multisample_pipeline( annotation, fastqs, outdir, genome_fa, minimap2 = NULL, barcodes_file = NULL, expect_cell_numbers = NULL, config_file = NULL )sc_long_multisample_pipeline( annotation, fastqs, outdir, genome_fa, minimap2 = NULL, barcodes_file = NULL, expect_cell_numbers = NULL, config_file = NULL )
annotation |
The file path to the annotation file in GFF3 format |
fastqs |
The file path to input fastq file |
outdir |
The path to directory to store all output files. |
genome_fa |
The file path to genome fasta file. |
minimap2 |
Path to minimap2, optional. |
barcodes_file |
The file with expected cell barcodes, with each barcode on a new line. |
expect_cell_numbers |
The expected number of cells in the sample. This is used if
|
config_file |
File path to the JSON configuration file. |
A list of SingleCellExperiment objects, one for each sample.
MultiSampleSCPipeline for the new pipeline interface,
SingleCellPipeline for single-sample pipeline,
BulkPipeline for bulk long data.
reads <- ShortRead::readFastq( system.file("extdata", "fastq", "musc_rps24.fastq.gz", package = "FLAMES") ) outdir <- tempfile() dir.create(outdir) dir.create(file.path(outdir, "fastq")) bc_allow <- file.path(outdir, "bc_allow.tsv") genome_fa <- file.path(outdir, "rps24.fa") R.utils::gunzip( filename = system.file("extdata", "bc_allow.tsv.gz", package = "FLAMES"), destname = bc_allow, remove = FALSE ) R.utils::gunzip( filename = system.file("extdata", "rps24.fa.gz", package = "FLAMES"), destname = genome_fa, remove = FALSE ) ShortRead::writeFastq(reads[1:100], file.path(outdir, "fastq/sample1.fq.gz"), mode = "w", full = FALSE) reads <- reads[-(1:100)] ShortRead::writeFastq(reads[1:100], file.path(outdir, "fastq/sample2.fq.gz"), mode = "w", full = FALSE) reads <- reads[-(1:100)] ShortRead::writeFastq(reads, file.path(outdir, "fastq/sample3.fq.gz"), mode = "w", full = FALSE) sce_list <- FLAMES::sc_long_multisample_pipeline( annotation = system.file("extdata", "rps24.gtf.gz", package = "FLAMES"), fastqs = c("sampleA" = file.path(outdir, "fastq"), "sample1" = file.path(outdir, "fastq", "sample1.fq.gz"), "sample2" = file.path(outdir, "fastq", "sample2.fq.gz"), "sample3" = file.path(outdir, "fastq", "sample3.fq.gz")), outdir = outdir, genome_fa = genome_fa, barcodes_file = rep(bc_allow, 4), config_file = create_config( outdir, pipeline_parameters.demultiplexer = "flexiplex" ) )reads <- ShortRead::readFastq( system.file("extdata", "fastq", "musc_rps24.fastq.gz", package = "FLAMES") ) outdir <- tempfile() dir.create(outdir) dir.create(file.path(outdir, "fastq")) bc_allow <- file.path(outdir, "bc_allow.tsv") genome_fa <- file.path(outdir, "rps24.fa") R.utils::gunzip( filename = system.file("extdata", "bc_allow.tsv.gz", package = "FLAMES"), destname = bc_allow, remove = FALSE ) R.utils::gunzip( filename = system.file("extdata", "rps24.fa.gz", package = "FLAMES"), destname = genome_fa, remove = FALSE ) ShortRead::writeFastq(reads[1:100], file.path(outdir, "fastq/sample1.fq.gz"), mode = "w", full = FALSE) reads <- reads[-(1:100)] ShortRead::writeFastq(reads[1:100], file.path(outdir, "fastq/sample2.fq.gz"), mode = "w", full = FALSE) reads <- reads[-(1:100)] ShortRead::writeFastq(reads, file.path(outdir, "fastq/sample3.fq.gz"), mode = "w", full = FALSE) sce_list <- FLAMES::sc_long_multisample_pipeline( annotation = system.file("extdata", "rps24.gtf.gz", package = "FLAMES"), fastqs = c("sampleA" = file.path(outdir, "fastq"), "sample1" = file.path(outdir, "fastq", "sample1.fq.gz"), "sample2" = file.path(outdir, "fastq", "sample2.fq.gz"), "sample3" = file.path(outdir, "fastq", "sample3.fq.gz")), outdir = outdir, genome_fa = genome_fa, barcodes_file = rep(bc_allow, 4), config_file = create_config( outdir, pipeline_parameters.demultiplexer = "flexiplex" ) )
This function is deprecated. Please use [SingleCellPipeline()] instead.
sc_long_pipeline( annotation, fastq, outdir, genome_fa, minimap2 = NULL, barcodes_file = NULL, expect_cell_number = NULL, config_file = NULL )sc_long_pipeline( annotation, fastq, outdir, genome_fa, minimap2 = NULL, barcodes_file = NULL, expect_cell_number = NULL, config_file = NULL )
annotation |
The file path to the annotation file in GFF3 format |
fastq |
The file path to input fastq file |
outdir |
The path to directory to store all output files. |
genome_fa |
The file path to genome fasta file. |
minimap2 |
Path to minimap2, optional. |
barcodes_file |
The file with expected cell barcodes, with each barcode on a new line. |
expect_cell_number |
The expected number of cells in the sample. This is used if
|
config_file |
File path to the JSON configuration file. |
A SingleCellPipeline object containing the transcript counts.
SingleCellPipeline for the new pipeline interface,
BulkPipeline for bulk long data,
MultiSampleSCPipeline for multi sample single cell pipelines.
outdir <- tempfile() dir.create(outdir) bc_allow <- file.path(outdir, "bc_allow.tsv") genome_fa <- file.path(outdir, "rps24.fa") R.utils::gunzip( filename = system.file("extdata", "bc_allow.tsv.gz", package = "FLAMES"), destname = bc_allow, remove = FALSE ) R.utils::gunzip( filename = system.file("extdata", "rps24.fa.gz", package = "FLAMES"), destname = genome_fa, remove = FALSE ) sce <- FLAMES::sc_long_pipeline( genome_fa = genome_fa, fastq = system.file("extdata", "fastq", "musc_rps24.fastq.gz", package = "FLAMES"), annotation = system.file("extdata", "rps24.gtf.gz", package = "FLAMES"), outdir = outdir, barcodes_file = bc_allow, config_file = FLAMES::create_config( outdir, pipeline_parameters.demultiplexer = "flexiplex" ) )outdir <- tempfile() dir.create(outdir) bc_allow <- file.path(outdir, "bc_allow.tsv") genome_fa <- file.path(outdir, "rps24.fa") R.utils::gunzip( filename = system.file("extdata", "bc_allow.tsv.gz", package = "FLAMES"), destname = bc_allow, remove = FALSE ) R.utils::gunzip( filename = system.file("extdata", "rps24.fa.gz", package = "FLAMES"), destname = genome_fa, remove = FALSE ) sce <- FLAMES::sc_long_pipeline( genome_fa = genome_fa, fastq = system.file("extdata", "fastq", "musc_rps24.fastq.gz", package = "FLAMES"), annotation = system.file("extdata", "rps24.gtf.gz", package = "FLAMES"), outdir = outdir, barcodes_file = bc_allow, config_file = FLAMES::create_config( outdir, pipeline_parameters.demultiplexer = "flexiplex" ) )
Count the number of reads supporting each variants at the given positions for each cell.
sc_mutations(bam_path, seqnames, positions, indel = FALSE, threads = 1)sc_mutations(bam_path, seqnames, positions, indel = FALSE, threads = 1)
bam_path |
character(1) or character(n): path to the bam file(s) aligned to the reference genome (NOT the transcriptome! Unless the postions are also from the transcriptome). |
seqnames |
character(n): chromosome names of the postions to count alleles. |
positions |
integer(n): positions, 1-based, same length as seqnames. The positions to count alleles. |
indel |
logical(1): whether to count indels (TRUE) or SNPs (FALSE). |
threads |
integer(1): number of threads to use. Maximum number of threads is the number of bam files * number of positions. |
A tibble with columns: allele, barcode, allele_count, cell_total_reads, pct, pos, seqname.
ppl <- example_pipeline("SingleCellPipeline") ppl <- run_step(ppl, "barcode_demultiplex") ppl <- run_step(ppl, "genome_alignment") snps_tb <- sc_mutations( bam_path = ppl@genome_bam, seqnames = c("chr14", "chr14"), positions = c(1260, 2714), # positions of interest indel = FALSE ) head(snps_tb) snps_tb |> dplyr::filter(pos == 1260) |> dplyr::group_by(allele) |> dplyr::summarise(count = sum(allele_count)) # should be identical to samtools pileupppl <- example_pipeline("SingleCellPipeline") ppl <- run_step(ppl, "barcode_demultiplex") ppl <- run_step(ppl, "genome_alignment") snps_tb <- sc_mutations( bam_path = ppl@genome_bam, seqnames = c("chr14", "chr14"), positions = c(1260, 2714), # positions of interest indel = FALSE ) head(snps_tb) snps_tb |> dplyr::filter(pos == 1260) |> dplyr::group_by(allele) |> dplyr::summarise(count = sum(allele_count)) # should be identical to samtools pileup
Plot the genotype of single-cell data on a reduced dimension plot (e.g. UMAP).
sc_plot_genotype( sce, genotype_tb, reduced_dim = "UMAP", na_cell_col = "grey", na_cell_size = 0.1, na_cell_alpha = 0.1, ... )sc_plot_genotype( sce, genotype_tb, reduced_dim = "UMAP", na_cell_col = "grey", na_cell_size = 0.1, na_cell_alpha = 0.1, ... )
sce |
SingleCellExperiment: the single-cell experiment object with reduced dimensions. |
genotype_tb |
tibble: the genotype table, output from |
reduced_dim |
character(1): the name of the reduced dimension to use for plotting. |
na_cell_col |
character(1): the color of the cells with no genotype. |
na_cell_size |
numeric(1): the size of the cells with no genotype. |
na_cell_alpha |
numeric(1): the alpha of the cells with no genotype. |
... |
additional arguments passed to |
A ggplot2 object with the genotype plotted on the reduced dimension.
ppl <- example_pipeline("SingleCellPipeline") |> run_FLAMES() sce <- experiment(ppl) |> scrapper::normalizeRnaCounts.se() |> scater::runPCA() |> scater::runUMAP() snps_tb <- sc_mutations( bam_path = ppl@genome_bam, seqnames = "chr14", positions = 2714 ) genotype_tb <- sc_genotype( snps_tb, ref = "C", alt = "T", seqname = "chr14", pos = 2714, alt_min_count = 2, alt_min_pct = 0.5, ref_min_count = 1, ref_min_pct = 1 ) sc_plot_genotype( sce, genotype_tb, na_cell_col = "black", na_cell_size = 0.5, na_cell_alpha = 0.7, size = 2 )ppl <- example_pipeline("SingleCellPipeline") |> run_FLAMES() sce <- experiment(ppl) |> scrapper::normalizeRnaCounts.se() |> scater::runPCA() |> scater::runUMAP() snps_tb <- sc_mutations( bam_path = ppl@genome_bam, seqnames = "chr14", positions = 2714 ) genotype_tb <- sc_genotype( snps_tb, ref = "C", alt = "T", seqname = "chr14", pos = 2714, alt_min_count = 2, alt_min_pct = 0.5, ref_min_count = 1, ref_min_pct = 1 ) sc_plot_genotype( sce, genotype_tb, na_cell_col = "black", na_cell_size = 0.5, na_cell_alpha = 0.7, size = 2 )
Short-read gene counts from long and short-read single cell RNA-seq profiling of human lung adenocarcinoma cell lines using 10X version 2 chemstry. See Tian, L. et al. Comprehensive characterization of single-cell full-length isoforms in human and mouse with long-read sequencing. Genome Biology 22, 310 (2021).
scmixology_lib10scmixology_lib10
## 'scmixology_lib10'
A SingleCellExperiment with 7,240 rows and 60 columns:
A SingleCellExperiment object
<https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE154869>
long-read transcript counts from long and short-read single cell RNA-seq profiling of human lung adenocarcinoma cell lines using 10X version 2 chemstry. See Tian, L. et al. Comprehensive characterization of single-cell full-length isoforms in human and mouse with long-read sequencing. Genome Biology 22, 310 (2021).
scmixology_lib10_transcriptsscmixology_lib10_transcripts
## 'scmixology_lib10_transcripts'
A SingleCellExperiment with 7,240 rows and 60 columns:
A SingleCellExperiment object
<https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE154869>
Short-read single cell RNA-seq profiling of human lung adenocarcinoma cell lines using 10X version 2 chemstry. Single cells from five human lung adenocarcinoma cell lines (H2228, H1975, A549, H838 and HCC827) were mixed in equal proportions and processed using the Chromium 10X platform, then sequenced using Illumina HiSeq 2500. See Tian L, Dong X, Freytag S, Lê Cao KA et al. Benchmarking single cell RNA-sequencing analysis pipelines using mixture control experiments. Nat Methods 2019 Jun;16(6):479-487. PMID: 31133762
scmixology_lib90scmixology_lib90
## 'scmixology_lib90'
A SingleCellExperiment
A SingleCellExperiment object
<https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE126906>
Semi-supervised isofrom detection and annotation for long read data. This variant is
meant for single sample scRNA-seq data. Specific parameters can be configured in
the config file (see create_config), input files are specified via
arguments.
SingleCellPipeline( config_file, outdir, fastq, annotation, genome_fa, genome_mmi, minimap2, samtools, barcodes_file, expect_cell_number, controllers )SingleCellPipeline( config_file, outdir, fastq, annotation, genome_fa, genome_mmi, minimap2, samtools, barcodes_file, expect_cell_number, controllers )
config_file |
Path to the JSON configuration file. See |
outdir |
Path to the output directory. If it does not exist, it will be created. |
fastq |
Path to the FASTQ file or a directory containing FASTQ files. Each file will be processed as an individual sample. |
annotation |
The file path to the annotation file in GFF3 / GTF format. |
genome_fa |
The file path to the reference genome in FASTA format. |
genome_mmi |
(optional) The file path to minimap2's index reference genome. |
minimap2 |
(optional) The path to the minimap2 binary. If not provided, FLAMES will
use a copy from bioconda via |
samtools |
(optional) The path to the samtools binary. If not provided, FLAMES will
use a copy from bioconda via |
barcodes_file |
The file with expected cell barcodes, with each barcode on a new line. |
expect_cell_number |
The expected number of cells in the sample. This is used if
|
controllers |
(optional, experimental) A |
By default the pipeline starts with demultiplexing the input fastq data. If the
cell barcodes are known apriori (e.g. via coupled short-read sequencing), the
barcodes_file argument can be used to specify a file containing the cell
barcodes, and a modified Rcpp version of flexiplex will be used; otherwise,
expect_cell_number need to be provided, and BLAZE will be used to
generate the cell barcodes. The pipeline then aligns the reads to the genome using
minimap2. The alignment is then used for isoform detection (either using
FLAMES or bambu, can be configured). The reads are then realigned
to the detected isoforms. Finally, a transcript count matrix is generated (either
using FLAMES's simplistic counting or oarfish's Expectation
Maximization algorithm, can be configured). The results can be accssed with
experiment(pipeline). If the pipeline errored out / new steps were configured,
it can be resumed by calling resume_FLAMES(pipeline)
A FLAMES.SingleCellPipeline object. The pipeline can be run using
run_FLAMES(pipeline). The results can be accessed with experiment(pipeline).
The pipeline also outputs a number of output files into the given outdir directory.
Some of these output files include:
- fastq file with reads demultiplexed
- sorted BAM file with reads aligned to genome
- demultiplexed and UMI-deduplicated fastq file
- transcript sequence from the isoforms
- isoforms in gff3 format (also contained in the SingleCellExperiment)
- sorted realigned BAM file using the transcript_assembly.fa as reference
create_config for creating a configuration file,
BulkPipeline for bulk long data,
MultiSampleSCPipeline for multi sample single cell pipelines.
outdir <- tempfile() dir.create(outdir) bc_allow <- file.path(outdir, "bc_allow.tsv") genome_fa <- file.path(outdir, "rps24.fa") R.utils::gunzip( filename = system.file("extdata", "bc_allow.tsv.gz", package = "FLAMES"), destname = bc_allow, remove = FALSE ) R.utils::gunzip( filename = system.file("extdata", "rps24.fa.gz", package = "FLAMES"), destname = genome_fa, remove = FALSE ) ppl <- SingleCellPipeline( config_file = create_config( outdir, pipeline_parameters.demultiplexer = "flexiplex", pipeline_parameters.do_gene_quantification = FALSE ), outdir = outdir, fastq = system.file("extdata", "fastq", "musc_rps24.fastq.gz", package = "FLAMES"), annotation = system.file("extdata", "rps24.gtf.gz", package = "FLAMES"), genome_fa = genome_fa, barcodes_file = bc_allow ) ppl <- run_FLAMES(ppl) experiment(ppl)outdir <- tempfile() dir.create(outdir) bc_allow <- file.path(outdir, "bc_allow.tsv") genome_fa <- file.path(outdir, "rps24.fa") R.utils::gunzip( filename = system.file("extdata", "bc_allow.tsv.gz", package = "FLAMES"), destname = bc_allow, remove = FALSE ) R.utils::gunzip( filename = system.file("extdata", "rps24.fa.gz", package = "FLAMES"), destname = genome_fa, remove = FALSE ) ppl <- SingleCellPipeline( config_file = create_config( outdir, pipeline_parameters.demultiplexer = "flexiplex", pipeline_parameters.do_gene_quantification = FALSE ), outdir = outdir, fastq = system.file("extdata", "fastq", "musc_rps24.fastq.gz", package = "FLAMES"), annotation = system.file("extdata", "rps24.gtf.gz", package = "FLAMES"), genome_fa = genome_fa, barcodes_file = bc_allow ) ppl <- run_FLAMES(ppl) experiment(ppl)
Steps to perform in the pipeline
steps(pipeline) ## S4 method for signature 'FLAMES.Pipeline' steps(pipeline)steps(pipeline) ## S4 method for signature 'FLAMES.Pipeline' steps(pipeline)
pipeline |
An object of class 'FLAMES.Pipeline' |
A named logical vector containing all possible steps for the pipeline. The names of the vector are the step names, and the values are logical indicating whether the step is configured to be performed.
ppl <- example_pipeline() steps(ppl)ppl <- example_pipeline() steps(ppl)
Set steps to perform in the pipeline
steps(pipeline) <- value ## S4 replacement method for signature 'FLAMES.Pipeline' steps(pipeline) <- valuesteps(pipeline) <- value ## S4 replacement method for signature 'FLAMES.Pipeline' steps(pipeline) <- value
pipeline |
An object of class 'FLAMES.Pipeline' |
value |
A named logical vector containing all possible steps for the pipeline. The names of the vector are the step names, and the values are logical indicating whether the step is configured to be performed. |
An pipeline of class 'FLAMES.Pipeline' with the updated steps.
ppl <- example_pipeline() steps(ppl) <- c( barcode_demultiplex = TRUE, genome_alignment = TRUE, gene_quantification = TRUE, isoform_identification = FALSE, read_realignment = FALSE, transcript_quantification = TRUE ) ppl # or partially change a step: steps(ppl)["read_realignment"] <- TRUE pplppl <- example_pipeline() steps(ppl) <- c( barcode_demultiplex = TRUE, genome_alignment = TRUE, gene_quantification = TRUE, isoform_identification = FALSE, read_realignment = FALSE, transcript_quantification = TRUE ) ppl # or partially change a step: steps(ppl)["read_realignment"] <- TRUE ppl
Given a vector of read counts, return a vector of weights.
The weights could be either the read counts themselves (type = 'counts'),
a binary vector of 0s and 1s where 1s are assigned to transcripts with read
counts above a threshold (type = 'equal', min_counts = 1000), or a
sigmoid function of the read counts (type = 'sigmoid'). The sigmoid
function is defined as 1 / (1 + exp(-steepness/inflection * (x - inflection))).
weight_transcripts( counts, type = "sigmoid", min_counts = 1000, inflection_idx = 10, inflection_max = 1000, steepness = 5 )weight_transcripts( counts, type = "sigmoid", min_counts = 1000, inflection_idx = 10, inflection_max = 1000, steepness = 5 )
counts |
numeric vector of read counts |
type |
string, one of 'counts', 'sigmoid', or 'equal' |
min_counts |
numeric, the threshold for the 'equal' type |
inflection_idx |
numeric, the index of the read counts to determine the inflection point for the sigmoid function. The default is 10, i.e. the 10th highest read count will be the inflection point. |
inflection_max |
numeric, the maximum value for the inflection point. If the inflection point according to the inflection_idx is higher than this value, the inflection point will be set to this value instead. |
steepness |
numeric, the steepness of the sigmoid function |
numeric vector of weights
weight_transcripts(1:2000) par(mfrow = c(2, 2)) plot( 1:2000, weight_transcripts(1:2000, type = 'sigmoid'), type = 'l', xlab = 'Read counts', ylab = 'Sigmoid weight' ) plot( 1:2000, weight_transcripts(1:2000, type = 'counts'), type = 'l', xlab = 'Read counts', ylab = 'Weight by counts' ) plot( 1:2000, weight_transcripts(1:2000, type = 'equal'), type = 'l', xlab = 'Read counts', ylab = 'Equal weights' )weight_transcripts(1:2000) par(mfrow = c(2, 2)) plot( 1:2000, weight_transcripts(1:2000, type = 'sigmoid'), type = 'l', xlab = 'Read counts', ylab = 'Sigmoid weight' ) plot( 1:2000, weight_transcripts(1:2000, type = 'counts'), type = 'l', xlab = 'Read counts', ylab = 'Weight by counts' ) plot( 1:2000, weight_transcripts(1:2000, type = 'equal'), type = 'l', xlab = 'Read counts', ylab = 'Equal weights' )