Title: | A toolkit for end-to-end analysis of RNA-seq data for circular genomes |
---|---|
Description: | CircSeqAlignTk is designed for end-to-end RNA-Seq data analysis of circular genome sequences, from alignment to visualization. It mainly targets viroids which are composed of 246-401 nt circular RNAs. In addition, CircSeqAlignTk implements a tidy interface to generate synthetic sequencing data that mimic real RNA-Seq data, allowing developers to evaluate the performance of alignment tools and workflows. |
Authors: | Jianqiang Sun [cre, aut] , Xi Fu [ctb], Wei Cao [ctb] |
Maintainer: | Jianqiang Sun <[email protected]> |
License: | MIT + file LICENSE |
Version: | 1.9.0 |
Built: | 2024-10-30 04:42:49 UTC |
Source: | https://github.com/bioc/CircSeqAlignTk |
CircSeqAlignTk is designed for end-to-end RNA-Seq data analysis of circular genome sequences, from alignment to visualization. It mainly targets viroids which are composed of 246-401 nt circular RNAs. In addition, CircSeqAlignTk implements a tidy interface to generate synthetic sequencing data that mimic real RNA-Seq data, allowing developers to evaluate the performance of alignment tools and workflows.
Refer to the vignette for an overview of the package, quick start, and detailed usages.
Maintainer: Jianqiang Sun [email protected] (ORCID)
Other contributors:
Xi Fu [contributor]
Wei Cao [contributor]
Useful links:
browseVignettes("CircSeqAlignTk")
browseVignettes("CircSeqAlignTk")
This function aligns sequence reads in a FASTQ file to the reference sequences of a genome.
align_reads( input, index, output, n_threads = 1, n_mismatch = 1, overwrite = TRUE, aligner = c("bowtie2", "hisat2"), add_args = NULL )
align_reads( input, index, output, n_threads = 1, n_mismatch = 1, overwrite = TRUE, aligner = c("bowtie2", "hisat2"), add_args = NULL )
input |
A path to a FASTQ format file for alignment. |
index |
A |
output |
A path to a directory for saving the intermediate and final results of alignment. |
n_threads |
Number of threads to use for aligning reads. |
n_mismatch |
Number of allowed mismatches in alignment. |
overwrite |
Overwrite the existing files if |
aligner |
A string to specify the alignment is for alignment. |
add_args |
A string of additional arguments to be passed on to the
alignment tool directly.
For example, |
This function aligns sequence reads in a FASTQ format file in two stages:
(i) aligning reads to the type 1 reference sequence (i.e., refseq.t1.fa
)
and (ii) collecting the unaligned reads
and aligning them with the type 2 reference (i.e., refseq.t2.fa
).
The alignment results are saved as BAM format files
in the specified directory with the suffixes *.t1.bam
and *.t2.bam
.
The original alignment results may contain mismatches.
Hence, filtering is performed to remove the alignment with mismatches
over the specified value from the BAM format file.
The filtered results of the *.t1.bam
and *.t2.bam
are saved as
*.clean.t1.bam
and *.clean.t2.bam
, respectively.
Two alignment tools (Bowtie2 and HISAT2) can be specified
for building indexes through the aligner argument.
This function first attempts to call the specified alignment tool
installed on the operation system directly;
however, if the tool is not installed,
then the function attempts to call
bowtie2_build
or hisat2_build
functions implemented in the Rbowtie2 or Rhisat2 packages for alignment.
A CircSeqAlignTkAlign-class
object.
output_dpath <- tempdir() genome_seq <- system.file(package="CircSeqAlignTk", "extdata", "FR851463.fa") fq <- system.file(package="CircSeqAlignTk", "extdata", "srna.fq.gz") ref_index <- build_index(input = genome_seq, output = file.path(output_dpath, 'index')) aln <- align_reads(input = fq, index = ref_index, output = file.path(output_dpath, 'align_results')) slot(aln, 'stats')
output_dpath <- tempdir() genome_seq <- system.file(package="CircSeqAlignTk", "extdata", "FR851463.fa") fq <- system.file(package="CircSeqAlignTk", "extdata", "srna.fq.gz") ref_index <- build_index(input = genome_seq, output = file.path(output_dpath, 'index')) aln <- align_reads(input = fq, index = ref_index, output = file.path(output_dpath, 'align_results')) slot(aln, 'stats')
Build a graphical user interface (GUI) application for CircSeqAlignTk using Shiny package.
build_app(...)
build_app(...)
... |
Arguments to be passed to |
The CircSeqAlignTk graphical user interface (GUI) application is built using the Shiny package. Users need to install the Shiny package and associated packages (shinyFiles, shinyjs) to use this application. Additionally, the installation of RBowtie2 and Rhisat2 is required to use the full functionality of this application.
A Shiny application object.
## Not run: library(shiny) library(CircSeqAlignTk) app <- build_app() runApp(app) ## End(Not run)
## Not run: library(shiny) library(CircSeqAlignTk) app <- build_app() runApp(app) ## End(Not run)
This function internally calls Bowtie2 or HISAT2 to build indexes of reference sequences for alignment preparation.
build_index( input, output = NULL, n_threads = 1, overwrite = TRUE, aligner = c("bowtie2", "hisat2"), add_args = NULL )
build_index( input, output = NULL, n_threads = 1, overwrite = TRUE, aligner = c("bowtie2", "hisat2"), add_args = NULL )
input |
A path to a FASTA format file containing a reference sequence of a genome for indexing. |
output |
A path to a directory for saving the reference sequences and indexes. |
n_threads |
Number of threads to use for aligning reads. |
overwrite |
Overwrite the existing files if |
aligner |
A string to specify the alignment for indexing. |
add_args |
A string of additional arguments to be passed on to the
alignment tool directly (e.g., |
This function generates two types of reference sequences
from a genome and indexes them in preparation for alignment.
The type 1 reference sequence is identical to
the sequence provided by the input
argument.
The type 2 reference sequence is generated
by restoring the type 1 reference sequence to a circular RNA
and opening the circle at the position opposite to that of type 1.
The type 1 and 2 reference sequences are then saved as FASTA format files,
refseq.t1.fa
and refseq.t2.fa
, respectively,
under the directory specified by the output
argument.
Next, the function builds indexes for refseq.t1.fa
and refseq.t2.fa
.
Two alignment tools (Bowtie2 and HISAT2) can be specified for
building indexes through the aligner argument.
This function first attempts
to call the specified alignment tool
installed on the operation system directly;
however, if the tool is not installed,
then the function attempts to call
bowtie2_build
or
hisat2_build
functions
implemented in the Rbowtie2 or Rhisat2 packages for indexing.
A CircSeqAlignTkRefIndex-class
object.
output_dpath <- tempdir() genome_seq <- system.file(package="CircSeqAlignTk", "extdata", "FR851463.fa") ref_index <- build_index(input = genome_seq, output = file.path(output_dpath, "index"))
output_dpath <- tempdir() genome_seq <- system.file(package="CircSeqAlignTk", "extdata", "FR851463.fa") ref_index <- build_index(input = genome_seq, output = file.path(output_dpath, "index"))
This function calculates alignment coverage according to the read strand and length from alignment results.
calc_coverage(x)
calc_coverage(x)
x |
A |
This function calculates alignment coverage from the two BAM files,
*.clean.t1.bam
and *.clean.t2.bam
,
generated by the align_reads
function.
The coverage is then sorted by the strand and length of the aligned reads
and summarized into data frames.
A CircSeqAlignTkCoverage-class
object.
CircSeqAlignTkAlign-class
,
CircSeqAlignTkCoverage-class
, align_reads
output_dpath <- tempdir() genome_seq <- system.file(package="CircSeqAlignTk", "extdata", "FR851463.fa") fq <- system.file(package="CircSeqAlignTk", "extdata", "srna.fq.gz") ref_index <- build_index(input = genome_seq, output = file.path(output_dpath, 'index')) aln <- align_reads(input = fq, index = ref_index, output = file.path(output_dpath, 'align_results')) alncov <- calc_coverage(aln)
output_dpath <- tempdir() genome_seq <- system.file(package="CircSeqAlignTk", "extdata", "FR851463.fa") fq <- system.file(package="CircSeqAlignTk", "extdata", "srna.fq.gz") ref_index <- build_index(input = genome_seq, output = file.path(output_dpath, 'index')) aln <- align_reads(input = fq, index = ref_index, output = file.path(output_dpath, 'align_results')) alncov <- calc_coverage(aln)
A class to store alignment results,
including the paths to FASTQ and BAM format files and the alignment summary.
The object belongs to this class
is generated by align_reads
function.
input_fastq
A path to the query FASTQ format file.
fastq
A vector containing the paths to the two FASTQ format files
used for alignment to the type 1 and type 2 references,
respectively. See align_reads
for
how the FASTQ format files are generated.
bam
A vector containing the paths to the two BAM format files
corresponding to the alignment results of the two FASTQ files
shown in the fastq
slot, respectively.
clean_bam
A vector containing the paths to the two BAM format files
after filtering by number of mismatch
from BAM format files shown in bam
slot.
stats
A data frame containing alignment summary, e.g., number of query reads, aligned reads, and unaligned reads.
reference
A CircSeqAlignTkRefIndex-class
storing the information of reference for alignment.
CircSeqAlignTkRefIndex-class
, align_reads
A class to store the alignment coverage
generated by calc_coverage
function.
forward
A matrix containing the alignment coverage of the forward strand reads.
reverse
A matrix containing the alignment coverage of the reverse strand reads.
.figdata
A string of adapter sequence.
A class to store reference information for alignment.
The object belongs to this class
is generated by build_index
function.
name
Reference name. The sequence name written the header of FASTA format file.
seq
Reference sequence.
length
Length of the reference sequence.
fasta
A vector containing the paths to the two FASTA format files
of the type 1 and type 2 reference sequences, respectively.
See build_index
for how FASTA format files are generated.
index
A vector containing the paths to the two reference indexes
corresponding to the two FASTA format files
stored in the fasta
slot, respectively.
cut_loc
The position on the user-given sequence (i.e., the type 1 sequence) to cut for generating the type 2 reference sequence.
A class to store parameters for generating synthetic sequence reads.
The object belongs to this class
is generated by generate_reads
function.
seq
A string of a genome sequence, which is used for sampling synthetic sequence reads.
adapter
A string of adapter sequence.
read_info
A data frame storing the summary information of read generation. It contains the start and end positions of sampling, strand, and nucleotide sequence of each synthetic read.
peak
A data frame storing the peaks information of alignment coverage.
coverage
A CircSeqAlignTkCoverage-class
storing the information of alignment coverage.
fastq
A path to FASTQ format file saving the synthetic reads.
generate_reads
, CircSeqAlignTkCoverage-class
This function removes sequence reads with lengths outside the specified range from the FASTQ file.
filter_reads(input, output, read_lengths = seq(21, 24), overwrite = TRUE)
filter_reads(input, output, read_lengths = seq(21, 24), overwrite = TRUE)
input |
A path to a FASTQ file targeted for filtering. |
output |
A path to save the filtered reads in FASTQ format. |
read_lengths |
A series of integers to specify read length. Reads other than the length specified will be excluded during alignment. |
overwrite |
Overwrite the existing files if |
Studies on small RNA-seq data from viroid-infected plants
have mostly focused on reads with lengths ranging from 21 nt to 24 nt.
This function is intended to be used to
remove sequence reads with lengths outside the specified range.
The default range is 21-24 nt,
which can be changed through the read_lengths
argument.
Note that, if filtering by read length has already been performed during the quality control process, there is no need to use this function.
A path to the filtered FASTQ file.
output_dpath <- tempdir() fq <- system.file(package="CircSeqAlignTk", "extdata", "srna.fq.gz") output_fq <- file.path(output_dpath, "sran.filtered.fq.gz") filter_reads(fq, output_fq, seq(21, 24))
output_dpath <- tempdir() fq <- system.file(package="CircSeqAlignTk", "extdata", "srna.fq.gz") output_fq <- file.path(output_dpath, "sran.filtered.fq.gz") filter_reads(fq, output_fq, seq(21, 24))
This function generates synthetic sequence reads to mimic RNA-seq reads sequenced from organelles or organisms with circular genome sequences in FASTQ format file.
generate_reads( n = 10000, seq = NULL, output = NULL, adapter = NULL, srna_length = NULL, read_length = 150, mismatch_prob = 0, peaks = NULL, read_name_prefix = NULL )
generate_reads( n = 10000, seq = NULL, output = NULL, adapter = NULL, srna_length = NULL, read_length = 150, mismatch_prob = 0, peaks = NULL, read_name_prefix = NULL )
n |
Number of reads should be generated. |
seq |
A file path to a genome sequence in FASTA format file or a string of genome sequence. |
output |
A file path to store the synthetic reads in FASTQ format file.
The extension should be one of |
adapter |
A path to a FASTA format file containing a string of adapter
sequence.
If |
srna_length |
A data frame to specify the lengths of sequence reads
sampled from the genome sequence.
The data frame should contain two columns named as |
read_length |
The length of synthetic reads.
If |
mismatch_prob |
A vector to specify probabilities
of mismatches occurring in the reads.
In order not to allow any mismatches in the reads,
set the argument to |
peaks |
A data frame to specify the peaks of the alignment coverage.
The data frame should contain four columns named as |
read_name_prefix |
The prefix of read name in FASTQ format file.
If |
A CircSeqAlignTkSim-class
object containing parameters
for read generation.
output_dpath <- tempdir() sim <- generate_reads(output = file.path(output_dpath, 'sample1.fq.gz')) srna_length <- data.frame(length = c(21, 22, 23, 24), prob = c(0.5, 0.3, 0.1, 0.1)) sim <- generate_reads(output = file.path(output_dpath, 'sample2.fq.gz'), srna_length = srna_length) sim <- generate_reads(output = file.path(output_dpath, 'sample3.fq.gz'), mismatch_prob = c(0.1, 0.1)) peaks <- data.frame(mean = c( 50, 100, 150), std = c( 3, 5, 5), strand = c('+', '-', '+'), prob = c(0.4, 0.4, 0.2)) sim <- generate_reads(output = file.path(output_dpath, 'sample4.fq.gz'), peaks = peaks)
output_dpath <- tempdir() sim <- generate_reads(output = file.path(output_dpath, 'sample1.fq.gz')) srna_length <- data.frame(length = c(21, 22, 23, 24), prob = c(0.5, 0.3, 0.1, 0.1)) sim <- generate_reads(output = file.path(output_dpath, 'sample2.fq.gz'), srna_length = srna_length) sim <- generate_reads(output = file.path(output_dpath, 'sample3.fq.gz'), mismatch_prob = c(0.1, 0.1)) peaks <- data.frame(mean = c( 50, 100, 150), std = c( 3, 5, 5), strand = c('+', '-', '+'), prob = c(0.4, 0.4, 0.2)) sim <- generate_reads(output = file.path(output_dpath, 'sample4.fq.gz'), peaks = peaks)
This function returns the slot contents from a formal class.
It is convenient to use @
when accessing the contents of a slot,
however, using @
will generate warnings
during the unit tests under software development.
This function was created to avoid that warning.
Users do not have to use this function.
get_slot_contents(object, name)
get_slot_contents(object, name)
object |
An object from a formally defined class. |
name |
The name of the slot. |
The contents of the specified slot from the given object.
output_dpath <- tempdir() sim <- generate_reads(output = file.path(output_dpath, 'sample1.fq.gz')) head(get_slot_contents(sim, 'peak'))
output_dpath <- tempdir() sim <- generate_reads(output = file.path(output_dpath, 'sample1.fq.gz')) head(get_slot_contents(sim, 'peak'))
Merge multiple synthetic datasets generated by generate_reads
.
## S3 method for class 'CircSeqAlignTkSim' merge(..., output = NULL, overwrite = TRUE)
## S3 method for class 'CircSeqAlignTkSim' merge(..., output = NULL, overwrite = TRUE)
... |
CircSeqAlignTkSim class objects. |
output |
A file path to store the synthetic reads in FASTQ format file.
The extension should be one of |
overwrite |
Overwrite the existing files if |
Merge multiple synthetic datasets generated by generate_reads
into one dataset.
A CircSeqAlignTkSim-class
object.
CircSeqAlignTkSim-class
, generate_reads
output_dpath <- tempdir() sim_params_1 <- data.frame(length = c(21, 22), prob = c(0.5, 0.4)) sim_1 <- generate_reads(n = 5e2, output = file.path(output_dpath, 'sample1.fq.gz'), srna_length = sim_params_1) sim_params_2 <- data.frame(length = c(19, 20, 23), prob = c(0.2, 0.7, 0.1)) sim_2 <- generate_reads(n = 5e2, output = file.path(output_dpath, 'sample2.fq.gz'), srna_length = sim_params_2) sim <- merge(sim_1, sim_2, output = file.path(output_dpath, 'sample.fq.gz'))
output_dpath <- tempdir() sim_params_1 <- data.frame(length = c(21, 22), prob = c(0.5, 0.4)) sim_1 <- generate_reads(n = 5e2, output = file.path(output_dpath, 'sample1.fq.gz'), srna_length = sim_params_1) sim_params_2 <- data.frame(length = c(19, 20, 23), prob = c(0.2, 0.7, 0.1)) sim_2 <- generate_reads(n = 5e2, output = file.path(output_dpath, 'sample2.fq.gz'), srna_length = sim_params_2) sim <- merge(sim_1, sim_2, output = file.path(output_dpath, 'sample.fq.gz'))
This function visualizes the alignment coverage using an area chart. By default, the upper and lower directions of the y-axis represent the alignment coverage of the reads aligned in the forward and reverse strands, respectively.
plot_coverage(x, read_lengths = NULL, fill = "read_length", scale_fun = NULL) ## S3 method for class 'CircSeqAlignTkCoverage' plot(x, ...)
plot_coverage(x, read_lengths = NULL, fill = "read_length", scale_fun = NULL) ## S3 method for class 'CircSeqAlignTkCoverage' plot(x, ...)
x |
A |
read_lengths |
Numeric numbers to specify the lengths of reads
targeted for visualization. If |
fill |
Specify |
scale_fun |
Set |
... |
Other graphical parameters. |
An object of ggplot2.
CircSeqAlignTkCoverage-class
, calc_coverage
output_dpath <- tempdir() genome_seq <- system.file(package="CircSeqAlignTk", "extdata", "FR851463.fa") fq <- system.file(package="CircSeqAlignTk", "extdata", "srna.fq.gz") ref_index <- build_index(input = genome_seq, output = file.path(output_dpath, 'index')) aln <- align_reads(input = fq, index = ref_index, output = file.path(output_dpath, 'align_results')) alncov <- calc_coverage(aln) plot(alncov)
output_dpath <- tempdir() genome_seq <- system.file(package="CircSeqAlignTk", "extdata", "FR851463.fa") fq <- system.file(package="CircSeqAlignTk", "extdata", "srna.fq.gz") ref_index <- build_index(input = genome_seq, output = file.path(output_dpath, 'index')) aln <- align_reads(input = fq, index = ref_index, output = file.path(output_dpath, 'align_results')) alncov <- calc_coverage(aln) plot(alncov)