In plants, systemic signalling is an elaborated molecular system which coordinates plant development, integrating and transmitting the information perceived from the environment to distant organs. An important role in long-distance signalling is played by small RNA molecules (sRNAs). The nucleotide length of a sRNA helps researchers identify the class of sRNA and predict its functionality. Micro-RNAs (miRNAs) are involved in directing translational repression and/or the cleavage of messenger RNAs (mRNAs). Whereas small interfering RNAs (siRNAs) are involved in the maintenance and de novo DNA methylation and account for the majority of sRNAs in plants. These endogenous sRNAs can be produced in a tissue and then transported systemically across the vascular system into recipient organs, where they can induce a molecular response and coordinate physiological changes. Similarly, messenger RNAs (mRNAs) can move across distances, and it’s thought they may translate into proteins which act as transcription factors in the recipient tissues.
Plant grafting can be utilised to create chimeric plant systems
composed of two genotypes, such as different species like tomato and
eggplant, or plant varieties or accessions. Grafting has been used as a
method to study RNA mobilomes and their impact on the phenotype. Yet, it
is clear that there is no standardised genomic approach for the analysis
of sequencing data to identify an RNA mobilome. Here we introduce the R
package, mobileRNA, a recommended pipeline and analysis workflow for the
identification of a sRNAs/mRNA mobilome. In addition, the flexibility
supports standard RNA analysis between treatment and control conditions.
For example, to identify sRNA population changes due to the application
of a treatment such as cold/heat stress or exposure to a pest.
mobileRNA
ultimately assists in pre-processing and analysis
including the characterization of different populations, visualization
of the results, and supporting output for functional analysis.
As stated, this was developed for applications for plant grafting experimental analysis, however, we believe it could have further applications including the analysis of dual-host systems.
In grafted plants, when different genotypes are used as rootstock and scions, the sequence variation between the two genomes involved can be used to discriminate the origin of a sequenced RNA molecule. Therefore, if an RNA molecule sequenced from one of the grafted partners (scion or rootstock) has been found to match the genome of the other grafting partner, this could empirically demonstrate its movement across the graft junction.
Most available genomics approaches to implement this analysis are based on RNA sequencing, followed by alignment on a genotype of reference and post-alignment screening of genetic variants to identify molecules which have a better match for the genotype of the grafted partner. These methods have many limitations, which might include:
Here, to circumvent such problems we propose a method inspired by the RNAseq analysis of plant hybrids (Lopez-Gomollon 2022), including an alignment step performed simultaneously on both genomes involved. The rationale of this approach considers that alignment tools already implement an algorithm ideal for the identification of the best matches (according to set parameters) in a given genome reference, but they do not account for potential matches to DNA sequences which are not provided as reference. Therefore, the two genomes from all partners involved in the system are merged in a single FASTA file and used as a reference for the unique alignment. Ultimately, in a bid to supply the algorithm with as much information as possible to make the best possible predictions and placement of sequencing reads to each genome.
The summarised workflow is shown below (Figure 1) where it contains a core RNA analysis and a mobile sRNA/mRNA analysis. The core analysis represents the standard workflow for the identification of RNA populations which have been gained, lost or changed in abundance, for example, the sRNA population difference between treatment and condition samples, or similarly in a chimeric system, such as plant grafting, we might want to explore the native sRNA population from the sample tissue origin (i.e. leaf) which have been lost or gained or changes in sRNA abundance. While the mobile analysis represents the workflow for the identification of putative mobile sRNAs or mRNA in a plant graft system.
As input, the pipeline requires cleaned sRNA or mRNA sequencing reads
in FASTQ format, along with the genome assemblies which represent the
genotypes in the system. The diagram below illustrates the complete
workflow using mobileRNA
, including essential, optional,
and plotting functions.
The analysis approach includes several underlying features to be aware of which can alter the final output. When working with a chimeric system, the core steps offer the ability to remove mapping errors by comparing control samples to treatment samples. If the genotypes in the chimeric system are fairly distantly related, it is unexpected that unique reads aligned to the mobile genotype will be found in the control samples. With that in mind, we can assume RNAs with reads mapped to the mobile genotype from the control samples could be artifacts or mapping errors. Hence, these RNAs are removed from the analysis when the parameters are utilized. Note that, if the chimeric system is expected to share some or a high level of similarity it might be insightful to the analysis to not remove these sRNA clusters.
At the end of the analysis, the user will have generated a dataframe where rows represent either sRNA clusters or mRNAs and the columns include information on the sRNA clusters or mRNAs, individual sample replicates, and more:
Locus
: Genomic locationchr
: Chromosome or scaffold namestart
: Start position of the clusterend
: End position of the clusterCluster
: Cluster NameDicerConsensus
: Consensus dicercall (Calculated by
RNAdicercall()
)DicerCounts
: Number of replicates which contributed to
the consensus dicercall (Calculated by RNAdicercall()
)CountMean
: Count mean (Calculated by
RNAdifferentialAnalysis()
)log2FoldChange
: Log fold change (Calculated by
RNAdifferentialAnalysis()
)pvalue
: p-value (Calculated by
RNAdifferentialAnalysis()
)padjusted
: Adjusted p-value (Calculated by
RNAdifferentialAnalysis()
)logCPM
: Log counts per million (CPM/RPM) (Calculated
by RNAdifferentialAnalysis()
)DicerCall_
: The nucleotide length of most abundant
sRNACount_
: Number of aligned sRNA reads. As default,
these are uniquely aligned (e.g. not multi-mapping).RPM_
: Reads per MillionFPKM_
: Fragments Per Kilobase of transcript per
MillionMajorRNA_
: RNA sequence of most abundant sRNA in the
clusterFeature
: mRNA nameSampleCounts
: Consensus dicercall (Calculated by
RNAdicercall()
)CountMean
: Count mean (Calculated by
RNAdifferentialAnalysis()
)log2FoldChange
: Log fold change (Calculated by
RNAdifferentialAnalysis()
)pvalue
: p-value (Calculated by
RNAdifferentialAnalysis()
)padjusted
: Adjusted p-value (Calculated by
RNAdifferentialAnalysis()
)logCPM
: Log counts per million (CPM/RPM) (Calculated
by RNAdifferentialAnalysis()
)Count_
: Number of aligned mRNA reads. As default,
these are uniquely aligned (e.g. not multi-mapping).FPKM_
: Fragments Per Kilobase of transcript per
MillionIn addition, the user can utilise additional functions to plot a PCA, distance matrix, sRNA class distribution and heatmaps. Plus, mobileRNA offers functions to assist functional analysis to support the determination of the biological implications. For instance, within sRNA analysis, the user can extract the consensus RNA sequence for each sRNA cluster for target prediction analysis, as well as identifying genomic features associates with each sRNA clusters, such as genes or repeat regions. This can be utilised, for example, to explore the RNA expression of sRNA-producing genes in parallel analysis.
The latest version of mobileRNA
can be installed via
Bioconductor:
if (!require("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("mobileRNA")
It is also available on GitHub:
if (!require("devtools")) install.packages("devtools")
devtools::install_github("KJeynesCupper/mobileRNA", ref = "main")
Load into R library:
mobileRNA
works on systems with R
, and
depending on the type of sequencing data different OS dependencies
installed via Conda (Anaconda-Inc 2020)
are required for the alignment step.
For sRNA data, ShortStack
(>= 4.0) (Axtell 2013) and the dependencies are required.
Please consider that ShortStack
is not available for
Windows, hence, Windows users will either need to opt to use a virtual
machine or Windows
Subsystem for Linux In either case, both R
and
ShortStack
will need to be installed and used on the Linux
side. Please head to ShortStack
to see the recommended installation instructions with Conda (Anaconda-Inc 2020). This will ensure all
dependencies are available within the same environment.
For mRNA data, HISAT2 (Kim 2015), HTSeq (Anders, Pyl, and Huber 2014), SAMtools (Danecek P 2021) are required within the same Conda environment (Anaconda-Inc 2020).
The package includes a simulated data set to replicate the grafting between eggplant-tomato (Solanum melongena - Solanum lycopersicon) where eggplant represents the scion and tomato represents the rootstock. The FASTQ files represent data extracted from the eggplant leaf tissue. Here we will locate mobile RNAs produced by tomato roots which have travelled into the eggplant leaves. In the example data set, there are heterograft replicate, where each is an individual tomato replicate spiked with the same random set of tomato sRNA clusters. There are two sets of each for mRNA analysis and three sets for sRNA analysis. These are known as:
heterograft_1
heterograft_2
heterograft_3
(sRNA analysis only)There are self-graft replicates, where each is an individual tomato replicate without the spiked tomato sRNA clusters. These are:
selfgraft_1
selfgraft_2
selfgraft_3
(sRNA analysis only)The replicates mirror each other where, for instance,
heterograft_1
and selfgraft_1
are the same
replicate, either with or without the spiked clusters.
Hence, we also provide more comprehensive data sets, for small RNA it
is called sRNA_data
and for messenger RNA it is called
mRNA_data
. Each stores an example dataframe produced by the
importation step with RNAimport()
. This can be loaded in
the R environment by using the following command:
For sRNAseq & mRNAseq alignment step, we provide demo FASTQ
files; these have a reduced number of reads and do not cluster as
expected, so should not be used for downstream analysis. These are
stored within inst/extdata
. Tomato and Eggplant reduced
genome assemblies and annotations have been provided based on those
generated by Hosmani et al., 2019 (Hosmani
2019) and Barchi et al., 2021 (Barchi
2021).
The simulated data was generated from the data published Consortium (2012).
This is a quick-start example analysis of sRNAseq data to locate
putative mobile root-to-shoot sRNAs from a plant grafting experiment
between eggplant and tomato.
Merge the FASTA genome assemblies of tomato and eggplant into a single reference file stored in your desired directory. Eggplant represents the scion as genome A, and tomato represents the rootstock (foreign/mobile) as genome B.
fasta_1 <- system.file("extdata","reduced_chr12_Eggplant.fa.gz",
package="mobileRNA")
fasta_2 <-system.file("extdata","reduced_chr2_Tomato.fa.gz",
package="mobileRNA")
# define temporary output directory - replace with your directory
output_assembly_file <- file.path(tempfile("merged_assembly",
fileext = ".fa"))
# merge
merged_reference <- RNAmergeGenomes(genomeA = fasta_1,
genomeB = fasta_2,
output_file = output_assembly_file)
Now, repeat for the genome annotations (GFF) - this is required for downstream analysis or for mRNA mapping.
It is important that the same alterations are made to each genome
in the merged files. Otherwise, the annotation file will not align with
the chromosome names in the assembly file.
anno1 <- system.file("extdata","reduced_chr12_Eggplant.gff.gz",
package="mobileRNA")
anno2 <- system.file("extdata","reduced_chr2_Tomato.gff.gz",
package="mobileRNA")
# define temporary output directory
output_annotation_file <- file.path(tempfile("merged_annotation",
fileext = ".gff3"))
# merge annotation files into a single file
merged_annotation <- RNAmergeAnnotations(annotationA = anno1,
annotationB = anno2,
output_file = output_annotation_file)
Here we will align our samples containing small RNA sequencing reads
to the merged genome assembly. The mapRNA()
function
invokes a system command to ShortStack
; an application
which employs Bowtie
(Langmead B
2009) mapping and performs comprehensive de novo annotation and
quantification of sRNA genes. The function first undertakes de novo sRNA
cluster detection (stored in 1_de_novo_detection
) and then
quantification of sRNA genes (stored in2_sRNA_results
).
Since we are using chimeric samples, we use mmap = n
to
exclude multi-mapped reads. We exclude multi-mappers as we currently do
not have a method to distinguish whether a read is mapped to multiple
locations within one or both of the genomes in the merged reference.
PLEASE NOTE: For the alignment & import demo, we are using fastq files which have a reduced size for the package. This data will not be used for the continued analysis step, instead, you will need to load the full dataset.
# directory containing only sRNAseq samples
samples <- system.file("extdata/sRNAseq",package="mobileRNA")
# output location
output_location <- tempdir()
mapRNA(input = "sRNA",
input_files_dir = samples,
output_dir = output_location,
genomefile = output_assembly_file,
condaenv = "/Users/user-name/miniconda3/envs/ShortStack4",
mmap = "n")
If there are issues utilising this function, the manual steps are
illustrated in the Appendix
Import the results from the alignment step into R using the
RNAimport()
function.
# Directory containing results
results_dir <- file.path(output_location,"2_sRNA_results")
# Sample names and total number of reads, in the same order.
sample_names <- c("selfgraft_demo_1", "selfgraft_demo_2",
"selfgraft_demo_3", "heterograft_demo_1",
"heterograft_demo_2", "heterograft_demo_3")
sRNA_data_demo <- RNAimport(input = "sRNA",
directory = results_dir,
samples = sample_names)
This will generate a dataframe where rows represent sRNA clusters and contains the following columns:
Locus
: Genomic locationchr
: Name of the chromosome or scaffoldstart
: Start position of the clusterend
: End position of the clusterCluster
: Cluster NameFor each sample, the following columns will be present. Where the sample name follows after the underscore:
DicerCall_
: The size of most abundant small RNA
sizeCount_
: Number of uniquely aligned reads that overlap
the locus.RPM_
: Reads per MillionFPKM_
: Fragments Per Kilobase of transcript per
MillionMajorRNA_
: RNA sequence of the most abundant sRNA in
the clusterLoad the full analysis dataset for a more comprehensive analysis:
Select the putative mobile sRNA clusters using
RNAmobile()
. This requires supplying the function with a
unique identifier of the rootstock genome, which is the prefix “B” to
the tomato chromosome names. Here, the function retains sRNA clusters
which were aligned to the tomato genome.
# define control samples
controls <- c("selfgraft_1", "selfgraft_2", "selfgraft_3")
mobile_sRNA <- RNAmobile(input = "sRNA",
data = sRNA_data,
controls = controls,
genome.ID = "B",
task = "keep")
For a more advanced analysis, users can include further steps. Here we demonstrate a more advanced analysis for the Core RNA analysis and Mobile RNA analysis, both demonstrated with the sRNAseq dataset.
A useful step before analysis is to assess the overall similarity
between sample replicates to understand which samples are similar and
different. This is known as sample-level quality control and can help us
understand where the largest variation is introduced, whether the data
meets the expectations and if there are outliers.
Here, the RNAdistribution()
function can generate a
number of different customized plots to represent the number of sRNA
clusters within each dicercall sRNA class in a sample. In plants, sRNAs
are known to be produced with the length between 20-24 nucleotides, and
the lengths signify the sRNA class and specific functional role in
(epi)genetic regulation. If there was an inconsistent size profile of
sRNAs within the cluster, the dicercall is defined as “N”, ie,
unclassified.
Plotting the distribution of dicercall sRNA classes within each replicate can support expectation for samples. For instance, if a significant number of sRNA clusters are unclassified it might suggest the data contains degraded RNA fragments, or novel types of sRNA genes.
# plot each replicate as a line, separately, and facet
sample_distribution_line <- RNAdistribution(sRNA_data,
style = "line",
overlap = FALSE,
facet = TRUE,
colour = "darkgreen")
# plot each replicate as a bar, separately, and facet
sample_distribution_bar <- RNAdistribution(sRNA_data,
style = "bar",
facet = TRUE,
colour ="lightblue")
Let’s view the plots:
Principal Component Analysis (PCA) is a useful technique to
illustrate sample distance as it emphasizes the variation through the
reduction of dimensions in the data set. Here, we introduce the function
plotSamplePCA()
groups <- c("Heterograft", "Heterograft", "Heterograft",
"Selfgraft", "Selfgraft", "Selfgraft")
plotSamplePCA(data = sRNA_data, group = groups, size.ratio = 2)
Have a look at the sRNA_data
data frame, you will see
that for each sample the sRNA class for a given cluster has been
determined (see columns with names containing with “DicerCall_”) which,
in this data, will state a number from 20-24. This value represent the
length in nucleotides of the most abundant sRNA within the cluster. For
some clusters, there is no particular sRNA which is more abundant than
another, hence, it is stated as “NA” or “N”, which is referred to as
being unclassified.
The RNAdicercall()
function is used to calculate the
consensus dicercall for each sRNA cluster. This is based on the
classification predicted for the cluster by each sample within the
analysis. There are several parameters which will alter the output,
including the handling of ties and the method to draw the consensus
from.
When working along the mobile sRNA analysis workflow, the function
contains a specialised parameter which can be utilized. This is
chimeric=TRUE
, plus with the genome.ID
and
controls
parameters. In the example below B_
represents the prefix added to the mobile/foreign genotype, which is the
rootstock tomato genome. This helps optimise classification by removal
of potential mapping errors.
# define consensus, store as a data summary file.
sRNA_data_dicercall <- RNAdicercall(data = sRNA_data,
chimeric = TRUE,
genome.ID = "B_",
controls = c("selfgraft_1",
"selfgraft_2",
"selfgraft_3"))
For the downstream analysis, it can be useful to define distinct
groups of sRNA classes depending on your organism. For plant samples, it
is beneficial to select a group of 24-nt and another containing 21/22-nt
sRNAs. To subset the data, use the RNAsubset()
function to
choose which sRNA populations.
# Subset data for analysis: 24-nt sRNAs
sRNA_24 <- RNAsubset(sRNA_data_dicercall, class = 24)
# Subset data for analysis: 21/22-nt sRNAs
sRNA_2122 <- RNAsubset(sRNA_data_dicercall, class = c(21, 22))
DESeq2
or
edgeR
Differential analysis is undertaken to identify RNAs which are
statistically significant to discover quantitative changes in the
abundance levels between the treatment and the control groups. This
technique can be undertaken with either the DESeq2
(Love, Huber, and Anders 2014) or
edgeR
(McCarthy, Chen, and Smyth
2012) analytical method.
First, let’s re-order the data frame so we can compare control vs treatment (i.e. Selfgraft vs Heterograft). When the data is imported, it may not be in the correct order/levels for comparison.
#reorder df
controls <- c("selfgraft_1", "selfgraft_2", "selfgraft_3")
reorder_df <- RNAreorder(sRNA_data_dicercall, controls)
# sample conditions in order within dataframe
groups <- c("Selfgraft", "Selfgraft", "Selfgraft",
"Heterograft", "Heterograft", "Heterograft")
## Differential analysis of whole dataset: DESeq2 method
sRNA_DESeq2 <- RNAdifferentialAnalysis(data = reorder_df,
group = groups,
method = "DESeq2")
We can summarise the results using RNAsummary()
:
Comparison : Selfgraft Vs Heterograft out of 8138 with nonzero read count adjusted p-value < 0.1 144 sRNA clusters met the adjusted p-value cutoff LFC > 0 (higher) : 144 , 100 % LFC < 0 (lower) : 0 , 0 %
How about looking at the sRNA population which are statistically significant:
Comparison : Selfgraft Vs Heterograft out of 8138 with nonzero read count adjusted p-value < 0.05 141 sRNA clusters met the adjusted p-value cutoff LFC > 0 (higher) : 141 , 100 % LFC < 0 (lower) : 0 , 0 %
When comparing treatment to control conditions, it might be the case that the same sRNA clusters are found within both, yet, there could be difference in the total abundance of the shared clusters. For instance, for a given sRNA cluster the samples in the treatment condition might have a greater abundance than the samples in the control condition.
The statistical analysis calculated the log2FC values for each sRNA cluster by comparing the normalised counts between treatment and control. Here, a positive log2FC indicates an increased abundance of transcripts for a given sRNA cluster in the treatment compared to the control, while negative log2FC indicates decreased abundance of transcripts for a given sRNA cluster. The statistical significance of the log2FC is determined by the adjusted p-value.
Here we will filter the data to select sRNA clusters which are statistically significant, and then plot the results as a heatmap to compare the conditions.
NOTE: the data used here will not yield any results as the treatment and control samples contain the exact same population of eggplant sRNAs, the only difference in the treatment samples are the spiked tomato sRNA clusters.
# summary of statistical sRNA clusters
RNAsummary(sRNA_DESeq2, alpha=0.05)
#> Comparison : Selfgraft Vs Heterograft
#> out of 8138 with nonzero read count
#> adjusted p-value < 0.05
#> 141 sRNA clusters met the adjusted p-value cutoff
#> LFC > 0 (higher) : 141 , 100 %
#> LFC < 0 (lower) : 0 , 0 %
#>
# select significant
significant_sRNAs <- sRNA_DESeq2[sRNA_DESeq2$padjusted < 0.05, ]
Plot the results:
p1 <- plotHeatmap(significant_sRNAs, value = "RPM", row.names = FALSE,
title = "Heatmap of log-transformed FPKM")
The RNApopulation()
function can be utilised to identify
unique sRNA populations found in the treatment or control conditions.
This represents the sRNA which are gained or lost due to the treatment
conditions.
First let’s look at the sRNA clusters gained to the treatment
condition. In the chimeric heterografts, we expect that the foreign
sRNAs will also be selected in this pick-up, therefore, we can use the
parameter genome.ID
to remove sRNA cluster related to the
foreign genome.
# select sRNA clusters only found in treatment & not in the control samples
gained_sRNA <- RNApopulation(data = sRNA_DESeq2,
conditions = c("heterograft_1",
"heterograft_2" ,
"heterograft_3"),
chimeric = TRUE,
genome.ID = "B_",
controls = c("selfgraft_1",
"selfgraft_2",
"selfgraft_3"))
#> Comparison : Selfgraft Vs Heterograft
#> out of 8138 sRNA clusters in the dataset
#> unique sRNA clusters : 148 , 1.8 %
#> samples : heterograft_1, heterograft_2, heterograft_3
#> --- No statistical cutoff was implemented.
#>
#> LFC > 0 (higher) : 148 , 100 % LFC < 0 (lower) : 0 , 0 % NULL
# look at number of sRNA cluster only found in treatment
nrow(gained_sRNA)
#> [1] 148
Now, the sRNA clusters lost and only produced in the control condition:
# select sRNA clusters only found in control & not in the treatment samples
lost_sRNA <- RNApopulation(data = sRNA_DESeq2,
conditions = c("selfgraft_1",
"selfgraft_2" ,
"selfgraft_3"),
chimeric = TRUE,
genome.ID = "B_",
controls = c("selfgraft_1",
"selfgraft_2",
"selfgraft_3"))
#> Comparison :
#> out of 8138 sRNA clusters in the dataset
#> unique sRNA clusters : 0 , 0 %
#> samples : selfgraft_1, selfgraft_2, selfgraft_3
#> --- No statistical cutoff was implemented.
#>
#> LFC > 0 (higher) : 0 , NaN % LFC < 0 (lower) : 0 , NaN % NULL
# look at number of sRNA cluster only found in control
nrow(lost_sRNA)
#> [1] 0
Now we have identified unique populations produced or not produced in our treatment samples compared to our control samples, we can extract the RNA sequences to undertaken target prediction and then onward to gene ontology enrichment analysis.
Moreover, we can identify genomic features associated with these sRNA clusters which are unique to the treatment and absent in the control (i.e. gained).
gained_sRNA_attributes <- RNAattributes(data = gained_sRNA,
match ="genes",
annotation = output_annotation_file)
We identify candidate mobile RNAs by identifying those which are
mapped to the genotype representing the mobile RNAs. These can be
isolated using the RNAmobile()
function. Then the user can
undertake further steps to assist functional analysis.
In respect to the example data set, we are looking to identify sRNAs
traveling from the tomato rootstock to the eggplant scion in the
heterografts. Hence, this function will look to select clusters mapped
to the tomato genome and remove those mapped to the eggplant genome.
Previously, the example of the prefix B_
was added to the
chromosomes of the tomato genome while prefix A_
added to
the eggplant genome. To remove or keep specific clusters, we align this
request with the "task"
parameter.
# vector of control names
control_names <- c("selfgraft_1", "selfgraft_2", "selfgraft_3")
## Identify potential tomato mobile molecules
mobile_sRNA <- RNAmobile(input = "sRNA",
data = sRNA_DESeq2,
controls = control_names,
genome.ID = "B_",
task = "keep",
statistical = FALSE)
We can plot our results as a heatmap, which represents the normalised
reads-per-million (RPM) values which have been log transformed.
Here we will plot all potential mobile molecules and those which are statistically significant:
Now, we can extrapolate information to assist the prediction of their targets and role in the biological system. mobileRNA offer three different tools to assist the functional analysis.
IMPORTANT: Alterations to the genome assemblies by
the RNAmergeGenomes()
function must be replicated in the
annotations. A merged annotation with the same amendments can be created
with the function RNAmergeAnnotations()
.
Each sRNA cluster contains coordinates, these can be matched with coordinates in an annotation file. A match occurs when the cluster is found within the coordinates of a feature. If there is a match, the function returns the input dataframe with additional fields of information from the annotation file.
This enables users to identify the genomic features producing the mobile sRNAs. Here we will be overlapping the data with genes and adding a buffer region of 1 kb upstream and downstream of each gene.
annotation_file <- system.file("extdata",
"prefix_reduced_chr2_Tomato.gff.gz",
package="mobileRNA")
mobile_attributes <- RNAattributes(data = mobile_sRNA,
match ="genes",
annotation = annotation_file)
Very similar to before, we can find stricter overlaps between our candidate sRNA clusters and genomic features. However, this time we can calculate the number of sRNA clusters which are associated to each type of feature. These include promoter regions, exon, introns, untranslated regions and repeat regions. The results can either be displayed in the data frame as an absolute value or as a percentage of the total:
mobile_features <- RNAfeatures(data = mobile_sRNA,
annotation = annotation_file)
print(mobile_features)
#> Genome Dataset
#> promoters 50% 100%
#> exons 0% 0%
#> introns 50% 0%
#> 5'UTR 0% 0%
#> 3'UTR 0% 0%
#> repeats 0% 0%
#> others 0% 0%
# plot as stacked bar chart
features_plot <- plotRNAfeatures(data = sRNA_data,
annotation = annotation_file)
plot(features_plot)
To predict the targets of the mobile sRNA candidates, we need to
extract the RNA sequence. Here, we introduce the
RNAsequences()
function which extrapolates the most common
RNA sequence for a cluster and determines the complementary
sequences.
The output consists of a dataframe of 6 columns where rows represent each putative sRNA cluster. The columns include:
Cluster
: name of sRNA clusterMatch
: whether the RNA sequence is consistent across
replicates (either “No”, “Yes” or “Duplicate”; where “Duplicate”
indicates a tie)Sequence
: RNA sequence of the most abundant sRNA within
a cluster across samplesWidth
: length of nucleotide sequenceComplementary_RNA
: complementary RNA nucleotide
sequenceComplementary_DNA
: complementary DNA nucleotide
sequenceTo predict the potential targets in a recipient tissue, a target
prediction tool such as psRNATarget
(Xinbin Dai and Zhao 2018) can be used. For such
a tool, we can manipulate the output of RNAsequences()
to
match the input style for target prediction using the following
code:
# load `dplyr` package
library(dplyr)
# select the cluster and sequence columns
sequences <- mobile_sequences %>% select(Cluster, Sequence)
# add prefix, remove row with NA
prefix <- ">"
sequences$Cluster <- paste0(prefix, sequences$Cluster)
sequences <- na.omit(sequences)
# convert to required format:
res <- do.call(rbind, lapply(seq(nrow(sequences)),
function(i) t(sequences[i, ])))
Save output:
# save output
write.table(res, file ="./mobile_sRNA_sequences.txt",
row.names = FALSE, col.names = FALSE, quote = FALSE)
Other steps can be taken at this stage including alignment of the
mobile sRNA sequences to identify consensus sequences. For example, this
can be undertaken with the R package bioseq
:
If users wish to undertake the alignment step independently, the manual steps are shown below, of which, will need modifying to suit your files and directory names.
Here, ShortStack
(Axtell
2013) aligns the reads (includes multimapped reads) with Bowtie
(Langmead B 2009) and undertakes de novo
sRNA clustering analysis.
bash scripting
Step 2 collates all the sRNA loci information from each sample into a text file.
R scripting
# names of samples, represented by folder names produced in previous step
samples <- c("[sample names]")
# location of output folders from previous step
dir <- "[output_dir]"
# merge information of the detected de novo sRNA clusters
gff_alignment <- GenomicRanges::GRangesList()
for (i in samples) {
file_path <- file.path(dir, i, "Results.gff.gz3")
if (file.exists(file_path)) {
gff_alignment[[i]] <- rtracklayer::import.gff.gz(file_path)
} else{
Stop("File does not exist:", file_path, "\n")
}
}
gff_merged <- GenomicRanges::reduce(unlist(gff_alignment),
ignore.strand = TRUE)
gff_merged <- as.data.frame(gff_merged)
colnames(gff_merged)[1] <- "chr"
if('*' %in% gff_merged$strand){
gff_merged <- gff_merged[, -match("strand", colnames(gff_merged))]
}
locifile_txt <- data.frame(Locus = paste0(gff_merged$chr, ":",
gff_merged$start,"-",
gff_merged$end),
Cluster = paste0("cluster_",
seq_len(nrow(gff_merged))))
# save output to location
dir_locifile <- "[output directory]"
loci_out <- file.path(dir_locifile,"locifile.txt")
utils::write.table(locifile_txt, file = loci_out, quote = FALSE,
sep = "\t", row.names = FALSE, col.names = TRUE)
Finally, we repeat the sRNA clustering using the de novo sRNA loci
list to produce our results. The output folders are used as input for
the RNAimport()
function.
bash scripting
ShortStack \
--readfile "[fastqfile.fq]" \
--genomefile "[genomefile.fa.gz]" \
--locifile "[locifile]" \
--threads 6 \
--mmap u \
--nohp \
--pad 200 \
--mincov 0.5 \
--outdir "[output_dir_2]"
Here, Bowtie
(Langmead B
2009) aligns the reads (excluding multimappers) and
ShortStack
(Axtell 2013)
undertakes de novo sRNA clustering analysis.
bash scripting
#index reference genome
bowtie-build --threads 6 [genomefile.fa.gz] [genomefile]
# alignment
bowtie -p 6 -v 0 -a -m 1 --sam -x [genomefile] [fastqfile.fq] | samtools view -bS | samtools sort -o [outputfile.bam]
# cluster analysis with ShortStack - use the alignment files as input
ShortStack \
--bamfile "[outputfile.bam]" \
--genomefile "[genomefile.fa.gz]" \
--threads 6 \
--pad 200 \
--mincov 0.5 \
--nohp \
--outdir "[output_dir]"
Step 2 collates all the sRNA loci information from each sample into a text file.
R scripting
# names of samples, represented by folder names produced in previous step
samples <- c("[sample names]")
# location of output folders from previous step
dir <- "[output_dir]"
# merge information of the detected de novo sRNA clusters
gff_alignment <- GenomicRanges::GRangesList()
for (i in samples) {
file_path <- file.path(dir, i, "Results.gff.gz3")
if (file.exists(file_path)) {
gff_alignment[[i]] <- rtracklayer::import.gff.gz(file_path)
} else{
Stop("File does not exist:", file_path, "\n")
}
}
gff_merged <- GenomicRanges::reduce(unlist(gff_alignment),
ignore.strand = TRUE)
gff_merged <- as.data.frame(gff_merged)
colnames(gff_merged)[1] <- "chr"
if('*' %in% gff_merged$strand){
gff_merged <- gff_merged[, -match("strand", colnames(gff_merged))]
}
locifile_txt <- data.frame(Locus = paste0(gff_merged$chr, ":",
gff_merged$start,"-",
gff_merged$end),
Cluster = paste0("cluster_",
seq_len(nrow(gff_merged))))
# save output to location
dir_locifile <- "[output directory]"
loci_out <- file.path(dir_locifile,"locifile.txt")
utils::write.table(locifile_txt, file = loci_out, quote = FALSE,
sep = "\t", row.names = FALSE, col.names = TRUE)
Finally, we repeat the sRNA clustering using the de novo sRNA loci
list to produce our results. The output folders are used as input for
the RNAimport()
function.
bash scripting
ShortStack \
--bamfile "[outputfile.bam]" \
--genomefile "[genomefile.fa.gz]" \
--locifile "[locifile.txt]" \
--threads 6 \
--nohp \
--mincov 0.5 \
--pad 200 \
--outdir "[output_dir_2]"
Below states the manual steps for mRNAseq alignment for mobile mRNA identification. Keep in mind these commands need modifying to suit your files and directory names.
These steps utilise HISAT2 (Kim 2015) and HTSeq (Anders, Pyl, and Huber 2014) OS software.
If required, index the genome assembly: bash scripting
bash scripting
# alignment with hisat2
hist2 -p 6 -x [genomefile] -U [fastqfile_1.fq] -S [outputfile.sam]
# select reads only mapped to one location:
grep "NH:i:1" [outputfile.sam] > [outputfile_uniqueReads.sam]
# reheader the sam file:
samtools view -H [outputfile.sam] > [outputfile_header.txt]
cat [outputfile_header.txt] [outputfile_uniqueReads.sam] > [outputfile_uniqueReads_H.sam]
# convert to bam and sort
samtools view -Sb [outputfile_uniqueReads_H.sam] | samtools sort -o [outputfile_uniqueReads_H.bam]
# make folder to store output for sample:
mkdir [samplename]
# raw count with HTSeq
python -m HTSeq.scripts.count \
--format=bam \
--order=pos \
--stranded=no \
--mode=union \
--nonunique=none \
--type=mRNA \
--idattr=Name \
[outputfile_uniqueReads_H.bam] \
[annotationfile.gff] \
> [./samplename/Results.txt]
bash scripting
# alignment with hisat2
hist2 -p 6 -x [genomefile] \
-1 [fastqfile_1.fq] \
-2 [fastqfile_2.fq] \
-S [outputfile.sam]
# select reads only mapped to one location:
grep "NH:i:1" [outputfile.sam] > [outputfile_uniqueReads.sam]
# reheader the sam file:
samtools view -H [outputfile.sam] > [outputfile_header.txt]
cat [outputfile_header.txt] [outputfile_uniqueReads.sam] > [outputfile_uniqueReads_H.sam]
# convert to bam and sort
samtools view -Sb [outputfile_uniqueReads_H.sam] | samtools sort -o [outputfile_uniqueReads_H.bam]
# make folder to store output for sample:
mkdir [samplename]
# raw count with HTSeq
python -m HTSeq.scripts.count \
--format=bam \
--order=pos \
--stranded=no \
--mode=union \
--nonunique=none \
--type=mRNA \
--idattr=Name \
[outputfile_uniqueReads_H.bam] \
[annotationfile.gff] \
> [./samplename/Results.txt]
sessionInfo()
#> R version 4.4.2 (2024-10-31)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.1 LTS
#>
#> Matrix products: default
#> BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
#> LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
#>
#> locale:
#> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
#> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
#> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
#> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
#> [9] LC_ADDRESS=C LC_TELEPHONE=C
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
#>
#> time zone: Etc/UTC
#> tzcode source: system (glibc)
#>
#> attached base packages:
#> [1] stats graphics grDevices utils datasets methods base
#>
#> other attached packages:
#> [1] dplyr_1.1.4 mobileRNA_1.3.0 BiocStyle_2.35.0
#>
#> loaded via a namespace (and not attached):
#> [1] bitops_1.0-9 pbapply_1.7-2
#> [3] testthat_3.2.1.1 rlang_1.1.4
#> [5] magrittr_2.0.3 matrixStats_1.4.1
#> [7] compiler_4.4.2 png_0.1-8
#> [9] vctrs_0.6.5 stringr_1.5.1
#> [11] pkgconfig_2.0.3 bioseq_0.1.4
#> [13] crayon_1.5.3 fastmap_1.2.0
#> [15] XVector_0.47.0 labeling_0.4.3
#> [17] utf8_1.2.4 Rsamtools_2.23.1
#> [19] rmarkdown_2.29 sessioninfo_1.2.2
#> [21] UCSC.utils_1.3.0 purrr_1.0.2
#> [23] xfun_0.49 zlibbioc_1.52.0
#> [25] cachem_1.1.0 GenomeInfoDb_1.43.2
#> [27] jsonlite_1.8.9 progress_1.2.3
#> [29] DelayedArray_0.33.2 BiocParallel_1.41.0
#> [31] parallel_4.4.2 prettyunits_1.2.0
#> [33] R6_2.5.1 stringi_1.8.4
#> [35] bslib_0.8.0 RColorBrewer_1.1-3
#> [37] limma_3.63.2 reticulate_1.40.0
#> [39] rtracklayer_1.67.0 parallelly_1.39.0
#> [41] brio_1.1.5 GenomicRanges_1.59.1
#> [43] jquerylib_0.1.4 Rcpp_1.0.13-1
#> [45] SummarizedExperiment_1.37.0 knitr_1.49
#> [47] future.apply_1.11.3 snow_0.4-4
#> [49] audio_0.1-11 R.utils_2.12.3
#> [51] IRanges_2.41.1 Matrix_1.7-1
#> [53] tidyselect_1.2.1 abind_1.4-8
#> [55] yaml_2.3.10 codetools_0.2-20
#> [57] curl_6.0.1 listenv_0.9.1
#> [59] lattice_0.22-6 tibble_3.2.1
#> [61] withr_3.0.2 Biobase_2.67.0
#> [63] evaluate_1.0.1 future_1.34.0
#> [65] Biostrings_2.75.1 pillar_1.9.0
#> [67] BiocManager_1.30.25 MatrixGenerics_1.19.0
#> [69] stats4_4.4.2 generics_0.1.3
#> [71] RCurl_1.98-1.16 S4Vectors_0.45.2
#> [73] hms_1.1.3 ggplot2_3.5.1
#> [75] munsell_0.5.1 scales_1.3.0
#> [77] globals_0.16.3 glue_1.8.0
#> [79] RPushbullet_0.3.4 pheatmap_1.0.12
#> [81] maketools_1.3.1 tools_4.4.2
#> [83] BiocIO_1.17.1 sys_3.4.3
#> [85] data.table_1.16.2 beepr_2.0
#> [87] SimDesign_2.17.1 GenomicAlignments_1.43.0
#> [89] locfit_1.5-9.10 XML_3.99-0.17
#> [91] buildtools_1.0.0 grid_4.4.2
#> [93] tidyr_1.3.1 edgeR_4.5.0
#> [95] colorspace_2.1-1 GenomeInfoDbData_1.2.13
#> [97] restfulr_0.0.15 cli_3.6.3
#> [99] fansi_1.0.6 S4Arrays_1.7.1
#> [101] gtable_0.3.6 R.methodsS3_1.8.2
#> [103] DESeq2_1.47.1 sass_0.4.9
#> [105] digest_0.6.37 progressr_0.15.1
#> [107] BiocGenerics_0.53.3 SparseArray_1.7.2
#> [109] ggrepel_0.9.6 farver_2.1.2
#> [111] rjson_0.2.23 htmltools_0.5.8.1
#> [113] R.oo_1.27.0 lifecycle_1.0.4
#> [115] httr_1.4.7 statmod_1.5.0