mobileRNA: Investigate the RNA mobilome & population-scale changes.

Introduction

In plants, systemic signalling is an elaborated molecular system which coordinates plant development, integrating and transmitting the information perceived from the environment to distant organs. An important role in long-distance signalling is played by small RNA molecules (sRNAs). The nucleotide length of a sRNA helps researchers identify the class of sRNA and predict its functionality. Micro-RNAs (miRNAs) are involved in directing translational repression and/or the cleavage of messenger RNAs (mRNAs). Whereas small interfering RNAs (siRNAs) are involved in the maintenance and de novo DNA methylation and account for the majority of sRNAs in plants. These endogenous sRNAs can be produced in a tissue and then transported systemically across the vascular system into recipient organs, where they can induce a molecular response and coordinate physiological changes. Similarly, messenger RNAs (mRNAs) can move across distances, and it’s thought they may translate into proteins which act as transcription factors in the recipient tissues.

Plant grafting can be utilised to create chimeric plant systems composed of two genotypes, such as different species like tomato and eggplant, or plant varieties or accessions. Grafting has been used as a method to study RNA mobilomes and their impact on the phenotype. Yet, it is clear that there is no standardised genomic approach for the analysis of sequencing data to identify an RNA mobilome. Here we introduce the R package, mobileRNA, a recommended pipeline and analysis workflow for the identification of a sRNAs/mRNA mobilome. In addition, the flexibility supports standard RNA analysis between treatment and control conditions. For example, to identify sRNA population changes due to the application of a treatment such as cold/heat stress or exposure to a pest. mobileRNA ultimately assists in pre-processing and analysis including the characterization of different populations, visualization of the results, and supporting output for functional analysis.

As stated, this was developed for applications for plant grafting experimental analysis, however, we believe it could have further applications including the analysis of dual-host systems.


Approach

In grafted plants, when different genotypes are used as rootstock and scions, the sequence variation between the two genomes involved can be used to discriminate the origin of a sequenced RNA molecule. Therefore, if an RNA molecule sequenced from one of the grafted partners (scion or rootstock) has been found to match the genome of the other grafting partner, this could empirically demonstrate its movement across the graft junction.

Most available genomics approaches to implement this analysis are based on RNA sequencing, followed by alignment on a genotype of reference and post-alignment screening of genetic variants to identify molecules which have a better match for the genotype of the grafted partner. These methods have many limitations, which might include:

  • High dependency on arbitrary thresholds to determined mapped and unmapped RNAseq reads.
  • Use of additional statistical tests to discriminate genetic variations from sequencing noise in the post-alignment analysis.
  • High false positive rates.

Here, to circumvent such problems we propose a method inspired by the RNAseq analysis of plant hybrids (Lopez-Gomollon 2022), including an alignment step performed simultaneously on both genomes involved. The rationale of this approach considers that alignment tools already implement an algorithm ideal for the identification of the best matches (according to set parameters) in a given genome reference, but they do not account for potential matches to DNA sequences which are not provided as reference. Therefore, the two genomes from all partners involved in the system are merged in a single FASTA file and used as a reference for the unique alignment. Ultimately, in a bid to supply the algorithm with as much information as possible to make the best possible predictions and placement of sequencing reads to each genome.

The summarised workflow is shown below (Figure 1) where it contains a core RNA analysis and a mobile sRNA/mRNA analysis. The core analysis represents the standard workflow for the identification of RNA populations which have been gained, lost or changed in abundance, for example, the sRNA population difference between treatment and condition samples, or similarly in a chimeric system, such as plant grafting, we might want to explore the native sRNA population from the sample tissue origin (i.e. leaf) which have been lost or gained or changes in sRNA abundance. While the mobile analysis represents the workflow for the identification of putative mobile sRNAs or mRNA in a plant graft system.

As input, the pipeline requires cleaned sRNA or mRNA sequencing reads in FASTQ format, along with the genome assemblies which represent the genotypes in the system. The diagram below illustrates the complete workflow using mobileRNA, including essential, optional, and plotting functions.



Comprehensive diagram of complete mobileRNA workflows.

Comprehensive diagram of complete mobileRNA workflows.


The analysis approach includes several underlying features to be aware of which can alter the final output. When working with a chimeric system, the core steps offer the ability to remove mapping errors by comparing control samples to treatment samples. If the genotypes in the chimeric system are fairly distantly related, it is unexpected that unique reads aligned to the mobile genotype will be found in the control samples. With that in mind, we can assume RNAs with reads mapped to the mobile genotype from the control samples could be artifacts or mapping errors. Hence, these RNAs are removed from the analysis when the parameters are utilized. Note that, if the chimeric system is expected to share some or a high level of similarity it might be insightful to the analysis to not remove these sRNA clusters.

At the end of the analysis, the user will have generated a dataframe where rows represent either sRNA clusters or mRNAs and the columns include information on the sRNA clusters or mRNAs, individual sample replicates, and more:

sRNA Analysis

Information on the sRNA cluster:
  • Locus: Genomic location
  • chr: Chromosome or scaffold name
  • start : Start position of the cluster
  • end : End position of the cluster
  • Cluster: Cluster Name
  • DicerConsensus : Consensus dicercall (Calculated by RNAdicercall())
  • DicerCounts : Number of replicates which contributed to the consensus dicercall (Calculated by RNAdicercall())
  • CountMean : Count mean (Calculated by RNAdifferentialAnalysis())
  • log2FoldChange : Log fold change (Calculated by RNAdifferentialAnalysis())
  • pvalue : p-value (Calculated by RNAdifferentialAnalysis())
  • padjusted : Adjusted p-value (Calculated by RNAdifferentialAnalysis())
  • logCPM : Log counts per million (CPM/RPM) (Calculated by RNAdifferentialAnalysis())
Information on each sample replicate:
  • DicerCall_ : The nucleotide length of most abundant sRNA
  • Count_ : Number of aligned sRNA reads. As default, these are uniquely aligned (e.g. not multi-mapping).
  • RPM_ : Reads per Million
  • FPKM_ : Fragments Per Kilobase of transcript per Million
  • MajorRNA_ : RNA sequence of most abundant sRNA in the cluster

For mRNA Analysis

Information on the mRNA:
  • Feature: mRNA name
  • SampleCounts : Consensus dicercall (Calculated by RNAdicercall())
  • CountMean : Count mean (Calculated by RNAdifferentialAnalysis())
  • log2FoldChange : Log fold change (Calculated by RNAdifferentialAnalysis())
  • pvalue : p-value (Calculated by RNAdifferentialAnalysis())
  • padjusted : Adjusted p-value (Calculated by RNAdifferentialAnalysis())
  • logCPM : Log counts per million (CPM/RPM) (Calculated by RNAdifferentialAnalysis())
Information on each sample replicate:
  • Count_ : Number of aligned mRNA reads. As default, these are uniquely aligned (e.g. not multi-mapping).
  • FPKM_ : Fragments Per Kilobase of transcript per Million

In addition, the user can utilise additional functions to plot a PCA, distance matrix, sRNA class distribution and heatmaps. Plus, mobileRNA offers functions to assist functional analysis to support the determination of the biological implications. For instance, within sRNA analysis, the user can extract the consensus RNA sequence for each sRNA cluster for target prediction analysis, as well as identifying genomic features associates with each sRNA clusters, such as genes or repeat regions. This can be utilised, for example, to explore the RNA expression of sRNA-producing genes in parallel analysis.

Installation

The latest version of mobileRNA can be installed via Bioconductor:

if (!require("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

BiocManager::install("mobileRNA")

It is also available on GitHub:

if (!require("devtools")) install.packages("devtools")
devtools::install_github("KJeynesCupper/mobileRNA", ref = "main")

Load into R library:

library("mobileRNA")

Installing OS Dependencies

mobileRNA works on systems with R, and depending on the type of sequencing data different OS dependencies installed via Conda (Anaconda-Inc 2020) are required for the alignment step.

For sRNA data, ShortStack (>= 4.0) (Axtell 2013) and the dependencies are required. Please consider that ShortStack is not available for Windows, hence, Windows users will either need to opt to use a virtual machine or Windows Subsystem for Linux In either case, both R and ShortStack will need to be installed and used on the Linux side. Please head to ShortStack to see the recommended installation instructions with Conda (Anaconda-Inc 2020). This will ensure all dependencies are available within the same environment.

For mRNA data, HISAT2 (Kim 2015), HTSeq (Anders, Pyl, and Huber 2014), SAMtools (Danecek P 2021) are required within the same Conda environment (Anaconda-Inc 2020).

Example Data

The package includes a simulated data set to replicate the grafting between eggplant-tomato (Solanum melongena - Solanum lycopersicon) where eggplant represents the scion and tomato represents the rootstock. The FASTQ files represent data extracted from the eggplant leaf tissue. Here we will locate mobile RNAs produced by tomato roots which have travelled into the eggplant leaves. In the example data set, there are heterograft replicate, where each is an individual tomato replicate spiked with the same random set of tomato sRNA clusters. There are two sets of each for mRNA analysis and three sets for sRNA analysis. These are known as:

  • heterograft_1
  • heterograft_2
  • heterograft_3 (sRNA analysis only)

There are self-graft replicates, where each is an individual tomato replicate without the spiked tomato sRNA clusters. These are:

  • selfgraft_1
  • selfgraft_2
  • selfgraft_3 (sRNA analysis only)

The replicates mirror each other where, for instance, heterograft_1 and selfgraft_1 are the same replicate, either with or without the spiked clusters.

Hence, we also provide more comprehensive data sets, for small RNA it is called sRNA_data and for messenger RNA it is called mRNA_data. Each stores an example dataframe produced by the importation step with RNAimport(). This can be loaded in the R environment by using the following command:

# Load small RNA data
data("sRNA_data")

# Load messenger RNA data
data("mRNA_data")

For sRNAseq & mRNAseq alignment step, we provide demo FASTQ files; these have a reduced number of reads and do not cluster as expected, so should not be used for downstream analysis. These are stored within inst/extdata. Tomato and Eggplant reduced genome assemblies and annotations have been provided based on those generated by Hosmani et al., 2019 (Hosmani 2019) and Barchi et al., 2021 (Barchi 2021).

The simulated data was generated from the data published Consortium (2012).

Example Workflow: Locating the sRNA Mobilome

This is a quick-start example analysis of sRNAseq data to locate putative mobile root-to-shoot sRNAs from a plant grafting experiment between eggplant and tomato.

Quick Start

Merging Genome Assemblies

Merge the FASTA genome assemblies of tomato and eggplant into a single reference file stored in your desired directory. Eggplant represents the scion as genome A, and tomato represents the rootstock (foreign/mobile) as genome B.

fasta_1 <- system.file("extdata","reduced_chr12_Eggplant.fa.gz", 
                       package="mobileRNA")

fasta_2 <-system.file("extdata","reduced_chr2_Tomato.fa.gz",
                      package="mobileRNA")
# define temporary output directory - replace with your directory
output_assembly_file <- file.path(tempfile("merged_assembly", 
                                           fileext = ".fa"))

# merge
merged_reference <- RNAmergeGenomes(genomeA = fasta_1,
                                    genomeB = fasta_2,
                                    output_file = output_assembly_file)


Merging Genome Annotations

Now, repeat for the genome annotations (GFF) - this is required for downstream analysis or for mRNA mapping.

It is important that the same alterations are made to each genome in the merged files. Otherwise, the annotation file will not align with the chromosome names in the assembly file.

anno1 <- system.file("extdata","reduced_chr12_Eggplant.gff.gz",
                     package="mobileRNA")

anno2 <- system.file("extdata","reduced_chr2_Tomato.gff.gz", 
                     package="mobileRNA")

# define temporary output directory 
output_annotation_file <- file.path(tempfile("merged_annotation", 
                                             fileext = ".gff3"))

# merge annotation files into a single file
merged_annotation <- RNAmergeAnnotations(annotationA = anno1,
                                         annotationB = anno2,
                                         output_file = output_annotation_file)


Alignment

Here we will align our samples containing small RNA sequencing reads to the merged genome assembly. The mapRNA() function invokes a system command to ShortStack; an application which employs Bowtie (Langmead B 2009) mapping and performs comprehensive de novo annotation and quantification of sRNA genes. The function first undertakes de novo sRNA cluster detection (stored in 1_de_novo_detection) and then quantification of sRNA genes (stored in2_sRNA_results). Since we are using chimeric samples, we use mmap = n to exclude multi-mapped reads. We exclude multi-mappers as we currently do not have a method to distinguish whether a read is mapped to multiple locations within one or both of the genomes in the merged reference.

PLEASE NOTE: For the alignment & import demo, we are using fastq files which have a reduced size for the package. This data will not be used for the continued analysis step, instead, you will need to load the full dataset.

# directory containing only sRNAseq samples 
samples <- system.file("extdata/sRNAseq",package="mobileRNA")

# output location
output_location <- tempdir()

mapRNA(input = "sRNA",
       input_files_dir = samples, 
       output_dir = output_location, 
       genomefile = output_assembly_file,
       condaenv = "/Users/user-name/miniconda3/envs/ShortStack4",
       mmap = "n")

If there are issues utilising this function, the manual steps are illustrated in the Appendix

Import Pre-Processed Data into R

Import the results from the alignment step into R using the RNAimport() function.

# Directory containing results
results_dir <-  file.path(output_location,"2_sRNA_results")

# Sample names and total number of reads, in the same order. 
sample_names <- c("selfgraft_demo_1", "selfgraft_demo_2", 
                  "selfgraft_demo_3", "heterograft_demo_1",
                  "heterograft_demo_2", "heterograft_demo_3")


sRNA_data_demo <- RNAimport(input = "sRNA", 
                            directory = results_dir,
                            samples = sample_names)


This will generate a dataframe where rows represent sRNA clusters and contains the following columns:

  • Locus: Genomic location
  • chr: Name of the chromosome or scaffold
  • start : Start position of the cluster
  • end : End position of the cluster
  • Cluster: Cluster Name

For each sample, the following columns will be present. Where the sample name follows after the underscore:

  • DicerCall_ : The size of most abundant small RNA size
  • Count_ : Number of uniquely aligned reads that overlap the locus.
  • RPM_ : Reads per Million
  • FPKM_ : Fragments Per Kilobase of transcript per Million
  • MajorRNA_ : RNA sequence of the most abundant sRNA in the cluster

Import Example Data

Load the full analysis dataset for a more comprehensive analysis:

data("sRNA_data")

Identify Potential Mobile sRNA clusters

Select the putative mobile sRNA clusters using RNAmobile(). This requires supplying the function with a unique identifier of the rootstock genome, which is the prefix “B” to the tomato chromosome names. Here, the function retains sRNA clusters which were aligned to the tomato genome.

# define control samples
controls <- c("selfgraft_1", "selfgraft_2", "selfgraft_3")

mobile_sRNA <- RNAmobile(input = "sRNA",
                         data = sRNA_data, 
                         controls = controls,
                         genome.ID = "B",
                         task = "keep")
# save output as txt file 
write.table(mobile_sRNA, "./sRNA_mobile_output.txt")


Advancing the Analysis

For a more advanced analysis, users can include further steps. Here we demonstrate a more advanced analysis for the Core RNA analysis and Mobile RNA analysis, both demonstrated with the sRNAseq dataset.

Core sRNA Analysis

Quality control

A useful step before analysis is to assess the overall similarity between sample replicates to understand which samples are similar and different. This is known as sample-level quality control and can help us understand where the largest variation is introduced, whether the data meets the expectations and if there are outliers.

Plot the distribution of sRNA classes within each sample

Here, the RNAdistribution() function can generate a number of different customized plots to represent the number of sRNA clusters within each dicercall sRNA class in a sample. In plants, sRNAs are known to be produced with the length between 20-24 nucleotides, and the lengths signify the sRNA class and specific functional role in (epi)genetic regulation. If there was an inconsistent size profile of sRNAs within the cluster, the dicercall is defined as “N”, ie, unclassified.

Plotting the distribution of dicercall sRNA classes within each replicate can support expectation for samples. For instance, if a significant number of sRNA clusters are unclassified it might suggest the data contains degraded RNA fragments, or novel types of sRNA genes.

# plot each replicate as a line, separately, and facet 
sample_distribution_line <- RNAdistribution(sRNA_data,
                                            style = "line",
                                            overlap =  FALSE, 
                                            facet = TRUE,
                                            colour = "darkgreen")

                                            
 # plot each replicate as a bar, separately, and facet 
sample_distribution_bar <- RNAdistribution(sRNA_data,
                                           style = "bar",
                                           facet = TRUE,
                                           colour ="lightblue")

Let’s view the plots:

# View plot (only)
sample_distribution_line$plot
An example facet line graph to show the distribution of sRNA classes within each sample.

An example facet line graph to show the distribution of sRNA classes within each sample.


# View plot (only)
sample_distribution_bar$plot
An example facet bar graph to show the distribution of sRNA classes within each sample.

An example facet bar graph to show the distribution of sRNA classes within each sample.


PCA

Principal Component Analysis (PCA) is a useful technique to illustrate sample distance as it emphasizes the variation through the reduction of dimensions in the data set. Here, we introduce the function plotSamplePCA()

groups <- c("Heterograft", "Heterograft", "Heterograft",
            "Selfgraft", "Selfgraft", "Selfgraft")

 plotSamplePCA(data = sRNA_data, group = groups, size.ratio = 2)
An example of a PCA, illustrating  the sRNA data set sample similarity

An example of a PCA, illustrating the sRNA data set sample similarity


Distance matrix heatmap

Similarly to a PCA plot, the plotSampleDistance() function undertakes hierarchical clustering with an unbiased log transformation to calculate sample distance illustrated with a distance matrix heatmap.

plotSampleDistance(sRNA_data)
An example of a heatmap, illustrating the sRNA data set sample similarity

An example of a heatmap, illustrating the sRNA data set sample similarity


Define the consensus dicercall

Have a look at the sRNA_data data frame, you will see that for each sample the sRNA class for a given cluster has been determined (see columns with names containing with “DicerCall_”) which, in this data, will state a number from 20-24. This value represent the length in nucleotides of the most abundant sRNA within the cluster. For some clusters, there is no particular sRNA which is more abundant than another, hence, it is stated as “NA” or “N”, which is referred to as being unclassified.

The RNAdicercall() function is used to calculate the consensus dicercall for each sRNA cluster. This is based on the classification predicted for the cluster by each sample within the analysis. There are several parameters which will alter the output, including the handling of ties and the method to draw the consensus from.

When working along the mobile sRNA analysis workflow, the function contains a specialised parameter which can be utilized. This is chimeric=TRUE, plus with the genome.ID and controls parameters. In the example below B_ represents the prefix added to the mobile/foreign genotype, which is the rootstock tomato genome. This helps optimise classification by removal of potential mapping errors.

# define consensus, store as a data summary file.
sRNA_data_dicercall <- RNAdicercall(data = sRNA_data, 
                                     chimeric = TRUE, 
                                     genome.ID = "B_", 
                                     controls = c("selfgraft_1", 
                                                  "selfgraft_2", 
                                                  "selfgraft_3"))

For the downstream analysis, it can be useful to define distinct groups of sRNA classes depending on your organism. For plant samples, it is beneficial to select a group of 24-nt and another containing 21/22-nt sRNAs. To subset the data, use the RNAsubset() function to choose which sRNA populations.

# Subset data for analysis: 24-nt sRNAs
sRNA_24 <- RNAsubset(sRNA_data_dicercall, class = 24)

# Subset data for analysis: 21/22-nt sRNAs
sRNA_2122 <- RNAsubset(sRNA_data_dicercall, class = c(21, 22))


Plot the consensus dicercall

The RNAdistriution() function can be used to visualize the distribution of the consensus dicercall classes across the total data set.

consensus_plot <- RNAdistribution(data = sRNA_data_dicercall,
                                  style = "bar",
                                  data.type = "consensus")

Now, view the plot:

# view 
consensus_plot$plot
An example of the distribution of small RNA consensus dicer classifications.

An example of the distribution of small RNA consensus dicer classifications.


Differential analysis with DESeq2 or edgeR

Differential analysis is undertaken to identify RNAs which are statistically significant to discover quantitative changes in the abundance levels between the treatment and the control groups. This technique can be undertaken with either the DESeq2 (Love, Huber, and Anders 2014) or edgeR (McCarthy, Chen, and Smyth 2012) analytical method.

First, let’s re-order the data frame so we can compare control vs treatment (i.e. Selfgraft vs Heterograft). When the data is imported, it may not be in the correct order/levels for comparison.

#reorder df
controls <- c("selfgraft_1", "selfgraft_2", "selfgraft_3")
reorder_df <- RNAreorder(sRNA_data_dicercall, controls)

# sample conditions in order within dataframe
groups <- c("Selfgraft", "Selfgraft", "Selfgraft", 
            "Heterograft", "Heterograft", "Heterograft")

## Differential analysis of whole dataset: DESeq2 method 
sRNA_DESeq2 <- RNAdifferentialAnalysis(data = reorder_df,
                              group = groups,
                              method = "DESeq2")

We can summarise the results using RNAsummary():

RNAsummary(sRNA_DESeq2)

Comparison : Selfgraft Vs Heterograft out of 8138 with nonzero read count adjusted p-value < 0.1 144 sRNA clusters met the adjusted p-value cutoff LFC > 0 (higher) : 144 , 100 % LFC < 0 (lower) : 0 , 0 %

How about looking at the sRNA population which are statistically significant:

RNAsummary(sRNA_DESeq2, alpha=0.05)

Comparison : Selfgraft Vs Heterograft out of 8138 with nonzero read count adjusted p-value < 0.05 141 sRNA clusters met the adjusted p-value cutoff LFC > 0 (higher) : 141 , 100 % LFC < 0 (lower) : 0 , 0 %

Save output
write.table(sRNA_DESeq2, "./sRNA_core_dataset.txt")

Differences in RNA abundance

When comparing treatment to control conditions, it might be the case that the same sRNA clusters are found within both, yet, there could be difference in the total abundance of the shared clusters. For instance, for a given sRNA cluster the samples in the treatment condition might have a greater abundance than the samples in the control condition.

The statistical analysis calculated the log2FC values for each sRNA cluster by comparing the normalised counts between treatment and control. Here, a positive log2FC indicates an increased abundance of transcripts for a given sRNA cluster in the treatment compared to the control, while negative log2FC indicates decreased abundance of transcripts for a given sRNA cluster. The statistical significance of the log2FC is determined by the adjusted p-value.

Here we will filter the data to select sRNA clusters which are statistically significant, and then plot the results as a heatmap to compare the conditions.

NOTE: the data used here will not yield any results as the treatment and control samples contain the exact same population of eggplant sRNAs, the only difference in the treatment samples are the spiked tomato sRNA clusters.

# summary of statistical sRNA clusters
RNAsummary(sRNA_DESeq2, alpha=0.05)
#> Comparison : Selfgraft Vs Heterograft 
#>  out of 8138 with nonzero read count 
#>  adjusted p-value < 0.05 
#>  141 sRNA clusters met the adjusted p-value cutoff 
#>  LFC > 0 (higher) : 141 , 100 % 
#>  LFC < 0 (lower)  : 0 , 0 % 
#> 

# select significant 
significant_sRNAs <-  sRNA_DESeq2[sRNA_DESeq2$padjusted < 0.05, ]

Plot the results:

p1 <- plotHeatmap(significant_sRNAs, value = "RPM", row.names = FALSE,
                  title = "Heatmap of log-transformed FPKM")
p1$plot
Heatmap of statistically significant sRNA clusters

Heatmap of statistically significant sRNA clusters

Identify gain & loss of RNA populations

The RNApopulation() function can be utilised to identify unique sRNA populations found in the treatment or control conditions. This represents the sRNA which are gained or lost due to the treatment conditions.

First let’s look at the sRNA clusters gained to the treatment condition. In the chimeric heterografts, we expect that the foreign sRNAs will also be selected in this pick-up, therefore, we can use the parameter genome.ID to remove sRNA cluster related to the foreign genome.

# select sRNA clusters only found in treatment & not in the control samples
gained_sRNA <- RNApopulation(data = sRNA_DESeq2, 
                                  conditions = c("heterograft_1", 
                                                 "heterograft_2" , 
                                                 "heterograft_3"),
                                  chimeric = TRUE, 
                                  genome.ID = "B_", 
                                  controls = c("selfgraft_1",
                                               "selfgraft_2", 
                                               "selfgraft_3"))
#> Comparison : Selfgraft Vs Heterograft 
#>  out of 8138 sRNA clusters in the dataset 
#>  unique sRNA clusters : 148 , 1.8 % 
#>  samples : heterograft_1, heterograft_2, heterograft_3 
#>  --- No statistical cutoff was implemented. 
#>  
#>  LFC > 0 (higher) : 148 , 100 % LFC < 0 (lower)  : 0 , 0 %  NULL

# look at number of sRNA cluster only found in treatment 
nrow(gained_sRNA)
#> [1] 148

Now, the sRNA clusters lost and only produced in the control condition:

# select sRNA clusters only found in control & not in the treatment samples
lost_sRNA <- RNApopulation(data = sRNA_DESeq2, 
                                conditions = c("selfgraft_1", 
                                               "selfgraft_2" , 
                                               "selfgraft_3"), 
                                  chimeric = TRUE, 
                                  genome.ID = "B_", 
                                  controls = c("selfgraft_1",
                                               "selfgraft_2", 
                                               "selfgraft_3"))
#> Comparison :  
#>  out of 8138 sRNA clusters in the dataset 
#>  unique sRNA clusters : 0 , 0 % 
#>  samples : selfgraft_1, selfgraft_2, selfgraft_3 
#>  --- No statistical cutoff was implemented. 
#>  
#>  LFC > 0 (higher) : 0 , NaN % LFC < 0 (lower)  : 0 , NaN %  NULL
# look at number of sRNA cluster only found in control  
nrow(lost_sRNA)
#> [1] 0

Functional analysis of gained sRNA populations

Now we have identified unique populations produced or not produced in our treatment samples compared to our control samples, we can extract the RNA sequences to undertaken target prediction and then onward to gene ontology enrichment analysis.

gained_sRNA_sequences <- RNAsequences(gained_sRNA, method = "consensus" )

Moreover, we can identify genomic features associated with these sRNA clusters which are unique to the treatment and absent in the control (i.e. gained).

gained_sRNA_attributes <- RNAattributes(data = gained_sRNA, 
                                        match ="genes",
                            annotation = output_annotation_file)


Mobile sRNA Analysis

We identify candidate mobile RNAs by identifying those which are mapped to the genotype representing the mobile RNAs. These can be isolated using the RNAmobile() function. Then the user can undertake further steps to assist functional analysis.

In respect to the example data set, we are looking to identify sRNAs traveling from the tomato rootstock to the eggplant scion in the heterografts. Hence, this function will look to select clusters mapped to the tomato genome and remove those mapped to the eggplant genome. Previously, the example of the prefix B_ was added to the chromosomes of the tomato genome while prefix A_ added to the eggplant genome. To remove or keep specific clusters, we align this request with the "task" parameter.

# vector of control names
control_names <- c("selfgraft_1", "selfgraft_2", "selfgraft_3")


## Identify potential tomato mobile molecules
mobile_sRNA <- RNAmobile(input = "sRNA",
                        data = sRNA_DESeq2, 
                         controls = control_names,
                         genome.ID = "B_",
                         task = "keep", 
                         statistical = FALSE)


Heatmap plots to represent mobile molecules

We can plot our results as a heatmap, which represents the normalised reads-per-million (RPM) values which have been log transformed.

Here we will plot all potential mobile molecules and those which are statistically significant:

p2 <- plotHeatmap(mobile_sRNA, row.names = FALSE)

p2$plot
An example heatmap of candidate mobile small RNAs. Where the columns represent the sample replicates and the rows represent the small RNA cluster.

An example heatmap of candidate mobile small RNAs. Where the columns represent the sample replicates and the rows represent the small RNA cluster.


Save output

write.table(mobile_sRNA, "./candidate_mobile_sRNAs.txt")


Functional analysis of putative mobile sRNA clusters

Now, we can extrapolate information to assist the prediction of their targets and role in the biological system. mobileRNA offer three different tools to assist the functional analysis.

IMPORTANT: Alterations to the genome assemblies by the RNAmergeGenomes() function must be replicated in the annotations. A merged annotation with the same amendments can be created with the function RNAmergeAnnotations().

Add genomic attributes to sRNA clusters

Each sRNA cluster contains coordinates, these can be matched with coordinates in an annotation file. A match occurs when the cluster is found within the coordinates of a feature. If there is a match, the function returns the input dataframe with additional fields of information from the annotation file.

This enables users to identify the genomic features producing the mobile sRNAs. Here we will be overlapping the data with genes and adding a buffer region of 1 kb upstream and downstream of each gene.

annotation_file <- system.file("extdata", 
                                    "prefix_reduced_chr2_Tomato.gff.gz",
                                    package="mobileRNA")

mobile_attributes <- RNAattributes(data = mobile_sRNA, 
                                      match ="genes",
                                      annotation = annotation_file)


Summarise sRNA cluster overlaps with genomic features

Very similar to before, we can find stricter overlaps between our candidate sRNA clusters and genomic features. However, this time we can calculate the number of sRNA clusters which are associated to each type of feature. These include promoter regions, exon, introns, untranslated regions and repeat regions. The results can either be displayed in the data frame as an absolute value or as a percentage of the total:

mobile_features <- RNAfeatures(data = mobile_sRNA, 
                               annotation = annotation_file)
print(mobile_features)
#>           Genome Dataset
#> promoters    50%    100%
#> exons         0%      0%
#> introns      50%      0%
#> 5'UTR         0%      0%
#> 3'UTR         0%      0%
#> repeats       0%      0%
#> others        0%      0%

# plot as stacked bar chart
features_plot <- plotRNAfeatures(data = sRNA_data,  
                                 annotation = annotation_file)

plot(features_plot)


Retrieve RNA sequence from mobile sRNA clusters

To predict the targets of the mobile sRNA candidates, we need to extract the RNA sequence. Here, we introduce the RNAsequences() function which extrapolates the most common RNA sequence for a cluster and determines the complementary sequences.

mobile_sequences <- RNAsequences(mobile_sRNA, method = "consensus") 


The output consists of a dataframe of 6 columns where rows represent each putative sRNA cluster. The columns include:

  • Cluster: name of sRNA cluster
  • Match: whether the RNA sequence is consistent across replicates (either “No”, “Yes” or “Duplicate”; where “Duplicate” indicates a tie)
  • Sequence: RNA sequence of the most abundant sRNA within a cluster across samples
  • Width: length of nucleotide sequence
  • Complementary_RNA: complementary RNA nucleotide sequence
  • Complementary_DNA: complementary DNA nucleotide sequence
write.table(mobile_sequences, "./candidate_mobile_sRNA_sequences.txt")

To predict the potential targets in a recipient tissue, a target prediction tool such as psRNATarget (Xinbin Dai and Zhao 2018) can be used. For such a tool, we can manipulate the output of RNAsequences() to match the input style for target prediction using the following code:

# load `dplyr` package
library(dplyr)

# select the cluster and sequence columns 
sequences <- mobile_sequences %>% select(Cluster, Sequence)

# add prefix, remove row with NA
prefix <- ">"
sequences$Cluster <- paste0(prefix, sequences$Cluster)
sequences <- na.omit(sequences)

# convert to required format:
res <- do.call(rbind, lapply(seq(nrow(sequences)), 
                             function(i) t(sequences[i, ])))

Save output:

# save output 
write.table(res, file ="./mobile_sRNA_sequences.txt",
            row.names = FALSE, col.names = FALSE, quote = FALSE)


Other steps can be taken at this stage including alignment of the mobile sRNA sequences to identify consensus sequences. For example, this can be undertaken with the R package bioseq:

# load `bioseq` package
library(bioseq)

# build a RNA vector from a character vector: 
mobileRNA_seq <- bioseq::rna(mobile_sequences$Sequence)

# Find a consensus sequence for a set of sequences:
bioseq::seq_consensus(mobileRNA_seq)

Appendix

Manual sRNA Alignment Pipeline

If users wish to undertake the alignment step independently, the manual steps are shown below, of which, will need modifying to suit your files and directory names.

Core sRNA Analysis

Step 1 - De novo sRNA analysis

Here, ShortStack (Axtell 2013) aligns the reads (includes multimapped reads) with Bowtie (Langmead B 2009) and undertakes de novo sRNA clustering analysis.

bash scripting

ShortStack \
--readfile "[fastqfile.fq]" \
--genomefile "[genomefile.fa.gz]" \
--threads 6 \
--mmap u \
--pad 200 \
--mincov 0.5 \
--nohp \
--outdir "[output_dir]"

Step 2 - Build sRNA cluster list

Step 2 collates all the sRNA loci information from each sample into a text file.

R scripting

# names of samples, represented by folder names produced in previous step
samples <- c("[sample names]")

# location of output folders from previous step
dir <- "[output_dir]"

# merge information of the detected de novo sRNA clusters
gff_alignment <- GenomicRanges::GRangesList()
for (i in samples) {
  file_path <- file.path(dir, i, "Results.gff.gz3")
  if (file.exists(file_path)) {
    gff_alignment[[i]] <- rtracklayer::import.gff.gz(file_path)
  } else{
    Stop("File does not exist:", file_path, "\n")
  }
}

gff_merged <- GenomicRanges::reduce(unlist(gff_alignment), 
                                    ignore.strand = TRUE)

gff_merged <- as.data.frame(gff_merged)
colnames(gff_merged)[1] <- "chr"
if('*' %in% gff_merged$strand){
    gff_merged <- gff_merged[, -match("strand", colnames(gff_merged))]
  }

locifile_txt <- data.frame(Locus = paste0(gff_merged$chr, ":", 
                                          gff_merged$start,"-", 
                                          gff_merged$end), 
                           Cluster = paste0("cluster_", 
                                            seq_len(nrow(gff_merged))))

# save output to location 
dir_locifile <- "[output directory]"
loci_out <- file.path(dir_locifile,"locifile.txt")
utils::write.table(locifile_txt, file = loci_out, quote = FALSE, 
                   sep = "\t", row.names = FALSE, col.names = TRUE)

Step 3 - sRNA clustering

Finally, we repeat the sRNA clustering using the de novo sRNA loci list to produce our results. The output folders are used as input for the RNAimport() function.

bash scripting

ShortStack \
--readfile "[fastqfile.fq]" \
--genomefile "[genomefile.fa.gz]" \
--locifile "[locifile]"  \
--threads 6 \
--mmap u \
--nohp \
--pad 200 \
--mincov 0.5 \
--outdir "[output_dir_2]"


Mobile sRNA Analysis

Step 1 - De novo sRNA analysis

Here, Bowtie (Langmead B 2009) aligns the reads (excluding multimappers) and ShortStack (Axtell 2013) undertakes de novo sRNA clustering analysis.

bash scripting

#index reference genome 
bowtie-build --threads 6 [genomefile.fa.gz] [genomefile]

# alignment 
bowtie -p 6 -v 0 -a -m 1 --sam -x [genomefile] [fastqfile.fq] | samtools view -bS | samtools sort -o [outputfile.bam]

# cluster analysis with ShortStack - use the alignment files as input
ShortStack \
--bamfile "[outputfile.bam]" \
--genomefile "[genomefile.fa.gz]" \
--threads 6 \
--pad 200 \
--mincov 0.5 \
--nohp \
--outdir "[output_dir]"

Step 2 - Build sRNA cluster list

Step 2 collates all the sRNA loci information from each sample into a text file.

R scripting

# names of samples, represented by folder names produced in previous step
samples <- c("[sample names]")

# location of output folders from previous step
dir <- "[output_dir]"

# merge information of the detected de novo sRNA clusters
gff_alignment <- GenomicRanges::GRangesList()
for (i in samples) {
  file_path <- file.path(dir, i, "Results.gff.gz3")
  if (file.exists(file_path)) {
    gff_alignment[[i]] <- rtracklayer::import.gff.gz(file_path)
  } else{
    Stop("File does not exist:", file_path, "\n")
  }
}

gff_merged <- GenomicRanges::reduce(unlist(gff_alignment), 
                                    ignore.strand = TRUE)

gff_merged <- as.data.frame(gff_merged)
colnames(gff_merged)[1] <- "chr"
if('*' %in% gff_merged$strand){
    gff_merged <- gff_merged[, -match("strand", colnames(gff_merged))]
  }

locifile_txt <- data.frame(Locus = paste0(gff_merged$chr, ":", 
                                          gff_merged$start,"-", 
                                          gff_merged$end), 
                           Cluster = paste0("cluster_", 
                                            seq_len(nrow(gff_merged))))

# save output to location 
dir_locifile <- "[output directory]"
loci_out <- file.path(dir_locifile,"locifile.txt")
utils::write.table(locifile_txt, file = loci_out, quote = FALSE, 
                   sep = "\t", row.names = FALSE, col.names = TRUE)

Step 3 - sRNA clustering

Finally, we repeat the sRNA clustering using the de novo sRNA loci list to produce our results. The output folders are used as input for the RNAimport() function.

bash scripting

ShortStack \
--bamfile "[outputfile.bam]" \
--genomefile "[genomefile.fa.gz]" \
--locifile "[locifile.txt]" \
--threads 6 \
--nohp \
--mincov 0.5 \
--pad 200 \
--outdir "[output_dir_2]"


Manual mRNA Alignment pipeline

Below states the manual steps for mRNAseq alignment for mobile mRNA identification. Keep in mind these commands need modifying to suit your files and directory names.

These steps utilise HISAT2 (Kim 2015) and HTSeq (Anders, Pyl, and Huber 2014) OS software.

If required, index the genome assembly: bash scripting

hisat2-build --threads 6 [genomefile.fa.gz] [genomefile]

Single-End Sequencing Reads

bash scripting

# alignment with hisat2
hist2 -p 6 -x [genomefile] -U [fastqfile_1.fq] -S [outputfile.sam]

# select reads only mapped to one location:
grep "NH:i:1" [outputfile.sam] > [outputfile_uniqueReads.sam]

# reheader the sam file:
samtools view -H [outputfile.sam] > [outputfile_header.txt]
cat [outputfile_header.txt] [outputfile_uniqueReads.sam] > [outputfile_uniqueReads_H.sam]

# convert to bam and sort
samtools view -Sb [outputfile_uniqueReads_H.sam] | samtools sort -o [outputfile_uniqueReads_H.bam]


# make folder to store output for sample:
mkdir [samplename]

# raw count with HTSeq
python -m HTSeq.scripts.count \
--format=bam  \
--order=pos \
--stranded=no \
--mode=union \
--nonunique=none \
--type=mRNA \
--idattr=Name \
[outputfile_uniqueReads_H.bam] \
[annotationfile.gff] \
> [./samplename/Results.txt]

Pair-End Sequencing Reads

bash scripting

# alignment with hisat2
hist2 -p 6 -x [genomefile] \
-1 [fastqfile_1.fq] \
-2 [fastqfile_2.fq] \
-S [outputfile.sam]

# select reads only mapped to one location:
grep "NH:i:1" [outputfile.sam] > [outputfile_uniqueReads.sam]

# reheader the sam file:
samtools view -H [outputfile.sam] > [outputfile_header.txt]
cat [outputfile_header.txt] [outputfile_uniqueReads.sam] > [outputfile_uniqueReads_H.sam]

# convert to bam and sort
samtools view -Sb [outputfile_uniqueReads_H.sam] | samtools sort -o [outputfile_uniqueReads_H.bam]


# make folder to store output for sample:
mkdir [samplename]

# raw count with HTSeq
python -m HTSeq.scripts.count \
--format=bam  \
--order=pos \
--stranded=no \
--mode=union \
--nonunique=none \
--type=mRNA \
--idattr=Name \
[outputfile_uniqueReads_H.bam] \
[annotationfile.gff] \
> [./samplename/Results.txt]

Session Information

sessionInfo()
#> R version 4.4.1 (2024-06-14)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.1 LTS
#> 
#> Matrix products: default
#> BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
#> LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so;  LAPACK version 3.12.0
#> 
#> locale:
#>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
#>  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=C              
#>  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
#>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
#>  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
#> 
#> time zone: Etc/UTC
#> tzcode source: system (glibc)
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#> [1] dplyr_1.1.4      mobileRNA_1.3.0  BiocStyle_2.35.0
#> 
#> loaded via a namespace (and not attached):
#>   [1] bitops_1.0-9                pbapply_1.7-2              
#>   [3] testthat_3.2.1.1            rlang_1.1.4                
#>   [5] magrittr_2.0.3              matrixStats_1.4.1          
#>   [7] compiler_4.4.1              png_0.1-8                  
#>   [9] vctrs_0.6.5                 stringr_1.5.1              
#>  [11] pkgconfig_2.0.3             bioseq_0.1.4               
#>  [13] crayon_1.5.3                fastmap_1.2.0              
#>  [15] XVector_0.45.0              labeling_0.4.3             
#>  [17] utf8_1.2.4                  Rsamtools_2.21.2           
#>  [19] rmarkdown_2.28              sessioninfo_1.2.2          
#>  [21] UCSC.utils_1.1.0            purrr_1.0.2                
#>  [23] xfun_0.48                   zlibbioc_1.51.2            
#>  [25] cachem_1.1.0                GenomeInfoDb_1.41.2        
#>  [27] jsonlite_1.8.9              progress_1.2.3             
#>  [29] highr_0.11                  DelayedArray_0.33.1        
#>  [31] BiocParallel_1.41.0         prettyunits_1.2.0          
#>  [33] parallel_4.4.1              R6_2.5.1                   
#>  [35] stringi_1.8.4               bslib_0.8.0                
#>  [37] RColorBrewer_1.1-3          rtracklayer_1.65.0         
#>  [39] reticulate_1.39.0           limma_3.61.12              
#>  [41] parallelly_1.38.0           brio_1.1.5                 
#>  [43] GenomicRanges_1.57.2        jquerylib_0.1.4            
#>  [45] Rcpp_1.0.13                 SummarizedExperiment_1.35.5
#>  [47] knitr_1.48                  future.apply_1.11.3        
#>  [49] snow_0.4-4                  audio_0.1-11               
#>  [51] R.utils_2.12.3              IRanges_2.39.2             
#>  [53] Matrix_1.7-1                tidyselect_1.2.1           
#>  [55] abind_1.4-8                 yaml_2.3.10                
#>  [57] codetools_0.2-20            curl_5.2.3                 
#>  [59] listenv_0.9.1               lattice_0.22-6             
#>  [61] tibble_3.2.1                withr_3.0.2                
#>  [63] Biobase_2.67.0              evaluate_1.0.1             
#>  [65] future_1.34.0               Biostrings_2.75.0          
#>  [67] pillar_1.9.0                BiocManager_1.30.25        
#>  [69] MatrixGenerics_1.17.1       stats4_4.4.1               
#>  [71] generics_0.1.3              RCurl_1.98-1.16            
#>  [73] S4Vectors_0.43.2            hms_1.1.3                  
#>  [75] ggplot2_3.5.1               munsell_0.5.1              
#>  [77] scales_1.3.0                globals_0.16.3             
#>  [79] glue_1.8.0                  RPushbullet_0.3.4          
#>  [81] pheatmap_1.0.12             maketools_1.3.1            
#>  [83] tools_4.4.1                 BiocIO_1.17.0              
#>  [85] sys_3.4.3                   data.table_1.16.2          
#>  [87] beepr_2.0                   SimDesign_2.17.1           
#>  [89] GenomicAlignments_1.41.0    locfit_1.5-9.10            
#>  [91] XML_3.99-0.17               buildtools_1.0.0           
#>  [93] grid_4.4.1                  tidyr_1.3.1                
#>  [95] edgeR_4.3.21                colorspace_2.1-1           
#>  [97] GenomeInfoDbData_1.2.13     restfulr_0.0.15            
#>  [99] cli_3.6.3                   fansi_1.0.6                
#> [101] S4Arrays_1.5.11             gtable_0.3.6               
#> [103] R.methodsS3_1.8.2           DESeq2_1.47.0              
#> [105] sass_0.4.9                  digest_0.6.37              
#> [107] progressr_0.15.0            BiocGenerics_0.53.0        
#> [109] SparseArray_1.5.45          ggrepel_0.9.6              
#> [111] farver_2.1.2                rjson_0.2.23               
#> [113] htmltools_0.5.8.1           R.oo_1.26.0                
#> [115] lifecycle_1.0.4             httr_1.4.7                 
#> [117] statmod_1.5.0


References

Anaconda-Inc. 2020. Anaconda Software Distribution.” https://docs.anaconda.com/.
Anders, Simon, Paul Theodor Pyl, and Wolfgang Huber. 2014. HTSeq – A Python framework to work with high-throughput sequencing data.” Bioinformatics. http://dx.doi.org/10.1093/bioinformatics/btu638.
Axtell, MJ. 2013. ShortStack: comprehensive annotation and quantification of small RNA genes. Rna 19: 740–51. https://doi.org/10.1093/nar/gky316.
Barchi, Rabanus‐Wallace, L. 2021. Improved genome assembly and pan‐genome provide key insights into eggplant domestication and breeding. The Plant Journal 107. https://doi.org/10.1111/tpj.15313.
Consortium, Tomato Genome. 2012. The tomato genome sequence provides insights into fleshy fruit evolution. Nature 458. 10.1038/nature11119.
Danecek P, Liddle J, Bonfield JK. 2021. Twelve years of SAMtools and BCFtools.” GigaScience 10. http://dx.doi.org/10.1186/s13059-014-0550-8.
Hosmani, Flores-Gonzalez, P. S. 2019. An improved de novo assembly and annotation of the tomato reference genome using single-molecule sequencing, Hi-C proximity ligation and optical maps. Biorxiv 767764. https://doi.org/10.1101/767764.
Kim, Langmead, D. 2015. HISAT: a fast spliced aligner with low memory requirements. Nature Methods 12. https://www.nature.com/articles/nmeth.3317.
Langmead B, Pop M, Trapnell C. 2009. Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol 10. https://doi.org/10.1186/gb-2009-10-3-r25.
Li, Zhang, B. 2022. Identification and Functional Analysis of MicroRNAs and Their Target Genes in Reverse Thermosensitive Genic Male Sterility of Eggplant. Journal of the American Society for Horticultural Science 147. https://doi.org/10.21273/JASHS05222-22.
Love, Michael I., Wolfgang Huber, and Simon Anders. 2014. Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2.” Genome Biology 15: 550. http://dx.doi.org/10.1186/s13059-014-0550-8.
McCarthy, Davis J, Yunshun Chen, and Gordon K Smyth. 2012. Differential expression analysis of multifactor RNA-Seq experiments with respect to biological variation.” Nucleic Acids Research 40 (January): 4288–97. https://doi.org/10.1093/nar/gks042.
Qing, Zheng, Y. 2022. Dynamically expressed small RNAs, substantially driven by genomic structural variants, contribute to transcriptomic changes during tomato domestication. The Plant Journal 110. https://doi.org/10.1111/tpj.15798.
Villanueva, Vilanova, G. 2023. Transcriptome profiles of eggplant (Solanum melongena) and its wild relative S. dasyphyllum under different levels of osmotic stress provide insights into response mechanisms to drought. Current Plant Biology 33. https://doi.org/10.1016/j.cpb.2023.100276.
Xinbin Dai, Zhaohong Zhuang, and Patrick X. Zhao. 2018. psRNATarget: a plant small RNA target analysis server (2017 release). Nucleic Acids Research. https://doi.org/10.1093/nar/gky316.