mirTarRnaSeq is a package for miRNA and mRNA interaction analysis through regression and correlation approaches supporting various modeling approaches (gaussian, poisson, negative binomial, zero inflated poisson or negative binomial models for the data). miRNA and mRNA sequencing data are analysed from the same experiment (condition or time point data) and mRNA targets of the miRNAs from same experiment will be identified using statistical approaches.
The example data set for the first approach is 25 matching miRNA and mRNA EBV positive samples from TCGA identified as high EBV load based on Movassagh et al, Scientific Reports, 2019 paper. We attempt to identify the EBV miRNA targets on EBV genome (part1).
The second example set set is the simulated mouse fibroblast differentiated to muscle cells in three time points. Here, we try to identify mRNA targets of miRNA expressed at various time points (parts 2 and 3).
The users can utilize TPM/RPKM or count data for this section of analysis as long as if the data is normalized (TPM/RPKM) is used for the mRNA expression files the miRNA expression files are also normalized (TPM). The format of the files for miRNA and mRNA needs to be odd the class dataframe with the sample names as colnames and mRNA names as rownames. Note, for this section of analysis we have only included all differentially expressed mRNAs for the analysis but the user can included all mRNAs. However, the user should realize if they choose to use all mRNAs expressed the analysis will take a longer time.
# Helper function to access test files in extdata
get_test_file <- function(file_name) {
if (missing(file_name)) {
return(system.file("extdata", "test", package="mirTarRnaSeq"))
}
return(system.file("extdata", "test", file_name, package="mirTarRnaSeq"))
}
# Read input files
DiffExp<-read.table(get_test_file("EBV_mRNA.txt.gz"), as.is=TRUE, header=TRUE, row.names=1)
miRNAExp<-read.table(get_test_file("EBV_miRNA.txt.gz"), as.is=TRUE, header=TRUE, row.names=1)
We currently support miRanda runs (potential miRNA target parsing by score,interaction energy, and interaction or miRNA length ) on seven species (“Human”, “Mouse”, “Drosophila”, “C.elegans”, “Epstein_Barr” (EBV), “Cytomegalovirus” (CMV) and “Kaposi_Sarcoma” (KSHV)). We also support the viral miRNAs targeting human genes for EBV “Epstein_Barr_Human”, CMV “CMV_Human” and KSHV “KSHV_Human”. 1) Here we first import the relevant miRanda file . 2) We only keep targets which are also targets of EBV miRNAs based on our EBV miRanda file. Note, for other organisms (not provided by the pacakged) the appropriate format for miRanda input file is V1 column: name of the miRNA, V2 column: name of the predicted gene, V3: column score/rank of the miranda interaction (or any other score for interaction if TargetScan ( for TargetScan prediction user can provide context++ score and threshold accordingly) is used, note these three columns are required), V4 column: folding energy of miRNA-mRNA interaction. V5 column: target Identity value and V6 column: miRNA idnetity value. (Note, all these values are available after miRanda analysis but V1-V3 must be provided not matter the prediction method)
Running various univariate models(1 to 1 miRNA-mRNA relationships) with various miRNA-mRNA distribution assumptions for either 1 miRNA and 1 mRNA relationship chosen/selected by the user or across all miRNAs and mRNAs in the expression dataframes. Note,the users can choose to run any of the available distribution model assumptions (glm_poisson, glm_gaussian, glm_nb (negative binomial), glm_zeroinfl (zero inflated model or zero inflated negative binomial)).
Here LMP_1 is EBV mRNA and the ebv-mir-BART9-5p is the miRNA and we are running a glm poisson model.
blaGaus <- runModels(Combine,
geneVariant, miRNA_select,
family = glm_gaussian(),
scale = 100)
# Plot the regression relationship for the mRNA and miRNAs (BHLF1 is the EBV
# mRNA). Note, these are standard quality check plot outputs from plot
# regression.
par(oma=c(2,2,2,2))
par(mfrow=c(2,3),mar=c(4,3,3,2))
plot(blaGaus[["all_models"]][["BHLF1"]])
plot(modelData(blaGaus[["all_models"]][["BHLF1"]]))
# To comprehend the performance of all models we look at Akaike information
# criterion (AIC) across all miRNA-mRNA model performances and then look at
# the density plots. Note, the models with lower comparitive AICs have better
# performace.
G <- do.call(rbind.data.frame, blaGaus[["AICvalues"]])
names(G) <- c("AIC_G")
# Low values seems like a reasonable model
plot(density(G$AIC_G))
blaPois <- runModels(Combine,
geneVariant, miRNA_select,
family = glm_poisson(),
scale = 100)
par(oma=c(2,2,2,2))
par(mfrow=c(2,3),mar=c(4,3,3,2))
plot(blaPois[["all_models"]][["LMP-2A"]])
plot(modelData(blaPois[["all_models"]][["LMP-2A"]]))
P <- do.call(rbind.data.frame, blaPois[["AICvalues"]])
names(P) <- c("AIC_Po")
PM <- melt(P)
#> No id variables; using all as measure variables
When using glm_multi(), where all available models will be run, the AICs will be compared and the best model will be chosen based on the miRNA-mRNA model AIC score. In the example bellow we are using the mode= “multi” option for combination of 2 miRNAs (multivariate model) for interaction model the user can choose the mode= “inter” option. Note all_coeff parameters defaults TRUE all interactions (only negative miRNA-mRNA relationships) are reported. More comments on the mode is provided in the next section of the vignette. If all miRNA-mRNA relationships are wanted this parameter can be set to false.
miRNA_select<-c("ebv-mir-BART9-5p","ebv-mir-BART6-3p")
Combine <- combiner(DiffExp, miRNAExp, miRNA_select)
geneVariant <- geneVari(Combine, miRNA_select)
MultiModel <- runModels(Combine, geneVariant,
miRNA_select, family = glm_multi(),
mode="multi", scale = 10)
# Print the name of the models used for the analysis (note the printed outputs
# are the number of models ran by various models based on the AIC scores using
# the glm_multi())
print(table(unlist(lapply(MultiModel$all_models, modelModelName))))
TRUE
TRUE glm_gaussian glm_nb
TRUE 94 2
In GLM multi (multinomial), (model=“multi”), the user can choose to run as many selected miRNAs they choose against all mRNAs datasets.In this example we select two miRNAs. The user can also select particular mRNAs to run this analysis on. We recommend the user always chooses the number of miRNAs for the multinational models and not run it across all the dataset as the analysis could take a very long time. If the user chooses to use more than two miRNAs for the multinomial model they should assign specific miRNAs and we recommend running it on a high memory machine. In GLM multi (synergy) model (mode=“inter”), miRNA interactions. In this example we select two miRNAs. The user can also select particular mRNAs to run this analysis on. We recommend the user always chooses the number of miRNAs for the multinational models and not run it across all the dataset as the analysis could take a very long time. If the user chooses to use more than two miRNAs for the multinomial model they should assign specific miRNAs and we recommend running it on a high memory machine.
miRNA_select<-c("ebv-mir-BART9-5p","ebv-mir-BART6-3p")
Combine <- combiner(DiffExp, miRNAExp, miRNA_select)
geneVariant <- geneVari(Combine, miRNA_select)
InterModel <- runModels(Combine,
geneVariant, miRNA_select,
family = glm_multi(
models=list(glm_gaussian,
glm_poisson())),mode="inter", scale = 10)
# Print the name of the models used for the analysis (note, you can see although
# we have defined for the models to be run using either gaussian or poisson
# options, mirTarRnaSeq chooses the poisson model for all as it performs with a
# better AIC (lower AIC value) for all miRNA-mRNA interactions)
print(table(unlist(lapply(InterModel$all_models, modelModelName))))
TRUE
TRUE glm_gaussian
TRUE 96
Note, for “inter” and “multi” mode options we only support combination of 2 if more than two relationships are of interest, we recommend selecting the miRNAs and running the previously described runModels function due to complication of the models and time consumption.
vMiRNA<-rownames(miRNAExp)
# Note, the user can run all miRNAs but for speed reasons we have chosen the
# first 5 here for mirnas input for the analysis.
All_miRNAs_run<-runAllMirnaModels(mirnas =vMiRNA[1:5] ,
DiffExpmRNA = DiffExpmRNASub,
DiffExpmiRNA = miRNAExp,
miranda_data = miRanda,prob=0.75,
cutoff=0.05,fdr_cutoff = 0.1, method = "fdr",
family = glm_multi(), scale = 2, mode="multi")
TRUE 1: ebv-mir-BART1-3p, ebv-mir-BART1-5p
TRUE 2: ebv-mir-BART1-3p, ebv-mir-BART10-3p
TRUE 3: ebv-mir-BART1-3p, ebv-mir-BART10-5p
TRUE 4: ebv-mir-BART1-3p, ebv-mir-BART11-3p
TRUE 5: ebv-mir-BART1-5p, ebv-mir-BART10-3p
TRUE 6: ebv-mir-BART1-5p, ebv-mir-BART10-5p
TRUE 7: ebv-mir-BART1-5p, ebv-mir-BART11-3p
TRUE 8: ebv-mir-BART10-3p, ebv-mir-BART10-5p
TRUE 9: ebv-mir-BART10-3p, ebv-mir-BART11-3p
TRUE 10: ebv-mir-BART10-5p, ebv-mir-BART11-3p
#select significant genes
hasgenes <- lapply(All_miRNAs_run, function(x) nrow(x$SigFDRGenes)) > 0
All_miRNAs_run <- All_miRNAs_run[hasgenes]
print(table(unlist(lapply
(All_miRNAs_run[[1]][["FDRModel"]][["all_models"]],
modelModelName))))
TRUE
TRUE glm_gaussian glm_nb glm_zeroinfl_negbin
TRUE 78 5 1
TRUE glm_zeroinfl_poisson
TRUE 1
# Print specific models for specific miRNAs (in this example the significant
# multivariate model for ebv-mir-BART1-5p and ebv-mir-BART11-3p )
print(
table(
unlist(lapply(
(All_miRNAs_run[["ebv-mir-BART1-5p and ebv-mir-BART11-3p"]]
[["FDRModel"]]
[["all_models"]]),
modelModelName))))
TRUE
TRUE glm_gaussian glm_nb glm_zeroinfl_negbin
TRUE 79 3 2
TRUE glm_zeroinfl_poisson
TRUE 1
miRNA-mRNA sparse partial correlation prediction using elastic net and compatibility with SPONGE package.Note, we do not recommend using this method for low number of samples (20 or less), or in miRNAs/mRNAs with low variance as the sample size increases the confidence in this analysis increases same goes for miRNA and mRNA variance. Optimal analysis using this method is in 100 or more samples in high variance miRNAs/mRNAs. For viral analysis or samples with low mRNA miRNA variance we recommend using part1 of mirTarRnaSeq or narrowing down the miRNAs and mRNAs to high variance ones.
# Make miRanda file compatible with SPONGE package mir_predicted_targets()
sponge_mir_input<-miranda_sponge_predict(miRNAExp,DiffExp,miRanda)
#Perform sparse partial correlation for miRNA-mRNA relationships
one2many_output<-one2manySponge(miRNAExp,DiffExp,sponge_mir_input)
#> Loading required package: Matrix
#>
#> Attaching package: 'Matrix'
#> The following objects are masked from 'package:tidyr':
#>
#> expand, pack, unpack
#> Loaded glmnet 4.1-8
#> Loading required package: foreach
#> Loading required package: rngtools
Note for this analysis we need fold change data for time points or control versus condition. Hence, a differential expression (DE) analysis needs to be performed before proceeding this analysis (These values should be provided for all miRNA and mRNA in the DE expression and not only the significantly DE miRNAs/mRNAs). Here we are looking at differential expression (DE) files between three time points. The format of each timepoint/control vs condition file needs to be Gene/miRNA names as the first column, log2FC or logfoldchange (FC), (or any other FC metrics as long as for both miRNA, and mRNA the same metrics is used) for column two. The pvalue assigned to the gene(mRNA) expression after the differential expression analysis on the third column.For the miRNA file, the user needs to assign Gene names on the first column and the representative log2FC or logfoldchange (FC) on the second column.
Load files from test directory or you can load individually and feed them in separately in a list: list[(mRNA1,mRNA2,mRNA)]
files <- local({
filenames <- list.files(path=get_test_file(), pattern="^.*\\.txt\\.gz$",
full.names=TRUE)
files <- lapply(filenames, read.table, as.is=TRUE, header=TRUE, sep="\t")
names(files) <- gsub("^.*/(.*)\\.txt\\.gz$", "\\1", filenames)
return(files)
})
Density plot for background and corrs in our data. Note grey is the background distribution and red is the actual data.
Correlation heatmap for cor equal or less than -0.7. Note upperbound for heatmap should be always less than the correlation threshold.
Get miRanda intersection and significant miRNA and mRNA interactions and the plot it.
Heatmap for p value significant miRNA mRNA fold change differences when compared to backgound