MultiRNAflow
on a
RNA-seq raw counts with different time points and several biological
conditionsThis document is a quick workflow describing how to use our R package MultiRNAflow on one dataset (see Dataset used as example). For a more complete description of our package and complete outputs with graphs, the user must read our pdf file entitled ’MultiRNAflow_vignette-knitr.pdf”.
The Mouse dataset 2 (Weger et al. 2021) is accessible on the Gene Expression Omnibus (GEO) database with the accession number GSE135898.
This dataset contains the temporal transcription profile of 16 mice with Bmal1 and Cry1/2 knocked-down under an ad libitum (AL) or night restricted feeding (RF) regimen. Data were collected at 0, 4h, 8h, 16, 20h and 24h. Therefore, there are six time points and eight biological conditions. As there are only two mice per biological condition, we decided not to take into account the effect of the regimen. The dataset contains temporal expression data of 40327 genes.
To illustrate the use of our package, we take 500 genes, over the global 40327 genes in the original dataset. This sub dataset is saved in the file RawCounts_Weger2021_MOUSEsub500.
The dataset must be a data.frame containing raw counts data. If it is
not the case, the functions DATAnormalization()
and
DEanalysisGlobal()
will stop and print an error. Each line
should correspond to a gene, each column to a sample, except a
particular column that may contain strings of characters describing the
names of the genes. The first line of the data.frame should contain the
names of the columns (strings of characters) that must have the
following structure.
head(colnames(RawCounts_Weger2021_MOUSEsub500), n=5)
#> [1] "Gene" "BmKo_t1_r1" "BmKo_t2_r1" "BmKo_t0_r1" "BmKo_t3_r1"
In this example, “Gene” indicates the column which contains the names of the different genes. The other column names contain all kind of information about the sample, including the biological condition, the time of measurement and the name of the individual (e.g patient, replicate, mouse, yeasts culture…). Other kinds of information can be stored in the column names (such as patient information), but they will not be used by the package. The various information in the column names must be separated by underscores. The order of these information is arbitrary but must be the same for all columns. For instance, the sample “BmKo_t0_r1” corresponds to the first replicate (r1) of the biological condition BmKo at time t0 (reference time).
The information located to the left of the first underscore will be
considered to be in position 1, the information located between the
first underscore and the second one will be considered to be in position
2, and so on. In the previous example, the biological condition is in
position 1, the time is in position 2 and the replicate is in position
3. In most of the functions of our package, the order of the previous
information in the column names will be indicated with the inputs
Group.position
, Time.position
and
Individual.position
. Similarly the input
Column.gene
will indicate the number of the column
containing gene names.
DATAprepSE()
resDATAprepSE <- DATAprepSE(RawCounts=RawCounts_Weger2021_MOUSEsub500,
Column.gene=1,
Group.position=1,
Time.position=2,
Individual.position=3)
DEanalysisGlobal
for the statistical (supervised)
analysis.RawCounts
)Column.gene=1
means that the first column
of the dataset contain genes name, Time.position=2
means
that the time of measurements is between the first and the second
underscores in the columns names, Individual.position=3
means that the name of the individual is between the second and the
third underscores in the columns names and
Group.position=NULL
means that there is only one biological
condition in the dataset. Similarly, Time.position=NULL
would mean that there is only one time of measurement for each
individual and Column.gene=NULL
would mean that there is no
column containing gene names.?DATAprepSE
in your console for more information
about the function.DATAnormalization()
resNorm <- DATAnormalization(SEres=resDATAprepSE,
Normalization="vst",
Blind.rlog.vst=FALSE,
Plot.Boxplot=FALSE,
Colored.By.Factors=TRUE,
Color.Group=NULL,
path.result=NULL)
resDATAprepSE
but with the normalized dataPlot.Boxplot=TRUE
) showing the
distribution of the normalized expression
(Normalization="vst"
means that the vst method is used) of
genes for each sample.DATAprepSE()
.Plot.Boxplot=TRUE
.Colored.By.Factors=TRUE
, the color of the boxplots
would be different for different biological conditions.path.result
.?DATAnormalization
in your console for more
information about the function.PCAanalysis()
resPCA <- PCAanalysis(SEresNorm=resNorm,
DATAnorm=TRUE,
gene.deletion=NULL,
sample.deletion=NULL,
Plot.PCA=FALSE,
Mean.Accross.Time=FALSE,
Color.Group=NULL,
Cex.label=0.6,
Cex.point=0.7, epsilon=0.2,
Phi=25,Theta=140,
motion3D=FALSE,
path.result=NULL)
resNorm
but
with the results of the function PCA()
of the package
FactoMineR
.motion3D=FALSE
) where samples are
colored with different colors for different biological conditions.
Furthermore, lines are drawn between each pair of consecutive points for
each sample (if Mean.Accross.Time=FALSE
, otherwise it will
be only between means).motion3D=FALSE
) for each biological
condition, where samples are colored with different colors for different
time points. Furthermore, lines are drawn between each pair of
consecutive points for each sample (if
Mean.Accross.Time=FALSE
, otherwise it will be only between
means).DATAnormalization()
.DATAnorm=TRUE
) for the PCA analysis.Color.Group=NULL
), a color will be
automatically applied for each biological condition. If you want to
change the colors, read our pdf file entitled
’MultiRNAflow_vignette-knitr.pdf”.gene.deletion=c("ENSMUSG00000025921", "ENSMUSG00000026113")
and/or sample.deletion=c("BmKo_t2_r1", "BmKo_t5_r2")
gene.deletion=c(2,6)
and/or
sample.deletion=c(3,13)
. The integers in
gene.deletion
and sample.deletion
represent
respectively the row numbers and the column numbers of
ExprData
where the selected genes and samples are
located.Plot.PCA=TRUE
.path.result
.?PCAanalysis
in your console for more information
about the function.HCPCanalysis()
resHCPC <- HCPCanalysis(SEresNorm=resNorm,
DATAnorm=TRUE,
gene.deletion=NULL,
sample.deletion=NULL,
Plot.HCPC=FALSE,
Phi=25,Theta=140,
Cex.point=0.6,
epsilon=0.2,
Cex.label=0.6,
motion3D=FALSE,
path.result=NULL)
resNorm
but
with the results of the function HCPCanalysis()
of the
package FactoMineR
motion3D=FALSE
). These PCA graphs are
identical to the outputs of PCAanalysis()
with all samples
but samples are colored with different colors for different
clusters.DATAnormalization()
.DATAnorm=TRUE
) for the HCPC analysis.Plot.HCPC=TRUE
.path.result
.HCPC()
.?HCPCanalysis
in your console for more
information about the function.MFUZZanalysis()
.resMFUZZ <- MFUZZanalysis(SEresNorm=resNorm,
DATAnorm=TRUE,
DataNumberCluster=NULL,
Method="hcpc",
Membership=0.5,
Min.std=0.1,
Plot.Mfuzz=FALSE,
path.result=NULL)
resNorm
but
with
mfuzz()
DataNumberCluster=NULL
). If Method="hcpc"
, the
function plots the scaled within-cluster inertia, but if
Method="kmeans"
, the function plots the scaled
within-cluster inertia. As the number of genes can be very high, we
recommend to select Method="hcpc"
which is by default.mfuzz()
which
represents the most common temporal behavior among all genes for the
biological condition ‘BmKo’.DATAnormalization()
.DATAnorm=TRUE
) for the clustering analysis.Membership
(numeric value between 0 and 1) will not be
displayed. The membership values correspond to the probability of gene
to belong to each cluster.Min.std
(numeric positive value) will be
excluded.Plot.Mfuzz=TRUE
.path.result
.?MFUZZanalysis
in your console for more
information about the function.DATAplotExpressionGenes()
resEXPR <- DATAplotExpressionGenes(SEresNorm=resNorm,
DATAnorm=TRUE,
Vector.row.gene=c(17),
Color.Group=NULL,
Plot.Expression=FALSE,
path.result=NULL)
DATAnormalization()
.DATAnorm=TRUE
).Vector.row.gene=c(97,192,194,494)
.Color.Group=NULL
), a color will be
automatically applied for each biological condition. If you want to
change the colors, read our pdf file entitled
’MultiRNAflow_vignette-knitr.pdf”.Plot.Expression=TRUE
.path.result
.?DATAplotExpressionGenes
in your console for more
information about the function.DEanalysisGlobal()
For the speed of the algorithm, we will take only three biological conditions and three times
Sub3bc3T <- RawCounts_Weger2021_MOUSEsub500
Sub3bc3T <- Sub3bc3T[, seq_len(73)]
SelectTime <- grep("_t0_", colnames(Sub3bc3T))
SelectTime <- c(SelectTime, grep("_t1_", colnames(Sub3bc3T)))
SelectTime <- c(SelectTime, grep("_t2_", colnames(Sub3bc3T)))
Sub3bc3T <- Sub3bc3T[, c(1, SelectTime)]
resSEsub <- DATAprepSE(RawCounts=Sub3bc3T,
Column.gene=1,
Group.position=1,
Time.position=2,
Individual.position=3)
resDE <- DEanalysisGlobal(SEres=resSEsub,
pval.min=0.05,
log.FC.min=1,
Plot.DE.graph=FALSE,
path.result=NULL)
#> [1] "Preprocessing"
#> [1] "Differential expression step with DESeq2::DESeq()"
#> [1] "Case 3 analysis : Biological conditions and Times."
#> [1] "DE time analysis for each biological condition."
#> [1] "DE group analysis for each time measurement."
#> [1] "Combined time and group results."
DESeq()
Results
) which contains
DEplotVennBarplotTime()
).DEplotBarplotFacetGrid()
).DEplotVennBarplotGroup()
).DEplotAlluvial()
).DEplotBarplotFacetGrid()
).DEplotAlluvial()
).pval.min
(numeric value between 0 and 1) and if
the absolute log fold change associated to the gene is superior to
log.FC.min
(numeric positive value).Plot.DE.graph=TRUE
.path.result
.RawCounts=RawCounts_Weger2021_MOUSEsub500
in order
to use the complete dataset.?DEanalysisGlobal
in your console for more
information about the function.DEplotVolcanoMA()
resMAvolcano <- DEplotVolcanoMA(SEresDE=resDE,
NbGene.plotted=2,
SizeLabel=3,
Display.plots=FALSE,
Save.plots=FALSE)
DEanalysisGlobal()
.Display.plots=TRUE
.Save.plots=TRUE
and and a folder path in the input
path.result
in the function
DEanalysisGlobal()
.?DEplotVolcanoMA
in your console for more
information about the function.DEplotHeatmaps()
resHEAT <- DEplotHeatmaps(SEresDE=resDE,
ColumnsCriteria=2,
Set.Operation="union",
NbGene.analysis=20,
SizeLabelRows=5,
SizeLabelCols=5,
Display.plots=FALSE,
Save.plots=TRUE)
Set.Operation="union"
then the rows extracted from
the different datasets included in SEresDE
are those such
that the sum of the selected columns of
SummarizedExperiment::rowData(SEresDE)
by
ColumnsCriteria
is >0. For example, the selected genes
can be those DE at least at t1 or t2 (versus the reference time
t0).Display.plots=TRUE
.Save.plots=TRUE
and and a folder path in the input
path.result
in the function
DEanalysisGlobal()
.?DEplotHeatmaps
in your console for more
information about the function.GSEAQuickAnalysis()
and
GSEApreprocessing()
resRgprofiler2 <- GSEAQuickAnalysis(Internet.Connection=FALSE,
SEresDE=resDE,
ColumnsCriteria=2,
ColumnsLog2ordered=NULL,
Set.Operation="union",
Organism="mmusculus",
MaxNumberGO=20,
Background=FALSE,
Display.plots=FALSE,
Save.plots=TRUE)
DEanalysisGlobal()
, an enrichment analysis (GSEA) of a
subset of genes with the R package gprofiler2
.
gprofiler2::gost()
MaxNumberGO
most important
GO.Set.Operation="union"
then the rows extracted from
the different datasets included in SEresDE
are those such
that the sum of the selected columns of
SummarizedExperiment::rowData(SEresDE)
by
ColumnsCriteria
is >0. For example, the selected genes
can be those DE at least at t1 or t2 (versus the reference time
t0).Display.plots=TRUE
.Save.plots=TRUE
and a folder path in the input
path.result
in the function
DEanalysisGlobal()
.?GSEAQuickAnalysis
in your console for more
information about the function.resGSEApreprocess <- GSEApreprocessing(SEresDE=resDE,
ColumnsCriteria=2,
Set.Operation="union",
Rnk.files=FALSE,
Save.files=FALSE)
ColumnsCriteria
and Set.Operation
.Save.files=TRUE
and the path.result
of DEanalysisGlobal()
is not NULL, specific files designed
to be used as input for the following online tools and software : GSEA,
DAVID, WebGestalt, gProfiler, Panther, ShinyGO, Enrichr, GOrillaSet.Operation="union"
then
the rows extracted from the different datasets included in
SEresDE
are those such that the sum of the selected columns
of SummarizedExperiment::rowData(SEresDE)
by
ColumnsCriteria
is >0. For example, the selected genes
can be those DE at least at t1 or t2 (versus the reference time
t0).?GSEApreprocessing
in your console for more
information about the function.Here is the output of sessionInfo()
on the system on
which this document was compiled.
sessionInfo()
#> R version 4.4.1 (2024-06-14)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.1 LTS
#>
#> Matrix products: default
#> BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
#> LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
#>
#> locale:
#> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
#> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
#> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
#> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
#> [9] LC_ADDRESS=C LC_TELEPHONE=C
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
#>
#> time zone: Etc/UTC
#> tzcode source: system (glibc)
#>
#> attached base packages:
#> [1] tcltk stats graphics grDevices utils datasets methods
#> [8] base
#>
#> other attached packages:
#> [1] MultiRNAflow_1.5.0 Mfuzz_2.67.0 DynDoc_1.85.0
#> [4] widgetTools_1.85.0 e1071_1.7-16 Biobase_2.67.0
#> [7] BiocGenerics_0.53.1 generics_0.1.3 BiocStyle_2.35.0
#>
#> loaded via a namespace (and not attached):
#> [1] RColorBrewer_1.1-3 sys_3.4.3
#> [3] jsonlite_1.8.9 shape_1.4.6.1
#> [5] magrittr_2.0.3 estimability_1.5.1
#> [7] farver_2.1.2 rmarkdown_2.28
#> [9] GlobalOptions_0.1.2 fs_1.6.5
#> [11] zlibbioc_1.52.0 vctrs_0.6.5
#> [13] base64enc_0.1-3 rstatix_0.7.2
#> [15] htmltools_0.5.8.1 S4Arrays_1.7.1
#> [17] broom_1.0.7 Formula_1.2-5
#> [19] SparseArray_1.7.0 gridGraphics_0.5-1
#> [21] sass_0.4.9 bslib_0.8.0
#> [23] htmlwidgets_1.6.4 plyr_1.8.9
#> [25] emmeans_1.10.5 plotly_4.10.4
#> [27] cachem_1.1.0 buildtools_1.0.0
#> [29] misc3d_0.9-1 lifecycle_1.0.4
#> [31] iterators_1.0.14 pkgconfig_2.0.3
#> [33] Matrix_1.7-1 R6_2.5.1
#> [35] fastmap_1.2.0 GenomeInfoDbData_1.2.13
#> [37] MatrixGenerics_1.19.0 clue_0.3-65
#> [39] digest_0.6.37 colorspace_2.1-1
#> [41] S4Vectors_0.45.0 DESeq2_1.47.0
#> [43] GenomicRanges_1.59.0 ggpubr_0.6.0
#> [45] labeling_0.4.3 fansi_1.0.6
#> [47] httr_1.4.7 abind_1.4-8
#> [49] compiler_4.4.1 proxy_0.4-27
#> [51] withr_3.0.2 doParallel_1.0.17
#> [53] backports_1.5.0 BiocParallel_1.41.0
#> [55] viridis_0.6.5 carData_3.0-5
#> [57] UpSetR_1.4.0 dendextend_1.18.1
#> [59] ggsignif_0.6.4 MASS_7.3-61
#> [61] tkWidgets_1.85.0 DelayedArray_0.33.1
#> [63] rjson_0.2.23 scatterplot3d_0.3-44
#> [65] flashClust_1.01-2 tools_4.4.1
#> [67] FactoMineR_2.11 glue_1.8.0
#> [69] grid_4.4.1 cluster_2.1.6
#> [71] reshape2_1.4.4 gtable_0.3.6
#> [73] class_7.3-22 tidyr_1.3.1
#> [75] data.table_1.16.2 car_3.1-3
#> [77] utf8_1.2.4 XVector_0.47.0
#> [79] ggrepel_0.9.6 foreach_1.5.2
#> [81] pillar_1.9.0 stringr_1.5.1
#> [83] yulab.utils_0.1.7 circlize_0.4.16
#> [85] dplyr_1.1.4 lattice_0.22-6
#> [87] tidyselect_1.2.1 ComplexHeatmap_2.23.0
#> [89] locfit_1.5-9.10 maketools_1.3.1
#> [91] knitr_1.48 gridExtra_2.3
#> [93] IRanges_2.41.0 SummarizedExperiment_1.37.0
#> [95] stats4_4.4.1 xfun_0.49
#> [97] plot3Drgl_1.0.4 factoextra_1.0.7
#> [99] matrixStats_1.4.1 DT_0.33
#> [101] stringi_1.8.4 UCSC.utils_1.3.0
#> [103] lazyeval_0.2.2 yaml_2.3.10
#> [105] evaluate_1.0.1 codetools_0.2-20
#> [107] tibble_3.2.1 BiocManager_1.30.25
#> [109] multcompView_0.1-10 ggplotify_0.1.2
#> [111] cli_3.6.3 munsell_0.5.1
#> [113] jquerylib_0.1.4 Rcpp_1.0.13-1
#> [115] GenomeInfoDb_1.43.0 gprofiler2_0.2.3
#> [117] png_0.1-8 parallel_4.4.1
#> [119] leaps_3.2 rgl_1.3.12
#> [121] ggplot2_3.5.1 ggalluvial_0.12.5
#> [123] viridisLite_0.4.2 mvtnorm_1.3-1
#> [125] plot3D_1.4.1 scales_1.3.0
#> [127] purrr_1.0.2 crayon_1.5.3
#> [129] GetoptLong_1.0.5 rlang_1.1.4