A Multi Assay Experiment object (Ramos et al.
2017) is the input for all main functions of ELMER and
can be generated by createMAE
function.
To perform ELMER analyses, the Multi Assay Experiment needs:
If TCGA data are used, the the last two matrices will be automatically generated. Based on the genome of reference selected, metadata for the DNA methylation probes, such as genomic coordinates, will be added from Wanding Zhou annotation (Zhou et al. 2016); and metadata for gene annotation will be added from ensemble database (Yates et al. 2015) using biomaRt (Durinck et al. 2009).
DNA methylation data feeding to ELMER should be a matrix of DNA methylation beta (β) value for samples (column) and probes (row) processed from row HM450K array data. If TCGA data is used, processed data from GDC website will be downloaded and automatically transformed to the matrix by ELMER. The processed TCGA DNA methylation data were calculated as $\frac{M}{(M+U)}$, where M represents the methylated allele intensity and U the unmethylated allele intensity. Beta values range from 0 to 1, reflecting the fraction of methylated alleles at each CpG in the each tumor; beta values close to 0 indicates low levels of DNA methylation and beta values close to 1 indicates high levels of DNA methylation.
If user have raw HM450K data, these data can be processed by Methylumi
or minfi
generating DNA methylation beta (β) value for each CpG site and
multiple samples. The getBeta
function in minfi can
be used to generate a matrix of DNA methylation beta (β) value to feed in ELMER. And
we recommend to save this matrix as meth.rda
since
createMAE
can read in files by specifying their path which
will help to reduce memory usage.
Gene expresion data feeding to ELMER
should be a matrix of gene expression values for samples (column) and
genes (row). Gene expression value can be generated from different
platforms: array or RNA-seq. The row data should be processed by other
software to get gene or transcript level gene expression calls such as
mapping by tophat,
calling expression value by cufflink, RSEM or GenomeStudio
for expression array. It is recommended to normalize expression data
making gene expression comparable across samples such as quantile
normalization. User can refer TCGA RNA-seq analysis pipeline to do
generate comparable gene expression data. Then transform the gene
expression values from each sample to the matrix for feeding into ELMER. If
users want to use TCGA data, ELMER has
functions to download the RNA-Seq Quantification data (HTSeq-FPKM-UQ)
from GDC website and transform the data to the matrix for feeding into
ELMER. It
is recommended to save this matrix as RNA.rda
since
createMAE
can read in files by specifying the path of files
which will help to reduce memory usage.
Sample information should be stored as a data.frame object containing
sample ID, group labels (control and experiment). Sample ID and groups
labels are required. Other information for each sample can be added to
this data.frame object. When TCGA data were used, samples information
will be automatically generated by createMAE
function by
specifying option TCGA=TRUE
. A columns name TN
will create the groups Tumor and Normal using the following samples to
each group:
Tumor samples are:
Normal samples:
library(MultiAssayExperiment)
data <- createMAE(
exp = GeneExp,
met = Meth,
met.platform = "450K",
genome = "hg19",
save = FALSE,
TCGA = TRUE
)
## Warning: sampleMap[['assay']] coerced with as.factor()
## A MultiAssayExperiment object of 2 listed
## experiments with user-defined names and respective classes.
## Containing an ExperimentList class object of length 2:
## [1] DNA methylation: RangedSummarizedExperiment with 1687 rows and 234 columns
## [2] Gene expression: RangedSummarizedExperiment with 3808 rows and 234 columns
## Functionality:
## experiments() - obtain the ExperimentList instance
## colData() - the primary/phenotype DataFrame
## sampleMap() - the sample coordination DataFrame
## `$`, `[`, `[[` - extract colData columns, subset, or experiment
## *Format() - convert into a long or wide DataFrame
## assays() - convert ExperimentList to a SimpleList of matrices
## exportClass() - save data to flat files
as.data.frame(colData(data)[,c("patient","definition","TN")]) %>%
datatable(options = list(scrollX = TRUE,pageLength = 5))
# Adding sample information for non TCGA samples
# You should have two objects with one for DNA methylation and
# one for gene expression. They should have the same number of samples and the names of the
# sample in the gene expression object and in hte DNA methylation matrix
# should be the same
not.tcga.exp <- GeneExp # 234 samples
colnames(not.tcga.exp) <- substr(colnames(not.tcga.exp),1,15)
not.tcga.met <- Meth # 268 samples
colnames(not.tcga.met) <- substr(colnames(not.tcga.met),1,15)
# Number of samples in both objects (234)
table(colnames(not.tcga.met) %in% colnames(not.tcga.exp))
##
## FALSE TRUE
## 34 234
# Our sample information must have as row names the samples information
phenotype.data <- data.frame(row.names = colnames(not.tcga.exp),
primary = colnames(not.tcga.exp),
group = c(rep("group1", ncol(GeneExp)/2),
rep("group2", ncol(GeneExp)/2)))
data.hg19 <- createMAE(exp = not.tcga.exp,
met = not.tcga.met,
TCGA = FALSE,
met.platform = "450K",
genome = "hg19",
colData = phenotype.data)
## Warning: sampleMap[['assay']] coerced with as.factor()
## A MultiAssayExperiment object of 2 listed
## experiments with user-defined names and respective classes.
## Containing an ExperimentList class object of length 2:
## [1] DNA methylation: RangedSummarizedExperiment with 1687 rows and 234 columns
## [2] Gene expression: RangedSummarizedExperiment with 3808 rows and 234 columns
## Functionality:
## experiments() - obtain the ExperimentList instance
## colData() - the primary/phenotype DataFrame
## sampleMap() - the sample coordination DataFrame
## `$`, `[`, `[[` - extract colData columns, subset, or experiment
## *Format() - convert into a long or wide DataFrame
## assays() - convert ExperimentList to a SimpleList of matrices
## exportClass() - save data to flat files
# The samples that does not have data for both DNA methylation and Gene exprssion will be removed even for the phenotype data
phenotype.data <- data.frame(row.names = colnames(not.tcga.met),
primary = colnames(not.tcga.met),
group = c(rep("group1", ncol(Meth)/4),
rep("group2", ncol(Meth)/4),
rep("group3", ncol(Meth)/4),
rep("group4", ncol(Meth)/4)))
data.hg38 <- createMAE(exp = not.tcga.exp,
met = not.tcga.met,
TCGA = FALSE,
save = FALSE,
met.platform = "450K",
genome = "hg38",
colData = phenotype.data)
## Warning: sampleMap[['assay']] coerced with as.factor()
## A MultiAssayExperiment object of 2 listed
## experiments with user-defined names and respective classes.
## Containing an ExperimentList class object of length 2:
## [1] DNA methylation: RangedSummarizedExperiment with 1663 rows and 234 columns
## [2] Gene expression: RangedSummarizedExperiment with 3737 rows and 234 columns
## Functionality:
## experiments() - obtain the ExperimentList instance
## colData() - the primary/phenotype DataFrame
## sampleMap() - the sample coordination DataFrame
## `$`, `[`, `[[` - extract colData columns, subset, or experiment
## *Format() - convert into a long or wide DataFrame
## assays() - convert ExperimentList to a SimpleList of matrices
## exportClass() - save data to flat files
Probe information is stored as a GRanges object containing the coordinates of each probe on the DNA methylation array and names of each probe. The default probe information is fetching from Wanding Zhou annotation (Zhou et al. 2016)
## GRanges object with 3 ranges and 8 metadata columns:
## seqnames ranges strand | address_A address_B channel
## <Rle> <IRanges> <Rle> | <integer> <integer> <character>
## cg23733394 chr1 839752-839753 + | 35610394 <NA> Both
## cg23625715 chr1 976172-976173 + | 65764492 71782443 Grn
## cg26222311 chr1 976227-976228 + | 61750445 38637435 Grn
## designType nextBase nextBaseRef probeType orientation
## <character> <character> <character> <character> <character>
## cg23733394 II G/A C cg up
## cg23625715 I C G cg up
## cg26222311 I C G cg up
## -------
## seqinfo: 26 sequences from an unspecified genome; no seqlengths
Gene information is stored as a GRanges object containing coordinates of each gene, gene id, gene symbol and gene isoform id. The default gene information is the ensembl gene annotation fetched from biomaRt by ELMER function.
## GRanges object with 3808 ranges and 2 metadata columns:
## seqnames ranges strand | external_gene_name
## <Rle> <IRanges> <Rle> | <character>
## ENSG00000188984 chr1 12776118-12788726 + | AADACL3
## ENSG00000204518 chr1 12704566-12727097 + | AADACL4
## ENSG00000108270 chr17 35306175-35414171 + | AATF
## ENSG00000198691 chr1 94458393-94586688 - | ABCA4
## ENSG00000135776 chr1 229652329-229694442 - | ABCB10
## ... ... ... ... . ...
## ENSG00000132003 chr19 13906274-13943044 + | ZSWIM4
## ENSG00000162415 chr1 45482071-45771881 - | ZSWIM5
## ENSG00000203995 chr1 53308183-53360670 + | ZYG11A
## ENSG00000162378 chr1 53192126-53293014 + | ZYG11B
## ENSG00000036549 chr1 78028101-78149104 - | ZZZ3
## ensembl_gene_id
## <character>
## ENSG00000188984 ENSG00000188984
## ENSG00000204518 ENSG00000204518
## ENSG00000108270 ENSG00000108270
## ENSG00000198691 ENSG00000198691
## ENSG00000135776 ENSG00000135776
## ... ...
## ENSG00000132003 ENSG00000132003
## ENSG00000162415 ENSG00000162415
## ENSG00000203995 ENSG00000203995
## ENSG00000162378 ENSG00000162378
## ENSG00000036549 ENSG00000036549
## -------
## seqinfo: 24 sequences from an unspecified genome; no seqlengths
A Multi Assay Experiment object from the MultiAssayExperiment
package is the input for multiple main functions of ELMER. It
contains the above components and making a Multi Assay Experiment object
by createMAE
function will keep each component consistent
with each other. For example, althougth DNA methylation and gene
expression matrixes have different rows (probe for DNA methylation and
gene id for gene expression), the column (samples) order should be same
in the two matrixes. The createMAE
function will keep them
consistent when it generates the Multi Assay Experiment object.
data <- createMAE(exp = GeneExp,
met = Meth,
genome = "hg19",
save = FALSE,
met.platform = "450K",
TCGA = TRUE)
## Warning: sampleMap[['assay']] coerced with as.factor()
# For TGCA data 1-12 represents the patient and 1-15 represents the sample ID (i.e. primary solid tumor samples )
all(substr(colnames(getExp(data)),1,15) == substr(colnames(getMet(data)),1,15))
## [1] TRUE
# See sample information for data
as.data.frame(colData(data)) %>% datatable(options = list(scrollX = TRUE))
# See sample names for each experiment
as.data.frame(sampleMap(data)) %>% datatable(options = list(scrollX = TRUE))
You can also use your own data and annotations to create Multi Assay Experiment object.
# NON TCGA example: matrices has different column names
gene.exp <- S4Vectors::DataFrame(sample1.exp = c("ENSG00000141510"=2.3,"ENSG00000171862"=5.4),
sample2.exp = c("ENSG00000141510"=1.6,"ENSG00000171862"=2.3))
dna.met <- S4Vectors::DataFrame(sample1.met = c("cg14324200"=0.5,"cg23867494"=0.1),
sample2.met = c("cg14324200"=0.3,"cg23867494"=0.9))
sample.info <- S4Vectors::DataFrame(primary = c("sample1","sample2"),
sample.type = c("Normal", "Tumor"))
sampleMap <- S4Vectors::DataFrame(
assay = c("Gene expression","DNA methylation","Gene expression","DNA methylation"),
primary = c("sample1","sample1","sample2","sample2"),
colname = c("sample1.exp","sample1.met","sample2.exp","sample2.met"))
mae <- createMAE(exp = gene.exp,
met = dna.met,
sampleMap = sampleMap,
met.platform ="450K",
colData = sample.info,
genome = "hg38")
## Warning: sampleMap[['assay']] coerced with as.factor()
# You can also use sample Mapping and Sample information tables from a tsv file
# You can use the createTSVTemplates function to create the tsv files
readr::write_tsv(as.data.frame(sampleMap), path = "sampleMap.tsv")
## Warning: The `path` argument of `write_tsv()` is deprecated as of readr 1.4.0.
## ℹ Please use the `file` argument instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
readr::write_tsv(as.data.frame(sample.info), path = "sample.info.tsv")
mae <- createMAE(exp = gene.exp,
met = dna.met,
sampleMap = "sampleMap.tsv",
met.platform ="450K",
colData = "sample.info.tsv",
genome = "hg38")
## Warning: sampleMap[['assay']] coerced with as.factor()
## A MultiAssayExperiment object of 2 listed
## experiments with user-defined names and respective classes.
## Containing an ExperimentList class object of length 2:
## [1] DNA methylation: RangedSummarizedExperiment with 2 rows and 2 columns
## [2] Gene expression: RangedSummarizedExperiment with 2 rows and 2 columns
## Functionality:
## experiments() - obtain the ExperimentList instance
## colData() - the primary/phenotype DataFrame
## sampleMap() - the sample coordination DataFrame
## `$`, `[`, `[[` - extract colData columns, subset, or experiment
## *Format() - convert into a long or wide DataFrame
## assays() - convert ExperimentList to a SimpleList of matrices
## exportClass() - save data to flat files
# NON TCGA example: matrices has same column names
gene.exp <- S4Vectors::DataFrame(sample1 = c("ENSG00000141510"=2.3,"ENSG00000171862"=5.4),
sample2 = c("ENSG00000141510"=1.6,"ENSG00000171862"=2.3))
dna.met <- S4Vectors::DataFrame(sample1 = c("cg14324200"=0.5,"cg23867494"=0.1),
sample2= c("cg14324200"=0.3,"cg23867494"=0.9))
sample.info <- S4Vectors::DataFrame(primary = c("sample1","sample2"),
sample.type = c("Normal", "Tumor"))
sampleMap <- S4Vectors::DataFrame(
assay = c("Gene expression","DNA methylation","Gene expression","DNA methylation"),
primary = c("sample1","sample1","sample2","sample2"),
colname = c("sample1","sample1","sample2","sample2")
)
mae <- createMAE(exp = gene.exp,
met = dna.met,
sampleMap = sampleMap,
met.platform ="450K",
colData = sample.info,
genome = "hg38")
## Warning: sampleMap[['assay']] coerced with as.factor()
## A MultiAssayExperiment object of 2 listed
## experiments with user-defined names and respective classes.
## Containing an ExperimentList class object of length 2:
## [1] DNA methylation: RangedSummarizedExperiment with 2 rows and 2 columns
## [2] Gene expression: RangedSummarizedExperiment with 2 rows and 2 columns
## Functionality:
## experiments() - obtain the ExperimentList instance
## colData() - the primary/phenotype DataFrame
## sampleMap() - the sample coordination DataFrame
## `$`, `[`, `[[` - extract colData columns, subset, or experiment
## *Format() - convert into a long or wide DataFrame
## assays() - convert ExperimentList to a SimpleList of matrices
## exportClass() - save data to flat files