Recent advances in biotechnology are introducing new sources of biological information. As a result, developers need to create classes to properly storage and manage these new kinds of data.
MultiDataSet
has methods to deal with three common
datasets: gene expression, SNPs data, and DNA methylation. Gene
expression from an ExpressionSet
can be added to a
MultiDataSet
using add_genexp
(for
microarrays) or add_rnaseq
(for RNAseq) functions. SNP data
can be incorporated into a MultiDataSet
by using the
function add_snp
. DNA methylation encapsulated in a
MethylationSet
object can be added into a
MultiDataSet
using add_methy
. In addition,
MultiDataSet
is also able to work with any other class of
objects based on eSet
or SummarizedExperiment
,
two general classes of Bioconductor framework. Consequently, developers
can implement methods to expand MultiDataSet
to work with
their own classes.
In this document, we show how to create a new method to add a new
class of objects into MultiDataSet
. This process is
exemplified by creating a new class to store proteomic data. To this
end, we will extend the eSet
class. It should be noticed
that the process would be very similar if the class was based on
SummarizedExperiment
.
The objective of this document is to illustrate how to create a
method to add a new class of object to MultiDataSet
. This
tutorial is meant for software developers who have developed a new class
to manage any new omic data or another type of information to be
included in MultiDataSet
objects.
Proteome data is commonly represented as a matrix of protein’s levels. This data has a special characteristic: some of the information cannot be measure due to the limit of detection (LOD) and limit of quantification (LOQ). Having LOD information in the data matrix can be crucial when performing statistical analyses. Considering proteins as the outcome requires using different statistical methods than those used for analyzing, for instance, gene expression data. LOD makes that proteins are left-censored variables making impossible the use of standard methods of analysis such as linear regression or t-test.
One approach that can be adopted to deal with this type of data is to
assign LOD/2 to those values that are below LOD. This will allow the
user to analyze protein data using standard packages such as
limma
, which uses linear models. However, this approach is
biased. Other methods have been developed to properly analyze
left-censored variables. Those methods require knowing the LOD of each
protein. Therefore, having information about both protein’s levels and
LOD is crucial for downstream analyses. Currently, a class that stores
protein data with LOD does not exist. To solve this issue, we propose to
create ProteomeSet
, a new class based on eSet
.
Our new class will contain the raw protein levels, information about LOD
and protein levels having data below LOD equal to LOD/2.
We will begin by defining the new class: ProteomeSet
.
Second, we will develop a function to load the protein data and the LOD
into R and to create our ProteomeSet
. Third, we will
implement a method for MultiDataSet
to add
ProteomeSet
s. Finally, we will show the application of the
code by creating a MultiDataSet
with protein data.
We have chosen to extend eSet
to implement our
ProteomeSet
because we can take profit of
eSet
s’ methods and structure. Therefore, our
ProteomeSet
will also have the phenotype and feature data
as well as methods to retrieve data. Given that eSet
is
defined in Biobase package, we should load it prior to the definition of
ProteomeSet
:
library(Biobase)
setClass (
Class = "ProteomeSet",
contains = "eSet",
prototype = prototype(new("VersionedBiobase",
versions = c(classVersion("eSet"), ProteomeSet = "1.0.0")))
)
This ProteomeSet
is defined as another eSet
object. As previously mentioned, proteome data should contain some
specific features. The setValidity
function defines the
requirements that an object must accomplish to be valid. assayData of
ProteomeSet
objects should have two slots to encapsulate
both raw and modified data (e.g. values below LOD replaced by LOD/2).
These slots will be called raw
and prots
,
respectively. ProteomeSet
should also contain information
about LOD and LOQ. This data will be obtained from the columns
LoD.T
and LoD.B
available as featureData. We
can introduce these requirements with the following lines of code:
setValidity("ProteomeSet", function(object) {
## Check that object has the slots 'prots' and 'raw' in assayData
msg <- validMsg(NULL, assayDataValidMembers(assayData(object), c("prots", "raw")))
## Check that objects has the columns 'LoD.T' and 'LoD.B' in featureData
msg <- validMsg(msg, ifelse(any(!c("LoD.T", "LoD.B") %in% fvarLabels(object)), "fData must contain columns 'LoD.T' and 'LoD.B'", FALSE))
if (is.null(msg)){
TRUE
} else{
msg
}
})
In this subsection, we have covered the essentials of extending a
class based on eSet
. Readers interested in more advanced
features can find more information about extending R classes and
extending eSets.
Here, we create a function that loads proteome data from a text file,
replaces values below LOD and above LOQ and returns a
ProteomeSet
with the available data. Correction of
limit of detection is commonly defined as follows:
The function read_ldset
will perform this task. It
requires four arguments:
assayFile
: (character) A path to the proteome’
measurements file.phenoFile
: (character) A path to the samples’ phenotype
file.featureFile
: (character) A path to the features’
annotation file.sep
: (character) Indicates the field separator
character of the three files above.The three input files need to be TSV style (TSV: tab-separated file)
and must include a header. assayFile
and
phenoFile
must have a column called sample
with a unique sample id. featureFile
must have a column
called feature
with the unique feature id, that must be
equal to assayFile
’s columns names. Moreover,
featureFile
must have two columns called LoD.B
and LoD.T
, corresponding to the LOD (bottom limit of
detection) and the LOQ (top limit of detection) of each
protein.
read_lds
checks that the features’ names are the same in
assay data and in feature data. It also checks that feature data has the
two columns for the limits of detection. After performing the
two checks, read_ldset
creates the matrix with the updated
level of expression of each protein. Then a ProteomeSet
is
created, containing the raw matrix as raw
and the updated
matrix as prots
. The phenotypic data and feature’s
annotations are also included:
read_ldset <- function(assayFile, phenoFile, featureFile, sep="\t") {
## Load the threes files that will be used to create the ProteomeSet
adata <- read.delim(assayFile, sep=sep, header=TRUE, row.names="sample")
pdata <- read.delim(phenoFile, sep=sep, header=TRUE, row.names="sample")
fdata <- read.delim(featureFile, sep=sep, header=TRUE, row.names="feature")
## /
## Check that proteins in assay data are the same in feature data
if(!identical(colnames(adata), rownames(fdata))) {
stop("Features names in assay data (columns) are not equal to ",
"features names in feature data (rownames).")
}
##/
## Check that feature data include columns LoD.B and LoD.T
if(sum(c("LoD.T", "LoD.B") %in% colnames(fdata)) != 2) {
stop("Feature data must contain two columns labeled 'LoD.T' (top ",
"limit of dectection) and 'LoD.B (bottom limit of dectection)")
}
## /
## Perform the transformation of the protein level of expression
low <- fdata[colnames(adata), "LoD.B"]
up <- fdata[colnames(adata), "LoD.T"]
names(low) <- names(up) <- rownames(fdata)
faux <- function(x, low, up) {
x[x < low] <- as.double(low / 2)
x[x > up] <- as.double(up * 1.5)
x
}
tadata <- mapply(FUN = faux, x = as.list(adata), low = as.list(low), up = as.list(up))
dimnames(tadata) <- dimnames(adata)
## /
## Create the ExpressionSet with the two matrices
prot <- new("ProteomeSet",
assayData = assayDataNew("environment", prots = t(tadata), raw = t(adata)),
phenoData = AnnotatedDataFrame(pdata),
featureData = AnnotatedDataFrame(fdata)
)
## /
## Check that the new ProteomeSet is valid
validObject(prot)
## /
return(prot)
}
So far, we have developed a function to define a new class of objects
to encapsulate proteomic data. In this section, we will show how to add
ProteomeSet
objects to MultiDataSet
. To do so,
we will create a generic method for adding proteins
(add_prot
) to MultiDataSet
and its
implementation using add_eset
.
The method add_prot
for MultiDataSet
will
accept two arguments: a MultiDataSet
and a
ProteomeSet
. Following S4 development rules, a new generic
method for add_prot
needs to be set:
setGeneric("add_prot", function(object, protSet, warnings = TRUE, ...)
standardGeneric("add_prot")
)
## [1] "add_prot"
In the definition of add_prot
, we can see the three main
arguments of this function. object
is the
MultiDataSet
where we will add the
ProteomeSet
. protSet
is the new
ProteomeSet
that will be added. Finally,
warnings
is a flag to indicate if warnings are
displayed.
The following code shows the implementation of add_prot
.
In the signature, we specify that the first argument should be a
MultiDataSet
and the second a ProteomeSet
. If
any other class is passed to the function, an error will be returned. In
the code of the function, we see only two lines: a call to
add_eset
and the return of the object.
library(MultiDataSet)
setMethod(
f = "add_prot",
signature = c("MultiDataSet", "ProteomeSet"),
definition = function(object, protSet, warnings = TRUE, ...) {
## Add given ProteomeSet as 'proteome'
object <- MultiDataSet::add_eset(object, protSet, dataset.type = "proteome", GRanges = NA, ...)
## /
return(object)
}
)
Basic method add_eset
is used to add eSet
derived-classes to MultiDataSet
and accepts several
arguments. Four of them are used for implementing add_prot
.
As mentioned above, the first and the second are the
MultiDataSet
object where the proteins will be added and
ProteomeSet
with the proteins data. The third,
dataset.type
, is used to tag the type of omics data that
add_prot
is adding to the MultiDataSet
. This
argument is set to "proteome"
in add_prot
,
"expression"
in add_genexp
and
"snps"
in add_snps
. The fourth argument,
GRanges
is set to NA
. This argument can take
two type of values: a GRanges
object with the equivalent
content of the fData
included into the
ExpressionSet
or NA
in case no genomic
coordinates are available for the set’s features. In this case, since we
are working with proteins, no genomic coordinates are given.
For illustration purposes, we have created three tsv-dummy files that
are used in the following code to create a ProteomeSet
using read_ldset
function. These files are available in the
Supplementary Material of the manuscript
## Create a ProteomeSet with protein data
ps <- read_ldset(assayFile="assay_data.tsv",
phenoFile="pheno_data.tsv",
featureFile="feature_data.tsv"
)
ps
## ProteomeSet (storageMode: environment)
## assayData: 5 features, 33 samples
## element names: prots, raw
## protocolData: none
## phenoData
## sampleNames: sp001 sp002 ... sp035 (33 total)
## varLabels: gender plate kit
## varMetadata: labelDescription
## featureData
## featureNames: Adiponectin CRP ... APO.E (5 total)
## fvarLabels: LoD.T LoD.B unit
## fvarMetadata: labelDescription
## experimentData: use 'experimentData(object)'
## Annotation:
The created object ps
is a ProteomeSet
and
it contains two elements called prots
and raw
.
Moreover, ps
’s feature data contains two columns called
LoD.B
and LoD.T
.
Now that the proteome data is loaded and stored in a
ProteomeSet
, we can add it to a new
MultiDataSet
. MultiDataSet
objects can be
created using the constructor createMultiDataSet
. Then the
method add_prot
is used to include the proteome data to
md
:
The method names
of MultiDataSet
shows the
datasets stored in the MultiDataSet
.
MultiDataSet
stores datasets calling them by its data
type.
## [1] "proteome"
Finally, the show of the object gives more information
related to the stored in the MultiDataSet
:
## Object of class 'MultiDataSet'
## . assayData: 1 elements
## . proteome: 5 features, 33 samples
## . featureData:
## . proteome: 5 rows, 3 cols (LoD.T, ..., LoD.B)
## . rowRanges:
## . proteome: NO
## . phenoData:
## . proteome: 33 samples, 4 cols (gender, ..., kit)
The name of the set is shown (proteome
), the number of
proteins (5 rows in feature data), the number of samples (33 samples in
pheno data) and, since no GRanges
was provided,
rowRanges
is NO
.