mice
To illustrate how to perform a multiple imputation using
mice
we start loading both rexposome
and
mice
libraries.
The we load the txt
files includes in
rexposome
package so we can load the exposures and see the
amount of missing data (check vignette Exposome Data Analysis
for more information).
The following lines locates where the txt
files were
installed.
path <- file.path(path.package("rexposome"), "extdata")
description <- file.path(path, "description.csv")
phenotype <- file.path(path, "phenotypes.csv")
exposures <- file.path(path, "exposures.csv")
Once the files are located we load them as
data.frames
:
dd <- read.csv(description, header=TRUE, stringsAsFactors=FALSE)
ee <- read.csv(exposures, header=TRUE)
pp <- read.csv(phenotype, header=TRUE)
In order to speed up the imputation process that will be carried in this vignette, we will remove four families of exposures.
dd <- dd[-which(dd$Family %in% c("Phthalates", "PBDEs", "PFOAs", "Metals")), ]
ee <- ee[ , c("idnum", dd$Exposure)]
We can check the amount of missing data in both exposures and
phenotypes data.frames
:
## Set Count
## 1 Exposures 304
## 2 Phenotypes 5
Before running mice
, we need to collapse both the
exposures and the phenotypes in a single data.frame
.
rownames(ee) <- ee$idnum
rownames(pp) <- pp$idnum
dta <- cbind(ee[ , -1], pp[ , -1])
dta[1:3, c(1:3, 52:56)]
## DDE DDT HCB birthdate sex age cbmi blood_pre
## id001 NA NA NA 2004-12-29 male 4.2 16.3 120
## id002 1.713577 0.6931915 1.270750 2005-01-05 male 4.2 16.4 121
## id003 2.594590 0.7448906 2.205519 2005-01-05 male 4.2 19.0 120
Once this is done, the class of each column needs to be set,
so mice
will be able to differentiate between continuous
and categorical exposures.
for(ii in c(1:13, 18:47, 55:56)) {
dta[, ii] <- as.numeric(dta[ , ii])
}
for(ii in c(14:17, 48:54)) {
dta[ , ii] <- as.factor(dta[ , ii])
}
With this data.frame
we perform the imputation calling
mice
functions (for more information about this call, check
mice
’s vignette). We remove the columns birthdate
since it is not necessary for the imputations and carries lots of
categories.
imp <- mice(dta[ , -52], pred = quickpred(dta[ , -52], mincor = 0.2,
minpuc = 0.4), seed = 38788, m = 5, maxit = 10, printFlag = FALSE)
## Warning: Number of logged events: 223
## [1] "mids"
The created object imp
, that is an object of class
mids
contains 20 data-sets with the imputed exposures and
the phenotypes. To work with this information we need to extract each
one of these sets and create a new data-set that includes all of them.
This new data.frame
will be passed to
rexposome
(check next section to see the requirements).
mice
package includes the function complete
that allows to extract a single data-set from an object of class
mids
. We will use this function to extract the sets and
join them in a single data.frame
.
If we set the argument action
of the
complete
function to 0
, it will return the
original data:
## [1] 109 57
## H_pesticides Benzene
## 0 :68 Min. :-0.47427
## 1 :35 1st Qu.:-0.19877
## NA's: 6 Median :-0.11975
## Mean :-0.12995
## 3rd Qu.:-0.06879
## Max. : 0.13086
## NA's :3
If the action
number is between 1 and the m
value, it will return the selected set.
for(set in 1:5) {
im <- complete(imp, action = set)
im[ , ".imp"] <- set
im[ , ".id"] <- rownames(im)
me <- rbind(me, im)
}
me <- me[ , c(".imp", ".id", colnames(me)[-(97:98)])]
rownames(me) <- 1:nrow(me)
dim(me)
## [1] 654 59
The format of the multiple imputation data for rexposome
needs to follow some restrictions:
data.frame
.data.frame
must have a column called
.imp
indicating the number of imputation. This imputation
tagged as 0
are raw exposures (no imputation).data.frame
must have a column called
.id
indicating the name of samples. This will be converted
to character.data.frame
with the description with the
relation between exposures and families.imExposomeSet
With the exposome data.frame
and the description
data.frame
an object of class imExposomeSet
can be created. To this end, the function loadImputed
is
used:
ex_imp <- loadImputed(data = me, description = dd,
description.famCol = "Family",
description.expCol = "Exposure")
The function loadImputed
has several arguments:
## function (data, description, description.famCol = "family", description.expCol = "exposure",
## exposures.asFactor = 5, warnings = TRUE)
## NULL
The argument data
is filled with the
data.frame
of exposures. The argument
decription
with the data.frame
with the
exposures’ description. description.famCol
indicates the
column on the description that corresponds to the family.
description.expCol
indicates the column on the description
that corresponds to the exposures. Finally,
exposures.asFactor
indicates that the exposures with less
that, by default, five different values are considered categorical
exposures, otherwise continuous.
## Object of class 'imExposomeSet'
## . exposures description:
## . categorical: 4
## . continuous: 43
## . #imputations: 6 (raw detected)
## . assayData: 47 exposures 109 individuals
## . phenoData: 109 individuals 12 phenotypes
## . featureData: 47 exposures 3 explanations
The output of this object indicates that we loaded 14 exposures, being 13 continuous and 1 categorical.
The class ExposomeSet
has several accessors to get the
data stored in it. There are four basic methods that returns the names
of the individuals (sampleNames
), the name of the exposures
(exposureNames
), the name of the families of exposures
(familyNames
) and the name of the phenotypes
(phenotypeNames
).
## [1] "id001" "id002" "id003" "id004" "id005" "id006"
## [1] "DDE" "DDT" "HCB" "bHCH" "PCB118" "PCB153"
## [1] "Organochlorines" "Bisphenol A" "Water Pollutants"
## [4] "Cotinine" "Home Environment" "Air Pollutants"
## [7] "Built Environment" "Noise" "Temperature"
## [1] "whistling_chest" "flu" "rhinitis" "wheezing"
## [5] "sex" "age" "cbmi" "blood_pre"
## [9] ".imp.1" ".id.1"
fData
will return the description of the exposures
(including internal information to manage them).
## DataFrame with 3 rows and 4 columns
## Family Exposure Name .type
## <character> <character> <character> <character>
## DDE Organochlorines DDE Dichlorodiphenyldich.. numeric
## DDT Organochlorines DDT Dichlorodiphenyltric.. numeric
## HCB Organochlorines HCB Hexachlorobenzene numeric
pData
will return the phenotypes information.
## DataFrame with 3 rows and 12 columns
## .imp .id whistling_chest flu rhinitis wheezing sex
## <numeric> <character> <factor> <factor> <factor> <factor> <factor>
## 1 0 id001 never no no no male
## 2 0 id002 never no no no male
## 3 0 id003 7-12 epi no no yes male
## age cbmi blood_pre .imp.1 .id.1
## <factor> <numeric> <numeric> <numeric> <character>
## 1 4.2 16.3 120 0 id001
## 2 4.2 16.4 121 0 id002
## 3 4.2 19.0 120 0 id003
The behavior of the exposures through the imputation process can be
studies using the plotFamily
method. This method will draw
the behavior of the exposures in each imputation set in a single
chart.
The method required an argument family
and it will draw
a mosaic with the plots from the exposures within the family. Following
the same strategy than using an ExposomeSet
, when the
exposures are continuous box-plots are used.
## Warning: Removed 104 rows containing non-finite outside the scale range
## (`stat_boxplot()`).
For categorical exposures, the method draws accumulated bar-plot:
The arguments group
and na.omit
are not
available when plotFamily
is used with an
imExposomeSet
.
ExposomeSet
from an
imExposomeSet
Once an imExposomeSet
is created, an
ExposomeSet
can be obtained by selecting one of the
internal imputed-sets. This is done using the method toES
and setting the argument rid
with the number of the
imputed-set to use:
## Object of class 'ExposomeSet' (storageMode: environment)
## . exposures description:
## . categorical: 4
## . continuous: 43
## . exposures transformation:
## . categorical: 0
## . transformed: 0
## . standardized: 0
## . imputed: 0
## . assayData: 47 exposures 109 individuals
## . element names: exp
## . exposures: AbsPM25, ..., Temp
## . individuals: id001, ..., id108
## . phenoData: 109 individuals 10 phenotypes
## . individuals: id001, ..., id108
## . phenotypes: whistling_chest, ..., .imp.1
## . featureData: 47 exposures 8 explanations
## . exposures: AbsPM25, ..., Temp
## . descriptions: Family, ..., .imp
## experimentData: use 'experimentData(object)'
## Annotation:
## Object of class 'ExposomeSet' (storageMode: environment)
## . exposures description:
## . categorical: 4
## . continuous: 43
## . exposures transformation:
## . categorical: 0
## . transformed: 0
## . standardized: 0
## . imputed: 0
## . assayData: 47 exposures 109 individuals
## . element names: exp
## . exposures: AbsPM25, ..., Temp
## . individuals: id001, ..., id108
## . phenoData: 109 individuals 10 phenotypes
## . individuals: id001, ..., id108
## . phenotypes: whistling_chest, ..., .imp.1
## . featureData: 47 exposures 8 explanations
## . exposures: AbsPM25, ..., Temp
## . descriptions: Family, ..., .imp
## experimentData: use 'experimentData(object)'
## Annotation:
The interesting point on working with multiple imputations is to test
the association of the different version of the exposures with a target
phenotype. rexposome
implements the method
exwas
to be used with an imExposomeSet
.
## An object of class 'ExWAS'
##
## ~ blood_pre sex + age
##
## Tested exposures: 47
## Threshold for effective tests (TEF): 2.43e-03
## . Tests < TEF: 7
## Robust standar errors: Computed
As usual, the object obtained from exwas
method can be
plotted using plotExwas
:
clr <- rainbow(length(familyNames(ex_imp)))
names(clr) <- familyNames(ex_imp)
plotExwas(as_iew, color = clr)
The method extract
allows to obtain a table of P-Values
from an ExWAS
object. At the same time, the
tef
method allows to obtain the threshold of effective
tests computed at exwas
. We can use them combined in
order to create a table with the P-Value of the exposures that are
beyond the threshold of efective tests.
## [1] 0.002426352
## DataFrame with 11 rows and 4 columns
## pvalue effect X2.5 X97.5
## <numeric> <numeric> <numeric> <numeric>
## AbsPM25 5.26794e-05 20.09668 10.68607 29.50729
## NOx 7.96448e-05 12.85241 6.69541 19.00941
## NO 8.27436e-05 10.67213 5.55743 15.78682
## NO2 9.90808e-05 15.06292 7.77227 22.35356
## PM25CU 4.00711e-04 11.39992 5.23645 17.56340
## PM25 5.62435e-04 35.76444 16.11183 55.41704
## Temp 1.11891e-03 88.57154 36.30290 140.84017
## PCB180 1.37441e-03 4.00055 1.59345 6.40765
## PCB153 1.67554e-03 3.99374 1.54289 6.44459
## PM10Fe 2.07052e-03 13.57084 5.06769 22.07398
## PCB138 2.21163e-03 4.48011 1.65837 7.30184
## R version 4.4.2 (2024-10-31)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.1 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: Etc/UTC
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] mice_3.17.0 ggplot2_3.5.1 rexposome_1.29.0
## [4] Biobase_2.67.0 BiocGenerics_0.53.3 generics_0.1.3
## [7] BiocStyle_2.35.0
##
## loaded via a namespace (and not attached):
## [1] bitops_1.0-9 gridExtra_2.3 formatR_1.14
## [4] sandwich_3.1-1 rlang_1.1.4 magrittr_2.0.3
## [7] ggridges_0.5.6 compiler_4.4.2 reshape2_1.4.4
## [10] vctrs_0.6.5 stringr_1.5.1 pkgconfig_2.0.3
## [13] shape_1.4.6.1 fastmap_1.2.0 backports_1.5.0
## [16] labeling_0.4.3 caTools_1.18.3 rmarkdown_2.29
## [19] nloptr_2.1.1 purrr_1.0.2 jomo_2.7-6
## [22] xfun_0.49 glmnet_4.1-8 cachem_1.1.0
## [25] jsonlite_1.8.9 flashClust_1.01-2 gmm_1.8
## [28] pan_1.9 pryr_0.1.6 broom_1.0.7
## [31] cluster_2.1.8 R6_2.5.1 RColorBrewer_1.1-3
## [34] bslib_0.8.0 stringi_1.8.4 boot_1.3-31
## [37] rpart_4.1.23 jquerylib_0.1.4 estimability_1.5.1
## [40] Rcpp_1.0.13-1 iterators_1.0.14 knitr_1.49
## [43] zoo_1.8-12 base64enc_0.1-3 Matrix_1.7-1
## [46] splines_4.4.2 nnet_7.3-19 tidyselect_1.2.1
## [49] rstudioapi_0.17.1 yaml_2.3.10 gplots_3.2.0
## [52] codetools_0.2-20 plyr_1.8.9 lattice_0.22-6
## [55] tibble_3.2.1 withr_3.0.2 evaluate_1.0.1
## [58] tmvtnorm_1.6 foreign_0.8-87 survival_3.8-3
## [61] norm_1.0-11.1 circlize_0.4.16 pillar_1.10.0
## [64] BiocManager_1.30.25 KernSmooth_2.23-24 corrplot_0.95
## [67] checkmate_2.3.2 DT_0.33 foreach_1.5.2
## [70] stats4_4.4.2 S4Vectors_0.45.2 munsell_0.5.1
## [73] scales_1.3.0 minqa_1.2.8 gtools_3.9.5
## [76] leaps_3.2 glue_1.8.0 emmeans_1.10.6
## [79] scatterplot3d_0.3-44 Hmisc_5.2-1 maketools_1.3.1
## [82] tools_4.4.2 lsr_0.5.2 sys_3.4.3
## [85] data.table_1.16.4 lme4_1.1-35.5 imputeLCMD_2.1
## [88] buildtools_1.0.0 mvtnorm_1.3-2 grid_4.4.2
## [91] impute_1.81.0 tidyr_1.3.1 colorspace_2.1-1
## [94] nlme_3.1-166 htmlTable_2.4.3 Formula_1.2-5
## [97] cli_3.6.3 dplyr_1.1.4 pcaMethods_1.99.0
## [100] gtable_0.3.6 sass_0.4.9 digest_0.6.37
## [103] ggrepel_0.9.6 farver_2.1.2 FactoMineR_2.11
## [106] htmlwidgets_1.6.4 htmltools_0.5.8.1 lifecycle_1.0.4
## [109] multcompView_0.1-10 mitml_0.4-5 GlobalOptions_0.1.2
## [112] MASS_7.3-61