The QTLExperiment
class is a Bioconductor container for
storing and manipulating QTL summary statistics (e.g., effect sizes,
standard errors, significance metrics) from one or more states (e.g.,
tissues, cell-types, environmental conditions). It extends the
RangedSummarizedExperiment
class (from the SummarizedExperiment
package), where rows represent QTL associations, columns represent
states, and assays contain the various summary statistics. Associations
are identified by the feature and variant tested in the format
feature_id|variant_id
. For example, a QTL association
between the gene ENSG00000103888 and the SNP variant rs112057726 would
be stored as ENSG00000103888|rs112057726. This package also provides
convenient and robust methods for loading, merging, and subsetting
multi-state QTL data.
QTLExperiment
objects can be created manually by passing
required information to the QTLExperiment
function. The
assays (i.e., betas, error, and other user defined
assays) can be provided as a named list or as a
SummarizedExperiment
object. All assays should contain the
same rows and columns. Important metadata (i.e., state_id,
feature_id, and variant_id) can be inferred from the
input data or provided directly. If not provided,
QTLExperiment
will use the assay column names as state IDs
and will look for the row names to contain the feature IDs and the
variant IDs separated by a pipe (“|”). For example:
set.seed(42)
nStates <- 6
nQTL <- 40
mock_summary_stats <- mockSummaryStats(nStates=nStates, nQTL=nQTL)
mock_summary_stats$betas[1:5,1:5]
## state1 state2 state3 state4 state5
## geneA|snp36974 1.3709584 0.2059986 1.51270701 -1.493625067 -0.1755259
## geneC|snp96738 -0.5646982 -0.3610573 0.25792144 -1.470435741 -1.0717824
## geneB|snp9285 0.3631284 0.7581632 0.08844023 0.124702386 0.1632069
## geneB|snp66252 0.6328626 -0.7267048 -0.12089654 -0.996639135 -0.3627384
## geneB|snp7450 0.4042683 -1.3682810 -1.19432890 -0.001822614 0.5900135
With input data in this format, the qtle
object can be
generated as shown below.
qtle <- QTLExperiment(
assays=list(
betas=mock_summary_stats$betas,
errors=mock_summary_stats$errors),
metadata=list(study="mock-example"))
qtle
## class: QTLExperiment
## dim: 40 6
## metadata(1): study
## assays(2): betas errors
## rownames(40): geneA|snp36974 geneC|snp96738 ... geneB|snp50164
## geneC|snp16503
## rowData names(2): variant_id feature_id
## colnames(6): state1 state2 ... state5 state6
## colData names(1): state_id
Alternatively, the state IDs, feature IDs, and variant IDs can be provided manually. Note that in this mode the user must ensure the rows and columns of all assays are in the order of the IDs provided!
mock_summary_stats <- mockSummaryStats(nStates=nStates, nQTL=nQTL, names=FALSE)
mock_summary_stats$betas[1:5,1:5]
## [,1] [,2] [,3] [,4] [,5]
## [1,] -1.1872946 -0.8664769 1.93253557 -0.2675735 1.2216340
## [2,] -0.2035310 1.4811912 0.25001404 1.1926643 0.6262407
## [3,] -1.0591162 -0.3604593 0.74765249 1.5921706 1.0063578
## [4,] 0.6613011 1.1137537 1.00108017 -0.9963816 -0.3485696
## [5,] -0.3329064 0.1817592 -0.07813849 -1.0139292 -1.1513481
qtle <- QTLExperiment(assays=list(betas=mock_summary_stats$betas,
errors=mock_summary_stats$errors),
feature_id=paste0("gene_", 1:nQTL),
variant_id=sample(1:1e6, nQTL),
state_id=paste0("state_", LETTERS[1:nStates]),
metadata=list(study="mock-example2"))
qtle
## class: QTLExperiment
## dim: 40 6
## metadata(1): study
## assays(2): betas errors
## rownames(40): gene_1|75729 gene_2|864550 ... gene_39|763732
## gene_40|186281
## rowData names(2): variant_id feature_id
## colnames(6): state_A state_B ... state_E state_F
## colData names(1): state_id
A mock QTLExperiment
object can also be generated
automatically using the mockQTLe
function:
## class: QTLExperiment
## dim: 40 6
## metadata(0):
## assays(3): betas errors pvalues
## rownames(40): geneB|snp48287 geneC|snp98994 ... geneC|snp76205
## geneB|snp88707
## rowData names(2): variant_id feature_id
## colnames(6): state1 state2 ... state5 state6
## colData names(2): state_id sample_size
The sumstats2qtle()
function is a generic function to
load QTL summary statistics from multiple files, where each file
represents a state, and convert them into a QTLExperiment
object. Because different QTL mapping software produce summary
statistics in different formats, this function is flexible, allowing
users to specify the column name or index where required data is stored.
This function can also utilize parallel processing (cores available are
automatically detected, see the vroom
documentation for details.
Required arguments:
state
;
value=path
) or matrix with two columns (state
and path
) where the state is the desired name and the path
is the full local path or the weblink to the summary statistics for that
state.Optional arguments: - na.rm: Logical. To remove QTL tests (rows) with missing data for any state. - pval: The index or name of the column in the summary statistic files containing test statistics. - n_max: The number of rows to read from each file
Note that vroom does not handle compressed files well, and will not load all rows for large files. It is best to provide vroom with uncompressed objects for this reason.
As an example, we can load summary statistics from the EBI eQTL database.
Transcript usage data for lung, thyroid, spleen and blood are available
in the inst/extdata
folder. This data is licensed under the
Creative Commons Attribution 4.0 International License. See
inst/script/data_processing.R
for details about how these
data files were obtained from the database.
input_path <- system.file("extdata", package="QTLExperiment")
state <- c("lung", "thyroid", "spleen", "blood")
input <- data.frame(
state=state,
path=paste0(input_path, "/GTEx_tx_", state, ".tsv"))
qtle4 <- sumstats2qtle(
input,
feature_id="molecular_trait_id",
variant_id="rsid",
betas="beta",
errors="se",
pvalues="pvalue",
verbose=TRUE)
qtle4
## class: QTLExperiment
## dim: 1163 4
## metadata(0):
## assays(3): betas errors pvalues
## rownames(1163): ENST00000428771|rs554008981 ENST00000477976|rs554008981
## ... ENST00000445118|rs368254419 ENST00000483767|rs368254419
## rowData names(2): variant_id feature_id
## colnames(4): lung thyroid spleen blood
## colData names(1): state_id
## lung thyroid spleen blood
## ENST00000428771|rs554008981 -0.1733690 NA 0.134913 NA
## ENST00000477976|rs554008981 0.1616170 0.3173110 NA NA
## ENST00000483767|rs554008981 -0.4161480 -0.0483018 NA -0.204647
## ENST00000623070|rs554008981 -0.1137930 NA NA NA
## ENST00000669922|rs554008981 -0.1921680 -0.1067540 0.724622 -0.117424
## ENST00000428771|rs201055865 -0.0630909 NA NA NA
A convenience function is also available to convert mash
objects (generated here using the simple_sims function from mashr) into
QTLe
objects.
mashr_sim <- mockMASHR(nStates=nStates, nQTL=nQTL)
qtle2 <- mash2qtle(
mashr_sim,
rowData=DataFrame(feature_id=row.names(mashr_sim$Bhat),
variant_id=sample(seq(1:1e5), nQTL)))
qtle2
## class: QTLExperiment
## dim: 40 6
## metadata(0):
## assays(2): betas errors
## rownames(40): geneA|snp79754|12127 geneA|snp87571|72956 ...
## geneA|snp56954|49069 geneB|snp81574|67904
## rowData names(2): variant_id feature_id
## colnames(6): state1 state2 ... state5 state6
## colData names(1): state_id
Any operation that can be applied to a RangedSummarizedExperiment is
also applicable to any instance of a QTLExperiment
. This
includes access to assay data via assay(), column metadata with
colData(), etc.
## [1] 1163 4
## [1] "lung" "thyroid" "spleen" "blood"
## DataFrame with 6 rows and 2 columns
## variant_id feature_id
## <character> <character>
## ENST00000428771|rs554008981 rs554008981 ENST00000428771
## ENST00000477976|rs554008981 rs554008981 ENST00000477976
## ENST00000483767|rs554008981 rs554008981 ENST00000483767
## ENST00000623070|rs554008981 rs554008981 ENST00000623070
## ENST00000669922|rs554008981 rs554008981 ENST00000669922
## ENST00000428771|rs201055865 rs201055865 ENST00000428771
## [1] 1163 8
## [1] 2326 4
## [1] 13 6
## [1] 40 2
## [1] 40 1
The QTLExperiment
assays can be viewed and manipulated
using appropriate getter and setter methods. For common assay types
(i.e., betas, errors, pvalues, and lfsrs), convenient getter and setters
are available. For example, the betas getter and setter is shown
here:
## state1 state2 state3 state4 state5
## geneB|snp48287 1.69640227 0.03453868 0.3160142 1.11069222 -0.6743804
## geneC|snp98994 1.54284685 0.28629965 -1.0381128 2.14072073 -2.3030232
## geneC|snp82890 0.21452759 1.18620809 -0.7551373 -0.83097181 -0.1493329
## geneA|snp37049 1.70068830 -0.03754937 -0.4915043 0.07640586 0.9486896
## geneB|snp97397 -0.00798683 0.42037159 0.1472257 0.93582445 1.1563625
## state1 state2 state3 state4 state5
## geneB|snp48287 -1.69640227 -0.03453868 -0.3160142 -1.11069222 0.6743804
## geneC|snp98994 -1.54284685 -0.28629965 1.0381128 -2.14072073 2.3030232
## geneC|snp82890 -0.21452759 -1.18620809 0.7551373 0.83097181 0.1493329
## geneA|snp37049 -1.70068830 0.03754937 0.4915043 -0.07640586 -0.9486896
## geneB|snp97397 0.00798683 -0.42037159 -0.1472257 -0.93582445 -1.1563625
Users or downstream tools may add and see additional assays to the
QTLe
object using generic getter and setter methods. For
example, to store information about what QTL are significant, we could
add a new assay called significant using the generic setter
method and then look at the data in the new assay using the generic
getter method:
## lung thyroid spleen blood
## ENST00000428771|rs554008981 FALSE NA FALSE NA
## ENST00000477976|rs554008981 FALSE FALSE NA NA
## ENST00000483767|rs554008981 TRUE FALSE NA FALSE
## ENST00000623070|rs554008981 FALSE NA NA NA
## ENST00000669922|rs554008981 FALSE FALSE TRUE FALSE
Because the feature, variant, and state IDs are critical for the
continuity of a QTLExperiment
object, special getters and
setters are used to ensure changes to avoid unintentional mislabeling
and to make sure that changes made to these data are synced across the
QTLe
object. Getter functions include
state_id()
, feature_id()
, and
variant_id()
.
## [1] "lung" "thyroid" "spleen" "blood"
## [1] "ENST00000428771" "ENST00000477976" "ENST00000483767"
## [1] "rs554008981" "rs554008981" "rs554008981"
These functions can also be used as setters. For example, when
state_id()
is used to update the names of the states, this
information is updated in the colData and in the assays.
## DataFrame with 4 rows and 1 column
## state_id
## <character>
## LUNG LUNG
## THYROID THYROID
## SPLEEN SPLEEN
## BLOOD BLOOD
## LUNG THYROID SPLEEN BLOOD
## ENST00000428771|rs554008981 -0.1733690 NA 0.134913 NA
## ENST00000477976|rs554008981 0.1616170 0.3173110 NA NA
## ENST00000483767|rs554008981 -0.4161480 -0.0483018 NA -0.204647
## ENST00000623070|rs554008981 -0.1137930 NA NA NA
## ENST00000669922|rs554008981 -0.1921680 -0.1067540 0.724622 -0.117424
## ENST00000428771|rs201055865 -0.0630909 NA NA NA
Finally, if the feature, variant, and state IDs are accidentally
overwritten or removed in any way (i.e. not using one of the special
setters), they can be retrieved using the recover_qtle_ids
.
For example, here we accidentally removed the state_id information from
our colData by providing new colData.
new_colData <- DataFrame(list(some_info1=LETTERS[1:ncol(qtle4)],
some_info2=c(1:ncol(qtle4))))
colData(qtle4) <- new_colData
head(colData(qtle4))
## DataFrame with 4 rows and 2 columns
## some_info1 some_info2
## <character> <integer>
## 1 A 1
## 2 B 2
## 3 C 3
## 4 D 4
## DataFrame with 4 rows and 3 columns
## some_info1 some_info2 state_id
## <character> <integer> <character>
## LUNG A 1 LUNG
## THYROID B 2 THYROID
## SPLEEN C 3 SPLEEN
## BLOOD D 4 BLOOD
Or here we accidentally shuffled the variant_ids in the rowData, but can retrieve the old labels.
## DataFrame with 6 rows and 2 columns
## variant_id feature_id
## <character> <character>
## ENST00000428771|rs554008981 rs1351019412 ENST00000428771
## ENST00000477976|rs554008981 rs202038446 ENST00000477976
## ENST00000483767|rs554008981 rs577455319 ENST00000483767
## ENST00000623070|rs554008981 rs62642117 ENST00000623070
## ENST00000669922|rs554008981 rs76388980 ENST00000669922
## ENST00000428771|rs201055865 rs369556846 ENST00000428771
## DataFrame with 6 rows and 2 columns
## variant_id feature_id
## <character> <character>
## ENST00000428771|rs554008981 rs554008981 ENST00000428771
## ENST00000477976|rs554008981 rs554008981 ENST00000477976
## ENST00000483767|rs554008981 rs554008981 ENST00000483767
## ENST00000623070|rs554008981 rs554008981 ENST00000623070
## ENST00000669922|rs554008981 rs554008981 ENST00000669922
## ENST00000428771|rs201055865 rs201055865 ENST00000428771
## R version 4.4.2 (2024-10-31)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.1 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: Etc/UTC
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats4 stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] QTLExperiment_1.5.0 SummarizedExperiment_1.37.0
## [3] Biobase_2.67.0 GenomicRanges_1.59.1
## [5] GenomeInfoDb_1.43.2 IRanges_2.41.2
## [7] S4Vectors_0.45.2 BiocGenerics_0.53.3
## [9] generics_0.1.3 MatrixGenerics_1.19.0
## [11] matrixStats_1.4.1 BiocStyle_2.35.0
##
## loaded via a namespace (and not attached):
## [1] xfun_0.49 bslib_0.8.0 collapse_2.0.18
## [4] lattice_0.22-6 tzdb_0.4.0 vctrs_0.6.5
## [7] tools_4.4.2 parallel_4.4.2 tibble_3.2.1
## [10] pkgconfig_2.0.3 Matrix_1.7-1 checkmate_2.3.2
## [13] SQUAREM_2021.1 lifecycle_1.0.4 GenomeInfoDbData_1.2.13
## [16] truncnorm_1.0-9 compiler_4.4.2 htmltools_0.5.8.1
## [19] sys_3.4.3 buildtools_1.0.0 sass_0.4.9
## [22] yaml_2.3.10 pillar_1.10.0 crayon_1.5.3
## [25] jquerylib_0.1.4 tidyr_1.3.1 DelayedArray_0.33.3
## [28] cachem_1.1.0 abind_1.4-8 tidyselect_1.2.1
## [31] digest_0.6.37 dplyr_1.1.4 purrr_1.0.2
## [34] ashr_2.2-63 maketools_1.3.1 fastmap_1.2.0
## [37] grid_4.4.2 cli_3.6.3 invgamma_1.1
## [40] SparseArray_1.7.2 magrittr_2.0.3 S4Arrays_1.7.1
## [43] withr_3.0.2 UCSC.utils_1.3.0 backports_1.5.0
## [46] bit64_4.5.2 rmarkdown_2.29 XVector_0.47.0
## [49] httr_1.4.7 bit_4.5.0.1 evaluate_1.0.1
## [52] knitr_1.49 irlba_2.3.5.1 rlang_1.1.4
## [55] Rcpp_1.0.13-1 mixsqp_0.3-54 glue_1.8.0
## [58] BiocManager_1.30.25 vroom_1.6.5 jsonlite_1.8.9
## [61] R6_2.5.1 zlibbioc_1.52.0