An introduction to the QTLExperiment class

1. Motivation

The QTLExperiment class is a Bioconductor container for storing and manipulating QTL summary statistics (e.g., effect sizes, standard errors, significance metrics) from one or more states (e.g., tissues, cell-types, environmental conditions). It extends the RangedSummarizedExperiment class (from the SummarizedExperiment package), where rows represent QTL associations, columns represent states, and assays contain the various summary statistics. Associations are identified by the feature and variant tested in the format feature_id|variant_id. For example, a QTL association between the gene ENSG00000103888 and the SNP variant rs112057726 would be stored as ENSG00000103888|rs112057726. This package also provides convenient and robust methods for loading, merging, and subsetting multi-state QTL data.

QTLExperiment container structure
QTLExperiment container structure

Installation

QTLExperiment can be installed from GitHub:

if (!require("BiocManager", quietly=TRUE))
    install.packages("BiocManager")

BiocManager::install("QTLExperiment", version = "devel")
library(QTLExperiment)

2. Creating QTLExperiment instances

Manually

QTLExperiment objects can be created manually by passing required information to the QTLExperiment function. The assays (i.e., betas, error, and other user defined assays) can be provided as a named list or as a SummarizedExperiment object. All assays should contain the same rows and columns. Important metadata (i.e., state_id, feature_id, and variant_id) can be inferred from the input data or provided directly. If not provided, QTLExperiment will use the assay column names as state IDs and will look for the row names to contain the feature IDs and the variant IDs separated by a pipe (“|”). For example:

set.seed(42)
nStates <- 6
nQTL <- 40

mock_summary_stats <- mockSummaryStats(nStates=nStates, nQTL=nQTL)
mock_summary_stats$betas[1:5,1:5]
##                    state1     state2      state3       state4     state5
## geneA|snp36974  1.3709584  0.2059986  1.51270701 -1.493625067 -0.1755259
## geneC|snp96738 -0.5646982 -0.3610573  0.25792144 -1.470435741 -1.0717824
## geneB|snp9285   0.3631284  0.7581632  0.08844023  0.124702386  0.1632069
## geneB|snp66252  0.6328626 -0.7267048 -0.12089654 -0.996639135 -0.3627384
## geneB|snp7450   0.4042683 -1.3682810 -1.19432890 -0.001822614  0.5900135

With input data in this format, the qtle object can be generated as shown below.

qtle <- QTLExperiment(
    assays=list(
        betas=mock_summary_stats$betas,
        errors=mock_summary_stats$errors),
    metadata=list(study="mock-example"))
qtle
## class: QTLExperiment 
## dim: 40 6 
## metadata(1): study
## assays(2): betas errors
## rownames(40): geneA|snp36974 geneC|snp96738 ... geneB|snp50164
##   geneC|snp16503
## rowData names(2): variant_id feature_id
## colnames(6): state1 state2 ... state5 state6
## colData names(1): state_id

Alternatively, the state IDs, feature IDs, and variant IDs can be provided manually. Note that in this mode the user must ensure the rows and columns of all assays are in the order of the IDs provided!

mock_summary_stats <- mockSummaryStats(nStates=nStates, nQTL=nQTL, names=FALSE)
mock_summary_stats$betas[1:5,1:5]
##            [,1]       [,2]        [,3]       [,4]       [,5]
## [1,] -1.1872946 -0.8664769  1.93253557 -0.2675735  1.2216340
## [2,] -0.2035310  1.4811912  0.25001404  1.1926643  0.6262407
## [3,] -1.0591162 -0.3604593  0.74765249  1.5921706  1.0063578
## [4,]  0.6613011  1.1137537  1.00108017 -0.9963816 -0.3485696
## [5,] -0.3329064  0.1817592 -0.07813849 -1.0139292 -1.1513481
qtle <- QTLExperiment(assays=list(betas=mock_summary_stats$betas,
                                  errors=mock_summary_stats$errors),
                      feature_id=paste0("gene_", 1:nQTL),
                      variant_id=sample(1:1e6, nQTL),
                      state_id=paste0("state_", LETTERS[1:nStates]),
                      metadata=list(study="mock-example2"))
qtle
## class: QTLExperiment 
## dim: 40 6 
## metadata(1): study
## assays(2): betas errors
## rownames(40): gene_1|75729 gene_2|864550 ... gene_39|763732
##   gene_40|186281
## rowData names(2): variant_id feature_id
## colnames(6): state_A state_B ... state_E state_F
## colData names(1): state_id

A mock QTLExperiment object can also be generated automatically using the mockQTLe function:

qtle <- mockQTLE(nStates=nStates, nQTL=nQTL)
qtle
## class: QTLExperiment 
## dim: 40 6 
## metadata(0):
## assays(3): betas errors pvalues
## rownames(40): geneB|snp48287 geneC|snp98994 ... geneC|snp76205
##   geneB|snp88707
## rowData names(2): variant_id feature_id
## colnames(6): state1 state2 ... state5 state6
## colData names(2): state_id sample_size

From QTL summary statistics for each state

The sumstats2qtle() function is a generic function to load QTL summary statistics from multiple files, where each file represents a state, and convert them into a QTLExperiment object. Because different QTL mapping software produce summary statistics in different formats, this function is flexible, allowing users to specify the column name or index where required data is stored. This function can also utilize parallel processing (cores available are automatically detected, see the vroom documentation for details.

Required arguments:

  • input: A named list (name=state; value=path) or matrix with two columns (state and path) where the state is the desired name and the path is the full local path or the weblink to the summary statistics for that state.
  • feature_id: The index or name of the column in the summary statistic files containing the feature ID
  • variant_id: The index or name of the column in the summary statistic files containing the variant ID
  • betas: The index or name of the column in the summary statistic files containing the estimated betas
  • error: The index or name of the column in the summary statistic files containing the estimated beta errors

Optional arguments: - na.rm: Logical. To remove QTL tests (rows) with missing data for any state. - pval: The index or name of the column in the summary statistic files containing test statistics. - n_max: The number of rows to read from each file

Note that vroom does not handle compressed files well, and will not load all rows for large files. It is best to provide vroom with uncompressed objects for this reason.

As an example, we can load summary statistics from the EBI eQTL database. Transcript usage data for lung, thyroid, spleen and blood are available in the inst/extdata folder. This data is licensed under the Creative Commons Attribution 4.0 International License. See inst/script/data_processing.R for details about how these data files were obtained from the database.

input_path <- system.file("extdata", package="QTLExperiment")
state <- c("lung", "thyroid", "spleen", "blood")

input <- data.frame(
    state=state, 
    path=paste0(input_path, "/GTEx_tx_", state, ".tsv"))

qtle4 <- sumstats2qtle(
    input, 
    feature_id="molecular_trait_id",
    variant_id="rsid", 
    betas="beta", 
    errors="se",
    pvalues="pvalue", 
    verbose=TRUE)
qtle4
## class: QTLExperiment 
## dim: 1163 4 
## metadata(0):
## assays(3): betas errors pvalues
## rownames(1163): ENST00000428771|rs554008981 ENST00000477976|rs554008981
##   ... ENST00000445118|rs368254419 ENST00000483767|rs368254419
## rowData names(2): variant_id feature_id
## colnames(4): lung thyroid spleen blood
## colData names(1): state_id
head(betas(qtle4))
##                                   lung    thyroid   spleen     blood
## ENST00000428771|rs554008981 -0.1733690         NA 0.134913        NA
## ENST00000477976|rs554008981  0.1616170  0.3173110       NA        NA
## ENST00000483767|rs554008981 -0.4161480 -0.0483018       NA -0.204647
## ENST00000623070|rs554008981 -0.1137930         NA       NA        NA
## ENST00000669922|rs554008981 -0.1921680 -0.1067540 0.724622 -0.117424
## ENST00000428771|rs201055865 -0.0630909         NA       NA        NA

From mashr data format

A convenience function is also available to convert mash objects (generated here using the simple_sims function from mashr) into QTLe objects.

mashr_sim <- mockMASHR(nStates=nStates, nQTL=nQTL)

qtle2 <- mash2qtle(
    mashr_sim,
    rowData=DataFrame(feature_id=row.names(mashr_sim$Bhat),
                      variant_id=sample(seq(1:1e5), nQTL)))
qtle2
## class: QTLExperiment 
## dim: 40 6 
## metadata(0):
## assays(2): betas errors
## rownames(40): geneA|snp79754|12127 geneA|snp87571|72956 ...
##   geneA|snp56954|49069 geneB|snp81574|67904
## rowData names(2): variant_id feature_id
## colnames(6): state1 state2 ... state5 state6
## colData names(1): state_id

3. Basic object manipulation

Any operation that can be applied to a RangedSummarizedExperiment is also applicable to any instance of a QTLExperiment. This includes access to assay data via assay(), column metadata with colData(), etc.

dim(qtle4)
## [1] 1163    4
colnames(qtle4)
## [1] "lung"    "thyroid" "spleen"  "blood"
head(rowData(qtle4))
## DataFrame with 6 rows and 2 columns
##                              variant_id      feature_id
##                             <character>     <character>
## ENST00000428771|rs554008981 rs554008981 ENST00000428771
## ENST00000477976|rs554008981 rs554008981 ENST00000477976
## ENST00000483767|rs554008981 rs554008981 ENST00000483767
## ENST00000623070|rs554008981 rs554008981 ENST00000623070
## ENST00000669922|rs554008981 rs554008981 ENST00000669922
## ENST00000428771|rs201055865 rs201055865 ENST00000428771
qtle4b <- qtle4
state_id(qtle4b) <- paste0(state_id(qtle4), "_b")
dim(cbind(qtle4, qtle4b))
## [1] 1163    8
qtle4b <- qtle4
feature_id(qtle4b) <- paste0(feature_id(qtle4), "_b")
dim(rbind(qtle4, qtle4b))
## [1] 2326    4
qtle.subset <- qtle[rowData(qtle)$feature_id == "geneA", ]
dim(qtle.subset)
## [1] 13  6
qtle.subset <- qtle[, c("state1", "state2")]
dim(qtle.subset)
## [1] 40  2
qtle.subset <- subset(qtle, , sample_size > 100)
dim(qtle.subset)
## [1] 40  1

4. Working with assays

The QTLExperiment assays can be viewed and manipulated using appropriate getter and setter methods. For common assay types (i.e., betas, errors, pvalues, and lfsrs), convenient getter and setters are available. For example, the betas getter and setter is shown here:

betas(qtle)[1:5,1:5]
##                     state1      state2     state3      state4     state5
## geneB|snp48287  1.69640227  0.03453868  0.3160142  1.11069222 -0.6743804
## geneC|snp98994  1.54284685  0.28629965 -1.0381128  2.14072073 -2.3030232
## geneC|snp82890  0.21452759  1.18620809 -0.7551373 -0.83097181 -0.1493329
## geneA|snp37049  1.70068830 -0.03754937 -0.4915043  0.07640586  0.9486896
## geneB|snp97397 -0.00798683  0.42037159  0.1472257  0.93582445  1.1563625
betas(qtle) <- betas(qtle) * -1

betas(qtle)[1:5,1:5]
##                     state1      state2     state3      state4     state5
## geneB|snp48287 -1.69640227 -0.03453868 -0.3160142 -1.11069222  0.6743804
## geneC|snp98994 -1.54284685 -0.28629965  1.0381128 -2.14072073  2.3030232
## geneC|snp82890 -0.21452759 -1.18620809  0.7551373  0.83097181  0.1493329
## geneA|snp37049 -1.70068830  0.03754937  0.4915043 -0.07640586 -0.9486896
## geneB|snp97397  0.00798683 -0.42037159 -0.1472257 -0.93582445 -1.1563625

Users or downstream tools may add and see additional assays to the QTLe object using generic getter and setter methods. For example, to store information about what QTL are significant, we could add a new assay called significant using the generic setter method and then look at the data in the new assay using the generic getter method:

assay(qtle4, "significant") <- assay(qtle4, "pvalues") < 0.05

assay(qtle4, "significant")[1:5, 1:4]
##                              lung thyroid spleen blood
## ENST00000428771|rs554008981 FALSE      NA  FALSE    NA
## ENST00000477976|rs554008981 FALSE   FALSE     NA    NA
## ENST00000483767|rs554008981  TRUE   FALSE     NA FALSE
## ENST00000623070|rs554008981 FALSE      NA     NA    NA
## ENST00000669922|rs554008981 FALSE   FALSE   TRUE FALSE

5. Working with critical meta data

Because the feature, variant, and state IDs are critical for the continuity of a QTLExperiment object, special getters and setters are used to ensure changes to avoid unintentional mislabeling and to make sure that changes made to these data are synced across the QTLe object. Getter functions include state_id(), feature_id(), and variant_id().

state_id(qtle4)
## [1] "lung"    "thyroid" "spleen"  "blood"
feature_id(qtle4)[1:3]
## [1] "ENST00000428771" "ENST00000477976" "ENST00000483767"
variant_id(qtle4)[1:3]
## [1] "rs554008981" "rs554008981" "rs554008981"

These functions can also be used as setters. For example, when state_id() is used to update the names of the states, this information is updated in the colData and in the assays.

state_id(qtle4) <- c("LUNG", "THYROID", "SPLEEN", "BLOOD")
head(colData(qtle4))
## DataFrame with 4 rows and 1 column
##            state_id
##         <character>
## LUNG           LUNG
## THYROID     THYROID
## SPLEEN       SPLEEN
## BLOOD         BLOOD
head(betas(qtle4))
##                                   LUNG    THYROID   SPLEEN     BLOOD
## ENST00000428771|rs554008981 -0.1733690         NA 0.134913        NA
## ENST00000477976|rs554008981  0.1616170  0.3173110       NA        NA
## ENST00000483767|rs554008981 -0.4161480 -0.0483018       NA -0.204647
## ENST00000623070|rs554008981 -0.1137930         NA       NA        NA
## ENST00000669922|rs554008981 -0.1921680 -0.1067540 0.724622 -0.117424
## ENST00000428771|rs201055865 -0.0630909         NA       NA        NA

Manually syncing metadata

Finally, if the feature, variant, and state IDs are accidentally overwritten or removed in any way (i.e. not using one of the special setters), they can be retrieved using the recover_qtle_ids. For example, here we accidentally removed the state_id information from our colData by providing new colData.

new_colData <- DataFrame(list(some_info1=LETTERS[1:ncol(qtle4)],
                              some_info2=c(1:ncol(qtle4))))
colData(qtle4) <- new_colData
head(colData(qtle4))
## DataFrame with 4 rows and 2 columns
##    some_info1 some_info2
##   <character>  <integer>
## 1           A          1
## 2           B          2
## 3           C          3
## 4           D          4
qtle4 <- recover_qtle_ids(qtle4)
head(colData(qtle4))
## DataFrame with 4 rows and 3 columns
##          some_info1 some_info2    state_id
##         <character>  <integer> <character>
## LUNG              A          1        LUNG
## THYROID           B          2     THYROID
## SPLEEN            C          3      SPLEEN
## BLOOD             D          4       BLOOD

Or here we accidentally shuffled the variant_ids in the rowData, but can retrieve the old labels.

rowData(qtle4)$variant_id <- sample(rowData(qtle4)$variant_id, nrow(qtle4))

head(rowData(qtle4))
## DataFrame with 6 rows and 2 columns
##                               variant_id      feature_id
##                              <character>     <character>
## ENST00000428771|rs554008981 rs1351019412 ENST00000428771
## ENST00000477976|rs554008981  rs202038446 ENST00000477976
## ENST00000483767|rs554008981  rs577455319 ENST00000483767
## ENST00000623070|rs554008981   rs62642117 ENST00000623070
## ENST00000669922|rs554008981   rs76388980 ENST00000669922
## ENST00000428771|rs201055865  rs369556846 ENST00000428771
qtle4 <- recover_qtle_ids(qtle4)
head(rowData(qtle4))
## DataFrame with 6 rows and 2 columns
##                              variant_id      feature_id
##                             <character>     <character>
## ENST00000428771|rs554008981 rs554008981 ENST00000428771
## ENST00000477976|rs554008981 rs554008981 ENST00000477976
## ENST00000483767|rs554008981 rs554008981 ENST00000483767
## ENST00000623070|rs554008981 rs554008981 ENST00000623070
## ENST00000669922|rs554008981 rs554008981 ENST00000669922
## ENST00000428771|rs201055865 rs201055865 ENST00000428771

Session Info

sessionInfo()
## R version 4.4.2 (2024-10-31)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.1 LTS
## 
## Matrix products: default
## BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so;  LAPACK version 3.12.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## time zone: Etc/UTC
## tzcode source: system (glibc)
## 
## attached base packages:
## [1] stats4    stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
##  [1] QTLExperiment_1.5.0         SummarizedExperiment_1.37.0
##  [3] Biobase_2.67.0              GenomicRanges_1.59.1       
##  [5] GenomeInfoDb_1.43.2         IRanges_2.41.2             
##  [7] S4Vectors_0.45.2            BiocGenerics_0.53.3        
##  [9] generics_0.1.3              MatrixGenerics_1.19.0      
## [11] matrixStats_1.4.1           BiocStyle_2.35.0           
## 
## loaded via a namespace (and not attached):
##  [1] xfun_0.49               bslib_0.8.0             collapse_2.0.18        
##  [4] lattice_0.22-6          tzdb_0.4.0              vctrs_0.6.5            
##  [7] tools_4.4.2             parallel_4.4.2          tibble_3.2.1           
## [10] pkgconfig_2.0.3         Matrix_1.7-1            checkmate_2.3.2        
## [13] SQUAREM_2021.1          lifecycle_1.0.4         GenomeInfoDbData_1.2.13
## [16] truncnorm_1.0-9         compiler_4.4.2          htmltools_0.5.8.1      
## [19] sys_3.4.3               buildtools_1.0.0        sass_0.4.9             
## [22] yaml_2.3.10             pillar_1.10.0           crayon_1.5.3           
## [25] jquerylib_0.1.4         tidyr_1.3.1             DelayedArray_0.33.3    
## [28] cachem_1.1.0            abind_1.4-8             tidyselect_1.2.1       
## [31] digest_0.6.37           dplyr_1.1.4             purrr_1.0.2            
## [34] ashr_2.2-63             maketools_1.3.1         fastmap_1.2.0          
## [37] grid_4.4.2              cli_3.6.3               invgamma_1.1           
## [40] SparseArray_1.7.2       magrittr_2.0.3          S4Arrays_1.7.1         
## [43] withr_3.0.2             UCSC.utils_1.3.0        backports_1.5.0        
## [46] bit64_4.5.2             rmarkdown_2.29          XVector_0.47.0         
## [49] httr_1.4.7              bit_4.5.0.1             evaluate_1.0.1         
## [52] knitr_1.49              irlba_2.3.5.1           rlang_1.1.4            
## [55] Rcpp_1.0.13-1           mixsqp_0.3-54           glue_1.8.0             
## [58] BiocManager_1.30.25     vroom_1.6.5             jsonlite_1.8.9         
## [61] R6_2.5.1                zlibbioc_1.52.0