The Global Alliance for Genomics and Health (GA4GH) was formed to help accelerate the potential of genomic medicine to advance human health. It brings together over 400 leading institutions working in healthcare, research, disease advocacy, life science, and information technology. The Data Working Group of the GA4GH developed data model schemas and application program interfaces (APIs) for genomic data. These APIs are specifically designed to allow sharing of genomics data in a standardized manner and without having to exchange complete experiments. They developed a reference implementation for these APIs providing a web server for hosting genomic data.
The Bioconductor project provides tools for the analysis and comprehension of high-throughput genomic data. It uses the R statistical programming language, and is open source and open development. The Bioconductor provides stable representations for genomics data such as biological sequences and genomic variants.
We developed GA4GHclient, a Bioconductor package that provides an easy way to access public data servers through GA4GH APIs. The requested output data are converted into tidy data and presented as data frames. Our package provides methods for converting genomic variant data into VCF data allowing the creation of complex data analysis using public genomic data and well validated Bioconductor packages. This package also provides an interactive web application for interacting with genomics data through GA4GH APIs.
Public data servers that use GA4GH Genomics API:
The table below shows all available methods in GA4GHclient package. These methods are based on the official GA4GH schemas. Let us know if some method is missing by opening an issue at https://github.com/labbcb/GA4GHclient/issues.
Method | Description |
---|---|
searchDatasets | Search for datasets |
searchReferenceSets | Search for reference sets (reference genomes) |
searchReferences | Search for references (genome sequences, e.g. chromosomes) |
listReferenceBases | Get the sequence bases of a reference genome by genomic range |
searchVariantSets | Search for for variant sets (VCF files) |
searchVariants | Search for variants by genomic ranges (lines of VCF files) |
searchCallSets | Search for call sets (sample columns of VCF files) |
searchVariantAnnotationSets | Search for variant annotation sets (annotated VCF files) |
searchVariantAnnotations | Search for annotated variants by genomic range |
searchFeatureSets | Search for feature sets (genomic features, e.g. GFF files) |
searchFeatures | Search for features (lines of genomic feature files) |
searchReadGroupSets | Search for read group sets (sequence alignement, e.g BAM files) |
searchReads | Search for reads by genomic range (bases of aligned sequences) |
searchBiosamples | Search for biosamples |
searchIndividuals | Search for individuals |
searchRnaQuantificationSets | Search for RNA quantification sets |
searchRnaQuantifications | Search for RNA quantifications |
searchExpressionLevels | Search for expression levels |
searchPhenotypeAssociationSets | Search for phenotype association sets |
searchPhenotypeAssociations | Search for phenotype associations |
getDataset | Get a dataset by its ID |
getReferenceSet | Get a reference set by its ID |
getReference | Get a reference by its ID |
getVariantSet | Get a variant set by its ID |
getVariant | Get a variant by its ID with all call sets for this variant |
getCallSet | Get a call set by its ID |
getVariantAnnotationSet | Get a variant annotation set by its ID |
getVariantAnnotation | Get an annotated variant by its ID |
getFeatureSet | Get a feature set by its ID |
getFeature | Get a feature by its ID |
getReadGroupSet | Get a read group set by its ID |
getBiosample | Get a biosample by its ID |
getIndividual | Get an individual by its ID |
getRnaQuantificationSet | Get an RNA quantification set by its ID |
getRnaQuantification | Get an RNA quantification by its ID |
getExpressionLevel | Get an expression level by its ID |
First, load required packages for this vignette.
library(GA4GHclient)
library(org.Hs.eg.db)
library(TxDb.Hsapiens.UCSC.hg19.knownGene)
library(VariantAnnotation)
For this example we will use Hosting Thousand Genomes Project server. It contains data from the 1000 Genomes project. Set the URL of GA4GH API data server. This variable will be used by all request functions.
Get basic information about data server. The dataset may contain many
variant sets. A variant set represents the header of an VCF file. A
collection of variants represent lines of an VCF file. The
nrow
attribute define the amount of entries we want to get
from the server. For the dataset and variant set we want only the first
entry.
The searchVariants
function requests genomic variant
data from the server. It uses genomic intervals to retrieve these
varants. As R programming language, the genomic indice is 1-based. If
you want to see what reference names (e.g. chromosomes) are available
run the searchReferences
function.
datasets <- searchDatasets(host, nrows = 1)
datasetId <- datasets$id
variantSets <- searchVariantSets(host, datasetId, nrows = 1)
variantSetId <- variantSets$id
variants <- searchVariants(host, variantSetId, referenceName = "1",
start = 15000, end = 16000, nrows = 10)
## class: CollapsedVCF
## dim: 10 0
## rowRanges(vcf):
## GRanges with 3 metadata columns: ID, REF, ALT
## info(vcf):
## DataFrame with 14 columns: EUR_AF, SAS_AF, AC, AA, AF, AFR_AF, AMR_AF, AN,...
## info(header(vcf)):
## Number Type Description
## EUR_AF A Float Allele frequency in the EUR populations calc...
## SAS_AF A Float Allele frequency in the SAS populations calc...
## AC A Integer Total number of alternate alleles in called ...
## AA 1 String Ancestral Allele. Format: AA|REF|ALT|IndelTy...
## AF A Float Estimated allele frequency in the range (0,1)
## AFR_AF A Float Allele frequency in the AFR populations calc...
## AMR_AF A Float Allele frequency in the AMR populations calc...
## AN 1 Integer Total number of alleles in called genotypes
## VT . String indicates what type of variant the line repr...
## EAS_AF A Float Allele frequency in the EAS populations calc...
## NS 1 Integer Number of samples with data
## EX_TARGET 0 Flag indicates whether a variant is within the ex...
## DP 1 Integer Total read depth; only low coverage data wer...
## MULTI_ALLELIC 0 Flag indicates whether a site is multi-allelic
## geno(vcf):
## List of length 0:
Almost search functions will return an DataFrame
object
from S4Vector
package. The searchVariants
and getVariant
functions will return an VCF
object with header from VariantAnnotation
package. The getVariantSet
function will return an
VCFHeader
object. These three functions will return
DataFrame
object when asVCF
or
asVCFHeader
parameters be FALSE
.
variants.df <- searchVariants(host, variantSetId, referenceName = "1",
start = 15000, end = 16000, nrows = 10, asVCF = FALSE)
Due to internet connection, there is a limit of amount for response
data imposed by the server (or the responseSize
parameter).
By default the function will make requests until get all response data.
Increasing the value of the responseSize
parameter will
reduce the number os requests to server.
Bioconductor has annotation packages for many genomes. For example,
the TxDb.Hsapiens.UCSC.hg19.knownGene
pakage exposes an annotation databases generated from UCSC by exposing
these as TxDb objects. Before using annotation packages, it is very
important to check what version of reference genome was used by the
GA4GH API-based data server. In other words, what reference genome was
used to align sequencing reads and call for variants. This information
can be accessed via searchReferenceSets
function.
In this case, the version of reference genome is NCBI37, which is compatible with TxDb.Hsapiens.UCSC.hg19.knownGene.
Get name of all available genes.
## [1] "A1BG" "A2M" "A2MP1" "NAT1" "NAT2" "NATP"
Get genomic location of all genes.
## 403 genes were dropped because they have exons located on both strands of the
## same reference sequence or on more than one reference sequence, so cannot be
## represented by a single genomic range.
## Use 'single.strand.genes.only=FALSE' to get all the genes in a GRangesList
## object, or use suppressMessages() to suppress this message.
## GRanges object with 6 ranges and 1 metadata column:
## seqnames ranges strand | gene_id
## <Rle> <IRanges> <Rle> | <character>
## 1 chr19 58858172-58874214 - | 1
## 10 chr8 18248755-18258723 + | 10
## 100 chr20 43248163-43280376 - | 100
## 1000 chr18 25530930-25757445 - | 1000
## 10000 chr1 243651535-244006886 - | 10000
## 100008586 chrX 49217763-49233491 + | 100008586
## -------
## seqinfo: 93 sequences (1 circular) from hg19 genome
Use case: search for variants located in SCN1A gene.
## 'select()' returned 1:1 mapping between keys and columns
gr <- genes(TxDb.Hsapiens.UCSC.hg19.knownGene, filter=list(gene_id=df$ENTREZID))
seqlevelsStyle(gr) <- "NCBI"
## Warning in (function (seqlevels, genome, new_style) : cannot switch some hg19's
## seqlevels from UCSC to NCBI style
## Warning in instance$preRenderHook(instance): It seems your data is too big for
## client-side DataTables. You may consider server-side processing:
## https://rstudio.github.io/DT/server.html
Use case: locate annotation data for a group of variants located at SCN1A gene.
variants.scn1a.gr <- makeGRangesFromDataFrame(variants.scn1a[[1]],
seqnames.field = "referenceName")
seqlevelsStyle(variants.scn1a.gr) <- "UCSC"
locateVariants(variants.scn1a.gr, TxDb.Hsapiens.UCSC.hg19.knownGene,
CodingVariants())
## Warning in valid.GenomicRanges.seqinfo(x, suggest.trim = TRUE): GRanges object contains 184 out-of-bound ranges located on sequences 12221,
## 12222, and 12223. Note that ranges located on a sequence whose length is
## unknown (NA) or on a circular sequence are not considered out-of-bound (use
## seqlengths() and isCircular() to get the lengths and circularity flags of the
## underlying sequences). You can use trim() to trim these ranges. See
## ?`trim,GenomicRanges-method` for more information.
## 'select()' returned many:1 mapping between keys and columns
## GRanges object with 308 ranges and 9 metadata columns:
## seqnames ranges strand | LOCATION LOCSTART LOCEND QUERYID
## <Rle> <IRanges> <Rle> | <factor> <integer> <integer> <integer>
## [1] chr2 166847834 - | coding 5867 5867 51
## [2] chr2 166847834 - | coding 5918 5918 51
## [3] chr2 166847834 - | coding 5951 5951 51
## [4] chr2 166847921 - | coding 5780 5780 52
## [5] chr2 166847921 - | coding 5831 5831 52
## ... ... ... ... . ... ... ... ...
## [304] chr2 166930064 - | coding 68 68 2159
## [305] chr2 166930064 - | coding 68 68 2159
## [306] chr2 166930078 - | coding 54 54 2160
## [307] chr2 166930078 - | coding 54 54 2160
## [308] chr2 166930078 - | coding 54 54 2160
## TXID CDSID GENEID PRECEDEID FOLLOWID
## <character> <IntegerList> <character> <CharacterList> <CharacterList>
## [1] 12221 37217 6323
## [2] 12222 37217 6323
## [3] 12223 37217 6323
## [4] 12221 37217 6323
## [5] 12222 37217 6323
## ... ... ... ... ... ...
## [304] 12222 37244 6323
## [305] 12223 37244 6323
## [306] 12221 37244 6323
## [307] 12222 37244 6323
## [308] 12223 37244 6323
## -------
## seqinfo: 1 sequence from an unspecified genome; no seqlengths
The searchVariants
and getVariant
functions
return VCF
objects. These VCF
objects contains
the VCF header. This is an example of how to get variants with call
data. We can get calls running the searchCallSets
function.
After convertion, it can be written into disk as VCF file using
writeVcf
.
callSetIds <- searchCallSets(host, variantSetId, nrows = 5)$id
vcf <- searchVariants(host, variantSetId, referenceName = "1",
start = 15000, end = 16000, callSetIds = callSetIds, nrows = 10)
## class: CollapsedVCF
## dim: 10 5
## rowRanges(vcf):
## GRanges with 3 metadata columns: ID, REF, ALT
## info(vcf):
## DataFrame with 14 columns: EUR_AF, SAS_AF, AC, AA, AF, AFR_AF, AMR_AF, AN,...
## info(header(vcf)):
## Number Type Description
## EUR_AF A Float Allele frequency in the EUR populations calc...
## SAS_AF A Float Allele frequency in the SAS populations calc...
## AC A Integer Total number of alternate alleles in called ...
## AA 1 String Ancestral Allele. Format: AA|REF|ALT|IndelTy...
## AF A Float Estimated allele frequency in the range (0,1)
## AFR_AF A Float Allele frequency in the AFR populations calc...
## AMR_AF A Float Allele frequency in the AMR populations calc...
## AN 1 Integer Total number of alleles in called genotypes
## VT . String indicates what type of variant the line repr...
## EAS_AF A Float Allele frequency in the EAS populations calc...
## NS 1 Integer Number of samples with data
## EX_TARGET 0 Flag indicates whether a variant is within the ex...
## DP 1 Integer Total read depth; only low coverage data wer...
## MULTI_ALLELIC 0 Flag indicates whether a site is multi-allelic
## geno(vcf):
## List of length 1: GT
## geno(header(vcf)):
## Number Type Description
## GT 1 String Genotype
## class: VCFHeader
## samples(0):
## meta(0):
## fixed(0):
## info(27): CIEND CIPOS ... EX_TARGET MULTI_ALLELIC
## geno(1): GT
The vcf
object works as expected. We can get the
genotype data for example. More information about working with VCF data
see vignette("VariantAnnotation")
.
## HG00096 HG00097 HG00099 HG00100 HG00101
## [1,] "0|0" "0|0" "0|1" "0|0" "0|0"
## [2,] "0|0" "0|0" "0|1" "0|0" "0|0"
## [3,] "0|0" "0|0" "0|1" "0|0" "0|0"
## [4,] "0|0" "0|0" "0|1" "0|0" "0|0"
## [5,] "0|0" "0|0" "1|0" "0|0" "0|0"
## [6,] "0|0" "0|0" "0|0" "0|0" "1|2"
## [7,] "0|0" "0|0" "0|0" "0|0" "2|2"
## [8,] "0|0" "0|0" "0|0" "0|0" "2|2"
## [9,] "0|0" "0|0" "0|0" "0|0" "1|2"
## [10,] "0|0" "0|0" "0|0" "0|0" "1|2"
## R version 4.4.2 (2024-10-31)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.1 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: Etc/UTC
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats4 stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] VariantAnnotation_1.53.0
## [2] Rsamtools_2.23.1
## [3] Biostrings_2.75.3
## [4] XVector_0.47.0
## [5] SummarizedExperiment_1.37.0
## [6] MatrixGenerics_1.19.0
## [7] matrixStats_1.4.1
## [8] TxDb.Hsapiens.UCSC.hg19.knownGene_3.2.2
## [9] GenomicFeatures_1.59.1
## [10] GenomicRanges_1.59.1
## [11] GenomeInfoDb_1.43.2
## [12] org.Hs.eg.db_3.20.0
## [13] AnnotationDbi_1.69.0
## [14] IRanges_2.41.2
## [15] Biobase_2.67.0
## [16] GA4GHclient_1.31.0
## [17] S4Vectors_0.45.2
## [18] BiocGenerics_0.53.3
## [19] generics_0.1.3
## [20] BiocStyle_2.35.0
##
## loaded via a namespace (and not attached):
## [1] KEGGREST_1.47.0 rjson_0.2.23 xfun_0.49
## [4] bslib_0.8.0 htmlwidgets_1.6.4 lattice_0.22-6
## [7] crosstalk_1.2.1 vctrs_0.6.5 tools_4.4.2
## [10] bitops_1.0-9 curl_6.0.1 parallel_4.4.2
## [13] tibble_3.2.1 RSQLite_2.3.9 blob_1.2.4
## [16] pkgconfig_2.0.3 Matrix_1.7-1 BSgenome_1.75.0
## [19] lifecycle_1.0.4 GenomeInfoDbData_1.2.13 compiler_4.4.2
## [22] codetools_0.2-20 htmltools_0.5.8.1 sys_3.4.3
## [25] buildtools_1.0.0 sass_0.4.9 RCurl_1.98-1.16
## [28] yaml_2.3.10 pillar_1.10.0 crayon_1.5.3
## [31] jquerylib_0.1.4 DT_0.33 BiocParallel_1.41.0
## [34] DelayedArray_0.33.3 cachem_1.1.0 abind_1.4-8
## [37] tidyselect_1.2.1 digest_0.6.37 dplyr_1.1.4
## [40] restfulr_0.0.15 maketools_1.3.1 fastmap_1.2.0
## [43] grid_4.4.2 cli_3.6.3 SparseArray_1.7.2
## [46] magrittr_2.0.3 S4Arrays_1.7.1 XML_3.99-0.17
## [49] UCSC.utils_1.3.0 bit64_4.5.2 rmarkdown_2.29
## [52] httr_1.4.7 bit_4.5.0.1 png_0.1-8
## [55] memoise_2.0.1 evaluate_1.0.1 knitr_1.49
## [58] BiocIO_1.17.1 rtracklayer_1.67.0 rlang_1.1.4
## [61] glue_1.8.0 DBI_1.2.3 BiocManager_1.30.25
## [64] jsonlite_1.8.9 R6_2.5.1 GenomicAlignments_1.43.0
## [67] zlibbioc_1.52.0