ensemblVEP: using the REST API with Bioconductor

Introduction

Ensembl’s Variant Effect Predictor is described in McLaren et al. (2016).

Prior to Bioconductor 3.19, the ensemblVEP package provided access to Ensembl’s predictions through an interface between Perl and MySQL.

In 3.19 VariantAnnotation supports the use of the VEP component of the REST API at https://rest.ensembl.org.

Acquire annotation on variants from a VCF file

The function vep_by_region will accept a VCF object as defined in VariantAnnotation.

library(VariantAnnotation)
fl <- system.file("extdata", "chr22.vcf.gz", package="VariantAnnotation")
r22 = readVcf(fl)
r22
## class: CollapsedVCF 
## dim: 10376 5 
## rowRanges(vcf):
##   GRanges with 5 metadata columns: paramRangeID, REF, ALT, QUAL, FILTER
## info(vcf):
##   DataFrame with 22 columns: LDAF, AVGPOST, RSQ, ERATE, THETA, CIEND, CIPOS,...
## info(header(vcf)):
##              Number Type    Description                                        
##    LDAF      1      Float   MLE Allele Frequency Accounting for LD             
##    AVGPOST   1      Float   Average posterior probability from MaCH/Thunder    
##    RSQ       1      Float   Genotype imputation quality from MaCH/Thunder      
##    ERATE     1      Float   Per-marker Mutation rate from MaCH/Thunder         
##    THETA     1      Float   Per-marker Transition rate from MaCH/Thunder       
##    CIEND     2      Integer Confidence interval around END for imprecise var...
##    CIPOS     2      Integer Confidence interval around POS for imprecise var...
##    END       1      Integer End position of the variant described in this re...
##    HOMLEN    .      Integer Length of base pair identical micro-homology at ...
##    HOMSEQ    .      String  Sequence of base pair identical micro-homology a...
##    SVLEN     1      Integer Difference in length between REF and ALT alleles   
##    SVTYPE    1      String  Type of structural variant                         
##    AC        .      Integer Alternate Allele Count                             
##    AN        1      Integer Total Allele Count                                 
##    AA        1      String  Ancestral Allele, ftp://ftp.1000genomes.ebi.ac.u...
##    AF        1      Float   Global Allele Frequency based on AC/AN             
##    AMR_AF    1      Float   Allele Frequency for samples from AMR based on A...
##    ASN_AF    1      Float   Allele Frequency for samples from ASN based on A...
##    AFR_AF    1      Float   Allele Frequency for samples from AFR based on A...
##    EUR_AF    1      Float   Allele Frequency for samples from EUR based on A...
##    VT        1      String  indicates what type of variant the line represents 
##    SNPSOURCE .      String  indicates if a snp was called when analysing the...
## geno(vcf):
##   List of length 3: GT, DS, GL
## geno(header(vcf)):
##       Number Type   Description                      
##    GT 1      String Genotype                         
##    DS 1      Float  Genotype dosage from MaCH/Thunder
##    GL G      Float  Genotype Likelihoods

In this example we confine attention to single nucleotide variants.

There is a limit of 200 locations in a request, and 55000 requests per hour. We’ll base our query on 100 positions in the chr22 VCF.

dr = which(width(rowRanges(r22))!=1)
r22s = r22[-dr]
res = vep_by_region(r22[1:100], snv_only=FALSE, chk_max=FALSE)
jans = toJSON(content(res))

There are various ways to work with the result of this query to the API. We’ll use the rjsoncons JSON processing infrastructure to dig in and understand aspects of the API behavior.

First, the top-level concepts produced for each variant can be retrieved using

library(rjsoncons)
names(jsonlite::fromJSON(jmespath(jans, "[*]")))
##  [1] "transcript_consequences"         "allele_string"                  
##  [3] "most_severe_consequence"         "input"                          
##  [5] "seq_region_name"                 "id"                             
##  [7] "end"                             "regulatory_feature_consequences"
##  [9] "assembly_name"                   "strand"                         
## [11] "start"                           "colocated_variants"             
## [13] "motif_feature_consequences"

Annotation of the most severe consequence known will typically be of interest:

table(jsonlite::fromJSON(jmespath(jans, "[*].most_severe_consequence")))
## 
##   5_prime_UTR_variant        intron_variant splice_region_variant 
##                     1                    98                     1

There is variability in the structure of data returned for each query.

head(fromJSON(jmespath(jans, "[*].regulatory_feature_consequences")))
## [[1]]
##        biotype variant_allele consequence_terms   impact regulatory_feature_id
## 1 open_chr....              G      regulato.... MODIFIER          ENSR0000....
## 
## [[2]]
##   regulatory_feature_id   impact      biotype consequence_terms variant_allele
## 1          ENSR0000.... MODIFIER open_chr....      regulato....              T
## 
## [[3]]
##     impact regulatory_feature_id consequence_terms variant_allele      biotype
## 1 MODIFIER          ENSR0000....      regulato....              A open_chr....
## 
## [[4]]
##     impact regulatory_feature_id      biotype consequence_terms variant_allele
## 1 MODIFIER          ENSR0000.... open_chr....      regulato....              T
## 
## [[5]]
##   regulatory_feature_id   impact      biotype variant_allele consequence_terms
## 1          ENSR0000.... MODIFIER open_chr....              T      regulato....
## 
## [[6]]
##     impact regulatory_feature_id variant_allele consequence_terms      biotype
## 1 MODIFIER          ENSR0000....              A      regulato.... open_chr....

Furthermore, the content of the motif feature consequences field seems very peculiar.

table(unlist(fromJSON(jmespath(jans, "[*].motif_feature_consequences"))))
## 
##                  -0.012                  -0.015                  -0.087 
##                       1                       1                       1 
##                      -1                       0                   0.001 
##                       6                       4                       1 
##                   0.009                   0.018                       1 
##                       1                       1                       5 
##                      10                      19                       2 
##                       1                       1                       1 
##                      21                       4                       6 
##                       1                       1                       1 
##                       7                       8                       9 
##                       1                       1                       1 
##                       A                   BACH1         ENSM00205804739 
##                       4                       1                       1 
##         ENSM00521974880         ENSM00522105203         ENSM00522219019 
##                       1                       1                       2 
##         ENSM00522779376         ENSM00523964478         ENSM00525238397 
##                       1                       2                       1 
##         ENSM00526271801              ENSPFM0017              ENSPFM0239 
##                       1                       1                       2 
##              ENSPFM0374              ENSPFM0403              ENSPFM0506 
##                       1                       2                       1 
##              ENSPFM0565              ENSPFM0598              ENSPFM0605 
##                       1                       1                       1 
##                       G               GCM1::MAX                    JDP2 
##                       3                       2                       1 
##              MEIS1::MAX                MODIFIER              MYBL1::MAX 
##                       1                      10                       2 
##                       N                    NFE2                       T 
##                       8                       1                       3 
##             TEAD4::ELF1             TEAD4::ELK1              TEAD4::MAX 
##                       1                       1                       1 
##             TEAD4::SPIB             TFAP2C::MAX              TFAP4::MAX 
##                       1                       1                       1 
## TF_binding_site_variant                       Y 
##                      10                       2

Transforming the API response to GRanges

We’ll consider the following approach to converting the API response to a GenomicRanges GRanges instance. Eventually this may become part of the package.

library(GenomicRanges)
.make_GRanges = function( vep_response ) {
  stopifnot(inherits(vep_response, "response"))  # httr
  nested = fromJSON(toJSON(content(vep_response)))
  ini = GRanges(seqnames = unlist(nested$seq_region_name),
    IRanges(start=unlist(nested$start), end=unlist(nested$end)))
  dr = match(c("seq_region_name", "start", "end"), names(nested))
  mcols(ini) = DataFrame(nested[,-dr])
  ini
}
tstg = .make_GRanges( res )
tstg[,1]  # full print is unwieldy
## GRanges object with 100 ranges and 1 metadata column:
##         seqnames    ranges strand |
##            <Rle> <IRanges>  <Rle> |
##     [1]       22  50300078      * |
##     [2]       22  50300086      * |
##     [3]       22  50300101      * |
##     [4]       22  50300113      * |
##     [5]       22  50300166      * |
##     ...      ...       ...    ... .
##    [96]       22  50304748      * |
##    [97]       22  50304805      * |
##    [98]       22  50304935      * |
##    [99]       22  50304943      * |
##   [100]       22  50305084      * |
##                                                                                                            transcript_consequences
##                                                                                                                             <list>
##     [1]                                                protein_coding:-1:G:...,protein_coding:-1:G:...,protein_coding:-1:G:...,...
##     [2]                  intron_variant:HGNC:9104:HGNC:...,intron_variant:HGNC:9104:HGNC:...,intron_variant:HGNC:9104:HGNC:...,...
##     [3]                  intron_variant:HGNC:HGNC:9104:...,intron_variant:HGNC:HGNC:9104:...,intron_variant:HGNC:HGNC:9104:...,...
##     [4]         intron_variant:-1:protein_coding:...,intron_variant:-1:protein_coding:...,intron_variant:-1:protein_coding:...,...
##     [5]      intron_variant:ENST00000359337:-1:...,intron_variant:ENST00000425954:-1:...,intron_variant:ENST00000432455:-1:...,...
##     ...                                                                                                                        ...
##    [96]           PLXNB2:MODIFIER:ENSG00000196576:...,DENND6B:MODIFIER:ENSG00000205593:...,PLXNB2:MODIFIER:ENSG00000196576:...,...
##    [97]              HGNC:HGNC:9104:ENSG00000196576:...,HGNC:HGNC:32690:ENSG00000205593:...,HGNC:HGNC:9104:ENSG00000196576:...,...
##    [98]                                             ENST00000359337:A:-1:...,ENST00000413817:A:-1:...,ENST00000425954:A:-1:...,...
##    [99] intron_variant:protein_coding:-1:...,downstream_gene_vari..:protein_coding:-1:...,intron_variant:protein_coding:-1:...,...
##   [100]           PLXNB2:MODIFIER:ENSG00000196576:...,DENND6B:MODIFIER:ENSG00000205593:...,PLXNB2:MODIFIER:ENSG00000196576:...,...
##   -------
##   seqinfo: 1 sequence from an unspecified genome; no seqlengths
names(mcols(tstg))
##  [1] "transcript_consequences"         "allele_string"                  
##  [3] "most_severe_consequence"         "input"                          
##  [5] "id"                              "regulatory_feature_consequences"
##  [7] "assembly_name"                   "strand"                         
##  [9] "colocated_variants"              "motif_feature_consequences"

Now information about variants can be retrieved with range operations. Deep annotation requires nested structure of the metadata columns.

mcols(tstg)[1, "transcript_consequences"]
## [[1]]
##        biotype strand variant_allele transcript_id gene_symbol_source   hgnc_id
## 1 protein_....     -1              G  ENST0000....               HGNC HGNC:9104
## 2 protein_....     -1              G  ENST0000....               HGNC HGNC:9104
## 3 protein_....     -1              G  ENST0000....               HGNC HGNC:9104
## 4 protein_....     -1              G  ENST0000....               HGNC HGNC:9104
##        gene_id gene_symbol   impact consequence_terms      flags
## 1 ENSG0000....      PLXNB2 MODIFIER      intron_v....           
## 2 ENSG0000....      PLXNB2 MODIFIER      intron_v.... cds_end_NF
## 3 ENSG0000....      PLXNB2 MODIFIER      intron_v.... cds_end_NF
## 4 ENSG0000....      PLXNB2 MODIFIER      intron_v....

Further work

An important element of prior work in ensemblVEP supports feeding annotation back into the VCF used to generate the effect prediction query. This seems feasible but concrete use cases are of interest.

References

McLaren, William, Laurent Gil, Sarah E. Hunt, Harpreet Singh Riat, Graham R. S. Ritchie, Anja Thormann, Paul Flicek, and Fiona Cunningham. 2016. “The Ensembl Variant Effect Predictor.” Genome Biology 17 (1): 122. https://doi.org/10.1186/s13059-016-0974-4.