Title: | Infrastructure to work with genomewide position-specific scores |
---|---|
Description: | Provide infrastructure to store and access genomewide position-specific scores within R and Bioconductor. |
Authors: | Robert Castelo [aut, cre], Pau Puigdevall [ctb], Pablo RodrÃguez [ctb] |
Maintainer: | Robert Castelo <[email protected]> |
License: | Artistic-2.0 |
Version: | 2.19.0 |
Built: | 2024-11-18 03:47:51 UTC |
Source: | https://github.com/bioc/GenomicScores |
Functions to explore genomic scores resources.
availableGScores(use.internet=FALSE) getGScores(x, accept.license=FALSE)
availableGScores(use.internet=FALSE) getGScores(x, accept.license=FALSE)
x |
A |
use.internet |
A logical value specifying whether we want
to check through the internet whether the
annotation packages and |
accept.license |
A logical value specifying whether we accept the license under which we can use the requested data, when the data is distributed under such a license. |
The function availableGScores()
shows genomic score sets available as
AnnotationHub
online resources.
The function availableGScores()
returns a data.frame
object
with a row for each available resource of genomic scores and the following
four columns:
Name:
Name of the Bioconductor package or AnnotationHub
resource.
Organism:
Organism on which the genomic scores are defined.
Category:
Category to which the genomic scores belong to.
Installed:
Whether the resource is installed as a package
(TRUE
) or not (FALSE
).
Cached:
Whether the resource is available within the local
cache of the AnnotationHub
(TRUE
) or
not (FALSE
).
BiocManagerInstall:
Whether the resource can be installed as
an annotation package through
BiocManager::install()
(TRUE
) or not (FALSE
).
AnnotationHub:
Whether the resource can be downloaded as a
GScores
object through the
AnnotationHub
, using the function
getGScores()
(TRUE
), or
not (FALSE
).
The function getGScores()
returns a GScores
object.
R. Castelo
Puigdevall, P. and Castelo, R. GenomicScores: seamless access to genomewide position-specific scores from R and Bioconductor. Bioinformatics, 18:3208-3210, 2018.
getGScores()
phastCons100way.UCSC.hg38
MafDb.1Kgenomes.phase1.hs37d5
availableGScores() ## Not run: gsco <- getGScores("cadd.v1.6.hg38") ## End(Not run)
availableGScores() ## Not run: gsco <- getGScores("cadd.v1.6.hg38") ## End(Not run)
Functions to access genomic gscores through GScores
objects.
## S4 method for signature 'GScores,GenomicRanges' gscores(x, ranges, ...) ## S4 method for signature 'GScores,character' gscores(x, ranges, ...) ## S4 method for signature 'GScores' score(x, ..., simplify=TRUE)
## S4 method for signature 'GScores,GenomicRanges' gscores(x, ranges, ...) ## S4 method for signature 'GScores,character' gscores(x, ranges, ...) ## S4 method for signature 'GScores' score(x, ..., simplify=TRUE)
x |
A |
ranges |
A |
... |
In the call to the
|
simplify |
Flag setting whether the result should be simplified to a
vector ( |
The method gscores()
takes as first argument a GScores
object, previouly loaded from either an annotation package or an
AnnotationHub
resource; see getGScores()
.
The arguments ref
and alt
serve two purposes. One, when there
are multiple scores per position, such as with CADD or M-CAP, and we want to
select a score matching a specific combination of reference and alternate
alleles. The other purpose is when the GScores
object x
is a
MafDb.*
package, then by providing ref
and alt
alelles
we will get separate frequencies for reference and alternate alleles. The
current lossy compression of these values yields a correct assignment for
biallelic variants in the corresponding MafDb.*
package and an
approximation for multiallelic ones.
The method gscores()
returns a GRanges
object with the genomic
scores in a metadata column called score
. The method score()
returns a numeric vector with the genomic scores.
R. Castelo
Puigdevall, P. and Castelo, R. GenomicScores: seamless access to genomewide position-specific scores from R and Bioconductor. Bioinformatics, 18:3208-3210, 2018.
phastCons100way.UCSC.hg38
MafDb.1Kgenomes.phase1.hs37d5
## one genomic range of width 5 gr1 <- GRanges(seqnames="chr22", IRanges(start=50528591, width=5)) gr1 ## five genomic ranges of width 1 gr2 <- GRanges(seqnames="chr22", IRanges(start=50528591:50528596, width=1)) gr2 ## accessing genomic gscores from an annotation package if (require(phastCons100way.UCSC.hg38)) { library(GenomicRanges) gsco <- phastCons100way.UCSC.hg38 gsco gscores(gsco, gr1) score(gsco, gr1) gscores(gsco, gr2) populations(gsco) gscores(gsco, gr2, pop="DP2") } if (require(MafDb.1Kgenomes.phase1.hs37d5)) { mafdb <- MafDb.1Kgenomes.phase1.hs37d5 mafdb populations(mafdb) ## lookup allele frequencies for SNP rs1129038, located at 15:28356859, a ## SNP associated to blue and brown eye colors as reported by Eiberg et al. ## Blue eye color in humans may be caused by a perfectly associated founder ## mutation in a regulatory element located within the HERC2 gene ## inhibiting OCA2 expression. Human Genetics, 123(2):177-87, 2008 ## [http://www.ncbi.nlm.nih.gov/pubmed/18172690] gscores(mafdb, GRanges("15:28356859"), pop=populations(mafdb)) gscores(mafdb, "rs1129038", pop=populations(mafdb)) }
## one genomic range of width 5 gr1 <- GRanges(seqnames="chr22", IRanges(start=50528591, width=5)) gr1 ## five genomic ranges of width 1 gr2 <- GRanges(seqnames="chr22", IRanges(start=50528591:50528596, width=1)) gr2 ## accessing genomic gscores from an annotation package if (require(phastCons100way.UCSC.hg38)) { library(GenomicRanges) gsco <- phastCons100way.UCSC.hg38 gsco gscores(gsco, gr1) score(gsco, gr1) gscores(gsco, gr2) populations(gsco) gscores(gsco, gr2, pop="DP2") } if (require(MafDb.1Kgenomes.phase1.hs37d5)) { mafdb <- MafDb.1Kgenomes.phase1.hs37d5 mafdb populations(mafdb) ## lookup allele frequencies for SNP rs1129038, located at 15:28356859, a ## SNP associated to blue and brown eye colors as reported by Eiberg et al. ## Blue eye color in humans may be caused by a perfectly associated founder ## mutation in a regulatory element located within the HERC2 gene ## inhibiting OCA2 expression. Human Genetics, 123(2):177-87, 2008 ## [http://www.ncbi.nlm.nih.gov/pubmed/18172690] gscores(mafdb, GRanges("15:28356859"), pop=populations(mafdb)) gscores(mafdb, "rs1129038", pop=populations(mafdb)) }
The goal of the GenomicScores
package is to provide support to store
and retrieve genomic scores associated to physical nucleotide positions along
a genome. This is achieved through the GScores
class of objects, which
is a container for genomic score values.
The GScores
class attempts to provide a compact storage and efficient
retrieval of genomic score values that have been typically processed and
stored using some form of lossy compression. This class is currently based
on a former version of the SNPlocs
class defined in the
BSgenome
package, with the following slots:
provider
(character
), the data provider such as UCSC.
provider_version
(character
), the version of the data
as given by the data provider, typically a date in some compact format.
download_url
(character
), the URL of the data provider
from where the original data were downloaded.
download_date
(character
), the date on which the data
were downloaded.
reference_genome
(GenomeDescription
), object with
information about the reference genome whose physical positions have
the genomic scores.
data_pkgname
(character
), name given to the set
of genomic scores associated to a particular genome. When the genomic
scores are stored within an annotation package, then this corresponds to
the name of that package.
data_dirpath
(character
), absolute path to the local
directory where the genomic scores are stored in one file per genome
sequence.
data_serialized_objnames
(character
), named vector of
filenames pointing to files containing the genomic scores in one file per
genome sequence. The names of this vector correspond to the genome
sequence names.
data_group
(character
), name denoting a category of
genomic scores to which the scores stored in the object belong to.
Typical values are "Conservation", "MAF", "Pathogenicity", etc.
data_tag
(character
), name identifying the genomic
scores stored in the object and which can be used, for instance, to
assign a column name storing these scores.
data_pops
(character
), vector of character strings
storing score population names. The term "default" is reserved to denote
a score set that is not associated to a particular population name and
is used by default.
data_nonsnrs
(logical
), flag indicating whether the
object stores genomic scores associated with non-single nucleotide ranges.
data_nsites
(integer
), number of sites in the genome
associated with the genomic scores stored in the object.
.data_cache
(environment
), data structure where
objects storing genomic scores are cached into main memory.
The goal of the design behind the GScores
class is to load into main
memory only the objects associated with the queried sequences to minimize the
memory footprint, which may be advantageous in workflows that parallelize the
access to genomic scores by genome sequence.
GScores
objects are created either from AnnotationHub
resources
or when loading specific annotation packages that store genomic score values.
Two such annotation packages are:
phastCons100way.UCSC.hg19
Nucleotide-level phastCons conservation scores from the UCSC Genome Browser calculated from multiple genome alignments from the human genome version hg19 to 99 vertebrate species.
phastCons100way.UCSC.hg38
Nucleotide-level phastCons conservation scores from the UCSC Genome Browser calculated from multiple genome alignments from the human genome version hg38 to 99 vertebrate species.
GScores(provider, provider_version, download_url,
download_date, reference_genome, data_pkgname, data_dirpath,
data_serialized_objnames, default_pop, data_tag)
:Creates a GScores
object. In principle, the end-user needs not to
call this function.
provider
character string, containing the data provider.
provider_version
character string, containing the version of the data as given by the data provider.
download_url
character string, containing the URL of the data provider from where the original data were downloaded.
reference_genome
GenomeDescription, storing the information about the associated reference genome.
data_pkgname
character string, name given to the set of genomic scores stored through this object.
data_dirpath
character string, absolute path to the local directory where the genomic scores are stored.
data_serialized_objname
character string vector, containing filenames where the genomic scores are stored.
default_pop
character string, containing the name of the default scores population.
data_group
character string, containing a name that indicates a category of genomic scores to which the scores in the object belong to. Typical names could be "Conservation", "MAF", etc.
data_tag
character string, containing a tag that succintly labels genomic scores from a particular source. This can be used to automatically give, for instance, a name to a column storing genomic scores in data frame object. Its default value takes the prefix of the package name.
name(x)
:get the name of the set of genomic scores.
type(x)
: get the substring of the name of the set of genomic
scores comprised between the first character until the first period. This
should typically match the type of genomic scores such as,
phastCons
, phyloP
, etc.
provider(x)
:get the data provider.
providerVersion(x)
:get the provider version.
organism(x)
:get the organism associated with the genomic scores.
seqlevelsStyle(x)
:get the genome sequence style.
seqinfo(x)
:get the genome sequence information.
seqnames(x)
:get the genome sequence names.
seqlengths(x)
:get the genome sequence lengths.
populations(x)
: get the identifiers of the available scores
populations. If only one scores population is available, then it shows
only the term default
.
defaultPopulation(x)
:get or set the default population of scores.
gscoresCategory(x)
:get or set the genomic scores category label.
gscoresTag(x)
:get or set the genomic scores tag label.
gscoresNonSNRs(x)
:get whether there are genomic scores associated with non-single nucleotide ranges.
nsites(x)
:get the number of sites in the genome with genomic scores.
qfun(x)
:get the quantizer function.
dqfun(x)
:get the dequantizer function.
citation(x)
: get citation information for the genomic scores data
in the form of a bibentry
object.
R. Castelo
Puigdevall, P. and Castelo, R. GenomicScores: seamless access to genomewide position-specific scores from R and Bioconductor. Bioinformatics, 18:3208-3210, 2018.
gscores()
score()
phastCons100way.UCSC.hg38
## one genomic range of width 5 gr1 <- GRanges(seqnames="chr22", IRanges(start=50528591, width=5)) gr1 ## five genomic ranges of width 1 gr2 <- GRanges(seqnames="chr22", IRanges(start=50528591:50528596, width=1)) gr2 ## supporting annotation packages with genomic scores if (require(phastCons100way.UCSC.hg38)) { library(GenomicRanges) phast <- phastCons100way.UCSC.hg38 phast gscores(phast, gr1) score(phast, gr1) gscores(phast, gr2) populations(phast) gscores(phast, gr2, pop="DP2") } ## supporting AnnotationHub resources ## Not run: availableGScores() phast <- getGScores("phastCons100way.UCSC.hg38") phast gscores(phast, gr1) ## End(Not run) ## metadata from a GScores object name(phast) type(phast) provider(phast) providerVersion(phast) organism(phast) seqlevelsStyle(phast) seqinfo(phast) head(seqnames(phast)) head(seqlengths(phast)) gscoresTag(phast) populations(phast) defaultPopulation(phast) qfun(phast) dqfun(phast) citation(phast)
## one genomic range of width 5 gr1 <- GRanges(seqnames="chr22", IRanges(start=50528591, width=5)) gr1 ## five genomic ranges of width 1 gr2 <- GRanges(seqnames="chr22", IRanges(start=50528591:50528596, width=1)) gr2 ## supporting annotation packages with genomic scores if (require(phastCons100way.UCSC.hg38)) { library(GenomicRanges) phast <- phastCons100way.UCSC.hg38 phast gscores(phast, gr1) score(phast, gr1) gscores(phast, gr2) populations(phast) gscores(phast, gr2, pop="DP2") } ## supporting AnnotationHub resources ## Not run: availableGScores() phast <- getGScores("phastCons100way.UCSC.hg38") phast gscores(phast, gr1) ## End(Not run) ## metadata from a GScores object name(phast) type(phast) provider(phast) providerVersion(phast) organism(phast) seqlevelsStyle(phast) seqinfo(phast) head(seqnames(phast)) head(seqlengths(phast)) gscoresTag(phast) populations(phast) defaultPopulation(phast) qfun(phast) dqfun(phast) citation(phast)
Starts an interactive GenomicScores shiny web app.
igscores()
igscores()
The goal of the GenomicScores
package is to provide support to store
and retrieve genomic scores associated to physical nucleotide positions along
a genome.
The igscores()
function starts an interactive shiny web app that allows the
user to query annotation packages storing genomic scores. Internally, it calls to
the function gscores()
; see its manual page for a description of the
arguments and their default and alternative values.
None.
P. RodrÃguez and R. Castelo
Puigdevall, P. and Castelo, R. GenomicScores: seamless access to genomewide position-specific scores from R and Bioconductor. Bioinformatics, 18:3208-3210, 2018.
## Not run: igscores() ## this will open your browser with the GenomicScores shiny web app ## End(Not run)
## Not run: igscores() ## this will open your browser with the GenomicScores shiny web app ## End(Not run)
Build a genomic gscores packages from a GScores
object.
makeGScoresPackage(gsco, version, maintainer, author, destDir=".", license="Artistic-2.0")
makeGScoresPackage(gsco, version, maintainer, author, destDir=".", license="Artistic-2.0")
gsco |
|
version |
|
maintainer |
|
author |
|
destDir |
|
license |
|
This function allows one to create an R package from a GScores
object. This may be useful if one wants to have a tar-ball package version of genomic scores available only through the AnnotationHub; see the vignette.
It returns invisibily the package directory.
R. Castelo
Puigdevall, P. and Castelo, R. GenomicScores: seamless access to genomewide position-specific scores from R and Bioconductor. Bioinformatics, 18:3208-3210, 2018.
availableGScores()
getGScores()
## accessing genomic scores from AnnotationHub resources ## and building a package from them ## Not run: availableGScores() gsco <- getGScores("fitCons.UCSC.hg19") makeGScoresPackage(gsco, version="1.0", maintainer="me <[email protected]>", author="me") ## End(Not run)
## accessing genomic scores from AnnotationHub resources ## and building a package from them ## Not run: availableGScores() gsco <- getGScores("fitCons.UCSC.hg19") makeGScoresPackage(gsco, version="1.0", maintainer="me <[email protected]>", author="me") ## End(Not run)
Function for randomly sampling genomic gscores from GScores
objects.
## S4 method for signature 'GScores,missing' rgscores(n, object, ...) ## S4 method for signature 'missing,GScores' rgscores(n, object, ...) ## S4 method for signature 'numeric,GScores' rgscores(n, object, ...) ## S4 method for signature 'integer,GScores' rgscores(n, object, ...)
## S4 method for signature 'GScores,missing' rgscores(n, object, ...) ## S4 method for signature 'missing,GScores' rgscores(n, object, ...) ## S4 method for signature 'numeric,GScores' rgscores(n, object, ...) ## S4 method for signature 'integer,GScores' rgscores(n, object, ...)
n |
Number of scores to sample. |
object |
A |
... |
In the call to the
|
The method rgscores()
samples scores randomly from a GScores
object.
A GRanges
object with the sampled genomic positions and scores. When
scores.only=TRUE
then a numeric vector is returned with the sampled scores.
R. Castelo
Puigdevall, P. and Castelo, R. GenomicScores: seamless access to genomewide position-specific scores from R and Bioconductor. Bioinformatics, 18:3208-3210, 2018.
phastCons100way.UCSC.hg38
MafDb.1Kgenomes.phase1.hs37d5
## accessing genomic gscores from an annotation package if (require(phastCons100way.UCSC.hg38)) { library(GenomicRanges) phast <- phastCons100way.UCSC.hg38 set.seed(123) rgscores(10L, phast, ranges=c("chr22", "chrY")) } if (require(MafDb.1Kgenomes.phase1.hs37d5)) { mafdb <- MafDb.1Kgenomes.phase1.hs37d5 set.seed(123) rgscores(10L, mafdb, ranges=c("21", "22")) }
## accessing genomic gscores from an annotation package if (require(phastCons100way.UCSC.hg38)) { library(GenomicRanges) phast <- phastCons100way.UCSC.hg38 set.seed(123) rgscores(10L, phast, ranges=c("chr22", "chrY")) } if (require(MafDb.1Kgenomes.phase1.hs37d5)) { mafdb <- MafDb.1Kgenomes.phase1.hs37d5 set.seed(123) rgscores(10L, mafdb, ranges=c("21", "22")) }
Functions to discover which genomic scores are present in given genomic ranges through GScores
objects.
## S4 method for signature 'GScores,GenomicRanges' wgscores(x, ranges, ...)
## S4 method for signature 'GScores,GenomicRanges' wgscores(x, ranges, ...)
x |
A |
ranges |
A |
... |
In the call to the
|
The method wgscores()
takes as first argument a GScores
object, previouly loaded from either an annotation package or an
AnnotationHub
resource; see getGScores()
and a
GenomicRanges
object as a second argument. It will search for
which genomic scores fall within the provided genomic ranges and return
them in an ordered GenomicRanges
object with the scores in the
metadata columns.
The method wgscores()
returns a GRanges
object with the
genomic scores in metadata columns named after the corresponding score
population name.
R. Castelo
Puigdevall, P. and Castelo, R. GenomicScores: seamless access to genomewide position-specific scores from R and Bioconductor. Bioinformatics, 18:3208-3210, 2018.
if (require(MafDb.1Kgenomes.phase1.hs37d5)) { mafdb <- MafDb.1Kgenomes.phase1.hs37d5 mafdb populations(mafdb) ## lookup allele frequencies for SNP rs1129038, located at 15:28356859 in ## GRCh37, a SNP associated to blue and brown eye colors as reported by ## Eiberg et al. (Human Genetics, 2008; http://www.ncbi.nlm.nih.gov/pubmed/18172690). ## Blue eye color in humans may be caused by a perfectly associated founder ## mutation in a regulatory element located within the HERC2 gene inhibiting ## OCA2 expression. ## ## for the sake of illustrating this functionality let's create a ## GenomicRanges object with the SNP rs1129038 and enlarge its range ## by 200nt at each flank. snp <- GRanges("15:28356859") rngsnp <- flank(snp, width=100, both=TRUE) width(rngsnp) ## now use this genomic range to search for the rs1129038 SNP wgscores(mafdb, rngsnp) ## let's illustrate this same functionality for INDELs, starting ## from the specific INDEL rs113993960 that leads to the loss of ## phenylalanine at amino acid position 508 of the CFTR protein, ## commonly referred to as F508del in the CFTR gene, which is ## concretely a deletion of four nucleotides at position ## 7:117199644 in GRCh37 and enlarge its range by 20nt on each flank. indel <- GRanges(seqnames="chr7", ranges=IRanges(start=117199644, width=4)) rngindel <- flank(indel, width=20, both=TRUE) width(rngindel) ## now use this genomic range to search for the rs113993960 INDEL wgscores(mafdb, rngindel, type="nonsnrs") }
if (require(MafDb.1Kgenomes.phase1.hs37d5)) { mafdb <- MafDb.1Kgenomes.phase1.hs37d5 mafdb populations(mafdb) ## lookup allele frequencies for SNP rs1129038, located at 15:28356859 in ## GRCh37, a SNP associated to blue and brown eye colors as reported by ## Eiberg et al. (Human Genetics, 2008; http://www.ncbi.nlm.nih.gov/pubmed/18172690). ## Blue eye color in humans may be caused by a perfectly associated founder ## mutation in a regulatory element located within the HERC2 gene inhibiting ## OCA2 expression. ## ## for the sake of illustrating this functionality let's create a ## GenomicRanges object with the SNP rs1129038 and enlarge its range ## by 200nt at each flank. snp <- GRanges("15:28356859") rngsnp <- flank(snp, width=100, both=TRUE) width(rngsnp) ## now use this genomic range to search for the rs1129038 SNP wgscores(mafdb, rngsnp) ## let's illustrate this same functionality for INDELs, starting ## from the specific INDEL rs113993960 that leads to the loss of ## phenylalanine at amino acid position 508 of the CFTR protein, ## commonly referred to as F508del in the CFTR gene, which is ## concretely a deletion of four nucleotides at position ## 7:117199644 in GRCh37 and enlarge its range by 20nt on each flank. indel <- GRanges(seqnames="chr7", ranges=IRanges(start=117199644, width=4)) rngindel <- flank(indel, width=20, both=TRUE) width(rngindel) ## now use this genomic range to search for the rs113993960 INDEL wgscores(mafdb, rngindel, type="nonsnrs") }