This section gives an overview of the operations that can be
performed on a given set of metadata obtained particularly from
data-rich objects such as those obtained from
curatedTCGAData
. There are several operations that work
with microRNA, methylation, mutation, and assays that have gene symbol
annotations.
CpGtoRanges
Using the methylation annotations in
IlluminaHumanMethylation450kanno.ilmn12.hg19
and the
minfi
package, we look up CpG probes and convert to genomic
coordinates with CpGtoRanges
. The function provides two
assays, one with mapped probes and the other with unmapped probes.
Excluding unmapped probes can be done by setting the
unmapped
argument to FALSE
. This will run for
both types of methylation data (27k and 450k).
methcoad <- CpGtoRanges(coad)
## Setting options('download.file.method.GEOquery'='auto')
## Setting options('GEOquery.inmemory.gpl'=FALSE)
## Warning: 'experiments' dropped; see 'drops()'
## harmonizing input:
## removing 535 sampleMap rows not in names(experiments)
mirToRanges
microRNA assays obtained from curatedTCGAData
have
annotated sequences that can be converted to genomic ranges using the
mirbase.db
package. The function looks up all sequences and
converts them to (‘hg19’) ranges. For those rows that cannot be found,
an ‘unranged’ assay is introduced in the resulting MultiAssayExperiment
object.
mircoad <- mirToRanges(coad)
## Warning in (function (seqlevels, genome, new_style) : cannot switch some hg19's
## seqlevels from UCSC to NCBI style
## Warning: 'experiments' dropped; see 'drops()'
## harmonizing input:
## removing 221 sampleMap rows not in names(experiments)
qreduceTCGA
The qreduceTCGA
function converts
RaggedExperiment
mutation data objects to
RangedSummarizedExperiment
using org.Hs.eg.db
and the qreduceTCGA
utility function from
RaggedExperiment
to summarize ‘silent’ and ‘non-silent’
mutations based on a ‘Variant_Classification’ metadata column in the
original object.
It uses ‘hg19’ transcript database (‘TxDb’) package internally to
summarize regions using qreduceAssay
. The current genome
build (‘hg18’) in the data must be translated to ‘hg19’.
In this example, we first set the appropriate build name in the
mutation dataset COAD_Mutation-20160128
according to the NCBI
website and we then use seqlevelsStyle
to match the
UCSC
style in the chain.
rag <- "COAD_Mutation-20160128"
# add the appropriate genome annotation
genome(coad[[rag]]) <- "NCBI36"
# change the style to UCSC
seqlevelsStyle(rowRanges(coad[[rag]])) <- "UCSC"
# inspect changes
seqlevels(rowRanges(coad[[rag]]))
## [1] "chr1" "chr2" "chr3" "chr4" "chr5" "chr6" "chr7" "chr8" "chr9"
## [10] "chr10" "chr11" "chr12" "chr13" "chr14" "chr15" "chr16" "chr17" "chr18"
## [19] "chr19" "chr20" "chr21" "chr22" "chrX" "chrY"
## chr1 chr2 chr3 chr4 chr5 chr6 chr7 chr8 chr9 chr10 chr11
## "hg18" "hg18" "hg18" "hg18" "hg18" "hg18" "hg18" "hg18" "hg18" "hg18" "hg18"
## chr12 chr13 chr14 chr15 chr16 chr17 chr18 chr19 chr20 chr21 chr22
## "hg18" "hg18" "hg18" "hg18" "hg18" "hg18" "hg18" "hg18" "hg18" "hg18" "hg18"
## chrX chrY
## "hg18" "hg18"
Now we use liftOver
from rtracklayer
to
translate ‘hg18’ builds to ‘hg19’ using the chain file obtained via
AnnotationHub
. We use a query to find the file. You can
also query with terms such as
c("Homo sapiens", "chain", "hg18", "hg19")
in the query
function. We are specifically looking for the chain file
“hg18ToHg19.over.chain.gz”.
library(AnnotationHub)
ah <- AnnotationHub()
query(ah, "hg18ToHg19.over.chain.gz")
## AnnotationHub with 1 record
## # snapshotDate(): 2024-10-28
## # names(): AH14220
## # $dataprovider: UCSC
## # $species: Homo sapiens
## # $rdataclass: ChainFile
## # $rdatadateadded: 2014-12-15
## # $title: hg18ToHg19.over.chain.gz
## # $description: UCSC liftOver chain file from hg18 to hg19
## # $taxonomyid: 9606
## # $genome: hg18
## # $sourcetype: Chain
## # $sourceurl: http://hgdownload.cse.ucsc.edu/goldenpath/hg18/liftOver/hg18To...
## # $sourcesize: NA
## # $tags: c("liftOver", "chain", "UCSC", "genome", "homology")
## # retrieve record with 'object[["AH14220"]]'
## downloading 1 resources
## retrieving 1 resource
## loading from cache
ranges19 <- rtracklayer::liftOver(rowRanges(coad[[rag]]), chain)
Note. The same can be done to convert
hg19
to hg38
(the same build that the Genomic
Data Commons uses) with the corresponding chain file as obtained
above.
This will give us a list of ranges, each element corresponding to a
single row in the RaggedExperiment
. We remove rows that had
no matches in the liftOver
process and replace the ranges
in the original RaggedExperiment
with the replacement
method. Finally, we put the RaggedExperiment
object back
into the MultiAssayExperiment
.
re19 <- coad[[rag]][as.logical(lengths(ranges19))]
ranges19 <- unlist(ranges19)
genome(ranges19) <- "hg19"
rowRanges(re19) <- ranges19
# replacement
coad[["COAD_Mutation-20160128"]] <- re19
rowRanges(re19)
## GRanges object with 62523 ranges and 0 metadata columns:
## seqnames ranges strand
## <Rle> <IRanges> <Rle>
## [1] chr20 1552407-1552408 +
## [2] chr1 161736152-161736153 +
## [3] chr7 100685895 +
## [4] chr7 103824453 +
## [5] chr7 104783644 +
## ... ... ... ...
## [62519] chr9 36369716 +
## [62520] chr9 37692640 +
## [62521] chr9 6007456 +
## [62522] chrX 123785782 +
## [62523] chrX 51487184 +
## -------
## seqinfo: 24 sequences from hg19 genome; no seqlengths
Now that we have matching builds, we can finally run the
qreduceTCGA
function.
coad <- qreduceTCGA(coad, keep.assay = TRUE)
##
## 403 genes were dropped because they have exons located on both strands of the
## same reference sequence or on more than one reference sequence, so cannot be
## represented by a single genomic range.
## Use 'single.strand.genes.only=FALSE' to get all the genes in a GRangesList
## object, or use suppressMessages() to suppress this message.
## Warning in (function (seqlevels, genome, new_style) : cannot switch some hg19's
## seqlevels from UCSC to NCBI style
## 'select()' returned 1:1 mapping between keys and columns
## Warning in .normarg_seqlevelsStyle(value): more than one seqlevels style
## supplied, using the 1st one only
## Warning in .normarg_seqlevelsStyle(value): cannot switch some hg19's seqlevels
## from UCSC to NCBI style
symbolsToRanges
In the cases where row annotations indicate gene symbols, the
symbolsToRanges
utility function converts genes to genomic
ranges and replaces existing assays with
RangedSummarizedExperiment
objects. Gene annotations are
given as ‘hg19’ genomic regions.
## 403 genes were dropped because they have exons located on both strands of the
## same reference sequence or on more than one reference sequence, so cannot be
## represented by a single genomic range.
## Use 'single.strand.genes.only=FALSE' to get all the genes in a GRangesList
## object, or use suppressMessages() to suppress this message.
## Warning in (function (seqlevels, genome, new_style) : cannot switch some hg19's
## seqlevels from UCSC to NCBI style
## 'select()' returned 1:1 mapping between keys and columns
## 403 genes were dropped because they have exons located on both strands of the
## same reference sequence or on more than one reference sequence, so cannot be
## represented by a single genomic range.
## Use 'single.strand.genes.only=FALSE' to get all the genes in a GRangesList
## object, or use suppressMessages() to suppress this message.
## Warning in (function (seqlevels, genome, new_style) : cannot switch some hg19's
## seqlevels from UCSC to NCBI style
## 'select()' returned 1:1 mapping between keys and columns
## Warning: 'experiments' dropped; see 'drops()'
## harmonizing input:
## removing 363 sampleMap rows not in names(experiments)
## A MultiAssayExperiment object of 11 listed
## experiments with user-defined names and respective classes.
## Containing an ExperimentList class object of length 11:
## [1] COAD_CNASeq-20160128: RaggedExperiment with 40530 rows and 136 columns
## [2] COAD_miRNASeqGene-20160128: SummarizedExperiment with 705 rows and 221 columns
## [3] COAD_Mutation-20160128: RaggedExperiment with 62523 rows and 154 columns
## [4] COAD_Methylation_methyl27-20160128: SummarizedExperiment with 27578 rows and 202 columns
## [5] COAD_Methylation_methyl450-20160128: SummarizedExperiment with 485577 rows and 333 columns
## [6] COAD_Mutation-20160128_simplified: RangedSummarizedExperiment with 22913 rows and 154 columns
## [7] COAD_CNASeq-20160128_simplified: RangedSummarizedExperiment with 22913 rows and 136 columns
## [8] COAD_mRNAArray-20160128_ranged: RangedSummarizedExperiment with 14121 rows and 172 columns
## [9] COAD_mRNAArray-20160128_unranged: SummarizedExperiment with 3693 rows and 172 columns
## [10] COAD_RNASeq2GeneNorm-20160128_ranged: RangedSummarizedExperiment with 16970 rows and 191 columns
## [11] COAD_RNASeq2GeneNorm-20160128_unranged: SummarizedExperiment with 3531 rows and 191 columns
## Functionality:
## experiments() - obtain the ExperimentList instance
## colData() - the primary/phenotype DataFrame
## sampleMap() - the sample coordination DataFrame
## `$`, `[`, `[[` - extract colData columns, subset, or experiment
## *Format() - convert into a long or wide DataFrame
## assays() - convert ExperimentList to a SimpleList of matrices
## exportClass() - save data to flat files