Title: | Spike-in calibration for cell-free MeDIP |
---|---|
Description: | spiky implements methods and model generation for cfMeDIP (cell-free methylated DNA immunoprecipitation) with spike-in controls. CfMeDIP is an enrichment protocol which avoids destructive conversion of scarce template, making it ideal as a "liquid biopsy," but creating certain challenges in comparing results across specimens, subjects, and experiments. The use of synthetic spike-in standard oligos allows diagnostics performed with cfMeDIP to quantitatively compare samples across subjects, experiments, and time points in both relative and absolute terms. |
Authors: | Samantha Wilson [aut], Lauren Harmon [aut], Tim Triche [aut, cre] |
Maintainer: | Tim Triche <[email protected]> |
License: | GPL-2 |
Version: | 1.13.0 |
Built: | 2024-11-19 04:31:46 UTC |
Source: | https://github.com/bioc/spiky |
given a vector of fragment identifiers like 160_2_35
or 80b_1C_35G-2
,
encoded typically as lengthInBp_numberOfCpGs_GCpercent, and optionally a
database of spike-in sequences corresponding to those fragments, add those
columns to the source data (along with, if present in the database, other
metadata such as standard concentrations, GC fraction, etc.) and return i
an updated DataFrame.
add_frag_info(x, frag_grp = "frag_grp", spike = NULL)
add_frag_info(x, frag_grp = "frag_grp", spike = NULL)
x |
data.frame with a column of spike information (see above) |
frag_grp |
column name for the spike contig information ( |
spike |
optional database of spike-in properties (none) |
the data.frame x, augmented with metadata columns
data(spike_cram_counts) data(spike, package="spiky") spike <- subset(spike, methylated == 1) add_frag_info(spike_cram_counts, spike=spike)
data(spike_cram_counts) data(spike, package="spiky") spike <- subset(spike, methylated == 1) add_frag_info(spike_cram_counts, spike=spike)
This function replaces a bedtools call: bedtools intersect -wao -a fragments.bed -b hg38_300bp_windows.bed > data.bed
bam_to_bins(x, width = 300, param = NULL, which = IRangesList(), ...)
bam_to_bins(x, width = 300, param = NULL, which = IRangesList(), ...)
x |
a BAM or CRAM filename (or a BamFile object) |
width |
the width of the bins to tile (default is 300) |
param |
optional ScanBamParam (whence we attempt to extract |
which |
an optional GRanges restricting the bins to certain locations |
... |
additional arguments to pass on to seqinfo_from_header |
The idea is to skip the BED creation step for most runs, and just do it once. In order to count reads in bins, we need bins. In order to have bins, we need to know how long the chromosomes are. In order to have a BAM or CRAM file, we need to have those same lengths. This function takes advantage of all of the above to create binned ranges. Note that a very recent branch of Rsamtools is required for CRAM file bins.
a GRangesList with y-base-pair-wide bins tiled across it
seqinfo_from_header
library(Rsamtools) fl <- system.file("extdata", "ex1.bam", package="Rsamtools", mustWork=TRUE) bam_to_bins(fl)
library(Rsamtools) fl <- system.file("extdata", "ex1.bam", package="Rsamtools", mustWork=TRUE) bam_to_bins(fl)
Given the results of model_glm_pmol and predict_pmol, adjust the predictions to reflect picomoles of captured DNA overlapping a given bin in the genome.
bin_pmol(x)
bin_pmol(x)
x |
results from predict_pmol (a data.frame or GRanges) |
the same object, but with a column `adjusted_pred_con`
model_glm_pmol
predict_pmol
data(spike, package="spiky") data(spike_res, package="spiky") data(genomic_res,package="spiky") fit <- model_glm_pmol(covg_to_df(spike_res, spike=spike),spike=spike) pred <- predict_pmol(fit, genomic_res, ret="df") bin_pmol(pred)
data(spike, package="spiky") data(spike_res, package="spiky") data(genomic_res,package="spiky") fit <- model_glm_pmol(covg_to_df(spike_res, spike=spike),spike=spike) pred <- predict_pmol(fit, genomic_res, ret="df") bin_pmol(pred)
Convert Pairs to GRanges
convertPairedGRtoGR(pairs)
convertPairedGRtoGR(pairs)
pairs |
the Pairs object |
a GRanges
scan_spiked_bam
results into data.frames for model_glm_pmolreshape scan_spiked_bam
results into data.frames for model_glm_pmol
covg_to_df(spike_gr, spike, meth = TRUE, ID = NULL)
covg_to_df(spike_gr, spike, meth = TRUE, ID = NULL)
spike_gr |
GRanges of spike contigs (e.g. output object from scan_spiked_bam, scan_spike_contigs, or scan_spike_bedpe) |
spike |
spike database (as from data(spike, package="spiky")) |
meth |
only keep methylated spike reads? (TRUE; if FALSE, sum both) |
ID |
an identifier for this sample, if running several (autogenerate) |
a data.frame with columns 'frag_grp', 'id', and 'read_count'
scan_spiked_bam
data(spike, package="spiky") data(spike_res, package="spiky") subsetted <- covg_to_df(spike_res, spike=spike, meth=TRUE) summed <- covg_to_df(spike_res, spike=spike, meth=FALSE) round((summed$read_count - subsetted$read_count) / summed$read_count, 3)
data(spike, package="spiky") data(spike_res, package="spiky") subsetted <- covg_to_df(spike_res, spike=spike, meth=TRUE) summed <- covg_to_df(spike_res, spike=spike, meth=FALSE) round((summed$read_count - subsetted$read_count) / summed$read_count, 3)
A data.frame with spike-in results from control samples in the manuscript.
This maps 1:1 onto spike_read_counts
using reshape2::melt.
data(dedup)
data(dedup)
A data.frame object with
the encoded spike contig name: basepairs_CpGs_GCpercent
read coverage for this spike in sample 6547
read coverage for this spike in sample 6548
This data was created using inst/script/loadDedup.R
x
, where !is.null(seqinfo(x))Find the spike-like contigs in a BAM with both natural and spiked contigs. This started out as glue in some other functions and got refactored out.
find_spike_contigs(x, spike)
find_spike_contigs(x, spike)
x |
something with seqlevels |
spike |
a DataFrame with spike-in information |
The indices have an attribute "mappings", which is a character vector
such that attr(find_spike_contigs(x), "mappings") == standardized
for all contig names in the CRAM/BAM/whatever, and standardized is
the rowname in spike
that corresponds to the original contig name.
indices of which contigs in seqlevels(x) are spike-in contigs
get_base_name
rename_spike_seqlevels
sb <- system.file("extdata", "example.spike.bam", package="spiky", mustWork=TRUE) si <- seqinfo_from_header(sb) data(spike, package="spiky") find_spike_contigs(si, spike=spike)
sb <- system.file("extdata", "example.spike.bam", package="spiky", mustWork=TRUE) si <- seqinfo_from_header(sb) data(spike, package="spiky") find_spike_contigs(si, spike=spike)
A DataFrame with species, genome, accession, and sequence for GenBank mitochondrial genome depositions. No concentration provided; add if needed.
data(genbank_mito)
data(genbank_mito)
A DataFrame object with
the species whence the record came, as a character string
the genome assembly whence the mtDNA, as a character string
the genbank accession, as a character string
genome sequence, as a DNAStringSet
www.ncbi.nlm.nih.gov/genbank/
A FASTA reference is not always needed, so long as .crai indices
are available for all contigs in the CRAM. See spike_counts
for a fast
and convenient alternative that extracts spike coverage from index stats.
However, spike_counts has its own issues, and it's better to use fragments.
generate_spike_fasta(bam, spike, assembly = NULL, fa = "spike_contigs.fa")
generate_spike_fasta(bam, spike, assembly = NULL, fa = "spike_contigs.fa")
bam |
a BAM or CRAM file, hopefully with an index |
spike |
the spike contig database (mandatory as of 0.9.99) |
assembly |
optional BSgenome or seqinfo with reference contigs (NULL) |
fa |
the filename for the resulting FASTA ("spikes.fa") |
If the contigs in a CRAM have even slightly different names from those in the reference, decoding will fail. In some cases there are multiple names for a given contig (which raises the question of whether to condense them), and thus the same reference sequence decodes multiple contig names.
This function generates an appropriate spike reference for a BAM or CRAM, using BAM/CRAM headers to figure out which references are used for which.
At the moment, CRAM support in Rsamtools only exists in the GitHub branch:
BiocManager::install("Bioconductor/Rsamtools@cram")
Using other versions of Rsamtools will yield an error on CRAM files.
Note that for merged genomic + spike reference BAMs/CRAMs, this function will only attempt to generate a FASTA for the spike contigs, not reference. If your reference contigs are screwed up, talk to your sequencing people, and keep better track of the FASTA reference against which you compress!
invisibly, a DNAStringSet as exported to `fa`
rename_contigs
library(GenomicRanges) data(spike, package="spiky") sb <- system.file("extdata", "example.spike.bam", package="spiky", mustWork=TRUE) outFasta <- paste(system.file("extdata", package="spiky", mustWork=TRUE),"/spike_contigs.fa",sep="") show(generate_spike_fasta(sb, spike=spike,fa=outFasta))
library(GenomicRanges) data(spike, package="spiky") sb <- system.file("extdata", "example.spike.bam", package="spiky", mustWork=TRUE) outFasta <- paste(system.file("extdata", package="spiky", mustWork=TRUE),"/spike_contigs.fa",sep="") show(generate_spike_fasta(sb, spike=spike,fa=outFasta))
A Granges object with genomic coverage from chr21q22, binned every 300bp for the genomic contigs then averaged across the bin. (In other words, the default output of scan_genomic_contigs or scan_genomic_bedpe, restricted to a small enough set of genomic regions to be practical for examples.) This represents what most users will want to generate from their own genomic BAMs or BEDPEs, and is used repeatedly in downstream examples throughout the package.
data(genomic_res)
data(genomic_res)
A GRanges of coverage results with one metadata column, coverage
Generated using scan_genomic_bedpe or scan_genomic_contigs on an example bedpe or bam containing chr21q22 contigs.
A common task between generate_spike_fasta, rename_spikes, and rename_spike_seqlevels is to determine what the largest common subset of characters between existing contig names and stored standardized contigs might be. This function eases that task.
get_base_name(contig_names, sep = "_")
get_base_name(contig_names, sep = "_")
contig_names |
the names of contigs |
sep |
separator character in contig names ("_") |
a vector of elements 1:3 from each contig name
sb <- system.file("extdata", "example.spike.bam", package="spiky", mustWork=TRUE) bh <- scanBamHeader(BamFile(sb)) orig_contigs <- names(bh$targets) get_base_name(orig_contigs)
sb <- system.file("extdata", "example.spike.bam", package="spiky", mustWork=TRUE) bh <- scanBamHeader(BamFile(sb)) orig_contigs <- names(bh$targets) get_base_name(orig_contigs)
refactored out of scan_spiked_bam
get_binned_coverage(bins, covg)
get_binned_coverage(bins, covg)
bins |
the GRanges with bins |
covg |
the coverage result (an RleList) |
a GRanges of summarized coverage
get_spiked_coverage
scan_spiked_bam
sb <- system.file("extdata", "example.spike.bam", package="spiky", mustWork=TRUE) data(spike,package="spiky") si <- seqinfo_from_header(sb) genome(si) <- "spike" mgr <- get_merged_gr(si,spike=spike) fl <- scanBamFlag(isDuplicate=FALSE, isPaired=TRUE, isProperPair=TRUE) bp <- ScanBamParam(flag=fl) bamMapqFilter(bp) <- 20 covg <- get_spiked_coverage(sb, bp=bp, gr=mgr) get_binned_coverage(bins=GRanges(), covg=covg)
sb <- system.file("extdata", "example.spike.bam", package="spiky", mustWork=TRUE) data(spike,package="spiky") si <- seqinfo_from_header(sb) genome(si) <- "spike" mgr <- get_merged_gr(si,spike=spike) fl <- scanBamFlag(isDuplicate=FALSE, isPaired=TRUE, isProperPair=TRUE) bp <- ScanBamParam(flag=fl) bamMapqFilter(bp) <- 20 covg <- get_spiked_coverage(sb, bp=bp, gr=mgr) get_binned_coverage(bins=GRanges(), covg=covg)
refactored from scan_spiked_bam to clarify information flow
get_merged_gr(si, spike, standard = TRUE)
get_merged_gr(si, spike, standard = TRUE)
si |
seqinfo, usually from a BAM/CRAM file with spike contigs |
spike |
database of spike-in standard sequence features (spike) |
standard |
trim to standard chromosomes? (TRUE) |
By default, get_merged_gr will return a GRanges with "standardized" genomic and spike contig names (i.e. genomic chr1-22, X, Y, M, and the canonical spike names in data(spike, package="spiky")).
The constraint to "standard" chromosomes on genomic contigs can be
removed by setting standard
to FALSE in the function arguments.
GRanges with two genomes: the organism assembly and "spike"
sb <- system.file("extdata", "example.spike.bam", package="spiky", mustWork=TRUE) si <- seqinfo_from_header(sb) genome(si) <- "spike" # no genomic contigs data(spike, package="spiky") get_merged_gr(si, spike=spike) # note canonicalized spikes
sb <- system.file("extdata", "example.spike.bam", package="spiky", mustWork=TRUE) si <- seqinfo_from_header(sb) genome(si) <- "spike" # no genomic contigs data(spike, package="spiky") get_merged_gr(si, spike=spike) # note canonicalized spikes
get the (max, median, or mean) coverage for spike-in contigs from a BAM/CRAM
get_spike_depth(covg, spike_gr = NULL, spike = NULL, how = c("max", "mean"))
get_spike_depth(covg, spike_gr = NULL, spike = NULL, how = c("max", "mean"))
covg |
the coverage RleList |
spike_gr |
the spike-in GRanges (default: figure out from seqinfo) |
spike |
information about the spikes (default: load |
how |
how to summarize the per-spike coverage (max) |
a GRanges with summarized coverage and features for each
sb <- system.file("extdata", "example.spike.bam", package="spiky", mustWork=TRUE) data(spike, package="spiky") si <- seqinfo_from_header(sb) genome(si) <- "spike" mgr <- get_merged_gr(si,spike=spike) fl <- scanBamFlag(isDuplicate=FALSE, isPaired=TRUE, isProperPair=TRUE) bp <- ScanBamParam(flag=fl) bamMapqFilter(bp) <- 20 covg <- get_spiked_coverage(sb, bp=bp, gr=mgr) get_spike_depth(covg, spike_gr=mgr, spike=spike)
sb <- system.file("extdata", "example.spike.bam", package="spiky", mustWork=TRUE) data(spike, package="spiky") si <- seqinfo_from_header(sb) genome(si) <- "spike" mgr <- get_merged_gr(si,spike=spike) fl <- scanBamFlag(isDuplicate=FALSE, isPaired=TRUE, isProperPair=TRUE) bp <- ScanBamParam(flag=fl) bamMapqFilter(bp) <- 20 covg <- get_spiked_coverage(sb, bp=bp, gr=mgr) get_spike_depth(covg, spike_gr=mgr, spike=spike)
FIXME: this is wicked slow, ask Herve if a faster version exists
get_spiked_coverage(bf, bp, gr)
get_spiked_coverage(bf, bp, gr)
bf |
the BamFile object |
bp |
the ScanBamParam object |
gr |
the GRanges with sorted seqlevels |
Refactored from scan_spiked_bam, this is a very simple wrapper
a list of Rles
scan_spiked_bam
coverage
sb <- system.file("extdata", "example.spike.bam", package="spiky", mustWork=TRUE) si <- seqinfo_from_header(sb) genome(si) <- "spike" data(spike, package="spiky") mgr <- get_merged_gr(si, spike=spike) # note canonicalized spikes fl <- scanBamFlag(isDuplicate=FALSE, isPaired=TRUE, isProperPair=TRUE) bp <- ScanBamParam(flag=fl) bamMapqFilter(bp) <- 20 get_spiked_coverage(sb, bp=bp, gr=mgr)
sb <- system.file("extdata", "example.spike.bam", package="spiky", mustWork=TRUE) si <- seqinfo_from_header(sb) genome(si) <- "spike" data(spike, package="spiky") mgr <- get_merged_gr(si, spike=spike) # note canonicalized spikes fl <- scanBamFlag(isDuplicate=FALSE, isPaired=TRUE, isProperPair=TRUE) bp <- ScanBamParam(flag=fl) bamMapqFilter(bp) <- 20 get_spiked_coverage(sb, bp=bp, gr=mgr)
simple contig kmer comparisons
kmax(km, normalize = TRUE)
kmax(km, normalize = TRUE)
km |
kmer summary |
normalize |
normalize (divide by row sums)? (TRUE) |
the most common kmers for each contig, across all contigs
data(genbank_mito, package="spiky") mtk6 <- kmers(genbank_mito, k=6) rownames(mtk6) <- paste0(rownames(mtk6), "_MT") kmax(mtk6) data(phage, package="spiky") phk6 <- kmers(phage, k=6) kmax(phk6, normalize=FALSE) stopifnot(identical(colnames(phk6), colnames(mtk6))) k6 <- rbind(mtk6, phk6) kmax(k6)
data(genbank_mito, package="spiky") mtk6 <- kmers(genbank_mito, k=6) rownames(mtk6) <- paste0(rownames(mtk6), "_MT") kmax(mtk6) data(phage, package="spiky") phk6 <- kmers(phage, k=6) kmax(phk6, normalize=FALSE) stopifnot(identical(colnames(phk6), colnames(mtk6))) k6 <- rbind(mtk6, phk6) kmax(k6)
oligonucleotideFrequency, but less letters and more convenient.
kmers(x, k = 6)
kmers(x, k = 6)
x |
BSgenome, DFrame with |
k |
the length of the kmers (default is 6) |
The companion kmax
function finds the maximum frequency kmer for each
contig and plots all of them together for comparison purposes.
a matrix of contigs (rows) by kmer frequencies (columns)
kmax
data(genbank_mito, package="spiky") mtk6 <- kmers(genbank_mito, k=6) kmax(mtk6) data(phage, package="spiky") phk6 <- kmers(phage, k=6) kmax(phk6)
data(genbank_mito, package="spiky") mtk6 <- kmers(genbank_mito, k=6) kmax(mtk6) data(phage, package="spiky") phk6 <- kmers(phage, k=6) kmax(phk6)
In a cfMeDIP experiment, the yield of methylated fragments should be >95% (ideally 98-99%) due to the nature of the assay.
methylation_specificity(spike_gr, spike)
methylation_specificity(spike_gr, spike)
spike_gr |
GRanges of spike contigs (e.g. output object from scan_spiked_bam, scan_spike_contigs, or scan_spike_bedpe) |
spike |
spike contig database, if needed (e.g. data(spike)) |
list with median and mean coverage across spike contigs
data(genomic_res) data(spike_res) data(spike, package="spiky") methylation_specificity(spike_res, spike=spike)
data(genomic_res) data(spike_res) data(spike, package="spiky") methylation_specificity(spike_res, spike=spike)
Build a Bayesian additive model from spike-ins to correct bias in *-seq
model_bam_standards(x, conc = NULL, fm = NULL, ...)
model_bam_standards(x, conc = NULL, fm = NULL, ...)
x |
data with assorted feature information (GCfrac, CpGs, etc) |
conc |
concentration for each spike (must be provided!) |
fm |
model formula (conc ~ read_count + fraglen + GCfrac + CpGs_3) |
... |
other arguments to pass to |
the model fit for the data
library(bamlss) data(spike_cram_counts,package="spiky") data(spike,package="spiky") scc <- add_frag_info(spike_cram_counts, spike=spike) scc$conc <- scc$conc * 0.9 # adjust for dilution scc$CpGs_3 <- scc$CpGs ^ (1/3) fit0 <- model_bam_standards(scc, fm=conc ~ read_count + fraglen) fit1 <- model_bam_standards(scc, fm=conc ~ read_count + fraglen + GCfrac + CpGs_3) DIC(fit0, fit1)
library(bamlss) data(spike_cram_counts,package="spiky") data(spike,package="spiky") scc <- add_frag_info(spike_cram_counts, spike=spike) scc$conc <- scc$conc * 0.9 # adjust for dilution scc$CpGs_3 <- scc$CpGs ^ (1/3) fit0 <- model_bam_standards(scc, fm=conc ~ read_count + fraglen) fit1 <- model_bam_standards(scc, fm=conc ~ read_count + fraglen + GCfrac + CpGs_3) DIC(fit0, fit1)
formerly '2020_model_glm_fmol'. Note that everything in x can be had from a BAM/CRAM with spike contigs named as frag_grp (len_CpGs_GC) in the index and in fact that is what scan_spiked_bam now does.
model_glm_pmol(x, spike, conc = NULL, ...)
model_glm_pmol(x, spike, conc = NULL, ...)
x |
data w/frag_grp, id, and read_count; or scan_spiked_bam result |
spike |
spike database, e.g. |
conc |
concentration for each spike (will be referenced if NULL) |
... |
other arguments to pass to |
the model fit for the data
data(spike, package="spiky") data(spike_read_counts, package="spiky") fit1 <- model_glm_pmol(spike_read_counts, spike=spike) data(spike_res) # scan_spiked_bam result fit2 <- model_glm_pmol(spike_res, spike=spike)
data(spike, package="spiky") data(spike_read_counts, package="spiky") fit1 <- model_glm_pmol(spike_read_counts, spike=spike) data(spike_res) # scan_spiked_bam result fit2 <- model_glm_pmol(spike_res, spike=spike)
parse out the forward and reverse UMIs and contig for a BED/BAM
parse_spike_UMI(UMI, pos = NULL, seqs = NULL)
parse_spike_UMI(UMI, pos = NULL, seqs = NULL)
UMI |
a vector of UMIs |
pos |
optional vector of positions (else all are set to 1) |
seqs |
optional vector of read sequences (else widths default to 96) |
a GRanges
A DataFrame with sequence, methylated, CpGs, GCfrac, and OECpG for phages
data(phage)
data(phage)
A DataFrame object with
genome sequence, as a DNAStringSet
whether CpGs are methylated, as an integer
the number of CpGs in the phage genome, as an integer
the GC fraction of the phage genome, as a numeric
the observed / expected CpG fraction, as a numeric
www.ncbi.nlm.nih.gov/genbank/
FIXME: this could be made MUCH faster by precomputing CpG/GC stats per bin
predict_pmol( fit, genomic_gr, bsgenome = NULL, ret = c("gr", "df"), slide = FALSE )
predict_pmol( fit, genomic_gr, bsgenome = NULL, ret = c("gr", "df"), slide = FALSE )
fit |
result of model_glm_pmol |
genomic_gr |
the genomic data / new data |
bsgenome |
BSgenome name (if null, will guess from genomic_gr) |
ret |
return a data.frame ("df") or GRanges ("gr")? ("gr") |
slide |
compute a sliding window estimate for GCfrac (1/3 width)? |
Using GRanges as the return value is (perhaps counterintuitively) much faster than the data.frame, since the sequence of the bins gets converted from a BSgenome representation to characters in the latter (it is implied by the bin start, stop, and genome when left as a GRanges).
object with read count, fraglen, GC%, CpG**(1/3), and concentration
data(spike_res) data(genomic_res) data(spike, package="spiky") fit <- model_glm_pmol(covg_to_df(spike_res, spike=spike),spike=spike) preddf <- predict_pmol(fit, genomic_res, ret="df") pred <- predict_pmol(fit, genomic_res, ret="gr") bin_pmol(pred)
data(spike_res) data(genomic_res) data(spike, package="spiky") fit <- model_glm_pmol(covg_to_df(spike_res, spike=spike),spike=spike) preddf <- predict_pmol(fit, genomic_res, ret="df") pred <- predict_pmol(fit, genomic_res, ret="gr") bin_pmol(pred)
Sequence feature verification: never trust anyone, least of all yourself.
process_spikes(fasta, methylated = 0, ...)
process_spikes(fasta, methylated = 0, ...)
fasta |
fasta file (or GRanges or DataFrame) w/spike sequences |
methylated |
whether CpGs in each are methylated (0 or 1, default 0) |
... |
additional arguments, e.g. kernels (currently unused) |
GCfrac is the GC content of spikes as a proportion instead of a percent. OECpG is (observed/expected) CpGs (expectation is 25% of GC dinucleotides).
a DataFrame suitable for downstream processing
kmers
data(spike) spikes <- system.file("extdata", "spikes.fa", package="spiky", mustWork=TRUE) spikemeth <- spike$methylated process_spikes(spikes, spikemeth) data(phage) phages <- system.file("extdata", "phages.fa", package="spiky", mustWork=TRUE) identical(process_spikes(phage), phage) identical(phage, process_spikes(phage)) data(genbank_mito) (mt <- process_spikes(genbank_mito)) # see also genbank_mito.R gb_mito <- system.file("extdata", "genbank_mito.R", package="spiky")
data(spike) spikes <- system.file("extdata", "spikes.fa", package="spiky", mustWork=TRUE) spikemeth <- spike$methylated process_spikes(spikes, spikemeth) data(phage) phages <- system.file("extdata", "phages.fa", package="spiky", mustWork=TRUE) identical(process_spikes(phage), phage) identical(phage, process_spikes(phage)) data(genbank_mito) (mt <- process_spikes(genbank_mito)) # see also genbank_mito.R gb_mito <- system.file("extdata", "genbank_mito.R", package="spiky")
read a BEDPE file into Pairs of GRanges (as if a GAlignmentPairs or similar)
read_bedpe( x, ..., stranded = FALSE, fraglen = TRUE, optional = FALSE, keep = FALSE )
read_bedpe( x, ..., stranded = FALSE, fraglen = TRUE, optional = FALSE, keep = FALSE )
x |
a Tabixed BEDPE file, or a TabixFile of one |
... |
additional arguments to pass to scanTabix internally |
stranded |
Is the data stranded? (FALSE) |
fraglen |
compute the fragment length? (TRUE) |
optional |
scan the optional columns (name, score, strand1)? (FALSE) |
keep |
keep additional columns? (FALSE) |
BEDPE import in R is a shambles. This is a bandaid on a GSW. See the \href{https://bedtools.readthedocs.io/en/latest/content/general-usage.html#bedpe-format}{BEDPE format definition} for full details. In short, for a pair of ranges 1 and 2, we have fields chrom1, start1, end1, chrom2, start2, end2, and (optionally) name, score, strand1, strand2, plus any other user defined fields that may be included (these are not yet supported by read_bedpe). For example, two valid BEDPE lines are: chr1 100 200 chr5 5000 5100 bedpe_example1 30 chr9 900 5000 chr9 3000 3800 bedpe_example2 99 + -
a Pairs of GRanges, perhaps with $score or $fraglen
bedpe_covg
## Not run: bedpe <- "GSM5067076_2020_A64_bedpe.bed.gz" WT1_hg38 <- GRanges("chr11", IRanges(32387775, 32435564), "-") read_bedpe(bedpe, param=WT1_hg38) ## End(Not run)
## Not run: bedpe <- "GSM5067076_2020_A64_bedpe.bed.gz" WT1_hg38 <- GRanges("chr11", IRanges(32387775, 32435564), "-") read_bedpe(bedpe, param=WT1_hg38) ## End(Not run)
This function is essentially the opposite of rename_spikes, except that it works well on GRanges/GAlignments from or for merged genome+spike BAMs. If spike contigs are found, it will assign genome='spike' to those, while changing the seqlevels to standardized names that match rownames(spike).
rename_spike_seqlevels(x, spike = NULL)
rename_spike_seqlevels(x, spike = NULL)
x |
something with seqlevels (GRanges, GAlignments, Seqinfo...) |
spike |
a DataFrame where spike$sequence is a DNAStringSet (or NULL) |
x, but with standardized spike seqlevels and genomes
rename_spikes
spike
rowsThis function does that.
rename_spikes(x, spike)
rename_spikes(x, spike)
x |
a BAM/CRAM file, hopefully with an index |
spike |
a DataFrame where spike$sequence is a DNAStringSet |
a DataFrame with renamed contigs (rows)
generate_spike_fasta
Scan genomic BEDPE
scan_genomic_bedpe( bedpe, bin = TRUE, binwidth = 300L, bins = NULL, standard = TRUE, genome = "hg38" )
scan_genomic_bedpe( bedpe, bin = TRUE, binwidth = 300L, bins = NULL, standard = TRUE, genome = "hg38" )
bedpe |
the BEDPE file path, or output from read_bedpe() |
bin |
Bin reads? (TRUE) |
binwidth |
width of the bins for chromosomal tiling (300) |
bins |
a pre-tiled GRanges for binning coverage (NULL) |
standard |
restrict non-spike contigs to "standard" chromosomes? (TRUE) |
genome |
Name of genome (default hg38) |
a GRanges with coverage
fl <- system.file("extdata", "example_chr21_bedpe.bed.gz", package="spiky",mustWork=TRUE) scan_genomic_bedpe(fl) # will warn user about spike contigs
fl <- system.file("extdata", "example_chr21_bedpe.bed.gz", package="spiky",mustWork=TRUE) scan_genomic_bedpe(fl) # will warn user about spike contigs
The default workflow for spiky is roughly as follows:
scan_genomic_contigs( bam, spike, param = NULL, bin = TRUE, binwidth = 300L, bins = NULL, standard = TRUE, genome = "hg38", ... )
scan_genomic_contigs( bam, spike, param = NULL, bin = TRUE, binwidth = 300L, bins = NULL, standard = TRUE, genome = "hg38", ... )
bam |
the BAM or CRAM filename, or a vector of them |
spike |
the spike-in reference database (e.g. data(spike)) |
param |
a ScanBamParam object specifying which reads to count (NULL) |
bin |
Bin reads? (TRUE) |
binwidth |
width of the bins for chromosomal tiling (300) |
bins |
a pre-tiled GRanges for binning coverage (NULL) |
standard |
restrict non-spike contigs to "standard" chromosomes? (TRUE) |
genome |
Name of genome (default hg38) |
... |
additional arguments to pass to scanBamFlag() |
Identify and quantify the spike-in contigs in an experiment.
Fit a model for sequence-based abundance artifacts using the spike-ins.
Quantify raw fragment abundance on genomic contigs, and adjust per step 2.
scan_genomic_contigs addresses the first half of step 3. The assumption is
that anything which isn't a spike contig, is a genomic contig. This isn't
necessarily true, so the user can also supply a ScanBamParam object for the
param
argument and restrict scanning to whatever contigs they wish, which
also allows for non-default MAPQ, pairing, and quality filters.
If multiple BAM or CRAM filenames are provided, all indices will be checked before attempting to run through any of the files.
a CompressedGRangesList with bin- and spike-level coverage
Rsamtools::ScanBamParam
library(Rsamtools) data(spike, package="spiky") fl <- system.file("extdata", "ex1.bam", package="Rsamtools", mustWork=TRUE) scan_genomic_contigs(fl, spike=spike,standard=FALSE) # will warn user about spike contigs sb <- system.file("extdata", "example_chr21.bam", package="spiky", mustWork=TRUE) scan_genomic_contigs(sb, spike=spike) # will warn user about genomic contigs
library(Rsamtools) data(spike, package="spiky") fl <- system.file("extdata", "ex1.bam", package="Rsamtools", mustWork=TRUE) scan_genomic_contigs(fl, spike=spike,standard=FALSE) # will warn user about spike contigs sb <- system.file("extdata", "example_chr21.bam", package="spiky", mustWork=TRUE) scan_genomic_contigs(sb, spike=spike) # will warn user about genomic contigs
Methylation specificity is here defined as methylated_spike_covg/spike_covg
scan_methylation_specificity(files, spike, sep = "_")
scan_methylation_specificity(files, spike, sep = "_")
files |
a vector of BAM/CRAM file names |
spike |
a spike-in database |
sep |
the separator for spike-in contig names ("_") |
a matrix with columns "mean" and "median"
data(spike) library(GenomicRanges) sb <- system.file("extdata", "example.spike.bam", package="spiky", mustWork=TRUE) scan_methylation_specificity(sb, spike=spike)
data(spike) library(GenomicRanges) sb <- system.file("extdata", "example.spike.bam", package="spiky", mustWork=TRUE) scan_methylation_specificity(sb, spike=spike)
Scan spikes BEDPE
scan_spike_bedpe(bedpe, spike, how = "max")
scan_spike_bedpe(bedpe, spike, how = "max")
bedpe |
the BEDPE file path, or output from read_bedpe() |
spike |
information about the spikes (default: load |
how |
how to summarize the per-spike coverage (max) |
a GRanges with coverage
data(spike, package="spiky") fl <- system.file("extdata", "example_spike_bedpe.bed.gz", package="spiky",mustWork=TRUE) scan_spike_bedpe(fl,spike=spike) # will warn user about spike contigs
data(spike, package="spiky") fl <- system.file("extdata", "example_spike_bedpe.bed.gz", package="spiky",mustWork=TRUE) scan_spike_bedpe(fl,spike=spike) # will warn user about spike contigs
default workflow is
scan_spike_contigs(bam, spike, how = "max", param = NULL, mc.cores = 16, ...)
scan_spike_contigs(bam, spike, how = "max", param = NULL, mc.cores = 16, ...)
bam |
the BAM or CRAM filename, or a vector of such filenames |
spike |
the spike-in reference database (e.g. data(spike)) |
how |
how to summarize the per-spike coverage (max) |
param |
a ScanBamParam object, or NULL (will default to MAPQ=20 etc) |
mc.cores |
Number of cores to run on (default 16) |
... |
additional arguments to pass to scanBamFlag() |
scan spike contigs and count fragments per contig or per bin.
fit the appropriate model for adjusting genomic contigs based on spikes.
scan and adjust binned fragment tallies along genomic contigs per above.
scan_spike_contigs implements step 1.
If multiple BAM or CRAM filenames are provided, all indices will be checked before attempting to run through any of the files.
a CompressedGRangesList with bin- and spike-level coverage
Rsamtools::ScanBamParam
library(GenomicRanges) data(spike, package="spiky") sb <- system.file("extdata", "example.spike.bam", package="spiky", mustWork=TRUE) # switch to a CRAM res <- scan_spike_contigs(sb, spike=spike) # use default ScanBamParam summary(res)
library(GenomicRanges) data(spike, package="spiky") sb <- system.file("extdata", "example.spike.bam", package="spiky", mustWork=TRUE) # switch to a CRAM res <- scan_spike_contigs(sb, spike=spike) # use default ScanBamParam summary(res)
Typically one will want to fit a correction model to multiple samples. This function eases this task by merging the output of spike_counts into a data.frame that model_glm_pmol can directly fit.
scan_spike_counts(files, spike, methylated = 1, sep = "_")
scan_spike_counts(files, spike, methylated = 1, sep = "_")
files |
a vector of BAM/CRAM file names |
spike |
a spike-in database |
methylated |
a logical (0/1) to include only methylated fragments |
sep |
the separator for spike-in contig names ("_") |
a data.frame with columns "frag_grp", "id", and "read_count"
data(spike) library(GenomicRanges) sb <- system.file("extdata", "example.spike.bam", package="spiky", mustWork=TRUE) scan_spike_counts(sb, spike=spike) fit <- model_glm_pmol(scan_spike_counts(sb, spike=spike),spike=spike)
data(spike) library(GenomicRanges) sb <- system.file("extdata", "example.spike.bam", package="spiky", mustWork=TRUE) scan_spike_counts(sb, spike=spike) fit <- model_glm_pmol(scan_spike_counts(sb, spike=spike),spike=spike)
Note: behind the scenes, this is being refactored into scan_spike_contigs and scan_genomic_contigs. Once that is done, perhaps before release, the default workflow will switch to
scan_spiked_bam( bam, spike, mapq = 20, binwidth = 300L, bins = NULL, how = c("max", "mean"), dupe = FALSE, paired = TRUE, standard = TRUE, ... )
scan_spiked_bam( bam, spike, mapq = 20, binwidth = 300L, bins = NULL, how = c("max", "mean"), dupe = FALSE, paired = TRUE, standard = TRUE, ... )
bam |
the BAM file |
spike |
the spike-in reference database (e.g. data(spike)) |
mapq |
minimum mapq value to count a pair (20) |
binwidth |
width of the bins for chromosomal tiling (300) |
bins |
a pre-tiled GRanges for binning coverage (NULL) |
how |
how to record spike read coverage (max or mean)? (max) |
dupe |
unique (FALSE), duplicte (TRUE), or all (NA) reads? (FALSE) |
paired |
restrict coverage to that from properly paired reads? (TRUE) |
standard |
restrict non-spike contigs to "standard" chromosomes? (TRUE) |
... |
additional arguments to pass to scanBamFlag() |
scan spike contigs and count fragments per contig or per bin.
fit the appropriate model for adjusting genomic contigs based on spikes.
scan and adjust binned fragment tallies along genomic contigs per above.
This approach decouples binning schemes from model generation (using spikes) and model-based adjustment (using genomic fragment counts), decreasing code complexity while increasing the opportunities for caching & parallelization.
For a more realistic example (not run), one might do something like:
data(spike, package="spiky"); bam <- "2021_ctl.hg38_withSpikes.bam"; ssb_res <- scan_spiked_bam(bam, mapq=20, spike=spike);
An extract from the resulting ssb_res
object is available via
data(ssb_res, package="spiky");
The full ssb_res is a GRangesList object with 300bp-binned coverage on the standard (chr1-22, chrX, chrY, chrM) chromosomes (as determined by the GenomeInfoDb::standardChromosomes() function against the assembly defined in the BAM or CRAM file, by default; if desired, a user can scan all genomic contigs by setting standard=FALSE when calling the function). By default, the mean base-level coverage of genomic bins is reported, and the maximum spike-level coverage is reported, though this can also be adjusted as needed. The results then inform the reliability of measurements from replicate samples in multiple labs, as well as the adjusted quantitative coverage in each bin once the absolute quantity of captured cell-free methylated DNA has been fit by model_glm_pmol and predict_pmol. In some sense, this function converts BAMs/CRAMs into usable data structures for high-throughput standardized cfMeDIP experiments.
The data extract used in other examples is the same as the full version, with the sole difference being that genomic bins are limited to chr22.
a CompressedGRangesList with bin- and spike-level coverage
GenomeInfoDb::keepStandardChromosomes
Rsamtools::ScanBamParam
library(GenomicRanges) data(spike, package="spiky") sb <- system.file("extdata", "example.spike.bam", package="spiky", mustWork=TRUE) res <- scan_spiked_bam(sb, spike=spike, bins=GRanges()) summary(res$spikes$coverage)
library(GenomicRanges) data(spike, package="spiky") sb <- system.file("extdata", "example.spike.bam", package="spiky", mustWork=TRUE) res <- scan_spiked_bam(sb, spike=spike, bins=GRanges()) summary(res$spikes$coverage)
create seqinfo (and thus a standard chromosome filter) from a BAM header
seqinfo_from_header(x, gen = NA, std = FALSE, ret = c("si", "gr"))
seqinfo_from_header(x, gen = NA, std = FALSE, ret = c("si", "gr"))
x |
the BAM file or its header |
gen |
genome of the BAM file, if known (NULL; autodetect) |
std |
standard chromosomes only? (FALSE; will be empty if spikes) |
ret |
return Seqinfo ("si", the default) or GRanges ("gr")? ("si") |
Setting std=TRUE on a spike-in BAM will produce an empty result.
Seqinfo object or GRanges (or `as(seqinfo, "GRanges")`)
library(Rsamtools) fl <- system.file("extdata", "ex1.bam", package="Rsamtools", mustWork=TRUE) hdr <- scanBamHeader(BamFile(fl)) si <- seqinfo_from_header(hdr) gr <- seqinfo_from_header(fl, ret="gr") stopifnot(identical(gr, as(si, "GRanges"))) std_si <- seqinfo_from_header(fl, std=TRUE) seqlevels(std_si) # for comparison with below data(spike, package="spiky") spike sp <- system.file("extdata", "example.spike.bam", package="spiky") sp_gr <- seqinfo_from_header(sp, ret="gr") sp_gr
library(Rsamtools) fl <- system.file("extdata", "ex1.bam", package="Rsamtools", mustWork=TRUE) hdr <- scanBamHeader(BamFile(fl)) si <- seqinfo_from_header(hdr) gr <- seqinfo_from_header(fl, ret="gr") stopifnot(identical(gr, as(si, "GRanges"))) std_si <- seqinfo_from_header(fl, std=TRUE) seqlevels(std_si) # for comparison with below data(spike, package="spiky") spike sp <- system.file("extdata", "example.spike.bam", package="spiky") sp_gr <- seqinfo_from_header(sp, ret="gr") sp_gr
A DataFrame with sequence, concentration, and other properties of Sam's synthetic cfMeDIP spike-in controls. The row names redudantly encode some of these properties, such as the number of CpGs in the spike-in sequence.
data(spike)
data(spike)
A DataFrame object with
contig sequence, as a DNAStringSet
are the CpGs in this spike-in methylated? 0 or 1
number of CpG dinucleotides in the spike, from 1 to 16
femtomolar concentration of the spike-in for standard mix
molar mass of spike-in sequence
https://doi.org/10.1101/2021.02.12.430289
Bland-Altman plot for cfMeDIP spike standards
spike_bland_altman_plot(fit)
spike_bland_altman_plot(fit)
fit |
a model fit, from predict_pmol (?) |
a ggplot2 object
data(spike_res) data(spike, package="spiky") fit <- model_glm_pmol(covg_to_df(spike_res, spike=spike),spike=spike) ba_plot <- spike_bland_altman_plot(fit)
data(spike_res) data(spike, package="spiky") fit <- model_glm_pmol(covg_to_df(spike_res, spike=spike),spike=spike) ba_plot <- spike_bland_altman_plot(fit)
It dawned on me one day that we don't even have to bother reading the file if we have an index for a spiked BAM/CRAM result, since any fragments that map properly to the spike contigs are generated from synthetic templates. This function takes an index and a spike database (usually a DataFrame) as inputs and provides a rough coverage estimate over "rehabilitated" contig names (i.e., canonicalized contigs mapping to the database) as its output.
spike_counts( bam, spike, sep = "_", ref = "spike", verbose = FALSE, dump_idx = FALSE )
spike_counts( bam, spike, sep = "_", ref = "spike", verbose = FALSE, dump_idx = FALSE )
bam |
the BAM or CRAM file (MUST HAVE AN INDEX) |
spike |
a data.frame, DataFrame, or similar with spikes |
sep |
separator character in contig names ("_") |
ref |
reference name for spike genome ("spike") |
verbose |
be verbose? (FALSE) |
dump_idx |
dump the renamed idxstats to aggregate? (FALSE) |
The argument spike
has no default since we are attempting to refactor the
spike-in databases into their own data packages and allow more general use.
a GRanges of spike-in contig read counts
data(spike, package="spiky") sb <- system.file("extdata", "example.spike.bam", package="spiky", mustWork=TRUE) spike_counts(sb, spike=spike)
data(spike, package="spiky") sb <- system.file("extdata", "example.spike.bam", package="spiky", mustWork=TRUE) spike_counts(sb, spike=spike)
A data.frame with spike-in results from CRAM files (generated from scan_spike_counts(CRAMs, spike=spike))
data(spike_cram_counts)
data(spike_cram_counts)
A data.frame object with
the encoded spike contig name: basepairs_CpGs_GCpercent
subject from whom cfMeDIP spike reads (column 3) were counted
read coverage for this spike in this subject (column 2)
Generated from scan_spike_counts(CRAMs, spike=spike) using example CRAMs containing spike contigs
A data.frame with spike-in results from control samples in the manuscript.
This maps 1:1 onto dedup
using reshape2::melt.
data(spike_read_counts)
data(spike_read_counts)
A data.frame object with
the encoded spike contig name: basepairs_CpGs_GCpercent
subject from whom cfMeDIP spike reads (column 3) were counted
read coverage for this spike in this subject (column 2)
This data was created using inst/script/loadDedup.R
max
coverage. (In other
words, the default output of scan_spike_contigs or scan_spike_bedpe)
This represents what most users will want to generate from their own spike-in BAMs or BEDPEs, and
is used repeatedly in downstream examples throughout the package.A Granges object with spike-in sequence coverage, and
summarized for each spike contig as (the default) max
coverage. (In other
words, the default output of scan_spike_contigs or scan_spike_bedpe)
This represents what most users will want to generate from their own spike-in BAMs or BEDPEs, and
is used repeatedly in downstream examples throughout the package.
data(spike_res)
data(spike_res)
A GRanges of coverage results with one metadata column, coverage
Generated using scan_spike_bedpe or scan_spike_contigs on an example bedpe or bam containing spike contigs.
Particularly, simple methods to plot coverage results.
## S4 method for signature 'Rle,ANY' plot(x, y, ...) ## S4 method for signature 'SimpleRleList,ANY' plot(x, y, ...)
## S4 method for signature 'Rle,ANY' plot(x, y, ...) ## S4 method for signature 'SimpleRleList,ANY' plot(x, y, ...)
x |
an Rle or RleList, usually |
y |
not usedan Rle or RleList, usually |
... |
other params such as |
selectMethod("plot", "Rle") and also selectMethod("plot", "RleList") too.
invisibly, the plot details
A CompressedGRangesList object with genomic
(chr22) and spikes
coverage,
binned every 300bp for the genomic contigs then averaged across the bin, and
summarized for each spike contig as (the default) max
coverage. (In other
words, the default output of scan_spiked_bam, restricted to a small enough
set of genomic regions to be practical for examples.) This represents what
most users will want to generate from their own merged BAMs or CRAMs, and
is used repeatedly in downstream examples throughout the package.
data(ssb_res)
data(ssb_res)
A CompressedGRangesList of coverage results, containing
a GRanges with one metadata column, coverage
a GRanges with one metadata column, coverage
Generated using scan_spiked_bam on an example bam containing chr22 and spike contigs.
Sources and overlap widths of various read sequences in a test CRAM.
data(testGR)
data(testGR)
A GRanges object with an mcols() DataFrame containing
the unique molecular identifier on the forward read
the unique molecular identifier on the reverse read
the sequence of the fragment
the name of the fragment
whether the fragment passes filters (always 1)
Generated using inst/script/loadTest.R
refactored out of scan_spiked_bam for more explicit information flow
tile_bins(gr, binwidth = 300L)
tile_bins(gr, binwidth = 300L)
gr |
the GRanges |
binwidth |
bin width to tile (default is 300) |
a GRanges of bins
bam <- system.file("extdata", "ex1.bam", package="Rsamtools", mustWork=TRUE) gr <- as(seqinfo_from_header(bam), "GRanges") genome(gr) <- "notspike" tile_bins(gr)
bam <- system.file("extdata", "ex1.bam", package="Rsamtools", mustWork=TRUE) gr <- as(seqinfo_from_header(bam), "GRanges") genome(gr) <- "notspike" tile_bins(gr)