An Introduction to Rsamtools

Introduction

Many users will find that the GenomicAlignments package provides a more useful representation of BAM files in R; the GenomicFiles package is also useful for iterating through BAM files.

The Rsamtools package provides an interface to BAM files. BAM files are produced by samtools and other software, and represent a flexible format for storing ‘short’ reads aligned to reference genomes. BAM files typically contain sequence and base qualities, and alignment coordinates and quality measures. BAM files are appealing for several reasons. The format is flexible enough to represent reads generated and aligned using diverse technologies. The files are binary so that file access is relatively efficient. BAM files can be indexed, allowing ready access to localized chromosomal regions. BAM files can be accessed remotely, provided the remote hosting site supports such access and a local index is available. This means that specific regions of remote files can be accessed without retrieving the entire (large!) file. A full description is available in the BAM format specification (http://samtools.sourceforge.net/SAM1.pdf)

The main purpose of the Rsamtools is to import BAM files into R.Rsamtools also provides some facility for file access such as record counting, index file creation, and filtering to create new files containing subsets of the original. An important use case for Rsamtools is as a starting point for creating objects suitable for a diversity of work flows, e.g., AlignedRead objects in the ShortRead package (for quality assessment and read manipulation), or GAlignments objects in GenomicAlignments package (for RNA-seq and other applications). Those desiring more functionality are encouraged to explore samtools and related software efforts.

Input

RfunctionscanBam and ScanBamParam

The essential capability provided by Rsamtools is BAM input. This is accomplished with the scanBam function. scanBam takes as input the name of the BAM file to be parsed. In addition, the param argument determines which genomic coordinates of the BAM file, and what components of each record, will be input. Rparam is an instance of the ScanBamParam class. To create a param object, call ScanBamParam. Here we create a param object to extract reads aligned to three distinct ranges (one on seq1, two on seq2). From each of read in those ranges, we specify that we would like to extract the reference name (rname, e.g., seq1 ), strand, alignment position, query (i.e., read) width, and query sequence:

which <- GRanges(c(
    "seq1:1000-2000",
    "seq2:100-1000",
    "seq2:1000-2000"
))
## equivalent:
## GRanges(
##     seqnames = c("seq1", "seq2", "seq2"),
##     ranges = IRanges(
##         start = c(1000, 100, 1000),
##         end = c(2000, 1000, 2000)
##     )
## )
what <- c("rname", "strand", "pos", "qwidth", "seq")
param <- ScanBamParam(which=which, what=what)

Additional information can be found on the help page for ScanBamParam. Reading the relevant records from the BAM file is accomplished with

bamFile <- system.file("extdata", "ex1.bam", package="Rsamtools")
bam <- scanBam(bamFile, param=param)

Like scan, scanBam returns a list of values. Each element of the list corresponds to a range specified by the which argument to ScanBamParam.

class(bam)
## [1] "list"
names(bam)
## [1] "seq1:1000-2000" "seq2:100-1000"  "seq2:1000-2000"

Each element is itself a list, containing the elements specified by the what and tag arguments to ScanBamParam.

class(bam[[1]])
## [1] "list"
names(bam[[1]])
## [1] "rname"  "strand" "pos"    "qwidth" "seq"

The elements are either basic R or IRanges data types

sapply(bam[[1]], class)
##          rname         strand            pos         qwidth            seq 
##       "factor"       "factor"      "integer"      "integer" "DNAStringSet"

A paradigm for collapsing the list-of-lists into a single list is

.unlist <- function (x)
{
    ## do.call(c, ...) coerces factor to integer, which is undesired
    x1 <- x[[1L]]
    if (is.factor(x1)) {
        structure(unlist(x), class = "factor", levels = levels(x1))
    } else {
        do.call(c, x)
    }
}
bam <- unname(bam) # names not useful in unlisted result
elts <- setNames(bamWhat(param), bamWhat(param))
lst <- lapply(elts, function(elt) .unlist(lapply(bam, "[[", elt)))

This might be further transformed, e.g., to a DataFrame, with

head(do.call("DataFrame", lst))
## DataFrame with 6 rows and 5 columns
##      rname   strand       pos    qwidth                     seq
##   <factor> <factor> <integer> <integer>          <DNAStringSet>
## 1     seq1        +       970        35 TATTAGGAAA...ACTATGAAGA
## 2     seq1        +       971        35 ATTAGGAAAT...CTATGAAGAG
## 3     seq1        +       972        35 TTAGGAAATG...TATGAAGAGA
## 4     seq1        +       973        35 TAGGAAATGC...ATGAAGAGAC
## 5     seq1        +       974        35 AGGAAATGCT...TGAAGAGACT
## 6     seq1        -       975        35 GGAAATGCTT...GAAGAGACTA

Often, an alternative is to use a ScanBamParam object with desired fields specified in what as an argument to GenomicAlignments::readGAlignments; the specified fields are added as columns to the returned GAlignments .

Using BAM index files

The BAM file in the previous example includes an index, represented by a separate file with extension .bai:

list.files(dirname(bamFile), pattern="ex1.bam(.bai)?")
## [1] "ex1.bam"     "ex1.bam.bai"

Indexing provides two significant benefits. First, an index allows a BAM file to be efficiently accessed by range. A corollary is that providing a which argument to scanBamPram requires an index. Second, coordinates for extracting information from a BAM file can be derived from the index, so a portion of a remote BAM file can be retrieved with local access only to the index. For instance, provided an index file exists on the local computer, it is possible to retrieve a small portion of a BAM file residing on the 1000 genomes HTTP server. The url ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/pilot_data/data/NA19240/alignment/NA19240.chrom6.SLX.maq.SRP000032.2009_07.bam points to the BAM file corresponding to individual NA19240 chromosome 6 Solexa (Illumina) sequences aligned using MAQ. The remote file is very large (about 10 GB), but the corresponding index file is small (about 500 KB). With na19240url set to the above address, the following retrieves just those reads in the specified range

which <- GRanges("6:100000-110000")
param <- ScanBamParam(which=which, what=scanBamWhat())
na19240bam <- scanBam(na19240url, param=param)

Invoking scanBam without an index file, as above, first retrieves the index file from the remote location, and then queries the remote file using the index; for repeated queries, it is more efficient to retrieve the index file first (e.g., with download.file) and then use the local index as an argument to scanBam. Many BAM files were created in a way that causes scanBam to report that the “EOF marker is absent”; this message can safely be ignored.

Other ways to work with BAM files

BAM files may be read by functions in packages other than Rsamtools, in particular the readGAlignments family of functions in GenomicAlignments.

Additional ways of interacting with BAM files include scanBamHeader (to extract header information) and countBam (to count records matching param). filterBam filters reads from the source file according to the criteria of the ScanBamParam parameter, writing reads passing the filter to a new file. The function sorts a previously unsorted BAM, while The function indexBam creates an index file from a sorted BAM file. readPileup reads a pileup file created by , importing SNP, indel, or all variants into a GRanges object.

Large bam files

BAM files can be large, containing more information on more genomic regions than are of immediate interest or than can fit in memory. The first strategy for dealing with this is to select, using the what and which arguments to scanBamParam, just those portions of the BAM file that are essential to the current analysis, e.g., specifying what=c('rname', 'qname', 'pos') when wishing to calculate coverage of ungapped reads.

When selective input of BAM files is still too memory-intensive, the file can be processed in chunks, with each chunk distilled to the derived information of interest. Chromosomes will often be the natural chunk to process. For instance, here we write a summary function that takes a single sequence name (chromosome) as input, reads in specific information from the BAM file, and calculates coverage over that sequence.

summaryFunction <-
    function(seqname, bamFile, ...)
{
    param <- ScanBamParam(
        what=c('pos', 'qwidth'),
        which=GRanges(seqname, IRanges(1, 1e7)),
        flag=scanBamFlag(isUnmappedQuery=FALSE)
    )
    x <- scanBam(bamFile, ..., param=param)[[1]]
    coverage(IRanges(x[["pos"]], width=x[["qwidth"]]))
}

This might be used as follows; it is an ideal candidate for evaluation in parallel, e.g., using the parallel package and srapply function in ShortRead.

seqnames <- paste("seq", 1:2, sep="")
cvg <- lapply(seqnames, summaryFunction, bamFile)
names(cvg) <- seqnames
cvg
## $seq1
## integer-Rle of length 1569 with 1054 runs
##   Lengths:  2  2  1  3  4  2  3  4  2  4  1 ...  1  2  1  1  1  1  1  1  1  1
##   Values :  1  2  3  4  5  7  8  9 11 12 13 ... 13 12 10  9  7  6  5  3  2  1
## 
## $seq2
## integer-Rle of length 1567 with 1092 runs
##   Lengths:  1  3  1  1  1  3  1  4  1  1  6 ...  1  1  1  1  1  2  1  4  4  1
##   Values :  3  4  5  8 12 14 15 16 17 18 19 ... 15 14 13 10  8  7  6  3  2  1

The result of the function (a coverage vector, in this case) will often be much smaller than the input. The GenomicFiles package implements strategies for iterating through BAM and other files, including in parallel.

Views

The functions described in the previous section import data in to R. However, sequence data can be very large, and it does not always make sense to read the data in immediately. Instead, it can be useful to marshal references to the data into a container and then act on components of the container. The BamViews class provides a mechanism for creating ‘views’ into a set of BAM files. The view itself is light-weight, containing references to the relevant BAM files and metadata about the view (e.g., the phenotypic samples corresponding to each BAM file).

One way of understanding a instance is as a rectangular data structure. The columns represent BAM files (e.g., distinct samples). The rows represent ranges (i.e., genomic coordinates). For instance, a ChIP-seq experiment might identify a number of peaks of high read counts.

Assembling a BamViews instance

To illustrate, suppose we have an interest in caffeine metabolism in humans. The ‘rows’ contain coordinates of genomic regions associated with genes in a KEGG caffeine metabolism pathway. The ‘columns’ represent individuals in the 1000 genomes project.

To create the ‘rows’, we identify possible genes that KEGG associates with caffeine metabolism. Using the KEGGREST package, the steps are

## uses KEGGREST, dplyr, and tibble packages
org <- "hsa"
caffeine_pathway <-
    KEGGREST::keggList("pathway", org) 
    tibble::enframe(name = "pathway_id", value = "pathway") 
    dplyr::filter(startsWith(.data$pathway, "Caffeine metabolism"))

egid <-
    KEGGREST::keggLink(org, "pathway") %>%
    tibble::enframe(name = "pathway_id", value = "gene_id") 
    dplyr::left_join(x = caffeine_pathway, by = "pathway_id") 
    dplyr::mutate(gene_id = sub("hsa:", "", gene_id)) 
    pull(gene_id)

At the time of writing, genes in the caffeine metabolism pathway are

egid <- c("10", "1544", "1548", "1549", "7498", "9")

Then we use the appropriate TxDb package to translate Entrez identifiers to obtain ranges of interest (one could also use biomaRt to retrieve coordinates for non-model organisms, perhaps making a TxDb object as outlined in the GenomicFeatures vignette). We’ll find that the names used for chromosomes in the alignments differ from those used at Ensembl, so seqlevels<- is used to map between naming schemes and to drop unused levels.

library(TxDb.Hsapiens.UCSC.hg18.knownGene)
bamRanges <- transcripts(
    TxDb.Hsapiens.UCSC.hg18.knownGene,
    filter=list(gene_id=egid)
)
seqlevels(bamRanges) <-                 # translate seqlevels
    sub("chr", "", seqlevels(bamRanges))
lvls <- seqlevels(bamRanges)            # drop unused levels
seqlevels(bamRanges) <- lvls[lvls %in% as.character(seqnames(bamRanges))]

bamRanges
## GRanges object with 18 ranges and 2 metadata columns:
##        seqnames            ranges strand |     tx_id     tx_name
##           <Rle>         <IRanges>  <Rle> | <integer> <character>
##    [1]        2 31410692-31491115      - |      9095  uc002rnv.1
##    [2]        8 18111895-18125100      + |     26333  uc003wyq.1
##    [3]        8 18111895-18125100      + |     26334  uc003wyr.1
##    [4]        8 18111895-18125100      + |     26335  uc003wys.1
##    [5]        8 18113074-18125100      + |     26336  uc003wyt.1
##    ...      ...               ...    ... .       ...         ...
##   [14]       19 46042667-46048192      - |     57448  uc010ehe.1
##   [15]       19 46043701-46048191      - |     57449  uc010ehf.1
##   [16]       19 46073184-46080497      - |     57450  uc002opm.1
##   [17]       19 46073184-46080497      - |     57451  uc002opn.1
##   [18]       19 46073184-46226008      - |     57452  uc002opo.1
##   -------
##   seqinfo: 4 sequences from hg18 genome

The bamRanges ‘knows’ the genome for which the ranges are defined

unique(genome(bamRanges))
## [1] "hg18"

Here we retrieve a vector of BAM file URLs (slxMaq09) from the package.

slxMaq09 <- local({
    fl <- system.file("extdata", "slxMaq09_urls.txt", package="Rsamtools")
    readLines(fl)
})

We now assemble the BamViews instance from these objects; we also provide information to annotated the BAM files (with the bamSamples function argument, which is a DataFrame instance with each row corresponding to a BAM file) and the instance as a whole (with bamExperiment, a simple named list containing information structured as the user sees fit).

bamExperiment <-
    list(description="Caffeine metabolism views on 1000 genomes samples",
         created=date())
bv <- BamViews(
    slxMaq09, bamRanges=bamRanges, bamExperiment=bamExperiment
)
metadata(bamSamples(bv)) <-
    list(description="Solexa/MAQ samples, August 2009",
         created="Thu Mar 25 14:08:42 2010")
bv
## BamViews dim: 18 ranges x 24 samples 
## names: NA06986.SLX.maq.SRP000031.2009_08.bam NA06994.SLX.maq.SRP000031.2009_08.bam ... NA12828.SLX.maq.SRP000031.2009_08.bam NA12878.SLX.maq.SRP000031.2009_08.bam 
## detail: use bamPaths(), bamSamples(), bamRanges(), ...

Using BamViews instances

The BamViews object can be queried for its component parts, e.g.,

bamExperiment(bv)
## $description
## [1] "Caffeine metabolism views on 1000 genomes samples"
## 
## $created
## [1] "Fri Dec 27 05:47:46 2024"

More usefully, methods in Rsamtools are designed to work with BamViews objects, retrieving data from all files in the view. These operations can take substantial time and require reliable network access.

To illustrate, the following code (not evaluated when this vignette was created) downloads the index files associated with the bv object

bamIndexDir <- tempfile()
indexFiles <- paste(bamPaths(bv), "bai", sep=".")
dir.create(bamIndexDir)
bv <- BamViews(
    slxMaq09,
    file.path(bamIndexDir, basename(indexFiles)), # index file location
    bamRanges=bamRanges,
    bamExperiment=bamExperiment
)

idxFiles <- mapply(
    download.file, indexFiles,
    bamIndicies(bv),
    MoreArgs=list(method="curl")
)

and then queries the 1000 genomes project for reads overlapping our transcripts.

library(GenomicAlignments)
olaps <- readGAlignments(bv)

The resulting object is about 11 MB in size. To avoid having to download this data each time the vignette is run, we instead load it from disk

library(GenomicAlignments)
load(system.file("extdata", "olaps.Rda", package="Rsamtools"))
olaps
## List of length 24
## names(24): NA06986.SLX.maq.SRP000031.2009_08.bam ...
head(olaps[[1]])
## GAlignments object with 6 alignments and 0 metadata columns:
##       seqnames strand       cigar    qwidth     start       end     width
##          <Rle>  <Rle> <character> <integer> <integer> <integer> <integer>
##   [1]        2      +         51M        51  31410650  31410700        51
##   [2]        2      +         51M        51  31410658  31410708        51
##   [3]        2      -         51M        51  31410663  31410713        51
##   [4]        2      +         51M        51  31410666  31410716        51
##   [5]        2      -         51M        51  31410676  31410726        51
##   [6]        2      +         51M        51  31410676  31410726        51
##           njunc
##       <integer>
##   [1]         0
##   [2]         0
##   [3]         0
##   [4]         0
##   [5]         0
##   [6]         0
##   -------
##   seqinfo: 114 sequences from an unspecified genome

There are 33964 reads in NA06986.SLX.maq.SRP000031.2009_08.bam overlapping at least one of our transcripts. It is easy to explore this object, for instance discovering the range of read widths in each individual.

head(t(sapply(olaps, function(elt) range(qwidth(elt)))))
##                                       [,1] [,2]
## NA06986.SLX.maq.SRP000031.2009_08.bam   51   51
## NA06994.SLX.maq.SRP000031.2009_08.bam   36   51
## NA07051.SLX.maq.SRP000031.2009_08.bam   51   51
## NA07346.SLX.maq.SRP000031.2009_08.bam   48   76
## NA07347.SLX.maq.SRP000031.2009_08.bam   51   51
## NA10847.SLX.maq.SRP000031.2009_08.bam   36   51

Suppose we were particularly interested in the first transcript, which has a transcript id uc002rnv.1. Here we extract reads overlapping this transcript from each of our samples. As a consequence of the protocol used, reads aligning to either strand could be derived from this transcript. For this reason, we set the strand of our range of interest to *. We use the endoapply function, which is like lapply but returns an object of the same class (in this case, SimpleList) as its first argument.

rng <- bamRanges(bv)[1]
strand(rng) <- "*"
olap1 <- endoapply(olaps, subsetByOverlaps, rng)
olap1 <- lapply(olap1, "seqlevels<-", value=as.character(seqnames(rng)))
head(olap1[[24]])
## GAlignments object with 6 alignments and 0 metadata columns:
##       seqnames strand       cigar    qwidth     start       end     width
##          <Rle>  <Rle> <character> <integer> <integer> <integer> <integer>
##   [1]        2      +         36M        36  31410660  31410695        36
##   [2]        2      -         36M        36  31410670  31410705        36
##   [3]        2      +         36M        36  31410683  31410718        36
##   [4]        2      -         36M        36  31410687  31410722        36
##   [5]        2      -         36M        36  31410694  31410729        36
##   [6]        2      -         36M        36  31410701  31410736        36
##           njunc
##       <integer>
##   [1]         0
##   [2]         0
##   [3]         0
##   [4]         0
##   [5]         0
##   [6]         0
##   -------
##   seqinfo: 1 sequence from an unspecified genome

To carry the example a little further, we calculate coverage of each sample:

minw <- min(sapply(olap1, function(elt) min(start(elt))))
maxw <- max(sapply(olap1, function(elt) max(end(elt))))
cvg <- endoapply(
    olap1, coverage,
    shift=-start(ranges(bamRanges[1])),
    width=width(ranges(bamRanges[1]))
)
cvg[[1]]
## RleList of length 1
## $`2`
## integer-Rle of length 80424 with 13290 runs
##   Lengths:  8  8  5  2  1  3  7  7 10  2  3 ...  4  3  1 11  8  7 17  4  4  9
##   Values :  6  5  4  3  4  3  5  3  4  5  6 ...  4  5  4  5  4  5  4  3  2  1

Since the example includes a single region of uniform width across all samples, we can easily create a coverage matrix with rows representing nucleotide and columns sample and, e.g., document variability between samples and nucleotides

m <- matrix(unlist(lapply(cvg, lapply, as.vector)), ncol=length(cvg))
summary(rowSums(m))
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   10.00   74.00   82.00   81.63   91.00  133.00
summary(colSums(m))
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  133924  173925  248333  273528  350823  567727

Directions

This vignette has summarized facilities in the Rsamtools package. Important additional packages include GenomicRanges (for representing and manipulating gapped alignments), ShortRead for I/O and quality assessment of ungapped short read alignments, Biostrings and BSgenome for DNA sequence and whole-genome manipulation, IRanges for range-based manipulation, and rtracklayer for I/O related to the UCSC genome browser. Users might also find the interface to the integrative genome browser (IGV) in SRAdb useful for visualizing BAM files.

packageDescription("Rsamtools")
## Package: Rsamtools
## Type: Package
## Title: Binary alignment (BAM), FASTA, variant call (BCF), and tabix
##         file import
## Description: This package provides an interface to the 'samtools',
##         'bcftools', and 'tabix' utilities for manipulating SAM
##         (Sequence Alignment / Map), FASTA, binary variant call (BCF)
##         and compressed indexed tab-delimited (tabix) files.
## biocViews: DataImport, Sequencing, Coverage, Alignment, QualityControl
## URL: https://bioconductor.org/packages/Rsamtools
## Video:
##         https://www.youtube.com/watch?v=Rfon-DQYbWA&list=UUqaMSQd_h-2EDGsU6WDiX0Q
## BugReports: https://github.com/Bioconductor/Rsamtools/issues
## Version: 2.23.1
## License: Artistic-2.0 | file LICENSE
## Encoding: UTF-8
## Authors@R: c( person("Martin", "Morgan", role = "aut"), person("Hervé",
##         "Pagès", role = "aut"), person("Valerie", "Obenchain", role =
##         "aut"), person("Nathaniel", "Hayden", role = "aut"),
##         person("Busayo", "Samuel", role = "ctb", comment = "Converted
##         Rsamtools vignette from Sweave to RMarkdown / HTML."),
##         person("Bioconductor Package Maintainer", email =
##         "[email protected]", role = "cre"))
## Depends: methods, GenomeInfoDb (>= 1.1.3), GenomicRanges (>= 1.31.8),
##         Biostrings (>= 2.47.6), R (>= 3.5.0)
## Imports: utils, BiocGenerics (>= 0.25.1), S4Vectors (>= 0.17.25),
##         IRanges (>= 2.13.12), XVector (>= 0.19.7), bitops,
##         BiocParallel, stats
## Suggests: GenomicAlignments, ShortRead (>= 1.19.10), GenomicFeatures,
##         TxDb.Dmelanogaster.UCSC.dm3.ensGene,
##         TxDb.Hsapiens.UCSC.hg18.knownGene, RNAseqData.HNRNPC.bam.chr14,
##         BSgenome.Hsapiens.UCSC.hg19, RUnit, BiocStyle, knitr
## LinkingTo: Rhtslib (>= 3.3.1), S4Vectors, IRanges, XVector, Biostrings
## LazyLoad: yes
## SystemRequirements: GNU make
## VignetteBuilder: knitr
## Config/pak/sysreqs: make libssl-dev
## Repository: https://bioc.r-universe.dev
## RemoteUrl: https://github.com/bioc/Rsamtools
## RemoteRef: HEAD
## RemoteSha: 21a5945271e78bd38ebbe42a1368a8e8287c32f9
## Author: Martin Morgan [aut], Hervé Pagès [aut], Valerie Obenchain
##         [aut], Nathaniel Hayden [aut], Busayo Samuel [ctb] (Converted
##         Rsamtools vignette from Sweave to RMarkdown / HTML.),
##         Bioconductor Package Maintainer [cre]
## Maintainer: Bioconductor Package Maintainer
##         <[email protected]>
## Built: R 4.4.2; x86_64-pc-linux-gnu; 2024-12-27 05:47:09 UTC; unix
## 
## -- File: /tmp/RtmpRWWb7t/Rinst17065473f270/Rsamtools/Meta/package.rds
sessionInfo()
## R version 4.4.2 (2024-10-31)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.1 LTS
## 
## Matrix products: default
## BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so;  LAPACK version 3.12.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## time zone: Etc/UTC
## tzcode source: system (glibc)
## 
## attached base packages:
## [1] stats4    stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
##  [1] GenomicAlignments_1.43.0               
##  [2] SummarizedExperiment_1.37.0            
##  [3] MatrixGenerics_1.19.0                  
##  [4] matrixStats_1.4.1                      
##  [5] TxDb.Hsapiens.UCSC.hg18.knownGene_3.2.2
##  [6] GenomicFeatures_1.59.1                 
##  [7] AnnotationDbi_1.69.0                   
##  [8] Biobase_2.67.0                         
##  [9] Rsamtools_2.23.1                       
## [10] Biostrings_2.75.3                      
## [11] XVector_0.47.1                         
## [12] GenomicRanges_1.59.1                   
## [13] GenomeInfoDb_1.43.2                    
## [14] IRanges_2.41.2                         
## [15] S4Vectors_0.45.2                       
## [16] BiocGenerics_0.53.3                    
## [17] generics_0.1.3                         
## [18] BiocStyle_2.35.0                       
## 
## loaded via a namespace (and not attached):
##  [1] KEGGREST_1.47.0         rjson_0.2.23            xfun_0.49              
##  [4] bslib_0.8.0             lattice_0.22-6          vctrs_0.6.5            
##  [7] tools_4.4.2             bitops_1.0-9            curl_6.0.1             
## [10] parallel_4.4.2          RSQLite_2.3.9           blob_1.2.4             
## [13] pkgconfig_2.0.3         Matrix_1.7-1            lifecycle_1.0.4        
## [16] GenomeInfoDbData_1.2.13 compiler_4.4.2          codetools_0.2-20       
## [19] htmltools_0.5.8.1       sys_3.4.3               buildtools_1.0.0       
## [22] sass_0.4.9              RCurl_1.98-1.16         yaml_2.3.10            
## [25] crayon_1.5.3            jquerylib_0.1.4         BiocParallel_1.41.0    
## [28] DelayedArray_0.33.3     cachem_1.1.0            abind_1.4-8            
## [31] digest_0.6.37           restfulr_0.0.15         maketools_1.3.1        
## [34] fastmap_1.2.0           grid_4.4.2              SparseArray_1.7.2      
## [37] cli_3.6.3               S4Arrays_1.7.1          XML_3.99-0.17          
## [40] UCSC.utils_1.3.0        bit64_4.5.2             rmarkdown_2.29         
## [43] httr_1.4.7              bit_4.5.0.1             png_0.1-8              
## [46] memoise_2.0.1           evaluate_1.0.1          knitr_1.49             
## [49] BiocIO_1.17.1           rtracklayer_1.67.0      rlang_1.1.4            
## [52] DBI_1.2.3               BiocManager_1.30.25     jsonlite_1.8.9         
## [55] R6_2.5.1                zlibbioc_1.52.0

(APPENDIX) Appendix

Assembling a BamViews instance

Genomic ranges of interest

BAM files

Note: The following operations were performed at the time the vignette was written; location of on-line resources, in particular the organization of the 1000 genomes BAM files, may have changed.

We are interested in collecting the URLs of a number of BAM files from the 1000 genomes project. Our first goal is to identify files that might make for an interesting comparison. First, let’s visit the 1000 genomes FTP site and discover available files. We’ll use the RCurl package to retrieve the names of all files in an appropriate directory

library(RCurl)
ftpBase <-
    "ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/pilot_data/data/"
indivDirs <-
    strsplit(getURL(ftpBase, ftplistonly=TRUE), "\n")[[1]]
alnDirs <-
    paste(ftpBase, indivDirs, "/alignment/", sep="")
urls0 <-
    strsplit(getURL(alnDirs, dirlistonly=TRUE), "\n")

From these, we exclude directories without any files in them, select only the BAM index (extension .bai) files, and choose those files that exactly six '.' characters in their name.

urls <- urls[sapply(urls0, length) != 0]
fls0 <- unlist(unname(urls0))
fls1 <- fls0[grepl("bai$", fls0)]
fls <- fls1[sapply(strsplit(fls1, "\\."), length)==7]

After a little exploration, we focus on those files obtained form Solexa sequencing, aligned using MAQ, and archived in August, 2009; we remove the .bai extension, so that the URL refers to the underlying file rather than index. There are 24 files.

urls1 <- Filter(
    function(x) length(x) != 0,
    lapply(urls, function(x) x[grepl("SLX.maq.*2009_08.*bai$", x)])
)
slxMaq09.bai <-
   mapply(paste, names(urls1), urls1, sep="", USE.NAMES=FALSE)
slxMaq09 <- sub(".bai$", "", slxMaq09.bai)

As a final step to prepare for using a file, we create local copies of the index files. The index files are relatively small (about 190 Mb total).

bamIndexDir <- tempfile()
dir.create(bamIndexDir)
idxFiles <- mapply(
    download.file, slxMaq09.bai,
    file.path(bamIndexDir, basename(slxMaq09.bai)),
    MoreArgs=list(method="curl")
)