To install ompBAM, start R (version “4.2”) and enter:
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("ompBAM")
To create new packages using ompBAM, install devtools
and usethis
packages from CRAN:
For MacOS users, make sure OpenMP libraries are
installed correctly. We recommend users follow this guide, but the quickest way
to get started is to install libomp
via brew:
To create a new package (in temporary directory):
library(ompBAM)
pkg_path = file.path(tempdir(), "MyPkg")
use_ompBAM(pkg_path)
#> ✔ Creating '/tmp/Rtmp4u2KQi/MyPkg/'.
#> ✔ Setting active project to "/tmp/Rtmp4u2KQi/MyPkg".
#> ✔ Creating 'R/'.
#> ✔ Writing 'DESCRIPTION'.
#> Package: MyPkg
#> Title: What the Package Does (One Line, Title Case)
#> Version: 0.0.0.9000
#> Authors@R (parsed):
#> * First Last <[email protected]> [aut, cre] (YOUR-ORCID-ID)
#> Description: What the package does (one paragraph).
#> License: `use_mit_license()`, `use_gpl3_license()` or friends to pick a
#> license
#> Encoding: UTF-8
#> Roxygen: list(markdown = TRUE)
#> RoxygenNote: 7.3.2
#> ✔ Writing 'NAMESPACE'.
#> ✔ Setting active project to "<no active project>".
#> ✔ Setting active project to "/tmp/Rtmp4u2KQi/MyPkg".
#> ✔ Creating 'src/'.
#> ✔ Adding "*.o", "*.so", and "*.dll" to 'src/.gitignore'.
#> ✔ Created /tmp/Rtmp4u2KQi/MyPkg/R/ompBAM_imports.R with roxygen tags
#> ✔ Created src/Makevars.in and src/Makevars.win
#> ✔ Created configure scripts
#> ✔ Created src/ompBAM_example.cpp with idxstats_pbam() function
#> ✔ Adding Rcpp to Imports field in DESCRIPTION
#> ✔ Adding ompBAM to LinkingTo field in DESCRIPTION
#> ✔ Adding Rcpp to LinkingTo field in DESCRIPTION
#> ✔ MyPkg successfully created in /tmp/Rtmp4u2KQi/MyPkg
#> Please run roxygen2 using devtools::document() before building the package.
Before this package is functional, it needs to be roxygenised. This
will automate the export of the example function
idxstats()
. Run the following:
devtools::document(pkg_path)
#> ℹ Updating MyPkg documentation
#> Writing 'NAMESPACE'
#> ℹ Loading MyPkg
#> Exports from /tmp/Rtmp4u2KQi/MyPkg/src/ompBAM_example.cpp:
#> int idxstats_pbam(std::string bam_file, int n_threads_to_use = 1)
#>
#> /tmp/Rtmp4u2KQi/MyPkg/src/RcppExports.cpp updated.
#> /tmp/Rtmp4u2KQi/MyPkg/R/RcppExports.R updated.
#> ℹ Re-compiling MyPkg (debug build)
#> ── R CMD INSTALL ───────────────────────────────────────────────────────────────
#> * installing *source* package ‘MyPkg’ ...
#> ** using staged installation
#> ** libs
#> using C++ compiler: ‘g++ (Ubuntu 13.2.0-23ubuntu4) 13.2.0’
#> g++ -std=gnu++17 -I"/usr/share/R/include" -DNDEBUG -I'/tmp/RtmpztTsn3/Rinst10c721d6a4c3/ompBAM/include' -I'/github/workspace/pkglib/Rcpp/include' -fopenmp -fpic -O2 -fno-omit-frame-pointer -mno-omit-leaf-frame-pointer -ffile-prefix-map=/build/r-base-YnquTi/r-base-4.4.1=. -fstack-protector-strong -fstack-clash-protection -Wformat -fcf-protection -fdebug-prefix-map=/build/r-base-YnquTi/r-base-4.4.1=/usr/src/r-base-4.4.1-3.2404.0 -Wdate-time -D_FORTIFY_SOURCE=3 -UNDEBUG -Wall -pedantic -g -O0 -c RcppExports.cpp -o RcppExports.o
#> g++ -std=gnu++17 -I"/usr/share/R/include" -DNDEBUG -I'/tmp/RtmpztTsn3/Rinst10c721d6a4c3/ompBAM/include' -I'/github/workspace/pkglib/Rcpp/include' -fopenmp -fpic -O2 -fno-omit-frame-pointer -mno-omit-leaf-frame-pointer -ffile-prefix-map=/build/r-base-YnquTi/r-base-4.4.1=. -fstack-protector-strong -fstack-clash-protection -Wformat -fcf-protection -fdebug-prefix-map=/build/r-base-YnquTi/r-base-4.4.1=/usr/src/r-base-4.4.1-3.2404.0 -Wdate-time -D_FORTIFY_SOURCE=3 -UNDEBUG -Wall -pedantic -g -O0 -c ompBAM_example.cpp -o ompBAM_example.o
#> g++ -std=gnu++17 -shared -L/usr/lib/R/lib -Wl,-Bsymbolic-functions -flto=auto -ffat-lto-objects -Wl,-z,relro -o MyPkg.so RcppExports.o ompBAM_example.o -fopenmp -L/usr/lib/R/lib -lR
#> installing to /tmp/Rtmp4u2KQi/devtools_install_117d241f06c1/00LOCK-MyPkg/00new/MyPkg/libs
#> ** checking absolute paths in shared objects and dynamic libraries
#> * DONE (MyPkg)
#> Writing 'NAMESPACE'
To simulate the compilation and loading of this package (this is
almost equivalent to running R CMD BUILD
then loading the
MyPkg
package:
To test the new package works as expected, run the following commands:
idxstats(ompBAM::example_BAM("Unsorted"), 2)
#> /tmp/RtmpztTsn3/Rinst10c721d6a4c3/ompBAM/extdata/THP1_ND_1.bam summary:
#> Name Length Number of reads
#> 1 248956422 1308
#> 10 133797422 308
#> 11 135086622 600
#> 12 133275309 648
#> 13 114364328 162
#> 14 107043718 230
#> 15 101991189 294
#> 16 90338345 334
#> 17 83257441 570
#> 18 80373285 78
#> 19 58617616 600
#> 2 242193529 586
#> 20 64444167 168
#> 21 46709983 76
#> 22 50818468 200
#> 3 198295559 454
#> 4 190214555 252
#> 5 181538259 516
#> 6 170805979 824
#> 7 159345973 412
#> 8 145138636 452
#> 9 138394717 372
#> MT 16569 288
#> X 156040895 254
#> Y 57227415 14
#> [1] 0
idxstats(ompBAM::example_BAM("scRNAseq"), 2)
#> /tmp/RtmpztTsn3/Rinst10c721d6a4c3/ompBAM/extdata/MGE_E35_1.bam summary:
#> Name Length Number of reads
#> 1 195471971 50000
#> 10 130694993 0
#> 11 122082543 0
#> 12 120129022 0
#> 13 120421639 0
#> 14 124902244 0
#> 15 104043685 0
#> 16 98207768 0
#> 17 94987271 0
#> 18 90702639 0
#> 19 61431566 0
#> 2 182113224 0
#> 3 160039680 0
#> 4 156508116 0
#> 5 151834684 0
#> 6 149736546 0
#> 7 145441459 0
#> 8 129401213 0
#> 9 124595110 0
#> MT 16299 0
#> X 171031299 0
#> Y 91744698 0
#> JH584299.1 953012 0
#> GL456233.1 336933 0
#> JH584301.1 259875 0
#> GL456211.1 241735 0
#> GL456350.1 227966 0
#> JH584293.1 207968 0
#> GL456221.1 206961 0
#> JH584297.1 205776 0
#> JH584296.1 199368 0
#> GL456354.1 195993 0
#> JH584294.1 191905 0
#> JH584298.1 184189 0
#> JH584300.1 182347 0
#> GL456219.1 175968 0
#> GL456210.1 169725 0
#> JH584303.1 158099 0
#> JH584302.1 155838 0
#> GL456212.1 153618 0
#> JH584304.1 114452 0
#> GL456379.1 72385 0
#> GL456216.1 66673 0
#> GL456393.1 55711 0
#> GL456366.1 47073 0
#> GL456367.1 42057 0
#> GL456239.1 40056 0
#> GL456213.1 39340 0
#> GL456383.1 38659 0
#> GL456385.1 35240 0
#> GL456360.1 31704 0
#> GL456378.1 31602 0
#> GL456389.1 28772 0
#> GL456372.1 28664 0
#> GL456370.1 26764 0
#> GL456381.1 25871 0
#> GL456387.1 24685 0
#> GL456390.1 24668 0
#> GL456394.1 24323 0
#> GL456392.1 23629 0
#> GL456382.1 23158 0
#> GL456359.1 22974 0
#> GL456396.1 21240 0
#> GL456368.1 20208 0
#> JH584292.1 14945 0
#> JH584295.1 1976 0
#> [1] 0
To create an example package, create a new R project using ompBAM by running the following in the R console:
NB: be sure to replace project_path
to the actual
directory path in which you wish to store your project.
After running the example code above, use RStudio to open the newly
created MyPkg.Rproj
file located at the given path. Then,
run the following:
This will process the roxygen2 flags and write to the
NAMESPACE
file.
After this, the new package can be built by running Install and Restart from the Build tab.
ompBAM provides a one-step function called
use_ompBAM()
that creates a new package R project in a new
directory (or adds requisite files to an existing project). The function
is similar to those in the usethis package. The user supplies a
directory path and the directory containing the package must also be the
name of the new package.
Using the R function use_ompBAM()
, we can create a new
package inside the R-generated temporary directory as an example:
library(ompBAM)
pkg_path = file.path(tempdir(), "myPkgName")
use_ompBAM(pkg_path)
#> ✔ Creating '/tmp/Rtmp4u2KQi/myPkgName/'.
#> ✔ Setting active project to "/tmp/Rtmp4u2KQi/myPkgName".
#> ✔ Creating 'R/'.
#> ✔ Writing 'DESCRIPTION'.
#> Package: myPkgName
#> Title: What the Package Does (One Line, Title Case)
#> Version: 0.0.0.9000
#> Authors@R (parsed):
#> * First Last <[email protected]> [aut, cre] (YOUR-ORCID-ID)
#> Description: What the package does (one paragraph).
#> License: `use_mit_license()`, `use_gpl3_license()` or friends to pick a
#> license
#> Encoding: UTF-8
#> Roxygen: list(markdown = TRUE)
#> RoxygenNote: 7.3.2
#> ✔ Writing 'NAMESPACE'.
#> ✔ Setting active project to "/tmp/Rtmp4u2KQi/MyPkg".
#> ✔ Setting active project to "/tmp/Rtmp4u2KQi/myPkgName".
#> ✔ Creating 'src/'.
#> ✔ Adding "*.o", "*.so", and "*.dll" to 'src/.gitignore'.
#> ✔ Created /tmp/Rtmp4u2KQi/myPkgName/R/ompBAM_imports.R with roxygen tags
#> ✔ Created src/Makevars.in and src/Makevars.win
#> ✔ Created configure scripts
#> ✔ Created src/ompBAM_example.cpp with idxstats_pbam() function
#> ✔ Adding Rcpp to Imports field in DESCRIPTION
#> ✔ Adding ompBAM to LinkingTo field in DESCRIPTION
#> ✔ Adding Rcpp to LinkingTo field in DESCRIPTION
#> ✔ myPkgName successfully created in /tmp/Rtmp4u2KQi/myPkgName
#> Please run roxygen2 using devtools::document() before building the package.
We recommend the user run this function in a project directory of
their choice and NOT use tempdir()
, as the
temp directory is destroyed on each R session restart! We only
demonstrate using tempdir()
here to demonstrate the typical
output of the use_ompBAM()
function.
Users following this vignette should instead use something like:
In the remainder of this section we will examine this newly-created project and explain the importance of the configurations made.
After running ompBAM(), please open the newly-generated
myPkgName.Rproj
from within RStudio.
Note that the newly-created package is designed to run with roxygen2.
We recommend users build their package documentation and NAMESPACE using
roxygen2. To do this, simply run devtools::document()
to
process roxygen2 tags.
To compile with ompBAM capabilities, the package must import Rcpp. Check that the following has been added to the DESCRIPTION file:
Imports: Rcpp
LinkingTo: Rcpp, ompBAM
NB: prior versions of ompBAM suggested linking to zlibbioc. This is
now removed as zlibbioc will be deprecated as of Bioconductor 3.20.
Instead, simply make sure Makevars.win
contains the line
PKG_LIBS = -lz $(SHLIB_OPENMP_CXXFLAGS)
Also, use_ompBAM()
creates a “wrapper” function for the
C++ function idxstats_pbam()
function. This is a simple R
function idxstats()
which in turn calls the C++ function.
We export this function with a roxygen2 tag so that
idxstats()
can be called once the MyPkgName
package is loaded.
Open this file to verify that the following has been added:
#' @useDynLib myPkgName, .registration = TRUE
#' @import Rcpp
NULL
#' @export
idxstats <- function(bam_file, n_threads) {
idxstats_pbam(bam_file, n_threads)
}
To make sure your roxygen2 setup works, make sure the new package is
your active project by opening the project within RStudio. Then run
devtools::document()
to allow roxygen2 to do its magic.
When it is done, the NAMESPACE file should contain the following:
# Generated by roxygen2: do not edit by hand
export(idxstats)
import(Rcpp)
useDynLib(myPkgName, .registration = TRUE)
If for whatever reason roxygen2 says it did not edit the NAMESPACE
file because it was not created by roxygen2, we suggest deleting the
NAMESPACE file, then run devtools::document()
again.
Also, you may have noticed, the included source code may have been prompted to compile. Don’t worry about this, we will explain it in detail in the next section.
use_ompBAM() has created two make
files that instructs
the compiler to link with OpenMP and the zlib library. You can view
these files from within the src/
directory.
Make files instruct the C++ compiler to link to the appropriate
libraries at compile-time. For R packages, these are called
Makevars
and are streamlined make
files. For
more information, refer to (Writing R Extensions - Using MakeVars) [https://cran.r-project.org/doc/manuals/r-release/R-exts.html#Using-Makevars]
Makevars.in
is a template make
file used by
Linux and MacOS, and should contain the following:
PKG_CXXFLAGS = $(OMPBAM_PKG_CXXFLAGS)
PKG_LIBS = $(OMPBAM_PKG_LIBS)
Additionally, a configure
script is added to the root
project directory. This is a shell script run when the package is built
from source. It detects whether the operating system is Linux or MacOS,
and whether OpenMP libraries are available for MacOS. It assigns the
correct compile and linker flags to the OMPBAM_PKG_CXXFLAGS
and OMPBAM_PKG_LIBS
variables in the template
make
file. The contents of the configure
script is beyond the scope of this documentation, but feel free to look
at it. It is based on the configure
script for the
data.table R package on CRAN.
For Windows installations, a second make
file called
“Makevars.win” is required. In windows systems, “Makevars.win” is used
to build your package, and contains the following:
PKG_CXXFLAGS = $(SHLIB_OPENMP_CXXFLAGS)
PKG_LIBS = -lz $(SHLIB_OPENMP_CXXFLAGS)
This will ensure that the zlib libraries are linked correctly to Windows systems.
Windows and Linux systems should support OpenMP natively. MacOS,
however, does not. In order to set up OpenMP for MacOS, we recommend
following this guide. In
short, to utilise OpenMP on MacOS users must first install the correct
OpenMP libraries. Secondly, specific compiler and linker flags must be
set. This has already been done using the configure
script
and template Makevars.in
file as described in section
(1c).
To simplify the installation of OpenMP libraries on MacOS, instruct package users to run the following:
As can be seen, OpenMP for MacOS will continue to be a difficult
issue. We recommend writing C++ code that is compatible for both OpenMP
and non-OpenMP systems. For non-OpenMP systems, multi-threading can
still be implemented via BiocParallel::MulticoreParam()
but
the session memory will be multiplied over the number of cores used.
In C++ code, to check whether OpenMP is installed in a system, use
the #ifdef
and #ifndef
directives. An example
below:
use_ompBAM() creates source code contained within
src/ompBAM_example.cpp
which contains a “Hello world!”
equivalent function that is ready to be compiled. This function, called
idxstats_pbam()
, replicates the idxstats function of
Samtools. It reports the chromosome names and lengths of the genome to
which the sequences in the BAM file is aligned, as well as the number of
reads aligned to each chromosome. See the idxstats
documentation for more details
In this section, we will examine the
src/ompBAM_example.cpp
source code in detail and explain
how we used ompBAM to re-create the idxstats function.
First, lets see how this function works. Make sure you have created the example project using the steps as described in Section (0) - Installation and Quick-Start. Make sure you can reproduce all the example output in that section.
Lets look at the line used to run the ompBAM-based idxstats function:
Like the Samtools idxstats, this function counts the number of reads aligned to each chromosome in the genome. However, unlike idxstats, this function can be run on BAM files that are not sorted by chromosome and not indexed.
In the following guide, we will go through the
src/ompBAM_example.cpp
in the example package, step by
step, explaining the code behind this simple function.
To add the ompBAM header files into your source code,
src/ompBAM_example.cpp
contains the following:
#include "Rcpp.h"
using namespace Rcpp;
// Required to print cout output generated by ompBAM
#define cout Rcpp::Rcout
// [[Rcpp::depends(ompBAM)]]
#include <ompBAM.hpp>
ompBAM headers must be added AFTER Rcpp. This is because ompBAM uses
cout
function for console output. The cout
C++
function is not supported in R, and must replaced by Rcout
in Rcpp.
The line #define cout Rcpp::Rcout
ensures that
cout
in all ompBAM headers are interpreted by the compiler
as Rcpp::Rcout
.
src/ompBAM_example.cpp
contains a simple internal
function use_threads()
which takes an int
parameter and returns an unsigned int
parameter:
unsigned int use_threads(int n_threads = 1) {
#ifdef _OPENMP
if(n_threads <= 1) return(1);
if(n_threads > omp_get_max_threads()) {
return((unsigned int)omp_get_max_threads());
}
return((unsigned int)n_threads);
#else
return(1);
#endif
}
This function takes a user request for the number of threads (CPU
workers) to be used. Sometimes these requests may not be appropriate
(e.g. requesting more threads than available in the system). First,
notice that we use the #ifdef
directives to tell the
compiler which block of code to compile, based on whether OpenMP is
available on the system. If not compiled with OpenMP , this function
returns 1
(single core). If compiled with OpenMP, the
function checks the maximum available threads and returns this number if
the user requests more threads than is available.
We recommend users include a similar function to make sure their program knows how many threads to run, as they will be running an OpenMP-based loop in their functions.
Now, lets return to the idxstats_pbam()
function. We
will show a subset of this function below:
// [[Rcpp::export]]
int idxstats_pbam(std::string bam_file, int n_threads_to_use = 1){
// Ensure number of threads requested < number of system threads available
unsigned int n_threads_to_really_use = use_threads(n_threads_to_use);
pbam_in inbam;
inbam.openFile(bam_file, n_threads_to_really_use);
/*
lots of code
*/
return(0);
}
// [[Rcpp::export]]
tells Rcpp to export this
function so that R can call these functions directly. They will not be
exported functions so you will still need to create wrapper functions
from within R, and export them, for example:# this is added to R/ompBAM_imports.R by use_ompBAM()
#' @export
idxstats <- function(bam_file, n_threads) {
idxstats_pbam(bam_file, n_threads)
}
idxstats_pbam()
takes two inputs. The first,
bam_file
, takes a string input which is the file name of
the BAM file. The second, n_threads_to_use
will take a user
input to request the number of threads. We sanitise this number as
discussed in the previous section, to make sure the number of threads in
this system is not above that available to the system.
We create a pbam_in
object called
inbam
. The pbam_in
class is described in the
pbam_in.hpp
file which is included in
ompBAM.hpp
. Users can find the header files in the
include/
directory within the installation path of ompBAM,
but we endeavour to explain as much as you need to know in this
document. For now, we call pbam_in::openFile()
which
directs the pbam_in
object inbam
to open and
examine the given BAM file. openFile()
takes a
std::string
for the path to the BAM file, and an
unsigned int
to specify the number of threads to initialize
pbam_in for parallel file reading and decompression.
Once pbam_in
opens a BAM file, it will automatically
check the BAM file for errors. After noting the file size, it will check
whether the BAM file is truncated (by verifying whether a BGZF EOF
end of file marker is set at the end of the file). It will then
read the BAM header and extract the chromosome names and their genome
lengths.
To obtain the chromosome names and lengths, and to check that the BAM file is indeed valid:
std::vector<std::string> s_chr_names;
std::vector<uint32_t> u32_chr_lens;
int chrom_count = inbam.obtainChrs(s_chr_names, u32_chr_lens);
Here, we create a string vector to store the chromosome names, and an
unsigned 32-bit integer vector for chromosome lengths. These vectors are
passed by reference to the pbam_in::obtainChrs()
function
which will fill these vectors with the chromosome names and lengths.
obtainChrs()
will return a negative number if the BAM
header read fails, and will return the number of chromosomes if the
function is run successfully. We can check for this with the
following:
ompBAM is designed to read an entire BAM file using multiple threads. This functionality is ideal for programs that perform whole-transcriptome calculations. Programs that interrogate small genomic regions are better off using htslib and interrogating sorted BAM files.
To read an entire BAM file, we need to construct code contained within a loop. This is because we can conserve RAM by not committing the entire decompressed BAM to memory. Instead, we read a small portion of the BAM file sequentially, process all the aligned reads therein, before reading / decompressing more of the BAM file again.
First, lets declare a vector to store the per-chromosome reads so we can count these reads:
To sequentially read a “small” portion of the BAM file, we call
pbam_in::fillReads()
and contain this within a
while
loop:
pbam_in::fillReads()
returns 0
if the BAM
file successfully extracts some reads and we are not at the end of file.
If all the reads have already been extracted from the BAM file and no
data was decompressed, fillReads()
returns 1
,
and if the BAM file is corrupt or the decompression returns an error,
fillReads()
returns -1
.
Note that fillReads()
will also return an
-1
error if the program has not processed all the reads
decompressed by the previous call to fillReads()
. We will
explain this further in the next section
OpenMP provides an easy-to-use parallelism which we can use to run
each iteration of a FOR
loop in parallel. That is, each
iteration of the FOR
loop is run simultaneously, each
iteration being processed by a separate thread. We can set up this
parallel FOR
loop in the following code:
#ifdef _OPENMP
#pragma omp parallel for num_threads(n_threads_to_really_use) schedule(static,1)
#endif
for(unsigned int i = 0; i < n_threads_to_really_use; i++) {
std::vector<uint32_t> read_counter(chrom_count);
// Process thread-specific BAM reads here
}
Note the line beginning with #pragma omp parallel for
.
This line simply specifies that the FOR
loop declared after
this line should be run in parallel.
num_threads(n_threads_to_really_use)
declares that we want
to use the specified number of threads, as stored in the
n_threads_to_really_use
variable.
schedule(static,1)
declares that one thread should be used
to run each FOR
iteration.
Also note that this line is enclosed within an
#ifdef _OPENMP
. If OpenMP is not available at compile, the
compiler simply ignores the #pragma
and will run the
FOR
loop sequentially.
Within this loop, we are free to declare any thread-specific
variables. Variables declared within the parallel FOR loop are
thread-specific, meaning their contents cannot be accessed by other
threads processing other iterations. So within the FOR
loop, we declare a read_counter
vector which will count
reads processed by each thread.
To obtain a single aligned read from a thread-specific buffer of
reads processed by ompBAM, we declare a pbam1_t
object and
use pbam_in::supplyRead()
to get a single read from the
thread-specific buffer:
As the above code is contained within the parallel FOR
loop, i
here refers to the thread ID. For n threads,
i
will take any integer between 0
and
n-1
. supplyRead()
will return the next read in
the thread-specific buffer which we receive by calling it within the
pbam1_t
constructor at declaration.
Note that the above code is equivalent to:
pbam1_t
is an ompBAM object that is designed to store
and get values contained in a single aligned BAM read. We will discuss
its full functionality in section (4). For now, the most important
function is pbam1_t::validate()
, which will check whether
the obtained read actually contains any data.
A pbam1_t
can be “invalid” for several reasons. The most
important reason is when supplyRead()
has reached the end
of its buffer. When this occurs, supplyRead()
will simply
return an empty read, which will not validate.
So, to profile all the reads in a single thread, we simply call
supplyRead()
for thread i
until it returns an
invalid read, using the following:
// Gets the first read from the thread-specific buffer
pbam1_t read(inbam.supplyRead(i));
while(read.validate()) {
// Do stuff with this valid read
// Gets the next read
read = inbam.supplyRead(i);
}
pbam1_t::validate()
will return false
for
two other reasons.
If the BAM contains corrupt data, validate()
will
return false. validate()
checks the first 4 bytes of the
read which returns the size of the read (in bytes). It then checks
whether this is larger than the size of the individual components,
including read name, cigar, sequence and quality scores. If there is
corrupt data, we might expect that this will trigger a failure and
safely abort the program, as the next call to fillReads()
will trigger an error notifying that not all reads were
processed.
pbam1_t
optimises performance by storing pointers to
the memory that contains decompressed BAM data, rather than creating
separate memory containers. Therefore, copying a pbam1_t
read will only copy its memory address, not the data contained in the
read. We refer to these as “virtual” pbam1_t
reads. The
consequence is, the next time fillReads()
is called, these
reads will point to an invalid memory address and will become
invalid.
In some situations (e.g. for paired RNA-seq reads), it is desirable to keep some reads around in memory. To do this we can do something like the following:
std::vector<pbam1_t> read_storage;
pbam1_t real_read = read;
real_read.realize();
read_storage.push_back(real_read);
In the above code, we call pbam1_t::realize()
. This
function creates a read- specific data buffer and copies the data from
the thread-specific buffer. A “real” pbam1_t
read simply
means it has its own data storage and is thus “persistent”, i.e. it will
still store the read after the next call to fillReads()
. In
other words:
pbam1_t virtual_read = inbam.supplyRead(i);
pbam1_t real_read = inbam.supplyRead(i);
pbam1_t read = inbam.supplyRead(i);
while(read.validate()){
// consumes the remainder of reads
read = inbam.supplyRead(i);
}
real_read.realize(); // make this read 'real'
virtual_read.isReal(); // returns false
real_read.isReal(); // returns true
inbam.fillReads();
virtual_read.validate(); // returns false
real_read.validate(); // returns true
NB: the pbam1_t::realize()
and
pbam1_t::isReal()
functions are not demonstrated in the
included example code created by use_ompBAM()
. For more
information about these functions, refer to Section (4c)
Now that we have discussed the most important aspects of
pbam1_t
as well as the fillReads()
and
supplyRead()
functions of pbam_in
, we will now
count the reads within the OpenMP parallel FOR loop. The entire
processing loop is included below and we will discuss the remaining
lines in detail:
// Creates a data structure that stores per-chromosome read counts
std::vector<uint32_t> total_reads(chrom_count);
while(0 == inbam.fillReads()) {
// OpenMP parallel FOR loop, each thread runs 1 loop simultaneously.
#ifdef _OPENMP
#pragma omp parallel for num_threads(n_threads_to_really_use) schedule(static,1)
#endif
for(unsigned int i = 0; i < n_threads_to_really_use; i++) {
std::vector<uint32_t> read_counter(chrom_count);
// Gets the first read from the thread read storage buffer
pbam1_t read(inbam.supplyRead(i));
// Keep looping while reads are valid
while(read.validate()) {
// Counts the read if it is mapped to a chromosome
if(read.refID() >= 0 && read.refID() < chrom_count) {
read_counter.at(read.refID())++;
}
// Gets the next read
read = inbam.supplyRead(i);
}
// Adds the counted reads to the final count
// #pragma omp critical ensures only 1 thread at a time runs the following
// block of code.
#ifdef _OPENMP
#pragma omp critical
#endif
for(unsigned int j = 0; j < (unsigned int)chrom_count; j++) {
total_reads.at(j) += read_counter.at(j);
}
}
// At this stage, all threads would have read all their thread-specific reads
// At the next call to pbam_in::fillReads(), if any reads were not read, it
// will throw an error and fillReads() will return -1.
// If we have finished reading the BAM file, fillReads() will return 1.
}
If you have read all the sections until now, you will understand most of the code in this loop. We will discuss the things we have not already addressed.
The pbam1_t::refID()
returns the refID
of
the aligned read. Rather than the name of the chromosome, it returns the
ID as a number from 0
to k-1
for an
k
-chromosome genome. These are in the same order as
described in the BAM header. So here, we simply check whether the read
is mapped to a valid chromosome refID, and add it to the thread-specific
read container vector:
For other “getter” functions of pbam1_t, refer to Section (4e-h).
Any variables declared outside the OpenMP FOR
loop is
accessible to code within the parallel loop. This can be both a blessing
and a curse. It can cause problems when multiple threads write to the
variable at the same time (known as a “race condition”). To prevent
this, we declare that only 1 thread can run the code that is responsible
for accessing the external variable total_reads
, using the
#pragma omp critical
directive. This directive instructs
the following block of code can only be run by a single thread at any
one time. Of course, we declare this within an
#ifdef _OPENMP
directive so that compilers without OpenMP
will ignore it.
After we have read the BAM file, we will close the file.
This will instruct pbam_in
to close the file that it
opened internally using the ifstream object contained within. It will
also destroy all memory associated with its file and decompressed data
buffers, thereby safely freeing up memory. This is also called on
destruction of the pbam_in
object, so files are safely
closed when the pbam_in
object goes out of scope. Of
course, pbam1_t
reads will invalidate once these buffers
are cleared, except “real” pbam1_t
reads created via
pbam1_t::realize()
.
For the output of this function, we use the function Rcout (from Rcpp) to print out the chromosome names, lengths and read counts, just like the samtools idxstats function:
BAM files are very large files (or we wouldn’t use multi-threading to read them). Sometimes a progress bar is very useful in this regard.
pbam_in
has two functions that provide useful data that
can be parsed to the RcppProgress progress bar.
pbam_in::GetFileSize()
will return the file size (in
bytes) of the opened BAM file.pbam_in::IncProgress()
will return the number of bytes
processed (read and decompressed) since the last call to
IncProgress()
was madeThese functions are demonstrated in the example code contained within
the ompBAMExample package contained within the examples/
folder of the ompBAM installation path. Readers are welcome to
peruse this example package which contains the
idxstats_pbam
function with the addition of a RcppProgress
bar.
The path to the source code can be returned using the following in R:
Alternatively, refer to Section (3i) for an example RcppProgress bar
The pbam_in
object is responsible for opening BAM files
for sequential multi-threaded BAM processing.
Internally, pbam_in
checks BAM files for truncation,
data corruption and other issues. Then it reads and decompresses BAM
files using multiple threads. Applications using ompBAM simply
have to retrieve reads from the pbam_in
object using the
supplyRead()
function, after “priming the pump” to fill the
reads buffer using the fillReads()
function.
supplyRead()
is considered “thread-safe”, meaning that
applications using ompBAM can run supplyRead()
within multi-threaded blocks of code. This allows applications using
ompBAM to process reads using multiple threads.
It is highly recommend to read the previous section (2) - “Step-by-step guide to creating your first ompBAM-powered package” before scrutinising this section. The “Step-by-step guide” provides the context behind all the functions mentioned in this section.
Please note that currently, ompBAM only supports BAM file reading. BAM writing may be supported in later versions of ompBAM.
Creates a pbam_in
object.
const size_t file_buffer_cap
: (default 500 Mb) The size
(in bytes) of each of the two file buffersconst size_t data_buffer_cap
: (default 1 Gb) The size
(in bytes) of the data buffer containing uncompressed dataconst unsigned int chunks_per_file_buffer
: (default 5)
How many chunks should the file buffer be divided. See detailsconst bool read_file_using_multiple_threads
(default
true): Whether to use multiple threads to read compressed data from
file.pbam_in
reads a set amount of data from the BAM file to
efficiently process reads. Internally, it allocates two “file” buffers
to store compressed data, each of size file_buffer_cap
, as
well as a larger “data” buffer to store uncompressed data, of size
data_buffer_cap
.
The first call to pbam_in::fillReads()
, the function
that decompresses the BAM file to extract BAM reads, will fill the first
file buffer with data, as well as a fraction (one “chunk”) of data to
fill the second file buffer. This chunk size is determined by
chunks_per_file_buffer
. For example, if
file_buffer_cap = 5e8
i.e. 500 Mb and
chunks_per_file_buffer = 5
, then the chunk size is 100
Mb.
After importing compressed data, pbam_in
will use
multiple threads to decompress enough data to either fill up the data
buffer, or decompress a portion of the file buffer (as determined by
chunk size), whichever occurs first.
Subsequent calls to pbam_in::fillReads()
will check
whether the number of chunks stored in the second file buffer exceeds
the number of chunks already processed in the first file buffer. Should
this occur, it will read compressed data from the BAM file and place
this in the second file buffer. When the amount of unprocessed data in
the first file buffer drop below one chunk size amount, a “buffer swap”
occurs where the data is copied from the second file buffer to the
first. Thereafter, the process continues until the entire BAM file has
been read and decompressed.
As can be seen, the optimal values of file_buffer_cap
,
data_buffer_cap
and chunks_per_file_buffer
will depend on the compression ratio of the BAM file, the I/O speed of
the storage device, as well as the available memory. We believe the
default settings is a good starting point and will consume approximately
2 Gb of RAM.
Storage on hard disk drives (with spinning components) typically
found on traditional desktop computers may experience lower file I/O in
multi-threaded applications. Solid-state drives (typically found in
modern notebook laptops) and some RAID setups may favour multi-threaded
file input. To instruct pbam_in
to read the file using a
single thread and decompress with the remaining cores, set
read_file_using_multiple_threads = false
.
Note that copy constructor and copy assignment operators are disabled
for pbam_in
. Thus, code containing things like the
following will fail at compile-time:
// Creates a pbam_in object with default settings
pbam_in default_pbam;
// Creates a pbam_in object with two 0.5Gb buffers, one 2 Gb data buffer, and
// will prompt file access when the last chunk of the first 10-chunk buffer is
// being decompressed. Set read_file_using_multiple_threads = true to enable
// multi-threaded file read access.
pbam_in custom_pbam(5e8, 2e9, 10, true);
Opens a BAM file given the file path, and the requested number of threads to be used. Also checks the BAM file and reads the BAM header (silently).
std::string filename
: The path to the BAM file to be
openedconst unsigned int n_threads
: The number of threads to
use to read the BAM fileAssigns an istream
handle to pbam_in
. Also
defines how many threads to use to read the BAM file. This is used as an
alternative to pbam_in::openFile()
.
std::istream *in_stream
: The handle of an ifstream that
has opened a BAM file using input binary accessconst unsigned int n_threads
: The number of threads to
use to read the BAM fileFrankly this is superseded by openFile()
.
NB: multi-threaded read access is disabled when pbam_in
accesses a BAM file via this method, because it does not know the BAM
file path.
Closes the BAM file. Also deallocates all memory contained in the
pbam_in
object, essentially resetting this object.
None
Note that closeFile()
internally calls
reset()
, which is also called at object destruction. Thus,
any open BAM files are closed when the pbam_in
object goes
out of scope. This function is provided if closing the
ifstream
associated with the BAM file, and/or freeing up
RAM associated with pbam_in
, is desired.
Retrieves chromosome names and lengths from an opened BAM file
std::vector<std::string> & s_chr_names
A
reference to a string vector to contain chromosome namesstd::vector<uint32_t> & u32_chr_lens
A
reference to an uint32_t vector to contain chromosome lengths(int) The number of chromosomes in the genome the BAM file is aligned to. Chromosome names and lengths are returned to the two vectors supplied by reference.
If pbam_in
has not opened a valid BAM file, this
function returns -1
When a BAM file is opened via openFile()
or
SetInputHandle()
, pbam_in will automatically read and store
the header data, namely chromosome names and lengths.
obtainChrs()
can be called to retrieve this data.
std::string bam_file = "example.bam";
pbam_in inbam;
inbam.openFile(bam_file, 4);
std::vector<std::string> s_chr_names; // A vector to contain chromosome names
std::vector<uint32_t> u32_chr_lens; // A vector to contain chromosome lengths
int chrom_count = inbam.obtainChrs(s_chr_names, u32_chr_lens);
Reads from the BAM file, decompresses the file buffer to extract the reads.
None
Returns 0
if successful, -1
if error, and
1
if end of file is already reached.
A loop is required to read the BAM file until finished. This can be set up by checking the return value of fillReads() and making sure it is zero. If it is non-zero, abort the loop as there is either an error with the file, or end of file was reached in the last call to fillReads(). See example below.
NB1: fillReads()
should only be called by the main
thread (i.e. it should not be called from within child threads).
fillReads()
contain OpenMP parallel FOR
loops
that perform file reads and decompressions using multi-threading.
NB2: Applications must ensure that all reads decompressed by
fillReads()
are obtained from every thread via
pbam_in::supplyRead()
. If some reads are not obtained, the
next call to fillReads()
will return an error.
Reads the thread-specific data buffer to supply a single aligned read from the BAM file.
const unsigned int thread_id
The index of the
thread-specific buffer from which to retrieve the read.pbam1_t
A pbam1_t
object containing the
data from the aligned read.
After fillReads()
is called and the data buffer is
filled, the reads are split into equal portions of decompressed data.
Each chunk of data is processed by a single thread. This allows
developers to set up an OpenMP for-loop such that reads from each chunk
are processed exclusively by each thread.
Reads are divided into contiguous blocks. E.g., thread
i=0
contains a contiguous block of reads, and that the
first read in thread i=1
contains the read following the
last read from thread i=0
.
Setting i
to the thread-ID makes sure
supplyRead()
is thread-safe, ensuring that we don’t have a
situation where multiple threads are accessing the same data (i.e. a
“race condition”).
When the end of the thread-specific buffer is reached,
fillReads()
will return an empty read (which will not
validate using pbam1_t::validate()
. This is useful to check
whether the thread-specific buffer is exhausted.
Queries how much decompressed data is available to be read by the thread- specific buffer
const unsigned int thread_id
(default 0) The index of
the thread- specific buffer from which to retrieve the read.Returns (in bytes) the amount of aligned reads available to be
returned by supplyReads()
This function is useful to distinguish between whether an invalid
read returned by supplyReads()
is because all the data from
the thread-specific buffer has been process, or whether corrupt data is
detected.
Frankly, it is sufficient to check for errors at
fillReads()
, because it is probably inappropriate to keep
reading the BAM file if some reads are corrupt.
Returns data regarding file size, and progress
GetFileSize()
Returns the file size of the BAM
fileGetProgress()
Returns the total number of (compressed)
bytes in the BAM file that has been processed (i.e. read and
decompressed)IncProgress()
Returns the number of bytes processed
since the last call to IncProgress()
These functions are useful for addition of progress bars. The
following example demonstrates a simple method of setting up an
RcppProgress
progress bar.
#include "Rcpp.h"
using namespace Rcpp;
// Required to print cout output generated by ompBAM
#define cout Rcpp::Rcout
// [[Rcpp::depends(ompBAM)]]
#include <ompBAM.hpp>
// [[Rcpp::depends(RcppProgress)]]
#include <progress.hpp>
int main() {
std::string bam_file = "example.bam";
pbam_in inbam;
inbam.openFile(bam_file, 4);
size_t file_size = inbam.GetFileSize();
size_t bytes_read = inbam.GetProgress();
// Initialize the RcppProgress bar with "100%" set as the size of the BAM file
Progress p(inbam.GetFileSize());
while(0 == inbam.fillReads()) {
// Increment the progress bar by the number of bytes incrementally processed
p.increment(inbam.IncProgress());
#ifdef _OPENMP
#pragma omp parallel for num_threads(4) schedule(static,1)
#endif
for(unsigned int i = 0; i < 4; i++) {
pbam1_t read(inbam.supplyRead(i));
// Keep looping while reads are valid
while(read.validate()) {
read = inbam.supplyRead(i);
}
}
}
// Final increment to RcppProgress bar to fill it to 100%
p.increment(inbam.IncProgress());
return(0);
}
The pbam1_t
object is used to retrieve data from a
single aligned read.
pbam1_t
object is returned from the
supplyRead()
function of pbam_in
and is the
primary way in which ompBAM provides access to the alignment data in BAM
files. pbam1_t
contain many useful functions that make it
easy to access data from read alignments. In the documentation below,
“reads” and “read alignments” are use interchangeably and are
equivalent.
NB: In all the examples below, it is assumed that a
pbam_in
object, named inbam
, has been
initialized, and the first fillReads()
has been called
using the following code:
#include "Rcpp.h"
using namespace Rcpp;
// Required to print cout output generated by ompBAM
#define cout Rcpp::Rcout
// [[Rcpp::depends(ompBAM)]]
#include <ompBAM.hpp>
std::string bam_file = "example.bam";
pbam_in inbam;
inbam.openFile(bam_file, 4);
inbam.fillReads();
int i = 0; // Thread 0
Creates a pbam1_t
object
char * src
A raw pointer to the char* in the
pbam_in
data buffer containing the read.const bool realize
Whether to “realize” the read.As explained in the previous section (2g:i Checking “validity” of
reads and “realizing” reads”), pbam1_t
does not
initially contain dedicated memory space to store the data in a BAM
read. Rather, it is a “virtual” read, in which it contain pointers to
the corresponding data on the data buffer in the pbam_in
object. This allows rapid processing of the BAM file and spares
unnecessary memcpy
operations which will increase RAM usage
and slow the program.
However it can be converted into a “real” read, whereby
pbam1_t
will allocate the required memory space and copies
the data from the pbam_in
data buffer. When reads are
“realized”, subsequent calls to pbam_in::fillReads()
will
not result in invalidation of such reads. This is important when data
persistence is required, e.g. when matching paired reads in
coordinate-sorted paired-end RNA-seq data.
NB: In typical usage, users are not expected to use the constructor
to create “real” reads. Instead, it is better to first obtain the
virtual read using pbam_in::supplyRead()
, and subsequently
“realize” the read via pbam1_t::realize()
.
// Creates a pbam1_t object
pbam1_t read;
// Construct and "realize" read based on a known location
// in a buffer given by pointer (char* dest)
pbam1_t read(dest, true)
// Use copy constructor to get a read from the pbam_in object `inbam`
pbam1_t read(inbam::supplyRead(i));
// Use copy assignment to get a read from the pbam_in object `inbam`
pbam1_t read = inbam::supplyRead(i);
Checks whether the read contained within the pbam1_t
object contains valid data for a BAM read.
None
Returns true
if the read contains valid data, and
false
otherwise
validate()
makes the following checks, and returns false
quickly if any of these checks fail:
It is most often used to check whether
pbam_in::supplyRead()
has supplied an empty read signifying
end of the thread-specific buffer. It can also check for invalid data
pointers if users call pbam_in::fillReads()
before
“realizing” the reads to make the pbam1_t
data persist. As
explained previously in Section (2g:i),
pbam_in::fillReads()
will invalidate virtual reads as their
data would have been recycled; however, realized reads will be
persistent and continue to validate.
Internally, validate()
is called prior to returning any
values from the pbam_1
“getter” functions.
Converts a virtual read into a real read, thereby allowing data persistence by allocating specific memory space to store the data.
None
Returns 0
if the pbam1_t
was successfully
converted to a “real” read, and -1
if the given read was
invalid or if this function fails.
realize()
first checks if the pbam1_t
was
already a real read (i.e. it is contained within a dedicated persistent
data buffer). If it is not, it allocates a buffer using
malloc
, and copies the data from the buffer pointed to by
the previously-virtual read, using memcpy
. It then rechecks
validity.
This function is used to convert reads such that they no longer point
to memory initiated by pbam_in
, but instead have their own
dedicated memory buffers (i.e. their memory becomes persistent)
// Stores every read in a vector of `pbam1_t` objects
std::vector<pbam1_t> read_container;
pbam1_t read = inbam.supplyRead(i);
while(read.validate()){
read.realize();
read_container.push_back(read);
// keep getting reads from pbam_in::supplyRead()
// until it returns an empty, invalid read
read = inbam.supplyRead(i);
}
// reads stored inside read_container should now persist beyond the next call
// to pbam_in::fillReads()
Checks whether the pbam1_t is persistent (real) or points to the pbam_in memory (virtual)
None
Returns true
if the read is real, and false
if it is virtual.
See Section (4c) pbam1_t::realize()
for detailed
explanation of “real” vs “virtual” reads
The core is the first 36 bytes of the alignment read data. It contains constant-length data including chromosome refID and the left-most genomic coordinates of the alignment. It also contains metadata including cigar size and sequence length.
uint32_t block_size(); // Size of the alignment data (in bytes)
int32_t refID(); // refID of chromosome of this alignment
int32_t pos(); // leftmost 0-based genome coordinate of this alignment
uint8_t l_read_name(); // length of read name, including terminating '\0' null
uint8_t mapq(); // mapping quality
uint16_t bin(); // BAI index bin
uint16_t n_cigar_op(); // number of cigar operations
uint32_t flag(); // Bitwise flags
uint32_t l_seq(); // Sequence length
int32_t next_refID(); // refID of next segment
int32_t next_pos(); // pos of next segment
int32_t tlen(); // template length
None
See comments in the “Usage” section above. Also, it is helpful to refer to SAMv1.pdf - section 4.2 for further details regarding the BAM format.
pbam1_t read = inbam.supplyRead(i);
Rcpp::Rcout << "The alignment block size is " << read.block_size() << '\n';
Rcpp::Rcout << "The alignment refID is " << read.refID() << '\n';
Rcpp::Rcout << "The alignment left-most coordinate is " << read.pos() << '\n';
Rcpp::Rcout << "The read name length is "
<< unsigned(read.l_read_name()) << '\n';
Rcpp::Rcout << "The alignment mapping quality is "
<< unsigned(read.mapq()) << '\n';
Rcpp::Rcout << "The BAI index bin is " << read.bin() << '\n';
Rcpp::Rcout << "The number of cigar ops is " << read.n_cigar_op() << '\n';
Rcpp::Rcout << "The flag binary value is " << read.flag() << '\n';
Rcpp::Rcout << "The sequence length is " << read.l_seq() << '\n';
Rcpp::Rcout << "The next alignment refID is " << read.next_refID() << '\n';
Rcpp::Rcout << "The next alignment pos " << read.next_pos() << '\n';
Rcpp::Rcout << "The alignment template length is " << read.tlen() << '\n';
The cigar is data which specifies the nature of the alignment.
// Direct uint32_t pointer to raw data
uint32_t * cigar();
// number of cigar operations; includes compatibility for long-read data
uint32_t cigar_size();
// Fills a string with the SAM-based cigar string
// - Returns cigar_size() if success
// - Returns 0 if failed to validate
int cigar(std::string & dest);
// Returns the cigar operation / value (as char / uint32_t)
// given the position of the cigar operation
char cigar_op(const uint16_t pos);
uint32_t cigar_val(const uint16_t pos);
// Returns the cigar operations and values (as char / uint32_t) as a vector
int cigar_ops_and_vals(std::vector<char> & ops, std::vector<uint32_t> & vals);
std::string & dest
The reference to a string to
contain the cigar stringconst uint16_t pos
The index of the cigar operation
(0 <= pos < cigar_size()
)std::vector<char> & ops
The reference to a
char
vector to contain the cigar operationstd::vector<uint32_t> & vals
The reference to
a uint32_t
vector to contain the number of nucleotides to
apply the cigar operationuint32_t * cigar()
returns a pointer to the data buffer
at which the cigar data begins.
uint32_t cigar_size()
returns the number of cigar
operations. The BAM format is limited at 65535 operations which is
insufficient for some long-read applications. Recently, the “CG” tag has
been implemented that allows storage of alignment cigar data beyond this
limit. ompBAM allows for this by first checking whether the
“CG” tag exists. If it does, cigar_size()
returns the
length of this tag. If it does not, cigar_size()
returns
n_cigar_op
which is the 16-bit storage of the cigar
length.
int cigar()
takes as reference a string and fills it
with a string (in SAM format) of the entire cigar string.
cigar_op()
and cigar_val()
returns the
cigar operation and length, respectively, given the index of cigar
operation pos
(where
0 <= pos < cigar_size()
)
cigar_ops_and_vals()
takes two vectors by reference
(ops
and vals
) and fills these with the cigar
operations and lengths, respectively, of the entire cigar.
Refer to SAMv1.pdf - section 1.4 for further details regarding the cigar string (in SAM format).
pbam1_t read = inbam.supplyRead(i);
// Stores raw pointer to the data containing the cigar
uint32_t * cigar_buffer = read.cigar();
uint32_t cigar_len = read.cigar_size();
Rcpp::Rcout << "The cigar has " << cigar_len << " operations\n";
std::string cigar_string;
read.cigar(cigar_string);
Rcpp::Rcout << "The cigar string is " << cigar_string << '\n';
Rcpp::Rcout << "The cigar operation at index 0 is "
<< read.cigar_op(0)
<< " and the length to apply this is "
<< std::to_string(read.cigar_val(0)) << '\n';
std::vector<char> ops;
std::vector<uint32_t> vals;
read.cigar_ops_and_vals(ops, vals);
Rcpp::Rcout << "The cigar operation at index 0 is "
<< ops.at(0)
<< " and the length to apply this is "
<< std::to_string(vals.at(0)) << '\n';
Reads consist of a sequence of nucleotides. FASTQ files also contain per- nucleotide quality scores representing the level of statistical significance of each sequenced nucleotide. These values are often stored in BAM files after alignment.
std::string & dest
A string reference to contain
the sequence.std::vector<uint8_t> & dest
A 8-bit unsigned
integer vector to contain the list of quality scoresuint8_t * seq()
and char * qual()
returns
the pointer to the raw data containing the sequence (in
uint8_t
) and quality scores (in char
).
Advanced users will have to convert the 4-bit sequence to nucleotides as
described in the SAM documentation: SAMv1.pdf.
It is more convenient to use the latter functions.
int seq(std::string & dest)
takes a string reference as
parameter and fills this reference with the sequence of the read. It
returns the length of the read.
qual(std::vector<uint8_t> & dest)
takes a
uint8_t
vector by reference and fills this with
per-nucleotide quality scores for the sequence. It returns the length of
the read.
Quality score can take values between [0,93] (without ASCII conversion). Users wishing to convert these to ASCII must add +33 to these scores before ASCII conversion. Alignments without quality scores will have all QUAL scores set at 255 (0xFF).
It is helpful to refer to SAMv1.pdf - section 4.2.3 for further details regarding SEQ and QUAL encoding.
pbam1_t read = inbam.supplyRead(i);
std::string sequence;
std::vector<uint8_t> qual_scores;
read.seq(sequence);
read.qual(qual_scores);
Rcpp::Rcout << "The sequence is " << sequence << '\n';
if(qual_scores.at(0) != 255) {
// After checking quality scores are not missing, convert to printable ASCII
for(unsigned int k = 0; k < qual_scores.size(); k++) {
qual_scores.at(k) += 33;
}
Rcpp::Rcout << "The Phred-scale per-nucleotide score ASCII is "
<< std::string(qual_scores.begin(), qual_scores.end()) << '\n';
}
Tags are auxiliary information about each alignment.
// Provides an index of tags available to the alignment
int AvailTags(std::vector<std::string> & tags);
// Provides metadata about the specific tag
char Tag_Type(const std::string tag);
char Tag_Subtype(const std::string tag);
uint32_t Tag_Size(const std::string tag);
char Tag_Type_SAM(const std::string tag);
// Returns raw char pointer to the beginning of the info stored by the tag
// - For advanced users only
char * p_tagVal(const std::string tag);
// Returns values of fixed length
// - For tags of type AcCsSiIf
// Returns '\0' or 0 if tag does not exist or if the type is inappropriate
// for the given tag
char tagVal_A(const std::string tag); // tags of type 'A'
int8_t tagVal_c(const std::string tag); // tags of type 'c'
uint8_t tagVal_C(const std::string tag); // tags of type 'C'
int16_t tagVal_s(const std::string tag); // tags of type 's'
uint16_t tagVal_S(const std::string tag); // tags of type 'S'
int32_t tagVal_i(const std::string tag); // tags of type 'i'
uint32_t tagVal_I(const std::string tag); // tags of type 'I'
float tagVal_f(const std::string tag); // tags of type 'f'
// Fills given string reference by given Z-type tag
// Returns tag length of string if success, -1 if fail
int tagVal_Z(const std::string tag, std::string & dest); // tags of type 'Z'
// Returns a B-tag by reference to its respective type
// Returns tag length if success, -1 if fail
int tagVal_B(const std::string tag, std::vector<int8_t> & dest); // 'B, c'
int tagVal_B(const std::string tag, std::vector<uint8_t> & dest); // 'B, C'
int tagVal_B(const std::string tag, std::vector<int16_t> & dest); // 'B, s'
int tagVal_B(const std::string tag, std::vector<uint16_t> & dest); // 'B, S'
int tagVal_B(const std::string tag, std::vector<int32_t> & dest); // 'B, i'
int tagVal_B(const std::string tag, std::vector<uint32_t> & dest); // 'B, I'
int tagVal_B(const std::string tag, std::vector<float> & dest); // 'B, f'
std::vector<std::string> & tags
A reference
to a string vector in which to store a list of available tags for the
read.const std::string tag
A string containing the
two-character tag to querystd::string & dest
A string reference to store the
given string in Z-type tags.std::vector<T> & dest
A reference to a vector
of type AvailTags()
returns the number of tags in the read.
It also fills the given string vector with a list of available tags that
can be queried.
Tag_Type()
, Tag_Subtype()
and
Tag_Size()
returns the type, subtype and tag length of the
given tag
. Tag_Type_SAM()
returns the type as
displayed in SAM format. See details for further information
p_tagVal()
returns the raw pointer to the beginning
of the data contained within the given tag. That is, the pointer is 3
bytes downstream to the start of the tag label in all tags except tags
of type ‘B’, where it is 8 bytes downstream of the tag label. See SAMv1.pdf for
more details.
tagVal_{A/c/C/s/S/i/I/f}()
returns the 1-length
value of the given tag type stored in the given
tag
.
tagVal_Z()
takes by reference a string
dest
in which to store the string contained in Z-type tags.
It returns the length of the tag if success, or -1
if
fail.
tagval_B()
takes by reference a vector
dest
of the appropriate type, in which to store the vector
of values associated with B-type tags. It returns the length of the tag
if success, or -1
if fails.
Tags of type A/c/C/s/S/i/I/f are of length 1. Tags of type Z are of the length of its stored string (including the terminating ‘\0’ byte). Tags of type B are of the length of its vector.
Tag types designate the data type of its stored value. These are ‘A’
char
, ‘c’ int8_t
, ‘C’ uint8_t
,
‘s’ int16_t
, ‘S’ uint16_t
, ‘i’
int32_t
, ‘I’ uint32_t
, and ‘f’
float
.
Subtypes only apply to tags of type B. They can be of type c/C/s/S/i/I/f which refers to the data type of its stored value (see above).
SAM types are slightly different. Tags of BAM type c/C/s/S/i/I are displayed in SAM format as type ‘i’. Other types remain as they are.
Note that tags of type ‘H’ are not supported in ompBAM.
For more details, refer to SAMv1.pdf, section 4.2.4 for more information about how tags are stored in BAM format.
Also, refer to SAMtags.pdf for a list of the commonly-used BAM/SAM tags annotated with their SAM-types and the type of information they store.
// The code below iterates through all the tags in the read and prints them in
// SAM format. B-type tag data is not displayed for brevity of this example.
pbam1_t read = inbam.supplyRead(i);
std::vector<std::string> tags;
read.AvailTags(tags);
for(unsigned int j = 0; j < tags.size(); j++) {
std::string Z_val;
Rcpp::Rcout << tags.at(j) << ":" << read.Tag_Type_SAM(tags.at(j)) << ":";
switch(read.Tag_Type(tags.at(j))) {
case 'A':
Rcpp::Rcout << read.tagVal_A(tags.at(j)) << '\t'; break;
case 'c':
Rcpp::Rcout << std::to_string(read.tagVal_c(tags.at(j))) << '\t'; break;
case 'C':
Rcpp::Rcout << std::to_string(read.tagVal_C(tags.at(j))) << '\t'; break;
case 's':
Rcpp::Rcout << std::to_string(read.tagVal_s(tags.at(j))) << '\t'; break;
case 'S':
Rcpp::Rcout << std::to_string(read.tagVal_S(tags.at(j))) << '\t'; break;
case 'i':
Rcpp::Rcout << std::to_string(read.tagVal_i(tags.at(j))) << '\t'; break;
case 'I':
Rcpp::Rcout << std::to_string(read.tagVal_I(tags.at(j))) << '\t'; break;
case 'f':
Rcpp::Rcout << std::to_string(read.tagVal_f(tags.at(j))) << '\t'; break;
case 'Z':
read.tagVal_Z(tags.at(j), Z_val);
Rcpp::Rcout << Z_val << '\t'; break;
case 'B':
// write code to store B-type tag in vector of appropriate type
Rcpp::Rcout << '\t'; break;
}
}
Rcpp::Rcout << '\n'; // Line break
sessionInfo()
#> R version 4.4.1 (2024-06-14)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.1 LTS
#>
#> Matrix products: default
#> BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
#> LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
#>
#> locale:
#> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
#> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
#> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
#> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
#> [9] LC_ADDRESS=C LC_TELEPHONE=C
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
#>
#> time zone: Etc/UTC
#> tzcode source: system (glibc)
#>
#> attached base packages:
#> [1] stats graphics grDevices utils datasets methods base
#>
#> other attached packages:
#> [1] MyPkg_0.0.0.9000 ompBAM_1.11.0 rmarkdown_2.28
#>
#> loaded via a namespace (and not attached):
#> [1] xfun_0.48 bslib_0.8.0 htmlwidgets_1.6.4 devtools_2.4.5
#> [5] remotes_2.5.0 processx_3.8.4 callr_3.7.6 vctrs_0.6.5
#> [9] tools_4.4.1 ps_1.8.1 tibble_3.2.1 fansi_1.0.6
#> [13] pkgconfig_2.0.3 desc_1.4.3 lifecycle_1.0.4 compiler_4.4.1
#> [17] stringr_1.5.1 brio_1.1.5 credentials_2.0.2 httpuv_1.6.15
#> [21] htmltools_0.5.8.1 sys_3.4.3 usethis_3.0.0 buildtools_1.0.0
#> [25] sass_0.4.9 yaml_2.3.10 pillar_1.9.0 later_1.3.2
#> [29] crayon_1.5.3 jquerylib_0.1.4 urlchecker_1.0.1 ellipsis_0.3.2
#> [33] openssl_2.2.2 cachem_1.1.0 sessioninfo_1.2.2 mime_0.12
#> [37] digest_0.6.37 stringi_1.8.4 purrr_1.0.2 maketools_1.3.1
#> [41] rprojroot_2.0.4 fastmap_1.2.0 cli_3.6.3 magrittr_2.0.3
#> [45] pkgbuild_1.4.5 utf8_1.2.4 withr_3.0.2 promises_1.3.0
#> [49] roxygen2_7.3.2 askpass_1.2.1 memoise_2.0.1 shiny_1.9.1
#> [53] evaluate_1.0.1 knitr_1.48 testthat_3.2.1.1 miniUI_0.1.1.1
#> [57] profvis_0.4.0 rlang_1.1.4 gert_2.1.4 Rcpp_1.0.13
#> [61] xtable_1.8-4 glue_1.8.0 xml2_1.3.6 pkgload_1.4.0
#> [65] rstudioapi_0.17.1 jsonlite_1.8.9 R6_2.5.1 fs_1.6.4