Saving common bioinformatics file formats

Overview

The alabaster.files package implements methods to save common bioinformatics file formats within the alabaster framework. It does not perform any validation or parsing of the files, it just provides very light-weight wrappers for processing via alabaster.base::stageObject(). Check out the alabaster.base package for more details on the motivation and concepts behind alabaster.

Quick start

We’ll start with an indexed BAM file from the Rsamtools package:

bam.file <- system.file("extdata", "ex1.bam", package="Rsamtools", mustWork=TRUE)
bam.index <- paste0(bam.file, ".bai")

We can wrap this inside a BamFileReference class:

library(alabaster.files)
library(S4Vectors)
wrapped.bam <- BamFileReference(bam.file, index=bam.index)

Then we can save it to file:

dir <- tempfile()
saveObject(wrapped.bam, dir)

… and load it back at some later time.

readObject(dir)
## BamFileReference object
## path: /tmp/RtmpBuy6Rx/file1771368e603d/file.bam 
## index: /tmp/RtmpBuy6Rx/file1771368e603d/file.bam.bai

Integration with other objects

The example above isn’t very exciting, but it demonstrates how these files can be easily added to an alabaster project. This allows us to incorporate the Wrapper objects into other Bioconductor data structures, like:

df <- DataFrame(Sample=LETTERS[1:4])

# Adding a column of assorted wrapper files:
df$File <- list(
    wrapped.bam,
    BigWigFileReference(system.file("tests", "test.bw", package = "rtracklayer")),
    BigBedFileReference(system.file("tests", "test.bb", package = "rtracklayer")),
    BcfFileReference(system.file("extdata", "ex1.bcf.gz", package = "Rsamtools"))
)

# Saving it all to the staging directory:
dir <- tempfile()
saveObject(df, dir)

# Now reading it back in:
roundtrip <- readObject(dir)
roundtrip$File
## [[1]]
## BamFileReference object
## path: /tmp/RtmpBuy6Rx/file1771420b5a53/other_columns/1/other_contents/0/file.bam 
## index: /tmp/RtmpBuy6Rx/file1771420b5a53/other_columns/1/other_contents/0/file.bam.bai
## 
## [[2]]
## BigWigFileReference object
## path: /tmp/RtmpBuy6Rx/file1771420b5a53/other_columns/1/other_contents/1/file.bw 
## 
## [[3]]
## BigBedFileReference object
## path: /tmp/RtmpBuy6Rx/file1771420b5a53/other_columns/1/other_contents/2/file.bb 
## 
## [[4]]
## BcfFileReference object
## path: /tmp/RtmpBuy6Rx/file1771420b5a53/other_columns/1/other_contents/3/file.bcf 
## index: NULL

Similarly, if the staging directory is uploaded to a remote store, the wrapped files will automatically be included in the upload. This avoids the need for a separate process to handle these files.

Validation

alabaster.files will try to perform some cursory validation of the wrapped file to catch errors in user inputs. The level of validation is format-dependent but should be fast, e.g., BAM file validation is performed by scanning the header. In all cases, users should not expect an exhaustive check of file validity, as that would take too long and involve more parsing than desired for the scope of alabaster.files. If stricter validation is required, applications calling alabaster.files should override the saveObject() methods for the relevant FileReference classes.

Session information

sessionInfo()
## R version 4.4.1 (2024-06-14)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.1 LTS
## 
## Matrix products: default
## BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so;  LAPACK version 3.12.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## time zone: Etc/UTC
## tzcode source: system (glibc)
## 
## attached base packages:
## [1] stats4    stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
## [1] S4Vectors_0.43.2      BiocGenerics_0.51.3   alabaster.files_1.5.0
## [4] alabaster.base_1.5.10 BiocStyle_2.33.1     
## 
## loaded via a namespace (and not attached):
##  [1] jsonlite_1.8.9          crayon_1.5.3            compiler_4.4.1         
##  [4] BiocManager_1.30.25     Rcpp_1.0.13             Biostrings_2.73.2      
##  [7] GenomicRanges_1.57.2    Rsamtools_2.21.2        rhdf5filters_1.17.0    
## [10] bitops_1.0-9            parallel_4.4.1          jquerylib_0.1.4        
## [13] IRanges_2.39.2          BiocParallel_1.39.0     yaml_2.3.10            
## [16] fastmap_1.2.0           XVector_0.45.0          R6_2.5.1               
## [19] GenomeInfoDb_1.41.2     knitr_1.48              maketools_1.3.1        
## [22] GenomeInfoDbData_1.2.13 bslib_0.8.0             rlang_1.1.4            
## [25] cachem_1.1.0            xfun_0.48               sass_0.4.9             
## [28] sys_3.4.3               cli_3.6.3               Rhdf5lib_1.27.0        
## [31] zlibbioc_1.51.2         digest_0.6.37           alabaster.schemas_1.5.0
## [34] rhdf5_2.49.0            lifecycle_1.0.4         evaluate_1.0.1         
## [37] codetools_0.2-20        buildtools_1.0.0        rmarkdown_2.28         
## [40] httr_1.4.7              tools_4.4.1             htmltools_0.5.8.1      
## [43] UCSC.utils_1.1.0