Saving XStringSets to artifacts and back again

Overview

The alabaster.string package implements methods to save XStringSet objects to file artifacts and load them back into R. Check out the alabaster.base for more details on the motivation and concepts of the alabaster framework.

Quick start

Given an XStringSet, we can use saveObject() to save it inside a staging directory:

library(Biostrings)
x <- DNAStringSet(c(seq1="CTCNACCAGTAT", seq2="TTGA", seq3="TACCTAGAG"))
mcols(x)$score <- runif(length(x))
x
## DNAStringSet object of length 3:
##     width seq                                               names               
## [1]    12 CTCNACCAGTAT                                      seq1
## [2]     4 TTGA                                              seq2
## [3]     9 TACCTAGAG                                         seq3
library(alabaster.string)
tmp <- tempfile()
saveObject(x, tmp)

list.files(tmp, recursive=TRUE)
## [1] "OBJECT"                               
## [2] "names.txt.gz"                         
## [3] "sequence_annotations/OBJECT"          
## [4] "sequence_annotations/basic_columns.h5"
## [5] "sequences.fasta.gz"

We can then load it back into the session with readObject().

roundtrip <- readObject(tmp)
class(roundtrip)
## [1] "DNAStringSet"
## attr(,"package")
## [1] "Biostrings"

More details on the metadata and on-disk layout are provided in the schema.

Quality scaled strings

The same approach works with QualityScaledXStringSet objects:

x <- DNAStringSet(c("TTGA", "CTCN"))
q <- PhredQuality(c("*+,-", "6789"))
y <- QualityScaledDNAStringSet(x, q)

library(alabaster.string)
tmp <- tempfile()
saveObject(y, tmp)

roundtrip <- readObject(tmp)
class(roundtrip)
## [1] "QualityScaledDNAStringSet"
## attr(,"package")
## [1] "Biostrings"

Session information

sessionInfo()
## R version 4.4.2 (2024-10-31)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.1 LTS
## 
## Matrix products: default
## BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so;  LAPACK version 3.12.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## time zone: Etc/UTC
## tzcode source: system (glibc)
## 
## attached base packages:
## [1] stats4    stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
##  [1] alabaster.string_1.7.0 alabaster.base_1.7.2   Biostrings_2.75.2     
##  [4] GenomeInfoDb_1.43.2    XVector_0.47.0         IRanges_2.41.2        
##  [7] S4Vectors_0.45.2       BiocGenerics_0.53.3    generics_0.1.3        
## [10] BiocStyle_2.35.0      
## 
## loaded via a namespace (and not attached):
##  [1] jsonlite_1.8.9          compiler_4.4.2          BiocManager_1.30.25    
##  [4] crayon_1.5.3            Rcpp_1.0.13-1           rhdf5filters_1.19.0    
##  [7] jquerylib_0.1.4         yaml_2.3.10             fastmap_1.2.0          
## [10] R6_2.5.1                knitr_1.49              maketools_1.3.1        
## [13] GenomeInfoDbData_1.2.13 bslib_0.8.0             rlang_1.1.4            
## [16] cachem_1.1.0            xfun_0.49               sass_0.4.9             
## [19] sys_3.4.3               cli_3.6.3               Rhdf5lib_1.29.0        
## [22] zlibbioc_1.52.0         digest_0.6.37           alabaster.schemas_1.7.0
## [25] rhdf5_2.51.0            lifecycle_1.0.4         evaluate_1.0.1         
## [28] buildtools_1.0.0        rmarkdown_2.29          httr_1.4.7             
## [31] tools_4.4.2             htmltools_0.5.8.1       UCSC.utils_1.3.0