The IndexedFst
class provides fast named random access
to indexed fst
files. It is based on the fst package,
which provides fast random reading of data frames. This is particularly
useful to manipulate large collections of binding sites without loading
them all in memory.
Creating an indexed fst file from a data.frame is very simple:
## Loading required package: scanMiR
# we create a temporary directory in which the files will be saved
tmp <- tempdir()
f <- file.path(tmp, "test")
# we create a dummy data.frame
d <- data.frame( category=sample(LETTERS[1:4], 10000, replace=TRUE),
var2=sample(LETTERS, 10000, replace=TRUE),
var3=runif(10000) )
saveIndexedFst(d, index.by="category", file.prefix=f)
The file can then be loaded (without having all the data in memory) in the following way:
## [1] "IndexedFst"
## attr(,"package")
## [1] "scanMiRApp"
## <fst file>
## 10000 rows, 3 columns (test.fst)
##
## * 'category': character
## * 'var2' : character
## * 'var3' : double
We can see that d2
is considerably smaller than the
original d
:
## [1] "237 Kb"
## [1] "2.4 Kb"
Nevertheless, a number of functions can be used normally on the object:
## [1] 10000
## [1] 3
## [1] "category" "var2" "var3"
## category var2 var3
## 1 A W 0.20581679
## 2 A I 0.34200166
## 3 A C 0.57244492
## 4 A G 0.28961427
## 5 A P 0.04906822
## 6 A W 0.97997983
In addition, the object can be accessed as a list (using the indexed
variable). Since in this case the file is indexed using the category
column, the different categories can be accessed as names
of the object:
## [1] "A" "B" "C" "D"
## A B C D
## 2475 2471 2551 2503
We can read specifically the rows pertaining to one category using:
## category var2 var3
## 1 B N 0.90477901
## 2 B V 0.71859241
## 3 B X 0.53217675
## 4 B Q 0.57689692
## 5 B K 0.08894358
## 6 B H 0.34006295
In addition to data.frames, GRanges can be saved as indexed Fst. To demonstrate this, we first create a dummy GRanges object:
## Loading required package: stats4
## Loading required package: BiocGenerics
## Loading required package: generics
##
## Attaching package: 'generics'
## The following objects are masked from 'package:base':
##
## as.difftime, as.factor, as.ordered, intersect, is.element, setdiff,
## setequal, union
##
## Attaching package: 'BiocGenerics'
## The following objects are masked from 'package:stats':
##
## IQR, mad, sd, var, xtabs
## The following objects are masked from 'package:base':
##
## Filter, Find, Map, Position, Reduce, anyDuplicated, aperm, append,
## as.data.frame, basename, cbind, colnames, dirname, do.call,
## duplicated, eval, evalq, get, grep, grepl, is.unsorted, lapply,
## mapply, match, mget, order, paste, pmax, pmax.int, pmin, pmin.int,
## rank, rbind, rownames, sapply, saveRDS, table, tapply, unique,
## unsplit, which.max, which.min
## Loading required package: S4Vectors
##
## Attaching package: 'S4Vectors'
## The following object is masked from 'package:utils':
##
## findMatches
## The following objects are masked from 'package:base':
##
## I, expand.grid, unname
## Loading required package: IRanges
## Loading required package: GenomeInfoDb
gr <- GRanges(sample(LETTERS[1:3],200,replace=TRUE), IRanges(seq_len(200), width=2))
gr$propertyA <- factor(sample(letters[1:5],200,replace=TRUE))
gr
## GRanges object with 200 ranges and 1 metadata column:
## seqnames ranges strand | propertyA
## <Rle> <IRanges> <Rle> | <factor>
## [1] B 1-2 * | c
## [2] A 2-3 * | c
## [3] C 3-4 * | a
## [4] B 4-5 * | e
## [5] C 5-6 * | c
## ... ... ... ... . ...
## [196] C 196-197 * | b
## [197] A 197-198 * | d
## [198] C 198-199 * | e
## [199] B 199-200 * | b
## [200] B 200-201 * | c
## -------
## seqinfo: 3 sequences from an unspecified genome; no seqlengths
Again the file can then be loaded (without having all the data in memory) in the following way:
f2 <- file.path(tmp, "test2")
saveIndexedFst(gr, index.by="seqnames", file.prefix=f2)
d1 <- loadIndexedFst(f2)
names(d1)
## [1] "B" "A" "C"
## GRanges object with 6 ranges and 1 metadata column:
## seqnames ranges strand | propertyA
## <Rle> <IRanges> <Rle> | <factor>
## [1] A 2-3 * | c
## [2] A 7-8 * | a
## [3] A 19-20 * | e
## [4] A 21-22 * | b
## [5] A 23-24 * | c
## [6] A 24-25 * | c
## -------
## seqinfo: 3 sequences from an unspecified genome; no seqlengths
Similarly, we could index using a different column:
## [1] "a" "b" "c" "d" "e"
The fst
package supports multithreaded reading and writing. This can also be
applied for IndexedFst
, using the nthreads
argument of loadIndexedFst
and
saveIndexedFst
.
The IndexedFst
class is simply a wrapper around the
fst
package. In addition to the fst
file, an
rds
file is saved containing the index data. For example,
for our last example, the following files have been saved:
## [1] "test2.fst" "test2.idx.rds"
Either file (or the prefix) can be used for loading, but both files need to have the same prefix.
## R version 4.4.2 (2024-10-31)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.1 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: Etc/UTC
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats4 stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] GenomicRanges_1.59.1 GenomeInfoDb_1.43.2 IRanges_2.41.2
## [4] S4Vectors_0.45.2 BiocGenerics_0.53.3 generics_0.1.3
## [7] fstcore_0.9.18 scanMiRApp_1.13.0 scanMiR_1.13.0
## [10] BiocStyle_2.35.0
##
## loaded via a namespace (and not attached):
## [1] sys_3.4.3 jsonlite_1.8.9
## [3] magrittr_2.0.3 shinyjqui_0.4.1
## [5] GenomicFeatures_1.59.1 rmarkdown_2.29
## [7] BiocIO_1.17.1 zlibbioc_1.52.0
## [9] vctrs_0.6.5 memoise_2.0.1
## [11] Rsamtools_2.23.1 RCurl_1.98-1.16
## [13] htmltools_0.5.8.1 S4Arrays_1.7.1
## [15] progress_1.2.3 AnnotationHub_3.15.0
## [17] curl_6.0.1 SparseArray_1.7.2
## [19] sass_0.4.9 bslib_0.8.0
## [21] htmlwidgets_1.6.4 httr2_1.0.7
## [23] plotly_4.10.4 cachem_1.1.0
## [25] buildtools_1.0.0 GenomicAlignments_1.43.0
## [27] mime_0.12 lifecycle_1.0.4
## [29] pkgconfig_2.0.3 Matrix_1.7-1
## [31] R6_2.5.1 fastmap_1.2.0
## [33] GenomeInfoDbData_1.2.13 MatrixGenerics_1.19.0
## [35] shiny_1.10.0 digest_0.6.37
## [37] colorspace_2.1-1 AnnotationDbi_1.69.0
## [39] shinycssloaders_1.1.0 RSQLite_2.3.9
## [41] seqLogo_1.73.0 filelock_1.0.3
## [43] httr_1.4.7 abind_1.4-8
## [45] compiler_4.4.2 bit64_4.5.2
## [47] BiocParallel_1.41.0 DBI_1.2.3
## [49] biomaRt_2.63.0 rappdirs_0.3.3
## [51] DelayedArray_0.33.3 waiter_0.2.5
## [53] rjson_0.2.23 tools_4.4.2
## [55] httpuv_1.6.15 fst_0.9.8
## [57] glue_1.8.0 restfulr_0.0.15
## [59] promises_1.3.2 grid_4.4.2
## [61] gtable_0.3.6 tidyr_1.3.1
## [63] ensembldb_2.31.0 data.table_1.16.4
## [65] hms_1.1.3 xml2_1.3.6
## [67] XVector_0.47.1 BiocVersion_3.21.1
## [69] pillar_1.10.0 stringr_1.5.1
## [71] later_1.4.1 rintrojs_0.3.4
## [73] dplyr_1.1.4 BiocFileCache_2.15.0
## [75] lattice_0.22-6 rtracklayer_1.67.0
## [77] bit_4.5.0.1 tidyselect_1.2.1
## [79] maketools_1.3.1 Biostrings_2.75.3
## [81] knitr_1.49 ProtGenerics_1.39.1
## [83] SummarizedExperiment_1.37.0 xfun_0.49
## [85] shinydashboard_0.7.2 Biobase_2.67.0
## [87] matrixStats_1.4.1 DT_0.33
## [89] stringi_1.8.4 UCSC.utils_1.3.0
## [91] lazyeval_0.2.2 yaml_2.3.10
## [93] evaluate_1.0.1 codetools_0.2-20
## [95] tibble_3.2.1 BiocManager_1.30.25
## [97] cli_3.6.3 xtable_1.8-4
## [99] munsell_0.5.1 jquerylib_0.1.4
## [101] Rcpp_1.0.13-1 dbplyr_2.5.0
## [103] png_0.1-8 XML_3.99-0.17
## [105] parallel_4.4.2 ggplot2_3.5.1
## [107] blob_1.2.4 prettyunits_1.2.0
## [109] AnnotationFilter_1.31.0 bitops_1.0-9
## [111] pwalign_1.3.1 txdbmaker_1.3.1
## [113] viridisLite_0.4.2 scales_1.3.0
## [115] scanMiRData_1.12.0 purrr_1.0.2
## [117] crayon_1.5.3 rlang_1.1.4
## [119] cowplot_1.1.3 KEGGREST_1.47.0