Title: | Import methods for 10X Genomics files |
---|---|
Description: | Provides a structured S4 approach to importing data files from the 10X pipelines. It mainly supports Single Cell Multiome ATAC + Gene Expression data among other data types. The main Bioconductor data representations used are SingleCellExperiment and RaggedExperiment. |
Authors: | Marcel Ramos [aut, cre] |
Maintainer: | Marcel Ramos <[email protected]> |
License: | Artistic-2.0 |
Version: | 1.9.2 |
Built: | 2024-11-14 03:26:04 UTC |
Source: | https://github.com/bioc/TENxIO |
The TENxFile
constructor function serves as the
auto-recognizer function for 10X files. It can import several different
file extensions, namely:
* H5 - on-disk HDF5 * MTX - matrix market * .tar.gz - compressed tarball
TENxFile(resource, extension, ...)
TENxFile(resource, extension, ...)
resource |
character(1) The path to the file |
extension |
character(1) The file extension for the given resource. It
can usually be obtained from the file path. An override can be provided
especially for |
... |
Additional inputs to the low level class generator functions |
Note that the example below includes the use of a large ~ 4 GB
ExperimentHub
resource obtained from the 10X website.
A subclass of TENxFile
according to the input file extension
if (interactive()) { ## from ExperimentHub hub <- ExperimentHub::ExperimentHub() fname <- hub[["EH1039"]] TENxFile(fname, extension = "h5", group = "mm10", version = "2") TENxFile(fname, extension = "h5", group = "mm10", version = "2") |> metadata() }
if (interactive()) { ## from ExperimentHub hub <- ExperimentHub::ExperimentHub() fname <- hub[["EH1039"]] TENxFile(fname, extension = "h5", group = "mm10", version = "2") TENxFile(fname, extension = "h5", group = "mm10", version = "2") |> metadata() }
The TENxFile
class is the default representation for
unrecognized subclasses. It inherits from the BiocFile class and adds a few
additional slots. The constructor function can handle typical 10X file
types. For more details, see the constructor function documentation.
## S4 method for signature 'TENxFile' metadata(x, ...)
## S4 method for signature 'TENxFile' metadata(x, ...)
x |
An object of class |
... |
Additional arguments (not used) |
A list of metadata for the given object
metadata(TENxFile)
: metadata
method for TENxFile
objects
extension
character(1) The file extension as extracted from the file
path or overridden via the ext
argument in the constructor function.
colidx
integer(1) The column index corresponding to the columns in the file that will subsequently be imported
rowidx
integer(1) The row index corresponding to rows in the file that will subsequently be imported
remote
logical(1) Whether the file exists on the web, i.e., the
resource
is a URL
compressed
logical(1) Whether the file is compressed with, e.g., .gz
This constructor function is meant to handle .tar.gz
tarball
files from 10X Genomics.
TENxFileList(..., version, compressed = FALSE)
TENxFileList(..., version, compressed = FALSE)
... |
Typically, a file path to a tarball archive. Can be named arguments corresponding to file paths, or a named list of file paths. |
version |
character(1) The version in the tarball. See details. |
compressed |
logical(1) Whether or not the file provided is compressed,
usually as |
These tarballs usually contain three files:
matrix.mtx.gz
- the counts matrix
features.tsv.gz
- row metadata usually represented as rowData
barcodes.tsv.gz
- column names corresponding to cell barcode
identifiers
If all above files are in the tarball, the import method will provide a
SingleCellExperiment
. Otherwise, a simple list of imported data is given.
Note that version "3" uses 'features.tsv.gz' and version "2" uses
'genes.tsv.gz'. If known, indicate the version
argument in the
TENxFileList
constructor function.
Either a SingleCellExperiment
or a list of imported data
fl <- system.file( "extdata", "pbmc_granulocyte_sorted_3k_ff_bc_ex_matrix.tar.gz", package = "TENxIO", mustWork = TRUE ) ## Method 1 (tarball) TENxFileList(fl) ## metadata before import metadata(TENxFileList(fl)) ## import() method import(TENxFileList(fl)) ## metadata after import import(TENxFileList(fl)) |> metadata() ## untar to simulate folder output dir.create(tdir <- tempfile()) untar(fl, exdir = tdir) ## Method 2 (folder) TENxFileList(tdir) import(TENxFileList(tdir)) ## Method 3 (list of TENxFile objects) files <- list.files(tdir, recursive = TRUE, full.names = TRUE) names(files) <- basename(files) filelist <- lapply(files, TENxFile) TENxFileList(filelist, compressed = FALSE) ## Method 4 (SimpleList) TENxFileList(as(filelist, "SimpleList"), compressed = FALSE) ## Method 5 (named arguments) TENxFileList( barcodes.tsv.gz = TENxFile(files[1]), features.tsv.gz = TENxFile(files[2]), matrix.mtx.gz = TENxFile(files[3]) ) unlink(tdir, recursive = TRUE)
fl <- system.file( "extdata", "pbmc_granulocyte_sorted_3k_ff_bc_ex_matrix.tar.gz", package = "TENxIO", mustWork = TRUE ) ## Method 1 (tarball) TENxFileList(fl) ## metadata before import metadata(TENxFileList(fl)) ## import() method import(TENxFileList(fl)) ## metadata after import import(TENxFileList(fl)) |> metadata() ## untar to simulate folder output dir.create(tdir <- tempfile()) untar(fl, exdir = tdir) ## Method 2 (folder) TENxFileList(tdir) import(TENxFileList(tdir)) ## Method 3 (list of TENxFile objects) files <- list.files(tdir, recursive = TRUE, full.names = TRUE) names(files) <- basename(files) filelist <- lapply(files, TENxFile) TENxFileList(filelist, compressed = FALSE) ## Method 4 (SimpleList) TENxFileList(as(filelist, "SimpleList"), compressed = FALSE) ## Method 5 (named arguments) TENxFileList( barcodes.tsv.gz = TENxFile(files[1]), features.tsv.gz = TENxFile(files[2]), matrix.mtx.gz = TENxFile(files[3]) ) unlink(tdir, recursive = TRUE)
This class was designed to mainly handle tarballs from
10X Genomics. The typical file extension for these tarballs is .tar.gz
.
## S4 method for signature 'TENxFileList' path(object, ...) ## S4 method for signature 'TENxFileList' decompress(manager, con, ...) ## S4 method for signature 'TENxFileList,ANY,ANY' import(con, format, text, ...) ## S4 method for signature 'TENxFileList' metadata(x, ...)
## S4 method for signature 'TENxFileList' path(object, ...) ## S4 method for signature 'TENxFileList' decompress(manager, con, ...) ## S4 method for signature 'TENxFileList,ANY,ANY' import(con, format, text, ...) ## S4 method for signature 'TENxFileList' metadata(x, ...)
object |
An object containing paths. Even though it will typically contain
a single path, |
... |
Additional arguments (not used) |
manager |
A |
con |
The connection from which data is loaded or to which data is
saved. If this is a |
format |
The format of the output. If missing and |
text |
If |
x |
An object of class |
These tarballs usually contain three files:
matrix.mtx.gz
- the counts matrix
features.tsv.gz
- row metadata usually represented as rowData
barcodes.tsv.gz
- column names corresponding to cell barcode
identifiers
Note that version '2' includes genes.tsv.gz
instead of features.tsv.gz
in
version '3'.
An additional ref
argument can be provided when the file contains
multiple feature_type
in the file or "Type" in the rowData
. By default,
the first type reported in table()
is set as the mainExpName
in the
SingleCellExperiment
object.
A TENxFileList
class object
path(TENxFileList)
: Obtain file paths for all files in the object
as a vector
decompress(TENxFileList)
: An intermediate method for decompressing
(via untar) the contents of a .tar.gz
file list
import(con = TENxFileList, format = ANY, text = ANY)
: Recursively import files within a
TENxFileList
metadata(TENxFileList)
: metadata
method for TENxFileList
objects
listData
list() The data in list format
extension
character() A vector of file extensions for each file
compressed
logical(1) Whether the file is compressed as .tar.gz
version
character(1) The version number of the tarball usually either '2' or '3'
TENxFragments: Import fragments files from 10X
TENxFragments(resource, yieldSize = 200, which = GRanges(), ...)
TENxFragments(resource, yieldSize = 200, which = GRanges(), ...)
resource |
character(1) The file path to the fragments resource, usually
a compressed tabix file with extension |
yieldSize |
numeric() The number of records to read by default, 200 records will be imported. A warning will be emitted if not modified. |
which |
GRanges() A GRanges indicating the regions of interest. This
get sent to |
... |
Further arguments to the class generator function (currently not used) |
A RaggedExperiment
object class
fr <- system.file( "extdata", "pbmc_3k_atac_ex_fragments.tsv.gz", package = "TENxIO", mustWork = TRUE ) tfr <- TENxFragments(fr) fra <- import(tfr)
fr <- system.file( "extdata", "pbmc_3k_atac_ex_fragments.tsv.gz", package = "TENxIO", mustWork = TRUE ) tfr <- TENxFragments(fr) fra <- import(tfr)
GRanges
This class is designed to work mainly with fragments.tsv.gz
files from
10x pipelines.
## S4 method for signature 'TENxFragments,ANY,ANY' import(con, format, text, ...)
## S4 method for signature 'TENxFragments,ANY,ANY' import(con, format, text, ...)
con |
The connection from which data is loaded or to which data is
saved. If this is a |
format |
The format of the output. If missing and |
text |
If |
... |
Parameters to pass to the format-specific method. |
Fragments data from 10x can be quite large. In order to speed up
the initial exploration of the data, we use a default of 200 records
for loading. Users can change this default value by specifying a new one
via the yieldSize
argument in the constructor function.
A TENxFragments
class object
import(con = TENxFragments, format = ANY, text = ANY)
: Import method for representing fragments.tsv.gz
data from 10x via Rsamtools
and RaggedExperiment
which
GRanges() A GRanges indicating the regions of interest. This
get sent to RSamtools
as the param
input.
yieldSize
numeric() The number of records to read by default, 200 records will be imported. A warning will be emitted if not modified.
This constructor function was developed using the PBMC 3K dataset from 10X
Genomics (version 3). Other versions are supported and input arguments
version
and group
can be overridden.
TENxH5(resource, version, group, ranges, rowidx, colidx, ...)
TENxH5(resource, version, group, ranges, rowidx, colidx, ...)
resource |
character(1) The path to the file |
version |
character(1) There are currently two recognized versions associated with 10X data, either version "2" or "3". See details for more information. |
group |
character(1) The HDF5 group embedded within the file structure, this is usually either the "matrix" or "outs" group but other groups are supported as well (e.g., "mm10"). |
ranges |
character(1) The HDF5 internal folder location embedded within
the file that points to the ranged data information, e.g.,
"/features/interval". Set to |
rowidx , colidx
|
numeric() A vector of indices corresponding to either
rows or columns that will dictate the data imported from the file. The
indices will be passed on to the |
... |
Additional inputs to the low level class generator functions |
The various TENxH5
methods including rowData
and rowRanges
,
provide a snapshot of the data using a length 12 head and tail subset for
efficiency. In contrast, methods such as dimnames
and dim
give
a full view of the dimensions of the data. The show
method provides
relevant information regarding the dimensions of the data including
metadata such as rowData
and "Type" column, if available. The term
"projection" refers to the data class that will be provided once the
data file is import
ed.
An additional ref
argument can be provided when the file contains
multiple feature_type
in the file or "Type" in the rowData
. By default,
the first type reported in table()
is set as the mainExpName
in the
SingleCellExperiment
object.
For data that do not contain genomic coordinate information, the TENxH5
will fail to read "/features/interval"
and will set the ranges
argument to NA_character_
.
The data version "3" mainly includes a "matrix" group and "interval"
information within the file. Version "2" data does not include
ranged-based information and has a different directory structure compared
to version "3". See the internal data.frame
: TENxIO:::h5.version.map
for a map of fields and their corresponding file locations within the H5
file. This map is used to create the rowData
structure from the file.
Usually, a SingleCellExperiment
instance
import
section in TENxH5
h5f <- system.file( "extdata", "pbmc_granulocyte_ff_bc_ex.h5", package = "TENxIO", mustWork = TRUE ) TENxH5(h5f) import(TENxH5(h5f)) h5f <- system.file( "extdata", "10k_pbmc_ATACv2_f_bc_ex.h5", package = "TENxIO", mustWork = TRUE ) ## Optional ref input, most frequent Type used by default th5 <- TENxH5(h5f, ranges = "/features/id", ref = "Peaks") th5 TENxH5(h5f, ranges = "/features/id") import(th5)
h5f <- system.file( "extdata", "pbmc_granulocyte_ff_bc_ex.h5", package = "TENxIO", mustWork = TRUE ) TENxH5(h5f) import(TENxH5(h5f)) h5f <- system.file( "extdata", "10k_pbmc_ATACv2_f_bc_ex.h5", package = "TENxIO", mustWork = TRUE ) ## Optional ref input, most frequent Type used by default th5 <- TENxH5(h5f, ranges = "/features/id", ref = "Peaks") th5 TENxH5(h5f, ranges = "/features/id") import(th5)
This class is designed to work with 10x Single Cell datasets. It was developed using the PBMC 3k 10X dataset from the CellRanger v2 pipeline.
## S4 method for signature 'TENxH5' rowData(x, use.names = TRUE, ...) ## S4 method for signature 'TENxH5' dim(x) ## S4 method for signature 'TENxH5' dimnames(x) ## S4 method for signature 'TENxH5' genome(x) ## S4 method for signature 'TENxH5' rowRanges(x, ...) ## S4 method for signature 'TENxH5,ANY,ANY' import(con, format, text, ...) ## S4 method for signature 'TENxH5' show(object)
## S4 method for signature 'TENxH5' rowData(x, use.names = TRUE, ...) ## S4 method for signature 'TENxH5' dim(x) ## S4 method for signature 'TENxH5' dimnames(x) ## S4 method for signature 'TENxH5' genome(x) ## S4 method for signature 'TENxH5' rowRanges(x, ...) ## S4 method for signature 'TENxH5,ANY,ANY' import(con, format, text, ...) ## S4 method for signature 'TENxH5' show(object)
x |
A |
use.names |
For For |
... |
For For For other accessors, ignored. |
con |
The connection from which data is loaded or to which data is
saved. If this is a |
format |
The format of the output. If missing and |
text |
If |
object |
A |
The data version "3" mainly includes a "matrix" group and "interval"
information within the file. Version "2" data does not include
ranged-based information and has a different directory structure compared
to version "3". See the internal data.frame
: TENxIO:::h5.version.map
for
a map of fields and their corresponding file locations within the H5 file.
This map is used to create the rowData
structure from the file.
A TENxH5
class object
rowData(TENxH5)
: Generate the rowData ad hoc from a TENxH5 file
dim(TENxH5)
: Get the dimensions of the data as stored in the file
dimnames(TENxH5)
: Get the dimension names from the file
genome(TENxH5)
: Read genome string from file
rowRanges(TENxH5)
: Read interval data and represent as GRanges
import(con = TENxH5, format = ANY, text = ANY)
: Import TENxH5 data as a SingleCellExperiment; see section
below
show(TENxH5)
: Display a snapshot of the contents within a TENxH5 file
before import
version
character(1) There are currently two recognized versions associated with 10X data, either version "2" or "3". See details for more information.
group
character(1) The HDF5 group embedded within the file structure, this is usually either the "matrix" or "outs" group but other groups are supported as well.
ranges
character(1) The HDF5 internal folder location embedded within the file that points to the ranged data information, e.g., "/features/interval".
The import
method uses DelayedArray::TENxMatrix
to represent matrix
data. Generally, version 3 datasets contain associated genomic coordinates.
The associated feature data, as displayed by the rowData
method, is
queried for the "Type" column which will indicate that a splitAltExps
operation is appropriate. If a ref
input is provided to the constructor
function TENxH5
, it will be used as the main experiment; otherwise, the
most frequent category in the "Type" column will be used. For example,
the Multiome ATAC + Gene Expression feature data contains both 'Gene
Expression' and 'Peaks' labels in the "Type" column.
The package provides file classes based on BiocIO
for common file
extensions found in the 10X Genomics website.
Here is a table of supported file and file extensions and their imported classes:
Extension | Class | Imported as |
.h5 | TENxH5 | SingleCellExperiment w/ TENxMatrix |
.mtx / .mtx.gz | TENxMTX | SummarizedExperiment w/ dgCMatrix |
.tar.gz | TENxFileList | SingleCellExperiment w/ dgCMatrix |
peak_annotation.tsv | TENxPeaks | GRanges |
fragments.tsv.gz | TENxFragments | RaggedExperiment |
.tsv / .tsv.gz | TENxTSV | tibble |
Maintainer: Marcel Ramos [email protected] (ORCID)
Useful links:
This constructor function accepts .mtx
and .mtx.gz
compressed formats
for eventual importing. It is mainly used with tarball files from 10X
Genomics, where more annotation data is included. Importing solely the
.mtx
format will provide users with a SummarizedExperiment
with an assay
of class dgCMatrix
from the Matrix
package. Currently, other formats are
not supported but if you'd like to request support for a format, please open
an issue on GitHub.
TENxMTX(resource, compressed = FALSE, ...)
TENxMTX(resource, compressed = FALSE, ...)
resource |
character(1) The path to the file |
compressed |
logical(1) Whether the resource file is compressed (default FALSE) |
... |
Additional inputs to the low level class generator functions |
A SummarizedExperiment
instance with a dgCMatrix
in the assay
mtxf <- system.file( "extdata", "pbmc_3k_ff_bc_ex.mtx", package = "TENxIO", mustWork = TRUE ) con <- TENxMTX(mtxf) import(con)
mtxf <- system.file( "extdata", "pbmc_3k_ff_bc_ex.mtx", package = "TENxIO", mustWork = TRUE ) con <- TENxMTX(mtxf) import(con)
This class is designed to work with 10x MTX datasets, particularly from the multiome pipelines.
## S4 method for signature 'TENxMTX,ANY,ANY' import(con, format, text, ...)
## S4 method for signature 'TENxMTX,ANY,ANY' import(con, format, text, ...)
con |
The connection from which data is loaded or to which data is
saved. If this is a |
format |
The format of the output. If missing and |
text |
If |
... |
Parameters to pass to the format-specific method. |
The TENxMTX
class is a straightforward implementation that allows
the user to import a Matrix Market file format using Matrix::readMM
.
Currently, it returns a SummarizedExperiment
with an internal dgCMatrix
assay. To request other formats, please open an issue on GitHub.
A TENxMTX
class object
import(con = TENxMTX, format = ANY, text = ANY)
: Import method mainly for mtx.gz files from 10x
compressed
logical(1) Whether or not the file is in compressed format,
usually gzipped (.gz
).
This constructor function is designed to work with the files denoted with
"peak_annotation" in the file name. These are usually produced as tab
separated value files, i.e., .tsv
.
TENxPeaks(resource, extension, ...)
TENxPeaks(resource, extension, ...)
resource |
character(1) The path to the file |
extension |
character(1) The file extension for the given resource. It
can usually be obtained from the file path. An override can be provided
especially for |
... |
Additional inputs to the low level class generator functions |
The output class allows handling of peak data. It can be used in
conjunction with the annotation
method on a SingleCellExperiment
to add
peak information to the experiment. The ranged data is represented as a
GRanges
class object.
A GRanges
class object of peak locations
fi <- system.file( "extdata", "pbmc_granulocyte_sorted_3k_ex_atac_peak_annotation.tsv", package = "TENxIO", mustWork = TRUE ) peak_file <- TENxPeaks(fi) peak_anno <- import(peak_file) peak_anno example(TENxH5) ## Add peaks to an existing SCE ## First, import the SCE from an example H5 file h5f <- system.file( "extdata", "pbmc_granulocyte_ff_bc_ex.h5", package = "TENxIO", mustWork = TRUE ) con <- TENxH5(h5f) sce <- import(con) ## auto-import peaks when using annotation<- annotation(sce, name = "peak_annotation") <- peak_file annotation(sce)
fi <- system.file( "extdata", "pbmc_granulocyte_sorted_3k_ex_atac_peak_annotation.tsv", package = "TENxIO", mustWork = TRUE ) peak_file <- TENxPeaks(fi) peak_anno <- import(peak_file) peak_anno example(TENxH5) ## Add peaks to an existing SCE ## First, import the SCE from an example H5 file h5f <- system.file( "extdata", "pbmc_granulocyte_ff_bc_ex.h5", package = "TENxIO", mustWork = TRUE ) con <- TENxH5(h5f) sce <- import(con) ## auto-import peaks when using annotation<- annotation(sce, name = "peak_annotation") <- peak_file annotation(sce)
This class is designed to work with the files denoted with "peak_annotation"
in the file name. These are usually produced as tab separated value files,
i.e., .tsv
.
## S4 method for signature 'TENxPeaks,ANY,ANY' import(con, format, text, ...) ## S4 replacement method for signature 'SingleCellExperiment,ANY' annotation(object, ...) <- value ## S4 method for signature 'SingleCellExperiment' annotation(object, ...)
## S4 method for signature 'TENxPeaks,ANY,ANY' import(con, format, text, ...) ## S4 replacement method for signature 'SingleCellExperiment,ANY' annotation(object, ...) <- value ## S4 method for signature 'SingleCellExperiment' annotation(object, ...)
con |
The connection from which data is loaded or to which data is
saved. If this is a |
format |
The format of the output. If missing and |
text |
If |
... |
Parameters to pass to the format-specific method. |
object |
The object to export. |
value |
The annotation information to set on |
This class is a straightforward class for handling peak data. It can
be used in conjunction with the annotation
method on a
SingleCellExperiment
to add peak information to the experiment. The
ranged data is represented as a GRanges
class object.
A TENxPeaks
class object
import(con = TENxPeaks, format = ANY, text = ANY)
: Import a peaks_annotation file from 10x as a
GRanges
representation
annotation(object = SingleCellExperiment) <- value
: Replacement method to add annotation data to a
SingleCellExperiment
annotation(SingleCellExperiment)
: Extraction method to obtain annotation data from
a SingleCellExperiment
representation
This class is general purpose for reading in tabular data from
the 10x Genomics website with the .tsv
file extension. The class also
supports compressed files, i.e., those with the .tsv.gz
extension.
## S4 method for signature 'TENxTSV,ANY,ANY' import(con, format, text, ...) TENxTSV(resource, compressed = FALSE, ...) ## S4 method for signature 'TENxTSV' metadata(x, ...)
## S4 method for signature 'TENxTSV,ANY,ANY' import(con, format, text, ...) TENxTSV(resource, compressed = FALSE, ...) ## S4 method for signature 'TENxTSV' metadata(x, ...)
con |
The connection from which data is loaded or to which data is
saved. If this is a |
format |
The format of the output. If missing and |
text |
If |
... |
Parameters to pass to the format-specific method. |
resource |
character(1) The path to the file |
compressed |
logical(1) Whether the resource file is compressed (default FALSE) |
x |
A |
Typical .tsv
files obtained from the 10X website are compressed
and contain information relevant to 'barcodes' and 'features'. Currently,
the code only supports files such as features.tsv.*
and barcodes.tsv.*
.
A TENxTSV
class object; a tibble
for the import method
import(con = TENxTSV, format = ANY, text = ANY)
: General import method for tsv
files from 10x;
using readr::read_tsv
and returning a tibble
representation
metadata(TENxTSV)
: metadata
method for TENxTSV
objects