Title: | Queries the Human Cell Atlas |
---|---|
Description: | Provides access to a copy of the Human Cell Atlas, but with harmonised metadata. This allows for uniform querying across numerous datasets within the Atlas using common fields such as cell type, tissue type, and patient ethnicity. Usage involves first querying the metadata table for cells of interest, and then downloading the corresponding cells into a SingleCellExperiment object. |
Authors: | Stefano Mangiola [aut, cre, rev] , Michael Milton [aut, rev] , Martin Morgan [ctb, rev], Vincent Carey [ctb, rev], Julie Iskander [rev], Tony Papenfuss [rev], Silicon Valley Foundation CZF2019-002443 [fnd], NIH NHGRI 5U24HG004059-18 [fnd], Victoria Cancer Agency ECRF21036 [fnd], NHMRC 1116955 [fnd] |
Maintainer: | Stefano Mangiola <[email protected]> |
License: | GPL-3 |
Version: | 1.5.0 |
Built: | 2024-11-14 06:03:44 UTC |
Source: | https://github.com/bioc/CuratedAtlasQueryR |
URL pointing to the full metadata file
DATABASE_URL
DATABASE_URL
An object of class character
of length 1.
A character scalar consisting of the URL
get_metadata(remote_url = DATABASE_URL)
get_metadata(remote_url = DATABASE_URL)
Downloads a parquet database of the Human Cell Atlas metadata to a local
cache, and then opens it as a data frame. It can then be filtered and passed
into get_single_cell_experiment()
to obtain a
SingleCellExperiment::SingleCellExperiment
get_metadata( remote_url = DATABASE_URL, cache_directory = get_default_cache_dir(), use_cache = TRUE )
get_metadata( remote_url = DATABASE_URL, cache_directory = get_default_cache_dir(), use_cache = TRUE )
remote_url |
Optional character vector of length 1. An HTTP URL pointing to the location of the parquet database. |
cache_directory |
Optional character vector of length 1. A file path on
your local system to a directory (not a file) that will be used to store
|
use_cache |
Optional logical scalar. If |
The metadata was collected from the Bioconductor package cellxgenedp
. it's
vignette using_cellxgenedp
provides an overview of the columns in the
metadata. The data for which the column organism_name
included "Homo
sapiens" was collected collected from cellxgenedp
.
The columns dataset_id
and file_id
link the datasets explorable through
CuratedAtlasQueryR
and cellxgenedp
to the CELLxGENE portal.
Our representation, harmonises the metadata at dataset, sample and cell levels, in a unique coherent database table.
Dataset-specific columns (definitions available at cellxgene.cziscience.com)
cell_count
, collection_id
, created_at.x
, created_at.y
,
dataset_deployments
, dataset_id
, file_id
, filename
, filetype
,
is_primary_data.y
, is_valid
, linked_genesets
, mean_genes_per_cell
,
name
, published
, published_at
, revised_at
, revision
, s3_uri
,
schema_version
, tombstone
, updated_at.x
, updated_at.y
,
user_submitted
, x_normalization
Sample-specific columns (definitions available at cellxgene.cziscience.com)
sample_
, .sample_name
, age_days
, assay
, assay_ontology_term_id
,
development_stage
, development_stage_ontology_term_id
, ethnicity
,
ethnicity_ontology_term_id
, experiment___
, organism
,
organism_ontology_term_id
, sample_placeholder
, sex
,
sex_ontology_term_id
, tissue
, tissue_harmonised
,
tissue_ontology_term_id
, disease
, disease_ontology_term_id
,
is_primary_data.x
Cell-specific columns (definitions available at cellxgene.cziscience.com)
cell_
, cell_type
, cell_type_ontology_term_idm
, cell_type_harmonised
,
confidence_class
, cell_annotation_azimuth_l2
,
cell_annotation_blueprint_singler
Through harmonisation and curation we introduced custom column, not present in the original CELLxGENE metadata
tissue_harmonised
: a coarser tissue name for better filtering
age_days
: the number of days corresponding to the age
cell_type_harmonised
: the consensus call identity (for immune cells)
using the original and three novel annotations using Seurat Azimuth and
SingleR
confidence_class
: an ordinal class of how confident
cell_type_harmonised
is. 1 is complete consensus, 2 is 3 out of four and
so on.
cell_annotation_azimuth_l2
: Azimuth cell annotation
cell_annotation_blueprint_singler
: SingleR cell annotation using
Blueprint reference
cell_annotation_blueprint_monaco
: SingleR cell annotation using Monaco
reference
sample_id_db
: Sample subdivision for internal use
file_id_db
: File subdivision for internal use
sample_
: Sample ID
.sample_name
: How samples were defined
Possible cache path issues
If your default R cache path includes non-standard characters (e.g. dash because of your user or organisation name), the following error can manifest
Error in db_query_fields.DBIConnection()
: ! Can't query fields. Caused by
error: ! Parser Error: syntax error at or near "/" LINE 2: FROM
/Users/bob/Library/Caches...
The solution is to choose a different cache, for example
get_metadata(cache_directory = path.expand('~'))
A lazy data.frame subclass containing the metadata. You can interact
with this object using most standard dplyr functions. For string matching,
it is recommended that you use stringr::str_like
to filter character
columns, as stringr::str_match
will not work.
library(dplyr) filtered_metadata <- get_metadata() |> filter( ethnicity == "African" & assay %LIKE% "%10x%" & tissue == "lung parenchyma" & cell_type %LIKE% "%CD4%" )
library(dplyr) filtered_metadata <- get_metadata() |> filter( ethnicity == "African" & assay %LIKE% "%10x%" & tissue == "lung parenchyma" & cell_type %LIKE% "%CD4%" )
Given a data frame of HCA metadata, returns a Seurat object corresponding to the samples in that data frame
get_seurat(...)
get_seurat(...)
... |
Arguments passed on to
|
A Seurat object containing the same data as a call to
get_single_cell_experiment()
meta <- get_metadata() |> head(2) seurat <- get_seurat(meta)
meta <- get_metadata() |> head(2) seurat <- get_seurat(meta)
Given a data frame of Curated Atlas metadata obtained from get_metadata()
,
returns a SingleCellExperiment::SingleCellExperiment
object
corresponding to the samples in that data frame
get_single_cell_experiment( data, assays = "counts", cache_directory = get_default_cache_dir(), repository = COUNTS_URL, features = NULL )
get_single_cell_experiment( data, assays = "counts", cache_directory = get_default_cache_dir(), repository = COUNTS_URL, features = NULL )
data |
A data frame containing, at minimum, a |
assays |
A character vector whose elements must be either "counts" and/or "cpm", representing the corresponding assay(s) you want to request. By default only the count assay is downloaded. If you are interested in comparing a limited amount of genes, the "cpm" assay is more appropriate. |
cache_directory |
An optional character vector of length one. If provided, it should indicate a local file path where any remotely accessed files should be copied. |
repository |
A character vector of length one. If provided, it should be an HTTP URL pointing to the location where the single cell data is stored. |
features |
An optional character vector of features (ie genes) to return the counts for. By default counts for all features will be returned. |
A SingleCellExperiment object, with one assay for each value in the assays argument
meta <- get_metadata() |> head(2) sce <- get_single_cell_experiment(meta)
meta <- get_metadata() |> head(2) sce <- get_single_cell_experiment(meta)
Given a data frame of Curated Atlas metadata obtained from get_metadata()
,
returns a SingleCellExperiment::SingleCellExperiment
object
corresponding to the samples in that data frame
get_SingleCellExperiment(...)
get_SingleCellExperiment(...)
... |
Arguments passed on to
|
A SingleCellExperiment object, with one assay for each value in the assays argument
meta <- get_metadata() |> head(2) sce <- get_single_cell_experiment(meta)
meta <- get_metadata() |> head(2) sce <- get_single_cell_experiment(meta)
Various metadata fields are not common between datasets, so it does not make sense for these to live in the main metadata table. This function is a utility that allows easy fetching of this data if necessary.
get_unharmonised_dataset( dataset_id, cells = NULL, conn = dbConnect(drv = duckdb(), read_only = TRUE), remote_url = UNHARMONISED_URL, cache_directory = get_default_cache_dir() )
get_unharmonised_dataset( dataset_id, cells = NULL, conn = dbConnect(drv = duckdb(), read_only = TRUE), remote_url = UNHARMONISED_URL, cache_directory = get_default_cache_dir() )
dataset_id |
A character vector, where each entry is a dataset ID
obtained from the |
cells |
An optional character vector of cell IDs. If provided, only metadata for those cells will be returned. |
conn |
An optional DuckDB connection object. If provided, it will re-use the existing connection instead of opening a new one. |
remote_url |
Optional character vector of length 1. An HTTP URL pointing to the root URL under which all the unharmonised dataset files are located. |
cache_directory |
Optional character vector of length 1. A file path on your local system to a directory (not a file) that will be used to store the unharmonised metadata files. |
A named list, where each name is a dataset file ID, and each value is
a "lazy data frame", ie a tbl
.
dataset <- "838ea006-2369-4e2c-b426-b2a744a2b02b" harmonised_meta <- get_metadata() |> dplyr::filter(file_id == dataset) |> dplyr::collect() unharmonised_meta <- get_unharmonised_dataset(dataset) unharmonised_tbl <- dplyr::collect(unharmonised_meta[[dataset]]) dplyr::left_join(harmonised_meta, unharmonised_tbl, by=c("file_id", "cell_"))
dataset <- "838ea006-2369-4e2c-b426-b2a744a2b02b" harmonised_meta <- get_metadata() |> dplyr::filter(file_id == dataset) |> dplyr::collect() unharmonised_meta <- get_unharmonised_dataset(dataset) unharmonised_tbl <- dplyr::collect(unharmonised_meta[[dataset]]) dplyr::left_join(harmonised_meta, unharmonised_tbl, by=c("file_id", "cell_"))
Various metadata fields are not common between datasets, so it does not make sense for these to live in the main metadata table. This function is a utility that allows easy fetching of this data if necessary.
get_unharmonised_metadata(metadata, ...)
get_unharmonised_metadata(metadata, ...)
metadata |
A lazy data frame obtained from |
... |
Arguments passed on to
|
A tibble with two columns:
file_id
: the same file_id
as the main metadata table obtained from
get_metadata()
unharmonised
: a nested tibble, with one row per cell in the input
metadata
, containing unharmonised metadata
harmonised <- dplyr::filter(get_metadata(), tissue == "kidney blood vessel") unharmonised <- get_unharmonised_metadata(harmonised)
harmonised <- dplyr::filter(get_metadata(), tissue == "kidney blood vessel") unharmonised <- get_unharmonised_metadata(harmonised)
URL pointing to the sample metadata file, which is smaller and for test, demonstration, and vignette purposes only
SAMPLE_DATABASE_URL
SAMPLE_DATABASE_URL
An object of class character
of length 1.
A character scalar consisting of the URL
get_metadata(remote_url = SAMPLE_DATABASE_URL)
get_metadata(remote_url = SAMPLE_DATABASE_URL)