The purpose of this package is to make it easy to query the Human Cell Atlas Data Portal via their data browser API. Visit the Human Cell Atlas for more information on the project.
Evaluate the following code chunk to install packages required for this vignette.
## install from Bioconductor if you haven't already
pkgs <- c("httr", "dplyr", "LoomExperiment", "hca")
pkgs_needed <- pkgs[!pkgs %in% rownames(installed.packages())]
BiocManager::install(pkgs_needed)
Load the packages into your R session.
To illustrate use of this package, consider the task of downloading a ‘loom’ file summarizing single-cell gene expression observed in an HCA research project. This could be accomplished by visiting the HCA data portal (at https://data.humancellatlas.org/explore) in a web browser and selecting projects interactively, but it is valuable to accomplish the same goal in a reproducible, flexible, programmatic way. We will (1) discover projects available in the HCA Data Coordinating Center that have loom files; and (2) retrieve the file from the HCA and import the data into R as a ‘LoomExperiment’ object. For illustration, we focus on the ‘Single cell transcriptome analysis of human pancreas reveals transcriptional signatures of aging and somatic mutation patterns’ project.
Use projects()
to retrieve the first 200 projects in the
HCA’s default catalog.
projects(size = 200)
## # A tibble: 200 × 14
## projectId projectTitle genusSpecies sampleEntityType specimenOrgan
## <chr> <chr> <list> <list> <list>
## 1 74b6d569-3b11-42ef-… 1.3 Million… <chr [1]> <chr [1]> <chr [1]>
## 2 53c53cd4-8127-4e12-… A Cellular … <chr [1]> <chr [1]> <chr [1]>
## 3 7027adc6-c9c9-46f3-… A Cellular … <chr [1]> <chr [1]> <chr [1]>
## 4 94e4ee09-9b4b-410a-… A Human Liv… <chr [1]> <chr [2]> <chr [1]>
## 5 c5b475f2-76b3-4a8e-… A Partial P… <chr [1]> <chr [1]> <chr [1]>
## 6 60ea42e1-af49-42f5-… A Protocol … <chr [1]> <chr [1]> <chr [1]>
## 7 ef1e3497-515e-4bbe-… A Single-Ce… <chr [1]> <chr [1]> <chr [3]>
## 8 9ac53858-606a-4b89-… A Single-Ce… <chr [1]> <chr [1]> <chr [1]>
## 9 258c5e15-d125-4f2d-… A Single-Ce… <chr [1]> <chr [1]> <chr [1]>
## 10 894ae6ac-5b48-41a8-… A Single-Ce… <chr [1]> <chr [1]> <chr [1]>
## # ℹ 190 more rows
## # ℹ 9 more variables: specimenOrganPart <list>, selectedCellType <list>,
## # libraryConstructionApproach <list>, nucleicAcidSource <list>,
## # pairedEnd <list>, workflow <list>, specimenDisease <list>,
## # donorDisease <list>, developmentStage <list>
Use filters()
to restrict the projects to just those
that contain at least one ‘loom’ file.
project_filter <- filters(fileFormat = list(is = "loom"))
project_tibble <- projects(project_filter)
project_tibble
## # A tibble: 78 × 14
## projectId projectTitle genusSpecies sampleEntityType specimenOrgan
## <chr> <chr> <list> <list> <list>
## 1 53c53cd4-8127-4e12-… A Cellular … <chr [1]> <chr [1]> <chr [1]>
## 2 7027adc6-c9c9-46f3-… A Cellular … <chr [1]> <chr [1]> <chr [1]>
## 3 c1810dbc-16d2-45c3-… A cell atla… <chr [2]> <chr [1]> <chr [2]>
## 4 a9301beb-e9fa-42fe-… A human cel… <chr [1]> <chr [1]> <chr [14]>
## 5 996120f9-e84f-409f-… A human sin… <chr [1]> <chr [1]> <chr [1]>
## 6 842605c7-375a-47c5-… A single ce… <chr [1]> <chr [1]> <chr [1]>
## 7 cc95ff89-2e68-4a08-… A single ce… <chr [1]> <chr [1]> <chr [3]>
## 8 a004b150-1c36-4af6-… A single-ce… <chr [1]> <chr [1]> <chr [1]>
## 9 1cd1f41f-f81a-486b-… A single-ce… <chr [1]> <chr [1]> <chr [1]>
## 10 8185730f-4113-40d3-… A single-ce… <chr [1]> <chr [1]> <chr [1]>
## # ℹ 68 more rows
## # ℹ 9 more variables: specimenOrganPart <list>, selectedCellType <list>,
## # libraryConstructionApproach <list>, nucleicAcidSource <list>,
## # pairedEnd <list>, workflow <list>, specimenDisease <list>,
## # donorDisease <list>, developmentStage <list>
Use standard R commands to further filter projects to the
one we are interested in, with title starting with “Single…”. Extract
the unique projectId
for the first project with this
title.
project_tibble |>
filter(startsWith(projectTitle, "Single")) |>
head(1) |>
t()
## [,1]
## projectId "4d6f6c96-2a83-43d8-8fe1-0f53bffd4674"
## projectTitle "Single cell RNA sequencing of human liver reveals distinct intrahepatic macrophage populations"
## genusSpecies "Homo sapiens"
## sampleEntityType "specimens"
## specimenOrgan "liver"
## specimenOrganPart "caudate lobe"
## selectedCellType character,0
## libraryConstructionApproach "10x 3' v2"
## nucleicAcidSource "single cell"
## pairedEnd FALSE
## workflow character,2
## specimenDisease "normal"
## donorDisease "normal"
## developmentStage "human adult stage"
projectIds <-
project_tibble |>
filter(startsWith(projectTitle, "Single")) |>
dplyr::pull(projectId)
projectId <- projectIds[1]
A project id can be used to discover the title or additional project information.
project_title(projectId)
## [1] "Single cell RNA sequencing of human liver reveals distinct intrahepatic macrophage populations"
project_information(projectId)
## Title
## Single cell RNA sequencing of human liver reveals distinct
## intrahepatic macrophage populations
## Contributors (unknown order; any role)
## Sonya,A,MacParland, Jeff,C,Liu, Gary,D,Bader, Ian,D,McGilvray,
## Xue-Zhong Ma, Brendan,T,Innes, Agata,M,Bartczak, Blair,K,Gage, Justin
## Manuel, Nicholas Khuu, Juan Echeverri, Ivan Linares, Rahul Gupta,
## Michael,L,Cheng, Lewis,Y,Liu, Damra Camat, Sai,W,Chung,
## Rebecca,K,Seliga, Zigong Shao, Elizabeth Lee, Shinichiro Ogawa, Mina
## Ogawa, Michael,D,Wilson, Jason,E,Fish, Markus Selzner, Anand
## Ghanekar, David Grant, Paul Greig, Gonzalo Sapisochin, Nazia Selzner,
## Neil Winegarden, Oyedele Adeyi, Gordon Keller, William,G,Sullivan
## Description
## The liver is the largest solid organ in the body and is critical for
## metabolic and immune functions. However, little is known about the
## cells that make up the human liver and its immune microenvironment.
## Here we report a map of the cellular landscape of the human liver
## using single-cell RNA sequencing. We provide the transcriptional
## profiles of 8444 parenchymal and non-parenchymal cells obtained from
## the fractionation of fresh hepatic tissue from five human livers.
## Using gene expression patterns, flow cytometry, and
## immunohistochemical examinations, we identify 20 discrete cell
## populations of hepatocytes, endothelial cells, cholangiocytes,
## hepatic stellate cells, B cells, conventional and non-conventional T
## cells, NK-like cells, and distinct intrahepatic monocyte/macrophage
## populations. Together, our study presents a comprehensive view of the
## human liver at single-cell resolution that outlines the
## characteristics of resident cells in the liver, and in particular
## provides a map of the human hepatic immune microenvironment.
## DOI
## 10.1038/s41467-018-06318-7
## URL
## https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6197289/
## Project
## https://data.humancellatlas.org/explore/projects/4d6f6c96-2a83-43d8-8fe1-0f53bffd4674
files()
retrieves (the first 1000) files from the Human
Cell Atlas data portal. Construct a filter to restrict the files to loom
files from the project we are interested in.
file_filter <- filters(
projectId = list(is = projectId),
fileFormat = list(is = "loom")
)
# only the two smallest files
file_tibble <- files(file_filter, size = 2, sort = "fileSize", order = "asc")
file_tibble
## # A tibble: 2 × 8
## fileId name fileFormat size version projectTitle projectId url
## <chr> <chr> <chr> <int> <chr> <chr> <chr> <chr>
## 1 b8150aca-83a2-5c… d178… loom 1.11e9 2021-0… Single cell… 4d6f6c96… http…
## 2 b1f60da2-db89-55… sc-l… loom 1.18e9 2021-0… Single cell… 4d6f6c96… http…
files_download()
will download one or more files (one
for each row) in file_tibble
. The download is more
complicated than simply following the url
column of
file_tibble
, so it is not possible to simply copy the url
into a browser. We’ll download the file and then immediately import it
into R.
file_locations <- file_tibble |> files_download()
LoomExperiment::import(unname(file_locations[1]),
type ="SingleCellLoomExperiment")
## class: SingleCellLoomExperiment
## dim: 58347 348643
## metadata(10): last_modified CreationDate ...
## optimus_output_schema_version pipeline_version
## assays(1): matrix
## rownames: NULL
## rowData names(29): Gene antisense_reads ... reads_per_molecule
## spliced_reads
## colnames: NULL
## colData names(43): CellID antisense_reads ... reads_unmapped
## spliced_reads
## reducedDimNames(0):
## mainExpName: NULL
## altExpNames(0):
## rowGraphs(0): NULL
## colGraphs(0): NULL
Note that files_download()
uses [BiocFileCache][https://bioconductor.org/packages/BiocFileCache], so
individual files are only downloaded once.
h5ad
filesThis example walks through the process of file discovery and
retrieval in a little more detail, using h5ad
files created
by the Python AnnData analysis software and available for some
experiments in the default catalog.
The first challenge is to understand what file formats are available from the HCA. Obtain a tibble describing the ‘facets’ of the data, the number of terms used in each facet, and the number of distinct values used to describe projects.
projects_facets()
## # A tibble: 39 × 3
## facet n_terms n_values
## <chr> <int> <int>
## 1 accessible 2 475
## 2 assayType 2 475
## 3 biologicalSex 5 830
## 4 bionetworkName 8 478
## 5 cellLineType 6 492
## 6 contactName 5987 7469
## 7 contentDescription 72 1943
## 8 dataUseRestriction 4 475
## 9 developmentStage 185 1133
## 10 donorDisease 496 1244
## # ℹ 29 more rows
Note the fileFormat
facet, and repeat
projects_facets()
to discover detail about available file
formats
projects_facets("fileFormat")
## # A tibble: 86 × 3
## facet term count
## <chr> <chr> <int>
## 1 fileFormat xlsx 349
## 2 fileFormat fastq.gz 348
## 3 fileFormat tsv.gz 101
## 4 fileFormat tar 88
## 5 fileFormat mtx.gz 86
## 6 fileFormat loom 78
## 7 fileFormat bam 76
## 8 fileFormat csv.gz 74
## 9 fileFormat txt.gz 68
## 10 fileFormat csv 48
## # ℹ 76 more rows
Note that there are 8 uses of the h5ad
file format. Use
this as a filter to discover relevant projects.
filters <- filters(fileFormat = list(is = "h5ad"))
projects(filters)
## # A tibble: 40 × 14
## projectId projectTitle genusSpecies sampleEntityType specimenOrgan
## <chr> <chr> <list> <list> <list>
## 1 cdabcf0b-7602-4abf-… A blood atl… <chr [1]> <chr [1]> <chr [1]>
## 2 c1810dbc-16d2-45c3-… A cell atla… <chr [2]> <chr [1]> <chr [2]>
## 3 c0518445-3b3b-49c6-… A cellular … <chr [1]> <chr [1]> <chr [2]>
## 4 b176d756-62d8-4933-… A human emb… <chr [2]> <chr [1]> <chr [2]>
## 5 2fe3c60b-ac1a-4c61-… A human fet… <chr [1]> <chr [2]> <chr [2]>
## 6 73769e0a-5fcd-41f4-… A proximal-… <chr [1]> <chr [1]> <chr [2]>
## 7 cc95ff89-2e68-4a08-… A single ce… <chr [1]> <chr [1]> <chr [3]>
## 8 957261f7-2bd6-4358-… A spatially… <chr [1]> <chr [1]> <chr [1]>
## 9 ae9f439b-bd47-4d6e-… A temporal … <chr [1]> <chr [1]> <chr [1]>
## 10 1dddae6e-3753-48af-… Cell Types … <chr [1]> <chr [2]> <chr [2]>
## # ℹ 30 more rows
## # ℹ 9 more variables: specimenOrganPart <list>, selectedCellType <list>,
## # libraryConstructionApproach <list>, nucleicAcidSource <list>,
## # pairedEnd <list>, workflow <list>, specimenDisease <list>,
## # donorDisease <list>, developmentStage <list>
The default tibble produced by projects()
contains only
some of the information available; the information is much richer.
To obtain a tibble with an expanded set of columns, you can specify
that using the as
parameter set to
"tibble_expanded"
.
# an expanded set of columns for all or the first 4 projects
projects(as = 'tibble_expanded', size = 4)
## # A tibble: 4 × 127
## projectId cellSuspensions.orga…¹ cellSuspensions.organ cellSuspensions.sele…²
## <chr> <list> <chr> <list>
## 1 74b6d569-… <chr [1]> brain <chr [1]>
## 2 53c53cd4-… <chr [2]> prostate gland <chr [7]>
## 3 7027adc6-… <chr [0]> heart <chr [0]>
## 4 94e4ee09-… <chr [0]> liver <chr [0]>
## # ℹ abbreviated names: ¹cellSuspensions.organPart,
## # ²cellSuspensions.selectedCellType
## # ℹ 123 more variables: cellSuspensions.totalCells <int>,
## # cellSuspensions.totalCellsRedundant <int>,
## # dates.aggregateLastModifiedDate <chr>, dates.aggregateSubmissionDate <chr>,
## # dates.aggregateUpdateDate <chr>, dates.lastModifiedDate <chr>,
## # dates.submissionDate <chr>, dates.updateDate <chr>, …
In the next sections, we’ll cover other options for the
as
parameter, and the data formats they return.
projects()
as an R list
Instead of retrieving the result of projects()
as a
tibble, retrieve it as a ‘list-of-lists’
This is a complicated structure. We will use lengths()
,
names()
, and standard R list selection operations
to navigate this a bit. At the top level there are three elements.
hits
represents each project as a list, e.g,.
lengths(projects_list$hits[[1]])
## protocols entryId sources projects
## 2 1 1 1
## samples specimens cellLines donorOrganisms
## 1 1 0 1
## organoids cellSuspensions dates fileTypeSummaries
## 0 1 1 2
shows that there are 10 different ways in which the first project is described. Each component is itself a list-of-lists, e.g.,
lengths(projects_list$hits[[1]]$projects[[1]])
## projectId projectTitle projectShortname
## 1 1 1
## laboratory estimatedCellCount isTissueAtlasProject
## 1 1 1
## tissueAtlas bionetworkName dataUseRestriction
## 0 1 0
## projectDescription contributors publications
## 1 6 1
## supplementaryLinks matrices contributedAnalyses
## 1 0 1
## accessions accessible
## 3 1
projects_list$hits[[1]]$projects[[1]]$projectTitle
## [1] "1.3 Million Brain Cells from E18 Mice"
One can use standard R commands to navigate this data
structure, and to, e.g., extract the projectTitle
of each
project.
projects()
as an lol
Use as = "lol"
to create a more convenient way to
select, filter and extract elements from the list-of-lists by
projects()
.
lol <- projects(size = 200, as = "lol")
lol
## # class: lol_hca lol
## # number of distinct paths: 26756
## # total number of elements: 187098
## # number of leaf paths: 20749
## # number of leaf elements: 148780
## # lol_path():
## # A tibble: 26,756 × 3
## path n is_leaf
## <chr> <int> <lgl>
## 1 hits 1 FALSE
## 2 hits[*] 200 FALSE
## 3 hits[*].cellLines 200 FALSE
## 4 hits[*].cellLines[*] 29 FALSE
## 5 hits[*].cellLines[*].cellLineType 29 FALSE
## 6 hits[*].cellLines[*].cellLineType[*] 38 TRUE
## 7 hits[*].cellLines[*].id 29 FALSE
## 8 hits[*].cellLines[*].id[*] 122 TRUE
## 9 hits[*].cellLines[*].modelOrgan 29 FALSE
## 10 hits[*].cellLines[*].modelOrgan[*] 39 TRUE
## # ℹ 26,746 more rows
Use lol_select()
to restrict the lol
to
particular paths, and lol_filter()
to filter results to
paths that are leafs, or with specific numbers of entries.
lol_select(lol, "hits[*].projects[*]")
## # class: lol_hca lol
## # number of distinct paths: 26631
## # total number of elements: 133542
## # number of leaf paths: 20689
## # number of leaf elements: 110037
## # lol_path():
## # A tibble: 26,631 × 3
## path n is_leaf
## <chr> <int> <lgl>
## 1 hits[*].projects[*] 200 FALSE
## 2 hits[*].projects[*].accessible 200 TRUE
## 3 hits[*].projects[*].accessions 200 FALSE
## 4 hits[*].projects[*].accessions[*] 566 FALSE
## 5 hits[*].projects[*].accessions[*].accession 566 TRUE
## 6 hits[*].projects[*].accessions[*].namespace 566 TRUE
## 7 hits[*].projects[*].bionetworkName 200 FALSE
## 8 hits[*].projects[*].bionetworkName[*] 201 TRUE
## 9 hits[*].projects[*].contributedAnalyses 200 FALSE
## 10 hits[*].projects[*].contributedAnalyses.developmentStage 2 FALSE
## # ℹ 26,621 more rows
lol_select(lol, "hits[*].projects[*]") |>
lol_filter(n == 44, is_leaf)
## # class: lol_hca lol
## # number of distinct paths: 0
## # total number of elements: 0
## # number of leaf paths: 0
## # number of leaf elements: 0
## # lol_path():
## # A tibble: 0 × 3
## # ℹ 3 variables: path <chr>, n <int>, is_leaf <lgl>
lol_pull()
extracts a path from the lol
as
a vector; lol_lpull()
extracts paths as lists.
projects()
tibbles with specific columnsThe path or its abbreviation can be used to specify the columns of
the tibble to be returned by the projects()
query.
Here we retrieve additional details of donor count and total cells by adding appropriate path abbreviations to a named character vector. Names on the character vector can be used to rename the path more concisely, but the paths must uniquely identify elements in the list-of-lists.
columns <- c(
projectId = "hits[*].entryId",
projectTitle = "hits[*].projects[*].projectTitle",
genusSpecies = "hits[*].donorOrganisms[*].genusSpecies[*]",
donorCount = "hits[*].donorOrganisms[*].donorCount",
cellSuspensions.organ = "hits[*].cellSuspensions[*].organ[*]",
totalCells = "hits[*].cellSuspensions[*].totalCells"
)
projects <- projects(filters, columns = columns)
projects
## # A tibble: 40 × 6
## projectId projectTitle genusSpecies donorCount cellSuspensions.organ
## <chr> <chr> <list> <int> <list>
## 1 cdabcf0b-7602-4ab… A blood atl… <chr [1]> 124 <chr [1]>
## 2 c1810dbc-16d2-45c… A cell atla… <chr [2]> 24 <chr [2]>
## 3 c0518445-3b3b-49c… A cellular … <chr [1]> 17 <chr [2]>
## 4 b176d756-62d8-493… A human emb… <chr [2]> 36 <chr [2]>
## 5 2fe3c60b-ac1a-4c6… A human fet… <chr [1]> 38 <chr [2]>
## 6 73769e0a-5fcd-41f… A proximal-… <chr [1]> 3 <chr [2]>
## 7 cc95ff89-2e68-4a0… A single ce… <chr [1]> 28 <chr [3]>
## 8 957261f7-2bd6-435… A spatially… <chr [1]> 13 <chr [1]>
## 9 ae9f439b-bd47-4d6… A temporal … <chr [1]> 8 <chr [1]>
## 10 1dddae6e-3753-48a… Cell Types … <chr [1]> 6 <chr [1]>
## # ℹ 30 more rows
## # ℹ 1 more variable: totalCells <list>
Note that the cellSuspensions.organ
and
totalCells
columns have more than one entry per
project.
projects |>
select(projectId, cellSuspensions.organ, totalCells)
## # A tibble: 40 × 3
## projectId cellSuspensions.organ totalCells
## <chr> <list> <list>
## 1 cdabcf0b-7602-4abf-9afb-3b410e545703 <chr [1]> <int [0]>
## 2 c1810dbc-16d2-45c3-b45e-3e675f88d87b <chr [2]> <int [2]>
## 3 c0518445-3b3b-49c6-b8fc-c41daa4eacba <chr [2]> <int [2]>
## 4 b176d756-62d8-4933-83a4-8b026380262f <chr [2]> <int [2]>
## 5 2fe3c60b-ac1a-4c61-9b59-f6556c0fce63 <chr [2]> <int [1]>
## 6 73769e0a-5fcd-41f4-9083-41ae08bfa4c1 <chr [2]> <int [0]>
## 7 cc95ff89-2e68-4a08-a234-480eca21ce79 <chr [3]> <int [3]>
## 8 957261f7-2bd6-4358-a6ed-24ee080d5cfc <chr [1]> <int [0]>
## 9 ae9f439b-bd47-4d6e-bd72-32dc70b35d97 <chr [1]> <int [1]>
## 10 1dddae6e-3753-48af-b20e-fa22abad125d <chr [1]> <int [0]>
## # ℹ 30 more rows
In this case, the mapping between cellSuspensions.organ
and totalCells
is clear, but in general more refined
navigation of the lol
structure may be necessary.
projects |>
select(projectId, cellSuspensions.organ, totalCells) |>
filter(
## 2023-06-06 two projects have different 'organ' and
## 'totalCells' lengths, causing problems with `unnest()`
lengths(cellSuspensions.organ) == lengths(totalCells)
) |>
tidyr::unnest(c("cellSuspensions.organ", "totalCells"))
## # A tibble: 29 × 3
## projectId cellSuspensions.organ totalCells
## <chr> <chr> <int>
## 1 c1810dbc-16d2-45c3-b45e-3e675f88d87b thymus 456000
## 2 c1810dbc-16d2-45c3-b45e-3e675f88d87b colon 16000
## 3 c0518445-3b3b-49c6-b8fc-c41daa4eacba lung 40200
## 4 c0518445-3b3b-49c6-b8fc-c41daa4eacba nose 7087
## 5 b176d756-62d8-4933-83a4-8b026380262f forelimb 48000
## 6 b176d756-62d8-4933-83a4-8b026380262f hindlimb 56000
## 7 cc95ff89-2e68-4a08-a234-480eca21ce79 immune system 274182
## 8 cc95ff89-2e68-4a08-a234-480eca21ce79 blood 1615910
## 9 cc95ff89-2e68-4a08-a234-480eca21ce79 bone marrow 600000
## 10 ae9f439b-bd47-4d6e-bd72-32dc70b35d97 brain 90000
## # ℹ 19 more rows
Select the following entry, augment the filter, and query available files
projects |>
filter(startsWith(projectTitle, "Reconstruct")) |>
glimpse()
## Rows: 1
## Columns: 6
## $ projectId <chr> "f83165c5-e2ea-4d15-a5cf-33f3550bffde"
## $ projectTitle <chr> "Reconstructing the human first trimester fetal-…
## $ genusSpecies <list> "Homo sapiens"
## $ donorCount <int> 16
## $ cellSuspensions.organ <list> <"blood", "decidua", "placenta">
## $ totalCells <list> <>
This approach can be used to customize the tibbles returned by the
other main functions in the package, files()
,
samples()
, and bundles()
.
The relevant file can be selected and downloaded using the technique in the first example.
filters <- filters(
projectId = list(is = "f83165c5-e2ea-4d15-a5cf-33f3550bffde"),
fileFormat = list(is = "h5ad")
)
files <-
files(filters) |>
head(1) # only first file, for demonstration
files |> t()
## [,1]
## fileId "6d4fedcf-857d-5fbb-9928-8b9605500a69"
## name "vento18_ss2.processed.h5ad"
## fileFormat "h5ad"
## size "82121633"
## version "2021-02-10T16:56:40.419579Z"
## projectTitle "Reconstructing the human first trimester fetal-maternal interface using single cell transcriptomics"
## projectId "f83165c5-e2ea-4d15-a5cf-33f3550bffde"
## url "https://service.azul.data.humancellatlas.org/repository/files/6d4fedcf-857d-5fbb-9928-8b9605500a69?catalog=dcp43&version=2021-02-10T16%3A56%3A40.419579Z"
"h5ad"
files can be read as SingleCellExperiment objects
using the zellkonverter
package.
project_filter <- filters(fileFormat = list(is = "csv"))
project_tibble <- projects(project_filter)
project_tibble |>
filter(
startsWith(
projectTitle,
"Reconstructing the human first trimester"
)
)
## # A tibble: 1 × 14
## projectId projectTitle genusSpecies sampleEntityType specimenOrgan
## <chr> <chr> <list> <list> <list>
## 1 f83165c5-e2ea-4d15-a… Reconstruct… <chr [1]> <chr [1]> <chr [3]>
## # ℹ 9 more variables: specimenOrganPart <list>, selectedCellType <list>,
## # libraryConstructionApproach <list>, nucleicAcidSource <list>,
## # pairedEnd <list>, workflow <list>, specimenDisease <list>,
## # donorDisease <list>, developmentStage <list>
projectId <-
project_tibble |>
filter(
startsWith(
projectTitle,
"Reconstructing the human first trimester"
)
) |>
pull(projectId)
file_filter <- filters(
projectId = list(is = projectId),
fileFormat = list(is = "csv")
)
## first 4 files will be returned
file_tibble <- files(file_filter, size = 4)
file_tibble |>
files_download()
## 7f9a181e-24c5-5462-b308-7fef5b1bda2a-2021-02-10T16:56:40.419579Z
## "/github/home/.cache/R/hca/175e2e000eaa_175e2e000eaa.csv"
## d04c6e3c-b740-5586-8420-4480a1b5706c-2021-02-10T16:56:40.419579Z
## "/github/home/.cache/R/hca/175e5bc0688a_175e5bc0688a.csv"
## d30ffc0b-7d6e-5b85-aff9-21ec69663a81-2021-02-10T16:56:40.419579Z
## "/github/home/.cache/R/hca/175e559dbb3f_175e559dbb3f.csv"
## e1517725-01b0-5346-9788-afca63e9993a-2021-02-10T16:56:40.419579Z
## "/github/home/.cache/R/hca/175e1ea730e2_175e1ea730e2.csv"
The files()
, bundles()
, and
samples()
can all return many 1000’s of results. It is
necessary to ‘page’ through these to see all of them. We illustrate
pagination with projects()
, retrieving only 30
projects.
Pagination works for the default tibble
output
page_1_tbl <- projects(size = 30)
page_1_tbl
## # A tibble: 30 × 14
## projectId projectTitle genusSpecies sampleEntityType specimenOrgan
## <chr> <chr> <list> <list> <list>
## 1 74b6d569-3b11-42ef-… 1.3 Million… <chr [1]> <chr [1]> <chr [1]>
## 2 53c53cd4-8127-4e12-… A Cellular … <chr [1]> <chr [1]> <chr [1]>
## 3 7027adc6-c9c9-46f3-… A Cellular … <chr [1]> <chr [1]> <chr [1]>
## 4 94e4ee09-9b4b-410a-… A Human Liv… <chr [1]> <chr [2]> <chr [1]>
## 5 c5b475f2-76b3-4a8e-… A Partial P… <chr [1]> <chr [1]> <chr [1]>
## 6 60ea42e1-af49-42f5-… A Protocol … <chr [1]> <chr [1]> <chr [1]>
## 7 ef1e3497-515e-4bbe-… A Single-Ce… <chr [1]> <chr [1]> <chr [3]>
## 8 9ac53858-606a-4b89-… A Single-Ce… <chr [1]> <chr [1]> <chr [1]>
## 9 258c5e15-d125-4f2d-… A Single-Ce… <chr [1]> <chr [1]> <chr [1]>
## 10 894ae6ac-5b48-41a8-… A Single-Ce… <chr [1]> <chr [1]> <chr [1]>
## # ℹ 20 more rows
## # ℹ 9 more variables: specimenOrganPart <list>, selectedCellType <list>,
## # libraryConstructionApproach <list>, nucleicAcidSource <list>,
## # pairedEnd <list>, workflow <list>, specimenDisease <list>,
## # donorDisease <list>, developmentStage <list>
page_2_tbl <- page_1_tbl |> hca_next()
page_2_tbl
## # A tibble: 30 × 14
## projectId projectTitle genusSpecies sampleEntityType specimenOrgan
## <chr> <chr> <list> <list> <list>
## 1 9f17ed7d-9325-4723-… A single ce… <chr [1]> <chr [1]> <chr [1]>
## 2 842605c7-375a-47c5-… A single ce… <chr [1]> <chr [1]> <chr [1]>
## 3 cc95ff89-2e68-4a08-… A single ce… <chr [1]> <chr [1]> <chr [3]>
## 4 a62dae2e-cd69-4d5c-… A single-ce… <chr [2]> <chr [1]> <chr [6]>
## 5 6663070f-fd8b-41a9-… A single-ce… <chr [1]> <chr [1]> <chr [1]>
## 6 c31fa434-c9ed-4263-… A single-ce… <chr [1]> <chr [1]> <chr [18]>
## 7 dcc28fb3-7bab-48ce-… A single-ce… <chr [1]> <chr [1]> <chr [1]>
## 8 d3446f0c-30f3-4a12-… A single-ce… <chr [1]> <chr [1]> <chr [1]>
## 9 a004b150-1c36-4af6-… A single-ce… <chr [1]> <chr [1]> <chr [1]>
## 10 1defdada-a365-44ad-… A single-ce… <chr [1]> <chr [1]> <chr [1]>
## # ℹ 20 more rows
## # ℹ 9 more variables: specimenOrganPart <list>, selectedCellType <list>,
## # libraryConstructionApproach <list>, nucleicAcidSource <list>,
## # pairedEnd <list>, workflow <list>, specimenDisease <list>,
## # donorDisease <list>, developmentStage <list>
## should be identical to page_1_tbl
page_2_tbl |> hca_prev()
## # A tibble: 30 × 14
## projectId projectTitle genusSpecies sampleEntityType specimenOrgan
## <chr> <chr> <list> <list> <list>
## 1 74b6d569-3b11-42ef-… 1.3 Million… <chr [1]> <chr [1]> <chr [1]>
## 2 53c53cd4-8127-4e12-… A Cellular … <chr [1]> <chr [1]> <chr [1]>
## 3 7027adc6-c9c9-46f3-… A Cellular … <chr [1]> <chr [1]> <chr [1]>
## 4 94e4ee09-9b4b-410a-… A Human Liv… <chr [1]> <chr [2]> <chr [1]>
## 5 c5b475f2-76b3-4a8e-… A Partial P… <chr [1]> <chr [1]> <chr [1]>
## 6 60ea42e1-af49-42f5-… A Protocol … <chr [1]> <chr [1]> <chr [1]>
## 7 ef1e3497-515e-4bbe-… A Single-Ce… <chr [1]> <chr [1]> <chr [3]>
## 8 9ac53858-606a-4b89-… A Single-Ce… <chr [1]> <chr [1]> <chr [1]>
## 9 258c5e15-d125-4f2d-… A Single-Ce… <chr [1]> <chr [1]> <chr [1]>
## 10 894ae6ac-5b48-41a8-… A Single-Ce… <chr [1]> <chr [1]> <chr [1]>
## # ℹ 20 more rows
## # ℹ 9 more variables: specimenOrganPart <list>, selectedCellType <list>,
## # libraryConstructionApproach <list>, nucleicAcidSource <list>,
## # pairedEnd <list>, workflow <list>, specimenDisease <list>,
## # donorDisease <list>, developmentStage <list>
Pagination also works for the lol
objects
page_1_lol <- projects(size = 5, as = "lol")
page_1_lol |>
lol_pull("hits[*].projects[*].projectTitle")
## [1] "1.3 Million Brain Cells from E18 Mice"
## [2] "A Cellular Anatomy of the Normal Adult Human Prostate and Prostatic Urethra"
## [3] "A Cellular Atlas of Pitx2-Dependent Cardiac Development."
## [4] "A Human Liver Cell Atlas reveals Heterogeneity and Epithelial Progenitors"
## [5] "A Partial Picture of the Single-Cell Transcriptomics of Human IgA Nephropathy"
page_2_lol <-
page_1_lol |>
hca_next()
page_2_lol |>
lol_pull("hits[*].projects[*].projectTitle")
## [1] "A Protocol for Revealing Oral Neutrophil Heterogeneity by Single-Cell Immune Profiling in Human Saliva"
## [2] "A Single-Cell Atlas of the Human Healthy Airways"
## [3] "A Single-Cell Characterization of Human Post-implantation Embryos Cultured In Vitro Delineates Morphogenesis in Primary Syncytialization"
## [4] "A Single-Cell Transcriptome Atlas of Glia Diversity in the Human Hippocampus across the Lifespan and in Alzheimer’s Disease"
## [5] "A Single-Cell Transcriptome Atlas of the Human Pancreas."
## should be identical to page_1_lol
page_2_lol |>
hca_prev() |>
lol_pull("hits[*].projects[*].projectTitle")
## [1] "1.3 Million Brain Cells from E18 Mice"
## [2] "A Cellular Anatomy of the Normal Adult Human Prostate and Prostatic Urethra"
## [3] "A Cellular Atlas of Pitx2-Dependent Cardiac Development."
## [4] "A Human Liver Cell Atlas reveals Heterogeneity and Epithelial Progenitors"
## [5] "A Partial Picture of the Single-Cell Transcriptomics of Human IgA Nephropathy"
Much like projects()
and files()
,
samples()
and bundles()
allow you to provide a
filter
object and additional criteria to retrieve data in
the form of samples and bundles respectively
heart_filters <- filters(organ = list(is = "heart"))
heart_samples <- samples(filters = heart_filters, size = 4)
heart_samples
## # A tibble: 4 × 6
## entryId projectTitle genusSpecies disease format count
## <chr> <chr> <chr> <chr> <list> <lis>
## 1 012c52ff-4770-4c0c-8c2e-c348da… A Cellular … Mus musculus normal <chr> <int>
## 2 035db5b9-a219-4df8-bfc9-117cd0… A Cellular … Mus musculus normal <chr> <int>
## 3 09e425f7-22d7-487e-b78b-78b449… A Cellular … Mus musculus normal <chr> <int>
## 4 2273e44d-9fbc-4c13-8cb3-3caf8a… A Cellular … Mus musculus normal <chr> <int>
heart_bundles <- bundles(filters = heart_filters, size = 4)
heart_bundles
## # A tibble: 4 × 6
## projectTitle genusSpecies samples files bundleUuid bundleVersion
## <chr> <chr> <list> <lis> <chr> <chr>
## 1 A Cellular Atlas of Pitx2… Mus musculus <chr> <chr> 0d391bd1-… 2021-02-26T0…
## 2 A Cellular Atlas of Pitx2… Mus musculus <chr> <chr> 165a2df1-… 2021-02-26T0…
## 3 A Cellular Atlas of Pitx2… Mus musculus <chr> <chr> 166c1b1a-… 2023-07-19T1…
## 4 A Cellular Atlas of Pitx2… Mus musculus <chr> <chr> 18bad6b1-… 2021-02-26T0…
HCA experiments are organized into catalogs, each of which can be
summarized with the hca::summary()
function
heart_filters <- filters(organ = list(is = "heart"))
hca::summary(filters = heart_filters, type = "fileTypeSummaries")
## # A tibble: 34 × 3
## format count totalSize
## <chr> <int> <dbl>
## 1 fastq.gz 30365 3.09e13
## 2 fastq 316 6.53e11
## 3 tsv.gz 273 1.39e11
## 4 png 270 8.96e 6
## 5 h5 180 1.75e10
## 6 loom 169 3.56e11
## 7 bam 164 3.28e12
## 8 zip 148 9.46e 9
## 9 mtx.gz 98 1.91e10
## 10 csv 89 1.15e 8
## # ℹ 24 more rows
first_catalog <- catalogs()[1]
hca::summary(type = "overview", catalog = first_catalog)
## # A tibble: 7 × 2
## name value
## <chr> <dbl>
## 1 projectCount 4.75e 2
## 2 specimenCount 2.25e 4
## 3 speciesCount 3 e 0
## 4 fileCount 5.24e 5
## 5 totalFileSize 3.27e14
## 6 donorCount 9.18e 3
## 7 labCount 8.23e 2
Each project, file, sample, and bundles has its own unique ID by which, in conjunction with its catalog, can be to uniquely identify them.
heart_filters <- filters(organ = list(is = "heart"))
heart_projects <- projects(filters = heart_filters, size = 4)
heart_projects
## # A tibble: 4 × 14
## projectId projectTitle genusSpecies sampleEntityType specimenOrgan
## <chr> <chr> <chr> <list> <list>
## 1 7027adc6-c9c9-46f3-8… A Cellular … Mus musculus <chr [1]> <chr [1]>
## 2 a9301beb-e9fa-42fe-b… A human cel… Homo sapiens <chr [1]> <chr [14]>
## 3 902dc043-7091-445c-9… A human cel… Homo sapiens <chr [1]> <chr [1]>
## 4 2fe3c60b-ac1a-4c61-9… A human fet… Homo sapiens <chr [2]> <chr [2]>
## # ℹ 9 more variables: specimenOrganPart <list>, selectedCellType <lgl>,
## # libraryConstructionApproach <list>, nucleicAcidSource <list>,
## # pairedEnd <lgl>, workflow <list>, specimenDisease <chr>,
## # donorDisease <chr>, developmentStage <list>
projectId <-
heart_projects |>
filter(
startsWith(
projectTitle,
"Cells of the adult human"
)
) |>
dplyr::pull(projectId)
result <- projects_detail(uuid = projectId)
The result is a list containing three elements representing
information for navigating next or previous (alphabetical, by default)
(pagination
) project, the filters (termFacets
)
available, and details of the project (hits
).
As mentioned above, the hits
are a complicated
list-of-lists structure. A very convenient way to explore this structure
visually is with listview::jsonedit(result)
. Selecting
individual elements is possible using the lol
interface; an
alternative is cellxgenedp::jmespath()
.
lol(result)
## # class: lol
## # number of distinct paths: 687
## # total number of elements: 48234
## # number of leaf paths: 405
## # number of leaf elements: 31961
## # lol_path():
## # A tibble: 687 × 3
## path n is_leaf
## <chr> <int> <lgl>
## 1 hits 1 FALSE
## 2 hits[*] 10 FALSE
## 3 hits[*].cellLines 10 FALSE
## 4 hits[*].cellSuspensions 10 FALSE
## 5 hits[*].cellSuspensions[*] 12 FALSE
## 6 hits[*].cellSuspensions[*].organ 12 FALSE
## 7 hits[*].cellSuspensions[*].organPart 12 FALSE
## 8 hits[*].cellSuspensions[*].organPart[*] 14 TRUE
## 9 hits[*].cellSuspensions[*].organ[*] 12 TRUE
## 10 hits[*].cellSuspensions[*].selectedCellType 12 FALSE
## # ℹ 677 more rows
See the accompanying “Human Cell Atlas Manifests” vignette on details
pertaining to the use of the manifest
endpoint and further
annotation of .loom
files.
sessionInfo()
## R version 4.4.2 (2024-10-31)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.1 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: Etc/UTC
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats4 stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] httr_1.4.7 hca_1.15.0
## [3] LoomExperiment_1.25.0 BiocIO_1.17.1
## [5] rhdf5_2.51.0 SingleCellExperiment_1.29.1
## [7] SummarizedExperiment_1.37.0 Biobase_2.67.0
## [9] GenomicRanges_1.59.1 GenomeInfoDb_1.43.2
## [11] IRanges_2.41.1 S4Vectors_0.45.2
## [13] BiocGenerics_0.53.3 generics_0.1.3
## [15] MatrixGenerics_1.19.0 matrixStats_1.4.1
## [17] dplyr_1.1.4 BiocStyle_2.35.0
##
## loaded via a namespace (and not attached):
## [1] tidyselect_1.2.1 blob_1.2.4 filelock_1.0.3
## [4] fastmap_1.2.0 BiocFileCache_2.15.0 promises_1.3.2
## [7] digest_0.6.37 mime_0.12 lifecycle_1.0.4
## [10] RSQLite_2.3.8 magrittr_2.0.3 compiler_4.4.2
## [13] rlang_1.1.4 sass_0.4.9 tools_4.4.2
## [16] utf8_1.2.4 yaml_2.3.10 knitr_1.49
## [19] S4Arrays_1.7.1 htmlwidgets_1.6.4 bit_4.5.0
## [22] curl_6.0.1 DelayedArray_0.33.2 abind_1.4-8
## [25] miniUI_0.1.1.1 HDF5Array_1.35.1 withr_3.0.2
## [28] purrr_1.0.2 sys_3.4.3 grid_4.4.2
## [31] fansi_1.0.6 xtable_1.8-4 Rhdf5lib_1.29.0
## [34] cli_3.6.3 rmarkdown_2.29 crayon_1.5.3
## [37] tzdb_0.4.0 DBI_1.2.3 cachem_1.1.0
## [40] stringr_1.5.1 zlibbioc_1.52.0 parallel_4.4.2
## [43] BiocManager_1.30.25 XVector_0.47.0 vctrs_0.6.5
## [46] Matrix_1.7-1 jsonlite_1.8.9 hms_1.1.3
## [49] bit64_4.5.2 maketools_1.3.1 jquerylib_0.1.4
## [52] tidyr_1.3.1 glue_1.8.0 DT_0.33
## [55] stringi_1.8.4 later_1.4.1 UCSC.utils_1.3.0
## [58] tibble_3.2.1 pillar_1.9.0 htmltools_0.5.8.1
## [61] rhdf5filters_1.19.0 GenomeInfoDbData_1.2.13 R6_2.5.1
## [64] dbplyr_2.5.0 vroom_1.6.5 evaluate_1.0.1
## [67] shiny_1.9.1 lattice_0.22-6 readr_2.1.5
## [70] memoise_2.0.1 httpuv_1.6.15 bslib_0.8.0
## [73] Rcpp_1.0.13-1 SparseArray_1.7.2 xfun_0.49
## [76] buildtools_1.0.0 pkgconfig_2.0.3
h5ad
files