--- title: "Explore Human BioMolecular Atlas Program Data Portal" author: - name: "Christine Hou" affiliation: Department of Biostatistics, Johns Hopkins University email: chris2018hou@gmail.com output: BiocStyle::html_document package: "HuBMAPR" vignette: | %\VignetteIndexEntry{Accessing Human Cell Atlas Data} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} %\VignetteKeywords{Software, SingleCell, DataImport, ThirdPartyClient, Spatial, Infrastructure} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` # Overview 'HuBMAP' data portal () provides an open, global bio-molecular atlas of the human body at the cellular level. `HuBMAPR` package provides an alternative interface to explore the data via R. The HuBMAP Consortium offers several [APIs](https://docs.hubmapconsortium.org/apis.html). To achieve the main objectives, `HuBMAPR` package specifically integrates three APIs: [Search API](https://smart-api.info/ui/7aaf02b838022d564da776b03f357158), [Entity API](https://smart-api.info/ui/0065e419668f3336a40d1f5ab89c6ba3), and [Ontology API](https://smart-api.info/ui/d10ff85265d8b749fbe3ad7b51d0bf0a). Each API serves a distinct purpose with unique query capabilities, tailored to meet various needs. Utilizing the `httr2` and `rjsoncons` packages, `HuBMAPR` effectively manages, modifies, and executes multiple requests via these APIs, presenting responses in formats such as tibble or character. These outputs are further modified for clarity in the final results from the `HuBMAPR` functions. The **Search API** is primarily searching relevant data information and is referenced to the [Elasticsearch API](https://www.elastic.co/guide/en/elasticsearch/). The **Entity API** is specifically utilized in the `bulk_data_transfer()` function for Globus URL retrieval, while the **Ontology API** is applied in the `organ()` function. Referencing to HuBMAP Data Portal, the functions in `HuBAMPR` package reflects the data information of HuBMAP as much as possible. HuBMAP Data incorporates three different [identifiers](https://docs.hubmapconsortium.org/apis): HuBMAP ID, Universally Unique Identifier (UUID), and Digital Object Identifiers (DOI). The `HuBMAPR` package utilizes the UUID - a 32-digit hexadecimal number - and the more human-readable HuBMAP ID as two common identifiers in the retrieved results. Considering precision and compatibility with software implementation and data storage, UUID serves as the primary identifier to retrieve data across various functions, with the UUID mapping uniquely to its corresponding HuBMAP ID. The systematic nomenclature is adopted for functions in the package by appending the entity category prefix to the concise description of the specific functionality. Most of the functions are grouped by entity categories, thereby simplifying the process of selecting the appropriate functions to retrieve the desired information associated with the given UUID from the specific entity category. The structure of these functions is heavily consistent across all entity categories with some exceptions for collection and publication. # Installation `HuBMAPR` is a R package. The package can be installed by ```{r 'install bioc', eval=FALSE} if (!requireNamespace("BiocManager")) { install.packages("BiocManager") } BiocManager::install("HuBMAPR") ``` Install development version from [GitHub](https://christinehou11.github.io/HuBMAPR): ```{r 'install dev', eval = FALSE} remotes::install("christinehou11/HuBMAPR") ``` # Basic User Guide ## Load Necessary Packages Load additional packages. `dplyr` package is widely used in this vignettes to conduct data wrangling and specific information extraction. ```{r 'library', message=FALSE, warning=FALSE} library("dplyr") library("tidyr") library("ggplot2") library("HuBMAPR") ``` ## Data Discovery `HuBMAP` data portal page displays five categories of entity data, which are **Dataset, Sample, Donor, Publication, and Collection**, chronologically (last modified date time). Using corresponding functions to explore entity data. ```{r 'datasets'} datasets_df <- datasets() datasets_df ``` ```{r 'plot', echo=FALSE, warning=FALSE, message=FALSE} datasets_sub <- datasets_df |> select(organ, dataset_type) |> group_by(organ) |> mutate(count = n()) |> filter(!is.na(organ)) plot1 <- ggplot(datasets_sub, aes(y = reorder(organ, count), fill = dataset_type)) + geom_histogram(stat = "count") + labs(x = NULL, y = NULL, fill = "Assay Type") + theme_minimal() + theme( panel.grid.major.y = element_blank(), panel.grid.minor = element_blank(), axis.text.y = element_text(size = 9), axis.text.x = element_text(size = 9), legend.position = "right", legend.title = element_text(size = 9), legend.text = element_text(size = 7), panel.background = element_rect(fill = "white", color = NA), plot.background = element_rect(fill = "white", color = NA)) + guides(fill = guide_legend(ncol = 2)) plot1 ``` The default tibble produced by corresponding entity function only reflects selected information. To see the names of selected information, use following commands for each entity category. Specify `as` parameter to display information in the format of `"character"` or `"tibble"`. ```{r 'cols'} # as = "tibble" (default) datasets_col_tbl <- datasets_default_columns(as = "tibble") datasets_col_tbl # as = "character" datasets_col_char <- datasets_default_columns(as = "character") datasets_col_char ``` `samples_default_columns()`, `donors_default_columns()`, `collections_default_columns()`, and `publications_default_columns()` work same as above. A brief overview of selected information for five entity categories is: ```{r 'summary cols'} tbl <- bind_cols( dataset = datasets_default_columns(as = "character"), sample = c(samples_default_columns(as = "character"), rep(NA, 7)), donor = c(donors_default_columns(as = "character"), rep(NA, 6)), collection = c(collections_default_columns(as = "character"), rep(NA, 10)), publication = c(publications_default_columns(as = "character"), rep(NA, 7)) ) tbl ``` Use `organ()` to read through the available organs included in `HuBMAP`. It can be helpful to filter retrieved data based on organ information. ```{r 'organs'} organs <- organ() organs ``` Data wrangling and filter are welcome to retrieve data based on interested information. ```{r 'datasets filter'} # Example from datasets() datasets_df |> filter(organ == 'Small Intestine') |> count() ``` Any dataset, sample, donor, collection, and publication has special **HuBMAP ID** and **UUID**, and **UUID** is the main ID to be used in most of functions for specific detail retrievals. The column of **donor_hubmap_id** is included in the retrieved tibbles from `samples()` and `datasets()`, which can help to join the tibble. ```{r 'derived using left_join'} donors_df <- donors() donor_sub <- donors_df |> filter(Sex == "Female", Age <= 76 & Age >= 55, Race == "White", `Body Mass Index` <= 25, last_modified_timestamp >= "2020-01-08" & last_modified_timestamp <= "2020-06-30") |> head(1) # Datasets donor_sub_dataset <- donor_sub |> left_join(datasets_df |> select(-c(group_name, last_modified_timestamp)) |> rename("dataset_uuid" = "uuid", "dataset_hubmap_id" = "hubmap_id"), by = c("hubmap_id" = "donor_hubmap_id")) donor_sub_dataset # Samples samples_df <- samples() donor_sub_sample <- donor_sub |> left_join(samples_df |> select(-c(group_name, last_modified_timestamp)) |> rename("sample_uuid" = "uuid", "sample_hubmap_id" = "hubmap_id"), by = c("hubmap_id" = "donor_hubmap_id")) donor_sub_sample ``` You can use `*_detail(uuid)` to retrieve all available information for any entry of any entity category given **UUID**. Use `select()` and `unnest_*()` functions to expand list-columns. It will be convenient to view tables with multiple columns but one row using `glimpse()`. ```{r '*_detail()'} dataset_uuid <- datasets_df |> filter(dataset_type == "Auto-fluorescence", organ == "Kidney (Right)") |> head(1) |> pull(uuid) # Full Information dataset_detail(dataset_uuid) |> glimpse() # Specific Information dataset_detail(uuid = dataset_uuid) |> select(contributors) |> unnest_longer(contributors) |> unnest_wider(everything()) ``` `sample_detail()`, `donor_detail()`, `collection_detail()`, and `publication_detail()` work same as above. ## Metadata To retrieve the metadata for **Dataset**, **Sample**, and **Donor** metadata, use `dataset_metadata()`, `sample_metadata()`, and `donor_metadata()`. ```{r 'metadata'} dataset_metadata("993bb1d6fa02e2755fd69613bb9d6e08") sample_metadata("8ecdbdc3e2d04898e2563d666658b6a9") donor_metadata("b2c75c96558c18c9e13ba31629f541b6") ``` ## Derived Data Some datasets from **Dataset** entity has derived (support) dataset(s). Use `dataset_derived()` to retrieve. A tibble with selected details will be retrieved as if the given dataset has support dataset; otherwise, nothing returns. ```{r 'dataset derived'} # no derived/support dataset dataset_uuid_1 <- "3acdb3ed962b2087fbe325514b098101" dataset_derived(uuid = dataset_uuid_1) # has derived/support dataset dataset_uuid_2 <- "baf976734dd652208d13134bc5c4594b" dataset_derived(uuid = dataset_uuid_2) |> glimpse() ``` **Sample** and **Donor** have derived samples and datasets. In `HuBAMPR` package, `sample_derived()` and `donor_derived()` functions are available to use to see the derived datasets and samples from one sample given sample UUID or one donor given donor UUID. Specify `entity_type` parameter to retrieve derived `Dataset` or `Sample`. ```{r 'derived using sample_derived'} sample_uuid <- samples_df |> filter(last_modified_timestamp >= "2023-01-01" & last_modified_timestamp <= "2023-10-01", organ == "Kidney (Left)") |> head(1) |> pull(uuid) sample_uuid # Derived Datasets sample_derived(uuid = sample_uuid, entity_type = "Dataset") # Derived Samples sample_derived(uuid = sample_uuid, entity_type = "Sample") ``` `donor_derived()` works same as above. ## Provenance Data For individual entries from **Dataset** and **Sample** entities, `uuid_provenance()` helps to retrieve the provenance of the entry as a list of characters (UUID, HuBMAP ID, and entity type) from the most recent ancestor to the furthest ancestor. There is no ancestor for Donor UUID, and an empty list will be returned. ```{r 'provenance'} # dataset provenance dataset_uuid <- "3e4c568d9ce8df9d73b8cddcf8d0fec3" uuid_provenance(dataset_uuid) # sample provenance sample_uuid <- "35e16f13caab262f446836f63cf4ad42" uuid_provenance(sample_uuid) # donor provenance donor_uuid <- "0abacde2443881351ff6e9930a706c83" uuid_provenance(donor_uuid) ``` ## Related Data Each **Collection** has related datasets, and use `collection_data()` to retrieve. ```{r 'collection datasets'} collections_df <- collections() collection_uuid <- collections_df |> filter(last_modified_timestamp >= "2023-01-01") |> head(1) |> pull(uuid) collection_data(collection_uuid) ``` Each publication has related datasets, samples, and donors, and use `publication_data()` to see, while specifying `entity_type` parameter to retrieve derived `Dataset` or `Sample`. ```{r 'publication data'} publications_df <- publications() publication_uuid <- publications_df |> filter(publication_venue == "Nature") |> head(1) |> pull(uuid) publication_data(publication_uuid, entity_type = "Dataset") publication_data(publication_uuid, entity_type = "Sample") ``` ## Additional Information To read the textual description of one **Collection** or **Publication**, use `collection_information()` or `publication_information()` respectively. ```{r 'information'} collection_information(uuid = collection_uuid) publication_information(uuid = publication_uuid) ``` Some additional contact/author/contributor information can be retrieved using `dataset_contributor()` for **Dataset** entity, `collection_contact()` and `collection_contributors()` for **Collection** entity, or `publication_authors()` for **Publication** entity. ```{r 'author'} # Dataset dataset_contributors(uuid = dataset_uuid) # Collection collection_contacts(uuid = collection_uuid) collection_contributors(uuid = collection_uuid) # Publication publication_authors(uuid = publication_uuid) ``` # File Download For each dataset, there are corresponding data files. Most of the datasets' files are available on HuBMAP Globus with corresponding URL. Some of the datasets' files are not available via Globus, but can be accessed via dbGAP (database of Genotypes and Phenotypes) and/or SRA (Sequence Read Archive). But some of the datasets' files are not available in any authorized platform. Each dataset available on Globus has different components of data-related files to preview and download, include but not limited to images, metadata files, downstream analysis reports, raw data products, etc. Use `bulk_data_transfer()` to know whether data files are open-accessed or restricted. The function will direct you to chrome if the files are accessible; otherwise, the error messages will be printed out with addition instructions, either providing dbGAP/SRA URLs or pointing out the unavailability. ```{r 'bulk data transfer', eval=FALSE} uuid_globus <- "d1dcab2df80590d8cd8770948abaf976" bulk_data_transfer(uuid_globus) uuid_dbGAP_SRA <- "d926c41ac08f3c2ba5e61eec83e90b0c" bulk_data_transfer(uuid_dbGAP_SRA) uuid_not_avail <- "0eb5e457b4855ce28531bc97147196b6" bulk_data_transfer(uuid_not_avail) ``` You can choose to download data files on Globus webpage by clicking download button after choosing the desired document. You can also preview and download data files using [rglobus](https://mtmorgan.github.io/rglobus/) package. Follow the instructions [here](https://mtmorgan.github.io/rglobus/articles/a_get_started.html). # `R` session information {.unnumbered} ```{r 'sessionInfo', echo=FALSE} ## Session info options(width = 120) sessionInfo() ```