---
title: "Explore Human BioMolecular Atlas Program Data Portal"
author:
  - name: "Christine Hou"
    affiliation: Department of Biostatistics, Johns Hopkins University
    email: chris2018hou@gmail.com
output:
  BiocStyle::html_document
package: "HuBMAPR"
vignette: |
  %\VignetteIndexEntry{Accessing Human Cell Atlas Data}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
  %\VignetteKeywords{Software, SingleCell, DataImport, ThirdPartyClient,
  Spatial, Infrastructure}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
    collapse = TRUE,
    comment = "#>"
)
```

# Overview

'HuBMAP' data portal (<https://portal.hubmapconsortium.org/>) provides an open,
global bio-molecular atlas of the human body at the cellular level. `HuBMAPR`
package provides an alternative interface to explore the data via R.

The HuBMAP Consortium offers several
[APIs](https://docs.hubmapconsortium.org/apis.html). 
To achieve the main objectives, `HuBMAPR` package specifically integrates three
APIs:  
[Search API](https://smart-api.info/ui/7aaf02b838022d564da776b03f357158),
[Entity API](https://smart-api.info/ui/0065e419668f3336a40d1f5ab89c6ba3), and
[Ontology API](https://smart-api.info/ui/d10ff85265d8b749fbe3ad7b51d0bf0a).
Each API serves a distinct purpose with unique query capabilities, tailored to
meet various needs. Utilizing the `httr2` and `rjsoncons` packages, `HuBMAPR`
effectively manages, modifies, and executes multiple requests via these APIs,
presenting responses in formats such as tibble or character. These outputs are
further modified for clarity in the final results from the `HuBMAPR` functions.
The **Search API** is primarily searching relevant data information and is
referenced to the 
[Elasticsearch API](https://www.elastic.co/guide/en/elasticsearch/). 
The **Entity API** is specifically utilized in the `bulk_data_transfer()`
function for Globus URL retrieval, while the **Ontology API** is applied in the
`organ()` function. Referencing to HuBMAP Data Portal, the functions in
`HuBAMPR` package reflects the data information of HuBMAP as much as possible.

HuBMAP Data incorporates three different 
[identifiers](https://docs.hubmapconsortium.org/apis): HuBMAP ID, Universally
Unique Identifier (UUID), and Digital Object Identifiers (DOI). The `HuBMAPR`
package utilizes the UUID - a 32-digit hexadecimal number - and the more
human-readable HuBMAP ID as two common identifiers in the retrieved results.
Considering precision and compatibility with software implementation and data
storage, UUID serves as the primary identifier to retrieve data across various
functions, with the UUID mapping uniquely to its corresponding HuBMAP ID. The
systematic nomenclature is adopted for functions in the package by appending
the entity category prefix to the concise description of the specific
functionality. Most of the functions are grouped by entity categories, thereby
simplifying the process of selecting the appropriate functions to retrieve the
desired information associated with the given UUID from the specific entity
category. The structure of these functions is heavily consistent across all
entity categories with some exceptions for collection and publication. 

# Installation

`HuBMAPR` is a R package. The package can be installed by

```{r 'install bioc', eval=FALSE}
if (!requireNamespace("BiocManager")) {
    install.packages("BiocManager")
}
BiocManager::install("HuBMAPR")
```

Install development version from
[GitHub](https://christinehou11.github.io/HuBMAPR):

```{r 'install dev', eval = FALSE}
remotes::install("christinehou11/HuBMAPR")
```

# Basic User Guide

## Load Necessary Packages

Load additional packages. `dplyr` package is widely used in this vignettes to
conduct data wrangling and specific information extraction.

```{r 'library', message=FALSE, warning=FALSE}
library("dplyr")
library("tidyr")
library("ggplot2")
library("HuBMAPR")
```

## Data Discovery

`HuBMAP` data portal page displays five categories of entity data, which are
**Dataset, Sample, Donor, Publication, and Collection**, chronologically (last
modified date time). Using corresponding functions to explore entity data.

```{r 'datasets'}
datasets_df <- datasets()
datasets_df
```

```{r 'plot', echo=FALSE, warning=FALSE, message=FALSE}
datasets_sub <- datasets_df |>
    select(organ, dataset_type) |>
    group_by(organ) |>
    mutate(count = n()) |>
    filter(!is.na(organ))

plot1 <- ggplot(datasets_sub,
                aes(y = reorder(organ, count), fill = dataset_type)) + 
    geom_histogram(stat = "count") + 
    labs(x = NULL, y = NULL, fill = "Assay Type") + 
    theme_minimal() +
    theme(
        panel.grid.major.y = element_blank(),
        panel.grid.minor = element_blank(),
        axis.text.y = element_text(size = 9),
        axis.text.x = element_text(size = 9),
        legend.position = "right",
        legend.title = element_text(size = 9),
        legend.text = element_text(size = 7), 
        panel.background = element_rect(fill = "white", color = NA),
        plot.background = element_rect(fill = "white", color = NA)) +
    guides(fill = guide_legend(ncol = 2))

plot1
```

The default tibble produced by corresponding entity function only reflects
selected information. To see the names of selected information, use following
commands for each entity category. Specify `as` parameter to display
information in the format of `"character"` or `"tibble"`.


```{r 'cols'}
# as = "tibble" (default)
datasets_col_tbl <- datasets_default_columns(as = "tibble")
datasets_col_tbl

# as = "character"
datasets_col_char <- datasets_default_columns(as = "character")
datasets_col_char
```

`samples_default_columns()`, `donors_default_columns()`,
`collections_default_columns()`, and `publications_default_columns()` work same
as above.

A brief overview of selected information for five entity categories is:

```{r 'summary cols'}
tbl <- bind_cols(
    dataset = datasets_default_columns(as = "character"),
    sample = c(samples_default_columns(as = "character"), rep(NA, 7)),
    donor = c(donors_default_columns(as = "character"), rep(NA, 6)),
    collection = c(collections_default_columns(as = "character"),
                    rep(NA, 10)),
    publication = c(publications_default_columns(as = "character"),
                    rep(NA, 7))
)

tbl
```

Use `organ()` to read through the available organs included in `HuBMAP`. It can
be helpful to filter retrieved data based on organ information.

```{r 'organs'}
organs <- organ()
organs
```

Data wrangling and filter are welcome to retrieve data based on interested
information.

```{r 'datasets filter'}
# Example from datasets()
datasets_df |>
    filter(organ == 'Small Intestine') |>
    count()
```

Any dataset, sample, donor, collection, and publication has special 
**HuBMAP ID** and **UUID**, and **UUID** is the main ID to be used in most of
functions for specific detail retrievals.

The column of **donor_hubmap_id** is included in the retrieved tibbles from
`samples()` and `datasets()`, which can help to join the tibble.

```{r 'derived using left_join'}
donors_df <- donors()
donor_sub <- donors_df |>
    filter(Sex == "Female",
            Age <= 76 & Age >= 55,
            Race == "White",
            `Body Mass Index` <= 25,
            last_modified_timestamp >= "2020-01-08" &
            last_modified_timestamp <= "2020-06-30") |>
    head(1)

# Datasets
donor_sub_dataset <- donor_sub |>
    left_join(datasets_df |>
                select(-c(group_name, last_modified_timestamp)) |>
                rename("dataset_uuid" = "uuid",
                        "dataset_hubmap_id" = "hubmap_id"),
                by = c("hubmap_id" = "donor_hubmap_id"))

donor_sub_dataset

# Samples
samples_df <- samples()
donor_sub_sample <- donor_sub |>
    left_join(samples_df |>
                select(-c(group_name, last_modified_timestamp)) |>
                rename("sample_uuid" = "uuid",
                        "sample_hubmap_id" = "hubmap_id"),
                by = c("hubmap_id" = "donor_hubmap_id"))

donor_sub_sample
```

You can use `*_detail(uuid)` to retrieve all available information for any
entry of any entity category given **UUID**. Use `select()` and `unnest_*()`
functions to expand list-columns. It will be convenient to view tables with
multiple columns but one row using `glimpse()`.

```{r '*_detail()'}
dataset_uuid <- datasets_df |>
    filter(dataset_type == "Auto-fluorescence",
            organ == "Kidney (Right)") |>
    head(1) |>
    pull(uuid)

# Full Information
dataset_detail(dataset_uuid) |> glimpse()

# Specific Information
dataset_detail(uuid = dataset_uuid) |>
    select(contributors) |>
    unnest_longer(contributors) |>
    unnest_wider(everything())
```

`sample_detail()`, `donor_detail()`, `collection_detail()`, and
`publication_detail()` work same as above.

## Metadata

To retrieve the metadata for **Dataset**, **Sample**, and **Donor** metadata,
use `dataset_metadata()`, `sample_metadata()`, and `donor_metadata()`.

```{r 'metadata'}
dataset_metadata("993bb1d6fa02e2755fd69613bb9d6e08")

sample_metadata("8ecdbdc3e2d04898e2563d666658b6a9")

donor_metadata("b2c75c96558c18c9e13ba31629f541b6")
```

## Derived Data

Some datasets from **Dataset** entity has derived (support) dataset(s). Use
`dataset_derived()` to retrieve. A tibble with selected details will be
retrieved as if the given dataset has support dataset; otherwise, nothing
returns.

```{r 'dataset derived'}
# no derived/support dataset
dataset_uuid_1 <- "3acdb3ed962b2087fbe325514b098101"

dataset_derived(uuid = dataset_uuid_1)

# has derived/support dataset
dataset_uuid_2 <- "baf976734dd652208d13134bc5c4594b"

dataset_derived(uuid = dataset_uuid_2) |> glimpse()
```

**Sample** and **Donor** have derived samples and datasets. In `HuBAMPR`
package, `sample_derived()` and `donor_derived()` functions are available to
use to see the derived datasets and samples from one sample given sample UUID
or one donor given donor UUID. Specify `entity_type` parameter to retrieve
derived `Dataset` or `Sample`.

```{r 'derived using sample_derived'}
sample_uuid <- samples_df |>
    filter(last_modified_timestamp >= "2023-01-01" &
            last_modified_timestamp <= "2023-10-01",
            organ == "Kidney (Left)") |>
    head(1) |>
    pull(uuid)

sample_uuid

# Derived Datasets
sample_derived(uuid = sample_uuid, entity_type = "Dataset")

# Derived Samples
sample_derived(uuid = sample_uuid, entity_type = "Sample")
```

`donor_derived()` works same as above.

## Provenance Data

For individual entries from **Dataset** and **Sample** entities,
`uuid_provenance()` helps to retrieve the provenance of the entry as a list of
characters (UUID, HuBMAP ID, and entity type) from the most recent ancestor to
the furthest ancestor. There is no ancestor for Donor UUID, and an empty list
will be returned.

```{r 'provenance'}
# dataset provenance
dataset_uuid <- "3e4c568d9ce8df9d73b8cddcf8d0fec3"
uuid_provenance(dataset_uuid)

# sample provenance
sample_uuid <- "35e16f13caab262f446836f63cf4ad42"
uuid_provenance(sample_uuid)

# donor provenance
donor_uuid <- "0abacde2443881351ff6e9930a706c83"
uuid_provenance(donor_uuid)
```

## Related Data

Each **Collection** has related datasets, and use `collection_data()` to
retrieve.

```{r 'collection datasets'}
collections_df <- collections()
collection_uuid <- collections_df |>
    filter(last_modified_timestamp >= "2023-01-01") |>
    head(1) |>
    pull(uuid)

collection_data(collection_uuid)
```

Each publication has related datasets, samples, and donors, and use
`publication_data()` to see, while specifying `entity_type` parameter to
retrieve derived `Dataset` or `Sample`.

```{r 'publication data'}
publications_df <- publications()
publication_uuid <- publications_df |>
    filter(publication_venue == "Nature") |>
    head(1) |>
    pull(uuid)

publication_data(publication_uuid, entity_type = "Dataset")

publication_data(publication_uuid, entity_type = "Sample")
```

## Additional Information

To read the textual description of one **Collection** or **Publication**, use
`collection_information()` or `publication_information()` respectively.

```{r 'information'}
collection_information(uuid = collection_uuid)

publication_information(uuid = publication_uuid)
```

Some additional contact/author/contributor information can be retrieved using
`dataset_contributor()` for **Dataset** entity, `collection_contact()` and
`collection_contributors()` for **Collection** entity, or
`publication_authors()` for **Publication** entity.

```{r 'author'}
# Dataset
dataset_contributors(uuid = dataset_uuid)

# Collection
collection_contacts(uuid = collection_uuid)

collection_contributors(uuid = collection_uuid)

# Publication
publication_authors(uuid = publication_uuid)
```

# File Download

For each dataset, there are corresponding data files. Most of the datasets'
files are available on HuBMAP Globus with corresponding URL. Some of the
datasets' files are not available via Globus, but can be accessed via dbGAP
(database of Genotypes and Phenotypes) and/or SRA (Sequence Read Archive). But
some of the datasets' files are not available in any authorized platform.

Each dataset available on Globus has different components of data-related files
to preview and download, include but not limited to images, metadata files,
downstream analysis reports, raw data products, etc. 

Use `bulk_data_transfer()` to know whether data files are open-accessed or
restricted. The function will direct you to chrome if the files are accessible;
otherwise, the error messages will be printed out with addition instructions,
either providing dbGAP/SRA URLs or pointing out the unavailability. 

```{r 'bulk data transfer', eval=FALSE}
uuid_globus <- "d1dcab2df80590d8cd8770948abaf976"

bulk_data_transfer(uuid_globus)

uuid_dbGAP_SRA <- "d926c41ac08f3c2ba5e61eec83e90b0c"

bulk_data_transfer(uuid_dbGAP_SRA)

uuid_not_avail <- "0eb5e457b4855ce28531bc97147196b6"

bulk_data_transfer(uuid_not_avail)
```

You can choose to download data files on Globus webpage by clicking download
button after choosing the desired document. You can also preview and download
data files using [rglobus](https://mtmorgan.github.io/rglobus/) package. Follow
the instructions 
[here](https://mtmorgan.github.io/rglobus/articles/a_get_started.html).

# `R` session information {.unnumbered}

```{r 'sessionInfo', echo=FALSE}
## Session info
options(width = 120)
sessionInfo()
```