barcodetrackR

Introduction

Installation

Currently, barcodetrackR is available at Github and can be downloaded using the devtools package.

devtools::install_github("dunbarlabNIH/barcodetrackR")

Contributors

The R package and functions were created by Diego A. Espinoza, Ryland D. Mortlock, Samson J. Koelle, and others at Cynthia Dunbar’s laboratory at the National Heart, Lung, and Blood Institutes of Health. Issues should be addressed to https://github.com/dunbarlabNIH/barcodetrackR/issues.

Loading data

Loading required packages

The barcodetrackR package operates on SummarizedExperiment objects from the Bioconductor repository. It stores associated colData for each sample in this object as well as any metadata. We load the barcodetrackR and SummarizedExperiment packages here for our analyses, as well as the magrittr package in order to improve legibility of code through using the pipe %>% operator.

require("magrittr")
require("barcodetrackR")
require("SummarizedExperiment")

Creating objects with create_SE

For this vignette, we will load publically available data from the following papers (these sample datasets are included in the R package):

system.file("extdata", "/WuC_etal_appdata/sample_data_ZJ31.txt", package = "barcodetrackR") %>%
  read.delim(row.names = 1) -> wu_dataframe

system.file("extdata", "/WuC_etal_appdata/sample_metadata_ZJ31.txt", package = "barcodetrackR") %>%
  read.delim() -> wu_metadata

wu_SE <- create_SE(your_data = wu_dataframe,
                   meta_data = wu_metadata,
                   threshold = 0)


system.file("extdata", "/BelderbosME_etal/count_matrix_mouse_C21.txt", package = "barcodetrackR") %>%
  read.delim(row.names = 1) -> belderbos_dataframe

system.file("extdata", "/BelderbosME_etal/metadata_mouse_C21.txt", package = "barcodetrackR") %>%
  read.delim() -> belderbos_metadata

belderbos_metadata$weeks <- factor(belderbos_metadata$weeks, levels = c("9", "14", "20", "22", "sac"))

belderbos_SE <- create_SE(your_data = belderbos_dataframe,
                          meta_data = belderbos_metadata,
                          threshold = 0)

system.file("extdata", "/SixE_etal/WAS5_reads.txt", package = "barcodetrackR") %>%
  read.delim(row.names = 1) -> six_dataframe

system.file("extdata", "/SixE_etal/WAS5_metadata.txt", package = "barcodetrackR") %>%
  read.delim() -> six_metadata

six_SE <- create_SE(your_data = six_dataframe,
                    meta_data = six_metadata,
                    threshold = 0)
## No threshold supplied. All barcodes will be retained. Be aware that lower abundance barcodes are likely to be less reliable due to sampling bias. To estimate an appropriate threshold, please see the barcodetrackR function `estimate_barcode_threshold`. 
## No threshold supplied. All barcodes will be retained. Be aware that lower abundance barcodes are likely to be less reliable due to sampling bias. To estimate an appropriate threshold, please see the barcodetrackR function `estimate_barcode_threshold`. 
## No threshold supplied. All barcodes will be retained. Be aware that lower abundance barcodes are likely to be less reliable due to sampling bias. To estimate an appropriate threshold, please see the barcodetrackR function `estimate_barcode_threshold`.

Our input dataframes to create the SummarizedExperiment (SE) objects are each an n x m data.frame where there are n rows of observations (typically cellular barcodes, insertion sites, or the like) and the m columns are the samples. The input metadata must have row order identical to the order of the colums in its corresponding dataframe. The metadata must also have a column titled SAMPLENAME that denotes the column of your_data it refers to.

Assays created by create_SE

create_SE takes the input dataframe and metadata and creates a SummarizedExperiment object with the following assays:

  • counts: the raw values from the input dataframe
  • proportions: the per-column proportions of each entry in each column
  • ranks: the rank of each entry in each column
  • normalized: the normalized read values (CPM)
  • logs: the log of the normalized values
assays(six_SE)
## List of length 5
## names(5): counts proportions ranks normalized logs

Thresholding barcoding data

We also include a function to help users estimate the minimum abundance of reliable barcodes. For a specified capture efficiency C, the minimum clone size N that we can expect to detect with confidence level P is calculated from:

P = 1 − (1 − C)N

The proportional abundance of a clonal tag of size N is:

N/(T * F)

where T is the total population size of cells or genomes and F is the frequency or proportion of the total population which is labeled or genetically modified with the clonal tag.

The population size and proportion labeled must be determined experimentally. The capture efficiency should be estimated for a given clonal tracking technique by simulating the barcode retrieval process in silico and finding the capture efficiency which leads to a total # of detected barcodes matching the experimentally determined number. Adair et al (PMID: 32355868) performed this analysis for viral integration site analysis and DNA barcode sequencing and determined good estimates for the capture efficiencies of these two technologies to be 0.05 and 0.4 respectively.

my_thresh <- estimate_barcode_threshold(capture_efficiency = 0.4,
                                        population_size = 500000,
                                        proportion_labeled = 0.3,
                                        confidence_level = 0.95,
                                        verbose = TRUE)
## Relative threshold. Barcodes above 0.003909661 % of a given sample are estimated to be reliable.

This threshold can be applied to an existing SE to remove low abundance barcodes which do not reach the thresold in any sample.

wu_thresh <- threshold_SE(your_SE = wu_SE,
                          threshold_value = 0.005,
                          threshold_type = "relative",
                          verbose = TRUE)
## Removed 35353 barcodes from the supplied dataframe based on relative threshold of 0.005

Users can also specify an absolute threshold count rather than a relative threshold by changing the threshold_type to “absolute”.

Shiny app

The barcodetrackR package includes a Shiny app for users without programming experience to analyze clonal tracking data and create high-quality visualizations. To launch the app, use the following line of code.

barcodetrackR::launchApp()

Correlations

scatter_plot

A straightforward way to view the relationship between samples in a pairwise manner is to view basic scatter plots of two samples using the provided assays. We provide a scatter_plot function as part of the package.

Here, we view the correlation of barcode “proportions” between different cell types at the 20 month timepoint of the Wu et al study. We compare granulocytes (Gr) to B and T cells.

Gr_B_20 <- c("ZJ31_20m_Gr", "ZJ31_20m_B")
Gr_T_20 <- c("ZJ31_20m_Gr", "ZJ31_20m_T")
wu_scatterplot_1 <- scatter_plot(wu_SE[,Gr_B_20], your_title = "Gr vs B", assay = "proportions")
wu_scatterplot_2 <- scatter_plot(wu_SE[,Gr_T_20], your_title = "Gr vs T", assay = "proportions")
cowplot::plot_grid(wu_scatterplot_1, wu_scatterplot_2, ncol = 2)

cor_plot

A more comprehensive way to view the relationship between samples in a pairwise manner is to use a correlation plot. Here, we visualize the Pearson correlation of the barcode “proportions” between the T, B, Gr, NK CD56+/CD16-, and NK CD56-/CD16+ fractions within the Wu dataset for the 6, 9.5, 12, and 20 month post-transplant timepoints.

wu_cor_plot_sample_selection <- colData(wu_SE)$SAMPLENAME[1:20]
cor_plot(wu_SE[,wu_cor_plot_sample_selection],
         method_corr = "pearson",
         plot_type = "color",
         assay = "proportions")

If desired, we can also calculate and visualize the Pearson correlations for the “logs” assay for the same samples above.

wu_cor_plot_sample_selection <- colData(wu_SE)$SAMPLENAME[1:20]
cor_plot(wu_SE[,wu_cor_plot_sample_selection],
         method_corr = "pearson",
         plot_type = "color",
         assay = "logs")

We can return a table of the Pearson correlations as well as the p-values and confidence intervals for each of the comparisons above. This argument "return_table" is included in all barcodetrackR functions which conduct mathematical or statistical analysis. By setting the option to “TRUE”, users can return the calculated data as a dataframe rather than display the visualization. Here, we return the p-values and confidence intervals for the correlations calculated using the “proportions” assay.

cor_plot(wu_SE[,wu_cor_plot_sample_selection],
         method_corr = "pearson",
         assay = "proportions",
         return_table = TRUE) %>% head
##       sample_i    sample_j correlation_value       p_value     ci_lo     ci_hi
## cor  ZJ31_6m_T   ZJ31_6m_T         1.0000000  0.000000e+00 1.0000000 1.0000000
## cor1 ZJ31_6m_T ZJ31_9.5m_T         0.4672994  0.000000e+00 0.4527245 0.4816246
## cor2 ZJ31_6m_T  ZJ31_12m_T         0.4406121  0.000000e+00 0.4261147 0.4548832
## cor3 ZJ31_6m_T  ZJ31_20m_T         0.3730234  0.000000e+00 0.3564373 0.3893744
## cor4 ZJ31_6m_T   ZJ31_6m_B         0.2263901 7.739959e-201 0.2122372 0.2404479
## cor5 ZJ31_6m_T ZJ31_9.5m_B         0.2346975 2.001192e-191 0.2197056 0.2495785

Above, we used two of the three available correlation visualizations ("color" and "circle") using the standard color palette provided. A no_negative parameter is offered as well to set all negative correlations in the data to equal 0. This may be done to eliminate negative correlations from the data, from which deriving biological meaning may be difficult.

When there are a smaller number of samples to analyze, the "number" option can be used to view the actual correlation within the grid. Here is an example visualizing the Pearson correlations of peripheral blood samples at subsequent timepoints from the Belderbos et al sample data set. Note: within these samples, U stands for unsorted.

belderbos_wk_samples_PB <- c("wk9_U", "wk14_U", "wk20_U", "wk22_U")
cor_plot(your_SE = belderbos_SE[,belderbos_wk_samples_PB],
         method_corr = "pearson",
         plot_type = "number",
         assay = "proportions",
         number_size = 3)

Distances and similarities

dist_plot

While correlation coefficients provide some insight into pairwise comparisons between samples, there exist a number of measures and metrics possible for pairwise sample comparisons (namely, distances and similarities). The proxy package provides a number of distance and similarity measure we can incorporate for our purposes.

summary(proxy::pr_DB)
## * Similarity measures:
## Braun-Blanquet, Chi-squared, Cramer, Dice, Fager, Faith, Gower, Hamman,
## Jaccard, Kulczynski1, Kulczynski2, Michael, Mountford, Mozley, Ochiai,
## Pearson, Phi, Phi-squared, Russel, Simpson, Stiles, Tanimoto,
## Tschuprow, Yule, Yule2, angular, correlation, cosine, eDice, eJaccard,
## simple matching
## 
## * Distance measures:
## Bhjattacharyya, Bray, Canberra, Chord, Euclidean, Geodesic, Hellinger,
## Kullback, Levenshtein, Mahalanobis, Manhattan, Minkowski, Podani,
## Soergel, Wave, Whittaker, divergence, fJaccard, supremum

We will use the Wu dataset as in the correlation plots above, this time using the logs assay to determine pairwise sample distances or similarities. The use of a similarity or distance will be automatically detected in the argument dist_method and plotted appropriately. The samples may be clustered or not, using the cluster_tree argument. We first show the Manhattan distances calculated between samples for the Wu dataset.

dist_plot(wu_SE[,wu_cor_plot_sample_selection],
          dist_method =  "manhattan",
          plot_type = "color",
          assay = "logs")

Here is the same example, this time using the cosine similarity as the pairwise measure, and imposing a clustering on the resulting similarities (note, in the case of calculating hierarchical clustering on a similarity matrix, we use proxy to convert similarities to distances prior to clustering). We can also pick a number of color scales, here choosing “Greens”.

dist_plot(wu_SE[,wu_cor_plot_sample_selection],
          dist_method =  "cosine",
          plot_type = "color",
          assay = "logs",
          color_pal = "Greens",
          cluster_tree = TRUE)

Clonal patterns

barcode_ggheatmap

A useful visualization to study clonal patterns over time is by using a heat map which clusters the top clones based on relatedness and displays their proportion in each sample over time. Our function barcode_ggheatmap does this by choosing the top N clones (n_clones) within each sample and tracking them over time. The argument n_clones assumes that in most cases, the large-contributing clones are of most interest to the user. This assumption can be relaxed by passing a large value to the argument.

We first visualize the top 10 clones from the selected samples in the Wu dataset. The default cell note is stars for the top 10 clones in each sample.

barcode_ggheatmap(your_SE = wu_SE[,wu_cor_plot_sample_selection],
                  n_clones = 10,
                  grid = FALSE,
                  label_size = 14,
                  cellnote_size = 4)

We can also add a dendogram which clusters clones based on the Euclidean distance between each clone’s log assay. Here we plot the top 5 clones from each sample within the Six dataset. First, we order the columns to group them by celltype. The dendrograms make it easily to visually categorize groups of similar clones.

six_celltype_order <- c("m13_TCELLS", "m36_TCELLS", "m43_TCELLS",
                        "m55_TCELLS","m13_BCELLS", "m36_BCELLS",
                        "m43_BCELLS", "m55_BCELLS", "m13_NKCELLS",
                        "m36_NKCELLS", "m43_NKCELLS","m55_NKCELLS",
                        "m13_GRANULOCYTES", "m36_GRANULOCYTES",
                        "m43_GRANULOCYTES", "m55_GRANULOCYTES",
                        "m13_MONOCYTES", "m36_MONOCYTES",
                        "m43_MONOCYTES","m55_MONOCYTES")

barcode_ggheatmap(your_SE = six_SE[,six_celltype_order],
                  n_clones = 5,
                  cellnote_assay = "stars",
                  cellnote_size = 3,
                  label_size = 14,
                  dendro = TRUE,
                  grid = FALSE,
                  clusters = 4,
                  distance_method = "euclidean")

barcode_ggheatmap_stat

In some cases, we may be interested in whether each barcode changed in proportion from one sample to another. The function barcode_ggheatmap_stat allows users to layer information from statistical tests onto the heat map. Note that this test requires an additional piece of information, which is the sample size of cells which the barcoding data approximates. In this case, we view the CD16+ NK cells from the Wu dataset at various timepoints. In this case, 40,000 barcoded cells were used for DNA extraction and high-throughput sequencing for each sample. The sample size does not have to be the same for each sample though.

The stars in the heat map indicate which barcodes in a given sample had a statistically signficant change in proportion, as compared to the previous sample using a chi-squared test. Users can view the p-value from the statistical test on the heatmap by changing the cellnote_assay parameter to "p_val". Users can also compare each sample to a reference sample (such as the first timepoint) by changing the stat_option to "reference" and providing the desired sample name to the reference_sample parameter.

wu_CD16_NK_order <- colnames(wu_SE)[17:20]

barcode_ggheatmap_stat(your_SE = wu_SE[,wu_CD16_NK_order],
                       sample_size = rep(40000,length(wu_CD16_NK_order)),
                       stat_test = "chi-squared",
                       stat_option = "subsequent",
                       p_threshold = 0.05,
                       n_clones = 10,
                       cellnote_assay = "stars",
                       cellnote_size = 6)

Users can also return the results from the statistical test using the barcode_stat_test function which outputs a list containing the FC, log_FC, and p-value of each barcode for each sample. The inputs are similar to the barcode_ggheatmap_stat function. The function only performs statistical testing on the barcodes which have a proportion greater than bc_threshold in at least one sample.

Here we make statistical comparisons on the CD16+ NK samples from the Wu dataset.

sample_size <- rep(40000,length(wu_CD16_NK_order))
wu_CD16_NK_statistics <- barcode_stat_test(your_SE = wu_SE[,wu_CD16_NK_order],
                                           sample_size = sample_size,
                                           stat_test = "chi-squared",
                                           stat_option = "subsequent",
                                           bc_threshold = 0.01)

head(wu_CD16_NK_statistics$p_val)
##                                                    ZJ31_6m_NK_CD56n_CD16p
## GTAGCCAGATAGCAATTCAGAAGACTCATTCGAGGTTCACGAGATCGGAA                     NA
## GTAGCCATCACCGTGAGGATAAATATTGCGTTTTTCTACTGAGATCGGAA                     NA
## GTAGCCATGAGCATCTACCGTGTGCTCATCGTTTTAAAATTAGATCGGAA                     NA
## GTAGCCATTGGTCCGTATTGGTTAGGTAAGACGTCGTTGGCAGATCGGAA                     NA
## GTAGCCCAGGAGACTCCTCCTTAGTCCGATGCGATTGTCTGAGATCGGAA                     NA
## GTAGCCCAGTCCCTGATGTGTGTACACCCTGGGGACGCAAGAGATCGGAA                     NA
##                                                    ZJ31_9.5m_NK_CD56n_CD16p
## GTAGCCAGATAGCAATTCAGAAGACTCATTCGAGGTTCACGAGATCGGAA            3.155084e-199
## GTAGCCATCACCGTGAGGATAAATATTGCGTTTTTCTACTGAGATCGGAA            3.011996e-216
## GTAGCCATGAGCATCTACCGTGTGCTCATCGTTTTAAAATTAGATCGGAA             4.471398e-05
## GTAGCCATTGGTCCGTATTGGTTAGGTAAGACGTCGTTGGCAGATCGGAA                      NaN
## GTAGCCCAGGAGACTCCTCCTTAGTCCGATGCGATTGTCTGAGATCGGAA            7.935321e-108
## GTAGCCCAGTCCCTGATGTGTGTACACCCTGGGGACGCAAGAGATCGGAA             1.810373e-36
##                                                    ZJ31_12m_NK_CD56n_CD16p
## GTAGCCAGATAGCAATTCAGAAGACTCATTCGAGGTTCACGAGATCGGAA            2.107042e-07
## GTAGCCATCACCGTGAGGATAAATATTGCGTTTTTCTACTGAGATCGGAA            7.350457e-06
## GTAGCCATGAGCATCTACCGTGTGCTCATCGTTTTAAAATTAGATCGGAA            5.175835e-27
## GTAGCCATTGGTCCGTATTGGTTAGGTAAGACGTCGTTGGCAGATCGGAA            3.223660e-04
## GTAGCCCAGGAGACTCCTCCTTAGTCCGATGCGATTGTCTGAGATCGGAA            1.725087e-24
## GTAGCCCAGTCCCTGATGTGTGTACACCCTGGGGACGCAAGAGATCGGAA            2.224201e-61
##                                                    ZJ31_20m_NK_CD56n_CD16p
## GTAGCCAGATAGCAATTCAGAAGACTCATTCGAGGTTCACGAGATCGGAA            9.459816e-28
## GTAGCCATCACCGTGAGGATAAATATTGCGTTTTTCTACTGAGATCGGAA            3.877976e-22
## GTAGCCATGAGCATCTACCGTGTGCTCATCGTTTTAAAATTAGATCGGAA            3.303986e-20
## GTAGCCATTGGTCCGTATTGGTTAGGTAAGACGTCGTTGGCAGATCGGAA            3.729529e-99
## GTAGCCCAGGAGACTCCTCCTTAGTCCGATGCGATTGTCTGAGATCGGAA            1.400149e-06
## GTAGCCCAGTCCCTGATGTGTGTACACCCTGGGGACGCAAGAGATCGGAA            7.362162e-44

In either the barcode_ggheatmap_stat or the barcode_stat_test functions, p-value adjustment for multiple testing can be performed by specifying a p value adjustment method to the p_adjust argument.

barcode_binary_heatmap

In some cases, we may be interested in a global view of the presence or absence of barcodes across samples, regardless of read abundance. In that case, a binary heat map can be generated using barcode_binary_heatmap to give the simplest visual representation. Here we view the binary heat map of the belderbos data with a threshold of 0.01, meaning clones that make up less than 1% of a sample are treated as not detected.

barcode_binary_heatmap(your_SE = belderbos_SE[,belderbos_wk_samples_PB],
                       label_size = 12,
                       threshold = 0.01)

clonal_contribution

Another familiar way to visualize clonal patterns over time is using a line or bar chart showing the proportion of top clones. In the above heat maps for wu data, we could see that there are some large uni-lineage CD56-/CD16+ NK cell clones. We can view the expansion of the top clones from the final timepoint through a stacked area line chart showing the proportion of each clone in CD56-/CD16+ NK cell samples across time. Each color indicates one of the top clones from the final timepoint.

clonal_contribution(your_SE = wu_SE,
                    SAMPLENAME_choice = "ZJ31_20m_NK_CD56n_CD16p",
                    n_clones = 10,
                    graph_type = "line",
                    plot_over = "months",
                    filter_by = "celltype",
                    filter_selection = "NK_16",
                    plot_non_selected = FALSE)

For data with fewer clones, a bar chart might be appropriate. We can do so by setting the plot_non_selected argument to TRUE. We can also use categorical spacing on the x-axis rather than numeric by setting keep_numeric to FALSE.

Clonal bias

bias_histogram

The most straightforward way to view the bias between samples is using a histogram. We include the bias_histogram function which allows users to compare the frequency of shared barcode abundance between samples. The histogram shows the frequency of barcodes within different values of log bias (on the x-axis) with values close to 0 signifying similar abundance between the two samples. When the barcode abundance is 0 in one of the samples, it is lumped into the leftmost or right most bin of the histogram. By setting the "remove_unique" argument to TRUE, one can compare only the barcodes found in both samples being compared.

The function allows users to compare two samples from a given piece of metadata "split_bias_on" faceted by another piece of metadata "split_bias_over", which is illustrated below using the Wu et al data comparing clonal bias between B and T cells from the 6, 9.5, 12, and 20 month post-transplant timepoints.

wu_bias_plot_sample_selection <- colData(wu_SE)$SAMPLENAME[1:20]

bias_histogram(your_SE = wu_SE[,wu_bias_plot_sample_selection],
               split_bias_on = "celltype",
               bias_1 = "B",
               bias_2 = "Gr",
               split_bias_over = "months",
               ncols = 2)

The stacked bars of the histogram represent individual clones. For the Wu dataset, there are a multitude of small clones so the stacked bars are not visible. We can view clonal bias between B and T cells from the Six, et al dataset which has a smaller number of larger clones.

bias_histogram(your_SE = six_SE,
               split_bias_on = "celltype",
               bias_1 = "B",
               bias_2 = "T",
               split_bias_over = "months",
               ncols = 2)

bias_ridge_plot

An alternative to the histogram is a ridge plot which shows clonal bias between cell types through a density estimation of the number of clones at each value of the log bias. Since the ridge plot treats clonal bias as a continuous variable, it can reveal trends that are masked by grouping into bins with a histogram.

It is important to note that in order to handle clones which have a count of zero in one of the samples, log+1 normalization is used within the ridge plot function. This differs from the histogram where these clones can be grouped into the farthest bins on either side. The log bias formula for the ridge plot function is given by:

$logbias=log(\frac{normalized_1+1}{normalized_2+1})$

Here, we view a ridge plot showing clonal bias between B cells and Granulocytes from the Wu dataset. We calculate the density statistic using the cumulative sum of the normalized values for each log2 comparison. We also visualize each clone as a dot on the plot, proportionate to the cumulative sum of the normalized values.

bias_ridge_plot(your_SE = wu_SE,
                split_bias_on = "celltype",
                bias_1 = "B",
                bias_2 = "Gr",
                split_bias_over = "months",
                bias_over = c("6","9.5","12","20"),
                weighted = TRUE,
                add_dots = TRUE)

We again view a comparison in the Wu dataset, this time comparing B cells to Granulocytes but turning the weighting OFF. In this visualization, the density statistics does not take into account eh abundance of the clones, so each clone is treated as equal. The fact that these ridge plots appear more biased than in the above visualization shows that many of the high-abundance clones are shared between samples, and the more biased clones tend to have lower overall abundance.

bias_ridge_plot(your_SE = wu_SE,
                split_bias_on = "celltype",
                bias_1 = "B",
                bias_2 = "Gr",
                split_bias_over = "months",
                bias_over = c("6","9.5","12","20"),
                weighted = FALSE,
                add_dots = TRUE)

bias_lineplot

In some cases, there may be enough data to observe more interesting longitudinal trends. In the Six dataset, we can track the Monocyte vs Granulocyte abundance bias of all individual clones over time; each line represents a single clone with shading weighted by its added proportion between both cell types. The clones that have higher expression are evident on the plot and suggest that the highest abundance clones in both lineages are relatively unbiased to lineage over time.

bias_lineplot(your_SE = six_SE,
              split_bias_on = "celltype",
              bias_1 = "Gr",
              bias_2 = "Mo",
              split_bias_over = "months",
              remove_unique = TRUE)

Clone counts

clonal_count

We may be interested in the number of clones detected per group across another variable (such as time). Here, we will use the function clonal_count to plot the number of unique clones detected in each lineage over time for the Six dataset.

clonal_count(your_SE = six_SE,
             plot_over = "months",
             group_by = "celltype")

We can also set the argument cumulative = TRUE to plot the cumulative detection of clones in each lineage over time for the Six dataset. This plot can help us determine whether the clones being detected in each successive time point are newly detected or have previously been detected.

clonal_count(your_SE = six_SE,
             plot_over = "months",
             group_by = "celltype",
             cumulative = TRUE)

Diversity

rank_abundance

A way to depict clone richness and evenness within a plot is by using a rank-abundance plot. Here, the cumulative abundance of every clone in a sample is plotted in descending rank from 1 to n (where there are n clones in the sample being plotted), or by scaling all ranks to the range [0,1].

Here, we plot the first 8 samples from the Belderbos dataset. Note that the week 9 samples appear to have the most evenness across detected clones, while the other samples contain both large and small detected clones.

rank_abundance_plot(belderbos_SE[,1:8],
                    scale_rank = TRUE,
                    point_size = 2)

In addition to visually comparing rank abundance plots, one might want to emply a statistical test to ask whether the rank-abundance profiles from different samples are drawn from the same distribution. We have implemented the two sample Kolmogorov-Smirnov test in the function "rank_abundance_stat_test" in order to compute pairwise p-values between each samples testing the null hypothesis that their rank abundance profile is drawn from the same distribution. Note that this method is agnostic to whether the samples have actual shared barcode sequences - it simply compares the cumulative abundance distribution as barcodes are added from highest to lowest in abundance.

Below, we illustrate this test on the same 8 samples from Belderbos et al shown above and print the p-values for all pairwise comparisons.

stat_result <- rank_abundance_stat_test(belderbos_SE[,1:8])
print(stat_result[["p_value"]])
##                       wk9_U       wk14_U      wk20_U      wk22_U      wk22_B
## wk9_U          1.000000e+00           NA          NA          NA          NA
## wk14_U         5.899677e-03 1.000000e+00          NA          NA          NA
## wk20_U         3.622040e-17 6.010370e-04 1.000000000          NA          NA
## wk22_U         1.316234e-07 2.704595e-01 0.001062126 1.000000000          NA
## wk22_B         1.274820e-05 5.124057e-01 0.005391181 0.864231604 1.000000000
## wk22_T         3.228310e-17 4.637325e-05 0.237191127 0.011976320 0.029622238
## sac_BM_Front_U 7.438368e-18 2.713533e-04 0.716547822 0.000171218 0.001410338
## sac_BM_Front_B 2.431375e-17 3.249388e-04 0.889634009 0.016518714 0.045682062
##                    wk22_T sac_BM_Front_U sac_BM_Front_B
## wk9_U                  NA             NA             NA
## wk14_U                 NA             NA             NA
## wk20_U                 NA             NA             NA
## wk22_U                 NA             NA             NA
## wk22_B                 NA             NA             NA
## wk22_T         1.00000000             NA             NA
## sac_BM_Front_U 0.07232963      1.0000000             NA
## sac_BM_Front_B 0.78034510      0.4296676              1

clonal_diversity

Within-sample diversity indices (also referred to as alpha diversities) are indices computed independently for each sample in a data set. With clonal tracking, these diversity indices can give a global indication about the number of species in a sample using the number of detected species as input and sometimes also leveraging the proportional abundances of species within the sample. We include the function clonal_diversity which can calculate three diversity indices (making use the vegan package):

  • "shannon" $H'=-\sum _{i=1}^{R}p_{i}\ln p_{i}$
  • "simpson" $\lambda =\sum _{i=1}^{R}p_{i}^{2}$
  • "invsimpson"${\displaystyle {\frac {1}{\lambda }}={1 \over \sum _{i=1}^{R}p_{i}^{2}}={}^{2}D}$

We also include "count" as an option for index_type in order to use the total detected clones per sample as a measure for diversity.

As an example, we can plot the shannon diversities for the 5 cell types within the Wu dataset over time.

clonal_diversity(wu_SE[,1:20],
                 plot_over = "months",
                 group_by = "celltype",
                 index_type = "shannon")

Here we show the simpson indices for the Peripheral Blood samples in the Belderbos dataset over time; the last time point splits the cell fraction in to T, B, and Granulocyte fractions, allowing comparison of their shannon indices.

clonal_diversity(belderbos_SE[,1:6],
                 plot_over = "weeks",
                 group_by = "celltype", 
                 index_type = "shannon")

Similar to the Wu dataset, the Six dataset contains clonal tracking information for the T, B, Gr, Monocyte, and NK lineages. We can plot these over time as well and utilize the simpson index as a measure of diversity.

clonal_diversity(six_SE,
                 plot_over = "months",
                 group_by = "celltype",
                 index_type = "simpson")

mds_plot

Measures of simmilarity or dissimilarity between samples are known as beta-diversity indices (or distances if they are metrics). A common way for depicting these beta-diversity indices are using what are known as PCoA (Principal Coordinate Analysis) plots, in which an input distance matrix is plotted in two dimensions. Again, we leverage the "vegan" package here to call vegandist which allows us to calculate a number of dissimilarity indices between our samples (choosing an assay from the SummarizedExperiment object) and then perform principal coordinates analysis using cmdscale. Note that using "euclidean" as our index is equivalent to performing PCA (Prinicipal Components Analysis) on our data.

One of the most commonly used beta-diversity indices is the Bray-Curtis Dissimilarity. Here, we find the Bray-Curtis dissimilarity index between all of the samples in the Wu dataset and use PCoA to plot them on two dimensions. From the plot, it is evident that NK cells are most dissimilar from all other celltypes when considering the Bray-Curtis index.

mds_plot(wu_SE, group_by = "celltype",
         method_dist = "bray", assay = "proportions")

When using the Bray-Curtis dissimilarity index between all of the samples in the Six dataset, we find that similarly to the Wu dataset, NK cells appear dissimilar from the remainder of the celltypes, while Monocytes and Granulocytes appear most similar to one another.

mds_plot(six_SE, group_by = "celltype",
         method_dist = "bray", assay = "proportions")

Chord diagram

The chord diagram provides an informative way to view relationships between variables. We make use of the circlize package to show shared clones between compartments as links between regions around a circle. Here, we will start by subetting our data to a single timepoint and comparing just three cell types: T cells, B cells, and Granulocytes.

wu_circos_selection <- c("ZJ31_12m_T","ZJ31_12m_B","ZJ31_12m_Gr")
chord_diagram(wu_SE[,wu_circos_selection], plot_label = "celltype")

Here, we can see that most clones are present in all three cell types (purple link). There are also clones shared between each pair-wise combination of cell types. Based on the width of the links, you can ascertain that more clones are shared between Granulocytes and B cells (yellow link), followed by B cells and T cells (blue link), then T cells and Granulocytes (green link). The portion of each cell type without a link (empty space) represents the number of clones which are unique to that cell type.

We can also create a weighted circos plot. The difference is that in the previous plot, the width of the links between cell types are proportional to the number of clones shared between those cell types. In the weighted heat map, the width of the links between two cell types is proportional to the proportion of contribution of the shared clones to the overall hematopoiesis in that cell type.

chord_diagram(wu_SE[,wu_circos_selection],
              plot_label = "celltype",
              weighted = TRUE)

In this example, the circos plots look very similar. But you can see the subtle difference when you look at the blue link showing clones shared between B and T cells. The link is wider when connecting to T cells because the shared Bcell-Tcell clones represent a larger fraction of detected hematopoiesis in the T cell compartment than the B cell compartment.

The circos plot can handle any number of inputs but keep in mind that the number of unique combinations rises exponentially with the number of compartments. Here, we use the alpha parameter to control the transparency of the links. And we plot four cell types rather than three.

wu_circos_selection2 <- c("ZJ31_12m_T",
                          "ZJ31_12m_B",
                          "ZJ31_12m_Gr",
                          "ZJ31_12m_NK_CD56n_CD16p")
chord_diagram(wu_SE[,wu_circos_selection2],
              plot_label = "celltype",
              alpha = 0.9)

The plot is quite involved, but we can quickly draw a few high-level conclusions. A large chunk of clones are shared between all four cell types (darkest purple color). Another large chunk of clones is shared between B cells, T cells, and Grans but not NK cells (lighter purple). Note that each unique color signifies a unique combination of the four cell types.

The regions of the circos plot need not be cell types. They could also be timepoints. Here, we show that example using the belderbos dataset from mouse studies showing only the first four timepoints of peripheral blood samples.

belderbos_wk_samples_PB <- c("wk9_U", "wk14_U", "wk20_U", "wk22_U")
chord_diagram(belderbos_SE[,belderbos_wk_samples_PB])

From the chord diagram, there are some clones shared between all timepoints (dark purple links) and some shared only between the wk14, wk20, and wk22 timepoints (light green and yellow links).

chord_diagram(belderbos_SE[,belderbos_wk_samples_PB],
            weighted = TRUE)

When looking at the weighted heat map, one sees that the majority of hematopoiesis in timepoints past 9 weeks is accounted for by the group of clones (dark purple color) that are present at all timepoints.

From this example, one can see that the two plots paint different but complementary pictures. With the regular circos plot, clones are treated as equal so the information on fractional contribution is lost. However, the number of detected clones is indicated by the length of each region around the perimeter of the circos. This information can be useful for some studies and it is lost in the weighted heat map since all compartments poportion adds to 100%. We recommend using the regular and circos plot in combination to obtain maximal information from any given dataset.

Session Info

sessionInfo()
## R version 4.4.2 (2024-10-31)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.1 LTS
## 
## Matrix products: default
## BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so;  LAPACK version 3.12.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## time zone: Etc/UTC
## tzcode source: system (glibc)
## 
## attached base packages:
## [1] stats4    stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
##  [1] SummarizedExperiment_1.37.0 Biobase_2.67.0             
##  [3] GenomicRanges_1.59.1        GenomeInfoDb_1.43.2        
##  [5] IRanges_2.41.2              S4Vectors_0.45.2           
##  [7] BiocGenerics_0.53.3         generics_0.1.3             
##  [9] MatrixGenerics_1.19.0       matrixStats_1.4.1          
## [11] barcodetrackR_1.15.0        magrittr_2.0.3             
## [13] BiocStyle_2.35.0           
## 
## loaded via a namespace (and not attached):
##  [1] shape_1.4.6.1           circlize_0.4.16         gtable_0.3.6           
##  [4] xfun_0.49               bslib_0.8.0             ggplot2_3.5.1          
##  [7] GlobalOptions_0.1.2     lattice_0.22-6          vctrs_0.6.5            
## [10] tools_4.4.2             parallel_4.4.2          proxy_0.4-27           
## [13] tibble_3.2.1            cluster_2.1.8           pkgconfig_2.0.3        
## [16] Matrix_1.7-1            RColorBrewer_1.1-3      ggridges_0.5.6         
## [19] lifecycle_1.0.4         GenomeInfoDbData_1.2.13 farver_2.1.2           
## [22] compiler_4.4.2          munsell_0.5.1           permute_0.9-7          
## [25] htmltools_0.5.8.1       sys_3.4.3               buildtools_1.0.0       
## [28] sass_0.4.9              yaml_2.3.10             pillar_1.10.0          
## [31] crayon_1.5.3            jquerylib_0.1.4         tidyr_1.3.1            
## [34] MASS_7.3-61             DelayedArray_0.33.3     cachem_1.1.0           
## [37] vegan_2.6-8             viridis_0.6.5           abind_1.4-8            
## [40] nlme_3.1-166            tidyselect_1.2.1        digest_0.6.37          
## [43] dplyr_1.1.4             purrr_1.0.2             labeling_0.4.3         
## [46] splines_4.4.2           maketools_1.3.1         cowplot_1.1.3          
## [49] fastmap_1.2.0           grid_4.4.2              colorspace_2.1-1       
## [52] cli_3.6.3               SparseArray_1.7.2       S4Arrays_1.7.1         
## [55] withr_3.0.2             UCSC.utils_1.3.0        scales_1.3.0           
## [58] rmarkdown_2.29          XVector_0.47.0          httr_1.4.7             
## [61] gridExtra_2.3           evaluate_1.0.1          knitr_1.49             
## [64] viridisLite_0.4.2       mgcv_1.9-1              ggdendro_0.2.0         
## [67] rlang_1.1.4             Rcpp_1.0.13-1           glue_1.8.0             
## [70] BiocManager_1.30.25     jsonlite_1.8.9          R6_2.5.1               
## [73] plyr_1.8.9              zlibbioc_1.52.0