LinTInd - tutorial

Introduction

Single-cell RNA sequencing has become a common approach to trace developmental processes of cells, however, using exogenous barcodes is more direct than predicting from expression profiles recently, based on that, as gene-editing technology matures, combining this technological method with exogenous barcodes can generate more complex dynamic information for single-cell. In this application note, we introduce an R package: LinTInd for reconstructing a tree from alleles generated by the genome-editing tool known as CRISPR for a moderate time period based on the order in which editing occurs, and for sc-RNA seq, ScarLin can also quantify the similarity between each cluster in three ways.

Installation

Via GitHub

devtools::install_github("mana-W/LinTInd")

Via Bioconductor

if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

BiocManager::install("LinTInd")
library(LinTInd)

Import data

The input for LinTInd consists three required files:

  • sequence
  • reference
  • position of cutsites

and an optional file:

  • celltype
data<-paste0(system.file("extdata",package = 'LinTInd'),"/CB_UMI")
fafile<-paste0(system.file("extdata",package = 'LinTInd'),"/V3.fasta")
cutsite<-paste0(system.file("extdata",package = 'LinTInd'),"/V3.cutSites")
celltype<-paste0(system.file("extdata",package = 'LinTInd'),"/celltype.tsv")
data<-read.table(data,sep="\t",header=TRUE)
ref<-ReadFasta(fafile)
cutsite<-read.table(cutsite,col.names = c("indx","start","end"))
celltype<-read.table(celltype,header=TRUE,stringsAsFactors=FALSE)

For the sequence file, only the column contain reads’ strings is requeired, the cell barcodes and UMIs are both optional.

head(data,3)
##                                   Read.ID
## 1  @A01045:289:HM7K3DRXX:2:2101:9896:1031
## 2 @A01045:289:HM7K3DRXX:2:2101:13367:1031
## 3  @A01045:289:HM7K3DRXX:2:2101:9959:1047
##                                                                                                                                                                                                                                                     Read.Seq
## 1 GAACGCGTAGGATAACATGGCCATCATCAAGGAGTTCTCATGCGCTTCAAGGTGCACATGGTTTATTGGAGCCGTACATGAACTGAGGTTAAGGACAGGATGTCCCAGGCGTAGGTAATTGGCCCCCTGCCCTTCGCCTGGGTTATAAGCTTCGGGTTTAAACGGGCCCTGGGGGTGGCATCCCTGTGACCCCTCCCCAGTGCCTCTCCTGGCCCTGGAAGTTGCCACTCCAGTGCCCACCAGCCTTGTC
## 2 GAACGCGTAGGATAACATGGCCATCATCAAGGAGTTCTCATGCGCTTCAAGGTGCACATGGTTTATTGGAGCCGTACATGAACTGAGGTTAAGGACAGGATGTCCCAGGCGTAGGTAATTGGCCCCCTGCCCTTCGCCTGGGTTATAAGCTTCGGGTTTAAACGGGCCCTGGGGGTGGCATCCCTGTGACCCCTCCCCAGTGCCTCTCCTGGCCCTGGAAGTTGCCACTCCAGTGCCCACCAGCCTTGTC
## 3 GAACGCGTAGGATAACATGGCCATCATCAAGGAGTTCTCATGCGCTTCAAGGTGCACATGGTTTATTGGAGCCGTACATGAACTGAGGTTAAGGACAGGATGTCCCAGGCGTAGGTAATTGGCCCCCTGCCCTTCGCCTGGGTTATAAGCTTCGGGTTTAAACGGGCCCTGGGGGTGGCATCCCTGTGACCCCTCCCCAGTGCCTCTCCTGGCCCTGGAAGTTGCCACTCCAGTGCCCACCAGCCTTGTC
##            Cell.BC        UMI
## 1 GAAGGGTAGCCTCAGC CTTCTCCCGA
## 2 ACCCTCACAAGACTGG TGTAATTTTT
## 3 GAAGGGTAGCCTCAGC CTTCTCCCGA
ref
## $scarfull
## 333-letter DNAString object
## seq: GAACGCGTAGGATAACATGGCCATCATCAAGGAGTT...GGAAGTTGCCACTCCAGTGCCCACCAGCCTTGTCCT
cutsite
##   indx start end
## 1    0    39 267
## 2    1     1  23
## 3    2    28  50
## 4    3    55  77
## 5    4    82 104
## 6    5   109 131
## 7    6   136 158
## 8    7   163 185
head(celltype,3)
##            Cell.BC Cell.type
## 1 AAGCGAGTCTTCTGTA         A
## 2 AATCGACTCGTAGTGT         A
## 3 ACATGCAGTCCACACG         A

Array identify and indel visualization

In the first step, we shold use FindIndel() to alignment and find indels, and the function IndelForm() will help us to generate an array-form string for each read.

scarinfo<-FindIndel(data=data,scarfull=ref,scar=cutsite,indel.coverage="All",type="test",cln=1)
scarinfo<-IndelForm(scarinfo,cln=1)

Then for single-cell sequencing, we shold define a final-version of array-form string for each cell use IndelIdents(), there are three method are provided :

  • “reads.num”(default): find an array-form stirng supported by most reads in a cell
  • “umi.num”: find an array-form stirng supported by most UMIs in a cell
  • “consensus”: find the consistent sequences in each cell, and then generate array-form strings from the new reads

For bulk sequencing, in this step, we will generate a “cell barcode” for each read.

cellsinfo<-IndelIdents(scarinfo,method.use="umi.num",cln=1)

After define the indels for each cell, we can use IndelPlot() to visualise them.

IndelPlot(cellsinfo = cellsinfo)

Indel extract and similarity calculate

We can use the function TagProcess() to extract indels for cells/reads. The parameter Cells is optional.

tag<-TagProcess(cellsinfo$info,Cells=celltype)

And if the annotation of each cells are provided, we can also use TagDist() to calculate the relationship between each group in three way:

  • “Jaccard”(default): calculate the weighted jaccard similarity of indels between each pair of groups
  • “P”: right-tailed test, compare the Indels intersection level with the hypothetical result generated from random editing, and the former is expected to be significantly higher than the latter
  • “spearman”: Spearman correlation of indels between each pair of groups

The heatmap of this result will be saved as a pdf file.

tag_dist=TagDist(tag,method = "Jaccard")
## Using Cell.type as value column: use value.var to override.
## Aggregation function missing: defaulting to length
tag_dist
##           A         B         C         D         E
## A 1.0000000 0.4925373 0.2794118 0.2985075 0.2058824
## B 0.4925373 1.0000000 0.5588235 0.6060606 0.4117647
## C 0.2794118 0.5588235 1.0000000 0.9047619 0.7500000
## D 0.2985075 0.6060606 0.9047619 1.0000000 0.6666667
## E 0.2058824 0.4117647 0.7500000 0.6666667 1.0000000

Tree reconstruct

In the laste part, we can use BuildTree() to Generate an array generant tree.

treeinfo<-BuildTree(tag)
## Using Cell.num as value column: use value.var to override.

Finally, we can use the function PlotTree() to visualise the tree created before.

plotinfo<-PlotTree(treeinfo = treeinfo,data.extract = "TRUE",annotation = "TRUE")
## Using tags as id variables
## ℹ invalid tbl_tree object. Missing column: parent,node.
## ℹ invalid tbl_tree object. Missing column: parent,node.
## ℹ invalid tbl_tree object. Missing column: parent,node.
## ℹ invalid tbl_tree object. Missing column: parent,node.
plotinfo$p

Session Info

sessionInfo()
## R version 4.4.2 (2024-10-31)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.1 LTS
## 
## Matrix products: default
## BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so;  LAPACK version 3.12.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## time zone: Etc/UTC
## tzcode source: system (glibc)
## 
## attached base packages:
## [1] stats4    parallel  stats     graphics  grDevices utils     datasets 
## [8] methods   base     
## 
## other attached packages:
## [1] LinTInd_1.11.0      S4Vectors_0.45.2    BiocGenerics_0.53.5
## [4] generics_0.1.3      ggplot2_3.5.1      
## 
## loaded via a namespace (and not attached):
##  [1] stringdist_0.9.15       gtable_0.3.6            xfun_0.50              
##  [4] bslib_0.8.0             htmlwidgets_1.6.4       rlist_0.4.6.2          
##  [7] lattice_0.22-6          vctrs_0.6.5             tools_4.4.2            
## [10] yulab.utils_0.1.9       tibble_3.2.1            pkgconfig_2.0.3        
## [13] pheatmap_1.0.12         data.table_1.16.4       ggnewscale_0.5.0       
## [16] ggplotify_0.1.2         RColorBrewer_1.1-3      lifecycle_1.0.4        
## [19] GenomeInfoDbData_1.2.13 stringr_1.5.1           farver_2.1.2           
## [22] compiler_4.4.2          treeio_1.31.0           Biostrings_2.75.3      
## [25] munsell_0.5.1           data.tree_1.1.0         ggtree_3.15.0          
## [28] ggfun_0.1.8             GenomeInfoDb_1.43.4     htmltools_0.5.8.1      
## [31] sys_3.4.3               buildtools_1.0.0        sass_0.4.9             
## [34] yaml_2.3.10             lazyeval_0.2.2          pillar_1.10.1          
## [37] crayon_1.5.3            jquerylib_0.1.4         tidyr_1.3.1            
## [40] cachem_1.1.0            nlme_3.1-167            tidyselect_1.2.1       
## [43] aplot_0.2.4             digest_0.6.37           stringi_1.8.4          
## [46] reshape2_1.4.4          dplyr_1.1.4             purrr_1.0.2            
## [49] labeling_0.4.3          maketools_1.3.1         cowplot_1.1.3          
## [52] fastmap_1.2.0           grid_4.4.2              colorspace_2.1-1       
## [55] cli_3.6.3               magrittr_2.0.3          patchwork_1.3.0        
## [58] ape_5.8-1               withr_3.0.2             scales_1.3.0           
## [61] UCSC.utils_1.3.1        pwalign_1.3.2           rmarkdown_2.29         
## [64] XVector_0.47.2          httr_1.4.7              networkD3_0.4          
## [67] igraph_2.1.4            evaluate_1.0.3          knitr_1.49             
## [70] IRanges_2.41.2          gridGraphics_0.5-1      rlang_1.1.5            
## [73] Rcpp_1.0.14             glue_1.8.0              tidytree_0.4.6         
## [76] jsonlite_1.8.9          plyr_1.8.9              R6_2.5.1               
## [79] fs_1.6.5