Analysis of high-throughput sequencing of T and B cell receptors with LymphoSeq

Application of high-throughput sequencing of T and B lymphocyte antigen receptors has great potential for improving the monitoring of lymphoid malignancies, assessing immune reconstitution after hematopoietic stem cell transplantation, and characterizing the composition of lymphocyte repertoires (Warren, E. H. et al. Blood 2013;122:19–22). LymhoSeq is an R package designed to import, analyze, and visualize antigen receptor sequencing from Adaptive Biotechnologies’ ImmunoSEQ assay. The package is also adaptable to the analysis of T and B cell receptor sequencing processed using other platforms such as MiXCR or IMGT/HighV-QUEST. This vignette has been written to highlight some of the features of LymphoSeq and guide the user through a typical workflow.

Importing data

The LymphoSeq function readImmunoSeq imports tab-separated value (.tsv) files exported by Adaptive Biotechnologies ImmunoSEQ analyzer v2 where each row represents a unique sequence and each column is a variable with information about that sequence such as read count, frequency, or variable gene name. Note that the file format for ImmunoSEQ analyzer v3 is not yet supported and users must choose to export the v2 format from the analyzer software. Only files with the extension .tsv are imported while all other are disregarded. It is possible to import files processed using other platforms as long as the files are tab-delimited, are given the extension .tsv and have identical column names as the ImmunoSEQs files (see readImmunoSeq manual for a list of column names used by this file type). Refer to the LymphoSeq manual regarding the required column names used by each function.

To explore the features of LymphoSeq, this package includes 2 example data sets. The first is a data set of T cell receptor beta (TCRB) sequencing from 10 blood samples acquired serially from a single patient who underwent a bone marrow transplant (Kanakry, C.G., et al. JCI Insight 2016;1(5):pii: e86252). The second, is a data set of B cell receptor immunoglobulin heavy (IGH) chain sequencing from Burkitt lymphoma tumor biopsies acquired from 10 different individuals (Lombardo, K.A., et al. Blood Advances 2017 1:535-544). To improve performance, both data sets contain only the top 1,000 most frequent sequences. The complete data sets are publicly available through Adapatives’ immuneACCESS portal. As shown in the example below, you can specify the path to the example data sets using the command system.file("extdata", "TCRB_sequencing", package = "LymphoSeq") for the TCRB files and system.file("extdata", "IGH_sequencing", package = "LymphoSeq") for the IGH files.

readImmunoSeq imports each file in the specified directory as a list object where each file becomes a data frame. You can import all columns from each file by setting the columns parameter to "all" or list just those columns of interest. Be aware that Adaptive Biotechnologies has changed the column names of their files over time and if the headings of your files are not all the same, you will need to specify "all" or provide all variations of the column header. By default, the columns parameter is set to import only those columns used by LymphoSeq.

library(LymphoSeq)

## Loading required package: LymphoSeqDB

TCRB.path <- system.file("extdata", "TCRB_sequencing", package = "LymphoSeq")

TCRB.list <- readImmunoSeq(path = TCRB.path)

Notice that each data frame listed in the TCRB.list object is named according the ImmunoSEQ file names. If different names are desired, you may rename the original .tsv files or assign names(TCRB.list) to a new character vector of desired names in the same order as the list.

names(TCRB.list)

 [1] "TRB_CD4_949"       "TRB_CD8_949"       "TRB_CD8_CMV_369"  
 [4] "TRB_Unsorted_0"    "TRB_Unsorted_1320" "TRB_Unsorted_1496"
 [7] "TRB_Unsorted_32"   "TRB_Unsorted_369"  "TRB_Unsorted_83"  
[10] "TRB_Unsorted_949"

Having the data in the form of a list makes it easy to apply a function over that list using the base function lapply. For example, you may use the function dim to report the dimensions of each data frame as shown below. Noticed that each data frame in the example below has less than 1,000 rows and 11 columns.

lapply(TCRB.list, dim)

$TRB_CD4_949
[1] 1000   11

$TRB_CD8_949
[1] 1000   11

$TRB_CD8_CMV_369
[1] 414  11

$TRB_Unsorted_0
[1] 1000   11

$TRB_Unsorted_1320
[1] 1000   11

$TRB_Unsorted_1496
[1] 1000   11

$TRB_Unsorted_32
[1] 920  11

$TRB_Unsorted_369
[1] 1000   11

$TRB_Unsorted_83
[1] 1000   11

$TRB_Unsorted_949
[1] 1000   11

In place of dim, you may also use colnames, nrow, ncol, or other more complex functions that perform operations on subsetted columns.

Subsetting data

If you imported all of the files from your project but just want to perform an analysis on a subset, use standard R methods to subset the list. Remember that a single bracket [ returns a list and a double bracket [[ returns a single data frame.

CMV <- TCRB.list[grep("CMV", names(TCRB.list))]
names(CMV)

[1] "TRB_CD8_CMV_369"

TRB_Unsorted_0 <- TCRB.list[["TRB_Unsorted_0"]]
head(TRB_Unsorted_0)

          aminoAcid
1                  
2      CASSPVSNEQFF
3    CASSQEVPPYQAFF
4                  
5   CASSQEASGRQTQYF
6 CASSLEHTGATNEKLFF
                                                                               nucleotide
1 TCAATTCCCTGGAGCTTGGTGACTCTGCTGTGTATTTCTGTGCCAGCAGCCATCGGGACAGAGAACACTGAAGCTTTCTTTGGACAA
2 CTGATTCTGGAGTCCGCCAGCACCAACCAGACATCTATGTACCTCTGTGCCAGCAGCCCCGTGAGCAATGAGCAGTTCTTCGGGCCA
3 ATCAATTCCCTGGAGCTTGGTGACTCTGCTGTGTATTTCTGTGCCAGCAGCCAAGAAGTTCCGCCTTACCAAGCTTTCTTTGGACAA
4 TGCCATCCCCAACCAGACAGCTCTTTACTTCTGTGCCACCAGTGTCCACAAACAGGGGGCAGGACCGGGGAGCTGTTTTTTGGAGAA
5 CACACCCTGCAGCCAGAAGACTCGGCCCTGTATCTCTGCGCCAGCAGCCAAGAGGCTAGCGGGAGACAGACCCAGTACTTCGGGCCA
6 GCCAGCACCAACCAGACATCTATGTACCTCTGTGCCAGCAGTTTGGAGCACACGGGTGCAACTAATGAAAAACTGTTTTTTGGCAGT
  count frequencyCount estimatedNumberGenomes vFamilyName dFamilyName
1  1450     0.06606637                   1450     TCRBV03     TCRBD01
2   822     0.03737558                    822     TCRBV28     TCRBD02
3   797     0.03635297                    797     TCRBV03            
4   702     0.03203462                    702     TCRBV24     TCRBD01
5   704     0.03201317                    704     TCRBV04     TCRBD02
6   653     0.02968602                    653     TCRBV28     TCRBD02
  jFamilyName  vGeneName  dGeneName  jGeneName
1     TCRBJ01 unresolved TCRBD01-01 TCRBJ01-01
2     TCRBJ02 TCRBV28-01 TCRBD02-01 TCRBJ02-01
3     TCRBJ01 unresolved unresolved TCRBJ01-01
4     TCRBJ02 unresolved TCRBD01-01 TCRBJ02-02
5     TCRBJ02 TCRBV04-03 TCRBD02-01 TCRBJ02-05
6     TCRBJ01 TCRBV28-01 TCRBD02-01 TCRBJ01-04

For more complex subsetting, you can use a metadata file where one column contains the file names and the other columns have additional information about the sample files. You can then subset the metadata file using criteria from the other columns to give you just a character vector of file names that you can use to subset TCRB.list. In the example below, a metadata file is imported for the example TCRB data set which contains information on the number of days post bone marrow transplant the sample was collected and the cellular phenopyte the blood sample was sorted for prior to sequencing.

TCRB.metadata <- read.csv(system.file("extdata", "TCRB_metadata.csv", package = "LymphoSeq"))
TCRB.metadata

             samples  day phenotype
1     TRB_Unsorted_0    0  Unsorted
2    TRB_Unsorted_32   32  Unsorted
3    TRB_Unsorted_83   82  Unsorted
4    TRB_CD8_CMV_369  369  CD8+CMV+
5   TRB_Unsorted_369  369  Unsorted
6        TRB_CD4_949  949      CD4+
7        TRB_CD8_949  949      CD8+
8   TRB_Unsorted_949  949  Unsorted
9  TRB_Unsorted_1320 1320  Unsorted
10 TRB_Unsorted_1496 1496  Unsorted

selected <- as.character(TCRB.metadata[TCRB.metadata$phenotype == "Unsorted" & 
                                 TCRB.metadata$day > 300, "samples"])
TCRB.list.selected <- TCRB.list[selected]
names(TCRB.list.selected)

[1] "TRB_Unsorted_369"  "TRB_Unsorted_949"  "TRB_Unsorted_1320"
[4] "TRB_Unsorted_1496"

Extracting productive sequences

A productive sequence is defined as a sequences that is in frame and does not have an early stop codon. If you sequenced genomic DNA as opposed to complimentary DNA made from RNA, then you will have unproductive and productive sequences in your data files. Use the function productiveSeq to remove unproductive sequences and recompute the frequencyCount for each of your samples.

If you are interested in just the complementarity determining region 3 (CDR3) amino acid sequences, then set aggregate to "aminoAcid" and the count and estimated number of genomes for duplicate amino acid sequences will be summed. Note that the resulting list of data frames will have columns corresponding to “aminoAcid”, “count”, “frequencyCount”, and “estimatedNumberGenomes” (if this column is available) only. All other columns, such as those corresponding to the V, D, and J gene names, will be removed if they were included in your original file list. The reason for this is to avoid confusion since a single amino acid CDR3 sequence may be encoded by multiple different nucleotide sequences with differing V, D, and J genes.

productive.TRB.aa <- productiveSeq(file.list = TCRB.list, aggregate = "aminoAcid", 
                               prevalence = FALSE)

Alternatively, you may set aggregate to "nucleotide" and the resulting list of data frames will all have the same columns as your original file list. Take note that some LymphoSeq functions require a productive sequence list aggregated by amino acid or nucleotide.

productive.TRB.nt <- productiveSeq(file.list = TCRB.list, aggregate = "nucleotide", 
                               prevalence = FALSE)

If the parameter prevalence is set to TRUE, then a new column is added to each of the data frames giving the prevalence (%) of each TCR beta CDR3 amino acid sequence in 55 healthy donor peripheral blood samples. Values range from 0 to 100% where 100% means the sequence appeared in the blood of all 55 individuals. The data for this operation resides in a separate package that is automatically loaded called LymphoSeqDB. Please refer to that package manual for more details.

Notice in the example below that there are no amino acid sequences given in the first and fourth row of the TCRB.list data frame for sample “TRB_Unsorted_949”. This is because the nucleotide sequence is out of frame and does not produce a productively transcribed amino acid sequence. If an asterisk (*) appears in the amino acid sequences, this would indicate an early stop codon.

head(TCRB.list[["TRB_Unsorted_0"]])

          aminoAcid
1                  
2      CASSPVSNEQFF
3    CASSQEVPPYQAFF
4                  
5   CASSQEASGRQTQYF
6 CASSLEHTGATNEKLFF
                                                                               nucleotide
1 TCAATTCCCTGGAGCTTGGTGACTCTGCTGTGTATTTCTGTGCCAGCAGCCATCGGGACAGAGAACACTGAAGCTTTCTTTGGACAA
2 CTGATTCTGGAGTCCGCCAGCACCAACCAGACATCTATGTACCTCTGTGCCAGCAGCCCCGTGAGCAATGAGCAGTTCTTCGGGCCA
3 ATCAATTCCCTGGAGCTTGGTGACTCTGCTGTGTATTTCTGTGCCAGCAGCCAAGAAGTTCCGCCTTACCAAGCTTTCTTTGGACAA
4 TGCCATCCCCAACCAGACAGCTCTTTACTTCTGTGCCACCAGTGTCCACAAACAGGGGGCAGGACCGGGGAGCTGTTTTTTGGAGAA
5 CACACCCTGCAGCCAGAAGACTCGGCCCTGTATCTCTGCGCCAGCAGCCAAGAGGCTAGCGGGAGACAGACCCAGTACTTCGGGCCA
6 GCCAGCACCAACCAGACATCTATGTACCTCTGTGCCAGCAGTTTGGAGCACACGGGTGCAACTAATGAAAAACTGTTTTTTGGCAGT
  count frequencyCount estimatedNumberGenomes vFamilyName dFamilyName
1  1450     0.06606637                   1450     TCRBV03     TCRBD01
2   822     0.03737558                    822     TCRBV28     TCRBD02
3   797     0.03635297                    797     TCRBV03            
4   702     0.03203462                    702     TCRBV24     TCRBD01
5   704     0.03201317                    704     TCRBV04     TCRBD02
6   653     0.02968602                    653     TCRBV28     TCRBD02
  jFamilyName  vGeneName  dGeneName  jGeneName
1     TCRBJ01 unresolved TCRBD01-01 TCRBJ01-01
2     TCRBJ02 TCRBV28-01 TCRBD02-01 TCRBJ02-01
3     TCRBJ01 unresolved unresolved TCRBJ01-01
4     TCRBJ02 unresolved TCRBD01-01 TCRBJ02-02
5     TCRBJ02 TCRBV04-03 TCRBD02-01 TCRBJ02-05
6     TCRBJ01 TCRBV28-01 TCRBD02-01 TCRBJ01-04

After productiveSeq is run, the unproductive sequences are removed and the frequencyCount is recalculated for each sequence. If there were two identical amino acid sequences that differed in their nucleotide sequence, they would be combined and their counts added together.

head(productive.TRB.aa[["TRB_Unsorted_0"]])

          aminoAcid count frequencyCount estimatedNumberGenomes
1      CASSPVSNEQFF   822       5.773283                    822
2    CASSQEVPPYQAFF   797       5.597696                    797
3   CASSQEASGRQTQYF   704       4.944515                    704
4 CASSLEHTGATNEKLFF   653       4.586318                    653
5       CASSPGDEQYF   619       4.347521                    619
6  CSARSPSTGTLAEAFF   429       3.013064                    429

Finally, notice that the productive.TRB.nt data frame for sample “TRB_Unsorted_949” below has additional columns not present in productive.TRB.aa but are in TCRB.list. This is because the data frame was aggregated by nucleotide sequence and all of the original columns from TCRB.list were carried over.

head(productive.TRB.nt[["TRB_Unsorted_0"]])

          aminoAcid
1      CASSPVSNEQFF
2    CASSQEVPPYQAFF
3   CASSQEASGRQTQYF
4 CASSLEHTGATNEKLFF
5       CASSPGDEQYF
6  CSARSPSTGTLAEAFF
                                                                               nucleotide
1 CTGATTCTGGAGTCCGCCAGCACCAACCAGACATCTATGTACCTCTGTGCCAGCAGCCCCGTGAGCAATGAGCAGTTCTTCGGGCCA
2 ATCAATTCCCTGGAGCTTGGTGACTCTGCTGTGTATTTCTGTGCCAGCAGCCAAGAAGTTCCGCCTTACCAAGCTTTCTTTGGACAA
3 CACACCCTGCAGCCAGAAGACTCGGCCCTGTATCTCTGCGCCAGCAGCCAAGAGGCTAGCGGGAGACAGACCCAGTACTTCGGGCCA
4 GCCAGCACCAACCAGACATCTATGTACCTCTGTGCCAGCAGTTTGGAGCACACGGGTGCAACTAATGAAAAACTGTTTTTTGGCAGT
5 CCCCTGACCCTGGAGTCTGCCAGGCCCTCACATACCTCTCAGTACCTCTGTGCCAGCAGTCCGGGGGACGAGCAGTACTTCGGGCCG
6 AGTGCCCATCCTGAAGACAGCAGCTTCTACATCTGCAGTGCTAGATCACCCAGTACAGGGACCCTCGCTGAAGCTTTCTTTGGACAA
  count frequencyCount estimatedNumberGenomes vFamilyName dFamilyName
1   822       5.773283                    822     TCRBV28     TCRBD02
2   797       5.597696                    797     TCRBV03            
3   704       4.944515                    704     TCRBV04     TCRBD02
4   653       4.586318                    653     TCRBV28     TCRBD02
5   619       4.347521                    619     TCRBV25     TCRBD02
6   429       3.013064                    429     TCRBV20     TCRBD01
  jFamilyName  vGeneName  dGeneName  jGeneName
1     TCRBJ02 TCRBV28-01 TCRBD02-01 TCRBJ02-01
2     TCRBJ01 unresolved unresolved TCRBJ01-01
3     TCRBJ02 TCRBV04-03 TCRBD02-01 TCRBJ02-05
4     TCRBJ01 TCRBV28-01 TCRBD02-01 TCRBJ01-04
5     TCRBJ02 TCRBV25-01 TCRBD02-01 TCRBJ02-07
6     TCRBJ01 unresolved TCRBD01-01 TCRBJ01-01

Create a table of summary statistics

To create a table summarizing the total number of sequences, number of unique productive sequences, number of genomes, entropy, clonality, Gini coefficient, and the frequency (%) of the top productive sequence in each imported file, use the function clonality.

clonality(file.list = TCRB.list)

             samples totalSequences uniqueProductiveSequences totalCount
1        TRB_CD4_949           1000                       845      25769
2   TRB_Unsorted_369           1000                       830     339413
3    TRB_Unsorted_83           1000                       823     236732
4        TRB_CD8_949           1000                       794      26239
5    TRB_CD8_CMV_369            414                       281       1794
6  TRB_Unsorted_1320           1000                       838     178190
7  TRB_Unsorted_1496           1000                       832      33669
8   TRB_Unsorted_949           1000                       831       6549
9     TRB_Unsorted_0           1000                       838      18161
10   TRB_Unsorted_32            920                       767      31078
   clonality giniCoefficient topProductiveSequence totalGenomes
1   0.442719       0.8665242             30.091732        25769
2   0.425965       0.8447387             29.720171           NA
3   0.338114       0.7766277             23.645843           NA
4   0.430615       0.9026124             19.346779        26239
5   0.331570       0.7606261             16.487936         1794
6   0.421630       0.9016617             14.579022       178190
7   0.389318       0.8812733             14.248338        33669
8   0.305784       0.7654438             13.837321         6549
9   0.280923       0.8184686              5.773283        18161
10  0.134242       0.6007820              4.865016           NA

The clonality score is derived from the Shannon entropy, which is calculated from the frequencies of all productive sequences divided by the logarithm of the total number of unique productive sequences. This normalized entropy value is then inverted (1 - normalized entropy) to produce the clonality metric.

The Gini coefficient is an alternative metric used to calculate repertoire diversity and is derived from the Lorenz curve. The Lorenz curve is drawn such that x-axis represents the cumulative percentage of unique sequences and the y-axis represents the cumulative percentage of reads. A line passing through the origin with a slope of 1 reflects equal frequencies of all clones. The Gini coefficient is the ratio of the area between the line of equality and the observed Lorenz curve over the total area under the line of equality.

Both Gini coefficient and clonality are reported on a scale from 0 to 1 where 0 indicates all sequences have the same frequency and 1 indicates the repertoire is dominated by a single sequence.

Calculate clonal relatedness

One of the drawbacks of the clonality metric is that it does not take into account sequence similarity. This is particularly important when studying affinity maturation or B cell malignancies(Lombardo, K.A., et al. Blood Advances 2017 1:535-544). Clonal relatedness is a useful metric that takes into account sequence similarity without regard for clonal frequency. It is defined as the proportion of nucleotide sequences that are related by a defined edit distance threshold. The value ranges from 0 to 1 where 0 indicates no sequences are related and 1 indicates all sequences are related. Edit distance is a way of quantifying how dissimilar two sequences are to one another by counting the minimum number of operations required to transform one sequence into the other. For example, an edit distance of 0 means the sequences are identical and an edit distance of 1 indicates that the sequences different by a single amino acid or nucleotide.

IGH.path <- system.file("extdata", "IGH_sequencing", package = "LymphoSeq")

IGH.list <- readImmunoSeq(path = IGH.path)

clonalRelatedness(list = IGH.list, editDistance = 10)

             samples clonalRelatedness
1  IGH_MVQ108911A_BL       0.623919308
2  IGH_MVQ194745A_BL       0.845000000
3   IGH_MVQ81231A_BL       0.623919308
4   IGH_MVQ89037A_BL       0.288184438
5   IGH_MVQ90143A_BL       0.006097561
6   IGH_MVQ92552A_BL       0.272727273
7   IGH_MVQ93505A_BL       0.392201835
8   IGH_MVQ93631A_BL       0.757000000
9   IGH_MVQ94865A_BL       0.007000000
10  IGH_MVQ95413A_BL       0.003636364

Draw a phylogenetic tree

A phylogenetic tree is a useful way to visualize the similarity between sequences. The phyloTree function create a phylogenetic tree of a single sample using neighbor joining tree estimation for amino acid or nucleotide CDR3 sequences. Each leaf in the tree represents a sequence color coded by the V, D, and J gene usage. The number next to each leaf refers to the sequence count. A triangle shaped leaf indicates the most frequent sequence. The distance between leaves on the horizontal axis corresponds to the sequence similarity (i.e. the further apart the leaves are horizontally, the less similar the sequences are to one another).

productive.IGH.nt <- productiveSeq(file.list = IGH.list, aggregate = "nucleotide")

phyloTree(list = productive.IGH.nt, sample = "IGH_MVQ92552A_BL", type = "nucleotide", 
         layout = "rectangular")

Warning: The `<scale>` argument of `guides()` cannot be `FALSE`. Use "none" instead as
of ggplot2 3.3.4.
ℹ The deprecated feature was likely used in the LymphoSeq package.
  Please report the issue to the authors.
This warning is displayed once every 8 hours.
Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
generated.

Multiple sequence alignment

In LymphoSeq, you can perform a multiple sequence alignment using one of three methods provided by the Bioconductor msa package (ClustalW, ClustalOmega, or Muscle) and output results to the console or as a pdf file. One may perform the alignment of all amino acid or nucleotide sequences in a single sample. Alternatively, one may search for a given sequence within a list of samples using an edit distance threshold.

alignSeq(list = productive.IGH.nt, sample = "IGH_MVQ92552A_BL", type = "aminoAcid", 
         method = "ClustalW", output = "consule")

use default substitution matrix

Searching for sequences

To search for one or more amino acid or nucleotide CDR3 sequences in a list of data frames, use the function searchSeq. You may specify to search in either a list of productive or unproductive data frames.

searchSeq(list = productive.TRB.aa, sequence = "CASSPVSNEQFF", type = "aminoAcid", 
          match = "global", editDistance = 0)

          sample    aminoAcid count frequencyCount estimatedNumberGenomes
1 TRB_Unsorted_0 CASSPVSNEQFF   822    5.773282764                    822
2    TRB_CD8_949 CASSPVSNEQFF     2    0.008923791                      2

If you have only a partial sequence, set the parameter match to "partial". If you are looking for related sequences that differ by one or more nucleotides or amino acids, then increase the editDistance value. Edit distance is a way of quantifying how dissimilar two sequences are to one another by counting the minimum number of operations required to transform one sequence into the other. For example, an edit distance of 0 means the sequences are identical and an edit distance of 1 indicates that the sequences differ by a single amino acid or nucleotide.

Searching for published sequences

To search your entire list of data frames for a published amino acid CDR3 TCRB sequence with known antigen specificity, use the function searchPublished.

published <- searchPublished(list = productive.TRB.aa)
head(published)

             sample     aminoAcid count frequencyCount estimatedNumberGenomes
1   TRB_Unsorted_32 CASASSGTDTQYF    33    0.131594688                      0
2 TRB_Unsorted_1496 CASSETGGTEAFF     2    0.007267706                      2
3  TRB_Unsorted_949  CASSFSTDTQYF     2    0.038277512                      2
4       TRB_CD8_949 CASSIRSAYEQYF     7    0.031233268                      7
5 TRB_Unsorted_1320 CASSIRSAYEQYF    15    0.010313886                     15
6 TRB_Unsorted_1496 CASSIRSSYEQYF     4    0.014535412                      4
      PMID         HLA  antigen         epitope prevalence
1 20647322 HLA-A*24:02 Leukemia            <NA>        7.3
2 19786555 HLA-A*02:01 Melanoma      EAAGIGILTV       34.5
3 23267020    HLA-A*02      EBV BMFL1-GLCTLVAML       85.5
4 21048112    HLA-A*02      EBV BMLF1-GLCTLVAML       18.2
5 21048112    HLA-A*02      EBV BMLF1-GLCTLVAML       18.2
6 21048112    HLA-A*02      EBV BMLF1-GLCTLVAML       45.5

For each found sequence, a table is provides listing the antigen, epitope, HLA type, PubMed ID (PMID), and prevalence (%) of the sequence among 55 healthy donor blood samples. The data for this function resides in the separate LymphoSeqDB package that is automatically loaded when the function is called. Please refer to that package manual for more details.

Visualizing repertoire diversity

Antigen receptor repertoire diversity can be characterized by a number such as clonality or Gini coefficient calculated by the clonality function. Alternatively, you can visualize the repertoire diversity by plotting the Lorenz curve for each sample as defined above. In this plot, the more diverse samples will appear near the dotted diagonal line (the line of equality) whereas the more clonal samples will appear to have a more bowed shape.

lorenzCurve(samples = names(productive.TRB.aa), list = productive.TRB.aa)

Alternatively, you can get a feel for the repertoire diversity by plotting the cumulative frequency of a selected number of the top most frequent clones using the function topSeqsPlot. In this case, each of the top sequences are represented by a different color and all less frequent clones will be assigned a single color (violet).

topSeqsPlot(list = productive.TRB.aa, top = 10)

Both of these functions are built using the ggplot2 package. You can reformat the plot using ggplot2 functions. Please refer to the lorenzCurve and topSeqsPlot manual for specific examples.

Comparing samples

To compare the T or B cell repertoires of all samples in a pairwise fashion, use the bhattacharyyaMatrix or similarityMatrix functions. Both the Bhattacharyya coefficient and similarity score are measures of the amount of overlap between two samples. The value for each ranges from 0 to 1 where 1 indicates the sequence frequencies are identical in the two samples and 0 indicates no shared frequencies. The Bhattacharyya coefficient differs from the similarity score in that it involves weighting each shared sequence in the two distributions by the arithmetic mean of the frequency of each sequence, while calculating the similarity scores involves weighting each shared sequence in the two distributions by the geometric mean of the frequency of each sequence in the two distributions.

bhattacharyya.matrix <- bhattacharyyaMatrix(productive.seqs = productive.TRB.aa)
bhattacharyya.matrix

                  TRB_Unsorted_1496 TRB_Unsorted_1320 TRB_Unsorted_949
TRB_Unsorted_1496        1.00000000        0.95291865       0.85590946
TRB_Unsorted_1320        0.95291865        1.00000000       0.87674297
TRB_Unsorted_949         0.85590946        0.87674297       1.00000000
TRB_Unsorted_369         0.52401899        0.53276440       0.51749643
TRB_Unsorted_83          0.30375115        0.33141730       0.37518862
TRB_Unsorted_32          0.28782025        0.31292611       0.28018709
TRB_CD8_CMV_369          0.75528817        0.77886610       0.71503376
TRB_Unsorted_0           0.01542965        0.01627338       0.01388704
TRB_CD8_949              0.78871302        0.81794347       0.81414832
TRB_CD4_949              0.44361207        0.43743309       0.42587691
                  TRB_Unsorted_369 TRB_Unsorted_83 TRB_Unsorted_32
TRB_Unsorted_1496      0.524018989      0.30375115     0.287820253
TRB_Unsorted_1320      0.532764399      0.33141730     0.312926108
TRB_Unsorted_949       0.517496433      0.37518862     0.280187090
TRB_Unsorted_369       1.000000000      0.46978757     0.192342747
TRB_Unsorted_83        0.469787569      1.00000000     0.297865580
TRB_Unsorted_32        0.192342747      0.29786558     1.000000000
TRB_CD8_CMV_369        0.512020867      0.40059194     0.272207037
TRB_Unsorted_0         0.008900138      0.01347907     0.008658058
TRB_CD8_949            0.532143928      0.43208064     0.350967319
TRB_CD4_949            0.176016728      0.06449128     0.023973222
                  TRB_CD8_CMV_369 TRB_Unsorted_0 TRB_CD8_949 TRB_CD4_949
TRB_Unsorted_1496     0.755288167    0.015429649  0.78871302 0.443612066
TRB_Unsorted_1320     0.778866101    0.016273376  0.81794347 0.437433093
TRB_Unsorted_949      0.715033758    0.013887037  0.81414832 0.425876907
TRB_Unsorted_369      0.512020867    0.008900138  0.53214393 0.176016728
TRB_Unsorted_83       0.400591940    0.013479071  0.43208064 0.064491279
TRB_Unsorted_32       0.272207037    0.008658058  0.35096732 0.023973222
TRB_CD8_CMV_369       1.000000000    0.008967238  0.86559885 0.001116121
TRB_Unsorted_0        0.008967238    1.000000000  0.04164991 0.006956798
TRB_CD8_949           0.865598846    0.041649912  1.00000000 0.000000000
TRB_CD4_949           0.001116121    0.006956798  0.00000000 1.000000000

similarity.matrix <- similarityMatrix(productive.seqs = productive.TRB.aa)
similarity.matrix

                  TRB_Unsorted_1496 TRB_Unsorted_1320 TRB_Unsorted_949
TRB_Unsorted_1496        1.00000000         0.9581854       0.89906548
TRB_Unsorted_1320        0.95818541         1.0000000       0.93863667
TRB_Unsorted_949         0.89906548         0.9386367       1.00000000
TRB_Unsorted_369         0.74206009         0.6514381       0.77154129
TRB_Unsorted_83          0.54589852         0.4325548       0.59723589
TRB_Unsorted_32          0.32105103         0.4035552       0.26159989
TRB_CD8_CMV_369          0.67188308         0.7063167       0.65222570
TRB_Unsorted_0           0.02730081         0.0371885       0.01387248
TRB_CD8_949              0.78972983         0.7648513       0.90794949
TRB_CD4_949              0.47223856         0.2900181       0.72941015
                  TRB_Unsorted_369 TRB_Unsorted_83 TRB_Unsorted_32
TRB_Unsorted_1496       0.74206009      0.54589852      0.32105103
TRB_Unsorted_1320       0.65143812      0.43255478      0.40355518
TRB_Unsorted_949        0.77154129      0.59723589      0.26159989
TRB_Unsorted_369        1.00000000      0.47127662      0.23387758
TRB_Unsorted_83         0.47127662      1.00000000      0.42829336
TRB_Unsorted_32         0.23387758      0.42829336      1.00000000
TRB_CD8_CMV_369         0.70312873      0.58191936      0.17384922
TRB_Unsorted_0          0.01588281      0.02653832      0.01017423
TRB_CD8_949             0.69642713      0.60793633      0.40165091
TRB_CD4_949             0.11852468      0.04184631      0.03070217
                  TRB_CD8_CMV_369 TRB_Unsorted_0 TRB_CD8_949 TRB_CD4_949
TRB_Unsorted_1496     0.671883079    0.027300812  0.78972983 0.472238563
TRB_Unsorted_1320     0.706316742    0.037188504  0.76485132 0.290018135
TRB_Unsorted_949      0.652225696    0.013872476  0.90794949 0.729410153
TRB_Unsorted_369      0.703128733    0.015882807  0.69642713 0.118524684
TRB_Unsorted_83       0.581919360    0.026538319  0.60793633 0.041846313
TRB_Unsorted_32       0.173849223    0.010174234  0.40165091 0.030702175
TRB_CD8_CMV_369       1.000000000    0.007247298  0.85081995 0.001761048
TRB_Unsorted_0        0.007247298    1.000000000  0.04387449 0.031309635
TRB_CD8_949           0.850819946    0.043874488  1.00000000 0.000000000
TRB_CD4_949           0.001761048    0.031309635  0.00000000 1.000000000

The results of either function can be visualized by the pairwisePlot function.

pairwisePlot(matrix = bhattacharyya.matrix)

To view sequences shared between two or more samples, use the function commonSeqs. This function requires that a productive amino acid list be specified.

common <- commonSeqs(samples = c("TRB_Unsorted_0", "TRB_Unsorted_32"), 
                    productive.aa = productive.TRB.aa)
head(common)

        aminoAcid TRB_Unsorted_0 TRB_Unsorted_32
1 CASSQDRTGQYGYTF     0.47057171      0.80551900
2    CAWTGGTTEAFF     0.10535188      0.15153328
3    CAISEGNYGYTF     0.03511729      0.18742274
4   CASSFGIQETQYF     0.01404692      0.09570523

To visualize the number of overlapping sequences between two or three samples in the form of a Venn diagram, use the function commonSeqVenn.

commonSeqsVenn(samples = c("TRB_Unsorted_32", "TRB_Unsorted_83"), 
               productive.seqs = productive.TRB.aa)

commonSeqsVenn(samples = c("TRB_Unsorted_0", "TRB_Unsorted_32", "TRB_Unsorted_83"), 
               productive.seqs = productive.TRB.aa)

To compare the frequency of sequences between two samples as a scatter plot, use the function commonSeqsPlot.

commonSeqsPlot("TRB_Unsorted_32", "TRB_Unsorted_83", 
               productive.aa = productive.TRB.aa, show = "common")

If you have more than 3 samples to compare, use the commonSeqBar function. You can chose to color a single sample with the color.sample argument or a desired intersection with the color.intersection argument.

commonSeqsBar(productive.aa = productive.TRB.aa, 
              samples = c("TRB_CD4_949", "TRB_CD8_949", 
                          "TRB_Unsorted_949", "TRB_Unsorted_1320"), 
              color.sample = "TRB_CD8_949",
              labels = "no")

Differential abundance

When comparing a sample from two different time points, it is useful to identify sequences that are significantly more or less abundant in one versus the other time point (DeWitt, W.S., et al. Journal of Virology 2015 89(8):4517-4526). The differentialAbundance function uses a Fisher exact test to calculate differential abundance of each sequence in two time points and reports the log2 transformed fold change, P value and adjusted P value.

differentialAbundance(list = productive.TRB.aa, 
                      sample1 = "TRB_Unsorted_949", 
                      sample2 = "TRB_Unsorted_1320", 
                      type = "aminoAcid", q = 0.01)

                aminoAcid TRB_Unsorted_949 TRB_Unsorted_1320            p
1         CASSPPTGERDTQYF       5.62679426       12.93567573 2.406577e-66
2  CASSQDLMTVDSLFAGANVLTF       8.05741627        3.11891910 2.038483e-63
3         CASSLAGDSQETQYF       2.02870813        6.57888404 6.843544e-52
4         CASSSIKTGATEAFF       0.61244019        0.01856499 3.382517e-31
5           CASSQDTGNEQFF       0.32535885        0.00000000 1.481077e-25
6         CASSQGGSYNSPLHF       0.28708134        0.00000000 1.238489e-22
7         CASSFHRDAAYGYTF       0.24880383        0.00000000 1.034867e-19
8         CASSQEGGRDNEQFF       0.36363636        0.01031389 1.991455e-19
9           CASSPWSNEKLFF       0.22966507        0.00000000 2.990614e-18
10        CASSYFRDGGELGTF       0.57416268        0.07013442 2.034201e-16
11         CASSDMAIGREQYF       0.19138756        0.00000000 2.496149e-15
12        CASSRIGRERDEQYF       0.19138756        0.00000000 2.496149e-15
13      CATSDRRQADNVDIQYF       0.19138756        0.00000000 2.496149e-15
14        CASRDGQGSGNTIYF       0.91866029        0.25647196 9.081279e-13
15           CASRQGSYGYTF       0.15311005        0.00000000 2.081896e-12
16      CASSFPRLAGGTDTQYF       0.15311005        0.00000000 2.081896e-12
17       CASSQDERFSGNTIYF       0.15311005        0.00000000 2.081896e-12
18      CASSRQGTFSGNTEAFF       0.24880383        0.01237666 1.186840e-11
19           CASSYVGDGYTF       1.12918660        2.44507856 1.936386e-11
20          CASSAGDGYEQYF       0.13397129        0.00000000 6.010805e-11
21        CASSLRWGATGELFF       0.13397129        0.00000000 6.010805e-11
22     CASSQNLVRTANNSPLHF       0.13397129        0.00000000 6.010805e-11
23        CASTLRWGKTGELFF       0.13397129        0.00000000 6.010805e-11
24         CASLGAGGWTEAFF       0.38277512        0.05225702 1.047460e-10
25         CASSLYPSTDTQYF       0.40191388        1.23285316 8.660086e-10
26      CASSFNRYSSSYNEQFF       0.30622010        0.03506721 8.950179e-10
27        CASSNPGQTSYGYTF       0.11483254        0.00000000 1.735106e-09
28          CASSQDSGGEQYF       0.11483254        0.00000000 1.735106e-09
29          CASSRGPMTEAFF       0.11483254        0.00000000 1.735106e-09
30      CASSLNGPGQGAYEQYF       0.09569378        0.00000000 5.007709e-08
31          CASSQDTGYEQYF       0.09569378        0.00000000 5.007709e-08
32         CASSQSGDYNEQFF       0.09569378        0.00000000 5.007709e-08
33        CASSQVLELGRGYTF       0.09569378        0.00000000 5.007709e-08
34       CSAREMAGGLNSPLHF       0.09569378        0.00000000 5.007709e-08
35      CAWSDFQGPRSGNTIYF       0.47846890        1.15102967 6.651968e-07
36           CASSPDKWGYTF       0.59330144        1.29886203 1.152347e-06
37      CASSLPTGGLMNTEAFF       0.07655502        0.00000000 1.445013e-06
38         CASSNGRKNTEAFF       0.07655502        0.00000000 1.445013e-06
39        CASSPLVTGNTEAFF       0.07655502        0.00000000 1.445013e-06
40          CASSQDSGNEQFF       0.07655502        0.00000000 1.445013e-06
41      CASSQDSQGVGKNIQYF       0.07655502        0.00000000 1.445013e-06
42            CASSSGQRPYF       0.07655502        0.00000000 1.445013e-06
43          CSASLNHALEQYF       0.47846890        1.07401932 6.437625e-06
              q      l2fc
1  3.205561e-63 -1.200970
2  2.713221e-60  1.369271
3  9.101914e-49 -1.697282
4  4.495366e-28  5.043912
5  1.966870e-22  8.345888
6  1.643475e-19  8.165316
7  1.372234e-16  7.958865
8  2.638678e-16  5.139837
9  3.959573e-15  7.843388
10 2.691248e-13  3.033265
11 3.299908e-12  7.580353
12 3.299908e-12  7.580353
13 3.299908e-12  7.580353
14 1.197821e-09  1.840730
15 2.743938e-09  7.258425
16 2.743938e-09  7.258425
17 2.743938e-09  7.258425
18 1.560694e-08  4.329314
19 2.544412e-08 -1.114597
20 7.892187e-08  7.065780
21 7.892187e-08  7.065780
22 7.892187e-08  7.065780
23 7.892187e-08  7.065780
24 1.371125e-07  2.872800
25 1.132739e-06 -1.617043
26 1.169788e-06  3.126374
27 2.266048e-06  6.843388
28 2.266048e-06  6.843388
29 2.266048e-06  6.843388
30 6.525044e-05  6.580353
31 6.525044e-05  6.580353
32 6.525044e-05  6.580353
33 6.525044e-05  6.580353
34 6.525044e-05  6.580353
35 8.634255e-04 -1.266428
36 1.494594e-03 -1.130411
37 1.872737e-03  6.258425
38 1.872737e-03  6.258425
39 1.872737e-03  6.258425
40 1.872737e-03  6.258425
41 1.872737e-03  6.258425
42 1.872737e-03  6.258425
43 8.304536e-03 -1.166523

Finding recurring sequences

To create a data frame of unique, productive amino acid sequences as rows and sample names as headers use the seqMatrix function. Each value in the data frame represents the frequency that each sequence appears in the sample. You can specify your own list of sequences or all unique sequences in the list using the output of the function uniqueSeqs. The uniqueSeqs function creates a data frame of all unique, productive sequences and reports the total count in all samples.

unique.seqs <- uniqueSeqs(productive.aa = productive.TRB.aa)
head(unique.seqs)

                  aminoAcid count
3143        CASSQDWERLGEQFF 99480
2178          CASSLQGREKLFF 90567
3039 CASSQDLMTVDSLFAGANVLTF 68682
2506        CASSPAGAYYNEQFF 30454
2744        CASSPPTGERDTQYF 24703
1642        CASSLAGDSQETQYF 22147

sequence.matrix <- seqMatrix(productive.aa = productive.TRB.aa, sequences = unique.seqs$aminoAcid)
head(sequence.matrix)

               aminoAcid numberSamples TRB_CD4_949 TRB_CD8_949 TRB_CD8_CMV_369
1        CASSQDRTGQYGYTF             9           0  0.99500268       0.6032172
2          CASSLQGREKLFF             8           0  8.73192932       5.1608579
3 CASSQDLMTVDSLFAGANVLTF             8           0 10.65054435       8.8471850
4          CASSYSGNTEAFF             8           0  0.16955203       0.1340483
5      CASSFMDWTGGNSPLHF             8           0  0.04461895       0.4691689
6          CASSREGDQPQHF             8           0  0.29894699       0.1340483
  TRB_Unsorted_0 TRB_Unsorted_1320 TRB_Unsorted_1496 TRB_Unsorted_32
1     0.47057171        0.43249562        0.50147171      0.80551900
2     0.00000000        7.36548974        4.86209528      4.86501575
3     0.00000000        3.11891910        2.60910644      1.55122224
4     0.04214075        0.07082202        0.07631091      0.00000000
5     0.00000000        0.04606869        0.04360624      0.03588946
6     0.00000000        0.29085158        0.47240089      2.84723053
  TRB_Unsorted_369 TRB_Unsorted_83 TRB_Unsorted_949
1        1.0268620       1.8625084       0.80382775
2        9.0363857      23.6458434       6.18181818
3       12.1288068      12.1922371       8.05741627
4        0.3891633       0.8385633       0.07655502
5        0.5047288       0.6695954       0.03827751
6        0.2220807       0.1573815       0.19138756

If just the top clones with a frequency greater than a specified amount are of interest to you, then use the topFreq function. This creates a data frame of the top productive amino acid sequences having a minimum specified frequency and reports the minimum, maximum, and mean frequency that the sequence appears in a list of samples. For TCRB sequences, the prevalence (%) and the published antigen specificity of that sequence are also provided.

top.freq <- topFreq(productive.aa = productive.TRB.aa, percent = 0.1)
head(top.freq)

                 aminoAcid minFrequency maxFrequency meanFrequency
387        CASSQDRTGQYGYTF   0.43249562    1.8625084     0.8334973
276          CASSLQGREKLFF   4.86209528   23.6458434     8.7311794
382 CASSQDLMTVDSLFAGANVLTF   1.55122224   12.1922371     7.3944297
436          CASSREGDQPQHF   0.13404826    2.8472305     0.5767910
158      CASSFMDWTGGNSPLHF   0.03588946    0.6695954     0.2314942
536          CASSYSGNTEAFF   0.04214075    0.8385633     0.2246444
    numberSamples prevalence antigen
387             9        3.6        
276             8       30.9        
382             8        1.8        
436             8       18.2        
158             8        1.8        
536             8       69.1     CMV

One very useful thing to do is merge the output of seqMatrix and topFreq.

top.freq <- topFreq(productive.aa = productive.TRB.aa, percent = 0)
top.freq.matrix <- merge(top.freq, sequence.matrix)
head(top.freq.matrix)

            aminoAcid numberSamples minFrequency maxFrequency meanFrequency
1       CAAGDTTLYEQYF             1  0.014046917  0.014046917   0.014046917
2        CAAGTSGDTQYF             1  0.019138756  0.019138756   0.019138756
3      CAARGGGESYEQYF             1  0.028093833  0.028093833   0.028093833
4   CAATRRQGDVMNTEAFF             1  0.083541316  0.083541316   0.083541316
5 CAAWGTGPLGSSGANVLTF             1  0.029931447  0.029931447   0.029931447
6         CACALGDGYTF             1  0.001375185  0.001375185   0.001375185
  prevalence antigen TRB_CD4_949 TRB_CD8_949 TRB_CD8_CMV_369 TRB_Unsorted_0
1        0.0                   0           0               0     0.01404692
2        1.8                   0           0               0     0.00000000
3        0.0                   0           0               0     0.02809383
4        0.0                   0           0               0     0.00000000
5        0.0                   0           0               0     0.00000000
6        0.0                   0           0               0     0.00000000
  TRB_Unsorted_1320 TRB_Unsorted_1496 TRB_Unsorted_32 TRB_Unsorted_369
1       0.000000000                 0               0       0.00000000
2       0.000000000                 0               0       0.00000000
3       0.000000000                 0               0       0.00000000
4       0.000000000                 0               0       0.08354132
5       0.000000000                 0               0       0.00000000
6       0.001375185                 0               0       0.00000000
  TRB_Unsorted_83 TRB_Unsorted_949
1      0.00000000       0.00000000
2      0.00000000       0.01913876
3      0.00000000       0.00000000
4      0.00000000       0.00000000
5      0.02993145       0.00000000
6      0.00000000       0.00000000

Tracking sequences across samples

To visually track the frequency of sequences across multiple samples, use the function cloneTrack. This function takes the output from the seqMatrix function. You can specify a character vector of amino acid sequences using the parameter track to highlight those sequences with a different color. Alternatively, you can highlight all of the sequences from a given sample using the parameter map. If the mapping feature is use, then you must specify a productive amino acid list and a character vector of labels to title the mapped samples. To hide sequences that are not being tracked or mapped, set unassigned to FALSE.

cloneTrack(sequence.matrix = sequence.matrix, 
           productive.aa = productive.TRB.aa, 
           map = c("TRB_CD4_949", "TRB_CD8_949"), 
           label = c("CD4", "CD8"), 
           track = "CASSPPTGERDTQYF", 
           unassigned = FALSE)

Refer to the cloneTrack manual for examples on how to reformat the chart using ggplot2 function.

Comparing V(D)J gene usage

To compare the V, D, and J gene usage across samples, start by creating a data frame of V, D, and J gene counts and frequencies using the function geneFreq. You can specify if you are interested in the “VDJ”, “DJ”, “VJ”, “DJ”, “V”, “D”, or “J” loci using the locus parameter. Set family to TRUE if you prefer the family names instead of the gene names as reported by ImmunoSeq.

vGenes <- geneFreq(productive.nt = productive.TRB.nt, locus = "V", family = TRUE)
head(vGenes)

           samples familyName count frequencyGene
1 TRB_Unsorted_949    TCRBV02    72      1.377990
2 TRB_Unsorted_949    TCRBV03   165      3.157895
3 TRB_Unsorted_949    TCRBV04   780     14.928230
4 TRB_Unsorted_949    TCRBV05   398      7.617225
5 TRB_Unsorted_949    TCRBV06   413      7.904306
6 TRB_Unsorted_949    TCRBV07   475      9.090909

To create a chord diagram showing VJ or DJ gene associations from one or more more samples, combine the output of geneFreq with the function chordDiagramVDJ. This function works well the topSeqs function that creates a data frame of a selected number of top productive sequences. In the example below, a chord diagram is made showing the association between V and J genes of just the single dominant clones in each sample. The size of the ribbons connecting VJ genes correspond to the number of samples that have that recombination event. The thicker the ribbon, the higher the frequency of the recombination.

top.seqs <- topSeqs(productive.seqs = productive.TRB.nt, top = 1)
chordDiagramVDJ(sample = top.seqs, 
                association = "VJ", 
                colors = c("darkred", "navyblue"))

You can also visualize the results of geneFreq as a heat map, word cloud, our cumulative frequency bar plot with the support of additional R packages as shown below.

vGenes <- geneFreq(productive.nt = productive.TRB.nt, locus = "V", family = TRUE)
library(RColorBrewer)
library(grDevices)
RedBlue <- grDevices::colorRampPalette(rev(RColorBrewer::brewer.pal(11, "RdBu")))(256)
library(wordcloud)
wordcloud::wordcloud(words = vGenes[vGenes$samples == "TRB_Unsorted_83", "familyName"], 
                     freq = vGenes[vGenes$samples == "TRB_Unsorted_83", "frequencyGene"], 
                     colors = RedBlue)

library(reshape)
vGenes <- reshape::cast(vGenes, familyName ~ samples, value = "frequencyGene", sum)
rownames(vGenes) = as.character(vGenes$familyName)
vGenes$familyName = NULL
library(pheatmap)
pheatmap::pheatmap(vGenes, color = RedBlue, scale = "row")

vGenes <- geneFreq(productive.nt = productive.TRB.nt, locus = "V", family = TRUE)
library(ggplot2)
multicolors <- grDevices::colorRampPalette(rev(RColorBrewer::brewer.pal(9, "Set1")))(28)
ggplot2::ggplot(vGenes, aes(x = samples, y = frequencyGene, fill = familyName)) +
  geom_bar(stat = "identity") +
  theme_minimal() + 
  scale_y_continuous(expand = c(0, 0)) + 
  guides(fill = guide_legend(ncol = 2)) +
  scale_fill_manual(values = multicolors) + 
  labs(y = "Frequency (%)", x = "", fill = "") +
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1))

Removing sequences

Occasionally you may identify one or more sequences in your data set that appear to be contamination. You can remove an amino acid sequence from all data frames using the function removeSeq and recompute frequencyCount for all remaining sequences.

searchSeq(list = productive.TRB.aa, sequence = "CASSESAGSTGELFF")

             sample       aminoAcid count frequencyCount estimatedNumberGenomes
1       TRB_CD4_949 CASSESAGSTGELFF  5019      30.091732                   5019
2 TRB_Unsorted_1320 CASSESAGSTGELFF 10326       7.100079                  10326
3  TRB_Unsorted_949 CASSESAGSTGELFF   338       6.468900                    338
4 TRB_Unsorted_1496 CASSESAGSTGELFF  1755       6.377412                   1755

cleansed <- removeSeq(file.list = productive.TRB.aa, sequence = "CASSESAGSTGELFF")
searchSeq(list = cleansed, sequence = "CASSESAGSTGELFF")

No sequences found.

Merging samples

If you need to combine multiple samples into one, use the mergeFiles function. It merges two or more sample data frames into a single data frame and aggregates count, frequencyCount, and estimatedNumberGenomes.

TRB_949_Merged <- mergeFiles(samples = c("TRB_CD4_949", "TRB_CD8_949"), 
                                file.list = TCRB.list)

Conclusion

Advances in high-throughput sequencing have enabled characterizing T and B lymphocyte repertoires with unprecedented depth. LymphoSeq was developed as a tool to assist in the analysis of targeted next generation sequencing of the hypervariable CDR3 region of T and B cell receptors. The three key features of this R package are to characterize lymphocyte repertoire diversity, compare two or more lymphocyte repertoires, and track the frequency of CDR3 sequences across multiple samples. LymphoSeq also provides the unique ability to search for sequences in a curated database of published TCRB sequences with known antigen specificity. Finally, LymphoSeq can assign the percent prevalence that any given TCRB sequence appears in a the peripheral blood in healthy population of donors.

Session info

sessionInfo()

## R version 4.4.3 (2025-02-28)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.2 LTS
## 
## Matrix products: default
## BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so;  LAPACK version 3.12.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## time zone: Etc/UTC
## tzcode source: system (glibc)
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] ggplot2_3.5.1      pheatmap_1.0.12    reshape_0.8.9      wordcloud_2.6     
## [5] RColorBrewer_1.1-3 LymphoSeq_1.35.0   LymphoSeqDB_0.99.2
## 
## loaded via a namespace (and not attached):
##  [1] tidyselect_1.2.1        dplyr_1.1.4             farver_2.1.2           
##  [4] Biostrings_2.75.4       fastmap_1.2.0           lazyeval_0.2.2         
##  [7] digest_0.6.37           lifecycle_1.0.4         tidytree_0.4.6         
## [10] VennDiagram_1.7.3       magrittr_2.0.3          compiler_4.4.3         
## [13] rlang_1.1.5             sass_0.4.9              tools_4.4.3            
## [16] igraph_2.1.4            yaml_2.3.10             data.table_1.17.0      
## [19] knitr_1.50              lambda.r_1.2.4          phangorn_2.12.1        
## [22] labeling_0.4.3          plyr_1.8.9              aplot_0.2.5            
## [25] withr_3.0.2             purrr_1.0.4             BiocGenerics_0.53.6    
## [28] sys_3.4.3               grid_4.4.3              stats4_4.4.3           
## [31] msa_1.39.2              colorspace_2.1-1        scales_1.3.0           
## [34] cli_3.6.4               UpSetR_1.4.0            rmarkdown_2.29         
## [37] crayon_1.5.3            treeio_1.31.0           generics_0.1.3         
## [40] stringdist_0.9.15       ggtree_3.15.0           httr_1.4.7             
## [43] ape_5.8-1               cachem_1.1.0            parallel_4.4.3         
## [46] ggplotify_0.1.2         formatR_1.14            XVector_0.47.2         
## [49] yulab.utils_0.2.0       vctrs_0.6.5             Matrix_1.7-3           
## [52] jsonlite_2.0.0          gridGraphics_0.5-1      IRanges_2.41.3         
## [55] patchwork_1.3.0         S4Vectors_0.45.4        maketools_1.3.2        
## [58] ineq_0.2-13             tidyr_1.3.1             jquerylib_0.1.4        
## [61] glue_1.8.0              codetools_0.2-20        gtable_0.3.6           
## [64] shape_1.4.6.1           futile.logger_1.4.3     GenomeInfoDb_1.43.4    
## [67] UCSC.utils_1.3.1        quadprog_1.5-8          munsell_0.5.1          
## [70] tibble_3.2.1            pillar_1.10.1           htmltools_0.5.8.1      
## [73] GenomeInfoDbData_1.2.13 circlize_0.4.16         R6_2.6.1               
## [76] evaluate_1.0.3          lattice_0.22-6          futile.options_1.0.1   
## [79] ggfun_0.1.8             bslib_0.9.0             Rcpp_1.0.14            
## [82] fastmatch_1.1-6         gridExtra_2.3           nlme_3.1-167           
## [85] xfun_0.51               fs_1.6.5                buildtools_1.0.0       
## [88] pkgconfig_2.0.3         GlobalOptions_0.1.2