hiReadsProcessor
contains set of functions which allow
users to process LM-PCR products sequenced using any platform. Given an
excel/txt file containing parameters for demultiplexing and sample
metadata, the functions in here automate trimming of adaptors, removing
any vector sequences, and identifying host genomic location.
The basic philosophy of this package is to detect various ‘bits’ of information in a sequencing read like barcodes, primers, linkers, etc and store it within a nested list in a space efficient manner. No information is duplicated once read into the object and the supplied utility functions enable addition & extraction of needed features on the fly.
Here is the workflow in a nutshell followed by few simple steps to get you started.
Please refer to following publications from Bushman Lab to obtain more information on sample/amplicon preparation
First load this package and the parallel backend of choice from
BiocParallel
library. Although BiocParallel
is
imported internally and will invoke multicore
like
functionality, you may have to load it to invoke snow
like
functionality.
The package comes with an example 454 sequencing run:
FLX_sample_run
. In rest of this tutorial we will use this
dataset to introduce functionality of this package.
A typical sequencing run will have several filetypes but among the most important files are the fasta/fastq which are created per sector/quadrant/lane. In the example dataset we have compressed fasta files for three quadrants, an excel spreadsheet holding sample metadata along with information/parameters to process that sample and vector fasta files to be trimmed if needed.
runData <- system.file("extdata/FLX_sample_run/", package = "hiReadsProcessor")
list.files(runData, recursive = TRUE)
## [1] "RunData/1.TCA.454Reads.fna" "RunData/1.TCA.454Reads.fna.gz"
## [3] "RunData/2.TCA.454Reads.fna" "RunData/2.TCA.454Reads.fna.gz"
## [5] "RunData/3.TCA.454Reads.fna" "RunData/3.TCA.454Reads.fna.gz"
## [7] "Vectors/HIV1.fa" "Vectors/MLV-vector.fa"
## [9] "sampleInfo.xls" "sampleInfo.xlsx"
Function read.SeqFolder
will initiate a
SimpleList
sample information object which will hold
everything regarding the sequencing run. The object is structured to
store sequencing file paths, sequence data, processed data as well as
sample metadata. The function finds the required file needed to ease the
automation process. It is important that somewhere in the sequencing
folder there is a file called “sampleInfo” else object initialization
will fail.
## Choosing /tmp/RtmpA23cz7/Rinst15f114466c66/hiReadsProcessor/extdata/FLX_sample_run/sampleInfo.xls as sample information file.
## List of length 6
## names(6): sequencingFolderPath seqFilePaths ... sectors callHistory
On successfully initializing the sample information object you will see that there is a hierarchy within the object. The root or top most level holds the information regarding the sequence folder. Each sample within a quadrant/lane is held within “sectors” list. Within the sectors list, there is a list of samples and the associated metadata which gets modified and appended to as the reads get trimmed and aligned.
## List of length 54
## names(54): Roth-MLV-CD4T-20100723NCMu ... Roth-MLV3p-CD4TMLVwell6-MuA
The first step of processing most sequencing run is to demultiplex
reads by barcodes/MIDs. Function findBarcodes
automates the
demultiplexing process based on already stored data from the sample
information file. Please see the documentation for
read.sampleInfo
for the kinds of parameters and information
held in a sample information file. An example file is supplied within
the FLX_sample_run
dataset.
## Decoding sector: 1
## Reading:
## /tmp/RtmpA23cz7/Rinst15f114466c66/hiReadsProcessor/extdata/FLX_sample_run/RunData/1.TCA.454Reads.fna.gz
## Using following schema for barcode to sample associations
## barcodesSample
## TGCATCGA Roth-MLV-CD4T-20100723NCMu
## TCAGTCAG Roth-MLV-CD4T-20100723well1BMu
## ACGTACGA Roth-MLV-CD4T-20100723well2BMu1-
## CTCAGACA Roth-MLV-CD4T-20100723well2BMu10
## CTGAGTCA Roth-MLV-CD4T-20100723well2BMu11
## TCACAGAC Roth-MLV-CD4T-20100723well2BMu12
## ACTCGACA Roth-MLV-CD4T-20100723well2BMu2
## AGACAGTG Roth-MLV-CD4T-20100723well2BMu3
## CACAGACG Roth-MLV-CD4T-20100723well2BMu4
## CACTGTGA Roth-MLV-CD4T-20100723well2BMu5
## CATCTCGA Roth-MLV-CD4T-20100723well2BMu6
## CATGACGA Roth-MLV-CD4T-20100723well2BMu7
## CGATCGTA Roth-MLV-CD4T-20100723well2BMu8
## CGTAGCTA Roth-MLV-CD4T-20100723well2BMu9
## TCAGTCTC Roth-MLV-CD4T-20100723well2Mu1
## TCGTAGCA Roth-MLV-CD4T-20100723well2Mu2
## TCGTCATC Roth-MLV-CD4T-20100723well2Mu3
## TCTCACAC Roth-MLV-CD4T-20100723well2Mu4
## TGACAGTC Roth-MLV-CD4T-20100723well2Mu5
## TGCAGTAC Roth-MLV-CD4T-20100723well2Mu6
## TCACGTGA Roth-MLV3p-CD4T-20100730NC
## TGCTGATG Roth-MLV3p-CD4T-20100730Well1BstYI
## CGTGCGAC Roth-MLV3p-CD4T-20100730Well1MseI
## CAGCTGTA Roth-MLV3p-CD4T-20100730Well1NlaIII
## CAGTCTCA Roth-MLV3p-CD4T-20100730Well1Tsp509I
## AGCTCATG Roth-MLV3p-CD4T-20100730Well2BstyI
## ACACTGAC Roth-MLV3p-CD4T-20100730Well2MseI
## ACTGAGTC Roth-MLV3p-CD4T-20100730Well2NlaIII
## TCGATCGA Roth-MLV3p-CD4T-20100730Well2Tsp509I
## AGCACTAC Roth-MLV3p-CD4TMLVLot16-Mu
## TCTCAGTC Roth-MLV3p-CD4TMLVWell3-MseI
## TCTCGTCA Roth-MLV3p-CD4TMLVWell3-NlaIII
## TCGAGTAC Roth-MLV3p-CD4TMLVWell3-Tsp509I
## TGTGCTGA Roth-MLV3p-CD4TMLVWell3Harri-Mu
## ACACACTG Roth-MLV3p-CD4TMLVWell3Lot60-Mu
## ACAGTGTC Roth-MLV3p-CD4TMLVWell3Lot62-Mu
## ACGATGCT Roth-MLV3p-CD4TMLVWell3Lot64-Mu
## ACGTCATG Roth-MLV3p-CD4TMLVWell3Lot64new-Mu
## TGACTGTG Roth-MLV3p-CD4TMLVWell4-BstYI
## GCTATACA Roth-MLV3p-CD4TMLVWell4-MseI
## GAGCATGA Roth-MLV3p-CD4TMLVWell4-NlaIII
## GCATGCTA Roth-MLV3p-CD4TMLVWell4-Tsp509I
## ACTGATAC Roth-MLV3p-CD4TMLVWell5-BstYI
## GTCTGAGC Roth-MLV3p-CD4TMLVWell5-MseI
## AGCTCTGC Roth-MLV3p-CD4TMLVWell5-NlaIII
## ATACTCTC Roth-MLV3p-CD4TMLVWell5-Tsp509I
## ACTGACAC Roth-MLV3p-CD4TMLVWell6-BstYI
## TGACGTCA Roth-MLV3p-CD4TMLVWell6-MseI
## CAGTCACG Roth-MLV3p-CD4TMLVWell6-NlaIII
## TCGAGCAT Roth-MLV3p-CD4TMLVWell6-Tsp509I
## AGCTGTAC Roth-MLV3p-CD4TMLVwell3-BstYI
## TCGAGACT Roth-MLV3p-CD4TMLVwell4-MuA
## TCGACTGA Roth-MLV3p-CD4TMLVwell5-MuA
## ACAGCAGA Roth-MLV3p-CD4TMLVwell6-MuA
## Number of Sequences with no matching barcode: 196
## Number of Sequences decoded:
## sampleNames Freq
## 1 Roth-MLV-CD4T-20100723well1BMu 219
## 2 Roth-MLV-CD4T-20100723well2BMu1- 37
## 3 Roth-MLV-CD4T-20100723well2BMu10 12
## 4 Roth-MLV-CD4T-20100723well2BMu11 40
## 5 Roth-MLV-CD4T-20100723well2BMu12 9
## 6 Roth-MLV-CD4T-20100723well2BMu2 50
## 7 Roth-MLV-CD4T-20100723well2BMu3 88
## 8 Roth-MLV-CD4T-20100723well2BMu4 85
## 9 Roth-MLV-CD4T-20100723well2BMu5 86
## 10 Roth-MLV-CD4T-20100723well2BMu6 85
## 11 Roth-MLV-CD4T-20100723well2BMu7 47
## 12 Roth-MLV-CD4T-20100723well2BMu8 26
## 13 Roth-MLV-CD4T-20100723well2BMu9 28
## 14 Roth-MLV-CD4T-20100723well2Mu1 57
## 15 Roth-MLV-CD4T-20100723well2Mu2 65
## 16 Roth-MLV-CD4T-20100723well2Mu3 31
## 17 Roth-MLV-CD4T-20100723well2Mu4 22
## 18 Roth-MLV-CD4T-20100723well2Mu5 6
## 19 Roth-MLV-CD4T-20100723well2Mu6 2
## 20 Roth-MLV3p-CD4T-20100730Well1BstYI 139
## 21 Roth-MLV3p-CD4T-20100730Well1MseI 240
## 22 Roth-MLV3p-CD4T-20100730Well1NlaIII 251
## 23 Roth-MLV3p-CD4T-20100730Well1Tsp509I 313
## 24 Roth-MLV3p-CD4T-20100730Well2BstyI 324
## 25 Roth-MLV3p-CD4T-20100730Well2MseI 475
## 26 Roth-MLV3p-CD4T-20100730Well2NlaIII 371
## 27 Roth-MLV3p-CD4T-20100730Well2Tsp509I 437
## 28 Roth-MLV3p-CD4TMLVLot16-Mu 172
## 29 Roth-MLV3p-CD4TMLVWell3-MseI 468
## 30 Roth-MLV3p-CD4TMLVWell3-NlaIII 301
## 31 Roth-MLV3p-CD4TMLVWell3-Tsp509I 487
## 32 Roth-MLV3p-CD4TMLVWell3Harri-Mu 62
## 33 Roth-MLV3p-CD4TMLVWell3Lot60-Mu 36
## 34 Roth-MLV3p-CD4TMLVWell3Lot62-Mu 171
## 35 Roth-MLV3p-CD4TMLVWell3Lot64-Mu 109
## 36 Roth-MLV3p-CD4TMLVWell3Lot64new-Mu 14
## 37 Roth-MLV3p-CD4TMLVWell4-BstYI 116
## 38 Roth-MLV3p-CD4TMLVWell4-MseI 395
## 39 Roth-MLV3p-CD4TMLVWell4-NlaIII 287
## 40 Roth-MLV3p-CD4TMLVWell4-Tsp509I 318
## 41 Roth-MLV3p-CD4TMLVWell5-BstYI 148
## 42 Roth-MLV3p-CD4TMLVWell5-MseI 284
## 43 Roth-MLV3p-CD4TMLVWell5-NlaIII 286
## 44 Roth-MLV3p-CD4TMLVWell5-Tsp509I 542
## 45 Roth-MLV3p-CD4TMLVWell6-BstYI 195
## 46 Roth-MLV3p-CD4TMLVWell6-MseI 378
## 47 Roth-MLV3p-CD4TMLVWell6-NlaIII 261
## 48 Roth-MLV3p-CD4TMLVWell6-Tsp509I 490
## 49 Roth-MLV3p-CD4TMLVwell3-BstYI 211
## 50 Roth-MLV3p-CD4TMLVwell4-MuA 138
## 51 Roth-MLV3p-CD4TMLVwell5-MuA 235
## 52 Roth-MLV3p-CD4TMLVwell6-MuA 155
## Decoding sector: 2
## Reading:
## /tmp/RtmpA23cz7/Rinst15f114466c66/hiReadsProcessor/extdata/FLX_sample_run/RunData/2.TCA.454Reads.fna.gz
## Using following schema for barcode to sample associations
## barcodesSample
## TGCATCGA Roth-MLV-CD4T-20100723NCMu
## TCAGTCAG Roth-MLV-CD4T-20100723well1BMu
## ACGTACGA Roth-MLV-CD4T-20100723well2BMu1-
## CTCAGACA Roth-MLV-CD4T-20100723well2BMu10
## CTGAGTCA Roth-MLV-CD4T-20100723well2BMu11
## TCACAGAC Roth-MLV-CD4T-20100723well2BMu12
## ACTCGACA Roth-MLV-CD4T-20100723well2BMu2
## AGACAGTG Roth-MLV-CD4T-20100723well2BMu3
## CACAGACG Roth-MLV-CD4T-20100723well2BMu4
## CACTGTGA Roth-MLV-CD4T-20100723well2BMu5
## CATCTCGA Roth-MLV-CD4T-20100723well2BMu6
## CATGACGA Roth-MLV-CD4T-20100723well2BMu7
## CGATCGTA Roth-MLV-CD4T-20100723well2BMu8
## CGTAGCTA Roth-MLV-CD4T-20100723well2BMu9
## TCAGTCTC Roth-MLV-CD4T-20100723well2Mu1
## TCGTAGCA Roth-MLV-CD4T-20100723well2Mu2
## TCGTCATC Roth-MLV-CD4T-20100723well2Mu3
## TCTCACAC Roth-MLV-CD4T-20100723well2Mu4
## TGACAGTC Roth-MLV-CD4T-20100723well2Mu5
## TGCAGTAC Roth-MLV-CD4T-20100723well2Mu6
## TCACGTGA Roth-MLV3p-CD4T-20100730NC
## TGCTGATG Roth-MLV3p-CD4T-20100730Well1BstYI
## CGTGCGAC Roth-MLV3p-CD4T-20100730Well1MseI
## CAGCTGTA Roth-MLV3p-CD4T-20100730Well1NlaIII
## CAGTCTCA Roth-MLV3p-CD4T-20100730Well1Tsp509I
## AGCTCATG Roth-MLV3p-CD4T-20100730Well2BstyI
## ACACTGAC Roth-MLV3p-CD4T-20100730Well2MseI
## ACTGAGTC Roth-MLV3p-CD4T-20100730Well2NlaIII
## TCGATCGA Roth-MLV3p-CD4T-20100730Well2Tsp509I
## AGCACTAC Roth-MLV3p-CD4TMLVLot16-Mu
## TCTCAGTC Roth-MLV3p-CD4TMLVWell3-MseI
## TCTCGTCA Roth-MLV3p-CD4TMLVWell3-NlaIII
## TCGAGTAC Roth-MLV3p-CD4TMLVWell3-Tsp509I
## TGTGCTGA Roth-MLV3p-CD4TMLVWell3Harri-Mu
## ACACACTG Roth-MLV3p-CD4TMLVWell3Lot60-Mu
## ACAGTGTC Roth-MLV3p-CD4TMLVWell3Lot62-Mu
## ACGATGCT Roth-MLV3p-CD4TMLVWell3Lot64-Mu
## ACGTCATG Roth-MLV3p-CD4TMLVWell3Lot64new-Mu
## TGACTGTG Roth-MLV3p-CD4TMLVWell4-BstYI
## GCTATACA Roth-MLV3p-CD4TMLVWell4-MseI
## GAGCATGA Roth-MLV3p-CD4TMLVWell4-NlaIII
## GCATGCTA Roth-MLV3p-CD4TMLVWell4-Tsp509I
## ACTGATAC Roth-MLV3p-CD4TMLVWell5-BstYI
## GTCTGAGC Roth-MLV3p-CD4TMLVWell5-MseI
## AGCTCTGC Roth-MLV3p-CD4TMLVWell5-NlaIII
## ATACTCTC Roth-MLV3p-CD4TMLVWell5-Tsp509I
## ACTGACAC Roth-MLV3p-CD4TMLVWell6-BstYI
## TGACGTCA Roth-MLV3p-CD4TMLVWell6-MseI
## CAGTCACG Roth-MLV3p-CD4TMLVWell6-NlaIII
## TCGAGCAT Roth-MLV3p-CD4TMLVWell6-Tsp509I
## AGCTGTAC Roth-MLV3p-CD4TMLVwell3-BstYI
## TCGAGACT Roth-MLV3p-CD4TMLVwell4-MuA
## TCGACTGA Roth-MLV3p-CD4TMLVwell5-MuA
## ACAGCAGA Roth-MLV3p-CD4TMLVwell6-MuA
## Number of Sequences with no matching barcode: 214
## Number of Sequences decoded:
## sampleNames Freq
## 1 Roth-MLV-CD4T-20100723well1BMu 215
## 2 Roth-MLV-CD4T-20100723well2BMu1- 27
## 3 Roth-MLV-CD4T-20100723well2BMu10 4
## 4 Roth-MLV-CD4T-20100723well2BMu11 41
## 5 Roth-MLV-CD4T-20100723well2BMu12 7
## 6 Roth-MLV-CD4T-20100723well2BMu2 41
## 7 Roth-MLV-CD4T-20100723well2BMu3 78
## 8 Roth-MLV-CD4T-20100723well2BMu4 76
## 9 Roth-MLV-CD4T-20100723well2BMu5 74
## 10 Roth-MLV-CD4T-20100723well2BMu6 80
## 11 Roth-MLV-CD4T-20100723well2BMu7 47
## 12 Roth-MLV-CD4T-20100723well2BMu8 15
## 13 Roth-MLV-CD4T-20100723well2BMu9 32
## 14 Roth-MLV-CD4T-20100723well2Mu1 50
## 15 Roth-MLV-CD4T-20100723well2Mu2 68
## 16 Roth-MLV-CD4T-20100723well2Mu3 21
## 17 Roth-MLV-CD4T-20100723well2Mu4 19
## 18 Roth-MLV-CD4T-20100723well2Mu5 4
## 19 Roth-MLV-CD4T-20100723well2Mu6 2
## 20 Roth-MLV3p-CD4T-20100730Well1BstYI 151
## 21 Roth-MLV3p-CD4T-20100730Well1MseI 223
## 22 Roth-MLV3p-CD4T-20100730Well1NlaIII 253
## 23 Roth-MLV3p-CD4T-20100730Well1Tsp509I 315
## 24 Roth-MLV3p-CD4T-20100730Well2BstyI 375
## 25 Roth-MLV3p-CD4T-20100730Well2MseI 428
## 26 Roth-MLV3p-CD4T-20100730Well2NlaIII 370
## 27 Roth-MLV3p-CD4T-20100730Well2Tsp509I 456
## 28 Roth-MLV3p-CD4TMLVLot16-Mu 153
## 29 Roth-MLV3p-CD4TMLVWell3-MseI 487
## 30 Roth-MLV3p-CD4TMLVWell3-NlaIII 311
## 31 Roth-MLV3p-CD4TMLVWell3-Tsp509I 469
## 32 Roth-MLV3p-CD4TMLVWell3Harri-Mu 54
## 33 Roth-MLV3p-CD4TMLVWell3Lot60-Mu 54
## 34 Roth-MLV3p-CD4TMLVWell3Lot62-Mu 165
## 35 Roth-MLV3p-CD4TMLVWell3Lot64-Mu 116
## 36 Roth-MLV3p-CD4TMLVWell3Lot64new-Mu 9
## 37 Roth-MLV3p-CD4TMLVWell4-BstYI 111
## 38 Roth-MLV3p-CD4TMLVWell4-MseI 400
## 39 Roth-MLV3p-CD4TMLVWell4-NlaIII 286
## 40 Roth-MLV3p-CD4TMLVWell4-Tsp509I 338
## 41 Roth-MLV3p-CD4TMLVWell5-BstYI 170
## 42 Roth-MLV3p-CD4TMLVWell5-MseI 319
## 43 Roth-MLV3p-CD4TMLVWell5-NlaIII 283
## 44 Roth-MLV3p-CD4TMLVWell5-Tsp509I 608
## 45 Roth-MLV3p-CD4TMLVWell6-BstYI 205
## 46 Roth-MLV3p-CD4TMLVWell6-MseI 406
## 47 Roth-MLV3p-CD4TMLVWell6-NlaIII 247
## 48 Roth-MLV3p-CD4TMLVWell6-Tsp509I 445
## 49 Roth-MLV3p-CD4TMLVwell3-BstYI 179
## 50 Roth-MLV3p-CD4TMLVwell4-MuA 125
## 51 Roth-MLV3p-CD4TMLVwell5-MuA 250
## 52 Roth-MLV3p-CD4TMLVwell6-MuA 124
## Decoding sector: 3
## Reading:
## /tmp/RtmpA23cz7/Rinst15f114466c66/hiReadsProcessor/extdata/FLX_sample_run/RunData/3.TCA.454Reads.fna.gz
## Using following schema for barcode to sample associations
## barcodesSample
## CCGGAATT Ocwieja-HIV896-CD4TND365-InfectionI
## GATCGACT Ocwieja-HIV896-CD4TND365-InfectionII
## TCGTACAG Ocwieja-HIV896-CD4TND365-InfectionIII
## TATAGCGC Ocwieja-HIV896-CD4TND365-NoVirus
## Number of Sequences with no matching barcode: 140
## Number of Sequences decoded:
## sampleNames Freq
## 1 Ocwieja-HIV896-CD4TND365-InfectionI 3879
## 2 Ocwieja-HIV896-CD4TND365-InfectionII 3111
## 3 Ocwieja-HIV896-CD4TND365-InfectionIII 2869
## 4 Ocwieja-HIV896-CD4TND365-NoVirus 1
## List of length 6
## names(6): sequencingFolderPath seqFilePaths ... sectors callHistory
## List of length 3
## names(3): 1 2 3
Following the barcode sequence is the 5’ viral LTR primer. Function
findPrimers
facilities the trimming of respective
primers(primerltrsequence) for each sample. Minimum
threshold for detecting the primer can be adjusted using
primerLTRidentity within the sample information
file.
Since it may take a while to process this kind of data, the package
is equipped with processed data object which makes things easier for the
tutorial. The code chunks below are not evaluated but rather references
the loaded seqProps
object.
If LM-PCR products were designed to include the viral LTR
(ltrBitSequence) following the primer landing site,
then findLTRs
confirms authenticity of the integrated
virus. Absence of LTR part denotes nongenuine integration! Minimum
threshold for detecting the LTR bit can be adjusted using
ltrBitIdentity.
If the vectorFile parameter is defined within the
sample information file, function findVector
tags any reads
which matches the given vector file. These reads are discarded during
the genomic alignment step which is covered later.
Linker adaptors are found on the 3’ end of sequences. Depending on an
experiment the linkerSequence can be same or different
per sample. Furthermore, some linker adaptors are designed to have
primerID which can help quantify pre-PCR
products. Function findLinkers
makes it easy to process
various samples with different linker sequences and type. If primerID
technology is utilized, enabling parameter
primerIdInLinker within the sample information file
automates the extract of the random part within the adaptor. Thresholds
for linker detection can be controlled by setting following parameters
within the sample information file: linkerIdentity,
primerIdInLinker, primerIdInLinkerIdentity1,
primerIdInLinkerIdentity2
Once all the non-genomic parts have been detected, it is time to find
the actual integration sites. Function findIntegrations
makes this a breeze given that BLAT and indexed genome files are
provided/in-place.
Function sampleSummary
quantifies 7 basic features of
this package:
## Total sectors:1,2,3
## Sector SampleName decoded primed LTRed vectored
## 1 1 Roth-MLV-CD4T-20100723NCMu NA NA NA NA
## 2 1 Roth-MLV-CD4T-20100723well1BMu 18 18 6 1
## 3 1 Roth-MLV-CD4T-20100723well2BMu1- 12 11 NA NA
## 4 1 Roth-MLV-CD4T-20100723well2BMu10 1 1 1 NA
## 5 1 Roth-MLV-CD4T-20100723well2BMu11 6 4 3 NA
## 6 1 Roth-MLV-CD4T-20100723well2BMu12 1 1 NA NA
## 7 1 Roth-MLV-CD4T-20100723well2BMu2 3 3 1 NA
## 8 1 Roth-MLV-CD4T-20100723well2BMu3 8 8 NA NA
## 9 1 Roth-MLV-CD4T-20100723well2BMu4 10 10 4 NA
## 10 1 Roth-MLV-CD4T-20100723well2BMu5 8 8 3 NA
## 11 1 Roth-MLV-CD4T-20100723well2BMu6 9 8 2 NA
## 12 1 Roth-MLV-CD4T-20100723well2BMu7 7 5 1 NA
## 13 1 Roth-MLV-CD4T-20100723well2BMu8 4 4 2 NA
## 14 1 Roth-MLV-CD4T-20100723well2BMu9 1 1 1 NA
## 15 1 Roth-MLV-CD4T-20100723well2Mu1 3 3 NA NA
## 16 1 Roth-MLV-CD4T-20100723well2Mu2 4 4 NA NA
## 17 1 Roth-MLV-CD4T-20100723well2Mu3 3 3 NA NA
## 18 1 Roth-MLV-CD4T-20100723well2Mu4 1 1 NA NA
## 19 1 Roth-MLV-CD4T-20100723well2Mu5 1 1 NA NA
## 20 1 Roth-MLV-CD4T-20100723well2Mu6 NA NA NA NA
## 21 1 Roth-MLV3p-CD4T-20100730NC NA NA NA NA
## 22 1 Roth-MLV3p-CD4T-20100730Well1BstYI 19 18 15 NA
## 23 1 Roth-MLV3p-CD4T-20100730Well1MseI 26 26 23 NA
## 24 1 Roth-MLV3p-CD4T-20100730Well1NlaIII 36 33 28 4
## 25 1 Roth-MLV3p-CD4T-20100730Well1Tsp509I 34 29 25 NA
## 26 1 Roth-MLV3p-CD4T-20100730Well2BstyI 32 31 22 NA
## 27 1 Roth-MLV3p-CD4T-20100730Well2MseI 43 43 39 1
## 28 1 Roth-MLV3p-CD4T-20100730Well2NlaIII 31 31 25 4
## 29 1 Roth-MLV3p-CD4T-20100730Well2Tsp509I 39 39 35 1
## 30 1 Roth-MLV3p-CD4TMLVLot16-Mu 15 14 11 NA
## 31 1 Roth-MLV3p-CD4TMLVwell3-BstYI 18 18 17 NA
## 32 1 Roth-MLV3p-CD4TMLVWell3-MseI 42 40 36 NA
## 33 1 Roth-MLV3p-CD4TMLVWell3-NlaIII 35 35 28 NA
## 34 1 Roth-MLV3p-CD4TMLVWell3-Tsp509I 39 39 37 NA
## 35 1 Roth-MLV3p-CD4TMLVWell3Harri-Mu 9 9 7 NA
## 36 1 Roth-MLV3p-CD4TMLVWell3Lot60-Mu 5 5 2 NA
## 37 1 Roth-MLV3p-CD4TMLVWell3Lot62-Mu 13 13 8 NA
## 38 1 Roth-MLV3p-CD4TMLVWell3Lot64-Mu 8 8 8 NA
## 39 1 Roth-MLV3p-CD4TMLVWell3Lot64new-Mu 2 2 2 NA
## 40 1 Roth-MLV3p-CD4TMLVWell4-BstYI 8 8 6 NA
## 41 1 Roth-MLV3p-CD4TMLVWell4-MseI 38 38 35 1
## 42 1 Roth-MLV3p-CD4TMLVwell4-MuA 18 18 1 NA
## 43 1 Roth-MLV3p-CD4TMLVWell4-NlaIII 22 22 22 NA
## 44 1 Roth-MLV3p-CD4TMLVWell4-Tsp509I 30 26 25 NA
## 45 1 Roth-MLV3p-CD4TMLVWell5-BstYI 10 10 8 NA
## 46 1 Roth-MLV3p-CD4TMLVWell5-MseI 43 41 38 NA
## 47 1 Roth-MLV3p-CD4TMLVwell5-MuA 26 25 6 NA
## 48 1 Roth-MLV3p-CD4TMLVWell5-NlaIII 28 27 27 1
## 49 1 Roth-MLV3p-CD4TMLVWell5-Tsp509I 67 66 60 NA
## 50 1 Roth-MLV3p-CD4TMLVWell6-BstYI 17 17 14 NA
## 51 1 Roth-MLV3p-CD4TMLVWell6-MseI 36 36 31 NA
## 52 1 Roth-MLV3p-CD4TMLVwell6-MuA 13 13 NA NA
## 53 1 Roth-MLV3p-CD4TMLVWell6-NlaIII 22 22 21 NA
## 54 1 Roth-MLV3p-CD4TMLVWell6-Tsp509I 54 52 48 1
## 55 2 Roth-MLV-CD4T-20100723NCMu NA NA NA NA
## 56 2 Roth-MLV-CD4T-20100723well1BMu 21 21 6 NA
## 57 2 Roth-MLV-CD4T-20100723well2BMu1- 2 2 NA NA
## 58 2 Roth-MLV-CD4T-20100723well2BMu10 NA NA NA NA
## 59 2 Roth-MLV-CD4T-20100723well2BMu11 4 4 2 NA
## 60 2 Roth-MLV-CD4T-20100723well2BMu12 1 1 NA NA
## 61 2 Roth-MLV-CD4T-20100723well2BMu2 4 4 NA NA
## 62 2 Roth-MLV-CD4T-20100723well2BMu3 17 17 2 NA
## 63 2 Roth-MLV-CD4T-20100723well2BMu4 8 8 1 NA
## 64 2 Roth-MLV-CD4T-20100723well2BMu5 6 6 3 NA
## 65 2 Roth-MLV-CD4T-20100723well2BMu6 5 5 1 NA
## 66 2 Roth-MLV-CD4T-20100723well2BMu7 8 6 2 NA
## 67 2 Roth-MLV-CD4T-20100723well2BMu8 2 2 1 NA
## 68 2 Roth-MLV-CD4T-20100723well2BMu9 3 3 3 NA
## 69 2 Roth-MLV-CD4T-20100723well2Mu1 2 2 NA NA
## 70 2 Roth-MLV-CD4T-20100723well2Mu2 7 7 NA NA
## 71 2 Roth-MLV-CD4T-20100723well2Mu3 4 4 NA NA
## 72 2 Roth-MLV-CD4T-20100723well2Mu4 1 1 NA NA
## 73 2 Roth-MLV-CD4T-20100723well2Mu5 NA NA NA NA
## 74 2 Roth-MLV-CD4T-20100723well2Mu6 NA NA NA NA
## 75 2 Roth-MLV3p-CD4T-20100730NC NA NA NA NA
## 76 2 Roth-MLV3p-CD4T-20100730Well1BstYI 14 13 8 NA
## 77 2 Roth-MLV3p-CD4T-20100730Well1MseI 20 20 20 1
## 78 2 Roth-MLV3p-CD4T-20100730Well1NlaIII 29 28 24 NA
## 79 2 Roth-MLV3p-CD4T-20100730Well1Tsp509I 25 22 20 NA
## 80 2 Roth-MLV3p-CD4T-20100730Well2BstyI 45 45 36 NA
## 81 2 Roth-MLV3p-CD4T-20100730Well2MseI 46 46 45 NA
## 82 2 Roth-MLV3p-CD4T-20100730Well2NlaIII 36 35 30 4
## 83 2 Roth-MLV3p-CD4T-20100730Well2Tsp509I 34 33 31 NA
## 84 2 Roth-MLV3p-CD4TMLVLot16-Mu 16 16 9 1
## 85 2 Roth-MLV3p-CD4TMLVwell3-BstYI 17 17 14 NA
## 86 2 Roth-MLV3p-CD4TMLVWell3-MseI 64 63 57 NA
## 87 2 Roth-MLV3p-CD4TMLVWell3-NlaIII 32 32 28 1
## 88 2 Roth-MLV3p-CD4TMLVWell3-Tsp509I 42 42 37 NA
## 89 2 Roth-MLV3p-CD4TMLVWell3Harri-Mu 3 3 1 NA
## 90 2 Roth-MLV3p-CD4TMLVWell3Lot60-Mu 3 3 3 NA
## 91 2 Roth-MLV3p-CD4TMLVWell3Lot62-Mu 16 16 10 NA
## 92 2 Roth-MLV3p-CD4TMLVWell3Lot64-Mu 11 11 10 NA
## 93 2 Roth-MLV3p-CD4TMLVWell3Lot64new-Mu NA NA NA NA
## 94 2 Roth-MLV3p-CD4TMLVWell4-BstYI 13 13 13 1
## 95 2 Roth-MLV3p-CD4TMLVWell4-MseI 41 41 37 NA
## 96 2 Roth-MLV3p-CD4TMLVwell4-MuA 21 21 1 NA
## 97 2 Roth-MLV3p-CD4TMLVWell4-NlaIII 31 29 26 NA
## 98 2 Roth-MLV3p-CD4TMLVWell4-Tsp509I 27 26 26 NA
## 99 2 Roth-MLV3p-CD4TMLVWell5-BstYI 19 19 18 NA
## 100 2 Roth-MLV3p-CD4TMLVWell5-MseI 40 40 37 NA
## 101 2 Roth-MLV3p-CD4TMLVwell5-MuA 36 36 10 NA
## 102 2 Roth-MLV3p-CD4TMLVWell5-NlaIII 23 21 19 NA
## 103 2 Roth-MLV3p-CD4TMLVWell5-Tsp509I 54 53 48 1
## 104 2 Roth-MLV3p-CD4TMLVWell6-BstYI 14 14 12 NA
## 105 2 Roth-MLV3p-CD4TMLVWell6-MseI 33 33 31 NA
## 106 2 Roth-MLV3p-CD4TMLVwell6-MuA 15 14 1 NA
## 107 2 Roth-MLV3p-CD4TMLVWell6-NlaIII 23 23 20 1
## 108 2 Roth-MLV3p-CD4TMLVWell6-Tsp509I 46 46 41 NA
## 109 3 Ocwieja-HIV896-CD4TND365-InfectionI 422 422 400 66
## 110 3 Ocwieja-HIV896-CD4TND365-InfectionII 295 293 279 48
## 111 3 Ocwieja-HIV896-CD4TND365-InfectionIII 269 269 253 41
## 112 3 Ocwieja-HIV896-CD4TND365-NoVirus NA NA NA NA
## linkered psl sites
## 1 NA NA NA
## 2 15 5 2
## 3 12 NA NA
## 4 NA NA NA
## 5 NA NA NA
## 6 NA NA NA
## 7 3 1 1
## 8 7 1 1
## 9 9 2 2
## 10 4 1 1
## 11 6 1 NA
## 12 2 NA NA
## 13 2 NA NA
## 14 NA NA NA
## 15 3 NA NA
## 16 4 NA NA
## 17 3 NA NA
## 18 1 NA NA
## 19 1 NA NA
## 20 NA NA NA
## 21 NA NA NA
## 22 19 9 6
## 23 26 31 24
## 24 36 38 29
## 25 34 31 22
## 26 32 24 16
## 27 42 49 34
## 28 31 37 25
## 29 39 41 32
## 30 15 17 11
## 31 18 23 18
## 32 42 74 67
## 33 35 46 39
## 34 39 58 46
## 35 9 7 6
## 36 5 5 5
## 37 13 14 10
## 38 8 16 10
## 39 1 1 1
## 40 8 8 4
## 41 38 51 42
## 42 17 NA NA
## 43 22 39 32
## 44 30 44 34
## 45 9 18 15
## 46 43 56 48
## 47 16 3 2
## 48 28 33 22
## 49 67 80 60
## 50 17 10 8
## 51 36 44 33
## 52 13 NA NA
## 53 22 31 25
## 54 54 63 44
## 55 NA NA NA
## 56 17 5 2
## 57 2 NA NA
## 58 NA NA NA
## 59 NA NA NA
## 60 NA NA NA
## 61 4 1 1
## 62 15 1 1
## 63 6 2 2
## 64 4 1 1
## 65 4 1 NA
## 66 NA NA NA
## 67 NA NA NA
## 68 NA NA NA
## 69 2 NA NA
## 70 7 NA NA
## 71 4 NA NA
## 72 1 NA NA
## 73 NA NA NA
## 74 NA NA NA
## 75 NA NA NA
## 76 14 9 6
## 77 20 31 24
## 78 28 38 29
## 79 24 31 22
## 80 45 24 16
## 81 46 49 34
## 82 36 37 25
## 83 33 41 32
## 84 16 17 11
## 85 17 23 18
## 86 64 74 67
## 87 32 46 39
## 88 42 58 46
## 89 3 7 6
## 90 3 5 5
## 91 13 14 10
## 92 9 16 10
## 93 NA 1 1
## 94 13 8 4
## 95 40 51 42
## 96 19 NA NA
## 97 30 39 32
## 98 27 44 34
## 99 19 18 15
## 100 40 56 48
## 101 27 3 2
## 102 23 33 22
## 103 54 80 60
## 104 14 10 8
## 105 33 44 33
## 106 15 NA NA
## 107 23 31 25
## 108 45 63 44
## 109 357 281 210
## 110 233 201 145
## 111 188 171 114
## 112 NA NA NA
Before diving into functions offered by this package, lets first understand the underlying data object holding all the data. For example purposes we will refer to this as the “sampleInfo” object (although it’s essentially a SimpleList object).
The figure below outlines the hierarchy of data storage within the sampleInfo object.
** THE SECTIONS BELOW ARE IN WORKS ** ### Sequencing Run related functions * read.SeqFolder * read.sampleInfo * read.seqsFromSector * findBarcodes|decodeByBarcode * findPrimers * findLTRs * findVector * findLinkers * troubleshootLinkers
## R version 4.4.2 (2024-10-31)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.1 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: Etc/UTC
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats4 stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] hiReadsProcessor_1.43.0 hiAnnotator_1.41.0
## [3] BiocParallel_1.41.0 GenomicAlignments_1.43.0
## [5] Rsamtools_2.23.1 SummarizedExperiment_1.37.0
## [7] Biobase_2.67.0 MatrixGenerics_1.19.0
## [9] matrixStats_1.4.1 GenomicRanges_1.59.1
## [11] pwalign_1.3.0 Biostrings_2.75.1
## [13] GenomeInfoDb_1.43.2 XVector_0.47.0
## [15] IRanges_2.41.1 S4Vectors_0.45.2
## [17] BiocGenerics_0.53.3 generics_0.1.3
##
## loaded via a namespace (and not attached):
## [1] gtable_0.3.6 rjson_0.2.23 xfun_0.49
## [4] bslib_0.8.0 ggplot2_3.5.1 lattice_0.22-6
## [7] vctrs_0.6.5 tools_4.4.2 bitops_1.0-9
## [10] curl_6.0.1 parallel_4.4.2 fansi_1.0.6
## [13] tibble_3.2.1 pkgconfig_2.0.3 Matrix_1.7-1
## [16] BSgenome_1.75.0 readxl_1.4.3 lifecycle_1.0.4
## [19] GenomeInfoDbData_1.2.13 compiler_4.4.2 sonicLength_1.4.7
## [22] munsell_0.5.1 codetools_0.2-20 htmltools_0.5.8.1
## [25] sys_3.4.3 buildtools_1.0.0 sass_0.4.9
## [28] RCurl_1.98-1.16 yaml_2.3.10 pillar_1.9.0
## [31] crayon_1.5.3 jquerylib_0.1.4 DelayedArray_0.33.2
## [34] cachem_1.1.0 iterators_1.0.14 abind_1.4-8
## [37] foreach_1.5.2 tidyselect_1.2.1 digest_0.6.37
## [40] dplyr_1.1.4 restfulr_0.0.15 splines_4.4.2
## [43] maketools_1.3.1 fastmap_1.2.0 grid_4.4.2
## [46] colorspace_2.1-1 cli_3.6.3 SparseArray_1.7.2
## [49] magrittr_2.0.3 S4Arrays_1.7.1 utf8_1.2.4
## [52] XML_3.99-0.17 UCSC.utils_1.3.0 scales_1.3.0
## [55] rmarkdown_2.29 httr_1.4.7 cellranger_1.1.0
## [58] evaluate_1.0.1 knitr_1.49 BiocIO_1.17.1
## [61] rtracklayer_1.67.0 rlang_1.1.4 glue_1.8.0
## [64] jsonlite_1.8.9 R6_2.5.1 zlibbioc_1.52.0