Package 'hiReadsProcessor' reference manual

Title:	Functions to process LM-PCR reads from 454/Illumina data
Description:	hiReadsProcessor contains set of functions which allow users to process LM-PCR products sequenced using any platform. Given an excel/txt file containing parameters for demultiplexing and sample metadata, the functions automate trimming of adaptors and identification of the genomic product. Genomic products are further processed for QC and abundance quantification.
Authors:	Nirav V Malani <[email protected]>
Maintainer:	Nirav V Malani <[email protected]>
License:	GPL-3
Version:	1.43.0
Built:	2025-03-29 05:18:53 UTC
Source:	https://github.com/bioc/hiReadsProcessor

Add a specific feature/attribute to the sampleInfo object.

Description

Given a sampleInfo object, the function adds a new feature for the given samples & sectors.

Usage

addFeature(
  sampleInfo,
  sector = NULL,
  samplename = NULL,
  feature = NULL,
  value = NULL
)
addFeature(
  sampleInfo,
  sector = NULL,
  samplename = NULL,
  feature = NULL,
  value = NULL
)

Arguments

`sampleInfo`	sample information SimpleList object, which samples per sector/quadrant information along with other metadata.
`sector`	a vector or a specific sector to add the new feature(s) to. Default is NULL, in which case the sectors are searched from samplename parameter.
`samplename`	a character vector or a specific sample to add the new feature(s) to. Default is NULL.
`feature`	a string of naming the new feature to add for the defined samplename and sector.
`value`	named vector of samplenames & values which is assigned for the defined sector, samplename, and feature. Example: c("Sample1"="ACDTDASD")

Value

modified sampleInfo object with new feature(s) added.

Examples

load(file.path(system.file("data", package = "hiReadsProcessor"),
"FLX_seqProps.RData"))
extractFeature(seqProps, sector="2", 
samplename="Roth-MLV3p-CD4TMLVWell6-MseI", feature="metadata")
seqProps <- addFeature(seqProps, sector="2", 
samplename="Roth-MLV3p-CD4TMLVWell6-MseI", feature="foo", 
value=c("Roth-MLV3p-CD4TMLVWell6-MseI"="woo"))
extractFeature(seqProps, sector="2", 
samplename="Roth-MLV3p-CD4TMLVWell6-MseI", feature="metadata")
load(file.path(system.file("data", package = "hiReadsProcessor"),
"FLX_seqProps.RData"))
extractFeature(seqProps, sector="2", 
samplename="Roth-MLV3p-CD4TMLVWell6-MseI", feature="metadata")
seqProps <- addFeature(seqProps, sector="2", 
samplename="Roth-MLV3p-CD4TMLVWell6-MseI", feature="foo", 
value=c("Roth-MLV3p-CD4TMLVWell6-MseI"="woo"))
extractFeature(seqProps, sector="2", 
samplename="Roth-MLV3p-CD4TMLVWell6-MseI", feature="metadata")

Prepend name attribute of a list to DNAStringSet

Description

Given a named listed DNAStringSet object returned from extractSeqs, the function prepends the sample name to read names.

Usage

addListNameToReads(dnaSet, flatten = FALSE)
addListNameToReads(dnaSet, flatten = FALSE)

Arguments

`dnaSet`	output from `extractSeqs`
`flatten`	should the output be unlisted? Default is FALSE.

Value

listed DNAStringSet with the names attribute prepended with the name of the list. If flatten is TRUE, then a DNAStringSet object

Examples

load(file.path(system.file("data", package = "hiReadsProcessor"),
"FLX_seqProps.RData"))
samples <- c('Roth-MLV3p-CD4TMLVWell6-Tsp509I', 
'Roth-MLV3p-CD4TMLVWell6-MseI', 'Roth-MLV3p-CD4TMLVwell5-MuA')
seqs <- extractSeqs(seqProps, sector = '2', samplename = samples,
feature="genomic")
addListNameToReads(seqs, TRUE)
load(file.path(system.file("data", package = "hiReadsProcessor"),
"FLX_seqProps.RData"))
samples <- c('Roth-MLV3p-CD4TMLVWell6-Tsp509I', 
'Roth-MLV3p-CD4TMLVWell6-MseI', 'Roth-MLV3p-CD4TMLVwell5-MuA')
seqs <- extractSeqs(seqProps, sector = '2', samplename = samples,
feature="genomic")
addListNameToReads(seqs, TRUE)

Find the 5' primers and add results to SampleInfo object.

Description

Given a sampleInfo object, the function finds 5' primers for each sample per sector and adds the results back to the object. This is a specialized function which depends on many other functions shown in 'see also section' to perform specialized trimming of 5' primer/adaptor found in the sampleInfo object. The sequence itself is never trimmed but rather coordinates of primer portion is recorded back to the object and used subsequently by extractSeqs function to perform the trimming.

Usage

annotateSites(
  sampleInfo,
  annots = NULL,
  samplenames = NULL,
  parallel = TRUE,
  ...
)
annotateSites(
  sampleInfo,
  annots = NULL,
  samplenames = NULL,
  parallel = TRUE,
  ...
)

Arguments

`sampleInfo`	sample information SimpleList object outputted from `findIntegrations`, which holds genomic integration sites.
`annots`	a named list of GRanges object holding features for annotating integration sites. The name attribute of the list is used as column name.
`samplenames`	a vector of samplenames to process. Default is NULL, which processes all samples from sampleInfo object.
`parallel`	use parallel backend to perform calculation. Defaults to TRUE. If no parallel backend is registered, then a serial version is ran using `SerialParam`. Parllelization is done at sample level per sector. Use parallel2 for parallelization at sequence level.
`...`	additional parameters to be passed to `doAnnotation` except for sites.rd, features.rd, and colnam.

Value

a SimpleList object similar to sampleInfo paramter supplied with new data added under each sector and sample. New data attributes include: primed

Examples

 
## Not run: 
load(file.path(system.file("data", package = "hiReadsProcessor"),
"FLX_seqProps.RData"))
data(genes)
genes <- makeGRanges(genes)
cpgs <- getUCSCtable("cpgIslandExt","CpG Islands")
cpgs <- makeGRanges(cbind(cpgs,strand="*"), chromCol = "chrom")
annots <- list("RefGenes"=genes,"CpG"=cpgs)
annotateSites(seqProps, annots, annotType="nearest", side="5p")

## End(Not run)
## Not run: 
load(file.path(system.file("data", package = "hiReadsProcessor"),
"FLX_seqProps.RData"))
data(genes)
genes <- makeGRanges(genes)
cpgs <- getUCSCtable("cpgIslandExt","CpG Islands")
cpgs <- makeGRanges(cbind(cpgs,strand="*"), chromCol = "chrom")
annots <- list("RefGenes"=genes,"CpG"=cpgs)
annotateSites(seqProps, annots, annotType="nearest", side="5p")

## End(Not run)

Align a listed DNAStringSet to a reference using gfClient or standalone BLAT.

Description

Align sequences from a listed DNAStringSet object returned from extractSeqs to an indexed reference genome using gfServer/gfClient protocol or using standalone BLAT and return the psl file as a GRanges object. This function heavily relies on defaults of blatSeqs.

Usage

blatListedSet(dnaSetList = NULL, ...)
blatListedSet(dnaSetList = NULL, ...)

Arguments

`dnaSetList`	DNAStringSet object containing sequences to be aligned against the reference.
`...`	parameters to be passed to `blatSeqs`.

Value

a list of GRanges object reflecting psl file type per set of sequences.

Align sequences using BLAT.

Description

Align batch of sequences using standalone BLAT or gfServer/gfClient protocol against an indexed reference genome. Depending on parameters provided, the function either aligns batch of files to a reference genome using gfClient or takes sequences from query & subject parameters and aligns them using standalone BLAT. If standaloneBlat=FALSE and gfServer is not launched apriori, this function will start one using startgfServer and kill it using stopgfServer upon successful execution.

Usage

blatSeqs(
  query = NULL,
  subject = NULL,
  standaloneBlat = TRUE,
  port = 5560,
  host = "localhost",
  parallel = TRUE,
  numServers = 1L,
  gzipResults = TRUE,
  blatParameters = c(minIdentity = 90, minScore = 10, stepSize = 5, tileSize = 10,
    repMatch = 112312, dots = 50, maxDnaHits = 10, q = "dna", t = "dna", out = "psl")
)
blatSeqs(
  query = NULL,
  subject = NULL,
  standaloneBlat = TRUE,
  port = 5560,
  host = "localhost",
  parallel = TRUE,
  numServers = 1L,
  gzipResults = TRUE,
  blatParameters = c(minIdentity = 90, minScore = 10, stepSize = 5, tileSize = 10,
    repMatch = 112312, dots = 50, maxDnaHits = 10, q = "dna", t = "dna", out = "psl")
)

Arguments

`query`	an object of DNAStringSet, a character vector of filename(s), or a path/pattern of fasta files to BLAT. Default is NULL.
`subject`	an object of DNAStringSet, a character vector, or a path to an indexed genome (nibs,2bits) to serve as a reference or target to the query. Default is NULL. If the subject is a path to a nib or 2bit file, then standaloneBlat will not work!
`standaloneBlat`	use standalone BLAT as suppose to gfServer/gfClient protocol. Default is TRUE.
`port`	the same number you started the gfServer with. Required if standaloneBlat=FALSE. Default is 5560.
`host`	name of the machine running gfServer. Default is 'localhost' and only used when standaloneBlat=FALSE.
`parallel`	use parallel backend to perform calculation with `BiocParallel`. Defaults to TRUE. If no parallel backend is registered, then a serial version is ran using `SerialParam`.
`numServers`	launch >1 gfServer and load balance jobs? This only applies when parallel=TRUE and standaloneBlat=FALSE. Enable this option only if the machine has a lot of RAM! Option ignored if launched gfServer is found at specified host and port. Default is 1.
`gzipResults`	gzip the output files? Default is TRUE.
`blatParameters`	a character vector of options to be passed to gfClient/BLAT command except for 'nohead' option. Default: c(minIdentity=90, minScore=10, stepSize=5, tileSize=10, repMatch=112312, dots=50, maxDnaHits=10, q="dna", t="dna", out="psl"). Be sure to only pass parameters accepted by either BLAT or gfClient. For example, if repMatch or stepSize parameters are specified when using gfClient, then the function will simply ignore them! The defaults are configured to align a 19bp sequence with 90% identity.

Value

a character vector of psl filenames. Each file provided is split by number of parallel workers and with read number denoting the cut. Files are cut in smaller pieces to for the ease of read & write into a single R session.

Examples

 
## Not run: 
blatSeqs(dnaSeqs, subjectSeqs, blatParameters=c(minIdentity=90, minScore=10, 
tileSize=10, dots=10, q="dna", t="dna", out="blast8"))
blatSeqs(dnaSeqs, "/usr/local/genomeIndex/hg18.2bit", standaloneBlat=FALSE)
blatSeqs("mySeqs.fa", "/usr/local/genomeIndex/hg18.2bit", standaloneBlat=FALSE)
blatSeqs("my.*.fa", "/usr/local/genomeIndex/hg18.2bit", standaloneBlat=FALSE)

## End(Not run)
## Not run: 
blatSeqs(dnaSeqs, subjectSeqs, blatParameters=c(minIdentity=90, minScore=10, 
tileSize=10, dots=10, q="dna", t="dna", out="blast8"))
blatSeqs(dnaSeqs, "/usr/local/genomeIndex/hg18.2bit", standaloneBlat=FALSE)
blatSeqs("mySeqs.fa", "/usr/local/genomeIndex/hg18.2bit", standaloneBlat=FALSE)
blatSeqs("my.*.fa", "/usr/local/genomeIndex/hg18.2bit", standaloneBlat=FALSE)

## End(Not run)

Breaks an object into chunks of N size.

Description

Given a linear/vector like object, the function breaks it into equal sized chunks either by chunkSize. This is a helper function used by functions in 'See Also' section where each chunk is sent to a parallel node for processing.

Usage

chunkize(x, chunkSize = NULL)
chunkize(x, chunkSize = NULL)

Arguments

`x`	a linear object.
`chunkSize`	number of rows to use per chunk of query. Defaults to length(x)/detectCores() or length(query)/bpworkers() depending on parallel backend registered.

Value

a list of object split into chunks.

Examples

x <- c("GAGGCTGTCACCGACAAGGTTCTGA", "AATAGCGTGGTGACAGCCCACATGC", 
       "GGTCTTCTAGGGAACCTACGCCACA", "TTTCCGGCGGCAGTCAGAGCCAAAG", 
       "TCCTGTCAACTCGTAGATCCAATCA", "ATGGTCACCTACACACAACGGCAAT", 
       "GTCAGGACACCTAATCACAAACGGA", "AGACGCAGGTTCAGGCTGGACTAGA", 
       "ATCGTTTCCGGAATTCGTGCACTGG", "CAATGCGGGCACACGCTCTTACAGT")
chunkize(DNAStringSet(x), 5)
x <- c("GAGGCTGTCACCGACAAGGTTCTGA", "AATAGCGTGGTGACAGCCCACATGC", 
       "GGTCTTCTAGGGAACCTACGCCACA", "TTTCCGGCGGCAGTCAGAGCCAAAG", 
       "TCCTGTCAACTCGTAGATCCAATCA", "ATGGTCACCTACACACAACGGCAAT", 
       "GTCAGGACACCTAATCACAAACGGA", "AGACGCAGGTTCAGGCTGGACTAGA", 
       "ATCGTTTCCGGAATTCGTGCACTGG", "CAATGCGGGCACACGCTCTTACAGT")
chunkize(DNAStringSet(x), 5)

Cluster/Correct values within a window based on their frequency given discrete factors

Description

Given a group of discrete factors (i.e. position ids) and integer values, the function tries to correct/cluster the integer values based on their frequency in a defined windowsize.

Usage

clusterSites(
  posID = NULL,
  value = NULL,
  grouping = NULL,
  psl.rd = NULL,
  weight = NULL,
  windowSize = 5L,
  byQuartile = FALSE,
  quartile = 0.7,
  parallel = TRUE,
  sonicAbund = FALSE
)
clusterSites(
  posID = NULL,
  value = NULL,
  grouping = NULL,
  psl.rd = NULL,
  weight = NULL,
  windowSize = 5L,
  byQuartile = FALSE,
  quartile = 0.7,
  parallel = TRUE,
  sonicAbund = FALSE
)

Arguments

`posID`	a vector of groupings for the value parameter (i.e. Chr,strand). Required if psl.rd parameter is not defined.
`value`	a vector of integer with values that needs to corrected or clustered (i.e. Positions). Required if psl.rd parameter is not defined.
`grouping`	additional vector of grouping of length posID or psl.rd by which to pool the rows (i.e. samplenames). Default is NULL.
`psl.rd`	a GRanges object returned from `getIntegrationSites`. Default is NULL.
`weight`	a numeric vector of weights to use when calculating frequency of value by posID and grouping if specified. Default is NULL.
`windowSize`	size of window within which values should be corrected or clustered. Default is 5.
`byQuartile`	flag denoting whether quartile based technique should be employed. See notes for details. Default is TRUE.
`quartile`	if byQuartile=TRUE, then the quartile which serves as the threshold. Default is 0.70.
`parallel`	use parallel backend to perform calculation with `BiocParallel`. Defaults to TRUE. If no parallel backend is registered, then a serial version is ran using `SerialParam`. Process is split by the grouping the column.
`sonicAbund`	calculate breakpoint abundance using `getSonicAbund`. Default is FALSE.

Value

a data frame with clusteredValues and frequency shown alongside with the original input. If psl.rd parameter is defined then a GRanges object is returned with three new columns appended at the end: clusteredPosition, clonecount, and clusterTopHit (a representative for a given cluster chosen by best scoring hit!).

Note

The algorithm for clustering when byQuartile=TRUE is as follows: for all values in each grouping, get a distribution and test if their frequency is >= quartile threshold. For values below the quartile threshold, test if any values overlap with the ones that passed the threshold and is within the defined windowSize. If there is a match, then merge with higher value, else leave it as is. This is only useful if the distribution is wide and polynodal. When byQuartile=FALSE, for each group the values within the defined window are merged with the next highest frequently occuring value, if freuquencies are tied then lowest value is used to represent the cluster. When psl.rd is passed, then multihits are ignored and only unique sites are clustered. All multihits will be tagged as a good 'clusterTopHit'.

Examples

clusterSites(posID = c('chr1-', 'chr1-', 'chr1-', 'chr2+', 'chr15-', 
'chr16-','chr11-'), value = c(rep(1000, 2), 5832, 1000, 12324, 65738, 928042), 
grouping = c('a', 'a', 'a', 'b', 'b', 'b', 'c'), parallel = FALSE)
## Not run: 
data(psl)
psl <- psl[sample(nrow(psl), 100), ]
psl.rd <- getIntegrationSites(pslToRangedObject(psl))
psl.rd$grouping <- sub("(.+)-.+", "\\1", psl.rd$qName)
clusterSites(grouping = psl.rd$grouping, psl.rd = psl.rd)

## End(Not run)
clusterSites(posID = c('chr1-', 'chr1-', 'chr1-', 'chr2+', 'chr15-', 
'chr16-','chr11-'), value = c(rep(1000, 2), 5832, 1000, 12324, 65738, 928042), 
grouping = c('a', 'a', 'a', 'b', 'b', 'b', 'c'), parallel = FALSE)
## Not run: 
data(psl)
psl <- psl[sample(nrow(psl), 100), ]
psl.rd <- getIntegrationSites(pslToRangedObject(psl))
psl.rd$grouping <- sub("(.+)-.+", "\\1", psl.rd$qName)
clusterSites(grouping = psl.rd$grouping, psl.rd = psl.rd)

## End(Not run)

Remove values/positions which are overlapping between discrete groups based on their frequency.

Description

Given a group of discrete factors (i.e. position ids) and integer values, the function tests if they overlap between groups. If overlap is found, then the group having highest frequency of a given position wins, else the position is filtered out from all the groups. The main use of this function is to remove crossover sites from different samples in the data.

Usage

crossOverCheck(
  posID = NULL,
  value = NULL,
  grouping = NULL,
  weight = NULL,
  windowSize = 1,
  psl.rd = NULL
)
crossOverCheck(
  posID = NULL,
  value = NULL,
  grouping = NULL,
  weight = NULL,
  windowSize = 1,
  psl.rd = NULL
)

Arguments

`posID`	a vector of groupings for the value parameter (i.e. Chr,strand). Required if psl.rd parameter is not defined.
`value`	a vector of integer locations/positions that needs to be binned, i.e. genomic location. Required if psl.rd parameter is not defined.
`grouping`	additional vector of grouping of length posID or psl.rd by which to pool the rows (i.e. samplenames). Default is NULL.
`weight`	a numeric vector of weights to use when calculating frequency of value by posID and grouping if specified. Default is NULL.
`windowSize`	size of window within which values should be checked. Default is 1.
`psl.rd`	a GRanges object. Default is NULL.

Value

a data frame of the original input with columns denoting whether a given row was a Candidate and isCrossover. If psl.rd parameter is defined, then a GRanges object with 'isCrossover', 'Candidate', and 'FoundIn' columns appended at the end.

Examples

crossOverCheck(posID = c('chr1-', 'chr1-', 'chr1-', 'chr1-', 'chr2+', 'chr15-', 
'chr16-', 'chr11-'), value = c(rep(1000, 3), 5832, 1000, 12324, 65738, 928042), 
grouping = c('a', 'a', 'b', 'b', 'b', 'b', 'c', 'c'))
crossOverCheck(posID = c('chr1-', 'chr1-', 'chr1-', 'chr1-', 'chr2+', 'chr15-', 
'chr16-', 'chr11-'), value = c(rep(1000, 3), 5832, 1000, 12324, 65738, 928042), 
grouping = c('a', 'a', 'b', 'b', 'b', 'b', 'c', 'c'))

Removes duplicate sequences from DNAStringSet object.

Description

Given a DNAStringSet object, the function dereplicates reads and adds counts=X to the definition line to indicate replication.

Usage

dereplicateReads(dnaSet)
dereplicateReads(dnaSet)

Arguments

dnaSet

DNAStringSet object to dereplicate.

Value

DNAStringSet object with names describing frequency of repeat.

Examples

dnaSet <- c("CCTGAATCCTGGCAATGTCATCATC", "ATCCTGGCAATGTCATCATCAATGG", 
"ATCAGTTGTCAACGGCTAATACGCG", "ATCAATGGCGATTGCCGCGTCTGCA", 
"CCGCGTCTGCAATGTGAGGGCCTAA", "GAAGGATGCCAGTTGAAGTTCACAC", 
"CCTGAATCCTGGCAATGTCATCATC", "ATCCTGGCAATGTCATCATCAATGG", 
"ATCAGTTGTCAACGGCTAATACGCG", "ATCAATGGCGATTGCCGCGTCTGCA", 
"CCGCGTCTGCAATGTGAGGGCCTAA", "GAAGGATGCCAGTTGAAGTTCACAC") 
dereplicateReads(dnaSet)
dnaSet <- c("CCTGAATCCTGGCAATGTCATCATC", "ATCCTGGCAATGTCATCATCAATGG", 
"ATCAGTTGTCAACGGCTAATACGCG", "ATCAATGGCGATTGCCGCGTCTGCA", 
"CCGCGTCTGCAATGTGAGGGCCTAA", "GAAGGATGCCAGTTGAAGTTCACAC", 
"CCTGAATCCTGGCAATGTCATCATC", "ATCCTGGCAATGTCATCATCAATGG", 
"ATCAGTTGTCAACGGCTAATACGCG", "ATCAATGGCGATTGCCGCGTCTGCA", 
"CCGCGTCTGCAATGTGAGGGCCTAA", "GAAGGATGCCAGTTGAAGTTCACAC") 
dereplicateReads(dnaSet)

Test if pattern aligns better in +/- orientation.

Description

Given a fixed length pattern sequence and variable length subject sequences, the function roughly finds which orientation of pattern yields the most hits. The function doing the heavylifting is vcountPattern. This is an accessory function used in function listed under See Also section below.

Usage

doRCtest(
  subjectSeqs = NULL,
  patternSeq = NULL,
  qualityThreshold = 0.5,
  parallel = TRUE
)
doRCtest(
  subjectSeqs = NULL,
  patternSeq = NULL,
  qualityThreshold = 0.5,
  parallel = TRUE
)

Arguments

`subjectSeqs`	DNAStringSet object containing sequences to be searched for the pattern.
`patternSeq`	DNAString object or a sequence containing the query sequence to search.
`qualityThreshold`	percent of patternSeq to match. Default is 0.50, half match. This is supplied to max.mismatch parameter of `vcountPattern` as round(nchar(patternSeq)*(1-qualityThreshold)).
`parallel`	use parallel backend to perform calculation with `BiocParallel`. Defaults to FALSE. If no parallel backend is registered, then a serial version is ran using `SerialParam`.

Value

patternSeq that aligned the best.

Examples

subjectSeqs <- c("CCTGAATCCTGGCAATGTCATCATC", "ATCCTGGCAATGTCATCATCAATGG", 
"ATCAGTTGTCAACGGCTAATACGCG", "ATCAATGGCGATTGCCGCGTCTGCA", 
"CCGCGTCTGCAATGTGAGGGCCTAA", "GAAGGATGCCAGTTGAAGTTCACAC")
subjectSeqs <- xscat("AAAAAAAAAA", subjectSeqs)
doRCtest(subjectSeqs, "TTTTTTTTT")
subjectSeqs <- c("CCTGAATCCTGGCAATGTCATCATC", "ATCCTGGCAATGTCATCATCAATGG", 
"ATCAGTTGTCAACGGCTAATACGCG", "ATCAATGGCGATTGCCGCGTCTGCA", 
"CCGCGTCTGCAATGTGAGGGCCTAA", "GAAGGATGCCAGTTGAAGTTCACAC")
subjectSeqs <- xscat("AAAAAAAAAA", subjectSeqs)
doRCtest(subjectSeqs, "TTTTTTTTT")

Extract a specific feature/attribute of the sampleInfo object.

Description

Given a sampleInfo object, the function extracts a defined feature(s) for given sample or sector.

Usage

extractFeature(sampleInfo, sector = NULL, samplename = NULL, feature = NULL)
extractFeature(sampleInfo, sector = NULL, samplename = NULL, feature = NULL)

Arguments

`sampleInfo`	sample information SimpleList object, which samples per sector/quadrant information along with other metadata.
`sector`	a vector or specific sector to extract the feature from. Default is NULL, which extracts all sectors.
`samplename`	a character vector or a specific sample to extract feature from. Default is NULL, which extracts all samples.
`feature`	Options include: primed, LTRed, linkered, decoded, and any of the metadata. Default is NULL. When feature='metadata', then it returns names of all the metadata elements associated with the sample as a comma separated list.

Value

a list or list of lists depending upon which parameters were supplied.

Examples

load(file.path(system.file("data", package = "hiReadsProcessor"),
"FLX_seqProps.RData"))
samples <- c('Roth-MLV3p-CD4TMLVWell6-Tsp509I', 
'Roth-MLV3p-CD4TMLVWell6-MseI', 'Roth-MLV3p-CD4TMLVwell5-MuA')
extractFeature(seqProps, sector='2', samplename=samples, feature="primed")
extractFeature(seqProps, sector='2', samplename=samples, feature="linkered")
extractFeature(seqProps, sector='2', samplename=samples, feature="metadata")
load(file.path(system.file("data", package = "hiReadsProcessor"),
"FLX_seqProps.RData"))
samples <- c('Roth-MLV3p-CD4TMLVWell6-Tsp509I', 
'Roth-MLV3p-CD4TMLVWell6-MseI', 'Roth-MLV3p-CD4TMLVwell5-MuA')
extractFeature(seqProps, sector='2', samplename=samples, feature="primed")
extractFeature(seqProps, sector='2', samplename=samples, feature="linkered")
extractFeature(seqProps, sector='2', samplename=samples, feature="metadata")

Extract sequences for a feature in the sampleInfo object.

Description

Given a sampleInfo object, the function extracts sequences for a defined feature.

Usage

extractSeqs(
  sampleInfo,
  sector = NULL,
  samplename = NULL,
  feature = "genomic",
  trim = TRUE,
  minReadLength = 1,
  sideReturn = NULL,
  pairReturn = "both",
  strict = FALSE
)
extractSeqs(
  sampleInfo,
  sector = NULL,
  samplename = NULL,
  feature = "genomic",
  trim = TRUE,
  minReadLength = 1,
  sideReturn = NULL,
  pairReturn = "both",
  strict = FALSE
)

Arguments

`sampleInfo`	sample information SimpleList object, which samples per sector/quadrant information along with other metadata.
`sector`	specific sector to extract sequences from. Default is NULL, which extracts all sectors.
`samplename`	specific sample to extract sequences from. Default is NULL, which extracts all samples.
`feature`	which part of sequence to extract (case sensitive). Options include: primed, !primed, LTRed, !LTRed, linkered, !linkered, primerIDs, genomic, genomicLinkered, decoded, and unDecoded. If a sample was primerIDed and processed by `primerIDAlignSeqs`, then all the rejected and unmatched attributes can be prepended to the feature. Example: vectored, Rejectedlinkered, RejectedprimerIDslinkered, Absentlinkered, or unAnchoredprimerIDslinkered. When feature is genomic, it includes sequences which are primed, LTRed, linkered, and !linkered. The genomicLinkered is same as genomic minus the !linkered. When feature is decoded, it includes everything that demultiplexed. The '!' in front of a feature extracts the inverse. One can only get unDecoded sequences if returnUnmatched was TRUE in `findBarcodes`. If `findVector` was run and "vectored" feature was found in the sampleInfo object, then genomic & genomicLinkered output will have vectored reads removed.
`trim`	whether to trim the given feature from sequences or keep it. Default is TRUE. This option is ignored for feature with '!'.
`minReadLength`	threshold for minimum length of trimmed sequences to return.
`sideReturn`	if trim=TRUE, which side of the sequence to return: left, middle, or right. Defaults to NULL and determined automatically. Doesn't apply to features: decoded, genomic or genomicLinkered.
`pairReturn`	if the data is paired end, then from which pair to return the feature. Options are "pair1", "pair2", or defaults to "both". Ignored if data is single end.
`strict`	this option is used when feature is either 'genomic' or 'genomicLinkered'. When a sample has no LTRed reads, primer ends are used as starting points by default to extract the genomic part. Enabling this option will strictly ensure that only reads with primer and LTR are trimmed for the 'genomic' or 'genomicLinkered' feature. Default is FALSE.

Value

a listed DNAStringSet object structed by sector then sample. Note: when feature='genomic' or 'genomicLinkered' and when data is paired end, then "pair2" includes union of reads from both pairs which found LTR.

Examples

load(file.path(system.file("data", package = "hiReadsProcessor"),
"FLX_seqProps.RData"))
samples <- c('Roth-MLV3p-CD4TMLVWell6-Tsp509I', 
'Roth-MLV3p-CD4TMLVWell6-MseI', 'Roth-MLV3p-CD4TMLVwell5-MuA')
extractSeqs(seqProps, sector='2', samplename=samples, feature="primed")
extractSeqs(seqProps, sector='2', samplename=samples, feature="!primed")
extractSeqs(seqProps, sector='2', samplename=samples, feature="linkered")
extractSeqs(seqProps, sector='2', samplename=samples, feature="genomic")
load(file.path(system.file("data", package = "hiReadsProcessor"),
"FLX_seqProps.RData"))
samples <- c('Roth-MLV3p-CD4TMLVWell6-Tsp509I', 
'Roth-MLV3p-CD4TMLVWell6-MseI', 'Roth-MLV3p-CD4TMLVwell5-MuA')
extractSeqs(seqProps, sector='2', samplename=samples, feature="primed")
extractSeqs(seqProps, sector='2', samplename=samples, feature="!primed")
extractSeqs(seqProps, sector='2', samplename=samples, feature="linkered")
extractSeqs(seqProps, sector='2', samplename=samples, feature="genomic")

Find and trim vector sequence from reads.

Description

This function facilitates finding and trimming of long/short fragments of vector present in LM-PCR products. The algorithm looks for vector sequence present anywhere within the read and trims according longest contiguous match on either end of the read. Alignment is doing using BLAT

Usage

findAndRemoveVector(
  reads,
  Vector,
  minLength = 10,
  returnCoords = FALSE,
  parallel = TRUE
)
findAndRemoveVector(
  reads,
  Vector,
  minLength = 10,
  returnCoords = FALSE,
  parallel = TRUE
)

Arguments

`reads`	DNAStringSet object containing sequences to be trimmed for vector.
`Vector`	DNAString object containing vector sequence to be searched in reads.
`minLength`	integer value dictating minimum length of trimmed sequences to return. Default is 10.
`returnCoords`	return the coordinates of vector start-stop for the matching reads. Defaults to FALSE.
`parallel`	use parallel backend to perform calculation with `BiocParallel`. Defaults to TRUE. If no parallel backend is registered, then a serial version is ran using `SerialParam`.

Value

DNAStringSet object with Vector sequence removed from the reads. If returnCoords=TRUE, then a list of two named elements "hits" & "reads". The first element, "hits" is a GRanges object with properties of matched region and whether it was considered valid denoted by 'good.row'. The second element, "reads" is a DNAStringSet object with Vector sequence removed from the reads.

Note

If parallel=TRUE, then be sure to have a parallel backend registered before running the function. One can use any of the following MulticoreParam SnowParam

Find and trim a short pattern sequence from the subject.

Description

This function facilitates finding and trimming of a short pattern sequence from a collection of subject sequences. The trimming is dictated by side parameter. For more information on the trimming process see the 'side' parameter documentation in trimSeqs. For information regarding the pattern alignment see the documentation for pairwiseAlignSeqs. This function is meant for aligning a short pattern onto large collection of subjects. If you are looking to align a vector sequence to subjects, then please use BLAT.

Usage

findAndTrimSeq(
  patternSeq,
  subjectSeqs,
  side = "left",
  offBy = 0,
  alignWay = "slow",
  ...
)
findAndTrimSeq(
  patternSeq,
  subjectSeqs,
  side = "left",
  offBy = 0,
  alignWay = "slow",
  ...
)

Arguments

`patternSeq`	DNAString object or a sequence containing the query sequence to search.
`subjectSeqs`	DNAStringSet object containing sequences to be searched for the pattern.
`side`	which side of the sequence to perform the search & trimming: left, right or middle. Default is 'left'.
`offBy`	integer value dictating if the trimming base should be offset by X number of bases. Default is 0.
`alignWay`	method to utilize for detecting the primers. One of following: "slow" (Default), "fast", or "blat". Fast, calls `vpairwiseAlignSeqs` and uses `vmatchPattern` at its core, which is less accurate with indels and mismatches but much faster. Slow, calls `pairwiseAlignSeqs` and uses `pairwiseAlignment` at its core, which is accurate with indels and mismatches but slower. Blat will use `blatSeqs`.
`...`	parameters to be passed to `pairwiseAlignment`, `vpairwiseAlignSeqs` or `blatSeqs` depending on which method is defined in 'alignWay' parameter.

Value

DNAStringSet object with pattern sequence removed from the subject sequences.

Note

If parallel=TRUE, then be sure to have a parallel backend registered before running the function. One can use any of the following MulticoreParam SnowParam

Examples

findAndTrimSeq(patternSeq="AGACCCTTTT",
subjectSeqs=DNAStringSet(c("AGACCCTTTTGAGCAGCAT","AGACCCTTGGTCGACTCA",
"AGACCCTTTTGACGAGCTAG")), qualityThreshold=.85, doRC=FALSE, side="left", 
offBy=1, alignWay = "slow")

findAndTrimSeq(patternSeq="AGACCCTTTT",
subjectSeqs=DNAStringSet(c("AGACCCTTTTGAGCAGCAT","AGACCCTTGGTCGACTCA",
"AGACCCTTTTGACGAGCTAG")), qualityThreshold=.85, doRC=FALSE, side="left", 
offBy=1, alignWay = "slow")

Demultiplex reads by their barcodes

Description

Given a sample information object, the function reads in the raw fasta/fastq file, demultiplexes reads by their barcodes, and appends it back to the sampleInfo object. Calls splitByBarcode to perform the actual splitting of file by barcode sequences. If supplied with a character vector and reads themselves, the function behaves a bit differently. See the examples.

Usage

findBarcodes(
  sampleInfo,
  sector = NULL,
  dnaSet = NULL,
  showStats = FALSE,
  returnUnmatched = FALSE,
  dereplicate = FALSE,
  alreadyDecoded = FALSE
)
findBarcodes(
  sampleInfo,
  sector = NULL,
  dnaSet = NULL,
  showStats = FALSE,
  returnUnmatched = FALSE,
  dereplicate = FALSE,
  alreadyDecoded = FALSE
)

Arguments

`sampleInfo`	sample information SimpleList object created using `read.sampleInfo`, which holds barcodes and sample names per sector/quadrant/lane or a character vector of barcodes to sample name associations. Ex: c("ACATCCAT"="Sample1", "GAATGGAT"="Sample2",...)
`sector`	If sampleInfo is a SimpleList object, then a numeric/character value or vector representing sector(s) in sampleInfo. Optionally if on high memory machine sector='all' will decode/demultiplex sequences from all sectors/quadrants. This option is ignored if sampleInfo is a character vector. Default is NULL.
`dnaSet`	DNAStringSet object containing sequences to be decoded or demultiplexed. Default is NULL. If sampleInfo is a SimpleList object, then reads are automatically extracted using `read.seqsFromSector` and parameters defined in sampleInfo object.
`showStats`	toggle output of search statistics. Default is FALSE.
`returnUnmatched`	return unmatched sequences. Returns results as a list where x[["unDecodedSeqs"]] has culprits. Default is FALSE.
`dereplicate`	return dereplicated sequences. Calls `dereplicateReads`, which appends counts=X to sequence names/deflines. Default is FALSE. Not applicable for paired end data since it can cause insyncronicity.
`alreadyDecoded`	if reads have be already decoded and split into respective files per sample and 'seqfilePattern' parameter in `read.SeqFolder` is set to reading sample files and not the sector files, then set this to TRUE. Default is FALSE. Enabling this parameter skips the barcode detection step and loads the sequence file as is into the sampleInfo object.

Value

If sampleInfo is an object of SimpleList then decoded sequences are appeneded to respective sample slots, else a named list of DNAStringSet object. If returnUnmatched=TRUE, then x[["unDecodedSeqs"]] has the unmatched sequences.

Examples

dnaSet <- DNAStringSet(c("read1" = "ACATCCATAGAGCTACGACGACATCGACATA",
"read2"="GAATGGATGACGACTACAGCACGACGAGCAGCTACT",
"read3"="GAATGGATGCGCTAAGAAGAGA", "read4"="ACATCCATTCTACACATCT"))
findBarcodes(sampleInfo = c("ACATCCAT" = "Sample1", "GAATGGAT" = "Sample2"),
dnaSet=dnaSet, showStats=TRUE, returnUnmatched=TRUE)
## Not run: 
load(file.path(system.file("data", package = "hiReadsProcessor"),
"FLX_seqProps.RData"))
findBarcodes(seqProps, sector = "all", showStats = TRUE)

## End(Not run)
dnaSet <- DNAStringSet(c("read1" = "ACATCCATAGAGCTACGACGACATCGACATA",
"read2"="GAATGGATGACGACTACAGCACGACGAGCAGCTACT",
"read3"="GAATGGATGCGCTAAGAAGAGA", "read4"="ACATCCATTCTACACATCT"))
findBarcodes(sampleInfo = c("ACATCCAT" = "Sample1", "GAATGGAT" = "Sample2"),
dnaSet=dnaSet, showStats=TRUE, returnUnmatched=TRUE)
## Not run: 
load(file.path(system.file("data", package = "hiReadsProcessor"),
"FLX_seqProps.RData"))
findBarcodes(seqProps, sector = "all", showStats = TRUE)

## End(Not run)

Find the integration sites and add results to SampleInfo object.

Description

Given a SampleInfo object, the function finds integration sites for each sample using their respective settings and adds the results back to the object. This is an all-in-one function which aligns, finds best hit per read per sample, cluster sites, and assign ISU IDs. It calls blatSeqs, read.psl, getIntegrationSites, clusterSites, otuSites. here must be linkered reads within the sampleInfo object in order to use this function using the default parameters. If you are planning on BLATing non-linkered reads, then change the seqType to one of accepted options for the 'feature' parameter of extractSeqs, except for '!' based features.

Usage

findIntegrations(
  sampleInfo,
  seqType = NULL,
  genomeIndices = NULL,
  samplenames = NULL,
  parallel = TRUE,
  autoOptimize = FALSE,
  doSonic = FALSE,
  doISU = FALSE,
  ...
)
findIntegrations(
  sampleInfo,
  seqType = NULL,
  genomeIndices = NULL,
  samplenames = NULL,
  parallel = TRUE,
  autoOptimize = FALSE,
  doSonic = FALSE,
  doISU = FALSE,
  ...
)

Arguments

`sampleInfo`	sample information SimpleList object outputted from `findLinkers`, which holds decoded, primed, LTRed, and Linkered sequences for samples per sector/quadrant along with metadata.
`seqType`	which type of sequence to align and find integration sites. Default is NULL and determined automatically based on type of restriction enzyme or isolation method used. If restriction enzyme is Fragmentase, MuA, Sonication, or Sheared then this parameter is set to genomicLinkered, else it is genomic. Any one of following options are valid: genomic, genomicLinkered, decoded, primed, LTRed, linkered.
`genomeIndices`	an associative character vector of freeze to full or relative path of respective of indexed genomes from BLAT(.nib or .2bit files). For example: c("hg18"="/usr/local/blatSuite34/hg18.2bit", "mm8"="/usr/local/blatSuite34/mm8.2bit"). Be sure to supply an index per freeze supplied in the sampleInfo object. Default is NULL.
`samplenames`	a vector of samplenames to process. Default is NULL, which processes all samples from sampleInfo object.
`parallel`	use parallel backend to perform calculation with `BiocParallel`. Defaults to TRUE. If no parallel backend is registered, then a serial version is ran using `SerialParam`.
`autoOptimize`	if aligner='BLAT', then should the blatParameters be automatically optimized based on the reads? Default is FALSE. When TRUE, following parameters are adjusted within the supplied blatParameters vector: stepSize, tileSize, minScore, minIdentity. This parameter is useful when aligning reads of various lengths to the genome. Optimization is done using only read lengths. In beta phase!
`doSonic`	calculate integration sites abundance using breakpoints. See `getSonicAbund` for more details. Default is FALSE.
`doISU`	calculate integration site unit for multihits. See `isuSites` for more details. Default is FALSE.
`...`	additional parameters to be passed to `blatSeqs`.

Value

a SimpleList object similar to sampleInfo parameter supplied with new data added under each sector and sample. New data attributes include: psl, and sites. The psl attributes holds the genomic hits per read along with QC information. The sites attribute holds the condensed integration sites where genomic hits have been clustered by the Position column and cherry picked to have each site pass all the QC steps.

Note

If parallel=TRUE, then be sure to have a parallel backend registered before running the function. One can use any of the following MulticoreParam SnowParam

Examples

 
## Not run: 
load(file.path(system.file("data", package = "hiReadsProcessor"),
"FLX_seqProps.RData"))
findIntegrations(seqProps, 
genomeIndices=c("hg18"="/usr/local/genomeIndexes/hg18.noRandom.2bit"), 
numServers=2)

## End(Not run)
## Not run: 
load(file.path(system.file("data", package = "hiReadsProcessor"),
"FLX_seqProps.RData"))
findIntegrations(seqProps, 
genomeIndices=c("hg18"="/usr/local/genomeIndexes/hg18.noRandom.2bit"), 
numServers=2)

## End(Not run)

Find the 3' linkers and add results to SampleInfo object.

Description

Given a sampleInfo object, the function finds 3' linkers for each sample per sector and adds the results back to the object. This is a specialized function which depends on many other functions shown in 'see also section' to perform specialized trimming of 3' primer/linker adaptor sequence found in the sampleInfo object. The sequence itself is never trimmed but rather coordinates of linker portion is recorded back to the object and used subsequently by extractSeqs function to perform the trimming. This function heavily relies on either pairwiseAlignSeqs or primerIDAlignSeqs depending upon whether linkers getting aligned have primerID in it or not.

Usage

findLinkers(
  sampleInfo,
  showStats = FALSE,
  doRC = FALSE,
  parallel = TRUE,
  samplenames = NULL,
  bypassChecks = FALSE,
  parallel2 = FALSE,
  ...
)
findLinkers(
  sampleInfo,
  showStats = FALSE,
  doRC = FALSE,
  parallel = TRUE,
  samplenames = NULL,
  bypassChecks = FALSE,
  parallel2 = FALSE,
  ...
)

Arguments

`sampleInfo`	sample information SimpleList object outputted from `findPrimers` or `findLTRs`, which holds decoded sequences for samples per sector/quadrant along with information of sample to primer associations.
`showStats`	toggle output of search statistics. Default is FALSE.
`doRC`	perform reverse complement search of the defined pattern/linker sequence. Default is FALSE.
`parallel`	use parallel backend to perform calculation with `BiocParallel`. Defaults to TRUE. If no parallel backend is registered, then a serial version is ran using `SerialParam`. Parllelization is done at sample level per sector.
`samplenames`	a vector of samplenames to process. Default is NULL, which processes all samples from sampleInfo object.
`bypassChecks`	skip checkpoints which detect if something was odd with the data? Default is FALSE.
`parallel2`	perform parallelization is sequence level. Default is FALSE. Useful in cases where each sector has only one sample with numerous sequences.
`...`	extra parameters to be passed to `pairwiseAlignment`.

Value

a SimpleList object similar to sampleInfo paramter supplied with new data added under each sector and sample. New data attributes include: linkered. If linkers have primerID then, primerIDs attribute is appended as well.

Note

For paired end data, qualityThreshold for pair 2 is increased by 0.25 or set to 1 whichever is lower to increase quality & full match to linker sequence.
If no linker matches are found with default options, then try doRC=TRUE.
If parallel=TRUE, then be sure to have a parallel backend registered before running the function. One can use any of the following MulticoreParam SnowParam

Examples

 
## Not run: 
load(file.path(system.file("data", package = "hiReadsProcessor"),
"FLX_seqProps.RData"))
findLinkers(seqProps, showStats=TRUE, doRC=TRUE)

## End(Not run)
## Not run: 
load(file.path(system.file("data", package = "hiReadsProcessor"),
"FLX_seqProps.RData"))
findLinkers(seqProps, showStats=TRUE, doRC=TRUE)

## End(Not run)

Find the 5' LTRs and add results to SampleInfo object.

Description

Given a sampleInfo object, the function finds 5' LTR following the primer for each sample per sector and adds the results back to the object. This is a specialized function which depends on many other functions shown in 'see also section' to perform specialized trimming of 5' viral LTRs found in the sampleInfo object. The sequence itself is never trimmed but rather coordinates of LTR portion is added to primer coordinates and recorded back to the object and used subsequently by extractSeqs function to perform the trimming. This function heavily relies on pairwiseAlignSeqs.

Usage

findLTRs(
  sampleInfo,
  showStats = FALSE,
  doRC = FALSE,
  parallel = TRUE,
  samplenames = NULL,
  bypassChecks = FALSE,
  parallel2 = FALSE,
  ...
)
findLTRs(
  sampleInfo,
  showStats = FALSE,
  doRC = FALSE,
  parallel = TRUE,
  samplenames = NULL,
  bypassChecks = FALSE,
  parallel2 = FALSE,
  ...
)

Arguments

`sampleInfo`	sample information SimpleList object outputted from `findPrimers`, which holds decoded and primed sequences for samples per sector/quadrant along with information of sample to LTR associations.
`showStats`	toggle output of search statistics. Default is FALSE. For paired end data, stats for "pair2" is relative to decoded and/or primed reads.
`doRC`	perform reverse complement search of the defined pattern/LTR sequence. Default is FALSE.
`parallel`	use parallel backend to perform calculation with `BiocParallel`. Defaults to TRUE. If no parallel backend is registered, then a serial version is ran using `SerialParam`. Parllelization is done at sample level per sector.
`samplenames`	a vector of samplenames to process. Default is NULL, which processes all samples from sampleInfo object.
`bypassChecks`	skip checkpoints which detect if something was odd with the data? Default is FALSE.
`parallel2`	perform parallelization is sequence level. Default is FALSE. Useful in cases where each sector has only one sample with numerous sequences.
`...`	extra parameters to be passed to `pairwiseAlignment`.

Value

a SimpleList object similar to sampleInfo paramter supplied with new data added under each sector and sample. New data attributes include: LTRed

Note

For paired end data, qualityThreshold for pair 2 is decreased by 0.05 to increase chances of matching LTR sequence.
If parallel=TRUE, then be sure to have a parallel backend registered before running the function. One can use any of the following MulticoreParam SnowParam

Examples

 
## Not run: 
load(file.path(system.file("data", package = "hiReadsProcessor"),
"FLX_seqProps.RData"))
findLTRs(seqProps, showStats=TRUE)

## End(Not run)
## Not run: 
load(file.path(system.file("data", package = "hiReadsProcessor"),
"FLX_seqProps.RData"))
findLTRs(seqProps, showStats=TRUE)

## End(Not run)

Find the 5' primers and add results to SampleInfo object.

Description

Usage

findPrimers(
  sampleInfo,
  alignWay = "slow",
  showStats = FALSE,
  doRC = FALSE,
  parallel = TRUE,
  samplenames = NULL,
  bypassChecks = FALSE,
  parallel2 = FALSE,
  ...
)
findPrimers(
  sampleInfo,
  alignWay = "slow",
  showStats = FALSE,
  doRC = FALSE,
  parallel = TRUE,
  samplenames = NULL,
  bypassChecks = FALSE,
  parallel2 = FALSE,
  ...
)

Arguments

`sampleInfo`	sample information SimpleList object outputted from `findBarcodes`, which holds decoded sequences for samples per sector/quadrant along with information of sample to primer associations.
`alignWay`	method to utilize for detecting the primers. One of following: "slow" (Default), or "fast". Fast, calls `vpairwiseAlignSeqs` and uses `vmatchPattern` at its core, which is less accurate with indels and mismatches but much faster. Slow, calls `pairwiseAlignSeqs` and uses `pairwiseAlignment` at its core, which is accurate with indels and mismatches but slower.
`showStats`	toggle output of search statistics. Default is FALSE.
`doRC`	perform reverse complement search of the defined pattern/primer. Default is FALSE.
`parallel`	use parallel backend to perform calculation . Defaults to TRUE. If no parallel backend is registered, then a serial version is ran using `SerialParam`. Parllelization is done at sample level per sector. Use parallel2 for parallelization at sequence level.
`samplenames`	a vector of samplenames to process. Default is NULL, which processes all samples from sampleInfo object.
`bypassChecks`	skip checkpoints which detect if something was odd with the data? Default is FALSE.
`parallel2`	perform parallelization is sequence level. Default is FALSE. Useful in cases where each sector has only one sample with numerous sequences.
`...`	extra parameters to be passed to either `vmatchPattern` or `pairwiseAlignment` depending on 'alignWay' parameter.

Value

a SimpleList object similar to sampleInfo paramter supplied with new data added under each sector and sample. New data attributes include: primed

Note

For paired end data, qualityThreshold for pair 2 is decreased by 0.10 to increase chances of matching primer sequence.
If parallel=TRUE, then be sure to have a parallel backend registered before running the function. One can use any of the following MulticoreParam SnowParam

Examples


## Not run: 
load(file.path(system.file("data", package = "hiReadsProcessor"),
"FLX_seqProps.RData"))
findPrimers(seqProps, showStats=TRUE)

## End(Not run)
## Not run: 
load(file.path(system.file("data", package = "hiReadsProcessor"),
"FLX_seqProps.RData"))
findPrimers(seqProps, showStats=TRUE)

## End(Not run)

Find vector DNA in reads and add results to SampleInfo object.

Description

Given a sampleInfo object, the function finds vector fragments following the LTR piece for each sample per sector and adds the results back to the object. This is a specialized function which depends on many other functions shown in 'see also section' to perform specialized trimming of 5' viral LTRs found in the sampleInfo object. The sequence itself is never trimmed but rather coordinates of vector portion is added to LTR coordinates and recorded back to the object and used subsequently by extractSeqs function to perform the trimming. This function heavily relies on blatSeqs. In order for this function to work, it needs vector sequence which is read in using 'vectorFile' metadata supplied in the sample information file in read.sampleInfo

Usage

findVector(sampleInfo, showStats = FALSE, parallel = TRUE, samplenames = NULL)
findVector(sampleInfo, showStats = FALSE, parallel = TRUE, samplenames = NULL)

Arguments

`sampleInfo`	sample information SimpleList object outputted from `findLTRs`, which holds decoded, primed, and LTRed sequences for samples per sector/quadrant.
`showStats`	toggle output of search statistics. Default is FALSE.
`parallel`	use parallel backend to perform calculation with `BiocParallel`. Defaults to TRUE. If no parallel backend is registered, then a serial version is ran using `SerialParam`. Parllelization is done at sample level per sector. Use parallel2 for parallelization at sequence level.
`samplenames`	a vector of samplenames to process. Default is NULL, which processes all samples from sampleInfo object.

Value

a SimpleList object similar to sampleInfo paramter supplied with new data added under each sector and sample. New data attributes include: vectored

Note

If parallel=TRUE, then be sure to have a parallel backend registered before running the function. One can use any of the following MulticoreParam SnowParam

Examples


## Not run: 
load(file.path(system.file("data", package = "hiReadsProcessor"),
"FLX_seqProps.RData"))
findVector(seqProps, showStats=TRUE)

## End(Not run)
## Not run: 
load(file.path(system.file("data", package = "hiReadsProcessor"),
"FLX_seqProps.RData"))
findVector(seqProps, showStats=TRUE)

## End(Not run)

Obtain integration sites from BLAT output

Description

Given a GRanges object from read.psl, the function uses specified filtering parameters to obtain integration sites and maintain sequence attrition. The function will remove any non-best scoring alignments from the object if not already filtered apriori.

Usage

getIntegrationSites(
  psl.rd = NULL,
  startWithin = 3,
  alignRatioThreshold = 0.7,
  genomicPercentIdentity = 0.98,
  correctByqStart = TRUE,
  oneBased = FALSE
)
getIntegrationSites(
  psl.rd = NULL,
  startWithin = 3,
  alignRatioThreshold = 0.7,
  genomicPercentIdentity = 0.98,
  correctByqStart = TRUE,
  oneBased = FALSE
)

Arguments

`psl.rd`	a GRanges object reflecting psl format where tName is the seqnames.
`startWithin`	upper bound limit of where the alignment should start within the query. Default is 3.
`alignRatioThreshold`	cuttoff for (alignment span/read length). Default is 0.7.
`genomicPercentIdentity`	cuttoff for (1-(misMatches/matches)). Default is 0.98.
`correctByqStart`	use qStart to correct genomic position. This would account for sequencing/trimming errors. Position=ifelse(strand=="+",tStart-qStart,tEnd+qStart). Default is TRUE.
`oneBased`	the coordinates in psl files are "zero based half open". The first base in a sequence is numbered zero rather than one. Enabling this would add +1 to the start and leave the end as is. Default is FALSE.

Value

a GRanges object with integration sites which passed all filtering criteria. Each filtering parameter creates a new column to flag if a sequence/read passed that filter which follows the scheme: 'pass.FilterName'. Integration Site is marked by new column named 'Position'.

Examples

data(psl)
psl.rd <- pslToRangedObject(psl)
getIntegrationSites(psl.rd)
data(psl)
psl.rd <- pslToRangedObject(psl)
getIntegrationSites(psl.rd)

Get sectors for samples defined in the sampleInfo object.

Description

Given a sampleInfo object, the function gets the sectors for each samplename. This is an accessory function utilized by other functions of this package to aid sector retrieval.

Usage

getSectorsForSamples(
  sampleInfo,
  sector = NULL,
  samplename = NULL,
  returnDf = FALSE
)
getSectorsForSamples(
  sampleInfo,
  sector = NULL,
  samplename = NULL,
  returnDf = FALSE
)

Arguments

`sampleInfo`	sample information SimpleList object, which samples per sector/quadrant information along with other metadata.
`sector`	a specific sector or vector of sectors if known ahead of time. Default is NULL, which extracts all sectors.
`samplename`	a specific sample or vector of samplenames to get sectors for. Default is NULL, which extracts all samples.
`returnDf`	return results in a dataframe. Default is FALSE.

Value

If returnDf=TRUE, then a dataframe of sector associated with each samplename, else a named list of length two: x[["sectors"]] and x[["samplenames"]]

Examples

load(file.path(system.file("data", package = "hiReadsProcessor"),
"FLX_seqProps.RData"))
samples <- c('Roth-MLV3p-CD4TMLVWell6-Tsp509I', 
'Roth-MLV3p-CD4TMLVWell6-MseI', 'Roth-MLV3p-CD4TMLVwell5-MuA')
getSectorsForSamples(seqProps, samplename=samples)
getSectorsForSamples(seqProps, samplename=samples, returnDf=TRUE)
load(file.path(system.file("data", package = "hiReadsProcessor"),
"FLX_seqProps.RData"))
samples <- c('Roth-MLV3p-CD4TMLVWell6-Tsp509I', 
'Roth-MLV3p-CD4TMLVWell6-MseI', 'Roth-MLV3p-CD4TMLVwell5-MuA')
getSectorsForSamples(seqProps, samplename=samples)
getSectorsForSamples(seqProps, samplename=samples, returnDf=TRUE)

Calculate breakpoint/sonic abundance of integration sites in a population

Description

Given distinct fragment lengths per integration, the function calculates sonic abundance as described in sonicLength. This function is called by clusterSites and needs all individual fragments lengths per position to properly estimate the clonal abundance of an integration sites in a given population.

Usage

getSonicAbund(
  posID = NULL,
  fragLen = NULL,
  grouping = NULL,
  replicateNum = NULL,
  psl.rd = NULL,
  parallel = TRUE
)
getSonicAbund(
  posID = NULL,
  fragLen = NULL,
  grouping = NULL,
  replicateNum = NULL,
  psl.rd = NULL,
  parallel = TRUE
)

Arguments

`posID`	a vector of discrete positions, i.e. Chr,strand,Position. Required if psl.rd parameter is not defined.
`fragLen`	a vector of fragment length per posID. Required if psl.rd parameter is not defined.
`grouping`	additional vector of grouping of length posID or psl.rd by which to pool the rows (i.e. samplenames). Default is NULL.
`replicateNum`	an optional vector of the replicate number per grouping and posID. Default is NULL.
`psl.rd`	a GRanges object returned from `getIntegrationSites` Default is NULL.
`parallel`	use parallel backend to perform calculation with `BiocParallel`. Defaults to TRUE. If no parallel backend is registered, then a serial version is ran using `SerialParam`. Process is split by the grouping the column.

Value

a data frame with estimated sonic abundance shown alongside with the original input. If psl.rd parameter is defined then a GRanges object is returned with a new column 'estAbund'.

Note

For samples isolated using traditional restriction digest method, the abundance will be inaccurate as it is designed for sonicated or sheared sample preparation method.

Examples

data("A1",package='sonicLength')
A1 <- droplevels(A1[1:1000,])
bore <- with(A1, getSonicAbund(locations, lengths, "A", replicates))
head(bore)
data("A1",package='sonicLength')
A1 <- droplevels(A1[1:1000,])
bore <- with(A1, getSonicAbund(locations, lengths, "A", replicates))
head(bore)

Functions to process LM-PCR reads from 454/Illumina data

Description

hiReadsProcessor contains set of functions which allow users to process LM-PCR products sequenced using any platform. Given an excel/txt file containing parameters for demultiplexing and sample metadata, the functions automate trimming of adaptors and identification of the genomic product. Genomic products are further processed for QC and abundance quantification.

Author(s)

Nirav V Malani

Bin values or make ISUs by assigning a unique ID to them within discrete factors.

Description

Given a group of values or genomic positions per read/clone, the function tries to yield a unique ISU (Integration Site Unit) ID for the collection based on overlap of locations to other reads/clones by grouping. This is mainly useful when each read has many locations which needs to be considered as one single group of sites.

Usage

isuSites(
  posID = NULL,
  value = NULL,
  readID = NULL,
  grouping = NULL,
  psl.rd = NULL,
  maxgap = 5,
  parallel = TRUE
)
isuSites(
  posID = NULL,
  value = NULL,
  readID = NULL,
  grouping = NULL,
  psl.rd = NULL,
  maxgap = 5,
  parallel = TRUE
)

Arguments

`posID`	a vector of groupings for the value parameter (i.e. Chr,strand). Required if psl.rd parameter is not defined.
`value`	a vector of integer locations/positions that needs to be binned, i.e. genomic location. Required if psl.rd parameter is not defined.
`readID`	a vector of read/clone names which is unique to each row, i.e. deflines.
`grouping`	additional vector of grouping of length posID or psl.rd by which to pool the rows (i.e. samplenames). Default is NULL.
`psl.rd`	a GRanges object returned from `clusterSites`. Default is NULL.
`maxgap`	max distance allowed between two non-overlapping position to trigger the merging. Default is 5.
`parallel`	use parallel backend to perform calculation with `BiocParallel`. Defaults to TRUE. If no parallel backend is registered, then a serial version is ran using `SerialParam`. Process is split by the grouping the column.

Value

a data frame with binned values and isuID shown alongside the original input. If psl.rd parameter is defined, then a GRanges object where object is first filtered by clusterTopHit column and the isuID column appended at the end.

Note

The algorithm for making isus of sites is as follows: for each readID check how many positions are there. Separate readIDs with only position from the rest. Check if any readIDs with >1 position match to any readIDs with only one position. If there is a match, then assign both readIDs with the same ISU ID. Check if any positions from readIDs with >1 position match any other readIDs with >1 position. If yes, then assign same ISU ID to all readIDs sharing 1 or more positions.

Examples

isuSites(posID = c('chr1-', 'chr1-', 'chr1-', 'chr2+', 'chr15-', 'chr16-', 'chr11-'), 
value = c(rep(1000, 2), 5832, 1000, 12324, 65738, 928042), 
readID = paste('read', sample(letters, 7), sep = '-'), 
grouping = c('a', 'a',  'a', 'b', 'b', 'b', 'c'), parallel = FALSE)
isuSites(posID = c('chr1-', 'chr1-', 'chr1-', 'chr2+', 'chr15-', 'chr16-', 'chr11-'), 
value = c(rep(1000, 2), 5832, 1000, 12324, 65738, 928042), 
readID = paste('read', sample(letters, 7), sep = '-'), 
grouping = c('a', 'a',  'a', 'b', 'b', 'b', 'c'), parallel = FALSE)

Bin values or make OTUs by assigning a unique ID to them within discrete factors.

Description

Given a group of values or genomic positions per read/clone, the function tries to yield a unique OTU (operation taxinomical unit) ID for the collection based on overlap of locations to other reads/clones by grouping. This is mainly useful when each read has many locations which needs to be considered as one single group of sites.

Usage

otuSites(
  posID = NULL,
  value = NULL,
  readID = NULL,
  grouping = NULL,
  psl.rd = NULL,
  maxgap = 5,
  parallel = TRUE
)
otuSites(
  posID = NULL,
  value = NULL,
  readID = NULL,
  grouping = NULL,
  psl.rd = NULL,
  maxgap = 5,
  parallel = TRUE
)

Arguments

`posID`	a vector of groupings for the value parameter (i.e. Chr,strand). Required if psl.rd parameter is not defined.
`value`	a vector of integer locations/positions that needs to be binned, i.e. genomic location. Required if psl.rd parameter is not defined.
`readID`	a vector of read/clone names which is unique to each row, i.e. deflines.
`grouping`	additional vector of grouping of length posID or psl.rd by which to pool the rows (i.e. samplenames). Default is NULL.
`psl.rd`	a GRanges object returned from `clusterSites`. Default is NULL.
`maxgap`	max distance allowed between two non-overlapping position to trigger the merging. Default is 5.
`parallel`	use parallel backend to perform calculation with `BiocParallel`. Defaults to TRUE. If no parallel backend is registered, then a serial version is ran using `SerialParam`. Process is split by the grouping the column.

Value

a data frame with binned values and otuID shown alongside the original input. If psl.rd parameter is defined, then a GRanges object.

Note

The algorithm for making OTUs of sites is as follows:

for each grouping & posID, fix values off by maxgap parameter
create bins of fixed values per readID
assign arbitrary numeric ID to each distinct bins above & obtain its frequency
perform overlap b/w readIDs with only one value (singletons) to readIDs with >1 value (non-singletons)
- for any overlapping values, tag non-singleton readID with the ID of singleton readID
- if non-singleton readID matched with more than one singleton readID, then pick on at random
for any non-tagged & non-singleton readIDs, perform an overlap of values within themselves using the maxgap parameter
- tag any overlapping positions across any readID with the ID of most frequently occuring bin
positions with no overlap are left as is with the original arbitrary ID

Examples

otuSites(posID = c('chr1-', 'chr1-', 'chr1-', 'chr2+', 'chr15-', 'chr16-', 'chr11-'),
value = c(1000, 1003, 5832, 1000, 12324, 65738, 928042), 
readID = paste('read', sample(letters, 7), sep = '-'), 
grouping = c('a', 'a', 'a', 'b', 'b', 'b', 'c'), parallel = FALSE)
otuSites(posID = c('chr1-', 'chr1-', 'chr1-', 'chr2+', 'chr15-', 'chr16-', 'chr11-'),
value = c(1000, 1003, 5832, 1000, 12324, 65738, 928042), 
readID = paste('read', sample(letters, 7), sep = '-'), 
grouping = c('a', 'a', 'a', 'b', 'b', 'b', 'c'), parallel = FALSE)

Pair up alignments in a GRanges object

Description

Given a GRanges object, the function uses specified gaplength parameter to pair up reads where the qName column ends with "atpersand pairname atpersand" which is outputted by extractSeqs.

Usage

pairUpAlignments(
  psl.rd = NULL,
  maxGapLength = 2500,
  sameStrand = TRUE,
  parallel = TRUE
)
pairUpAlignments(
  psl.rd = NULL,
  maxGapLength = 2500,
  sameStrand = TRUE,
  parallel = TRUE
)

Arguments

`psl.rd`	a GRanges object with qNames ending in "atpersand pairname atpersand".
`maxGapLength`	maximum gap allowed between end of pair1 and start of pair2. Default is 2500 bp.
`sameStrand`	should pairs be aligned to the same strand or in same orientationin the reference genome? Default is TRUE. This is 'TRUE' because pair2 reads are reverseComplemented when reading in data in `findBarcodes`
`parallel`	use parallel backend to perform calculation with `BiocParallel`. Defaults to TRUE. If no parallel backend is registered, then a serial version is ran using `SerialParam`.

Value

a GRanges object with reads paired up denoted by "paired" column. Improper pairs or unpaired reads are returned with "paired" column as FALSE.

Examples

 
## Not run: 
psl.rd <- read.BAMasPSL(bamFile=c("sample1hits.bam","sample2hits.bam"))
pairUpAlignments(psl.rd)

## End(Not run)
## Not run: 
psl.rd <- read.BAMasPSL(bamFile=c("sample1hits.bam","sample2hits.bam"))
pairUpAlignments(psl.rd)

## End(Not run)

Align a short pattern to variable length target sequences.

Description

Align a fixed length short pattern sequence (i.e. primers or adaptors) to subject sequences using pairwiseAlignment. This function uses default of type="overlap", gapOpening=-1, and gapExtension=-1 to align the patternSeq against subjectSeqs. One can adjust these parameters if prefered, but not recommended. This function is meant for aligning a short pattern onto large collection of subjects. If you are looking to align a vector sequence to subjects, then please use BLAT or see one of following blatSeqs, findAndRemoveVector

Usage

pairwiseAlignSeqs(
  subjectSeqs = NULL,
  patternSeq = NULL,
  side = "left",
  qualityThreshold = 1,
  showStats = FALSE,
  bufferBases = 5,
  doRC = TRUE,
  returnUnmatched = FALSE,
  returnLowScored = FALSE,
  parallel = FALSE,
  ...
)
pairwiseAlignSeqs(
  subjectSeqs = NULL,
  patternSeq = NULL,
  side = "left",
  qualityThreshold = 1,
  showStats = FALSE,
  bufferBases = 5,
  doRC = TRUE,
  returnUnmatched = FALSE,
  returnLowScored = FALSE,
  parallel = FALSE,
  ...
)

Arguments

`subjectSeqs`	DNAStringSet object containing sequences to be searched for the pattern. This is generally bigger than patternSeq, and cases where subjectSeqs is smaller than patternSeq will be ignored in the alignment.
`patternSeq`	DNAString object or a sequence containing the query sequence to search. This is generally smaller than subjectSeqs.
`side`	which side of the sequence to perform the search: left, right or middle. Default is 'left'.
`qualityThreshold`	percent of patternSeq to match. Default is 1, full match.
`showStats`	toggle output of search statistics. Default is FALSE.
`bufferBases`	use x number of bases in addition to patternSeq length to perform the search. Beneficial in cases where the pattern has homopolymers or indels compared to the subject. Default is 5. Doesn't apply when side='middle'.
`doRC`	perform reverse complement search of the defined pattern. Default is TRUE.
`returnUnmatched`	return sequences which had no or less than 5% match to the patternSeq. Default is FALSE.
`returnLowScored`	return sequences which had quality score less than the defined qualityThreshold. Default is FALSE.
`parallel`	use parallel backend to perform calculation with `BiocParallel`. Defaults to FALSE. If no parallel backend is registered, then a serial version is ran using `SerialParam`.
`...`	extra parameters for `pairwiseAlignment`

Value

IRanges object with starts, stops, and names of the aligned sequences.
If returnLowScored or returnUnmatched = T, then a CompressedIRangesList where x[["hits"]] has the good scoring hits, x[["Rejected"]] has the failed to match qualityThreshold hits, and x[["Absent"]] has the hits where the aligned bit is <=10% match to the patternSeq.

Note

For qualityThreshold, the alignment score is calculated by (matches*2)-(mismatches+gaps) which programatically translates to round(nchar(patternSeq)*qualityThreshold)*2.
Gaps and mismatches are weighed equally with value of -1 which can be overriden by defining extra parameters 'gapOpening' & 'gapExtension'.
If qualityThreshold is 1, then it is a full match, if 0, then any match is accepted which is useful in searching linker sequences at 3' end. Beware, this function only searches for the pattern sequence in one orientation. If you are expecting to find the pattern in both orientation, you might be better off using BLAST/BLAT!
If parallel=TRUE, then be sure to have a parallel backend registered before running the function. One can use any of the following MulticoreParam SnowParam

Examples

subjectSeqs <- c("CCTGAATCCTGGCAATGTCATCATC", "ATCCTGGCAATGTCATCATCAATGG", 
"ATCAGTTGTCAACGGCTAATACGCG", "ATCAATGGCGATTGCCGCGTCTGCA", 
"CCGCGTCTGCAATGTGAGGGCCTAA", "GAAGGATGCCAGTTGAAGTTCACAC")
subjectSeqs <- DNAStringSet(xscat("AAAAAAAAAA", subjectSeqs))
pairwiseAlignSeqs(subjectSeqs, "AAAAAAAAAA", showStats=TRUE)
pairwiseAlignSeqs(subjectSeqs, "AAATAATAAA", showStats=TRUE, 
qualityThreshold=0.5)

subjectSeqs <- c("CCTGAATCCTGGCAATGTCATCATC", "ATCCTGGCAATGTCATCATCAATGG", 
"ATCAGTTGTCAACGGCTAATACGCG", "ATCAATGGCGATTGCCGCGTCTGCA", 
"CCGCGTCTGCAATGTGAGGGCCTAA", "GAAGGATGCCAGTTGAAGTTCACAC")
subjectSeqs <- DNAStringSet(xscat("AAAAAAAAAA", subjectSeqs))
pairwiseAlignSeqs(subjectSeqs, "AAAAAAAAAA", showStats=TRUE)
pairwiseAlignSeqs(subjectSeqs, "AAATAATAAA", showStats=TRUE, 
qualityThreshold=0.5)

Align a short pattern with PrimerID to variable length target sequences.

Description

Align a fixed length short pattern sequence containing primerID to variable length subject sequences using pairwiseAlignment. This function uses default of type="overlap", gapOpening=-1, and gapExtension=-1 to align the patterSeq against subjectSeqs. The search is broken up into as many pieces +1 as there are primerID and then compared against subjectSeqs. For example, patternSeq="AGCATCAGCANNNNNNNNNACGATCTACGCC" will launch two search jobs one per either side of Ns. For each search, qualityThreshold is used to filter out candidate alignments and the area in between is chosen to be the primerID. This strategy is benefical because of Indels introduced through homopolymer errors. Most likely the length of primerID(s) wont the same as you expected!

Usage

primerIDAlignSeqs(
  subjectSeqs = NULL,
  patternSeq = NULL,
  qualityThreshold1 = 0.75,
  qualityThreshold2 = 0.5,
  doAnchored = FALSE,
  doRC = TRUE,
  returnUnmatched = FALSE,
  returnRejected = FALSE,
  showStats = FALSE,
  ...
)
primerIDAlignSeqs(
  subjectSeqs = NULL,
  patternSeq = NULL,
  qualityThreshold1 = 0.75,
  qualityThreshold2 = 0.5,
  doAnchored = FALSE,
  doRC = TRUE,
  returnUnmatched = FALSE,
  returnRejected = FALSE,
  showStats = FALSE,
  ...
)

Arguments

`subjectSeqs`	DNAStringSet object containing sequences to be searched for the pattern.
`patternSeq`	DNAString object or a sequence containing the query sequence to search with the primerID.
`qualityThreshold1`	percent of first part of patternSeq to match. Default is 0.75.
`qualityThreshold2`	percent of second part of patternSeq to match. Default is 0.50.
`doAnchored`	for primerID based patternSeq, use the base before and after primer ID in patternSeq as anchors?. Default is FALSE.
`doRC`	perform reverse complement search of the defined pattern. Default is TRUE.
`returnUnmatched`	return sequences if it had no or less than 5% match to the first part of patternSeq before the primerID. Default is FALSE.
`returnRejected`	return sequences if it only has a match to one side of patternSeq or primerID length does not match # of Ns +/-2 in the pattern. Default is FALSE.
`showStats`	toggle output of search statistics. Default is FALSE.
`...`	extra parameters for `pairwiseAlignment`

Value

A CompressedIRangesList of length two, where x[["hits"]] is hits covering the entire patternSeq, and x[["primerIDs"]] is the potential primerID region.
If returnUnmatched = T, then x[["Absent"]] is appended which includes reads not matching the first part of patternSeq.
If returnRejected=TRUE, then x[["Rejected"]] includes reads that only matched one part of patternSeq or places where no primerID was found in between two part of patternSeq, and x[["RejectedprimerIDs"]] includes primerIDs that didn't match the correct length.
If doAnchored=TRUE, then x[["unAnchoredprimerIDs"]] includes reads that didn't match the base before and after primer ID on patternSeq.

Note

For qualityThreshold1 & qualityThreshold2, the alignment score is calculated by (matches*2)-(mismatches+gaps) which programatically translates to round(nchar(patternSeq)*qualityThreshold)*2
Gaps and mismatches are weighed equally with value of -1 which can be overriden by defining extra parameters 'gapOpening' & 'gapExtension'.
If qualityThreshold is 1, then it is a full match, if 0, then any match is accepted which is useful in searching linker sequences at 3' end. Beware, this function only searches for the pattern sequence in one orientation. If you are expecting to find the pattern in both orientation, you might be better off using BLAST/BLAT!

Examples

subjectSeqs <- c("CCTGAATCCTGGCAATGTCATCATC", "ATCCTGGCAATGTCATCATCAATGG", 
"ATCAGTTGTCAACGGCTAATACGCG", "ATCAATGGCGATTGCCGCGTCTGCA", 
"CCGCGTCTGCAATGTGAGGGCCTAA", "GAAGGATGCCAGTTGAAGTTCACAC")
ids <- c("GGTTCTACGT", "AGGAGTATGA", "TGTCGGTATA", "GTTATAAAAC", 
"AGGCTATATC", "ATGGTTTGTT")
subjectSeqs <- xscat(subjectSeqs, xscat("AAGCGGAGCCC",ids,"TTTTTTTTTTT"))
patternSeq <- "AAGCGGAGCCCNNNNNNNNNNTTTTTTTTTTT"
primerIDAlignSeqs(DNAStringSet(subjectSeqs), patternSeq, doAnchored = TRUE)
subjectSeqs <- c("CCTGAATCCTGGCAATGTCATCATC", "ATCCTGGCAATGTCATCATCAATGG", 
"ATCAGTTGTCAACGGCTAATACGCG", "ATCAATGGCGATTGCCGCGTCTGCA", 
"CCGCGTCTGCAATGTGAGGGCCTAA", "GAAGGATGCCAGTTGAAGTTCACAC")
ids <- c("GGTTCTACGT", "AGGAGTATGA", "TGTCGGTATA", "GTTATAAAAC", 
"AGGCTATATC", "ATGGTTTGTT")
subjectSeqs <- xscat(subjectSeqs, xscat("AAGCGGAGCCC",ids,"TTTTTTTTTTT"))
patternSeq <- "AAGCGGAGCCCNNNNNNNNNNTTTTTTTTTTT"
primerIDAlignSeqs(DNAStringSet(subjectSeqs), patternSeq, doAnchored = TRUE)

PSL file output

Description

Sample BLAT PSL file output from samples included Integration Sites Sequencing Data seqProps

Format

a data frame of 1000 rows and 21 columns

Return PSL file columns with classes

Description

Print out required fields & classes of PSL file format

Usage

pslCols(withClass = TRUE)
pslCols(withClass = TRUE)

Arguments

withClass

return classes for each column.

Value

vector of PSL column names

Examples

pslCols()

pslCols()

Convert psl dataframe to GRanges

Description

Convert psl dataframe to GRanges object using either the query or target as the reference data column.

Usage

pslToRangedObject(x, useTargetAsRef = TRUE, isblast8 = FALSE)
pslToRangedObject(x, useTargetAsRef = TRUE, isblast8 = FALSE)

Arguments

`x`	dataframe reflecting psl format
`useTargetAsRef`	use target(tName) or query(qName) as the chromosome or the reference data. Default is TRUE.
`isblast8`	the input dataframe blast8 format output from BLAT. Default is FALSE.

Value

a GRanges object reflecting psl file type.

Examples

data(psl)
psl <- head(psl)
pslToRangedObject(psl)
pslToRangedObject(psl, useTargetAsRef = FALSE)
data(psl)
psl <- head(psl)
pslToRangedObject(psl)
pslToRangedObject(psl, useTargetAsRef = FALSE)

Reads a BAM/SAM file and converts it into a PSL like format.

Description

Given filename(s), the function reads the BAM/SAM file, converts into a PSL like format. Any other file format will yield errors or erroneous results. This is intended to be used independently with other short read aligners.

Usage

read.BAMasPSL(bamFile = NULL, removeFile = TRUE, asGRanges = TRUE)
read.BAMasPSL(bamFile = NULL, removeFile = TRUE, asGRanges = TRUE)

Arguments

`bamFile`	BAM/SAM filename, or vector of filenames, or a pattern of files to import.
`removeFile`	remove the file(s) supplied in bamFile paramter after importing. Default is FALSE.
`asGRanges`	return a GRanges object. Default is TRUE

Value

a GRanges or GAlignments object reflecting psl file type.

Examples

 
## Not run: 
read.BAMasPSL(bamFile="processed.*.bam$")
read.BAMasPSL(bamFile=c("sample1hits.bam","sample2hits.bam"))

## End(Not run)
## Not run: 
read.BAMasPSL(bamFile="processed.*.bam$")
read.BAMasPSL(bamFile=c("sample1hits.bam","sample2hits.bam"))

## End(Not run)

Read blast8 file(s) outputted by BLAT

Description

Given filename(s), the function reads the blast8 file format from BLAT as a data frame and performs basic score filtering if indicated. Any other file format will yield errors or erroneous results.

Usage

read.blast8(
  files = NULL,
  asGRanges = FALSE,
  removeFile = TRUE,
  parallel = FALSE
)
read.blast8(
  files = NULL,
  asGRanges = FALSE,
  removeFile = TRUE,
  parallel = FALSE
)

Arguments

`files`	blast8 filename, or vector of filenames, or a pattern of files to import.
`asGRanges`	return a GRanges object instead of a dataframe. Default is TRUE Saves memory!
`removeFile`	remove the blast8 file(s) after importing. Default is FALSE.
`parallel`	use parallel backend to perform calculation with `BiocParallel`. Defaults to TRUE. If no parallel backend is registered, then a serial version is ran using `SerialParam`.

Value

a dataframe or GRanges object reflecting blast8 file type.

Note

If parallel=TRUE, then be sure to have a parallel backend registered before running the function. One can use any of the following MulticoreParam SnowParam

Examples

# this function works similar to read.psl #
#read.blast8(files="processed.*.blast8$")
#read.blast8(files=c("sample1hits.blast8","sample2hits.blast8"))

# this function works similar to read.psl #
#read.blast8(files="processed.*.blast8$")
#read.blast8(files=c("sample1hits.blast8","sample2hits.blast8"))

Read PSL file(s) outputted by BLAT

Description

Given filename(s), the function reads the PSL file format from BLAT as a data frame and performs basic score filtering if indicated. Any other file format will yield errors or erroneous results. Make sure there is no header row! See required columns in pslCols.

Usage

read.psl(
  pslFile = NULL,
  bestScoring = TRUE,
  asGRanges = FALSE,
  removeFile = TRUE,
  parallel = FALSE
)
read.psl(
  pslFile = NULL,
  bestScoring = TRUE,
  asGRanges = FALSE,
  removeFile = TRUE,
  parallel = FALSE
)

Arguments

`pslFile`	PSL filename, or vector of filenames, or a pattern of files to import.
`bestScoring`	report only best scoring hits instead of all hits. Default is TRUE. Score is calculated by matches-misMatches-qBaseInsert-tBaseInsert.
`asGRanges`	return a GRanges object instead of a dataframe. Default is FALSE
`removeFile`	remove the PSL file(s) after importing. Default is FALSE.
`parallel`	use parallel backend to perform calculation with `BiocParallel`. Defaults to TRUE. If no parallel backend is registered, then a serial version is ran using `SerialParam`.

Value

a dataframe reflecting psl file type. If asGRanges=TRUE, then a GRanges object.

Note

If parallel=TRUE, then be sure to have a parallel backend registered before running the function. One can use any of the following MulticoreParam SnowParam

Examples

## Not run: 
data(psl)
pslFile <- tempfile()
write.psl(psl, filename = pslFile)
head(read.psl(pslFile = pslFile))
# read many PSL files matching the regex #
psl <- read.psl(pslFile = "processed.*.psl$")

## End(Not run)
## Not run: 
data(psl)
pslFile <- tempfile()
write.psl(psl, filename = pslFile)
head(read.psl(pslFile = pslFile))
# read many PSL files matching the regex #
psl <- read.psl(pslFile = "processed.*.psl$")

## End(Not run)

Read a sample information file and format appropriate metadata.

Description

Given a sample information file, the function checks if it includes required information to process samples present on each sector/quadrant/region/lane. The function also adds other columns required for processing with default values if not already defined ahead of time.

Usage

read.sampleInfo(sampleInfoPath = NULL, splitBySector = TRUE)
read.sampleInfo(sampleInfoPath = NULL, splitBySector = TRUE)

Arguments

`sampleInfoPath`	full or relative path to the sample information file, which holds samples to quadrant/lane associations along with other metadata required to trim sequences or process it.
`splitBySector`	split the data frame into a list by sector column. Default is TRUE.

Details

Required Column Description:
- sector => region/quadrant/lane of the sequencing plate the sample comes from. If files have been split by samples apriori, then the filename associated per sample without the extension. If this is a filename, then be sure to enable 'alreadyDecoded' parameter in findBarcodes, since contents of this column is pasted together with 'seqfilePattern' parameter in read.SeqFolder to find the appropriate file needed. For paired end data, this is basename of the FASTA/Q file holding the sample data from the LTR side. For example, files such as Lib3_L001_R2_001.fastq.gz or Lib3_L001_R2_001.fastq would be Lib3_L001_R2_001, and consequently Lib3_L001_R1_001 would be used as the second pair!
- barcode => unique 4-12bp DNA sequence which identifies the sample. If providing filename as sector, then leave this blank since it is assumed that the data is already demultiplexed.
- primerltrsequence => DNA sequence of the viral LTR primer with/without the viral LTR sequence following the primer landing site. If already trimmed, then mark this as SKIP.
- sampleName => Name of the sample associated with the barcode
- sampleDescription => Detailed description of the sample
- gender => sex of the sample: male or female or NA
- species => species of the sample: homo sapien, mus musculus, etc.
- freeze => UCSC freeze to which the sample should be aligned to.
- linkerSequence => DNA sequence of the linker adaptor following the genomic sequence. If already trimmed, then mark this as SKIP.
- restrictionEnzyme => Restriction enzyme used for digestion and sample recovery. Can also be one of: Fragmentase or Sonication!
Metadata Parameter Column Description:
- ltrBitSequence => DNA sequence of the viral LTR following the primer landing site. Default is last 7bps of the primerltrsequence.
- ltrBitIdentity => percent of LTR bit sequence to match during the alignment. Default is 1.
- primerLTRidentity => percent of primer to match during the alignment. Default is .85
- linkerIdentity => percent of linker sequence to match during the alignment. Default is 0.55. Only applies to non-primerID/random tag based linker search.
- primerIdInLinker => whether the linker adaptor used has primerID/random tag in it? Default is FALSE.
- primerIdInLinkerIdentity1 => percent of sequence to match before the random tag. Default is 0.75. Only applies to primerID/random tag based linker search and when primeridinlinker is TRUE.
- primerIdInLinkerIdentity2 => percent of sequence to match after the random tag. Default is 0.50. Only applies to primerID/random tag based linker search and when primeridinlinker is TRUE.
- celltype => celltype information associated with the sample
- user => name of the user who prepared or processed the sample
- pairedEnd => is the data paired end? Default is FALSE.
- vectorFile => fasta file containing the vector sequence
Processing Parameter Column Description:
- startWithin => upper bound limit of where the alignment should start within the query. Default is 3.
- alignRatioThreshold => cuttoff for (alignment span/read length). Default is 0.7.
- genomicPercentIdentity => cuttoff for (1-(misMatches/matches)). Default is 0.98.
- clusterSitesWithin => cluster integration sites within a defined window size based on frequency which corrects for any sequencing errors. Default is 5.
- keepMultiHits => whether to keep sequences/reads that return multiple best hits, aka ambiguous locations.
- processingDate => the date of processing

Value

if splitBySector=TRUE, then an object of SimpleList named by quadrant/lane information defined in sampleInfo file, else a dataframe.

Examples

 
runData <- system.file(file.path("extdata", "FLX_sample_run"), 
package = "hiReadsProcessor")
read.sampleInfo(file.path(runData, "sampleInfo.xlsx"))
runData <- system.file(file.path("extdata", "FLX_sample_run"), 
package = "hiReadsProcessor")
read.sampleInfo(file.path(runData, "sampleInfo.xlsx"))

Read contents of a sequencing folder and make a SimpleList object

Description

Given a sequencing folder path, sample information file path, and sequence file extension pattern, the function returns a list of variables required to process the data. The function also calls read.sampleInfo which reads in sample processing metadata and formats it if needed.

Usage

read.SeqFolder(
  sequencingFolderPath = NULL,
  sampleInfoFilePath = NULL,
  seqfilePattern = NULL
)
read.SeqFolder(
  sequencingFolderPath = NULL,
  sampleInfoFilePath = NULL,
  seqfilePattern = NULL
)

Arguments

`sequencingFolderPath`	full or relative path to the sequencing folder
`sampleInfoFilePath`	full or relative path to the sample information file, which holds samples to quadrant/lane associations along with other metadata required to trim sequences or process it. Default to NULL, where the function tries to find xls/xlsx or tab deliminated txt file in the sequencing folder which sounds similar to 'sampleinfo' and present you with choices of file to select from.
`seqfilePattern`	regex/string to describe sequence file endings. See examples. Default is NULL.

Value

a SimpleList list which is used by other functions to process and decode the data.

Note

One must make sure that each sequencing file has sector name/number prefixed at the beginning, else findBarcodes will fail trying to find the filename.
For paired end Illumina runs, make sure the filenames include R1, R2, and I1 somewhere in the name denoting pair1, pair2, and index/barcode reads, respectively.

Examples

runData <- system.file("extdata/FLX_sample_run/", 
package = "hiReadsProcessor")
read.SeqFolder(runData, seqfilePattern=".+fna.gz$")
## Not run: 
read.SeqFolder(".", seqfilePattern = "\\.TCA.454Reads.fna$")
read.SeqFolder(".", seqfilePattern = ".+fastq$")

## End(Not run)
runData <- system.file("extdata/FLX_sample_run/", 
package = "hiReadsProcessor")
read.SeqFolder(runData, seqfilePattern=".+fna.gz$")
## Not run: 
read.SeqFolder(".", seqfilePattern = "\\.TCA.454Reads.fna$")
read.SeqFolder(".", seqfilePattern = ".+fastq$")

## End(Not run)

Read fasta/fastq given the path or sampleInfo object.

Description

Given a sequence reads file path, the function returns a DNAStringSet object.

Usage

read.seqsFromSector(seqFilePath = NULL, sector = 1, isPaired = FALSE)
read.seqsFromSector(seqFilePath = NULL, sector = 1, isPaired = FALSE)

Arguments

`seqFilePath`	a path to fasta/fastq reads file or a sampleInfo object returned by `read.SeqFolder`
`sector`	specific sector to reads sequences from. Default is 1, and not required if seqFilePath is a direct file path rather than sampleInfo object.
`isPaired`	does the sector contain paired end reads? Default is FALSE

Value

if isPaired is FALSE, then a DNAStringSet object, else a list of DNAStringSet objects of three elements corresponding to reads from "barcode", "pair1", and "pair2". Note: "pair2" is reverse complemented!

Examples

## Not run: 
load(file.path(system.file("data", package = "hiReadsProcessor"),
"FLX_seqProps.RData"))
read.seqsFromSector(seqProps, sector="2")

## End(Not run)
## Not run: 
load(file.path(system.file("data", package = "hiReadsProcessor"),
"FLX_seqProps.RData"))
read.seqsFromSector(seqProps, sector="2")

## End(Not run)

Remove sequences with ambiguous nucleotides.

Description

Given a DNAStringSet object, the function removes any reads that has either repeating or total Ns which is greater than to maxNs threshold

Usage

removeReadsWithNs(dnaSet, maxNs = 5, consecutive = TRUE)
removeReadsWithNs(dnaSet, maxNs = 5, consecutive = TRUE)

Arguments

`dnaSet`	DNAStringSet object to evaluate.
`maxNs`	integer value denoting the threshold of maximum allowed Ns. Default is 5.
`consecutive`	boolean flag denoting whether Ns to filter is consecutive or total . Default is TRUE.

Value

DNAStringSet object.

Examples

 
dnaSet <- c("CCTGAATCCTNNCAATGTCATCATC", "ATCCTGGCNATGTCATCATCAATGG", 
"ATCAGTTGTCAACGGCTAATACGCG", "ATCAATGGCGATTGCCGCGTCTGCA", 
"CCGNNTCTGCAATGTGNGGNCCTAN", "GAAGNNNNNNGTTGAAGTTCACAC") 
removeReadsWithNs(dnaSet)
removeReadsWithNs(dnaSet, maxNs = 4, consecutive = FALSE)
dnaSet <- c("CCTGAATCCTNNCAATGTCATCATC", "ATCCTGGCNATGTCATCATCAATGG", 
"ATCAGTTGTCAACGGCTAATACGCG", "ATCAATGGCGATTGCCGCGTCTGCA", 
"CCGNNTCTGCAATGTGNGGNCCTAN", "GAAGNNNNNNGTTGAAGTTCACAC") 
removeReadsWithNs(dnaSet)
removeReadsWithNs(dnaSet, maxNs = 4, consecutive = FALSE)

Replicate sequences from DNAStringSet object using counts identifier or vector

Description

Given a DNAStringSet object, the function replicates reads using counts=X marker at the end of definition line.

Usage

replicateReads(dnaSet, counts = NULL)
replicateReads(dnaSet, counts = NULL)

Arguments

`dnaSet`	DNAStringSet object to replicate.
`counts`	an integer or a numeric vector of length length(dnaSet) indicating how many times to repeat each sequence. Default is NULL, in which it uses counts=X notation from the definition line to replicate reads.

Value

DNAStringSet object.

Examples

 
dnaSet <- c("CCTGAATCCTGGCAATGTCATCATC", "ATCCTGGCAATGTCATCATCAATGG", 
"ATCAGTTGTCAACGGCTAATACGCG", "ATCAATGGCGATTGCCGCGTCTGCA", 
"CCGCGTCTGCAATGTGAGGGCCTAA", "GAAGGATGCCAGTTGAAGTTCACAC", 
"CCTGAATCCTGGCAATGTCATCATC", "ATCCTGGCAATGTCATCATCAATGG", 
"ATCAGTTGTCAACGGCTAATACGCG", "ATCAATGGCGATTGCCGCGTCTGCA", 
"CCGCGTCTGCAATGTGAGGGCCTAA", "GAAGGATGCCAGTTGAAGTTCACAC") 
dnaSet <- dereplicateReads(dnaSet)
replicateReads(dnaSet)
dnaSet <- c("CCTGAATCCTGGCAATGTCATCATC", "ATCCTGGCAATGTCATCATCAATGG", 
"ATCAGTTGTCAACGGCTAATACGCG", "ATCAATGGCGATTGCCGCGTCTGCA", 
"CCGCGTCTGCAATGTGAGGGCCTAA", "GAAGGATGCCAGTTGAAGTTCACAC", 
"CCTGAATCCTGGCAATGTCATCATC", "ATCCTGGCAATGTCATCATCAATGG", 
"ATCAGTTGTCAACGGCTAATACGCG", "ATCAATGGCGATTGCCGCGTCTGCA", 
"CCGCGTCTGCAATGTGAGGGCCTAA", "GAAGGATGCCAGTTGAAGTTCACAC") 
dnaSet <- dereplicateReads(dnaSet)
replicateReads(dnaSet)

Simple summary of a sampleInfo object.

Description

Give a simple summary of major attributes in sampleInfo/SimpleList object.

Usage

sampleSummary(object, ...)
sampleSummary(object, ...)

Arguments

`object`	sample information SimpleList object, which samples per sector/quadrant information along with other metadata.
`...`	ignored for now.

Value

a dataframe summarizing counts of major attributes per sample and sector.

Examples

data(FLX_seqProps)
sampleSummary(seqProps)

data(FLX_seqProps)
sampleSummary(seqProps)

Sample Integration Sites Sequencing Data

Description

This is a processed data object containing raw sequences and respective alignments to UCSC freeze hg18 from 112 integration site samples. The object is of SimpleList class and follows a certain structural hierarchy explained by the Introductory vignette.

Format

A SimpleList object

Split DNAStringSet object using first X number of bases defined by a vector.

Description

Given a character vector of barcodes/MID to sample association and a DNAStringSet object, the function splits/demultiplexes the DNAStringSet object by first few bases dictated by length of barcodes/MID supplied. This is an accessory function used by findBarcodes

Usage

splitByBarcode(
  barcodesSample,
  dnaSet,
  trimFrom = NULL,
  showStats = FALSE,
  returnUnmatched = FALSE
)
splitByBarcode(
  barcodesSample,
  dnaSet,
  trimFrom = NULL,
  showStats = FALSE,
  returnUnmatched = FALSE
)

Arguments

`barcodesSample`	a character vector of barcodes to sample name associations. Ex: c("ACATCCAT"="Sample1", "GAATGGAT"="Sample2",...)
`dnaSet`	DNAStringSet object to evaluate.
`trimFrom`	integer value serving as start point to trim the sequences from. This is calculated internally length barcode+1. Default is NULL.
`showStats`	boolean flag denoting whether to show decoding statistics per sample & barcode. Default is FALSE.
`returnUnmatched`	boolean flag denoting whether to return unmatched reads. Default is FALSE.

Value

DNAStringSet object split by sample name found in barcodesSample.

Examples

dnaSet <- DNAStringSet(c("read1" = "ACATCCATAGAGCTACGACGACATCGACATA",
"read2"="GAATGGATGACGACTACAGCACGACGAGCAGCTACT",
"read3"="GAATGGATGCGCTAAGAAGAGA", "read4"="ACATCCATTCTACACATCT"))
splitByBarcode(c("ACATCCAT" = "Sample1", "GAATGGAT" = "Sample2"), dnaSet,
showStats=TRUE)
dnaSet <- DNAStringSet(c("read1" = "ACATCCATAGAGCTACGACGACATCGACATA",
"read2"="GAATGGATGACGACTACAGCACGACGAGCAGCTACT",
"read3"="GAATGGATGCGCTAAGAAGAGA", "read4"="ACATCCATTCTACACATCT"))
splitByBarcode(c("ACATCCAT" = "Sample1", "GAATGGAT" = "Sample2"), dnaSet,
showStats=TRUE)

Split DNA sequences into smaller files.

Description

Given a vector of sequences or DNAStringSet or a FASTA filename, the function splits it into smaller pieces as denoted by totalFiles parameter.

Usage

splitSeqsToFiles(
  x,
  totalFiles = 4,
  suffix = "tempy",
  filename = "queryFile.fa",
  outDir = getwd()
)
splitSeqsToFiles(
  x,
  totalFiles = 4,
  suffix = "tempy",
  filename = "queryFile.fa",
  outDir = getwd()
)

Arguments

`x`	a DNAStringSet object, or a FASTA filename.
`totalFiles`	an integer indicating how many files to create. Default is 4.
`suffix`	a word to add to each file created. Default is "tempy".
`filename`	name of the file if x is a DNAStringSet object. Default is "queryFile.fa".
`outDir`	directory to write the output file. Default is current directory.

Value

a vector of filename names created.

Examples

seqs <- DNAStringSet(sapply(sample(c(100:1000), 500),
function(size) paste(sample(DNA_BASES, size, replace = TRUE), collapse = ""))) 
splitSeqsToFiles(seqs, 5, "tempyQ", "myDNAseqs.fa", tempdir())
seqs <- DNAStringSet(sapply(sample(c(100:1000), 500),
function(size) paste(sample(DNA_BASES, size, replace = TRUE), collapse = ""))) 
splitSeqsToFiles(seqs, 5, "tempyQ", "myDNAseqs.fa", tempdir())

Start/Stop a gfServer instance

Description

Start or Stop a gfServer with indexed reference genome to align batch of sequences using BLAT gfServer/gfClient protocol.

Usage

startgfServer(
  seqDir = NULL,
  host = "localhost",
  port = 5560,
  gfServerOpts = c(repMatch = 112312, stepSize = 5, tileSize = 10, maxDnaHits = 10)
)

stopgfServer(host = "localhost", port = NULL)
startgfServer(
  seqDir = NULL,
  host = "localhost",
  port = 5560,
  gfServerOpts = c(repMatch = 112312, stepSize = 5, tileSize = 10, maxDnaHits = 10)
)

stopgfServer(host = "localhost", port = NULL)

Arguments

`seqDir`	absolute or relative path to the genome index (nib/2bit files).
`host`	name of the machine to run gfServer on. Default: localhost
`port`	a port number to host the gfServer with. Default is 5560.
`gfServerOpts`	a character vector of options to be passed to gfServer command on top of server defaults. Default: c(repMatch=112312, stepSize=5, tileSize=10, maxDnaHits=10). Set this to NULL to start gfServer with defaults.

Value

system command status for executing gfServer command.

Examples

#startgfServer(seqDir="/usr/local/blatSuite34/hg18.2bit",port=5560)
#stopgfServer(port=5560)
#startgfServer(seqDir="/usr/local/blatSuite34/hg18.2bit",port=5560)
#stopgfServer(port=5560)

Trim sequences from a specific side.

Description

This function trims a DNAStringSet object using the ranges from left, right, or middle of the sequence. This is a helper function utilized in primerIDAlignSeqs and extractSeqs. If dnaSet and coords are not the same length, then they are required to have a names attribute to perform the matched trimming.

Usage

trimSeqs(dnaSet, coords, side = "middle", offBy = 0)
trimSeqs(dnaSet, coords, side = "middle", offBy = 0)

Arguments

`dnaSet`	DNAStringSet object containing sequences to be trimmed.
`coords`	IRanges object containing boundaries.
`side`	either 'left','right',or the Default 'middle'.
`offBy`	integer value dictating if the supplied coordinates should be offset by X number of bases. Default is 0.

Value

a DNAStringSet object with trimmed sequences.

Note

If side is left, then any sequence following end of coords+offBy is returned. If side is right, then sequence preceding start of coords-offBy is returned. If side is middle, then sequence contained in coords is returned where offBy is added to start and subtracted from end in coords.

Examples

dnaSet <- DNAStringSet(c("AAAAAAAAAACCTGAATCCTGGCAATGTCATCATC", 
"AAAAAAAAAAATCCTGGCAATGTCATCATCAATGG", "AAAAAAAAAAATCAGTTGTCAACGGCTAATACGCG",
"AAAAAAAAAAATCAATGGCGATTGCCGCGTCTGCA", "AAAAAAAAAACCGCGTCTGCAATGTGAGGGCCTAA",
"AAAAAAAAAAGAAGGATGCCAGTTGAAGTTCACAC"))
coords <- IRanges(start=1, width=rep(10,6))
trimSeqs(dnaSet, coords, side="left", offBy=1)
trimSeqs(dnaSet, coords, side="middle")
dnaSet <- DNAStringSet(c("AAAAAAAAAACCTGAATCCTGGCAATGTCATCATC", 
"AAAAAAAAAAATCCTGGCAATGTCATCATCAATGG", "AAAAAAAAAAATCAGTTGTCAACGGCTAATACGCG",
"AAAAAAAAAAATCAATGGCGATTGCCGCGTCTGCA", "AAAAAAAAAACCGCGTCTGCAATGTGAGGGCCTAA",
"AAAAAAAAAAGAAGGATGCCAGTTGAAGTTCACAC"))
coords <- IRanges(start=1, width=rep(10,6))
trimSeqs(dnaSet, coords, side="left", offBy=1)
trimSeqs(dnaSet, coords, side="middle")

Compare LTRed/Primed sequences to all linkers.

Description

Given a SampleInfo object, the function compares LTRed sequences from each sample per sector to all the linker sequences present in the run. The output is a summary table of counts of good matches to all the linkers per sample.

Usage

troubleshootLinkers(
  sampleInfo,
  qualityThreshold = 0.55,
  qualityThreshold1 = 0.75,
  qualityThreshold2 = 0.5,
  doRC = TRUE,
  parallel = TRUE,
  samplenames = NULL,
  ...
)
troubleshootLinkers(
  sampleInfo,
  qualityThreshold = 0.55,
  qualityThreshold1 = 0.75,
  qualityThreshold2 = 0.5,
  doRC = TRUE,
  parallel = TRUE,
  samplenames = NULL,
  ...
)

Arguments

`sampleInfo`	sample information SimpleList object outputted from `findPrimers` or `findLTRs`, which holds decoded sequences for samples per sector/quadrant along with information of sample to primer associations.
`qualityThreshold`	percent of linker length to match, round(nchar(linker)*qualityThreshold). Default is 0.55. Only applies to non-primerID based linkers
`qualityThreshold1`	percent of first part of patternSeq to match. Default is 0.75. Only applies to primerID based linker search.
`qualityThreshold2`	percent of second part of patternSeq to match. Default is 0.50. Only applies to primerID based linker search.
`doRC`	perform reverse complement search of the linker sequence. Default is TRUE. Highly recommended!
`parallel`	use parallel backend to perform calculation with `BiocParallel`. Defaults to TRUE. If no parallel backend is registered, then a serial version is ran using `SerialParam`. Parllelization is done at sample level per sector.
`samplenames`	a vector of samplenames to process. Default is NULL, which processes all samples from sampleInfo object.
`...`	extra parameters to be passed to `pairwiseAlignment`.

Value

a dataframe of counts.

Note

If parallel=TRUE, then be sure to have a parallel backend registered before running the function. One can use any of the following MulticoreParam SnowParam

Align a short pattern to variable length target sequences.

Description

Align a fixed length short pattern sequence to subject sequences using vmatchPattern. This function is meant for aligning a short pattern onto large collection of subjects. If you are looking to align a vector sequence to subjects, then please use BLAT.

Usage

vpairwiseAlignSeqs(
  subjectSeqs = NULL,
  patternSeq = NULL,
  side = "left",
  qualityThreshold = 1,
  showStats = FALSE,
  bufferBases = 5,
  doRC = TRUE,
  parallel = FALSE,
  ...
)
vpairwiseAlignSeqs(
  subjectSeqs = NULL,
  patternSeq = NULL,
  side = "left",
  qualityThreshold = 1,
  showStats = FALSE,
  bufferBases = 5,
  doRC = TRUE,
  parallel = FALSE,
  ...
)

Arguments

`subjectSeqs`	DNAStringSet object containing sequences to be searched for the pattern. This is generally bigger than patternSeq, and cases where subjectSeqs is smaller than patternSeq will be ignored in the alignment.
`patternSeq`	DNAString object or a sequence containing the query sequence to search. This is generally smaller than subjectSeqs.
`side`	which side of the sequence to perform the search: left, right, or middle. Default is 'left'.
`qualityThreshold`	percent of patternSeq to match. Default is 1, full match. This is supplied to max.mismatch parameter of `vmatchPattern` as round(nchar(patternSeq)*(1-qualityThreshold)).
`showStats`	toggle output of search statistics. Default is FALSE.
`bufferBases`	use x number of bases in addition to patternSeq length to perform the search. Beneficial in cases where the pattern has homopolymers or indels compared to the subject. Default is 5. Doesn't apply when side='middle'.
`doRC`	perform reverse complement search of the defined pattern. Default is TRUE.
`parallel`	use parallel backend to perform calculation with `BiocParallel`. Defaults to FALSE. If no parallel backend is registered, then a serial version is ran using `SerialParam`.
`...`	extra parameters for `vmatchPattern` except for 'max.mismatch' since it's calculated internally using the 'qualityThreshold' parameter.

Value

IRanges object with starts, stops, and names of the aligned sequences.

Note

For qualityThreshold, the alignment score is calculated by (matches*2)-(mismatches+gaps) which programatically translates to round(nchar(patternSeq)*qualityThreshold)*2.
No indels are allowed in the function, if expecting indels then use pairwiseAlignSeqs.
If qualityThreshold is 1, then it is a full match, if 0, then any match is accepted which is useful in searching linker sequences at 3' end. Beware, this function only searches for the pattern sequence in one orientation. If you are expecting to find the pattern in both orientation, you might be better off using BLAST/BLAT!
If parallel=TRUE, then be sure to have a parallel backend registered before running the function. One can use any of the following MulticoreParam SnowParam

Examples

subjectSeqs <- c("CCTGAATCCTGGCAATGTCATCATC", "ATCCTGGCAATGTCATCATCAATGG", 
"ATCAGTTGTCAACGGCTAATACGCG", "ATCAATGGCGATTGCCGCGTCTGCA", 
"CCGCGTCTGCAATGTGAGGGCCTAA", "GAAGGATGCCAGTTGAAGTTCACAC")
subjectSeqs <- DNAStringSet(xscat("AAAAAAAAAA", subjectSeqs))
vpairwiseAlignSeqs(subjectSeqs, "AAAAAAAAAA", showStats=TRUE)
vpairwiseAlignSeqs(subjectSeqs, "AAAAAAAAAA", showStats=TRUE,
qualityThreshold=0.5)
subjectSeqs <- c("CCTGAATCCTGGCAATGTCATCATC", "ATCCTGGCAATGTCATCATCAATGG", 
"ATCAGTTGTCAACGGCTAATACGCG", "ATCAATGGCGATTGCCGCGTCTGCA", 
"CCGCGTCTGCAATGTGAGGGCCTAA", "GAAGGATGCCAGTTGAAGTTCACAC")
subjectSeqs <- DNAStringSet(xscat("AAAAAAAAAA", subjectSeqs))
vpairwiseAlignSeqs(subjectSeqs, "AAAAAAAAAA", showStats=TRUE)
vpairwiseAlignSeqs(subjectSeqs, "AAAAAAAAAA", showStats=TRUE,
qualityThreshold=0.5)

Write a fasta file per sample in parallel

Description

Given a listed DNAStringSet object return from extractSeqs, the function writes a fasta file for each sample as defined in filePath parameter.

Usage

write.listedDNAStringSet(
  dnaSet,
  filePath = ".",
  filePrefix = "processed",
  prependSamplenames = TRUE,
  format = "fasta",
  parallel = FALSE
)
write.listedDNAStringSet(
  dnaSet,
  filePath = ".",
  filePrefix = "processed",
  prependSamplenames = TRUE,
  format = "fasta",
  parallel = FALSE
)

Arguments

`dnaSet`	listed DNAStringSet object containing sequences to be written.
`filePath`	a path write the fasta files per sample. Default is current working directory.
`filePrefix`	prefix the filenames with a string. Default is 'processed' followed by samplename.
`prependSamplenames`	Prepend definition lines with samplenames. Default is TRUE. Make sure the dnaSet parameter is a named list where names are used as samplenames.
`format`	either fasta (the default) or fastq.
`parallel`	use parallel backend to perform calculation with `BiocParallel`. Defaults to TRUE. If no parallel backend is registered, then a serial version is ran using `SerialParam`.

Note

Writing of the files is done using writeXStringSet with parameter append=TRUE. This is to aggregate reads from a sample which might be present in more than one sector.
If data is paired end, then each pair will be written separately with designations in the filename as well as in the definition line as (at)pairX(at) appended at the end.
If parallel=TRUE, then be sure to have a parallel backend registered before running the function. One can use any of the following MulticoreParam SnowParam

Examples

 
## Not run: 
load(file.path(system.file("data", package = "hiReadsProcessor"),
"FLX_seqProps.RData"))
samples <- c('Roth-MLV3p-CD4TMLVWell6-Tsp509I', 
'Roth-MLV3p-CD4TMLVWell6-MseI', 'Roth-MLV3p-CD4TMLVwell5-MuA')
seqs <- extractSeqs(seqProps, sector='2', samplename=samples, feature="primed")
write.listedDNAStringSet(seqs)

## End(Not run)
## Not run: 
load(file.path(system.file("data", package = "hiReadsProcessor"),
"FLX_seqProps.RData"))
samples <- c('Roth-MLV3p-CD4TMLVWell6-Tsp509I', 
'Roth-MLV3p-CD4TMLVWell6-MseI', 'Roth-MLV3p-CD4TMLVwell5-MuA')
seqs <- extractSeqs(seqProps, sector='2', samplename=samples, feature="primed")
write.listedDNAStringSet(seqs)

## End(Not run)

Write PSL file from dataframe or GRanges

Description

Given a data frame or GRanges object, the function write a tab deliminated PSL file

Usage

write.psl(x, filename = "out.psl", header = FALSE, includeOtherCols = FALSE)
write.psl(x, filename = "out.psl", header = FALSE, includeOtherCols = FALSE)

Arguments

`x`	data frame or GRanges object with required columns for psl file format.
`filename`	name for the output PSL file. Default is "out.psl"
`header`	include PSL header line. Default is FALSE.
`includeOtherCols`	nclude other non PSL specific columns from x in the output. Default is FALSE.

Value

name of the output PSL file

Examples

data(psl)
pslFile <- tempfile()
write.psl(psl, filename = pslFile)
data(psl)
pslFile <- tempfile()
write.psl(psl, filename = pslFile)

Package 'hiReadsProcessor'

Help Index

Add a specific feature/attribute to the sampleInfo object.

Description

Usage

Arguments

Value

See Also

Examples

Prepend name attribute of a list to DNAStringSet

Description

Usage

Arguments

Value

See Also

Examples

Find the 5' primers and add results to SampleInfo object.

Description

Usage

Arguments

Value

See Also

Examples

Align a listed DNAStringSet to a reference using gfClient or standalone BLAT.

Description

Usage

Arguments

Value

See Also

Align sequences using BLAT.

Description

Usage

Arguments

Value

See Also

Examples

Breaks an object into chunks of N size.

Description

Usage

Arguments

Value

See Also

Examples

Cluster/Correct values within a window based on their frequency given discrete factors

Description

Usage

Arguments

Value

Note

See Also

Examples

Remove values/positions which are overlapping between discrete groups based on their frequency.

Description

Usage

Arguments

Value

See Also

Examples

Removes duplicate sequences from DNAStringSet object.

Description

Usage

Arguments

Value

See Also

Examples

Test if pattern aligns better in +/- orientation.

Description

Usage

Arguments

Value

See Also

Examples

Extract a specific feature/attribute of the sampleInfo object.

Description

Usage

Arguments

Value

See Also

Examples

Extract sequences for a feature in the sampleInfo object.