| Title: | MSA2dist calculates pairwise distances between all sequences of a DNAStringSet or a AAStringSet using a custom score matrix and conducts codon based analysis |
|---|---|
| Description: | MSA2dist calculates pairwise distances between all sequences of a DNAStringSet or a AAStringSet using a custom score matrix and conducts codon based analysis. It uses scoring matrices to be used in these pairwise distance calculations which can be adapted to any scoring for DNA or AA characters. E.g. by using literal distances MSA2dist calculates pairwise IUPAC distances. DNAStringSet alignments can be analysed as codon alignments to look for synonymous and nonsynonymous substitutions (dN/dS) in a parallelised fashion using a variety of substitution models. Non-aligned coding sequences can be directly used to construct pairwise codon alignments (global/local) and calculate dN/dS without any external dependencies. |
| Authors: | Kristian K Ullrich [aut, cre] (ORCID: <https://orcid.org/0000-0003-4308-9626>) |
| Maintainer: | Kristian K Ullrich <[email protected]> |
| License: | GPL-3 + file LICENSE |
| Version: | 1.17.0 |
| Built: | 2026-05-30 07:32:38 UTC |
| Source: | https://github.com/bioc/MSA2dist |
This function return the selfscore from an AAStringSet.
aa2selfscore(aa, scorematrix = "BLOSUM62")aa2selfscore(aa, scorematrix = "BLOSUM62")
aa |
|
scorematrix |
score matrix to use [default: BLOSUM62] |
data.frame
Kristian K Ullrich
XStringSet-class,
substitution_matrices
data(woodmouse, package="ape") #cds2aa(dnabin2dnastring(woodmouse), shorten=TRUE, #genetic.code=Biostrings::getGeneticCode("2")) woodmouse |> dnabin2dnastring() |> cds2aa(shorten=TRUE, genetic.code=Biostrings::getGeneticCode("2")) |> aa2selfscore()data(woodmouse, package="ape") #cds2aa(dnabin2dnastring(woodmouse), shorten=TRUE, #genetic.code=Biostrings::getGeneticCode("2")) woodmouse |> dnabin2dnastring() |> cds2aa(shorten=TRUE, genetic.code=Biostrings::getGeneticCode("2")) |> aa2selfscore()
This function converts an ape AAbin into
AAStringSet.
aabin2aastring(aabin)aabin2aastring(aabin)
aabin |
|
An object of class AAStringSet
Kristian K Ullrich
as.alignment
as.DNAbin.alignment
AAStringSet
data(woodmouse, package="ape") ## convert into AAStringSet #aabin2aastring(ape::trans(woodmouse, 2)) ape::trans(woodmouse, 2) |> aabin2aastring()data(woodmouse, package="ape") ## convert into AAStringSet #aabin2aastring(ape::trans(woodmouse, 2)) ape::trans(woodmouse, 2) |> aabin2aastring()
getAAMatrix() from the alakazam package.
data(AAMatrix)data(AAMatrix)
an object of class matrix
score matrix
Gupta N, Vander Heiden J, Uduman M, Gadala-Maria D, Yaari G, Kleinstein S (2015) Change-O: a toolkit for analyzing large-scale B cell immunoglobulin repertoire sequencing data. Bioinformatics. 31(20), 3356-3358.
data("AAMatrix", package="MSA2dist")data("AAMatrix", package="MSA2dist")
This function converts an AAStringSet into an ape
AAbin.
aastring2aabin(aa)aastring2aabin(aa)
aa |
|
An object of class AAbin
Kristian K Ullrich
as.alignment
as.DNAbin.alignment
## define two cds sequences cds1 <- Biostrings::DNAString("ATGCAACATTGC") cds2 <- Biostrings::DNAString("ATG---CATTGC") cds1.cds2.aln <- c(Biostrings::DNAStringSet(cds1), Biostrings::DNAStringSet(cds2)) ## convert into AAbin #aastring2aabin(cds2aa(cds1.cds2.aln)) cds1.cds2.aln |> cds2aa() |> aastring2aabin()## define two cds sequences cds1 <- Biostrings::DNAString("ATGCAACATTGC") cds2 <- Biostrings::DNAString("ATG---CATTGC") cds1.cds2.aln <- c(Biostrings::DNAStringSet(cds1), Biostrings::DNAStringSet(cds2)) ## convert into AAbin #aastring2aabin(cds2aa(cds1.cds2.aln)) cds1.cds2.aln |> cds2aa() |> aastring2aabin()
This function converts a AAStringSet into an
seqinr alignment.
aastring2aln(aa)aastring2aln(aa)
aa |
|
An object of class alignment which is a list with the
following components:nb the number of aligned sequencesnam a vector of strings containing the names of the aligned
sequencesseq a vector of strings containing the aligned sequencescom a vector of strings containing the commentaries for each sequence
or NA if there are no comments
Kristian K Ullrich
## define two cds sequences cds1 <- Biostrings::DNAString("ATGCAACATTGC") cds2 <- Biostrings::DNAString("ATG---CATTGC") cds1.cds2.aln <- c(Biostrings::DNAStringSet(cds1), Biostrings::DNAStringSet(cds2)) #aastring2aln(cds2aa(cds1.cds2.aln)) cds1.cds2.aln |> cds2aa() |> aastring2aln()## define two cds sequences cds1 <- Biostrings::DNAString("ATGCAACATTGC") cds2 <- Biostrings::DNAString("ATG---CATTGC") cds1.cds2.aln <- c(Biostrings::DNAStringSet(cds1), Biostrings::DNAStringSet(cds2)) #aastring2aln(cds2aa(cds1.cds2.aln)) cds1.cds2.aln |> cds2aa() |> aastring2aln()
This function calculates pairwise distances for all combinations
of a AAStringSet.
aastring2dist( aa, threads = 1, symmetric = TRUE, score = NULL, mask = NULL, region = NULL )aastring2dist( aa, threads = 1, symmetric = TRUE, score = NULL, mask = NULL, region = NULL )
aa |
|
threads |
number of parallel threads [default: 1] |
symmetric |
symmetric score matrix [default: TRUE] |
score |
|
mask |
|
region |
|
A data.frame of pairwise distance values
distSTRING, sites used sitesUsed and region used
regionUsed
Kristian K Ullrich
## load example sequence data data("hiv", package="MSA2dist") #aastring2dist(cds2aa(hiv), score=granthamMatrix()) hiv |> cds2aa() |> aastring2dist(score=granthamMatrix()) ## create mask mask1 <- IRanges::IRanges(start=c(11,41,71), end=c(20,50,80)) ## use mask hiv |> cds2aa() |> aastring2dist(score=granthamMatrix(), mask=mask1) ## use region region1 <- IRanges::IRanges(start=c(1,75), end=c(45,85)) hiv |> cds2aa() |> aastring2dist(score=granthamMatrix(), region=region1) ## use mask and region hiv |> cds2aa() |> aastring2dist(score=granthamMatrix(), mask=mask1, region=region1) ## use asymmetric score matrix myscore <- granthamMatrix() myscore[5, 6] <- 0 h <- hiv |> cds2aa() |> aastring2dist(score=myscore, symmetric=FALSE) h$distSTRING[1:2, 1:2]## load example sequence data data("hiv", package="MSA2dist") #aastring2dist(cds2aa(hiv), score=granthamMatrix()) hiv |> cds2aa() |> aastring2dist(score=granthamMatrix()) ## create mask mask1 <- IRanges::IRanges(start=c(11,41,71), end=c(20,50,80)) ## use mask hiv |> cds2aa() |> aastring2dist(score=granthamMatrix(), mask=mask1) ## use region region1 <- IRanges::IRanges(start=c(1,75), end=c(45,85)) hiv |> cds2aa() |> aastring2dist(score=granthamMatrix(), region=region1) ## use mask and region hiv |> cds2aa() |> aastring2dist(score=granthamMatrix(), mask=mask1, region=region1) ## use asymmetric score matrix myscore <- granthamMatrix() myscore[5, 6] <- 0 h <- hiv |> cds2aa() |> aastring2dist(score=myscore, symmetric=FALSE) h$distSTRING[1:2, 1:2]
This function adds mask information as an IRanges object,
START and END information, to a
DNAStringSet or an AAStringSet and puts them into the
metadata information.
This information can be used to restrict the distance calculation to
specific regions of the DNAStringSet or the AAStringSet.
addmask2string(seq, mask = NULL, append = TRUE)addmask2string(seq, mask = NULL, append = TRUE)
seq |
|
mask |
|
append |
indicate if mask should be appended or overwritten [default: TRUE] |
An object of class DNAStringSet or AAStringSet
Kristian K Ullrich
addregion2string,
addpop2string,
addpos2string
## load example sequence data data(iupac, package="MSA2dist") iupac.aa <- iupac |> cds2aa(shorten = TRUE) ## create mask mask1 <- IRanges::IRanges(start=c(1,41), end=c(20,50)) ## add mask iupac.aa <- iupac.aa |> addmask2string(mask=mask1) #(iupac.aa |> slot("metadata"))$mask iupac.aa |> getmask() ## append mask mask2 <- IRanges::IRanges(start=c(21), end=c(30)) iupac.aa <- iupac.aa |> addmask2string(mask=mask2) #(iupac.aa |> slot("metadata"))$mask iupac.aa |> getmask() ## overwrite mask iupac.aa <- iupac.aa |> addmask2string(mask=mask2, append=FALSE) #(iupac.aa |> slot("metadata"))$mask iupac.aa |> getmask() ## reduce by mask #iupac.aa.region <- iupac.aa |> string2region(mask= # (iupac.aa |> slot("metadata"))$mask) iupac.aa.region <- iupac.aa |> string2region(mask= getmask(iupac.aa)) #iupac.aa.region |> slot("metadata") iupac.aa.region |> getmask()## load example sequence data data(iupac, package="MSA2dist") iupac.aa <- iupac |> cds2aa(shorten = TRUE) ## create mask mask1 <- IRanges::IRanges(start=c(1,41), end=c(20,50)) ## add mask iupac.aa <- iupac.aa |> addmask2string(mask=mask1) #(iupac.aa |> slot("metadata"))$mask iupac.aa |> getmask() ## append mask mask2 <- IRanges::IRanges(start=c(21), end=c(30)) iupac.aa <- iupac.aa |> addmask2string(mask=mask2) #(iupac.aa |> slot("metadata"))$mask iupac.aa |> getmask() ## overwrite mask iupac.aa <- iupac.aa |> addmask2string(mask=mask2, append=FALSE) #(iupac.aa |> slot("metadata"))$mask iupac.aa |> getmask() ## reduce by mask #iupac.aa.region <- iupac.aa |> string2region(mask= # (iupac.aa |> slot("metadata"))$mask) iupac.aa.region <- iupac.aa |> string2region(mask= getmask(iupac.aa)) #iupac.aa.region |> slot("metadata") iupac.aa.region |> getmask()
This function adds population information to a
DNAStringSet or an AAStringSet and puts them into the
metadata information.
__Note__: All unassigned sequences will be put into pop "unassigned"!
Do not use "unassigned" as a population name!
__Note__: Names in a population in the poplist must match sequence names!
__Note__: Duplicated assignments are allowed!
addpop2string(seq, poplist)addpop2string(seq, poplist)
seq |
|
poplist |
named |
An object of class DNAStringSet or AAStringSet
Kristian K Ullrich
addmask2string,
addregion2string,
addpos2string
## load example sequence data data(iupac, package="MSA2dist") iupac.aa <- iupac |> cds2aa(shorten = TRUE) ## create poplist poplist <- list(FRA = grep("Mmd.FRA", names(iupac)), GER = grep("Mmd.GER", names(iupac)), IRA = grep("Mmd.IRA", names(iupac)), AFG = grep("Mmm.AFG", names(iupac))) iupac.aa <- iupac.aa |> addpop2string(poplist) #(iupac.aa |> slot("metadata"))$pop.integer iupac.aa |> popinteger() #(iupac.aa |> slot("metadata"))$pop.names iupac.aa |> popnames() ## mxixing index and names poplist <- list(FRA = names(iupac)[grep("Mmd.FRA", names(iupac))], GER = grep("Mmd.GER", names(iupac)), IRA = names(iupac)[grep("Mmd.IRA", names(iupac))], AFG = grep("Mmm.AFG", names(iupac))) iupac.aa <- iupac.aa |> addpop2string(poplist) iupac.aa |> popinteger() iupac.aa |> popnames() ## leaving out some sequences which will be assigned as "unassigned" poplist <- list(FRA = names(iupac)[grep("Mmd.FRA", names(iupac))], GER = grep("Mmd.GER", names(iupac)), IRA = names(iupac)[grep("Mmd.IRA", names(iupac))]) iupac.aa <- iupac.aa |> addpop2string(poplist) iupac.aa |> popinteger() iupac.aa |> popnames()## load example sequence data data(iupac, package="MSA2dist") iupac.aa <- iupac |> cds2aa(shorten = TRUE) ## create poplist poplist <- list(FRA = grep("Mmd.FRA", names(iupac)), GER = grep("Mmd.GER", names(iupac)), IRA = grep("Mmd.IRA", names(iupac)), AFG = grep("Mmm.AFG", names(iupac))) iupac.aa <- iupac.aa |> addpop2string(poplist) #(iupac.aa |> slot("metadata"))$pop.integer iupac.aa |> popinteger() #(iupac.aa |> slot("metadata"))$pop.names iupac.aa |> popnames() ## mxixing index and names poplist <- list(FRA = names(iupac)[grep("Mmd.FRA", names(iupac))], GER = grep("Mmd.GER", names(iupac)), IRA = names(iupac)[grep("Mmd.IRA", names(iupac))], AFG = grep("Mmm.AFG", names(iupac))) iupac.aa <- iupac.aa |> addpop2string(poplist) iupac.aa |> popinteger() iupac.aa |> popnames() ## leaving out some sequences which will be assigned as "unassigned" poplist <- list(FRA = names(iupac)[grep("Mmd.FRA", names(iupac))], GER = grep("Mmd.GER", names(iupac)), IRA = names(iupac)[grep("Mmd.IRA", names(iupac))]) iupac.aa <- iupac.aa |> addpop2string(poplist) iupac.aa |> popinteger() iupac.aa |> popnames()
This function adds GenomicRanges information,
CHROM, START and END to a
DNAStringSet or an AAStringSet and puts them into the
metadata information.
This information can be used to find overlaps with a chromosome wide mask.
addpos2string(seq, chrom = NULL, start = NULL, end = NULL)addpos2string(seq, chrom = NULL, start = NULL, end = NULL)
seq |
|
chrom |
chromosome name [mandatory] |
start |
start position [mandatory] |
end |
end position [mandatory] |
An object of class DNAStringSet or AAStringSet
Kristian K Ullrich
addmask2string,
addregion2string,
addpop2string
## load example sequence data data(iupac, package="MSA2dist") ## add position iupac <- iupac |> addpos2string(chrom="chr1", start=1, end=1000) #(iupac |> slot("metadata"))$GRanges iupac |> getpos()## load example sequence data data(iupac, package="MSA2dist") ## add position iupac <- iupac |> addpos2string(chrom="chr1", start=1, end=1000) #(iupac |> slot("metadata"))$GRanges iupac |> getpos()
This function adds region information as an IRanges
object, START and END information, to a
DNAStringSet or an AAStringSet and puts them into the
metadata information.
This information can be used to restrict the distance calculation to
specific regions of the DNAStringSet or the AAStringSet.
addregion2string(seq, region = NULL, append = TRUE)addregion2string(seq, region = NULL, append = TRUE)
seq |
|
region |
|
append |
indicate if region should be appended or overwritten [default: TRUE] |
An object of class DNAStringSet or AAStringSet
Kristian K Ullrich
addmask2string,
addpop2string,
addpos2string
## load example sequence data data(iupac, package="MSA2dist") iupac.aa <- iupac |> cds2aa(shorten = TRUE) ## create region region1 <- IRanges::IRanges(start=c(1,41), end=c(20,50)) ## add region iupac.aa <- iupac.aa |> addregion2string(region=region1) #(iupac.aa |> slot("metadata"))$region iupac.aa |> region() ## append region region2 <- IRanges::IRanges(start=c(21), end=c(30)) iupac.aa <- iupac.aa |> addregion2string(region=region2) #(iupac.aa |> slot("metadata"))$region iupac.aa |> region() ## overwrite region iupac.aa <- iupac.aa |> addregion2string(region=region2, append=FALSE) #(iupac.aa |> slot("metadata"))$region iupac.aa |> region() ## reduce by region #iupac.aa.region <- iupac.aa |> string2region(region= # (iupac.aa |> slot("metadata"))$region) iupac.aa.region <- iupac.aa |> string2region(region= region(iupac.aa)) #iupac.aa.region |> slot("metadata") iupac.aa.region |> region()## load example sequence data data(iupac, package="MSA2dist") iupac.aa <- iupac |> cds2aa(shorten = TRUE) ## create region region1 <- IRanges::IRanges(start=c(1,41), end=c(20,50)) ## add region iupac.aa <- iupac.aa |> addregion2string(region=region1) #(iupac.aa |> slot("metadata"))$region iupac.aa |> region() ## append region region2 <- IRanges::IRanges(start=c(21), end=c(30)) iupac.aa <- iupac.aa |> addregion2string(region=region2) #(iupac.aa |> slot("metadata"))$region iupac.aa |> region() ## overwrite region iupac.aa <- iupac.aa |> addregion2string(region=region2, append=FALSE) #(iupac.aa |> slot("metadata"))$region iupac.aa |> region() ## reduce by region #iupac.aa.region <- iupac.aa |> string2region(region= # (iupac.aa |> slot("metadata"))$region) iupac.aa.region <- iupac.aa |> string2region(region= region(iupac.aa)) #iupac.aa.region |> slot("metadata") iupac.aa.region |> region()
This function converts a seqinr alignment into
an AAStringSet.
aln2aastring(aln)aln2aastring(aln)
aln |
|
An object of class AAStringSet
Kristian K Ullrich
## define two cds sequences cds1 <- Biostrings::DNAString("ATGCAACATTGC") cds2 <- Biostrings::DNAString("ATG---CATTGC") cds1.cds2.aln <- c(Biostrings::DNAStringSet(cds1), Biostrings::DNAStringSet(cds2)) #aastring2aln(cds2aa(cds1.cds2.aln)) cds1.cds2.aln |> cds2aa() |> aastring2aln() |> aln2aastring()## define two cds sequences cds1 <- Biostrings::DNAString("ATGCAACATTGC") cds2 <- Biostrings::DNAString("ATG---CATTGC") cds1.cds2.aln <- c(Biostrings::DNAStringSet(cds1), Biostrings::DNAStringSet(cds2)) #aastring2aln(cds2aa(cds1.cds2.aln)) cds1.cds2.aln |> cds2aa() |> aastring2aln() |> aln2aastring()
This function converts a seqinr alignment into
an DNAStringSet.
aln2dnastring(aln)aln2dnastring(aln)
aln |
|
An object of class DNAStringSet
Kristian K Ullrich
## define two cds sequences cds1 <- Biostrings::DNAString("ATGCAACATTGC") cds2 <- Biostrings::DNAString("ATG---CATTGC") cds1.cds2.aln <- c(Biostrings::DNAStringSet(cds1), Biostrings::DNAStringSet(cds2)) ## convert into alignment #dnastring2aln(cds1.cds2.aln) cds1.cds2.aln |> dnastring2aln() ## convert back into DNAStringSet #aln2dnastring(dnastring2aln(cds1.cds2.aln)) cds1.cds2.aln |> dnastring2aln() |> aln2dnastring()## define two cds sequences cds1 <- Biostrings::DNAString("ATGCAACATTGC") cds2 <- Biostrings::DNAString("ATG---CATTGC") cds1.cds2.aln <- c(Biostrings::DNAStringSet(cds1), Biostrings::DNAStringSet(cds2)) ## convert into alignment #dnastring2aln(cds1.cds2.aln) cds1.cds2.aln |> dnastring2aln() ## convert back into DNAStringSet #aln2dnastring(dnastring2aln(cds1.cds2.aln)) cds1.cds2.aln |> dnastring2aln() |> aln2dnastring()
This function translates a DNAStringSet into an
AAStringSet.
cds2aa( cds, shorten = FALSE, frame = 1, framelist = NULL, genetic.code = NULL, return.cds = FALSE )cds2aa( cds, shorten = FALSE, frame = 1, framelist = NULL, genetic.code = NULL, return.cds = FALSE )
cds |
|
shorten |
shorten all sequences to multiple of three [default: FALSE] |
frame |
indicates the first base of a the first codon [default: 1] |
framelist |
supply vector of frames for each entry [default: NULL] |
genetic.code |
The genetic code to use for the translation of codons into Amino Acid letters [default: NULL] |
return.cds |
return shorten cds instead of aa [default: FALSE] |
AAStringSet
Kristian K Ullrich
## define two cds sequences cds1 <- Biostrings::DNAString("ATGCAACATTGC") cds2 <- Biostrings::DNAString("ATG---CATTGC") cds1.cds2.aln <- c(Biostrings::DNAStringSet(cds1), Biostrings::DNAStringSet(cds2)) #cds2aa(cds1.cds2.aln) cds1.cds2.aln |> cds2aa() ## alternative genetic code data(woodmouse, package="ape") #cds2aa(dnabin2dnastring(woodmouse), shorten=TRUE) woodmouse |> dnabin2dnastring() |> cds2aa(shorten=TRUE) #cds2aa(dnabin2dnastring(woodmouse), shorten=TRUE, #genetic.code=Biostrings::getGeneticCode("2")) woodmouse |> dnabin2dnastring() |> cds2aa(shorten=TRUE, genetic.code=Biostrings::getGeneticCode("2")) woodmouse |> dnabin2dnastring() |> cds2aa(shorten=TRUE, return.cds=TRUE) |> cds2aa(genetic.code=Biostrings::getGeneticCode("2"))## define two cds sequences cds1 <- Biostrings::DNAString("ATGCAACATTGC") cds2 <- Biostrings::DNAString("ATG---CATTGC") cds1.cds2.aln <- c(Biostrings::DNAStringSet(cds1), Biostrings::DNAStringSet(cds2)) #cds2aa(cds1.cds2.aln) cds1.cds2.aln |> cds2aa() ## alternative genetic code data(woodmouse, package="ape") #cds2aa(dnabin2dnastring(woodmouse), shorten=TRUE) woodmouse |> dnabin2dnastring() |> cds2aa(shorten=TRUE) #cds2aa(dnabin2dnastring(woodmouse), shorten=TRUE, #genetic.code=Biostrings::getGeneticCode("2")) woodmouse |> dnabin2dnastring() |> cds2aa(shorten=TRUE, genetic.code=Biostrings::getGeneticCode("2")) woodmouse |> dnabin2dnastring() |> cds2aa(shorten=TRUE, return.cds=TRUE) |> cds2aa(genetic.code=Biostrings::getGeneticCode("2"))
This function takes two single sequence DNAString's or
two single sequence DNAStringSet's, converts them into aa, calculates
a global alignment and converts this alignment back into a codon alignment.
cds2codonaln( cds1, cds2, type = "global", substitutionMatrix = "BLOSUM62", gapOpening = 10, gapExtension = 0.5, remove.gaps = FALSE, ... )cds2codonaln( cds1, cds2, type = "global", substitutionMatrix = "BLOSUM62", gapOpening = 10, gapExtension = 0.5, remove.gaps = FALSE, ... )
cds1 |
single sequence |
cds2 |
single sequence |
type |
type of alignment (see
|
substitutionMatrix |
substitution matrix representing the fixed
substitution scores for an alignment (see
|
gapOpening |
the cost for opening a gap in the alignment (see
|
gapExtension |
the incremental cost incurred along the length of the
gap in the alignment (see |
remove.gaps |
specify if gaps in the codon alignment should be removed [default: FALSE] |
... |
other cds2aa parameters |
codon alignment as DNAStringSet
Kristian K Ullrich
Pagès, H et al. (2014) Biostrings: Efficient manipulation of biological strings. R package version, 2(0).
## define two cds sequences cds1 <- Biostrings::DNAString("ATGCAACATTGC") cds2 <- Biostrings::DNAString("ATGCATTGC") cds2codonaln(cds1, cds2)## define two cds sequences cds1 <- Biostrings::DNAString("ATGCAACATTGC") cds2 <- Biostrings::DNAString("ATGCATTGC") cds2codonaln(cds1, cds2)
This function takes two sequences as DNAStringSet,
and their corresponding AAStringSet, calculates
a global alignment and converts this alignment back into a codon alignment.
cdsstring2codonaln( cds, aa, type = "global", substitutionMatrix = "BLOSUM62", gapOpening = 10, gapExtension = 0.5, remove.gaps = FALSE )cdsstring2codonaln( cds, aa, type = "global", substitutionMatrix = "BLOSUM62", gapOpening = 10, gapExtension = 0.5, remove.gaps = FALSE )
cds |
two sequences |
aa |
two sequences |
type |
type of alignment (see
|
substitutionMatrix |
substitution matrix representing the fixed
substitution scores for an alignment (see
|
gapOpening |
the cost for opening a gap in the alignment (see
|
gapExtension |
the incremental cost incurred along the length of the
gap in the alignment (see |
remove.gaps |
specify if gaps in the codon alignment should be removed [default: FALSE] |
codon alignment as DNAStringSet
Kristian K Ullrich
Pagès, H et al. (2014) Biostrings: Efficient manipulation of biological strings. R package version, 2(0).
## define two cds sequences cds <- Biostrings::DNAStringSet(c("ATGCAACATTGC", "ATGCATTGC")) names(cds) <- c("cds1", "cds2") ## get protein alignment aa <- MSA2dist::cds2aa(cds) cdsstring2codonaln(cds, aa)## define two cds sequences cds <- Biostrings::DNAStringSet(c("ATGCAACATTGC", "ATGCATTGC")) names(cds) <- c("cds1", "cds2") ## get protein alignment aa <- MSA2dist::cds2aa(cds) cdsstring2codonaln(cds, aa)
This function converts a codon into a number,
but accept N and -.
codon2numberAMBIG(codon)codon2numberAMBIG(codon)
codon |
[mandatory] |
An object of class numeric
Kristian K Ullrich
#unlist(lapply(names(Biostrings::GENETIC_CODE), codon2numberAMBIG)) names(Biostrings::GENETIC_CODE) |> codon2numberAMBIG()#unlist(lapply(names(Biostrings::GENETIC_CODE), codon2numberAMBIG)) names(Biostrings::GENETIC_CODE) |> codon2numberAMBIG()
This function converts a codon into a number.
codon2numberTCAG(codon)codon2numberTCAG(codon)
codon |
[mandatory] |
An object of class numeric
Kristian K Ullrich
#unlist(lapply(names(Biostrings::GENETIC_CODE), codon2numberTCAG)) names(Biostrings::GENETIC_CODE) |> codon2numberTCAG()#unlist(lapply(names(Biostrings::GENETIC_CODE), codon2numberTCAG)) names(Biostrings::GENETIC_CODE) |> codon2numberTCAG()
This function calculates pn/ps according to Nei and Gojobori (1986).
codonmat2pnps(codonmat)codonmat2pnps(codonmat)
codonmat |
|
An object of class pnps which is a list with the following
components:seq1 sequence1 nameseq2 sequence2 nameCodons sequence2 nameCompared sequence2 nameAmbigiuous sequence2 nameIndels sequence2 nameNs sequence2 nameSd sequence2 nameSn sequence2 nameS sequence2 nameN sequence2 nameps sequence2 namepn sequence2 namepnps sequence2 nameds sequence2 namedn sequence2 namednds sequence2 name
Kristian K Ullrich
Nei and Gojobori. (1986) Simple methods for estimating the numbers of synonymous and nonsynonymous nucleotide substitutions. Mol. Biol. Evol., 3(5), 418-426.
Ganeshan et al. (1997) Human immunodeficiency virus type 1 genetic evolution in children with different rates of development of disease. J. Virology. 71(1), 663-677.
Yang et al. (2000) Codon-substitution models for heterogeneous selection pressure at amino acid sites. Genetics. 155(1), 431-449.
## load example sequence data data("hiv", package="MSA2dist") #codonmat2pnps(dnastring2codonmat(hiv)[,c(1, 2)]) (hiv |> dnastring2codonmat())[,c(1, 2)] |> codonmat2pnps()## load example sequence data data("hiv", package="MSA2dist") #codonmat2pnps(dnastring2codonmat(hiv)[,c(1, 2)]) (hiv |> dnastring2codonmat())[,c(1, 2)] |> codonmat2pnps()
This function calculates average behavior of each codon for all pairwise comparisons for indels, syn, and nonsyn mutations according to Nei and Gojobori (1986).
codonmat2xy(codonmat, threads = 1)codonmat2xy(codonmat, threads = 1)
codonmat |
|
threads |
number of parallel threads [default: 1] |
A data.frame object with the following components:Codon Codon indexn number of comparisonSynSum Sum of synNonSynSum Sum of nonsynIndelSum Sum of indelsSynMean average syn per codonNonSynMean average nonsyn per codonIndelMean average indels per codonCumSumSynMean cumulative average syn per codonCumSumNonSynMean cumulative average nonsyn per codonCumSumIndelMean cumulative indels per codon
Kristian K Ullrich
Nei and Gojobori. (1986) Simple methods for estimating the numbers of synonymous and nonsynonymous nucleotide substitutions. Mol. Biol. Evol., 3(5), 418-426.
Ganeshan et al. (1997) Human immunodeficiency virus type 1 genetic evolution in children with different rates of development of disease. J. Virology. 71(1), 663-677.
Yang et al. (2000) Codon-substitution models for heterogeneous selection pressure at amino acid sites. Genetics. 155(1), 431-449.
dnastring2codonmat
codonmat2pnps
dnastring2kaks
kaks
## load example sequence data data("hiv", package="MSA2dist") #codonmat2xy(dnastring2codonmat(hiv)) hiv |> dnastring2codonmat() |> codonmat2xy() #codonmat2xy(dnastring2codonmat(hiv), threads=2) hiv |> dnastring2codonmat() |> codonmat2xy(threads=2)## load example sequence data data("hiv", package="MSA2dist") #codonmat2xy(dnastring2codonmat(hiv)) hiv |> dnastring2codonmat() |> codonmat2xy() #codonmat2xy(dnastring2codonmat(hiv), threads=2) hiv |> dnastring2codonmat() |> codonmat2xy(threads=2)
This function compares two codons and returns the number of syn and non-syn sites according to Nei and Gojobori (1986).
compareCodons(codA, codB)compareCodons(codA, codB)
codA |
|
codB |
|
vector of syn and non-syn sites
Kristian K Ullrich
Nei and Gojobori. (1986) Simple methods for estimating the numbers of synonymous and nonsynonymous nucleotide substitutions. Mol. Biol. Evol., 3(5), 418-426.
Ganeshan et al. (1997) Human immunodeficiency virus type 1 genetic evolution in children with different rates of development of disease. J. Virology. 71(1), 663-677.
Yang et al. (2000) Codon-substitution models for heterogeneous selection pressure at amino acid sites. Genetics. 155(1), 431-449.
compareCodons("AAA","TTA") compareCodons("AAA","TAT") compareCodons("AAA","ATT") compareCodons("AAA","TTT") ## load example sequence data data("hiv", package="MSA2dist") compareCodons(dnastring2codonmat(hiv)[1,1], dnastring2codonmat(hiv)[1,2])compareCodons("AAA","TTA") compareCodons("AAA","TAT") compareCodons("AAA","ATT") compareCodons("AAA","TTT") ## load example sequence data data("hiv", package="MSA2dist") compareCodons(dnastring2codonmat(hiv)[1,1], dnastring2codonmat(hiv)[1,2])
This function converts an ape DNAbin into a
DNAStringSet.
dnabin2dnastring(dnabin)dnabin2dnastring(dnabin)
dnabin |
|
An object of class DNAStringSet
Kristian K Ullrich
as.alignment
as.DNAbin.alignment
DNAStringSet
data(woodmouse, package="ape") ## convert into DNAStringSet #dnabin2dnastring(woodmouse) woodmouse |> dnabin2dnastring()data(woodmouse, package="ape") ## convert into DNAStringSet #dnabin2dnastring(woodmouse) woodmouse |> dnabin2dnastring()
This function converts a DNAStringSet into an
seqinr alignment.
dnastring2aln(dna)dnastring2aln(dna)
dna |
|
An object of class alignment which is a list with the
following components:nb the number of aligned sequencesnam a vector of strings containing the names of the aligned
sequencesseq a vector of strings containing the aligned sequencescom a vector of strings containing the commentaries for each sequence
or NA if there are no comments
Kristian K Ullrich
## define two cds sequences cds1 <- Biostrings::DNAString("ATGCAACATTGC") cds2 <- Biostrings::DNAString("ATG---CATTGC") cds1.cds2.aln <- c(Biostrings::DNAStringSet(cds1), Biostrings::DNAStringSet(cds2)) ## convert into alignment #dnastring2aln(cds1.cds2.aln) cds1.cds2.aln |> dnastring2aln()## define two cds sequences cds1 <- Biostrings::DNAString("ATGCAACATTGC") cds2 <- Biostrings::DNAString("ATG---CATTGC") cds1.cds2.aln <- c(Biostrings::DNAStringSet(cds1), Biostrings::DNAStringSet(cds2)) ## convert into alignment #dnastring2aln(cds1.cds2.aln) cds1.cds2.aln |> dnastring2aln()
This function converts a DNAStringSet into a
codon matrix.
dnastring2codonmat(cds, shorten = FALSE, frame = 1, framelist = NULL)dnastring2codonmat(cds, shorten = FALSE, frame = 1, framelist = NULL)
cds |
|
shorten |
shorten all sequences to multiple of three [default: FALSE] |
frame |
indicates the first base of a the first codon [default: 1] |
framelist |
supply vector of frames for each entry [default: NULL] |
An object of class alignment which is a list with the
following components:nb the number of aligned sequencesnam a vector of strings containing the names of the aligned
sequencesseq a vector of strings containing the aligned sequencescom a vector of strings containing the commentaries for each sequence
or NA if there are no comments
Kristian K Ullrich
## define two cds sequences cds1 <- Biostrings::DNAString("ATGCAACATTGC") cds2 <- Biostrings::DNAString("ATG---CATTGC") cds1.cds2.aln <- c(Biostrings::DNAStringSet(cds1), Biostrings::DNAStringSet(cds2)) ## convert into alignment #dnastring2codonmat(cds1.cds2.aln) cds1.cds2.aln |> dnastring2codonmat() ## use frame 2 and shorten to circumvent multiple of three error cds1 <- Biostrings::DNAString("-ATGCAACATTGC-") cds2 <- Biostrings::DNAString("-ATG---CATTGC-") cds1.cds2.aln <- c(Biostrings::DNAStringSet(cds1), Biostrings::DNAStringSet(cds2)) cds1.cds2.aln |> dnastring2codonmat(frame=2, shorten=TRUE)## define two cds sequences cds1 <- Biostrings::DNAString("ATGCAACATTGC") cds2 <- Biostrings::DNAString("ATG---CATTGC") cds1.cds2.aln <- c(Biostrings::DNAStringSet(cds1), Biostrings::DNAStringSet(cds2)) ## convert into alignment #dnastring2codonmat(cds1.cds2.aln) cds1.cds2.aln |> dnastring2codonmat() ## use frame 2 and shorten to circumvent multiple of three error cds1 <- Biostrings::DNAString("-ATGCAACATTGC-") cds2 <- Biostrings::DNAString("-ATG---CATTGC-") cds1.cds2.aln <- c(Biostrings::DNAStringSet(cds1), Biostrings::DNAStringSet(cds2)) cds1.cds2.aln |> dnastring2codonmat(frame=2, shorten=TRUE)
This function calculates pairwise distances for all
combinations of a DNAStringSet.
dnastring2dist( dna, model = "IUPAC", threads = 1, symmetric = TRUE, score = NULL, mask = NULL, region = NULL, ... )dnastring2dist( dna, model = "IUPAC", threads = 1, symmetric = TRUE, score = NULL, mask = NULL, region = NULL, ... )
dna |
|
model |
specify model either "IUPAC" or any model from
|
threads |
number of parallel threads [default: 1] |
symmetric |
symmetric score matrix [default: TRUE] |
score |
|
mask |
|
region |
|
... |
other |
A data.frame of pairwise distance values distSTRING and
sites used sitesUsed
Kristian K Ullrich
## load example sequence data data("hiv", package="MSA2dist") #dnastring2dist(hiv, model="IUPAC") hiv |> dnastring2dist(model="IUPAC") #dnastring2dist(hiv, model="K80") hiv |> dnastring2dist(model="K80") data("woodmouse", package="ape") #dnastring2dist(dnabin2dnastring(woodmouse), score=iupacMatrix()) woodmouse |> dnabin2dnastring() |> dnastring2dist() #dnastring2dist(hiv, model = "IUPAC", threads = 2) hiv |> dnastring2dist(model = "IUPAC", threads = 2) ## create mask mask1 <- IRanges::IRanges(start=c(1,61,121), end=c(30,90,150)) ## use mask hiv |> dnastring2dist(model="IUPAC", mask=mask1) ## use region region1 <- IRanges::IRanges(start=c(1,139), end=c(75,225)) hiv |> dnastring2dist(model="IUPAC", region=region1) ## use mask and region hiv |> dnastring2dist(model="IUPAC", mask=mask1, region=region1) ## use asymmetric score matrix myscore <- iupacMatrix() myscore[1, 4] <- 0.5 (hiv |> dnastring2dist(score=myscore, symmetric=FALSE))$distSTRING[1:2, 1:2]## load example sequence data data("hiv", package="MSA2dist") #dnastring2dist(hiv, model="IUPAC") hiv |> dnastring2dist(model="IUPAC") #dnastring2dist(hiv, model="K80") hiv |> dnastring2dist(model="K80") data("woodmouse", package="ape") #dnastring2dist(dnabin2dnastring(woodmouse), score=iupacMatrix()) woodmouse |> dnabin2dnastring() |> dnastring2dist() #dnastring2dist(hiv, model = "IUPAC", threads = 2) hiv |> dnastring2dist(model = "IUPAC", threads = 2) ## create mask mask1 <- IRanges::IRanges(start=c(1,61,121), end=c(30,90,150)) ## use mask hiv |> dnastring2dist(model="IUPAC", mask=mask1) ## use region region1 <- IRanges::IRanges(start=c(1,139), end=c(75,225)) hiv |> dnastring2dist(model="IUPAC", region=region1) ## use mask and region hiv |> dnastring2dist(model="IUPAC", mask=mask1, region=region1) ## use asymmetric score matrix myscore <- iupacMatrix() myscore[1, 4] <- 0.5 (hiv |> dnastring2dist(score=myscore, symmetric=FALSE))$distSTRING[1:2, 1:2]
This function converts a DNAStringSet into an
ape DNAbin.
dnastring2dnabin(dna)dnastring2dnabin(dna)
dna |
|
An object of class DNAbin
Kristian K Ullrich
as.alignment
as.DNAbin.alignment
## define two cds sequences cds1 <- Biostrings::DNAString("ATGCAACATTGC") cds2 <- Biostrings::DNAString("ATG---CATTGC") cds1.cds2.aln <- c(Biostrings::DNAStringSet(cds1), Biostrings::DNAStringSet(cds2)) ## convert into DNAbin #dnastring2dnabin(cds1.cds2.aln) cds1.cds2.aln |> dnastring2dnabin()## define two cds sequences cds1 <- Biostrings::DNAString("ATGCAACATTGC") cds2 <- Biostrings::DNAString("ATG---CATTGC") cds1.cds2.aln <- c(Biostrings::DNAStringSet(cds1), Biostrings::DNAStringSet(cds2)) ## convert into DNAbin #dnastring2dnabin(cds1.cds2.aln) cds1.cds2.aln |> dnastring2dnabin()
This function calculates Ka/Ks (pN/pS)
for all combinations of a DNAStringSet.
If the sequences in the DNAStringSet are not a multiple-sequence
alignment, pairwise codon alignments can be calculated on the fly.
Models used and implemented according to
Li (1993) (using seqinr) or
Nei and Gojobori (1986) (own implementation) or models from
KaKs_Calculator2 ported to MSA2dist with Rcpp.
dnastring2kaks( cds, model = "Li", threads = 1, isMSA = TRUE, sgc = "1", verbose = FALSE, ... )dnastring2kaks( cds, model = "Li", threads = 1, isMSA = TRUE, sgc = "1", verbose = FALSE, ... )
cds |
|
model |
specify codon model either "Li" or "NG86" or
one of |
threads |
number of parallel threads [default: 1] |
isMSA |
cds |
sgc |
standard genetic code (for KaKs Calculator models) [default: 1] |
verbose |
verbosity (for KaKs Calculator models) [default: FALSE] |
... |
other codon alignment parameters |
A data.frame of KaKs values
Kristian K Ullrich
"MS/MA/GNG/GLWL/GLPB/GMLWL/GMLPB/GYN:" Wang et al. (2010) KaKs_Calculator 2.0: a toolkit incorporating gamma-series methods and sliding window strategies.Genomics, proteomics & bioinformatics. 8(1), 77-80.
"Li/LWL:" Li et al. (1985) A new method for estimating synonymous and nonsynonymous rates of nucleotide substitution considering the relative likelihood of nucleotide and codon changes. Mol. Biol. Evol., 2(2), 150-174.
"Li/LPB:" Li (1993). Unbiased estimation of the rates of synonymous and nonsynonymous substitution. Journal of molecular evolution, 36(1), pp.96-99.
"NG86/NG:" Nei and Gojobori. (1986) Simple methods for estimating the numbers of synonymous and nonsynonymous nucleotide substitutions. Mol. Biol. Evol., 3(5), 418-426.
"LPB:" Pamilo and Bianchi. (1993) Evolution of the Zfx and Zfy genes: Rates and interdependence between genes. Mol. Biol. Evol., 10, 271-281.
"MLWL/MLPB:" Tzeng et al. (2004). Comparison of three methods for estimating rates of synonymous and nonsynonymous nucleotide substitutions. Mol. Biol. Evol., 21(12), 2290-2298.
"GY:" Goldman and Yang (1994). A codon-based model of nucleotide substitution for protein-coding DNA sequences. Mol. Biol. Evol., 11(5) 725-736.
"YN:" Yang et al. (2000) Codon-substitution models for heterogeneous selection pressure at amino acid sites. Genetics. 155(1), 431-449.
"MYN:" Zhang et al. (2006). Computing Ka and Ks with a consideration of unequal transitional substitutions. BMC evolutionary biology, 6(1), 1-10.
"data(hiv):" Ganeshan et al. (1997) Human immunodeficiency virus type 1 genetic evolution in children with different rates of development of disease. J. Virology. 71(1), 663-677.
Wang et al. (2009). gamma-MYN: a new algorithm for estimating Ka and Ks with consideration of variable substitution rates. Biology Direct, 4(1), 1-18.
## load example sequence data data("hiv", package="MSA2dist") #dnastring2kaks(hiv, model="Li") hiv |> dnastring2kaks(model="Li") #dnastring2kaks(hiv, model="NG86") hiv |> dnastring2kaks(model="NG86") #dnastring2kaks(hiv, model="NG86", threads=2) hiv |> dnastring2kaks(model="NG86", threads=2) ## define three unaligned cds sequences cds1 <- Biostrings::DNAString("ATGCAACATTGC") cds2 <- Biostrings::DNAString("ATGCATTGC") cds3 <- Biostrings::DNAString("ATGCAATGC") cds_sequences <- Biostrings::DNAStringSet(list(cds1, cds2, cds3)) names(cds_sequences) <- c("cds1", "cds2", "cds3") ## set isMSA to FALSE to automatically create pairwise codon alignments #dnastring2kaks(cds_sequences, model="Li", isMSA=FALSE) cds_sequences |> dnastring2kaks(model="Li", isMSA=FALSE)## load example sequence data data("hiv", package="MSA2dist") #dnastring2kaks(hiv, model="Li") hiv |> dnastring2kaks(model="Li") #dnastring2kaks(hiv, model="NG86") hiv |> dnastring2kaks(model="NG86") #dnastring2kaks(hiv, model="NG86", threads=2) hiv |> dnastring2kaks(model="NG86", threads=2) ## define three unaligned cds sequences cds1 <- Biostrings::DNAString("ATGCAACATTGC") cds2 <- Biostrings::DNAString("ATGCATTGC") cds3 <- Biostrings::DNAString("ATGCAATGC") cds_sequences <- Biostrings::DNAStringSet(list(cds1, cds2, cds3)) names(cds_sequences) <- c("cds1", "cds2", "cds3") ## set isMSA to FALSE to automatically create pairwise codon alignments #dnastring2kaks(cds_sequences, model="Li", isMSA=FALSE) cds_sequences |> dnastring2kaks(model="Li", isMSA=FALSE)
GENETIC_CODE from Biostrings extended by codon
number and number of syn sites.
codon2number(codon)codon2number(codon)
codon |
|
An object of class numeric
Kristian K Ullrich
GENETIC_CODE_TCAGGENETIC_CODE_TCAG
This function shows the mask slot from a
DNAStringSet or an AAStringSet
metadata information.
getmask(seq)getmask(seq)
seq |
|
IRanges information from metadata
Kristian K Ullrich
## load example sequence data data(iupac, package="MSA2dist") iupac.aa <- iupac |> cds2aa(shorten = TRUE) ## create mask mask1 <- IRanges::IRanges(start=c(1,41), end=c(20,50)) ## add mask iupac.aa <- iupac.aa |> addmask2string(mask=mask1) #(iupac.aa |> slot("metadata"))$mask iupac.aa |> getmask()## load example sequence data data(iupac, package="MSA2dist") iupac.aa <- iupac |> cds2aa(shorten = TRUE) ## create mask mask1 <- IRanges::IRanges(start=c(1,41), end=c(20,50)) ## add mask iupac.aa <- iupac.aa |> addmask2string(mask=mask1) #(iupac.aa |> slot("metadata"))$mask iupac.aa |> getmask()
This function shows the position slot from a
DNAStringSet or an AAStringSet
metadata information.
getpos(seq)getpos(seq)
seq |
|
GenomicRanges information from metadata
Kristian K Ullrich
## load example sequence data data(iupac, package="MSA2dist") ## add position iupac <- iupac |> addpos2string(chrom="chr1", start=1, end=1000) #(iupac |> slot("metadata"))$GRanges iupac |> getpos()## load example sequence data data(iupac, package="MSA2dist") ## add position iupac <- iupac |> addpos2string(chrom="chr1", start=1, end=1000) #(iupac |> slot("metadata"))$GRanges iupac |> getpos()
This function returns a DNAStringSet reduced by all
sites containing any gaps ("-", "+", ".") or missing ("N") sites.
globalDeletion(dna)globalDeletion(dna)
dna |
|
DNAStringSet
Kristian K Ullrich
## define two cds sequences cds1 <- Biostrings::DNAString("ATGCAACATTGC") cds2 <- Biostrings::DNAString("ATG---CATTGC") cds1.cds2.aln <- c(Biostrings::DNAStringSet(cds1), Biostrings::DNAStringSet(cds2)) globalDeletion(cds1.cds2.aln)## define two cds sequences cds1 <- Biostrings::DNAString("ATGCAACATTGC") cds2 <- Biostrings::DNAString("ATG---CATTGC") cds1.cds2.aln <- c(Biostrings::DNAStringSet(cds1), Biostrings::DNAStringSet(cds2)) globalDeletion(cds1.cds2.aln)
This function returns an AAStringSet reduced by all
sites containing any gaps ("-", "+", ".") or missing ("X") sites.
globalDeletionAA(aa)globalDeletionAA(aa)
aa |
|
AAStringSet
Kristian K Ullrich
## define two cds sequences cds1 <- Biostrings::DNAString("ATGCAACATTGC") cds2 <- Biostrings::DNAString("ATG---CATTGC") cds1.cds2.aln <- c(Biostrings::DNAStringSet(cds1), Biostrings::DNAStringSet(cds2)) #globalDeletionAA(cds2aa(cds1.cds2.aln)) cds1.cds2.aln |> cds2aa() |> globalDeletionAA()## define two cds sequences cds1 <- Biostrings::DNAString("ATGCAACATTGC") cds2 <- Biostrings::DNAString("ATG---CATTGC") cds1.cds2.aln <- c(Biostrings::DNAStringSet(cds1), Biostrings::DNAStringSet(cds2)) #globalDeletionAA(cds2aa(cds1.cds2.aln)) cds1.cds2.aln |> cds2aa() |> globalDeletionAA()
This function creates a granthamMatrix object to be used
with the rcpp_distSTRING function. By default,the grantham matrix is
defined as from Grantham 1974.
(see https://link.springer.com/article/10.1007/s00335-017-9704-9)
granthamMatrix()granthamMatrix()
matrix
Kristian K Ullrich
Grantham R. (1974). Amino Acid Difference Formula to Help Explain Protein Evolution. Science,185(4154),862-864.
granthamMatrix()granthamMatrix()
Example cds sequences from HIV-1 sample 136 patient 1 from
Sweden envelope glycoprotein (env) gene, V3 region as DNAStringSet.
data(hiv)data(hiv)
an object of class DNAStringSet
see XStringSet-class
Yang et al. (2000) Codon-substitution models for heterogeneous selection pressure at amino acid sites. Genetics. 155(1), 431-449.
data("hiv", package="MSA2dist")data("hiv", package="MSA2dist")
This function calculates Ka/Ks (pN/pS)
for all combinations given in an indices list of a
DNAStringSet.
If the sequences in the DNAStringSet are not a multiple-sequence
alignment, pairwise codon alignments can be calculated on the fly.
Models used and implemented according to
Li (1993) (using seqinr) or
Nei and Gojobori (1986) (own implementation) or models from
KaKs_Calculator2 ported to MSA2dist with Rcpp.
indices2kaks( cds, indices, model = "Li", threads = 1, isMSA = TRUE, sgc = "1", verbose = FALSE, ... )indices2kaks( cds, indices, model = "Li", threads = 1, isMSA = TRUE, sgc = "1", verbose = FALSE, ... )
cds |
|
indices |
|
model |
specify codon model either "Li" or "NG86" or
one of |
threads |
number of parallel threads [default: 1] |
isMSA |
cds |
sgc |
standard genetic code (for KaKs Calculator models) [default: 1] |
verbose |
verbosity (for KaKs Calculator models) [default: FALSE] |
... |
other codon alignment parameters |
A data.frame of KaKs values
Kristian K Ullrich
"MS/MA/GNG/GLWL/GLPB/GMLWL/GMLPB/GYN:" Wang et al. (2010) KaKs_Calculator 2.0: a toolkit incorporating gamma-series methods and sliding window strategies.Genomics, proteomics & bioinformatics. 8(1), 77-80.
"Li/LWL:" Li et al. (1985) A new method for estimating synonymous and nonsynonymous rates of nucleotide substitution considering the relative likelihood of nucleotide and codon changes. Mol. Biol. Evol., 2(2), 150-174.
"Li/LPB:" Li (1993). Unbiased estimation of the rates of synonymous and nonsynonymous substitution. Journal of molecular evolution, 36(1), pp.96-99.
"NG86/NG:" Nei and Gojobori. (1986) Simple methods for estimating the numbers of synonymous and nonsynonymous nucleotide substitutions. Mol. Biol. Evol., 3(5), 418-426.
"LPB:" Pamilo and Bianchi. (1993) Evolution of the Zfx and Zfy genes: Rates and interdependence between genes. Mol. Biol. Evol., 10, 271-281.
"MLWL/MLPB:" Tzeng et al. (2004). Comparison of three methods for estimating rates of synonymous and nonsynonymous nucleotide substitutions. Mol. Biol. Evol., 21(12), 2290-2298.
"GY:" Goldman and Yang (1994). A codon-based model of nucleotide substitution for protein-coding DNA sequences. Mol. Biol. Evol., 11(5) 725-736.
"YN:" Yang et al. (2000) Codon-substitution models for heterogeneous selection pressure at amino acid sites. Genetics. 155(1), 431-449.
"MYN:" Zhang et al. (2006). Computing Ka and Ks with a consideration of unequal transitional substitutions. BMC evolutionary biology, 6(1), 1-10.
"data(hiv):" Ganeshan et al. (1997) Human immunodeficiency virus type 1 genetic evolution in children with different rates of development of disease. J. Virology. 71(1), 663-677.
Wang et al. (2009). gamma-MYN: a new algorithm for estimating Ka and Ks with consideration of variable substitution rates. Biology Direct, 4(1), 1-18.
## load example sequence data data("hiv", package="MSA2dist") ## create indices idx <- list(c(2, 3), c(5,7,9)) #indices2kaks(hiv, idx, model="Li") hiv |> indices2kaks(idx, model="Li") #indices2kaks(hiv, idx, model="NG86") hiv |> indices2kaks(idx, model="NG86") #indices2kaks(hiv, idx, model="NG86", threads=2) hiv |> indices2kaks(idx, model="NG86", threads=2) ## define three unaligned cds sequences cds1 <- Biostrings::DNAString("ATGCAACATTGC") cds2 <- Biostrings::DNAString("ATGCATTGC") cds3 <- Biostrings::DNAString("ATGCAATGC") cds_sequences <- Biostrings::DNAStringSet(list(cds1, cds2, cds3)) names(cds_sequences) <- c("cds1", "cds2", "cds3") ## create indices idx <- list(c(1, 2), c(1,3)) ## set isMSA to FALSE to automatically create pairwise codon alignments #indices2kaks(cds_sequences, idx, model="Li", isMSA=FALSE) cds_sequences |> indices2kaks(idx, model="Li", isMSA=FALSE)## load example sequence data data("hiv", package="MSA2dist") ## create indices idx <- list(c(2, 3), c(5,7,9)) #indices2kaks(hiv, idx, model="Li") hiv |> indices2kaks(idx, model="Li") #indices2kaks(hiv, idx, model="NG86") hiv |> indices2kaks(idx, model="NG86") #indices2kaks(hiv, idx, model="NG86", threads=2) hiv |> indices2kaks(idx, model="NG86", threads=2) ## define three unaligned cds sequences cds1 <- Biostrings::DNAString("ATGCAACATTGC") cds2 <- Biostrings::DNAString("ATGCATTGC") cds3 <- Biostrings::DNAString("ATGCAATGC") cds_sequences <- Biostrings::DNAStringSet(list(cds1, cds2, cds3)) names(cds_sequences) <- c("cds1", "cds2", "cds3") ## create indices idx <- list(c(1, 2), c(1,3)) ## set isMSA to FALSE to automatically create pairwise codon alignments #indices2kaks(cds_sequences, idx, model="Li", isMSA=FALSE) cds_sequences |> indices2kaks(idx, model="Li", isMSA=FALSE)
Example IUPAC sequences created with angsd
from different house mouse (Mus musculus) sub-populations
from Harr et al. (2016) DNAStringSet.
data(iupac)data(iupac)
an object of class DNAStringSet
see XStringSet-class
Harr et al. (2016) Genomic resources for wild populations of the house mouse, Mus musculus and its close relative Mus spretus. Scientific data. 3(1), 1-14.
data("iupac", package="MSA2dist")data("iupac", package="MSA2dist")
This function creates a iupacMatrix object to be used
with the rcpp_distSTRING function. By default, the iupac matrix
is defined as literal distance obtained from Chang et al. 2017.
(see https://link.springer.com/article/10.1007/s00335-017-9704-9)
iupacMatrix()iupacMatrix()
score matrix
Kristian K Ullrich
Chang,P. L.,Kopania,E.,Keeble,S.,Sarver,B. A.,Larson, E.,Orth,A.,... & Dean,M. D. (2017). Whole exome sequencing of wild-derived inbred strains of mice improves power to link phenotype and genotype. Mammalian genome,28(9-10),416-425.
iupacMatrix()iupacMatrix()
This function is a fork from an internal function from
Biostrings
makePostalignedSeqs(x)makePostalignedSeqs(x)
x |
x |
get internal function makePostalignedSeqs
Kristian K Ullrich
pairwiseAlignment,
cds2codonaln
## define two cds sequences cds1 <- Biostrings::DNAString("ATGCAACATTGC") cds2 <- Biostrings::DNAString("ATGCATTGC") makePostalignedSeqs(pwalign::pairwiseAlignment( cds2aa(Biostrings::DNAStringSet(cds1)), cds2aa(Biostrings::DNAStringSet(cds2))))## define two cds sequences cds1 <- Biostrings::DNAString("ATGCAACATTGC") cds2 <- Biostrings::DNAString("ATGCATTGC") makePostalignedSeqs(pwalign::pairwiseAlignment( cds2aa(Biostrings::DNAStringSet(cds1)), cds2aa(Biostrings::DNAStringSet(cds2))))
This function takes an AAStringSet alignment and
its corresponding coding sequences DNAStringSet and converts
the protein alignment into a codon alignment.
pal2nal(pal, nal, remove.gaps = FALSE)pal2nal(pal, nal, remove.gaps = FALSE)
pal |
|
nal |
|
remove.gaps |
specify if gaps in the codon alignment should be removed [default: FALSE] |
codon alignment as DNAStringSet
Kristian K Ullrich
Pagès, H et al. (2014) Biostrings: Efficient manipulation of biological strings. R package version, 2(0).
## define two cds sequences cds <- Biostrings::DNAStringSet(c("ATGCAACATTGC", "ATGCATTGC")) names(cds) <- c("cds1", "cds2") ## get protein alignment aa <- MSA2dist::cds2aa(cds) msa <- makePostalignedSeqs(pwalign::pairwiseAlignment(aa[1], aa[2]))[[1L]] names(msa) <- names(aa) ## get codon alignment nal <- MSA2dist::pal2nal(pal=msa, nal=cds) nal## define two cds sequences cds <- Biostrings::DNAStringSet(c("ATGCAACATTGC", "ATGCATTGC")) names(cds) <- c("cds1", "cds2") ## get protein alignment aa <- MSA2dist::cds2aa(cds) msa <- makePostalignedSeqs(pwalign::pairwiseAlignment(aa[1], aa[2]))[[1L]] names(msa) <- names(aa) ## get codon alignment nal <- MSA2dist::pal2nal(pal=msa, nal=cds) nal
This function shows the population integer slot from a
DNAStringSet or an AAStringSet
metadata information.
popinteger(seq)popinteger(seq)
seq |
|
population integer from metadata
Kristian K Ullrich
## load example sequence data data(iupac, package="MSA2dist") iupac.aa <- iupac |> cds2aa(shorten = TRUE) ## create poplist poplist <- list(FRA = grep("Mmd.FRA", names(iupac)), GER = grep("Mmd.GER", names(iupac)), IRA = grep("Mmd.IRA", names(iupac)), AFG = grep("Mmm.AFG", names(iupac))) iupac.aa <- iupac.aa |> addpop2string(poplist) popinteger(iupac.aa)## load example sequence data data(iupac, package="MSA2dist") iupac.aa <- iupac |> cds2aa(shorten = TRUE) ## create poplist poplist <- list(FRA = grep("Mmd.FRA", names(iupac)), GER = grep("Mmd.GER", names(iupac)), IRA = grep("Mmd.IRA", names(iupac)), AFG = grep("Mmm.AFG", names(iupac))) iupac.aa <- iupac.aa |> addpop2string(poplist) popinteger(iupac.aa)
This function shows the population names slot from a
DNAStringSet or an AAStringSet
metadata information.
popnames(seq)popnames(seq)
seq |
|
population names from metadata
Kristian K Ullrich
## load example sequence data data(iupac, package="MSA2dist") iupac.aa <- iupac |> cds2aa(shorten = TRUE) ## create poplist poplist <- list(FRA = grep("Mmd.FRA", names(iupac)), GER = grep("Mmd.GER", names(iupac)), IRA = grep("Mmd.IRA", names(iupac)), AFG = grep("Mmm.AFG", names(iupac))) iupac.aa <- iupac.aa |> addpop2string(poplist) popnames(iupac.aa)## load example sequence data data(iupac, package="MSA2dist") iupac.aa <- iupac |> cds2aa(shorten = TRUE) ## create poplist poplist <- list(FRA = grep("Mmd.FRA", names(iupac)), GER = grep("Mmd.GER", names(iupac)), IRA = grep("Mmd.IRA", names(iupac)), AFG = grep("Mmm.AFG", names(iupac))) iupac.aa <- iupac.aa |> addpop2string(poplist) popnames(iupac.aa)
calculates pairwise distances using a score matrix
rcpp_distSTRING(dnavector, scoreMatrix, ncores = 1L, symmetric = 1L)rcpp_distSTRING(dnavector, scoreMatrix, ncores = 1L, symmetric = 1L)
dnavector |
StringVector [mandatory] |
scoreMatrix |
NumericMatrix [mandatory] |
ncores |
number of cores [default: 1] |
symmetric |
symmetric score matrix [default: 1] |
list
Kristian K Ullrich
## load example sequence data data("hiv", package="MSA2dist") rcpp_distSTRING(dnavector=as.character(hiv), scoreMatrix=iupacMatrix())## load example sequence data data("hiv", package="MSA2dist") rcpp_distSTRING(dnavector=as.character(hiv), scoreMatrix=iupacMatrix())
calculates KaKs as implememted in
KaKs Calculator 2.0 MSA2dist with Rcpp.
rcpp_KaKs(cdsstr, sgc = "1", method = "YN", verbose = FALSE)rcpp_KaKs(cdsstr, sgc = "1", method = "YN", verbose = FALSE)
cdsstr |
StringVector [mandatory] |
sgc |
standard genetic code to use [default: 1] |
method |
KaKs Calculator 2.0 codon model [default: YN] |
verbose |
specify if verbose output [default: FALSE] |
list
Kristian K Ullrich
Wang et al. (2010) KaKs_Calculator 2.0: a toolkit incorporating gamma-series methods and sliding window strategies.Genomics, proteomics & bioinformatics. 8(1), 77-80.
## load example sequence data data("hiv", package="MSA2dist") rcpp_KaKs(cdsstr=as.character(hiv[1:3]))## load example sequence data data("hiv", package="MSA2dist") rcpp_KaKs(cdsstr=as.character(hiv[1:3]))
returns number of AA sites used
rcpp_pairwiseDeletionAA(aavector, ncores = 1L, symmetric = 1L)rcpp_pairwiseDeletionAA(aavector, ncores = 1L, symmetric = 1L)
aavector |
StringVector [mandatory] |
ncores |
number of cores [default: 1] |
symmetric |
symmetric score matrix [default: 1] |
list
Kristian K Ullrich
## load example sequence data data("hiv", package="MSA2dist") h <- hiv |> cds2aa() |> as.character() rcpp_pairwiseDeletionAA(aavector=h, ncores=1)## load example sequence data data("hiv", package="MSA2dist") h <- hiv |> cds2aa() |> as.character() rcpp_pairwiseDeletionAA(aavector=h, ncores=1)
returns number of DNA sites used
rcpp_pairwiseDeletionDNA(dnavector, ncores = 1L, symmetric = 1L)rcpp_pairwiseDeletionDNA(dnavector, ncores = 1L, symmetric = 1L)
dnavector |
StringVector [mandatory] |
ncores |
number of cores [default: 1] |
symmetric |
symmetric score matrix [default: 1] |
list
Kristian K Ullrich
## load example sequence data data("woodmouse", package="ape") w <- woodmouse |> dnabin2dnastring() |> as.character() rcpp_pairwiseDeletionDNA(dnavector=w, ncores=1)## load example sequence data data("woodmouse", package="ape") w <- woodmouse |> dnabin2dnastring() |> as.character() rcpp_pairwiseDeletionDNA(dnavector=w, ncores=1)
This function shows the region slot from a
DNAStringSet or an AAStringSet
metadata information.
region(seq)region(seq)
seq |
|
region IRanges object from metadata
Kristian K Ullrich
## load example sequence data data(iupac, package="MSA2dist") iupac.aa <- iupac |> cds2aa(shorten = TRUE) ## create region region1 <- IRanges::IRanges(start=c(1,41), end=c(20,50)) ## add region iupac.aa <- iupac.aa |> addregion2string(region=region1) iupac.aa |> region()## load example sequence data data(iupac, package="MSA2dist") iupac.aa <- iupac |> cds2aa(shorten = TRUE) ## create region region1 <- IRanges::IRanges(start=c(1,41), end=c(20,50)) ## add region iupac.aa <- iupac.aa |> addregion2string(region=region1) iupac.aa |> region()
This function shows the region used slot from a
DNAStringSet or an AAStringSet
metadata information.
regionused(seq)regionused(seq)
seq |
|
population names from metadata
Kristian K Ullrich
## load example sequence data data("hiv", package="MSA2dist") ## create mask mask1 <- IRanges::IRanges(start=c(11,41,71), end=c(20,50,80)) ## use mask hiv.region <- hiv |> cds2aa() |> string2region(mask=mask1) #(hiv.region |> slot("metadata"))$regionUsed hiv.region |> regionused()## load example sequence data data("hiv", package="MSA2dist") ## create mask mask1 <- IRanges::IRanges(start=c(11,41,71), end=c(20,50,80)) ## use mask hiv.region <- hiv |> cds2aa() |> string2region(mask=mask1) #(hiv.region |> slot("metadata"))$regionUsed hiv.region |> regionused()
This function subsets a DNAStringSet or an
AAStringSet by a mask and region given one or both
options as IRanges.
string2region(seq, mask = NULL, region = NULL, add = TRUE)string2region(seq, mask = NULL, region = NULL, add = TRUE)
seq |
|
mask |
|
region |
|
add |
indicate if mask and region should be added to |
A list object with the following components:DNAStringSet or AAStringSetregionUsed
Kristian K Ullrich
## load example sequence data data("hiv", package="MSA2dist") ## create mask mask1 <- IRanges::IRanges(start=c(11,41,71), end=c(20,50,80)) ## use mask hiv.region <- hiv |> cds2aa() |> string2region(mask=mask1) #(hiv.region |> slot("metadata"))$regionUsed hiv.region |> regionused() ## use region region1 <- IRanges::IRanges(start=c(1,75), end=c(45,85)) hiv.region <- hiv |> cds2aa() |> string2region(region=region1) #(hiv.region |> slot("metadata"))$regionUsed hiv.region |> regionused() ## use mask and region hiv.region <- hiv |> cds2aa() |> string2region(mask=mask1, region=region1) #(hiv.region |> slot("metadata"))$regionUsed hiv.region |> regionused()## load example sequence data data("hiv", package="MSA2dist") ## create mask mask1 <- IRanges::IRanges(start=c(11,41,71), end=c(20,50,80)) ## use mask hiv.region <- hiv |> cds2aa() |> string2region(mask=mask1) #(hiv.region |> slot("metadata"))$regionUsed hiv.region |> regionused() ## use region region1 <- IRanges::IRanges(start=c(1,75), end=c(45,85)) hiv.region <- hiv |> cds2aa() |> string2region(region=region1) #(hiv.region |> slot("metadata"))$regionUsed hiv.region |> regionused() ## use mask and region hiv.region <- hiv |> cds2aa() |> string2region(mask=mask1, region=region1) #(hiv.region |> slot("metadata"))$regionUsed hiv.region |> regionused()
This function gets a subsequence from a DNAString,
RNAString, AAString, BString, DNAStringSet,
RNAStringSet, AAStringSet, BStringSet
object from the Biostrings package.
subString(x, s, e)subString(x, s, e)
x |
|
s |
start vector [mandatory] |
e |
end vector [mandatory] |
subsequence of an Biostrings object
Kristian K Ullrich
## define two cds sequences cds1 <- Biostrings::DNAString("ATGCAACATTGC") cds2 <- Biostrings::DNAString("ATG---CATTGC") cds1.cds2.aln <- c(Biostrings::DNAStringSet(cds1), Biostrings::DNAStringSet(cds2)) subString(cds1.cds2.aln, c(1,7), c(3,12))## define two cds sequences cds1 <- Biostrings::DNAString("ATGCAACATTGC") cds2 <- Biostrings::DNAString("ATG---CATTGC") cds1.cds2.aln <- c(Biostrings::DNAStringSet(cds1), Biostrings::DNAStringSet(cds2)) subString(cds1.cds2.aln, c(1,7), c(3,12))
This function returns upper tri index for usage with
pivot_long reduction.
uptriidx(n, diag = FALSE)uptriidx(n, diag = FALSE)
n |
dimension of initial matrix [mandatory] |
diag |
indicate if diag should be retained [default: FALSE] |
list of positions
Kristian K Ullrich
uptriidx(10)uptriidx(10)