Title: | Molecular Informatics Toolkit for Compound-Protein Interaction in Drug Discovery |
---|---|
Description: | A molecular informatics toolkit with an integration of bioinformatics and chemoinformatics tools for drug discovery. |
Authors: | Nan Xiao [aut, cre] , Dong-Sheng Cao [aut], Qing-Song Xu [aut] |
Maintainer: | Nan Xiao <[email protected]> |
License: | Artistic-2.0 | file LICENSE |
Version: | 1.43.0 |
Built: | 2024-10-31 04:20:52 UTC |
Source: | https://github.com/bioc/Rcpi |
2D Autocorrelations Descriptors for 20 Amino Acids calculated by Dragon
This dataset includes the 2D autocorrelations descriptors of the 20 amino acids calculated by Dragon (version 5.4) used for scales extraction in this package.
AA2DACOR data
data(AA2DACOR)
data(AA2DACOR)
3D-MoRSE Descriptors for 20 Amino Acids calculated by Dragon
This dataset includes the 3D-MoRSE descriptors of the 20 amino acids calculated by Dragon (version 5.4) used for scales extraction in this package.
AA3DMoRSE data
data(AA3DMoRSE)
data(AA3DMoRSE)
Atom-Centred Fragments Descriptors for 20 Amino Acids calculated by Dragon
This dataset includes the atom-centred fragments descriptors of the 20 amino acids calculated by Dragon (version 5.4) used for scales extraction in this package.
AAACF data
data(AAACF)
data(AAACF)
BLOSUM100 Matrix for 20 Amino Acids
BLOSUM100 Matrix for the 20 amino acids. The matrix was extracted from the
Biostrings
package of Bioconductor.
AABLOSUM100 data
data(AABLOSUM100)
data(AABLOSUM100)
BLOSUM45 Matrix for 20 Amino Acids
BLOSUM45 Matrix for the 20 amino acids. The matrix was extracted from the
Biostrings
package of Bioconductor.
AABLOSUM45 data
data(AABLOSUM45)
data(AABLOSUM45)
BLOSUM50 Matrix for 20 Amino Acids
BLOSUM50 Matrix for the 20 amino acids. The matrix was extracted from the
Biostrings
package of Bioconductor.
AABLOSUM50 data
data(AABLOSUM50)
data(AABLOSUM50)
BLOSUM62 Matrix for 20 Amino Acids
BLOSUM62 Matrix for the 20 amino acids. The matrix was extracted from the
Biostrings
package of Bioconductor.
AABLOSUM62 data
data(AABLOSUM62)
data(AABLOSUM62)
BLOSUM80 Matrix for 20 Amino Acids
BLOSUM80 Matrix for the 20 amino acids. The matrix was extracted from the
Biostrings
package of Bioconductor.
AABLOSUM80 data
data(AABLOSUM80)
data(AABLOSUM80)
Burden Eigenvalues Descriptors for 20 Amino Acids calculated by Dragon
This dataset includes the Burden eigenvalues descriptors of the 20 amino acids calculated by Dragon (version 5.4) used for scales extraction in this package.
AABurden data
data(AABurden)
data(AABurden)
Connectivity Indices Descriptors for 20 Amino Acids calculated by Dragon
This dataset includes the connectivity indices descriptors of the 20 amino acids calculated by Dragon (version 5.4) used for scales extraction in this package.
AAConn data
data(AAConn)
data(AAConn)
Constitutional Descriptors for 20 Amino Acids calculated by Dragon
This dataset includes the constitutional descriptors of the 20 amino acids calculated by Dragon (version 5.4) used for scales extraction in this package.
AAConst data
data(AAConst)
data(AAConst)
CPSA Descriptors for 20 Amino Acids calculated by Discovery Studio
This dataset includes the CPSA descriptors of the 20 amino acids
calculated by Discovery Studio (version 2.5) used for scales extraction
in this package.
All amino acid molecules had also been optimized with
MOE 2011.10 (semiempirical AM1)
before calculating these CPSA descriptors.
The SDF file containing the information of the optimized amino acid molecules
is included in this package. See OptAA3d
for more information.
AACPSA data
data(AACPSA)
data(AACPSA)
All 2D Descriptors for 20 Amino Acids calculated by Dragon
This dataset includes all the 2D descriptors of the 20 amino acids calculated by Dragon (version 5.4) used for scales extraction in this package.
AADescAll data
data(AADescAll)
data(AADescAll)
Edge Adjacency Indices Descriptors for 20 Amino Acids calculated by Dragon
This dataset includes the edge adjacency indices descriptors of the 20 amino acids calculated by Dragon (version 5.4) used for scales extraction in this package.
AAEdgeAdj data
data(AAEdgeAdj)
data(AAEdgeAdj)
Eigenvalue-Based Indices Descriptors for 20 Amino Acids calculated by Dragon
This dataset includes the eigenvalue-based indices descriptors of the 20 amino acids calculated by Dragon (version 5.4) used for scales extraction in this package.
AAEigIdx data
data(AAEigIdx)
data(AAEigIdx)
Functional Group Counts Descriptors for 20 Amino Acids calculated by Dragon
This dataset includes the functional group counts descriptors of the 20 amino acids calculated by Dragon (version 5.4) used for scales extraction in this package.
AAFGC data
data(AAFGC)
data(AAFGC)
Geometrical Descriptors for 20 Amino Acids calculated by Dragon
This dataset includes the geometrical descriptors of the 20 amino acids calculated by Dragon (version 5.4) used for scales extraction in this package.
AAGeom data
data(AAGeom)
data(AAGeom)
GETAWAY Descriptors for 20 Amino Acids calculated by Dragon
This dataset includes the GETAWAY descriptors of the 20 amino acids calculated by Dragon (version 5.4) used for scales extraction in this package.
AAGETAWAY data
data(AAGETAWAY)
data(AAGETAWAY)
AAindex Data of 544 Physicochemical and Biological Properties for 20 Amino Acids
The data was extracted from the AAindex1 database ver 9.1 (ftp://ftp.genome.jp/pub/db/community/aaindex/aaindex1) as of November 2012 (Data Last Modified 2006-08-14).
With this data, users could investigate each property's accession number and other details. Visit https://www.genome.jp/dbget/aaindex.html for more information.
AAindex data
data(AAindex)
data(AAindex)
Information Indices Descriptors for 20 Amino Acids calculated by Dragon
This dataset includes the information indices descriptors of the 20 amino acids calculated by Dragon (version 5.4) used for scales extraction in this package.
AAInfo data
data(AAInfo)
data(AAInfo)
Meta Information for the 20 Amino Acids
This dataset includes the meta information of the 20 amino acids used for the 2D and 3D descriptor calculation in this package. Each column represents:
AAName
Amino acid name
Short
One-letter representation
Abbreviation
Three-letter representation
mol
SMILES representation
PUBCHEM_COMPOUND_CID
PubChem CID for the amino acid
PUBCHEM_LINK
PubChem link for the amino acid
AAMetaInfo data
data(AAMetaInfo)
data(AAMetaInfo)
2D Descriptors for 20 Amino Acids calculated by MOE 2011.10
This dataset includes the 2D descriptors of the 20 amino acids calculated by MOE 2011.10 used for scales extraction in this package.
AAMOE2D data
data(AAMOE2D)
data(AAMOE2D)
3D Descriptors for 20 Amino Acids calculated by MOE 2011.10
This dataset includes the 3D descriptors of the 20 amino acids
calculated by MOE 2011.10 used for scales extraction in this package.
All amino acid molecules had also been optimized with MOE (semiempirical AM1)
before calculating these 3D descriptors.
The SDF file containing the information of the optimized amino acid molecules
is included in this package. See OptAA3d
for more information.
AAMOE3D data
data(AAMOE3D)
data(AAMOE3D)
Molecular Properties Descriptors for 20 Amino Acids calculated by Dragon
This dataset includes the molecular properties descriptors of the 20 amino acids calculated by Dragon (version 5.4) used for scales extraction in this package.
AAMolProp data
data(AAMolProp)
data(AAMolProp)
PAM120 Matrix for 20 Amino Acids
PAM120 Matrix for the 20 amino acids. The matrix was extracted from the
Biostrings
package of Bioconductor.
AAPAM120 data
data(AAPAM120)
data(AAPAM120)
PAM250 Matrix for 20 Amino Acids
PAM250 Matrix for the 20 amino acids. The matrix was extracted from the
Biostrings
package of Bioconductor.
AAPAM250 data
data(AAPAM250)
data(AAPAM250)
PAM30 Matrix for 20 Amino Acids
PAM30 Matrix for the 20 amino acids. The matrix was extracted from the
Biostrings
package of Bioconductor.
AAPAM30 data
data(AAPAM30)
data(AAPAM30)
PAM40 Matrix for 20 Amino Acids
PAM40 Matrix for the 20 amino acids. The matrix was extracted from the
Biostrings
package of Bioconductor.
AAPAM40 data
data(AAPAM40)
data(AAPAM40)
PAM70 Matrix for 20 Amino Acids
PAM70 Matrix for the 20 amino acids. The matrix was extracted from the
Biostrings
package of Bioconductor.
AAPAM70 data
data(AAPAM70)
data(AAPAM70)
Randic Molecular Profiles Descriptors for 20 Amino Acids calculated by Dragon
This dataset includes the Randic molecular profiles descriptors of the 20 amino acids calculated by Dragon (version 5.4) used for scales extraction in this package.
AARandic data
data(AARandic)
data(AARandic)
RDF Descriptors for 20 Amino Acids calculated by Dragon
This dataset includes the RDF descriptors of the 20 amino acids calculated by Dragon (version 5.4) used for scales extraction in this package.
AARDF data
data(AARDF)
data(AARDF)
Topological Descriptors for 20 Amino Acids calculated by Dragon
This dataset includes the topological descriptors of the 20 amino acids calculated by Dragon (version 5.4) used for scales extraction in this package.
AATopo data
data(AATopo)
data(AATopo)
Topological Charge Indices Descriptors for 20 Amino Acids calculated by Dragon
This dataset includes the topological charge indices descriptors of the 20 amino acids calculated by Dragon (version 5.4) used for scales extraction in this package.
AATopoChg data
data(AATopoChg)
data(AATopoChg)
Walk and Path Counts Descriptors for 20 Amino Acids calculated by Dragon
This dataset includes the walk and path counts descriptors of the 20 amino acids calculated by Dragon (version 5.4) used for scales extraction in this package.
AAWalk data
data(AAWalk)
data(AAWalk)
WHIM Descriptors for 20 Amino Acids calculated by Dragon
This dataset includes the WHIM descriptors of the 20 amino acids calculated by Dragon (version 5.4) used for scales extraction in this package.
AAWHIM data
data(AAWHIM)
data(AAWHIM)
Calculates auto covariance and auto cross covariance for generating scale-based descriptors of the same length.
acc(mat, lag)
acc(mat, lag)
mat |
A |
lag |
The lag parameter. Must be less than the amino acids. |
A length lag * p^2
named vector, the element names are
constructed by: the scales index (crossed scales index) and
lag index.
To see more details about auto cross covariance, check the references.
Wold, S., Jonsson, J., Sjörström, M., Sandberg, M., & Rännar, S. (1993). DNA and peptide sequences and chemical processes multivariately modelled by principal component analysis and partial least-squares projections to latent structures. Analytica chimica acta, 277(2), 239–253.
Sjöström, M., Rännar, S., & Wieslander, Å. (1995). Polypeptide sequence property relationships in Escherichia coli based on auto cross covariances. Chemometrics and intelligent laboratory systems, 29(2), 295–305.
See extractPCMScales
for
generalized scales-based descriptors.
For more details, see extractPCMDescScales
and extractPCMPropScales
.
p = 8 # p is the scales number n = 200 # n is the amino acid number lag = 7 # lag parameter mat = matrix(rnorm(p * n), nrow = p, ncol = n) acc(mat, lag)
p = 8 # p is the scales number n = 200 # n is the amino acid number lag = 7 # lag parameter mat = matrix(rnorm(p * n), nrow = p, ncol = n) acc(mat, lag)
Calculate Drug Molecule Similarity Derived by Molecular Fingerprints
calcDrugFPSim( fp1, fp2, fptype = c("compact", "complete"), metric = c("tanimoto", "euclidean", "cosine", "dice", "hamming") )
calcDrugFPSim( fp1, fp2, fptype = c("compact", "complete"), metric = c("tanimoto", "euclidean", "cosine", "dice", "hamming") )
fp1 |
The first molecule's fingerprints,
could be extracted by |
fp2 |
The second molecule's fingerprints. |
fptype |
The fingerprint type, must be one of |
metric |
The similarity metric,
one of |
This function calculate drug molecule fingerprints similarity.
Define a
as the features of object A, b
is the
features of object B, c
is the number of common features to A and B:
Tanimoto: aka Jaccard -
Euclidean:
Dice: aka Sorensen, Czekanowski, Hodgkin-Richards -
Cosine: aka Ochiai, Carbo -
Hamming: aka Manhattan, taxi-cab, city-block distance -
The numeric similarity value.
Gasteiger, Johann, and Thomas Engel, eds. Chemoinformatics. Wiley.com, 2006.
mols = readMolFromSDF(system.file('compseq/tyrphostin.sdf', package = 'Rcpi')) fp1 = extractDrugEstate(mols[[1]]) fp2 = extractDrugEstate(mols[[2]]) calcDrugFPSim(fp1, fp2, fptype = 'compact', metric = 'tanimoto') calcDrugFPSim(fp1, fp2, fptype = 'compact', metric = 'euclidean') calcDrugFPSim(fp1, fp2, fptype = 'compact', metric = 'cosine') calcDrugFPSim(fp1, fp2, fptype = 'compact', metric = 'dice') calcDrugFPSim(fp1, fp2, fptype = 'compact', metric = 'hamming') fp3 = extractDrugEstateComplete(mols[[1]]) fp4 = extractDrugEstateComplete(mols[[2]]) calcDrugFPSim(fp3, fp4, fptype = 'complete', metric = 'tanimoto') calcDrugFPSim(fp3, fp4, fptype = 'complete', metric = 'euclidean') calcDrugFPSim(fp3, fp4, fptype = 'complete', metric = 'cosine') calcDrugFPSim(fp3, fp4, fptype = 'complete', metric = 'dice') calcDrugFPSim(fp3, fp4, fptype = 'complete', metric = 'hamming')
mols = readMolFromSDF(system.file('compseq/tyrphostin.sdf', package = 'Rcpi')) fp1 = extractDrugEstate(mols[[1]]) fp2 = extractDrugEstate(mols[[2]]) calcDrugFPSim(fp1, fp2, fptype = 'compact', metric = 'tanimoto') calcDrugFPSim(fp1, fp2, fptype = 'compact', metric = 'euclidean') calcDrugFPSim(fp1, fp2, fptype = 'compact', metric = 'cosine') calcDrugFPSim(fp1, fp2, fptype = 'compact', metric = 'dice') calcDrugFPSim(fp1, fp2, fptype = 'compact', metric = 'hamming') fp3 = extractDrugEstateComplete(mols[[1]]) fp4 = extractDrugEstateComplete(mols[[2]]) calcDrugFPSim(fp3, fp4, fptype = 'complete', metric = 'tanimoto') calcDrugFPSim(fp3, fp4, fptype = 'complete', metric = 'euclidean') calcDrugFPSim(fp3, fp4, fptype = 'complete', metric = 'cosine') calcDrugFPSim(fp3, fp4, fptype = 'complete', metric = 'dice') calcDrugFPSim(fp3, fp4, fptype = 'complete', metric = 'hamming')
Calculate Drug Molecule Similarity Derived by Maximum Common Substructure Search
calcDrugMCSSim( mol1, mol2, type = c("smile", "sdf"), plot = FALSE, al = 0, au = 0, bl = 0, bu = 0, matching.mode = "static", ... )
calcDrugMCSSim( mol1, mol2, type = c("smile", "sdf"), plot = FALSE, al = 0, au = 0, bl = 0, bu = 0, matching.mode = "static", ... )
mol1 |
The first molecule. R character string object containing the molecule. See examples. |
mol2 |
The second molecule. R character string object containing the molecule. See examples. |
type |
The input molecule format, 'smile' or 'sdf'. |
plot |
Logical. Should we plot the two molecules and their maximum common substructure? |
al |
Lower bound for the number of atom mismatches. Default is 0. |
au |
Upper bound for the number of atom mismatches. Default is 0. |
bl |
Lower bound for the number of bond mismatches. Default is 0. |
bu |
Upper bound for the number of bond mismatches. Default is 0. |
matching.mode |
Three modes for bond matching are supported:
|
... |
Other graphical parameters |
This function calculate drug molecule similarity derived by
maximum common substructure search. The maximum common substructure
search algorithm is provided by the fmcsR
package.
A list containing the detail MCS information and similarity values. The numeric similarity value includes Tanimoto coefficient and overlap coefficient.
Wang, Y., Backman, T. W., Horan, K., & Girke, T. (2013). fmcsR: mismatch tolerant maximum common substructure searching in R. Bioinformatics, 29(21), 2792–2794.
mol1 = 'CC(C)CCCCCC(=O)NCC1=CC(=C(C=C1)O)OC' mol2 = 'O=C(NCc1cc(OC)c(O)cc1)CCCC/C=C/C(C)C' mol3 = readChar(system.file('compseq/DB00859.sdf', package = 'Rcpi'), nchars = 1e+6) mol4 = readChar(system.file('compseq/DB00860.sdf', package = 'Rcpi'), nchars = 1e+6) ## Not run: sim1 = calcDrugMCSSim(mol1, mol2, type = 'smile') sim2 = calcDrugMCSSim(mol3, mol4, type = 'sdf', plot = TRUE) print(sim1[[2]]) # Tanimoto Coefficient print(sim2[[3]]) # Overlap Coefficient ## End(Not run)
mol1 = 'CC(C)CCCCCC(=O)NCC1=CC(=C(C=C1)O)OC' mol2 = 'O=C(NCc1cc(OC)c(O)cc1)CCCC/C=C/C(C)C' mol3 = readChar(system.file('compseq/DB00859.sdf', package = 'Rcpi'), nchars = 1e+6) mol4 = readChar(system.file('compseq/DB00860.sdf', package = 'Rcpi'), nchars = 1e+6) ## Not run: sim1 = calcDrugMCSSim(mol1, mol2, type = 'smile') sim2 = calcDrugMCSSim(mol3, mol4, type = 'sdf', plot = TRUE) print(sim1[[2]]) # Tanimoto Coefficient print(sim2[[3]]) # Overlap Coefficient ## End(Not run)
Protein Sequence Similarity Calculation based on Gene Ontology (GO) Similarity
calcParProtGOSim( golist, type = c("go", "gene"), ont = c("MF", "BP", "CC"), organism = "human", measure = "Resnik", combine = "BMA" )
calcParProtGOSim( golist, type = c("go", "gene"), ont = c("MF", "BP", "CC"), organism = "human", measure = "Resnik", combine = "BMA" )
golist |
A character vector, each component contains a character vector of GO terms or one Entrez Gene ID. |
type |
Input type of |
ont |
Default is |
organism |
Default is |
measure |
Default is |
combine |
Default is |
This function calculates protein sequence similarity based on Gene Ontology (GO) similarity.
A n
x n
similarity matrix.
See calcTwoProtGOSim
for calculating the
GO semantic similarity between two groups of GO terms or two Entrez gene IDs.
See calcParProtSeqSim
for paralleled protein similarity
calculation based on sequence alignment.
# By GO Terms go1 = c('GO:0005215', 'GO:0005488', 'GO:0005515', 'GO:0005625', 'GO:0005802', 'GO:0005905') # AP4B1 go2 = c('GO:0005515', 'GO:0005634', 'GO:0005681', 'GO:0008380', 'GO:0031202') # BCAS2 go3 = c('GO:0003735', 'GO:0005622', 'GO:0005840', 'GO:0006412') # PDE4DIP glist = list(go1, go2, go3) calcParProtGOSim(glist, type = 'go', ont = 'CC', measure = 'Wang') # By Entrez gene id genelist = list(c('150', '151', '152', '1814', '1815', '1816')) calcParProtGOSim(genelist, type = 'gene', ont = 'BP', measure = 'Wang')
# By GO Terms go1 = c('GO:0005215', 'GO:0005488', 'GO:0005515', 'GO:0005625', 'GO:0005802', 'GO:0005905') # AP4B1 go2 = c('GO:0005515', 'GO:0005634', 'GO:0005681', 'GO:0008380', 'GO:0031202') # BCAS2 go3 = c('GO:0003735', 'GO:0005622', 'GO:0005840', 'GO:0006412') # PDE4DIP glist = list(go1, go2, go3) calcParProtGOSim(glist, type = 'go', ont = 'CC', measure = 'Wang') # By Entrez gene id genelist = list(c('150', '151', '152', '1814', '1815', '1816')) calcParProtGOSim(genelist, type = 'gene', ont = 'BP', measure = 'Wang')
Parallellized Protein Sequence Similarity Calculation based on Sequence Alignment
calcParProtSeqSim(protlist, cores = 2, type = "local", submat = "BLOSUM62")
calcParProtSeqSim(protlist, cores = 2, type = "local", submat = "BLOSUM62")
protlist |
A length |
cores |
Integer. The number of CPU cores to use for parallel execution,
default is |
type |
Type of alignment, default is |
submat |
Substitution matrix, default is |
This function implemented the parallellized version for calculating protein sequence similarity based on sequence alignment.
A n
x n
similarity matrix.
See calcTwoProtSeqSim
for protein sequence alignment
for two protein sequences. See calcParProtGOSim
for
protein similarity calculation based on
Gene Ontology (GO) semantic similarity.
s1 = readFASTA(system.file('protseq/P00750.fasta', package = 'Rcpi'))[[1]] s2 = readFASTA(system.file('protseq/P08218.fasta', package = 'Rcpi'))[[1]] s3 = readFASTA(system.file('protseq/P10323.fasta', package = 'Rcpi'))[[1]] s4 = readFASTA(system.file('protseq/P20160.fasta', package = 'Rcpi'))[[1]] s5 = readFASTA(system.file('protseq/Q9NZP8.fasta', package = 'Rcpi'))[[1]] plist = list(s1, s2, s3, s4, s5) psimmat = calcParProtSeqSim(plist, cores = 2, type = 'local', submat = 'BLOSUM62') print(psimmat)
s1 = readFASTA(system.file('protseq/P00750.fasta', package = 'Rcpi'))[[1]] s2 = readFASTA(system.file('protseq/P08218.fasta', package = 'Rcpi'))[[1]] s3 = readFASTA(system.file('protseq/P10323.fasta', package = 'Rcpi'))[[1]] s4 = readFASTA(system.file('protseq/P20160.fasta', package = 'Rcpi'))[[1]] s5 = readFASTA(system.file('protseq/Q9NZP8.fasta', package = 'Rcpi'))[[1]] plist = list(s1, s2, s3, s4, s5) psimmat = calcParProtSeqSim(plist, cores = 2, type = 'local', submat = 'BLOSUM62') print(psimmat)
Protein Similarity Calculation based on Gene Ontology (GO) Similarity
calcTwoProtGOSim( id1, id2, type = c("go", "gene"), ont = c("MF", "BP", "CC"), organism = "human", measure = "Resnik", combine = "BMA" )
calcTwoProtGOSim( id1, id2, type = c("go", "gene"), ont = c("MF", "BP", "CC"), organism = "human", measure = "Resnik", combine = "BMA" )
id1 |
A character vector. length > 1: each element is a GO term; length = 1: the Entrez Gene ID. |
id2 |
A character vector. length > 1: each element is a GO term; length = 1: the Entrez Gene ID. |
type |
Input type of id1 and id2, |
ont |
Default is |
organism |
Default is |
measure |
Default is |
combine |
Default is |
This function calculates the Gene Ontology (GO) similarity between two groups of GO terms or two Entrez gene IDs.
A n x n matrix.
See calcParProtGOSim
for
protein similarity calculation based on
Gene Ontology (GO) semantic similarity.
See calcParProtSeqSim
for paralleled protein similarity
calculation based on sequence alignment.
# By GO terms go1 = c("GO:0004022", "GO:0004024", "GO:0004023") go2 = c("GO:0009055", "GO:0020037") calcTwoProtGOSim(go1, go2, type = 'go', ont = 'MF', measure = 'Wang') # By Entrez gene id gene1 = '241' gene2 = '251' calcTwoProtGOSim(gene1, gene2, type = 'gene', ont = 'CC', measure = 'Lin')
# By GO terms go1 = c("GO:0004022", "GO:0004024", "GO:0004023") go2 = c("GO:0009055", "GO:0020037") calcTwoProtGOSim(go1, go2, type = 'go', ont = 'MF', measure = 'Wang') # By Entrez gene id gene1 = '241' gene2 = '251' calcTwoProtGOSim(gene1, gene2, type = 'gene', ont = 'CC', measure = 'Lin')
Protein Sequence Alignment for Two Protein Sequences
calcTwoProtSeqSim(seq1, seq2, type = "local", submat = "BLOSUM62")
calcTwoProtSeqSim(seq1, seq2, type = "local", submat = "BLOSUM62")
seq1 |
A character string, containing one protein sequence. |
seq2 |
A character string, containing another protein sequence. |
type |
Type of alignment, default is |
submat |
Substitution matrix, default is |
This function implements the sequence alignment between two protein sequences.
An Biostrings object containing the scores and other alignment information.
See calcParProtSeqSim
for paralleled pairwise
protein similarity calculation based on sequence alignment.
See calcTwoProtGOSim
for calculating the
GO semantic similarity between two groups of GO terms or two Entrez gene IDs.
s1 = readFASTA(system.file('protseq/P00750.fasta', package = 'Rcpi'))[[1]] s2 = readFASTA(system.file('protseq/P10323.fasta', package = 'Rcpi'))[[1]] seqalign = calcTwoProtSeqSim(s1, s2) seqalign slot(seqalign, "score")
s1 = readFASTA(system.file('protseq/P00750.fasta', package = 'Rcpi'))[[1]] s2 = readFASTA(system.file('protseq/P10323.fasta', package = 'Rcpi'))[[1]] seqalign = calcTwoProtSeqSim(s1, s2) seqalign slot(seqalign, "score")
Check if the protein sequence's amino acid types are the 20 default types
checkProt(x)
checkProt(x)
x |
A character vector, as the input protein sequence. |
This function checks if the protein sequence's amino acid types are the 20 default types.
Logical. TRUE
if all of the amino acid types of the sequence
are within the 20 default types.
x = readFASTA(system.file('protseq/P00750.fasta', package = 'Rcpi'))[[1]] checkProt(x) # TRUE checkProt(paste(x, 'Z', sep = '')) # FALSE
x = readFASTA(system.file('protseq/P00750.fasta', package = 'Rcpi'))[[1]] checkProt(x) # TRUE checkProt(paste(x, 'Z', sep = '')) # FALSE
Chemical File Formats Conversion
convMolFormat(infile, outfile, from, to)
convMolFormat(infile, outfile, from, to)
infile |
A character string. Indicating the input file location. |
outfile |
A character string. Indicating the output file location. |
from |
The format of |
to |
The desired format of |
This function converts between various chemical file formats via OpenBabel. The complete supported file format list could be found at https://openbabel.org/docs/FileFormats/Overview.html.
NULL
The supported formats include:
abinit – ABINIT Output Format [Read-only]
acr – ACR format [Read-only]
adf – ADF cartesian input format [Write-only]
adfout – ADF output format [Read-only]
alc – Alchemy format
arc – Accelrys/MSI Biosym/Insight II CAR format [Read-only]
axsf – XCrySDen Structure Format [Read-only]
bgf – MSI BGF format
box – Dock 3.5 Box format
bs – Ball and Stick format
c3d1 – Chem3D Cartesian 1 format
c3d2 – Chem3D Cartesian 2 format
cac – CAChe MolStruct format [Write-only]
caccrt – Cacao Cartesian format
cache – CAChe MolStruct format [Write-only]
cacint – Cacao Internal format [Write-only]
can – Canonical SMILES format
car – Accelrys/MSI Biosym/Insight II CAR format [Read-only]
castep – CASTEP format [Read-only]
ccc – CCC format [Read-only]
cdx – ChemDraw binary format [Read-only]
cdxml – ChemDraw CDXML format
cht – Chemtool format [Write-only]
cif – Crystallographic Information File
ck – ChemKin format
cml – Chemical Markup Language
cmlr – CML Reaction format
com – Gaussian 98/03 Input [Write-only]
CONFIG – DL-POLY CONFIG
CONTCAR – VASP format [Read-only]
copy – Copy raw text [Write-only]
crk2d – Chemical Resource Kit diagram(2D)
crk3d – Chemical Resource Kit 3D format
csr – Accelrys/MSI Quanta CSR format [Write-only]
cssr – CSD CSSR format [Write-only]
ct – ChemDraw Connection Table format
cub – Gaussian cube format
cube – Gaussian cube format
dat – Generic Output file format [Read-only]
dmol – DMol3 coordinates format
dx – OpenDX cube format for APBS
ent – Protein Data Bank format
fa – FASTA format
fasta – FASTA format
fch – Gaussian formatted checkpoint file format [Read-only]
fchk – Gaussian formatted checkpoint file format [Read-only]
fck – Gaussian formatted checkpoint file format [Read-only]
feat – Feature format
fh – Fenske-Hall Z-Matrix format [Write-only]
fhiaims – FHIaims XYZ format
fix – SMILES FIX format [Write-only]
fpt – Fingerprint format [Write-only]
fract – Free Form Fractional format
fs – Fastsearch format
fsa – FASTA format
g03 – Gaussian Output [Read-only]
g09 – Gaussian Output [Read-only]
g92 – Gaussian Output [Read-only]
g94 – Gaussian Output [Read-only]
g98 – Gaussian Output [Read-only]
gal – Gaussian Output [Read-only]
gam – GAMESS Output [Read-only]
gamess – GAMESS Output [Read-only]
gamin – GAMESS Input
gamout – GAMESS Output [Read-only]
gau – Gaussian 98/03 Input [Write-only]
gjc – Gaussian 98/03 Input [Write-only]
gjf – Gaussian 98/03 Input [Write-only]
got – GULP format [Read-only]
gpr – Ghemical format
gr96 – GROMOS96 format [Write-only]
gro – GRO format
gukin – GAMESS-UK Input
gukout – GAMESS-UK Output
gzmat – Gaussian Z-Matrix Input
hin – HyperChem HIN format
HISTORY – DL-POLY HISTORY [Read-only]
inchi – InChI format
inchikey – InChIKey [Write-only]
inp – GAMESS Input
ins – ShelX format [Read-only]
jin – Jaguar input format [Write-only]
jout – Jaguar output format [Read-only]
k – Compare molecules using InChI [Write-only]
log – Generic Output file format [Read-only]
mcdl – MCDL format
mcif – Macromolecular Crystallographic Info
mdl – MDL MOL format
ml2 – Sybyl Mol2 format
mmcif – Macromolecular Crystallographic Info
mmd – MacroModel format
mmod – MacroModel format
mna – Multilevel Neighborhoods of Atoms (MNA) [Write-only]
mol – MDL MOL format
mol2 – Sybyl Mol2 format
mold – Molden format
molden – Molden format
molf – Molden format
molreport – Open Babel molecule report [Write-only]
moo – MOPAC Output format [Read-only]
mop – MOPAC Cartesian format
mopcrt – MOPAC Cartesian format
mopin – MOPAC Internal
mopout – MOPAC Output format [Read-only]
mp – Molpro input format [Write-only]
mpc – MOPAC Cartesian format
mpd – MolPrint2D format [Write-only]
mpo – Molpro output format [Read-only]
mpqc – MPQC output format [Read-only]
mpqcin – MPQC simplified input format [Write-only]
mrv – Chemical Markup Language
msi – Accelrys/MSI Cerius II MSI format [Read-only]
msms – M.F. Sanner's MSMS input format [Write-only]
nul – Outputs nothing [Write-only]
nw – NWChem input format [Write-only]
nwo – NWChem output format [Read-only]
out – Generic Output file format [Read-only]
outmol – DMol3 coordinates format
output – Generic Output file format [Read-only]
pc – PubChem format [Read-only]
pcm – PCModel Format
pdb – Protein Data Bank format
pdbqt – AutoDock PDQBT format
png – PNG 2D depiction
POSCAR – VASP format [Read-only]
pov – POV-Ray input format [Write-only]
pqr – PQR format
pqs – Parallel Quantum Solutions format
prep – Amber Prep format [Read-only]
pwscf – PWscf format [Read-only]
qcin – Q-Chem input format [Write-only]
qcout – Q-Chem output format [Read-only]
report – Open Babel report format [Write-only]
res – ShelX format [Read-only]
rsmi – Reaction SMILES format
rxn – MDL RXN format
sd – MDL MOL format
sdf – MDL MOL format
smi – SMILES format
smiles – SMILES format
svg – SVG 2D depiction [Write-only]
sy2 – Sybyl Mol2 format
t41 – ADF TAPE41 format [Read-only]
tdd – Thermo format
text – Read and write raw text
therm – Thermo format
tmol – TurboMole Coordinate format
txt – Title format
txyz – Tinker XYZ format
unixyz – UniChem XYZ format
vmol – ViewMol format
xed – XED format [Write-only]
xml – General XML format [Read-only]
xsf – XCrySDen Structure Format [Read-only]
xyz – XYZ cartesian coordinates format
yob – YASARA.org YOB format
zin – ZINDO input format [Write-only]
sdf = system.file('sysdata/OptAA3d.sdf', package = 'Rcpi') # SDF to SMILES ## Not run: convMolFormat(infile = sdf, outfile = 'aa.smi', from = 'sdf', to = 'smiles') ## End(Not run) # SMILES to MOPAC Cartesian format ## Not run: convMolFormat(infile = 'aa.smi', outfile = 'aa.mop', from = 'smiles', to = 'mop') ## End(Not run)
sdf = system.file('sysdata/OptAA3d.sdf', package = 'Rcpi') # SDF to SMILES ## Not run: convMolFormat(infile = sdf, outfile = 'aa.smi', from = 'sdf', to = 'smiles') ## End(Not run) # SMILES to MOPAC Cartesian format ## Not run: convMolFormat(infile = 'aa.smi', outfile = 'aa.mop', from = 'smiles', to = 'mop') ## End(Not run)
Calculate All Molecular Descriptors in Rcpi at Once
extractDrugAIO(molecules, silent = TRUE, warn = TRUE)
extractDrugAIO(molecules, silent = TRUE, warn = TRUE)
molecules |
Parsed molucule object. |
silent |
Logical. Whether the calculating process should be
shown or not, default is |
warn |
Logical. Whether the warning about some descriptors
need the 3D coordinates should be shown or not after the calculation,
default is |
This function calculates all the molecular descriptors in the Rcpi package at once.
A data frame, each row represents one of the molecules, each column represents one descriptor. Currently, this function returns total 293 descriptors composed of 48 descriptor types.
Note that we need 3-D coordinates of the molecules to calculate
some of the descriptors, if not provided, these descriptors
values will be NA
.
# Load 20 small molecules that have 3D coordinates sdf = system.file('sysdata/OptAA3d.sdf', package = 'Rcpi') mol = readMolFromSDF(sdf) dat = extractDrugAIO(mol, warn = FALSE)
# Load 20 small molecules that have 3D coordinates sdf = system.file('sysdata/OptAA3d.sdf', package = 'Rcpi') mol = readMolFromSDF(sdf) dat = extractDrugAIO(mol, warn = FALSE)
Calculate Atom Additive logP and Molar Refractivity Values Descriptor
extractDrugALOGP(molecules, silent = TRUE)
extractDrugALOGP(molecules, silent = TRUE)
molecules |
Parsed molucule object. |
silent |
Logical. Whether the calculating process
should be shown or not, default is |
Calculates ALOGP (Ghose-Crippen LogKow) and the Ghose-Crippen molar refractivity as described by Ghose, A.K. and Crippen, G.M. Note the underlying code in CDK assumes that aromaticity has been detected before evaluating this descriptor. The code also expects that the molecule will have hydrogens explicitly set. For SD files, this is usually not a problem since hydrogens are explicit. But for the case of molecules obtained from SMILES, hydrogens must be made explicit.
A data frame, each row represents one of the molecules,
each column represents one feature. This function returns three columns
named ALogP
, ALogp2
and AMR
.
Ghose, A.K. and Crippen, G.M. , Atomic physicochemical parameters for three-dimensional structure-directed quantitative structure-activity relationships. I. Partition coefficients as a measure of hydrophobicity, Journal of Computational Chemistry, 1986, 7:565-577.
Ghose, A.K. and Crippen, G.M. , Atomic physicochemical parameters for three-dimensional-structure-directed quantitative structure-activity relationships. 2. Modeling dispersive and hydrophobic interactions, Journal of Chemical Information and Computer Science, 1987, 27:21-35.
smi = system.file('vignettedata/FDAMDD.smi', package = 'Rcpi') mol = readMolFromSmi(smi, type = 'mol') dat = extractDrugALOGP(mol) head(dat)
smi = system.file('vignettedata/FDAMDD.smi', package = 'Rcpi') mol = readMolFromSmi(smi, type = 'mol') dat = extractDrugALOGP(mol) head(dat)
Calculate the Number of Amino Acids Descriptor
extractDrugAminoAcidCount(molecules, silent = TRUE)
extractDrugAminoAcidCount(molecules, silent = TRUE)
molecules |
Parsed molucule object. |
silent |
Logical. Whether the calculating process
should be shown or not, default is |
Calculates the number of each amino acids (total 20 types) found in the molecues.
A data frame, each row represents one of the molecules,
each column represents one feature.
This function returns 20 columns named
nA
, nR
, nN
, nD
, nC
,
nF
, nQ
, nE
, nG
, nH
,
nI
, nP
, nL
nK
, nM
,
nS
, nT
, nY
nV
, nW
.
smi = system.file('vignettedata/FDAMDD.smi', package = 'Rcpi') mol = readMolFromSmi(smi, type = 'mol') dat = extractDrugAminoAcidCount(mol) head(dat)
smi = system.file('vignettedata/FDAMDD.smi', package = 'Rcpi') mol = readMolFromSmi(smi, type = 'mol') dat = extractDrugAminoAcidCount(mol) head(dat)
Calculate the Sum of the Atomic Polarizabilities Descriptor
extractDrugApol(molecules, silent = TRUE)
extractDrugApol(molecules, silent = TRUE)
molecules |
Parsed molucule object. |
silent |
Logical. Whether the calculating process
should be shown or not, default is |
Calculates the sum of the atomic polarizabilities (including implicit hydrogens) descriptor. Polarizabilities are taken from https://bit.ly/3PvNbhe.
A data frame, each row represents one of the molecules,
each column represents one feature.
This function returns one column named apol
.
smi = system.file('vignettedata/FDAMDD.smi', package = 'Rcpi') mol = readMolFromSmi(smi, type = 'mol') dat = extractDrugApol(mol) head(dat)
smi = system.file('vignettedata/FDAMDD.smi', package = 'Rcpi') mol = readMolFromSmi(smi, type = 'mol') dat = extractDrugApol(mol) head(dat)
Calculate the Number of Aromatic Atoms Descriptor
extractDrugAromaticAtomsCount(molecules, silent = TRUE)
extractDrugAromaticAtomsCount(molecules, silent = TRUE)
molecules |
Parsed molucule object. |
silent |
Logical. Whether the calculating process
should be shown or not, default is |
Calculates the number of aromatic atoms of a molecule.
A data frame, each row represents one of the molecules,
each column represents one feature.
This function returns one column named naAromAtom
.
smi = system.file('vignettedata/FDAMDD.smi', package = 'Rcpi') mol = readMolFromSmi(smi, type = 'mol') dat = extractDrugAromaticAtomsCount(mol) head(dat)
smi = system.file('vignettedata/FDAMDD.smi', package = 'Rcpi') mol = readMolFromSmi(smi, type = 'mol') dat = extractDrugAromaticAtomsCount(mol) head(dat)
Calculate the Number of Aromatic Bonds Descriptor
extractDrugAromaticBondsCount(molecules, silent = TRUE)
extractDrugAromaticBondsCount(molecules, silent = TRUE)
molecules |
Parsed molucule object. |
silent |
Logical. Whether the calculating process
should be shown or not, default is |
Calculates the number of aromatic bonds of a molecule.
A data frame, each row represents one of the molecules,
each column represents one feature.
This function returns one column named nAromBond
.
smi = system.file('vignettedata/FDAMDD.smi', package = 'Rcpi') mol = readMolFromSmi(smi, type = 'mol') dat = extractDrugAromaticBondsCount(mol) head(dat)
smi = system.file('vignettedata/FDAMDD.smi', package = 'Rcpi') mol = readMolFromSmi(smi, type = 'mol') dat = extractDrugAromaticBondsCount(mol) head(dat)
Calculate the Number of Atom Descriptor
extractDrugAtomCount(molecules, silent = TRUE)
extractDrugAtomCount(molecules, silent = TRUE)
molecules |
Parsed molucule object. |
silent |
Logical. Whether the calculating process
should be shown or not, default is |
Calculates the number of atoms of a certain element type in a molecule. By default it returns the count of all atoms.
A data frame, each row represents one of the molecules,
each column represents one feature.
This function returns one column named nAtom
.
smi = system.file('vignettedata/FDAMDD.smi', package = 'Rcpi') mol = readMolFromSmi(smi, type = 'mol') dat = extractDrugAtomCount(mol) head(dat)
smi = system.file('vignettedata/FDAMDD.smi', package = 'Rcpi') mol = readMolFromSmi(smi, type = 'mol') dat = extractDrugAtomCount(mol) head(dat)
Calculate the Moreau-Broto Autocorrelation Descriptors using Partial Charges
extractDrugAutocorrelationCharge(molecules, silent = TRUE)
extractDrugAutocorrelationCharge(molecules, silent = TRUE)
molecules |
Parsed molucule object. |
silent |
Logical. Whether the calculating process
should be shown or not, default is |
Calculates the ATS autocorrelation descriptor, where the weight equal to the charges.
A data frame, each row represents one of the molecules,
each column represents one feature.
This function returns 5 columns named
ATSc1
, ATSc2
, ATSc3
, ATSc4
, ATSc5
.
smi = system.file('vignettedata/FDAMDD.smi', package = 'Rcpi') mol = readMolFromSmi(smi, type = 'mol') dat = extractDrugAutocorrelationCharge(mol) head(dat)
smi = system.file('vignettedata/FDAMDD.smi', package = 'Rcpi') mol = readMolFromSmi(smi, type = 'mol') dat = extractDrugAutocorrelationCharge(mol) head(dat)
Calculate the Moreau-Broto Autocorrelation Descriptors using Atomic Weight
extractDrugAutocorrelationMass(molecules, silent = TRUE)
extractDrugAutocorrelationMass(molecules, silent = TRUE)
molecules |
Parsed molucule object. |
silent |
Logical. Whether the calculating process
should be shown or not, default is |
Calculates the ATS autocorrelation descriptor, where the weight equal to the scaled atomic mass.
A data frame, each row represents one of the molecules,
each column represents one feature. This function returns 5 columns named
ATSm1
, ATSm2
, ATSm3
, ATSm4
, ATSm5
.
Moreau, Gilles, and Pierre Broto. The autocorrelation of a topological structure: a new molecular descriptor. Nouv. J. Chim 4 (1980): 359-360.
smi = system.file('vignettedata/FDAMDD.smi', package = 'Rcpi') mol = readMolFromSmi(smi, type = 'mol') dat = extractDrugAutocorrelationMass(mol) head(dat)
smi = system.file('vignettedata/FDAMDD.smi', package = 'Rcpi') mol = readMolFromSmi(smi, type = 'mol') dat = extractDrugAutocorrelationMass(mol) head(dat)
Calculate the Moreau-Broto Autocorrelation Descriptors using Polarizability
extractDrugAutocorrelationPolarizability(molecules, silent = TRUE)
extractDrugAutocorrelationPolarizability(molecules, silent = TRUE)
molecules |
Parsed molucule object. |
silent |
Logical. Whether the calculating process
should be shown or not, default is |
Calculates the ATS autocorrelation descriptor using polarizability.
A data frame, each row represents one of the molecules,
each column represents one feature. This function returns 5 columns named
ATSp1
, ATSp2
, ATSp3
, ATSp4
, ATSp5
.
smi = system.file('vignettedata/FDAMDD.smi', package = 'Rcpi') mol = readMolFromSmi(smi, type = 'mol') dat = extractDrugAutocorrelationPolarizability(mol) head(dat)
smi = system.file('vignettedata/FDAMDD.smi', package = 'Rcpi') mol = readMolFromSmi(smi, type = 'mol') dat = extractDrugAutocorrelationPolarizability(mol) head(dat)
BCUT – Eigenvalue Based Descriptor
extractDrugBCUT(molecules, silent = TRUE)
extractDrugBCUT(molecules, silent = TRUE)
molecules |
Parsed molucule object. |
silent |
Logical. Whether the calculating process
should be shown or not, default is |
Eigenvalue based descriptor noted for its utility in chemical diversity. Described by Pearlman et al. The descriptor is based on a weighted version of the Burden matrix which takes into account both the connectivity as well as atomic properties of a molecule. The weights are a variety of atom properties placed along the diagonal of the Burden matrix. Currently three weighting schemes are employed:
Atomic Weight
Partial Charge (Gasteiger Marsilli)
Polarizability (Kang et al.)
A data frame, each row represents one of the molecules, each column represents one feature. This function returns 6 columns:
BCUTw-1l, BCUTw-2l ...
- n high lowest atom weighted BCUTS
BCUTw-1h, BCUTw-2h ...
- n low highest atom weighted BCUTS
BCUTc-1l, BCUTc-2l ...
- n high lowest partial charge weighted BCUTS
BCUTc-1h, BCUTc-2h ...
- n low highest partial charge weighted BCUTS
BCUTp-1l, BCUTp-2l ...
- n high lowest polarizability weighted BCUTS
BCUTp-1h, BCUTp-2h ...
- n low highest polarizability weighted BCUTS
By default, the descriptor will return the highest and lowest eigenvalues for the three classes of descriptor in a single ArrayList (in the order shown above). However it is also possible to supply a parameter list indicating how many of the highest and lowest eigenvalues (for each class of descriptor) are required. The descriptor works with the hydrogen depleted molecule.
A side effect of specifying the number of highest and lowest eigenvalues is that it is possible to get two copies of all the eigenvalues. That is, if a molecule has 5 heavy atoms, then specifying the 5 highest eigenvalues returns all of them, and specifying the 5 lowest eigenvalues returns all of them, resulting in two copies of all the eigenvalues.
Note that it is possible to specify an arbitrarily large number of
eigenvalues to be returned. However if the number (i.e., nhigh or nlow)
is larger than the number of heavy atoms, the remaining eignevalues
will be NaN
.
Given the above description, if the aim is to gt all the eigenvalues for a molecule, you should set nlow to 0 and specify the number of heavy atoms (or some large number) for nhigh (or vice versa).
Pearlman, R.S. and Smith, K.M., Metric Validation and the Receptor-Relevant Subspace Concept, J. Chem. Inf. Comput. Sci., 1999, 39:28-35.
Burden, F.R., Molecular identification number for substructure searches, J. Chem. Inf. Comput. Sci., 1989, 29:225-227.
Burden, F.R., Chemically Intuitive Molecular Index, Quant. Struct. -Act. Relat., 1997, 16:309-314
Kang, Y.K. and Jhon, M.S., Additivity of Atomic Static Polarizabilities and Dispersion Coefficients, Theoretica Chimica Acta, 1982, 61:41-48
smi = system.file('vignettedata/FDAMDD.smi', package = 'Rcpi') mol = readMolFromSmi(smi, type = 'mol') dat = extractDrugBCUT(mol) head(dat)
smi = system.file('vignettedata/FDAMDD.smi', package = 'Rcpi') mol = readMolFromSmi(smi, type = 'mol') dat = extractDrugBCUT(mol) head(dat)
Calculate the Descriptor Based on the Number of Bonds of a Certain Bond Order
extractDrugBondCount(molecules, silent = TRUE)
extractDrugBondCount(molecules, silent = TRUE)
molecules |
Parsed molucule object. |
silent |
Logical. Whether the calculating process
should be shown or not, default is |
Calculates the descriptor based on the number of bonds of a certain bond order.
A data frame, each row represents one of the molecules,
each column represents one feature.
This function returns one column named nB
.
smi = system.file('vignettedata/FDAMDD.smi', package = 'Rcpi') mol = readMolFromSmi(smi, type = 'mol') dat = extractDrugBondCount(mol) head(dat)
smi = system.file('vignettedata/FDAMDD.smi', package = 'Rcpi') mol = readMolFromSmi(smi, type = 'mol') dat = extractDrugBondCount(mol) head(dat)
Calculates the Descriptor that Describes the Sum of the Absolute Value of the Difference between Atomic Polarizabilities of All Bonded Atoms in the Molecule
extractDrugBPol(molecules, silent = TRUE)
extractDrugBPol(molecules, silent = TRUE)
molecules |
Parsed molucule object. |
silent |
Logical. Whether the calculating process
should be shown or not, default is |
This descriptor calculates the sum of the absolute value of the difference between atomic polarizabilities of all bonded atoms in the molecule (including implicit hydrogens) with polarizabilities taken from https://bit.ly/3PvNbhe. This descriptor assumes 2-centered bonds.
A data frame, each row represents one of the molecules,
each column represents one feature.
This function returns one column named bpol
.
smi = system.file('vignettedata/FDAMDD.smi', package = 'Rcpi') mol = readMolFromSmi(smi, type = 'mol') dat = extractDrugBPol(mol) head(dat)
smi = system.file('vignettedata/FDAMDD.smi', package = 'Rcpi') mol = readMolFromSmi(smi, type = 'mol') dat = extractDrugBPol(mol) head(dat)
Topological Descriptor Characterizing the Carbon Connectivity in Terms of Hybridization
extractDrugCarbonTypes(molecules, silent = TRUE)
extractDrugCarbonTypes(molecules, silent = TRUE)
molecules |
Parsed molucule object. |
silent |
Logical. Whether the calculating process
should be shown or not, default is |
Calculates the carbon connectivity in terms of hybridization. The function calculates 9 descriptors in the following order:
C1SP1
- triply hound carbon bound to one other carbon
C2SP1
- triply bound carbon bound to two other carbons
C1SP2
- doubly hound carbon bound to one other carbon
C2SP2
- doubly bound carbon bound to two other carbons
C3SP2
- doubly bound carbon bound to three other carbons
C1SP3
- singly bound carbon bound to one other carbon
C2SP3
- singly bound carbon bound to two other carbons
C3SP3
- singly bound carbon bound to three other carbons
C4SP3
- singly bound carbon bound to four other carbons
A data frame, each row represents one of the molecules,
each column represents one feature.
This function returns 9 columns named
C1SP1
, C2SP1
, C1SP2
, C2SP2
, C3SP2
,
C1SP3
, C2SP3
, C3SP3
and C4SP3
.
smi = system.file('vignettedata/FDAMDD.smi', package = 'Rcpi') mol = readMolFromSmi(smi, type = 'mol') dat = extractDrugCarbonTypes(mol) head(dat)
smi = system.file('vignettedata/FDAMDD.smi', package = 'Rcpi') mol = readMolFromSmi(smi, type = 'mol') dat = extractDrugCarbonTypes(mol) head(dat)
Calculate the Kier and Hall Chi Chain Indices of Orders 3, 4, 5, 6 and 7
extractDrugChiChain(molecules, silent = TRUE)
extractDrugChiChain(molecules, silent = TRUE)
molecules |
Parsed molucule object. |
silent |
Logical. Whether the calculating process
should be shown or not, default is |
Evaluates chi chain descriptors. The code currently evluates the simple and valence chi chain descriptors of orders 3, 4, 5, 6 and 7. It utilizes the graph isomorphism code of the CDK to find fragments matching SMILES strings representing the fragments corresponding to each type of chain.
A data frame, each row represents one of the molecules, each column represents one feature. This function returns 10 columns, in the following order:
SCH.3
- Simple chain, order 3
SCH.4
- Simple chain, order 4
SCH.5
- Simple chain, order 5
SCH.6
- Simple chain, order 6
SCH.7
- Simple chain, order 7
VCH.3
- Valence chain, order 3
VCH.4
- Valence chain, order 4
VCH.5
- Valence chain, order 5
VCH.6
- Valence chain, order 6
VCH.7
- Valence chain, order 7
These descriptors are calculated using graph isomorphism to identify the various fragments. As a result calculations may be slow. In addition, recent versions of Molconn-Z use simplified fragment definitions (i.e., rings without branches etc.) whereas these descriptors use the older more complex fragment definitions.
smi = system.file('vignettedata/FDAMDD.smi', package = 'Rcpi') mol = readMolFromSmi(smi, type = 'mol') dat = extractDrugChiChain(mol) head(dat)
smi = system.file('vignettedata/FDAMDD.smi', package = 'Rcpi') mol = readMolFromSmi(smi, type = 'mol') dat = extractDrugChiChain(mol) head(dat)
Evaluates the Kier and Hall Chi cluster indices of orders 3, 4, 5 and 6
extractDrugChiCluster(molecules, silent = TRUE)
extractDrugChiCluster(molecules, silent = TRUE)
molecules |
Parsed molucule object. |
silent |
Logical. Whether the calculating process
should be shown or not, default is |
Evaluates chi cluster descriptors. It utilizes the graph isomorphism code of the CDK to find fragments matching SMILES strings representing the fragments corresponding to each type of chain.
A data frame, each row represents one of the molecules, each column represents one feature. This function returns 8 columns, the order and names of the columns returned is:
SC.3 - Simple cluster, order 3
SC.4 - Simple cluster, order 4
SC.5 - Simple cluster, order 5
SC.6 - Simple cluster, order 6
VC.3 - Valence cluster, order 3
VC.4 - Valence cluster, order 4
VC.5 - Valence cluster, order 5
VC.6 - Valence cluster, order 6
These descriptors are calculated using graph isomorphism to identify the various fragments. As a result calculations may be slow. In addition, recent versions of Molconn-Z use simplified fragment definitions (i.e., rings without branches etc.) whereas these descriptors use the older more complex fragment definitions.
smi = system.file('vignettedata/FDAMDD.smi', package = 'Rcpi') mol = readMolFromSmi(smi, type = 'mol') dat = extractDrugChiCluster(mol) head(dat)
smi = system.file('vignettedata/FDAMDD.smi', package = 'Rcpi') mol = readMolFromSmi(smi, type = 'mol') dat = extractDrugChiCluster(mol) head(dat)
Calculate the Kier and Hall Chi Path Indices of Orders 0 to 7
extractDrugChiPath(molecules, silent = TRUE)
extractDrugChiPath(molecules, silent = TRUE)
molecules |
Parsed molucule object. |
silent |
Logical. Whether the calculating process
should be shown or not, default is |
Evaluates chi path descriptors. This function utilizes the graph isomorphism code of the CDK to find fragments matching SMILES strings representing the fragments corresponding to each type of chain.
A data frame, each row represents one of the molecules, each column represents one feature. This function returns 16 columns, The order and names of the columns returned is:
SP.0, SP.1, ..., SP.7
- Simple path, orders 0 to 7
VP.0, VP.1, ..., VP.7
- Valence path, orders 0 to 7
These descriptors are calculated using graph isomorphism to identify the various fragments. As a result calculations may be slow. In addition, recent versions of Molconn-Z use simplified fragment definitions (i.e., rings without branches etc.) whereas these descriptors use the older more complex fragment definitions.
smi = system.file('vignettedata/FDAMDD.smi', package = 'Rcpi') mol = readMolFromSmi(smi, type = 'mol') dat = extractDrugChiPath(mol) head(dat)
smi = system.file('vignettedata/FDAMDD.smi', package = 'Rcpi') mol = readMolFromSmi(smi, type = 'mol') dat = extractDrugChiPath(mol) head(dat)
Calculate the Kier and Hall Chi Path Cluster Indices of Orders 4, 5 and 6
extractDrugChiPathCluster(molecules, silent = TRUE)
extractDrugChiPathCluster(molecules, silent = TRUE)
molecules |
Parsed molucule object. |
silent |
Logical. Whether the calculating process
should be shown or not, default is |
Evaluates chi path cluster descriptors. The code currently evluates the simple and valence chi chain descriptors of orders 4, 5 and 6. It utilizes the graph isomorphism code of the CDK to find fragments matching SMILES strings representing the fragments corresponding to each type of chain.
A data frame, each row represents one of the molecules,
each column represents one feature.
This function returns 6 columns named
SPC.4
, SPC.5
, SPC.6
,
VPC.4
, VPC.5
, VPC.6
:
SPC.4
- Simple path cluster, order 4
SPC.5
- Simple path cluster, order 5
SPC.6
- Simple path cluster, order 6
VPC.4
- Valence path cluster, order 4
VPC.5
- Valence path cluster, order 5
VPC.6
- Valence path cluster, order 6
These descriptors are calculated using graph isomorphism to identify the various fragments. As a result calculations may be slow. In addition, recent versions of Molconn-Z use simplified fragment definitions (i.e., rings without branches etc.) whereas these descriptors use the older more complex fragment definitions.
smi = system.file('vignettedata/FDAMDD.smi', package = 'Rcpi') mol = readMolFromSmi(smi, type = 'mol') dat = extractDrugChiPathCluster(mol) head(dat)
smi = system.file('vignettedata/FDAMDD.smi', package = 'Rcpi') mol = readMolFromSmi(smi, type = 'mol') dat = extractDrugChiPathCluster(mol) head(dat)
A Variety of Descriptors Combining Surface Area and Partial Charge Information
extractDrugCPSA(molecules, silent = TRUE)
extractDrugCPSA(molecules, silent = TRUE)
molecules |
Parsed molucule object. |
silent |
Logical. Whether the calculating process
should be shown or not, default is |
Calculates 29 Charged Partial Surface Area (CPSA) descriptors. The CPSA's were developed by Stanton et al.
A data frame, each row represents one of the molecules, each column represents one feature. This function returns 29 columns:
PPSA.1
- partial positive surface area –
sum of surface area on positive parts of molecule
PPSA.2
- partial positive surface area *
total positive charge on the molecule
PPSA.3
- charge weighted partial positive surface area
PNSA.1
- partial negative surface area –
sum of surface area on negative parts of molecule
PNSA.2
- partial negative surface area *
total negative charge on the molecule
PNSA.3
- charge weighted partial negative surface area
DPSA.1
- difference of PPSA.1 and PNSA.1
DPSA.2
- difference of FPSA.2 and PNSA.2
DPSA.3
- difference of PPSA.3 and PNSA.3
FPSA.1
- PPSA.1 / total molecular surface area
FFSA.2
- PPSA.2 / total molecular surface area
FPSA.3
- PPSA.3 / total molecular surface area
FNSA.1
- PNSA.1 / total molecular surface area
FNSA.2
- PNSA.2 / total molecular surface area
FNSA.3
- PNSA.3 / total molecular surface area
WPSA.1
- PPSA.1 * total molecular surface area / 1000
WPSA.2
- PPSA.2 * total molecular surface area /1000
WPSA.3
- PPSA.3 * total molecular surface area / 1000
WNSA.1
- PNSA.1 * total molecular surface area /1000
WNSA.2
- PNSA.2 * total molecular surface area / 1000
WNSA.3
- PNSA.3 * total molecular surface area / 1000
RPCG
- relative positive charge –
most positive charge / total positive charge
RNCG
- relative negative charge –
most negative charge / total negative charge
RPCS
- relative positive charge surface area –
most positive surface area * RPCG
RNCS
- relative negative charge surface area –
most negative surface area * RNCG
THSA
- sum of solvent accessible surface areas of
atoms with absolute value of partial charges less than 0.2
TPSA
- sum of solvent accessible surface areas of
atoms with absolute value of partial charges greater than or equal 0.2
RHSA
- THSA / total molecular surface area
RPSA
- TPSA / total molecular surface area
Stanton, D.T. and Jurs, P.C. , Development and Use of Charged Partial Surface Area Structural Descriptors in Computer Assissted Quantitative Structure Property Relationship Studies, Analytical Chemistry, 1990, 62:2323.2329.
sdf = system.file('sysdata/OptAA3d.sdf', package = 'Rcpi') mol = readMolFromSDF(sdf) dat = extractDrugCPSA(mol) head(dat)
sdf = system.file('sysdata/OptAA3d.sdf', package = 'Rcpi') mol = readMolFromSDF(sdf) dat = extractDrugCPSA(mol) head(dat)
Calculate Molecular Descriptors Provided by OpenBabel
extractDrugDescOB(molecules, type = c("smile", "sdf"))
extractDrugDescOB(molecules, type = c("smile", "sdf"))
molecules |
R character string object containing the molecules. See the example section for details. |
type |
|
This function calculates 14 types of the numerical molecular descriptors provided in OpenBabel.
A data frame, each row represents one of the molecules,
each column represents one descriptor.
This function returns 14 columns named
abonds
, atoms
, bonds
, dbonds
,
HBA1
, HBA2
, HBD
, logP
,
MR
, MW
, nF
, sbonds
, tbonds
, TPSA
:
abonds
- Number of aromatic bonds
atoms
- Number of atoms
bonds
- Number of bonds
dbonds
- Number of double bonds
HBA1
- Number of Hydrogen Bond Acceptors 1
HBA2
- Number of Hydrogen Bond Acceptors 2
HBD
- Number of Hydrogen Bond Donors
logP
- Octanol/Water Partition Coefficient
MR
- Molar Refractivity
MW
- Molecular Weight Filter
nF
- Number of Fluorine Atoms
sbonds
- Number of single bonds
tbonds
- Number of triple bonds
TPSA
- Topological Polar Surface Area
mol1 = 'CC(=O)NCCC1=CNc2c1cc(OC)cc2' # one molecule SMILE in a vector mol2 = c('OCCc1c(C)[n+](=cs1)Cc2cnc(C)nc(N)2', 'CCc(c1)ccc2[n+]1ccc3c2Nc4c3cccc4', '[Cu+2].[O-]S(=O)(=O)[O-]') # multiple SMILEs in a vector mol3 = readChar(system.file('compseq/DB00860.sdf', package = 'Rcpi'), nchars = 1e+6) # single molecule in a sdf file mol4 = readChar(system.file('sysdata/OptAA3d.sdf', package = 'Rcpi'), nchars = 1e+6) # multiple molecules in a sdf file ## Not run: smidesc0 = extractDrugDescOB(mol1, type = 'smile') smidesc1 = extractDrugDescOB(mol2, type = 'smile') sdfdesc0 = extractDrugDescOB(mol3, type = 'sdf') sdfdesc1 = extractDrugDescOB(mol4, type = 'sdf') ## End(Not run)
mol1 = 'CC(=O)NCCC1=CNc2c1cc(OC)cc2' # one molecule SMILE in a vector mol2 = c('OCCc1c(C)[n+](=cs1)Cc2cnc(C)nc(N)2', 'CCc(c1)ccc2[n+]1ccc3c2Nc4c3cccc4', '[Cu+2].[O-]S(=O)(=O)[O-]') # multiple SMILEs in a vector mol3 = readChar(system.file('compseq/DB00860.sdf', package = 'Rcpi'), nchars = 1e+6) # single molecule in a sdf file mol4 = readChar(system.file('sysdata/OptAA3d.sdf', package = 'Rcpi'), nchars = 1e+6) # multiple molecules in a sdf file ## Not run: smidesc0 = extractDrugDescOB(mol1, type = 'smile') smidesc1 = extractDrugDescOB(mol2, type = 'smile') sdfdesc0 = extractDrugDescOB(mol3, type = 'sdf') sdfdesc1 = extractDrugDescOB(mol4, type = 'sdf') ## End(Not run)
Calculate the Eccentric Connectivity Index Descriptor
extractDrugECI(molecules, silent = TRUE)
extractDrugECI(molecules, silent = TRUE)
molecules |
Parsed molucule object. |
silent |
Logical. Whether the calculating process
should be shown or not, default is |
Eccentric Connectivity Index (ECI) is a topological descriptor combining distance and adjacency information. This descriptor is described by Sharma et al. and has been shown to correlate well with a number of physical properties. The descriptor is also reported to have good discriminatory ability. The eccentric connectivity index for a hydrogen supressed molecular graph is given by
where E(i) is the eccentricity of the i-th atom (path length from the i-th atom to the atom farthest from it) and V(i) is the vertex degree of the i-th atom.
A data frame, each row represents one of the molecules,
each column represents one feature.
This function returns one column named ECCEN
.
Sharma, V. and Goswami, R. and Madan, A.K. (1997), Eccentric Connectivity Index: A Novel Highly Discriminating Topological Descriptor for Structure-Property and Structure-Activity Studies, Journal of Chemical Information and Computer Sciences, 37:273-282
smi = system.file('vignettedata/FDAMDD.smi', package = 'Rcpi') mol = readMolFromSmi(smi, type = 'mol') dat = extractDrugECI(mol) head(dat)
smi = system.file('vignettedata/FDAMDD.smi', package = 'Rcpi') mol = readMolFromSmi(smi, type = 'mol') dat = extractDrugECI(mol) head(dat)
Calculate the E-State Molecular Fingerprints (in Compact Format)
extractDrugEstate(molecules, silent = TRUE)
extractDrugEstate(molecules, silent = TRUE)
molecules |
Parsed molucule object. |
silent |
Logical. Whether the calculating process
should be shown or not, default is |
79 bit fingerprints corresponding to the E-State atom types described by Hall and Kier.
A list, each component represents one of the molecules, each element in the component represents the index of which element in the fingerprint is 1. Each component's name is the length of the fingerprints.
smi = system.file('vignettedata/FDAMDD.smi', package = 'Rcpi') mol = readMolFromSmi(smi, type = 'mol') fp = extractDrugEstate(mol) head(fp)
smi = system.file('vignettedata/FDAMDD.smi', package = 'Rcpi') mol = readMolFromSmi(smi, type = 'mol') fp = extractDrugEstate(mol) head(fp)
Calculate the E-State Molecular Fingerprints (in Complete Format)
extractDrugEstateComplete(molecules, silent = TRUE)
extractDrugEstateComplete(molecules, silent = TRUE)
molecules |
Parsed molucule object. |
silent |
Logical. Whether the calculating process
should be shown or not, default is |
79 bit fingerprints corresponding to the E-State atom types described by Hall and Kier.
An integer vector or a matrix. Each row represents one molecule, the columns represent the fingerprints.
smi = system.file('vignettedata/FDAMDD.smi', package = 'Rcpi') mol = readMolFromSmi(smi, type = 'mol') fp = extractDrugEstateComplete(mol) dim(fp)
smi = system.file('vignettedata/FDAMDD.smi', package = 'Rcpi') mol = readMolFromSmi(smi, type = 'mol') fp = extractDrugEstateComplete(mol) dim(fp)
Calculate the Extended Molecular Fingerprints (in Compact Format)
extractDrugExtended(molecules, depth = 6, size = 1024, silent = TRUE)
extractDrugExtended(molecules, depth = 6, size = 1024, silent = TRUE)
molecules |
Parsed molucule object. |
depth |
The search depth. Default is |
size |
The length of the fingerprint bit string. Default is |
silent |
Logical. Whether the calculating process
should be shown or not, default is |
Calculate the extended molecular fingerprints. Considers paths of a given length, similar to the standard type, but takes rings and atomic properties into account into account. This is hashed fingerprints, with a default length of 1024.
A list, each component represents one of the molecules, each element in the component represents the index of which element in the fingerprint is 1. Each component's name is the length of the fingerprints.
smi = system.file('vignettedata/FDAMDD.smi', package = 'Rcpi') mol = readMolFromSmi(smi, type = 'mol') fp = extractDrugExtended(mol) head(fp)
smi = system.file('vignettedata/FDAMDD.smi', package = 'Rcpi') mol = readMolFromSmi(smi, type = 'mol') fp = extractDrugExtended(mol) head(fp)
Calculate the Extended Molecular Fingerprints (in Complete Format)
extractDrugExtendedComplete(molecules, depth = 6, size = 1024, silent = TRUE)
extractDrugExtendedComplete(molecules, depth = 6, size = 1024, silent = TRUE)
molecules |
Parsed molucule object. |
depth |
The search depth. Default is |
size |
The length of the fingerprint bit string. Default is |
silent |
Logical. Whether the calculating process
should be shown or not, default is |
Calculate the extended molecular fingerprints. Considers paths of a given length, similar to the standard type, but takes rings and atomic properties into account into account. This is hashed fingerprints, with a default length of 1024.
An integer vector or a matrix. Each row represents one molecule, the columns represent the fingerprints.
smi = system.file('vignettedata/FDAMDD.smi', package = 'Rcpi') mol = readMolFromSmi(smi, type = 'mol') fp = extractDrugExtendedComplete(mol) dim(fp)
smi = system.file('vignettedata/FDAMDD.smi', package = 'Rcpi') mol = readMolFromSmi(smi, type = 'mol') fp = extractDrugExtendedComplete(mol) dim(fp)
Calculate the FMF Descriptor
extractDrugFMF(molecules, silent = TRUE)
extractDrugFMF(molecules, silent = TRUE)
molecules |
Parsed molucule object. |
silent |
Logical. Whether the calculating process
should be shown or not, default is |
Calculates the FMF descriptor characterizing molecular complexity in terms of its Murcko framework. This descriptor is the ratio of heavy atoms in the framework to the total number of heavy atoms in the molecule. By definition, acyclic molecules which have no frameworks, will have a value of 0. Note that the authors consider an isolated ring system to be a framework (even though there is no linker).
A data frame, each row represents one of the molecules,
each column represents one feature.
This function returns one column named FMF
.
Yang, Y., Chen, H., Nilsson, I., Muresan, S., & Engkvist, O. (2010). Investigation of the relationship between topology and selectivity for druglike molecules. Journal of medicinal chemistry, 53(21), 7709-7714.
smi = system.file('vignettedata/FDAMDD.smi', package = 'Rcpi') mol = readMolFromSmi(smi, type = 'mol') dat = extractDrugFMF(mol) head(dat)
smi = system.file('vignettedata/FDAMDD.smi', package = 'Rcpi') mol = readMolFromSmi(smi, type = 'mol') dat = extractDrugFMF(mol) head(dat)
Calculate Complexity of a System
extractDrugFragmentComplexity(molecules, silent = TRUE)
extractDrugFragmentComplexity(molecules, silent = TRUE)
molecules |
Parsed molucule object. |
silent |
Logical. Whether the calculating process
should be shown or not, default is |
This descriptor calculates the complexity of a system. The complexity is defined in Nilakantan, R. et al. as:
where C is complexity, A is the number of non-hydrogen atoms, B is the number of bonds and H is the number of heteroatoms.
A data frame, each row represents one of the molecules,
each column represents one feature.
This function returns one column named fragC
.
Nilakantan, R. and Nunn, D.S. and Greenblatt, L. and Walker, G. and Haraki, K. and Mobilio, D., A family of ring system-based structural fragments for use in structure-activity studies: database mining and recursive partitioning., Journal of chemical information and modeling, 2006, 46:1069-1077
smi = system.file('vignettedata/FDAMDD.smi', package = 'Rcpi') mol = readMolFromSmi(smi, type = 'mol') dat = extractDrugFragmentComplexity(mol) head(dat)
smi = system.file('vignettedata/FDAMDD.smi', package = 'Rcpi') mol = readMolFromSmi(smi, type = 'mol') dat = extractDrugFragmentComplexity(mol) head(dat)
Calculate the Graph Molecular Fingerprints (in Compact Format)
extractDrugGraph(molecules, depth = 6, size = 1024, silent = TRUE)
extractDrugGraph(molecules, depth = 6, size = 1024, silent = TRUE)
molecules |
Parsed molucule object. |
depth |
The search depth. Default is |
size |
The length of the fingerprint bit string. Default is |
silent |
Logical. Whether the calculating process
should be shown or not, default is |
Calculate the graph molecular fingerprints. Similar to the standard type by simply considers connectivity. This is hashed fingerprints, with a default length of 1024.
A list, each component represents one of the molecules, each element in the component represents the index of which element in the fingerprint is 1. Each component's name is the length of the fingerprints.
smi = system.file('vignettedata/FDAMDD.smi', package = 'Rcpi') mol = readMolFromSmi(smi, type = 'mol') fp = extractDrugGraph(mol) head(fp)
smi = system.file('vignettedata/FDAMDD.smi', package = 'Rcpi') mol = readMolFromSmi(smi, type = 'mol') fp = extractDrugGraph(mol) head(fp)
Calculate the Graph Molecular Fingerprints (in Complete Format)
extractDrugGraphComplete(molecules, depth = 6, size = 1024, silent = TRUE)
extractDrugGraphComplete(molecules, depth = 6, size = 1024, silent = TRUE)
molecules |
Parsed molucule object. |
depth |
The search depth. Default is |
size |
The length of the fingerprint bit string. Default is |
silent |
Logical. Whether the calculating process
should be shown or not, default is |
Calculate the graph molecular fingerprints. Similar to the standard type by simply considers connectivity. This is hashed fingerprints, with a default length of 1024.
An integer vector or a matrix. Each row represents one molecule, the columns represent the fingerprints.
smi = system.file('vignettedata/FDAMDD.smi', package = 'Rcpi') mol = readMolFromSmi(smi, type = 'mol') fp = extractDrugGraphComplete(mol) dim(fp)
smi = system.file('vignettedata/FDAMDD.smi', package = 'Rcpi') mol = readMolFromSmi(smi, type = 'mol') fp = extractDrugGraphComplete(mol) dim(fp)
Descriptor Characterizing the Mass Distribution of the Molecule.
extractDrugGravitationalIndex(molecules, silent = TRUE)
extractDrugGravitationalIndex(molecules, silent = TRUE)
molecules |
Parsed molucule object. |
silent |
Logical. Whether the calculating process
should be shown or not, default is |
Descriptor characterizing the mass distribution of the molecule described by Katritzky et al. For modelling purposes the value of the descriptor is calculated both with and without H atoms. Furthermore the square and cube roots of the descriptor are also generated as described by Wessel et al.
A data frame, each row represents one of the molecules, each column represents one feature. This function returns 9 columns:
GRAV.1
- gravitational index of heavy atoms
GRAV.2
- square root of gravitational index of heavy atoms
GRAV.3
- cube root of gravitational index of heavy atoms
GRAVH.1
- gravitational index - hydrogens included
GRAVH.2
- square root of hydrogen-included gravitational index
GRAVH.3
- cube root of hydrogen-included gravitational index
GRAV.4
- grav1 for all pairs of atoms (not just bonded pairs)
GRAV.5
- grav2 for all pairs of atoms (not just bonded pairs)
GRAV.6
- grav3 for all pairs of atoms (not just bonded pairs)
Katritzky, A.R. and Mu, L. and Lobanov, V.S. and Karelson, M., Correlation of Boiling Points With Molecular Structure. 1. A Training Set of 298 Diverse Organics and a Test Set of 9 Simple Inorganics, J. Phys. Chem., 1996, 100:10400-10407.
Wessel, M.D. and Jurs, P.C. and Tolan, J.W. and Muskal, S.M. , Prediction of Human Intestinal Absorption of Drug Compounds From Molecular Structure, Journal of Chemical Information and Computer Sciences, 1998, 38:726-735.
sdf = system.file('sysdata/OptAA3d.sdf', package = 'Rcpi') mol = readMolFromSDF(sdf) dat = extractDrugGravitationalIndex(mol) head(dat)
sdf = system.file('sysdata/OptAA3d.sdf', package = 'Rcpi') mol = readMolFromSDF(sdf) dat = extractDrugGravitationalIndex(mol) head(dat)
Number of Hydrogen Bond Acceptors
extractDrugHBondAcceptorCount(molecules, silent = TRUE)
extractDrugHBondAcceptorCount(molecules, silent = TRUE)
molecules |
Parsed molucule object. |
silent |
Logical. Whether the calculating process
should be shown or not, default is |
This descriptor calculates the number of hydrogen bond acceptors using a slightly simplified version of the PHACIR atom types. The following groups are counted as hydrogen bond acceptors: any oxygen where the formal charge of the oxygen is non-positive (i.e. formal charge <= 0) except
an aromatic ether oxygen (i.e. an ether oxygen that is adjacent to at least one aromatic carbon)
an oxygen that is adjacent to a nitrogen
and any nitrogen where the formal charge of the nitrogen is non-positive (i.e. formal charge <= 0) except a nitrogen that is adjacent to an oxygen.
A data frame, each row represents one of the molecules,
each column represents one feature.
This function returns one column named nHBAcc
.
smi = system.file('vignettedata/FDAMDD.smi', package = 'Rcpi') mol = readMolFromSmi(smi, type = 'mol') dat = extractDrugHBondAcceptorCount(mol) head(dat)
smi = system.file('vignettedata/FDAMDD.smi', package = 'Rcpi') mol = readMolFromSmi(smi, type = 'mol') dat = extractDrugHBondAcceptorCount(mol) head(dat)
Number of Hydrogen Bond Donors
extractDrugHBondDonorCount(molecules, silent = TRUE)
extractDrugHBondDonorCount(molecules, silent = TRUE)
molecules |
Parsed molucule object. |
silent |
Logical. Whether the calculating process
should be shown or not, default is |
This descriptor calculates the number of hydrogen bond donors using a slightly simplified version of the PHACIR atom types (https://bit.ly/3qXQELf). The following groups are counted as hydrogen bond donors:
Any-OH where the formal charge of the oxygen is non-negative (i.e. formal charge >= 0)
Any-NH where the formal charge of the nitrogen is non-negative (i.e. formal charge >= 0)
A data frame, each row represents one of the molecules,
each column represents one feature.
This function returns one column named nHBDon
.
smi = system.file('vignettedata/FDAMDD.smi', package = 'Rcpi') mol = readMolFromSmi(smi, type = 'mol') dat = extractDrugHBondDonorCount(mol) head(dat)
smi = system.file('vignettedata/FDAMDD.smi', package = 'Rcpi') mol = readMolFromSmi(smi, type = 'mol') dat = extractDrugHBondDonorCount(mol) head(dat)
Calculate the Hybridization Molecular Fingerprints (in Compact Format)
extractDrugHybridization(molecules, depth = 6, size = 1024, silent = TRUE)
extractDrugHybridization(molecules, depth = 6, size = 1024, silent = TRUE)
molecules |
Parsed molucule object. |
depth |
The search depth. Default is |
size |
The length of the fingerprint bit string. Default is |
silent |
Logical. Whether the calculating process
should be shown or not, default is |
Calculate the hybridization molecular fingerprints. Similar to the standard type, but only consider hybridization state. This is hashed fingerprints, with a default length of 1024.
A list, each component represents one of the molecules, each element in the component represents the index of which element in the fingerprint is 1. Each component's name is the length of the fingerprints.
extractDrugHybridizationComplete
smi = system.file('vignettedata/FDAMDD.smi', package = 'Rcpi') mol = readMolFromSmi(smi, type = 'mol') fp = extractDrugHybridization(mol) head(fp)
smi = system.file('vignettedata/FDAMDD.smi', package = 'Rcpi') mol = readMolFromSmi(smi, type = 'mol') fp = extractDrugHybridization(mol) head(fp)
Calculate the Hybridization Molecular Fingerprints (in Complete Format)
extractDrugHybridizationComplete( molecules, depth = 6, size = 1024, silent = TRUE )
extractDrugHybridizationComplete( molecules, depth = 6, size = 1024, silent = TRUE )
molecules |
Parsed molucule object. |
depth |
The search depth. Default is |
size |
The length of the fingerprint bit string. Default is |
silent |
Logical. Whether the calculating process
should be shown or not, default is |
Calculate the hybridization molecular fingerprints. Similar to the standard type, but only consider hybridization state. This is hashed fingerprints, with a default length of 1024.
An integer vector or a matrix. Each row represents one molecule, the columns represent the fingerprints.
smi = system.file('vignettedata/FDAMDD.smi', package = 'Rcpi') mol = readMolFromSmi(smi, type = 'mol') fp = extractDrugHybridizationComplete(mol) dim(fp)
smi = system.file('vignettedata/FDAMDD.smi', package = 'Rcpi') mol = readMolFromSmi(smi, type = 'mol') fp = extractDrugHybridizationComplete(mol) dim(fp)
Descriptor that Characterizing Molecular Complexity in Terms of Carbon Hybridization States
extractDrugHybridizationRatio(molecules, silent = TRUE)
extractDrugHybridizationRatio(molecules, silent = TRUE)
molecules |
Parsed molucule object. |
silent |
Logical. Whether the calculating process
should be shown or not, default is |
This descriptor calculates the fraction of sp3 carbons to sp2 carbons. Note that it only considers carbon atoms and rather than use a simple ratio it reports the value of Nsp3/(Nsp3 + Nsp2). The original form of the descriptor (i.e., simple ratio) has been used to characterize molecular complexity, especially in the are of natural products, which usually have a high value of the sp3 to sp2 ratio.
A data frame, each row represents one of the molecules,
each column represents one feature.
This function returns one column named HybRatio
.
smi = system.file('vignettedata/FDAMDD.smi', package = 'Rcpi') mol = readMolFromSmi(smi, type = 'mol') dat = extractDrugHybridizationRatio(mol) head(dat)
smi = system.file('vignettedata/FDAMDD.smi', package = 'Rcpi') mol = readMolFromSmi(smi, type = 'mol') dat = extractDrugHybridizationRatio(mol) head(dat)
Calculate the Descriptor that Evaluates the Ionization Potential
extractDrugIPMolecularLearning(molecules, silent = TRUE)
extractDrugIPMolecularLearning(molecules, silent = TRUE)
molecules |
Parsed molucule object. |
silent |
Logical. Whether the calculating process
should be shown or not, default is |
Calculate the ionization potential of a molecule. The descriptor assumes that explicit hydrogens have been added to the molecules.
A data frame, each row represents one of the molecules,
each column represents one feature.
This function returns one column named MolIP
.
smi = system.file('vignettedata/FDAMDD.smi', package = 'Rcpi') mol = readMolFromSmi(smi, type = 'mol') dat = extractDrugIPMolecularLearning(mol) head(dat)
smi = system.file('vignettedata/FDAMDD.smi', package = 'Rcpi') mol = readMolFromSmi(smi, type = 'mol') dat = extractDrugIPMolecularLearning(mol) head(dat)
Descriptor that Calculates Kier and Hall Kappa Molecular Shape Indices
extractDrugKappaShapeIndices(molecules, silent = TRUE)
extractDrugKappaShapeIndices(molecules, silent = TRUE)
molecules |
Parsed molucule object. |
silent |
Logical. Whether the calculating process
should be shown or not, default is |
Kier and Hall Kappa molecular shape indices compare the molecular graph with minimal and maximal molecular graphs; see https://bit.ly/3ramdBy for details: "they are intended to capture different aspects of molecular shape. Note that hydrogens are ignored. In the following description, n denotes the number of atoms in the hydrogen suppressed graph, m is the number of bonds in the hydrogen suppressed graph. Also, let p2 denote the number of paths of length 2 and let p3 denote the number of paths of length 3".
A data frame, each row represents one of the molecules,
each column represents one feature.
This function returns 3 columns named
Kier1
, Kier2
and Kier3
:
Kier1
- First kappa shape index
Kier2
- Second kappa shape index
Kier3
- Third kappa shape index
smi = system.file('vignettedata/FDAMDD.smi', package = 'Rcpi') mol = readMolFromSmi(smi, type = 'mol') dat = extractDrugKappaShapeIndices(mol) head(dat)
smi = system.file('vignettedata/FDAMDD.smi', package = 'Rcpi') mol = readMolFromSmi(smi, type = 'mol') dat = extractDrugKappaShapeIndices(mol) head(dat)
Descriptor that Counts the Number of Occurrences of the E-State Fragments
extractDrugKierHallSmarts(molecules, silent = TRUE)
extractDrugKierHallSmarts(molecules, silent = TRUE)
molecules |
Parsed molucule object. |
silent |
Logical. Whether the calculating process
should be shown or not, default is |
A fragment count descriptor that uses e-state fragments. Traditionally the e-state descriptors identify the relevant fragments and then evaluate the actual e-state value. However it has been shown in Butina et al. that simply using the counts of the e-state fragments can lead to QSAR models that exhibit similar performance to those built using the actual e-state indices.
Atom typing and aromaticity perception should be performed prior to calling this descriptor. The atom type definitions are taken from Hall et al. The SMARTS definitions were obtained from RDKit.
A data frame, each row represents one of the molecules, each column represents one feature. This function returns 79 columns:
ID | Name | Pattern |
0 |
khs.sLi |
[LiD1]-*
|
1 |
khs.ssBe |
[BeD2](-*)-*
|
2 |
khs.ssssBe |
[BeD4](-*)(-*)(-*)-*
|
3 |
khs.ssBH |
[BD2H](-*)-*
|
4 |
khs.sssB |
[BD3](-*)(-*)-*
|
5 |
khs.ssssB |
[BD4](-*)(-*)(-*)-*
|
6 |
khs.sCH3 |
[CD1H3]-*
|
7 |
khs.dCH2 |
[CD1H2]=*
|
8 |
khs.ssCH2 |
[CD2H2](-*)-*
|
9 |
khs.tCH |
[CD1H]#*
|
10 |
khs.dsCH |
[CD2H](=*)-*
|
11 |
khs.aaCH |
[C,c;D2H](:*):*
|
12 |
khs.sssCH |
[CD3H](-*)(-*)-*
|
13 |
khs.ddC |
[CD2H0](=*)=*
|
14 |
khs.tsC |
[CD2H0](#*)-*
|
15 |
khs.dssC |
[CD3H0](=*)(-*)-*
|
16 |
khs.aasC |
[C,c;D3H0](:*)(:*)-*
|
17 |
khs.aaaC |
[C,c;D3H0](:*)(:*):*
|
18 |
khs.ssssC |
[CD4H0](-*)(-*)(-*)-*
|
19 |
khs.sNH3 |
[ND1H3]-*
|
20 |
khs.sNH2 |
[ND1H2]-*
|
21 |
khs.ssNH2 |
[ND2H2](-*)-*
|
22 |
khs.dNH |
[ND1H]=*
|
23 |
khs.ssNH |
[ND2H](-*)-*
|
24 |
khs.aaNH |
[N,nD2H](:*):*
|
25 |
khs.tN |
[ND1H0]#*
|
26 |
khs.sssNH |
[ND3H](-*)(-*)-*
|
27 |
khs.dsN |
[ND2H0](=*)-*
|
28 |
khs.aaN |
[N,nD2H0](:*):*
|
29 |
khs.sssN |
[ND3H0](-*)(-*)-*
|
30 |
khs.ddsN |
[ND3H0](~[OD1H0])(~[OD1H0])-,:*
|
31 |
khs.aasN |
[N,nD3H0](:*)(:*)-,:*
|
32 |
khs.ssssN |
[ND4H0](-*)(-*)(-*)-*
|
33 |
khs.sOH |
[OD1H]-*
|
34 |
khs.dO |
[OD1H0]=*
|
35 |
khs.ssO |
[OD2H0](-*)-*
|
36 |
khs.aaO |
[O,oD2H0](:*):*
|
37 |
khs.sF |
[FD1]-*
|
38 |
khs.sSiH3 |
[SiD1H3]-*
|
39 |
khs.ssSiH2 |
[SiD2H2](-*)-*
|
40 |
khs.sssSiH |
[SiD3H1](-*)(-*)-*
|
41 |
khs.ssssSi |
[SiD4H0](-*)(-*)(-*)-*
|
42 |
khs.sPH2 |
[PD1H2]-*
|
43 |
khs.ssPH |
[PD2H1](-*)-*
|
44 |
khs.sssP |
[PD3H0](-*)(-*)-*
|
45 |
khs.dsssP |
[PD4H0](=*)(-*)(-*)-*
|
46 |
khs.sssssP |
[PD5H0](-*)(-*)(-*)(-*)-*
|
47 |
khs.sSH |
[SD1H1]-*
|
48 |
khs.dS |
[SD1H0]=*
|
49 |
khs.ssS |
[SD2H0](-*)-*
|
50 |
khs.aaS |
[S,sD2H0](:*):*
|
51 |
khs.dssS |
[SD3H0](=*)(-*)-*
|
52 |
khs.ddssS |
[SD4H0](~[OD1H0])(~[OD1H0])(-*)-*
|
53 |
khs.sCl |
[ClD1]-*
|
54 |
khs.sGeH3 |
[GeD1H3](-*)
|
55 |
khs.ssGeH2 |
[GeD2H2](-*)-*
|
56 |
khs.sssGeH |
[GeD3H1](-*)(-*)-*
|
57 |
khs.ssssGe |
[GeD4H0](-*)(-*)(-*)-*
|
58 |
khs.sAsH2 |
[AsD1H2]-*
|
59 |
khs.ssAsH |
[AsD2H1](-*)-*
|
60 |
khs.sssAs |
[AsD3H0](-*)(-*)-*
|
61 |
khs.sssdAs |
[AsD4H0](=*)(-*)(-*)-*
|
62 |
khs.sssssAs |
[AsD5H0](-*)(-*)(-*)(-*)-*
|
63 |
khs.sSeH |
[SeD1H1]-*
|
64 |
khs.dSe |
[SeD1H0]=*
|
65 |
khs.ssSe |
[SeD2H0](-*)-*
|
66 |
khs.aaSe |
[SeD2H0](:*):*
|
67 |
khs.dssSe |
[SeD3H0](=*)(-*)-*
|
68 |
khs.ddssSe |
[SeD4H0](=*)(=*)(-*)-*
|
69 |
khs.sBr |
[BrD1]-*
|
70 |
khs.sSnH3 |
[SnD1H3]-*
|
71 |
khs.ssSnH2 |
[SnD2H2](-*)-*
|
72 |
khs.sssSnH |
[SnD3H1](-*)(-*)-*
|
73 |
khs.ssssSn |
[SnD4H0](-*)(-*)(-*)-*
|
74 |
khs.sI |
[ID1]-*
|
75 |
khs.sPbH3 |
[PbD1H3]-*
|
76 |
khs.ssPbH2 |
[PbD2H2](-*)-*
|
77 |
khs.sssPbH |
[PbD3H1](-*)(-*)-*
|
78 |
khs.ssssPb |
[PbD4H0](-*)(-*)(-*)-*
|
Butina, D. , Performance of Kier-Hall E-state Descriptors in Quantitative Structure Activity Relationship (QSAR) Studies of Multifunctional Molecules, Molecules, 2004, 9:1004-1009.
Hall, L.H. and Kier, L.B. , Electrotopological State Indices for Atom Types: A Novel Combination of Electronic, Topological, and Valence State Information, Journal of Chemical Information and Computer Science, 1995, 35:1039-1045.
smi = system.file('vignettedata/FDAMDD.smi', package = 'Rcpi') mol = readMolFromSmi(smi, type = 'mol') dat = extractDrugKierHallSmarts(mol) head(dat)
smi = system.file('vignettedata/FDAMDD.smi', package = 'Rcpi') mol = readMolFromSmi(smi, type = 'mol') dat = extractDrugKierHallSmarts(mol) head(dat)
Calculate the KR (Klekota and Roth) Molecular Fingerprints (in Compact Format)
extractDrugKR(molecules, silent = TRUE)
extractDrugKR(molecules, silent = TRUE)
molecules |
Parsed molucule object. |
silent |
Logical. Whether the calculating process
should be shown or not, default is |
Calculate the 4860 bit fingerprint defined by Klekota and Roth.
A list, each component represents one of the molecules, each element in the component represents the index of which element in the fingerprint is 1. Each component's name is the length of the fingerprints.
smi = system.file('vignettedata/FDAMDD.smi', package = 'Rcpi') mol = readMolFromSmi(smi, type = 'mol') fp = extractDrugKR(mol) head(fp)
smi = system.file('vignettedata/FDAMDD.smi', package = 'Rcpi') mol = readMolFromSmi(smi, type = 'mol') fp = extractDrugKR(mol) head(fp)
Calculate the KR (Klekota and Roth) Molecular Fingerprints (in Complete Format)
extractDrugKRComplete(molecules, silent = TRUE)
extractDrugKRComplete(molecules, silent = TRUE)
molecules |
Parsed molucule object. |
silent |
Logical. Whether the calculating process
should be shown or not, default is |
Calculate the 4860 bit fingerprint defined by Klekota and Roth.
An integer vector or a matrix. Each row represents one molecule, the columns represent the fingerprints.
smi = system.file('vignettedata/FDAMDD.smi', package = 'Rcpi') mol = readMolFromSmi(smi, type = 'mol') fp = extractDrugKRComplete(mol) dim(fp)
smi = system.file('vignettedata/FDAMDD.smi', package = 'Rcpi') mol = readMolFromSmi(smi, type = 'mol') fp = extractDrugKRComplete(mol) dim(fp)
Descriptor that Calculates the Number of Atoms in the Largest Chain
extractDrugLargestChain(molecules, silent = TRUE)
extractDrugLargestChain(molecules, silent = TRUE)
molecules |
Parsed molucule object. |
silent |
Logical. Whether the calculating process
should be shown or not, default is |
This descriptor calculates the number of atoms in the largest chain.
Note that a chain exists if there are two or more atoms.
Thus single atom molecules will return 0
.
A data frame, each row represents one of the molecules,
each column represents one feature.
This function returns one column named nAtomLC
.
smi = system.file('vignettedata/FDAMDD.smi', package = 'Rcpi') mol = readMolFromSmi(smi, type = 'mol') dat = extractDrugLargestChain(mol) head(dat)
smi = system.file('vignettedata/FDAMDD.smi', package = 'Rcpi') mol = readMolFromSmi(smi, type = 'mol') dat = extractDrugLargestChain(mol) head(dat)
Descriptor that Calculates the Number of Atoms in the Largest Pi Chain
extractDrugLargestPiSystem(molecules, silent = TRUE)
extractDrugLargestPiSystem(molecules, silent = TRUE)
molecules |
Parsed molucule object. |
silent |
Logical. Whether the calculating process
should be shown or not, default is |
This descriptor calculates the number of atoms in the largest pi chain.
A data frame, each row represents one of the molecules,
each column represents one feature.
This function returns one column named nAtomP
.
smi = system.file('vignettedata/FDAMDD.smi', package = 'Rcpi') mol = readMolFromSmi(smi, type = 'mol') dat = extractDrugLargestPiSystem(mol) head(dat)
smi = system.file('vignettedata/FDAMDD.smi', package = 'Rcpi') mol = readMolFromSmi(smi, type = 'mol') dat = extractDrugLargestPiSystem(mol) head(dat)
Calculate the Ratio of Length to Breadth Descriptor
extractDrugLengthOverBreadth(molecules, silent = TRUE)
extractDrugLengthOverBreadth(molecules, silent = TRUE)
molecules |
Parsed molucule object. |
silent |
Logical. Whether the calculating process
should be shown or not, default is |
Calculates the Ratio of Length to Breadth, as a result ti does not perform any orientation and only considers the X & Y extents for a series of rotations about the Z axis (in 10 degree increments).
A data frame, each row represents one of the molecules,
each column represents one feature.
This function returns two columns named LOBMAX
and LOBMIN
:
LOBMAX
- The maximum L/B ratio;
LOBMIN
- The L/B ratio for the rotation that results in the
minimum area (defined by the product of the X & Y extents for that orientation).
The descriptor assumes that the atoms have been configured.
sdf = system.file('sysdata/OptAA3d.sdf', package = 'Rcpi') mol = readMolFromSDF(sdf) dat = extractDrugLengthOverBreadth(mol) head(dat)
sdf = system.file('sysdata/OptAA3d.sdf', package = 'Rcpi') mol = readMolFromSDF(sdf) dat = extractDrugLengthOverBreadth(mol) head(dat)
Descriptor that Calculates the Number of Atoms in the Longest Aliphatic Chain
extractDrugLongestAliphaticChain(molecules, silent = TRUE)
extractDrugLongestAliphaticChain(molecules, silent = TRUE)
molecules |
Parsed molucule object. |
silent |
Logical. Whether the calculating process
should be shown or not, default is |
This descriptor calculates the number of atoms in the longest aliphatic chain.
A data frame, each row represents one of the molecules,
each column represents one feature.
This function returns one column named nAtomLAC
.
smi = system.file('vignettedata/FDAMDD.smi', package = 'Rcpi') mol = readMolFromSmi(smi, type = 'mol') dat = extractDrugLongestAliphaticChain(mol) head(dat)
smi = system.file('vignettedata/FDAMDD.smi', package = 'Rcpi') mol = readMolFromSmi(smi, type = 'mol') dat = extractDrugLongestAliphaticChain(mol) head(dat)
Calculate the MACCS Molecular Fingerprints (in Compact Format)
extractDrugMACCS(molecules, silent = TRUE)
extractDrugMACCS(molecules, silent = TRUE)
molecules |
Parsed molucule object. |
silent |
Logical. Whether the calculating process
should be shown or not, default is |
The popular 166 bit MACCS keys described by MDL.
A list, each component represents one of the molecules, each element in the component represents the index of which element in the fingerprint is 1. Each component's name is the length of the fingerprints.
smi = system.file('vignettedata/FDAMDD.smi', package = 'Rcpi') mol = readMolFromSmi(smi, type = 'mol') fp = extractDrugMACCS(mol) head(fp)
smi = system.file('vignettedata/FDAMDD.smi', package = 'Rcpi') mol = readMolFromSmi(smi, type = 'mol') fp = extractDrugMACCS(mol) head(fp)
Calculate the MACCS Molecular Fingerprints (in Complete Format)
extractDrugMACCSComplete(molecules, silent = TRUE)
extractDrugMACCSComplete(molecules, silent = TRUE)
molecules |
Parsed molucule object. |
silent |
Logical. Whether the calculating process
should be shown or not, default is |
The popular 166 bit MACCS keys described by MDL.
An integer vector or a matrix. Each row represents one molecule, the columns represent the fingerprints.
smi = system.file('vignettedata/FDAMDD.smi', package = 'Rcpi') mol = readMolFromSmi(smi, type = 'mol') fp = extractDrugMACCSComplete(mol) dim(fp)
smi = system.file('vignettedata/FDAMDD.smi', package = 'Rcpi') mol = readMolFromSmi(smi, type = 'mol') fp = extractDrugMACCSComplete(mol) dim(fp)
Descriptor that Calculates the LogP Based on a Simple Equation Using the Number of Carbons and Hetero Atoms
extractDrugMannholdLogP(molecules, silent = TRUE)
extractDrugMannholdLogP(molecules, silent = TRUE)
molecules |
Parsed molucule object. |
silent |
Logical. Whether the calculating process
should be shown or not, default is |
This descriptor calculates the LogP based on a simple equation using the number of carbons and hetero atoms. The implemented equation was proposed in Mannhold et al.
A data frame, each row represents one of the molecules,
each column represents one feature.
This function returns one column named MLogP
.
Mannhold, R., Poda, G. I., Ostermann, C., & Tetko, I. V. (2009). Calculation of molecular lipophilicity: State-of-the-art and comparison of log P methods on more than 96,000 compounds. Journal of pharmaceutical sciences, 98(3), 861-893.
smi = system.file('vignettedata/FDAMDD.smi', package = 'Rcpi') mol = readMolFromSmi(smi, type = 'mol') dat = extractDrugMannholdLogP(mol) head(dat)
smi = system.file('vignettedata/FDAMDD.smi', package = 'Rcpi') mol = readMolFromSmi(smi, type = 'mol') dat = extractDrugMannholdLogP(mol) head(dat)
Calculate Molecular Distance Edge (MDE) Descriptors for C, N and O
extractDrugMDE(molecules, silent = TRUE)
extractDrugMDE(molecules, silent = TRUE)
molecules |
Parsed molucule object. |
silent |
Logical. Whether the calculating process
should be shown or not, default is |
This descriptor calculates the 10 molecular distance edge (MDE) descriptor described in Liu, S., Cao, C., & Li, Z, and in addition it calculates variants where O and N are considered.
A data frame, each row represents one of the molecules,
each column represents one feature.
This function returns one column named nAtomLAC
.
Liu, S., Cao, C., & Li, Z. (1998). Approach to estimation and prediction for normal boiling point (NBP) of alkanes based on a novel molecular distance-edge (MDE) vector, lambda. Journal of chemical information and computer sciences, 38(3), 387-394.
smi = system.file('vignettedata/FDAMDD.smi', package = 'Rcpi') mol = readMolFromSmi(smi, type = 'mol') dat = extractDrugMDE(mol) head(dat)
smi = system.file('vignettedata/FDAMDD.smi', package = 'Rcpi') mol = readMolFromSmi(smi, type = 'mol') dat = extractDrugMDE(mol) head(dat)
Descriptor that Calculates the Principal Moments of Inertia and Ratios of the Principal Moments
extractDrugMomentOfInertia(molecules, silent = TRUE)
extractDrugMomentOfInertia(molecules, silent = TRUE)
molecules |
Parsed molucule object. |
silent |
Logical. Whether the calculating process
should be shown or not, default is |
A descriptor that calculates the moment of inertia and radius of gyration. Moment of inertia (MI) values characterize the mass distribution of a molecule. Related to the MI values, ratios of the MI values along the three principal axes are also well know modeling variables. This descriptor calculates the MI values along the X, Y and Z axes as well as the ratio's X/Y, X/Z and Y/Z. Finally it also calculates the radius of gyration of the molecule.
A data frame, each row represents one of the molecules,
each column represents one feature.
This function returns 7 columns named
MOMI.X
, MOMI.Y
, MOMI.Z
,
MOMI.XY
, MOMI.XZ
, MOMI.YZ
, MOMI.R
:
MOMI.X
- MI along X axis
MOMI.Y
- MI along Y axis
MOMI.Z
- MI along Z axis
MOMI.XY
- X/Y
MOMI.XZ
- X/Z
MOMI.YZ
- Y/Z
MOMI.R
- Radius of gyration
One important aspect of the algorithm is that if the eigenvalues
of the MI tensor are below 1e-3
,
then the ratio's are set to a default of 1000.
sdf = system.file('sysdata/OptAA3d.sdf', package = 'Rcpi') mol = readMolFromSDF(sdf) dat = extractDrugMomentOfInertia(mol) head(dat)
sdf = system.file('sysdata/OptAA3d.sdf', package = 'Rcpi') mol = readMolFromSDF(sdf) dat = extractDrugMomentOfInertia(mol) head(dat)
Calculate the FP2 Molecular Fingerprints
extractDrugOBFP2(molecules, type = c("smile", "sdf"))
extractDrugOBFP2(molecules, type = c("smile", "sdf"))
molecules |
R character string object containing the molecules. See the example section for details. |
type |
|
Calculate the 1024 bit FP2 fingerprints provided by OpenBabel.
A matrix. Each row represents one molecule, the columns represent the fingerprints.
mol1 = 'C1CCC1CC(CN(C)(C))CC(=O)CC' # one molecule SMILE in a vector mol2 = c('CCC', 'CCN', 'CCN(C)(C)', 'c1ccccc1Cc1ccccc1', 'C1CCC1CC(CN(C)(C))CC(=O)CC') # multiple SMILEs in a vector mol3 = readChar(system.file('compseq/DB00860.sdf', package = 'Rcpi'), nchars = 1e+6) # single molecule in a sdf file mol4 = readChar(system.file('sysdata/OptAA3d.sdf', package = 'Rcpi'), nchars = 1e+6) # multiple molecules in a sdf file ## Not run: smifp0 = extractDrugOBFP2(mol1, type = 'smile') smifp1 = extractDrugOBFP2(mol2, type = 'smile') sdffp0 = extractDrugOBFP2(mol3, type = 'sdf') sdffp1 = extractDrugOBFP2(mol4, type = 'sdf') ## End(Not run)
mol1 = 'C1CCC1CC(CN(C)(C))CC(=O)CC' # one molecule SMILE in a vector mol2 = c('CCC', 'CCN', 'CCN(C)(C)', 'c1ccccc1Cc1ccccc1', 'C1CCC1CC(CN(C)(C))CC(=O)CC') # multiple SMILEs in a vector mol3 = readChar(system.file('compseq/DB00860.sdf', package = 'Rcpi'), nchars = 1e+6) # single molecule in a sdf file mol4 = readChar(system.file('sysdata/OptAA3d.sdf', package = 'Rcpi'), nchars = 1e+6) # multiple molecules in a sdf file ## Not run: smifp0 = extractDrugOBFP2(mol1, type = 'smile') smifp1 = extractDrugOBFP2(mol2, type = 'smile') sdffp0 = extractDrugOBFP2(mol3, type = 'sdf') sdffp1 = extractDrugOBFP2(mol4, type = 'sdf') ## End(Not run)
Calculate the FP3 Molecular Fingerprints
extractDrugOBFP3(molecules, type = c("smile", "sdf"))
extractDrugOBFP3(molecules, type = c("smile", "sdf"))
molecules |
R character string object containing the molecules. See the example section for details. |
type |
|
Calculate the 64 bit FP3 fingerprints provided by OpenBabel.
A matrix. Each row represents one molecule, the columns represent the fingerprints.
mol1 = 'C1CCC1CC(CN(C)(C))CC(=O)CC' # one molecule SMILE in a vector mol2 = c('CCC', 'CCN', 'CCN(C)(C)', 'c1ccccc1Cc1ccccc1', 'C1CCC1CC(CN(C)(C))CC(=O)CC') # multiple SMILEs in a vector mol3 = readChar(system.file('compseq/DB00860.sdf', package = 'Rcpi'), nchars = 1e+6) # single molecule in a sdf file mol4 = readChar(system.file('sysdata/OptAA3d.sdf', package = 'Rcpi'), nchars = 1e+6) # multiple molecules in a sdf file ## Not run: smifp0 = extractDrugOBFP3(mol1, type = 'smile') smifp1 = extractDrugOBFP3(mol2, type = 'smile') sdffp0 = extractDrugOBFP3(mol3, type = 'sdf') sdffp1 = extractDrugOBFP3(mol4, type = 'sdf') ## End(Not run)
mol1 = 'C1CCC1CC(CN(C)(C))CC(=O)CC' # one molecule SMILE in a vector mol2 = c('CCC', 'CCN', 'CCN(C)(C)', 'c1ccccc1Cc1ccccc1', 'C1CCC1CC(CN(C)(C))CC(=O)CC') # multiple SMILEs in a vector mol3 = readChar(system.file('compseq/DB00860.sdf', package = 'Rcpi'), nchars = 1e+6) # single molecule in a sdf file mol4 = readChar(system.file('sysdata/OptAA3d.sdf', package = 'Rcpi'), nchars = 1e+6) # multiple molecules in a sdf file ## Not run: smifp0 = extractDrugOBFP3(mol1, type = 'smile') smifp1 = extractDrugOBFP3(mol2, type = 'smile') sdffp0 = extractDrugOBFP3(mol3, type = 'sdf') sdffp1 = extractDrugOBFP3(mol4, type = 'sdf') ## End(Not run)
Calculate the FP4 Molecular Fingerprints
extractDrugOBFP4(molecules, type = c("smile", "sdf"))
extractDrugOBFP4(molecules, type = c("smile", "sdf"))
molecules |
R character string object containing the molecules. See the example section for details. |
type |
|
Calculate the 512 bit FP4 fingerprints provided by OpenBabel.
A matrix. Each row represents one molecule, the columns represent the fingerprints.
mol1 = 'C1CCC1CC(CN(C)(C))CC(=O)CC' # one molecule SMILE in a vector mol2 = c('CCC', 'CCN', 'CCN(C)(C)', 'c1ccccc1Cc1ccccc1', 'C1CCC1CC(CN(C)(C))CC(=O)CC') # multiple SMILEs in a vector mol3 = readChar(system.file('compseq/DB00860.sdf', package = 'Rcpi'), nchars = 1e+6) # single molecule in a sdf file mol4 = readChar(system.file('sysdata/OptAA3d.sdf', package = 'Rcpi'), nchars = 1e+6) # multiple molecules in a sdf file ## Not run: smifp0 = extractDrugOBFP4(mol1, type = 'smile') smifp1 = extractDrugOBFP4(mol2, type = 'smile') sdffp0 = extractDrugOBFP4(mol3, type = 'sdf') sdffp1 = extractDrugOBFP4(mol4, type = 'sdf') ## End(Not run)
mol1 = 'C1CCC1CC(CN(C)(C))CC(=O)CC' # one molecule SMILE in a vector mol2 = c('CCC', 'CCN', 'CCN(C)(C)', 'c1ccccc1Cc1ccccc1', 'C1CCC1CC(CN(C)(C))CC(=O)CC') # multiple SMILEs in a vector mol3 = readChar(system.file('compseq/DB00860.sdf', package = 'Rcpi'), nchars = 1e+6) # single molecule in a sdf file mol4 = readChar(system.file('sysdata/OptAA3d.sdf', package = 'Rcpi'), nchars = 1e+6) # multiple molecules in a sdf file ## Not run: smifp0 = extractDrugOBFP4(mol1, type = 'smile') smifp1 = extractDrugOBFP4(mol2, type = 'smile') sdffp0 = extractDrugOBFP4(mol3, type = 'sdf') sdffp1 = extractDrugOBFP4(mol4, type = 'sdf') ## End(Not run)
Calculate the MACCS Molecular Fingerprints
extractDrugOBMACCS(molecules, type = c("smile", "sdf"))
extractDrugOBMACCS(molecules, type = c("smile", "sdf"))
molecules |
R character string object containing the molecules. See the example section for details. |
type |
|
Calculate the 256 bit MACCS fingerprints provided by OpenBabel.
A matrix. Each row represents one molecule, the columns represent the fingerprints.
mol1 = 'C1CCC1CC(CN(C)(C))CC(=O)CC' # one molecule SMILE in a vector mol2 = c('CCC', 'CCN', 'CCN(C)(C)', 'c1ccccc1Cc1ccccc1', 'C1CCC1CC(CN(C)(C))CC(=O)CC') # multiple SMILEs in a vector mol3 = readChar(system.file('compseq/DB00860.sdf', package = 'Rcpi'), nchars = 1e+6) # single molecule in a sdf file mol4 = readChar(system.file('sysdata/OptAA3d.sdf', package = 'Rcpi'), nchars = 1e+6) # multiple molecules in a sdf file ## Not run: # MACCS may not be available in current version of ChemmineOB smifp0 = extractDrugOBMACCS(mol1, type = 'smile') smifp1 = extractDrugOBMACCS(mol2, type = 'smile') sdffp0 = extractDrugOBMACCS(mol3, type = 'sdf') sdffp1 = extractDrugOBMACCS(mol4, type = 'sdf') ## End(Not run)
mol1 = 'C1CCC1CC(CN(C)(C))CC(=O)CC' # one molecule SMILE in a vector mol2 = c('CCC', 'CCN', 'CCN(C)(C)', 'c1ccccc1Cc1ccccc1', 'C1CCC1CC(CN(C)(C))CC(=O)CC') # multiple SMILEs in a vector mol3 = readChar(system.file('compseq/DB00860.sdf', package = 'Rcpi'), nchars = 1e+6) # single molecule in a sdf file mol4 = readChar(system.file('sysdata/OptAA3d.sdf', package = 'Rcpi'), nchars = 1e+6) # multiple molecules in a sdf file ## Not run: # MACCS may not be available in current version of ChemmineOB smifp0 = extractDrugOBMACCS(mol1, type = 'smile') smifp1 = extractDrugOBMACCS(mol2, type = 'smile') sdffp0 = extractDrugOBMACCS(mol3, type = 'sdf') sdffp1 = extractDrugOBMACCS(mol4, type = 'sdf') ## End(Not run)
Descriptor that Calculates the Petitjean Number of a Molecule
extractDrugPetitjeanNumber(molecules, silent = TRUE)
extractDrugPetitjeanNumber(molecules, silent = TRUE)
molecules |
Parsed molucule object. |
silent |
Logical. Whether the calculating process should be shown or not, default is |
This descriptor calculates the Petitjean number of a molecule. According to the Petitjean definition, the eccentricity of a vertex corresponds to the distance from that vertex to the most remote vertex in the graph.
The distance is obtained from the distance matrix as the count of edges
between the two vertices. If r(i)
is the largest matrix entry
in row i
of the distance matrix D
, then the radius is defined
as the smallest of the r(i)
. The graph diameter D
is defined as
the largest vertex eccentricity in the graph.
(http://www.edusoft-lc.com/molconn/manuals/400/chaptwo.html)
A data frame, each row represents one of the molecules,
each column represents one feature.
This function returns one column named PetitjeanNumber
.
smi = system.file('vignettedata/FDAMDD.smi', package = 'Rcpi') mol = readMolFromSmi(smi, type = 'mol') dat = extractDrugPetitjeanNumber(mol) head(dat)
smi = system.file('vignettedata/FDAMDD.smi', package = 'Rcpi') mol = readMolFromSmi(smi, type = 'mol') dat = extractDrugPetitjeanNumber(mol) head(dat)
Descriptor that Calculates the Petitjean Shape Indices
extractDrugPetitjeanShapeIndex(molecules, silent = TRUE)
extractDrugPetitjeanShapeIndex(molecules, silent = TRUE)
molecules |
Parsed molucule object. |
silent |
Logical. Whether the calculating process
should be shown or not, default is |
The topological and geometric shape indices described Petitjean and Bath et al. respectively. Both measure the anisotropy in a molecule.
A data frame, each row represents one of the molecules,
each column represents one feature.
This function returns two columns named
topoShape
(Topological Shape Index) and
geomShape
(Geometric Shape Index).
Petitjean, M., Applications of the radius-diameter diagram to the classification of topological and geometrical shapes of chemical compounds, Journal of Chemical Information and Computer Science, 1992, 32:331-337
Bath, P.A. and Poirette, A.R. and Willet, P. and Allen, F.H. , The Extent of the Relationship between the Graph-Theoretical and the Geometrical Shape Coefficients of Chemical Compounds, Journal of Chemical Information and Computer Science, 1995, 35:714-716.
sdf = system.file('sysdata/OptAA3d.sdf', package = 'Rcpi') mol = readMolFromSDF(sdf) dat = extractDrugPetitjeanShapeIndex(mol) head(dat)
sdf = system.file('sysdata/OptAA3d.sdf', package = 'Rcpi') mol = readMolFromSDF(sdf) dat = extractDrugPetitjeanShapeIndex(mol) head(dat)
Calculate the PubChem Molecular Fingerprints (in Compact Format)
extractDrugPubChem(molecules, silent = TRUE)
extractDrugPubChem(molecules, silent = TRUE)
molecules |
Parsed molucule object. |
silent |
Logical. Whether the calculating process
should be shown or not, default is |
Calculate the 881 bit fingerprints defined by PubChem.
A list, each component represents one of the molecules, each element in the component represents the index of which element in the fingerprint is 1. Each component's name is the length of the fingerprints.
smi = system.file('vignettedata/FDAMDD.smi', package = 'Rcpi') mol = readMolFromSmi(smi, type = 'mol') fp = extractDrugPubChem(mol) head(fp)
smi = system.file('vignettedata/FDAMDD.smi', package = 'Rcpi') mol = readMolFromSmi(smi, type = 'mol') fp = extractDrugPubChem(mol) head(fp)
Calculate the PubChem Molecular Fingerprints (in Complete Format)
extractDrugPubChemComplete(molecules, silent = TRUE)
extractDrugPubChemComplete(molecules, silent = TRUE)
molecules |
Parsed molucule object. |
silent |
Logical. Whether the calculating process
should be shown or not, default is |
Calculate the 881 bit fingerprints defined by PubChem.
An integer vector or a matrix. Each row represents one molecule, the columns represent the fingerprints.
smi = system.file('vignettedata/FDAMDD.smi', package = 'Rcpi') mol = readMolFromSmi(smi, type = 'mol') fp = extractDrugPubChemComplete(mol) dim(fp)
smi = system.file('vignettedata/FDAMDD.smi', package = 'Rcpi') mol = readMolFromSmi(smi, type = 'mol') fp = extractDrugPubChemComplete(mol) dim(fp)
Descriptor that Calculates the Number of Nonrotatable Bonds on A Molecule
extractDrugRotatableBondsCount(molecules, silent = TRUE)
extractDrugRotatableBondsCount(molecules, silent = TRUE)
molecules |
Parsed molucule object. |
silent |
Logical. Whether the calculating process
should be shown or not, default is |
The number of rotatable bonds is given by the SMARTS specified by Daylight on SMARTS tutorial (https://www.daylight.com/dayhtml_tutorials/languages/smarts/smarts_examples.html)
A data frame, each row represents one of the molecules,
each column represents one feature.
This function returns one column named nRotB
.
smi = system.file('vignettedata/FDAMDD.smi', package = 'Rcpi') mol = readMolFromSmi(smi, type = 'mol') dat = extractDrugRotatableBondsCount(mol) head(dat)
smi = system.file('vignettedata/FDAMDD.smi', package = 'Rcpi') mol = readMolFromSmi(smi, type = 'mol') dat = extractDrugRotatableBondsCount(mol) head(dat)
Descriptor that Calculates the Number Failures of the Lipinski's Rule Of Five
extractDrugRuleOfFive(molecules, silent = TRUE)
extractDrugRuleOfFive(molecules, silent = TRUE)
molecules |
Parsed molucule object. |
silent |
Logical. Whether the calculating process
should be shown or not, default is |
This descriptor calculates the number failures of the Lipinski's Rule Of Five: http://en.wikipedia.org/wiki/Lipinski%27s_Rule_of_Five.
A data frame, each row represents one of the molecules,
each column represents one feature.
This function returns one column named LipinskiFailures
.
smi = system.file('vignettedata/FDAMDD.smi', package = 'Rcpi') mol = readMolFromSmi(smi, type = 'mol') dat = extractDrugRuleOfFive(mol) head(dat)
smi = system.file('vignettedata/FDAMDD.smi', package = 'Rcpi') mol = readMolFromSmi(smi, type = 'mol') dat = extractDrugRuleOfFive(mol) head(dat)
Calculate the Shortest Path Molecular Fingerprints (in Compact Format)
extractDrugShortestPath(molecules, depth = 6, size = 1024, silent = TRUE)
extractDrugShortestPath(molecules, depth = 6, size = 1024, silent = TRUE)
molecules |
Parsed molucule object. |
depth |
The search depth. Default is |
size |
The length of the fingerprint bit string. Default is |
silent |
Logical. Whether the calculating process
should be shown or not, default is |
Calculate the fingerprint based on the shortest paths between pairs of atoms and takes into account ring systems, charges etc.
A list, each component represents one of the molecules, each element in the component represents the index of which element in the fingerprint is 1. Each component's name is the length of the fingerprints.
extractDrugShortestPathComplete
smi = system.file('vignettedata/FDAMDD.smi', package = 'Rcpi') mol = readMolFromSmi(smi, type = 'mol') fp = extractDrugShortestPath(mol) head(fp)
smi = system.file('vignettedata/FDAMDD.smi', package = 'Rcpi') mol = readMolFromSmi(smi, type = 'mol') fp = extractDrugShortestPath(mol) head(fp)
Calculate the Shortest Path Molecular Fingerprints (in Complete Format)
extractDrugShortestPathComplete( molecules, depth = 6, size = 1024, silent = TRUE )
extractDrugShortestPathComplete( molecules, depth = 6, size = 1024, silent = TRUE )
molecules |
Parsed molucule object. |
depth |
The search depth. Default is |
size |
The length of the fingerprint bit string. Default is |
silent |
Logical. Whether the calculating process
should be shown or not, default is |
Calculate the fingerprint based on the shortest paths between pairs of atoms and takes into account ring systems, charges etc.
An integer vector or a matrix. Each row represents one molecule, the columns represent the fingerprints.
smi = system.file('vignettedata/FDAMDD.smi', package = 'Rcpi') mol = readMolFromSmi(smi, type = 'mol') fp = extractDrugShortestPathComplete(mol) dim(fp)
smi = system.file('vignettedata/FDAMDD.smi', package = 'Rcpi') mol = readMolFromSmi(smi, type = 'mol') fp = extractDrugShortestPathComplete(mol) dim(fp)
Calculate the Standard Molecular Fingerprints (in Compact Format)
extractDrugStandard(molecules, depth = 6, size = 1024, silent = TRUE)
extractDrugStandard(molecules, depth = 6, size = 1024, silent = TRUE)
molecules |
Parsed molucule object. |
depth |
The search depth. Default is |
size |
The length of the fingerprint bit string. Default is |
silent |
Logical. Whether the calculating process
should be shown or not, default is |
Calculate the standard molecular fingerprints. Considers paths of a given length. This is hashed fingerprints, with a default length of 1024.
A list, each component represents one of the molecules, each element in the component represents the index of which element in the fingerprint is 1. Each component's name is the length of the fingerprints.
smi = system.file('vignettedata/FDAMDD.smi', package = 'Rcpi') mol = readMolFromSmi(smi, type = 'mol') fp = extractDrugStandard(mol) head(fp)
smi = system.file('vignettedata/FDAMDD.smi', package = 'Rcpi') mol = readMolFromSmi(smi, type = 'mol') fp = extractDrugStandard(mol) head(fp)
Calculate the Standard Molecular Fingerprints (in Complete Format)
extractDrugStandardComplete(molecules, depth = 6, size = 1024, silent = TRUE)
extractDrugStandardComplete(molecules, depth = 6, size = 1024, silent = TRUE)
molecules |
Parsed molucule object. |
depth |
The search depth. Default is |
size |
The length of the fingerprint bit string. Default is |
silent |
Logical. Whether the calculating process
should be shown or not, default is |
Calculate the standard molecular fingerprints. Considers paths of a given length. This is hashed fingerprints, with a default length of 1024.
An integer vector or a matrix. Each row represents one molecule, the columns represent the fingerprints.
smi = system.file('vignettedata/FDAMDD.smi', package = 'Rcpi') mol = readMolFromSmi(smi, type = 'mol') fp = extractDrugStandardComplete(mol) dim(fp)
smi = system.file('vignettedata/FDAMDD.smi', package = 'Rcpi') mol = readMolFromSmi(smi, type = 'mol') fp = extractDrugStandardComplete(mol) dim(fp)
Descriptor of Topological Polar Surface Area Based on Fragment Contributions (TPSA)
extractDrugTPSA(molecules, silent = TRUE)
extractDrugTPSA(molecules, silent = TRUE)
molecules |
Parsed molucule object. |
silent |
Logical. Whether the calculating process
should be shown or not, default is |
Calculate the descriptor of topological polar surface area based on fragment contributions (TPSA).
A data frame, each row represents one of the molecules,
each column represents one feature.
This function returns one column named TopoPSA
.
Ertl, P., Rohde, B., & Selzer, P. (2000). Fast calculation of molecular polar surface area as a sum of fragment-based contributions and its application to the prediction of drug transport properties. Journal of medicinal chemistry, 43(20), 3714-3717.
smi = system.file('vignettedata/FDAMDD.smi', package = 'Rcpi') mol = readMolFromSmi(smi, type = 'mol') dat = extractDrugTPSA(mol) head(dat)
smi = system.file('vignettedata/FDAMDD.smi', package = 'Rcpi') mol = readMolFromSmi(smi, type = 'mol') dat = extractDrugTPSA(mol) head(dat)
Descriptor that Calculates the Volume of A Molecule
extractDrugVABC(molecules, silent = TRUE)
extractDrugVABC(molecules, silent = TRUE)
molecules |
Parsed molucule object. |
silent |
Logical. Whether the calculating process
should be shown or not, default is |
This descriptor calculates the volume of a molecule.
A data frame, each row represents one of the molecules,
each column represents one feature.
This function returns one column named VABC
.
smi = system.file('vignettedata/FDAMDD.smi', package = 'Rcpi') mol = readMolFromSmi(smi, type = 'mol') dat = extractDrugVABC(mol) head(dat)
smi = system.file('vignettedata/FDAMDD.smi', package = 'Rcpi') mol = readMolFromSmi(smi, type = 'mol') dat = extractDrugVABC(mol) head(dat)
Descriptor that Calculates the Vertex Adjacency Information of A Molecule
extractDrugVAdjMa(molecules, silent = TRUE)
extractDrugVAdjMa(molecules, silent = TRUE)
molecules |
Parsed molucule object. |
silent |
Logical. Whether the calculating process
should be shown or not, default is |
Vertex adjacency information (magnitude):
where
is the number of heavy-heavy bonds.
If
is zero, then
0
is returned.
A data frame, each row represents one of the molecules,
each column represents one feature.
This function returns one column named VAdjMat
.
smi = system.file('vignettedata/FDAMDD.smi', package = 'Rcpi') mol = readMolFromSmi(smi, type = 'mol') dat = extractDrugVAdjMa(mol) head(dat)
smi = system.file('vignettedata/FDAMDD.smi', package = 'Rcpi') mol = readMolFromSmi(smi, type = 'mol') dat = extractDrugVAdjMa(mol) head(dat)
Descriptor that Calculates the Total Weight of Atoms
extractDrugWeight(molecules, silent = TRUE)
extractDrugWeight(molecules, silent = TRUE)
molecules |
Parsed molucule object. |
silent |
Logical. Whether the calculating process
should be shown or not, default is |
This descriptor calculates the molecular weight.
A data frame, each row represents one of the molecules,
each column represents one feature.
This function returns one column named MW
.
smi = system.file('vignettedata/FDAMDD.smi', package = 'Rcpi') mol = readMolFromSmi(smi, type = 'mol') dat = extractDrugWeight(mol) head(dat)
smi = system.file('vignettedata/FDAMDD.smi', package = 'Rcpi') mol = readMolFromSmi(smi, type = 'mol') dat = extractDrugWeight(mol) head(dat)
Descriptor that Calculates the Weighted Path (Molecular ID)
extractDrugWeightedPath(molecules, silent = TRUE)
extractDrugWeightedPath(molecules, silent = TRUE)
molecules |
Parsed molucule object. |
silent |
Logical. Whether the calculating process
should be shown or not, default is |
This descriptor calculates the weighted path (molecular ID) described by Randic, characterizing molecular branching. Five descriptors are calculated, based on the implementation in the ADAPT software package. Note that the descriptor is based on identifying all paths between pairs of atoms and so is NP-hard. This means that it can take some time for large, complex molecules.
A data frame, each row represents one of the molecules,
each column represents one feature.
This function returns 5 columns named
WTPT.1
, WTPT.2
, WTPT.3
, WTPT.4
, WTPT.5
:
WTPT.1
- molecular ID
WTPT.2
- molecular ID / number of atoms
WTPT.3
- sum of path lengths starting from heteroatoms
WTPT.4
- sum of path lengths starting from oxygens
WTPT.5
- sum of path lengths starting from nitrogens
Randic, M., On molecular identification numbers (1984). Journal of Chemical Information and Computer Science, 24:164-175.
smi = system.file('vignettedata/FDAMDD.smi', package = 'Rcpi') mol = readMolFromSmi(smi, type = 'mol') dat = extractDrugWeightedPath(mol) head(dat)
smi = system.file('vignettedata/FDAMDD.smi', package = 'Rcpi') mol = readMolFromSmi(smi, type = 'mol') dat = extractDrugWeightedPath(mol) head(dat)
Calculate Holistic Descriptors Described by Todeschini et al.
extractDrugWHIM(molecules, silent = TRUE)
extractDrugWHIM(molecules, silent = TRUE)
molecules |
Parsed molucule object. |
silent |
Logical. Whether the calculating process
should be shown or not, default is |
Holistic descriptors described by Todeschini et al, the descriptors are based on a number of atom weightings. There are six different possible weightings:
unit weights
atomic masses
van der Waals volumes
Mulliken atomic electronegativites
atomic polarizabilities
E-state values described by Kier and Hall
Currently weighting schemes 1, 2, 3, 4 and 5 are implemented. The weight values are taken from Todeschini et al. and as a result 19 elements are considered. For each weighting scheme we can obtain
11 directional WHIM descriptors (lambda1 .. 3, nu1 .. 2, gamma1 .. 3, eta1 .. 3)
6 non-directional WHIM descriptors (T, A, V, K, G, D)
Though Todeschini et al. mentions that for planar molecules only 8 directional WHIM descriptors are required the current code will return all 11.
A data frame, each row represents one of the molecules, each column represents one feature. This function returns 17 columns:
Wlambda1
Wlambda2
wlambda3
Wnu1
Wnu2
Wgamma1
Wgamma2
Wgamma3
Weta1
Weta2
Weta3
WT
WA
WV
WK
WG
WD
Each name will have a suffix of the form .X
where X
indicates
the weighting scheme used. Possible values of X
are
unity
mass
volume
eneg
polar
Todeschini, R. and Gramatica, P., New 3D Molecular Descriptors: The WHIM theory and QAR Applications, Persepectives in Drug Discovery and Design, 1998, ?:355-380.
sdf = system.file('sysdata/OptAA3d.sdf', package = 'Rcpi') mol = readMolFromSDF(sdf) dat = extractDrugWHIM(mol) head(dat)
sdf = system.file('sysdata/OptAA3d.sdf', package = 'Rcpi') mol = readMolFromSDF(sdf) dat = extractDrugWHIM(mol) head(dat)
Descriptor that Calculates Wiener Path Number and Wiener Polarity Number
extractDrugWienerNumbers(molecules, silent = TRUE)
extractDrugWienerNumbers(molecules, silent = TRUE)
molecules |
Parsed molucule object. |
silent |
Logical. Whether the calculating process
should be shown or not, default is |
This descriptor calculates the Wiener numbers, including the Wiener Path number and the Wiener Polarity Number. Wiener path number: half the sum of all the distance matrix entries; Wiener polarity number: half the sum of all the distance matrix entries with a value of 3.
A data frame, each row represents one of the molecules,
each column represents one feature.
This function returns two columns named
WPATH
(weiner path number) and WPOL
(weiner polarity number).
Wiener, H. (1947). Structural determination of paraffin boiling points. Journal of the American Chemical Society, 69(1), 17-20.
smi = system.file('vignettedata/FDAMDD.smi', package = 'Rcpi') mol = readMolFromSmi(smi, type = 'mol') dat = extractDrugWienerNumbers(mol) head(dat)
smi = system.file('vignettedata/FDAMDD.smi', package = 'Rcpi') mol = readMolFromSmi(smi, type = 'mol') dat = extractDrugWienerNumbers(mol) head(dat)
Descriptor that Calculates the Prediction of logP Based on the Atom-Type Method Called XLogP
extractDrugXLogP(molecules, silent = TRUE)
extractDrugXLogP(molecules, silent = TRUE)
molecules |
Parsed molucule object. |
silent |
Logical. Whether the calculating process
should be shown or not, default is |
Prediction of logP based on the atom-type method called XLogP.
A data frame, each row represents one of the molecules,
each column represents one feature.
This function returns one column named XLogP
.
Wang, R., Fu, Y., and Lai, L., A New Atom-Additive Method for Calculating Partition Coefficients, Journal of Chemical Information and Computer Sciences, 1997, 37:615-621.
Wang, R., Gao, Y., and Lai, L., Calculating partition coefficient by atom-additive method, Perspectives in Drug Discovery and Design, 2000, 19:47-66.
smi = system.file('vignettedata/FDAMDD.smi', package = 'Rcpi') mol = readMolFromSmi(smi, type = 'mol') dat = extractDrugXLogP(mol) head(dat)
smi = system.file('vignettedata/FDAMDD.smi', package = 'Rcpi') mol = readMolFromSmi(smi, type = 'mol') dat = extractDrugXLogP(mol) head(dat)
Descriptor that Calculates the Sum of the Squared Atom Degrees of All Heavy Atoms
extractDrugZagrebIndex(molecules, silent = TRUE)
extractDrugZagrebIndex(molecules, silent = TRUE)
molecules |
Parsed molucule object. |
silent |
Logical. Whether the calculating process
should be shown or not, default is |
Zagreb index: the sum of the squares of atom degree over
all heavy atoms i
.
A data frame, each row represents one of the molecules,
each column represents one feature.
This function returns one column named Zagreb
.
smi = system.file('vignettedata/FDAMDD.smi', package = 'Rcpi') mol = readMolFromSmi(smi, type = 'mol') dat = extractDrugZagrebIndex(mol) head(dat)
smi = system.file('vignettedata/FDAMDD.smi', package = 'Rcpi') mol = readMolFromSmi(smi, type = 'mol') dat = extractDrugZagrebIndex(mol) head(dat)
Generalized BLOSUM and PAM Matrix-Derived Descriptors
extractPCMBLOSUM(x, submat = "AABLOSUM62", k, lag, scale = TRUE, silent = TRUE)
extractPCMBLOSUM(x, submat = "AABLOSUM62", k, lag, scale = TRUE, silent = TRUE)
x |
A character vector, as the input protein sequence. |
submat |
Substitution matrix for the 20 amino acids. Should be one of
|
k |
Integer. The number of selected scales (i.e. the first
|
lag |
The lag parameter. Must be less than the amino acids. |
scale |
Logical. Should we auto-scale the substitution matrix
( |
silent |
Logical. Whether we print the relative importance of
each scales (diagnal value of the eigen decomposition result matrix B)
or not.
Default is |
This function calculates the generalized BLOSUM matrix-derived descriptors.
For users' convenience, Rcpi
provides the
BLOSUM45, BLOSUM50, BLOSUM62, BLOSUM80, BLOSUM100,
PAM30, PAM40, PAM70, PAM120, and PAM250 matrices
for the 20 amino acids to select.
A length lag * p^2
named vector,
p
is the number of scales selected.
Georgiev, A. G. (2009). Interpretable numerical descriptors of amino acid space. Journal of Computational Biology, 16(5), 703–723.
x = readFASTA(system.file('protseq/P00750.fasta', package = 'Rcpi'))[[1]] blosum = extractPCMBLOSUM(x, submat = 'AABLOSUM62', k = 5, lag = 7, scale = TRUE, silent = FALSE)
x = readFASTA(system.file('protseq/P00750.fasta', package = 'Rcpi'))[[1]] blosum = extractPCMBLOSUM(x, submat = 'AABLOSUM62', k = 5, lag = 7, scale = TRUE, silent = FALSE)
Scales-Based Descriptors with 20+ classes of Molecular Descriptors
extractPCMDescScales( x, propmat, index = NULL, pc, lag, scale = TRUE, silent = TRUE )
extractPCMDescScales( x, propmat, index = NULL, pc, lag, scale = TRUE, silent = TRUE )
x |
A character vector, as the input protein sequence. |
propmat |
The matrix containing the descriptor set for the amino acids,
which could be chosen from
|
index |
Integer vector or character vector. Specify which molecular
descriptors to select from one of these deseriptor sets by specify the
numerical or character index of the molecular descriptors in the
descriptor set.
Default is |
pc |
Integer. The maximum dimension of the space which the data are to be represented in. Must be no greater than the number of AA properties provided. |
lag |
The lag parameter. Must be less than the amino acids. |
scale |
Logical. Should we auto-scale the property matrix
( |
silent |
Logical. Whether we print the standard deviation,
proportion of variance and the cumulative proportion of
the selected principal components or not.
Default is |
This function calculates the scales-based descriptors with molecular descriptors sets calculated by Dragon, Discovery Studio and MOE. Users could specify which molecular descriptors to select from one of these deseriptor sets by specify the numerical or character index of the molecular descriptors in the descriptor set.
A length lag * p^2
named vector,
p
is the number of scales selected.
See extractPCMScales
for generalized
AA-descriptor based scales descriptors.
x = readFASTA(system.file('protseq/P00750.fasta', package = 'Rcpi'))[[1]] descscales = extractPCMDescScales(x, propmat = 'AATopo', index = c(37:41, 43:47), pc = 5, lag = 7, silent = FALSE)
x = readFASTA(system.file('protseq/P00750.fasta', package = 'Rcpi'))[[1]] descscales = extractPCMDescScales(x, propmat = 'AATopo', index = c(37:41, 43:47), pc = 5, lag = 7, silent = FALSE)
Generalized Scales-Based Descriptors derived by Factor Analysis
extractPCMFAScales( x, propmat, factors, scores = "regression", lag, scale = TRUE, silent = TRUE )
extractPCMFAScales( x, propmat, factors, scores = "regression", lag, scale = TRUE, silent = TRUE )
x |
A character vector, as the input protein sequence. |
propmat |
A matrix containing the properties for the amino acids. Each row represent one amino acid type, each column represents one property. Note that the one-letter row names must be provided for we need them to seek the properties for each AA type. |
factors |
Integer. The number of factors to be fitted. Must be no greater than the number of AA properties provided. |
scores |
Type of scores to produce. The default is |
lag |
The lag parameter. Must be less than the amino acids number in the protein sequence. |
scale |
Logical. Should we auto-scale the property matrix
( |
silent |
Logical. Whether we print the SS loadings,
proportion of variance and the cumulative proportion of
the selected factors or not.
Default is |
This function calculates the generalized scales-based descriptors derived by Factor Analysis (FA). Users could provide customized amino acid property matrices.
A length lag * p^2
named vector,
p
is the number of scales (factors) selected.
Atchley, W. R., Zhao, J., Fernandes, A. D., & Druke, T. (2005). Solving the protein sequence metric problem. Proceedings of the National Academy of Sciences of the United States of America, 102(18), 6395-6400.
x = readFASTA(system.file('protseq/P00750.fasta', package = 'Rcpi'))[[1]] data(AATopo) tprops = AATopo[, c(37:41, 43:47)] # select a set of topological descriptors fa = extractPCMFAScales(x, propmat = tprops, factors = 5, lag = 7, silent = FALSE)
x = readFASTA(system.file('protseq/P00750.fasta', package = 'Rcpi'))[[1]] data(AATopo) tprops = AATopo[, c(37:41, 43:47)] # select a set of topological descriptors fa = extractPCMFAScales(x, propmat = tprops, factors = 5, lag = 7, silent = FALSE)
Generalized Scales-Based Descriptors derived by Multidimensional Scaling
extractPCMMDSScales(x, propmat, k, lag, scale = TRUE, silent = TRUE)
extractPCMMDSScales(x, propmat, k, lag, scale = TRUE, silent = TRUE)
x |
A character vector, as the input protein sequence. |
propmat |
A matrix containing the properties for the amino acids. Each row represent one amino acid type, each column represents one property. Note that the one-letter row names must be provided for we need them to seek the properties for each AA type. |
k |
Integer. The maximum dimension of the space which the data are to be represented in. Must be no greater than the number of AA properties provided. |
lag |
The lag parameter. Must be less than the amino acids. |
scale |
Logical. Should we auto-scale the property matrix
( |
silent |
Logical. Whether we print the |
This function calculates the generalized scales-based descriptors derived by Multidimensional Scaling (MDS). Users could provide customized amino acid property matrices.
A length lag * p^2
named vector,
p
is the number of scales (dimensionality) selected.
Venkatarajan, M. S., & Braun, W. (2001). New quantitative descriptors of amino acids based on multidimensional scaling of a large number of physical-chemical properties. Molecular modeling annual, 7(12), 445–453.
See extractPCMScales
for generalized scales-based
descriptors derived by Principal Components Analysis.
x = readFASTA(system.file('protseq/P00750.fasta', package = 'Rcpi'))[[1]] data(AATopo) tprops = AATopo[, c(37:41, 43:47)] # select a set of topological descriptors mds = extractPCMMDSScales(x, propmat = tprops, k = 5, lag = 7, silent = FALSE)
x = readFASTA(system.file('protseq/P00750.fasta', package = 'Rcpi'))[[1]] data(AATopo) tprops = AATopo[, c(37:41, 43:47)] # select a set of topological descriptors mds = extractPCMMDSScales(x, propmat = tprops, k = 5, lag = 7, silent = FALSE)
Generalized AA-Properties Based Scales Descriptors
extractPCMPropScales(x, index = NULL, pc, lag, scale = TRUE, silent = TRUE)
extractPCMPropScales(x, index = NULL, pc, lag, scale = TRUE, silent = TRUE)
x |
A character vector, as the input protein sequence. |
index |
Integer vector or character vector. Specify which
AAindex properties to select from the AAindex database by
specify the numerical or character index of the properties
in the AAindex database.
Default is |
pc |
Integer. Use the first pc principal components as the scales. Must be no greater than the number of AA properties provided. |
lag |
The lag parameter. Must be less than the amino acids. |
scale |
Logical. Should we auto-scale the property matrix
before PCA? Default is |
silent |
Logical. Whether we print the standard deviation,
proportion of variance and the cumulative proportion of
the selected principal components or not.
Default is |
This function calculates the generalized amino acid properties based scales descriptors. Users could specify which AAindex properties to select from the AAindex database by specify the numerical or character index of the properties in the AAindex database.
A length lag * p^2
named vector,
p
is the number of scales (principal components) selected.
See extractPCMScales
for
generalized scales-based descriptors.
x = readFASTA(system.file('protseq/P00750.fasta', package = 'Rcpi'))[[1]] propscales = extractPCMPropScales(x, index = c(160:165, 258:296), pc = 5, lag = 7, silent = FALSE)
x = readFASTA(system.file('protseq/P00750.fasta', package = 'Rcpi'))[[1]] propscales = extractPCMPropScales(x, index = c(160:165, 258:296), pc = 5, lag = 7, silent = FALSE)
Generalized Scales-Based Descriptors derived by Principal Components Analysis
extractPCMScales(x, propmat, pc, lag, scale = TRUE, silent = TRUE)
extractPCMScales(x, propmat, pc, lag, scale = TRUE, silent = TRUE)
x |
A character vector, as the input protein sequence. |
propmat |
A matrix containing the properties for the amino acids. Each row represent one amino acid type, each column represents one property. Note that the one-letter row names must be provided for we need them to seek the properties for each AA type. |
pc |
Integer. Use the first pc principal components as the scales. Must be no greater than the number of AA properties provided. |
lag |
The lag parameter. Must be less than the amino acids. |
scale |
Logical. Should we auto-scale the property matrix
( |
silent |
Logical. Whether we print the standard deviation,
proportion of variance and the cumulative proportion of
the selected principal components or not.
Default is |
This function calculates the generalized scales-based descriptors derived by Principal Components Analysis (PCA). Users could provide customized amino acid property matrices. This function implements the core computation procedure needed for the generalized scales-based descriptors derived by AA-Properties (AAindex) and generalized scales-based descriptors derived by 20+ classes of 2D and 3D molecular descriptors (Topological, WHIM, VHSE, etc.).
A length lag * p^2
named vector,
p
is the number of scales (principal components) selected.
See extractPCMDescScales
for generalized
AA property based scales descriptors, and extractPCMPropScales
for (19 classes) AA descriptor based scales descriptors.
x = readFASTA(system.file('protseq/P00750.fasta', package = 'Rcpi'))[[1]] data(AAindex) AAidxmat = t(na.omit(as.matrix(AAindex[, 7:26]))) scales = extractPCMScales(x, propmat = AAidxmat, pc = 5, lag = 7, silent = FALSE)
x = readFASTA(system.file('protseq/P00750.fasta', package = 'Rcpi'))[[1]] data(AAindex) AAidxmat = t(na.omit(as.matrix(AAindex[, 7:26]))) scales = extractPCMScales(x, propmat = AAidxmat, pc = 5, lag = 7, silent = FALSE)
Amino Acid Composition Descriptor
extractProtAAC(x)
extractProtAAC(x)
x |
A character vector, as the input protein sequence. |
This function calculates the Amino Acid Composition descriptor (Dim: 20).
A length 20 named vector
M. Bhasin, G. P. S. Raghava. Classification of Nuclear Receptors Based on Amino Acid Composition and Dipeptide Composition. Journal of Biological Chemistry, 2004, 279, 23262.
See extractProtDC
and extractProtTC
for Dipeptide Composition and Tripeptide Composition descriptors.
x = readFASTA(system.file('protseq/P00750.fasta', package = 'Rcpi'))[[1]] extractProtAAC(x)
x = readFASTA(system.file('protseq/P00750.fasta', package = 'Rcpi'))[[1]] extractProtAAC(x)
Amphiphilic Pseudo Amino Acid Composition Descriptor
extractProtAPAAC( x, props = c("Hydrophobicity", "Hydrophilicity"), lambda = 30, w = 0.05, customprops = NULL )
extractProtAPAAC( x, props = c("Hydrophobicity", "Hydrophilicity"), lambda = 30, w = 0.05, customprops = NULL )
x |
A character vector, as the input protein sequence. |
props |
A character vector, specifying the properties used. 2 properties are used by default, as listed below:
|
lambda |
The lambda parameter for the APAAC descriptors, default is 30. |
w |
The weighting factor, default is 0.05. |
customprops |
A |
This function calculates the Amphiphilic Pseudo Amino Acid
Composition (APAAC) descriptor
(Dim: 20 + (n * lambda)
,
n
is the number of properties selected, default is 80).
A length 20 + n * lambda
named vector,
n
is the number of properties selected.
Note the default 20 * 2
prop
values have been already
independently given in the function. Users could also specify
other (up to 544) properties with the Accession Number in
the AAindex
data, with or without the default
three properties, which means users should explicitly specify
the properties to use.
Kuo-Chen Chou. Prediction of Protein Cellular Attributes Using Pseudo-Amino Acid Composition. PROTEINS: Structure, Function, and Genetics, 2001, 43: 246-255.
Type 2 pseudo amino acid composition. http://www.csbio.sjtu.edu.cn/bioinf/PseAAC/type2.htm
Kuo-Chen Chou. Using Amphiphilic Pseudo Amino Acid Composition to Predict Enzyme Subfamily Classes. Bioinformatics, 2005, 21, 10-19.
JACS, 1962, 84: 4240-4246. (C. Tanford). (The hydrophobicity data)
PNAS, 1981, 78:3824-3828 (T.P.Hopp & K.R.Woods). (The hydrophilicity data)
See extractProtPAAC
for pseudo
amino acid composition descriptor.
x = readFASTA(system.file('protseq/P00750.fasta', package = 'Rcpi'))[[1]] extractProtAPAAC(x) myprops = data.frame(AccNo = c("MyProp1", "MyProp2", "MyProp3"), A = c(0.62, -0.5, 15), R = c(-2.53, 3, 101), N = c(-0.78, 0.2, 58), D = c(-0.9, 3, 59), C = c(0.29, -1, 47), E = c(-0.74, 3, 73), Q = c(-0.85, 0.2, 72), G = c(0.48, 0, 1), H = c(-0.4, -0.5, 82), I = c(1.38, -1.8, 57), L = c(1.06, -1.8, 57), K = c(-1.5, 3, 73), M = c(0.64, -1.3, 75), F = c(1.19, -2.5, 91), P = c(0.12, 0, 42), S = c(-0.18, 0.3, 31), T = c(-0.05, -0.4, 45), W = c(0.81, -3.4, 130), Y = c(0.26, -2.3, 107), V = c(1.08, -1.5, 43)) # Use 2 default properties, 4 properties in the AAindex database, # and 3 cutomized properties extractProtAPAAC(x, customprops = myprops, props = c('Hydrophobicity', 'Hydrophilicity', 'CIDH920105', 'BHAR880101', 'CHAM820101', 'CHAM820102', 'MyProp1', 'MyProp2', 'MyProp3'))
x = readFASTA(system.file('protseq/P00750.fasta', package = 'Rcpi'))[[1]] extractProtAPAAC(x) myprops = data.frame(AccNo = c("MyProp1", "MyProp2", "MyProp3"), A = c(0.62, -0.5, 15), R = c(-2.53, 3, 101), N = c(-0.78, 0.2, 58), D = c(-0.9, 3, 59), C = c(0.29, -1, 47), E = c(-0.74, 3, 73), Q = c(-0.85, 0.2, 72), G = c(0.48, 0, 1), H = c(-0.4, -0.5, 82), I = c(1.38, -1.8, 57), L = c(1.06, -1.8, 57), K = c(-1.5, 3, 73), M = c(0.64, -1.3, 75), F = c(1.19, -2.5, 91), P = c(0.12, 0, 42), S = c(-0.18, 0.3, 31), T = c(-0.05, -0.4, 45), W = c(0.81, -3.4, 130), Y = c(0.26, -2.3, 107), V = c(1.08, -1.5, 43)) # Use 2 default properties, 4 properties in the AAindex database, # and 3 cutomized properties extractProtAPAAC(x, customprops = myprops, props = c('Hydrophobicity', 'Hydrophilicity', 'CIDH920105', 'BHAR880101', 'CHAM820101', 'CHAM820102', 'MyProp1', 'MyProp2', 'MyProp3'))
CTD Descriptors - Composition
extractProtCTDC(x)
extractProtCTDC(x)
x |
A character vector, as the input protein sequence. |
This function calculates the Composition descriptor of the CTD descriptors (Dim: 21).
A length 21 named vector
Inna Dubchak, Ilya Muchink, Stephen R. Holbrook and Sung-Hou Kim. Prediction of protein folding class using global description of amino acid sequence. Proceedings of the National Academy of Sciences. USA, 1995, 92, 8700-8704.
Inna Dubchak, Ilya Muchink, Christopher Mayor, Igor Dralyuk and Sung-Hou Kim. Recognition of a Protein Fold in the Context of the SCOP classification. Proteins: Structure, Function and Genetics, 1999, 35, 401-407.
See extractProtCTDT
and extractProtCTDD
for the Transition and Distribution descriptors.
x = readFASTA(system.file('protseq/P00750.fasta', package = 'Rcpi'))[[1]] extractProtCTDC(x)
x = readFASTA(system.file('protseq/P00750.fasta', package = 'Rcpi'))[[1]] extractProtCTDC(x)
CTD Descriptors - Distribution
extractProtCTDD(x)
extractProtCTDD(x)
x |
A character vector, as the input protein sequence. |
This function calculates the Distribution descriptor of the CTD descriptors (Dim: 105).
A length 105 named vector
Inna Dubchak, Ilya Muchink, Stephen R. Holbrook and Sung-Hou Kim. Prediction of protein folding class using global description of amino acid sequence. Proceedings of the National Academy of Sciences. USA, 1995, 92, 8700-8704.
Inna Dubchak, Ilya Muchink, Christopher Mayor, Igor Dralyuk and Sung-Hou Kim. Recognition of a Protein Fold in the Context of the SCOP classification. Proteins: Structure, Function and Genetics, 1999, 35, 401-407.
See extractProtCTDC
and extractProtCTDT
for the Composition and Transition descriptors.
x = readFASTA(system.file('protseq/P00750.fasta', package = 'Rcpi'))[[1]] extractProtCTDD(x)
x = readFASTA(system.file('protseq/P00750.fasta', package = 'Rcpi'))[[1]] extractProtCTDD(x)
CTD Descriptors - Transition
extractProtCTDT(x)
extractProtCTDT(x)
x |
A character vector, as the input protein sequence. |
This function calculates the Transition descriptor of the CTD descriptors (Dim: 21).
A length 21 named vector
Inna Dubchak, Ilya Muchink, Stephen R. Holbrook and Sung-Hou Kim. Prediction of protein folding class using global description of amino acid sequence. Proceedings of the National Academy of Sciences. USA, 1995, 92, 8700-8704.
Inna Dubchak, Ilya Muchink, Christopher Mayor, Igor Dralyuk and Sung-Hou Kim. Recognition of a Protein Fold in the Context of the SCOP classification. Proteins: Structure, Function and Genetics, 1999, 35, 401-407.
See extractProtCTDC
and extractProtCTDD
for the Composition and Distribution descriptors.
x = readFASTA(system.file('protseq/P00750.fasta', package = 'Rcpi'))[[1]] extractProtCTDT(x)
x = readFASTA(system.file('protseq/P00750.fasta', package = 'Rcpi'))[[1]] extractProtCTDT(x)
Conjoint Triad Descriptor
extractProtCTriad(x)
extractProtCTriad(x)
x |
A character vector, as the input protein sequence. |
This function calculates the Conjoint Triad descriptor (Dim: 343).
A length 343 named vector
J.W. Shen, J. Zhang, X.M. Luo, W.L. Zhu, K.Q. Yu, K.X. Chen, Y.X. Li, H.L. Jiang. Predicting Protein-protein Interactions Based Only on Sequences Information. Proceedings of the National Academy of Sciences. 007, 104, 4337–4341.
x = readFASTA(system.file('protseq/P00750.fasta', package = 'Rcpi'))[[1]] extractProtCTriad(x)
x = readFASTA(system.file('protseq/P00750.fasta', package = 'Rcpi'))[[1]] extractProtCTriad(x)
Dipeptide Composition Descriptor
extractProtDC(x)
extractProtDC(x)
x |
A character vector, as the input protein sequence. |
This function calculates the Dipeptide Composition descriptor (Dim: 400).
A length 400 named vector
M. Bhasin, G. P. S. Raghava. Classification of Nuclear Receptors Based on Amino Acid Composition and Dipeptide Composition. Journal of Biological Chemistry, 2004, 279, 23262.
See extractProtAAC
and extractProtTC
for Amino Acid Composition and Tripeptide Composition descriptors.
x = readFASTA(system.file('protseq/P00750.fasta', package = 'Rcpi'))[[1]] extractProtDC(x)
x = readFASTA(system.file('protseq/P00750.fasta', package = 'Rcpi'))[[1]] extractProtDC(x)
Geary Autocorrelation Descriptor
extractProtGeary( x, props = c("CIDH920105", "BHAR880101", "CHAM820101", "CHAM820102", "CHOC760101", "BIGC670101", "CHAM810101", "DAYM780201"), nlag = 30L, customprops = NULL )
extractProtGeary( x, props = c("CIDH920105", "BHAR880101", "CHAM820101", "CHAM820102", "CHOC760101", "BIGC670101", "CHAM810101", "DAYM780201"), nlag = 30L, customprops = NULL )
x |
A character vector, as the input protein sequence. |
props |
A character vector, specifying the Accession Number of the target properties. 8 properties are used by default, as listed below:
|
nlag |
Maximum value of the lag parameter. Default is |
customprops |
A |
This function calculates the Geary
autocorrelation descriptor (Dim: length(props) * nlag
).
A length nlag
named vector
AAindex: Amino acid index database. https://www.genome.jp/dbget/aaindex.html
Feng, Z.P. and Zhang, C.T. (2000) Prediction of membrane protein types based on the hydrophobic index of amino acids. Journal of Protein Chemistry, 19, 269-275.
Horne, D.S. (1988) Prediction of protein helix content from an autocorrelation analysis of sequence hydrophobicities. Biopolymers, 27, 451-477.
Sokal, R.R. and Thomson, B.A. (2006) Population structure inferred by local spatial autocorrelation: an usage from an Amerindian tribal population. American Journal of Physical Anthropology, 129, 121-131.
See extractProtMoreauBroto
and
extractProtMoran
for Moreau-Broto autocorrelation descriptors and
Moran autocorrelation descriptors.
x = readFASTA(system.file('protseq/P00750.fasta', package = 'Rcpi'))[[1]] extractProtGeary(x) myprops = data.frame(AccNo = c("MyProp1", "MyProp2", "MyProp3"), A = c(0.62, -0.5, 15), R = c(-2.53, 3, 101), N = c(-0.78, 0.2, 58), D = c(-0.9, 3, 59), C = c(0.29, -1, 47), E = c(-0.74, 3, 73), Q = c(-0.85, 0.2, 72), G = c(0.48, 0, 1), H = c(-0.4, -0.5, 82), I = c(1.38, -1.8, 57), L = c(1.06, -1.8, 57), K = c(-1.5, 3, 73), M = c(0.64, -1.3, 75), F = c(1.19, -2.5, 91), P = c(0.12, 0, 42), S = c(-0.18, 0.3, 31), T = c(-0.05, -0.4, 45), W = c(0.81, -3.4, 130), Y = c(0.26, -2.3, 107), V = c(1.08, -1.5, 43)) # Use 4 properties in the AAindex database, and 3 cutomized properties extractProtGeary(x, customprops = myprops, props = c('CIDH920105', 'BHAR880101', 'CHAM820101', 'CHAM820102', 'MyProp1', 'MyProp2', 'MyProp3'))
x = readFASTA(system.file('protseq/P00750.fasta', package = 'Rcpi'))[[1]] extractProtGeary(x) myprops = data.frame(AccNo = c("MyProp1", "MyProp2", "MyProp3"), A = c(0.62, -0.5, 15), R = c(-2.53, 3, 101), N = c(-0.78, 0.2, 58), D = c(-0.9, 3, 59), C = c(0.29, -1, 47), E = c(-0.74, 3, 73), Q = c(-0.85, 0.2, 72), G = c(0.48, 0, 1), H = c(-0.4, -0.5, 82), I = c(1.38, -1.8, 57), L = c(1.06, -1.8, 57), K = c(-1.5, 3, 73), M = c(0.64, -1.3, 75), F = c(1.19, -2.5, 91), P = c(0.12, 0, 42), S = c(-0.18, 0.3, 31), T = c(-0.05, -0.4, 45), W = c(0.81, -3.4, 130), Y = c(0.26, -2.3, 107), V = c(1.08, -1.5, 43)) # Use 4 properties in the AAindex database, and 3 cutomized properties extractProtGeary(x, customprops = myprops, props = c('CIDH920105', 'BHAR880101', 'CHAM820101', 'CHAM820102', 'MyProp1', 'MyProp2', 'MyProp3'))
Moran Autocorrelation Descriptor
extractProtMoran( x, props = c("CIDH920105", "BHAR880101", "CHAM820101", "CHAM820102", "CHOC760101", "BIGC670101", "CHAM810101", "DAYM780201"), nlag = 30L, customprops = NULL )
extractProtMoran( x, props = c("CIDH920105", "BHAR880101", "CHAM820101", "CHAM820102", "CHOC760101", "BIGC670101", "CHAM810101", "DAYM780201"), nlag = 30L, customprops = NULL )
x |
A character vector, as the input protein sequence. |
props |
A character vector, specifying the Accession Number of the target properties. 8 properties are used by default, as listed below:
|
nlag |
Maximum value of the lag parameter. Default is |
customprops |
A |
This function calculates the Moran
autocorrelation descriptor (Dim: length(props) * nlag
).
A length nlag
named vector
AAindex: Amino acid index database. https://www.genome.jp/dbget/aaindex.html
Feng, Z.P. and Zhang, C.T. (2000) Prediction of membrane protein types based on the hydrophobic index of amino acids. Journal of Protein Chemistry, 19, 269-275.
Horne, D.S. (1988) Prediction of protein helix content from an autocorrelation analysis of sequence hydrophobicities. Biopolymers, 27, 451-477.
Sokal, R.R. and Thomson, B.A. (2006) Population structure inferred by local spatial autocorrelation: an usage from an Amerindian tribal population. American Journal of Physical Anthropology, 129, 121-131.
See extractProtMoreauBroto
and
extractProtGeary
for Moreau-Broto autocorrelation descriptors and
Geary autocorrelation descriptors.
x = readFASTA(system.file('protseq/P00750.fasta', package = 'Rcpi'))[[1]] extractProtMoran(x) myprops = data.frame(AccNo = c("MyProp1", "MyProp2", "MyProp3"), A = c(0.62, -0.5, 15), R = c(-2.53, 3, 101), N = c(-0.78, 0.2, 58), D = c(-0.9, 3, 59), C = c(0.29, -1, 47), E = c(-0.74, 3, 73), Q = c(-0.85, 0.2, 72), G = c(0.48, 0, 1), H = c(-0.4, -0.5, 82), I = c(1.38, -1.8, 57), L = c(1.06, -1.8, 57), K = c(-1.5, 3, 73), M = c(0.64, -1.3, 75), F = c(1.19, -2.5, 91), P = c(0.12, 0, 42), S = c(-0.18, 0.3, 31), T = c(-0.05, -0.4, 45), W = c(0.81, -3.4, 130), Y = c(0.26, -2.3, 107), V = c(1.08, -1.5, 43)) # Use 4 properties in the AAindex database, and 3 cutomized properties extractProtMoran(x, customprops = myprops, props = c('CIDH920105', 'BHAR880101', 'CHAM820101', 'CHAM820102', 'MyProp1', 'MyProp2', 'MyProp3'))
x = readFASTA(system.file('protseq/P00750.fasta', package = 'Rcpi'))[[1]] extractProtMoran(x) myprops = data.frame(AccNo = c("MyProp1", "MyProp2", "MyProp3"), A = c(0.62, -0.5, 15), R = c(-2.53, 3, 101), N = c(-0.78, 0.2, 58), D = c(-0.9, 3, 59), C = c(0.29, -1, 47), E = c(-0.74, 3, 73), Q = c(-0.85, 0.2, 72), G = c(0.48, 0, 1), H = c(-0.4, -0.5, 82), I = c(1.38, -1.8, 57), L = c(1.06, -1.8, 57), K = c(-1.5, 3, 73), M = c(0.64, -1.3, 75), F = c(1.19, -2.5, 91), P = c(0.12, 0, 42), S = c(-0.18, 0.3, 31), T = c(-0.05, -0.4, 45), W = c(0.81, -3.4, 130), Y = c(0.26, -2.3, 107), V = c(1.08, -1.5, 43)) # Use 4 properties in the AAindex database, and 3 cutomized properties extractProtMoran(x, customprops = myprops, props = c('CIDH920105', 'BHAR880101', 'CHAM820101', 'CHAM820102', 'MyProp1', 'MyProp2', 'MyProp3'))
Normalized Moreau-Broto Autocorrelation Descriptor
extractProtMoreauBroto( x, props = c("CIDH920105", "BHAR880101", "CHAM820101", "CHAM820102", "CHOC760101", "BIGC670101", "CHAM810101", "DAYM780201"), nlag = 30L, customprops = NULL )
extractProtMoreauBroto( x, props = c("CIDH920105", "BHAR880101", "CHAM820101", "CHAM820102", "CHOC760101", "BIGC670101", "CHAM810101", "DAYM780201"), nlag = 30L, customprops = NULL )
x |
A character vector, as the input protein sequence. |
props |
A character vector, specifying the Accession Number of the target properties. 8 properties are used by default, as listed below:
|
nlag |
Maximum value of the lag parameter. Default is |
customprops |
A |
This function calculates the normalized Moreau-Broto
autocorrelation descriptor (Dim: length(props) * nlag
).
A length nlag
named vector
AAindex: Amino acid index database. https://www.genome.jp/dbget/aaindex.html
Feng, Z.P. and Zhang, C.T. (2000) Prediction of membrane protein types based on the hydrophobic index of amino acids. Journal of Protein Chemistry, 19, 269-275.
Horne, D.S. (1988) Prediction of protein helix content from an autocorrelation analysis of sequence hydrophobicities. Biopolymers, 27, 451-477.
Sokal, R.R. and Thomson, B.A. (2006) Population structure inferred by local spatial autocorrelation: an usage from an Amerindian tribal population. American Journal of Physical Anthropology, 129, 121-131.
See extractProtMoran
and extractProtGeary
for Moran autocorrelation descriptors and
Geary autocorrelation descriptors.
x = readFASTA(system.file('protseq/P00750.fasta', package = 'Rcpi'))[[1]] extractProtMoreauBroto(x) myprops = data.frame(AccNo = c("MyProp1", "MyProp2", "MyProp3"), A = c(0.62, -0.5, 15), R = c(-2.53, 3, 101), N = c(-0.78, 0.2, 58), D = c(-0.9, 3, 59), C = c(0.29, -1, 47), E = c(-0.74, 3, 73), Q = c(-0.85, 0.2, 72), G = c(0.48, 0, 1), H = c(-0.4, -0.5, 82), I = c(1.38, -1.8, 57), L = c(1.06, -1.8, 57), K = c(-1.5, 3, 73), M = c(0.64, -1.3, 75), F = c(1.19, -2.5, 91), P = c(0.12, 0, 42), S = c(-0.18, 0.3, 31), T = c(-0.05, -0.4, 45), W = c(0.81, -3.4, 130), Y = c(0.26, -2.3, 107), V = c(1.08, -1.5, 43)) # Use 4 properties in the AAindex database, and 3 cutomized properties extractProtMoreauBroto(x, customprops = myprops, props = c('CIDH920105', 'BHAR880101', 'CHAM820101', 'CHAM820102', 'MyProp1', 'MyProp2', 'MyProp3'))
x = readFASTA(system.file('protseq/P00750.fasta', package = 'Rcpi'))[[1]] extractProtMoreauBroto(x) myprops = data.frame(AccNo = c("MyProp1", "MyProp2", "MyProp3"), A = c(0.62, -0.5, 15), R = c(-2.53, 3, 101), N = c(-0.78, 0.2, 58), D = c(-0.9, 3, 59), C = c(0.29, -1, 47), E = c(-0.74, 3, 73), Q = c(-0.85, 0.2, 72), G = c(0.48, 0, 1), H = c(-0.4, -0.5, 82), I = c(1.38, -1.8, 57), L = c(1.06, -1.8, 57), K = c(-1.5, 3, 73), M = c(0.64, -1.3, 75), F = c(1.19, -2.5, 91), P = c(0.12, 0, 42), S = c(-0.18, 0.3, 31), T = c(-0.05, -0.4, 45), W = c(0.81, -3.4, 130), Y = c(0.26, -2.3, 107), V = c(1.08, -1.5, 43)) # Use 4 properties in the AAindex database, and 3 cutomized properties extractProtMoreauBroto(x, customprops = myprops, props = c('CIDH920105', 'BHAR880101', 'CHAM820101', 'CHAM820102', 'MyProp1', 'MyProp2', 'MyProp3'))
Pseudo Amino Acid Composition Descriptor
extractProtPAAC( x, props = c("Hydrophobicity", "Hydrophilicity", "SideChainMass"), lambda = 30, w = 0.05, customprops = NULL )
extractProtPAAC( x, props = c("Hydrophobicity", "Hydrophilicity", "SideChainMass"), lambda = 30, w = 0.05, customprops = NULL )
x |
A character vector, as the input protein sequence. |
props |
A character vector, specifying the properties used. 3 properties are used by default, as listed below:
|
lambda |
The lambda parameter for the PAAC descriptors, default is 30. |
w |
The weighting factor, default is 0.05. |
customprops |
A |
This function calculates the Pseudo Amino Acid Composition (PAAC) descriptor
(Dim: 20 + lambda
, default is 50).
A length 20 + lambda
named vector
Note the default 20 * 3
prop
values have been already
independently given in the function. Users could also specify
other (up to 544) properties with the Accession Number in
the AAindex
data, with or without the default
three properties, which means users should explicitly specify
the properties to use.
Kuo-Chen Chou. Prediction of Protein Cellular Attributes Using Pseudo-Amino Acid Composition. PROTEINS: Structure, Function, and Genetics, 2001, 43: 246-255.
Type 1 pseudo amino acid composition. http://www.csbio.sjtu.edu.cn/bioinf/PseAAC/type1.htm
Kuo-Chen Chou. Using Amphiphilic Pseudo Amino Acid Composition to Predict Enzyme Subfamily Classes. Bioinformatics, 2005, 21, 10-19.
JACS, 1962, 84: 4240-4246. (C. Tanford). (The hydrophobicity data)
PNAS, 1981, 78:3824-3828 (T.P.Hopp & K.R.Woods). (The hydrophilicity data)
CRC Handbook of Chemistry and Physics, 66th ed., CRC Press, Boca Raton, Florida (1985). (The side-chain mass data)
R.M.C. Dawson, D.C. Elliott, W.H. Elliott, K.M. Jones, Data for Biochemical Research 3rd ed., Clarendon Press Oxford (1986). (The side-chain mass data)
See extractProtAPAAC
for amphiphilic pseudo
amino acid composition descriptor.
x = readFASTA(system.file('protseq/P00750.fasta', package = 'Rcpi'))[[1]] extractProtPAAC(x) myprops = data.frame(AccNo = c("MyProp1", "MyProp2", "MyProp3"), A = c(0.62, -0.5, 15), R = c(-2.53, 3, 101), N = c(-0.78, 0.2, 58), D = c(-0.9, 3, 59), C = c(0.29, -1, 47), E = c(-0.74, 3, 73), Q = c(-0.85, 0.2, 72), G = c(0.48, 0, 1), H = c(-0.4, -0.5, 82), I = c(1.38, -1.8, 57), L = c(1.06, -1.8, 57), K = c(-1.5, 3, 73), M = c(0.64, -1.3, 75), F = c(1.19, -2.5, 91), P = c(0.12, 0, 42), S = c(-0.18, 0.3, 31), T = c(-0.05, -0.4, 45), W = c(0.81, -3.4, 130), Y = c(0.26, -2.3, 107), V = c(1.08, -1.5, 43)) # Use 3 default properties, 4 properties in the AAindex database, # and 3 cutomized properties extractProtPAAC(x, customprops = myprops, props = c('Hydrophobicity', 'Hydrophilicity', 'SideChainMass', 'CIDH920105', 'BHAR880101', 'CHAM820101', 'CHAM820102', 'MyProp1', 'MyProp2', 'MyProp3'))
x = readFASTA(system.file('protseq/P00750.fasta', package = 'Rcpi'))[[1]] extractProtPAAC(x) myprops = data.frame(AccNo = c("MyProp1", "MyProp2", "MyProp3"), A = c(0.62, -0.5, 15), R = c(-2.53, 3, 101), N = c(-0.78, 0.2, 58), D = c(-0.9, 3, 59), C = c(0.29, -1, 47), E = c(-0.74, 3, 73), Q = c(-0.85, 0.2, 72), G = c(0.48, 0, 1), H = c(-0.4, -0.5, 82), I = c(1.38, -1.8, 57), L = c(1.06, -1.8, 57), K = c(-1.5, 3, 73), M = c(0.64, -1.3, 75), F = c(1.19, -2.5, 91), P = c(0.12, 0, 42), S = c(-0.18, 0.3, 31), T = c(-0.05, -0.4, 45), W = c(0.81, -3.4, 130), Y = c(0.26, -2.3, 107), V = c(1.08, -1.5, 43)) # Use 3 default properties, 4 properties in the AAindex database, # and 3 cutomized properties extractProtPAAC(x, customprops = myprops, props = c('Hydrophobicity', 'Hydrophilicity', 'SideChainMass', 'CIDH920105', 'BHAR880101', 'CHAM820101', 'CHAM820102', 'MyProp1', 'MyProp2', 'MyProp3'))
Compute PSSM (Position-Specific Scoring Matrix) for given protein sequence
extractProtPSSM( seq, start.pos = 1L, end.pos = nchar(seq), psiblast.path = NULL, makeblastdb.path = NULL, database.path = NULL, iter = 5, silent = TRUE, evalue = 10L, word.size = NULL, gapopen = NULL, gapextend = NULL, matrix = "BLOSUM62", threshold = NULL, seg = "no", soft.masking = FALSE, culling.limit = NULL, best.hit.overhang = NULL, best.hit.score.edge = NULL, xdrop.ungap = NULL, xdrop.gap = NULL, xdrop.gap.final = NULL, window.size = NULL, gap.trigger = 22L, num.threads = 1L, pseudocount = 0L, inclusion.ethresh = 0.002 )
extractProtPSSM( seq, start.pos = 1L, end.pos = nchar(seq), psiblast.path = NULL, makeblastdb.path = NULL, database.path = NULL, iter = 5, silent = TRUE, evalue = 10L, word.size = NULL, gapopen = NULL, gapextend = NULL, matrix = "BLOSUM62", threshold = NULL, seg = "no", soft.masking = FALSE, culling.limit = NULL, best.hit.overhang = NULL, best.hit.score.edge = NULL, xdrop.ungap = NULL, xdrop.gap = NULL, xdrop.gap.final = NULL, window.size = NULL, gap.trigger = 22L, num.threads = 1L, pseudocount = 0L, inclusion.ethresh = 0.002 )
seq |
Character vector, as the input protein sequence. |
start.pos |
Optional integer denoting the start position of the
fragment window. Default is |
end.pos |
Optional integer denoting the end position of the
fragment window. Default is |
psiblast.path |
Character string indicating the path of the
|
makeblastdb.path |
Character string indicating the path of the
|
database.path |
Character string indicating the path of a reference database (a FASTA file). |
iter |
Number of iterations to perform for PSI-Blast. |
silent |
Logical. Whether the PSI-Blast running output
should be shown or not (May not work on some Windows versions and
PSI-Blast versions), default is |
evalue |
Expectation value (E) threshold for saving hits.
Default is |
word.size |
Word size for wordfinder algorithm. An integer >= 2. |
gapopen |
Integer. Cost to open a gap. |
gapextend |
Integer. Cost to extend a gap. |
matrix |
Character string. The scoring matrix name
(default is |
threshold |
Minimum word score such that the word is added to the BLAST lookup table. A real value >= 0. |
seg |
Character string. Filter query sequence with SEG ( |
soft.masking |
Logical. Apply filtering locations as soft masks?
Default is |
culling.limit |
An integer >= 0. If the query range of a hit is
enveloped by that of at least this many higher-scoring hits,
delete the hit. Incompatible with |
best.hit.overhang |
Best Hit algorithm overhang value
(A real value >= 0 and =< 0.5, recommended value: 0.1).
Incompatible with |
best.hit.score.edge |
Best Hit algorithm score edge value
(A real value >=0 and =< 0.5, recommended value: 0.1).
Incompatible with |
xdrop.ungap |
X-dropoff value (in bits) for ungapped extensions. |
xdrop.gap |
X-dropoff value (in bits) for preliminary gapped extensions. |
xdrop.gap.final |
X-dropoff value (in bits) for final gapped alignment. |
window.size |
An integer >= 0. Multiple hits window size,
To specify 1-hit algorithm, use |
gap.trigger |
Number of bits to trigger gapping. Default is |
num.threads |
Integer. Number of threads (CPUs) to use in the
BLAST search. Default is |
pseudocount |
Integer. Pseudo-count value used when constructing PSSM.
Default is |
inclusion.ethresh |
E-value inclusion threshold for pairwise alignments.
Default is |
This function calculates the PSSM (Position-Specific Scoring Matrix) derived by PSI-Blast for given protein sequence or peptides. For given protein sequences or peptides, PSSM represents the log-likelihood of the substitution of the 20 types of amino acids at that position in the sequence. Note that the output value is not normalized.
The original PSSM, a numeric matrix which has
end.pos - start.pos + 1
columns and 20
named rows.
The function requires the makeblastdb
and psiblast
programs
to be properly installed in the operation system or
their paths provided.
The two command-line programs are included in the NCBI-BLAST+ software package. To install NCBI Blast+, just open the NCBI FTP site using web browser or FTP software: ftp://[email protected]:21/blast/executables/blast+/LATEST/ then download the executable version of BLAST+ according to your operation system, and compile or install the downloaded source code or executable program.
Ubuntu/Debian users can directly use the command
sudo apt-get install ncbi-blast+
to install NCBI Blast+.
For OS X users, download ncbi-blast- ... .dmg
then install.
For Windows users, download ncbi-blast- ... .exe
then install.
Altschul, Stephen F., et al. "Gapped BLAST and PSI-BLAST: a new generation of protein database search programs." Nucleic acids research 25.17 (1997): 3389–3402.
Ye, Xugang, Guoli Wang, and Stephen F. Altschul. "An assessment of substitution scores for protein profile-profile comparison." Bioinformatics 27.24 (2011): 3356–3363.
Rangwala, Huzefa, and George Karypis. "Profile-based direct kernels for remote homology detection and fold recognition." Bioinformatics 21.23 (2005): 4239–4247.
extractProtPSSMFeature extractProtPSSMAcc
x = readFASTA(system.file('protseq/P00750.fasta', package = 'Rcpi'))[[1]] dbpath = tempfile('tempdb', fileext = '.fasta') invisible(file.copy(from = system.file('protseq/Plasminogen.fasta', package = 'Rcpi'), to = dbpath)) pssmmat = extractProtPSSM(seq = x, database.path = dbpath) dim(pssmmat) # 20 x 562 (P00750: length 562, 20 Amino Acids)
x = readFASTA(system.file('protseq/P00750.fasta', package = 'Rcpi'))[[1]] dbpath = tempfile('tempdb', fileext = '.fasta') invisible(file.copy(from = system.file('protseq/Plasminogen.fasta', package = 'Rcpi'), to = dbpath)) pssmmat = extractProtPSSM(seq = x, database.path = dbpath) dim(pssmmat) # 20 x 562 (P00750: length 562, 20 Amino Acids)
Profile-based protein representation derived by PSSM (Position-Specific Scoring Matrix) and auto cross covariance
extractProtPSSMAcc(pssmmat, lag)
extractProtPSSMAcc(pssmmat, lag)
pssmmat |
The PSSM computed by |
lag |
The lag parameter. Must be less than the number of amino acids in the sequence (i.e. the number of columns in the PSSM matrix). |
This function calculates the feature vector based on the PSSM by running PSI-Blast and auto cross covariance tranformation.
A length lag * 20^2
named numeric vector,
the element names are derived by the amino acid name abbreviation
(crossed amino acid name abbreviation) and lag index.
Wold, S., Jonsson, J., Sjörström, M., Sandberg, M., & Rännar, S. (1993). DNA and peptide sequences and chemical processes multivariately modelled by principal component analysis and partial least-squares projections to latent structures. Analytica chimica acta, 277(2), 239–253.
extractProtPSSM extractProtPSSMFeature
x = readFASTA(system.file('protseq/P00750.fasta', package = 'Rcpi'))[[1]] dbpath = tempfile('tempdb', fileext = '.fasta') invisible(file.copy(from = system.file('protseq/Plasminogen.fasta', package = 'Rcpi'), to = dbpath)) pssmmat = extractProtPSSM(seq = x, database.path = dbpath) pssmacc = extractProtPSSMAcc(pssmmat, lag = 3) tail(pssmacc)
x = readFASTA(system.file('protseq/P00750.fasta', package = 'Rcpi'))[[1]] dbpath = tempfile('tempdb', fileext = '.fasta') invisible(file.copy(from = system.file('protseq/Plasminogen.fasta', package = 'Rcpi'), to = dbpath)) pssmmat = extractProtPSSM(seq = x, database.path = dbpath) pssmacc = extractProtPSSMAcc(pssmmat, lag = 3) tail(pssmacc)
Profile-based protein representation derived by PSSM (Position-Specific Scoring Matrix)
extractProtPSSMFeature(pssmmat)
extractProtPSSMFeature(pssmmat)
pssmmat |
The PSSM computed by |
This function calculates the profile-based protein representation
derived by PSSM. The feature vector is based on the PSSM computed by
extractProtPSSM
. For a given sequence,
The PSSM feature represents the log-likelihood of the substitution of the
20 types of amino acids at that position in the sequence.
Each PSSM feature value in the vector represents the degree of conservation
of a given amino acid type. The value is normalized to
interval (0, 1) by the transformation 1/(1+e^(-x)).
A numeric vector which has 20 x N
named elements,
where N
is the size of the window (number of rows of the PSSM).
Ye, Xugang, Guoli Wang, and Stephen F. Altschul. "An assessment of substitution scores for protein profile-profile comparison." Bioinformatics 27.24 (2011): 3356–3363.
Rangwala, Huzefa, and George Karypis. "Profile-based direct kernels for remote homology detection and fold recognition." Bioinformatics 21.23 (2005): 4239–4247.
extractProtPSSM extractProtPSSMAcc
x = readFASTA(system.file('protseq/P00750.fasta', package = 'Rcpi'))[[1]] dbpath = tempfile('tempdb', fileext = '.fasta') invisible(file.copy(from = system.file('protseq/Plasminogen.fasta', package = 'Rcpi'), to = dbpath)) pssmmat = extractProtPSSM(seq = x, database.path = dbpath) pssmfeature = extractProtPSSMFeature(pssmmat) head(pssmfeature)
x = readFASTA(system.file('protseq/P00750.fasta', package = 'Rcpi'))[[1]] dbpath = tempfile('tempdb', fileext = '.fasta') invisible(file.copy(from = system.file('protseq/Plasminogen.fasta', package = 'Rcpi'), to = dbpath)) pssmmat = extractProtPSSM(seq = x, database.path = dbpath) pssmfeature = extractProtPSSMFeature(pssmmat) head(pssmfeature)
Quasi-Sequence-Order (QSO) Descriptor
extractProtQSO(x, nlag = 30, w = 0.1)
extractProtQSO(x, nlag = 30, w = 0.1)
x |
A character vector, as the input protein sequence. |
nlag |
The maximum lag, defualt is 30. |
w |
The weighting factor, default is 0.1. |
This function calculates the Quasi-Sequence-Order (QSO) descriptor
(Dim: 20 + 20 + (2 * nlag)
, default is 100).
A length 20 + 20 + (2 * nlag)
named vector
Kuo-Chen Chou. Prediction of Protein Subcellar Locations by Incorporating Quasi-Sequence-Order Effect. Biochemical and Biophysical Research Communications, 2000, 278, 477-483.
Kuo-Chen Chou and Yu-Dong Cai. Prediction of Protein Sucellular Locations by GO-FunD-PseAA Predictor. Biochemical and Biophysical Research Communications, 2004, 320, 1236-1239.
Gisbert Schneider and Paul Wrede. The Rational Design of Amino Acid Sequences by Artifical Neural Networks and Simulated Molecular Evolution: Do Novo Design of an Idealized Leader Cleavge Site. Biophys Journal, 1994, 66, 335-344.
See extractProtSOCN
for sequence-order-coupling numbers.
x = readFASTA(system.file('protseq/P00750.fasta', package = 'Rcpi'))[[1]] extractProtQSO(x)
x = readFASTA(system.file('protseq/P00750.fasta', package = 'Rcpi'))[[1]] extractProtQSO(x)
Sequence-Order-Coupling Numbers
extractProtSOCN(x, nlag = 30)
extractProtSOCN(x, nlag = 30)
x |
A character vector, as the input protein sequence. |
nlag |
The maximum lag, defualt is 30. |
This function calculates the Sequence-Order-Coupling Numbers
(Dim: nlag * 2
, default is 60).
A length nlag * 2
named vector
Kuo-Chen Chou. Prediction of Protein Subcellar Locations by Incorporating Quasi-Sequence-Order Effect. Biochemical and Biophysical Research Communications, 2000, 278, 477-483.
Kuo-Chen Chou and Yu-Dong Cai. Prediction of Protein Sucellular Locations by GO-FunD-PseAA Predictor. Biochemical and Biophysical Research Communications, 2004, 320, 1236-1239.
Gisbert Schneider and Paul Wrede. The Rational Design of Amino Acid Sequences by Artifical Neural Networks and Simulated Molecular Evolution: Do Novo Design of an Idealized Leader Cleavge Site. Biophys Journal, 1994, 66, 335-344.
See extractProtQSO
for
quasi-sequence-order descriptors.
x = readFASTA(system.file('protseq/P00750.fasta', package = 'Rcpi'))[[1]] extractProtSOCN(x)
x = readFASTA(system.file('protseq/P00750.fasta', package = 'Rcpi'))[[1]] extractProtSOCN(x)
Tripeptide Composition Descriptor
extractProtTC(x)
extractProtTC(x)
x |
A character vector, as the input protein sequence. |
This function calculates the Tripeptide Composition descriptor (Dim: 8000).
A length 8000 named vector
M. Bhasin, G. P. S. Raghava. Classification of Nuclear Receptors Based on Amino Acid Composition and Dipeptide Composition. Journal of Biological Chemistry, 2004, 279, 23262.
See extractProtAAC
and extractProtDC
for Amino Acid Composition and Dipeptide Composition descriptors.
x = readFASTA(system.file('protseq/P00750.fasta', package = 'Rcpi'))[[1]] extractProtTC(x)
x = readFASTA(system.file('protseq/P00750.fasta', package = 'Rcpi'))[[1]] extractProtTC(x)
Generating Compound-Protein Interaction Descriptors
getCPI(drugmat, protmat, type = c("combine", "tensorprod"))
getCPI(drugmat, protmat, type = c("combine", "tensorprod"))
drugmat |
The compound descriptor matrix. |
protmat |
The protein descriptor matrix. |
type |
The interaction type, one or two of
|
This function calculates the compound-protein interaction descriptors by three types of interaction:
combine
- combine the two descriptor matrix,
result has (p1 + p2)
columns
tensorprod
- calculate column-by-column
(pseudo)-tensor product type interactions, result has
(p1 * p2)
columns
A matrix containing the compound-protein interaction descriptors
See getPPI
for generating
protein-protein interaction descriptors.
x = matrix(1:10, ncol = 2) y = matrix(1:15, ncol = 3) getCPI(x, y, 'combine') getCPI(x, y, 'tensorprod') getCPI(x, y, type = c('combine', 'tensorprod')) getCPI(x, y, type = c('tensorprod', 'combine'))
x = matrix(1:10, ncol = 2) y = matrix(1:15, ncol = 3) getCPI(x, y, 'combine') getCPI(x, y, 'tensorprod') getCPI(x, y, type = c('combine', 'tensorprod')) getCPI(x, y, type = c('tensorprod', 'combine'))
Retrieve Drug Molecules in MOL and SMILES Format from Databases
getDrug( id, from = c("pubchem", "chembl", "cas", "kegg", "drugbank"), type = c("mol", "smile"), parallel = 5 )
getDrug( id, from = c("pubchem", "chembl", "cas", "kegg", "drugbank"), type = c("mol", "smile"), parallel = 5 )
id |
A character vector, as the drug ID(s). |
from |
The database, one of |
type |
The returned molecule format, |
parallel |
An integer, the parallel parameter, indicates how many
process the user would like to use for retrieving
the data (using RCurl), default is |
This function retrieves drug molecules in MOL and SMILES format from five databases.
A length of id
character vector,
each element containing the corresponding drug molecule.
See getProt
for retrieving protein sequences
from three databases.
id = c('DB00859', 'DB00860') getDrug(id, 'drugbank', 'smile')
id = c('DB00859', 'DB00860') getDrug(id, 'drugbank', 'smile')
Retrieve Protein Sequence in FASTA Format from the KEGG Database
getFASTAFromKEGG(id, parallel = 5)
getFASTAFromKEGG(id, parallel = 5)
id |
A character vector, as the protein ID. |
parallel |
An integer, the parallel parameter, indicates how many
process the user would like to use for retrieving
the data (using RCurl), default is |
This function retrieves protein sequences in FASTA format from the KEGG database.
A list, each component contains one of the protein sequences in FASTA format.
See getSeqFromKEGG
for retrieving protein
represented by amino acid sequence from the KEGG database.
See readFASTA
for reading FASTA format files.
id = c('hsa:10161', 'hsa:10162') getFASTAFromKEGG(id)
id = c('hsa:10161', 'hsa:10162') getFASTAFromKEGG(id)
Retrieve Protein Sequence in FASTA Format from the UniProt Database
getFASTAFromUniProt(id, parallel = 5)
getFASTAFromUniProt(id, parallel = 5)
id |
A character vector, as the protein ID. |
parallel |
An integer, the parallel parameter, indicates how many
process the user would like to use for retrieving
the data (using RCurl), default is |
This function retrieves protein sequences in FASTA format from the UniProt database.
A list, each component contains one of the protein sequences in FASTA format.
UniProt. https://www.uniprot.org/
UniProt REST API Documentation. https://www.uniprot.org/help/api
See getSeqFromUniProt
for retrieving protein
represented by amino acid sequence from the UniProt database.
See readFASTA
for reading FASTA format files.
id = c('P00750', 'P00751', 'P00752') getFASTAFromUniProt(id)
id = c('P00750', 'P00751', 'P00752') getFASTAFromUniProt(id)
Retrieve Drug Molecules in InChI Format from the CAS Database
getMolFromCAS(id, parallel = 5)
getMolFromCAS(id, parallel = 5)
id |
A character vector, as the CAS drug ID. |
parallel |
An integer, the parallel parameter, indicates how many
process the user would like to use for retrieving
the data (using RCurl), default is |
This function retrieves drug molecules in InChI format from the CAS database. CAS database only provides InChI data, so here we return the molecule in InChI format, users could convert them to SMILES format using Open Babel or other third-party tools.
A length of id
character vector,
each element containing the corresponding drug molecule.
See getDrug
for retrieving drug molecules
in MOL and SMILES Format from other databases.
id = '52-67-5' # Penicillamine getMolFromCAS(id)
id = '52-67-5' # Penicillamine getMolFromCAS(id)
Retrieve Drug Molecules in MOL Format from the ChEMBL Database
getMolFromChEMBL(id, parallel = 5)
getMolFromChEMBL(id, parallel = 5)
id |
A character vector, as the ChEMBL drug ID. |
parallel |
An integer, the parallel parameter, indicates how many
process the user would like to use for
retrieving the data (using RCurl), default is |
This function retrieves drug molecules in MOL format from the ChEMBL database.
A length of id
character vector,
each element containing the corresponding drug molecule.
See getSmiFromChEMBL
for retrieving drug molecules
in SMILES format from the ChEMBL database.
id = 'CHEMBL1430' # Penicillamine getMolFromChEMBL(id)
id = 'CHEMBL1430' # Penicillamine getMolFromChEMBL(id)
Retrieve Drug Molecules in MOL Format from the DrugBank Database
getMolFromDrugBank(id, parallel = 5)
getMolFromDrugBank(id, parallel = 5)
id |
A character vector, as the DrugBank drug ID. |
parallel |
An integer, the parallel parameter, indicates how many
process the user would like to use for retrieving
the data (using RCurl), default is |
This function retrieves drug molecules in MOL format from the DrugBank database.
A length of id
character vector,
each element containing the corresponding drug molecule.
See getSmiFromDrugBank
for retrieving drug molecules
in SMILES format from the DrugBank database.
id = 'DB00859' # Penicillamine getMolFromDrugBank(id)
id = 'DB00859' # Penicillamine getMolFromDrugBank(id)
Retrieve Drug Molecules in MOL Format from the KEGG Database
getMolFromKEGG(id, parallel = 5)
getMolFromKEGG(id, parallel = 5)
id |
A character vector, as the KEGG drug ID. |
parallel |
An integer, the parallel parameter, indicates how many
process the user would like to use for retrieving
the data (using RCurl), default is |
This function retrieves drug molecules in MOL format from the KEGG database.
A length of id
character vector,
each element containing the corresponding drug molecule.
See getSmiFromKEGG
for retrieving drug molecules
in SMILES format from the KEGG database.
id = 'D00496' # Penicillamine getMolFromKEGG(id)
id = 'D00496' # Penicillamine getMolFromKEGG(id)
Retrieve Drug Molecules in MOL Format from the PubChem Database
getMolFromPubChem(id, parallel = 5)
getMolFromPubChem(id, parallel = 5)
id |
A character vector, as the PubChem drug ID. |
parallel |
An integer, the parallel parameter, indicates how many
processes the user would like to use for retrieving
the data (using RCurl), default is |
This function retrieves drug molecules in MOL format from the PubChem database.
A length of id
character vector,
each element containing the corresponding drug molecule.
See getSmiFromPubChem
for retrieving drug molecules
in SMILES format from the PubChem database.
id = c('7847562', '7847563') # Penicillamine getMolFromPubChem(id)
id = c('7847562', '7847563') # Penicillamine getMolFromPubChem(id)
Retrieve Protein Sequence in PDB Format from RCSB PDB
getPDBFromRCSBPDB(id, parallel = 5)
getPDBFromRCSBPDB(id, parallel = 5)
id |
A character vector, as the protein ID. |
parallel |
An integer, the parallel parameter, indicates how many
process the user would like to use for retrieving
the data (using RCurl), default is |
This function retrieves protein sequences in PDB format from RCSB PDB.
A list, each component contains one of the protein sequences in PDB format.
See getSeqFromRCSBPDB
for retrieving protein
represented by amino acid sequence from the RCSB PDB database.
id = c('4HHB', '4FF9') getPDBFromRCSBPDB(id)
id = c('4HHB', '4FF9') getPDBFromRCSBPDB(id)
Generating Protein-Protein Interaction Descriptors
getPPI(protmat1, protmat2, type = c("combine", "tensorprod", "entrywise"))
getPPI(protmat1, protmat2, type = c("combine", "tensorprod", "entrywise"))
protmat1 |
The first protein descriptor matrix,
must have the same ncol with |
protmat2 |
The second protein descriptor matrix,
must have the same ncol with |
type |
The interaction type, one or more of
|
This function calculates the protein-protein interaction descriptors by three types of interaction:
combine
- combine the two descriptor matrix,
result has (p + p)
columns
tensorprod
- calculate column-by-column
(pseudo)-tensor product type interactions, result has (p * p)
columns
entrywise
- calculate entrywise product and
entrywise sum of the two matrices, then combine them,
result has (p + p)
columns
A matrix containing the protein-protein interaction descriptors
See getCPI
for generating
compound-protein interaction descriptors.
x = matrix(1:10, ncol = 2) y = matrix(5:14, ncol = 2) getPPI(x, y, type = 'combine') getPPI(x, y, type = 'tensorprod') getPPI(x, y, type = 'entrywise') getPPI(x, y, type = c('combine', 'tensorprod')) getPPI(x, y, type = c('combine', 'entrywise')) getPPI(x, y, type = c('entrywise', 'tensorprod')) getPPI(x, y, type = c('combine', 'entrywise', 'tensorprod'))
x = matrix(1:10, ncol = 2) y = matrix(5:14, ncol = 2) getPPI(x, y, type = 'combine') getPPI(x, y, type = 'tensorprod') getPPI(x, y, type = 'entrywise') getPPI(x, y, type = c('combine', 'tensorprod')) getPPI(x, y, type = c('combine', 'entrywise')) getPPI(x, y, type = c('entrywise', 'tensorprod')) getPPI(x, y, type = c('combine', 'entrywise', 'tensorprod'))
Retrieve Protein Sequence in various Formats from Databases
getProt( id, from = c("uniprot", "kegg", "pdb"), type = c("fasta", "pdb", "aaseq"), parallel = 5 )
getProt( id, from = c("uniprot", "kegg", "pdb"), type = c("fasta", "pdb", "aaseq"), parallel = 5 )
id |
A character vector, as the protein ID(s). |
from |
The database, one of |
type |
The returned protein format, one of |
parallel |
An integer, the parallel parameter, indicates how many
process the user would like to use for retrieving
the data (using RCurl), default is |
This function retrieves protein sequence in various formats from three databases.
A length of id
character list, each element
containing the corresponding protein sequence(s) or file(s).
See getDrug
for retrieving drug molecules
from five databases.
id = c('P00750', 'P00751', 'P00752') getProt(id, from = 'uniprot', type = 'aaseq')
id = c('P00750', 'P00751', 'P00752') getProt(id, from = 'uniprot', type = 'aaseq')
Retrieve Protein Sequence from the KEGG Database
getSeqFromKEGG(id, parallel = 5)
getSeqFromKEGG(id, parallel = 5)
id |
A character vector, as the protein ID. |
parallel |
An integer, the parallel parameter, indicates how many
process the user would like to use for retrieving
the data (using RCurl), default is |
This function retrieves protein represented by amino acid sequence from the KEGG database.
A list, each component contains one of the protein represented by amino acid sequence(s).
See getFASTAFromKEGG
for retrieving protein sequence
in FASTA format from the KEGG database.
id = c('hsa:10161', 'hsa:10162') getSeqFromKEGG(id)
id = c('hsa:10161', 'hsa:10162') getSeqFromKEGG(id)
Retrieve Protein Sequence from RCSB PDB
getSeqFromRCSBPDB(id, parallel = 5)
getSeqFromRCSBPDB(id, parallel = 5)
id |
A character vector, as the protein ID. |
parallel |
An integer, the parallel parameter, indicates how many
process the user would like to use for retrieving
the data (using RCurl), default is |
This function retrieves protein sequences from RCSB PDB.
A list, each component contains one of the protein represented by amino acid sequence(s).
See getPDBFromRCSBPDB
for retrieving protein
in PDB format from the RCSB PDB database.
id = c('4HHB', '4FF9') getSeqFromRCSBPDB(id)
id = c('4HHB', '4FF9') getSeqFromRCSBPDB(id)
Retrieve Protein Sequence from the UniProt Database
getSeqFromUniProt(id, parallel = 5)
getSeqFromUniProt(id, parallel = 5)
id |
A character vector, as the protein ID. |
parallel |
An integer, the parallel parameter, indicates how many
process the user would like to use for retrieving
the data (using RCurl), default is |
This function retrieves protein represented by amino acid sequence from the UniProt database.
A list, each component contains one of the protein represented by amino acid sequence(s).
UniProt. https://www.uniprot.org/
UniProt REST API Documentation. https://www.uniprot.org/help/api
See getFASTAFromUniProt
for retrieving protein
sequences in FASTA format from the UniProt database.
id = c('P00750', 'P00751', 'P00752') getSeqFromUniProt(id)
id = c('P00750', 'P00751', 'P00752') getSeqFromUniProt(id)
Retrieve Drug Molecules in SMILES Format from the ChEMBL Database
getSmiFromChEMBL(id, parallel = 5)
getSmiFromChEMBL(id, parallel = 5)
id |
A character vector, as the ChEMBL drug ID. |
parallel |
An integer, the parallel parameter, indicates how many
process the user would like to use for retrieving
the data (using RCurl), default is |
This function retrieves drug molecules in SMILES format from the ChEMBL database.
A length of id
character vector,
each element containing the corresponding drug molecule.
See getMolFromChEMBL
for retrieving drug molecules
in MOL format from the ChEMBL database.
id = 'CHEMBL1430' # Penicillamine getSmiFromChEMBL(id)
id = 'CHEMBL1430' # Penicillamine getSmiFromChEMBL(id)
Retrieve Drug Molecules in SMILES Format from the DrugBank Database
getSmiFromDrugBank(id, parallel = 5)
getSmiFromDrugBank(id, parallel = 5)
id |
A character vector, as the DrugBank drug ID. |
parallel |
An integer, the parallel parameter, indicates how many
process the user would like to use for retrieving
the data (using RCurl), default is |
This function retrieves drug molecules in SMILES format from the DrugBank database.
A length of id
character vector, each element containing
the corresponding drug molecule.
See getMolFromDrugBank
for retrieving drug molecules
in MOL format from the DrugBank database.
id = 'DB00859' # Penicillamine getSmiFromDrugBank(id)
id = 'DB00859' # Penicillamine getSmiFromDrugBank(id)
Retrieve Drug Molecules in SMILES Format from the KEGG Database
getSmiFromKEGG(id, parallel = 5)
getSmiFromKEGG(id, parallel = 5)
id |
A character vector, as the KEGG drug ID. |
parallel |
An integer, the parallel parameter, indicates how many
process the user would like to use for retrieving
the data (using RCurl), default is |
This function retrieves drug molecules in SMILES format from the KEGG database.
A length of id
character vector,
each element containing the corresponding drug molecule.
See getMolFromKEGG
for retrieving drug molecules
in MOL format from the KEGG database.
id = 'D00496' # Penicillamine getSmiFromKEGG(id)
id = 'D00496' # Penicillamine getSmiFromKEGG(id)
Retrieve Drug Molecules in SMILES Format from the PubChem Database
getSmiFromPubChem(id, parallel = 5)
getSmiFromPubChem(id, parallel = 5)
id |
A character vector, as the PubChem drug ID. |
parallel |
An integer, the parallel parameter, indicates how many
processes the user would like to use for retrieving
the data (using RCurl), default is |
This function retrieves drug molecules in SMILES format from the PubChem database.
A length of id
character vector,
each element containing the corresponding drug molecule.
See getMolFromPubChem
for retrieving drug molecules
in MOL format from the PubChem database.
id = c('7847562', '7847563') # Penicillamine getSmiFromPubChem(id)
id = c('7847562', '7847563') # Penicillamine getSmiFromPubChem(id)
OptAA3d.sdf - 20 Amino Acids Optimized with MOE 2011.10 (Semiempirical AM1)
OptAA3d.sdf - 20 Amino Acids Optimized with MOE 2011.10 (Semiempirical AM1)
OptAA3d data
# This example requires the rcdk package # library('rcdk') # optaa3d = load.molecules(system.file('sysdata/OptAA3d.sdf', package = 'Rcpi')) # view.molecule.2d(optaa3d[[1]]) # view the first amino acid
# This example requires the rcdk package # library('rcdk') # optaa3d = load.molecules(system.file('sysdata/OptAA3d.sdf', package = 'Rcpi')) # view.molecule.2d(optaa3d[[1]]) # view the first amino acid
Reads protein sequences in FASTA format.
readFASTA( file = system.file("protseq/P00750.fasta", package = "Rcpi"), legacy.mode = TRUE, seqonly = FALSE )
readFASTA( file = system.file("protseq/P00750.fasta", package = "Rcpi"), legacy.mode = TRUE, seqonly = FALSE )
file |
The name of the file which the sequences in fasta format are
to be read from. If it does not contain an absolute or
relative path, the file name is relative to the current
working directory, |
legacy.mode |
If set to |
seqonly |
If set to |
Character vector.
Pearson, W.R. and Lipman, D.J. (1988) Improved tools for biological sequence comparison. Proceedings of the National Academy of Sciences of the United States of America, 85: 2444-2448.
See readPDB
for reading protein sequences
in PDB format.
P00750 = readFASTA(system.file('protseq/P00750.fasta', package = 'Rcpi')) P00750
P00750 = readFASTA(system.file('protseq/P00750.fasta', package = 'Rcpi')) P00750
Read Molecules from SDF Files and Return Parsed Java Molecular Object
readMolFromSDF(sdffile)
readMolFromSDF(sdffile)
sdffile |
Character vector, containing SDF file location(s). |
This function reads molecules from SDF files and return
parsed Java molecular object needed by extractDrug...
functions.
A list, containing parsed Java molecular object.
See readMolFromSmi
for reading molecules by SMILES
string and returning parsed Java molecular object.
sdf = system.file('compseq/DB00859.sdf', package = 'Rcpi') sdfs = c(system.file('compseq/DB00859.sdf', package = 'Rcpi'), system.file('compseq/DB00860.sdf', package = 'Rcpi')) mol = readMolFromSDF(sdf) mols = readMolFromSDF(sdfs)
sdf = system.file('compseq/DB00859.sdf', package = 'Rcpi') sdfs = c(system.file('compseq/DB00859.sdf', package = 'Rcpi'), system.file('compseq/DB00860.sdf', package = 'Rcpi')) mol = readMolFromSDF(sdf) mols = readMolFromSDF(sdfs)
Read Molecules from SMILES Files and Return Parsed Java Molecular Object or Plain Text List
readMolFromSmi(smifile, type = c("mol", "text"))
readMolFromSmi(smifile, type = c("mol", "text"))
smifile |
Character vector, containing SMILES file location(s). |
type |
|
This function reads molecules from SMILES strings and return
parsed Java molecular object or plain text list
needed by extractDrug...()
functions.
A list, containing parsed Java molecular object or character strings.
See readMolFromSDF
for reading molecules
from SDF files and returning parsed Java molecular object.
smi = system.file('vignettedata/FDAMDD.smi', package = 'Rcpi') mol1 = readMolFromSmi(smi, type = 'mol') mol2 = readMolFromSmi(smi, type = 'text')
smi = system.file('vignettedata/FDAMDD.smi', package = 'Rcpi') mol1 = readMolFromSmi(smi, type = 'mol') mol2 = readMolFromSmi(smi, type = 'text')
Read Protein Sequences in PDB Format
readPDB(file = system.file("protseq/4HHB.pdb", package = "Rcpi"))
readPDB(file = system.file("protseq/4HHB.pdb", package = "Rcpi"))
file |
The name of the file which the sequences in PDB format are
to be read from. If it does not contain an absolute or
relative path, the file name is relative to the current
working directory, |
This function reads protein sequences in PDB (Protein Data Bank) format, and return the amino acid sequences represented by single-letter code.
A character vector, representing the amino acid sequence of the single-letter code.
Protein Data Bank Contents Guide: Atomic Coordinate Entry Format Description, Version 3.30. Accessed 2013-06-26. https://files.wwpdb.org/pub/pdb/doc/format_descriptions/Format_v33_Letter.pdf
See readFASTA
for reading protein sequences
in FASTA format.
Seq4HHB = readPDB(system.file('protseq/4HHB.pdb', package = 'Rcpi')) Seq4HHB
Seq4HHB = readPDB(system.file('protseq/4HHB.pdb', package = 'Rcpi')) Seq4HHB
Parallelized Drug Molecule Similarity Search by Molecular Fingerprints Similarity or Maximum Common Substructure Search
searchDrug( mol, moldb, cores = 2, method = c("fp", "mcs"), fptype = c("standard", "extended", "graph", "hybrid", "maccs", "estate", "pubchem", "kr", "shortestpath", "fp2", "fp3", "fp4", "obmaccs"), fpsim = c("tanimoto", "euclidean", "cosine", "dice", "hamming"), mcssim = c("tanimoto", "overlap"), ... )
searchDrug( mol, moldb, cores = 2, method = c("fp", "mcs"), fptype = c("standard", "extended", "graph", "hybrid", "maccs", "estate", "pubchem", "kr", "shortestpath", "fp2", "fp3", "fp4", "obmaccs"), fpsim = c("tanimoto", "euclidean", "cosine", "dice", "hamming"), mcssim = c("tanimoto", "overlap"), ... )
mol |
The query molecule. The location of a |
moldb |
The molecule database. The location of a |
cores |
Integer. The number of CPU cores to use for parallel search,
default is |
method |
|
fptype |
The fingerprint type, only available when |
fpsim |
Similarity measure type for fingerprint,
only available when |
mcssim |
Similarity measure type for maximum common substructure search,
only available when |
... |
Other possible parameter for maximum common substructure search,
see |
This function does compound similarity search derived by various molecular fingerprints with various similarity measures or derived by maximum common substructure search. This function runs for a query compound against a set of molecules.
Named numerical vector. With the decreasing similarity value of the molecules in the database.
mol = system.file('compseq/DB00530.sdf', package = 'Rcpi') # DrugBank ID DB00530: Erlotinib moldb = system.file('compseq/tyrphostin.sdf', package = 'Rcpi') # Database composed by searching 'tyrphostin' in PubChem and filtered by Lipinski's Rule of Five searchDrug(mol, moldb, cores = 4, method = 'fp', fptype = 'maccs', fpsim = 'hamming') searchDrug(mol, moldb, cores = 4, method = 'fp', fptype = 'fp2', fpsim = 'tanimoto') searchDrug(mol, moldb, cores = 4, method = 'mcs', mcssim = 'tanimoto')
mol = system.file('compseq/DB00530.sdf', package = 'Rcpi') # DrugBank ID DB00530: Erlotinib moldb = system.file('compseq/tyrphostin.sdf', package = 'Rcpi') # Database composed by searching 'tyrphostin' in PubChem and filtered by Lipinski's Rule of Five searchDrug(mol, moldb, cores = 4, method = 'fp', fptype = 'maccs', fpsim = 'hamming') searchDrug(mol, moldb, cores = 4, method = 'fp', fptype = 'fp2', fpsim = 'tanimoto') searchDrug(mol, moldb, cores = 4, method = 'mcs', mcssim = 'tanimoto')
Protein Sequence Segmentation
segProt( x, aa = c("A", "R", "N", "D", "C", "E", "Q", "G", "H", "I", "L", "K", "M", "F", "P", "S", "T", "W", "Y", "V"), k = 7 )
segProt( x, aa = c("A", "R", "N", "D", "C", "E", "Q", "G", "H", "I", "L", "K", "M", "F", "P", "S", "T", "W", "Y", "V"), k = 7 )
x |
A character vector, as the input protein sequence. |
aa |
A character, the amino acid type. one of
|
k |
A positive integer, specifys the window size (half of the window), default is 7. |
This function extracts the segmentations from the protein sequence.
A named list, each component contains one of the segmentations (a character string), names of the list components are the positions of the specified amino acid in the sequence.
x = readFASTA(system.file('protseq/P00750.fasta', package = 'Rcpi'))[[1]] segProt(x, aa = 'R', k = 5)
x = readFASTA(system.file('protseq/P00750.fasta', package = 'Rcpi'))[[1]] segProt(x, aa = 'R', k = 5)