Package 'Rcpi' reference manual

Title:	Molecular Informatics Toolkit for Compound-Protein Interaction in Drug Discovery
Description:	A molecular informatics toolkit with an integration of bioinformatics and chemoinformatics tools for drug discovery.
Authors:	Nan Xiao [aut, cre] , Dong-Sheng Cao [aut], Qing-Song Xu [aut]
Maintainer:	Nan Xiao <[email protected]>
License:	Artistic-2.0 \| file LICENSE
Version:	1.43.0
Built:	2025-03-30 06:43:09 UTC
Source:	https://github.com/bioc/Rcpi

2D Autocorrelations Descriptors for 20 Amino Acids calculated by Dragon

Description

2D Autocorrelations Descriptors for 20 Amino Acids calculated by Dragon

Details

This dataset includes the 2D autocorrelations descriptors of the 20 amino acids calculated by Dragon (version 5.4) used for scales extraction in this package.

Value

AA2DACOR data

Examples

data(AA2DACOR)

data(AA2DACOR)

3D-MoRSE Descriptors for 20 Amino Acids calculated by Dragon

Description

3D-MoRSE Descriptors for 20 Amino Acids calculated by Dragon

Details

This dataset includes the 3D-MoRSE descriptors of the 20 amino acids calculated by Dragon (version 5.4) used for scales extraction in this package.

Value

AA3DMoRSE data

Examples

data(AA3DMoRSE)

data(AA3DMoRSE)

Atom-Centred Fragments Descriptors for 20 Amino Acids calculated by Dragon

Description

Atom-Centred Fragments Descriptors for 20 Amino Acids calculated by Dragon

Details

This dataset includes the atom-centred fragments descriptors of the 20 amino acids calculated by Dragon (version 5.4) used for scales extraction in this package.

Value

AAACF data

Examples

data(AAACF)

data(AAACF)

BLOSUM100 Matrix for 20 Amino Acids

Description

BLOSUM100 Matrix for 20 Amino Acids

Details

BLOSUM100 Matrix for the 20 amino acids. The matrix was extracted from the Biostrings package of Bioconductor.

Value

AABLOSUM100 data

Examples

data(AABLOSUM100)

data(AABLOSUM100)

BLOSUM45 Matrix for 20 Amino Acids

Description

BLOSUM45 Matrix for 20 Amino Acids

Details

BLOSUM45 Matrix for the 20 amino acids. The matrix was extracted from the Biostrings package of Bioconductor.

Value

AABLOSUM45 data

Examples

data(AABLOSUM45)

data(AABLOSUM45)

BLOSUM50 Matrix for 20 Amino Acids

Description

BLOSUM50 Matrix for 20 Amino Acids

Details

BLOSUM50 Matrix for the 20 amino acids. The matrix was extracted from the Biostrings package of Bioconductor.

Value

AABLOSUM50 data

Examples

data(AABLOSUM50)

data(AABLOSUM50)

BLOSUM62 Matrix for 20 Amino Acids

Description

BLOSUM62 Matrix for 20 Amino Acids

Details

BLOSUM62 Matrix for the 20 amino acids. The matrix was extracted from the Biostrings package of Bioconductor.

Value

AABLOSUM62 data

Examples

data(AABLOSUM62)

data(AABLOSUM62)

BLOSUM80 Matrix for 20 Amino Acids

Description

BLOSUM80 Matrix for 20 Amino Acids

Details

BLOSUM80 Matrix for the 20 amino acids. The matrix was extracted from the Biostrings package of Bioconductor.

Value

AABLOSUM80 data

Examples

data(AABLOSUM80)

data(AABLOSUM80)

Burden Eigenvalues Descriptors for 20 Amino Acids calculated by Dragon

Description

Burden Eigenvalues Descriptors for 20 Amino Acids calculated by Dragon

Details

This dataset includes the Burden eigenvalues descriptors of the 20 amino acids calculated by Dragon (version 5.4) used for scales extraction in this package.

Value

AABurden data

Examples

data(AABurden)

data(AABurden)

Connectivity Indices Descriptors for 20 Amino Acids calculated by Dragon

Description

Connectivity Indices Descriptors for 20 Amino Acids calculated by Dragon

Details

This dataset includes the connectivity indices descriptors of the 20 amino acids calculated by Dragon (version 5.4) used for scales extraction in this package.

Value

AAConn data

Examples

data(AAConn)

data(AAConn)

Constitutional Descriptors for 20 Amino Acids calculated by Dragon

Description

Constitutional Descriptors for 20 Amino Acids calculated by Dragon

Details

This dataset includes the constitutional descriptors of the 20 amino acids calculated by Dragon (version 5.4) used for scales extraction in this package.

Value

AAConst data

Examples

data(AAConst)

data(AAConst)

CPSA Descriptors for 20 Amino Acids calculated by Discovery Studio

Description

CPSA Descriptors for 20 Amino Acids calculated by Discovery Studio

Details

This dataset includes the CPSA descriptors of the 20 amino acids calculated by Discovery Studio (version 2.5) used for scales extraction in this package. All amino acid molecules had also been optimized with MOE 2011.10 (semiempirical AM1) before calculating these CPSA descriptors. The SDF file containing the information of the optimized amino acid molecules is included in this package. See OptAA3d for more information.

Value

AACPSA data

Examples

data(AACPSA)

data(AACPSA)

All 2D Descriptors for 20 Amino Acids calculated by Dragon

Description

All 2D Descriptors for 20 Amino Acids calculated by Dragon

Details

This dataset includes all the 2D descriptors of the 20 amino acids calculated by Dragon (version 5.4) used for scales extraction in this package.

Value

AADescAll data

Examples

data(AADescAll)

data(AADescAll)

Edge Adjacency Indices Descriptors for 20 Amino Acids calculated by Dragon

Description

Edge Adjacency Indices Descriptors for 20 Amino Acids calculated by Dragon

Details

This dataset includes the edge adjacency indices descriptors of the 20 amino acids calculated by Dragon (version 5.4) used for scales extraction in this package.

Value

AAEdgeAdj data

Examples

data(AAEdgeAdj)

data(AAEdgeAdj)

Eigenvalue-Based Indices Descriptors for 20 Amino Acids calculated by Dragon

Description

Eigenvalue-Based Indices Descriptors for 20 Amino Acids calculated by Dragon

Details

This dataset includes the eigenvalue-based indices descriptors of the 20 amino acids calculated by Dragon (version 5.4) used for scales extraction in this package.

Value

AAEigIdx data

Examples

data(AAEigIdx)

data(AAEigIdx)

Functional Group Counts Descriptors for 20 Amino Acids calculated by Dragon

Description

Functional Group Counts Descriptors for 20 Amino Acids calculated by Dragon

Details

This dataset includes the functional group counts descriptors of the 20 amino acids calculated by Dragon (version 5.4) used for scales extraction in this package.

Value

AAFGC data

Examples

data(AAFGC)

data(AAFGC)

Geometrical Descriptors for 20 Amino Acids calculated by Dragon

Description

Geometrical Descriptors for 20 Amino Acids calculated by Dragon

Details

This dataset includes the geometrical descriptors of the 20 amino acids calculated by Dragon (version 5.4) used for scales extraction in this package.

Value

AAGeom data

Examples

data(AAGeom)

data(AAGeom)

GETAWAY Descriptors for 20 Amino Acids calculated by Dragon

Description

GETAWAY Descriptors for 20 Amino Acids calculated by Dragon

Details

This dataset includes the GETAWAY descriptors of the 20 amino acids calculated by Dragon (version 5.4) used for scales extraction in this package.

Value

AAGETAWAY data

Examples

data(AAGETAWAY)

data(AAGETAWAY)

AAindex Data of 544 Physicochemical and Biological Properties for 20 Amino Acids

Description

AAindex Data of 544 Physicochemical and Biological Properties for 20 Amino Acids

Details

The data was extracted from the AAindex1 database ver 9.1 (ftp://ftp.genome.jp/pub/db/community/aaindex/aaindex1) as of November 2012 (Data Last Modified 2006-08-14).

With this data, users could investigate each property's accession number and other details. Visit https://www.genome.jp/dbget/aaindex.html for more information.

Value

AAindex data

Examples

data(AAindex)

data(AAindex)

Information Indices Descriptors for 20 Amino Acids calculated by Dragon

Description

Information Indices Descriptors for 20 Amino Acids calculated by Dragon

Details

This dataset includes the information indices descriptors of the 20 amino acids calculated by Dragon (version 5.4) used for scales extraction in this package.

Value

AAInfo data

Examples

data(AAInfo)

data(AAInfo)

Meta Information for the 20 Amino Acids

Description

Meta Information for the 20 Amino Acids

Details

This dataset includes the meta information of the 20 amino acids used for the 2D and 3D descriptor calculation in this package. Each column represents:

AAName: Amino acid name
Short: One-letter representation
Abbreviation: Three-letter representation
mol: SMILES representation
PUBCHEM_COMPOUND_CID: PubChem CID for the amino acid
PUBCHEM_LINK: PubChem link for the amino acid

Value

AAMetaInfo data

Examples

data(AAMetaInfo)
data(AAMetaInfo)

2D Descriptors for 20 Amino Acids calculated by MOE 2011.10

Description

2D Descriptors for 20 Amino Acids calculated by MOE 2011.10

Details

This dataset includes the 2D descriptors of the 20 amino acids calculated by MOE 2011.10 used for scales extraction in this package.

Value

AAMOE2D data

Examples

data(AAMOE2D)

data(AAMOE2D)

3D Descriptors for 20 Amino Acids calculated by MOE 2011.10

Description

3D Descriptors for 20 Amino Acids calculated by MOE 2011.10

Details

This dataset includes the 3D descriptors of the 20 amino acids calculated by MOE 2011.10 used for scales extraction in this package. All amino acid molecules had also been optimized with MOE (semiempirical AM1) before calculating these 3D descriptors. The SDF file containing the information of the optimized amino acid molecules is included in this package. See OptAA3d for more information.

Value

AAMOE3D data

Examples

data(AAMOE3D)

data(AAMOE3D)

Molecular Properties Descriptors for 20 Amino Acids calculated by Dragon

Description

Molecular Properties Descriptors for 20 Amino Acids calculated by Dragon

Details

This dataset includes the molecular properties descriptors of the 20 amino acids calculated by Dragon (version 5.4) used for scales extraction in this package.

Value

AAMolProp data

Examples

data(AAMolProp)

data(AAMolProp)

PAM120 Matrix for 20 Amino Acids

Description

PAM120 Matrix for 20 Amino Acids

Details

PAM120 Matrix for the 20 amino acids. The matrix was extracted from the Biostrings package of Bioconductor.

Value

AAPAM120 data

Examples

data(AAPAM120)

data(AAPAM120)

PAM250 Matrix for 20 Amino Acids

Description

PAM250 Matrix for 20 Amino Acids

Details

PAM250 Matrix for the 20 amino acids. The matrix was extracted from the Biostrings package of Bioconductor.

Value

AAPAM250 data

Examples

data(AAPAM250)

data(AAPAM250)

PAM30 Matrix for 20 Amino Acids

Description

PAM30 Matrix for 20 Amino Acids

Details

PAM30 Matrix for the 20 amino acids. The matrix was extracted from the Biostrings package of Bioconductor.

Value

AAPAM30 data

Examples

data(AAPAM30)

data(AAPAM30)

PAM40 Matrix for 20 Amino Acids

Description

PAM40 Matrix for 20 Amino Acids

Details

PAM40 Matrix for the 20 amino acids. The matrix was extracted from the Biostrings package of Bioconductor.

Value

AAPAM40 data

Examples

data(AAPAM40)

data(AAPAM40)

PAM70 Matrix for 20 Amino Acids

Description

PAM70 Matrix for 20 Amino Acids

Details

PAM70 Matrix for the 20 amino acids. The matrix was extracted from the Biostrings package of Bioconductor.

Value

AAPAM70 data

Examples

data(AAPAM70)

data(AAPAM70)

Randic Molecular Profiles Descriptors for 20 Amino Acids calculated by Dragon

Description

Randic Molecular Profiles Descriptors for 20 Amino Acids calculated by Dragon

Details

This dataset includes the Randic molecular profiles descriptors of the 20 amino acids calculated by Dragon (version 5.4) used for scales extraction in this package.

Value

AARandic data

Examples

data(AARandic)

data(AARandic)

RDF Descriptors for 20 Amino Acids calculated by Dragon

Description

RDF Descriptors for 20 Amino Acids calculated by Dragon

Details

This dataset includes the RDF descriptors of the 20 amino acids calculated by Dragon (version 5.4) used for scales extraction in this package.

Value

AARDF data

Examples

data(AARDF)

data(AARDF)

Topological Descriptors for 20 Amino Acids calculated by Dragon

Description

Topological Descriptors for 20 Amino Acids calculated by Dragon

Details

This dataset includes the topological descriptors of the 20 amino acids calculated by Dragon (version 5.4) used for scales extraction in this package.

Value

AATopo data

Examples

data(AATopo)

data(AATopo)

Topological Charge Indices Descriptors for 20 Amino Acids calculated by Dragon

Description

Topological Charge Indices Descriptors for 20 Amino Acids calculated by Dragon

Details

This dataset includes the topological charge indices descriptors of the 20 amino acids calculated by Dragon (version 5.4) used for scales extraction in this package.

Value

AATopoChg data

Examples

data(AATopoChg)

data(AATopoChg)

Walk and Path Counts Descriptors for 20 Amino Acids calculated by Dragon

Description

Walk and Path Counts Descriptors for 20 Amino Acids calculated by Dragon

Details

This dataset includes the walk and path counts descriptors of the 20 amino acids calculated by Dragon (version 5.4) used for scales extraction in this package.

Value

AAWalk data

Examples

data(AAWalk)

data(AAWalk)

WHIM Descriptors for 20 Amino Acids calculated by Dragon

Description

WHIM Descriptors for 20 Amino Acids calculated by Dragon

Details

This dataset includes the WHIM descriptors of the 20 amino acids calculated by Dragon (version 5.4) used for scales extraction in this package.

Value

AAWHIM data

Examples

data(AAWHIM)

data(AAWHIM)

Auto Cross Covariance (ACC) for Generating Scales-Based Descriptors of the Same Length

Description

Calculates auto covariance and auto cross covariance for generating scale-based descriptors of the same length.

Usage

acc(mat, lag)
acc(mat, lag)

Arguments

`mat`	A `p * n` matrix. Each row represents one scale (total `p` scales), each column represents one amino acid position (total `n` amino acids).
`lag`	The lag parameter. Must be less than the amino acids.

Value

A length lag * p^2 named vector, the element names are constructed by: the scales index (crossed scales index) and lag index.

Note

To see more details about auto cross covariance, check the references.

References

Wold, S., Jonsson, J., Sjörström, M., Sandberg, M., & Rännar, S. (1993). DNA and peptide sequences and chemical processes multivariately modelled by principal component analysis and partial least-squares projections to latent structures. Analytica chimica acta, 277(2), 239–253.

Sjöström, M., Rännar, S., & Wieslander, Å. (1995). Polypeptide sequence property relationships in Escherichia coli based on auto cross covariances. Chemometrics and intelligent laboratory systems, 29(2), 295–305.

Examples

p = 8    # p is the scales number
n = 200  # n is the amino acid number
lag = 7  # lag parameter
mat = matrix(rnorm(p * n), nrow = p, ncol = n)
acc(mat, lag)
p = 8    # p is the scales number
n = 200  # n is the amino acid number
lag = 7  # lag parameter
mat = matrix(rnorm(p * n), nrow = p, ncol = n)
acc(mat, lag)

Calculate Drug Molecule Similarity Derived by Molecular Fingerprints

Description

Calculate Drug Molecule Similarity Derived by Molecular Fingerprints

Usage

calcDrugFPSim(
  fp1,
  fp2,
  fptype = c("compact", "complete"),
  metric = c("tanimoto", "euclidean", "cosine", "dice", "hamming")
)
calcDrugFPSim(
  fp1,
  fp2,
  fptype = c("compact", "complete"),
  metric = c("tanimoto", "euclidean", "cosine", "dice", "hamming")
)

Arguments

`fp1`	The first molecule's fingerprints, could be extracted by `extractDrugMACCS()`, `extractDrugMACCSComplete()` etc.
`fp2`	The second molecule's fingerprints.
`fptype`	The fingerprint type, must be one of `"compact"` or `"complete"`.
`metric`	The similarity metric, one of `"tanimoto"`, `"euclidean"`, `"cosine"`, `"dice"` and `"hamming"`.

Details

This function calculate drug molecule fingerprints similarity. Define a as the features of object A, b is the features of object B, c is the number of common features to A and B:

Tanimoto: aka Jaccard - $c/a+b+c$
Euclidean: $\sqrt(a + b)$
Dice: aka Sorensen, Czekanowski, Hodgkin-Richards - $c/0.5[(a+c) + (b+c)]$
Cosine: aka Ochiai, Carbo - $c/\sqrt((a + c)(b + c))$
Hamming: aka Manhattan, taxi-cab, city-block distance - $(a + b)$

Value

The numeric similarity value.

References

Gasteiger, Johann, and Thomas Engel, eds. Chemoinformatics. Wiley.com, 2006.

Examples


mols = readMolFromSDF(system.file('compseq/tyrphostin.sdf', package = 'Rcpi'))

fp1 = extractDrugEstate(mols[[1]])
fp2 = extractDrugEstate(mols[[2]])
calcDrugFPSim(fp1, fp2, fptype = 'compact', metric = 'tanimoto')
calcDrugFPSim(fp1, fp2, fptype = 'compact', metric = 'euclidean')
calcDrugFPSim(fp1, fp2, fptype = 'compact', metric = 'cosine')
calcDrugFPSim(fp1, fp2, fptype = 'compact', metric = 'dice')
calcDrugFPSim(fp1, fp2, fptype = 'compact', metric = 'hamming')

fp3 = extractDrugEstateComplete(mols[[1]])
fp4 = extractDrugEstateComplete(mols[[2]])
calcDrugFPSim(fp3, fp4, fptype = 'complete', metric = 'tanimoto')
calcDrugFPSim(fp3, fp4, fptype = 'complete', metric = 'euclidean')
calcDrugFPSim(fp3, fp4, fptype = 'complete', metric = 'cosine')
calcDrugFPSim(fp3, fp4, fptype = 'complete', metric = 'dice')
calcDrugFPSim(fp3, fp4, fptype = 'complete', metric = 'hamming')
mols = readMolFromSDF(system.file('compseq/tyrphostin.sdf', package = 'Rcpi'))

fp1 = extractDrugEstate(mols[[1]])
fp2 = extractDrugEstate(mols[[2]])
calcDrugFPSim(fp1, fp2, fptype = 'compact', metric = 'tanimoto')
calcDrugFPSim(fp1, fp2, fptype = 'compact', metric = 'euclidean')
calcDrugFPSim(fp1, fp2, fptype = 'compact', metric = 'cosine')
calcDrugFPSim(fp1, fp2, fptype = 'compact', metric = 'dice')
calcDrugFPSim(fp1, fp2, fptype = 'compact', metric = 'hamming')

fp3 = extractDrugEstateComplete(mols[[1]])
fp4 = extractDrugEstateComplete(mols[[2]])
calcDrugFPSim(fp3, fp4, fptype = 'complete', metric = 'tanimoto')
calcDrugFPSim(fp3, fp4, fptype = 'complete', metric = 'euclidean')
calcDrugFPSim(fp3, fp4, fptype = 'complete', metric = 'cosine')
calcDrugFPSim(fp3, fp4, fptype = 'complete', metric = 'dice')
calcDrugFPSim(fp3, fp4, fptype = 'complete', metric = 'hamming')

Calculate Drug Molecule Similarity Derived by Maximum Common Substructure Search

Description

Calculate Drug Molecule Similarity Derived by Maximum Common Substructure Search

Usage

calcDrugMCSSim(
  mol1,
  mol2,
  type = c("smile", "sdf"),
  plot = FALSE,
  al = 0,
  au = 0,
  bl = 0,
  bu = 0,
  matching.mode = "static",
  ...
)
calcDrugMCSSim(
  mol1,
  mol2,
  type = c("smile", "sdf"),
  plot = FALSE,
  al = 0,
  au = 0,
  bl = 0,
  bu = 0,
  matching.mode = "static",
  ...
)

Arguments

`mol1`	The first molecule. R character string object containing the molecule. See examples.
`mol2`	The second molecule. R character string object containing the molecule. See examples.
`type`	The input molecule format, 'smile' or 'sdf'.
`plot`	Logical. Should we plot the two molecules and their maximum common substructure?
`al`	Lower bound for the number of atom mismatches. Default is 0.
`au`	Upper bound for the number of atom mismatches. Default is 0.
`bl`	Lower bound for the number of bond mismatches. Default is 0.
`bu`	Upper bound for the number of bond mismatches. Default is 0.
`matching.mode`	Three modes for bond matching are supported: `'static'`, `'aromatic'`, and `'ring'`.
`...`	Other graphical parameters

Details

This function calculate drug molecule similarity derived by maximum common substructure search. The maximum common substructure search algorithm is provided by the fmcsR package.

Value

A list containing the detail MCS information and similarity values. The numeric similarity value includes Tanimoto coefficient and overlap coefficient.

References

Wang, Y., Backman, T. W., Horan, K., & Girke, T. (2013). fmcsR: mismatch tolerant maximum common substructure searching in R. Bioinformatics, 29(21), 2792–2794.

Examples

mol1 = 'CC(C)CCCCCC(=O)NCC1=CC(=C(C=C1)O)OC'
mol2 = 'O=C(NCc1cc(OC)c(O)cc1)CCCC/C=C/C(C)C'
mol3 = readChar(system.file('compseq/DB00859.sdf', package = 'Rcpi'), nchars = 1e+6)
mol4 = readChar(system.file('compseq/DB00860.sdf', package = 'Rcpi'), nchars = 1e+6)
## Not run: 
sim1 = calcDrugMCSSim(mol1, mol2, type = 'smile')
sim2 = calcDrugMCSSim(mol3, mol4, type = 'sdf', plot = TRUE)
print(sim1[[2]])  # Tanimoto Coefficient
print(sim2[[3]])  # Overlap Coefficient

## End(Not run)
mol1 = 'CC(C)CCCCCC(=O)NCC1=CC(=C(C=C1)O)OC'
mol2 = 'O=C(NCc1cc(OC)c(O)cc1)CCCC/C=C/C(C)C'
mol3 = readChar(system.file('compseq/DB00859.sdf', package = 'Rcpi'), nchars = 1e+6)
mol4 = readChar(system.file('compseq/DB00860.sdf', package = 'Rcpi'), nchars = 1e+6)
## Not run: 
sim1 = calcDrugMCSSim(mol1, mol2, type = 'smile')
sim2 = calcDrugMCSSim(mol3, mol4, type = 'sdf', plot = TRUE)
print(sim1[[2]])  # Tanimoto Coefficient
print(sim2[[3]])  # Overlap Coefficient

## End(Not run)

Protein Sequence Similarity Calculation based on Gene Ontology (GO) Similarity

Description

Protein Sequence Similarity Calculation based on Gene Ontology (GO) Similarity

Usage

calcParProtGOSim(
  golist,
  type = c("go", "gene"),
  ont = c("MF", "BP", "CC"),
  organism = "human",
  measure = "Resnik",
  combine = "BMA"
)
calcParProtGOSim(
  golist,
  type = c("go", "gene"),
  ont = c("MF", "BP", "CC"),
  organism = "human",
  measure = "Resnik",
  combine = "BMA"
)

Arguments

`golist`	A character vector, each component contains a character vector of GO terms or one Entrez Gene ID.
`type`	Input type of `golist`, `'go'` for GO Terms, `'gene'` for gene ID.
`ont`	Default is `'MF'`, could be one of `'MF'`, `'BP'`, or `'CC'` subontologies.
`organism`	Default is `'human'`, could be one of `'anopheles'`, `'arabidopsis'`, `'bovine'`, `'canine'`, `'chicken'`, `'chimp'`, `'coelicolor'`, `'ecolik12'`, `'ecsakai'`, `'fly'`, `'human'`, `'malaria'`, `'mouse'`, `'pig'`, `'rat'`, `'rhesus'`, `'worm'`, `'xenopus'`, `'yeast'` or `'zebrafish'`.
`measure`	Default is `'Resnik'`, could be one of `'Resnik'`, `'Lin'`, `'Rel'`, `'Jiang'` or `'Wang'`.
`combine`	Default is `'BMA'`, could be one of `'max'`, `'average'`, `'rcmax'` or `'BMA'` for combining semantic similarity scores of multiple GO terms associated with protein.

Details

This function calculates protein sequence similarity based on Gene Ontology (GO) similarity.

Value

A n x n similarity matrix.

Examples

# By GO Terms
go1 = c('GO:0005215', 'GO:0005488', 'GO:0005515', 'GO:0005625', 'GO:0005802', 'GO:0005905')  # AP4B1
go2 = c('GO:0005515', 'GO:0005634', 'GO:0005681', 'GO:0008380', 'GO:0031202')  # BCAS2
go3 = c('GO:0003735', 'GO:0005622', 'GO:0005840', 'GO:0006412')  # PDE4DIP
glist = list(go1, go2, go3)
calcParProtGOSim(glist, type = 'go', ont = 'CC', measure = 'Wang')

# By Entrez gene id
genelist = list(c('150', '151', '152', '1814', '1815', '1816'))
calcParProtGOSim(genelist, type = 'gene', ont = 'BP', measure = 'Wang')
# By GO Terms
go1 = c('GO:0005215', 'GO:0005488', 'GO:0005515', 'GO:0005625', 'GO:0005802', 'GO:0005905')  # AP4B1
go2 = c('GO:0005515', 'GO:0005634', 'GO:0005681', 'GO:0008380', 'GO:0031202')  # BCAS2
go3 = c('GO:0003735', 'GO:0005622', 'GO:0005840', 'GO:0006412')  # PDE4DIP
glist = list(go1, go2, go3)
calcParProtGOSim(glist, type = 'go', ont = 'CC', measure = 'Wang')

# By Entrez gene id
genelist = list(c('150', '151', '152', '1814', '1815', '1816'))
calcParProtGOSim(genelist, type = 'gene', ont = 'BP', measure = 'Wang')

Parallellized Protein Sequence Similarity Calculation based on Sequence Alignment

Description

Parallellized Protein Sequence Similarity Calculation based on Sequence Alignment

Usage

calcParProtSeqSim(protlist, cores = 2, type = "local", submat = "BLOSUM62")
calcParProtSeqSim(protlist, cores = 2, type = "local", submat = "BLOSUM62")

Arguments

`protlist`	A length `n` list containing `n` protein sequences, each component of the list is a character string, storing one protein sequence. Unknown sequences should be represented as `''`.
`cores`	Integer. The number of CPU cores to use for parallel execution, default is `2`. Users could use the `detectCores()` function in the `parallel` package to see how many cores they could use.
`type`	Type of alignment, default is `'local'`, could be `'global'` or `'local'`, where `'global'` represents Needleman-Wunsch global alignment; `'local'` represents Smith-Waterman local alignment.
`submat`	Substitution matrix, default is `'BLOSUM62'`, could be one of `'BLOSUM45'`, `'BLOSUM50'`, `'BLOSUM62'`, `'BLOSUM80'`, `'BLOSUM100'`, `'PAM30'`, `'PAM40'`, `'PAM70'`, `'PAM120'`, `'PAM250'`.

Details

This function implemented the parallellized version for calculating protein sequence similarity based on sequence alignment.

Value

A n x n similarity matrix.

Examples

s1 = readFASTA(system.file('protseq/P00750.fasta', package = 'Rcpi'))[[1]]
s2 = readFASTA(system.file('protseq/P08218.fasta', package = 'Rcpi'))[[1]]
s3 = readFASTA(system.file('protseq/P10323.fasta', package = 'Rcpi'))[[1]]
s4 = readFASTA(system.file('protseq/P20160.fasta', package = 'Rcpi'))[[1]]
s5 = readFASTA(system.file('protseq/Q9NZP8.fasta', package = 'Rcpi'))[[1]]
plist = list(s1, s2, s3, s4, s5)

psimmat = calcParProtSeqSim(plist, cores = 2, type = 'local',
                            submat = 'BLOSUM62')
print(psimmat)
s1 = readFASTA(system.file('protseq/P00750.fasta', package = 'Rcpi'))[[1]]
s2 = readFASTA(system.file('protseq/P08218.fasta', package = 'Rcpi'))[[1]]
s3 = readFASTA(system.file('protseq/P10323.fasta', package = 'Rcpi'))[[1]]
s4 = readFASTA(system.file('protseq/P20160.fasta', package = 'Rcpi'))[[1]]
s5 = readFASTA(system.file('protseq/Q9NZP8.fasta', package = 'Rcpi'))[[1]]
plist = list(s1, s2, s3, s4, s5)

psimmat = calcParProtSeqSim(plist, cores = 2, type = 'local',
                            submat = 'BLOSUM62')
print(psimmat)

Protein Similarity Calculation based on Gene Ontology (GO) Similarity

Description

Protein Similarity Calculation based on Gene Ontology (GO) Similarity

Usage

calcTwoProtGOSim(
  id1,
  id2,
  type = c("go", "gene"),
  ont = c("MF", "BP", "CC"),
  organism = "human",
  measure = "Resnik",
  combine = "BMA"
)
calcTwoProtGOSim(
  id1,
  id2,
  type = c("go", "gene"),
  ont = c("MF", "BP", "CC"),
  organism = "human",
  measure = "Resnik",
  combine = "BMA"
)

Arguments

`id1`	A character vector. length > 1: each element is a GO term; length = 1: the Entrez Gene ID.
`id2`	A character vector. length > 1: each element is a GO term; length = 1: the Entrez Gene ID.
`type`	Input type of id1 and id2, `'go'` for GO Terms, `'gene'` for gene ID.
`ont`	Default is `'MF'`, could be one of `'MF'`, `'BP'`, or `'CC'` subontologies.
`organism`	Default is `'human'`, could be one of `'anopheles'`, `'arabidopsis'`, `'bovine'`, `'canine'`, `'chicken'`, `'chimp'`, `'coelicolor'`, `'ecolik12'`, `'ecsakai'`, `'fly'`, `'human'`, `'malaria'`, `'mouse'`, `'pig'`, `'rat'`, `'rhesus'`, `'worm'`, `'xenopus'`, `'yeast'` or `'zebrafish'`.
`measure`	Default is `'Resnik'`, could be one of `'Resnik'`, `'Lin'`, `'Rel'`, `'Jiang'` or `'Wang'`.
`combine`	Default is `'BMA'`, could be one of `'max'`, `'average'`, `'rcmax'` or `'BMA'` for combining semantic similarity scores of multiple GO terms associated with protein.

Details

This function calculates the Gene Ontology (GO) similarity between two groups of GO terms or two Entrez gene IDs.

Value

A n x n matrix.

Examples

# By GO terms
go1 = c("GO:0004022", "GO:0004024", "GO:0004023")
go2 = c("GO:0009055", "GO:0020037")
calcTwoProtGOSim(go1, go2, type = 'go', ont = 'MF', measure = 'Wang')

# By Entrez gene id
gene1 = '241'
gene2 = '251'
calcTwoProtGOSim(gene1, gene2, type = 'gene', ont = 'CC', measure = 'Lin')
# By GO terms
go1 = c("GO:0004022", "GO:0004024", "GO:0004023")
go2 = c("GO:0009055", "GO:0020037")
calcTwoProtGOSim(go1, go2, type = 'go', ont = 'MF', measure = 'Wang')

# By Entrez gene id
gene1 = '241'
gene2 = '251'
calcTwoProtGOSim(gene1, gene2, type = 'gene', ont = 'CC', measure = 'Lin')

Protein Sequence Alignment for Two Protein Sequences

Description

Protein Sequence Alignment for Two Protein Sequences

Usage

calcTwoProtSeqSim(seq1, seq2, type = "local", submat = "BLOSUM62")
calcTwoProtSeqSim(seq1, seq2, type = "local", submat = "BLOSUM62")

Arguments

`seq1`	A character string, containing one protein sequence.
`seq2`	A character string, containing another protein sequence.
`type`	Type of alignment, default is `'local'`, could be `'global'` or `'local'`, where `'global'` represents Needleman-Wunsch global alignment; `'local'` represents Smith-Waterman local alignment.
`submat`	Substitution matrix, default is `'BLOSUM62'`, could be one of `'BLOSUM45'`, `'BLOSUM50'`, `'BLOSUM62'`, `'BLOSUM80'`, `'BLOSUM100'`, `'PAM30'`, `'PAM40'`, `'PAM70'`, `'PAM120'`, `'PAM250'`.

Details

This function implements the sequence alignment between two protein sequences.

Value

An Biostrings object containing the scores and other alignment information.

Examples

s1 = readFASTA(system.file('protseq/P00750.fasta', package = 'Rcpi'))[[1]]
s2 = readFASTA(system.file('protseq/P10323.fasta', package = 'Rcpi'))[[1]]

seqalign = calcTwoProtSeqSim(s1, s2)
seqalign
slot(seqalign, "score")
s1 = readFASTA(system.file('protseq/P00750.fasta', package = 'Rcpi'))[[1]]
s2 = readFASTA(system.file('protseq/P10323.fasta', package = 'Rcpi'))[[1]]

seqalign = calcTwoProtSeqSim(s1, s2)
seqalign
slot(seqalign, "score")

Check if the protein sequence's amino acid types are the 20 default types

Description

Check if the protein sequence's amino acid types are the 20 default types

Usage

checkProt(x)
checkProt(x)

Arguments

`x`	A character vector, as the input protein sequence.

Details

This function checks if the protein sequence's amino acid types are the 20 default types.

Value

Logical. TRUE if all of the amino acid types of the sequence are within the 20 default types.

Examples

x = readFASTA(system.file('protseq/P00750.fasta', package = 'Rcpi'))[[1]]
checkProt(x)  # TRUE
checkProt(paste(x, 'Z', sep = ''))  # FALSE
x = readFASTA(system.file('protseq/P00750.fasta', package = 'Rcpi'))[[1]]
checkProt(x)  # TRUE
checkProt(paste(x, 'Z', sep = ''))  # FALSE

Chemical File Formats Conversion

Description

Chemical File Formats Conversion

Usage

convMolFormat(infile, outfile, from, to)
convMolFormat(infile, outfile, from, to)

Arguments

`infile`	A character string. Indicating the input file location.
`outfile`	A character string. Indicating the output file location.
`from`	The format of `infile`. A character string supported by OpenBabel. See the note section for the supported formats.
`to`	The desired format of `outfile`. A character string supported by OpenBabel. See the note section for the supported formats.

Details

This function converts between various chemical file formats via OpenBabel. The complete supported file format list could be found at https://openbabel.org/docs/FileFormats/Overview.html.

Value

NULL

Note

The supported formats include:

abinit – ABINIT Output Format [Read-only]
acr – ACR format [Read-only]
adf – ADF cartesian input format [Write-only]
adfout – ADF output format [Read-only]
alc – Alchemy format
arc – Accelrys/MSI Biosym/Insight II CAR format [Read-only]
axsf – XCrySDen Structure Format [Read-only]
bgf – MSI BGF format
box – Dock 3.5 Box format
bs – Ball and Stick format
c3d1 – Chem3D Cartesian 1 format
c3d2 – Chem3D Cartesian 2 format
cac – CAChe MolStruct format [Write-only]
caccrt – Cacao Cartesian format
cache – CAChe MolStruct format [Write-only]
cacint – Cacao Internal format [Write-only]
can – Canonical SMILES format
car – Accelrys/MSI Biosym/Insight II CAR format [Read-only]
castep – CASTEP format [Read-only]
ccc – CCC format [Read-only]
cdx – ChemDraw binary format [Read-only]
cdxml – ChemDraw CDXML format
cht – Chemtool format [Write-only]
cif – Crystallographic Information File
ck – ChemKin format
cml – Chemical Markup Language
cmlr – CML Reaction format
com – Gaussian 98/03 Input [Write-only]
CONFIG – DL-POLY CONFIG
CONTCAR – VASP format [Read-only]
copy – Copy raw text [Write-only]
crk2d – Chemical Resource Kit diagram(2D)
crk3d – Chemical Resource Kit 3D format
csr – Accelrys/MSI Quanta CSR format [Write-only]
cssr – CSD CSSR format [Write-only]
ct – ChemDraw Connection Table format
cub – Gaussian cube format
cube – Gaussian cube format
dat – Generic Output file format [Read-only]
dmol – DMol3 coordinates format
dx – OpenDX cube format for APBS
ent – Protein Data Bank format
fa – FASTA format
fasta – FASTA format
fch – Gaussian formatted checkpoint file format [Read-only]
fchk – Gaussian formatted checkpoint file format [Read-only]
fck – Gaussian formatted checkpoint file format [Read-only]
feat – Feature format
fh – Fenske-Hall Z-Matrix format [Write-only]
fhiaims – FHIaims XYZ format
fix – SMILES FIX format [Write-only]
fpt – Fingerprint format [Write-only]
fract – Free Form Fractional format
fs – Fastsearch format
fsa – FASTA format
g03 – Gaussian Output [Read-only]
g09 – Gaussian Output [Read-only]
g92 – Gaussian Output [Read-only]
g94 – Gaussian Output [Read-only]
g98 – Gaussian Output [Read-only]
gal – Gaussian Output [Read-only]
gam – GAMESS Output [Read-only]
gamess – GAMESS Output [Read-only]
gamin – GAMESS Input
gamout – GAMESS Output [Read-only]
gau – Gaussian 98/03 Input [Write-only]
gjc – Gaussian 98/03 Input [Write-only]
gjf – Gaussian 98/03 Input [Write-only]
got – GULP format [Read-only]
gpr – Ghemical format
gr96 – GROMOS96 format [Write-only]
gro – GRO format
gukin – GAMESS-UK Input
gukout – GAMESS-UK Output
gzmat – Gaussian Z-Matrix Input
hin – HyperChem HIN format
HISTORY – DL-POLY HISTORY [Read-only]
inchi – InChI format
inchikey – InChIKey [Write-only]
inp – GAMESS Input
ins – ShelX format [Read-only]
jin – Jaguar input format [Write-only]
jout – Jaguar output format [Read-only]
k – Compare molecules using InChI [Write-only]
log – Generic Output file format [Read-only]
mcdl – MCDL format
mcif – Macromolecular Crystallographic Info
mdl – MDL MOL format
ml2 – Sybyl Mol2 format
mmcif – Macromolecular Crystallographic Info
mmd – MacroModel format
mmod – MacroModel format
mna – Multilevel Neighborhoods of Atoms (MNA) [Write-only]
mol – MDL MOL format
mol2 – Sybyl Mol2 format
mold – Molden format
molden – Molden format
molf – Molden format
molreport – Open Babel molecule report [Write-only]
moo – MOPAC Output format [Read-only]
mop – MOPAC Cartesian format
mopcrt – MOPAC Cartesian format
mopin – MOPAC Internal
mopout – MOPAC Output format [Read-only]
mp – Molpro input format [Write-only]
mpc – MOPAC Cartesian format
mpd – MolPrint2D format [Write-only]
mpo – Molpro output format [Read-only]
mpqc – MPQC output format [Read-only]
mpqcin – MPQC simplified input format [Write-only]
mrv – Chemical Markup Language
msi – Accelrys/MSI Cerius II MSI format [Read-only]
msms – M.F. Sanner's MSMS input format [Write-only]
nul – Outputs nothing [Write-only]
nw – NWChem input format [Write-only]
nwo – NWChem output format [Read-only]
out – Generic Output file format [Read-only]
outmol – DMol3 coordinates format
output – Generic Output file format [Read-only]
pc – PubChem format [Read-only]
pcm – PCModel Format
pdb – Protein Data Bank format
pdbqt – AutoDock PDQBT format
png – PNG 2D depiction
POSCAR – VASP format [Read-only]
pov – POV-Ray input format [Write-only]
pqr – PQR format
pqs – Parallel Quantum Solutions format
prep – Amber Prep format [Read-only]
pwscf – PWscf format [Read-only]
qcin – Q-Chem input format [Write-only]
qcout – Q-Chem output format [Read-only]
report – Open Babel report format [Write-only]
res – ShelX format [Read-only]
rsmi – Reaction SMILES format
rxn – MDL RXN format
sd – MDL MOL format
sdf – MDL MOL format
smi – SMILES format
smiles – SMILES format
svg – SVG 2D depiction [Write-only]
sy2 – Sybyl Mol2 format
t41 – ADF TAPE41 format [Read-only]
tdd – Thermo format
text – Read and write raw text
therm – Thermo format
tmol – TurboMole Coordinate format
txt – Title format
txyz – Tinker XYZ format
unixyz – UniChem XYZ format
vmol – ViewMol format
xed – XED format [Write-only]
xml – General XML format [Read-only]
xsf – XCrySDen Structure Format [Read-only]
xyz – XYZ cartesian coordinates format
yob – YASARA.org YOB format
zin – ZINDO input format [Write-only]

Examples

sdf = system.file('sysdata/OptAA3d.sdf', package = 'Rcpi')
# SDF to SMILES
## Not run: 
convMolFormat(infile = sdf, outfile = 'aa.smi',
              from = 'sdf', to = 'smiles')
## End(Not run)
# SMILES to MOPAC Cartesian format
## Not run: 
convMolFormat(infile = 'aa.smi', outfile = 'aa.mop',
              from = 'smiles', to = 'mop')
## End(Not run)
sdf = system.file('sysdata/OptAA3d.sdf', package = 'Rcpi')
# SDF to SMILES
## Not run: 
convMolFormat(infile = sdf, outfile = 'aa.smi',
              from = 'sdf', to = 'smiles')
## End(Not run)
# SMILES to MOPAC Cartesian format
## Not run: 
convMolFormat(infile = 'aa.smi', outfile = 'aa.mop',
              from = 'smiles', to = 'mop')
## End(Not run)

Calculate All Molecular Descriptors in Rcpi at Once

Description

Calculate All Molecular Descriptors in Rcpi at Once

Usage

extractDrugAIO(molecules, silent = TRUE, warn = TRUE)
extractDrugAIO(molecules, silent = TRUE, warn = TRUE)

Arguments

`molecules`	Parsed molucule object.
`silent`	Logical. Whether the calculating process should be shown or not, default is `TRUE`.
`warn`	Logical. Whether the warning about some descriptors need the 3D coordinates should be shown or not after the calculation, default is `TRUE`.

Details

This function calculates all the molecular descriptors in the Rcpi package at once.

Value

A data frame, each row represents one of the molecules, each column represents one descriptor. Currently, this function returns total 293 descriptors composed of 48 descriptor types.

Note

Note that we need 3-D coordinates of the molecules to calculate some of the descriptors, if not provided, these descriptors values will be NA.

Examples

# Load 20 small molecules that have 3D coordinates
sdf = system.file('sysdata/OptAA3d.sdf', package = 'Rcpi')

mol = readMolFromSDF(sdf)
dat = extractDrugAIO(mol, warn = FALSE)
# Load 20 small molecules that have 3D coordinates
sdf = system.file('sysdata/OptAA3d.sdf', package = 'Rcpi')

mol = readMolFromSDF(sdf)
dat = extractDrugAIO(mol, warn = FALSE)

Calculate Atom Additive logP and Molar Refractivity Values Descriptor

Description

Calculate Atom Additive logP and Molar Refractivity Values Descriptor

Usage

extractDrugALOGP(molecules, silent = TRUE)
extractDrugALOGP(molecules, silent = TRUE)

Arguments

`molecules`	Parsed molucule object.
`silent`	Logical. Whether the calculating process should be shown or not, default is `TRUE`.

Details

Calculates ALOGP (Ghose-Crippen LogKow) and the Ghose-Crippen molar refractivity as described by Ghose, A.K. and Crippen, G.M. Note the underlying code in CDK assumes that aromaticity has been detected before evaluating this descriptor. The code also expects that the molecule will have hydrogens explicitly set. For SD files, this is usually not a problem since hydrogens are explicit. But for the case of molecules obtained from SMILES, hydrogens must be made explicit.

Value

A data frame, each row represents one of the molecules, each column represents one feature. This function returns three columns named ALogP, ALogp2 and AMR.

References

Ghose, A.K. and Crippen, G.M. , Atomic physicochemical parameters for three-dimensional structure-directed quantitative structure-activity relationships. I. Partition coefficients as a measure of hydrophobicity, Journal of Computational Chemistry, 1986, 7:565-577.

Ghose, A.K. and Crippen, G.M. , Atomic physicochemical parameters for three-dimensional-structure-directed quantitative structure-activity relationships. 2. Modeling dispersive and hydrophobic interactions, Journal of Chemical Information and Computer Science, 1987, 27:21-35.

Examples

smi = system.file('vignettedata/FDAMDD.smi', package = 'Rcpi')

mol = readMolFromSmi(smi, type = 'mol')
dat = extractDrugALOGP(mol)
head(dat)
smi = system.file('vignettedata/FDAMDD.smi', package = 'Rcpi')

mol = readMolFromSmi(smi, type = 'mol')
dat = extractDrugALOGP(mol)
head(dat)

Calculate the Number of Amino Acids Descriptor

Description

Calculate the Number of Amino Acids Descriptor

Usage

extractDrugAminoAcidCount(molecules, silent = TRUE)
extractDrugAminoAcidCount(molecules, silent = TRUE)

Arguments

`molecules`	Parsed molucule object.
`silent`	Logical. Whether the calculating process should be shown or not, default is `TRUE`.

Details

Calculates the number of each amino acids (total 20 types) found in the molecues.

Value

A data frame, each row represents one of the molecules, each column represents one feature. This function returns 20 columns named nA, nR, nN, nD, nC, nF, nQ, nE, nG, nH, nI, nP, nL nK, nM, nS, nT, nY nV, nW.

Examples

smi = system.file('vignettedata/FDAMDD.smi', package = 'Rcpi')

mol = readMolFromSmi(smi, type = 'mol')
dat = extractDrugAminoAcidCount(mol)
head(dat)
smi = system.file('vignettedata/FDAMDD.smi', package = 'Rcpi')

mol = readMolFromSmi(smi, type = 'mol')
dat = extractDrugAminoAcidCount(mol)
head(dat)

Calculate the Sum of the Atomic Polarizabilities Descriptor

Description

Calculate the Sum of the Atomic Polarizabilities Descriptor

Usage

extractDrugApol(molecules, silent = TRUE)
extractDrugApol(molecules, silent = TRUE)

Arguments

`molecules`	Parsed molucule object.
`silent`	Logical. Whether the calculating process should be shown or not, default is `TRUE`.

Details

Calculates the sum of the atomic polarizabilities (including implicit hydrogens) descriptor. Polarizabilities are taken from https://bit.ly/3PvNbhe.

Value

A data frame, each row represents one of the molecules, each column represents one feature. This function returns one column named apol.

Examples

smi = system.file('vignettedata/FDAMDD.smi', package = 'Rcpi')

mol = readMolFromSmi(smi, type = 'mol')
dat = extractDrugApol(mol)
head(dat)
smi = system.file('vignettedata/FDAMDD.smi', package = 'Rcpi')

mol = readMolFromSmi(smi, type = 'mol')
dat = extractDrugApol(mol)
head(dat)

Calculate the Number of Aromatic Atoms Descriptor

Description

Calculate the Number of Aromatic Atoms Descriptor

Usage

extractDrugAromaticAtomsCount(molecules, silent = TRUE)
extractDrugAromaticAtomsCount(molecules, silent = TRUE)

Arguments

`molecules`	Parsed molucule object.
`silent`	Logical. Whether the calculating process should be shown or not, default is `TRUE`.

Details

Calculates the number of aromatic atoms of a molecule.

Value

A data frame, each row represents one of the molecules, each column represents one feature. This function returns one column named naAromAtom.

Examples

smi = system.file('vignettedata/FDAMDD.smi', package = 'Rcpi')

mol = readMolFromSmi(smi, type = 'mol')
dat = extractDrugAromaticAtomsCount(mol)
head(dat)
smi = system.file('vignettedata/FDAMDD.smi', package = 'Rcpi')

mol = readMolFromSmi(smi, type = 'mol')
dat = extractDrugAromaticAtomsCount(mol)
head(dat)

Calculate the Number of Aromatic Bonds Descriptor

Description

Calculate the Number of Aromatic Bonds Descriptor

Usage

extractDrugAromaticBondsCount(molecules, silent = TRUE)
extractDrugAromaticBondsCount(molecules, silent = TRUE)

Arguments

`molecules`	Parsed molucule object.
`silent`	Logical. Whether the calculating process should be shown or not, default is `TRUE`.

Details

Calculates the number of aromatic bonds of a molecule.

Value

A data frame, each row represents one of the molecules, each column represents one feature. This function returns one column named nAromBond.

Examples

smi = system.file('vignettedata/FDAMDD.smi', package = 'Rcpi')

mol = readMolFromSmi(smi, type = 'mol')
dat = extractDrugAromaticBondsCount(mol)
head(dat)
smi = system.file('vignettedata/FDAMDD.smi', package = 'Rcpi')

mol = readMolFromSmi(smi, type = 'mol')
dat = extractDrugAromaticBondsCount(mol)
head(dat)

Calculate the Number of Atom Descriptor

Description

Calculate the Number of Atom Descriptor

Usage

extractDrugAtomCount(molecules, silent = TRUE)
extractDrugAtomCount(molecules, silent = TRUE)

Arguments

`molecules`	Parsed molucule object.
`silent`	Logical. Whether the calculating process should be shown or not, default is `TRUE`.

Details

Calculates the number of atoms of a certain element type in a molecule. By default it returns the count of all atoms.

Value

A data frame, each row represents one of the molecules, each column represents one feature. This function returns one column named nAtom.

Examples

smi = system.file('vignettedata/FDAMDD.smi', package = 'Rcpi')

mol = readMolFromSmi(smi, type = 'mol')
dat = extractDrugAtomCount(mol)
head(dat)
smi = system.file('vignettedata/FDAMDD.smi', package = 'Rcpi')

mol = readMolFromSmi(smi, type = 'mol')
dat = extractDrugAtomCount(mol)
head(dat)

Calculate the Moreau-Broto Autocorrelation Descriptors using Partial Charges

Description

Calculate the Moreau-Broto Autocorrelation Descriptors using Partial Charges

Usage

extractDrugAutocorrelationCharge(molecules, silent = TRUE)
extractDrugAutocorrelationCharge(molecules, silent = TRUE)

Arguments

`molecules`	Parsed molucule object.
`silent`	Logical. Whether the calculating process should be shown or not, default is `TRUE`.

Details

Calculates the ATS autocorrelation descriptor, where the weight equal to the charges.

Value

A data frame, each row represents one of the molecules, each column represents one feature. This function returns 5 columns named ATSc1, ATSc2, ATSc3, ATSc4, ATSc5.

Examples

smi = system.file('vignettedata/FDAMDD.smi', package = 'Rcpi')

mol = readMolFromSmi(smi, type = 'mol')
dat = extractDrugAutocorrelationCharge(mol)
head(dat)
smi = system.file('vignettedata/FDAMDD.smi', package = 'Rcpi')

mol = readMolFromSmi(smi, type = 'mol')
dat = extractDrugAutocorrelationCharge(mol)
head(dat)

Calculate the Moreau-Broto Autocorrelation Descriptors using Atomic Weight

Description

Calculate the Moreau-Broto Autocorrelation Descriptors using Atomic Weight

Usage

extractDrugAutocorrelationMass(molecules, silent = TRUE)
extractDrugAutocorrelationMass(molecules, silent = TRUE)

Arguments

`molecules`	Parsed molucule object.
`silent`	Logical. Whether the calculating process should be shown or not, default is `TRUE`.

Details

Calculates the ATS autocorrelation descriptor, where the weight equal to the scaled atomic mass.

Value

A data frame, each row represents one of the molecules, each column represents one feature. This function returns 5 columns named ATSm1, ATSm2, ATSm3, ATSm4, ATSm5.

References

Moreau, Gilles, and Pierre Broto. The autocorrelation of a topological structure: a new molecular descriptor. Nouv. J. Chim 4 (1980): 359-360.

Examples

smi = system.file('vignettedata/FDAMDD.smi', package = 'Rcpi')

mol = readMolFromSmi(smi, type = 'mol')
dat = extractDrugAutocorrelationMass(mol)
head(dat)
smi = system.file('vignettedata/FDAMDD.smi', package = 'Rcpi')

mol = readMolFromSmi(smi, type = 'mol')
dat = extractDrugAutocorrelationMass(mol)
head(dat)

Calculate the Moreau-Broto Autocorrelation Descriptors using Polarizability

Description

Calculate the Moreau-Broto Autocorrelation Descriptors using Polarizability

Usage

extractDrugAutocorrelationPolarizability(molecules, silent = TRUE)
extractDrugAutocorrelationPolarizability(molecules, silent = TRUE)

Arguments

`molecules`	Parsed molucule object.
`silent`	Logical. Whether the calculating process should be shown or not, default is `TRUE`.

Details

Calculates the ATS autocorrelation descriptor using polarizability.

Value

A data frame, each row represents one of the molecules, each column represents one feature. This function returns 5 columns named ATSp1, ATSp2, ATSp3, ATSp4, ATSp5.

Examples

smi = system.file('vignettedata/FDAMDD.smi', package = 'Rcpi')

mol = readMolFromSmi(smi, type = 'mol')
dat = extractDrugAutocorrelationPolarizability(mol)
head(dat)
smi = system.file('vignettedata/FDAMDD.smi', package = 'Rcpi')

mol = readMolFromSmi(smi, type = 'mol')
dat = extractDrugAutocorrelationPolarizability(mol)
head(dat)

BCUT – Eigenvalue Based Descriptor

Description

BCUT – Eigenvalue Based Descriptor

Usage

extractDrugBCUT(molecules, silent = TRUE)
extractDrugBCUT(molecules, silent = TRUE)

Arguments

`molecules`	Parsed molucule object.
`silent`	Logical. Whether the calculating process should be shown or not, default is `TRUE`.

Details

Eigenvalue based descriptor noted for its utility in chemical diversity. Described by Pearlman et al. The descriptor is based on a weighted version of the Burden matrix which takes into account both the connectivity as well as atomic properties of a molecule. The weights are a variety of atom properties placed along the diagonal of the Burden matrix. Currently three weighting schemes are employed:

Atomic Weight
Partial Charge (Gasteiger Marsilli)
Polarizability (Kang et al.)

Value

A data frame, each row represents one of the molecules, each column represents one feature. This function returns 6 columns:

BCUTw-1l, BCUTw-2l ... - n high lowest atom weighted BCUTS
BCUTw-1h, BCUTw-2h ... - n low highest atom weighted BCUTS
BCUTc-1l, BCUTc-2l ... - n high lowest partial charge weighted BCUTS
BCUTc-1h, BCUTc-2h ... - n low highest partial charge weighted BCUTS
BCUTp-1l, BCUTp-2l ... - n high lowest polarizability weighted BCUTS
BCUTp-1h, BCUTp-2h ... - n low highest polarizability weighted BCUTS

Note

By default, the descriptor will return the highest and lowest eigenvalues for the three classes of descriptor in a single ArrayList (in the order shown above). However it is also possible to supply a parameter list indicating how many of the highest and lowest eigenvalues (for each class of descriptor) are required. The descriptor works with the hydrogen depleted molecule.

A side effect of specifying the number of highest and lowest eigenvalues is that it is possible to get two copies of all the eigenvalues. That is, if a molecule has 5 heavy atoms, then specifying the 5 highest eigenvalues returns all of them, and specifying the 5 lowest eigenvalues returns all of them, resulting in two copies of all the eigenvalues.

Note that it is possible to specify an arbitrarily large number of eigenvalues to be returned. However if the number (i.e., nhigh or nlow) is larger than the number of heavy atoms, the remaining eignevalues will be NaN.

Given the above description, if the aim is to gt all the eigenvalues for a molecule, you should set nlow to 0 and specify the number of heavy atoms (or some large number) for nhigh (or vice versa).

References

Pearlman, R.S. and Smith, K.M., Metric Validation and the Receptor-Relevant Subspace Concept, J. Chem. Inf. Comput. Sci., 1999, 39:28-35.

Burden, F.R., Molecular identification number for substructure searches, J. Chem. Inf. Comput. Sci., 1989, 29:225-227.

Burden, F.R., Chemically Intuitive Molecular Index, Quant. Struct. -Act. Relat., 1997, 16:309-314

Kang, Y.K. and Jhon, M.S., Additivity of Atomic Static Polarizabilities and Dispersion Coefficients, Theoretica Chimica Acta, 1982, 61:41-48

Examples

smi = system.file('vignettedata/FDAMDD.smi', package = 'Rcpi')

mol = readMolFromSmi(smi, type = 'mol')
dat = extractDrugBCUT(mol)
head(dat)
smi = system.file('vignettedata/FDAMDD.smi', package = 'Rcpi')

mol = readMolFromSmi(smi, type = 'mol')
dat = extractDrugBCUT(mol)
head(dat)

Calculate the Descriptor Based on the Number of Bonds of a Certain Bond Order

Description

Calculate the Descriptor Based on the Number of Bonds of a Certain Bond Order

Usage

extractDrugBondCount(molecules, silent = TRUE)
extractDrugBondCount(molecules, silent = TRUE)

Arguments

`molecules`	Parsed molucule object.
`silent`	Logical. Whether the calculating process should be shown or not, default is `TRUE`.

Details

Calculates the descriptor based on the number of bonds of a certain bond order.

Value

A data frame, each row represents one of the molecules, each column represents one feature. This function returns one column named nB.

Examples

smi = system.file('vignettedata/FDAMDD.smi', package = 'Rcpi')

mol = readMolFromSmi(smi, type = 'mol')
dat = extractDrugBondCount(mol)
head(dat)
smi = system.file('vignettedata/FDAMDD.smi', package = 'Rcpi')

mol = readMolFromSmi(smi, type = 'mol')
dat = extractDrugBondCount(mol)
head(dat)

Calculate the Descriptor that Describes the Sum of the Absolute Value of the Difference between Atomic Polarizabilities of All Bonded Atoms in the Molecule

Description

Calculates the Descriptor that Describes the Sum of the Absolute Value of the Difference between Atomic Polarizabilities of All Bonded Atoms in the Molecule

Usage

extractDrugBPol(molecules, silent = TRUE)
extractDrugBPol(molecules, silent = TRUE)

Arguments

`molecules`	Parsed molucule object.
`silent`	Logical. Whether the calculating process should be shown or not, default is `TRUE`.

Details

This descriptor calculates the sum of the absolute value of the difference between atomic polarizabilities of all bonded atoms in the molecule (including implicit hydrogens) with polarizabilities taken from https://bit.ly/3PvNbhe. This descriptor assumes 2-centered bonds.

Value

A data frame, each row represents one of the molecules, each column represents one feature. This function returns one column named bpol.

Examples

smi = system.file('vignettedata/FDAMDD.smi', package = 'Rcpi')

mol = readMolFromSmi(smi, type = 'mol')
dat = extractDrugBPol(mol)
head(dat)
smi = system.file('vignettedata/FDAMDD.smi', package = 'Rcpi')

mol = readMolFromSmi(smi, type = 'mol')
dat = extractDrugBPol(mol)
head(dat)

Topological Descriptor Characterizing the Carbon Connectivity in Terms of Hybridization

Description

Topological Descriptor Characterizing the Carbon Connectivity in Terms of Hybridization

Usage

extractDrugCarbonTypes(molecules, silent = TRUE)
extractDrugCarbonTypes(molecules, silent = TRUE)

Arguments

`molecules`	Parsed molucule object.
`silent`	Logical. Whether the calculating process should be shown or not, default is `TRUE`.

Details

Calculates the carbon connectivity in terms of hybridization. The function calculates 9 descriptors in the following order:

C1SP1 - triply hound carbon bound to one other carbon
C2SP1 - triply bound carbon bound to two other carbons
C1SP2 - doubly hound carbon bound to one other carbon
C2SP2 - doubly bound carbon bound to two other carbons
C3SP2 - doubly bound carbon bound to three other carbons
C1SP3 - singly bound carbon bound to one other carbon
C2SP3 - singly bound carbon bound to two other carbons
C3SP3 - singly bound carbon bound to three other carbons
C4SP3 - singly bound carbon bound to four other carbons

Value

A data frame, each row represents one of the molecules, each column represents one feature. This function returns 9 columns named C1SP1, C2SP1, C1SP2, C2SP2, C3SP2, C1SP3, C2SP3, C3SP3 and C4SP3.

Examples

smi = system.file('vignettedata/FDAMDD.smi', package = 'Rcpi')

mol = readMolFromSmi(smi, type = 'mol')
dat = extractDrugCarbonTypes(mol)
head(dat)
smi = system.file('vignettedata/FDAMDD.smi', package = 'Rcpi')

mol = readMolFromSmi(smi, type = 'mol')
dat = extractDrugCarbonTypes(mol)
head(dat)

Calculate the Kier and Hall Chi Chain Indices of Orders 3, 4, 5, 6 and 7

Description

Calculate the Kier and Hall Chi Chain Indices of Orders 3, 4, 5, 6 and 7

Usage

extractDrugChiChain(molecules, silent = TRUE)
extractDrugChiChain(molecules, silent = TRUE)

Arguments

`molecules`	Parsed molucule object.
`silent`	Logical. Whether the calculating process should be shown or not, default is `TRUE`.

Details

Evaluates chi chain descriptors. The code currently evluates the simple and valence chi chain descriptors of orders 3, 4, 5, 6 and 7. It utilizes the graph isomorphism code of the CDK to find fragments matching SMILES strings representing the fragments corresponding to each type of chain.

Value

A data frame, each row represents one of the molecules, each column represents one feature. This function returns 10 columns, in the following order:

SCH.3 - Simple chain, order 3
SCH.4 - Simple chain, order 4
SCH.5 - Simple chain, order 5
SCH.6 - Simple chain, order 6
SCH.7 - Simple chain, order 7
VCH.3 - Valence chain, order 3
VCH.4 - Valence chain, order 4
VCH.5 - Valence chain, order 5
VCH.6 - Valence chain, order 6
VCH.7 - Valence chain, order 7

Note

These descriptors are calculated using graph isomorphism to identify the various fragments. As a result calculations may be slow. In addition, recent versions of Molconn-Z use simplified fragment definitions (i.e., rings without branches etc.) whereas these descriptors use the older more complex fragment definitions.

Examples

smi = system.file('vignettedata/FDAMDD.smi', package = 'Rcpi')

mol = readMolFromSmi(smi, type = 'mol')
dat = extractDrugChiChain(mol)
head(dat)
smi = system.file('vignettedata/FDAMDD.smi', package = 'Rcpi')

mol = readMolFromSmi(smi, type = 'mol')
dat = extractDrugChiChain(mol)
head(dat)

Evaluates the Kier and Hall Chi cluster indices of orders 3, 4, 5 and 6

Description

Evaluates the Kier and Hall Chi cluster indices of orders 3, 4, 5 and 6

Usage

extractDrugChiCluster(molecules, silent = TRUE)
extractDrugChiCluster(molecules, silent = TRUE)

Arguments

`molecules`	Parsed molucule object.
`silent`	Logical. Whether the calculating process should be shown or not, default is `TRUE`.

Details

Evaluates chi cluster descriptors. It utilizes the graph isomorphism code of the CDK to find fragments matching SMILES strings representing the fragments corresponding to each type of chain.

Value

A data frame, each row represents one of the molecules, each column represents one feature. This function returns 8 columns, the order and names of the columns returned is:

SC.3 - Simple cluster, order 3
SC.4 - Simple cluster, order 4
SC.5 - Simple cluster, order 5
SC.6 - Simple cluster, order 6
VC.3 - Valence cluster, order 3
VC.4 - Valence cluster, order 4
VC.5 - Valence cluster, order 5
VC.6 - Valence cluster, order 6

Note

Examples

smi = system.file('vignettedata/FDAMDD.smi', package = 'Rcpi')

mol = readMolFromSmi(smi, type = 'mol')
dat = extractDrugChiCluster(mol)
head(dat)
smi = system.file('vignettedata/FDAMDD.smi', package = 'Rcpi')

mol = readMolFromSmi(smi, type = 'mol')
dat = extractDrugChiCluster(mol)
head(dat)

Calculate the Kier and Hall Chi Path Indices of Orders 0 to 7

Description

Calculate the Kier and Hall Chi Path Indices of Orders 0 to 7

Usage

extractDrugChiPath(molecules, silent = TRUE)
extractDrugChiPath(molecules, silent = TRUE)

Arguments

`molecules`	Parsed molucule object.
`silent`	Logical. Whether the calculating process should be shown or not, default is `TRUE`.

Details

Evaluates chi path descriptors. This function utilizes the graph isomorphism code of the CDK to find fragments matching SMILES strings representing the fragments corresponding to each type of chain.

Value

A data frame, each row represents one of the molecules, each column represents one feature. This function returns 16 columns, The order and names of the columns returned is:

SP.0, SP.1, ..., SP.7 - Simple path, orders 0 to 7
VP.0, VP.1, ..., VP.7 - Valence path, orders 0 to 7

Note

Examples

smi = system.file('vignettedata/FDAMDD.smi', package = 'Rcpi')

mol = readMolFromSmi(smi, type = 'mol')
dat = extractDrugChiPath(mol)
head(dat)
smi = system.file('vignettedata/FDAMDD.smi', package = 'Rcpi')

mol = readMolFromSmi(smi, type = 'mol')
dat = extractDrugChiPath(mol)
head(dat)

Calculate the Kier and Hall Chi Path Cluster Indices of Orders 4, 5 and 6

Description

Calculate the Kier and Hall Chi Path Cluster Indices of Orders 4, 5 and 6

Usage

extractDrugChiPathCluster(molecules, silent = TRUE)
extractDrugChiPathCluster(molecules, silent = TRUE)

Arguments

`molecules`	Parsed molucule object.
`silent`	Logical. Whether the calculating process should be shown or not, default is `TRUE`.

Details

Evaluates chi path cluster descriptors. The code currently evluates the simple and valence chi chain descriptors of orders 4, 5 and 6. It utilizes the graph isomorphism code of the CDK to find fragments matching SMILES strings representing the fragments corresponding to each type of chain.

Value

A data frame, each row represents one of the molecules, each column represents one feature. This function returns 6 columns named SPC.4, SPC.5, SPC.6, VPC.4, VPC.5, VPC.6:

SPC.4 - Simple path cluster, order 4
SPC.5 - Simple path cluster, order 5
SPC.6 - Simple path cluster, order 6
VPC.4 - Valence path cluster, order 4
VPC.5 - Valence path cluster, order 5
VPC.6 - Valence path cluster, order 6

Note

Examples

smi = system.file('vignettedata/FDAMDD.smi', package = 'Rcpi')

mol = readMolFromSmi(smi, type = 'mol')
dat = extractDrugChiPathCluster(mol)
head(dat)
smi = system.file('vignettedata/FDAMDD.smi', package = 'Rcpi')

mol = readMolFromSmi(smi, type = 'mol')
dat = extractDrugChiPathCluster(mol)
head(dat)

A Variety of Descriptors Combining Surface Area and Partial Charge Information

Description

A Variety of Descriptors Combining Surface Area and Partial Charge Information

Usage

extractDrugCPSA(molecules, silent = TRUE)
extractDrugCPSA(molecules, silent = TRUE)

Arguments

`molecules`	Parsed molucule object.
`silent`	Logical. Whether the calculating process should be shown or not, default is `TRUE`.

Details

Calculates 29 Charged Partial Surface Area (CPSA) descriptors. The CPSA's were developed by Stanton et al.

Value

A data frame, each row represents one of the molecules, each column represents one feature. This function returns 29 columns:

PPSA.1 - partial positive surface area – sum of surface area on positive parts of molecule
PPSA.2 - partial positive surface area * total positive charge on the molecule
PPSA.3 - charge weighted partial positive surface area
PNSA.1 - partial negative surface area – sum of surface area on negative parts of molecule
PNSA.2 - partial negative surface area * total negative charge on the molecule
PNSA.3 - charge weighted partial negative surface area
DPSA.1 - difference of PPSA.1 and PNSA.1
DPSA.2 - difference of FPSA.2 and PNSA.2
DPSA.3 - difference of PPSA.3 and PNSA.3
FPSA.1 - PPSA.1 / total molecular surface area
FFSA.2 - PPSA.2 / total molecular surface area
FPSA.3 - PPSA.3 / total molecular surface area
FNSA.1 - PNSA.1 / total molecular surface area
FNSA.2 - PNSA.2 / total molecular surface area
FNSA.3 - PNSA.3 / total molecular surface area
WPSA.1 - PPSA.1 * total molecular surface area / 1000
WPSA.2 - PPSA.2 * total molecular surface area /1000
WPSA.3 - PPSA.3 * total molecular surface area / 1000
WNSA.1 - PNSA.1 * total molecular surface area /1000
WNSA.2 - PNSA.2 * total molecular surface area / 1000
WNSA.3 - PNSA.3 * total molecular surface area / 1000
RPCG - relative positive charge – most positive charge / total positive charge
RNCG - relative negative charge – most negative charge / total negative charge
RPCS - relative positive charge surface area – most positive surface area * RPCG
RNCS - relative negative charge surface area – most negative surface area * RNCG
THSA - sum of solvent accessible surface areas of atoms with absolute value of partial charges less than 0.2
TPSA - sum of solvent accessible surface areas of atoms with absolute value of partial charges greater than or equal 0.2
RHSA - THSA / total molecular surface area
RPSA - TPSA / total molecular surface area

References

Stanton, D.T. and Jurs, P.C. , Development and Use of Charged Partial Surface Area Structural Descriptors in Computer Assissted Quantitative Structure Property Relationship Studies, Analytical Chemistry, 1990, 62:2323.2329.

Examples

sdf = system.file('sysdata/OptAA3d.sdf', package = 'Rcpi')

mol = readMolFromSDF(sdf)
dat = extractDrugCPSA(mol)
head(dat)
sdf = system.file('sysdata/OptAA3d.sdf', package = 'Rcpi')

mol = readMolFromSDF(sdf)
dat = extractDrugCPSA(mol)
head(dat)

Calculate Molecular Descriptors Provided by OpenBabel

Description

Calculate Molecular Descriptors Provided by OpenBabel

Usage

extractDrugDescOB(molecules, type = c("smile", "sdf"))
extractDrugDescOB(molecules, type = c("smile", "sdf"))

Arguments

`molecules`	R character string object containing the molecules. See the example section for details.
`type`	`'smile'` or `'sdf'`.

Details

This function calculates 14 types of the numerical molecular descriptors provided in OpenBabel.

Value

A data frame, each row represents one of the molecules, each column represents one descriptor. This function returns 14 columns named abonds, atoms, bonds, dbonds, HBA1, HBA2, HBD, logP, MR, MW, nF, sbonds, tbonds, TPSA:

abonds - Number of aromatic bonds
atoms - Number of atoms
bonds - Number of bonds
dbonds - Number of double bonds
HBA1 - Number of Hydrogen Bond Acceptors 1
HBA2 - Number of Hydrogen Bond Acceptors 2
HBD - Number of Hydrogen Bond Donors
logP - Octanol/Water Partition Coefficient
MR - Molar Refractivity
MW - Molecular Weight Filter
nF - Number of Fluorine Atoms
sbonds - Number of single bonds
tbonds - Number of triple bonds
TPSA - Topological Polar Surface Area

Examples

mol1 = 'CC(=O)NCCC1=CNc2c1cc(OC)cc2'  # one molecule SMILE in a vector
mol2 = c('OCCc1c(C)[n+](=cs1)Cc2cnc(C)nc(N)2',
         'CCc(c1)ccc2[n+]1ccc3c2Nc4c3cccc4',
         '[Cu+2].[O-]S(=O)(=O)[O-]')  # multiple SMILEs in a vector
mol3 = readChar(system.file('compseq/DB00860.sdf', package = 'Rcpi'),
                nchars = 1e+6)  # single molecule in a sdf file
mol4 = readChar(system.file('sysdata/OptAA3d.sdf', package = 'Rcpi'),
                nchars = 1e+6)  # multiple molecules in a sdf file

## Not run: 
smidesc0 = extractDrugDescOB(mol1, type = 'smile')
smidesc1 = extractDrugDescOB(mol2, type = 'smile')
sdfdesc0 = extractDrugDescOB(mol3, type = 'sdf')
sdfdesc1 = extractDrugDescOB(mol4, type = 'sdf')
## End(Not run)
mol1 = 'CC(=O)NCCC1=CNc2c1cc(OC)cc2'  # one molecule SMILE in a vector
mol2 = c('OCCc1c(C)[n+](=cs1)Cc2cnc(C)nc(N)2',
         'CCc(c1)ccc2[n+]1ccc3c2Nc4c3cccc4',
         '[Cu+2].[O-]S(=O)(=O)[O-]')  # multiple SMILEs in a vector
mol3 = readChar(system.file('compseq/DB00860.sdf', package = 'Rcpi'),
                nchars = 1e+6)  # single molecule in a sdf file
mol4 = readChar(system.file('sysdata/OptAA3d.sdf', package = 'Rcpi'),
                nchars = 1e+6)  # multiple molecules in a sdf file

## Not run: 
smidesc0 = extractDrugDescOB(mol1, type = 'smile')
smidesc1 = extractDrugDescOB(mol2, type = 'smile')
sdfdesc0 = extractDrugDescOB(mol3, type = 'sdf')
sdfdesc1 = extractDrugDescOB(mol4, type = 'sdf')
## End(Not run)

Calculate the Eccentric Connectivity Index Descriptor

Description

Calculate the Eccentric Connectivity Index Descriptor

Usage

extractDrugECI(molecules, silent = TRUE)
extractDrugECI(molecules, silent = TRUE)

Arguments

`molecules`	Parsed molucule object.
`silent`	Logical. Whether the calculating process should be shown or not, default is `TRUE`.

Details

Eccentric Connectivity Index (ECI) is a topological descriptor combining distance and adjacency information. This descriptor is described by Sharma et al. and has been shown to correlate well with a number of physical properties. The descriptor is also reported to have good discriminatory ability. The eccentric connectivity index for a hydrogen supressed molecular graph is given by

$x_i^c = \sum_{i = 1}^{n} E(i) V(i)$

where E(i) is the eccentricity of the i-th atom (path length from the i-th atom to the atom farthest from it) and V(i) is the vertex degree of the i-th atom.

Value

A data frame, each row represents one of the molecules, each column represents one feature. This function returns one column named ECCEN.

References

Sharma, V. and Goswami, R. and Madan, A.K. (1997), Eccentric Connectivity Index: A Novel Highly Discriminating Topological Descriptor for Structure-Property and Structure-Activity Studies, Journal of Chemical Information and Computer Sciences, 37:273-282

Examples

smi = system.file('vignettedata/FDAMDD.smi', package = 'Rcpi')

mol = readMolFromSmi(smi, type = 'mol')
dat = extractDrugECI(mol)
head(dat)
smi = system.file('vignettedata/FDAMDD.smi', package = 'Rcpi')

mol = readMolFromSmi(smi, type = 'mol')
dat = extractDrugECI(mol)
head(dat)

Calculate the E-State Molecular Fingerprints (in Compact Format)

Description

Calculate the E-State Molecular Fingerprints (in Compact Format)

Usage

extractDrugEstate(molecules, silent = TRUE)
extractDrugEstate(molecules, silent = TRUE)

Arguments

`molecules`	Parsed molucule object.
`silent`	Logical. Whether the calculating process should be shown or not, default is `TRUE`.

Details

79 bit fingerprints corresponding to the E-State atom types described by Hall and Kier.

Value

A list, each component represents one of the molecules, each element in the component represents the index of which element in the fingerprint is 1. Each component's name is the length of the fingerprints.

Examples

smi = system.file('vignettedata/FDAMDD.smi', package = 'Rcpi')

mol = readMolFromSmi(smi, type = 'mol')
fp  = extractDrugEstate(mol)
head(fp)
smi = system.file('vignettedata/FDAMDD.smi', package = 'Rcpi')

mol = readMolFromSmi(smi, type = 'mol')
fp  = extractDrugEstate(mol)
head(fp)

Calculate the E-State Molecular Fingerprints (in Complete Format)

Description

Calculate the E-State Molecular Fingerprints (in Complete Format)

Usage

extractDrugEstateComplete(molecules, silent = TRUE)
extractDrugEstateComplete(molecules, silent = TRUE)

Arguments

`molecules`	Parsed molucule object.
`silent`	Logical. Whether the calculating process should be shown or not, default is `TRUE`.

Details

79 bit fingerprints corresponding to the E-State atom types described by Hall and Kier.

Value

An integer vector or a matrix. Each row represents one molecule, the columns represent the fingerprints.

Examples

smi = system.file('vignettedata/FDAMDD.smi', package = 'Rcpi')

mol = readMolFromSmi(smi, type = 'mol')
fp  = extractDrugEstateComplete(mol)
dim(fp)
smi = system.file('vignettedata/FDAMDD.smi', package = 'Rcpi')

mol = readMolFromSmi(smi, type = 'mol')
fp  = extractDrugEstateComplete(mol)
dim(fp)

Calculate the Extended Molecular Fingerprints (in Compact Format)

Description

Calculate the Extended Molecular Fingerprints (in Compact Format)

Usage

extractDrugExtended(molecules, depth = 6, size = 1024, silent = TRUE)
extractDrugExtended(molecules, depth = 6, size = 1024, silent = TRUE)

Arguments

`molecules`	Parsed molucule object.
`depth`	The search depth. Default is `6`.
`size`	The length of the fingerprint bit string. Default is `1024`.
`silent`	Logical. Whether the calculating process should be shown or not, default is `TRUE`.

Details

Calculate the extended molecular fingerprints. Considers paths of a given length, similar to the standard type, but takes rings and atomic properties into account into account. This is hashed fingerprints, with a default length of 1024.

Value

Examples

smi = system.file('vignettedata/FDAMDD.smi', package = 'Rcpi')

mol = readMolFromSmi(smi, type = 'mol')
fp  = extractDrugExtended(mol)
head(fp)
smi = system.file('vignettedata/FDAMDD.smi', package = 'Rcpi')

mol = readMolFromSmi(smi, type = 'mol')
fp  = extractDrugExtended(mol)
head(fp)

Calculate the Extended Molecular Fingerprints (in Complete Format)

Description

Calculate the Extended Molecular Fingerprints (in Complete Format)

Usage

extractDrugExtendedComplete(molecules, depth = 6, size = 1024, silent = TRUE)
extractDrugExtendedComplete(molecules, depth = 6, size = 1024, silent = TRUE)

Arguments

`molecules`	Parsed molucule object.
`depth`	The search depth. Default is `6`.
`size`	The length of the fingerprint bit string. Default is `1024`.
`silent`	Logical. Whether the calculating process should be shown or not, default is `TRUE`.

Details

Value

An integer vector or a matrix. Each row represents one molecule, the columns represent the fingerprints.

Examples

smi = system.file('vignettedata/FDAMDD.smi', package = 'Rcpi')

mol = readMolFromSmi(smi, type = 'mol')
fp  = extractDrugExtendedComplete(mol)
dim(fp)
smi = system.file('vignettedata/FDAMDD.smi', package = 'Rcpi')

mol = readMolFromSmi(smi, type = 'mol')
fp  = extractDrugExtendedComplete(mol)
dim(fp)

Calculate the FMF Descriptor

Description

Calculate the FMF Descriptor

Usage

extractDrugFMF(molecules, silent = TRUE)
extractDrugFMF(molecules, silent = TRUE)

Arguments

`molecules`	Parsed molucule object.
`silent`	Logical. Whether the calculating process should be shown or not, default is `TRUE`.

Details

Calculates the FMF descriptor characterizing molecular complexity in terms of its Murcko framework. This descriptor is the ratio of heavy atoms in the framework to the total number of heavy atoms in the molecule. By definition, acyclic molecules which have no frameworks, will have a value of 0. Note that the authors consider an isolated ring system to be a framework (even though there is no linker).

Value

A data frame, each row represents one of the molecules, each column represents one feature. This function returns one column named FMF.

References

Yang, Y., Chen, H., Nilsson, I., Muresan, S., & Engkvist, O. (2010). Investigation of the relationship between topology and selectivity for druglike molecules. Journal of medicinal chemistry, 53(21), 7709-7714.

Examples

smi = system.file('vignettedata/FDAMDD.smi', package = 'Rcpi')

mol = readMolFromSmi(smi, type = 'mol')
dat = extractDrugFMF(mol)
head(dat)
smi = system.file('vignettedata/FDAMDD.smi', package = 'Rcpi')

mol = readMolFromSmi(smi, type = 'mol')
dat = extractDrugFMF(mol)
head(dat)

Calculate Complexity of a System

Description

Calculate Complexity of a System

Usage

extractDrugFragmentComplexity(molecules, silent = TRUE)
extractDrugFragmentComplexity(molecules, silent = TRUE)

Arguments

`molecules`	Parsed molucule object.
`silent`	Logical. Whether the calculating process should be shown or not, default is `TRUE`.

Details

This descriptor calculates the complexity of a system. The complexity is defined in Nilakantan, R. et al. as:

$C = abs(B^2 - A^2 + A) + \frac{H}{100}$

where C is complexity, A is the number of non-hydrogen atoms, B is the number of bonds and H is the number of heteroatoms.

Value

A data frame, each row represents one of the molecules, each column represents one feature. This function returns one column named fragC.

References

Nilakantan, R. and Nunn, D.S. and Greenblatt, L. and Walker, G. and Haraki, K. and Mobilio, D., A family of ring system-based structural fragments for use in structure-activity studies: database mining and recursive partitioning., Journal of chemical information and modeling, 2006, 46:1069-1077

Examples

smi = system.file('vignettedata/FDAMDD.smi', package = 'Rcpi')

mol = readMolFromSmi(smi, type = 'mol')
dat = extractDrugFragmentComplexity(mol)
head(dat)
smi = system.file('vignettedata/FDAMDD.smi', package = 'Rcpi')

mol = readMolFromSmi(smi, type = 'mol')
dat = extractDrugFragmentComplexity(mol)
head(dat)

Calculate the Graph Molecular Fingerprints (in Compact Format)

Description

Calculate the Graph Molecular Fingerprints (in Compact Format)

Usage

extractDrugGraph(molecules, depth = 6, size = 1024, silent = TRUE)
extractDrugGraph(molecules, depth = 6, size = 1024, silent = TRUE)

Arguments

`molecules`	Parsed molucule object.
`depth`	The search depth. Default is `6`.
`size`	The length of the fingerprint bit string. Default is `1024`.
`silent`	Logical. Whether the calculating process should be shown or not, default is `TRUE`.

Details

Calculate the graph molecular fingerprints. Similar to the standard type by simply considers connectivity. This is hashed fingerprints, with a default length of 1024.

Value

Examples

smi = system.file('vignettedata/FDAMDD.smi', package = 'Rcpi')

mol = readMolFromSmi(smi, type = 'mol')
fp  = extractDrugGraph(mol)
head(fp)
smi = system.file('vignettedata/FDAMDD.smi', package = 'Rcpi')

mol = readMolFromSmi(smi, type = 'mol')
fp  = extractDrugGraph(mol)
head(fp)

Calculate the Graph Molecular Fingerprints (in Complete Format)

Description

Calculate the Graph Molecular Fingerprints (in Complete Format)

Usage

extractDrugGraphComplete(molecules, depth = 6, size = 1024, silent = TRUE)
extractDrugGraphComplete(molecules, depth = 6, size = 1024, silent = TRUE)

Arguments

`molecules`	Parsed molucule object.
`depth`	The search depth. Default is `6`.
`size`	The length of the fingerprint bit string. Default is `1024`.
`silent`	Logical. Whether the calculating process should be shown or not, default is `TRUE`.

Details

Calculate the graph molecular fingerprints. Similar to the standard type by simply considers connectivity. This is hashed fingerprints, with a default length of 1024.

Value

An integer vector or a matrix. Each row represents one molecule, the columns represent the fingerprints.

Examples

smi = system.file('vignettedata/FDAMDD.smi', package = 'Rcpi')

mol = readMolFromSmi(smi, type = 'mol')
fp  = extractDrugGraphComplete(mol)
dim(fp)
smi = system.file('vignettedata/FDAMDD.smi', package = 'Rcpi')

mol = readMolFromSmi(smi, type = 'mol')
fp  = extractDrugGraphComplete(mol)
dim(fp)

Descriptor Characterizing the Mass Distribution of the Molecule.

Description

Descriptor Characterizing the Mass Distribution of the Molecule.

Usage

extractDrugGravitationalIndex(molecules, silent = TRUE)
extractDrugGravitationalIndex(molecules, silent = TRUE)

Arguments

`molecules`	Parsed molucule object.
`silent`	Logical. Whether the calculating process should be shown or not, default is `TRUE`.

Details

Descriptor characterizing the mass distribution of the molecule described by Katritzky et al. For modelling purposes the value of the descriptor is calculated both with and without H atoms. Furthermore the square and cube roots of the descriptor are also generated as described by Wessel et al.

Value

A data frame, each row represents one of the molecules, each column represents one feature. This function returns 9 columns:

GRAV.1 - gravitational index of heavy atoms
GRAV.2 - square root of gravitational index of heavy atoms
GRAV.3 - cube root of gravitational index of heavy atoms
GRAVH.1 - gravitational index - hydrogens included
GRAVH.2 - square root of hydrogen-included gravitational index
GRAVH.3 - cube root of hydrogen-included gravitational index
GRAV.4 - grav1 for all pairs of atoms (not just bonded pairs)
GRAV.5 - grav2 for all pairs of atoms (not just bonded pairs)
GRAV.6 - grav3 for all pairs of atoms (not just bonded pairs)

References

Katritzky, A.R. and Mu, L. and Lobanov, V.S. and Karelson, M., Correlation of Boiling Points With Molecular Structure. 1. A Training Set of 298 Diverse Organics and a Test Set of 9 Simple Inorganics, J. Phys. Chem., 1996, 100:10400-10407.

Wessel, M.D. and Jurs, P.C. and Tolan, J.W. and Muskal, S.M. , Prediction of Human Intestinal Absorption of Drug Compounds From Molecular Structure, Journal of Chemical Information and Computer Sciences, 1998, 38:726-735.

Examples

sdf = system.file('sysdata/OptAA3d.sdf', package = 'Rcpi')

mol = readMolFromSDF(sdf)
dat = extractDrugGravitationalIndex(mol)
head(dat)
sdf = system.file('sysdata/OptAA3d.sdf', package = 'Rcpi')

mol = readMolFromSDF(sdf)
dat = extractDrugGravitationalIndex(mol)
head(dat)

Number of Hydrogen Bond Acceptors

Description

Number of Hydrogen Bond Acceptors

Usage

extractDrugHBondAcceptorCount(molecules, silent = TRUE)
extractDrugHBondAcceptorCount(molecules, silent = TRUE)

Arguments

`molecules`	Parsed molucule object.
`silent`	Logical. Whether the calculating process should be shown or not, default is `TRUE`.

Details

This descriptor calculates the number of hydrogen bond acceptors using a slightly simplified version of the PHACIR atom types. The following groups are counted as hydrogen bond acceptors: any oxygen where the formal charge of the oxygen is non-positive (i.e. formal charge <= 0) except

an aromatic ether oxygen (i.e. an ether oxygen that is adjacent to at least one aromatic carbon)
an oxygen that is adjacent to a nitrogen

and any nitrogen where the formal charge of the nitrogen is non-positive (i.e. formal charge <= 0) except a nitrogen that is adjacent to an oxygen.

Value

A data frame, each row represents one of the molecules, each column represents one feature. This function returns one column named nHBAcc.

Examples

smi = system.file('vignettedata/FDAMDD.smi', package = 'Rcpi')

mol = readMolFromSmi(smi, type = 'mol')
dat = extractDrugHBondAcceptorCount(mol)
head(dat)
smi = system.file('vignettedata/FDAMDD.smi', package = 'Rcpi')

mol = readMolFromSmi(smi, type = 'mol')
dat = extractDrugHBondAcceptorCount(mol)
head(dat)

Number of Hydrogen Bond Donors

Description

Number of Hydrogen Bond Donors

Usage

extractDrugHBondDonorCount(molecules, silent = TRUE)
extractDrugHBondDonorCount(molecules, silent = TRUE)

Arguments

`molecules`	Parsed molucule object.
`silent`	Logical. Whether the calculating process should be shown or not, default is `TRUE`.

Details

This descriptor calculates the number of hydrogen bond donors using a slightly simplified version of the PHACIR atom types (https://bit.ly/3qXQELf). The following groups are counted as hydrogen bond donors:

Any-OH where the formal charge of the oxygen is non-negative (i.e. formal charge >= 0)
Any-NH where the formal charge of the nitrogen is non-negative (i.e. formal charge >= 0)

Value

A data frame, each row represents one of the molecules, each column represents one feature. This function returns one column named nHBDon.

Examples

smi = system.file('vignettedata/FDAMDD.smi', package = 'Rcpi')

mol = readMolFromSmi(smi, type = 'mol')
dat = extractDrugHBondDonorCount(mol)
head(dat)
smi = system.file('vignettedata/FDAMDD.smi', package = 'Rcpi')

mol = readMolFromSmi(smi, type = 'mol')
dat = extractDrugHBondDonorCount(mol)
head(dat)

Calculate the Hybridization Molecular Fingerprints (in Compact Format)

Description

Calculate the Hybridization Molecular Fingerprints (in Compact Format)

Usage

extractDrugHybridization(molecules, depth = 6, size = 1024, silent = TRUE)
extractDrugHybridization(molecules, depth = 6, size = 1024, silent = TRUE)

Arguments

`molecules`	Parsed molucule object.
`depth`	The search depth. Default is `6`.
`size`	The length of the fingerprint bit string. Default is `1024`.
`silent`	Logical. Whether the calculating process should be shown or not, default is `TRUE`.

Details

Calculate the hybridization molecular fingerprints. Similar to the standard type, but only consider hybridization state. This is hashed fingerprints, with a default length of 1024.

Value

Examples

smi = system.file('vignettedata/FDAMDD.smi', package = 'Rcpi')

mol = readMolFromSmi(smi, type = 'mol')
fp  = extractDrugHybridization(mol)
head(fp)
smi = system.file('vignettedata/FDAMDD.smi', package = 'Rcpi')

mol = readMolFromSmi(smi, type = 'mol')
fp  = extractDrugHybridization(mol)
head(fp)

Calculate the Hybridization Molecular Fingerprints (in Complete Format)

Description

Calculate the Hybridization Molecular Fingerprints (in Complete Format)

Usage

extractDrugHybridizationComplete(
  molecules,
  depth = 6,
  size = 1024,
  silent = TRUE
)
extractDrugHybridizationComplete(
  molecules,
  depth = 6,
  size = 1024,
  silent = TRUE
)

Arguments

`molecules`	Parsed molucule object.
`depth`	The search depth. Default is `6`.
`size`	The length of the fingerprint bit string. Default is `1024`.
`silent`	Logical. Whether the calculating process should be shown or not, default is `TRUE`.

Details

Calculate the hybridization molecular fingerprints. Similar to the standard type, but only consider hybridization state. This is hashed fingerprints, with a default length of 1024.

Value

An integer vector or a matrix. Each row represents one molecule, the columns represent the fingerprints.

Examples

smi = system.file('vignettedata/FDAMDD.smi', package = 'Rcpi')

mol = readMolFromSmi(smi, type = 'mol')
fp  = extractDrugHybridizationComplete(mol)
dim(fp)
smi = system.file('vignettedata/FDAMDD.smi', package = 'Rcpi')

mol = readMolFromSmi(smi, type = 'mol')
fp  = extractDrugHybridizationComplete(mol)
dim(fp)

Descriptor that Characterizing Molecular Complexity in Terms of Carbon Hybridization States

Description

Descriptor that Characterizing Molecular Complexity in Terms of Carbon Hybridization States

Usage

extractDrugHybridizationRatio(molecules, silent = TRUE)
extractDrugHybridizationRatio(molecules, silent = TRUE)

Arguments

`molecules`	Parsed molucule object.
`silent`	Logical. Whether the calculating process should be shown or not, default is `TRUE`.

Details

This descriptor calculates the fraction of sp3 carbons to sp2 carbons. Note that it only considers carbon atoms and rather than use a simple ratio it reports the value of Nsp3/(Nsp3 + Nsp2). The original form of the descriptor (i.e., simple ratio) has been used to characterize molecular complexity, especially in the are of natural products, which usually have a high value of the sp3 to sp2 ratio.

Value

A data frame, each row represents one of the molecules, each column represents one feature. This function returns one column named HybRatio.

Examples

smi = system.file('vignettedata/FDAMDD.smi', package = 'Rcpi')

mol = readMolFromSmi(smi, type = 'mol')
dat = extractDrugHybridizationRatio(mol)
head(dat)
smi = system.file('vignettedata/FDAMDD.smi', package = 'Rcpi')

mol = readMolFromSmi(smi, type = 'mol')
dat = extractDrugHybridizationRatio(mol)
head(dat)

Calculate the Descriptor that Evaluates the Ionization Potential

Description

Calculate the Descriptor that Evaluates the Ionization Potential

Usage

extractDrugIPMolecularLearning(molecules, silent = TRUE)
extractDrugIPMolecularLearning(molecules, silent = TRUE)

Arguments

`molecules`	Parsed molucule object.
`silent`	Logical. Whether the calculating process should be shown or not, default is `TRUE`.

Details

Calculate the ionization potential of a molecule. The descriptor assumes that explicit hydrogens have been added to the molecules.

Value

A data frame, each row represents one of the molecules, each column represents one feature. This function returns one column named MolIP.

Examples

smi = system.file('vignettedata/FDAMDD.smi', package = 'Rcpi')

mol = readMolFromSmi(smi, type = 'mol')
dat = extractDrugIPMolecularLearning(mol)
head(dat)
smi = system.file('vignettedata/FDAMDD.smi', package = 'Rcpi')

mol = readMolFromSmi(smi, type = 'mol')
dat = extractDrugIPMolecularLearning(mol)
head(dat)

Descriptor that Calculates Kier and Hall Kappa Molecular Shape Indices

Description

Descriptor that Calculates Kier and Hall Kappa Molecular Shape Indices

Usage

extractDrugKappaShapeIndices(molecules, silent = TRUE)
extractDrugKappaShapeIndices(molecules, silent = TRUE)

Arguments

`molecules`	Parsed molucule object.
`silent`	Logical. Whether the calculating process should be shown or not, default is `TRUE`.

Details

Kier and Hall Kappa molecular shape indices compare the molecular graph with minimal and maximal molecular graphs; see https://bit.ly/3ramdBy for details: "they are intended to capture different aspects of molecular shape. Note that hydrogens are ignored. In the following description, n denotes the number of atoms in the hydrogen suppressed graph, m is the number of bonds in the hydrogen suppressed graph. Also, let p2 denote the number of paths of length 2 and let p3 denote the number of paths of length 3".

Value

A data frame, each row represents one of the molecules, each column represents one feature. This function returns 3 columns named Kier1, Kier2 and Kier3:

Kier1 - First kappa shape index
Kier2 - Second kappa shape index
Kier3 - Third kappa shape index

Examples

smi = system.file('vignettedata/FDAMDD.smi', package = 'Rcpi')

mol = readMolFromSmi(smi, type = 'mol')
dat = extractDrugKappaShapeIndices(mol)
head(dat)
smi = system.file('vignettedata/FDAMDD.smi', package = 'Rcpi')

mol = readMolFromSmi(smi, type = 'mol')
dat = extractDrugKappaShapeIndices(mol)
head(dat)

Descriptor that Counts the Number of Occurrences of the E-State Fragments

Description

Descriptor that Counts the Number of Occurrences of the E-State Fragments

Usage

extractDrugKierHallSmarts(molecules, silent = TRUE)
extractDrugKierHallSmarts(molecules, silent = TRUE)

Arguments

`molecules`	Parsed molucule object.
`silent`	Logical. Whether the calculating process should be shown or not, default is `TRUE`.

Details

A fragment count descriptor that uses e-state fragments. Traditionally the e-state descriptors identify the relevant fragments and then evaluate the actual e-state value. However it has been shown in Butina et al. that simply using the counts of the e-state fragments can lead to QSAR models that exhibit similar performance to those built using the actual e-state indices.

Atom typing and aromaticity perception should be performed prior to calling this descriptor. The atom type definitions are taken from Hall et al. The SMARTS definitions were obtained from RDKit.

Value

A data frame, each row represents one of the molecules, each column represents one feature. This function returns 79 columns:

ID	Name	Pattern
`0`	`khs.sLi`	`[LiD1]-*`
`1`	`khs.ssBe`	`[BeD2](-)-`
`2`	`khs.ssssBe`	`[BeD4](-)(-)(-)-`
`3`	`khs.ssBH`	`[BD2H](-)-`
`4`	`khs.sssB`	`[BD3](-)(-)-*`
`5`	`khs.ssssB`	`[BD4](-)(-)(-)-`
`6`	`khs.sCH3`	`[CD1H3]-*`
`7`	`khs.dCH2`	`[CD1H2]=*`
`8`	`khs.ssCH2`	`[CD2H2](-)-`
`9`	`khs.tCH`	`[CD1H]#*`
`10`	`khs.dsCH`	`[CD2H](=)-`
`11`	`khs.aaCH`	`[C,c;D2H](:):`
`12`	`khs.sssCH`	`[CD3H](-)(-)-*`
`13`	`khs.ddC`	`[CD2H0](=)=`
`14`	`khs.tsC`	`[CD2H0](#)-`
`15`	`khs.dssC`	`[CD3H0](=)(-)-*`
`16`	`khs.aasC`	`[C,c;D3H0](:)(:)-*`
`17`	`khs.aaaC`	`[C,c;D3H0](:)(:):*`
`18`	`khs.ssssC`	`[CD4H0](-)(-)(-)-`
`19`	`khs.sNH3`	`[ND1H3]-*`
`20`	`khs.sNH2`	`[ND1H2]-*`
`21`	`khs.ssNH2`	`[ND2H2](-)-`
`22`	`khs.dNH`	`[ND1H]=*`
`23`	`khs.ssNH`	`[ND2H](-)-`
`24`	`khs.aaNH`	`[N,nD2H](:):`
`25`	`khs.tN`	`[ND1H0]#*`
`26`	`khs.sssNH`	`[ND3H](-)(-)-*`
`27`	`khs.dsN`	`[ND2H0](=)-`
`28`	`khs.aaN`	`[N,nD2H0](:):`
`29`	`khs.sssN`	`[ND3H0](-)(-)-*`
`30`	`khs.ddsN`	`[ND3H0](~[OD1H0])(~[OD1H0])-,:*`
`31`	`khs.aasN`	`[N,nD3H0](:)(:)-,:*`
`32`	`khs.ssssN`	`[ND4H0](-)(-)(-)-`
`33`	`khs.sOH`	`[OD1H]-*`
`34`	`khs.dO`	`[OD1H0]=*`
`35`	`khs.ssO`	`[OD2H0](-)-`
`36`	`khs.aaO`	`[O,oD2H0](:):`
`37`	`khs.sF`	`[FD1]-*`
`38`	`khs.sSiH3`	`[SiD1H3]-*`
`39`	`khs.ssSiH2`	`[SiD2H2](-)-`
`40`	`khs.sssSiH`	`[SiD3H1](-)(-)-*`
`41`	`khs.ssssSi`	`[SiD4H0](-)(-)(-)-`
`42`	`khs.sPH2`	`[PD1H2]-*`
`43`	`khs.ssPH`	`[PD2H1](-)-`
`44`	`khs.sssP`	`[PD3H0](-)(-)-*`
`45`	`khs.dsssP`	`[PD4H0](=)(-)(-)-`
`46`	`khs.sssssP`	`[PD5H0](-)(-)(-)(-)-*`
`47`	`khs.sSH`	`[SD1H1]-*`
`48`	`khs.dS`	`[SD1H0]=*`
`49`	`khs.ssS`	`[SD2H0](-)-`
`50`	`khs.aaS`	`[S,sD2H0](:):`
`51`	`khs.dssS`	`[SD3H0](=)(-)-*`
`52`	`khs.ddssS`	`[SD4H0](~[OD1H0])(~[OD1H0])(-)-`
`53`	`khs.sCl`	`[ClD1]-*`
`54`	`khs.sGeH3`	`[GeD1H3](-*)`
`55`	`khs.ssGeH2`	`[GeD2H2](-)-`
`56`	`khs.sssGeH`	`[GeD3H1](-)(-)-*`
`57`	`khs.ssssGe`	`[GeD4H0](-)(-)(-)-`
`58`	`khs.sAsH2`	`[AsD1H2]-*`
`59`	`khs.ssAsH`	`[AsD2H1](-)-`
`60`	`khs.sssAs`	`[AsD3H0](-)(-)-*`
`61`	`khs.sssdAs`	`[AsD4H0](=)(-)(-)-`
`62`	`khs.sssssAs`	`[AsD5H0](-)(-)(-)(-)-*`
`63`	`khs.sSeH`	`[SeD1H1]-*`
`64`	`khs.dSe`	`[SeD1H0]=*`
`65`	`khs.ssSe`	`[SeD2H0](-)-`
`66`	`khs.aaSe`	`[SeD2H0](:):`
`67`	`khs.dssSe`	`[SeD3H0](=)(-)-*`
`68`	`khs.ddssSe`	`[SeD4H0](=)(=)(-)-`
`69`	`khs.sBr`	`[BrD1]-*`
`70`	`khs.sSnH3`	`[SnD1H3]-*`
`71`	`khs.ssSnH2`	`[SnD2H2](-)-`
`72`	`khs.sssSnH`	`[SnD3H1](-)(-)-*`
`73`	`khs.ssssSn`	`[SnD4H0](-)(-)(-)-`
`74`	`khs.sI`	`[ID1]-*`
`75`	`khs.sPbH3`	`[PbD1H3]-*`
`76`	`khs.ssPbH2`	`[PbD2H2](-)-`
`77`	`khs.sssPbH`	`[PbD3H1](-)(-)-*`
`78`	`khs.ssssPb`	`[PbD4H0](-)(-)(-)-`

References

Butina, D. , Performance of Kier-Hall E-state Descriptors in Quantitative Structure Activity Relationship (QSAR) Studies of Multifunctional Molecules, Molecules, 2004, 9:1004-1009.

Hall, L.H. and Kier, L.B. , Electrotopological State Indices for Atom Types: A Novel Combination of Electronic, Topological, and Valence State Information, Journal of Chemical Information and Computer Science, 1995, 35:1039-1045.

Examples

smi = system.file('vignettedata/FDAMDD.smi', package = 'Rcpi')

mol = readMolFromSmi(smi, type = 'mol')
dat = extractDrugKierHallSmarts(mol)
head(dat)
smi = system.file('vignettedata/FDAMDD.smi', package = 'Rcpi')

mol = readMolFromSmi(smi, type = 'mol')
dat = extractDrugKierHallSmarts(mol)
head(dat)

Calculate the KR (Klekota and Roth) Molecular Fingerprints (in Compact Format)

Description

Calculate the KR (Klekota and Roth) Molecular Fingerprints (in Compact Format)

Usage

extractDrugKR(molecules, silent = TRUE)
extractDrugKR(molecules, silent = TRUE)

Arguments

`molecules`	Parsed molucule object.
`silent`	Logical. Whether the calculating process should be shown or not, default is `TRUE`.

Details

Calculate the 4860 bit fingerprint defined by Klekota and Roth.

Value

Examples

smi = system.file('vignettedata/FDAMDD.smi', package = 'Rcpi')

mol = readMolFromSmi(smi, type = 'mol')
fp  = extractDrugKR(mol)
head(fp)
smi = system.file('vignettedata/FDAMDD.smi', package = 'Rcpi')

mol = readMolFromSmi(smi, type = 'mol')
fp  = extractDrugKR(mol)
head(fp)

Calculate the KR (Klekota and Roth) Molecular Fingerprints (in Complete Format)

Description

Calculate the KR (Klekota and Roth) Molecular Fingerprints (in Complete Format)

Usage

extractDrugKRComplete(molecules, silent = TRUE)
extractDrugKRComplete(molecules, silent = TRUE)

Arguments

`molecules`	Parsed molucule object.
`silent`	Logical. Whether the calculating process should be shown or not, default is `TRUE`.

Details

Calculate the 4860 bit fingerprint defined by Klekota and Roth.

Value

An integer vector or a matrix. Each row represents one molecule, the columns represent the fingerprints.

Examples

smi = system.file('vignettedata/FDAMDD.smi', package = 'Rcpi')

mol = readMolFromSmi(smi, type = 'mol')
fp  = extractDrugKRComplete(mol)
dim(fp)
smi = system.file('vignettedata/FDAMDD.smi', package = 'Rcpi')

mol = readMolFromSmi(smi, type = 'mol')
fp  = extractDrugKRComplete(mol)
dim(fp)

Descriptor that Calculates the Number of Atoms in the Largest Chain

Description

Descriptor that Calculates the Number of Atoms in the Largest Chain

Usage

extractDrugLargestChain(molecules, silent = TRUE)
extractDrugLargestChain(molecules, silent = TRUE)

Arguments

`molecules`	Parsed molucule object.
`silent`	Logical. Whether the calculating process should be shown or not, default is `TRUE`.

Details

This descriptor calculates the number of atoms in the largest chain. Note that a chain exists if there are two or more atoms. Thus single atom molecules will return 0.

Value

A data frame, each row represents one of the molecules, each column represents one feature. This function returns one column named nAtomLC.

Examples

smi = system.file('vignettedata/FDAMDD.smi', package = 'Rcpi')

mol = readMolFromSmi(smi, type = 'mol')
dat = extractDrugLargestChain(mol)
head(dat)
smi = system.file('vignettedata/FDAMDD.smi', package = 'Rcpi')

mol = readMolFromSmi(smi, type = 'mol')
dat = extractDrugLargestChain(mol)
head(dat)

Descriptor that Calculates the Number of Atoms in the Largest Pi Chain

Description

Descriptor that Calculates the Number of Atoms in the Largest Pi Chain

Usage

extractDrugLargestPiSystem(molecules, silent = TRUE)
extractDrugLargestPiSystem(molecules, silent = TRUE)

Arguments

`molecules`	Parsed molucule object.
`silent`	Logical. Whether the calculating process should be shown or not, default is `TRUE`.

Details

This descriptor calculates the number of atoms in the largest pi chain.

Value

A data frame, each row represents one of the molecules, each column represents one feature. This function returns one column named nAtomP.

Examples

smi = system.file('vignettedata/FDAMDD.smi', package = 'Rcpi')

mol = readMolFromSmi(smi, type = 'mol')
dat = extractDrugLargestPiSystem(mol)
head(dat)
smi = system.file('vignettedata/FDAMDD.smi', package = 'Rcpi')

mol = readMolFromSmi(smi, type = 'mol')
dat = extractDrugLargestPiSystem(mol)
head(dat)

Calculate the Ratio of Length to Breadth Descriptor

Description

Calculate the Ratio of Length to Breadth Descriptor

Usage

extractDrugLengthOverBreadth(molecules, silent = TRUE)
extractDrugLengthOverBreadth(molecules, silent = TRUE)

Arguments

`molecules`	Parsed molucule object.
`silent`	Logical. Whether the calculating process should be shown or not, default is `TRUE`.

Details

Calculates the Ratio of Length to Breadth, as a result ti does not perform any orientation and only considers the X & Y extents for a series of rotations about the Z axis (in 10 degree increments).

Value

A data frame, each row represents one of the molecules, each column represents one feature. This function returns two columns named LOBMAX and LOBMIN:

LOBMAX - The maximum L/B ratio;
LOBMIN - The L/B ratio for the rotation that results in the minimum area (defined by the product of the X & Y extents for that orientation).

Note

The descriptor assumes that the atoms have been configured.

Examples

sdf = system.file('sysdata/OptAA3d.sdf', package = 'Rcpi')

mol = readMolFromSDF(sdf)
dat = extractDrugLengthOverBreadth(mol)
head(dat)
sdf = system.file('sysdata/OptAA3d.sdf', package = 'Rcpi')

mol = readMolFromSDF(sdf)
dat = extractDrugLengthOverBreadth(mol)
head(dat)

Descriptor that Calculates the Number of Atoms in the Longest Aliphatic Chain

Description

Descriptor that Calculates the Number of Atoms in the Longest Aliphatic Chain

Usage

extractDrugLongestAliphaticChain(molecules, silent = TRUE)
extractDrugLongestAliphaticChain(molecules, silent = TRUE)

Arguments

`molecules`	Parsed molucule object.
`silent`	Logical. Whether the calculating process should be shown or not, default is `TRUE`.

Details

This descriptor calculates the number of atoms in the longest aliphatic chain.

Value

A data frame, each row represents one of the molecules, each column represents one feature. This function returns one column named nAtomLAC.

Examples

smi = system.file('vignettedata/FDAMDD.smi', package = 'Rcpi')

mol = readMolFromSmi(smi, type = 'mol')
dat = extractDrugLongestAliphaticChain(mol)
head(dat)
smi = system.file('vignettedata/FDAMDD.smi', package = 'Rcpi')

mol = readMolFromSmi(smi, type = 'mol')
dat = extractDrugLongestAliphaticChain(mol)
head(dat)

Calculate the MACCS Molecular Fingerprints (in Compact Format)

Description

Calculate the MACCS Molecular Fingerprints (in Compact Format)

Usage

extractDrugMACCS(molecules, silent = TRUE)
extractDrugMACCS(molecules, silent = TRUE)

Arguments

`molecules`	Parsed molucule object.
`silent`	Logical. Whether the calculating process should be shown or not, default is `TRUE`.

Details

The popular 166 bit MACCS keys described by MDL.

Value

Examples

smi = system.file('vignettedata/FDAMDD.smi', package = 'Rcpi')

mol = readMolFromSmi(smi, type = 'mol')
fp  = extractDrugMACCS(mol)
head(fp)
smi = system.file('vignettedata/FDAMDD.smi', package = 'Rcpi')

mol = readMolFromSmi(smi, type = 'mol')
fp  = extractDrugMACCS(mol)
head(fp)

Calculate the MACCS Molecular Fingerprints (in Complete Format)

Description

Calculate the MACCS Molecular Fingerprints (in Complete Format)

Usage

extractDrugMACCSComplete(molecules, silent = TRUE)
extractDrugMACCSComplete(molecules, silent = TRUE)

Arguments

`molecules`	Parsed molucule object.
`silent`	Logical. Whether the calculating process should be shown or not, default is `TRUE`.

Details

The popular 166 bit MACCS keys described by MDL.

Value

An integer vector or a matrix. Each row represents one molecule, the columns represent the fingerprints.

Examples

smi = system.file('vignettedata/FDAMDD.smi', package = 'Rcpi')

mol = readMolFromSmi(smi, type = 'mol')
fp  = extractDrugMACCSComplete(mol)
dim(fp)
smi = system.file('vignettedata/FDAMDD.smi', package = 'Rcpi')

mol = readMolFromSmi(smi, type = 'mol')
fp  = extractDrugMACCSComplete(mol)
dim(fp)

Descriptor that Calculates the LogP Based on a Simple Equation Using the Number of Carbons and Hetero Atoms

Description

Descriptor that Calculates the LogP Based on a Simple Equation Using the Number of Carbons and Hetero Atoms

Usage

extractDrugMannholdLogP(molecules, silent = TRUE)
extractDrugMannholdLogP(molecules, silent = TRUE)

Arguments

`molecules`	Parsed molucule object.
`silent`	Logical. Whether the calculating process should be shown or not, default is `TRUE`.

Details

This descriptor calculates the LogP based on a simple equation using the number of carbons and hetero atoms. The implemented equation was proposed in Mannhold et al.

Value

A data frame, each row represents one of the molecules, each column represents one feature. This function returns one column named MLogP.

References

Mannhold, R., Poda, G. I., Ostermann, C., & Tetko, I. V. (2009). Calculation of molecular lipophilicity: State-of-the-art and comparison of log P methods on more than 96,000 compounds. Journal of pharmaceutical sciences, 98(3), 861-893.

Examples

smi = system.file('vignettedata/FDAMDD.smi', package = 'Rcpi')

mol = readMolFromSmi(smi, type = 'mol')
dat = extractDrugMannholdLogP(mol)
head(dat)
smi = system.file('vignettedata/FDAMDD.smi', package = 'Rcpi')

mol = readMolFromSmi(smi, type = 'mol')
dat = extractDrugMannholdLogP(mol)
head(dat)

Calculate Molecular Distance Edge (MDE) Descriptors for C, N and O

Description

Calculate Molecular Distance Edge (MDE) Descriptors for C, N and O

Usage

extractDrugMDE(molecules, silent = TRUE)
extractDrugMDE(molecules, silent = TRUE)

Arguments

`molecules`	Parsed molucule object.
`silent`	Logical. Whether the calculating process should be shown or not, default is `TRUE`.

Details

This descriptor calculates the 10 molecular distance edge (MDE) descriptor described in Liu, S., Cao, C., & Li, Z, and in addition it calculates variants where O and N are considered.

Value

A data frame, each row represents one of the molecules, each column represents one feature. This function returns one column named nAtomLAC.

References

Liu, S., Cao, C., & Li, Z. (1998). Approach to estimation and prediction for normal boiling point (NBP) of alkanes based on a novel molecular distance-edge (MDE) vector, lambda. Journal of chemical information and computer sciences, 38(3), 387-394.

Examples

smi = system.file('vignettedata/FDAMDD.smi', package = 'Rcpi')

mol = readMolFromSmi(smi, type = 'mol')
dat = extractDrugMDE(mol)
head(dat)
smi = system.file('vignettedata/FDAMDD.smi', package = 'Rcpi')

mol = readMolFromSmi(smi, type = 'mol')
dat = extractDrugMDE(mol)
head(dat)

Descriptor that Calculates the Principal Moments of Inertia and Ratios of the Principal Moments

Description

Descriptor that Calculates the Principal Moments of Inertia and Ratios of the Principal Moments

Usage

extractDrugMomentOfInertia(molecules, silent = TRUE)
extractDrugMomentOfInertia(molecules, silent = TRUE)

Arguments

`molecules`	Parsed molucule object.
`silent`	Logical. Whether the calculating process should be shown or not, default is `TRUE`.

Details

A descriptor that calculates the moment of inertia and radius of gyration. Moment of inertia (MI) values characterize the mass distribution of a molecule. Related to the MI values, ratios of the MI values along the three principal axes are also well know modeling variables. This descriptor calculates the MI values along the X, Y and Z axes as well as the ratio's X/Y, X/Z and Y/Z. Finally it also calculates the radius of gyration of the molecule.

Value

A data frame, each row represents one of the molecules, each column represents one feature. This function returns 7 columns named MOMI.X, MOMI.Y, MOMI.Z, MOMI.XY, MOMI.XZ, MOMI.YZ, MOMI.R:

MOMI.X - MI along X axis
MOMI.Y - MI along Y axis
MOMI.Z - MI along Z axis
MOMI.XY - X/Y
MOMI.XZ - X/Z
MOMI.YZ - Y/Z
MOMI.R - Radius of gyration

One important aspect of the algorithm is that if the eigenvalues of the MI tensor are below 1e-3, then the ratio's are set to a default of 1000.

Examples

sdf = system.file('sysdata/OptAA3d.sdf', package = 'Rcpi')

mol = readMolFromSDF(sdf)
dat = extractDrugMomentOfInertia(mol)
head(dat)
sdf = system.file('sysdata/OptAA3d.sdf', package = 'Rcpi')

mol = readMolFromSDF(sdf)
dat = extractDrugMomentOfInertia(mol)
head(dat)

Calculate the FP2 Molecular Fingerprints

Description

Calculate the FP2 Molecular Fingerprints

Usage

extractDrugOBFP2(molecules, type = c("smile", "sdf"))
extractDrugOBFP2(molecules, type = c("smile", "sdf"))

Arguments

`molecules`	R character string object containing the molecules. See the example section for details.
`type`	`'smile'` or `'sdf'`.

Details

Calculate the 1024 bit FP2 fingerprints provided by OpenBabel.

Value

A matrix. Each row represents one molecule, the columns represent the fingerprints.

Examples

mol1 = 'C1CCC1CC(CN(C)(C))CC(=O)CC'  # one molecule SMILE in a vector
mol2 = c('CCC', 'CCN', 'CCN(C)(C)', 'c1ccccc1Cc1ccccc1',
         'C1CCC1CC(CN(C)(C))CC(=O)CC')  # multiple SMILEs in a vector
mol3 = readChar(system.file('compseq/DB00860.sdf', package = 'Rcpi'),
                nchars = 1e+6)  # single molecule in a sdf file
mol4 = readChar(system.file('sysdata/OptAA3d.sdf', package = 'Rcpi'),
                nchars = 1e+6)  # multiple molecules in a sdf file

## Not run: 
smifp0 = extractDrugOBFP2(mol1, type = 'smile')
smifp1 = extractDrugOBFP2(mol2, type = 'smile')
sdffp0 = extractDrugOBFP2(mol3, type = 'sdf')
sdffp1 = extractDrugOBFP2(mol4, type = 'sdf')
## End(Not run)
mol1 = 'C1CCC1CC(CN(C)(C))CC(=O)CC'  # one molecule SMILE in a vector
mol2 = c('CCC', 'CCN', 'CCN(C)(C)', 'c1ccccc1Cc1ccccc1',
         'C1CCC1CC(CN(C)(C))CC(=O)CC')  # multiple SMILEs in a vector
mol3 = readChar(system.file('compseq/DB00860.sdf', package = 'Rcpi'),
                nchars = 1e+6)  # single molecule in a sdf file
mol4 = readChar(system.file('sysdata/OptAA3d.sdf', package = 'Rcpi'),
                nchars = 1e+6)  # multiple molecules in a sdf file

## Not run: 
smifp0 = extractDrugOBFP2(mol1, type = 'smile')
smifp1 = extractDrugOBFP2(mol2, type = 'smile')
sdffp0 = extractDrugOBFP2(mol3, type = 'sdf')
sdffp1 = extractDrugOBFP2(mol4, type = 'sdf')
## End(Not run)

Calculate the FP3 Molecular Fingerprints

Description

Calculate the FP3 Molecular Fingerprints

Usage

extractDrugOBFP3(molecules, type = c("smile", "sdf"))
extractDrugOBFP3(molecules, type = c("smile", "sdf"))

Arguments

`molecules`	R character string object containing the molecules. See the example section for details.
`type`	`'smile'` or `'sdf'`.

Details

Calculate the 64 bit FP3 fingerprints provided by OpenBabel.

Value

A matrix. Each row represents one molecule, the columns represent the fingerprints.

Examples

mol1 = 'C1CCC1CC(CN(C)(C))CC(=O)CC'  # one molecule SMILE in a vector
mol2 = c('CCC', 'CCN', 'CCN(C)(C)', 'c1ccccc1Cc1ccccc1',
         'C1CCC1CC(CN(C)(C))CC(=O)CC')  # multiple SMILEs in a vector
mol3 = readChar(system.file('compseq/DB00860.sdf', package = 'Rcpi'),
                nchars = 1e+6)  # single molecule in a sdf file
mol4 = readChar(system.file('sysdata/OptAA3d.sdf', package = 'Rcpi'),
                nchars = 1e+6)  # multiple molecules in a sdf file
## Not run: 
smifp0 = extractDrugOBFP3(mol1, type = 'smile')
smifp1 = extractDrugOBFP3(mol2, type = 'smile')
sdffp0 = extractDrugOBFP3(mol3, type = 'sdf')
sdffp1 = extractDrugOBFP3(mol4, type = 'sdf')
## End(Not run)
mol1 = 'C1CCC1CC(CN(C)(C))CC(=O)CC'  # one molecule SMILE in a vector
mol2 = c('CCC', 'CCN', 'CCN(C)(C)', 'c1ccccc1Cc1ccccc1',
         'C1CCC1CC(CN(C)(C))CC(=O)CC')  # multiple SMILEs in a vector
mol3 = readChar(system.file('compseq/DB00860.sdf', package = 'Rcpi'),
                nchars = 1e+6)  # single molecule in a sdf file
mol4 = readChar(system.file('sysdata/OptAA3d.sdf', package = 'Rcpi'),
                nchars = 1e+6)  # multiple molecules in a sdf file
## Not run: 
smifp0 = extractDrugOBFP3(mol1, type = 'smile')
smifp1 = extractDrugOBFP3(mol2, type = 'smile')
sdffp0 = extractDrugOBFP3(mol3, type = 'sdf')
sdffp1 = extractDrugOBFP3(mol4, type = 'sdf')
## End(Not run)

Calculate the FP4 Molecular Fingerprints

Description

Calculate the FP4 Molecular Fingerprints

Usage

extractDrugOBFP4(molecules, type = c("smile", "sdf"))
extractDrugOBFP4(molecules, type = c("smile", "sdf"))

Arguments

`molecules`	R character string object containing the molecules. See the example section for details.
`type`	`'smile'` or `'sdf'`.

Details

Calculate the 512 bit FP4 fingerprints provided by OpenBabel.

Value

A matrix. Each row represents one molecule, the columns represent the fingerprints.

Examples

mol1 = 'C1CCC1CC(CN(C)(C))CC(=O)CC'  # one molecule SMILE in a vector
mol2 = c('CCC', 'CCN', 'CCN(C)(C)', 'c1ccccc1Cc1ccccc1',
         'C1CCC1CC(CN(C)(C))CC(=O)CC')  # multiple SMILEs in a vector
mol3 = readChar(system.file('compseq/DB00860.sdf', package = 'Rcpi'),
                nchars = 1e+6)  # single molecule in a sdf file
mol4 = readChar(system.file('sysdata/OptAA3d.sdf', package = 'Rcpi'),
                nchars = 1e+6)  # multiple molecules in a sdf file

## Not run: 
smifp0 = extractDrugOBFP4(mol1, type = 'smile')
smifp1 = extractDrugOBFP4(mol2, type = 'smile')
sdffp0 = extractDrugOBFP4(mol3, type = 'sdf')
sdffp1 = extractDrugOBFP4(mol4, type = 'sdf')
## End(Not run)
mol1 = 'C1CCC1CC(CN(C)(C))CC(=O)CC'  # one molecule SMILE in a vector
mol2 = c('CCC', 'CCN', 'CCN(C)(C)', 'c1ccccc1Cc1ccccc1',
         'C1CCC1CC(CN(C)(C))CC(=O)CC')  # multiple SMILEs in a vector
mol3 = readChar(system.file('compseq/DB00860.sdf', package = 'Rcpi'),
                nchars = 1e+6)  # single molecule in a sdf file
mol4 = readChar(system.file('sysdata/OptAA3d.sdf', package = 'Rcpi'),
                nchars = 1e+6)  # multiple molecules in a sdf file

## Not run: 
smifp0 = extractDrugOBFP4(mol1, type = 'smile')
smifp1 = extractDrugOBFP4(mol2, type = 'smile')
sdffp0 = extractDrugOBFP4(mol3, type = 'sdf')
sdffp1 = extractDrugOBFP4(mol4, type = 'sdf')
## End(Not run)

Calculate the MACCS Molecular Fingerprints

Description

Calculate the MACCS Molecular Fingerprints

Usage

extractDrugOBMACCS(molecules, type = c("smile", "sdf"))
extractDrugOBMACCS(molecules, type = c("smile", "sdf"))

Arguments

`molecules`	R character string object containing the molecules. See the example section for details.
`type`	`'smile'` or `'sdf'`.

Details

Calculate the 256 bit MACCS fingerprints provided by OpenBabel.

Value

A matrix. Each row represents one molecule, the columns represent the fingerprints.

Examples

mol1 = 'C1CCC1CC(CN(C)(C))CC(=O)CC'  # one molecule SMILE in a vector
mol2 = c('CCC', 'CCN', 'CCN(C)(C)', 'c1ccccc1Cc1ccccc1',
         'C1CCC1CC(CN(C)(C))CC(=O)CC')  # multiple SMILEs in a vector
mol3 = readChar(system.file('compseq/DB00860.sdf', package = 'Rcpi'),
                nchars = 1e+6)  # single molecule in a sdf file
mol4 = readChar(system.file('sysdata/OptAA3d.sdf', package = 'Rcpi'),
                nchars = 1e+6)  # multiple molecules in a sdf file

## Not run: 
# MACCS may not be available in current version of ChemmineOB
smifp0 = extractDrugOBMACCS(mol1, type = 'smile')
smifp1 = extractDrugOBMACCS(mol2, type = 'smile')
sdffp0 = extractDrugOBMACCS(mol3, type = 'sdf')
sdffp1 = extractDrugOBMACCS(mol4, type = 'sdf')
## End(Not run)
mol1 = 'C1CCC1CC(CN(C)(C))CC(=O)CC'  # one molecule SMILE in a vector
mol2 = c('CCC', 'CCN', 'CCN(C)(C)', 'c1ccccc1Cc1ccccc1',
         'C1CCC1CC(CN(C)(C))CC(=O)CC')  # multiple SMILEs in a vector
mol3 = readChar(system.file('compseq/DB00860.sdf', package = 'Rcpi'),
                nchars = 1e+6)  # single molecule in a sdf file
mol4 = readChar(system.file('sysdata/OptAA3d.sdf', package = 'Rcpi'),
                nchars = 1e+6)  # multiple molecules in a sdf file

## Not run: 
# MACCS may not be available in current version of ChemmineOB
smifp0 = extractDrugOBMACCS(mol1, type = 'smile')
smifp1 = extractDrugOBMACCS(mol2, type = 'smile')
sdffp0 = extractDrugOBMACCS(mol3, type = 'sdf')
sdffp1 = extractDrugOBMACCS(mol4, type = 'sdf')
## End(Not run)

Descriptor that Calculates the Petitjean Number of a Molecule

Description

Descriptor that Calculates the Petitjean Number of a Molecule

Usage

extractDrugPetitjeanNumber(molecules, silent = TRUE)
extractDrugPetitjeanNumber(molecules, silent = TRUE)

Arguments

`molecules`	Parsed molucule object.
`silent`	Logical. Whether the calculating process should be shown or not, default is `TRUE`.

Details

This descriptor calculates the Petitjean number of a molecule. According to the Petitjean definition, the eccentricity of a vertex corresponds to the distance from that vertex to the most remote vertex in the graph.

The distance is obtained from the distance matrix as the count of edges between the two vertices. If r(i) is the largest matrix entry in row i of the distance matrix D, then the radius is defined as the smallest of the r(i). The graph diameter D is defined as the largest vertex eccentricity in the graph. (http://www.edusoft-lc.com/molconn/manuals/400/chaptwo.html)

Value

A data frame, each row represents one of the molecules, each column represents one feature. This function returns one column named PetitjeanNumber.

Examples

smi = system.file('vignettedata/FDAMDD.smi', package = 'Rcpi')

mol = readMolFromSmi(smi, type = 'mol')
dat = extractDrugPetitjeanNumber(mol)
head(dat)
smi = system.file('vignettedata/FDAMDD.smi', package = 'Rcpi')

mol = readMolFromSmi(smi, type = 'mol')
dat = extractDrugPetitjeanNumber(mol)
head(dat)

Descriptor that Calculates the Petitjean Shape Indices

Description

Descriptor that Calculates the Petitjean Shape Indices

Usage

extractDrugPetitjeanShapeIndex(molecules, silent = TRUE)
extractDrugPetitjeanShapeIndex(molecules, silent = TRUE)

Arguments

`molecules`	Parsed molucule object.
`silent`	Logical. Whether the calculating process should be shown or not, default is `TRUE`.

Details

The topological and geometric shape indices described Petitjean and Bath et al. respectively. Both measure the anisotropy in a molecule.

Value

A data frame, each row represents one of the molecules, each column represents one feature. This function returns two columns named topoShape (Topological Shape Index) and geomShape (Geometric Shape Index).

References

Petitjean, M., Applications of the radius-diameter diagram to the classification of topological and geometrical shapes of chemical compounds, Journal of Chemical Information and Computer Science, 1992, 32:331-337

Bath, P.A. and Poirette, A.R. and Willet, P. and Allen, F.H. , The Extent of the Relationship between the Graph-Theoretical and the Geometrical Shape Coefficients of Chemical Compounds, Journal of Chemical Information and Computer Science, 1995, 35:714-716.

Examples

sdf = system.file('sysdata/OptAA3d.sdf', package = 'Rcpi')

mol = readMolFromSDF(sdf)
dat = extractDrugPetitjeanShapeIndex(mol)
head(dat)
sdf = system.file('sysdata/OptAA3d.sdf', package = 'Rcpi')

mol = readMolFromSDF(sdf)
dat = extractDrugPetitjeanShapeIndex(mol)
head(dat)

Calculate the PubChem Molecular Fingerprints (in Compact Format)

Description

Calculate the PubChem Molecular Fingerprints (in Compact Format)

Usage

extractDrugPubChem(molecules, silent = TRUE)
extractDrugPubChem(molecules, silent = TRUE)

Arguments

`molecules`	Parsed molucule object.
`silent`	Logical. Whether the calculating process should be shown or not, default is `TRUE`.

Details

Calculate the 881 bit fingerprints defined by PubChem.

Value

Examples

smi = system.file('vignettedata/FDAMDD.smi', package = 'Rcpi')

mol = readMolFromSmi(smi, type = 'mol')
fp  = extractDrugPubChem(mol)
head(fp)
smi = system.file('vignettedata/FDAMDD.smi', package = 'Rcpi')

mol = readMolFromSmi(smi, type = 'mol')
fp  = extractDrugPubChem(mol)
head(fp)

Calculate the PubChem Molecular Fingerprints (in Complete Format)

Description

Calculate the PubChem Molecular Fingerprints (in Complete Format)

Usage

extractDrugPubChemComplete(molecules, silent = TRUE)
extractDrugPubChemComplete(molecules, silent = TRUE)

Arguments

`molecules`	Parsed molucule object.
`silent`	Logical. Whether the calculating process should be shown or not, default is `TRUE`.

Details

Calculate the 881 bit fingerprints defined by PubChem.

Value

An integer vector or a matrix. Each row represents one molecule, the columns represent the fingerprints.

Examples

smi = system.file('vignettedata/FDAMDD.smi', package = 'Rcpi')

mol = readMolFromSmi(smi, type = 'mol')
fp  = extractDrugPubChemComplete(mol)
dim(fp)
smi = system.file('vignettedata/FDAMDD.smi', package = 'Rcpi')

mol = readMolFromSmi(smi, type = 'mol')
fp  = extractDrugPubChemComplete(mol)
dim(fp)

Descriptor that Calculates the Number of Nonrotatable Bonds on A Molecule

Description

Descriptor that Calculates the Number of Nonrotatable Bonds on A Molecule

Usage

extractDrugRotatableBondsCount(molecules, silent = TRUE)
extractDrugRotatableBondsCount(molecules, silent = TRUE)

Arguments

`molecules`	Parsed molucule object.
`silent`	Logical. Whether the calculating process should be shown or not, default is `TRUE`.

Details

The number of rotatable bonds is given by the SMARTS specified by Daylight on SMARTS tutorial (https://www.daylight.com/dayhtml_tutorials/languages/smarts/smarts_examples.html)

Value

A data frame, each row represents one of the molecules, each column represents one feature. This function returns one column named nRotB.

Examples

smi = system.file('vignettedata/FDAMDD.smi', package = 'Rcpi')

mol = readMolFromSmi(smi, type = 'mol')
dat = extractDrugRotatableBondsCount(mol)
head(dat)
smi = system.file('vignettedata/FDAMDD.smi', package = 'Rcpi')

mol = readMolFromSmi(smi, type = 'mol')
dat = extractDrugRotatableBondsCount(mol)
head(dat)

Descriptor that Calculates the Number Failures of the Lipinski's Rule Of Five

Description

Descriptor that Calculates the Number Failures of the Lipinski's Rule Of Five

Usage

extractDrugRuleOfFive(molecules, silent = TRUE)
extractDrugRuleOfFive(molecules, silent = TRUE)

Arguments

`molecules`	Parsed molucule object.
`silent`	Logical. Whether the calculating process should be shown or not, default is `TRUE`.

Details

This descriptor calculates the number failures of the Lipinski's Rule Of Five: http://en.wikipedia.org/wiki/Lipinski%27s_Rule_of_Five.

Value

A data frame, each row represents one of the molecules, each column represents one feature. This function returns one column named LipinskiFailures.

Examples

smi = system.file('vignettedata/FDAMDD.smi', package = 'Rcpi')

mol = readMolFromSmi(smi, type = 'mol')
dat = extractDrugRuleOfFive(mol)
head(dat)
smi = system.file('vignettedata/FDAMDD.smi', package = 'Rcpi')

mol = readMolFromSmi(smi, type = 'mol')
dat = extractDrugRuleOfFive(mol)
head(dat)

Calculate the Shortest Path Molecular Fingerprints (in Compact Format)

Description

Calculate the Shortest Path Molecular Fingerprints (in Compact Format)

Usage

extractDrugShortestPath(molecules, depth = 6, size = 1024, silent = TRUE)
extractDrugShortestPath(molecules, depth = 6, size = 1024, silent = TRUE)

Arguments

`molecules`	Parsed molucule object.
`depth`	The search depth. Default is `6`.
`size`	The length of the fingerprint bit string. Default is `1024`.
`silent`	Logical. Whether the calculating process should be shown or not, default is `TRUE`.

Details

Calculate the fingerprint based on the shortest paths between pairs of atoms and takes into account ring systems, charges etc.

Value

Examples

smi = system.file('vignettedata/FDAMDD.smi', package = 'Rcpi')

mol = readMolFromSmi(smi, type = 'mol')
fp  = extractDrugShortestPath(mol)
head(fp)
smi = system.file('vignettedata/FDAMDD.smi', package = 'Rcpi')

mol = readMolFromSmi(smi, type = 'mol')
fp  = extractDrugShortestPath(mol)
head(fp)

Calculate the Shortest Path Molecular Fingerprints (in Complete Format)

Description

Calculate the Shortest Path Molecular Fingerprints (in Complete Format)

Usage

extractDrugShortestPathComplete(
  molecules,
  depth = 6,
  size = 1024,
  silent = TRUE
)
extractDrugShortestPathComplete(
  molecules,
  depth = 6,
  size = 1024,
  silent = TRUE
)

Arguments

`molecules`	Parsed molucule object.
`depth`	The search depth. Default is `6`.
`size`	The length of the fingerprint bit string. Default is `1024`.
`silent`	Logical. Whether the calculating process should be shown or not, default is `TRUE`.

Details

Calculate the fingerprint based on the shortest paths between pairs of atoms and takes into account ring systems, charges etc.

Value

An integer vector or a matrix. Each row represents one molecule, the columns represent the fingerprints.

Examples

smi = system.file('vignettedata/FDAMDD.smi', package = 'Rcpi')

mol = readMolFromSmi(smi, type = 'mol')
fp  = extractDrugShortestPathComplete(mol)
dim(fp)
smi = system.file('vignettedata/FDAMDD.smi', package = 'Rcpi')

mol = readMolFromSmi(smi, type = 'mol')
fp  = extractDrugShortestPathComplete(mol)
dim(fp)

Calculate the Standard Molecular Fingerprints (in Compact Format)

Description

Calculate the Standard Molecular Fingerprints (in Compact Format)

Usage

extractDrugStandard(molecules, depth = 6, size = 1024, silent = TRUE)
extractDrugStandard(molecules, depth = 6, size = 1024, silent = TRUE)

Arguments

`molecules`	Parsed molucule object.
`depth`	The search depth. Default is `6`.
`size`	The length of the fingerprint bit string. Default is `1024`.
`silent`	Logical. Whether the calculating process should be shown or not, default is `TRUE`.

Details

Calculate the standard molecular fingerprints. Considers paths of a given length. This is hashed fingerprints, with a default length of 1024.

Value

Examples

smi = system.file('vignettedata/FDAMDD.smi', package = 'Rcpi')

mol = readMolFromSmi(smi, type = 'mol')
fp  = extractDrugStandard(mol)
head(fp)
smi = system.file('vignettedata/FDAMDD.smi', package = 'Rcpi')

mol = readMolFromSmi(smi, type = 'mol')
fp  = extractDrugStandard(mol)
head(fp)

Calculate the Standard Molecular Fingerprints (in Complete Format)

Description

Calculate the Standard Molecular Fingerprints (in Complete Format)

Usage

extractDrugStandardComplete(molecules, depth = 6, size = 1024, silent = TRUE)
extractDrugStandardComplete(molecules, depth = 6, size = 1024, silent = TRUE)

Arguments

`molecules`	Parsed molucule object.
`depth`	The search depth. Default is `6`.
`size`	The length of the fingerprint bit string. Default is `1024`.
`silent`	Logical. Whether the calculating process should be shown or not, default is `TRUE`.

Details

Calculate the standard molecular fingerprints. Considers paths of a given length. This is hashed fingerprints, with a default length of 1024.

Value

An integer vector or a matrix. Each row represents one molecule, the columns represent the fingerprints.

Examples

smi = system.file('vignettedata/FDAMDD.smi', package = 'Rcpi')

mol = readMolFromSmi(smi, type = 'mol')
fp  = extractDrugStandardComplete(mol)
dim(fp)
smi = system.file('vignettedata/FDAMDD.smi', package = 'Rcpi')

mol = readMolFromSmi(smi, type = 'mol')
fp  = extractDrugStandardComplete(mol)
dim(fp)

Descriptor of Topological Polar Surface Area Based on Fragment Contributions (TPSA)

Description

Descriptor of Topological Polar Surface Area Based on Fragment Contributions (TPSA)

Usage

extractDrugTPSA(molecules, silent = TRUE)
extractDrugTPSA(molecules, silent = TRUE)

Arguments

`molecules`	Parsed molucule object.
`silent`	Logical. Whether the calculating process should be shown or not, default is `TRUE`.

Details

Calculate the descriptor of topological polar surface area based on fragment contributions (TPSA).

Value

A data frame, each row represents one of the molecules, each column represents one feature. This function returns one column named TopoPSA.

References

Ertl, P., Rohde, B., & Selzer, P. (2000). Fast calculation of molecular polar surface area as a sum of fragment-based contributions and its application to the prediction of drug transport properties. Journal of medicinal chemistry, 43(20), 3714-3717.

Examples

smi = system.file('vignettedata/FDAMDD.smi', package = 'Rcpi')

mol = readMolFromSmi(smi, type = 'mol')
dat = extractDrugTPSA(mol)
head(dat)
smi = system.file('vignettedata/FDAMDD.smi', package = 'Rcpi')

mol = readMolFromSmi(smi, type = 'mol')
dat = extractDrugTPSA(mol)
head(dat)

Descriptor that Calculates the Volume of A Molecule

Description

Descriptor that Calculates the Volume of A Molecule

Usage

extractDrugVABC(molecules, silent = TRUE)
extractDrugVABC(molecules, silent = TRUE)

Arguments

`molecules`	Parsed molucule object.
`silent`	Logical. Whether the calculating process should be shown or not, default is `TRUE`.

Details

This descriptor calculates the volume of a molecule.

Value

A data frame, each row represents one of the molecules, each column represents one feature. This function returns one column named VABC.

Examples

smi = system.file('vignettedata/FDAMDD.smi', package = 'Rcpi')

mol = readMolFromSmi(smi, type = 'mol')
dat = extractDrugVABC(mol)
head(dat)
smi = system.file('vignettedata/FDAMDD.smi', package = 'Rcpi')

mol = readMolFromSmi(smi, type = 'mol')
dat = extractDrugVABC(mol)
head(dat)

Descriptor that Calculates the Vertex Adjacency Information of A Molecule

Description

Descriptor that Calculates the Vertex Adjacency Information of A Molecule

Usage

extractDrugVAdjMa(molecules, silent = TRUE)
extractDrugVAdjMa(molecules, silent = TRUE)

Arguments

`molecules`	Parsed molucule object.
`silent`	Logical. Whether the calculating process should be shown or not, default is `TRUE`.

Details

Vertex adjacency information (magnitude): $1 + \log_2^m$ where $m$ is the number of heavy-heavy bonds. If $m$ is zero, then 0 is returned.

Value

A data frame, each row represents one of the molecules, each column represents one feature. This function returns one column named VAdjMat.

Examples

smi = system.file('vignettedata/FDAMDD.smi', package = 'Rcpi')

mol = readMolFromSmi(smi, type = 'mol')
dat = extractDrugVAdjMa(mol)
head(dat)
smi = system.file('vignettedata/FDAMDD.smi', package = 'Rcpi')

mol = readMolFromSmi(smi, type = 'mol')
dat = extractDrugVAdjMa(mol)
head(dat)

Descriptor that Calculates the Total Weight of Atoms

Description

Descriptor that Calculates the Total Weight of Atoms

Usage

extractDrugWeight(molecules, silent = TRUE)
extractDrugWeight(molecules, silent = TRUE)

Arguments

`molecules`	Parsed molucule object.
`silent`	Logical. Whether the calculating process should be shown or not, default is `TRUE`.

Details

This descriptor calculates the molecular weight.

Value

A data frame, each row represents one of the molecules, each column represents one feature. This function returns one column named MW.

Examples

smi = system.file('vignettedata/FDAMDD.smi', package = 'Rcpi')

mol = readMolFromSmi(smi, type = 'mol')
dat = extractDrugWeight(mol)
head(dat)
smi = system.file('vignettedata/FDAMDD.smi', package = 'Rcpi')

mol = readMolFromSmi(smi, type = 'mol')
dat = extractDrugWeight(mol)
head(dat)

Descriptor that Calculates the Weighted Path (Molecular ID)

Description

Descriptor that Calculates the Weighted Path (Molecular ID)

Usage

extractDrugWeightedPath(molecules, silent = TRUE)
extractDrugWeightedPath(molecules, silent = TRUE)

Arguments

`molecules`	Parsed molucule object.
`silent`	Logical. Whether the calculating process should be shown or not, default is `TRUE`.

Details

This descriptor calculates the weighted path (molecular ID) described by Randic, characterizing molecular branching. Five descriptors are calculated, based on the implementation in the ADAPT software package. Note that the descriptor is based on identifying all paths between pairs of atoms and so is NP-hard. This means that it can take some time for large, complex molecules.

Value

A data frame, each row represents one of the molecules, each column represents one feature. This function returns 5 columns named WTPT.1, WTPT.2, WTPT.3, WTPT.4, WTPT.5:

WTPT.1 - molecular ID
WTPT.2 - molecular ID / number of atoms
WTPT.3 - sum of path lengths starting from heteroatoms
WTPT.4 - sum of path lengths starting from oxygens
WTPT.5 - sum of path lengths starting from nitrogens

References

Randic, M., On molecular identification numbers (1984). Journal of Chemical Information and Computer Science, 24:164-175.

Examples

smi = system.file('vignettedata/FDAMDD.smi', package = 'Rcpi')

mol = readMolFromSmi(smi, type = 'mol')
dat = extractDrugWeightedPath(mol)
head(dat)
smi = system.file('vignettedata/FDAMDD.smi', package = 'Rcpi')

mol = readMolFromSmi(smi, type = 'mol')
dat = extractDrugWeightedPath(mol)
head(dat)

Calculate Holistic Descriptors Described by Todeschini et al.

Description

Calculate Holistic Descriptors Described by Todeschini et al.

Usage

extractDrugWHIM(molecules, silent = TRUE)
extractDrugWHIM(molecules, silent = TRUE)

Arguments

`molecules`	Parsed molucule object.
`silent`	Logical. Whether the calculating process should be shown or not, default is `TRUE`.

Details

Holistic descriptors described by Todeschini et al, the descriptors are based on a number of atom weightings. There are six different possible weightings:

unit weights
atomic masses
van der Waals volumes
Mulliken atomic electronegativites
atomic polarizabilities
E-state values described by Kier and Hall

Currently weighting schemes 1, 2, 3, 4 and 5 are implemented. The weight values are taken from Todeschini et al. and as a result 19 elements are considered. For each weighting scheme we can obtain

11 directional WHIM descriptors (lambda1 .. 3, nu1 .. 2, gamma1 .. 3, eta1 .. 3)
6 non-directional WHIM descriptors (T, A, V, K, G, D)

Though Todeschini et al. mentions that for planar molecules only 8 directional WHIM descriptors are required the current code will return all 11.

Value

A data frame, each row represents one of the molecules, each column represents one feature. This function returns 17 columns:

Wlambda1
Wlambda2
wlambda3
Wnu1
Wnu2
Wgamma1
Wgamma2
Wgamma3
Weta1
Weta2
Weta3
WT
WA
WV
WK
WG
WD

Each name will have a suffix of the form .X where X indicates the weighting scheme used. Possible values of X are

unity
mass
volume
eneg
polar

References

Todeschini, R. and Gramatica, P., New 3D Molecular Descriptors: The WHIM theory and QAR Applications, Persepectives in Drug Discovery and Design, 1998, ?:355-380.

Examples

sdf = system.file('sysdata/OptAA3d.sdf', package = 'Rcpi')

mol = readMolFromSDF(sdf)
dat = extractDrugWHIM(mol)
head(dat)
sdf = system.file('sysdata/OptAA3d.sdf', package = 'Rcpi')

mol = readMolFromSDF(sdf)
dat = extractDrugWHIM(mol)
head(dat)

Descriptor that Calculates Wiener Path Number and Wiener Polarity Number

Description

Descriptor that Calculates Wiener Path Number and Wiener Polarity Number

Usage

extractDrugWienerNumbers(molecules, silent = TRUE)
extractDrugWienerNumbers(molecules, silent = TRUE)

Arguments

`molecules`	Parsed molucule object.
`silent`	Logical. Whether the calculating process should be shown or not, default is `TRUE`.

Details

This descriptor calculates the Wiener numbers, including the Wiener Path number and the Wiener Polarity Number. Wiener path number: half the sum of all the distance matrix entries; Wiener polarity number: half the sum of all the distance matrix entries with a value of 3.

Value

A data frame, each row represents one of the molecules, each column represents one feature. This function returns two columns named WPATH (weiner path number) and WPOL (weiner polarity number).

References

Wiener, H. (1947). Structural determination of paraffin boiling points. Journal of the American Chemical Society, 69(1), 17-20.

Examples

smi = system.file('vignettedata/FDAMDD.smi', package = 'Rcpi')

mol = readMolFromSmi(smi, type = 'mol')
dat = extractDrugWienerNumbers(mol)
head(dat)
smi = system.file('vignettedata/FDAMDD.smi', package = 'Rcpi')

mol = readMolFromSmi(smi, type = 'mol')
dat = extractDrugWienerNumbers(mol)
head(dat)

Descriptor that Calculates the Prediction of logP Based on the Atom-Type Method Called XLogP

Description

Descriptor that Calculates the Prediction of logP Based on the Atom-Type Method Called XLogP

Usage

extractDrugXLogP(molecules, silent = TRUE)
extractDrugXLogP(molecules, silent = TRUE)

Arguments

`molecules`	Parsed molucule object.
`silent`	Logical. Whether the calculating process should be shown or not, default is `TRUE`.

Details

Prediction of logP based on the atom-type method called XLogP.

Value

A data frame, each row represents one of the molecules, each column represents one feature. This function returns one column named XLogP.

References

Wang, R., Fu, Y., and Lai, L., A New Atom-Additive Method for Calculating Partition Coefficients, Journal of Chemical Information and Computer Sciences, 1997, 37:615-621.

Wang, R., Gao, Y., and Lai, L., Calculating partition coefficient by atom-additive method, Perspectives in Drug Discovery and Design, 2000, 19:47-66.

Examples

smi = system.file('vignettedata/FDAMDD.smi', package = 'Rcpi')

mol = readMolFromSmi(smi, type = 'mol')
dat = extractDrugXLogP(mol)
head(dat)
smi = system.file('vignettedata/FDAMDD.smi', package = 'Rcpi')

mol = readMolFromSmi(smi, type = 'mol')
dat = extractDrugXLogP(mol)
head(dat)

Descriptor that Calculates the Sum of the Squared Atom Degrees of All Heavy Atoms

Description

Descriptor that Calculates the Sum of the Squared Atom Degrees of All Heavy Atoms

Usage

extractDrugZagrebIndex(molecules, silent = TRUE)
extractDrugZagrebIndex(molecules, silent = TRUE)

Arguments

`molecules`	Parsed molucule object.
`silent`	Logical. Whether the calculating process should be shown or not, default is `TRUE`.

Details

Zagreb index: the sum of the squares of atom degree over all heavy atoms i.

Value

A data frame, each row represents one of the molecules, each column represents one feature. This function returns one column named Zagreb.

Examples

smi = system.file('vignettedata/FDAMDD.smi', package = 'Rcpi')

mol = readMolFromSmi(smi, type = 'mol')
dat = extractDrugZagrebIndex(mol)
head(dat)
smi = system.file('vignettedata/FDAMDD.smi', package = 'Rcpi')

mol = readMolFromSmi(smi, type = 'mol')
dat = extractDrugZagrebIndex(mol)
head(dat)

Generalized BLOSUM and PAM Matrix-Derived Descriptors

Description

Generalized BLOSUM and PAM Matrix-Derived Descriptors

Usage

extractPCMBLOSUM(x, submat = "AABLOSUM62", k, lag, scale = TRUE, silent = TRUE)
extractPCMBLOSUM(x, submat = "AABLOSUM62", k, lag, scale = TRUE, silent = TRUE)

Arguments

`x`	A character vector, as the input protein sequence.
`submat`	Substitution matrix for the 20 amino acids. Should be one of `AABLOSUM45`, `AABLOSUM50`, `AABLOSUM62`, `AABLOSUM80`, `AABLOSUM100`, `AAPAM30`, `AAPAM40`, `AAPAM70`, `AAPAM120`, `AAPAM250`. Default is `'AABLOSUM62'`.
`k`	Integer. The number of selected scales (i.e. the first `k` scales) derived by the substitution matrix. This could be selected according to the printed relative importance values.
`lag`	The lag parameter. Must be less than the amino acids.
`scale`	Logical. Should we auto-scale the substitution matrix (`submat`) before doing eigen decomposition? Default is `TRUE`.
`silent`	Logical. Whether we print the relative importance of each scales (diagnal value of the eigen decomposition result matrix B) or not. Default is `TRUE`.

Details

This function calculates the generalized BLOSUM matrix-derived descriptors. For users' convenience, Rcpi provides the BLOSUM45, BLOSUM50, BLOSUM62, BLOSUM80, BLOSUM100, PAM30, PAM40, PAM70, PAM120, and PAM250 matrices for the 20 amino acids to select.

Value

A length lag * p^2 named vector, p is the number of scales selected.

References

Georgiev, A. G. (2009). Interpretable numerical descriptors of amino acid space. Journal of Computational Biology, 16(5), 703–723.

Examples

x = readFASTA(system.file('protseq/P00750.fasta', package = 'Rcpi'))[[1]]
blosum = extractPCMBLOSUM(x, submat = 'AABLOSUM62', k = 5, lag = 7, scale = TRUE, silent = FALSE)

x = readFASTA(system.file('protseq/P00750.fasta', package = 'Rcpi'))[[1]]
blosum = extractPCMBLOSUM(x, submat = 'AABLOSUM62', k = 5, lag = 7, scale = TRUE, silent = FALSE)

Scales-Based Descriptors with 20+ classes of Molecular Descriptors

Description

Scales-Based Descriptors with 20+ classes of Molecular Descriptors

Usage

extractPCMDescScales(
  x,
  propmat,
  index = NULL,
  pc,
  lag,
  scale = TRUE,
  silent = TRUE
)
extractPCMDescScales(
  x,
  propmat,
  index = NULL,
  pc,
  lag,
  scale = TRUE,
  silent = TRUE
)

Arguments

`x`	A character vector, as the input protein sequence.
`propmat`	The matrix containing the descriptor set for the amino acids, which could be chosen from `AAMOE2D`, `AAMOE3D`, `AACPSA`, `AADescAll`, `AA2DACOR`, `AA3DMoRSE`, `AAACF`, `AABurden`, `AAConn`, `AAConst`, `AAEdgeAdj`, `AAEigIdx`, `AAFGC`, `AAGeom`, `AAGETAWAY`, `AAInfo`, `AAMolProp`, `AARandic`, `AARDF`, `AATopo`, `AATopoChg`, `AAWalk`, `AAWHIM`.
`index`	Integer vector or character vector. Specify which molecular descriptors to select from one of these deseriptor sets by specify the numerical or character index of the molecular descriptors in the descriptor set. Default is `NULL`, means selecting all the molecular descriptors in this descriptor set.
`pc`	Integer. The maximum dimension of the space which the data are to be represented in. Must be no greater than the number of AA properties provided.
`lag`	The lag parameter. Must be less than the amino acids.
`scale`	Logical. Should we auto-scale the property matrix (`propmat`) before doing MDS? Default is `TRUE`.
`silent`	Logical. Whether we print the standard deviation, proportion of variance and the cumulative proportion of the selected principal components or not. Default is `TRUE`.

Details

This function calculates the scales-based descriptors with molecular descriptors sets calculated by Dragon, Discovery Studio and MOE. Users could specify which molecular descriptors to select from one of these deseriptor sets by specify the numerical or character index of the molecular descriptors in the descriptor set.

Value

A length lag * p^2 named vector, p is the number of scales selected.

Examples

x = readFASTA(system.file('protseq/P00750.fasta', package = 'Rcpi'))[[1]]
descscales = extractPCMDescScales(x, propmat = 'AATopo', index = c(37:41, 43:47),
                                  pc = 5, lag = 7, silent = FALSE)

x = readFASTA(system.file('protseq/P00750.fasta', package = 'Rcpi'))[[1]]
descscales = extractPCMDescScales(x, propmat = 'AATopo', index = c(37:41, 43:47),
                                  pc = 5, lag = 7, silent = FALSE)

Generalized Scales-Based Descriptors derived by Factor Analysis

Description

Generalized Scales-Based Descriptors derived by Factor Analysis

Usage

extractPCMFAScales(
  x,
  propmat,
  factors,
  scores = "regression",
  lag,
  scale = TRUE,
  silent = TRUE
)
extractPCMFAScales(
  x,
  propmat,
  factors,
  scores = "regression",
  lag,
  scale = TRUE,
  silent = TRUE
)

Arguments

`x`	A character vector, as the input protein sequence.
`propmat`	A matrix containing the properties for the amino acids. Each row represent one amino acid type, each column represents one property. Note that the one-letter row names must be provided for we need them to seek the properties for each AA type.
`factors`	Integer. The number of factors to be fitted. Must be no greater than the number of AA properties provided.
`scores`	Type of scores to produce. The default is `"regression"`, which gives Thompson's scores, `"Bartlett"` given Bartlett's weighted least-squares scores.
`lag`	The lag parameter. Must be less than the amino acids number in the protein sequence.
`scale`	Logical. Should we auto-scale the property matrix (`propmat`) before doing Factor Analysis? Default is `TRUE`.
`silent`	Logical. Whether we print the SS loadings, proportion of variance and the cumulative proportion of the selected factors or not. Default is `TRUE`.

Details

This function calculates the generalized scales-based descriptors derived by Factor Analysis (FA). Users could provide customized amino acid property matrices.

Value

A length lag * p^2 named vector, p is the number of scales (factors) selected.

References

Atchley, W. R., Zhao, J., Fernandes, A. D., & Druke, T. (2005). Solving the protein sequence metric problem. Proceedings of the National Academy of Sciences of the United States of America, 102(18), 6395-6400.

Examples

x = readFASTA(system.file('protseq/P00750.fasta', package = 'Rcpi'))[[1]]
data(AATopo)
tprops = AATopo[, c(37:41, 43:47)]  # select a set of topological descriptors
fa = extractPCMFAScales(x, propmat = tprops, factors = 5, lag = 7, silent = FALSE)

x = readFASTA(system.file('protseq/P00750.fasta', package = 'Rcpi'))[[1]]
data(AATopo)
tprops = AATopo[, c(37:41, 43:47)]  # select a set of topological descriptors
fa = extractPCMFAScales(x, propmat = tprops, factors = 5, lag = 7, silent = FALSE)

Generalized Scales-Based Descriptors derived by Multidimensional Scaling

Description

Generalized Scales-Based Descriptors derived by Multidimensional Scaling

Usage

extractPCMMDSScales(x, propmat, k, lag, scale = TRUE, silent = TRUE)
extractPCMMDSScales(x, propmat, k, lag, scale = TRUE, silent = TRUE)

Arguments

`x`	A character vector, as the input protein sequence.
`propmat`	A matrix containing the properties for the amino acids. Each row represent one amino acid type, each column represents one property. Note that the one-letter row names must be provided for we need them to seek the properties for each AA type.
`k`	Integer. The maximum dimension of the space which the data are to be represented in. Must be no greater than the number of AA properties provided.
`lag`	The lag parameter. Must be less than the amino acids.
`scale`	Logical. Should we auto-scale the property matrix (`propmat`) before doing MDS? Default is `TRUE`.
`silent`	Logical. Whether we print the `k` eigenvalues computed during the scaling process or not. Default is `TRUE`.

Details

This function calculates the generalized scales-based descriptors derived by Multidimensional Scaling (MDS). Users could provide customized amino acid property matrices.

Value

A length lag * p^2 named vector, p is the number of scales (dimensionality) selected.

References

Venkatarajan, M. S., & Braun, W. (2001). New quantitative descriptors of amino acids based on multidimensional scaling of a large number of physical-chemical properties. Molecular modeling annual, 7(12), 445–453.

Examples

x = readFASTA(system.file('protseq/P00750.fasta', package = 'Rcpi'))[[1]]
data(AATopo)
tprops = AATopo[, c(37:41, 43:47)]  # select a set of topological descriptors
mds = extractPCMMDSScales(x, propmat = tprops, k = 5, lag = 7, silent = FALSE)

x = readFASTA(system.file('protseq/P00750.fasta', package = 'Rcpi'))[[1]]
data(AATopo)
tprops = AATopo[, c(37:41, 43:47)]  # select a set of topological descriptors
mds = extractPCMMDSScales(x, propmat = tprops, k = 5, lag = 7, silent = FALSE)

Generalized AA-Properties Based Scales Descriptors

Description

Generalized AA-Properties Based Scales Descriptors

Usage

extractPCMPropScales(x, index = NULL, pc, lag, scale = TRUE, silent = TRUE)
extractPCMPropScales(x, index = NULL, pc, lag, scale = TRUE, silent = TRUE)

Arguments

`x`	A character vector, as the input protein sequence.
`index`	Integer vector or character vector. Specify which AAindex properties to select from the AAindex database by specify the numerical or character index of the properties in the AAindex database. Default is `NULL`, means selecting all the AA properties in the AAindex database.
`pc`	Integer. Use the first pc principal components as the scales. Must be no greater than the number of AA properties provided.
`lag`	The lag parameter. Must be less than the amino acids.
`scale`	Logical. Should we auto-scale the property matrix before PCA? Default is `TRUE`.
`silent`	Logical. Whether we print the standard deviation, proportion of variance and the cumulative proportion of the selected principal components or not. Default is `TRUE`.

Details

This function calculates the generalized amino acid properties based scales descriptors. Users could specify which AAindex properties to select from the AAindex database by specify the numerical or character index of the properties in the AAindex database.

Value

A length lag * p^2 named vector, p is the number of scales (principal components) selected.

Examples

x = readFASTA(system.file('protseq/P00750.fasta', package = 'Rcpi'))[[1]]
propscales = extractPCMPropScales(x, index = c(160:165, 258:296), pc = 5, lag = 7, silent = FALSE)

x = readFASTA(system.file('protseq/P00750.fasta', package = 'Rcpi'))[[1]]
propscales = extractPCMPropScales(x, index = c(160:165, 258:296), pc = 5, lag = 7, silent = FALSE)

Generalized Scales-Based Descriptors derived by Principal Components Analysis

Description

Generalized Scales-Based Descriptors derived by Principal Components Analysis

Usage

extractPCMScales(x, propmat, pc, lag, scale = TRUE, silent = TRUE)
extractPCMScales(x, propmat, pc, lag, scale = TRUE, silent = TRUE)

Arguments

`x`	A character vector, as the input protein sequence.
`propmat`	A matrix containing the properties for the amino acids. Each row represent one amino acid type, each column represents one property. Note that the one-letter row names must be provided for we need them to seek the properties for each AA type.
`pc`	Integer. Use the first pc principal components as the scales. Must be no greater than the number of AA properties provided.
`lag`	The lag parameter. Must be less than the amino acids.
`scale`	Logical. Should we auto-scale the property matrix (`propmat`) before PCA? Default is `TRUE`.
`silent`	Logical. Whether we print the standard deviation, proportion of variance and the cumulative proportion of the selected principal components or not. Default is `TRUE`.

Details

This function calculates the generalized scales-based descriptors derived by Principal Components Analysis (PCA). Users could provide customized amino acid property matrices. This function implements the core computation procedure needed for the generalized scales-based descriptors derived by AA-Properties (AAindex) and generalized scales-based descriptors derived by 20+ classes of 2D and 3D molecular descriptors (Topological, WHIM, VHSE, etc.).

Value

A length lag * p^2 named vector, p is the number of scales (principal components) selected.

Examples

x = readFASTA(system.file('protseq/P00750.fasta', package = 'Rcpi'))[[1]]
data(AAindex)
AAidxmat = t(na.omit(as.matrix(AAindex[, 7:26])))
scales = extractPCMScales(x, propmat = AAidxmat, pc = 5, lag = 7, silent = FALSE)
x = readFASTA(system.file('protseq/P00750.fasta', package = 'Rcpi'))[[1]]
data(AAindex)
AAidxmat = t(na.omit(as.matrix(AAindex[, 7:26])))
scales = extractPCMScales(x, propmat = AAidxmat, pc = 5, lag = 7, silent = FALSE)

Amino Acid Composition Descriptor

Description

Amino Acid Composition Descriptor

Usage

extractProtAAC(x)
extractProtAAC(x)

Arguments

`x`	A character vector, as the input protein sequence.

Details

This function calculates the Amino Acid Composition descriptor (Dim: 20).

Value

A length 20 named vector

References

M. Bhasin, G. P. S. Raghava. Classification of Nuclear Receptors Based on Amino Acid Composition and Dipeptide Composition. Journal of Biological Chemistry, 2004, 279, 23262.

Examples

x = readFASTA(system.file('protseq/P00750.fasta', package = 'Rcpi'))[[1]]
extractProtAAC(x)

x = readFASTA(system.file('protseq/P00750.fasta', package = 'Rcpi'))[[1]]
extractProtAAC(x)

Amphiphilic Pseudo Amino Acid Composition Descriptor

Description

Amphiphilic Pseudo Amino Acid Composition Descriptor

Usage

extractProtAPAAC(
  x,
  props = c("Hydrophobicity", "Hydrophilicity"),
  lambda = 30,
  w = 0.05,
  customprops = NULL
)
extractProtAPAAC(
  x,
  props = c("Hydrophobicity", "Hydrophilicity"),
  lambda = 30,
  w = 0.05,
  customprops = NULL
)

Arguments

`x`	A character vector, as the input protein sequence.
`props`	A character vector, specifying the properties used. 2 properties are used by default, as listed below: `'Hydrophobicity'` Hydrophobicity value of the 20 amino acids `'Hydrophilicity'` Hydrophilicity value of the 20 amino acids
`lambda`	The lambda parameter for the APAAC descriptors, default is 30.
`w`	The weighting factor, default is 0.05.
`customprops`	A `n x 21` named data frame contains `n` customize property. Each row contains one property. The column order for different amino acid types is `'AccNo'`, `'A'`, `'R'`, `'N'`, `'D'`, `'C'`, `'E'`, `'Q'`, `'G'`, `'H'`, `'I'`, `'L'`, `'K'`, `'M'`, `'F'`, `'P'`, `'S'`, `'T'`, `'W'`, `'Y'`, `'V'`, and the columns should also be exactly named like this. The `AccNo` column contains the properties' names. Then users should explicitly specify these properties with these names in the argument `props`. See the examples below for a demonstration. The default value for `customprops` is `NULL`.

Details

This function calculates the Amphiphilic Pseudo Amino Acid Composition (APAAC) descriptor (Dim: 20 + (n * lambda), n is the number of properties selected, default is 80).

Value

A length 20 + n * lambda named vector, n is the number of properties selected.

Note

Note the default 20 * 2 prop values have been already independently given in the function. Users could also specify other (up to 544) properties with the Accession Number in the AAindex data, with or without the default three properties, which means users should explicitly specify the properties to use.

References

Kuo-Chen Chou. Prediction of Protein Cellular Attributes Using Pseudo-Amino Acid Composition. PROTEINS: Structure, Function, and Genetics, 2001, 43: 246-255.

Type 2 pseudo amino acid composition. http://www.csbio.sjtu.edu.cn/bioinf/PseAAC/type2.htm

Kuo-Chen Chou. Using Amphiphilic Pseudo Amino Acid Composition to Predict Enzyme Subfamily Classes. Bioinformatics, 2005, 21, 10-19.

JACS, 1962, 84: 4240-4246. (C. Tanford). (The hydrophobicity data)

PNAS, 1981, 78:3824-3828 (T.P.Hopp & K.R.Woods). (The hydrophilicity data)

Examples

x = readFASTA(system.file('protseq/P00750.fasta', package = 'Rcpi'))[[1]]
extractProtAPAAC(x)

myprops = data.frame(AccNo = c("MyProp1", "MyProp2", "MyProp3"),
                     A = c(0.62,  -0.5, 15),  R = c(-2.53,   3, 101),
                     N = c(-0.78,  0.2, 58),  D = c(-0.9,    3, 59),
                     C = c(0.29,    -1, 47),  E = c(-0.74,   3, 73),
                     Q = c(-0.85,  0.2, 72),  G = c(0.48,    0, 1),
                     H = c(-0.4,  -0.5, 82),  I = c(1.38, -1.8, 57),
                     L = c(1.06,  -1.8, 57),  K = c(-1.5,    3, 73),
                     M = c(0.64,  -1.3, 75),  F = c(1.19, -2.5, 91),
                     P = c(0.12,     0, 42),  S = c(-0.18, 0.3, 31),
                     T = c(-0.05, -0.4, 45),  W = c(0.81, -3.4, 130),
                     Y = c(0.26,  -2.3, 107), V = c(1.08, -1.5, 43))

# Use 2 default properties, 4 properties in the AAindex database,
# and 3 cutomized properties
extractProtAPAAC(x, customprops = myprops,
                 props = c('Hydrophobicity', 'Hydrophilicity',
                           'CIDH920105', 'BHAR880101',
                           'CHAM820101', 'CHAM820102',
                           'MyProp1', 'MyProp2', 'MyProp3'))

x = readFASTA(system.file('protseq/P00750.fasta', package = 'Rcpi'))[[1]]
extractProtAPAAC(x)

myprops = data.frame(AccNo = c("MyProp1", "MyProp2", "MyProp3"),
                     A = c(0.62,  -0.5, 15),  R = c(-2.53,   3, 101),
                     N = c(-0.78,  0.2, 58),  D = c(-0.9,    3, 59),
                     C = c(0.29,    -1, 47),  E = c(-0.74,   3, 73),
                     Q = c(-0.85,  0.2, 72),  G = c(0.48,    0, 1),
                     H = c(-0.4,  -0.5, 82),  I = c(1.38, -1.8, 57),
                     L = c(1.06,  -1.8, 57),  K = c(-1.5,    3, 73),
                     M = c(0.64,  -1.3, 75),  F = c(1.19, -2.5, 91),
                     P = c(0.12,     0, 42),  S = c(-0.18, 0.3, 31),
                     T = c(-0.05, -0.4, 45),  W = c(0.81, -3.4, 130),
                     Y = c(0.26,  -2.3, 107), V = c(1.08, -1.5, 43))

# Use 2 default properties, 4 properties in the AAindex database,
# and 3 cutomized properties
extractProtAPAAC(x, customprops = myprops,
                 props = c('Hydrophobicity', 'Hydrophilicity',
                           'CIDH920105', 'BHAR880101',
                           'CHAM820101', 'CHAM820102',
                           'MyProp1', 'MyProp2', 'MyProp3'))

CTD Descriptors - Composition

Description

CTD Descriptors - Composition

Usage

extractProtCTDC(x)
extractProtCTDC(x)

Arguments

`x`	A character vector, as the input protein sequence.

Details

This function calculates the Composition descriptor of the CTD descriptors (Dim: 21).

Value

A length 21 named vector

References

Inna Dubchak, Ilya Muchink, Stephen R. Holbrook and Sung-Hou Kim. Prediction of protein folding class using global description of amino acid sequence. Proceedings of the National Academy of Sciences. USA, 1995, 92, 8700-8704.

Inna Dubchak, Ilya Muchink, Christopher Mayor, Igor Dralyuk and Sung-Hou Kim. Recognition of a Protein Fold in the Context of the SCOP classification. Proteins: Structure, Function and Genetics, 1999, 35, 401-407.

Examples

x = readFASTA(system.file('protseq/P00750.fasta', package = 'Rcpi'))[[1]]
extractProtCTDC(x)
x = readFASTA(system.file('protseq/P00750.fasta', package = 'Rcpi'))[[1]]
extractProtCTDC(x)

CTD Descriptors - Distribution

Description

CTD Descriptors - Distribution

Usage

extractProtCTDD(x)
extractProtCTDD(x)

Arguments

`x`	A character vector, as the input protein sequence.

Details

This function calculates the Distribution descriptor of the CTD descriptors (Dim: 105).

Value

A length 105 named vector

References

Examples

x = readFASTA(system.file('protseq/P00750.fasta', package = 'Rcpi'))[[1]]
extractProtCTDD(x)
x = readFASTA(system.file('protseq/P00750.fasta', package = 'Rcpi'))[[1]]
extractProtCTDD(x)

CTD Descriptors - Transition

Description

CTD Descriptors - Transition

Usage

extractProtCTDT(x)
extractProtCTDT(x)

Arguments

`x`	A character vector, as the input protein sequence.

Details

This function calculates the Transition descriptor of the CTD descriptors (Dim: 21).

Value

A length 21 named vector

References

Examples

x = readFASTA(system.file('protseq/P00750.fasta', package = 'Rcpi'))[[1]]
extractProtCTDT(x)
x = readFASTA(system.file('protseq/P00750.fasta', package = 'Rcpi'))[[1]]
extractProtCTDT(x)

Conjoint Triad Descriptor

Description

Conjoint Triad Descriptor

Usage

extractProtCTriad(x)
extractProtCTriad(x)

Arguments

`x`	A character vector, as the input protein sequence.

Details

This function calculates the Conjoint Triad descriptor (Dim: 343).

Value

A length 343 named vector

References

J.W. Shen, J. Zhang, X.M. Luo, W.L. Zhu, K.Q. Yu, K.X. Chen, Y.X. Li, H.L. Jiang. Predicting Protein-protein Interactions Based Only on Sequences Information. Proceedings of the National Academy of Sciences. 007, 104, 4337–4341.

Examples

x = readFASTA(system.file('protseq/P00750.fasta', package = 'Rcpi'))[[1]]
extractProtCTriad(x)
x = readFASTA(system.file('protseq/P00750.fasta', package = 'Rcpi'))[[1]]
extractProtCTriad(x)

Dipeptide Composition Descriptor

Description

Dipeptide Composition Descriptor

Usage

extractProtDC(x)
extractProtDC(x)

Arguments

`x`	A character vector, as the input protein sequence.

Details

This function calculates the Dipeptide Composition descriptor (Dim: 400).

Value

A length 400 named vector

References

M. Bhasin, G. P. S. Raghava. Classification of Nuclear Receptors Based on Amino Acid Composition and Dipeptide Composition. Journal of Biological Chemistry, 2004, 279, 23262.

Examples

x = readFASTA(system.file('protseq/P00750.fasta', package = 'Rcpi'))[[1]]
extractProtDC(x)

x = readFASTA(system.file('protseq/P00750.fasta', package = 'Rcpi'))[[1]]
extractProtDC(x)

Geary Autocorrelation Descriptor

Description

Geary Autocorrelation Descriptor

Usage

extractProtGeary(
  x,
  props = c("CIDH920105", "BHAR880101", "CHAM820101", "CHAM820102", "CHOC760101",
    "BIGC670101", "CHAM810101", "DAYM780201"),
  nlag = 30L,
  customprops = NULL
)
extractProtGeary(
  x,
  props = c("CIDH920105", "BHAR880101", "CHAM820101", "CHAM820102", "CHOC760101",
    "BIGC670101", "CHAM810101", "DAYM780201"),
  nlag = 30L,
  customprops = NULL
)

Arguments

`x`	A character vector, as the input protein sequence.
`props`	A character vector, specifying the Accession Number of the target properties. 8 properties are used by default, as listed below: AccNo. CIDH920105 Normalized average hydrophobicity scales (Cid et al., 1992) AccNo. BHAR880101 Average flexibility indices (Bhaskaran-Ponnuswamy, 1988) AccNo. CHAM820101 Polarizability parameter (Charton-Charton, 1982) AccNo. CHAM820102 Free energy of solution in water, kcal/mole (Charton-Charton, 1982) AccNo. CHOC760101 Residue accessible surface area in tripeptide (Chothia, 1976) AccNo. BIGC670101 Residue volume (Bigelow, 1967) AccNo. CHAM810101 Steric parameter (Charton, 1981) AccNo. DAYM780201 Relative mutability (Dayhoff et al., 1978b)
`nlag`	Maximum value of the lag parameter. Default is `30`.
`customprops`	A `n x 21` named data frame contains `n` customize property. Each row contains one property. The column order for different amino acid types is `'AccNo'`, `'A'`, `'R'`, `'N'`, `'D'`, `'C'`, `'E'`, `'Q'`, `'G'`, `'H'`, `'I'`, `'L'`, `'K'`, `'M'`, `'F'`, `'P'`, `'S'`, `'T'`, `'W'`, `'Y'`, `'V'`, and the columns should also be exactly named like this. The `AccNo` column contains the properties' names. Then users should explicitly specify these properties with these names in the argument `props`. See the examples below for a demonstration. The default value for `customprops` is `NULL`.

Details

This function calculates the Geary autocorrelation descriptor (Dim: length(props) * nlag).

Value

A length nlag named vector

References

AAindex: Amino acid index database. https://www.genome.jp/dbget/aaindex.html

Feng, Z.P. and Zhang, C.T. (2000) Prediction of membrane protein types based on the hydrophobic index of amino acids. Journal of Protein Chemistry, 19, 269-275.

Horne, D.S. (1988) Prediction of protein helix content from an autocorrelation analysis of sequence hydrophobicities. Biopolymers, 27, 451-477.

Sokal, R.R. and Thomson, B.A. (2006) Population structure inferred by local spatial autocorrelation: an usage from an Amerindian tribal population. American Journal of Physical Anthropology, 129, 121-131.

Examples

x = readFASTA(system.file('protseq/P00750.fasta', package = 'Rcpi'))[[1]]
extractProtGeary(x)

myprops = data.frame(AccNo = c("MyProp1", "MyProp2", "MyProp3"),
                     A = c(0.62,  -0.5, 15),  R = c(-2.53,   3, 101),
                     N = c(-0.78,  0.2, 58),  D = c(-0.9,    3, 59),
                     C = c(0.29,    -1, 47),  E = c(-0.74,   3, 73),
                     Q = c(-0.85,  0.2, 72),  G = c(0.48,    0, 1),
                     H = c(-0.4,  -0.5, 82),  I = c(1.38, -1.8, 57),
                     L = c(1.06,  -1.8, 57),  K = c(-1.5,    3, 73),
                     M = c(0.64,  -1.3, 75),  F = c(1.19, -2.5, 91),
                     P = c(0.12,     0, 42),  S = c(-0.18, 0.3, 31),
                     T = c(-0.05, -0.4, 45),  W = c(0.81, -3.4, 130),
                     Y = c(0.26,  -2.3, 107), V = c(1.08, -1.5, 43))

# Use 4 properties in the AAindex database, and 3 cutomized properties
extractProtGeary(x, customprops = myprops,
                 props = c('CIDH920105', 'BHAR880101',
                           'CHAM820101', 'CHAM820102',
                           'MyProp1', 'MyProp2', 'MyProp3'))

x = readFASTA(system.file('protseq/P00750.fasta', package = 'Rcpi'))[[1]]
extractProtGeary(x)

myprops = data.frame(AccNo = c("MyProp1", "MyProp2", "MyProp3"),
                     A = c(0.62,  -0.5, 15),  R = c(-2.53,   3, 101),
                     N = c(-0.78,  0.2, 58),  D = c(-0.9,    3, 59),
                     C = c(0.29,    -1, 47),  E = c(-0.74,   3, 73),
                     Q = c(-0.85,  0.2, 72),  G = c(0.48,    0, 1),
                     H = c(-0.4,  -0.5, 82),  I = c(1.38, -1.8, 57),
                     L = c(1.06,  -1.8, 57),  K = c(-1.5,    3, 73),
                     M = c(0.64,  -1.3, 75),  F = c(1.19, -2.5, 91),
                     P = c(0.12,     0, 42),  S = c(-0.18, 0.3, 31),
                     T = c(-0.05, -0.4, 45),  W = c(0.81, -3.4, 130),
                     Y = c(0.26,  -2.3, 107), V = c(1.08, -1.5, 43))

# Use 4 properties in the AAindex database, and 3 cutomized properties
extractProtGeary(x, customprops = myprops,
                 props = c('CIDH920105', 'BHAR880101',
                           'CHAM820101', 'CHAM820102',
                           'MyProp1', 'MyProp2', 'MyProp3'))

Moran Autocorrelation Descriptor

Description

Moran Autocorrelation Descriptor

Usage

extractProtMoran(
  x,
  props = c("CIDH920105", "BHAR880101", "CHAM820101", "CHAM820102", "CHOC760101",
    "BIGC670101", "CHAM810101", "DAYM780201"),
  nlag = 30L,
  customprops = NULL
)
extractProtMoran(
  x,
  props = c("CIDH920105", "BHAR880101", "CHAM820101", "CHAM820102", "CHOC760101",
    "BIGC670101", "CHAM810101", "DAYM780201"),
  nlag = 30L,
  customprops = NULL
)

Arguments

`x`	A character vector, as the input protein sequence.
`props`	A character vector, specifying the Accession Number of the target properties. 8 properties are used by default, as listed below: AccNo. CIDH920105 Normalized average hydrophobicity scales (Cid et al., 1992) AccNo. BHAR880101 Average flexibility indices (Bhaskaran-Ponnuswamy, 1988) AccNo. CHAM820101 Polarizability parameter (Charton-Charton, 1982) AccNo. CHAM820102 Free energy of solution in water, kcal/mole (Charton-Charton, 1982) AccNo. CHOC760101 Residue accessible surface area in tripeptide (Chothia, 1976) AccNo. BIGC670101 Residue volume (Bigelow, 1967) AccNo. CHAM810101 Steric parameter (Charton, 1981) AccNo. DAYM780201 Relative mutability (Dayhoff et al., 1978b)
`nlag`	Maximum value of the lag parameter. Default is `30`.
`customprops`	A `n x 21` named data frame contains `n` customize property. Each row contains one property. The column order for different amino acid types is `'AccNo'`, `'A'`, `'R'`, `'N'`, `'D'`, `'C'`, `'E'`, `'Q'`, `'G'`, `'H'`, `'I'`, `'L'`, `'K'`, `'M'`, `'F'`, `'P'`, `'S'`, `'T'`, `'W'`, `'Y'`, `'V'`, and the columns should also be exactly named like this. The `AccNo` column contains the properties' names. Then users should explicitly specify these properties with these names in the argument `props`. See the examples below for a demonstration. The default value for `customprops` is `NULL`.

Details

This function calculates the Moran autocorrelation descriptor (Dim: length(props) * nlag).

Value

A length nlag named vector

References

AAindex: Amino acid index database. https://www.genome.jp/dbget/aaindex.html

Feng, Z.P. and Zhang, C.T. (2000) Prediction of membrane protein types based on the hydrophobic index of amino acids. Journal of Protein Chemistry, 19, 269-275.

Horne, D.S. (1988) Prediction of protein helix content from an autocorrelation analysis of sequence hydrophobicities. Biopolymers, 27, 451-477.

Examples

x = readFASTA(system.file('protseq/P00750.fasta', package = 'Rcpi'))[[1]]
extractProtMoran(x)

myprops = data.frame(AccNo = c("MyProp1", "MyProp2", "MyProp3"),
                     A = c(0.62,  -0.5, 15),  R = c(-2.53,   3, 101),
                     N = c(-0.78,  0.2, 58),  D = c(-0.9,    3, 59),
                     C = c(0.29,    -1, 47),  E = c(-0.74,   3, 73),
                     Q = c(-0.85,  0.2, 72),  G = c(0.48,    0, 1),
                     H = c(-0.4,  -0.5, 82),  I = c(1.38, -1.8, 57),
                     L = c(1.06,  -1.8, 57),  K = c(-1.5,    3, 73),
                     M = c(0.64,  -1.3, 75),  F = c(1.19, -2.5, 91),
                     P = c(0.12,     0, 42),  S = c(-0.18, 0.3, 31),
                     T = c(-0.05, -0.4, 45),  W = c(0.81, -3.4, 130),
                     Y = c(0.26,  -2.3, 107), V = c(1.08, -1.5, 43))

# Use 4 properties in the AAindex database, and 3 cutomized properties
extractProtMoran(x, customprops = myprops,
                 props = c('CIDH920105', 'BHAR880101',
                           'CHAM820101', 'CHAM820102',
                           'MyProp1', 'MyProp2', 'MyProp3'))

x = readFASTA(system.file('protseq/P00750.fasta', package = 'Rcpi'))[[1]]
extractProtMoran(x)

myprops = data.frame(AccNo = c("MyProp1", "MyProp2", "MyProp3"),
                     A = c(0.62,  -0.5, 15),  R = c(-2.53,   3, 101),
                     N = c(-0.78,  0.2, 58),  D = c(-0.9,    3, 59),
                     C = c(0.29,    -1, 47),  E = c(-0.74,   3, 73),
                     Q = c(-0.85,  0.2, 72),  G = c(0.48,    0, 1),
                     H = c(-0.4,  -0.5, 82),  I = c(1.38, -1.8, 57),
                     L = c(1.06,  -1.8, 57),  K = c(-1.5,    3, 73),
                     M = c(0.64,  -1.3, 75),  F = c(1.19, -2.5, 91),
                     P = c(0.12,     0, 42),  S = c(-0.18, 0.3, 31),
                     T = c(-0.05, -0.4, 45),  W = c(0.81, -3.4, 130),
                     Y = c(0.26,  -2.3, 107), V = c(1.08, -1.5, 43))

# Use 4 properties in the AAindex database, and 3 cutomized properties
extractProtMoran(x, customprops = myprops,
                 props = c('CIDH920105', 'BHAR880101',
                           'CHAM820101', 'CHAM820102',
                           'MyProp1', 'MyProp2', 'MyProp3'))

Normalized Moreau-Broto Autocorrelation Descriptor

Description

Normalized Moreau-Broto Autocorrelation Descriptor

Usage

extractProtMoreauBroto(
  x,
  props = c("CIDH920105", "BHAR880101", "CHAM820101", "CHAM820102", "CHOC760101",
    "BIGC670101", "CHAM810101", "DAYM780201"),
  nlag = 30L,
  customprops = NULL
)
extractProtMoreauBroto(
  x,
  props = c("CIDH920105", "BHAR880101", "CHAM820101", "CHAM820102", "CHOC760101",
    "BIGC670101", "CHAM810101", "DAYM780201"),
  nlag = 30L,
  customprops = NULL
)

Arguments

`x`	A character vector, as the input protein sequence.
`props`	A character vector, specifying the Accession Number of the target properties. 8 properties are used by default, as listed below: AccNo. CIDH920105 Normalized average hydrophobicity scales (Cid et al., 1992) AccNo. BHAR880101 Average flexibility indices (Bhaskaran-Ponnuswamy, 1988) AccNo. CHAM820101 Polarizability parameter (Charton-Charton, 1982) AccNo. CHAM820102 Free energy of solution in water, kcal/mole (Charton-Charton, 1982) AccNo. CHOC760101 Residue accessible surface area in tripeptide (Chothia, 1976) AccNo. BIGC670101 Residue volume (Bigelow, 1967) AccNo. CHAM810101 Steric parameter (Charton, 1981) AccNo. DAYM780201 Relative mutability (Dayhoff et al., 1978b)
`nlag`	Maximum value of the lag parameter. Default is `30`.
`customprops`	A `n x 21` named data frame contains `n` customize property. Each row contains one property. The column order for different amino acid types is `'AccNo'`, `'A'`, `'R'`, `'N'`, `'D'`, `'C'`, `'E'`, `'Q'`, `'G'`, `'H'`, `'I'`, `'L'`, `'K'`, `'M'`, `'F'`, `'P'`, `'S'`, `'T'`, `'W'`, `'Y'`, `'V'`, and the columns should also be exactly named like this. The `AccNo` column contains the properties' names. Then users should explicitly specify these properties with these names in the argument `props`. See the examples below for a demonstration. The default value for `customprops` is `NULL`.

Details

This function calculates the normalized Moreau-Broto autocorrelation descriptor (Dim: length(props) * nlag).

Value

A length nlag named vector

References

AAindex: Amino acid index database. https://www.genome.jp/dbget/aaindex.html

Feng, Z.P. and Zhang, C.T. (2000) Prediction of membrane protein types based on the hydrophobic index of amino acids. Journal of Protein Chemistry, 19, 269-275.

Horne, D.S. (1988) Prediction of protein helix content from an autocorrelation analysis of sequence hydrophobicities. Biopolymers, 27, 451-477.

Examples

x = readFASTA(system.file('protseq/P00750.fasta', package = 'Rcpi'))[[1]]
extractProtMoreauBroto(x)

myprops = data.frame(AccNo = c("MyProp1", "MyProp2", "MyProp3"),
                     A = c(0.62,  -0.5, 15),  R = c(-2.53,   3, 101),
                     N = c(-0.78,  0.2, 58),  D = c(-0.9,    3, 59),
                     C = c(0.29,    -1, 47),  E = c(-0.74,   3, 73),
                     Q = c(-0.85,  0.2, 72),  G = c(0.48,    0, 1),
                     H = c(-0.4,  -0.5, 82),  I = c(1.38, -1.8, 57),
                     L = c(1.06,  -1.8, 57),  K = c(-1.5,    3, 73),
                     M = c(0.64,  -1.3, 75),  F = c(1.19, -2.5, 91),
                     P = c(0.12,     0, 42),  S = c(-0.18, 0.3, 31),
                     T = c(-0.05, -0.4, 45),  W = c(0.81, -3.4, 130),
                     Y = c(0.26,  -2.3, 107), V = c(1.08, -1.5, 43))

# Use 4 properties in the AAindex database, and 3 cutomized properties
extractProtMoreauBroto(x, customprops = myprops,
                       props = c('CIDH920105', 'BHAR880101',
                                 'CHAM820101', 'CHAM820102',
                                 'MyProp1', 'MyProp2', 'MyProp3'))

x = readFASTA(system.file('protseq/P00750.fasta', package = 'Rcpi'))[[1]]
extractProtMoreauBroto(x)

myprops = data.frame(AccNo = c("MyProp1", "MyProp2", "MyProp3"),
                     A = c(0.62,  -0.5, 15),  R = c(-2.53,   3, 101),
                     N = c(-0.78,  0.2, 58),  D = c(-0.9,    3, 59),
                     C = c(0.29,    -1, 47),  E = c(-0.74,   3, 73),
                     Q = c(-0.85,  0.2, 72),  G = c(0.48,    0, 1),
                     H = c(-0.4,  -0.5, 82),  I = c(1.38, -1.8, 57),
                     L = c(1.06,  -1.8, 57),  K = c(-1.5,    3, 73),
                     M = c(0.64,  -1.3, 75),  F = c(1.19, -2.5, 91),
                     P = c(0.12,     0, 42),  S = c(-0.18, 0.3, 31),
                     T = c(-0.05, -0.4, 45),  W = c(0.81, -3.4, 130),
                     Y = c(0.26,  -2.3, 107), V = c(1.08, -1.5, 43))

# Use 4 properties in the AAindex database, and 3 cutomized properties
extractProtMoreauBroto(x, customprops = myprops,
                       props = c('CIDH920105', 'BHAR880101',
                                 'CHAM820101', 'CHAM820102',
                                 'MyProp1', 'MyProp2', 'MyProp3'))

Pseudo Amino Acid Composition Descriptor

Description

Pseudo Amino Acid Composition Descriptor

Usage

extractProtPAAC(
  x,
  props = c("Hydrophobicity", "Hydrophilicity", "SideChainMass"),
  lambda = 30,
  w = 0.05,
  customprops = NULL
)
extractProtPAAC(
  x,
  props = c("Hydrophobicity", "Hydrophilicity", "SideChainMass"),
  lambda = 30,
  w = 0.05,
  customprops = NULL
)

Arguments

`x`	A character vector, as the input protein sequence.
`props`	A character vector, specifying the properties used. 3 properties are used by default, as listed below: `'Hydrophobicity'` Hydrophobicity value of the 20 amino acids `'Hydrophilicity'` Hydrophilicity value of the 20 amino acids `'SideChainMass'` Side-chain mass of the 20 amino acids
`lambda`	The lambda parameter for the PAAC descriptors, default is 30.
`w`	The weighting factor, default is 0.05.
`customprops`	A `n x 21` named data frame contains `n` customize property. Each row contains one property. The column order for different amino acid types is `'AccNo'`, `'A'`, `'R'`, `'N'`, `'D'`, `'C'`, `'E'`, `'Q'`, `'G'`, `'H'`, `'I'`, `'L'`, `'K'`, `'M'`, `'F'`, `'P'`, `'S'`, `'T'`, `'W'`, `'Y'`, `'V'`, and the columns should also be exactly named like this. The `AccNo` column contains the properties' names. Then users should explicitly specify these properties with these names in the argument `props`. See the examples below for a demonstration. The default value for `customprops` is `NULL`.

Details

This function calculates the Pseudo Amino Acid Composition (PAAC) descriptor (Dim: 20 + lambda, default is 50).

Value

A length 20 + lambda named vector

Note

Note the default 20 * 3 prop values have been already independently given in the function. Users could also specify other (up to 544) properties with the Accession Number in the AAindex data, with or without the default three properties, which means users should explicitly specify the properties to use.

References

Kuo-Chen Chou. Prediction of Protein Cellular Attributes Using Pseudo-Amino Acid Composition. PROTEINS: Structure, Function, and Genetics, 2001, 43: 246-255.

Type 1 pseudo amino acid composition. http://www.csbio.sjtu.edu.cn/bioinf/PseAAC/type1.htm

Kuo-Chen Chou. Using Amphiphilic Pseudo Amino Acid Composition to Predict Enzyme Subfamily Classes. Bioinformatics, 2005, 21, 10-19.

JACS, 1962, 84: 4240-4246. (C. Tanford). (The hydrophobicity data)

PNAS, 1981, 78:3824-3828 (T.P.Hopp & K.R.Woods). (The hydrophilicity data)

CRC Handbook of Chemistry and Physics, 66th ed., CRC Press, Boca Raton, Florida (1985). (The side-chain mass data)

R.M.C. Dawson, D.C. Elliott, W.H. Elliott, K.M. Jones, Data for Biochemical Research 3rd ed., Clarendon Press Oxford (1986). (The side-chain mass data)

Examples

x = readFASTA(system.file('protseq/P00750.fasta', package = 'Rcpi'))[[1]]
extractProtPAAC(x)

myprops = data.frame(AccNo = c("MyProp1", "MyProp2", "MyProp3"),
                     A = c(0.62,  -0.5, 15),  R = c(-2.53,   3, 101),
                     N = c(-0.78,  0.2, 58),  D = c(-0.9,    3, 59),
                     C = c(0.29,    -1, 47),  E = c(-0.74,   3, 73),
                     Q = c(-0.85,  0.2, 72),  G = c(0.48,    0, 1),
                     H = c(-0.4,  -0.5, 82),  I = c(1.38, -1.8, 57),
                     L = c(1.06,  -1.8, 57),  K = c(-1.5,    3, 73),
                     M = c(0.64,  -1.3, 75),  F = c(1.19, -2.5, 91),
                     P = c(0.12,     0, 42),  S = c(-0.18, 0.3, 31),
                     T = c(-0.05, -0.4, 45),  W = c(0.81, -3.4, 130),
                     Y = c(0.26,  -2.3, 107), V = c(1.08, -1.5, 43))

# Use 3 default properties, 4 properties in the AAindex database,
# and 3 cutomized properties
extractProtPAAC(x, customprops = myprops,
                props = c('Hydrophobicity', 'Hydrophilicity', 'SideChainMass',
                          'CIDH920105', 'BHAR880101',
                          'CHAM820101', 'CHAM820102',
                          'MyProp1', 'MyProp2', 'MyProp3'))

x = readFASTA(system.file('protseq/P00750.fasta', package = 'Rcpi'))[[1]]
extractProtPAAC(x)

myprops = data.frame(AccNo = c("MyProp1", "MyProp2", "MyProp3"),
                     A = c(0.62,  -0.5, 15),  R = c(-2.53,   3, 101),
                     N = c(-0.78,  0.2, 58),  D = c(-0.9,    3, 59),
                     C = c(0.29,    -1, 47),  E = c(-0.74,   3, 73),
                     Q = c(-0.85,  0.2, 72),  G = c(0.48,    0, 1),
                     H = c(-0.4,  -0.5, 82),  I = c(1.38, -1.8, 57),
                     L = c(1.06,  -1.8, 57),  K = c(-1.5,    3, 73),
                     M = c(0.64,  -1.3, 75),  F = c(1.19, -2.5, 91),
                     P = c(0.12,     0, 42),  S = c(-0.18, 0.3, 31),
                     T = c(-0.05, -0.4, 45),  W = c(0.81, -3.4, 130),
                     Y = c(0.26,  -2.3, 107), V = c(1.08, -1.5, 43))

# Use 3 default properties, 4 properties in the AAindex database,
# and 3 cutomized properties
extractProtPAAC(x, customprops = myprops,
                props = c('Hydrophobicity', 'Hydrophilicity', 'SideChainMass',
                          'CIDH920105', 'BHAR880101',
                          'CHAM820101', 'CHAM820102',
                          'MyProp1', 'MyProp2', 'MyProp3'))

Compute PSSM (Position-Specific Scoring Matrix) for given protein sequence

Description

Compute PSSM (Position-Specific Scoring Matrix) for given protein sequence

Usage

extractProtPSSM(
  seq,
  start.pos = 1L,
  end.pos = nchar(seq),
  psiblast.path = NULL,
  makeblastdb.path = NULL,
  database.path = NULL,
  iter = 5,
  silent = TRUE,
  evalue = 10L,
  word.size = NULL,
  gapopen = NULL,
  gapextend = NULL,
  matrix = "BLOSUM62",
  threshold = NULL,
  seg = "no",
  soft.masking = FALSE,
  culling.limit = NULL,
  best.hit.overhang = NULL,
  best.hit.score.edge = NULL,
  xdrop.ungap = NULL,
  xdrop.gap = NULL,
  xdrop.gap.final = NULL,
  window.size = NULL,
  gap.trigger = 22L,
  num.threads = 1L,
  pseudocount = 0L,
  inclusion.ethresh = 0.002
)
extractProtPSSM(
  seq,
  start.pos = 1L,
  end.pos = nchar(seq),
  psiblast.path = NULL,
  makeblastdb.path = NULL,
  database.path = NULL,
  iter = 5,
  silent = TRUE,
  evalue = 10L,
  word.size = NULL,
  gapopen = NULL,
  gapextend = NULL,
  matrix = "BLOSUM62",
  threshold = NULL,
  seg = "no",
  soft.masking = FALSE,
  culling.limit = NULL,
  best.hit.overhang = NULL,
  best.hit.score.edge = NULL,
  xdrop.ungap = NULL,
  xdrop.gap = NULL,
  xdrop.gap.final = NULL,
  window.size = NULL,
  gap.trigger = 22L,
  num.threads = 1L,
  pseudocount = 0L,
  inclusion.ethresh = 0.002
)

Arguments

`seq`	Character vector, as the input protein sequence.
`start.pos`	Optional integer denoting the start position of the fragment window. Default is `1`, i.e. the first amino acid of the given sequence.
`end.pos`	Optional integer denoting the end position of the fragment window. Default is `nchar(seq)`, i.e. the last amino acid of the given sequence.
`psiblast.path`	Character string indicating the path of the `psiblast` program. If NCBI Blast+ was previously installed in the operation system, the path will be automatically detected.
`makeblastdb.path`	Character string indicating the path of the `makeblastdb` program. If NCBI Blast+ was previously installed in the system, the path will be automatically detected.
`database.path`	Character string indicating the path of a reference database (a FASTA file).
`iter`	Number of iterations to perform for PSI-Blast.
`silent`	Logical. Whether the PSI-Blast running output should be shown or not (May not work on some Windows versions and PSI-Blast versions), default is `TRUE`.
`evalue`	Expectation value (E) threshold for saving hits. Default is `10`.
`word.size`	Word size for wordfinder algorithm. An integer >= 2.
`gapopen`	Integer. Cost to open a gap.
`gapextend`	Integer. Cost to extend a gap.
`matrix`	Character string. The scoring matrix name (default is `'BLOSUM62'`).
`threshold`	Minimum word score such that the word is added to the BLAST lookup table. A real value >= 0.
`seg`	Character string. Filter query sequence with SEG (`'yes'`, `'window locut hicut'`, or `'no'` to disable) Default is `'no'`.
`soft.masking`	Logical. Apply filtering locations as soft masks? Default is `FALSE`.
`culling.limit`	An integer >= 0. If the query range of a hit is enveloped by that of at least this many higher-scoring hits, delete the hit. Incompatible with `best.hit.overhang` and `best_hit_score_edge`.
`best.hit.overhang`	Best Hit algorithm overhang value (A real value >= 0 and =< 0.5, recommended value: 0.1). Incompatible with `culling_limit`.
`best.hit.score.edge`	Best Hit algorithm score edge value (A real value >=0 and =< 0.5, recommended value: 0.1). Incompatible with `culling_limit`.
`xdrop.ungap`	X-dropoff value (in bits) for ungapped extensions.
`xdrop.gap`	X-dropoff value (in bits) for preliminary gapped extensions.
`xdrop.gap.final`	X-dropoff value (in bits) for final gapped alignment.
`window.size`	An integer >= 0. Multiple hits window size, To specify 1-hit algorithm, use `0`.
`gap.trigger`	Number of bits to trigger gapping. Default is `22`.
`num.threads`	Integer. Number of threads (CPUs) to use in the BLAST search. Default is `1`.
`pseudocount`	Integer. Pseudo-count value used when constructing PSSM. Default is `0`.
`inclusion.ethresh`	E-value inclusion threshold for pairwise alignments. Default is `0.002`.

Details

This function calculates the PSSM (Position-Specific Scoring Matrix) derived by PSI-Blast for given protein sequence or peptides. For given protein sequences or peptides, PSSM represents the log-likelihood of the substitution of the 20 types of amino acids at that position in the sequence. Note that the output value is not normalized.

Value

The original PSSM, a numeric matrix which has end.pos - start.pos + 1 columns and 20 named rows.

Note

The function requires the makeblastdb and psiblast programs to be properly installed in the operation system or their paths provided.

The two command-line programs are included in the NCBI-BLAST+ software package. To install NCBI Blast+, just open the NCBI FTP site using web browser or FTP software: ftp://[email protected]:21/blast/executables/blast+/LATEST/ then download the executable version of BLAST+ according to your operation system, and compile or install the downloaded source code or executable program.

Ubuntu/Debian users can directly use the command sudo apt-get install ncbi-blast+ to install NCBI Blast+. For OS X users, download ncbi-blast- ... .dmg then install. For Windows users, download ncbi-blast- ... .exe then install.

References

Altschul, Stephen F., et al. "Gapped BLAST and PSI-BLAST: a new generation of protein database search programs." Nucleic acids research 25.17 (1997): 3389–3402.

Ye, Xugang, Guoli Wang, and Stephen F. Altschul. "An assessment of substitution scores for protein profile-profile comparison." Bioinformatics 27.24 (2011): 3356–3363.

Rangwala, Huzefa, and George Karypis. "Profile-based direct kernels for remote homology detection and fold recognition." Bioinformatics 21.23 (2005): 4239–4247.

Examples

x = readFASTA(system.file('protseq/P00750.fasta', package = 'Rcpi'))[[1]]

dbpath = tempfile('tempdb', fileext = '.fasta')
invisible(file.copy(from = system.file('protseq/Plasminogen.fasta', package = 'Rcpi'), to = dbpath))
pssmmat = extractProtPSSM(seq = x, database.path = dbpath)
dim(pssmmat)  # 20 x 562 (P00750: length 562, 20 Amino Acids)
x = readFASTA(system.file('protseq/P00750.fasta', package = 'Rcpi'))[[1]]

dbpath = tempfile('tempdb', fileext = '.fasta')
invisible(file.copy(from = system.file('protseq/Plasminogen.fasta', package = 'Rcpi'), to = dbpath))
pssmmat = extractProtPSSM(seq = x, database.path = dbpath)
dim(pssmmat)  # 20 x 562 (P00750: length 562, 20 Amino Acids)

Profile-based protein representation derived by PSSM (Position-Specific Scoring Matrix) and auto cross covariance

Description

Profile-based protein representation derived by PSSM (Position-Specific Scoring Matrix) and auto cross covariance

Usage

extractProtPSSMAcc(pssmmat, lag)
extractProtPSSMAcc(pssmmat, lag)

Arguments

`pssmmat`	The PSSM computed by `extractProtPSSM`.
`lag`	The lag parameter. Must be less than the number of amino acids in the sequence (i.e. the number of columns in the PSSM matrix).

Details

This function calculates the feature vector based on the PSSM by running PSI-Blast and auto cross covariance tranformation.

Value

A length lag * 20^2 named numeric vector, the element names are derived by the amino acid name abbreviation (crossed amino acid name abbreviation) and lag index.

References

Examples

x = readFASTA(system.file('protseq/P00750.fasta', package = 'Rcpi'))[[1]]

dbpath = tempfile('tempdb', fileext = '.fasta')
invisible(file.copy(from = system.file('protseq/Plasminogen.fasta', package = 'Rcpi'), to = dbpath))
pssmmat = extractProtPSSM(seq = x, database.path = dbpath)
pssmacc = extractProtPSSMAcc(pssmmat, lag = 3)
tail(pssmacc)
x = readFASTA(system.file('protseq/P00750.fasta', package = 'Rcpi'))[[1]]

dbpath = tempfile('tempdb', fileext = '.fasta')
invisible(file.copy(from = system.file('protseq/Plasminogen.fasta', package = 'Rcpi'), to = dbpath))
pssmmat = extractProtPSSM(seq = x, database.path = dbpath)
pssmacc = extractProtPSSMAcc(pssmmat, lag = 3)
tail(pssmacc)

Profile-based protein representation derived by PSSM (Position-Specific Scoring Matrix)

Description

Profile-based protein representation derived by PSSM (Position-Specific Scoring Matrix)

Usage

extractProtPSSMFeature(pssmmat)
extractProtPSSMFeature(pssmmat)

Arguments

pssmmat

The PSSM computed by extractProtPSSM.

Details

This function calculates the profile-based protein representation derived by PSSM. The feature vector is based on the PSSM computed by extractProtPSSM. For a given sequence, The PSSM feature represents the log-likelihood of the substitution of the 20 types of amino acids at that position in the sequence. Each PSSM feature value in the vector represents the degree of conservation of a given amino acid type. The value is normalized to interval (0, 1) by the transformation 1/(1+e^(-x)).

Value

A numeric vector which has 20 x N named elements, where N is the size of the window (number of rows of the PSSM).

References

Ye, Xugang, Guoli Wang, and Stephen F. Altschul. "An assessment of substitution scores for protein profile-profile comparison." Bioinformatics 27.24 (2011): 3356–3363.

Rangwala, Huzefa, and George Karypis. "Profile-based direct kernels for remote homology detection and fold recognition." Bioinformatics 21.23 (2005): 4239–4247.

Examples

x = readFASTA(system.file('protseq/P00750.fasta', package = 'Rcpi'))[[1]]

dbpath = tempfile('tempdb', fileext = '.fasta')
invisible(file.copy(from = system.file('protseq/Plasminogen.fasta', package = 'Rcpi'), to = dbpath))
pssmmat = extractProtPSSM(seq = x, database.path = dbpath)
pssmfeature = extractProtPSSMFeature(pssmmat)
head(pssmfeature)
x = readFASTA(system.file('protseq/P00750.fasta', package = 'Rcpi'))[[1]]

dbpath = tempfile('tempdb', fileext = '.fasta')
invisible(file.copy(from = system.file('protseq/Plasminogen.fasta', package = 'Rcpi'), to = dbpath))
pssmmat = extractProtPSSM(seq = x, database.path = dbpath)
pssmfeature = extractProtPSSMFeature(pssmmat)
head(pssmfeature)

Quasi-Sequence-Order (QSO) Descriptor

Description

Quasi-Sequence-Order (QSO) Descriptor

Usage

extractProtQSO(x, nlag = 30, w = 0.1)
extractProtQSO(x, nlag = 30, w = 0.1)

Arguments

`x`	A character vector, as the input protein sequence.
`nlag`	The maximum lag, defualt is 30.
`w`	The weighting factor, default is 0.1.

Details

This function calculates the Quasi-Sequence-Order (QSO) descriptor (Dim: 20 + 20 + (2 * nlag), default is 100).

Value

A length 20 + 20 + (2 * nlag) named vector

References

Kuo-Chen Chou. Prediction of Protein Subcellar Locations by Incorporating Quasi-Sequence-Order Effect. Biochemical and Biophysical Research Communications, 2000, 278, 477-483.

Kuo-Chen Chou and Yu-Dong Cai. Prediction of Protein Sucellular Locations by GO-FunD-PseAA Predictor. Biochemical and Biophysical Research Communications, 2004, 320, 1236-1239.

Gisbert Schneider and Paul Wrede. The Rational Design of Amino Acid Sequences by Artifical Neural Networks and Simulated Molecular Evolution: Do Novo Design of an Idealized Leader Cleavge Site. Biophys Journal, 1994, 66, 335-344.

Examples

x = readFASTA(system.file('protseq/P00750.fasta', package = 'Rcpi'))[[1]]
extractProtQSO(x)

x = readFASTA(system.file('protseq/P00750.fasta', package = 'Rcpi'))[[1]]
extractProtQSO(x)

Sequence-Order-Coupling Numbers

Description

Sequence-Order-Coupling Numbers

Usage

extractProtSOCN(x, nlag = 30)
extractProtSOCN(x, nlag = 30)

Arguments

`x`	A character vector, as the input protein sequence.
`nlag`	The maximum lag, defualt is 30.

Details

This function calculates the Sequence-Order-Coupling Numbers (Dim: nlag * 2, default is 60).

Value

A length nlag * 2 named vector

References

Kuo-Chen Chou. Prediction of Protein Subcellar Locations by Incorporating Quasi-Sequence-Order Effect. Biochemical and Biophysical Research Communications, 2000, 278, 477-483.

Kuo-Chen Chou and Yu-Dong Cai. Prediction of Protein Sucellular Locations by GO-FunD-PseAA Predictor. Biochemical and Biophysical Research Communications, 2004, 320, 1236-1239.

Examples

x = readFASTA(system.file('protseq/P00750.fasta', package = 'Rcpi'))[[1]]
extractProtSOCN(x)

x = readFASTA(system.file('protseq/P00750.fasta', package = 'Rcpi'))[[1]]
extractProtSOCN(x)

Tripeptide Composition Descriptor

Description

Tripeptide Composition Descriptor

Usage

extractProtTC(x)
extractProtTC(x)

Arguments

`x`	A character vector, as the input protein sequence.

Details

This function calculates the Tripeptide Composition descriptor (Dim: 8000).

Value

A length 8000 named vector

References

M. Bhasin, G. P. S. Raghava. Classification of Nuclear Receptors Based on Amino Acid Composition and Dipeptide Composition. Journal of Biological Chemistry, 2004, 279, 23262.

Examples

x = readFASTA(system.file('protseq/P00750.fasta', package = 'Rcpi'))[[1]]
extractProtTC(x)

x = readFASTA(system.file('protseq/P00750.fasta', package = 'Rcpi'))[[1]]
extractProtTC(x)

Generating Compound-Protein Interaction Descriptors

Description

Generating Compound-Protein Interaction Descriptors

Usage

getCPI(drugmat, protmat, type = c("combine", "tensorprod"))
getCPI(drugmat, protmat, type = c("combine", "tensorprod"))

Arguments

`drugmat`	The compound descriptor matrix.
`protmat`	The protein descriptor matrix.
`type`	The interaction type, one or two of `"combine"` and `"tensorprod"`.

Details

This function calculates the compound-protein interaction descriptors by three types of interaction:

combine - combine the two descriptor matrix, result has (p1 + p2) columns
tensorprod - calculate column-by-column (pseudo)-tensor product type interactions, result has (p1 * p2) columns

Value

A matrix containing the compound-protein interaction descriptors

Examples

x = matrix(1:10, ncol = 2)
y = matrix(1:15, ncol = 3)

getCPI(x, y, 'combine')
getCPI(x, y, 'tensorprod')
getCPI(x, y, type = c('combine', 'tensorprod'))
getCPI(x, y, type = c('tensorprod', 'combine'))

x = matrix(1:10, ncol = 2)
y = matrix(1:15, ncol = 3)

getCPI(x, y, 'combine')
getCPI(x, y, 'tensorprod')
getCPI(x, y, type = c('combine', 'tensorprod'))
getCPI(x, y, type = c('tensorprod', 'combine'))

Retrieve Drug Molecules in MOL and SMILES Format from Databases

Description

Retrieve Drug Molecules in MOL and SMILES Format from Databases

Usage

getDrug(
  id,
  from = c("pubchem", "chembl", "cas", "kegg", "drugbank"),
  type = c("mol", "smile"),
  parallel = 5
)
getDrug(
  id,
  from = c("pubchem", "chembl", "cas", "kegg", "drugbank"),
  type = c("mol", "smile"),
  parallel = 5
)

Arguments

`id`	A character vector, as the drug ID(s).
`from`	The database, one of `'pubchem'`, `'chembl'`, `'cas'`, `'kegg'`, `'drugbank'`.
`type`	The returned molecule format, `mol` or `smile`.
`parallel`	An integer, the parallel parameter, indicates how many process the user would like to use for retrieving the data (using RCurl), default is `5`. For regular cases, we recommend a number less than `20`.

Details

This function retrieves drug molecules in MOL and SMILES format from five databases.

Value

A length of id character vector, each element containing the corresponding drug molecule.

Examples

id = c('DB00859', 'DB00860')

getDrug(id, 'drugbank', 'smile')
id = c('DB00859', 'DB00860')

getDrug(id, 'drugbank', 'smile')

Retrieve Protein Sequence in FASTA Format from the KEGG Database

Description

Retrieve Protein Sequence in FASTA Format from the KEGG Database

Usage

getFASTAFromKEGG(id, parallel = 5)
getFASTAFromKEGG(id, parallel = 5)

Arguments

`id`	A character vector, as the protein ID.
`parallel`	An integer, the parallel parameter, indicates how many process the user would like to use for retrieving the data (using RCurl), default is `5`. For regular cases, we recommend a number less than `20`.

Details

This function retrieves protein sequences in FASTA format from the KEGG database.

Value

A list, each component contains one of the protein sequences in FASTA format.

Examples

id = c('hsa:10161', 'hsa:10162')

getFASTAFromKEGG(id)
id = c('hsa:10161', 'hsa:10162')

getFASTAFromKEGG(id)

Retrieve Protein Sequence in FASTA Format from the UniProt Database

Description

Retrieve Protein Sequence in FASTA Format from the UniProt Database

Usage

getFASTAFromUniProt(id, parallel = 5)
getFASTAFromUniProt(id, parallel = 5)

Arguments

`id`	A character vector, as the protein ID.
`parallel`	An integer, the parallel parameter, indicates how many process the user would like to use for retrieving the data (using RCurl), default is `5`. For regular cases, we recommend a number less than `20`.

Details

This function retrieves protein sequences in FASTA format from the UniProt database.

Value

A list, each component contains one of the protein sequences in FASTA format.

References

UniProt. https://www.uniprot.org/

UniProt REST API Documentation. https://www.uniprot.org/help/api

Examples

id = c('P00750', 'P00751', 'P00752')

getFASTAFromUniProt(id)
id = c('P00750', 'P00751', 'P00752')

getFASTAFromUniProt(id)

Retrieve Drug Molecules in InChI Format from the CAS Database

Description

Retrieve Drug Molecules in InChI Format from the CAS Database

Usage

getMolFromCAS(id, parallel = 5)
getMolFromCAS(id, parallel = 5)

Arguments

`id`	A character vector, as the CAS drug ID.
`parallel`	An integer, the parallel parameter, indicates how many process the user would like to use for retrieving the data (using RCurl), default is `5`. For regular cases, we recommend a number less than `20`.

Details

This function retrieves drug molecules in InChI format from the CAS database. CAS database only provides InChI data, so here we return the molecule in InChI format, users could convert them to SMILES format using Open Babel or other third-party tools.

Value

A length of id character vector, each element containing the corresponding drug molecule.

Examples

id = '52-67-5'  # Penicillamine

getMolFromCAS(id)
id = '52-67-5'  # Penicillamine

getMolFromCAS(id)

Retrieve Drug Molecules in MOL Format from the ChEMBL Database

Description

Retrieve Drug Molecules in MOL Format from the ChEMBL Database

Usage

getMolFromChEMBL(id, parallel = 5)
getMolFromChEMBL(id, parallel = 5)

Arguments

`id`	A character vector, as the ChEMBL drug ID.
`parallel`	An integer, the parallel parameter, indicates how many process the user would like to use for retrieving the data (using RCurl), default is `5`. For regular cases, we recommend a number less than `20`.

Details

This function retrieves drug molecules in MOL format from the ChEMBL database.

Value

A length of id character vector, each element containing the corresponding drug molecule.

Examples

id = 'CHEMBL1430'  # Penicillamine

getMolFromChEMBL(id)
id = 'CHEMBL1430'  # Penicillamine

getMolFromChEMBL(id)

Retrieve Drug Molecules in MOL Format from the DrugBank Database

Description

Retrieve Drug Molecules in MOL Format from the DrugBank Database

Usage

getMolFromDrugBank(id, parallel = 5)
getMolFromDrugBank(id, parallel = 5)

Arguments

`id`	A character vector, as the DrugBank drug ID.
`parallel`	An integer, the parallel parameter, indicates how many process the user would like to use for retrieving the data (using RCurl), default is `5`. For regular cases, we recommend a number less than `20`.

Details

This function retrieves drug molecules in MOL format from the DrugBank database.

Value

A length of id character vector, each element containing the corresponding drug molecule.

Examples

id = 'DB00859'  # Penicillamine

getMolFromDrugBank(id)
id = 'DB00859'  # Penicillamine

getMolFromDrugBank(id)

Retrieve Drug Molecules in MOL Format from the KEGG Database

Description

Retrieve Drug Molecules in MOL Format from the KEGG Database

Usage

getMolFromKEGG(id, parallel = 5)
getMolFromKEGG(id, parallel = 5)

Arguments

`id`	A character vector, as the KEGG drug ID.
`parallel`	An integer, the parallel parameter, indicates how many process the user would like to use for retrieving the data (using RCurl), default is `5`. For regular cases, we recommend a number less than `20`.

Details

This function retrieves drug molecules in MOL format from the KEGG database.

Value

A length of id character vector, each element containing the corresponding drug molecule.

Examples

id = 'D00496'  # Penicillamine

getMolFromKEGG(id)
id = 'D00496'  # Penicillamine

getMolFromKEGG(id)

Retrieve Drug Molecules in MOL Format from the PubChem Database

Description

Retrieve Drug Molecules in MOL Format from the PubChem Database

Usage

getMolFromPubChem(id, parallel = 5)
getMolFromPubChem(id, parallel = 5)

Arguments

`id`	A character vector, as the PubChem drug ID.
`parallel`	An integer, the parallel parameter, indicates how many processes the user would like to use for retrieving the data (using RCurl), default is `5`. For regular cases, we recommend a number less than `20`.

Details

This function retrieves drug molecules in MOL format from the PubChem database.

Value

A length of id character vector, each element containing the corresponding drug molecule.

Examples

id = c('7847562', '7847563')  # Penicillamine

getMolFromPubChem(id)
id = c('7847562', '7847563')  # Penicillamine

getMolFromPubChem(id)

Retrieve Protein Sequence in PDB Format from RCSB PDB

Description

Retrieve Protein Sequence in PDB Format from RCSB PDB

Usage

getPDBFromRCSBPDB(id, parallel = 5)
getPDBFromRCSBPDB(id, parallel = 5)

Arguments

`id`	A character vector, as the protein ID.
`parallel`	An integer, the parallel parameter, indicates how many process the user would like to use for retrieving the data (using RCurl), default is `5`. For regular cases, we recommend a number less than `20`.

Details

This function retrieves protein sequences in PDB format from RCSB PDB.

Value

A list, each component contains one of the protein sequences in PDB format.

Examples

id = c('4HHB', '4FF9')

getPDBFromRCSBPDB(id)
id = c('4HHB', '4FF9')

getPDBFromRCSBPDB(id)

Generating Protein-Protein Interaction Descriptors

Description

Generating Protein-Protein Interaction Descriptors

Usage

getPPI(protmat1, protmat2, type = c("combine", "tensorprod", "entrywise"))
getPPI(protmat1, protmat2, type = c("combine", "tensorprod", "entrywise"))

Arguments

`protmat1`	The first protein descriptor matrix, must have the same ncol with `protmat2`.
`protmat2`	The second protein descriptor matrix, must have the same ncol with `protmat1`.
`type`	The interaction type, one or more of `"combine"`, `"tensorprod"`, and `"entrywise"`.

Details

This function calculates the protein-protein interaction descriptors by three types of interaction:

combine - combine the two descriptor matrix, result has (p + p) columns
tensorprod - calculate column-by-column (pseudo)-tensor product type interactions, result has (p * p) columns
entrywise - calculate entrywise product and entrywise sum of the two matrices, then combine them, result has (p + p) columns

Value

A matrix containing the protein-protein interaction descriptors

Examples

x = matrix(1:10, ncol = 2)
y = matrix(5:14, ncol = 2)

getPPI(x, y, type = 'combine')
getPPI(x, y, type = 'tensorprod')
getPPI(x, y, type = 'entrywise')
getPPI(x, y, type = c('combine', 'tensorprod'))
getPPI(x, y, type = c('combine', 'entrywise'))
getPPI(x, y, type = c('entrywise', 'tensorprod'))
getPPI(x, y, type = c('combine', 'entrywise', 'tensorprod'))

x = matrix(1:10, ncol = 2)
y = matrix(5:14, ncol = 2)

getPPI(x, y, type = 'combine')
getPPI(x, y, type = 'tensorprod')
getPPI(x, y, type = 'entrywise')
getPPI(x, y, type = c('combine', 'tensorprod'))
getPPI(x, y, type = c('combine', 'entrywise'))
getPPI(x, y, type = c('entrywise', 'tensorprod'))
getPPI(x, y, type = c('combine', 'entrywise', 'tensorprod'))

Retrieve Protein Sequence in various Formats from Databases

Description

Retrieve Protein Sequence in various Formats from Databases

Usage

getProt(
  id,
  from = c("uniprot", "kegg", "pdb"),
  type = c("fasta", "pdb", "aaseq"),
  parallel = 5
)
getProt(
  id,
  from = c("uniprot", "kegg", "pdb"),
  type = c("fasta", "pdb", "aaseq"),
  parallel = 5
)

Arguments

`id`	A character vector, as the protein ID(s).
`from`	The database, one of `'uniprot'`, `'kegg'`, or `'pdb'`.
`type`	The returned protein format, one of `fasta`, `pdb`, or `aaseq`.
`parallel`	An integer, the parallel parameter, indicates how many process the user would like to use for retrieving the data (using RCurl), default is `5`. For regular cases, we recommend a number less than `20`.

Details

This function retrieves protein sequence in various formats from three databases.

Value

A length of id character list, each element containing the corresponding protein sequence(s) or file(s).

Examples

id = c('P00750', 'P00751', 'P00752')

getProt(id, from = 'uniprot', type = 'aaseq')
id = c('P00750', 'P00751', 'P00752')

getProt(id, from = 'uniprot', type = 'aaseq')

Retrieve Protein Sequence from the KEGG Database

Description

Retrieve Protein Sequence from the KEGG Database

Usage

getSeqFromKEGG(id, parallel = 5)
getSeqFromKEGG(id, parallel = 5)

Arguments

`id`	A character vector, as the protein ID.
`parallel`	An integer, the parallel parameter, indicates how many process the user would like to use for retrieving the data (using RCurl), default is `5`. For regular cases, we recommend a number less than `20`.

Details

This function retrieves protein represented by amino acid sequence from the KEGG database.

Value

A list, each component contains one of the protein represented by amino acid sequence(s).

Examples

id = c('hsa:10161', 'hsa:10162')

getSeqFromKEGG(id)
id = c('hsa:10161', 'hsa:10162')

getSeqFromKEGG(id)

Retrieve Protein Sequence from RCSB PDB

Description

Retrieve Protein Sequence from RCSB PDB

Usage

getSeqFromRCSBPDB(id, parallel = 5)
getSeqFromRCSBPDB(id, parallel = 5)

Arguments

`id`	A character vector, as the protein ID.
`parallel`	An integer, the parallel parameter, indicates how many process the user would like to use for retrieving the data (using RCurl), default is `5`. For regular cases, we recommend a number less than `20`.

Details

This function retrieves protein sequences from RCSB PDB.

Value

A list, each component contains one of the protein represented by amino acid sequence(s).

Examples

id = c('4HHB', '4FF9')

getSeqFromRCSBPDB(id)
id = c('4HHB', '4FF9')

getSeqFromRCSBPDB(id)

Retrieve Protein Sequence from the UniProt Database

Description

Retrieve Protein Sequence from the UniProt Database

Usage

getSeqFromUniProt(id, parallel = 5)
getSeqFromUniProt(id, parallel = 5)

Arguments

`id`	A character vector, as the protein ID.
`parallel`	An integer, the parallel parameter, indicates how many process the user would like to use for retrieving the data (using RCurl), default is `5`. For regular cases, we recommend a number less than `20`.

Details

This function retrieves protein represented by amino acid sequence from the UniProt database.

Value

A list, each component contains one of the protein represented by amino acid sequence(s).

References

UniProt. https://www.uniprot.org/

UniProt REST API Documentation. https://www.uniprot.org/help/api

Examples

id = c('P00750', 'P00751', 'P00752')

getSeqFromUniProt(id)
id = c('P00750', 'P00751', 'P00752')

getSeqFromUniProt(id)

Retrieve Drug Molecules in SMILES Format from the ChEMBL Database

Description

Retrieve Drug Molecules in SMILES Format from the ChEMBL Database

Usage

getSmiFromChEMBL(id, parallel = 5)
getSmiFromChEMBL(id, parallel = 5)

Arguments

`id`	A character vector, as the ChEMBL drug ID.
`parallel`	An integer, the parallel parameter, indicates how many process the user would like to use for retrieving the data (using RCurl), default is `5`. For regular cases, we recommend a number less than `20`.

Details

This function retrieves drug molecules in SMILES format from the ChEMBL database.

Value

A length of id character vector, each element containing the corresponding drug molecule.

Examples

id = 'CHEMBL1430'  # Penicillamine

getSmiFromChEMBL(id)
id = 'CHEMBL1430'  # Penicillamine

getSmiFromChEMBL(id)

Retrieve Drug Molecules in SMILES Format from the DrugBank Database

Description

Retrieve Drug Molecules in SMILES Format from the DrugBank Database

Usage

getSmiFromDrugBank(id, parallel = 5)
getSmiFromDrugBank(id, parallel = 5)

Arguments

`id`	A character vector, as the DrugBank drug ID.
`parallel`	An integer, the parallel parameter, indicates how many process the user would like to use for retrieving the data (using RCurl), default is `5`. For regular cases, we recommend a number less than `20`.

Details

This function retrieves drug molecules in SMILES format from the DrugBank database.

Value

A length of id character vector, each element containing the corresponding drug molecule.

Examples

id = 'DB00859'  # Penicillamine

getSmiFromDrugBank(id)
id = 'DB00859'  # Penicillamine

getSmiFromDrugBank(id)

Retrieve Drug Molecules in SMILES Format from the KEGG Database

Description

Retrieve Drug Molecules in SMILES Format from the KEGG Database

Usage

getSmiFromKEGG(id, parallel = 5)
getSmiFromKEGG(id, parallel = 5)

Arguments

`id`	A character vector, as the KEGG drug ID.
`parallel`	An integer, the parallel parameter, indicates how many process the user would like to use for retrieving the data (using RCurl), default is `5`. For regular cases, we recommend a number less than `20`.

Details

This function retrieves drug molecules in SMILES format from the KEGG database.

Value

A length of id character vector, each element containing the corresponding drug molecule.

Examples

id = 'D00496'  # Penicillamine

getSmiFromKEGG(id)
id = 'D00496'  # Penicillamine

getSmiFromKEGG(id)

Retrieve Drug Molecules in SMILES Format from the PubChem Database

Description

Retrieve Drug Molecules in SMILES Format from the PubChem Database

Usage

getSmiFromPubChem(id, parallel = 5)
getSmiFromPubChem(id, parallel = 5)

Arguments

`id`	A character vector, as the PubChem drug ID.
`parallel`	An integer, the parallel parameter, indicates how many processes the user would like to use for retrieving the data (using RCurl), default is `5`. For regular cases, we recommend a number less than `20`.

Details

This function retrieves drug molecules in SMILES format from the PubChem database.

Value

A length of id character vector, each element containing the corresponding drug molecule.

Examples

id = c('7847562', '7847563')  # Penicillamine

getSmiFromPubChem(id)
id = c('7847562', '7847563')  # Penicillamine

getSmiFromPubChem(id)

OptAA3d.sdf - 20 Amino Acids Optimized with MOE 2011.10 (Semiempirical AM1)

Description

OptAA3d.sdf - 20 Amino Acids Optimized with MOE 2011.10 (Semiempirical AM1)

Details

OptAA3d.sdf - 20 Amino Acids Optimized with MOE 2011.10 (Semiempirical AM1)

Value

OptAA3d data

Examples

# This example requires the rcdk package
# library('rcdk')
# optaa3d = load.molecules(system.file('sysdata/OptAA3d.sdf', package = 'Rcpi'))
# view.molecule.2d(optaa3d[[1]])  # view the first amino acid
# This example requires the rcdk package
# library('rcdk')
# optaa3d = load.molecules(system.file('sysdata/OptAA3d.sdf', package = 'Rcpi'))
# view.molecule.2d(optaa3d[[1]])  # view the first amino acid

Read Protein Sequences in FASTA Format

Description

Reads protein sequences in FASTA format.

Usage

readFASTA(
  file = system.file("protseq/P00750.fasta", package = "Rcpi"),
  legacy.mode = TRUE,
  seqonly = FALSE
)
readFASTA(
  file = system.file("protseq/P00750.fasta", package = "Rcpi"),
  legacy.mode = TRUE,
  seqonly = FALSE
)

Arguments

`file`	The name of the file which the sequences in fasta format are to be read from. If it does not contain an absolute or relative path, the file name is relative to the current working directory, `getwd`. The default here is to read the `P00750.fasta` file which is present in the `protseq` directory of the Rcpi package.
`legacy.mode`	If set to `TRUE`, lines starting with a semicolon ';' are ignored. Default value is `TRUE`.
`seqonly`	If set to `TRUE`, only sequences as returned without attempt to modify them or to get their names and annotations (execution time is divided approximately by a factor 3). Default value is `FALSE`.

Value

Character vector.

References

Pearson, W.R. and Lipman, D.J. (1988) Improved tools for biological sequence comparison. Proceedings of the National Academy of Sciences of the United States of America, 85: 2444-2448.

Examples

P00750 = readFASTA(system.file('protseq/P00750.fasta', package = 'Rcpi'))
P00750
P00750 = readFASTA(system.file('protseq/P00750.fasta', package = 'Rcpi'))
P00750

Read Molecules from SDF Files and Return Parsed Java Molecular Object

Description

Read Molecules from SDF Files and Return Parsed Java Molecular Object

Usage

readMolFromSDF(sdffile)
readMolFromSDF(sdffile)

Arguments

sdffile

Character vector, containing SDF file location(s).

Details

This function reads molecules from SDF files and return parsed Java molecular object needed by extractDrug... functions.

Value

A list, containing parsed Java molecular object.

Examples

sdf  = system.file('compseq/DB00859.sdf', package = 'Rcpi')
sdfs = c(system.file('compseq/DB00859.sdf', package = 'Rcpi'),
         system.file('compseq/DB00860.sdf', package = 'Rcpi'))

mol  = readMolFromSDF(sdf)
mols = readMolFromSDF(sdfs)
sdf  = system.file('compseq/DB00859.sdf', package = 'Rcpi')
sdfs = c(system.file('compseq/DB00859.sdf', package = 'Rcpi'),
         system.file('compseq/DB00860.sdf', package = 'Rcpi'))

mol  = readMolFromSDF(sdf)
mols = readMolFromSDF(sdfs)

Read Molecules from SMILES Files and Return Parsed Java Molecular Object or Plain Text List

Description

Read Molecules from SMILES Files and Return Parsed Java Molecular Object or Plain Text List

Usage

readMolFromSmi(smifile, type = c("mol", "text"))
readMolFromSmi(smifile, type = c("mol", "text"))

Arguments

`smifile`	Character vector, containing SMILES file location(s).
`type`	`'mol'` or `'text'`. `'mol'` returns parsed Java molecular object, used for `'text'` returns (plain-text) character string list. For common molecular descriptors and fingerprints, use `'mol'`. For descriptors and fingerprints calculated by OpenBabel, i.e. functions named `extractDrugOB...()` , use `'text'`.

Details

This function reads molecules from SMILES strings and return parsed Java molecular object or plain text list needed by extractDrug...() functions.

Value

A list, containing parsed Java molecular object or character strings.

Examples

smi = system.file('vignettedata/FDAMDD.smi', package = 'Rcpi')

mol1 = readMolFromSmi(smi, type = 'mol')
mol2 = readMolFromSmi(smi, type = 'text')
smi = system.file('vignettedata/FDAMDD.smi', package = 'Rcpi')

mol1 = readMolFromSmi(smi, type = 'mol')
mol2 = readMolFromSmi(smi, type = 'text')

Read Protein Sequences in PDB Format

Description

Read Protein Sequences in PDB Format

Usage

readPDB(file = system.file("protseq/4HHB.pdb", package = "Rcpi"))
readPDB(file = system.file("protseq/4HHB.pdb", package = "Rcpi"))

Arguments

file

The name of the file which the sequences in PDB format are to be read from. If it does not contain an absolute or relative path, the file name is relative to the current working directory, getwd. The default here is to read the 4HHB.PDB file which is present in the protseq directory of the Rcpi package.

Details

This function reads protein sequences in PDB (Protein Data Bank) format, and return the amino acid sequences represented by single-letter code.

Value

A character vector, representing the amino acid sequence of the single-letter code.

References

Protein Data Bank Contents Guide: Atomic Coordinate Entry Format Description, Version 3.30. Accessed 2013-06-26. https://files.wwpdb.org/pub/pdb/doc/format_descriptions/Format_v33_Letter.pdf

Examples

Seq4HHB = readPDB(system.file('protseq/4HHB.pdb', package = 'Rcpi'))
Seq4HHB
Seq4HHB = readPDB(system.file('protseq/4HHB.pdb', package = 'Rcpi'))
Seq4HHB

Parallelized Drug Molecule Similarity Search by Molecular Fingerprints Similarity or Maximum Common Substructure Search

Description

Parallelized Drug Molecule Similarity Search by Molecular Fingerprints Similarity or Maximum Common Substructure Search

Usage

searchDrug(
  mol,
  moldb,
  cores = 2,
  method = c("fp", "mcs"),
  fptype = c("standard", "extended", "graph", "hybrid", "maccs", "estate", "pubchem",
    "kr", "shortestpath", "fp2", "fp3", "fp4", "obmaccs"),
  fpsim = c("tanimoto", "euclidean", "cosine", "dice", "hamming"),
  mcssim = c("tanimoto", "overlap"),
  ...
)
searchDrug(
  mol,
  moldb,
  cores = 2,
  method = c("fp", "mcs"),
  fptype = c("standard", "extended", "graph", "hybrid", "maccs", "estate", "pubchem",
    "kr", "shortestpath", "fp2", "fp3", "fp4", "obmaccs"),
  fpsim = c("tanimoto", "euclidean", "cosine", "dice", "hamming"),
  mcssim = c("tanimoto", "overlap"),
  ...
)

Arguments

`mol`	The query molecule. The location of a `sdf` file containing one molecule.
`moldb`	The molecule database. The location of a `sdf` file containing all the molecules to be searched with.
`cores`	Integer. The number of CPU cores to use for parallel search, default is `2`. Users could use the `detectCores()` function in the `parallel` package to see how many cores they could use.
`method`	`'fp'` or `'mcs'`. Search by molecular fingerprints or by maximum common substructure searching.
`fptype`	The fingerprint type, only available when `method = 'fp'`. Rcpi supports 13 types of fingerprints, including `'standard'`, `'extended'`, `'graph'`, `'hybrid'`, `'maccs'`, `'estate'`, `'pubchem'`, `'kr'`, `'shortestpath'`, `'fp2'`, `'fp3'`, `'fp4'`, `'obmaccs'`.
`fpsim`	Similarity measure type for fingerprint, only available when `method = 'fp'`. Including `'tanimoto'`, `'euclidean'`, `'cosine'`, `'dice'` and `'hamming'`. See `calcDrugFPSim` for details.
`mcssim`	Similarity measure type for maximum common substructure search, only available when `method = 'mcs'`. Including `'tanimoto'` and `'overlap'`.
`...`	Other possible parameter for maximum common substructure search, see `calcDrugMCSSim` for available options.

Details

This function does compound similarity search derived by various molecular fingerprints with various similarity measures or derived by maximum common substructure search. This function runs for a query compound against a set of molecules.

Value

Named numerical vector. With the decreasing similarity value of the molecules in the database.

Examples

mol = system.file('compseq/DB00530.sdf', package = 'Rcpi')
# DrugBank ID DB00530: Erlotinib
moldb = system.file('compseq/tyrphostin.sdf', package = 'Rcpi')
# Database composed by searching 'tyrphostin' in PubChem and filtered by Lipinski's Rule of Five

searchDrug(mol, moldb, cores = 4, method = 'fp', fptype = 'maccs', fpsim = 'hamming')
searchDrug(mol, moldb, cores = 4, method = 'fp', fptype = 'fp2', fpsim = 'tanimoto')
searchDrug(mol, moldb, cores = 4, method = 'mcs', mcssim = 'tanimoto')
mol = system.file('compseq/DB00530.sdf', package = 'Rcpi')
# DrugBank ID DB00530: Erlotinib
moldb = system.file('compseq/tyrphostin.sdf', package = 'Rcpi')
# Database composed by searching 'tyrphostin' in PubChem and filtered by Lipinski's Rule of Five

searchDrug(mol, moldb, cores = 4, method = 'fp', fptype = 'maccs', fpsim = 'hamming')
searchDrug(mol, moldb, cores = 4, method = 'fp', fptype = 'fp2', fpsim = 'tanimoto')
searchDrug(mol, moldb, cores = 4, method = 'mcs', mcssim = 'tanimoto')

Protein Sequence Segmentation

Description

Protein Sequence Segmentation

Usage

segProt(
  x,
  aa = c("A", "R", "N", "D", "C", "E", "Q", "G", "H", "I", "L", "K", "M", "F", "P", "S",
    "T", "W", "Y", "V"),
  k = 7
)
segProt(
  x,
  aa = c("A", "R", "N", "D", "C", "E", "Q", "G", "H", "I", "L", "K", "M", "F", "P", "S",
    "T", "W", "Y", "V"),
  k = 7
)

Arguments

`x`	A character vector, as the input protein sequence.
`aa`	A character, the amino acid type. one of `'A'`, `'R'`, `'N'`, `'D'`, `'C'`, `'E'`, `'Q'`, `'G'`, `'H'`, `'I'`, `'L'`, `'K'`, `'M'`, `'F'`, `'P'`, `'S'`, `'T'`, `'W'`, `'Y'`, `'V'`.
`k`	A positive integer, specifys the window size (half of the window), default is 7.

Details

This function extracts the segmentations from the protein sequence.

Value

A named list, each component contains one of the segmentations (a character string), names of the list components are the positions of the specified amino acid in the sequence.

Examples

x = readFASTA(system.file('protseq/P00750.fasta', package = 'Rcpi'))[[1]]
segProt(x, aa = 'R', k = 5)

x = readFASTA(system.file('protseq/P00750.fasta', package = 'Rcpi'))[[1]]
segProt(x, aa = 'R', k = 5)

Package 'Rcpi'

Help Index

2D Autocorrelations Descriptors for 20 Amino Acids calculated by Dragon

Description

Details

Value

Examples

3D-MoRSE Descriptors for 20 Amino Acids calculated by Dragon

Description

Details

Value

Examples

Atom-Centred Fragments Descriptors for 20 Amino Acids calculated by Dragon

Description

Details

Value

Examples

BLOSUM100 Matrix for 20 Amino Acids

Description

Details

Value

Examples

BLOSUM45 Matrix for 20 Amino Acids

Description

Details

Value

Examples

BLOSUM50 Matrix for 20 Amino Acids

Description

Details

Value

Examples

BLOSUM62 Matrix for 20 Amino Acids

Description

Details

Value

Examples

BLOSUM80 Matrix for 20 Amino Acids

Description

Details

Value

Examples

Burden Eigenvalues Descriptors for 20 Amino Acids calculated by Dragon

Description

Details

Value

Examples

Connectivity Indices Descriptors for 20 Amino Acids calculated by Dragon

Description

Details

Value

Examples

Constitutional Descriptors for 20 Amino Acids calculated by Dragon

Description

Details

Value

Examples

CPSA Descriptors for 20 Amino Acids calculated by Discovery Studio

Description

Details

Value

Examples

All 2D Descriptors for 20 Amino Acids calculated by Dragon

Description

Details

Value

Examples

Edge Adjacency Indices Descriptors for 20 Amino Acids calculated by Dragon

Description

Details

Value

Examples

Eigenvalue-Based Indices Descriptors for 20 Amino Acids calculated by Dragon

Description

Details

Value

Examples

Functional Group Counts Descriptors for 20 Amino Acids calculated by Dragon

Description

Details