Most proteomics experiments need protein (peptide) separation and cleavage procedures before these molecules could be analyzed or identified by mass spectrometry or other analytical tools.
cleaver allows in-silico cleavage of polypeptide sequences to e.g. create theoretical mass spectrometry data.
The cleavage rules are taken from the ExPASy PeptideCutter tool (Gasteiger et al. 2005).
Loading the cleaver package:
Getting help and list all available cleavage rules:
Cleaving of Gastric juice peptide 1 (P01358) using Trypsin:
## $LAAGKVEDSD
## [1] "LAAGK" "VEDSD"
## $LAAGKVEDSD
## start end
## [1,] 1 5
## [2,] 6 10
## $LAAGKVEDSD
## [1] 5
Sometimes cleavage is not perfect and the enzym miss some cleavage positions:
## $LAAGKVEDSD
## [1] "LAAGKVEDSD"
## $LAAGKVEDSD
## start end
## [1,] 1 10
## $LAAGKVEDSD
## [1] "LAAGK" "VEDSD" "LAAGKVEDSD"
## $LAAGKVEDSD
## start end
## [1,] 1 5
## [2,] 6 10
## [3,] 1 10
Combine cleaver and Biostrings (Pages et al., n.d.):
## create AAStringSet object
p <- AAStringSet(c(gaju="LAAGKVEDSD", pnm="AGEPKLDAGV"))
## cleave it
cleave(p, enzym="trypsin")
## AAStringSetList of length 2
## [["gaju"]] LAAGK VEDSD
## [["pnm"]] AGEPK LDAGV
## IRangesList object of length 2:
## $gaju
## IRanges object with 2 ranges and 0 metadata columns:
## start end width
## <integer> <integer> <integer>
## [1] 1 5 5
## [2] 6 10 5
##
## $pnm
## IRanges object with 2 ranges and 0 metadata columns:
## start end width
## <integer> <integer> <integer>
## [1] 1 5 5
## [2] 6 10 5
## $gaju
## [1] 5
##
## $pnm
## [1] 5
Downloading Insulin (P01308) and Somatostatin (P61278) sequences from the UniProt (The UniProt Consortium 2012) database using UniProt.ws (Carlson, n.d.).
## load UniProt.ws library
library("UniProt.ws")
## select species Homo sapiens
up <- UniProt.ws(taxId=9606)
## download sequences of Insulin/Somatostatin
s <- select(up,
keys=c("P01308", "P61278"),
columns=c("sequence"),
keytype="UniProtKB"
)
## fetch only sequences
sequences <- setNames(s$Sequence, s$Entry)
## remove whitespaces
sequences <- gsub(pattern="[[:space:]]", replacement="", x=sequences)
Cleaving using Pepsin:
## $P01308
## [1] "MA" "L" "W" "MRLLP"
## [5] "LL" "A" "WGPDPAAA" "F"
## [9] "VNQH" "CGSH" "VEA" "Y"
## [13] "VCGERG" "FF" "YTPKTRREAED" "QVGQVE"
## [17] "GGGPGAGS" "LQP" "LA" "EGS"
## [21] "QKRGIVEQCCTSICS" "Q" "EN" "CN"
##
## $P61278
## [1] "ML" "SCRL" "QCA"
## [4] "L" "AA" "SIV"
## [7] "A" "GCVTGAPSDPRL" "RQ"
## [10] "FL" "QKS" "LAAAAGKQEL"
## [13] "AK" "Y" "AE"
## [16] "SEPNQTENDA" "LEPED" "SQAAEQDEMRL"
## [19] "EL" "QRSANSNPAMAPRERKAGCKN" "FF"
## [22] "W" "KT" "FTSC"
A common use case of in-silico cleavage is the calculation of the isotopic distribution of peptides (which were enzymatic digested in the in-vitro experimental workflow). Here BRAIN (Claesen et al. 2012; Dittwald et al. 2013) is used to calculate the isotopic distribution of cleaver’s output. (please note: it is only a toy example, e.g. the relation of intensity values between peptides isn’t correct).
## load BRAIN library
library("BRAIN")
## cleave insulin
cleavedInsulin <- cleave(sequences[1], enzym="trypsin")[[1]]
## create empty plot area
plot(NA, xlim=c(150, 4300), ylim=c(0, 1),
xlab="mass", ylab="relative intensity",
main="tryptic digested insulin - isotopic distribution")
## loop through peptides
for (i in seq(along=cleavedInsulin)) {
## count C, H, N, O, S atoms in current peptide
atoms <- BRAIN::getAtomsFromSeq(cleavedInsulin[[i]])
## calculate isotopic distribution
d <- useBRAIN(atoms)
## draw peaks
lines(d$masses, d$isoDistr, type="h", col=2)
}
## R version 4.4.2 (2024-10-31)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.1 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: Etc/UTC
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats4 stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] BRAIN_1.53.0 lattice_0.22-6 PolynomF_2.0-8
## [4] UniProt.ws_2.47.1 RSQLite_2.3.8 cleaver_1.45.0
## [7] Biostrings_2.75.1 GenomeInfoDb_1.43.1 XVector_0.47.0
## [10] IRanges_2.41.1 S4Vectors_0.45.2 BiocGenerics_0.53.3
## [13] generics_0.1.3 BiocStyle_2.35.0
##
## loaded via a namespace (and not attached):
## [1] KEGGREST_1.47.0 xfun_0.49 bslib_0.8.0
## [4] Biobase_2.67.0 rjsoncons_1.3.1 vctrs_0.6.5
## [7] tools_4.4.2 curl_6.0.1 tibble_3.2.1
## [10] fansi_1.0.6 AnnotationDbi_1.69.0 blob_1.2.4
## [13] BiocBaseUtils_1.9.0 pkgconfig_2.0.3 dbplyr_2.5.0
## [16] lifecycle_1.0.4 GenomeInfoDbData_1.2.13 compiler_4.4.2
## [19] progress_1.2.3 htmltools_0.5.8.1 sys_3.4.3
## [22] buildtools_1.0.0 sass_0.4.9 yaml_2.3.10
## [25] pillar_1.9.0 crayon_1.5.3 jquerylib_0.1.4
## [28] cachem_1.1.0 tidyselect_1.2.1 digest_0.6.37
## [31] dplyr_1.1.4 maketools_1.3.1 grid_4.4.2
## [34] fastmap_1.2.0 cli_3.6.3 magrittr_2.0.3
## [37] utf8_1.2.4 httpcache_1.2.0 prettyunits_1.2.0
## [40] filelock_1.0.3 UCSC.utils_1.3.0 bit64_4.5.2
## [43] rmarkdown_2.29 httr_1.4.7 bit_4.5.0
## [46] png_0.1-8 hms_1.1.3 memoise_2.0.1
## [49] evaluate_1.0.1 knitr_1.49 BiocFileCache_2.15.0
## [52] rlang_1.1.4 Rcpp_1.0.13-1 glue_1.8.0
## [55] DBI_1.2.3 BiocManager_1.30.25 jsonlite_1.8.9
## [58] R6_2.5.1 zlibbioc_1.52.0