Introduction

Most proteomics experiments need protein (peptide) separation and cleavage procedures before these molecules could be analyzed or identified by mass spectrometry or other analytical tools.

cleaver allows in-silico cleavage of polypeptide sequences to e.g. create theoretical mass spectrometry data.

The cleavage rules are taken from the ExPASy PeptideCutter tool (Gasteiger et al. 2005).

Simple Usage

Loading the cleaver package:

library("cleaver")

Getting help and list all available cleavage rules:

help("cleave")

Cleaving of Gastric juice peptide 1 (P01358) using Trypsin:

## cleave it
cleave("LAAGKVEDSD", enzym="trypsin")

## $LAAGKVEDSD
## [1] "LAAGK" "VEDSD"

## get the cleavage ranges
cleavageRanges("LAAGKVEDSD", enzym="trypsin")

## $LAAGKVEDSD
##      start end
## [1,]     1   5
## [2,]     6  10

## get only cleavage sites
cleavageSites("LAAGKVEDSD", enzym="trypsin")

## $LAAGKVEDSD
## [1] 5

Sometimes cleavage is not perfect and the enzym miss some cleavage positions:

## miss one cleavage position
cleave("LAAGKVEDSD", enzym="trypsin", missedCleavages=1)

## $LAAGKVEDSD
## [1] "LAAGKVEDSD"

cleavageRanges("LAAGKVEDSD", enzym="trypsin", missedCleavages=1)

## $LAAGKVEDSD
##      start end
## [1,]     1  10

## miss zero or one cleavage positions
cleave("LAAGKVEDSD", enzym="trypsin", missedCleavages=0:1)

## $LAAGKVEDSD
## [1] "LAAGK"      "VEDSD"      "LAAGKVEDSD"

cleavageRanges("LAAGKVEDSD", enzym="trypsin", missedCleavages=0:1)

## $LAAGKVEDSD
##      start end
## [1,]     1   5
## [2,]     6  10
## [3,]     1  10

Combine cleaver and Biostrings (Pages et al., n.d.):

## create AAStringSet object
p <- AAStringSet(c(gaju="LAAGKVEDSD", pnm="AGEPKLDAGV"))

## cleave it
cleave(p, enzym="trypsin")

## AAStringSetList of length 2
## [["gaju"]] LAAGK VEDSD
## [["pnm"]] AGEPK LDAGV

cleavageRanges(p, enzym="trypsin")

## IRangesList object of length 2:
## $gaju
## IRanges object with 2 ranges and 0 metadata columns:
##           start       end     width
##       <integer> <integer> <integer>
##   [1]         1         5         5
##   [2]         6        10         5
## 
## $pnm
## IRanges object with 2 ranges and 0 metadata columns:
##           start       end     width
##       <integer> <integer> <integer>
##   [1]         1         5         5
##   [2]         6        10         5

cleavageSites(p, enzym="trypsin")

## $gaju
## [1] 5
## 
## $pnm
## [1] 5

Insulin & Somatostatin Example

Downloading Insulin (P01308) and Somatostatin (P61278) sequences from the UniProt (The UniProt Consortium 2012) database using UniProt.ws (Carlson, n.d.).

## load UniProt.ws library
library("UniProt.ws")

## select species Homo sapiens
up <- UniProt.ws(taxId=9606)

## download sequences of Insulin/Somatostatin
s <- select(up,
    keys=c("P01308", "P61278"),
    columns=c("sequence"),
    keytype="UniProtKB"
)

## fetch only sequences
sequences <- setNames(s$Sequence, s$Entry)

## remove whitespaces
sequences <- gsub(pattern="[[:space:]]", replacement="", x=sequences)

Cleaving using Pepsin:

cleave(sequences, enzym="pepsin")

## $P01308
##  [1] "MA"              "L"               "W"               "MRLLP"          
##  [5] "LL"              "A"               "WGPDPAAA"        "F"              
##  [9] "VNQH"            "CGSH"            "VEA"             "Y"              
## [13] "VCGERG"          "FF"              "YTPKTRREAED"     "QVGQVE"         
## [17] "GGGPGAGS"        "LQP"             "LA"              "EGS"            
## [21] "QKRGIVEQCCTSICS" "Q"               "EN"              "CN"             
## 
## $P61278
##  [1] "ML"                    "SCRL"                  "QCA"                  
##  [4] "L"                     "AA"                    "SIV"                  
##  [7] "A"                     "GCVTGAPSDPRL"          "RQ"                   
## [10] "FL"                    "QKS"                   "LAAAAGKQEL"           
## [13] "AK"                    "Y"                     "AE"                   
## [16] "SEPNQTENDA"            "LEPED"                 "SQAAEQDEMRL"          
## [19] "EL"                    "QRSANSNPAMAPRERKAGCKN" "FF"                   
## [22] "W"                     "KT"                    "FTSC"

Isotopic Distribution Of Tryptic Digested Insulin

A common use case of in-silico cleavage is the calculation of the isotopic distribution of peptides (which were enzymatic digested in the in-vitro experimental workflow). Here BRAIN (Claesen et al. 2012; Dittwald et al. 2013) is used to calculate the isotopic distribution of cleaver’s output. (please note: it is only a toy example, e.g. the relation of intensity values between peptides isn’t correct).

## load BRAIN library
library("BRAIN")

## cleave insulin
cleavedInsulin <- cleave(sequences[1], enzym="trypsin")[[1]]

## create empty plot area
plot(NA, xlim=c(150, 4300), ylim=c(0, 1),
     xlab="mass", ylab="relative intensity",
     main="tryptic digested insulin - isotopic distribution")

## loop through peptides
for (i in seq(along=cleavedInsulin)) {
  ## count C, H, N, O, S atoms in current peptide
  atoms <- BRAIN::getAtomsFromSeq(cleavedInsulin[[i]])
  ## calculate isotopic distribution
  d <- useBRAIN(atoms)
  ## draw peaks
  lines(d$masses, d$isoDistr, type="h", col=2)
}

Session Information

## R version 4.4.3 (2025-02-28)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.2 LTS
## 
## Matrix products: default
## BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so;  LAPACK version 3.12.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## time zone: Etc/UTC
## tzcode source: system (glibc)
## 
## attached base packages:
## [1] stats4    stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
##  [1] BRAIN_1.53.0        lattice_0.22-6      PolynomF_2.0-8     
##  [4] UniProt.ws_2.47.6   cleaver_1.45.0      Biostrings_2.75.4  
##  [7] GenomeInfoDb_1.43.4 XVector_0.47.2      IRanges_2.41.3     
## [10] S4Vectors_0.45.4    BiocGenerics_0.53.6 generics_0.1.3     
## [13] BiocStyle_2.35.0   
## 
## loaded via a namespace (and not attached):
##  [1] KEGGREST_1.47.0         xfun_0.51               bslib_0.9.0            
##  [4] httr2_1.1.2             AnVILBase_1.1.0         Biobase_2.67.0         
##  [7] vctrs_0.6.5             rjsoncons_1.3.2         tools_4.4.3            
## [10] curl_6.2.2              tibble_3.2.1            AnnotationDbi_1.69.0   
## [13] RSQLite_2.3.9           blob_1.2.4              pkgconfig_2.0.3        
## [16] BiocBaseUtils_1.9.0     dbplyr_2.5.0            lifecycle_1.0.4        
## [19] GenomeInfoDbData_1.2.13 compiler_4.4.3          progress_1.2.3         
## [22] htmltools_0.5.8.1       sys_3.4.3               buildtools_1.0.0       
## [25] sass_0.4.9              yaml_2.3.10             pillar_1.10.1          
## [28] crayon_1.5.3            jquerylib_0.1.4         cachem_1.1.0           
## [31] tidyselect_1.2.1        digest_0.6.37           dplyr_1.1.4            
## [34] maketools_1.3.2         grid_4.4.3              fastmap_1.2.0          
## [37] cli_3.6.4               magrittr_2.0.3          prettyunits_1.2.0      
## [40] filelock_1.0.3          UCSC.utils_1.3.1        rappdirs_0.3.3         
## [43] bit64_4.6.0-1           rmarkdown_2.29          httr_1.4.7             
## [46] bit_4.6.0               png_0.1-8               hms_1.1.3              
## [49] memoise_2.0.1           evaluate_1.0.3          knitr_1.50             
## [52] BiocFileCache_2.15.1    rlang_1.1.5             Rcpp_1.0.14            
## [55] glue_1.8.0              DBI_1.2.3               BiocManager_1.30.25    
## [58] jsonlite_1.9.1          R6_2.6.1

References

Carlson, Marc. n.d. UniProt.ws: R Interface to UniProt Web Services.

Claesen, Jürgen, Piotr Dittwald, Tomasz Burzykowski, and Dirk Valkenborg. 2012. “An Efficient Method to Calculate the Aggregated Isotopic Distribution and Exact Center-Masses.” Journal of The American Society for Mass Spectrometry 23 (4): 753–63.

Dittwald, Piotr, Jürgen Claesen, Tomasz Burzykowski, Dirk Valkenborg, and Anna Gambin. 2013. “BRAIN: A Universal Tool for High-Throughput Calculations of the Isotopic Distribution for Mass Spectrometry.” Analytical Chemistry 85 (4): 1991–94.

Gasteiger, Elisabeth, Christine Hoogland, Alexandre Gattiker, S’everine Duvaud, Marc R. Wilkins, Ron D. Appel, and Amos Bairoch. 2005. “Protein Identification and Analysis Tools on the ExPASy Server.” In The Proteomics Protocols Handbook, edited by John M. Walker, 571–607. Humana Press. https://doi.org/10.1385/1-59259-890-0:571.

Pages, H., P. Aboyoun, R. Gentleman, and S. DebRoy. n.d. Biostrings: String Objects Representing Biological Sequences, and Matching Algorithms.

The UniProt Consortium. 2012. “Reorganizing the Protein Space at the Universal Protein Resource (UniProt).” Nucleic Acids Research 40 (D1): D71–75. https://doi.org/10.1093/nar/gkr981.