The “Spaced Words Projection (SWeeP)” is a method for representing biological sequences in compact vectors. The sequences are scanned by a reading mask that corresponds to a given k-mer. This reading generates a high-dimensional vector (HDV) which, by random projection, is reduced to a low-dimensional vector (LDV), preserving the relative distances between the high-dimensional vectors (preserving the information).
The SWeeP function is the original implementation of the method as in
Pierri et al. (2019), which uses a random orthonormal basis generated by
the orthBase
function. The SWeePlite
function
is optimized for processing with larger masks (larger k-mer) and for
larger volumes of data (long sequences or large numbers of
sequences).
These functions are suitable for making high quality comparisons between sequences allowing analyzes that are not possible due to the computational limitation of the traditional techniques. The MATLAB version of the method is available at sWeeP (PIERRI, 2019). This tool has it’s main speed gain in constanci processing time. The response time grows linear to the number of inputs, while in other methods it grow is exponencial.
Tutorials and more information are available at https://aibialab.github.io/rSWeeP.
The package has six functions:
The SWeeP and SWeePlite functions are responsible for vectorizing the sequences using the referenced SWeeP method (PIERRI, 2019).
The page https://aibialab.github.io/rSWeeP provides basic tutorials on using and parameterizing the rSWeeP package. Below you can find a quickstart of the package.
Consider a set of 13 mitochondrial proteomes (translated CDSs)
deposited in the folder at the address path
in FASTA
format. These sequences can be vectorized as below:
## Loading required package: foreach
## Loading required package: doParallel
## Loading required package: iterators
## Loading required package: parallel
## Loading required package: Biostrings
## Loading required package: BiocGenerics
## Loading required package: generics
##
## Attaching package: 'generics'
## The following objects are masked from 'package:base':
##
## as.difftime, as.factor, as.ordered, intersect, is.element, setdiff,
## setequal, union
##
## Attaching package: 'BiocGenerics'
## The following objects are masked from 'package:stats':
##
## IQR, mad, sd, var, xtabs
## The following objects are masked from 'package:base':
##
## Filter, Find, Map, Position, Reduce, anyDuplicated, aperm, append,
## as.data.frame, basename, cbind, colnames, dirname, do.call,
## duplicated, eval, evalq, get, grep, grepl, is.unsorted, lapply,
## mapply, match, mget, order, paste, pmax, pmax.int, pmin, pmin.int,
## rank, rbind, rownames, sapply, saveRDS, table, tapply, unique,
## unsplit, which.max, which.min
## Loading required package: S4Vectors
## Loading required package: stats4
##
## Attaching package: 'S4Vectors'
## The following object is masked from 'package:utils':
##
## findMatches
## The following objects are masked from 'package:base':
##
## I, expand.grid, unname
## Loading required package: IRanges
## Loading required package: XVector
## Loading required package: GenomeInfoDb
##
## Attaching package: 'Biostrings'
## The following object is masked from 'package:base':
##
## strsplit
path = paste (system.file("examples/aaMitochondrial/",package = "rSWeeP"),'/', sep = '')
sw = SWeePlite(path,seqtype='AA',mask=c(4),psz=1000)
## Starting projection. Please wait.
## starting sequence 1 of 14 - complete
## starting sequence 2 of 14 - complete
## starting sequence 3 of 14 - complete
## starting sequence 4 of 14 - complete
## starting sequence 5 of 14 - complete
## starting sequence 6 of 14 - complete
## starting sequence 7 of 14 - complete
## starting sequence 8 of 14 - complete
## starting sequence 9 of 14 - complete
## starting sequence 10 of 14 - complete
## starting sequence 11 of 14 - complete
## starting sequence 12 of 14 - complete
## starting sequence 13 of 14 - complete
## starting sequence 14 of 14 - complete
In sw$proj
you will find the SWeeP vectors in matrix
format with 13 rows (each row a sample) and 1000 columns (coordinates).
In sw$info
other processing information is stored, which
may be important in subsequent steps.
## $headers
## [1] "01_Pan_troglodytes" "02_Capra_aegagrus"
## [3] "03_Homo_sapiens" "04_Bos_taurus"
## [5] "05_Ara_ararauna" "06_Mus_musculus"
## [7] "07_Brotogeris_cyanoptera" "08_Homo_sapiens_neanderthalensis"
## [9] "09_Gazella_gazella" "10_Rattus_norvegicus"
## [11] "11_Pan_paniscus" "12_Psittacara_rubritorquis"
## [13] "13_Apodemus_sylvaticus" "14_Rhazya_stricta"
##
## $ProjectionSize
## [1] 1000
##
## $bin
## [1] "counting (FALSE)"
##
## $mask
## [1] 1 1 1 1
##
## $SequenceType
## [1] "AA"
##
## $extension
## [1] ""
##
## $version
## [1] '1.19.0'
##
## $norm
## [1] "none"
##
## $timeElapsed
## [1] 4.805
$ProjectionSize
- length of the output vector (length
of the projection)$mask
- mask used (provided in extended binary
format)$SequenceType
- data type (amino acid or
nucleotide)$concatenate
- if the files were concatenated into
one$bin
- whether binary (TRUE) or count (FALSE)$version
- version of rSWeeP used$norm
- type of normalization used (‘none’, ‘log’ or
‘logNeg’)$timeElapsed
- time spent in processing$headers
- name of the vectorized files/samplesWe can obtain the phylogenetic relationship between the vectorized organisms using the Neighbour Joining (NJ) method.
##
## Attaching package: 'ape'
## The following object is masked from 'package:Biostrings':
##
## complement
# get the distance matrix
mdist = dist(sw$proj,method='euclidean')
# use the NJ algorithm to build the tree
tr = nj(mdist)
# root the tree in the plant sample
tr = root(tr,outgroup='14_Rhazya_stricta')
# plot
plot(tr)
To visualize the vectorized data graphically and evaluate the phylogenetic tree, we provide the metadata with the classes at the family taxonomic level.
pathmetadata <- system.file(package = "rSWeeP" , "examples" , "metadata_mitochondrial.csv")
mt = read.csv(pathmetadata,header=TRUE)
We evaluated the phylogenetic tree in terms of taxon groupings with the PCCI metric, and checked the percentage of consistent taxa, mono or paraphyletic with the PMPG metric.
data = data.frame(sp=mt$fileName,family=mt$family)
PCCI(tr,data) # PhyloTaxonomic Consistency Cophenetic Index
## $tab
## taxa cost
## 1 Hominidae 1.0000000
## 2 Bovidae 1.0000000
## 3 Psittacidae 0.8888889
## 4 Muridae 1.0000000
## 5 Apocynaceae 1.0000000
##
## $mean
## [1] 0.9777778
## $tab
## taxa mono para
## 1 Hominidae TRUE FALSE
## 2 Bovidae TRUE FALSE
## 3 Psittacidae TRUE FALSE
## 4 Muridae TRUE FALSE
## 5 Apocynaceae TRUE FALSE
##
## $percMono
## [1] 1
##
## $percPara
## [1] 0
##
## $metric
## [1] 1
We obtain the PCA and visualize the first components.
pca_output <- prcomp (sw$proj , scale = FALSE)
par(mfrow=c(1,2))
plot(pca_output$x[,1],pca_output$x[,2],xlab = 'PC-1' , ylab = 'PC-2' , pch =20 , col = mt$id)
legend("bottomright",unique(mt$family),col=as.character(c(1:length(unique(mt$family)))),pch=20)
plot(pca_output$x[,3],pca_output$x[,4],xlab = 'PC-3' , ylab = 'PC-4' , pch =20 , col = mt$id)
## R version 4.4.2 (2024-10-31)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.1 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: Etc/UTC
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats4 parallel stats graphics grDevices utils datasets
## [8] methods base
##
## other attached packages:
## [1] ape_5.8 rSWeeP_1.19.0 Biostrings_2.75.1
## [4] GenomeInfoDb_1.43.1 XVector_0.47.0 IRanges_2.41.1
## [7] S4Vectors_0.45.2 BiocGenerics_0.53.3 generics_0.1.3
## [10] doParallel_1.0.17 iterators_1.0.14 foreach_1.5.2
## [13] BiocStyle_2.35.0
##
## loaded via a namespace (and not attached):
## [1] jsonlite_1.8.9 crayon_1.5.3 compiler_4.4.2
## [4] BiocManager_1.30.25 Rcpp_1.0.13-1 jquerylib_0.1.4
## [7] yaml_2.3.10 fastmap_1.2.0 lattice_0.22-6
## [10] R6_2.5.1 knitr_1.49 maketools_1.3.1
## [13] GenomeInfoDbData_1.2.13 bslib_0.8.0 rlang_1.1.4
## [16] stringi_1.8.4 cachem_1.1.0 xfun_0.49
## [19] sass_0.4.9 sys_3.4.3 cli_3.6.3
## [22] zlibbioc_1.52.0 grid_4.4.2 digest_0.6.37
## [25] nlme_3.1-166 lifecycle_1.0.4 evaluate_1.0.1
## [28] codetools_0.2-20 buildtools_1.0.0 rmarkdown_2.29
## [31] httr_1.4.7 tools_4.4.2 htmltools_0.5.8.1
## [34] UCSC.utils_1.3.0