--- title: "rSWeeP: Alignment-free method for vectorising biological sequences" shorttitle: "rSWeeP" author: - name: Camila Pereira Perico affiliation: Federal University of Paraná, Graduate Program in Bioinformatics, Curitiba, Paraná, Brazil. - name: Danrley Rafael Fernandes affiliation: Federal University of Paraná, Graduate in Biological Sciences, Curitiba, Paraná, Brazil. - name: Julia Formighieri Varaschin affiliation: Federal University of Paraná, Graduate Program in Bioinformatics, Curitiba, Paraná, Brazil. - name: Camilla Reginatto De Pierri affiliation: - Federal University of Paraná, Graduate Program in Bioinformatics, Curitiba, Paraná, Brazil. - Federal University of Paraná, Department of Biochemistry and Molecular Biology, Curitiba, Paraná, Brazil. - name: Mariane Gonçalves Kulik affiliation: Federal University of Paraná, Graduate Program in Bioinformatics, Curitiba, Paraná, Brazil. - name: Ricardo Assunção Vialle affiliation: Rush Alzheimer's Disease Center, Chicago, IL, US. - name: Roberto Tadeu Raittz affiliation: - Federal University of Paraná, Graduate Program in Bioinformatics, Curitiba, Paraná, Brazil. - Federal University of Paraná, Department of Biochemistry and Molecular Biology, Curitiba, Paraná, Brazil. output: BiocStyle::html_document package: rSWeeP abstract: | The rSWeeP package is an R implementation of the SWeeP method. The main function of this package is to provide a vector representation of biological sequences (nucleotides or amino acids), and thus favor alignment-free phylogenetic studies. Each sequence provided is represented by a compact numerical vector which is easier to analyze. The method is based on k-mers counting and random projection of high-dimensional vectors into low-dimensional vectors. In addition, the package allows general dimensionality reduction of RNAseq data and generic matrices. vignette: > %\VignetteIndexEntry{rSWeeP} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- # Overview The “Spaced Words Projection (SWeeP)” is a method for representing biological sequences in compact vectors. The sequences are scanned by a reading mask that corresponds to a given k-mer. This reading generates a high-dimensional vector (HDV) which, by random projection, is reduced to a low-dimensional vector (LDV), preserving the relative distances between the high-dimensional vectors (preserving the information). The SWeeP function is the original implementation of the method as in Pierri et al. (2019), which uses a random orthonormal basis generated by the `orthBase` function. The `SWeePlite` function is optimized for processing with larger masks (larger k-mer) and for larger volumes of data (long sequences or large numbers of sequences). These functions are suitable for making high quality comparisons between sequences allowing analyzes that are not possible due to the computational limitation of the traditional techniques. The MATLAB version of the method is available at [sWeeP](https://sourceforge.net/projects/spacedwordsprojection/) (PIERRI, 2019). This tool has it's main speed gain in constanci processing time. The response time grows linear to the number of inputs, while in other methods it grow is exponencial. Tutorials and more information are available at . ## Functions The package has six functions: - SWeeP: conventional SWeeP function, in which it is necessary to generate or load an orthonormal basis from a file. - orthBase: function to generate the orthonormal matrix to be used by SWeeP in the projection. - SWeePlite: SWeeP function with built-in orthonormal basis. Slower function but low RAM usage. - extractHDV: extracts only the High Dimensional Vector (HDV) from biological sequences. - PCCI: PhyloTaxonomic Consistency Cophenetic Index, estimates the grouping rate of the taxa in the phylogenetic tree. - PMPG: percentage of Monophyletic and Paraphyletic taxa in the phylogenetic tree. The SWeeP and SWeePlite functions are responsible for vectorizing the sequences using the referenced SWeeP method (PIERRI, 2019). # Quick Start The page provides basic tutorials on using and parameterizing the rSWeeP package. Below you can find a quickstart of the package. Consider a set of 13 mitochondrial proteomes (translated CDSs) deposited in the folder at the address `path` in FASTA format. These sequences can be vectorized as below: ```{r} library(rSWeeP) path = paste (system.file("examples/aaMitochondrial/",package = "rSWeeP"),'/', sep = '') sw = SWeePlite(path,seqtype='AA',mask=c(4),psz=1000) ``` In `sw$proj` you will find the SWeeP vectors in matrix format with 13 rows (each row a sample) and 1000 columns (coordinates). In `sw$info` other processing information is stored, which may be important in subsequent steps. ```{r} sw$info ``` - `$ProjectionSize` - length of the output vector (length of the projection) - `$mask` - mask used (provided in extended binary format) - `$SequenceType` - data type (amino acid or nucleotide) - `$concatenate` - if the files were concatenated into one - `$bin` - whether binary (TRUE) or count (FALSE) - `$version` - version of rSWeeP used - `$norm` - type of normalization used ('none', 'log' or 'logNeg') - `$timeElapsed` - time spent in processing - `$headers` - name of the vectorized files/samples We can obtain the phylogenetic relationship between the vectorized organisms using the Neighbour Joining (NJ) method. ```{r plot Tree1, fig.height=10,fig.width=10} library(ape) # get the distance matrix mdist = dist(sw$proj,method='euclidean') # use the NJ algorithm to build the tree tr = nj(mdist) # root the tree in the plant sample tr = root(tr,outgroup='14_Rhazya_stricta') # plot plot(tr) ``` To visualize the vectorized data graphically and evaluate the phylogenetic tree, we provide the metadata with the classes at the family taxonomic level. ```{r} pathmetadata <- system.file(package = "rSWeeP" , "examples" , "metadata_mitochondrial.csv") mt = read.csv(pathmetadata,header=TRUE) ``` We evaluated the phylogenetic tree in terms of taxon groupings with the PCCI metric, and checked the percentage of consistent taxa, mono or paraphyletic with the PMPG metric. ```{r eval tree} data = data.frame(sp=mt$fileName,family=mt$family) PCCI(tr,data) # PhyloTaxonomic Consistency Cophenetic Index PMPG(tr,data) # Percentage of Mono or Paraphyletic Groups ``` We obtain the PCA and visualize the first components. ```{r PCA, fig.height=7,fig.width=9} pca_output <- prcomp (sw$proj , scale = FALSE) par(mfrow=c(1,2)) plot(pca_output$x[,1],pca_output$x[,2],xlab = 'PC-1' , ylab = 'PC-2' , pch =20 , col = mt$id) legend("bottomright",unique(mt$family),col=as.character(c(1:length(unique(mt$family)))),pch=20) plot(pca_output$x[,3],pca_output$x[,4],xlab = 'PC-3' , ylab = 'PC-4' , pch =20 , col = mt$id) ``` # Session information ```{r label='Session information', eval=TRUE, echo=FALSE} sessionInfo() ``` # References - Pierri,C. R. *et al*. **SWeeP: Representing large biological sequences data sets in compact vectors**. Scientific Reports, accepted in December 2019.doi: 10.1038/s41598-019-55627-4.