Title: | Rapid Comparison of Surface Protein Isoform Membrane Topologies Through surfaltr |
---|---|
Description: | Cell surface proteins form a major fraction of the druggable proteome and can be used for tissue-specific delivery of oligonucleotide/cell-based therapeutics. Alternatively spliced surface protein isoforms have been shown to differ in their subcellular localization and/or their transmembrane (TM) topology. Surface proteins are hydrophobic and remain difficult to study thereby necessitating the use of TM topology prediction methods such as TMHMM and Phobius. However, there exists a need for bioinformatic approaches to streamline batch processing of isoforms for comparing and visualizing topologies. To address this gap, we have developed an R package, surfaltr. It pairs inputted isoforms, either known alternatively spliced or novel, with their APPRIS annotated principal counterparts, predicts their TM topologies using TMHMM or Phobius, and generates a customizable graphical output. Further, surfaltr facilitates the prioritization of biologically diverse isoform pairs through the incorporation of three different ranking metrics and through protein alignment functions. Citations for programs mentioned here can be found in the vignette. |
Authors: | Pooja Gangras [aut, cre] , Aditi Merchant [aut] |
Maintainer: | Pooja Gangras <[email protected]> |
License: | MIT + file LICENSE |
Version: | 1.13.0 |
Built: | 2024-12-19 04:09:06 UTC |
Source: | https://github.com/bioc/surfaltr |
This function allows a user to specify genes of interest and subsequently receive a pdf of all the corresponding aligned human and mouse amino acid sequences. In order for this to work, transcripts for the same genes from both organisms need to be provided in separate files.
align_org_prts(gene_names, hs_data_file, mm_data_file, if_aa = FALSE, temp = FALSE)
align_org_prts(gene_names, hs_data_file, mm_data_file, if_aa = FALSE, temp = FALSE)
gene_names |
Vector containing names of genes of interest (e.g. c("Crb1", "Adgrl1")) |
hs_data_file |
Path to the input file containing the human transcripts |
mm_data_file |
Path to the input file containing the mouse transcripts |
if_aa |
Boolean value indicating if the input file contains amino acid sequence. TRUE indicates that sequences are present and FALSE indicates that only IDs are present. |
temp |
Boolean indicating if the fasta file should be saved to the working directory or no |
Nothing is returned.
Although the function returns nothing, it saves pdfs containing the aligned sequences to the working directory under a file labeled with the gene name. It's also important to note that although the gene names will be standardized to be fully capitalized, this may not match with the case of the gene name for some organisms.
tmhmm_folder_name <- "~/TMHMM2.0c" if (check_tmhmm_install(tmhmm_folder_name)) {align_org_prts( c("IGSF1"), system.file("extdata", "hpa_example.csv", package = "surfaltr"), system.file("extdata", "hpa_mouse_example.csv", package = "surfaltr"), FALSE, TRUE)}
tmhmm_folder_name <- "~/TMHMM2.0c" if (check_tmhmm_install(tmhmm_folder_name)) {align_org_prts( c("IGSF1"), system.file("extdata", "hpa_example.csv", package = "surfaltr"), system.file("extdata", "hpa_mouse_example.csv", package = "surfaltr"), FALSE, TRUE)}
This function allows a user to specify genes of interest and subsequently receive a pdf of all the corresponding aligned amino acid sequences in pdf format.
align_prts(gene_names, data_file, if_aa = FALSE, organism = "human", temp = FALSE)
align_prts(gene_names, data_file, if_aa = FALSE, organism = "human", temp = FALSE)
gene_names |
Vector containing names of genes of interest (e.g. c(Crb1, Adgrl1)) |
data_file |
Path to the input file |
if_aa |
Boolean value indicating if the input file contains amino acid sequence. TRUE indicates that sequences are present and FALSE indicates that only IDs are present. |
organism |
String indicating if the transcripts are from a human or a mouse |
temp |
Boolean indicating if the fasta file should be saved to the working directory or no |
Nothing is returned.
Although the function returns nothing, it saves pdfs containing the aligned sequences to the working directory under a file labeled with the gene name.
tmhmm_folder_name <- "~/TMHMM2.0c" if (check_tmhmm_install(tmhmm_folder_name)) { align_prts(c("Crb1"), system.file("extdata", "crb1_example.csv", package = "surfaltr" ), TRUE, "mouse", TRUE) }
tmhmm_folder_name <- "~/TMHMM2.0c" if (check_tmhmm_install(tmhmm_folder_name)) { align_prts(c("Crb1"), system.file("extdata", "crb1_example.csv", package = "surfaltr" ), TRUE, "mouse", TRUE) }
This function checks to make sure that TMHMM is installed correctly at the file path specified by the user. If TMHMM is not installed correctly, then the function will output an error message telling the user to check their installation.
check_tmhmm_install(tmhmm_folder_name)
check_tmhmm_install(tmhmm_folder_name)
tmhmm_folder_name |
Full path to folder containing installed TMHMM 2.0 software. This value should end in TMHMM2.0c |
A Boolean stating if TMHMM is installed correctly, will be TRUE if TMHMM 2.0 is located at the path specified and FALSE if it is not.
This function also prints a helpful method providing tips on how to fix the installation if TMHMM is not found at the folder path specified.
tmhmm_folder_name <- "~/TMHMM2.0c" install_correct <- check_tmhmm_install(tmhmm_folder_name)
tmhmm_folder_name <- "~/TMHMM2.0c" install_correct <- check_tmhmm_install(tmhmm_folder_name)
This function cleans and formats input data. The cleaning and formatting portion involves removing any non-protein coding transcripts, removing any principal transcripts, and standardizing all column names. If the sequence is provided directly, the function also extracts the APPRIS annotation and UniProt IDs of each transcript from Ensembl. Provided data can follow 2 formats — the first option only contain transcript IDs and gene names and the second option contains a unique transcript identifier, gene names, and amino acid sequences. The function will return a data frame containing the transcript IDs, gene names, and APPRIS Annotation for each inputted transcript. If the amino acid sequence is included in the input data, this will also be included in the data frame. If only gene names and transcript IDS are provided, UniProt IDs will be included in the data frame.
clean_data(data_file, if_aa, organism)
clean_data(data_file, if_aa, organism)
data_file |
Path to the input file |
if_aa |
Boolean value indicating if the input file contains amino acid sequences with TRUE indicating that sequences are present and FALSE indicating that only IDs are present |
organism |
String indicating if the transcripts are from a human or a mouse |
A data frame containing gene names, transcript IDs, and APPRIS annotations for the given data. If sequences were provided, the data frame will also contain amino acid sequences. If only IDs were provided, the data frame will also contain the UniProt Swissprot ID, UniProt Swissprot isoform ID, and UniProt TREMBL ID.
For the amino acid input, we have utilized the supplementary data 1 from Ray et al 2020 (ref). This data includes novel isoforms expressed in mouse retina identified by long read sequencing and further validated by cell surface proteomics approaches. The data has been formatted to be compatible with the package.
Crb1
Crb1
A data frame with 36 rows and 3 variables:
Gene name corresponding to amino acid sequence
transcript ID corresponding to amino acid sequence
amino acid sequence of transcript
...
https://www.nature.com/articles/s41467-020-17009-7#Sec45
This function retrieves all the primary transcripts in the given organism and their corresponding gene names, APPRIS annotations, and UniProt IDs.
ensembl_db_retrieval(organism)
ensembl_db_retrieval(organism)
organism |
String indicating if mouse or human transcripts should be retrieved |
A data frame containing the gene names, transcript IDs, APPRIS annotations, UniProt Swissprot IDs, UniProt Swissprot isoform IDs, and UniProt TREMBL IDs for all the primary transcripts in an organism.
Modify format of data to display all primary and alternative transcripts from the same gene together and remove any duplicates.
format_ids(final_pairs)
format_ids(final_pairs)
final_pairs |
Data frame containing original row-wise pairings of primary and alternative transcripts for inputted data without associated sequences |
A data frame containing the gene names, transcript IDs, APPRIS annotations, UniProt Swissprot IDs, UniProt Swissprot isoform IDs, and UniProt TREMBL IDs for all the given and associated primary transcripts in an alternating fashion
This function creates a fasta file with the transcript ID followed by the amino acid sequence for all inputted and associated primary transcripts. The file is organized so that all transcripts from a gene are next to each other. The function also returns a final table containing the gene names, transcript IDs, APPRIS annotations, and amino acid sequences for each transcript
get_aas(final_pairs, temp = FALSE)
get_aas(final_pairs, temp = FALSE)
final_pairs |
A data frame containing gene names, transcript IDs, amino acid sequences, and APPRIS annotations for all inputted data and its corresponding primary transcripts. |
temp |
Boolean indicating if the fasta file should be deleted after the function finishes running or not. Recommended to always be set to FALSE. |
A data frame containing the gene names, transcript IDs, APPRIS annotations, and protein sequences for each transcript.
This function also creates a fasta file containing the transcript IDs and associated amino acid sequences in the root directory.
This function processes the input data to retrieve information from ensembl and uniprot to generate a dataframe containing the gene names, transcript IDs, APPRIS annotations, and protein sequences for each pair of primary and alternative transcripts. Additionally, this function creates a fasta file with the transcript ID followed by the amino acid sequence for all inputted and associated primary transcripts. The file is organized so that all transcripts from a gene are next to each other. Finally, the function also produces a final table in csv form containing the gene names, transcript IDs, APPRIS annotations, and amino acid sequences for each transcript
get_pairs(data_file, if_aa = FALSE, organism = "human", temp = FALSE)
get_pairs(data_file, if_aa = FALSE, organism = "human", temp = FALSE)
data_file |
Path to the input file |
if_aa |
Boolean value indicating if the input file contains amino acid sequences with TRUE indicating that sequences are present and FALSE indicating that only IDs are present |
organism |
String indicating if the transcripts are from a human or a mouse |
temp |
Boolean indicating if the fasta file should be deleted after the function finishes running or not. Recommended to always be set to FALSE. |
A data frame containing the gene names, transcript IDs, APPRIS annotations,and protein sequences for each pair of primary and alternative transcripts.
This function also creates a fasta file containing the transcript IDs and associated amino acid sequences in the root directory. In addition to the fasta file, a csv file containing the returned dataframe is saved to the working directory.
tmhmm_folder_name <- "~/TMHMM2.0c" if (check_tmhmm_install(tmhmm_folder_name)) { currwd <- getwd() AA_seq <- get_pairs(system.file("extdata", "crb1_example.csv", package = "surfaltr" ), TRUE, "mouse", TRUE) setwd(currwd) }
tmhmm_folder_name <- "~/TMHMM2.0c" if (check_tmhmm_install(tmhmm_folder_name)) { currwd <- getwd() AA_seq <- get_pairs(system.file("extdata", "crb1_example.csv", package = "surfaltr" ), TRUE, "mouse", TRUE) setwd(currwd) }
Phobius web server is a combined transmembrane topology and signal peptide (N-sp) predictor. Currently only "normal prediction" of signal peptides is supported by the function.
get_phobius(data, ...) ## S3 method for class 'character' get_phobius(data, progress = FALSE, ...) ## S3 method for class 'data.frame' get_phobius(data, sequence, id, ...) ## S3 method for class 'list' get_phobius(data, ...) ## Default S3 method: get_phobius(data = NULL, sequence, id, ...)
get_phobius(data, ...) ## S3 method for class 'character' get_phobius(data, progress = FALSE, ...) ## S3 method for class 'data.frame' get_phobius(data, sequence, id, ...) ## S3 method for class 'list' get_phobius(data, ...) ## Default S3 method: get_phobius(data = NULL, sequence, id, ...)
data |
A data frame with protein amino acid sequences as strings in one
column and corresponding id's in another. Alternatively a path to a .fasta
file with protein sequences. Alternatively a list with elements of class
"SeqFastaAA" resulting from |
... |
currently no additional arguments are accepted apart the ones documented bellow. |
progress |
Boolean, whether to show the progress bar, at default set to FALSE. |
sequence |
A vector of strings representing protein amino acid sequences, or the appropriate column name if a data.frame is supplied to data argument. If .fasta file path, or list with elements of class "SeqFastaAA" provided to data, this should be left blank. |
id |
A vector of strings representing protein identifiers, or the appropriate column name if a data.frame is supplied to data argument. If .fasta file path, or list with elements of class "SeqFastaAA" provided to data, this should be left blank. |
The topology (prediction column of the result) is given as the position of the transmembrane helices separated by 'i' if the loop is on the cytoplasmic or 'o' if it is on the non-cytoplasmic side. A signal peptide is given by the position of its h-region separated by a n and a c, and the position of the last amino acid in the signal peptide and the first of the mature protein separated by a /.
A data frame with columns:
Character, name of the submitted sequence.
Integer, the number of predicted transmembrane segments.
Character, Y/0 indicator if a signal peptide was predicted or not.
Character string, predicted topology of the protein.
Integer, first amino acid after removal of the signal peptide
Logical, did Phobius predict the presence of a signal peptide
This function creates temporary files in the working directory.
Kall O. Krogh A. Sonnhammer E. L. L. (2004) A Combined Transmembrane Topology and Signal Peptide Prediction Method. Journal of Molecular Biology 338(5): 1027-1036
This function creates a fasta file with the transcript ID followed by the amino acid sequence for all given alternative transcripts and associated primary transcripts. The file is organized so that all transcripts from a gene are next to each other. The function also returns a final table containing the gene names, transcript IDs, APPRIS annotations, and amino acid sequences for each transcript
get_prts(aa_trans, temp = FALSE)
get_prts(aa_trans, temp = FALSE)
aa_trans |
A data frame containing the gene names, transcript IDs, APPRIS annotations, UniProt Swissprot IDs, UniProt Swissprot isoform IDs, and UniProt TREMBL IDs for all transcripts. |
temp |
Boolean indicating if the fasta file should be deleted after the function finishes running or not. Recommended to always be set to FALSE. |
A data frame containing the gene names, transcript IDs, APPRIS annotations, UniProt IDs, and protein sequences for each transcript.
This function also creates a fasta file containing the transcript IDs and associated amino acid sequences in the root directory.
This function creates a data frame with columns containing transcript IDs and corresponding output from TMHMM. The TMHMM output includes a location for each amino acid, with O and o representing extracellular, M representing transmembrane, and i representing intracellular.
get_tmhmm(fasta_file_name, tmhmm_folder_name)
get_tmhmm(fasta_file_name, tmhmm_folder_name)
fasta_file_name |
Name of .fasta file containing amino acid sequences |
tmhmm_folder_name |
Full path to folder containing installed TMHMM 2.0 software. This path should end in TMHMM2.0c |
A data frame containing each transcript ID and the corresponding membrane location for each amino acid in its sequence formatted as a string
In order for this function to work, there needs to be a .fasta file containing the amino acid sequences for each transcript called "AA.fasta" saved to a folder called output within the working directory. Additionally, the file saves a copy of the returned data frame in csv format to the output folder in the working directory.
tmhmm_folder_name <- "~/TMHMM2.0c" if (check_tmhmm_install(tmhmm_folder_name)) { AA_seq <- get_pairs(system.file("extdata", "crb1_example.csv", package = "surfaltr" ), TRUE, "mouse", TRUE) topo <- get_tmhmm("AA.fasta", tmhmm_folder_name) }
tmhmm_folder_name <- "~/TMHMM2.0c" if (check_tmhmm_install(tmhmm_folder_name)) { AA_seq <- get_pairs(system.file("extdata", "crb1_example.csv", package = "surfaltr" ), TRUE, "mouse", TRUE) topo <- get_tmhmm("AA.fasta", tmhmm_folder_name) }
This function creates a ggplot figure showing the differences in membrane location and length between primary and alternative transcripts from the same gene. This process is performed based on input data containing the gene names and amino acid sequences of the proteins in question. Transcripts derived from the same gene are grouped together to facilitate easy interpretation. The y axis lists the gene name and transcript ID for each transcript and the x axis lists the length in amino acids. Each fill color corresponds to a membrane location and either principal or alternative isoform.
graph_from_aas(data_file, organism = "human", rank = "length", n_prts = 20, mode = "phobius", size_txt = 2, space_left = -400, temp = FALSE, tmhmm_folder_name = NULL)
graph_from_aas(data_file, organism = "human", rank = "length", n_prts = 20, mode = "phobius", size_txt = 2, space_left = -400, temp = FALSE, tmhmm_folder_name = NULL)
data_file |
Path to the input file |
organism |
String indicating if the transcripts are from a human or a mouse |
rank |
String indicating which method to use to rank proteins in graphical output. Options include "Length", "TM", and "Combo". |
n_prts |
Integer value indicating the number of genes that should be displayed on the graphical output. Default value is 20. |
mode |
String detailing whether TMHMM or Phobius should be used to predict transmembrane regions. Input values include "phobius" or "tmhmm". |
size_txt |
Integer value specifying the size of the row labels. Default size is 2. |
space_left |
Integer value specifying how far left the graph should extend. |
temp |
Boolean indicating if the fasta file should be deleted after the function finishes running or not. Recommended to always be set to FALSE. |
tmhmm_folder_name |
Full path to folder containing installed TMHMM 2.0 software. This value should end in TMHMM2.0c and needs to be provided if the mode used is TMHMM. |
A ggplot figure showing the protein locations for each part of the surface protein for each alternative and primary transcripts.
tmhmm_folder_name <- "~/TMHMM2.0c" if (check_tmhmm_install(tmhmm_folder_name)) { graph_from_aas( system.file("extdata", "crb1_example.csv", package = "surfaltr"), "mouse", "combo", 1, "tmhmm", 4, -300, TRUE ) }
tmhmm_folder_name <- "~/TMHMM2.0c" if (check_tmhmm_install(tmhmm_folder_name)) { graph_from_aas( system.file("extdata", "crb1_example.csv", package = "surfaltr"), "mouse", "combo", 1, "tmhmm", 4, -300, TRUE ) }
This function creates a ggplot figure showing the differences in membrane location and length between primary and alternative transcripts from the same gene. This process is performed based on input data containing the gene names and transcript IDs of the proteins in question. Transcripts derived from the same gene are grouped together to facilitate easy interpretation. The y axis lists the gene name and transcript ID for each transcript and the x axis lists the length in amino acids. Each fill color corresponds to a membrane location and either principal or alternative isoform.
graph_from_ids(data_file, organism = "human", rank = "length", n_prts = 20, mode = "phobius", size_txt = 2, space_left = -400, temp = FALSE, tmhmm_folder_name = NULL)
graph_from_ids(data_file, organism = "human", rank = "length", n_prts = 20, mode = "phobius", size_txt = 2, space_left = -400, temp = FALSE, tmhmm_folder_name = NULL)
data_file |
Path to the input file |
organism |
String indicating if the transcripts are from a human or a mouse |
rank |
String indicating which method to use to rank proteins in graphicl output. Options include "Length", "TM", and "Combo". |
n_prts |
Integer value indicating the number of genes that should be displayed on the graphical output. Default value is 20. |
mode |
String detailing whether TMHMM or Phobius should be used to predict transmembrane regions. Input values include "phobius" or "tmhmm". |
size_txt |
Integer value specifying the size of the row labels. Default size is 2. |
space_left |
Integer value specifying how far left the graph should extend. |
temp |
Boolean indicating if the fasta file should be deleted after the function finishes running or not. Recommended to always be set to FALSE. |
tmhmm_folder_name |
Full path to folder containing installed TMHMM 2.0 software. This value should end in TMHMM2.0c and needs to be provided if the mode used is TMHMM. |
A ggplot figure showing the protein locations for each part of the surface protein for each alternative and primary transcripts.
tmhmm_folder_name <- "~/TMHMM2.0c" if (check_tmhmm_install(tmhmm_folder_name)) { graph_from_ids( system.file("extdata", "hpa_example.csv", package = "surfaltr"), "human", "length", 1, "tmhmm", 5, -300, TRUE ) }
tmhmm_folder_name <- "~/TMHMM2.0c" if (check_tmhmm_install(tmhmm_folder_name)) { graph_from_ids( system.file("extdata", "hpa_example.csv", package = "surfaltr"), "human", "length", 1, "tmhmm", 5, -300, TRUE ) }
This function creates a ggplot figure showing the differences in membrane location and length between primary and alternative transcripts from the same gene. Transcripts derived from the same gene are grouped together to facilitate easy interpretation. The y axis lists the gene name and transcript ID for each transcript and the x axis lists the length in amino acids. Each fill color corresponds to a membrane location and either principal or alternative isoform.
graph_prots(counts, rank = "length", n_prts = 20, size_txt = 2, space_left = -400)
graph_prots(counts, rank = "length", n_prts = 20, size_txt = 2, space_left = -400)
counts |
A data frame containing the overall length and individual lengths of each section of the surface protein corresponding to a certain transcript. |
rank |
String indicating which method to use to rank proteins in graphical output. Options include "Length", "TM", and "Combo". |
n_prts |
Integer value indicating the number of genes that should be displayed on the graphical output. Default value is 20. |
size_txt |
Integer value specifying the size of the row labels. Default size is 2. |
space_left |
Integer value specifying how far left the graph should extend. |
A ggplot figure showing the protein locations for each part of the surface protein for each alternative and primary transcripts.
For the gene name and transcript ID input, we have included 10 unique human transcripts from 7 different genes annotated as alternative by APPRIS. These genes were derived from supplementary data 12 from Uhlén et all 2015. This data has been formatted to be compatible with the package.
hpa_genes
hpa_genes
A data frame with 10 rows and 2 variables:
Gene name corresponding to transcript ID
transcript ID of gene of interest
...
https://science.sciencemag.org/content/347/6220/1260419/tab-figures-data
For the gene name and transcript ID input, we have included 5 unique mouse transcripts from 5 different genes annotated as alternative by APPRIS. These genes were derived from supplementary data 12 from Uhlén et all 2015. This data has been formatted to be compatible with the package and to match the genes in the HPA human gene dataset.
hpa_mouse_genes
hpa_mouse_genes
A data frame with 5 rows and 2 variables:
Gene name corresponding to transcript ID
transcript ID of gene of interest
...
https://science.sciencemag.org/content/347/6220/1260419/tab-figures-data
This function matches each inputted transcript with its corresponding primary transcripts and returns a data frame containing the gene name, transcript ID and APPRIS annotation for each.
merge_trans(princ, final_trans, if_aa)
merge_trans(princ, final_trans, if_aa)
princ |
Data frame containing all primary transcripts and relevant gene information for an organism |
final_trans |
Data frame containing cleaned and formatted input data |
if_aa |
Boolean value indicating if the input file contains amino acid sequences with TRUE indicating that sequences are present and FALSE indicating that only IDs are present |
A data frame containing gene names, transcript IDs, and APPRIS annotations for all inputted data and its corresponding primary transcripts. If sequences were provided, the data frame will also contain the amino acid sequences. If only IDs were provided, the data frame will also contain the UniProt Swissprot ID, UniProt Swissprot isoform ID, and UniProt TREMBL ID for both the inputted data and the primary transcripts.
This function creates a ggplot figure showing the differences in membrane location and length between primary and alternative transcripts from the same gene. Transcripts derived from the same gene are grouped together to facilitate easy interpretation. The y axis lists the gene name and transcript ID for each transcript and the x axis lists the length in amino acids. Each fill color corresponds to a membrane location and either principal or alternative isoform.
plot_isoforms(topo, AA_seq, rank = "length", n_prts = 20, size_txt = 2, space_left = -400)
plot_isoforms(topo, AA_seq, rank = "length", n_prts = 20, size_txt = 2, space_left = -400)
topo |
Outputted data frame from the run_phobius or get_tmhmm function showing membrane locations of amino acids and transcript IDs |
AA_seq |
A data frame outputted by the get_pairs function containing the gene names, transcript IDs, APPRIS annotations, and protein sequences for each transcript. |
rank |
String indicating which method to use to rank proteins in graphicl output. Options include "length", "TM", and "combo". |
n_prts |
Integer value indicating the number of genes that should be displayed on the graphical output. Default value is 20. |
size_txt |
Integer value specifying the size of the row labels. Default size is 2. |
space_left |
Integer value specifying how far left the graph should extend. |
A ggplot figure showing the protein locations for each part of the surface protein for each alternative and primary transcripts.
tmhmm_folder_name <- "~/TMHMM2.0c" if (check_tmhmm_install(tmhmm_folder_name)) { currwd <- getwd() AA_seq <- get_pairs(system.file("extdata", "crb1_example.csv", package = "surfaltr" ), TRUE, "mouse", TRUE) topo <- run_phobius(AA_seq, paste(getwd(), "/AA.fasta", sep = "")) plot_isoforms(topo, AA_seq, "combo", 15, 3, -400) setwd(currwd) }
tmhmm_folder_name <- "~/TMHMM2.0c" if (check_tmhmm_install(tmhmm_folder_name)) { currwd <- getwd() AA_seq <- get_pairs(system.file("extdata", "crb1_example.csv", package = "surfaltr" ), TRUE, "mouse", TRUE) topo <- run_phobius(AA_seq, paste(getwd(), "/AA.fasta", sep = "")) plot_isoforms(topo, AA_seq, "combo", 15, 3, -400) setwd(currwd) }
This function creates a data frame with columns containing transcript IDs and corresponding output from tmhmm. The tmhmm output includes a location for each amino acid, with O and o representing extracellular, M representing transmembrane, and i representing intracellular. The data frame includes columns with the transcript ID, membrane location, gene name, starting amino acid, and ending amino acid for a certain transcript. The first row for each transcript contains the overall length of the amino acid sequence.
process_tmhmm(topo, AA_seq)
process_tmhmm(topo, AA_seq)
topo |
A data frame containing each transcript ID and the corresponding membrane location for each amino acid in its sequence formatted as a string. |
AA_seq |
A data frame containing the gene names, transcript IDs, APPRIS annotations, and protein sequences for each transcript. |
A data frame containing the overall length and individual lengths of each section of the surface protein corresponding to a certain transcript.
This function creates a data frame containing the primary and alternative transcripts of each gene ranked by how different the resultant surface proteins are. Transcripts can be ranked by length, number of transmembrane domains, or a combo metric that multiplied the difference in length by the number of transmembrane domains and ranks accordingly. This function can also be set to restrict the number of genes that are returned to the user to show only the most significant gene transcripts.
rank_prts(counts, rank, n_prts)
rank_prts(counts, rank, n_prts)
counts |
A data frame containing the overall length and individual lengths of each section of the surface protein corresponding to a certain transcript. |
rank |
String indicating which method to use to rank proteins in graphic output. Options include "Length", "TM", and "Combo". |
n_prts |
Integer value indicating the number of genes that should be displayed on the graphical output. Default value is 20. |
A data frame containing the overall length and individual lengths of each section of the surface protein corresponding to a certain transcript ranked by how different the primary and alternative transcripts are functionally.
This function creates a data frame with columns containing transcript IDs and corresponding output from Phobius. The Phobius output includes a location for each amino acid, with O representing extracellular, M representing transmembrane, S representing signal, and i representing intracellular.
run_phobius(AA_seq, fasta_file_name)
run_phobius(AA_seq, fasta_file_name)
AA_seq |
A data frame outputted by the get_pairs function containing the gene names, transcript IDs, APPRIS annotations, and protein sequences for each transcript. |
fasta_file_name |
Path to fasta file containing amino acid sequences |
A data frame containing each transcript ID and the corresponding membrane location for each amino acid in its sequence formatted as a string
In order for this function to work, there needs to be a .fasta file containing the amino acid sequences for each transcript called "AA.fasta" saved to the working directory. Additionally, the file saves a copy of the returned data frame in csv format to the working directory.
tmhmm_folder_name <- "~/TMHMM2.0c" if (check_tmhmm_install(tmhmm_folder_name)) { currwd <- getwd() AA_seq <- get_pairs(system.file("extdata", "crb1_example.csv", package = "surfaltr" ), TRUE, "mouse", TRUE) topo <- run_phobius(AA_seq, paste(getwd(), "/AA.fasta", sep = "")) setwd(currwd) }
tmhmm_folder_name <- "~/TMHMM2.0c" if (check_tmhmm_install(tmhmm_folder_name)) { currwd <- getwd() AA_seq <- get_pairs(system.file("extdata", "crb1_example.csv", package = "surfaltr" ), TRUE, "mouse", TRUE) topo <- run_phobius(AA_seq, paste(getwd(), "/AA.fasta", sep = "")) setwd(currwd) }
The function splits a fasta formatted file to a defined number of smaller .fasta files for further processing.
split_fasta( path_in, path_out, num_seq = 20000, trim = FALSE, trunc = NULL, id = FALSE )
split_fasta( path_in, path_out, num_seq = 20000, trim = FALSE, trunc = NULL, id = FALSE )
path_in |
A path to the .FASTA formatted file that is to be processed. |
path_out |
A path where the resulting .FASTA formatted files should be stored. The path should also contain the prefix name of the fasta files on which _n (integer from 1 to number of fasta files generated) will be appended along with the extension ".fa" |
num_seq |
Integer defining the number of sequences to be in each resulting .fasta file. Defaults to 20000. |
trim |
Logical, should the sequences be trimmed to 4000 amino acids to bypass the CBS server restrictions. Defaults to FALSE. |
trunc |
Integer, truncate the sequences to this length. First 1:trunc amino acids will be kept. |
id |
Logical, should the protein id's be returned. Defaults to FALSE. |
if id = FALSE, A Character vector of the paths to the resulting .FASTA formatted files.
if id = TRUE, A list with two elements:
Character, protein identifiers.
Character, paths to the resulting .FASTA formatted files.
This function runs all of surfaltr's other functions on the CRB1 data set to ensure that the function output matches the expected output. An incorrect output or error indicates that something went wrong in installation.
test_surfaltr()
test_surfaltr()
Nothing is returned.
If the results from the test match the expected results, a message stating that the test worked will be printed. If not, the user will be prompted to check the installation
tmhmm_folder_name <- "~/TMHMM2.0c" if (check_tmhmm_install(tmhmm_folder_name)) { test_surfaltr() }
tmhmm_folder_name <- "~/TMHMM2.0c" if (check_tmhmm_install(tmhmm_folder_name)) { test_surfaltr() }
This function retrieves the raw data from tmhmm containing information about the membrane location of each amino acid in a transcript. In order to set a standard path that allows tmhmm to run, the path is set to match that of the fasta file contining the amino acids.
tmhmm_fix_path(fasta_filename, folder_name)
tmhmm_fix_path(fasta_filename, folder_name)
fasta_filename |
Parameter containing input fasta file to be run on tmhmm |
folder_name |
Path to folder containing installed tmhmm software |
Raw results from tmhmm containing membrane locations for each transcript
In order for this function to work, there needs to be a .fasta file containing the amino acid sequences for each transcript called "AA.fasta" saved to a folder called output within the working directory.