Transcription factors (TFs) are proteins that bind to cis-regulatory
elements in promoter regions of genes and regulate their expression.
Identifying them in a genome is useful for a variety of reasons, such as
exploring their evolutionary history across clades and inferring gene
regulatory networks. planttfhunter
allows users to identify plant TFs from whole-genome protein sequences
and classify them into families and subfamilies (when applicable) using
the classification scheme implemented in PlantTFDB. As planttfhunter
interoperates with core Bioconductor packages (i.e.,
AAStringSet
objects as input,
SummarizedExperiment
objects as output), it can be easily
incorporated in pipelines for TF identification and classification in
large-scale genomic data sets.
You can install planttfhunter with the following code:
if (!requireNamespace("BiocManager", quietly = TRUE)) {
install.packages("BiocManager")
}
BiocManager::install("planttfhunter")
Loading package after installation:
In this vignette, we will use protein sequences of TFs from the algae
species Galdieria sulphuraria as an example, as its proteome is
very small. The proteome file was downloaded from the PLAZA Diatoms
database (Osuna-Cruz et al. 2020), and it
was filtered to keep only TFs for demonstration purposes. The object
gsu
stores the protein sequences in an
AAStringSet
object.
data(gsu)
gsu
#> AAStringSet object of length 290:
#> width seq names
#> [1] 383 MSVLLQTSRGDIVVDLFTDLAP...EKALFQEHRRSSYSKTSKRYK* gsu00340.1
#> [2] 301 MESMSQTKRWSCVIYQVDSILP...PNISLQQLLGDEDEISWSGLP* gsu04730.1
#> [3] 232 MEKGPCSHKWKQLHVGDYQHDN...PNISLQQLLGDEDEISWSGLP* gsu05000.1
#> [4] 340 MDVPSYRFECTPLVEVKKEARD...EVTQSSTRNCNDKEWNPNANE* gsu06140.1
#> [5] 120 MAPKERSKPKSRSSKAGLQFPV...SGGVLPNVHPNLLPKKKAKEE* gsu06160.1
#> ... ... ...
#> [286] 394 MSKLEAASDSGSLKTSSCSFQE...NETNSHDKPGEDRMNIQENTS* gsu98790.1
#> [287] 365 MITEVDNPMSFVHSEHFMNYSS...RRHSRLPVFQTLEEKSDIHSK* gsu99250.1
#> [288] 365 MITEVDNPMSFVHSEHFMNYSS...RRHSRLPVFQTLEEKSDIHSK* gsu99270.1
#> [289] 284 MTSFYIKKGITFSSIVYNHNYK...LFQRIIEINKNYNPLIQLQRI* AIG92462.1
#> [290] 219 MKYKLLVIDDELSIRQSLKKYL...TFTRSRTELVRYAIKNNLIIE* AIG92471.1
TF identification and classification is based on the presence of signature protein domains, which are identified using profile hidden Markov models (HMMs). The family classification scheme is the same as the one used by PlantTFDB (Jin et al. 2016), and it is summarized below: 1
Family | Subfamily | DBD | Auxiliary | Forbidden | |
---|---|---|---|---|---|
2 | AP2/ERF | AP2 | AP2 (>=2) (PF00847) | NA | NA |
3 | AP2/ERF | ERF | AP2 (1) (PF00847) | NA | NA |
4 | AP2/ERF | RAV | AP2 (PF00847) and B3 (PF02362) | NA | NA |
5 | B3 superfamily | ARF | B3 (PF02362) | Auxin_resp (PF06507) | NA |
6 | B3 superfamily | B3 | B3 (PF02362) | NA | NA |
7 | BBR-BPC | BBR-BPC | GAGA_bind (PF06217) | NA | NA |
8 | BES1 | BES1 | DUF822 (PF05687) | NA | NA |
9 | bHLH | bHLH | HLH (PF00010) | NA | NA |
10 | bZIP | bZIP | bZIP_1 (PF00170) | NA | NA |
11 | C2C2 | CO-like | zf-B_box (PF00643) | CCT (PF06203) | NA |
12 | C2C2 | Dof | Zf-Dof (PF02701) | NA | NA |
13 | C2C2 | GATA | GATA-zf (PF00320) | NA | NA |
14 | C2C2 | LSD | Zf-LSD1 (PF06943) | NA | Peptidase_C14 (PF00656) |
15 | C2C2 | YABBY | YABBY (PF04690) | NA | NA |
16 | C2H2 | C2H2 | zf-C2H2 (PF00096) | NA | RNase_T (PF00929) |
17 | C3H | C3H | Zf-CCCH (PF00642) | NA | RRM_1 (PF00076) or Helicase_C (PF00271) |
18 | CAMTA | CAMTA | CG1 (PF03859) | NA | NA |
19 | CPP | CPP | TCR (PF03638) | NA | NA |
20 | DBB | DBB | zf-B_box (>=2) (PF00643) | NA | NA |
21 | E2F/DP | E2F/DP | E2F_TDP (PF02319) | NA | NA |
22 | EIL | EIL | EIN3 (PF04873) | NA | NA |
23 | FAR1 | FAR1 | FAR1 (PF03101) | NA | NA |
24 | GARP | ARR-B | G2-like | Response_reg (PF00072) | NA |
25 | GARP | G2-like | G2-like | NA | NA |
26 | GeBP | GeBP | DUF573 (PF04504) | NA | NA |
27 | GRAS | GRAS | GRAS (PF03514) | NA | NA |
28 | GRF | GRF | WRC (PF08879) | QLQ (PF08880) | NA |
29 | HB | HD-ZIP | Homeobox (PF00046) | HD-ZIP_I/II or SMART (PF01852) | NA |
30 | HB | TALE | Homeobox (PF00046) | BELL or ELK (PF03789) | NA |
31 | HB | WOX | homeobox (PF00046) | Wus type homeobox | NA |
32 | HB | HB-PHD | homeobox (PF00046) | PHD (PF00628) | NA |
33 | HB | HB-other | homeobox (PF00046) | NA | NA |
34 | HRT-like | HRT-like | HRT-like | NA | NA |
35 | HSF | HSF | HSF_dna_bind (PF00447) | NA | NA |
36 | LBD (AS2/LOB) | LBD (AS2/LOB) | DUF260 (PF03195) | NA | NA |
37 | LFY | LFY | FLO_LFY (PF01698) | NA | NA |
38 | MADS | M_type | SRF-TF (PF00319) | NA | NA |
39 | MADS | MIKC | SRF-TF (PF00319) | K-box (PF01486) | NA |
40 | MYB superfamily | MYB | Myb_dna_bind (>=2) (PF00249) | NA | SWIRM (PF04433) |
41 | MYB superfamily | MYB_related | Myb_dna_bind (1) (PF00249) | NA | SWIRM (PF04433) |
42 | NAC | NAC | NAM (PF02365) | NA | NA |
43 | NF-X1 | NF-X1 | Zf-NF-X1 (PF01422) | NA | NA |
44 | NF-Y | NF-YA | CBFB_NFYA (PF02045) | NA | NA |
45 | NF-Y | NF-YB | NF-YB | NA | NA |
46 | NF-Y | NF-YC | NF-YC | NA | NA |
47 | Nin-like | Nin-like | RWP-RK (PF02042) | NA | NA |
48 | NZZ/SPL | NZZ/SPL | NOZZLE (PF08744) | NA | NA |
49 | S1Fa-like | S1Fa-like | S1FA (PF04689) | NA | NA |
50 | SAP | SAP | SAP | NA | NA |
51 | SBP | SBP | SBP (PF03110) | NA | NA |
52 | SRS | SRS | DUF702 (PF05142) | NA | NA |
53 | STAT | STAT | STAT | NA | NA |
54 | TCP | TCP | TCP (PF03634) | NA | NA |
55 | Trihelix | Trihelix | Trihelix | NA | NA |
56 | VOZ | VOZ | VOZ | NA | NA |
57 | Whirly | Whirly | Whirly (PF08536) | NA | NA |
58 | WRKY | WRKY | WRKY (PF03106) | NA | NA |
59 | ZF-HD | ZF-HD | ZF-HD_dimer (PF04770) | NA | NA |
To identify TFs from protein sequence data, you will use the function
annotate_pfam()
. This function takes as input an
AAStringSet
object 2 and returns a data frame of protein domains
associated with each sequence. The HMMER program (Finn, Clements, and Eddy 2011) is used to scan
protein sequences for the presence of DNA-binding protein domains, as
well as auxiliary and forbidden domains. Pre-built HMM profiles can be
found in the extdata/ directory of this package.
This is how you can run annotate_pfam()
3:
data(gsu_annotation)
# Annotate TF-related domains using a local installation of HMMER
if(hmmer_is_installed()) {
gsu_annotation <- annotate_pfam(gsu)
}
# Take a look at the first few lines of the output
head(gsu_annotation)
#> Gene Domain
#> 1 gsu144370.1 PF00010
#> 2 gsu140730.1 PF00010
#> 3 gsu127100.1 PF00046
#> 4 gsu109490.1 PF00046
#> 5 AIG92462.1 PF00072
#> 6 AIG92471.1 PF00072
Now that we have our TF-related domains, we can classify TFs in
families with the function classify_tfs()
.
# Classify TFs into families
gsu_families <- classify_tfs(gsu_annotation)
# Take a look at the output
head(gsu_families)
#> Gene Family
#> 1 gsu04730.1 Nin-like
#> 2 gsu05000.1 Nin-like
#> 3 gsu06140.1 G2-like
#> 4 gsu06140.1 MYB-related
#> 5 gsu09770.1 C3H
#> 6 gsu100410.1 Nin-like
# Count number of TFs per family
table(gsu_families$Family)
#>
#> C2H2 C3H CO-like CPP E2F/DP G2-like
#> 4 11 2 3 8 2
#> GATA HB-other HSF LSD M-type MYB
#> 8 2 6 3 1 16
#> MYB-related NF-X1 NF-YA NF-YB NF-YC Nin-like
#> 25 1 1 4 5 10
#> bHLH bZIP
#> 2 8
If you want to get TF counts per family for multiple species, you can
use the function get_tf_counts()
. This function takes a
list of AAStringSet
objects containing proteomes as input
4, and it
returns a SummarizedExperiment
object containing TF counts
per family in each species, as well as species metadata (optional). If
you are not familiar with the SummarizedExperiment
class,
you should consider checking the vignettes of the SummarizedExperiment
Bioconductor package.
To demonstrate how get_tf_counts()
works, we will
simulate a list of 4 AAStringSet
objects by sampling 50
random genes from the example data set gsu
4 times.
set.seed(123) # for reproducibility
# Simulate 4 different species by sampling 100 random genes from `gsu`
proteomes <- list(
Gsu1 = gsu[sample(names(gsu), 50, replace = FALSE)],
Gsu2 = gsu[sample(names(gsu), 50, replace = FALSE)],
Gsu3 = gsu[sample(names(gsu), 50, replace = FALSE)],
Gsu4 = gsu[sample(names(gsu), 50, replace = FALSE)]
)
proteomes
#> $Gsu1
#> AAStringSet object of length 50:
#> width seq names
#> [1] 405 MSQDTYTFEKLGVCKILCEECKN...SSSYNCPTDERMESEEEKLET* gsu49120.1
#> [2] 292 MGRSTTVFVGNIAYNTSEEQLQE...TLGAQMGQPGLGSSAFSSNNN* gsu102250.1
#> [3] 530 MTFTRIENNLKKYFGYENFRPLQ...KMSKTQNINQKTLLRWFSETM* gsu56420.1
#> [4] 672 MNRSLFTPTLRWSQWFAKTCTTL...EKRKLSVVHNKTQQIVSFNRH* gsu22760.1
#> [5] 606 MSQERLTPFERHLKKVEEKNQRE...EEDLDGVPLEEDEEAIEIEIK* gsu68790.1
#> ... ... ...
#> [46] 465 MNMFSYSTIETPKYVRSNSSDDR...STQMSHLLSVALQDWVEWNQE* gsu47220.1
#> [47] 2225 MSFPPKFRNYLLAQKGQPVSVEQ...MWRWVDNKSQFLSSYQVAWNE* gsu96990.1
#> [48] 292 MRGKSKSVVFPASRIKRIMRINE...LDKVESPSRVFISIEELVHSV* gsu124540.1
#> [49] 477 MDSSKKSTNPKLSESGTKDNRGN...PSDTTAPQVAVNVHAGNGSSK* gsu67550.1
#> [50] 740 MERLQPPYRYLIILDLEATAVDL...CTQDKSVSLGSVSLEGKGDSV* gsu102410.1
#>
#> $Gsu2
#> AAStringSet object of length 50:
#> width seq names
#> [1] 764 MSQSVPVKNDTEDSCGVQKLSNA...NSDDTSRIRRRTLNVHDLLSE* gsu21860.1
#> [2] 406 MQPHVSDHRYPTTVEQREYHSGS...QTTDVTGGAVFLRKETEHKDI* gsu140730.1
#> [3] 714 MEDERNSRLLIQGLPKYIAEKRL...LVFDYLVNWMALIVALVLSSF* gsu84990.1
#> [4] 236 MDTSEQGSEQGEESISNNSQQLC...YTNTASHRFRKLLKASVGDTS* gsu72770.1
#> [5] 524 MKKNAKEIGAQYSALVFIHVALF...PVSTVPYFVMDNNSGGSYSFV* gsu136350.1
#> ... ... ...
#> [46] 284 MTSFYIKKGITFSSIVYNHNYKF...LFQRIIEINKNYNPLIQLQRI* AIG92462.1
#> [47] 546 MAPRLNKTTQTKLKKQSSFREQP...NACGLFWAKHKQLRPKEKWVR* gsu134260.1
#> [48] 310 MSGGSGIYQPSGMSLYVGNLDPR...EIAGCVVQCEWGREGLKSRYF* gsu10640.1
#> [49] 445 MEPISKDDVYGGFSTVENCDSSM...NCKCVDCKNQSSLSLLKNTAM* gsu21090.1
#> [50] 318 MFICGHMIQNLVSVCAHSRIFCI...DCSYIFATDASDYPPPWQYFP* gsu64800.1
#>
#> $Gsu3
#> AAStringSet object of length 50:
#> width seq names
#> [1] 196 MAYLYEDRPVTLYRDRRFQGTQE...RDETDEVFQTEKNPRFREEED* gsu18660.1
#> [2] 171 MEDRMQVVVSETSKGEERGRGRG...SWAFSRGPLGTRLSTRRRSEK* gsu34810.1
#> [3] 461 MTERIDKSRRKKYVLTKKREYWT...HKESFSKRTYPDSVQAVIVGQ* gsu38570.1
#> [4] 1207 MELVSGPLLDQFPFVAGHSRQTL...KYGREHHWQYSLEHPFVSPIL* gsu143710.1
#> [5] 928 MFILVVEVKVEEYFSFQLDDFQQ...LRRVLDILRQIPRLPAKQWLS* gsu79190.1
#> ... ... ...
#> [46] 150 MAAVEDNRVFVGGLPWSVGEDDL...GHGHGGRGGRSGGFRRREDFE* gsu58080.1
#> [47] 122 MAPKERSKPKSRSSKAGLQFPVG...GVLPNVHPNLLPKKKAKEDMQ* gsu59350.1
#> [48] 680 MWSTVDYSLNCDEEFRELSNAAA...QQSTNVVYPSNTNNSETCENS* gsu44800.1
#> [49] 148 MVANGEPGVIYIGHLPHGFYENE...ERRNKELEVKLKKLGVSFSLS* gsu97960.1
#> [50] 226 MAYRKLETRVPSYLDEVLGKVSS...SAPEGVWRCPDCRSGGANRAR* gsu111770.1
#>
#> $Gsu4
#> AAStringSet object of length 50:
#> width seq names
#> [1] 740 MERLQPPYRYLIILDLEATAVDL...CTQDKSVSLGSVSLEGKGDSV* gsu102410.1
#> [2] 701 MFKSSLLTFPALKTVVGAQDQYT...NLTKSFDLKKSQLSKRKKKWK* gsu110850.1
#> [3] 484 MSEVSWEAGRSPVQPGKDHKSSS...DRSSTENRNSGRRRSVEGRAR* gsu114840.1
#> [4] 483 MEEERKQKKGTGTSVSKTRAVQE...AKSADDESSRPVRQYDVENVA* gsu100590.1
#> [5] 122 MAPKERSKPKSRSSKAGLQFPVG...GVLPNVHPNLLPKKKAKEDMQ* gsu59350.1
#> ... ... ...
#> [46] 798 MVLYQSYSSDTNSDVKPSEVSNS...EDENNKILCLCGAPTCRKFLN* gsu40400.1
#> [47] 383 MSVLLQTSRGDIVVDLFTDLAPL...EKALFQEHRRSSYSKTSKRYK* gsu139150.1
#> [48] 518 MRSGTTLSSLHNSHTEDATSLRA...EIGDIASLLEGEEVNYERLER* gsu32730.1
#> [49] 287 MKASQVLASQLCELCQSANSSIY...RKQLAERRCRFKGRFIKNTAS* gsu55840.1
#> [50] 1886 MDDTEYVPVKKRRQRILQQAKEL...TEVTRKQLIYYLSDVLANNKE* gsu43660.1
Great, we have a list of 4 AAStringSet
objects. Now,
let’s also create a simulated species metadata data frame for each
“species” (simulated).
# Create simulated species metadata
species_metadata <- data.frame(
row.names = names(proteomes),
Division = "Rhodophyta",
Origin = c("US", "Belgium", "China", "Brazil")
)
species_metadata
#> Division Origin
#> Gsu1 Rhodophyta US
#> Gsu2 Rhodophyta Belgium
#> Gsu3 Rhodophyta China
#> Gsu4 Rhodophyta Brazil
You can add as many columns as you want to the species metadata data
frame, but make sure that species names are in row
names, and that names(proteomes)
match
rownames(species)
, otherwise get_tf_counts()
will return an error.
Now that we have a list of AAStringSet
objects and
species metadata, we can execute get_tf_counts()
. This
function uses annotate_pfam()
under the hood, so you also
need to have HMMER installed and in your PATH to run it. Here is how you
can run it:
data(tf_counts)
# Get TF counts per family in each species as a SummarizedExperiment object
if(hmmer_is_installed()) {
tf_counts <- get_tf_counts(proteomes, species_metadata)
}
# Take a look at the SummarizedExperiment object
tf_counts
#> class: SummarizedExperiment
#> dim: 19 4
#> metadata(0):
#> assays(1): counts
#> rownames(19): C2H2 C3H ... NF-YA NF-YB
#> rowData names(0):
#> colnames(4): Gsu1 Gsu2 Gsu3 Gsu4
#> colData names(2): Division Origin
# Look at the matrix of counts: assay() function from SummarizedExperiment
SummarizedExperiment::assay(tf_counts)
#> Gsu1 Gsu2 Gsu3 Gsu4
#> C2H2 3 1 1 2
#> C3H 3 2 0 3
#> CPP 1 1 1 0
#> E2F/DP 1 2 2 2
#> GATA 1 2 1 0
#> HSF 2 0 0 3
#> MYB 3 3 0 2
#> MYB-related 4 5 3 9
#> NF-X1 1 0 0 0
#> NF-YC 3 0 0 1
#> Nin-like 2 0 1 1
#> bHLH 0 1 1 0
#> bZIP 0 2 4 0
#> G2-like 0 1 0 0
#> HB-other 0 1 0 0
#> LSD 0 1 0 0
#> M-type 0 1 1 0
#> NF-YA 0 1 0 0
#> NF-YB 0 2 0 1
# Look at the species metadata: colData() function from SummarizedExperiment
SummarizedExperiment::colData(tf_counts)
#> DataFrame with 4 rows and 2 columns
#> Division Origin
#> <character> <character>
#> Gsu1 Rhodophyta US
#> Gsu2 Rhodophyta Belgium
#> Gsu3 Rhodophyta China
#> Gsu4 Rhodophyta Brazil
Cool, huh? In real-world analyses, once you have TF counts per family
in multiple species obtained with get_tf_counts()
, you can
try to find associations between TF counts and eco-evolutionary aspects
or traits of each species (e.g., higher frequencies of a stress-related
TF family in a species that inhabits a stressful environment).
This document was created under the following conditions:
sessioninfo::session_info()
#> ─ Session info ───────────────────────────────────────────────────────────────
#> setting value
#> version R version 4.4.2 (2024-10-31)
#> os Ubuntu 24.04.1 LTS
#> system x86_64, linux-gnu
#> ui X11
#> language (EN)
#> collate C
#> ctype en_US.UTF-8
#> tz Etc/UTC
#> date 2024-11-18
#> pandoc 3.2.1 @ /usr/local/bin/ (via rmarkdown)
#>
#> ─ Packages ───────────────────────────────────────────────────────────────────
#> package * version date (UTC) lib source
#> abind 1.4-8 2024-09-12 [2] RSPM (R 4.4.0)
#> Biobase 2.67.0 2024-10-31 [2] https://bioc.r-universe.dev (R 4.4.1)
#> BiocGenerics 0.53.3 2024-11-15 [2] https://bioc.r-universe.dev (R 4.4.2)
#> BiocManager 1.30.25 2024-08-28 [2] RSPM (R 4.4.0)
#> BiocStyle * 2.35.0 2024-10-30 [2] https://bioc.r-universe.dev (R 4.4.1)
#> Biostrings 2.75.1 2024-11-07 [2] https://bioc.r-universe.dev (R 4.4.2)
#> bslib 0.8.0 2024-07-29 [2] RSPM (R 4.4.0)
#> buildtools 1.0.0 2024-11-11 [3] local (/pkg)
#> cachem 1.1.0 2024-05-16 [2] RSPM (R 4.4.0)
#> cli 3.6.3 2024-06-21 [2] RSPM (R 4.4.0)
#> crayon 1.5.3 2024-06-20 [2] RSPM (R 4.4.0)
#> DelayedArray 0.33.2 2024-11-15 [2] https://bioc.r-universe.dev (R 4.4.2)
#> digest 0.6.37 2024-08-19 [2] RSPM (R 4.4.0)
#> evaluate 1.0.1 2024-10-10 [2] RSPM (R 4.4.0)
#> fastmap 1.2.0 2024-05-15 [2] RSPM (R 4.4.0)
#> generics 0.1.3 2022-07-05 [2] RSPM (R 4.4.0)
#> GenomeInfoDb 1.43.1 2024-11-18 [2] https://bioc.r-universe.dev (R 4.4.2)
#> GenomeInfoDbData 1.2.13 2024-11-18 [2] Bioconductor
#> GenomicRanges 1.59.0 2024-10-30 [2] https://bioc.r-universe.dev (R 4.4.1)
#> htmltools 0.5.8.1 2024-04-04 [2] RSPM (R 4.4.0)
#> httr 1.4.7 2023-08-15 [2] RSPM (R 4.4.0)
#> IRanges 2.41.1 2024-11-17 [2] https://bioc.r-universe.dev (R 4.4.2)
#> jquerylib 0.1.4 2021-04-26 [2] RSPM (R 4.4.0)
#> jsonlite 1.8.9 2024-09-20 [2] RSPM (R 4.4.0)
#> knitr 1.49 2024-11-08 [2] RSPM (R 4.4.0)
#> lattice 0.22-6 2024-03-20 [2] RSPM (R 4.4.0)
#> lifecycle 1.0.4 2023-11-07 [2] RSPM (R 4.4.0)
#> maketools 1.3.1 2024-10-04 [3] RSPM (R 4.4.0)
#> Matrix 1.7-1 2024-10-18 [2] RSPM (R 4.4.0)
#> MatrixGenerics 1.19.0 2024-11-06 [2] https://bioc.r-universe.dev (R 4.4.2)
#> matrixStats 1.4.1 2024-09-08 [2] RSPM (R 4.4.0)
#> planttfhunter * 1.7.0 2024-11-18 [1] https://bioc.r-universe.dev (R 4.4.2)
#> R6 2.5.1 2021-08-19 [2] RSPM (R 4.4.0)
#> rlang 1.1.4 2024-06-04 [2] RSPM (R 4.4.0)
#> rmarkdown 2.29 2024-11-04 [2] RSPM (R 4.4.0)
#> S4Arrays 1.7.1 2024-10-31 [2] https://bioc.r-universe.dev (R 4.4.1)
#> S4Vectors 0.45.2 2024-11-16 [2] https://bioc.r-universe.dev (R 4.4.2)
#> sass 0.4.9 2024-03-15 [2] RSPM (R 4.4.0)
#> sessioninfo 1.2.2 2021-12-06 [2] RSPM (R 4.4.0)
#> SparseArray 1.7.2 2024-11-15 [2] https://bioc.r-universe.dev (R 4.4.2)
#> SummarizedExperiment 1.37.0 2024-10-31 [2] https://bioc.r-universe.dev (R 4.4.1)
#> sys 3.4.3 2024-10-04 [2] RSPM (R 4.4.0)
#> UCSC.utils 1.3.0 2024-10-31 [2] https://bioc.r-universe.dev (R 4.4.1)
#> xfun 0.49 2024-10-31 [2] RSPM (R 4.4.0)
#> XVector 0.47.0 2024-10-31 [2] https://bioc.r-universe.dev (R 4.4.1)
#> yaml 2.3.10 2024-07-26 [2] RSPM (R 4.4.0)
#> zlibbioc 1.52.0 2024-10-29 [2] Bioconductor 3.20 (R 4.4.2)
#>
#> [1] /tmp/Rtmp8LX3W4/Rinst104c40ad38e9
#> [2] /github/workspace/pkglib
#> [3] /usr/local/lib/R/site-library
#> [4] /usr/lib/R/site-library
#> [5] /usr/lib/R/library
#>
#> ──────────────────────────────────────────────────────────────────────────────
Tip: You can access this classification
scheme in your R session by loading the data frame
data(classification_scheme)
.↩︎
Tip: If you have protein sequences in a
FASTA file, you can read them into an AAStringSet
object
with the function readAAStringSet()
from the Biostrings
package.↩︎
Note: in the code chunk below, the if
statement is not required. We just added it to make sure that the
function annotate_pfam()
is only executed if HMMER is
installed, to avoid problems when building this vignette in machines
that do not have HMMER installed.↩︎
Tip: If you have whole-genome protein
sequences for multiple species as FASTA files in a given directory, you
can read them all as a list of AAStringSet
objects with the
function fasta2AAStringSetlist()
from the Bioconductor
package syntenet.↩︎