Title: | RNA-binding protein motif analysis |
---|---|
Description: | transite is a computational method that allows comprehensive analysis of the regulatory role of RNA-binding proteins in various cellular processes by leveraging preexisting gene expression data and current knowledge of binding preferences of RNA-binding proteins. |
Authors: | Konstantin Krismer [aut, cre, cph] , Anna Gattinger [aut] , Michael Yaffe [ths, cph] , Ian Cannell [ths] |
Maintainer: | Konstantin Krismer <[email protected]> |
License: | MIT + file LICENSE |
Version: | 1.25.0 |
Built: | 2024-10-31 06:22:31 UTC |
Source: | https://github.com/bioc/transite |
Calls compute_kmer_enrichment
to compute k-mer
enrichment values
for multiple foregrounds. Calculates enrichment for foreground sets in
parallel.
calculate_kmer_enrichment( foreground_sets, background_set, k, permutation = FALSE, chisq_p_value_threshold = 0.05, p_adjust_method = "BH", n_cores = 4 )
calculate_kmer_enrichment( foreground_sets, background_set, k, permutation = FALSE, chisq_p_value_threshold = 0.05, p_adjust_method = "BH", n_cores = 4 )
foreground_sets |
list of foreground sets; a foreground set is a
character vector of
DNA or RNA sequences (not both) and a strict subset of the
|
background_set |
character vector of DNA or RNA sequences that constitute the background set |
k |
length of k-mer, either |
permutation |
if |
chisq_p_value_threshold |
threshold below which Fisher's exact test is used instead of Pearson's chi-squared test |
p_adjust_method |
see |
n_cores |
number of computing cores to use |
A list with two entries:
dfs |
a list of data frames with results from
compute_kmer_enrichment for each of the foreground sets |
kmers |
a character vector of all k-mers |
Other k-mer functions:
check_kmers()
,
compute_kmer_enrichment()
,
count_homopolymer_corrected_kmers()
,
draw_volcano_plot()
,
estimate_significance()
,
estimate_significance_core()
,
generate_kmers()
,
generate_permuted_enrichments()
,
run_kmer_spma()
,
run_kmer_tsma()
# define simple sequence sets for foreground and background foreground_set1 <- c( "CAACAGCCUUAAUU", "CAGUCAAGACUCC", "CUUUGGGGAAU", "UCAUUUUAUUAAA", "AAUUGGUGUCUGGAUACUUCCCUGUACAU", "AUCAAAUUA", "AGAU", "GACACUUAAAGAUCCU", "UAGCAUUAACUUAAUG", "AUGGA", "GAAGAGUGCUCA", "AUAGAC", "AGUUC", "CCAGUAA" ) foreground_set2 <- c("UUAUUUA", "AUCCUUUACA", "UUUUUUU", "UUUCAUCAUU") foreground_sets <- list(foreground_set1, foreground_set2) background_set <- c(foreground_set1, foreground_set2, "CCACACAC", "CUCAUUGGAG", "ACUUUGGGACA", "CAGGUCAGCA") # single-threaded kmer_enrichment_values_st <- calculate_kmer_enrichment(foreground_sets, background_set, 6, n_cores = 1) ## Not run: # multi-threaded kmer_enrichment_values_mt <- calculate_kmer_enrichment(foreground_sets, background_set, 6) ## End(Not run)
# define simple sequence sets for foreground and background foreground_set1 <- c( "CAACAGCCUUAAUU", "CAGUCAAGACUCC", "CUUUGGGGAAU", "UCAUUUUAUUAAA", "AAUUGGUGUCUGGAUACUUCCCUGUACAU", "AUCAAAUUA", "AGAU", "GACACUUAAAGAUCCU", "UAGCAUUAACUUAAUG", "AUGGA", "GAAGAGUGCUCA", "AUAGAC", "AGUUC", "CCAGUAA" ) foreground_set2 <- c("UUAUUUA", "AUCCUUUACA", "UUUUUUU", "UUUCAUCAUU") foreground_sets <- list(foreground_set1, foreground_set2) background_set <- c(foreground_set1, foreground_set2, "CCACACAC", "CUCAUUGGAG", "ACUUUGGGACA", "CAGGUCAGCA") # single-threaded kmer_enrichment_values_st <- calculate_kmer_enrichment(foreground_sets, background_set, 6, n_cores = 1) ## Not run: # multi-threaded kmer_enrichment_values_mt <- calculate_kmer_enrichment(foreground_sets, background_set, 6) ## End(Not run)
C++ implementation of Local Consistency Score algorithm.
calculate_local_consistency(x, numPermutations, minPermutations, e)
calculate_local_consistency(x, numPermutations, minPermutations, e)
x |
numeric vector that contains values for shuffling |
numPermutations |
maximum number of permutations performed in Monte Carlo test for consistency score |
minPermutations |
minimum number of permutations performed in Monte Carlo test for consistency score |
e |
stop criterion for consistency score Monte Carlo test:
aborting permutation
process after observing |
list with score
, p_value
, and n
components,
where score
is the raw local consistency score (usually not used),
p_value
is the associated p-value for that score, obtained by
Monte Carlo testing, and n
is the number of permutations performed
in the Monte Carlo test (the higher, the more significant)
poor_enrichment_spectrum <- c(0.1, 0.5, 0.6, 0.4, 0.7, 0.6, 1.2, 1.1, 1.8, 1.6) local_consistency <- calculate_local_consistency(poor_enrichment_spectrum, 1000000, 1000, 5) enrichment_spectrum <- c(0.1, 0.3, 0.6, 0.7, 0.8, 0.9, 1.2, 1.4, 1.6, 1.4) local_consistency <- calculate_local_consistency(enrichment_spectrum, 1000000, 1000, 5)
poor_enrichment_spectrum <- c(0.1, 0.5, 0.6, 0.4, 0.7, 0.6, 1.2, 1.1, 1.8, 1.6) local_consistency <- calculate_local_consistency(poor_enrichment_spectrum, 1000000, 1000, 5) enrichment_spectrum <- c(0.1, 0.3, 0.6, 0.7, 0.8, 0.9, 1.2, 1.4, 1.6, 1.4) local_consistency <- calculate_local_consistency(enrichment_spectrum, 1000000, 1000, 5)
This function is used to calculate binding site enrichment / depletion scores between predefined foreground and background sequence sets. Significance levels of enrichment values are obtained by Monte Carlo tests.
calculate_motif_enrichment( foreground_scores_df, background_scores_df, background_total_sites, background_absolute_hits, n_transcripts_foreground, max_fg_permutations = 1e+06, min_fg_permutations = 1000, e = 5, p_adjust_method = "BH" )
calculate_motif_enrichment( foreground_scores_df, background_scores_df, background_total_sites, background_absolute_hits, n_transcripts_foreground, max_fg_permutations = 1e+06, min_fg_permutations = 1000, e = 5, p_adjust_method = "BH" )
foreground_scores_df |
result of |
background_scores_df |
result of |
background_total_sites |
number of potential binding sites per sequence
(returned by |
background_absolute_hits |
number of putative binding sites per sequence
(returned by |
n_transcripts_foreground |
number of sequences in the foreground set |
max_fg_permutations |
maximum number of foreground permutations performed in Monte Carlo test for enrichment score |
min_fg_permutations |
minimum number of foreground permutations performed in Monte Carlo test for enrichment score |
e |
integer-valued stop criterion for enrichment score Monte Carlo
test: aborting
permutation process after
observing |
p_adjust_method |
adjustment of p-values from Monte Carlo tests to
avoid alpha error
accumulation, see |
A data frame with the following columns:
motif_id |
the motif identifier that is used in the original motif library |
motif_rbps |
the gene symbol of the RNA-binding protein(s) |
enrichment |
binding site enrichment between foreground and background sequences |
p_value |
unadjusted p-value from Monte Carlo test |
p_value_n |
number of Monte Carlo test permutations |
adj_p_value |
adjusted p-value from Monte Carlo test (usually FDR) |
Other matrix functions:
run_matrix_spma()
,
run_matrix_tsma()
,
score_transcripts()
,
score_transcripts_single_motif()
foreground_seqs <- c("CAGUCAAGACUCC", "AAUUGGUGUCUGGAUACUUCCCUGUACAU", "AGAU", "CCAGUAA") background_seqs <- c(foreground_seqs, "CAACAGCCUUAAUU", "CUUUGGGGAAU", "UCAUUUUAUUAAA", "AUCAAAUUA", "GACACUUAAAGAUCCU", "UAGCAUUAACUUAAUG", "AUGGA", "GAAGAGUGCUCA", "AUAGAC", "AGUUC") foreground_scores <- score_transcripts(foreground_seqs, cache = FALSE) background_scores <- score_transcripts(background_seqs, cache = FALSE) enrichments_df <- calculate_motif_enrichment(foreground_scores$df, background_scores$df, background_scores$total_sites, background_scores$absolute_hits, length(foreground_seqs), max_fg_permutations = 1000 )
foreground_seqs <- c("CAGUCAAGACUCC", "AAUUGGUGUCUGGAUACUUCCCUGUACAU", "AGAU", "CCAGUAA") background_seqs <- c(foreground_seqs, "CAACAGCCUUAAUU", "CUUUGGGGAAU", "UCAUUUUAUUAAA", "AUCAAAUUA", "GACACUUAAAGAUCCU", "UAGCAUUAACUUAAUG", "AUGGA", "GAAGAGUGCUCA", "AUAGAC", "AGUUC") foreground_scores <- score_transcripts(foreground_seqs, cache = FALSE) background_scores <- score_transcripts(background_seqs, cache = FALSE) enrichments_df <- calculate_motif_enrichment(foreground_scores$df, background_scores$df, background_scores$total_sites, background_scores$absolute_hits, length(foreground_seqs), max_fg_permutations = 1000 )
C++ implementation of Motif Enrichment calculation
calculate_transcript_mc( absoluteHits, totalSites, relHitsForeground, n, maxPermutations, minPermutations, e )
calculate_transcript_mc( absoluteHits, totalSites, relHitsForeground, n, maxPermutations, minPermutations, e )
absoluteHits |
number of putative binding sites per sequence
(returned by |
totalSites |
number of potential binding sites per sequence
(returned by |
relHitsForeground |
relative number of hits in foreground set |
n |
number of sequences in the foreground set |
maxPermutations |
maximum number of foreground permutations performed in Monte Carlo test for enrichment score |
minPermutations |
minimum number of foreground permutations performed in Monte Carlo test for enrichment score |
e |
stop criterion for enrichment score Monte Carlo test:
aborting permutation process
after observing |
list with p-value and number of iterations of Monte Carlo sampling for foreground enrichment
foreground_seqs <- c("CAGUCAAGACUCC", "AAUUGGUUGUGGGGCUUCCCUGUACAU", "AGAU", "CCAGUAA", "UGUGGGG") background_seqs <- c(foreground_seqs, "CAACAGCCUUAAUU", "CUUUGGGGAAU", "UCAUUUUAUUAAA", "AUCAAAUUA", "GACACUUAAAGAUCCU", "UAGCAUUAACUUAAUG", "AUGGA", "GAAGAGUGCUCA", "AUAGAC", "AGUUC") motif_db <- get_motif_by_id("M178_0.6") fg <- score_transcripts(foreground_seqs, cache = FALSE, motifs = motif_db) bg <- score_transcripts(background_seqs, cache = FALSE, motifs = motif_db) mc_result <- calculate_transcript_mc(unlist(bg$absolute_hits), unlist(bg$total_sites), fg$df$absolute_hits / fg$df$total_sites, length(foreground_seqs), 1000, 500, 5)
foreground_seqs <- c("CAGUCAAGACUCC", "AAUUGGUUGUGGGGCUUCCCUGUACAU", "AGAU", "CCAGUAA", "UGUGGGG") background_seqs <- c(foreground_seqs, "CAACAGCCUUAAUU", "CUUUGGGGAAU", "UCAUUUUAUUAAA", "AUCAAAUUA", "GACACUUAAAGAUCCU", "UAGCAUUAACUUAAUG", "AUGGA", "GAAGAGUGCUCA", "AUAGAC", "AGUUC") motif_db <- get_motif_by_id("M178_0.6") fg <- score_transcripts(foreground_seqs, cache = FALSE, motifs = motif_db) bg <- score_transcripts(background_seqs, cache = FALSE, motifs = motif_db) mc_result <- calculate_transcript_mc(unlist(bg$absolute_hits), unlist(bg$total_sites), fg$df$absolute_hits / fg$df$total_sites, length(foreground_seqs), 1000, 500, 5)
Checks if the provided set of k-mers is valid. A valid set of k-mers is (1) non-empty, (2) contains either only hexamers or only heptamers, and (3) contains only characters from the RNA alphabet (A, C, G, U)
check_kmers(kmers)
check_kmers(kmers)
kmers |
set of k-mers |
TRUE
if set of k-mers is valid
Other k-mer functions:
calculate_kmer_enrichment()
,
compute_kmer_enrichment()
,
count_homopolymer_corrected_kmers()
,
draw_volcano_plot()
,
estimate_significance()
,
estimate_significance_core()
,
generate_kmers()
,
generate_permuted_enrichments()
,
run_kmer_spma()
,
run_kmer_tsma()
# valid set check_kmers(c("ACGCUC", "AAACCC", "UUUACA")) # invalid set (contains hexamers and heptamers) check_kmers(c("ACGCUC", "AAACCC", "UUUACAA"))
# valid set check_kmers(c("ACGCUC", "AAACCC", "UUUACA")) # invalid set (contains hexamers and heptamers) check_kmers(c("ACGCUC", "AAACCC", "UUUACAA"))
Spectra can be classified based on the aggregate spectrum classifier score.
If sum(score) == 3
spectrum considered non-random, random otherwise.
classify_spectrum( adj_r_squared, degree, slope, consistency_score_n, n_significant, n_bins )
classify_spectrum( adj_r_squared, degree, slope, consistency_score_n, n_significant, n_bins )
adj_r_squared |
adjusted |
degree |
degree of polynomial, returned by score_spectrum |
slope |
coefficient of the linear term of the polynomial model (spectrum "direction"), returned by score_spectrum |
consistency_score_n |
number of performed permutations before early stopping, returned by score_spectrum |
n_significant |
number of bins with statistically significant enrichment |
n_bins |
number of bins |
a three-dimensional binary vector with the following components:
coordinate 1 |
adj_r_squared >= 0.4
|
coordinate 2 |
consistency_score_n > 1000000
|
coordinate 3 |
n_significant >= floor(n_bins / 10)
|
Other SPMA functions:
run_kmer_spma()
,
run_matrix_spma()
,
score_spectrum()
,
subdivide_data()
n_bins <- 40 # random spectrum random_sp <- score_spectrum(runif(n = n_bins, min = -1, max = 1), max_model_degree = 1) score <- classify_spectrum( get_adj_r_squared(random_sp), get_model_degree(random_sp), get_model_slope(random_sp), get_consistency_score_n(random_sp), 0, n_bins ) sum(score) # non-random linear spectrum with strong noise component signal <- seq(-1, 0.99, 2 / 40) noise <- rnorm(n = 40, mean = 0, sd = 0.5) linear_sp <- score_spectrum(signal + noise, max_model_degree = 1, max_cs_permutations = 100000) score <- classify_spectrum( get_adj_r_squared(linear_sp), get_model_degree(linear_sp), get_model_slope(linear_sp), get_consistency_score_n(linear_sp), 10, n_bins ) sum(score) ## Not run: # non-random linear spectrum with weak noise component signal <- seq(-1, 0.99, 2 / 40) noise <- rnorm(n = 40, mean = 0, sd = 0.2) linear_sp <- score_spectrum(signal + noise, max_model_degree = 1, max_cs_permutations = 100000) score <- classify_spectrum( get_adj_r_squared(linear_sp), get_model_degree(linear_sp), get_model_slope(linear_sp), get_consistency_score_n(linear_sp), 10, n_bins ) sum(score) ## End(Not run) # non-random quadratic spectrum with strong noise component signal <- seq(-1, 0.99, 2 / 40)^2 - 0.5 noise <- rnorm(n = 40, mean = 0, sd = 0.2) quadratic_sp <- score_spectrum(signal + noise, max_model_degree = 2, max_cs_permutations = 100000) score <- classify_spectrum( get_adj_r_squared(quadratic_sp), get_model_degree(quadratic_sp), get_model_slope(quadratic_sp), get_consistency_score_n(quadratic_sp), 10, n_bins ) sum(score) ## Not run: # non-random quadratic spectrum with weak noise component signal <- seq(-1, 0.99, 2 / 40)^2 - 0.5 noise <- rnorm(n = 40, mean = 0, sd = 0.1) quadratic_sp <- score_spectrum(signal + noise, max_model_degree = 2) score <- classify_spectrum( get_adj_r_squared(quadratic_sp), get_model_degree(quadratic_sp), get_model_slope(quadratic_sp), get_consistency_score_n(quadratic_sp), 10, n_bins ) sum(score) ## End(Not run)
n_bins <- 40 # random spectrum random_sp <- score_spectrum(runif(n = n_bins, min = -1, max = 1), max_model_degree = 1) score <- classify_spectrum( get_adj_r_squared(random_sp), get_model_degree(random_sp), get_model_slope(random_sp), get_consistency_score_n(random_sp), 0, n_bins ) sum(score) # non-random linear spectrum with strong noise component signal <- seq(-1, 0.99, 2 / 40) noise <- rnorm(n = 40, mean = 0, sd = 0.5) linear_sp <- score_spectrum(signal + noise, max_model_degree = 1, max_cs_permutations = 100000) score <- classify_spectrum( get_adj_r_squared(linear_sp), get_model_degree(linear_sp), get_model_slope(linear_sp), get_consistency_score_n(linear_sp), 10, n_bins ) sum(score) ## Not run: # non-random linear spectrum with weak noise component signal <- seq(-1, 0.99, 2 / 40) noise <- rnorm(n = 40, mean = 0, sd = 0.2) linear_sp <- score_spectrum(signal + noise, max_model_degree = 1, max_cs_permutations = 100000) score <- classify_spectrum( get_adj_r_squared(linear_sp), get_model_degree(linear_sp), get_model_slope(linear_sp), get_consistency_score_n(linear_sp), 10, n_bins ) sum(score) ## End(Not run) # non-random quadratic spectrum with strong noise component signal <- seq(-1, 0.99, 2 / 40)^2 - 0.5 noise <- rnorm(n = 40, mean = 0, sd = 0.2) quadratic_sp <- score_spectrum(signal + noise, max_model_degree = 2, max_cs_permutations = 100000) score <- classify_spectrum( get_adj_r_squared(quadratic_sp), get_model_degree(quadratic_sp), get_model_slope(quadratic_sp), get_consistency_score_n(quadratic_sp), 10, n_bins ) sum(score) ## Not run: # non-random quadratic spectrum with weak noise component signal <- seq(-1, 0.99, 2 / 40)^2 - 0.5 noise <- rnorm(n = 40, mean = 0, sd = 0.1) quadratic_sp <- score_spectrum(signal + noise, max_model_degree = 2) score <- classify_spectrum( get_adj_r_squared(quadratic_sp), get_model_degree(quadratic_sp), get_model_slope(quadratic_sp), get_consistency_score_n(quadratic_sp), 10, n_bins ) sum(score) ## End(Not run)
Compares foreground sequence set to background sequence set and computes enrichment values for each possible k-mer.
compute_kmer_enrichment( foreground_kmers, background_kmers, permutation = FALSE, chisq_p_value_threshold = 0.05, p_adjust_method = "BH" )
compute_kmer_enrichment( foreground_kmers, background_kmers, permutation = FALSE, chisq_p_value_threshold = 0.05, p_adjust_method = "BH" )
foreground_kmers |
k-mer counts of the foreground set
(generated by |
background_kmers |
k-mer counts of the background set
(generated by |
permutation |
if |
chisq_p_value_threshold |
threshold below which Fisher's exact test is used instead of Pearson's chi-squared test |
p_adjust_method |
see |
Usually uses Pearson's chi-squared test, but recalculates p-values
with Fisher's exact test
for Pearson's chi-squared test p-values <= chisq_p_value_threshold
.
The reason this is done is
computational efficiency. Fisher's exact tests are computationally
demanding and are only
performed in situations, where exact p-values are preferred, e.g.,
if expected < 5 or
significant p-values.
enrichment of k-mers in specified foreground sequences. A data frame with the following columns is returned:
foreground_count |
foreground counts for each k-mer |
background_count |
background counts for each k-mer |
enrichment |
k-mer enrichment |
p_value |
p-value of k-mer enrichment (either from Fisher's exact test or Pearson's chi-squared test) |
adj_p_value |
multiple testing corrected p-value |
Other k-mer functions:
calculate_kmer_enrichment()
,
check_kmers()
,
count_homopolymer_corrected_kmers()
,
draw_volcano_plot()
,
estimate_significance()
,
estimate_significance_core()
,
generate_kmers()
,
generate_permuted_enrichments()
,
run_kmer_spma()
,
run_kmer_tsma()
# define simple sequence sets for foreground and background foreground_set <- c( "CAACAGCCUUAAUU", "CAGUCAAGACUCC", "CUUUGGGGAAU", "UCAUUUUAUUAAA", "AAUUGGUGUCUGGAUACUUCCCUGUACAU", "AUCAAAUUA", "AGAU", "GACACUUAAAGAUCCU", "UAGCAUUAACUUAAUG", "AUGGA", "GAAGAGUGCUCA", "AUAGAC", "AGUUC", "CCAGUAA" ) background_set <- c( "CAACAGCCUUAAUU", "CAGUCAAGACUCC", "CUUUGGGGAAU", "UCAUUUUAUUAAA", "AAUUGGUGUCUGGAUACUUCCCUGUACAU", "AUCAAAUUA", "AGAU", "GACACUUAAAGAUCCU", "UAGCAUUAACUUAAUG", "AUGGA", "GAAGAGUGCUCA", "AUAGAC", "AGUUC", "CCAGUAA", "UUAUUUA", "AUCCUUUACA", "UUUUUUU", "UUUCAUCAUU", "CCACACAC", "CUCAUUGGAG", "ACUUUGGGACA", "CAGGUCAGCA" ) foreground_kmers <- generate_kmers(foreground_set, 6) background_kmers <- generate_kmers(background_set, 6) kmer_enrichment_values <- compute_kmer_enrichment(foreground_kmers, background_kmers)
# define simple sequence sets for foreground and background foreground_set <- c( "CAACAGCCUUAAUU", "CAGUCAAGACUCC", "CUUUGGGGAAU", "UCAUUUUAUUAAA", "AAUUGGUGUCUGGAUACUUCCCUGUACAU", "AUCAAAUUA", "AGAU", "GACACUUAAAGAUCCU", "UAGCAUUAACUUAAUG", "AUGGA", "GAAGAGUGCUCA", "AUAGAC", "AGUUC", "CCAGUAA" ) background_set <- c( "CAACAGCCUUAAUU", "CAGUCAAGACUCC", "CUUUGGGGAAU", "UCAUUUUAUUAAA", "AAUUGGUGUCUGGAUACUUCCCUGUACAU", "AUCAAAUUA", "AGAU", "GACACUUAAAGAUCCU", "UAGCAUUAACUUAAUG", "AUGGA", "GAAGAGUGCUCA", "AUAGAC", "AGUUC", "CCAGUAA", "UUAUUUA", "AUCCUUUACA", "UUUUUUU", "UUUCAUCAUU", "CCACACAC", "CUCAUUGGAG", "ACUUUGGGACA", "CAGGUCAGCA" ) foreground_kmers <- generate_kmers(foreground_set, 6) background_kmers <- generate_kmers(background_set, 6) kmer_enrichment_values <- compute_kmer_enrichment(foreground_kmers, background_kmers)
Counts all non-overlapping instances of k-mers in a given set of sequences.
count_homopolymer_corrected_kmers(sequences, k, kmers, is_rna = FALSE)
count_homopolymer_corrected_kmers(sequences, k, kmers, is_rna = FALSE)
sequences |
character vector of DNA or RNA sequences |
k |
length of k-mer, either |
kmers |
column sums of return value of
|
is_rna |
if |
Returns a named numeric vector, where the elements are k-mer counts and the names are k-mers.
Other k-mer functions:
calculate_kmer_enrichment()
,
check_kmers()
,
compute_kmer_enrichment()
,
draw_volcano_plot()
,
estimate_significance()
,
estimate_significance_core()
,
generate_kmers()
,
generate_permuted_enrichments()
,
run_kmer_spma()
,
run_kmer_tsma()
Takes a position weight matrix (PWM) and meta info and returns an object of
class RBPMotif
.
create_kmer_motif(id, rbps, kmers, type, species, src)
create_kmer_motif(id, rbps, kmers, type, species, src)
id |
motif id (character vector of length 1) |
rbps |
character vector of names of RNA-binding proteins associated with this motif |
kmers |
character vector of k-mers that are associated with the motif, set of k-mers is valid if (1) all k-mers must have the same length, (2) only hexamers or heptamers allowed, (3) allowed characters are A, C, G, U |
type |
type of motif (e.g., |
species |
species where motif was discovered (e.g.,
|
src |
source of motif (e.g., |
object of class RBPMotif
custom_motif <- create_kmer_motif( "custom_motif", "RBP1", c("AAAAAAA", "CAAAAAA"), "HITS-CLIP", "Homo sapiens", "user" )
custom_motif <- create_kmer_motif( "custom_motif", "RBP1", c("AAAAAAA", "CAAAAAA"), "HITS-CLIP", "Homo sapiens", "user" )
Takes a position weight matrix (PWM) and meta info and returns an object of
class RBPMotif
.
create_matrix_motif(id, rbps, matrix, type, species, src)
create_matrix_motif(id, rbps, matrix, type, species, src)
id |
motif id (character vector of length 1) |
rbps |
character vector of names of RNA-binding proteins associated with this motif |
matrix |
data frame with four columns (A, C, G, U) and 6 - 15 rows (positions), where cell (i, j) contains weight of nucleotide j on position i |
type |
type of motif (e.g., |
species |
species where motif was discovered (e.g.,
|
src |
source of motif (e.g., |
object of class RBPMotif
custom_motif <- create_matrix_motif( "custom_motif", "RBP1", transite:::toy_motif_matrix, "HITS-CLIP", "Homo sapiens", "user" )
custom_motif <- create_matrix_motif( "custom_motif", "RBP1", transite:::toy_motif_matrix, "HITS-CLIP", "Homo sapiens", "user" )
Uses a volcano plot to visualize k-mer enrichment.
X-axis is enrichment value,
y-axis is
significance, i.e., multiple testing
corrected p-value from
Fisher's exact test or Pearson's chi-squared test.
draw_volcano_plot( kmers, motif_kmers, motif_rbps, significance_threshold = 0.01, show_legend = TRUE )
draw_volcano_plot( kmers, motif_kmers, motif_rbps, significance_threshold = 0.01, show_legend = TRUE )
kmers |
data frame with the following columns: kmer, adj_p_value, enrichment |
motif_kmers |
set of k-mers that are associated with a certain motif, will be highlighted in volcano plot |
motif_rbps |
name of RNA-binding proteins associated with highlighted k-mers (character vector of length 1) |
significance_threshold |
p-value threshold for significance,
e.g., |
show_legend |
whether or not a legend should be shown |
volcano plot
Other TSMA functions:
run_kmer_tsma()
,
run_matrix_tsma()
Other k-mer functions:
calculate_kmer_enrichment()
,
check_kmers()
,
compute_kmer_enrichment()
,
count_homopolymer_corrected_kmers()
,
estimate_significance()
,
estimate_significance_core()
,
generate_kmers()
,
generate_permuted_enrichments()
,
run_kmer_spma()
,
run_kmer_tsma()
motif <- get_motif_by_id("951_12324455") draw_volcano_plot(transite:::kmers_enrichment, get_hexamers(motif[[1]]), get_rbps(motif[[1]])) ## Not run: foreground_set <- c("UGUGGG", "GUGGGG", "GUGUGG", "UGUGGU") background_set <- unique(c(foreground_set, c( "CAACAGCCUUAAUU", "CAGUCAAGACUCC", "CUUUGGGGAAU", "UCAUUUUAUUAAA", "AAUUGGUGUCUGGAUACUUCCCUGUACAU", "AUCAAAUUA", "AGAU", "GACACUUAAAGAUCCU", "UAGCAUUAACUUAAUG", "AUGGA", "GAAGAGUGCUCA", "AUAGAC", "AGUUC", "CCAGUAA", "CCACACAC", "CUCAUUGGAG", "ACUUUCCCACA", "CAGGUCAGCA", "CCACACCAG", "CCACACAUCAGU", "CACACACUCC", "CAGCCCCCCACAGGCA" ))) motif <- get_motif_by_id("M178_0.6") results <- run_kmer_tsma(list(foreground_set), background_set, motifs = motif) draw_volcano_plot(results[[1]]$motif_kmers_dfs[[1]], get_hexamers(motif[[1]]), "test RBP") ## End(Not run)
motif <- get_motif_by_id("951_12324455") draw_volcano_plot(transite:::kmers_enrichment, get_hexamers(motif[[1]]), get_rbps(motif[[1]])) ## Not run: foreground_set <- c("UGUGGG", "GUGGGG", "GUGUGG", "UGUGGU") background_set <- unique(c(foreground_set, c( "CAACAGCCUUAAUU", "CAGUCAAGACUCC", "CUUUGGGGAAU", "UCAUUUUAUUAAA", "AAUUGGUGUCUGGAUACUUCCCUGUACAU", "AUCAAAUUA", "AGAU", "GACACUUAAAGAUCCU", "UAGCAUUAACUUAAUG", "AUGGA", "GAAGAGUGCUCA", "AUAGAC", "AGUUC", "CCAGUAA", "CCACACAC", "CUCAUUGGAG", "ACUUUCCCACA", "CAGGUCAGCA", "CCACACCAG", "CCACACAUCAGU", "CACACACUCC", "CAGCCCCCCACAGGCA" ))) motif <- get_motif_by_id("M178_0.6") results <- run_kmer_tsma(list(foreground_set), background_set, motifs = motif) draw_volcano_plot(results[[1]]$motif_kmers_dfs[[1]], get_hexamers(motif[[1]]), "test RBP") ## End(Not run)
estimate_significance
returns an estimate of the significance
of the observed
mean, given a set of random permutations of the data.
estimate_significance( actual_mean, motif_kmers, random_permutations, alternative = c("two_sided", "less", "greater"), conf_level = 0.95, produce_plot = TRUE )
estimate_significance( actual_mean, motif_kmers, random_permutations, alternative = c("two_sided", "less", "greater"), conf_level = 0.95, produce_plot = TRUE )
actual_mean |
observed mean |
motif_kmers |
set of k-mers that were used to compute
the |
random_permutations |
a set of random permutations of the original data, used to generate an empirical null distribution. |
alternative |
side of the test, one of the following:
|
conf_level |
confidence level for the returned confidence interval |
produce_plot |
if distribution plot should be part of the returned list |
A list with the following components:
p_value_estimate |
the estimated p-value of the observed mean |
conf_int |
the confidence interval around that estimate |
plot |
plot of the empirical distribution of geometric means of the enrichment values |
Other k-mer functions:
calculate_kmer_enrichment()
,
check_kmers()
,
compute_kmer_enrichment()
,
count_homopolymer_corrected_kmers()
,
draw_volcano_plot()
,
estimate_significance_core()
,
generate_kmers()
,
generate_permuted_enrichments()
,
run_kmer_spma()
,
run_kmer_tsma()
estimate_significance_core
returns an estimate of the significance
of the observed
mean, given a vector of means based on random permutations of the data.
estimate_significance_core( random_means, actual_mean, alternative = c("two_sided", "less", "greater"), conf_level = 0.95 )
estimate_significance_core( random_means, actual_mean, alternative = c("two_sided", "less", "greater"), conf_level = 0.95 )
random_means |
numeric vector of means based on random permutations of the data (empirical null distribution) |
actual_mean |
observed mean |
alternative |
side of the test, one of the following:
|
conf_level |
confidence level for the returned confidence interval |
A list with the following components:
p_value_estimate |
the estimated p-value of the observed mean |
conf_int |
the confidence interval around that estimate |
Other k-mer functions:
calculate_kmer_enrichment()
,
check_kmers()
,
compute_kmer_enrichment()
,
count_homopolymer_corrected_kmers()
,
draw_volcano_plot()
,
estimate_significance()
,
generate_kmers()
,
generate_permuted_enrichments()
,
run_kmer_spma()
,
run_kmer_tsma()
test_sd <- 1.0 test_null_distribution <- rnorm(n = 10000, mean = 1.0, sd = test_sd) estimate_significance_core(test_null_distribution, test_sd * 2, "greater")
test_sd <- 1.0 test_null_distribution <- rnorm(n = 10000, mean = 1.0, sd = test_sd) estimate_significance_core(test_null_distribution, test_sd * 2, "greater")
This object contains a toy data set based on gene expression measurements and 3'-UTR sequences of 1000 genes. It comprises three data frames with RefSeq identifiers, log fold change values, and 3'-UTR sequences of genes, which are either upregulated or downregulated after some hypothetical treatment, as well as all measured genes. The actual values are not important. This data set merely serves as an example input for various functions.
data(ge)
data(ge)
A list with the following components:
foreground1_df |
data frame that contains down-regulated genes after treatment |
foreground2_df |
data frame that contains up-regulated genes after treatment |
background_df |
data frame that contains all genes measured |
Generates a compact logo of a motif based on IUPAC codes given by a character vector of k-mers
generate_iupac_by_kmers(kmers, code = NULL)
generate_iupac_by_kmers(kmers, code = NULL)
kmers |
character vector of k-mers |
code |
if IUPAC code table has already been initialized by
|
IUPAC RNA nucleotide code:
A |
Adenine |
C |
Cytosine |
G |
Guanine |
U |
Uracil |
R |
A or G |
Y |
C or U |
S |
G or C |
W |
A or U |
K |
G or U |
M |
A or C |
B |
C or G or U |
D |
A or G or U |
H |
A or C or U |
V |
A or C or G |
N |
any base |
the IUPAC string of the binding site
http://www.chem.qmul.ac.uk/iubmb/misc/naseq.html
Other motif functions:
generate_iupac_by_matrix()
,
generate_kmers_from_iupac()
,
get_motif_by_id()
,
get_motif_by_rbp()
,
get_motifs()
,
get_motifs_meta_info()
,
get_ppm()
,
init_iupac_lookup_table()
,
set_motifs()
generate_iupac_by_kmers(c("AACCAA", "AACCGG", "CACCGA"))
generate_iupac_by_kmers(c("AACCAA", "AACCGG", "CACCGA"))
Generates a compact logo of a motif based on IUPAC codes given by a position weight matrix
generate_iupac_by_matrix(matrix, threshold = 0.215, code = NULL)
generate_iupac_by_matrix(matrix, threshold = 0.215, code = NULL)
matrix |
the position probability matrix of an RNA-binding protein |
threshold |
the threshold probability (nucleotides with lower probabilities are ignored) |
code |
if IUPAC code table has already been initialized by
|
IUPAC RNA nucleotide code:
A |
Adenine |
C |
Cytosine |
G |
Guanine |
U |
Uracil |
R |
A or G |
Y |
C or U |
S |
G or C |
W |
A or U |
K |
G or U |
M |
A or C |
B |
C or G or U |
D |
A or G or U |
H |
A or C or U |
V |
A or C or G |
N |
any base |
the IUPAC string of the binding site
http://www.chem.qmul.ac.uk/iubmb/misc/naseq.html
Other motif functions:
generate_iupac_by_kmers()
,
generate_kmers_from_iupac()
,
get_motif_by_id()
,
get_motif_by_rbp()
,
get_motifs()
,
get_motifs_meta_info()
,
get_ppm()
,
init_iupac_lookup_table()
,
set_motifs()
generate_iupac_by_matrix(get_motif_matrix(get_motif_by_id("M178_0.6")[[1]]))
generate_iupac_by_matrix(get_motif_matrix(get_motif_by_id("M178_0.6")[[1]]))
Counts occurrences of k-mers of length k
in the given set of
sequences. Corrects for homopolymeric stretches.
generate_kmers(sequences, k)
generate_kmers(sequences, k)
sequences |
character vector of DNA or RNA sequences |
k |
length of k-mer, either |
Returns a named numeric vector, where the elements are k-mer counts and the names are DNA k-mers.
generate_kmers
always returns DNA k-mers, even if
sequences
contains RNA sequences.
RNA sequences are internally converted to DNA sequences. It is not
allowed to mix DNA and
RNA sequences.
Other k-mer functions:
calculate_kmer_enrichment()
,
check_kmers()
,
compute_kmer_enrichment()
,
count_homopolymer_corrected_kmers()
,
draw_volcano_plot()
,
estimate_significance()
,
estimate_significance_core()
,
generate_permuted_enrichments()
,
run_kmer_spma()
,
run_kmer_tsma()
# count hexamers in set of RNA sequences rna_sequences <- c( "CAACAGCCUUAAUU", "CAGUCAAGACUCC", "CUUUGGGGAAU", "UCAUUUUAUUAAA", "AAUUGGUGUCUGGAUACUUCCCUGUACAU", "AUCAAAUUA", "AGAU", "GACACUUAAAGAUCCU", "UAGCAUUAACUUAAUG", "AUGGA", "GAAGAGUGCUCA", "AUAGAC", "AGUUC", "CCAGUAA", "UUAUUUA", "AUCCUUUACA", "UUUUUUU", "UUUCAUCAUU", "CCACACAC", "CUCAUUGGAG", "ACUUUGGGACA", "CAGGUCAGCA" ) hexamer_counts <- generate_kmers(rna_sequences, 6) # count heptamers in set of DNA sequences dna_sequences <- c( "CAACAGCCTTAATT", "CAGTCAAGACTCC", "CTTTGGGGAAT", "TCATTTTATTAAA", "AATTGGTGTCTGGATACTTCCCTGTACAT", "ATCAAATTA", "AGAT", "GACACTTAAAGATCCT", "TAGCATTAACTTAATG", "ATGGA", "GAAGAGTGCTCA", "ATAGAC", "AGTTC", "CCAGTAA", "TTATTTA", "ATCCTTTACA", "TTTTTTT", "TTTCATCATT", "CCACACAC", "CTCATTGGAG", "ACTTTGGGACA", "CAGGTCAGCA" ) hexamer_counts <- generate_kmers(dna_sequences, 7)
# count hexamers in set of RNA sequences rna_sequences <- c( "CAACAGCCUUAAUU", "CAGUCAAGACUCC", "CUUUGGGGAAU", "UCAUUUUAUUAAA", "AAUUGGUGUCUGGAUACUUCCCUGUACAU", "AUCAAAUUA", "AGAU", "GACACUUAAAGAUCCU", "UAGCAUUAACUUAAUG", "AUGGA", "GAAGAGUGCUCA", "AUAGAC", "AGUUC", "CCAGUAA", "UUAUUUA", "AUCCUUUACA", "UUUUUUU", "UUUCAUCAUU", "CCACACAC", "CUCAUUGGAG", "ACUUUGGGACA", "CAGGUCAGCA" ) hexamer_counts <- generate_kmers(rna_sequences, 6) # count heptamers in set of DNA sequences dna_sequences <- c( "CAACAGCCTTAATT", "CAGTCAAGACTCC", "CTTTGGGGAAT", "TCATTTTATTAAA", "AATTGGTGTCTGGATACTTCCCTGTACAT", "ATCAAATTA", "AGAT", "GACACTTAAAGATCCT", "TAGCATTAACTTAATG", "ATGGA", "GAAGAGTGCTCA", "ATAGAC", "AGTTC", "CCAGTAA", "TTATTTA", "ATCCTTTACA", "TTTTTTT", "TTTCATCATT", "CCACACAC", "CTCATTGGAG", "ACTTTGGGACA", "CAGGTCAGCA" ) hexamer_counts <- generate_kmers(dna_sequences, 7)
Generates all possible k-mers for a given IUPAC string.
generate_kmers_from_iupac(iupac, k)
generate_kmers_from_iupac(iupac, k)
iupac |
IUPAC string |
k |
length of k-mer, |
IUPAC RNA nucleotide code:
A |
Adenine |
C |
Cytosine |
G |
Guanine |
U |
Uracil |
R |
A or G |
Y |
C or U |
S |
G or C |
W |
A or U |
K |
G or U |
M |
A or C |
B |
C or G or U |
D |
A or G or U |
H |
A or C or U |
V |
A or C or G |
N |
any base |
list of k-mers
http://www.chem.qmul.ac.uk/iubmb/misc/naseq.html
Other motif functions:
generate_iupac_by_kmers()
,
generate_iupac_by_matrix()
,
get_motif_by_id()
,
get_motif_by_rbp()
,
get_motifs()
,
get_motifs_meta_info()
,
get_ppm()
,
init_iupac_lookup_table()
,
set_motifs()
generate_kmers_from_iupac(get_iupac(get_motif_by_id("M178_0.6")[[1]]), k = 6)
generate_kmers_from_iupac(get_iupac(get_motif_by_id("M178_0.6")[[1]]), k = 6)
Calculates k-mer enrichment values for randomly sampled (without replacement) foreground sets.
generate_permuted_enrichments( n_transcripts_foreground, background_set, k, n_permutations = 1000, n_cores = 4 )
generate_permuted_enrichments( n_transcripts_foreground, background_set, k, n_permutations = 1000, n_cores = 4 )
n_transcripts_foreground |
number of transcripts in the original foreground set |
background_set |
character vector of DNA or RNA sequences that constitute the background set |
k |
length of k-mer, either |
n_permutations |
number of permutations to perform |
n_cores |
number of computing cores to use |
The result of calculate_kmer_enrichment
for the
random foreground sets.
Other k-mer functions:
calculate_kmer_enrichment()
,
check_kmers()
,
compute_kmer_enrichment()
,
count_homopolymer_corrected_kmers()
,
draw_volcano_plot()
,
estimate_significance()
,
estimate_significance_core()
,
generate_kmers()
,
run_kmer_spma()
,
run_kmer_tsma()
Calculates the geometric mean of the specified values.
geometric_mean(x, na_rm = TRUE)
geometric_mean(x, na_rm = TRUE)
x |
numeric vector of values for which the geometric mean will be computed |
na_rm |
logical. Should missing values (including NaN) be removed? |
Geometric mean of x
or 1
if length of x
is 0
geometric_mean(c(0.123, 0.441, 0.83))
geometric_mean(c(0.123, 0.441, 0.83))
Retrieves one or more motif objects identified by motif id.
get_motif_by_id(id)
get_motif_by_id(id)
id |
character vector of motif identifiers |
A list of objects of class RBPMotif
Other motif functions:
generate_iupac_by_kmers()
,
generate_iupac_by_matrix()
,
generate_kmers_from_iupac()
,
get_motif_by_rbp()
,
get_motifs()
,
get_motifs_meta_info()
,
get_ppm()
,
init_iupac_lookup_table()
,
set_motifs()
get_motif_by_id("M178_0.6") get_motif_by_id(c("M178_0.6", "M188_0.6"))
get_motif_by_id("M178_0.6") get_motif_by_id(c("M178_0.6", "M188_0.6"))
Retrieves one or more motif objects identified by gene symbol.
get_motif_by_rbp(rbp)
get_motif_by_rbp(rbp)
rbp |
character vector of gene symbols of RNA-binding proteins |
A list of objects of class RBPMotif
Other motif functions:
generate_iupac_by_kmers()
,
generate_iupac_by_matrix()
,
generate_kmers_from_iupac()
,
get_motif_by_id()
,
get_motifs()
,
get_motifs_meta_info()
,
get_ppm()
,
init_iupac_lookup_table()
,
set_motifs()
get_motif_by_rbp("ELAVL1") get_motif_by_rbp(c("ELAVL1", "ELAVL2"))
get_motif_by_rbp("ELAVL1") get_motif_by_rbp(c("ELAVL1", "ELAVL2"))
Retrieves all Transite motifs
get_motifs()
get_motifs()
A list of objects of class Motif
Other motif functions:
generate_iupac_by_kmers()
,
generate_iupac_by_matrix()
,
generate_kmers_from_iupac()
,
get_motif_by_id()
,
get_motif_by_rbp()
,
get_motifs_meta_info()
,
get_ppm()
,
init_iupac_lookup_table()
,
set_motifs()
transite_motifs <- get_motifs()
transite_motifs <- get_motifs()
Generates a data frame with meta information about all Transite motifs.
get_motifs_meta_info()
get_motifs_meta_info()
A data frame containing meta information for all Transite motifs, with the following columns:
id
rbps
length
iupac
type
species
src
Other motif functions:
generate_iupac_by_kmers()
,
generate_iupac_by_matrix()
,
generate_kmers_from_iupac()
,
get_motif_by_id()
,
get_motif_by_rbp()
,
get_motifs()
,
get_ppm()
,
init_iupac_lookup_table()
,
set_motifs()
get_motifs_meta_info()
get_motifs_meta_info()
Return the position probability matrix of the specified motif.
get_ppm(motif)
get_ppm(motif)
motif |
object of class |
The position probability matrix of the specified motif
Other motif functions:
generate_iupac_by_kmers()
,
generate_iupac_by_matrix()
,
generate_kmers_from_iupac()
,
get_motif_by_id()
,
get_motif_by_rbp()
,
get_motifs()
,
get_motifs_meta_info()
,
init_iupac_lookup_table()
,
set_motifs()
get_ppm(get_motif_by_id("M178_0.6")[[1]])
get_ppm(get_motif_by_id("M178_0.6")[[1]])
Initializes a hash table that serves as a IUPAC lookup table for the
generate_iupac_by_matrix
function.
init_iupac_lookup_table()
init_iupac_lookup_table()
IUPAC RNA nucleotide code:
A |
Adenine |
C |
Cytosine |
G |
Guanine |
U |
Uracil |
R |
A or G |
Y |
C or U |
S |
G or C |
W |
A or U |
K |
G or U |
M |
A or C |
B |
C or G or U |
D |
A or G or U |
H |
A or C or U |
V |
A or C or G |
N |
any base |
an environment, the IUPAC lookup hash table
http://www.chem.qmul.ac.uk/iubmb/misc/naseq.html
Other motif functions:
generate_iupac_by_kmers()
,
generate_iupac_by_matrix()
,
generate_kmers_from_iupac()
,
get_motif_by_id()
,
get_motif_by_rbp()
,
get_motifs()
,
get_motifs_meta_info()
,
get_ppm()
,
set_motifs()
generate_iupac_by_matrix(get_motif_matrix(get_motif_by_id("M178_0.6")[[1]]), code = init_iupac_lookup_table())
generate_iupac_by_matrix(get_motif_matrix(get_motif_by_id("M178_0.6")[[1]]), code = init_iupac_lookup_table())
This data frame with k-mer enrichment data (as produced by
run_kmer_tsma
) is used in a code example for k-mer volcano plot
function draw_volcano_plot
.
data(kmers_enrichment)
data(kmers_enrichment)
A data frame with the following columns:
kmer |
contains all hexamers (AAAAAA to UUUUUU) |
foreground_count |
absolute k-mer frequency in foreground set |
background_count |
absolute k-mer frequency in background set |
enrichment |
enrichment of k-mer in foreground relative to background |
p_value |
associated p-value of enrichment |
adj_p_value |
multiple testing corrected p-value |
The Transite motif database contains sequence motifs and associated k-mers of more than 100 different RNA-binding proteins, obtained from publicly available motif databases.
data(motifs)
data(motifs)
A list of lists with the following components:
id |
motif id |
rbps |
gene symbols of RNA-binding proteins associated with motif |
matrix |
data frame of sequence motif (position weight matrix) |
hexamers |
all motif-associated hexamers |
heptamers |
all motif-associated heptamers |
length |
length of motif in nucleotides |
iupac |
IUPAC string of sequence motif |
type |
type of motif, e.g., RNAcompete |
species |
usually human |
src |
source of motif, e.g., RNA Zoo |
http://cisbp-rna.ccbr.utoronto.ca/
http://rbpdb.ccbr.utoronto.ca/
p_combine
is used to combine the p-values of independent
significance tests.
p_combine(p, method = c("fisher", "SL", "MG", "tippett"), w = NULL)
p_combine(p, method = c("fisher", "SL", "MG", "tippett"), w = NULL)
p |
vector of p-values |
method |
one of the following: Fisher (1932)
( |
w |
weights, only used in combination with Stouffer-Liptak.
If |
The problem can be specified as follows: Given a vector of
p-values
, find
, the combined p-value of the
significance tests.
Most of the methods introduced here combine the p-values in order to obtain
a test statistic, which follows a known probability distribution.
The general procedure can be stated as:
The function , which returns the test statistic
, takes
two arguments.
is a function defined on the interval
that transforms
the individual
p-values, and
is a correction term.
Fisher's method (1932), also known as the inverse chi-square method is
probably the most widely
used method
for combining p-values. Fisher used the fact that if is
uniformly distributed
(which p-values are under the null hypothesis), then
follows a chi-square
distribution with two degrees of freedom. Therefore, if p-values are
transformed as follows,
and the correction term is neutral, i.e., equals
, the
following statement can be
made about the sampling distribution of the test statistic
under the null hypothesis:
is distributed as chi-square with
degrees of freedom,
where
is the number of p-values.
Stouffer's method, or the inverse normal method, uses a p-value
transformation
function that
leads to a test statistic that follows the standard normal
distribution by transforming
each p-value to its corresponding normal score. The correction
term scales the sum of the normal
scores by the root of the number of p-values.
Under the null hypothesis, is distributed as standard normal.
is the inverse of the cumulative standard
normal distribution function.
An extension of Stouffer's method with weighted p-values is called Liptak's method.
The logit method by Mudholkar and George uses the following transformation:
When the sum of the transformed p-values is corrected in the following way:
the test statistic is approximately t-distributed with
degrees of freedom.
In Tippett's method the smallest p-value is used as the test
statistic and the
combined significance is calculated as follows:
A list with the following components:
statistic |
the test statistic |
p_value |
the corresponding p-value |
method |
the method used |
statistic_name |
the name of the test statistic |
p_combine(c(0.01, 0.05, 0.5)) p_combine(c(0.01, 0.05, 0.5), method = "tippett")
p_combine(c(0.01, 0.05, 0.5)) p_combine(c(0.01, 0.05, 0.5), method = "tippett")
An S4 class to represent a RBPMotif
Getter Method get_id
Getter Method get_rbps
Getter Method get_motif_matrix
Getter Method get_hexamers
Getter Method get_heptamers
Getter Method get_width
Getter Method get_iupac
Getter Method get_type
Getter Method get_species
Getter Method get_source
get_id(object) ## S4 method for signature 'RBPMotif' get_id(object) get_rbps(object) ## S4 method for signature 'RBPMotif' get_rbps(object) get_motif_matrix(object) ## S4 method for signature 'RBPMotif' get_motif_matrix(object) get_hexamers(object) ## S4 method for signature 'RBPMotif' get_hexamers(object) get_heptamers(object) ## S4 method for signature 'RBPMotif' get_heptamers(object) get_width(object) ## S4 method for signature 'RBPMotif' get_width(object) get_iupac(object) ## S4 method for signature 'RBPMotif' get_iupac(object) get_type(object) ## S4 method for signature 'RBPMotif' get_type(object) get_species(object) ## S4 method for signature 'RBPMotif' get_species(object) get_source(object) ## S4 method for signature 'RBPMotif' get_source(object) ## S4 method for signature 'RBPMotif' show(object)
get_id(object) ## S4 method for signature 'RBPMotif' get_id(object) get_rbps(object) ## S4 method for signature 'RBPMotif' get_rbps(object) get_motif_matrix(object) ## S4 method for signature 'RBPMotif' get_motif_matrix(object) get_hexamers(object) ## S4 method for signature 'RBPMotif' get_hexamers(object) get_heptamers(object) ## S4 method for signature 'RBPMotif' get_heptamers(object) get_width(object) ## S4 method for signature 'RBPMotif' get_width(object) get_iupac(object) ## S4 method for signature 'RBPMotif' get_iupac(object) get_type(object) ## S4 method for signature 'RBPMotif' get_type(object) get_species(object) ## S4 method for signature 'RBPMotif' get_species(object) get_source(object) ## S4 method for signature 'RBPMotif' get_source(object) ## S4 method for signature 'RBPMotif' show(object)
object |
RBPMotif object |
Object of type RBPMotif
id
motif id (character vector of length 1)
rbps
character vector of names of RNA-binding proteins associated with this motif
matrix
data frame with four columns (A, C, G, U) and 6 - 15 rows (positions), where cell (i, j) contains weight of nucleotide j on position i
hexamers
character vector of hexamers associated with this motif
heptamers
character vector of heptamers associated with this motif
length
length of the motif (i.e., nrow(matrix)
)
iupac
IUPAC code for motif matrix
(see generate_iupac_by_matrix
)
type
type of motif (e.g., 'HITS-CLIP'
, 'EMSA'
,
'SELEX'
, etc.)
species
species where motif was discovered (e.g.,
'Homo sapiens'
)
src
source of motif (e.g., 'RBPDB v1.3.1'
)
kmers <- c("AAAAAAA", "CAAAAAA") iupac <- generate_iupac_by_kmers(kmers, code = init_iupac_lookup_table()) hexamers <- generate_kmers_from_iupac(iupac, 6) heptamers <- generate_kmers_from_iupac(iupac, 7) new("RBPMotif", id = "custom_motif", rbps = "RBP1", matrix = NULL, hexamers = hexamers, heptamers = heptamers, length = 7L, iupac = iupac, type = "HITS-CLIP", species = "Homo sapiens", src = "user" )
kmers <- c("AAAAAAA", "CAAAAAA") iupac <- generate_iupac_by_kmers(kmers, code = init_iupac_lookup_table()) hexamers <- generate_kmers_from_iupac(iupac, 6) heptamers <- generate_kmers_from_iupac(iupac, 7) new("RBPMotif", id = "custom_motif", rbps = "RBP1", matrix = NULL, hexamers = hexamers, heptamers = heptamers, length = 7L, iupac = iupac, type = "HITS-CLIP", species = "Homo sapiens", src = "user" )
SPMA helps to illuminate the relationship between RBP binding evidence and the transcript sorting criterion, e.g., fold change between treatment and control samples.
run_kmer_spma( sorted_transcript_sequences, sorted_transcript_values = NULL, transcript_values_label = "transcript value", motifs = NULL, k = 6, n_bins = 40, midpoint = 0, x_value_limits = NULL, max_model_degree = 1, max_cs_permutations = 1e+07, min_cs_permutations = 5000, fg_permutations = 5000, p_adjust_method = "BH", p_combining_method = "fisher", n_cores = 1 )
run_kmer_spma( sorted_transcript_sequences, sorted_transcript_values = NULL, transcript_values_label = "transcript value", motifs = NULL, k = 6, n_bins = 40, midpoint = 0, x_value_limits = NULL, max_model_degree = 1, max_cs_permutations = 1e+07, min_cs_permutations = 5000, fg_permutations = 5000, p_adjust_method = "BH", p_combining_method = "fisher", n_cores = 1 )
sorted_transcript_sequences |
character vector of ranked sequences,
either DNA
(only containing upper case characters A, C, G, T) or RNA (A, C, G, U).
The sequences in |
sorted_transcript_values |
vector of sorted transcript values, i.e.,
the fold change or signal-to-noise ratio or any other quantity that was used
to sort the transcripts that were passed to |
transcript_values_label |
label of transcript sorting criterion
(e.g., |
motifs |
a list of motifs that is used to score the specified sequences.
If |
k |
length of k-mer, either |
n_bins |
specifies the number of bins in which the sequences will be divided, valid values are between 7 and 100 |
midpoint |
for enrichment values the midpoint should be |
x_value_limits |
sets limits of the x-value color scale (used to
harmonize color scales of different spectrum plots), see |
max_model_degree |
maximum degree of polynomial |
max_cs_permutations |
maximum number of permutations performed in Monte Carlo test for consistency score |
min_cs_permutations |
minimum number of permutations performed in Monte Carlo test for consistency score |
fg_permutations |
numer of foreground permutations |
p_adjust_method |
see |
p_combining_method |
one of the following: Fisher (1932)
( |
n_cores |
number of computing cores to use |
In order to investigate how motif targets are distributed across a spectrum of transcripts (e.g., all transcripts of a platform, ordered by fold change), Spectrum Motif Analysis visualizes the gradient of RBP binding evidence across all transcripts.
The k-mer-based approach differs from the matrix-based approach by how the sequences are scored. Here, sequences are broken into k-mers, i.e., oligonucleotide sequences of k bases. And only statistically significantly enriched or depleted k-mers are then used to calculate a score for each RNA-binding protein, which quantifies its target overrepresentation.
A list with the following components:
foreground_scores |
the result of run_kmer_tsma
for the binned data |
spectrum_info_df |
a data frame with the SPMA results |
spectrum_plots |
a list of spectrum plots, as generated by
score_spectrum
|
classifier_scores |
a list of classifier scores, as returned by
classify_spectrum
|
Other SPMA functions:
classify_spectrum()
,
run_matrix_spma()
,
score_spectrum()
,
subdivide_data()
Other k-mer functions:
calculate_kmer_enrichment()
,
check_kmers()
,
compute_kmer_enrichment()
,
count_homopolymer_corrected_kmers()
,
draw_volcano_plot()
,
estimate_significance()
,
estimate_significance_core()
,
generate_kmers()
,
generate_permuted_enrichments()
,
run_kmer_tsma()
# example data set background_df <- transite:::ge$background_df # sort sequences by signal-to-noise ratio background_df <- dplyr::arrange(background_df, value) # character vector of named and ranked (by signal-to-noise ratio) sequences background_seqs <- gsub("T", "U", background_df$seq) names(background_seqs) <- paste0(background_df$refseq, "|", background_df$seq_type) results <- run_kmer_spma(background_seqs, sorted_transcript_values = background_df$value, transcript_values_label = "signal-to-noise ratio", motifs = get_motif_by_id("M178_0.6"), n_bins = 20, fg_permutations = 10) ## Not run: results <- run_kmer_spma(background_seqs, sorted_transcript_values = background_df$value, transcript_values_label = "signal-to-noise ratio") ## End(Not run)
# example data set background_df <- transite:::ge$background_df # sort sequences by signal-to-noise ratio background_df <- dplyr::arrange(background_df, value) # character vector of named and ranked (by signal-to-noise ratio) sequences background_seqs <- gsub("T", "U", background_df$seq) names(background_seqs) <- paste0(background_df$refseq, "|", background_df$seq_type) results <- run_kmer_spma(background_seqs, sorted_transcript_values = background_df$value, transcript_values_label = "signal-to-noise ratio", motifs = get_motif_by_id("M178_0.6"), n_bins = 20, fg_permutations = 10) ## Not run: results <- run_kmer_spma(background_seqs, sorted_transcript_values = background_df$value, transcript_values_label = "signal-to-noise ratio") ## End(Not run)
Calculates the enrichment of putative binding sites in foreground sets versus a background set using k-mers to identify putative binding sites
run_kmer_tsma( foreground_sets, background_set, motifs = NULL, k = 6, fg_permutations = 5000, kmer_significance_threshold = 0.01, produce_plot = TRUE, p_adjust_method = "BH", p_combining_method = "fisher", n_cores = 1 )
run_kmer_tsma( foreground_sets, background_set, motifs = NULL, k = 6, fg_permutations = 5000, kmer_significance_threshold = 0.01, produce_plot = TRUE, p_adjust_method = "BH", p_combining_method = "fisher", n_cores = 1 )
foreground_sets |
list of foreground sets; a foreground set is a
character vector of
DNA or RNA sequences (not both) and a strict subset of the
|
background_set |
character vector of DNA or RNA sequences that constitute the background set |
motifs |
a list of motifs that is used to score the specified sequences.
If |
k |
length of k-mer, either |
fg_permutations |
numer of foreground permutations |
kmer_significance_threshold |
p-value threshold for significance,
e.g., |
produce_plot |
if |
p_adjust_method |
see |
p_combining_method |
one of the following: Fisher (1932)
( |
n_cores |
number of computing cores to use |
Motif transcript set analysis can be used to identify RNA binding proteins, whose targets are significantly overrepresented or underrepresented in certain sets of transcripts.
The aim of Transcript Set Motif Analysis (TSMA) is to identify the overrepresentation and underrepresentation of potential RBP targets (binding sites) in a set (or sets) of sequences, i.e., the foreground set, relative to the entire population of sequences. The latter is called background set, which can be composed of all sequences of the genes of a microarray platform or all sequences of an organism or any other meaningful superset of the foreground sets.
The k-mer-based approach breaks the sequences of foreground and background sets into k-mers and calculates the enrichment on a k-mer level. In this case, motifs are not represented as position weight matrices, but as lists of k-mers.
Statistically significantly enriched or depleted k-mers are then used to calculate a score for each RNA-binding protein, which quantifies its target overrepresentation.
A list of lists (one for each transcript set) with the following components:
enrichment_df |
the result of
compute_kmer_enrichment |
motif_df |
|
motif_kmers_dfs |
|
volcano_plots |
volcano plots for each
motif (see draw_volcano_plot ) |
perm_test_plots |
plots of the empirical distribution of k-mer enrichment values for each motif |
enriched_kmers_combined_p_values |
|
depleted_kmers_combined_p_values |
Other TSMA functions:
draw_volcano_plot()
,
run_matrix_tsma()
Other k-mer functions:
calculate_kmer_enrichment()
,
check_kmers()
,
compute_kmer_enrichment()
,
count_homopolymer_corrected_kmers()
,
draw_volcano_plot()
,
estimate_significance()
,
estimate_significance_core()
,
generate_kmers()
,
generate_permuted_enrichments()
,
run_kmer_spma()
# define simple sequence sets for foreground and background foreground_set1 <- c( "CAACAGCCUUAAUU", "CAGUCAAGACUCC", "CUUUGGGGAAU", "UCAUUUUAUUAAA", "AAUUGGUGUCUGGAUACUUCCCUGUACAU", "AUCAAAUUA", "AGAU", "GACACUUAAAGAUCCU", "UAGCAUUAACUUAAUG", "AUGGA", "GAAGAGUGCUCA", "AUAGAC", "AGUUC", "CCAGUAA" ) foreground_set2 <- c("UUAUUUA", "AUCCUUUACA", "UUUUUUU", "UUUCAUCAUU") foreground_sets <- list(foreground_set1, foreground_set2) background_set <- unique(c(foreground_set1, foreground_set2, c( "CCACACAC", "CUCAUUGGAG", "ACUUUGGGACA", "CAGGUCAGCA", "CCACACCGG", "GUCAUCAGU", "GUCAGUCC", "CAGGUCAGGGGCA" ))) # run k-mer based TSMA with all Transite motifs (recommended): # results <- run_kmer_tsma(foreground_sets, background_set) # run TSMA with one motif: motif_db <- get_motif_by_id("M178_0.6") results <- run_kmer_tsma(foreground_sets, background_set, motifs = motif_db) ## Not run: # define example sequence sets for foreground and background foreground_set1 <- gsub("T", "U", transite:::ge$foreground1_df$seq) foreground_set2 <- gsub("T", "U", transite:::ge$foreground2_df$seq) foreground_sets <- list(foreground_set1, foreground_set2) background_set <- gsub("T", "U", transite:::ge$background_df$seq) # run TSMA with all Transite motifs results <- run_kmer_tsma(foreground_sets, background_set) # run TSMA with a subset of Transite motifs results <- run_kmer_tsma(foreground_sets, background_set, motifs = get_motif_by_rbp("ELAVL1")) # run TSMA with user-defined motif toy_motif <- create_kmer_motif( "toy_motif", "example RBP", c("AACCGG", "AAAACG", "AACACG"), "example type", "example species", "user" ) results <- run_matrix_tsma(foreground_sets, background_set, motifs = list(toy_motif)) ## End(Not run)
# define simple sequence sets for foreground and background foreground_set1 <- c( "CAACAGCCUUAAUU", "CAGUCAAGACUCC", "CUUUGGGGAAU", "UCAUUUUAUUAAA", "AAUUGGUGUCUGGAUACUUCCCUGUACAU", "AUCAAAUUA", "AGAU", "GACACUUAAAGAUCCU", "UAGCAUUAACUUAAUG", "AUGGA", "GAAGAGUGCUCA", "AUAGAC", "AGUUC", "CCAGUAA" ) foreground_set2 <- c("UUAUUUA", "AUCCUUUACA", "UUUUUUU", "UUUCAUCAUU") foreground_sets <- list(foreground_set1, foreground_set2) background_set <- unique(c(foreground_set1, foreground_set2, c( "CCACACAC", "CUCAUUGGAG", "ACUUUGGGACA", "CAGGUCAGCA", "CCACACCGG", "GUCAUCAGU", "GUCAGUCC", "CAGGUCAGGGGCA" ))) # run k-mer based TSMA with all Transite motifs (recommended): # results <- run_kmer_tsma(foreground_sets, background_set) # run TSMA with one motif: motif_db <- get_motif_by_id("M178_0.6") results <- run_kmer_tsma(foreground_sets, background_set, motifs = motif_db) ## Not run: # define example sequence sets for foreground and background foreground_set1 <- gsub("T", "U", transite:::ge$foreground1_df$seq) foreground_set2 <- gsub("T", "U", transite:::ge$foreground2_df$seq) foreground_sets <- list(foreground_set1, foreground_set2) background_set <- gsub("T", "U", transite:::ge$background_df$seq) # run TSMA with all Transite motifs results <- run_kmer_tsma(foreground_sets, background_set) # run TSMA with a subset of Transite motifs results <- run_kmer_tsma(foreground_sets, background_set, motifs = get_motif_by_rbp("ELAVL1")) # run TSMA with user-defined motif toy_motif <- create_kmer_motif( "toy_motif", "example RBP", c("AACCGG", "AAAACG", "AACACG"), "example type", "example species", "user" ) results <- run_matrix_tsma(foreground_sets, background_set, motifs = list(toy_motif)) ## End(Not run)
SPMA helps to illuminate the relationship between RBP binding evidence and the transcript sorting criterion, e.g., fold change between treatment and control samples.
run_matrix_spma( sorted_transcript_sequences, sorted_transcript_values = NULL, transcript_values_label = "transcript value", motifs = NULL, n_bins = 40, midpoint = 0, x_value_limits = NULL, max_model_degree = 1, max_cs_permutations = 1e+07, min_cs_permutations = 5000, max_hits = 5, threshold_method = "p_value", threshold_value = 0.25^6, max_fg_permutations = 1e+06, min_fg_permutations = 1000, e = 5, p_adjust_method = "BH", n_cores = 1, cache = paste0(tempdir(), "/sc/") )
run_matrix_spma( sorted_transcript_sequences, sorted_transcript_values = NULL, transcript_values_label = "transcript value", motifs = NULL, n_bins = 40, midpoint = 0, x_value_limits = NULL, max_model_degree = 1, max_cs_permutations = 1e+07, min_cs_permutations = 5000, max_hits = 5, threshold_method = "p_value", threshold_value = 0.25^6, max_fg_permutations = 1e+06, min_fg_permutations = 1000, e = 5, p_adjust_method = "BH", n_cores = 1, cache = paste0(tempdir(), "/sc/") )
sorted_transcript_sequences |
named character vector of ranked sequences
(only containing upper case characters A, C, G, T), where the
names are RefSeq identifiers and sequence
type qualifiers ( |
sorted_transcript_values |
vector of sorted transcript values, i.e.,
the fold change or signal-to-noise ratio or any other quantity that was used
to sort the transcripts that were passed to |
transcript_values_label |
label of transcript sorting criterion
(e.g., |
motifs |
a list of motifs that is used to score the specified sequences.
If |
n_bins |
specifies the number of bins in which the sequences will be divided, valid values are between 7 and 100 |
midpoint |
for enrichment values the midpoint should be |
x_value_limits |
sets limits of the x-value color scale (used to
harmonize color scales of different spectrum plots), see |
max_model_degree |
maximum degree of polynomial |
max_cs_permutations |
maximum number of permutations performed in Monte Carlo test for consistency score |
min_cs_permutations |
minimum number of permutations performed in Monte Carlo test for consistency score |
max_hits |
maximum number of putative binding sites per mRNA that are counted |
threshold_method |
either |
threshold_value |
semantics of the |
max_fg_permutations |
maximum number of foreground permutations performed in Monte Carlo test for enrichment score |
min_fg_permutations |
minimum number of foreground permutations performed in Monte Carlo test for enrichment score |
e |
integer-valued stop criterion for enrichment score Monte Carlo
test: aborting
permutation process after
observing |
p_adjust_method |
adjustment of p-values from Monte Carlo tests to
avoid alpha error
accumulation, see |
n_cores |
the number of cores that are used |
cache |
either logical or path to a directory where scores are cached.
The scores of each
motif are stored in a
separate file that contains a hash table with RefSeq identifiers and
sequence type
qualifiers as keys and the number of putative binding sites as values.
If |
In order to investigate how motif targets are distributed across a spectrum of transcripts (e.g., all transcripts of a platform, ordered by fold change), Spectrum Motif Analysis visualizes the gradient of RBP binding evidence across all transcripts.
The matrix-based approach skips the k-merization step of the k-mer-based approach and instead scores the transcript sequence as a whole with a position specific scoring matrix.
For each sequence in foreground and background sets and each sequence motif, the scoring algorithm evaluates the score for each sequence position. Positions with a relative score greater than a certain threshold are considered hits, i.e., putative binding sites.
By scoring all sequences in foreground and background sets, a hit count for each motif and each set is obtained, which is used to calculate enrichment values and associated p-values in the same way in which motif-compatible hexamer enrichment values are calculated in the k-mer-based approach. P-values are adjusted with one of the available adjustment methods.
An advantage of the matrix-based approach is the possibility of detecting clusters of binding sites. This can be done by counting regions with many hits using positional hit information or by simply applying a hit count threshold per sequence, e.g., only sequences with more than some number of hits are considered. Homotypic clusters of RBP binding sites may play a similar role as clusters of transcription factors.
A list with the following components:
foreground_scores |
the result of score_transcripts
for the foreground
sets (the bins) |
background_scores |
the result of score_transcripts
for the background
set |
enrichment_dfs |
a list of data frames, returned by
calculate_motif_enrichment
|
spectrum_info_df |
a data frame with the SPMA results |
spectrum_plots |
a list of spectrum plots, as generated by
score_spectrum
|
classifier_scores |
a list of classifier scores, as returned by
classify_spectrum
|
Other SPMA functions:
classify_spectrum()
,
run_kmer_spma()
,
score_spectrum()
,
subdivide_data()
Other matrix functions:
calculate_motif_enrichment()
,
run_matrix_tsma()
,
score_transcripts()
,
score_transcripts_single_motif()
# example data set background_df <- transite:::ge$background_df # sort sequences by signal-to-noise ratio background_df <- dplyr::arrange(background_df, value) # character vector of named and ranked (by signal-to-noise ratio) sequences background_seqs <- gsub("T", "U", background_df$seq) names(background_seqs) <- paste0(background_df$refseq, "|", background_df$seq_type) results <- run_matrix_spma(background_seqs, sorted_transcript_values = background_df$value, transcript_values_label = "signal-to-noise ratio", motifs = get_motif_by_id("M178_0.6"), n_bins = 20, max_fg_permutations = 10000) ## Not run: results <- run_matrix_spma(background_seqs, sorted_transcript_values = background_df$value, transcript_values_label = "SNR") ## End(Not run)
# example data set background_df <- transite:::ge$background_df # sort sequences by signal-to-noise ratio background_df <- dplyr::arrange(background_df, value) # character vector of named and ranked (by signal-to-noise ratio) sequences background_seqs <- gsub("T", "U", background_df$seq) names(background_seqs) <- paste0(background_df$refseq, "|", background_df$seq_type) results <- run_matrix_spma(background_seqs, sorted_transcript_values = background_df$value, transcript_values_label = "signal-to-noise ratio", motifs = get_motif_by_id("M178_0.6"), n_bins = 20, max_fg_permutations = 10000) ## Not run: results <- run_matrix_spma(background_seqs, sorted_transcript_values = background_df$value, transcript_values_label = "SNR") ## End(Not run)
Calculates motif enrichment in foreground sets versus a background set using position weight matrices to identify putative binding sites
run_matrix_tsma( foreground_sets, background_set, motifs = NULL, max_hits = 5, threshold_method = "p_value", threshold_value = 0.25^6, max_fg_permutations = 1e+06, min_fg_permutations = 1000, e = 5, p_adjust_method = "BH", n_cores = 1, cache = paste0(tempdir(), "/sc/") )
run_matrix_tsma( foreground_sets, background_set, motifs = NULL, max_hits = 5, threshold_method = "p_value", threshold_value = 0.25^6, max_fg_permutations = 1e+06, min_fg_permutations = 1000, e = 5, p_adjust_method = "BH", n_cores = 1, cache = paste0(tempdir(), "/sc/") )
foreground_sets |
a list of named character vectors of
foreground sequences
(only containing upper case characters A, C, G, T), where the
names are RefSeq identifiers
and sequence
type qualifiers ( |
background_set |
a named character vector of background sequences (naming follows same rules as foreground set sequences) |
motifs |
a list of motifs that is used to score the specified sequences.
If |
max_hits |
maximum number of putative binding sites per mRNA that are counted |
threshold_method |
either |
threshold_value |
semantics of the |
max_fg_permutations |
maximum number of foreground permutations performed in Monte Carlo test for enrichment score |
min_fg_permutations |
minimum number of foreground permutations performed in Monte Carlo test for enrichment score |
e |
integer-valued stop criterion for enrichment score Monte Carlo
test: aborting
permutation process after
observing |
p_adjust_method |
adjustment of p-values from Monte Carlo tests to
avoid alpha error
accumulation, see |
n_cores |
the number of cores that are used |
cache |
either logical or path to a directory where scores are cached.
The scores of each
motif are stored in a
separate file that contains a hash table with RefSeq identifiers and
sequence type
qualifiers as keys and the number of putative binding sites as values.
If |
Motif transcript set analysis can be used to identify RNA binding proteins, whose targets are significantly overrepresented or underrepresented in certain sets of transcripts.
The aim of Transcript Set Motif Analysis (TSMA) is to identify the overrepresentation and underrepresentation of potential RBP targets (binding sites) in a set (or sets) of sequences, i.e., the foreground set, relative to the entire population of sequences. The latter is called background set, which can be composed of all sequences of the genes of a microarray platform or all sequences of an organism or any other meaningful superset of the foreground sets.
The matrix-based approach skips the k-merization step of the k-mer-based approach and instead scores the transcript sequence as a whole with a position specific scoring matrix.
For each sequence in foreground and background sets and each sequence motif, the scoring algorithm evaluates the score for each sequence position. Positions with a relative score greater than a certain threshold are considered hits, i.e., putative binding sites.
By scoring all sequences in foreground and background sets, a hit count for each motif and each set is obtained, which is used to calculate enrichment values and associated p-values in the same way in which motif-compatible hexamer enrichment values are calculated in the k -mer-based approach. P-values are adjusted with one of the available adjustment methods.
An advantage of the matrix-based approach is the possibility of detecting clusters of binding sites. This can be done by counting regions with many hits using positional hit information or by simply applying a hit count threshold per sequence, e.g., only sequences with more than some number of hits are considered. Homotypic clusters of RBP binding sites may play a similar role as clusters of transcription factors.
A list with the following components:
foreground_scores |
the result of score_transcripts
for the foreground
sets |
background_scores |
the result of score_transcripts
for the background
set |
enrichment_dfs |
a list of data frames, returned by
calculate_motif_enrichment
|
Other TSMA functions:
draw_volcano_plot()
,
run_kmer_tsma()
Other matrix functions:
calculate_motif_enrichment()
,
run_matrix_spma()
,
score_transcripts()
,
score_transcripts_single_motif()
# define simple sequence sets for foreground and background foreground_set1 <- c( "CAACAGCCUUAAUU", "CAGUCAAGACUCC", "CUUUGGGGAAU", "UCAUUUUAUUAAA", "AAUUGGUGUCUGGAUACUUCCCUGUACAU", "AUCAAAUUA", "AGAU", "GACACUUAAAGAUCCU", "UAGCAUUAACUUAAUG", "AUGGA", "GAAGAGUGCUCA", "AUAGAC", "AGUUC", "CCAGUAA" ) names(foreground_set1) <- c( "NM_1_DUMMY|3UTR", "NM_2_DUMMY|3UTR", "NM_3_DUMMY|3UTR", "NM_4_DUMMY|3UTR", "NM_5_DUMMY|3UTR", "NM_6_DUMMY|3UTR", "NM_7_DUMMY|3UTR", "NM_8_DUMMY|3UTR", "NM_9_DUMMY|3UTR", "NM_10_DUMMY|3UTR", "NM_11_DUMMY|3UTR", "NM_12_DUMMY|3UTR", "NM_13_DUMMY|3UTR", "NM_14_DUMMY|3UTR" ) foreground_set2 <- c("UUAUUUA", "AUCCUUUACA", "UUUUUUU", "UUUCAUCAUU") names(foreground_set2) <- c( "NM_15_DUMMY|3UTR", "NM_16_DUMMY|3UTR", "NM_17_DUMMY|3UTR", "NM_18_DUMMY|3UTR" ) foreground_sets <- list(foreground_set1, foreground_set2) background_set <- c( "CAACAGCCUUAAUU", "CAGUCAAGACUCC", "CUUUGGGGAAU", "UCAUUUUAUUAAA", "AAUUGGUGUCUGGAUACUUCCCUGUACAU", "AUCAAAUUA", "AGAU", "GACACUUAAAGAUCCU", "UAGCAUUAACUUAAUG", "AUGGA", "GAAGAGUGCUCA", "AUAGAC", "AGUUC", "CCAGUAA", "UUAUUUA", "AUCCUUUACA", "UUUUUUU", "UUUCAUCAUU", "CCACACAC", "CUCAUUGGAG", "ACUUUGGGACA", "CAGGUCAGCA" ) names(background_set) <- c( "NM_1_DUMMY|3UTR", "NM_2_DUMMY|3UTR", "NM_3_DUMMY|3UTR", "NM_4_DUMMY|3UTR", "NM_5_DUMMY|3UTR", "NM_6_DUMMY|3UTR", "NM_7_DUMMY|3UTR", "NM_8_DUMMY|3UTR", "NM_9_DUMMY|3UTR", "NM_10_DUMMY|3UTR", "NM_11_DUMMY|3UTR", "NM_12_DUMMY|3UTR", "NM_13_DUMMY|3UTR", "NM_14_DUMMY|3UTR", "NM_15_DUMMY|3UTR", "NM_16_DUMMY|3UTR", "NM_17_DUMMY|3UTR", "NM_18_DUMMY|3UTR", "NM_19_DUMMY|3UTR", "NM_20_DUMMY|3UTR", "NM_21_DUMMY|3UTR", "NM_22_DUMMY|3UTR" ) # run cached version of TSMA with all Transite motifs (recommended): # results <- run_matrix_tsma(foreground_sets, background_set) # run uncached version with one motif: motif_db <- get_motif_by_id("M178_0.6") results <- run_matrix_tsma(foreground_sets, background_set, motifs = motif_db, cache = FALSE) ## Not run: # define example sequence sets for foreground and background foreground1_df <- transite:::ge$foreground1_df foreground_set1 <- gsub("T", "U", foreground1_df$seq) names(foreground_set1) <- paste0(foreground1_df$refseq, "|", foreground1_df$seq_type) foreground2_df <- transite:::ge$foreground2_df foreground_set2 <- gsub("T", "U", foreground2_df$seq) names(foreground_set2) <- paste0(foreground2_df$refseq, "|", foreground2_df$seq_type) foreground_sets <- list(foreground_set1, foreground_set2) background_df <- transite:::ge$background_df background_set <- gsub("T", "U", background_df$seq) names(background_set) <- paste0(background_df$refseq, "|", background_df$seq_type) # run cached version of TSMA with all Transite motifs (recommended) results <- run_matrix_tsma(foreground_sets, background_set) # run uncached version of TSMA with all Transite motifs results <- run_matrix_tsma(foreground_sets, background_set, cache = FALSE) # run TSMA with a subset of Transite motifs results <- run_matrix_tsma(foreground_sets, background_set, motifs = get_motif_by_rbp("ELAVL1")) # run TSMA with user-defined motif toy_motif <- create_matrix_motif( "toy_motif", "example RBP", toy_motif_matrix, "example type", "example species", "user" ) results <- run_matrix_tsma(foreground_sets, background_set, motifs = list(toy_motif)) ## End(Not run)
# define simple sequence sets for foreground and background foreground_set1 <- c( "CAACAGCCUUAAUU", "CAGUCAAGACUCC", "CUUUGGGGAAU", "UCAUUUUAUUAAA", "AAUUGGUGUCUGGAUACUUCCCUGUACAU", "AUCAAAUUA", "AGAU", "GACACUUAAAGAUCCU", "UAGCAUUAACUUAAUG", "AUGGA", "GAAGAGUGCUCA", "AUAGAC", "AGUUC", "CCAGUAA" ) names(foreground_set1) <- c( "NM_1_DUMMY|3UTR", "NM_2_DUMMY|3UTR", "NM_3_DUMMY|3UTR", "NM_4_DUMMY|3UTR", "NM_5_DUMMY|3UTR", "NM_6_DUMMY|3UTR", "NM_7_DUMMY|3UTR", "NM_8_DUMMY|3UTR", "NM_9_DUMMY|3UTR", "NM_10_DUMMY|3UTR", "NM_11_DUMMY|3UTR", "NM_12_DUMMY|3UTR", "NM_13_DUMMY|3UTR", "NM_14_DUMMY|3UTR" ) foreground_set2 <- c("UUAUUUA", "AUCCUUUACA", "UUUUUUU", "UUUCAUCAUU") names(foreground_set2) <- c( "NM_15_DUMMY|3UTR", "NM_16_DUMMY|3UTR", "NM_17_DUMMY|3UTR", "NM_18_DUMMY|3UTR" ) foreground_sets <- list(foreground_set1, foreground_set2) background_set <- c( "CAACAGCCUUAAUU", "CAGUCAAGACUCC", "CUUUGGGGAAU", "UCAUUUUAUUAAA", "AAUUGGUGUCUGGAUACUUCCCUGUACAU", "AUCAAAUUA", "AGAU", "GACACUUAAAGAUCCU", "UAGCAUUAACUUAAUG", "AUGGA", "GAAGAGUGCUCA", "AUAGAC", "AGUUC", "CCAGUAA", "UUAUUUA", "AUCCUUUACA", "UUUUUUU", "UUUCAUCAUU", "CCACACAC", "CUCAUUGGAG", "ACUUUGGGACA", "CAGGUCAGCA" ) names(background_set) <- c( "NM_1_DUMMY|3UTR", "NM_2_DUMMY|3UTR", "NM_3_DUMMY|3UTR", "NM_4_DUMMY|3UTR", "NM_5_DUMMY|3UTR", "NM_6_DUMMY|3UTR", "NM_7_DUMMY|3UTR", "NM_8_DUMMY|3UTR", "NM_9_DUMMY|3UTR", "NM_10_DUMMY|3UTR", "NM_11_DUMMY|3UTR", "NM_12_DUMMY|3UTR", "NM_13_DUMMY|3UTR", "NM_14_DUMMY|3UTR", "NM_15_DUMMY|3UTR", "NM_16_DUMMY|3UTR", "NM_17_DUMMY|3UTR", "NM_18_DUMMY|3UTR", "NM_19_DUMMY|3UTR", "NM_20_DUMMY|3UTR", "NM_21_DUMMY|3UTR", "NM_22_DUMMY|3UTR" ) # run cached version of TSMA with all Transite motifs (recommended): # results <- run_matrix_tsma(foreground_sets, background_set) # run uncached version with one motif: motif_db <- get_motif_by_id("M178_0.6") results <- run_matrix_tsma(foreground_sets, background_set, motifs = motif_db, cache = FALSE) ## Not run: # define example sequence sets for foreground and background foreground1_df <- transite:::ge$foreground1_df foreground_set1 <- gsub("T", "U", foreground1_df$seq) names(foreground_set1) <- paste0(foreground1_df$refseq, "|", foreground1_df$seq_type) foreground2_df <- transite:::ge$foreground2_df foreground_set2 <- gsub("T", "U", foreground2_df$seq) names(foreground_set2) <- paste0(foreground2_df$refseq, "|", foreground2_df$seq_type) foreground_sets <- list(foreground_set1, foreground_set2) background_df <- transite:::ge$background_df background_set <- gsub("T", "U", background_df$seq) names(background_set) <- paste0(background_df$refseq, "|", background_df$seq_type) # run cached version of TSMA with all Transite motifs (recommended) results <- run_matrix_tsma(foreground_sets, background_set) # run uncached version of TSMA with all Transite motifs results <- run_matrix_tsma(foreground_sets, background_set, cache = FALSE) # run TSMA with a subset of Transite motifs results <- run_matrix_tsma(foreground_sets, background_set, motifs = get_motif_by_rbp("ELAVL1")) # run TSMA with user-defined motif toy_motif <- create_matrix_motif( "toy_motif", "example RBP", toy_motif_matrix, "example type", "example species", "user" ) results <- run_matrix_tsma(foreground_sets, background_set, motifs = list(toy_motif)) ## End(Not run)
C++ implementation of PWM scoring algorithm
score_sequences(sequences, pwm)
score_sequences(sequences, pwm)
sequences |
list of sequences |
pwm |
position weight matrix |
list of PWM scores for each sequence
motif <- get_motif_by_id("M178_0.6")[[1]] sequences <- c("CAACAGCCUUAAUU", "CAGUCAAGACUCC", "CUUUGGGGAAU", "UCAUUUUAUUAAA", "AAUUGGUGUCUGGAUACUUCCCUGUACAU", "AUCAAAUUA", "UGUGGGG", "GACACUUAAAGAUCCU", "UAGCAUUAACUUAAUG", "AUGGA", "GAAGAGUGCUCA", "AUAGAC", "AGUUC", "CCAGUAA") seq_char_vectors <- lapply(sequences, function(seq) { unlist(strsplit(seq, "")) }) score_sequences(seq_char_vectors, as.matrix(get_motif_matrix(motif)))
motif <- get_motif_by_id("M178_0.6")[[1]] sequences <- c("CAACAGCCUUAAUU", "CAGUCAAGACUCC", "CUUUGGGGAAU", "UCAUUUUAUUAAA", "AAUUGGUGUCUGGAUACUUCCCUGUACAU", "AUCAAAUUA", "UGUGGGG", "GACACUUAAAGAUCCU", "UAGCAUUAACUUAAUG", "AUGGA", "GAAGAGUGCUCA", "AUAGAC", "AGUUC", "CCAGUAA") seq_char_vectors <- lapply(sequences, function(seq) { unlist(strsplit(seq, "")) }) score_sequences(seq_char_vectors, as.matrix(get_motif_matrix(motif)))
Spectrum scores are a means to evaluate if a spectrum has a meaningful (i.e., biologically relevant) or a random pattern.
score_spectrum( x, p_values = array(1, length(x)), x_label = "log enrichment", sorted_transcript_values = NULL, transcript_values_label = "transcript value", midpoint = 0, x_value_limits = NULL, max_model_degree = 3, max_cs_permutations = 1e+07, min_cs_permutations = 5000, e = 5 )
score_spectrum( x, p_values = array(1, length(x)), x_label = "log enrichment", sorted_transcript_values = NULL, transcript_values_label = "transcript value", midpoint = 0, x_value_limits = NULL, max_model_degree = 3, max_cs_permutations = 1e+07, min_cs_permutations = 5000, e = 5 )
x |
vector of values (e.g., enrichment values, normalized RBP scores) per bin |
p_values |
vector of p-values (e.g., significance of enrichment values) per bin |
x_label |
label of values (e.g., |
sorted_transcript_values |
vector of sorted transcript values, i.e.,
the fold change or signal-to-noise ratio or any other quantity that was used
to sort the transcripts that were passed to |
transcript_values_label |
label of transcript sorting criterion
(e.g., |
midpoint |
for enrichment values the midpoint should be |
x_value_limits |
sets limits of the x-value color scale (used to
harmonize color scales of different spectrum plots), see |
max_model_degree |
maximum degree of polynomial |
max_cs_permutations |
maximum number of permutations performed in Monte Carlo test for consistency score |
min_cs_permutations |
minimum number of permutations performed in Monte Carlo test for consistency score |
e |
integer-valued stop criterion for consistency score Monte
Carlo test: aborting permutation process after
observing |
One way to quantify the meaningfulness of a spectrum is to calculate the
deviance
between the linear interpolation of the scores of two adjoining bins and
the score
of the middle bin, for each position in the spectrum. The lower the score,
the more consistent
the trend in the spectrum plot. Formally, the local consistency score
is defined as
In order to obtain an estimate of the significance of a particular
score ,
Monte Carlo sampling
is performed by randomly permuting the coordinates of the scores
vector
and recomputing
.
The probability estimate
is given by the lower tail
version of the cumulative
distribution function
where is the indicator function,
is the
sample size, i.e.,
the number of performed permutations, and
equals
in the above equation.
An alternative approach to assess the consistency of a spectrum plot is
via polynomial regression.
In a first step, polynomial regression models of various degrees are
fitted to the data, i.e.,
the dependent variable (vector of scores), and
orthogonal
polynomials of the independent variable
(vector of
bin numbers).
Secondly, the model that reflects best the true nature of the data is
selected by
means of the F-test. And lastly, the adjusted
and the sum of
squared residuals
are calculated to indicate how well the model fits the data. These
statistics are used as
scores to rank the spectrum plots.
In general, the polynomial regression equation is
where is the degree of the polynomial (usually
),
and
is the error term.
The dependent variable
is the vector of
scores
and
to
are the orthogonal polynomials of
the vector of bin
numbers
.
Orthogonal polynomials are used in order to reduce the correlation
between the different
powers of
and therefore avoid multicollinearity in the model. This is important,
because correlated
predictors lead to unstable coefficients, i.e., the coefficients of a
polynomial
regression model of
degree
can be greatly different from a model of degree
.
The orthogonal polynomials of vector are obtained by
centering (subtracting the mean),
QR decomposition, and subsequent normalization.
Given the dependent variable
and the orthogonal
polynomials
of
to
, the model coefficients
are chosen
in a way to minimize
the deviance between the actual and the predicted values characterized by
where L(actual value, predicted value) denotes the loss function.
Ordinary least squares is used as estimation method for the model
coefficients . The loss
function of ordinary least squares is the sum of squared residuals
(SSR)
and is defined as follows
SSR
where
are the observed data
and
the
model predictions.
Thus the ordinary least squares estimate of the
coefficients
(including the
intercept
) of the model
is defined by
After polynomial models of various degrees have been fitted to the data, the F-test is used to select the model that best fits the data. Since the SSR monotonically decreases with increasing model degree (model complexity), the relative decrease of the SSR between the simpler model and the more complex model must outweigh the increase in model complexity between the two models. The F-test gives the probability that a relative decrease of the SSR between the simpler and the more complex model given their respective degrees of freedom is due to chance. A low p-value indicates that the additional degrees of freedom of the more complex model lead to a better fit of the data than would be expected after a mere increase of degrees of freedom.
The F-statistic is calculated as follows
where is the sum of squared residuals
and
is the number
of parameters of
model
. The number of data points, i.e., bins, is denoted as
.
is distributed according to the F-distribution
with
and
.
A list object of class SpectrumScore
with the following
components:
adj_r_squared |
adjusted of polynomial model |
degree |
maximum degree of polynomial |
residuals |
residuals of polynomial model |
slope |
coefficient of the linear term of the polynomial model (spectrum "direction") |
f_statistic |
statistic of the F-test |
f_statistic_p_value |
p-value of F-test |
consistency_score |
normalized sum of deviance between the linear interpolation of the scores of two adjoining bins and the score of the middle bin, for each position in the spectrum |
consistency_score_p_value |
obtained by Monte Carlo sampling (randomly permuting the coordinates of the scores vector) |
consistency_score_n |
number of permutations |
plot |
Other SPMA functions:
classify_spectrum()
,
run_kmer_spma()
,
run_matrix_spma()
,
subdivide_data()
# random spectrum score_spectrum(runif(n = 40, min = -1, max = 1), max_model_degree = 1) # two random spectrums with harmonized color scales plot(score_spectrum(runif(n = 40, min = -1, max = 1), max_model_degree = 1, x_value_limits = c(-2.0, 2.0))) plot(score_spectrum(runif(n = 40, min = -2, max = 2), max_model_degree = 1, x_value_limits = c(-2.0, 2.0))) # random spectrum with p-values score_spectrum(runif(n = 40, min = -1, max = 1), p_values = runif(n = 40, min = 0, max = 1), max_model_degree = 1) # random spectrum with sorted transcript values log_fold_change <- log(runif(n = 1000, min = 0, max = 1) / runif(n = 1000, min = 0, max = 1)) score_spectrum(runif(n = 40, min = -1, max = 1), sorted_transcript_values = sort(log_fold_change), max_model_degree = 1) # non-random linear spectrum signal <- seq(-1, 0.99, 2 / 40) noise <- rnorm(n = 40, mean = 0, sd = 0.5) score_spectrum(signal + noise, max_model_degree = 1, max_cs_permutations = 100000) # non-random quadratic spectrum signal <- seq(-1, 0.99, 2 / 40)^2 - 0.5 noise <- rnorm(n = 40, mean = 0, sd = 0.2) score_spectrum(signal + noise, max_model_degree = 2, max_cs_permutations = 100000)
# random spectrum score_spectrum(runif(n = 40, min = -1, max = 1), max_model_degree = 1) # two random spectrums with harmonized color scales plot(score_spectrum(runif(n = 40, min = -1, max = 1), max_model_degree = 1, x_value_limits = c(-2.0, 2.0))) plot(score_spectrum(runif(n = 40, min = -2, max = 2), max_model_degree = 1, x_value_limits = c(-2.0, 2.0))) # random spectrum with p-values score_spectrum(runif(n = 40, min = -1, max = 1), p_values = runif(n = 40, min = 0, max = 1), max_model_degree = 1) # random spectrum with sorted transcript values log_fold_change <- log(runif(n = 1000, min = 0, max = 1) / runif(n = 1000, min = 0, max = 1)) score_spectrum(runif(n = 40, min = -1, max = 1), sorted_transcript_values = sort(log_fold_change), max_model_degree = 1) # non-random linear spectrum signal <- seq(-1, 0.99, 2 / 40) noise <- rnorm(n = 40, mean = 0, sd = 0.5) score_spectrum(signal + noise, max_model_degree = 1, max_cs_permutations = 100000) # non-random quadratic spectrum signal <- seq(-1, 0.99, 2 / 40)^2 - 0.5 noise <- rnorm(n = 40, mean = 0, sd = 0.2) score_spectrum(signal + noise, max_model_degree = 2, max_cs_permutations = 100000)
This function is used to count the binding sites in a set of sequences for
all or a
subset of RNA-binding protein sequence
motifs and returns the result in a data frame, which is subsequently used by
calculate_motif_enrichment
to
obtain binding site enrichment scores.
score_transcripts( sequences, motifs = NULL, max_hits = 5, threshold_method = c("p_value", "relative"), threshold_value = 0.25^6, n_cores = 1, cache = paste0(tempdir(), "/sc/") )
score_transcripts( sequences, motifs = NULL, max_hits = 5, threshold_method = c("p_value", "relative"), threshold_value = 0.25^6, n_cores = 1, cache = paste0(tempdir(), "/sc/") )
sequences |
character vector of named sequences
(only containing upper case characters A, C, G, T), where the names are
RefSeq identifiers
and sequence
type qualifiers ( |
motifs |
a list of motifs that is used to score the specified sequences.
If |
max_hits |
maximum number of putative binding sites per mRNA that are counted |
threshold_method |
either |
threshold_value |
semantics of the |
n_cores |
the number of cores that are used |
cache |
either logical or path to a directory where scores are cached.
The scores of each
motif are stored in a
separate file that contains a hash table with RefSeq identifiers and
sequence type
qualifiers as keys and the number of putative binding sites as values.
If |
A list with three entries:
(1) df: a data frame with the following columns:
motif_id |
the motif identifier that is used in the original motif library |
motif_rbps |
the gene symbol of the RNA-binding protein(s) |
absolute_hits |
the absolute frequency of putative binding sites per motif in all transcripts |
relative_hits |
the relative, i.e., absolute divided by total, frequency of binding sites per motif in all transcripts |
total_sites |
the total number of potential binding sites |
one_hit , two_hits , ... |
number of transcripts with one, two, three, ... putative binding sites |
(2) total_sites: a numeric vector with the total number of potential binding sites per transcript
(3) absolute_hits: a numeric vector with the absolute (not relative) number of putative binding sites per transcript
Other matrix functions:
calculate_motif_enrichment()
,
run_matrix_spma()
,
run_matrix_tsma()
,
score_transcripts_single_motif()
foreground_set <- c( "CAACAGCCUUAAUU", "CAGUCAAGACUCC", "CUUUGGGGAAU", "UCAUUUUAUUAAA", "AAUUGGUGUCUGGAUACUUCCCUGUACAU", "AUCAAAUUA", "AGAU", "GACACUUAAAGAUCCU", "UAGCAUUAACUUAAUG", "AUGGA", "GAAGAGUGCUCA", "AUAGAC", "AGUUC", "CCAGUAA" ) # names are used as keys in the hash table (cached version only) # ideally sequence identifiers (e.g., RefSeq ids) and region labels # (e.g., 3UTR for 3'-UTR) names(foreground_set) <- c( "NM_1_DUMMY|3UTR", "NM_2_DUMMY|3UTR", "NM_3_DUMMY|3UTR", "NM_4_DUMMY|3UTR", "NM_5_DUMMY|3UTR", "NM_6_DUMMY|3UTR", "NM_7_DUMMY|3UTR", "NM_8_DUMMY|3UTR", "NM_9_DUMMY|3UTR", "NM_10_DUMMY|3UTR", "NM_11_DUMMY|3UTR", "NM_12_DUMMY|3UTR", "NM_13_DUMMY|3UTR", "NM_14_DUMMY|3UTR" ) # specific motifs, uncached motifs <- get_motif_by_rbp("ELAVL1") scores <- score_transcripts(foreground_set, motifs = motifs, cache = FALSE) ## Not run: # all Transite motifs, cached (writes scores to disk) scores <- score_transcripts(foreground_set) # all Transite motifs, uncached scores <- score_transcripts(foreground_set, cache = FALSE) foreground_df <- transite:::ge$foreground1_df foreground_set <- foreground_df$seq names(foreground_set) <- paste0(foreground_df$refseq, "|", foreground_df$seq_type) scores <- score_transcripts(foreground_set) ## End(Not run)
foreground_set <- c( "CAACAGCCUUAAUU", "CAGUCAAGACUCC", "CUUUGGGGAAU", "UCAUUUUAUUAAA", "AAUUGGUGUCUGGAUACUUCCCUGUACAU", "AUCAAAUUA", "AGAU", "GACACUUAAAGAUCCU", "UAGCAUUAACUUAAUG", "AUGGA", "GAAGAGUGCUCA", "AUAGAC", "AGUUC", "CCAGUAA" ) # names are used as keys in the hash table (cached version only) # ideally sequence identifiers (e.g., RefSeq ids) and region labels # (e.g., 3UTR for 3'-UTR) names(foreground_set) <- c( "NM_1_DUMMY|3UTR", "NM_2_DUMMY|3UTR", "NM_3_DUMMY|3UTR", "NM_4_DUMMY|3UTR", "NM_5_DUMMY|3UTR", "NM_6_DUMMY|3UTR", "NM_7_DUMMY|3UTR", "NM_8_DUMMY|3UTR", "NM_9_DUMMY|3UTR", "NM_10_DUMMY|3UTR", "NM_11_DUMMY|3UTR", "NM_12_DUMMY|3UTR", "NM_13_DUMMY|3UTR", "NM_14_DUMMY|3UTR" ) # specific motifs, uncached motifs <- get_motif_by_rbp("ELAVL1") scores <- score_transcripts(foreground_set, motifs = motifs, cache = FALSE) ## Not run: # all Transite motifs, cached (writes scores to disk) scores <- score_transcripts(foreground_set) # all Transite motifs, uncached scores <- score_transcripts(foreground_set, cache = FALSE) foreground_df <- transite:::ge$foreground1_df foreground_set <- foreground_df$seq names(foreground_set) <- paste0(foreground_df$refseq, "|", foreground_df$seq_type) scores <- score_transcripts(foreground_set) ## End(Not run)
This function is used to count the putative binding sites (i.e., motifs)
in a set of
sequences for the specified RNA-binding protein sequence
motifs and returns the result in a data frame, which is aggregated by
score_transcripts
and
subsequently used by calculate_motif_enrichment
to
obtain binding site enrichment scores.
score_transcripts_single_motif( motif, sequences, max_hits = 5, threshold_method = c("p_value", "relative"), threshold_value = 0.25^6, cache_path = paste0(tempdir(), "/sc/") )
score_transcripts_single_motif( motif, sequences, max_hits = 5, threshold_method = c("p_value", "relative"), threshold_value = 0.25^6, cache_path = paste0(tempdir(), "/sc/") )
motif |
a Transite motif that is used to score the specified sequences |
sequences |
character vector of named sequences
(only containing upper case characters A, C, G, T), where the names are
RefSeq identifiers
and sequence
type qualifiers ( |
max_hits |
maximum number of putative binding sites per mRNA that are counted |
threshold_method |
either |
threshold_value |
semantics of the |
cache_path |
the path to a directory where scores are cached. The scores of each motif are stored in a separate file that contains a hash table with RefSeq identifiers and sequence type qualifiers as keys and the number of binding sites as values. If is.null(cache_path), scores will not be cached. |
A list with the following items:
motif_id |
the motif identifier of the specified motif |
motif_rbps |
the gene symbol of the RNA-binding protein(s) |
absolute_hits |
the absolute frequency of binding sites per motif in all transcripts |
relative_hits |
the relative, i.e., absolute divided by total, frequency of binding sites per motif in all transcripts |
total_sites |
the total number of potential binding sites |
one_hit , two_hits , ... |
number of transcripts with one, two, three, ... binding sites |
Other matrix functions:
calculate_motif_enrichment()
,
run_matrix_spma()
,
run_matrix_tsma()
,
score_transcripts()
Globally sets Transite motif database, use with care.
set_motifs(value)
set_motifs(value)
value |
list of Motif objects |
void
Other motif functions:
generate_iupac_by_kmers()
,
generate_iupac_by_matrix()
,
generate_kmers_from_iupac()
,
get_motif_by_id()
,
get_motif_by_rbp()
,
get_motifs()
,
get_motifs_meta_info()
,
get_ppm()
,
init_iupac_lookup_table()
custom_motif <- create_kmer_motif( "custom_motif", "RBP1", c("AAAAAAA", "CAAAAAA"), "HITS-CLIP", "Homo sapiens", "user" ) set_motifs(list(custom_motif))
custom_motif <- create_kmer_motif( "custom_motif", "RBP1", c("AAAAAAA", "CAAAAAA"), "HITS-CLIP", "Homo sapiens", "user" ) set_motifs(list(custom_motif))
An S4 class to represent a scored spectrum
Getter Method get_adj_r_squared
Getter Method get_model_degree
Getter Method get_model_residuals
Getter Method get_model_slope
Getter Method get_model_f_statistic
Getter Method get_model_f_statistic_p_value
Getter Method get_consistency_score
Getter Method get_consistency_score_p_value
Getter Method get_consistency_score_n
get_adj_r_squared(object) ## S4 method for signature 'SpectrumScore' get_adj_r_squared(object) get_model_degree(object) ## S4 method for signature 'SpectrumScore' get_model_degree(object) get_model_residuals(object) ## S4 method for signature 'SpectrumScore' get_model_residuals(object) get_model_slope(object) ## S4 method for signature 'SpectrumScore' get_model_slope(object) get_model_f_statistic(object) ## S4 method for signature 'SpectrumScore' get_model_f_statistic(object) get_model_f_statistic_p_value(object) ## S4 method for signature 'SpectrumScore' get_model_f_statistic_p_value(object) get_consistency_score(object) ## S4 method for signature 'SpectrumScore' get_consistency_score(object) get_consistency_score_p_value(object) ## S4 method for signature 'SpectrumScore' get_consistency_score_p_value(object) get_consistency_score_n(object) ## S4 method for signature 'SpectrumScore' get_consistency_score_n(object) ## S4 method for signature 'SpectrumScore' show(object) ## S4 method for signature 'SpectrumScore,ANY' plot(x)
get_adj_r_squared(object) ## S4 method for signature 'SpectrumScore' get_adj_r_squared(object) get_model_degree(object) ## S4 method for signature 'SpectrumScore' get_model_degree(object) get_model_residuals(object) ## S4 method for signature 'SpectrumScore' get_model_residuals(object) get_model_slope(object) ## S4 method for signature 'SpectrumScore' get_model_slope(object) get_model_f_statistic(object) ## S4 method for signature 'SpectrumScore' get_model_f_statistic(object) get_model_f_statistic_p_value(object) ## S4 method for signature 'SpectrumScore' get_model_f_statistic_p_value(object) get_consistency_score(object) ## S4 method for signature 'SpectrumScore' get_consistency_score(object) get_consistency_score_p_value(object) ## S4 method for signature 'SpectrumScore' get_consistency_score_p_value(object) get_consistency_score_n(object) ## S4 method for signature 'SpectrumScore' get_consistency_score_n(object) ## S4 method for signature 'SpectrumScore' show(object) ## S4 method for signature 'SpectrumScore,ANY' plot(x)
object |
SpectrumScore object |
x |
SpectrumScore object |
Object of type SpectrumScore
adj_r_squared
adjusted of polynomial model
degree
degree of polynomial (integer between 0 and 5)
residuals
residuals of the polynomial model
slope
coefficient of the linear term of the polynomial model (spectrum "direction")
f_statistic
F statistic from the F test used to determine the degree of the polynomial model
f_statistic_p_value
p-value associated with the F statistic
consistency_score
raw local consistency score of the spectrum
consistency_score_p_value
p-value associated with the local consistency score
consistency_score_n
number of permutations performed to calculate p-value of local consistency score (permutations performed before early stopping criterion reached)
plot
spectrum plot
new("SpectrumScore", adj_r_squared = 0, degree = 0L, residuals = 0, slope = 0, f_statistic = 0, f_statistic_p_value = 1, consistency_score = 1, consistency_score_p_value = 1, consistency_score_n = 1000L, plot = NULL )
new("SpectrumScore", adj_r_squared = 0, degree = 0L, residuals = 0, slope = 0, f_statistic = 0, f_statistic_p_value = 1, consistency_score = 1, consistency_score_p_value = 1, consistency_score_n = 1000L, plot = NULL )
Preprocessing function for SPMA, divides transcript sequences into n bins.
subdivide_data(sorted_transcript_sequences, n_bins = 40)
subdivide_data(sorted_transcript_sequences, n_bins = 40)
sorted_transcript_sequences |
character vector of named sequences (names are usually RefSeq identifiers and sequence region labels, e.g., "NM_1_DUMMY|3UTR"). It is important that the sequences are already sorted by fold change, signal-to-noise ratio or any other meaningful measure. |
n_bins |
specifies the number of bins in which the sequences will be divided, valid values are between 7 and 100 |
An array of n_bins
length, containing the binned sequences
Other SPMA functions:
classify_spectrum()
,
run_kmer_spma()
,
run_matrix_spma()
,
score_spectrum()
# toy example toy_seqs <- c( "CAACAGCCUUAAUU", "CAGUCAAGACUCC", "CUUUGGGGAAU", "UCAUUUUAUUAAA", "AAUUGGUGUCUGGAUACUUCCCUGUACAU", "AUCAAAUUA", "AGAU", "GACACUUAAAGAUCCU", "UAGCAUUAACUUAAUG", "AUGGA", "GAAGAGUGCUCA", "AUAGAC", "AGUUC", "CCAGUAA" ) # names are used as keys in the hash table (cached version only) # ideally sequence identifiers (e.g., RefSeq ids) and # sequence region labels (e.g., 3UTR for 3'-UTR) names(toy_seqs) <- c( "NM_1_DUMMY|3UTR", "NM_2_DUMMY|3UTR", "NM_3_DUMMY|3UTR", "NM_4_DUMMY|3UTR", "NM_5_DUMMY|3UTR", "NM_6_DUMMY|3UTR", "NM_7_DUMMY|3UTR", "NM_8_DUMMY|3UTR", "NM_9_DUMMY|3UTR", "NM_10_DUMMY|3UTR", "NM_11_DUMMY|3UTR", "NM_12_DUMMY|3UTR", "NM_13_DUMMY|3UTR", "NM_14_DUMMY|3UTR" ) foreground_sets <- subdivide_data(toy_seqs, n_bins = 7) # example data set background_df <- transite:::ge$background_df # sort sequences by signal-to-noise ratio background_df <- dplyr::arrange(background_df, value) # character vector of named sequences background_seqs <- background_df$seq names(background_seqs) <- paste0(background_df$refseq, "|", background_df$seq_type) foreground_sets <- subdivide_data(background_seqs)
# toy example toy_seqs <- c( "CAACAGCCUUAAUU", "CAGUCAAGACUCC", "CUUUGGGGAAU", "UCAUUUUAUUAAA", "AAUUGGUGUCUGGAUACUUCCCUGUACAU", "AUCAAAUUA", "AGAU", "GACACUUAAAGAUCCU", "UAGCAUUAACUUAAUG", "AUGGA", "GAAGAGUGCUCA", "AUAGAC", "AGUUC", "CCAGUAA" ) # names are used as keys in the hash table (cached version only) # ideally sequence identifiers (e.g., RefSeq ids) and # sequence region labels (e.g., 3UTR for 3'-UTR) names(toy_seqs) <- c( "NM_1_DUMMY|3UTR", "NM_2_DUMMY|3UTR", "NM_3_DUMMY|3UTR", "NM_4_DUMMY|3UTR", "NM_5_DUMMY|3UTR", "NM_6_DUMMY|3UTR", "NM_7_DUMMY|3UTR", "NM_8_DUMMY|3UTR", "NM_9_DUMMY|3UTR", "NM_10_DUMMY|3UTR", "NM_11_DUMMY|3UTR", "NM_12_DUMMY|3UTR", "NM_13_DUMMY|3UTR", "NM_14_DUMMY|3UTR" ) foreground_sets <- subdivide_data(toy_seqs, n_bins = 7) # example data set background_df <- transite:::ge$background_df # sort sequences by signal-to-noise ratio background_df <- dplyr::arrange(background_df, value) # character vector of named sequences background_seqs <- background_df$seq names(background_seqs) <- paste0(background_df$refseq, "|", background_df$seq_type) foreground_sets <- subdivide_data(background_seqs)
This toy motif matrix is used in code examples for various functions.
data(toy_motif_matrix)
data(toy_motif_matrix)
A data frame with four columns (A, C, G, U) and seven rows (position 1 - 7)