Title: | Affinity test for identifying regulatory SNPs |
---|---|
Description: | atSNP performs affinity tests of motif matches with the SNP or the reference genomes and SNP-led changes in motif matches. |
Authors: | Chandler Zuo [aut], Sunyoung Shin [aut, cre], Sunduz Keles [aut] |
Maintainer: | Sunyoung Shin <[email protected]> |
License: | GPL-2 |
Version: | 1.23.0 |
Built: | 2024-10-30 03:34:33 UTC |
Source: | https://github.com/bioc/atSNP |
atSNP implements the affinity test for large sets of SNP-motif interactions using the importance sampling algorithm. Users may identify SNPs that potentially may affect binding affinity of transcription factors. Given a set of SNPs and a library of motif position weight matrices (PWMs), atSNP provides two main functions for analyzing SNP effects: (i) the binding affinity score for each allele and each PWM and the p-values for allele-specific binding affinity scores (ii) the p-values for affinity score changes between the two alleles for each SNP. Compared to other bioinformatics tools that provide similar functionalities, atSNP is highly scalable.
The atSNP main functions are:
LoadMotifLibrary
- Load position weight matrices
LoadSNPData
- Load the SNP information and code the
genome sequences around the SNP locations
LoadFastaData
- Load the SNP data from fasta files
ComputeMotifScore
- Compute the scores for SNP effects on
motifs
ComputePValues
- Compute p-values for affinity scores
Some helper functions are:
MatchSubsequence
- Compute the matching subsequence
GetIUPACSequence
- Get the IUPAC sequence of a motif
dtMotifMatch
- Compute the augmented matching subsequence
on SNP and reference alleles
The composite logo plotting function is:
plotMotifMatch
- Plot sequence logos of the position
weight matrix of the motif and sequences of its corresponding best matching
augmented subsequence on the reference and SNP allele
Chandler Zuo Sunyoung Shin [email protected]
Zuo, Chandler, Shin, Sunyoung, and Keles, Sunduz. (2015). atSNP: Transcription factor binding affinity testing for regulatory SNP detection. Bioinformatics 31 (20): 3353-5.
atSNP vignette for more information
Compute the log-likelihood scores for motifs.
ComputeMotifScore(motif.lib, snp.info, ncores = 1)
ComputeMotifScore(motif.lib, snp.info, ncores = 1)
motif.lib |
A list object with the output format of function
|
snp.info |
A list object with the output format of function
|
ncores |
An integer for the number of parallel process. Default: 1. |
This function computes the binding affinity scores for both alleles at each SNP window. For each pair of SNP and motif, it finds the subsequence from both strand that maximizes the affinity binding score. It returns both the matching positions and the maximized affinity scores.
A list of two data.frame's. Field snp.tbl
contains:
snpid | SNP id. |
ref_seq | Reference allele nucleotide sequence. |
snp_seq | SNP allele nucleotide sequence. |
ref_seq_rev | Reference allele nucleotide sequence on the reverse strand. |
snp_seq_rev | SNP allele nucleotide sequence on the reverse strand. |
Field motif.score
contains:
motif | Name of the motif. |
motif_len | Length of the motif. |
ref_start, ref_end, ref_strand | Location of the best matching subsequence on the reference allele. |
snp_start, snp_end, snp_strand | Location of the best matching subsequence on the SNP allele. |
log_lik_ref | Log-likelihood score for the reference allele. |
log_lik_snp | Log-likelihood score for the SNP allele. |
log_lik_ratio | The log-likelihood ratio. |
log_enhance_odds | Difference in log-likelihood ratio between SNP allele and reference allele based on the best matching subsequence on the reference allele. |
log_reduce_odds | Difference in log-likelihood ratio between reference allele and SNP allele based on the best matching subsequence on the SNP allele. |
Sunyoung Shin [email protected], Chandler Zuo [email protected]
data(example) ComputeMotifScore(motif_library, snpInfo, ncores = 2)
data(example) ComputeMotifScore(motif_library, snpInfo, ncores = 2)
This function computes the p-values for allele-specific affinity scores and between-allele affinity score changes using the importance sampling technique.
ComputePValues(motif.lib, snp.info, motif.scores, ncores = 1, testing.mc = FALSE, figdir = NULL)
ComputePValues(motif.lib, snp.info, motif.scores, ncores = 1, testing.mc = FALSE, figdir = NULL)
motif.lib |
A list object with the output format of function
|
|||||||
snp.info |
A list object with the output format of function
|
|||||||
motif.scores |
A data.frame object containing at least the following columns:
|
|||||||
ncores |
An integer for the number of parallel process. Default: 1. |
|||||||
testing.mc |
Monte Carlo sample size of 200 is considered. Do not change the default unless conducting a quick test. Default: FALSE |
|||||||
figdir |
A string for the path to print p-value plots for monitoring results. Default: NULL (no figure). |
A data.frame extending motif.scores
by the following
additional columns:
pval_ref | P-values for scores on the reference allele. |
pval_snp | P-values for scores on the SNP allele. |
pval_cond_ref | Conditional p-values for scores on the reference allele. |
pval_cond_snp | Conditional p-values for scores on the SNP allele. |
pval_diff | P-values for the difference in scores between the reference and the SNP alleles. |
pval_rank | P-values for the log rank ratio between the reference and the SNP alleles. |
Sunyoung Shin [email protected], Chandler Zuo [email protected]
data(example) ComputePValues(motif_library, snpInfo, motif_scores$motif.scores, ncores = 2, testing.mc=TRUE)
data(example) ComputePValues(motif_library, snpInfo, motif_scores$motif.scores, ncores = 2, testing.mc=TRUE)
Calculate the best matching augmented subsequences on both SNP and reference alleles for motifs. Obtain extra unmatching position on the best matching augmented subsequence of the reference and SNP alleles.
dtMotifMatch(snp.tbl, motif.scores, snpids = NULL, motifs = NULL, motif.lib, ncores = 2)
dtMotifMatch(snp.tbl, motif.scores, snpids = NULL, motifs = NULL, motif.lib, ncores = 2)
snp.tbl |
A data.frame with the following information:
|
|||||||||||||||||||
motif.scores |
A data.frame with the following information:
|
|||||||||||||||||||
snpids |
A subset of snpids to compute the subsequences. Default: NULL, when all snps are computed. |
|||||||||||||||||||
motifs |
A subset of motifs to compute the subsequences. Default: NULL, when all motifs are computed. |
|||||||||||||||||||
motif.lib |
A list of named position weight matrices. |
|||||||||||||||||||
ncores |
The number of cores used for parallel computing. Default: 10 |
A data.frame containing all columns from the function,
MatchSubsequence
. In addition, the following columns are added:
snp_ref_start, snp_ref_end, snp_ref_length | Location and Length of the best matching augmented subsequence on both the reference and SNP allele. |
ref_aug_match_seq_forward | Best matching augmented subsequence or its corresponding sequence to the forward strand on the reference allele. |
snp_aug_match_seq_forward | Best matching augmented subsequence or its corresponding sequence to the forward strand on the SNP allele. |
ref_aug_match_seq_reverse | Best matching augmented subsequence or its corresponding sequence to the reverse strand on the reference allele. |
snp_aug_match_seq_reverse | Best matching augmented subsequence or its corresponding sequence to the reverse strand on the SNP allele. |
ref_location | SNP location of the best matching augmented subsequence on the reference allele. Starting from zero. |
snp_location | SNP location of the best matching augmented subsequence on the SNP allele. Starting from zero. |
ref_extra_pwm_left | Left extra unmatching position on the best matching augmented subsequence of the reference allele. |
ref_extra_pwm_right | Right extra unmatching position on the best matching augmented subsequence of the reference allele. |
snp_extra_pwm_left | Left extra unmatching position on the best matching augmented subsequence of the SNP allele. |
snp_extra_pwm_right | Right extra unmatching position on the best matching augmented subsequence of the SNP allele. |
Sunyoung Shin[email protected]
data(example) dtMotifMatch(motif_scores$snp.tbl, motif_scores$motif.scores, motif.lib = motif_library)
data(example) dtMotifMatch(motif_scores$snp.tbl, motif_scores$motif.scores, motif.lib = motif_library)
This motif library can be loaded by 'data(encode_library)'.
A list object.
Sunyoung Shin [email protected], Chandler Zuo [email protected]
This is a character vector that be loaded by
'data(encode_library)'. The names of this vector are the same as the names
for encode_motif
. The entries of this vector are the
corresponding motif information parsed from the raw file.
A character vector.
Sunyoung Shin [email protected], Chandler Zuo [email protected]
Convert the posotion weight matrix of a motif to the IUPAC sequence.
GetIUPACSequence(pwm, prob = 0.25)
GetIUPACSequence(pwm, prob = 0.25)
pwm |
The position weight matrix, with the columns representing A, C, G, T. |
prob |
The probability threshold. Default: 0.25. |
A character string.
Sunyoung Shin [email protected], Chandler Zuo [email protected]
data(example) GetIUPACSequence(motif_library[[1]], prob = 0.2)
data(example) GetIUPACSequence(motif_library[[1]], prob = 0.2)
This motif library can be loaded by 'data(jaspar_library)'.
A list object.
Sunyoung Shin [email protected], Chandler Zuo [email protected]
This is a character vector that be loaded by
'data(jaspar_library)'. The names of this vector are the same as the names
for jaspar_motif
. The entries of this vector are the
corresponding motif information parsed from the raw file.
A character vector.
Sunyoung Shin [email protected], Chandler Zuo [email protected]
Load SNP data.
LoadFastaData(ref.filename = NULL, snp.filename = NULL, ref.urlname = NULL, snp.urlname = NULL, snpids = NULL, default.par = FALSE)
LoadFastaData(ref.filename = NULL, snp.filename = NULL, ref.urlname = NULL, snp.urlname = NULL, snpids = NULL, default.par = FALSE)
ref.filename |
a fastq file name for the reference allele sequences. |
snp.filename |
a fastq file name for the SNP allele sequences. |
ref.urlname |
URL of a fastq file for the reference allele sequences. |
snp.urlname |
URL of a fastq file for the SNP allele sequences. |
snpids |
SNP IDs |
default.par |
A boolean for whether using the default Markov parameters. Default: FALSE. |
A list object containing the following components:
sequence_matrix | A list of integer vectors representing the deroxyribose sequence around each SNP. |
a1 | An integer vector for the deroxyribose at the SNP location on the reference genome. |
a2 | An integer vector for the deroxyribose at the SNP location on the SNP genome. |
The results are coded as: "A"-1, "C"-2, "G"-3, "T"-4.
Sunyoung Shin [email protected], Chandler Zuo [email protected]
LoadFastaData( ref.urlname="http://pages.stat.wisc.edu/~keles/atSNP-Data/sample_1.fasta", snp.urlname="http://pages.stat.wisc.edu/~keles/atSNP-Data/sample_2.fasta")
LoadFastaData( ref.urlname="http://pages.stat.wisc.edu/~keles/atSNP-Data/sample_1.fasta", snp.urlname="http://pages.stat.wisc.edu/~keles/atSNP-Data/sample_2.fasta")
Load the file for position weight matrices for motifs.
LoadMotifLibrary(filename = NULL, urlname = NULL, tag = "MOTIF", transpose = FALSE, field = 2, sep = c("\t", " "), skipcols = 0, skiprows = 2, pseudocount = 0)
LoadMotifLibrary(filename = NULL, urlname = NULL, tag = "MOTIF", transpose = FALSE, field = 2, sep = c("\t", " "), skipcols = 0, skiprows = 2, pseudocount = 0)
filename |
a MEME format file name. |
urlname |
URL containing a MEME format file. |
tag |
A string that marks the description line of the position weight matrix. |
transpose |
If TRUE (default), then the position weight matrix should have 4 columns. Otherwise, it should have 4 rows. |
field |
The index of the field in the description line, seperated by space, that indicates the motif name. |
sep |
A vector of chars for the string separators to parse each lines of the matrix. Default: c(" ", "\t"). |
skipcols |
Number of columns to be skipped in the position weight matrix. |
skiprows |
Number of description lines before each position weight matrix. |
pseudocount |
An integer for the pseudocount added to each of the original matrices. Default: 0. Recommended to be 1 if the original matrices are position frequency matrices. |
This function reads the formatted file containing motif information and convert them into a list of position weight matrices. The list of arguments should provide enough flexibility of importing a varying number of formats. Some examples are the following: For MEME format, the suggested arguments are: tag = 'Motif', skiprows = 2, skipcols = 0, transpose = FALSE, field = 2, sep = ' '; For motif files from JOHNSON lab (i.e. http://johnsonlab.ucsf.edu/mochi_files/JASPAR_motifs_H_sapiens.txt), the suggested arguments are: tag = '/NAME', skiprows = 1, skipcols = 0, transpose = FALSE, field = 2, sep = "\t"; For JASPAR pfm matrices (i.e. http://jaspar.genereg.net/download/CORE/JASPAR 2018_CORE_vertebrates_non-redundant_pfms_jaspar.txt), the suggested arguments are: tag = ">", skiprows = 1, skipcols = 0, transpose = TRUE, field = 1, sep = "\t"; For the TRANSFAC library provided by UCF bioinformatics groups (i.e. http://gibbs.biomed.ucf.edu/PreDREM/download/nonredundantmotif.transfac ), the suggested arguments are: tag = "DE", skiprows = 1, skipcols = 1, transpose = FALSE, field = 2, sep = "\t".
A list object of position weight matrices.
Sunyoung Shin [email protected], Chandler Zuo [email protected]
pwms <- LoadMotifLibrary( urlname="http://pages.stat.wisc.edu/~keles/atSNP-Data/pfm_vertebrates.txt", tag = ">", transpose = FALSE, field = 1, sep = c("\t", " ", ">"), skipcols = 1, skiprows = 1, pseudocount = 1)
pwms <- LoadMotifLibrary( urlname="http://pages.stat.wisc.edu/~keles/atSNP-Data/pfm_vertebrates.txt", tag = ">", transpose = FALSE, field = 1, sep = c("\t", " ", ">"), skipcols = 1, skiprows = 1, pseudocount = 1)
Load the SNP data.
LoadSNPData(filename = NULL, genome.lib = "BSgenome.Hsapiens.UCSC.hg38", snp.lib = "SNPlocs.Hsapiens.dbSNP144.GRCh38", snpids = NULL, half.window.size = 30, default.par = FALSE, mutation = FALSE, ...)
LoadSNPData(filename = NULL, genome.lib = "BSgenome.Hsapiens.UCSC.hg38", snp.lib = "SNPlocs.Hsapiens.dbSNP144.GRCh38", snpids = NULL, half.window.size = 30, default.par = FALSE, mutation = FALSE, ...)
filename |
A table containing the SNP information. Must contain at least five columns with exactly the following names:
If this file exists already, it is used to extract the SNP information. Otherwise, SNP information extracted using argument 'snpids' is outputted to this file. |
|||||||||||
genome.lib |
A string of the library name for the genome version. Default: "BSgenome.Hsapiens.UCSC.hg38". |
|||||||||||
snp.lib |
A string of the library name to obtain the SNP information based on rs ids. Default: "SNPlocs.Hsapiens.dbSNP144.GRCh38". |
|||||||||||
snpids |
A vector of rs ids for the SNPs. This argument is overidden
if the file with name |
|||||||||||
half.window.size |
An integer for the half window size around the SNP within which the motifs are matched. Default: 30. |
|||||||||||
default.par |
A boolean for whether using the default Markov parameters. Default: FALSE. |
|||||||||||
mutation |
A boolean for whether this is mutation data. See details for more information. Default: FALSE. |
|||||||||||
... |
Other parameters passed to |
This function extracts the nucleotide sequence within a window
around each SNP and code them using 1-A, 2-C, 3-G, 4-T.
There are two ways of obtaining the nucleotide sequences. If filename
is not NULL and the file exists, it should contain the positions and alleles
for each SNP. Based on such information, the sequences around SNP positions
are extracted using the Bioconductor annotation package specified by
genome.lib
. Users should make sure that this annotation package
corresponds to the correct species and genome version of the actual data.
Alternatively, users can also provide a vector of rs ids via the argument
snpids
. The SNP locations and allele information is then obtained via
the Bioconductor annotation package specified by snp.lib
, and passed
on to the package specified by genome.lib
to further obtain the
nucleotide sequences.
If mutation=FALSE
(default), this function assumes that the data is
for SNP analysis, and the reference genome should be consistent with either
the a1 or a2 nucleotide. When extracting the genome sequence around each SNP
position, this function compares the nucleotide at the SNP location on the
reference genome with both a1 and a2 to distinguish between the reference
allele and the SNP allele. If the nucleotide extracted from the reference
genome does not match either a1 or a2, the SNP is discarded. The discarded
SNPs are in the 'rsid.rm' field in the output.
Alternatively, if mutation=TRUE
, this function assumes that the data
is for general single nucleotide mutation analysis. After extracting the
genome sequence around each SNP position, it replaces the nucleotide at the
SNP location by the a1 nucleotide as the 'reference' allele sequence, and by
the a2 nucleotide as the 'snp' allele sequence. It does NOT discard the
sequence even if neither a1 or a2 matches the reference genome. When this
data set is used in other functions, such as ComputeMotifScore
,
ComputePValues
, all the results (i.e. affinity scores and
their p-values) for the reference allele are indeed for the a1 allele, and
results for the SNP allele are indeed for the a2 allele.
If the input is a list of rsid's, the SNP information extracted from
snp.lib
may contain more than two alleles for a single location. For
such cases, LoadSNPData
first extracts all pairs of alleles
associated with those locations. If 'mutation=TRUE', all those pairs are
considered as pairs of reference and SNP alleles, and their information is
contained in 'sequence_matrix', 'a1', 'a2' and 'snpid'. If 'mutation=FALSE',
LoadSNPData
further filters these pairs based on whether one
allele matches to the reference genome nucleotide extracted from
genome.lib
. Only those pairs with one allele matching the reference
genome nucleotide is considered as pairs of reference and SNP alleles, with
their information contained in 'sequence_matrix', 'a1', 'a2' and 'snpid'.
A list object containing the following components:
sequence_matrix | A list of integer vectors representing the deroxyribose sequence around each SNP. |
a1 | An integer vector for the deroxyribose at the SNP location on the reference genome. |
a2 | An integer vector for the deroxyribose at the SNP location on the SNP genome. |
snpid | A string vector for the SNP rsids. |
rsid.missing | If the data source is a list of rsids, this field records rsids for SNPs that are discarded because they are not in the SNPlocs package. |
rsid.duplicate | If the data source is a list of rsids, this field records rsids for SNPs that based on the SNPlocs package, this locus has more than 2 alleles. |
rsid.na | This field records rsids for SNPs that are discarded because the nucleotide sequences contain none ACGT characters. |
rsid.rm | If the data source is a table and mutation=FALSE , this
field records rsids for SNPs that are discarded because the nucleotide on the
reference genome matches neither 'a1' or 'a2' in the data source. |
The results are coded as: "A"-1, "C"-2, "G"-3, "T"-4.
Chandler Zuo [email protected]
## Not run: LoadSNPData(snpids = c("rs53576", "rs7412"), genome.lib ="BSgenome.Hsapiens.UCSC.hg38", snp.lib = "SNPlocs.Hsapiens.dbSNP144.GRCh38", half.window.size = 30, default.par = TRUE , mutation = FALSE) ## End(Not run)
## Not run: LoadSNPData(snpids = c("rs53576", "rs7412"), genome.lib ="BSgenome.Hsapiens.UCSC.hg38", snp.lib = "SNPlocs.Hsapiens.dbSNP144.GRCh38", half.window.size = 30, default.par = TRUE , mutation = FALSE) ## End(Not run)
This function combines the SNP set, the motif library and the affinity score table and produce the matching subsequence found at each SNP location for each motif.
MatchSubsequence(snp.tbl, motif.scores, motif.lib, snpids = NULL, motifs = NULL, ncores = 1)
MatchSubsequence(snp.tbl, motif.scores, motif.lib, snpids = NULL, motifs = NULL, ncores = 1)
snp.tbl |
A data.frame with the following information:
|
|||||||||||||||||||
motif.scores |
A data.frame with the following information:
|
|||||||||||||||||||
motif.lib |
A list of the position weight matrices for the motifs. |
|||||||||||||||||||
snpids |
A subset of snpids to compute the subsequences. Default: NULL, when all snps are computed. |
|||||||||||||||||||
motifs |
A subset of motifs to compute the subsequences. Default: NULL, when all motifs are computed. |
|||||||||||||||||||
ncores |
The number of cores used for parallel computing. |
A data.frame containing all columns in both snp.tbl
and
motif.scores
. In addition, the following columns are added:
ref_match_seq | Best matching subsequence on the reference allele. |
snp_match_seq | Best matching subsequence on the SNP allele. |
ref_seq_snp_match | Subsequence on the reference allele corresponding to the best matching location on the SNP allele. |
snp_seq_ref_match | Subsequence on the SNP allele corresponding to the best matching location on the reference allele. |
Sunyoung Shin [email protected], Chandler Zuo [email protected]
data(example) MatchSubsequence(motif_scores$snp.tbl, motif_scores$motif.scores, motif_library, ncores=2)
data(example) MatchSubsequence(motif_scores$snp.tbl, motif_scores$motif.scores, motif_library, ncores=2)
A list of the position weight matrices corresponding to motifs, loaded by 'data(example)'.
A list object.
Sunyoung Shin [email protected], Chandler Zuo [email protected]
This data.frame object loaded by 'data(example)' contains information about MYC_disc1 match to rs53576.
A data.frame object.
Sunyoung Shin [email protected], Chandler Zuo [email protected]
This list object loaded by 'data(example)' contains two fields:
snp.tbl | A data.frame containing the sequence of nucleobases around each SNP. |
motif.scores | A data.frame containing the likelihood scores computed for each SNP and each motif. |
A data.frame object.
Sunyoung Shin [email protected], Chandler Zuo [email protected]
Plot the best matching augmented subsequences on the reference and SNP alleles. Plot sequence logos of the position weight matrix of the motif to the corresponding positions of the best matching subsequences on the references and SNP alleles.
plotMotifMatch(motif.match, motif.lib, cex.main = 2, ...)
plotMotifMatch(motif.match, motif.lib, cex.main = 2, ...)
motif.match |
a single row ofdtMotifMatch output in data.frame format |
motif.lib |
A list of position weight matrices |
cex.main |
The size of the main title. |
... |
Other parameters passed to plotMotifLogo. |
Sequence logo stacks: Reference subsequences, sequence logo of reference allele matching potision weight matrix, SNP subsequences, sequence logo of SNP allele matching potision weight matrix
Sunyoung Shin[email protected]
data(example) plotMotifMatch(motif_match, motif.lib = motif_library)
data(example) plotMotifMatch(motif_match, motif.lib = motif_library)
This parameter is fitted using 61bp windowns around the SNPs in the NHGRI catalog. Loaded by 'data(default_par)'.
A numeric vector.
Sunyoung Shin [email protected], Chandler Zuo [email protected]
This data frame is loaded by 'data(example)'. It is a table including the following columns:
chr | The chromosome. |
snp | The SNP location coordinate. |
snpid | The SNP label. |
a1,a2 | The nucleotide on the reference and SNP allele. |
A data.frame object.
Sunyoung Shin [email protected], Chandler Zuo [email protected]
This list object loaded by 'data(example)' contains three fields :
sequence_matrix | A sequence matrix, coded by 1-A, 2-C, 3-G, 4-T, with each column corresponding to a subsequence of 61 bp around one SNP. |
transition | The transition matrix used in Markov model. |
prior | The stationary distribution used in the Markov model. |
A list object.
Sunyoung Shin [email protected], Chandler Zuo [email protected]
This parameter is fitted using 61bp windowns around the SNPs in the NHGRI catalog. Loaded by 'data(default_par)'.
A 4 by 4 numeric matrix.
Sunyoung Shin [email protected], Chandler Zuo [email protected]