ChIPanalyser provides a quick an easy method to predict and explain TF binding. The package uses a statistical thermodynamic framework to model the binding of proteins to DNA.
The model assumes that there are four main driver to TF binding:
While binding energy is given by PWM matrices, the other parameters will be inferred by maximizing or minimizing a goodness of fit score between ChIP data and ChIPanalyser predictions. To do so, ChIPanalyser provides a genetic algorithm to infer optimal values for each parameter.
The chromatin state model is described by the following equation derived from statistical thermodynamics:
$$P(N,a,\lambda,\omega)_j = \frac{N \cdot a_{j} \cdot e^{(\frac{1}{\lambda} \cdot \omega_{j})}}{N \cdot a_{j} \cdot e^{(\frac{1}{\lambda} \cdot \omega_{j})}+ L \cdot n \cdot [a_{i} \cdot e^{(\frac{1}{\lambda} \cdot \omega_{j})}]_ {i}} $$
Chromatin state affinity is defined as the following:
aj = ∑kαk ⋅ cjk
with α the chromatin state affinity scores for a given TF and c the chromatin state at site j.
Shortly, the model above generates ChIP like profiles and describes the affinity of a TF to chromatin states. The affinity scores a, the number of bound molecules N, and lambda λ are inferred by optimising the fit between predicted profiles and ChIP data using a genetic algortihm.
To demonstrate the use of ChIPanalyser and the genetic algorithm, we
provide some internal data sets. We complement this data with external
data source such a DNA sequence sets taken from the
BSgenome
package suit.
First, we load ChIPanlyser and the internal data
To reduce the size of the internal data size, we complement our data with the Drosophila DNA sequence taken from the BSgenome package suit.
Now we that have loaded some data, what exactly are we looking at? The environment should contain the following new objects:
chip - GRanges object containing ChIP scores. Essentially, ChIP data that will be used to train our model.
cs - GRanges object containing Chromatin State (CS) information. Essentially, where each CS can be found in the genome.
top - GRanges object containing regions of interest. Esssentially, the genomic regions we will use for training and testing.
PFM - Path to file containing the Position Frequency Matrix. Essentially, this represents the binding affinity of our TF (here we are using BEAF-32)
DNASequenceSet - DNAStringSet containing the DNASequenceSet of the organism we are looking at.
geneRef - GRanges object containing gene reference taken from the (Genome UCSC website)[https://genome.ucsc.edu/]
The next step consitst in setting initial parameters to run the genetic algorithm.
# Number of individuals per generation
pop <- 10
# Number of generations
gen <- 2
# Mutation Probability
mut <- 0.3
# Children - Number of ofspring passed to the next generation
child <- 2
# Method - Goodness of fit metric used to optimise the Genetic algorithm
method <- "MSE"
Please note that the parameters presented here are for the sake of the vignette. We recommend using the following parameters for a first trial with your own data. Please be advised that these parameters will depend on the nature of the data you are using.
# Number of individuals per generation
pop <- 100
# Number of generations
gen <- 50
# Mutation Probability
mut <- 0.3
# Children - Number of ofspring passed to the next generation
child <- 10
# Method - Goodness of fit metric used to optimise the Genetic algorithm
method <- "MSE"
We can also define which parameters we wish to optimise. In this example, we will optimise the number of bound molecules (N), lambda (λ), the PWM Threshold and 11 different chromatin states.
Alternatively, we can also set custom ranges for the parameters. We will use custom ranges here to reduce computational time.
params_custom <- vector("list", 14)
names(params_custom) <- c("N","lambda","PWMThreshold", paste0("CS",seq(1:11)))
# vector in the format of min value, max value and number of values
params_custom$N <- c(1,1000000,5)
params_custom$lambda <- c(1,5,5)
# Bound between 0 and 1
params_custom$PWMThreshold <- c(0.1,0.9,5)
# Bound between 0 and 1
CS <- c(0,1,5)
CS_loc <- grep("CS",names(params_custom))
for(i in CS_loc){
params_custom[[i]] <- CS
}
The first step of any ChIPanalyser anaylsis is to create parameter object. These object contain input paramters and hold basic data. Here, we will load the PFM as PWM and compute the Base Pair Frequency from the DNASequenceSet.
For clairty, we will show what a starting population looks like. This
step is not stricly required as the evolve
function can
take the number of individuals in the poppulation (defined above by
pop
) and the parameters to be optimised (defined above by
params
or params_custom
).
We will pre-proccess the ChIP data by reducing noise and converting GRanges to a ChIPscore object. This object contains normalised and smoothed ChIP scores that will be used to train and test the model.
chipProfile <- processingChIP(chip,loci = top)
# Splitting data into training and testing
# We recommend setting dist to 20/80. However, here we only have 4 loci.
splitdata <- splitData(chipProfile,dist = c(50,50),as.proportion = TRUE)
trainingSet <- splitdata$trainingSet
testingSet <- splitdata$testingSet
We can run the Genetic algorithm provided by ChIPanalyser using a single function. The function will generate intermediate files that allow you to check the status of the algorithm while it is running. It should be noted that the intermediate files are updated at each generation. This method also provides a lambda database. This allows for a faster run time on larger data sets as all values for lambda are pre-computed.
If you do not want intermediate files, set the
checkpoint
argument to FALSE
.
If you do not want to pre-compute lambda values, set the
lambda
argument to FALSE
.
For the purpose of this vignette, we will not save intermediate files.
evo <- evolve(population = pop,
DNASequenceSet = DNASequenceSet,
ChIPScore = trainingSet,
genomicProfiles = GPP,
parameters = params_custom,
mutationProbability = mut,
generations = gen,
offsprings = child,
chromatinState = cs,
method = method,
filename = "This_TF_is_Best_TF",
checkpoint = FALSE,
cores= 1)
## Generating Lambda DataBase
## Generation: 1
## Generation Fitness: 0.0153583153136014
## Generation: 2
## Generation Fitness: 0.0152928555331196
The output of this function returns a list of 3 elements:
Once we have the best perfomring paramters, we can plug them into a single run of ChIPanalyser and use these parameters on a tesing set.
The first step is to extract the best performing individual and its associated traits from the population.
We can use these parameters as input to the singleRun
function to obtain the ChIP like profiles. This run represents the
predicted run and TF binding affinity of our TF of choice. We will set
fitness to all
. This means that the function will return
all goodness of fit metrics available.
# Set chromatin states for single run - create CS Granges with affinity scores
cs_single <- setChromatinStates(single,cs)[[1]]
superFit <- singleRun(indiv = single,
DNAAffinity = cs_single,
genomicProfiles = GPP,
DNASequenceSet = DNASequenceSet,
ChIPScore = testingSet,
fitness = "all")
## Warning in ks.test.default(predicted, locusProfile): p-value will be
## approximate in the presence of ties
## Warning in ks.test.default(predicted, locusProfile): p-value will be
## approximate in the presence of ties
This final sections returns a list containing 3 elements: