KinSwingR aims to predict kinase activity from phoshoproteomics data. It implements the alogorithm described in: Engholm-Keller et al. (2019) (in greater detail below). KinSwingR predicts kinase activity by integrating kinase-substrate predictions and the fold change and signficance of change for peptide sequences obtained from phospho-proteomics studies. The score is based on the network connectivity of kinase-substrate networks and is weighted for the number of substrates as well as the size of the local network. P-values are provided to assess the significance of the KinSwing scores, which are determined through random permuations of the overall kinase-substrate network.
KinSwingR is implemented as 3 core functions:
buildPWM()
builds position weight
matrices (PWMs) from known kinase-substrate sequencesscoreSequences()
score PWMs build
using buildPWM()
against input phosphoproteome dataswing()
integrates PWM scores,
direction of phosphopeptide change and significance of phosphopeptide
change into a “swing” score.The KinSwing score is a metric of kinase activity, ranging from positive to negative, and p-values are provided to determine significance.
Additional functions are also provided:
cleanAnnotation()
function to tidy
annotations and extract peptide sequences.viewPWM()
function to view PWM
modelsDetailed information for each of these functions can be accessed
using the ?
command before the function of interest. E.g.
?buildPWM
We will now consider an example dataset to predict kinase activity. Kinase-substrate sequences and phosphoproteomics data are provided as example data in the KinSwingR package.
Begin by loading the KinSwingR library and the two data libraries included in the package.
library(KinSwingR)
data(example_phosphoproteome)
data(phosphositeplus_human)
# View the datasets:
head(example_phosphoproteome)
## annotation peptide fc pval
## 1 A0A096MJ61|NA|89|PRRVRNLSAVLAART NA -0.08377538 0.218815889
## 2 A0A096MJB0|Adcy9|1296|LDKASLGSDDGAQTK NA 0.03707147 0.751069301
## 3 A0A096MJB0|Adcy9|610|PRGQGTASPGSVSDL NA -0.06885408 0.594494965
## 4 A0A096MJB0|Adcy9|613|QGTASPGSVSDLAQT NA -0.29418446 0.002806832
## 5 A0A096MJN4|Sept4|49|ILEPRPQSPDLCDDD NA 0.09097982 0.078667811
## 6 A0A096MJN4|Sept4|81|FCPPAPLSPSSRPRS NA -0.12246661 0.078619010
## kinase substrate
## [1,] "EIF2AK1" "MILLSELSRRRIRSI"
## [2,] "EIF2AK1" "RILLSELSR______"
## [3,] "EIF2AK1" "IEGMILLSELSRRRI"
## [4,] "PRKCD" "MKKKDEGSYDLGKKP"
## [5,] "PRKCD" "FPLRKTASEPNLKVR"
## [6,] "PRKCD" "PLLARSPSTNRKYPP"
## sample 100 data points for demonstration
sample_data <- head(example_phosphoproteome, 1000)
# Random sample for demosntration purposes
set.seed(1234)
sample_pwm <- phosphositeplus_human[sample(nrow(phosphositeplus_human), 1000),]
# for visualising a motif, sample only CAMK2A
CAMK2A_example <- phosphositeplus_human[phosphositeplus_human[,1]== "CAMK2A",]
Where the centered peptide sequences (on the phosphosite of interest)
are not provided in the format required for
scoreSequences()
(see the argument “input_data”, in
?scoreSequences), these can be required to be extracted from another
column of annotated data. NB. “input_data” table format must contain
columns for “annotation”, “peptide”, “fold-change” and “p-values”.
In the example dataset provided,
example_phosphoproteome
, peptides have not been extracted
into a stand-a-lone peptide column. cleanAnnotation()
is
provided as a function to extract peptides from annotation columns and
place into the peptide column.
In the example dataset, example_phosphoproteome
, the
peptide sequence is the 4th component of the annotation, which
corresponds to using the argument seq_number = 4
below, and
is seperated by |
, which corresponds to the argument
annotation_delimiter = "|"
. In this case, the annotated
data also contains multi-mapped and multi-site information. For example
the following annotation
A1L1I3|Numbl|263;270|PAQPGHVSPTPATTS;SPTPATTSPGEKGEA
contains two peptides PAQPGHVSPTPATTS
and
SPTPATTSPGEKGEA
that map to different sites from the same
reference gene Numbl
, where the peptides are seperated by
;
. The annotated data also includes multi-protein mapped
(where a peptide could map to more than one protein - not shown) and
contains X
instead of _
to indicate sequences
that were outside of the length of the coding sequences. KinSwingR
requires that these sequences outside of the coding region are marked
with _
as deafult and therefore
replace_search = "X"
and replace_with = "_"
can be used as arguments in cleanAnnotation()
to replace
these. This allows for full flexibility of the input data here,
depending of the software used to generate determine the peptide
sequences. NB: characters other than _
can be used, but
these need to be declared when calling buildPWM and scoreSequences
functions later (see their help files).
Calling cleanAnnotation()
will produce a new table with
the unique combinations of peptide sequences extracted from the
annotation column into the peptide column:
annotated_data <- cleanAnnotation(input_data = sample_data,
annotation_delimiter = "|",
multi_protein_delimiter = ":",
multi_site_delimiter = ";",
seq_number = 4,
replace = TRUE,
replace_search = "X",
replace_with = "_")
head(annotated_data)
## annotation peptide fc pval
## 1 A0A096MJ61|NA|89|PRRVRNLSAVLAART PRRVRNLSAVLAART -0.08377538 0.218815889
## 2 A0A096MJB0|Adcy9|1296|LDKASLGSDDGAQTK LDKASLGSDDGAQTK 0.03707147 0.751069301
## 3 A0A096MJB0|Adcy9|610|PRGQGTASPGSVSDL PRGQGTASPGSVSDL -0.06885408 0.594494965
## 4 A0A096MJB0|Adcy9|613|QGTASPGSVSDLAQT QGTASPGSVSDLAQT -0.29418446 0.002806832
## 5 A0A096MJN4|Sept4|49|ILEPRPQSPDLCDDD ILEPRPQSPDLCDDD 0.09097982 0.078667811
## 6 A0A096MJN4|Sept4|81|FCPPAPLSPSSRPRS FCPPAPLSPSSRPRS -0.12246661 0.078619010
The first step to inferring kinase activity, is to build Position
Weight Matrices (PWMs) for kinases. This can be done using
buildPWM()
for any table containing centered substrate
peptide sequences for a list of kinases. The example data
data(phosphositeplus_human)
indicates the required format
for building PWM models. Below, for demosntration, we use a subset that
was sampled above sample_pwm
To generate the PWMs:
This will build the PWM models, accessible as PWM$pwm
and list the number of substrate sequences used to build each PWM,
accesible as PWM$kinase
.
To view the list of kinases and the number of sequences used:
## kinase n
## 1 PLK1 23
## 7 CSNK2A1 60
## 8 MAPK8 21
## 9 AURKA 12
## 10 MAPK3 27
## 11 PRKACA 62
Motif amino acids are coloured according to their properties.
color_scheme
parameter allows options are “lesk” or
“shapely” (default). Y-axis is the information content, measured as
bits.
Next, we will use the PWM models generated, pwms
, to
identify matches in the annotated_data
table that was
cleaned using cleanAnnotation()
above.
``scoreSequencessupports multi-core processing - see the example below for setting the number of workers to 4.
scoreSequencesdraws a random background by default of size
n
=
1000. It is recommended to use
set.seed()prior to calling
scoreSequencesif you wish to reproduce your results. To access the help file, which explains all the arguments, type
?scoreSequences```
into the console.
# As an example of control over multi-core processing
# load BiocParallel library
library(BiocParallel)
# finally set/register the number of cores to use
register(SnowParam(workers = 4))
# set seed for reproducible results
set.seed(1234)
scores <- scoreSequences(input_data = annotated_data,
pwm_in = pwms,
n = 100)
The outputs of scores
are transparent and accessible.
These are however primarily intermediate tables for obtaining swing
scores. scores
is a simple list object that contains
peptide scores (scores$peptide_scores)
, p-values for the
peptide scores (scores$peptide_p)
and the background
peptides used to score significance (scores$background)
for
reproducibility (i.e. the background can saved and reused for
reproducibility).
In summary, scoreSequences()
scores each input sequence
for a match against all PWMs provided using `buildPWM()
and
generates p-values for scores. This is effectively one large network of
kinase-substrate edges of dimensions kinase,
k, by substrate,
s.
Having built a kinase-substrate network, swing()
then
integrates the kinase-substrate predictions, directionality and
significance of phosphopeptide fold change to assess local connectivity
(or swing) of kinase-substrate networks. The final score is a normalised
score of predicted kinase activity that is weighted for the number of
substrates used in the PWM model and number of peptides in the local
kinase-substrate network. By default, this will permute the network 1000
times (here we use 10 for example purposes). It is recommended to use
set.seed()
prior to calling swing
if you wish
to reproduce your results. ``swing``` supports multi-core processing -
see the example below for setting the number of workers to 4.
# after loading BiocParallel library, set/register the number of cores to use
register(SnowParam(workers = 4))
# set seed for reproducible results
set.seed(1234)
swing_out <- swing(input_data = annotated_data,
pwm_in = pwms,
pwm_scores = scores,
permutations = 10)
# This will produce two tables, one is a network for use with e.g. Cytoscape and the other is the scores. To access the scores:
head(swing_out$scores)
## kinase pos neg all pk nk swing_raw n swing p_greater
## 2 AKT1 4 1 5 0.8000000 0.2000000 19.726764 19 1.6151652 0.09090909
## 5 AURKB 4 1 5 0.8000000 0.2000000 15.426556 10 1.2855635 0.09090909
## 10 CHEK1 3 1 4 0.7500000 0.2500000 11.364062 12 0.9741821 0.18181818
## 12 CSNK2A1 4 4 8 0.5000000 0.5000000 0.000000 60 0.1031511 0.18181818
## 4 AURKA 4 2 6 0.6666667 0.3333333 9.266994 12 0.8134463 0.27272727
## 6 CAMK2A 4 3 7 0.5714286 0.4285714 4.660630 16 0.4603784 0.27272727
## p_less
## 2 1.0000000
## 5 0.9090909
## 10 0.9090909
## 12 0.6363636
## 4 0.8181818
## 6 0.8181818
The outputs of this table indicate the following:
kinase
: The kinasepos
: Number of positively
regulated kinase substratesneg
: Number of negatively
regulated kinase substratesall
: Total number of regulated kinase substratespk
: Proportion of positively
regulated kinase substratesnk
: Proportion of negatively
regulated kinase substratesswing_raw
: Raw - weighted scoren
: Number of subtrate sequence in kinase
PWMswing
: Normalised (Z-score transformed) - weighted
scorep_greater
: probability of observing a swing score
greater thanp_less
: probability of observing a swing score less
thanNote that the pos
, neg
and all
include a pseudo-count, that is set in ?swing
, note
pseudo_count
.
*** See Engholm-Keller et al. (2019) for methods description ***
*** For a full description of the KinSwing algorithm, see Engholm-Keller et al. (2019) ***
In brief:
buildPWM()
generates Position Weight Matrices (PWMs) for
kinases based on known substrate sequence (Equation 1), where each
kinase, K, is considered as
the log-likelihood ratio of the average frequency of amino acid, a, at each position, p, divided by background
frequencies, B (C is a pseudo count to avoid log
zero):
Equation 1: $PWM_{(a,p)}=log((1/n∑^n_{i=1}K_i)+C)/B_a+C)$
scoreSequences()
scores each kinase, K, match to a substrate S, given as $S_{score}=∑^n_{(i=1)}f(a,p)$ , which
corresponds to the sum of the corresponding amino acid, a, of peptide sequence length, i, from position, p, of PWM(a, p)
and f(a, p) = PWMap ∈ PWM(a, p).
The probability of observing Sscore
for kinase, K, is determined
as conditional on a randomly sampled reference distribution of size
N sequences P(Sscore|R, N),
where R sequences are
determined to have a test statistic less than or equal to Sscore:
Equation 2: $R= ∑^N_{n=1}I((S_{score})n* ≥ (S_{score})i)$
swing()
integrates phosphosite data and kinase-substrate
scores from scoreSequences()
into a network for scoring
kinase activity based on local connectivity, swingk,
(Equation 3). swingk
is the weighted product of the proportion of positive, Posk,
and negative, Negk,
network edges, determined as the product of a logic function (described
here: Engholm-Keller et al. (2019)) given
a local network of size, Ck, with n substrates for kinase, K:
Equation 3: swingk = log2((Posk + c)/(Negk + c)) * log2(Ck) * log2(Sn)
swingk, is transformed into a z-score, Z(swingk), where, μ, is the mean and, σ, the standard deviation of swing scores, thus allowing for comparison of predicted kinase activity across multiple timepoints and/or conditions.
KinSwingR addresses the question of “how likely is it is observe the predicted activity of kinase, K, by random chance?” by computing swingk given N permutations of kinase node labels, K, to substrates, S, of the total network, Mks. Thus, the probability of observing swingk is conditional on this permuted reference distribution, of size, N (Equation 2). This is computed for each tail of the distribution, that is, positive and negative swingk scores.