PhosphoPICK tutorial page

The tutorial page contains the following sections:

Introduction

Phosphorylation is the most ubiquitus of post translational modifications, and is responsible for regulating numerous complex operations throughout the cell-cycle. Methods for predicting kinase-specific phosphorylation sites have historically operated primarily on amino acid sequences, relying on the information contained in the sequence region surrounding phosphorylation sites. However, these short sequence motifs can lack specificity and be found at random throughout the proteome. The phosphorylation of a target substrate by a kinase is not determined soley by its binding affinity however, but by various context factors that determine how a kinase comes into contact with its substrate. The concept behind PhosphoPICK is to take stock of such context information through known kinase-substrate relationships, protein-protein interaction data and cell-cycle data in order to predict kinase substrates.

Algorithm overview

We represent observations about protein interactions, kinase-specific phosphorylation events and cell cycle profiles as Boolean variables in a Bayesian network model (shown below). The model represents observations about a phosphorylation substrate - the kinases that bind to it, protein interactions, and whether it is up-regulated during the M, G1, S or G2 phase. The "kinase" nodes are linked to protein-protein interaction events that are believed to be relevant for the kinase to phosphorylate substrates. Similarly, the kinase nodes are conditioned on a latent variable, which is used to capture cell cycle stages that the kinase substrates are up-regulated during. For more detailed information on the algorithm and its prediction capabilities, please see our publication in Bioinformatics here.

Bayesian network model
Fig. 1: Context Bayesian network model. K nodes represent kinase-substrate events, P nodes represent protein-interaction events, and top layer of nodes represent four stages of the cell cycle.

In order to extend the model to incorporate knowledge of kinase-binding profiles, we also created a 'sequence model' that can score kinase binding sites from the protein sequence. We defined a Bayesian network model that represents various aspects of a kinase binding motif (Fig. 2). The model learns expected position-specific amino acid frequencies and the frequency of co-occuring amino acids in the form of k-mers - specifically size 2 (dimers) and size 3 (trimers) k-mers. When trained on some kinase of interest, the model learns the likliehood of these sequence features occuring within the kinase's binding sites, those of its family members, and a general phosphorylation background.

sequence model
Fig. 2: Sequence Bayesian network model. 'R' nodes represent positions in a motif surrounding the phosphorylation site, where R0 is the potential phosphorylation site. Kmer1 to Kmern represent the dimer and trimer configurations incorporated into the model.

The sequence model was incorportaed into the context model as shown below in Fig. 3. When scoring a protein, the model is "scanned" along the protein's sequence and the probability of the kinase phosphorylating the substrate is queried at each potential phosphorylation site. The highest scoring value for the sequence is then taken as the best liklihood that the kinase will phosphorylate the protein. Separately, the sequence model is used to score each potential phosphorylation site in the input sequence - the final prediction score for a query kinase to be phosphorylating a site on an input protein involves taking the average of the 'substrate score' from the combined Bayesian network model and the 'site score' from the sequence model. You can learn more about the training procedures and prediction accuracy of these models from our publication in BBA Proteins and Proteomics [here].

combined model
Fig. 3: Combined Bayesian network model incorporating both sequence and context data. The variable representing some kinase of interest binding to a phosphorylation site is conditioned on the variable in the context model representing the kinase targeting the substrate.

Submitting a job

In order to submit a job, one or more protein sequences in Fasta format are required as input. This can either be through a file upload or by copying/pasting sequences into the input box. Currently, PhosphoPICK can make predictions for human and mouse proteins, so input proteins must belong to one of these species. In order for PhosphoPICK to be able to correctly identify the input proteins, the 'species' selection option must be set to the species of the input proteins. Next, one or more kinases need to be selected to make predictions for. The 'selection type' option allows you to choose between selecting a single kinase, multiple kinases, or all kinases. The 'family' selection option is for filtering kinases according to their families. Fig. 4 below shows an example of a human protein being used as input, and a single kinase from the AGC family being chosen to make predictions. Please note that while PhosphoPICK allows for submissions of up to 2000 proteins, if the number of returned predictions (i.e. number of kinase-specific site predictions) exceeds 250,000 the results will only be available in downloadable form.

submitting screenshot
Fig. 4: Screenshot showing the choosing of a kinase and the entering of a protein sequence.

P-value calculation

The 'P-value calculation' section contains an option for calculating a p-value and choosing a significance threshold (i.e. only results that obtain a p-value below the threshold are retained). To compute the p-value, we first calculate empirical p-values for both the context score and the site score. This is done by using the distribution of these scores over the proteome; i.e. the distribution of context scores for each human protein, and the distribution of site scores for each potential phosphorylation site in the human proteome. The empircal p-values are frequencies calculated from the number of times a protein or a site is observed to have a score greater than, or equal to, the query. A final 'combined' p-value is calculated from the context and site p-values using Fisher's method.

pvalue screenshot
Fig. 5: Screenshot showing the choosing of a p-value threshold and entering of an email address.

Retrieving your results

There is an option on the submission page (shown in Fig. 5) to enter an email address. If you provide an email address, you will be sent a link to your results when they are completed. If you don't wish to provide an email address, after the job is submitted you are still provided with the link to access your results when they completed. Alternatively, the waiting page you are re-directed to after submitting a job will continue to be refreshed until the results can be displayed. Things that will result in longer running times are large protein submissions, and calculations of p-values for many proteins and/or kinases.

Understanding the results

Your results will be returned in an interactive table with the following columns (Fig. 6 below shows example output):

  • Protein: identifier from the Fasta file.
  • Uniprot ID: identifier of the protein that has been found by the BLAST search (please note if multiple proteins return as the most significant match PhosphoPICK will return all matching Uniprot IDs). Links to the protein's page on the Uniprot website, allowing you to confirm that PhosphoPICK has found the correct protein.
  • Site: the location of a potential phosphorylation site on the protein.
  • Kinase: the kinase being scored for this substrate/site.
  • Context score: probability according to the PhosphoPICK Bayesian network model that your chosen kinase is phosphorylating the protein.
  • Site score: probability according to the naive Bayes model that this site is being phosphorylated by the kinase.
  • Combined: Represents the combined probability according to the context model and the sequence model that this site is phosphorylated by the kinase. Calculated as the average of the context score and the site score.
  • P-value (Optional): a p-value representing the probability that the context and site scores could be seen by chance.

In the below example, NF-Kappa B was submitted for analysis, with all kinases chosen. A p-value threshold of 0.005 was selected to return the most significant hits. The table is shown with the results sorted by the combined score. You can access this example in the results page by clicking here.

results screenshot
Fig. 6: Screenshot showing example results, sorted by greatest to least combined score.

Filtering results

There is an option at the bottom of the results table to filter results according to protein identifier or according to protein identifier and site. To use the filter, select the filtering option and enter one or more entries that you want to display. If you are filtering according to protein and site, the protein names and sites should be separated by either a space or a tab.

The below screenshot is again from the NF-Kappa B example. According to Uniprot, there are phosphorylation sites at positions 429, 713, 715 and 717 that are not yet annotated with kinases. In you wanted to find the most likely kinases for these sites, they could be entered as per the below example.

filter screenshot
Fig. 7: Screenshot showing a list of proteins and sites being submitted to the filter.

Fig. 8 below shows the results after the filter has been applied. They have also been sorted from lowest to highest p-value. It can be seen that in this example, the highest scoring kinase for the phosphorylation sites at positions 715 and 713 is Caesein kinase II subunit alpha (CSNK2A1).

results filtered screenshot
Fig. 8: Screenshot showing example results after the filter has been applied. Results have been sorted by P-value.

Protein viewer page

If you click on a protein name within the 'Protein' column of the results table, you will be directed to the protein viewer page, which contains an interactive visualisation of the results for that protein. Fig. 9 below shows the top half of the protein viewer page for our NF-Kappa B example, where isoform three has been selected. The protein sequence is displayed across the top, with the amino acid position numbers displayed at the bottom (not shown in Fig 8). The kinases with predicted sites are listed along the Y-axis, and their predicted phosphorylation sites are annotated along the sequence as circles. The shade of the circles correspond to the strength of the context prediction, while the size of the circles correspond to the strength of the sequence prediction for that site. As can be seen in the example, the kinase that has the strongest prediction is CSNK2A1.

viewer top half screenshot
Fig. 9: Screenshot showing example results for NF-Kappa B in the protein viewer page.

You can zoom in on a section of the protein as well, by clicking on the start of the section that you wish to view, and dragging the mouse to cover it. Fig. 10 below shows a zoomed in region at the bottom of the viewer. Also shown is the 'selected site info' panel. When one of the phosphorylation site predictions is clicked, the panel will show the prediction information (protein identifier, site, kinase, context score, site score and p-value if calculated).

viewer bottom half screenshot
Fig. 10: Screenshot showing the bottom section of the example from Fig. 8, where the viewer has been zoomed in. Also shown is a panel showing the data associated with a selected site.

Downloading results

You can download results as a tab-delimited text file. This file contains all the information listed above, plus a few extra columns:

  • blastp identity: The percentage sequence identity found between the input protein and the protein represented by the Uniprot ID.
  • context p-value (Optional): The p-value representing the context score being seen by random in the proteome.
  • site p-value (Optional): The p-value representing the site score being seen by random in the proteome.

SNV analysis with PhosphoPICK-SNP

Genome-wide association studies are identifying single nucleotide variants (SNVs) linked to various diseases, however the functional effect caused by these variants is often unknown. One potential functional effect, the loss or gain of protein phosphorylation sites, can be induced through variations in key amino acids that disrupt or introduce valid kinase binding patterns. There are numerous examples of such 'phosphovariants'; our group has collated examples from the literature, and they can be viewed here. We encourage researches who have published additional examples of phosphovariants to inform us so we can add to this growing list.

The SNV analysis page allows researchers to submit protein sequences and amino acid variants in the protein. The PhosphoPICK-SNP method is employed to predict whether the variants will cause a loss or gain of phosphorylation at residues within the vicinity of a variant. This method quantifies the expected effect of a nsSNV on phosphorylation based on predictions from the sequence model, and the probability that a query kinase will target the variant protein. Employing distributions of predicted variants across the proteome, the method can then provide a measure of the significance of novel variants.

When submitting variants for prediction to the server, protein sequences and variants corresponding to amino acids in a submitted protein are required. An example submission is shown in Fig. 11. Each line in the variant submission box should contain a protein identifier (corresponding to a protein sequence) and a variant, separated by a space or tab. Variants should be of the form 'R196C'; i.e. the first letter represents the reference amino acid contained in the submitted protein, the number represents the location of the amino acid and the final letter represents the variant amino acid. Any number of variants can be submitted corresponding to a submitted protein sequence. As with phosphorylation site prediction, kinases are selected to make predictions for. Finally, the E-value threshold for returning predictions can be selected. The E-value is calculated using a Bonferroini multiple correction, where the E-value equals the product of the P-value and the number of tests performed; in this case, the number of tests is equals to the number of kinases selected. If only one kinase is selected, the E-value is therefore equal to the P-value.

PhosphoPICK-SNP submission page
Fig. 11: Screenshot showing the submission page for the PhosphoPICK-SNP method, where protein sequences and variants have been entered.

SNV analysis results

Your results will be returned in an interactive table with the following columns:

  • Protein: identifier from the Fasta file.
  • Variant: location of the variant impacting the phosphorylation site.
  • Phos Site: the location of a potential phosphorylation site on the protein.
  • Kinase: the kinase being scored for this substrate/site.
  • Context Score: probability according to the PhosphoPICK Bayesian network model that your chosen kinase is phosphorylating the protein.
  • Reference Score: probability according to the sequence model that the kinase can phosphorylate this site on the reference protein.
  • Variant Score: probability according to the sequence model that the kinase can phosphorylate this site on the variant protein.
  • E-value: E-value representing the significance of the variant's effect on phosphorylation.
  • Peptide: The set of amino acids surrounding the potential phosphorylation site. The reference and variant amino acids are donoted as (Reference/Variant). The residue where the phosphorylation site is predicted to occur is marked by an asterix (*).

You can download results as a tab-delimited text file. This file contains all the information listed above, plus a few extra columns:

  • Uniprot Identifier: identifier of the protein that has been found by the BLAST search (please note if multiple proteins return as the most significant match PhosphoPICK will return all matching Uniprot IDs).
  • Context P-value: the significance of the context score when compared to the distribution of context scores over the proteome.
  • Sequence Diff. Evalue: Evalue based on the sequence scores alone.
  • Alternative: Statistical alternative to the null hypothesis used when computing the Fisher's exact test. 'Loss' indicates that the variant is resulting in loss of phosphorylation and 'gain' indicates that the variant is resulting in gain of phosphorylation.

Download page

The download page is for those who are interested in obtaining proteome-wide predictions of kinase substrates (i.e. predictions of phosphorylation sites are not included). Similar to the 'submit sequences' page, there are options to choose from the available kinases; however instead of submitting protein sequences, the set of reviewed proteins in Uniprot (Swissprot) are retrived for either the set of canonical or isoform proteins.

The downloadable file contains the following information for each kinase that is queried:

  • substrate: Uniprot Accession for the protein.
  • kinase score: probability according to the PhosphoPICK Bayesian network model that this kinase is phosphorylating the substrate.
  • kinase p-value (Optional): p-value representing the probability that a score this high could be seen by random in the proteome.

return to top