Alpha&ESM hFolds
A database for the comparison of structural models predicted by ESMFold and AlphaFold2 for 42,942 human proteins.
Part of the Bioinformatics Sweeties collection.
Alpha&ESM hFolds
A database for the comparison of structural models predicted by ESMFold and AlphaFold2 for 42,942 human proteins.
Part of the Bioinformatics Sweeties collection.
Home page
From the home page of the web server, it is possible to query the model database in two ways:
UniProt accession: the accession is searched against the database and, if present, all data for the entry are shown. Otherwise, an error message is displayed prompting the user to go back to the home page.
The same search functionality is available also outside the home page, through the “Quick search” field in the navigation bar at the top of each page.
FASTA sequence: the protein is aligned against all the sequences present in our database using BLAST. Depending on the results, a list of at most 10 entries is displayed.
Search page
From the search page of the web server, it is possible to query the model database adopting different criteria. The search is case insensitive.
Entries can be filtered by Gene name. If this field is left empty, the filter will not be applied
Entries can be filtered based on the TM-scores between the ESMFold and AlphaFold2 models. The TM-score goes from 0 (no overlap) to 1 (perfect overlap) and the user can set the minimum and maximum values. The default range FROM 0 TO 1 selects all possible entries.
The search can be restricted to the entries endowed with a PDB structure. If this flag is turned on, the following filters become available.
Entries can be filtered by PDB ID. If this field is left empty, the filter will not be applied. The search is case insensitive. After the PDB ID, an underscore (_) followed by a specific chain can be added (e.g.: 7pzc_A)
Entries can be filtered based on the TM-scores between the ESMFold model and the PDB structure, and/or on the TM-scores between the AlphaFold2 model and the PDB structure. The logic for combining the two filters (AND / OR) can be selected.
Results page
For all entries in the model database, the results page shows multiple information.
Protein information: At the top of the page, a table is displayed that contains general information on the selected protein, including:
Protein name, gene name and UniProt accession (with a cross-link to the corresponding UniProt page).
Sequence length.
Protein source (either SwissProt or TrEMBL).
Presence of Signal or Transit peptide.
Highest-coverage PDB Chain (with a link to the corresponding PDB page). This field is present only if the PDB chain has a sequence coverage of the ATOM residues of at least 70%.
Alternatively, if the sequence is highly similar (more than 50% of similarity over a coverage of at least 70%) to an entry endowed with a PDB structure, a link to the putative template is shown.
A more complete list of information on the entry can also be downloaded in JSON format.
Comparing ESMFold and AlphaFold2 models: A tab displays the comparison between the AlphaFold2 and ESMFold computed structures. This includes:
The sequence alignment obtained from the superimposition of the two structural models. Residues are coloured according to the model confidence (pLDDT), while a green bar highlights residues which correctly match.
A FASTA-like file containing the alignment is available for download. The file will contain two gapped sequences with their corresponding ids.
The structure superimposition of the two models. Two different colours are adopted to distinguish ESMFold models (green) and AlphaFold2 models (purple). The graphical viewer is our implementation of PDBe Mol* and can be similarly interacted with (see original documentation; some operations are not available in our viewer). Additionally, residues shown in the sequence alignment can be clicked to zoom in on the corresponding position.
Both models, as well as their superimposition, are available for download in PDB format.
Different statistics, including:
Number and percentage of residues with pLDDT over different thresholds (50, 70 and 90), for each model individually and for the consensus.
Information regarding the sequence alignment (start and end position for each model, length of the alignment, number of matches, mismatches and gaps).
Scores of the structure alignment (TM-score, RMSD, GDT and TM-score obtained when considering only residues with pLDDT over different thresholds).
The statistics here reported are included in the JSON file that can be downloaded from the table at the top of the page.
Comparing predicted models and PDB chain: Finally, if the entry is endowed with a PDB chain, two similar tabs show the comparison between the experimental model and each predicted model.
In this case, PDB chains are coloured in white in the graphical viewer, and the statistics displayed vary slightly.
Examples
We show here four cases to clarify how the different statistics can be interpreted.
Example 1 - High-Quality models and Good Superimposition: P07902
In this example, both models have good quality since about 90% of residues are predicted with high pLDDT. When this is the case, we usually expect to observe a good superimposition. Indeed, the two models have a very high TM-score (0.9), which approaches 1 if we consider only residues with a higher pLDDT. Looking at the model superimposition, we observe that only the first helix of the protein and a few loops are misaligned, but the majority of the protein is correctly superimposed.
Example 2 - High-Quality models but Poor Superimposition: Q96P20
In this example the two models are high-quality, having about 80% of residues predicted with high pLDDT. Despite that, the two models have a poor TM-score (0.58), although it improves if we consider only residues with a high pLDDT (up to 0.71). Looking at the model superimposition, we observe that the first part of the protein is not aligned (about 430 residues, mostly due to a domain rotation), negatively impacting the TM-score.
When comparing the models to the PDB structure, it appear that the region where the models agree is well superimposed with the experimental structure. Conversely, the first portion of the protein appears to be misplaced in both models. The ESMFold model is closer to the structure and therefore the TM-score is higher for ESMFold (0.73) than for AlphaFold2 (0.57).
Example 3 - Low-Quality models and Poor Superimposition: Q9NVL8
In this example, both models are low quality. The AlphaFold2 model has only about 22% of residues with a pLDDT greater than 70, while the ESMFold model has only one residue. As expected, this correlates with a very poor TM-score (0.2). Looking at the model superimposition, we observe agreement only in one helix, while the majority of the models are misaligned.
Example 4 - Low-Quality models but Good Superimposition: Q9HD87
In this example, both models are low quality, not having any residues with a pLDDT greater than 70. Despite that, they have a high TM-score (0.72). Looking at the model superimposition, we observe that most of the structure is aligned, the main difference being a 11 residue long alpha-helix at the N-terminus of the protein. Just like for the second example, this protein represents a rare case where the quality of the models does not correlate with their agreement.