Rhapsody is a machine learning tool for predicting the impact of amino acid substitutions in proteins. It consists of a random forest classifier trained not only on traditional conservation properties, but also on structural and dynamical properties of the mutation site, localized on the protein's PDB structure, and coevolution properties, extracted from Pfam sequence alignments.
Rhapsody can provide predictions for Single Amino acid Variants (SAVs) in human proteins for which PDB structures are available.
Because Rhapsody derives sequence conservation properties from PolyPhen-2, which is designed to work only for human SAVs.
Rhapsody only accepts SAVs in Uniprot coordinates,
with the format:
<Uniprot ID> <position> <wild-type aa>
<mutated aa>
.
For instance, mutation Q99R in human protein
GTPase HRas can be queried by submitting the input string
P01112 99 Q R
or
RASH_HUMAN 99 Q R
.
We provide a Uniprot search tool
to help with the identification of a sequence's unique accession
number. When running an
in silico saturation mutagenesis analysis, only the
Uniprot sequence identifier (plus, optionally, a specific position)
should be provided.
A complete scanning of all possible 19 amino acid substitutions at every position in a protein sequence. The result will be a "saturation mutagenesis table" (see example) that not only contains predictions for individual mutations, but also provides a general view of the parts in the sequence that are predicted to be more (or less) sensitive to mutations.
A batch query allows to submit a list of individual variants from a single or multiple protein sequences. The list must contain one variant per line, in Uniprot coordinates.
Normally, when queried with a sequence, Rhapsody searches the Protein Data Bank for the "best" (i.e. the largest) structure available. If a structure is not found, the user can manually provide a custom protein structure, by either indicating a PDB code (for instance, of a homologous protein from another organism) or uploading a file in PDB format (e.g. downloaded from the SWISS-MODEL repository of homology models, see ROMK tutorial for an example). This option can also be used to run predictions on a particular protein structure or conformation (see HRAS tutorial for an example). Please note that Rhapsody will automatically align the Uniprot sequence to the PDB sequence and compute predictions only for matching amino acids: if the two sequences are too dissimilar, the resulting predictions might be too sparse.
When computing structural and dynamical features from a PDB structure, by default Rhapsody will only consider a single chain (the one with higher sequence similarity with the given Uniprot sequence) and ignore other chains that might be present in the PDB file. Sometimes, for instance in the case of multimers or other complexes, the presence of other chains should not be ignored and those properties should be computed for the entire complex. This is done by using a variant of Elastic Network Model called "environmental ANM" (more precisely, a "sliced" model, see main publication and ROMK tutorial). In conclusion, environmental effects should be included if the chain of interest is part of a "stable" complex (e.g. a multimer) and as such its dynamical properties are influenced and determined by the presence of other chains. On the other hand, please be aware that computing predictions on large complexes will take a significantly longer time.
Both "full" and "reduced" classifiers are trained on sequence-, structure- and dynamics-based features. The main difference is that the "full" classifier also includes coevolutionary properties computed on Pfam multiple sequence alignments. If part of a sequence is not covered by a Pfam domain, predictions from the "reduced" classifier are returned instead.
The "full+EVmutation" classifier includes in its list of features used for predictions the "epistatic statistical energy difference of mutant", computed by EVmutation and based on coevolution analysis of multiple sequence alignments. Although it has been shown to slightly improve the accuracy of predictions (see Rhapsody paper), by default this additional feature is not included in order to provide predictions that are independent from those computed by EVmutation. EVmutation predictions alone are always displayed in the final results along with those from Rhapsody and PolyPhen-2.
training info
indicates whether a variant
was never seen by the classifier (new
), thus its
prediction can be considered genuine, or was included in the training
dataset (known_del
or known_neu
),
thus its prediction cannot be considered unbiased.score
contains the output from the random
forest classifier, a real number between 0 and 1.prob.
contains a "pathogenicity probability"
calculated by applying a non-linear monotonic transformation to
the random forest score that eliminates the effect of an
imbalanced training dataset (where deleterious
labels usually
dominate). After this operation, the threshold between neutral
and deleterious
predictions can be set at 0.5.class
provides a final classification of
variants into neutral
and deleterious
.main
) or "reduced" (aux.
) classifiers,
as explained above. A left arrow between the two sets of
columns indicates that "reduced" predictions replace missing
"full" predictions in the "combined" results mentioned above.