Machine learning methods for prediction of protein-protein interaction hot spot residues

Sitani, Divya; Carloni, Paolo; Zimmer-Bensch, Geraldine Marion

doi:HT030884506

%0 Thesis
%A Sitani, Divya
%T Machine learning methods for prediction of protein-protein interaction hot spot residues
%I RWTH Aachen University
%V Dissertation
%C Aachen
%M RWTH-2024-06740
%P 1 Online-Ressource : Illustrationen
%D 2024
%Z Veröffentlicht auf dem Publikationsserver der RWTH Aachen University
%Z Dissertation, RWTH Aachen University, 2024
%X Protein–protein interactions (PPIs) form a vast and intricate network of reactions important for the regulation and execution of most biological processes [Rao+14]. PPIs occur when two proteins make direct physical contact via their surface residues and form an interface, which is a non-uniform surface on a protein-protein complex [GS10]. Even though a protein interface may occupy a large area, only a small subset of its buried residues plays a crucial role in the binding free energy of the complex [BT98; Jan95]. These energetically key residues are known as hot spots. The experimental method to identify them is Alanine Scanning Mutagenesis (ASM) where systematically each interface residue is mutated to Alanine and the consequent change in binding energy ΔΔGbinding between the wild type and the mutant complex is measured. If (ΔΔGbinding) is larger than a certain threshold, typically 2 kcal/mol, the interface residue is defined as a hot spot or else it is considered a null spot [MFR07; CW89; BT98]. The so-called hot spot residues are often enriched in disease-associated mutations [Ten+09]. These mutations often cause disrupted or erroneous protein interactions, resulting in phenotypic changes that might cause a disease. Moreover, with the discovery of hot spots in protein-protein interfaces, it has become possible to target a broader range of PPIs with small molecule drugs. The identification of hot spots has helped researchers to identify molecules that interact at these sites, thus interfering with PPIs and the downstream pathways they mediate [Pet+16a; Pet+16b; Sco+16]. Therefore, predicting hot spots is crucial to understand the effect of disease-associated mutations on PPIs and for drug discovery [Mur+17]. As mentioned before, experimentally hot spots can be found out by using ASM, but it is quite costly and tedious and this has led to the use of computational methods to predict hot spot residues. Previous computational approaches included molecular dynamics and knowledge-based methods [GNS02; KB02; MK99; HMK02; GF08; Bre+09]. However, such approaches were time-consuming and hence limited in the number of hot spots predicted. This led to an increased use of machine learning (ML) based methods for hot spot prediction in recent years [DPM07; Den+13; CKL09a; CKL09b; Ass+10]. Such ML approaches capitalize on the availability of experimental datasets containing protein-protein complex structures and ASM-derived hotspot data. However, as it often happens with biological data repositories, such hotspot datasets often contain noise [Mor+17; KC21]. If machine learning (ML) algorithms are trained and predictions are made on this "noisy" data, the results will not be accurate [GG19]. The earlier ML-based approaches for hot spot prediction did not take this issue into account. In this thesis, I describe the basic concepts and recent advances of machine learning applications in finding the protein–protein interaction hot spots. To reduce the effects of noise in hot spot prediction, I have proposed the method RBHS (Robust Principal Component Analysis-(RPCA) based Prediction of Protein-Protein Interaction Hot Spots) in this thesis [Sit+21]. I use RPCA [Can+11] followed by feature selection using Extreme Gradient Boosting (XGBoost) [CG16] on the data matrix containing protein sequence and structure-based features calculated on the interface residues. I trained several popular machine learning classifiers on the benchmark dataset HB-34 [LLD18] and evaluated the performance of my proposed method on the independent test set BID-18 [LLD18]. After extensive computational experimentation and comparison with the existing state-of-the-art approaches to predict hot spots, I was able to show that my method is quite efficient in identifying hot spot residues crucial for protein-protein interactions. Finally, I discuss the challenges and future directions in the prediction of hot spots in this thesis.
%F PUB:(DE-HGF)11
%9 Dissertation / PhD Thesis
%R 10.18154/RWTH-2024-06740
%U https://publications.rwth-aachen.de/record/989346

h1

h2

h3

h4

h5

h6

RWTH

Kontakt

RWTH Publications

Allgemeines