Machine learning methods for prediction of protein-protein interaction hot spot residues

Sitani, Divya; Carloni, Paolo; Zimmer-Bensch, Geraldine Marion

doi:HT030884506

Items
Marc 21

001			989346
005			20241108100032.0
024	7	_	\|2 HBZ \|a HT030884506
024	7	_	\|2 Laufende Nummer \|a 43672
024	7	_	\|2 datacite_doi \|a 10.18154/RWTH-2024-06740
037	_	_	\|a RWTH-2024-06740
041	_	_	\|a English
082	_	_	\|a 570
100	1	_	\|0 P:(DE-588)1347518967 \|a Sitani, Divya \|b 0 \|u rwth
245	_	_	\|a Machine learning methods for prediction of protein-protein interaction hot spot residues \|c vorgelegt von Divya Sitani, M.Tech. \|h online
260	_	_	\|a Aachen \|b RWTH Aachen University \|c 2024
300	_	_	\|a 1 Online-Ressource : Illustrationen
336	7	_	\|0 2 \|2 EndNote \|a Thesis
336	7	_	\|0 PUB:(DE-HGF)11 \|2 PUB:(DE-HGF) \|a Dissertation / PhD Thesis \|b phd \|m phd
336	7	_	\|2 BibTeX \|a PHDTHESIS
336	7	_	\|2 DRIVER \|a doctoralThesis
336	7	_	\|2 DataCite \|a Output Types/Dissertation
336	7	_	\|2 ORCID \|a DISSERTATION
500	_	_	\|a Veröffentlicht auf dem Publikationsserver der RWTH Aachen University
502	_	_	\|a Dissertation, RWTH Aachen University, 2024 \|b Dissertation \|c RWTH Aachen University \|d 2024 \|g Fak01 \|o 2024-01-16
520	3	_	\|a Protein-Protein-Interaktionen (PPI) bilden ein umfangreiches und kompliziertes Netz von Reaktionen, die für die Regulierung und Ausführung der meisten biologischen Prozesse wichtig sind [Rao+14]. PPIs treten auf, wenn zwei Proteine über ihre Oberflächenreste in direkten physischen Kontakt treten und eine Grenzfläche bilden, d. h. eine ungleichmäßige Oberfläche auf einem Protein-Protein-Komplex [GS10]. Obwohl eine Proteinschnittstelle eine große Fläche einnehmen kann, spielt nur eine kleine Untergruppe der darin enthaltenen Reste eine entscheidende Rolle für die freie Enthalpie der Bindung des Komplexes [BT98; Jan95]. Diese Reste werden als Hot Spots bezeichnet. Die experimentelle Methode zu ihrer Identifizierung ist die Alanin-Scanning-Mutagenese (ASM), bei der systematisch jeder Schnittstellenrest zu Alanin mutiert und die daraus resultierende Änderung der Bindungsenergie ΔΔGbinding zwischen dem Wildtyp und dem mutierten Komplex gemessen wird. Ist (ΔΔGbinding) größer als ein bestimmter Schwellenwert, in der Regel 2 kcal/mol, wird der Schnittstellenrest als Hot-Spot definiert, andernfalls wird er als Null-Spot [MFR07; CW89; BT98] betrachtet. Die sogenannten Hot-Spot-Reste sind häufig mit krankheitsassoziierten Mutationen beteiligt [Ten+09]. Diese Mutationen führen oft zu gestörten oder fehlerhaften Proteininteraktionen, was zu phänotypischen Veränderungen führt, die eine Krankheit verursachen können. Mit der Entdeckung von Hot Spots in Protein-Protein-Schnittstellen ist es außerdem möglich geworden, eine breitere Palette von PPIs mit kleinen Molekülen zu beeinflussen. Die Identifizierung von Hot Spots hat den Forschern geholfen, Moleküle zu identifizieren, die an diesen Stellen interagieren und so die PPIs und die nachgeschalteten Stoffwechselwege stören [Pet+16a; Pet+16b; Sco+16]. Daher ist die Vorhersage von Hot Spots von entscheidender Bedeutung für das Verständnis der Auswirkungen von krankheitsassoziierten Mutationen auf PPIs und für die Entwicklung von Medikamenten [Mur+17]. Wie bereits erwähnt, können Hot Spots experimentell mit Hilfe von ASM ermittelt werden. Dies ist jedoch recht kostspielig und langwierig, was zum Einsatz von Berechnungsmethoden zur Vorhersage von Hot Spot-Resten führte. Zu den früheren Berechnungsmethoden gehörten Molekulardynamik und wissensbasierte Methoden [MK99; HMK02; GF08; Bre+09]. Solche Ansätze waren jedoch zeitaufwändig und daher in der Anzahl der vorhergesagten Hot Spots begrenzt. Dies führte dazu, dass in den letzten Jahren verstärkt Methoden des maschinellen Lernens (ML) für die Hot-Spot-Vorhersage eingesetzt wurden [DPM07; Den+13; CKL09a; CKL09b; Ass+10]. Solche ML-Ansätze machen sich die Verfügbarkeit experimenteller Datensätze zunutze, die Protein-Protein-Komplexstrukturen und von ASM abgeleitete Hotspot-Daten enthalten. Wie bei biologischen Datenbeständen üblich, enthalten solche Hotspot-Datensätze jedoch häufig Rauschen [Mor+17; KC21]. Wenn Algorithmen des maschinellen Lernens (ML) auf diesen "verrauschten" Daten trainiert werden und Vorhersagen getroffen werden, sind die Ergebnisse nicht genau [GG19]. Bei den früheren ML-basierten Ansätzen zur Vorhersage von Hotspots wurde dieses Problem nicht berücksichtigt. In dieser Arbeit beschreibe ich die grundlegenden Konzepte und die jüngsten Fortschritte der Anwendungen des maschinellen Lernens bei der Suche nach den Protein-Protein-Interaktions-Hotspots. Um die Auswirkungen des Rauschens bei der Vorhersage von Hot Spots zu reduzieren, habe ich in dieser Arbeit die Methode RBHS (Robust Principal Component Analysis-(RPCA) based Prediction of Protein-Protein Interaction Hot Spots) vorgeschlagen [Sit+21]. Ich wende RPCA [Can+11], gefolgt von einer Merkmalsauswahl mit Extreme Gradient Boosting (XGBoost) [CG16] auf die Datenmatrix an, die Proteinsequenz- und strukturbasierte Merkmale enthält, die für die Schnittstellenreste berechnet wurden. Ich habe mehrere gängige Klassifikatoren für maschinelles Lernen auf dem Benchmark-Datensatz HB-34 [LLD18] trainiert und die Leistung der von mir vorgeschlagenen Methode auf dem unabhängigen Testsatz BID-18 [LLD18] bewertet. Nach ausgiebigen Experimenten und einem Vergleich mit den bestehenden State-of-the-Art-Ansätzen zur Vorhersage von Hot Spots konnte ich zeigen, dass meine Methode bei der Identifizierung von Hot Spot-Resten, die für Protein-Protein-Interaktionen entscheidend sind, recht effizient ist. Abschließend diskutiere ich in dieser Arbeit die Herausforderungen und zukünftigen Richtungen bei der Vorhersage von Hot Spots. \|l ger
520	_	_	\|a Protein–protein interactions (PPIs) form a vast and intricate network of reactions important for the regulation and execution of most biological processes [Rao+14]. PPIs occur when two proteins make direct physical contact via their surface residues and form an interface, which is a non-uniform surface on a protein-protein complex [GS10]. Even though a protein interface may occupy a large area, only a small subset of its buried residues plays a crucial role in the binding free energy of the complex [BT98; Jan95]. These energetically key residues are known as hot spots. The experimental method to identify them is Alanine Scanning Mutagenesis (ASM) where systematically each interface residue is mutated to Alanine and the consequent change in binding energy ΔΔGbinding between the wild type and the mutant complex is measured. If (ΔΔGbinding) is larger than a certain threshold, typically 2 kcal/mol, the interface residue is defined as a hot spot or else it is considered a null spot [MFR07; CW89; BT98]. The so-called hot spot residues are often enriched in disease-associated mutations [Ten+09]. These mutations often cause disrupted or erroneous protein interactions, resulting in phenotypic changes that might cause a disease. Moreover, with the discovery of hot spots in protein-protein interfaces, it has become possible to target a broader range of PPIs with small molecule drugs. The identification of hot spots has helped researchers to identify molecules that interact at these sites, thus interfering with PPIs and the downstream pathways they mediate [Pet+16a; Pet+16b; Sco+16]. Therefore, predicting hot spots is crucial to understand the effect of disease-associated mutations on PPIs and for drug discovery [Mur+17]. As mentioned before, experimentally hot spots can be found out by using ASM, but it is quite costly and tedious and this has led to the use of computational methods to predict hot spot residues. Previous computational approaches included molecular dynamics and knowledge-based methods [GNS02; KB02; MK99; HMK02; GF08; Bre+09]. However, such approaches were time-consuming and hence limited in the number of hot spots predicted. This led to an increased use of machine learning (ML) based methods for hot spot prediction in recent years [DPM07; Den+13; CKL09a; CKL09b; Ass+10]. Such ML approaches capitalize on the availability of experimental datasets containing protein-protein complex structures and ASM-derived hotspot data. However, as it often happens with biological data repositories, such hotspot datasets often contain noise [Mor+17; KC21]. If machine learning (ML) algorithms are trained and predictions are made on this "noisy" data, the results will not be accurate [GG19]. The earlier ML-based approaches for hot spot prediction did not take this issue into account. In this thesis, I describe the basic concepts and recent advances of machine learning applications in finding the protein–protein interaction hot spots. To reduce the effects of noise in hot spot prediction, I have proposed the method RBHS (Robust Principal Component Analysis-(RPCA) based Prediction of Protein-Protein Interaction Hot Spots) in this thesis [Sit+21]. I use RPCA [Can+11] followed by feature selection using Extreme Gradient Boosting (XGBoost) [CG16] on the data matrix containing protein sequence and structure-based features calculated on the interface residues. I trained several popular machine learning classifiers on the benchmark dataset HB-34 [LLD18] and evaluated the performance of my proposed method on the independent test set BID-18 [LLD18]. After extensive computational experimentation and comparison with the existing state-of-the-art approaches to predict hot spots, I was able to show that my method is quite efficient in identifying hot spot residues crucial for protein-protein interactions. Finally, I discuss the challenges and future directions in the prediction of hot spots in this thesis. \|l eng
588	_	_	\|a Dataset connected to Lobid/HBZ
591	_	_	\|a Germany
700	1	_	\|0 P:(DE-82)IDM03480 \|a Zimmer-Bensch, Geraldine Marion \|b 1 \|e Thesis advisor \|u rwth
700	1	_	\|0 P:(DE-82)IDM02752 \|a Carloni, Paolo \|b 2 \|e Thesis advisor \|u rwth
856	4	_	\|u https://publications.rwth-aachen.de/record/989346/files/989346.pdf \|y OpenAccess
856	4	_	\|u https://publications.rwth-aachen.de/record/989346/files/989346_source.zip \|y Restricted
909	C	O	\|o oai:publications.rwth-aachen.de:989346 \|p openaire \|p open_access \|p VDB \|p driver \|p dnbdelivery
910	1	_	\|0 I:(DE-588b)36225-6 \|6 P:(DE-588)1347518967 \|a RWTH Aachen \|b 0 \|k RWTH
910	1	_	\|0 I:(DE-588b)36225-6 \|6 P:(DE-82)IDM03480 \|a RWTH Aachen \|b 1 \|k RWTH
910	1	_	\|0 I:(DE-588b)36225-6 \|6 P:(DE-82)IDM02752 \|a RWTH Aachen \|b 2 \|k RWTH
914	1	_	\|y 2023
915	_	_	\|0 StatID:(DE-HGF)0510 \|2 StatID \|a OpenAccess
920	1	_	\|0 I:(DE-82)164620_20181217 \|k 164620 \|l Lehr- und Forschungsgebiet Neuroepigenetik \|x 0
920	1	_	\|0 I:(DE-82)160000_20140620 \|k 160000 \|l Fachgruppe Biologie \|x 1
980	1	_	\|a FullTexts
980	_	_	\|a I:(DE-82)160000_20140620
980	_	_	\|a I:(DE-82)164620_20181217
980	_	_	\|a UNRESTRICTED
980	_	_	\|a VDB
980	_	_	\|a phd

Library	Collection	CLSMajor	CLSMinor	Language	Author

Marc 21

h1

h2

h3

h4

h5

h6