End-to-end reinforcement learning of koopman model predictive control

Mayfrank, Daniel Georg; Lucia, Sergio; Mitsos, Alexander

doi:44904

End-to-end reinforcement learning of koopman model predictive control = Ende-zu-Ende bestärkendes Lernen von Koopman modellprädiktiver Regelung

Mayfrank, Daniel Georg^RWTH*

2025 & 2026

Verantwortlichkeitsangabevorgelegt von Daniel Georg Mayfrank

ImpressumAachen : Aachener Verfahrenstechnik 2025

Umfang1 Online-Ressource : Illustrationen

ReiheAachener Verfahrenstechnik series - AVT.SVT - Systemverfahrenstechnik - Dissertationen ; 42 (2025)

Dissertation, Rheinisch-Westfälische Technische Hochschule Aachen, 2025

Veröffentlicht auf dem Publikationsserver der RWTH Aachen University 2026

Genehmigende Fakultät
Fak04

Hauptberichter/Gutachter
Mitsos, Alexander (Thesis advisor)^RWTH* ; Lucia, Sergio (Thesis advisor)

Tag der mündlichen Prüfung/Habilitation
2025-11-14

Online
DOI: 10.18154/RWTH-2025-09976
URL: https://publications.rwth-aachen.de/record/1022363/files/1022363.pdf

Einrichtungen

Lehrstuhl für Systemverfahrenstechnik (416710)

Projekte

Thematische Einordnung (Klassifikation)
DDC: 620

Kurzfassung
Modellbasierte Regelungsverfahren wie modellprädiktive Regelung (MPC) und Varianten davon wie z.B. ökonomische nichtlineare MPC (eNMPC) sind in der chemischen Industrie nach wie vor unverzichtbar. Mechanistische Modelle sind jedoch oft nicht verfügbar oder zu rechenintensiv für den Einsatz als Teil einer (e)NMPC. Datengetriebene Modelle, die meist mittels Systemidentifikation (SI) gelernt werden, bieten eine recheneffiziente Alternative. Allerdings maximiert SI die durchschnittliche Vorhersagegenauigkeit und kann dadurch in (e)NMPCs zu suboptimalem Verhalten führen. Im Gegensatz dazu nutzt aktuelle Forschung bestärkendes Lernen (RL) um datengetriebene Modelle Ende-zu-Ende für optimales Verhalten als Teil eines Reglers für spezifische Anwendungen zu trainieren. Diese Arbeit leistet einen Beitrag zu diesem aufstrebenden Forschungsfeld, indem sie Methoden zum RL-basierten Ende-zu-Ende-Lernen von Koopman-Modellen für (e)NMPC-Regler entwickelt. Koopman-Modelle können die Dynamik nichtlinearer Systeme abbilden und führen gleichzeitig zu konvexen optimalen Regelungsproblemen (OCPs), wenn sie in einer (e)NMPC verwendet werden, womit sie eine günstige Balance zwischen Darstellungskapazität und Berechnungseffizienz bieten. Mittels post-optimaler Sensitivitätsanalyse entwickeln wir eine Methode zur Konstruktion automatisch differenzierbarer Koopman-basierter (e)NMPC-Regler, die über die lernbaren Parameter des Koopman-Modells optimiert werden können. Wir nutzen den RL-Algorithmus Proximal Policy Optimization (PPO) um die Koopman-(e)NMPC-Regler für spezifische Regelungsaufgaben zu optimieren. Unter der Annahme der Verfügbarkeit vollständiger Zustandsmessungen demonstrieren wir die Effektivität unserer Methode in NMPC und eNMPC Fallstudien. Diese basieren auf (i) einem kleinen kontinuierlichen Rührkesselreaktormodell mit zwei Differentialzuständen und zwei Stellgrößen, sowie (ii) einer Luftzerlegungsanlage mit 119 Differentialzuständen und etwa 2300 algebraischen Zuständen. Die Ergebnisse zeigen, dass die vorgeschlagene Methode in Bezug auf die Regelungsleistung im Vergleich zu traditionellen Benchmark-Verfahren vorteilhaft abschneidet. Außerdem zeigen wir, dass die (e)NMPC-Regler im Gegensatz zu RL-trainierten Reglern in Form künstlicher neuronaler Netze auf bestimmte Änderungen in der Regelungsumgebung reagieren können, ohne erneut trainiert werden zu müssen. Allerdings beobachten wir (i) Konvergenzprobleme aufgrund ungenauer Gradientenschätzungen und (ii) eine geringe Stichprobeneffizienz. Wir adressieren das Konvergenzproblem (i), indem wir die automatische Differenzierbarkeit von Trainingsumgebungen auf Basis mechanistischer Simulationsmodelle nutzen. Die Stichprobeneffizienz (ii) erhöhen wir durch die Integration unseres RL-Trainingsansatzes für Koopman-(e)NMPC-Regler mit modellbasiertem RL. Zudem zeigen wir, dass physikinformiertes Modelllernen die Stichprobeneffizienz weiter erhöhen kann, wenn partielles Vorwissen über die Systemdynamik vorhanden ist. Insgesamt leistet diese Arbeit einen Beitrag zum Gebiet der datengetriebenen Regelung und zeigt Wege zu leistungsfähigeren, echtzeitfähigen, datengetriebenen (e)NMPCs auf.

Model-based control methods such as Model Predictive Control (MPC) and variants thereof, e.g., economic nonlinear MPC (eNMPC), remain indispensable in the chemical industry. However, mechanistic models are often unavailable or too computationally expensive for use in (e)NMPC. Data-driven models, usually trained using system identification (SI) approaches, can serve as a computationally cheap alternative to mechanistic models. However, SI focuses narrowly on maximizing average prediction accuracy, which can result in suboptimal performance when the model is used as part of a policy. In contrast, recent research has explored training data-driven models end-to-end for optimal performance in predictive control policies using reinforcement learning (RL) approaches. This thesis contributes to this emerging research field by developing methods for RL-based end-to-end learning of Koopman models for (e)NMPC policies. Koopman models can accurately represent the dynamics of nonlinear systems while resulting in convex optimal control problems (OCPs) when used in (e)NMPC, thus striking a favorable balance between representational capacity and computational efficiency. By performing post-optimal sensitivity analysis on the resulting OCPs, we develop a method for constructing automatically-differentiable Koopman-based (e)NMPC policies, which can be optimized via the learnable parameters of the Koopman model. We optimize the (e)NMPC policies for specific control tasks using the state-of-the-art actor-critic RL algorithm Proximal Policy Optimization (PPO). Assuming the availability of full state measurements, we demonstrate the effectiveness of our method in NMPC (setpoint tracking) and eNMPC (demand response) case studies. These are based on (i) a small continuous stirred-tank reactor model with two differential states and two control inputs and (ii) an air separation unit with 119 differential states and approximately 2300 algebraic states. The results show that the proposed method performs favorably in terms of the control performance of the resulting policies compared to traditional benchmarks, including neural network policies trained using RL and Koopman-based eNMPC policies trained via system identification. Furthermore, we show that, in contrast to the neural network policies, the (e)NMPC policies can react to certain changes in the control setting without retraining. However, we observe (i) convergence problems resulting from inaccurate policy gradient estimates and (ii) low sample efficiency. To address the former issue, we exploit the automatic differentiability of training environments based on mechanistic simulation models to aid the policy optimization, resulting in substantially improved convergence and control performance. Furthermore, we improve the sample efficiency of the learning process by integrating our method for RL-based training of Koopman (e)NMPC policies with Dyna-style model-based RL. We also show that when leveraging model-based RL, the sample efficiency can be increased further by utilizing partial prior knowledge about the system dynamics via physics-informed model learning. In sum, this thesis contributes to the field of data-driven control and shows avenues toward higher-performance, real-time-capable, data-driven (e)NMPCs.

OpenAccess:
PDF
(additional files)