Transfer learning workflow for I/O bandwidth prediction

Povaliaiev, Dmytro; Kunkel, Julian; Liem, Radita Tapaning Hesti; Müller, Matthias S.

doi:10.18154/RWTH-2023-05063

Transfer learning workflow for I/O bandwidth prediction = Transfer Learning Workflow zur I/O-Leistungsvorhersage der Bandbreite

Povaliaiev, Dmytro^RWTH*

2023

VerantwortlichkeitsangabeDmytro Povaliaiev

ImpressumAachen : RWTH Aachen University 2023

Umfang1 Online-Ressource : Illustrationen, Diagramme

Masterarbeit, RWTH Aachen University, 2023

Veröffentlicht auf dem Publikationsserver der RWTH Aachen University

Genehmigende Fakultät
Fak01

Hauptberichter/Gutachter
Müller, Matthias S. (Thesis advisor)^RWTH* ; Kunkel, Julian (Thesis advisor) ; Liem, Radita Tapaning Hesti (Consultant)^RWTH*

Tag der mündlichen Prüfung/Habilitation
2023-03-13

Online
DOI: 10.18154/RWTH-2023-05063
URL: https://publications.rwth-aachen.de/record/958007/files/958007.pdf

Einrichtungen

Inhaltliche Beschreibung (Schlagwörter)
HPC I/O (frei) ; I/O bandwidth prediction (frei) ; explainable AI (frei) ; interpretable machine learning (frei) ; machine learning (frei) ; transfer learning (frei) ; HPC (frei) ; high-performance computing (frei) ; I/O (frei)

Thematische Einordnung (Klassifikation)
DDC: 004

Kurzfassung
As the new generation of high-performance computing (HPC) systems reaches exascale performance for the first time, preventing underutilization due to I/O bottlenecks becomes even more critical. However, accurately predicting the I/O performance remains a challenging problem. The existing approaches [29] [37] [92] use a significant amount of data from a particular HPC cluster to create a suitable machine learning model. This is problematic due to the required timescale and I/O instrumentation infrastructure, especially in the case of the new filesystems that have not yet gained widespread adoption. To address this issue, I propose a transfer learning-based workflow for I/O bandwidth prediction that requires less data from the target cluster than the existing methods to produce a model of equivalent quality. As a proof-of-concept (POC), I use it to predict the I/O performance of CLAIX, the supercomputing cluster at RWTH Aachen University, employing data collected at the Blue Waters system of the University of Illinois for the initial training. Even in the POC form, the models produced by the workflow show a slight improvement of 1.08% average residual error over the current state of the art of 10% in bandwidth prediction on HPC clusters [37]. I further verify these results using cross-validation and analyze the models with the help of nine interpretable machine learning (also called explainable AI) techniques to provide insight into the features they consider to be the most important ones.

OpenAccess:
PDF
(additional files)