OpenMP scalability limits on large SMPs and how to extend them

Schmidl, Dirk; Müller, Matthias S.; Bischof, Christian

doi:34813

Items
Marc 21

001			659783
005			20230408004734.0
024	7	_	\|2 URN \|a urn:nbn:de:hbz:82-rwth-2016-049355
024	7	_	\|2 HBZ \|a HT019019718
024	7	_	\|2 Laufende Nummer \|a 34813
037	_	_	\|a RWTH-2016-04935
041	_	_	\|a English
082	_	_	\|a 004
100	1	_	\|0 P:(DE-82)IDM01602 \|a Schmidl, Dirk \|b 0 \|u rwth
245	_	_	\|a OpenMP scalability limits on large SMPs and how to extend them \|c vorgelegt von Diplom-Informatiker Dirk Schmidl \|h online
246	_	3	\|a OpenMP-Skalierbarkeitslimits auf großen SMP-Systemen and wie sie erweitert werden können \|y German
260	_	_	\|a Aachen \|c 2016
300	_	_	\|a 1 Online-Ressource (vii, 123 Seiten) : Illustrationen, Diagramme
336	7	_	\|2 DataCite \|a Output Types/Dissertation
336	7	_	\|2 ORCID \|a DISSERTATION
336	7	_	\|2 BibTeX \|a PHDTHESIS
336	7	_	\|0 2 \|2 EndNote \|a Thesis
336	7	_	\|0 PUB:(DE-HGF)11 \|2 PUB:(DE-HGF) \|a Dissertation / PhD Thesis \|b phd \|m phd
336	7	_	\|2 DRIVER \|a doctoralThesis
500	_	_	\|a Veröffentlicht auf dem Publikationsserver der RWTH Aachen University
502	_	_	\|a Dissertation, RWTH Aachen University, 2016 \|b Dissertation \|c RWTH Aachen University \|d 2016 \|g Fak01 \|o 2016-06-28
520	3	_	\|a Aktuell sind Rechenknoten mit zwei Prozessoren die am häufigsten verwendeten Knoten im Bereich des Hochleistungsrechnen.Viele tausend dieser Knoten können über ein schnelles Netzwerk miteinander gekoppelt werden zu einem Rechencluster. Um diese Cluster zu programmieren wird üblicherweise das Message Passing Interface (MPI) verwendet. MPI erfordert es die Parallelität und die verwendeten Datentransfers sehr explizit über Funktionsaufrufe zu realisieren. Eine Alternative zu MPI, welche eine Parallelisierung auf höherer Ebene erlaubt ist OpenMP. In OpenMP können serielle Programme mit Pragmas angereichert werden um rechenintensive Teile der Anwendung parallel auszuführen.In vielen Fällen ist dies mit weniger Aufwand verbunden wie eine Parallelisierung mit MPI, bei der die gesamte Datenverteilung über alle Knoten im gesamten Programm implementiert werden muss. Der Nachteil von OpenMP ist, dass es nur auf Maschinen mit geteiltem Hauptspeicher und nicht auf den weit verbreiteten Clustern eingesetzt werden kann. Eine Reihe von Herstellern hat sich aber darauf spezialisiert große Maschinen mit geteiltem Hauptspeicher herzustellen. Da geteilter Hauptspeicher und damit einhergehende Anforderungen an die Koheränz der Speicher und Caches kompliziert zu implementieren sind, haben solche Maschinen Eigenheiten, die bei der Programmierung mit OpenMP berücksichtigt werden müssen um eine gute Parallelisierung zu erreichen.In dieser Arbeit beschäftige ich mich damit die Eigenschaften verschiedener dieser großen Maschinen mit geteilten Hauptspeicher und die Programmierbarkeit mit OpenMP zu untersuchen. An Stellen an denen OpenMP nicht die nötigen Mittel für eine gute Parallelisierung liefert, werde ich Verbesserungen aufzeigen. Weiterhin beschäftige ich mich in der Arbeit damit, wie Anwendungen mit OpenMP für solche Maschinen systematisch optimiert werden können. Hierbei wird die Nutzbarkeit von Performance-Analyse-Werkzeugen untersucht und Verbesserungen im Bereich der Task-basierten Analyse vorgestellt, welche die Optimierung für große Systeme vereinfachen. Abschließend stelle ich noch ein Modell vor, welches verwendet werden kann um eine Performance-Abschätzung für eine Anwendung auf einem solchen System vorzunehmen.Abschließend wird anhand von zwei Anwendungen gezeigt, dass es die vorgestellten Optimierungen erlauben mit echten Nutzeranwendungen eine Skalierbarkeit mit OpenMP auf großen Systemen zu erreichen. \|l ger
520	_	_	\|a The most widely used node type in high-performance computing nowadays is a 2-socket server node. These nodes are coupled to clusters with thousands of nodes via a fast interconnect, e.g. Infiniband. To program these clusters the Message Passing Interface (MPI) became the de-facto standard. However, MPI requires a very explicit expression of data layout and data transfer in a parallel program which often requires the rewriting of an application to parallelize it. An alternative to MPI is OpenMP, which allows to incrementally parallelize a serial application by adding pragmas to compute-intensive regions of the code.This is often more feasibly than rewriting the application with MPI. The disadvantage of OpenMP is that it requires a shared memory and thus cannot be used between nodes of a cluster. However, different hardware vendors offer large machines with a shared memory between all cores of the system.However, maintaining coherency between memory and all cores of the system is a challenging task and so these machines have different characteristics compared to the standard 2-socket servers. These characteristics must be taken into account by a programmer to achieve good performance on such a system. In this work, I will investigate different large shared memory machines to highlight these characteristics and I will show how these characteristics can be handled in OpenMP programs. When OpenMP is not able to handle different problems, I will present solutions in user space, which could be added to OpenMP for a better support of large systems. Furthermore, I will present a tools-guided workflow to optimize applications for such machines.I will investigate the ability of performance tools to highlight performance issues and I will present improvements for such tools to handle OpenMP tasks. These improvements allow to investigate the efficiency of task-parallel execution, especially for large shared memory machines.The workflow also contains a performance model to find out how well the performance of an application is on a system and when to stop tuning the application.Finally, I will present two application case studies where user codes have been optimized to reach a good performance by applying the optimization techniques presented in this thesis. \|l eng
591	_	_	\|a Germany
653	_	7	\|a OpenMP
653	_	7	\|a high performance computing
653	_	7	\|a NUMA-Architektur
653	_	7	\|a Leistungsoptimierung
700	1	_	\|0 P:(DE-82)IDM01074 \|a Müller, Matthias S. \|b 1 \|e Thesis advisor
700	1	_	\|0 P:(DE-82)003917 \|a Bischof, Christian \|b 2 \|e Thesis advisor
856	4	_	\|u https://publications.rwth-aachen.de/record/659783/files/659783.pdf \|y OpenAccess
856	4	_	\|u https://publications.rwth-aachen.de/record/659783/files/659783_source.zip \|y restricted
856	4	_	\|u https://publications.rwth-aachen.de/record/659783/files/659783.gif?subformat=icon \|x icon \|y OpenAccess
856	4	_	\|u https://publications.rwth-aachen.de/record/659783/files/659783.jpg?subformat=icon-1440 \|x icon-1440 \|y OpenAccess
856	4	_	\|u https://publications.rwth-aachen.de/record/659783/files/659783.jpg?subformat=icon-180 \|x icon-180 \|y OpenAccess
856	4	_	\|u https://publications.rwth-aachen.de/record/659783/files/659783.jpg?subformat=icon-640 \|x icon-640 \|y OpenAccess
856	4	_	\|u https://publications.rwth-aachen.de/record/659783/files/659783.jpg?subformat=icon-700 \|x icon-700 \|y OpenAccess
856	4	_	\|u https://publications.rwth-aachen.de/record/659783/files/659783.pdf?subformat=pdfa \|x pdfa \|y OpenAccess
909	C	O	\|o oai:publications.rwth-aachen.de:659783 \|p openaire \|p open_access \|p urn \|p driver \|p VDB \|p dnbdelivery
914	1	_	\|y 2016
915	_	_	\|0 StatID:(DE-HGF)0510 \|2 StatID \|a OpenAccess
920	1	_	\|0 I:(DE-82)123010_20140620 \|k 123010 \|l Lehrstuhl für Informatik 12 (Hochleistungsrechnen) \|x 0
920	1	_	\|0 I:(DE-82)120000_20140620 \|k 120000 \|l Fachgruppe Informatik \|x 1
980	1	_	\|a FullTexts
980	_	_	\|a phd
980	_	_	\|a VDB
980	_	_	\|a I:(DE-82)123010_20140620
980	_	_	\|a I:(DE-82)120000_20140620
980	_	_	\|a UNRESTRICTED

Library	Collection	CLSMajor	CLSMinor	Language	Author

Marc 21

h1

h2

h3

h4

h5

h6