Please do not disturb: Performance analysis on supercomputers

German and Russian scientists join forces to improve performance tuning on supercomputers

2021/03/03 by

To better understand the efficiency of programs on supercomputers, scientists rely on performance measurements. However, these also reflect random delays, for example, caused by concurrently running programs running. The performance of the actual algorithm is therefore difficult to evaluate. In a German-Russian joint project, coordinated by computer science professor Felix Wolf at TU Darmstadt, partners from Darmstadt, Jülich and Moscow are working on less sensitive analysis methods.

The project ExtraNoise – Performance analysis of HPC applications in noisy environments is funded by Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) and the Russian Foundation for Basic Research (RFBR). It has a duration of three years and is coordinated by TU Darmstadt.

High-performance computing is a key technology of the 21st century. Numerous application examples, ranging from the improved understanding of matter to the discovery of new materials and from the study of biological processes to artificial intelligence, give evidence of its tremendous potential.

However, exploiting the full power of HPC systems has always been hard and is becoming even harder as the complexity and size of systems and applications continue to grow. On the other hand, the savings potential in terms of energy and CPU hours that application optimization can achieve is enormous.

Key to understanding and ultimately improving the performance of HPC applications is performance measurement. Unfortunately, many HPC systems expose their jobs to substantial amounts of interference (aka noise), leading to significant run-to-run variation. This makes performance measurements generally irreproducible, heavily complicating performance analysis and modeling.

On noisy systems, performance analysts usually have to repeat performance measurements several times and then apply statistics to capture trends. First, this is expensive and, second, extracting trends from a limited series of experiments is far from trivial, as the noise can follow quite irregular patterns.

In the ExtraNoise project, Felix Wolf is working on solutions to this problem together with Dr Bernd Mohr from Jülich Supercomputing Centre and Drs Dmitry Nikitenko and Konstantin Stefanov from Moscow State University. Computer-science professor Torsten Hoefler from ETH Zurich is involved as an associated partner.

In addition to making performance analysis more noise resilient, the partners also aim at a better understanding of how applications respond to noise in general and which design choices increase or lower their active and passive interference potential.