White Papers – Benchmarking

On this page you will find PRACE White Papers related to Benchmarking.

Title: Description of the initial accelerator benchmark suite

Authors: G. Hautreuxa, D. Dellisb, C. Moulinecc, A. Sunderlandc, A. Grayd, A. Proemed, V. Codreanue, A. Emersonf, B. Eguzkitzag, J. Strassburgg, M. Louhivuorih

Abstract: The work produced within this task is an extension of the UEABS (Unified European Applications Benchmark Suite) for accelerators. As a first version of the extension, this document will present a full definition of a suite for accelerators. This will cover each code, presenting the code in itself as well as the test cases defined for the benchmarks and the problems that could occur during the next months.
As the UEABS, this suite aims to present results for many scientific fields that can use HPC accelerated resources. Hence, it will help the European scientific communities to make a decision in terms of infrastructures they could buy in a near future. We focus on Intel Xeon Phi coprocessors and NVidia GPU cards for benchmarking as they are the two most important accelerated resources available at the moment.
The following table lists the codes that will be presented in the next sections as well as their implementations available. It should be noted that OpenMP can be used with the Intel Xeon Phi architecture while CUDA is used for NVidia GPU cards. OpenCL is a third alternative that can be used on both architectures.
The chosen codes are all part of the UEAS suite excepted PFARM which comes from PRACE-2IP and SHOC which is a synthetic benchmark suite for accelerators.

Download paper: PDF

Title: Introducing OpenMP Tasks into the HYDRO Benchmark

Authors: Jeremie Gaidamoura, Dimitri Lecasa, Pierre-Francois Lavalleea
a IDRIS/CNRS, Campus universaire d’Orsay, rue John Von Neumann, Batiment 506, F-91403 Orsay, France

The HYDRO mini-application has been successfully used as a research vehicle in previous PRACE projects [6]. In this paper, we evaluate the benefits of the tasking model introduced in recent OpenMP standards [9]. We have developed a new version of HYDRO using the concept of OpenMP tasks and this implementation is compared to already existing and optimized OpenMP versions of HYDRO.

Download paper: PDF

Title: Performance Analysis of Alya on a Tier-0 Machine using Extrae

Authors: Jorge Rodrigueza
aBSC-CNS: Barcelona Supercomputing Center, Torre Girona, C/Jordi Girona, 31, 08034 Barcelona, Spain

Abstract: Alya [5] is a computational mechanics code capable of solving different physics. It has been extensively used in MareNostrum III (BSC’s Tier-0 machine), and it has been also used as a benchmarking code in PRACE Unified European Applications Benchmark Suite. In this document, Extrae will be used to collect and analyze performance data during an Alya simulation in a petaflop environment.
As a result of the performance analysis using Extrae [2] [3], some potential improvements in Alya have shown up, and if considered, exascale scalability could be achieved.
Application Code: Alya

Download paper: PDF

Title: Profiling of Code_Saturne with HPCToolkit and TAU, and autotuning Kernels with Orio

Authors: B. Lindia*, T. Ponweiserb, P. Jovanovicc, T. Arslana
aNorwegian University of Science and Technology
bRISC Software GmbH A company of Johannes Kepler University Linz
cInstitute of Physics Belgrade

Abstract: This study has profiled the application Code Saturne, which is part of the PRACE benchmark suite. The profiling has been carried out with the tools HPCtookit and Tuning and Analysis Utilities (TAU) with the target of finding compute kernels suitable for autotuning.
Autotuning is regarded as a necessary step in achieving sustainable performance at an Exascale level as Exascale systems most likely will have a heterogeneous runtime environment. A heterogeneous runtime environment imposes a parameter space for the applications run time behavior which cannot be explored by a traditional compiler. Neither can the run time behavior be explored manually by the developer/code owner as this will be too time consuming.
The tool Orio has been used for autotuning idenitified compute kernels. Orio has been used on traditional Intel processors, Intel Xeon Phi and NVIDIA GPUs.The compute kernels have a small contribution to the overall execution time for Code Saturne. By autotuning with Orio these kernels have been improved by 3-5%.

Download paper: PDF

Title: Accelerator Aware MPI Micro-benchmarking using CUDA, OpenACC and OpenCL

Authors: Sadaf Alam, Ugo Varettoa
aSwiss National Supercomputing Centre, Lugano, Switzerland

Abstract: Recently MPI implementations have been extended to support accelerator devices, Intel Many Integrated Core (MIC) and nVidia GPU. This has been accomplished by changes to different levels of the software stacks and MPI implementations. In order to evaluate performance and scalability of accelerator aware MPI libraries, we developed portable micro-benchmarks to indentify factors that influence efficincies of primitive MPI point-to-point and collective operations. These benchmarks have been implemented in OpenACC, CUDA and OpenCL. On the Intel MIC platform, existing MPI benchmarks can be executed with appropriate mapping onto the MIC and CPU cores. Our results demonstrate that the MPI operations are highly sensitive to the memory and I/O bus configurations on the node. The current implemetation of MIC on-node communication interface exhibit additional limitations on the placement of the card and data transfers over the memory bus.

Download paper: PDF

Title: Benchmarking and Thread Scaling of the HBM Ocean Circulation Model

Authors: Mikael Rannara, Maciej Szpindlerb
aHPC2N & Department of Computing Science, Umea University
b Interdisciplinary Centre for Mathematical and Computational Modelling, University of Warsaw

Abstract: The HBM (HIROMB-BOOS Model) ocean circulation model scaling on the selected PRACE Tier-0 systems is described. The model has been ported to the BlueGene/Q architecture and tested against OpenMP and mixed OpenMP/MPI parallel performance and scaling with a given test case scenario. Benchmarking of the selected computational kernels and model procedures with a micro-benchmarking module has been proposed for further integration with the model code. Details on the micro-benchmark proposal and results of the scaling tests are described.

Download paper: PDF

Title: Delft3D Performance Benchmarking Report

Authors: J. Donnersa*, A. Mouritsb, M. Gensebergerb, B. Jagersb
aSURFsara, Amsterdam, The Netherlands
bDeltares, Delft, The Netherlands

Abstract: The Delft3D modelling suite has been ported to the PRACE Tier-0 and Tier-1 infrastructure. The portability of Delft3D was improved by removing platform-dependent options from the build system and replacing non-standard constructs from the source. Three benchmarks were used to investigate the scaling of Delft3D: (1) a large, regular domain; (2) a realistic, irregular domain with a low fill-factor; (3) a regular domain with a sediment transport module. The first benchmark clearly shows a good scalability up to a thousand cores for a suitable problem. The other benchmarks show a reasonable scalability up to about 100 cores. For test case (2) the main bottleneck is the serialized I/O. It was attempted to implement a separate I/O server by using the last MPI process only for the I/O, but this work is not yet finished. The imbalance due to the irregular domain can be reduced somewhat by using a cyclic placement of MPI tasks. Test case (3) benefits from inlining of often-called routines.

Download paper: PDF

Title: Performance analysis of parallel applications on modern multithreaded processor architectures

Authors: Maciej Cytowski, Maciej Filocha, Jakub Katarzynski, Maciej Szpindler
Interdisciplinary Centre for Mathematical and Computational Modeling (ICM), University of Warsaw, Poland

Abstract: In this whitepaper we describe the effort we have made to measure performance of applications and synthetic benchmarks with the use of different simultaneous multithreading (SMT) modes. This specific processor architecture feature is currently available in many petascale HPC systems worldwide. Both IBM Power7 processors available in Power775 (IH) and IBM Power A2 processors available in Blue Gene/Q are built upon 4-way simultaneous multithreaded cores. It should be also mentioned that multithreading is predicted to be one of the leading features of future exascale systems available by the end of next decade [1].

Download paper: PDF

Title: Selection of a Unified European Application Benchmark Suite

Authors: J. Mark Bulla*, Andrew Emersonb
aEPCC, University of Edinburgh, King’s Buildings, Mayfield Road, Edinburgh EH9 3JZ, UK.
bCINECA, via Magnanelli 6/3, 40033 Casalecchio di Reno, Bologna, Italy

Abstract: This White Paper reports on the selection of a set of application codes taken from the existing PRACE and DEISA application benchmark suites to form a single Unified European Application Benchmark Suite (UEABS).
The selected codes are: QCD, NAMD, GROMACS, Quantum Espresso, CP2K, GPAW, Code_Saturne, ALYA, NEMO, SPECFEM3D, GENE, and GADGET.

Download paper: PDF