White Papers – Parallel Programming Interfaces

On this page you will find PRACE White Papers related to Parallel programming interfaces​.

Title: Optimizing ART: Radiative Transfer Forward Modeling Code for Solar Observations with ALMA

Authors: Marcin Krotkiewskia*
a USIT, University of Oslo / SIGMA2, Norway

Abstract: Various optimizations of the ART software package for the solution of the radiative transfer equation in three dimensions are discussed in this white paper. All critical path functions of the code have been optimized and vectorized using OpenMP directives. Several techniques have been used, amongst others the rearrangement of input data and internal data structures to facilitate usage of CPU vector units, vectorization of calls to the math library, explicit loop unrolling to allow vectorization of iterative loops with a convergence criterion, vectorization of data-dependent if-statements through enforced computations on all SIMD lanes and filtering of the final result. Several technical challenges had to be overcome to achieve the best performance. The OpenMPI stack needed to be compiled with a custom (non-native) Glibc library. In some cases, individual vectorized clones generated automatically by the compilers needed to be substituted with custom functions implemented manually using compiler intrinsics. Performance tests have shown that on the Broadwell architecture the optimized code works from 2.5x faster (RT solver) to 13x faster (EOS solver) on a single core. MPI implementation of the code scales with 95% efficiency on 2048 cores. Throughout the project, several GCC bugs related to automatic OpenMP vectorization have been reported, which shows that the support for the relatively new OpenMP vectorization features is still not mature. For some of those bugs effective workarounds have been developed. We also point to some shortcomings in the OpenMP simd vectorization framework and develop several new optimization techniques, which improve effectiveness of automatic code vectorization. Finally, some generally useful tutorials have been delivered.

Download paper: PDF

Title: Extending the Scalability and Parallelization of SimuCoast Code to hybrid CPU+GPU Supercomputers

Authors: G.Oyarzuna*, Chalmoukisaa,*, G.Leftheriotisa,*, Th.Koutrouvelia*, A.Dimasa *,R.Borrellb*
a Laboratory of Hydraulics, Department of Civil Engineering, University of Patras, Greece
b Barcelona Supercomputing Center, Barcelona, Spain

Abstract: The aim of the project was to extend the scalability and parallelization strategy of SimuCoast code to enable the use of hybrid CPU+GPU supercomputers. The code is focused on increasing the understanding of coastal processes utilizing high performance computing (HPC) for the numerical simulation of the three-dimensional turbulent flow, which is induced in the coastal zone, and mainly in the surf zone, by wave propagation (oblique to the shore), refraction, breaking and dissipation. A model based on MPI+OpenACC has been implemented in order to increase the computing capabilities of the code. The adapted code was validated using data from the Vittori-Blondeaux simulation and it was tested using up to 512 computing nodes of the Piz Daint supercomputer.

Download paper: PDF

Title: Performance Portability of OpenCL with Application to Neural Networks

Authors: Jan Christian Meyera* and Benjamin Adric Dunnb
a High Performance Computing Section, IT Dept., NTNU
b Faculty of Medicine, Kavli Institute for Systems Neuroscience / Centre for Neural Computation, NTNU

Abstract: This whitepaper investigates the parallel performance of a sample application that implements an approximate expectation-maximization method for inferring the network structure and time varying states of a hidden population within the framework of the kinetic Ising model. The size of networks that can yield informative results can be made arbitrarily large, and the long-running computational demand is highly localized, making the application a strong candidate for future exascale platforms.
Previous investigations using OpenMP on the Intel Xeon Phi architecture have suggested that the class of accelerator unit may play a significant part in attainable application performance. An OpenCL parallelization enables experiments with a variety of accelerator units. We examine how this programming model affects the performance of a portable implementation, and use it to compare accelerator technologies in terms of their suitability for future extreme-scale computations.

Download paper: PDF

Title: An interface for halo exchange pattern

Author: Mauro Bianco
Swiss National Supercomputing Centre(CSCS)

Abstract: Halo exchange patterns are very common in scientific computing, since the solution of PDEs often requires communication between neighbor points. Although this is a common pattern, implementations are often made by programmers from scratch, with an accompanying feeling of “reinventing the wheel”. In this paper we describe GCL, a C++ generic library that implements a flexible and still efficient interface to specify halo-exchange/halo-update operations for regular grids. GCL allows to specify data layout, processor mapping, value types, and other parameters at compile time, while other parameters are specified at run-time. GCL is also GPU enabled and we show that, somewhat surprisingly, GPU-to-GPU communication can be faster than the traditional CPU-to-CPU communication, making accelerated platforms more appealing for large scale computations.

Download paper: PDF

Title: Selection of Task Implementations in the Nanos++ Runtime

Authors: JuditPlanasa,b, Rosa M. Badiaa,b,c, EduardAyguadéa,b, Jesús Labartaa,b
aBarcelona Supercom putingCenter, Barcelona, Spain
bUniversitat Politècnica deCatalunya, Barcelona, Spain
cArtificial IntelligenceResearch Institute (IIIA), Spanish National Research Council (CSIC),Madrid, Spain

Abstract: New heterogeneous systems and hardware accelerators can give higher levels of computational power to high performance computers. However, this does not come for free, since the more heterogeneity the system presents, the more complex becomes the programming task in terms of resource utilization.
OmpSs is a task-based programming model and framework focused on the automatic parallelization of sequential applications. We present a set of extensions to this framework: we show how the application programmer can expose different specialized versions of tasks (i.e. pieces of specific code targeted and optimized for a particular architecture) and how the framework will choose between these versions at runtime to obtain the best performance achievable for the given application. From our results, obtained in a multi-GPU system, we can prove that our project gives flexibility to application’s source code and can potentially increase application’s performance.

Download paper: PDF

Title: Towards Runtime-Clustering and improved Implementations of collective Operations in MPI

Authors: ChandanBasua, Johan Rabera, and Michael Schliephakeb*
a National SupercomputerCenter, Linköping University, SE-581 83 Linköping
b PDC Center for HighPerformance Computing, KTH Royal Institute of Technology, SE-100 44Stockholm

Abstract: Further performance improvements of parallel simulation applications will not be reached by simply scaling today’s simulation algorithms and system software. Rather, they need qualitatively different approaches and new developments that address and reduce the typically non-linearly increasing complexity of algorithms with the use of increasing processor counts. We presented first results of an activity aimed at improving the performance of collective communication operations of relevance to simulation applications through more efficient implementations of collective communication operations for large-scale program executions.

Download paper: PDF

Title: The State-of-the-Art in Directive-Guided Auto-Tuning for Accelerator and Heterogeneous Many-Core Architectures

Authors: RenatoMicelia*, Francois Bodinb
aIrish Centre for High-EndComputing, Dublin, Ireland
bCAPS Enterprise, Rennes,France

Abstract: In this whitepaper we discuss the latest achievements in the field of auto-tuning of applications for accelerator and heterogeneous many-core architectures guided by programming directives. We provide both an academic perspective, presenting preliminary results obtained by the EU FP7 AutoTune project, and an industrial point of view, demonstrated by the commercial uptake by a leader in compiler technology and services, CAPS Enterprise.

Download paper: PDF

Title: Analysis and Optimization of a Hybrid Linear Equation Solver using Task-Based Parallel Programming Models

Authors: ClaudiaRosas, Vladimir Subotic, José Carlos Sancho, Jesús Labarta
BSC, Barcelona Supercomputing Center,Jordi Girona, 29, Barcelona, 08034, Spain

Abstract: This paper describes a methodology and tools to analyze and optimize the performance of task-based parallel applications. For illustrative purposes, a cutting-edge implementation of the Jacobi method aimed to address software challenges at exascale computers is evaluated. Specifically, the analysis was carried out on synchronous and asynchronous task-based implementations of the Jacobi method. The methodology consists of three basic steps: (i) performance analysis; (ii) prediction; and (iii) implementation. First, by instrumenting and tracing an application a general overview of its behavior can be obtained. The Paraver visualization tool enables the identification of performance bottlenecks or scalability problems. Secondly, with the help of prediction tools, such as Tareador and Dimemas, the inherent parallelism of the application is evaluated. Finally, the code is refactored to solve potential inefficiencies that prevent it to achieve higher performance. This final step is accomplished by using the OmpSs task-based parallel programming language. Results reported from using the methodology highlighted performance issues regarding to memory access, synchronization among the threads, and processors with long waiting periods. Additionally, the OmpSs implementation enabled the parallel execution of core functions of the application inside each thread, therefore obtaining a greater utilization of the computational resources.

Download paper: PDF

Title: Parallelization Using a PGAS Language such as X10 in HYDRO and TRITON

Author: Marc Tajchman
Commissariat à l’énergie atomiqueet aux énergies alternatives –CEA/DEN/DM2S/STMF/LGLS
91191 Gif-sur-Yvette, France

Abstract: In this whitepaper, after an introduction to X10, one of the PGAS languages, we describe the different parallelization paradigms used to write versions of two computing codes in this language. For HYDRO, a 2D hydrodynamics code, we started from the original sequential C version. We keep the global 1D alternating direction method, thanks to the logical global addressing scheme for distributed array in PGAS languages. Remote activities (or threads) of X10 were used to distribute work tasks between the different nodes. Local activities of X10 allow us to distribute local computations between cores on the same node. The only communication steps are the computation of the global time step and a global 2D array transposition. So we do not think that this scheme will be scalable on a large set of nodes. TRITON, a simulation platform that performs 3D hydrodynamics computations, was parallelized using a standard 3D domain decomposition method. We implement a specialized distributed array class, by extending the standard X10 array class to transparently handle ghost cells. Again, local and remote activities distribute computing tasks on all the cores. This scheme should show better scalability behaviour. These code porting actions show the flexibility and ease of programming of PGAS languages, even as the absolute performances of our PGAS implementations cannot rival the efficiency of current MPI implementations.

Download paper: PDF

Title: Investigating Performance Benefits from OpenACC Kernel Directives

Authors: BenjaminEagan, Gilles Civario
Irish Center for High End Computing,Ireland

Abstract: This whitepaper wxplores the possible benefit of using OpenACC performance tuning directives, comparing the two prevalent implementations of the standard, CAPS and PGI. The performance of the default generated code along with the impact of the gang and vector parameters is evaluated through a matrix-matrix multiplication and a Classical Gram-Schmidt orthonormalization. Additionally, the impact in the context of a change in the hardware is assessed.

Download paper: PDF

Title: Porting and Optimizing HYDRO to new Platforms

Authors: Pierre-FrançoisLavalléea, Guillaume Colin de Verdièreb,Philippe Wauteleta, Dimitri Lecasa, Jean-MichelDupaysa
aIDRIS/CNRS, Campusuniversitaire d’Orsay, rue John Von Neumann, Bâtiment 506, F-91403Orsay, France
bCEA,Centre DAMIle-de-France, Bruyères-le-Châtel,F-91227 Arpajon, France

Abstract: The purpose of low-level benchmarks is to measure certain important characteristics of the target computer system such as arithmetic and communication rates and overheads. They are synthetic in the sense that each is designed to measure a particular architectural feature of the computer. In contrast to higher-level kernel and application benchmarks, they solve no real problem and they don’t exhibit properties of real production codes which scientific computer developers experiment daily with.
On the other hand real application codes can be very complex and can use multiple specific algorithms. It can be very difficult or costly to port the code to a specific processor or to a new architecture.
Since HYDRO has been extracted from a real code (RAMSES [1]), it occurred to us that it will be a good candidate for benchmarking purposes. HYDRO includes classical algorithms we can find in many applications codes for Tier-0 systems.
It has been written in several versions including Fortran and C in order to experiment many new ways of parallelism and to adapt it easily to new architectures that are emerging.
In this paper, we described the different versions of HYDRO we have developed using classical or new parallel programming technics or paradigms. We also synthetized the lessons that could be learned from this work, the difficulties that we have encountered in porting the application, the ease of use and the maturity of the new parallel programming paradigms and the significant improvements in terms of performance that could be obtained.

Download paper: PDF

Title: Data-parallel programming with Intel Array Building Blocks(ArBB)

Author: VolkerWeinberg
Leibniz Rechenzentrum der BayerischenAkademie der Wissenschaften, Boltzmannstr. 1, D-85748 Garching b.München, Germany

Abstract: Intel Array Building Blocks is a high-level data-parallel programming environment designed to produce scalable and portable results on existing and upcoming multi- and many-core platforms. We have chosen several mathematical kernels – a dense matrix-matrix multiplication, a sparse matrix-vector multiplication, a 1-D complex FFT and a conjugate gradients solver – as synthetic benchmarks and representatives of scientific codes and ported them to ArBB. This whitepaper describes the ArBB ports and presents performance and scaling measurements on the Westmere-EX based system SuperMIG at LRZ in comparison with OpenMP and MKL.

Download paper: PDF