White Papers – Preparatory Access

On this page you will find PRACE White Papers related to PRACE Preparatory Access.

Title: Parallel I/O Implementation in Hybrid Molecular Dynamics-Self Consistent Field OCCAM Code using fpioLib Library

Authors: C. Basua*, A. De Nicolab, G. Milanob
a National Supercomputer Centre, Linköping University, Sweden
b University of Salerno, Department of Chemistry and Biology, Via Giovanni Paolo II, 132, 84084, Fisciano, Italy

Abstract: A new parallel I/O scheme is implemented in the hybrid particle-field MD simulation code called OCCAM. In the new implementation the numbers of input and output files are greatly reduced. Furthermore, the sizes of the input and output files are reduced as the new files are in binary format compared to ASCII files in the original code. The I/O performance is improved due to bulk data transfer instead of small and frequent data transfer in the original code. The results of tests on two different systems show 6-18% performance improvements.

Download paper: PDF

Title: Optimising UCNS3D, a High-Order finite-Volume WENO Scheme Code for arbitrary unstructured Meshes

Authors: Thomas Ponweisera,*, Panagiotis Tsoutsanisb
a Research Institute for Symbolic Computation (RISC), Johannes Kepler University, Altenberger Stra?e 69, 4040 Linz, Austria
b Centre for Computational Engineering Sciences, Cranfield University, College Rd, Cranfield MK43 0AL, United Kingdom

Abstract: UCNS3D is a computational-fluid-dynamics (CFD) code for the simulation of viscous flows on arbitrary unstructured meshes. It employs very high-order numerical schemes which inherently are easier to scale than lower-order numerical schemes due to the higher ratio of computation versus communication. In this white paper, we report on optimisations of the UCNS3D code implemented in the course of the PRACE Preparatory Access Type C project “HOVE” in the time frame of February to August 2016. Through the optimisation of dense linear algebra operations, in particular matrix-vector products, by formula rewriting, pre-computation and the usage of BLAS, significant speedups of the code by factors of 2 to 6 have been achieved for representative benchmark cases. Moreover, very good scalability up to the order of 10,000 CPU cores has been demonstrated.

Download paper: PDF

Title: Enabling Space Filling Curves parallel mesh partitioning in Alya

Authors: R. Borrella, J.C. Cajasa , G. Houzeauxa and M.Vazqueza
aBarcelona Supercomputing Center – Centro Nacional de Supercomputación, Spain

Abstract: Larger supercomputers allow the resolution of more complex problems that require denser and thus also larger meshes. In this context, and extrapolating to the Exascale paradigm, meshing operations such as generation, deformation, adaptation/regeneration or partition/load balance, become a critical issue within the simulation workflow. In this paper we focus on the mesh partitioning, presenting the work carried out in the context of a PRACE Preparatory Access Project to enable a Space Filling Curve (SFC) based partitioner in the computational mechanics code Alya. In particular, we have run our tests on the MareNostrum III supercomputer of the Barcelona Supercomputing Center. SFC partitioning is a fast and scalable alternative to the standard graph based partitioning and in some cases provides better solutions. We show our approach at implementing a parallel SFC based partitioner. We have avoided any computing or memory bottleneck in the algorithm, while we have imposed that the solution achieved is independent (discounting rounding off errors) of the number of parallel processes used to compute it.

Download paper: PDF

Title: Particle Transport in a Fluid interacting with an immersed Body with Alya

Authors: B. Eguzkitzaa*, M. Garcíaa , G J.C.Cajasa , S. Marrasb, G. Houzeauxa, B. Sainte-Rosec
aBarcelona Supercomputing Center – Centro Nacional de Supercomputación, Spain
bStanford University, Department of Geophysics, Stanford, CA, U.S.A
cThe Ocean Cleanup, Operations Maritime Research, Delft, Netherland

Abstract: The Ocean Cleanup (www.theoceancleanup.com) is a foundation that develops technologies to extract plastic pollution from the oceans and prevent more plastic debris from entering ocean waters. The main technology is the Ocean Cleanup Array which utilizes long floating barriers to capture and concentrate the plastic such that the system is a passive barrier. Computational Fluid Dynamics (CFD) is being used to study the catch efficiency debris of different sizes and densities, the transport of plastic along the containment boom, and the forces acting on it in order to determine the appropriate shape for their passive barrier concept. A study for the wave and boom influence on particle trajectories has to be done with CFD to investigate the effects of wind-and- wave- induced turbulence on the boom capture efficiently as well as to include the interaction between particles and the dynamic structure in the CFD analyses. The objective of this PRACE project is to simulate the flow dynamic around the buoyancy body and the flexible skirt as well as the interaction of the plastic debris with the skirt. This simulation is a very strong multi-physics coupled problem carried out by Alya code: Navier-stokes equations in a turbulence regime, free surface, solid mechanics and particle Lagrangian transport have to be solved. On one hand, we have analysed the performance of the code solving this kind of complex problems in terms of computational efficiency. On the other hand, we have overcome the physical and numerical difficulties presented in the simulation.

Download paper: PDF

Title: Parallel curved mesh Subdivision for flow Simulation on curved Topographies

Authors: A. Gargallo-Peiróa*, H. Owena , G. Houzeauxa , X. Rocaa
aBarcelona Supercomputing Center – Centro Nacional de Supercomputación Carrer de Jordi Girona, 29-31, 08034 Barcelona, Spain (Spain)

Abstract: We present the implementation in the Alya code of a method to refine a mesh in parallel while preserving the curvature of a target topography. Our approach starts by generating a coarse linear mesh of the computational domain. Then, the former coarse mesh is curved to match the curvature of the target geometry. Finally, the curved mesh is given to the improved Alya code that now reads the curved mesh, partitions it, and sends the subdomain meshes to the slaves. The result is a finer linear mesh obtained in parallel with improved geometric accuracy. The main application of the obtained finer linear mesh is to compute a steady state flow solution on complex topographies.

Download paper: PDF

Title: Optimization of REDItools Package for investigating RNA Editing in Thousands of human deep sequencing Experiments

Authors: T. Flatia, S. Gioiosaa, G. Pesolea,c, T. Castrignanòb1, E. Picardia,c
a Istituto di Biomembrane, Bioenergetica e Biotecnologie Molecolari, Consiglio Nazionale delle Ricerche, Bari, Italy
b SCAI, Cineca, Consorzio Interuniversitario di Supercalcolo, Roma, Italy
c Dipartimento di Bioscienze, Biotecnologie e Biofarmaceutica, Università degli Studi di Bari “A. Moro”, Bari, Italy

Abstract: RNA editing is a widespread post-transcriptional mechanism that alters primary RNA sequences through the insertion/ deletion or modification of specific nucleotides. In humans, RNA editing affects nuclear and cytoplasmic transcripts mainly by the deamination of adenosine (A) to inosine (I) through members of ADAR enzymes. A-to-I modifications increase transcriptome and proteome diversity, and contribute in modulating gene expression at RNA level. RNA editing by A-to-I change is prominent in non-coding regions containing Alu repetitive elements, whereas the list of ADAR substrates in protein coding genes is relatively small. RNA editing modifies several human neurotransmitter receptors and plays important roles in modulating their physiology. Indeed, its deregulation has been linked to a variety of human diseases, including neurological and neurodegenerative disorders, as well as cancer.
Current technologies for massive transcriptome sequencing, such as RNASeq, are providing accurate maps of transcriptional dynamics occurring in complex eukaryotic genomes, as the human one, and are facilitating the detection of post-transcriptional RNA editing modifications with unprecedented resolution. However, the computational detection of RNA editing events in RNAseq experiments is quite intensive, requiring the browsing of the human genome, position by position. To investigate RNA editing in very large cohort of RNAseq data, we have developed a novel algorithm called REDItools2.0. Here, we describe the core algorithm as well as optimization strategies used to efficiently analyze RNA editing in HPC systems.

Download paper: PDF

Title: A Numerical Code for The Study of Water Droplets’ Growth, Collision, Coalescence and Clustering Inside Turbulent Warm Cloud-Clear Air Interfaces

Authors: V. Ruggieroa*, D. Codonib and D. Tordellab
aCINECA, SCAI Rome, bPolitecnico di Torino, DISAT

Abstract: In past literature, most simulations of lukewarm clouds assumed static and homogeneous conditions. We are interested in simulating more realistic regimes of warm clouds that actually are systems which live in perpetual transitional situations. These time evolutions highly depend on the turbulent air flow hosting the cloud, and on transport phenomena taking place through the complex surfaces that bound the cloud with respect to the clear air surrounding it.
In our simulations, cloud boundaries (called interfaces in the text) are modelled through the shear-less turbulent mixing, matching two interacting flow regions – a small portion of cloud, and an adjacent clear air portion of equivalent volume – at different turbulent intensities. An initial condition reproduces local stable or unstable stratification in density and temperature. The droplets model includes evaporation, condensation, collision and coalescence. The typical water content inside a warm cloud parcel of about 500 m^3, when associated to an initial condition, where drops are 30 microns in diameter, leads to an initial number of drops of the order of 10^11. A simulation grid up to 4092x2048x2048 points is sought after, which leads to a Taylor’s microscale Reynolds number of 500. The governing equations are the Navier-Stokes equations under the Boussinesq‘s approximation, and are coupled to the transport equation for the water vapour represented as a passive scalar, and for drops seen as inertia particles, transported by background turbulence and gravity. The code uses a slab parallelization. The system contains a huge number of discrete elements, i.e. the water droplets, which undergo an intense clustering due to turbulent fluctuations. Turbulent clustering is not predictable, and in turn produces an imbalance on the communication rate among different cores. As a consequence, the computational burden among the cores in the cluster is not evenly distributed. This, per se, highly limits performance and binds the parallelization organization to that of a slab structure. Furthermore, clustering increases in time, and induces an inhomogeneous enhancement of the local droplets collision rate, as well as a concomitant depression of the growth in size of water droplets.
The long-term evolution of many kinds of transients must be considered in order to understand the above processes. This, in association to the variation of a quite large set of control parameters, will be the main motivation to ask in the future 0 level of computational resources for the simulation of water droplets’ growth, collision, coalescence and clustering inside turbulent warm cloud-clear air interfaces.

Download paper: PDF

Title: Extending the Scalability and Parallelization of SimuCoast Code to hybrid CPU+GPU Supercomputers

Authors: G.Oyarzuna*, Chalmoukisaa,*, G.Leftheriotisa,*, Th.Koutrouvelia*, A.Dimasa *,R.Borrellb*
a Laboratory of Hydraulics, Department of Civil Engineering, University of Patras, Greece
b Barcelona Supercomputing Center, Barcelona, Spain

Abstract: The aim of the project was to extend the scalability and parallelization strategy of SimuCoast code to enable the use of hybrid CPU+GPU supercomputers. The code is focused on increasing the understanding of coastal processes utilizing high performance computing (HPC) for the numerical simulation of the three-dimensional turbulent flow, which is induced in the coastal zone, and mainly in the surf zone, by wave propagation (oblique to the shore), refraction, breaking and dissipation. A model based on MPI+OpenACC has been implemented in order to increase the computing capabilities of the code. The adapted code was validated using data from the Vittori-Blondeaux simulation and it was tested using up to 512 computing nodes of the Piz Daint supercomputer.

Download paper: PDF

Title: Airinnova: Automation of High-Fidelity CFD Analysis for Aircraft Design and Optimization

Authors: Mengmeng Zhanga*, Jing Gongb, Lilit Axnerb, Michaela Barthb
a Airinnova AB, SE-182 48, Sweden
b PDC Center for High Performance Computing, KTH Royal Institute of Technology, SE-100 44, Sweden

Abstract: Airinnova is a start-up company with a key competency in the automation of high-fidelity computational fluid dynamics (CFD) analysis. Following on from our previous PRACE SHAPE project, we have continued collaborating with the PDC Center for High Performance Computing at the KTH Royal Institute of Technology (KTH-PDC), to investigate the performance analysis of the open source CFD code SU2 and further develop the automation process for the field of aerodynamic optimization and design.

Download paper: PDF

Title: Scalable Delft3D Flexible Mesh for Efficient Modelling of Shallow Water and Transport Processes

Authors: M. Mogé1a, M. J. Russchera,b, A. Emersonc, M. Gensebergerb
a SURFsara, The Netherlands
b Deltares, The Netherlands
c CINECA, Italy

Abstract: D-Flow Flexible Mesh (“D-Flow FM”) [1] is the hydrodynamic module of the Delft3D Flexible Mesh Suite [2]. Since for typical, real-life applications there is a need to make D-Flow FM more efficient and scalable for high performance computing, we profiled and analysed D-Flow FM for representative test cases. In the current paper, we discuss the conclusions of our profiling and analysis. We observed that, for specific models, D-Flow FM can be used for parallel simulations using up to a few hundred cores with good efficiency. It was however observed that D-Flow FM is MPI bound when scaled up. Therefore, for further improvement, we investigated two optimisation strategies described below.
The parallelisation is based on mesh decomposition and the use of deep halo regions may lead to significant mesh imbalance. Therefore, we first investigated different partitioning and repartitioning strategies to improve the load balance and thus reduce the time spent waiting on MPI communications. We obtained small performance gains in some cases, but further investigations and broader changes in the numerical methods would be needed for this to be usable in a general case.
As a second option we tried to use a communication-hiding conjugate gradient method, PETSc’s linear solver KSPPIPECG, to solve the linear system arising from the spatial discretisation, but we were not able to get any performance improvement or to reproduce the speedup published by the authors. The performance of this method turns out to be very architecture and compiler dependent, which prevents its use in a more general-purpose code like D-Flow FM.​

Download paper: PDF

Title: Optimizing ART: Radiative Transfer Forward Modeling Code for Solar Observations with ALMA

Authors: Marcin Krotkiewskia*
a USIT, University of Oslo / SIGMA2, Norway

Abstract: Various optimizations of the ART software package for the solution of the radiative transfer equation in three dimensions are discussed in this white paper. All critical path functions of the code have been optimized and vectorized using OpenMP directives. Several techniques have been used, amongst others the rearrangement of input data and internal data structures to facilitate usage of CPU vector units, vectorization of calls to the math library, explicit loop unrolling to allow vectorization of iterative loops with a convergence criterion, vectorization of data-dependent if-statements through enforced computations on all SIMD lanes and filtering of the final result. Several technical challenges had to be overcome to achieve the best performance. The OpenMPI stack needed to be compiled with a custom (non-native) Glibc library. In some cases, individual vectorized clones generated automatically by the compilers needed to be substituted with custom functions implemented manually using compiler intrinsics. Performance tests have shown that on the Broadwell architecture the optimized code works from 2.5x faster (RT solver) to 13x faster (EOS solver) on a single core. MPI implementation of the code scales with 95% efficiency on 2048 cores. Throughout the project, several GCC bugs related to automatic OpenMP vectorization have been reported, which shows that the support for the relatively new OpenMP vectorization features is still not mature. For some of those bugs effective workarounds have been developed. We also point to some shortcomings in the OpenMP simd vectorization framework and develop several new optimization techniques, which improve effectiveness of automatic code vectorization. Finally, some generally useful tutorials have been delivered.

Download paper: PDF

Title: Optimization of EC-Earth 3.2 Model

Author: K. Serradell, M. Castrillo and M. Acosta – Barcelona Supercomputing Center


The increase in capability of Earth System Models (ESMs) is strongly linked to the amount of computing power, given that the spatial resolution used for global climate experimentation is a limiting factor to correctly reproduce climate mean state and variability. However, higher spatial resolutions require new and High Performance Computing (HPC) platforms, where the improvement of the computational efficiency of the ESMs will be mandatory. In this context, porting a new ultra-high resolution configuration into a new and more powerful HPC cluster is a challenging task, involving a technical expertise to deploy and improve the computational performance of such a novel configuration. In this paper, we focus on the work done in the context of a PRACE Preparatory Access Project, aiming to optimize the T1279-ORCA12 configuration of the EC-Earth 3.2 coupled climate model.

In this case, all runs have been performed in the MareNostrum IV supercomputer of the Barcelona Supercomputing Center.

Download paper: PDF

Title: Performance Assessment of Pipelined Conjugate Gradient method in Alya

Authors: Pedro Ojeda-Maya, Jerry Erikssona, Guillaume Houzeauxband Ricard Borrellb*
a High Performance Computing Center North (HPC2N), MIT Huset, Umeå Universitet, 90187 Umeå, Sweden
b Barcelona Supercomputing Center, C/Jordi Girona 29, 08034-Barcelona, Spain

Abstract: Currently, one of the trending topics in High Performance Computing is related to exascale computing. Although the hardware is not yet available, the software community is working on developing and updating codes, which can efficiently use exascale architectures when they become available. Alya is one of the codes that are being developed towards exascale computing. It is part of the simulation packages of the Unified European Applications Benchmark Suite (UEABS) and Accelerators Benchmark Suite of PRACE and thus complies with the highest standards in HPC. Even though Alya has proven its scalability for up to hundreds of thousands of CPU-cores, there are some expensive routines that could affect its performance on exascale architectures. One of these routines is the conjugate gradient (CG) algorithm. CG is relevant because it is called at each time step in order to solve a linear system of equations. The bottleneck in CG is the large number of collective communications calls. In particular, the preconditioned CG (PCG) already implemented in Alya utilises two collective communications. In the present work, we developed and implemented a pipelined version of the PCG (PPCG) algorithm which allows us to half the number of collectives. Then, we took advantage of non-blocking MPI communications to reduce the waiting time during message exchange even further. The resulting implementation was analysed in detail by using Extrae/Paraver profiling tools. The PPCG implementation was tested by studying the flow around a 3D sphere. Several tests were performed using a different number of processes/workloads to attest the strong and weak scaling of the implemented algorithms. This work has been developed in the context of the preparatory access program of PRACE, simulations were run on the MareNostrum 4 (MN4) supercomputer at Barcelona Supercomputing Center (BSC).

Download paper: PDF

Title: Performance Analysis Of The BiqBin Solver For The Maximum Table Set Problem

Authors: A. Shamakina – High-Performance Computing Center Stuttgart (HLRS), Germany, T. Hrga and J. Povh – University of Ljubljana, Slovenia

Abstract: The BiqBin application is a high-performance solver for linearly constrained binary quadratic problems. This software is open source and available as an online solver. In this White Paper, we report on optimizations of the branch of the BiqBin code designed for solving the maximum stable set problem, which were implemented in the course of the PRACE Preparatory Access Type D project “BiqBin solver” in the time frame of September 2018 to February 2020. This research on the BiqBin code consists of a performance analysis and implementation of a prototype. Optimizing the original BiqBin code v1.0 to a new version v2.0 allowed to increase the speedup 6.2 times for a use case on 192 MPI processes. Further analysis of the BiqBin v2.0 showed a bad parallel efficiency due to the high proportion of MPI communications. All suggested optimization ideas were demonstrated on the prototype.

Download paper: PDF

Title: Optimisation of the Higher-Order Finite-Volume Unstructured Code Enhancement for Compressible Turbulent Flows

Authors: A. Shamakinaa, Panagiotis Tsoutsanisb
a High-Performance Computing Center Stuttgart (HLRS), University of Stuttgart, Nobelstrasse 19, 70569 Stuttgart, Germany
b Centre for Computational Engineering Sciences, Cranfield University, College Rd, Cranfield MK43 0AL, United Kingdom

Abstract: The Higher-Order finite-Volume unstructured code Enhancement (HOVE2) is an open-source software in the field of computational-fluid-dynamics (CFD). This code enables to do the simulation of compressible turbulent flows.

In this White Paper, we report on optimisations of the HOVE2 code implemented in the course of the PRACE Preparatory Access Type C project “HOVE2” in the time frame of December 2018 to June 2019. A focus of optimisation was an implementation of the ParMETIS support and MPI-IO. Through the optimisation of the MPI collective communications significant speedups have been achieved. In particular, the acceleration of the write time of the MPI-IO compared to the normal I/O on 70 compute nodes amounted to 180 times.

Download paper: PDF

Title: PETSc4FOAM – A Library to plug-in PETSc into the OpenFOAM Framework


  • Simone Bna, SuperCompunting Application and Innovation Department, Cineca, Bologna, Italy
  • Ivan Spisso, SuperCompunting Application and Innovation Department, Cineca, Bologna, Italy
  • Mark Olesen, ESI-OpenCFD, Engineering System International GmbH, Munich, Germany
  • Giacomo Rossi, Intel Corporation Italia SpA, Milano, Italy

Abstract: OpenFOAM acts as a major player in the Open Source CFD arena, due to its flexibility, but its complexity also makes it more difficult to correctly define performance figure and scaling. One of the main bottlenecks for a full enabling of OpenFOAM for massively parallel cluster is the limit in its MPI-parallelism paradigm, embodied in the Pstream library, which limits the scalability up to the order of few thousands of cores. The proposed work aims to creating an interface to external linear algebra libraries for solving sparse linear system such as PETSc/Hypre thus providing to the users a greater choice and flexibility when solving their cases, and to utilise their respective Community’s knowledge which has been developed over decades and not currently accessible within the OpenFOAM framework.

Download paper: PDF

Title: Keeping Computational Performance Analysis Simple: An Evaluation Of The NEMO BENCH Test

Author: Stella V. Paronuzzi Ticco, Mario C. Acosta, M. Castrillo, O.Tintó and K. Serradell – Barcelona Supercomputing Center


In 2019 a non-intrusive instrumentation of the NEMO code, aimed to give information about the MPI communications cost and structure of the model, was created by E. Maisonnave and S. Masson at the LOCEAN Laboratory, in Paris. The main goal was to identify which developments have to be prioritized for the model to enhance its scalability: a new NEMO configuration, called BENCH, was specifically developed for the purpose, offering some easy way to make performance measurements, and hoping to simplify future benchmark activities related to computing efficiency.

In the first part of this work we study if this configuration can actually be used as a valid tool to get insight into NEMO’s performance, and then proceed to use the BENCH test to study some of NEMO’s most known bottlenecks: I/O and thtime e north fold. Additionally, we take the chance to investigate a topic that is gaining popularity among NEMO developers: the variability of time required to perform a time step and how it influences NEMO’s performance, obtaining prescriptions on how to avoid or mitigate such behaviour.

Download paper: PDF

Title: Intelligent HTC for Committor Analysis

Author: Mi losz Bialczaka, Alan O’Caisb, David Swensonb, Mariusz Uchronskia, Adam W lodarczyka

aWroclaw Centre of Networking and Supercomputing (WCSS), Wroclaw University of Science and Technology
bE-CAM HPC Centre of Excellence


Committor analysis is a powerful, but computationally expensive, tool to study reaction mechanisms in complex systems. The committor can also be used to generate initial trajectories for transition path sampling, a less-expensive technique to study reaction mechanisms. The main goal of the project was to facilitate an implementation of committor analysis in the software application OpenPathSampling (http://openpathsampling.org/) that is performance portable across a range of HPC hardware and hosting sites. We do this by the use of hardware-enabled MD engines in OpenPathSampling coupled with a custom library extension to the data analytics framework Dask (https://dask.org/) that allows for the execution of MPI-enabled tasks in a steerable High Throughput Computing workflow. The software developed here is being used to generate initial trajectories to study a conformational change in the main protease of the SARS-CoV-2
virus, which causes COVID-19. This conformational change may regulate the accessibility of the active site of the main protease, and a better understanding of its mechanism could aid drug design.

Download paper: PDF

Title: OpenMP optimisation of the eXtended Discrete Element Method

Author:Pedro Ojeda-Maya, Jerry Erikssona, Alban Roussetb, Xavier Besseronb, Abdoul Wahid Mainassara Chekaraoub, Bernhard Petersb

aHigh Performance Computing Center North (HPC2N), MIT Huset, Umeå Universitet, 90187 Umeå, Sweden
bFaculty of Science, Technology and Medicine (FSTM), University of Luxembourg, 1359 Luxembourg


The eXtended Discrete Element Method (XDEM) is an extension of the regular Discrete Element Method (DEM) which is a software for simulating the dynamics of granular material. XDEM extends the regular DEM method by adding features where both micro and macroscopic observables can be computed simultaneously by coupling different time and length scales. In this sense XDEM belongs the category of multi-scale/multi-physics applications which can be used in realistic simulations. In this whitepaper, we detail the different optimisations done during the preparatory PRACE project to overcome known bottlenecks in the OpenMP implementation of XDEM. We analysed the Conversion, Dynamic, and the combined Dynamics-Conversion modules with Extrae/Paraver and Intel VTune profiling tools in order to find the most expensive functions. The proposed code modifications improved the performance of XDEM by ~17% for the computational expensive Dynamics-Conversion combined modules (with 48 cores, full node). Our analysis was performed in the Marenostrum 4 (MN4) PRACE infrastructure at Barcelona Supercomputing Center (BSC).

Download paper: PDF