White Papers – Evaluations on Intel MIC

On this page you will find PRACE White Papers related to Evaluations on Intel MIC.

Title: GEANT4 on Large Intel Xeon Phi Clusters: Service Implementation and Performance

Authors: Nevena Ilievaa,b, Borislav Pavlovb,c,*, Peicho Petkovb,c, and Leandar Litovb,c
a Institute of Information and Communication Technologies, BAS, Sofia, Bulgaria
b National Centre for Supercomputing Applications (NCSA), Sofia, Bulgaria
c Sofia University “St. Kl. Ohridski”, Sofia, Bulgaria

Abstract: TThe fast and precise simulation of particle interactions with matter is an important task for several scientific areas, ranging from fundamental science (high-energy physics experiments, astrophysics and astroparticle physics, nuclear investigations) through applied science (radiation protection, space science), to biology and medicine (DNA investigations, hadron therapy, medical physics and medical imaging) and education. The basic software in all these fields is Geant4 (GEometry ANd Tracking) – a platform for simulation of the passage of particles through matter [1]. Along the way towards enabling the execution of Geant4 based simulations on HPC architectures with large clusters of Intel Xeon Phi co-processors as a general service, we study the performance of this software suit on the supercomputer system Avitohol@BAS, commenting on the pitfalls in the installation and compilation process. Some practical scripts are collected in the supplementary material shown in the appendix.

Download paper: PDF

Title: Optimization of IFS Subroutine LAITRI on Intel Knights Landing

Authors: Oisín Robinsona,*, Alastair McKinstryb, Michael Lysaghta
a ICHEC, Dublin, Ireland
bICHEC, Galway, Ireland

Abstract: The landscape of HPC architectures has undergone signfi
cant change in the last few years. Notably, the Intel Xeon Phi architecture features a 512 bit wide vector register allowing fine-grained parallelism, with a marked improvement in memory/cache speeds in the Knights Landing variant over the original Knights Corner version. Complexity of optimization is increased due to the great variety of hardware features, where cache considerations become fundamentally important. The optimization process is greatly facilitated in this regard by the Intel Advisor tool, which not only allows traditional roofline analysis, but also new roofline variations specific to each cache level – the `cache-aware’ roofline model (CARM). The Exascale-driving ESCAPE project is motivated by reducing time/energy costs of numerical weather prediction in part by investigation of `mini-app/dwarf’ performance on emerging hardware. The LAITRI subroutine of IFS is one of the `seven dwarves of HPC weather codes’. It accounts for about 4% of the total runtime of IFS in production. We investigate various optimization strategies (using coarse and ne-grained parallelism among other ideas) for speeding up LAITRI. The CARM implemented by the Intel Advisor allows realistic evaluation of actual/potential performance gain.

Download paper: PDF

Title: Characterization and optimization of sparse computations on Intel Xeon Phi

Authors: A. Elafroua, G. Goumasa*
a National Technical University of Athens

Abstract: In this paper, we propose a lightweight optimization methodology for the ubiquitous sparse matrix-vector multiplication (SpMV) kernel for the Intel Xeon Phi manycore processors. The large number of cores in this platform overly exposes inherent structural weaknesses of different sparse matrices, intensifying performance issues beyond the traditionally reported memory bandwidth limitation. We, thus, advocate an input-adaptive optimization approach and present a method that identifies the major performance bottleneck of the kernel for every instance of the problem and selects a suitable optimization to tackle it. We describe two models for identifying the bottleneck: our first model requires performance bounds to be determined for the input matrix during an online profiling phase, while our second model only uses comprehensive structural features of the sparse matrix. Our optimizations are based on the widely used Compressed Sparse Row (CSR) storage format and have low preprocessing overheads, making our overall approach practical even in the context of iterative solvers that converge in a small number of iterations. We evaluate our methodology on the Intel Xeon Phi co-processor, codename Knights Corner (KNC), and demonstrate that it is able to distinguish and appropriately optimize the great majority of matrices in a large and diverse test suite, leading to a significant speedup of 2.2× on average over the Intel MKL library.

Download paper: PDF

Title: Study of Xeon Phi Performance of a Molecular Dynamics Proxy Application

Authors: Benjamin Andreassen Bjørnsetha,*, Jan Christian Meyerb, Lasse Natviga
a Dept. of Computer and Information Science, Norwegian University of Science and Technology (NTNU)
b High Performance Computing section, IT Dept., Norwegian University of Science and Technology (NTNU)

Abstract: This whitepaper studies the execution speed of the Intel Xeon Phi coprocessor when running a molecular dynamics proxy application. We aim to describe how particular code transformations influence the performance of different phases in the application code. Results demonstrate the performance response to code transformations on a single accelerator. The whitepaper will also present weak and strong scaling projections, and evaluate the potential for exascale simulations.

Download paper: PDF

Title: Practical Experiences with Intel Xeon Phi

Authors: Olli-Pekka Lehtoa
aCSC, IT Center for Science,Finland

Abstract: This whitepaper presents experiences integrating Xeon Phi to a cluster system as well as porting and optimizing
applications to the Xeon Phi. The focus is on disseminating information and best practices learned from handson
experience which is not readily available from the standard manuals and other product literature.

Download paper: PDF

Title: Novel HPC Technologies for Rapid Analysis in Bioinformatics

Tristan Cabel, Gabriel Hautreux, Eric Boyer, Simon Wong, Nicolas Mignerey, Xiangwu Lu, Paul Walsh
Centre Informatique National de l’Enseignement Supérieur, 950, rue de Saint Priest, 34097 Montpellier Cedex 5, France.
Irish Centre for High-End Computing, 7/F Tower Blg., Trinity Technology & Enterprise Campus, Grand Canal Quay, Dublin 2, Ireland.
cGrand Equipement National de Calcul Intensif, 12, rue de l’Eglise, 75015 Paris, France.
NSilico Lifescience Ltd., Melbourne Building, CIT Campus, Bishopstown, Cork City, Ireland.

NSilico is an Irish based SME that develops software for the life sciences sector, providing bioinformatics and medical informatics systems to a range of clients. One of the major challenges that their users face is the exponential growth of high-throughput genomic sequence data and the associated computational demands to process such data in a fast and efficient manner. Genomic sequences contain gigabytes of nucleotide data that require detailed comparison with similar sequences in order to determine the nature of functional, structural and evolutionary relationships. In this regard NSilico has been working with computational experts from CINES (France) and ICHEC (Ireland) under the PRACE SHAPE programme to address a key problem that is the rapid alignment of short DNA sequences to reference genomes by deploying the Smith-Waterman algorithm on an
emerging many-core technology, the Intel Xeon Phi co-processor. This white paper will discuss some of the parallelisation and optimisation strategies adopted to achieve performance improvements of the algorithm keeping in mind both existing and future versions of the hardware. The outcome has been an extremely successful collaboration between PRACE and NSilico, resulting in an implementation of the algorithm that can be readily deployed to realise significant performance gains from the next generation of many-core hardware.

Download paper: PDF

Title: OpenMP Parallelization of the Slilab Code

Evghenii Gaburov, Minh Do-Quang, Lilit Axner
SURFsara, Science Park 140, 1098XG Amsterdam, the Netherlands
Linn´e FLOW Centre, Mechanics Department, KTH, SE-100 44 Stockholm, Sweden
KTH-PDC, SE-100 44 Stockholm, Sweden

This white paper describes parallelization of the Slilab code with OpenMP for a shared-memory execution model when focusing on the multiphase phase flow simulations, such as fiber suspensions in turbulent channel flows. In such problems the motion of the ”second phase – fibre” is frequently crossed over the distributed domain boundary of the ”first phase – fluid”, which in turn reduces the work-balance between the MPI ranks. The addition of OpenMP parallelization allows to minimize the number of MPI ranks in favor of a single-node parallelism, therefore mitigating MPI imbalance. With
OpenMP parallelism in place, we also analyze performance of Slilab on Intel XeonPhi.

Download paper: PDF

Title: Computational Throughput of Accelerator Units with Application to Neural Networks

Authors: Jan Christian Meyer, Benjamin Adric Dunn
High Performance Computing Section, IT Dept., Norwegian University of Science and Technology
Faculty of Medicine, Kavli Institute for Systems Neuroscience / Centre for Neural Computation,
Norwegian University of Science and Technology

Abstract: The size of data that can be fitted with a statistical model becomes restrictive when accounting for hidden dynamical effects, but approximations can be computed using loosely coupled computations mainly limited by computational throughput. This whitepaper describes scalability results attained by implementing one approximate approach using accelerator technology identified in the PRACE deliverable D7.2.1 [1], with the aim of adapting the technique to future exascale platforms.

Download paper: PDF

Title: Algebraic Multi-Grid solver for lattice QCD on Exascale hardware: Intel

Authors: A. Abdel-Rehim, G. Koutsou , C. Urbach
Computation Based Science and Technology Research Center, The Cyprus Institute, 20 Konstantinou Kavafi Street 2121 Aglantzia, Nicosia, Cyprus
Helmholtz Institut für Strahlen und Kernphysik (Theorie) and Bethe Center for Theoretical Physics, Universität Bonn, 53115 Bonn, Germany

Abstract: In this white paper we describe work done on the development of an efficient iterative solver for lattice QCD based on the
Algebraic Multi-Grid approach (AMG) within the tmLQCD software suite. This development is aimed at modern computer architectures that will be relevant for the Exa-scale regime, namely multicore processors together with the Intel Xeon Phi coprocessor. Because of the complexity of this solver, implementation turned out to take a considerable effort. Fine tuning and optimization will require more work and will be the subject of further investigation. However, the work presented here provides a necessary initial step in this direction.

Download paper: PDF

Title: Evaluating CP2K on Exascale Hardware: Intel Xeon Phi

Authors: Fiona Reid, Iain Bethune
EPCC, The University of Edinburgh, James Clerk Maxwell Building, Mayfield Road, Edinburgh, EJ9 3JZ, UK

Abstract: CP2K, a popular open-source European atomistic simulation package has been ported to the Intel Xeon Phi architecture, requiring no code modifications except minor bug fixes. Benchmarking of a small molecular dynamics simulation has been carried out using CP2K’s existing MPI, OpenMP and mixed-mode MPI/OpenMP implementations to achieve full utilisation of the Xeon Phi’s 240 virtual cores. Running on the Xeon Phi in native mode, CP2K is approximately 4x slower than utilising all 16 cores of a Xeon E5-2670 Sandy Bridge dualsocket node. Careful placement of processes and threads on the virtual cores of the Xeon Phi was found to be crucial in achieving good performance.
Analysis of the benchmark results has led to the identification of a number of bottlenecks which must be resolved to achieve competitive performance on the Xeon Phi, which will be carried out as a follow on to the work reported here. Application Code: CP2K

Download paper: PDF

Title: Analysis of SuperLU Solvers on Intel® MIC Architecture

Authors: Ahmet Durana, M. Serdar Celebi, Bora Akaydin, Mehmet Tuncel, Figen Oztoprak

Abstract: Intel Xeon Phi is a coprocessor with sixty-one cores in a single chip. The chip has a more powerful FPU that contains 512-bit
SIMD registers. Intel Xeon Phi chip can benefit from the algorithms that operate with the large vectors. In this work, sequential, multithreaded and distributed versions of SuperLU solvers are tested on the Intel Xeon Phi using offload programming model and they work well. There are several offload programming alternatives depending on where to place pragma directives. We find that the sequential SuperLU benefited up to 45% performance improvement from the offload programming depending on the sparse matrix type and the size of transferred and processed data. On the other hand, the partitioning method of SuperLU_DIST and SuperLU_MT generates very small sized submatrices. Therefore, we observe that the matrix partitioning method and several other tradeoffs influence their performance via the Xeon Phi architecture.

Download paper: PDF

Title: Xeon Phi Performance for HPC-based Computational Mechanics Codes

Authors: M. Vazqueza, F. Rubio, G. Houzeauxa, J. Gonzalez, J. Gimenez, V. Beltran, R. de la Cruz, A. Folch,
Barcelona Supercomputing Center, Spain

Abstract: In this paper we describe different applications we have ported to Intel Xeon Phi architectures, analyzing their performance. The applications cover a wide range. Alya, which is an HPC-based multi-physics code for parallel computers capable of solving coupled engineering problems in non-structured meshes. Waris, which is a simulation code for Cartesian meshes, that uses efficiently well-ordered data and well-balanced parallel threads. The last analysis is performed for a cornerstone of several simulation codes, a Cholesky decomposition method. The results are very promising, showing the great flexibility and power of Xeon Phi architectures.

Download paper: PDF

Title: Performance Analysis and Enabling of the RayBen Code for the Intel® MIC Architecture

Authors: A. Schnurpfeila, F. Janetzko, St. Janetzko, K. Thusta, M. S. Emran, J. Schumacher,
Julich Supercomputing Centre, Institute for Advanced Simulation, Germany
Institut fur Thermo- und Fluiddynamik, Postfach
Technische Universitat Ilmenau, Germany

Abstract: The subject of this project is the analysis and enabling of the RayBen code, which implements a finite difference scheme for the simulation of turbulent Rayleigh-Benard convection in a closed cylindrical cell, for the Intel® Xeon Phi coprocessor architecture. After a brief introduction to the physical background of the code, the integration of Rayben into the benchmarking environment JuBE is discussed. The structure of the code is analysed through its call graph. The most performance-critical routines were identified. A detailed analysis of the OpenMP parallelization revealed several race conditions which were eliminated. The code was ported to the JUROPA cluster at the Julich Supercomputing as well as to the EURORA cluster at CINECA. The performance of the code is discussed using the results of pure MPI and hybrid MPI/OpenMP benchmarks. It is shown that RayBen is a memory-intensive application that highly benefits from the MPI parallelization. The offloading mechanism for the Intel® MIC architecture lowers considerably the performance while the use of binaries that run exclusively on the coprocessor show a satisfactory performance and a scalability which is comparable to the CPU.

Download paper: PDF

Title: Enabling the UCD-SPH code on the Xeon Phi

Authors: Christian Lalanne, Ashkan Rafiee, Denys Dutykh, Michael Lysaght, Frederic Dias,
Irish Center of High-End Computing, Dublin, Ireland
University College Dublin, Dublin, Ireland

Abstract: This white-paper reports on our efforts to enable an SPH-based Fortran code on the Intel Xeon Phi. As a result of the work described here , the two most computationally intensive subroutines (rates and shepard_beta) of the UCD-SPH code were refactored and parallelised with OpenMP for the first time, enabling the code to be executed on multi-core and many-core shared memory systems. This parallelisation achieved speedups of up to 4.3x for the rates subroutine and 6.0x for the shepard_beta subroutine resulting in overall speedups of up to 4.2x on a 2 processor Sandy Bridge Xeon E5 machine. The code was subsequently enabled and refactored to execute in different modes on the Intel Xeon Phi co-processor achieving speedups of up to 2.8x for the rates subroutine and up to 3.8x for the shepard_beta subroutine producing overall speedups of up to 2.7x compared to the original unoptimised code. To explore the capabilities of auto-vectorisation the shepard_beta subroutine was refactored which results in speedups of up to 6.4x for the shepard_beta subroutine relative to the original unoptimised version of the shepard_beta subroutine. The development and testing phases of the project were carried out on the PRACE EURORA machine.

Download paper: PDF

Title: XeonPhi Meets Astrophysical Fluid Dynamics

Authors: Evghenii Gaburov, Yuri Cavecchi
SURFsara, Science Park 140, 1098XG Amsterdam, the Netherlands
Anton Pannekoek Institute, University of Amsterdam, the Netherlands

Abstract: This white paper reports on ours e-orts to optimize a 2D/3D astrophysical (magento-)hydrodynamics Fortran code for XeonPhi. The code is parallelized with OpenMP and is suitable for execution on a shared memory system. Due to complexity of the code combined with immaturity of compiler we were unable to stay within the boundaries of Intel Compiler Suite. To deliver performance we took two di-erent approaches. First, we optimized and partially rewrote most of the bandwidth-bound Fortran code to recover scalability on XeonPhi. Next, we ported several critical compute- bound hotspots to Intel SPMD Program Compiler (ISPC), which o-ers performance portability of a single source code across various architectures, such as Xeon, XeonPhi and possibly even GPU. This approach allowed us to achieve over 4x speed-up of the original code on dual-socket IvyBridge EP, and over 50x speed-up on the XeonPhi coprocessor. While the resulting optimized code can already be used in production to solve speci-c problems, we consider this project to be a proof-of-concept case reecting the di-culty of achieving acceptable performance from XeonPhi on a “home-brewed” application.

Download paper: PDF

Title: Multi-Kepler GPU vs. Multi-Intel MIC for spin systems simulations

Authors: M. Bernaschi-, M. Bisson, F. Salvadore
Istituto Applicazioni Calcolo, Rome, Italy
CINECA, Rome, Italy

Abstract: We present and compare the performances of two many-core architectures: the Nvidia Kepler and the Intel MIC both in a single system and in cluster con-guration for the simulation of spin systems. As a benchmark we consider the time required to update a single spin of the 3D Heisenberg spin glass model by using the Over-relaxation algorithm.
We present data also for a traditional high-end multi core architecture: the Intel Sandy Bridge. The results show that although on the two Intel architectures it is possible to use basically the same code, the performances of a Intel MIC change dramatically depending on (apparently) minor details. Another issue is that to obtain a reasonable scalability with the Intel Phi coprocessor (Phi is the coprocessor that implements the MIC architecture) in cluster con-guration it is necessary to use the so-called o-oad mode which reduces the performances of the single system. As to the GPU, the Kepler architecture o-ers a clear advantage with respect to the previous Fermi architecture maintaining exactly the same source code. Scalability of the multi-GPU implementation remains very good by using the CPU as a communication co-processor of the GPU. All source codes are provided for inspection and for double-checking the results.

Download paper: PDF

Title: Enabling Smeagol on Xeon Phi: Lessons Learned

Authors: Alin M. Elena, Ivan Rungger
Irish Centre for High-end Computing, Dublin, Ireland
School of Physics and CRANN, Trinity College, Dublin, Ireland

Download paper: PDF

Title: Code Optimization and Scaling of the Astrophysics Software Gadget on Intel Xeon Phi

Authors:P. Borovskaa, D. Ivanova
National Centre for Supercomputing Applications, Bulgaria

Abstract: The whitepaper reports our investigation into the porting, optimization and subsequent performance of the astrophysics software package GADGET, on the Intel Xeon Phi. The GADGET code is intended for cosmological N-body/SPH simulations to solve a wide range of astrophysical tasks. The test cases within the project were simulations of galaxy systems. A performance analysis of the code was carried out and porting, tuning and scaling of the GADGET code were completed. As a result, the hybrid MPI/OpenMP parallelization of the code has been enabled and scalability tests on the Intel Xeon Phi processors, on the PRACE EURORA system are reported.

Download paper: PDF

Title: Code Optimization and Scalability Testing of an Artificial Bee  Colony Based Software for Massively Parallel Multiple Sequence Alignment on the Intel MIC Architecture

Authors: Plamenka Borovska, Veska Gancheva, Nikolay Landzhev
National Centre for Supercomputing Applications, Bulgaria
Department of Programming and Computer Technologies, Technical University of Sofia, Bulgaria
Department of Computer Systems, Technical University of Sofia, Bulgaria

Abstract: This activity with the project is aimed to investigate and to improve the performance of the multiple sequence alignment software MSA_BG on the computer system EURORA at CINECA, for the case study of the influenza virus sequences. The objective is code optimization, porting, scaling and performance evaluation of the parallel multiple sequence alignment software MSA_BG for Intel Xeon Phi (the MIC architecture). For this purpose a parallel multithreaded optimization including OpenMP has been implemented and verified. The experimental results show that the hybrid parallel implementation utilizing MPI and OpenMP provides considerably better performance than the original code.

Download paper: PDF

Title: Optimization and Scaling of Multiple Sequence Alignment Software ClustalW on Intel Xeon Phi

Authors: Plamenka Borovska, Veska Gancheva, Simeon Tsvetanov

Abstract: This work is aimed to investigate and to improve the performance of multiple sequence alignment software ClustalW on the test platform EURORA at CINECA, for the case study of the influenza virus sequences. The objective is code optimization, porting, scaling and performance evaluation of parallel multiple sequence alignment software ClustalW for Intel Xeon Phi (the MIC architecture). For this purpose a parallel multithreaded optimization including OpenMP has been implemented and verified. The experimental results show that the hybrid parallel implementation utilizing MPI and OpenMP provides considerably better performance than the original code.

Download paper: PDF

Title: Porting FEASTFLOW to the Intel Xeon Phi: Lessons Learned

Authors: Ioannis E. Venetis, Georgios Goumas, Markus Geveler, Dirk Ribbrock
University of Patras, Greece
National Technical University of Athens – NTUA, Greece
Technical University of Dortmund – TUD, Germany

Abstract: In this paper we report our experiences in porting the FEASTFLOW software infrastructure to the Intel Xeon Phi coprocessor. Our efforts involved both the evaluation of programming models including OpenCL, POSIX threads and OpenMP and typical optimization strategies like parallelization and vectorization. Since the straightforward porting process of the already existing OpenCL version of the code encountered performance problems that require further analysis, we focused our efforts on the implementation and optimization of two core building block kernels for FEASTFLOW: an axpy vector operation and a sparse matrix-vector multiplication (spmv). Our experimental results on these building blocks indicate the Xeon Phi can serve as a promising accelerator for our software infrastructure.

Download paper: PDF

Title: Optimising CP2K for the Intel Xeon Phi

Authors:Fiona Reid, Iain Bethune
EPCC, The University of Edinburgh, King’s Buildings, Edinburgh, UK

Abstract: CP2K is an important European program for atomistic simulation for many users of the PRACE Research Infrastructure as well as national and local compute resources. In the context of a PRACE Preparatory Access Type C project, we have parallelised several routines in CP2K to allow the code to gain better performance on the Intel Xeon Phi for a materials science application. We have obtained a 50% speedup in the maximum performance of the code on the Xeon Phi, but have not been able to demonstrate better performance than running the same calculation on a Sandy Bridge 16-core CPU node. We present details of the developments made to CP2K, and discuss several lessons, which will be of wider interest to developers considering porting their codes to Xeon Phi.

Download paper: PDF

Title: Towards Porting a Real-World Seismological Application to the Intel MIC Architecture

Authors: V. Weinberg, M. Allalen and G. Brietzke
Leibniz Rechenzentrum der Bayerischen Akademie der Wissenschaften, Munchen, Germany

Abstract: This whitepaper aims to discuss first experiences with porting an MPI-based real-world geophysical application to the new Intel Many Integrated Core (MIC) architecture. The selected code SeisSol is an application written in Fortran that can be used to simulate earthquake rupture and radiating seismic wave propagation in complex 3-D heterogeneous materials. The PRACE prototype cluster EURORA at CINECA, Italy, was accessed to analyse the MPI-performance of SeisSol on Intel Xeon Phi on both single- and multi-coprocessor level. The whitepaper presents detailed scaling results on EURORA and compares them with the SandyBridge-based HPC system SuperMUC at LRZ at Garching near Munich, Germany. The work was done in a PRACE Preparatory Access project within the PRACE-1IP extension.

Download paper: PDF

Title: FMPS on MIC

Authors: Florian Seybolda, Ralf Schneider, David Horak, Lubomir Riha, Vaclav Hapla, Vit Vondrak
High Performance Computing Center Stuttgart (HLRS)
IT4Innovations, VSB-Technical University of Ostrava (VSB)

Abstract: The Finite Element Method software system FMPS is briefly introduced and an overview is given of development work done by HLRS and VSB concerning the acceleration of solutions to problems yielded by FMPS using Intel’s Many Integrated Core architecture. HLRS implemented a hybrid MPI/OpenMP Conjugate Gradient solver. A different scaling behaviour when using Intel Xeon Phi cards due to a different message transportation (involving a PCIe 2.0 bus) is revealed. VSB shows that the number of iterations of a Conjugate Gradient solver can be significantly reduced by using a Deflated Conjugate Gradient scheme, involving the need for an additional solution to a dense linear system by a LU factorisation. The solution to the dense system can be accelerated using MIC cards.

Download paper: PDF

Title: Massively parallel Poisson Equation Solver for hybrid Intel Xeon – Xeon Phi HPC Systems

Authors: Peicho Petkov, Damyan Grancharov, Stoyan Markov, Georgi Georgiev, Elena Lilkova, Nevena Ilieva, Leandar Litov
National Centre for Supercomputing Application, Bulgaria

Abstract: We have optimized an implementation of a massively parallel algorithm for solving the Poisson equation using a 27-stencil discretization scheme based on the stabilized biconjugate gradient method. The code is written in the C programming language with OpenMP parallelization. The main objective of this whitepaper lies in the optimization of the code for native Intel Xeon Phi execution, where we observe nearly linear scalability on the MIC architecture for the bigger problem sizes.

Download paper: PDF

Title: Exploiting Locality in Sparse Matrix-Matrix Multiplication on the Many Integrated Core Architecture

Authors: K. Akbudak, C.Aykanat
Bilkent University, Computer Engineering Department, Ankara, Turkey

Abstract: In this whitepaper, we propose outer-product-parallel and inner-product-parallel sparse matrix-matrix multiplication (SpMM) algorithms for the Xeon Phi architecture. We discuss the trade-offs between these two parallelization schemes for the Xeon Phi architecture. We also propose two hypergraph-partitioning-based matrix partitioning and row/column reordering methods that achieve temporal locality in these two parallelization schemes. Both HP models try to minimize the total number of transfers from/to the memory while maintaining balance on computational loads of threads. The experimental results performed for realistic SpMM instances show that the Intel MIC architecture has the potential for attaining high performance in irregular applications, as well as regular applications. However, intelligent data and computation reordering that considers better utilization of temporal locality should be developed for attaining high performance in irregular applications.

Download paper: PDF

Title: Porting and Verification of ExaFMM Library in MIC Architecture

Authors: Valentin Pavlov, Nikola Andonov, Georgi Kremenliev
National Center for Supercomputing Applications (NCSA)

Abstract: ExaFMM is a highly scalable implementation of the Fast Multipole Method (FMM) – an O(N) algorithm for solving N-body interaction with applications in gravitational and electrostatic simulations. The authors report scaling on large systems with O(100k) cores with support for MPI, OpenMP and SIMD vectorization. The library also includes GPU kernels capable of running on a multi-GPU system. The objective of the project is to enable the use of the ExaFMM solver in the MIC architecture by performing porting, verification, scalability testing and providing configuration suggestions to its potential users.

Download paper: PDF

Title: AGBNP2 Implicit Solvent Library for Intel® MIC Architecture

Authors: Peicho Petkov, Elena Lilkova, Stoyan Markov, Damyan Grancharov, Nevena Ilieva, Leandar Litov
National Centre for Supercomputing Applications, Bulgaria

Abstract: A library, implementing the AGBNP2 [1,2] implicit solvation model, was developed, which calculates the nonbonded interactions of a solute with the solvent — implicit water. The library is to be used in Molecular Dynamics (MD) packages for estimation of solvation free energies and studying of hydration effects.
The library was based on a prevoiusly developed Fortran code [3]. The presented in this paper code was rewritten in C and parallelized with OpenMP. The main objective was to parallelize the code in a way that allows it to run efficiently on Intel Xeon Phi in native mode. However, no efficient utilization of the MIC was observed, as the obtained performance on the coprocessor was an order of magnitute lower than on the CPU.