Printer Friendly

Memory optimisation of a stencil-based code: Intel's Cedric Andreolli highlights the benefits of using memory optimisation techniques.

By definition, HPC is about high performance, therefore HPC programmers must optimise their code. The performance of an application, measured in floating-point operations per second (flops), is often limited by memory bandwidth. Achieving performance close to the theoretical maximum usually requires threading, vectorisation, and efficient use of memory bandwidth. One way to optimise memory bandwidth is to check that the distance between consecutive memory accesses is small. To take advantage of automatic hardware prefetching, a good practice is to design your for-loops to guarantee unit stride accesses between successive iterations on your most-used arrays. When unit stride accesses are used, it may also be possible to order your memory accesses into blocks, increasing the chance of data reuse from the cache. In this article, we will see how to detect a memory bottleneck due to non-unit stride accesses, and how to implement cache blocking on a stencil code.


This article is part of a series that demonstrates the optimisation process on Iso3DFD, a wave propagation kernel used in seismic simulation. The equation is implemented with a finite difference stencil that is 16th order in space and second order in time. To keep the code simple, it doesn't implement boundary conditions (as seen in Panel 1).

Detecting non unit stride memory accesses

The roofline model is a visual model used to predict a performance estimation based on operations and memory movements executed by an application. In this article, we will use the roofline model to evaluate the initial performance of an application and identify the root cause of our insufficient performance. An initial version of the roofline model, Cache Aware Roofline Model (CARM) is available in Intel Advisor since the 2017 release. This initial version based on instrumentation provides an interesting way to characterise an algorithm but offer a limited set of information to identify why the maximum performance is not achieved. In the 2019 release of Intel Parallel Studio, Intel Advisor offers a new version of the roofline model: the integrated roofline model. This new version, based on a cache simulator, is closer to the original roofline model described by University of California, Berkeley. It counts memory transfers between different memory levels and provides arithmetic intensities for L1 cache (this one is called the cache aware roofline model, CARM), L2, L3, and DRAM.

Arithmetic intensity (AI) is the ratio of flops to bytes transferred, so an optimised application should see L1 AI < L2 AI < L3 AI < DRAM AI. This order reflects the data locality as we are supposed to ask more often for a value from the L1 cache, than from the L2 cache. This is a particular issue for stencil-based kernels. In the new integrated roofline model, memory transfers between caches and DRAM are computed based on the cache line size. This means that every time a miss is observed in one of these levels, the cache simulator counts the transfer of a complete cache line and not a single scalar element. On the other hand, for the L1, data transfers are counted for each element used and moved from the L1 to registers.

When running the integrated roofline model with Intel Advisor on this first unoptimised version of Iso3DFD, we can filter the result to look only at the points for a single loop. We see that the code does not have the desired intensity order (L1 CARM < L2 < L3 < DRAM, since L2 < L1). In other words, it means that when a data is not in L1, we need to copy a whole cache line from L2 or L1 but only a few of the bytes in this cache line are used. As a result, we transfer less data between L1 and the registers, than we do between L2 and L1. This is a typical symptom of non-unit stride accesses.

The Intel Advisor Memory Access Pattern Analysis also lets us see how memory is accessed. However, running this test is slow, so we apply it only to a limited number of loops. Running the memory access pattern analysis reports constant strides, meaning that our accesses within the main loop are not consecutive, but are always padded by the same distance of 16,384 elements. This is a pattern we want to avoid.

Intel Advisor reports this problem the first time on the first reference to the pointer ptr_prev[offset].

In order to guarantee unit stride accesses between two iterations, offset needs to be incremented by 1. The innermost loop is on the variable iz, so we need to reorder the loops so the ix loop is innermost (see panel 2). In version dev01, we reverse the loops on iz and ix (panel 3).

Running the roofline model on this version shows the difference in performance. We can see that changing the way the memory is accessed has also changed the arithmetic intensity for L2, L3 and DRAM.

The application's performance moved from 9.12Gflops/s to 36.48Gflop/s on a two-socket Intel Xeon Gold 6148 after this optimisation. The machine has 192GB of memory and we used Intel Parallel Studio XE 2019 update 1. The operating system used is CentOS Linux release 7.5.1804.

Using contiguous memory access patterns is critical to achieving high performance, as we have demonstrated by achieving a 4x performance improvement on this stencil kernel from Iso3DFD.

Intel Advisor provides the tools to allow you to understand and optimise the memory access patterns of your code.
Panel 1

#define HALF_LENGTH 8
#pragma omp parallel for
for(int ix= HALF_LENGTH; ix<nx- HALF_LENGTH; ix++) {
for(int iy= HALF_LENGTH; iy<ny- HALF_LENGTH; iy++) {
for(int iz= HALF_LENGTH; iz<nz- HALF_LENGTH; iz++) {
int offset = iz*dimnXnY + iy*nx + ix;
float value = 0.0;
value += ptr_prev[offset]*coeff[0];
for(int ir=1; ir<=HALF_LENGTH; ir++) {
value += coeff[ir] * (ptr_prev[offset + ir] + ptr_prev[offset - ir]);
value += coeff[ir] * (ptr_prev[offset + ir*nx] +
ptr_prev[offset - ir*nx]);
value += coeff[ir] * (ptr_prev[offset + ir*dimnXnY] +
ptr_prev[offset - ir*dimnXnY]);
ptr_next[offset] = 2.0f* ptr_prev[offset] - ptr_next[offset] +

Panel 2

for(int ix= HALF_LENGTH; ix<nx- HALF_LENGTH; ix++) {
for(int iy= HALF_LENGTH; iy<ny- HALF_LENGTH; iy++) {
for(int iz= HALF_LENGTH; iz<nz- HALF_LENGTH; iz++) {
int offset = iz*dimnXnY + iy*nx + ix;
float value = 0.0;
value += ptr_prev[offset]*coeff[0];

Panel 3

for(int iz= HALF_LENGTH; iz<nz- HALF_LENGTH; iz++) {
for(int iy= HALF_LENGTH; iy<ny- HALF_LENGTH; iy++) {
for(int ix= HALF_LENGTH; ix<nx- HALF_LENGTH; ix++) {
int offset = iz*dimnXnY + iy*nx + ix;
float value = 0.0;
value += ptr_prev[offset]*coeff[0];
COPYRIGHT 2019 Europa Science, Ltd.
No portion of this article can be reproduced without the express written permission from the copyright holder.
Copyright 2019 Gale, Cengage Learning. All rights reserved.

Article Details
Printer friendly Cite/link Email Feedback
Author:Andreolli, Cedric
Publication:Scientific Computing World
Geographic Code:1USA
Date:Feb 1, 2019
Previous Article:SoftIron.
Next Article:Modular design increases energy efficiency: Robert Roe speaks to Bill Thigpen, advanced computing branch chief for the Nasa Advanced Supercomputing...

Terms of use | Privacy policy | Copyright © 2020 Farlex, Inc. | Feedback | For webmasters