This article demonstrates techniques that software developers can use to identify and fix NUMA-related performance issues in their applications using the latest Intel® software development tools.
Quick links
- Introduction
- Steps to Compile and Execute ISO3DFD Application
- Identify NUMA-related Performance Issues
- Code Modifications to Reduce Remote Memory Accesses
- Memory Access Analysis for the Modified Version
- Overall Performance Comparison
- System Configuration
- References
1. Introduction
Non-uniform memory access (NUMA) is a computer memory design used in multiprocessing where the memory access time depends on the memory location relative to the processor. Under NUMA, a processor can access its own local memory faster than the remote memory (memory local to another processor or memory shared between processors).
This article gives an overview of how the latest memory access feature of Intel® VTune™ Amplifier XE can be used to identify NUMA-related issues in the application. It extends the article published on Intel® Developer Zone (IDZ) related to development and performance comparison of Isotropic 3-dimensional finite difference application running on Intel® Xeon® processors and Intel® Xeon Phi™ coprocessors. We also provide recommendations for source code modifications to achieve consistent high performance for your application in the NUMA environment.
In order to focus only on the NUMA issues, we discuss only the version optimized for the Intel Xeon processor. The code can be downloaded here. For this article, the dev06 version from the provided source code for the ISO3DFD application will be used to compare the important metrics and understand the benefits of introducing NUMA awareness in the application.
2. Steps to Compile and Execute ISO3DFD Application
The application can be compiled using the makefile1:
make build version=dev06 simd=avx2
The run_on_xeon.pl script provided with the source can be used to run the application:
./run_on_xeon.pl executable_name n1 n2 n3 nb_iter n1_block \ n2_block n3_block kmp_affinity nb_threads where -executable_name: The executable name -n1: N1 //X-Dimension -n2: N2 //Y-Dimension -n3: N3 //Z-Dimension -nb_iter: The number of iterations -n1_block: size of the cache block in x dimension -n2_block: size of the cache block in y dimension -n3_block: size of the cache block in z dimension -kmp_affinity: The thread partitionning -nb_threads: The number of OpenMP threads
3. Identifying NUMA-related Performance Issues
Current-generation NUMA architectures are complex. Before investigating the memory accesses, it is useful to verify whether NUMA has an effect on the application's performance. The numactl2utility can be used to achieve this. Once verified, it is important to track down the memory accesses with higher latencies. Optimizing these memory accesses might result in greater performance gains.
3.1 numactl
One of the quickest ways to find out whether an application is affected by NUMA is to run it completely out of one-socket/NUMA node and then compare it to the performance on multiple NUMA nodes. In an ideal scenario without NUMA effects, we should be scaling well across the socket and should see twice (in the case of two-socket systems) the single-socket performance unless there are other scaling limitations. As we can see below, for ISO3DFD, the single-socket performance is better than the full node, so it’s possible that the application performance is affected by NUMA. Although this method helps to answer our first question about finding out whether NUMA affects application performance, it does not give any particular insight on which part of the application the problem exists. We can use the memory access analysis feature of Intel VTune Amplifier XE to investigate the NUMA issues in detail.
-Dual socket performance without numactl (22 threads per socket):
./run_on_xeon.pl bin/iso3dfd_dev06_cpu_avx2.exe 448 2016 1056 10 448 24 96 compact 44 n1=448 n2=2016 n3=1056 nreps=10 num_threads=44 HALF_LENGTH=8 n1_thrd_block=448 n2_thrd_block=24 n3_thrd_block=96 allocating prev, next and vel: total 10914.8 Mbytes ------------------------------ time: 3.25 sec throughput: 2765.70 MPoints/s flops: 168.71 GFlops
-Single socket performance using numactl:
numactl -m 0 -c 0 ./run_on_xeon.pl bin/iso3dfd_dev06_cpu_avx2.exe 448 2016 1056 10 448 24 96 compact 22 n1=448 n2=2016 n3=1056 nreps=10 num_threads=22 HALF_LENGTH=8 n1_thrd_block=448 n2_thrd_block=24 n3_thrd_block=96 allocating prev, next and vel: total 10914.8 Mbytes ------------------------------- time: 3.05 sec throughput: 2948.22 MPoints/s flops: 179.84 GFlops
-Running 2 processes on the system with each process pinned to a different socket (22 threads per socket):
numactl -c 0 -m 0 ./run_on_xeon.pl bin/iso3dfd_dev06_cpu_avx2.exe \ 448 2016 1056 10 448 24 96 compact 22 & \ numactl -c 1 -m 1 ./run_on_xeon.pl bin/iso3dfd_dev06_cpu_avx2.exe \ 448 2016 1056 10 448 24 96 compact 22 & n1=448 n2=2016 n3=1056 nreps=10 num_threads=22 HALF_LENGTH=8 n1_thrd_block=448 n2_thrd_block=24 n3_thrd_block=96 allocating prev, next and vel: total 10914.8 Mbytes ------------------------------- time: 2.98 sec throughput: 2996.78 MPoints/s flops: 180.08 GFlops n1=448 n2=2016 n3=1056 nreps=10 num_threads=22 HALF_LENGTH=8 n1_thrd_block=448 n2_thrd_block=24 n3_thrd_block=96 allocating prev, next and vel: total 10914.8 Mbytes ------------------------------- time: 3.02 sec throughput: 2951.22 MPoints/s flops: 179.91 GFlops
3.2 Using Intel® VTune™ Amplifier XE – Memory Access Analysis
In a processor that supports NUMA, it is important to investigate not only the cache misses on the CPU you are running but also the references made to the remote DRAM and cache on another CPU. In order to get insight on these details, we run the memory access analysis on the application as follows:
amplxe-cl -c memory-access –knob analyze-mem-objects=true \ -knob mem-object-size-min-thres=1024 -data-limit=0 \ -r ISO_dev06_MA_10 ./run_on_xeon.pl bin/iso3dfd_dev06_cpu_avx2.exe \ 448 2016 1056 10 448 24 96 compact 44
The metrics relevant to NUMA are the following:
3.2.1 Memory Bound – Is this application memory bound? If yes, does the bandwidth utilization histogram show high DRAM bandwidth utilization? It is important that the bandwidth utilization is balanced between the sockets since the actual compute-intensive work is split equally among the sockets.
In the Summary window, we can determine whether or not the application is memory bound
Figure 1: Memory bound metric and DRAM bandwidth histogram
Notice that the Memory Bound metric is high and highlighted, but the DRAM bandwidth utilization is only low to medium, which is not what we expected and needs further investigation.
3.2.2 Intel® QuickPath Interconnect (Intel® QPI) bandwidth. The performance of the application can sometimes be limited by the bandwidth of Intel QPI links between the sockets. Intel VTune Amplifier provides mechanisms to identify the source and memory objects that lead to this type of bandwidth problem.
In the Summary window, use the Bandwidth Utilization Histogram and select QPI from the Bandwidth Domain drop-down menu.
Figure 2:Intel® QuickPath Interconnect bandwidth utilization histogram.
You can also switch to the bottom-up view and select areas with high QPI bandwidth utilization in the timeline view and filter by this selection.
Figure 3:Bandwidth utilization timeline view
After the filter is applied, from the timeline graph, we see that the DRAM bandwidth is utilized on only one of the sockets and the QPI bandwidth is high, up to 38 GB/s.
In the same bottom-up view, the grid below the timeline pane shows what was executing during that time range. To see the name of the function contributing to high Intel QPI traffic, we select the grouping from the drop-down menu to Bandwidth Domain / Bandwidth Utilization Type / Function / Call Stack and then expand the QPI bandwidth domain with High utilization.
Figure 4: High Intel® QuickPath Interconnect bandwidth - bottom-up grid view.
These are the typical issues observed commonly on NUMA machines where memory for the OpenMP* threads are allocated on a single socket, and the threads are spawned across all the sockets. This forces some of the threads to load the data from remote DRAM or remote caches over the Intel QPI links, which is much slower than accessing local memory.
4. Code Modifications to Reduce Remote Memory Accesses
To reduce the effects of NUMA, the threads running on each socket should be accessing local memory and thereby reduce Intel QPI traffic. This can be enabled using the first-touch policy. On Linux*, memory pages are allocated on first access; that is, data are not physically mapped in memory until they are first written. This allows the thread touching the data to place it close to the CPU it is running on. To achieve this, the memory should be initialized using the same OpenMP loop order as the one used for computation. Considering this, the initialize function in src/dev06/iso-3dfd_main.cc (included in the ISO3DFD source code) is replaced by initialize_FT, which enables first-touch. As a result, threads will probably access and initialize in local memory the same blocks of data that they will be working on in the iso_3dfd_it function where the compute intensive seismic wave is propagated. Also, here we use static OpenMP scheduling for both initialization and the computation to achieve higher performance gains.
void initialize_FT(float* ptr_prev, float* ptr_next, float* ptr_vel, Parameters* p, size_t nbytes, int n1_Tblock, int n2_Tblock, int n3_Tblock, int nThreads){ #pragma omp parallel num_threads(nThreads) default(shared) { float *ptr_line_next, *ptr_line_prev, *ptr_line_vel; int n3End = p->n3; int n2End = p->n2; int n1End = p->n1; int ixEnd, iyEnd, izEnd; int dimn1n2 = p->n1 * p->n2; int n1 = p->n1; #pragma omp for schedule(static) collapse(3) for(int bz=0; bz<n3End; bz+=n3_Tblock){ for(int by=0; by<n2End; by+=n2_Tblock){ for(int bx=0; bx<n1End; bx+=n1_Tblock){ izEnd = MIN(bz+n3_Tblock, n3End); iyEnd = MIN(by+n2_Tblock, n2End); ixEnd = MIN(n1_Tblock, n1End-bx); for(int iz=bz; iz<izEnd; iz++) { for(int iy=by; iy<iyEnd; iy++) { ptr_line_next = &ptr_next[iz*dimn1n2 + iy*n1 + bx]; ptr_line_prev = &ptr_prev[iz*dimn1n2 + iy*n1 + bx]; ptr_line_vel = &ptr_vel[iz*dimn1n2 + iy*n1 + bx]; #pragma ivdep for(int ix=0; ix<ixEnd; ix++) { ptr_line_prev[ix] = 0.0f; ptr_line_next[ix] = 0.0f; ptr_line_vel[ix] = 2250000.0f*DT*DT;//Integration of the v² and dt² here } } } } } } } float val = 1.f; for(int s=5; s>=0; s--){ for(int i=p->n3/2-s; i<p->n3/2+s;i++){ for(int j=p->n2/4-s; j<p->n2/4+s;j++){ for(int k=p->n1/4-s; k<p->n1/4+s;k++){ ptr_prev[i*p->n1*p->n2 + j*p->n1 + k] = val; } } } val *= 10; } }
5. Memory Access Analysis for the Modified Version
The metrics we are interested in here are DRAM bandwidth utilization and the QPI bandwidth.
Figure 5:Memory bound metric and DRAM bandwidth utilization - modified version.
From the Summary window we can see that the application is still memory bound and the bandwidth utilization is high.
Using the Bandwidth Utilization Histogram and selecting QPI in the Bandwidth Domain drop-down menu we can see that the QPI bandwidth has been reduced to low or medium.
Figure 6: QPI Bandwidth Histogram with First-touch
Switching to the bottom-up view and looking at the timeline we can see that the DRAM bandwidth utilization is now balanced or split between the two sockets and the QPI traffic is also 3x lower.
Figure 7:Reduced QPI traffic and balanced Each Socket DRAM Bandwidth
6. Overall Performance Comparison
We run the modified version of the application as follows:
./run_on_xeon.pl bin/iso3dfd_dev06_cpu_avx2.exe 448 2016 1056 \ 10 448 24 96 compact 44 n1=448 n2=2016 n3=1056 nreps=10 num_threads=44 HALF_LENGTH=8 n1_thrd_block=448 n2_thrd_block=24 n3_thrd_block=96 allocating prev, next and vel: total 11694.4 Mbytes ------------------------------- time: 1.70 sec throughput: 5682.07 MPoints/s flops: 346.61 GFlops
With better memory access characteristics, the application throughput has now increased from 2765 MPoints/s to 5682 MPoints/s, that is almost 2x speedup.
In order to confirm if the performance improvement is consistent, both the versions of code were run 10 times with 100 iterations per run. For clarity, the original dev06 version of the code was run using both static and dynamic OpenMP scheduling to differentiate between the performance gains resulting from change in OpenMP scheduling as opposed to introduction of first-touch. Higher performance gains were seen while running the modified NUMA aware code with static OpenMP scheduling. With dynamic scheduling the mapping between each OpenMP thread and the cache block/s (unit of work for OpenMP thread in this application) is non-deterministic for every iteration. As a result the effect of first-touch is not prominent with dynamic OpenMP scheduling.
Figure 8:Performance Variation
7. System Configuration
Performance testing for the results provided in the tables in this paper were achieved from the following test system. For more information go to http://www.intel.com/performance.
Component | Specification |
---|---|
System | Two-socket server |
Host Processor | Intel® Xeon® Processor E5-2699 V4 @ 2.20 GHz |
Cores/Threads | 44/44 |
Host Memory | 64 GB/Socket |
Compiler | Intel® C++ Compiler Version 16.0.2 |
Profiler | Intel® VTune™ Amplifier XE 2016 Update 2 |
Host OS | Linux; Version 3.10.0-327.el7.x86_64 |
8. References
Eight Optimizations for 3-Dimensional Finite Difference (3DFD) Code with an Isotropic (ISO) (https://software.intel.com/en-us/articles/eight-optimizations-for-3-dimensional-finite-difference-3dfd-code-with-an-isotropic-iso)
Intel® VTune™ Amplifier XE 2016 (https://software.intel.com/en-us/intel-vtune-amplifier-xe)
Intel® VTune™ Amplifier XE - Interpreting Memory Usage Data (https://software.intel.com/en-us/node/544170)
Non-uniform Memory Access (https://en.wikipedia.org/wiki/Non-uniform_memory_access)
numactl - Linux man page (http://linux.die.net/man/8/numactl)
About The Author
Sunny Gogar
Software Engineer
Sunny Gogar received a Master’s degree in Electrical and Computer Engineering from the University of Florida, Gainesville and a Bachelor’s degree in Electronics and Telecommunications from the University of Mumbai, India. He is currently a software engineer with Intel Corporation's Software and Services Group. His interests include parallel programming and optimization for Multi-core and Many-core Processor Architectures.
[1] Compile time flags for newer Intel processors like -fma were used in all the versions compared in this article
[2]numactl - Control NUMA policy for processes or shared memory
Notices
Intel technologies’ features and benefits depend on system configuration and may require enabled hardware, software or service activation. Performance varies depending on system configuration. Check with your system manufacturer or retailer or learn more at intel.com.
No license (express or implied, by estoppel or otherwise) to any intellectual property rights is granted by this document.
Intel disclaims all express and implied warranties, including without limitation, the implied warranties of merchantability, fitness for a particular purpose, and non-infringement, as well as any warranty arising from course of performance, course of dealing, or usage in trade.
This document contains information on products, services and/or processes in development. All information provided here is subject to change without notice. Contact your Intel representative to obtain the latest forecast, schedule, specifications and roadmaps.
The products and services described may contain defects or errors known as errata which may cause deviations from published specifications. Current characterized errata are available on request.
Copies of documents which have an order number and are referenced in this document may be obtained by calling 1-800-548-4725 or by visiting www.intel.com/design/literature.htm.
Intel, the Intel logo, Xeon Phi, VTune, and Xeon are trademarks of Intel Corporation in the U.S. and/or other countries.
*Other names and brands may be claimed as the property of others.
© 2016 Intel Corporation.
This sample source code is released under the Intel Sample Source Code License Agreement.