Molecular Dynamics Optimization on Intel® Many Integrated Core Architecture (Intel® MIC)

Download Article

Introduction

Molecular Dynamics (MD) is a computer simulation of physical movements of atoms and molecules^[1], as shown in Figure 1. The atoms and molecules interact with each other in a specified field obeying Newton’s Second law, in which the forces and potential energy are defined by molecular mechanics force fields, and the velocities and movements are determined by numerically solving the Newton's equations of motion.

Atoms and molecules can be stored in cell linked-lists^[2], in which way they can easily find neighbors to interact with. In each step of simulation, atoms and molecules should follow the sequence as move, update cells, and then collide. When moving, they update velocities and movements. When updating cells, they determine the cells where they shall locate in. When colliding, they calculate the forces and potential energy with neighbors.

In chemistry, MD is used in protein structure determination and refinement using experimental tools such as X-ray crystallography and Nuclear Magnetic Resonance (NMR). It has also been applied with limited success as a method of refining protein structure predictions. In physics, MD is used to examine the dynamics of atomic-level phenomena that cannot be observed directly, such as thin film growth and ion-sub plantation ^[1].

Figure 1. Molecular Dynamics

Intel introduces Intel® Many Integrated Core (Intel® MIC) architecture, which consists of up to 61 cores and supports 4 hardware threads for each core. In comparison with CPU architecture, Intel MIC has more computing power. In this report, the transportation of MD program to Intel MIC is described as well as the key optimization methods.

MD Computation Flow

In MD calculation, the forces, velocities and position of each molecule are recorded as its properties and are updated in each time step. To reduce the number of molecules interacted with the updated molecule, all molecules are split into a grid, composed of cells (shown in Figure 2). Then only the molecules in neighboring cells and same cell are taken into account. For example, when molecules’ properties in cell 0 are updated, they only interact with molecules in cells from 1 to 8 as neighbor cells and itself, cell 0. Since interactions between two molecules are symmetric, we can calculate in one direction and refer to only half number of the neighbors. Therefore, molecules in cells from 1 to 4 are used to update the molecules in cell 0 as neighbor cell.

In each time stamp, molecules are reassigned to cell locations as their positions are changed. Because the number of molecules in one cell is limited, few molecules may be dropped.

After a given numbers of iterations, the calculated properties of all molecules are the results of molecular dynamics.

Figure 2. Molecules split into grid with cells

Key Optimization Method

In this section, we describe the key methods to optimize the MD programs in details. The original program is serial implementation. To utilize the compute resource of multi-core CPU and many-core Intel MIC, we parallel this program using Intel® Threading Building Blocks (Intel® TBB) at first.

Modify Data Structure to be Accessed Contiguously

Molecules are represented in struct. In 2-Dimension MD, each property is represented by two components in x and y directions. In the original implementation, those components are stored alternatively in one array. When we process one component, the memory accesses are not contiguous, which is not suitable for vectorization. This modification changes the Array of Struct (AOS) to Struct of Array (SOA).

When the properties are stored, we store the each property in one separated array. Then the memory access pattern can be contiguous and the operation can be vectorized.

Vectorization with Intel(R) Cilk(TM) Plus Array Notation and intrinsic

Intel(R) Cilk(TM) Plus Array Notation is an Intel-specific extension and supported by Intel® compilers. It enables the compiler to vectorize the operation with less reliance on alias and dependence analysis automatically.

Intel® Streaming SIMD Extension 4.2 (Intel® SSE 4.2)^[3] is a SIMD instruction set extension to X86 architecture, which allows more than one data to be processed simultaneously with one instruction. Now Intel has introduced a new 256-bit instruction set extension to Intel SSE --- Intel® Advanced Vector Extensions (Intel® AVX) ^[4]. To utilize those instruction sets to accelerate application performance, intrinsics are designed for developers in place of assembly instructions. Intrinsics are assembly-coded functions, which may consist of several SIMD instructions.

In summary, Array notation and intrinsics are used for explicit vectorization by developers using Intel C++ Compiler. In MD optimization, we use both Array notation and intrinsics for optimization. As we already store each property in an individual array, as the previous session mentions, each property can be loaded into a separated vector register.

The other benefit of vectorization is reducing memory access, since more than one operand is loaded into register at one time.

Experiment Result

In this section, the optimized methods are evaluated. The platform information is given in Table 1.

Table 1. Platform Information

Platform	WSM* 2 socket Intel® Xeon W5680	SNB** 2 socket Intel® Xeon E5-2670	KNC
# cores/ Threads	6/12	8/16	61/244
Frequency (GHz)	3.33	2.6	1.1
Memory	24GB	32GB	8GB

*Intel® microarchitecture code name Westmere

** Intel® microarchitecture code name Sandy Bridge

The optimization result is given in Figure 3. As the result shows, the speedup of optimization methods on Intel® microarchitecture code name Westmere can reach up to 32.8X, mainly derived from parallelization and vectorization.

We can disable vectorization via adding “-no-vec” option in compilation. The speedup of the program using Intel SSE 4.2 is 2.97X. If we utilize the new proposed Intel AVX in Intel® microarchitecture code name Sandy Bridge, the speedup is 2.22X in comparison with Intel SSE version. The key difference is that registers are 256 bit in Intel AVX rather than 128 bit in Intel SSE4.2. The registers in Intel MIC are 512 bits, bringing more benefit for vectorization on Intel MIC.

Figure 3. Speedup of the optimized MD program

*KNC is Knights Corner

When the hardware platform is updated, the speedup of Intel microarchitecture code name Sandy Bridge compared to Intel® microarchitecture code name Westmere is 2.98X. And the speedup can be still increased up to 1.89X when we use Intel MIC, compared to Intel microarchitecture code name Sandy Bridge.

Summary

In this report, we optimize the molecular dynamics program in Intel platforms. The experiment results show the great performance improvement. When we parallelize and vectorize the program, the speedup reaches up to 97.74X. The computing resources of Intel Xeon CPU are explored sufficiently. After the Intel MIC is used, we can obtain an extra 1.89X speedup and the total speedup is increased to 184.74X.

References:

[1]. Molecular dynamics – Wikipedia. http://en.wikipedia.org/wiki/Molecular_dynamics

[2]. Computer Simulation of Liquids. M. P. Allen, D. J. Tildesley. Oxford: Clarendon Press.

[3]. SSE. http://en.wikipedia.org/wiki/Streaming_SIMD_Extensions

[4]. AVX. http://software.intel.com/en-us/avx/

About Author

Xiangzheng Sun graduated from Institute of Software, Chinese Academy of Sciences. He majors in High Performance Computing (HPC) and IPDC. Now He mainly focuses on application developing/porting and performance tuning on Intel® Multi-Core CPU and Intel® Many Integrated Core (Intel® MIC).

MIC

Intel Many Integrated Core

Intel MIC

molecular dynamics

Xiangzheng Sun

Molecular Dynamics Optimization

Intel® Xeon Phi™ Coprocessor

Developers

Server

Intel® Many Integrated Core Architecture

Parallel Computing

URL