A case study comparing AoS (Arrays of Structures) and SoA (Structures of Arrays) data layouts for a compute-intensive loop run on Intel® Xeon® processors and Intel® Xeon Phi™ product family coprocessors

by Paul Besl

Download full case study in PDF format Download

Download accompanying source code (ZIP format) Download

Abstract

A customer recently purchased a significant number of Intel Xeon Phi coprocessors to augment the capabilities of their cluster of 16-core dual-socket compute nodes based on the Intel Xeon E5-2600 series processors (with 8 cores per socket). The 61 core Intel Xeon Phi coprocessors can be programmed to execute natively as a separate Linux* host or via an offload method controlled by the Intel Xeon processor host. In this case study, we take one mini-application code, tested the host version, tried out the Intel Xeon Phi coprocessor native execution, optimized the native version (and as often happens, also the host version) by switching two AoS arrays (“arrays of structures”) containing three-dimensional point data to the SoA format (“structures of arrays”). We also added an “offload pragma” to offload the key O(N2) loop, keeping almost all of the performance of the native version in the offload version. In addition, we did a cache-blocking transformation on the two main loops by creating 4 loops and then interchanging the two middle (non-inner, non-outer) loops. Finally we compare the performance of all tested Intel Xeon and Intel Xeon Phi coprocessor versions showing that a 9 to 1 performance ratio exists between the slowest AoS double precision executable vs. the fastest SoA single precision executable.

Introduction

A customer recently purchased a significant number of Intel Xeon coprocessors to augment the capabilities of their cluster of 16-core dual-socket compute nodes based on Intel Xeon E5-2600 series processors (with 8 cores per socket). They will be supporting a wide variety of software packages on these systems. They already run clusters with the same set of supported software packages on earlier x86_64 clusters. Intel has developed software technology in the form of Intel® compilers, Intel® Math Kernel Library (Intel® MKL), Intel® MPI libraries, and Intel performance analysis tools that allow this customer and their users to port their software to run natively on the Intel Xeon Phi coprocessor’s Linux* OS as if it were a separate Linux host or to offload compute tasks from the host to the Intel Xeon Phi coprocessor located on the PCIe bus of the host. At this point in time, it has been demonstrated that significant compute server codes (i.e. not client-based Windows* oriented user-interactive codes) can be ported to or partially offload-enabled for Intel Xeon Phi coprocessor without a huge amount of software work. However, the remaining task for a software developer is to “tune & optimize” their software so that it runs efficiently & quickly on the coprocessor. Intel Xeon is able to run x86_64, 128-bit SSE (Streaming SIMD Extensions), 128-bit AVX-1, or 256-bit AVX-1 (Intel® Advanced Vector Extensions (Intel® AVX) codes using “big cores” with out-of-order (OOO) capabilities, but the Intel Xeon Phi coprocessor can only run in-order x86_64 instructions (including x87 instructions) or the new 512-bit Intel® Initial Many Core Instructions (Intel® IMCI) floating point vector instructions. Ideally, we’d like to develop a decision tree or step-by-step recipe for software developers to be able to follow in their porting & optimization activities so that excellent performance can be attained with the minimum amount of software developer work.

A case study comparing AoS (Arrays of Structures) and SoA (Structures of Arrays) data layouts for a compute-intensive loop run on Intel® Xeon® processors and Intel® Xeon Phi™ product family coprocessors

Abstract

Introduction

Trending Articles

Practice Sheet of Right form of verbs for HSC Students

Download: FK ft Shenky – Nakuyewa ”Prod by: Shenky”

How to win at Markstrat (Markstrat Tips and Tricks) – Vodites

Ominde Commission Report and Recommendations – Ominde Report of 1964

Bureau of Internal Revenue: Regional Offices (Directory)

GO 53 on Enhancement of Ex-gratia upto 5 Lakhs Toddy Tappers in Telangana

Cakewalk CA-2A Leveling Amplifier v2.0.1.97 WiN, v2.0.1.96 OSX Incl Keygen

Mp3 Download: Mdu - Kunjenjenjena

How the kill the job , when DTP request running for long hours.

Microsoft Intune から展開しているアプリのアップデートについて

18-year-old girl was beaten for half an hour by two Northampton men in 'an...

Car crash in Dunton Bassett leaves driver in critical condition

Macky 2, Two Others In Road Accident

Application log 00000000000000089514: Could not convert queue DLVST90CLNT

Detroit mafia: D’Anna Brothers agree to plea deal

Delivery block field greyed out using VA02

Muloraki Au

【個人撮影】スマホのプライベート映像♪「中に出さないで///」カラオケ屋での生ハメ撮りが流出ｗ【リベンジポルノ】＠PornHub

BREAKING NEWS: Diamond Platnumz Is Reported Dead After Ghastly Car Accident

FIAT 500 B0111 B0112