The Intel® Xeon® processor E7 v3 family now includes an instruction set called Intel® Advanced Vector Extensions 2 (Intel® AVX2), which can potentially improve application performance related to high performance computing, databases, and video processing. To validate this statement, I performed a simple experiment using the Intel® Optimized LINPACK benchmark. The results, as shown in Table 1, show a greater than 2x performance increase using Intel AVX2 vs. using Intel® Streaming SIMD Extensions (Intel® SSE). It also shows an increase of 1.7x when comparing Intel AVX2 with Intel® Advanced Vector Extensions (Intel® AVX) instructions.
The results in Table 1 are from three different workloads running on Linux* (Intel AVX, Intel AVX2, and Intel SSE4). The last two columns show the performance gain from Intel AVX2 compared to Intel AVX or to Intel SSE4. Running with the combination of an Intel AVX2 optimized LINPACK and an Intel AVX2-capable processor, Intel AVX2 performed ~2.89x-3.49x better than Intel SSE while performing ~1.73x-2.12x better than Intel AVX. And these numbers were just an example of the potential performance boost for LINPACK. For other applications, the performance gain will vary depending on the optimized code and the hardware environment.
Table1 – Results and Performance Gain from Running the LINPACK Benchmark on Quad Intel® Xeon® Processor E7-8890 v3.
Linux* LINPACK v11.2.2 | Intel® AVX2 (Gflops) | Intel® AVX (Gflops) | Intel® SSE4 (Gflops) | Performance Gain over Intel SSE4 | Performance Gain over Intel AVX |
---|---|---|---|---|---|
30K | 1835.83 | 867.065 | 525.38 | 3.49 | 2.12 |
75K | 2092.87 | 1211.89 | 724.40 | 2.89 | 1.73 |
100K | 2130.31 | 1224.44 | 731.42 | 2.91 | 1.74 |
Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark* and MobileMark*, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products.
Configurations: Intel® Xeon® processor E7-8890 v3 @ 2.50GHz, 45MB L3 cache, 18 core pre-production system. 2x Intel® SSD DC P3700 Series @ 800GB, 2568GB memory (32x16GB DDR4 -2133Mhz), BIOS by Intel Corporation Version: BRHSXSD1.86B.0063.R00.1503261059 (63.R00) BMC 70.7.5334 ME 2.3.0 SDR Package D.00, Power supply: 2x1200W NON-REDUNDANT, running Microsoft Windows* Server 2012 R2 / SLES 11 SP3 Linux*
For more information go to http://www.intel.com/performance
How to take advantage of Intel® AVX2 in existing vectorized code
Vectorized code that uses floating point operations can get a potential performance boost when running on newer platforms such as the Intel Xeon processor E7 v3 family by doing the following:
- Recompile the code, using the Intel® compiler with the proper Intel AVX2 switch to convert existing Intel SSE code. See the Intel® Compiler Options for Intel® SSE and Intel® AVX generation white paper for more details.
- Modify the code's function calls to leverage the Intel® Math Kernel Library (Intel® MKL), which is already optimized to use Intel AVX2 where supported.
- Use the Intel AVX2 intrinsic instructions. High level language (such as C or C++) developers can use Intel® Intrinsic instructions to make the calls and recompile code. See the Intel® Intrinsic Guide and Intel® 64 and IA-32 Architectures Optimization Reference Manual for more details.
- Code in assembly instructions directly. Low level language (such as assembly) developers can use equivalent Intel AVX2 instructions from their existing Intel SSE code. See the Intel® 64 and IA-32 Architectures Optimization Reference Manual for more details.
Equivalent instructions for Intel® AVX2, Intel® AVX, and Intel® SSE used in the tests
Table 2 lists the equivalent instructions for Intel AVX2, Intel AVX, and Intel SSE (SSE/SSE2/SSE3/SSE4) that may be useful for migrating code. It contains three sets of the instructions: the first set are equivalent instructions across all three instruction sets (Intel AVX2, Intel AVX, and Intel SSE); the second set are equivalent instructions across two instruction sets (Intel AVX2 and Intel AVX), and the last set are Intel AVX2 instructions.
Table 2– Intel® AVX2, Intel® AVX, and Intel® SSE Equivalent Instructions
Intel® AVX and Intel® AVX2 | Equivalent Intel® SSE | Definitions |
---|---|---|
VADDPD | ADDPD | Add packed double-precision floating-point values |
VDIVSD | DIVSD | Divide low double-precision floating point value in xmm2 by low double-precision floating-point value in xmm3/m64 |
VMOVSD | MOVSD | Move data from string to string |
VMOVUPD | MOVUPD | Move unaligned packed double-precision floating-point values |
VMULPD | MULPD | Multiply packed double-precision floating-point Values |
VPXOR | PXOR | Logical exclusive OR |
VUCOMISD | UCOMISD | Unordered compare scalar double-precision floating-point values and set EFLAGS |
VUNPCKHPD | UNPCKHPD | Unpack and interleave high-packed double-precision floating-point values |
VUNPCKLPD | UNPCKLPD | Unpack and interleave low-packed double-precision floating-point values |
VXORPD | XORPD | Bitwise logical XOR for double-precision floating-point values |
Intel AVX and AVX2 | Definitions | |
VADDSD | Set the upper half of all YMM registers to zero. Used when switching between 128-bit use and 256-bit use. | |
VBROADCASTSD | Copy a 32-bit, 64-bit, or 128-bit memory operand to all elements of a XMM or YMM vector register. | |
VCMPPD | Compare packed double-precision floating-point values | |
VCOMISD | Perform ordered comparison of scalar double-precision floating-point values and set flags in EFLAGS register | |
VINSERTF128 | Replace only half of a 256-bit YMM register with the value of a 128-bit source operand. The other half is unchanged. | |
VMAXSD | Determine the maximum of single-precision float64 vectors. The corresponding Intel AVX instruction is VMAXSD. | |
VMOVQ | Move Quadword | |
VMOVUPS | Move unaligned packed single-precision floating-point values | |
VMULSD | Multiply packed single-precision floating-point values | |
VPERM2F128 | Permute 128-bit floating-point fields in ymm2 and ymm3/mem using controls fromimm8 and store result in ymm1. | |
VPSHUFD | Permute 32-bit blocks of an int32 vector | |
VXORPS | Perform bitwise logical XOR operation on float32 vectors | |
VZEROUPPER | Set the upper half of all YMM registers to zero. Used when switching between 128-bit use and 256-bit use. | |
Intel AVX2 | Definitions | |
VEXTRACTF128 | Extract 128 bits of float data from ymm2 and store results in xmm1/mem. | |
VEXTRACTI128 | Extract 128 bits of integer data from ymm2 and store results in xmm1/mem. | |
VFMADD213PD | Multiply packed double-precision floating-point values from xmm0 and xmm1, add to xmm2/mem, and put result in xmm0. | |
VFMADD213SD | Multiply scalar double-precision floating-point value from xmm0 and xmm1, add to xmm2/mem, and put result in xmm0. | |
VFMADD231PD | Multiply packed double-precision floating-point values from xmm1 and xmm2/mem, add to xmm0, and put result in xmm0. | |
VFMADD231SD | Multiply scalar double-precision floating-point value in xmm1 and xmm2/mem, add to xmm0, and put result in xmm0. | |
VFNMADD213PD | Multiply packed double-precision floating-point values from xmm1 and xmm2/mem. Negate the multiplication result, add to xmm0, and put the result in xmm0. | |
VFNMADD213SD | Multiply the low-packed double-precision floating-point value from the second source operand to the low-packed double-precision floating-point value in the first source operand, add the negated infinite precision intermediate result to the low-packed double-precision floating-point value in the third source operand, perform rounding, and store the resulting packed double-precision floating-point value to the destination operand (first source operand). | |
VFNMADD231PD | Multiply packed double-precision floating-point values from ymm1 and ymm2/mem, negate the multiplication result, and add to ymm0. Put the result in ymm0. | |
VMAXPD | Determine the maximum of float64 vectors. The corresponding Intel AVX instruction is VMAXPD. | |
VPADDQ | Add packed quad-precision floating-point values | |
VPBLENDVB | Conditionally blend word elements of source vector depending on bits in a mask vector | |
VPBROADCASTQ | Take qwords from the source operand and broadcast to all elements of the result vector | |
VPCMPEQD | Compare packed bytes/words/doublewords/quadwords of two source vectors | |
VPCMPGTQ | Compare packed bytes/words/doublewords/quadwords of two source vectors |
Table 2 lists just the instructions used in these tests. You can obtain the full list from the Intel® 64 and IA-32 Architectures Optimization Reference Manual. By setting the compiler to Intel AVX2, it will use instructions from all 3 instruction sets as needed.
Procedure for running LINPACK
- Download and install the following:
- Create input files for 30K, 75K, 100K from the “...\linpack” directory
- For optimal performance, make the following operating system and BIOS setting changes before running LINPACK:
- Turn off Intel® Hyper-Threading Technology (Intel® HT Technology) in the BIOS.
- For Linux, export the “MKL_CBWR=AVX2” setting on the command line and update the runme_xeon64 shell script file to use the input files you created.
- The results will be in Glops similar to Table 2.
- For Intel AVX runs, set the “MKL_CBWR=AVX” and repeat the above steps.
- For Intel SSE runs, set the “MKL_CBWR=SSE4_2” and repeat the above steps.
Platform Configuration
CPU & Chipset | Model/Speed/Cache: Intel® Xeon® processor E7-8890 v3 (code named Haswell-EX) (2.5GHz, 45M) QGUA D0 Step
|
Platform | Brand/model:)(code named Brickland)
|
Memory | Memory Size: 256GB (32x8GB) DDR4 1.2V ECC 2133MHZ RDIMMs Brand/model: Micron MTA18ASF1G72PDZ-2G1A1HG DIMM info: 8GB 2Rx8 PC4-2133P |
Mass storage | Brand & model: Intel® S3700 Series SSD Number/size/RPM/Cache: 2/800GB/NA |
Operating system | Microsoft Windows* Server 2012 R2 / SLES 11 SP3 Linux* |
Conclusion
From our LINPACK experiment, we see compelling performance benefits when going to an Intel AVX2-enabled Intel Xeon processor. In this specific case, we saw a performance increase of ~2.89x-3.49x for Intel AVX2 vs. Intel SSE and ~1.73x-2.12x for Intel AVX2 vs. Intel AVX in our test environment, which is a strong case for developers who have Intel SSE-enabled code and are weighing the benefit of moving to a newer Intel Xeon processor-based system with Intel AVX2. To learn how to migrate existing Intel SSE code to Intel AVX2 code, refer to the materials below.
References
- Intel® Compiler Options for Intel® SSE and Intel® AVX generation (SSE2, SSE3, SSSE3, ATOM_SSSE3, SSE4.1, SSE4.2, ATOM_SSE4.2, AVX, AVX2) and processor-specific optimizations
- Intel® System Studio: Intel® AVX2 Support in the Intel® C++ Compiler
- Intel® AVX2 optimization in Intel® MKL
- Intel® IPP support for Intel® AVX2
- Processing Arrays of Bits with Intel® Advanced Vector Extensions 2 (Intel® AVX2)
- High Performance Multi-core Networked and Storage Systems for Linux
- Optimized Pseudo Random Number Generators with Intel® AVX2