Quantcast
Channel: Intel Developer Zone Articles
Viewing all articles
Browse latest Browse all 312

Improve Server Application Performance with Intel® Advanced Vector Extensions 2

$
0
0

The Intel® Xeon® processor E7 v3 family now includes an instruction set called Intel® Advanced Vector Extensions 2 (Intel® AVX2), which can potentially improve application performance related to high performance computing, databases, and video processing. To validate this statement, I performed a simple experiment using the Intel® Optimized LINPACK benchmark. The results, as shown in Table 1, show a greater than 2x performance increase using Intel AVX2 vs. using Intel® Streaming SIMD Extensions (Intel® SSE). It also shows an increase of 1.7x when comparing Intel AVX2 with Intel® Advanced Vector Extensions (Intel® AVX) instructions.

The results in Table 1 are from three different workloads running on Linux* (Intel AVX, Intel AVX2, and Intel SSE4). The last two columns show the performance gain from Intel AVX2 compared to Intel AVX or to Intel SSE4. Running with the combination of an Intel AVX2 optimized LINPACK and an Intel AVX2-capable processor, Intel AVX2 performed ~2.89x-3.49x better than Intel SSE while performing ~1.73x-2.12x better than Intel AVX. And these numbers were just an example of the potential performance boost for LINPACK. For other applications, the performance gain will vary depending on the optimized code and the hardware environment.

Table1 – Results and Performance Gain from Running the LINPACK Benchmark on Quad Intel® Xeon® Processor E7-8890 v3.

Linux* LINPACK v11.2.2Intel® AVX2 (Gflops)Intel® AVX (Gflops)Intel® SSE4 (Gflops)Performance Gain over Intel SSE4Performance Gain over Intel AVX
30K1835.83867.065525.383.492.12
75K2092.871211.89724.402.891.73
100K2130.311224.44731.422.911.74

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark* and MobileMark*, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products.

Configurations: Intel® Xeon® processor E7-8890 v3 @ 2.50GHz, 45MB L3 cache, 18 core pre-production system. 2x Intel® SSD DC P3700 Series @ 800GB, 2568GB memory (32x16GB DDR4 -2133Mhz), BIOS by Intel Corporation Version: BRHSXSD1.86B.0063.R00.1503261059 (63.R00) BMC 70.7.5334 ME 2.3.0 SDR Package D.00, Power supply: 2x1200W NON-REDUNDANT, running Microsoft Windows* Server 2012 R2 / SLES 11 SP3 Linux*

For more information go to http://www.intel.com/performance

How to take advantage of Intel® AVX2 in existing vectorized code

Vectorized code that uses floating point operations can get a potential performance boost when running on newer platforms such as the Intel Xeon processor E7 v3 family by doing the following:

  1. Recompile the code, using the Intel® compiler with the proper Intel AVX2 switch to convert existing Intel SSE code. See the Intel® Compiler Options for Intel® SSE and Intel® AVX generation white paper  for more details. 
  2. Modify the code's function calls to leverage the Intel® Math Kernel Library (Intel® MKL), which is already optimized to use Intel AVX2 where supported.
  3. Use the Intel AVX2 intrinsic instructions. High level language (such as C or C++) developers can use Intel® Intrinsic instructions to make the calls and recompile code. See the Intel® Intrinsic Guide and Intel® 64 and IA-32 Architectures Optimization Reference Manual for more details.
  4. Code in assembly instructions directly. Low level language (such as assembly) developers can use equivalent Intel AVX2 instructions from their existing Intel SSE code. See the Intel® 64 and IA-32 Architectures Optimization Reference Manual for more details.

Equivalent instructions for Intel® AVX2, Intel® AVX, and Intel® SSE used in the tests

Table 2 lists the equivalent instructions for Intel AVX2, Intel AVX, and Intel SSE (SSE/SSE2/SSE3/SSE4) that may be useful for migrating code. It contains three sets of the instructions: the first set are equivalent instructions across all three instruction sets (Intel AVX2, Intel AVX, and Intel SSE); the second set are equivalent instructions across two instruction sets (Intel AVX2 and Intel AVX), and the last set are Intel AVX2 instructions.

Table 2– Intel® AVX2, Intel® AVX, and Intel® SSE Equivalent Instructions

Intel® AVX and Intel® AVX2Equivalent Intel® SSEDefinitions
VADDPDADDPDAdd packed double-precision floating-point values
VDIVSDDIVSDDivide low double-precision floating point value in xmm2 by low double-precision floating-point value in xmm3/m64
VMOVSDMOVSDMove data from string to string
VMOVUPDMOVUPDMove unaligned packed double-precision floating-point values
VMULPDMULPDMultiply packed double-precision floating-point Values
VPXORPXORLogical exclusive OR
VUCOMISDUCOMISDUnordered compare scalar double-precision floating-point values and set EFLAGS
VUNPCKHPDUNPCKHPDUnpack and interleave high-packed double-precision floating-point values
VUNPCKLPDUNPCKLPDUnpack and interleave low-packed double-precision floating-point values
VXORPDXORPDBitwise logical XOR for double-precision floating-point values
Intel AVX and AVX2Definitions
VADDSDSet the upper half of all YMM registers to zero. Used when switching between 128-bit use and 256-bit use.
VBROADCASTSDCopy a 32-bit, 64-bit, or 128-bit memory operand to all elements of a XMM or YMM vector register.
VCMPPDCompare packed double-precision floating-point values
VCOMISDPerform ordered comparison of scalar double-precision floating-point values and set flags in EFLAGS register
VINSERTF128Replace only half of a 256-bit YMM register with the value of a 128-bit source operand. The other half is unchanged.
VMAXSDDetermine the maximum of single-precision float64 vectors. The corresponding Intel AVX instruction is VMAXSD.
VMOVQMove Quadword
VMOVUPSMove unaligned packed single-precision floating-point values
VMULSDMultiply packed single-precision floating-point values
VPERM2F128Permute 128-bit floating-point fields in ymm2 and ymm3/mem using controls fromimm8 and store result in ymm1.
VPSHUFDPermute 32-bit blocks of an int32 vector
VXORPSPerform bitwise logical XOR operation on float32 vectors
VZEROUPPERSet the upper half of all YMM registers to zero. Used when switching between 128-bit use and 256-bit use.
Intel AVX2Definitions
VEXTRACTF128Extract 128 bits of float data from ymm2 and store results in xmm1/mem.
VEXTRACTI128Extract 128 bits of integer data from ymm2 and store results in xmm1/mem.
VFMADD213PDMultiply packed double-precision floating-point values from xmm0 and xmm1, add to xmm2/mem, and put result in xmm0.
VFMADD213SDMultiply scalar double-precision floating-point value from xmm0 and xmm1, add to xmm2/mem, and put result in xmm0.
VFMADD231PDMultiply packed double-precision floating-point values from xmm1 and xmm2/mem, add to xmm0, and put result in xmm0.
VFMADD231SDMultiply scalar double-precision floating-point value in xmm1 and xmm2/mem, add to xmm0, and put result in xmm0.
VFNMADD213PDMultiply packed double-precision floating-point values from xmm1 and xmm2/mem. Negate the multiplication result, add to xmm0, and put the result in xmm0.
VFNMADD213SDMultiply the low-packed double-precision floating-point value from the second source operand to the low-packed double-precision floating-point value in the first source operand, add the negated infinite precision intermediate result to the low-packed double-precision floating-point value in the third source operand, perform rounding, and store the resulting packed double-precision floating-point value to the destination operand (first source operand).
VFNMADD231PDMultiply packed double-precision floating-point values from ymm1 and ymm2/mem, negate the multiplication result, and add to ymm0. Put the result in ymm0.
VMAXPDDetermine the maximum of float64 vectors. The corresponding Intel AVX instruction is VMAXPD.
VPADDQAdd packed quad-precision floating-point values
VPBLENDVBConditionally blend word elements of source vector depending on bits in a mask vector
VPBROADCASTQTake qwords from the source operand and broadcast to all elements of the result vector
VPCMPEQDCompare packed bytes/words/doublewords/quadwords of two source vectors
VPCMPGTQCompare packed bytes/words/doublewords/quadwords of two source vectors

Table 2 lists just the instructions used in these tests. You can obtain the full list from the Intel® 64 and IA-32 Architectures Optimization Reference Manual. By setting the compiler to Intel AVX2, it will use instructions from all 3 instruction sets as needed.

Procedure for running LINPACK

  1. Download and install the following:
    1. Intel MKL – LINPACK Download
      http://software.intel.com/en-us/articles/intel-math-kernel-library-linpack-download
    2. Intel MKL
      http://software.intel.com/en-us/intel-math-kernel-library-evaluation-options
  2. Create input files for 30K, 75K, 100K from the “...\linpack” directory
  3. For optimal performance, make the following operating system and BIOS setting changes before running LINPACK:
    1. Turn off Intel® Hyper-Threading Technology (Intel® HT Technology) in the BIOS.
    2. For Linux, export the “MKL_CBWR=AVX2” setting on the command line and update the runme_xeon64 shell script file to use the input files you created.
    3. The results will be in Glops similar to Table 2.
  4. For Intel AVX runs, set the “MKL_CBWR=AVX” and repeat the above steps.
  5. For Intel SSE runs, set the “MKL_CBWR=SSE4_2” and repeat the above steps.

Platform Configuration

CPU & ChipsetModel/Speed/Cache: Intel® Xeon® processor E7-8890 v3 (code named Haswell-EX) (2.5GHz, 45M) QGUA D0 Step
  • # of cores per chip: 18
  • # of sockets: 4
  • Chipset: (code named Patsburg) (J C1 step)
  • System bus: 9.6GT/s QPI
PlatformBrand/model:)(code named Brickland)
  • Chassis: Intel 4U Rackable
  • Baseboard: code named Brickland, 3 SPC DDR4
  • BIOS: BRHSXSD1.86B.0063.R00.1503261059 (63.R00)
  • Dimm slots: 96
  • Power supply: 2x1200W NON-REDUNDANT
  • CD ROM: TEAC Slim
  • Network (NIC): 1x Intel® Ethernet Converged Network Adapter x540-T2 (code named "Twin Pond") (OEM-GEN)
MemoryMemory Size: 256GB (32x8GB) DDR4 1.2V ECC 2133MHZ RDIMMs Brand/model: Micron MTA18ASF1G72PDZ-2G1A1HG DIMM info: 8GB 2Rx8 PC4-2133P
Mass storageBrand & model: Intel® S3700 Series SSD Number/size/RPM/Cache: 2/800GB/NA
Operating systemMicrosoft Windows* Server 2012 R2 / SLES 11 SP3 Linux*

Conclusion

From our LINPACK experiment, we see compelling performance benefits when going to an Intel AVX2-enabled Intel Xeon processor. In this specific case, we saw a performance increase of ~2.89x-3.49x for Intel AVX2 vs. Intel SSE and ~1.73x-2.12x for Intel AVX2 vs. Intel AVX in our test environment, which is a strong case for developers who have Intel SSE-enabled code and are weighing the benefit of moving to a newer Intel Xeon processor-based system with Intel AVX2. To learn how to migrate existing Intel SSE code to Intel AVX2 code, refer to the materials below.

References


Viewing all articles
Browse latest Browse all 312

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>