Improve Server Application Performance with Intel® Advanced Vector Extensions 2

The Intel® Xeon® processor E7 v3 family now includes an instruction set called Intel® Advanced Vector Extensions 2 (Intel® AVX2), which can potentially improve application performance related to high performance computing, databases, and video processing. To validate this statement, I performed a simple experiment using the Intel® Optimized LINPACK benchmark. The results, as shown in Table 1, show a greater than 2x performance increase using Intel AVX2 vs. using Intel® Streaming SIMD Extensions (Intel® SSE). It also shows an increase of 1.7x when comparing Intel AVX2 with Intel® Advanced Vector Extensions (Intel® AVX) instructions.

The results in Table 1 are from three different workloads running on Linux* (Intel AVX, Intel AVX2, and Intel SSE4). The last two columns show the performance gain from Intel AVX2 compared to Intel AVX or to Intel SSE4. Running with the combination of an Intel AVX2 optimized LINPACK and an Intel AVX2-capable processor, Intel AVX2 performed ~2.89x-3.49x better than Intel SSE while performing ~1.73x-2.12x better than Intel AVX. And these numbers were just an example of the potential performance boost for LINPACK. For other applications, the performance gain will vary depending on the optimized code and the hardware environment.

Table1 – Results and Performance Gain from Running the LINPACK Benchmark on Quad Intel® Xeon® Processor E7-8890 v3.

Linux* LINPACK v11.2.2	Intel® AVX2 (Gflops)	Intel® AVX (Gflops)	Intel® SSE4 (Gflops)	Performance Gain over Intel SSE4	Performance Gain over Intel AVX
30K	1835.83	867.065	525.38	3.49	2.12
75K	2092.87	1211.89	724.40	2.89	1.73
100K	2130.31	1224.44	731.42	2.91	1.74

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark* and MobileMark*, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products.

Configurations: Intel® Xeon® processor E7-8890 v3 @ 2.50GHz, 45MB L3 cache, 18 core pre-production system. 2x Intel® SSD DC P3700 Series @ 800GB, 2568GB memory (32x16GB DDR4 -2133Mhz), BIOS by Intel Corporation Version: BRHSXSD1.86B.0063.R00.1503261059 (63.R00) BMC 70.7.5334 ME 2.3.0 SDR Package D.00, Power supply: 2x1200W NON-REDUNDANT, running Microsoft Windows* Server 2012 R2 / SLES 11 SP3 Linux*

For more information go to http://www.intel.com/performance

How to take advantage of Intel® AVX2 in existing vectorized code

Vectorized code that uses floating point operations can get a potential performance boost when running on newer platforms such as the Intel Xeon processor E7 v3 family by doing the following:

Recompile the code, using the Intel® compiler with the proper Intel AVX2 switch to convert existing Intel SSE code. See the Intel® Compiler Options for Intel® SSE and Intel® AVX generation white paper for more details.
Modify the code's function calls to leverage the Intel® Math Kernel Library (Intel® MKL), which is already optimized to use Intel AVX2 where supported.
Use the Intel AVX2 intrinsic instructions. High level language (such as C or C++) developers can use Intel® Intrinsic instructions to make the calls and recompile code. See the Intel® Intrinsic Guide and Intel® 64 and IA-32 Architectures Optimization Reference Manual for more details.
Code in assembly instructions directly. Low level language (such as assembly) developers can use equivalent Intel AVX2 instructions from their existing Intel SSE code. See the Intel® 64 and IA-32 Architectures Optimization Reference Manual for more details.

Equivalent instructions for Intel® AVX2, Intel® AVX, and Intel® SSE used in the tests

Table 2 lists the equivalent instructions for Intel AVX2, Intel AVX, and Intel SSE (SSE/SSE2/SSE3/SSE4) that may be useful for migrating code. It contains three sets of the instructions: the first set are equivalent instructions across all three instruction sets (Intel AVX2, Intel AVX, and Intel SSE); the second set are equivalent instructions across two instruction sets (Intel AVX2 and Intel AVX), and the last set are Intel AVX2 instructions.

Table 2– Intel® AVX2, Intel® AVX, and Intel® SSE Equivalent Instructions

Intel® AVX and Intel® AVX2	Equivalent Intel® SSE	Definitions
VADDPD	ADDPD	Add packed double-precision floating-point values
VDIVSD	DIVSD	Divide low double-precision floating point value in xmm2 by low double-precision floating-point value in xmm3/m64
VMOVSD	MOVSD	Move data from string to string
VMOVUPD	MOVUPD	Move unaligned packed double-precision floating-point values
VMULPD	MULPD	Multiply packed double-precision floating-point Values
VPXOR	PXOR	Logical exclusive OR
VUCOMISD	UCOMISD	Unordered compare scalar double-precision floating-point values and set EFLAGS
VUNPCKHPD	UNPCKHPD	Unpack and interleave high-packed double-precision floating-point values
VUNPCKLPD	UNPCKLPD	Unpack and interleave low-packed double-precision floating-point values
VXORPD	XORPD	Bitwise logical XOR for double-precision floating-point values
Intel AVX and AVX2	Definitions
VADDSD	Set the upper half of all YMM registers to zero. Used when switching between 128-bit use and 256-bit use.
VBROADCASTSD	Copy a 32-bit, 64-bit, or 128-bit memory operand to all elements of a XMM or YMM vector register.
VCMPPD	Compare packed double-precision floating-point values
VCOMISD	Perform ordered comparison of scalar double-precision floating-point values and set flags in EFLAGS register
VINSERTF128	Replace only half of a 256-bit YMM register with the value of a 128-bit source operand. The other half is unchanged.
VMAXSD	Determine the maximum of single-precision float64 vectors. The corresponding Intel AVX instruction is VMAXSD.
VMOVQ	Move Quadword
VMOVUPS	Move unaligned packed single-precision floating-point values
VMULSD	Multiply packed single-precision floating-point values
VPERM2F128	Permute 128-bit floating-point fields in ymm2 and ymm3/mem using controls fromimm8 and store result in ymm1.
VPSHUFD	Permute 32-bit blocks of an int32 vector
VXORPS	Perform bitwise logical XOR operation on float32 vectors
VZEROUPPER	Set the upper half of all YMM registers to zero. Used when switching between 128-bit use and 256-bit use.
Intel AVX2	Definitions
VEXTRACTF128	Extract 128 bits of float data from ymm2 and store results in xmm1/mem.
VEXTRACTI128	Extract 128 bits of integer data from ymm2 and store results in xmm1/mem.
VFMADD213PD	Multiply packed double-precision floating-point values from xmm0 and xmm1, add to xmm2/mem, and put result in xmm0.
VFMADD213SD	Multiply scalar double-precision floating-point value from xmm0 and xmm1, add to xmm2/mem, and put result in xmm0.
VFMADD231PD	Multiply packed double-precision floating-point values from xmm1 and xmm2/mem, add to xmm0, and put result in xmm0.
VFMADD231SD	Multiply scalar double-precision floating-point value in xmm1 and xmm2/mem, add to xmm0, and put result in xmm0.
VFNMADD213PD	Multiply packed double-precision floating-point values from xmm1 and xmm2/mem. Negate the multiplication result, add to xmm0, and put the result in xmm0.
VFNMADD213SD	Multiply the low-packed double-precision floating-point value from the second source operand to the low-packed double-precision floating-point value in the first source operand, add the negated infinite precision intermediate result to the low-packed double-precision floating-point value in the third source operand, perform rounding, and store the resulting packed double-precision floating-point value to the destination operand (first source operand).
VFNMADD231PD	Multiply packed double-precision floating-point values from ymm1 and ymm2/mem, negate the multiplication result, and add to ymm0. Put the result in ymm0.
VMAXPD	Determine the maximum of float64 vectors. The corresponding Intel AVX instruction is VMAXPD.
VPADDQ	Add packed quad-precision floating-point values
VPBLENDVB	Conditionally blend word elements of source vector depending on bits in a mask vector
VPBROADCASTQ	Take qwords from the source operand and broadcast to all elements of the result vector
VPCMPEQD	Compare packed bytes/words/doublewords/quadwords of two source vectors
VPCMPGTQ	Compare packed bytes/words/doublewords/quadwords of two source vectors

Table 2 lists just the instructions used in these tests. You can obtain the full list from the Intel® 64 and IA-32 Architectures Optimization Reference Manual. By setting the compiler to Intel AVX2, it will use instructions from all 3 instruction sets as needed.

Procedure for running LINPACK

Download and install the following:
1. Intel MKL – LINPACK Download
  http://software.intel.com/en-us/articles/intel-math-kernel-library-linpack-download
2. Intel MKL
  http://software.intel.com/en-us/intel-math-kernel-library-evaluation-options
Create input files for 30K, 75K, 100K from the “...\linpack” directory
For optimal performance, make the following operating system and BIOS setting changes before running LINPACK:
1. Turn off Intel® Hyper-Threading Technology (Intel® HT Technology) in the BIOS.
2. For Linux, export the “MKL_CBWR=AVX2” setting on the command line and update the runme_xeon64 shell script file to use the input files you created.
3. The results will be in Glops similar to Table 2.
For Intel AVX runs, set the “MKL_CBWR=AVX” and repeat the above steps.
For Intel SSE runs, set the “MKL_CBWR=SSE4_2” and repeat the above steps.

Platform Configuration

CPU & Chipset	Model/Speed/Cache: Intel® Xeon® processor E7-8890 v3 (code named Haswell-EX) (2.5GHz, 45M) QGUA D0 Step # of cores per chip: 18 # of sockets: 4 Chipset: (code named Patsburg) (J C1 step) System bus: 9.6GT/s QPI
Platform	Brand/model:)(code named Brickland) Chassis: Intel 4U Rackable Baseboard: code named Brickland, 3 SPC DDR4 BIOS: BRHSXSD1.86B.0063.R00.1503261059 (63.R00) Dimm slots: 96 Power supply: 2x1200W NON-REDUNDANT CD ROM: TEAC Slim Network (NIC): 1x Intel® Ethernet Converged Network Adapter x540-T2 (code named "Twin Pond") (OEM-GEN)
Memory	Memory Size: 256GB (32x8GB) DDR4 1.2V ECC 2133MHZ RDIMMs Brand/model: Micron MTA18ASF1G72PDZ-2G1A1HG DIMM info: 8GB 2Rx8 PC4-2133P
Mass storage	Brand & model: Intel® S3700 Series SSD Number/size/RPM/Cache: 2/800GB/NA
Operating system	Microsoft Windows* Server 2012 R2 / SLES 11 SP3 Linux*

Conclusion

From our LINPACK experiment, we see compelling performance benefits when going to an Intel AVX2-enabled Intel Xeon processor. In this specific case, we saw a performance increase of ~2.89x-3.49x for Intel AVX2 vs. Intel SSE and ~1.73x-2.12x for Intel AVX2 vs. Intel AVX in our test environment, which is a strong case for developers who have Intel SSE-enabled code and are weighing the benefit of moving to a newer Intel Xeon processor-based system with Intel AVX2. To learn how to migrate existing Intel SSE code to Intel AVX2 code, refer to the materials below.

Improve Server Application Performance with Intel® Advanced Vector Extensions 2

How to take advantage of Intel® AVX2 in existing vectorized code

Equivalent instructions for Intel® AVX2, Intel® AVX, and Intel® SSE used in the tests

Procedure for running LINPACK

Platform Configuration

Conclusion

References

Trending Articles

Practice Sheet of Right form of verbs for HSC Students

Download: FK ft Shenky – Nakuyewa ”Prod by: Shenky”

How to win at Markstrat (Markstrat Tips and Tricks) – Vodites

Ominde Commission Report and Recommendations – Ominde Report of 1964

Bureau of Internal Revenue: Regional Offices (Directory)

GO 53 on Enhancement of Ex-gratia upto 5 Lakhs Toddy Tappers in Telangana

Cakewalk CA-2A Leveling Amplifier v2.0.1.97 WiN, v2.0.1.96 OSX Incl Keygen

Mp3 Download: Mdu - Kunjenjenjena

How the kill the job , when DTP request running for long hours.

Microsoft Intune から展開しているアプリのアップデートについて

18-year-old girl was beaten for half an hour by two Northampton men in 'an...

Car crash in Dunton Bassett leaves driver in critical condition

Macky 2, Two Others In Road Accident

Application log 00000000000000089514: Could not convert queue DLVST90CLNT

Detroit mafia: D’Anna Brothers agree to plea deal

Delivery block field greyed out using VA02

Muloraki Au

【個人撮影】スマホのプライベート映像♪「中に出さないで///」カラオケ屋での生ハメ撮りが流出ｗ【リベンジポルノ】＠PornHub

BREAKING NEWS: Diamond Platnumz Is Reported Dead After Ghastly Car Accident

FIAT 500 B0111 B0112