Quantcast
Channel: Intel Developer Zone Articles
Viewing all 312 articles
Browse latest View live

Avoid frequency drop in GPU cores when executing applications in Heterogeneous mode

$
0
0

Introduction

Intel(R) C++ Compiler 15.0 provides a feature which enables offloading general purpose compute kernels to processor graphics. This feature enables the processor graphics silicon area for general purpose computing. The key idea is to utilize the compute power of both CPU cores and GPU execution units in tandem for better utilization of available compute power.

Target OS requirements:

Windows* 32 and 64 bit support. Compute offload feature on Microsoft Windows 7* will only work with an active display (no locked screen). This restriction is imposed by DirectX 9 but relaxed in DirectX 11 (on Microsoft Windows 8* and Microsoft Windows 8.1*). 

Linux 64 bit:

  1. Ubuntu 12.04 (Linux kernel numbers: 3.2.0-41 for 3rd generation Intel® Core™ Processors and 3.8.0-23 for 4th generation Intel® Core™ Processors)
  2. SUSE Linux Enterprise Server 11 SP3 (Linux kernel numbers: 3.0.76-11 for both 3rd and 4th generation Intel® Core™ Processors) 

Heterogeneous mode

When executing an application in heterogeneous mode (CPU+GPU cores in action), the processor is running in full throttle. Every processor has an operating TDP limit and the power sharing algorithm implemented in hardware will take the necessary action to keep the processor within the TDP limits. Modern processors having both Intel(R) Turbo Boost Technology and Intel(R) HD Graphics Dynamic Frequency Technology. Intel(R) Turbo Boost Technology is used for increasing the frequency of the CPU cores when needed. Intel(R) HD Graphics Dynamic Frequency Technology is something similar to Turbo boost for GPU cores. When both CPU and GPU are in action simultaneously, that's when processor is hitting TDP limits relatively quicker. In these cases, for the default power settings on the system ("Maximize Performance" or "balanced" for CPU, Turbo Boost turned ON and "Balanced" for GPU), the power sharing algorithm gives preference to CPU core's frequency. Section 2.3 in http://www.intel.com/Assets/PDF/whitepaper/323324.pdf describes how to avoid drop in GPU core frequency. In short, do the following:

1. Turn off Intel(R) Turbo Boost Technology
2. Switch the power option of Graphics processor from "Balanced" to "Maximize Performance"

This user control helps in giving the priority for GPU cores for workloads which performs better on GPU.

This article applies to:
    Products: Intel® System Studio
    Host OS/Platform: Windows (IA-32 or Intel® 64); Linux (Intel® 64)
    Target OS/platform: Windows (IA-32 or Intel® 64); Ubuntu 12.04 (Intel® 64)


Abaqus/Standard Performance Case Study on Intel® Xeon® E5-2600 v3 Product Family

$
0
0

Background

The whole point of simulation is to model the behavior of a design and potential changes against various conditions to determine whether we are getting an expected response; and simulation in software is far cheaper than building hardware and performing a physical simulation and modifying the hardware model each time.

Dassault Systèmes [1] through its SIMULIA* brand, is creating a new paradigm to establish Finite Element Analysis and mulitphysics simulation software as an integral business process in the engineering value chain. More information about SIMULIA can be found here [2].   

The Abaqus* Unified Finite Elements Analysis product suite, from Dassault Systèmes* SIMULIA, offers powerful and complete solutions for both routine and sophisticated engineering problems covering a vast spectrum of industrial applications in Automotive, Aerospace, Consumer Packaged Goods, Energy, High Tech, Industrial Equipment and Life Sciences. As an example,  automotive industry engineering work groups are able to consider full vehicle loads, dynamic vibration, multibody systems, impact/crash, nonlinear static, thermal coupling, and acoustic-structural coupling using a common model data structure and integrated solver technology.

What is Finite Element Analysis (FEA)?

FEA is a computerized method of simulating the behavior of engineering structures and components under a variety of conditions.  It is the application of the Finite Element method (FEM)[3] [8].  It works by breaking down an object into a large number of finite elements and each element is represented by an equation. By integrating all the element’s equations, the whole object can be mathematical modeled.

How Abaqus/Standard take advantage of Intel® AVX2

Abaqus/Standard is general purpose FEA.  It includes many analysis capabilities. According to Dassault Systèmes web site, it “employs solution technology ideal for static and low-speed dynamic events where highly accurate stress solutions are critically important. Examples include sealing pressure in a gasket joint, steady-state rolling of a tire, or crack propagation in a composite airplane fuselage. Within a single simulation, it is possible to analyze a model both in the time and frequency domain. For example, one may start by performing a nonlinear engine cover mounting analysis including sophisticated gasket mechanics. Following the mounting analysis, the pre-stressed natural frequencies of the cover can be extracted, or the frequency domain mechanical and acoustic response of the pre-stressed cover to engine induced vibrations can be examined.”  More information about Abaqus/Standard can be found at [9].

According to Dassault Systèmes web site, Abaqus/Standard uses Hilber-Hughes-Taylor time [12] integration by default. The time integration is implicit, meaning that the operator matrix must be inverted and a set of simultaneous nonlinear dynamic equilibrium equations must be solved at each time increment.  This solution is done iteratively using Newton’s [13] method.  This solution utilizes a function called DGEMM [5] (Double-Precision General Matrix Multiplication) in the Intel® Math Kernel Libraries (Intel® MKL [4]) to handle matrix multiplication involving double-precision values.

Analysis of Abaqus workloads using performance monitoring tools, such as Intel® VTune™ [14], showed a significant number of them spend 40% to 50% of their runtime time in DGEMM.  Further analysis of the DGEMM function showed that it makes extensively used of the multiply-add operation since DGEMM is, basically, matrix multiplication.

One of the new feature of the Intel® Xeon® E5-2600 v3 Product Family is the support of a new extension set called Intel AVX2 [7]. One of the new instructions in Intel AVX2 is the three-operand fused multiply-add (FMA3 [6]).  By implementing the combined multiply-addition operation in the hardware, the speed of this operation is considerably improved.

Abaqus/Standard uses Intel® MKL’s DGEMM implementation.  It should also be noted that in Intel MKL version 11 update 5, and later versions, DGEMM was optimized to use Intel AVX2 extensions, thus allowing DGEMM to run optimally on the Intel® Xeon® E5-2600 v3 Product Family.

Performance test procedure

To prove the performance improvement brought forth by using a newer DGEMM implementation that takes advantage of Intel AVX2, we performed tests on two platforms. One system was equipped with Intel Xeon E5-2697 v3 and the other with Intel Xeon E5-2697 v2.  The duration of the tests were measured in seconds.

Performance test Benchmarks

The following four benchmarks from Abaqus/Standard were used: s2a, s3a, s3b and s4b.

Figure 1. S2a is a nonlinear static analysis of a flywheel with centrifugal loading.

Figure 2. S3 extracts the natural frequencies and mode shapes of a turbine impeller.

S3 has three versions.

S3a is a 360,000 degrees of freedom (DOF) using Lanczos Eigensolver [11] version.

S3b is a 1,100,000 degrees of freedom (DOF) using Lanczos Eigensolver version.

Figure 3. S4 is a benchmark that simulates the bolting of a cylinder head onto an engine block.

S4b is S4 version with 5,000,000 degrees of freedom (DOF) using direct solver version.

Note that these pictures are properties of Dassault Systèmes*.  They are reprinted with the permission from Dassault Systèmes.

Test configurations

System equipped with Intel Xeon E5-2697 v3

  • System: Pre-production
  • Processors: Xeon E5-2697 v3 @2.6GHz
  • Memory: 128GB DDR4-2133MHz

System equipped with Intel Xeon E5-2697 v2

  • System: Pre-production
  • Processors: Xeon E5-2697 v2 @2.7GHz
  • Memory: 64GB DDR4-1866MHz

Operating System: Red Hat* Enterprise Linux Server release 6.4

Application: Abaqus/Standard benchmarks version 6.13-1

Note:

1) Although the system equipped with the Intel® Xeon® E5-2697 v3 processor has more memory, the memory capacity does not affect the tests results, as the largest workload only used 43GB of memory.

2) The duration was measured by wall-clock time in seconds.

Test Results 

Figure 4. Comparison between Intel Xeon E5-2697 v3 and E5-2697 v2

Figure 4 shows the benchmarks running on a system equipped with Intel Xeon E5-2697 v3 and on a system equipped with E5-2697 v2. Performance improvement due to Intel AVX2 and hardware advantage ranging from 1.11X to 1.39X.

 

Figure 5. Comparison between benchmarks with Intel AVX2 enabled and disabled

Figure 5 shows the results of benchmarks with Intel AVX2 enabled and disabled on a system equipped with Intel Xeon E5-2697 v3.  Using Intel AVX2 allows benchmarks to finish faster than without using Intel AVX2.  The performance increase due to Intel AVX2 is ranging from 1.03X to 1.11X.

Note: Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark* and MobileMark*, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more information go to http://www.intel.com/performance

Conclusion

Simulation software performance is very critical since it can significantly reduce the model development and analysis time.  Abaqus/Standard is well-known for FEA that relies on DGEMM for its solvers.  As a result of the introduction of Intel® AVX2 in the Intel® Xeon® E5-2600 v3 Product Family, and as a result of the Intel MKL augmentation to take advantage of Intel AVX2, a simple change to the Abaqus/Standard to use the latest libraries yielded a considerable performance improvement. 

References

[1] www.3ds.com

[2] http://www.3ds.com/products-services/simulia/

[3] http://en.wikipedia.org/wiki/Finite_element_method

[4] https://software.intel.com/en-us/intel-mkl

[5] https://software.intel.com/en-us/node/429920

[6] http://en.wikipedia.org/wiki/FMA_instruction_set

[7] http://en.wikipedia.org/wiki/Advanced_Vector_Extensions

[8] http://people.maths.ox.ac.uk/suli/fem.pdf

[9] http://www.3ds.com/products-services/simulia/products/abaqus/abaqusstandard/

[10] http://www.simulia.com/support/v66/v66_performance.html#s2

[11] http://en.wikipedia.org/wiki/Lanczos_algorithm

[12] http://sbel.wisc.edu/People/schafer/mdexperiments/node13.html

[13] http://en.wikipedia.org/wiki/Newton%27s_method

[14] VTUNE

 

Easy SIMD through Wrappers

$
0
0

By Michael Kopietz,  Crytek Render Architect

Download PDF

1. Introduction

This article aims to change your thinking on how SIMD programming can be applied in your code. By thinking of SIMD lanes as functioning similarly to CPU threads, you will gain new insights and be able to apply SIMD more often in your code.

Intel has been shipping CPUs with SIMD support for about twice as long as they have been shipping multi core CPUs, yet threading is more established in software development. One reason for this increased support is an abundance of tutorials that introduce threading in a simple “run this entry function n-times” manner, skipping all the possible traps. On the other side, SIMD tutorials tend to focus on achieving the final 10% speed up that requires you to double the size of your code. If these tutorials provide code as an example, you may find it hard to focus on all the new information and at the same time come up with your simple and elegant way of using it. Thus showing a simple, useful way of using SIMD is the topic of this paper.

First the basic principle of SIMD code: alignment. Probably all SIMD hardware either demands or at least prefers some natural alignment, and explaining the basics could fill a paper [1]. But in general, if you're not running out of memory, it is important for you to allocate memory in a cache friendly way. For Intel CPUs that means allocating memory on a 64 byte boundary as shown in Code Snippet 1.

inline void* operator new(size_t size)
{
	return _mm_malloc(size, 64);
}

inline void* operator new[](size_t size)
{
	return _mm_malloc(size, 64);
}

inline void operator delete(void *mem)
{
	_mm_free(mem);
}

inline void operator delete[](void *mem)
{
	_mm_free(mem);
}

Code Snippet 1: Allocation functions that respect cache-friendly 64 byte boundaries

2. The basic idea

The way to begin is simple: assume every lane of a SIMD register executes as a thread. In case of Intel® Streaming SIMD Extensions (Intel® SSE), you have 4 threads/lanes, with Intel® Advanced Ventor Extensions (Intel® AVX) 8 threads/lanes and 16 threads/lanes on Intel® Xeon-p Phi coprocessors .

To have a 'drop in' solution, the first step is to implement classes that behave mostly like primitive data types. Wrap 'int', 'float' etc. and use those wrappers as the starting point for every SIMD implementation. For the Intel SSE version, replace the float member with __m128, int and unsigned int with __m128i and implement operators using Intel SSE intrinsics or Intel AVX intrinsics as in Code Snippet 2.

// SEE 128-bit
inline	DRealF	operator+(DRealF R)const{return DRealF(_mm_add_ps(m_V, R.m_V));}
inline	DRealF	operator-(DRealF R)const{return DRealF(_mm_sub_ps(m_V, R.m_V));}
inline	DRealF	operator*(DRealF R)const{return DRealF(_mm_mul_ps(m_V, R.m_V));}
inline	DRealF	operator/(DRealF R)const{return DRealF(_mm_div_ps(m_V, R.m_V));}

// AVX 256-bit
inline	DRealF	operator+(const DRealF& R)const{return DRealF(_mm256_add_ps(m_V, R.m_V));}
inline	DRealF	operator-(const DRealF& R)const{return DRealF(_mm256_sub_ps(m_V, R.m_V));}
inline	DRealF	operator*(const DRealF& R)const{return DRealF(_mm256_mul_ps(m_V, R.m_V));}
inline	DRealF	operator/(const DRealF& R)const{return DRealF(_mm256_div_ps(m_V, R.m_V));}

Code Snippet 2: Overloaded arithmetic operators for SIMD wrappers

3. Usage Example

Now let’s assume you're working on two HDR images, where every pixel is a float and you blend between both images.

void CrossFade(float* pOut,const float* pInA,const float* pInB,size_t PixelCount,float Factor)

void CrossFade(float* pOut,const float* pInA,const float* pInB,size_t PixelCount,float Factor)
{
	const DRealF BlendA(1.f - Factor);
	const DRealF BlendB(Factor);
	for(size_t i = 0; i < PixelCount; i += THREAD_COUNT)
		*(DRealF*)(pOut + i) = *(DRealF*)(pInA + i) * BlendA + *(DRealF*)(pInB + i) + BlendB;
}

Code Snippet 3: Blending function that works with both primitive data types and SIMD data

The executable generated from Code Snippet 3 runs natively on normal registers as well as on Intel SSE and Intel AVX. It's not really the vanilla way you'd write it usually, but every C++ programmer should still be able to read and understand it. Let’s see whether it's the way you expect. The first and second line of the implementation initialize the blend factors of our linear interpolation by replicating the parameter to whatever width your SIMD register has.

The third line is nearly a normal loop. The only special part is “THREAD_COUNT”. It's 1 for normal registers, 4 for Intel SSE and 8 for Intel AVX, representing the count of lanes of the register, which in our case resembles threads.

The fourth line indexes into the arrays and both input pixel are scaled by the blend factors and summed. Depending on your preference of writing it, you might want to use some temporaries, but there is no intrinsic you need to look up, no implementation per platform.

4. Drop in

Now it's time to prove that it actually works. Let's take a vanilla MD5 hash implementation and use all of your available CPU power to find the pre-image.  To achieve that, we'll replace the primitive types with our SIMD types. MD5 is running several “rounds” that apply various simple bit operations on unsigned integers as demonstrated in Code Snippet 4.

#define LEFTROTATE(x, c) (((x) << (c)) | ((x) >> (32 - (c))))
#define BLEND(a, b, x) SelectBit(a, b, x)

template<int r>
inline DRealU Step1(DRealU a,DRealU b,DRealU c,DRealU d,DRealU k,DRealU w)
{
	const DRealU f = BLEND(d, c, b);
	return b + LEFTROTATE((a + f + k + w), r);
}

template<int r>
inline DRealU Step2(DRealU a,DRealU b,DRealU c,DRealU d,DRealU k,DRealU w)
{
	const DRealU f = BLEND(c, b, d);
	return b + LEFTROTATE((a + f + k + w),r);
}

template<int r>
inline DRealU Step3(DRealU a,DRealU b,DRealU c,DRealU d,DRealU k,DRealU w)
{
	DRealU f = b ^ c ^ d;
	return b + LEFTROTATE((a + f + k + w), r);
}

template<int r>
inline DRealU Step4(DRealU a,DRealU b,DRealU c,DRealU d,DRealU k,DRealU w)
{
	DRealU f = c ^ (b | (~d));
	return b + LEFTROTATE((a + f + k + w), r);
}

Code Snippet 4: MD5 step functions for SIMD wrappers

Besides the type naming, there is really just one change that could look a little bit like magic — the “SelectBit”. If a bit of x is set, the respective bit of b is returned; otherwise, the respective bit of a; in other words, a blend. The main MD5 hash function is shown in Code Snippet 5.

inline void MD5(const uint8_t* pMSG,DRealU& h0,DRealU& h1,DRealU& h2,DRealU& h3,uint32_t Offset)
{
	const DRealU w0  =	Offset(DRealU(*reinterpret_cast<const uint32_t*>(pMSG + 0 * 4) + Offset));
	const DRealU w1  =	*reinterpret_cast<const uint32_t*>(pMSG + 1 * 4);
	const DRealU w2  =	*reinterpret_cast<const uint32_t*>(pMSG + 2 * 4);
	const DRealU w3  =	*reinterpret_cast<const uint32_t*>(pMSG + 3 * 4);
	const DRealU w4  =	*reinterpret_cast<const uint32_t*>(pMSG + 4 * 4);
	const DRealU w5  =	*reinterpret_cast<const uint32_t*>(pMSG + 5 * 4);
	const DRealU w6  =	*reinterpret_cast<const uint32_t*>(pMSG + 6 * 4);
	const DRealU w7  =	*reinterpret_cast<const uint32_t*>(pMSG + 7 * 4);
	const DRealU w8  =	*reinterpret_cast<const uint32_t*>(pMSG + 8 * 4);
	const DRealU w9  =	*reinterpret_cast<const uint32_t*>(pMSG + 9 * 4);
	const DRealU w10 =	*reinterpret_cast<const uint32_t*>(pMSG + 10 * 4);
	const DRealU w11 =	*reinterpret_cast<const uint32_t*>(pMSG + 11 * 4);
	const DRealU w12 =	*reinterpret_cast<const uint32_t*>(pMSG + 12 * 4);
	const DRealU w13 =	*reinterpret_cast<const uint32_t*>(pMSG + 13 * 4);
	const DRealU w14 =	*reinterpret_cast<const uint32_t*>(pMSG + 14 * 4);
	const DRealU w15 =	*reinterpret_cast<const uint32_t*>(pMSG + 15 * 4);

	DRealU a = h0;
	DRealU b = h1;
	DRealU c = h2;
	DRealU d = h3;

	a = Step1< 7>(a, b, c, d, k0, w0);
	d = Step1<12>(d, a, b, c, k1, w1);
	.
	.
	.
	d = Step4<10>(d, a, b, c, k61, w11);
	c = Step4<15>(c, d, a, b, k62, w2);
	b = Step4<21>(b, c, d, a, k63, w9);

	h0 += a;
	h1 += b;
	h2 += c;
	h3 += d;
}

Code Snippet 5: The main MD5 function

The majority of the code is again like a normal C function, except that the first lines prepare the data by replicating our SIMD registers with the parameter passed. In this case we load the SIMD registers with the data we want to hash. One specialty is the “Offset” call, since we don't want every SIMD lane to do exactly the same work, this call offsets the register by the lane index. It's like a thread-id you would add. See Code Snippet 6 for reference.

Offset(Register)
{
	for(i = 0; i < THREAD_COUNT; i++)
		Register[i] += i;
}

Code Snippet 6: Offset is a utility function for dealing with different register widths

That means, our first element that we want to hash is not [0, 0, 0, 0] for Intel SSE or [0, 0, 0, 0, 0, 0, 0, 0] for Intel AVX. Instead the first element is [0, 1, 2, 3] and [0, 1, 2, 3, 4, 5, 6, 7], respectively. This replicates the effect of running the function in parallel by 4 or 8 threads/cores, but in case of SIMD, instruction parallel.

We can see the results for our 10 minutes of hard work to get this function SIMD-ified in Table 1.

Table 1: MD5 performance with primitive and SIMD types

TypeTimeSpeedup

x86 integer

379.389s

1.0x

SSE4

108.108s

3.5x

AVX2

51.490s

7.4x

 

5. Beyond Simple SIMD-threads

The results are satisfying, not linearly scaling, as there is always some non-threaded part (you can easily identify it in the provided source code). But we're not aiming for the last 10% with twice the work. As a programmer, you'd probably prefer to go for other quick solutions that maximize the gain. Some considerations always arise, like: Would it be worthwhile to unroll the loop?

MD5 hashing seems to be frequently dependent on the result of previous operations, which is not really friendly for CPU pipelines, but you could become register bound if you unroll. Our wrappers can help us to evaluate that easily. Unrolling is the software version of hyper threading, we emulate twice the threads running by repeating the execution of operations on twice the data than SIMD lanes available. Therefore create a duplicate type alike and implement unrolling inside by duplicating every operation for our basic operators as in Code Snippet 7.

struct __m1282
{
	__m128		m_V0;
	__m128		m_V1;
	inline		__m1282(){}
	inline		__m1282(__m128 C0, __m128 C1):m_V0(C0), m_V1(C1){}
};

inline	DRealF	operator+(DRealF R)const
	{return __m1282(_mm_add_ps(m_V.m_V0, R.m_V.m_V0),_mm_add_ps(m_V.m_V1, R.m_V.m_V1));}
inline	DRealF	operator-(DRealF R)const
	{return __m1282(_mm_sub_ps(m_V.m_V0, R.m_V.m_V0),_mm_sub_ps(m_V.m_V1, R.m_V.m_V1));}
inline	DRealF	operator*(DRealF R)const
	{return __m1282(_mm_mul_ps(m_V.m_V0, R.m_V.m_V0),_mm_mul_ps(m_V.m_V1, R.m_V.m_V1));}
inline	DRealF	operator/(DRealF R)const
	{return __m1282(_mm_div_ps(m_V.m_V0, R.m_V.m_V0),_mm_div_ps(m_V.m_V1, R.m_V.m_V1));}

Code Snippet 7: These operators are re-implemented to work with two SSE registers at the same time

That's it, really, now we can again run the timings of the MD5 hash function.

Table 2: MD5 performance with loop unrolling SIMD types

TypeTimeSpeedup

x86 integer

379.389s

1.0x

SSE4

108.108s

3.5x

SSE4 x2

75.659s

4.8x

AVX2

51.490s

7.4x

AVX2 x2

36.014s

10.5x

 

The data in Table 2 shows that it's clearly worth unrolling. We achieve speed beyond the SIMD lane count scaling, probably because the x86 integer version was already stalling the pipeline with operation dependencies.

6. More complex SIMD-threads

So far our examples were simple in the sense that the code was the usual candidate to be vectorized by hand. There is nothing complex beside a lot of compute demanding operations. But how would we deal with more complex scenarios like branching?

The solution is again quite simple and widely used: speculative calculation and masking. Especially if you've worked with shader or compute languages, you'll likely have encountered this before. Let’s take a look at a basic branch of Code Snippet 8 and rewrite it to a ?: operator as in Code Snippet 9.

int a = 0;
if(i % 2 == 1)
	a = 1;
else
	a = 3;

Code Snippet 8: Calculates the mask using if-else

int a = (i % 2) ? 1 : 3;

Code Snippet 9: Calculates the mask with ternary operator ?:

If you recall our bit-select operator of Code Snippet 4, we can also use it to achieve the same with only bit operations in Code Snippet 10.

int Mask = (i % 2) ? ~0 : 0;
int a = SelectBit(3, 1, Mask);

Code Snippet 10: Use of SelectBit prepares for SIMD registers as data

Now, that might seem pointless, if we still have an ?: operator to create the mask, and the compare does not result in true or false, but in all bits set or cleared. Yet this is not a problem, because all bits set or cleared are actually what the comparison instruction of Intel SSE and Intel AVX return.

Of course, instead of assigning just 3 or 1, you can call functions and select the returned result you want. That might lead to performance improvement even in non-vectorized code, as you avoid branching and the CPU never suffers of branch misprediction, but the more complex the functions you call, the more a misprediction is possible. Even in vectorized code, we'll avoid executing unneeded long branches, by checking for special cases where all elements of our SIMD register have the same comparison result as demonstrated in Code Snippet 11.

int Mask = (i % 2) ? ~0 : 0;
int a = 0;
if(All(Mask))
	a = Function1();
else
if(None(Mask))
	a = Function3();
else
	a = BitSelect(Function3(), Function1(), Mask);

Code Snippet 11: Shows an optimized branchless selection between two functions

This detects the special cases where all of the elements are 'true' or where all are 'false'. Those cases run on SIMD the same way as on x86, just the last 'else' case is where the execution flow would diverge, hence we need to use a bit-select.

If Function1 or Function3 modify any data, you'd need to pass the mask down the call and explicitly bit select the modifications just like we've done here. For a drop-in solution, that's a bit of work, but it still results in code that’s readable by most programmers.

7. Complex example

Let's again take some source code and drop in our SIMD types. A particularly interesting case is raytracing of distance fields. For this, we'll use the scene from Iñigo Quilez's demo [2] with his friendly permission, as shown in Figure 1.

Figure 1: Test scene from Iñigo Quilez's raycasting demo

The “SIMD threading” is placed at a spot where you'd add threading usually. Every thread handles a pixel, traversing the world until it hits something, subsequently a little bit of shading is applied and the pixel is converted to RGBA and written to the frame buffer.

The scene traversing is done in an iterative way. Every ray has an unpredictable amount of steps until a hit is recognized. For example, a close up wall is reached after a few steps while some rays might reach the maximum trace distance not hit anything at all. Our main loop in Code Snippet 12 handles both cases using the bit select method we've discussed in the previous section.

DRealU LoopMask(RTrue);
for(; a < 128; a++)

{
      DRealF Dist             =     SceneDist(O.x, O.y, O.z, C);
      DRealU DistU            =     *reinterpret_cast<DRealU*>(&Dist) & DMask(LoopMask);
      Dist                    =     *reinterpret_cast<DRealF*>(&DistU);
      TotalDist               =     TotalDist + Dist;
      O                       +=    D * Dist;
      LoopMask                =     LoopMask && Dist > MinDist && TotalDist < MaxDist;
      if(DNone(LoopMask))
            break;
}

Code Snippet 12: Raycasting with SIMD types

The LoopMask variable identifies that a ray is active by ~0 or 0 in which case we are done with that ray. At the end of the loop, we test whether no ray is active anymore and in this case we break out of the loop.

In the line above we evaluate our conditions for the rays, whether we're close enough to an object to call it a hit or whether the ray is already beyond the maximum distance we want to trace. We logically AND it with the previous result, as the ray might be already terminated in one of the previous iterations.

“SceneDist” is the evaluation function for our tracing - It's run for all SIMD-lanes and is the heavy weighted function that returns the current distance to the closest object. The next line sets the distance elements to 0 for rays that are not active anymore and steps this amount further for the next iteration.

The original “SceneDist” had some assembler optimizations and material handling that we don't need for our test, and this function is reduced to the minimum we need to have a complex example. Inside are still some if-cases that are handled exactly the same as before. Overall, the “SceneDist” is quite large and rather complex and would take a while to rewrite it by hand for every SIMD-platform again and again. You might need to convert it all in one go, while typos might generate completely wrong results. Even if it works, you'll have only a few functions that you really understand, and maintenance is much higher. Doing it by hand should be the last resort. Compared to that, our changes are relatively minor. It's easy to modify and you are able to extend the visual appearance, without the need to worry about optimizing it again and being the only maintainer that understands the code, just like it would be by adding real threads.

But we've done that work to see results, so let’s check the timings in Table 3.

Table 3: Raytracing performance with primitive and SIMD types, including loop unrolling types

TypeFPSSpeedup

x86

0.992FPS

1.0x

SSE4

3.744FPS

3.8x

SSE4 x2

3.282FPS

3.3x

AVX2

6.960FPS

7.0x

AVX2 x2

5.947FPS

6.0x

 

You can clearly see the speed up is not scaling linearly with the element count, which is mainly because of the divergence. Some rays might need 10 times more iterations than others.

8. Why not let the compiler do it?

Compilers nowadays can vectorize to some degree, but the highest priority for the generated code is to deliver correct results, as you would not use 100 time faster binaries that deliver wrong results even 1% of the time. Some assumptions we make, like the data will be aligned for SIMD, and we allocate enough padding to not overwrite consecutive allocations, are out of scope for the compiler. You can get annotations from the Intel compiler about all opportunities it had to skip because of assumptions it could not guarantee, and you can try to rearrange code and make promises to the compiler so it'll generate the vectorized version. But that's work you have to do every time you modify your code and in more complex cases like branching, you can just guess whether it will result in branchless bit selection or serialized code.

The compiler has also no inside knowledge of what you intend to create. You know whether threads will be diverging or coherent and implement a branched or bit selecting solution. You see the point of attack, the loop that would make most sense to change to SIMD, whereas the compiler can just guess whether it will run 10times or 1 million times.

Relying on the compiler might be a win in one place and pain in another. It's good to have this alternative solution you can rely on, just like your hand placed thread entries.

9. Real threading?

Yes, real threading is useful and SIMD-threads are not a replacement — both are orthogonal. SIMD-threads are still not as simple to get running as real threading is, but you'll also run into less trouble about synchronization and seldom bugs. The really nice advantage is that every core Intel sells can run your SIMD-thread version with all the 'threads'. A dual core CPU will run 4 or 8 times faster just like your quad socket 15-core Haswell-EP. Some results for our benchmarks in combination with threading are summarized in Table 4 through Table 7.1

Table 4: MD5 Performance on Intel® Core™ i7 4770K with both SIMD and threading

ThreadsTypeTimeSpeedup

1T

x86 integer

311.704s

1.00x

8T

x86 integer

47.032s

6.63x

1T

SSE4

90.601s

3.44x

8T

SSE4

14.965s

20.83x

1T

SSE4 x2

62.225s

5.01x

8T

SSE4 x2

12.203s

25.54x

1T

AVX2

42.071s

7.41x

8T

AVX2

6.474s

48.15x

1T

AVX2 x2

29.612s

10.53x

8T

AVX2 x2

5.616s

55.50x

 

Table 5: Raytracing Performance on Intel® Core™ i7 4770K with both SIMD and threading

ThreadsTypeFPSSpeedup

1T

x86 integer

1.202FPS

1.00x

8T

x86 integer

6.019FPS

5.01x

1T

SSE4

4.674FPS

3.89x

8T

SSE4

23.298FPS

19.38x

1T

SSE4 x2

4.053FPS

3.37x

8T

SSE4 x2

20.537FPS

17.09x

1T

AVX2

8.646FPS

4.70x

8T

AVX2

42.444FPS

35.31x

1T

AVX2 x2

7.291FPS

6.07x

8T

AVX2 x2

36.776FPS

30.60x

 

Table 6: MD5 Performance on Intel® Core™ i7 5960X with both SIMD and threading

ThreadsTypeTimeSpeedup

1T

x86 integer

379.389s

1.00x

16T

x86 integer

28.499s

13.34x

1T

SSE4

108.108s

3.51x

16T

SSE4

9.194s

41.26x

1T

SSE4 x2

75.694s

5.01x

16T

SSE4 x2

7.381s

51.40x

1T

AVX2

51.490s

3.37x

16T

AVX2

3.965s

95.68x

1T

AVX2 x2

36.015s

10.53x

16T

AVX2 x2

3.387s

112.01x

 

Table 7: Raytracing Performance on Intel® Core™ i7 5960X with both SIMD and threading

ThreadsTypeFPSSpeedup

1T

x86 integer

0.992FPS

1.00x

16T

x86 integer

6.813FPS

6.87x

1T

SSE4

3.744FPS

3.774x

16T

SSE4

37.927FPS

38.23x

1T

SSE4 x2

3.282FPS

3.31x

16T

SSE4 x2

33.770FPS

34.04x

1T

AVX2

6.960FPS

7.02x

16T

AVX2

70.545FPS

71.11x

1T

AVX2 x2

5.947FPS

6.00x

16T

AVX2 x2

59.252FPS

59.76x

 

1Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark* and MobileMark*, are measured using specific computer systems, components, software, operations, and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more information go to http://www.intel.com/performance.

As you can see, the threading results vary depending on the CPU, the SIMD-thread results scale similar. But it's striking that you can reach speed up factors in the higher two digits if you combine both ideas. It makes sense to go for the 8x speed up on a dual core, but so does it make sense to go for an additional 8x speed up on highly expensive server hardware.

Join me, SIMD-ify your code!

About the Author

Michael Kopietz is Render Architect at Crytek's R&D and leads a team of engineers developing the rendering of CryEngine(R) and also guides students during their thesis. He worked among other things on the cross platform rendering architecture, software rendering and on highly responsive server, always high-performance and reusable code in mind. Prior, he was working on ship-battle and soccer simulation game rendering. Having his root in assembler programming on the earliest home consoles, he still wants to make every cycle count.

Code License

All code in this article is © 2014 Crytek GmbH, and released under the https://software.intel.com/en-us/articles/intel-sample-source-code-license-agreement license. All rights reserved.

References

[1] Memory Management for Optimal Performance on Intel® Xeon Phi™ Coprocessor: Alignment and Prefetching https://software.intel.com/en-us/articles/memory-management-for-optimal-performance-on-intel-xeon-phi-coprocessor-alignment-and

[2] Rendering Worlds with Two Triangles by Iñigo Quilez http://www.iquilezles.org/www/material/nvscene2008/nvscene2008.htm

License changes in Intel® Parallel Studio XE 2016 Beta

$
0
0

This Beta release of the Intel® Parallel Studio XE 2016 introduces a major change to the 'Named-user' licensing scheme (provided as default for the 2016 Beta licenses).  Read below for more details on this new functionality as well as a list of special exceptions.  Following a thorough Beta testing period, implementation will carry forward into the product release.

Description of changes:

The ‘Named-user’ license provisions in the Intel software EULA have changed to only allow the software to be installed on up to three systems.  During the Intel® Parallel Studio XE 2016 Beta program, product licensing will be updated to check for this when it checks for valid licenses, and it will track systems by the system host ID.  The installer will automatically detect the host ID and create the appropriate license.  If your system cannot access the internet during install-time, you will need to manually create a host-specific Beta license.  For more details on how to determine the host ID on your machine, follow the directions in this article.

We would love to get your feedback on this new license scheme. If you reach the allowable number of activations or have other 'Named-user' license problems, please contact us at the Intel® Premier Customer Support website. You will also be asked to complete a Beta survey at the end of the Beta program where you can give some final thoughts on this new functionality.

Limitations:

Using this new 'Named-user' license scheme may not be possible in one of the following cases:

  • Doing an installation on a machine that does not have access to the internet or is behind a firewall
  • Doing a distributed cluster install of the Beta software on a cluster with more than 3 nodes
    • NOTE: You will only hit this issue if the directory where the Beta tools are being installed is not NFS-mounted across the cluster and a distributed installation is required
  • Installation of the following stand-alone packages:
    • Intel® Advisor XE Beta (Linux*, Windows*)
    • Intel® VTune™ Amplifier XE Beta - OS X* host only
    • NOTE: You will only hit this issue if installing the stand-alone packages.  This does not affect installation of these individual components when done via the Intel Parallel Studio XE 2016 Beta installer.

Workaround:

We expect that the new 'Named-user' license scheme will work in the majority of installation cases.  If you encounter either of the situations described previously, you can easily replace the default 'Named-user' license provided during the Beta with a new license for manual offline installation.

In order to do this, return to the Intel® Parallel Studio XE 2016 Beta registration page and select the first option ("Generate license for manual offline installation") under the Email field:

Enter your email address and select the "Continue" button.  A new license file will be emailed to you.  You do NOT have to download the packages again.

Threading Intel® Integrated Performance Primitives Image Resize with Intel® Threading Building Blocks

$
0
0

Threading Intel® IPP Image Resize with Intel® TBB.pdf (157.18 KB) :Download Now

 

Introduction

The Intel® Integrated Performance Primitives (Intel® IPP) library provides a wide variety of vectorized signal and image processing functions. Intel® Threading Building Blocks (Intel® TBB) adds simple but powerful abstractions for expressing parallelism in C++ programs. This article presents a starting point for using these tools together to combine the benefits of vectorization and threading to resize images.   

From Intel® IPP 8.2 onwards multi-threading (internal threaded) libraries are deprecated due to issues with performance and interoperability with other threading models, but made available for legacy applications. However, multithreaded programming is now main stream and there is a rich ecosystem of threading tools such as Intel® TBB.  In most cases, handling threading at an application level (that is, external/above the primitives) offers many advantages.  Many applications already have their own threading model, and application level/external threading gives developers the greatest level of flexibility and control.  With a little extra effort to add threading to applications it is possible to meet or exceed internal threading performance, and this opens the door to more advanced optimization techniques such as reusing local cache data for multiple operations.  This is the main reason to start deprecating internal threading in the latest releases.

Getting started with parallel_for

Intel® TBB’s parallel_for offers an easy way to get started with parallelism, and it is one of the most commonly used parts of Intel® TBB. Any for() loop in the applications, where  each iteration can be done independently and the order of execution doesn’t matter.  In these scenarios, Intel® TBB parallel_for is useful and takes care of most details, like setting up a thread pool and a scheduler. You supply the partitioning scheme and the code to run on separate threads or cores. More sophisticated approaches are possible. However, the goal of this article and sample code is to provide a simple starting point and not the best possible threading configuration for every situation.

Intel® TBB’s parallel_for takes 2 or 3 arguments. 

parallel_for ( range, body, optional partitioner ) 

The range, for this simplified line-based partitioning, is specified by:

blocked_range<int>(begin, end, grainsize)

This provides information to each thread about which lines of the image it is processing. It will automatically partition a range from begin to end in grainsize chunks.  For Intel® TBB the grainsize is automatically adjusted when ranges don't partition evenly, so it is easy to accommodate arbitrary sizes.

The body is the section of code to be parallelized. This can be implemented separately (including as part of a class); though for simple cases it is often convenient to use a lambda expression. With the lambda approach the entire function body is part of the parallel_for call. Variables to pass to this anonymous function are listed in brackets [alg, pSrc, pDst, stridesrc_8u, …] and range information is passed via blocked_range<int>& range.

This is a general threading abstraction which can be applied to a wide variety of problems.  There are many examples elsewhere showing parallel_for with simple loops such as array operations.  Tailoring for resize follows the same pattern.

External Parallelization for Intel® IPP Resize

A threaded resize can be split into tiles of any shape. However, it is convenient to use groups of rows where the tiles are the width of the image.

Each thread can query range.begin(), range.size(), etc. to determine offsets into the image buffer. Note: this starting point implementation assumes that the entire image is available within a single buffer in memory. 

The new image resize functions in Intel® IPP 7.1 and later versions, new approach has many advantages like

  • IppiResizeSpec holds precalculated coefficients based on input/output resolution combination. Multiple resizes which can be completed without recomputing them.
  • Separate functions for each interpolation method.
  • Significantly smaller executable size footprint with static linking.
  • Improved support for threading and tiled image processing.
  • For more information please refer to article : Resize Changes in Intel® IPP 7.1

Before starting resize, the offsets (number of bytes to add to the source and destination pointers to calculate where each thread’s region starts) must be calculated. Intel® IPP provides a convenient function for this purpose:

ippiResizeGetSrcOffset

This function calculates the corresponding offset/location in the source image for a location in the destination image. In this case, the destination offset is the beginning of the thread’s blocked range.

After this function it is easy to calculate the source and destination addresses for each thread’s current work unit:

pSrcT=pSrc+(srcOffset.y*stridesrc_8u);
pDstT=pDst+(dstOffset.y*stridedst_8u);

These are plugged into the resize function, like this:

ippiResizeLanczos_8u_C1R(pSrcT, stridesrc_8u, pDstT, stridedst_8u, dstOffset, dstSizeT, ippBorderRepl, 0, pSpec, localBuffer);

This specifies how each thread works on a subset of lines of the image. Instead of using the beginning of the source and destination buffers, pSrcT and pDstT provide the starting points of the regions each thread is working with. The height of each thread's region is passed to resize via dstSizeT. Of course, in the special case of 1 thread these values are the same as for a nonthreaded implementation.

Another difference to call out is that since each thread is doing its own resize simultaneously the same working buffer cannot be used for all threads. For simplicity the working buffer is allocated within the lambda function with scalable_aligned_malloc, though further efficiency could be gained by pre-allocating a buffer for each thread.

The following code snippet demonstrates how to set up resize within a parallel_for lambda function, and how the concepts described above could be implemented together.  

 Click here for full source code.

By downloading this sample code, you accept the End User License Agreement.

parallel_for( blocked_range<int>( 0, pnminfo_dst.imgsize.height, grainsize ),
            [pSrc, pDst, stridesrc_8u, stridedst_8u, pnminfo_src,
            pnminfo_dst, bufSize, pSpec]( const blocked_range<int>& range )
        {
            Ipp8u *pSrcT,*pDstT;
            IppiPoint srcOffset = {0, 0};
            IppiPoint dstOffset = {0, 0};

            // resized region is the full width of the image,
            // The height is set by TBB via range.size()
            IppiSize  dstSizeT = {pnminfo_dst.imgsize.width,(int)range.size()};

            // set up working buffer for this thread's resize
            Ipp32s localBufSize=0;
            ippiResizeGetBufferSize_8u( pSpec, dstSizeT,
                pnminfo_dst.nChannels, &localBufSize );

            Ipp8u *localBuffer =
                (Ipp8u*)scalable_aligned_malloc( localBufSize*sizeof(Ipp8u), 32);

            // given the destination offset, calculate the offset in the source image
            dstOffset.y=range.begin();
            ippiResizeGetSrcOffset_8u(pSpec,dstOffset,&srcOffset);

            // pointers to the starting points within the buffers that this thread
            // will read from/write to
            pSrcT=pSrc+(srcOffset.y*stridesrc_8u);
            pDstT=pDst+(dstOffset.y*stridedst_8u);


            // do the resize for greyscale or color
            switch (pnminfo_dst.nChannels)
            {
            case 1: ippiResizeLanczos_8u_C1R(pSrcT,stridesrc_8u,pDstT,stridedst_8u,
                        dstOffset,dstSizeT,ippBorderRepl, 0, pSpec,localBuffer); break;
            case 3: ippiResizeLanczos_8u_C3R(pSrcT,stridesrc_8u,pDstT,stridedst_8u,
                        dstOffset,dstSizeT,ippBorderRepl, 0, pSpec,localBuffer); break;
            default:break; //only 1 and 3 channel images
            }

            scalable_aligned_free((void*) localBuffer);
        });
 

As you can see, a threaded implementation can be quite similar to single threaded.  The main difference is simply that the image is partitioned by Intel® TBB to work across several threads, and each thread is responsible for groups of image lines. This is a relatively straightforward way to divide the task of resizing an image across multiple cores or threads.

Conclusion

Intel® IPP provides a suite of SIMD-optimized functions. Intel® TBB provides a simple but powerful way to handle threading in Intel® IPP applications. Using them together allows access to great vectorized performance on each core as well as efficient partitioning to multiple cores. The deeper level of control available with external threading enables more efficient processing and better performance. 

Example code: As with other  Intel® IPP sample code, by downloading you accept the End User License Agreement.

Intel® IPP - Threading / OpenMP* FAQ

$
0
0

In Intel® IPP 8.2 and later versions, multi-threading (internal threading) libraries are deprecated due to issues with performance and interoperability with other threading models, but made available for legacy applications. Multi-threaded static and dynamic libraries are available as a separate download to support legacy applications. For new applications development, highly recommended to use the single-threaded versions with application-level threading (as  shown in the below picture).

Intel® IPP 8.2 and later versions installation will have single threaded libraries in the following directory Structure

<ipp directory>lib/ia32– Single-threaded Static and Dynamic for IA32 architecture

<ipp directory>lib/intel64 - Single-threaded Static and Dynamic for Intel 64 architecture

Static linking (Both single threaded and Multi-threaded libraries)             

  • Windows* OS: mt suffix in a library name (ipp<domain>mt.lib)
  • Linux* OS and OS X*: no suffix in a library name (libipp<domain>.a)

Dynamic Linking: Default (no suffix)

  • Windows* OS: ipp<domain>.lib
  • Linux* OS: libipp<domain>.a
  • OS X*: libipp<domain>.dylib

Q: Does Intel® IPP supports external multi-threading? Thread safe?

Answer: Yes, Intel® IPP supports external threading as in the below picture. User has option to use different threading models like Intel TBB, Intel Cilk Plus, Windows * threads, OpenMP or PoSIX. All Intel® Integrated Performance Primitives functions are thread-safe.

Q: How to get Intel® IPP threaded libraries?

Answer: While Installing Intel IPP, choose ‘custom’ installation option.  Then you will get option to select threaded libraries for different architecture.

To select right package of threaded libraries, right click and enable ‘Install’ option.

After selecting threaded libraries, selection option will get highlighted with  mark and memory requirement for threaded libraries will get highlighted.

Threading in Intel® IPP 8.1 and earlier versions

Threading, within the deprecated multi-threaded add-on packages of the Intel® IPP library, is accomplished by use of the Intel® OpenMP* library. Intel® IPP 8.0 continues the process of deprecating threading inside Intel IPP functions that was started in version 7.1. Though not installed by default, the threaded libraries can be installed so code written with these libraries will still work as before. However, moving to external threading is recommended.

Q: How can I determine the number of threads the Intel IPP creates?
Answer: You can use the function ippGetNumThreads to find the number of threads created by the Intel IPP.

Q: How do I control the number of threads the Intel IPP creates?
Ans: Call the function ippSetNumThreads to set the number of threads created.

Q: Is it possible to prevent Intel IPP from creating threads?
Ans: Yes, if you are calling the Intel IPP functions from multiple threads, it is recommended to have Intel IPP threading turned off. There are 3 ways to disable multi-threading:

  • Link to the non-threaded static libraries
  • Build and link to a custom DLL using the non-threaded static libraries
  • Call ippSetNumThread(1)

Q: When my application calls Intel IPP functions from a separate thread, the application hangs; how do I resolve this?

Ans: This issue occurs because the threading technology used in your application and in the Intel IPP (which has OpenMP threading) is incompatible. The ippSetNumThreads function has been developed so that threading can be disabled in the dynamic libraries. Please also check the sections above for other ways to prevent Intel IPP functions from creating threads.

Q: Which Intel IPP functions contain OpenMP* code?

Ans: "ThreadedFunctionsList.txt" file under ‘doc’ folder under product installation directory provide detailed list of threaded functions in Intel IPP Library. The list is updated in each release.

 

Please let us know if you have any feedback on deprecations via the feedback URL

 

Introducing Batch GEMM Operations

$
0
0

The general matrix-matrix multiplication (GEMM) is a fundamental operation in most scientific, engineering, and data applications. There is an everlasting desire to make this operation run faster. Optimized numerical libraries like Intel® Math Kernel Library (Intel® MKL) typically offer parallel high-performing GEMM implementations to leverage the concurrent threads supported by modern multi-core architectures. This strategy works well when multiplying large matrices because all cores are used efficiently. When multiplying small matrices, however, individual GEMM calls may not optimally use all the cores. Developers wanting to improve utilization usually batch multiple independent small GEMM operations into a group and then spawn multiple threads for different GEMM instances within the group. While this is a classic example of an embarrassingly parallel approach, making it run optimally requires a significant programming effort that involves threads creation/termination, synchronization, and load balancing. That is, until now. 

Intel MKL 11.3 Beta (part of Intel® Parallel Studio XE 2016 Beta) includes a new flavor of GEMM feature called "Batch GEMM". This allows users to achieve the same objective described above with minimal programming effort. Users can specify multiple independent GEMM operations, which can be of different matrix sizes and different parameters, through a single call to the "Batch GEMM" API. At runtime, Intel MKL will intelligently execute all of the matrix multiplications so as to optimize overall performance. Here is an example that shows how "Batch GEMM" works:

Example

Let A0, A1 be two real double precision 4x4 matrices; Let B0, B1 be two real double precision 8x4 matrices. We'd like to perform these operations:

C0 = 1.0 * A0 * B0T  , and C1 = 1.0 * A1 * B1T

where C0 and C1 are two real double precision 4x8 result matrices. 

Again, let X0, X1 be two real double precision 3x6 matrices; Let Y0, Y1 be another two real double precision 3x6 matrices. We'd like to perform these operations:

Z0 = 1.0 * X0 * Y0T + 2.0 * Z0and Z1 = 1.0 * X1 * Y1T + 2.0 * Z1

where Z0 and Z1 are two real double precision 3x3 result matrices.

We could accomplished these multiplications using four individual calls to the standard DGEMM API. Instead, here we use a single "Batch GEMM" call for the same with potentially improved overall performance. We illustrate this using the "cblas_dgemm_batch" function as an example below.

#define    GRP_COUNT    2

MKL_INT    m[GRP_COUNT] = {4, 3};
MKL_INT    k[GRP_COUNT] = {4, 6};
MKL_INT    n[GRP_COUNT] = {8, 3};

MKL_INT    lda[GRP_COUNT] = {4, 6};
MKL_INT    ldb[GRP_COUNT] = {4, 6};
MKL_INT    ldc[GRP_COUNT] = {8, 3};

CBLAS_TRANSPOSE    transA[GRP_COUNT] = {'N', 'N'};
CBLAS_TRANSPOSE    transB[GRP_COUNT] = {'T', 'T'};

double    alpha[GRP_COUNT] = {1.0, 1.0};
double    beta[GRP_COUNT] = {0.0, 2.0};

MKL_INT    size_per_grp[GRP_COUNT] = {2, 2};

// Total number of multiplications: 4
double    *a_array[4], *b_array[4], *c_array[4];
a_array[0] = A0, b_array[0] = B0, c_array[0] = C0;
a_array[1] = A1, b_array[1] = B1, c_array[1] = C1;
a_array[2] = X0, b_array[2] = Y0, c_array[2] = Z0;
a_array[3] = X1, b_array[3] = Y1, c_array[3] = Z1;

// Call cblas_dgemm_batch
cblas_dgemm_batch (
        CblasRowMajor,
        transA,
        transB,
        m,
        n,
        k,
        alpha,
        a_array,
        lda,
        b_array,
        ldb,
        beta,
        c_array,
        ldc,
        GRP_COUNT,
        size_per_group);



The "Batch GEMM" interface resembles the GEMM interface. It is simply a matter of passing arguments as arrays of pointers to matrices and parameters, instead of as matrices and the parameters themselves. We see that it is possible to batch the multiplications of different shapes and parameters by packaging them into groups. Each group consists of multiplications of the same matrices shape (same m, n, and k) and the same parameters. 

Performance

While this example does not show performance advantages of "Batch GEMM", when you have thousands of independent small matrix multiplications then the advantages of "Batch GEMM" become apparent. The chart below shows the performance of 11K small matrix multiplications with various sizes using "Batch GEMM" and the standard GEMM, respectively. The benchmark was run on a 28-core Intel Xeon processor (Haswell). The performance metric is Gflops, and higher bars mean higher performance or a faster solution.

The second chart shows the same benchmark running on a 61-core Intel Xeon Phi co-processor (KNC). Because "Batch GEMM" is able to exploit parallelism using many concurrent multiple threads, its advantages are more evident on architectures with a larger core count. 

Summary

This article introduces the new API for batch computation of matrix-matrix multiplications. It is an ideal solution when many small independent matrix multiplications need to be performed. "Batch GEMM" supports all precision types (S/D/C/Z). It has Fortran 77 and Fortran 95 APIs, and also CBLAS bindings. It is available in Intel MKL 11.3 Beta and later releases. Refer to the reference manual for additional documentation.  

 

Accelerating Financial Applications on Intel® architecture

$
0
0

Download PDFAccelerating Financial Applications on Intel Architecture [PDF 575.55KB]

Download FileQuantLib_optimized_for_IA.tar.gz [TAR 522.48KB]

Abstract:
 

A paper titled Accelerating Financial Applications on the GPU compared GPU vs. CPU performance using four QuantLib library financial workloads. The paper reported significant GPU performance speedup; as high as 1,000X for a Monte-Carlo* workload running a single thread. Upon closer inspection we found that the parallelization approach was not sufficient to properly utilize all of the parallel resources available on the CPU. We decided to conduct an in-depth performance analysis by optimizing the original CPU code and re-running the tests on the latest available GPU/CPU hardware. The results are significantly different than what was reported in the paper and in some cases the CPU actually outperforms the GPU.

 

 

 


Intel® Xeon® Processor E7 v3 Product Family

$
0
0

Based on Intel® Core™ microarchitecture (formerly codenamed Haswell) and manufactured on 22-nanometer process technology, these processors provide significant performance over the previous-generation Intel Xeon processor E7 v2 product family. This is the first Intel® Xeon® processor product family that supports Intel® Transaction Synchronization Extensions (Intel® TSX).

For a more in-depth discussion of the key features and the architecture of the Intel® Xeon® E7 v3 product family see the technical overview document.

Key supported features you should be aware of, as a Software Developer:

  • Intel® Virtual Machine Control Structure (Intel® VMCS) Shadowing works by reducing the frequency in which the guest virtual machine monitor (VMM) requires assistance from the parent VMM. Its goal is to eliminate the VM-exits due to VMREAD and VMWRITE instructions executed by the guest VMM. The Intel and Citrix collaboration article provides a good description about the benefit of using Intel® VMCS shadowing.  A team at IBM® enabled this feature and gained significant performance improvement.
  • New Reliability features include Enhanced Machine Check Architecture Generation 2 (eMCA2).  Prior to eMCA2, errors were logged in architected registers and OS/VMM (Virtual Machine Management) was informed.  This will restrict the platform firmware from doing fault diagnosis. eMCA2 allows errors (corrected and uncorrected) to first signal the BIOS/SMM(System Management Mode) before determining if the errors need to be informed to the OS/VMM . Another important reliability feature is Memory Address Range Mirroring (MARM). MARM allows the BIOS or OS to determine a range of memory addresses to be mirrored instead of mirroring the entire memory space.  More information about these features can be found here

Learn more about the Intel® Xeon® E7 v3 product family here. Also, these software vendors are already developed applications that are highly optimized to run Intel® Xeon® processor E7 v3 family server platform.

Download Intel System Studio 2016 Beta

$
0
0

Download Intel® System Studio 2016 Beta

Intel® System Studio 2016 Beta

Register and Download HERE

Note: if you are interested in a FreeBSD* OS based target of Intel System Studio 2016 Beta, or support for unreleased platforms please contact IntelSystemStudio@intel.com for more information.

We recommend the online installer option. It provides a speedy download with many options tailored for your unique host/target environment. You can email us direct at intelsystemstudio@intel.com if you have any questions in getting set up.


What’s New in the 2016 Beta?

We are pleased to announce the release of Intel® System Studio 2016 Beta which offers a wide variety of new features and advances described below. Please also visit our tips and tricks page for in-depth articles specifically on the new beta. We have written over 100 articles covering a complete embedded developer workflow: debugging, power/performance analysis, and compiler/libraries.

Support for the latest embedded/mobile platforms (additional upcoming platforms available under NDA)

  • Intel® Atom™ x3, x5, x7 SoC Processor Series
  • Intel® Core™ M Processors
  • Intel® Xeon™ Processor E7 Family

Intel® C++ Compiler

What’s New:

Support and optimizations for

  • Enhanced C++11 feature support
  • Enhanced C++14 feature support
  • FreeBSD* support
  • Added support for Red Hat Enterprise Linux* 7
  • Deprecated Red Hat Enterprise Linux* 5.

ChromeOS* Target – New Component Support

  • Intel® Compiler 16.0
  • Intel® Integrated Performance Primitives 9.0
  • Intel® System Debugger

Clang Compiler and VTune Support for FreeBSD*

Intel® VTune™ Amplifier for Systems

What’s New:

  • Basic hotspots, Locks & Waits and EBS with stacks for RT kernel and RT application for Linux Targets
  • EBS based stack sampling for kernel mode threads
  • Support for Intel® Atom™ x7 Z8700 & x5 Z8500/X8400 processor series (Cherry Trail) including GPU analysis

Additional minor new features:

  • KVM guest OS profiling from host based on Linux Perf tool
  • Support for analysis of applications in virtualized environment (KVM). Requires Linux kernels > 3.2 and Qemu version > 1.4
  • Automated remote EBS analysis on SoFIA  (by leveraging existing sampling driver on target)
  • Super Tiny display mode added for the Timeline pane to easily identify problem areas for results with multiple processes/threads
  • Platform window replacing Tasks and Frames window and providing CPU, GPU, and bandwidth metrics data distributed over time
  • General Exploration analysis views extended to display confidence indication (greyed out font) for non-reliable metrics data resulted, for example, from the low number of collected samples
  • GPU usage analysis for OpenCL™ applications extended to display compute-originated batch buffers on the GPU software queue in the Timeline pane (Linux* target only)
  • New filtering mode for command line reports to display data for the specified column names only

Intel® Inspector for Systems improvements

What’s New:

  • Added support for DWARF Version 4 symbolics.
  • Improved custom install directory process.
  • For Windows,
    • Added limited support for memory growth when analyzing applications containing Windows* fibers.

GNU GDB* Updates

What’s New:

  • GDB Features
    • The version of GDB provided as part of Intel® System Studio 2016 is based on GDB version 7.8. Notably, it contains the following features added by Intel:
  • Data Race Detection (pdbx):
    • Detect and locate data races for applications threaded using POSIX* threads
  • Branch Trace Store (btrace):
    • Record branches taken in the execution flow to backtrack easily after events like crashes, signals, exceptions, etc.
  • Pointer Checker:
    • Assist in finding pointer issues if compiled with Intel® C++ Compiler and having
    • Pointer Checker feature enabled (see Intel® C++ Compiler documentation for more information)
  • Intel® Processor Trace (Intel® PT) Support:
    • Improved version of Branch Trace Store supporting Intel® TSX. For 5th generation
  • Intel® Core™ Processors and later access it via command:
    • (gdb) record btrace pt
    • Those features are only provided for the command line version and are not supported via the Eclipse* IDE Integration.

Intel® Debugger for Heterogeneous Compute 2016

The version of Intel® Debugger for Heterogeneous Compute 2016 provided as part of Intel® System Studio 2016 uses GDB version 7.6. It provides the following features:

  • Debugging applications containing offload enabled code to Intel® Graphics Technology
  • Eclipse* IDE integration

Intel® System Debugger

What’s New:

  • Windows* support for BIOS development
  • Support for Intel® Atom™ x7 Z8700 & x5 Z8500/X8400 processor series (Cherry Trail)
  • Several bug fixes and stability improvements

Intel® Threading Building Blocks

What’s New:

  • Added a C++11 variadic constructor for enumerable_thread_specific.
  • The arguments from this constructor are used to construct thread-local values.
  • Improved exception safety for enumerable_thread_specific.
  • Added documentation for tbb::flow::tagged_msg class and tbb::flow::output_port function.
  • Fixed build errors for systems that do not support dynamic linking.
  • C++11 move aware insert and emplace methods have been added to concurrent unordered containers

Intel® Integrated Performance Primitives

What’s New:

  • New APIs for external threading and APIs for external memory allocation
  • Special functions and optimizations for the latest Intel® Atom x3, x5, x7 Processor series

Intel® Math Kernel Library

What’s New:

  • New version of MKL Reference Manual highly tailored for each MKL function domain
  • Many improvements to smaller sized matrix multiplication on all platforms

Details

Intel® System Studio 2016 Beta targets development for Android*, Chrome OS*, FreeBSD*, Windows*, Embedded Linux*, Yocto Project*, Tizen* IVI, and Wind River Linux* deployment targets from Linux* or Windows* host. The Beta package is a collection of many different components providing a complete embedded/mobile developer workflow. At the time of download, you can select to install individual products or the full suite.

The first release of the beta will be available last week of April 2015.

The second release of the beta will be available early June 2015.


Beta Duration and Schedule

The beta program officially ends August 26th, 2015. The beta license will expire August 31st, 2015. At the conclusion of the beta program, you will be asked to complete a survey regarding your experience with the beta software. Note that once you register for the beta, you will be notified automatically when updates are made available.


Support

Technical support will be provided via Intel® Premier Support. The Intel® Registration Center will be used to provide updates to the component products during this beta period.


How to enroll in the Beta program

Please go to the registration link to enroll.

Information collected during registration will be used to evaluate beta testing coverage. Here is a link to the Intel Privacy Policy.

Keep the beta product serial number provided for future reference.

After registration, you will be taken to the Intel Registration Center to download the product.

After registration, you will be able to download all available beta products at any time by returning to the Intel Registration Center.


Beta Webinars

Please stay tuned for announcements of upcoming webinars during the Intel® System Studio Beta program.

We look forward to additionally offering in-depth virtual training sessions on demand to beta program participants. Please contact us at IntelSystemStudio@intel.com for details.


Special Features and Known Issues

This section contains information on known issues (plus associated fixes) and special features of Intel® System Studio 2016 Beta. Check back often for updates.

For a full list of known issues of individual Intel® System Studio components, please refer to the individual component release notes.

Please see Chapter 7 of the following release notes for Known Issues at time of Beta Launch. More will be listed on this page as the beta progresses.

Linux Target

https://software.intel.com/sites/default/files/managed/0a/d5/all-release-install.pdf

Windows Target

https://software.intel.com/sites/default/files/managed/01/b7/w-all-release-install.pdf


Next Steps

Review the Intel® System Studio 2016 Beta What’s New or check out the Release Notes.

Register for the beta program and install Intel® System Studio 2016 Beta

Try it out and share your experience with us!

Categories:

For more complete information about compiler optimizations, see our Optimization Notice.

Improve Server Application Performance with Intel® Advanced Vector Extensions 2

$
0
0

The Intel® Xeon® processor E7 v3 family now includes an instruction set called Intel® Advanced Vector Extensions 2 (Intel® AVX2), which can potentially improve application performance related to high performance computing, databases, and video processing. To validate this statement, I performed a simple experiment using the Intel® Optimized LINPACK benchmark. The results, as shown in Table 1, show a greater than 2x performance increase using Intel AVX2 vs. using Intel® Streaming SIMD Extensions (Intel® SSE). It also shows an increase of 1.7x when comparing Intel AVX2 with Intel® Advanced Vector Extensions (Intel® AVX) instructions.

The results in Table 1 are from three different workloads running on Linux* (Intel AVX, Intel AVX2, and Intel SSE4). The last two columns show the performance gain from Intel AVX2 compared to Intel AVX or to Intel SSE4. Running with the combination of an Intel AVX2 optimized LINPACK and an Intel AVX2-capable processor, Intel AVX2 performed ~2.89x-3.49x better than Intel SSE while performing ~1.73x-2.12x better than Intel AVX. And these numbers were just an example of the potential performance boost for LINPACK. For other applications, the performance gain will vary depending on the optimized code and the hardware environment.

Table1 – Results and Performance Gain from Running the LINPACK Benchmark on Quad Intel® Xeon® Processor E7-8890 v3.

Linux* LINPACK v11.2.2Intel® AVX2 (Gflops)Intel® AVX (Gflops)Intel® SSE4 (Gflops)Performance Gain over Intel SSE4Performance Gain over Intel AVX
30K1835.83867.065525.383.492.12
75K2092.871211.89724.402.891.73
100K2130.311224.44731.422.911.74

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark* and MobileMark*, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products.

Configurations: Intel® Xeon® processor E7-8890 v3 @ 2.50GHz, 45MB L3 cache, 18 core pre-production system. 2x Intel® SSD DC P3700 Series @ 800GB, 2568GB memory (32x16GB DDR4 -2133Mhz), BIOS by Intel Corporation Version: BRHSXSD1.86B.0063.R00.1503261059 (63.R00) BMC 70.7.5334 ME 2.3.0 SDR Package D.00, Power supply: 2x1200W NON-REDUNDANT, running Microsoft Windows* Server 2012 R2 / SLES 11 SP3 Linux*

For more information go to http://www.intel.com/performance

How to take advantage of Intel® AVX2 in existing vectorized code

Vectorized code that uses floating point operations can get a potential performance boost when running on newer platforms such as the Intel Xeon processor E7 v3 family by doing the following:

  1. Recompile the code, using the Intel® compiler with the proper Intel AVX2 switch to convert existing Intel SSE code. See the Intel® Compiler Options for Intel® SSE and Intel® AVX generation white paper  for more details. 
  2. Modify the code's function calls to leverage the Intel® Math Kernel Library (Intel® MKL), which is already optimized to use Intel AVX2 where supported.
  3. Use the Intel AVX2 intrinsic instructions. High level language (such as C or C++) developers can use Intel® Intrinsic instructions to make the calls and recompile code. See the Intel® Intrinsic Guide and Intel® 64 and IA-32 Architectures Optimization Reference Manual for more details.
  4. Code in assembly instructions directly. Low level language (such as assembly) developers can use equivalent Intel AVX2 instructions from their existing Intel SSE code. See the Intel® 64 and IA-32 Architectures Optimization Reference Manual for more details.

Equivalent instructions for Intel® AVX2, Intel® AVX, and Intel® SSE used in the tests

Table 2 lists the equivalent instructions for Intel AVX2, Intel AVX, and Intel SSE (SSE/SSE2/SSE3/SSE4) that may be useful for migrating code. It contains three sets of the instructions: the first set are equivalent instructions across all three instruction sets (Intel AVX2, Intel AVX, and Intel SSE); the second set are equivalent instructions across two instruction sets (Intel AVX2 and Intel AVX), and the last set are Intel AVX2 instructions.

Table 2– Intel® AVX2, Intel® AVX, and Intel® SSE Equivalent Instructions

Intel® AVX and Intel® AVX2Equivalent Intel® SSEDefinitions
VADDPDADDPDAdd packed double-precision floating-point values
VDIVSDDIVSDDivide low double-precision floating point value in xmm2 by low double-precision floating-point value in xmm3/m64
VMOVSDMOVSDMove data from string to string
VMOVUPDMOVUPDMove unaligned packed double-precision floating-point values
VMULPDMULPDMultiply packed double-precision floating-point Values
VPXORPXORLogical exclusive OR
VUCOMISDUCOMISDUnordered compare scalar double-precision floating-point values and set EFLAGS
VUNPCKHPDUNPCKHPDUnpack and interleave high-packed double-precision floating-point values
VUNPCKLPDUNPCKLPDUnpack and interleave low-packed double-precision floating-point values
VXORPDXORPDBitwise logical XOR for double-precision floating-point values
Intel AVX and AVX2Definitions
VADDSDSet the upper half of all YMM registers to zero. Used when switching between 128-bit use and 256-bit use.
VBROADCASTSDCopy a 32-bit, 64-bit, or 128-bit memory operand to all elements of a XMM or YMM vector register.
VCMPPDCompare packed double-precision floating-point values
VCOMISDPerform ordered comparison of scalar double-precision floating-point values and set flags in EFLAGS register
VINSERTF128Replace only half of a 256-bit YMM register with the value of a 128-bit source operand. The other half is unchanged.
VMAXSDDetermine the maximum of single-precision float64 vectors. The corresponding Intel AVX instruction is VMAXSD.
VMOVQMove Quadword
VMOVUPSMove unaligned packed single-precision floating-point values
VMULSDMultiply packed single-precision floating-point values
VPERM2F128Permute 128-bit floating-point fields in ymm2 and ymm3/mem using controls fromimm8 and store result in ymm1.
VPSHUFDPermute 32-bit blocks of an int32 vector
VXORPSPerform bitwise logical XOR operation on float32 vectors
VZEROUPPERSet the upper half of all YMM registers to zero. Used when switching between 128-bit use and 256-bit use.
Intel AVX2Definitions
VEXTRACTF128Extract 128 bits of float data from ymm2 and store results in xmm1/mem.
VEXTRACTI128Extract 128 bits of integer data from ymm2 and store results in xmm1/mem.
VFMADD213PDMultiply packed double-precision floating-point values from xmm0 and xmm1, add to xmm2/mem, and put result in xmm0.
VFMADD213SDMultiply scalar double-precision floating-point value from xmm0 and xmm1, add to xmm2/mem, and put result in xmm0.
VFMADD231PDMultiply packed double-precision floating-point values from xmm1 and xmm2/mem, add to xmm0, and put result in xmm0.
VFMADD231SDMultiply scalar double-precision floating-point value in xmm1 and xmm2/mem, add to xmm0, and put result in xmm0.
VFNMADD213PDMultiply packed double-precision floating-point values from xmm1 and xmm2/mem. Negate the multiplication result, add to xmm0, and put the result in xmm0.
VFNMADD213SDMultiply the low-packed double-precision floating-point value from the second source operand to the low-packed double-precision floating-point value in the first source operand, add the negated infinite precision intermediate result to the low-packed double-precision floating-point value in the third source operand, perform rounding, and store the resulting packed double-precision floating-point value to the destination operand (first source operand).
VFNMADD231PDMultiply packed double-precision floating-point values from ymm1 and ymm2/mem, negate the multiplication result, and add to ymm0. Put the result in ymm0.
VMAXPDDetermine the maximum of float64 vectors. The corresponding Intel AVX instruction is VMAXPD.
VPADDQAdd packed quad-precision floating-point values
VPBLENDVBConditionally blend word elements of source vector depending on bits in a mask vector
VPBROADCASTQTake qwords from the source operand and broadcast to all elements of the result vector
VPCMPEQDCompare packed bytes/words/doublewords/quadwords of two source vectors
VPCMPGTQCompare packed bytes/words/doublewords/quadwords of two source vectors

Table 2 lists just the instructions used in these tests. You can obtain the full list from the Intel® 64 and IA-32 Architectures Optimization Reference Manual. By setting the compiler to Intel AVX2, it will use instructions from all 3 instruction sets as needed.

Procedure for running LINPACK

  1. Download and install the following:
    1. Intel MKL – LINPACK Download
      http://software.intel.com/en-us/articles/intel-math-kernel-library-linpack-download
    2. Intel MKL
      http://software.intel.com/en-us/intel-math-kernel-library-evaluation-options
  2. Create input files for 30K, 75K, 100K from the “...\linpack” directory
  3. For optimal performance, make the following operating system and BIOS setting changes before running LINPACK:
    1. Turn off Intel® Hyper-Threading Technology (Intel® HT Technology) in the BIOS.
    2. For Linux, export the “MKL_CBWR=AVX2” setting on the command line and update the runme_xeon64 shell script file to use the input files you created.
    3. The results will be in Glops similar to Table 2.
  4. For Intel AVX runs, set the “MKL_CBWR=AVX” and repeat the above steps.
  5. For Intel SSE runs, set the “MKL_CBWR=SSE4_2” and repeat the above steps.

Platform Configuration

CPU & ChipsetModel/Speed/Cache: Intel® Xeon® processor E7-8890 v3 (code named Haswell-EX) (2.5GHz, 45M) QGUA D0 Step
  • # of cores per chip: 18
  • # of sockets: 4
  • Chipset: (code named Patsburg) (J C1 step)
  • System bus: 9.6GT/s QPI
PlatformBrand/model:)(code named Brickland)
  • Chassis: Intel 4U Rackable
  • Baseboard: code named Brickland, 3 SPC DDR4
  • BIOS: BRHSXSD1.86B.0063.R00.1503261059 (63.R00)
  • Dimm slots: 96
  • Power supply: 2x1200W NON-REDUNDANT
  • CD ROM: TEAC Slim
  • Network (NIC): 1x Intel® Ethernet Converged Network Adapter x540-T2 (code named "Twin Pond") (OEM-GEN)
MemoryMemory Size: 256GB (32x8GB) DDR4 1.2V ECC 2133MHZ RDIMMs Brand/model: Micron MTA18ASF1G72PDZ-2G1A1HG DIMM info: 8GB 2Rx8 PC4-2133P
Mass storageBrand & model: Intel® S3700 Series SSD Number/size/RPM/Cache: 2/800GB/NA
Operating systemMicrosoft Windows* Server 2012 R2 / SLES 11 SP3 Linux*

Conclusion

From our LINPACK experiment, we see compelling performance benefits when going to an Intel AVX2-enabled Intel Xeon processor. In this specific case, we saw a performance increase of ~2.89x-3.49x for Intel AVX2 vs. Intel SSE and ~1.73x-2.12x for Intel AVX2 vs. Intel AVX in our test environment, which is a strong case for developers who have Intel SSE-enabled code and are weighing the benefit of moving to a newer Intel Xeon processor-based system with Intel AVX2. To learn how to migrate existing Intel SSE code to Intel AVX2 code, refer to the materials below.

References

Parallel Programming Books

$
0
0

Use these parallel programming resources to optimize with your Intel® Xeon® processor and Intel® Xeon Phi™ coprocessor.

High Performance Parallelism Pearls: Multicore and Many-core Programming Approaches ›
by James Reinders and James Jeffers | Publication Date: November 17, 2014 | ISBN-10: 0128021187 | ISBN-13: 978-0128021187

High Performance Parallelism Pearls shows how to leverage parallelism on processors and coprocessors with the same programming – illustrating the most effective ways to better tap the computational potential of systems with Intel® Xeon Phi™ coprocessors and Intel® Xeon® processors or other multicore processors.


Structured Parallel Programming: Patterns for Efficient Computation ›
by Michael McCool, James Reinders and Arch Robison | Publication Date: July 9, 2012 | ISBN-10: 0124159931 | ISBN-13: 978-0124159938

This book fills a need for learning and teaching parallel programming, using an approach based on structured patterns which should make the subject accessible to every software developer. It is appropriate for classroom usage as well as individual study.


Intel® Xeon Phi™ Coprocessor High Performance Programming ›
by Jim Jeffers and James Reinders – Now available!

The key techniques emphasized in this book are essential to programming any modern parallel computing system whether based on Intel® Xeon® processors, Intel® Xeon Phi™ coprocessors, or other high performance microprocessors.


Parallel Programming and Optimization with Intel® Xeon Phi™ Coprocessors

Parallel Programming and Optimization with Intel® Xeon Phi™ Coprocessors ›
by Colfax International

This book will guide you to the mastery of parallel programming with Intel® Xeon® family products: Intel® Xeon® processors and Intel® Xeon Phi™ coprocessors. It includes a detailed presentation of the programming paradigm for Intel® Xeon® product family, optimization guidelines, and hands-on exercises on systems equipped with the Intel® Xeon Phi™ coprocessors, as well as instructions on using Intel software development tools and libraries included in Intel Parallel Studio XE.


Intel® Xeon Phi™ Coprocessor Architecture and Tools: The Guide for Application Developers ›
by Reza Rahman

Intel® Xeon Phi™ Coprocessor Architecture and Tools: The Guide for Application Developers provides developers a comprehensive introduction and in-depth look at the Intel Xeon Phi coprocessor architecture and the corresponding parallel data structure tools and algorithms used in the various technical computing applications for which it is suitable. It also examines the source code-level optimizations that can be performed to exploit the powerful features of the processor.


Optimizing HPC Applications with Intel Cluster Tools: Hunting Petaflops ›
by Alexander Supalov

Optimizing HPC Applications with Intel® Cluster Tools takes the reader on a tour of the fast-growing area of high performance computing and the optimization of hybrid programs. These programs typically combine distributed memory and shared memory programming models and use the Message Passing Interface (MPI) and OpenMP for multi-threading to achieve the ultimate goal of high performance at low power consumption on enterprise-class workstations and compute clusters.

SGEMM for Intel® Processor Graphics

$
0
0

Introduction

General Matrix Multiply

cl_intel_subgroups Extension

OpenCL Implementation

Performance

Optimization Tips

Controlling the Sample

Conclusion

References

About the Authors

Download Code and PDF Version of the Article

Introduction

In this article we are going to demonstrate how to optimize Single precision floating General Matrix Multiply (SGEMM) kernels for the best performance on Intel® Core™ Processors with Intel® Processor Graphics. We implemented our sample using OpenCL and rely heavily on Intel’s cl_intel_subgroups OpenCL extension. After giving a brief overview of the General Matrix Multiply and cl_intel_subgroups extension, we are going to cover our implementation and summarize its performance on 4th and 5th Generation Intel® Core™ Processors with Intel® Processor Graphics.

We want to thank Brijender Bharti, Tom Craver, Ben Ashbaugh, Girish Ravunnikutty, Allen Hux, and Qually Jiang for their help in reviewing this article and the accompanying code.

General Matrix Multiply

From Wikipedia page on Matrix Multiplication: “In mathematics, matrix multiplication is a binary operation that takes a pair of matrices, and produces another matrix”. Matrix Multiplication is such a common operation with a wide variety of practical applications that it has been implemented in numerous programming languages. Starting from 1979 Basic Linear Algebra Subprograms (BLAS) specification prescribes a common set of routines for performing linear algebra operations, including matrix multiplication. Refer to BLAS Wikipedia page for more details. BLAS functionality has three levels, and here we are going to consider Level 3, which contains matrix-matrix operations of the form:

General Matrix Multiply

Single float precision General Matrix Multiply (SGEMM) sample we are presenting here shows how to efficiently utilize OpenCL to perform general matrix multiply operation on two dense square matrices. We developed our sample to target Intel® 4th and 5th Generation Processors with Intel® Processor Graphics. Our implementation relies on Intel’s cl_intel_subgroups extension to OpenCL to optimize matrix multiplication for more efficient data sharing.

cl_intel_subgroups Extension

From cl_intel_subgroups extension specification page: “The goal of this extension is to allow programmers to improve the performance of their applications by taking advantage of the fact that some work items in a work group execute together as a group (a "subgroup"), and that work items in a subgroup can take advantage of hardware features that are not available to work items in a work group.  Specifically, this extension is designed to allow work items in a subgroup to share data without the use of local memory and work group barriers, and to utilize specialized hardware to load and store blocks of data.

The size of subgroup is equal to SIMD width (note that code targeting Intel® Processor Graphics can be compiled SIMD-8, SIMD-16, or SIMD-32 depending on the size of the kernel, which means that 8, 16 or 32 work items respectively can fit on a hardware thread of the Execution Unit (EU). For a deeper introduction please see section 5.3.5 SIMD Code Generation for SPMD Programming Models of Stephen Junkins’ excellent paper “The Compute Architecture of Intel® Processor Graphics Gen8” ). For example if the kernel is compiled SIMD-8, then a subgroup is made up of 8 work items that share 4 KB of register space of a hardware thread and execute together. The programmers could use a kernel function get_sub_group_size to figure out the size of the subgroup.

We mainly use two kernel functions in this sample: intel_sub_group_shuffle and intel_sub_group_block_read. We use intel_sub_group_shuffle to share data between work items in a subgroup; we use intel_sub_group_block_read to read a block of data for each work item in a subgroup from a source image at a specific location.

Let’s take a look at the code below. Assume the subgroup size is 8. A block read function intel_sub_group_block_read4 reads four uints of data from a source image in row-major order and after conversion to four floats stores the value into a blockA private variable of each work item in a subgroup. The data for the first work item is shown as four blue blocks in the first column (see the diagram below). Then we read the value of private variable blockA of a work item with a sub group local id provided as a second parameter to the intel_sub_group_shuffle into variables acol0-7. After performing eight sub group shuffles, each work item has the full data of a 4 by 8 block. . For a detailed explanation of intel_sub_group_shuffle and intel_sub_group_block_read functions and their operation please refer to cl_intel_subgroups extension specification page.

intel_sub_group_shuffle

OpenCL Implementation

gemm.cl file provided with the sample contains several different implementations that demonstrate how to optimize the SGEMM kernels for Intel® Processor Graphics. We start with a naïve kernel and follow with kernels using local memory and kernels using cl_intel_subgroups extension. Since using local memory is a common practice when optimizing an OpenCL kernel, we focus on kernels using cl_intel_subgroups extension. At the same time we also use a well-known practice of tiling (or blocking) in these kernels, where matrices are divided into blocks and the blocks are multiplied separately to maintain a better data locality.  We tested the performance of our kernels on a 4th Generation Intel® Core™ Processor with Intel® Iris™ Pro Graphics 5200, which contains 40 EUs running at 1.3GHz with a theoretical peak compute1 of 832 Gflops) and a 5th Generation Intel® Core™ Processor with Intel® Iris™ Graphics, which contains 23 EUs running at 900MHz with a theoretical peak compute of 331 Gflops) running SUSE Linux Enterprise Server* (SLES) 12 GM operating system and Intel® Media Server Studio 16.4.2 release.

The naming convention of the kernels is as follows: optimizationMethod_blockHeight x blockWidth_groupHeight x groupWidth. The matrices in the kernels are in column-major order.

Naïve Kernel

Let’s take a look at the naïve implementation first, which is fairly similar to the original C version with just the addition of a few OpenCL C qualifiers, and with the outermost loop replaced by a parallel kernel. The compute efficiency of the naïve kernel is only about 2%~3% in our test environment.

There are two major issues in the naïve code.

  1. Global memory is accessed repeatedly without any data reuse/sharing;

  2. Each work item only calculates one output without using register space wisely;

gemm_naive kernel

Kernels using Local Memory

L3_SLM_xx_xx kernels load matrix A into local memory, synchronize work items with a barrier, and then proceed with the computation. Using local memory is a common optimization to avoid repeated global memory access. The compute efficiency of these kernels is about 50% in our test environment.

Kernels using cl_intel_subgroups Extension

We developed two different types of kernel using cl_intel_subgroups extension. One is L3_SIMD_xx_xx loading data from a regular OpenCL buffer, and the other one is block_read_xx_xx reading a block of data from OpenCL image2D by using intel_sub_group_block_read. Kernels differ due to different ways of input data access, but the basic idea of data sharing in a subgroup is similar. Let’s take block_read_4x1_1x8 kernel as an example. Note that in this example tiling sizes are too small to be efficient and are used for illustration purposes only. According to the above naming convention the kernel handles 4 * 1 floats in a work item and a work group size is (1, 8).  The following picture shows how the kernel works in a subgroup. The partial kernel code is also shown in “cl_intel_subgroups Extension” section.

This kernel is compiled to SIMD-8, thus the subgroup size is 8. In matrix A, a float4 is read by intel_sub_group_block_read4 in a row-major order at first (refer to 4 blue blocks in matrix A), then intel_sub_group_shuffle is called to share adjacent 7 columns of float4 from work item 1~7 in the 1st subgroup (refer to 28 red blocks in matrix A).  In matrix B, a float8 is read by intel_sub_group_block_read8 in row-major order as well (refer to 8 blue blocks in matrix B). We need to read a float8 from matrix B because 8 columns of data in matrix A could be got by shuffle function. After that we could do the sub-matrix multiplication (4 * 8) * (8 * 1) and get the partial result of sub-matrix C (4 * 1). This is the 1st read and calculation in the 1st work item.

Then we will move to the next block of (4 * 8) in column-major order from Matrix A and move to the next block of (8 * 1) in row-major order from Matrix B (Refer to white blocks in matrix A and B). In other word in the 1st work item we walk across the 1st 4 rows of matrix A and walk down the 1st column of matrix B and finally get the 1st (4 * 1) block of matrix C (refer to 4 blue blocks in matrix C). In the following work item, we will move to next 4 rows of matrix A or next column of matrix B.

The tiling parameters TILE_M, TILE_K and TILE_N decide the partial matrix sizes of three large matrices in a work group. In a work-group matrix C’s size is TILE_M by TILE_N elements, matrix A’s size is TILE_M by TILE_K elements and matrix B’s size is TILE_K by TILE_N elements. In the current implementation work group size is (1 * 8), the size of a work-item sub-matrix C is TILE_M/1 by TILE_N/8 elements, the size of a sub-matrix A is TILE_M/1 by (TILE_K/8 * 8) elements and the size of a sub-matrix B is TILE_K/1 by TILE_N/8 elements. For a sub-matrix A, we need to multiply by 8 because we share eight columns of float4 from eight work items in a subgroup by using shuffle function. Thus in this kernel TILE_M = 4, TILE_K is 8 and TILE_N is 8.

Tiled Matrix Multiply

When we apply the cl_intel_subgroups extension the kernel performance improves further. The compute efficiency of L3_SIMD_xx_xx is about 60% in our test environment. The compute efficiency of block_read_xx_xx is about 80% on the 5th Generation Intel® Processors, and 65~70% on the 4th Generation Intel® Processors. We will discuss performance of these kernels in the next section.

Performance

Here is the kernel performance comparison between different implementations of SGEMM on a 4th Generation Intel® Core™ Processor with Intel® Iris™ Pro Graphics 5200, which contains 40 EUs running at 1.3GHz, and a 5th Generation Intel® Core™ Processor with Intel® Iris™ Graphics, which contains 23 EUs running at 900MHz, on a SLES 12 GM OS and MSS 16.4.2 release. The block_read_32x2_8x1 and block_read_32x1_8x1 show about 90% compute efficiency on a 5th Generation Intel® Core™ Processor.

SGEMM Kernels Performance HSW

SGEMM Kernels Performance BDW

Optimization Tips

Impact of Barriers and Work Group Size on Performance in Non-local Memory Kernels

The built-in function barrier(CLK_LOCAL_MEM_FENCE) is commonly used in kernels with local memory, but in non-local memory kernels it may also provide performance benefits when matrix size is too large to fit into cache. Let’s take a look at L3_SIMD_8x4_1x8, L3_SIMD_8x4_8x8 and L3_SIMD_8x4_8x8_barrier. L3_SIMD_8x4_1x8 is the basic implementation, and work group size is enlarged from (1 * 8) to (8 * 8) in L3_SIMD_8x4_8x8. L3_SIMD_8x4_8x8_barrier adds a barrier after loading matrix B to make the work items of a work group stay in synch for better L3 cache use. Let’s compare the performance when matrix size reaches 1K. The performance improvement can be seen in the following graphs.

Impact of Tiling Parameters on Performance

The tiling technique generally provides the speedup, but we need to avoid performance degradation due to an overuse of private memory, which exhausts register space. On the 4th and 5th Generation Intel® Core™ Processors each EU thread has 128 general purpose registers. Each register stores 32 bytes, accessible as an 8 element vector of 32 bit data elements, for a total of 4KB. Thus each work item in an OpenCL kernel has access to up to 128, 256, or 512 (SIMD32 / SIMD-16 / SIMD-8) bytes of register space. If the large tiling parameters exceed the private memory threshold, the performance will suffer.

It is hard to find the best tiling parameters that show good performance for different matrix sizes. It is possible that some tiling parameters run faster on some matrix size but slower on others. You may also use some auto-tuning code generator to try other tiling parameters and get the best performance on your device. Another limitation of large tiling parameters is the matrix size must be aligned with large tiling size. Please compare the performance between the kernels like L3_SIMD_32x2_1x8, L3_SIMD_16x2_1x8 and L3_SIMD_8x4_1x8 on different matrix sizes.

SGEMM Kernels Optimization with Intel® VTune Amplifier XE

We highly recommend to use Intel® VTune Amplifier XE to gain deeper understanding of the application performance on Intel® Processor Graphics. Here we focus on the performance analysis of different optimizations of SGEMM kernel on Intel® Processor Graphics. See the articles by Julia Fedorova in the References section to understand the overall OpenCL performance analysis on Intel® Processor Graphics. In this sample we chose Advanced Hotspots analysis of Intel® VTune Amplifier XE, enabled Analyze GPU Usage option, chose Overview at Analyze Processor Graphics events, enabled OpenCL profiling on the GPU and selected the option Trace OpenCL kernels on Processor Graphics.

The testing matrix size is (1024 * 1024) and all the kernels are executed once on Windows 8.1 running on 4th Generation Intel® Core™ Processor with Intel HD4400, which has 20 EUs running at 600Mhz. Here is the VTune screenshot of various SGEMM kernel runs sorted by the execution time:

SGEMM Kernels in VTune

 Let’s take gemm_naive as an example. At first take a look at the number of compute threads started (hardware threads), which is ~65536, and the formula is Global_size / SIMD_width. Secondly the EU Array stall rate is highlighted in red with a large number of L3 misses, which means EUs are waiting ~30% of the time for data from memory. From this information we could infer that 20% of 65536 of software threads are not doing productive work during kernel execution. Due to a large number of software threads and high EU stall rate, the kernel performs poorly.

Since SGEMM should not be memory bandwidth-bound, we will try to optimize memory accesses and layout. We use common optimization techniques like coalescing memory accesses and utilizing shared local memory (SLM). cl_intel_subgroups extension provides another avenue for the optimization. The basic idea is to share the data and let each work item do more work, then the ratio between loading data and computation is more balanced. At the same time using vector data types and block reads also improves the memory access efficiency.

As you can see from the table above, the kernel block_read_32x2_1x8 has the best performance. The EU Array stalls are only 7.3% with 2048 software threads launched. Although each work item takes some time to calculate a block of 32*2 floats, it is likely to hide the memory read stall. The block read and shuffle functions provide efficient memory access, at the same time tiling size of the kernel won’t exhaust the register space. We could also compare the data between L3_SIMD_8x4_1x8, L3_SIMD_8x4_8x8 and L3_SIMD_8x4_8x8_barrier. The optimization mentioned in the 1st tip provides better cache performance. L3_SIMD_8x4_8x8 and its barrier version get the benefits from fewer L3 misses and lower rate of EU Array stalls due to the synchronization in the work group.

Controlling the Sample

Option

Description

-h, --help    

Show this help text and exit.

-p, --platform <number-or-string>

Selects the platform, the devices of which are used. (Default value: Intel)

-t, --type all | cpu | gpu | acc | default | <OpenCL constant for device type>

Selects the device by type on which the OpenCL kernel is executed. (Default value: gpu)

 

-d, --device <number-or-string>

Selects the device on which all stuff is executed. (Default value: 0)

-M, --size1 <integer>

              

              

 

              

Rows of 1st matrix in elements. (Default value: 0)

 

-K, --size2 <integer>

 

Cols of 1st matrix in elements, rows of 2nd matrix in element (Default value: 0)

 

-N, --size3 <integer>

 

Cols of 2nd matrix in elements (Default value: 0)

-i, --iterations <integer>

 

Number of kernel invocations. For each invocation, performance information will be printed. Zero is allowed: in this case no kernel invocation is performed but all other host stuff is created. (Default value: 10)

 

-k    --kernel naive | L3_SIMD_32x2_1x8 | L3_SIMD_16x2_1x8 | L3_SIMD_16x2_4x8 | L3_SIMD_8x4_1x8 | L3_SIMD_8x4_8x8 | L3_SIMD_8x4_8x8_barrier | L3_SLM_8x8_8x16 | L3_SLM_8x8_4x16 | L3_SLM_8x8_16x16 | block_read_32x2_1x8 | block_read_32x1_1x8

 

Determines format of matrices involved in multiplication. There are several supported kernels with naive implementation and optimization on Intel GPU; both matrices A and B are in column-major form; (Default value: NULL)

 

-v    --validation

              

Enables validation procedure on host (slow for big matrices). (Default disabled)

 

  1. Peak compute is different for each product SKU and is calculated as follows: (MUL + ADD)xPhysical SIMDxNum FPUsxNum EUsxClock Speed, where Physical SIMD is 4 for Intel® Processors with Intel® Processor Graphics.

 

Conclusion

In this article we demonstrated how to optimize Single precision floating General Matrix Multiply (SGEMM) algorithm for the best performance on Intel® Core™ Processors with Intel® Processor Graphics. We implemented our sample using OpenCL and relied heavily on Intel’s cl_intel_subgroups OpenCL extension. When used properly, cl_intel_subgroups OpenCL extension provides an excellent performance boost to SGEMM kernels.

References

  1. Wikipedia page on Matrix Multiplication

  2. BLAS Wikipedia page

  3. The Compute Architecture of Intel® Processor Graphics Gen8” by Stephen Junkins

  4. cl_intel_subgroups extension specification page

  5. Intel® VTune™ Amplifier XE: Getting started with OpenCL* performance analysis on Intel® HD Graphics by Julia Fedorova

  6. Analyzing OpenCL applications with Intel® VTune™ Amplifier 2015 XE Webinar by Julia Fedorova (Please use Internet Explorer to view the videos)

  7. Intel® VTune™ Amplifier 2015

  8. Optimizing Simple OpenCL Kernels: Modulate Kernel Optimization by Robert Ioffe

  9. Optimizing Simple OpenCL Kernels: Sobel Kernel Optimization by Robert Ioffe

About the Authors

Lingyi Kong is a Software Engineer at Intel’s IT Flex Services Group. He is an expert in GPU programming and optimization, and also has Graphics driver/runtime development experience on Intel® Iris and Intel® Iris Pro Graphics.

Robert Ioffe is a Technical Consulting Engineer at Intel’s Software and Solutions Group. He is an expert in OpenCL programming and OpenCL workload optimization on Intel Iris and Intel Iris Pro Graphics with deep knowledge of Intel Graphics Hardware. He was heavily involved in Khronos standards work, focusing on prototyping the latest features and making sure they can run well on Intel architecture. Most recently he has been working on prototyping Nested Parallelism (enqueue_kernel functions) feature of OpenCL 2.0 and wrote a number of samples that demonstrate Nested Parallelism functionality, including GPU-Quicksort for OpenCL 2.0. He also recorded and released two Optimizing Simple OpenCL Kernels videos and a third video on Nested Parallelism.

You might also be interested in the following articles:

Optimizing Simple OpenCL Kernels: Modulate Kernel Optimization

Optimizing Simple OpenCL Kernels: Sobel Kernel Optimization

Sierpiński Carpet in OpenCL 2.0

GPU-Quicksort in OpenCL 2.0: Nested Parallelism and Work-Group Scan Functions

Download Code and PDF Version of the Article

Intel® Xeon® Processor E7-8800/4800 V3 Product Family Technical Overview

$
0
0

Contents

1.     Executive Summary
2.     Introduction
3.     Intel® Xeon® Processor E7-8800/4800 v3 Product Family Enhancements
3.1    Intel® Advanced Vector Extensions 2 (Intel® AVX2)
3.2    Haswell New Instructions (HNI)
3.3    Intel® Transactional Synchronization Extensions (Intel® TSX)
3.4    Support for DDR4 memory
3.5    Power Improvements
3.6    New RAS features
4.     Brickland platform improvements
4.1    Virtualization features
1.3    New Security features
1.4    Intel® Node Manager 3.0
5      Conclusion
About the Author

 

1.   Executive Summary 

The Intel® Xeon® processor E7-8800/4800 v3 product family, formerly codenamed “Haswell EX”, is a 4-socket platform based on Intel’s most recent microarchitecture, the new “TOCK,” which is based on 22nm process technology. The new processor brings additional capabilities for business intelligence, database and virtualization applications. Platforms based on the Intel Xeon processor E7-8800/4800 v3 product family yield up to 40% average improvement in performance1 compared to the previous generation, “Ivy Bridge EX.”

The latest generation processor has many new hardware and software features. On the hardware side its additional cores and memory bandwidth, DDR4 memory support, power enhancements, virtualization enhancements and some security enhancements (System Management Mode external call trap) can improve application performance significantly without any code changes. On the software side it has Haswell New Instructions (HNI), Intel® Transactional Synchronization Extensions (Intel® TSX), and Intel® Advanced Vector Extensions 2 (Intel® AVX2). Developers must enable these software features in their applications. Haswell-EX also brings additional reliability, availability, and serviceability (RAS) capabilities such as address-based mirroring for granular control of critical memory regions improving uptime.

 

2.   Introduction 

The Intel Xeon processor E7-8800/4800 v3 product family is based on the Haswell microarchitecture and has several enhancements over the Ivy Bridge EX microarchitecture. The platform supporting the Intel Xeon processor E7-8800/4800 v3 product family is based on the Intel C602J Chipset (codenamed “Brickland”). This paper discusses the new features available in the latest product family compared to the previous one. Each section includes information about what developers need to do to take advantage of the new features to improve application performance, security, and reliability.

 

3.   Intel® Xeon® Processor E7-8800/4800 v3 Product Family Enhancements 

Figure 1 shows an overview of the Intel Xeon processor E7-8800/4800 V3 product family microarchitecture. All processors in the family have up to 18 cores (compared to 15 cores in its predecessor), which add additional computing power. They also have faster, additional cache and more memory bandwidth.

[1] Up to 40% average performance improvement claim based on the geometric mean of 12 key benchmark workloads comparing 4-socket servers using Intel® Xeon® processor E7-8890 v3 to similarly configured Intel® Xeon® processor E7-4890 v2. Source: Internal Technical Reports.  See http://www.intel.com/performance/datacenter for more details.

Intel® Xeon® processor E7-8800/4800 v3 product family overview

Figure 1: Intel® Xeon® processor E7-8800/4800 v3 product family overview

The Intel Xeon processor E7-8800/4800 v3 product family includes the following new features:

  1. Intel® Advanced Vector Extensions 2 (Intel® AVX2)
  2. Haswell New Instructions (HNI)
  3. Intel® Transactional Synchronization Extensions (Intel® TSX)
  4. Support for DDR4 memory
  5. Power management feature improvements
  6. New RAS features

Table 1 compares latest and previous generations of product families.

Table 1. Feature Comparison of the Intel® Xeon® processor E7-8800/4800 v3 product family to the Intel® Xeon® processor E7-4800 v2 product family

Features

Intel® Xeon® processor
E7-8800/4800/2800 v2 product family

(Ivy Bridge-EX)

Intel® Xeon® processor
E7-8800/4800 v3 product family

(Haswell-EX)
SocketR1R1
CoresUp to 15Up to 18
Process technology22 nm22 nm
TDP155W Max165W Max (includes Integrated Voltage Regulator)
Intel® QuickPath Interconnect (Intel® QPI) ports/speed3x Intel QPI_v1.1, 8.0 GT/s max.3x Intel QPI_v1.1, 9.6 GT/s max.
Core addressability46 bit /48 bit virtual46 bit /48 bit virtual
Last Level Cache SizeUp to 37.5MBUp to 45MB
Memory DDR4 speedsN/A

Perf Mode: 1333 MT/s, 1600 MT/s

Lockstep Mode: 1333, 1600, 1866 MT/s
Memory DDR3 speeds

Perf Mode: 1066,1333 MT/s

Lockstep Mode: 1066,1333, 1600 MT/s

Perf Mode: 1066,1333 MT/s, 1600 MT/s

Lockstep Mode: 1066,1333, 1600 MT/s
VMSE Speeds

Up to 2667 MT/s

Up to 3200 MT/s

DIMMs/Socket24 DIMMs/(3DIMMs/DDR3 channel)24 DIMMs/(3 DIMMs/DDR4 & DDR3 channel)
RAS

Westmere EX Baseline Features +

  • Enhanced Machine Check Architecture (eMCA) Gen 1
  • MCA recovery – Execution Path
  • MCA recovery – IO
PCIe LER

Ivy Bridge-EX Baseline +

eMCA Gen2 + Address Based Memory Mirroring +
Multiple Rank Sparing + DDR4 recovery for command & parity errors
Intel® Integrated I/O32 PCIe* 3.0, 1x x4 DMI232 PCIe 3.0, 1x x4 DMI2

Security

(the same on both families)

Intel® Trusted Execution Technology, Intel® Advanced Encryption Standard New Instructions,

Intel® Platform Protection with Intel® OS Guard, Intel® Data Protection Technology with Intel® Secure Key

The rest of this paper discusses some of the main enhancements in the latest product family.

 

3.1   Intel® Advanced Vector Extensions 2 (Intel® AVX2) 

While all floating point vector instructions were extended from 128 bits to 256 bits in Intel AVX, Integer vector instructions are also extended to 256 bits in Intel AVX2. Intel AVX2 uses the same 256 bit YMM registers as Intel AVX. Intel AVX2 includes the fused multiply add (FMA), gather, shifts, and permute instructions and is designed to benefit high performance computing (HPC), database, and audio and video applications.

The fused multiply add (FMA) instruction computes ±(a×b)±c with only one rounding. axb intermediate results are not rounded, thus this instruction brings increased accuracy compared to MUL and ADD instructions. FMA increases performance and accuracy of many floating point computations such as matrix multiplication, dot product, and polynomial evaluation. With 256 bits, you can have 8 single precision and 4 double precision FMA operations. Since FMA combines two operations into one, it is possible to perform more floating point operations per second per cycle; additionally, because Haswell has two FMA units, the peak FLOPS are doubled.

The gather instruction loads sparse elements to a single vector. It can gather 8 single precision (Dword) or 4 double precision (Qword) data elements into a vector register in a single operation. There is a base address that points to the data structure in memory. Index (offset) gives the offset of each element from the base address. The mask register tracks which element needs to be gathered. The gather operation is complete when the mask register contains all zeros.

Other new operations in Intel AVX2 include an integer version of permute instructions and new broadcast and blend instructions.

 

3.2   Haswell New Instructions (HNI) 

Haswell new instructions include 4 crypto instructions to speed up public key and SHA encryption algorithms and 12 (bit manipulation) instructions to speed up compression or signal processing algorithms. The bit manipulation instructions (BMI) perform arbitrary bit field manipulations, leading and trailing zero bit counts, trailing set bit manipulations, and improved rotates and arbitrary precision multiplies. They speed up algorithms that perform bit field extracting and packing, bit-granular encoded data processing (compression algorithms universal coding), arbitrary precision multiplication, and hashes.

To use HNI, you need to use an updated compiler, as shown in the Table 2 below.

 

3.3   Intel® Transactional Synchronization Extensions (Intel® TSX) 

Intel TSX transactionally executes lock-protected critical sections. It executes instructions without acquiring a lock thereby exposing hidden concurrency. Hardware manages transactional updates to registers and memory. Everything looks atomic from a software perspective. If the instruction fails, hardware rolls back and restarts execution of the critical section.

Intel TSX has two different interfaces:

Hardware Lock Elision (HLE)– includes two prefixes XACQUIRE/XRELEASE. Software uses these legacy compatible hints to identify critical sections. These hints are ignored on legacy hardware. Instructions are executed transactionally without acquiring a lock. An abort causes re-execution without elision.

Restricted Transactional Memory (RTM)– allows software to use two new instructions XBEGIN/XEND to specify critical sections. This is similar to HLE but has a more flexible interface for software to perform lock elision. If the instruction fails, control is transferred to the target specified by XBEGIN operand. The software call back handler can implement any number of policies like exponential back off. What action to take is up to the developer and depends on the workload. XTEST and XABORT are additional instructions. XTEST allows software to determine quickly whether it is executing within a transaction or not. XABORT is the explicit abort instruction.

Intel TSX has a simple and clean ISA interface. This is particularly useful for shared memory multithreaded applications that employ lock-based synchronization mechanisms. For more details, please visit http://software.intel.com/en-us/blogs/2012/02/07/transactional-synchronization-in-haswell/

For more details on Intel AVX2, HNI, and Intel TSX, please refer to the Intel® architecture instruction set extensions programming reference manual at https://software.intel.com/sites/default/files/managed/0d/53/319433-022.pdf

Table 2: Compiler support options for the new instructions

Compiler support options for the new instructions

For more details on Intel® C++ Compiler, visit http://software.intel.com/en-us/intel-parallel-studio-xe.

 

3.4   Support for DDR4 memory 

The Intel Xeon processor E7-8800/4800 v3 product family supports both DDR3 and DDR4 memory. The product family also supports Intel® C112 or C114 Scalable memory buffers. With 8 DDR4/DDR3 channels per socket and up to 24 DDR4/DDR3 DIMMs per socket, this platform supports up to DDR4 64GB LR-DIMM (see Figure 2). This platform supports up to 6 Terabytes of memory in a 4-Socket/96-DIMM configuration.

Intel® Xeon® processor E7-8800/4800 v3 memory configuration

Figure 2: Intel® Xeon® processor E7-8800/4800 v3 memory configuration

 

3.5   Power Improvements 

The power improvements in Intel Xeon processor E7-8800/4800 v3 product family include:

  • Per core P-states (PCPS)
    • Each core can be programmed to the Operating System (OS) requested P-state
  • Uncore frequency scaling (UFS)
    • The uncore frequency is independently controlled from the cores’ frequencies
    • Optimizing performance by applying power to where it is most needed
  • Faster C-states
    • When you wake a core out of a C3 or C6 state, the transition takes time. The time is less on Haswell-EX, making the transition faster.
  • Lower idle power

Contact your operating system (OS) provider for details on which OS supports these features.

 

3.6   New RAS features 

The Intel Xeon processor E7-8800/4800 v3 product family includes these new RAS features:

  • Enhanced machine check architecture recovery Gen2
  • Address-based memory mirroring
  • Multiple rank sparing
  • DDR4 recovery

Enhanced machine check architecture recovery Gen2 (EMCA2): As shown in Figure 3, the EMCA2 feature implements an enhanced “firmware first model (FFM)” fault handling: all of the corrected and uncorrected errors are first signaled to BIOS/SMM (System Management Mode) allowing the firmware (FW) to determine if and when errors need to be signaled to the virtual machine monitor (VMM)/OS/software (SW). Once FW determines that an error needs to be reported to VMM/OS/SW, it updates the Machine Check Architecture (MCA) banks and/or optionally the Enhanced Error Log Data Structure and signals the OS.

Enhanced machine check architecture recovery Gen2

Figure 3: Enhanced machine check architecture recovery Gen2

Prior to EMCA2, IA32-legacy MCA implemented error handling by logging all the errors in architected registers (MCA Banks) and signaling the OS/VMM. Enhanced error log is a capability for BIOS to present error logs to the OS in an architectural manner using data structures located within the main memory. The OS can traverse the data structures that are pointed to by EXTENDED_MCG_PTR MSR and locate the enhanced error log. The memory range used for error logs is preallocated and reserved by FW during boot time. This allows the OS to provide correct mapping for this range at all times. These memory buffers cannot be part of System Management RAM (SMRAM) since the OS cannot read SMRAM. This range must be 4-K aligned and may be located below or above 4 GB.

To make use of this feature both OS- and application-level enabling are required.

For details on enhanced MCA, see: http://www.intel.com/content/dam/www/public/us/en/documents/white-papers/enhanced-mca-logging-xeon-paper.pdf.

Address-based memory mirroring: Memory mirroring provides protection against uncorrectable memory errors that would otherwise result in a platform failure. Address-based memory mirroring provides further granularity to mirror memory by allowing the FW or OS to determine a range of memory addresses to be mirrored.

A pair of mirrored DIMMs forms a redundant group. In a mirror configuration, one pair of memory DIMMs is designated the primary image and the other the secondary image. For memory writes, the write request is issued to both sets of DIMMs. For memory reads, the read request is issued to the primary DIMM. In the case of a detected correctable error, the primary image will toggle and the read will be issued to the “new” primary image. In the case of a detected uncorrectable error, the definition of the primary image will “hard fail” to the other image. In this case the “failed” image will never become the primary image until the failed DIMMs have been replaced and the image re-built.

Memory mirroring reduces the available memory by one half. You cannot configure memory mirroring and memory RAID at the same time.

Address-based memory mirroring OS memory

Address-based memory mirroring System Memory

Figure 4: Address-based memory mirroring

Address-based mirroring enables more cost-effective mirroring by mirroring just the critical portion of memory versus mirroring the entire memory space.

To use this feature, the OS must be enabled. Please contact your OS provider for details on what versions of the OS support or will support this feature.

Multiple rank sparing: Rank sparing provides a second rank for dynamic failover of a failing rank to a spare rank behind the same memory controller. Multi-rank sparing allows more than one sparing event and therefore increases the uptime of the system. The system’s BIOS provides multiple options to select from: 1, 2, 3, or auto (default) mode. In auto mode, up to half of the available ranks are identified to be allocated as spare ranks. Spare ranks must be equal to or greater than all other ranks. As shown in Figure 5, if more than one is equal and larger size rank, then nonterminating ranks will be selected as the spare rank (for example, rank1 in Dual-Rank (DR) DIMM, and rank 1 or 3 in Quad-Rank (QR) DIMM). Multiple rank sparing is enabled in firmware and requires no OS involvement. It supports up to two spare ranks per DDR channel.

Multiple rank sparing

Figure 5: Multiple rank sparing

This is an OEM configuration and no enabling is required in the OS or application level.

DDR4 Recovery:In DDR3 technology, recovery from command and address parity errors was not feasible. These errors were reported as fatal, requiring a system reset. DDR4 technology-based DIMMs incorporate logic to allow integrated memory controller (iMC) recovery from command and address parity errors. Approximately 1% performance tradeoff is incurred when this feature is used.

To make use of this feature, no OS or application level enabling is required.

DDR4 error recovery

Figure 6: DDR4 error recovery

 

4.   Brickland platform improvements 

Some of the new features that come with Brickland platform include:

  • New virtualization features
  • New security features
  • Intel® Node Manager 3.0

 

4.1   Virtualization features 

Virtual Machine Control Structure (VMCS) shadowing:Nested virtualization allows a root Virtual Machine Monitor (VMM) to support guest VMMs. However, additional Virtual Machine (VM) exits can impact performance. As shown in Figure 7, VMCS shadowing directs the guest VMM VMREAD/VMWRITE to a VMCS shadow structure. This reduces nesting induced VM exits. VMCS shadowing increases efficiency by reducing virtualization latency.

VMCS Shadowing

Figure 7: VMCS Shadowing

For this feature, the VMM must be enabled. Ask your VM provider when this feature will be supported.

Cache Monitoring Technology (CMT):CMT (also known as “Noisy Neighbor” management) provides last-level cache occupancy monitoring, which allows the VMM to identify cache occupancy at an individual application or VM level. With this information, virtualization software can better schedule and migrate workloads.

For this feature, the VMM must be enabled. Ask your VM provider when this feature will be supported.

Extended Page Table (EPT) Access/Dirty (A/D) bits: In the previous generation platform, Accessed and Dirty bits (A/D bits) are emulated in VMM and accessing them causes VM exits. Brickland implements EPT A/D bits in hardware to reduce VM exits (Figure 8). This enables efficient live migration of Virtual Machines and fault tolerance.

EPT A/D in HW

Figure 8: EPT A/D in HW

For this feature, the VMM must be enabled. Ask your VM provider when this feature will be supported

VT-X latency reduction: Performance overheads arise from virtualization transition round trips—“exits” from VM to VMM and “entry” from VMM to VM due to handling of privileged instructions. Intel has made continuing enhancements to reduce transition times on each platform generation. Brickland reduces the VMM overheads further and increases virtualization performance.

 

1.3   New Security features 

System Management Mode (SMM) external call trap (SECT): SMM is an operating mode in which all normal execution (including the OS’s) is suspended, and special separate software (usually firmware or a hardware-assisted debugger) is executed in high-privilege mode. SMM is entered to run handler code due to the SMI (system management interrupt). Without SMM external call trap (SECT), the SMI handler could execute malicious code in user memory. With SECT, the handler can’t invoke code in user memory.

SMM external call trap

Figure 9: SMM external call trap

BIOS level enabling is required to turn on this feature.

General Crypto Assist – Intel AVX2, 4th ALU, RORX for hashing:Intel AVX2 (256-bit integer, better bit manipulation, permute granularity), and 4th ALU (arithmetic & logical unit) help crypto algorithms run faster. RORX accelerates hash algorithms. Please refer Intel® Architecture Instruction Set Extensions Programming Reference for more details on these instructions.

Asymmetric Crypto Assist – MULX for public key:The new MULX instruction improves asymmetric crypto and eases more crypto challenges. Please refer Intel® Architecture Instruction Set Extensions Programming Reference for more details on this instruction.

Symmetric Crypto Assist – Intel® Advanced Encryption Standard New Instructions (Intel® AES-NI) optimization: Brickland includes enhancements and extensions for symmetric cryptography–Intel AES-NI and beyond. Please refer to this article to find out more about Intel AES-NI and how to use it.

PCH-ME Digital Random Number Generator (DRNG): The Manageability Engine (ME) is an independent and autonomous controller in the platform’s architecture. ME requires well secured communication methods given its autonomy and access to low-level platform mechanisms. Providing the ME with a high-quality randomization source is necessary to maximize platform security. PCH-ME DRNG Technology provides real entropy and generates highly unpredictable random numbers for encryption use by ME, isolated from other system resources.

 

1.4   Intel® Node Manager 3.0 

Brickland comes with the latest version of Intel Node Manager 3.0. Its improvements include:

  • Predictive power limiting
    • Power throttles engage predictively as system power approaches limit
  • Power limit enforced during boot
    • “Boot spike” is controlled without complex IT processes or disabling cores
  • Power Management for the Intel® Xeon Phi™ coprocessor
    • Separate power limits and controls for the Intel Xeon Phi coprocessor domain and rest-of-platform
  • Node Manager Power Thermal Utility (PTU)
    • Establishes key power characterization values for CPU and memory domains
    • Delivered as firmware

Please visit this link for more details on Intel Node Manager.

 

5   Conclusion 

The Intel Xeon processor E7-8800/4800 V3 product family combined with the Brickland platform provides many new and improved features that could significantly improve performance and power experience on enterprise platforms.

 

About the Author 

Sree Syamalakumari is a software engineer in the Software & Service Group at Intel Corporation. Sree holds a Master's degree in Computer Engineering from Wright State University, Dayton, Ohio.

Elusive Algorithms – Parallel Scan

$
0
0

jim@quickthreadprogramming.com

This article on parallel programming will choose one of those elusive algorithms that upon first glance seem to be neither vectorizable nor parallelizable. The intent of this article is not to address the specific algorithm, but rather to provide you with an approach to problems that share similarities with this algorithm. The elusive algorithm for this article is the inclusive scan:

 In:          1             2             3             4             5             6             7             8             …
Out:       1             3             6             10           15           21           28           36           …

Where the output is the sum of the prior output (or 0 for first), and the value of the input. This loop has a temporal dependency that, at first inspection, defies both vectorization and parallelization. This article will describe how you can attain both vectorization and parallelization with results like this:


Intel® Xeon® Processor D Product Family Technical Overview

$
0
0

Contents

1. Form Factor Overview
2. Intel® Xeon® Processor D Product Family Overview
3. Intel® Xeon® Processor D Product Family Feature Overview
4. Intel® Xeon® processor D Product Family introduces new instructions as well as enhancements of previous instructions4
5. Intel® Advanced Vector Extensions 2 (Intel® AVX2) Instructions
6. VT Cache QoS Monitoring/Enforcement and Memory Bandwidth Monitoring4
7. A/D Bits for EPT
8. Intel® Virtual Machine Control Structure Shadowing (Intel® VMCS Shadowing).
9. APICv
10. Supervisor Mode Access Protection (SMAP)
11. RDSEED4
12. Intel ® Trusted Execution Technology (Intel® TXT)
13. Intel® Node Manager
14. RAS – Reliability Availability Serviceability
15. Intel® Processor Trace4
16. Non-Transparent Bridge (NTB)
17. Asynchronous DRAM Refresh (ADR)
18. Intel® QuickData Technology
19. Resources

 

1. Form Factor Overview 

Microservers are an emerging form of servers designed to process lightweight, scale out workloads for hyper-scale data centers. They’re a good form factor example to use to describe the design considerations when implementing an Intel® SoC. Typical workloads suited for microservers include dynamic and static web page serving, entry dedicated hosting, cold and warm storage, and basic content delivery, among others. A microserver consists of a collection of nodes that share a common backplane.  Each node contains a system-on-chip (SoC), local memory for the SoC, and ideally all required IO components for the desired implementation. Because of the microserver’s high-density and energy-efficient design, its infrastructure (including the fan and power supply) can be shared by tens or even hundreds of SoCs, eliminating the space and power consumption demands of duplicate infrastructure components. Even within the microserver category, there is no one-size-fits-all answer to system design or processor choice. Some microservers may have high-performing single-socket processors with robust memory and storage, while others may have a far higher number of miniature dense configurations with lower power and relatively lower compute capacity per SoC.

Comparison of server form factors
Figure 1. Comparison of server form factors

To meet the full breadth of these requirements, Intel provides a range of processors that provide a spectrum of performance options so companies can select what’s appropriate for their lightweight scale out workloads. The Intel® Xeon® processor D product family offers new options for infrastructure optimization, by bringing the performance and advanced intelligence of Intel® Xeon® processors into dense, lower-power SoCs. The Intel® Xeon® processor E3 family offers a choice of integrated graphics, node performance, performance per watt, and flexibility. The Intel® Atom™ processor C2000 product family provides extreme low power and higher density.

The Intel® Xeon® processor D-1500 product family is Intel’s first generation SoC that is based on Intel Xeon processor line and is manufactured using Intel’s low-power 14nm process. This SoC adds additional performance capabilities to Intel’s SoC line up with such features as hyperthreading, improved cache sizes, DDR4 memory capability, Intel® 10GbE Network Adapter and more. Power enhancements are also a point of focus with a SoC thermal design power of 20-45 Watts and additional power capabilities such as Intel® Node Manager. Multiple redundancy features are also available that help mitigate failures with memory and storage.

The data center environment is diversifying both in terms of the infrastructure and the market segments including storage, network, and cloud. Each area has unique requirements, providing opportunities for targeted solutions to best cover these needs. The Intel Xeon processor D-1500 product family extends market segment coverage beyond Intel’s previous microserver product line based on the Intel Atom processor C2000 product family. Cloud service providers can benefit from the SoC with compute-focused workloads associated with hyper scale out such as distributed memcaching, web frontend, content delivery, and dedicated hosting. The Intel Xeon processor D-1500 product family is also beneficial for mid-range network-focused workloads such as those associated with compact PCI advanced mezzanine cards (AMC) found in router mid-range control. For storage-focused workloads it can also provide benefit with entry enterprise SAN/NAS, cloud storage nodes, or warm cloud storage.

These SoCs offer a significant step up from the Intel® Atom™ SoC C2750, delivering up to 3.4 times the performance per node1,3 and up to 1.7x estimated better performance per watt.2,3 With exceptional node performance, up to 12 MB of last level cache, and support for up to 128 GB of high-speed DDR4 memory, these SoCs are ideal for emerging lightweight hyper-scale workloads, including memory caching, dynamic web serving, and dedicated hosting.

 

2. Intel® Xeon® Processor D Product Family Overview 

Table 1 provides a high-level summary of the hardware differences between the Intel Xeon processor D-1500 product family and the Intel Atom SoC C2000 product family. Some of the more notable changes introduced with the Intel Xeon processor D-1500 product family include Intel® Hyper-Threading Technology (Intel® HT Technology), an L3 cache, greater memory capacity and speed, C-states, and more.

Table 1. Comparison of the Intel® Atom™ Processor C2000 Product Family to the Intel® Xeon® Processor D Product Family

 Intel® Atom™ Processor C2000 Product Family on the Edisonville  platform Intel® Xeon® Processor D-1500 Product Family  on the Grangeville platform  
Silicon Core Process technology 22nm14nm
Core / Thread CountUp to 8 cores / 8 threadsUp to 8 cores / 16 threads
Core FrequencyUp to 2.4GHz (2.6GHz with Turbo)Up to 2.0Ghz (2.6Ghz with Turbo)
L1 Cache32KB Data, 24KB Instruction per core32KB Data, 32KB Instruction per core
L2 Cache1MB shared per 2 cores256K per core
L3 CacheNone1.5Mb per core
SoC Thermal Design Power5W - 20W~20W - 45W
C-statesNoYes
Memory Addressing38 bits physical / 48 bits virtual48 bits physical / 48 bits virtual
Memory

2 Channels
2 DIMMs per ch

1600 DDR3/L

2 Channels

2 DIMMs per ch

1600 DDR3/L

2133 DDR4

64GB Max capacity128GB Max capacity
SODIMM, UDIMM, VLP UDIMM ECCRDIMM, UDIMM, SODIMM ECC
IO: PCI Express* (PCIe) lanes16x PCIe G224x Gen3, 8x Gen2
IO: GbE4x 1GbE/2.5GbE2x 1GbE / 2.5GbE / 10GbE
IO: SATA ports4x SATA2, 2x SATA36x SATA3
IO: USB ports4x USB 2.04x USB 2.0, 4x USB 3.0

A block diagram of the Intel® Xeon® processor D-1500 product
Figure 2. A block diagram of the Intel® Xeon® processor D-1500 product

 

3. Intel® Xeon® Processor D Product Family Feature Overview 

The rest of this paper discusses some of the new features in the Intel Xeon processor D-1500 product family. In Table 2 the items denoted with a4 have been newly introduced with this version of the silicon, while the other features are new to the entire Intel SoC product line, which previously contained only Intel Atom processors. Some of the features previously existed on other Intel Xeon processor product families, but are new to Intel’s SoC product line.

Table 2. Features and associated workload segments

Features/Technologies

COMPUTE:

Hyper Scale Out, Distributed Memcaching, Web Frontend, Content Delivery, Dedicated Hosting

NETWORK:

Router Mid Control such as with high density, compact PCI Advanced Mezzanine Cards (AMC)
New or Enhanced Instructions (ADC, SBB, ADCX, ADOX, PREFETCHW, MWAIT) 4
Intel® Advanced Vector Extensions 2 (Intel® AVX2)
VT Cache QoS Monitoring/Enforcement4
Memory Bandwidth Monitoring4
A/D Bits for EPT
Intel® Virtual Machine Control Structure Shadowing (Intel® VMCS Shadowing)
Posted Interruptsv
APICv
RDSEED4
Supervisor Mode Access Protection (SMAP) 4
Intel® Trusted Execution Technology
Intel® Node Manager
RAS
Intel® Processor Trace4
Intel® QuickAssist Technology v
Intel® Quick Data Technology  
Non-Transparent Bridge  
Asynchronous DRAM Refresh  

 

4. Intel® Xeon® processor D Product Family introduces new instructions as well as enhancements of previous instructions4 

ADCX (unsigned integer add with carry) and ADOX (unsigned integer add with overflow) have been introduced for Asymmetric Crypto Assist5 in addition to faster ADC/SSB instructions (no re-compilation required for ADC/SSB benefits). ADCX and ADOX are extensions of ADC (add with carry) and ADO (add with overflow) instructions for use in large integer arithmetic, greater than 64 bits. Performance improvements are due to two parallel carry chains being supported at the same time. ADOX/ADCX can be combined with MULX for additional performance improvements with public key encryption such as RSA. Large integer arithmetic is also used for Elliptic Curve Cryptography (ECC) and Diffie-Hellman (DH) Key Exchange. Beyond cryptography, there are many use cases in complex research and high performance computing (HPC). The demand for this functionality is high enough to warrant a number of commonly used optimized libraries, such as the GNU Multi-Precision (GMP) library (e.g., Mathematica), see New Instructions Supporting Large Integer Arithmetic on Intel® Architecture Processors. To take advantage of these new instructions you need to obtain a new software library and recompilation (Intel® Compiler 14.1+, GCC 4.7+, and Microsoft Visual Studio* 2013+).

MWAIT extensions for advanced power management can be used by the Operating System to implement power management policy.

PREFETCHW, which prefetches data into cache in anticipation of a write, now helps optimization with the network stack.

For more information about these instructions see the Intel® 64 and IA-32 Architectures Developer’s Manual. Currently, Intel® Compiler 14.1+, GCC 4.7+, and Microsoft Visual Studio* 2013+ support these instructions.

 

5. Intel® Advanced Vector Extensions 2 (Intel® AVX2) Instructions 

With Intel® AVX, all the floating point vector instructions were extended from 128 bit to 256 bits. The Intel Xeon processor D Family further improves performance by reducing floating point multiply (MULPS, PD) to 3 cycles vs 5 cycles on the previous generation of Intel Xeon processor. Intel® AVX2 also extends the integer vector instructions to 256 bits. Intel AVX2 uses the same 256 bit YMM registers as Intel AVX. Intel AVX2 instructions benefit high performance computing (HPC) applications, databases, and audio and video applications. Intel AVX2 instructions include fused multiply add (FMA), gather, shifts, and permute instructions.

The FMA instruction computes ±(a×b)±c with only one rounding. axb intermediate results are not rounded and therefore bring increased accuracy compared to MUL and ADD instructions. FMA increases performance and accuracy of many floating point computations such as matrix multiplication, dot product and polynomial evaluation. With 256 bits, we can have 8 single precision and 4 double precision FMA operations. Since FMA combines 2 operations into one, floating point operations per second (FLOPS) are increased. Additionally, because there are 2 FMA units, the peak FLOPS are doubled.

The gather instruction loads sparse elements to a single vector. It can gather 8 single precision (Dword) or 4 double precision (Qword) data elements into a vector register in a single operation. A base address points to the data structure in memory, and an Index (offset) gives the offset of each element from the base address. The mask register tracks which elements need to be gathered. Gather is complete when the mask register is all zeros. The gather instruction enables vectorization for workloads that could previously not be vectorized for various reasons.

Intel Xeon processor D product family adds additional hardware capability with a gather index table (GIT) to improve performance (Figure 3). No recompiling is required to take advantage of this new feature. The GIT provides storage for full width indices near the address generation unit. A special load grabs the correct index, simplifying the index handling. Loaded elements are merged directly into the destination.

Gather Index Table Conceptual Block Diagram
Figure 3. Gather Index Table Conceptual Block Diagram

Other new operations in Intel AVX2 include integer versions of permute instructions, new broadcast instructions, and blend instructions. A 1,024 radix divider for reduced latency, along with a "split" operation for scalar divides, where two scalar divides occur simultaneously, improve performance over previous generations of Intel Xeon processors.

Currently, the Intel Compiler 14.1+, GCC 4.7+, and Microsoft Visual Studio 2013+ support these instructions.

 

6. VT Cache QoS Monitoring/Enforcement and Memory Bandwidth Monitoring4 

The Intel Xeon processor D product family has the ability to monitor the last level of processor cache on a per-thread, application, or VM basis. This allows the VMM or OS scheduler to make changes based on policy enforcement. One scenario where this can be of benefit is if you have a multi-tenant environment and a VM is causing a lot of thrash with the cache. This feature allows the VMM or OS to migrate this “noisy neighbor” to a different location where it may have less of an impact on other VMs. This product family also introduces a new capability to manage the processor LLC based on pre-defined levels of service, independent of the OS or VMM. A QoS mask can be used to provide 16 different levels of enforcement to limit the amount of cache that a thread can consume.

Intel® 64 and IA-32 Architectures Software Developer’s Manual (SDM) volume-3 chapter-17.14 provides the CQM & MBM programming details. Chapter 17.15 provides the CQE programming details. To read the raw value from the IA32_QM_CTR register, multiply by a factor given in the CPUID field CPUID.0xF.1:EBX to convert to bytes.

For additional resources see: Benefits of Intel Cache Monitoring Technology in the Intel® Xeon™ Processor E5 v3 Family, IntelRLICache Monitoring Technology Software-Visible Interfaces, Intel's Cache Monitoring Technology: Use Models and Data, or Intel's Cache Monitoring Technology: Software Support and Tools

Cache and memory bandwidth monitoring and enforcement vectors
Figure 4. Cache and memory bandwidth monitoring and enforcement vectors.

Another new capability enables the OS or VMM to monitor memory bandwidth. This allows scheduling decisions to be made based on memory bandwidth usage on a per core or thread basis. An example of this situation is when one core is being heavily utilized by two applications, while another core is being underutilized by two other applications. With memory bandwidth monitoring the OS or VMM now has the ability to schedule a VM or an application to a different core to balance out memory bandwidth utilization. In Figure 5 two high memory bandwidth applications are competing for the same resource. The OS or VMM can move one of the high bandwidth memory applications to another resource to balance out the load on the cores.

Memory Bandwidth Monitoring use case
Figure 5. Memory Bandwidth Monitoring use case

 

7. A/D Bits for EPT 

In the previous generation, accessed and dirty bits (A/D bits) were emulated in VMM and accessing them caused VM exits. EPT A/D bits are implemented in hardware to reduce VM exits. This enables efficient live migration of VMs and fault tolerance.

VM exits with EPT A/D in hardware vs emulation
Figure 6. VM exits with EPT A/D in hardware vs emulation

This feature requires enabling VT-x at the BIOS level. Currently it is supported by KVM with 3.6+ kernel and Xen* 4.3+. For other VM providers please contact them to find out when this feature will be supported.

 

8. Intel® Virtual Machine Control Structure Shadowing (Intel® VMCS Shadowing) 

Nested virtualization allows a root Virtual Machine Monitor (VMM) to support guest VMMs. However, additional Virtual Machine (VM) exits can impact performance. As shown in Figure 7, Intel® VMCS Shadowing directs the guest VMM VMREAD/VMWRITE to a VMCS shadow structure. This reduces nesting induced VM exits. Intel VMCS Shadowing increases efficiency by reducing virtualization latency.

VM exits with Intel® VMCS Shadowing vs software-only
Figure 7. VM exits with Intel® VMCS Shadowing vs software-only

This feature requires enabling VT-x at the BIOS level. Currently it is supported by KVM with Linux Kernel 3.10+ and Xen 4.3+. For other VM providers please contact them to find out when this feature will be supported.

 

9. APICv 

The Virtual Machine Monitor emulates most guest accesses to interrupts and the Advanced Programmable Interrupt Controller (APIC) in a virtual environment. This causes VM exits, creating overhead on the system. APICv offloads this task to the hardware, eliminating VM exits and increasing I/O throughput.

VM exits with APICv vs without APICv
Figure 8. VM exits with APICv vs without APICv

This feature requires enabling VT-x at the BIOS level. Currently it is supported by KVM with Linux Kernel 3.10+, ESX(i)* 4.0+. For other VM providers please contact them to find out when this feature will be supported.

 

10. Supervisor Mode Access Protection (SMAP) 4 

Supervisor Mode Access Protection (SMAP) is a new CPU-based mechanism for user-mode address-space protection. It extends the protection that previously was provided by Supervisor Mode Execution Prevention (SMEP). SMEP prevents supervisor mode execution from user pages, while SMAP prevents unintended supervisor mode accesses to data on user pages. There are legitimate instances where the operating system needs to access user pages, and SMAP does provide support for those situations.

SMAP conceptual diagram
Figure 9. SMAP conceptual diagram

SMAP was developed with the Linux community and is supported on kernel 3.12+ and KVM version 3.15+. Support for this feature depends on which operating system or VMM you are using.

 

11. RDSEED4 

The RDSEED instruction is intended for seeding a Pseudorandom Number Generator (PRNG) of arbitrary width, which can be useful when you want to create stronger cryptography keys. If you do not need to seed another PRNG, then use the RDSEED instruction. For more information see Table 3, Figure 10, and The Difference Between RDRAND and RDSEED.

Table 3. RDSEED and RDRAND compliance and source information

InstructionSourceNIST Compliance
RDRANDCryptographically secure pseudorandom number generatorSP 800-90A
RDSEEDNon-deterministic random bit generatorSP 800-90B & C (drafts)

RDSEED and RDRAND conceptual block diagram
Figure 10. RDSEED and RDRAND conceptual block diagram

Currently the Intel® Compiler 15+, GCC 4.8+, and Microsoft Visual Studio* 2013+ support RDSEED.

RDSEED loads a hardware-generated random value and stores it in the destination register. The random value is generated from an Enhanced NRBG (Non Deterministic Random Bit Generator) that is compliant with NIST SP800-90B and INST SP800-90C in the XOR construction mode.

In order for the hardware design to meet its security goals, the random number generator continuously tests itself and the random data it is generating. The self-test hardware detects run-time failures in the random number generator circuitry or statistically anomalous data occurring by chance and flags the resulting data as bad. In such extremely rare cases, the RDSEED instruction will return no data instead of bad data.

Intel C/C++ Compiler Intrinsic Equivalent:

  • RDSEED int_rdseed16_step( unsigned short * );
  • RDSEED int_rdseed32_step( unsigned int * );
  • RDSEED int_rdseed64_step( unsigned __int64 *);

As with RDRAND, RDSEED will avoid any OS or library enabling dependencies and can be used directly by any software at any protection level or processor state.

For more information see section 7.3.17.2 of the Intel® 64 and IA-32 Architectures Software Developer’s Manual (SDM).

 

12. Intel ® Trusted Execution Technology (Intel® TXT) 

Intel® TXT is the hardware basis for mechanisms that validate platform trustworthiness during boot and launch, which enables reliable evaluation of the computing platform and its protection level. Intel TXT is compact and difficult to defeat or subvert, and it allows for flexibility and extensibility to verify the integrity during boot and launch of platform components, including BIOS, operating system loader, and hypervisor. Because of the escalating sophistication of malicious threats, mainstream organizations must employ ever-more stringent security requirements and scrutinize every aspect of the execution environment.

Intel TXT reduces the overall attack surface for both individual systems and compute pools. The technology provides a signature that represents the state of an intact system’s launch environment. The corresponding signature at the time of future launches can then be compared against that known-good state to verify a trusted software launch, to execute system software, and to ensure that cloud infrastructure as a service (IaaS) has not been tampered with. Security policies based on a trusted platform or pool status can then be set to restrict (or allow) the deployment or redeployment of virtual machines (VMs) and data to trusted platforms with known security profiles. Rather than relying on the detection of malware, Intel TXT builds trust into a known software environment and thus ensures that the software being executed hasn’t been compromised. This advances security to address key stealth attack mechanisms used to gain access to parts of the data center in order to access or compromise information. Intel TXT works with Intel® Virtualization Technology (Intel® VT) to create a trusted, isolated environment for VMs.

Simplified Intel® TXT Component diagram
Figure 11. Simplified Intel® TXT Component diagram

For more details on Intel TXT and its implementation see Intel® TXT Enabling Guide.

 

13. Intel® Node Manager 

Intel® Node Manager is a core set of power management features providing a smart way to optimize and manage power, cooling, and compute resources in the data center. This server management technology extends component instrumentation to the platform level and can be used to make the most of every watt consumed in the data center. First, Intel Node Manager reports vital platform information, such as power, temperature, and resource utilization using standards-based, out-of-band communications. Second, it provides fine-grained controls to limit platform power in compliance with IT policy. This feature can be found across Intel products segments providing consistency within the data center.

Table 4. Intel® Node Manager features

Intel® Node Manager features

To use this feature you must enable the BMC LAN and the associated BMC user configuration at the BIOS level, which should be available under the server management menu. The Programmer’s Reference Kit is very simple to use and requires no additional external libraries to compile or run. All that is needed is a C/C++ compiler and to then run the configuration and compilation scripts.

Intel® Node Manager website

Intel® Node Manager Programmer’s Reference Kit

Open Source Reference Kit

How to set up Intel® Node Manager

 

14. RAS – Reliability Availability Serviceability 

Server reliability, availability, and serviceability (RAS) are crucial issues for modern enterprise IT data centers that deliver mission-critical applications and services, as application delivery failures can be extremely costly per hour of system downtime. Furthermore, the likelihood of such failures increases statistically with the size of the servers, data, and memory required for these deployments. The Intel Xeon processor D product family offers a set of RAS features in silicon to provide error detection, correction, containment, and recovery. This feature set is a powerful foundation for hardware and software vendors to build higher-level RAS layers and provide overall server reliability across the entire hardware-software stack from silicon to application delivery and services. Table 5 shows a comparison of the RAS features available on the Intel Xeon processor D product family vs the Intel Atom processor C2000 series.

Table 5. Comparison of RAS features

CategoryFeatureIntel® Atom™ Processor C2000 Product Family on the Edisonville  platformIntel® Xeon® Processor D-1500 Product Family  on the Grangeville platform
MemoryECC
MemoryError detection and correction coverage
MemoryFailed DIMM Identification
MemoryMemory Address Parity Protection on Reads/WritesNo
MemoryMemory Demand and Patrol Scrubbing
MemoryMemory Thermal Throttling
MemoryMemory BIST including Error InjectionNo
MemoryData Scrambling with address
MemorySDDCNo
PlatformPCIe* Device Surprise RemovalNo
PlatformPCIe and GbE Advanced Error Reporting (AER)
PlatformPCIe Device Hot Add / Remove / SwapNo
PlatformECRC on PCIeNo
PlatformData Poisoning - ContainmentVia parity
PlatformCorrected Error Cloaking from OS
PlatformDisable CMCINo CMCI support
PlatformUncorrected error signaling to SMI (dual-signaling)
PlatformIntel® Silicon View TechnologyNo

 

15. Intel® Processor Trace4 

Intel® Processor Trace enables low-overhead instruction tracing of workloads to memory. This can be of value for low-level debugging, fine tuning performance, or post-mortem analysis (core dumps, save on crash, etc.). The output includes control flow details, enabling precise reconstruction of the path of software execution. It also provides timing information, software context details, processor frequency indication and more. Intel Processor Trace has a sampling mode to estimate the number of function calls and loop iterations in an application being profiled. It has a limited impact to system execution and does not require any enabling, you simply need Intel® VTune™ Amplifier 2015 update 1 (and newer).

For additional information see the Intel® Processor Trace lecture or pdf given at IDF14.

Overview of Intel® Processor Trace
Figure 12. Overview of Intel® Processor Trace

 

16. Non-Transparent Bridge (NTB) 

Non-Transparent Bridge (NTB) reduces loss of data, allowing a secondary system to take over the PCIe* storage devices in the event of a CPU failure providing high-availability for your storage devices.

Overview of Non-Transparent Bridge with a local and remote host on the Intel® Xeon® processor D product family
Figure 13. Overview of Non-Transparent Bridge with a local and remote host on the Intel® Xeon® processor D product family

 

17. Asynchronous DRAM Refresh (ADR) 

Asynchronous DRAM Refresh (ADR) preserves key data in the battery-backed DRAM in the event of AC power supply failure.

Figure 14. Overview of Asynchronous DRAM Refresh
Figure 14. Overview of Asynchronous DRAM Refresh

 

18. Intel® QuickData Technology 

Intel® QuickData Technology is a platform solution designed to maximize the throughput of server data traffic across a broader range of configurations and server environments to achieve faster, scalable, and more reliable I/O. It enables the chipset instead of the CPU to copy data, which allows data to move more efficiently through the server.  This technology is supported on Linux kernel 2.6.18+ and Windows* Server 2008 R2 and will require enabling within the BIOS.

For more information, see the Intel® QuickData Technology Software Guide for Linux.

Overview of Intel® QuickData Technology
Figure 15. Overview of Intel® QuickData Technology

 

19. Resources 

Intel® Xeon® processor D product family performance comparisons for general compute, cloud, storage and network.

Intel® 64 and IA-32 Architectures Software Developer’s Manual (SDM)

Intel® Processor Trace IDF 2014 Video Presentation

Intel® Processor Trace IDF 2014 PDF Presentation

Benefits of Intel Cache Monitoring Technology in the Intel® Xeon™ Processor E5 v3 Family

Intel’s Cache Monitoring Technology Software-Visible Interfaces

Intel's Cache Monitoring Technology: Use Models and Data

Intel's Cache Monitoring Technology: Software Support and Tools

The Difference Between RDRAND and RDSEED

Intel® Node Manager website

Intel® Node Manager Programmer’s Reference Kit

Open Source Reference Kit

How to set up Intel® Node Manager

Intel® QuickData Technology Software Guide for Linux

Haswell Cryptographic Performance

Intel® TXT Enabling Guide

Intel® Atom™ processor C2000 product family

Intel® Xeon® processor E3 family

  1. Up to 3.4x better performance on dynamic web serving Intel® Xeon® processor D-based reference platform with one Intel Xeon processor D (8C, 1.9GHz, 45W, ES2), Intel® Turbo Boost Technology enabled, Intel® Hyper-Threading Technology enabled, 64GB memory (4x16GB DDR4-2133 RDIMM ECC), 2x10GBase-T X552, 3x S3700 SATA SSD, Fedora* 20 (3.17.8-200.fc20.x86_64, Nginx* 1.4.4, Php-fpm* 15.4.14, Memcached* 1.4.14, Simultaneous users=43844 Supermicro SuperServer* 5018A-TN4 with one Intel® Atom™ processor C2750 (8C, 2.4GHz,20W), Intel Turbo Boost Technology enabled, 32GB memory (4x8GB DDR3-1600 SO-DIMM ECC), 1x10GBase-T X520, 2x S3700 SATA SSD, Ubuntu* 14.10 (3.16.0-23 generic), Nginx 1.4.4, Php-fpm 15.4.14, Memcached 1.4.14, Simultaneous users=12896.2
  2. Up to 1.7x (estimated) better performance per watt on dynamic web serving Intel® Xeon® processor D-based reference platform with one Intel Xeon processor D (8C, 1.9GHz, 45W, ES2), Intel® Turbo Boost Technology enabled, Intel® Hyper-Threading Technology enabled, 64GB memory (4x16GB DDR4-2133 RDIMM ECC), 2x10GBase-T X552, 3x S3700 SATA SSD, Fedora* 20 (3.17.8-200.fc20.x86_64, Nginx* 1.4.4, Php-fpm* 15.4.14, Memcached* 1.4.14, Simultaneous users=43844, Estimated wall power based on microserver chassis, power=90W, Perf/W=487.15 users/W Supermicro SuperServer* 5018A-TN4 with one Intel® Atom™ processor C2750 (8C, 2.4GHz,20W), Intel® Turbo Boost Technology enabled, 32GB memory (4x8GB DDR3-1600 SO-DIMM ECC), 1x10GBase-T X520, 2x S3700 SATA SSD, Ubuntu* 14.10 (3.16.0-23 generic), Nginx 1.4.4, Php-fpm 15.4.14, Memcached 1.4.14, Simultaneous users=12896. Maximum wall power =46W, Perf/W=280.3 users/W
  3. Software and workloads used in performance tests may have been optimized for performance only on Intel® microprocessors. Performance tests, such as SYSmark* and MobileMark*, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more information go to http://www.intel.com/performance.
  4. New feature introduced with the Intel® Xeon® processor D product family. Intel technologies may require enabled hardware, specific software, or services activation. Performance varies depending on system configuration. Check with your system manufacturer or retailer.
  5. Intel® processors do not contain crypto algorithms, but support math functionality that accelerates the sub-operations.

GROMACS recipe for symmetric Intel® MPI using PME workloads

$
0
0

Objectives

This package (scripts with instructions) delivers a build and run environment for symmetric MPI runs. This file is actually the README of the package. Symmetric stands for employing a Xeon® executable and a Xeon Phi™ executable both running together exchanging MPI messages and collective data via Intel MPI.

There is already a GROMACS recipe for symmetric Intel MPI: https://software.intel.com/en-us/articles/gromacs-for-intel-xeon-phi-coprocessor but this recipe addresses the so called RF data sets and does not take advantage of the special Particle Mesh Ewald (PME) configuration option.

The symmetric run configurations of this recipe use the PME mode of GROMACS. In this mode the so called Particle-Mesh part of GROMACS for the far reaching forces can run in parallel to the direct force calculation. The idea for an efficient use of both architectures is to run the direct forces on Xeon Phi™ because highly vectorized kernels exist. The PME calculations that make heavy use of FFT and need very intensive MPI_Alltoall communication run on the Xeon® executable.

This package contains run scripts for running GROMACS on Clusters equipped with Xeon® and Xeon Phi™ processors. It is also possible to run GROMACS separately on Xeon® and Xeon Phi™ alone. Scripts assist interactive running but can also be integrated in batch scripts. The full package is attached to this recipe.

0. Prerequisites

The following software and files are necessary for the installation. A user may take these packages or download probably newer versions when they exist.

  1. Download the package GROMACS-SYM-VERSION.tgz provided at the bottom of this article.
     
  2. GROMACS package:    
    ftp://ftp.gromacs.org/pub/gromacs/gromacs-5.0.5.tar.gz
     
  3. cmake package: this software is needed in a recent version. Some OS distributions will still have cmake version that do not build GROMACS correctly. Version number > 2.8.8 is mandatory for this GROMACS package. cmake is found http://www.cmake.org/cmake/resources/software.html please read the contained Readme.txt for help on the installation.
     
  4. *.tpr input file: You have to have an *.tpr GROMACS input file. This package contains an artificial input topol500k.tpr.  

1. Installation of GROMACS

1.1 Preparation

  1. Untar the package GROMACS-SYM-version.tar.gz
    (e.g. GROMACS-SYM-0.9.4.tar.gz)

    $ tar -xvzf GROMACS-SYM-version.tar.gz
     
  2. Enter directory
    $ cd GROMACS-SYM-version/
    This directory with absolute path is from now: $ BASE_DIR=$PWD
    Update the version string for GROMACS Versions other than 5.0.5

    $ cat VERSION

    if this is 5.0.5 and you intend to use this version you are done. Update the version number for different GROMACS versions.
     
  3. Enter the Package directory and copy the original distribution.
    $ cd $BASE_DIR
    $ cd package

    Copy original GROMACS package to this directory and unpack.
    $ cp /<path to package>/gromacs-5.0.5.tar.gz .
    $ tar -xvzf gromacs-5.0.5.tar.gz
     
  4. Setup environment:
    $ vi $BASE_DIR/env.sh

    check environment settings for compiler and mpi by sourcing the environment. Use empty env.sh when the environment is taken from shell.

    $ source ./env.sh
    $ which icc
    $ which mpiicc

 Tested software versions

icc                 :  Version 15.0.3, 15.0.2
Intel® MPI     :  Version 5.0.3, 5.0.2   
gcc                :  Version  4.4.7 20120313 (Xeon®)
                         Version  4.8.1

MPSS: 3.2.1, 3.4.2, 3.5

gcc version is crucial for stdc++ library and the C++ Flags!

1.2 Install Xeon executable

It makes sense to compile on the same architecture as the target Xeon® architecture because the GROMACS cmake configuration script will detect the best options.

  1. Go to the Xeon build directory

    $ cd $BASE_DIR/build-xeon

    this directory contains 3 scripts:
    conf.sh   : configures the build directory using cmake
    build.sh  : builds and installs software (make)
    clean.sh  : removes all configured files if you intend to change parameters and re-install
     
  2. Configure GROMACS: the script conf.sh contains GROMACS configuration that will be transformed into a makefile by using cmake. In case of failure you may inspect conf.log and conf.err. These file contain log information and error output. The GROMACS installation will be in $BASE_DIR/gromacs

    $ ./conf.sh

    conf.sh contains cmake with some proven options (compare original GROMACS installation information). The C++ flags must be different for gcc versions >= 4.7. In case of error follow the instructions given inside conf.sh
     
  3. Build GROMACS for Xeon the build script simply executes the makefile generated in step b. and installs the executable

    $ ./build.sh

    in case of success there will be an executable in: $BASE_DIR/gromacs/bin/mdrun_mpi
    in case of error check if gcc version fits to conf.sh settings.

1.3 Install MIC executable

MPSS software stack must be present.

  1. Enter the build-mic directory:

    $ cd $BASE_DIR/build-mic

    The following steps are completely analogous to 1.2 a-c. Only the additional -mmic Flag and potentially a different C++ flag. The user should not do any changes.  
     
  2. $ ./conf.sh
     
  3. $ ./build.sh
    should generate a mic executable in $BASE_DIR/gromacs-mic/bin

2. Run GROMACS

2.0 Run environment

Starting with interactive tests. Reserve an interactive node for direct testing containing one or more mic cards.If interactive usage is not allowed follow the instructions for running under a batch system (see below).
NOTE! You will need to provide a hosts file with the name of your host as minimal entry. Please check if the mics are names <hostname>-mic0… inside /etc/hosts .Enter run directory:

$ cd $BASE_DIR/run

Scripts and files:

start.sh : starts a run by defining all environment settings etc. start.sh sources different scripts. These scripts are:

  • functions.sh : define some auxiliary bash functions used in start.sh
  • env.sh       : source scripts for compiler and mpi -- update for your system. For clusters using modules this script may be empty. The environment settings will be taken from shell if no env.sh is present
  • MPI_OMP_settings.sh: contains MPI and OpenMP specific environment
  • application.sh: contains application specific settings like program name and program path this package contains imb.sh for IMB testing and gromacs.sh for running GROMACS
  • run_mpi.sh: executes the MPI command line
  • prg.sh: wrapper script for the executable(s) distinguishes Xeon and MIC environment                          
  • gen_mach.sh: generates MPI machinefile from hostfile. Default name for machinefile is mach.txt. Please check if hostnames are correct.
  • settings_log.sh: protocols settings inside settings.prot
  • env_log.sh   : protocols environment settings inside env.prot
  • conf_test.sh : runs seven different configurations

2.1 IMB tests (optional, test run scripts independent of GROMACS)

To make sure that the run system is working, use it with Intel MPI Benchmarks (IMB) as a test for different scenarios. The IMB Benchmarks are already build for intel64 and mic architecture. They can be found in: $I_MPI_ROOT/intel64/bin and $I_MPI_ROOT/mic/bin.

  1. Set application to imb

    $ rm application.sh

    make a soft link to imb.sh

    $ ln -s imb.sh application.sh

    imb.sh contains all necessary imb definitions for running different scenarios.
     
  2. Run the test script

    $ ./conf_test.sh

    this will generate an output directory output_Sendrecv_TEST. This directory contains 7 sub-directories. The sub-directory names are the configuration: e.g.:
    N-1_H4T6_2xMIC12T15: 1 Node with 4 Host processes with each 6 threads and 2 MIC with 12 processes and 15 threads each. Each directory contains all used scripts from the run directory and the output files:

    settings.prot and env.prot    : configuration logs
    command.txt                          : command line
    OUT.txt                                   : stdout
    OUT.err                                   : stderr

    the stdout files of each directory contain an IMB sendrecv benchmark showing potential bottlenecks in MPI message passing.

2.2 Run GROMACS Tests

Set application to GROMACS:

$ ln -s gromacs.sh application.sh

settings in gromacs.sh are defined for the artificial
test case topol500k.tpr. Please adapt settings to your
input set.
Run the test cases

$ ./conf_test.sh

This generates an output directory: output_topol500k_TEST which contains 7 sub directories. In case of success, each directory contains an md.log for GROMACS and prints out a performance statement in the end.

Please see 2.1 for an explanation of the directory/file names.

2.3 Define new runs

New configurations can be created by changing 3
variables inside start.sh:

Open start.sh script:

$ vi start.sh

# HOST_PE: Ranks on host,           (=0 : host not used)
# NUM_MIC: number of used MIC cards (=0 : no mic card used)
# PP_MIC : number of Ranks on each MIC card

export HOST_PE=${HOST_PE:-2}
export NUM_MIC=${NUM_MIC:-2}
export PP_MIC=${PP_MIC:-12}

These variables determine the number of MPI ranks on the host, the number of used mic cards and the number of ranks on each of the cards.

The number of threads on host and mic are determined by:

# automatic setting of thread number
# this overwrites explicit thread number
# compare output in file settings.prot

export NUM_CORES=12
export MIC_NUM_CORES=57

export THREADS_PER_CORE=1
export MIC_THREADS_PER_CORE=3

Here we define the number of cores. The choices are minimal and should be adapted it can be determined by reading the output of micinfo and cpuinfo. Please adapt to your mic and xeon.

After changing the parameters it makes sense to do a dry run with

RUN_TYPE="test"

Running start.sh will generate all settings but will not execute the program. This mode will show e.g. if the machine file is correct:

$ cat mach.txt

2.4 Batch Usage

The start script also contains

RUN_TYPE="batch"

This branch will just work as in interactive mode but sends the file run_MPI.sh to a batch queue using command line options of the batch system: compare settings under run/TEMPLATES for templates of batch.sh. This methodology works for LSF and PBS and SLURM but it might need some additional knowledge of the job manager.

It will be easier to write a batch script as supposed by the cluster documentation. The Script can look like this:

 #QSUB <you settings>
 #QSUB ...

 #generate hosts file e.g. = $PBS_NODEFILE

 # define configuration

export HOST_PE=<num of host pe>
export NUM_MIC=<number of mics>
export PP_MIC=<number of ranks per mic>

 ./start.sh [<number of nodes>]  

3. Trouble Shooting

  • Check settings.prot and env.prot for protocolled settings.
  • Check machine file mach.txt
  • Use imb as application and check if the system works with imb.
  • Timing output is spoiled for symmetric runs. PME part is scaled by wrong factor (just for information).
  • Before configuring check if there are no LDFLAGS and CFLAGS defined inside the shell. This will confuse the cmake configuration.
  • Please check if your gcc version number is >= 4.7. This might need an additional flag for the CXX_FLAGS inside conf.sh. Please read the note inside conf.sh.
  • Check if the general rule for mic hosts is valid:
    name for mic0 is: <hostname>-mic0
    if this is not the case, please adapt the function
    host2mic inside gen_mach.sh  

Building and Running 3D-FFT Code that Leverages MPI-3 Non-Blocking Collectives with the Intel® Parallel Studio XE Cluster Edition

$
0
0

Purpose

This application note assists developers with using Intel® Software Development Tools with the 3D-FFT MPI-3 based code sample from the Scalable Parallel Computing Lab (SPCL), ETH Zurich.

Introduction

The original 3D-FFT code based on the prototype library libNBC was developed to help in optimizing parallel high performance applications by overlapping computation and communication [1]. The updated version of the code based on MPI-3 Non-Blocking Collectives (NBC) has now been posted at the SPCL, ETH Zurich web site. This new version relies on the MPI-3 API and therefore can be used by modern MPI libraries that implement it. One such MPI library implementation is Intel® MPI Library that fully supports the MPI-3 Standard [2].     

Obtaining the latest Version of Intel® Parallel Studio XE 2015 Cluster Edition

The Intel® Parallel Studio XE 2015 Cluster Edition software product includes the following components used to build the 3D-FFT code:

  •    Intel® C++ Compiler XE
  •    Intel® MPI Library (version 5.0 or above) which supports the MPI-3 Standard
  •    Intel® Math Kernel Library (Intel® MKL) that contains an optimized FFT (Fast Fourier Transform) solver and the wrappers for FFTW (Fastest Fourier Transform in the West)

 The latest versions of Intel® Parallel Studio XE 2015 Cluster Edition may be purchased, or evaluation copies requested, from the URL https://software.intel.com/en-us/intel-parallel-studio-xe/try-buy.  Existing customers with current support for Intel® Parallel Studio XE 2015 Cluster Edition can download the latest software updates directly from https://registrationcenter.intel.com/     

Code Access

To download the 3D-FFT MPI-3 NBC code, please go to the URL http://spcl.inf.ethz.ch/Research/Parallel_Programming/NB_Collectives/Kernels/3d-fft_nbc_mpi_intel.tgz 

Building the 3D-FFT NBC Binary      

To build the 3D-FFT NBC code:

 

  1. Set up the build environment, e.g.,

01source /opt/intel/composer_xe_2015.2.164/bin/compilervars.sh intel64
02source /opt/intel/impi/5.0.3.048/bin64/mpivars.sh

 

   Regarding the above mentioned versions (.../composer_xe_2015.2.164 and .../impi/5.0.3.048), please source the corresponding versions that are installed on your system.

  2. Untar the 3D-FFT code download from the link provided in the Code Access section above and build the 3D-FFT NBC binary

01mpiicc -o 3d-fft_nbc 3d-fft_nbc.cpp -I$MKLROOT/include/fftw/ -mkl

 

Running the 3D-FFT NBC Application      

Intel® MPI Library support of asynchronous message progressing allows to overlap computation and communication in NBC operations [2]. To enable asynchronous progress in the Intel® MPI Library, the environment variable MPICH_ASYNC_PROGRESS should be set to 1:

01export MPICH_ASYNC_PROGRESS=1
 

Run the application using the mpirun command as usual. For example, the command shown below starts the application with 32 ranks on the 2 nodes (node1 and node2) with 16 processes per node:

01mpirun -n 32 -ppn 16 -hosts node1,node2 ./3d-fft_nbc
 

and produces this output

011 repetitions of N=320, testsize: 0, testint 0, tests: 0, max_n: 10
02approx. size: 62.500000 MB
03normal (MPI): 0.192095 (NBC_A2A: 0.037659/0.000000) (Test: 0.000000) (2x1d-fft: 0.069162) - 1x512000 byte
04normal (NBC): 0.203643 (NBC_A2A: 0.047140/0.046932) (Test: 0.000000) (2x1d-fft: 0.069410) - 1x512000 byte
05pipe (NBC): 0.173483 (NBC_A2A: 0.042651/0.031492) (Test: 0.000000) (2x1d-fft: 0.069383) - 1x512000 byte
06tile (NBC): 0.155921 (NBC_A2A: 0.018214/0.010794) (Test: 0.000000) (2x1d-fft: 0.069577) - 1x512000 byte
07win (NBC): 0.173479 (NBC_A2A: 0.042485/0.026085) (Pack: 0.000000) (2x1d-fft: 0.069385) - 1x512000 byte
08wintile (NBC): 0.169248 (NBC_A2A: 0.028918/0.021769) (Pack: 0.000000) (2x1d-fft: 0.069290) - 1x512000 byte

          

Acknowledgments

Thanks goes to Torsten Hoefler for hosting 3D-FFT distribution for Intel tools. Mikhail Brinskiy assisted in porting libNBC version of 3D-FFT code to MPI-3 Standard. James Tullos and Steve Healey suggested corrections and improvements to the draft. 

References

1. Torsten Hoefler, Peter Gottschling, Andrew Lumsdaine, Brief Announcement: Leveraging Non-Blocking Collectives Communication in High-Performance Applications, SPAA'08, pp. 113-115, June 14-16, 2008, Munich, Germany. 

2. Mikhail Brinskiy, Alexander Supalov, Michael Chuvelev, Evgeny Leksikov, Mastering Performance Challenges with the new MPI-3 Standard, PUM issue 18: http://goparallel.sourceforge.net/wp-content/uploads/2014/07/PUM18_Mastering_Performance_with_MPI3.pdf

 

Intel® Inspector Glossary

$
0
0

Intel® Inspector is a dynamic memory and threading error checking tool for users developing serial and multithreaded applications on Windows* and Linux* operating systems.

The following is a glossary for the Intel Inspector.

analysis: A process during which the Intel Inspector performs collection and finalization.

code location: A fact the Intel Inspector observes at a source code location, such as a write code location. Previously called an observation.

collection: A process during which the Intel Inspector executes an application, identifies issues that may need handling, and collects those issues in a result.

false positive: A reported error that is not an error.

finalization: A process during which the Intel Inspector uses debug information from binary files to convert symbol information into filenames and line numbers, performs duplicate elimination (if requested), and forms problem sets.

problem: One or more occurrences of a detected issue, such as an uninitialized memory access. Multiple occurrences have the same call stack but a different thread or timestamp. You can view information for a problem as well as for each occurrence.

problem breakpoint: A breakpoint that halts execution when a memory or threading analysis detects a problem. In the Visual Studio* debugger, a problem breakpoint is indicated by a yellow arrow at the source line where execution halts.

problem set: A group of problems with a common problem type and a shared code location that might share a common solution, such as a problem set resulting from deallocating an object too early during application execution. You can view problem sets only after analysis is complete.

project: A compiled application; a collection of configurable attributes, including suppression rules and search directories; and a container for analysis results.

result: A collection of issues that may need handling.

suppression: An Intel Inspector productivity enhancement feature you can use to not collect result data that matches a rule you define.

target: An application the Intel Inspector inspects for errors. 

Intel® Inspector Sample Applications

$
0
0

Intel® Inspector is a dynamic memory and threading error checking tool for users developing serial and multithreaded applications on Windows* and Linux* operating systems.

Intel Inspector sample applications are installed as individual compressed files in the samples directory within the Intel Inspector installation directory. After you copy a sample application compressed file to a writable directory, use a suitable tool to extract the contents. Extracted contents include a short README (TXT format) that describes how to build the sample and fix issues.To load a sample into the Microsoft Visual Studio* environment, double-click the .sln file.

NOTE:

  • Companion tutorials are available for some sample applications.
  • Sample applications are non-deterministic.
  • Sample applications are designed only to illustrate the Intel Inspector features and do not represent best practices for creating code.

The following sample applications are included with the Intel Inspector.

Sample ApplicationSummary

tachyon_insp_xe

Displays a rendering of a graphical image via 2D ray tracing.

Demonstrates: Detecting memory and threading errors in a C++ application.

Data conflicts: Memory leak, invalid memory access, mismatched memory allocation and deallocation, and data race.

banner

Displays an abcde banner on the command line.

Demonstrates: Detecting memory and threading errors in a C++ application.

Data conflicts: Memory leak, invalid memory access, and data race.

parallel_nqueens_csharp

Computes the number of solutions to the nQueens problem for a given board size.

Demonstrates: Detecting threading errors in a C# application.

Data conflicts: Data race.

nqueens_fortran

Solves the nqueens problem for various board sizes.

Demonstrates: Detecting threading and memory errors in a Fortran application.

Data Conflicts: Data race, uninitialized memory access, and memory leak.

Viewing all 312 articles
Browse latest View live


<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>