The following article gives some advice on the types of Intel® Xeon Phi™ applications that may benefit from using Intel® Math Kernel Library (Intel® MKL) and its different usage models. Here are some choices to make concerning your Xeon Phi workload before you move to calling Intel MKL functions:
Optimize your code regardless of your target.
- Any optimizations you make regarding threading, vectorization, and memory hierarchy before you even start making Intel MKL calls will benefit both Xeon and Xeon Phi, and this longer-term investment will benefit toward future proofing.
Decide on the various execution and programming models that suit your needs.
- You have the choice of using MPI, Offloading, and Native modes with the other Xeon Phi programming models. MKL offers an additional Automatic Offload model for commonly used functions in addition to the Offload, and Native modes.
- A selective set of MKL functions are AO enabled.
§ Only functions with sufficient computation to offset data transfer overhead are subject to AO
- In 11.0, AO enabled functions include:
§ Level-3 BLAS: ?GEMM, ?TRSM, ?TRMM
§ LAPACK 3 amigos: LU, QR, Cholesky
- AO works only when matrix sizes are right
§ ?GEMM: Offloading only when M, N > 2048
§ ?TRSM/TRMM: Offloading only when M, N > 3072
§ Square matrices may give better performance
- Work division settings are just hints to MKL runtime
- How to disable AO after it is enabled?
§ mkl_mic_disable( ), or
§ mkl_mic_set_workdivision(MIC_TARGET_HOST, 0, 1.0), or
§ MKL_HOST_WORKDIVISION=100
- Use data persistence to avoid unnecessary data copying and memory alloc/de-alloc
o Thread affinity: avoid using the OS core. Example for a 60-core coprocessor:
MIC_KMP_AFFINITY=explicit,granularity=fine,proclist=[1-236:1]
- Use huge (2MB) pages for memory allocation in user code:
o MIC_USE_2MB_BUFFERS=64K
o The value of MIC_USE_2MB_BUFFERS is a threshold. E.g., allocations of 64K bytes or larger will use huge pages.
- Users can use AO for some MKL calls and use CAO for others in the same program
§ Only supported by Intel compilers
§ Work division must be set explicitly for AO
· Otherwise, all MKL AO calls are executed on the host
- Choose native execution if
§ Highly parallel code.
§ Want to use coprocessors as independent compute nodes.
- Choose AO when
§ A sufficient Byte/FLOP ratio makes offload beneficial.
§ Level-3 BLAS functions: ?GEMM, ?TRMM, ?TRSM.
§ LU and QR factorization (in upcoming release updates).
- Choose CAO when either
§ There is enough computation to offset data transfer overhead.
§ Transferred data can be reused by multiple operations
- You can always run on the host if offloading does not achieve better performance
The same performance gap analysis tools work with Xeon Phi.
- Understand where your app is relative to speed of light limits.
- Diagnose bottlenecks and strategize your performance gap closure.
Look into algorithmic improvements for your custom algorithms to extract more parallelism.
• The additional parallelism helps deliver greater performance at same power
• These applications are diverse, building upon and frequently combining Messaging, Threading, and Data Parallel constructs
• These “Highly Parallel” applications, with more time spent in parallel sections of code under Amdahl’s law can benefit from the highly parallel Xeon Phi architecture
Understand the differences between Xeon and Xeon Phi
- Xeon Phi vs. Sandy Bridge
o ~3x slower clock
o in order vs. out
o longer latency
o Xeon Phi needs two threads to keep the front end busy
- The implication is that the your application as a whole needs to be 99% parallel
o You can rest assured that the Intel MKL calls you make will be one of the best possible implementations to get maximum utilization out of the processor other than hand coding the algorithm yourself.
o Make use of all threads and vector width most of the time.
Perform a granularity analysis
- Ideal Scenario for Xeon Phi
o Small number of larger chunks of work that fit on the card with a high computation to communication ratio
o Huge codes can become difficult to manage, so you may need to rely on data persistence to minimize host-card transfers.
- Preferred tool for Analysis help
o Intel VTune™ Amplifier hotspot analysis can indicate granularity for functions and loops
- Maximize the granularity
o Implicit spawn/join for >>100 threads can be large enough to matter
Functional vs. Data Parallelism
- The goal is to execute concurrently on host and MIC
- Data parallelism
o Use explicit data domain decomposition.
o Load balance work on host and MIC for disjoint data subsets.
o OpenMP 4.0 simd directive – standalone, allows forward deps
o Array notations – vector notation
o Intel Compiler pragma simd – guarantees use of vector version
o Elemental functions – building block for auto-parallelization, array notations and pragma simd (as well as Cilk)
- Functional parallelism
o Execute different functions on each of host and MIC based on their characteristics. This typically requires dependence analysis and code refactoring
o Load balancing is less portable across host-MIC pairs over time
Ways to expose thread parallelism
- Intel MKL is of course an excellent option for non-custom algorithms
- OpenMP
o Use collapse to increase degree of parallelism
o Consider experimental crew feature to parallelize within a core
- Intel® Cilk™ Plus
o Use cilk_for, cilk_spawn