Quantcast
Channel: Intel Developer Zone Articles
Viewing all 312 articles
Browse latest View live

Using Intel MKL and Intel TBB in the same application

$
0
0

Intel MKL 11.3 Beta has introduced Intel TBB support.

Intel MKL 11.3 can increase performance of applications threaded using Intel TBB. Applications using Intel TBB can benefit from the following Intel MKL functions:

  •          BLAS dot, gemm, gemv, gels
  •          LAPACK getrf, getrs, syev, gels, gelsy, gesv, pstrf, potrs
  •          Sparse BLAS csrmm, bsrmm
  •          Intel MKL Poisson Solver
  •          Intel MKL PARDISO

If such applications call functions not listed above, Intel MKL 11.3 executes sequential code. Depending on feedback from customers, future versions of Intel MKL may support Intel TBB in more functions.

Linking applications to Intel TBB and Intel MKL

The simplest way to link applications to Intel TBB and Intel MKL is to use Intel C/C++ Compiler. While Intel MKL supports static and dynamic linking, only dynamic Intel TBB library is available.

Under Linux, use the following commands to compile your application app.c and link it to Intel TBB and Intel MKL.

Dynamic Intel TBB, dynamic Intel MKL                    icc app.c -mkl -tbb

Dynamic Intel TBB, static Intel MKL                         icc app.c -static -mkl -tbb

Under Windows, use the following commands to compile your application app.c and link it to dynamic Intel TBB and Intel MKL.

Dynamic Intel TBB, dynamic Intel MKL                    icl.exe app.c -mkl -tbb

Improving Intel MKL performance with Intel TBB

Performance of Intel MKL can be improved by telling Intel TBB to ensure thread affinity to processor cores. Use the tbb::affinity_partitioner class to this end.

To improve performance of Intel MKL for small input data, you may limit the number of threads allocated by Intel TBB for Intel MKL. Use the tbb::task_scheduler_init class to do so.

For more information on controlling behavior of Intel TBB, see the Intel TBB documentation at https://www.threadingbuildingblocks.org/documentation.

LAPACK performance in applications using Intel TBB and Intel MKL 11.3

* Each call is single run of single size on range from 1000 to 10000 with step 1000.  Performance (GFlops) is computed as cumulative number of floating point operations for all 10 calls divided by wall clock time from starting very first call till finishing very last call.

 


Diagnostic 13379: loop was not vectorized with "simd"

$
0
0

Product Version:  Intel® Fortran Compiler 15.0 and above

Cause:

When  a loop contains a conditional statement which controls the assignment of a scalar value AND the scalar value is referenced AFTER the loop exits. The vectorization report generated using Intel® Fortran Compiler's optimization and vectorization report options includes non-vectorized loop instance:

Windows* OS:  /O2  /Qopt-report:2  /Qopt-report-phase:vec    

Linux OS or OS X:  -O2 -qopt-report2  -qopt-report-phase=vec

Example:

An example below will generate the following remark in optimization report:

subroutine f13379( a, b, n )
implicit none
integer :: a(n), b(n), n

integer :: i, x=10

!dir$ simd
do i=1,n
  if( a(i) > 0 ) then
     x = i  !...here is the conditional assignment
  end if
  b(i) = x
end do
!... reference the scalar outside of the loop
write(*,*) "last value of x: ", x
end subroutine f13379

ifort -c /O2 /Qopt-report:2 /Qopt-report-phase:vec /Qopt-report-file:stdout f13379.f90

Begin optimization report for: F13379

    Report from: Vector optimizations [vec]

LOOP BEGIN at f13379.f90(8,1)
    ....
   remark #13379: loop was not vectorized with "simd"
LOOP END

Resolution:

The reference of the scalar after the loop requires that the value coming out of the loop is "correct", meaning that the loop iterations were executed strictly in-order and sequentially.  IF the scalar is NOT referenced outside of the loop, the compiler can can vectorize this loop since the order of that the iterations are evaluated does not matter - without reference outside the loop the final value of the scalar does not matter since it is no longer referenced.

Example

subroutine f13379( a, b, n )
implicit none
integer :: a(n), b(n), n

integer :: i, x=10

!dir$ simd
do i=1,n
  if( a(i) > 0 ) then
     x = i  !...here is the conditional assignment
  end if
  b(i) = x
end do
!... no reference to scalar X outside of the loop
!... removed the WRITE statment for X
end subroutine f13379

Begin optimization report for: F13379
    Report from: Vector optimizations [vec]

LOOP BEGIN at f13379.f90(8,1)
f13379.f90(8,1):remark #15301: SIMD LOOP WAS VECTORIZED
LOOP END

See also:

Requirements for Vectorizable Loops

Vectorization Essentials

Vectorization and Optimization Reports

Back to the list of vectorization diagnostics for Intel® Fortran

Debug SPI BIOS after Power Up Sequence

$
0
0

After PCB assembly and the board power up, the next phase will be SPI BIOS debugging. A lot of system engineers and firmware engineers who had been interested in Intel System Studio (ISS), questioned whether available to use ISS to do SPI BIOS debugging once after CPU reset de-assertion. Answer is YES, and explain below how to make it happen with Intel System Debugger of ISS.

To debug SPI BIOS once after CPU reset de-assertion, is kind of difficult task. Because the connection time from host to target is much longer compared with platform power up sequence, and even including BIOS module boot time. In order to accommodate end users’ demands, Intel System Debugger provides a feature set to halt target once after CPU reset de-assertion. Some steps described below are required for this usage case.

 

1st need to do, is to launch Intel System Debugger of ISS2015 (former name was Intel JTAG Debugger of ISS2014).

2nd connect to target platform

3rd reset the target through using the “restart” console command, or by clicking the restart button as below.

After the target reset, then can debug SPI BIOS.

 

What is Code Modernization?

$
0
0

Modern high performance computers are built with a combination of resources including: multi-core processors, many core processors, large caches, high speed memory, high bandwidth inter-processor communications fabric, and high speed I/O capabilities. High performance software needs to be designed to take full advantage of these wealth of resources. Whether re-architecting and/or tuning existing applications for maximum performance or architecting new applications for existing or future machines, it is critical to be aware of the interplay between programming models and the efficient use of these resources. Consider this a starting point for information regarding Code Modernization. When it comes to performance, your code matters!

Building parallel versions of software can enable applications to run a given data set in less time, run multiple data sets in a fixed amount of time, or run large-scale data sets that are prohibitive with un-optimized software. The success of parallelization is typically quantified by measuring the speedup of the parallel version relative to the serial version. In addition to that comparison, however, it is also useful to compare that speedup relative to the upper limit of the potential speedup. That issue can be addressed using Amdahl's Law and Gustafson's Law.

Good code design takes into consideration several levels of parallelism.

  • The first level of parallelism is Vector parallelism (within a core) where identical computational instructions are performed on large chunks of data.  Both scalar and parallel portions of code will benefit from the efficient use of vector computing.
  • A second level of parallelism called Thread parallelism, is characterized by a number of cooperating threads of a single process, communicating via shared memory and collectively cooperating on a given task.
  • The third level is when many codes have been developed in the style of independent cooperating processes, communicating with each other via some message passage system. This is called distributed memory Rank parallelism, so named as each process is given a unique rank number.

Developing code which uses all three levels of parallelism effectively, efficiently, and with high performance is optimal for modernizing code.

Factoring into these considerations is the impact of the memory model of the machine: amount and speed of main memory, memory access times with respect to location of memory, cache sizes and numbers, and requirements for memory coherence.

Poor data alignment for vector parallelism will generate a huge performance impact. Data should be organized in a cache friendly way. If it is not, performance will suffer, when the application requests data that’s not in the cache. The fastest memory access occurs when the needed data is already in cache. Data transfers to and from cache are in cache-lines, and as such if the next piece of data is not within the current cache-line or is scattered amongst multiple cache-lines, the application may have poor cache efficiency.

Divisional and transcendental math functions are expensive even when directly supported by the instruction set. If your application uses many division and square root operations within the run-time code, the resulting performance may be degraded because of the limited functional units within the hardware; the pipeline to these units may be dominated. Since these instructions are expensive, the developer may wish to cache frequently used values to improve performance.

There is no “one recipe, one solution” technique. A great deal depends on the problem being solved and the long term requirements for the code, but a good developer will pay attention to all levels of optimization, both for today’s requirements and for the future.

Intel has built a full suite of tools to aid in code modernization - compilers, libraries, debuggers, performance analyzers, parallel optimization tools and more. Intel even has webinars, documentation, training examples, and best known methods and case studies which are all based on over thirty years of experience as a leader in the development of parallel computers.

Code Modernization 5 Stage Framework for Multi-level Parallelism

The Code Modernization optimization framework takes a systematic approach to application performance improvement. This framework takes an application though five optimization stages, each stage iteratively improving the application performance. But before you start the optimization process, you should consider if the application needs to be re-architected (given the guidelines below) to achieve the highest performance, and then follow the Code Modernization optimization framework.

By following this framework, an application can achieve the highest performance possible on Intel® Architecture. The stepwise approach helps the developer achieve the best application performance in the shortest possible time. In another words, it allows the program to maximize its use of all parallel hardware resources in the execution environment. The stages:

  1. Leverage optimization tools and libraries: Profile the workload using Intel® VTune™ Amplifier to identify hotspots. Use Intel C++ compiler to generate optimal code and apply optimized libraries such as Intel® Math Kernel Library, Intel® TBB, and OpenMP* when appropriate.
  2. Scalar, serial optimization: Maintain the proper precision, type constants, and use appropriate functions and precision flags.
  3. Vectorization: Utilize SIMD features in conjunction with data layout optimizations Apply cache-aligned data structures, convert from arrays of structures to structure of arrays, and minimize conditional logic.
  4. Thread Parallelization: Profile thread scaling and affinitize threads to cores. Scaling issues typically are a result of thread synchronization or inefficient memory utilization.
  5. Scale your application from multicore to many core (distributed memory Rank parallelism): Scaling is especially important for highly parallel applications. Minimize the changes and maximize the performance as the execution target changes from one flavor of the Intel architecture (Intel® Xeon® processor) to another (Intel® Xeon Phi™ Coprocessor).

5 Stages of code modernization

Code Modernization – The 5 Stages in Practice

Stage 1
At the beginning of your optimization project, select an optimizing development environment. The decision you make at this step will have a profound influence in the later steps. Not only will it affect the results you get, it could substantially reduce the amount of work to do. The right optimizing development environment can provide you with good compiler tools, optimized, ready-to-use libraries, and debugging and profiling tools to pinpoint exactly what the code is doing at the runtime.

Stage 2
Once you have exhausted the available optimization solutions, in order to extract greater performance from your application you will need to begin the optimization process on the application source code. Before you begin active parallel programming, you need to make sure your application delivers the right results before you vectorize and parallelize it. Equally important, you need to make sure it does the minimum number of operations to get that correct result. You should look at the data and algorithm related issues such as:

  • Choosing the right floating point precision
  • Choosing the right approximation method accuracy; polynomial vs. rational
  • Avoiding jump algorithms
  • Reducing the loop operation strength by using iteration calculations
  • Avoiding or minimizing conditional branches in your algorithms
  • Avoiding repetitive calculations, using previously calculated results.

You may also have to deal with language-related performance issues. If you have chosen C/C++, the language related issues are:

  • Use explicit typing for all constants to avoid auto-promotion
  • Choose the right types of C runtime function, e.g. doubles vs. floats: exp() vs. expf(); abs() vs. fabs()
  • Explicitly tell compiler about point aliases
  • Explicitly Inline function calls to avoid overhead

Stage 3
Try vector level parallelism. First try to vectorize the inner most loop. For efficient vector loops, make sure that there is minimal control flow divergence and that memory accesses are coherent. Outer loop vectorization is a technique to enhance performance. By default, compilers attempt to vectorize innermost loops in nested loop structures. But, in some cases, the number of iterations in the innermost loop is small. In this case, inner-loop vectorization is not profitable. However, if an outer loop contains more work, a combination of elemental functions, strip-mining, and pragma/directive SIMD can force vectorization at this outer, profitable level.

  1. SIMD performs best on “packed” and aligned input data, and by its nature penalizes control divergences. In addition, good SIMD and thread performance on modern hardware can be obtained if the application implementation puts a focus on data proximity.
  2. If the innermost loop does not have enough work (e.g., the trip count is very low; the performance benefit of vectorization can be measured) or there are data dependencies that prevent vectorising the innermost loop, try vectorising the outer loop. The outer loop is likely to have control flow divergence; especially of the trip count of the inner loop is different for each iteration of the outer loop. This will limit the gains from vectorization. The memory access of the outer loop is more likely to be divergent than that of an inner loop. This will result in gather / scatter instructions instead of vector loads and stores and will significantly limit scaling due to vectorization. Data transformations, such as transposing a two dimensional array, may alleviate these problems, or look at switching from Arrays of Structures to Structures of Arrays.
  3. When the loop hierarchy is shallow, the above guideline may result in a loop that needs to be both parallelized and vectorized. In that case, that loop has to both provide enough parallel work to compensate for the overhead and also maintain control flow uniformity and memory access coherence.
  4. Check out the Vectorization Essentials for more details.

Stage 4
Now we get to thread level parallelization. Identify the outermost level and try to parallelize it. Obviously, this requires taking care of potential data races and moving data declaration to inside the loop as necessary. It may also require that the data be maintained in a cache efficient manner, to reduce the overhead of maintaining the data across multiple parallel paths. The rationale for the outermost level is to try to provide as much work as possible to each individual thread. Amdahl’s law states: The speedup of a program using multiple processors in parallel computing is limited by the time needed for the sequential fraction of the program. Since the amount of work needs to compensate for the overhead of parallelization, it helps to have as large a parallel effort in each thread as possible. If the outermost level cannot be parallelized due to unavoidable data dependencies, try to parallelize at the next-outermost level that can be parallelized correctly.

  1. If the amount of parallel work achieved at the outermost level appears sufficient for the target hardware and likely to scale with a reasonable increase of parallel resources, you are done. Do not add more parallelism, as the overhead will be noticeable (thread control overhead will negate any performance improvement) and the gains are unlikely.
  2. If the amount of parallel work is insufficient, e.g. as measured by core scaling that only scales up to a small core count and not to the actual core count, attempt to parallelize additional layer, as outmost as possible. Note that you don’t necessarily need to scale the loop hierarchy to all the available cores, as there may be additional loop hierarchies executing in parallel.
  3. If step 2 did not result in scalable code, there may not be enough parallel work in your algorithm. This may mean that partitioning a fixed amount of work among many threads gives each thread too little work, so the overhead of starting and terminating threads swamps the useful work. Perhaps the algorithms can be scaled to do more work, for example by trying on a bigger problem size.
  4. Make sure your parallel algorithm is cache efficient. If it is not, rework it to be cache efficient, as cache inefficient algorithms do not scale with parallelism.
  5. Check out the Intel Guide for Developing Multithreaded Applications series for more details.

Stage 5
Lastly we get to multi-node (Rank) parallelism. To many developers message passing interface (MPI) is a black box the “just works” behind the scenes, to transfer data from one MPI task (process) to another. The beauty of MPI for the developer is that the algorithmic coding is hardware independent. The concern that developers have, is that with the many core architecture with 60+ cores, the communication between tasks may create a communication storm either within a node or across nodes. To mitigate these communication bottlenecks, applications should employ hybrid techniques, employing a few MPI tasks and many OpenMP threads.

A well-optimized application should address vector parallelization, multi-threading parallelization, and multi-node (Rank) parallelization. However to do this efficiently it is helpful to use a standard step-by-step methodology to ensure each stage level is considered. The stages described here can be (and often are) reordered depending upon the specific needs of each individual application; you can iterate in a stage more than once to achieve the desired performance.

Experience has shown that all stages must at least be considered to ensure an application delivers great performance on today’s scalable hardware as well as being well positioned to scale effectively on upcoming generations of hardware.

Give it a try!

Books - High Performance Parallelism Pearls

$
0
0

The two “Pearls” books contain an outstanding collection of examples of code modernization, complete with discussions by software developers of how code was modified with commentary on what worked as well as what did not!  Code for these real world applications is available for download from http://lotsofcores.com whether you have bought the books or not.  The figures are freely available as well, a real bonus for instructors who choose to uses these examples when teaching code modernization techniques.  The books, edited by James Reinders and Jim Jeffers, had 67 contributors for volume one, and 73 contributors for volume 2. 

Experts wrote about their experiences in adding parallelism to their real world applications. Most examples illustrate their results on processors and on the Intel® Xeon Phi™ coprocessor. The key issues of scaling, locality of reference and vectorization are recurring themes as each contributed chapter contains explanations of the thinking behind adding use of parallelism to their applications. The actual code is shown and discussed, with step-by-step thinking, and analysis of their results.  While OpenMP* are MPI are the dominant method for parallelism, the books also include usage of TBB, OpenCL and other models. There is a balance of Fortran, C and C++ throughout. With such a diverse collection of real world examples, the opportunities to learn from other experts is quite amazing.

 

Volume 1 includes the following chapters:

Foreword by Sverre Jarp, CERN.

Chapter 1: Introduction

Chapter 2: From ‘Correct’ to ‘Correct & Efficient’: A Hydro2D Case Study with Godunov’s Scheme

Chapter 3: Better Concurrency and SIMD on HBM

Chapter 4: Optimizing for Reacting Navier-Stokes Equations

Chapter 5: Plesiochronous Phasing Barriers

Chapter 6: Parallel Evaluation of Fault Tree Expressions

Chapter 7: Deep-Learning and Numerical Optimization

Chapter 8: Optimizing Gather/Scatter Patterns

Chapter 9: A Many-Core Implementation of the Direct N-body Problem

Chapter 10: N-body Methods

Chapter 11: Dynamic Load Balancing Using OpenMP 4.0

Chapter 12: Concurrent Kernel Offloading

Chapter 13: Heterogeneous Computing with MPI

Chapter 14: Power Analysis on the Intel® Xeon Phi™ Coprocessor

Chapter 15: Integrating Intel Xeon Phi Coprocessors into a Cluster Environment

Chapter 16: Supporting Cluster File Systems on Intel® Xeon Phi™ Coprocessors

Chapter 17: NWChem: Quantum Chemistry Simulations at Scale

Chapter 18: Efficient Nested Parallelism on Large-Scale Systems

Chapter 19: Performance Optimization of Black-Scholes Pricing

Chapter 20: Data Transfer Using the Intel COI Library

Chapter 21: High-Performance Ray Tracing

Chapter 22: Portable Performance with OpenCL

Chapter 23: Characterization and Optimization Methodology Applied to Stencil Computations

Chapter 24: Profiling-Guided Optimization

Chapter 25: Heterogeneous MPI optimization with ITAC

Chapter 26: Scalable Out-of-Core Solvers on a Cluster

Chapter 27: Sparse Matrix-Vector Multiplication: Parallelization and Vectorization

Chapter 28: Morton Order Improves Performance

 

Volume 2 includes the following chapters:

Foreword by Dan Stanzione, TACC

Chapter 1: Introduction

Chapter 2: Numerical Weather Prediction Optimization

Chapter 3: WRF Goddard Microphysics Scheme Optimization

Chapter 4: Pairwise DNA Sequence Alignment Optimization

Chapter 5: Accelerated Structural Bioinformatics for Drug Discovery     

Chapter 6: Amber PME Molecular Dynamics Optimization

Chapter 7: Low Latency Solutions for Financial Services

Chapter 8: Parallel Numerical Methods in Finance    

Chapter 9: Wilson Dslash Kernel From Lattice QCD Optimization

Chapter 10: Cosmic Microwave Background Analysis: Nested Parallelism In Practice  

Chapter 11: Visual Search Optimization

Chapter 12: Radio Frequency Ray Tracing

Chapter 13: Exploring Use of the Reserved Core

Chapter 14: High Performance Python Offloading

Chapter 15: Fast Matrix Computations on Asynchronous Streams 

Chapter 16: MPI-3 Shared Memory Programming Introduction

Chapter 17: Coarse-Grain OpenMP for Scalable Hybrid Parallelism  

Chapter 18: Exploiting Multilevel Parallelism with OpenMP

Chapter 19: OpenCL: There and Back Again

Chapter 20: OpenMP vs. OpenCL: Difference in Performance?      

Chapter 21: Prefetch Tuning Optimizations

Chapter 22: SIMD functions via OpenMP

Chapter 23: Vectorization Advice  

Chapter 24: Portable Explicit Vectorization Intrinsics

Chapter 25: Power Analysis for Applications and Data Centers

 

Parallel Programming Books

$
0
0

Use these parallel programming resources to optimize with your Intel® Xeon® processor and Intel® Xeon Phi™ coprocessor.

High Performance Parallelism Pearls: Multicore and Many-core Programming Approaches ›
by James Reinders and James Jeffers | Publication Date: November 17, 2014 | ISBN-10: 0128021187 | ISBN-13: 978-0128021187

High Performance Parallelism Pearls shows how to leverage parallelism on processors and coprocessors with the same programming – illustrating the most effective ways to better tap the computational potential of systems with Intel® Xeon Phi™ coprocessors and Intel® Xeon® processors or other multicore processors.

More details on the 1st and (new) 2nd volume of the High Performance Parallelism Pearls can be found here


Structured Parallel Programming: Patterns for Efficient Computation ›
by Michael McCool, James Reinders and Arch Robison | Publication Date: July 9, 2012 | ISBN-10: 0124159931 | ISBN-13: 978-0124159938

This book fills a need for learning and teaching parallel programming, using an approach based on structured patterns which should make the subject accessible to every software developer. It is appropriate for classroom usage as well as individual study.


Intel® Xeon Phi™ Coprocessor High Performance Programming ›
by Jim Jeffers and James Reinders – Now available!

The key techniques emphasized in this book are essential to programming any modern parallel computing system whether based on Intel® Xeon® processors, Intel® Xeon Phi™ coprocessors, or other high performance microprocessors.


Parallel Programming and Optimization with Intel® Xeon Phi™ Coprocessors

Parallel Programming and Optimization with Intel® Xeon Phi™ Coprocessors ›
by Colfax International

This book will guide you to the mastery of parallel programming with Intel® Xeon® family products: Intel® Xeon® processors and Intel® Xeon Phi™ coprocessors. It includes a detailed presentation of the programming paradigm for Intel® Xeon® product family, optimization guidelines, and hands-on exercises on systems equipped with the Intel® Xeon Phi™ coprocessors, as well as instructions on using Intel software development tools and libraries included in Intel Parallel Studio XE.


Intel® Xeon Phi™ Coprocessor Architecture and Tools: The Guide for Application Developers ›
by Reza Rahman

Intel® Xeon Phi™ Coprocessor Architecture and Tools: The Guide for Application Developers provides developers a comprehensive introduction and in-depth look at the Intel Xeon Phi coprocessor architecture and the corresponding parallel data structure tools and algorithms used in the various technical computing applications for which it is suitable. It also examines the source code-level optimizations that can be performed to exploit the powerful features of the processor.


Optimizing HPC Applications with Intel Cluster Tools: Hunting Petaflops ›
by Alexander Supalov

Optimizing HPC Applications with Intel® Cluster Tools takes the reader on a tour of the fast-growing area of high performance computing and the optimization of hybrid programs. These programs typically combine distributed memory and shared memory programming models and use the Message Passing Interface (MPI) and OpenMP for multi-threading to achieve the ultimate goal of high performance at low power consumption on enterprise-class workstations and compute clusters.

Intel® Parallel Computing Center at Georgia Institute of Technology

$
0
0

Principal Investigator:

Srinivas Aluru is a professor in the School of Computational Science and Engineering within the College of Computing at Georgia Institute of Technology. He serves as a co-director of the Georgia Tech Strategic Initiative in Data Engineering and Science. Aluru conducts research in high performance computing, bioinformatics, systems biology, and combinatorial scientific computing. He pioneered the development of parallel methods in computational genomics and systems biology. He is a Fellow of the American Association for the Advancement of Science (AAAS) and the Institute of Electrical and Electronic Engineers (IEEE).

Description:

The Intel® Parallel Computing Center (Intel® PCC) on Big Data in Biosciences and Public Health is focused on developing and optimizing parallel algorithms and software on Intel® Xeon® Processor and Intel® Xeon Phi™ Coprocessor systems for handling high-throughput DNA sequencing data and gene expression data. Advances in high-throughput sequencing technologies permit massively parallel sequencing of DNA at a low cost, leading to the creation of big data sets even in routine investigations by small research laboratories. Rapid analysis of such large-scale data is of critical importance in many applications, and is the foundation of modern genomics. This is currently an area underserved by high performance computing, but has great economic potential and societal prominence.

Research under this Intel® PCC will be focused on two large-scale projects: The first project is a comprehensive effort to identify core index structures and fundamental building blocks for the numerous applications of high-throughput DNA sequencing, develop parallel algorithms for them, and release them as software libraries to enable application parallelization. The Intel® PCC will support development of novel algorithms optimized for the Intel® Xeon Phi™ coprocessors, and development and release of software libraries specifically optimized for Intel® Xeon® processors and Intel® Xeon Phi™ coprocessors. The second project concerns the development of systems biology tools for biological researchers. Under the Intel® PCC, two objectives will be pursued: Large-scale Intel based clusters and supercomputers will be used to build whole-genome networks for important model organisms and make them widely available to researchers. A second objective is to put mid-scale network capabilities in the hands of individual biology researchers. The project will leverage other collaborative projects that support experimental research, allowing direct experimental verification of some of the tools generated under the Intel® PCC.

The work will lead to the release of open source software optimized for Intel® Xeon® processors and Intel® Xeon Phi™ coprocessors in the important areas of computational genomics and systems biology. These will be used in applications with the potential to impact many important fields including viral and microbial genomics, agricultural biotechnology, and precision medicine. The research is expected to inform future Intel architectural designs regarding suitability in the important area of bioinformatics.

Related websites:

www.cc.gatech.edu/~saluru

Better Concurrency and SIMD On The HIROMB‐BOOS­‐Model 3D Ocean Code

$
0
0

By utilizing the strengths of the Intel® Xeon Phi™ coprocessor, the  chapter 3 High Performance Parallelism Pearls authors were able to improve and modernize their code and “achieve great scaling, vectorization, bandwidth utilization and performance/watt”. The authors (Jacob Weismann Poulsen, Karthik Raman and Per Berg) note, “The thinking process and techniques used in this chapter have wide applicability: focus on data locality and then apply threading and vectorization techniques.”. In particular, they write about the advection routine from the HIROMBBOOS­‐Model (HBM)  which was initially underperforming on the Intel Xeon Phi coprocessor. However, they were able to achieve a 3x performance improvement after re-structuring the code which involved changing data structures for better data locality, exploiting the available threads and SIMD lanes for better concurrency at thread and loop level to utilize the maximum available memory bandwidth. To avoid data licensing issues the example code provided in High Performance Parallelism Pearls utilizes the Baffin Bay setup generated from the freely available ETOPO2 data set.

Click to view entire article.


How to use the Intel® Cluster Checker v3 SDK on a cluster using multiple Linux Distributions

$
0
0

Linux based HPC clusters can use different Linux distributions or different versions of a given Linux distribution for different types of nodes in the HPC cluster.

When the Linux distribution on which the connector extension has been built uses a glibc version 2.14 or newer and the Linux distribution where the connector extension is used, i.e. where clck-analyze is executed, uses a glibc version lower than 2.14, clck-analyze is not able to execute the shared library of the connector extension due to a missing symbol.

clck-analyze will show a message like this:

<your check>... not found

and

ldd lib<your check>.so

will show the following message, in addition to other output:

./lib<your check>.so: /lib64/libc.so.6: version `GLIBC_2.14' not found (required by ./lib<your check>.so)

The underlying reason is that memcpy is versioned by default as memcpy@GLIBC_2.14 starting in glibc version 2.14.
glibc versions lower than 2.14 will not have memcpy versioned like this.
The previous version, memcpy@GLIBC_2.2.5, is available in all glibc versions.

There are three solutions to this problem.

  1. The preferred solution is to compile the connector extension, i.e. lib<your check>.so, on a Linux distribution using a glibc version lower than 2.14
  2. If case option #1 cannot be used, you can enforce the use of the compatible memcpy@GLIBC_2.2.5 by adding the following code into the header file of your connector extension (as described here http://stackoverflow.com/questions/8823267/linking-against-older-symbol-version-in-a-so-file):
    #if defined(__GNUC__) && defined(__LP64__)  /* only with 64bit gcc, just to be sure */
    #include <features.h>       /* get the glibc version */
    /* only change memcpy when the version is newer than 2.14 */
    #if defined(__GLIBC__) && (__GLIBC__ == 2) && (__GLIBC_MINOR__ >= 14)
    /* enforce mempcy to use the earlier, i.e. compatible, memcpy@GLIBC_2.2.5 */
    __asm__(".symver memcpy,memcpy@GLIBC_2.2.5");
    #endif
    #undef _FEATURES_H  /* reload it ... usually necessary */
    #endif
  3. The third solution is to use a wrapper function. This is also described on the above mentioned web page, but option #2 is simpler and easier to use.

Now you can compile your connector extension on a Linux distribution with a glibc version of 2.14 or newer and use it on a Linux distribution with a glibc version lower than 2.14.

Using OpenCL™ 2.0 Read-Write Images

$
0
0

Acknowledgements

We want to thank Javier Martinez, Kevin Patel, and Tejas Budukh for their help in reviewing this article and the associated sample.

Introduction

Prior to OpenCL™ 2.0, there was no ability to read and write to an image within the same kernel. Images could always be declared as a “CL_MEM_READ_WRITE”, but once the image was passed to the kernel, it had to be either “__read_only” or “__write_only”.

input1 = clCreateImage(
oclobjects.context,
CL_MEM_READ_WRITE|CL_MEM_COPY_HOST_PTR,&format,&desc,&input_data1[0],&err );
SAMPLE_CHECK_ERRORS( err );

Code 1. Image buffer could be created with CL_MEM_READ_WRITE

__kernel void Alpha( __read_write image2d_t inputImage1,
__read_only image2d_t
inputImage2,
uint width,
uint height,
float alpha,
float beta,
int gamma )

Code 2. OpenCL 2.0 introduced the ability to read and write to images in Kernels

The addition, while intuitive, comes with a few caveats that are discussed in the next section.

The value of Read-Write Images

While Image convolution is not as effective with the new Read-Write images functionality, any image processing technique that needs be done in place may benefit from the Read-Write images. One example of a process that could be used effectively is image composition.

In OpenCL 1.2 and earlier, images were qualified with the “__read_only” and __write_only” qualifiers. In the OpenCL 2.0, images can be qualified with a “__read_write” qualifier, and copy the output to the input buffer. This reduces the number of resources that are needed.

Since OpenCL 1.2 images are either read_only or write_image. Performing an in-place modifications of an image requires treating the image as a buffer and operating on the buffer (see cl_khr_image2d_from_buffer: https://software.intel.com/en-us/articles/using-image2d-from-buffer-extension.

The current solution is to treat the images as buffers, and manipulate the buffers. Treating 2d images as buffers many not be a free operation and prevents clamping and filtering abilities available in read_images from being used. As a result, it may be more desirable to use read_write qualified images.

Overview of the Sample

The sample takes two windows bitmap images “input1.bmp” and “input2.bmp” and puts them into an image buffer. These images are then composited based on the value of the alpha, a weight factor in the equation of the calculated pixel, which can be passed in as an option.

Using Alpha value 0.84089642

Figure 1. Using Alpha value 0.84089642

The images have to be either 24/32-bit images. The output is a 24-bit image. The images have to be of the same size. The images were also of the Format ARGB, so when loading that fact was taken into consideration.

Using Alpha value of 0.32453

Figure 2. Using Alpha value of 0.32453

The ARGB is converted to RGBA. Changing the value of the beta value causes a significant change in the output.

Using the Sample SDK

The SDK demonstrates how to use image composition with Read write images. Use the following command-line options to control this sample:

Options

Description

-h, --help

Show this text and exit

-p, --platform number-or-string

Select platform, devices of which are used

-t, --type all | cpu | gpu | acc | default | <OpenCL constant for device type>

Select the device by type on which the OpenCL Kernel is executed

-d, --device number-or-string

Select the device on which all stuff is executed

-i, --infile 24/32-bit .bmp file

Base name of the first .bmp file to read. Default is input1.bmp

-j, --infile 24/32-bit .bmp file

Base name of the second .bmp file to read Default is input2.bmp

-o, --outfile 24/32-bit .bmp file

Base name of the output to write to. Default is output.bmp for OCL1.2 and 20_output.bmp for OCL2.0

-a, --alpha floating point value between 0 and 1

Non-zero positive value that determines how much the two images will blend in composition. Default alpha is 0.84089642. Default beta value is 0.15950358.

The sample SDK has a number of default values that allow the application to be able to run without any user input. The user will be able to use their input .bmp files. The files have to be either 24/32 bmp files as well. The alpha value is used to determine how much prominence image one will have over image 2 as such:

calculatedPixel = ((currentPixelImage1 * alpha) + (currentPixeImage2 * beta) + gamma);

The beta value is determined by subtracting the value of the alpha from 1.

float beta = 1 – alpha;

These two values determine the weighted distribution of images 1 to image 2.

The gamma value can be used to brighten each of the pixels. The default value is 0. But user can brighten the overall composited image. 

Example Run of Program

Read Write Image Sample Program running on OCL2.0 Device

Figure 3. Program running on OpenCL 2.0 Device

Limitations of Read-Write Images

Barriers cannot be used with images that require synchronization across different workgroups. Image convolution requires synchronizing all threads. Convolution with respect to images usually involves a mathematical operation on two matrices that results in the creation of a third matrix. An example of an image convolution is using Gaussian blur. Other examples are image sharpening, edge detection, and embossing.

Let’s use Gaussian blur as an example. A Gaussian filter is a low pass filter that removes high frequency values. The implication of this is to reduce detail and eventually cause a blurring like effect. Applying a Gaussian blur is the same as convolving the image with a Gaussian function that is often called the mask. To effectively show the functionality of Read-Write images, a horizontal and vertical blurring had to be done.

In OpenCL 1.2, this would have to be done in two passes. One kernel would be exclusively used for the horizontal blur, and another does the vertical blur. The result of one of the blurs would be used as the input of the next one depending on which was done first.

__kernel void GaussianBlurHorizontalPass( __read_only image2d_t inputImage, __write_only image2d_t outputImage, __constant float* mask, int maskSize)
{
    int2 currentPosition = (int2)(get_global_id(0), get_global_id(1));
    float4 currentPixel = (float4)(0,0,0,0);
    float4 calculatedPixel = (float4)(0,0,0,0);
    for(int maskIndex = -maskSize; maskIndex < maskSize+1; ++maskIndex)
    {
        currentPixel = read_imagef(inputImage, imageSampler, currentPosition + (int2)(maskIndex, 0));
        calculatedPixel += currentPixel * mask[maskSize + maskIndex];
    }
    write_imagef(outputImage, currentPosition, calculatedPixel);
}

__kernel void GaussianBlurVerticalPass( __read_only image2d_t inputImage, __write_only image2d_t outputImage, __constant float* mask, int maskSize)
{
    int2 currentPosition = (int2)(get_global_id(0), get_global_id(1));
    float4 currentPixel = (float4)(0,0,0,0);
    float4 calculatedPixel = (float4)(0,0,0,0); 
    for(int maskIndex = -maskSize; maskIndex < maskSize+1; ++maskIndex)
    {
        currentPixel = read_imagef(inputImage, imageSampler, currentPosition + (int2)(0, maskIndex));
        calculatedPixel += currentPixel * mask[maskSize + maskIndex];
    }
    write_imagef(outputImage, currentPosition, calculatedPixel);
}

Code 3. Gaussian Blur Kernel in OpenCL 1.2

The idea for the OpenCL 2.0 would be to combine these two kernels into one. Use a barrier to force the completion of each of the horizontal or vertical blurs before the next one begins.

__kernel void GaussianBlurDualPass( __read_only image2d_t inputImage, __read_write image2d_t tempRW, __write_only image2d_t outputImage, __constant float* mask, int maskSize)
{
    int2 currentPosition = (int2)(get_global_id(0), get_global_id(1));
    float4 currentPixel = (float4)(0,0,0,0);  
    float4 calculatedPixel = (float4)(0,0,0,0)
    currentPixel = read_imagef(inputImage, currentPosition);
    for(int maskIndex = -maskSize; maskIndex < maskSize+1; ++maskIndex)
    {
        currentPixel = read_imagef(inputImage, currentPosition + (int2)(maskIndex, 0));     
        calculatedPixel += currentPixel * mask[maskSize + maskIndex];
    }
    write_imagef(tempRW, currentPosition, calculatedPixel);

    barrier(CLK_GLOBAL_MEM_FENCE);

    for(int maskIndex = -maskSize; maskIndex < maskSize+1; ++maskIndex)
    {
        currentPixel = read_imagef(tempRW, currentPosition + (int2)(0, maskIndex));
        calculatedPixel += currentPixel * mask[maskSize + maskIndex];
    }
    write_imagef(outputImage, currentPosition, calculatedPixel);
}

Code 4. Gaussian Blur Kernel in OpenCL 2.0

Barriers were found to be ineffective. Using a barrier does not guarantee that the horizontal blur is completed before the vertical blur begins, assuming you did the horizontal blur first. The implication of this was an inconsistent result in multiple runs. Barriers can be used to synchronize threads within a group. The reason the problem occurs is that edge pixels are read from multiple workgroups, and there is no way to synchronize multiple workgroups. The initial assumption that we can implement a single Gaussian blur using read_write images proved incorrect because the inter-workgroup data dependency cannot be synchronized in OpenCL.

References

About the Authors

Oludemilade Raji is a Graphics Driver Engineer at Intel’s Visual and Parallel Computing Group. He has been working in the OpenCL programming language for 4 years and contributed to the development of the Intel HD Graphics driver including the development of OpenCL 2.0.

 

Robert Ioffe is a Technical Consulting Engineer at Intel’s Software and Solutions Group. He is an expert in OpenCL programming and OpenCL workload optimization on Intel Iris and Intel Iris Pro Graphics with deep knowledge of Intel Graphics Hardware. He was heavily involved in Khronos standards work, focusing on prototyping the latest features and making sure they can run well on Intel architecture. Most recently he has been working on prototyping Nested Parallelism (enqueue_kernel functions) feature of OpenCL 2.0 and wrote a number of samples that demonstrate Nested Parallelism functionality, including GPU-Quicksort for OpenCL 2.0. He also recorded and released two Optimizing Simple OpenCL Kernels videos and GPU-Quicksort and Sierpinski Carpet in OpenCL 2.0 videos.

 

You might also be interested in the following:

Optimizing Simple OpenCL Kernels: Modulate Kernel Optimization

Optimizing Simple OpenCL Kernels: Sobel Kernel Optimization

GPU-Quicksort in OpenCL 2.0: Nested Parallelism and Work-Group Scan Functions

Sierpiński Carpet in OpenCL 2.0

Downloads

Intel® Parallel Computing Center at Princeton University, Princeton Neuroscience Institute and Computer Science Dept.

$
0
0

Princeton University

Principal Investigators:

Princeton - Kai Li

Kai Li is a professor at Computer Science Department of Princeton University. He pioneered Distributed Shared Memory allowing shared-memory programming on clusters of computers, which one the ACM SIGOPS Hall of Fame Award and proposed user-level DMA which evolved into RDMA in the Infiniband standard.  He led the PARSEC project which became the de factor benchmark for multicore processors.  He recently co-led the ImageNet project and propelled the advancement of deep learning methods.  He co-founded Data Domain, Inc. (now an EMC division) and led the innovation of deduplication storage system products to displace tape automation market.  He is an ACM fellow, IEEE fellow and a member of National Academy of Engineering.

Princeton - Sebastian Seung

Sebastian Seung is Professor at the Princeton Neuroscience Institute and Department of Computer Science. Over the past decade, he has helped pioneer the new field of connectomics, developing new computational technologies for mapping the connections between neurons. His lab created EyeWire.org, a site that has recruited 200,000 players from 150 countries to a game to map neural connections. His book Connectome: How the Brain's Wiring Makes Us Who We Are was chosen by the Wall Street Journal as Top Ten Nonfiction of 2012.  Before joining the Princeton faculty in 2014, Seung studied at Harvard University, worked at Bell Laboratories, and taught at the Massachusetts Institute of Technology.

Description:

Over the past few years, convolutional neural networks (rebranded as “deep learning”) have become the leading approach to big data.  In order to perform well, deep learning requires large amount of training data and substantial amount of computing power for training and classification.   Most deep learning implementations use GPUs instead of general-purpose CPUs because the conventional wisdom is that a GPU is an order-of-magnitude faster than a CPU for deep learning at a similar cost.  As a result, the machine learning community as well as vendors have invested a lot of efforts to develop deep learning packages.

Intel® Xeon Phi™ coprocessors, based on Many-Integrated-Core (MIC) architecture, offer an alternative to GPUs for deep learning, because its peak floating-point performance and cost are on par with a GPU, while offering several advantages such as easy to program, binary compatible with host processor, and direct access to large host memory.  However, it is still challenging to fully take advantage of the hardware capabilities.  It requires running many threads in parallel (e.g. 240+ threads for 60+ cores), executing 16 floating point operations in parallel (for AVX-512), and reducing the working set for each thread (128KB L2 cache per thread).   

This center will develop an efficient deep learning package for Intel® Xeon Phi™ coprocessor.  The project is built on Sebastian Seung’s lab’s work on ZNN, a deep learning package (https://github.com/seung-lab/znn-release) based on two key concepts, both of which leverage the advantages of CPUs. (1) FFT-based convolution becomes more efficient when FFTs are cached and reused. This trades memory for speed, and is therefore appropriate for the larger working memory of CPUs. (2) Task parallelism on CPUs can make more efficient use of computing resources than SIMD parallelism on GPUs.  Our preliminary results with ZNN are encouraging. We have shown that CPUs can be competitive with GPUs in speed of deep learning, for certain network architectures. Furthermore, an initial port to Intel® Xeon Phi™ coprocessor (Knights Corner) was done quickly, supporting the idea that CPU implementations are likely to incur relatively low development cost.

The proposed optimizations for the future Intel® Xeon Phi™ processor family include trading memory space for computation (transforming convolution networks to reusable FFTs), intelligently choosing direct vs. FFT-based convolution for each layer of the network, choosing the right flavor of task parallelism, intelligent tiling to optimize L2 cache performance, and careful data structure layouts to maximize the utilization of AVX-512 vector units.  We will carefully evaluate the deep learning package with 2D ImageNet dataset, 3D electron microscopy image dataset, and 4D fMRI dataset.  We plan to deploy the software package and datasets in the public domain.

Related websites:

http://www.cs.princeton.edu/~li/

Intel® Parallel Computing Center at King Abdullah University of Science and Technology

$
0
0

King Abdullah University

Principal Investigators:

KAUST - David Keyes

David Keyes is a founding professor of Applied Mathematics and Computational Science at KAUST, where he focuses on high performance implementations of implicit methods for PDEs.  He received a BSE from Princeton and a PhD from Harvard. He has held faculty positions at Yale, Old Dominion, and Columbia Universities and research positions at NASA and DOE laboratories, and has led the scalable solvers initiative of the DOE SciDAC program. He is a Fellow of AMS and SIAM, and recipient of the IEEE Sidney Fernbach Award, the ACM Gordon Bell Prize, and the SIAM Prize for Distinguished Service to the Profession.

KAUST - Hatem Ltaief

Hatem Ltaief is a Senior Research Scientist in the Extreme Computing Research Center at KAUST, where he directs the KBLAS software project for dense and sparse linear algebraic operations on emerging architectures.  He received an MS in computational science from the University of Lyon and an MS in applied mathematics and a PhD in computer science from the University of Houston.  He has been a Research Scientist at the Innovative Computing Laboratory of the University of Tennessee and a Computational Scientist in the KAUST Supercomputing Laboratory. He is a member of the European Exascale Software Initiative (EESI2).

KAUST - Rio Yokota

Rio Yokota is an associate professor in the Global Scientific Information and Computing Center at the Tokyo Institute of Technology and a consultant at KAUST, where he researches fast multipole methods, their implementation on emerging architectures, and their applications in PDEs, BEMs, molecular dynamics, and particle methods. He received his undergraduate and doctoral degrees in Mechanical Engineering from Keio University, and held postdoctoral appointments at the University of Bristol and Boston University and a Research Scientist appointment at KAUST. He is a recipient of the ACM Gordon Bell Prize.

Description:

The Intel® Parallel Computing Center (Intel® PCC) at King Abdullah University of Science and Technology (KAUST) aims to provide scalable software kernels common to scientific simulation codes that will adapt well to future architectures, including a scheduled upgrade of KAUST’s globally Top10 Intel-based Cray XC40 system. In the spirit of co-design, Intel® PCC at KAUST will also provide feedback that could influence architectural design trade-offs. The Intel® PCC at KAUST is hosted in the KAUST’s Extreme Computing Research Center (ECRC), directed by co-PI Keyes, which aims to smooth the architectural transition of KAUST’s simulation-intensive science and engineering code base.  Rather than taking a specific application code and optimizing it, the ECRC adopts the strategy of optimizing algorithmic kernels that are shared among many application codes, and of providing the results in open source libraries.  Chief among such kernels are Poisson solvers and dense symmetric generalized eigensolvers.

We focus on optimizing two types of scalable hierarchical algorithms – fast multipole methods (FMM) and hierarchical matrices – on next generation Intel® Xeon® processors and Intel® Xeon Phi™ coprocessors. These algorithms have the potential to replace workhorse kernels of molecular dynamics codes (drug/material design), sparse matrix preconditioners (structural/fluid dynamics), and covariance matrix calculations (statistics/big data). Co-PI Yokota is the architect of the open source fast multipole library ExaFMM, which attempts to integrate best solutions offered by FMM algorithms, including the ability to control expansion order and octtree decomposition strategy independently to create the fastest inverter to meet a given accuracy requirement for solver or a preconditioner on manycore and heterogenous architectures.  Co-PI Ltaief is the architect of the KBLAS library, which promotes the directed acyclic graph-based dataflow execution model to create NUMA-aware work-stealing tile algorithms of high concurrency, with innermost SIMD structure well suited to floating point accelerators.  The overall software framework of this Intel® PCC at KAUST, Hierarchical Computations on Manycore Architectures (HiCMA), is built upon these linear solvers and the philosophy that dense blocks of low rank should often be replaced with hierarchical matrices as they arise.  Hierarchical matrices are natural algebraic generalizations of fast multipole, and are implementable in data structures similar to those that have made FMM successful on distributed nodes of shared memory cores.

FMM and hierarchical matrix algorithms share a rare combination of O(N) arithmetic complexity and high arithmetic intensity (flops/Byte). This is in contrast to traditional algorithms that have either low arithmetic complexity with low arithmetic intensity (FFT, sparse linear algebra, and stencil application), or high arithmetic intensity with high arithmetic complexity (dense linear algebra, direct N-body summation). In short, FMM and hierarchical matrices are efficient algorithms that will remain compute-bound on future architectures. Furthermore, these methods have a communication complexity of O(log P) for P processors, and permit high asynchronicity in their communication. Therefore, they are amenable to asynchronous programming models that are gaining popularity as architectures approach the exascale.

Related websites:

http://ecrc.kaust.edu.sa

Intel® Parallel Computing Center at Indiana University

$
0
0

Indiana University

Principal Investigators:

Indiana - JudyQiu

Judy Qiu is an assistant professor of Computer Science at Indiana University. Her general area of research is in data-intensive computing at the intersection of Cloud and HPC multicore technologies. This includes a specialization on programming models that support iterative computation, ranging from storage to analysis which can scalably execute data intensive applications. Her research has been funded by NSF, NIH, Microsoft, Google, and Indiana University. She is the recipient of a NSF CAREER Award in 2012, Indiana University Trustees Award for Teaching Excellence in 2013-2014, and Indiana University Outstanding Junior Faculty Award in 2015.

Indiana - StevenGottlieb

Steven Gottlieb is a Distinguished Professor of Physics at Indiana University. He works in Lattice QCD an area of theoretical high energy physics that relies large scale computing to understand the quantum field theory that describes the strong force.  His research has been funded for many years by the US Department of Energy and National Science Foundation. He received an A.B. degree from Cornell University with majors in mathematics and physics, as well as Masters and Ph.D. degrees in physics from Princeton University. He was a DOE Outstanding Junior Investigator and Indiana University Outstanding Junior Faculty Award recipient.

Description:

The Indiana University Intel® Parallel Processing Center (Intel® PCC) is a multi-component interdisciplinary center. The initial activities involve Center Director Judy Qiu, an Assistant Professor in the School of Informatics and Computing, and Distinguished Professor of Physics Steven Gottlieb. Qiu will be researching novel parallel systems supporting data analytics and Gottlieb will be adapting the physics simulation code of the MILC Collaboration to the Intel® Xeon Phi™ coprocessor.

More generally, the focus of the Center will be grand challenges in high performance simulation and data analytics with innovative applications, and software development using the Intel architecture. Issues of programmer productivity and performance portability will be studied.

Steven Gottlieb is a founding member of the MILC Collaboration which studies Quantum Chromodynamics, one of nature's four fundamental forces. The open source MILC code is part of the SPEC benchmark and has been used as a performance benchmark for a number of supercomputer acquisitions. Gottlieb will be working on restructuring the MILC code to make optimal use of the SIMD vector units and many-core architecture of the Intel® Xeon Phi™ coprocessor. These will be used in upcoming supercomputers at the National Energy Research Supercomputing Center (NERSC) and the Argonne Leadership Computing Center (ALCC). The MILC code currently is used for hundreds of millions of core-hours at NSF and DOE supercomputer centers.

Data analysis plays an important role in data-driven scientific discovery and commercial services. Judy Qiu's earlier research has shown that previous complicated versions of MapReduce can be replaced by Harp (a Hadoop plug-in) that offers both data abstractions useful for high performance iterative computation and MPI-quality communication that can drive libraries like Mahout, MLlib, and DAAL on HPC and Cloud systems. A subset of machine learning algorithms have been selected and will be implemented with optimal performance using Hadoop/Harp and Intel's library DAAL. The code will be tested on Intel’s Haswell and Xeon Phi™ coprocessor architectures.

Related websites:

http://ipcc.soic.iu.edu/

OpenCV 3.0.0 ( IPP & TBB enabled ) on Yocto with Intel® Edison with new Yocto image release

$
0
0

For OpenCV 3.0.0 - Beta , please see this article

The following article is for OpenCV 3.0.0 and Intel(R) Edison with the latest (ww25) Yocto Image.

< Overview >

 This article is a tutorial for setting up OpenCV 3.0.0 on Yocto with Intel® Edison. We will build OpenCV 3.0.0 on Edison Breakout/Expansion Board using a Windows/Linux host machine.

 In this article, we will enable Intel® Integrated Performance Primitives ( IPP ) and Intel® Threading Building Blocks ( TBB ) to optimize and parallelize some OpenCV functions. For example, cvHaarDetectObjects(...) , an OpenCV function that detects objects of different sizes in the input image, is parallelized with the TBB library. By doing this, we can fully utilize the dual-core of Edison.

1. Prepare the new Yocto image for your Edison

   Go to Intel(R) Edison downloads and download 'Release 2.1 Yocto* complete image' and 'Flash Tool Lite' that matches your OS. Then refer Flash Tool Lite User Manual for Edison to flash the new image. Using this new release, you don't need to manually enable UVC for Webcams and will have enough storage space for OpenCV 3.0.0. Additionally, CMake is already included. To enable UVC by customizing the Linux Kernel and change partition setting for different space configuration, refer the past article's step 2 and 3. 

2. Setup root password and WiFi for ssh and FTP

  Follow Edison Getting Started to connect your host and Edison as you want.

  Setup any FTP method for transferring files from your host to your Edison. ( For an easy file transfer, MobaXterm is recommended for Windows hosts )

3. OpenCV 3.0.0

 When you check your space with 'df -h', you will see a very similar result with the following. 

  Go to OpenCV Official Page and download OpenCV on your host machine. When download is done, copy the zip file to your Edison through FTP. It is recommended to copy the OpenCV to '/home/<User>' and work it out there. Since '/home' has more than 1G.

 Unzip the downloaded file by typing 'unzip opencv-3.0.0.zip' and check if your opencv folder is created.

 go to <OpenCV DIR> and type 'cmake .' and take a look what kind of options are there.

 We will enable IPP and TBB for better performance. The library to enable IPP and TBB will be downloaded automatically when the flag is turned on. 

 Now, on Edison, go to <OpenCV DIR> and type ( do not forget '.' at the end of the command line )

 root@edison:<OpenCV DIR># cmake -D WITH_IPP=ON -D WITH_TBB=ON -D BUILD_TBB=ON -D WITH_CUDA=OFF -D WITH_OPENCL=OFF -D BUILD_SHARED_LIBS=OFF -D BUILD_PERF_TESTS=OFF -D BUILD_TESTS=OFF .

 which turns on IPP & TBB flags and turns off irrelevant features to make it simple. With 'BUILD_SHARED_LIBS=OFF' , your Edison will make the executables able to run without OpenCV installed in case of distribution. ( If you don't want IPP & TBB, go with WITH_TBB=OFF and WITH_IPP=OFF )

 In the configuration result, you should see IPP and TBB are enabled.

If you observe no problems, then type

 root@edison:<OpenCV DIR># make -j2

 It will take a while to complete the building. ( 30mins ~ 1hour )

 If you encounter 'undefined reference to symbol 'v4l2_munmap' ... libv4l2.so.0 : error adding symbols: DSO missing from command line' error while building OpenCV or OpenCV samples later, we need to add ‘-lv4l2’ after ‘-lv4l1’ in the corresponding configuration files. This error could happen for more than 50+ files so it’s better to add them all with a line of command instead.

root@edison:<OpenCV DIR># grep -rl -- -lv4l1 samples/* modules/* | xargs sed -i ‘s/-lv4l1/-lv4l1 -lv4l2/g’

 

 When building is done, install what is made by typing

 root@edison:<OpenCV DIR># make install

 

4. Making applications with OpenCV 3.0.0

 

 The easiest way to make a simple OpenCV application is using the sample came along with the package. Go to '<OpenCV DIR>/samples' and type

 root@edison:<OpenCV DIR>/samples# cmake .

 then it will configure and get ready to compile and link the samples. Now you can replace one of the sample code file in 'samples/cpp' and build it using cmake. For example, we can replace 'facedetect.cpp'  with our own code. Now at '<OpenCV DIR>/samples' type

 root@edison:<OpenCV DIR>/samples# make example_facedetect

 then it will automatically get the building done and output file will be placed in 'samples/cpp'

If you encounter 'undefined reference to symbol 'v4l2_munmap' ... libv4l2.so.0 : error adding symbols: DSO missing from command line' error while building OpenCV or OpenCV samples later, we need to add ‘-lv4l2’ after ‘-lv4l1’ in the corresponding configuration files. This error could happen for more than 50+ files so it’s better to add them all with a line of command instead.

root@edison:<OpenCV DIR># grep -rl -- -lv4l1 samples/* modules/* | xargs sed -i ‘s/-lv4l1/-lv4l1 -lv4l2/g’

 

One more thing, since Edison does not have a video out, an error will occur as you call functions related to 'display on the screen' such as 'imshow' which creates and displays an image or a video on the screen. Therefore, before you build samples or examples that include those functions, you need to comment them out. 

 

 

 

 

Benefitting Power and Performance Sleep Loops

$
0
0

by Joe Olivas, Mike Chynoweth, & Tom Propst
 

Abstract

To take full advantage of today’s multicore processors, software developers typically break their work into manageable sizes and spread the work across several simultaneously running threads in a thread pool. Performance and power problems in thread pools can occur when the work queue is highly contended by multiple threads requesting work or when threads are waiting on work to appear in the queue.

While there are numerous algorithmic solutions to this problem, this paper discusses one in particular that we have seen as the most commonly used. We also provide some simple recommendations to improve both performance and power consumption without having to redesign an entire implementation.

Overview of the Problem

The popular method for solving performance and power problems is to have each thread in the thread pool continually check for work in a queue and split off to process once work becomes available. This is a very simple approach, but often developers run into problems with the methodologies to poll the queue for work or deal with issues when the queue is highly contended. Issues can occur in two extreme conditions:

  1. The case when the work queue is not filling fast enough with work for the worker threads and they must back-off and wait for work to appear.
  2. The case when many threads are trying to get work from the queue in parallel, causing contention on the lock protecting the queue, and the threads must back off the lock to decrease contention on the lock.

Popular thread implementations have a some pitfalls, yet by making a few simple changes, you'll see big differences in both power and performance.

To start, we make a few assumptions about the workload in question. We assume we have a large dataset, which is evenly and independently divisible in order to eliminate complications that are outside the scope of this study.
 

Details of the Sleep Loop Algorithm

In our example, each thread is trying to access the queue of work, so it is necessary to protect access to that queue with a lock, in order that only a specified number of threads can concurrently get work.

With this added complexity, our algorithm from a single thread view looks like the following:
 


 

Problems with this Approach on Windows* Platforms

Where simple thread pools begin to break down is in the implementation. The key is how to back off the queue when there is no work available or the thread fails to acquire the lock to the queue. The simple approach is to constantly check, otherwise known as a “busy-wait” loop, shown below in pseudo code.

while (!acquire_lock() && work_in_queue);
get_work();
release_lock();
do_work();

Busy Wait Loop

The problem with the implementation above is if a thread cannot obtain the lock or there is no work in the queue, the thread continues checking as fast as possible. Actively polling consumes all of the available processor resources and has very negative impacts on both performance and power consumption. The upside is that the thread will enter the queue almost immediately when the lock is available or when work appears.

Sleep and SwitchToThread

The solution that many developers use for backing off checking the queue, or locks that are highly contended, is typically to call Sleep(0) or SwitchToThread() from the Win32 APIs. According to MSDN Sleep Function documents, calling Sleep(0) allows the calling thread to give up the remaining part of its time slice if and only if a thread of equal or greater priority is ready to run. 

Similarly, SwitchToThread() allows the calling thread to give up the remaining part of its time slice, but only to another thread on the same processor. This means that instead of constant checking, a thread only checks if no other useful work is pending. If you want the software to back off more aggressively, use a Sleep(1) call, which always gives up the remaining time slice, and context switch out, regardless of thread priority or processor residency. The goal of a Sleep(1) is to wake up and recheck in 1 millisecond.

while (!acquire_lock() || no_work_in_queue)
{
  Sleep(0);
}
get_work();
release_lock();
do_work();

Sleep Loop

Unfortunately, a lot more is going on under the hood that can cause some serious performance degradations. The Sleep(0) and SwitchToThread() calls require overhead since they involve a fairly long instruction path length, combined with an expensive ring3 to ring 0 transition costing about 1000 cycles. The processor is fooled into thinking that this “sleep loop” is accomplishing useful work. In executing these instructions, the processor is being fully utilized, filling up the pipeline with instructions, executing them, trashing the cache, and most importantly, using energy that is not benefiting the software.

An additional problem is that a Sleep(1) call probably does not do what you intended if the Windows’ kernel’s tick rate is at the default of 15.6 ms. At the default tick rate, the call is actually equivalent to a sleep that is much larger than 1 ms and can wait as long as 15.6 ms, since a thread can only wake up when the kernel wakes it. Such a call means the thread is inactive for a very long time while the lock could become available or work placed in the queue.

Another issue is that immediately giving up a time slice means the running thread will be context switched out. A context switch costs something on the order of 5,000 cycles, so getting switched out and switched back in means the processor has wasted at least 10,000 cycles of overhead, which is not helping the workload get completed any faster. Very often, these loops lead to very high context switch rates, which are signs of overhead and possible opportunities for performance gains.

Fortunately, you have some options for help mitigating the overhead, saving power, and getting a nice boost in performance.

Spinning Out of Control

If you are using a threading library, you may not have control over the spin algorithms implemented.  During performance analysis, you may see a high volume of context switches, calls to Sleep or SwitchToThread, and high processor utilization tagged to the threading library.  In these situations, it is worth looking at alternative threading libraries to determine if their spin algorithms are more efficient.

Resolving the Problems

The approach we recommend in such an algorithm is akin to a more gradual back off. First, we allow the thread to spin on the lock for a brief period of time, but instead of fully spinning, we use the pause instruction in the loop. Introduced with the Intel® Streaming SIMD Extensions 2 (Intel® SSE2) instruction set, the pause instruction gives a hint to the processor that the calling thread is in a "spin-wait" loop. In addition, the pause instruction is a no-op when used on x86 architectures that do not support Intel SSE2, meaning it will still execute without doing anything or raising a fault. While this means older x86 architectures that don’t support Intel SSE2 won’t see the benefits of the pause, it also means that you can keep one straightforward code path that works across the board.

Essentially, the pause instruction delays the next instruction's execution for a finite period of time. By delaying the execution of the next instruction, the processor is not under demand, and parts of the pipeline are no longer being used, which in turn reduces the power consumed by the processor.

The pause  instruction can be used in conjunction with a Sleep(0) to construct something similar to an exponential back-off in situations where the lock or more work may become available in a short period of time, and the performance may benefit from a short spin in ring 3. It is important to note that the number of cycles delayed by the pause instruction may vary from one processor family to another. You should avoid using multiple pause instructions, assuming you will introduce a delay of a specific cycle count.  Since you cannot guarantee the cycle count from one system to the next, you should check the lock in between each pause to avoid introducing unnecessarily long delays on new systems. This algorithm is shown below:

ATTEMPT_AGAIN:
  if (!acquire_lock())
  {
    /* Spin on pause max_spin_count times before backing off to sleep */
    for(int j = 0; j < max_spin_count; ++j)
    {
      /* pause intrinsic */
      _mm_pause();
      if (read_volatile_lock())
      {
        if (acquire_lock())
        {
          goto PROTECTED_CODE;
        }
      }
    }
    /* Pause loop didn't work, sleep now */
    Sleep(0);
    goto ATTEMPT_AGAIN;
  }
PROTECTED_CODE:
  get_work();
  release_lock();
  do_work();

Sleep Loop with exponential back off
 

Using pause in the Real World

Using the algorithms described above, including the pause instruction, has shown to have significant positive impacts on both power and performance. For our tests, we used three workloads, each of which had longer periods of active work. The high granularity means the work was relatively extensive, and the threads were not contending for the lock very often. In the low granularity case, the work was quite short, and the threads were more often finishing and ready for further tasks.

These measurements were taken on a 6-core, 12-thread, Intel® Core™ i7 processor 990X  equivalent system. The observed performance gains were quite impressive. Up to 4x gains were seen when using eight threads, and even at thirty-two threads, the performance numbers were approximately 3x over just using Sleep(0).



Performance using pause

Performance using pause

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark* and MobileMark*, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more information go to http://www.intel.com/performance.

As mentioned before, using the pause instruction allows the processor pipeline to be less active when threads are waiting, resulting in the processor using less energy. Because of this, we were also able to measure the power differences between the two algorithms using a Fluke NetDAQ*.

Power Consumption with optimization.

Knowing that your software is saving 0.73W over a standard implementation means it is less likely to be a culprit for draining laptop battery life. Combining reduced energy consumption with the gains in performance can lead to enormous power savings over the lifetime of the workload

Conclusions

In many cases, developers may be overlooking or simply have no way of knowing their applications have hidden performance problems. We were able to get a handle on these performance issues after several years of investigation and measurement.

We hope that this solution is simple enough to be retrofitted into existing software.  It follows common algorithms, but includes a few tweaks that can have large impacts. With battery life and portable devices becoming more prevalent and important to developers, a few software changes can take advantage of new instructions and have positive results for both performance and power consumption. 

About the Authors

Joe Olivas is a Software Engineer at Intel working on software performance optimizations for external software vendors, as well as creating new tools for analysis. He received both his B.S. and M.S. in Computer Science from CSU Sacramento with an emphasis on cryptographic primitives and performance. When Joe is not making software faster, he spends his time working on his house and brewing beer at home with his wife.
Mike Chynoweth is a Software Engineer at Intel, focusing on software performance optimization and analysis. He received his B.S. in Chemical Engineering from the University of Florida. When Mike is not concentrating on new performance analysis methodologies, he is playing didgeridoo, cycling, hiking or spending time with family.

 

Tom Propst is a Software Engineer at Intel focusing on enabling new use cases in business environments. He received his B.S. in Electrical Engineering from Colorado State University. Outside of work, Tom enjoys playing bicycle polo and tinkering with electronics.

 


Coming Soon…The New Intel® Parallel Studio XE

$
0
0

It’s coming. …

The new Intel® Parallel Studio XE will ship soon.  It includes all-new libraries and tools to boost performance of big data analytics, large MPI clusters, or any code that benefits from parallel execution.  Leverage the power of Intel® processors like never before. 

Product highlights include:

  • Make faster code using both vectorization and threading with new vectorization assistance
  • Boost the speed of data analytics and machine learning programs with new data analytics acceleration library
  • Improve large cluster performance with the ability to profile large MPI jobs faster

To learn more, visit Intel Software Development Products in the Intel® Software Pavilion at IDF.  Be sure to attend these technical sessions:

Fast Gathering-based SpMxV for Linear Feature Extraction

$
0
0

1. Background

Sparse Matrix-Vector Multiplication (SpMxV) is a common linear algebra function that often appears in real recognition-related problems, such as speech recognition. In standard framework of speech/facial recognition, input data directly extracted from outside are not suitable for pattern matching. It is a must step to transform input data to more compact and amenable feature data by multiplying with a huge-scale constant sparse parameter matrix.

Figure 1: Linear Feature Extraction Equation

A matrix is characterized as sparse if most of its elements are zero. Density of matrix is defined as percentage of non-zero elements among the matrix, which varies from 0% to 50%. The basic idea on optimizing a SpMxV is to concentrate non-zero elements to avoid unnecessary multiple-with-zero operations as many as possible. In general, concentration methods can be classified as two kinds.

The first is the widely used Compressed Row Storage (CRS), which only store non-zero elements and their position information for each row. But it is so unfriendly to modern SIMD architecture that it is hardly vectorized with SIMD, and only outperforms SIMD-accelerated ordinary matrix-vector multiplication when matrix is extreme sparse. A variation of this means, tailored for SIMD implementation, is Blocked Compressed Row Storage (BCRS) in which a fixed-size block instead of an element is handled in the same way. Because of involvement of indirect memory access, its performance may degrade severely when matrix density increases.

The second is to reorder matrix row/column via permutation. The key of these algorithms is to find the best matrix permutation scheme measured by certain criterion correlated with non-zero concentration degree, such as:

  • Group non-zero elements together to facilitate partitioning matrix to sub-matrices
  • Minimize total count of continuous non-zeros N x 1 blocks

Figure 2:  Permutation to minimize N x 1 blocks

However, in some applications, such as speech/facial recognition, there exist some permutation-insensitive sparse matrices. That is to say that any permutation operation does not bring about significant improvement for SpMxV. An extremely simplified example matrix is:

Figure 3: Simplest permutation-insensitive matrix

If non-zero elements are uniformly distributed inside a sparse matrix, it may happen that when exchanging any two columns, rows benefitted are nearly same as rows negatively. When this situation happens, the matrix is permutation-insensitive.

Additionally, for those sparse matrices of somewhat high density, if no help can be expected from two methods, we have to resort to ordinary matrix-vector multiplication merely accelerated by SIMD instructions, illustrated in Figure 4, which is totally sparseness-unaware. In hopes of alleviating this problem, we initiated and generalized a gathering-based SpMxV algorithm that is effective for not only evenly distributed but also irregular constant sparse matrix.

 

2. Terms and Task

Before detailing the algorithm, we introduce some terms/definitions/assumptions to ease description.

  • A SIMD Block is a memory block that is same-sized as SIMD register. A SIMD BlockSet consists of one or several SIMD Blocks. A SIMD value is either a SIMD Block or a SIMD register, which can be a SIMD instruction operand.

  • An element is underlying basic data unit of SIMD value. Type of element can be built-in integer or float. Type of whole SIMD value is called SIMD type, and is vector of element types. Element index is element LSB order in the SIMD value, equal to element-offset/element-byte-size.

  • Instructions of loading a SIMD Block into a SIMD register are symbolized as SIMD_LOAD. For most element types, there are corresponding SIMD multiplication or multiplication-accumulation instructions. On X86, examples are PMADDUBSW/PMADDWD for integer, MULPS/MULPD for float. These instructions are symbolized as SIMD_MUL.

  • Angular bracket “< >” is used to indicate parameterization similar to C++ template.

  • For a value X in memory or register, X<L>[i] is the ith L-bit slice of X, in LSB order.

On modern SIMD processors, an ordinary matrix-vector multiplication can be greatly accelerated with the help of SIMD instructions as the following pseudo-code:

Figure 4: Plain Matrix-Vector SIMD Multiplication

In the case of sparse matrix, we propose an innovative technique to compact non-zeros of the matrix, while sustaining SpMxV’s implementability via SIMD ISA as above pseudo-code, with a goal of reducing unnecessary SIMD_MUL instructions. Since a matrix is assumed to be constant, the operation of compacting non-zeros is considered as preprocessing on the matrix, which can be completed during program initialization or off-line matrix data preparation, so that no runtime cost is incurred for a matrix-vector multiplication.

 

3. Description

GATHER Operation

First of all, we should define a conceptual GATHER operation, which is the basis of this work. And its general description is:

GATHER<T, K>(destination = [D0, D1, …, DE–1],
                           source       = [S0, S1, …, SK*E–1],
                           hint            = [H0, H1, …, HE–1])

The parameters destination and source are SIMD values, whose SIMD type is specified by T. And destination is one SIMD value whose element count is denoted by E, while source consists K SIMD value(s) whose total element count is K*E. The parameter hint, called Relocation Hint, has E integer values, each of which is called Relocation Index. A Relocation Index is derived from a virtual index ranging between –1 and K*E–1, and can be described by a mathematical mapping as:    

 RELOCATION_INDEX<T>(index), abbreviated as RI<T>(index)

GATHER operation will move elements of source into destination based on Relocation Indices as:

  • If Hi is RI<T>(–1), GATHER will retain context of Di.
  • If Hi is RI<T>(j) (0 ≤ j < K*E), GATHER will move Sj to Di.

Implementation of GATHER operation is specific to processor’s ISA. Correspondingly, RI mapping depends on instruction selection for GATHER. Likewise, materialization of hint may be a SIMD value or an integer array, or even mixed with other Relocation Hints, which is totally instruction-specific.

According to ISA availability of certain SIMD processor, we only consider those, called fast or intrinsic GATHER operation, which can be translated to simple and efficient instruction sequence with low CPU cycles.

 

Fast GATHER on X86

On X86 processor, we propose a method to construct fast GATHER using BLEND and SHUFFLE instruction pair.

Given a SIMD type T, imagined BLEND and SHUFFLE instruction are defined as:

  • BLEND<T, L>(operand1, operand2,mask)  ->  result

    L is power of 2, not more than element bit length of T. And operand1, operand2 and result are values of T; mask is a SIMD value whose element is L-bit integer, and its element count is denoted by E. For the ith (0 ≤ i < E) element of mask, we have:

    • operand1<L>[i]  ->  result<L>[i]      (if the element’s MSB is 0)
    • operand2<L>[i]  ->  result<L>[i]      (if the element’s MSB is 1)
  • SHUFFLE<T, L>(operand1, mask)  ->  result

    Parameters description is same as BLEND. In element of mask, only low log2(E) bits, called SHUFFLE INDEX BITS, and MSB are significant. For the ith (0 ≤ i < E) element of mask, we have:

    • operand1<L>[mask<L>[i] & (E–1) ]  ->  result<L>[i]          (if the element’s MSB is 0)
    • instruction specific value  ->  result<L>[i]                          (if the element’s MSB is 1)

Then, we will construct fast GATHER<T, K> using SHUFFLE<T, LS> and BLEND<T, LB> instruction pair.  And element bit length of T is denoted by LT, SHUFFLE INDEX BITS is SIB. Relocation Hint is materialized as one SIMD value and each Relocation Index is LT-bit integer. The mathematical mapping RI<T>( ) is defined as:

  • RI<T>(­virtual index = –1) = –1

  • If virtual index ≥ 0, in other words, we can suppose the element indicated by this index is actually the pth element of the kth (0 ≤ k < K) source SIMD value. Final result, denoted by rid, is computed according to the formulations:

    • LS ≤ LB   (0 ≤ i < LT/LS)
      rid< LS>[i] = k * 2SIB + p * LT/LS + i                    ( i = integer * LB/LS – 1)
      rid< LS>[i] = ? * 2SIB + p * LT/LS + i                    ( i ≠ integer * LB/LS – 1)

       

    • LS > LB   (0 ≤ i < LT/LB)
      rid< LB>[i] = k * 2SIB + p * LT/LS + i * LB/LS       ( i = integer * LS/LB)
      rid< LB>[i] = k * 2SIB + ? & (2SIB– 1)                   ( i ≠ integer * LS/LB)

 Figure 5 is an example illustrating Relocation Hint for a GATHER<8*int16, 2> while LS = LB = 8.

Figure 5:  Relocation Hint For Gathering 2 SSE Blocks

The code sequence of fast GATHER<T, K> is depicted in Figure 6. Destination and Relocation Hint are symbolized as D and H. Source values are represented by B0, B1, …, BK–1. Besides, an essential SIMD constant I, of which element bit length is min(LS, LB) and each element is the integer 2SIB, will be used. Additionally, a condition should be satisfied that K is not more than 2min(LS, LB) – SIB – 1, which is K ≤ 8 for above case.

Figure 6:  Fast GATHER Code Sequence

Depending on SIMD type and processor SIMD ISA, SHUFFLE and BLEND should be mapped to specific instructions as optimal as possible. Some existing instruction selections are listed as examples.

SSE128 - Integer

PSHUFB + PBLENDV

LS=8,   LB=8

SSE128 - Float

VPERMILPS + BLENDPS

LS=32, LB=32

SSE128 - Double

VPERMILPD + BLENDPD

LS=64, LB=64

AVX256 - Int32/64

VPERMD + PBLENDV

LS=32, LB=8

AVX256 - Float

VPERMPS + BLENDPS

LS=32, LB=32

 

Sparse Matrix Re-organization

In a SpMxV, two operands, the matrix and the vector, are expressed by M and V respectively. Each row in M is partitioned into several pieces in unit of SIMD Block according to certain scheme. Non-zero elements in a piece are compacted into one SIMD Block as many as possible. If there are some remaining non-zero elements outside of compaction, the piece’s SIMD Blocks containing them should be as least as possible. Meanwhile, these leftover elements are moved to a left-over matrix ML. Obviously, M*V is theoretically broken up to (M–ML)*V and ML*V. When a proper partition scheme is adopted, especially possible for those nearly even distributed sparse matrices, ML is intended to be an ultra sparse matrix that is far sparser than M so that computation time of ML*V is non-significant in total time. We can apply standard compression-based algorithm or like, which will not be covered in this invention, to ML*V. And organization of ML is subject to its multiplication algorithm and its storage is separate from M’s compacted data, whose organization is detailed as the following.

Given a piece, suppose it contains N+1 SIMD Blocks of type T, expressed by MB0, MB1, …, MBN. We use MB0 as containing Block, select and gather non-zero elements of the other N Blocks into MB0. Without loss of generality, we assume that this gathering-N-Block operation is synthesized from one or several intrinsic GATHERs, whose ‘K’ parameters are K1, K2, …, KG that are subject to N = K1 + K2 + … + KG. That is to say, the N Blocks are divided into G groups sized by K1, K2, …, KG, and these groups are individually gathered into MB0 one by one. To archive best performance, we should find a decomposition that minimizes G. This is a classical knapsack type problem and can be solved in either dynamic programming or greedy method. As a special case, when intrinsic GATHER<T, N> exists, G=1.

Relocation Hints for those G intrinsic GATHERs are expressed by MH1, MH2, …, MHG. So, the piece will be replaced with its compacted form consisting of two parts: MB0 after compaction and (MH1, MH2, …, MHG). The former is called Data Block. The latter is called Relocation Block and means certain possible combination form of all Relocation Hints, which is specific to any implementation or optimization consideration that is out of discussion of this paper. The combination form may be affected by alignment enforcement, memory optimization, or other instruction-specific reasons. For example, if a Relocation Index occupies only half a byte, we can merge two Relocation Indices from two Relocation Hints into one byte so as to reduce memory usage. Ordinarily, a simple way is to layout Relocation Hints end to end. Figure 5 also shows how to create Data Block and Relocation Block for a 3-Block piece. A blank in SIMD Block means zero-valued element.

 

Sparse Matrix Partitioning Scheme

To guide decision on how to partition a row of matrix, we introduce a cost model. For a piece of N+1 SIMD Blocks, suppose that there will be R (R ≤ N) SIMD Blocks containing non-zero elements to be moved to ML. The cost of this piece is 1 + N*CostG + R*(1+CostL), in which:

In the following description, one or several adjacent pieces in a row will be referred as a whole, which is termed piece clique. All rows of the matrix have same partitioning scheme as:

  • 1 is cost of a SIMD multiplication in the piece.
  • CostG (CostG < 1) means cost of gathering one SIMD Block.
  • CostL means extra effort for a SIMD multiplication in ML, is always a very small value.

In the following description, one or several adjacent pieces in a row will be referred as a whole, which is termed piece clique. All rows of the matrix have same partitioning scheme as:

  • Row is cut into identical primary cliques except a possible leftover clique with fewer pieces than primary one.
  • Pieces in any clique should be not more than a pre-defined count limit C(1≤ C), which is statically deduced from characteristic of non-zero distribution of the sparse matrix and is also used to control code complexity in final implementation.
  • Total cost of all pieces in the matrix should be minimal for the given count limit C. As to how to find this most optimal scheme, we may rely on an exhaustive search or an improved beam algorithm. This beam algorithm will be covered in a new patent and ignored here.

An example of partitioning is [4, 5, 2], [4, 5, 2], [4, 5, 2], [2, 5] for a 40-Block row when C=3. ‘[ ]’ means a piece clique. For those even-distributed matrices, C=1 is always chosen.

 

Gather-Based Matrix-Vector Multiplication

Multiplication between vector V and a row of M is broken up into sub-multiplications on partitioned pieces. Given a piece in M, which we suppose its original form has N+1 SIMD Blocks, the corresponding SIMD Blocks in vector V are expressed by VB0, VB1, …, VBN. Previous symbol definitions for piece are extended to this section.

With new compacted form, a piece multiplication between [MB0, MB1, …, MBN] and [VB0, VB1, …, VBN] is transformed to operations of gathering effective vector elements into VB0 and only one SIMD multiplication on Data Block and VB0. Figure 7 depicts the pseudo-code of new multiplication, in which Data Block is MD, Relocation Block is MR and the vector is VB. And we will refer to a conceptual function EXTRACT_HINT(MR, i) (1 ≤ i ≤ G), which means extracting the ith Relocation Hint from MR and is the reverse operation to aforementioned ₡(MH1, MH2, …, MHG). To improve performance, there may be some internal temporaries inside this function. For example, register value of previous Relocation Hint was retained to avoid memory access. But detail of this function is not in scope of the article.

Figure 7:  Multiplication For Compacted Form of N+1 SIMD Blocks

In the code, original N SIMD multiplications are replaced by G gathering operations. Therefore, computation acceleration is possible and meaningful only if the former is much more time-consuming than the latter. We should compose efficient intrinsic GATHER to guarantee this assertion. This matter is easily done for some processors, such as ARM, on which intrinsic GATHER of SIMD integer type can be directly mapped to single low-cost hardware instruction. To be more specific, the fast GATHER elaborately constructed on X86 also satisfies the assertion. For the ith (1 ≤ i ≤ G) SIMD Block group in the piece, Ki SIMD_MULs are replaced by Ki rather faster BLEND and SHUFFLE pairs, and Ki1 SIMD_LOADs from the matrix are avoided and replaced by Ki1 much more CPU-cycle-saving SIMD_SUBs.

At last, new SpMxV algorithm can be described by the following flowchart:

Figure 8:  New Sparse Matrix-Vector Multiplication

 

4. Summary

The algorithm can be used to improve sparse matrix-vector and matrix-matrix multiplication in any numerical computation. As we know, there are lots of applications involving semi-sparse matrix computation in High Performance Computing. Additionally, in popular perceptual computing low-level engines, especially speech and facial recognition, semi-sparse matrices are found to be very common. Therefore, this invention can be applied to those mathematical libraries dedicated to these kinds of recognition engines.

Optimizing Image Resizing Example of Intel® Integrated Performance Primitives (IPP) With Intel® Threading Building Blocks and Intel® C++ Compiler. Intel® System Studio 2016 Linux

$
0
0

For Intel® System Studio 2015, find the corresponding article here -> click

< Overview >

 In this article, we are enabling and using Intel® Integrated Performance Primitives(IPP), Intel® Threading Building Blocks(TBB) and Intel® C++ Compiler(ICC) on Linux ( Ubuntu 14.04 LTS 64bit ). We will build and run one of the examples that comes with IPP and apply TBB and ICC on the example to observe the performance improvement of using Intel® System Studio features.

  Intel® System Studio (ISS) used for this article is  Intel® System Studio 2016 Beta Ultimate Edition for Linux Host. The components used here in the tool suite are the following

  • Intel® Integrated Performance Primitives 9.0 got Linux
  • Intel® Threading Building Blocks 4.4
  • Intel® C++ Compiler 16.0

 This example was tested on i5 dual core platform.

 

< Building the IPP example with TBB libraries and ICC >

STEP 1. Setup the environment variables for IPP, TBB and ICC

  We need to setup environment variables for IPP,TBB and ICC to work appropriately. Use the following 3 commands in the command line then the variables will be set. It is needed to input the right target architecture when you execute them. ex) 'ia32'IA-32 target and 'intel64'for Intel®64 target. Additionally, for ICC, you also need to insert a platform type. ex) 'linux' for a Linux target, 'android' for an Android target and 'mac' for a Mac target . Finally, do not forget to type a dot and a space at the beginning wich is '. '

  • . /opt/intel/compilers_and_libraries_2016.x.xxx/linux/ipp/bin/ippvars.sh <arch type>
  • . /opt/intel/compilers_and_libraries_2016.x.xxx/linux/tbb/bin/tbbvars.sh <arch type>
  • . /opt/intel/compilers_and_libraries_2016.x.xxx/linux/bin/iccvars.sh -arch <arch type> -platform <platform type>

  To verify if the above commands were executed correctly, type 'printenv' and check if 'IPPROOT' and 'TBBROOT' are listed and indicating IPP and TBB install directories, and 'PATH'is indicating'/opt/intel/compilers_and_libraries_2016.x.xxx/linux/bin/<arch type>'. For the future usage, it is recommended to write a bash script to enable multiple features of ISS at once.

STEP 2. Find the example

  First, we will go find the IPP example and prepare to build with additional ISS features applied such as TBB and ICC.

  When you install ISS 2016 with default setting,  the IPP example archive file is located at

/opt/intel/compilers_and_libraries_2016.x.xxx/linux/ipp/examples

 you will find 'ipp-examples_lin.tgz' in the location. Extract the examples wherever you like (but don't extract it at a directory where you need strict permissions. Do it where you can play without type 'sudo' otherwise, building the example gets complicated), and find 'ipp_resize_mt'example folder. That is the example we are using here. You can find additional document at '<Extracted Eamples>/documentation/ipp-examples.html'when you extract the examples.

STEP3. Build the example

 If you want to build the example without TBB and ICC, just try 'make' at  '<Extracted Eamples>/ipp_resize_mt' and save the binary for the future comparison. Since IPP environment setup has been done already, the example should build without any problem.

Now we need to add TBB and ICC to build a faster version of the original example. In 'Makefile' of the example, we can see comments that let us know how to enable TBB and ICC while building.

Type 'export CC=icc && export CXX=icpc && CXXFLAGS=-DUSE_TBB' . Now run 'make' at the 'ipp_resize_mt' folder to build the example.

 

< Simple Performance Comparison >

 The IPP example simply shows the performance of itself as how long  in average it spends on resizing one image.  

 Refer the following as the options and arguments that can be used to execute the resize sample.

When the resize example works without TBB, resize function will be utilizing a single thread which results in not full exploitation of multi cores. The following is the result of the resize example with a command : './ipp_resize_mt -i ../../lena.bmp -r 960x540 -p 1 -T AVX2 -l 5000' . This command means 'resize ../../lena.bmp into 960x540 using linear interpolation method and AVX2 5000 times.

 As we can see above, the average duration resizing a single image takes about 2.275ms in average. Given this result, we will test the same example with TBB exploiting 2 cores. If TBB has been successfully enabled, the thread option gets included in the help page.

 

 When the resize example works with TBB, resize function will be run on 2 threads simultaneously. The following is the result of the resize example with a command : './ipp_resize_mt -i ../../lena.bmp -r 960x540 -p 1 -T AVX2 -t 2 -l 5000' 

 Utilizing 2 threads at the same time resulted in exploiting both two cores and the performance increased about 70%.

To verify if the example technically exploit two cores simultaneously, we can use VTune to investigate. The following picture shows the number of CPUs utilized during each execution. ( Blue = Resize example without TBB, Yellow = Resize example with TBB ) 

 

 A yellow bar on 2.00 tells us that 2 CPUs had been running simultaneously about 4.4s.

 VTune results also shows how threads were working for specific tasks. Extracted results of functions used for resizing are listed below.

 We can see only a single thread is used to handle the resize function and it is  a heavy load. If this sort of circumstance happens we should consider multi parallelizing. The following is results of the one with TBB.

 As expected, 2 threads where running simultaneously for about 4.4s during the task and that increased the performance.

 

< Conclusion >

  We saw how easily an IPP example can be built and tested with other features of ISS. It is recommended to take a close look into the IPP example to learn how to program with IPP and TBB. TBB here parallelizes for the dual core processor and increase the performance.

 Talking about ICC for this example in fact,  just changing compiler from GCC into ICC did not bring a big benefit in this case since IPP resize function already is optimized with SIMD instructions and the loops were parallelized by TBB. So there are not many other tasks that could be optimized by ICC in this example. If there were additional functions and loops that can be vectorized or parallelized so SIMD instructions or OpenMP or Cilk could be used with ICC, there would have been further chances to optimize the application.

 

Vectorization Sample for Intel® Advisor XE 2016

$
0
0

Intel® Advisor XE 2016 provides two tools to help ensure your Fortran and native/managed C++ applications take full performance advantage of today’s processors:

  • Vectorization Advisor is a vectorization analysis tool that lets you identify loops that will benefit most from vectorization, identify what is blocking effective vectorization, explore the benefit of alternative data reorganizations, and increase the confidence that vectorization is safe.
  • Threading Advisor is a threading design and prototyping tool that lets you analyze, design, tune, and check threading design options without disrupting your normal development.

The following READMEs show how to improve the performance of a C++ sample application with the Vectorization Advisor in the:

Vectorization Sample README-Windows* OS/Standalone GUI 

This README shows how to use the Intel® Advisor XE 2016 standalone GUI to improve the performance of a C++ sample application. Follow these steps:

  1. Prepare the sample application.

  2. Establish a performance baseline.

  3. Get to know the Vectorization Workflow.

  4. Increase the optimization level.

  5. Disambiguate pointers.

  6. Generate instructions for the highest instruction set available.

  7. Handle dependencies.

  8. Align data.

  9. Reorganize code.

Prepare the Sample Application 

Get Software Tools and Unpack the Sample

You need the following tools:

  • Intel Advisor XE 2016

  • Version 15 .0 or higher of an Intel C++ compiler or a supported compiler
    Use an Intel compiler to get more benefit from the Vectorization Advisor Survey Report. See the Release Notes for more information on supported compilers.

  • .zip file extraction utility

Acquire and Install Intel Software Tools

If you do not already have access to the Intel Advisor XE 2016 or to Version 15.0 or higher of an Intel C++ compiler, download an evaluation copy from http://software.intel.com/en-us/articles/intel-software-evaluation-center/.

Set Up the Intel Advisor Sample Application

  1. Copy the vec_samples.zip file from the <advisor-install-dir>\samples\<locale>\C++\ directory to a writable directory or share on your system.
    The default installation path, <advisor-install-dir>, is C:\Program Files (x86)\IntelSWTools\Advisor XE 201n\ (on certain systems, instead of Program Files (x86), the directory name is Program Files).

  2. Extract the sample from the .zip file.

Build the Sample Application in Release Mode

  1. Set the environment for version 15.0 or higher of an Intel compiler.

  2. Build the sample application with the following options in release mode:

    • /O1
    • /Qstd=c99
    • /fp:fast
    • /Qopt-report:5

For example:
icl /O1 /Qstd=c99 /fp:fast /Qopt-report:5 Multiply.c Driver.c -o MatVector

Launch the Intel Advisor

Do one of the following:

  • Run the advixe-gui command.

  • From the Microsoft Windows* Start menu, select Intel Parallel Studio XE 2016 > Analyzers > Advisor XE.

  • From the Microsoft Windows Start screen, scroll to access the Parallel Studio XE 2016 tile.

  • From the Microsoft Windows* Apps screen, scroll to access the Intel Parallel Studio XE group.

Create a New Project

  1. Choose File > New > Project… (or click New Project… in the Welcome page) to open the Create a Project dialog box.

  2. Type vec_samples in the Project Name field, supply a location for the sample application project, then click the CreateProject button to open the Project Properties dialog box.

  3. On the left side of the Analysis Target tab, ensure the Survey Hotspots/Suitability Analysis type is selected.

  4. Click the Browse… button next to the Application field, and choose the just-built binary.

  5. Click the OK button to close the Project Properties window and open an empty Survey Report window.

Establish a Performance Baseline 

To set a performance baseline for the improvements that will follow, do the following:

  1. In the VECTORIZATION WORKFLOW pane, click the Collect control under Survey Target to produce a Survey Report.

  2. If necessary, select the Do not show this window again checkbox in the infotip, then close the infotip.

In the Survey Report window, notice:

  • The Elapsed time value in the top left corner is ~16 seconds. (Your value may vary.) This is the baseline against which subsequent improvements will be measured.

  • In the Loop Type column in the top pane, all detected loops are Scalar.

Get to Know the Vectorization Workflow 

The VECTORIZATION WORKFLOW in the left pane is a recommended usage scenario.

Survey Target– This analysis produces a Survey Report (currently displayed) that offers integrated compiler report data and performance data all in one place. Use this information to help identify:

  • Where vectorization will pay off the most

  • If vectorized loops are providing benefit, and if not, why not

  • Un-vectorized and under-vectorized loops, and the estimated expected performance gain of vectorization or better vectorization

  • How data accessed by vectorized loops is organized and the estimated expected performance gain of reorganization

Find Trip Counts– This optional analysis dynamically identifies the number of times loops are invoked and execute (sometimes called call count/loop count and iteration count respectively), and adds this information to the Survey Report. Use this information to make better decisions about your vectorization strategy for particular loops, as well as optimize already-parallel loops.

Check Dependencies – This optional analysis produces one of two optional Refinement Reports if you want to dig deeper. For safety purposes, the compiler is often conservative when assuming data dependencies. Use the Dependencies Report to check for real data dependencies in loops the compiler did not vectorize because of assumed dependencies. If real dependencies are detected, the report can provide additional details to help resolve the dependencies. Your objective: Identify and better characterize real data dependencies that could make forced vectorization unsafe.

Check Memory Access Patterns – This optional analysis produces one of two optional Refinement Reports if you want to dig deeper. Use the Memory Access Patterns (MAP) Report to check for various memory issues, such as non-contiguous memory accesses and unit stride vs. non-unit stride accesses. Your objective: Eliminate issues that could lead to significant vector code execution slowdown or block automatic vectorization by the compiler.

Increase the Optimization Level 

To see if increasing the optiization level improves performance, do the following:

  1. Rebuild the sample application with the following options:

    • /O3
    • /Qstd=c99
    • /fp:fast
    • /Qopt-report:5
  2. In the VECTORIZATION WORKFLOW pane, click the Collect control under Survey Target.

In the new Survey Report, notice:

  • Increasing the optimization level does not vectorize the loops; all are still scalar.

  • Advisor XE explains why the loops are still scalar in the Vector Issues and Why No Vectorization? columns: The compiler assumed there are dependencies that could make vectorization unsafe. The Survey Report also offers recommendations for how to fix this issue.
    Try clicking a:

    • Recommendation icon icon in the Vector Issues column to display Recommendations in the bottom pane

    • Compiler Diagnostic icon icon in the Why No Vectorization? column to display Compiler Diagnostic Details in the bottom pane

  • The Elapsed time improves.

Disambiguate Pointers 

Two pointers are aliased if both point to the same memory location. Storing to memory using a pointer that might be aliased may prevent some optimizations. For example, it may create a dependency between loop iterations that would make vectorization unsafe. Sometimes, the compiler can generate both a vectorized and a non-vectorized version of a loop and test for aliasing at runtime to select the appropriate code path. If you know that pointers do not alias and inform the compiler, it can avoid the runtime check and generate a single vectorized code path.

In Multiply.c, the compiler generates runtime checks to determine whether or not the pointer b in function matvec(FTYPE a[][COLWIDTH], FTYPE b[], FTYPE x[]) is aliased to either a or x . If Multiply.c is compiled with the NOALIAS macro, the restrict qualifier of the argument b informs the compiler that the pointer does not alias with any other pointer, and in particular that the array b does not overlap with a or x.

To see if the NOALIAS macro improves performance, do the following:

  1. Rebuild the sample application with the following options:

    • /O3
    • /Qstd=c99
    • /fp:fast
    • /Qopt-report:5
    • /DNOALIAS
  2. In the VECTORIZATION WORKFLOW pane, click the Collect control under Survey Target.

In the new Survey Report, notice:

  • The compiler successfully vectorizes one loop, but still cannot not vectorize other loops because the compiler assumes there are dependencies that could make vectorization unsafe.

  • The Elapsed time improves.

  • The value in the Vector Instruction Set column in the top pane is SSE2, the default Vector Instruction Set Architecture (ISA). AVX2 is preferable.

  • The value in the Vector Length column in the top pane is 2;4, which means some vector lengths are 2 and some are 4.

Generate Instructions for the Highest Instruction Set Available 

To see if generating instructions for the highest instruction set available on the compilation host processor improves performance, do the following:

  1. Rebuild the sample application with the following options:

    • /O3
    • /Qstd=c99
    • /fp:fast
    • /Qopt-report:5
    • /DNOALIAS
    • /QxHost
  2. In the VECTORIZATION WORKFLOW pane, click the Collect control under Survey Target.

In the new Survey Report, notice:

  • The Elapsed time improves.

  • The Vector Instruction Set and Vector Length columns in the top pane (probably) changes.

Handle Dependencies 

For safety purposes, the compiler is often conservative when assuming data dependencies.

To run a Dependencies analysis to identify and explore real loop-carried dependencies, do the following:

  1. Choose Project > Intel Advisor version Project Properties… to open the Project Properties dialog box.

  2. On the left side of the Analysis Target tab, select the Dependencies Analysis type.

  3. If necessary, click the Browse… button next to the Application field to choose the just-built binary.

  4. Click the OK button.

  5. In the Checkbox icon column in the Survey Report window, select the checkbox for the two loops with assumed dependencies.

  6. In the VECTORIZATION WORKFLOW pane, click the Collect control under Check Dependencies to produce a Dependencies Report.
    (If the analysis takes more than 5 minutes, click the Stop current analysis and display result collected thus far control under Check Dependencies.)

In the Refinement Reports window, notice the Intel Advisor reports no dependencies in two loops and a RAW (Read after write) dependency in one loop. Forcing the compiler to vectorize:

  • The loops without dependencies will not result in significant performance improvement

  • The loop with the RAW dependency will generate incorrect code

Align Data 

The compiler can generate faster code when operating on aligned data.

The ALIGNED macro:

  • Aligns the arrays ab, and x in Driver.c on a 16-byte boundary.

  • Pads the row length of the matrix, a, to be a multiple of 16 bytes, so each individual row of a is 16-byte aligned.

  • Tells the compiler it can safely assume the arrays in Multiply.c are aligned.

To see if the ALIGNED macro improves performance, do the following::

  1. Rebuild the sample application with the following options:

    • /O3
    • /Qstd=c99
    • /fp:fast
    • /Qopt-report:5
    • /DNOALIAS
    • /QxHost
    • /DALIGNED
  2. In the VECTORIZATION WORKFLOW pane, click the Collect control under Survey Target.

In the new Survey Report, notice the Elapsed time shows little improvement.

Reorganize Code 

When you use the matvec function in the sample application, the compiler cannot determine it is safe to vectorize the loop because it cannot tell if a and b are unique arrays.

When you inline the loop instead, the compiler can determine it is safe to vectorize the loop because it can tell exactly which variables you want processed in the loop.

The NOFUNCCALL macro removes the matvec function.

To see if the NOFUNCCALL macro improves performance, do the following:

  1. Rebuild the sample application with the following options:

    • /O3
    • /Qstd=c99
    • /fp:fast
    • /Qopt-report:5
    • /DNOALIAS
    • /QxHost
    • /DALIGNED
    • /DNOFUNCCALL
  2. In the VECTORIZATION WORKFLOW pane, click the Collect control under Survey Target.

In the new Survey Report, notice the Elapsed time improves substantially.

Vectorization Sample README-Windows* OS/Visual Studio IDE

This README shows how to use the Intel® Advisor XE 2016 plug-in to the Microsoft Visual Studio* 2013 IDE to improve the performance of a C++ sample application. Follow these steps:

  1. Prepare the sample application.

  2. Establish a performance baseline.

  3. Get to know the Vectorization Workflow.

  4. Increase the optimization level.

  5. Disambiguate pointers.

  6. Generate instructions for the highest instruction set available.

  7. Handle dependencies.

  8. Align data.

  9. Reorganize code.

Prepare the Sample Application 

Get Software Tools and Unpack the Sample

You need the following tools:

  • Intel Advisor XE 2016

  • Version 15 .0 or higher of an Intel C++ compiler or a supported compiler
    Use an Intel compiler to get more benefit from the Vectorization Advisor Survey Report. See the Release Notes for more information on supported compilers.

  • .zip file extraction utility

Acquire and Install Intel Software Tools

If you do not already have access to the Intel Advisor XE 2016 or to Version 15.0 or higher of an Intel C++ compiler, download an evaluation copy from http://software.intel.com/en-us/articles/intel-software-evaluation-center/.

Set Up the Intel Advisor Sample Application

  1. Copy the vec_samples.zip file from the <advisor-install-dir>\samples\<locale>\C++\ directory to a writable directory or share on your system.
    The default installation path, <advisor-install-dir>, is C:\Program Files (x86)\IntelSWTools\Advisor XE 201n\ (on certain systems, instead of Program Files (x86), the directory name is Program Files).

  2. Extract the sample from the .zip file.

Open the Microsoft Visual Studio* Solution

  1. Launch the Microsoft Visual Studio* IDE.

  2. If necessary, choose View > Solution Explorer.

  3. Choose File> Open> Project/Solution.

  4. In the Open Project dialog box, open the vec_samples.sln file.

Prepare the Project

  1. Right-click the vec_samples project in the Solution Explorer. Then choose Intel Compiler XE > Use Intel C++.

  2. If the Solutions Configuration drop-down on the Visual Studio* Standard toolbar is set to Debug, change it to Release.

  3. Choose Build > CleanSolution.

Establish a Performance Baseline 

To set a performance baseline for the improvements that will follow, do the following:

  1. Right-click the vec_samples project in the Solution Explorer. Then choose Project > Properties to display the Property Pages window.

  2. Choose Configuration Properties > C/C++ > Optimization. In the Optimization drop-down, choose  Minimum Size(/O1).

  3. Choose Configuration Properties > C/C++ > Diagnostics [Intel C++]. In the Optimization Diagnostic Level drop-down, type Level 5 [/Qopt-report:5].

  4. Click the Apply, button, then click the OK button.

  5. Choose Build > Rebuild Solution.

  6. Right-click the vec_samples project in the Solution Explorer. Then choose Intel Advisor XE 2016 > Start Survey Analysis to produce a Survey Report.
    If necessary, select the Do not show this window again checkbox in the infotip, then close the infotip.

In the Survey Report window, notice:

  • The Elapsed time value in the top left corner is ~16 seconds. (Your value may vary.) This is the baseline against which subsequent improvements will be measured.

  • In the Loop Type column in the top pane, all detected loops are Scalar.

Get to Know the Vectorization Workflow 

The VECTORIZATION WORKFLOW in the left pane is a recommended usage scenario.

Survey Target– This analysis produces a Survey Report (currently displayed) that offers integrated compiler report data and performance data all in one place. Use this information to help identify:

  • Where vectorization will pay off the most

  • If vectorized loops are providing benefit, and if not, why not

  • Un-vectorized and under-vectorized loops, and the estimated expected performance gain of vectorization or better vectorization

  • How data accessed by vectorized loops is organized and the estimated expected performance gain of reorganization

Find Trip Counts– This optional analysis dynamically identifies the number of times loops are invoked and execute (sometimes called call count/loop count and iteration count respectively), and adds this information to the Survey Report. Use this information to make better decisions about your vectorization strategy for particular loops, as well as optimize already-parallel loops.

Check Dependencies – This optional analysis produces one of two optional Refinement Reports if you want to dig deeper. For safety purposes, the compiler is often conservative when assuming data dependencies. Use the Dependencies Report to check for real data dependencies in loops the compiler did not vectorize because of assumed dependencies. If real dependencies are detected, the report can provide additional details to help resolve the dependencies. Your objective: Identify and better characterize real data dependencies that could make forced vectorization unsafe.

Check Memory Access Patterns – This optional analysis produces one of two optional Refinement Reports if you want to dig deeper. Use the Memory Access Patterns (MAP) Report to check for various memory issues, such as non-contiguous memory accesses and unit stride vs. non-unit stride accesses. Your objective: Eliminate issues that could lead to significant vector code execution slowdown or block automatic vectorization by the compiler.

Increase the Optimization Level 

To see if increasing the optiization level improves performance, do the following:

  1. Right-click the vec_samples project in the Solution Explorer. Then choose Project > Properties.

  2. Choose Configuration Properties > C/C++ > Optimization. In the Optimization drop-down, choose  Highest Optimizations(/O3).

  3. Click the Apply button, then click the OK button.

  4. Choose Build > Rebuild Solution.

  5. In the VECTORIZATION WORKFLOW pane, click the Collect control under Survey Target.

In the new Survey Report, notice:

  • Increasing the optimization level does not vectorize the loops; all are still scalar.

  • Advisor XE explains why the loops are still scalar in the Vector Issues and Why No Vectorization? columns: The compiler assumed there are dependencies that could make vectorization unsafe. The Survey Report also offers recommendations for how to fix this issue.
    Try clicking a:

    • Recommendation icon icon in the Vector Issues column to display Recommendations in the bottom pane

    • Compiler Diagnostic icon icon in the Why No Vectorization? column to display Compiler Diagnostic Details in the bottom pane

  • The Elapsed time improves.

Disambiguate Pointers 

Two pointers are aliased if both point to the same memory location. Storing to memory using a pointer that might be aliased may prevent some optimizations. For example, it may create a dependency between loop iterations that would make vectorization unsafe. Sometimes, the compiler can generate both a vectorized and a non-vectorized version of a loop and test for aliasing at runtime to select the appropriate code path. If you know that pointers do not alias and inform the compiler, it can avoid the runtime check and generate a single vectorized code path.

In Multiply.c, the compiler generates runtime checks to determine whether or not the pointer b in function matvec(FTYPE a[][COLWIDTH], FTYPE b[], FTYPE x[]) is aliased to either a or x . If Multiply.c is compiled with the NOALIAS macro, the restrict qualifier of the argument b informs the compiler that the pointer does not alias with any other pointer, and in particular that the array b does not overlap with a or x.

To see if the NOALIAS macro improves performance, do the following:

  1. Right-click the vec_samples project in the Solution Explorer. Then choose Project > Properties.

  2. Choose Configuration Properties > C/C++ > Command Line. In the Additional Options area, type /DNOALIAS.

  3. Click the Apply button, then click the OK button.

  4. Choose Build > Rebuild Solution.

  5. In the VECTORIZATION WORKFLOW pane, click the Collect control under Survey Target.

In the new Survey Report, notice:

  • The compiler successfully vectorizes one loop, but still cannot not vectorize other loops because the compiler assumes there are dependencies that could make vectorization unsafe.

  • The Elapsed time improves.

  • The value in the Vector Instruction Set column in the top pane is SSE2, the default Vector Instruction Set Architecture (ISA). AVX2 is preferable.

  • The value in the Vector Length column in the top pane is 2;4, which means some vector lengths are 2 and some are 4.

Generate Instructions for the Highest Instruction Set Available 

To see if generating instructions for the highest instruction set available on the compilation host processor improves performance, do the following:

  1. Right-click the vec_samples project in the Solution Explorer. Then choose Project > Properties.

  2. Choose Configuration Properties > C/C++ > Code Generation [Intel C++]. In the Intel Processor-Specific Optimization drop-down, choose Same as the host processor performing the compilation (/QxHost).

  3. Click the Apply, button, then click the OK button.

  4. Choose Build > Rebuild Solution.

  5. In the VECTORIZATION WORKFLOW pane, click the Collect control under Survey Target.

In the new Survey Report, notice:

  • The Elapsed time improves.

  • The Vector Instruction Set and Vector Length columns in the top pane (probably) changes.

Handle Dependencies 

For safety purposes, the compiler is often conservative when assuming data dependencies.

To run a Dependencies analysis to identify and explore real loop-carried dependencies, do the following:

  1. In the Checkbox icon column, select the checkbox for the two loops with assumed dependencies.

  2. In the VECTORIZATION WORKFLOW pane, click the Collect control under Check Dependencies to produce a Dependencies Report.
    (If the analysis takes more than 5 minutes, click the Stop current analysis and display result collected thus far control under Check Dependencies.)

In the Refinement Reports window, notice the Intel Advisor reports no dependencies in two loops and a RAW (Read after write) dependency in one loop. Forcing the compiler to vectorize:

  • The loops without dependencies will not result in significant performance improvement

  • The loop with the RAW dependency will generate incorrect code

Align Data 

The compiler can generate faster code when operating on aligned data.

The ALIGNED macro:

  • Aligns the arrays ab, and x in Driver.c on a 16-byte boundary.

  • Pads the row length of the matrix, a, to be a multiple of 16 bytes, so each individual row of a is 16-byte aligned.

  • Tells the compiler it can safely assume the arrays in Multiply.c are aligned.

To see if the ALIGNED macro improves performance, do the following::

  1. Right-click the vec_samples project in the Solution Explorer. Then choose Project > Properties.

  2. Choose Configuration Properties > C/C++ > Command Line. In the Additional Options area, type /DALIGNED.

  3. Click the Apply button, then click the OK button.

  4. Choose Build > Rebuild Solution.

  5. In the VECTORIZATION WORKFLOW pane, click the Collect control under Survey Target.

In the new Survey Report, notice the Elapsed time shows little improvement.

Reorganize Code 

When you use the matvec function in the sample application, the compiler cannot determine it is safe to vectorize the loop because it cannot tell if a and b are unique arrays.

When you inline the loop instead, the compiler can determine it is safe to vectorize the loop because it can tell exactly which variables you want processed in the loop.

The NOFUNCCALL macro removes the matvec function.

To see if the NOFUNCCALL macro improves performance, do the following:

  1. Right-click the vec_samples project in the Solution Explorer. Then choose Project > Properties.

  2. Choose Configuration Properties > C/C++ > Command Line. In the Additional Options area, type /DNOFUNCCALL.

  3. Click the Apply button, then click the OK button.

  4. Choose Build > Rebuild Solution.

  5. In the VECTORIZATION WORKFLOW pane, click the Collect control under Survey Target.

In the new Survey Report, notice the Elapsed time improves substantially.

Vectorization Sample README-Linux* OS/Standalone GUI

This README shows how to use the Intel® Advisor XE 2016 GUI to improve the performance of a C++ sample application. Follow these steps:

  1. Prepare the sample application.

  2. Establish a performance baseline.

  3. Get to know the Vectorization Workflow.

  4. Increase the optimization level.

  5. Disambiguate pointers.

  6. Generate instructions for the highest instruction set available.

  7. Handle dependencies.

  8. Align data.

  9. Reorganize code.

Prepare the Sample Application 

Get Software Tools and Unpack the Sample

You need the following tools:

  • Intel Advisor XE 2016

  • Version 15 .0 or higher of an Intel C++ compiler or a supported compiler
    Use an Intel compiler to get more benefit from the Vectorization Advisor Survey Report. See the Release Notes for more information on supported compilers.

  • .tgz file extraction utility

Acquire and Install Intel Software Tools

If you do not already have access to the Intel Advisor XE 2016 or to Version 15.0 or higher of an Intel C++ compiler, download an evaluation copy from http://software.intel.com/en-us/articles/intel-software-evaluation-center/.

Set Up the Intel Advisor Sample Application

  1. Copy the vec_samples.tgz file from the <advisor-install-dir>/samples/<locale>/C++/ directory to a writable directory or share on your system.
    The default installation path, <advisor-install-dir>:

    • For root users: /opt/intel/parallel_studio_xe_201n/advisor_xe_201n/

    • For non-root users: $HOME/intel/parallel_studio_xe_201n/advisor_xe_201n/

  2. Extract the sample from the .tgz file.

Build the Sample Application in Release Mode

  1. Set the environment for version 15.0 or higher of an Intel compiler.

  2. Build the sample application with the following options in release mode:

    • -O1
    • -std=c99
    • -fp-model fast
    • -qopt-report=5

For example:
icpc -O1 -std=c99 -fp-model fast -qopt-report=5 Multiply.c Driver.c -o MatVector

Launch the Intel Advisor

Run the advixe-gui command.

NOTE: Make sure you run the Intel Advisor in the same environment as the sample application.

Create a New Project

  1. Choose File > New > Project… (or click New Project… in the Welcome page) to open the Create a Project dialog box.

  2. Type vec_samples in the Project Name field, supply a location for the sample application project, then click the CreateProject button to open the Project Properties dialog box.

  3. On the left side of the Analysis Target tab, ensure the Survey Hotspots/Suitability Analysis type is selected.

  4. Click the Browse… button next to the Application field, and choose the just-built binary.

  5. Click the OK button to close the Project Properties window and open an empty Survey Report window.

Establish a Performance Baseline 

To set a performance baseline for the improvements that will follow, do the following:

  1. In the VECTORIZATION WORKFLOW pane, click the Collect control under Survey Target to produce a Survey Report.

  2. If necessary, select the Do not show this window again checkbox in the infotip, then close the infotip.

In the Survey Report window, notice:

  • The Elapsed time value in the top left corner is ~16 seconds. (Your value may vary.) This is the baseline against which subsequent improvements will be measured.

  • In the Loop Type column in the top pane, all detected loops are Scalar.

Get to Know the Vectorization Workflow 

The VECTORIZATION WORKFLOW in the left pane is a recommended usage scenario.

Survey Target– This analysis produces a Survey Report (currently displayed) that offers integrated compiler report data and performance data all in one place. Use this information to help identify:

  • Where vectorization will pay off the most

  • If vectorized loops are providing benefit, and if not, why not

  • Un-vectorized and under-vectorized loops, and the estimated expected performance gain of vectorization or better vectorization

  • How data accessed by vectorized loops is organized and the estimated expected performance gain of reorganization

Find Trip Counts– This optional analysis dynamically identifies the number of times loops are invoked and execute (sometimes called call count/loop count and iteration count respectively), and adds this information to the Survey Report. Use this information to make better decisions about your vectorization strategy for particular loops, as well as optimize already-parallel loops.

Check Dependencies – This optional analysis produces one of two optional Refinement Reports if you want to dig deeper. For safety purposes, the compiler is often conservative when assuming data dependencies. Use the Dependencies Report to check for real data dependencies in loops the compiler did not vectorize because of assumed dependencies. If real dependencies are detected, the report can provide additional details to help resolve the dependencies. Your objective: Identify and better characterize real data dependencies that could make forced vectorization unsafe.

Check Memory Access Patterns – This optional analysis produces one of two optional Refinement Reports if you want to dig deeper. Use the Memory Access Patterns (MAP) Report to check for various memory issues, such as non-contiguous memory accesses and unit stride vs. non-unit stride accesses. Your objective: Eliminate issues that could lead to significant vector code execution slowdown or block automatic vectorization by the compiler.

Increase the Optimization Level 

To see if increasing the optiization level improves performance, do the following:

  1. Rebuild the sample application with the following options:

    • -O3
    • -std=c99
    • -fp-model fast
    • -qopt-report=5
  2. In the VECTORIZATION WORKFLOW pane, click the Collect control under Survey Target.

In the new Survey Report, notice:

  • Increasing the optimization level does not vectorize the loops; all are still scalar.

  • Advisor XE explains why the loops are still scalar in the Vector Issues and Why No Vectorization? columns: The compiler assumed there are dependencies that could make vectorization unsafe. The Survey Report also offers recommendations for how to fix this issue.
    Try clicking a:

    • Recommendation icon icon in the Vector Issues column to display Recommendations in the bottom pane

    • Compiler Diagnostic icon icon in the Why No Vectorization? column to display Compiler Diagnostic Details in the bottom pane

  • The Elapsed time improves.

Disambiguate Pointers 

Two pointers are aliased if both point to the same memory location. Storing to memory using a pointer that might be aliased may prevent some optimizations. For example, it may create a dependency between loop iterations that would make vectorization unsafe. Sometimes, the compiler can generate both a vectorized and a non-vectorized version of a loop and test for aliasing at runtime to select the appropriate code path. If you know that pointers do not alias and inform the compiler, it can avoid the runtime check and generate a single vectorized code path.

In Multiply.c, the compiler generates runtime checks to determine whether or not the pointer b in function matvec(FTYPE a[][COLWIDTH], FTYPE b[], FTYPE x[]) is aliased to either a or x . If Multiply.c is compiled with the NOALIAS macro, the restrict qualifier of the argument b informs the compiler that the pointer does not alias with any other pointer, and in particular that the array b does not overlap with a or x.

To see if the NOALIAS macro improves performance, do the following:

  1. Rebuild the sample application with the following options:

    • -O3
    • -std=c99
    • -fp-model fast
    • -qopt-report=5
    • -D NOALIAS
  2. In the VECTORIZATION WORKFLOW pane, click the Collect control under Survey Target.

In the new Survey Report, notice:

  • The compiler successfully vectorizes one loop, but still cannot not vectorize other loops because the compiler assumes there are dependencies that could make vectorization unsafe.

  • The Elapsed time improves.

  • The value in the Vector Instruction Set column in the top pane is SSE2, the default Vector Instruction Set Architecture (ISA). AVX2 is preferable.

  • The value in the Vector Length column in the top pane is 2;4, which means some vector lengths are 2 and some are 4.

Generate Instructions for the Highest Instruction Set Available 

To see if generating instructions for the highest instruction set available on the compilation host processor improves performance, do the following:

  1. Rebuild the sample application with the following options:

    • -O3
    • -std=c99
    • -fp-model fast
    • -qopt-report=5
    • -D NOALIAS
    • -xHost
  2. In the VECTORIZATION WORKFLOW pane, click the Collect control under Survey Target.

In the new Survey Report, notice:

  • The Elapsed time improves.

  • The Vector Instruction Set and Vector Length columns in the top pane (probably) changes.

Handle Dependencies 

For safety purposes, the compiler is often conservative when assuming data dependencies.

To run a Dependencies analysis to identify and explore real loop-carried dependencies, do the following:

  1. Choose Project > Intel Advisor version Project Properties… to open the Project Properties dialog box.

  2. On the left side of the Analysis Target tab, select the Dependencies Analysis type.

  3. If necessary, click the Browse… button next to the Application field to choose the just-built binary.

  4. Click the OK button.

  5. In the Checkbox icon column in the Survey Report window, select the checkbox for the two loops with assumed dependencies.

  6. In the VECTORIZATION WORKFLOW pane, click the Collect control under Check Dependencies to produce a Dependencies Report.
    (If the analysis takes more than 5 minutes, click the Stop current analysis and display result collected thus far control under Check Dependencies.)

In the Refinement Reports window, notice the Intel Advisor reports no dependencies in two loops and a RAW (Read after write) dependency in one loop. Forcing the compiler to vectorize:

  • The loops without dependencies will not result in significant performance improvement

  • The loop with the RAW dependency will generate incorrect code

Align Data 

The compiler can generate faster code when operating on aligned data.

The ALIGNED macro:

  • Aligns the arrays ab, and x in Driver.c on a 16-byte boundary.

  • Pads the row length of the matrix, a, to be a multiple of 16 bytes, so each individual row of a is 16-byte aligned.

  • Tells the compiler it can safely assume the arrays in Multiply.c are aligned.

To see if the ALIGNED macro improves performance, do the following::

  1. Rebuild the sample application with the following options:

    • -O3
    • -std=c99
    • -fp-model fast
    • -qopt-report=5
    • -D NOALIAS
    • -xHost
    • -D ALIGNED
  2. In the VECTORIZATION WORKFLOW pane, click the Collect control under Survey Target.

In the new Survey Report, notice the Elapsed time shows little improvement.

Reorganize Code 

When you use the matvec function in the sample application, the compiler cannot determine it is safe to vectorize the loop because it cannot tell if a and b are unique arrays.

When you inline the loop instead, the compiler can determine it is safe to vectorize the loop because it can tell exactly which variables you want processed in the loop.

The NOFUNCCALL macro removes the matvec function.

To see if the NOFUNCCALL macro improves performance, do the following:

  1. Rebuild the sample application with the following options:

    • -O3
    • -std=c99
    • -fp-model fast
    • -qopt-report=5
    • -D NOALIAS
    • -xHost
    • -D ALIGNED
    • -D NOFUNCCALL
  2. In the VECTORIZATION WORKFLOW pane, click the Collect control under Survey Target.

In the new Survey Report, notice the Elapsed time improves substantially.

For More Information

Start with the following resources:

Vectorization Resources for Intel® Advisor XE Users

$
0
0

Intel® Advisor XE 2016 provides two tools to help ensure your Fortran and native/managed C++ applications take full performance advantage of today’s processors:

  • Vectorization Advisor is a vectorization analysis tool that lets you identify loops that will benefit most from vectorization, identify what is blocking effective vectorization, explore the benefit of alternative data reorganizations, and increase the confidence that vectorization is safe.
  • Threading Advisor is a threading design and prototyping tool that lets you analyze, design, tune, and check threading design options without disrupting your normal development.

One of the key Vectorization Advisor features is a Survey Report that offers integrated compiler report data and performance data all in one place, including GUI-embedded advice on how to fix vectorization issues specific to your code. This page augments that GUI-embedded advice with links to web-based vectorization resources.

Contents

Getting Started    Landing Pages    Compiler Diagnostics    OpenMP* Resources    Compiler User Guides    See Also

Getting Started With Intel® Advisor XE 

Use the following resources to start taking advantage of the power and flexibility of the Intel Advisor XE:

Recommendation: Follow these SIMD Parallelism workflows (usage scenarios) to maximize your productivity as quickly as possible. (White polygons represent optional workflow steps.)

Intel Advisor Survey Workflow-Quick path to maximizing productivity            Intel Advisor Dig Deeper Workflow-Quick path to maximizing productivity

Intel® Developer Zone Landing Pages 

The Intel® Developer Zone offers a wealth of vectorization resources. The following landing pages are an easy way to find vectorization resources of interest.

NOTE:

  • Some of these resources emphasize the vectorization capabilities of other Intel software development tools, such the Intel compiler vec and opt reports and the Intel® VTune™ Amplifier XE hotspot analyses. Nevertheless, these resources include nuggets of vectorization information useful to Intel Advisor XE users.
  • Many resources are written to support the current Intel compiler version plus two previous versions. Much of the content in compiler resources written for a previous Intel compiler version still applies to the current Intel compiler version; in most cases, version differences are explained.

Intel Compiler Diagnostic Messages 

The Vectorization Advisor requires the Intel Compiler version 15.0 or later to collect a full set of analysis data, including compiler diagnostics about vectorization constraints. The following links provide more information about these compiler diagnostics:  

NOTE: A subset of metrics is available for binaries built with the GCC* or Microsoft* compiler.

OpenMP* 4.0 Resources 

The following resources are helpful if you are using the OpenMP* parallel framework in your application:

Intel Compiler User & Reference Guides 

The Intel Advisor XE ships with a C++ mini-guide and a Fortran mini-guide of vectorization-related, Intel compiler 16.0 options and directives. These mini-guides are composed of excerpted pages from the full Intel compiler user and reference guides. The following links provide more information about Intel compilers:

See Also 

Many Intel Advisor XE users find these additional resources helpful:

Viewing all 312 articles
Browse latest View live