Quantcast
Channel: Intel Developer Zone Articles
Viewing all 312 articles
Browse latest View live

How to use Intel® Inspector for Systems

$
0
0

Background

Intel® System Studio is the new embedded software tool suite and includes Intel® Inspector for Systems. This article will explain the steps you need to follow to run Inspector for Systems on an embedded platform.

Overview

We will use Yocto Project* version 1.2 as an example. This platform supports many Intel board support packages (BSP’s) and it also allows you to work without running any physical embedded hardware by letting you develop via an emulator that they provide. The following steps explain how to setup an application and then run an Intel® Inspector for Systems collection on it via the Yocto Project* emulator(runqemu).  Here are the steps we will take to run our collection:

  1. Setting up a Yocto Project* 1.2 environment.
    1. Cross compilers
    2. Yocto Project* pre-built kernel
    3. File system to NFS mount from host
  2. Install Intel System Studio
    1. Copy installation to root file system created above.
  3. Cross compiling the tachyon sample application
    1. Build the application
    2. Copy to root file system created above.
  4. Start a QEMU emulator session
    1. Login to emulator
    2. cd /home/root
  5. Run an Intel Inspector for Systems on the tachyon sample code
  6. On your Linux* host open the Inspector for Systems  results and view results in the Inspector for systems GUI

Setting up a Yocto Project* 1.2 environment

  1. Download the pre-built toolchain, which includes the runqemu script and support files
    download from: http://downloads.yoctoproject.org/releases/yocto/yocto-1.2/toolchain/
    1. The following tool chain tarball is for a 32-bit development host system and a 32-bit target architecture: poky-eglibc-i686-i586-toolchain-gmae-1.2.tar.bz2
    2. You need to install this tar ball on your Linux* host in the root “/” directory. This will create an installation area “/opt/poky/1.2”
  2. Downloading the Pre-Built Linux* Kernel:
    You can download the pre-built Linux* kernel (*zImage-qemu<arch>.bin OR vmlinux-qemu<arch>.bin).
    1.   http://downloads.yoctoproject.org/releases/yocto/yocto-1.2/machines/qemu/
      1. download: bzImage-qemux86.bi
      2. This article assumes this file is located ~/yocto/ bzImage-qemux86.bin
  3. Create a file system
    1. from: http://downloads.yoctoproject.org/releases/yocto/yocto-1.2/machines/qemu/
    2. Download core-image-sato-sdk-qemux86.tar.bz2
    3. source /opt/poky/1.2/environment-setup-i586-poky-linux 
      mkdir -p ~/yocto/file_system/
      runqemu-extract-sdk core-image-sato-sdk-qemux86.tar.bz2  ~/yocto/file_system/
    4. This will create a root file system that you can access for your host and emulated session.

Install Intel® Inspector 2013 for Systems

  1. Install Intel® System Studio on your Linux* host.
  2. Copy the Intel Inspector for Systems installation to the file system you created above.
    1. cp inspector_for_systems ~/yocto/file_system/home/root

Cross compiling the tachyon sample code

  1. The tachyon sample code is provided as part of the Inspector for Systems release.
  2. On your Linux* host      
    1. cd ~/yocto
    2. untar tachyon : tar xvzf /opt/intel/systems_studio_2013.0.xxx/inspector_for_systems/samples/en/C++/tachon_insp_xe.tgz
    3. You will need to modify the tachyon sample as follows:
    4. In the top level Makefile:  Comment the line containing CXX.
    5. The the lower level Makefile.gmake ('tachyon/common/gui/Makefile.gmake') Add the following lines:
UI = x
EXE = $(NAME)$(SUFFIX)
CXXFLAGS += -I$(OECORE_TARGET_SYSROOT)/X11
LIBS += -lpthread -lX11
#LIBS += -lXext
CXXFLAGS += -DX_NOSHMEM
  1. source /opt/poky/1.2/environment-setup-i586-poky-linux 
  2. e.      make
  3. f.        Copy the tachyon binary and the create libtbb.so file to ~/yocto/file_system/home/yocto/test

 

Start a QEMU emulator session

  1. source /opt/poky/1.2/environment-setup-i586-poky-linux 

  2. runqemu bzImage-qemux86.bin ~/yocto/file_system/

 

Run Intel® Inspector for Systems on the tachyon sample code

  1. Login to the QEMU emulator session
    1. User root no password
    2. cd /home/root/inspector_for_systems
  2. You should see the tachyon binaries and inspector directory in the file system copied from above.
  3. source /home/root/inspector_for_systems/inspxe-vars.sh
  4. Run an Inspector collection:

Create directory test; cd test

inspxe-cl –no-auto-finalize –collect mi2 ../tachyon_find_hotspots

Note: the above will perform a level 2 memory checking analysis.

Run inspxe-cl –collect help to see some other collections you can do.

On your Linux* host: Open the Inspector for Systems results

  1. You can view the results you create above on you Linux host. You should see a directory ~/yocto/file_system/home/root/test/r000mi2.
  2. To view these results in Intel® Inspector for Systems  on your Linux host
    1.  Source /opt/intel/system_studio_2014.0.xxx/inspector_for_systems/inspxe-vars.sh
    2.  inspxe-gui ~/yocto/file_system/home/root/test/r000mi2
    3. You should see the following results similar to the following:

 

Summary

Intel Inspector for Systems is a powerful tool for finding correctness errors in your code.  


Overhead and Spin Time Issue in Intel® Threading Building Blocks Applications Due to Inlining

$
0
0

Intel® Threading Building Blocks (Intel TBB) applications may have an incorrectly high amount of Overhead or Spin Time associated with them due to function inlining without corresponding debug information.

When analyzing an Intel TBB application with Intel® VTune™ Amplifier XE, we recommend that you enable inline debug information to ensure Overhead and Spin Time metrics are as accurate as possible. Using a compiler that performs inlining (as is the case with most optimized builds) in conjunction with Intel TBB may result in aggressive inlining of user code into Intel TBB template functions and methods. This may cause time that would normally be classified as Effective Time being incorrectly classified as Overhead or Spin Time.

If you're seeing results that appear to have much more Overhead or Spin Time and much less Effective Time, make sure you have enabled this option in your compiler.

To enable this option for the various compilers listed below, use the corresponding flag:

GCC* in Linux*

-g flag

Intel Compiler on Linux

"-debug inline-debug-info"

Intel Compiler on Windows*

"/debug:inline-debug-info”

 

Below are two examples of the same application compiled without this flag and with this flag respectively. Notice that the call stack in Figure 1 ends at the Intel TBB execute() function while the call stack in Figure 2 includes function names from the user code, e.g. IsPrime(). In both examples, the compiler inlined the IsPrime() function, however only the second example includes the debug information which is passed along to VTune Amplifier. Also notice the large amount of Overhead Time in the first example. This time was classified as Overhead Time because there was no information from the compiler about the inlined user code. Also notice the large amount of red in the timeline of Figure 1. This red represents Overhead or Spin Time. If your application shows these symptoms and you believe this to be incorrect, ensure you have enabled inline debug information as described above.

No Inline Information

Figure 1: VTune Amplifier result without inline debug information

With Inline Information

Figure 2: VTune Amplifier result with inline debug information enabled

For more information on enabling inline debug information, see the VTune Amplifier Help section titled “Viewing Data on Inline Functions”.

Diagnostic 15523: Loop was not vectorized: cannot compute loop iteration count before executing the loop.

$
0
0

Product Version: Intel(R) Visual Fortran Compiler XE 15.0.0.070

Cause:

The vectorization report generated when using Visual Fortran Compiler's optimization options ( -O3  -Qopt-report:2 ) states that loop was not vectorized since loop iteration count cannot be computed.

Example:

An example below will generate the following remark in optimization report:

subroutine foo(a, n)

       implicit none
       integer, intent(in) :: n
       double precision, intent(inout) :: a(n)
       integer :: bar
       integer :: i

       i=0
 100   CONTINUE
       a(i)=0
       i=i+1
       if (i .lt. bar()) goto 100

  end subroutine foo

 

 

Report from: Loop nest, Vector & Auto-parallelization optimizations [loop, vec, par]

  LOOP BEGIN  
      remark #15523: loop was not vectorized: cannot compute loop iteration count before executing the loop.
 
  LOOP END

Resolution:

 -goto statements prevent vectorization, rewriting the code will get this loop vectorized.

See also:

Requirements for Vectorizable Loops

Vectorization Essentials

Vectorization and Optimization Reports

Back to the list of vectorization diagnostics for Intel Fortran

Sierpiński Carpet in OpenCL 2.0

$
0
0

We demonstrate how to create a Sierpinski Carpet in OpenCL 2.0

Prerequisites:

      A laptop or a workstation with the 5th Generation Intel® Core™ Processor

What is Nested Parallelism?

Device kernels can enqueue kernels to the same device with no host interaction, enabling flexible work scheduling paradigms and avoiding the need to transfer execution control and data between the device and host, often significantly offloading host processor bottlenecks (see Khronos Finalizes OpenCL 2.0 Specification for Heterogeneous Computing). Nested parallelism was introduced to OpenCL 2.0 to meet competitive challenge from CUDA’s Dynamic Parallelism.

What are Blocks?

Blocks simplify nested parallelism (also known as device-side enqueue). Blocks

For more information see Blocks in OpenCL 2.0.

What is Sierpiński Carpet?

The Sierpinski carpet is a plane fractal first described by Wacław Sierpiński in 1916. Start with a white square. Divide the square into 9 sub-squares in a 3-by-3 grid. Paint the central sub-square black. Apply the same procedure recursively to the remaining 8 sub-squares. And so on …

See http://en.wikipedia.org/wiki/Sierpinski_carpet for more info.

Sierpinski carpet

enqueue_kernel API

int enqueue_kernel ( queue_t queue,
                     kernel_enqueue_flags_t flags,
                     const ndrange_t ndrange,
                     void (^block)(void) );
enqueue_kernel is similar to clEnqueuNDRangeKernel API, but in OpenCL C kernel language. It has three more variations available that provide handling of event dependencies and passing local memory. For more info, see enqueue_kernel functions online documentation.
 

Sierpiński Carpet – Host Side

Build your code with "-cl-std=CL2.0“ to enable OpenCL 2.0 compilation. Don’t forget to create device side queue on the host:
// You need to create device side queue for enqueue_kernel to work
// We set the device side queue to 16MB,
// since we are going to have a large number of enqueues
cl_queue_properties qprop[] = {CL_QUEUE_SIZE, 16*1024*1024, 
                               CL_QUEUE_PROPERTIES,         
  (cl_command_queue_properties)CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE |
                               CL_QUEUE_ON_DEVICE |
                               CL_QUEUE_ON_DEVICE_DEFAULT, 0};
  
cl_command_queue my_device_q = clCreateCommandQueueWithProperties(CLU_CONTEXT, cluGetDevice(CL_DEVICE_TYPE_GPU), qprop, &status);

Sierpiński Carpet in OpenCL 2.0

__kernel void sierpinski(__global char* src, int width, int offsetx, int offsety)
{
    int x = get_global_id(0);
    int y = get_global_id(1);
    queue_t q = get_default_queue();

    int one_third = get_global_size(0) / 3;
    int two_thirds = 2 * one_third;

    if (x >= one_third && x < two_thirds &&
        y >= one_third && y < two_thirds)
    {
        src[(y+offsety)*width+(x+offsetx)] = BLACK;
    }
    else
    {
        src[(y+offsety)*width+(x+offsetx)] = WHITE;

        if (one_third > 1 && x % one_third == 0 && y % one_third == 0)
        {
            const size_t  grid[2] = {one_third, one_third};
            enqueue_kernel(q, 0, ndrange_2D(grid), ^{ sierpinski (src, width, x+offsetx, y+offsety); });
        }
    }
}

Download the full source code of the sample below.

About the Author

Robert Ioffe is a Technical Consulting Engineer at Intel’s Software and Solutions Group. He is an expert in OpenCL programming and OpenCL workload optimization on Intel Iris and Intel Iris Pro Graphics with deep knowledge of Intel Graphics Hardware. He was heavily involved in Khronos standards work, focusing on prototyping the latest features and making sure they can run well on Intel architecture. Most recently he has been working on prototyping Nested Parallelism (enqueue_kernel functions) feature of OpenCL 2.0 and wrote a number of samples that demonstrate Nested Parallelism functionality, including GPU-Quicksort for OpenCL 2.0. He also recorded and released two Optimizing Simple OpenCL Kernels videos and is in the process of recording a third video on Nested Parallelism.

You might also be interested in the following:

GPU-Quicksort in OpenCL 2.0: Nested Parallelism and Work-Group Scan Functions

Optimizing Simple OpenCL Kernels: Modulate Kernel Optimization

Optimizing Simple OpenCL Kernels: Sobel Kernel Optimization

 

Using Intel® VTune™ Amplifier XE to Tune Software on the Intel® Xeon® Processor E5 v3 Family

$
0
0

Download this guide (see Article Attachments, below) to learn how to identify performance issues on software running on the Intel® Xeon® Processor E5 v3 Family (based on Intel® Microarchitecture Codename Haswell). The guide explains the General Exploration Analysis viewpoint available in Intel® VTune™ Amplifier XE. It also walks through some of the most common performance issues that the VTune Amplifier XE interface highlights, what each issue means, and some suggested ways to fix them.

For other tuning guides, please visit our Processor-specific Performance Analysis web page.

Intel® Concurrent Collections for C++ for Windows* and Linux*

$
0
0

Parallelism Without the Pain

Why CnC?

CnC makes it easy to write C++ programs which take full advantage of the available parallelism. Whether run on multicore systems, Xeon Phi™ or clusters CnC will seamlessly exploit the performance potential of your hardware. Through its portability and composability (with itself and other tools) it provides future-proof scalability.

Intel® Concurrent Collections for C++

Intel® Concurrent Collections for C++ is a C++ template library for letting C++ programmers implement CnC applications which run in parallel on shared and distributed memory. Intel(R) Concurrent Collections for C++ is now also available as open source from github

Primary features

Easy parallelism

  • There is no need to think about lower level parallelization techniques like threading primitives or message passing; no need to understand pthreads, MPI, Windows threads, TBB,...
  • There is no need to think about different types of parallelism such as task, pipeline, fork-join, task or data parallelism.
  • Intel® Concurrent Collections for C++ provides a separation of concerns between what the application means and how to tune it for a specific platform. The application code can be paired with isolated tuning code. This allows programmers to focus on each separately.

 


CnC yields quasi-linear scaling in these example applications

 


CnC yields quasi-linear scaling on thousands of cores in RTM-3dfd

 

CnC makes tuning a separate ingredient
CnC makes tuning a separate ingredient

 

 

Portability

  • The same source runs on Windows and Linux.
  • The same binary runs on shared memory multi-core systems and clusters of workstations. In fact, Intel® Concurrent Collections for C++ is a unified model for shared and distributed memory systems (as opposed to the MPI / OpenMP combination, for example).

Efficiency

  • Because Intel® Concurrent Collections for C++ provides a way to express an algorithm with minimal scheduling constraints, it is very efficient
  • In addition, Intel® Concurrent Collections for C++ supports two types of tuning:
    • Runtime tuning makes the runtime more efficient for a specific application.
    • Application tuning makes the application itself more efficient with user-specified distribution of the work.

Scalability

  • Intel® Concurrent Collections for C++ achieves scalable performance on a wide range of configurations from small multicore systems to large clusters.
  • No need to re-write or re-compile application in order to target a new configuration.

The following downloads are available under the BSD license. The current version is 1.0.100.

Including required TBB bits
Choose one of these if in doubt
Linux* 64bit  
Windows* 64bitWindows* 32bit (old ver. 1.0.002)

Without TBB bits
Requires existing TBB >= 4.2 Update 3
Windows* 64bit Windows* 32bit (old ver. 1.0.002)

The Idea

The major goal of CnC (Concurrent Collections) is a productive path to efficient parallel execution. And yet a CnC program does not indicate what runs in parallel. Instead, it explicitly identifies what precludes parallel execution. There are exactly two reasons that computations cannot execute in parallel. If one computation produces data that the other one consumes, the producer must execute before the consumer. If one computation determines if another will execute, the controller must execute before the controllee. CnC is a data and control flow model together with tuple-space influence. However, it is closer in philosophy to the PDG (Program Dependence Graph) intermediate form than to other parallel programming models. Its high-level abstractions allow flexible and efficient mapping of a CnC program to the target platform. By this it simplifies parallelism and at the same time let's you exploit the full parallel potential of your application.

What's new in version 1.0?

  • Support for re-use
    • Write CnC graphs and use them in other CnC programs (multiple times)
    • Embed non-CnC functionality in a CnC application
  • Reductions
  • Join/Cross
  • Combine MPI programs with CnC code
  • More example codes
    • Attaching to databases (mysql*)
    • Showcasing reductions, mapreduce, join, CnC-SPMD/MPI combo and more
  • Intel(R) Concurrent Collections for C++ is now available as open source from github.
  • Bug fixes etc.
  • Added support for Visual Studio* 2013, dropped support for Visual Studio* 2008 (Microsoft Windows* only)
  • With Update 1.0.100:
    • Improved tracing output: put()s, get()s and reporting never-available items linked to calling step
    • Added warning messages about put()s before get()s in step executions (unless -DNDEBUG)
    • New CnC programs/samples: raytracer and jacobi2d-heureka
    • Improved startup process for SOCKETS on distributed memory
    • Dropped binary releases for ia32 and Visual Studio* 2010 (Microsoft Windows*)
    • Source structure improvements on github

See also the Release Notes.

Docs about Intel® Concurrent Collections for C++

Tutorial
API Documentation
Release Notes
Getting Started
FAQ

CnC Papers and Related Links

The Concurrent Collections Programming Model
Parallel Programming For Distributed Memory Without The Pain
Performance Evaluation of Concurrent Collections on High-Performance Multicore Computing Systems
Measuring the Overhead of Intel C++ CnC over Gauss-Jordan Elimination
Segmentation for the Brain Connectome using a Multi-Scale Parallel Computing Architecture
Cluster Computing using Intel Concurrent Collections 
Habanero Concurrent Collections Project at Rice University

Discussions, Report Problems or Leave Feedback

To report a problem or request a feature, please use the issue tracker on github: https://github.com/icnc/icnc/issues

For questions and feedback on this product you can also visit the "Whatif Alpha Forum" to participate in forum discussions about Intel® Concurrent Collections: http://software.intel.com/en-us/forums/intel-concurrent-collections-for-cc/

To stay in touch with the Intel® Concurrent Collections team and the community, we provide a new email-list you can subscribe to or just watch online:
http://tech.groups.yahoo.com/group/intel_concurrent_collections/

Heterogeneous Computing Pipelining

$
0
0

Download PDF[PDF 657KB]

Content

1. Abstract

2. Intel® Core™ and Atom™ family Processors and Heterogeneous Processing Architecture

3. Introduction

4. Model

5. Memory requirements

6. Implementation – TBB+OpenMP+OpenCL

7. Performance results

8. Conclusion

9. About the author

10. References

1. Abstract

Most image and video effects consist of two or more stages. Software developers often execute them sequentially, one by one. The easy way to performance tune for such algorithms is to optimize each stage separately using vectorization, multi-threading and/or graphics offload. But, processing of the stages still happens sequentially. The Intel® processor provides multiple opportunities to offload and process the algorithms in parallel for the best performance with optimum power usage. Developers can always optimize an algorithm for one device (e.g. CPU or Intel® Processor Graphics), however they ignore significant performance increases by not using other processing components on the chip. Things get more complex if the application distributes the workload among different devices and while processing in parallel. In this article we will review an efficient method to use all available resources on Intel® CPUs via software pipelining CPU and OpenCL code.

2. Intel® Core™ and Atom™ Family Processors And Heterogeneous Processing Architecture

Intel® Core™ and Atom™ processor families provide system-on-a-chip (SoC) solutions that use multiple resources for heterogeneous compute and media processing. Figure 1 shows Intel® Core™ M Processor silicon die layout.

silicon die layout
Figure 1. Silicon die layout for an Intel® Core™ M Processor. This SoC contains 2 CPU cores, outlined in orange dashed boxes. Outlined in the blue dashed box, is Intel® HD Graphics 5300. It is a one slice instantiation of Intel® Processor Graphics [7].

The Intel® microprocessors are complex SoCs integrating multiple CPU Cores, Intel® Processor Graphics, and other fixed functions all on a single shared silicon die. The architecture implements numerous unique clock domains, including a per-CPU core clock domain, a processor graphics clock domain, and a ring interconnect clock domain. The SoC architecture is designed to be extensible, and yet still enable efficient wire routing between components.Intel® Core™ M Processor SoC
Figure 2. An Intel® Core™ M Processor SoC and its ring interconnect architecture [7].

The 2nd Generation Intel® Core™ processor expanded heterogeneous media processing with the introduction of Intel® Quick Sync Video (Intel® QSV) through Intel® Media SDK [9]. These kinds of innovations continued with the 3rd Generation Intel® Core™ processor and the support of OpenCL™ 1.2. It allowed applications to do heterogeneous computing on Intel® Processor Graphics [7]. The new Intel® Core M processors added more flexibility and programming ease with OpenCL™ 2.0 support. Intel® Core™ processors already supported shared physical memory so applications shared data between the CPU and Intel® processor graphics, but OpenCL™ 2.0 shared virtual memory support allows applications to share the data structures seamlessly between two devices.

3. Introduction

Intel® processor architecture provides a cohesive environment for applications by sharing resources (e.g. caches) between the CPU and processor graphics. We will to take advantage of these features through software pipelining.

Let’s review a general algorithm execution model consisting of N stages.

general algorithm model
Figure 3. General algorithm execution model.

Let Ti be the time to execute i-th stage, then the total execution time might be estimated as ∑ Ti.

The straightforward way to optimize is to performance tune each stage individually via scalar code tuning, multi-threading, vectorization [1] or offloading to processor graphics (OpenCL etc.). This approach could be a first step, but might limit performance opportunities. For example, optimizing one out of five stages to be twice as fast might yield only ~10% overall performance increase if all stages take about the same time.

One can achieve parallelization by running multiple effects at the same time, as explained in “Expressing Pipeline Parallelism using TBB constructs” by Eric Reed, Nicholas Chen and Ralph Johnson [2]. Unfortunately, it is applicable only when there are multiple input data (frames) available at the same time. It may not be used, for example, in photo editing. The method might also underutilize available resources, and add additional memory bandwidth requirements. This method is still worth consideration and evaluation as it might be useful for many kinds of video processing algorithms.

We will show you a way to perform parallel stage execution on both devices, which can improve performance, data locality, and uses all available resources effectively. Both methods discussed can be used at the same time.

4. Model

To simplify the discussion, we will review the short version of an effect with only two stages

two stage algorithm
Figure 4. Algorithm with two stages.

As we want to distribute the execution between CPU and GPU we need to implement good and optimized versions of Stage 1 and Stage 2 algorithms. Suppose Stage 1 is optimized for CPU and Stage 2 for GPU. In order to successfully implement pipelined model one must be able to implement efficient tiled version of the algorithm as shown on the Figure 5.

tiled version algorithm
Figure 5. Tiled version of the algorithm.

Key to this method is to execute Stage 1 on the CPU for one tile at a time while finishing the previous tile on the GPU, as shown on Figure 6. Store Stage 1’s output data in an intermediate buffer, which will be used on the next iteration to produce the output on GPU using Stage 2 algorithm. Wait for both CPU and GPU to finish their tasks and repeat the process. There will be two pipelines executing simultaneously, therefore we need two intermediate buffers to pass the data between the stages, which can be allocated once and rotated appropriately.

pipeline execution

Let t1 and t2 be the execution time for Stage 1 and Stage 2, so the total execution time will be T = t1 and t1. . If we ignore threading and synchronization overhead, we can estimate that the wall time is T=max(t1,t2)+(t1 + t2)/N, where N is the number of tiles. Understand that limN→∞T = max(t1,t2), so theoretically we can hide one of the effects execution, and reduce the wall time by a factor of
t1 and t2.

Unfortunately, it is not always possible to execute a “perfect” parallelization because synchronization and data transfer penalties grow with the additional number of tiles.

5. Memory Requirements

As stated in the beginning, we will show how proposed algorithm may reduce memory requirements as well as improve data locality. Let S represent the input data (an image buffer, for example) and assume that we need S space for intermediate data and the same amount of memory for output data, so the total working set size estimate is 3S. In the proposed model we only need two intermediate buffers of S/N size, therefore the total memory requirements will be M=(2+2⁄N)S, which is < 3S for all N> 2. Thus we might expect total memory requirements to drop by a factor of
original algorithm formula We would also like to note that lim formula.

Now count the number of tile buffers in the flight at a time - 2 * Input + 2 * Itermidiate + 1 * Output in the case when Stage2 uses the input data as well. Then the working set size would be m=5S/N vs. 2S for a baseline implementation, which is 1.6 times less for N = 4 and more than 6 times less for N = 16.

On the other side, as mentioned earlier, improved data locality and reduced working set size might also improve performance. One of the interesting cases would be when we choose the number of tiles in such a way that m is equal or less than last level cache (LLC) size to avoid additional latency on the data transfer between the devices.

6. Implementation – TBB+OpenMP+OpenCL

We implemented one of the photo editing effects with two stages using this methodology. We used all of the optimizations, from scalar optimization and vectorization, up to parallel computing using OpenMP and OpenCL.

First, we implemented “tiled” versions of both Stage1 (CPU/AVX2) & Stage2 (OpenCL) algorithms, as shown of Fig. 5, and used OpenMP to improve Stage1 performance and utilize all available CPU cores.

To apply software pipelining and parallel execution, we used Intel TBB [5], which is available open source and/or together with Intel the C/C++ compiler. To use tbb::pipeline for software pipelining, define stage classes (tbb::filter) and token structure. Token is what will be transferred between the stages and includes the tile information and other intermediate data. TBB manages token transfer to the next stage as (void*) operator input parameter.

In order to get the best performance, use zero-copy OpenCL capabilities [6] available in Intel® Graphics architecture or Shared Virtual Memory (SVM) available in OpenCL 2.0 stack and supported by Intel® Graphics starting from Broadwell.

struct token_t {
	void *buffer; 	// intermediate buffer
	int tile_id;		// Tile number
};

class Stage1: public tbb::filter {
public:
	void* operator() (void *) {
		if(tile_id == num_tiles)
			return null;			// tells TBB to stop pipeline execution

		// Create token and select preallocated buffer
		token_t *t=new token_t();
		t->tile_id = tile_id++;
		t->buffer = buffers[tile_id % 2]; // select one of the buffers

		DoStage1_CPU(t, input);		// Process tile on CPU

		return t;	// Transfer intermediate data to the next stage
	}
};


class Stage2: public tbb::filter {
public:
	void* operator(void* token) {
		token_t *t = (token_t*)token;	// Receive data from the previous stage

		DoStage2_OpenCL(output, t);	// Do second stage on the Processor Gfx
							// and write output data

		delete t;				// Destroy token
		return 0;				// Finish pipeline
	}
}


After that, pipeline (tbb::pipeline [4]) creation and execution looks pretty simple.
// Create and initialize stages
Stage1 s1(input, W, H);
Stage2 s2(input, output, W,H);

// Create pipeline
tbb::pipeline ppln;
// Add filters(stages)
ppln.add_filter(s1);
ppln.add_filter(s2);
// Run 2 pipelines
// 	One will be executed on the CPU,
//     while the other one on Processor Gfx
//     and vice versa on the next “step”
ppln.run(2);

 

7. Performance Results

Ideally we want to have an unlimited number of tiles, but multi-threading and synchronization overhead might reduce the algorithm efficiency.

We’ve done experiments on Haswell i7-4702HQ @ 2.2GHz with numerous tiles and got up to 1.7x overall performance improvements over serial implementation. We also found out that four to eight tiles are optimal for our algorithm on this particular platform.

effect execution time
Figure 7. Effect execution time.

8. Conclusion

In this article we illustrated an effective algorithm to that distributed workloads between available computing resources, and gained up to 1.7x better performance by running the algorithm on both CPU and processor graphics and the same time. This approach can be extended to use other devices or features available on Intel® processors. One of most and very popular hardware feature in Intel Processor Graphics is Intel® Quick Sync Video (QSV) [9]. Applications can use same approach to divide the preprocessing stages and then feed this super-fast fixed function blocks to encode the frames.

 

9. About the Author

Ilya Albrekht

Ilya Albrekht been working for Intel for 5 years as an Application Engineer and has strong skills in algorithms optimization, modern CPU microarchitecture and OpenCL. He has Master’s degree in Mathematics and Information Security. He loves to explore natural wonders in Arizona and California with his wife.

 

 

10. References

  1. “Intel Vectorization Tools”, https://software.intel.com/en-us/intel-vectorization-tools
  2. “Expressing Pipeline Parallelism using TBB constructs”, Eric Reed, Nicholas Chen, Ralpf Johnson https://www.ideals.illinois.edu/bitstream/handle/2142/25908/reedchenjohnson.pdf?sequence=2
  3. “OpenCL™ Drivers and Runtimes for Intel® Architecture”, https://software.intel.com/en-us/articles/opencl-drivers
  4. “pipeline Class”,  http://www.threadingbuildingblocks.org/docs/help/reference/algorithms/pipeline_cls.htm
  5. “Threading Building Blocks (Intel® TBB)”, https://www.threadingbuildingblocks.org/
  6. “OpenCL Zero-Copy with OpenCL 1.2”, OpenCL optimization guide, https://software.intel.com/en-us/node/515456
  7. “Intel® Processor Graphics”, https://software.intel.com/en-us/articles/intel-graphics-developers-guides.
  8. “Intel® SDK for OpenCL™ Applications”, https://software.intel.com/en-us/intel-opencl
  9. “Intel® Media SDK”, https://software.intel.com/en-us/vcsource/tools/media-sdk-clients

OpenCV 3.0.0-beta ( IPP & TBB enabled ) on Yocto with Intel® Edison

$
0
0

< Overview >

 This article is a tutorial for setting up OpenCV 3.0.0-beta on Yocto with Intel® Edison. We will build OpenCV 3.0.0-beta on Edison Breakout/Expansion Board using a Linux host machine and it takes up a lot of space on Edison, therefore, it is required to have at least 4GB micro SD Card as an extended storage for your Edison Breakout/Expansion Board.

1. Prepare the image for your Edison

  prepare your standard ( or customized with additional packages ) Edison image following Board Support Package and Startup Guide. You can use the original image or customize the image with your desired additional packages.

2. Enabling UVC ( USB Video device Class ) by customizing the Linux Kernel ( optinal )

  For those who want to use an USB camera module such as a simple webcam, we need to enable UVC in the Linux Kernel configuration. If you are done building your own image ( if you did 'bitbake edison-image') , now you are ready to customize your Linux Kernel. Type

   ~/edison-src> bitbake virtual/kernel -c menuconfig

and then find and enable Device Drivers -> Multimedia support -> Media USB Adapters. When configuration is completed, replace defconfig with .config that you just modified. Type

  ~/edison-src> cp /build/tmp/work/edison-poky-linux/linuxyocto/3.10.17+gitAUTOINC+6ad20f049a_c03195ed6e-r0/linux-edison-standardbuild/.config build/tmp/work/edison-poky-linux/linuxyocto/3.10.17+gitAUTOINC+6ad20f049a_c03195ed6er0/defconfig

Finally, we bitbake again, type

  ~/edison-src> bitbake virtual/kernel -c configure -f -v


  ~/edison-src> bitbake edison-image

 

3. Changing Partition ( recommended )

 

 It is recommended to change the original partition option because the partition of the root file system only is originally 512 MB which will be mostly consumed after flasing an image. '/home' takes whatever much space remained after assining specified partitions and it is more than 2GB. Therefore, making those two partitions even which can be about 1.3GB would give your Edison more flexibility to do more work after flashing.

 Look into 'edison-src/device-software/meta-edison-distro/recipes-bsp/u-boot/files/edison.env' file, change Rootfs’ size from 512MB to 1312MB. As results, /home size will shrink automatically. One more change needs to be made before we flash the image again. The second part of this is the rootfs image size itself, which is set in the 'edison-src/device-software/meta-edison-distro/recipes-core/images/edison-image.bb' and is also 512MB. Change the size of rootfs again then re-build the image again, by typing 'bitbake edison-image' .

 After 'bitbake' is done, type

 ~/edison-src> /device-software/utils/flash/postBuild.sh

 and check if you have 'dfu-util'. If you do not , then install it by typing

 ~/edison-src> sudo apt-get install dfu-util

now we need to flash twice to apply the new partition settings. First complete

 ~/edison-src> /build/toFlash/flashall.sh --recovery

and then complete flashing without '--recovery' ,

 ~/edison-src> /build/toFlash/flashall.sh

After your Edison boots up successfully, connect a USB cable to Edison's serial terminal port. Check what number of USB device your Edison gets detected on your host Linux and connect Edison through 'screen' ( put the corresponding number instead of 'X' ex) ttyUSBX -> ttyUSB0 )

> sudo screen /dev/ttyUSBX 115200

If you see your Edison successfully boots up , login as 'root' and check the space by typing

 root@edison:~# df -h

 

4. Setup root password and WiFi for ssh and FTP

 

 After your host Linux is connected with Edison through the serial port, type

 root@edison:~# configure_edison --setup

 and follow the instruction, you will have no trouble setting up your password and WiFi. If you use no password account, 'ssh' may refuse access requiests from outside of Edison.

Connect your host Linux machine and your Edison to the same wireless access point and enable FTP using any FTP program or command as you want.

5. Install CMake

 

 Edison image does not come with 'cmake' and we need it to build OpenCV. Therefore we need to manually install it. One of many ways to do it, it is fairly recommended to use 'opkg'. There is a repository established by one of the users out there, which can be refered through this link.

 In the link, AlexT instroduces a way to connect his repo through 'opkg'. To configure your Edison to fetch packages from the repo, replace anything you have in /etc/opkg/base-feeds.conf with the following (other opkg config files don't need any changes):

===/etc/opkg/base-feeds.conf contents below===
src/gz all http://repo.opkg.net/edison/repo/all
src/gz edison http://repo.opkg.net/edison/repo/edison
src/gz core2-32 http://repo.opkg.net/edison/repo/core2-32

===end of /etc/opkg/base-feeds.conf contents===

Now type 'opkg update' and

you should see the below output, which means you're successfully communicating with the repo:

root@edison:~# opkg update
Downloading http://repo.opkg.net/edison/repo/all/Packages.gz.
Inflating http://repo.opkg.net/edison/repo/all/Packages.gz.
Updated list of available packages in /var/lib/opkg/all.
Downloading http://repo.opkg.net/edison/repo/edison/Packages.gz.
Inflating http://repo.opkg.net/edison/repo/edison/Packages.gz.
Updated list of available packages in /var/lib/opkg/edison.
Downloading http://repo.opkg.net/edison/repo/core2-32/Packages.gz.
Inflating http://repo.opkg.net/edison/repo/core2-32/Packages.gz.
Updated list of available packages in /var/lib/opkg/core2-32.

Now you are ready to install CMake, type

root@edison:~# opkg install cmake-dev

 try 'cmake' if you can see the help page of it.

 

6. OpenCV 3.0.0-beta

 

 Before we jump in to OpenCV, we need a plenty of space for building OpenCV on Edison. Therefore, it is required to have an external storage such as a micro SD Card. We will format the micro SD Card and will mount it on Edison.

 Insert the card to your Linux host machine and type ( block_device ex-> /dev/mmcblk1 )

 > mkfs.ext4 block_device

or

  > mke4fs -t ext4 block_device

 now we lable the partition using

  > e4label <block_device> new_label

 insert the SD Card to your Edison and mount it.

 root@edison:~# mkdir <Desired DIR>
 root@edison:~# mount block_device <Desired DIR > 

check if it is mounted without problem by typing 'df -h'

  For the future use, it is convinient that you configure 'auto mount'. Add '/dev/block_device <Desired DIR> ' to /etc/fstab . For example type

 root@edison:~# vi /etc/fstab

and add '/dev/mmcblk1  /home/ext'

  Go to OpenCV Official Page and download OpenCV for Linux 3.0 BETA on your host Linux machine. When download is done, copy the zip file to your Edison through FTP. It is recommended to use the external SD card space for OpenCV. When it is built, it uses up more than 1 GB.

 Unzip the downloaded file by typing 'unzip opencv-3.0.0-beta.zip' and check if your opencv folder is created.

 go to <OpenCV DIR> and type 'cmake .' and take a look what kind of options are there.

 We will enable IPP and TBB for better performance. The library to enable IPP will be downloaded automatically when the flag is turned on but unfortunately, the library for TBB needs to be installed manually. Therefore, install TBB package on your host machine and copy the corresponding files to Edison. If your host Linux is 64bit you need to specify i386 when apt-get it. On your host machine, type

> sudo apt-get install libtbb-dev:i386

 and copy all files in /usr/include/tbb to Edison's same named folder, and /usr/lib/libtbb.so, /usr/lib/libtbbmalloc.so, /usr/lib/libtbbmalloc_proxy.so, and /usr/lib/pkgconfig/tbb.pc need to be copied to Edison also. Please refer this page to see the full file list of tbb library.

 Now, on Edison, go to <OpenCV DIR> and type ( do not forget '.' at the end of the command line )

 root@edison:<OpenCV DIR># cmake -D WITH_IPP=ON -D WITH_TBB=ON -D WITH_CUDA=OFF -D WITH_OPENCL=OFF -D BUILD_SHARED_LIBS=OFF -D BUILD_PERF_TESTS=OFF -D BUILD_TESTS=OFF .

 which turns on IPP & TBB flags and turns off irrelevant features to make it simple. With 'BUILD_SHARED_LIBS=OFF' , your Edison will make the executables able to run without OpenCV installed in case of distribution.

 In the configuration result, you should see IPP and TBB are enabled.

If you observe no problems, then type

 root@edison:<OpenCV DIR># make -j2

 It will take a while to complete the building. ( 30mins ~ 1hour )

 When building is done, install what is made by typing

 root@edison:<OpenCV DIR># make install

 

7. Making applications with OpenCV 3.0.0-beta

 

 The easiest way to make a simple OpenCV application is using the sample came along with the package. Go to '<OpenCV DIR>/samples' and type

 root@edison:<OpenCV DIR>/samples# cmake .

 then it will configure and get ready to compile and link the samples. Now you can replace one of the sample code file in 'samples/cpp' and build it using cmake. For example, we can replace 'facedetect.cpp'  with our own code. Now at '<OpenCV DIR>/samples' type

 root@edison:<OpenCV DIR>/samples# make example_facedetect

 then it will automatically get the building done and output file will be placed in 'samples/cpp'

 

 

 

 

 

 

 


Analyzing Intel® SDE's TSX-related log data for capacity aborts

$
0
0

Starting with version 7.12.0, Intel® SDE has Intel® TSX-related instruction and memory access logging features which can be useful for debugging Intel® TSX's capacity aborts. With the log data from the Intel SDE you can diagnose cache set population to determine if there is non-uniform cache set usage causing capacity overflows. A refined log data may be used to further diagnose the source of the aborts. Since the log file may be huge to navigate and diagnose without refining, here, a simple Python script is presented to help analyze Intel TSX-related log files to help root-cause sources of capacity aborts.

TxSDELogAnalyzer.py is a simple program which focuses on capacity-aborted transactions. It uses the SDE's log data which can be collected using the set of parameters as below.

>$ sde -tsx -hle_enabled 1 -rtm_mode full -tsx_debug_log 3 -tsx_log_inst 1 -tsx_log_file 1

We will describe a few features this simple script has and why it might be important to use them to aid in debugging transaction aborts due to capacity overflows. In a typical scenario, you may do the following to find the source of those aborts.

(i)  Look at the TSX read/write set sizes and cache set usage distribution

(ii) Take a random sample of aborted transactions and closely examine them

(ii) In conjunction with case (ii) you may need to look at various log data related to capacity aborted transactions, as well as committed transactions to compare how they differ, hoping that you may spot a difference that could lead to the source of the aborts.

Cache set usage distribution

You might need to see how the cache sets are populated in an aborted transaction to see if there are outliers or a non-uniformity which could be causing premature overflows. Then you can locate the data structure in this transaction and enhance it such that it uniformly uses the available cache sets -- thus avoiding capacity overflows (see CIA allocator as a possible mitigation). Under normal circumstances you may need to compare cache access distributions between committed and aborted transactions to see how they differ by plotting a histogram or any suitable graph with both cases.

TxSDELogAnalyzer.py calculates the average cache population among all transactions of the same type from the input log file. It outputs the final result in CSV format, making it easy to plot a graph in an Excel sheet for example. For average distribution of cache lines access the following command structure is used.

>$ TxSDELogAnalyzer.py -D <type_num> [-o <output_file>] <input_log_file>

Where type_num specifies the type of transactions data to generate the distribution from. type_num can either be 2 for capacity-aborted transactions or 8 for committed transactions. This command calculates unique cache set accesses for each transaction data of type type_num and finally calculates the average of these cache access distributions.

A sample output from this command from capacity aborted transactions log data is given below.

Average cache set population for capacity aborts

Cache set #, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63
Avg. population, 5.00, 1.00, 0.00, 0.00, 1.00, 0.00, 1.00, 2.00, 3.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 1.00, 2.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 2.00, 2.00, 0.00, 2.00, 1.00, 0.00, 0.00, 2.00, 1.00, 0.00, 1.00, 2.00, 0.00, 1.00, 3.00, 0.00, 2.00, 2.00, 2.00, 2.00, 3.00, 2.00, 3.00, 2.00, 4.00, 1.00, 1.00, 0.00, 1.00, 1.00, 3.00

Average cache set population

Fig.1: A graphical representation of the sample output. It shows how many cache lines are used in each of 64 cache sets of a data cache.

To easily interpret the graph in Fig.1 we need to understand the Intel® SDE's "-tsx_cache_set_size " parameter whose default value is 8 but 4 was used to generate the sample output above.

Intel's L1 data cache is 8-way associative that means every set of the data cache has 8 ways. The two logical cores share the L1 cache dynamically (Intel(r) Hyper-Threading technology is enabled). To approximate this behavior we can statically limit the number of ways per thread to 4 using Intel SDE (but this assumes that the usage of ways is equal).

If  "-tsx_cache_set_size " is set to the default value which is 8, it means hyper-threading is not "emulated" as all the ways in a cache set are used by the (only) single hardware thread in a core. To emulate Hyper-Threading, at least at data cache level, we set  "-tsx_cache_set_size " to 4. Thus only a half of the ways in each data cache set are used up. With the Hyper-Threading emulation a maximum of 4 ways per cache set should be uniquely modified in TSX-transaction execution. Otherwise a capacity overflow occurs aborting the transaction. From Figure 1, we see that at cache set #0 there is an overflow (5 unique way accesses instead of max 4) and this could lead to a capacity abort.

Transactions data filtering

Intel® SDE already has knobs to start/stop logging thus limiting the log output size. This is normally done by adding the following command options to the SDE command you use to run your test

-control start:interactive:bcast,stop:interactive:bcast -interactive_file tmp

However, you may sometimes need log data with longer execution coverage; hence a huge log file will be produced. Limiting the log output size may therefore not be enough and you many need filtering options.

Moreover, your log data may contain data for various abort reasons and the commits, too. Even the log information of thread operations outside a transaction executions may be included in the SDE log. Therefore, it may be hard to analyze a huge log file with a mixture of information. Moreover, you may be interested in only logs of certain types -- e.g; capacity aborted transactions -- only. TxSDELogAnalyzer.py helps you filter out the log data and presents you only the data for a specific abort reason, or a commit. The command format to do exactly this is shown below.

>$ TxSDELogAnalyzer.py -t <type_num> [-o <output_file>] <input_log_file>

type_num can be 1 (all log data of both committed and aborted transactions), 2 (for capacity aborted transactions only) or 8 (for committed transactions only),

# The command below outputs only capacity-abort related transaction logs>$ TxSDELogAnalyzer.py -t 2 sde-tsx-out.txt

# The following command line writes only commit-related transaction logs to a file named log_data_for_commits_data.txt
>$ TxSDELogAnalyzer.py -t 8 -o log_data_for_commits_data.txt sde-tsx-out.txt

Sampling transactions

Since you cannot analyse millions of transactions to identify code path patterns in a limited time you may want to extract log data for a few typical transactions out of millions. TxSDELogAnalyzer.py does exactly that. It randomly selects  num random transactions. It supplements these data with cache set accesses giving a full overview of the cache set population at each operation within a given transaction. This features works in conjunction with "Transactions data filtering" described above.

An example of commands for this feature is as follows.

>$ TxSDELogAnalyzer.py -t 8 -r <num> <input_log_file> # for commits, num random commits
>$ TxSDELogAnalyzer.py -t 2 -r <num> <input_log_file> # for capacity aborts, num random capacity aborts

A sample output from these commands is as below:

Transaction #0
                  OPERATION                      ; CACHE LINE ADDRESS; CACHE SET #; NEW CACHE LINE; CACHE SET POPULATION;    FUNCTION              ;  MANGLED FUNCTION         ; LIBRARY                 ; SOURCE
[ 1 1] @0x400b66 write access to 0x7f7bc2a39e7c:4;     0x7f7bc2a39e40;          57;           TRUE;                    1; tm_begin()               ; _Z8tm_beginv              ; /ABank/aBank:0x000000bee; /ABank/hle_lock.h 45
[ 1 1] @0x400b69 read access to 0x7f7bc2a39e7c:4 ;     0x7f7bc2a39e40;          57;          FALSE;                    1; tm_begin()               ; _Z8tm_beginv              ; /ABank/aBank:0x000000bee; /ABank/hle_lock.h 45
[ 1 1] @0x400c29 read access to 0x7f7bc2a39eac:4 ;     0x7f7bc2a39e80;          58;          FALSE;                    1; paySalaries(void*)       ; _Z11paySalariesPv         ; /ABank/aBank:0x000000c29; /ABank/SalaryBatch.cpp 78
[ 1 1] @0x400a36 write access to 0x7f7bc2a39e80:8;     0x7f7bc2a39e80;          58;          FALSE;                    1; myBank::paySalary(int)   ; _ZN6myBank9paySalaryEi    ; /ABank/aBank:0x000000a36; /ABank/SalaryBatch.cpp 62
[ 1 1] @0x4008ca read access to 0x7f7bc2a39e1c:4 ;     0x7f7bc2a39e00;          56;          FALSE;                    1; myBank::addToBalance(int); _ZN6myBank12addToBalanceEi; /ABank/aBank:0x00000082c; /ABank/SalaryBatch.cpp 30
. . .
[ 1 1] @0x4009b0 write access to 0x605dac:4      ;           0x605d80;          54;          FALSE;                    4; myBank::addToBalance(int); _ZN6myBank12addToBalanceEi; /ABank/aBank:0x00000082c; /ABank/SalaryBatch.cpp 30
[ 1 1] @0x4009b7 read access to 0x605db0:4       ;           0x605d80;          54;          FALSE;                    4; myBank::addToBalance(int); _ZN6myBank12addToBalanceEi; /ABank/aBank:0x00000082c; /ABank/SalaryBatch.cpp 30
[ 1 1] @0x400a4b read access to 0x7f7bc2a39e7c:4 ;     0x7f7bc2a39e40;          57;          FALSE;                    4; myBank::paySalary(int)   ; _ZN6myBank9paySalaryEi    ; /ABank/aBank:0x000000a53; /ABank/SalaryBatch.cpp 65
[ 1 1] @0x400837 write access to 0x7f7bc2a39e04:4;     0x7f7bc2a39e00;          56;          FALSE;                    4; myBank::addToBalance(int); _ZN6myBank12addToBalanceEi; /ABank/aBank:0x00000082c; /ABank/SalaryBatch.cpp 30
[ 1 1] @0x400a57 read access to 0x7f7bc2a39e7c:4 ;     0x7f7bc2a39e40;          57;          FALSE;                    4; myBank::paySalary(int)   ; _ZN6myBank9paySalaryEi    ; /ABank/aBank:0x000000a53; /ABank/SalaryBatch.cpp 65
[ 1 1] @0x400a4e write access to 0x7f7bc2a39e60:8;     0x7f7bc2a39e40;          57;          FALSE;                    4; myBank::paySalary(int)   ; _ZN6myBank9paySalaryEi    ; /ABank/aBank:0x000000a53; /ABank/SalaryBatch.cpp 65
[ 1 1] @0x40098f write access to 0x7f7bc2a39e48:8;     0x7f7bc2a39e40;          57;          FALSE;                    4; myBank::addToBalance(int); _ZN6myBank12addToBalanceEi; /ABank/aBank:0x00000082c; /ABank/SalaryBatch.cpp 30
[ 1 1] @0x400993 read access to 0x7f7bc2a39e28:8 ;     0x7f7bc2a39e00;          56;          FALSE;                    4; myBank::addToBalance(int); _ZN6myBank12addToBalanceEi; /ABank/aBank:0x00000082c; /ABank/SalaryBatch.cpp 30
[ 1 1] @0x400997 read access to 0x605dd0:4       ;           0x605dc0;          55;          FALSE;                    4; myBank::addToBalance(int); _ZN6myBank12addToBalanceEi; /ABank/aBank:0x00000082c; /ABank/SalaryBatch.cpp 30
[ 1 1] @0x400a35 read access to 0x7f7bc2a39e60:8 ;     0x7f7bc2a39e40;          57;          FALSE;                    4; myBank::addToBalance(int); _ZN6myBank12addToBalanceEi; /ABank/aBank:0x00000082c; /ABank/SalaryBatch.cpp 30
[ 1 1] @0x400a53 write access to 0x7f7bc2a39e7c:4;     0x7f7bc2a39e40;          57;          FALSE;                    4; myBank::paySalary(int)   ; _ZN6myBank9paySalaryEi    ; /ABank/aBank:0x000000a53; /ABank/SalaryBatch.cpp 65
[ 1 1] @0x400a57 read access to 0x7f7bc2a39e7c:4 ;     0x7f7bc2a39e40;          57;          FALSE;                    4; myBank::paySalary(int)   ; _ZN6myBank9paySalaryEi    ; /ABank/aBank:0x000000a53; /ABank/SalaryBatch.cpp 65
[ 1 1] @0x400a4b read access to 0x7f7bc2a39e7c:4 ;     0x7f7bc2a39e40;          57;          FALSE;                    4; myBank::paySalary(int)   ; _ZN6myBank9paySalaryEi    ; /ABank/aBank:0x000000a53; /ABank/SalaryBatch.cpp 65
[ 1 1] @0x400980 read access to 0x7f7bc2a39e24:4 ;     0x7f7bc2a39e00;          56;          FALSE;                    4; myBank::addToBalance(int); _ZN6myBank12addToBalanceEi; /ABank/aBank:0x00000082c; /ABank/SalaryBatch.cpp 30
[ 1 1] @0x40098f write access to 0x7f7bc2a39e48:8;     0x7f7bc2a39e40;          57;          FALSE;                    4; myBank::addToBalance(int); _ZN6myBank12addToBalanceEi; /ABank/aBank:0x00000082c; /ABank/SalaryBatch.cpp 30
[ 1 1] @0x400993 read access to 0x7f7bc2a39e28:8 ;     0x7f7bc2a39e00;          56;          FALSE;                    4; myBank::addToBalance(int); _ZN6myBank12addToBalanceEi; /ABank/aBank:0x00000082c; /ABank/SalaryBatch.cpp 30
[ 1 1] @0x400997 read access to 0x605dd8:4       ;           0x605dc0;          55;          FALSE;                    4; myBank::addToBalance(int); _ZN6myBank12addToBalanceEi; /ABank/aBank:0x00000082c; /ABank/SalaryBatch.cpp 30
[ 1 1] @0x40099c read access to 0x7f7bc2a39e28:8 ;     0x7f7bc2a39e00;          56;          FALSE;                    4; myBank::addToBalance(int); _ZN6myBank12addToBalanceEi; /ABank/aBank:0x00000082c; /ABank/SalaryBatch.cpp 30
[ 1 1] @0x400a22 read access to 0x7f7bc2a39e48:8 ;     0x7f7bc2a39e40;          57;          FALSE;                    4; myBank::addToBalance(int); _ZN6myBank12addToBalanceEi; /ABank/aBank:0x00000082c; /ABank/SalaryBatch.cpp 30
[ 1 1] @0x400a26 read access to 0x605dfc:4       ;           0x605dc0;          55;          FALSE;                    4; myBank::addToBalance(int); _ZN6myBank12addToBalanceEi; /ABank/aBank:0x00000082c; /ABank/SalaryBatch.cpp 30
[ 1 1] @0x400a2c read access to 0x7f7bc2a39e48:8 ;     0x7f7bc2a39e40;          57;          FALSE;                    4; myBank::addToBalance(int); _ZN6myBank12addToBalanceEi; /ABank/aBank:0x00000082c; /ABank/SalaryBatch.cpp 30
[ 1 1] @0x400a30 write access to 0x605dfc:4      ;           0x605dc0;          55;          FALSE;                    4; myBank::addToBalance(int); _ZN6myBank12addToBalanceEi; /ABank/aBank:0x00000082c; /ABank/SalaryBatch.cpp 30
[ 1 1] @0x400a33 read access to 0x7f7bc2a39e50:8 ;     0x7f7bc2a39e40;          57;          FALSE;                    4; myBank::addToBalance(int); _ZN6myBank12addToBalanceEi; /ABank/aBank:0x00000082c; /ABank/SalaryBatch.cpp 30
[ 1 1] @0x400a34 read access to 0x7f7bc2a39e58:8 ;     0x7f7bc2a39e40;          57;          FALSE;                    4; myBank::addToBalance(int); _ZN6myBank12addToBalanceEi; /ABank/aBank:0x00000082c; /ABank/SalaryBatch.cpp 30
[ 1 1] @0x400a35 read access to 0x7f7bc2a39e60:8 ;     0x7f7bc2a39e40;          57;          FALSE;                    4; myBank::addToBalance(int); _ZN6myBank12addToBalanceEi; /ABank/aBank:0x00000082c; /ABank/SalaryBatch.cpp 30
[ 1 1] @0x400a53 write access to 0x7f7bc2a39e7c:4;     0x7f7bc2a39e40;          57;          FALSE;                    4; myBank::paySalary(int)   ; _ZN6myBank9paySalaryEi    ; /ABank/aBank:0x000000a53; /ABank/SalaryBatch.cpp 65
[ 1 1] @0x400a4e write access to 0x7f7bc2a39e60:8;     0x7f7bc2a39e40;          57;          FALSE;                    4; myBank::paySalary(int)   ; _ZN6myBank9paySalaryEi    ; /ABank/aBank:0x000000a53; /ABank/SalaryBatch.cpp 65
[ 1 1] @0x40082c write access to 0x7f7bc2a39e58:8;     0x7f7bc2a39e40;          57;          FALSE;                    4; myBank::addToBalance(int); _ZN6myBank12addToBalanceEi; /ABank/aBank:0x00000082c; /ABank/SalaryBatch.cpp 30
[ 1 1] @0x400a06 read access to 0x605dfc:4       ;           0x605dc0;          55;          FALSE;                    4; myBank::addToBalance(int); _ZN6myBank12addToBalanceEi; /ABank/aBank:0x00000082c; /ABank/SalaryBatch.cpp 30
[ 1 1] @0x400a0c read access to 0x7f7bc2a39e40:8 ;     0x7f7bc2a39e40;          57;          FALSE;                    4; myBank::addToBalance(int); _ZN6myBank12addToBalanceEi; /ABank/aBank:0x00000082c; /ABank/SalaryBatch.cpp 30
[ 1 1] @0x400a10 write access to 0x605dfc:4      ;           0x605dc0;          55;          FALSE;                    4; myBank::addToBalance(int); _ZN6myBank12addToBalanceEi; /ABank/aBank:0x00000082c; /ABank/SalaryBatch.cpp 30
[ 1 1] @0x400a13 read access to 0x7f7bc2a39e48:8 ;     0x7f7bc2a39e40;          57;          FALSE;                    4; myBank::addToBalance(int); _ZN6myBank12addToBalanceEi; /ABank/aBank:0x00000082c; /ABank/SalaryBatch.cpp 30
[ 1 1] @0x400a17 read access to 0x605e00:4       ;           0x605e00;          56;           TRUE;                    5; myBank::addToBalance(int); _ZN6myBank12addToBalanceEi; /ABank/aBank:0x00000082c; /ABank/SalaryBatch.cpp 30
[ 1 1] @0x400a17 read access to 0x605e00:4       ;           0x605e00;          56;           TRUE;                    5; myBank::addToBalance(int); _ZN6myBank12addToBalanceEi; /ABank/aBank:0x00000082c; /ABank/SalaryBatch.cpp 30
[ 1 1] self abort transaction(1) abort reason 9
                                                 ;  TOTAL FOOTPRINT: 250 cache lines;
                                                 ;  TOTAL WRITE SET: 249 cache lines;

 

 

 

Getting Started with OpenCL Development on Windows with Intel® INDE

$
0
0

Intel® INDE provides a comprehensive tool set for developing applications targeting both CPU and GPUs, enriching the development experience of an OpenCL developer. Yet, if you got used to work with the legacy Intel® SDK for OpenCL™ Applications or if you just want to get started and build your first OpenCL code quickly, you can follow these steps and install only the OpenCL™ Code Builder component of Intel® INDE,

Go to the Intel® INDE Web page, select the edition you want to download and hit Download link:

Intel(R) INDE Download

At the Intel INDE downloads page select Online Installer (9 MB)

Online Installer Option

At the screen, where you need to select, which IDE to integrate Getting Started tools for Android* development, click Skip IDE Integration and uncheck the Install Intel® HAXM check box:

Skip IDE Integration

At the component selection screen, select only OpenCL™ Code Builder in the Build category (you are welcome to select any additional components that you need as well), and click Next. Installer will install all the OpenCL™ Code Builder components including Visual Studio and Eclipse plug-ins, if applicable.

OpenCL Code Builder

Complete the installation and restart your computer. Now you are ready to start developing your OpenCL code!

If later you decide that you need to install additional components of the Intel® INDE suite, rerun the installer, select Modify option to change installed features:

Modify Screen

and they you can select additional components that you need:

Intel C++ Compiler Selection

Complete the installation and restart your computer. Now you are ready to start using additional components of the Intel® INDE suite!

DrDebug : Linux Command Line Usage

$
0
0

Using DrDebug requires following two phases 1. recording and 2. replaying.


Pre-requisites

  1. GDB version 7.4 or higher with Python support
  2. PinPlay/DrDebug kit (Linux)

Setup 

Unpack the kit:
tar -zxf  <kit_tar_gz>

tcsh/csh
 setenv PIN_ROOT <path-to-DrDeubug-kit>
 setenv PATH $PIN_ROOT/extras/pinplay/scripts/:$PATH

bash/sh
 export PIN_ROOT=<path-to-DrDeubug-kit>
 export PATH=$PIN_ROOT/extras/pinplay/scripts/:$PATH


Recording

CAUTION: Recording causes a major (triple digit) slow-down, so it is highly recommended that you narrow down the execution region where the bug appears (both 'root cause' and 'symptom') and only record the 'buggy region'. Best alternative is to use 'gdb_record' to first go to the beginning of your region of interest and then turn on recording. Using the 'attach' mode (either 'record' or 'gdb_record') is another good alternative. You can also specify a region to 'record' using some 'controller switches' (e.g. '-log:start_address'/'-log:stop_address' etc.). More documentation of region specification to be added soon.

With GDB

Script to use: gdb_record

% gdb_record --help
Usage : gdb_record <options> <pintool options> -- bin args
OR gdb_record --pid=XXXX <pintool options> -- bin #no args
options:
--help : Print help message  
--pintool_help : Print help from pintool
 --pin_options=<pin options>: Options for Pin in double quotes
 
--arch=("intel64" | "ia32") : default "intel64"
--pinball=<pinball-name>: path+basename for pinball
Example:
% gdb_record --pinball=region.pb/log -- bread-demo
(gdb) b 113
Breakpoint 1 at 0x401908: file bread-demo.cpp, line 113.
(gdb) c
Continuing.
Breakpoint 1, main (argc=1, argv=0x7fffffffd5b8) at bread-demo. cpp:113 
113         Go = true;
(gdb) pin record on 
monitor record on
Started recording region number 0
(gdb) b 149
Breakpoint 2 at 0x401b67: file bread-demo.cpp, line 149.
(gdb) c
Continuing.
Breakpoint 2, main (argc=1, argv=0x7fffffffd5b8) at bread-demo.cpp:149 
149         std::cout << "\n";
(gdb) pin record off 
monitor record off
Stopped recording region number 0.
 Move forward before turning recording on again.
(gdb) c
Continuing.

From command line (without GDB)

Script to use: record

% record --help

Usage : record <options> <pintool options> -- bin args
OR record --pid=XXXX <pintool options> -- bin #no args
options:
--help : Print help message  
--pintool_help : Print help from pintool
 --pin_options=<pin options>: Options for Pin in double quotes
 
--arch=("intel64" | "ia32") : default "intel64"
--pinball=<pinball-name>: path+basename for pinball
Example:
% record --pinball=myregion.pb/log -log:start_address 0x401908 -log:stop_address 0x401b67 -- bread-demo


Replaying

With GDB

Script to use: gdb_replay

% gdb_replay --help

Usage : gdb_replay <options> <pintool options> -- pinball-basename program-binary
options:
--help : Print help message  
--pintool_help : Print help from pintool
 --pin_options=<pin options>: Options for Pin in double quotes
 
--arch=("intel64" | "ia32") : default "intel64"
--cross_os : Use address translation for text/data
Example:
% gdb_replay -- region.pb/log_0 bread-demo

(gdb) b 141
Breakpoint 1 at 0x401ae3: file bread-demo.cpp, line 141.
(gdb) c
Continuing.
 Breakpoint 1, main (argc=1, argv=0x7fffffffd5b8) at bread-demo.cpp:141 
141         std::cout << "Total revenue: $"<< std::setprecision(2) << std::fix
ed << revenue;
(gdb) print revenue
$1 = 21975.129998207092
(gdb) c
Continuing.

From command line (without GDB)

Script to use: replay

% replay --help
 
Usage : replay <options> <pintool options> -- pinball-basename
options:
  --help : Print help message
  --pintool_help : Print help from pintool
  --pin_options=<pin options>: Options for Pin in double quotes
  --arch=("intel64" | "ia32") : default "intel64"
  --cross_os : Use address translation for text/data.
Example:
% replay -- myregion.pb/log_0

Testimonials

$
0
0

I. From Milind Chabbi

I was developing a shared-memory synchronization algorithm, which was recursive in nature and involved complicated interactions between multiple threads via shared memory updates. The code ran into a livelock and the bug was neither apparent from inspecting the algorithm/code nor possible to isolate with traditional debugging techniques. The debugging was further complicated by lack of reproducibility of the bug, need for several threads to reproduce the bug, and run-to-run non determinism. Debugging tricks such as watchpoints, page protection and assertions could only identify symptoms of the problem but failed to help in arriving at the root cause of the problem even for parallel programming experts. 

Intel's PinPlay served as a savior by aiding in identifying the root cause--a data race. With PinPlay, I was able to run the code several times and record the log of a buggy execution. Deterministic replay feature of PinPlay for multi-threaded codes, in conjunction with powerful features of Pin framework to perform sophisticated analysis during execution replay, allowed me to break into the debugger just in time to notice step-by-step memory updates and thread interleaving that caused the data race. The cause of the data race was the following: the programmer assumed 64-bit cache-line-aligned memory writes to be atomically visible on x86_64 machines, whereas the compiler (GNU C++ v.4.4.5) took liberty to split a 64-bit write of an immediate value into two independent 32-bit writes, violating the programmer's assumption. This caused a small execution window of two instructions where a shared variable was in an inconsistent state, leading to the occasional data race and eventual livelock.

Like Intel's Pin framework, PinPlay is also robust and works on real code on real machines, making it my choice for debugging parallel programs. I would most certainly recommend PinPlay to both novice and expert programmers to debug their code that exhibit non determinism. In fact, we plan to introduce PinPlay in one of advanced multi-core programming classes here at Rice University. 

Affiliation:

Milind Chabbi is a doctoral candidate advised by Prof. John Mellor-Crummey in the department of computer science at Rice University. Milind is a member of Rice University's HPCToolkit team, where he develops tools and techniques for performance analysis of complex software systems.


If you are curious this was the issue:

My C++ source code:
   cache_line_aligned_64_bit_variable = 0xdffffffffffffffd; // Expected atomic write

g++ generated assembly on 64-bit machine:
   movl   $0xfffffffd,(%rax) // lower 32-bit update
   movl   $0xdfffffff,0x4(%rax) // Higher 32-bit update

As you can see, an atomic write was turned into a non-atomic write during m/c code generation! I can't really blame compiler for taking liberty to do so.

The Generic Address Space in OpenCL 2.0

$
0
0

Introduction

One of the new features of OpenCL 2.0  is the generic address space. Prior to OpenCL 2.0 the programmer had to specify an address space of what a pointer points to when that pointer was declared or the pointer was passed as an argument to a function. In OpenCL 2.0 the pointer itself remains in the private address space, but what the pointer points has changed its default to be generic meaning it can point to any of the named address spaces within the generic address space. This features requires you to set a flag to turn it on, so OpenCL C 1.2 programs will continue to compile with no changes.

To demonstrate this we show a brief of the new syntax on a function declaration and a variable declaration in the function. In OpenCL 1.2 we may have written this:

void foo(global unsigned int *bar)  // ‘global’ address space on bar, works in both OCL 1.2 and OCL 2.0 with no additional flags to compile
{
    local unsigned int *temp = NULL;//’local’ address space on temp, works in both OCL 1.2 and OCL 2.0 with no additional flags to compile
}

 In OpenCL 2.0 the following code will work on pointers that point to the global or the local or private address spaces:

void foo(unsigned int *bar)  // OCL 2.0, no address space on bar
{
    unsigned int *temp = NULL;//OCL 2.0, no address space on temp
}

Remember to enable OpenCL 2.0 features by passing the flag “–cl-std=CL2.0” in the options of clBuildProgram() or clCompileProgram(); Otherwise, you would see the following error:

1:54:24: error: passing '__local unsigned int *' to parameter of type 'unsigned int *' changes address space of pointer

since OpenCL 2.0 defaults to compile programs as if they are OpenCL 1.2 programs by default for backwards compatibility.

What is the generic address space?

The generic address space is an abstract address space that encapsulates the local, global, and private address spaces. For practical reasons it does not encompass the constant address space. The generic address space is inspired by the generic address space of the Embedded C Specification:

“In addition to the generic address space, an implementation may support other, named address spaces. Objects may be allocated in these alternate address spaces, and pointers may be defined that point to objects in these address spaces. It is intended that pointers into an address space only need be large enough to support the range of addresses in that address space.”

On some architectures the pointers to the address spaces are of different sizes and may be different memory banks, so we don’t want to get rid of the existing named address spaces. However, we do want a means to write programs that don’t require specialization of functions when it isn’t needed. Some of these use cases are documented below.  For performance reasons we want to continue to support specialization to the respective address spaces but also enable programmers to write a single segment of code that will operate on different address spaces.

A few important points about the generic address space:

  • Parameters to kernels called from the host must continue to be qualified with an address space.
  • Function and kernel parameters remain in the private address space, it is their points to address that has changed.  Read this three times, it is important and subtle!  Many examples are in Chapter 6.5 of the OpenCL C 2.0 Specification, see the References.
  • Null pointers from two different address spaces will evaluate to be equal as long as one of those pointers is generic.
  • If a null pointer is converted from one address space to another the null pointer will now be a null pointer of that type.
  • The address spaces are considered disjoint in the abstract sense but some implementations may treat them as if they are overlapping or part of the same physical memory. This is perfectly reasonable.
  • global, local, private and constant can be used interchangeably with __global, __local, __private and __constant respectively.  Generic is an unnamed address space and as such has no keyword in OpenCL 2.0.

Enabling the generic address space

As stated in the introduction, to enable the OpenCL C 2.0 generic address space feature, the flag “-cl-std=CL2.0” must be passed to clBuildProgram() or clCompileProgram().  Otherwise, the program will continue to compile in OpenCL 2.0 as an OpenCL 1.2 program. This is to make it easier to move to the new OpenCL 1.2 runtime yet have older programs ‘just work’, migrating to the new OpenCL 2.0 features like shared virtual memory and generic address space incrementally.

Why would I want to use the generic address space?

The generic address space makes writing OpenCL programs easier by removing the requirement of decorating all pointers with a points to address space when the programmer may not care or may want to use the same function regardless of the address space of the incoming pointer. For example, one can imagine incrementing the value of a histogram, adding or sorting a set of values in an array, or a set of operations on a per element basis. In all of these cases the OpenCL C 1.2 specification requires us to write a version of the function for each address space we expect to enter the function. This forces the developer to maintain multiple versions of the same function thus increasing the chances of making changes in one segment of code and not the other which can increase the risk of versioning issues. While the new generic address space eliminates the requirement for decorating all of the pointers with an address space, it does not require one to remove the address space, so all of your old OpenCL 1.2 code will continue to work as written. Programs that benefits from specifying the address space can continue to benefit from address spaces declared by the programmer.

Another reason one may choose to leverage the generic address space is to make it easier to cross compile a segment of C code on the CPU and the GPU, for example a data structure or set of core functions we use on both the host and device. Without the generic address space enabled code the programmer is forced to trick the host or device compiler to handle address space keywords by transforming address spaces to whitespace or other cleverness. It doesn’t take care of all the issues but make it easier to compile the same code with different C compilers.

What if I want to write one function, but I still have a few operations when the pointer points to a specific address space?

In some cases a programmer may want to write a function that operates in a generic fashion but there are some operations needed for a specific address space. In the case of a histogram, imagine incrementing a value in local memory may not require an atomic, but incrementing a value in a histogram shared globally among the workgroups would require an atomic operation on global memory. In such a case, there are built-ins to help for these portions of a function. The functions to_global(), to_local(), to_private() can be used to cast a pointer to the respective address space. If for some reason these functions are not able to cast a pointer to the respective address space they will return NULL. This allows the programmer to know if a pointer can be treated as if it points to the respective address space or not. 

One might wonder why these functions do not return a Boolean based on the value of is_local(), is_global(), and is_private(), for example. The reason is that some implementations may treat one address space as another and it is better to allow them to return the pointer value if they can be treated as if they are in the requested address space.  For example, CPUs may do this for all address spaces.

Address space casting

Casting from one named address space to another named address space is not allowed. The address space of a pointer to a named address space can be assigned to the generic address space but not the other way around. Also, a single generic pointer may be assigned to different named address spaces in the same code sequence. A variable that points to the constant address space is not convertible to any of the members of the generic address space.

This next example is legal. lp is a pointer to the local address space, g is a pointer in private memory pointing to the generic address space and is assigned a pointer that points to the local address space.

local int *lp;

int *g;

g = lp; //success!

However, the next example is not legal. The pointer is in private memory pointing to the local address space and this is an attempt to assign it a generic value, this is an error.

local *lp;

lp = p;  //error!

The OpenCL C specification Section 6.5 included in the References has numerous examples of legal and illegal casting.

Are there performance implications? How do I get around these issues?

In some implementations the performance of a function can be negatively impacted if the compiler cannot resolve the address space being pointed to at compile time. If this is an issue you can decorate the relevant pointer with a specific address space and the compiler does not have to generate any additional code to handle the generic address space.  The generic address space in the cases when the address space cannot be resolved at compile time may have a small performance impact in the case of very small kernels. Ideally, compilers will give feedback to the programmer that the address space of a parameter was not able to be resolved but as far as I know no compiler yet publicly supports this feature. Also, most kernels should be many more instructions than the few additional instructions needed at runtime to resolve the address space so this cost is expected to amortize relative to the execution time of the kernel.

A working example

To demonstrate this functionality we have written a short code sequence that behaves correctly in OpenCL 1.2 and the same code sequence using the generic address space for OpenCL 2.0. We use a simple synthetic kernel that mimics a memcpy() from a buffer segment in either the global or local address space to a global buffer. It is easy to imagine such a kernel doing a set of math operations on the value before writing it back to the global memory buffer on a per element basis. For example the conversion from an RGB image to YUV which would require several multiplies and adds on the input values.  The same set of operations would take place on the data the only difference would be the address spaces on the input buffer and this is a good candidate for the use of the generic address space as one may have some machines that perform better by leaving the buffer in global memory before the RGB to YUV conversation and other implementations that may choose to tile the image into local memory before doing any additional operations. 

A simple memcpy() function written in OpenCL 1.2 that loads from global memory and writes back to another location in global memory can be written:

void GlobalXToY_internal(__global unsigned char *in, __global unsigned char *out, unsigned int startFrom, unsigned int startTo, unsigned int length)
{
       for(int i = startFrom; i < startFrom + length;)
       {
              out[startTo++] = in[i++];
       }
}

To get the equivalent functionality in OpenCL 2.0 that can accept as input a value in the global or local address space is written as follows. Note the elimination of the keyword __global on the first function argument:

void GenericXToY_internal(unsigned char *in, unsigned char *out, unsigned int startFrom, unsigned int startTo, unsigned int length)
{
       for(int i = startFrom; i < startFrom + length;)
       {
              out[startTo++] = in[i++];
       }
}

The sample compiles and runs on any Intel Processor with Intel Processor Graphics supporting OpenCL 2.0. It has not been tested on other implementations at this time.

Future work

In the future we could augment this memcpy() example with a more complicated workload, for example tiled memory operations or a more complicated matrix multiplication. Also, we would like to see more OpenCL 2.0 implementations and we could enable the sample on those platforms. Additionally, doing analysis to verify the cost of using the generic address space would give us greater assurance to the performance implications.

Acknowledgements

We would like to thank Dillon Sharlet who was a close collaborator on the initial generic address space proposal, as well as Ben Ashbaugh, Stephen Junkins, Ben Gaster, and others who made significant contributions and clarifications during its development. Also thanks to the other vendors in Khronos who worked to do the proper analysis of the feature before inclusion into the OpenCL specification. 

References

Embedded C: http://www.open-std.org/jtc1/sc22/wg14/www/docs/n1169.pdf

Shared Virtual Memory Sample: https://software.intel.com/en-us/articles/opencl-20-shared-virtual-memory-code-sample

Additional samples available at: https://software.intel.com/en-us/intel-opencl-support/product-library

Intel® SDK for OpenCL™ Applications: https://software.intel.com/en-us/intel-opencl

OpenCL API Specification: https://www.khronos.org/registry/cl/specs/opencl-2.0.pdf 

OpenCL C Specification: https://www.khronos.org/registry/cl/specs/opencl-2.0-openclc.pdf

Understanding How General Exploration Works in Intel® VTune™ Amplifier XE

$
0
0

The General Exploration Analysis Type in Intel® VTune™ Amplifier XE is used to detect microarchitectural hardware bottlenecks in an application or system. General Exploration uses hardware event counters to detect and locate issues and presents the data in a user-friendly and actionable format. This article will explain the mechanisms used in this analysis, a few best-known-methods for interpreting the results, and the various complexities and issues that arise when doing this type of analysis.

The Mechanisms behind General Exploration

 

The majority of the data collected and displayed in the General Exploration profile is based on hardware events collected by the Performance Monitoring Units (PMU) on the CPU. These PMUs are hardware registers that can be programmed to count various events, for example cache misses or mispredicted branches. You can find details about the PMUs and events in the Software Developers Manual. VTune Amplifier collects the events in a mode known as Event-Based Sampling (EBS). In EBS, each PMU register is programmed to count a specific event and given a sample-after value (SAV). When the event occurs, the counter increments, and when it reaches the SAV, an interrupt fires and some data is collected, e.g. instruction pointer (IP), call stack, PID, etc... For example, if we programmed a PMU to count the L2 Cache Miss event with a SAV of 2000, on the 2000th L2 cache miss an interrupt would occur, the data would be collected, and VTune Amplifier would attribute all 2000 cache misses to the singe IP collected in the interrupt handler. Then the PMU would reset and start counting to 2000 again.

The fact that all 2000 cache misses are attributed to a single IP is the reason that this technique is called Event-Based Sampling. It is unlikely that all 2000 cache misses are actually caused by that single instruction, however, if enough samples are collected in this manner, the results should be a statistical representation of the actual behavior. When sufficient symbol information and source code is available, these instruction pointers can be resolved to specific modules, functions, and often times lines of code within an application.

There are hundreds of events that can be collected on a given architecture, and these events can change from generation to generation. The General Exploration analysis is designed to abstract away the complexities of selecting which events to collect by predefining these event lists for each architecture and combining events into understandable metrics that remain consistent across platforms. General Exploration collects approximately 60 events on recent architectures. There are a limited number of PMU registers (usually four per logical core) so this set of events must be multiplexed in order to collect them all. This means that four events are collected for some period of time, then they are swapped out for a different four events, and this process repeats continuously for the entire profile. The behavior of specific events must be estimated during the times when that event is not being collected. This multiplexing can introduce issues, which are described in the complexities section later in the article.

This is all done automatically when you select the General Exploration Analysis Type in VTune Amplifier.

Interpreting General Exploration Results

 

After running a General Exploration analysis you are presented with a summary of all the data collected. In addition to some common performance data such as elapsed time, the majority of the General Exploration data are metrics based on the hardware events described above. A metric is a combination of events into some useful value. For example, if we take the CPU_CLK_UNHALTED.THREAD event count and divide by the INST_RETIRED.ANY event count, we get the metric Cycles per Instruction (CPI). The metrics in VTune Amplifier are organized in a hierarchical fashion which is used to identify microarchitectural bottlenecks. This hierarchy and methodology is known as the Top-Down Characterization. A detailed description of this methodology can be found in the white paper How to Tune Applications Using a Top-Down Characterization of Microarchitectural Issues.  Additionally, there is a tuning guide for each hardware platform largely based on General Exploration which is available from www.intel.com/vtune-tuning-guides. After running an analysis, it’s important to focus on the hotspots within your application. These will have the largest counts for the CPU_CLK_UNHALTED.THREAD event. Next, focus on metrics that are highlighted in the GUI at the highest level of the hierarchy. Highlighted metrics have values outside a predefined threshold that represents when performance starts to be impacted negatively. Expand the highlighted metrics to see sub metrics that relate to the parent. Again look for highlighted metrics to identify performance impacts and continue expanding the highlighted metrics. Start by focusing on the lowest level of the hierarchy that has a highlighted metric. Use the hover text and documentation to understand the meaning of various metrics and how they can affect performance. The tuning guides include lots of suggestions for improving performance depending on which metrics are highlighted. If no metrics are highlighted at the lowest level, focus on the highlighted level above and try to understand what causes the performance issue indicated by that metric. After making code changes, rerun a General Exploration analysis to see if they have affected the values of the metrics. You can use the VTune Amplifier GUI to compare results side-by-side. This performance tuning process is iterative, and can continue as long as the developer is willing to invest the time. Generally, the first fixes will have the most impact on performance, and subsequent changes will begin to have diminishing returns.

Understanding the Complexities of Event Based-Sampling

 

There are a number of complexities associated with EBS that aren’t necessarily prevalent in standard user-mode performance analysis. It is important to understand these issues and how they may be affecting your data as you analyze results from VTune Amplifier.

Complexity 1: Intel® Hyper-Threading Technology

One such set of complexities are introduced by the use of Intel® Hyper-Threading Technology, sometimes referred to as an implementation of simultaneous multithreading (SMT). In a system running with SMT enabled, each physical core has additional hardware which allows it to appear as two logical processors, although most of the hardware is still shared. You can imagine that designing PMUs and events for this case can be complex. Some events become less accurate when SMT is enabled because it can be difficult to distinguish the logical core responsible for the event. Additionally, calculating metrics using these events is complex because metrics make some assumptions about the underlying hardware. For example, a modern Intel CPU can allocate and retire four micro-ops (uops) per cycle. Therefore the ideal CPI would be 0.25. However, when SMT is enabled, this limit of four uops per cycle still holds true for the core, each SMT thread must share the allocate and retire resources. Thus, in an application with SMT enabled and multiple software threads consistently running, a per-logical core (SMT core) CPI of 0.5 is the best you can do. And most applications are a mix of parallel and serial. Because of this, some metric values may vary when SMT is enabled and should be interpreted with this in mind. The main metrics to note that are affected by SMT are the “top 4” categories: Retiring, Bad Speculation, Back-End Bound, and Front-End bound. The effects of SMT are proportional to amount that both SMT threads are executing simultaneously on a core and competing for the same resource, e.g. allocation slots. In the case with lots of simultaneous execution, the Retiring, Bad Speculation, and Front-End Bound metrics can be underestimated and Back-End Bound will be overestimated. In a worst-case scenario, these metrics can be off by ~1X, however the variance is generally less than that. Additionally, VTune Amplifier has thresholds built in to highlight metrics when they may indicate a performance issue, and in most cases this variance will not affect whether a metric is highlighted. It’s important to be aware of how SMT may be affecting your results, and if you suspect such effects, try running an analysis with SMT disabled in the BIOS.

Complexity 2: Event Skid

Another complexity introduced by EBS is the event skid that occurs when PMU interrupt handlers collect the Instruction Pointer (IP) that caused the event. Due to complexities in the hardware and delays in interrupt processing, the IP collected at the time of the interrupt handler which is assigned to a given event may skid to several IPs later in the execution then when the event actually occurred. In the case that metrics are being evaluated at a module or function level, this is generally not an issue because the IP will often still be in the same module or function. However, if you are looking at source lines or assembly instructions, the event counts and metrics values may skid to a line or instruction later in the binary. If metric values don’t seem to make sense, look at the few source lines before and think about whether skid may be affecting the values and whether it makes sense for the metrics to be associated with a previous line in the program. A special case of event skid can occur on branch instructions where events may get associated to branch targets which may be far away in the binary. Pay special attention to this effect if it looks like metrics associated with branch targets are actually from branch sources. Some events can be collected in a “precise” mode. These events are usually denoted with a _PS at the end of their name, for example BR_INST_RETIRED.ALL_BRANCHES_PS. These events usually have a predefined skid of one instruction. They are attributed to the IP directly after the eventing IP. It is not always trivial to determine the previous IP, as in the branch case described above, so VTune Amplifier cannot do this automatically. However, when looking at event locations for precise events, keep this IP+1 skid in mind.

Complexity 3: Event Multiplexing

The General Exploration analysis type collects 60+ events on recent microarchitectures, which requires event multiplexing, as described above. This multiplexing can cause additional variance in the collected data if runs are too short or don’t repeat any type of steady state within the time it takes to cycle through all multiplexed event groups multiple times. Obviously there is a lot of variance in the amount of time you must run your specific application to get valid data, but keep in mind that each group is rotated approximately every 10ms and there are ~15 groups in General Exploration. Additionally, there is a “MUX Reliability” metric in VTune Amplifier which attempts to determine how accurate any given set of metrics for a row in the grid are, with respect to multiplexing inaccuracies. This value ranges from 0 to 1, and values above 0.9 generally represent a high confidence in the metrics for that row. If you think your data may be affected by multiplexing based on the symptoms described above (low MUX Reliability values, short run durations), try running the analysis for a longer period of time or enable multiple runs in the GUI.

Complexity 4: Interrupt Handlers and Masking

One more issue to be aware of when doing EBS with VTune Amplifier is the effect interrupt handlers and masking interrupts can have on sampling. As described above, samples are counted through the use of an interrupt and interrupt handler. If the platform or application being analyzed has lots of time spent with interrupts masked, for example in other interrupt handlers, samples cannot be collected during that time. This also means that interrupt handlers themselves cannot be fully profiled by EBS.  If multiple PMU interrupts arrive while interrupts are masked, only the last one will be handled when interrupts are eventually unmasked. This issue is often seen as many more samples than expected being attributed to the kernel idle loop, where interrupts are regularly masked and unmasked.

It’s important to be aware of these issues and be able to recognize when your analysis may be affected by some of these complexities.

PinPlay:FAQ

$
0
0

PinPlay:FAQ

I. How long does record/replay take?

Record/replay overhead is a function of number of memory accesses and the amount of sharing in the test program.

1. Time for recording/replaying a 'region': 

Source : CGO2014 paper on DrDebug

2. Slow-down for whole-program recording.

Source: Measured with PinPlay kit 2.0 

  Average
Slowdown
x Native
 
BenchmarkInputLoggerReplayer
SPEC2006'ref'93x11x
PARSEC'native'>=4T197x37x

II. Why does PinPlay have a high overhead (especially for recording)?

The design goals of PinPlay were:

  • No special HW requirement (including no reliance on HW performance counters).
  • No special operating system requirement (including no virtual machine or no modified kernel).
  • Complete and faithful reproduction of multi-threaded schedules.
  • Portability (small size, OS-independence) of recording ("pinball").
  • No program source needed. No re-compilation/re-linking required.

As a result, PinPlay works on multiple operating systems 'out of the box' and provides the guarantee that a bug once captured will not escape. However, that comes with a high overhead, especially during recording.

There are two major sources of slow-down in PinPlay:

1. System call side-effect analysis.

A shadow memory is implemented during recording. All real memory writes observed in the program are replicated on the shadow memory. Memory reads lead to a comparison of 'real' memory values and 'shadow' memory values and mismatch/missing value leads to an injection being emitted in the *.sel file. At replay time, all memory reads are monitored and recorded memory values are injected if present. The details are described in our SIGMETRICS 2006 paper "Automatic Logging of Operating System Effects to Guide Application-Level Architecture Simulation".

The overhead of this technique is proportional to the number of memory accesses in the program.

2. Shared memory access order analysis.

During recording, all memory accesses are monitored and a cache coherency protocol is simulated including maintenance of last reader/writer for each shared memory access. A subset of detected read-after-write, write-after-read, and write-after-write dependences is recorded in the *.race file. During replay, all memory accesses are monitored and a thread is delayed if it tries to access a shared memory location out of order.

The overhead of this technique is proportional to the number of shared memory accesses in the program.


Bitonic Sorting

$
0
0

Sample for Windows*Sample for Linux*Download Documentation

Features / Description

Demonstrates how to implement an efficient sorting routine with the OpenCL™ technology.

  • Operates on arbitrary input array of integer values
  • Utilizes properties of bitonic sequence and principles of sorting networks
  • Enables efficient SIMD-style parallelism through OpenCL vector data types
  • Fits modern CPUs

Supported Devices: CPU, Intel Processor Graphics, Intel® Xeon Phi™ coprocessor
Supported OS: Windows* and Linux* OS
Complexity Level: Intermediate

For more information about the sample refer to the sample documentation inside the package.

*OpenCL and the OpenCL logo are trademarks of Apple Inc. used by permission by Khronos.

Dynamic allocator replacement on OS X* with Intel® TBB

$
0
0

The Intel® Threading Building Blocks (Intel® TBB) library provides an alternative way to dynamically allocate memory - Intel TBB scalable allocator (tbbmalloc). Its purpose is to provide better performance and scalability for memory allocation/deallocation operations in multithreaded applications, compared to the default allocator.

There are two general ways to employ Intel TBB scalable allocator in your application:

  1. Explicitly specifying TBB scalable allocator in source code, either by using memory allocation routines (like “scalable_malloc”) or by specifying Intel TBB scalable allocator for containers:
#include <tbb/scalable_allocator.h>
std::vector<int, tbb::scalable_allocator <int> >;
  1. Automatic replacing of all calls to standard functions for dynamic memory allocation (such as malloc) with the Intel TBB scalable equivalents. This option was introduced in Intel TBB 4.3

One way to do the automatic replacement is to link the main executable file with the Intel TBB malloc proxy library:

clang++ foo.o bar.o -ltbbmalloc_proxy -o a.out

Another way does not even require re-building, so you can provide a new memory allocator to the same binary. This is done by loading the malloc proxy library at an application start time using the DYLD_INSERT_LIBRARIES environment variable:

DYLD_INSERT_LIBRARIES=libtbbmalloc_proxy.dylib

In OS X, simple loading libraries with DYLD_INSERT_LIBRARIESrequires using flat namespaces in order to access the shared library symbols. If an application was built with two-level namespaces, this will not work, and forcing usage of flat namespaces may lead to a crash.

Intel TBB overcomes this problem in a smart way. When libtbbmalloc_proxy library is loaded into the process, its static constructor is called and registers a “malloc zone” for TBB memory allocation routines. This allows redirecting memory allocation routine calls from a standard C++ library into TBB scalable allocator routines. So the application doesn’t need to use TBB malloc library symbols, it continues to call standard “libc” routines, thus there are no problems with name spaces. Also, OS X “malloc zones” mechanism allows applications to have several memory allocators (e.g. used by different libraries) and manage memory correctly. It guarantees that Intel TBB will use the same allocator for allocations and deallocations. It is a safeguard against crashes due to calling a deallocation routine for a memory object allocated from another allocator.

Additional links:

Intel TBB: Memory Allocation
Intel TBB documentation: dynamic memory interface replacement on OS X
Intel TBB documentation: Memory Allocation reference

 

Intel Cluster Ready FAQ: Customer benefits

$
0
0

Q: Why should we select a certified Intel Cluster Ready system and registered Intel Cluster Ready applications?
A: Choosing certified systems and registered applications gives you the confidence that your cluster will work as it should, right away, so you can boost productivity and start solving new problems faster.
Learn more about what is Intel Cluster Ready and its benefits.

Q: If the application I use is not registered, can I still benefit from Intel Cluster Ready?
A: Yes. Many unregistered applications run on certified Intel Cluster Ready systems with few or no changes. Even if some configuration adjustments are required, you can focus on the application, without having to simultaneously troubleshoot your cluster.

Intel Cluster Ready FAQ: Hardware vendors, system integrators, platform suppliers

$
0
0

Q: Why should we join the Intel® Cluster Ready program?
A: By offering certified Intel Cluster Ready systems and certified components, you can give customers greater confidence in deploying and running HPC systems. Participating in the program will help you drive HPC adoption, expand your customer base, and streamline customer support. You will also gain access to the Intel Cluster Checker software tool and the library of pre-certified Intel Cluster Ready system reference designs.
Learn more about the Intel Cluster Ready program

Q: How do we build a certified Intel Cluster Ready cluster?
A: First, read and accept the program agreement. You will receive access to the software and documents you need to build the cluster, including the Intel Cluster Ready Specification. Use Intel® Cluster Checker software to ensure the cluster is compliant with the specification. Once the cluster has passed the test, send a certification form with the Intel Cluster Checker output files to Intel to earn the certification. To simplify the process, you can also create an exact copy of an Intel Cluster Ready "reference design" and avoid having to test and re-certify the cluster.
Read more about the certification process

Q: Does the Intel Cluster Ready program restrict vendor-specific additions to clusters?
A: No. The Intel Cluster Ready Specification defines a set of minimum requirements but in no way restricts innovation or the inclusion of additional capabilities. Once the base specification is met, you can add cluster monitoring, management, hardware configuration, interconnect options, or runtime components. If additions are made after certification, the cluster will need to be re-certified.
Read more about the types of changes that require re-certification

Q: What is a “reference design"? What makes a reference design Intel Cluster Ready?
A: A “reference design” is a repeatable process that can be followed to build a cluster platform. The recipe includes a list of essentials (or materials) with step-by-step instructions for setting up, installing, and configuring the cluster hardware and software.

An Intel Cluster Ready reference design has been certified to produce clusters adhering to the Intel Cluster Ready technical specification. Intel also provides Intel Cluster Ready reference designs that can be used by Intel Cluster Ready hardware partners as-is or as a basis for customization. Hardware partners may also create their own recipes for validation and certification.
View Intel Cluster Ready reference recipes

Q: What is a certificate?
A: A certificate is an electronically signed document that provides proof of certification for a cluster recipe. The certificate is issued for the particular reference system built according to the recipe, but all clusters built from the recipe are covered by the same certificate. With a single certificate for a cluster recipe, you can sell multiple configurations.
Learn more about the Intel Cluster Ready program

Q: If a reference cluster was certified with 6-core Intel® Xeon® processors, does the certification automatically cover 8-core processors?
A: Yes, although some Intel Cluster Checker performance thresholds will change. In general, the cluster hardware can be altered as long as the cluster software stack is unchanged. Performance parameters may need to be adjusted so that Intel Cluster Checker can successfully verify the new configuration.
Learn more about the allowed hardware variances

Q: Do we need to go through the certification process for each new software update or upgrade?
A: It depends. The entire hardware and software stack is certified. If a piece is changed, updated, or replaced in a way that falls outside the exceptions in the certification procedure, the resulting modified stack must also be certified. The second certificate does not invalidate the first, so customers using the first stack still have a certified Intel Cluster Ready system.
Read more about the certification process

Q: How do we join the Intel Cluster Ready program?
A: Read and accept the program agreement

Q: What do we receive when we join the Intel Cluster Ready program?
A: You receive access to:

  • The Intel Cluster Ready Specification
  • A list of pre-certified cluster reference designs
  • Intel Cluster Ready cluster test and tools software and documentation
  • Optional training on the Intel Cluster Ready program and its tools
  • Documented procedures for certification and testing
  • Vendor certificates for all certified recipes
  • Registration of compliant applications
  • Sales collateral templates
  • Opportunities to be listed in Intel Cluster Ready marketing material

Q: If we have additional questions about how to build or certify an Intel Cluster Ready system, where can we get technical support?
A: When you accept the program agreement, you will be given an account on the Intel Premier support site. A team of Intel Cluster Ready technical experts monitor the Premier site and will respond to any technical questions you may have.

Intel Cluster Ready FAQ: Software vendors (ISVs)

$
0
0

Why should we join the Intel Cluster Ready program?
A: By offering registered Intel Cluster Ready applications, you can provide the confidence that applications will run as they should, right away, on certified clusters. Participating in the program will help you increase application adoption, expand application flexibility, and streamline customer support.
Learn more about the Intel Cluster Ready program

Q: How do we register an application?
A: First, read and accept the program agreement. Once you accept the agreement, we will send you an easy-to-use manual and provide access to a certified Intel Cluster Ready system. You define your own test suite and then verify that your application will run real-world workloads successfully on the certified system. Provide us with the results, and we will send you the official registration notification. Each time you release an updated version of your software, register the new version to ensure compliance and interoperability have not changed.
Learn more about the registration process

Q: How do we join the Intel Cluster Ready program?
A: Read and accept the program agreement

Q: What do we receive when we join the Intel Cluster Ready program?
A: You receive access to:
 

  • The Intel Cluster Ready Specification
  • A list of pre-certified cluster reference designs
  • Intel Cluster Ready cluster test and tools software and documentation
  • Optional training on the Intel Cluster Ready program and its tools
  • Documented procedures for registration and testing
  • Registration of compliant applications
  • Sales collateral templates
  • Opportunities to be listed in Intel Cluster Ready marketing material

Q: If we have additional questions about how to register our application as Intel Cluster Ready, where can we get technical support?
A: When you accept the program agreement, you will be given an account on the Intel Premier support site. A team of Intel Cluster Ready technical experts monitor the Premier site and will respond to any technical questions you may have.

Viewing all 312 articles
Browse latest View live


<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>