Intel® Advisor XE 2016 provides two tools to help ensure your Fortran and native/managed C++ applications take full performance advantage of today’s processors:
- Vectorization Advisor is a vectorization analysis tool that lets you identify loops that will benefit most from vectorization, identify what is blocking effective vectorization, explore the benefit of alternative data reorganizations, and increase the confidence that vectorization is safe.
- Threading Advisor is a threading design and prototyping tool that lets you analyze, design, tune, and check threading design options without disrupting your normal development.
The following READMEs show how to improve the performance of a C++ sample application with the Vectorization Advisor in the:
Vectorization Sample README-Windows* OS/Standalone GUI
This README shows how to use the Intel® Advisor XE 2016 standalone GUI to improve the performance of a C++ sample application. Follow these steps:
Prepare the sample application.
Establish a performance baseline.
Get to know the Vectorization Workflow.
Increase the optimization level.
Disambiguate pointers.
Generate instructions for the highest instruction set available.
Handle dependencies.
Align data.
Reorganize code.
Prepare the Sample Application
Get Software Tools and Unpack the Sample
You need the following tools:
Intel Advisor XE 2016
Version 15 .0 or higher of an Intel C++ compiler or a supported compiler
Use an Intel compiler to get more benefit from the Vectorization Advisor Survey Report. See the Release Notes for more information on supported compilers.
.zip file extraction utility
Acquire and Install Intel Software Tools
If you do not already have access to the Intel Advisor XE 2016 or to Version 15.0 or higher of an Intel C++ compiler, download an evaluation copy from http://software.intel.com/en-us/articles/intel-software-evaluation-center/.
Set Up the Intel Advisor Sample Application
Copy the vec_samples.zip file from the <advisor-install-dir>\samples\<locale>\C++\ directory to a writable directory or share on your system.
The default installation path, <advisor-install-dir>, is C:\Program Files (x86)\IntelSWTools\Advisor XE 201n\ (on certain systems, instead of Program Files (x86), the directory name is Program Files).
Extract the sample from the .zip file.
Build the Sample Application in Release Mode
Set the environment for version 15.0 or higher of an Intel compiler.
Build the sample application with the following options in release mode:
- /O1
- /Qstd=c99
- /fp:fast
- /Qopt-report:5
For example:
icl /O1 /Qstd=c99 /fp:fast /Qopt-report:5 Multiply.c Driver.c -o MatVector
Launch the Intel Advisor
Do one of the following:
Run the advixe-gui command.
From the Microsoft Windows* Start menu, select Intel Parallel Studio XE 2016 > Analyzers > Advisor XE.
From the Microsoft Windows Start screen, scroll to access the Parallel Studio XE 2016 tile.
From the Microsoft Windows* Apps screen, scroll to access the Intel Parallel Studio XE group.
Create a New Project
Choose File > New > Project… (or click New Project… in the Welcome page) to open the Create a Project dialog box.
Type vec_samples in the Project Name field, supply a location for the sample application project, then click the CreateProject button to open the Project Properties dialog box.
On the left side of the Analysis Target tab, ensure the Survey Hotspots/Suitability Analysis type is selected.
Click the Browse… button next to the Application field, and choose the just-built binary.
Click the OK button to close the Project Properties window and open an empty Survey Report window.
Establish a Performance Baseline
To set a performance baseline for the improvements that will follow, do the following:
In the VECTORIZATION WORKFLOW pane, click the Collect control under Survey Target to produce a Survey Report.
If necessary, select the Do not show this window again checkbox in the infotip, then close the infotip.
In the Survey Report window, notice:
The Elapsed time value in the top left corner is ~16 seconds. (Your value may vary.) This is the baseline against which subsequent improvements will be measured.
In the Loop Type column in the top pane, all detected loops are Scalar.
Get to Know the Vectorization Workflow
The VECTORIZATION WORKFLOW in the left pane is a recommended usage scenario.
Survey Target– This analysis produces a Survey Report (currently displayed) that offers integrated compiler report data and performance data all in one place. Use this information to help identify:
Where vectorization will pay off the most
If vectorized loops are providing benefit, and if not, why not
Un-vectorized and under-vectorized loops, and the estimated expected performance gain of vectorization or better vectorization
How data accessed by vectorized loops is organized and the estimated expected performance gain of reorganization
Find Trip Counts– This optional analysis dynamically identifies the number of times loops are invoked and execute (sometimes called call count/loop count and iteration count respectively), and adds this information to the Survey Report. Use this information to make better decisions about your vectorization strategy for particular loops, as well as optimize already-parallel loops.
Check Dependencies – This optional analysis produces one of two optional Refinement Reports if you want to dig deeper. For safety purposes, the compiler is often conservative when assuming data dependencies. Use the Dependencies Report to check for real data dependencies in loops the compiler did not vectorize because of assumed dependencies. If real dependencies are detected, the report can provide additional details to help resolve the dependencies. Your objective: Identify and better characterize real data dependencies that could make forced vectorization unsafe.
Check Memory Access Patterns – This optional analysis produces one of two optional Refinement Reports if you want to dig deeper. Use the Memory Access Patterns (MAP) Report to check for various memory issues, such as non-contiguous memory accesses and unit stride vs. non-unit stride accesses. Your objective: Eliminate issues that could lead to significant vector code execution slowdown or block automatic vectorization by the compiler.
Increase the Optimization Level
To see if increasing the optiization level improves performance, do the following:
Rebuild the sample application with the following options:
- /O3
- /Qstd=c99
- /fp:fast
- /Qopt-report:5
In the VECTORIZATION WORKFLOW pane, click the Collect control under Survey Target.
In the new Survey Report, notice:
Increasing the optimization level does not vectorize the loops; all are still scalar.
Advisor XE explains why the loops are still scalar in the Vector Issues and Why No Vectorization? columns: The compiler assumed there are dependencies that could make vectorization unsafe. The Survey Report also offers recommendations for how to fix this issue.
Try clicking a:
The Elapsed time improves.
Disambiguate Pointers
Two pointers are aliased if both point to the same memory location. Storing to memory using a pointer that might be aliased may prevent some optimizations. For example, it may create a dependency between loop iterations that would make vectorization unsafe. Sometimes, the compiler can generate both a vectorized and a non-vectorized version of a loop and test for aliasing at runtime to select the appropriate code path. If you know that pointers do not alias and inform the compiler, it can avoid the runtime check and generate a single vectorized code path.
In Multiply.c, the compiler generates runtime checks to determine whether or not the pointer b in function matvec(FTYPE a[][COLWIDTH], FTYPE b[], FTYPE x[]) is aliased to either a or x . If Multiply.c is compiled with the NOALIAS macro, the restrict qualifier of the argument b informs the compiler that the pointer does not alias with any other pointer, and in particular that the array b does not overlap with a or x.
To see if the NOALIAS macro improves performance, do the following:
Rebuild the sample application with the following options:
- /O3
- /Qstd=c99
- /fp:fast
- /Qopt-report:5
- /DNOALIAS
In the VECTORIZATION WORKFLOW pane, click the Collect control under Survey Target.
In the new Survey Report, notice:
The compiler successfully vectorizes one loop, but still cannot not vectorize other loops because the compiler assumes there are dependencies that could make vectorization unsafe.
The Elapsed time improves.
The value in the Vector Instruction Set column in the top pane is SSE2, the default Vector Instruction Set Architecture (ISA). AVX2 is preferable.
The value in the Vector Length column in the top pane is 2;4, which means some vector lengths are 2 and some are 4.
Generate Instructions for the Highest Instruction Set Available
To see if generating instructions for the highest instruction set available on the compilation host processor improves performance, do the following:
Rebuild the sample application with the following options:
- /O3
- /Qstd=c99
- /fp:fast
- /Qopt-report:5
- /DNOALIAS
- /QxHost
In the VECTORIZATION WORKFLOW pane, click the Collect control under Survey Target.
In the new Survey Report, notice:
Handle Dependencies
For safety purposes, the compiler is often conservative when assuming data dependencies.
To run a Dependencies analysis to identify and explore real loop-carried dependencies, do the following:
Choose Project > Intel Advisor version Project Properties… to open the Project Properties dialog box.
On the left side of the Analysis Target tab, select the Dependencies Analysis type.
If necessary, click the Browse… button next to the Application field to choose the just-built binary.
Click the OK button.
In the
column in the Survey Report window, select the checkbox for the two loops with assumed dependencies.
In the VECTORIZATION WORKFLOW pane, click the Collect control under Check Dependencies to produce a Dependencies Report.
(If the analysis takes more than 5 minutes, click the Stop current analysis and display result collected thus far control under Check Dependencies.)
In the Refinement Reports window, notice the Intel Advisor reports no dependencies in two loops and a RAW (Read after write) dependency in one loop. Forcing the compiler to vectorize:
Align Data
The compiler can generate faster code when operating on aligned data.
The ALIGNED macro:
Aligns the arrays a, b, and x in Driver.c on a 16-byte boundary.
Pads the row length of the matrix, a, to be a multiple of 16 bytes, so each individual row of a is 16-byte aligned.
Tells the compiler it can safely assume the arrays in Multiply.c are aligned.
To see if the ALIGNED macro improves performance, do the following::
Rebuild the sample application with the following options:
- /O3
- /Qstd=c99
- /fp:fast
- /Qopt-report:5
- /DNOALIAS
- /QxHost
- /DALIGNED
In the VECTORIZATION WORKFLOW pane, click the Collect control under Survey Target.
In the new Survey Report, notice the Elapsed time shows little improvement.
Reorganize Code
When you use the matvec function in the sample application, the compiler cannot determine it is safe to vectorize the loop because it cannot tell if a and b are unique arrays.
When you inline the loop instead, the compiler can determine it is safe to vectorize the loop because it can tell exactly which variables you want processed in the loop.
The NOFUNCCALL macro removes the matvec function.
To see if the NOFUNCCALL macro improves performance, do the following:
Rebuild the sample application with the following options:
- /O3
- /Qstd=c99
- /fp:fast
- /Qopt-report:5
- /DNOALIAS
- /QxHost
- /DALIGNED
- /DNOFUNCCALL
In the VECTORIZATION WORKFLOW pane, click the Collect control under Survey Target.
In the new Survey Report, notice the Elapsed time improves substantially.
Vectorization Sample README-Windows* OS/Visual Studio IDE
This README shows how to use the Intel® Advisor XE 2016 plug-in to the Microsoft Visual Studio* 2013 IDE to improve the performance of a C++ sample application. Follow these steps:
Prepare the sample application.
Establish a performance baseline.
Get to know the Vectorization Workflow.
Increase the optimization level.
Disambiguate pointers.
Generate instructions for the highest instruction set available.
Handle dependencies.
Align data.
Reorganize code.
Prepare the Sample Application
Get Software Tools and Unpack the Sample
You need the following tools:
Intel Advisor XE 2016
Version 15 .0 or higher of an Intel C++ compiler or a supported compiler
Use an Intel compiler to get more benefit from the Vectorization Advisor Survey Report. See the Release Notes for more information on supported compilers.
.zip file extraction utility
Acquire and Install Intel Software Tools
If you do not already have access to the Intel Advisor XE 2016 or to Version 15.0 or higher of an Intel C++ compiler, download an evaluation copy from http://software.intel.com/en-us/articles/intel-software-evaluation-center/.
Set Up the Intel Advisor Sample Application
Copy the vec_samples.zip file from the <advisor-install-dir>\samples\<locale>\C++\ directory to a writable directory or share on your system.
The default installation path, <advisor-install-dir>, is C:\Program Files (x86)\IntelSWTools\Advisor XE 201n\ (on certain systems, instead of Program Files (x86), the directory name is Program Files).
Extract the sample from the .zip file.
Open the Microsoft Visual Studio* Solution
Launch the Microsoft Visual Studio* IDE.
If necessary, choose View > Solution Explorer.
Choose File> Open> Project/Solution.
In the Open Project dialog box, open the vec_samples.sln file.
Prepare the Project
Right-click the vec_samples project in the Solution Explorer. Then choose Intel Compiler XE > Use Intel C++.
If the Solutions Configuration drop-down on the Visual Studio* Standard toolbar is set to Debug, change it to Release.
Choose Build > CleanSolution.
Establish a Performance Baseline
To set a performance baseline for the improvements that will follow, do the following:
Right-click the vec_samples project in the Solution Explorer. Then choose Project > Properties to display the Property Pages window.
Choose Configuration Properties > C/C++ > Optimization. In the Optimization drop-down, choose Minimum Size(/O1).
Choose Configuration Properties > C/C++ > Diagnostics [Intel C++]. In the Optimization Diagnostic Level drop-down, type Level 5 [/Qopt-report:5].
Click the Apply, button, then click the OK button.
Choose Build > Rebuild Solution.
Right-click the vec_samples project in the Solution Explorer. Then choose Intel Advisor XE 2016 > Start Survey Analysis to produce a Survey Report.
If necessary, select the Do not show this window again checkbox in the infotip, then close the infotip.
In the Survey Report window, notice:
The Elapsed time value in the top left corner is ~16 seconds. (Your value may vary.) This is the baseline against which subsequent improvements will be measured.
In the Loop Type column in the top pane, all detected loops are Scalar.
Get to Know the Vectorization Workflow
The VECTORIZATION WORKFLOW in the left pane is a recommended usage scenario.
Survey Target– This analysis produces a Survey Report (currently displayed) that offers integrated compiler report data and performance data all in one place. Use this information to help identify:
Where vectorization will pay off the most
If vectorized loops are providing benefit, and if not, why not
Un-vectorized and under-vectorized loops, and the estimated expected performance gain of vectorization or better vectorization
How data accessed by vectorized loops is organized and the estimated expected performance gain of reorganization
Find Trip Counts– This optional analysis dynamically identifies the number of times loops are invoked and execute (sometimes called call count/loop count and iteration count respectively), and adds this information to the Survey Report. Use this information to make better decisions about your vectorization strategy for particular loops, as well as optimize already-parallel loops.
Check Dependencies – This optional analysis produces one of two optional Refinement Reports if you want to dig deeper. For safety purposes, the compiler is often conservative when assuming data dependencies. Use the Dependencies Report to check for real data dependencies in loops the compiler did not vectorize because of assumed dependencies. If real dependencies are detected, the report can provide additional details to help resolve the dependencies. Your objective: Identify and better characterize real data dependencies that could make forced vectorization unsafe.
Check Memory Access Patterns – This optional analysis produces one of two optional Refinement Reports if you want to dig deeper. Use the Memory Access Patterns (MAP) Report to check for various memory issues, such as non-contiguous memory accesses and unit stride vs. non-unit stride accesses. Your objective: Eliminate issues that could lead to significant vector code execution slowdown or block automatic vectorization by the compiler.
Increase the Optimization Level
To see if increasing the optiization level improves performance, do the following:
Right-click the vec_samples project in the Solution Explorer. Then choose Project > Properties.
Choose Configuration Properties > C/C++ > Optimization. In the Optimization drop-down, choose Highest Optimizations(/O3).
Click the Apply button, then click the OK button.
Choose Build > Rebuild Solution.
In the VECTORIZATION WORKFLOW pane, click the Collect control under Survey Target.
In the new Survey Report, notice:
Increasing the optimization level does not vectorize the loops; all are still scalar.
Advisor XE explains why the loops are still scalar in the Vector Issues and Why No Vectorization? columns: The compiler assumed there are dependencies that could make vectorization unsafe. The Survey Report also offers recommendations for how to fix this issue.
Try clicking a:
The Elapsed time improves.
Disambiguate Pointers
Two pointers are aliased if both point to the same memory location. Storing to memory using a pointer that might be aliased may prevent some optimizations. For example, it may create a dependency between loop iterations that would make vectorization unsafe. Sometimes, the compiler can generate both a vectorized and a non-vectorized version of a loop and test for aliasing at runtime to select the appropriate code path. If you know that pointers do not alias and inform the compiler, it can avoid the runtime check and generate a single vectorized code path.
In Multiply.c, the compiler generates runtime checks to determine whether or not the pointer b in function matvec(FTYPE a[][COLWIDTH], FTYPE b[], FTYPE x[]) is aliased to either a or x . If Multiply.c is compiled with the NOALIAS macro, the restrict qualifier of the argument b informs the compiler that the pointer does not alias with any other pointer, and in particular that the array b does not overlap with a or x.
To see if the NOALIAS macro improves performance, do the following:
Right-click the vec_samples project in the Solution Explorer. Then choose Project > Properties.
Choose Configuration Properties > C/C++ > Command Line. In the Additional Options area, type /DNOALIAS.
Click the Apply button, then click the OK button.
Choose Build > Rebuild Solution.
In the VECTORIZATION WORKFLOW pane, click the Collect control under Survey Target.
In the new Survey Report, notice:
The compiler successfully vectorizes one loop, but still cannot not vectorize other loops because the compiler assumes there are dependencies that could make vectorization unsafe.
The Elapsed time improves.
The value in the Vector Instruction Set column in the top pane is SSE2, the default Vector Instruction Set Architecture (ISA). AVX2 is preferable.
The value in the Vector Length column in the top pane is 2;4, which means some vector lengths are 2 and some are 4.
Generate Instructions for the Highest Instruction Set Available
To see if generating instructions for the highest instruction set available on the compilation host processor improves performance, do the following:
Right-click the vec_samples project in the Solution Explorer. Then choose Project > Properties.
Choose Configuration Properties > C/C++ > Code Generation [Intel C++]. In the Intel Processor-Specific Optimization drop-down, choose Same as the host processor performing the compilation (/QxHost).
Click the Apply, button, then click the OK button.
Choose Build > Rebuild Solution.
In the VECTORIZATION WORKFLOW pane, click the Collect control under Survey Target.
In the new Survey Report, notice:
Handle Dependencies
For safety purposes, the compiler is often conservative when assuming data dependencies.
To run a Dependencies analysis to identify and explore real loop-carried dependencies, do the following:
In the
column, select the checkbox for the two loops with assumed dependencies.
In the VECTORIZATION WORKFLOW pane, click the Collect control under Check Dependencies to produce a Dependencies Report.
(If the analysis takes more than 5 minutes, click the Stop current analysis and display result collected thus far control under Check Dependencies.)
In the Refinement Reports window, notice the Intel Advisor reports no dependencies in two loops and a RAW (Read after write) dependency in one loop. Forcing the compiler to vectorize:
Align Data
The compiler can generate faster code when operating on aligned data.
The ALIGNED macro:
Aligns the arrays a, b, and x in Driver.c on a 16-byte boundary.
Pads the row length of the matrix, a, to be a multiple of 16 bytes, so each individual row of a is 16-byte aligned.
Tells the compiler it can safely assume the arrays in Multiply.c are aligned.
To see if the ALIGNED macro improves performance, do the following::
Right-click the vec_samples project in the Solution Explorer. Then choose Project > Properties.
Choose Configuration Properties > C/C++ > Command Line. In the Additional Options area, type /DALIGNED.
Click the Apply button, then click the OK button.
Choose Build > Rebuild Solution.
In the VECTORIZATION WORKFLOW pane, click the Collect control under Survey Target.
In the new Survey Report, notice the Elapsed time shows little improvement.
Reorganize Code
When you use the matvec function in the sample application, the compiler cannot determine it is safe to vectorize the loop because it cannot tell if a and b are unique arrays.
When you inline the loop instead, the compiler can determine it is safe to vectorize the loop because it can tell exactly which variables you want processed in the loop.
The NOFUNCCALL macro removes the matvec function.
To see if the NOFUNCCALL macro improves performance, do the following:
Right-click the vec_samples project in the Solution Explorer. Then choose Project > Properties.
Choose Configuration Properties > C/C++ > Command Line. In the Additional Options area, type /DNOFUNCCALL.
Click the Apply button, then click the OK button.
Choose Build > Rebuild Solution.
In the VECTORIZATION WORKFLOW pane, click the Collect control under Survey Target.
In the new Survey Report, notice the Elapsed time improves substantially.
Vectorization Sample README-Linux* OS/Standalone GUI
This README shows how to use the Intel® Advisor XE 2016 GUI to improve the performance of a C++ sample application. Follow these steps:
Prepare the sample application.
Establish a performance baseline.
Get to know the Vectorization Workflow.
Increase the optimization level.
Disambiguate pointers.
Generate instructions for the highest instruction set available.
Handle dependencies.
Align data.
Reorganize code.
Prepare the Sample Application
Get Software Tools and Unpack the Sample
You need the following tools:
Intel Advisor XE 2016
Version 15 .0 or higher of an Intel C++ compiler or a supported compiler
Use an Intel compiler to get more benefit from the Vectorization Advisor Survey Report. See the Release Notes for more information on supported compilers.
.tgz file extraction utility
Acquire and Install Intel Software Tools
If you do not already have access to the Intel Advisor XE 2016 or to Version 15.0 or higher of an Intel C++ compiler, download an evaluation copy from http://software.intel.com/en-us/articles/intel-software-evaluation-center/.
Set Up the Intel Advisor Sample Application
Copy the vec_samples.tgz file from the <advisor-install-dir>/samples/<locale>/C++/ directory to a writable directory or share on your system.
The default installation path, <advisor-install-dir>:
Extract the sample from the .tgz file.
Build the Sample Application in Release Mode
Set the environment for version 15.0 or higher of an Intel compiler.
Build the sample application with the following options in release mode:
- -O1
- -std=c99
- -fp-model fast
- -qopt-report=5
For example:
icpc -O1 -std=c99 -fp-model fast -qopt-report=5 Multiply.c Driver.c -o MatVector
Launch the Intel Advisor
Run the advixe-gui command.
NOTE: Make sure you run the Intel Advisor in the same environment as the sample application.
Create a New Project
Choose File > New > Project… (or click New Project… in the Welcome page) to open the Create a Project dialog box.
Type vec_samples in the Project Name field, supply a location for the sample application project, then click the CreateProject button to open the Project Properties dialog box.
On the left side of the Analysis Target tab, ensure the Survey Hotspots/Suitability Analysis type is selected.
Click the Browse… button next to the Application field, and choose the just-built binary.
Click the OK button to close the Project Properties window and open an empty Survey Report window.
Establish a Performance Baseline
To set a performance baseline for the improvements that will follow, do the following:
In the VECTORIZATION WORKFLOW pane, click the Collect control under Survey Target to produce a Survey Report.
If necessary, select the Do not show this window again checkbox in the infotip, then close the infotip.
In the Survey Report window, notice:
The Elapsed time value in the top left corner is ~16 seconds. (Your value may vary.) This is the baseline against which subsequent improvements will be measured.
In the Loop Type column in the top pane, all detected loops are Scalar.
Get to Know the Vectorization Workflow
The VECTORIZATION WORKFLOW in the left pane is a recommended usage scenario.
Survey Target– This analysis produces a Survey Report (currently displayed) that offers integrated compiler report data and performance data all in one place. Use this information to help identify:
Where vectorization will pay off the most
If vectorized loops are providing benefit, and if not, why not
Un-vectorized and under-vectorized loops, and the estimated expected performance gain of vectorization or better vectorization
How data accessed by vectorized loops is organized and the estimated expected performance gain of reorganization
Find Trip Counts– This optional analysis dynamically identifies the number of times loops are invoked and execute (sometimes called call count/loop count and iteration count respectively), and adds this information to the Survey Report. Use this information to make better decisions about your vectorization strategy for particular loops, as well as optimize already-parallel loops.
Check Dependencies – This optional analysis produces one of two optional Refinement Reports if you want to dig deeper. For safety purposes, the compiler is often conservative when assuming data dependencies. Use the Dependencies Report to check for real data dependencies in loops the compiler did not vectorize because of assumed dependencies. If real dependencies are detected, the report can provide additional details to help resolve the dependencies. Your objective: Identify and better characterize real data dependencies that could make forced vectorization unsafe.
Check Memory Access Patterns – This optional analysis produces one of two optional Refinement Reports if you want to dig deeper. Use the Memory Access Patterns (MAP) Report to check for various memory issues, such as non-contiguous memory accesses and unit stride vs. non-unit stride accesses. Your objective: Eliminate issues that could lead to significant vector code execution slowdown or block automatic vectorization by the compiler.
Increase the Optimization Level
To see if increasing the optiization level improves performance, do the following:
Rebuild the sample application with the following options:
- -O3
- -std=c99
- -fp-model fast
- -qopt-report=5
In the VECTORIZATION WORKFLOW pane, click the Collect control under Survey Target.
In the new Survey Report, notice:
Increasing the optimization level does not vectorize the loops; all are still scalar.
Advisor XE explains why the loops are still scalar in the Vector Issues and Why No Vectorization? columns: The compiler assumed there are dependencies that could make vectorization unsafe. The Survey Report also offers recommendations for how to fix this issue.
Try clicking a:
The Elapsed time improves.
Disambiguate Pointers
Two pointers are aliased if both point to the same memory location. Storing to memory using a pointer that might be aliased may prevent some optimizations. For example, it may create a dependency between loop iterations that would make vectorization unsafe. Sometimes, the compiler can generate both a vectorized and a non-vectorized version of a loop and test for aliasing at runtime to select the appropriate code path. If you know that pointers do not alias and inform the compiler, it can avoid the runtime check and generate a single vectorized code path.
In Multiply.c, the compiler generates runtime checks to determine whether or not the pointer b in function matvec(FTYPE a[][COLWIDTH], FTYPE b[], FTYPE x[]) is aliased to either a or x . If Multiply.c is compiled with the NOALIAS macro, the restrict qualifier of the argument b informs the compiler that the pointer does not alias with any other pointer, and in particular that the array b does not overlap with a or x.
To see if the NOALIAS macro improves performance, do the following:
Rebuild the sample application with the following options:
- -O3
- -std=c99
- -fp-model fast
- -qopt-report=5
- -D NOALIAS
In the VECTORIZATION WORKFLOW pane, click the Collect control under Survey Target.
In the new Survey Report, notice:
The compiler successfully vectorizes one loop, but still cannot not vectorize other loops because the compiler assumes there are dependencies that could make vectorization unsafe.
The Elapsed time improves.
The value in the Vector Instruction Set column in the top pane is SSE2, the default Vector Instruction Set Architecture (ISA). AVX2 is preferable.
The value in the Vector Length column in the top pane is 2;4, which means some vector lengths are 2 and some are 4.
Generate Instructions for the Highest Instruction Set Available
To see if generating instructions for the highest instruction set available on the compilation host processor improves performance, do the following:
Rebuild the sample application with the following options:
- -O3
- -std=c99
- -fp-model fast
- -qopt-report=5
- -D NOALIAS
- -xHost
In the VECTORIZATION WORKFLOW pane, click the Collect control under Survey Target.
In the new Survey Report, notice:
Handle Dependencies
For safety purposes, the compiler is often conservative when assuming data dependencies.
To run a Dependencies analysis to identify and explore real loop-carried dependencies, do the following:
Choose Project > Intel Advisor version Project Properties… to open the Project Properties dialog box.
On the left side of the Analysis Target tab, select the Dependencies Analysis type.
If necessary, click the Browse… button next to the Application field to choose the just-built binary.
Click the OK button.
In the
column in the Survey Report window, select the checkbox for the two loops with assumed dependencies.
In the VECTORIZATION WORKFLOW pane, click the Collect control under Check Dependencies to produce a Dependencies Report.
(If the analysis takes more than 5 minutes, click the Stop current analysis and display result collected thus far control under Check Dependencies.)
In the Refinement Reports window, notice the Intel Advisor reports no dependencies in two loops and a RAW (Read after write) dependency in one loop. Forcing the compiler to vectorize:
Align Data
The compiler can generate faster code when operating on aligned data.
The ALIGNED macro:
Aligns the arrays a, b, and x in Driver.c on a 16-byte boundary.
Pads the row length of the matrix, a, to be a multiple of 16 bytes, so each individual row of a is 16-byte aligned.
Tells the compiler it can safely assume the arrays in Multiply.c are aligned.
To see if the ALIGNED macro improves performance, do the following::
Rebuild the sample application with the following options:
- -O3
- -std=c99
- -fp-model fast
- -qopt-report=5
- -D NOALIAS
- -xHost
- -D ALIGNED
In the VECTORIZATION WORKFLOW pane, click the Collect control under Survey Target.
In the new Survey Report, notice the Elapsed time shows little improvement.
Reorganize Code
When you use the matvec function in the sample application, the compiler cannot determine it is safe to vectorize the loop because it cannot tell if a and b are unique arrays.
When you inline the loop instead, the compiler can determine it is safe to vectorize the loop because it can tell exactly which variables you want processed in the loop.
The NOFUNCCALL macro removes the matvec function.
To see if the NOFUNCCALL macro improves performance, do the following:
Rebuild the sample application with the following options:
- -O3
- -std=c99
- -fp-model fast
- -qopt-report=5
- -D NOALIAS
- -xHost
- -D ALIGNED
- -D NOFUNCCALL
In the VECTORIZATION WORKFLOW pane, click the Collect control under Survey Target.
In the new Survey Report, notice the Elapsed time improves substantially.
For More Information
Start with the following resources: