Calculating estimated call counts with Intel® VTune™ Amplifier XE 2013

When you profile your software with VTune™ Amplifier XE you often start from looking at the top function hotspots list. This allows you to see what functions are spending CPU resources, so you can focus your optimization efforts.

Function call counts can provide some additional information to assist in further optimization.

A hotspot function’s CPU time is a measure of overall time spent there during a collection. There may be multiple calls to a function, some of longer duration and some shorter. If you know call counts along with CPU time/clock ticks, you can then calculate the clock ticks spent in a function during each call. Depending on the call counts you may choose different optimization techniques:

If you’re thinking about introducing parallelism, you can do it inside a heavy function.
If time-per-call is small, it may make sense to move your parallel construction to a higher level in the function call stack.
Also don’t forget about inlining - it makes sense for functions with significant call count and small time-per-call, because function invocation overhead may be big enough.

VTune Amplifier XE 2013 can provide you call count information. This metric is available for Hardware Event-based Sampling analysis types, such as Lightweight Hotspots.

“Collect stacks” and “Estimate call counts” options are required to enable collecting call counts:

This also can be done from the command line:

$ amplxe-cl -collect lightweight-hotspots -knob enable-stack-collection=true -knob enable-call-counts=true -- <target_application>

With these options you’ll be able to see the estimated call counts. See how it looks in this Bottom-up view:

In the Top-down view you can see total and self call counts:

If you switch to the “Hardware Event Counts” viewpoint, you can easily calculate events per call, e.g. clock ticks per single function run:

Things to remember about the “estimated call counts” feature

Call counts are estimated – this means they are statistically calculated. It is not exact call count values. A zero value just means that the function was called just a relatively few times and might still be hundreds or even thousands of calls.
The call count column often appears in the right part of the grid, so it is not shown initially – scroll right to find it, and move to the left if needed (as was done on the screenshots above).
Call count collection introduces additional overhead of 20% and more. Though it is much lower than would be if exact call counts were collected with binary instrumentation.
Collecting call count info significantly increases profile data volume. This leads to increased size of analysis results and significantly increased RAM usage when browsing the results in the GUI.
If you experience significant slowdown or too high memory usage, think about decreasing analysis data – e.g. increase “sample after value” for collected events (by creating a custom analysis type).

A few words about technology

Collecting estimated call counts is based on BTS (Branch Trace Store) usage. This is hardware functionality in Intel® processors to automatically store information into a memory buffer about all taken branches. Function calls are considered as branches and are taken from this buffer.

If a function is hot, it is statistically visible. So it is interrupted by a performance monitoring interrupt (PMI) which occurs once a hardware event counter overflows. Once interrupted, the collection of branches is initiated, and when the memory buffer containing branch records overflows, the information is saved on disk (into a trace file) upon reception of a branch tracing interrupt (BTI). Then collection waits for the next sample and gathers another “branch bunch”, and so on.

After collection is finished, trace files are analyzed and call counts are separated from other branching info. Taking into account call counts in trace files, the frequency of samples and the total number of branches in a program VTune Amplifier XE estimates statistical call counts. Rarely called functions appear only in few samples or don’t appear at all – estimating call counts from these data would be too far from reality, so call counts for them are shown as zeros.

Conclusion

The estimated call count feature of VTune Amplifier XE allows you to detect frequently called functions, so you can make informed decisions regarding inlining, introducing parallel constructions and data decomposition. Statistical collection technology adds lower overhead comparing to exact call count collecting methods. But the overhead may be significant, especially in terms of memory usage when results are explored. Be sure to take that into account.

"VTune Amplifier XE"

Developers

Intel AppUp® Developers

Microsoft Windows* (XP, Vista, 7)