Extending the performance analysis tool box: Multi-stage CPI stacks and FLOPS stacks

S Eyerman, W Heirman, K Du Bois…�- 2018 IEEE International�…, 2018 - ieeexplore.ieee.org
2018 IEEE International Symposium on Performance Analysis of�…, 2018ieeexplore.ieee.org
CPI stacks are an intuitive way to visualize processor core performance bottlenecks.
However, they often do not provide a full view on all bottlenecks, because stall events can
occur concurrently (eg, an instruction cache miss and a data cache miss). To not double-
count penalties, typically one of the events is selected, which means information about the
non-chosen stall events is lost. Furthermore, we show that there is no single correct CPI
stack: stall penalties can be hidden, can overlap or can cause second-order effects, making�…
CPI stacks are an intuitive way to visualize processor core performance bottlenecks. However, they often do not provide a full view on all bottlenecks, because stall events can occur concurrently (e.g., an instruction cache miss and a data cache miss). To not double-count penalties, typically one of the events is selected, which means information about the non-chosen stall events is lost. Furthermore, we show that there is no single correct CPI stack: stall penalties can be hidden, can overlap or can cause second-order effects, making total CPI more complex than just a sum of components. Instead of showing a single CPI stack, we propose to measure multiple CPI stacks during program execution: a CPI stack at each stage of the processor pipeline. This representation reveals all performance bottlenecks and provides a more complete view on the performance of an application. Additionally, we propose FLOPS stacks, targeted at HPC performance analysis. FLOPS stacks are a variant of CPI stacks at the issue stage, but instead of considering all instructions, they focus at floating point performance specifically, which is the common definition of useful work in the HPC domain. Multi-stage CPI stacks and FLOPS stacks are easy to collect. We show that they can be included in a simulator with negligible slowdown, and we provide recommendations how to include them in a hardware core.
ieeexplore.ieee.org
Showing the best result for this search. See all results