research-article

Enabling Branch-Mispredict Level Parallelism by Selectively Flushing Instructions

Authors:

Sam Van Den Steen,

Ibrahim HurAuthors Info & Claims

MICRO '21: MICRO-54: 54th Annual IEEE/ACM International Symposium on Microarchitecture

Pages 767 - 778

https://doi.org/10.1145/3466752.3480045

Published: 17 October 2021 Publication History

Abstract

Conventionally, branch mispredictions are resolved by flushing wrongly speculated instructions from the reorder buffer and refetching instructions along the correct path. However, a large part of the misspeculated instructions could have reconverged with the correct path and executed correctly. Yet, they are flushed to ensure in-order commit. This inefficiency has been recognized in prior work, which proposes either complex additions to a core to reuse the correctly executed instructions, or less intrusive solutions that only reuse part of the converged instructions.

We propose a hardware-software cooperative mechanism to recover correctly executed instructions, avoiding the need to refetch and re-execute them. It combines relatively limited additions to the core architecture with a high reuse of reconverged instructions. Adding the software hints to enable our mechanism is a similar effort as parallelizing an application, which is already necessary to extract high performance from current multicore processors. We evaluate the technique on emerging graph applications and sorting, applications that are known to perform poorly on conventional CPUs, and report an average 29% increase in performance.

References

[1]

M. Agarwal, K. Malik, K. M. Woley, S. S. Stone, and M. I. Frank. 2007. Exploiting Postdominance for Speculative Parallelization. In IEEE 13th International Symposium on High Performance Computer Architecture (HPCA). 295–305.

[2]

Mayank Agarwal, Nitin Navale, Kshitiz Malik, and Matthew I Frank. 2008. Fetch-Criticality Reduction through Control Independence. In International Symposium on Computer Architecture (ISCA). IEEE, 13–24.

Digital Library

[3]

Haitham Akkary, Ravi Rajwar, and Srikanth T Srinivasan. 2003. Checkpoint processing and recovery: Towards scalable large instruction window processors. In 36th IEEE/ACM International Symposium on Microarchitecture (MICRO). 423–434.

[4]

Ahmed S. Al-Zawawi, Vimal K. Reddy, Eric Rotenberg, and Haitham H. Akkary. 2007. Transparent Control Independence (TCI). In 34th Annual International Symposium on Computer Architecture (ISCA). 448–459.

Digital Library

[5]

M. Arora, S. Nath, S. Mazumdar, S. B. Baden, and D. M. Tullsen. 2012. Redefining the Role of the CPU in the Era of CPU-GPU Integration. IEEE Micro 32, 6 (2012), 4–16. https://doi.org/10.1109/MM.2012.57

Digital Library

[6]

Grant Ayers, Heiner Litz, Christos Kozyrakis, and Parthasarathy Ranganathan. 2020. Classifying Memory Access Patterns for Prefetching. In 25th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS). 513–526.

[7]

Scott Beamer, Krste Asanovic, and David A. Patterson. 2015. The GAP Benchmark Suite. CoRR abs/1508.03619(2015). http://arxiv.org/abs/1508.03619

[8]

Trevor E. Carlson, Wim Heirman, and Lieven Eeckhout. 2011. Sniper: Exploring the Level of Abstraction for Scalable and Accurate Parallel Multi-Core Simulations. In International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

Digital Library

[9]

Deepayan Chakrabarti, Yiping Zhan, and Christos Faloutsos. 2004. R-MAT: A recursive model for graph mining. In SIAM International Conference on Data Mining. 442–446.

[10]

Adarsh Chauhan, Jayesh Gaur, Zeev Sperber, Franck Sala, Lihu Rappoport, Adi Yoaz, and Sreenivas Subramoney. 2020. Auto-predication of critical branches. In ACM/IEEE 47th International Symposium on Computer Architecture (ISCA). 92–104.

Digital Library

[11]

Chen-Yong Cher and TN Vijaykumar. 2001. Skipper: a microarchitecture for exploiting control-flow independence. In 34th ACM/IEEE International Symposium on Microarchitecture (MICRO). 4–15.

[12]

Jamison D Collins, Dean M Tullsen, and Hong Wang. 2004. Control flow optimization via dynamic reconvergence prediction. In 37th International Symposium on Microarchitecture (MICRO). 129–140.

Digital Library

[13]

Robert H Dennard, Fritz H Gaensslen, Hwa-Nien Yu, V Leo Rideout, Ernest Bassous, and Andre R LeBlanc. 1974. Design of ion-implanted MOSFET’s with very small physical dimensions. IEEE Journal of Solid-State Circuits 9, 5 (1974), 256–268.

[14]

Jack Doweck, Wen-Fu Kao, Allen Kuan-yu Lu, Julius Mandelblat, Anirudha Rahatekar, Lihu Rappoport, Efraim Rotem, Ahmad Yasin, and Adi Yoaz. 2017. Inside 6th-generation Intel Core: New microarchitecture code-named Skylake. IEEE Micro 37, 2 (2017), 52–62.

Digital Library

[15]

Timothy Dysart, Peter Kogge, Martin Deneroff, Eric Bovell, Preston Briggs, Jay Brockman, Kenneth Jacobsen, Yujen Juan, Shannon Kuntz, Richard Lethin, Janice McMahon, Chandra Pawar, Martin Perrigo, Sarah Rucker, John Ruttenberg, Max Ruttenberg, and Steve Stein. 2016. Highly Scalable Near Memory Processing with Migrating Threads on the Emu System Architecture. In Proceedings of the Sixth Workshop on Irregular Applications: Architectures and Algorithms(IA3 ’16). 2–9.

Digital Library

[16]

Stijn Eyerman, Lieven Eeckhout, Tejas Karkhanis, and James E. Smith. 2009. A Mechanistic Performance Model for Superscalar Out-of-order Processors. ACM Transactions on Computer Systems (TOCS) 27, 2 (May 2009), 3:1–3:37.

Digital Library

[17]

S. Eyerman, W. Heirman, K. Du Bois, J. B. Fryman, and I. Hur. 2018. Many-Core Graph Workload Analysis. In SC18: International Conference for High Performance Computing, Networking, Storage and Analysis. 282–292. https://doi.org/10.1109/SC.2018.00025

Digital Library

[18]

Amit Gandhi, Haitham Akkary, and Srikanth T Srinivasan. 2004. Reducing branch misprediction penalty via selective branch recovery. In International Symposium on High Performance Computer Architecture (HPCA). 254–264.

Digital Library

[19]

Ali Hajiabadi, Andreas Diavastos, and Trevor E Carlson. 2021. NOREBA: a compiler-informed non-speculative out-of-order commit processor. In 26th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS). 182–193.

Digital Library

[20]

T. J. Ham, L. Wu, N. Sundaram, N. Satish, and M. Martonosi. 2016. Graphicionado: A high-performance and energy-efficient accelerator for graph analytics. In 49th IEEE/ACM International Symposium on Microarchitecture (MICRO). 1–13.

[21]

Rehan Hameed, Wajahat Qadeer, Megan Wachs, Omid Azizi, Alex Solomatnikov, Benjamin C Lee, Stephen Richardson, Christos Kozyrakis, and Mark Horowitz. 2010. Understanding sources of inefficiency in general-purpose chips. In 37th International Symposium on Computer Architecture (ISCA). 37–47.

Digital Library

[22]

Ron Kalla, Balaram Sinharoy, and Joel M Tendler. 2004. IBM Power5 chip: A dual-core multithreaded processor. IEEE Micro 24, 2 (2004), 40–47.

Digital Library

[23]

Hyesoon Kim, Jose A Joao, Onur Mutlu, and Yale N Patt. 2006. Diverge-merge processor (DMP): Dynamic predicated execution of complex control-flow graphs based on frequently executed paths. In 39th IEEE/ACM International Symposium on Microarchitecture (MICRO). 53–64.

Digital Library

[24]

Hyesoon Kim, Onur Mutlu, Jared Stark, and Yale N Patt. 2005. Wish branches: Combining conditional branching and predication for adaptive predicated execution. In 38th International Symposium on Microarchitecture (MICRO).

[25]

Donald E Knuth. 1998. The art of computer programming: Volume 3: Sorting and Searching. Addison-Wesley Professional.

Digital Library

[26]

Andrew Kopser and Dennis Vollrath. 2011. Overview of the next generation Cray XMT. In Cray User Group Proceedings. 1–10.

[27]

V. R. Kothinti Naresh, R. Sheikh, A. Perais, and H. W. Cain. 2018. SPF: Selective Pipeline Flush. In IEEE 36th International Conference on Computer Design (ICCD). 152–155.

[28]

Hang Liu and H. Howie Huang. 2019. SIMD-X: Programming and Processing of Graph Algorithms on GPUs. In 2019 USENIX Annual Technical Conference (USENIX ATC 19). 411–428.

[29]

K. Malik, M. Agarwal, S. S. Stone, K. M. Woley, and M. I. Frank. 2008. Branch-mispredict level parallelism (BLP) for control independence. In IEEE 14th International Symposium on High Performance Computer Architecture (HPCA). 62–73.

[30]

Mengjie Mao, Hong An, Tao Sun, Qi Li, Bobin Deng, Xuechao Wei, and Junrui Zhou. 2012. Distributed Control Independence for Composable Multi-processors. In 2012 IEEE/ACIS 11th International Conference on Computer and Information Science. 124–129.

[31]

Gordon E Moore. 1965. Cramming more components onto integrated circuits., 114–-117 pages.

[32]

Quan M Nguyen and Daniel Sanchez. 2020. Pipette: Improving Core Utilization on Irregular Applications through Intra-Core Pipeline Parallelism. In 53rd International Symposium on Microarchitecture (MICRO). 596–608.

[33]

M. U. Nisar, A. Fard, and J. A. Miller. 2013. Techniques for Graph Analytics on Big Data. In 2013 IEEE International Congress on Big Data. 255–262.

[34]

Alex Pajuelo, Antonio González, and Mateo Valero. 2005. Control-flow independence reuse via dynamic vectorization. In 19th IEEE International Parallel and Distributed Processing Symposium.

Digital Library

[35]

E. Rotenberg and J. Smith. 1999. Control independence in trace processors. In 32nd ACM/IEEE International Symposium on Microarchitecture (MICRO). 4–15.

[36]

Amir Roth and Gurindar S Sohi. 2000. Register integration: a simple and efficient implementation of squash reuse. In 33rd ACM/IEEE international symposium on Microarchitecture (MICRO). 223–234.

Digital Library

[37]

Farzad Samie and Amirali Baniasadi. 2011. Power and frequency analysis for data and control independence in embedded processors. In 2011 International Green Computing Conference and Workshops. 1–6.

Digital Library

[38]

André Seznec. 2011. A new case for the TAGE branch predictor. In 44th IEEE/ACM International Symposium on Microarchitecture (MICRO). 117–127.

Digital Library

[39]

Wade Shen. [n. d.]. Hierarchical Identify Verify Exploit (HIVE). ([n. d.]). https://www.darpa.mil/program/hierarchical-identify-verify-exploit

[40]

Balaram Sinharoy, JA Van Norstrand, Richard J Eickemeyer, Hung Q Le, Jens Leenstra, Dung Q Nguyen, B Konigsburg, K Ward, MD Brown, José E Moreira, 2015. IBM POWER8 processor core microarchitecture. IBM Journal of Research and Development 59, 1 (2015), 2–1.

Digital Library

[41]

William S Song, Vitaliy Gleyzer, Alexei Lomakin, and Jeremy Kepner. 2016. Novel graph processor architecture, prototype system, and results. In IEEE High Performance Extreme Computing Conference (HPEC). 1–7.

[42]

Zonghan Wu, Shirui Pan, Fengwen Chen, Guodong Long, Chengqi Zhang, and S Yu Philip. 2020. A comprehensive survey on graph neural networks. IEEE Transactions on Neural Networks and Learning Systems (2020).

[43]

Xiangyao Yu, Christopher J Hughes, Nadathur Satish, and Srinivas Devadas. 2015. IMP: Indirect memory prefetcher. In 48th International Symposium on Microarchitecture (MICRO). 178–190.

Digital Library

[44]

Xuda Zhou, Zidong Du, Qi Guo, Shaoli Liu, Chengsi Liu, Chao Wang, Xuehai Zhou, Ling Li, Tianshi Chen, and Yunji Chen. 2018. Cambricon-S: Addressing irregularity in sparse neural networks through a cooperative software/hardware approach. In 51st IEEE/ACM International Symposium on Microarchitecture (MICRO). 15–28.

Digital Library

Cited By

Aananthakrishnan SAbedin SCavé VChecconi FBois KEyerman SFryman JHeirman WHoward JHur IJain SLandowski MMa KNelson JPawlowski RPetrini FSzkoda STayal STithi JVandriessche Y(2023)The Intel Programmable and Integrated Unified Memory Architecture Graph Analytics ProcessorIEEE Micro10.1109/MM.2023.329584843:5(78-87)Online publication date: 1-Sep-2023
https://dl.acm.org/doi/10.1109/MM.2023.3295848
Eyerman SSteen SHeirman WHur I(2023)Simulating Wrong-Path Instructions in Decoupled Functional-First Simulation2023 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)10.1109/ISPASS57527.2023.00021(124-133)Online publication date: Apr-2023
https://doi.org/10.1109/ISPASS57527.2023.00021
Lan MHuang LYang LMa SYan RWang YXu W(2022)Late-Stage Optimization of Modern ILP Processor Cores via FPGA SimulationApplied Sciences10.3390/app12231222512:23(12225)Online publication date: 29-Nov-2022
https://doi.org/10.3390/app122312225

Recommendations

Converting thread-level parallelism to instruction-level parallelism via simultaneous multithreading

To achieve high performance, contemporary computer systems rely on two forms of parallelism: instruction-level parallelism (ILP) and thread-level parallelism (TLP). Wide-issue super-scalar processors exploit ILP by executing multiple instructions from a ...
Automatic generation of custom SIMD instructions for superword level parallelism
DATE '14: Proceedings of the conference on Design, Automation & Test in Europe

Application specific instruction-set processors (ASIPs) have drawn significant attention from System-on-a-Chip (SoC) community due to the capability of fine grain flexibility and customizability. In order to maximize the benefit of ASIP, automatic ...
Enhancing instruction level parallelism through compiler-controlled speculation

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

MICRO '21: MICRO-54: 54th Annual IEEE/ACM International Symposium on Microarchitecture

October 2021

1322 pages

ISBN:9781450385572

DOI:10.1145/3466752

Copyright © 2021 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

SIGMICRO: ACM Special Interest Group on Microarchitectural Research and Processing

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 17 October 2021

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Qualifiers

Research-article
Research
Refereed limited

Conference

MICRO '21

Sponsor:

SIGMICRO

MICRO '21: 54th Annual IEEE/ACM International Symposium on Microarchitecture

October 18 - 22, 2021

Virtual Event, Greece

Acceptance Rates

Overall Acceptance Rate 484 of 2,242 submissions, 22%

Upcoming Conference

MICRO '24

Sponsor:
sigmicro

57th Annual IEEE/ACM International Symposium on Microarchitecture

November 2 - 6, 2024

Austin , TX , USA

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

3
Total Citations
View Citations
994
Total Downloads

Downloads (Last 12 months)145
Downloads (Last 6 weeks)10

Reflects downloads up to 21 Oct 2024

Other Metrics

View Author Metrics

Citations

Cited By

Aananthakrishnan SAbedin SCavé VChecconi FBois KEyerman SFryman JHeirman WHoward JHur IJain SLandowski MMa KNelson JPawlowski RPetrini FSzkoda STayal STithi JVandriessche Y(2023)The Intel Programmable and Integrated Unified Memory Architecture Graph Analytics ProcessorIEEE Micro10.1109/MM.2023.329584843:5(78-87)Online publication date: 1-Sep-2023
https://dl.acm.org/doi/10.1109/MM.2023.3295848
Eyerman SSteen SHeirman WHur I(2023)Simulating Wrong-Path Instructions in Decoupled Functional-First Simulation2023 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)10.1109/ISPASS57527.2023.00021(124-133)Online publication date: Apr-2023
https://doi.org/10.1109/ISPASS57527.2023.00021
Lan MHuang LYang LMa SYan RWang YXu W(2022)Late-Stage Optimization of Modern ILP Processor Cores via FPGA SimulationApplied Sciences10.3390/app12231222512:23(12225)Online publication date: 29-Nov-2022
https://doi.org/10.3390/app122312225

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Media

Figures

Other

Tables

View Table of Contents