skip to main content
research-article

Enabling Branch-Mispredict Level Parallelism by Selectively Flushing Instructions

Published: 17 October 2021 Publication History

Abstract

Conventionally, branch mispredictions are resolved by flushing wrongly speculated instructions from the reorder buffer and refetching instructions along the correct path. However, a large part of the misspeculated instructions could have reconverged with the correct path and executed correctly. Yet, they are flushed to ensure in-order commit. This inefficiency has been recognized in prior work, which proposes either complex additions to a core to reuse the correctly executed instructions, or less intrusive solutions that only reuse part of the converged instructions.
We propose a hardware-software cooperative mechanism to recover correctly executed instructions, avoiding the need to refetch and re-execute them. It combines relatively limited additions to the core architecture with a high reuse of reconverged instructions. Adding the software hints to enable our mechanism is a similar effort as parallelizing an application, which is already necessary to extract high performance from current multicore processors. We evaluate the technique on emerging graph applications and sorting, applications that are known to perform poorly on conventional CPUs, and report an average 29% increase in performance.

References

[1]
M. Agarwal, K. Malik, K. M. Woley, S. S. Stone, and M. I. Frank. 2007. Exploiting Postdominance for Speculative Parallelization. In IEEE 13th International Symposium on High Performance Computer Architecture (HPCA). 295–305.
[2]
Mayank Agarwal, Nitin Navale, Kshitiz Malik, and Matthew I Frank. 2008. Fetch-Criticality Reduction through Control Independence. In International Symposium on Computer Architecture (ISCA). IEEE, 13–24.
[3]
Haitham Akkary, Ravi Rajwar, and Srikanth T Srinivasan. 2003. Checkpoint processing and recovery: Towards scalable large instruction window processors. In 36th IEEE/ACM International Symposium on Microarchitecture (MICRO). 423–434.
[4]
Ahmed S. Al-Zawawi, Vimal K. Reddy, Eric Rotenberg, and Haitham H. Akkary. 2007. Transparent Control Independence (TCI). In 34th Annual International Symposium on Computer Architecture (ISCA). 448–459.
[5]
M. Arora, S. Nath, S. Mazumdar, S. B. Baden, and D. M. Tullsen. 2012. Redefining the Role of the CPU in the Era of CPU-GPU Integration. IEEE Micro 32, 6 (2012), 4–16. https://doi.org/10.1109/MM.2012.57
[6]
Grant Ayers, Heiner Litz, Christos Kozyrakis, and Parthasarathy Ranganathan. 2020. Classifying Memory Access Patterns for Prefetching. In 25th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS). 513–526.
[7]
Scott Beamer, Krste Asanovic, and David A. Patterson. 2015. The GAP Benchmark Suite. CoRR abs/1508.03619(2015). http://arxiv.org/abs/1508.03619
[8]
Trevor E. Carlson, Wim Heirman, and Lieven Eeckhout. 2011. Sniper: Exploring the Level of Abstraction for Scalable and Accurate Parallel Multi-Core Simulations. In International Conference for High Performance Computing, Networking, Storage and Analysis (SC).
[9]
Deepayan Chakrabarti, Yiping Zhan, and Christos Faloutsos. 2004. R-MAT: A recursive model for graph mining. In SIAM International Conference on Data Mining. 442–446.
[10]
Adarsh Chauhan, Jayesh Gaur, Zeev Sperber, Franck Sala, Lihu Rappoport, Adi Yoaz, and Sreenivas Subramoney. 2020. Auto-predication of critical branches. In ACM/IEEE 47th International Symposium on Computer Architecture (ISCA). 92–104.
[11]
Chen-Yong Cher and TN Vijaykumar. 2001. Skipper: a microarchitecture for exploiting control-flow independence. In 34th ACM/IEEE International Symposium on Microarchitecture (MICRO). 4–15.
[12]
Jamison D Collins, Dean M Tullsen, and Hong Wang. 2004. Control flow optimization via dynamic reconvergence prediction. In 37th International Symposium on Microarchitecture (MICRO). 129–140.
[13]
Robert H Dennard, Fritz H Gaensslen, Hwa-Nien Yu, V Leo Rideout, Ernest Bassous, and Andre R LeBlanc. 1974. Design of ion-implanted MOSFET’s with very small physical dimensions. IEEE Journal of Solid-State Circuits 9, 5 (1974), 256–268.
[14]
Jack Doweck, Wen-Fu Kao, Allen Kuan-yu Lu, Julius Mandelblat, Anirudha Rahatekar, Lihu Rappoport, Efraim Rotem, Ahmad Yasin, and Adi Yoaz. 2017. Inside 6th-generation Intel Core: New microarchitecture code-named Skylake. IEEE Micro 37, 2 (2017), 52–62.
[15]
Timothy Dysart, Peter Kogge, Martin Deneroff, Eric Bovell, Preston Briggs, Jay Brockman, Kenneth Jacobsen, Yujen Juan, Shannon Kuntz, Richard Lethin, Janice McMahon, Chandra Pawar, Martin Perrigo, Sarah Rucker, John Ruttenberg, Max Ruttenberg, and Steve Stein. 2016. Highly Scalable Near Memory Processing with Migrating Threads on the Emu System Architecture. In Proceedings of the Sixth Workshop on Irregular Applications: Architectures and Algorithms(IA3 ’16). 2–9.
[16]
Stijn Eyerman, Lieven Eeckhout, Tejas Karkhanis, and James E. Smith. 2009. A Mechanistic Performance Model for Superscalar Out-of-order Processors. ACM Transactions on Computer Systems (TOCS) 27, 2 (May 2009), 3:1–3:37.
[17]
S. Eyerman, W. Heirman, K. Du Bois, J. B. Fryman, and I. Hur. 2018. Many-Core Graph Workload Analysis. In SC18: International Conference for High Performance Computing, Networking, Storage and Analysis. 282–292. https://doi.org/10.1109/SC.2018.00025
[18]
Amit Gandhi, Haitham Akkary, and Srikanth T Srinivasan. 2004. Reducing branch misprediction penalty via selective branch recovery. In International Symposium on High Performance Computer Architecture (HPCA). 254–264.
[19]
Ali Hajiabadi, Andreas Diavastos, and Trevor E Carlson. 2021. NOREBA: a compiler-informed non-speculative out-of-order commit processor. In 26th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS). 182–193.
[20]
T. J. Ham, L. Wu, N. Sundaram, N. Satish, and M. Martonosi. 2016. Graphicionado: A high-performance and energy-efficient accelerator for graph analytics. In 49th IEEE/ACM International Symposium on Microarchitecture (MICRO). 1–13.
[21]
Rehan Hameed, Wajahat Qadeer, Megan Wachs, Omid Azizi, Alex Solomatnikov, Benjamin C Lee, Stephen Richardson, Christos Kozyrakis, and Mark Horowitz. 2010. Understanding sources of inefficiency in general-purpose chips. In 37th International Symposium on Computer Architecture (ISCA). 37–47.
[22]
Ron Kalla, Balaram Sinharoy, and Joel M Tendler. 2004. IBM Power5 chip: A dual-core multithreaded processor. IEEE Micro 24, 2 (2004), 40–47.
[23]
Hyesoon Kim, Jose A Joao, Onur Mutlu, and Yale N Patt. 2006. Diverge-merge processor (DMP): Dynamic predicated execution of complex control-flow graphs based on frequently executed paths. In 39th IEEE/ACM International Symposium on Microarchitecture (MICRO). 53–64.
[24]
Hyesoon Kim, Onur Mutlu, Jared Stark, and Yale N Patt. 2005. Wish branches: Combining conditional branching and predication for adaptive predicated execution. In 38th International Symposium on Microarchitecture (MICRO).
[25]
Donald E Knuth. 1998. The art of computer programming: Volume 3: Sorting and Searching. Addison-Wesley Professional.
[26]
Andrew Kopser and Dennis Vollrath. 2011. Overview of the next generation Cray XMT. In Cray User Group Proceedings. 1–10.
[27]
V. R. Kothinti Naresh, R. Sheikh, A. Perais, and H. W. Cain. 2018. SPF: Selective Pipeline Flush. In IEEE 36th International Conference on Computer Design (ICCD). 152–155.
[28]
Hang Liu and H. Howie Huang. 2019. SIMD-X: Programming and Processing of Graph Algorithms on GPUs. In 2019 USENIX Annual Technical Conference (USENIX ATC 19). 411–428.
[29]
K. Malik, M. Agarwal, S. S. Stone, K. M. Woley, and M. I. Frank. 2008. Branch-mispredict level parallelism (BLP) for control independence. In IEEE 14th International Symposium on High Performance Computer Architecture (HPCA). 62–73.
[30]
Mengjie Mao, Hong An, Tao Sun, Qi Li, Bobin Deng, Xuechao Wei, and Junrui Zhou. 2012. Distributed Control Independence for Composable Multi-processors. In 2012 IEEE/ACIS 11th International Conference on Computer and Information Science. 124–129.
[31]
Gordon E Moore. 1965. Cramming more components onto integrated circuits., 114–-117 pages.
[32]
Quan M Nguyen and Daniel Sanchez. 2020. Pipette: Improving Core Utilization on Irregular Applications through Intra-Core Pipeline Parallelism. In 53rd International Symposium on Microarchitecture (MICRO). 596–608.
[33]
M. U. Nisar, A. Fard, and J. A. Miller. 2013. Techniques for Graph Analytics on Big Data. In 2013 IEEE International Congress on Big Data. 255–262.
[34]
Alex Pajuelo, Antonio González, and Mateo Valero. 2005. Control-flow independence reuse via dynamic vectorization. In 19th IEEE International Parallel and Distributed Processing Symposium.
[35]
E. Rotenberg and J. Smith. 1999. Control independence in trace processors. In 32nd ACM/IEEE International Symposium on Microarchitecture (MICRO). 4–15.
[36]
Amir Roth and Gurindar S Sohi. 2000. Register integration: a simple and efficient implementation of squash reuse. In 33rd ACM/IEEE international symposium on Microarchitecture (MICRO). 223–234.
[37]
Farzad Samie and Amirali Baniasadi. 2011. Power and frequency analysis for data and control independence in embedded processors. In 2011 International Green Computing Conference and Workshops. 1–6.
[38]
André Seznec. 2011. A new case for the TAGE branch predictor. In 44th IEEE/ACM International Symposium on Microarchitecture (MICRO). 117–127.
[39]
Wade Shen. [n. d.]. Hierarchical Identify Verify Exploit (HIVE). ([n. d.]). https://www.darpa.mil/program/hierarchical-identify-verify-exploit
[40]
Balaram Sinharoy, JA Van Norstrand, Richard J Eickemeyer, Hung Q Le, Jens Leenstra, Dung Q Nguyen, B Konigsburg, K Ward, MD Brown, José E Moreira, 2015. IBM POWER8 processor core microarchitecture. IBM Journal of Research and Development 59, 1 (2015), 2–1.
[41]
William S Song, Vitaliy Gleyzer, Alexei Lomakin, and Jeremy Kepner. 2016. Novel graph processor architecture, prototype system, and results. In IEEE High Performance Extreme Computing Conference (HPEC). 1–7.
[42]
Zonghan Wu, Shirui Pan, Fengwen Chen, Guodong Long, Chengqi Zhang, and S Yu Philip. 2020. A comprehensive survey on graph neural networks. IEEE Transactions on Neural Networks and Learning Systems (2020).
[43]
Xiangyao Yu, Christopher J Hughes, Nadathur Satish, and Srinivas Devadas. 2015. IMP: Indirect memory prefetcher. In 48th International Symposium on Microarchitecture (MICRO). 178–190.
[44]
Xuda Zhou, Zidong Du, Qi Guo, Shaoli Liu, Chengsi Liu, Chao Wang, Xuehai Zhou, Ling Li, Tianshi Chen, and Yunji Chen. 2018. Cambricon-S: Addressing irregularity in sparse neural networks through a cooperative software/hardware approach. In 51st IEEE/ACM International Symposium on Microarchitecture (MICRO). 15–28.

Cited By

View all
  • (2023)The Intel Programmable and Integrated Unified Memory Architecture Graph Analytics ProcessorIEEE Micro10.1109/MM.2023.329584843:5(78-87)Online publication date: 1-Sep-2023
  • (2023)Simulating Wrong-Path Instructions in Decoupled Functional-First Simulation2023 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)10.1109/ISPASS57527.2023.00021(124-133)Online publication date: Apr-2023
  • (2022)Late-Stage Optimization of Modern ILP Processor Cores via FPGA SimulationApplied Sciences10.3390/app12231222512:23(12225)Online publication date: 29-Nov-2022

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
MICRO '21: MICRO-54: 54th Annual IEEE/ACM International Symposium on Microarchitecture
October 2021
1322 pages
ISBN:9781450385572
DOI:10.1145/3466752
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 17 October 2021

Permissions

Request permissions for this article.

Check for updates

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Conference

MICRO '21
Sponsor:

Acceptance Rates

Overall Acceptance Rate 484 of 2,242 submissions, 22%

Upcoming Conference

MICRO '24

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)145
  • Downloads (Last 6 weeks)10
Reflects downloads up to 21 Oct 2024

Other Metrics

Citations

Cited By

View all
  • (2023)The Intel Programmable and Integrated Unified Memory Architecture Graph Analytics ProcessorIEEE Micro10.1109/MM.2023.329584843:5(78-87)Online publication date: 1-Sep-2023
  • (2023)Simulating Wrong-Path Instructions in Decoupled Functional-First Simulation2023 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)10.1109/ISPASS57527.2023.00021(124-133)Online publication date: Apr-2023
  • (2022)Late-Stage Optimization of Modern ILP Processor Cores via FPGA SimulationApplied Sciences10.3390/app12231222512:23(12225)Online publication date: 29-Nov-2022

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format.

HTML Format

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media