skip to main content
research-article
Open access

MInGLE: An Efficient Framework for Domain Acceleration Using Low-Power Specialized Functional Units

Published: 14 June 2016 Publication History

Abstract

The end of Dennard scaling leads to new research directions that try to cope with the utilization wall in modern chips, such as the design of specialized architectures. Processor customization utilizes transistors more efficiently, optimizing not only for performance but also for power. However, hardware specialization for each application is costly and impractical due to time-to-market constraints. Domain-specific specialization is an alternative that can increase hardware reutilization across applications that share similar computations. This article explores the specialization of low-power processors with custom instructions (CIs) that run on a specialized functional unit. We are the first, to our knowledge, to design CIs for an application domain and across basic blocks, selecting CIs that maximize both performance and energy efficiency improvements.
We present the Merged Instructions Generator for Large Efficiency (MInGLE), an automated framework that identifies and selects CIs. Our framework analyzes large sequences of code (across basic blocks) to maximize acceleration potential while also performing partial matching across applications to optimize for reuse of the specialized hardware. To do this, we convert the code into a new canonical representation, the Merging Diagram, which represents the code’s functionality instead of its structure. This is key to being able to find similarities across such large code sequences from different applications with different coding styles. Groups of potential CIs are clustered depending on their similarity score to effectively reduce the search space. Additionally, we create new CIs that cover not only whole-body loops but also fragments of the code to optimize hardware reutilization further. For a set of 11 applications from the media domain, our framework generates CIs that significantly improve the energy-delay product (EDP) and performance speedup. CIs with the highest utilization opportunities achieve an average EDP improvement of 3.8 × compared to a baseline processor modeled after an Intel Atom. We demonstrate that we can efficiently accelerate a domain with partially matched CIs, and that their design time, from identification to selection, stays within tractable bounds.

References

[1]
N. Arora, K. Chandramohan, N. Pothineni, and A. Kumar. 2010. Instruction selection in ASIP synthesis using functional matching. In Proceedings of the Conference on VLSI Design. 146--151.
[2]
K. Atasu, W. Luk, O. Mencer, C. Özturan, and G. Dündar. 2012. FISH: Fast instruction synthesis for custom processors. IEEE Transactions on Very Large Scale Integration (VLSI) Systems 20, 1, 52--65.
[3]
K. Atasu, O. Mencer, W. Luk, C. Ozturan, and G. Dundar. 2008. Fast custom instruction identification by convex subgraph enumeration. In Proceedings of the International Conference on Application-Specific Systems, Architectures, and Processors (ASAP’08). IEEE, Los Alamitos, CA, 1--6.
[4]
K. Atasu, L. Pozzi, and P. Ienne. 2003. Automatic application-specific instruction-set extensions under microarchitectural constraints. International Journal of Parallel Programming 31, 6, 411--428.
[5]
L. Bauer, M. Shafique, and J. Henkel. 2008. Run-time instruction set selection in a transmutable embedded processor. In Proceedings of the Design Automation Conference (DAC’08). 56--61.
[6]
L. Bauer, M. Shafique, and J. Henkel. 2011. Concepts, architectures, and run-time systems for efficient and adaptive reconfigurable processors. In Proceedings of the Conference on Adaptive Hardware and Systems (AHS’11). 80--87.
[7]
J. Benson, R. Cofell, C. Frericks, C. Ho, V. Govindaraju, T. Nowatzki, and K. Sankaralingam. 2012. Design, integration and implementation of the DySER hardware accelerator into OpenSPARC. In Proceedings of the Symposium on High-Performance Computer Architecture (HPCA’12). IEEE, Los Alamitos, CA, 1--12.
[8]
G. Bradski. 2000. The OpenCV library. Dr. Dobb’s Journal of Software Tools 20, 11, 120--126.
[9]
R. E. Bryant. 1986. Graph-based algorithms for boolean function manipulation. IEEE Transactions on Computers C-35, 8, 677--691.
[10]
T. E. Carlson, W. Heirman, S. Eyerman, I. Hur, and L. Eeckhout. 2014. An evaluation of high-level mechanistic core models. ACM Transactions on Architecture and Code Optimization 11, 3, Article No. 28.
[11]
J. E. Carrillo and P. Chow. 2001. The effect of reconfigurable units in superscalar processors. In Proceedings of the Conference on Field Programmable Gate Arrays (FPGA’01). ACM, New York, NY, 141--150.
[12]
M. Ciesielski, P. Kalla, and S. Askar. 2006. Taylor expansion diagrams: A canonical representation for verification of data flow designs. IEEE Transactions on Computers 55, 9, 1188--1201.
[13]
N. T. Clark, H. Zhong, and S. A. Mahlke. 2005. Automated custom instruction generation for domain-specific processor acceleration. IEEE Transactions on Computers 54, 10, 1258--1270.
[14]
J. Cong, Y. Fan, G. Han, and Z. Zhang. 2004. Application-specific instruction generation for configurable processor architectures. In Proceedings of the Conference on Field Programmable Gate Arrays (FPGA’04). ACM, New York, NY, 183--189.
[15]
R. H. Dennard, F. H. Gaensslen, H. Yu, V. L. Rideout, E. Bassous, and A. R. LeBlanc 1974. Design of ion-implanted MOSFET’s with very small physical dimensions. IEEE Journal of Solid-State Circuits 9, 5, 256--268.
[16]
J. E. Fritts, F. W. Steiling, J. A. Tucek, and W. Wolf. 2009. MediaBench II video: Expediting the next generation of video systems research. Microprocessor and Microsystems 33, 4, 301--318.
[17]
C. González-Álvarez, J. B. Sartor, C. Álvarez, D. Jiménez-González, and L. Eeckhout. 2013. Accelerating an application domain with specialized functional units. ACM Transactions on Architecture and Code Optimization 10, 4, Article No. 47.
[18]
N. Goulding-Hotta, J. Sampson, G. Venkatesh, S. Garcia, J. Auricchio, P. Huang, M. Arora, S. Nath, V. Bhatt, J. Babb, S. Swanson, and M. B. Taylor. 2011. The GreenDroid mobile application processor: An architecture for silicon’s dark future. IEEE Micro 31, 2, 86--95.
[19]
V. Govindaraju, C. Ho, T. Nowatzki, J. Chhugani, N. Satish, K. Sankaralingam, and C. Kim. 2012. DySER: Unifying functionality and parallelism specialization for energy-efficient computing. IEEE Micro 32, 5, 38--51.
[20]
S. Gupta, S. Feng, A. Ansari, S. Mahlke, and D. August. 2011. Bundled execution of recurring traces for energy-efficient general purpose processing. In Proceedings of the Symposium on Microarchitecture (MICRO’11). 12--23.
[21]
M. R. Guthaus, J. S. Ringenberg, D. Ernst, T. M. Austin, T. Mudge, and R. B. Brown. 2001. MiBench: A free, commercially representative embedded benchmark suite. In Proceedings of the Workshop on Workload Characterization (WWC’01). IEEE, Los Alamitos, CA, 3--14.
[22]
M. Haaß, L. Bauer, and J. Henkel. 2014. Automatic custom instruction identification in memory streaming algorithms. In Proceedings of the Conference on Compilers, Architecture, and Synthesis for Embedded Systems (CASES’14). ACM, New York, NY, 6:1--6:9.
[23]
T. R. Halfhill. 2008. Intel’s tiny atom. Microprocessor Report, 040708, 1--13.
[24]
H. Huang, T. Kim, and Y. Hoskote. 2014. Edit distance based instruction merging technique to improve flexibility of custom instructions toward flexible accelerator design. In Proceedings of the Asia and South Pacific Design Automation Conference (ASP-DAC’14). 219--224.
[25]
IBM. 2014. ILOG CPLEX. Retrieved May 9, 2016, from http://www-01.ibm.com/software/integration/optimization/cplex-optimizer/.
[26]
L. Jówiak, N. Nedjah, and M. Figueroa. 2010. Modern development methods and tools for embedded reconfigurable systems: A survey. Integration, the VLSI Journal 43, 1, 1--33.
[27]
K. Karuri and R. Leupers. 2011. A primer on ISA customization. In Application Analysis Tools for ASIP Design. Springer, 93--109.
[28]
K. Keutzer, S. Malik, and A. R. Newton. 2002. From ASIC to ASIP: The next design discontinuity. In Proceedings of the Conference on Computer Design: VLSI in Computers and Processors. 84--90.
[29]
D. Kroshko. 2015. OpenOpt: Free scientific-engineering software for mathematical modeling and optimization.
[30]
C. Lattner and V. Adve. 2004. LLVM: A compilation framework for lifelong program analysis and transformation. In Proceedings of the Symposium on Code Generation and Optimization (CGO’04). IEEE, Los Alamitos, CA.
[31]
S. Li, J. Ho Ahn, R. D. Strong, J. B. Brockman, D. M. Tullsen, and N. P. Jouppi. 2009a. McPAT: An integrated power, area, and timing modeling framework for multicore and manycore architectures. In Proceedings of the Symposium on Microarchitecture (MICRO’09). ACM, New York, NY, 469--480.
[32]
T. Li, Z. Sun, W. Jigang, and X. Lu. 2009b. Fast enumeration of maximal valid subgraphs for custom-instruction identification. In Proceedings of the Conference on Compilers, Architecture, and Synthesis for Embedded Systems. ACM, New York, NY.
[33]
K. Martin, C. Wolinski, K. Kuchcinski, A. Floch, and F. Charot. 2012. Constraint programming approach to reconfigurable processor extension generation and application compilation. ACM Transactions on Reconfigurable Technology and Systems 5, 2, 1--38.
[34]
B. Middha, A. Kumar, V. Raj, M. Balakrishnan, P. Ienne, and A. Gangwar. 2002. A Trimaran based framework for exploring the design space of VLIW ASIPs with coarse grain functional units. In Proceedings of the Symposium on System Synthesis. 2--7.
[35]
Daniel Müllner. 2013. Fastcluster: Fast hierarchical, agglomerative clustering routines for R and Python. Journal of Statistical Software 53, 9, 1--18. http://www.jstatsoft.org/v53/i09/.
[36]
A. C. Murray, R. V. Bennett, B. Franke, and N. Topham. 2009. Code transformation and instruction set extension. ACM Transactions on Embedded Computing Systems 8, 4, 1--31.
[37]
L. Pozzi, K. Atasu, and P. Ienne. 2006. Exact and approximate algorithms for the extension of embedded processor instruction sets. Computer-Aided Design of Integrated Circuits and Systems 25, 7, 1209--1229.
[38]
M. Shafique, L. Bauer, and J. Henkel. 2014. Adaptive energy management for dynamically reconfigurable processors. Computer-Aided Design of Integrated Circuits and Systems 33, 1, 50--63.
[39]
Sage Development Team. 2013. Sage Mathematics Software (Version 5.8). Available at http://www.sagemath.org.
[40]
M. Stojilovic, D. Novo, L. Saranovac, P. Brisk, and P. Ienne. 2013. Selective flexibility: Creating domain-specific reconfigurable arrays. Computer-Aided Design of Integrated Circuits and Systems 32, 5, 681--694.
[41]
G. Venkatesh, J. Sampson, N. Goulding-Hotta, S. K. Venkata, M. B. Taylor, and S. Swanson. 2011. QsCores: Trading dark silicon for scalable energy efficiency with quasi-specific cores. In Proceedings of the Symposium on Microarchitecture (MICRO’11). ACM, New York, NY, 163--174.
[42]
A. K. Verma, P. Brisk, and P. Ienne. 2007. Rethinking custom ISE identification: A new processor-agnostic method. In Proceedings of the Conference on Compilers, Architecture, and Synthesis for Embedded Systems (CASES’07). ACM, New York, NY, 125--134.
[43]
Xilinx. 2014. Vivado High-Level Synthesis. Retrieved May 9, 2016, from http://www.xilinx.com/products/design-tools/vivado/integration/esl-design.html.
[44]
P. Yu and T. Mitra. 2004. Scalable custom instructions identification for instruction-set extensible processors. In Proceedings of the Conference on Compilers, Architecture, and Synthesis for Embedded Systems.
[45]
M. Zuluaga and N. Topham. 2009. Design-space exploration of resource-sharing solutions for custom instruction set extensions. Computer-Aided Design of Integrated Circuits and Systems 28, 12, 1788--1801.

Cited By

View all
  • (2024)Reinforcement Learning for Selecting Custom Instructions Under Area ConstraintIEEE Transactions on Artificial Intelligence10.1109/TAI.2023.33080995:4(1882-1894)Online publication date: Apr-2024
  • (2024)Automating application-driven customization of ASIPsJournal of Systems Architecture: the EUROMICRO Journal10.1016/j.sysarc.2024.103080148:COnline publication date: 1-Mar-2024
  • (2021)NOVIA: A Framework for Discovering Non-Conventional Inline AcceleratorsMICRO-54: 54th Annual IEEE/ACM International Symposium on Microarchitecture10.1145/3466752.3480094(507-521)Online publication date: 18-Oct-2021

Index Terms

  1. MInGLE: An Efficient Framework for Domain Acceleration Using Low-Power Specialized Functional Units

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Transactions on Architecture and Code Optimization
    ACM Transactions on Architecture and Code Optimization  Volume 13, Issue 2
    June 2016
    200 pages
    ISSN:1544-3566
    EISSN:1544-3973
    DOI:10.1145/2952301
    Issue’s Table of Contents
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 14 June 2016
    Accepted: 01 February 2016
    Revised: 01 February 2016
    Received: 01 August 2015
    Published in TACO Volume 13, Issue 2

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. Customization
    2. acceleration
    3. canonical representation
    4. clustering
    5. domain specific

    Qualifiers

    • Research-article
    • Research
    • Refereed

    Funding Sources

    • European Research Council under the European Community's Seventh Framework Programme
    • ERC
    • Spanish Ministry of Science and Technology
    • Generalitat de Catalunya
    • Spanish Government under the Severo Ochoa program

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)78
    • Downloads (Last 6 weeks)10
    Reflects downloads up to 21 Oct 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Reinforcement Learning for Selecting Custom Instructions Under Area ConstraintIEEE Transactions on Artificial Intelligence10.1109/TAI.2023.33080995:4(1882-1894)Online publication date: Apr-2024
    • (2024)Automating application-driven customization of ASIPsJournal of Systems Architecture: the EUROMICRO Journal10.1016/j.sysarc.2024.103080148:COnline publication date: 1-Mar-2024
    • (2021)NOVIA: A Framework for Discovering Non-Conventional Inline AcceleratorsMICRO-54: 54th Annual IEEE/ACM International Symposium on Microarchitecture10.1145/3466752.3480094(507-521)Online publication date: 18-Oct-2021

    View Options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Get Access

    Login options

    Full Access

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media