research-article

Communication Minimized Model-Architecture Co-design for Efficient Convolution Acceleration

Authors:

Wendi Sun,

Wenhao Sun,

Yifan Wang,

Yi Kang,

Song ChenAuthors Info & Claims

GLSVLSI '24: Proceedings of the Great Lakes Symposium on VLSI 2024

Pages 144 - 150

https://doi.org/10.1145/3649476.3658752

Published: 12 June 2024 Publication History

Get Access

Abstract

CNN is indispensable for today’s Artificial Intelligence (AI) applications, but brings dominantly large overhead of data communication. Current works mainly focus on prior off-chip or intuitive/heuristic on-chip access optimization, but with the development of Near Memory Processing (NMP), DRAM access cost has greatly dropped and on&off-chip access optimization needs rethinking as a whole. Thus, this paper proposes a holistic on&off-chip communication-minimized model-architecture acceleration scheme for CNN. First, we derive the layer-wise off-chip communication Lower Bound (LB) based on different data reuse strategies. Second, on-chip LB is derived and overall on&off-chip communication analysis model is presented to provide a solid guidance for on-chip storage allocation, dataflow and architecture design. Finally, we design Window-Primitive (WP) dataflow and a Systolic-Cross-Line (SCL) CNN accelerator based on proposed theoretical model. SCL achieves 3.8 × pJ/MAC energy reduction at 1.4 × less on-chip storage area compared with Eyeriss and 1.3~1.8 × reduction at 3~4 × less area compared with CLB. For NMP, we reduce around 2 × access energy compared with previous systolic NMP architecture.

References

[1]

Tianshi Chen, Zidong Du, 2014. DianNao: a small-footprint high-throughput accelerator for ubiquitous machine-learning(ASPLOS ’14). Association for Computing Machinery, 269–284.

Google Scholar

[2]

Xiaoming Chen, Yinhe Han, and Yu Wang. 2020. Communication lower bound in convolution accelerators. In 2020 IEEE International Symposium on High Performance Computer Architecture (HPCA). IEEE, 529–541.

Crossref

Google Scholar

[3]

Yu-Hsin Chen, Joel Emer, and Vivienne Sze. 2016. Eyeriss: A spatial architecture for energy-efficient dataflow for convolutional neural networks. ACM SIGARCH computer architecture news 44, 3 (2016), 367–379.

Digital Library

Google Scholar

[4]

Yu-Hsin Chen, Tushar Krishna, 2016. Eyeriss: An energy-efficient reconfigurable accelerator for deep convolutional neural networks. IEEE journal of solid-state circuits 52, 1 (2016), 127–138.

Digital Library

Google Scholar

[5]

Bai Fujun 2020. A Stacked Embedded DRAM Array for LPDDR4/4X using Hybrid Bonding 3D Integration with 34GB/s/1Gb 0.88pJ/b Logic-to-Memory Interface. In 2020 IEEE International Electron Devices Meeting (IEDM).

Crossref

Google Scholar

[6]

Norman P Jouppi, Cliff Young, 2017. In-datacenter performance analysis of a tensor processing unit. In Proceedings of the 44th annual international symposium on computer architecture. 1–12.

Digital Library

Google Scholar

[7]

Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. 2012. ImageNet Classification with Deep Convolutional Neural Networks. In Advances in Neural Information Processing Systems, F. Pereira, C.J. Burges, L. Bottou, and K.Q. Weinberger (Eds.). Vol. 25. Curran Associates, Inc.

Google Scholar

[8]

Yann Lecun 2010. Convolutional Networks and Applications in Vision. In Proceedings of 2010 IEEE International Symposium on Circuits and Systems.

Crossref

Google Scholar

[9]

Dimin Niu, Shuangchen Li, 2022. 184QPS/W 64Mb/mm23D Logic-to-DRAM Hybrid Bonding with Process-Near-Memory Engine for Recommendation System. In 2022 IEEE International Solid-State Circuits Conference (ISSCC), Vol. 65. 1–3.

Crossref

Google Scholar

[10]

Prasanth Prabu Ravichandiran and Paul D Franzon. 2021. A review of 3D-dynamic random-access memory based near-memory computation. In 2021 IEEE International 3D Systems Integration Conference (3DIC). IEEE, 1–6.

Google Scholar

[11]

Karen Simonyan and Andrew Zisserman. 2014. Very Deep Convolutional Networks for Large-Scale Image Recognition. Computer Science (2014).

Google Scholar

[12]

Haiping Wu 2021. CvT: Introducing Convolutions to Vision Transformers. In 2021 IEEE/CVF International Conference on Computer Vision (ICCV). 22–31.

Crossref

Google Scholar

[13]

Chen Xin 2017. COSY: An Energy-Efficient Hardware Architecture for Deep Convolutional Neural Networks Based on Systolic Array. In 2017 IEEE 23rd International Conference on Parallel and Distributed Systems (ICPADS). 180–189.

Google Scholar

[14]

Shijin Zhang, Zidong Du, 2016. Cambricon-X: An accelerator for sparse neural networks. In 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). 1–12.

Crossref

Google Scholar

Index Terms

Communication Minimized Model-Architecture Co-design for Efficient Convolution Acceleration
1. Computer systems organization
  1. Architectures
    1. Other architectures
      1. Data flow architectures
      2. Neural networks
2. Theory of computation
  1. Models of computation
    1. Concurrency
      1. Parallel computing models

Recommendations

An Efficient Parallel Architecture for Convolutional Neural Networks Accelerator on FPGAs
HP3C '22: Proceedings of the 6th International Conference on High Performance Compilation, Computing and Communications

Convolutional Neural Networks (CNNs) have been widely used in the field of computer vision. Due to the computational complexity of CNNs, their computational efficiency has become a major concern. Field Programmable Gate Array (FPGA) is an ideal ...
AoCStream: All-on-Chip CNN Accelerator With Stream-Based Line-Buffer Architecture
FPGA '23: Proceedings of the 2023 ACM/SIGDA International Symposium on Field Programmable Gate Arrays

Convolutional neural network (CNN) accelerators are being widely used for their efficiency, but they require a large amount of memory, leading to the use of slow and power consuming external memories. This paper exploits two schemes to reduce the ...
Reconfigurable Network-on-Chip based Convolutional Neural Network Accelerator
Abstract
Convolutional Neural Networks (CNNs) have a wide range of applications due to their superior performance in image and pattern classification. However, the performance of CNNs comes at the price of high computational load and memory ...

Comments

Information & Contributors

Information

Published In

GLSVLSI '24: Proceedings of the Great Lakes Symposium on VLSI 2024

June 2024

797 pages

ISBN:9798400706059

DOI:10.1145/3649476

Editors:
Inna Partin-Vaisband
University of Illinois Chicago, USA
,
Srinivas Katkoori
University of South Florida, USA
,
Lu Peng
Tulane University, USA
,
Boris Vaisband
McGill University, Canada
,
Tooraj Nikoubin
University of Texas at Dallas, USA

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 12 June 2024

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Funding Sources

University Synergy Innovation Program of Anhui Province
Strategic Priority Research Program of Chinese Academy of Sciences
CAS Project for Young Scientists in Basic Research

Conference

GLSVLSI '24

Sponsor:

SIGDA

GLSVLSI '24: Great Lakes Symposium on VLSI 2024

June 12 - 14, 2024

FL, Clearwater, USA

Acceptance Rates

Overall Acceptance Rate 312 of 1,156 submissions, 27%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
39
Total Downloads

Downloads (Last 12 months)39
Downloads (Last 6 weeks)5

Reflects downloads up to 19 Oct 2024

Other Metrics

View Author Metrics

Citations

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format.

HTML Format

Abstract

References

Index Terms

Recommendations

An Efficient Parallel Architecture for Convolutional Neural Networks Accelerator on FPGAs

AoCStream: All-on-Chip CNN Accelerator With Stream-Based Line-Buffer Architecture

Reconfigurable Network-on-Chip based Convolutional Neural Network Accelerator

Comments

Information

Published In

Sponsors

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Funding Sources

Conference

Acceptance Rates

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Get Access

Login options

Full Access

View options

PDF

eReader

HTML Format

Figures

Other

Share

Share this Publication link

Share on social media

Affiliations