research-article

Open access

Fast Loosely-Timed Deep Neural Network Models with Accurate Memory Contention

Authors:

Emad M. Arasteh,

Rainer DömerAuthors Info & Claims

ACM Transactions on Embedded Computing Systems, Volume 23, Issue 5

Article No.: 75, Pages 1 - 32

https://doi.org/10.1145/3607548

Published: 14 August 2024 Publication History

PDF eReader

Abstract

The emergence of data-intensive applications, such as Deep Neural Networks (DNN), exacerbates the well-known memory bottleneck in computer systems and demands early attention in the design flow. Electronic System-Level (ESL) design using SystemC Transaction Level Modeling (TLM) enables effective performance estimation, design space exploration (DSE), and gradual refinement. However, memory contention is often only detectable after detailed TLM-2.0 approximately-timed or cycle-accurate RTL models are developed. A memory bottleneck detected at such a late stage can severely limit the available design choices or even require costly redesign.

In this work, we propose a novel TLM-2.0 loosely-timed contention-aware (LT-CA) modeling style that offers high-speed simulation close to traditional loosely-timed (LT) models, yet shows the same accuracy for memory contention as low-level approximately-timed (AT) models. Thus, our proposed LT-CA modeling breaks the speed/accuracy tradeoff between regular LT and AT models and offers fast and accurate observation and visualization of memory contention.

Our extensible SystemC model generator automatically produces desired TLM-1 and TLM-2.0 models from a DNN architecture description for design space exploration focusing on memory contention. We demonstrate our approach with a real-world industry-strength DNN application, GoogLeNet. The experimental results show that the proposed LT-CA modeling is 46× faster in simulation than equivalent AT models with an average error of less than 1% in simulated time. Early detection of memory contentions also suggests that local memories close to computing cores can eliminate memory contention in such applications.

1 Introduction

Emerging computing applications create an ever-increasing demand for higher memory bandwidth and lower access latency. While massively parallel processor arrays [26, 36] allow an order-of-magnitude improvement in computational capacity, a severe performance gap exists in the state-of-the-art memory architectures. Additionally, the low-power requirements of embedded systems create extra design challenges to achieve on-par performance improvements.

Advances in 3-D integration of memory-logic fabrics [29], and continuing trends towards data-intensive applications necessitate new hardware-software codesign approaches with particular emphasis on memory contention. A system-level memory-aware modeling framework is a cornerstone to building the next-generation system-on-chips (SoCs), capable of addressing memory bandwidth and latency issues. Such a modeling framework identifies memory bottlenecks and explores new architectures before RTL implementation with high accuracy and faster simulation speed.

In prior work [3], we have studied the impact of communication mechanisms on the available parallelism in transaction level modeling (TLM). Specifically, we have demonstrated the effects of varying synchronization mechanisms and buffering schemes on the exposed parallelism using different modeling styles of a deep neural network (DNN), GoogLeNet. In this Phase I modeling, we developed six untimed SystemC TLM-1 and TLM-2.0 models. Figure 1 places the six models generated in this Phase I, models A to F (green nodes), in a chart with the number of buffers indicated on the x-axis and the communication mechanism on the y-axis. We have further quantified the improved parallelism in the above models by measuring the performance of aggressive out-of-order parallel simulation in the Recoding Infrastructure of SystemC (RISC) [24]. As a result, we have demonstrated that the design with the highest amount of parallelism exposed, i.e., model F, is best suited for further refinement in the system design flow.

Fig. 1.

Expanding on our prior work [3], this article explores the critical aspects of modeling and analysis of timing accuracy and memory contention. In this Phase II modeling, we further refine the untimed TLM-2.0 back-pressure model F with double-buffering to loosely-timed (G) and approximately-timed (H) models. Moreover, we propose a new loosely-timed contention-aware (LT-CA) modeling style to expose memory contention in a fast and yet accurate manner (model I).

Furthermore, we define a system-level exploration framework to automatically generate TLM from an abstract DNN specification. As illustrated in Figure 2, the DNN specification and modeling parameters constitute the inputs to our proposed model generator, netspec. Based on user-specified design metrics, netspec (green box) automatically creates models at desired abstraction levels.

Fig. 2.

In general, TLM trades off timing accuracy for simulation speed. TLM allows system designers and chip architects to rapidly prototype and verify their design candidates before generating detailed RTL. RTL simulations tend to be an order of magnitude slower than SystemC TLM. Traditionally, higher-level abstractions of TLM (i.e., loosely-timed—LT) mainly focus on functionality and define the programmer’s view of the design for early software development. On the other hand, lower-level TLM (approximately-timed, AT) can represent finer-grained timing details at the price of sacrificing simulation speed. However, with the enormous increase in today’s design complexity, running lower-level models has become a severe obstacle in agile hardware development. Accurate and fast high-level TLM that can expose the critical aspect of memory contention without sacrificing simulation performance is needed to efficiently build future computing platforms (orange box in Figure 2).

By having a fast and accurate contention model, we can rapidly evaluate design candidates on performance metrics for a lower-level implementation, e.g., RTL (blue box in Figure 2). Here, a data visualization tool that can generate transaction-level timing diagrams for early feedback to system designers is beneficial to analyze and address any memory contention in the design. Early detection of memory contentions allows the exploration of different memory organizations to find the optimal designs with minimal memory contention.

To summarize, the key contributions of this work are the following:

(1)

A novel system-level modeling framework and automatic SystemC model generator for design space exploration (DSE) with a focus on mitigating memory contention, lowering memory footprint, and increasing the performance of DNNs (green box)

(2)

Early contention modeling in selected SystemC loosely-timed (LT) models with awareness of first-come-first-served (FCFS) and round-robin (RR) arbitration policies focusing on high accuracy and fast simulation speed (orange box)

(3)

Extensive performance measurement results and data visualization to generate transaction-level timing diagrams for memory contention analysis (blue box)

We organize the rest of this article as follows: Section 2 reviews some relevant background and related work and introduces the DNN application used for our study. We then lay the foundation of our system modeling strategy in Section 3.1. To demonstrate how each abstraction level focuses on the aspect of memory contention, we describe the modeling details from the highest to the lowest level of abstraction in Sections 3 and 4, with our novel modeling approach for fast and accurate memory contention in Section 4.3, and proposed local memory organization with minimum memory contention in Section 4.4. Section 5 describes the structure of our TLM generator for DSE. Finally, we present our extensive results and analysis in Section 6 and conclude this study in Section 7.

2 Background

This section briefly reviews the memory bottleneck and system-level modeling context. We also introduce the application driver used in this study and briefly review related work.

2.1 Memory Bottleneck

The term von Neumann bottleneck, widely known as the memory bottleneck, was coined by John Backus in 1978 [5]. Von Neumann computers are built around an inherent bottleneck that is “the word-at-a-time tube connecting the CPU to the memory” [5]. Since the birth of the first von Neumann computer in 1945, various innovations have developed to alleviate the memory bottleneck. Multi-level cache hierarchies, shared scratchpad memory, multi-channel memory architecture, Non-Uniform Memory Access (NUMA) architecture, and more recently, computation-in-memory [32] are only a few of the inventions to tackle the memory bottleneck in computer systems. Despite all these efforts, the memory bottleneck remains one of the grand challenges of computer science and engineering.

2.2 Electronic System Level and Transaction Level Modeling

With the rapid growth in the complexity of electronic devices and the drastic reduction in time to market, Electronic System Level (ESL) methodology has been proposed for modeling systems at higher levels of abstraction [7, 13]. ESL ideas resulted in defining System-level Description Languages (SLDL), such as SpecC [14] and SystemC [15], that can model both hardware and software components.

ESL techniques focus on TLM, which separates computation from communication in the model [9]. TLM allows the refinement of computation and communication independently and on different abstraction levels. In this way, TLM can speed up simulation significantly by replacing many pin-level events in RTL simulation with an abstract function call. Generally, the higher the level of abstraction is, the faster the simulation runs. Naturally, this simulation speedup typically comes at the price of lower model accuracy.

In summary, ESL and TLM raise the design abstraction above RTL to overcome the challenges of designing today’s complex SoCs. In particular, TLM provides an agile hardware-software codesign framework for the early exploration of the wide range of design metrics and the evaluation of design candidates. Moreover, TLM provides a codesign environment wherein software can be developed in parallel with hardware. TLM is beneficial not only for earlier system integration but also for rapid feedback to system designers.

2.3 Deep Learning and Convolutional Neural Networks (CNN)

Deep Learning (DL) is a known technique in machine learning to extract useful features from input data, perform data transformations, and arrive at a final meaningful representation. One of the main application areas of DL is visual recognition, and in particular, image classification, which is the problem of assigning a descriptive label to an input image from a fixed set of categories. DL and convolutional neural networks (CNNs) have been shown to solve this challenging problem fast and with acceptable precision.

Early work on CNNs dates back to 1989 with the LeNet network for handwritten digit recognition [22]. However, the early 2010s started a new era for CNN applications with the introduction of AlexNet [20] for image classification. Growth of computing power, availability of massive datasets for training, and rapid innovation in DL architectures have paved the way for the success of DL techniques in recent years [33].

A CNN consists of alternating convolution layers and pooling (sub-sampling) layers. Each convolution layer extracts features from the input by applying trainable filters to the input. Later, the convolved feature is fed to an activation function, for example, a Rectifier Linear Unit (ReLU), to introduce non-linearity and obtain activation maps. Each pooling layer down-samples the activation maps to reduce computation and memory usage in the network. Features extracted from the previous convolution and pooling layers are fed to a fully connected layer to perform classification. Typically, a softmax activation function can be placed following the final fully connected layer to output the probability corresponding to each classification label.

Choosing state-of-the-art deep CNNs for TLM modeling enables our investigation of the memory bottleneck problem.

2.3.1 GoogLeNet Structure.

GoogLeNet is a deep CNN for image classification and detection. It won the ImageNet Large Scale Recognition Competition (ILSVRC) in 2014 with only 6.67% top-5 error [34]. GoogLeNet was proposed and designed with computational efficiency and deployability in mind. The two main features of GoogLeNet are (1) using 1× 1 convolution layers for dimension reduction and (2) applying Network-in-Network architecture to increase the representational power of the neural network [34]. GoogLeNet is 22 layers deep when counting only layers with parameters. As detailed in Table 1, the overall number of independent building blocks is 142 distinct layers.

Table 1.

Layer type	Count
Convolution	57
ReLU	57
Pooling	14
LRN	2
Concat	9
Dropout	1
InnerProduct	1
Softmax	1
Total	142

Table 1. GoogLeNet Layer Summary

Based on an initial model described in SystemC TLM-1 [2] and further study on improving parallelism in TLM-1 and TLM-2.0 [3], we design and analyze timed memory-accurate models of GoogLeNet in this work.

2.4 Related Work

A large body of research exists on performance modeling and memory contention modeling and analysis. We can broadly categorize most system-level performance analysis methods into two main classes: analytical and simulation-based. In analytical approaches, we mathematically model the systems and analytically derive their performance as a function of workload and input parameters. Frank et al. [11] define an analytical contention model in parallel algorithms on a multiprocessor workstation. Chen et al. [10] use queueing theory to model contention in bus-based system design. Analytical models depend on the architecture described, and a new model must be developed for each new architecture or application [8]. Moreover, analytical modeling does not take into account the dynamic behavior of the system, and often use of more realistic assumptions makes meaningful analysis difficult [6].

Simulation-based approaches can capture many dynamic and complex interactions in a system. SpecC [14] and SystemC [15] are widely-used SLDL for modeling, simulation, and validation of complex SoC models. The SystemC C++ class library is an IEEE standard that enables system and TLM using discrete event simulation (DES) [16]. Simulation techniques often suffer from long simulator run-times at lower abstraction levels. Furthermore, there are high costs associated with manually building simulation models and debugging them.

A systematic and quantitative analysis of TLM’s speed/accuracy tradeoff has been studied in [30]. A method of overcoming this general tradeoff for the specific case of processor models is proposed in [31]. The study of speed and accuracy tradeoff in DES has also been carried out in other scientific fields, such as network simulation. Packet-level network simulators enable high-accuracy simulation but can lead to long simulation times. For example, SimGrid, developed by Legrand et al. [23], is a simulation framework that simulates networks at higher levels, thus enabling fast simulation but losing accuracy. This speed-accuracy tradeoff is quantitatively evaluated and confirmed by Fujiwara et al. [12]. Packet-level models are essentially the analog of non-blocking TLM transactions as both model the atomic elements circulating on the interconnect.

To overcome strictly simulation-based methods, a hybrid approach of analytical and simulation methodologies has been proposed. Lunzli et al. [21] offer a technique to combine SystemC-based simulation with formal analysis based on real-time calculus. Borbek et al. [8] also combine simulation with an analytical method focusing on the study of shared resource contention. While these mixed methodologies help to shorten simulator run-times, the coverage for corner cases in simulation remains difficult [35]. Furthermore, [8] operates at a much higher level of abstraction than TLM and thus sacrifices some accuracy for higher simulation speedup.

Aside from analytical and simulation-based modeling approaches, there are also experimental techniques to measure the effect of memory contention. More recently, DNN library profilers such as PyTorch Profiler [28] and performance profilers such as Intel VTune Profiler [17] provide some coarse-grain measures on memory usage and footprint. However, the results are valid only for a specific processor architecture and memory hierarchy. This hardware dependency is not helpful for DSE or refinement to lower-level abstraction.

Our proposed TLM framework is based on the well-defined SystemC methodology, which makes it easy to deploy. Our automatic model generation dramatically reduces the burden of constructing and debugging simulation models. Furthermore, memory contention is modeled accurately and simulated fast, enabling efficient early DSE.

3 Phase I: Untimed Model Design and Parallelization

For Phase I of our study, we first introduce high-level aspects of our system modeling framework. We then describe the TLM-1 and TLM-2.0 modeling of DNNs to provide early feedback on the amount of available parallelism in the application. This forms TLM-1 models A, B, C, D and TLM-2.0 models E and F in Figure 1.

3.1 System-level Modeling Framework

A well-defined modeling strategy is essential to manage the system’s complexity and provide maximum flexibility. Our system modeling framework follows three criteria introduced in [2]:

(1)

Generic layers: Since a CNN is composed of a handful of layer types, the layers shall be parameterized by their attributes using a custom constructor. For example, a pooling layer shall be parameterized by its type (max-pooling or average pooling), kernel size, stride, and the number of padding pixels.

(2)

Self-contained layers: Each layer shall implement the functionality it requires without needing an external scheduler to load its input, or in some cases, load its parameters. For example, a convolution layer shall have a dedicated method to load its parameters (weight matrix and bias vector) used only at construction time.

(3)

Reusability and modularity: Since most CNNs share a standard set of layers, the code shall be structured to feed any kind of CNN with minimum effort. For example, the layer implementation shall be organized as code template blocks, and the SystemC model shall be automatically generated using only the network model defined by the AI framework.

We have used the Caffe (Convolutional Architecture for Fast Feature Embedding) model zoo to obtain pre-trained network parameters. Caffe is a DL framework originally developed at the University of California, Berkeley, and is available under BSD license [18]. Caffe models come with (1) a binary file .caffemodel that contains network parameters and (2) a text file .prototxt that specifies the network architecture. Class labels are also provided in a text file format that includes a synonym ring or synset of those labels.

Our SystemC models rely on efficient, optimized code inside OpenCV 3.4.1. OpenCV is a library of computer vision functions mainly aimed at real-time applications written in C/C++ [27]. The OpenCV library was originally developed by Intel and is now freely accessible under the open-source BSD license. OpenCV uses an internal data structure to represent an n-dimensional dense numerical single-channel or multi-channel array, a so-called Mat class. Therefore, our models use the Mat data type to store images, weight matrices, bias vectors, feature maps, and class scores. This design decision becomes practical while interacting with various OpenCV application programming interfaces (APIs).

Our previous study [2] shows that the multi-threaded OpenCV library delivers the highest level of parallelism for simulation speedup compared to existing thread-level parallelism at the SystemC level. Therefore, we rely on multi-threaded OpenCV with a sequential SystemC simulator instead of a parallel simulator such as the Recoding Infrastructure for SystemC (RISC) [24] for better simulation performance. Moreover, RISC still needs support for all language constructs for approximately-timed modeling.

Our system-level modeling framework follows the well-known Specify-Explore-Refine (SER) methodology [13], which is a successive, stepwise refinement of design models, as described in subsequent sections.

3.2 TLM-1 Modeling of DNNs

TLM-1 implements message-passing semantics primarily to separate communication from computation. Through well-defined TLM-1 interface method calls, any internal state changes in one SystemC module are hidden from other modules [16].

Following the TLM-1 coding style, each layer in the CNN is modeled as a sc_module with input and output ports. Ports in each module are defined as sc_port and are parameterized either by primitive or user-defined interface classes. The user-defined interfaces are derived from sc_interface and declare read and write access methods with a granularity of Mat. The choice of Mat for the granularity of port parameterization simplifies the design by focusing on the proper level of abstraction at this level of modeling.

Each module has a main thread that continuously reads its input port, computes results, and writes those to its output port. Data processing is handled by a run method that interacts with the OpenCV library. The run method creates an instance of OpenCV layer and calls its forward method by passing references to input Mat and output Mat objects.

Channels are modeled as queues with FIFO semantics, allowing to consume/produce data in a first-in, first-out discipline. These channels implement interface methods for read and write access. Encapsulating communication in channels allows various communication mechanisms and buffer sizes to be modeled independently from the module functionality. This exploratory approach provides early feedback on the amount of available parallelism and local communication interactions. More details on the characteristics of our four channel variants and simulation results for the corresponding TLM-1 models A, B, C, and D in Figure 1 can be found in [3, 4].

3.3 TLM-2.0 Modeling of DNNs

While TLM-1 provides early feedback on parallelism and local communication, it is not specifically intended for bus modeling or interoperability. TLM-2.0 introduces a generic payload and blocking/non-blocking transport interfaces for the abstract modeling of memory-mapped buses.

Instead of defining a strict taxonomy of abstraction levels, TLM-2.0 establishes a set of APIs and describes a set of appropriate coding styles for various use cases. For example, the LT coding style utilizes the blocking transport interface (b_transport) for the use cases of software development and performance optimization. LT models simulate fast and have sufficient timing details to boot an operating system. On the other hand, the approximately-timed (AT) coding style utilizes the non-blocking transport interface (nb_transport) for the use cases of architecture exploration and detailed performance analysis. Generally, AT models simulate slower but carry better timing accuracy than LT models [16].

In TLM-2.0, a socket is instantiated within each initiator and each target module for every transaction-level connection. The generic payload captures the information to pass on with each bus transaction between the initiator and the target. The initiator module instantiates the generic payload transaction object and sets its attributes before passing a reference to this object to the target module via its transport interface.

For an initial estimation of the specific DNN memory usage at the system level, we can extract the read/write accesses initiated from each module to a memory. At this stage of modeling, the granularity of accesses can be the size of an input or output buffer, and the latency of each access can be a delta-cycle delay.

In our proposed model, the initiator sockets are connected to target sockets of shared memory, as shown in Figure 4(a). Each module has a dedicated address space in the memory to read and write its buffers. This model uses a blocking transport interface (b_transport) to pass transactions between the initiator and the target memory. The transaction is a tlm_generic_payload object, and its data pointer points to the start address of the input/output buffer. For early estimation of memory usage and accessible visualizations at this stage of modeling, the data length of the generic payload is set to the entire buffer inside the shared memory. Since the model is untimed, the timing annotation argument of b_transport is set to a delta cycle.

Fig. 3.

Fig. 4.

Generally, the communication mechanism in a single-buffer scheme is as follows: each producer places its output into a buffer in the shared memory. Each consumer reads its input from the same shared buffer. To avoid race conditions, each consumer waits for an event notification from its producer. The arrows between the modules in Figure 4(a) illustrate this feed-forward notification mechanism. This design forms the TLM-2.0 untimed model with feed-forward events mechanism, model E in Figure 1.

Without a balanced graph topology, event synchronization for multiple producers or consumers requires delta-cycle delay compensation. Such behavior can be seen in every inception module in GoogLeNet, as shown in Figure 3(a). Note that the four parallel tracks contain 2, 4, 4, and 3 modules. Our proposed untimed model compensates for these irregularities to guarantee correct synchronization between the modules.

We also devise a back-pressure events mechanism to allow the untimed TLM-2.0 model to execute safely even in aggressive out-of-order parallel simulation for maximum speedup [3]. Event connections for the first convolution and ReLU layers in GoogLeNet are depicted in Figure 4(b). Each module has a set of two sc_events for each input and output. The stb event is notified once a module has valid data inside the memory to be read, and the ready event signals a module is ready to read new data. By connecting events between all subsequent modules, the model forms a robust back-pressure mechanism that safely controls the data flow inside the pipeline.

Support for the back-pressure events mechanism is extended to all neighboring modules in the TLM-2.0 model. The double-buffering scheme guarantees a continuous stream of data inside the design pipeline, maximizing model parallelism and model throughput with the minimum number of buffers in the memory. This design forms the TLM-2.0 untimed model with back-pressure events mechanism, model F in Figure 1.

4 Phase II: Timed Modeling and Visualization of Memory Contention

While our modeling framework can generate TLM-1 and TLM-2.0 DNN models to study performance metrics such as parallelization, we focus on memory contention in Phase II. In the following sections, we extend our TLM-2.0 model F and add a latency model for memory, computation and interconnect following the LT coding style, model G in Figure 1. Next, to achieve higher timing accuracy, we refine the entire model following the AT coding style, model H in Figure 1. Then, we describe our novel method for LT interconnect and memory contention modeling that simulates as fast as an LT model yet shows memory contention as accurately as an AT model, model I in Figure 1. Finally, to highlight the strength of LT-CA modeling, we explore an alternative memory organization and describe our proposed local memory organization with minimum memory contention.

4.1 TLM-2.0 Loosely-timed (LT) Model

Given that the TLM-2.0 untimed model F provides only causal ordering between processes, timing is introduced at the next lower abstraction level. Our LT approach models a transaction’s start and end times using the blocking transport interface with a timing annotation, providing a good tradeoff between timing accuracy and simulation speed.

While LT modeling generally aims at timed simulation as fast as possible, our goal of observing memory contention requires the explicit modeling of interconnect components and memory interfaces. Thus, we cannot use more aggressive TLM-2.0 abstraction techniques, such as Direct Memory Interface (DMI)¹ or temporal decoupling.² However, we maintain the LT assumption that memory transactions are complete in one function call and cannot overlap with others.³

The LT model adds three sources of latency: (1) memory, (2) computation, and (3) interconnect. Our step-wise approach incrementally refines the model, adding one source of latency at each step. We keep the system model topology as in the TLM-2.0 untimed model, i.e., the mapping of each layer to a separate module. We refine further the point-to-point connection between layers and memory with a generic interconnect. The interconnect can be considered a shared bus matrix at this modeling stage. With the interoperability mechanism offered by TLM-2.0, our framework can also support different interconnect topologies.

4.1.1 Memory.

To add memory latency, the inter-module communication must be revisited. Since events occur at precise points in simulation time, as soon as a consumer incurs a delay due to memory access, it would miss events from the producer. Therefore, we replace the feed-forward event notification with a pair of sc_signals for each input and output in every module (Figure 5). Once a producer fills the shared buffer, it increments the num_sent output signal to inform the waiting consumer that new data is available. When the consumer finishes reading the buffer, it increments the num_rcvd input signal. To implement a back-pressure mechanism, the module waits with the new write transaction when the output buffer is full.

Fig. 5.

Algorithm 1 lists the pseudo-code used in our TLM-2.0 LT model. The initiator module writes the layer output using b_transport to send transactions to the target memory. When the memory serves the request, it updates the delay object inside the timing annotation argument with the memory latency and returns immediately, because the simulation is faster when b_transport does not block. The memory latency value depends on the access type and the transaction’s data size and is configurable for each LT memory module as follows:

\[\begin{gather*} payload \; delay = \frac{generic \; payload \; length}{memory \; bus \; width} \cdot memory \; latency \end{gather*}\]

4.1.2 Computation.

To estimate the computational latency, we analyze the computational complexity of the most common constituent layers in a DNN in terms of the number of multiplications (\(N_{mul}\)) and the number of additions (\(N_{add}\)). Given a 32-bit single-precision floating-point multiply-accumulate (FP32-MAC) unit available, we assume the total computational latency of a layer to be the product of the number of MAC operations and the inverse of the peak floating-point operations per second (FLOPS): \(N_{MAC} \cdot \frac{s}{flop}\). Here, the peak FLOPS value is the maximum number of single-precision floating-point MAC operations that a processing element (PE) can perform per second. A PE is a basic arithmetic component that at least includes a 32-bit floating-point multiplier and an accumulator register. It is worth mentioning that the maximum throughput of a PE is the main focus at this modeling stage. In other words, implementation details of the PE, such as its clock frequency, number of parallel MAC units, and the amount of control logic and congestion overhead, are all abstracted away.

We describe the timing estimation separately for each layer type. The size of the input volume to each layer is \(W_{i} \times H_{i} \times C_{i}\) where \(W_{i}\), \(H_{i}\), and \(C_{i}\) represent the width, height, and number of channels, respectively.

Convolution. Convolution has the following hyper-parameters: a number of filters \(K\), kernel size \(F\), stride \(S\), and padding \(P\). Convolution has also learned parameters, weights, and biases. The total number of weights is \(F \cdot F \cdot C_{i} \cdot K\), and the total number of biases is \(K\). Convolution produces an output volume of size \(W_{o} \times H_{o} \times C_{o}\) where \(W_{o} = \lfloor {\frac{W_{i}-F+2\cdot P}{S}+1}\rfloor\), \(H_{o} = \lfloor {\frac{H_{i}-F+2 \cdot P}{S}+1}\rfloor\) and \(C_{o} = K\). To compute one output element for one channel, \(N_{mul_{elem}} = F \cdot F\) and \(N_{add_{elem}} = F \cdot F-1\). To compute one output element for all channels, \(N_{mul_{chans}} = C_{i} \cdot N_{mul_{elem}} = C_{i} \cdot F \cdot F\) and \(N_{add_{chans}} = C_{i} \cdot N_{add_{elem}} + C_{i}-1 + 1 = C \cdot F \cdot F\) where the extra addition is for adding the bias value. To compute all output elements for one filter, \(N_{mul_{filter}} = W_{o} \times H_{o} \times N_{mul_{chans}} \approx \frac{W_{i} \cdot H_{i} \cdot C{i} \cdot F^2}{S^2}\) and \(N_{add_{filter}} = W_{o} \times H_{o} \times N_{add_{chans}} \approx \frac{W_{i} \cdot H_{i} \cdot C{i} \cdot F^2}{S^2}\). To compute all output elements for all \(K\) filters, \(N_{mul} \approx \frac{W_{i} \cdot H_{i} \cdot C{i} \cdot F^2 \cdot K}{S^2}\) and \(N_{add} \approx \frac{W_{i} \cdot H_{i} \cdot C{i} \cdot F^2 \cdot K}{S^2}\).

ReLU. The ReLU is an activation function defined as the positive part of its argument (\(max(0,x)\)). This unit is implemented by a comparator that can be simply modeled as an adder. Therefore, \(N_{add} = W_{i} \cdot H_{i} \cdot C_{i}\).

Pooling. To reduce the spatial size of volumes in the network, pooling down-samples the input volume by choosing the maximum element inside the kernel. This unit has two hyper-parameters: kernel size (\(F\)) and stride (\(S\)). Pooling produces an output volume of size \(W_{o} \times H_{o} \times C_{o}\) where \(W_{o} = \lfloor {\frac{W_{i}-F}{S}+1}\rfloor\), \(H_{o} = \lfloor {\frac{H_{i}-F}{S}+1}\rfloor\) and \(C_{o} = C_{i}\). To find the maximum element inside a kernel, it requires \(N_{add_{elem}} = F \cdot F\). To compute the output for one channel, \(N_{add_{chan}} = W_{o} \times H_{o} \times N_{add_{elem}}\approx \frac{W_{i} \cdot H_{i} \cdot F^2}{S^2}\). The total number of additions to compute the output for all channels is \(N_{add} \approx \frac{W_{i} \cdot H_{i} \cdot C{i} \cdot F^2}{S^2}\).

Concat. A concat layer concatenates two or more volumes and does not perform any computation on the inputs. Hence, its computational complexity is zero.

Since the majority of layer types in our example have been considered, as shown in Table 1, the computational complexities of the remaining layers are deferred for now. Furthermore, the peak computational capacity available in each layer is also configurable.

4.1.3 Interconnect.

To refine a dedicated point-to-point communication between a layer and a memory, we design a generic TLM-2.0 LT interconnect module. The interconnect module arbitrates and forwards existing transaction objects from initiator layers to target memory. The interconnect can also model the latency to accept transaction objects before forwarding them to the memory.

As an example depicted in Figure 6(a), the interconnect is placed between the 142 initiator layers and a single target memory. In this architecture, the interconnect has 142 target sockets and one initiator socket. The thick double arrows represent the inter-module communication mechanism supporting the timing in the model. Furthermore, the interconnect supports the modeling of multiple memories. As another example illustrated in Figure 6(b), the interconnect is placed between four separate memories with segmented address space.

Fig. 6.

The main functionality of the interconnect is to route transactions from an incoming target socket to an outgoing initiator socket. Since each memory has a dedicated address space, the interconnect routes transactions to the correct memory depending on the address embedded in the transaction. After address translation, the interconnect forwards the transaction via the corresponding initiator socket connected to the memory.

We design a programmable memory map to allow maximum flexibility in the interconnect. Each address region has an entry in this table which contains the start address, size, and the index of the initiator socket to forward the transaction. To decode an address, the interconnect inspects the address attribute in the generic payload and looks it up in the memory map to determine which outgoing initiator socket to forward the transaction. Suppose the address is found in the table; in that case, the interconnect overwrites the address attribute with the decoded local address in the memory and forwards the transaction to the correct target memory. Otherwise, it aborts the simulation with an error message.

According to TLM-2.0 guidelines, an interconnect module cannot change the generic payload’s data length attribute. This restriction means that if the data in a transaction is split into two separate memories, the interconnect must act as the endpoint for that particular transaction. Here, the interconnect forms two separate transactions to the memories. In that respect, the role of the interconnect is dynamic. It functions as an interconnect component for some transactions and as a target for other transactions.

4.2 TLM-2.0 Approximately-timed (AT) Model

As described in Section 4.1, each transaction in an LT model covers a whole transaction in one function call. LT has two timing points, the start, and the end. It is possible to increase the number of timing points for each transaction to have a higher degree of timing accuracy. More timing points help to more accurately monitor throughput, latency, and bandwidth utilization and can support interleaving transactions and pipelined bus protocols. However, processes are more likely to run in lock step with the SystemC scheduler with more timing points. Hence, it is generally expected that AT models simulate significantly slower than their LT counterparts.

The AT coding style uses a non-blocking transport interface which, in addition to the timing annotation, supports multiple phases within the lifetime of a transaction. The base protocol for AT modeling defines four phases to represent four timing points for each transaction: the start and end of the request and the beginning and the end of the response. These four phases of the base protocol can model three timing parameters: (1) the request accept delay, (2) the latency of the target, and (3) the response accept delay [16, 19].

To build an AT model for a DNN, we first refine the communication part of the LT style initiator. The blocking transport interface is replaced with an implementation of the nb_transport_fw and nb_transport_bw functions. Following TLM-2.0 guidelines with regards to the usage of the generic payload in non-blocking interfaces, we instantiate a memory manager to acquire a generic payload transaction from a pool of transaction objects and release it to return to the same pool once the transaction is no longer in use. Furthermore, the logic for handling the base protocol call sequence is also added with the help of a Payload Event Queue (PEQ) and its callback method.

Next, the LT interconnect is replaced with an interconnect that supports the handling of non-blocking interfaces and the AT base protocol. Unlike its LT counterpart, the AT interconnect can queue incoming requests if there is already a request in progress and the interconnect has not completed the END_REQ phase for that request. AT interconnect can also queue up incoming responses from the memory to forward transactions later on the backward path to initiators. The sequence of phase transitions for each transaction helps to model the contention. Our AT interconnect supports two arbitration strategies, namely, FCFS and RR policy. Finally, we reuse the logic for address mapping and address translation from the LT interconnect.

As the final refinement step, we replace the LT memory with an AT counterpart. Our AT memory model implements the base protocol with four phases to provide the proper timing granularity for the AT coding style and accurate contention modeling. The bus width, size, request, and response delays for the AT memory are all configurable. To provide accurate timing for comparative analysis, we devise an estimation for read and write request and response delays for each transaction as follows:

\[\begin{gather*} request \; accept \; delay = memory \; latency \\ response \; accept \; delay = \frac{generic \; payload \; length}{memory \; bus \; width} \cdot memory \; latency \end{gather*}\]

Given TLM-2.0 AT compliance, we connect the multi_passthrough_initiator_socket of the AT interconnect to the target_socket of the AT memory module.

4.3 TLM-2.0 Loosely-timed Contention-aware (LT-CA) Model

Transaction modeling using LT coding style simulates fast because transactions are complete in a single blocking transport method call, namely, a b_transport call. For the same reason, LT models are usually not used for contention analysis which typically requires a detailed sequence of interactions between the initiator and the target within the life of a transaction. On the contrary, AT coding style uses a non-blocking transport interface that supports multiple phases within the lifetime of a transaction. The AT modeling naturally enables resource contention modeling to find performance bottlenecks in the design. However, an AT model simulates slower than its LT counterpart because it can contain up to four function calls to complete a transaction.

AT modeling is one of the more complex aspects of TLM-2.0, which makes AT model development a non-trivial task. Therefore, the development of AT models is often postponed to later stages of the design flow, typically only after the LT model is in place. Moreover, when AT model development is disregarded due to a tight project schedule, RTL simulations are most likely used to find performance bottlenecks. However, chip-level RTL simulations suffer from an order of magnitude slower simulation speed compared to system-level AT models.

A cycle-accurate model is the closest abstraction to the final hardware and can exhibit the design’s most accurate estimation of memory contention. However, as mentioned earlier, the simulation performance of cycle-accurate and even AT models are typically orders of magnitude slower than an LT model. Importantly, techniques to tackle memory contention at the lower levels of abstraction usually have a sub-optimal impact on the overall performance. Hence, memory contention must already become visible in the design flow’s early stages. This new modeling approach enables system designers and chip architects to codesign both hardware and software to mitigate memory bottlenecks optimally.

Our LT-CA modeling supports two major arbitration policies, namely, FCFS and RR scheduling, which we detail in the following two sections.

4.3.1 First-Come-First-Served (FCFS) Arbitration Policy.

To have visibility of memory contention early on with fast simulation, we propose to use the timing annotation in the blocking transport interface to keep track of memory congestion. By storing the memory-busy status in a state variable inside the interconnect, we can schedule transactions at the correct simulation time without holding pending transactions in a complex PEQ.

As listed in Algorithm 2, we store a timestamp marking the end of memory occupation in a state variable busy_until. Since the memory is not busy at the start of the simulation, we initialize busy_until to zero. Once a new transaction arrives at the interconnect, we calculate the remaining time left until the memory becomes available again (busy_delay). If busy_delay is zero, this indicates that the transaction has arrived after the point that the memory was busy. Hence, the memory is available, and busy_until is set to the current timestamp. Before forwarding the transaction to the memory, the timing annotation of the transaction (delay) is updated with the sum of the interconnect latency and busy_delay. We then perform the memory transaction where the LT memory updates the delay with its read or write latency and returns immediately. Finally, before returning the transaction to the initiator, the busy_until state variable is incremented with the observed time for the memory latency (memory_delay).

Figure 7(a) illustrates the FCFS policy for an example where three transaction requests A, B, and C arrive at times 0, 2, and 1, and then again at times 9, 11, and 10, respectively. Assuming that each transaction has a memory delay of 3 time units, Algorithm 2 schedules them in arrival order, A from time 0 to 2, C from time 3 to 5 (busy_delay = 2), and B from time 6 to 8 (busy_delay = 4). The same schedule then repeats again from time 9.

Fig. 7.

Note that keeping state in the busy_until variable is simple yet fully effective for an FCFS arbitration policy. All record-keeping is performed inside the b_transport call without any explicit PEQ, allowing our LT-CA to simulate much faster than the corresponding AT model.

4.3.2 Approximate Round-Robin (RR) Arbitration Policy.

While not as simple as FCFS, our LT-CA modeling approach also supports RR scheduling policy with high efficiency. To avoid complex AT data structures, such as a PEQ, we exploit the LT idea of only loosely tracking the timing in our LT-CA model. In other words, we tradeoff some accuracy for speed by approximating the interconnect contention of RR arbitration.

Note that RR scheduling is inherently dynamic in the sense that the delay for a transaction cannot be determined entirely at its arrival time because it cannot be known if later transactions will affect the waiting time. Figure 7(b) illustrates this problem for a RR policy that demands requests to be scheduled in-order, i.e., A, B, C. Given the out-of-order arrival times of 0, 2, and 1, respectively, a RR scheduler will initially plan C to execute immediately after A but has to revise that plan at the arrival time of B with an extra delay of 3 units (shown in blue) when B gets priority over C.

Such dynamic rescheduling is not possible in LT-CA if we want to maintain high simulation speed by calculating delays immediately within the same b_transport call. So we approximate the RR policy with the following idea. We optimistically assume the best-case scenario of in-order arrival times, and if that proves incorrect for a transaction, we compensate for the mistake by recording a penalty for the next time the transaction appears. This approximate RR policy is illustrated in Figure 7(c) where transaction C is scheduled immediately at its arrival time 1 from time 3 to 5. At time 2, the incoming request B proves this optimistic schedule for C incorrect (now marked in red), so we record a penalty of 3 time units (i.e., the memory delay of B) for C. To not make any further mistakes, we let B perform at its correct slot from 3 to 5 but also set busy_until to 8 so that no other transaction can take place at the time C was supposed to occur.

The recorded penalty of 3 time units for C then ensures that the RR violation does not occur again, as shown at time 10 when C is scheduled with busy_delay of 2 and the added penalty of 3 (purple). Note that in this example, only 1 out of the 6 memory transactions is scheduled too early (83.3% are correct), and the total delay is maintained accurately (busy_until = 17). As such, our approximation meets the loosely timed criterion of LT modeling.

Algorithm 3 lists our contention modeling with approximate RR arbitration in detail. Similar to Algorithm 2, the procedure interconnect_b_transport performs the memory transaction, updates the scheduler state, and updates the transaction delay. The added argument id identifies the calling initiator via the used socket so that transactions handle transactions with RR priority. For this purpose, the scheduler state is maintained in a circular array request where active initiators’ transactions are maintained with their start time, memory delay, and any penalty. The delay of each transaction is computed optimistically for the shortest time possible given the current state of all other active requests.⁴ Here, the function FindEarliestOtherActiveRequest locates the earliest active transaction in the circular request array. If such a request exists, the function RescheduleActiveRequests recalculates the busy delay (busy_until) and updates the start times of all active transactions in the RR order accordingly. If an active request moves to a later time, a penalty is recorded for the next transaction from the same initiator. Last but not least, a transaction’s penalty is reset to zero once it has been applied.

Compared to the FCFS Algorithm 2, our RR approximation Algorithm 3 is more elaborate and requires more calculations to maintain the circular request array. From the perspective of theoretical complexity analysis, however, both algorithms are of low complexity. With \(N\) denoting the number of input sockets of the interconnect, the complexity of the RR algorithm is \(O(N)\) (size of the request array is \(N\)), whereas the complexity of FCFS is constant (\(O(1)\)).

Overall, our LT-CA interconnect modeling for FCFS and RR arbitration requires only a small local change in the LT interconnect model and does not require any knowledge of AT modeling. Furthermore, any LT-memory type with an arbitrary memory delay can be used with this interconnect. As a result, the LT-CA system model is contention-aware and reflects accurate simulation time. At the same time, it only needs the regular blocking transport interface and thus simulates fast.⁵

4.4 TLM-2.0 Loosely-timed Contention-aware (LT-CA) Model with Local Memories

Our fast and accurate LT-CA modeling enables the efficient exploration of alternative memory organizations. To demonstrate this, we describe an alternative architecture with local memories and interconnect components adjacent to computing units which improves the locality of data and thus minimizes contention. As illustrated in Figure 8, we can model each layer with two initiator sockets connected to two local memories that store layers’ input and output. The global interconnect also breaks into separate components that exclusively service two adjacent layers. In other words, the global shared memory transforms into many local memories which store the intermediate results.

Fig. 8.

For the case that layers have more than one input or output in the network, we can arrange multiple local memories for each input or output layer. In the case of a multi-consumer layer, we partition the output data into multiple smaller memories and devise a static scheduling scheme between the consumers to avoid contention in simultaneous read accesses. The sizes of these local memories equal the output buffer size divided by the number of consumers. This design leaves the total memory requirement of the application unchanged.

In the case of a multi-producer layer, each producer owns a local interconnect and memory. The dedicated memory and local interconnect for each input layer prevents the competition between producers to access the single local memory between all producers. In this organization, producers output data to local memory as soon as they complete their processing without any risk of contention by the other producers.

Since data is stored and processed locally in separate memories, there is no need for any extra logic to implement cache coherence protocols between multiple computing units. This memory organization also eliminates the performance penalty for cache coherency, high interconnect congestion, and high global memory contention.

5 Transaction Level Model Generator

A model generator should be parameterizable, customizable, and extensible so that it can be flexibly utilized for wide DSE. The automatic generation of the TLM models has two benefits: (a) it saves time for model development and manual optimization, and (b) due to the absence of manual coding, it allows for easy verification and can minimize human errors. Based on the model generator initially developed in [2] and [3], we design a significantly improved generator framework, called netspec (Figure 2), to automatically generate customized SystemC TLM-1 and TLM-2.0 models with different timing accuracy (untimed, LT, AT, and LT-CA) from an abstract DNN specification.

Figure 9 illustrates the internal structure of our automatic generator. First, netspec is instrumented by a set of modeling parameters for DSE, which describe a wide range of modeling features, such as the desired TLM standard, coding style, inter-module communication, and buffer architecture. Second, netspec extracts the network architecture and network learned parameters by parsing the DNN textual protocol buffer file (.prototxt) and a DNN binary protocol buffer file (.caffemodel). Third, netspec constructs an internal graph data structure that stores each node’s inputs and outputs, the input and output buffer shapes of each node, and the shapes of weights/biases for those nodes with learned parameters. Fourth, based on the TLM parameters and network hyper-parameters, netspec constructs a custom generator for each SystemC module. Each module generator captures the attributes for a customized constructor, the specific method for TLM communication, the support for temporal decoupling, and the buffer addressing. Finally, netspec generates SystemC code for all the modules in the network, and the top-level network module with all its connections.

Fig. 9.

Netspec is written in Python 3 and uses the Python interface to the Caffe library, pyCaffe, to read the input files and construct its internal data representation of the DNN.

Netspec can generate both TLM-1 and TLM-2.0 untimed, LT, AT, and LT-CA models based on modeling type, coding style, and contention configuration. In the case of TLM-2.0, netspec automatically generates an address map file based on the buffer architecture and supports memory address generation for multiple buffers for any layer in the network. For DSE, the latency of memory modules, the acceptance latency of the interconnect, and the peak computational performance available to each layer are all configurable. The global interconnect supports multiple memories with arbitrary sizes. To support TLM-2.0 model generation with local memories, a configuration file describes the local memory organization, including modules connected via local memories and the number of local memories attached to each module. Table 2 summarizes the features of netspec, which provide a DNN system synthesis framework that automatically generates SystemC models from an abstract specification.

Table 2.

Feature	Possible values	Description
TLM standard	TLM1/TLM2	Specify TLM standard
Coding style	UT/LT/AT	Specify model’s timing points: untimed, loosely-timed, and approximately-timed
Channel type	BLK/NBLK/SC	In case of TLM-1, user-defined blocking FIFO, user-defined non-blocking FIFO and SystemC FIFO
Inter-module communication	FF/BP	In case of TLM-2.0, feed-forward, and back-pressure
Buffer architecture	1.N	In case of TLM-1, number of buffers inside arbitrary channels. In the case of TLM-2.0, the number of buffers allocated for each module inside memory
Interconnect architecture	1.N	In case of TLM-2.0, the number of initiator sockets connected to memory(ies) for multiple memories support
Global memory architecture	(NumxSize) (in MiB)	In case of TLM-2.0, number, and size of memories connected to the interconnect
Computational capacity	X (GFLOPS)	In case of TLM-2.0, peak computational capacity available to each layer (default 1 GFLOPS)
Memory latency	X (ps)	In case of TLM-2.0, word latency for read/write memory accesses (default 1ps)
Contention	True/False	In case of TLM-2.0, disable/enable modeling of interconnect/memory contention in a loosely-timed model
Local memory organization	filename	In case of TLM-2.0, local memory specification specifies which modules are connected via local memories and the number of local memories attached to each module

Table 2. Table of Parameterized Features for netspec Model Generation and Component Customization

6 Experiments and Results

Using our TLM model generator we have generated a set of TLM-1 and TLM-2.0 models of GoogLeNet.⁶ This section describes our experiments, obtained simulation results, and insights gained from analyzing the models.

6.1 Simulation Setup

We use SystemC 2.3.1 and OpenCV 3.4.1 built in the default release mode settings for simulation. For benchmarking, we measure the simulator run-time using Linux /usr/bin/time under CentOS 6.10. To have reproducible experiments, the Linux CPU scaling governor is set to “performance” mode to run all cores at the maximum frequency, and file I/O operations are minimized. An Intel Xeon E5-2680 CPU running at 2.7 GHz with eight physical cores, two threads per core, and two CPU sockets are used as our simulation platform.⁷ Lastly, the stimulus module is configured to feed 100 images with the size of 224 × 224 pixels to the model, which results in reasonable simulator run-times for our experiments.

6.2 Memory Load Estimation

For an estimation of the load on the memory from the DNN, we generate an untimed TLM-2.0 model (model F in Figure 1). The layers’ input and output data are stored in a single global memory, and each layer uses local storage to process its data. We use double-buffering in the memory so that the producer layer can write to the front buffer, the consumer layer can simultaneously read the data from the back buffer, and vice versa. Since the untimed model runs on a delta-cycle basis, we devise a specific buffer architecture for the concat layers at the bottom of the inception modules to compensate for the unbalanced graph topology. The concat layers require 4, 2, 2, and 3 buffers in the tracks to synchronize correctly in the double-buffering scheme. Figure 10 shows the total read and write memory accesses to the single memory architecture using the model running the classification of 100 images. As the figure shows, the memory accesses peak at over 90 million bytes per delta cycle once all layers are active and processing data. This simulation result confirms that GoogLeNet is a very memory-intensive application.

Fig. 10.

The TLM-2.0 untimed model also measures the memory usage of the DNN. For example, Table 3 lists the memory requirements for each layer in the GoogLeNet. The total required memory for each layer is the combination of its input buffers, output buffer, and, in the case of learned parameters, its weights and biases. To reduce the memory footprint of the DNN, adjacent modules can share their input/output buffers. Hence, a consumer module points to the output of the corresponding producer module.

Table 3.

Layer type	Input [MiB]	Output [MiB]	Weights [MiB]	Bias [MiB]	Total memory [MiB]
Input	0.000	0.574	0.000	0.000	0.574
Convolution	17.78	12.30	22.75	0.027	52.87
ReLU	12.30	12.30	0.000	0.000	24.61
Pooling	11.16	5.411	0.000	0.000	16.57
LRN	3.062	3.062	0.000	0.000	6.125
Concat	4.713	4.713	0.000	0.000	9.426
Dropout	0.003	0.003	0.000	0.000	0.007
InnerProduct	0.003	0.003	3.906	0.003	3.917
Softmax	0.003	0.003	0.000	0.000	0.007
Extra Fillers	0.000	1.638	0.000	0.000	1.638
Total	49.04	40.02	26.66	0.030	115.76

Table 3. GoogLeNet Total Memory Footprint

As described in Section 3.3, to implement a double-buffering scheme, extra filler buffers are required to balance the graph architecture. Therefore, the total memory requirement of GoogLeNet in a double-buffering mode is double the size of the total output buffers: \(40.02 \times 2 = 80.04 \; MiB\). It is also worth mentioning that the memory footprint can be further reduced with a smart addressing scheme. For example, the consumers of the concat layers can simply point to the output buffers of the concat producers.

6.3 Comparison of LT, AT, and LT-CA Models

Our proposed netspec can automatically generate TLM-2.0 models for early software performance analysis and virtual prototyping. The generated LT models carry sufficient timing information to provide a coarse-grain estimation of the application execution time. However, the generic LT does not take into account any bus contention. In contrast, our proposed LT-CA counterpart is developed to analyze the effect of interconnect and memory contention. Additionally, AT models provide even better timing accuracy for memory contention analysis. Therefore, we also generate AT models of our application for comparison.

By choosing four different memory latencies⁸ (1 ns, 10 ns, 100 ns, 1,000 ns) and four different computational capacities⁹ (model G in Figure 1). We also instruct netspec to generate 16 LT-CA models (model I in Figure 1) and 16 AT models (model H in Figure 1) across the same parameters. The LT memory bus is configured to have a 64-bit width and the interconnect is configured to avoid extra accept latency. Furthermore, we set the burst length to 8, and the size of the generic payload for each transaction is configured to be the burst length multiplied by the memory data width (8*8 B = 64 B). The choice of generic payload size reflects the more realistic operation of memory in burst mode, resulting in a more accurate timing estimation of the memory accesses. Since the AT memory can model both request and response accept latencies, we configure the AT memory request accept delay equal to memory latency and the response accept delay identical to LT-CA and LT memory delays. Finally, we set the size of the generic payload for each transaction identical to LT-CA and LT models (64 B). Table 4 summarizes the total simulated time for all 48 models.

Table 4.

As shown in Table 4, the total simulated times of LT models (left box) are significantly less than their LT-CA and AT counterparts. In generic LT modeling, transactions that simultaneously access the shared memory complete their accesses at the same simulated time point. This lack of contention modeling incorrectly makes the total simulated time shorter than the other modeling styles that reflect contention. As shown in the middle box of Table 4, the total simulated times of the LT-CA models show a significant increase compared to LT models by considering the effect of contention. Finally, the right box shows the total simulated times of AT models that accurately model contention of both memory requests and memory responses. Comparing LT-CA and AT simulated times shows the high accuracy and high fidelity of our proposed LT-CA modeling.

Figure 11(a) and (b) visualizes the total simulated time of LT and LT-CA models reported in Table 4. As expected, lower computational capacities and higher memory latencies increase the simulated time. However, the impact of memory with higher latencies on performance is more significant. For example, the memory with 1,000 ns latency constantly performs over all computational capacities, indicating that an increase in computational power has no considerable effect on simulated time. Moreover, the impact of memory with higher latencies on the performance becomes more significant in higher computational powers. For example, having 1,000 GFLOPS computational capacity available, a 10× increase in the memory latency from 10 ns to 100 ns leads to a 10× decrease in performance. On the contrary, at the same interval in 1 GFLOPS, performance only decreases 3× . This result clearly shows the application is heavily memory-bound rather than compute-bound.

Fig. 11.

Figure 11(c) illustrates the impact of memory contention by showing the ratios of simulated times for LT-CA models over simulated times for LT models. The negative effect of memory contention is visible in every computation/memory configuration. However, the contention is more noticeable in higher computational capacities. For example, for 1,000 GFLOPS, the contention reduces the performance by 13× . Meanwhile, the contention has a lower impact on the lower computational capacities. For 1 GFLOPS, decreasing memory latency by 10× from 10 ns to 1 ns (which is quite an expensive design decision) has little effect on the performance.

To show further the generality of the LT-CA modeling approach, we change the interconnect FCFS scheduling to approximate RR scheduling. Table 5 shows the total simulated time for LT-CA and AT models using RR scheduling. Since LT modeling does not support contention and scheduling policies, we do not report the total simulated time for LT models. Similar to Table 4, it is evident that the LT-CA simulated times show high accuracy and high fidelity results compared to the reference AT models. Notably, AT simulated times for RR and FCFS scheduling policies match. This shows that arbitration policy does not play a big role in this given application.

Table 5.

We exhibit the accuracy of LT and LT-CA models compared to the reference AT model for each scheduling policy in Table 6. As shown in Table 6(a) for FCFS scheduling, LT models show a very low accuracy (7% for 1,000GFLOPS) compared to their AT model counterparts. At the same time, LT-CA models show almost complete accuracy. The same pattern applies for LT and LT-CA models in RR scheduling (Table 6(b)). While the LT-CA for RR shows a minor decrease in accuracy, it is still a very accurate model compared to the LT model.

Table 6.

In contrast to the simulated execution times, Table 7 lists the total simulator run-time for all LT, LT-CA, and AT models. As shown, LT models simulate faster than their LT-CA and AT counterparts because they use only a single function call to complete a transaction. LT-CA models simulate slightly slower than LT models (1.2×) by storing memory congestion status inside the interconnect. AT models have much longer simulator run-times as each transaction can have multiple phases and can use up to four function calls to complete a transaction. In particular, the simulator speed of the LT-CA models is an order of magnitude higher than their AT counterparts. Notably, the LT-CA models show an impressive total speedup of 46× in simulation while providing the same accuracy.

Table 7.

We also measure the total simulator run-time for LT-CA and AT models using RR scheduling in Table 8. As expected, AT models have long run-times (about 2 hours) while LT-CA models simulate much faster (about 20 minutes). As also expected, the AT reference models for both RR and FCFS policies are equally slow.

Table 8.

To signify the better speed/accuracy tradeoff, we quantify the simulator run-time for LT and LT-CA modeling styles and scheduling policies. Table 9 shows the speedup of LT and LT-CA models compared to the reference AT models for different scheduling policies. For both policies, LT models show the maximum simulation speedup (50× -60×). However, the simulator run-time for RR is about 9× longer than FCFS, which is expected as the complexity of the RR approximation Algorithm 3 is higher than the simple FCFS Algorithm 2.

Table 9.

6.4 Contention Visualization

By having the fast and accurate LT-CA models available, we can further analyze the memory access patterns and the effect of the interconnect/memory contention in the network. We simulate the application with 1,000 GFLOPS computational capacity and 1ns memory latency using the same setup in LT-CA modeling and the buffer size as the generic payload data length. Figure 12 shows the transaction-level timing diagram of the first inception module, inception_3a, in GoogLeNet. The left diagram shows the timing without contention (LT model) and the right one with contention (LT-CA). The x- and y-axes represent the simulated time and the names of the parallel tracks in the inception module, respectively. The elapsed times for memory write, memory read, computation, and contention are colored in blue, light green, dark green, and red, respectively.

Fig. 12.

Since the LT does not model the contention, the layers in parallel tracks access memory without blocking each other. However, when a layer accesses the memory in the LT-CA model, the other layers are blocked and wait until access becomes available again (red areas). For instance, at the beginning of the contention diagram in Figure 12, once the first layer in the 1x1 track issues a read transaction, all layers in the other tracks are blocked until the read transaction completes.

It is clear from the visual charts in Figure 12 that the LT-CA model accurately reflects the red idle waiting periods for the modules blocked by memory contention. For example, the first layers in all four tracks of inception_3a simultaneously initiate read accesses to the memory at 5.07 ms with a payload size of 602,112 bytes. The layer in the 1x1 track grants access and blocks the memory for approximately 0.07 ms (total delay to access 602,112 bytes of data). Therefore, the read transaction of the layer in 3x3 track experiences a delay of 0.07 ms. Subsequently, layers in 5x5 and pool tracks face contention of 0.15 ms, and 0.23 ms, respectively. Later, as soon as the layer in the 1x1 track completes its computation at 5.15 ms, it sends a transaction to write its result to the memory. Since there are already pending transactions, the layer in the 1x1 track must wait until the memory becomes available again at 5.37 ms once the layer in the pool track completes its read transaction. Hence, the timing annotation of the write transaction for the layer in the 1x1 track is updated with the waiting time for the next memory availability (5.37 ms - 5.15 ms = 0.22 ms). This schedules the write operation to start at 5.37 ms precisely after the memory has become available (start of the blue bar in the 1x1 track in Figure 12).

Figures 13 and 14 show the transaction-level timing diagrams for all inception modules in the GoogLeNet for one pass of an image without and with contention, respectively. As seen in Figure 14, layers in parallel tracks block each other, and contention is high. Furthermore, the track of 3× 3 has the highest elapsed time in all inceptions, making it the critical path of execution. That is valuable feedback to system architects on how to allocate computation resources.

Fig. 13.

Fig. 14.

The effect of contention becomes more significant once the DNN pipeline is full of images to process and layers frequently access the shared memory. We simulate the LT-CA model of GoogLeNet by feeding in 100 images to classify. Figure 15 shows the timing diagram for the 75th image. As clearly seen by almost all red coloring in the figure, the layers are mostly blocked due to contention. The effect of back-pressure is also quite visible, especially in the first inception module, where all its layers are mostly blocked for the subsequent inceptions to process.

Fig. 15.

We also build the LT-CA model of the proposed local memory organization discussed in Section 4.4 for the final experiment. Since all data is stored close to the compute cores and the possibility for contention is carefully looked after, memory contention is entirely resolved. Indeed, our measurements confirm that the simulated time of the network matches precisely the time of the LT model (model G in Figure 1) with global shared memory that does not consider contention (Table 4). Thus, there is actually no contention. This benefit of the local memory organization becomes more significant when the DNN is processing many images at once.

The high amount of contention on the shared memory suggests new architectures that place private local memories close to computing units (such as Figure 8). Memory organizations that rely on fine-grained data localities of the given application reduce the average bandwidth usage on the global shared memory. Such a memory organization eliminates the performance overhead of memory contention and complex cache coherency. However, new processor architectures with private local memories close to the computing units require us to rethink the conventional programming model, compilation flow, and run-time system support. For the same reason, the TLM of architectures with local memories requires a fast yet accurate estimation of the contention. Our contention awareness is attractive for enabling the efficient codesign of both hardware and software solutions in the future.

7 Conclusion

Interconnect and memory contention is a critical aspect in system-level models that requires attention already at the early design stages. This article presents a SystemC TLM framework that automatically generates a configurable set of TLM-1 and TLM-2.0 models from a high-level DNN specification. For efficient DSE and performance optimization, our novel LT-CA modeling breaks the speed/accuracy tradeoff. The LT-CA modeling offers high simulation speed with the accurate observation of memory contention in system models without temporal decoupling, DMI, and pipelined transactions.

We have demonstrated the effectiveness of this approach for representative complex DNN graph structures such as GoogLeNet. Our LT-CA modeling is an order of magnitude faster than its equivalent AT model (46×) while maintaining the same timing accuracy. Furthermore, we have been able to visualize memory contention to a greater extent using transaction-level timing diagrams, enabling effortless detection of excessive concurrent memory accesses.

In future work, we intend to apply our modeling framework to more DNNs and other memory-intensive applications. We also plan to explore different local memory topologies and conduct a tradeoff analysis concerning the costs incurred by local interconnects and local memories in terms of area, bandwidth, and latency.

One can also study the simulator run-time of the different transaction-level models and compare them against a purely algorithmic model to estimate the overhead of TLM over plain C/C++ implementations. Finally, an accuracy comparison of our LT-CA modeling against lower-level abstractions, e.g., RTL hardware, is desirable future work.

Acknowledgments

The authors would like to thank the anonymous reviewers for their valuable comments and helpful suggestions to improve this work.

Footnotes

DMI bypasses the interconnect network by use of pointers and thus offers no concept of monitoring contention.

Temporal decoupling often results in reordered timing of memory accesses from different initiators, making contention hard to observe. For example, two initiators \(A\) and \(B\) using a memory \(M\) may in reality yield alternating accesses, say \(A_1\), \(B_1\), \(A_2\), \(B_2\), \(A_3\), \(B_3\), whereas a temporally decoupled simulation can merge these as \(A_1\), \(A_2\), \(A_3\), and \(B_1\), \(B_2\), \(B_3\). Thus, the interleaved timing is hidden and contention cannot be monitored accurately.

For advanced bus protocols with overlapping requests or pipelined transactions, modeling at lower abstraction levels would be needed, such as cycle-count accurate AT or clocked RTL.

⁴

Active requests are transactions already executed by the simulator but with an end date in the future, hence possibly to be rescheduled in time by the LT-CA algorithm.

⁵

A cycle-accurate model is the closest abstraction to the final hardware implementation and can exhibit the most accurate estimation of memory contention in the design. One such model with a cycle-accurate shared-memory subsystem has been studied in [1].

⁶

To further demonstrate the generality and effectiveness of our modeling framework and contention modeling, we have also studied another state-of-the-art deep CNN application, Single Shot MultiBox Detector (SSD) [25] for object detection rather than image classification [1]. Given our automatic model-based design flow, the SSD models can be quickly generated by simply feeding the abstract SSD specification to netspec. For space reasons, however, we focus in this article solely on the GoogLeNet application.

⁷

We have also measured our simulator run-time using another simulation platform with only four physical cores and two threads per core. The simulation results confirm an identical pattern in measured total simulator run-time for the host with fewer available cores.

⁸

Sweeping memory latency and computational capacity values are based on today’s technological candidates to fulfill those requirements. For example, current memory technology to realize aforementioned access delays includes static random-access memory (SRAM), high bandwidth on-chip memory, dynamic random access memory (DRAM), and flash memory.

⁹

A viable candidate to realize such computational capacities could be a massively parallel processor array. For example, chips such as Epiphany-V [26] (2016) and Manticore [36] (2020) have already demonstrated high-performance and energy-efficient many-core architectures. Considering their competitive efficacy in performance and energy metrics such as GFLOPS/mm\(^2\), GFLOPS/Watt, and Watt/mm\(^2\), there are good indications that with the increase in transistor density in the future, such computing power may be available.

References

[1]

Emad Arasteh. 2022. Transaction-Level Modeling of Deep Neural Networks for Efficient Parallelism and Memory Accuracy. Ph.D. Dissertation. UC Irvine, Irvine, CA.

Abstract

1 Introduction

2 Background

2.1 Memory Bottleneck

2.2 Electronic System Level and Transaction Level Modeling

2.3 Deep Learning and Convolutional Neural Networks (CNN)

2.3.1 GoogLeNet Structure.

2.4 Related Work

3 Phase I: Untimed Model Design and Parallelization

3.1 System-level Modeling Framework

3.2 TLM-1 Modeling of DNNs

3.3 TLM-2.0 Modeling of DNNs

4 Phase II: Timed Modeling and Visualization of Memory Contention

4.1 TLM-2.0 Loosely-timed (LT) Model

4.1.1 Memory.

4.1.2 Computation.

4.1.3 Interconnect.

4.2 TLM-2.0 Approximately-timed (AT) Model

4.3 TLM-2.0 Loosely-timed Contention-aware (LT-CA) Model

4.3.1 First-Come-First-Served (FCFS) Arbitration Policy.

4.3.2 Approximate Round-Robin (RR) Arbitration Policy.

4.4 TLM-2.0 Loosely-timed Contention-aware (LT-CA) Model with Local Memories

5 Transaction Level Model Generator

6 Experiments and Results

6.1 Simulation Setup

6.2 Memory Load Estimation

6.3 Comparison of LT, AT, and LT-CA Models

6.4 Contention Visualization

7 Conclusion

Acknowledgments

Footnotes

References

Cited By

Index Terms

Recommendations

A model of memory contention in a paging machine

Understanding Off-Chip Memory Contention of Parallel Programs in Multicore Systems

PseudoNUMA for reducing memory interference in multi-core systems

Comments

Information

Published In

Publisher

Journal Family

Publication History

Check for updates

Author Tags

Qualifiers

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

View options

PDF

eReader

Get Access

Login options

Full Access

Figures

Other

Share

Share this Publication link

Share on social media

Affiliations