## VTA: Open & Flexible **DL** Acceleration Thierry Moreau TVM Conference, Dec 12th 2018





























### Tensor Expression IR





























### **Tensor Expression IR**

### LLVM CUDA Metal VTA: Open Hardware Accelerator





























### Tensor Expression IR

### LLVM CUDA Metal VTA: Open Hardware Accelerator





Edge FPGA





























### **Tensor Expression IR**

### LLVM CUDA Metal VTA: Open Hardware Accelerator





Edge FPGA





**Cloud FPGA** 





























### **Tensor Expression IR**

### LLVM CUDA Metal VTA: Open Hardware Accelerator





Edge FPGA





**Cloud FPGA** 



ASIC





### Tensor Expression IR



Edge FPGA













### Tensor Expression IR



Edge FPGA

m

Transparent End-to-End Deep Learning System Stack





ASIC





## TVM+VTA Stack Goals

# acceleration stack



Blue-print for a complete deep learning

## TVM+VTA Stack Goals



acceleration stack



- Blue-print for a complete deep learning
- Experimentation framework for crossstack deep learning optimizations

## TVM+VTA Stack Goals

- Blue-print for a complete deep learning acceleration stack
- Experimentation framework for crossstack deep learning optimizations
- Open-source community for industrialstrength deep learning acceleration







## Extensible Hardware Architecture

## Programmability Across the Stack

## Facilitates HW-SW Co-Design



## Extensible Hardware Architecture

### Programmability Across the Stack

## Facilitates HW-SW Co-Design

## VTA: General DL Architecture



## VTA: General DL Architecture

32



## VTA: General DL Architecture

Hardware Datatype

<16 x i8> vs. <32 x i4>

32



**Memory Subsystem** 



## VTA: General DL Architecture

Hardware Datatype

<16 x i8> vs. <32 x i4>

32



**Memory Subsystem** 



## VTA: General DL Architecture

Hardware Datatype

<16 x i8> vs. <32 x i4>

32

**Operation Support** 

{ADD, MUL, SHL, MAX} vs. {ADD, SHL, MAX}



## VTA Hardware Architecture

Philosophy: simple hardware, provide software-defined flexibility

## VTA Hardware Architecture

## Philosophy: simple hardware, provide software-defined flexibility



## VTA Hardware Architecture



Monolithic Design











Monolithic Design



**Execute Stage** 

Load Stage

Store Stage

low-level synchronization between tasks is explicitly managed by the software





### Provides the right tradeoff between expressiveness and code compactness





DENSE

ALU

Two-Level ISA Overview

Provides the right tradeoff between expressiveness and code compactness





DENSE

ALU

**Two-Level ISA Overview** 

Provides the right tradeoff between expressiveness and code compactness



Use RISC micro-ops to perform single-cycle tensor operations



DENSE

ALU

Two-Level ISA Overview

Provides the right tradeoff between expressiveness and code compactness



Use RISC micro-ops to perform single-cycle tensor operations

## RO: RO + GEMM(A8, W3)



DENSE

ALU

- Use RISC micro-ops to perform single-cycle tensor operations

  - R2: MAX(R0, ZERO)

**Two-Level ISA Overview** 

Provides the right tradeoff between expressiveness and code compactness



# RO: RO + GEMM(A8, W3)

## **VTA RISC Micro-Kernels**

multiple RISC instructions define a micro-kernel, which can be invoked by a CISC instruction

## VTA RISC Micro-Kernels

CONV2D: layout=NCHW, chan=128, kernel=(3,3), padding=(1,1), strides=(1,1)

## VTA RISC Micro-Kernels

### multiple RISC instructions define a micro-kernel, which can be invoked by a CISC instruction

## VTA RISC Micro-Kernels

multiple RISC instructions define a micro-kernel, which can be invoked by a CISC instruction

CONV2D: layout=NCHW, chan=128, kernel=(3,3), padding=(1,1), strides=(1,1)

CONV2D: layout=NCHW, chan=256, kernel=(1,1), padding=(0,0), strides=(2,2)

CONV2D TRANSPOSE: ...

## VTA RISC Micro-Kernels

multiple RISC instructions define a micro-kernel, which can be invoked by a CISC instruction

CONV2D: layout=NCHW, chan=128, kernel=(3,3), padding=(1,1), strides=(1,1)

CONV2D: layout=NCHW, chan=256, kernel=(1,1), padding=(0,0), strides=(2,2)
CONV2D TRANSPOSE: ...

GROUP CONV2D: ...

# VTA RISC Micro-Kernels

multiple RISC instructions define a micro-kernel, which can be invoked by a CISC instruction

CONV2D: layout=NCHW, chan=128, kernel=(3,3), padding=(1,1), strides=(1,1)

CONV2D: layout=NCHW, chan=256, kernel=(1,1), padding=(0,0), strides=(2,2)



DCGAN

# VTA RISC Micro-Kernels

micro-kernel programming gives us software-defined flexibility



ResNet50

# How is VTA Programmed?

| <pre>// Pseudo-code for convolution program for the VIA accelerator // Virtual Thread 0</pre>                                                                                                                                                                                                                                                                                                                                           |                                         |                                                                                                                                  |
|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-----------------------------------------|----------------------------------------------------------------------------------------------------------------------------------|
| <pre>0x00: LOAD(PARAM[ 0-71])<br/>0x01: LOAD(ACTIV[ 0-24])<br/>0x02: LOAD(LDBUF[ 0-31])<br/>0x03: PUSH(LD-&gt;EX)<br/>0x04: POP (LD-&gt;EX)<br/>0x05: EXE (ACTIV[ 0-24], PARAM[ 0-71], LDBUF[ 0-31], STBUF[ 0- 7])<br/>0x06: PUSH(EX-&gt;LD)</pre>                                                                                                                                                                                      | <br>   <br>   <br>   <br>               | LD@TID0<br>LD@TID0<br>LD@TID0<br>LD@TID0<br>EX@TID0<br>EX@TID0<br>EX@TID0                                                        |
| 0x07: PUSH(EX->ST)<br>0x08: POP (EX->ST)<br>0x09: STOR(STBUF[ 0- 7])<br>0x0A: PUSH(ST->EX)                                                                                                                                                                                                                                                                                                                                              | //<br>//<br>//                          | EX@TID0<br>ST@TID0<br>ST@TID0<br>ST@TID0                                                                                         |
| <pre>// Virtual Infead 1 0x0B: LOAD(ACTIV[25-50]) 0x0C: LOAD(LDBUF[32-63]) 0x0D: PUSH(LD-&gt;EX) 0x0E: POP (LD-&gt;EX) 0x0F: EXE (ACTIV[25-50], PARAM[ 0-71], LDBUF[32-63], STBUF[32-39]) 0x10: PUSH(EX-&gt;LD) 0x11: PUSH(EX-&gt;LD) 0x11: PUSH(EX-&gt;ST) 0x12: POP (EX-&gt;ST) 0x13: STOR(STBUF[32-39]) 0x14: PUSH(ST-&gt;EX) (/ Virtual Thread 2</pre>                                                                              | <br>   <br>   <br>   <br>   <br>        | LD@TID1<br>LD@TID1<br>LD@TID1<br>EX@TID1<br>EX@TID1<br>EX@TID1<br>EX@TID1<br>ST@TID1<br>ST@TID1<br>ST@TID1                       |
| <pre>// Virtual Inread 2<br/>0x15: POP (EX-&gt;LD)<br/>0x16: LOAD(PARAM[ 0-71])<br/>0x17: LOAD(ACTIV[ 0-24])<br/>0x18: LOAD(LDBUF[ 0-31])<br/>0x19: PUSH(LD-&gt;EX)<br/>0x18: POP (LD-&gt;EX)<br/>0x1B: POP (LD-&gt;EX)<br/>0x1B: POP (ST-&gt;EX)<br/>0x1C: EXE (ACTIV[ 0-24], PARAM[ 0-71], LDBUF[ 0-31], STBUF[ 0- 7])<br/>0x1D: PUSH(EX-&gt;ST)<br/>0x1E: POP (EX-&gt;ST)<br/>0x1F: STOR(STBUF[ 0- 7])<br/>// Virtual Thread 3</pre> | <br>   <br>   <br>   <br>   <br>   <br> | LD@TID2<br>LD@TID2<br>LD@TID2<br>LD@TID2<br>LD@TID2<br>EX@TID2<br>EX@TID2<br>EX@TID2<br>EX@TID2<br>EX@TID2<br>ST@TID2<br>ST@TID2 |
| <pre>0x20: POP (EX-&gt;LD)<br/>0x21: LOAD(ACTIV[25-50])<br/>0x22: LOAD(LDBUF[32-63])<br/>0x23: PUSH(LD-&gt;EX)<br/>0x24: POP (LD-&gt;EX)<br/>0x25: POP (ST-&gt;EX)<br/>0x26: EXE (ACTIV[25-50], PARAM[ 0-71], LDBUF[32-63], STBUF[32-39])<br/>0x27: PUSH(EX-&gt;ST)<br/>0x28: POP (EX-&gt;ST)<br/>0x29: STOR(STBUF[32-39])</pre>                                                                                                        | <br>   <br>   <br>   <br>   <br>        | LD@TID3<br>LD@TID3<br>LD@TID3<br>LD@TID3<br>EX@TID3<br>EX@TID2<br>EX@TID3<br>EX@TID3<br>ST@TID3<br>ST@TID3                       |

(a) Blocked convolution program with multiple thread contexts

# How is VTA Programmed?









```
// Convolution access pattern dictated by micro-coded program.
// Each register index is derived as a 2-D affine function.
// e.g. idx_{rf} = a_{rf}y + b_{rf}x + c_{rf}^{0}, where c_{rf}^{0} is specified by
           micro op 0 fields.
11
for y in [0...i)
  for x in [0…j)
     rf[idx_{rf}^{0}] = GEVM(act[idx_{act}^{0}], par[idx_{par}^{0}])
     rf[idx_{rf}^{1}] += GEVM(act[idx_{act}^{1}], par[idx_{par}^{1}])
     rf[idx<sub>rf</sub><sup>n</sup>] += GEVM(act[idx<sub>act</sub><sup>n</sup>], par[idx<sub>par</sub><sup>n</sup>])
```

(b) Convolution micro-coded program

```
// Max-pool, batch normalization and activation function
// access pattern dictated by micro-coded program.
// Each register index is derived as a 2D affine function.
// e.g. idx_{dst} = a_{dst}y + b_{dst}x + c_{dst}^{0}, where c_{dst}^{0} is specified by
           micro op 0 fields.
//
for y in [O…i)
   for x in [0...j)
     // max pooling
     rf[idx_{dst}^{0}] = MAX(rf[idx_{dst}^{0}], rf[idx_{src}^{0}])
     rf[idx_{dst}^{1}] = MAX(rf[idx_{dst}^{1}], rf[idx_{src}^{1}])
      // batch norm
      rf[idx<sub>dst</sub><sup>m</sup>] = MUL(rf[idx<sub>dst</sub><sup>m</sup>], rf[idx<sub>src</sub><sup>m</sup>])
      rf[idx_{dst}^{m+1}] = ADD(rf[idx_{dst}^{m+1}], rf[idx_{src}^{m+1}])
      rf[idx_{dst}^{m+2}] = MUL(rf[idx_{dst}^{m+2}], rf[idx_{src}^{m+2}])
      rf[idx_{dst}^{m+3}] = ADD(rf[idx_{dst}^{m+3}], rf[idx_{src}^{m+3}])
     // activation
      rf[idx_{dst}^{n-1}] = RELU(rf[idx_{dst}^{n-1}], rf[idx_{src}^{n-1}])
      rf[idx_{dst}^{n}] = RELU(rf[idx_{dst}^{n}], rf[idx_{src}^{n}])
```

(c) Max pool, batch norm and activation micro-coded program



(a) Blocked convolution program with multiple thread contexts

# How is VTA Programmed?

// Convolution access pattern dictated by micro-coded program. // Each register index is derived as a 2-D affine function. // e.g.  $idx_{rf} = a_{rf}y + b_{rf}x + c_{rf}^{0}$ , where  $c_{rf}^{0}$  is specified by micro op 0 fields. 11 for y in [0…i) for x in [0...j)  $rf[idx_{rf}^{0}] = GEVM(act[idx_{act}^{0}], par[idx_{par}^{0}])$  $rf[idx_{rf}^{1}] += GEVM(act[idx_{act}^{1}], par[idx_{par}^{1}])$ rf[idx<sub>rf</sub><sup>n</sup>] += GEVM(act[idx<sub>act</sub><sup>n</sup>], par[idx<sub>par</sub><sup>n</sup>])

(b) Convolution micro-coded program

```
batch normalization and activation function
                tern dictated by micro-coded program.
                ter index is derived as a 2D affine function.
                = a_{dst}y + b_{dst}x + c_{dst}^0, where c_{dst}^0 is specified by
                 op 0 fields.
                ...j)
         n
                oling
                  = MAX(rf[idx<sub>dst</sub><sup>0</sup>], rf[idx<sub>src</sub><sup>0</sup>])
                  = MAX(rf[idx<sub>dst</sub><sup>1</sup>], rf[idx<sub>src</sub><sup>1</sup>])
                norm
           [st<sup>m</sup>] = MUL(rf[idx<sub>dst</sub><sup>m</sup>], rf[idx<sub>src</sub><sup>m</sup>])
        x<sub>dst</sub><sup>m+1</sup>] = ADD(rf[idx<sub>dst</sub><sup>m+1</sup>], rf[idx<sub>src</sub><sup>m+1</sup>])
   idx_{dst}^{m+2}] = MUL(rf[idx_{dst}^{m+2}], rf[idx_{src}^{m+2}])
 rf[idx<sub>dst</sub><sup>m+3</sup>] = ADD(rf[idx<sub>dst</sub><sup>m+3</sup>], rf[idx<sub>src</sub><sup>m+3</sup>])
// activation
rf[idx_{dst}^{n-1}] = RELU(rf[idx_{dst}^{n-1}], rf[idx_{src}^{n-1}])
rf[idx_{dst}^{n}] = RELU(rf[idx_{dst}^{n}], rf[idx_{src}^{n}])
```

(c) Max pool, batch norm and activation micro-coded program



### Extensible Hardware Architecture

### Programmability Across the Stack

### Facilitates HW-SW Co-Design

### programmer friendly construct

// Virtual Threading
tx, co = s[OUT\_L].split(co, factor=2)
s[OUT\_L].bind(tx, thread\_axis("cthread"))

### programmer friendly construct





low-level pipelined execution

// Virtual Threading
tx, co = s[OUT\_L].split(co, factor=2)
s[OUT\_L].bind(tx, thread\_axis("cthread"))



### programmer friendly construct







low-level pipelined execution

// Virtual Threading
tx, co = s[OUT\_L].split(co, factor=2)
s[OUT\_L].bind(tx, thread\_axis("cthread"))



### programmer friendly construct





low-level pipelined execution

### programmer friendly construct



**Tensor Expression Optimizer (TVM)** 



low-level pipelined execution

inserts dependence ops based on thread scope



### programmer friendly construct



**Tensor Expression Optimizer (TVM)** 

VTA Runtime & JIT Compiler



low-level pipelined execution

inserts dependence ops based on thread scope

generates instruction stream



### programmer friendly construct



**Tensor Expression Optimizer (TVM)** 

VTA Runtime & JIT Compiler

VTA Hardware/Software Interface (ISA)



low-level pipelined execution

- inserts dependence ops based on thread scope
- generates instruction stream
- exposes explicit dependences



### programmer friendly construct

**Tensor Expression Optimizer (TVM)** 

VTA Runtime & JIT Compiler

VTA Hardware/Software Interface (ISA)

**VTA MicroArchitecture** 



low-level pipelined execution

- inserts dependence ops based on thread scope
- generates instruction stream
- exposes explicit dependences
- execution predicated on dependences



### programmer friendly construct

**Tensor Expression Optimizer (TVM)** 

VTA Runtime & JIT Compiler

VTA Hardware/Software Interface (ISA)

**VTA MicroArchitecture** 



low-level pipelined execution

- inserts dependence ops based on thread scope
- generates instruction stream
- exposes explicit dependences
- execution predicated on dependences

## 9-60% better compute utilization





I. How do we partition work and explicitly manage on-chip memories?



// Tile yo, xo, yi, xi = s[OUT].tile(y, x, 4, 4)// Scoped cache read INP L = s.cache\_read(INP, vta.inp, [OUT]) s[INP L].compute at(s[OUT], xo)

I. How do we partition work and explicitly manage on-chip memories?



2. How do we take advantage of tensor computation intrinsics?



// Tile yo, xo, yi, xi = s[OUT].tile(y, x, 4, 4)// Scoped cache read INP L = s.cache\_read(INP, vta.inp, [OUT]) s[INP L].compute at(s[OUT], xo)

```
// Tensorize
s[OUT L].tensorize(ni)
```

### I. How do we partition work and explicitly manage on-chip memories?



2. How do we take advantage of tensor computation intrinsics?



3. How do we hide memory access latency?



// Tile yo, xo, yi, xi = s[OUT].tile(y, x, 4, 4)// Scoped cache read INP L = s.cache\_read(INP, vta.inp, [OUT]) s[INP L].compute at(s[OUT], xo)

```
Tensorize
s[OUT L].tensorize(ni)
```

```
// Virtual Threading
tx, co = s[OUT_L].split(co, factor=2)
s[OUT_L].bind(tx, thread_axis("cthread"))
```



### Extensible Hardware Architecture

### Programmability Across the Stack

### Facilitates HW-SW Co-Design

# Hardware Exploration with VTA

### HW / SW Constraints



# Hardware Exploration with VTA



VTA Design Space

GEMM Intrinsic: e.g. (1,32) x (32,32) vs. (4,16) x (16,16)

BRAM allocation between buffers, register file, micro-op cache

Circuit Pipelining: e.g. for GEMM core between [11, 20] stages

PLL Frequency Sweeps: e.g. 250 vs. 300 vs. 333MHz

# Hardware Exploration with VTA



VTA Design Space

GEMM Intrinsic: e.g. (1,32) x (32,32) vs. (4,16) x (16,16)

BRAM allocation between buffers, register file, micro-op cache

Circuit Pipelining: e.g. for GEMM core between [11, 20] stages

PLL Frequency Sweeps: e.g. 250 vs. 300 vs. 333MHz

### VTA Candidate Designs

#1 Design AAA @ 307GOPs

#2 Design BBB @ 307GOPs

#3 Design CCC @ 307GOPs

#4 Design DDD @ 256GOPs

Needs to pass place & route and pass timing closure



### AutoTVM for Conv2D on Hardware Candidates

## AutoTVM for Conv2D on Hardware Candidates



# Schedule Exploration with VTA

### VTA Candidate Designs

#1 Design AAA @ 307GOPs

#2 Design BBB @ 307GOPs

#3 Design CCC @ 307GOPs

#4 Design DDD @ 256GOPs

Needs to pass place & route and pass timing closure

# Schedule Exploration with VTA

### VTA Candidate Designs

#1 Design AAA @ 307GOPs

#2 Design BBB @ 307GOPs

#3 Design CCC @ 307GOPs

#4 Design DDD @ 256GOPs

Needs to pass place & route and pass timing closure



# **Operator Performance**

# Schedule Exploration with VTA

### VTA Candidate Designs

#1 Design AAA @ 307GOPs

#2 Design BBB @ 307GOPs

#3 Design CCC @ 307GOPs

#4 Design DDD @ 256GOPs

Needs to pass place & route and pass timing closure





### ARM Cortex A53 (TVM) Mali T860 (ARMCL) FPGA Ultra96 (VTA)

ResNet-34

ResNet-50













# VTA Released in the Summer

### **-**tvm Community About VTA Blog TVM Conference Tutorials Docs Github

### About VTA

The Versatile Tensor Accelerator (VTA) is an extension of the TVM framework designed to advance deep learning and hardware innovation. VTA is a programmable accelerator that exposes a RISC-like programming abstraction to describe compute and memory operations at the tensor level. We designed VTA to expose the most salient and common characteristics of mainstream deep learning accelerators, such as tensor operations, DMA load/stores, and explicit compute/memory arbitration.

VTA is more than a standalone accelerator design: it's an end-to-end solution that includes drivers, a JIT runtime, and an optimizing compiler stack based on TVM. The current release includes a behavioral hardware simulator, as well as the infrastructure to deploy VTA on low-cost FPGA hardware for fast prototyping. By extending the TVM stack with a customizable, and open source deep learning hardware accelerator design, we are exposing a transparent end-to-end deep learning stack from the high-level deep learning framework, down to the actual hardware design and implementation. This forms a truly end-to-end, from software-tohardware open source stack for deep learning systems.

![](_page_71_Picture_5.jpeg)
#### Based on of the box FPGA demo & tutorials that you can try on your own!







"cat"



pre-compiled bitstream pre-trained network model









#### I. CPU Only Inference (ResNet34, W8)

#### 2.VTA Inference (ResNet34,W8)

3. Fast VTA Inference (ResNet18, W4)

#### I. CPU Only Inference (ResNet34, W8): 2.6 FPS

#### 2.VTA Inference (ResNet34,W8): 10 FPS

#### 3. Fast VTA Inference (ResNet18, W4): 19 FPS

## TVM 0.5 VTA Release Features

### • FPGA Support: Ultra96, ZCU102, Intel DEIONano

**8**bit quantization

## TVM 0.5 VTA Release Features

# TOPI Operator Library & AutoTVM support Relay graph conversion front end, push-button

# 2019VTA Timeline

# 2019VTA Timeline

• • • :

### Chisel Generator for ASIC backends

### Initial Datacenter FPGA Prototype

• Q2:

Initial Training Prototype

#### Novel Numerical Representation Support (Posit)

# More at tvm.ai/vta



#### High-Level Differentiable IR

Tensor Expression IR

Edge FPGA





Transparent End-to-End Deep Learning System Stack





ASIC

