## Supporting TVM on RISC-V Architectures

#### Jenq-Kuen Lee<sup>1</sup>, Allen Lu<sup>2</sup>, Yuan-Ming Chang<sup>1,2</sup>, Chao-Lin Lee<sup>1,2</sup> Piyo Chen<sup>1</sup>, and Shao-Chung Wang<sup>3</sup>

<sup>1</sup>Department of Computer Science, National Tsing Hua University, Taiwan
 <sup>2</sup>Peakhills Group Corporation
 <sup>3</sup>Andes Technology Corporation





**PEAKHILLS**GROUF



# RISC-V with two vector ISAs to support fall-back engine with AI Models



Courtesy: Vector ISA, Roger Espasa, Esperanto Technologies

TVM and Deep Learning Compiler Conference, December 2018

Packed Vector (SubWord SIMD)

With Fixed-Point and Integer Instructions



RISC-V DSP (P) Extension Proposal Chuan-Hua Chang, Andes Technology Corporation

- We add RISC-V target in TVM codegen phase. The TVM RISC-V codegen will lower SIMD computation with Subword SIMD intrinsics.
- The LLVM backend will need to generate the corresponding SIMD instructions.
- Also on-going work to add TVM scheduling to quantize computation into fixed-points, "quantize(width, exponent)".



Schedule Space

and Optimizations

#### Support TVM on RISC-V with Subword

Loop

Cooperation

New Primitives support RISC-V with vector units

#### **SIMD** Computation

Subword SIMD

Primitives from prior works (Halide, Loopy)

Thread Binding

New primitives to support GPUs, Accelerators

Tensorization

Cache locality

Latency Hiding

SIMD Rewriting

### Example – Matrix Multiply

```
produce compute {
 for (i.outer, 0, ((n + 1)/2)) {
  for (j, 0, n) {
                                                           pkbb16 a6, a7, a6
    for (i.inner.s, 0, 2) {
                                                                                      In this example,
     if (likely(((i.outer*2) < (n - i.inner.s)))) {</pre>
                                                           1hu = 7, 0(a2)
       compute[((((i.outer*2) + i.inner.s)*n) + j)]
                                                           li a5, 7
       = (int16)0
                                                                                      104 of 229
                                                           li a3, 8
                                                                                      instructions will
    for (k, 0, n) {
                                                           pkbb16 a4, a4, a4
     for (i.inner.s, 0, 2) {
                                                                                      be with SIMD
       if (likely(((i.outer*2) < (n - i.inner.s)))) {</pre>
                                                           ksll16 a4, a4, a3
        compute[((((i.outer*2) + i.inner.s)*n) + j)]
        = (compute[(((i.outer*2) + i.inner.s)*n) + j)]
                                                           ksll16 a6, a6, a5
                                                                                      computation
        + (A[((((i.outer*2) + i.inner.s)*n) + k)]*B[((j*n) + k)]))
                                                           khm16 t2, a6, a4
                                                                                      which process
                                                           pkbb16 a4, a7, a7
                                                                                      two element in
                                                           pkbb16 a7, t1, t0
                                                           lhu t1, 16(a2)
                                                                                      one instruction.
                                                           ksll16 a4, a4, a3
%22 = bitcast i8* %21 to i16*
                                                           ksll16 a7, a7, a5
%23 = load i16, i16* %4, align 2, !tbaa !115
%24 = insertelement <2 x i16> undef, i16 %23, i32 0
                                                           lhu t3, 4(a1)
25 =  shufflevector 2 \times 16  24, 2 \times 16  undef,
                                                           lhu t4, 12(a1)
      <2 x i32> zeroinitializer
                                                           khm16 t0, a7, a4
%26 = tail call <2 x i16> @llvm.riscv.simd.khm16
      (<2 x i16> %13, <2 x i16> %25)
                                                           pkbb16 a4, zero, zero
%27 = tail call <2 x i16> @llvm.riscv.simd.sadd16
                                                           add16 t0, a4, t0
      (<2 x i16> zeroinitializer, <2 x i16> %26)
```

Subword SIMD Intrinsic LLVM IR

## Summary and Future Work

- Also has some discussions with AWS team to add RISC-V back-end for TVM deep learning compiler.
- Look forward to contributing the codes to TVM source trees.
- Currently the work is with Spike RISC-V simulator and we look forward to using Gem5 and Sid simulators and real chips for performance tuning.

