# **SkyCastle:** A Resource-Aware Multi-Loop Scheduler for High-Level Synthesis

Julian Oppermann<sup>1</sup>, Lukas Sommer<sup>1</sup>, Lukas Weber<sup>1</sup>, Melanie Reuter-Oppermann<sup>2</sup>, Andreas Koch<sup>1</sup>, <u>Oliver Sinnen<sup>3</sup></u>







Given: a kernel, an HLS tool, and an FPGA

- Given: a kernel, an HLS tool, and an FPGA
  - What's the fastest mircoarchitecture that still fits on the device?

Given: a kernel, an HLS tool, and an FPGA

 What's the fastest mircoarchitecture that still fits on the device?

| <pre>{ /* 10 FP mul, 1 FP add */ } { /* 8 FP mul, 1 FP add */ }</pre>                                                                                                                                                                                                                                                                                                                                                                                                                                                                      |
|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| <pre>double spn() { /* 10 FP mul, 1 FP add */ } double spn_marginal() { /* 8 FP mul, 1 FP add */ }</pre>                                                                                                                                                                                                                                                                                                                                                                                                                                   |
| <pre>double top(char i1, char i2, char i3, char i4) {     // most probable explanation for "i5"     char maxClause = -1; double maxProb = -1.0;     MPE: for (char x = 0; x &lt; 0xFF; x += 4) {         double p0 = spn(i1, i2, i3, i4, x);         double p1 = spn(i1, i2, i3, i4, x+1);         double p2 = spn(i1, i2, i3, i4, x+2);         double p3 = spn(i1, i2, i3, i4, x+3);         maxProb = // max(maxProb, p0, p1, p2, p3);         maxClause = // argument value for i5 that         // yielded new value for maxProb</pre> |
| <pre>} double pM = spn_marginal(i2, i3, i4, maxClause); return maxProb / pM; }</pre>                                                                                                                                                                                                                                                                                                                                                                                                                                                       |



src: Xilinx

Given: a kernel, an HLS tool, and an FPGA

 What's the fastest mircoarchitecture that still fits on the device?



Given: a kernel, an HLS tool, and an FPGA

 What's the fastest mircoarchitecture that still fits on the device?



 Most influential control "knob": amount of (loop) pipelining



- Most influential control "knob": amount of (loop) pipelining
- Tweak manually?



- Most influential control "knob": amount of (loop) pipelining
- Tweak manually?
- Use external exploration tool?



- Most influential control "knob": amount of (loop) pipelining
- Tweak manually?
- Use external exploration tool?
- Integrate into core HLS algorithms!
  - Optimisation problem:
     maximise "performance"
     subject to "resource constraints"



throughput

Max

Least

resource

demand

HLS = Automatic microarchitecture construction from a behavioural description think: C code



- HLS = Automatic microarchitecture construction from a behavioural description think: C code
- Terminology
  - Loops (and other regions) are transformed to control-data-flow-graphs comprised of operations and dependence edges
  - Operations require **operators** to perform intended function (e.g. floating-point addition)
  - Operators occupy resources on the FPGA device (e.g. DSP blocks)





- HLS = Automatic microarchitecture construction from a behavioural description think: C code
- Terminology
  - Loops (and other regions) are transformed to control-data-flow-graphs comprised of operations and dependence edges
  - Operations require **operators** to perform intended function (e.g. floating-point addition)
  - Operators occupy resources on the FPGA device (e.g. DSP blocks)
- Algorithmic steps



- HLS = Automatic microarchitecture construction from a behavioural description think: C code
- Terminology
  - Loops (and other regions) are transformed to control-data-flow-graphs comprised of operations and dependence edges
  - Operations require **operators** to perform intended function (e.g. floating-point addition)
  - Operators occupy resources on the FPGA device (e.g. DSP blocks)
- Algorithmic steps
  - Allocation <u>how many</u> operators?



- HLS = Automatic microarchitecture construction from a behavioural description think: C code
- Terminology
  - Loops (and other regions) are transformed to control-data-flow-graphs comprised of operations and dependence edges
  - Operations require **operators** to perform intended function (e.g. floating-point addition)
  - Operators occupy resources on the FPGA device (e.g. DSP blocks)
- Algorithmic steps
  - Allocation <u>how many</u> operators?
  - Scheduling <u>when</u> is an operation executed?



- HLS = Automatic microarchitecture construction from a behavioural description think: C code
- Terminology
  - Loops (and other regions) are transformed to control-data-flow-graphs comprised of operations and dependence edges
  - Operations require **operators** to perform intended function (e.g. floating-point addition)
  - Operators occupy resources on the FPGA device (e.g. DSP blocks)
- Algorithmic steps
  - Allocation <u>how many</u> operators?
  - Scheduling <u>when</u> is an operation executed?
  - Binding <u>where</u> is an operation executed?



- HLS = Automatic microarchitecture construction from a behavioural description think: C code
- Terminology
  - Loops (and other regions) are transformed to control-data-flow-graphs comprised of operations and dependence edges
  - Operations require **operators** to perform intended function (e.g. floating-point addition)
  - Operators occupy **resources** on the FPGA device (e.g. DSP blocks)
- Algorithmic steps
  - Allocation <u>how many</u> operators?
  - Scheduling <u>when</u> is an operation executed?
  - Binding <u>where</u> is an operation executed?



#### Trade-offs

For <u>one</u> loop, trade-offs can be computed with a resource-aware modulo scheduler [Euro-Par'19]



| Allocation: |     |     |     |  |
|-------------|-----|-----|-----|--|
| ADD         | ADD | ADD | ADD |  |
| MUL         | MUL | MUL | MUL |  |

Highest throughput



Least resource demand

#### Trade-offs

For <u>one</u> loop, trade-offs can be computed with a resource-aware modulo scheduler [Euro-Par'19]





#### Trade-offs

For <u>one</u> loop, trade-offs can be computed with a resource-aware modulo scheduler [Euro-Par'19]



# In Reality...

- Typical HLS kernels have:
  - More than one loop
  - Non-pipelined parts

# In Reality...

- Typical HLS kernels have:
  - More than one loop
  - Non-pipelined parts
- Typical HLS tools share operators between loops

# In Reality...

- Typical HLS kernels have:
  - More than one loop
  - Non-pipelined parts
- Typical HLS tools share operators between loops

#### Need a formal model for <u>that</u>!















- Operator sharing at the <u>function</u> level
  - Implicit in the formal model
  - Other schemes also possible

- Operator sharing at the <u>function</u> level
  - Implicit in the formal model
  - Other schemes also possible
- Pipelined regions cannot contain loops

- Operator sharing at the <u>function</u> level
  - Implicit in the formal model
  - Other schemes also possible
- Pipelined regions cannot contain loops
- Pipelined regions may contain calls to pipelined functions
  - Callee's II must divide II of caller's graph

- Operator sharing at the <u>function</u> level
  - Implicit in the formal model
  - Other schemes also possible
- Pipelined regions cannot contain loops
- Pipelined regions may contain calls to pipelined functions
  - Callee's II must divide II of caller's graph
- Closed tool, no interface to influence HLS steps
  - Faithful reproduction of the scheduling and allocation problems inside of Vivado HLS
Kernel





















# SkyCastle

Novel SkyCastle approach for a subclass of kernels



# SkyCastle

Novel SkyCastle approach for a subclass of kernels



- = first "level" of the multi-loop scheduling problem
  - Arbitrary number and nesting structure of loops in top-level function
  - Only innermost loops may be pipelined

## SkyCastle ILP



RASP = <u>Resource-Aware Scheduling Problem</u> RAMSP = <u>Resource-Aware</u> <u>Modulo Scheduling Problem</u>

## SkyCastle ILP



- Uses Moovac
  formulation
  [TRETS'19]
  - II is decision variable

RAMSP = <u>R</u>esource-<u>A</u>ware <u>Modulo</u> <u>S</u>cheduling <u>P</u>roblem

SkyCastle ILP

RAMSP = Resource-Aware Modulo Scheduling Problem



- Uses Moovac
  formulation
  [TRETS'19]
  - II is decision variable
- characteristics are variable for a subset of operations/ operators

- Different queries of a Sum-Product Network
  - Motivational example from the first slide



12 / 18

- Different queries of a Sum-Product Network
  - Motivational example from the first slide



- Different queries of a Sum-Product Network
  - Motivational example from the first slide



- Different queries of a Sum-Product Network
  - Motivational example from the first slide



- Different queries of a Sum-Product Network
  - Motivational example from the first slide



#### Evaluation: Case study "FFT"



#### Fast Fourier Transformation

 code from MachSuite/ fft\_transpose

## Evaluation: Case study "FFT"



#### Fast Fourier Transformation

- code from MachSuite/ fft\_transpose
- 3 of 11 loops are pipelined

## Evaluation: Case study "FFT"



#### Fast Fourier Transformation

- code from MachSuite/ fft\_transpose
- 3 of 11 loops are pipelined
- Pipelined functions are called from different loops
  - → consider II divisibility

## **Evaluation: Setup**

# Gurobi 8.1, 8 threads, 16 GiB RAM on Xeon E5-2680 v3 servers @ 2.8 GHz





src: HHLR TU-DA

# **Evaluation: Setup**

Gurobi 8.1, 8 threads, 16 GiB RAM
 on Xeon E5-2680 v3 servers @ 2.8 GHz



Timelimits (for solving ILP)
 15 min: minimise latency
 5 min: minimise resource utilisation



src: HHLR TU-DA

14 / 18

#### O. Sinnen, University of Auckland — SkyCastle: A Resource-Aware Multi-Loop Scheduler for High-Level Synthesis

#### **Evaluation: Setup**

- Gurobi 8.1, 8 threads, 16 GiB RAM
  on Xeon E5-2680 v3 servers @ 2.8 GHz
- Timelimits (for solving ILP)
  15 min: minimise latency
  5 min: minimise resource utilisation
- 2 FPGA boards
  - ZedBoard XC7Z020 "small"
  - VCU108 XCVU095 "medium"





src: HHLR TU-DA

src: Xilinx

14 / 18

#### O. Sinnen, University of Auckland — SkyCastle: A Resource-Aware Multi-Loop Scheduler for High-Level Synthesis

## **Evaluation: Setup**

- Gurobi 8.1, 8 threads, 16 GiB RAM
  on Xeon E5-2680 v3 servers @ 2.8 GHz
- Timelimits (for solving ILP)
  15 min: minimise latency
  5 min: minimise resource utilisation
- 2 FPGA boards
  - ZedBoard XC7Z020 "small"
  - VCU108 XCVU095 "medium"
- Xilinx Vivado HLS 2018.3





src: HHLR TU-DA

src: Xilinx

The next slides illustrate that ...

The next slides illustrate that ...

... solving the ILP is **tractable** 

The next slides illustrate that ...

... solving the ILP is tractable

... we **capture** the most important **aspects** of the Vivado HLS **scheduling & allocation problem** 

The next slides illustrate that ...

... solving the ILP is tractable

... we **capture** the most important **aspects** of the Vivado HLS **scheduling & allocation problem** 

... using SkyCastle leads to synthesisable designs

The next slides illustrate that ...

... solving the ILP is tractable

... we **capture** the most important **aspects** of the Vivado HLS **scheduling & allocation problem** 

... using SkyCastle leads to synthesisable designs

... we expect **improved throughput** from **replicating** slower-but-smaller kernel implementations

#### Results – Trade-off Solutions

#### **FFT**, VCU108



O. Sinnen, University of Auckland — SkyCastle: A Resource-Aware Multi-Loop Scheduler for High-Level Synthesis

# Results – Replication

#### **SPN**, VCU108



# Results – Replication

#### **SPN**, VCU108



O. Sinnen, University of Auckland — SkyCastle: A Resource-Aware Multi-Loop Scheduler for High-Level Synthesis
# Results – Replication

### **SPN**, VCU108



# Conclusion / Outlook

 New approach to automatic hardware design using mathematical optimisation

# Conclusion / Outlook

- New approach to automatic hardware design using mathematical optimisation
- Would benefit tremendously from public interface into the HLS steps (similar to RapidWright)

# Conclusion / Outlook

- New approach to automatic hardware design using mathematical optimisation
- Would benefit tremendously from public interface into the HLS steps (similar to RapidWright)
- Could be a key ingredient to the automatic designspace exploration of multi-kernel OpenCL applications

18 / 18