#### A Multi-Threaded Reactive Processor

Xin Li Marian Boldt Reinhard v. Hanxleden

Real-Time Systems and Embedded Systems Group Department of Computer Science Christian-Albrechts-Universität zu Kiel, Germany www.informatik.uni-kiel.de/rtsys

> ASPLOS'06 24 October 2006



Institut für Informatik Christian-Albrechts-Universität zu Kiel

Reactive Systems Reactive Processing I: Language Reactive Processing II: Execution Platform Why bother?

#### Reactive vs. Non-Reactive Systems

# Transformational systems numerical computation programs, compilers . . .

Interactive systems operating systems, databases ....

Reactive Systems Reactive Processing I: Language Reactive Processing II: Execution Platform Why bother?

#### Reactive vs. Non-Reactive Systems

Transformational systems numerical computation programs, compilers . . .

Interactive systems operating systems, databases ...

Reactive systems process controllers, signal processors ....

Reactive Systems Reactive Processing I: Language Reactive Processing II: Execution Platform Why bother?

# Why "Reactive Processing"?

Control flow on traditional (non-embedded) computing systems:

- Jumps, conditional branches, loops
- Procedure/method calls

Control flow on embedded, reactive systems: all of the above, plus

Reactive Systems Reactive Processing I: Language Reactive Processing II: Execution Platform Why bother?

# Why "Reactive Processing"?

Control flow on traditional (non-embedded) computing systems:

- Jumps, conditional branches, loops
- Procedure/method calls

Control flow on embedded, reactive systems: all of the above, plus

- Concurrency
- Preemption

Reactive Systems Reactive Processing I: Language Reactive Processing II: Execution Platform Why bother?

# Why "Reactive Processing"?

Control flow on traditional (non-embedded) computing systems:

- Jumps, conditional branches, loops
- Procedure/method calls

Control flow on embedded, reactive systems: all of the above, plus

- Concurrency
- Preemption

The problem: mismatch between traditional processing architectures and reactive control flow patterns

- Processing overhead, e.g. due to OS involvement or need to save thread states at application level
- Timing unpredictability

Reactive Systems Reactive Processing I: Language Reactive Processing II: Execution Platform Why bother?

# Reactive Processing Part I: The Language

#### Have chosen Esterel:

- Created in the early 1980's
- For programming control-dominated reactive systems
- Used as intermediate language for Statechart (Safe State Machines)
- Textual imperative language with reactive control flow constructs
  - Concurrency
  - Weak/strong abortion
  - Exceptions
  - Suspension
- A synchronous language
- Deterministic behavior, clean semantics
- Currently undergoing IEEE standardization

Reactive Systems Reactive Processing I: Language Reactive Processing II: Execution Platform Why bother?

#### Reactive Processing Part II: The Execution Platform



◆□▶ ◆冊▶ ★ヨ▶ ★ヨ▶ ヨヨ のへで

CIAU

Reactive Systems Reactive Processing I: Language Reactive Processing II: Execution Platform Why bother?

#### Reactive Processing Part II: The Execution Platform



Reactive Systems Reactive Processing I: Language Reactive Processing II: Execution Platform Why bother?

### Why bother?

Reactive processing yields

- Low power requirements
- Deterministic control flow
- Predictable timing
- Short design cycle

Reactive Systems Reactive Processing I: Language Reactive Processing II: Execution Platform Why bother?

# Why bother?

Reactive processing yields

- Low power requirements
- Deterministic control flow
- Predictable timing
- Short design cycle

Can use reactive processor

 in stand alone, small reactive applications

Reactive Systems Reactive Processing I: Language Reactive Processing II: Execution Platform Why bother?

# Why bother?

Reactive processing yields

- Low power requirements
- Deterministic control flow
- Predictable timing
- Short design cycle

Can use reactive processor

- in stand alone, small reactive applications
- as building block in SoC designs



The Esterel Language Instruction Set Architecture Processor Architecture Compiler

### Overview

#### Introduction

#### The Kiel Esterel Processor

The Esterel Language Instruction Set Architecture Processor Architecture Compiler

#### **Experimental Results**

#### Summary and Outlook

The Esterel Language Instruction Set Architecture Processor Architecture Compiler

# The Esterel Language

#### Logical Ticks

- Execution is divided into ticks
- Synchrony hypothesis:
  - Outputs generated from given inputs occur at the same tick

- Present or absent throughout a tick
- Used to communicate internally and with the environment

The Esterel Language Instruction Set Architecture Processor Architecture Compiler

### The Esterel Language

#### Logical Ticks

- Execution is divided into ticks
- Synchrony hypothesis: Outputs generated from given inputs occur at the same tick

- Present or absent throughout a tick
- Used to communicate internally and with the environment

```
module ABRO:
input A, B, R;
output O;
loop
   abort
   [ await A
   |!
    await B ];
   emit O
   halt;
   when R
end loop;
end module
```

The Esterel Language Instruction Set Architecture Processor Architecture Compiler

### The Esterel Language

#### Logical Ticks

- Execution is divided into ticks
- Synchrony hypothesis: Outputs generated from given inputs occur at the same tick

- Present or absent throughout a tick
- Used to communicate internally and with the environment

```
module ABRO:
input A, B, R;
output O;
loop
 abort
    [ await A
     11
     await B 1:
   emit O
   halt:
 when R
end loop:
end module
```





The Esterel Language Instruction Set Architecture Processor Architecture Compiler

### The Esterel Language

#### Logical Ticks

- Execution is divided into ticks
- Synchrony hypothesis: Outputs generated from given inputs occur at the same tick

- Present or absent throughout a tick
- Used to communicate internally and with the environment

```
module ABRO:
     input A, B, R;
     output O;
     loop
       abort
          [ await A
           11
           await B 1:
         emit O
         halt:
       when R
     end loop:
     end module
             A
Tick
             B
```

The Esterel Language Instruction Set Architecture Processor Architecture Compiler

### The Esterel Language

#### Logical Ticks

- Execution is divided into ticks
- Synchrony hypothesis: Outputs generated from given inputs occur at the same tick

#### Signals

- Present or absent throughout a tick
- Used to communicate internally and with the environment



Slide 8

The Esterel Language Instruction Set Architecture Processor Architecture Compiler

### The Esterel Language

#### Logical Ticks

- Execution is divided into ticks
- Synchrony hypothesis: Outputs generated from given inputs occur at the same tick

- Present or absent throughout a tick
- Used to communicate internally and with the environment



The Esterel Language Instruction Set Architecture Processor Architecture Compiler

### The Esterel Language

#### Logical Ticks

- Execution is divided into ticks
- Synchrony hypothesis: Outputs generated from given inputs occur at the same tick

- Present or absent throughout a tick
- Used to communicate internally and with the environment



The Esterel Language Instruction Set Architecture Processor Architecture Compiler

### Candidates for the Instruction Set

Esterel kernel statements

```
> ||
> suspend ... when S
> trap T in ... exit T ... end trap
> pause
> signal S in ... end
> emit S
> present S then ... end
> nothing
> loop ... end loop
```

The Esterel Language Instruction Set Architecture Processor Architecture Compiler

# Candidates for the Instruction Set

Esterel kernel statements

```
suspend ... when S
▶ trap T in ... exit T ... end trap
pause
▶ signal S in ... end
▶ emit S
present S then ... end
nothing
loop ... end loop
▶ ;
```

Derived statements

[weak] abort ... when S
await S

The Esterel Language Instruction Set Architecture Processor Architecture Compiler

### The KEP Instruction Set

- Includes all kernel statements
- In addition, some derived statements This redundancy improves space/time efficiency

The Esterel Language Instruction Set Architecture Processor Architecture Compiler

### The KEP Instruction Set

- Includes all kernel statements
- In addition, some derived statements

This redundancy improves space/time efficiency

| TOS:          | % trap T in      |
|---------------|------------------|
| AO:           | % loop           |
| PAUSE         | % pause;         |
| PRESENT S,A1  | % present S then |
| EXIT TOE, TOS | % exit T         |
| A1:           | % end present    |
| GOTO AO       | % end loop       |
| TOE:          | % end trap;      |

The Esterel Language Instruction Set Architecture Processor Architecture Compiler

### The KEP Instruction Set

- Includes all kernel statements
- In addition, some derived statements

This redundancy improves space/time efficiency

| TOS:          | % trap T in      |
|---------------|------------------|
| AO:           | % loop           |
| PAUSE         | % pause;         |
| PRESENT S,A1  | % present S then |
| EXIT TOE, TOS | % exit T         |
| A1:           | % end present    |
| GOTO AO       | % end loop       |
| TOE:          | % end trap;      |

The Esterel Language Instruction Set Architecture Processor Architecture Compiler

### The KEP Instruction Set

- Includes all kernel statements
- In addition, some derived statements

This redundancy improves space/time efficiency

| TOS:          | % trap T in      |
|---------------|------------------|
| AO:           | % loop           |
| PAUSE         | % pause;         |
| PRESENT S,A1  | % present S then |
| EXIT TOE, TOS | % exit T         |
| A1:           | % end present    |
| GOTO AO       | % end loop       |
| TOE:          | % end trap;      |

| AWAIT S | % au | ait S |
|---------|------|-------|
|---------|------|-------|

The Esterel Language Instruction Set Architecture Processor Architecture Compiler

# The KEP Instruction Set

- Includes all kernel statements
- In addition, some derived statements

This redundancy improves space/time efficiency





# Refined ISA to reduce HW usage

Example: abort can translate to

ABORT in the most general case

LABORT if no other [L] ABORTS are included in abort scope

TABORT if neither || nor other [L|T] ABORTS are included

The Esterel Language Instruction Set Architecture Processor Architecture Compiler

# The KEP Instruction Set

- Includes all kernel statements
- In addition, some derived statements

This redundancy improves space/time efficiency



AWAIT S % await S

Refined ISA to reduce HW usage

Example: abort can translate to

ABORT in the most general case

LABORT if no other [L] ABORTS are included in abort scope TABORT if neither || nor other [L]T] ABORTS are included

Furthermore: valued signals, pre, delay expressions, ...

The Esterel Language Instruction Set Architecture Processor Architecture Compiler

#### The Kiel Esterel Processor Architecture



- Reactive Core
  - Decoder & Controller, Reactive Block, Thread Block
- Interface Block
  - Interface signals, Local signals, ...
- Data Handling
  - Register file, ALU, .

C A U Xin Li, Marian Boldt, Reinhard v. Hanxleden

The Esterel Language Instruction Set Architecture Processor Architecture Compiler

#### The Architecture of the Reactive Core



The Esterel Language Instruction Set Architecture Processor Architecture Compiler

A D b 4 A b

### The Compilation Challenge: Thread Dependencies



The Esterel Language Instruction Set Architecture Processor Architecture Compiler

# The KEP Compiler

#### Thread scheduling:

- 1. Construct Concurrent KEP Assembler Graph (CKAG)
- 2. Compute thread priorities/ids that respect dependencies
- 3. Generate PAR and PRIO statements accordingly

The Esterel Language Instruction Set Architecture Processor Architecture Compiler

# The KEP Compiler

#### Thread scheduling:

- 1. Construct Concurrent KEP Assembler Graph (CKAG)
- 2. Compute thread priorities/ids that respect dependencies
- 3. Generate PAR and PRIO statements accordingly

#### Other tasks:

- Analyze Watcher requirements
- Map Esterel statements to KEP refined ISA

The Esterel Language Instruction Set Architecture Processor Architecture Compiler

# The KEP Compiler

#### Thread scheduling:

- 1. Construct Concurrent KEP Assembler Graph (CKAG)
- 2. Compute thread priorities/ids that respect dependencies
- 3. Generate PAR and PRIO statements accordingly

#### Other tasks:

- Analyze Watcher requirements
- Map Esterel statements to KEP refined ISA

#### Optimizations:

- Dead code elimination, based on CKAG
- "Undismantling" of kernel statements

The Esterel Language Instruction Set Architecture Processor Architecture Compiler

#### **Example Compilation**



The Esterel Language Instruction Set Architecture Processor Architecture Compiler

#### Example—Execution Trace

Scheduling criteria: 1. active, 2. highest priority, 3. highest id

| Xmodule Example:<br>OUTPUT 0output 0;<br>signal A,R in<br>[<br>weak abort<br>sustain R;<br>when immediate A;<br>emit 0[L00,T0]EMIT _TICKLEN,#12[L01,T0]SIGNAL A[L03,T0]PAR 2,A0,1[L04,T0]PAR 1,A1,2[L05,T0]PAR 2,22[L06,T1]A0: WABORTI A,A3[L07,T1]A4: EMIT R[L07,T1]PAR 2,2[L06,T1]PAR 1,A1,2[L07,T1]PAE 42,2[L07,T1]PAE 42,2[L07,T1]PAE 1,A3[L10,T1]PAUSE[L10,T1]PAUSE[L11,T1]GOTO A4[L12,T1]A3: EMIT 0[L13,T2]A1:AWAIT R[L15,T0]A2:JOIN 0[L16,T0]HALT | - Tick 1 -<br>! reset;<br>% In:<br>% Out: R<br>TO: L01, L02, L03, L04, L05<br>T1: L06, L07, L08<br>T2: L13<br>T1: L09, L10<br>T0: L15<br>- Tick 2 -<br>% In:<br>% Out: A R O<br>T1: L10, L11, L07, L08<br>T2: L13, L14<br>T1: L09, L10, L12<br>T0: L15, L16<br>- Tick 3 -<br>% In:<br>% Out:<br>T0: L16 |
|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|

KEP Evaluation Platform Performance Scalability

### Overview

Introduction

The Kiel Esterel Processor

Experimental Results KEP Evaluation Platform Performance Scalability

### Summary and Outlook

KEP Evaluation Platform Performance Scalability

### The KEP Evaluation Platform



- ▶ Highly automated process, currently using 470+ benchmarks
- End to end validation of hardware and compiler against "trusted" reference (Esterel Studio)
- Detailed performance measurements

### Performance

Memory usage

- Unoptimized: 25–94% (83% avg) reduction of memory usage (Code+RAM)
- ▶ Optimized: Yield further 5% to 30+% improvements

### Speed

- WCRT speedup: typically >4x
- ACRT speedup: typically >5x
- Optimizations yield further improvements

#### Power

- Peak energy usage reduction: 46–84% (75% avg)
- Idle (= no inputs) energy usage reduction: 58–97% (86% avg)

KEP Evaluation Platform Performance Scalability

### The worst-/average-case reaction times comparison

|              |       |       | Micro | Blaze |       |       | ۴    | KEP3a-Un | optimi | zed      |      | KEP3a-c  | ptimiz | zed      |
|--------------|-------|-------|-------|-------|-------|-------|------|----------|--------|----------|------|----------|--------|----------|
| Module Name  |       | WCRT  |       |       | ACRT  |       | WCRT | Ratio to | ACRT   | Ratio to | WCRT | Ratio to | ACRT   | Ratio    |
|              | V5    | V7    | CEC   | V5    | V7    | CEC   |      | best MB  |        | best MB  |      | Unopt    |        | to Unopt |
| abcd         | 1559  | 954   | 1476  | 1464  | 828   | 1057  | 135  | 0.14     | 87     | 0.11     | 135  | 1        | 84     | 0.97     |
| abcdef       | 2281  | 1462  | 1714  | 2155  | 1297  | 1491  | 201  | 0.14     | 120    | 0.09     | 201  | 1        | 117    | 0.98     |
| eight_but    | 3001  | 1953  | 2259  | 2833  | 1730  | 1931  | 267  | 0.14     | 159    | 0.09     | 267  | 1        | 153    | 0.96     |
| chan_prot    | 754   | 375   | 623   | 683   | 324   | 435   | 117  | 0.31     | 60     | 0.19     | 117  | 1        | 54     | 0.90     |
| reactor_ctrl | 487   | 230   | 397   | 456   | 214   | 266   | 54   | 0.23     | 45     | 0.21     | 51   | 0.94     | 39     | 0.87     |
| runner       | 566   | 289   | 657   | 512   | 277   | 419   | 36   | 0.12     | 15     | 0.05     | 30   | 0.83     | 6      | 0.40     |
| example      | 467   | 169   | 439   | 404   | 153   | 228   | 42   | 0.25     | 24     | 0.16     | 42   | 1        | 24     | 1        |
| ww_button    | 1185  | 578   | 979   | 1148  | 570   | 798   | 72   | 0.12     | 51     | 0.09     | 48   | 0.67     | 36     | 0.71     |
| greycounter  | 1965  | 1013  | 2376  | 1851  | 928   | 1736  | 528  | 0.52     | 375    | 0.40     | 528  | 1        | 375    | 1        |
| tcint        | 3580  | 1878  | 2350  | 3488  | 1797  | 2121  | 408  | 0.22     | 252    | 0.14     | 342  | 0.84     | 204    | 0.81     |
| mca200       | 75488 | 29078 | 12497 | 73824 | 24056 | 11479 | 2862 | 0.23     | 1107   | 0.10     | 2862 | 1        | 1107   | 1        |

The worst-/average-case reaction times, in clock cycles, for the KEP3a and MicroBlaze

- ▶ WCRT speedup: typically >4x
- ACRT speedup: typically >5x
- Optimizations can yield further improvements

KEP Evaluation Platform Performance Scalability

## Memory Usage

|              | Esterel | M      | croBlaz   | e     |      | KEF     | 3a-Unop | ot.         | KEP  | 3a-opt. |
|--------------|---------|--------|-----------|-------|------|---------|---------|-------------|------|---------|
| Module Name  | LOC     | Code-  | -Data (   | byte) | Code | (word)  | Code+I  | Data (byte) | Code | (word)  |
|              |         | V5     | V7        | CEC   | abs. | rel.    | abs.    | rel.        | abs. | rel.    |
|              | [1]     | [2     | 2] (best) | )     | [3]  | [3]/[1] | [4]     | [4]/[2]     | [5]  | [5]/[3] |
| abcd         | 160     | 6680   | 7928      | 7212  | 168  | 1.05    | 756     | 0.11        | 164  | 0.93    |
| abcdef       | 236     | 9352   | 9624      | 9220  | 252  | 1.07    | 1134    | 0.12        | 244  | 0.94    |
| eight_but    | 312     | 12016  | 11276     | 11948 | 336  | 1.08    | 1512    | 0.13        | 324  | 0.94    |
| $chan_prot$  | 42      | 3808   | 6204      | 3364  | 66   | 1.57    | 297     | 0.09        | 62   | 0.94    |
| reactor_ctrl | 27      | 2668   | 5504      | 2460  | 38   | 1.41    | 171     | 0.07        | 34   | 0.89    |
| runner       | 31      | 3140   | 5940      | 2824  | 39   | 1.22    | 175     | 0.06        | 27   | 0.69    |
| example      | 20      | 2480   | 5196      | 2344  | 31   | 1.55    | 139     | 0.06        | 28   | 0.94    |
| ww_button    | 76      | 6112   | 7384      | 5980  | 129  | 1.7     | 580     | 0.10        | 95   | 0.74    |
| greycounter  | 143     | 7612   | 7936      | 8688  | 347  | 2.43    | 1567    | 0.21        | 343  | 1       |
| tcint        | 355     | 14860  | 11376     | 15340 | 437  | 1.23    | 1968    | 0.17        | 379  | 0.87    |
| mca200       | 3090    | 104536 | 77112     | 52998 | 8650 | 2.79    | 39717   | 0.75        | 8650 | 1       |

Unoptimized: 83% avg reduction of memory usage (Code+RAM)

▶ Optimized: May yield further 5% to 30+% improvements > < => = ∽ < @

CIAU

Xin Li, Marian Boldt, Reinhard v. Hanxleden

A Multi-Threaded Reactive Processor

Slide 21

KEP Evaluation Platform Performance Scalability

### Scalability

### Synthesis results for Xilinx 3S1500-4fg-676<sup>1</sup>

| Thread | Slices  | Gates (k) |
|--------|---------|-----------|
| Count  | Silices | Gates (K) |
| 2      | 1295    | 295       |
| 10     | 1566    | 299       |
| 20     | 1871    | 311       |
| 40     | 2369    | 328       |
| 60     | 3235    | 346       |
| 80     | 4035    | 373       |
| 100    | 4569    | 389       |
| 120    | 5233    | 406       |

- 48 valued signals up to 256 possible
- 2 Watchers, 8 Local Watchers either up to 64 possible
- 1k (1024) instruction words up to 16k possible
- 128 registers (in word) up to 512 possible
- 16-bits (65536) max counter value
- Frequency is stable (around 60 MHz)

<sup>1</sup>For comparison, a MicroBlaze implementation requires around 1k slices and 309k gates; a two threads EMPEROR platform requires around 2k slices

Summary Reactive Processors Related Work/Contributions Outlook

### Summary Reactive Processors

Processor supports reactive control flow directly, at hardware level

- "Watchers" monitor preemption signals No need for polling, interrupts
- Support for concurrency Multi-threading or multi-processing
- Synchronous model of computation Perfectly deterministic, predictable timing

Summary Reactive Processors Related Work/Contributions Outlook

## Related Work/Contributions

RePIC [Roop et al.'04]/EMPEROR [Yoong et al.'06]

- Multi-processing patched reactive processor
- ► Three-valued signal logic + cyclic executive

Summary Reactive Processors Related Work/Contributions Outlook

# Related Work/Contributions

RePIC [Roop et al.'04]/EMPEROR [Yoong et al.'06]

- Multi-processing patched reactive processor
- ► Three-valued signal logic + cyclic executive

#### Kiel Esterel Processor 1–3

- Multi-threading custom reactive processor
- Provides most Esterel primitives, but still incomplete
- No compilation scheme to support concurrency

Summary Reactive Processors Related Work/Contributions Outlook

# Related Work/Contributions

RePIC [Roop et al.'04]/EMPEROR [Yoong et al.'06]

- Multi-processing patched reactive processor
- ► Three-valued signal logic + cyclic executive

### Kiel Esterel Processor 1-3

- Multi-threading custom reactive processor
- Provides most Esterel primitives, but still incomplete
- No compilation scheme to support concurrency

KEP3a (this work)

- Provides all Esterel primitives
- Refined ISA
- Compiler exploits multi-threading

Summary Reactive Processors Related Work/Contributions Outlook

### Outlook

- Improve priority assignments
- Speedup signal expression computations with external logic block
- WCRT analysis with concurrency
- Extend to Esterel v7
- ▶ KEP in Esterel—*e.g.*, to produce Esterel virtual machine
- Combination with multi-core (for data handling)
- Adaptation to non-Esterel languages

Summary Reactive Processors Related Work/Contributions Outlook

### Outlook

- Improve priority assignments
- Speedup signal expression computations with external logic block
- WCRT analysis with concurrency
- Extend to Esterel v7
- ▶ KEP in Esterel—*e.g.*, to produce Esterel virtual machine
- Combination with multi-core (for data handling)
- Adaptation to non-Esterel languages

Thanks! Questions/Comments? Appendix

### KEP3a Instruction Set + Architecture

Esterel-Type Instructions Handling Concurrency Handling Preemption Handling Exceptions WCRT Self-Monitoring

### The Compiler

Three Compilation Steps The Concurrent KEP Assembler Graph Cyclicity Constraints

#### Further Measurements

Code Characteristics and Compilation Times Speed, Size, Power, Scalability Analysis of context switches Another Example

#### Summary

Multi-processing vs. Multi-threading Comparison of Synthesis Options Application Scenarios KEP3a Instruction Set + Architecture

The Compiler Further Measurements Summary Esterel-Type Instructions Handling Concurrency Handling Preemption Handling Exceptions WCRT Self-Monitoring

A D > A D >

. . . . . . .

# Instruction Set Summary 1/2

| Mnemonic,<br>Operands                                           | Esterel Syntax                       | Notes                                                                                                                                               |  |  |  |
|-----------------------------------------------------------------|--------------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------|--|--|--|
| PAR Prio, startAddr [, ID]<br>PARE endAddr<br>JOIN<br>PRIO Prio | [<br>p    q<br>]                     | Fork and join.<br>An optional <i>ID</i> explicitly specifies the ID of<br>the created thread.<br>Set the priority of the current thread             |  |  |  |
| [W]ABORT [n,] S, endAddr                                        | [weak] abort<br><br>when [n] S       | S can be one of the following:         1. S: signal status (present/absent)         2. PRE(S): previous status of signal                            |  |  |  |
| [W]ABORTI S, endAddr                                            | [weak] abort<br><br>when immediate S | <ol> <li>TICK: always present</li> <li>n can be one of the following:</li> <li>1. #data: immediate data</li> </ol>                                  |  |  |  |
| SUSPEND[I] S, endAddr                                           | suspend<br><br>when [immediate] S    | <ol> <li>reg: register contents</li> <li>?S: value of a signal</li> <li>PRE(?S): previous value of a signal</li> </ol>                              |  |  |  |
| EXIT TrapEnd[,TrapStart]                                        | trap T in<br>exit T<br>end trap      | Exit from a trap, <i>TrapStart</i> and <i>TrapEnd</i> spec-<br>ify trap scope. Unlike GOTO, check for concur-<br>rent EXITs and terminate enclosing |  |  |  |

-

KEP3a Instruction Set + Architecture

The Compiler Further Measurements Summary Esterel-Type Instructions Handling Concurrency Handling Preemption Handling Exceptions WCRT Self-Monitoring

## Instruction Set Summary 2/2

| Mnemonic,<br>Operands                                      | Esterel Syntax                              | Notes                                                             |  |  |  |
|------------------------------------------------------------|---------------------------------------------|-------------------------------------------------------------------|--|--|--|
| PAUSE<br>AWAIT [n,] S<br>AWAIT[I] S                        | pause<br>await [n] S<br>await [immediate] S | Wait for a signal. AWAIT TICK is equivalent to PAUSE              |  |  |  |
| CAWAITS<br>CAWAIT[I] S, addr<br>CAWAITE                    | await<br>case [immediate]<br>S do<br>end    | wait for several signals in parallel                              |  |  |  |
| SIGNAL S                                                   | signal S inend                              | Initialize a local signal <i>S</i>                                |  |  |  |
| EMIT S [, {#data reg}]           SUSTAIN S [, {#data reg}] | emit S [(val)]<br>sustain S [(val)]         | Emit (valued) signal <i>S</i><br>Sustain (valued) signal <i>S</i> |  |  |  |
| PRESENT S, elseAddr                                        | present S thenend                           | Jump to <i>elseAddr</i> if <i>S</i> is absent                     |  |  |  |
| NOTHING                                                    | nothing                                     | Do nothing                                                        |  |  |  |
| HALT                                                       | halt                                        | Halt the program                                                  |  |  |  |
| GOTO addr                                                  | loopend loop                                | Jump to addr                                                      |  |  |  |
| CALL addr<br>RETURN                                        | call P                                      | call a procedure,<br>and return from the procedure                |  |  |  |

-

< E

KEP3a Instruction Set + Architecture The Compiler

Esterel-Type Instructions Handling Concurrency Handling Preemption Further Measurements Handling Exceptions Summary WCRT Self-Monitoring

### Handling Concurrency



#### Execution status of a single thread



The status of the whole program, as managed by the Thread Block

(4月) キョン・イヨ

KEP3a Instruction Set + Architecture The Compiler

Further Measurements

Summary

Esterel-Type Instructions Handling Concurrency Handling Preemption Handling Exceptions WCRT Self-Monitoring

# Handling Concurrency

- A thread has its
  - thread id
  - address range and independent program counter
  - priority value
    - assigned when a thread is created
    - dynamically changed via PRIO instruction
  - status flags
    - ThreadEnable
    - ThreadActive

% Esterel [ p || q ];

KEP3a Instruction Set + Architecture

The Compiler Further Measurements Summary Esterel-Type Instructions Handling Concurrency Handling Preemption Handling Exceptions WCRT Self-Monitoring

# Handling Preemption

#### Watcher contains Enable Watcher (EW)

- Watches the PC (Program Counter)
- Compares PC
- Preemption enabled?
- Trigger Watcher (TW)
  - Watches the Signal
  - Counts down the counter (abortion)
  - Preemption active?





Esterel-Type Instructions Handling Concurrency Handling Preemption Handling Exceptions WCRT Self-Monitoring

# Watcher Refinement

Thread Watcher

- belongs to a thread directly
- can neither include concurrent threads nor other preemptions
- least powerful, but also cheapest

#### Local Watcher

- may include concurrent threads and also preemptions handled by a Thread Watcher
- cannot include another Local Watcher

#### Watcher

- may include concurrent threads and any preemptions
- most powerful, but also most expensive

#### KEP3a Instruction Set + Architecture

The Compiler Further Measurements Summary Esterel-Type Instructions Handling Concurrency Handling Preemption Handling Exceptions WCRT Self-Monitoring

# Handling Exceptions

#### Exception

- has its address range
- sets an exitFlag
  - cleared when reaching the end of the trap scope
  - effects control at the join point
- can be overridden based on the corresponding trap scopes (address range)

% Esterel trap T1 in trap T2 in [ p; exit T1: 11 a: exit T2: 1: end trap; r:end trap:



Esterel-Type Instructions Handling Concurrency Handling Preemption Handling Exceptions WCRT Self-Monitoring

. . . . . . .

# WCRT (Tick Length) Self-Monitoring

- OscClk: external clock; InstrClk: instructions; Tick: logical ticks
- Emitting special signal <u>TICKLEN</u> configures Tick Manager with WCRT
- TickWarn pin indicates WCRT timing violation



Three Compilation Steps The Concurrent KEP Assembler Graph Cyclicity Constraints

### Step 1: Construct Concurrent KEP Assembler Graph



|          | % module Example<br>OUTPUT O |
|----------|------------------------------|
| [LO0,T0] | EMIT _TICKLEN,#12            |
| [L01,T0] | SIGNAL A                     |
| [L02,T0] | SIGNAL R                     |
| [L03,T0] | PAR 2, A0, 1                 |
| [L04,T0] | PAR 1, A1, 2                 |
| [L05,T0] | PARE A2,2                    |
| [L06,T1] | AO: WABORTI A,A3             |
| [L07,T1] | A4: EMIT R                   |
| [L08,T1] |                              |
| [L09,T1] |                              |
|          | PAUSE                        |
| [L11,T1] | GOTO A4                      |
| [L12,T1] | A3: EMIT O                   |
|          | A1:AWAIT R                   |
| [L14,T2] |                              |
| [L15,T0] | A2:JOIN 0                    |
| [L16,T0] | HALT                         |
|          |                              |

CAU

A D b 4 A b

 -

Three Compilation Steps The Concurrent KEP Assembler Graph Cyclicity Constraints

### Step 2: Compute Thread Priorities/ids

- Compute priority for current tick at each node
- Compute priority for next tick at tick boundaries

Three Compilation Steps The Concurrent KEP Assembler Graph Cyclicity Constraints

### Step 2: Compute Thread Priorities/ids

- Compute priority for current tick at each node
- Compute priority for next tick at tick boundaries
- Priority within tick must not increase
- Initialize tick boundaries with lowest priority, compute priorites backwards

Three Compilation Steps The Concurrent KEP Assembler Graph Cyclicity Constraints

## Step 2: Compute Thread Priorities/ids

- Compute priority for current tick at each node
- Compute priority for next tick at tick boundaries
- Priority within tick must not increase
- Initialize tick boundaries with lowest priority, compute priorites backwards
- Judicious traversal of CKAG allows to compute each priority just once
  - Facilitates correctness argument
  - Complexity linear in CKAG size

Three Compilation Steps The Concurrent KEP Assembler Graph Cyclicity Constraints

### Step 3: Generate PAR/PRIO Statements

- Enforce that a statement is always executed with same priority, irrespective of control flow
- Must consider priorities for current and for next tick
- Again linear complexity

Three Compilation Steps The Concurrent KEP Assembler Graph Cyclicity Constraints

# CKAG Node Types

The CKAG distinguishes the following sets of nodes:

- D: Delay nodes (octagons)
  - PAUSE, AWAIT, HALT, SUSTAIN
- F: Fork nodes (triangles)
  - PAR/PARE
- T: Transient nodes (rectangles/inverted triangles)
  - ▶ EMIT, PRESENT, etc. (rectangles)
  - JOIN nodes (inverted triangles)

N: Set of all nodes,  $N = D \cup F \cup T$ 

KEP3a Instruction Set + Architecture The Compiler Further Measurements Summary Cyclicity Constraints

### The Concurrent KEP Assembler Graph (CKAG) Define

▶ for each fork node *n*:

- **n.join**: the JOIN statement corresponding to *n*,
- **n.sub**: the transitive closure of nodes in threads generated by *n*.
- ▶ for abort nodes *n* ([L|T] [W] ABORT [I], SUSPEND [I]):
  - **n.end**: the end of the abort scope opened by *n*, **n.scope**: the nodes within *n*'s abort scope.
- for all nodes n:
  - **n.prio**: the priority that the thread executing *n* should be running with
- for  $n \in D \cup F$ ,

**n.prionext**: the priority that the thread executing *n* should be resumed with in the subsequent tick.

Three Compilation Steps The Concurrent KEP Assembler Graph Cyclicity Constraints

## CKAG Dependency Types

Define dependencies

- **n.dep**<sub>i</sub>: the dependency sinks with respect to *n* at the current tick (the *immediate dependencies*)
- n.depd: the dependency sinks with respect to n at the next tick (the delayed dependencies) Induced by emissions of strong abort trigger signals and corresponding delay nodes within the abort scope

Three Compilation Steps The Concurrent KEP Assembler Graph Cyclicity Constraints

### CKAG Successor Types

Define following types of successors for each *n*:

- **n.suc**: the *control successors*.
- **n.suc**<sub>w</sub>: the weak abort successors
- **n.suc**s: the strong abort successors
- **n.suc**<sub>f</sub>: the flow successors the set  $n.suc_c \cup n.suc_w \cup n.suc_s$
- For  $n \in F$  we also define the following fork abort successors **n.suc**<sub>wf</sub>: the weak fork abort successors **n.suc**<sub>sf</sub>: the strong fork abort successors

KEP3a Instruction Set + Architecture The Compiler Further Measurements Summary Constraints Three Compilation Steps The Concurrent KEP Assembler Graph Cyclicity Constraints

### Program Cycle

An Esterel program is considered cyclic iff the corresponding CKAG contains a path from a node to itself, where for all nodes n and their successors along that path, n' and n'', the following holds:

$$n \in D \land n' \in n.suc_w$$

$$\lor n \in F \land n' \in n.suc_c \cup n.suc_{wf}$$

$$\lor n \in T \land n' \in n.suc_c \cup n.dep_i$$

$$\lor n \in T \land n' \in n.dep_d \land n'' \in n'.suc_c \cup n'.suc_s \cup n'.suc_{sf}.$$

### Constraints

A correct priority assignment must fulfill the following constraints, where m, n are arbitrary nodes in the CKAG

### Constraint (Dependencies)

For 
$$m \in n.dep_i$$
:  $n.prio > m.prio$ 

### Constraint (Intra-Tick Priority)

KEP3a Instruction Set + Architecture The Compiler Further Measurents Summary KEP3a Instruction Steps The Concurrent KEP Assembler Graph Coglicity Constraints

### **Computing Thread Priorities**

Constraint (Inter-Tick Priority for Delay Nodes)

▶ For all  $m \in n.suc_c \cup n.suc_s$ :  $n.prionext \ge m.prio$ 

Constraint (Inter-Tick Priority for Fork Nodes)

- ▶ n.prionext ≥ n.join.prio
- For all  $m \in n.suc_{sf}$ :  $n.prionext \ge m.prio$

Three Compilation Steps The Concurrent KEP Assembler Graph Cyclicity Constraints

### Computing Thread Priorities

module Edwards02. input S, I; output O; signal A,R in every S do await I; weak abort sustain R; when immediate A; emit O; loop pause; pause; present R then emit A; end present end loop end every end signal end module

module Edwards02-dism: input S,I; output O; signal A, R in abort 100p pause end loop when S: 1000 abort [ abort 1000 pause end loop when I; weak abort loop emit R; pause end loop when immediate A; emit O Ш % cont...



CAU



|          | INPUT S,I                    |
|----------|------------------------------|
|          | OUTPUT O                     |
| [LO0,T0] |                              |
| [L01,T0] | SIGNAL A                     |
| [L02,T0] | SIGNAL R                     |
| [LO3,T0] | AWAIT S                      |
|          | A2: LABORT S,A3              |
| [L05,T0] | PAR 1,A4,1                   |
| [L06,T0] | PAR 1, A5, 2                 |
| [L07,T0] | PARE A6,1<br>A4: TABORT I,A7 |
|          |                              |
| [L09,T1] |                              |
| [L10,T1] | PAUSE                        |
| [L11,T1] | PRIO 1                       |
| [L12,T1] | GOTO A8                      |
|          | A7: TWABORTI A,A9            |
|          | A10:EMIT R                   |
|          | PRIO 1                       |
| [L16,T1] | PRIO 3                       |
| [L17,T1] | PAUSE                        |
| [L18,T1] | GOTO A10<br>A9: EMIT O       |
|          |                              |
|          | A5:A11: PAUSE                |
|          | PRIO 2                       |
|          | PAUSE                        |
| [L23,T2] | PRESENT R, A12               |
| [L24,T2] | EMIT A                       |
|          | A12:PRIO 1                   |
| [L26,T2] | GOTO A11                     |
|          | A6: JOIN                     |
| [L28,T0] | A3: GOTO A2                  |
|          |                              |

고기는

Three Compilation Steps The Concurrent KEP Assembler Graph Cyclicity Constraints

### Optimized Priority Assignment

|                      | INPUT S,I         |
|----------------------|-------------------|
| FT 00 mol            | OUTPUT O          |
| [L00,T0]             | EMIT _TICKLEN,#20 |
| [L01,T0]             | SIGNAL A          |
| [L02,T0]             | SIGNAL R          |
| [L03,T0]             | AWAIT S           |
| [L04,T0]             |                   |
| [L05,T0]             | PAR 1,A4,1        |
| [L06,T0]             | PAR 1, A5, 2      |
| [L07,T0]             | PARE A6,1         |
| [L08,T1]             | A4: TABORT I,A7   |
| [L09,T1]             | A8: PRIO 3        |
| [L10,T1]             | PAUSE<br>PRIO 1   |
| [L11,T1]<br>[L12,T1] |                   |
| [L12,11]             |                   |
| [L13,11]             | A10:EMIT R        |
| [L15,T1]             |                   |
| [L16,T1]             |                   |
| [L17,T1]             | PAUSE             |
| [L18,T1]             |                   |
| [L19,T1]             |                   |
| [L20,T2]             |                   |
| [L21,T2]             | PRIO 2            |
| [L22,T2]             |                   |
| [L23,T2]             |                   |
| [L24,T2]             | EMIT A            |
| [L25,T2]             | A12:PRI0 1        |
| [L26,T2]             | GOTO A11          |
| [L27,T0]             | A6: JOIN          |
| [L28,T0]             |                   |
|                      |                   |

| [L01,T0]<br>[L02,T0]<br>[L03,T0]<br>[L04,T0]<br>[L05,T0]<br>[L05,T0]<br>[L08,T1]<br>[L09,T1]<br>[L10,T1]<br>[L11,T1]<br>[L12,T1]<br>[L13,T1]<br>[L14,T1]<br>[L14,T1]<br>[L15,T1]<br>[L15,T1]<br>[L15,T2]<br>[L17,T2]<br>[L18,T2] | GOTO A10<br>A9: EMIT O<br>A5:A11: PAUSE<br>PAUSE<br>PRESENT R,A12 |
|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-------------------------------------------------------------------|
| [L17,T2]                                                                                                                                                                                                                         | PAUSE                                                             |
|                                                                                                                                                                                                                                  |                                                                   |
| [L19,T2]                                                                                                                                                                                                                         |                                                                   |
| [L20, T2]<br>[L21, T0]                                                                                                                                                                                                           | A12:GOTO A11                                                      |
|                                                                                                                                                                                                                                  | AG: JUIN<br>A3: GOTO A2                                           |
| [L22,10]                                                                                                                                                                                                                         | AS: GUIU AZ                                                       |

A D b 4 A b

 $\Rightarrow$ 

Code Characteristics and Compilation Times Speed, Size, Power, Scalability Analysis of context switches Another Example

## Code Characteristics and Compilation Times

|              | Esterel |       |      |      |         |       |      | MicroBlaze |       |         |           |          |           |                  |
|--------------|---------|-------|------|------|---------|-------|------|------------|-------|---------|-----------|----------|-----------|------------------|
| Module       |         | Threa | ds   | Pree | mptions |       | C٢   | (AG        |       | Preemp  | otion han | idled by | Compiling | Compiling        |
| Name         | Cnt     | Max   | Max  | Cnt  | Max     | Nodes | Dep. | Max        | PRIO  |         | Local     | Thread   | Time      | Time (Sec)       |
|              |         | Depth | Conc |      | Depth   |       | Num  | Priority   | Instr | Watcher | Watcher   | Watcher  | (Sec)     | (V5/V7/CEC)      |
| abcd         | 4       | 2     | 4    | 20   | 2       | 211   | 36   | 3          | 30    | 0       | 4 3       | 16 11    | 0.15      | 0.12 0.09 0.30   |
| abcdef       | 6       | 2     | 6    | 30   | 2       | 313   | 90   | 3          | 48    | 0       | 6 5       | 24 17    | 0.21      | 0.71 0.46 0.96   |
| eight_but    | 8       | 2     | 8    | 40   | 2       | 415   | 168  | 3          | 66    | 0       | 8 7       | 32 23    | 0.26      | 0.99 0.54 1.25   |
| chan_prot    | 5       | 3     | 4    | 6    | 1       | 80    | 4    | 2          | 10    | 0       | 0         | 6 4      | 0.07      | 0.35 0.35 0.43   |
| reactor_ctrl | 3       | 2     | 3    | 5    | 1       | 51    | 5    | 1          | 0     | 0       | 1 0       | 4        | 0.06      | 0.29 0.31 0.36   |
| runner       | 2       | 2     | 2    | 9    | 3       | 61    | 0    | 1          | 0     | 3 2     | 1         | 5 3      | 0.05      | 0.30 0.34 0.40   |
| example      | 2       | 2     | 2    | 4    | 2       | 36    | 2    | 3          | 6     | 0       | 1         | 3 2      | 0.05      | 0.28 0.31 0.31   |
| ww_button    | 13      | 3     | 4    | 27   | 2       | 194   | 0    | 1          | 0     | 0       | 5         | 22 10    | 0.10      | 0.44 0.40 0.64   |
| greycounter  | 17      | 3     | 13   | 19   | 2       | 414   | 53   | 6          | 58    | 0       | 4         | 15       | 0.34      | 0.57 0.43 0.75   |
| tcint        | 39      | 5     | 17   | 18   | 2       | 583   | 65   | 3          | 20    | 0       | 1         | 17 10    | 0.34      | 0.41 0.52 1.11   |
| mca200       | 59      | 5     | 49   | 64   | 4       | 11219 | 129  | 11         | 190   | 2       | 14        | 48       | 11.25     | 69.81 12.99 7.37 |

Note: In the mca200, the watcher refinement reduces the hardware requirements from 4033 slices (if all preemptions were handled by general purpose Watchers) to 3265 slices (19% reduction).

Code Characteristics and Compilation Times Speed, Size, Power, Scalability Analysis of context switches Another Example

## Worst-/Average-Case Reaction Times

|              | MicroBlaze |       |       |       |       |       | KEP3a-Unoptimized |          |      |          | KEP3a-optimized |          |      |          |
|--------------|------------|-------|-------|-------|-------|-------|-------------------|----------|------|----------|-----------------|----------|------|----------|
| Module Name  |            | WCRT  |       |       | ACRT  |       | WCRT              | Ratio to | ACRT | Ratio to | WCRT            | Ratio to | ACRT | Ratio    |
|              | V5         | V7    | CEC   | V5    | V7    | CEC   |                   | best MB  |      | best MB  |                 | Unopt    |      | to Unopt |
| abcd         | 1559       | 954   | 1476  | 1464  | 828   | 1057  | 135               | 0.14     | 87   | 0.11     | 135             | 1        | 84   | 0.97     |
| abcdef       | 2281       | 1462  | 1714  | 2155  | 1297  | 1491  | 201               | 0.14     | 120  | 0.09     | 201             | 1        | 117  | 0.98     |
| eight_but    | 3001       | 1953  | 2259  | 2833  | 1730  | 1931  | 267               | 0.14     | 159  | 0.09     | 267             | 1        | 153  | 0.96     |
| chan_prot    | 754        | 375   | 623   | 683   | 324   | 435   | 117               | 0.31     | 60   | 0.19     | 117             | 1        | 54   | 0.90     |
| reactor_ctrl | 487        | 230   | 397   | 456   | 214   | 266   | 54                | 0.23     | 45   | 0.21     | 51              | 0.94     | 39   | 0.87     |
| runner       | 566        | 289   | 657   | 512   | 277   | 419   | 36                | 0.12     | 15   | 0.05     | 30              | 0.83     | 6    | 0.40     |
| example      | 467        | 169   | 439   | 404   | 153   | 228   | 42                | 0.25     | 24   | 0.16     | 42              | 1        | 24   | 1        |
| ww_button    | 1185       | 578   | 979   | 1148  | 570   | 798   | 72                | 0.12     | 51   | 0.09     | 48              | 0.67     | 36   | 0.71     |
| greycounter  | 1965       | 1013  | 2376  | 1851  | 928   | 1736  | 528               | 0.52     | 375  | 0.40     | 528             | 1        | 375  | 1        |
| tcint        | 3580       | 1878  | 2350  | 3488  | 1797  | 2121  | 408               | 0.22     | 252  | 0.14     | 342             | 0.84     | 204  | 0.81     |
| mca200       | 75488      | 29078 | 12497 | 73824 | 24056 | 11479 | 2862              | 0.23     | 1107 | 0.10     | 2862            | 1        | 1107 | 1        |

The worst-/average-case reaction times, in clock cycles, for the KEP3a and MicroBlaze

- WCRT speedup: typically >4x
- ACRT speedup: typically >5x
- Optimizations yield further improvements

Code Characteristics and Compilation Times Speed, Size, Power, Scalability Analysis of context switches Another Example

# Memory Usage

| Esterel MicroBlaze |                 |        | e         |       | KEF                          | KEP3a-opt. |       |         |             |         |
|--------------------|-----------------|--------|-----------|-------|------------------------------|------------|-------|---------|-------------|---------|
| Module Name        | lodule Name LOC |        | -Data (   | byte) | Code (word) Code+Data (byte) |            |       |         | Code (word) |         |
|                    |                 | V5     | V7        | CEC   | abs.                         | rel.       | abs.  | rel.    | abs.        | rel.    |
|                    | [1]             | [2     | 2] (best) | )     | [3]                          | [3]/[1]    | [4]   | [4]/[2] | [5]         | [5]/[3] |
| abcd               | 160             | 6680   | 7928      | 7212  | 168                          | 1.05       | 756   | 0.11    | 164         | 0.93    |
| abcdef             | 236             | 9352   | 9624      | 9220  | 252                          | 1.07       | 1134  | 0.12    | 244         | 0.94    |
| eight_but          | 312             | 12016  | 11276     | 11948 | 336                          | 1.08       | 1512  | 0.13    | 324         | 0.94    |
| $chan_prot$        | 42              | 3808   | 6204      | 3364  | 66                           | 1.57       | 297   | 0.09    | 62          | 0.94    |
| reactor_ctrl       | 27              | 2668   | 5504      | 2460  | 38                           | 1.41       | 171   | 0.07    | 34          | 0.89    |
| runner             | 31              | 3140   | 5940      | 2824  | 39                           | 1.22       | 175   | 0.06    | 27          | 0.69    |
| example            | 20              | 2480   | 5196      | 2344  | 31                           | 1.55       | 139   | 0.06    | 28          | 0.94    |
| ww_button          | 76              | 6112   | 7384      | 5980  | 129                          | 1.7        | 580   | 0.10    | 95          | 0.74    |
| greycounter        | 143             | 7612   | 7936      | 8688  | 347                          | 2.43       | 1567  | 0.21    | 343         | 1       |
| tcint              | 355             | 14860  | 11376     | 15340 | 437                          | 1.23       | 1968  | 0.17    | 379         | 0.87    |
| mca200             | 3090            | 104536 | 77112     | 52998 | 8650                         | 2.79       | 39717 | 0.75    | 8650        | 1       |

Unoptimized: 83% avg reduction of memory usage (Code+RAM)

▶ Optimized: May yield further 5% to 30+% improvements > < => = ∽ < @

CAU

Xin Li, Marian Boldt, Reinhard v. Hanxleden

A Multi-Threaded Reactive Processor

Code Characteristics and Compilation Times Speed, Size, Power, Scalability Analysis of context switches Another Example

### Power Consumption

CIAU

|              | MicroBlaze   | KEP  | 3a <sup>2</sup> | Ratio       |      |  |
|--------------|--------------|------|-----------------|-------------|------|--|
| Module       | (82mW@50MHz) | (m\  | N)              | (KEP to MB) |      |  |
| Name         | Idle         | Peak | Idle            | Peak        | Idle |  |
| abcd         | 69           | 13   | 8               | 0.16        | 0.12 |  |
| abcdef       | 74           | 13   | 7               | 0.16        | 0.09 |  |
| eight_but    | 74           | 13   | 7               | 0.16        | 0.09 |  |
| chan_prot    | 70           | 28   | 12              | 0.34        | 0.17 |  |
| reactor_ctrl | 76           | 20   | 13              | 0.24        | 0.17 |  |
| runner       | 78           | 14   | 2               | 0.17        | 0.03 |  |
| example      | 77           | 25   | 9               | 0.30        | 0.12 |  |
| ww_button    | 81           | 13   | 4               | 0.16        | 0.05 |  |
| greycounter  | 78           | 44   | 33              | 0.54        | 0.42 |  |
| tcint        | 80           | 18   | 10              | 0.22        | 0.13 |  |

- Peak energy usage reduction: 75% avg
- ▶ Idle (= no inputs) energy usage reduction: 86% avg

<sup>&</sup>lt;sup>2</sup>Based on Xilinx 3S200-4ft256, requires an additional 37mW as quiescent power for the chip itself

Code Characteristics and Compilation Times Speed, Size, Power, Scalability Analysis of context switches Another Example

# Scalability

Synthesis results for Xilinx 3S1500-4fg-676<sup>3</sup>

| Thread | Slicos | Gates (k) |
|--------|--------|-----------|
| Count  | Silces | Gales (K) |
| 2      | 1295   | 295       |
| 10     | 1566   | 299       |
| 20     | 1871   | 311       |
| 40     | 2369   | 328       |
| 60     | 3235   | 346       |
| 80     | 4035   | 373       |
| 100    | 4569   | 389       |
| 120    | 5233   | 406       |
|        |        |           |

- 48 valued signals up to 256 possible
- 2 Watchers, 8 Local Watchers either up to 64 possible
- 1k (1024) instruction words up to 64k possible
- 128 registers (in word) up to 512 possible
- 16-bits (65536) max counter value
- Frequency is stable (around 60 MHz)

Note: In the mca200, the watcher refinement reduces the hardware requirements from 4033 slices (if all preemptions were handled by general purpose Watchers) to 3265 slices (19% reduction).

<sup>3</sup>For comparison, a MicroBlaze implementation requires around 1k slices

Code Characteristics and Compilation Times Speed, Size, Power, Scalability Analysis of context switches Another Example

A D b 4 A b

### Analysis of Context Switches

|              | Instr's | CS     | Ss      | CSs at | same    | PRI    | Os      | CS     | s due   | to      |
|--------------|---------|--------|---------|--------|---------|--------|---------|--------|---------|---------|
| Module       | total   | tot    | al      | prio   | rity    | tot    | al      |        | PRIO    |         |
| Name         | abs.    | abs.   | ratio   | abs.   | rel.    | abs.   | rel.    | abs.   | rel.    | rel.    |
|              | [1]     | [2]    | [1]/[2] | [3]    | [3]/[2] | [4]    | [4]/[1] | [5]    | [5]/[2] | [5]/[4] |
| abcd         | 16513   | 3787   | 4.36    | 1521   | 0.40    | 3082   | 0.19    | 1243   | 0.33    | 0.40    |
| abcdef       | 29531   | 7246   | 4.08    | 3302   | 0.46    | 6043   | 0.20    | 2519   | 0.35    | 0.42    |
| eight_but    | 39048   | 10073  | 3.88    | 5356   | 0.53    | 8292   | 0.21    | 3698   | 0.37    | 0.45    |
| chan_prot    | 5119    | 1740   | 2.94    | 707    | 0.41    | 990    | 0.19    | 438    | 0.25    | 0.44    |
| reactor_ctrl | 151     | 48     | 3.15    | 29     | 0.60    | 0      | 0       | 0      | 0       | -       |
| runner       | 5052    | 704    | 7.18    | 307    | 0.44    | 0      | 0       | 0      | 0       | -       |
| example      | 208     | 60     | 3.47    | 2      | 0.30    | 26     | 0.13    | 9      | 0.15    | 0.35    |
| ww_button    | 292     | 156    | 1.87    | 92     | 0.59    | 0      | 0       | 0      | 0       | -       |
| greycounter  | 160052  | 34560  | 4.63    | 14043  | 0.41    | 26507  | 0.17    | 12725  | 0.37    | 0.48    |
| tcint        | 80689   | 33610  | 2.4     | 16769  | 0.50    | 5116   | 0.06    | 2129   | 0.06    | 0.42    |
| mca200       | 982417  | 256988 | 3.82    | 125055 | 0.49    | 242457 | 0.25    | 105258 | 0.41    | 0.43    |

-

< E

Code Characteristics and Compilation Times Speed, Size, Power, Scalability Analysis of context switches Another Example

## Edwards02: Esterel to KEP



・ロト ・ 同ト ・ ヨト ・ ヨト

Code Characteristics and Compilation Times Speed, Size, Power, Scalability Analysis of context switches Another Example

# Edwards02: a Possible Execution Trace



ヘロト ヘ戸ト ヘヨト

KEP3a Instruction Set + Architecture The Compiler

Further Measurements Summary Code Characteristics and Compilation Times Speed, Size, Power, Scalability Analysis of context switches Another Example

|                   | INPUT S,I<br>OUTPUT 0                  | - Tick 1 -                 |
|-------------------|----------------------------------------|----------------------------|
| module Edwards02: | [LOO,TO] EMIT _TICKLEN,#20             | ! reset;                   |
| input S, I;       | [LOU, TO] SIGNAL A                     | % In:                      |
| output 0;         | [L01,10] SIGNAL A<br>[L02,T0] SIGNAL R | % Out:                     |
| output U;         |                                        | [L01,T0] [L02,T0] [L03,T0] |
|                   | [L03,T0] AWAIT S                       | - Tick 2 -                 |
| signal A,R in     | [L04,T0] A2: LABORT S,A3               | % In: S                    |
| every S do        | [L05,T0] PAR 1,A4,1                    | % Out:                     |
|                   | [L06,T0] PAR 1,A5,2                    | [L03,T0] [L04,T0] [L05,T0] |
| await I;          | [L07,T0] PARE A6,1                     |                            |
| weak abort        | [L08,T1] A4: TABORT I,A7               | [L06,T0] [L07,T0]          |
| sustain R;        | [L09,T1] A8: PRIO 3                    | [L20,T2] [L08,T1]          |
|                   | [L10,T1] PAUSE                         | [L09,T1] [L10,T1]          |
| when immediate A; | [L11,T1] PRIO 1                        | [L27,T0]                   |
| emit O;           | [L12,T1] GOTO A8                       | - Tick 3 -                 |
|                   | [L13,T1] A7: TWABORTI A,A9             | % In: I                    |
|                   | [L14,T1] A10:EMIT R                    | % Out: R                   |
| loop              | [L15,T1] PRIO 1                        | [L10,T1] [L13,T1]          |
| pause;            | [L16,T1] PRIO 3                        | [L14,T1] [L15,T1]          |
| pause;            | [L17,T1] PAUSE                         | [L20,T2] [L21,T2] [L22,T2] |
| · · · ·           | [L18,T1] GOTO A10                      | [L16,T1] [L17,T1] [L27,T0] |
| present R then    | [L19,T1] A9: EMIT O                    | - Tick 4 -                 |
| emit A;           | [L20,T2] A5:A11: PAUSE                 | % In:                      |
| end present       | [L21,T2] PRIO 2                        | % Out: A R O               |
| end loop          | [L22,T2] PAUSE                         | [L17,T1] [L18,T1]          |
|                   | [L23,T2] PRESENT R,A12                 | [L14,T1] [L15,T1]          |
| end every         | [L24,T2] EMIT A                        | [L22,T2] [L23,T2] [L24,T2] |
| end signal        | [L25,T2] A12:PRIO 1                    | [L25,T2] [L26,T2] [L20,T2] |
| end module        | [L26,T2] GOTO A11                      | [L16,T1] [L17,T1] [L19,T1] |
|                   | [L27,T0] A6: JOIN                      | [L27,T0]                   |
|                   | [L28,T0] A3: GOTO A2                   | L                          |

イロト イポト イヨト イヨト

Multi-processing vs. Multi-threading Comparison of Synthesis Options Application Scenarios

# Multi-processing vs. Multi-threading



Multi-processing (EMPEROR/RePIC)

- ► Esterel thread ≈ one independent RePIC processor
- Thread Control Unit handles the synchronization and communication
- Three-valued signal representation
- sync command to synchronize threads



- Multi-threading (KEP)
  - ► Esterel thread ≈ several registers
  - priority-based scheduler

. . . . . . .

 PRIO command to synchronize threads

Multi-processing vs. Multi-threading Comparison of Synthesis Options Application Scenarios

< 4 P ►

## Comparison of Synthesis Options

|       |                | HW   | SW | Co-design | Reactive F<br>Multi-processing |     |
|-------|----------------|------|----|-----------|--------------------------------|-----|
| Speed |                | ++   | -  | +         | +                              | +   |
|       | Flexibility    |      | ++ | -         | +/-                            | +   |
|       | Scalability    |      | ++ | +         |                                | +   |
|       | Logic Area     | ++/- | +  | +         |                                | +/- |
| Cost  | Memory         | ++   |    | -         | +                              | +   |
|       | Power Usage    | ++   | -  | -         |                                | +   |
| Appl. | . Design Cycle |      | ++ | +/-       | ++                             | ++  |

-

- A 🗐

Multi-processing vs. Multi-threading Comparison of Synthesis Options Application Scenarios

#### Scenario I: DSP + Reactive Processor



Multi-processing vs. Multi-threading Comparison of Synthesis Options Application Scenarios

## Scenario II: DSP + HW Block + Reactive Processor



Slide 60

Multi-processing vs. Multi-threading Comparison of Synthesis Options Application Scenarios

#### Scenario III: HW Block + Reactive Processor



Multi-processing vs. Multi-threading Comparison of Synthesis Options Application Scenarios

### Possible Co-Design Development Flow



Reactive processing ...

- permits a simple mapping strategy
- allows optimizations on high-level
- can meet stricter constraints than classical architectures
- permits a better tradeoff between all cost factors

- E - E

< 4 P

-

Multi-processing vs. Multi-threading Comparison of Synthesis Options Application Scenarios

### Possible Co-Design Development Flow



Reactive processing ...

- permits a simple mapping strategy
- allows optimizations on high-level
- can meet stricter constraints than classical architectures
- permits a better tradeoff between all cost factors

∃ >

#### Thanks/Comments?

-