Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Lecture 07 – CH 3 CH 4
Data Flow Implementation in Hardware and Software
Analysis of Control and Data Flow

Ryan Robucci

• Spacebar to advance through slides in order
• Shift-Spacebar to go back
• Arrow keys for navigation

• ESC/O-Key to see slide overview
• ? to see help

Printable Version

Table of Contents

References

Mapping Dataflow to Software

  • Choices for mapping dataflow to software:
    • Parallel using multiple CPUs (Processor Networks)
    • Sequential using a single CPU
      • Dynamic Schedule: firing decisions at run-time, uses executive thread(s) to monitor queues and activate execution of actors based on firing rules at runtime
      • Static Schedule:
        • firing order can be decided at compile/synthesis time
        • Therefore, can remove firing rule checking from run-time execution
        • can map multiple actors firing to a deterministic sequential execution with multiple actors code combined as in-line code and optimized

FIFO Queues

#include <assert.h>
#include <stdio.h>
#include <stdbool.h>

#define MAXFIFOSIZE 3

typedef struct fifo {
  int data [MAXFIFOSIZE]; // contiguous block of memory for token storage
  unsigned int writeOffset;   // write offset into contiguous block
  unsigned int readOffset;    // read pointer
  bool flagFull;
} fifo_t;


void init_fifo (fifo_t * F){
  F->writeOffset=0;
  F->readOffset=0;
  F->flagFull=false;
}

unsigned fifo_size(fifo_t *F) {
  unsigned size;
  if (F->writeOffset >= F->readOffset){
    size = F->writeOffset - F->readOffset;
  }else{
    size = F->writeOffset + MAXFIFOSIZE - F->readOffset;
  }
  if (size==0 && F->flagFull){
    size=MAXFIFOSIZE;
  }
  return size;
}

void put_fifo(fifo_t *F, int d) {
  if (!(F->flagFull)) {
    printf("Write %d\n",d);
    F->data[F->writeOffset]=d;
    F->writeOffset = ((F->writeOffset)+1) % MAXFIFOSIZE;
    F->flagFull = (F->writeOffset == F->readOffset);
    printf("New writeOffset: %d\n",F->writeOffset);
    printf("New flagFull: %d\n",F->flagFull);
  }else{
    printf("No Write, FIFO is already Full\n");
  }
}

int get_fifo(fifo_t *F) {
  int result;
  if (F->writeOffset!=F->readOffset || F->flagFull) {
    result = F->data[F->readOffset];
    F->readOffset = ((F->readOffset)+1) % MAXFIFOSIZE;
    F->flagFull = false;
    printf("New readOffset: %d\n",F->readOffset);
    printf("New flagFull: %d\n",F->flagFull);
  } else {
    result = -1;
  }
  return result;
}

void print_fifo(fifo_t *F){
  unsigned size = fifo_size(F);
  printf("FIFO size: %d :: ",size);
  printf("FIFO Contents: ");
  for (int index=0;index<size;++index){
    printf("%d ",F->data[(F->readOffset+index)%MAXFIFOSIZE]);
  }
  printf("\n");
}

int main() {
  fifo_t F1;
  int token;

  init_fifo(&F1);                     
  put_fifo(&F1, 3);                     // put 3
  print_fifo(&F1);                      
  put_fifo(&F1, 5);                     // put 5
  print_fifo(&F1);
  token = get_fifo(&F1);                // get 3
  printf("token:%d\n",token);
  put_fifo(&F1, 7);                     // put 7
  print_fifo(&F1);
  put_fifo(&F1, 11);                    // put 11
  print_fifo(&F1);
  put_fifo(&F1, 3);                     // put 12 fail (token lost, system result would be incorrect)
  print_fifo(&F1);                    
  token = get_fifo(&F1);                // get 5
  printf("token:%d\n",token);
  token = get_fifo(&F1);                // get 7
  printf("token:%d\n",token);
  token = get_fifo(&F1);                // get 11
  printf("token:%d\n",token);
  token = get_fifo(&F1);                // failed get
  printf("token:%d\n",token);
}

Write 3
New writeOffset: 1
New flagFull: 0
FIFO size: 1 :: FIFO Contents: 3 
Write 5
New writeOffset: 2
New flagFull: 0
FIFO size: 2 :: FIFO Contents: 3 5 
New readOffset: 1
New flagFull: 0
token:3
Write 7
New writeOffset: 0
New flagFull: 0
FIFO size: 2 :: FIFO Contents: 5 7 
Write 11
New writeOffset: 1
New flagFull: 1
FIFO size: 3 :: FIFO Contents: 5 7 11 
No Write, FIFO is already Full
FIFO size: 3 :: FIFO Contents: 5 7 11 
New readOffset: 2
New flagFull: 0
token:5
New readOffset: 0
New flagFull: 0
token:7
New readOffset: 1
New flagFull: 0
token:11
token:-1

Depiction of Length-24 and Length-3 circular buffer

Example Length-3 Implementation

Initially empty array, with writePtr and readPtr at the same location

write read

Put: write 3

write 3 read

Put: write 5

write 3 read 5

Get: read returning 3

write read 3 5

Put: write 7, writePtr wraps around to begining of allocated block

write read 3 5 7

Put: write 11, writePrt becomes same as readPtr defining onset of Queue Full

write 11 read 5 7

Get: returns 5

write read 11 5 7

Get: returns 7, readPtr pointer wraps around to begining of allocated block

write 11 read 5 7

Get: returns 11, readPtr becomes same as writePtr defining onset of Queue Empty

write 11 read 5 7

Actors

Scheduler

Static Firing Schedule

Example

G t1 t1l 1 t4 t4l 0 pulse pulse pulse:ne->pulse:nw Queue A 1 1 add21 add21 pulse->add21 Queue B 2 2 scale scale add21:ne->scale:se Queue D 1 1 print print add21->print Queue C 1 2 scale:sw->add21:nw Queue E 1 1

#define MAXIO 8
typedef struct actorio {
  fifo_t *in[MAXIO];
  fifo_t *out[MAXIO];
}actorio_t;

assume fifo_t is now float

G t1 t1l 1 pulse pulse pulse:ne->pulse:nw 1 1 add21 add21 pulse->add21 2 2 scale scale add21:ne->scale:se 1 1 print print add21->print 1 2 scale:sw->add21:nw 1 1 t4 t4l 0

G t1 t1l 0 t2 t2l 1 t3 t3l 1 pulse pulse pulse:ne->pulse:nw 1 1 add21 add21 pulse->add21 2 2 scale scale add21:ne->scale:se 1 1 print print add21->print 1 2 scale:sw->add21:nw 1 1 t4 t4l 0

G t1 t1l 0 pulse pulse pulse:ne->pulse:nw 1 1 add21 add21 pulse->add21 2 2 scale scale add21:ne->scale:se 1 1 print print add21->print 1 2 scale:sw->add21:nw 1 1 t5 t5l 2 t6 t6l 2 t7 t7l 2

G t1 t1l 0 pulse pulse pulse:ne->pulse:nw 1 1 add21 add21 pulse->add21 2 2 scale scale add21:ne->scale:se 1 1 print print add21->print 1 2 scale:sw->add21:nw 1 1 t4 t4l 1 t6 t6l 2 t7 t7l 2

G t1 t1l 0 pulse pulse pulse:ne->pulse:nw 1 1 add21 add21 pulse->add21 2 2 scale scale add21:ne->scale:se 1 1 print print add21->print 1 2 scale:sw->add21:nw 1 1 t4 t4l 1 t6 t6l 2

G t1 t1l 0 pulse pulse pulse:ne->pulse:nw 1 1 add21 add21 pulse->add21 2 2 scale scale add21:ne->scale:se 1 1 print print add21->print 1 2 scale:sw->add21:nw 1 1 t4 t4l 1

end of one PASS cycle
start a new cycle

G t1 t1l 0 t2 t2l 0 t3 t3l 0 pulse pulse pulse:ne->pulse:nw 1 1 add21 add21 pulse->add21 2 2 scale scale add21:ne->scale:se 1 1 print print add21->print 1 2 scale:sw->add21:nw 1 1 t4 t4l 0

G t1 t1l 0 pulse pulse pulse:ne->pulse:nw 1 1 add21 add21 pulse->add21 2 2 scale scale add21:ne->scale:se 1 1 print print add21->print 1 2 scale:sw->add21:nw 1 1 t5 t5l 1 t6 t6l 1 t7 t7l 1

and so on…

output will be 2 2 1 1 0.5 0.5 …

A simple scheduler can be implemented if firing rules are in-built

 while(1){
   pulse_inbuiltfr (&pulse_io)  
   add21_inbuiltfr (&add21_io)  
   scale_inbuiltfr (&scale_io)  
   print_inbuiltfr(&print_io)
 }

If inbuild firing rules are removed from each actor function, a dynamic schedule would need to perform a test before invoking each actor function.
Assume actors, pulse add21 scale and out with no inbuilt firing rules:

 while(1){
   if (/*firing rule for pulse satisfied*/) 
      pulse (&pulse_io);
   if (/*firing rule for add21 satisfied*/) 
      add21 (&add21_io); 
   if (/*firing rule for scale satisfied*/) 
      scale (&scale_io);  
   if (/*firing rule for out satisfied*/) 
     print(&print_io); 
 }

If a known valid firing pattern exists, the firing tests are not required
Static Scheduler:

 while(1){
   pulse (&pulse_io);
   add21 (&add21_io);  
   scale (&scale_io);
   print(&print_io)
   print(&print_io)
 }

Static and Inline not requiring firing rule tests anywhere:

 while(1){

   //pulse (&pulse_io);
   x = get_fifo(A);
   put_fifo(B,x);
   put_fifo(B,x);

   //add21 (&add21_io);  
   x = get_fifo(B);
   y = get_fifo(B);
   z = get_fifo(E);
   r=x+y+z;
   put_fifo(C,r);
   put_fifo(C,r);
   put_fifo(D,r);

   //scale (&scale_io);
   x = get_fifo(D);
   r = x/2;
   put_fifo(E,r);

   //print
   print("%f\n",get_fifo(C));

   //print
   print("%f\n",get_fifo(C));
 }

Example of inline with FIFOs around scale actor reduced to variables.

 while(1){
   //pulse (&pulse_io);
   x = get_fifo(A);
   put_fifo(B,x);
   put_fifo(B,x);
   //add21 (&add21_io);  
   x = get_fifo(B);
   y = get_fifo(B);
   z = e;
   r=x+y+z;
   put_fifo(C,r);
   put_fifo(C,r);
   d = r;

   //scale (&scale_io);
   e = d/2;

   //print
   print("%f\n",get_fifo(C));
   //print
   print("%f\n",get_fifo(C));
 }

Continuing along such lines, the entire program could be converted to inline code with fixed-length queues implemented as multiple variables, and desired initialization for initial token values would happen before the while loop.

HW Implementaion of Single-Rate Data Flow Graphs

  • Book’s mapping rules (assumptions):
    • actors implemented as combinational circuits
    • queues implemented as wires
    • initial tokens become registers
  • Therefore
    • Actors implemented as combinational circuits will operate within a single clock cycle
    • A chain of back-to-back actors without a register (initial token) between must operate as a whole in less than a clock cycle, their combinatorial delays add
    • A critical path can be identified from the resource graph as the back-to-back actor chain with the largest sum of delays, which related to the maximum allowed clock speed

HW Implementation of C Code

Data and Control Edges of a C Program

  • For the moment, view C as algorithm behavior description
  • Data Edge
    • defines data production and consumption relationship
    • fundamental aspect of algorithm (information flow)
    • must always be implemented
  • Control Edge
    • defines order of execution/firing (e.g. actor X must fire before actor Y)
    • consequence of implementation
    • not fundamental, may be removed in case of possible concurrency

Control and Data Flow Graph (CFG and DFG) Analysis of C Code

Code Operation Consume Produce
1: int max(int a,b){ enter _ a\green {a}, b\orange {b}
int r; _ _ _
2: if (a>b) if then else a\green {a},b\orange {b} a flag known as(a>b)\text {a flag known as}\,\purple {(a>b)}
3: r=a; _ a\green {a}, (a>b)\purple {(a>b)} r\red {r}
else _ _ _
4: r=b; _ b\orange {b}, (a>b)\purple {(a>b)} r\red {r}
5: return r; return max r\red {r} _

\dagger

Control Edges\dagger :

G 1 1 2 2 1->2 3 3 2->3 4 4 2->4 5 5 3->5 4->5

DFG\dagger :

G 1 1 2 2 1->2 a,b 3 3 1->3 a 4 4 1->4 b 2->3 a>b 2->4 a>b 5 5 3->5 r 4->5 r

Constructing DFG and CFG

CFG for conditional branching

Constructing DFG

  • Edges from Explicit Assignment Statements:
    • Start by identifying a node j\bold j which consumes a variable v\bold v.
    • Identify all possible nodes i\bold i which write to v\bold v.
    • Draw a line from i\bold i to j\bold j if
      a control path exist from i\bold i to j\bold j
      AND
      no node in that path also writes to v\bold v
    • Repeat for every node j\bold j and every variable v\bold v it uses
  • Edges from Conditional Expression Evaluations (have implicit output flag variable)
  • Add data edge from conditional expression evaluation node i\bold i to j\bold j if a control path exists from i\bold i to j\bold j

Example 0

1: int func(int a){
2:   a=a*2;
3:   a=a+1;
4:   return a;  
   }

CFG:

G 1 1 2 2 1->2 3 3 2->3 4 4 3->4

DFG:

inital:

G 1 1 2 2 1->2:n a 3 3 1->3:w a 4 4 1->4:w a 2->3:n a 2->4:3 a 3->4:n a

after pruning using C.F.G.:

G 1 1 2 2 1->2:n a 3 3 1->3:w a, pruned 4 4 1->4:w a, pruned 2->3:n a 2->4:3 a, pruned 3->4:n a

Example 1

1: int func(int a,int b){
2:   while (b>a){
3:     a=b;
4:     a=a*2;
     }  
5:   return a;  
   }

CFG:

G 1 1 2 2 1->2 3 3 2->3:n (b>a) 5 5 2:e->5 !(b>a) 4 4 3->4 4->2:sw

DFG:
examining only consumer node j=4 and only symbol a with all potential producers

G 1 1 4 4 1->4:w a, pruned 2 2 3 3 3->4 a 4->4 a, pruned 5 5

Example 2

CODE\dagger :

†Shaumont

CFG\dagger :

†Shaumont
    

DFG for node 5\dagger :

†Shaumont

DFG\dagger considering only a and b…must then include computed condition flags:

†Shaumont

A First Approach to Translate C to Hardware

Designing Data Paths

  1. C variables implemented in HW as registers w/multiplexer if multiple sources (mux controlled by FSM controller)
  2. C expressions implemented as Combinatorial Code, output results appropriately as either
    (a) data for datapath (example where b is a register: “b=a+1;” The expression a+1 is in the data path )
    (b) flags for controller (example: “if (a>b) b = a+1;” a>b is a generated flag fed to the controller)
  3. data path and register variables connected according to DFG
    (a) for each assignment, connect combinatorial to register
    (b) for each data edge, connect register to input of combinatorial circuit
    © connect appropriate system inputs and outputs

Example: GCD

CFG\dagger :

†Shaumont
   

Data Path Hardware\dagger :

†Shaumont

Designing the Controller

Example: GCD

CFG:
  • next state logic directly from CFG with conditions
  • control outputs are added to each state to complete the FSM
  • a decoder implemented as a lookup table can help
†Shaumont

Controller State machine and Datapath:

Next State Logic LUT Datapath upd_a State command {nop,run1,run4,run5} b_in a_in flag_while flag_if flag_while flag_if result upd_b

LUT:
instruction upd_a mux control upd_b mux control
nop (a) use previous (b) use previous
run1 (a_in) use input (b_in) use input
run4 (a-b) use sub (b) use previous
run5 (a) use previous (b-a) use sub

Allowing for Multiple Computations or Lines of C Per Clock Cycle

Single Assignment Code

Single Assignment Code GCD

Single Assignment Code

int gcd(int a1 , int b1) {
 while (MERGE(__?a1:a2)!=MERGE(__?b1:b2){
    a3 = MERGE(__?a1:a2);
    b3 = MERGE(__?b1:b2);
    if (a3> b3)
      a2 = a3-b3;
    else
      b2 = b3-a3;
    }
    return a2;
}

Single-Assignment Code Hardware Implementation

Forward Discussion: Synthesis of Multicycle Operations

Single Cycle, Two Multipliers:

G RA RA ADD1 + RA->ADD1 RB RB RB->ADD1 RC RC ADD2 + RC:e->ADD2 RD RD MULT1 × RD:e->MULT1 RD:e->MULT1 ADD1->ADD2 MULT2 × ADD2->MULT2 MULT1->MULT2 RQ RQ MULT2->RQ


Two Cycle, Two Multipliers, Reduced Critical Path:

G RA RA ADD1 + RA->ADD1 RB RB RB->ADD1 RC RC ADD2 + RC:e->ADD2 RD RD MULT1 × RD:e->MULT1 RD:e->MULT1 ADD1->ADD2 MULT2 × ADD2->MULT2 RE RE MULT1->RE RQ RQ MULT2->RQ RE->MULT2


Three Cycle, Two Multipliers, Reduced Critical Path, Fully Pipelined:

G ADD1 + RG RG ADD1->RG ADD2 + RH RH ADD2->RH MULT1 × RE RE MULT1->RE MULT2 × RQ RQ MULT2->RQ RA RA RA->ADD1 RB RB RB->ADD1 RC RC RF RF RC:e->RF RD RD RD:e->MULT1 RD:e->MULT1 RG->ADD2 RF->ADD2 RE->MULT2 RH->MULT2


Three Cycle, One Fast Multiplier, One Slow Multipler, Reduced Critical Path, Partial Pipelined:

G ADD1 + RG RG ADD1->RG ADD2 + RH RH ADD2->RH MULT1 slow × RE RE MULT1->RE MULT2 × RQ RQ MULT2->RQ RA RA RA->ADD1 RB RB RB->ADD1 RC RC RC:e->ADD2:sw RD RD RD:e->MULT1 RD:e->MULT1 RG->ADD2 RE->MULT2 RH->MULT2


Three Cycle, One Fast Multiplier:

G ADD1 + RG RG ADD1->RG ADD2 + RH RH ADD2->RH MULT1 × RQ RQ MULT1->RQ RE RE MULT1->RE RA RA RA->ADD1 RB RB RB->ADD1 RC RC RF RF RC:e->RF RD RD MUXA RD->MUXA:nw MUXB RD->MUXB:nw RG->ADD2 RF->ADD2 RE->MUXB:sw RH->MUXA:sw MUXA:e->MULT1:nw MUXB:e->MULT1:sw