Challenges & Implications For VLSI Architectures For Multimedia Processing

Challenges & Implications for
VLSI Architectures for Multimedia

Processing
Vineet Sahula
sahula@ieee.org
Deptt of ECE
Malaviya National Institute of Technology
Jaipur
Outline
• Motivation & challenges
• Choice of architectures
• Tasks in Multimedia processing
• Design optimization approach
– Throughput enhancement
– Power optimization
IETE'05 VLSI Arch. for Multimedia 2

Motivation
• Performance improvement of wireless
systems attributed to recent advances in
– Comm. Technology Standards & networking
– VLSI Technology
• VLSI design technology has evolved due to
– Demands raised by applications viz. mobile
computing & communication (almost converged)

Multimedia Processing
• Multiple medium
– Amalgamation of Data, voice, audio, images, graphics,
speech
• Characteristics
• Desire for high bandwidth, high throughput, and low power
• Mobile applications
• Low cost, very low power, real time

Classification of Multimedia Tasks
• Low level
• Characterized by highly regular sequences of operations & data
accesses
– Implies identical operations on large number of samples with high
potential for data parallelism
• Medium level
• Tasks are link between simple data structures (pixel) and
symbolic information
– Data dependent decisions & lower regularity
• High level
• Operations on symbols & complex objects of variable sizes
– Highly data dependent computation-flow
– Advance prediction not possible

Video Compression
• Throughput (MOPS)
– Motion estimation (~80%)
• Not matched by a GPP
• Needs specific optimizations
– ASICs
• Power consumption profile

– Motion estimation (~70%)
• Requires an integral
approach from system level
to gate level design

Implementation Architecture
Energy Efficiency
MOPS/mW
Dedicated HW
Reconfigurable HW
DSP/ASIPs
Programmable Processor
flexibility
• Software programmable
• general purpose (GPP)
• Application specific (DSP/ASIP)
• Hardware programmable (CPLD/FPGA)
• Dedicated hardware (ASIC)
Data-Path & Control
a b • z=(a+b)+(c+d)
c d
• Dedicated HW
– 2 time steps with 2
Mx My ALUs
Start
Mx=1
– 1 time step with 3
LR=1 S1 ALUs
My=0
R
S2 Mx=0
My=1
•Control FSM
LR=1 • 1-hot encoded
z Stop [HW control]
• Micro-program control
[Control memory]
Programmable Processors
z=(a+b)+(c+d)
Reg I-Reg
Bank
Load R1
Memory Load R2
Rx Ry HW
Microprogram
Control
control R3R1+R2
LR1 LR2 LR3 LR4 Load R1

Rz Load R2
CISC
RISC R4R1+R2
DSP- Multiply-Accumulate R R +R
3 3 4
Architecture Characteristics
• Processors
• Instruction set is fixed/customized
• Algorithm changes adapted through SW rewriting
• Power & computation-time overheads are large
• Reconfigurable HW
• Architecture at logic level is fixed
• Architecture reconfiguration requires interconnection
programming
• Dedicated HW
• HW can’t be reconfigured
• Can be extremely power-efficient and high performance
Dedicated HW
• Suitable most compute intensive Low Level Tasks
• Dedicated HW- VLSI implementation of highest

efficiency
• Overhead for control is minimum
• Power consumption can be made low
• Functionality is fixed
• Redesign means new design

Programmable Architecture
• Suitable for High Level Tasks
• Highly flexible for irregular tasks

• Can address larger application domain
• Control overhead is large

• Silicon area is large
• Software development time ia an additional
overhead
Hybrid Architecture
• Suitable for Medium Level Tasks as well as
HLTs
• Choice of mixing ASIPs/DSP/FPGAs [2,3]
[2] D. Chauhan et al, Hardware Design evaluation for fast motion estimation, B. Tech. Thesis, MNIT Jaipur, 2004
[3] Govind S. and V. Sahula, ASIP Design Space exploration for motion estimation IEEE VDAT 2003
Dedicated HW Implementation
• 2D DCT/IDCT for Video codec
– Matrix multiplication, a regular and parallelized
• Motion Estimation
– Estimate MV through Block matching
• a very regular & parallelized
– Minimizing a distortion metric
– Mean absolute difference MAD
– Object based ?
Media Processor Chips
• Philips TriMedia
• Audio/visual, graphics, communication tasks
• VLIW
– 25 FU: ALUs,multipliers, FP units
• Texas Instruments TMS320C6X

• Not exactly MM chip, but a general purpose DSP
• VLIW
– 2 symmetrical data paths
– 4 FU
– Multi-port registers

Media Processor Chips
• AxPe1280V
• Video signal processor
• RISC core
• SIMD/MIMD
– 8 Data paths units
• AT & T AVP4000
• DSP
• 3 ASICs

Throughput Enhancement
• Exploit parallelism in data operations/ computation
• Explicit parallel implementation
– Multiprocessors
• SIMD, MIMD
• Implicit parallel solution
– Pipelined FU
– Pipelined Data-path
• Dedicated HW, processors
– Pipelined Control (Instruction level parallelism)
• Processors only

Terminology I1I2..Ii…
Data
• Critical path delay, TD Path
– From primary input Ii to Primary output Oi
O1O2 …Oj..
• TD in ns
• Throughput: rate of getting output/sec
Throughput=number-of-operations/sec,
much higher than 1/TD

Data-path/FU Pipelining
• Un-Pipelined
– Delay TD
– Throughput
• 1/TD
• Linear Pipeline of stage k
– Delay is TD/k
– Throughput is k/TD
• Non-linear P/L
– Latency is L including register delays
– Delay is L
– Throughput is complex function ?
• of k and L
• Wave-pipelining
– Asynchronous circuit
– No registers
FU Pipelining- FP Adder
Unpipelined delay kT
S1
For n data, total-delay knT
S2 Throughput 1/kT
S3
S4
Delay of a stage T
Stages k(=4)
Throughput < 1/T
• Speed-up n k
SU 
=(delay-unpipelined)/(delay-pipelined) n  k 1
SU lim  k
n 
=(Throughput-pipelined)/(throughput-
unpipelined)
Pipelined Circuit

Data Path with FB/FF Connections
Area-pipeline

Data Path with FB/FF Connections
Area-pipeline

Pipelining Explorations- Infeasible
Alternatives

Pipelining- Feasible Alternatives

Limitations During Parallelizing
• Amdahl law
1 (1-Fractionenhanced) Fractionenhanced
--- = +----------------------
SU SUenhanced
1. For x=0.9, SUenhanced=100

1/SU=0.1+0.9/100=0.1009 SU~10
2. To achieve SU=80 with 100 parallel resources, what x is

feasible?
1/80=1-x+x/100 99x/100=79/80 x=99.8%
Only 2% code must be sequential!!!

Power Dissipation in a CMOS Gate
-Inverter
• Switching power
– VCC.i(t).dt
– Influenced by supply voltage
VCC VCC
Vin=0 Vout Vout

Vin=‘1’
CL CL

Dynamic Power Dissipation-
Capacitor Path
• During capacitor charging
• P1 gets disipated in p-channel Tr., resistive dissipation
into Rp
• P2 gets transferred to CL
• Vout transits 01
• During capacitor discharging
• P2 gets dissipated
• Vout transits 10
• In a cycle, thus total power is Pdyn= P1+ P2

Dynamic capacitive power
• Formula for dynamic power:
Pdyn  C V 2
L CC f
• Observations
– Depends on CL
• Fanout number
– Depends on frequency of operation f
– Depends on VDD

Reducing Power in a Gate
• Lower the supply voltage!
– Quadratic effect on dynamic power
• Reduce capacitance
– Short interconnect lengths
– Drive small gate load (small gates, small fan-
out)
• Reduce frequency
– Lower clock frequency -> use more parallelism
– Lower signal activity

Switching Activity
0
0
0
1
0
1
• Out of 4 possible output transitions
1 0 1 – output transition occurs for two input
1 1 1
pattern-pairs (IPP) only
– CL remains connected
– Out of 4 clocks, power dissipating
switching is for one clock only
– Average power for 4 clocks
• ¼ CLf V2DD
• In general
– P=aCLf V2DD
– a is switching activity of gate or
composite logic-circuit

Low Power VLSI Implementation
• During HW implementation
– Explore possibility of low power solution
• System level power management
– Clock gating
– Dynamic power management
• Behavioral level transformations
– Suited for DSP circuits
– Algorithm level, Filter-structure level, dataflow transformations,
Voltage scaling, clock gating
• Architecture level
– Bit-width reduction for arithmetic operations [4], pre-computation
based architectures,clock gating
[4] G. Singh, Low power Floating Point Arithmetic circuits, M. tech. Thesis, MNIT, 2003

Low Power HW…
• Logic Level
– FSM synthesis
• State assignment for low power, minimum Hamming distance to minimize
switching
– Low power technology mapping
• Circuit Level
– Apply low power data-sequence to a logic gate [5]
– Transistor sizing for low power and high performance
[5] P. Jain, V. Sahula, Low power IPP characterization for small digital circuits, IEEE VDAT, 2002

Low Power Software
• Sources of dissipation
– Buses, memory, control & clock distribution
scheme
• Power optimization
– Match algorithm to architectural resources
– Minimize memory accesses
– Proper sequencing of data transfer on bus, to
reduce bus switching
– Instruction reordering

Thanks.

Challenges & Implications For VLSI Architectures For Multimedia Processing

Uploaded by

Copyright:

Available Formats

Challenges & Implications For VLSI Architectures For Multimedia Processing

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Challenges & Implications For VLSI Architectures For Multimedia Processing

Uploaded by

Copyright:

Available Formats

Challenges & Implications for

VLSI Architectures for Multimedia

IETE'05 VLSI Arch. for Multimedia 2

IETE'05 VLSI Arch. for Multimedia 3

IETE'05 VLSI Arch. for Multimedia 4

IETE'05 VLSI Arch. for Multimedia 5

• Power consumption profile

IETE'05 VLSI Arch. for Multimedia 6

LR1 LR2 LR3 LR4 Load R1

• Dedicated HW- VLSI implementation of highest

IETE'05 VLSI Arch. for Multimedia 11

• Highly flexible for irregular tasks

• Control overhead is large

• Texas Instruments TMS320C6X

IETE'05 VLSI Arch. for Multimedia 15

IETE'05 VLSI Arch. for Multimedia 16

IETE'05 VLSI Arch. for Multimedia 17

IETE'05 VLSI Arch. for Multimedia 18

IETE'05 VLSI Arch. for Multimedia 21

IETE'05 VLSI Arch. for Multimedia 22

IETE'05 VLSI Arch. for Multimedia 23

IETE'05 VLSI Arch. for Multimedia 24

IETE'05 VLSI Arch. for Multimedia 25

1. For x=0.9, SUenhanced=100

2. To achieve SU=80 with 100 parallel resources, what x is

IETE'05 VLSI Arch. for Multimedia 26

Vin=0 Vout Vout

IETE'05 VLSI Arch. for Multimedia 27

IETE'05 VLSI Arch. for Multimedia 28

IETE'05 VLSI Arch. for Multimedia 29

IETE'05 VLSI Arch. for Multimedia 30

IETE'05 VLSI Arch. for Multimedia 31

IETE'05 VLSI Arch. for Multimedia 32

IETE'05 VLSI Arch. for Multimedia 33

IETE'05 VLSI Arch. for Multimedia 34

IETE'05 VLSI Arch. for Multimedia 35

You might also like