Data Driven Clock Gating: Bar Ilan University School of Engineering Vlsi Lab
Data Driven Clock Gating: Bar Ilan University School of Engineering Vlsi Lab
Data Driven Clock Gating: Bar Ilan University School of Engineering Vlsi Lab
School of Engineering
VLSI Lab
Data Driven
Clock Gating
Academic Advisor: Prof. Shmuel Wimer
Instructor: Mr. Moshe Doron
Industry correspondent: Mr. Roey Mioni
Dov Gropper
Dvir Shasha
Final Fourth Year Project
Computer Engineering
1
Table of Contents
Main Project Goals........................................................................................................................ 3
Motivation ..................................................................................................................................... 3
Theory .......................................................................................................................................... 4
Design Flow .................................................................................................................................. 7
Design: ...................................................................................................................................... 7
Simulation environment: ............................................................................................................ 8
Iterative Perfect Matching Algorithm (IPM): ............................................................................... 8
Clock gating Implementation: .................................................................................................... 9
Hardware and Design Components .............................................................................................10
Problems and Solutions ...............................................................................................................12
Direct Memory Accesses Controller .............................................................................................14
Behavior ...................................................................................................................................14
System level .............................................................................................................................14
The block diagram of the DMA controller's state machine: .......................................................16
The Design: ..............................................................................................................................17
Top design, with Verification diagram: ......................................................................................18
Results.........................................................................................................................................20
The SpyGlass Results: .............................................................................................................21
Result review: ...........................................................................................................................24
Conclusions .................................................................................................................................25
References and Sources..............................................................................................................27
Appendixs ....................................................................................................................................28
DMAC Spec: .....................................................................................................................28
2
Main Project Goals
Data Driven Clock Gating is a research study by Professor Shmuel Wimer. Its’ main
purpose is to reduce power consumption of electronic circuits.
Our project implements the technique described in Professor Wimer’s research on a
design in register transfer level (RTL).
Motivation
The increasing demand for low power mobile computing and consumer electronics
products has refocused VLSI design in the last two decades on lowering power and
increasing energy efficiency. Power reduction is treated at all design levels of VLSI chips.
From the architecture through block and logic levels, down to gate level circuit and
physical implementation, one of the major dynamic power consumers in the system clock
signal, typically responsible for up to 50% of the total dynamic power consumption. Clock
network design is a delicate procedure, and is therefore done in a very conservative
manner under worst case assumptions. It incorporates many diverse aspects such as
selection of sequential elements, controlling the clock skew, the decision of the topology
and physical implementation of the clock distribution network.
3
Theory
Clock gating
Several techniques to reduce the dynamic power have been developed, of which clock
gating is predominant. Ordinarily, when a logic unit is clock, its underlying sequential
elements receive the clock signal regardless of whether or not they will toggle in the next
cycle.
Clock enabling signals are usually introduced by designers during the system and clock
design phases, where the inter-dependencies of the various functions are well
understood. In contrast, it is very difficult to define such signals in the gate level,
especially in control logic, since the inter-dependencies among the states of various flip-
flops depend on automatically synthesized logic. There is a big gap between block
disabling that is driven from the HDL definitions, and what can be achieved with data
knowledge regarding the flip-flops activities and how they are correlated with each other.
The research presents an approach to maximize clock disabling at the gate level, where
the clock signal driving a flip-flop is disabled (gated) when the flip-flop states is not subject
to a change in the next clock cycle.
Clock gating does not come for free. Extra logic and interconnects are required to
generate the clock enabling signals, and the resulting area and power overhead must be
considered. In the extreme case, each clock input of a flip-flop can be disabled
individually, yielding maximum clock separation. This, however, results in high overhead.
Thus, the clock disabling circuit is shared by a group of several flip-flops in an attempt to
reduce the overhead.
4
On the other hand, such grouping may lower the disabling effectiveness, since the clock
will disabled only when the inputs to all the flip-flops in a group don’t change. It is,
therefore beneficial to group flip-flops whose switching activities are highly correlated in
derive a joined enabling signal.
This requires gathering statistical information of our flip-flops using simulations, and
statistical analysis.
Another issue that influences the effectiveness of this suggested technique is the fan-out
of the gater. The theory presents a formula for calculating the optimal fan-out of the gater,
referred to as k:
5
The graph above shows the normalized power net savings per flip-flop obtained by
adaptive gating at first level of clock tree in the equation above. The saving is compared
to the non-gated situation. The optimal fan-out is marked for each toggling probability:
Using the statistical information gathered and the optimal fan-out, we could attain groups
of matching flip-flops for the clock gating.
6
Design Flow
Design:
The design flow begins with a design in RTL. It is important to begin with a design that
has been proven to work properly. The design must not include any IP’s (intellectual
property) or RTL sources that are not visible to the user, and therefore cannot be edited.
At this point, the design flow supports implementation for a single clock domain.
Moreover, the sequential and combinational logic must be separated in the RTL in order
for the scripts to run properly.
7
Simulation environment:
In this stage, simulations on the RTL are performed and statistical information is gathered
for analysis. This must simulate a typical use of the design, so that we can achieve
realistic statistical information. There is support currently for one simulation per design.
The simulation runs with Cadence's SimVision.
The Simulation environment steps:
● Add tracing code to the design. This is obtained by running the program ftrc.exe.
To run this program, the user must first modify the file inputs.rti. In the file, the user
sets the following attributes:
○ Specifying the name of the design files list including extension (*.vc).
○ The program gives an option whether to get the output design as one or
multiple files.
● Add tracing code to the test-bench manually. The code must be added before the
DUT instantiation.
● At this point, the simulation can run.
● The simulation outputs will contain two files:
○ Activities.rpt - the report file contains the active flip-flops per time in
millisecond.
○ FF_lists.rpt - the report file contains a list of the flip flops in the design.
8
Clock gating Implementation:
This is a executable file - “rcg.exe” that creates a clock gate for each group of FFs
received from the IPM - thus completing the process.
These are the necessary steps:
● Modify the “inputs.rcg” file with the following:
○ The name of the design files list (*.vc).
○ The name of the folder containing the FF-lists for grouping.
○ Specify whether to get the output design as one-file or multiple.
● Run “rcg.exe”.
9
Hardware and Design
Components
Direct Memory Access Controller in RTL design:
We implemented our design flow on a Direct Memory Access Controller in RTL.
The design does not include any IP’s (intellectual property) and has been developed by
Mr. Moshe Doron in Bar-Ilan’s VLSI lab.
The design has a single clock domain and after editing, the sequential and combinational
logic have been separated.
Xilinx's ISE:
ISE Is a software tool produced by Xilinx that we used for synthesis and analysis of our
HDL design. In addition, after we received a gate level design from ISE’s synthesis we
burned the design onto the ML605 FPGA board - In similar to our usage of Altera’s
Quartus II.
10
Cadence's SimVision:
Simvision is the waveform viewer in the Cadence EDA suite. We mainly used it for
behavioral design verification before FPGA implementation. Due to the RTL changes
made, it was necessary to verify our design before implementing our design flow of Data
Driven Clock-Gating. After the implementation, we used it to make sure the design still
had the exact same behavior. We used a test bench with two “OK” signals that indicated
proper behavior of the design. It is crucial that we used the same test bench on both
designs, so that we can positively know that the two designs really had the same
behavior.
11
Problems and Solutions
The evil design problem:
We started our project with the Animation Graphic Engine (AGE) RTL design.
The design used Altera's tools (IP) to implement some elements.
This code could not be edited or viewed. Trying to design these components ourselves
failed due to memory resources shortage on the FPGA board. In other words, the Quartus
synthesized the design with our components and the output was too large to be
implemented on any of the FPGA boards in the VLSI lab’s possession.
The Solution we chose was to switch to alternative RTL design, DMA Controller, which
uses no IP.
12
The problem was solved by moving the design flow to RTL, and allowing the Altera
Quartus to synthesize the design without these limitations.
In addition, this change also shortened the runtime of the entire design flow.
Sequential logic - Like combinational logic circuits, a sequential logic circuit has inputs
and outputs. However, the output depends on the state of a FSM as well as the inputs.
Furthermore, it contains a clock.
An example of sequential logic:
13
Direct Memory Accesses
Controller
Background
As mentioned before, the DUT was changed from the AGE to the DMAC. This meant we
had to become familiar with the DMAC logic, and behavior - due to the fact that we
needed to create test-benches for it.
Behavior
The DMAC is an integral part of the vendor-specific Graphics-On-Key (GOK) USB2.0
Device. The Device is dedicated to USB Communication Channel. It has the potential of
being integrated into the Protocol Engine (PE) Device. The DMAC function, within the
GOK Device, is to transfer data between the USB2.0 Protocol Engine Receive/Transmit
(RX/TX) Packet Buffers and the Device Animation Graphics Engine (AGE) Function
Endpoints, in response to PE service requests. The DMAC is the only Bus Master in the
system. It is pre-configured to perform the required data transfers to and from the AGE
Application Function Core. The DMAC is capable of performing words gather-scatter,
support system data bus width up to 48bits (6 bytes) and up to 24bits address bus
(16Mbytes address range).
Flyby and gather-scatter data transfer modes are supported but memory to memory
transfers is not.
System level
The USB 2.0 Device DMAC is pre-programmed (ROM), to perform the required data
transfers to and from the AGE Application Function Core. The DMAC Configuration
Memory contains the necessary information to access any Endpoint Buffer (Memory
or Register Files), in the AGE Core.
14
The PE issues a Transaction Request command signal and a Packet Transfer
Request signal to the DMAC, for a specific AGE Endpoint. The DMAC responds with
an Acknowledge signal to the PE and starts data transfer transactions between PE
Packet Buffers and EP Buffers – Registers or Memory, over the system bus by
issuing Endpoint Buffer Address, Read and Write control signals, while monitoring
AGE Wait signal (for slow Memories). Data transfers are performed in either single
bus cycle 16bit words data transfer (flyby mode) or in multiple bus cycles (gather-
scatter mode), to match different source and destination bus widths. In both single
packet and multiple packets data transfers, terminating specific EP Input Transaction
(from EP to Host), is done by the DMAC monitoring the End-Of-Transaction signal,
issued by the Function (last EP Buffer address reached). In case of Output
Transaction (from Host to EP), if last packet size is smaller than the predefined EP
MaxPacketSize or packet having data size = 0 (zero), the PE de-asserts its Transfer
Request signal. In case of multiple packets data transfers, only the Packet Transfer
Request signal is de-asserted and the DMAC will carry on with next packet data
transfers as soon as the Packet Transfer Request signal is be asserted. When both
Transfer Request and Packet Request are de-asserted, the DMAC resorts to its idle
state and is ready to perform the next transfer request.
The DMAC access PE’s RX/TX Buffers (FIFOs), as an I/O Devices, using dedicated
PE read/write signals. Data is transferred over the system data bus, as 16bit words.
15
The block diagram of the DMA controller's state machine:
Notice that the flow splits left and right for the two directions: Rx path, and Tx path. Inside
each direction there are more splits, for different data sizes.
16
The Design:
This is a block diagram of the DMAC design. It is constructed from Data interface unit
(DIU), Finite State Machine unit and a Configuration ROM. On the left is the interface with
the protocol engine. On the right, is the interface with the System Bus, and function.
The next stage was to create an environment that would allow us to visually verify the
design on an FPGA board.
17
Top design, with Verification diagram:
This block diagram represents the design that was implemented on the FPGA board.
In addition to the DMA controller we used a stimuli ROM triggered by an 8 bit counter to
resemble data from Protocol engine or from i.e. the AGE.
To confirm the correctness of the data transfer we used 77Bit comparator and a monitor
ROM. The comparator compared the transferred data to the expected result stored in the
monitor ROM and using two LEDs if the data was transferred correctly and also if the
DMA control signals were in the correct state.
The components:
● 8-bit counter: a regular 8 bit counter. Each clock the count is increased by 1. The
output will return to 0x00 upon reach of 0xFF or reset.
● Stimuli ROM: a Read Only Memory component that contains the data that will be
pushed in the inputs of the DMAC. It is made of 57 bit words. It receives the
address from the 8-bit counter as an input.
18
● Monitor ROM: a Read Only Memory component that contains the data that should
be the output of the DMAC according to the input address. It receives the address
from the 8-bit counter as an input.
● 77-bit Comparator: A unit that compares the expected data (from monitor ROM)
to the collected data (from the DMAC). It splits the comparison into two: data, and
control signals. If the expected and collected are identical - both LEDs should be
on.
And so, if both LEDS are on during running- the design is working properly. It is important
to note that during reset, only the data OK will be on.
After debug work of the test bench, we achieved two working designs- with and without
Data Driven Clock Gating.
19
Results
In parallel to our work this year, our flow was run on designs at CEVA.
The VLSI department at CEVA already used clock gating in their design flow.
Their gaters are based on control signals. That means that if the entire clock domain is
not functioning at a given time, the clock signal is blocked and is not forwarded to the
specific clock domain.
The clock gaters we suggest in the design flow are based on data and statistical
information.
The data driven clock gaters were added to the design additionally to the control driven
clock gaters. This fact limited the process in terms of power reduction, because the
design was already power reduced.
To prove the potential of the design flow an activity test was made on the DUT. In this test
Flip Flops that did not needed a clock signal were sampled:
The table above shows that almost 98% of the Flip Flops active only 0-5% of the entire
test. This means that there is potential of saving power by implementing the technique on
the design. However, that is not enough to insure that saving is possible. It is also
necessary to show that many flip-flops have high correlation between their clock-toggle
vectors, in order to gate them together. The following graph shows just that:
20
The X-axis is the correlation percentage. The Y-axis is the number of flip-flops with the
appropriate correlation percentage. As can be seen, there are a very small percentage of
flip-flops with low correlation, and a very large percentage of flip-flops with high
correlation.
Now we can soundly predict high power saving potential.
After implement the entire design flow on three different designs and masure power with
simulation program, Spyglass, the results received in CEVA were:
21
The tables below shows the detailed results received with Spyglass on Design C:
Memory Power: 0W 0W 0W 0W
IO PAD Power: 0W 0W 0W 0W
Above is the power measurement report that was derived from analysis of the golden
design. This means that no data-driven clock gating was performed on the design.
The next table shows the main power consumption data according to a given k. This
means that the data-driven clock gating process ran, and a separate design was created
for each gater fan-in size.
22
k Leakage Internal Switching Total
23
The following table shows the power consumption data for k=16 fan-in, except for some
variations that we’re done outside of the design flow.
Total Switching Internal Leakage K=16 (with few variations)
0W 0W 0W 0W Memory Power:
0W 0W 0W 0W IO PAD Power:
This design was 22% more efficient than the original golden design above.
Result review:
● It can be noticed that with the k=16, the power saving is maximal. Also when the
fan-in is too small as in k =4 the power increases.
● The combinational power increases with Data Driven Clock Gating as a result of
the extra logical component, the gaters. But the sequential power and the clock
power decreases more significantly because of the clock disabling techniques.
● Although the design already had control driven clock gating the activity test shows
that there is still room to save power because the activity of 98% of the Flip Flops
were low and the correlation between the most of them were high.
24
Conclusions
The results that have been shown in the last chapter have proven beyond doubt that
Professor Shmuel Wimer’s research “Data-Driven Clock Gating” is a practical and
efficient power reduction tool. The design flow that was developed during this project
made the research a practical tool that could transform a given RTL design into a more
energy efficient one.
● The ability to work in RTL mode saved a lot of runtime of the design flow and made
it more effective. This issue change becomes more relevant, and even crucial,
when implementing this design flow on a large design. That is due to exponential
growth of runtime in every stage of the design flow.
● We added overhead to the design in the form of logical components, the gaters.
The ability to combine a number of Flip Flops together with statistical knowledge as
a tool was the power saving main element. Both of these aspects appeared in the
result in the form of decrease and increase of power in the final design.
● Even when a design has clock gaters driven by control the Data-Driven Clock
Gating proves effective. The fact that most of the Flip Flops were not active in most
of the run time, and the high correlation between most of them made it possible to
decrease power despite the control driven gaters.
● There is still room for improvement of the versatility and user friendliness of the
scripts and the design flow. The disadvantages of the scripts create a need to
change the design. This happens because the scripts can’t handle a design that
has both combinational and sequential logic mixed. In addition, the scripts won’t
work on a design that has a synchronous reset. The code addition to the test-
bench necessary for the tracing stage should be done as part of the flow (by one of
the programs) and not manually. It would be Ideal to create a main program with a
user interface (GUI) that would combine the entire design flow. That way, the flow
would be easier to run and more user friendly.
● There is still a need to achieve results in ASIC to confirm the efficiency of
implementing Data-Driven Clock Gating.
● The need of a good simulation that mimics a real application use of the design will
have significant influence on the effectiveness of the design flow. This is due to the
fact that the technique is based on statistics and correlation and the more realistic
the simulation the statistical results would be accurate.
● Our attempts to measure the power consumption on the FPGA boards were not
successful. The reason was that the boards has a tremendous static power
25
consumption level, due to all its’ BRAMs and LUTs. Even after multiplying the
design 100 times and measuring the power consumption with the ISE Chipscope
using the built in 0.005 ohm serial resistor - the power difference was not apparent.
That is probably the reason FPGA boards are used in the industry in order to
check design integrity of low power devices, and the actual devices are
manufactured using ASIC.
26
References and Sources
1. The Optimal Fan-Out of Clock Network for Power Minimization by Adaptive Gating
– By Shmuel Wimer and Israel Koren.
2. Optimal Flip-Flop Grouping in Data-Driven Clock Gating for Maximal Power Saving
– By Shmuel Wimer and Israel Koren.
27
Appendix A
DMAC Spec:
USB2.0 aware DMAC Specification
1. Introduction
The document defines a USB2.0 protocol-aware Direct Memory Access Controller (DMAC)
Device.
The DMAC is an integral part of the vendor-specific Graphics-On-Key (GOK) USB2.0
Device.
The Device is dedicated to USB Communication Channel. It has the potential of being
integrated into the Protocol Engine (PE) Device. The DMAC function, within the GOK
Device, is to transfer data between the USB2.0 Protocol Engine Receive/Transmit (RX/TX)
Packet Buffers and the Device Animation Graphics Engine (AGE) Function Endpoints, in
response to PE service requests. The DMAC is the only Bus Master in the system. It is pre-
configured to perform the required data transfers to and from the AGE Application Function
Core. The DMAC is capable of performing words gather-scatter, support system data bus
width up to 48bits (6 bytes) and up to 24bits address bus (16Mbytes address range).
Flyby and gather-scatter data transfer modes are supported but memory to memory transfers
does not.
System Bus
28
data transfers, only the Packet Transfer Request signal is de-asserted and the DMAC will carry
on with next packet data transfers as soon as the Packet Transfer Request signal is be asserted.
When both Transfer Request and Packet Request are de-asserted, the DMAC resorts to its idle
state and is ready to perform the next transfer request.
The DMAC access PE’s RX/TX Buffers (FIFOs), as an I/O Devices, using dedicated PE
read/write signals. Data is transferred over the system data bus, as 16bit words.
USB2.0-aware DMAC
DMAC's three main modules are the Control Core (FSM), Configuration ROM and the Data
Interface
Unit (DIU).
2.1. DMAC Top Level Introduction
Control
DIU
Core (FSM) System
Protocol Bus
Engine Configuration
ROM Fig. 2 - DMAC Block
Diagram
2.2. DMAC Modules
The DMAC is partitioned into modules as shown in Fig. 2 Block Diagram and described
below.
2.2.1. Configuration ROM
The Configuration ROM contains the essential information necessary to access any
pre-defined Application Function Endpoint Buffer (Memory or Register Files). The
Configuration information enables the DMAC to properly carry out the data
transactions, requested by the PE. Since PE issues at transaction request time,
Endpoint's transfer direction (IN-OUT), and Endpoint number (1-15), the specific
Endpoint Buffer can be selected, but EP Buffer data width (DW) must reside within
the Configuration ROM.
29
packets), address counter is cleared. Address counter increment or reset at transaction
completion is performed under the FSM as well as DIU's Gather-Scatter Registers
read & write. When PE issues transaction request signal, the DMAC responds with an
Acknowledge signal to the PE and when the PE issues packet transfer request, the
DMAC starts transfer data as requested. The Control Core issues the required control
signals for both the EP Buffer and the PE RX/TX FIFOs, in the correct sequence, to
perform either a flyby or gather-scatter data transfer operations (issue Read/Write
control signals and monitor Wait signal and increments address counter, as long as
the data transfer is carried on.
When the last data byte has been received or sent from/to the PE Packet Buffer, the
PE negates the DMA Request signals.
R4 R3 R2 R1
47 32 31 24 23 16 15 0
Gather-Scatter operation:
- IN (from EP to PE TX FIFO)
32bit: Read 32bit word into R1-R2-R3. Write 2 16bit words from R1 & R2+R3.
24bit: Read 2 24bit words into R1+R2 & R3+R4. Write 3 16bit words from R1,
R2+R3 & R4.
48bit: Read 48bit word into R1-R2-R3-R4. Write 3 16bit words from R1, R2+R3 &
R4.
- OUT (from PE RX FIFO to EP)
32bit: Read 2 16bit words into R1 & R2+R3. Write 32bit word from R1-R2-R3.
24bit: Read 3 16bit words into R1-R2-R3-R4. Write 2 24bit words from R1+R2 &
R3+R4.
48bit: Read 3 16bit words into R1-R2-R3-R4. Write 48bit word from R1-R2-R3-R4.
2.3. Interfaces
Signal Name Signal Type Description
Data Bus. These pins serve as input and output System data bus
dbus[47:0] Bi-directional
(for local µC, PE Packet Buffers and Application Buffers
Address Bus. Serves as System Address Bus for the DMAC.
abus[23:0] Bi-directional
16 LSBs are used by the µC to access the Control Registers.
nrd Bi-directional System Read signal issued by Bus Masters (DMAC or µC)
30
nwr Bi-directional System Write signal issued by Bus Masters (DMAC or µC)
npbrd Out Read signal for PE during data transfers. Active low.
npbwr Out Write signal for PE during data transfers. Active low.
nwait In-Active low Used to extend bus cycle for slow Application Memories.
ndack In-Active low DMAC Acknowledge to PE Transfer Request.
ntreq In-Active low DMA Transaction Request signal from Protocol Engine.
npreq In-Active low DMA Packet Request signal from Protocol Engine.
neot In-Active low End-Of-Transaction signal, issued by the Function
epn[3:0] In-Active Hi Endpoint number (1-15) for requested data transfer.
ep_dir In-Active Hi Endpoint Direction IN (1) or OUT (0) for requested data transfer.
clk Input Oscillator input. Connected to an External Oscillator.
nrst In-Active low Reset. External asynchronous static reset.
Vcc Input Internal Power Source (derived from USB V+, via LDO)
Vss Input Internal Power Ground (derived from USB V-, via LDO)
Note: DMAC uses Endpoint number (epn[3:0]) and Transaction direction (ep_dir), as
internal ROM address, to perform the expected data transfer to/from the specific End Point
@ the Function Core. They serve as chip selects for the Buffers within the Function Cores.
DMAC also issues nrd/nwr, npbrd/npbwr signals and current EP Buffer address, to handle
data transfer. Control signals npbrd or npbwr are used by the PE to drive RX FIFO output
data onto the system data bus or to latch the data from the system bus to the TX FIFO,
depending on transfer direction.
|-----EP8-----|-----EP7--------|------EP6-----|-----EP5----|-----EP4-----|-----EP3-----|-----EP2------|----
EP1----|
31
D15 D14 D13 D12 D11 D10 D9 D8 D7 D6 D5 D4 D3 D2 D1 D0
|----EP15------|-----EP14----|----EP13-----|----EP12----|----EP11----|----EP10----|----EP9-
-----|
3. Implementation
The DMAC is designed as a Front-End for near future ASIC implementation. It is designed
using Verilog HDL and simulated/logically verified for correct operation, using Cadence
Incisive Simulator.
Intermediate Hardware Implementation, for proof of concept and correct functionality, is
performed using FPGA Device, located on Altera DE2 Development Board, under Quartus II
Development Environment. The Incisive logically verified Verilog code is used for
implementation.
Quartus II MegaFunction Wizard is not used.
There is an option to incorporate the Protocol-Aware DMAC into the USB2.0 Protocol Engine.
32
4. USB2.0 Device System Diagram
+Vcc -Vss
nrst clk
ndack
ntreq
npreq USB2.0-Aware
XTAL clk ep_dir DMAC
Oscillator epn[3:0]
nrst
abus dbus
+Vcc -Vss npbwr npbrd nwr nrd neot nwait [23:0] [47:0]
nrd
USB
+V -V
Connector Address Bus
nwr
+D -D
nwr nrd abus dbus ntreq nwr nrd neot nwait abus dbus
[15:0]
[15:0] npreq [n:0] [m:0]
CLK
UTMI ndack Function
USB2.0 Protocol ep_dir Core 0
USB2.0 PHY Engine epn[3:0] epn[3:0]
4
nrst clk
RST npbwr
nrst clk npbrd
+Vcc -Vss
+Vcc -Vss
+Vcc -Vss
33
these transfers are very efficient; however, memory to-memory transfers are not
possible in this mode.
4.1.2. Gather-Scatter DMA Transfer
This type of transfer is useful for interfacing devices with different data bus sizes. The
DMA employs a multiple-cycle, multiple-address data transfers, called Gather-Scatter
transfer.
The data being transferred is first read from the I/O device or memory into a
temporary DMA internal data registers. The data is then written to the memory or I/O
device in the next cycles.
This device has only single address counter and hence supports only memory-to- I/O
transfers.
34