Algorithm and Design
Algorithm and Design
Abstract— An 8-bit CPU is designed at gate level from This unit has been designed to use five control signals,
scratch using custom chip approach. CPU has an 8-bit one enable signal, shift (left/right) signal and 3 signals to
integer unit and 16-bit floating point unit. The instruction determine the number of shifts. This unit accepts one
set includes shift, logic, integer and floating-point arithmetic operand from ALU registers.
instruction. The circuits are optimized by using more
efficient algorithm. The algorithm discussed in this paper
FP add/subtractor:
was applied for an 8-bit CPU design, however there is no Its operands, FA and FB are sourced from floating-
reason that this couldn't be used for more powerful and point register file and it requires one control signal to
serious CPU development. Currently no attempt has been indicate the start and another control signal to decide
made to include any special support or design for parallel whether addition or subtraction is to be performed. the
MUL/ ADD / SUB operations[1][2]. An attempt has been result is 16-bit wide.
made to improve conventional[6] algorithm. This paper FP multiplier:
discusses the design of FP ADD/SUB unit, with respect to Similar to FP add unit, the operands are sourced from
algorithm and VHDL implementation, as all the functional floating-point register file. But it requires only one
units cannot be discussed in this paper.
control to indicate the start.
The project was implemented using VHDL and simulated
using Altera MaxPlus II sim software which can map the B. Specifications of Floating-point add/sub unit:
design into Altera CPLD.
Specifications in short are 2x16 bit FP registers for
Index Terms—CPU, simulation, algorithm, Floating point operands, 1x16 FP register for final result, 1x8 bit register
unit, VHDL. counter, 1xadder, 1x 4to1 selector, 1x 2to1 selector, 6x8
bi register, 6x 2to1 multiplexers, 6x 2to1 multiplexers, 3x
I. INTRODUCTION 3to1 multiplexers, left barrel shifter, right barrel shifter, 1
zero counter, 1x7 bit register and an output signal logic.
Paper focuses on a functional unit of 16-bit FPU which is All simulations of VHDL code were done using device
a part of a CPU with 8-bit integer unit. CPU has 4x16bit family MAX7000 from Altera Max-plus II. It is
FPU registers, 16 bit data, address busses and 16-bit impossible to discuss all in this paper however, floating
program counter. Data path is where most of the point add unit is discussed. Logic of algorithm is
operations are done on by the processor's control unit. discussed in detail and implementation block diagram
There are seven functional units, out of which 3 for FPU. example of the Add unit is shown with detail design of
This paper will discuss 1 functional unit, floating point 16-bit floating point register. Although it is worth
add. Logic of algorithm is discussed in detail and mentioning that excellent work has been done for
implementation block diagram along with the VHDL improving the FP arithmetic [7][8].
code and simulation results. Goal is not to addrerss any
issues of Clock rate or IPC [3]. Main focus is on C. The Algorithm:
improved algorithm and relevant design
- Initially, operands are loaded in 2 temporary registers
A. 3 Functional units:. - 2 biased-exponents are compared.
Barrel shifter: - The difference is stored. - mantissa with smaller
Biased-exponent is shifted by the difference.
273
- Then it is subtracted from the other mantissa and - Round to nearest: in this, a representable significand
the result is stored. value nearest to the result will be stored. If the result is
- Round up or down exactly in between the two representable values, then the
- Result is normalized and stored. current least significand bit will determine a round up or
round down in order to force the result to be even. For
There are some exceptions to this algorithm: example, if the current least significand bit is zero, then it
- If the difference between the two biased-exponents is will be rounded down and if found to be one then it will
greater than 7, which is the length of mantissa, then the be rounded up.
operand with higher exponent value will be stored
without going through the following steps. Algorithm also allows exceptions like exponent
- or if the subtraction between mantissa is zero, then underflow/overflow and significand overflow. Algorithm
zero will be stored as a result, Normalizing will be is faster as it only takes maximum of 8 clock cycles to
skipped. complete as compared to conventional algorithm which
takes 13 clock cycles. There are different ways to
Conventional algorithm uses following steps: improve performance [4][5], our approach different.
- Zero checking. There is a small drawback in this, in a sense that this
- Significant adjustment. algorithm requires mores components, as a result, the
- Addition/Subtraction. block diagram may look complicated and confusing.
- Normalization. However this is far outweighed by the benefit.
- Rounding.
Overall Block diagram Figure 2. can be found on page 5
D. Rationale : of the paper.
With respect to the above algorithm, we have a slightly II. DESIGN: ADD/SUB UNIT
different method of obtaining the result. The difference
are as follows: We have tried quiet a few different designs like Carry
Look-ahead adder (CLA) or Ripple Carry adder (RCA).
- There is no zero checking on operands in this method. CLA provided good speed but has much larger size and
We think that as the operand with zero value doesn't power consumption was more than RCA. RCA , on the
occur very often there would be no significant other hand, is compact but rather slower than CLA.
degradation of performance of floating point calculations. Hence we decided to design Hybrid adders to take
Further, one clock cycle is saved for every FADD/SUB advantage of both. We have tried quiet a few different
instruction with non-zero operand and fewer gates are designs for Hybrid adders as well, and here we will
used. discuss type 1 hybrid adder (HA-1). HA-1 is very fast,
- Our method chooses not to compare exponent and with high on power consumption and usage is
testing significant for zero every clock cycle to make FADD/FSUB.
exponents equal. Instead, we chose to find the difference
between the two exponents and store the difference A. VHDL Code and Simulation
(which is positive). Larger exponent value will be stored
and significant with smaller exponent will be shifted by Code for CLA (3) with normal carry input
the difference using a barrel shifter (with the exception
that the difference must not be larger than 7) LIBRARY ieee;
USE ieee.std_loqic.1164.all ;
- Significant will not be checked for zero after adding
signed significands. Since most of the results of addition ENTITY add_cla3._n IS
does not result in zero, we feel that it is not necessary to PORT ( a0, a1, a2 : IN STD_LOGIC;
introduce an extra cycle just to check this. b0, bl, b2 : IN STD_LOGIC;
ci : IN STD_LOGIC;
- If significand overflow occurs after adding both o0, o1, o2 : OUT STD_LOGIC;
significands, exponent overflow will not be checked co : OUT STD_LOGIC);
immediately by this algorithm. The maximum biased END ENTITY;
exponent value that can be stored is 1111111-
ARCHITECTURE a OF add_cla3_n IS
1=1111110.1111111 will indicate an overflow. In a worst SIGNAL g0, g1, g2 : STD_LOGIC; -- imm signal for P
case scenario, the maximum value of biased exponent SIGNAL p0, pl, p2 : STO_LOGIC; -- imm signal for G
fater being incremented is 1111111. However, since the SIGNAL cl, c2, c3 : STD_LOGIC; -- imm signal for carry out
result will be normalized later, which can decrement the
BEGIN
biased exponent back into permissible range, we check g0 <= a0 AND b0; g1 <= a1 AND b1; g2 <= a2 AND b2;
this after normalization and rounding. p0 <= a0 OR b0; p1 <= a1 OR b1 ; p2 <= a2 OR b2;
274
c2 <= g1 OR (p1 AND c1); -- carry generation for bit 2 The inverted signal from carry out C4.
c3 <= g2 OR (p2 AND c2); -- carry generation for carry out c4n <= NOT c4;
o0 <= (a0 XOR b0) XOR ci; -- sum output bit0 0(0) <= sO; -- sum bitO (from full adderl)
o1 <= (a1 XOR b1) XOR c1; -- sum output bitl 0(1) <= s1; -- sum bitl (From CLA3)
o2 <= (a2 XOR b2) XDR c2; -- sum output bit2 0(2) <= s2; -- sum bit2 (From CLA3)
co <= c3; -- carry output 0(3) <= s3; -- sum bit3 (From CLA3)
END a;
0(4) <= (s40 AND c4n) OR (s41 AND c4); -- sum bit4 (From CSA)
0(5) <= (s50 AND c4n) OR (s51 AND c4); -- sum bit5 (From CSA)
0(6) <= (s60 AND c4n) OR (s61 AND c4); -- sum bit6 (From CSA)
Code for Hybrid Adder 1: 0(7) <= (s70 AND c4n) OR (s71 AND c4); -- sum bit7 (From CSA)
LIBRARY ieee;
USE ieee.std_logic.1164.all ; co <= (c80 AND c4n) OR (c81 AND c4); -- carry out
END a;
ENTITY add_ha1 IS
PORT ( a, b : IN STD_LOGIC_VECTOR (0 to 7);
The logic diagram Figure 3. gives an idea of the the
ci : IN STD_LOGIC;
o : OUT STD__LOGIC._VECTOR (0 to 7); circuit of HA-1
co : OUT STD_LOGIC);
END ENTITY;
ARCHITECTURE a OF add__hal IS
COMPONENT add_full1 IS -- declare full adderl
PORT ( a, b, ci : IN STD_LOGIC;
o, co : OUT STD_LOGIC); END COMPONENT;
BEGIN
XOR2 gates at the B input for ADD/SUB function
X0 <= b(0) XOR ei; xl <= bel) XOR ci; x2 <= b(2) XOR ei;
x3 <= b(3) XOR ci; x4 <= b(4) XOR ci; x5 <= b(5) XOR ci;
x6 <= b(6) XOR ci; xl <= b(7) XOR ci;
275
seen in the diagram shows 10010110, which
confirms the proper operation. Co= ‘0’ which
indicates that the result does not overflow.
- The results of Subtraction simulation is correct:
When control =1, Sub function is selected . So,
A-B = 50D = 00110010B. The output result (o0 to
o7) as seen in the diagram, matches the result.
Co= 1 indicates the result is not a negative number
after subtraction.
- MAX7000 CPLD device used in this simulation
has 8.1 ns of output delay.
- Glitches appeared during the simulation of sum
output (from 200ns to 208.1 ns). This means that
CPLD is not suitable for HA-1 type of
implementation, thought result were good.
III. CONCLUSION
REFERENCES
[1] A. Akkas, M.J. Schulte, “Dual-mode floating-point
multiplier architectures with parallel operations,” Journal
of Systems Architecture, vol. 52, pp. 549 - 562, October
2006.
[2] A. Akkas, “Dual-Mode Quadruple Precision Floating-
Point Adder,” Proceedings of the 9th EUROMICRO
Conference on Digital System Design, 2006, pp. 211 – 220,
ISBN:0-7695-2609-8
[3] V. Agarwal, M.S.Hrishikesh, S.W.Keckler and D.Burger,
“Clock rate versus IPC: the end of the road for
conventional microarchitectures,” Proceedings of the
27th annual international symposium on Computer
Figure. 3. HA-1 Logic diagram architecture, vol.28,May 2000,pp. 248 - 259 , ISSN:0163-
5964.
[4] A. Beaumont-Smith, N. Burgess, S. Lefrere and C. C. Lim
“Reduced Latency IEEE Floating-Point Standard Adder
B. Simulation results: Architectures,” Proceedings of the 14th IEEE Symposium
on Computer Arithmetic, pp. 35, 1999, ISBN:0-7695-
As seen in the Simulation diagram Figure 1. on page 4 0116-8.
of this paper: [5] G. Even, S. M. Mueller and PM. Seidel “A dual precision
IEEE floating-point multiplier”, Integration, the VLSI
A=01100100B = 100D Journal, vol. 29 issue 2, 2000, pp. 167- 180, ISSN:0167-
B=00110010B = 50D 9260.
[6] W. Stallings, Computer Organization and Architecture,
- The results of addition simulation are correct :
sixth edition, Pierson &Prentice-Hall, 2003.
When control is ‘0’ , add function is selected. So,
A+B = 150D = 10010110B. The output result as
276
[7] PM. Seidel, and G. Even, “Delay-Optimized Dr. Joshi is a member of IEEE and leads Computer Systems
Implementation of IEEE Floating-Point Addition,” IEEE group at IEEE Trinidad chapter.
Transactions on Computers.vol.53 issue 2., February 2004
pp. 97-113, ISSN:0018-9340.
[8] Y. Hida, X. S. Li, and D. H. Bailey, “Algorithms for
Quad-Double Precision Floating Point Arithmetic,” S.L.Lam graduated from Multimedia University, Cyberjaya,
Proceedings of the 15th IEEE Symposium on Malaysia.
Computer Arithmetic,pg.155, 2001. He later joined Xilinx in Malaysia as an Engineer.
277
Figure 2. FP ADD/SUB Unit Block diagram.
278