# A Design and Implementation of 32-paths Parallel 256-Point FFT/IFFT for Optical OFDM Systems

Hun-Sik Kang\*, Sun Hyok Chang\*, In-Ki Hwang\*, Joon-Ki Lee\*

\*Optical Internet Research Department, ETRI, 218 Gajeong-ro, Yuseong-gu, Daejeon, Korea gadin@etri.re.kr, shchang@etri.re.kr, ikhwang74@etri.re.kr, juneki@etri.re.kr

Abstract— This paper presents the design and implementation of thirty-two paths parallel FFT/IFFT for optical OFDM systems. Employing OFDM over high speed optical transmission systems requires large hardware resources due to computationalintensive operations such as full-parallel FFT implementation. The proposed 256-point FFT/IFFT adopts 32-paths parallel architecture with mixed radix-2<sup>3</sup> and radix-2<sup>5</sup> algorithm to reduce large number of complex multipliers. With this architecture, 256 full parallel data signals can be divided into eight sequential groups. Each group, composed of 32 parallel signals, is pipeline- processed. As a result, the hardware resources for full parallel 256-point FFT operation can be reduced to as much as those of 32-point FFT resource plus 32path 8-point single-path delay feedback (SDF). The proposed FFT architecture is implemented with field programmable gate array and integrated in 12 Gbps, 64-QAM encoded real-time optical OFDM system. The implementation result shows that an 86% complex multiplier reduction can be achieved in comparison with full parallel architecture. The attained error vector magnitude (EVM) is approximately up to -32 dB for 64-**QAM OFDM signals.** 

*Keywords*— Fast Fourier Transform (FFT), Optical OFDM (Orthogonal Frequency Division Multiplexing), Single-path Delay Feedback (SDF)

# I. INTRODUCTION

To meet the rapid growth of internet traffic and increase of high bandwidth demand for optical networks, Optical Orthogonal Frequency Division Multiplexing (OOFDM) technology has been widely researched and developed in recent years. OFDM, as a multicarrier transmission technique, has several advantages of simple equalization, intolerant to dispersive fiber channel, spectral efficiency, and flexibility [1]. Since the modulation and demodulation in OFDM are achieved digitally with the aid of Inverse Fast Fourier Transform (IFFT) and FFT requiring high computational complexity, hardware-efficient implementation is very crucial for real-time realization of OOFDM system.

To realize FFT implementation for real-time OOFDM applications, experimental results and demonstrations using field programmable gate array (FPGA) have been reported, which are, however, obtained by adopting the full-parallel architecture. 32/128/256-point full-parallel decimation-in-time FFT are presented for intensity modulation, direct detection (IMDD) and coherent optical OFDM (CO-OFDM) applications [2-5]. Gokhan [4] reported full parallel a 256-

point FFT processor for 100 Gbps applications, which uses radix-4 scheme to reduce complexity as well as maintaining high-speed data flow. For further complexity reduction, Schmogrow [5] proposed IFFT implementation using precomputed IFFT values with lookup tables. But its method could not be applied to FFT operation because of not being able to precompute receive signal in advance. In addition, its applicability is limited since its complexity grows as the number of QAM level and FFT points increase. From all these studies, full parallel architecture has been adopted, resulting in low speed internal clock operation, to meet excessive data rate demands at the cost of higher logic consumption. However, considering current state of the art FPGA devices has enough marginal slack of operation speed, there needs to adopt FFT architecture with smaller paths. Thus we propose 32-paths parallel 256-point FFT architecture to reduce logic resources for real-time realization of OOFDM systems.

The rest of paper is organized as follows: Section II describes the mathematical formulations of the 32-paths 256-point FFT algorithm. Section III gives the proposed FFT architecture and hardware complexity. And explains implementation and test results under the environment of optical OFDM experiment, and finally conclusion follows in Section IV.

### **II. 32-PATHS 256-POINT FFT DESIGN**

To derive a 32-paths 256-point decimation-in-frequency (DIF) FFT mathematical formulation, Cooley-Turkey (CT) algorithm firstly is applied to FFT equation [6]. Given a length *N* complex input sequence x(n), N-point FFT equation is given in (1) where *n* represents a discrete time-domain index, *k* is a discrete frequency index, and  $W_N^{nk} = e^{-j2\pi \frac{nk}{N}}$  is the twiddle factor (TF).

$$X(k) = \sum_{n=0}^{N-1} x(n) W_N^{nk}$$
(1)

N-point FFT can be divided into two smaller FFT, as represented in (2), by applying CT algorithm to (1) with conditions of  $N N_l \cdot N_2 = 8 \cdot 32$ ,  $n = r + qN_2$ ,  $r = 0, ..., N_2 - 1$ , and  $k = l + mN_1$ ,  $l = 0, ..., N_1 - 1$ . Mixed radix-2<sup>3</sup> and radix-2<sup>5</sup> form can be obtained by using that  $n = 2^{\nu-1} \cdot n_1 + 2^{\nu-2} \cdot n_2 + \cdots + n_{\nu}$  and  $k = k_1 + 2 \cdot k_2 + \cdots + 2^{\nu-1} \cdot k_{\nu}$ , where  $\nu = log_2 N$ .





$$X_{mN_{1}+l} = \sum_{r=0}^{N_{2}-1} \left[ \left\{ W_{N}^{rl} \left[ \sum_{\substack{q=0\\N_{1}-point \ DFTs}}^{N_{1}-1} X_{r+qN_{2}} W_{N_{1}}^{lq} \right] \right\} W_{N_{2}}^{mr} \right]$$
(2)

For radix-2<sup>3</sup> N<sub>1</sub>-point DIF, rewriting the index n and k as ( $n = \frac{N}{2}n_1 + \frac{N}{4}n_2 + \frac{N}{8}n_3 + \overline{n_4} = n_A + \overline{n_4}$ ;  $n_1 = n_2 = n_3 = 0,1$ ;  $\overline{n_4} = 0,1,2,...,\frac{N}{8} - 1$ ;) and ( $k = k_1 + 2k_2 + 4k_3 + 8\overline{k_4} = k_A + \overline{k_4}$ ;  $k_1 = k_2 = k_3 = 0,1$ ;  $\overline{k_4} = 0,1,2,...,\frac{N}{8} - 1$ ;) , respectively, (2) becomes

$$X(k_{A} + \overline{k_{4}}) = \sum_{n_{4}=0}^{31} \sum_{n_{3}=0}^{1} \sum_{n_{2}=0}^{1} \sum_{n_{1}=0}^{1} x(n_{A} + \overline{n_{4}})W_{N}^{(n)(k)}$$

$$= \sum_{n_{4}=0}^{31} \left\{ \left[ X_{N/8}(\overline{n_{4}}, k_{1}, k_{2}, k_{3}) \right] W_{N}^{\overline{n_{4}k_{4}}} W_{N/8}^{\overline{n_{4}k_{4}}} \right\},$$
where  $X_{N/8}(\overline{n_{4}}, k_{1}, k_{2}, k_{3}) =$ 

$$\left( \sum_{n_{3}=0}^{1} \left( \sum_{n_{2}=0}^{1} \left( \sum_{n_{1}=0}^{1} x(n_{A} + \overline{n_{4}}) W_{2}^{n_{1}k_{1}} \right) W_{4}^{n_{2}k_{1}} W_{2}^{n_{2}k_{2}} \right) W_{8}^{n_{3}(2k_{2}+k_{1})} W_{2}^{n_{3}k_{3}} \right)$$
(3)

From (3),  $X_{8/N}$  of (3) can be expressed as  $2^3$ -radix 8-point FFT with  $TF1 = W_4^{n_2k_1}$  and  $TF2 = W_8^{n_3(2k_2+k_1)}$ , where TF1 is twiddle factor of stage1, TF2 for stage2. And  $X(k_A + \overline{k_4})$  is composed of thirty-two  $X_{8/N}$  and eight 32-point FFT, in which the twiddle factors of  $W_N^{\overline{n_4}k_A}$  are multiplied between  $X_{8/N}$  and 32-point FFTs. For further decomposition of 32-point FFT into radix- $2^5$  form, similar procedures can be applied by

replacing 
$$\overline{k_4} = k_4 + 2k_5 + 4k_6 + 8k_7 + 16k_8$$
 and  $(\overline{n_4} = \frac{32}{2}n_4 + \frac{32}{4}n_5 + \frac{32}{8}n_6 + \frac{32}{16}n_7 + \frac{32}{32}n_8)$ . Thus, (3) can be rewritten as  
 $X(k_1 + 2k_2 + 4k_3 + 8k_4 + 16k_5 + 32k_6 + 64k_7 + 128k_8)$   
 $= \sum_{n_8=0}^{1} \sum_{n_7=0}^{1} \sum_{n_6=0}^{1} \sum_{n_4=0}^{1} \sum_{n_4=0}^{1} \left\{ \left[ D_{N/8}(\overline{n_4}, k_1, k_2, k_3) \right] W_{N/8}^{\overline{n_4}k_4} \right\}$  (4).  
where  $D_{N/8}(\overline{n_4}, k_1, k_2, k_3) = X_{N/8}(\overline{n_4}, k_1, k_2, k_3) W_{N}^{\overline{n_4}k_4}$ 

By evaluating  $W_{N/8}^{\overline{n_4}\cdot\overline{k_4}} = W_{N/8}^{(16n_4+8n_5+4n_6+2n_7+n_8)(k_4+2k_5+4k_6+8k_7+16k_8)}$ equation (4) can be express as

$$(4) = D_{N/128}(0) + \underbrace{W_{4}^{k_{7}} \cdot D_{N/128}(1) \cdot W_{2}^{k_{8}}}_{TF7}, where$$

$$\begin{cases} D_{N/128}(n) = D_{N/64}(n) + \underbrace{W_{32}^{(4k_{6}+2k_{5}+k_{4})}}_{TF6} \cdot D_{N/64}(n + \frac{N}{128}) \cdot W_{2}^{k_{7}}, \\ D_{N/64}(n) = D_{N/32}(n) + \underbrace{W_{8}^{(2k_{5}+k_{4})}}_{TF5} \cdot D_{N/32}(n + \frac{N}{64}) \cdot W_{2}^{k_{6}}, \\ D_{N/32}(n) = D_{N/16}(n) + \underbrace{W_{4}^{(k_{4})} \cdot D_{N/16}(n + \frac{N}{32}) \cdot W_{2}^{k_{5}}, \end{cases}$$
(5),

By considering (3) and (5), we can decompose the 256-point FFT into an 8-point FFT and a 32-point FFT in which are implemented radix- $2^3$  and radix- $2^5$  algorithm.

Signal flow graph (SFG) of a 256-point FFT algorithm, as shown in figure 1, represents that  $X_{N/8}(\overline{n_4}, k_1, k_2, k_3)$  has the same 8-point FFT structure for input sequence of  $x_B(r)|_{r=0,1,\dots,31} = \{x_{r+qN_2}, q = 0,1,\dots,7\}$ . In addition, it shows that  $X(k_A + \overline{k_4})$  for each value of  $k_A$  has the same SFG.



FF: delay elements for SDF (ex: shifter register), BFU1/2/3: Butter-fly unit type-I/type-II/type-III, CCM: Constant Complex Multiplier, SRM: Complex Multiplier with strength reduction transformation.

Figure 2. Architecture of the proposed 32-paths parallel 256-point FFT

Therefore, if we break input sequence x(n) into  $x_B(r)|_{r=0,1,...31}$  groups and multiply TF3 in accordance with  $X_{N/8}(\overline{n_4}, k_1, k_2, k_3)$ , a full parallel 256-point FFT can be processed with only one 8-point FFT and a 32-point FFT, which results in reducing hardware resources.

Inverse FFT calculation can be performed by swapping the real and imaginary parts of input sequence, taking forward FFT, and swapping real and imaginary parts of FFT's results [7].

# **III. IMPLEMENTATION AND EXPERIMENTAL RESULTS**

In this section, we describe 32-paths parallel 256-point FFT architecture based on the SGF in section II. A full parallel 256-point FFT can be processed sequentially if  $X_{N/_{o}}(\overline{n_{4}}, k_{1}, k_{2}, k_{3})$  are implemented with single-delay path feedback (SDF) units [8]. In this case input sequences of  $x_B(r)$ is performed as an 8-point FFT operation. The proposed architecture, as shown in figure 2, consists of thirty-two 8point SDF blocks, twiddle factor of TF3, and a 32-point FFT block in which contains sixteen butterfly units (BFU1, BUF2 and BFU3), constant complex multipliers (CCM), complex multiplier with strength reduction transformation (SRM), and delay elements such as flip-flop (FF). Non-trivial complex multipliers in this architecture are exploited in TF2, TF3, TF5 and TF6, while other TFs are -j multiplication resulting in exchange between real and imaginary parts. Since TF2 and TF3 vary depending on the sequence order q and r of  $x_B(r) =$ x(q+32r), variable complex multipliers with strength reduction are used. The total number of real multiplications is reduced to only three for one complex multiplication [9]. CCM in TF5 and TF6 can be made with shift and add operations. Butterfly unit type-I (BFU1), as depicted in figure 3, carries out complex addition and subtraction by controlling the switch to make them work in a serial way [10]. BFU2 in figure 2 shows the butterfly structure with -j multiplication which is performed in stage 4 and stage 7. BFU3 is a conventional radix-2 butterfly.

Table 1 summarizes complexities in the proposed architecture. Assuming full parallel architecture is implemented with radix-2 form, the proposed architecture has an 86% complex multiplier reduction, an 80 % of complex adders, and a 76% of complex registers in comparison with full parallel architecture. On the other hand, operating clock rate requires eight times of full-parallel-architecture's one. Thus the proposed architecture accomplishes high reduction of complexities unless clock rate is constrained.

|                                            | The proposed<br>architecture | Full parallel<br>architecture |  |
|--------------------------------------------|------------------------------|-------------------------------|--|
| No. of constant complex multipliers (CCMs) | 28                           | 642                           |  |
| No. of variable complex multipliers (SRMs) | 63                           | 0                             |  |
| No. of butterflies (BFs)                   | 186                          | 1024                          |  |
| No. of complex adders                      | 2x186 = 392                  | 2048                          |  |
| No. of complex registers for SDFs          | 224                          | 0                             |  |
| No. of complex registers for pipelining    | 256                          | 2048                          |  |
| Throughput rate<br>(R : clock rate)        | 8R                           | R                             |  |

TABLE 1. COMPLEXITY SUMMARY OF THE PROPOSED ARCHITECTURE

Fixed point simulation using SPW and Matlab is done to acquire bit accuracies of FFT. The word length of input and output of each stage is 12 bits. Different bit size processed internally in each stage. Twiddle factor bits are also 12 bits. The acquired SQNR is approximately up to 45 dB. Based on the word length chosen, the FFT architecture was simulated with VHDL and ModelSim simulator. And the proposed architecture is implemented with Xilinx FPGA - XC7K480T.

|                                                                                                                                        | The proposed architecture                                       | Gokan Polat et al.<br>[5] | Beril Inan et al. [4] <sup>(b)</sup> | R. Schmogrow et al.<br>[3] <sup>(b)</sup>         |  |
|----------------------------------------------------------------------------------------------------------------------------------------|-----------------------------------------------------------------|---------------------------|--------------------------------------|---------------------------------------------------|--|
| Architecture                                                                                                                           | 32-paths mixed radix-2 <sup>3</sup><br>and radix-2 <sup>5</sup> | Full parallel<br>Radix-4  | Full parallel                        | Full parallel<br>(precomputed IFFT<br>using LUTs) |  |
| No. of FFT points                                                                                                                      | 256                                                             | 256                       | 1024                                 | 64                                                |  |
| Word length [bits]                                                                                                                     | (input: 12), (output: 12)                                       | (input: 5), (output: -)   | (input: 5), (output: 10)             | (input: -), (output: -)                           |  |
| QAM Levels                                                                                                                             | 64-QAM                                                          | - <sup>(a)</sup>          | 4-QAM                                | 16-QAM                                            |  |
| Used FPGA devices                                                                                                                      | VC7V480T 2                                                      | VCAULV565T 2              | XC5VFX200T [11]                      | XC5VFX200T                                        |  |
|                                                                                                                                        | AC/K4801-2                                                      | AC0 V11A3031-2            | (2 devices)                          | (2 devices)                                       |  |
| Number of Slice LUTs                                                                                                                   | 87,209                                                          | 213,700                   | 172,032                              | 152,371                                           |  |
| Number of Registers                                                                                                                    | 28,806                                                          | 123,406                   | 203,980                              | 164,659                                           |  |
| Number of DSP                                                                                                                          | 0                                                               | 0                         | 630                                  | 0                                                 |  |
| Block RAM                                                                                                                              | 0                                                               | 0                         | 3612 Kb                              | 0                                                 |  |
| Operating clock rate (Max.)                                                                                                            | 219.546 MHz                                                     | 335.914 MHz               | -                                    | -                                                 |  |
| Error Vector Magnitude(EVM)                                                                                                            | -32 dB (2.5%)                                                   | -                         | -                                    | -19.72 dB (11%)                                   |  |
| (notes) (a) '-'means no record in the reference paper. (b) Considering that one slice contains 4 LUTs and 4 registers, numbers of used |                                                                 |                           |                                      |                                                   |  |
| resources are calculated from the utilization percentage per one device written in [4] and [3].                                        |                                                                 |                           |                                      |                                                   |  |

TABLE 1. COMPASION OF THE PROPOSED WITH EXISTING RESULTS FOR OPTICAL OFDM

Table 2 shows the implementation results and comparison of the existing ones. The utilization factor is found to be 4% of slice registers and 29% of slice LUTs available. In spite of wider word-length and higher modulation levels, it is shown that the proposed FFT processor consumes less hardware resources than the previous works. Table 2 indicates that the proposed 256-point FFT architecture is hardware-efficient and thus appropriate to high-speed OOFDM applications. Experimentally, OOFDM transmitter using the proposed FFT works with 10 bits digital-to-analog converter (DAC) running at 4 GS/s, which results in an aggregate line rate of 12 Gbit/s. Figure 3 illustrates the experimental results of constellations (a), EVM per subcarrier (b), frequency spectrum (c), and demodulation information (d), respectively. Average EVM of approximately 32 dB is obtained for 64-QAM modulation.

# **IV.CONCLUSIONS**

In this paper, we proposed 32-paths parallel 256-point FFT architecture by applying the mixed radix- $2^3$  and radix- $2^5$  algorithm. The proposed FFT processor proved that it significantly reduce hardware complexities through the FPGA design and implementation. And also we confirmed experimentally that it can be used for OOFDM systems



Figure 3. Experimental results using the proposed FFT architecture

requiring excessive data rates. For further studies, we need to investigate on more sophisticated multiplier structures for additional optimizations.

### ACKNOWLEDGMENT

This work was supported by Institute for Information & communications Technology Promotion (IITP) grant funded by the Korea government (MSIP) (No.B0126-15-1024, 100 Gbps Coherent/OFDM DSP Development).

### REFERENCES

- J. Armstrong, "OFDM for optical communications," J. Lightwave Technol. 27(3), 189–204 (2009).
- [2] R. P. Giddings, X. Q. Jin, E. Hugues-Salas, E. Giacoumidis, J. L. Wei, J. M. Tang, "Experimental demonstration of a record high 11.25Gb/s real-time optical OFDM transceiver supporting 25km SMF end-to-end transmission in simple IMDD systems," *Optics Express(2010)*, Vol 18, Issue 6, pp 5541-5555.
- [3] B. Inan et al., "Realization of a 23.9 Gb/s real time optical-OFDM transmitter with a 1024 Point IFFT," in *Proc. Optical Fiber Communication Conference (OFC'11)*, paper OMS2.
- [4] Gokan Polat, Sitki Ozturk, and Mehmet Yakut, "Design and Implementation of 256-Point Radix-4 100 Gbit/s FFT Algorithm into FPGA for High-Speed Applications," *ETRI Journal*, vol.37, no.4, Aug. 2015, pp. 667-676.
- [5] R. Schogrow et al., "Real-time OFDM transmitter beyond 100 Gbit/s," Optical Express 19(13), pp.12740-12749, 2011.
- [6] Cooley, James W., Turkey, Jon W., "An algorithm for the machine calculation of complex Fourier series", *Math. Comput.* 19: 297-301, 1965.
- [7] Richard G. Lyons, Understanding Digital Signal Processing, Addison-Wesley, 1'st Edition (1996), pp. 425-429.
- [8] S. He and M. Torkelson, "A new approach to pipeline FFT processor architecture," in *Proc. IEEE Int. Parallel Processing Symp.*, Apr. 1996, pp. 766-770
- [9] A. Wenzler and E. Luder, "New structures for complex multipliers and their noise analysis," in *Proc. IEEE Int. Symp. on Circuits and Systems*, Seattle, Washington, vol. 2, pp. 1432-1435, 1995.
- [10] M. Vergara, M. Strum, W. Eberle, B. Gyselinekx., "A 195K FFT/s (256-points) high performance FFT/IFFT processor for OFDM applications," in *Proc. IEEE Int. Telecommun. Symp.*, vol. 1, pp 273 – 278, 1998.
- [11] Virtex-5 FPGA User Guide (UG190), Xilinx, 2012.

- Hun-Sik Kang was born in Daegu, Korea, in 1969. He received the M.Sc. degree in electrical engineering from Kyung-book national university, Daegu, Rep. of Korea, in 1994, and the Ph.D. degree in Information & Communication from KAIST, Daejeon, Korea, in 2011. From 1995 to 2000, he worked as a Design Engineer in Hynix Semiconductor Ltd., Korea. In 2000, he joined Electronics and Telecommunication Research Institute and currently worked as a principal senior researcher. His primary research interest include digital signal processing on wireless/optical communication system and hardware engineering. He is presently researching signal processing technique and hardware engineering in optical communication systems that use orthogonal frequency division multiplexing (OFDM) or discrete multitone (DMT) modulation.
- Sun Hyok Chang received his BS in physics from the Korea Advanced Institute of Science and Technology (KAIST), Daejeon, Rep. of Korea, in 1994. He received his MS and PhD in physics from KAIST in 1996 and 2000, respectively. He joined ETRI, Daejeon, Rep. of Korea, in 2000 and has been engaged in research on optical fiber telecommunication systems. He is currently interested in coherent optical communication and flexible WDM networks.
- In-Ki Hwang received his M.Sc. degree in electrical engineering from Sungkyunkwan University, Rep. of Korea, in 2001. In 2001, he joined Electronics and Telecommunication Research Institute and currently worked as a senior researcher. His primary research interest include high-speed interface technology and hardware engineering in optical communication system. He is presently researching high-speed interface design in optical communication systems that use orthogonal frequency division multiplexing (OFDM) or discrete multitone (DMT) modulation.
- Joon-Ki Lee received his BS degree in material engineering from Sungkyunkwan University, Suwon, Rep. of Korea, in 1995 and his MS degree in communication engineering from the Gwangju Institute of Science and Technology, Gwangju, Rep. of Korea, in 1997. He joined ETRI in 2001 and has worked on high-speed Ethernet transceivers and optical devices. He is the director of the optical transmission research section in ETRI.