1 Introduction

DS-CDMA (direct sequence code-division multiple access) systems represent a bandwidth efficient solution in order to fulfill the third generation (3G) of mobile wireless communication requirements [1, 2, 4]. Unlike TDMA or FDMA, in CDMA systems, the entire bandwidth is shared among all active users used at the same time by the plurality of active users, by associating at each user unique orthogonal pseudo-noise codes.

Theoretically, in uplink (from users to base station) case, DS-CDMA scheme permits to detected each active user data at the base station (BS) receiver. But in practice, this multiple access technique stays interference-limited. Indeed, the transmission channel destroys the orthogonality between all active users [3, 4], and the receiver at BS can not correctly dissociate users’ information. This phenomenon is referred as multiple access interferences (MAI). In addition, inter-symbols interferences (ISI) also appear due to multipath channel and increase proportionally with transmission data-rate.

It is well known that, in order to combat those uplink interferences, the base station receiver has to be efficiently designed [4]. However, conventional receiver, such as Rake (commonly used in second generation—2G), ignores MAI and considers them as additive white Gaussian noise (AWGN) while detecting the users of interest [3, 4]. Accordingly, Rake receiver suffers substantially from performance degradation as the number of users and/or data throughput increases. Therefore, many suggestions in the literature provide efficient solution to overcome MAI and ISI through multiuser detection (MUD) schemes [1, 315].

The interest in multiuser signal processing for CDMA stemmed from Verdú’s seminal work in [1, 5], where he proposed and analyzed the optimal uplink multiuser device: the maximum likelihood sequence detector (MLSD). Unfortunately, this optimal MUD is prohibitively complex for real-time implementation; its complexity increases exponentially with the number of users. Therefore, over the last two decades, research in this area has focused on several uplink sub-optimal MUD solutions [624] whose design objective is to reach a good tradeoff between performance and complexity: (1) to combat MAI and ISI in order to reliably detect information data of possibly a larger number of active users; while (2) maintaining a complexity permitting a real-time implementation for this maximum number of users. This implementation complexity issue explains the reason why the Rake receiver is still present in 3G base stations [4]. Indeed, even though many sub-optimal MUD methods have been proposed, most of these techniques show high computational complexity and can not be feasible in real-time on commercial baseband processors used in the BS. Nevertheless, some work has been conducted regarding real-time implementation aspect of MUDs [16, 24].

One possible classification of these sub-optimum MUD methods is to assign them to one of the two major classes with respect to whether or not channel estimates are required. Namely, direct methods require channel estimates in terms of channel’s attenuations and delays to perform the detection process [69]. On the other hand, indirect techniques resort to some adaptive process to design the receiver using the possibly available training information [1013]. To improve the performance-complexity tradeoff, MUD methods based on direct and indirect process are proposed [14] (patented).

Owing to their reduced computational complexity, the indirect (adaptive) technique [1013] will take the major part of this paper. The computational complexity saving inherent in most adaptive methods, such as LMS or other techniques [13, 28, 29], stems from simple multiplication–addition operations wherein matrix manipulations (matrix–matrix, matrix–vector multiplications and/or matrix inversion etc.) are not usually involved. This approach supports VLSI implementation techniques such as pipelining (e.g. [30]) to maximize the number of users in a single device. Matrix manipulations cost become more expensive in time varying channels wherein the channel attenuations’ variations dictate the frequency of the MUD’s parameter (correlation matrices for MAI suppression in parallel interference cancellation techniques for instance) update [4].

In [4], J. G. Andrews addresses uplink MUD methods based on Interference Cancellation (IC) as the best solution for the DS-CDMA uplink problems. Given previously estimated symbols, this direct approach aims at regenerating the interferences and reconstructing the contributions of each user by subtracting in turn these interferences from the received signal; resulting in refined contributions from which new symbols estimates are deduced. This process is carried for a limited but sufficient number of iterations (stages). There are two main IC structures, namely: (1) The Multistage Parallel IC (MPIC) which makes the interferences cancellation of each user in parallel through several stages [7, 9]; (2) The Successive IC-SIC, which cancels the interferences successively from the least corrupted user to the most damaged one [68]. In both cases, a soft decision can be computed at the output of every stage so that a more accurate symbol detection is taken—Soft-MPIC. Depending on the transmission conditions, one technique may outperform the other [7]. It is worth mentioning at the current stage that interference cancellation can be made at the bit rate with the use of auto-correlation codes matrix, for better performance, or at chip rate, for a complexity efficiency. These IC MUD methods based on auto-correlation codes matrix shows excellent performances in combating MAI at the expense of a huge complexity. For acceptable complexity level most works do not use auto-correlation matrix [4] wherein works on their VLSI implementation appears in [1722]. In comparison to the conventional Rake receiver, Hagerman and al. showed in [18] the uplink capacity improvement for WCDMA when using a MPIC: a 40% system-level increase has been found in typical urban environment for a voice transmission (12.2 kbps). Authors estimated that this MUD is 5 times more complex than Rake receiver, doubling the total receiver complexity. This increasing receiver complexity represents the principal constraint to deploy the MUD technology on commercial BS networks. Complexity reduction with respect to high performance represents the goal of proposed MUD solution.

This paper presents an Adaptive Duplicated Filters and Interference Canceller (ADIC) based on a mixed direct–indirect adaptive filtering approach for MUD design. This approach consists of creating synthesized training signals based on the known channel parameters given in turn by the channel estimator (e.g. Correlator [26, 27]). These synthesized training signals can be generated at the receiver to adapt the filter coefficients following indirect adaptation method. ADIC is a multistage detector wherein each stage comprises a bank of filters followed by an IC. Using an appropriate adaptation rule (LMS using adaptive step size, e.g. [29]) the filter bank’s coefficients are obtained using a synthesized training signal. To maintain a pipeline implementation structure by preserving a low data dependency, a non feedback structure has been applied to ADIC MUD to obtain the expected performance. This pipeline structure propriety has been fully exploited in the presented VLSI implementation strategy (e.g. [30]).

The paper is organized as follows; in Section 2 a DS-CDMA signal model is presented. This model takes into account both traffic and pilot signal transmissions. Section 3 is devoted to describe in details the proposed MUD, ADIC. In Section 4, a VLSI implementation strategy of ADIC MUD is describes. Performance and complexity comparisons between ADIC and the Decision Feedback Soft MPIC (DF-Soft-MPIC) MUD and hardware resource evaluation to implement ADIC MUD method in FPGA are postponed to Section 5, while Section 6 draws some brief conclusions.

2 DS-CDMA Signal Model

Consider an uplink data transmission from K mobile units to a base station. To simplify the notation, we consider one receiving antenna. The kth user’s transmitted baseband signal, vehiculing the information sequence \(\left\{ {b_k \left[ n \right]} \right\}_{n = 1}^N \) with k = 1,2,...,K, can be written as

$$x_k \left( t \right) = \sum\limits_{n = 1}^N {A_k b_k \left[ n \right]d_k \left( {t - nT;n} \right)} ,$$
(1)

where N is the number of symbols, A k the signal gain and b k [n] the nth symbol of duration T. b k [n] is assumed to belong to S such that S = {±1}, for BPSK signals. d k (t;n) is the spreading (scrambling) sequence for the nth symbol given by

$$d_k \left( {t;n} \right) = \sum\limits_{\ell = 1}^{N_c } {c_{k,\ell } \left[ n \right]\psi \left( {t - \ell T_c } \right)} ,$$
(2)

here \({{N_c \triangleq T} \mathord{\left/ {\vphantom {{N_c \triangleq T} {T_c }}} \right. \kern-\nulldelimiterspace} {T_c }}\) determines the spreading factor (processing gain) with T c the chip period, \(c_{k,\ell } \left[ n \right]\) is the ℓth element of the spreading sequence for the nth symbol where \(c_{k,\ell } \left[ n \right] \in \left\{ {{{\left( { \pm 1 \pm j} \right)} \mathord{\left/ {\vphantom {{\left( { \pm 1 \pm j} \right)} {\sqrt 2 }}} \right. \kern-\nulldelimiterspace} {\sqrt 2 }}} \right\}\). ψ(t) is a unit energy pulse shaping filter, possibly a raised cosine filter. We define h k (t; n) to be the kth user’ multipath channel corresponding to the nth symbol as

$$h_k \left( {t;n} \right) = \sum\limits_{p = 1}^{P_k } {h_{k,p} \left[ n \right]\delta \left( {t - \tau _{k,p} } \right)} ,$$
(3)

where P k is the number of paths, h k,p [n] the pth path’s complex amplitude (attenuation), τ k,p the pth path’s propagation delay and δ(t) the Dirac impulse function.

From (1) and (3) the total received signal for the nth symbol at the base station including the pilot-transmission signal, \(\overline r _{k,p} \left( {t - \tau _{k,p} ;n} \right)\), is

$$y\left( {t;n} \right) = \eta \left( {t;n} \right) + \sum\limits_{k = 1}^K {\sum\limits_{p = 1}^{P_k } {\overline r _{k,p} \left( {t - \tau _{k,p} ;n} \right) + \overline r _{k,p} \left( {t - \tau _{k,p} ;n} \right)} } ,$$
(4)

where η(t; n) denotes the additive white Gaussian noise (AWGN) with zero mean and double side spectral density of N 0/2. The pilot-transmission signal undergoes the same multipath channel as the information traffic bearing signal \(\overline r _{k,p} \left( {t - \tau _{k,p} ;n} \right)\) such that

$$\overline r _{k,p} \left( {t;n} \right) = \sum\limits_{p = 1}^{P_k } {\overline r _{k,p} \left( {t - \tau _{k,p} ;n} \right)} = A_k b_k \left[ n \right]\Theta _k \left( {t;n} \right),$$
(5)

with Θ k (t; n) is the effective code h k (t; n) ⊗d k (t; n). ⊗ is the temporal convolution between the kth user’s spreading waveform (2) at symbol instant n and the multipath channel (3).

3 ADIC Method

Each stage of the proposed multistage MUD, named ADIC (Adaptive Duplicated filters Interference Canceller; patent pending), see Fig. 1, consists of two distinct blocks: the adaptive filters block (AFB) and the interference canceller block (ICB). Of particular novelty is the use of adaptive filters (AFB) per user. Once adapted, the AFB is duplicated over the rest of stages reducing considerably the implementation complexity: all the stages of the same user share the same AFB, designed to combat both MAI and ISI. For an effective MAI and ISI cancellation, ICB is used in a cascade arrangement (see Fig. 1). This block regenerates all or part of the users’ contributions using the AFB outputs. Once the interference cancellation is performed over the received signal, the interference-free signal per user is fed to the next AFB stage. As regards the AFB, two operational phases are considered, namely the adaptation phase wherein the filters’ coefficients are adapted, and the detection phase involving ICB.

Figure 1
figure 1

Detection phase of the proposed multistage MUD, for N s stages.

3.1 Detection Phase

Being similar, we detail the detection phase flow of the signals within one stage only: the AFB and ICB blocks of stage s, where s = 1,2,...,N s (N s the total number of stage). The AFB considers as inputs either the received signal (4), \(\left\{ {y\left( {t;n} \right)} \right\}_{n = 1}^N \) if s = 1, or \(\left\{ {\hat y_{k,s - 1} \left( {t;n} \right)} \right\}_{n = 1}^N \), the K estimated signals obtained from the previous stage s-1 if s ≠ 1. These inputs are sampled at a chip rate (1/T c ). The AFB’s outputs are estimates of the traffic symbols (1/T) denoted by \(\left\{ {\hat b_{k,s} \left[ n \right]} \right\}_{n = 1}^N \). For a given user k, in order to describe the AFB operations, we adopt a vectorial representation of the AFB’ estimated input \(\left\{ {\hat y_{k,s - 1} \left( {t;n} \right)} \right\}_{n = 1}^N \) from stage s-1, as

$${\mathbf{\hat y}}_{k,s - 1} \left[ n \right] = \left[ {\begin{array}{*{20}c}{\hat y_{k,s - 1} \left( {\left( {\left( {n - 1} \right)N_w + 1} \right)T_c + \overline \tau _k ;n} \right)} \\{\hat y_{k,s - 1} \left( {\left( {\left( {n - 1} \right)N_w + 2} \right)T_c + \overline \tau _k ;n} \right)} \\\vdots \\{\hat y_{k,s - 1} \left( {\left( {nN_w } \right)T_c + \overline \tau _k ;n} \right)} \\\end{array} } \right].$$

This vectorial representation corresponds to the simplest structure. For achieving such structure, one can consider one filter per path wherein each filter input is synchronized with an appropriate path delay, with N w the vector dimension of the filter coefficients w k as dim(w k ) = N w ×1 and \(\overline \tau _k \) being a given path delay of the kth user’s channel.

In the sth stage of the kth user, the corresponding raw AFB’s output is

$$\tilde b_{k,s} \left[ n \right] = {\mathbf{w}}_k^{\text{H}} {\mathbf{\hat y}}_{k,s - 1} \left[ n \right],$$
(6)

while the final output

$$\hat b_{k,s} \left[ n \right] = f\left( {\tilde b_{k,s} \left[ n \right]} \right),$$
(7)

wherein the kth user’s filter coefficients in a vector form is \({\mathbf{w}}_k \triangleq \left[ {w_k \left( 1 \right),w_k \left( 2 \right),\; \ldots ,\;w_k \left( {N_{\mathbf{w}} } \right)} \right]^{\text{T}} \). f(⋅) (7) is a decision function, for example, the signum function in case of a hard decision function, or a tangent–hyperbolic or any other relevant function for a soft decision function (e.g. [6]). The final outputs depending to s corresponding to the estimate of the traffic information symbols are given by

$$\hat b_{k,s} \left[ n \right] = \left\{ {\begin{array}{*{20}c} {f\left( {\tilde b_{k,s} \left[ n \right]} \right) = \tanh \left( {\tilde b_{k,s} \left[ n \right]} \right)} & {if} & {s = 1} \\ {f\left( {\tilde b_{k,s - 1} \left[ n \right],\tilde b_{k,s} \left[ n \right]} \right) = \tanh \left( {{{\left( {{\text{sign}}\left( {\tilde b_{k,s - 1} \left[ n \right]} \right) + {\text{sign}}\left( {\tilde b_{k,s} \left[ n \right]} \right)} \right)} \mathord{\left/ {\vphantom {{\left( {{\text{sign}}\left( {\tilde b_{k,s - 1} \left[ n \right]} \right) + {\text{sign}}\left( {\tilde b_{k,s} \left[ n \right]} \right)} \right)} 2}} \right. \kern-\nulldelimiterspace} 2}} \right)} & {if} & {1 < s < N_s } \\ {f\left( {\tilde b_{k,s} \left[ n \right]} \right) = {\text{sign}}\left( {\tilde b_{k,s} \left[ n \right]} \right)} & {if} & {s = N_s } \\ \end{array} } \right.$$
(8)

In the first stage, a tangent–hyperbolic function can be used. Such a function would softly limit the estimated information (binary) to within the pre-assumed safe dynamics. On the other hand at the last N s stage, a hard decision is made. However, to delimit the flip-flop effect [6], for s = 1,2,...,N s , a decision function operates on the current and the previous filter outputs, namely, \(\tilde b_{k,s} \left[ n \right]\) and \(\tilde b_{k,s - 1} \left[ n \right]\) from (8). In fact, the outcome from \({{\left( {{\text{sign}}\left( {\tilde b_{k,s - 1} \left[ n \right]} \right) + {\text{sign}}\left( {\tilde b_{k,s} \left[ n \right]} \right)} \right)} \mathord{\left/ {\vphantom {{\left( {{\text{sign}}\left( {\tilde b_{k,s - 1} \left[ n \right]} \right) + {\text{sign}}\left( {\tilde b_{k,s} \left[ n \right]} \right)} \right)} 2}} \right. \kern-\nulldelimiterspace} 2}\) is −1, 1 or 0. If the outcome is + 1 or −1, this means that both the sth and (s-1)th stages agree that + 1 or −1 has been transmitted, respectively. On the other hand, a 0 outcome signals a flip-flop phenomena and the hard estimates are not involved in the ICB in the sth stage procedure (the related interference is not constructed nor eliminated which prevents an erroneous decision from propagating to the next (s+1)th stage. Applying the tangent–hyperbolic function on \({{\left( {{\text{sign}}\left( {\tilde b_{k,s - 1} \left[ n \right]} \right) + {\text{sign}}\left( {\tilde b_{k,s} \left[ n \right]} \right)} \right)} \mathord{\left/ {\vphantom {{\left( {{\text{sign}}\left( {\tilde b_{k,s - 1} \left[ n \right]} \right) + {\text{sign}}\left( {\tilde b_{k,s} \left[ n \right]} \right)} \right)} 2}} \right. \kern-\nulldelimiterspace} 2}\) reduces to multiplying by 0.75, since tanh(±1)⋍ ±0.75. It is important to note that, if this decision function permits to improve interference cancellation performances at each stage, it also reduces the methods complexity by avoiding tangent–hyperbolic function calculations.

The data \(\left\{ {\hat b_{k,s} \left[ n \right]} \right\}_{n = 1}^N \) is the ICB inputs, Fig. 1. For a given stage s, the first role of the ICB is to construct the kth user’s contribution z k,s (t; n) using

$$z_{k,s} \left( {t;n} \right) = \hat A_k \hat b_{k,s} \left[ n \right]\sum\limits_{p = 1}^{P_k } {\hat h_{k,p} \left[ n \right]d_k \left( {t - nT - \hat \tau _{k,p} ;n} \right) = \hat A_k \hat b_{k,s} \left[ n \right]\hat \Theta _k \left( {t;n} \right)} .$$
(9)

This process is identical to (5). Unlike in (5), \(\hat A_k \), \(\hat h_{k,p} \left[ n \right]\) and the delays \(\widehat\tau _{k,p} \), for k = 1,2,...,K and p = 1,...,P k , are provided by a channel estimator, possibly a Correlator [27] or more performed method [26]. Accordingly, the total contribution from all the K users is given by the summation of all the users’ contributions, such that:

$$Z_s \left( {t;\;n} \right) = \sum\limits_{k = 1}^K {z_{k,s} \left( {t;\;n} \right)} $$
(10)

Therefore, the kth user’s interference can be deduced as:

$$\zeta _{k,s} \left( {t;\;n} \right) = Z_s \left( {t;\;n} \right) - z_{k,s} \left( {t;\;n} \right)$$
(11)

The next stage (s+1) input is built using the received signal (4) and the pre-estimated interference as

$$\hat y_{k,s} \left( {t;n} \right)\, = y\left( {t;n} \right) - \left( {\sum\limits_{k = 1}^K {z_{k,s} \left( {t;n} \right)} - \hat A_k \hat b_{k,s} \left[ n \right]\hat \Theta _k \left( {t;n} \right)} \right),$$
(12)

where \(\left\{ {\hat y_{k,s} \left( {t;n} \right)} \right\}_{n = 1}^N \) constitute the estimates of the received spread spectrum signals, essentially free from MAI and ISI.

3.2 Adaptation Phase

This phase consists of computing the filter coefficients w k for k = 1,2,...,K, on the basis of one filter per user. Upon convergence, w k represents to some extent the inverse of effective codes, Θ k (t; n) (5). Of interest is the fact that w k is aimed to be much shorter than Θ k (t; n) which saves considerable computation complexity. The adaptation phase is applied in the first stage only to compute the filter coefficients. These coefficients are duplicated on the next stage reducing considerably the adaptation complexity of the MUD.

Before describing the coefficient adaptation process, we need to construct a training data, Fig. 2. Indeed, for our method, existing commercial DS-CDMA systems (WCDMA and cdma2000) do not give access to pre-known or training data [2]—with the exception of pilot bits—in order to adjust the filter coefficients. It is important to note that, to assure the convergence, the filters need more than the already-available pilot bits to track channel variations as in fast fading context. Therefore, we may resort to synthesizing such training data along with a received signal using the estimated channel impulse response as follow [14] (patented):

  1. 1.

    randomly (or using a given distribution), we draws some training symbols \(b_k^{{\text{synth}}} \left[ {n^\prime } \right]\), per user k, from the same alphabet set as the original traffic symbols, S; n′ = 1,2,...N synth, N synth being the training sequence length;

  2. 2.

    using pre-estimated channel parameters, \(\ifmmode\expandafter\hat\else\expandafter\^\fi{A}_{k} \), \(\hat h_{k,p} \left[ n \right]\) and \(\widehat\tau _{k,p} \), like in (12), we synthesize a received signal \(y_k^{{\text{synth}}} \left( {t;n^\prime } \right)\) per user k as

    $$\begin{gathered} y_k^{{\text{synth}}} \left( {t;n'} \right) = r_k^{{\text{synth}}} \left( {t;n'} \right) + \overline r _k^{{\text{synth}}} \left( {t;n'} \right) + \eta ^{{\text{synth}}} \left( {t;n'} \right) \hfill \\ = \hat A_k \hat b_k^{{\text{synth}}} \left[ {n'} \right]\hat \Theta _k \left( {t;n'} \right) + \overline r _k^{{\text{synth}}} \left( {t;n'} \right) + \eta ^{{\text{synth}}} \left( {t;n'} \right) \hfill \\ \end{gathered} $$
    (13)
Figure 2
figure 2

Adaptation phase of the proposed multistage MUD, for n′ = 1,2,…, N synth, k = 1,2,…,K.

In fact, the training data are synthesized using the channel model, we have replaced the real sample index n (Section 2) by n′ to show synthetics sampling which have no constraint or dependence on the real time of transmitted data. As shown in (13), \(y_k^{{\text{synth}}} \left( {t;n^\prime } \right)\) contains traffic, pilot and noise contributions. \(b_k^{{\text{synth}}} \left[ {n^\prime } \right]\) is of length N synth symbols per user. With \(y_k^{{\text{synth}}} \left( {t;n^\prime } \right)\) and \(b_k^{{\text{synth}}} \left[ {n^\prime } \right]\), the adaptation process can start.

In the short-code WCDMA context, the traffic spreading sequence—chip-by-chip multiplied scrambling and OVSF (Orthogonal Variable Spreading Factor) channelization codes—is 256 chips long [2]. In WCDMA, the spreading factors N c or OVSF of 16, 8 and 4 correspond to the payload data throughput of 64, 144 and 384 kbps, respectively. Therefore we consider N nc  = 256/N c effective codes; this holds assuming that the channel is constant during one pilot symbol duration. Hence, for each user k, w k[n′] consists of N nc sub-filters, each aims to represent a short version of an inverse of the effective code. At first, we consider N SF  = 2N c to be the length of each sub-filter, which yields a total filter length of N w =N nc N SF . So one can write

$${\mathbf{w}}_k \left[ {n\prime } \right] \triangleq \left[ {{\mathbf{w}}_{k,1}^{\text{T}} \left[ {n\prime } \right],{\mathbf{w}}_{k,2}^{\text{T}} \left[ {n\prime } \right], \ldots ,{\mathbf{w}}_{k,v}^{\text{T}} \left[ {n\prime } \right], \ldots ,{\mathbf{w}}_{k,N_{nc} }^{\text{T}} \left[ {n\prime } \right]} \right]^{\text{T}} ,$$
(14)

where w k,v [n′] is the sub-filter corresponding to the n’th training symbols, with \(1 <v \overset{\wedge}{=}\bmod \left( {n^\prime ,N_{nc} } \right) \leqslant N_{nc} \), mod(⋅) represents the modulo operator. The above specialization to short-code WCDMA signaling is extended to \(y_k^{{\text{synth}}} \left( {t;n^\prime } \right)\), which is considered in a vector form as

$${\mathbf{y}}_k^{{\text{synth}}} \left[ {n^\prime } \right] = \left[ {\begin{array}{*{20}c}{y_k^{{\text{synth}}} \left( {\left( {\left( {n^\prime - 1} \right)N_{SF} - \frac{{N_{SF} }}{4} + 1} \right)T_c + \overline \tau _k ;n^\prime } \right)} \\{y_k^{{\text{synth}}} \left( {\left( {\left( {n^\prime - 1} \right)N_{SF} - \frac{{N_{SF} }}{4} + 2} \right)T_c + \overline \tau _k ;n^\prime } \right)} \\\vdots \\{y_k^{{\text{synth}}} \left( {\left( {n^\prime N_{SF} - \frac{{N_{SF} }}{4}} \right)T_c + \overline \tau _k ;n^\prime } \right)} \\\end{array} } \right].$$
(15)

It can be observed that the term N SF /4 centers \({\mathbf{y}}_k^{{\text{synth}}} \left[ {n^\prime } \right]\) with the n’th symbol of the kth user, in order to take into account of inter-symbols interferences (ISI). Of course, those representations of w k [n′] (14) and \({\mathbf{y}}_k^{{\text{synth}}} \left[ {n^\prime } \right]\) (15) are more precisely used in ADIC method.

Coefficient adaptation can be implemented using many adaptive techniques (e.g. [28]). Set membership normalized LMS (SM-NLMS) possesses a good performance-complexity trade-off, at a convergence speed superior to the mother technique, NLMS [29]. The SM-NLMS algorithm has been considered in DS-CDMA context for multiuser detection [16], for channel estimation [17] and in order to estimate the interference power [13]. Of importance is the incorporation of a self adapting mechanism for the step-size, the adaptation, for a given user k, is described by

$$e_{k,v} \left[ {n^\prime } \right] = b_k^{{\text{synth}}} \left[ {n^\prime } \right] - {\mathbf{w}}_{k,v} \left[ {n^\prime } \right]^{\text{H}} {\mathbf{y}}_k^{{\text{synth}}} \left[ {n^\prime } \right],$$
(16)
$${\mathbf{w}}_{k,v} \left[ {n^\prime + 1} \right] = {\mathbf{w}}_{k,v} \left[ {n^\prime } \right] + \mu _{k,v} \left[ {n^\prime } \right]\frac{{e_{k,v} \left[ {n^\prime } \right]{\mathbf{y}}_k^{{\text{synth}}} \left[ {n^\prime } \right]}}{{{\mathbf{y}}_k^{{\text{synth}}} \left[ {n^\prime } \right]^{\text{H}} {\mathbf{y}}_k^{{\text{synth}}} \left[ {n^\prime } \right]}},$$
(17)
$$\mu _{k,v} \left[ {n^\prime } \right] = \left\{ {\begin{array}{*{20}l}{{{1 - \lambda } \mathord{\left/{\vphantom {{1 - \lambda } {\left| {e_{k,v} \left[ {n^\prime } \right]} \right|}}} \right.\kern-\nulldelimiterspace} {\left| {e_{k,v} \left[ {n^\prime } \right]} \right|}},\quad {\text{if}}} \hfill & {\left| {e_{k,v} \left[ {n^\prime } \right]} \right| >\lambda } \hfill \\{0,} \hfill & {{\text{elsewhere}}} \hfill \\\end{array} } \right.\quad ,$$
(18)

wherein e k,v [n′] is the error for the n’th bit at the vth sub-filter output, μ k,v [n′] is dynamically conditional to a preset value of λ. Notice that (18) establishes two facts: (1) the term \({{1 - \lambda } \mathord{\left/ {\vphantom {{1 - \lambda } {\left| {e_{k,v} \left[ {n^\prime } \right]} \right|}}} \right. \kern-\nulldelimiterspace} {\left| {e_{k,v} \left[ {n^\prime } \right]} \right|}}\) is always less that 1 if \(\left| {e_{k,v} \left[ {n^\prime } \right]} \right| >\lambda \) so that SM-NLMS is inherently stable; (2) otherwise μ k,v [n′] is set equal to 0 which alleviate some computational burden. Note that the SM-NLMS method complexity is lower than the NLMS algorithm considering the possibility of no coefficient is updated when μ k,v [n′] = 0. Indeed, as shown in (19) by replacing μ k,v [n′] in (17) by its expression in (18) when \(\left| {e_{k,v} \left[ {n^\prime } \right]} \right| >\lambda \), the division operator in (18) disappears:

$$\begin{array}{*{20}c}{{\mathbf{w}}_{k,v} \left[ {n^\prime + 1} \right] = {\mathbf{w}}_{k,v} \left[ {n^\prime } \right] + \left( {1 - \frac{\lambda }{{\left| {e_{k,v} \left[ {n^\prime } \right]} \right|}}} \right)\frac{{e_{k,v} \left[ {n^\prime } \right]{\mathbf{y}}_k^{{\text{synth}}} \left[ {n^\prime } \right]}}{{{\mathbf{y}}_k^{{\text{synth}}} \left[ {n^\prime } \right]^{\text{H}} {\mathbf{y}}_k^{{\text{synth}}} \left[ {n^\prime } \right]}},} \\{ = {\mathbf{w}}_{k,v} \left[ {n^\prime } \right] + \left( {e_{k,v} \left[ {n^\prime } \right] - {\text{sign}}\left( {e_{k,v} \left[ {n^\prime } \right]} \right)\lambda } \right)\frac{{{\mathbf{y}}_k^{{\text{synth}}} \left[ {n^\prime } \right]}}{{{\mathbf{y}}_k^{{\text{synth}}} \left[ {n^\prime } \right]^{\text{H}} {\mathbf{y}}_k^{{\text{synth}}} \left[ {n^\prime } \right]}}.} \\\end{array} $$
(19)

Thus only one scalar division is used to update the vector of coefficients. As we can explain in next section, this division is not applied at each update to save hardware resources.

Finally, after convergence of the filter coefficients, w k [N synth], at the synthetic sample or iterations N synth are used by the detection phase as w k =w k [N synth].

4 Implementation Description

In WCDMA, the received signal, y(t), is composed of 4 × 10 ms-frames per block to compute the block error rate after decoding. Each 10 ms-frame has 38400 chips. These frames are divided in 15 slots of 38400/15 = 2,560 chips. Their duration is 10 s/15≈667 μs. Figure 3 shows the block diagram of ADIC MUD method. Three phases are used to estimate the transmitted data from K users, k = 1,...,K:

  • The channel estimation provides all channel amplitudes, \(\hat h_k \left( {t;n} \right)\), and delays propagation, \(\widehat\tau _k \), necessary to the ADIC adaptation phase. The channel estimation is not the object of this paper; a Correlator can be used instead but other methods can be used to boost the MUD performance (e.g. [26]);

  • During ADIC’s adaptation phase, (13)–(18), (using the channel information) the effective code, \(\widehat\Theta _k \left( {t;n} \right)\), for all k, is constructed and the corresponding coefficient of each user, w k , is performed;

  • Finally the detection phase, (6)–(12), suppresses interferences with its multistage arrangement and returns estimated symbols, \(\left\{ {\tilde b_{k,N_s } \left[ n \right]} \right\}_{n = 1}^N \), N s being the last stage.

Figure 3
figure 3

Timing diagram of ADIC MUD method.

These three phases are implemented to respect the timing constraint. We refer to [26] for implementation of channel estimation. As shown in Fig. 3, the latency is two slots; at the third slot all phases work concurrently until the end of the received data.

4.1 Detection Phase

In this phase, we assume to have access to the filter coefficients, w k  = w k [N synth], computed from the previous adaptation phase. Figure 4 describes the procedure for ADIC detection phase for the kth user following pipeline structure composed of three processing elements (PE). Each PE is shown in Fig. 5. We consider N s  = 3 stages; as shown in performance analysis Fig. 10, three stages represent the best ADIC MUD performance-complexity trade-off. To reduce necessary memory size and localized the data communications in the respective PE, we divided N (N = 2560/N c —the number of data per frame) in Q sequences of length N p , with N p  = 16 and Q = N/N p . We can write the data sequence \(\left\{ {\tilde b_{k,s} \left[ n \right]} \right\}_{n = 1}^N \) as a vector \(\widetilde{\mathbf{b}}_{k,s} \) by a concatenation of sub-vectors \(\widetilde{\mathbf{b}}_{k,s,q} \left[ {i_q } \right]\)

$$\widetilde{\mathbf{b}}_{k,s} = \left[ {\widetilde{\mathbf{b}}_{k,s,1}^T \left[ {i_1 } \right],\;{\mathbf{\tilde b}}_{k,s,2}^T \left[ {i_2 } \right],\;\; \ldots ,\;\,\widetilde{\mathbf{b}}_{k,s,q}^T \left[ {i_q } \right],\;\; \ldots ,\;\widetilde{\mathbf{b}}_{k,s,Q}^T \left[ {i_Q } \right]\;} \right]^T ,$$
(20)

where \(\tilde b_{k,s,q} \left[ {i_q } \right] = \left[ {\tilde b_{k,s,q} \left[ {i_{q,1} } \right],\;\tilde b_{k,s,q} \left[ {i_{q,2} } \right],\;\; \ldots ,\;\,\tilde b_{k,s,q} \left[ {i_{q,n_p } } \right],\;\; \ldots ,\;\tilde b_{k,s,q} \left[ {i_{q,N_p } } \right]\;} \right]\) and \(i_q = \left\{ {i_{q,n_p } } \right\}_{n_p = 1}^{N_p } \) with \(i_{q,n_p } = \left( {q - 1} \right)N_p + n_p \), for q = 1,2,...,Q and n p  = 1,2,...,N p .

Figure 4
figure 4

Procedure of ADIC detection phase for 3 stages (N s  = 3) and a user k.

Figure 5
figure 5figure 5

Hardware resources description for ADIC detection phase: a FB PE, b SB PE and c IC PE.

In Fig. 4, the detection filter block (FB), for each stage, s, and each partition, q, of kth user, uses the same coefficients w k and corresponds to

$$\tilde b_{k,s,q} \left[ {i_{q,n_p } } \right] = {\mathbf{w}}_k^{\text{H}} \hat y_{k,s - 1,q} \left[ {i_{q,n_p } } \right].$$
(21)

Detection filter block consists in a PE presents in Fig. 5a.

The spreading block (SB) PE executes the equation (22). This PE is described in Fig. 5b where the block Badd1 consists of 5 parallel adders.

$$\matrix {z_{k,s,q} \left( {t;i_{q,n_p } } \right) = \hat b_{k,s,q} \left[ {i_{q,n_p } } \right]\hat \Theta _k \left( {t;i_{q,n_p } } \right) = f\left( {\tilde b_{k,s,q} \left[ {i_{q,n_p } } \right]} \right)\hat \Theta _k \left( {t;i_{q,n_p } } \right),} \\ { = \hat b_{k,s,q} \left[ {i_{q,n_p } } \right]h_k \left( {t;i_{q,n_p } } \right) \otimes d_k \left( {t;i_{q,n_p } } \right).} \ $$
(22)

A look up table (LUT) is employed to represent the tangent–hyperbolic function of the decision function f(⋅), equation (8). Furthermore, to compute \(\widehat\Theta _k \left( {t;i_{q,n_p } } \right)\), a multiplier free design can be used considering that \(\left\{ {d_k \left( {t;i_{q,n_p } } \right)} \right\}\) as a sequence of ±1.

Expressions (23)–(25) represent operations of the interference canceller block, ICB. As we see in Fig. 5c, the corresponding ICB PE is only realized with 5 parallel adders, B add2.

$$Z_{s,q} \left( {t;i_{q,n_p } } \right) = \sum\limits_{k = 1}^K {z_{k,s,q} \left( {t;i_{q,n_p } } \right)} ,$$
(23)
$$\xi _{k,s,q} \left( {t;i_{q,n_p } } \right) = z_{k,s,q} \left( {t;i_{q,n_p } } \right) + Z_{s,q} \left( {t;i_{q,n_p } } \right),$$
(24)
$$\hat y_{k,s,q} \left( {t;i_{q,n_p } } \right)\, = y_q \left( {t;i_{q,n_p } } \right) + \xi _{k,s,q} \left( {t;i_{q,n_p } } \right).$$
(25)

Notice that on Figs. 4 to 7, vectors with indices i q , and not \(i_{q,n_p } \), are considered in order to represent groups of N p data.

The 3D graph in Fig. 6, shows the data flow of the ADIC detection structure as a function of PE, stages (s) and for the kth user, we have: (1) FB PEs share the same coefficient; (2) SB PEs share the same effective code; and (3) for the same stage s, ICB PEs share the sum in (23) for all users. Note that we repeat the same detection multistage structure for all users, event if signals are different.

Detection phase timing diagram in Fig. 7, based on the graph data dependency in Fig. 6, describes for the user k how the Q partitions are propagated in the detection structure, applying a pipeline process. The detection clock cycle, \(T_{clk}^d \), is the same for each PE and is imposed by the slowest PE which depends to the considering N c : FB PE for N c  = 16 and SB for N c  = 8 and 4. The pipeline is full at q = 3 and at s-2 to have an estimated data at each clock cycle. It results in the 16 first estimated symbols. Finally, the latency and the throughput are \(7T_{clk}^d \) and \(T_{clk}^d \) respectively.

Figure 6
figure 6

3-D data dependency graph of ADIC detection phase.

Figure 7
figure 7

Timing diagram of ADIC detection phase for a user k.

4.2 Adaptation Phase

The adaptation phase consists on three operations as depicted in Fig. 8: (1) the effective code computation, \(\widehat\Theta _k \left( {t;n} \right)\), for all K users, (2) the synthetic signal construction using (13) and (3) the coefficients’ update using an adaptive method based on SM-NLMS [29], to return w k (16)–(18). Figure 8 presents the timing diagram for these operations. The coefficients’ adaptation is divided in three other sub-operations each as a PE (Fig. 9): (1) the adaptation filter block (FBadapt) to compute the equations (6)–(8), (2) the error and step-size block (ESB) to compute (16) and (18), and (3) the update block (UB) for (17). Each of operators is executed by respective PE which have particular characteristics:

  • FBadapt PE, Fig. 9a, computes separately real and imaginary parts of considering data. In order to reduce hardware resources, this same PE is multiplexed to realize: (1) the effective code computation, and (2) bits estimation [right side of (16)] and (3) the synthetic signal norm calculation (19) of adaptive SM-NLMS treatment for one user.

  • ESB PE, Fig. 9b, only computes the right side of the SM-NLMS method update expression (19) of the kth user. Considering implementation point of view, it is important to notice that the complex division operator present in this PE is not used at each n’th instant and can be multiplexed with other user. Thanks to \({\mathbf{y}}_k^{{\text{synth}}} \left[ {n^\prime } \right]\) low dynamic magnitudes, instead of using the divisor N synth times it can be used (3N c /256)N synth times without performances loss for all user and all data rate.

  • Figure 9c, the addition block presenting an arrangement of 3 parallel adders, Badd3, is the only one arithmetic operator of UB PE, witch permits to compute: (1) the synthesized received signal \({\mathbf{y}}_k^{{\text{synth}}} \left[ {n^\prime } \right]\) from \(b_k^{{\text{synth}}} \left[ {n^\prime } \right]\) and \(\widehat\Theta _k \left( {t;n^\prime } \right)\), and (2) new coefficients \({\mathbf{w}}_{k,v} \left[ {n^\prime + 1} \right]\), using coefficients \({\mathbf{w}}_{k,v} \left[ {n^\prime } \right]\), during SM-NLMS adaptation process (19). Once calculated, coefficients \({\mathbf{w}}_{k,v} \left[ {n^\prime + 1} \right]\) replace the previous coefficients in the corresponding memory.

Figure 8
figure 8

Timing diagram of ADIC adaptation phase N T=N synth.

Figure 9
figure 9figure 9figure 9

Hardware resources description for ADIC adaptation phase: a FBadapt PE, b ESBPE, and c UB PE.

Applying a pipeline process, Fig. 8, the adaptation clock cycle, \(T_{clk}^a \), is given by the slowest PE block. At the beginning, for n′ = 1, the FB is under operation. It results in a symbol estimate, \({\mathbf{w}}_{k,v} \left[ {n^\prime } \right]^{\text{H}} {\mathbf{y}}_k^{{\text{synth}}} \left[ {n^\prime } \right]\), [c.f. (16) for the kth user]. After \(T_{clk}^a \), signals at n′ = 2 are available for BF and at n′ = 1 for ESB. This block gives e k,v [n′] (16) and μ k,v [n′] (18). At the next \(T_{clk}^a \), signals at n′ = 3 are available for BF, at n′ = 2 for ESB and at n′ = 1 for UB. This process permits to pipeline the architecture and continues until n′ = N synth. The latency and throughput are \(2T_{clk}^a \) and \(T_{clk}^a \) respectively. There are K identical and independent adaptation process and structure, one per user. Noted that \(T_{clk}^a \) is independent of \(T_{clk}^d \) and each of them depend of the timing diagram shown in Fig. 3.

5 Simulation

5.1 Performance Results

Some experiments are conducted in a WCDMA environment. The simulation basic conditions are: pulse-shaping filter, ψ(t), using Raised cosine with a roll off factor of 0.22; «Vehicular A» channel with P k  = P = 6 paths; mobile speed of 3 km/h; carrier frequency of 2 GHz; one transmitting and one receiving antenna. Channel’s amplitudes, h k,p , are estimated by a Correlator [27] and channel’s delays estimation τ k,p are considered perfect. Note that the K pilot signals have been canceled at the receiver using a pilot cancellation process [14]. For the sake of reference and comparison, Rake and Decision Feedback Soft Multistage Interference Canceller (DF-Soft-MPIC) [3, 4], using auto-correlation matrix with 5 stages, are included. Table 1 presents ADIC’s parameters used for all simulations excepted if indicated in another way. Finally, in each simulation, 6000 data slots have been usually considered in order to generate satisfactory average raw bit error rate (BER) results. In our simulation, we consider that, for BER results under 5%, the decoder system following MUD is able to find the totality of transmitted data.

Table 1 ADIC’s parameters for two different spreading codes.

For N c  = 16 (Fig. 10a) and N c  = 8 (Fig. 10b), the fifth stage of ADIC gives BER equivalent or better than DF-Soft-MPIC. Performance-wise, ADIC MUD can be tailored to work with N s  = 3 stages while maintaining good performance-complexity trade offs. It is worth mentioning that ADIC MUD provides the same results as the Rake at the first stage (N s  = 1). It is an important interesting point knowing that this Rake output results can be used for other applications inside the BS such as the power control.

Figure 10
figure 10

E b /N o (dB) versus users’ number at 64 kbps a and 144 kbps b, at 3 kph, to obtain BER = 5% with the Rake, DF-Soft-MPIC and ADIC.

According to the mobiles speed, Fig. 11, MUD methods performances are degraded due to non-optimum performances of Correlator channel estimator. However ADIC is less sensitive than DF-Soft-MPIC. Indeed, more the speed increases, more the BER results of ADIC fourth stage is better than DF-Soft-MPIC fifth stage.

Figure 11
figure 11

E b /N o (dB) versus mobiles speed for K = 10 at 64 kbps to obtain BER = 5% with DF-Soft-MPIC and ADIC.

It is known that the commercial key component of MUD method consist to a low implementation complexity [4] to attain the desired performance. The adaptive approach proposed by ADIC make possible to fine tunes the performance-complexity tradeoff. The output sensitivity of ADIC and DF-Soft-MPIC methods with the periodicity to adapt the coefficients inside one time frame (1 frame=15 slots [2]) has been studied. When we changed the adapt time period from one slot to 15 slots, a lost of 0.35 dB and 0.45 dB have been observed for ADIC and DF-Soft-MPIC, respectively. For DF-Soft-MPIC, calculations of Rake and its matrix of auto-correlation represent the adaptation phase, and (13)–(18) for ADIC. ADIC is here, still, less sensitive, to obtain the same results as the DF-Soft-MPIC with 15 adaptations per frame, ADIC used only 10 adaptations per frame. In pedestrian and fast speed mobile unit contexts, we can adapt the filter coefficients at each 15 slots and one slot, respectively.

5.2 Interest of the Proposed Adaptive Structure

To show the interest of ADIC adaptation phase structure and the use of SM-NLMS method has been studied to reduce the adaptation complexity and assure the convergence. For that, we introduced into simulations another adaptive MUD, AL-MMSE [10], based on a NLMS adaptation. To assure a convergence at pedestrian condition, AL-MMSE adaptation need a long training sequence of size N synth = 2400—5 times longer than ADIC one. AL-MMSE MUD adapts its K filters using the same received signal containing the K users contributions, contrary to ADIC which uses the user contribution corresponding to the filter considered (AB k in Fig. 2), in order to update each filter. Results, from two differently parameterized ADIC methods (λ = 0.005 and 0.02), are presented in Fig. 12. As explained before, λ represents the error value from which the update will not be carried out. AL-MMSE method with its long training sequence performs less than the ADIC second stage for a complexity much higher (because of the necessary training sequence size) than ADIC. Moreover, the use of SM-NLMS adaptive method, dependently of the selected value λ, also allows important calculation savings. Indeed, compared to NLMS which uses all (100%) the training data for coefficient update, the SM-NLMS uses, with λ = 0.005, 65% and, with λ = 0.02, 55% of the update sequence (Eq. (17)), which is equivalent respectively to ≈20 and ≈17 iterations per sub-filter instead of the 30 iterations; an economy favorable for hardware implementation. These updated reductions are observed constant on all E b 0 range of Fig. 10.

Figure 12
figure 12

BER versus E b / N o with K = 10 at 64 kbps and 3 kph, for ADIC with λ = 0.02 and 0.005, DF-Soft-MPIC, the Rake and AL-MMSE receivers, simulated with 1500 data slots.

5.3 Complexity Analysis

In this section, we applied the approach used for a fair arithmetic complexity comparison, based on a complexity benchmark from a VLSI technology point of view such as FPGA and ASIC hardware implementation.

As a first step of the approach, it is necessary to compute the number of additions and multiplications. We consider the following parameters: N c the spreading factor, N h the maximum delay spread of the channel, P the number of path, N synth the number of adaptive symbols in ADIC and m the MPIC parameter permitting to take into account ISI in its correlation matrix, \(m = \left\lceil {{{\left( {N_c + N_h - 1} \right)} \mathord{\left/ {\vphantom {{\left( {N_c + N_h - 1} \right)} {N_c }}} \right. \kern-\nulldelimiterspace} {N_c }}} \right\rceil \), \(\left\lceil \cdot \right\rceil \) being ceiling. Notice that these algorithms need a lot more additions than multiplication due to the presence of ±1 number in the algorithm execution. In our evaluation, we excluded the multiplication in presence of ±1 number. In order to make a fair arithmetic complexity comparison we use a unified framework for all these techniques by considering an elementary arithmetic unit used to realize an adder and a multiplier, the number of full adder (FA). In a VLSI technology, multiplication and addition operations have the same binary structure with a bit word-length adjusted to assure the precision needed. We consider that an addition requires N q FA and a multiplication \(N_q^2 \) FA, N q being the number of bits needed to quantify each parameter of MUD studies.

At full-load BS receiver system, K = N c , the required number of FA for ADIC and DF-Soft-MPIC relative to the Rake receiver for N c  = 16, 8 and 4 is shown Fig. 13. For all methods, there are considered 15 update (adaptation phase) per frame (each slot), N s  = 3 and N q  = 16-bits. This result reveals that DF-Soft-MPIC is 34 times more complex than the conventional Rake receiver while ADIC is only 4.0 to 6.8 times more complex. For K = N c , we can notice that ADIC presents a 4 to 8 complexity reduction compared to the DF-Soft-MPIC. ADIC presents a constant FA Rake ratio versus the number of simultaneously receive mobile users K.

Figure 13
figure 13

Required Number of FA for ADIC and DF-Soft-MPIC relative to the Rake receiver with 15 updates per frame, N q  = 16 and N s  = 3 bits for N c  = {4,8,16}.

5.4 Implementation Preliminary Results

In this section, we give some preliminary results about processing time and hardware resources estimation of the ADIC architecture described in previous section. Considering FPGA targeted technology integration, we drew each block architecture for detection (Fig. 5) and adaptation (Fig. 9) phases in term of additions, multiplications, multiplexers, registers, etc… We take into account N synth = 3N for adaptation and N s  = 3 for the detection. We evaluated ADIC in fixed-point bits and a word length of 16-bits is sufficient to keep the similar performances compare to floating-point with a lost of E b /N 0 inferior to 0.1dB.

Assuming the pipeline implementation structure and that an addition and a multiplication operation can be respectively performed at a frequency of 200 MHz and 100 MHz [23], Fig. 14 presents processing time results for both adaptation and detection phases. Here, we consider for all N c a full load receiver. From Fig. 14, one can draw the following remarks:

  • For both ADIC phases and all N c , processing times are lower than slot time, (reference time constrain);

  • The processing time is independent of K because the resources grow with K (cf. Fig. 15);

  • The adaptation process is lower than detection’s one. Indeed, the two phases work in parallel, and share the same FPGA;

  • The detection needs 3 FB, 2 SB and 2 ICB per user; it needs lots of FPGA embedded multipliers and slices. So we had to economize hardware to implant adaptation phase. This hardware economy reflects on adaptation processing time.

Figure 14
figure 14

ADIC MUD treatment time for N c  = {4,8,16}, for adaptation and detection phases and for a lower resources detection structure at N c  = 16.

Figure 15
figure 15

Total number of hardware arithmetic resources for ADIC: a adders and b multipliers for the proposed structure and a lower resources detection structure at N c  = 16.

The arithmetic operations represent the most important hardware resources need to materialize the pipeline structure of ADIC considering the lowest complexity of control units that the previous proposed MUD [23]. These resources are shown in Fig. 15 presenting the total number of 16-bits adders and 16-bits multipliers, with respect to Fig. 14. For N c  = 16, we need no more than 500 adders (7500 slices) in order to implement 16 users and 160 embedded multipliers are necessary. With these results and analysis of the required memory, and according to Virtex-II pro data sheet [25], ADIC MUD in full load integrated into a Virtex-II pro XC2VP40 family, which contains 19 392 slices and 192 embedded multipliers, accepts more users compared to the results proposed in [23].

From Fig. 15, we observed the hardware constrain imposed by the N c  = 16 case and the low time consuming for detection phase. To decrease the hardware resources, a second detection structure has been proposed for 64 kbps with no impact for N c  = 8 and 4. The modifications consist to take advantage of the PE regularities to time-multiplex the data computations: (1) use only one multiplier in the spreading block (SB) PE (Fig. 5b) and (2) decrease the number of parallel adders in Badd1 and Badd2 (Fig. 5c); 3 parallel adders instead of 5 in the both case. Of course the time processing of detection phase increases but those modifications permit, as shown in Fig. 15, to implement 16 users with 416 16-bits adders (6,240 slices) and 128 16-bits multipliers instead of 160 (20% reducing). In this case, a Virtex-II pro XC2VP30 family, which is constituted by 13,693 slices and 136 embedded multipliers, can be used to implement ADIC MUD for 64 kbps in full load.

5.5 Beyond the Arithmetic Complexity

Another important aspect to compare the implementation complexity is the algorithmic structure such as regularity, recursiveness, data flow, memory quantity and inherent parallelism—all qualities intrinsic to the non restrictive illustrative embodiments of the present invention.

In this study, these aspects have not been included to compare MUD methods. However, an obvious consideration can be observed with the decision feedback structure of MUD. Indeed, even if the decision feedback structure might have relatively the same complexity level, the main drawback is the lack of parallelism that can be exploited, especially for the MPIC caused by data dependencies. In fact, a DF-Soft-MPIC at instant n and for user k needs to wait for all users so that the kth user proceeds to detect the current data before processing its own data. Such a structure looses its parallelism to apply pipeline or parallel techniques and to become serial operation limited for sequential DSP implementation. Hence, the DF-Soft-MPIC will always be limited by the DSP clock speed to respect the computational time imposed by the 3GPP time frame. Noted that, the present invention do not use decision feedback structure to exploit the parallel implementation techniques.

When 3≤Ns≤5, it is worth mentioning that the ADIC can be optimized for a better performance-complexity trade-off. The performance represents the gains in dB saved to target a Bit Error Rate compared to the reference method and the complexity represents the implementation cost into VLSI technology such as DSP (Digital Signal Processor, FPGA—Field Programmable Gate Array, ASIC—Application Specific Integrated Circuit). Inherent to the illustrative embodiments of the present invention is a flexibility to tune the performance-complexity tradeoff based on the parameters such as N synth and N s . Compared to the most known technique, DF-Soft-MPIC, same performances in dB are obtained with less complexity in term of arithmetic implementations (see the results in the next section).

6 Conclusion

We proposed and investigated the performance and complexity of a new MUD based on adaptive filter block and interference canceller block in a cascade arrangement without the presence of decision feedback. It is known that the success key to commercially deploy the MUD in BS is to target a low complexity method offering the performances reach the soft multistage parallel interference canceller method (DF-Soft-MPIC). The adaptive duplicated filters and interference canceller (ADIC) proposed reaches this expectation. The AFB uses synthesized signals with the aid of the channel estimates to build a synthesized received signal per user. The latter is utilized as a training signal. The per-user-adaptation trend lowers the adaptation process complexity while the introduction of the ICB ensures, as the number of stages increases, interference free signals at the input of the next AFB. One short filter per user allows a considerable complexity reduction.

In addition, we have proposed a VLSI implementation strategy and hardware resources evaluation of ADIC MUD method. The presented implementation strategy takes into account the regularity of the algorithms, applying pipeline processes. Based on the arithmetic complexity, ADIC is 4 to 8 lower than the reference method, DF-Soft-MPIC. The evaluation of the hardware implementation and based on a pipeline strategy to take advantage of ADIC method, we noted that it is possible to implement in a Virtex-II pro XC2VP30 this MUD method at full-load base station receiver.

Future work will consist of exploiting ADIC method in to multi input multi output (MIMO) system and in orthogonal frequency-division multiple access (OFDMA) technologies always by respecting the performance-complexity trade-off.