In this article, we propose to directly match 3D traces generated through different methods instead of phase sequences to avoid losing spatial resolution. The previous section has introduced how we generate observed traces through a pure vision method. In this section, we will introduce the proposed trace conversion model, which is a Transformer-based seq2seq model that takes the RFID phase sequence of a tag and the frame sequence corresponding to a tagged object as inputs and outputs a simulated trace based on a hypothesized correspondence between them. How the model is implemented is elaborated as follows.
4.1 Theoretical Analysis
Before diving into details of the trace conversion model, let us start with analyzing the theoretical basis for generating 3D traces from phase measurements and 2D images.
Typically, the signal received at an RFID reader can be viewed as a superposition of carrier signals generated by the reader and modulated signals backscattered by the tag. The former come from circulator leakage and environmental scattering [
17] while the latter include modulated signals that are transmitted through direct and indirect paths [
39]. Accordingly, if the distance between an RFID reader and a tag is
\(d(t)\) at time
t, the received signal
\(r(t)\) can be expressed as
where
\(\lambda\) is the wavelength,
\(b(t)\) is the modulated signal generated by the tag, and
\(s(t)\) is the carrier signal transmitted by the reader, which typically is a continuous sinusoid wave represented as
\(s(t)=e^{j2\pi ct/\lambda }\), where
c is the speed of light. We denote attenuation in a propagation process with
\(\alpha\). Among them,
\(\alpha _l\),
\(\alpha _T\), and
\(\alpha _w\) are determined by electromagnetic characteristics of the circuits in the reader and the tag and and the material of the reflective surfaces [
5]. Considering each of them can introduce an additional phase change, we denote the unknown phase terms as
\(\theta _l\),
\(\theta _T\), and
\(\theta _w\) accordingly.
\(\alpha _d\) denotes path loss in free-space propagation, which is related to the distance as defined in the Friis equation [
23].
N is the number of propagation paths and
\(n(t)\) is the additive Gaussian white noise. Signals scattered twice by the surroundings are ignored in Equation (
4) since they tend to be severely attenuated.
As can be seen from Equation (
4), the first two terms that are irrelevant to the tag do not introduce new frequency components into the carrier signal, which can be filtered out after demodulation. Furthermore, with channel reciprocity, the forth and fifth terms can be merged. Therefore, we can simplify Equation (
4) for the received signal that has been demodulated and filtered from the DC component as
where
\(\alpha _{d_i} = \alpha _{w_i}\alpha _{d_{R-\gt w_i}}\alpha _{d_{w_i-\gt T}} = \alpha _{w_i}\alpha _{d_{T-\gt w_i}}\alpha _{d_{w_i-\gt R}}\) and
\(d_i = d_{R-\gt w_i}+d_{w_i-\gt T} = d_{T-\gt w_i}+d_{w_i-\gt R}\).
Accordingly, the transfer function can be calculated as
where
\(n^{\prime }(t) = \frac{n(t)}{b(t)}\). And then, we can get the phase measurement
\(\theta (t)\) at time
t as
Considering a situation where the signal of the direct path dominates the received signal, which means that
the measured phase then can be expressed as
which turns the reported phase value
\(\theta (t)\) into an indirect estimation of the tag-antenna distance at
\(d(t)\). Based on Equation (
9), we can represent the distance as
where
n is an unknown integer caused by phase wrapping. Unfortunately, the reported phase
\(\theta (t)\) is stili unable to be directly utilized for estimating the tag-antenna distance due to the two ambiguity terms. To overcome the problem, a commonly-utilized solution is to estimate the change of tag-antenna distance instead, which can be denoted as
where we assume that the random term
\(\theta _T\) remains constant at two sampling instants. As Equation (
11) implies, we can eliminate ambiguity terms in estimation over the change of tag-antenna distance as long as it changes less than half the wavelength between the two sampling instants, which can be satisfied through adding speed restrictions according to typical sampling rates of individual tags in specific applications. Then, Equation (
11) can be simplified as
Of course, the approximation in Equation (
9) can not hold for situations in which the direct path can not dominate the received signal, e.g., rooms full of metal surfaces, or even does not exist, e.g., NLoS situations. To deal with them, more antennas can be deployed for mitigating the influence of multipath interference as proposed in [
35,
38,
39], where distances of the tag to each antenna will be estimated. But the issue is beyond our discussion in this article and we will take further study in future work.
With an estimation over the tag-antenna distance change, now the remaining task is to decompose it into 3D coordinates. A naive method to fulfill it can be described as follows. Without loss of generality, let us set the camera as the origin and the coordinate of an antenna as
\(\overrightarrow{a}\). Supposing we have already obtained a series of position estimations of a tagged objects, denoted as
\(\lbrace \overrightarrow{l}(t_1),\ldots ,\overrightarrow{l}(t_{i-1})\rbrace\), and the corresponding vision estimations and phase measurements,
\(\lbrace (x(t_1),y(t_1)),\ldots ,(x(t_{i-1}),y(t_{i-1}))\rbrace\) and
\(\lbrace \theta (t_1),\ldots ,\theta (t_{i-1})\rbrace\), now we want to estimate
\(\overrightarrow{l}(t_i)\) with these data and the new vision estimation
\((x(t_{i}),y(t_{i}))\) and phase measurement
\(\theta (t_{i})\). Then, we can randomly pick a historical estimation
\(l(t_j)\) to estimate
\(l(t_{i})\) through
where
\(\Vert \cdot \Vert\) is the L2-norm. As the antenna position
\(\overrightarrow{a}\), the historical position
\(\overrightarrow{l}(t_j)\), the wavelength
\(\lambda\), and the phase difference
\(\theta (t_i)-\theta (t_j)\) are known, Equation (
13) can be rewritten as
where
Considering the
x and
y coordinates have been estimated as
\(x(t_{i})\) and
\(y(t_{i})\), Equation (
13) only requires to calculate the coordinate in the third dimension. It shall be noticed that Equation (
13) can produce two results and we can select one considering a smooth movement pattern. Furthermore, as
\((x(t_{i}), y(t_{i}))\) is merely a rough estimation, we can also leave the result behind and solve an optimization problem to obtain a 3D coordinate as
where
\(\lbrace j_1,\ldots ,j_n\rbrace\) denotes a set of historical results utilized for the estimation.
However, modeling the problem as above actually ignores the influence of nearby objects, which is an important factor that can influence the estimation result as observed in our test. Specifically, the coupling effect between two nearby RFID tags can disrupt signal features of RFID signals and as a result, invalidate Equation (
9). Therefore, the influence of the surrounding environment, especially nearby tagged objects, shall be considered. Moreover, considering the fact that movement of tagged objects in a certain application tends to show certain patterns, we choose to use a data-driven method and build a deep learning-based model, i.e., the trace conversion model, to solve the problem of generating 3D traces from phase measurements and 2D images.
The structure and working flow of the trace conversion model are illustrated in the bottom-left part of Figure
1. Our idea of forming a simulated trace is to estimate the position of a tagged object at each timestamp with the current video frame and RFID phase value as well as historical phase values and position estimations. We treat the video frame as a measurement of the object in the spatial domain and utilize time domain data, i.e., continuous phase change and historical positions, to calibrate and complement the measured result. Especially, we take into account not only the current target object but also other nearby targets through a multi-object detector. Based on this idea, the trace conversion model
\(\mathcal {T}(\cdot)\) can be formulated as
where
\(l_i\),
\(F_i\), and
\(\theta _i\) denote the position estimation, captured video frame, and collected phase value at time
\(t_i\) and
n is a variable defined as the number of historical positions involved in estimation.
4.3 Feature Extraction
We start with encoding a preprocessed RFID phase measurement sequence and a frame sequence, i.e., \(\boldsymbol {\bar{R}}=\lbrace (\theta _1^F,t_1^F),\ldots ,(\theta _{m}^F,t_{m}^F)\rbrace\) and \(\boldsymbol {F}=\lbrace (F_1,t_1^F),\ldots ,(F_{m},t_{m}^F)\rbrace\), into temporal and spatial features.
First, temporal features. For each timestamp
\(t_i^F\), a fully connected layer
\(\mathtt {FC_{en}}\) is utilized to form a
\(128\times 1\) vector
\(v_T[t_i^F]\) as
where
n is a variable requiring adjustment according to the typical moving speed in a certain application,
\(l_i\) denotes the position estimation at
\(t_i^F\), and
\(W_\text{en}\) is the weight matrix of
\(\mathtt {FC_{en}}\). If the number of historical phase measurements or position estimations is shorter than
n, it will be padded with zeros. Then, a Transformer
\(\mathcal {T}_{t}(\cdot)\) is utilized to capture the temporal dependency among phase measurements and historical positions and outputs a
\(64\times 1\) temporal feature as
As the Transformer is utilized for generating temporal features, we name it as temporal Transformer.
Second, spatial features. For the target tagged object, we use an object detector as described in section
3.1 to generate bounding boxes for all frames in its frame sequence
\(\boldsymbol {F}\). For each frame
\(F_i\), we erase the contents of the object by setting all pixels in the corresponding bounding boxes to 0 and output a matrix
\(A_{n\times n}^i\), where
n denotes the number of detected objects in
\(F_i\). We denote the processed frame sequence as
\(\boldsymbol {I}=\lbrace (I_1,t_1^F),\ldots ,(I_m,t_m^F)\rbrace\). Then, we use the GoogLeNet[
33] pre-trained with ImageNet [
31], denoted as
\(\mathtt {CNN_{en}}(\cdot)\), to extract a
\(64\times 1\) spatial vector
\(v_S[t_i^F]\) based on
\(I_i\) as
where
\(W_\text{CNN}\) are fixed parameters of its network structure. And then, a spatial Transformer
\(\mathcal {T}_{s}(\cdot)\) is utilized to turn the spatial vector
\(v_S[t_i^F]\) into a
\(64\times 1\) spatial feature
\(\bar{v}_S[t_i^F]\) with
\(A_{n\times n}^i\) as
Now, for each timestamp, there are two vectors, \(\bar{v}_{T}[t_i^F]\) and \(\bar{v}_{S}[t_i^F]\), representing features of the movement and the surrounding spatial environment of a tagged object, respectively.
However, as analyzed in Section
4.1, when multiple tagged objects move in a dynamic environment, RF features (e.g., RSSI, phase) of each tagged object will be affected by both the surrounding environment and the other tagged objects, which means that both the position estimation and the phase measurement are related to spatial information. Therefore, instead of treating the temporal and spatial features separately, we shall integrate them together for depicting the relationship between temporal-spatial observations. Accordingly, we interleave temporal and spatial Transformers to form the temporal-spatial feature
\(v_{TS}[t_i^F]\) as