research-article

Open access

TagFocus: Towards Fine-Grained Multi-Object Identification in RFID-based Systems with Visual Aids

Authors:

Li ZhangAuthors Info & Claims

ACM Transactions on Sensor Networks, Volume 19, Issue 1

Article No.: 9, Pages 1 - 22

https://doi.org/10.1145/3526193

Published: 27 March 2023 Publication History

All formats PDF

Abstract

Obtaining fine-grained spatial information is of practical importance in Radio Frequency Identification (RFID)-based systems for enabling multi-object identification. However, as high-precision positioning remains impractical in commercial-off-the-shelf (COTS)-RFID systems, researchers propose to combine computer vision (CV) with RFID and turn the positioning problem into a matching problem. Promising though it seems, current methods fuse CV and RFID through converting traces of tagged objects extracted from videos by CV into phase sequences for matching, which is a dimension-reduced procedure causing loss of spatial resolution. Consequently, they fail in harsh conditions like small tag intervals and low reading rates. To address the limitation, we propose TagFocus to achieve fine-grained multi-object identification with visual aids in RFID systems. The key observation is that traces generated through different methods shall be compatible if they are of one identical object. Accordingly, a Transformer-based sequence-to-sequence (seq2seq) model is trained to generate a simulated trace for each candidate tag-object pair. And the trace of the right pair shall best match the observed trace directly extracted by CV. A prototype of TagFocus is implemented and extensively assessed in lab environments. Experimental results show that our system maintains a matching accuracy of over 91% in harsh conditions, outperforming state-of-the-art schemes by 27%.

1 Introduction

Radio Frequency Identification (RFID) technologies are gaining popularity in recent years. In an RFID system, a reader can query a passive or an active tag to get a unique identification code contained in its memory using RF signals. Compared to other identification technologies such as barcodes, RFID is superior regarding convenience and efficiency as it does not require Line-of-Sight (LoS) and can provide a longer communication range, which makes it a popular choice for enabling smart identification services in scenarios like warehouses and clothing stores.

However, as RFID readers and tags communicate wirelessly, it is common that multiple tags are simultaneously reachable to a reader, which raises a challenge to formulate an accurate correspondence between tagged objects and collected tag information when more than one tag is read. Typical situations that face the challenge include discriminating each tagged object carried by a conveyor belt, which is common in industries, and finding a particular tagged tube in a tube box tube, which can be a troublesome procedure if checking them one by one. The traditional solution to address this challenge is physical isolation, i.e., reducing the readable range of a reader and separating each target object from others with a safe distance. Valid though the solution is, it suffers from inefficiency in both time and space, which greatly damages the practical value of RFID systems and leads to a demand for reliable multi-object identification.

One immediate idea is to distinguish tagged objects by positioning RFID tags with sufficient precisions. For a certain application, a positioning method is enough for achieving reliable multi-object identification as long as its positioning precision is within half the minimum possible distance between two tagged objects. Plenty of works [15] have been proposed in the past two decades to position RFID-tagged objects with signal features like received signal strength indicator (RSSI), phase, Doppler frequency, and so on. However, as noted in [37], signal features of RFID tags are sensitive to environmental change and tag geometry, making these methods hardly work steadily in realistic settings. Moreover, most existing works merely provide meter-level or decimeter-level precisions, far from applicable in a considerable proportion of RFID-enabled applications where objects are small in size and tightly placed, e.g., a box of tagged test tubes or penicillin bottles. Several works have managed to provide centimeter-level or even millimeter-level precisions with dedicated devices like antenna array [36] and USRP [21]. But considering cost and volume, they are not ready for wide adoption in real-world applications. Therefore, providing robust and fine-grained multi-object identification in COTS-RFID systems through positioning remains a practical but challenging task.

Meanwhile, some researchers have explored another mode of RFID positioning, i.e., fusing RFID with technologies that show better performance in positioning, such as computer vision (CV) [7, 8, 39], ultra-wideband (UWB) [14], ultrasonic [2], and so on. In this mode, RFID mainly serves for labeling, while another technology provides more precise positions, turning the RFID positioning problem into a matching problem. Among all candidates, CV is most preferred considering costs and difficulties of deployment and maintenance. Of course, one thing shall be noticed is that these methods require either RFID antennas or tags are in movement so that a trace can be generated for matching, which means that these methods are unsuitable for situations where the spatial relation between RFID antennas and tags is static. However, in real-world applications, it is simple to avoid static situations as a user can always make antennas in motion manually or with a movable mechanism. Several works have been proposed to achieve fine-grained identification and tracking for tagged objects with only a light-weight monocular camera appended, e.g., TagVision [7], RF-Focus [39], and TagView [8]. The rationale underlying these works can be phrased as follows: If a tag is attached to an object, the two traces of them can be viewed to be consistent. Based on this consistency, traces of tagged objects will first be turned into object-antenna distance sequences and then get converted into phase sequences to match with phase sequences of RFID tags, which are directly collected by RFID readers. Afterwards, tagged objects and collected tags get paired according to the matching result. Promising though it seems, current works suffer from lacking spatial resolution due to this manner of fusing CV and RFID. To be specific, converting traces into phase sequences is a dimension-reduced procedure as a phase sequence can only reflect the change of the tag-antenna distance. If two tagged objects move in two traces symmetrical to each other around the antenna, they are inseparable with phase. Furthermore, the trace of a tagged object is not identical to the trace of a tag in practice. A difference exists between each estimated position and the position of the tag. When the difference between the two types of traces in one tagged object is comparable with the difference between the traces of two tags, the approximation is no longer valid and the whole system can collapse. Therefore, the effectiveness of the aforementioned manner of fusion highly relies on the size of targets and the minimum possible distance among them. And as a consequence, existing works suffer from lacking spatial resolution and show poor robustness in harsh conditions, which blocks their usage in practice.

Motivated by the limitations listed above, in this article, we propose TagFocus, a CV-assisted RFID system that identifies multiple objects through matching traces instead of phase sequences. The key component of TagFocus is a Transformed-based sequence-to-sequence (seq2seq) model that converts the phase sequence of a tag and video frames containing a tagged object into a 3D trace. Instead of directly converting position estimations into phase values, the model takes the difference between traces of a tagged object and its tag as an application-related hidden parameter and learns it through training. Especially, such a data-driven method can also eliminate the requirement over measuring antenna positions and therefore reduce the measurement error. Since the trace is produced under a hypothesized correspondence between a tag and a tagged object, we name it as simulated trace. Also, like previous works, we implement a pure vision-based method with a state-of-the-art deep learning-based algorithm for obtaining another type of trace, which is named as observed trace. To eliminate the camera calibration procedure required for transferring 2D traces into 3D traces, we upgrade this part with a 3D monocular vision method, which utilizes the shape information for estimating coordinates of the third dimension. Obviously, the simulated trace of a right tag-object pair shall best match the observed trace of the object. Based on it, a matching module is designed in TagFocus to label each detected tagged object with a most likely RFID tag. Extensive experiments in diverse scenarios show that TagFocus outperforms existing methods in multi-object identification and shows higher robustness in all scenarios.

In a nutshell, this article makes the following contributions:

First, an RFID-based multi-object matching framework that matches detected tagged objects and RFID tags with traces is proposed, which provides a new perspective for fusing CV and RFID to enable high spatial resolution in multi-object identification.

Second, a novel 3D object tracking method is proposed through fusing CV and RFID. Experimental results show that it can achieve comparable performance with a pure 3D monocular vision method.

Third, a prototype of TagFocus has been implemented with a COTS camera and RFID devices. Extensive experiments show that the matching accuracy of TagFocus is over 96% in both 2D and 3D scenarios and maintains over 91% in more harsh conditions like small tag intervals, large tag populations, and low reading rates of tags. Comparisons with state-of-the-art schemes prove that TagFocus is superior in both matching accuracy and robustness.

The outline of this article can be summarized as follows. In Section 2, we provide an overview of TagFocus. Then, technical details for three modules, i.e., monocular 3D trace extraction, trace conversion model, and multi-trace matching, are present in Sections 3–5 accordingly. A prototype of TagFocus is implemented and evaluated in Section 6. Section 7 discusses limitations and future work of TagFocus. We review related works in Section 8. Section 9 summarizes this article.

2 Overview of TagFocus

TagFocus is a CV-assisted RFID system for providing fine-grained identification and tracking services for tagged objects. The system structure of TagFocus is illustrated in Figure 1. When multiple tagged objects move in the surveillance region, RFID reports (including EPC, phase, RSSI, and timestamp) and video frames will be gathered by an RFID reader and a camera, respectively. And the task of TagFocus is to build correct correspondence between tagged objects and tags through the following working flow:

Fig. 1.

–

Upon receiving video frames captured by the camera, Monocular 3D Trace Extraction, a CV-based module is responsible for detecting and tracking tagged objects occurring in the surveillance region as introduced in Section 3. For each detected tagged object, a group of 3D coordinates are generated to indicate its positions at timestamps of corresponding video frames, which form an observed trace.

–

Once an observed trace is generated, video frames for generating it and RFID reports corresponding to the time period will be fed into the Trace Conversion Model implemented in Section 4. This module is based on an encoder-decoder structure, which encodes RFID reports of a tag and video frames containing a tagged object into temporal and spatial features and decodes these features to another group of 3D coordinates, forming a simulated trace. Normally, more than one tag can be read by an RFID reader. Therefore, for each detected tagged object, there will be more than one candidate tag-object pair and so will simulated trace.

–

With observed traces and simulated traces generated, the Multi-Trace Matching module introduced in Section 5 will calculate similarities for each observed trace and its corresponding simulated traces. Based on them, the tag-object correspondence will be built through a maximum weight perfect matching algorithm.

In general, TagFocus provides a new perspective for fusing CV and RFID to increase spatial resolution. The following three sections will collaborate on the technical details of the above steps.

3 Monocular 3d Trace Extraction

Like previous works, we utilize a state-of-the-art deep learning-based method to obtain traces of tagged objects from the perspective of vision. And since recent progress on the CV field has enabled 3D trace extraction to be attained with monocular vision, in this article, we utilize the CV algorithm to directly extract 3D traces from video frames, which eliminates the troublesome camera calibration procedure required in previous works for converting 2D traces into 3D ones. The method we adopt is a monocular 3D object detection framework called MonoPSR [11]. In this section, we present details of how we implement it in TagFocus.

3.1 2D Object Detection

Given a video frame, we first utilize a 2D detector based on Faster R-CNN [30] to detect tagged objects and cover each of them with a 2D bounding box. Therefore, once a tagged object detected, a consecutive frame sequence marked by corresponding 2D bounding boxes will be recorded until it moves out of the surveillance region. We denote the frame sequence as \({{\boldsymbol F}}=\lbrace F_1,\ldots ,F_m\rbrace\), where \(F_i\) indicates the video frame recorded at time \(t_i\). And inside each \(F_i\), an image crop denoted as \(B_i\) is captured by a 2D bounding box for covering the tagged object. The center of \(B_i\) is viewed as the 2D coordinate of the tagged object in a 2D image plane and is denoted as \((p_{x,i},p_{y,i})\). Splicing these 2D coordinates together, a 2D path will be generated, representing projections of the tagged object in a series of parallel planes during the movement.

Furthermore, as more than one tagged object might be detected, we shall also take the coupling effect between nearby RFID tags into consideration, which can greatly disrupt signal features of RFID tags. To characterize it, we define a symmetric matrix \(A_{n\times n}^i\) for each video frame \(F_i\) to indicate the influence between each pair of detected tagged objects, where n is the number of detected tagged objects. We define items in \(A_{n\times n}^i\) based on the following three observations:

–

First, the coupling effect is not related to a tag itself.

–

Second, the influence caused by the coupling effect grows when two tags get closer.

–

Third, the coupling effects between two tags is reciprocal.

Based on these observations, we define the item in \(A_{n\times n}^i\) as

\begin{equation} A(i,j)= {\left\lbrace \begin{array}{ll}\frac{1}{\Vert (p_{x,i},p_{y,i})-(p_{x,j},p_{y,j})\Vert }, &i\ne j\\ 0, &i=j\\ \end{array}\right.} \end{equation}

(1)

where \(\Vert \cdot \Vert\) is the L2-norm. For a detected tagged object, its frame sequence \({{\boldsymbol F}}\) together with a corresponding matrix sequence \(\lbrace A_{n_1\times n_1}^1,\ldots ,A_{n_m\times n_m}^m\rbrace\) will be transferred to the trace conversion model present in Section 4 for generating simulated traces of it.

3.2 3D Trace Extraction

With projections of a detected tagged object in a series of parallel planes gathered, the next step is to turn them into 3D coordinates. The underlying idea is to utilize the shape transformation of an object in the video segment that contains it. For each timestamp \(t_i\), two feature maps are extracted to fulfill this goal. One is extracted from the full video frame \(F_i\) through a CNN-based encoder, characterizing the shape and location features of the object. The other is extracted from the captured image crop \(B_i\) through another CNN-based encoder, regarding the color feature of the object. Then, we concatenate both feature maps together to form a shared feature map and feed it into a CNN-based MultiBin regression model as proposed in [24] to obtain two matrixes: a \(3\times 1\) translation matrix T, containing the dimension information (length, width, and height), and a \(3\times 3\) rotation matrix R, representing rotation angles in three directions. As 2D bounding boxes can be viewed as projections of 3D bounding boxes, given the coordinate of the center of a 2D bounding box \(\overrightarrow{p}_\text{2D}\), the relationship between it and the coordinate of the center of its corresponding 3D bounding box \(\overrightarrow{p}_\text{3D}\) can be described as

\begin{equation} \left[ \begin{array}{c} \overrightarrow{p}_\text{2D}\\ 1 \end{array} \right] = K \left[ \begin{array}{cc} R & T \end{array} \right] \left[ \begin{array}{c} \overrightarrow{p}_\text{3D}\\ 1 \end{array} \right], \end{equation}

(2)

where K is the camera intrinsic matrix. Therefore, we can extend each obtained 2D coordinate \((p_{x,i},p_{y,i})\) to a 3D coordinate \((p_{x,i}^\prime ,p_{y,i}^\prime ,p_{z,i}^\prime)\) through

\begin{equation} \left[ \begin{array}{c} p_{x,i}^\prime \\ p_{y,i}^\prime \\ p_{z,i}^\prime \\ 1 \end{array} \right] = \left[ \begin{array}{cc} R & T \end{array} \right]^{-1} K^{-1} \left[ \begin{array}{c} p_{x,i}\\ p_{y,i}\\ 1 \end{array} \right], \end{equation}

(3)

combining them together, a 3D observed trace is obtained.

4 Trace Conversion Model

In this article, we propose to directly match 3D traces generated through different methods instead of phase sequences to avoid losing spatial resolution. The previous section has introduced how we generate observed traces through a pure vision method. In this section, we will introduce the proposed trace conversion model, which is a Transformer-based seq2seq model that takes the RFID phase sequence of a tag and the frame sequence corresponding to a tagged object as inputs and outputs a simulated trace based on a hypothesized correspondence between them. How the model is implemented is elaborated as follows.

4.1 Theoretical Analysis

Before diving into details of the trace conversion model, let us start with analyzing the theoretical basis for generating 3D traces from phase measurements and 2D images.

Typically, the signal received at an RFID reader can be viewed as a superposition of carrier signals generated by the reader and modulated signals backscattered by the tag. The former come from circulator leakage and environmental scattering [17] while the latter include modulated signals that are transmitted through direct and indirect paths [39]. Accordingly, if the distance between an RFID reader and a tag is \(d(t)\) at time t, the received signal \(r(t)\) can be expressed as

\begin{equation} \begin{aligned}r(t) =&\,\, \Bigg (\alpha _le^{\theta _l}+\sum _{i=1}^{N_1}{\alpha _{w_i}\alpha ^2_{d_{R-\gt w_i}}e^{-j\frac{4\pi d_{R-\gt w_i}}{\lambda }+\theta _{w_i}}}+\alpha _T\alpha ^2_{d(t)}e^{-j\frac{4\pi d(t)}{\lambda }+\theta _T}b(t)\\ &+ \sum _{i=N_1+1}^{N_2}\alpha _{w_i}\alpha _{d_{R-\gt w_i}}\alpha _{d_{w_i-\gt T}}\alpha _{T}\alpha _{d(t)}e^{-j\frac{2\pi (d_{R-\gt w_i}+d_{w_i-\gt T}+d(t))}{\lambda }+(\theta _T+\theta _{w_i})}b(t)\\ &+ \sum _{i=N_2+1}^{N_3}\alpha _{T}\alpha _{d(t)}\alpha _{w_i}\alpha _{d_{T-\gt w_i}}\alpha _{d_{w_i-\gt R}}e^{-j\frac{2\pi (d_{T-\gt w_i}+d_{w_i-\gt R}+d(t))}{\lambda }+(\theta _T+\theta _{w_i})}b(t)\Bigg)s(t)+n(t), \end{aligned} \end{equation}

(4)

where \(\lambda\) is the wavelength, \(b(t)\) is the modulated signal generated by the tag, and \(s(t)\) is the carrier signal transmitted by the reader, which typically is a continuous sinusoid wave represented as \(s(t)=e^{j2\pi ct/\lambda }\), where c is the speed of light. We denote attenuation in a propagation process with \(\alpha\). Among them, \(\alpha _l\), \(\alpha _T\), and \(\alpha _w\) are determined by electromagnetic characteristics of the circuits in the reader and the tag and and the material of the reflective surfaces [5]. Considering each of them can introduce an additional phase change, we denote the unknown phase terms as \(\theta _l\), \(\theta _T\), and \(\theta _w\) accordingly. \(\alpha _d\) denotes path loss in free-space propagation, which is related to the distance as defined in the Friis equation [23]. N is the number of propagation paths and \(n(t)\) is the additive Gaussian white noise. Signals scattered twice by the surroundings are ignored in Equation (4) since they tend to be severely attenuated.

As can be seen from Equation (4), the first two terms that are irrelevant to the tag do not introduce new frequency components into the carrier signal, which can be filtered out after demodulation. Furthermore, with channel reciprocity, the forth and fifth terms can be merged. Therefore, we can simplify Equation (4) for the received signal that has been demodulated and filtered from the DC component as

\begin{equation} \begin{aligned}r^{\prime }(t) =&\,\, \Bigg (\alpha _T\alpha ^2_{d(t)}e^{-j\frac{4\pi d(t)}{\lambda }+\theta _T}\\ &+2\sum _{i=N_1+1}^{N_3}\alpha _{w_i}\alpha _{d_{R-\gt w_i}}\alpha _{d_{w_i-\gt T}}\alpha _{T}\alpha _{d(t)}e^{-j\frac{2\pi (d_{R-\gt w_i}+d_{w_i-\gt T}+d(t))}{\lambda }+(\theta _T+\theta _{w_i})}\Bigg)b(t)+n(t)\\ =&\,\,\alpha _T\alpha ^2_{d(t)}e^{-j\frac{4\pi d(t)}{\lambda }+\theta _T}\Bigg (1+2\sum _{i=1}^{N}\frac{\alpha _{d_i}}{\alpha _{d(t)}}e^{-j\frac{2\pi (d_i-d(t))}{\lambda }+\theta _{w_i}}\Bigg)b(t)+n(t), \end{aligned} \end{equation}

(5)

where \(\alpha _{d_i} = \alpha _{w_i}\alpha _{d_{R-\gt w_i}}\alpha _{d_{w_i-\gt T}} = \alpha _{w_i}\alpha _{d_{T-\gt w_i}}\alpha _{d_{w_i-\gt R}}\) and \(d_i = d_{R-\gt w_i}+d_{w_i-\gt T} = d_{T-\gt w_i}+d_{w_i-\gt R}\).

Accordingly, the transfer function can be calculated as

\begin{equation} H(t) = \frac{r^{\prime }(t)}{b(t)} = \alpha _T\alpha ^2_{d(t)}e^{-j\frac{4\pi d(t)}{\lambda }+\theta _T}\left(1+2\sum _{i=1}^{N}\frac{\alpha _{d_i}}{\alpha _{d(t)}}e^{-j\frac{2\pi (d_i-d(t))}{\lambda }+\theta _{w_i}}\right) + n^{\prime }(t), \end{equation}

(6)

where \(n^{\prime }(t) = \frac{n(t)}{b(t)}\). And then, we can get the phase measurement \(\theta (t)\) at time t as

\begin{equation} \theta (t) = Arg(H(t)) = Arg\left(\alpha _T\alpha ^2_{d(t)}e^{-j\frac{4\pi d(t)}{\lambda }+\theta _T}\left(1+2\sum _{i=1}^{N}\frac{\alpha _{d_i}}{\alpha _{d(t)}}e^{-j\frac{2\pi (d_i-d(t))}{\lambda }+\theta _{w_i}}\right)+n^{\prime }(t)\right). \end{equation}

(7)

Considering a situation where the signal of the direct path dominates the received signal, which means that

\begin{equation} 2|\sum _{i=1}^{N}\frac{\alpha _{d_i}}{\alpha _{d(t)}}e^{-j\frac{2\pi (d_i-d(t))}{\lambda }+\theta _{w_i}}| \lt \lt 1, \end{equation}

(8)

the measured phase then can be expressed as

\begin{align} \theta (t)\approx Arg\left(\alpha _T\alpha ^2_{d(t)}e^{-j\frac{4\pi d(t)}{\lambda }+\theta _T}\right)=\left(\frac{4\pi }{\lambda }d(t)+\theta _\text{T}\right)\mod {2}\pi , \end{align}

(9)

which turns the reported phase value \(\theta (t)\) into an indirect estimation of the tag-antenna distance at \(d(t)\). Based on Equation (9), we can represent the distance as

\begin{align} d(t)=\frac{\lambda (\theta (t)-\theta _T)}{4\pi }+\frac{n\lambda }{2}, \end{align}

(10)

where n is an unknown integer caused by phase wrapping. Unfortunately, the reported phase \(\theta (t)\) is stili unable to be directly utilized for estimating the tag-antenna distance due to the two ambiguity terms. To overcome the problem, a commonly-utilized solution is to estimate the change of tag-antenna distance instead, which can be denoted as

\begin{align} \Delta d_{ij}=d(t_i)-d(t_j)=\frac{\lambda (\theta (t_i)-\theta (t_j))}{4\pi }+\frac{(n_i-n_j)\lambda }{2} , \end{align}

(11)

where we assume that the random term \(\theta _T\) remains constant at two sampling instants. As Equation (11) implies, we can eliminate ambiguity terms in estimation over the change of tag-antenna distance as long as it changes less than half the wavelength between the two sampling instants, which can be satisfied through adding speed restrictions according to typical sampling rates of individual tags in specific applications. Then, Equation (11) can be simplified as

\begin{align} \Delta d_{ij}=d(t_i)-d(t_j)=\frac{\lambda (\theta (t_i)-\theta (t_j))}{4\pi } . \end{align}

(12)

Of course, the approximation in Equation (9) can not hold for situations in which the direct path can not dominate the received signal, e.g., rooms full of metal surfaces, or even does not exist, e.g., NLoS situations. To deal with them, more antennas can be deployed for mitigating the influence of multipath interference as proposed in [35, 38, 39], where distances of the tag to each antenna will be estimated. But the issue is beyond our discussion in this article and we will take further study in future work.

With an estimation over the tag-antenna distance change, now the remaining task is to decompose it into 3D coordinates. A naive method to fulfill it can be described as follows. Without loss of generality, let us set the camera as the origin and the coordinate of an antenna as \(\overrightarrow{a}\). Supposing we have already obtained a series of position estimations of a tagged objects, denoted as \(\lbrace \overrightarrow{l}(t_1),\ldots ,\overrightarrow{l}(t_{i-1})\rbrace\), and the corresponding vision estimations and phase measurements, \(\lbrace (x(t_1),y(t_1)),\ldots ,(x(t_{i-1}),y(t_{i-1}))\rbrace\) and \(\lbrace \theta (t_1),\ldots ,\theta (t_{i-1})\rbrace\), now we want to estimate \(\overrightarrow{l}(t_i)\) with these data and the new vision estimation \((x(t_{i}),y(t_{i}))\) and phase measurement \(\theta (t_{i})\). Then, we can randomly pick a historical estimation \(l(t_j)\) to estimate \(l(t_{i})\) through

\begin{align} \Vert \overrightarrow{l}(t_i)-\overrightarrow{a}\Vert -\Vert \overrightarrow{l}(t_j)-\overrightarrow{a}\Vert =\Delta d_{ij}=\frac{\lambda (\theta (t_i)-\theta (t_j))}{4\pi } , \end{align}

(13)

where \(\Vert \cdot \Vert\) is the L2-norm. As the antenna position \(\overrightarrow{a}\), the historical position \(\overrightarrow{l}(t_j)\), the wavelength \(\lambda\), and the phase difference \(\theta (t_i)-\theta (t_j)\) are known, Equation (13) can be rewritten as

\begin{align} \Vert \overrightarrow{l}(t_i)-\overrightarrow{a}\Vert =\bar{c}_{j}(t_i), \end{align}

(14)

where

\begin{align} \bar{c}_{j}(t_i)=\frac{\lambda }{4\pi }\theta (t_i)+\left(\Vert \overrightarrow{l}(t_j)-\overrightarrow{a}\Vert -\frac{\lambda }{4\pi }\theta (t_j)\right). \end{align}

(15)

Considering the x and y coordinates have been estimated as \(x(t_{i})\) and \(y(t_{i})\), Equation (13) only requires to calculate the coordinate in the third dimension. It shall be noticed that Equation (13) can produce two results and we can select one considering a smooth movement pattern. Furthermore, as \((x(t_{i}), y(t_{i}))\) is merely a rough estimation, we can also leave the result behind and solve an optimization problem to obtain a 3D coordinate as

\begin{align} \overrightarrow{l}(t_i)=\text{arg\,min}_{\overrightarrow{l}}\sum _{j\in \lbrace j_1,\ldots \hspace{0.5pt},j_n\rbrace }|\Vert \overrightarrow{l}-\overrightarrow{a}\Vert -\bar{c}_{j}(t_i)|, \end{align}

(16)

where \(\lbrace j_1,\ldots ,j_n\rbrace\) denotes a set of historical results utilized for the estimation.

However, modeling the problem as above actually ignores the influence of nearby objects, which is an important factor that can influence the estimation result as observed in our test. Specifically, the coupling effect between two nearby RFID tags can disrupt signal features of RFID signals and as a result, invalidate Equation (9). Therefore, the influence of the surrounding environment, especially nearby tagged objects, shall be considered. Moreover, considering the fact that movement of tagged objects in a certain application tends to show certain patterns, we choose to use a data-driven method and build a deep learning-based model, i.e., the trace conversion model, to solve the problem of generating 3D traces from phase measurements and 2D images.

The structure and working flow of the trace conversion model are illustrated in the bottom-left part of Figure 1. Our idea of forming a simulated trace is to estimate the position of a tagged object at each timestamp with the current video frame and RFID phase value as well as historical phase values and position estimations. We treat the video frame as a measurement of the object in the spatial domain and utilize time domain data, i.e., continuous phase change and historical positions, to calibrate and complement the measured result. Especially, we take into account not only the current target object but also other nearby targets through a multi-object detector. Based on this idea, the trace conversion model \(\mathcal {T}(\cdot)\) can be formulated as

\begin{align} l_i = \mathcal {T}(F_i, [\theta _{i-n}:\theta _{i}], [l_{i-n}:l_{i-1}]), \end{align}

(17)

where \(l_i\), \(F_i\), and \(\theta _i\) denote the position estimation, captured video frame, and collected phase value at time \(t_i\) and n is a variable defined as the number of historical positions involved in estimation.

4.2 Data Preprocessing

Before feeding RFID phase sequences and frame sequences into the model, we first preprocess them for better utilization.

4.2.1 Phase Unwrapping.

Different from Equation (9), the phase reported by RFID readers go through one more operation as

\begin{align} \theta =\left(\frac{2\pi }{\lambda }2d+\theta _\text{div}\right)\mod {2}\pi , \end{align}

(18)

which means that it is a periodic function of half the tag-antenna distance after getting calculated modulo \(2\pi\). Apart from that, some COTS RFID readers will add \(\pi\) radians of ambiguity to reported phases [10]. Therefore, two consecutive phase values reported by a reader may suffer from a \(\pi\) or \(2\pi\) jump. For better characterizing its relationship with object traces, we shall first smooth raw phase sequences as

\begin{equation} \theta _{i+1} = {\left\lbrace \begin{array}{ll}\theta _{i+1}, &|\theta _{i+1}-\theta _{i}| \le \frac{\pi }{2}\\ \theta _{i+1}-\pi , &\frac{\pi }{2} \le \theta _{i+1}-\theta _{i} \le \pi \\ \theta _{i+1}+\pi , &-\pi \le \theta _{i+1}-\theta _{i} \le -\frac{\pi }{2}\\ \theta _{i+1}-2\pi , &\pi \le \theta _{i+1}-\theta _{i} \le 2\pi \\ \theta _{i+1}+2\pi , &-2\pi \le \theta _{i+1}-\theta _{i} \le -\pi \\ \end{array}\right.}, \end{equation}

(19)

which holds when the change of tag-antenna distance of any two consecutive samples is shorter than \(\lambda /4\) (around \(8 \,\mathrm{c}\mathrm{m}\)). Considering a normal individual tag sample rate of \(30 \,\mathrm{Hz}\), the upper bound of the applicable moving speed is \(1.2 \,\mathrm{m}\mathrm{s}^{-1}\). An example illustrates how raw phases get smoothed is shown in Figure 2.

Fig. 2.

4.2.2 Data Alignment.

One basic assumption of TagFocus is that we can observe a tagged object from both vision and RFID perspectives at same time. However, in practive, tags are not uniformly sampled in RFID systems due to the slotted Aloha scheme adopted in inventory processes [26], while cameras record videos at a fixed frame rate, which results in gaps between timestamps of phase sequences and frame sequences. To solve this issue, we choose timestamps of frame sequences as the benchmark for sample alignment. Given a frame sequence corresponding to a target object \(\boldsymbol {F}=\lbrace (F_1,t_1^F),\ldots ,(F_{m},t_{m}^F)\rbrace\), where \(F_i\) is a video frame sampled at time \(t_i^F\), and an RFID report of a target tag \(\boldsymbol {R}=\lbrace (\theta _1^R,t_1^R),\ldots ,(\theta _{n}^R,t_{n}^R)\rbrace\), where \(\theta _j^R\) is the phase value at time \(t_j^R\), we calculate a phase value \(\theta _i^F\) for each timestamp \(t_i^F\) of the frame sequence as

\begin{align} \theta _i^F = \frac{1}{U-L}\sum _{j=L}^U \theta _{j}^R , \end{align}

(20)

where

\begin{align} L = \text{arg\,max}_{x \in \lbrace 1,2,\ldots \hspace{0.5pt},n\rbrace } t_{x-1}^R \lt t_i^F-\Delta t, \end{align}

(21)

\begin{align} U = \text{arg\,min}_{x \in \lbrace 1,2,\ldots \hspace{0.5pt},n\rbrace } t_{x+1}^R \gt t_i^F+\Delta t, \end{align}

(22)

and \(\Delta t\) is a pre-defined time interval. Note here that phase values used in Equation (20) are unwrapped phase values. Combined together, a new phase sequence is constructed as \(\boldsymbol {\bar{R}}=\lbrace (\theta _1^F,t_1^F),\ldots ,(\theta _{m}^F,t_{m}^F)\rbrace\), whose timestamps are identical to the frame sequence \(\boldsymbol {F}\).

4.3 Feature Extraction

We start with encoding a preprocessed RFID phase measurement sequence and a frame sequence, i.e., \(\boldsymbol {\bar{R}}=\lbrace (\theta _1^F,t_1^F),\ldots ,(\theta _{m}^F,t_{m}^F)\rbrace\) and \(\boldsymbol {F}=\lbrace (F_1,t_1^F),\ldots ,(F_{m},t_{m}^F)\rbrace\), into temporal and spatial features.

First, temporal features. For each timestamp \(t_i^F\), a fully connected layer \(\mathtt {FC_{en}}\) is utilized to form a \(128\times 1\) vector \(v_T[t_i^F]\) as

\begin{equation} v_T\big [t_i^F\big ]=\mathtt {FC_{en}}\big (\big [\theta _{i-n}^F:\theta _{i}^F\big ], [l_{i-n}:l_{i-1}]; W_\text{en}\big), \end{equation}

(23)

where n is a variable requiring adjustment according to the typical moving speed in a certain application, \(l_i\) denotes the position estimation at \(t_i^F\), and \(W_\text{en}\) is the weight matrix of \(\mathtt {FC_{en}}\). If the number of historical phase measurements or position estimations is shorter than n, it will be padded with zeros. Then, a Transformer \(\mathcal {T}_{t}(\cdot)\) is utilized to capture the temporal dependency among phase measurements and historical positions and outputs a \(64\times 1\) temporal feature as

\begin{equation} \bar{v}_T\big [t_i^F\big ]=\mathcal {T}_{t}\big (v_T\big [t_i^F\big ]\big). \end{equation}

(24)

As the Transformer is utilized for generating temporal features, we name it as temporal Transformer.

Second, spatial features. For the target tagged object, we use an object detector as described in section 3.1 to generate bounding boxes for all frames in its frame sequence \(\boldsymbol {F}\). For each frame \(F_i\), we erase the contents of the object by setting all pixels in the corresponding bounding boxes to 0 and output a matrix \(A_{n\times n}^i\), where n denotes the number of detected objects in \(F_i\). We denote the processed frame sequence as \(\boldsymbol {I}=\lbrace (I_1,t_1^F),\ldots ,(I_m,t_m^F)\rbrace\). Then, we use the GoogLeNet[33] pre-trained with ImageNet [31], denoted as \(\mathtt {CNN_{en}}(\cdot)\), to extract a \(64\times 1\) spatial vector \(v_S[t_i^F]\) based on \(I_i\) as

\begin{equation} v_S\big [t_i^F\big ]=\mathtt {CNN_{en}}(I_i; W_\text{CNN}), \end{equation}

(25)

where \(W_\text{CNN}\) are fixed parameters of its network structure. And then, a spatial Transformer \(\mathcal {T}_{s}(\cdot)\) is utilized to turn the spatial vector \(v_S[t_i^F]\) into a \(64\times 1\) spatial feature \(\bar{v}_S[t_i^F]\) with \(A_{n\times n}^i\) as

\begin{equation} \bar{v}_S\big [t_i^F\big ]=\mathcal {T}_{s}\big (v_S\big [t_i^F\big ], A_{n\times n}^i\big). \end{equation}

(26)

Now, for each timestamp, there are two vectors, \(\bar{v}_{T}[t_i^F]\) and \(\bar{v}_{S}[t_i^F]\), representing features of the movement and the surrounding spatial environment of a tagged object, respectively.

However, as analyzed in Section 4.1, when multiple tagged objects move in a dynamic environment, RF features (e.g., RSSI, phase) of each tagged object will be affected by both the surrounding environment and the other tagged objects, which means that both the position estimation and the phase measurement are related to spatial information. Therefore, instead of treating the temporal and spatial features separately, we shall integrate them together for depicting the relationship between temporal-spatial observations. Accordingly, we interleave temporal and spatial Transformers to form the temporal-spatial feature \(v_{TS}[t_i^F]\) as

\begin{equation} v_{TS}\big [t_i^F\big ]=\mathcal {T}_{s}\big (\mathcal {T}_{t}\big (\bar{v}_T\big [t_i^F\big ],\bar{v}_S\big [t_i^F\big ]\big)\big). \end{equation}

(27)

4.4 Trace Estimation

With temporal-spatial features obtained, the remaining task is to decode them into a simulated trace. Specifically, for each timestamp \(t_i^F\), we utilize a fully connected layer \(\mathtt {FC_{de}}(\cdot)\) to generate a position \(l_i\) as

\begin{equation} l_i = \mathtt {FC_{de}}(v_{TS}[t_i^F];W_\text{de}), \end{equation}

(28)

where \(W_\text{de}\) is the weight matrix of \(\mathtt {FC_{de}}\). And the result \(l_i\) will be fed back into input for generating the next position. Step by step, a simulated trace can be formed.

5 Multi-Trace Matching

Supposing M objects and N tags are detected in the surveillance region simultaneously, M observed objects will be generated based on the two modules mentioned above. For each of them, there will be N corresponding simulated traces. In this section, we present a multi-trace matching method, which allocates one corresponding simulated trace for each observed trace.¹

5.1 Similarity Calculation

We start with calculating a similarity for each {observed trace, simulated trace} pair. Supposing there is an observed trace, denoted as \(L_o=\lbrace p_{o,1},\ldots ,p_{o,t}\rbrace\), and one of its corresponding simulated traces, denoted as \(L_s^{k}=\lbrace p_{s,1}^{k},\ldots ,p_{s,t}^{k}\rbrace\), where \(p_{o,i}\) and \(p_{s,j}^{k}\) are 3D coordinates of samples in the two traces and \(k \in \lbrace 1,2,\ldots ,N\rbrace\), we measure their similarity with a matching score \(s_\text{match}\), defined as

\begin{equation} s_\text{match} = \frac{1}{d_\text{err}}, \end{equation}

(29)

where \(d_\text{err}\) is the error distance between the simulated trace \(L_s^{k}\) and the observed trace \(L_o\), defined as

\begin{equation} d_\text{err} = \frac{1}{t}\sum _{1}^{t}d_j , \end{equation}

(30)

\begin{equation} d_j = \Vert p_{s,j}^{k}-p_{o,i} \Vert _\text{min}, i \in \lbrace 1,2,\ldots ,t\rbrace , \end{equation}

(31)

where \(\Vert \cdot \Vert\) is the L2-norm.

5.2 Maximum Weight Perfect Matching

Based on Equation (29), we can establish a complete weighted bipartite graph \(\mathbb {G}=(\mathbb {X},\mathbb {Y},\mathbb {E})\), where each vertex in \(\mathbb {X}\) denotes a detected object and each vertex in \(\mathbb {Y}\) denotes a detected tag. Generally, \(|\mathbb {Y}|\) is greater than \(|\mathbb {X}|\) due to the larger interrogation region of RFID. For each edge \((x_i, y_j) \in \mathbb {E}\), where \(x_i \in \mathbb {X}, y_j \in \mathbb {Y}\), it has a weight \(e_{x_i,y_j}\) that equals the matching score between the observed trace of the object x and the simulated trace of x and the tag y.

Under ideal conditions, there is an exclusive tag \(y_j = \text{arg\,max}_{y \in \mathbb {Y}} e_{x_i,y}\) for each detected tagged object \(x_i \in \mathbb {X}\). However, multiple objects may obtain highest matching scores with one same tag due to inevitable errors added on both traces. Therefore, the multi-trace matching problem now turns to a maximum weight perfect matching problem in a weighted complete bipartite graph [34]. We solve the problem with Kuhn–Munkres algorithm [12] and obtain a perfect matching result. The result is set to be the final matching result, where every detected object matches one exclusive tag.

6 Evaluation

This section presents the implementation and detailed performance of TagFocus.

6.1 Evaluation Methodology

6.1.1 Prototype Implementation.

We adopt an AONI C30 HD1080P camera and an Impinj Speedway Revolution R420 reader to implement the prototype. The frame rate of the camera is fixed to 30 fps, compatible with most COTS cameras. The reader is compatible with the EPC Gen2 standard [9] and no hardware or firmware modification is made. We fix the reader to work at 920.625\(\mathrm{M}\)\(\mathrm{Hz}\)920.625 \(\mathrm{M}\)\(\mathrm{Hz}\) to save efforts on calibrating phase shift caused by frequency hopping. One circularly-polarized antenna with a size of \(225 \,\mathrm{m}\mathrm{m}\) \(\times\) \(225 \,\mathrm{m}\mathrm{m}\) \(\times\) \(225 \,\mathrm{m}\mathrm{m}\) is connected to provide \(8 \,\mathrm{dB}\) gain. The type of tag utilized is Alien H3 AZ-9629, whose size is \(22.5 \,\mathrm{m}\mathrm{m}\) \(\times\) \(22.5 \,\mathrm{m}\mathrm{m}\).

We acquire RFID reports and frame sequences based on an opensource project TagSee [40] and the VideoCapture class in OpenCV 3.3.1, respectively. We implement the remaining modules in Python 3.7 and build all deep learning models using Tensorflow. All programs run on an Apple MacBook Pro with a dual-core \(2.5 \,\mathrm{G}\mathrm{Hz}\) Intel i7 CPU and 16 GB memory.

6.1.2 Experimental Setup.

The experimental setup is illustrated in Figure 3. As can be seen, we dedicatedly deploy the antenna together with the camera in a plane parallel to a desk for compatibility with TagView, which will be compared in our experiments. It is worth noting that TagFocus does not rely on such a dedicated deployment. The distance between the desk and the antenna-camera plane is \(80 \,\mathrm{c}\mathrm{m}\).

Fig. 3.

After installation is completed, we train TagFocus before utilization. The training set is collected as follows: We manually move a tag in a random trace and repeat the process 300 times, during which the camera and the RFID reader collect and record videos and RFID reports.

We move toy trains attached with tags on tracks at a moderate speed, around \(0.1 \,\mathrm{m}\mathrm{s}^{-1}\), for emulating moving objects. The applicable speed is bounded by various factors, including the sample rate of the RFID reader and the sight range of the camera. Normally, the upper bound speed is less than \(0.4 \,\mathrm{m}\mathrm{s}^{-1}\) for the RFID reader and the camera to generate enough samples for positioning and matching. Shapes of tracks will be varied for evaluation in different scenarios. Figure 3 illustrates three types of tracks utilized, i.e., 2D linear track, “8”-shaped track, and 3D track. The ground-truth of the actual tag-object correspondence is manually collected during our evaluation.

6.2 Accuracy of Trace Conversion Model

The core factor influencing the final matching results is the similarity between an observed trace and its corresponding simulated trace of the right tag-object pair. We measure the similarity with the error distance defined in Section 5.1.

Three types of tracks (2D linear, “8”-shaped, and 3D) are utilized. For each, we conduct 50 groups of experiments where a tagged toy train moves along a given track and calculate an error distance accordingly. We summarize their median values and 90% values in Table 1. As can be seen, for simple 2D linear and 3D tracks, all error distances of the 50 groups of experiments are smaller than \(5 \,\mathrm{m}\mathrm{m}\). And for the complex “8”-shaped track, the median and 90% error distances are around \(6 \,\mathrm{m}\mathrm{m}\). Considering the size of the toy train (\(25 \,\mathrm{m}\mathrm{m}\) \(\times\) \(60 \,\mathrm{m}\mathrm{m}\) \(\times\) \(40 \,\mathrm{m}\mathrm{m}\)) and the size of the tag (\(22.5 \,\mathrm{m}\mathrm{m}\) \(\times\) \(22.5 \,\mathrm{m}\mathrm{m}\)), the error distance is sufficiently small.

Table 1.

	2D linear	“8”-shaped	3D
Median error distance (mm)	1.93	5.71	3.64
90% error distance (mm)	2.01	6.10	3.82

Table 1. Median and 90% Error Distances of Different Tracks

We also conduct an experiment for illustrating how the trace conversion model distinguishes the right and wrong pairs with the “8”-shaped track and a 3D toy train track. The 3D toy train track can be viewed as a distorted version of a larger “8”-shaped track with some parts got raised. An interference tag is placed right behind the actual one with an interval of \(2 \,\mathrm{c}\mathrm{m}\) on the same toy train. Consequently, the trace of the interference tag is a delayed version of the actual one. Figure 4 illustrates a comparison between observed traces of the 2D and 3D tracks with their corresponding simulated traces of the right pair and the wrong pair, respectively. As can be seen, the simulated trace of the right pairs are more similar to their corresponding observed traces than the simulated traces of the wrong pairs.

Fig. 4.

6.3 Performance of Multi-Object Identification

To evaluate the performance of multi-object identification, we conduct comparisons over matching accuracy and robustness among TagFocus and two most relevant state-of-the-art methods, TagView and TagVision. As described in Section 6.1.2, we place the antenna and the camera to be identical in position to suit TagView. Procedures of camera calibration are also performed to suit TagVision. Apart from that, as TagVision can only identify a single target, we extend it with the fusion algorithm proposed in TagView.

6.3.1 Comparison to State-of-the-Art Methods over Matching Accuracy.

We first compare the matching accuracy in general scenarios. Experiments are conducted with the 2D linear track and the 3D track as depicted in Figure 3. In both scenarios, we parallelly place three tracks with an interval of \(8 \,\mathrm{c}\mathrm{m}\). One tagged toy train is placed on each track and the three toy trains will move together during one experiment. A total of 50 groups of experiments are performed for each scenario. We measure the performance with the matching accuracy defined as

\begin{equation} \text{Matching Accuracy}=\frac{\# \text{ of successfully matched traces}}{\# \text{ matched traces in total}}. \end{equation}

(32)

As presented in Table 2, all three achieve high matching accuracy (above 0.98) and show a very slight difference (below 0.01) with 2D linear tracks. However, in the 3D scenario, matching accuracies of both TagView and TagVision drop significantly below 0.80, while TagFocus is still above 0.96, showing that TagFocus outperforms TagView and TagVision in general scenarios regarding multi-object identification. It is worth noting that the poor result of TagView may result from its design objective. We find it only considers tracks fixed in a 2D plane parallel to the camera plane. Therefore, in the following comparisons over robustness, we choose 2D linear tracks for evaluation.

Table 2.

	TagFocus	TagView	TagVision
2D scenario	0.991	0.9852	0.9833
3D scenario	0.9650	0.7283	0.7940

Table 2. Matching Accuracy Comparison with State-of-the-Art Methods in 2D and 3D Scenarios

6.3.2 Comparison to State-of-the-Art Methods over Robustness.

Robustness is another critical metric for realizing practical systems. In real-world applications, suboptimal placing conditions and complicated environments can cause failure in identification. In this subsection, we compare the three methods over robustness to the interval between adjacent objects, the number of tagged objects, and the individual reading rate (IRR) as follows.

Robustness to interval between adjacent objects. Tagged objects can be tightly located for increasing space utilization, which raises a challenge to the spatial resolution of identification methods. We run experiments by decreasing the interval between adjacent objects from \(8 \,\mathrm{c}\mathrm{m}\) to \(1 \,\mathrm{c}\mathrm{m}\) with a step of \(1 \,\mathrm{c}\mathrm{m}\). For each interval, we perform 50 groups of experiments. As depicted in Figure 5(a), TagFocus performs best in all settings and remains an accuracy of 0.91 when the interval decreases to \(1 \,\mathrm{c}\mathrm{m}.\) while accuracies of TagView and TagVision have dropped to 0.680 and 0.642. Especially, when the interval between adjacent objects is larger than \(6 \,\mathrm{c}\mathrm{m}\), TagFocus can achieve 100% accuracy in our experiments. The result implies that TagFocus has a higher spatial resolution and consequently, it is more robust to small intervals. Also, we observe that the matching accuracy of TagFocus decreases quicker when the interval is smaller than \(2 \,\mathrm{c}\mathrm{m}\), which is near the size of the tagged object we use in our experiments. This is reasonable as the coupling effect between two close-by RFID tags will disrupt raw signal features of RFID and degrades the performance of our trace conversion model.

Fig. 5.

Robustness to the number of tagged objects. With the number of tagged objects increased, more candidate tag-object pairs will occur, enhancing difficulty in correct multi-object identification. To evaluate the influence, we vary the number of tagged objects from 2 to 6. The interval between adjacent tags is \(8 \,\mathrm{c}\mathrm{m}\). Likewise, 50 groups of experiments are performed for each number. Figure 5(b) shows that the accuracy of TagFocus decreases slightly from 0.995 to 0.978 when the number of tagged objects increases to 6. Meanwhile, the accuracy of TagView decreases from 0.9903 to 0.935 and the accuracy of TagVision decreases from 0.9933 to 0.887. In general, TagFocus performs well when multiple tagged objects occur in the surveillance region. Also, it can be observed that simply increasing object number has a slight influence as long as tags are spaced remotely enough.

Robustness to IRR. Even when the number of tagged objects is small, there can exist much more tags in the interrogation region due to the long communication range of RFID. For example, we have seen tens of static RFID tagged packaging bags located alongside a sorting line of one logistic company. Under this circumstance, even if there are only two RFID tagged packaging bags transferred by the sorting line, a much larger number of RFID tags are actually participating in the inventory process. Consequently, for each certain target tag, its IRR, defined as the average number of samples generated for it per second, can be significantly reduced. The experiment in [18] reveals that when the number of tags grows to near 40, the average IRR can decrease from \(63 \,\mathrm{Hz}\) to \(12 \,\mathrm{Hz}\). And as each RFID reading can be viewed as a sampling of a certain tag’s location, IRR is a crucial parameter influencing how well simulated traces approximate actual traces. To evaluate the influence of IRR, we emulate an experiment in which we pick one record from every n records of the RFID report and form a new down-sampled RFID report. We refer to the variable n as the decimation factor and vary it from 1 to 6 in our evaluation. Similar to previous experiments, we place three tracks with an interval of \(8 \,\mathrm{c}\mathrm{m}\) and move tagged toy trains. A total of 50 groups of data are collected and the average IRR is \(65.7 \,\mathrm{Hz}\). Therefore, when the decimation factor increases to 6, the IRR is reduced to around \(11 \,\mathrm{Hz}\), equivalent to placing 40 tags. Figure 5(c) shows that though accuracies of all three are above 0.99 without down-sampling, TagFocus significantly outperforms TagView and TagVision with an accuracy of 0.952 when the decimation factor increases to 6, while the other two methods drop to 0.814–0.8725, respectively.

Robustness to Time Duration. In experiments conducted above, a typical time duration of a tagged object monitored by the camera is around 3 seconds. Short though it seems, in practical scenarios, obstacles may occur that break the trace of an object into pieces with shorter time durations. In TagFocus, the tagged object detected in each piece will be viewed as a newly detected one and get matched with a collected RFID tag independently. Therefore, it is necessary to evaluate its performance with varying time durations. We emulate the experiment by segmenting previously collected data set with varying time ratios. To be specific, we reuse the data set where the interval between adjacent tags is \(8 \,\mathrm{c}\mathrm{m}\). For each value of the time ratio, we cut a continuous part of the videos contained in the dataset with a random start time and a length corresponding to the value. The value of the ratio varies from 1 to 0.5, with a step of 0.1. The matching accuracies of different time durations are depicted in Figure 5(d). As can be seen, unlike previous parameters, time duration shows a great impact on all three methods we compare in this experiment. With the ratio decreasing, all three methods drop quickly in matching accuracy. Though TagFocus still shows relatively better robustness to the time duration, the extent cannot help it resist corresponding issues faced in realistic settings. Especially, the matching accuracy of TagFocus drops from 0.99 to 0.67 when the time duration is cut to half of the original length.

From Figure 5 we can see, all three methods are fine-grained in general conditions. However, when harsh conditions occur, e.g., small tag intervals, large tag populations, low reading rates of tags, and short time durations, TagFocus outperforms existing methods with higher robustness. Furthermore, the time duration and the interval between adjacent tags show the highest influence over matching accuracy among the four factors studied in our evaluations. It is understandable as the fundamental reason for false identification is the difference between traces exceeds the spatial resolution of a certain method. Therefore, the result implies TagFocus owns a higher spatial resolution. And as all three methods require a trace to be formed between the tag and the antenna, they actually correct their evaluations through an inverse synthetic aperture radar (ISAR)-like manner [20, 28, 32]. Therefore, a longer time duration means a longer trace and more measurements for correcting the result, which in the end improves the overall accuracy of a matching system.

6.4 Evaluation in Uncontrollable Case

Apart from evaluations conducted above, we have also conducted experiments with a COTS device, Sample Localizer, which is for positioning test tubes inserted into a tube box. The device mainly contains three components: (1) a platform for placing a tube box; (2) a camera deployed above the platform for detecting whether a user is inserting test tubes and positioning these tubes; and (3) two circularly polarized antennas underneath the platform for reading RFID tags attached to test tubes. With these components, when a user is inserting a test tube, the device will monitor the process from both vision and RFID perspectives and bind the final position with a new tag read during the process. However, the principle suffers from inefficiency and inconvenience as it can only deal with the situation that only one new tag can be found during insertion. Consequently, the device cannot support inserting multiple test tubes together and the remaining test tubes shall be placed in safe zones to avoid possible interferences. To solve the challenge and verify the performance of TagFocus in real-world scenarios, we test TagFocus with data collected by the device as an attempt.

We define the process of simultaneously inserting N test tubes into a tube box as an N-operation. The training set is collected through performing 50 times of 1-operation and 50 times of 2-operation with an empty tube box. And two test datasets are collected through consecutively performing 2-operation and 3-operation until 8 and 9 test tubes are inserted for 50 times, respectively. The accuracy is defined as the total number of correctly positioned test tube divided by the total number of inserted tubes. Comparisons between TagFocus and TagVision over the two test sets are conducted, respectively (TagView requires dedicated deployment and is not compatible with this case). Especially, we also test the trace conversion model trained in previous experiments to evaluate its environmental dependence. Figure 6 illustrates the result, which shows that the TagFocus can achieve a higher accuracy than TagVision in this uncontrollable case even it is trained with data collected in totally different environment. However, to be applied in a new environment, TagFocus requires retraining before utilization to make the matching accuracy acceptable. Further studies over its applicability in different scenarios will be conducted in our future work.

Fig. 6.

6.5 Summary

Based on experiments conducted above, we summarize differences between TagFocus and the two most relevant state-of-the-art methods, TagView and TagVision, in Table 3. From the perspective of implementation, TagFocus adopts a fundamentally different manner for fusing CV and RFID. Instead of the dimension-reduced procedure, i.e., converting observed moving traces of target objects into phase sequences, we hypothesize the correspondence between detected targets and tags and generate traces accordingly to find the most matched pairs. Consequently, TagFocus shows higher spatial resolution and robustness in all experiments. And from the perspective of practicability, TagFocus can support multiple objects without requirements over dedicated deployment and calibration, while TagView requires antennas to be placed together with the camera and TagVision merely supports a single target and needs calibration. However, one major drawback of TagFocus is that as a data-driven system, it requires pre-training before utilization. In general, TagFocus is a more accurate, robust, and practical system compared with existing works.

Table 3.

	TagFocus	TagView	TagVision
Fusion Manner	Phase\(\rightarrow\)Trace	Trace\(\rightarrow\)Phase	Trace\(\rightarrow\)Phase
Spatial Resolution	High	Medium	Medium
Robustness	High	Low	Medium
Support Multi-Object	\(\checkmark\)	\(\checkmark\)	x
Require Dedicated Deployment	x	\(\checkmark\)	x
Require Calibration	x	x	\(\checkmark\)
Require Pre-training	\(\checkmark\)	x	x

Table 3. Differences among TagFocus, TagView, and TagVision

7 Discussions

TagFocus originates from two projects that we participated in, i.e., distinguishing blood bags that are simultaneously transferred by a conveyer belt and identifying test tubes inserted into a test tube box. Both projects seek reliable solutions based on RFID as the previously-adopted barcode-based solution fails when barcode markes are frosty or polluted on the surface. Especially, considering the environment in which the target objects are stored and utilized, failures can be too frequent to be accepted. To meet the demands of reliability, in this article, we focus on the matching accuracy of TagFocus, emphasizing its robustness in harsh but practical conditions that may occur in realistic settings. However, to make it truly applicable and fit more general conditions, limitations remain to be addressed in our future work as follows:

Reducing training overhead. As summarized in Section 6.5, compared with TagVision and TagView, one major drawback of TagFocus is its requirement over training procedure before utilization. Collecting training datasets can be more troublesome than the manual calibration procedure required in TagVision. And as evaluated in Section 6.4, TagFocus requires retraining when utilized in a new environment otherwise its performance will decline. Consequently, TagFocus is more suitable for applications that can maintain a stable environment. To overcome the limitation, we consider utilizing transfer learning to adapt an existing model to new environments with less training data. Specifically, we consider setting an existing model as the pre-training model of a new model. Weights of the existing model can be used for initialization and be updated through a top-down manner. Since the task of multi-object identification is not changed and high-layer parameters are more related to specific environments, we may only need to fine-tune parameters of few top layers to adapt an existing model to a new environment. Furthermore, as TagFocus requires tagged objects to be observed by the camera for identification. A LoS path exists between the object and the camera-antenna group. To reduce the training overhead of the trace conversion model, we shall mitigate the multipath interference before we feeding phase measurements into it so that the model can focus on learning the relationship between the vision observations and the changing of phase related to the changing of the LoS path without extra components caused by multipath interference.

Mitigating multipath interference. As analyzed in Section 4.1, the multipath interference can be too strong to be overlooked when the signal of the direct path can not dominate the received signal, including the condition that no direct path exists. Consequently, phase measurements will be distorted and in the end, the matching accuracy of TagFocus will degrade. Furthermore, though the influence of the multipath interference in a specific environment might be implicitly learned with massive training data, the trained model might be overfitting to the specific environment, reducing its applicability in other environments. Therefore, cleaner phase measurements can improve the overall robustness of the system. To overcome the limitation, we consider adding more antennas to mitigate multipath interference. We merely deploy one antenna in the prototype to be subject to the experimental settings of TagVision and TagView. But TagFocus is a framework expandable to accept input from more data sources, which can help correct measurement errors from a single one. Additionally, we shall also model and calibrate phase measurements under the circumstance. For example, in [39], the authors utilize two antennas and model the relationship between the phase values measured by the two antennas with consideration over multipath interference in LoS and NLoS conditions. Similarly, we can correct phase measurements collected by multiple antennas before feeding them into the trace conversion model to mitigate the influence of multipath over TagFocus.

Correcting error in observed traces. In TagFocus, we approximate the actual trace of a detected tagged object with its observed trace. Therefore, the matching accuracy is limited by the performance of the vision method we adopt in trace estimation. Though the effectiveness of the approximation is certified in experiments conducted above, it is also an obstacle for further improving the matching accuracy. Additionally, unfavorable factors like poor light conditions can severely damage the overall performance of TagFocus. To overcome the limitation, we consider upgrading hardware components. For example, both adding more cameras or replacing the monocular camera with a stereo or RGB-D camera can give a more accurate estimation in 3D location. Also, the vision method utilized in TagFocus can be replaced with other choices in the CV field that show higher accuracies.

Optimizing target-oriented reading process. As estimated in Section 6.3.2, when the number of tags grows large, the average IRR will drop and the candidate pairs will raise, which both degrade the performance of TagFocus. Therefore, TagFocus can hardly perform well in massive-tag situations. To address this limitation, we consider optimizing the reading process of RFID tags to focus on target tags. For example, in [18], the authors propose to selectively read moving tags to boost their IRRs with the SELECT command. Similarly, we can start with a rough estimation of the locations of all collected RFID tags and then filter out tags outside the interested region to improve IRRs of tags more possible to match detected target objects. Also, it can reduce the computation work of TagFocus with fewer candidate tag-object pairs.

8 Related Work

Both CV and RFID have been studied to achieve fine-grained multi-object identification and tracking. In this section, we briefly review literature related to our work.

8.1 CV-based Methods

Recent progress of CV has made it the most popular and reliable choice for multi-object detection and tracking and tons of works have been published in recent years. To narrow down the region, we mainly focus on low-cost methods based on monocular vision. Tradition methods are mainly based on observing the change of particular features in consecutive frames [1, 41, 43]. Recently, the representational power of deep neural networks is exploited to extract more complex and abstract features [19, 29]. The standard procedure is tracking-by-detection, i.e., generating bounding boxes to continuously detect targets from video frames and associate boxes of the same target together [3]. 2D traces can be generated accordingly. Based on this, researchers propose to introduce more constraints like shape to fulfill monocular 3D object detection [13, 25]. Promising though CV in object detection and tracking, it is hard to distinguish objects with the same appearance. Therefore, we leverage RFID for identification.

8.2 RFID-based Methods

As the name implies, RFID is inherently a technology for identification. However, due to the nature of RF signals, it is hard to automatically identify each one when multiple objects simultaneously occur in its interrogation region, which is one reason for the requirement of positioning. Tons of works have been proposed to localize RFID tags with signal features such as RSSI, phase, Doppler frequency shift, and so on. However, few can be applied in real-world applications due to lacking precision and robustness. A new trend in this field is to emulate a large bandwidth so that time-of-flight (ToF) can be obtained with fine granularity [22]. However, it requires expensive dedicated devices like USRP, which is not practical for now.

8.3 Methods Fusing CV and RFID

Fusing CV and RFID for fine-grained identification and tracking is one trend in RFID-enabled applications in recent years. Early works [4, 16, 27] fuse CV and RFID with RSSI measurements. However, RSSI has been proved to be an unreliable parameter [6], which turns researchers to develop methods based on phase measurements. TagVision deploys a COTS camera to obtain traces of moving objects and an RFID antenna to obtain the phase sequence of one target tag. It transfers 2D traces obtained by the camera through the optical flow to 3D traces and calculates phase sequences based on the relationship between phase and tag-antenna distance. A probabilistic model is then used to calculate a matching score between the two phase sequences and the object getting the highest matching score will be allocated to the target tag. Based on TagVision, TagView extends the system for multi-object scenarios and reduces troublesome camera calibration procedures by tactfully placing the RFID antenna and the camera at one identical position. However, this method is only suitable in applications where tags are limited to a plane parallel to the camera plane. RF-Focus notices the error added on measured phases due to the multipath interference and proposes a dual-antenna approach to remove the impact. Likewise, phase sequences are calculated from tag-antenna distance for matching.

9 Conclusion

In this article, we propose TagFocus, a system pushing forward the application of object identification and tracking through fusing CV and RFID. Compared to previous works, our key innovation is a novel scheme that converts RFID reports to 3D traces with visual aids, which provides a new perspective of fusing CV and RFID for identification. We implement a prototype of it with a monocular camera and COTS RFID devices and conduct extensive evaluations in lab environments. Experimental results demonstrate that it outperforms state-of-the-art works in matching accuracy and shows great robustness to severe conditions where existing works fail. In summary, we believe TagFocus is a concrete step towards practical RFID-based identification and tracking systems.

Acknowledgments

We sincerely thank the editor and anonymous reviewers for their valuable feedback.

Footnote

We do not consider the case where multiple tags are attached to one object in this article.

References

[1]

Sepehr Aslani and Homayoun Mahdavi-Nasab. 2013. Optical flow based moving object detection and tracking for traffic surveillance. International Journal of Electrical, Computer, Energetic, Electronic and Communication Engineering 7, 9 (2013), 1252–1256.

Abstract

1 Introduction

2 Overview of TagFocus

3 Monocular 3d Trace Extraction

3.1 2D Object Detection

3.2 3D Trace Extraction

4 Trace Conversion Model

4.1 Theoretical Analysis

4.2 Data Preprocessing

4.2.1 Phase Unwrapping.

4.2.2 Data Alignment.

4.3 Feature Extraction

4.4 Trace Estimation

5 Multi-Trace Matching

5.1 Similarity Calculation

5.2 Maximum Weight Perfect Matching

6 Evaluation

6.1 Evaluation Methodology

6.1.1 Prototype Implementation.

6.1.2 Experimental Setup.

6.2 Accuracy of Trace Conversion Model

6.3 Performance of Multi-Object Identification

6.3.1 Comparison to State-of-the-Art Methods over Matching Accuracy.

6.3.2 Comparison to State-of-the-Art Methods over Robustness.

6.4 Evaluation in Uncontrollable Case

6.5 Summary

7 Discussions

8 Related Work

8.1 CV-based Methods

8.2 RFID-based Methods

8.3 Methods Fusing CV and RFID

9 Conclusion

Acknowledgments

Footnote

References

Cited By

Index Terms

Recommendations

Towards Scalable Identification in RFID Systems

Attacks on ownership transfer scheme for multi-tag multi-owner passive RFID environments

Fine-grained privacy control for the RFID middleware of EPCglobal networks

Comments

Information

Published In

Publisher

Journal Family

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Funding Sources

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

View options

PDF

eReader

HTML Format

Get Access

Login options

Full Access

Figures

Other

Share

Share this Publication link

Share on social media

Affiliations