1. Introduction
Smart retail is regarded as an arrangement of the Internet of Things and big data analytics for retail purposes [
1]. Usually, it collects data from videos captured by ubiquitous cameras in retail stores. Consequently, we need to extract valuable information collected by videos. Customer behavior (CB) is commonly considered to be a kind of valuable analytic material for business management [
2]. As there are an almost infinite number of classes of CBs in retail environments, generally, specific CBs are selected as recognition targets, called target CBs, based on needs. Typically, customer-centric retailing demands different target CBs to analyze the customer decision-making process. Usually, the target CB changes frequently with different products or in-store layouts because of the different customer-product interactions. For instance, trying on clothes in a clothes shop, sitting on a bed in a furniture shop, picking up a bottle from the shelf, picking up an ice cream from a freezer, etc. Accordingly, CB recognition (CBR) methods should be modified to recognize the changed target CBs. In some cases, a current target CB is required to be discriminated, e.g., in the case of “pick a product”, discriminating whether a customer is picking a product with one hand or both hands provides information regarding the customer’s effort to pick a product. Therefore, a CBR method is expected to be flexible enough to address the issue of frequent changes in the target CB.
As CBR is a branch of human activity recognition (HAR), current CBR methods use machine learning (ML)-based models [
3] due to their remarkable accuracy in HAR tasks. Nevertheless, in contrast to human activity recognition, CBR methods also require flexibility. For frequent target CB changes, to recognize different target CBs, namely, changing the model’s output, ML-based models require time-consuming re-collection of training data and training the model. Though transfer learning can be applied in some cases for faster training, the inevitable step of data collection is still time-consuming. This causes current methods to be inflexible when coping with changes in target CBs. Additionally, in existing methods, target CBs are mostly selected arbitrarily according to the training data, instead of business needs, which indicates that change adaptation is not considered in their design. Thus, current CBR methods are not suitable for target CBR tasks in retail environments.
To cope with target changes, we propose a rule-based method to recognize CB by the combination of primitives, each of which is a kind of partitioned unit of CB. Since primitives are allowed to be combined for the customization of various CBs, our proposed method can reuse the primitives to customize the changed target CBs. The number of combinations of primitives increases exponentially as the number of primitives increases linearly. Thus, our method can cover a wide range of CBs with a small number of primitives. As CB analysis focuses on customer-product interaction, we designed the primitive as a unit that describes an object’s motion or the relationship between multiple objects.
To conclude, rather than accuracy improvement, we focus on the method’s flexibility, which is also important in CBR requirements. Consequently, the main contribution of the paper is the proposal of a flexible CBR method to cope with frequent changes in target CBs.
We evaluated our method on our self-collected laboratory dataset and the public MERL dataset. Compared to the time-consuming collection of data and training of models, our method was able to deal with target changes in a short time, which implies its enhanced flexibility. Moreover, assessment of acceptable recognition accuracy indicated that we did not lose too much accuracy as the cost of achieving a high degree of flexibility.
The remainder of this paper is organized as follows:
Section 2 explains the problems of existing methods in terms of their methodology and rationale for selecting target CBs.
Section 3 describes our proposal of CB decomposition and the matching of CB patterns in detail. In
Section 4, the evaluation of the performance of the proposed method on two different datasets is described. Finally,
Section 5 concludes the paper with some final remarks and suggestions for future research.
2. Related Work
In retail environments, we analyze CBs to meet the demands of customer-centric retailing. As a result, CBR tasks should not only address the issues of methodology but also consider the difficulty of application and the customer’s experience. Currently, various types of sensors are used in HAR research to acquire data on human movements. In contrast, almost all research on CBR uses visual data. The major reason is that visual data-based approaches can be directly applied to video acquired by surveillance cameras in the store, which makes the application of these approaches hardware-free and avoids active customer participation [
2]. In addition, visual data contains much more information than most other types of sensor data.
With the input of videos, existing CBR methods mainly use the pipeline of extracting features from consecutive frames within a certain period and recognizing behavior from the sequenced features using machine-learning-based models, especially the hidden Markov model (HMM). Popa et al. [
4] proposed an HMM-based model to recognize customer’s buying behavior with optical flow features. Within the next two years, they improved the HMM-based model by partitioning the CB into basic actions [
5], which are similar to our proposed primitives. However, they determined the basic actions by optical flow features. Thus, the model is not explainable, which results in it having poor flexibility when dealing with target CB changes. Djamal Merad et al. [
6] applied an HMM model for hand movement analysis and an SVM model as eye-tracking descriptors for the classification of a customer’s purchasing type. The specific CB classes were not given because the authors conducted CBR indirectly. Moreover, their wearable device was difficult to apply to every customer, and required customers’ active participation. However, people are generally reluctant to cooperate without tangible rewards [
2].
Apart from HMM models, convolutional neural networks (CNNs) are also widely used due to their excellent performance on spatial feature extraction. Singh et al. [
7] used a CNN connected with a long short-term memory (LSTM) [
8] model to recognize CBs, such as hand in the shelf, inspecting the products, etc. Using this method, Singh et al. avoided most object occlusions using top-view cameras. Some improved CNN-based models [
3,
9] have recently been proposed to detect customers and recognize basic customer-product interactions, such as picking up products, returning products back to the shelf, etc. Jingwen Liu et al. [
10] employed a dynamic Bayesian network to conduct CBR of six CBs, including turning to shelf, touching, picking, returning, etc., based on hand movements and the orientation of the head and body. Jumpei Yamamoto et al. [
11] estimated CB class in a book store based on depth features from a top-view camera and pixel state analysis (PSA) features using a support vector machine (SVM).
In addition, several studies, not using an ML-based model [
12,
13], implemented a complete CBR system with an RGB-D camera. Basic CBs, such as pick, return, etc., were recognized, based mainly on processing depth information by background subtraction. Unfortunately, since the systems were designed for specific purposes using simple and efficient methods, their flexibility was compromised.
In sum, although the aforementioned ML-based methods achieved improvements in CBR accuracy, they share common limitations with respect to flexibility, as follows:
Difficulty in adapting to changes in target CBs: The ML models cannot be reused as long as the changed CBs are substantially different from the training data. In this event, time-consuming new training data collection and model re-training are required, which implies inflexibility.
The model is not explainable: Unexplainable models can only be tuned based on their outputs. This implies poor flexibility during any modifications caused by changes in business needs.
Furthermore, since there are few approaches similar to our method in the field of CBR, we discuss the similarities and differences of several HAR methods with our approach with respect to their application to CBR. Liu et al. [
14] proposed an HMM-based method which divides human activity into several phases, called “motion units”, analogous to phonemes in speech recognition. Yale et al. [
15] proposed interpretable high-level features based on motion units. Different activities sharing the same motion units allow the model to derive more explanatory power from human activities. Although motion units are similar to our proposed primitives, the methods encounter two issues when applied to CBR tasks, which highlight how they differ. Firstly, these methods use data from a smartphone’s acceleration sensor. Alhough providing tangible rewards is less of a problem, the methods require the active participation of customers, e.g., downloading an app and agreeing to its terms of service, which increases saliency to customers. Consequently, the rewards increase the cost and the active participation creates privacy issues [
2]. Secondly, despite the fairly complete categorization of human activities based on motion units, the methods do not focus on human-item interactions. Since purchase behavior can be easily detected from cashier records, recognizing non-purchase CB becomes one of the objectives of CBR. As the main component of non-purchase CBs, human-item interactions are required in CBR tasks. As an illustration, “picking up a product” and “returning a product” would be practically identical due to their similar hand motions. Nishant Rai et al. [
16] divided human activities in indoor living spaces into atomic actions, analogous to the primitives in this paper. The use of both visual and audio data avoided users’ active participation, and the training data included human-item interactions. The authors improved recognition accuracy by training the model with annotations of both atomic actions and human activities. In contrast, we concentrated on improving the method’s flexibility without sacrificing too much accuracy, as flexibility is one of the important factors for CBR tasks. Romany F.Mansour et al. [
17] combined a faster RCNN and a deep Q network for the detection of anomalous entities or human activities in videos. Since this is a typical ML-based HAR method, it requires re-collecting training data and re-training models to adapt to the changed recognition targets, which is inflexible for CBR tasks. In conclusion, the HAR methods described require major modifications before they could be applied to CBR tasks.
3. Proposal
In this paper, we designed a unit, called a primitive, which is a kind of partitioned CB. Our CBR process consists of object tracking, primitive recognition, and CBR by matching recognized primitives with a predefined pattern of primitives. Since the innovative part of our approach is CBR with the combination of primitives, we applied existing methods to object tracking. The workflow of our approach is shown in
Figure 1. At the beginning, the existing method tracks objects from the input video captured by in-store cameras. Then, each frame’s primitives are recognized based on the object trajectories. We predefine CB as a pattern consisting of primitives. Finally, we match the recognized primitives with the predefined primitive pattern. The matched pattern is regarded as the corresponding CB. This section explains our proposed method in detail, including how we design the primitives, the method for primitive recognition, customizing CB using primitives, and CBR by pattern matching.
3.1. Primitive
The dictionary definition of a behavior is the accomplishment of a thing, usually over a period of time or in stages. We believe that this definition reveals the process by which the human brain recognizes a behavior from visual information. Behavior consists of several stages, and our brains recognize this behavior by checking whether these stages occur in the correct order. In this paper, we refer to these stages as primitives. Thus, CB can be decomposed into primitive(s).
Table 1 lists the target CBs in existing methods and the primitives from our subjective decomposition of the target CBs. We did not list a type of CB [
18] in
Table 1 because they recognize customer’s emotion from facial expressions and speech text, which might breach customers’ privacy. During the decomposition, we controlled the decomposition granularity to avoid redundancy from over-decomposition. We found that the objects in the target CBs were body parts or products. There are two types of primitives: one describes an object’s motion state and the other describes the relationship between two objects. Based on what we have found so far, we can decide what kind of information is in the primitive and how detailed it is.
It is necessary to design an expression format for primitives. Generally, using natural language is considered an efficient method when we need to let others know that we understand a behavior. Therefore, we define the primitive by a sentence with reference to the natural language grammar. The syntax is:
where italic words are syntax elements which can be replaced by words in the vocabulary below. If
, the syntax can be simplified as
. As the syntax shows, the primitive consists of
,
,
and
, each of which has a corresponding vocabulary, as follows:
: person, hand, product
: move, stay, follow, face to
: hand, shelf, cart, product
: in the shelf/cart, out of shelf/cart
and
refer to the name of an entity.
describes the movement of
or the relation between
and
.
means the place where the primitive happens. As our proposed method should cover a wide range of CBs, the vocabulary should be a selection of commonly used words in retail environments. Therefore, these words are selected based on our aforementioned findings from the existing methods in
Table 1 Nevertheless, more and more words will be available as our research progresses. There are some constraints and options for the syntax to avoid confusing definition sentences, as below:
, are required: , should be filled in. is required in relation primitives. is optional.
Any ignored optional element can be omitted: e.g., if is ignored, we do not care about the value of , the syntax can be simplified as .
: Same and is not allowed in logic.
The logical operator NOT(!) is allowed: It indicates all words except this one.
In sum, the syntax describes what an object does or what happens to it. With some verbs, it could represent two objects’ relationship. This design could define motion primitives, the motion of an object, relation primitives, or the relation between two objects. In the case of more than two objects, combining several relation primitives could describe a CB composed of multiple objects.
Table 1.
Primitives in target CBs of current approaches.
Table 1.
Primitives in target CBs of current approaches.
Target CB | Related Approaches | Primitives ({} = primitive) |
---|
Passing by the Shelf | [3,10,12] | {a person is moving in front of the shelf} |
Turning to the Shelf | [10] | {a person is turning to face the shelf} |
Viewing the Shelf | [5,10,11] | {a person is standing and watching the shelf} |
Touch the Shelf | [3,4,5,10,13] | {one’s hand moves to the shelf}, |
| | {one’s hand moves back from the shelf} |
Pick up a Product from the Shelf | [3,4,5,9,10,12,13] | {one’s hand moves to the shelf}, |
| | {one’s hand moves back from the shelf}, |
| | {a product is moving together with one’s hand} |
Return a Product back to the Shelf | [3,5,10,12,13] | {one’s hand moves to the shelf}, |
| | {a product is moving together with one’s hand}, |
| | {one’s hand moves back from the shelf} |
Put a Product into a Basket/Cart | [10] | {one’s hand moves to the cart}, |
| | {a product is moving together with one’s hand}, |
| | {one’s hand moves back from the cart} |
Holding a Product | [11] | {a product is moving together with one’s hand} |
Browsing a Product in a Hand | [5,11,13] | {a person is watching his hand}, |
| | {a product is moving together with one’s hand} |
However, though the proposed syntax is enough for our current research, its application range is limited due to the design of , , , and . Despite the ability to define multi-object interactions theoretically, each sentence only defines two objects’ one-to-one relationship. Therefore, the resources for multi-object relationships definition grow exponentially with the number of related objects. Nevertheless, it is currently sufficient for us because there are at most two objects in interaction. Since limits the number of positions only to start and end, it cannot describe complex motion, such as spiral movement.
3.2. Primitive Recognition
In this section, we consider the elements in the syntax from the objects’ trajectories. Since most CBs last for a few seconds which implies many frames for a video with 30 fps, this leads to redundancy in the trajectories with the object-tracking method. Consequently, we first perform trajectory segmentation to reduce redundancy in the trajectories. Then, we recognize primitive elements using the results of segmentation.
Trajectory segmentation refers to compressing a trajectory into several segments, which preserve most features of the trajectory. Current approaches [
19,
20] separate a trajectory based on the moving distance and direction of each vector in the trajectory. Thus, we design an approximate trajectory partitioning (ATP)-based algorithm [
19] for trajectory segmentation. However, ATP is sensitive to direction changes. In our case, an object’s frequent direction changes over short distances probably refers to idling. We anticipate that the algorithm will only react to change in the moving distance in this case. Hence, we designed a thresholding algorithm based on ATP as shown in Algorithm 1. The algorithm receives two inputs: a list of points
from ATP outputs, where
refers to the
i-th element in
,
N is the number of key-points from ATP, and a threshold
is set to preserve the key-points with a distance longer than
. Since the time complexity of ATP and Algorithm 1 are
, the time complexity of the tracjectory segmentation is
, where
n is the length of the trajectory.
Algorithm 1: Thresholding Algorithm for Trajectory Segmentation |
![Sensors 22 06740 i001]() |
In the primitive’s syntax,
and
are the entity names that can be obtained directly from the trajectory information. The words “in the shelf/cart” and “out of shelf/cart” for
can be directly acquired from the coordinates of the trajectory. Therefore, only
needs to be recognized from the trajectories. Algorithm 2 explains the recognition for “move” and “stay”. The two words are a pair of antonyms that mean an object is moving faster than a certain speed or staying still. The input segmented trajectory
contains the trajectory processed by segmentation algorithm, where
refers to the
i-th point in
, and
M is the number of points of
.
is reused in this algorithm to detect whether an object is moving or not. To improve the robustness to noise, we applied a window with length of
to filter the noise. The algorithm output
is one of the words “move” and “stay”, which means the recognition result for the current frame. The time complexity is
, where
n is the smaller of the length of the segmented trajectory and
.
Algorithm 2: Verb Recognition(move, stay) |
![Sensors 22 06740 i002]() |
Algorithm 3 shows the recognition for the
, “follow”. The word means the
is moving/staying together with the
. The inputs are two objects’ segmented trajectory
and
, where
refers to the
i-th point in the trajectory
,
M is the number of points of the segmented trajectory.
is used to detect whether an object is close to another one or not. Similar to Algorithm 2, a parameter
is passed to the algorithm for denoising. The algorithm output
is “follow” or
, which means the recognition result for the current frame. The time complexity is
. Furthermore, the
“face to” refers to
is facing
. Since it requires detecting the orientation of the body or head, which is not currently supported in our method, we intend to omit it in this paper and consider it in future work. The time complexity is
, where
n is the smaller of the length of the segmented trajectory and
.
Algorithm 3: Verb Recognition(follow) |
![Sensors 22 06740 i003]() |
3.3. Define CB by Primitives
With our designed primitives, we are able to customize a wide range of CBs with a combination of primitives. Since our primitives are designed with reference to target CBs in existing methods, we applied primitives to define those target CBs. The clothes-related CBs are excepted because they are not common in normal retail stores, and because they are too complex for our proposal. We defined CBs in
Table 1 by primitives, as shown in
Table 2. The symbol “→” defines the primitives’ chronological order. Primitives that precede this symbol are assumed to occur first. Since the product is occluded when it is on the shelf in our implementation, a precise definition of “touch the shelf” is difficult to formulate. Therefore, we defined it broadly as the primitive pattern in
Table 2.
3.4. Primitive Pattern Matching
The recognized primitives are stored in a sequence to retain their chronological order. Once any primitive has been recognized in the current frame, our method matches the primitive sequence with the predefined primitive patterns. Any matched result is considered as a recognized CB. Algorithm 4 explains the details of the pattern matching. Since forward matching in chronological order consumes a great deal of computational resources to save different matching states for each primitive pattern, it leads to the running speed becoming slow as the running time grows. Therefore, we match recognized primitives in reverse chronological order. In other words, we start matching from the most recently recognized primitives, which saves a great deal of computational resources because there is no need to save the matching states. The algorithm takes the inputs of a sequence, including recognized primitives, a predefined primitive pattern, and a number
, to stop the algorithm when there are not any matched primitives within the recent
frames. The output is a Boolean value of whether the corresponding CB is matched or not. The time complexity is
, where
n is the smaller of the length of
and the length of
.
Algorithm 4: Primitive Pattern Matching |
![Sensors 22 06740 i004]() |
5. Conclusions
Smart retail solutions usually require the recognition of a wide range of CBs from captured video in stores. The CBs that are selected as recognition targets are called target CBs. Target CBs frequently change with changes in needs, environments, etc. To achieve flexible target CB change adaptation, we proposed a flexible CBR approach. Our main idea is recognizing CB using a combination of primitives, which are a kind of partitioned CB. Since different CBs share the same primitives; the primitives can be reused when adapting to target CB changes, which avoids time-consuming steps, such as re-collecting training data and re-training the recognition models. Consequently, our method can flexibly adapt to changes in target CB by changing the combinations of primitives only. In addition, we designed a syntax based on natural language grammar to define primitives. The readable syntax improves the explanatory power of our method. Therefore, the usage of primitives and our proposed syntax can enable a high degree of flexibility in target CB change adaptation. Evaluation experiments undertaken demonstrated that our method achieved an acceptable level of accuracy for different datasets, and great flexibility across different datasets.
Nevertheless, the experiments also revealed some limitations of our proposed method. Since our method is difficult to fine-tune to fit some individual situations, the recognition accuracy is decreased compared to ML-based methods. A possible solution would be to replace the current pattern matching algorithm with a probabilistic model. In addition, because the element in the primitive syntax limits the number of positions, the syntax cannot represent complex movement, such as spiral movement. This leads to a limited cover range of CB. Increasing the vocabulary of could improve the model’s expressive power to represent complex movement. Furthermore, though the syntax element includes orientation information, the orientation detection is currently not applied. These limitations may be addressed in future work.