research-article

Open access

Towards Revenue Maximization with Popular and Profitable Products

Authors:

Wensheng Gan,

Guoting Chen,

Hongzhi Yin,

Philippe Fournier-Viger,

Chien-Ming Chen,

Philip S. YuAuthors Info & Claims

ACM/IMS Transactions on Data Science (TDS), Volume 2, Issue 4

Article No.: 42, Pages 1 - 21

https://doi.org/10.1145/3488058

Published: 24 May 2022 Publication History

All formats PDF

Abstract

Economic-wise, a common goal for companies conducting marketing is to maximize the return revenue/profit by utilizing the various effective marketing strategies. Consumer behavior is crucially important in economy and targeted marketing, in which behavioral economics can provide valuable insights to identify the biases and profit from customers. Finding credible and reliable information on products’ profitability is, however, quite difficult since most products tend to peak at certain times w.r.t. seasonal sales cycles in a year. On-Shelf Availability (OSA) plays a key factor for performance evaluation. Besides, staying ahead of hot product trends means we can increase marketing efforts without selling out the inventory. To fulfill this gap, in this paper, we first propose a general profit-oriented framework to address the problem of revenue maximization based on economic behavior, and compute the On-shelf Popular and most Profitable Products (OPPPs) for the targeted marketing. To tackle the revenue maximization problem, we model the k-satisfiable product concept and propose an algorithmic framework for searching OPPP and its variants. Extensive experiments are conducted on several real-world datasets to evaluate the effectiveness and efficiency of the proposed algorithm.

1 Introduction

Economic-wise, a common goal for companies conducting marketing is to maximize the return profit by utilizing various effective marketing strategies. Consumer behavior plays a very important role in economy and targeted marketing [34, 44, 45, 50]. A successful business influences the behavior of consumers to encourage them to buy its products. In marketing, behavioral economics can provide valuable insights via various web services, e.g., Amazon, by helping people to identify the biases and profit from all customers. On the other hand, manufacturers can use the information of consumers’ requirements on various products to select appropriate products in the market. As a result, evolving the ecosystem of personal behavioral data from web services has many real applications [1, 12, 16, 35]. However, understanding consumer behavior is quite challenging, such as finding credible and reliable information on products’ profitability.

The consumer behavior analytics is crucially important for the decision maker, which can be used to support the global market and has attracted many considerable attentions [2, 34, 38, 44, 45]. In economics, utility [32] is a measure of a consumer’s preferences over alternative sets of goods or services. Specifically, only few works of data mining and information search have been studied [35, 37, 44, 50, 51], from an economic perspective, to study targeted marketing revenue/profit maximization based on economic behavior. Consider the consumer choice [3] and the preferences constraint, several crucial factors for the task of commerce has been succeeded (w.r.t. profit/utility maximization): (i) rational behavior; (ii) preferences are known and measurable; (iii) inventory management; and (iv) price. To obtain higher profit from the products, the decision-maker should find out the most profitable and popular products from the historical records. The reason is that the business people would likely know the amount of profit in their business, but it is difficult to know which product makes the most money [34, 44].

In recent decades, there are many studies that have been proposed for computational economics which explores the intersection of economics and computation. In the same time, the scientists in the field of computer science also incorporate some interesting concepts from economics into various research domains, such as data mining, database, social network, Internet of Things, etc. In the field of data mining which is also called knowledge discovery from data (KDD), there is tremendous interest in developing novel utility-oriented methodologies and models for obtaining insights over rich data. Therefore, a new utility-oriented data mining paradigm called utility mining [8, 10, 38] becomes an emerging technology and can successfully be applied to various fields. For example, high-utility itemset mining [30, 33, 38], high-utility sequence mining [19, 42, 47], and high-utility episode mining [28, 43] has been extensively studied to deal with different types of data, including itemset-based transactional data, sequence data, and complex event data.

Most products operate on a seasonal sales cycle that tends to peak at a certain time in a year. How to sell the seasonal products to make profit maximization, the sales department can find what products are most likely purchased on special periods, by utilizing different seasonal consumer behavior data (i.e., weekly sales data, monthly sales data, quarterly sales data, and annual sales data). Despite that On-Shelf Availability (OSA) is a key factor [4], out of stock levels on the shelf, as shown in Figure 1,¹ still remains highly persistent, and today’s economic environment is more challenging since it is even more critical than ever for retailers and manufacturers to ensure that every product a customer wants to buy is available every time. The historical data was analyzed to classify the groups of products and identify those are most likely to be sold-out first. A sustainable OSA management process should consider the frequency and quantity of ordering, as well as the inventory, to address the root causes of out-of-stocks. In general, the seasonal products may become popular/hot during some special periods. The popular/hot products that are flying off the shelves, for example, sunglasses may sell out during the summer, but for the coming up winter, fewer people would purchase it.

Fig. 1.

Product bundling is one of the sales strategies to achieve the successful commerce. Use product bundling to pair less profitable (even those with negative profits) or slower-moving products with more profitable ones to reduce the storage savings of the less valuable products, which makes more space to keep the products with higher profitability. To improve the overall business’ profitability, it is important to decide when and how to adjust product pricing and sell bundled products for further increasing profitability. Compared to the hot-sold out products, finding the most profitable products can increase the overall profitability [11]. Ensuring most hot and profitable products are on the shelf is essential for any retailer, but even today, it still remains a major challenge.

Motivated by existing research in economics, in this paper, we propose a profit-oriented framework to address the problem and compute the on-shelf popular and most profitable products for the targeted marketing. We also model $k$ -satisfiable product searching and propose an algorithmic framework. The principle contributions of this paper are summarized as follows:

•

This is the first work to systematically study the problem of computing on-shelf hot and most profitable products for targeted marketing based on economic behavior, including purchase frequency, purchase time periodic, on-shelf availability and utility/profit theory. It can help us to understand users’ economic behaviors, find out the On-shelf Popular and most Profitable Products (OPPPs), then make targeted marketing with profit maximization.

•

The solution space of the designed approach can be reduced to a finite number of points by a tree-based searching technique. Two compact data structures are developed to store the necessary information of the databases.

•

The concept of remaining positive profit is adopted to calculate the estimated upper bound. Based on developed pruning strategies, the OP3M algorithm can directly discover OPPPs using OPP-list with only twice database scans. Without the candidate generation-and-test, it performs a depth-first search by spanning the search space during the constructing process of the OPP-list.

•

The extensive performance evaluation on several real-world e-commerce datasets demonstrates the effectiveness and efficiency of the proposed OP3M framework.

The rest of this paper is organized as follows. Some related works are reviewed in Section 2. The preliminaries and problem statement are given in Section 3. Details of the proposed OP3M algorithm are described in Section 4. The evaluation of the effectiveness and efficiency of the proposed OP3M framework are provided in Section 5. Finally, some conclusions are drawn in Section 6.

2 Related Work

Our research is related to the work in computational economic, utility-based mining, other profit-oriented searching and mining works. In particular, the advent of Internet has resulted in large sets of user behavior records, which makes it possible for targeted marketing to maximize the return profit. Up to now, evolving the ecosystem of personal behavioral data from web services has many real applications [1, 6, 12, 16, 35]. For example, frequency-based pattern or rule mining [1, 16] is one of the common approaches to discover hidden relationships among items in the transaction. Different from the frequency-based mining models, the rare pattern mining framework that aims at discovering the non-frequent but interesting patterns also has been proposed [17]. Geng and Hamilton [12] reviewed some measures that are intended for selecting and ranking patterns according to the potential interest to the user. The consumer behavior analytics is crucially important for decision-makers, which can be used to support the global market and has attracted many considerable attentions [34, 38, 44, 45]. Recently, some researchers studied the profit of web mining such as profit-oriented pattern mining [2, 38], association [46], market share [35], and decision making to maximize the revenue from products [1, 12, 37]. Periodicity is prevalent in the physical world, and many events involve more than one period. The previous profit-oriented works [29, 38] and skyline operator [40] have not yet optimized to handle temporal on-shelf data, on-shelf availability [4], even considering both positive and negative unit profits. In this class of data regarding time, the period of interest needs to be added as an additional constraint to be evaluated together with the decision criteria of the addressed problem.

In economics, utility [31] is a key measure of a consumer’s preferences over alternative sets of goods or services. It is a basic building block of rational choice theory [3]. Up to now, some works that explore the economic behavior data have been studied [35, 37, 44, 51], from the data mining and information search perspectives. For example, considering the utility concept, a new data mining paradigm called utility mining [8, 10, 30, 38, 42] has been extensively studied and applied to different applications. Although a straightforward enumeration of all high-utility patterns (HUPs) sounds promising, it unfortunately does not yield a scalable solution for utility computation of patterns. The reason is that utility in HUPs does not hold the well-known a priori property [1] (aka the downward closure property [1]) [2, 8, 10, 38]. Utility mining has been extended to deal with different types of data, including itemset-based transactional data [30, 38], sequence data [19, 42, 47], uncertain data [22, 24], and complex event data [28, 43]. Furthermore, various interesting and challenging issues about utility mining have been addressed, including top- $K$ high-utility pattern mining [5, 39, 48], utility mining in dynamic databases [10, 25, 27], mining high-utility patterns by taking different special constraints into account [18, 23, 26, 33], and privacy preserving utility mining [9], etc. Overall, utility-driven pattern mining has been shown to be of considerable value in a wide range of applications.

There are many heuristic search algorithms in artificial intelligence. Up to now, there have been several studies about heuristic search [15], constraint programming [13, 14], and multi-objective optimization [40] such as Pareto for pattern mining. Notice that exhaustive search strategy explores many possible subsets, while heuristic search strategy explores a limited number of possible subsets. Thus, they are different. In particular, in heuristic search methods, the state space is not fully explored and randomization is often employed. Most of the studies of utility mining and pattern mining aim at discovering an optimal set of interesting patterns under the given constraints, while some approaches may lead to not necessarily optimal result by the heuristic search.

In other related research fields, some interesting works have applied the utility theory [31] to recommender systems [20, 41, 49, 50]. Wang and Zhang [41] first incorporate marginal utility into product recommender systems. They adapt the widely used Cobb-Douglas utility function [3] to model product-specific diminishing marginal return and user-specific basic utility to personalize recommendation. Li et al. [20] highlights that product recommender systems differ from the music or movie recommender systems as the former should take into account the utility of products in their ranking. It employs the utility and utility surplus [50] theories from economics and marketing to improve the list of recommended product. [53] finds multi-product utility maximization for economic recommendation. These existing product recommender systems, however, do not consider the time periodicity between the products purchased and on-shelf availability. The work in [52] utilizes the purchase interval information to improve the performance for e-commerce, but it does not consider the on-shelf availability and purchase frequency of products.

3 Preliminaries AND Problem Formulation

3.1 Utility-based Computing Model

A fundamental notion in utility theory is that each consumer is endowed with an associated utility function, which is “a measure of the satisfaction from consumption of various goods and services” [32]. In the context of purchasing decisions, we assume that the consumer has access to a set of products, each product having a price. Informally, buying a product involves the exchange of money for a product. Given the utility of a product, to analyze consumers’ motivation to trade money for the product, it is also necessary to analyze consumer behavior. In economics, the utility that a consumer has for a product can be decomposed into a set of utilities for each product characteristic. According to this utility theory, we have the following concepts and formulation. The notations of symbols are first summarized in Table 1.

Table 1.

Symbol	Description
$I$	A set of $m$ items/products, I = {i $_{1}$ , i $_{2}$ , $\ldots ,$ i $_{m}$ }.
$X$	A group of products $X$ = $\lbrace i_1,$ $i_2,$ $\dots , i_j\rbrace$ .
$D$	A quantitative database, D = {T $_{1}$ , T $_{2}$ , $\ldots ,$ T $_{n}$ }.
minfre	A minimum frequent threshold.
minpro	A minimum profit threshold.
$sup(X)$	The total support value of $X$ in $D$ .
${OPPP}$	On-shelf most popular and profitable product.
$q(i_j, T_c)$	The occurred quantity of an item $i_j$ in $T_c$ .
$up(i_j)$	Each item $i_j \in I$ has a unit profit.
$p(i_j, T_c)$	The profit of an item $i_j$ in $T_c$ .
$p(X, h)$	The sum of profits of $X$ in a period $h$ .
RTWU	Redefined transaction-weighted utilization.
pp(X)	The sum of positive profit of $X$ in $D$ .
np(X)	The sum of negative profit of $X$ in $D$ .
rpp(X)	The sum of remaining positive profit of $X$ in $D$ .
OPP-list	List structure with On-shelf Popularity and Profit.
OFU $^{\pm }$ -table	An On-shelf Frequency-Utility (with both positive
	and negative profit value) table.
X.list	The OPPP-list of a group of product $X$ .

Table 1. Summary of Notations

Example 3.1.

Consider an e-commerce database shown in Table 2, which will be used as running example in the following sections. Similar to the e-commerce database provided by RecSys Challenge 2015² (it contains some negative profit values since many all-occasion gifts are sold), this example database contains five purchase behavior records ( $T_1, T_2, \dots , T_5$ ) and three time periods (1,2,3). Behavior $T_1$ occurred in time period 2, and contains products $b$ , $c$ , and $e$ , which respectively appear in $T_1$ with a purchase quantity of 2, 1 and 3. Table 3 indicates that the external profit w.r.t. unit profit of these products are, respectively, $-$ $2, $4 and $7. Notice that the negative unit profit of a product $b$ indicates that this product is sold at a loss.

Table 2.

Tid	User	Purchase record	Period
$T_1$	$U_1$	$(b,2)(c,1)(e,3)$	1
$T_2$	$U_2$	$(a,1)(b,1)(c,2)(f,1)$	1
$T_3$	$U_3$	$(a,3)(b,6)(c,4)(d,1),(e,1),(f,2)$	2
$T_4$	$U_4$	$(c,3)(d,3)(e,1)$	2
$T_5$	$U_5$	$(a,1)(d,2)(e,3)(f,1)$	3

Table 2. An E-commerce Database

Table 3.

Product	$a$	$b$	$c$	$d$	$e$	$f$
Profit ($)	3	$-$ 2	4	1	7	5

Table 3. External Profit Values (unit profit)

Given a time-varying e-commerce database such that $D$ = { $T_1$ , $T_2$ , $\dots$ , $T_n$ } containing a set of temporal consumer purchase behaviors. Each transaction $T_c$ is a behavior record of one consumer, $T_c$ $\in D$ is a subset of $I$ , and $T_c$ has a unique identifier $c$ called its Tid. Let $I$ be a set of distinct products/items, $I$ = { $i_1$ , $i_2$ , $\dots$ , $i_m$ }. Each product/item $i_j \in I$ is associated with a positive or negative number $up(i_j)$ , called its unit profit. For each transaction $T_c$ such that $i_j$ $\in$ $T_c$ , a positive number $q(i_j$ , $T_c)$ is called purchase quantity of $i$ . Let PE be a set of positive integers representing time periods, for any given period, this could be a weekly, monthly, quarterly or yearly timespan. Note that each transaction $T_c \in D$ is associated to a time period $pe(T_c) \in PE$ , representing the duration time in a period, which the transaction occurred.

In general, the profit of $X \subseteq I$ is associated to the cost price and selling price. For the addressed problem in this paper, assume that the unit profit of each distinct product has been given in the pre-defined profit-table, as shown in Table 3. As mentioned before, it is usually seen in cross-promotion with negative profits. Although giving away a unit of product $\lbrace b\rbrace$ results in a loss of $4 for the supermarket, selling bundled products $\lbrace (c,1) (e,3)\rbrace$ that are cross-promoted with $\lbrace b\rbrace$ generates $21 profit.

Given a set of products $I$ and a set of customers $C$ = { $c_1$ , $c_2$ , $\dots$ , $c_j$ }, the market contribution of a group of products $X$ = { $i_1$ , $i_2$ , $\dots$ , $i_j$ } is related to the profit from $X$ after marketing. Finally, the contribution of a group of products $X$ becomes the sum of the profits it receives from all the customers in the market. Therefore, the key concepts used in this paper are first introduced as follows. A utility function is a map $U$ : $X \longrightarrow \Re$ . The profit of a combined product $X$ (a group of products $X \subseteq I$ ) in a transaction $T_c$ is:

\begin{equation} p(X, T_c) = \sum _{i_j \in X \wedge X \subseteq T_c} {p(i_j, T_c)}. \end{equation}

(1)

where $p(i_j, T_c)$ is the profit of a product $i_j \in I$ in a transaction $T_c$ , and $p(i_j, T_c)$ can be calculated as $p(i_j, T_c)$ = $up(i_j)$ $\times$ $q(i_j, T_c)$ . It represents the profit generated by products $i \in X$ in $T_c$ . Consider the set of time periods where $X$ was sold, the time periods (on-shelf time) of a group of products $X \subseteq I$ becomes $os(X)$ = $\lbrace pe(T_c) | T_c \in D$ $\wedge X$ $\subseteq T_c\rbrace$ . Let $p(X, h)$ denote the profit of a group of products $X \subseteq I$ in a time period $h \in os(X)$ , then by Equation (2) we have:

\begin{equation} p(X, h) = \sum _{T_c \in D \wedge h \in os(X)} {p(X, T_c)}. \end{equation}

(2)

By Equation (3) we have the overall profit of a group of products $X \subseteq I$ in an e-commerce database $D$ as $p(X)$ = $\sum _{h \in os(X)} {p(X,h)}$ . Thus, given a group of products $X$ , let $top(X)$ denote the total profit of the time periods about $X$ , then it can be represented in a function as follows:

\begin{equation} top(X) = \sum _{h \in os(X) \wedge T_c \in D} {tp(T_c)}, \end{equation}

(3)

where $tp(T_c)$ is the transaction profit (tp) of a transaction $T_c$ , i.e.,

\begin{equation} tp(T_c) = \sum _{i \in T_c}{p(i, T_c)}. \end{equation}

(4)

Let $rp(X)$ denote the relative profit of a group of products $X \subseteq I$ in an e-commerce database $D$ , by Equation (4) we have $rp(X)$ = $p(X)$ / $top(X)$ , and it represents the percentage of the profit that was generated by $X$ during the time periods when $X$ was sold.

Example 3.2.

The profit of product $e$ in $T_1$ is $p(e, T_1)$ = 3 $\times$ $7 = $21, and the profit of products $\lbrace c,e\rbrace$ in $T_1$ is $p(\lbrace c,e\rbrace , T_1)$ = $p(c, T_1)$ + $p(e, T_1)$ = 1 $\times$ $4 + 3 $\times$ $7 = $25. The time periods of $\lbrace c,e\rbrace$ are $os(\lbrace c,e\rbrace)$ = {period 1, period2}. The profit of $\lbrace c,e\rbrace$ in periods 1 and 2 are, respectively, $p(\lbrace c,e\rbrace$ , period 1) = $25, and $p(\lbrace c,e\rbrace$ , period 2) = $42. The profit of $\lbrace c,e\rbrace$ in the database is $p(\lbrace c,e\rbrace)$ = $p(\lbrace c,e\rbrace$ , period 1) + $p(\lbrace c,e\rbrace$ , period 2) = $25 + $42 = $67. The transaction profit of $T_1$ , $T_2$ , $\dots$ , $T_5$ are $tp(T_1)$ = $21, $tp(T_2)$ = $14, $tp(T_3)$ = $31, $tp(T_4)$ = $20 and $tp(T_5)$ = $31. The total profit of the time periods of $\lbrace c,e\rbrace$ is $top(\lbrace c,e\rbrace)$ = $tp(T_1)$ + $tp(T_3)$ + $tp(T_4)$ = $72. The relative profit of $\lbrace c,e\rbrace$ is $rp(\lbrace c,e\rbrace$ = $p(\lbrace c,e\rbrace)$ / $top(\lbrace c,e\rbrace)$ = $67 / $72 = 0.93.

3.2 On-Shelf Availability

On-Shelf Availability (OSA) [4] is a key factor of product for sale to the customer. It is impacted by a host of different factors, all along with the supply chain. Out of Stock (OOS) [4] is also known as stock-out, it is a situation where the retailer does not physically possess a particular product category, on its shelf, to sell this product to the customer. It can be estimated from store inventory data.

The retail industry, being the highly competitive field and being able to fulfill customer expectations and the demands, has become the most essential element in order to get sustainable growth and profit margin. Among them, on-shelf availability plays a key indicator for the retail industry, which can greatly impact the profit and customer loyalty. On the basis of the dataset shown in Table 2, and by taking the OSA, popularity and profit into account as the primary decision criteria, details of each product are shown in Table 4.

Table 4.

Product	Profit	Quantity	Seasonal period
Product	Profit	Quantity	Start	End
$a$	$12	4	1	3
$b$	$-$ $18	9	1	2
$c$	$40	10	1	2
$d$	$6	6	2	3
$e$	$56	8	1	3
$f$	$20	4	1	3

Table 4. Details of Each Product

Let $sup(h)$ denote the frequency of a time period $h$ w.r.t. $sup(h)$ = $|$ the number of ${T_c \in h}|$ , and $sup(X,h)$ denote the number of $|X \subseteq T_c \wedge T_c \in h|$ , then the relative frequency of a group of products $X$ for a time period $h$ can be defined as:

\begin{equation} rf(X, h) = sup(X,h)/ sup(h). \end{equation}

(5)

Consider products $\lbrace c\rbrace$ and $\lbrace a,c\rbrace$ in period 1, $sup({period} 1)$ = 2, thus $rf(\lbrace c, {period} 1\rbrace)$ = $2/2$ = 1.0, and $rf(\lbrace a,c\rbrace , {period} 1)$ = $1/2$ = 0.5. Besides, $rf(\lbrace c, {period} 1\rbrace) \gt$ $rf(\lbrace a,c\rbrace , {period} 1)$ . The relative profit of a group of products $X$ for a time period $h$ is

\begin{equation} rp(X, h) = p(X, h)/top(h), \end{equation}

(6)

where $top(h)$ means the total profit of a time period $h$ , and $top(h) =\sum _{T_c \in h \wedge T_c \in D} {tp(T_c)}$ . Consider products $\lbrace c\rbrace$ and $\lbrace a,c\rbrace$ in period 1, $top({period} 1)$ = $tp(T_1)$ + $tp(T_2)$ = $21 + $14 = $35, thus $rp(\lbrace c, {period} 1\rbrace)$ = $12/$35 = 0.343, and $rp(\lbrace a,c\rbrace , {period} 1)$ = $11/$35 = 0.314.

3.3 Problem Formulation

A group of products $X$ is said to be the on-shelf most popular and profitable products (OPPP) if it is popular in one or more periods $pt$ (its occurred relative frequency is no less than a user-specified minimum frequent threshold $minfre$ ), and it is high profitable in $pt$ (its relative profit $rp(X)$ is no less than a user-specified minimum profit threshold $minpro$ given by the user $(0 \le minpro \le 1)$ . Otherwise, $X$ is a non-OPPP.

For clarity, when we use the terms of “OPPP”, which indicates “On-shelf most Popular and Profitable Product”. The OPPP makes maximal profit with various customer satisfactions in terms of on-shelf period, high popular and high profitable products. Thus, the significant concept of OPPP is actually a $k$ -satisfiable product. If a product satisfies at least $k$ -constraints (i.e., OSA constraint, popular w.r.t. inventory control, high profitable product), we say that this product is $k$ -satisfiable product for maximizing profit where $k$ is a user’s predefined parameter and a non-negative integer. In the addressed OP3M problem and the given running example, $k$ is equal to 3. To satisfy the customers, retailers must be able to obtain to their feedback and improve the services. Regarding to the above analytics, the decision maker can make the efficient business strategy and decision, which can improve the overall profitability in his/her business. Ensuring the on shelf product is essential for any retailer, but even today it remains a major challenge.

Problem statement. The problem by computing the most popular on-shelf and profitable products for the targeted marketing is to discover all significant OPPPs in an e-commerce database containing unit profit values are positive. The problem by computing the most popular on-shelf and profitable products for targeted marketing with negative values is to discover all OPPPs in an e-commerce database where external unit profit values are positive or negative.

A naive way for this problem is to enumerate all possible subsets of products $I$ , then calculate the sum of the frequency and profits of each possible subset, and choose the frequent subsets with the highly sum profit. However, this approach is not scalable because there is an exponential number of all possible subsets. This motivates us to propose an efficient algorithm named OP3M for the searching problem of OPPPs. For efficient multi-criteria decision analyses, as mentioned previously, we can utilize heuristic search [15], constraint programming [13, 14], and multi-objective optimization [40]. An alternative approach is to formulate a truly multi-objective optimization problem where the heuristic search tries to optimize for each criterion with respect to each other. However, the state space of heuristic search methods is not fully explored and randomization is often employed. Therefore, OP3M utilizes exhaustive search strategy with various user-specified constraints instead of heuristic search. Overall, OP3M is an exact utility-based framework but not a randomize one.

4 The Proposed OP3M Algorithm

4.1 Properties of On-Shelf Availability

It can be demonstrated that the popularity measure is anti-monotonic [1, 40], any superset of a non-popular pattern cannot be a popular pattern, while (relative) profit measure is not monotonic or anti-monotonic [29, 38]. In other words, a product may have a lower, equal or higher profit than that of the profit of its subsets. We extend the concept of transaction-weighted utilization(TWU) [2, 29] in OPPP to show properties of on-shelf availability. For a given time period $h$ , let ${RTWU}(X, h)$ denote the redefined transaction-weighted utilization of a group of products $X$ in $h$ , thus it is the sum of the redefined transaction profit of transactions from $h$ containing $X$ .

To handle the negative profit of product, the redefined transaction-weighted utilization(RTWU) [21] of a group of products $X$ is defined as:

\begin{equation} {RTWU}(X) = \sum _{T_c \in D \wedge X \subseteq T_c}{rtp(T_c)}. \end{equation}

(7)

where $rtp(T_c)$ is the redefined transaction profit (abbreviated as rtp) of a transaction $T_c$ , that is $rtp(T_c)$ = $\sum _{x \in T_c \wedge p(x) \gt 0}{p(X, T_c)}$ . Thus, $rtp(T_c)$ contains the sum of the positive profit of the products in $T_c$ , while negative external profits are ignored. Similarly, the redefined transaction-weighted utilization of a group of products $X$ for a time period $h$ can be represented as:

\begin{equation} {RTWU}(X, h) = \sum _{T_c \in D \wedge X \subseteq T_c \wedge pe(T_c) \subseteq h}{rtp(T_c)}. \end{equation}

(8)

For the running example (considering that $b$ has an external profit value of -$2), the rtp of $T_1$ , $T_2$ , $T_3$ , $T_4$ , and $T_5$ are, respectively, $25, $16, $43, $20 and $31. The RTWU of products $a$ , $b$ , $c$ , $d$ , $e$ , and $f$ are, respectively, $90, $84, $104, $94, $119 and $90. If reflexivity of OSA makes sense at all, the continuous utility function has several components, and the addressed problem becomes quite complicated. According to previous studies [7, 21], we generalize the properties of on-shelf availability for the case of an e-commerce database with time-sensitive periods as follows.

Property 4.1.

The RTWU of a group of products $X$ for a period $h$ is an upper bound on the profit of $X$ in period $h$ , that is ${RTWU}(X, h)$ $\ge$ $p(X, h)$ .

Property 4.2.

The RTWU measure is anti-monotonic in the whole database or in a specific period. Let $X$ and $Y$ be two products, if $X \subset Y$ , then ${RTWU}(X)$ $\ge$ ${RTWU}(Y)$ ; for a period $h$ , it has ${RTWU}(X,h)$ $\ge$ ${RTWU}(Y, h)$ .

Property 4.3.

Let $X$ be a group of products, if ${RTWU}(X)$ / $top(X)$ $\lt$ minpro, then the product $X$ is low profit as well as all its supersets.

Property 4.4.

The RTWU of a group of products $X$ divided by the total profit of its time periods $h$ is higher than or equal to its relative profit in periods $h$ , i.e., ${RTWU}(X, h)$ / $top(X)$ $\ge$ $rp(X, h)$ . It is an upper bound on the relative profit of a group of products.

Property 4.5.

Given a group of products $X$ , if there does not exist a time period $h$ such that ${RTWU}(X, h)$ / $top(X)$ $\ge$ minpro, then $X$ is not a profitable on-shelf product. Otherwise, $X$ may or may not be a profitable on-shelf product.

Consider products $\lbrace c\rbrace$ and $\lbrace a,c\rbrace$ in period 1, the values ${RTWU}(\lbrace c\rbrace)$ and ${RTWU}(\lbrace a,c\rbrace)$ are $90 and $59, which are the overestimations of $p(\lbrace a\rbrace)$ = $39 and $p(\lbrace a,c\rbrace)$ = $36, thus respecting Property 4.4. Suppose the above properties we’ve been studying, and the previous proposition is true, we still cannot easily solve the OP3M problem.

4.2 Properties of Positive and Negative Profits

Economic profit can be positive, negative, or zero. If a product generates a negative profit, this means that those products are sold less than their cost price. Thus, some products were less successful than others and got negative profit. Nevertheless, all profits together make a whole profit for each cosmetic product, which is commonly seen in a successful business. What if the profit target is negative and the final actual result is positive? The following properties hold.

First, let the total order $\prec$ on products in the designed algorithm adopt the RTWU ascending order of products, and negative products always succeed all positive products. In the running example, the RTWU of six products $a$ , $b$ , $c$ , $d$ , $e$ , and $f$ are, respectively, $90, $84, $104, $94, $119, and $90, thus the total order $\prec$ on products is $a \prec$ $f \prec$ $d \prec c$ $\prec e \prec b$ . The complete search space of the addressed problem can be represented by a Set-enumeration tree [36] where products are sorted according to the previous total order $\prec$ . By utilizing the OPP-list and OFU $^{\pm }$ -table, we named this tree OPP-tree. In this OPP-tree, according to the total order $\prec$ , all child nodes of any tree node are called its extension nodes. For any products (product-sets) $X$ , let $pp(X)$ and $np(X)$ respectively denote the sum of positive profits and the sum of negative profits of $X$ in a transaction or period or database, such that $p(X)$ = $pp(X)$ + $np(X)$ . We have the following important observations.

Property 4.6.

Relationships between positive profits and negative profits of a group of products: $np(X)$ $\le p(X)$ $\le pp(X)$ in a transaction or period or database [21]. We respectively denote as $np(X, T_c)$ $\le p(X, T_c)$ $\le$ $pp(X, T_c)$ in a transaction $T_c$ , $np(X, h)$ $\le$ $p(X, h)$ $\le$ $pp(X, h)$ in a period $h$ , and $np(X)$ $\le$ $p(X)$ $\le$ $pp(X)$ in a database.

It can be seen that the positive profit value of a group of products is always no less than its actual profit, while the negative profit value is just the opposite. Thus, the positive profit value is an upper-bound on the profit. However, both $pp(X)$ and $np(X)$ cannot be used to overestimate the profit of a group of products. As an upper-bound on profit, $pp(X)$ still does not hold the downward closure property for the extensions with positive or negative products.

4.3 OPP-List and OFU $^{\pm }$ -Table

In this subsection, we introduce the new concept called “list structure of a group of products with its On-shelf Popularity and Profit” (OPP-list for short) which is a component used for the information storing and calculation. Besides, a new concept called remaining positive profit is introduced and applied to obtain the estimated upper-bound, which will be presented in next subsection. First, the OPP-list structure is defined as follows.

Definition 1.

Let $rpp(X,T_c)$ denote the remaining positive profit of a group of products $X$ in a transaction $T_c$ . Thus, $rpp(X,T_c)$ is the sum of the positive profit values of each product appearing after $X$ in $T_c$ according to the total order $\prec$ . It is represented as:

\begin{equation} rpp(X,T_c) = \sum _{ i_j \notin X \wedge X \subseteq T_c \wedge X \prec i_j }p(i_j,T_c), p(i_j,T_{c}) \ge 0. \end{equation}

(9)

Definition 2.

The OPP-list in an e-commerce database $D$ is a set of tuples corresponding to the transactions where $X$ appears. A tuple is defined as $\lt$ $\underline{tid}$ , $\underline{pp}$ , $\underline{np}$ , $\underline{rpp}$ , period $\gt$ for each transaction $T_{c}$ containing $X$ .

•

$\underline{tid}$ : the transaction identifier of $T_{c}$ ;

•

$\underline{pp}$ : the positive profit of $X$ in $T_{c}$ , i.e., $p(X,T_{c}) \ge 0$ ;

•

$\underline{np}$ : the negative profit of $X$ in $T_{c}$ , i.e., $p(X,T_{c}) \lt 0$ ;

•

rpp : the remaining positive profit of $X$ in $T_{c}$ , w.r.t. $rpp(X,T_c)$ ;

•

period : the related period where $T_{c}$ is occurred.

Example 4.1.

Since the total order $\prec$ on products is $a \prec f \prec d \prec c \prec e \prec b$ , we have that $rpp(a,T_3)$ = $p(f,T_3)$ + $p(d,T_3)$ + $p(c,T_3)$ + $p(e,T_3)$ = $10 + $1 + $16 + $67 = $34, and $rpp(\lbrace a,c\rbrace ,T_{3})$ = $p(e,T_3)$ = $7. Thus, the OPP-list of product $(c)$ is $\lbrace (T_1$ , $4, 0, $21, 1), $(T_2$ , $3, 0, 0, 1), $(T_3$ , $16, 0, $7, 2), $(T_4$ , $12, 0, $7, 2), $\rbrace$ , as shown in Figure 2. It can perform a single database scan to create the all OPP-lists of all 1-products in the processed database. Based on the designed OPP-list and previous studies [7, 21], we can extract the following information.

For a group of products X, let $pp(X, h)$ , $np(X, h)$ , and $rpp(X, h)$ are respectively the sum of pp values, the sum of np values and the sum of rpp values in a period $h$ w.r.t. OPP-list of X, that are:

\begin{equation} pp(X, h) = \sum _{X \subseteq T_c \wedge T_c \in h}pp(X,T_c); \end{equation}

(10)

\begin{equation} np(X, h) = \sum _{X \subseteq T_c \wedge T_c \in h}np(X,T_c); \end{equation}

(11)

\begin{equation} rpp(X, h) = \sum _{X \subseteq T_c \wedge T_c \in h}rpp(X,T_c). \end{equation}

(12)

Fig. 2.

Similarly, let $pp(X)$ , $np(X)$ , and $rpp(X)$ are respectively the sum of pp values, the sum of np values and the sum of rpp values for a group of products X in the database $D$ w.r.t. OPP-list of X, we have:

\begin{equation} pp(X) = \sum _{X \subseteq T_c \wedge T_c \in D}pp(X,T_c); \end{equation}

(13)

\begin{equation} np(X) = \sum _{X \subseteq T_c \wedge T_c \in D}np(X,T_c); \end{equation}

(14)

\begin{equation} rpp(X) = \sum _{X \subseteq T_c \wedge T_c \in D}rpp(X,T_c). \end{equation}

(15)

By utilizing the property of OPP-list, we further design a structure called on-shelf frequency-profit table with both positive and negative values, hereafter termed OFU $^{\pm }$ -table. The OFU $^{\pm }$ -table of a pattern is built after the construction of its OPP-list, and it stores the following information.

Definition 3.

An OFU $^{\pm }$ -table of a group of products $X$ contains nine parts: (1) name: the name of product $X$ ; (2) sup(X, h): the support of $X$ in a period $h$ ; (3) sup(X): the support of $X$ in database $D$ ; (4) pp(X, h): the summation of the positive profits of $X$ in a period $h$ ; (5) pp(X): the summation of the positive profits of $X$ in database $D$ ; (6) np(X, h): the summation of the negative profits of $X$ in a period $h$ ; (7) np(X): the summation of the negative profits of $X$ in database $D$ ; (8) rpp(X, h): the summation of the remaining positive profits of $X$ in $h$ ; and (9) rpp(X): the summation of the remaining positive profits of $X$ in $D$ . Notice that there is a HashMap to respectively keep the sup(X, h), pp(X, h), np(X, h) and rpp(X, h) values of product $X$ in each period $h$ .

Example 4.2.

The construction process of an OFU $^{\pm }$ -table is as follows. Consider a group of products $(c)$ in Table 2, it appears in $T_1$ , $T_2$ , $T_3$ , and $T_4$ . From the built OPP-list of $(c)$ which is shown in Figure 2, the OFU $^{\pm }$ -table of product $(c)$ is efficiently constructed using the support count, positive profit, negative profit and remaining positive profit. They are calculated during the construction of its OPP-list, and the results of its OFU $^{\pm }$ -table are { $sup(c)$ = 4, $pp(c,1)$ = $7, $pp(c,2)$ = $28, $pp(c)$ = $pp(c,1)$ + $pp(c,2)$ = $35, $np(c,1)$ = $np(c,2)$ = $np(c)$ = $0, and $rpp(c,1)$ = $21, $rpp(c,2)$ = $14, $rpp(c)$ = $rpp(c,1)$ + $rpp(c,2)$ = $35}.

After initially constructing the OPP-list and OFU $^{\pm }$ -table of each 1-product/item, for any $k$ -product ( $k \ge 2$ ), its’ OPP-list can be directly calculated using the OPP-lists of some of its subsets, without scanning the database. The construction details of the OPP-list and OFU $^{\pm }$ -table of a $k$ -product are shown in Algorithm 1. Given a product $X$ and two products $X_a$ and $X_b$ , they are extensions of $X$ by respectively adding two distinct products $a$ and $b$ to $X$ . The construction procedure takes as input the OPP-lists of $X$ , $X_a$ and $X_b$ , and outputs the OPP-list and OFU $^{\pm }$ -table of the pattern $X_{ab}$ . Specially, it is important to notice that the construction for $k$ -product ( $k \ge 3$ , Lines 4 to 7) is different from that of $k$ -product ( $k$ = 2, Line 9). For instance, the profit value of { $a,b$ } is the sum profit value of { $a$ } and { $b$ }. And the 3-itemset { $a,b,c$ } its OPP-list is constructed by the OPP-lists of { $a,b$ } and { $b,c$ }, and the sum profit value of { $a,b$ } and { $b,c$ } has a duplicate profit value of { $b$ }. Thus, it should avoid duplication for $k$ -product ( $k \ge 3$ ). It can be easily implemented, because all the necessary information of ( $k$ -1)-product ( $k \ge 2$ ) has been calculated before constructing the OPP-list of $(k)$ -product ( $k \ge 2$ ) w.r.t. the extension pattern.

4.4 Filtering Strategies for Searching

Lemma 1 (Anti-monotonicity of the Unpromising Product with Support).

In the search space w.r.t. OPP-tree, if a tree node is a popular product in the whole database $D$ or a period $h$ , its parent node is also a popular product in $D$ or $h$ . Let $X$ be a $k$ -products (node) and its parent node are denoted as $X^{\prime }$ , a ( $k$ -1)-products. For a given database $D$ or a period $h$ , the relative frequency measure is anti-monotonic: $rf(X) \le rf(X^{\prime })$ always holds.

Proof.

According to a well-known a priori property [1], it always exists the relationship $sup(X,h)$ $\le$ $sup(X^{\prime },h)$ . Thus, the downward closure property of relative frequency measure can hold.□

Lemma 2 (Anti-monotonicity of Unpromising Product with Profit Upper-bound).

For any node $X$ in the search space w.r.t. the OPP-tree, the sum of ${SUM}(X.pp)$ and ${SUM}(X.rpp)$ in the OPP-list of $X$ (within a period or the whole database) is larger than or equal to profit of any one of its children (within any period $h$ or the whole database $D$ w.r.t. the whole/maximal period in the database).

Thus, there exists an upper bound on profit of any pattern/node with respect to a special period. Lemma 2 guarantees that the sum of profits of $X$ in $D$ or $h$ w.r.t $p(X)$ is always less than or equal to the sum of SUM $(X^{\prime }.pp)$ and SUM $(X^{\prime }.rpp)$ in $D$ or $h$ . It ensures that the downward closure property of transitive extensions with positive or negative products, based on these observations, we can use the following two filtering strategies.

Strategy 1.

When performing a depth-first search strategy on the OPP-tree, if the relative frequency of any product $X$ within a time period $h$ is less than minfre (w.r.t. $rf(X,h)$ = $sup(X,h)$ / $sup(h)$ ), any of its child nodes are not an OPPP, they can be regarded as irrelevant and directly pruned.

Strategy 2.

When traversing the OPP-tree based on a depth-first search strategy, if the sum of SUM $(X.pp)$ and SUM $(X.rpp)$ of any node/product $X$ within the related period of $X$ is less than minpro (w.r.t. $rp(X)$ = $p(X)$ / $top(X)$ ), any of its child nodes are not an OPPP, they can be regarded as irrelevant and be directly pruned.

4.5 Main Procedure

To clarify our methodology, we have illustrated the key properties of OSA and profit, the key data structures and the profit upper-bound so far. Utilizing the above technologies, as shown in Algorithm 2, the main procedure takes as input: (1) an e-commerce database, $D$ ; (2) a user-specified profit-table, ptable; (3) minimum frequent threshold, minfre; (4) a user-specified minimum profit threshold, minpro. How to systematically select these thresholds? And how could they be optimized to ensure the performance of the proposed approach? Note that both minfre and minpro are user-specified based on user’s a priori knowledge and empiricism. In other words, when applying the OP3M algorithm for mining OPPPs in different databases, the parameter setting is different. For example, a pattern may be on-shelf popular and high-profitable in one database, while it may be not in another one. Besides, the number of time periods in a database is an inherent characteristic. We can manually set it according to domain knowledge and empiricism, such as a week, month, quarter, or year. Therefore, minfre, minpro, and time periods can be manually specified case-by-case. It is worth mentioning that various parameter settings may lead to different performances. It is hard to optimize them to ensure the performance of the OP3M algorithm in all databases, but they can be tuned to achieve optimal performance in a special database.

The OP3M algorithm first scans the database to calculate ${RTWU}(\lbrace i\rbrace)$ , ${RTWU}(\lbrace i\rbrace ,h)$ and $os(\lbrace i\rbrace)$ for each product $i$ . Moreover, the set of all time periods $PE$ and the profit $top(h)$ of each period $h$ is computed during the first database scan. Notice that for optimization, the set of time periods $os(\lbrace i\rbrace)$ of each product $i$ is represented as a bitset where the $k$ th-bit is set to 1 if $i$ appears in the period $k$ , otherwise 0. The bitset representation can quickly calculate the time periods of any product $X$ = $\lbrace x_1$ , $x_2$ , $\dots$ , $x_n\rbrace$ by using the logical AND operation ( $\oplus$ ), i.e., $os(X)$ = $os(x_1) \oplus os(x_2)$ $\oplus \dots os(x_n)$ .

Then, the algorithm computes for each product $i$ the value $top(\lbrace i\rbrace)$ using $os(i)$ and the profit of periods previously obtained. This allows us to create the set $I^*$ containing all products $i$ such that RTWU( $\lbrace i\rbrace ,h$ ) / $top(\lbrace i\rbrace)$ $\ge$ minpro. Thereafter, all products not in $I^*$ will be ignored since they cannot be part of OPPPs. The RTWU values of products are then used to establish a total order $\succ$ on products, which is the ascending RTWU order. A second database scan is then performed and the products in transactions are reordered according to the total order $\succ$ ; the OPP-list of each product $i \in I^*$ is built. After the construction of the OPP-list, the depth-first search exploration of products starts calling the recursive procedure Search with the empty product $\emptyset$ , the set of single products $I^*$ , minfre, minpro, and the set of all time periods $PE$ .

The Search procedure (Algorithm 3) takes as input: (1) a group of products $P$ , (2) extensions of $P$ having the form $Pz$ meaning that $Pz$ was previously obtained by appending a product $z$ to $P$ , (3) minfre, (4) minpro, and (5) the time periods of $P$ ( $os(P)$ ). The search procedure operates as follows. For each extension $Px$ of $P$ , if the related frequency of $Px$ is no less than minfre, and the sum of the actual related profits values of $Px$ in the OPP-list is no less than minpro, then $Px$ is output as an OPPP (Lines 2 to 6). Then, it uses the pruning strategies to determine whether the extensions of $Px$ would be the OPPPs and should be explored (Line 7). This is performed by merging $Px$ with all extensions $Py$ of $P$ such that $y \succ x$ , $rf(Px,h)$ $\ge$ minfre and RTWU( $\lbrace x,y\rbrace ,h$ ) $\ge$ minpro (Line 10), to form extensions of the form $Pxy$ containing $|Px|$ +1 products. The OPP-list of Pxy is then constructed by calling the Construct procedure to join the OPP-lists of $P$ , $Px$ and $Py$ (Lines 10 to 15). Only the promising OPP-lists would be explored in next extension (Line 14). Then, a recursive call to the Search procedure with $Pxy$ is performed to calculate its on-shelf popularity and profit and explore its extension(s) (Line 17).

Complexity analysis. Assume $I$ = $\lbrace i_1,$ $i_2,$ $\dots , i_m\rbrace$ be a finite set of $m$ distinct items, and $D$ = { $T_1$ , $T_2$ , $\dots , T_n$ }. Firstly, a single database scan performs in $O(nz)$ time, where $z$ is the average transaction length. In the worst case, it takes $O(nm)$ time. The construction of OPP-list and OFU $^{\pm }$ -table is done in linear time. An exhaustive search of the search space in OP3M would take $O(2^{m} - 1)$ time. However, in a real situation, the database may be rarely very sparse or very dense. Thus, the number of items in the longest products in OP3M is generally much less than $m$ , and search space is $2^{m} - 1$ (the complete number of itemsets in the search space), in the worst case. The space analysis is as follows. The number of initial OPP-list and OFU $^{\pm }$ -table is $|I^*|$ , and each OPP-list takes $O(n)$ space if it contains an entry for each transaction. Firstly, the total space required for building the initial OPP-lists of 1-products is in the worst case $O(|I^*| \times n)$ space. Therefore, the worst-case space complexity of OP3M is $O((2^{m} - 1) \times n)$ . Incorporating the constraints of frequency, time periods, and profit, the filtering strategies above can lead to a smaller search space than the worst case. In practice, the real search space of the OP3M algorithm is reasonable and it has a linear time, which will be shown in an experimental evaluation.

5 Experimental Study

In this section, we study the proposed algorithm on several real datasets to evaluate its effectiveness and efficiency. To the best of our knowledge, this is the first work to address the targeted marketing problem for finding the on-shelf popular and most profitable products by considering product frequency, purchase time periodic, on-shelf availability and utility theory. Thus, no existing methods in literature can be reasonably compared here, as the baseline, to evaluate the efficiency (w.r.t. execution time, memory usage, etc.) against the proposed model.

To analyze the usefulness of the OP3M framework, the derived frequent patterns (FPs, generated by the well-known FP-growth algorithm [16]), high utility/profitable patterns (HUPs, generated by the FHN algorithm [21]), and OPPPs (generated by the proposed OP3M algorithm) on the same datasets are examined. Thus, in Section 5.2, the well-known FP-growth and UP-growth are both conducted as the baseline algorithms. A comparison with the frequency-based FP-growth approach will demonstrate whether the utility-based method is superior. Specially, we also compare the total utility of the derived patterns (e.g., HUPs and OPPPs), to evaluate how can the proposed OP3M algorithm toward revenue maximization. It is worth mentioning that OP3M, as an exact utility-based framework, utilizes exhaustive search strategy with constraints but not heuristic search. Therefore, some methods algorithm with existing heuristics are not compared here.

5.1 Datasets and Data Preprocessing

All compared algorithms are implemented using the Java language and executed on a PC ThinkPad T470p with an Intel Core i7-7700HQ CPU @2.80GHz and 32 GB of memory, run on the 64 bit Microsoft Windows 10 platform. Typically, e-commerce datasets are proprietary and consequently hard to find among publicly available data. We support reproducibility and use four publicly available e-commerce datasets³ (mushroom, chess, retail, and kosarak) in our experiments. These datasets have varied characteristics and represented the main types of data typically encountered in real-life scenarios. The characteristics of used datasets are described below in detail.

$\bullet$ mushroom: a very dense dataset containing 8,124 transactions with 119 distinct items. Its average item count per transaction is 23, with a density ratio as 19.33%.

$\bullet$ chess: it contains 3,196 transactions with 75 distinct products and an average transaction length of 36 products. It is a very dense dataset, with a density ratio as 49.33%.

$\bullet$ retail: it is a sparse e-commerce dataset, which contains 88,162 purchase records with 16,470 distinct products and an average transaction length of 10.30 products.

$\bullet$ kosarak: a very large dataset containing 990,002 transactions of click-stream data from a Hungarian on-line news portal, it has 41,270 distinct products.

5.2 Effectiveness Analytics

The addressed OP3M problem aims at computing the $k$ -satisfiable on-shelf most popular and profitable products. Thereby, this OPPP explicitly includes on-shelf availability, the frequency, and profit of contributions. How to know whether the resultsare interpreted or not? Numerous studies have shown that two metrics, frequency and utility, are good estimations of the importance of a pattern. And those frequency-based or utility-based methods have broad applications (see [8, 10, 12] for an overview). Therefore, it makes sense to evaluate the number, frequency, and profit of the mining results of OP3M. Results of different kinds of generated patterns under various parameters are shown in Table 5.

Table 5.

Dataset	Pattern	Results with varied threshold (minfreor minpro)
Dataset	Pattern	test $_1$	test $_2$	test $_3$	test $_4$	test $_5$
	FPs	1,843,327	1,843,327	1,843,327	1,843,327	1,843,327
	HUPs	-	-	-	-	-
mushroom	# ${P5}$	222,251	154,513	102,642	66,917	42,717
(fix minfre: 6%)	# ${P25}$	222,251	154,513	102,642	66,917	42,717
	# ${P50}$	222,251	154,513	102,644	66,917	42,717
	FPs	1,843,327	650,003	600,817	150,137	104,629
	HUPs	-	-	-	-	-
mushroom	# ${P5}$	222,251	102,452	94,882	30,659	25,579
(fix minpro: 5%)	# ${P25}$	222,251	102,452	94,882	30,659	25,579
	# ${P50}$	222,251	102,452	94,882	30,659	25,579
	FPs	1,272,932	1,272,932	1,272,932	1,272,932	1,272,932
	HUPs	32,324	14,114	5,847	2,304	859
chess	# ${P5}$	16,596	8,639	4,146	1,848	761
(fix minfre: 50%)	# ${P25}$	16,596	8,639	4,146	1,848	761
	# ${P50}$	16,596	8,639	4,146	1,848	761
	FPs	2,832,777	1,272,932	574,998	254,944	111,239
	HUPs	14,114	14,114	14,114	14,114	14,114
chess	# ${P5}$	11,335	8,639	5,584	3,097	1,413
(fix minpro: 40%)	# ${P25}$	11,335	8,639	5,584	3,097	1,413
	# ${P50}$	11,335	8,639	5,584	3,097	1,413
	FPs	12,418	12,418	12,418	12,418	12,418
	HUPs	10,524	9,103	7,872	6,965	6,194
retail	# ${P5}$	5,092	4,850	4,619	4,384	4,159
(fix minfre: 0.07%)	# ${P25}$	13,696	12,943	12,149	11,437	10,675
	# ${P50}$	8,590,024	8,501,157	8,394,752	8,287,240	8,187,650
	FPs	19,242	12,418	8,829	6,749	5,282
	HUPs	10,524	10,524	10,524	10,524	10,524
retail	# ${P5}$	6,857	5,092	3,901	3,116	2,515
(fix minpro: 0.20%)	# ${P25}$	13,477,640	13,696	4,199	3,222	2,555
	# ${P50}$	73,1663,259	8,590,024	7,907,427	7,173,798	3,361

Table 5. Derived Patterns Under Various Parameters

Notice that mushroom is first tested with a fixed minfre: 6% and various minpro from 5% to 13%, and then tested with a fixed minpro: 5% and various minfre from 6% to 14%. Chess is tested in a similar way. For the retail dataset, it is performed under a fixed minfre: 0.07% and various minpro from 0.20% to 0.28%, then performed under a fixed minpro: 0.20% and various minfre from 0.05% to 0.13%. Besides, We ran OP3M algorithm on the same datasets but randomly grouped transactions into 5, 25 and 50 time periods, notice that # ${P5}$ , # ${P25}$ and # ${P50}$ are the number of OPPPs which are respectively derived by the same OP3M algorithm on the dataset with different numbers of time periods 5, 25 and 50.

It can be clearly observed that the number of OPPPs is always different from that of FPs and HUPs under various minfre and minpro thresholds. Specifically, the minfre and minpro thresholds and the number of time periods all influence the results of OPPPs. However, the FPs is only influenced by minfre, and the HUPs is only influenced by minpro. For example, when setting minfre: 0.07% and minpro: 0.20% on retail, the number of HUPs, # ${P5}$ (with 5 periods), # ${P25}$ (with 25 periods), and # ${P50}$ (with 50 periods), are, respectively, as 10,524, 5,092, 13,696 and 8,590,024, while there are 12,418 FPs. These patterns (HUPs and OPPPs with 5, 25 and 50 periods), respectively, have the total utility as 3.76E7, 2.70E7, 2.97E7 and 1.40E9, but these results are not contained in Table 5 due to the space limit. When setting minfre: 0.07% and minpro: 0.28% on retail, the number of HUPs, # ${P5}$ , # ${P25}$ and # ${P50}$ are changed to 6,194, 4,159, 10,675 and 8,187,650, and their total utility are 3.05E7, 2.55E7, 2.77E7 and 1.38E9, respectively. The reason is that a huge number of frequent patterns are always found, while few of them are on-shelf popular with high profit. Moreover, it is clear that both the period and popular factors affect the derived results of the addressed problem in terms of the number of patterns and the achieved total utility. And the larger the granularity of on-shelf period is, the higher revenue maximization can be achieved.

What is more, it is interesting to notice that the number of OPPPs using the developed OP3M framework may increase when the number of time periods in the processed dataset increases. This can be clearly observed from the results of # ${P5}$ , # ${P25}$ and # ${P50}$ , as shown in retail. Besides, the time period does not affect the very dense dataset since each transaction has the similar products and quantity, the on-shelf hot and profitable patterns within a short time period is likely to be an OPPP within a long time period. The distribution of $k$ -patterns of the derived # ${P5}$ , # ${P25}$ and # ${P50}$ is skipped here due to space limitation. In general, the frequent patterns do not usually contain a large portion of the desired profitable on-shelf patterns, the complete information (e.g., OSA, profit) may be ignored. This implies the importance of inferring and understanding consumers’ adoption behavior. In particular, using methods devised from information search and economics utility theory, we can focus on understanding the economic behavior of users from the periodic behavior in the historical data.

5.3 Efficiency Analytics

From Table 5, we can observe the influence of minfre threshold, minpro threshold, and the number of time periods. Furthermore, we performed an experiment to assess the influence of the number of time periods on the execution time with the same parameter setting at Table 5. Notice that OP3M $_{P5}$ , OP3M $_{P25}$ and OP3M $_{P50}$ are respectively the running time of OP3M algorithm on the dataset with different numbers of time periods 5, 25 and 50. Results are shown in Figure 3 for the four datasets. As we can see, the designed OP3M has much better scalability w.r.t to the number of periods on all datasets under various parameters. For example, when varying minrpro on chess, as shown in Figure 3(b), it always has OP3M $_{P5} \lt$ OP3M $_{P25} \lt$ OP3M $_{P50}$ , the same trend can also be observed in Figures 3(a) and 3(c). The reason why OP3M with less period performs better is that it mines all time periods at the same time; less period makes it earlier to achieve the conditions of pruning strategies in OP3M, thus leading to less computation time. OP3M mines them separately and merges results found in each-time period, which degrades its performance when the number of time periods is large.

Fig. 3.

Furthermore, we performed an experiment to asses the scalability of OP3M w.r.t the number of transactions. We ran the proposed profit-based OP3M model on the Kosarak dataset with minfre: 0.15% and minpro: 1%, and varied the number of transactions from 100,000 to 900,000. The results of the scalability test are shown in Figure 4. In general, runtime is an estimate of how long it takes to perform such an analysis. From the results, it can be observed that OP3M has linear scalability w.r.t the number of transactions.

Fig. 4.

6 Conclusion

In this paper, we have presented a novel framework named OP3M for searching on-shelf popular and most profitable products in databases where both positive and negative profit values appeared and on-shelf time of products are considered. This is the first work to systematically study the problem of profit-based optimization for economic behavior, including purchase frequency, on-shelf availability w.r.t. purchase time periodic, and profit. Given some historical datasets on market share, the designed algorithm can help us to make sense of the users’ economic behavior and find the on-shelf popular and most profitable products. OP3M also brings several improvements over existing technologies. Based on the developed profit-based OPP-lists, it is a single phase algorithm that does not need to maintain candidates in memory. It relies on the novel concept named remaining positive profit, and uses a depth-first search rather than a level-wise search. Moreover, OP3M finds OPPPs in all time periods at the same time rather than separately searching each period and performing costly intersection operations of the results of each time period. The extensive performance evaluation on several datasets demonstrates the effectiveness and efficiency of the OP3M algorithm.

Footnotes

https://www.groceryinsight.com/blog/.

https://recsys.acm.org/recsys15/challenge/.

http://www.philippe-fournier-viger.com/spmf/.

References

[1]

Rakesh Agrawal, Ramakrishnan Srikant, et al. 1994. Fast algorithms for mining association rules. In Proceedings of the 20th International Conference on Very Large Data Bases, Vol. 1215. 487–499.

Symbol	Description
\(I\)	A set of \(m\) items/products, I = {i \(_{1}\) , i \(_{2}\) , \(\ldots ,\) i \(_{m}\) }.
\(X\)	A group of products \(X\) = \(\lbrace i_1,\) \(i_2,\) \(\dots , i_j\rbrace\) .
\(D\)	A quantitative database, D = {T \(_{1}\) , T \(_{2}\) , \(\ldots ,\) T \(_{n}\) }.
minfre	A minimum frequent threshold.
minpro	A minimum profit threshold.
\(sup(X)\)	The total support value of \(X\) in \(D\) .
\({OPPP}\)	On-shelf most popular and profitable product.
\(q(i_j, T_c)\)	The occurred quantity of an item \(i_j\) in \(T_c\) .
\(up(i_j)\)	Each item \(i_j \in I\) has a unit profit.
\(p(i_j, T_c)\)	The profit of an item \(i_j\) in \(T_c\) .
\(p(X, h)\)	The sum of profits of \(X\) in a period \(h\) .
RTWU	Redefined transaction-weighted utilization.
pp(X)	The sum of positive profit of \(X\) in \(D\) .
np(X)	The sum of negative profit of \(X\) in \(D\) .
rpp(X)	The sum of remaining positive profit of \(X\) in \(D\) .
OPP-list	List structure with On-shelf Popularity and Profit.
OFU \(^{\pm }\) -table	An On-shelf Frequency-Utility (with both positive
	and negative profit value) table.
X.list	The OPPP-list of a group of product \(X\) .

Tid	User	Purchase record	Period
\(T_1\)	\(U_1\)	\((b,2)(c,1)(e,3)\)	1
\(T_2\)	\(U_2\)	\((a,1)(b,1)(c,2)(f,1)\)	1
\(T_3\)	\(U_3\)	\((a,3)(b,6)(c,4)(d,1),(e,1),(f,2)\)	2
\(T_4\)	\(U_4\)	\((c,3)(d,3)(e,1)\)	2
\(T_5\)	\(U_5\)	\((a,1)(d,2)(e,3)(f,1)\)	3

Dataset	Pattern	Results with varied threshold (minfreor minpro)
Dataset	Pattern	test \(_1\)	test \(_2\)	test \(_3\)	test \(_4\)	test \(_5\)
	FPs	1,843,327	1,843,327	1,843,327	1,843,327	1,843,327
	HUPs	-	-	-	-	-
mushroom	# \({P5}\)	222,251	154,513	102,642	66,917	42,717
(fix minfre: 6%)	# \({P25}\)	222,251	154,513	102,642	66,917	42,717
	# \({P50}\)	222,251	154,513	102,644	66,917	42,717
	FPs	1,843,327	650,003	600,817	150,137	104,629
	HUPs	-	-	-	-	-
mushroom	# \({P5}\)	222,251	102,452	94,882	30,659	25,579
(fix minpro: 5%)	# \({P25}\)	222,251	102,452	94,882	30,659	25,579
	# \({P50}\)	222,251	102,452	94,882	30,659	25,579
	FPs	1,272,932	1,272,932	1,272,932	1,272,932	1,272,932
	HUPs	32,324	14,114	5,847	2,304	859
chess	# \({P5}\)	16,596	8,639	4,146	1,848	761
(fix minfre: 50%)	# \({P25}\)	16,596	8,639	4,146	1,848	761
	# \({P50}\)	16,596	8,639	4,146	1,848	761
	FPs	2,832,777	1,272,932	574,998	254,944	111,239
	HUPs	14,114	14,114	14,114	14,114	14,114
chess	# \({P5}\)	11,335	8,639	5,584	3,097	1,413
(fix minpro: 40%)	# \({P25}\)	11,335	8,639	5,584	3,097	1,413
	# \({P50}\)	11,335	8,639	5,584	3,097	1,413
	FPs	12,418	12,418	12,418	12,418	12,418
	HUPs	10,524	9,103	7,872	6,965	6,194
retail	# \({P5}\)	5,092	4,850	4,619	4,384	4,159
(fix minfre: 0.07%)	# \({P25}\)	13,696	12,943	12,149	11,437	10,675
	# \({P50}\)	8,590,024	8,501,157	8,394,752	8,287,240	8,187,650
	FPs	19,242	12,418	8,829	6,749	5,282
	HUPs	10,524	10,524	10,524	10,524	10,524
retail	# \({P5}\)	6,857	5,092	3,901	3,116	2,515
(fix minpro: 0.20%)	# \({P25}\)	13,477,640	13,696	4,199	3,222	2,555
	# \({P50}\)	73,1663,259	8,590,024	7,907,427	7,173,798	3,361

Abstract

1 Introduction

2 Related Work

3 Preliminaries AND Problem Formulation

3.1 Utility-based Computing Model

3.2 On-Shelf Availability

3.3 Problem Formulation

4 The Proposed OP3M Algorithm

4.1 Properties of On-Shelf Availability

4.2 Properties of Positive and Negative Profits

4.3 OPP-List and OFU \(^{\pm }\) -Table

4.4 Filtering Strategies for Searching

4.5 Main Procedure

5 Experimental Study

5.1 Datasets and Data Preprocessing

5.2 Effectiveness Analytics

5.3 Efficiency Analytics

6 Conclusion

Footnotes

References

Cited By

Index Terms

Recommendations

Revenue Maximization and Learning in Products Ranking

Consumer Mental Accounts and Implications to Selling Base Products and Add-ons

On revenue maximization with sharp multi-unit demands

Comments

Information

Published In

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

View options

PDF

eReader

HTML Format

Get Access

Login options

Full Access

Figures

Other

Share

Share this Publication link

Share on social media

Affiliations