Optimizing A Dynamic Order-Picking Process: Yossi Bukchin, Eugene Khmelnitsky, Pini Yakuel
Optimizing A Dynamic Order-Picking Process: Yossi Bukchin, Eugene Khmelnitsky, Pini Yakuel
Optimizing A Dynamic Order-Picking Process: Yossi Bukchin, Eugene Khmelnitsky, Pini Yakuel
Abstract
This research studies the problem of batching orders in a dynamic, finite-horizon
environment to minimize order tardiness and overtime costs of the pickers. The
problem introduces the following trade-off: at every period, the picker has to decide
whether to go on a tour and pick the accumulated orders, or to wait for more orders to
arrive. By waiting, the picker risks higher tardiness of existing orders on the account
of lower tardiness of future orders. We use a Markov Decision Process (MDP) based
approach to set an optimal decision making policy. In order to evaluate the potential
improvement of the proposed approach in practice, we compare the optimal policy
with two naive heuristics: (1) “Go on tour immediately after an order arrives”, and,
(2) “Wait as long as the current orders can be picked and supplied on time”. The
optimal policy shows a considerable improvement over the naïve heuristics, in the
range of 7%-99%, where the specific values depend on the picking process
parameters. We have found that one measure, the slack percentage of the picking
process, associated with the difference between the promised lead time and the single
item picking time, predicts quite accurately the cost reduction generated by the
optimal policy. The structure and the properties of the optimal solutions have led to
the construction of a more comprehensive heuristic method. Numerical results show
that the proposed heuristic, MDP-H, outperforms the naïve heuristics in all
experiments. As compared to the optimal solution, MDP-H provides close to optimal
results for a slack of up to 40%.
1. Introduction
Order-picking is the process of retrieving items from stocking locations in a warehouse to
satisfy given demands. This process may involve as much as 60% of all labor activities in a
warehouse and may account for as much as 65% of all operating expenses (Gademann and
Van de velde, 2005).
The performance of an order picking system is typically determined by seven factors:
batching, picking sequence, storage policy, zoning, layout design, picking equipment and
design of picking information. Some research has been mainly concerned with studying the
joint effect of several factors, on the performance of order picking systems. Petersen and
Aase (2004) evaluated a number of picking, routing and storing methods, in order to
determine which combination of these factors is best in terms of picking time. Each
combination was compared to a basic scenario, in which orders are picked separately, items
are stored randomly and the traversal strategy is used for routing. They concluded that
batching of orders leads to the largest improvement, especially when small sized orders are
frequent. Moreover, an improved storage policy (one which is not random, for example, class
based) also achieves significant improvement, and with less sensitivity to order size. The best
combination reduced the picking time by almost 30%. Other papers address each order
picking performance factor separately. In that context, batching related studies are very
common. Generally, the order batching problem is the problem of simultaneously assigning
orders to batches and determining a picking tour for every batch so as to optimize an
objective function. The main driver for batching is to reduce the average picking travel
distance and thereby increase the throughput, and improve due date performance. Gademann
and Van De Velde (2005) addressed the problem of batching orders to minimize total travel
time in a parallel aisle warehouse. This problem is also referred to as proximity batching,
since the obvious motivation is to batch orders that are stored in near locations. They proved
that the problem is NP-hard in the strong sense, but can be solved in a polynomial time when
the batch size is no greater than two orders. In the past, many heuristics have been presented
in the literature for proximity batching. Most of these heuristics first select a seed order for a
batch and subsequently expand the batch with orders that have "proximity" to the seed order
as long as the picking cart capacity is not exceeded. The distinctive factor is the measure of
the proximity of orders. Armstrong et al. (1979) considered proximity batching with fixed
batch sizes and presented an integer programming model. Gibson and Sharp (1992)
considered order batching in an order picking operation of storage and retrieval (S/R)
machines. Elsayed and Lee (1996) investigated automated storage/retrieval (AS/R) systems
where a due date is specified for each retrieval order. They considered the inclusion of both
order retrieval and storage in the same tour when possible. Their main results include a set of
rules for sequencing and batching orders to tours such that the total tardiness of retrievals per
group of orders is minimized.
2
The routing strategies of pickers in the warehouse were investigated in Hall (1993). Three
strategies for routing manual pickers are compared: (1) traversal, (2) midpoint, and (3) largest
gap. The comparison was made by estimating the expected route length of each strategy. The
results include a few rules of thumb which assist in choosing one strategy over another. For
example, the third strategy is best when the average number of picks per aisle is relatively
small. Another research was made by Roodbergen and de Koster (2001) who considered a
parallel aisle warehouse, where order pickers can change aisles at the ends of every aisle and
also at a cross aisle halfway along the aisles. They concluded that in many cases the average
order picking time can be decreased significantly by adding a middle aisle to the layout.
In zoning, the warehouse is divided into zones so that each order is divided into sub
orders which are allocated to the different zones. Every sub order is picked in the respective
zone and the entire order is being rejoined in the packing area. Hane and Laih (2005) studied
a synchronized zone order picking system. In such a system, the pickers of all zones work on
the same order simultaneously. In order to prevent balance loss, the authors suggest storing
items, which are likely to be a part of the same order, in different zones. Next, they developed
a natural cluster model for item assignment in the warehouse. In one case study, the proposed
item clustering approach improved the system's efficiency by 29% and the order picking time
by 18%. Jane (2000) has developed a heuristic algorithm for a progressive zone picking
system. Unlike synchronized zoning, under progressive zoning each order is processed by
one zone picker at a time. The research objective was to balance workloads among all pickers
so each one has almost the same load and to adjust the zone size for order volume
fluctuations. The proposed method was illustrated and verified to achieve the objective
through empirical data and simulation experiments.
As described above, most of the related literature deals with the static problem of picking
a fixed number of orders in the most efficient way while finding the best picking sequence or
picking strategy (batching or zoning). However, in many warehouses and distribution centers
(DC), the picking activity is executed under uncertainty, since the inter-arrival time of
customer orders is stochastic by nature. Both a DC satisfying customer orders made via the
Internet and an automotive warehouse providing spare parts for auto-shops are examples of
such an environment.
In this research, we address the problem of batching orders in a dynamic, finite-horizon
environment to minimize order tardiness and overtime costs of the pickers. This problem is
solved to optimality using a Markov decision process based approach. The performance of
the optimal procedure was compared with two naïve heuristics and is found significantly
3
superior. The structure and the properties of the obtained solutions lead to constructing an
efficient heuristic, called MDP-H. The comparison between the proposed heuristic and the
optimal one shows that MDP-H provides close to optimal solution (up to 0.62%) for a slack
up to 40%. In all experiments, MDP-H provides better solutions than the two naïve heuristics.
Although this paper mainly refers to the manual order picking system, we expect the
results to be applicable to automatic systems as well, where AS/R machines are responsible
for these operations in an AS/R system. Equipped with a dual or triple shuttle, the AS/R
machine is capable of picking a small number of orders simultaneously, just like a human
picker who uses a multi-bin, picking cart. Having this analogy defined, one might realize the
possible applicability of the automatic system. For example, an AS/R machine that operates
in a Blockbuster DVD rent center. Customers demand to rent DVDs randomly during the day
and a picking policy for the S/R Machine must be defined with the purpose of maximizing
customer service level (minimizing order tardiness).
The structure of this paper is as follows. In Section 2 the problem description is
presented. Section 3 formulates the problem as a Markov Decision Process (MDP) and
briefly outlines the solution algorithm. In Section 4 the optimal solution is compared with
naïve batching strategies and some numerical results are presented. Section 5 analyses a new
heuristic, which is developed on the basis of the optimal strategies' properties learnt from the
MDP solutions. The performance of the heuristic is then compared both with the optimal
approach and the naïve heuristics. Finally, in Section 6 we discuss the main contribution of
the paper and indicate further research opportunities.
2. Problem description
The problem studied can be outlined in the following manner. Orders, each of a single line
item, are picked by one picker who uses a cart of limited capacity. Different orders/items are
being placed in different bins of the cart during the picking tour. This picking method is
referred to as sort-while-pick. The arrival rate of the orders is a random variable, which
follows the Poisson distribution with a mean value of λ orders per period of time. All orders
are supplied according to the same service level, by having the same customer lead time.
Whenever an order is supplied after its due date, a penalty, which is proportional to the
amount of tardiness periods, is consumed. A finite horizon is considered, as the warehouse is
closed at the end of each working day, after fulfilling all the orders of that day. Consequently,
4
another kind of penalty is incurred whenever the picker keeps on working after the end of the
working day. This penalty is proportional to the amount of overtime periods.
The fundamental trade-off existing in our problem can be explained as follows. At every
period, regardless of whether a new order has arrived or not, the picker has to decide whether
to go on a picking tour and supply the orders accumulated so far or to wait for more orders to
arrive (to batch orders). The former decision may speed up the supply of the currently
available orders. However, by doing this, the picker may miss an opportunity to batch more
orders had he waited one more period. That is, by waiting, the picker risks higher tardiness of
existing orders for the potential lower tardiness of future orders. Our goal is to set a decision
making policy that will minimize the average cost of order tardiness and worker overtime
during a finite working day.
It is clear that the time to pick a batch of n orders changes according to their storage
locations in the warehouse. However, in this model, we assume that the picking tour time of n
items, T(n), is an increasing function of the number of items, n, and independent of their
locations. Still, we assume that T(n) is a concave function of n and therefore there is a
motivation for batching items before going on a tour.
5
Decision epochs
Let {1, 2,..., N } be a finite set of decision epochs, where the decision epoch N denotes the
end of a working day. According to the policy of the DC, no orders arrive at the last I periods
of the working day, in order to allow the picker to supply all the orders arrived during the (N-
I) periods. If I is chosen to be relatively small, then there is a good chance that the DC would
have to remain open after decision epoch N and therefore pay for overtime. If I is relatively
large, then there is only a small chance of overtime.
System states
Let S = S ' U Δ denote the set of the possible system’s states, where S' is the set of states
describing the order batching process, and Δ is the set of states describing the picking tour.
Let γ i denote the remaining time to supply order i and Γn = (γ 1 , γ 2 , K, γ n ) , the vector of
remaining times to supply the n orders batched so far, where γ 1 < γ 2 < ... < γ n . Recall that the
strict inequality results from the fact that at most one order can enter the system within a
single period. A member of the set S ' = {s' | s' = (n, (γ 1 , γ 2 ,K , γ n ))} contains the number of
orders batched, n, and their corresponding remaining times to supply, Γn . For example, the
system's state s '= {2, (3,5)} implies that two orders were batched so far. The first order is due
in three periods and the second one is due in five periods. The S' state set is bounded because
the values of n and γ i for all i are bounded. For all i, γ i is bounded from above by d, the
planned lead time of an order, and from below by L, the lowest time left to the due date,
L ≤ γ i ≤ d ; n is bounded by the number of bins in the picking cart, C. In case either the
remaining time to supply the oldest order, γ 1 , reaches the value of L, or, the cart is full, the
picker is forced going on a picking tour. The state s ' = (0, φ ) describes the system with no
orders.
As mentioned above, Δ is the set of states describing the picking tour. The members of
this set are defined by the time left to the end of the picking tour and the expected length of
the tour; i.e., Δ = {δ | δ = (k , T (n))} , where k is the time left to the end of the tour, and T(n) is
the length of the tour. For example, the system's state δ = (3, T (5)) implies that a picking tour
will be over in three periods and its total length is T(5) periods. The Δ state space counts the
periods left in the picking tour, in order to determine the epoch in which the system comes
back to the S' state space. The tour length is also kept as a part of the state in order to
calculate the correct transition probability to the S' state space.
6
The state set Δ consists of the following members:
Δ = {(T ( n ) − 1, T ( n )), (T ( n ) − 2, T ( n )),..., (1, T ( n )), (0,0)} for all n=1,…,C
where, δ = ( 0,0 ) is the state of a picking tour that ends in one of the last I periods of the
working day (i.e., an absorbing state, since no more orders arrive in the last I periods).
Actions
The action set As depends on state s and includes at most two actions for each state. The first
action, a1 , is to wait for one more period and the second action, a 2 , is to go on a picking
tour. Clearly, a choice of a2 is prohibited during a picking tour and when no orders have been
batched in the system. More precisely,
⎧{a1} s∈Δ
⎪{a } s∈S '| n = 0
⎪
As = ⎨ 1
⎪{a2 } s ∈ S ' | n = C or γ 1 = L
⎪⎩{a1 , a2 } otherwise
Rewards
Whenever action a1 is chosen, the decision maker receives no reward. If action a2 is chosen,
then an immediate penalty, which is proportional to the tardiness of all the orders
accumulated thus far, is incurred. Notice that since the length of the tour given n is assumed
known, T(n), the tardiness can be calculated before the tour has actually started. We denote
by cT the tardiness penalty per period and by cO the overtime penalty per period. Note that
the overtime penalty is incurred only once, at the end of the working day, in epoch N. The
value of the tardiness penalty at every time of the working day, t, and for every possible
action and state combination is
⎧0 a = a1
⎪
rt ( s, a) = ⎨ n
7
assumptions are taken: the expected time to pick one order is larger than two periods; and the
probability of more than C orders entering during the longest picking tour, T(C), is negligible.
The transition probabilities differ in two distinct time frames. The first time frame is
comprised of the first N-I periods during which orders can enter the system, and the second
time frame is comprised of the last I periods during which orders do not enter the system. The
transition probability matrix of the first time frame is presented in (1).
For t < N − I
⎧1 s ∈ S ' | n > 0, j ∈ Δ, j = (T (n) − 1, T (n)), a = a2
⎪1 s ∈ Δ, s = (k , T (n)), j ∈ Δ, j = (k − 1, T (n)), ∀k = 2 KT (n) − 1, a = a1
⎪
⎪1 − λ e− λ s ∈ S ' | s = (n < C , (γ 1 > L, γ 2 ,..., γ n )),
⎪
⎪ j ∈ S ' | j = (n, (γ 1 := γ 1 − 1,..., γ n := γ n − 1)), a = a1
⎪ (1)
Pt ( j | s, a) = ⎨λ e − λ s ∈ S ' | s = (n < C , (γ 1 > L, γ 2 ,..., γ n )),
⎪ j ∈ S ' | j = (n + 1, (γ 1 := γ 1 − 1,..., γ n := γ n − 1, γ n +1 := d )), a = a1
⎪
⎪P s ∈ Δ, s = (1, T (n)), j ∈ S ' | j = (n ', (γ 1 = d − (T (n) − p1 ),..., γ n ' = d −
⎪
⎪ − (T (n) − pn ' ))) if n ' > 0, or j = (0, ∅) if n ' = 0, a = a1
⎪0 otherwise
⎩
In the first line of (1), action a 2 (go on tour) is chosen, and the system evolves into the
set of the picking tour states with a probability of 1. The remaining tour time in the next state,
j, is T(n)-1. In the second line, the system occupies a state from Δ and moves in another state
from Δ with a probability of 1. The remaining tour time is decreased by one period. This is
true for all Δ states apart from δ = (1, T ( n )) . The third and forth lines consider a case in
which the system occupies a state from S' and does not have to go on a tour immediately.
That is, the number of batched orders is smaller than C and the oldest order has more than L
periods left to its due date. Then, if an action a1 is chosen, the next state will be determined
by whether an order has entered the system (line 4) or not (line 3). The fifth line addresses a
transition from the state δ = (1, T ( n )) into a state from S'. The transition to a specific state, j,
is determined by both the number of orders that entered the system during the picking tour,
n', and the time periods of the picking tour, denoted by p1, p2 ,..., pn ' , in which the n' orders
have entered the system.
To elaborate on the transition presented in the fifth line of (1), we consider the following
example, presented in Figure 1. Let a picking tour last five periods, and there are two orders
8
that entered the system during that tour, at the periods 2 and 4. Then, n'=2, p1 = 2 and
p2 = 4 .
The system
returns to a
state from S'
The state to which the system will transmit is defined as (2,(d-3,d-1)). There are two
orders to be supplied with the ages of three and one periods, respectively. The probability P
in the example is calculated as follows. The probability of arriving two orders within the
picking tour of five periods is e −5 λ (5λ ) . Due to the lack of memory property of the Poisson
2
2!
distribution, the two orders could have entered in any two periods with the same probability.
Since the number of options is ⎛⎜⎜ ⎞⎟⎟ , the transition probability is finally obtained as,
5
2 ⎝ ⎠
−1
−5 λ (5λ ) 2 ⎛ 5 ⎞ .
P=e ⎜ ⎟
2! ⎝ 2 ⎠
The transition probability matrix for the second time frame is given in (2). In this frame a
picking tour is taken immediately, since no new orders can arrive.
For N − I ≤ t < N
9
In the first line of (2), N-t<T(n), and hence, the last tour does not end before the end of the
working day. In this case, the overtime length is kept for future calculation of the overtime
cost. In the second line, there is enough time to complete the tour and the system evolves to
the absorbing state. In the third and fourth lines, the system moves to the absorbing state
immediately. The dynamics of the states within a picking tour is addressed in the fifth line.
The sixth line is similar to line 5 in (1). The only difference is that orders enter the system
only in the first N − I − t − 1 + T ( n ) periods of the tour rather than during the entire tour.
An optimal policy
The problem described above is characterized by a finite set of states, S, and a finite set of
actions, As , for each s∈ S . Therefore, there exists an optimal deterministic Markovian
policy, as stated in Puterman (1994). Let ut* (st ) be the maximum total expected reward,
starting from state s t at decision epochs t , t + 1,..., N − 1 . Then, ut* (st ) is obtained by the
following Backwards induction algorithm, which gives also the optimal actions for each state
and each epoch, As*t ,t .
u t* ( st ) = max{rt ( st , a ) + ∑ Pt ( j | s t , a )u t*+1 ( j )}
a∈ Ast
j∈ S
4. Experiments
4.1. MDP model versus naïve heuristics
The main objectives of the experiments conducted in this section are:
• To validate the mathematical model.
• To evaluate possible cost reduction via applying the proposed approach in a real order
picking system.
• To gain insights into the structure and the properties of optimal solutions that will
assist in developing new MDP based heuristic methods.
10
In order to implement the MDP model, we have developed a computer code, and obtained
as output a table containing the optimal policy. Each row in the table stands for each one of
the possible system’s states, excluding the Δ states which do not involve any decision. Each
column of the table stands for each time period during a working day. The data within the
table specifies instructions, regarding the optimal action choice, as "1" means "go on tour"
and "0" means "wait another time period". An example is presented in Figure 2, where the
upper left hand side of an optimal policy table is shown. For demonstration purposes, the "go
on tour" policy was painted in green while "wait another time period" policy was painted in
red. One can see, for example, that when the system contains a single item, the picker waits
when the time to supply the order is relatively large, however, when this value is smaller than
or equal to 16, the picker goes on a tour.
A simulation model of the order picking system was developed to evaluate the
performance of the proposed MDP based solution procedure versus two naïve heuristics. The
first heuristic (will be referred to as the Green heuristic from here on) is quite
straightforward. Whenever an order is waiting to be picked and the picker is available, the
picker will go on a picking tour. The second heuristic (will be referred to as the Slack
heuristic from here on) indicates that "waiting another time period" is preferred as long as no
certain tardiness will occur. We say that the system has slack if the orders picking time is
smaller than the orders' remaining times to supply. This heuristic is called Slack since as long
as there is slack available in the system, the action choice is "wait another time period". Once
there is no slack, the action choice is "go on tour".
11
When examining the tables of the optimal policies obtained in the experiments, we
were able to identify two major effects, demonstrated in Figure 3:
1. Steady state effect – at a certain point in time, which is far back from the end of the
working day, the optimal policy becomes independent of time. In fact, at the steady state,
the optimal policy can be expressed by a vector instead of a table, as each element
denotes the optimal action given a certain state. The steady state effect is clearly
illustrated in the left hand side of Figure 3(a).
2. Transient state effect – toward the end of the working day the optimal policy shows time
dependent irregular pattern, as different actions are associated with the same state in
different points in time (see the right hand side of Figure 3(a)). Note that in the last I
periods, where no new orders arrive, the only action is “go on tour”. Moreover, in some
experiments we were able to identify an additional red shape adjacent to the last I periods,
denoted as the “tail”. An example of such a “tail” is shown in Figure 3(b). In the "tail"
region, despite the certain tardiness, the picker chooses to wait in order to save future
overtime costs. We were also able to identify an influence of the cost parameters on the
transient state as the length of this state increases with the ratio of the overtime and
tardiness cost parameters.
(a) (b)
Another observation indicated that the optimal solution is mostly "green", i.e., the action
“go on tour” is made more frequently than the action “wait another time period”. We believe
such a behavior results from the relatively low order arrival rate. Indeed, when the arrival rate
is low, the chance to batch additional order while waiting another period is relatively low as
well.
12
4.2. Experimental design
When analyzing the results of the preliminary experiments, we have determined the
configuration of the final experiments in such a way that all the assumptions of the model are
satisfied and all aspects of the optimal policy are clearly expressed. In particular, λ is chosen
so that the probability of more than C orders arriving during a picking tour is small enough
(1%). For tractability purposes the value of C was set to three orders. The values of the other
parameters are detailed in Table 1. Overall we have conducted 25 experiments that model 25
different warehouse configurations. The tour time function is chosen convex, T(2)=T(1)+1,
and T(3)=T(2)+1.
Parameter C I N d L Slack:d-T(1) cO cT
Set at 3 T(3)-1 256 25,27,30,32,35 0 5,7,10,12,15 10 10
13
Table 2. Experimental results
The results indicate that the MDP optimal solution outperforms any of the heuristics in all
the experiments; i.e., its average cost is always lower than the average costs of the two
heuristics. Another observation, which is demonstrated in Figure 4, is that the slack
percentage predicts the relative improvement to the best heuristic quite accurately, as the
improvement percentage increases with the slack percentage.
14
Figure 4. The relative improvement as a function of the slack value
Clearly, systems with larger slacks suffer from less tardiness and consequently, enjoy
lower average costs. One can notice from Table 2 that cases with high relative improvement
are associated with low absolute values of improvement which may be sometimes negligible.
To stress this point we have divided the results into three groups with respect to high,
medium and small relative slack (see Table 3). In the medium relative slack scenarios the
average improvement is still significant while the average cost is far from being negligible.
Therefore, we conclude that the strength of our model lies in medium slack size scenarios.
The steady state vector of the Green heuristic is characterized by n1=slack and n2=slack.
Similarly, for the Slack heuristic, n1=0 and n2=0. Now, we can notice that in all of the
experiments, the MDP steady state vector is almost identical (there might be a difference of
one or two action choices in the entire vector) to the steady state vector of one of the two
15
naïve heuristics. Therefore, we conclude that the major part of the MDP model benefit is due
to the transient state effect. In addition, the structure of the optimal policy indicates that the
higher the slack percentage, the more preferable the Green heuristic is against the Slack
heuristic.
5. Heuristic methods
5.1 Background
In this section, a heuristic approach for large-scale problem is proposed. To this end, the
structure of the optimal policy, expressed by the colored table (see Section 4.1.), was
analyzed. Fortunately, regular patterns were identified in the optimal policy. These patterns
and their characteristics were the cornerstones of our heuristic design. We distinguish
between patterns of the steady state and patterns of the transient state, and use these patterns
in developing the heuristic. The main purpose of the proposed heuristic is to develop a close
to optimal procedure which outperforms the best practice heuristics, named Green and Slack
in the previous section. The patterns of the optimal procedure are outlined next.
Patterns in the steady state
We define the steady state as a time period in which the action choice depends only on the
systems state s and not on the decision epoch t. Thus, the steady state can be defined by a
policy vector instead of a policy table.
According to the optimal results, the steady state vector has only a few configurations.
The structure of steady state is described by the two parameters, n1 and n2 that take only three
values each. Specifically, one form of the steady state is a ‘green’ vector, which orders “go
on tour” for every possible state. Figure 5(a) illustrates such a case. Another form of the
steady state vector is one of full slack usage or one with full slack usage minus one1. This is
illustrated in Figure 5(b). Rarely, the slack usage could be uneven between states of two
orders and states of one order. Namely, n1 and n2 are not necessarily equal in all of the
optimal solutions.
1
Full slack usage indicates that system states, in which slack is available, are painted red. Similarly, full slack
usage minus one indicates that the same states are painted red, apart from the state with only one slack time
period. This state is painted green.
16
Four red cubes
(a) (b)
Figure 5. Steady state and transient patterns for a “green” solution (a) and a “full slack usage”
(b).
When analyzing the results of the main experiments, we have noticed that the steady state
vector seems to have a strong link to the slack percentage. This observation was extremely
helpful in the construction of the heuristic policy. Table 4 shows the 25 experiments, sorted
by the slack percentage. It is easily seen that (i) in low slack percentage systems the steady
state is described by a ‘green’ vector (i.e., n1 = n2 =0); (ii) in medium slack percentage
systems the steady state is described by a full slack usage minus one vector (i.e., n1 = n2 =1);
(iii) in high slack percentage systems the steady state is described by a full slack usage vector
(i.e., n1 = n2 = Slack). As mentioned above, note that in experiments 13 and 17 the values of
n1 and n2 are not even. We refer to this issue later on.
17
Patterns in the transient state
In the transient time, just before the end of the planning period, the system shows an unstable
behavior. Nevertheless, clear and repetitive patterns still exist. One clear pattern occurs in
systems for which the steady state vector is not ‘green’. In these cases, at least three green
holes are seen in the policy table. Such a case is illustrated in Figure 5(b). Another noticeable
pattern occurs in systems in which the steady state vector is ‘green’. In these cases, at least
four red cubes are observed in the policy table. Such a case is illustrated in Figure 5(a).
Furthermore, the thickness of the cubes is the same in most of the cases.
The transient state patterns also very much depend on the slack percentage.
Interestingly, they also depend on the length of the picking tour. For example, we have
discovered that the exact starting position of each of the three green holes can be determined
by T(1). This is illustrated in Figure 6.
The tail patterns are also apparent in the transient state. These patterns usually appear
in low slack systems. Specifically, in systems with slack percentage lower than 32% (see
Table 4). The tail is typically characterized by a fixed thickness and appears at specified
places in the policy table (see Figure 3(b)).
The last pattern, associated with the transient state, shows that the last I periods are
always green, since in that time period no orders arrive and therefore there is no need in
waiting.
Key points of the heuristic design
Three main principles have guided us in designing the heuristic approach:
18
1. A rough cut of the optimal policy. The basic idea of our design is to follow the
general visual form of the optimal policy table. Consequently, we identify several
types of problems based on their parameters and construct a typical generic heuristic
policy for each problem type based on the above patterns of the optimal policy. Still,
our heuristic policy does not imitate the exact pattern of the optimal policy. For
example, we ignore the jagged left side of the red patterns shown in Figure 6 and
replace it with a rectangular pattern.
2. The maximum similarity principle. The heuristic policy comprises from several
parameters. The setting strategy of all these parameters was based on the results of
the optimal solutions. Consequently, we have identified empirical properties of the
optimal solution with regard to each of the parameters, and determined the
parameters of the heuristic policy accordingly.
3. A ‘don’t damage approach’. The heuristic policy attempts to achieve better results
than the two naive heuristics. Accordingly, we wanted the MDP heuristic policy to
indicate the action choices different from the naïve heuristics, only when such an
Figure 7 illustrates the rough cut approach by showing four policies, two optimal and two
heuristic, for two problems.
(a) Optimal policy (high slack) (b) Optimal policy (low slack)
(c) Heuristic policy (high slack) (d) Heuristic policy (low slack)
Figure 7. Optimal versus heuristic policy in a high and low slack percentage system
19
action yields an improved performance over the best naïve heuristic. Therefore, we
were very conservative in the parameter setting. When the optimal policy follows a
pattern similar to those appears in Figure 5(a), clearly green policy heuristic
outperforms the slack heuristic. In this case, we have added the only those cubes that
were observed in all cases. Similarly, when the optimal policy follows a pattern
similar to those in Figure 5(b), clearly the slack heuristic outperforms the green
heuristic. In these cases, only those green holes that were identified in all of the cases
were added.
5.2 Algorithmic formulation
The following steps work as the instructions to the construction of an MDP based heuristic.
These instructions are general and they fit different warehouses with different configurations.
Every parameter in the following formulation was generated according to the maximum
similarity principle and the ‘don’t damage approach’, which were described above.
Step 1: Calculate the slack percentage.
Step 2: Set n1 and n2 in the following manner:
⎧0 if 0 ≤ slack percentage ≤ 0.36
⎪
n1 = ⎨1 if 0.36 < slack percentage ≤ 0.46
⎪slack if 0.46 < slack percentage ≤ 1
⎩
Step 4: (set the three green holes or the four red cubes)
If slack percentage < 0.44 then set the following steps in the left column (constructing
three green holes). Else, set the following steps in the right column (constructing four
red cubes).
1. t1 = N − I − T (1) : the starting point 1. t1 = N − I − 1 : the starting point of
of the first green hole. the first red cube.
2. t2 = t1 − T (1) : the starting point of 2. t2 = t1 − T (3) : the starting point of the
20
the second green hole. second red cube.
3. t3 = t2 − T (1) : the starting point of 3. t3 = t2 − T (3) : the starting point of the
the third green hole. third red cube.
4. A = ⎡0.31818 ⋅ T (1) ⎤ : the length of 4. t4 = t3 − T (3) : the starting point of the
the first hole. forth red cube.
5. B = ⎡0.18182 ⋅ T (1)⎤ : the length of 5. A = ⎣0.38306 ⋅ T (3) ⎦ : the length of
the second hole. the first cube.
6. C = ⎡0.04545 ⋅ T (1)⎤ : the length of 6. B = ⎣0.48254 ⋅ T (3) ⎦ : the length of
the third hole. the second cube.
7. V1 = {t1 , t1 + 1,.., t1 + A − 1}: the group 7. C = ⎣0.52844 ⋅ T (3)⎦ : the length of
of decision epochs in which the first the third cube.
hole is present. 8. D = ⎣0.76339 ⋅ T (3)⎦ : the length of
8. V2 = {t2 , t2 + 1,.., t2 + B − 1} :the group the fourth cube.
of decision epochs in which the 9. W1 = {t1 , t1 − 1,.., t1 − A − 1}: the group
second hole is present. of decision epochs in which the first
9. V3 = {t3 , t3 + 1,.., t3 + C − 1}:the group cube is present.
of decision epochs in which the 10. W2 = {t2 , t2 − 1,.., t2 − B − 1} : the
third hole is present. group of decision epochs in which the
10. V = V1 ∪ V2 ∪ V3 second cube is present.
Set the three green holes as follows. 11. W3 = {t3 , t3 − 1,.., t3 − C − 1}: the
For every decision epoch t, which is a group of decision epochs in which the
member of the group V, and for every third cube is present.
state s, which is a member of the group 12. W4 = {t4 , t4 − 1,.., t4 − D − 1}: the
S, set a=a2 (go on tour).
group of decision epochs in which the
forth cube is present.
13. W = W1 ∪ W2 ∪ W3 ∪ W4
Set the four red cubes as follows. For
every decision epoch t which is a
member of the group W and for every
state s which uses the slack to its full
except the last unit (i.e., n1 and n2 equal
21
1), set a=a1 (wait another time period).
Based on these parameters (m1, m2, Tail thickness, stopping state) set the action a1 (wait
another period) on the regions of the tail.
5.3 Experiments
After completing the design of the MDP based heuristic, we have conducted experimentation
for evaluation the performance of the heuristic procedure. The main purpose of the
experiments was to compare its performance with the two naïve heuristics and the optimal
algorithm. In addition, the effect of the length of the planning period was also examined.
As a result of the first session of experiments, in which the slack percentage turned
out to be very meaningful, we now choose to determine T(1) in an indirect manner. We did
so, by defining the slack percentage as a direct independent parameter. Three parameters
were examined. First, the length of the planning period was set to two levels, 256 and 540
decision epochs. Next, the order lead time, d, was set to 15, 30 and 45. Last, the slack
percentage, which was identified in the previous set of experiments as the most influencing
parameter, was set to five values: 20, 33, 40, 53 and 60 percent. All other parameters were
left the same as in the first session of experiments.
The experimental results are presented in Table 6. The first five columns contain the
number of experiment, the length (in time periods) of the working day N, the order lead time,
d, the picking time of one order T(1) and the slack percentage. These data define the
22
warehouse configuration. The next four columns contain the average daily cost, which was
evaluated by 10,000 runs of the simulation model (10,000 working days), for each of the four
order-picking policies (the optimal MDP policy, the MDP based heuristic and the two naive
heuristics). In the tenth column, the relative improvement (in terms of average daily cost) of
the MDP based heuristic to the better naive heuristic is presented. In the next column one can
see whether the above difference is statistically significant (based on significant level of
95%). Finally, the percentile distance between the optimal policy and the MDP heuristic
policy is shown.
23
Figure 8. Average improvement of MDP-H over naïve heuristic.
24
Figure 9. The effect of the working day length on the average improvement of MDP-H over
best naïve heuristic.
5. Concluding remarks
We study a dynamic order picking problem. In today's fast pace economy, such order picking
systems are very common and consequently the possible applications of this work are
abundant. Besides the impressive cost reduction, which was generated by the optimal policy,
the policy displayed obvious patterns in its configuration. Moreover, the patterns seem to
have a strong link to the problems parameters, especially to the slack percentage. Based on
this observation, we have developed an MDP based heuristic approach to the problem. The
MDP based heuristic achieved better results than the two naive heuristics, and unlike the
MDP optimal policy, it is generated at relative ease.
25
The MDP approach, which was used in this research, is quite flexible and enables
addressing different variations of the problem. However, the approach suffers from the curse
of dimensionality, i.e., as the problem becomes more detailed (with less assumptions), the
number of possible system’s states becomes very large and causes complexity difficulties.
Therefore, we recommend that further research will be conducted on the basis of a
reinforcement learning approach. In reinforcement learning, an optimal solution is
not searched for. Instead, the decision making policy is constantly improved based on the
results of decisions that were made in the past. Using this approach will allow relaxing some
of the assumptions adopted here. In particular, the order arrival rate does not have to follow
the Poisson distribution. Additionally, an environment with multiple pickers and orders of
multiple items with different due dates, could be easily addressed.
References
Armstrong, R.D., Cook, W.D., and Shaipe, A.L., Optimal batching in a semi-automated order
picking system, (1979). Journal of the Operational Research Society, 30(8), 711-720.
Elsayed, E.A., and Lee, M.-K., Order processing in automated storage/retrieval systems with
due dates, (1996). IIE Transactions, 28 (7), 567-577.
Gademann, N. and Van de velde, S., (2005). Order batching to minimize total travel time in a
parallel-aisle warehouse, IIE Transactions, 37, 63-75.
Gibson, D.R., and Sharp, G.P., Order batching procedures (1992). European Journal of
Operational Research, 58, 57-67.
Hall, R.W., Distance approximations for routing manual pickers in a warehouse, (1993). IIE
Transactions, 25 (4), 76-87.
Hane, C.C., and Laih, Y.W., A clustering algorithm for item assignment in a synchronized
zone order picking system, (2005). European Journal of Operational Research, 166, 489-496.
Jane, C.-C., Storage location assignment in a distribution center, (2000). International Journal
of Physical Distribution & Logistics Management, 30 (1), 55-71.
Petersen, C.G., and Aase, G., A comparison of picking, storage, and routing policies in
manual order picking, (2004). Int. J. Production Economics, 92, 11-19.
Puterman M. L., 1994. Markov decision process: discrete stochastic dynamic programming,
Wiley series in probability and mathematical statistics.
Roodbergen, K.J., and De Koster, R., Routing order pickers in a warehouse with a middle
aisle, (2001). European Journal of Operational Research, 133, 32-43.
26