Optimizing A Dynamic Order-Picking Process: Yossi Bukchin, Eugene Khmelnitsky, Pini Yakuel

Optimizing a Dynamic Order-Picking Process
Yossi Bukchin, Eugene Khmelnitsky, Pini Yakuel

Department of Industrial Engineering, Tel-Aviv University, Tel-Aviv 69978, ISRAEL
Abstract
This research studies the problem of batching orders in a dynamic, finite-horizon
environment to minimize order tardiness and overtime costs of the pickers. The
problem introduces the following trade-off: at every period, the picker has to decide
whether to go on a tour and pick the accumulated orders, or to wait for more orders to
arrive. By waiting, the picker risks higher tardiness of existing orders on the account
of lower tardiness of future orders. We use a Markov Decision Process (MDP) based
approach to set an optimal decision making policy. In order to evaluate the potential
improvement of the proposed approach in practice, we compare the optimal policy
with two naive heuristics: (1) “Go on tour immediately after an order arrives”, and,
(2) “Wait as long as the current orders can be picked and supplied on time”. The
optimal policy shows a considerable improvement over the naïve heuristics, in the
range of 7%-99%, where the specific values depend on the picking process
parameters. We have found that one measure, the slack percentage of the picking
process, associated with the difference between the promised lead time and the single
item picking time, predicts quite accurately the cost reduction generated by the
optimal policy. The structure and the properties of the optimal solutions have led to
the construction of a more comprehensive heuristic method. Numerical results show
that the proposed heuristic, MDP-H, outperforms the naïve heuristics in all
experiments. As compared to the optimal solution, MDP-H provides close to optimal
results for a slack of up to 40%.
1. Introduction
Order-picking is the process of retrieving items from stocking locations in a warehouse to
satisfy given demands. This process may involve as much as 60% of all labor activities in a
warehouse and may account for as much as 65% of all operating expenses (Gademann and
Van de velde, 2005).
The performance of an order picking system is typically determined by seven factors:
batching, picking sequence, storage policy, zoning, layout design, picking equipment and
design of picking information. Some research has been mainly concerned with studying the
joint effect of several factors, on the performance of order picking systems. Petersen and
Aase (2004) evaluated a number of picking, routing and storing methods, in order to
determine which combination of these factors is best in terms of picking time. Each
combination was compared to a basic scenario, in which orders are picked separately, items
are stored randomly and the traversal strategy is used for routing. They concluded that
batching of orders leads to the largest improvement, especially when small sized orders are
frequent. Moreover, an improved storage policy (one which is not random, for example, class
based) also achieves significant improvement, and with less sensitivity to order size. The best
combination reduced the picking time by almost 30%. Other papers address each order
picking performance factor separately. In that context, batching related studies are very
common. Generally, the order batching problem is the problem of simultaneously assigning
orders to batches and determining a picking tour for every batch so as to optimize an
objective function. The main driver for batching is to reduce the average picking travel
distance and thereby increase the throughput, and improve due date performance. Gademann
and Van De Velde (2005) addressed the problem of batching orders to minimize total travel
time in a parallel aisle warehouse. This problem is also referred to as proximity batching,
since the obvious motivation is to batch orders that are stored in near locations. They proved
that the problem is NP-hard in the strong sense, but can be solved in a polynomial time when
the batch size is no greater than two orders. In the past, many heuristics have been presented
in the literature for proximity batching. Most of these heuristics first select a seed order for a
batch and subsequently expand the batch with orders that have "proximity" to the seed order
as long as the picking cart capacity is not exceeded. The distinctive factor is the measure of
the proximity of orders. Armstrong et al. (1979) considered proximity batching with fixed
batch sizes and presented an integer programming model. Gibson and Sharp (1992)
considered order batching in an order picking operation of storage and retrieval (S/R)
machines. Elsayed and Lee (1996) investigated automated storage/retrieval (AS/R) systems
where a due date is specified for each retrieval order. They considered the inclusion of both
order retrieval and storage in the same tour when possible. Their main results include a set of
rules for sequencing and batching orders to tours such that the total tardiness of retrievals per
group of orders is minimized.
2
The routing strategies of pickers in the warehouse were investigated in Hall (1993). Three
strategies for routing manual pickers are compared: (1) traversal, (2) midpoint, and (3) largest
gap. The comparison was made by estimating the expected route length of each strategy. The
results include a few rules of thumb which assist in choosing one strategy over another. For
example, the third strategy is best when the average number of picks per aisle is relatively
small. Another research was made by Roodbergen and de Koster (2001) who considered a
parallel aisle warehouse, where order pickers can change aisles at the ends of every aisle and
also at a cross aisle halfway along the aisles. They concluded that in many cases the average
order picking time can be decreased significantly by adding a middle aisle to the layout.
In zoning, the warehouse is divided into zones so that each order is divided into sub
orders which are allocated to the different zones. Every sub order is picked in the respective
zone and the entire order is being rejoined in the packing area. Hane and Laih (2005) studied
a synchronized zone order picking system. In such a system, the pickers of all zones work on
the same order simultaneously. In order to prevent balance loss, the authors suggest storing
items, which are likely to be a part of the same order, in different zones. Next, they developed
a natural cluster model for item assignment in the warehouse. In one case study, the proposed
item clustering approach improved the system's efficiency by 29% and the order picking time
by 18%. Jane (2000) has developed a heuristic algorithm for a progressive zone picking
system. Unlike synchronized zoning, under progressive zoning each order is processed by
one zone picker at a time. The research objective was to balance workloads among all pickers
so each one has almost the same load and to adjust the zone size for order volume
fluctuations. The proposed method was illustrated and verified to achieve the objective
through empirical data and simulation experiments.
As described above, most of the related literature deals with the static problem of picking
a fixed number of orders in the most efficient way while finding the best picking sequence or
picking strategy (batching or zoning). However, in many warehouses and distribution centers
(DC), the picking activity is executed under uncertainty, since the inter-arrival time of
customer orders is stochastic by nature. Both a DC satisfying customer orders made via the
Internet and an automotive warehouse providing spare parts for auto-shops are examples of
such an environment.
In this research, we address the problem of batching orders in a dynamic, finite-horizon
environment to minimize order tardiness and overtime costs of the pickers. This problem is
solved to optimality using a Markov decision process based approach. The performance of
the optimal procedure was compared with two naïve heuristics and is found significantly
3
superior. The structure and the properties of the obtained solutions lead to constructing an
efficient heuristic, called MDP-H. The comparison between the proposed heuristic and the
optimal one shows that MDP-H provides close to optimal solution (up to 0.62%) for a slack
up to 40%. In all experiments, MDP-H provides better solutions than the two naïve heuristics.
Although this paper mainly refers to the manual order picking system, we expect the
results to be applicable to automatic systems as well, where AS/R machines are responsible
for these operations in an AS/R system. Equipped with a dual or triple shuttle, the AS/R
machine is capable of picking a small number of orders simultaneously, just like a human
picker who uses a multi-bin, picking cart. Having this analogy defined, one might realize the
possible applicability of the automatic system. For example, an AS/R machine that operates
in a Blockbuster DVD rent center. Customers demand to rent DVDs randomly during the day
and a picking policy for the S/R Machine must be defined with the purpose of maximizing
customer service level (minimizing order tardiness).
The structure of this paper is as follows. In Section 2 the problem description is
presented. Section 3 formulates the problem as a Markov Decision Process (MDP) and
briefly outlines the solution algorithm. In Section 4 the optimal solution is compared with
naïve batching strategies and some numerical results are presented. Section 5 analyses a new
heuristic, which is developed on the basis of the optimal strategies' properties learnt from the
MDP solutions. The performance of the heuristic is then compared both with the optimal
approach and the naïve heuristics. Finally, in Section 6 we discuss the main contribution of
the paper and indicate further research opportunities.
2. Problem description
The problem studied can be outlined in the following manner. Orders, each of a single line
item, are picked by one picker who uses a cart of limited capacity. Different orders/items are
being placed in different bins of the cart during the picking tour. This picking method is
referred to as sort-while-pick. The arrival rate of the orders is a random variable, which
follows the Poisson distribution with a mean value of λ orders per period of time. All orders
are supplied according to the same service level, by having the same customer lead time.
Whenever an order is supplied after its due date, a penalty, which is proportional to the
amount of tardiness periods, is consumed. A finite horizon is considered, as the warehouse is
closed at the end of each working day, after fulfilling all the orders of that day. Consequently,
4
another kind of penalty is incurred whenever the picker keeps on working after the end of the
working day. This penalty is proportional to the amount of overtime periods.
The fundamental trade-off existing in our problem can be explained as follows. At every
period, regardless of whether a new order has arrived or not, the picker has to decide whether
to go on a picking tour and supply the orders accumulated so far or to wait for more orders to
arrive (to batch orders). The former decision may speed up the supply of the currently
available orders. However, by doing this, the picker may miss an opportunity to batch more
orders had he waited one more period. That is, by waiting, the picker risks higher tardiness of
existing orders for the potential lower tardiness of future orders. Our goal is to set a decision
making policy that will minimize the average cost of order tardiness and worker overtime
during a finite working day.
It is clear that the time to pick a batch of n orders changes according to their storage
locations in the warehouse. However, in this model, we assume that the picking tour time of n
items, T(n), is an increasing function of the number of items, n, and independent of their
locations. Still, we assume that T(n) is a concave function of n and therefore there is a
motivation for batching items before going on a tour.
3. The solution approach

We have modeled the above problem via MDP. The time horizon over which the optimal
order picking policy is obtained is finite and associated with one working day. The working
day is divided into periods of time, where the period length is set in such a way that the
probability of the arrival of more than one order during a period is negligible.
At the beginning of each period (i.e., at each decision epoch), the decision maker has two
actions to choose from: to go on a picking tour or to wait another period. If he decided to
wait, he receives no reward. Otherwise, if he decided to go on a picking tour, he incurs a cost
proportional to the tardiness of the orders batched so far. The system then evolves into the
next decision epoch according to the transition probability matrix of the Markov chain. Each
state is defined by the number of orders batched and their corresponding remaining times to
supply. The number of states in our problem is finite since the picker cannot accumulate more
orders than the number of bins in the picking cart, and since we have defined the waiting time
of an order to be limited (i.e., every order must be supplied after a predefined amount of
time). The elements of the MDP model are described next.
5
Decision epochs
Let {1, 2,..., N } be a finite set of decision epochs, where the decision epoch N denotes the
end of a working day. According to the policy of the DC, no orders arrive at the last I periods
of the working day, in order to allow the picker to supply all the orders arrived during the (N-
I) periods. If I is chosen to be relatively small, then there is a good chance that the DC would
have to remain open after decision epoch N and therefore pay for overtime. If I is relatively
large, then there is only a small chance of overtime.
System states
Let S = S ' U Δ denote the set of the possible system’s states, where S' is the set of states
describing the order batching process, and Δ is the set of states describing the picking tour.
Let γ i denote the remaining time to supply order i and Γn = (γ 1 , γ 2 , K, γ n ) , the vector of
remaining times to supply the n orders batched so far, where γ 1 < γ 2 < ... < γ n . Recall that the
strict inequality results from the fact that at most one order can enter the system within a
single period. A member of the set S ' = {s' | s' = (n, (γ 1 , γ 2 ,K , γ n ))} contains the number of
orders batched, n, and their corresponding remaining times to supply, Γn . For example, the
system's state s '= {2, (3,5)} implies that two orders were batched so far. The first order is due
in three periods and the second one is due in five periods. The S' state set is bounded because
the values of n and γ i for all i are bounded. For all i, γ i is bounded from above by d, the
planned lead time of an order, and from below by L, the lowest time left to the due date,
L ≤ γ i ≤ d ; n is bounded by the number of bins in the picking cart, C. In case either the
remaining time to supply the oldest order, γ 1 , reaches the value of L, or, the cart is full, the
picker is forced going on a picking tour. The state s ' = (0, φ ) describes the system with no
orders.
As mentioned above, Δ is the set of states describing the picking tour. The members of
this set are defined by the time left to the end of the picking tour and the expected length of
the tour; i.e., Δ = {δ | δ = (k , T (n))} , where k is the time left to the end of the tour, and T(n) is
the length of the tour. For example, the system's state δ = (3, T (5)) implies that a picking tour
will be over in three periods and its total length is T(5) periods. The Δ state space counts the
periods left in the picking tour, in order to determine the epoch in which the system comes
back to the S' state space. The tour length is also kept as a part of the state in order to
calculate the correct transition probability to the S' state space.
6
The state set Δ consists of the following members:
Δ = {(T ( n ) − 1, T ( n )), (T ( n ) − 2, T ( n )),..., (1, T ( n )), (0,0)} for all n=1,…,C
where, δ = ( 0,0 ) is the state of a picking tour that ends in one of the last I periods of the
working day (i.e., an absorbing state, since no more orders arrive in the last I periods).
Actions
The action set As depends on state s and includes at most two actions for each state. The first
action, a1 , is to wait for one more period and the second action, a 2 , is to go on a picking
tour. Clearly, a choice of a2 is prohibited during a picking tour and when no orders have been
batched in the system. More precisely,
⎧{a1} s∈Δ
⎪{a } s∈S '| n = 0
⎪
As = ⎨ 1
⎪{a2 } s ∈ S ' | n = C or γ 1 = L
⎪⎩{a1 , a2 } otherwise
Rewards
Whenever action a1 is chosen, the decision maker receives no reward. If action a2 is chosen,
then an immediate penalty, which is proportional to the tardiness of all the orders
accumulated thus far, is incurred. Notice that since the length of the tour given n is assumed
known, T(n), the tardiness can be calculated before the tour has actually started. We denote
by cT the tardiness penalty per period and by cO the overtime penalty per period. Note that
the overtime penalty is incurred only once, at the end of the working day, in epoch N. The
value of the tardiness penalty at every time of the working day, t, and for every possible
action and state combination is
⎧0 a = a1
⎪
rt ( s, a) = ⎨ n
⎪ −cT ∑ max(T (n) − γ i , 0) a = a2 , s = {n, (γ 1 , K , γ n )}

⎩ i =1
and the overtime penalty value at time N is

rN ((k , T (n))) = −k cO ∀k = 1KT (n) − 1
rN ((0,0)) = 0
Transition probabilities
At each decision epoch t, given the system state s and the action a, we determine the
probability to reach a state j in the subsequent decision epoch t+1. Two additional
7
assumptions are taken: the expected time to pick one order is larger than two periods; and the
probability of more than C orders entering during the longest picking tour, T(C), is negligible.
The transition probabilities differ in two distinct time frames. The first time frame is
comprised of the first N-I periods during which orders can enter the system, and the second
time frame is comprised of the last I periods during which orders do not enter the system. The
transition probability matrix of the first time frame is presented in (1).
For t < N − I
⎧1 s ∈ S ' | n > 0, j ∈ Δ, j = (T (n) − 1, T (n)), a = a2
⎪1 s ∈ Δ, s = (k , T (n)), j ∈ Δ, j = (k − 1, T (n)), ∀k = 2 KT (n) − 1, a = a1
⎪
⎪1 − λ e− λ s ∈ S ' | s = (n < C , (γ 1 > L, γ 2 ,..., γ n )),
⎪
⎪ j ∈ S ' | j = (n, (γ 1 := γ 1 − 1,..., γ n := γ n − 1)), a = a1
⎪ (1)
Pt ( j | s, a) = ⎨λ e − λ s ∈ S ' | s = (n < C , (γ 1 > L, γ 2 ,..., γ n )),
⎪ j ∈ S ' | j = (n + 1, (γ 1 := γ 1 − 1,..., γ n := γ n − 1, γ n +1 := d )), a = a1
⎪
⎪P s ∈ Δ, s = (1, T (n)), j ∈ S ' | j = (n ', (γ 1 = d − (T (n) − p1 ),..., γ n ' = d −
⎪
⎪ − (T (n) − pn ' ))) if n ' > 0, or j = (0, ∅) if n ' = 0, a = a1
⎪0 otherwise
⎩
where, P = e − λT ( n ) × (λT (n)) ×

n'
1 .
n'! ⎛ T ( n) ⎞
⎜⎜ ⎟⎟
⎝ n' ⎠
In the first line of (1), action a 2 (go on tour) is chosen, and the system evolves into the
set of the picking tour states with a probability of 1. The remaining tour time in the next state,
j, is T(n)-1. In the second line, the system occupies a state from Δ and moves in another state
from Δ with a probability of 1. The remaining tour time is decreased by one period. This is
true for all Δ states apart from δ = (1, T ( n )) . The third and forth lines consider a case in
which the system occupies a state from S' and does not have to go on a tour immediately.
That is, the number of batched orders is smaller than C and the oldest order has more than L
periods left to its due date. Then, if an action a1 is chosen, the next state will be determined
by whether an order has entered the system (line 4) or not (line 3). The fifth line addresses a
transition from the state δ = (1, T ( n )) into a state from S'. The transition to a specific state, j,
is determined by both the number of orders that entered the system during the picking tour,
n', and the time periods of the picking tour, denoted by p1, p2 ,..., pn ' , in which the n' orders
have entered the system.
To elaborate on the transition presented in the fifth line of (1), we consider the following
example, presented in Figure 1. Let a picking tour last five periods, and there are two orders
8
that entered the system during that tour, at the periods 2 and 4. Then, n'=2, p1 = 2 and
p2 = 4 .
The system
returns to a
state from S'
Figure 1. Arrivals during picking tour.
The state to which the system will transmit is defined as (2,(d-3,d-1)). There are two
orders to be supplied with the ages of three and one periods, respectively. The probability P
in the example is calculated as follows. The probability of arriving two orders within the
picking tour of five periods is e −5 λ (5λ ) . Due to the lack of memory property of the Poisson
2
2!
distribution, the two orders could have entered in any two periods with the same probability.
Since the number of options is ⎛⎜⎜ ⎞⎟⎟ , the transition probability is finally obtained as,
5
2 ⎝ ⎠
−1
−5 λ (5λ ) 2 ⎛ 5 ⎞ .
P=e ⎜ ⎟
2! ⎝ 2 ⎠
The transition probability matrix for the second time frame is given in (2). In this frame a
picking tour is taken immediately, since no new orders can arrive.
For N − I ≤ t < N
⎧1 s ∈S '| n > 0, j ∈Δ, j = (T(n) −1,T(n)), a = a2 , N − t < T (n)

⎪1 s ∈S '| n > 0, j ∈Δ, j = (0,0), a = a2 , N − t ≥ T (n)
⎪
⎪1 s ∈S '| n = 0, j ∈Δ, j = (0,0), a = a1
⎪
⎪1 s ∈Δ, s = (0,0), j ∈Δ, j = (0,0), a = a1
Pt ( j | s, a) = ⎨ (2)
⎪1 s ∈Δ, s = (k,T(n)), j ∈Δ, j = (k −1,T(n)), ∀k = 2,K,T(n) −1, a = a1
⎪P* s ∈Δ, s = (1,T (n)), j ∈S '| j = (n ',(γ1 = d − (T(n) − p1),...,γ n' = d − (T (n) − pn' ))),
⎪
⎪ or j = (0, ∅) if n ' = 0, a = a1, t +1−T(n) < N − I , t ≤ N − 2
⎪0 otherwise
⎩
(λ ( N − I − t − 1 + T (n))) n ' 1
where, P * = e − λ ( N − I −t −1+T ( n )) × ×
n'! ⎛ N − I − t − 1 + T ( n) ⎞
⎜⎜ ⎟⎟
⎝ n' ⎠
9
In the first line of (2), N-t<T(n), and hence, the last tour does not end before the end of the
working day. In this case, the overtime length is kept for future calculation of the overtime
cost. In the second line, there is enough time to complete the tour and the system evolves to
the absorbing state. In the third and fourth lines, the system moves to the absorbing state
immediately. The dynamics of the states within a picking tour is addressed in the fifth line.
The sixth line is similar to line 5 in (1). The only difference is that orders enter the system
only in the first N − I − t − 1 + T ( n ) periods of the tour rather than during the entire tour.
An optimal policy
The problem described above is characterized by a finite set of states, S, and a finite set of
actions, As , for each s∈ S . Therefore, there exists an optimal deterministic Markovian
policy, as stated in Puterman (1994). Let ut* (st ) be the maximum total expected reward,
starting from state s t at decision epochs t , t + 1,..., N − 1 . Then, ut* (st ) is obtained by the
following Backwards induction algorithm, which gives also the optimal actions for each state
and each epoch, As*t ,t .
The Backwards induction algorithm

1. Set t = N and u *N (s N ) = rN (s N ) for all s N ∈ S
2. Substitute t-1 for t and compute ut* (st ) for each s t ∈ S by
u t* ( st ) = max{rt ( st , a ) + ∑ Pt ( j | s t , a )u t*+1 ( j )}
a∈ Ast
j∈ S
Set As*t ,t = arg max{rt ( st , a ) + ∑ Pt ( j | st , a )u t*+1 ( j )}

a∈ Ast
j∈S
3. If t = 1 , stop. Otherwise return to step 2.
4. Experiments
4.1. MDP model versus naïve heuristics
The main objectives of the experiments conducted in this section are:
• To validate the mathematical model.
• To evaluate possible cost reduction via applying the proposed approach in a real order
picking system.
• To gain insights into the structure and the properties of optimal solutions that will
assist in developing new MDP based heuristic methods.
10
In order to implement the MDP model, we have developed a computer code, and obtained
as output a table containing the optimal policy. Each row in the table stands for each one of
the possible system’s states, excluding the Δ states which do not involve any decision. Each
column of the table stands for each time period during a working day. The data within the
table specifies instructions, regarding the optimal action choice, as "1" means "go on tour"
and "0" means "wait another time period". An example is presented in Figure 2, where the
upper left hand side of an optimal policy table is shown. For demonstration purposes, the "go
on tour" policy was painted in green while "wait another time period" policy was painted in
red. One can see, for example, that when the system contains a single item, the picker waits
when the time to supply the order is relatively large, however, when this value is smaller than
or equal to 16, the picker goes on a tour.
Figure 2. An example of an optimal policy table
A simulation model of the order picking system was developed to evaluate the
performance of the proposed MDP based solution procedure versus two naïve heuristics. The
first heuristic (will be referred to as the Green heuristic from here on) is quite
straightforward. Whenever an order is waiting to be picked and the picker is available, the
picker will go on a picking tour. The second heuristic (will be referred to as the Slack
heuristic from here on) indicates that "waiting another time period" is preferred as long as no
certain tardiness will occur. We say that the system has slack if the orders picking time is
smaller than the orders' remaining times to supply. This heuristic is called Slack since as long
as there is slack available in the system, the action choice is "wait another time period". Once
there is no slack, the action choice is "go on tour".
11
When examining the tables of the optimal policies obtained in the experiments, we
were able to identify two major effects, demonstrated in Figure 3:
1. Steady state effect – at a certain point in time, which is far back from the end of the
working day, the optimal policy becomes independent of time. In fact, at the steady state,
the optimal policy can be expressed by a vector instead of a table, as each element
denotes the optimal action given a certain state. The steady state effect is clearly
illustrated in the left hand side of Figure 3(a).
2. Transient state effect – toward the end of the working day the optimal policy shows time
dependent irregular pattern, as different actions are associated with the same state in
different points in time (see the right hand side of Figure 3(a)). Note that in the last I
periods, where no new orders arrive, the only action is “go on tour”. Moreover, in some
experiments we were able to identify an additional red shape adjacent to the last I periods,
denoted as the “tail”. An example of such a “tail” is shown in Figure 3(b). In the "tail"
region, despite the certain tardiness, the picker chooses to wait in order to save future
overtime costs. We were also able to identify an influence of the cost parameters on the
transient state as the length of this state increases with the ratio of the overtime and
tardiness cost parameters.
(a) (b)
Figure 3. The two major effects identified in the optimal policy.
Another observation indicated that the optimal solution is mostly "green", i.e., the action
“go on tour” is made more frequently than the action “wait another time period”. We believe
such a behavior results from the relatively low order arrival rate. Indeed, when the arrival rate
is low, the chance to batch additional order while waiting another period is relatively low as
well.
12
4.2. Experimental design
When analyzing the results of the preliminary experiments, we have determined the
configuration of the final experiments in such a way that all the assumptions of the model are
satisfied and all aspects of the optimal policy are clearly expressed. In particular, λ is chosen
so that the probability of more than C orders arriving during a picking tour is small enough
(1%). For tractability purposes the value of C was set to three orders. The values of the other
parameters are detailed in Table 1. Overall we have conducted 25 experiments that model 25
different warehouse configurations. The tour time function is chosen convex, T(2)=T(1)+1,
and T(3)=T(2)+1.
Table1 . The values of the problem parameters in the main experiments.
Parameter C I N d L Slack:d-T(1) cO cT
Set at 3 T(3)-1 256 25,27,30,32,35 0 5,7,10,12,15 10 10
4.3. Analysis of results

As noted above, the optimal policy was compared via simulation against two naïve heuristics,
denoted as Green and Slack. The simulation results are presented in Table 2, as each row
refers to one experiment. The order lead time, d, and the system's slack (which is defined as
d-T(1)) are both displayed in the second and third column, respectively. Next, the slack as a
percentage of d is given. Column 5 contains the MDP optimal steady state vector, determined
by the parameters, n1 and n2; the parameters represent the threshold between the green area
and the red area in terms of the slack for one and two-order states, respectively. The value
Slack indicates that all the slack periods are painted green, and therefore the entire steady
state vector is green. The value "0" indicates that all the slack periods are red, and therefore
the steady state vector becomes identical to the vector of the Slack heuristic. The value "1"
indicates that all of the slack periods, except the last one, are painted red. Surprisingly, in all
the experiments, n1 and n2 took only three values, 0, 1 and Slack. That is, as slack decreases
(from 15 down to 5), n1 and n2 decrease as well, while skipping all intermediate values
between Slack and 1. The existence of “tail” is indicated in column 6. Note that “tail” appears
only for states with two orders. Columns 7, 8 and 9 contain the average daily cost (based on
simulation of 10,000 replications of a working day) of the optimal policy, Green heuristic
and Slack heuristic, respectively. The last column contains the relative improvement of the
optimal policy against the best heuristic.
13
Table 2. Experimental results
MDP MDP Green Slack

Improvement to
# D slack Slack% steady “tail” average average average
best H [%]
state daily cost daily cost daily cost
1 35 15 43% n1=1;n2=1 No 35.93 51.12 43.92 22.2
2 35 12 34% n1=0;n2=0 No 117.32 142.31 126.02 7.4
3 35 10 29% n1=0;n2=0 Yes 184.66 212.55 192.96 4.5
4 35 7 20% n1=0;n2=0 Yes 290.11 317.76 296.28 2.1
5 35 5 14% n1=0;n2=0 Yes 362.98 387.1 367.28 1.2
n1=Slack
6 32 15 47% No 12.34 20.86 21.41 69.0
n2=Slack
7 32 12 38% n1=1;n2=1 No 84.28 105.22 93.32 10.7
8 32 10 31% n1=0;n2=0 Yes 152.28 178.69 162.17 6.5
9 32 7 22% n1=0;n2=0 Yes 268.38 295.41 274.21 2.2
10 32 5 16% n1=0;n2=0 Yes 345.52 369.94 350.41 1.4
n1=Slack
11 30 15 50% No 2.37 9.69 12.03 309.17
n2= Slack
12 30 12 40% n1=1;n2=1 No 60.86 77.91 71.32 17.20
13 30 10 33% n1=0;n2=1 No 130 153.94 140.36 7.97
14 30 7 23% n1=0;n2=0 Yes 250.8 277.14 256.88 2.42
15 30 5 17% n1=0;n2=0 Yes 335.79 358.5 339.69 1.16
n1=Slack
16 27 15 56% No 0.423 6.093 8.211 1340.43
n2=Slack
n1=1
17 27 12 44% No 26.87 35.45 39.389 31.93
n2= Slack
18 27 10 37% n1=1;n2=1 No 90.124 108.62 100.93 11.99
19 27 7 26% n1=0;n2=0 Yes 218.11 245.92 225.14 3.22
20 27 5 19% n1=0;n2=0 Yes 312.51 336.1 317.27 1.52
n1=Slack
21 25 15 60% No 0.359 4.544 8.026 1165.74
n2=Slack
n1= Slack
22 25 12 48% No 8.287 14.523 23.487 75.25
n2=Slack
23 25 10 40% n1=1;n2=1 No 61.303 75.156 75.658 22.60
24 25 7 28% n1=0;n2=0 Yes 194.58 219.37 201.79 3.71
25 25 5 20% n1=0;n2=0 Yes 295.63 319.45 300.67 1.70
The results indicate that the MDP optimal solution outperforms any of the heuristics in all
the experiments; i.e., its average cost is always lower than the average costs of the two
heuristics. Another observation, which is demonstrated in Figure 4, is that the slack
percentage predicts the relative improvement to the best heuristic quite accurately, as the
improvement percentage increases with the slack percentage.
14
Figure 4. The relative improvement as a function of the slack value
Clearly, systems with larger slacks suffer from less tardiness and consequently, enjoy
lower average costs. One can notice from Table 2 that cases with high relative improvement
are associated with low absolute values of improvement which may be sometimes negligible.
To stress this point we have divided the results into three groups with respect to high,
medium and small relative slack (see Table 3). In the medium relative slack scenarios the
average improvement is still significant while the average cost is far from being negligible.
Therefore, we conclude that the strength of our model lies in medium slack size scenarios.
Table 3. Absolute value improvement versus relative improvement

Improvement to best
Slack percentage range Average cost
heuristic
44%-60% 8.4 498.6%
26%-43% 121 10.7%
14%-23% 380 1.7%
The steady state vector of the Green heuristic is characterized by n1=slack and n2=slack.
Similarly, for the Slack heuristic, n1=0 and n2=0. Now, we can notice that in all of the
experiments, the MDP steady state vector is almost identical (there might be a difference of
one or two action choices in the entire vector) to the steady state vector of one of the two
15
naïve heuristics. Therefore, we conclude that the major part of the MDP model benefit is due
to the transient state effect. In addition, the structure of the optimal policy indicates that the
higher the slack percentage, the more preferable the Green heuristic is against the Slack
heuristic.
5. Heuristic methods
5.1 Background
In this section, a heuristic approach for large-scale problem is proposed. To this end, the
structure of the optimal policy, expressed by the colored table (see Section 4.1.), was
analyzed. Fortunately, regular patterns were identified in the optimal policy. These patterns
and their characteristics were the cornerstones of our heuristic design. We distinguish
between patterns of the steady state and patterns of the transient state, and use these patterns
in developing the heuristic. The main purpose of the proposed heuristic is to develop a close
to optimal procedure which outperforms the best practice heuristics, named Green and Slack
in the previous section. The patterns of the optimal procedure are outlined next.
Patterns in the steady state
We define the steady state as a time period in which the action choice depends only on the
systems state s and not on the decision epoch t. Thus, the steady state can be defined by a
policy vector instead of a policy table.
According to the optimal results, the steady state vector has only a few configurations.
The structure of steady state is described by the two parameters, n1 and n2 that take only three
values each. Specifically, one form of the steady state is a ‘green’ vector, which orders “go
on tour” for every possible state. Figure 5(a) illustrates such a case. Another form of the
steady state vector is one of full slack usage or one with full slack usage minus one1. This is
illustrated in Figure 5(b). Rarely, the slack usage could be uneven between states of two
orders and states of one order. Namely, n1 and n2 are not necessarily equal in all of the
optimal solutions.
1
Full slack usage indicates that system states, in which slack is available, are painted red. Similarly, full slack
usage minus one indicates that the same states are painted red, apart from the state with only one slack time
period. This state is painted green.
16
Four red cubes
A full slack usage steady state vector

A ‘green’ steady state vector
(a) (b)
Figure 5. Steady state and transient patterns for a “green” solution (a) and a “full slack usage”
(b).
When analyzing the results of the main experiments, we have noticed that the steady state
vector seems to have a strong link to the slack percentage. This observation was extremely
helpful in the construction of the heuristic policy. Table 4 shows the 25 experiments, sorted
by the slack percentage. It is easily seen that (i) in low slack percentage systems the steady
state is described by a ‘green’ vector (i.e., n1 = n2 =0); (ii) in medium slack percentage
systems the steady state is described by a full slack usage minus one vector (i.e., n1 = n2 =1);
(iii) in high slack percentage systems the steady state is described by a full slack usage vector
(i.e., n1 = n2 = Slack). As mentioned above, note that in experiments 13 and 17 the values of
n1 and n2 are not even. We refer to this issue later on.
Table 4: The 25 experiments sorted by the slack percentage.

MDP steady Slack Tail # MDP steady Slack Tail #
Slack d Slack d
state (n1,n2) (%) (Y/N) exp. state (n1,n2) (%) (Y/N) exp.
(0,0) 5 14% Y 35 5 (0,0) 12 34% N 35 2
(0,0) 5 16% Y 32 10 (1,1) 10 37% N 27 18
(0,0) 5 17% Y 30 15 (1,1) 12 38% N 32 7
(0,0) 5 19% Y 27 20 (1,1) 12 40% N 30 12
(0,0) 7 20% Y 35 4 (1,1) 10 40% N 25 23
(0,0) 5 20% Y 25 25 (1,1) 15 43% N 35 1
(0,0) 7 22% Y 32 9 (1,Slack) 12 44% N 27 17
(0,0) 7 23% Y 30 14 (Slack,Slack) 15 47% N 32 6
(0,0) 7 26% Y 27 19 (Slack,Slack) 12 48% N 25 22
(0,0) 7 28% Y 25 24 (Slack,Slack) 15 50% N 30 11
(0,0) 10 29% Y 35 3 (Slack,Slack) 15 56% N 27 16
(0,0) 10 31% Y 32 8 (Slack,Slack) 15 60% N 25 21
(0,1) 10 33% N 30 13
17
Patterns in the transient state
In the transient time, just before the end of the planning period, the system shows an unstable
behavior. Nevertheless, clear and repetitive patterns still exist. One clear pattern occurs in
systems for which the steady state vector is not ‘green’. In these cases, at least three green
holes are seen in the policy table. Such a case is illustrated in Figure 5(b). Another noticeable
pattern occurs in systems in which the steady state vector is ‘green’. In these cases, at least
four red cubes are observed in the policy table. Such a case is illustrated in Figure 5(a).
Furthermore, the thickness of the cubes is the same in most of the cases.
The transient state patterns also very much depend on the slack percentage.
Interestingly, they also depend on the length of the picking tour. For example, we have
discovered that the exact starting position of each of the three green holes can be determined
by T(1). This is illustrated in Figure 6.
T(1) T(1) T(1)
Figure 6. The position of the three green holes.
The tail patterns are also apparent in the transient state. These patterns usually appear
in low slack systems. Specifically, in systems with slack percentage lower than 32% (see
Table 4). The tail is typically characterized by a fixed thickness and appears at specified
places in the policy table (see Figure 3(b)).
The last pattern, associated with the transient state, shows that the last I periods are
always green, since in that time period no orders arrive and therefore there is no need in
waiting.
Key points of the heuristic design
Three main principles have guided us in designing the heuristic approach:
18
1. A rough cut of the optimal policy. The basic idea of our design is to follow the
general visual form of the optimal policy table. Consequently, we identify several
types of problems based on their parameters and construct a typical generic heuristic
policy for each problem type based on the above patterns of the optimal policy. Still,
our heuristic policy does not imitate the exact pattern of the optimal policy. For
example, we ignore the jagged left side of the red patterns shown in Figure 6 and
replace it with a rectangular pattern.
2. The maximum similarity principle. The heuristic policy comprises from several
parameters. The setting strategy of all these parameters was based on the results of
the optimal solutions. Consequently, we have identified empirical properties of the
optimal solution with regard to each of the parameters, and determined the
parameters of the heuristic policy accordingly.
3. A ‘don’t damage approach’. The heuristic policy attempts to achieve better results
than the two naive heuristics. Accordingly, we wanted the MDP heuristic policy to
indicate the action choices different from the naïve heuristics, only when such an
Figure 7 illustrates the rough cut approach by showing four policies, two optimal and two
heuristic, for two problems.
(a) Optimal policy (high slack) (b) Optimal policy (low slack)
(c) Heuristic policy (high slack) (d) Heuristic policy (low slack)
Figure 7. Optimal versus heuristic policy in a high and low slack percentage system
19
action yields an improved performance over the best naïve heuristic. Therefore, we
were very conservative in the parameter setting. When the optimal policy follows a
pattern similar to those appears in Figure 5(a), clearly green policy heuristic
outperforms the slack heuristic. In this case, we have added the only those cubes that
were observed in all cases. Similarly, when the optimal policy follows a pattern
similar to those in Figure 5(b), clearly the slack heuristic outperforms the green
heuristic. In these cases, only those green holes that were identified in all of the cases
were added.
5.2 Algorithmic formulation
The following steps work as the instructions to the construction of an MDP based heuristic.
These instructions are general and they fit different warehouses with different configurations.
Every parameter in the following formulation was generated according to the maximum
similarity principle and the ‘don’t damage approach’, which were described above.
Step 1: Calculate the slack percentage.
Step 2: Set n1 and n2 in the following manner:
⎧0 if 0 ≤ slack percentage ≤ 0.36
⎪
n1 = ⎨1 if 0.36 < slack percentage ≤ 0.46
⎪slack if 0.46 < slack percentage ≤ 1
⎩
⎧0 if 0 ≤ slack percentage ≤ 0.36

⎪
n2 = ⎨1 if 0.36 < slack percentage ≤ 0.44
⎪slack if 0.44 < slack percentage ≤ 1
⎩
Step 3: (generalize the steady state vector on the entire policy table)
1. According to the values of n1 and n2 set the steady state policy vector.
2. Use the steady state policy vector to paint the entire policy table in a unified
manner.
Step 4: (set the three green holes or the four red cubes)
If slack percentage < 0.44 then set the following steps in the left column (constructing
three green holes). Else, set the following steps in the right column (constructing four
red cubes).
1. t1 = N − I − T (1) : the starting point 1. t1 = N − I − 1 : the starting point of
of the first green hole. the first red cube.
2. t2 = t1 − T (1) : the starting point of 2. t2 = t1 − T (3) : the starting point of the
20
the second green hole. second red cube.
3. t3 = t2 − T (1) : the starting point of 3. t3 = t2 − T (3) : the starting point of the
the third green hole. third red cube.
4. A = ⎡0.31818 ⋅ T (1) ⎤ : the length of 4. t4 = t3 − T (3) : the starting point of the
the first hole. forth red cube.
5. B = ⎡0.18182 ⋅ T (1)⎤ : the length of 5. A = ⎣0.38306 ⋅ T (3) ⎦ : the length of
the second hole. the first cube.
6. C = ⎡0.04545 ⋅ T (1)⎤ : the length of 6. B = ⎣0.48254 ⋅ T (3) ⎦ : the length of
the third hole. the second cube.
7. V1 = {t1 , t1 + 1,.., t1 + A − 1}: the group 7. C = ⎣0.52844 ⋅ T (3)⎦ : the length of
of decision epochs in which the first the third cube.
hole is present. 8. D = ⎣0.76339 ⋅ T (3)⎦ : the length of
8. V2 = {t2 , t2 + 1,.., t2 + B − 1} :the group the fourth cube.
of decision epochs in which the 9. W1 = {t1 , t1 − 1,.., t1 − A − 1}: the group
second hole is present. of decision epochs in which the first
9. V3 = {t3 , t3 + 1,.., t3 + C − 1}:the group cube is present.
of decision epochs in which the 10. W2 = {t2 , t2 − 1,.., t2 − B − 1} : the
third hole is present. group of decision epochs in which the
10. V = V1 ∪ V2 ∪ V3 second cube is present.
Set the three green holes as follows. 11. W3 = {t3 , t3 − 1,.., t3 − C − 1}: the
For every decision epoch t, which is a group of decision epochs in which the
member of the group V, and for every third cube is present.
state s, which is a member of the group 12. W4 = {t4 , t4 − 1,.., t4 − D − 1}: the
S, set a=a2 (go on tour).
group of decision epochs in which the
forth cube is present.
13. W = W1 ∪ W2 ∪ W3 ∪ W4
Set the four red cubes as follows. For
every decision epoch t which is a
member of the group W and for every
state s which uses the slack to its full
except the last unit (i.e., n1 and n2 equal
21
1), set a=a1 (wait another time period).
Step 5: (set the tail)

On the basis of the slack percentage set m1 and m2 (which determine whether a tail exists or
not, in states of one order and two orders accordingly), set the tail thickness and the stopping
state of the tail. The stopping state of the tail indicates that no more tails appear below this
state. The tail’s parameters are determined according to Table 5.
Table 5. Tail parameter determination.

Slack percentage m1 – m2 – Tail thickness Stopping state
[1 - tail exists, 0 – [1 - tail exists, 0 – (uniform thickness)
tail does not exist] tail does not exist]
0-0.15 1 1 4 s|γ2=T(2)
0.15-0.21 1 1 3 s|γ2=T(2)
0.21-0.27 1 1 2 s|γ2=T(2)
0.27-0.28 1 1 1 s|γ2=T(2)
0.28-0.32 0 1 1 s|γ2=T(2)+1
0.32-1 0 0 - -
Based on these parameters (m1, m2, Tail thickness, stopping state) set the action a1 (wait
another period) on the regions of the tail.
5.3 Experiments
After completing the design of the MDP based heuristic, we have conducted experimentation
for evaluation the performance of the heuristic procedure. The main purpose of the
experiments was to compare its performance with the two naïve heuristics and the optimal
algorithm. In addition, the effect of the length of the planning period was also examined.
As a result of the first session of experiments, in which the slack percentage turned
out to be very meaningful, we now choose to determine T(1) in an indirect manner. We did
so, by defining the slack percentage as a direct independent parameter. Three parameters
were examined. First, the length of the planning period was set to two levels, 256 and 540
decision epochs. Next, the order lead time, d, was set to 15, 30 and 45. Last, the slack
percentage, which was identified in the previous set of experiments as the most influencing
parameter, was set to five values: 20, 33, 40, 53 and 60 percent. All other parameters were
left the same as in the first session of experiments.
The experimental results are presented in Table 6. The first five columns contain the
number of experiment, the length (in time periods) of the working day N, the order lead time,
d, the picking time of one order T(1) and the slack percentage. These data define the
22
warehouse configuration. The next four columns contain the average daily cost, which was
evaluated by 10,000 runs of the simulation model (10,000 working days), for each of the four
order-picking policies (the optimal MDP policy, the MDP based heuristic and the two naive
heuristics). In the tenth column, the relative improvement (in terms of average daily cost) of
the MDP based heuristic to the better naive heuristic is presented. In the next column one can
see whether the above difference is statistically significant (based on significant level of
95%). Finally, the percentile distance between the optimal policy and the MDP heuristic
policy is shown.
Table 6. Experimental results in the second session.
Slack MDP-H vs best MDP-H vs

# N d T(1) Opt. MDP-H Green Slack Significance
(%) naive H (%) optimal (%)
1 256 15 12 20 293.25 293.74 305.27 294.62 0.30 No 0.17
2 256 15 10 33 131.75 134.61 142.09 137.16 1.89 No 2.17
3 256 15 9 40 65.969 66.042 71.377 74.247 8.08 Yes 0.11
4 256 15 7 53 1.689 2.432 3.942 15.247 62.09 Yes 43.99
5 256 15 6 60 0.298 0.47 2.044 12.78 334.89 Yes 57.72
6 256 30 24 20 291.45 291.21 317.23 294.12 1.00 No 0.00
7 256 30 20 33 130 130.29 153.94 135.77 4.21 Yes 0.22
8 256 30 18 40 61.123 61.824 77.83 68.014 10.01 Yes 1.15
9 256 30 14 53 0.511 2.415 7.428 8.671 207.58 Yes 372.60
10 256 30 12 60 0.371 1.837 6.058 7.581 229.78 Yes 395.15
11 256 45 36 20 281.08 281.64 315.68 287.64 2.13 No 0.20
12 256 45 30 33 125.42 126.1 157.27 135.07 7.11 Yes 0.54
13 256 45 27 40 57.463 58.635 80.737 67.915 15.83 Yes 2.04
14 256 45 21 53 0.763 4.799 12.842 10.669 122.32 Yes 528.96
15 256 45 18 60 0.61 3.533 10.474 8.61 143.70 Yes 479.18
16 540 15 12 20 633.18 633.01 657.67 634.11 0.17 No 0.00
17 540 15 10 33 285.29 291.15 303.48 293.68 0.87 No 2.05
18 540 15 9 40 142.8 142.85 149.25 157.73 4.48 Yes 0.04
19 540 15 7 53 3.312 3.982 5.511 30.318 38.40 Yes 20.23
20 540 15 6 60 0.317 0.529 2.03 25.536 283.74 Yes 66.88
21 540 30 24 20 663.11 662.78 707.05 665.18 0.36 No 0.00
22 540 30 20 33 288.61 288.68 329 294.1 1.88 No 0.02
23 540 30 18 40 133.9 135.23 159.06 143.84 6.37 Yes 0.99
24 540 30 14 53 0.786 2.309 7.433 12.326 221.91 Yes 193.77
25 540 30 12 60 0.31 1.855 5.972 11.157 221.94 Yes 498.39
26 540 45 36 20 660.21 661.04 721.18 668 1.05 No 0.13
27 540 45 30 33 285.83 285.82 337.46 294.12 2.90 No 0.00
28 540 45 27 40 129.35 131.08 163.5 141 7.57 Yes 1.34
29 540 45 21 53 0.685 4.695 12.511 11.902 153.50 Yes 585.40
30 540 45 18 60 0.537 3.85 10.437 10.362 169.14 Yes 616.95
MDP heuristic versus the two naive heuristics

Results show that the MDP based heuristic (MDP-H) outperforms the best naïve heuristic in
all of the experiments. The slack percentage has been identified as the only parameter which
significantly affects the difference between MDP-H and the best naïve heuristic. This value
increases with the value of the slack (see Figure 8). Note that high difference in high slack
systems stems from the fact that the total cost in these systems is very low and mostly
associated with the transient stage. This issue is addressed next.
23
Figure 8. Average improvement of MDP-H over naïve heuristic.
Sensitivity to different working day lengths

Two types of costs were considered in the analysis, the tardiness cost and the overtime cost.
The former occurred mostly during the steady state, while the latter was observed in the
transient state. When analyzing different lengths of working day, clearly, the relative effect of
the transient state decreases with the length of working day. One may expect that the main
advantage of MDP-H over the naïve heuristics is related to the transient state. Consequently,
we expected the difference between the two heuristics to increase as the working day
becomes shorter. When we compared the experiments of 256 to those of 540 time periods, no
significant effect of length of the working day was identified in high slack systems (slacks of
53% and 60%). The reason was that in such systems, the tardiness cost, associated with the
steady state, was close to zero, and hence the overtime cost became the most significant cost
element. As a result, the effect of the working day length became negligible. Small and
medium slack systems (up to 40%) demonstrate the increased effectiveness of the 256 time
epochs experiments over the 540 time epochs, as shown in Figure 9.
Distance from the optimal policy
The optimal policy was generated and compared to MDP-H. Not surprisingly, we have seen
that also in this case the slack percentage significantly affects the distance from optimality. In
particular, we have observed that the distance from optimality increases with the slack.
Figure 10 demonstrates this phenomenon, as one can see that the difference between MDP-H
and the optimal solution is relatively small (up to 0.62%) for a slack percentage of up to 0.4.
For higher slacks we can see much larger differences. However, as can be seen in Figure 14,
high slack systems are characterized by a very low costs. Hence the absolute value of the
difference is insignificant.
24
Figure 9. The effect of the working day length on the average improvement of MDP-H over
best naïve heuristic.
Figure 10. The reciprocal relationship between distance and cost.
5. Concluding remarks
We study a dynamic order picking problem. In today's fast pace economy, such order picking
systems are very common and consequently the possible applications of this work are
abundant. Besides the impressive cost reduction, which was generated by the optimal policy,
the policy displayed obvious patterns in its configuration. Moreover, the patterns seem to
have a strong link to the problems parameters, especially to the slack percentage. Based on
this observation, we have developed an MDP based heuristic approach to the problem. The
MDP based heuristic achieved better results than the two naive heuristics, and unlike the
MDP optimal policy, it is generated at relative ease.
25
The MDP approach, which was used in this research, is quite flexible and enables
addressing different variations of the problem. However, the approach suffers from the curse
of dimensionality, i.e., as the problem becomes more detailed (with less assumptions), the
number of possible system’s states becomes very large and causes complexity difficulties.
Therefore, we recommend that further research will be conducted on the basis of a
reinforcement learning approach. In reinforcement learning, an optimal solution is
not searched for. Instead, the decision making policy is constantly improved based on the
results of decisions that were made in the past. Using this approach will allow relaxing some
of the assumptions adopted here. In particular, the order arrival rate does not have to follow
the Poisson distribution. Additionally, an environment with multiple pickers and orders of
multiple items with different due dates, could be easily addressed.
References
Armstrong, R.D., Cook, W.D., and Shaipe, A.L., Optimal batching in a semi-automated order
picking system, (1979). Journal of the Operational Research Society, 30(8), 711-720.
Elsayed, E.A., and Lee, M.-K., Order processing in automated storage/retrieval systems with
due dates, (1996). IIE Transactions, 28 (7), 567-577.
Gademann, N. and Van de velde, S., (2005). Order batching to minimize total travel time in a
parallel-aisle warehouse, IIE Transactions, 37, 63-75.
Gibson, D.R., and Sharp, G.P., Order batching procedures (1992). European Journal of
Operational Research, 58, 57-67.
Hall, R.W., Distance approximations for routing manual pickers in a warehouse, (1993). IIE
Transactions, 25 (4), 76-87.
Hane, C.C., and Laih, Y.W., A clustering algorithm for item assignment in a synchronized
zone order picking system, (2005). European Journal of Operational Research, 166, 489-496.
Jane, C.-C., Storage location assignment in a distribution center, (2000). International Journal
of Physical Distribution & Logistics Management, 30 (1), 55-71.
Petersen, C.G., and Aase, G., A comparison of picking, storage, and routing policies in
manual order picking, (2004). Int. J. Production Economics, 92, 11-19.
Puterman M. L., 1994. Markov decision process: discrete stochastic dynamic programming,
Wiley series in probability and mathematical statistics.
Roodbergen, K.J., and De Koster, R., Routing order pickers in a warehouse with a middle
aisle, (2001). European Journal of Operational Research, 133, 32-43.
26

Optimizing A Dynamic Order-Picking Process: Yossi Bukchin, Eugene Khmelnitsky, Pini Yakuel

Uploaded by

Copyright:

Available Formats

Optimizing A Dynamic Order-Picking Process: Yossi Bukchin, Eugene Khmelnitsky, Pini Yakuel

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Optimizing A Dynamic Order-Picking Process: Yossi Bukchin, Eugene Khmelnitsky, Pini Yakuel

Uploaded by

Copyright:

Available Formats

Optimizing a Dynamic Order-Picking Process

Yossi Bukchin, Eugene Khmelnitsky, Pini Yakuel

3. The solution approach

⎪ −cT ∑ max(T (n) − γ i , 0) a = a2 , s = {n, (γ 1 , K , γ n )}

and the overtime penalty value at time N is

where, P = e − λT ( n ) × (λT (n)) ×

Figure 1. Arrivals during picking tour.

⎧1 s ∈S '| n > 0, j ∈Δ, j = (T(n) −1,T(n)), a = a2 , N − t < T (n)

The Backwards induction algorithm

2. Substitute t-1 for t and compute ut* (st ) for each s t ∈ S by

Set As*t ,t = arg max{rt ( st , a ) + ∑ Pt ( j | st , a )u t*+1 ( j )}

3. If t = 1 , stop. Otherwise return to step 2.

Figure 2. An example of an optimal policy table

Figure 3. The two major effects identified in the optimal policy.

Table1 . The values of the problem parameters in the main experiments.

4.3. Analysis of results

MDP MDP Green Slack

Table 3. Absolute value improvement versus relative improvement

A full slack usage steady state vector

Table 4: The 25 experiments sorted by the slack percentage.

T(1) T(1) T(1)

Figure 6. The position of the three green holes.

⎧0 if 0 ≤ slack percentage ≤ 0.36

Step 5: (set the tail)

Table 5. Tail parameter determination.

Table 6. Experimental results in the second session.

Slack MDP-H vs best MDP-H vs

MDP heuristic versus the two naive heuristics

Sensitivity to different working day lengths

Figure 10. The reciprocal relationship between distance and cost.

You might also like

Set Ast ,t = arg max{rt ( st , a ) + ∑ Pt ( j | st , a )u t+1 ( j )}