2.1. Markov Decision Process
The method we present in this paper is based on a Markov decision process (MDP). A MDP is a stochastic control process that can be seen as an extension of a Markov chain, adding actions and rewards [
8]. The MDP can be described as a 5-tuple:
, where
is a set of states,
a set of actions,
the transition probabilities between states, given actions,
a real-valued reward (or penalty) function that calculates the reward (or penalty) of any given state and
a discount factor. As the name suggests, this process assumes the Markov property, therefore the effects of an action taken in a state only depend on that state and not the prior history of the process. An example of a Markov decision process is presented in
Figure 1. In the present framework, the set of states
includes an finite number of states—in the example in
Figure 1, six states are shown. (Infinite sets of states are possible in the framework of Markov decision processes. For more information about the mathematical concept, please refer to the literature, e.g., [
8].) Each state can be described by one or multiple properties. These can be e.g., a location (distance from some fixed point), reward given in the respective state, or, in the case of offshore wind farm maintenance, the status of the turbine, a sea state observation, or the time needed to complete a repair. Each state differs from all other states in at least one characteristic, so no duplicates exist. The actions in the MDP can be either deterministic or stochastic. Deterministic actions lead to a (fixed) new state that the process will continue in after the current state. A stochastic action specifies a probability distribution over the next states. The transition probabilities between states depend on the action undertaken in that state and specify the new state, subject to that action. Therefore, for each state and possible action in that state, there is at least one positive transition probability to another state. For each state and action, the transition probabilities sum to one. A deterministic action is a special case of a stochastic action, with exactly one positive transition probability equal to one. The example in
Figure 1 includes two stochastic actions and the associated transition probabilities. The reward function is a real valued function, assigning a value to each state and action combination. When a negative value is assigned by the reward function, it is often called a penalty function instead. In the example in
Figure 1, each of the six states has one of two reward values, namely 1 and 0.
In addition to the Markov decision process that describe how the system works, our setup contains a set of policies
. A policy
is a mapping from
to
, and can be understood as a decision makers rule for choosing one of the possible actions
in each state. In order to follow a policy, one must (a) determine the current state
s, (b) determine the action to be executed in that state
, (c) determine the new state
and continue, alternating (b) and (c). The goal of using a MDP is of course to find an optimal (or at least better than existing) maintenance strategy. In the framework of the MDP, this is done by finding an optimal policy
. In order to evaluate a policy (and ultimately finding the optimal policy), it is necessary to determine expectation of the total reward gained by following it (in order to optimize it). Intuitively, one could try to sum all rewards obtained in the MDP when following the policy, but this can quickly become overwhelming. (Typically, summing all rewards will yield an infinite sum, namely for all MDPs with either infinite state space or for MPDs with infinite horizon. For more information about these cases, refer to e.g., [
8].) The solution is to use an objective function to map the sequence of rewards to (single, real) utility values. Options to obtain an objective function are (1) setting a finite horizon, (2) using discounting to favour earlier rewards over later rewards and (3) averaging the reward rate in the limit.
Instead of optimizing the policy, in some cases, it might be desirable to compare different policies with each other. When combining an MDP with a fixed policy that chooses exactly one action for each state, the result is a Markov chain. This is because all of the actions are defined by the policy and one is left with the transition probabilities between states. One example of a resulting Markov chain is visualized in
Figure 2. In this Markov chain, the value of each state
can be calculated based on the reward
of that state and based on the values of the states that can be reached. It is calculated as
where
is the transition probability between state
and state
from
. The equations in (
1) are known as Bellman equations, named after Richard Bellman. We can solve the linear equation system (LES) defined by the transition probabilities and reward function to find the values
for each state. When comparing two policies, one can look up the value of a specific state one is interested in, usually a ‘starting’ point. In the case of OWF maintenance, this could e.g., be a state in which a failure occurs and the value could then be representative of the time it takes for this failure to be corrected, with a penalty incurred for each step taken without resolving the failure. A case study comparing different policies is presented in
Section 4.
In the example shown in
Figure 1, a possible policy would be to always choose action
. The corresponding Markov chain is presented on the right-hand side of the figure in the form of its transition probabilities. The rewards (presented in the figure in blue next to the states) are
, and
. In order to calculate the value of each of the states, we solve the equation system defined by the Bellman Equation (
1):
The values of the states are then
If one is interested in comparing the value of a specific state under two different policies, the calculation is repeated for that policy and the values compared. It is also possible to find the
optimal policy without comparing the values based on the resulting Markov chains. In order to find the optimal policy, we define the optimal value function
by the recursive set of equations
so the optimal value of a state
is the reward in the state, plus the maximum over all actions we could take in the state. This is a the generalized form of the Bellman Equation (
1) for policies. In the example shown in
Figure 1, the possible actions are
and
. The idea behind this maximum is that in every state we aim to choose the action that maximizes the value of the future. The optimal value function
can be found by e.g., value iteration. When
is known, the optimal policy
can be found by picking the action that maximizes the expected optimal value:
In the example shown in
Figure 1, the optimal policy is to conduct action
in states
and
, and
in states
and
.
Figure 2 shows the Markov chain corresponding to the optimal policy.