1 Introduction

Today’s organizations typically need to monitor their processes to guarantee that they are executed within given boundaries. These boundaries can be set internally, e.g., by process managers, to enhance operational efficiency or can be derived by external legal requirements like the Sarbanes-Oxley Act [1]. As a result, organizations often define execution procedures to ensure that their processes meet these constraints. Deviating from these procedures can expose organizations to abuses and frauds.

However, procedures are often not enforced by design or can be bypassed in order to ensure business continuity [3]. In fact, preventive systems are typically too rigid to deal with real-world, dynamic environments wherein unpredictable circumstances and exceptions often raise. In these settings, a crucial challenge for organizations is to predict whether a given (set of) undesired behavior(s) will or will not occur in running process executions, to be able to timely take actions in order to prevent/mitigate potential risks.

To address this issue, one can exploit predictive process monitoring [15]. This comprises a family of techniques aimed to predict the “outcome” of running process executions, which in our context corresponds to the occurrence of undesired behaviors. One of the main challenges in predictive monitoring is balancing the reliability and effectiveness of predictions. On the one hand, the predictive system should raise an alarm only when there is “enough” evidence of the forthcoming occurrence of an undesirable behavior. High false-alarms rates, indeed, significantly hamper the usability of such systems, requiring the analyst to waste a large amount of time in verifying the alerts and, thus, leading to a loss of trust in the system. On the other hand, a system that provides late predictions or that, anyway, fails to provide a prediction in most cases, is of little or no use.

Predictive monitoring approaches often allow an analyst to determine the best trade-off between these two forces by acting on a set of metrics that the analyst can customize to her needs. Two commonly-used metrics are the support and confidence of predictions. The first metric accounts for the amount of history corresponding to the current state of the process execution and it is used to ensure that the prediction is supported by enough evidence. Varying the support threshold impacts the effectiveness of the predictive system; higher the threshold is, more evidence is required to make a prediction. The confidence is instead used to evaluate in which extent the predictive system was able to provide a correct prediction when the given state of execution occurred in the past. This metric impacts both the effectiveness and reliability of the system; requiring high confidence leads to generate predictions only for executions for which high quality predictions had been obtained in the past, thus reducing the number of false positives and false negatives.

However, previous work typically does not allow customizing the sensitivity of the prediction; namely, the system returns the outcome that it estimates to be the most likely, without accounting for the “gap” between the probabilities of occurrence/non-occurrence of a behavior. In contrast, a different level of sensitivity may be required in different contexts. For example, in cases where reacting to an alarm is difficult and/or costly, the analyst might be willing to take the risk of waiting to be sure that it is actually necessary to take an action. Ideally, this can be achieved by setting a low sensitivity for the prediction; that is, configuring the system in such a way that it will raise an alarm only when the probability of occurrence is “enough” higher than the probability of non-occurrence. This, however, requires changing the way predictions are usually computed, which is not trivial and even not always possible in existing approaches.

To deal with this challenge, in this work we perform an exploratory study on the application of Subjective Logic [12, 20] in the context of predictive process monitoring. Subjective Logic is an evidence-based opinion algebra used to evaluate the belief that a given proposition is true or false, explicitly modeling the uncertainty in the generation of a prediction. Considering the occurrence of a given behavior as a proposition and past process executions as evidence supporting/contradicting the proposition, Subjective Logic can be used to determine the likelihood that this behavior will or will not occur, or whether there is not enough evidence to make a prediction, thus providing us with a sound and rigorous method to deal with uncertainty.

Elaborating upon Subjective Logic, we introduce a novel prediction approach that allows analysts to customize both the reliability, effectiveness and sensitivity of predictions. We developed a proof-of-concept implementation and tested it over a synthetic dataset to evaluate the validity of the approach and to perform a first assessment of its performance. Results show that our approach is comparable to existing techniques in terms of quality of predictions, while it provides overall better results in terms of effectiveness by being able to make a prediction for a higher number of samples compared to the tested competitor.

The remainder of the paper is organized as follows. Section 2 introduces a running example that is used through the paper. Section 3 describes our approach. Section 4 presents an evaluation of the approach along with a comparison with a well-known predictive monitoring approach. Finally, Sect. 5 discusses related work and draws conclusions.

2 Running Example

Consider, as a running example, a loan management process derived from previous work on the event log of a financial institute made available for the BPI2012 challenge [2, 11]. Figure 1 shows the process in Petri net notation. Places are graphically represented by circles and transitions by boxes. Labels below the transitions report the activity names, whereas labels inside the transitions the corresponding acronyms. Black boxes represent invisible transitions, i.e. transitions that are not observed by the information systems and are mainly used for routing purposes.

Fig. 1.
figure 1

Loan management process

The process starts with the submission of an application. Then, the application passes through a first assessment, aimed to verify whether the applicant meets the requirements. If the requested amount is greater than 10000 euros, the application also goes through a more accurate analysis to detect possible frauds. If the application is not eligible, the process ends; otherwise, the application is accepted. An offer to be sent to the customer is selected and the details of the application are finalized. After the offer has been created and sent to the customer, the latter is contacted to check whether she intends to accept the offer. If this is not the case, the offer is renegotiated and a new offer is sent to the customer. At the end of the negotiation, the agreed application is registered on the system. At this point, further checks can be performed on the application, if needed, before approving it.

Let us assume that two deviating behaviors are allowed in our scenario:

  • Delaying the completion of fraud checking. Since fraud checking is usually a time-consuming activity, in some cases users can execute other tasks of the process while the checking is not finished yet.

  • Resuming declined applications. In some cases, a previously rejected application can be resumed, e.g., when the current salary of the customer does not provide enough guarantees for the requested loan amount, but the customer claims that he expects it will be increased. To speed up the process, employees can decide to (temporarily) reject the application, wait that the customer’s salary is raised and reuse the previous application, without restarting the process from scratch.

Although these deviating behaviors might be considered acceptable practices, they pave the way to possible abuses. We report below some executions of the process in Fig. 1 in which these behaviors occur:

\( \sigma _1= \langle S, AS, WFC{s} , WFA{s} , WFA{e} ,AA, AF,OS, OC, OSE, WCC{s} , WCC{e} , WFC{e} ,AR,AAP \rangle \)

\(\sigma _2 =\langle S, AS, WFA{s} , WFA{e} , AD, AA, AF, OS, OC, OSE, WCC{s} , WCC{e} , AR, AAP \rangle \)

\(\sigma _3 = \langle S,\! AS,\! WFA{s} ,\! WFA{e} ,\! AA,\! AF,\! OS,\! OC,\! OSE,\! WCC{s} ,\! WCC{e} ,\! OC,\! OSE,\! WCC{s} ,\! WCC{e} ,\! AR,\! AAP \rangle \)

The first process execution shows that fraud checking has been completed (\( WFCe \)) only after the offer was sent to the customer (\( OSE \)). This is clearly undesired. In fact, interrupting an application at this point is costly and likely leads to a loss of the customer’s trust. As a result, the employee performing the fraud checking might be more inclined to accept some risks and allow the process to proceed.

Process execution \(\sigma _2\) shows the management of a declined application (\( AD \)) that was resumed and then approved (\( AAP \)) without further assessment. Although the resuming of rejected applications is acceptable, it is highly advisable to perform further assessments on the application before the final approval to verify whether the issues that led to the initial rejection have been solved.

Also compliant behaviors, when misused, might hide possible threats. As an example, \(\sigma _3\) represents multiple repetitions of the application negotiation with the customer, followed by an approval (\( AAP \)) without any further assessment. This behavior might be a signal that when the negotiation takes long time, the application is immediately approved when an agreement is reached, without further assessment. An insider might exploit this practice at his own advantage to obtain desirable offers that would not be approved otherwise.

It is worth noting that neglecting the sensitivity of predictions in this kind of scenarios can easily lead to a large number of alerts. For example, if in past process executions the completion of fraud assessment was delayed for slightly more applications (with a given range of amounts) than for the ones the check was performed in time, the predictive system may raise an alert for every application with a similar amount. However, it is clearly neither convenient nor feasible performing additional checks for every application only based on the required amount, unless it is much more likely that a fraud will occur for applications with given amounts.

3 Approach

The goal of this work is to devise an approach to predict the occurrence of critical behaviors in a running process execution. We assume that the behavior to be monitored is modeled through a set of patterns representing (portions of) process behavior an analyst is interested in.

Our approach, depicted in Fig. 2, follows traditional Machine Learning based approaches to predictive process monitoring and consists of an off-line (training) phase and an on-line (predicting) phase. A characterizing aspect of this work is the employment of Subjective Logic [12] to assess the quality of predictions. Subjective Logic is an opinion algebra that allows assessing the probability that a given pattern occurs by explicitly accounting for uncertainty based on the amount of available evidence. The following sections detail the steps of our approach.

Fig. 2.
figure 2

Approach

3.1 Data Preprocessing

Process executions are usually recorded in event logs. To build the predictive model, the log should be preprocessed in order to obtain a format suitable for the analysis. Next, we first formally define event logs and then we present the preprocessing steps.

Definition 1 (Event, Event Trace, Event Log)

Let A be the set of process activities and V the set of data attributes. Given an attribute \(v \in V\), U(v) denotes the domain of v and \(\mathcal {U}=\cup _{v \in V} U(v)\) the union of all attribute domains. An event \(e = (a,\varphi _{e})\) consists of an activity \(a\in A\) and a function \(\varphi _{e}\) that assigns values to attributes \(V_e\subseteq V\): \(\varphi _{e} \in V_e \rightarrow \mathcal {U}\) s.t. for all v occurring in \(\varphi _{e}\) \(\varphi _{e}(v)\in U(v)\). The set of events is denoted by \(\mathcal {E}\). An event trace \(\sigma \in \mathcal {E}^{*}\) is a sequence of events. An event log \(\mathcal {L}\in \mathbb {B}(\mathcal {E^{*}})\) is a multiset of event traces.Footnote 1

Given an event \(e=(a,\varphi _{e})\), we use act(e) to denote the activity label associated to e, i.e. \(act(e)=a\). This notation extends to event traces. Given an event trace \(\sigma =\langle e_1,\dots ,e_n\rangle \in \mathcal {E}^{*}\), \(act(\sigma )\) denotes the sequence of activities obtained from the projection of the events in \(\sigma \) to their activity label, i.e. \(act(\sigma )=\langle act(e_1),\dots ,act(e_n)\rangle \).

To build and train a predictive model, we label the event log by indicating which patterns occurred in each trace. The problem of determining whether a pattern occurred in a process execution can be modeled as a compliance checking problem [10]. In this work, we model the patterns of interest as data Petri nets since several well-known techniques exist to detect the occurrence of this kind of patterns in a process execution (see, e.g., [3, 18]). Note, however, that our approach does not pose any constraint on the choice of the patterns formalism, as well as on the technique employed to detect them.

Figure 3 shows four patterns representing undesired behaviors for the process in Fig. 1: delayed fraud check, application resuming violation, multiple negotiations and old application resuming. Inspired by [11], we use \(\omega \) transitions as placeholders to specify that at a given point of the process execution any activity can be executed.

The first three patterns represent the behaviors we already discussed in Sect. 2. The last one is a variant of the application resuming pattern in which the time between the rejection and resuming of an application is also constrained. The idea behind this pattern is that resuming too old applications might lead to some risk, since the information initially provided by the applicant on his financial situation might have become outdated.

Once past process executions have been labelled, we apply data discretization on the labelled data. To generate accurate predictions, we need to take into account those data attributes that are related with the patterns of interest. This is, however, far from trivial. Especially when dealing with numerical attributes, it is not feasible to consider all values an attribute has assumed/can assume in the trace; therefore, we need to employ an effective strategy to discretize the attribute domain in a set of finite intervals.

Fig. 3.
figure 3

Patterns of undesired behaviors for the loan management process

In order to discretize continuous data, we resort to supervised discretization. This approach discretizes continuous variables by taking into account the class values, i.e., selecting discretized intervals that best discriminate between positive and negative classes – in our case between the occurrence and the non-occurrence of a pattern. The approach orders the numerical values of each continuous variable in the training set and selects the split point that produces the highest information gain, i.e., the amount of information gained by knowing the value of the attribute, to build the discretized intervals.

However, since the conditions that discriminate between positive and negative classes do not only depend on a single variable, we leverage the supervised discretization provided by the decision tree algorithm [17]. This algorithm, in order to build the tree, selects both the variable to be split and the value to be used for the splitting by maximizing the resulting information gain. By inspecting the tree, only the paths root-leaf ending in a leaf with confidence and support enough high, i.e., over user-defined thresholds, are used to extract the discretized intervals. Thus, the retrieved discretized intervals depend on the classes on which they are supposed to discriminate, i.e., on the occurrence of a specific single-pattern. However, multilabel classification can be leveraged to retrieve discretized intervals globally discriminating on all patterns together. We tested both discretization strategies in our experiments (Sect. 4), to check whether the increased efficiency of the global discretization allows for obtaining predictions as accurate as the ones obtained with different discretization intervals for each pattern (single pattern).

3.2 Training

This step takes as input (i) a (preprocessed) event log and (ii) the set of patterns the analyst wants to predict, and returns a prediction model representing, for each pattern, the likelihood that the pattern occurs in each “state” of the process. Roughly speaking, the state of a process represents its execution at a given time, i.e. the performed activities along with the value of data attributes. As in [4], we define the state of a process execution as follows:

Definition 2 (State)

Let \(\mathcal {A}\) be the set of activities, V the set of attributes and \(\mathcal {U}\) the attributes’ domain. A state s for an event trace \(\sigma \) is a pair \((act(\sigma ), \varphi _s)\) where \(\varphi _s\) is a function that associates a value to each attribute, i.e. \(\varphi _s : V \rightarrow \mathcal {U} \cup \{\bot \}\) such that for all \(v \in V\), \(\varphi _s(v) \in U(v)\cup \{\bot \}\) (where \(\bot \) indicates undefined). The initial state is denoted \(s_I = (\langle \rangle , \varphi _I )\) where \(\varphi _I\) is the initial assignment of values to attributes.

The initial attribute assignment \(\varphi _I\) represents the value of the attributes in V before the process is executed. For some attributes, a value may not be (initially) defined and, thus, it is considered undefined (\(\bot \)). The execution of the activities changes the state of the process. We first define the notion of state transition, which is a change of one state to another state due to the effect of an event, and then we extend this definition to event traces.

Definition 3 (State Transition)

Let V be the set of attributes. Given an event \(e=(a,\varphi _e)\) and a state \(s=(act(\sigma ),\varphi _s)\) for an event trace \(\sigma \), e transforms s into a state \(s'=(act(\sigma '),\varphi _{s'})\) such that \(\sigma '=\sigma \oplus e\) and for every \(v\in V\)

$$\begin{aligned} \begin{array}{l} \varphi _{s'}(v)= \left\{ \begin{array}{ll} \varphi _{e}(v) \quad \quad &{} \textit{if } v \in dom(\varphi _{e})\\ \varphi _{s}(v) \quad \quad &{} otherwise. \end{array} \right. \end{array} \end{aligned}$$
(1)

We denote \(s \xrightarrow {e} s'\) the state transition given by e.

Intuitively, Eq. 1 states that the occurrence of an event updates the data attributes associated to the event (i.e., \(v \in dom(\varphi _{e})\)) while the other attributes in \(\varphi _{s}\) remains unchanged.

Definition 4 (Trace Execution)

Given an event trace \(\sigma =\langle e_{1},...,e_{n} \rangle \in \mathcal {E}^*\), \(\sigma \) transforms the initial state \(s_I\) into a state s if there exist states \(s_0, s_1,\dots ,s_n\) such that

$$ s_I = s_0\xrightarrow {e_{1}} s_1 \xrightarrow {e_{2}} \dots \xrightarrow {e_{n}} s_n =s $$

We denote \(\mathsf{state}(\sigma )\) the state yielded by an event trace \(\sigma \).

Missing data attributes introduce uncertainty on the reached state as different states could have been reached. To deal with missing values in an event trace, we adopt the notion of state subsumption from [4]. State subsumption is used to determine the possible states of the process that could have been yielded by an event trace.

State

Pattern

#

Executed activities

Data attributes

amount

duration

\(\dots \)

\(\langle \rangle \)

\(A_1\)

\(D_1\)

\(\dots \)

\(\pi _1\)

5

\(\pi _2\)

0

\(\langle S\rangle \)

\( A_1 \)

\( D_2\)

\(\dots \)

\(\pi _1\)

5

\(\pi _2\)

0

\(\langle S, AS\rangle \)

\( A_1 \)

\( D_3\)

\(\dots \)

\(\pi _1\)

5

\(\pi _2\)

0

\(\langle S, AS, WFAs \rangle \)

\( A_1 \)

\( D_4\)

\(\dots \)

\(\pi _1\)

3

\(\pi _2\)

0

...

...

...

...

...

...

Definition 5 (State Subsumption)

Given two states \(s=(r_s,\varphi _s)\) and \(s'=(r_{s'},\varphi _{s'})\), we say that s subsumes \(s'\), denoted \(s\succ s'\), if and only if (i) \(r_s=r_{s'}\) and (ii) for all \(v\in V\) s.t. \(\varphi _s(v)\ne \bot \) , \(\varphi _{s'}(v)=\varphi _{s}(v)\).

The table on the right shows the prediction model obtained from (a portion) of \(\sigma _1\) after attributes amount and duration have been discretized in the preprocessing step. States consist of the prefixes of the trace along with the attribute values (or discretization intervals) obtained after the partial process execution corresponding to the prefix. The occurrence of an event may or may not change the value of an attribute. For instance, the amount does not change during a process execution, whereas the duration of the process execution is updated after each event. The last two columns report the patterns of interest (\(\pi _1\) and \(\pi _2\) in the table) and the number of occurrences of these patterns in the historical logging data. Specifically, the latter is the number of traces in the historical logging data for which there exists a prefix that leads to a state subsumed by the given state and in which the pattern occurred.

3.3 Evidence-Enhanced Prediction

This step takes as input (i) the prediction model built in the previous step, (ii) the set of patterns to predict, and (iii) the (partial) trace(s) corresponding to the running process execution(s), and returns (for each trace) a prediction on the patterns of interest. Each prediction should account for the amount of evidence for a given pattern in a given state. To this end, we employ principles of Subjective Logic [12]. This is an opinion algebra commonly used in the context of online communities, where users have to decide whether to interact with another user to achieve some goal. In assessing the trust level between users who do not know each other beforehand, an opinion is computed for each user based on his past interactions with other users in the community. Opinions are defined as follows:

Definition 6

An opinion x about a proposition P is a tuple \(x=(x_b, x_d, x_u)\) where \(x_b\) represents the belief that P is provable (belief), \(x_d\) the belief that P is disprovable (disbelief) and \(x_u\) that P is neither provable or unprovable (uncertainty). The components of x satisfies \(x_b+x_d+x_u=1\).

Opinions are computed from evidence. Let p,n be the amount of evidence that supports the proposition and contradicts the proposition, respectively. The opinion x on the proposition P is computed as follows:

$$\begin{aligned} x_b=\frac{p}{p+n+c}; \quad x_d=\frac{n}{p+n+c}; \quad x_u=\frac{c}{p+n+c} \end{aligned}$$
(2)

where \(c>0\) is a constant that represents the minimum amount of evidence required to form an opinion.Footnote 2

For our purposes, we formulate the proposition P as an assertion on the occurrence of a given pattern \(\pi \). In this setting, the evidence to compute opinions regarding P is derived by the prediction model. More precisely, given a state, p represents the number of traces leading to that state in which \(\pi \) occurred, whereas n is the number of traces leading to that state in which the pattern did not occur. Accordingly, \(x_b\) represents the belief that \(\pi \) will occur in a given state, \(x_d\) the disbelief that \(\pi \) will occur, and \(x_u\) the uncertainty about the computed opinions.

Given an opinion x on the occurrence of pattern \(\pi \) for a trace \(\sigma \), we compute a prediction from x at run-time. In doing so, two main aspects should be taken into account. First, we need to define a “proper” value for the minimum amount of evidence, that allows us to discard all those predictions that are “too much” uncertain. Secondly, we need to determine a suitable sensitivity for the prediction, since we want an alert to be raised only when we are “reasonably sure” that an undesired behavior is about to occur. In other words, we expect a positive answer only when belief \(x_b\) is “reasonably larger” than disbelief \(x_d\).

It is worth noting that the concrete instance of what is a “reasonable” amount of minimum evidence and of how much \(x_b\) has to overcome \(x_d\) to get a positive answer are domain-dependent decisions. Therefore, we model these notions in form of parameters, which can be set by the decision maker based on her needs and preferences.

More precisely, given an opinion \(x = (x_b, x_d, x_u)\) denoting the occurrence of a pattern \(\pi \) given a running execution \(\sigma \) w.r.t. a prediction model m, we compute the prediction as follows:

$$ pred(x)=\left\{ \begin{array}{llll} Unpredicted &{} \quad &{} \text {if } x_u> x_b \wedge x_u> x_d\\ Yes &{} &{}\text {if } (x_u< x_b \vee x_u<x_d) \wedge x_b > \alpha \cdot x_d\\ No &{} &{}\text {otherwise} \end{array} \right. $$

where \(\alpha \) is the sensitivity threshold (i.e., the required gap between belief and disbelief).

4 Experiments

This section describes the evaluation of our approach. In detail, we are interested in answering the following research questions:

RQ1: To what extent do the parameters c and \(\alpha \) affect accuracy of the predictions?

RQ2: To what extent does the adopted data discretization strategy affect the accuracy of predictions?

RQ3: Is the accuracy of the predictions obtained with our approach in line with the one provided by previous predictive process monitoring techniques?

The first research question aims to investigate the impact of the two parameters used by our approach – \(\alpha \) and c – on the quality of predictions. RQ2 aims to provide insights on the differences between “local” and “global” data discretization in terms of prediction quality. In the first case, we perform data discretization w.r.t. each pattern individually. This way, we obtain a preprocessed log for each pattern, which is used to make predictions for the corresponding pattern only. In the second case, attributes are discretized by considering the occurrences of all patterns at once. Finally, RQ3 aims to compare our approach with existing techniques. In order to address these questions and conduct an insightful experiment on realistic logs, we need datasets that allow us to evaluate the correctness of our predictions, i.e., how close the outcome of the proposed technique is to the desiderata, and gain meaningful insights about possible reasons underlying unexpected good/bad results. In other terms, we need event logs with data, labelled based on the occurrence of some patterns that involve also data, and some domain knowledge about these patterns. As far as we know, real-life publicly available datasets with these characteristics do not exist. Indeed, publicly available real-world event logs (e.g., https://data.4tu.nl/repository/collection:event_logs) typically involve complex data, for which little or no domain knowledge on the generating process is available, thus making it challenging to assess the correctness and relevance of the derived insights. We hence made a serious effort in order to generate a realistic dataset starting from one of these real-life logs (BPI2012), discovering the corresponding process model (see Fig. 1), injecting realistic data (see details below) on the simulated event log and labelling traces according to the occurrence of patterns also involving data.

We evaluated our approach against the results obtained by applying the clustering-based approach in [9], which explicitly takes into account the amount of evidence to compute predictions. In this approach, traces with a similar control-flow are first grouped together, and a classifier is trained on each cluster. The most suitable cluster (and, hence, classifier) is chosen at run-time to classify the current sample. If there is not enough evidence to make a decision, i.e., if the support of the (cluster of) traces representing the same state of the current one is below a user-defined threshold, no prediction is provided.

The following subsections describe the implementation and the parameters setting for each tested technique, the metrics used for the evaluation and the obtained results.

4.1 Experiments Settings

Dataset generation: To design a synthetic experiment exhibiting the complexity of real-word scenarios, we choose for our experiments a loan management process derived from the event log made available for the BPI2012 challenge based on several previous work (see Fig. 1). Based on this model, we generated a synthetic event log using CPNTools (http://cpntools.org/), a widely used tool for Petri nets editing and simulation, by setting for each pattern a probability of occurrence of 20%. We exploited the simulation options available in CPN to deal with changes in the control-flow, e.g., delaying the completion of fraud checking; while we developed a script in Java for the generation of values for the amount and duration attributes. To set possible values of the amount attribute, we collected the amount values from the BPI2012 log; then, for each trace in our event log, we randomly selected one of these values, setting a probability of 70% of selecting values higher than 10000, which is the threshold set for the fraud checking pattern. To capture the old applications resuming pattern, first we introduce a waiting time in between each pair of consecutive events, randomly choosing between an interval from 4 to 100 hours. Then, in those traces in which an application was resumed, we set a probability of 80% of increasing the timestamp of activity \( a\_registered \) by 31 days, to ensure a reasonable number of cases in which the pattern occurred. The final support for each pattern derives by the combination of the support of the changes performed within CPNTools and the changes performed by our script. More precisely, we obtained the following support values: 22.85% for the old applications resuming pattern; 14.81% for the delayed fraud check pattern; 29.86% for the multiple negotiations pattern; and, finally 21.62% for the applications resuming pattern. For the sake of simplicity, hereafter we refer to these patterns as duration, fraud, negotiation and resumed patterns respectively. It is worth noting that, by construction, we also generated traces involving partial patterns, e.g., patterns in which an old application was resumed but within an acceptable time window. This provides a realistic scenario for the log as we do expect a certain behavior to be undesirable only under certain conditions.

Data discretization: For data discretization, we used the supervised discretization approach provided by the Weka J48 decision tree implementation of the C4.5 algorithm [17]. Specifically, we looked at the discretized intervals returned by the decision tree algorithm by taking into account not only the continuous variables but also the categorical ones. In detail, we encoded the execution traces with the last-payload encoding [14] and we trained the decision tree. We then inspected the resulting decision tree and extracted the intervals of the continuous variables, whenever they have enough discriminative power with respect to the specific security pattern. We set the confidence threshold to 0.8 and the support threshold to 50.Footnote 3 We tested both the local and global discretization strategy.

Subjective logic classifier: We varied \(\alpha \) between [1, 2], with steps of 0.1. For each value of \(\alpha \), we tested three values of c, i.e. 2, 10 and 50. We did not consider higher values, since we observed a significant worsening of the classification performances already when setting the minimum amount of evidence to 20 traces.

Clustering-based approach: For the clustering-based approach, we used the K-means algorithm with 18 clustersFootnote 4 as clustering technique for grouping together execution traces with a similar control flow and the decision tree as classifier to be trained with data payload for the classification. As for classification thresholds, we varied the confidence threshold \(\gamma \) in the interval 0.5 and 0.9 and support threshold \(\rho \) in the set \(\{2,20, 50\}\).

Evaluation metrics: We evaluated our results along two dimensions:

  • Classification accuracy. We evaluate this dimension both in terms of the standard classifier Accuracy, which is the percentage of samples correctly classified in the dataset, and F1 measure (F1 hereafter). The latter is a metric widely used when addressing imbalanced datasets, where one class is more represented than the other. F1 balances the precision of the classifier, intended as the exactness of the predictions, and its recall, intended as the completeness of the results. Both accuracy and F1 range between 0 (minimum) and 1 (maximum).

  • Failure rate. A classifier usually does not provide a prediction when too little evidence is available. We measure failure rate (FR) as the percentage of unclassified samples in the dataset. Given the same values for classification accuracy, the best classifier is the one that achieves the lowest failure rate.

We evaluated the performance of the classifiers by means of a 10-fold cross validation.

4.2 Results

In the following we discuss the results related to each research question.

RQ1: Figure 4 shows the values of accuracy, F1 and FR while varying \(\alpha \) and c. The approach performs well in terms of accuracy for all tested configurations. It achieved an accuracy around 90% for all patterns with the exception of the negotiation pattern, for which we obtained an accuracy around 75%. Varying the c parameter does not seem to impact the accuracy metric. For F1, we still obtained quite good results for most of the patterns, around 70–75%, although we observe degradation in the results while increasing \(\alpha \) and c. However, the approach scored quite poorly in terms of F1 as regards the negotiation pattern, for which we obtained values between 0.4 and 0.5. As a general trend, we can observe an evident worsening of the performance in terms of F1 while \(\alpha \) and, in particular, c increase. As regards the FR, the approach performs well for \(c=2\) (predictions are missing for at most the 0.04% of the samples), while performance worsens for higher values of c. In particular, for \(c=50\), relevant portions of the samples (around 30–40% on average) are missed. Note that, as expected, the FR is not affected by variations of \(\alpha \).

Fig. 4.
figure 4

Results of our approach for c equal to (a) 2, (b) 10 and (c) 50.

RQ2: By applying local discretization, we identified two classes for attribute amount (for the fraud pattern) and six classes for duration (for the duration violation pattern). Comparing the thresholds used by construction for generating the log against the closest values delimiting the corresponding discretization classes, we can observe that the results are in line with the construction parameters. For instance, for amount the construction threshold is 10000 and the delimiter of the discretization class, obtained from the decision tree, is 10030. Moreover, by comparing the thresholds identified with the discretization against the actual data in the log, we can observe that the identified thresholds actually fit the trace labelling of the log. For instance, when amount is lower than or equal to 10030, the fraud pattern never occurs (the smallest amount value for which a violation occurs is 10071). With the global discretization configuration, we obtain intervals only slightly different from the ones of the local discretization. This preliminary and qualitative analysis allows us to assess that the returned classes are reasonable with respect to the criteria used for the log construction and with respect to the actual log.

The prediction results obtained for the local and global strategies are similar with respect to all three metrics (Fig. 4). An exception is represented by the duration pattern, for which the locally discretized log leads to better values in terms of accuracy and F1; however, this comes at the cost of a much higher rate of unpredicted samples.

RQ3: A comparison between our approach (SL) and the clustering-based approach is reported in Table 1 for the local discretization strategy and in Table 2 for the global discretization strategy. For each pattern, we report the configuration of parameters that optimizes each metric, along with the values of the other metrics for that configuration. For example, the first group of columns shows the configuration that optimizes accuracy.

Table 1. Results on the locally discretized data

For locally discretized data, the two approaches provided comparable performance in most of the cases. Overall, the clustering approach seems to return slightly better values for accuracy and F1 when considering the corresponding best configurations; however, this often comes with a much higher failure rate, in particular for the fraud and negotiation patterns. When considering the configurations that optimize failure rate, the clustering approach was able to classify all samples, even though our approach achieves a failure rate close to 0 with comparable results in terms of accuracy and F-measure. Results related to the globally discretized data show trends for accuracy and F1 similar to those observed in the locally discretized data. However, the clustering approach performs much worse than our approach in terms of failure rate for all configurations; indeed, failure rate is never below 23%, against the 0.04% achieved by our approach.

Discussion: Our approach scores fairly well for most of the patterns, although performance is worse for the negotiation pattern. This likely happens because this pattern contains a loop. Indeed, different number of loop iterations are modeled as different states in the predictive model, so that some states do not have enough support for being accounted in the prediction, leading to miss some positive samples. In some cases, e.g. for \(\alpha =1\), \(c=50\), no positive samples were found, thus resulting in F1 to be undefined. The results show that low values of \(\alpha \) and c typically provide better results. This is not unexpected; higher the thresholds are, higher the probability of missing true positives is, which explains the performance worsening. The approach seems, however, to be much more sensitive with respect to c than to \(\alpha \). While the worsening in performance is negligible when varying \(\alpha \) (with the exception of the duration pattern), we observed a worsening both in terms of F1 and, especially, in terms of failure rate when increasing c. According to these results, it seems advisable setting low thresholds for the minimum amount of evidence, when possible. As regards RQ2, even though the final outcome is clearly affected by the employed discretization strategy, the results seem to suggest that adopting a local or a global strategy does not have a significant impact on the performance. For RQ3, we observed that the performance provided by our approach is similar to the one provided by the clustering-based approach. The latter performs sometimes better in terms of accuracy and F1 but misses a higher percentage of samples in most of the tested configurations. This is because the clustering-based approach uses stricter constraints and, in particular, relies on the classifier confidence to decide whether a prediction should be made. The similarity between the performance of the two approaches seems to suggest that relaxing the assumption on the confidence does not lead to significantly worse performance in terms of accuracy and F1 while providing a lower failure rate. Moreover, the performance of the clustering-based approach seem to be more sensitive to the discretization strategy than our approach.

Threats to validity: One of the threats to the external validity of the evaluation is the application of the approach only to synthetic data. The use of more logs, including a real-life one, would clearly allow for more general results. However, such a threat is mitigated by the fact that the considered log was generated by simulating a realistic and widely known model, with a realistic number, type and range of data attributes. A second threat to the external validity is the choice of the investigated patterns. Also in this case the threat is mitigated by the fact that the chosen patterns are realistic for the considered scenario.

Table 2. Results on the globally discretized data

5 Related Work and Concluding Remarks

Predictive business process monitoring has received an increasing attention in the last years [15, 16]. Existing approaches can be grouped in three main categories based on the aim of the prediction: (i) approaches that aim to predict the remaining execution time of running process instances, e.g. [19, 22]; (ii) approaches that aim to predict the next activity to be executed, e.g. [5]; (iii) the so-called outcome-oriented approaches, which classify ongoing executions according to a given set of possible categorical outcomes [7, 9, 14, 21]. Our work is related to the third group, since we predict the value of an indicator (i.e., the occurrence of a given pattern) for each running execution.

Within the outcome-oriented approaches, some works focus on predictions and recommendations to reduce risks [6, 7, 15]. For example, in [6, 7], the authors present a technique to support process participants in making risk-informed decisions with the aim of reducing process failures, by considering process executions both in isolation [7] and propagating information about risks to similar running instances [6]. In [15], three different approaches for the prediction of process instance constraint violations are investigated: machine learning, constraint satisfaction and QoS aggregation. Our work, by making predictions on the occurrence of undesired behavior, is related to this group of works, although the focus is slightly different.

Moreover, our approach is close to those applying a lossless encoding strategy (e.g., [14]), which is an encoding that allows recovering the original trace. Since a lossless encoding leads to prefixes of different length, a common strategy adopted by lossless outcome-oriented approaches to employ classification techniques consists in splitting the set of prefixes in buckets, training a classifier for each of them. Different strategies have been explored to build the set of prefix buckets. For example, Lakshmanan et al. [13] build a classifier for each prefix length. Di Francescomarino et al. [9] exploit trace clustering techniques to group similar traces, building a classifier for each cluster. Leontjeva et al. [14] build a classifier for each state in a process model. Once the traces have been properly encoded and the prefixes have been grouped, well-known classification techniques are employed. Compared to previous work, our approach exploits a single predictive model, without requiring grouping prefixes and training multiple classifiers. Moreover, our approach provides analysts with a simple and intuitive mechanism to set both the reliability and sensitivity of predictions. To the best of our knowledge, sensitivity aspects have been mainly neglected by previous approaches; to support such a dimension, one should delve into the classification model and change the logic with which predictions are provided, which is not trivial and not always possible in existing approaches.

In this work, we introduced an approach based on Subjective Logic for predicting the occurrence of undesired behaviors in running process executions. The approach allows the process analyst to customize both the reliability and sensitivity desired for the prediction. This makes our predictive process monitoring system suitable for scenarios in which reacting to undesired behaviors might be costly or complex, as the system can raise an alert only in the presence of strong evidence supporting the prediction. The evaluation of the approach showed that the approach performed well both in terms of classification performance and effectiveness, obtaining results mostly comparable with the tested competitor and often leading to a significant reduction of the failure rate.

In future work, we plan to perform a more exhaustive set of experiments by considering real-world datasets as well as testing other discretization techniques. Moreover, we plan to investigate and develop solutions tailored to deal with the actor dimension. This dimension is necessary to predict behaviors involving actor-related constraints, such as separation/binding of duties, which cannot be handled by standard discretization techniques.