Keywords

1 Introduction

Business Process Simulation (BPS) is a technique for estimating the performance of business processes under different scenarios [9]. BPS enables analysts to address questions such as “what would be the cycle time of a process if one or more resources became unavailable?” or “what would be the impact of automating an activity on the waiting times of other activities in the process?”. The starting point of BPS is a process model, e.g. in the Business Process Model and Notation (BPMN)Footnote 1, enhanced with simulation parameters [21] (herein, a BPS model). These simulation parameters capture, for example, the processing times of each activity or the rate at which new process instances (cases) are created.

BPS models may be manually created based on information collected via interviews or empirical observations, or they may be automatically discovered from execution data recorded in process-aware information systems (event logs) [6, 7, 17, 22]. Regardless of the origin, a key question when using a BPS model is how to assess its quality. This question is particularly relevant when tuning the simulation parameters. Several approaches have been proposed to address this problem. However, these approaches are either manual and qualitative [22] or they produce a single number that does not allow one to identify the source(s) of deviations between the BPS model and the observed reality [6, 10].

In this paper, we study the problem of automatically measuring the quality of a BPS model w.r.t. its ability to replicate the observed behavior of a process as recorded in an event log. We advocate a multi-perspective approach to this problem, thus proposing a set of quality measures that address different perspectives of process performance. The starting point is the idea that a good BPS model is one that generates traces consisting of events similar to the observed data. Accordingly, the proposed approach maps an event log produced by the BPS model and an event log recording the observed behavior into histograms or time series capturing a given perspective, and then compares the resulting histograms or time series using a distance metrics.

We conduct a two-fold evaluation of the measures using synthetic and real-life datasets. In the synthetic evaluation, we study the ability of the proposed measures to discern the impact of modifications to a BPS model, whereas in the real-life evaluation, we analyze their ability to uncover the relative strengths and weaknesses of two approaches for automated discovery of BPS models. Our results show that the measures not only capture how close a BPS model is to the observed behavior, but also help us identify sources of discrepancies.

The rest of the paper is structured as follows. Section 2 gives an overview of prior research related to the discovery and evaluation of BPS models. Section 3 introduces relevant process mining concepts and distance measures. Section 4 analyzes the problem and proposes a set of measures of quality of BPS models. Section 5 discusses the empirical evaluation, and Sect. 6 draws conclusions and sketches future work.

2 Background

2.1 Business Process Simulation Models

A BPS model consists of i) a stochastic control-flow model, ii) an activity performance model, and iii) an arrival and congestion model. The stochastic model is composed of a process model (e.g., a BPMN model or a Petri net) and a stochastic component capturing the probability of occurrence of each path in the model (branching probabilities). In a BPS model, the stochastic model is enhanced by adding an activity performance model, which determines the duration of the activity instances (e.g., by associating a parametric distribution to each activity in the model). Finally, in a BPS model, an arrival and congestion model determines when new cases arrive in the system, and when the execution of each enabled activity instance starts, given the available resource capacity.

Traditionally, BPS models are constructed manually by experts. Recent approaches advocate for the automated discovery of BPS models from event logs. Below, we consider two such approaches. The first one, namely SIMOD [6], starts by constructing a stochastic process model by applying the SplitMiner algorithm [3] to discover a BPMN model from the input log, and replaying the traces of the log to calculate the branching probabilities. Next, SIMOD discovers the activity performance model (activity duration distributions) and a congestion model consisting of: i) a case inter-arrival time distribution; ii) a set of resources, their availability timetables, and the activities they perform; and iii) the distribution of extraneous waiting times between activities (i.e. waiting times not attributable to congestion) [8]. Once a BPS model is discovered, its parameters are tuned to fit the data using a Bayesian hyper-parameter optimizer.

The second BPS model discovery technique we consider is ServiceMiner\(^{\copyright }\). ServiceMiner operates in three steps: i) data preprocessing, where techniques for data cleaning and categorical feature encoding are applied; ii) data enhancement, where new data attributes that capture trend, seasonality, and system congestion are created using methods described in [23]; and iii) model learning, where the BPS model is created by combining process discovery, queue mining (learning of queueing building blocks from data), and machine learning (to boost the accuracy of arrival and activity time generation). For process discovery, ServiceMiner mines a Markov chain, estimating the case routing probabilities between consecutive activity pairs. An abstraction mechanism allows for filtering out rare activities, paths, and transitions. Next, using queue mining, the various queueing building blocks are fitted from data, by using techniques described in [24]. Lastly, ServiceMiner applies a machine learning technique that uses congestion features that come from queueing theory, which, via cross-validation, leads to accuracy improvements when generating inter-arrival times and activity durations.

While the evaluation reported below focuses on BPS models discovered by SIMOD and ServiceMiner, the proposed measures can be used to assess the quality of any model that generates event logs. For example, the proposal can also be used to evaluate generative deep learning models of business processes [7]. On the other hand, it cannot be used to assess coarse-grained BPS models, e.g. based on system dynamics [20], unless these are refined to generate event logs.

2.2 Quality Measures for Business Process Simulation Models

Leemans et al. [12, 13] and Burke et al. [5] studied the evaluation of stochastic models using, among other measures, Earth Movers’ Distance. However, they focus solely on the control-flow perspective, and their purpose is mainly conformance checking. In this paper, we focus on the assessment of BPS model quality considering both temporal and control-flow dimensions.

Prior studies have considered the evaluation of BPS models. Rozinat et al. [22] perform an evaluation of BPS models following manual comparisons. However, they do not propose concrete and automatable evaluation measures. Camargo et al. [7] study the performance of data-driven simulation and deep learning techniques, proposing measures that combine the control-flow and the temporal perspectives. The latter measures are not scalable, and they do not identify the sources of discrepancies between BPS models. To overcome these shortcomings, we propose an approach that views the process from different perspectives and provides a separation of concerns between the three BPS model components (the stochastic model, the activity performance model, and the congestion model). We then propose efficient measures for each component.

3 Preliminaries

3.1 Event Logs

Modern enterprise systems maintain records of business process executions, which can be used to extract event logs: sets of timestamped events capturing the execution of the activities in a process [9]. We assume that each event record in the log relates to a case, an activity, and an activity start and end timestamp (as in Table 1). However, the proposal of this paper can be generalized to include other life-cycle events (e.g., activity enablement or cancellation). We shall refer to events and activity instances interchangeably, even though they could mean different things in other contexts. Let \(\mathcal {E}\) be the universe of events, C be the universe of case identifiers, A be the set of possible activity labels, and T be the time domain.

Definition 1 (Event Log)

[Event Log] An event log (denoted by \(\mathcal {L}\)) is a set of executed activity instances, \(E \subseteq \mathcal {E}\), with each event having a schema \(\sigma _\mathcal {E} = \{\xi , \alpha , \tau _{start}, \tau _{end}\}\), that assigns the following attribute values to events:

  • \(\xi : \mathcal {E} \rightarrow C\) assigns a case identifier,

  • \(\alpha : \mathcal {E} \rightarrow A\) assigns an activity label,

  • \(\tau _{start}: \mathcal {E} \rightarrow T\) assigns the start timestamp of the executed activity, and,

  • \(\tau _{end}: \mathcal {E} \rightarrow T\) assigns the end timestamp of the executed activity.

Note that the transformation from a traditional event log that contains only a single timestamp to our notion of an event log is straightforward (see [18]).

Table 1. Example of 6 events of an event log from a Procure-to-Pay Process

3.2 Measures for Time-Series and Histogram Comparison

To analyze the temporal performance of a process, an event log can be mapped to a variety of time series (e.g. activity starts, activity ends). Accordingly, we consider the use of techniques to quantify the distance between two time series, \(x = (x_1, \ldots , x_n)\), and \(y = (y_1, \ldots , y_m)\), of (potentially different) lengths n and m, respectively. To this end, one may employ various measures, such as computing \(||x-y||_{l}\) in any of the standard norms (i.e., \(l=1, 2, \infty \)).Footnote 2 These comparisons would only be possible after padding the shorter time series. In addition, standard norms do not capture the temporal differences between the two time series. For example, a temporal shift in x vs y may produce l1 or l2 norm, but represents a significant failure in the model to capture time-series patterns properly. To overcome the two limitations, namely the need for padding, and ignoring temporal differences, a natural measure is the Wasserstein Distance (WD) [19]; in this work, we consider two variations of WD.

  • Earth Mover’s Distance (EMD) [13] computes the effort it takes to balance two vectors x and y of different lengths, treating each entry \(x_i, y_j\) as ‘masses’ to move from location to location until the two time series are equal. EMD does not assume that \( \sum _i x_i = \sum _j y_j \), i.e., the sum of the ‘earth mass’ to be moved can be different; in such cases, we add a penalty for creating redundant mass to fill in gaps. Herein, we consider the EMD problem with absolute distance measure [14].

  • 1st Wasserstein Distance (1WD) [14] is a computationally efficient variation of the EMD. It introduces the constraint that the sum of masses must be the same in x and y (i.e., the constraint \( \sum _i x_i = \sum _j y_j \) is enforced). 1WD is suitable for comparing empirical distribution functions (histograms), since the sum of the mass in each is 1.

When comparing two histograms, we let \(f = (f_1, \ldots , f_n) \) be the n normalized frequency values of the first histogram, and let \(g = (g_1, \ldots , g_m) \) be the m normalized frequencies of the second histogram. We treat the two histograms f and g similarly to the two time series x and y, and employ 1WD distance, since the sum of masses is 1 (EMD and 1WD lead to the same results).

4 Framework for Measuring BPS Model Quality

In this part, we develop an approach for measuring the quality of BPS models. There are two main reasons why directly evaluating a BPS model would be impractical: i) typically, the ‘true’ BPS model of the process is not available (and often does not exist), thus, we cannot perform a model-to-model comparison; ii) different simulation engines (e.g., BIMP [1], Prosimos [16]) support different BPS model formats, hindering a generic comparison of BPS models. Therefore, we propose to generate a collection of logs simulated with the BPS model under evaluation, and compare them to event logs of the actual system (i.e., the system that the model aims to mimic). Consequently, one can apply a ‘transitive argument’: the ‘closer’ the simulated logs are to the actual data, the better is the model. In other words, we treat the (test) data as our ‘ground truth’, since useful models are supposed to be faithful generators of ‘reality’.

Two challenges arise when measuring the quality of a BPS model: i) a model can be very close to the data in one aspect (e.g., control-flow), yet very different in another (e.g., in inter-arrival times), and, ii) a model can generate many realities as it is probabilistic in nature (durations and routing are stochastic), while the data consists of a single realization. To overcome the first limitation, we propose a collection of measures to quantify the distance across multiple process perspectives. Specifically, we shall consider control-flow, temporal, and congestion distance measures. As for the second limitation, our approach is to generate multiple event logs simulating the ‘ground-truth’ event log (i.e., with the same number of cases, and starting from the same instant in time), and use the generated logs to construct confidence intervals for each of our measures.

For all measures, we consider a collection of K generated logs (GLogs) that came from K simulation runs, and compare these K GLogs to the actual test event log (ALog) that, importantly, was not used to construct the BPS model. Below, we outline control-flow, temporal, and congestion, discuss their rationale, and briefly provide their computation by comparing GLogs and an ALog.Footnote 3

4.1 Control-Flow Measures

To evaluate the quality w.r.t. the control-flow perspective (i.e., the capability of the model to represent the event sequences in the actual event log), we propose two measures. The first one, namely control-flow log distance (CFLD), is a variation of a measure introduced by Camargo et al. in [7]. CFLD precisely penalizes the differences in the control-flow by pairing each case in GLog to the case in the ALog that minimizes the sum of their distances. However, due to its steep computational complexity, we propose an additional measure, the n-gram distance (NGD), that approaches the problem in a more efficient way.

Control-Flow Log Distance (CFLD). Given two logs \(\mathcal {L}_1\) and \(\mathcal {L}_2\) with the same number of cases, we compute the average distance to transform each case in \(\mathcal {L}_1\) to another case in \(\mathcal {L}_2\) (see [7] for a description of a similarity version of this measure). To compute this measure, we first transform each process case of \(\mathcal {L}_1\) and \(\mathcal {L}_2\) to their corresponding activity sequences, abstracting from temporal information. Then, we compute the Damerau-Levenshtein (DL) distance [26] between each pair of cases ij belonging to \(\mathcal {L}_1\) and \(\mathcal {L}_2\), respectively, normalizing them by the maximum of their lengths (obtaining a value in [0, 1]). Subsequently, we compute the matching between the cases of both logs (such that each i is matched to a different j, and vice versa) minimizing the sum of distances using the Hungarian algorithm for optimal alignment. The CFLD is the average of the normalized distance values.

CFLD requires pairing each case in the simulated log with a case in the original log, minimizing the total sum of distances. The computational complexity of computing the DL-distance for all possible pairings is \(O(N^2 \times MTL^3)\) where N is the number of traces in the logs (assuming both logs have an equal number of cases, which holds in our setting) and MTL is the maximum trace length. Since all pairings are put into a matrix to compute the optimal alignment of cases (the one that minimizes the total sum of distances), CFLD’s memory complexity is quadratic on the number of cases. The optimal alignment of traces using the Hungarian algorithm has a cubic complexity on the number of cases.

N-Gram Distance (NGD). Leemans et al. [13] measure the quality of a stochastic process model by mapping the model and a log to their Directly-Follows Graph (DFG), viewing each DFG as a histogram, and measuring the distance between these histograms. We note that the histogram of 2-grams of a log is equal to the histogram of its DFG.Footnote 4 Given this observation, we generalize the approach of [13] to n-grams, noting that the histogram of n-grams of a log is equal to the (n-1)\(^{\textrm{th}}\)-Markovian abstraction of the log [2]. In other words, the histogram of 2-grams is the \(1^{st}\)-order Markovian abstraction (the DFG), the histogram of 3-grams is the \(2^{nd}\)-order Markovian abstraction, and so on.

Given two logs \(\mathcal {L}_1\) and \(\mathcal {L}_2\), and a positive integer n, we compute the difference in the frequencies of the n-grams observed in \(\mathcal {L}_1 \bigcup \mathcal {L}_2\). To compute this measure, we transform each case of \(\mathcal {L}_1\) and \(\mathcal {L}_2\) to its corresponding activity sequences, abstracting temporal information, and adding \(n-1\) dummy activities to both start and end of the case (e.g., 0-A-B-C-0 for case A-B-C and \(n=2\)). Then, we compute all sequences of n activities (n-grams) observed in each log, and measure their frequency. Finally, we compute the sum of absolute differences between the frequencies of each computed n-gram, and normalize the total distance by the sum of frequencies of all n-grams in both logs (obtaining a value in [0, 1]).

For example, consider \(\mathcal {L}_1\) having three cases A-B-C-D, and \(\mathcal {L}_2\) having three cases A-B-E-D. Given \(n=2\), the observed n-grams are 0-A, A-B, B-C, C-D, and D-0 in \(\mathcal {L}_1\); and 0-A, A-B, B-E, E-D, and D-0 in \(\mathcal {L}_2\) (each one with a frequency of three). The n-grams B-C, C-D, B-E, and E-D have a frequency of 3 in one log, and 0 in the other, thus, the NGD between \(\mathcal {L}_1\) and \(\mathcal {L}_2\) is 0.4 (12 divided by 30). By adding dummy activities, all activity instances have the same weight in the measure, as each of them is present in n n-grams. Otherwise, the first and last activity instances of each trace would be present only in one n-gram. Note that we do not use the EMD to compute the NGD, because the order of the n-grams in the histogram is irrelevant and EMD would take this order into account.

NGD is considerably more efficient than CFLD, as the construction of the histogram of n-grams is linear on the number of events in the log, and the same goes for computing the differences between the n-gram histograms.

4.2 Temporal Measures

We propose three measures that assess the ability of a BPS model to capture the temporal performance perspective, based on the idea that the time series of events generated by a BPS model should be similar to the time series of the test data, with respect to seasonality, trend, and time-to-event.

The first two measures come from time-series analysis, where most approaches in the literature (e.g., SARIMA) decompose the time series into components of trend, seasonality, and noise [4]. We follow a similar path by analyzing the trend (comparing the absolute distribution of events), and the seasonality (comparing the circadian distribution of events). The third measure comes from time-to-event (or survival) analysis [11], a field in statistics that analyzes the behavior of individuals from some point in time until an event of interest occurs. Specifically, we are interested in analyzing the capability of the simulator to correctly reconstruct the occurrence of events (and their timestamps) from the beginning of the corresponding case to its end. Below, we provide the details of the three aforementioned measures.

Absolute Event Distribution (AED). Given two event logs \(\mathcal {L}_1\) and \(\mathcal {L}_2\), we transform the events into a time series by binning the timestamps in the event log (both start and end timestamps) by date and hour of the day (e.g., timestamps between ‘02/05/2022 10:00:00’ and ‘02/05/2022 10:59:59’ will be placed into the same bin). Let \(i = 1, \ldots , B\) be the hours from the first until the last timestamp in \(\mathcal {L}_1 \bigcup \mathcal {L}_2\) (i.e., the timeline of both logs), and \(dh(\tau (e))\) a function returning the i corresponding to the date and hour of the day of a timestamp of event e (for brevity, we refer to both \(\tau _{start}\) and \(\tau _{end}\) as \(\tau \)), the binning procedure is as follows,

$$\begin{aligned} x_i = |\{ e \in \mathcal {L}_1 \mid dh(\tau (e)) = i\}|, \ \ \ y_i = |\{ e \in \mathcal {L}_2 \mid dh(\tau (e)) = i\}|\end{aligned}$$
(1)

Finally, the AED distance between \(\mathcal {L}_1\) and \(\mathcal {L}_2\) corresponds to the EMD between \(x_1, \ldots , x_B\) and \(y_1, \ldots , y_B\).

Circadian Event Distribution (CED). Given two event logs \(\mathcal {L}_1\) and \(\mathcal {L}_2\), we partition each log into sub-logs by the day of the week (Mon-Sun). Let \(wd(\tau (e))\) be a function that returns the day of the week for timestamp \(\tau (e)\). Then, for \(i = 1, \ldots , 7\), we obtain the corresponding sub-logs as follows,

$$\begin{aligned} \mathcal {L}_{1, i}= \{ e \in \mathcal {L}_1 \ | \ wd(\tau (e)) = i\}, \ \ \ \mathcal {L}_{2, i}= \{ e \in \mathcal {L}_2 \ | \ wd(\tau (e)) = i\} \end{aligned}$$
(2)

Subsequently, we bin each sub-log into hours with Eq. (1) using \(h(\tau (e))\), a function returning the hour of the day of a timestamp of event e, instead of \(dh(\tau (e))\). In this way, all the timestamps recorded on any Monday between ‘10:00:00’ and ‘10:59:59’ will be placed in the same bin), obtaining \(x_{1, d}, \ldots , x_{B, d}\) and \(y_{1, d}, \ldots , y_{B, d}\) with \(d \in \{1, \ldots , 7\}\). Finally, the CED distance between \(\mathcal {L}_1\) and \(\mathcal {L}_2\) corresponds to the average of the EMD between \(x_{1, d}, \ldots , x_{B, d}\) and \(y_{1, d}, \ldots , y_{B, d}\) with \(d \in \{1, \ldots , 7\}\).

Relative Event Distribution (RED). Here, we wish to analyze the ability of the simulator to mimic the temporal distribution of events w.r.t. the origin of the case (i.e., the case arrival). To this end, given two event logs \(\mathcal {L}_1\) and \(\mathcal {L}_2\), we offset all log timestamps from their corresponding case arrival time (the first timestamp in a case is set to time 0, the second one is set to the inter-event time from the first, etc.). Formally, let \(a(\xi (e)) = \min _{t'} \{ t' \ | \ t' = \tau _{start}(e') \wedge e' \in \mathcal {L} \wedge \xi (e') = \xi (e)\} \) be the arrival time of a case associated with an event in the log. Then, the relative event times \(\rho (e)\) are defined as,

$$\begin{aligned} \rho (e) = \tau (e) - a(\xi (e)), \end{aligned}$$
(3)

with \(\tau (e)\) being \(\tau _{start}(e)\) for start times, and \(\tau _{end}(e)\) denoting end times. We apply Eq. (3) to the timestamps in \(\mathcal {L}_1\) and \(\mathcal {L}_2\) and, for each log, discretize the resulting \(\rho (e)\) into hourly bins (e.g., those durations between 0 and 3,599 s go to the same bin). Finally, the RED distance between \(\mathcal {L}_1\) and \(\mathcal {L}_2\) corresponds to the EMD between the discretized \(\rho (e)\) of each log.

4.3 Congestion Measures

To measure the capability of a model to represent congestion, we rely on queueing theory, a field in applied probability that studies the behavior of congested systems [25]. The workload in a queueing system is dominated by two factors: the arrival rate of cases over time, and the cycle time, which is the length-of-stay of a case in the system. Below, we propose two measures to compare the two workload components over pairs of event logs by comparing the time series of the arrivals, and the distribution of the cycle times (assuming that its variability is captured by the arrivals time-series comparison).

Case Arrival Rate (CAR). This measure compares case arrival patterns (shape) and counts (number of arrival per bin). Given two event logs \(\mathcal {L}_1\) and \(\mathcal {L}_2\), we use the function \(a(c), c \in C\) to obtain the sets of arrival timestamps of each log. Subsequently, we bin them using Eq. (1) (timestamps between ‘02/05/2022 10:00:00’ and ‘02/05/2022 10:59:59’ are placed in the same bin), obtaining two vectors \(x_1, \ldots , x_B\) and \(y_1, \ldots , y_B\) corresponding to the binned arrival timestamps of \(\mathcal {L}_1\) and \(\mathcal {L}_2\), respectively. Finally, the CAR distance between \(\mathcal {L}_1\) and \(\mathcal {L}_2\) corresponds to the EMD between \(x_1, \ldots , x_B\) and \(y_1, \ldots , y_B\).

Cycle Time Distribution (CTD). Here, we seek to measure the ability of the BPS model to capture the end-to-end cycle time of the process. Given two event logs \(\mathcal {L}_1\) and \(\mathcal {L}_2\), we collect all cycle times into a single histogram per log, which depicts their empirical probability distribution functions (PDF). The CTD distance between \(\mathcal {L}_1\) and \(\mathcal {L}_2\) corresponds to the 1WD between both histograms.Footnote 5

5 Evaluation

We report on a two-fold experimental evaluation. The first part aims to validate the applicability of the proposed measures by testing the following evaluation question: are the proposed measures able to discern the impact of different known modifications to a BPS model? (EQ1). Given the potential efficiency issues of CFLD, the first part of the evaluation also aims to answer the question: is the N-Gram Distance’s performance significantly different from the CFLD’s performance? (EQ2). The second part of the evaluation is designed to test if: given two BPS models discovered by existing automated BPS model discovery techniques in real-life scenarios, are the proposed measures able to identify the strengths and weaknesses of each technique? (EQ3). Given the complexity of the EMD (cf. Sect. 3), the second part of this evaluation also focuses on answering: does the 1-WD report the same insights in real-life scenarios as the EMD? (EQ4).

In the case of the NGD, we report on this measure for a size \(N=2\).Footnote 6 The distance computed by the EMD is not directly interpretable, as it is an absolute number on a scale that depends on the range of values of the input time series. Accordingly, we divide the raw EMD by the number of observations in the original log. In this way, we can interpret the resulting scaled-down EMD as the average number of bins that each observation of the original log must be moved to transform it into the simulated log. For example, a value of 10 implies that, on average, each observation had to be moved 10 bins.

5.1 Synthetic Evaluation

Datasets. To assess EQ1 and EQ2, we manually created the BPS model of a loan application process based on the examples from [9, Chapter 10.8]. The process comprises 12 activities (with one loop, a 3-branch parallel structure, 3 exclusive split gateways, and 3 possible endings) and 6 different resource types (performing different activities with a working schedule from Monday to Friday, from 9am to 5pm). We simulated a log of 1,000 cases as the log recording the process (i.e., the ALog). We created 7 modifications of the original BPS model: i) altering the control-flow by arranging the parallel activities as a sequence (\(\text {Loan}_{SEQ}\)); ii) altering, on top of the previous modification, the branching probabilities (\(\text {Loan}_{S\text {-}G}\)); iii) modifying the rate of case arrivals (\(\text {Loan}_{ARR}\)); iv) increasing the duration of the activities of the process (\(\text {Loan}_{DUR}\)); v) halving the available resources to create resource contention (\(\text {Loan}_{RC}\)); vi) changing the resource working schedules from 9 am–5 pm to 2 pm–10 pm (\(\text {Loan}_{CAL}\)); and vii) adding timer events to simulate extraneous waiting time [8] delaying the start of 4 of the activities (\(\text {Loan}_{EXT}\)).

We simulated \(K=10\) logs (as the GLogs) with 1,000 cases for each altered BPS model. Table 2 shows the results of the proposed measures for each modified scenario, and for the original BPS model as ground truth (\(\text {Loan}_{GT}\)) to measure the distance associated with the stochastic nature of the simulation.

Table 2. Results (average and 95% confidence interval) of the proposed measures for the original and modified BPS models of a loan application process.

Results and Discussion. Regarding EQ1, Table 2 shows how the proposed measures appropriately penalize the BPS models for the modifications affecting their corresponding perspectives. In the control-flow measures, the BPS models showing significant differences w.r.t. the ground truth are those with control-flow modifications. The distances of \(\text {Loan}_{RC}\) and \(\text {Loan}_{SEQ}\) are explained by the parallel activities being executed more frequently in a specific order. In the first BPS model, due to resource contention, which delays the execution of one of the parallel activities in some cases. In the second one, due to the control-flow modification. Finally, \(\text {Loan}_{S \text {-} G}\) reports the highest distance as, in addition to the modification in \(\text {Loan}_{SEQ}\), it also alters the frequency of each process variant.

For temporal measures, the AED distance captures the difference in the distribution of events along the entire process. However, to identify the sources of these differences. We require a combination of the penalties incurred by CED, RED, and CAR. Thus, we must analyze them to find the root-causes for the discrepancies in AED. Starting from the seasonal aspects captured by CED, only \(\text {Loan}_{S \text {-} G}\) and \(\text {Loan}_{CAL}\) report significant differences, being the latter the only BPS model altering seasonal aspects. \(\text {Loan}_{S \text {-} G}\)’s distance is due to the change in the gateway probabilities, which in turns impacts the overall distribution of executed events. As expected, \(\text {Loan}_{CAL}\) presents the highest CED distance due to the change in schedules that displaces executed events from morning to evening.

Moving to RED, which reports the distance in the distribution of events over time within each case, we observe that all modifications except \(\text {Loan}_{CAL}\) should affect this perspective. The slightly higher penalization of \(\text {Loan}_{ARR}\) is due to the higher case arrival rates, which delay the start activities due to resource contention. \(\text {Loan}_{SEQ}\) presents a higher distance (close to a displacement of 2 h per event) as the three parallel activities are executed as a sequence, delaying subsequent activities. Similarly, in \(\text {Loan}_{DUR}\), \(\text {Loan}_{EXT}\), and \(\text {Loan}_{RC}\), activity delays are caused by longer durations, extraneous delays, and resource contention waiting times, respectively. Finally, \(\text {Loan}_{S \text {-} G}\) presents the highest RED distance due to the high-frequency differences in each process variant.

Switching to CAR, we do not observe significant differences in BPS models that exhibit the same arrival rate, except for \(\text {Loan}_{CAL}\). The latter is explained by the change in schedules, as cases cannot start until the resources start their working period (which skews effective start times). Unsurprisingly, for \(\text {Loan}_{ARR}\), the difference in CAR is due to the change in the arrival model.

Finally, the last proposed measure is CTD, which reports the distance in case duration among all the cases. The results of CTD follow a similar to RED (yet, with different values), since cycle times correspond to the time distance between the first and last events of the case. However, this correlation might not hold across all scenarios. Specifically, if the distribution of executed activities in the middle of each case is different, but the last event does not change, RED would detect discrepancies that CTD would not (as the cycle time would remain the same). Thus, CTD is most relevant when the analysis revolves around total cycle times, while disregarding the temporal distribution of events within the case.

To answer EQ2, we computed the Kendall rank correlation coefficient between NGD and CFLD, and we obtained a correlation of 1.0. Thus, in light of the complexity of CFLD (cf. Sect. 4), we recommend using NGD to assess the quality of a BPS model from the control-flow perspective.

5.2 Real-Life Evaluation

Datasets. To evaluate EQ3 and EQ4, we selected four real-life logs of different complexities: i) a log from an academic credentials’ management process (AC_CRE), containing a high number of resources exhibiting low participation in the process. ii) a log of a loan application process from the Business Process Intelligence Challenge (BPIC) of 2012Footnote 7 – we preprocessed this log by retaining only the events corresponding to activities performed by human resources (i.e., only activity instances that have a duration). iii) the log from the BPIC of 2017Footnote 8 – we pre-processed this log by following the recommendations reported by the winning teams participating in the competition.Footnote 9 And iv) a log from a call centre process (CALL) containing numerous cases of short duration – on average, two activities per case. To avoid data leakage, we split the log of each dataset into two sets (training and testing). These datasets correspond to disjoint (non-overlapping) intervals in time with similar case and event intensity. The training dataset contains cases that are fully contained in the training period, and same for the testing dataset. Table 3 shows the characteristics of the four training and four testing- event logs. For each dataset, we ran two automated BPS model discovery techniques (SIMOD and ServiceMiner) on the training log, and evaluated the quality of the discovered BPS models on the test log.

Table 3. Characteristics of the real-life logs used in the evaluation.
Table 4. Distance measures for the BPS models discovered by SIMOD and ServiceMiner on the logs in Table 3. The CFLD ran out of memory (48 GB of allocated memory) on the CALL dataset after > 2 h, thus no values are reported in those cells.

Results and Discussion. Regarding EQ3, Table 4 shows the results of the proposed measures for the BPS models automatically discovered by SIMOD and ServiceMiner (henceforth \(M_{SIMOD}\) and \(M_{SerMin}\), respectively). From the control-flow perspective, \(M_{SerMin}\) performs closer to the original log than \(M_{SIMOD}\) for all four datasets. The reason lies in the methods that the approaches use to model the control-flow. SIMOD is designed to discover an interpretable process model to support modification for what-if analyses. To this end, SIMOD uses a model discovery algorithm that applies multiple pruning techniques to simplify the discovered model. Conversely, ServiceMiner discovers a Markov chain, which yields more accurate results, yet can lead to complex ‘spaghetti models’.

Two main differences are reported w.r.t. the temporal and congestion aspects. First, for Case Arrival Rate (CAR), \(M_{SerMin}\) presents better results in BPIC12, BPIC17, and CALL, while \(M_{SIMOD}\) outperforms it in AC_CRE. To model the arrival of new cases, ServiceMiner splits the timeline into one-hour windows, and bootstraps the arrivals per time window. SIMOD computes the inter-arrival times (i.e., the time between each arrival and the next one) and estimates a parametrized distribution to model them. The complexity of ServiceMiner’s arrival model allows it to capture better the arrival rate in scenarios where the density of case arrivals per hour is high, and/or the rate of arrivals varies through time (BPIC12, BPIC17, and CALL). On the contrary, if cases are scattered over time (AC_CRE), SIMOD’s approach presents a better result.

The second main difference lies in the Relative Event Distribution (RED) and the Cycle Time Distribution (CTD) distances. Here, \(M_{SIMOD}\) obtains better results in both measures except in one case. In the CALL dataset, \(M_{SerMin}\) obtains a smaller RED distance (both methods perform well w.r.t. the original log). SIMOD outperforms ServiceMiner due to a high amount of extraneous activity delays (i.e., waiting times not related to the resource allocation or activity performance) exhibited in these processes. Specifically, SIMOD includes a component to discover extraneous delays, which improves the distribution of the events within the case. Both techniques perform close to the original log in the CALL dataset because extraneous delays are rare for the call centre process.

For seasonality, the Circadian Event Distribution (CED) reports slight differences between the two methods for AC_CRE and BPIC17, and a moderately better result for \(M_{SIMOD}\) in BPIC12. The CALL dataset presents the highest difference, where \(M_{SerMin}\) obtains better results, which can be attributed to its highly accurate arrival model. The CALL dataset has mostly cases with one or two events. Hence, case execution depends more on the arrival time of the case, than on the activity performance and congestion models.

Combining all the temporal perspectives in one measure, the results of Absolute Event Distribution (AED) follow the same distribution as CAR, where \(M_{SerMin}\) presents better results in BPIC12, BPIC17, and CALL, while \(M_{SIMOD}\) performing better in AC_CRE. Although this measure summarizes all the temporal performance in one, it is highly affected by the performance of the arrival model. A wrong arrival rate propagates the error to all the events per case, displacing them even if their relative distribution is accurate.

The proposed measures detected key differences between the considered BPS model discovery techniques. Additionally, our results can help to identify potential improvements in these techniques. SIMOD’s inferior performance in the control-flow perspective is expected, given that it takes a simplified process model as input. Moreover, there is a natural fit between the control-flow measures (e.g., NGD) and the Markovian approach of ServiceMiner – as a Markov chain is, in essence, a generative 2-gram model. The results also highlight the benefits of SIMOD’s extraneous waiting time discovery component (a feature that ServiceMiner does not have). Finally, although ServiceMiner’s arrival model achieved the best results in most of the scenarios, the evaluation in AC_CRE point towards an improvement opportunity in the situation where cases arrive at a slow rate.

Table 5. Results of the proposed measures for the BPS models discovered by SIMOD and ServiceMiner with the real-life logs in Table 3.

To evaluate EQ4, Table 5 shows the result of AED, CAR, CED, and RED measures when computing the distance with 1WD, instead of EMD. The results follow the same distribution in all cases, except in the CED measure on the BPIC17 dataset, and the RED measure on the CALL dataset. In both cases, the slight differences shown by EMD are reduced to a similar value by both techniques. For arrivals, as explained in Sect. 3, computing the distance with EMD and 1WD provide the same result, as the number of observations in both samples is the same (i.e., the number of cases). In conclusion, computing the distance using 1WD leads to similar conclusions at a lower computational cost. Thus, we recommend using 1WD when the masses of both time series are close to each other, and when the number of observations (amount of mass) is large.

Threats to Validity

The evaluation reported above is potentially affected by the following threats to the validity. First, regarding internal validity, the experiments rely only on 8 BPS models of one synthetic process, and 8 automatically discovered BPS models from 4 real-life processes. The results could be different for other datasets. Second, regarding external validity, the evaluation was assessed with real-life event logs from processes of different domains. However, the results could not be generalized for processes of domains presenting specific unseen characteristics. Third, regarding construct validity, we proposed a set of measures of goodness based on discretized distributions and time series. The results could be different for other measures. Finally, regarding ecological validity, the evaluation compares the BPS models against the original log. While this allows us to measure how well the simulation models replicate the as-is process, it does not allow us to assess the goodness of the simulation models in a what-if setting, e.g., predicting the performance of the process after a change.

6 Conclusion

We proposed a multi-perspective approach to measure the ability of a BPS model to replicate the behavior recorded in an event log. The approach decomposes simulation models into three perspectives: control-flow, temporal, and congestion. We defined measures for each of these perspectives. We evaluated the adequacy of the proposed measures by analyzing their ability to discern the impact of modifications to a BPS model. The results showed that the measures are able to detect the alterations in their corresponding perspectives. Furthermore, we analyzed the usefulness of the metrics in real-life scenarios w.r.t. their ability to uncover the relative strengths and weaknesses of two approaches for the automated discovery of BPS models. The findings showed that beyond capturing the quality of BPS model and identifying the sources of discrepancies, the measures can also assist in eliciting areas for improvement in these techniques. Finally, as some of the proposed measures present higher computational cost, we evaluated more efficient measures, finding that they perform similarly to computationally-heavy methods.

In future work, we will explore the applicability of the proposed measures to other process mining problems, e.g., concept drift detection and variant analysis. Studying how to assess the quality of BPS models in the context of object-centric event logs is another future work avenue. Lastly, we aim to study other quality measures for BPS models adapted from the field of generative machine learning, for example, by using a discriminative model that attempts to distinguish between data generated by the BPS model and real data.

Reproducibility. The scripts to reproduce the experiments, the datasets, and the results are publicly available at: https://doi.org/10.5281/zenodo.7761252. The measures have been implemented as a Python package (log-distance-measures) installable from pip, and the code is publicly available at: https://github.com/AutomatedProcessImprovement/log-distance-measures.