Keywords

1 Introduction

Clustering data is a widely used technique across various fields such as machine learning [9], bioinformatics [14, 30], information retrieval [6], and text mining [2]. It helps us to discover patterns in data by grouping together supposedly similar data instances. However, explaining a clustering, which can be conceived as a set of clusters, can be challenging because the concept of a cluster is often fuzzy [8]. It might not be obvious to us why certain data instances belong to one particular cluster instead of another one. Furthermore, it might not be clear what the underlying concept of a cluster actually is. So instead of focusing on a cluster and its associated data instances as a whole, we could rather try to identify representative examples for this cluster. In the context of explaining raw data sets, such representative examples are typically known as prototypes and criticisms. A prototype is a data instance that best describes a data set [12], and a criticism is a data instance that provides insights into parts of the data set that prototypes do not explain well [17]. The use of criticisms in addition to prototypes is beneficial because it has been shown that criticisms can simplify human understanding and reasoning [17]. We, however, examine whether prototypes and criticisms can be directly inferred from clusterings so that the associated clusters can be explained. This is the major motivation for this work.

Digital public participation processes enable individuals to engage in different life areas such as city budgeting planning or planning public green city spaces. Their goal is to ensure that diverse opinions can be heard and considered in the design and decision-making processes [5]. This might lead to more acceptance and trust in the final planning result. Different groups participate in digital public participation processes. Two main groups are citizens and public administrations. While citizens submit contributions in order to voice their ideas or concerns, public administrations have to assess these contributions, typically by accepting or rejecting them. In the end, there is a collection of contributions to explore in digital public participation processes. However, this exploration is challenging. For instance, citizens, on the one hand, might want to identify if their opinion is already shared by others. So they are looking for similar contributions. This can be cumbersome. Public administrations, on the other hand, need to be consistent in deciding whether a contribution is accepted or rejected, i.e., when they reject one particular contribution, they should also reject all other contributions that have the same meaning. Thus, the public administrations might also look for similar contributions. This is an elaborate endeavor. In this regard, clustering can help to group together similar contributions. However, as we have already pointed out, the interpretation of clusterings is challenging. That is why we examine prototypes and criticisms for clusterings. We hypothesize that the clustering of contributions and the use of prototypes and criticisms for explaining the associated clusters can support citizens and public administrations in digital public participation processes. Thus, this work is also motivated by demands from practice.

In the remainder of this paper, we present related work first (Sect. 2). We then describe a novel centroid-based clusterings method for finding prototypes and criticisms (Sect. 3). This method is a generalization of the \(k\)-medoids [15] algorithm when used for finding prototypes in raw data sets. Our centroid-based clusterings method is able to retrieve multiple prototypes and criticisms per cluster. This method directly infers prototypes and criticisms from clusterings. We also briefly explain the key idea of the MMD-critic algorithm [17] that can be considered a general baseline for computing prototypes and criticisms (Sect. 3). The main part of this paper is about a user study that we conducted with 21 participants to evaluate our centroid-based clusterings method and the MMD-critic algorithm for finding prototypes and criticisms in clustered contributions of digital public participation processes. In this regard, we describe the associated experiments (Sect. 4), and we present the results of the user study (Sect. 5). Finally, we conclude our research and provide potential for future work (Sect. 6).

In summary, our main contributions are (1) the introduction of the novel centroid-based clusterings method for finding prototypes and criticisms in clusterings, (2) the idea of applying the MMD-critic algorithm to clusterings, (3) the application of the previous two methods as well as other, naive methods to text data from digital public participation processes of the e-participation domain, and (4) the evaluation of all methods in a user study.

2 Related Work

This work is part of the broad field interpretable machine learning. The need for methods in this field has long been recognized [1, 18], and research activities in this field continue to increase for years [7]. One specific subarea of interpretable machine learning is about example-based explanations such as prototypes and criticisms.

In psychology, the concept of prototypes refers to the idea that certain stimuli or objects serve as the best or most representative examples of a particular category. These prototypes are thought to be stored in memory and used as a reference point for making judgments about other stimuli or objects that belong to the same category [25]. The underlying prototype theory posits that people form mental representations of prototypes based on the most representative examples of a category [29]. People tend to rate category members that are similar to the prototype as more typical or representative of the category. The distance from an example to a prototype also affects the human categorization process, i.e., the nearer an example is to the prototype, the more likely it is for this example to be categorized in that prototype’s category [13, 21]. The related exemplar theory even considers multiple representative examples of a category to extract important properties of the category [22]. The inclusion of edge cases can provide valuable information about the boundaries and limits of a category [17].

Considering the computation of prototypes for raw data sets, there is the \(k\)-medoids [15] clustering algorithm. It assigns all data instances of a raw data set to different clusters. Each cluster then contains a medoid. This medoid can be used as a prototype for this particular cluster. However, only one prototype per cluster is output by the \(k\)-medoids algorithm. This might not be enough for explaining a cluster according to the exemplar theory. Other related clustering algorithms such as the \(k\)-means [19, 20] and fuzzy \(c\)-means [4] algorithms do not guarantee to return a medoid. They provide cluster centroids that typically are not actual data instances of the data set, which however is required for prototypes and criticisms. Furthermore, there is no concept of criticisms in any of the previously mentioned clustering algorithms. We, however, consider the computation of multiple prototypes and criticisms per cluster. In contrast to previous work, we want to use prototypes and criticisms to explain individual clusters of a clustering.

There are also the MMD-critic [17] and ProtoDash [12] algorithms for computing prototypes and criticisms. We briefly explain the MMD-critic algorithm in Sect. 3. The ProtoDash algorithm is a generalization of or an extension to the MMD-critic algorithm because it additionally computes importance weights for each prototype. These methods are model-agnostic [23], i.e., in our context, they can be applied to raw data sets and to individual clusters of a clustering, although we have not yet observed that this has already been done. However, these methods would not work directly with the clusterings but perform other extensive computations on top instead. To the best of our knowledge, there is no other (model-specific) method that directly infers prototypes and criticisms from clusterings. At the same time, we also observe that these methods have been applied to image data rather than text data [17]. Additionally, the evaluation of these methods using user studies appears to be rare.

Referring briefly to digital public participation processes, there exist general applications of natural language processing and machine learning methods that aim to provide sophisticated computer-aided support for the participants. For example, citizens are clustered to connect citizens with common interests, or comprehensive contributions are summarized to provide an overview of the contents [3]. Furthermore, certain approaches exist which utilize similarity-based ranking methods to aid public administrations in assessing contributions more effectively [26]. There is also work that considers the clustering of contributions [27]. However, that particular work focuses on comparing such clusterings. We focus on providing representative examples for a clustering instead. Furthermore, there are semi-supervised approaches to automatically categorize contributions by their topics while reducing the efforts needed to label the contributions [24]. However, the interpretation of the methods used is not covered at all. There is also a broader perspective that considers the visual analytics [16] approach including machine learning methods for digital public participation processes [28], e.g., for decision-making. However, no promising machine learning algorithm is described, and there is no relation to or mention of interpretable machine learning. We, however, consider it important in that domain.

3 Prototypes and Criticisms for Clusterings

We examine two primary methods for identifying prototypes and criticisms in clusterings as a foundation for our user study. On the one hand, we introduce a centroid-based clusterings method. On the other hand, we briefly examine the MMD-critic algorithm as a baseline. Both methods operate under the assumption that we have a pre-existing clustering of data, such as one obtained through the \(k\)-means algorithm. Our goal is to present this clustering to a user by focusing on prototypes and criticisms. To achieve this, we adapt existing approaches.

3.1 Centroid-Based Clusterings Method

We adapt the idea of inferring prototypes directly from clusterings similar to directly using the medoids of a \(k\)-medoids clustering as prototypes. However, we introduce a novel generalization of this idea in three aspects. First, we take clustering algorithms into account that do not directly yield actual data instances from the data set as prototypical examples. Second, we consider the selection of more than one prototype for every cluster of a clustering. Third, we also allow the retrieval of criticisms for each cluster.

Consider a set of data instances \(X = \{\textbf{x}_1, \textbf{x}_2, \ldots , \textbf{x}_n\}, \textbf{x}_i \in \mathbb {R}^m\). We use the partitioning \(\{X_1, X_2, \ldots , X_k\}\) of \(X\) into \(k\) disjoint subsets \(X_1, X_2, \ldots , X_k\) to denote a clustering of \(X\). The subsets \(X_1, X_2, \ldots , X_k\) represent the associated clusters. Every cluster \(X_i\) then contains at least one representative data instance which we denote by \(\textbf{p}_{i, 1}, 1 \le i \le k\). This is then the first prototype of the \(i\)-th cluster. Using a \(k\)-medoids clustering, for example, \(\textbf{p}_{i, 1}\) is the medoid of the \(i\)-th cluster. For all centroid-based clustering algorithms such as \(k\)-means, we retrieve the centroid \(\boldsymbol{\upmu }_i = \frac{1}{|X_i|}\sum _{\textbf{x} \in X_i}^{}{\textbf{x}}\) for every cluster \(X_i\) through the clustering algorithm. If this centroid is equal to an actual data instance from the data set, we consider it as the first prototype for the particular cluster, i.e., \(\textbf{p}_{i, 1} = \boldsymbol{\upmu }_i\). Otherwise, we select the nearest data instance to \(\boldsymbol{\upmu }_i\) as the first prototype. This approach describes our first generalization aspect. Through this little adaptation, we can use any centroid-based clustering algorithm as a basis. The second generalization aspect is about choosing \(p\) prototypes for every cluster \(X_i\) rather than being limited to choosing only one prototype, \(1 \le p \le \min _{1 \le i \le k}|X_i|\). This is motivated by the exemplar theory. For this purpose, we consider the next \(p-1\) nearest data instances to \(\boldsymbol{\upmu }_{i, 1}\) as the next prototypes \(\textbf{p}_{i, 2}, \textbf{p}_{i, 3}, \ldots , \textbf{p}_{i, p}\). In case of a tie, the selection can be made randomly. For the computation of criticisms, we adopt a similar approach as with the selection of prototypes. However, we consider the \(c\) most distant data instances to \(\boldsymbol{\upmu }_{i, 1}\) as the criticisms \(\textbf{c}_{i, 1}, \textbf{c}_{i, 2}, \ldots , \textbf{c}_{i, c}\)\(1 \le c \le \min _{1 \le i \le k}|X_i|\), with \(\textbf{c}_{i, 1}\) being the most distant criticism, \(\textbf{c}_{i, 2}\) being the second most distance criticism etc. This approach is our third generalization aspect. Figure 1 shows an example.

Fig. 1.
figure 1

Example of our centroid-based clusterings method with a different number of prototypes and criticisms per cluster. Prototype \(\textbf{p}_{i, j}\) is the \(j\)-th prototype of cluster \(i\), criticism \(\textbf{c}_{i, j}\) is the \(j\)-th criticism of cluster \(i\), and \(\boldsymbol{\upmu }_i\) is the centroid of cluster \(i\).

3.2 MMD-Critic for Clusterings

We briefly explain the MMD-critic algorithm [17] because we consider it as a model-agnostic baseline in our user study. While this algorithm has been successfully applied to raw data sets, we exclusively apply it to clusterings instead.

The MMD-critic algorithm relies on the maximum mean discrepancy (MMD) between two distributions for identifying prototypes. In our context, we need to consider a distribution of prototypes and a distribution of data instances. The MMD represents how much these distributions differ from each other. The MMD can be estimated by the empirical MMD \(\hat{f}\) using the set of \(n\) data instances \(X\), the set of \(p\) prototypes \(P\), and a kernel function \(k\) as stated in (1) [11].

$$\begin{aligned} \hat{f}(X, P) = \left[ \frac{1}{n^2}\sum _{i, j=1}^{n}{k(\textbf{x}_i, \textbf{x}_j)} + \frac{1}{p^2}\sum _{i, j=1}^{p}{k(\textbf{p}_i, \textbf{p}_j)} - \frac{2}{np}\sum _{i, j=1}^{n, p}{k(\textbf{x}_i, \textbf{p}_j)}\right] ^{\frac{1}{2}} \end{aligned}$$
(1)

The MMD-critic algorithm greedily selects \(p\) prototypes. Each data instance is examined and evaluated as a potential prototype until \(p\) prototypes have been found. For this purpose, the squared empirical MMD is calculated once when the particular data instance would be added as a prototype and once when it is not considered as a prototype. The data instance for which the difference between these two values is the lowest is selected as the next prototype, because then the discrepancy between the prototypes and the data instances has been reduced the most. This approach can be applied to every cluster of a clustering.

The criticisms are selected based on the witness function \(g(\textbf{x})\) that computes how much two distributions differ at a specific data instance \(\textbf{x}\). In our context, we consider the distribution of prototypes and the distribution of data instances. Its empirical estimate \(\hat{g}(\textbf{x})\), given a set of \(n\) data instances \(X\), a set of \(p\) prototypes \(P\), and a kernel function \(k\), is stated in (2) [11].

$$\begin{aligned} \hat{g}(\textbf{x}) = \frac{1}{n}\sum _{i=1}^{n}{k(\textbf{x}, \mathbf{x_i})} - \frac{1}{p}\sum _{i=1}^{p}{k(\textbf{x}, \mathbf{p_i})} \end{aligned}$$
(2)

The MMD-critic algorithm greedily identifies the criticisms. So each data instance is tested again, however, the data instance that has the highest witness function value is selected as the next criticism. This procedure is repeated until \(c\) criticisms have been found. This procedure can be applied to each cluster of a clustering.

4 Experiments

We conducted a user study to evaluate our centroid-based clusterings method and the MMD-critic algorithm for finding prototypes and criticisms in clustered contributions. We also wanted to find out whether these methods are suitable for text data in general, as previous research in this area has primarily focused on image data. The data used in the user study were contributions from past, real-life digital public participation processes. This is our specific application scenario. In the following, we provide details about the participants, data set, task, design and procedure, and research questions and measures.

4.1 Participants

After contacting students and employees at our institution, we were able to recruit a total of \(21\) participants (ten female, eleven male) for our user study. In the post-experiment questionnaire, the participants reported their ages within the following groups: ten participants were 21–30 years old, six participants were 31–40 years old, two participants were 41–50 years old, and three participants were 51–60 years old. The participants were also asked to rate their proficiency as computer users. None of them considered themselves inexperienced, and none considered themselves beginners. Eleven participants reported having average experience with computers, while ten participants considered themselves advanced computer users. The participants in our study have diverse educational backgrounds. Among them are two computer science professors, eight data science students, two employees with degrees in marketing, and one employee each with degrees in German studies and marketing, office administration, and IT administration. Additionally, six participants are employees with degrees in computer science.

Fig. 2.
figure 2

Two sample contributions of the data set used in the user study. The literal translations of the contents are shown.

4.2 Data Set and Task

We merged two data sets of contributions from past, real-life digital public participation processes into one larger data set to be used in the user study. Overall, the contributions consider either ideas for a regional development planning project or complaints about city noise sources. The contributions are in German. Each contribution belongs to exactly one of ten categories. The literal translations of these categories are: (1) residence, (2) city equipment, (3) pedestrian and bicycle paths, (4) miscellaneous, (5) play/sport/exercise, (6) vegetation, (7) aviation noise, (8) train noise, (9) miscellaneous noise, and (10) traffic noise. We considered these categories as the ground truth for the clustering of the contributions, i.e., every category actually represents a single cluster. This allowed us to bypass the manual as well as the computer-assisted clustering of the contributions, reducing the potential of inaccuracies. However, the participants in the user study were unaware of the existence of these categories and their origin. In order to provide a clearer understanding of the contributions, we provide two sample contributions with their literal translations in Fig. 2.

The participants were tasked with repeatedly assigning a reference contribution to one of three groups of other contributions. For each reference contribution, they were told to find the best fitting group of similar contributions. We hypothesize that it should be easier to assign a reference contribution to a cluster when the cluster is well-explained. We explain a cluster by selecting a subset of contributions from the cluster.

We used four different methods to populate the clusters with the contributions: (1) the MMD-critic algorithm (MMD), (2) our centroid-based clusterings method (CEN), (3) a random method (RND), and (4) a raw method (RAW). The MMD, CEN, and RND methods find prototypes and criticisms in the cluster to populate the particular cluster, whereas the RND method randomly selects data instances from the cluster without any specific selection criteria for prototypes and criticisms. The RAW method simply includes all contributions of a cluster.

We exclusively considered the contents of the contributions. We preprocessed the contributions before utilizing the MMD-critic algorithm and our centroid-based clusterings method for determining the prototypes and criticisms for each cluster. This involved tokenizing the content and creating averaged word embeddings through the use of pre-trained models [10]. For the remaining two methods, we simply queried the original data set. We always choose two prototypes and two criticisms per cluster except for the raw method. This is motivated by the exemplar theory. The resulting contributions were arranged vertically for each cluster, starting with the prototypes. We have limited the contributions to a maximum of five lines for display purposes, so that reading long texts does not take up too much time for a whole experiment per participant.

4.3 Design and Procedure

The user study employed a within-subject design. We presented the four methods in a balanced Latin block design to avoid potential bias due to usage order. Every participant completed the task three times for each method, and the trials were averaged per method for the final analysis. The reference contribution and groups were varied across trials to eliminate potential memorization effects. Importantly, the participants were not informed about the number of categories or their labels, nor were they aware of the method used for each trial.

In the beginning (phase 1), we explained the purpose of the experiment and the task to the participants. The participants were informed that they will perform the task multiple times with breaks in between. The participants were encouraged to solve the task as quickly and accurately as possible.

Fig. 3.
figure 3

Graphical user interface of the user study system consisting of the reference contribution (1), the three groups with different contributions aligned in a column layout (2–4) (each group represents a cluster by displaying selected prototypes and criticisms), and the button group (5) (one button for each group) for selecting the alleged group of the reference contribution

After completing phase 1, participants received a brief, self-paced tutorial that introduced them to the basic graphical elements and layout of the user study system (phase 2). Figure 3 shows the graphical user interface of the user study system. The tutorial emphasized key elements by highlighting them and providing text descriptions via tooltips. Participants learned about the reference contribution (cf. (1) in Fig. 3), the three groups of contributions arranged in a three-column layout (cf. (2)–(4) in Fig. 3), and the button group containing three buttons (cf. (5) in Fig. 3), each corresponding to a group of contributions. Participants were reminded that their decisions were final, meaning that once a button was clicked, the task would be completed and the contribution would be assigned irrevocably.

After completing phase 2, the actual experiment began (phase 3). The participants solved the task, where the reference contribution was randomly chosen from the three selected clusters. Upon completion, they were required to take a mandatory break of at least ten seconds, which was enforced by the user study system (phase 4). During this break, participants were asked to rate the perceived difficulty of the previous task on a 5-point Likert scale, with 1 indicating “very easy” and 5 indicating “very difficult”. Once they had answered this question and once the break had ended, the participants clicked a button to proceed to the next task. Phase 3 and phase 4 repeated eleven more times. In the final phase (phase 5), the participants answered a post-experiment questionnaire, providing information on their gender, age group, computer proficiency, and educational and job background.

4.4 Research Questions and Measures

We defined the following research questions \(q_1\) to \(q_6\):

  • What is the accuracy of the cluster assignments (\(q_1\))?

  • How long does it take to complete the tasks (\(q_2\))?

  • What is the perceived difficulty of the tasks (\(q_3\))?

  • Is there a correlation between accuracy, time, and difficulty (\(q_4\))?

  • Do participants explore all clusters (\(q_5\))?

  • How do the methods compare with each other (considering the previous research questions) (\(q_6\))?

We then defined the following measures to address the research questions:

  • Accuracy: Ratio of correct assignments and the total number of assignments (per method) (for \(q_1\) and \(q_6\))

  • Efficiency: Time spent to complete the task (per method) (for \(q_2\) and \(q_6\))

  • Difficulty: Participant’s perceived level of difficulty (per method) (for \(q_3\) and \(q_6\))

  • Correlation: Spearman’s correlation coefficient \(\rho \) (per method) (for \(q_4\) and \(q_6\))

  • Exploration: Kernel density estimation of recorded mouse positions (per method) (for \(q_5\) and \(q_6\))

5 Results

This section presents the findings of the conducted user study. We refer to the research questions and measures used to evaluate the proposed methods for computing prototypes and criticisms.

5.1 Accuracy

For all methods, the 21 participants completed the repeated task with a mean accuracy of 0.746 (SD = 0.119). The results are clearly better than random guessing. However, there is still potential for improvement. Table 1 lists all accuracies per method. Our centroid-based clusterings method leads to the best results (mean = 0.794, SD = 0.197), followed by the MMD-critic algorithm (mean = 0.762, SD = 0.261). The raw method resulted in the lowest accuracy (mean = 0.698, SD = 0.256). Figure 4 provides a detailed comparison of the paired mean difference effect sizes between each method pair. Additionally, we report these effect sizes in Table 2. This table also lists the corresponding \(p\)-values computed using the Wilcoxon matched-pairs signed rank test at a significance level of \(\alpha =0.05\). The largest effect with an absolute paired mean difference of 0.096 (\(p=0.279\)) can be observed between the centroid-based clusterings method and the raw method. These findings demonstrate that prototypes and criticisms accurately represent a cluster. Both the centroid-based clusterings method and the MMD-critic method are effective for textual data representations, with little differences between the two methods. Furthermore, the results suggest that a sophisticated method for finding prototypes and criticisms is preferable to randomly selecting an arbitrary subset of some data instances from a cluster.

5.2 Efficiency

We observe a mean efficiency of 2.345 (SD = 0.812) minutes needed for completing one task for all methods. This generally depends on how long the contributions are and how willing the participants are to actually read the contributions carefully. Table 3 lists the average efficiency values per method. Our centroid-based clusterings method achieved the best result (mean = 2.043, SD = 0.876), followed by the MMD-critic algorithm (mean = 2.325, SD = 1.053). The raw method achieved the worst efficiency (mean = 2.596, SD = 1.806). The random method (mean = 2.416, SD = 1.195) achieves similar values to the raw method. This indicates that prototypes and criticisms lead to more efficient results than randomly selecting a subset of data instances or selecting all data instances. Figure 5 shows the paired mean difference effect sizes for all pairs of methods. We also provide these effect sizes and the \(p\)-values computed with the Wilcoxon matched-pairs signed rank test at significance level \(\alpha =0.05\) in Table 4. We find that the centroid-based clusterings method is more efficient than the MMD-critic method due to the absolute paired mean difference of 0.282 min (\(p=0.191\)). To our surprise, there is almost no effect to observe between the MMD-critic method and the random method due to the absolute paired mean difference of 0.091 (\(p=0.973\)), i.e., in terms of efficiency, both methods perform the same.

Table 1. Average accuracy (mean (SD)) per method. The best average accuracy is in bold.
Fig. 4.
figure 4

Four shared-control estimation plots of paired mean differences between the methods (a) MMD, (b) CEN, (c) RND, (d) RAW and all other methods each for comparing the accuracy. For each plot, all methods are plotted on the top axes as a slopegraph, the paired mean differences are plotted on the bottom axes as bootstrap sampling distributions, the mean difference is depicted as a dot in each case, and the 95% confidence interval is indicated by the vertical error bar each. It becomes immediately apparent that the CEN method outperforms all other methods while the RAW method performs the worst.

Table 2. Paired mean difference effect sizes of the accuracies, bias-corrected and accelerated 95% bootstrap confidence intervals (CI, 1000 resamples), and associated \(p\)-values computed with the Wilcoxon matched-pairs signed rank test (\(\alpha =0.05\)). The largest absolute effect size is in bold. The absolute value of the effect size equals the effect size of \(m_1\) minus \(m_2\).
Table 3. Average efficiency (mean (SD)) per method. The best average efficiency is in bold.
Fig. 5.
figure 5

Four shared-control estimation plots of paired mean differences between the methods (a) MMD, (b) CEN, (c) RND, (d) RAW and all other methods each for comparing the efficiency. For each plot, all methods are plotted on the top axes as a slopegraph, the paired mean differences are plotted on the bottom axes as bootstrap sampling distributions, the mean difference is depicted as a dot in each case, and the 95% confidence interval is indicated by the vertical error bar each. The CEN method clearly outperforms all other methods in terms of efficiency. The RAW method performs the worst.

Table 4. Paired mean difference effect sizes of the task completion time (in min), bias-corrected and accelerated 95% bootstrap confidence intervals (CI, 1000 resamples), and associated \(p\)-values computed with the Wilcoxon matched-pairs signed rank test (\(\alpha =0.05\)). The largest absolute effect size is in bold. The absolute value of the effect size equals the effect size of \(m_1\) minus \(m_2\).

5.3 Difficulty

The results show a mean perceived difficulty of 3.000 (SD = 0.404). This indicates that the task was generally neither easy nor difficult for the participants. Table 5 lists all perceived difficulty values per method. The centroid-based clusterings method obtained the best results (mean = 2.873, SD = 0.853). Interestingly, the raw method (mean = 2.905, SD = 1.050) achieved the second best results. The many contributions when using the raw method obviously did not trigger overburdening. Only then follows the MMD-critic method (mean = 3.079, SD = 0.802). This means that the participants found the task easier when they used the centroid-based clusterings method instead of the MMD-critic method. The highest perceived difficulty value was recorded when the random method (mean = 3.143, SD = .0764) was used. However, we have to acknowledge that all values are still close to each other. Figure 6 shows the paired mean difference effect sizes for all pairs of methods. Again, we provide these effect sizes and the \(p\)-values computed with the Wilcoxon matched-pairs signed rank test at a significance level of \(\alpha =0.05\) in Table 6. There is almost no effect in perceived difficulty between the centroid-based clusterings method and the raw method because of the absolute paired mean differences of 0.032 (\(p=0.747\)). The largest effect, however, can be observed between our centroid-based clusterings method and the random method due to the absolute paired mean difference of 0.270 (\(p=0.601\)). In contrast, the MMD-critic method and the random method only differ by a small amount of 0.064 (\(p=0.917\)). Thus, the centroid-based clusterings method should be preferred when deciding between a sophisticated method for computing prototypes and criticisms.

Table 5. Average difficulty (mean (SD), based on a 5-point Likert scale) per method. The best average difficulty is in bold.
Fig. 6.
figure 6

Four shared-control estimation plots of paired mean differences between the methods (a) MMD, (b) CEN, (c) RND, (d) RAW and all other methods each for comparing the perceived difficulty. For each plot, all methods are plotted on the top axes as a slopegraph, the paired mean differences are plotted on the bottom axes as bootstrap sampling distributions, the mean difference is depicted as a dot in each case, and the 95% confidence interval is indicated by the vertical error bar each. The CEN method once again outperforms all other methods. However, the difference to the RAW method is very small. The RND method performs the worst.

Fig. 7.
figure 7

Spearman’s \(\rho \) between the measures accuracy, efficiency, and difficulty

Table 6. Paired mean difference effect sizes of the difficulties, bias-corrected and accelerated 95% bootstrap confidence intervals (CI, 1000 resamples), and associated \(p\)-values computed with the Wilcoxon matched-pairs signed rank test (\(\alpha =0.05\)). The largest absolute effect size is in bold. The absolute value of the effect size equals the effect size of \(m_1\) minus \(m_2\).

5.4 Correlation

We are interested in the correlations between (1) accuracy–efficiency, (2) accuracy–difficulty, and (3) efficiency–difficulty although we cannot derive causal relationships. All results are shown in Fig. 7. Table 7 lists the numeric values.

Table 7. Spearman’s \(\rho \) and the bias-corrected and accelerated 95% bootstrap confidence intervals (CI, 1000 resamples)

Concerning (1) accuracy–efficiency, the Spearman’s correlation coefficients \(\rho \) for the centroid-based method (\(\rho =0.074\)) is almost 0, i.e., there is no correlation between the accuracy and efficiency for this method. However, the Spearman’s correlation coefficient \(\rho \) for the MMD-critic method (\(\rho =-0.394\)) shows a weak monotonic decreasing relationship, i.e., as the accuracy of the cluster assignments increased, the time needed for finishing the task decreased.

Referring to (2) accuracy–difficulty, there is only negative correlation present. This means for all methods that if the task seemed easy, there were more accurate results. However, there are subtle differences in the effect sizes. The MMD-critic method (\(\rho =-0.419\)) has the largest negative correlation coefficient. This is a moderate monotonic decreasing relationship and the best result.

Finally, we observe only positive correlations for (3) efficiency–difficulty. The centroid-based clusterings method (\(\rho =0.207\)), the random method (\(\rho =0.237\)), and the raw method (\(\rho =0.282\)) have similar correlation coefficients. These values indicate a weak monotonic increasing correlation. This means that when the time needed for finishing the task increased, which is equivalent to a decreasing efficiency, the perceived difficulty of assigning the reference contribution to the associated cluster increased. However, this does not apply to the MMD-critic method (\(\rho =0.090\)), because there is almost no effect present.

5.5 Exploration

The user study system recorded the participants’ mouse positions (in screen coordinates) every two seconds. We use this data as an indicator of exploration activity, with the assumption that the participants moved the mouse while reading the contributions. To assess whether the participants explored all groups during the task, we analyze the kernel density estimation of the mouse positions for each method. Figure 8 depicts these kernel density estimations. The results show similar results across all methods. The participants explored the first and second groups the most (first and second columns), with the third group being explored less. The raw method best reflects the three-column layout of the user study system’s graphical user interface. This suggests that some participants explored the contributions towards the end of the groups. Based on these results, we can conclude that there were no abnormalities.

Fig. 8.
figure 8

Kernel density estimation plots of the computer mouse positions (in screen coordinates) per method

6 Conclusions and Future Work

We motivated for the need for clustering explanations in general and especially in digital public participation processes so that citizens and public administration can explore the contributions more easily. We introduced a novel centroid-based clusterings method for finding prototypes and criticisms in clusterings. They are directly inferred from the clustering. The user study results show that both the centroid-based clusterings method and the MMD-critic method are suitable for explaining clustered contributions. However, the centroid-based clusterings method outperforms the MMD-critic algorithm regarding accuracy, efficiency, and perceived difficulty.

Nonetheless, there is potential for future work. We only considered the raw representations of the contributions, i.e., we displayed the raw content to the participants of the user study. We plan to investigate other representations for the prototypes and criticisms. For example, content summarization or the inclusion of topic tags could enhance the comprehensibility of the prototypes and criticisms.