1 Introduction
Recent progress in computer vision has led to algorithms that are comparable to human performance on some tasks [
24]. Previously, vision algorithms have been limited by what has been termed the “cross-depiction” problem [
35]: While achieving impressive performance in detecting objects on photographic images, algorithms would struggle to detect the same objects in other depictions such as drawings, paintings, and other art styles. However, recent advances in machine learning algorithms trained on multimodal image-text datasets and approaches such as Contrastive Language-Image Pre-training (CLIP) [
63] and Grounded Language-Image Pre-Training (GLIP) [
49] offer promising performance across visual domains. In this paper, we explore how these technologies may be applied to the large art collection of the National Gallery of Denmark (Danish abbreviation: SMK) to facilitate new ways of exploring and experiencing art. While techniques such as computer vision and, more broadly, Artificial Intelligence (AI) raise both legal and ethical concerns relating to authorship, bias, trust, and more [
20,
29], these technologies also offer the potential to make art collections more accessible and to offer new ways of experiencing art.
As suggested by Lev Manovich, while visual art and aesthetics are traditionally experienced and studied by looking at individual images and artworks, computational analysis of images opens up the perspective of exploring large datasets, inviting us to shift our perspective from unique exemplars to
“seeing one billion images” and the patterns therein [
56]. Within Human-Computer Interaction (HCI) such perspectives have been explored through the design of novel systems for visualization and exploratory search [
26,
87,
90]. Exploratory search is of particular importance for museums and art collections, because it allows non-expert users to find pathways to explore and discover art that they don’t know about and so wouldn’t know how to search for in a traditional search interface - an issue that is strongly aligned with museums’ mission to inspire and educate [
8,
88,
92].
Museums have long served as a productive environment for research and experiments in Human-Computer Interaction (HCI) [
40], and debates around the use of computer vision in museums have become increasingly prominent [
17,
84]. The recent significant developments in object detection suggest that computer vision may be applied now to museum collections to make them searchable not just through metadata about the images but through the subject matter of the artworks - i.e. the objects appearing in the images. Presently, such search has been limited by the extent to which information about the objects has been manually entered into the collection metadata by museum curators - which is often far more sparse (if at all available) than the rich visual information in the images [
4,
17]. Thus, this paper presents a research through design [
95] exploration of the following research question:
RQ: How can object detection be used to support exploration of an art museum’s digital collection?
As this approach represents a novel application of object detection techniques, a significant part of the effort has been to implement an object detection workflow for extracting object data about the art collection. We present the design and evaluation of an interactive application, SMKExplore, which allows users to explore a museum’s digital collection of art paintings by browsing through objects detected in the images, as a novel form of exploration.
In this paper, we provide three contributions. First, we show how an object detection pipeline can be integrated into a design process for visual exploration. Second, we present the design and development of an app that enables exploration in the context of a museum collection. Third, we offer reflections on future possibilities for museums and HCI researchers to incorporate object detection techniques in digital museum collections.
4 Object Detection
Since no ground truth is available on the National Gallery’s collection, our object detection approach has relied on pre-trained models. With COCO [
50] being the most prominent object detection dataset available, many approaches applied to artwork are trained or pre-trained on it; for instance, see [
43,
76]. Designed to mirror modern-day visual environments, the COCO dataset emphasizes contemporary objects, rendering its classes heavily limited for the nuanced motifs or historical themes of traditional art paintings. For this reason, we decided to utilize a contrastively pre-trained model with the possibility to customize the labels of detected objects in a way suitable in the context of artwork.
Our approach involves defining a set of labels (up to 120), which is then presented as a single string of words separated by full stops to the GLIP model [
49]. The pre-trained GLIP model then represents each class label as a vector. This vector representation of an object label can then be compared to similar representations extracted from image patches. The model matches the image patch and object label vectors most similar to each other and generates labeled bounding boxes in the image based on these similarities. To evaluate the principal suitability of our approach for art images, we have applied it to the People-Art [
86] test set, where we achieved an average precision (AP) of 0.56 (label = “person”, confidence cutoff = 0.25, intersection over union 0.5 − 0.95). In comparison, recent work [
43,
76] reported APs of 0.36 and 0.44, respectively. We therefore concluded that our approach is suited for object detection in the National Gallery’s collection.
4.1 Defining Custom Labels
In order to minimize errors, we built a dataset consisting only of digitized artworks from the museum collection that were labeled as “painting” in the collection metadata, resulting in a set of 6,750 artworks. Initially, we applied the approach described above, using the 80 object categories in COCO. This gave impressive results: The system could recognize a wide range of objects even in crowded scenes (Fig.
3a) and with unclear depiction styles (Fig.
3b). However, the system appeared to have a bias towards modern object categories: for instance, mislabelling some old books as
suitcases (Fig.
3c) or the shield of a female warrior figure as a
handbag - in the latter case perhaps also revealing a gender bias.
In order to mitigate this problem we pivoted to IconClass, a comprehensive index for classification of objects depicted in art images [
19]. We needed to construct a custom set of labels, as IconClass contains over 28,000 unique concepts, whereas our system could only accept a maximum of 120 labels. We iteratively explored categories and labels from IconClass and observed the frequency of such objects in a random subset of the National Gallery’s paintings. As the first iteration of object detection with the COCO labels had shown a strong dominance of the label
Person (44% of all the detected objects), we prioritized including various labels relating to people and clothing. However, we omitted labels for small details such as mouth and eyes, as we expected such objects would most often be too small to crop in high-resolution and, therefore, difficult to apply in our user interface. Furthermore, we also included several labels relating to themes we observed occurring often in the artworks, such as religious themes, architecture, food items, musical instruments, furniture, weapons, vehicles, and nature. This process resulted in a list of 120 labels, as seen in Table
1.
4.2 Selecting a Subset
Based on 6,750 paintings from the museum collection and the aforementioned 120 labels, a total of 109,145 objects were detected on 6,477 of the paintings. Analyzing the data revealed a skewed distribution of objects with 4 categories (Human, Nature, Architecture, and Clothing) representing more than 70% of the total objects detected. Similarly a high variance existed in the number of objects detected per label: the object Man had an instance count of 5,975 whereas Bird Cage only had a total count of 5. Due to technical constraints, we needed to reduce the dataset to one-tenth of the total data. In order to get a more uniform representation of objects, we defined the subset by retrieving up to 100 object instances per label, selected based on highest confidence level. This resulted in a dataset consisting of 10,775 objects detected on 3,906 of the paintings from the collection.
Finally we developed a script that cropped the detected objects from images of the original paintings, leaving us with an image collection of the singular objects as illustrated in Figure
4. These individual object images were used as a key component throughout the design and development of the interactive application.
5 SMKExplore
Working with the data as described above gave us valuable insights that helped inform our design process. In combination with the results from the object detection, we drew inspiration from existing literature on designing for exploration (as summarized in Section
2) to create SMKExplore: A web application that allows users to browse and explore a digital art collection through the objects detected in the paintings, as well as to use the objects to create new images using a generative image algorithm. In the following subsection we first present the design process and insights leading up to the final design, which is presented in the subsequent subsection.
5.1 Design Process
The design process leading to the design of SMKExplore ran from January-August 2023. The design and development was carried out by the three first authors of this paper, and was structured as a combination of UX Design and Agile Software Development using SCRUM, thus combining a total of five design sprints and software development sprints based on the model presented in [
36].
From the outset, our aim was to explore how object detection data could be used to facilitate exploration of a digital museum collection, using the object data to create new entry points and alternative ways of browsing. The data processing described in Section
4 was conducted prior to the design process. Exploring the data helped us frame the design space and guide the process.
During the initial phases we investigated opportunities and qualities inspired by techniques from interaction-driven design as presented by [
54]. This led us to design a preliminary concept that fostered immersive interactions with the objects in a 3D gallery which we envisioned implemented in WebVR. However, we had concerns about complications regarding usability and motion sickness (cf. [
15]) as well as technical feasibility, and chose instead to develop a simpler 2D concept.
Based on insights from the research literature, we established a set of design principles for our system. First, as suggested by [
90] and [
26] we emphasized finding a balance between overview of the data as well as opportunities to explore information in detail. This led us to establish a clear information hierarchy that allows the user to gain an overview of the collection, while we also designed pathways to detailed information on each artwork. Dörk and colleagues [
26] furthermore inspired us to use
visual cues to design navigational paths and enhance the possibility of serendipitous discoveries. Inspired by [
21,
90] we additionally decided to enable users to save objects that caught their interest, as a means to revisit parts of the collection they enjoyed.
Furthermore, inspired by [
57,
89] we decided to cluster the objects based on similarity and to provide possibilities to filter the data as a means to create overview and support various information-seeking strategies, such as comparing, combining and evaluating. Insights from [
57,
81,
88] led us to design for accessing the collection through multiple entry points and navigating via multiple paths (cf. [
26]), in order to enhance the sense of free exploration and varied forms of interaction with the items of the collection.
In addition to this, past research [
39,
51] has highlighted the benefits of using playful elements to advance and encourage non-expert user engagement and maintain user attention. Thus, we decided to include a playful element in the application in the form of an interactive canvas where users, with the help of generative AI, can create their own art with objects of personal interest from the collection.
With the aim to establish a clear connection between the application, SMKExplore, and the National Gallery, we defined the visual identity with inspiration from the museum’s website. In particular, we drew upon the color scheme, font types, square frames and the layout of the Painting Screen (Fig.
5).
The design was revised and implemented through five iterations (sprints), informed in part by usability testing and in part by technical aspects, until an initial prototype was tested on users in May 2023. Based on findings and feedback from this test a last revision of the design was conducted in the beginning of August 2023.
5.2 Final Design
In the following, we present the version of SMKExplore (Fig.
5) that was used during the evaluation as reported in section
6.
SMKExplore allows “bottom-up search”, where the user encounters the digital collection moving from details (i.e., objects) to the full-sized painting they appear in originally. The design concept focuses on navigating the collection based on thematic interest, shunning more traditional goal-oriented search. The aim is to allow users to compare depictions of similar objects in different paintings across time periods, styles, etc., and to help users discover artworks and details they otherwise might not have noticed.
When the user enters the application, they are met by the Home Screen, which showcases a slider with three examples of objects that have been detected in the collection. By clicking the “Start Exploring” button, they are led to a screen displaying the 13 categories defined for the objects in the data processing (see Table
1). The Category Screen constitutes the first level of an information hierarchy, in which the objects are presented only as a high-order category.
Once the user chooses a category they are directed to the Object Screen, where all objects within that particular category are displayed. Users may choose to filter the category further by selecting a label, thus being presented only with images of one object label, for instance skulls in the category Occultism.
Clicking on an object leads the user to the Painting Screen, which presents the entire painting on which the object appears. Alongside the painting more detailed metadata are provided, such as title, artist, technique, production year, and color palette. Other detected objects on the painting are also displayed, making it possible to navigate to other types of objects. Detected objects can also be discovered by hovering over the painting itself.
Users can save objects that catch their interest to a list of favorites. The list of favorites provides users with an opportunity to revisit these parts of the collection later on. The saved objects can also be used to create new imagery using the interactive canvas. The canvas utilizes the outpainting function of OpenAI’s DALL·E 2 API, which generates an image based on visual input(s) and a text prompt. The user may place objects on the canvas, resizing them as needed and leaving a generous amount of white space. Once they are satisfied with the composition they type in a text prompt describing, for instance, the desired image’s style or theme. When the image is generated, the user has the possibility of creating a new image using the same or other objects, searching the collection further, or comparing their image to the original paintings by navigating through the list displaying the objects that were used on the canvas.
6 Evaluation
SMKExplore was evaluated on-site at the National Gallery 11-12 Aug 2023 during museum opening hours. Two authors were placed in the museum’s foyer, inviting visitors to participate in a short user test. All test participants were visitors we encountered at the museum and were unknown to us beforehand. In total 22 participants interacted with the application and were interviewed, aged 18 to 59 (median: 26), 9 males and 13 females. The participants represented 14 nationalities across Europe, North America, Asia and Oceania.
6.1 Procedure
Before interacting with SMKExplore, the content and functionality were briefly explained to the participants. Subsequently, they were instructed to explore the application freely, without time constraints. Towards the end of the session the participants were asked to generate an image on the Canvas Screen. The interactions with the application were documented using system logs as well as screen recordings.
A semi-structured interview was conducted immediately following the participants’ interaction with the application. The interviews focused on the participants’ experience of utilizing detected objects as the primary visual entry point to the collection. The participants were also asked about their general thoughts on using AI in art contexts and whether they noticed or reflected upon mislabelled objects. The interviews were recorded using an audio recorder. All participants were informed about the data collected and signed an information statement in accordance with the university’s policies and the General Data Protection Regulation (GDPR).
The following analysis is based on findings from the system logs, interviews, and observations during the tests. Screen recordings have been used to supplement and clarify some details. The system logs were analyzed through descriptive statistics. The interviews were analyzed through thematic analysis, following the phases and guidelines presented by Braun and Clarke [
13]. Additionally, following the guidelines by McDonald et al. [
58], consistency and validity of qualitative results was ensured through agreement among the authors by collaboratively developing the coding schemes through iterative discussions.
The initial phase of the thematic analysis was conducted by the first and second authors, who both had been part of the design team. They initially familiarized themselves with the data by transcribing and iteratively reading the interviews. From the transcripts, one author generated the initial codes for all interviews utilising the software ATLAS.ti. This amounted to a total of 158 distinctive codes related to our research question. These codes were assessed and revised by the second author. Subsequently, 16 groups of codes were established combining patterns such as “questioning own interpretation”, “noticing details through objects” and “surprised by personal interest”. In the following phases two additional authors, who did not take part in the design process nor the interviews, took part in the analysis and discussion to broaden the perspective. Through this process we found several overlaps in the 16 groups, which led us to narrow down to six potential themes touching upon object representation, attention to detail, interest driven search and discoveries, mislabelling and interpretation, and lastly contextualising the objects. Through an additional, conclusive, phase of analysis the themes and underlying patterns were reevaluated and the final six themes were established. These will be unfolded in the following.
6.2 Overall Experience
In general the participants became immersed rather quickly in browsing through the objects, looking concentrated throughout the process. When interacting with the canvas, they became more talkative towards the end of their session and seemed both entertained and surprised by the resulting image they generated. They spent between 4 and 15 minutes with the application (median: 8 minutes). The majority wanted to continue their exploration or said they could imagine themselves trying it again.
Generally, the participants interacted intuitively with the different features. They initially navigated from the Home Screen to the Category Screen and onward to the Object Screen. On their first visit to the Category Screen, they quickly (on average 10 seconds) found a category of interest to investigate further. Only 3 participants needed guidance on how to save an object to their favorite list. In addition, 4 participants said the functionality of the canvas could have been more apparent to them, while 3 participants said they could have used more tips on how it works. These issues primarily concerned confusion about how to resize objects on the canvas.
20 of the 22 test participants said they enjoyed using the application and described the overall experience with words such as “fun”, “interesting”, “intuitive”, “enjoyable”, and “ludic”. 2 participants had somewhat more mixed feelings.
6.3 Representation of Objects
The participants generally spent the most time on the Object Screen (see Table
2), which shows all the detected objects for a specific label or category. During the interviews, several participants shared that exploring the collection through objects made them reflect on the different depictions of these objects and the wide range of motifs represented in the collection.
“It was nice to see bikes from different paintings [...] I have never thought about looking at paintings and being like, oh, this is a bike here, and there is also a bike there” (P1). Several participants mentioned the wide variety of object types as an element of surprise to them:
“Wow, there are many of these objects I have never noticed in many of the artworks before” (P7).
In addition to the rich variation of objects, the participants commented on the effect of seeing the different depictions of these objects side by side on the Object Screen: “Many motifs are reappearing. It makes sense, but when you see it like this, it is wild” (P5). The participants shared how distinct representations of the same objects made them reflect on different styles of painting through time. “It shows how different artists from different parts of the world, during different times have treated that object. Say, an apple would be very different in the Renaissance than today” (P16).
6.4 Focusing on Details
When asked to describe their experience of accessing the collection primarily from the objects as opposed to the entire painting, several participants emphasized that it offered them a perspective on the artwork that made them notice things they would usually disregard. Removing the objects from their original context also made many aware of the complexity that goes into a painting, which they might not have discovered otherwise. In addition to this, several of the participants also expressed that experiencing the collection in this manner made them pay attention to what they were seeing and inspired them to look at the details more: “I think you just become more thoughtful of what actually is happening in a painting like this and what is present” (P1).
Some also suggested that focusing on details can serve as an interesting new way to discover the entire paintings, as browsing the objects made them aware of artworks that they had not noticed when going through the exhibition: “[...] by going through details that maybe struck me, I also had the chance to pay attention to paintings that maybe I disregarded in the exhibition” (P3).
While most participants enjoyed or found it interesting to experience the collection through the objects, some also expressed lacking the context of the objects as problematic or something they did not enjoy. Particularly, worries about losing the entire vision of the artist, were mentioned by those who would rather see the entire painting up front: “I like the whole vision that the artist had rather than just a small piece of it that somebody else had decided I would look at” (P4).
6.5 Interests and Discoveries
A recurring pattern in the participants’ interview answers was the ability to explore the museum collection based on their personal interests. They shared reflections on how this influenced their navigation in the application and that they discovered patterns in what caught their attention: “I learned what I am interested in when I look at art” (P19). Several participants stated they had found new and unexpected artworks by pursuing their interest in particular objects: “I didn’t think I would be interested in a painting of cows, but that was very surprising and interesting” (P16).
8 of the 22 participants mentioned the categories as an element that helped them follow their interests while exploring the collection. On average, each participant visited the Category Screen 6 times during their session. Throughout the 22 sessions all categories were visited, however the popularity of the categories varied, as illustrated in figure
6.
The log data revealed that 21 out of 22 participants explored the same category more than once during their session. In the interviews, multiple participants said they were drawn to unfold the content of the categories further, either going back and forth to it or by selecting filters in the category to narrow their search.
“I chose human, I think, which is a bit broad, and then I went into it and I was like, I am interested to see women in the collection, so I started to select those to go deeper into a more narrow category.” (P19)
Similarly, the log data, showcasing which objects the participants saved during their session, supports the notion that participants wanted to investigate categories of interest more in-depth. The participants on average saved 6 objects during their session, and most participants (17) saved more than one object from the same category. One particularly eager participant saved 25 objects, all from the category “Weaponry”.
6.6 Mislabelling and Interpretation
Out of the 22 test participants, 12 said that they noticed objects that were not correctly tagged. However, when asked if this influenced their experience none of the 12 participants said that it bothered them. Interestingly, participants seemed to express a large degree of understanding and perhaps even sympathy for the algorithm’s mislabelling. Some suggested that it couldn’t necessarily be determined what the correct label should be, pointing out that strict interpretation is not always possible. Others suggested that it was understandable why the object detection model would classify a given object as something other than it actually is because of its visual similarities, for instance a spear being labelled as a flute because of similar shape and colour:
“Especially with the flute, he showed a lot of pictures of long, thin objects, which I do understand why he would think is a flute. And I always find this very interesting, because this is quite difficult, especially analyzing photos. It’s quite difficult for artificial intelligence to do it. And as a human, you take a single look at it and you instantly know.” (P12)
Several of the participants that encountered incorrect labels found it interesting and said that being confronted with the AI’s “interpretation” made them question their own interpretation:
“For example, for a mirror, there was one that was a full painting. That’s why I clicked on it, I think, at some point, because I was like, that’s not really a mirror. But I thought it was interesting because it kind of made you question whether it was you or the AI that was making a mistake, or it made you explore that.” (P6)
Several said that the incorrect labeling made them reflect or think differently about the potential visual interpretations of a particular object when taken out of context, thus challenging their own interpretation:
“I guess it made it a bit more exciting, because you didn’t know if it was going to be the actual thing. One of them said it was a guitar, but it was open-heart surgery. It looked like a guitar. It was quite interesting.” (P14)
Interview participants often speculated why the AI had labeled an object the way it had. Particularly, participants noticed discrepancies in how the AI labelled the objects in contrast to how a human might interpret them. Similarly, people also speculated on what shared visual characteristics objects might have and how these shared characteristics would lead the AI to recognize a particular object incorrectly, but consistently: “I began to think about what the AI saw to think it was that object and what similarities it would have to the other objects” (P10).
One participant stated that they thought the flaws were “charming” and that it made the AI seem more human. The same participant, however, also reflected on being misled and becoming suspicious of whether they could trust the AI at all when noticing a wrongly labeled object:
“[...] all of a sudden, I became very aware that I suddenly couldn’t trust it, that something I had clicked on and that it almost had me convinced was a skull, and I was like maybe it isn’t that at all. That I am looking at it all of a sudden as an abstract, kind of distorted skull, but maybe it isn’t.” (P7)
6.7 (Re)contextualizing the Objects
Second to the Object Screen, the participants spent the most time on the Canvas Screen (Table
2), on which they could generate a new image using objects they had saved. When asked if playing with the objects on the canvas contributed to their interest in exploring the art or their experience of the artwork, most participants shared reflections concerning (re)contextualization, composition in paintings, and piecing together different styles and details.
“I think it’s just interesting to maybe take some details or take things in general, change the context and see what happens by reframing this relation.” (P3)
The opportunity to play with positioning objects on the canvas and creating new imagery made several participants contemplate how the objects were represented in the collection.
“Putting these objects together, you could give it your own context, and that also changed the way the objects were in the collection [...] it is an interesting way of combining images, not just generating completely new images, but combining specific objects from images to a completely new image was an interesting experience.” (P10)
By placing the objects in a new and different context, the participants expressed that they became aware of compositional aspects of the art. This awareness concerned the composition of existing paintings and the composition they were creating on the canvas: “For instance, in the Baroque exhibition, there were a few areas in the paintings where there weren’t a lot going on, and that didn’t make sense to me. It made me think you could have added something” (P8).
Furthermore, several participants also reflected on how combining objects from various parts of art history could reveal differences and similarities between styles across time:
“I thought it was cool how you could compose different pieces together. And I think if you did a bunch of different art movements together, you could learn a lot about how they evolved and how they could be intertwined.” (P6)
During testing, we noticed that the canvas increased the inclination of the participants to explore the collection further. All but one, spent a large enough amount of time exploring the various objects that they were asked to stop exploring and generate an image. When asked to do this, some asked for more time to find other objects. Others reflected on this aspect during the interviews, in which they expressed that had they been given more time, they would have gone back and collected more or different objects to use in their image, indicating a desire to investigate the collection even further.
When asked if anything unexpected happened while interacting with the application, the most frequent answer was that they were surprised by the resulting image they had generated on the canvas:
“I was really surprised to see an actual painting that could hang in a museum [...] The painting that AI generated reminded me really of one of my favorite artists. But none of the pictures were from him. So that’s quite interesting. I really like that.” (P22)
Participants were in particular surprised by the way the outpainting functionality worked, saying that they did not expect the canvas to take the size and position of the object into account in the final result (although in fact most participants had opened an instruction screen which explained and visualized how the Canvas Screen worked). While several participants were familiar with other generative image systems such as Midjourney, participants were generally not familiar with outpainting.
7 Discussion
In the following we will reflect on four themes coming out of our design process as well as the testing and evaluation presented above: How the system affected the participants’ view on the artworks, how the labelset influenced the design, the participants’ experience of errors in the labelling, and the participants’ experience with the Canvas Screen. Finally, we reflect on some implications for design.
7.1 Experiencing Art Through the Lens of AI
As proposed by Lev Manovich, machine learning offers new ways to experience art and visual culture by enabling the exploration of large collections and patterns, contrary to the traditional approach of inspecting artworks individually [
56]. Through the evaluation we found that participants reflected on patterns in the museum collection, specifically objects recurring over multiple paintings. Seeing the recurrence of these objects side by side made them reflect on the motifs repeatedly depicted by artists and their various styles. In addition, the evaluation highlights that participants were inspired to focus more on details by exploring the collection through objects instead of full-sized paintings.
It is particularly interesting to consider the participants’ experience in light of the fact that they encountered our prototype after having visited the physical exhibitions at the museum. Many users expressed that they noticed new details and recurring objects in the art when exploring the application, which they had not noticed in the museum exhibition beforehand. Thus, the experience of exploring the art collection based on the objects detected by the machine learning model offered participants a new perspective on the art collection compared to the physical visit to the museum exhibition.
7.2 Labelling
The labels applied in an object detection model determine what objects can be detected - what the computer vision system can “see” in the image. The process of constructing the list of labels described above in section
4 demonstrates the importance of building a set of labels that enable the model to detect the most relevant objects. Our setup was limited by the amount of labels that could be fitted as an input string to the GLIP model - 120. This means that the model would not be able to label objects that were not included on the list in table
1 - thus the objects labelled by the model represents only a partial view of all the objects in the collection, limited by the selection we had made. This may help explain the mislabelling of some details, such as that mentioned by participant 4 in section
6.6, mistaking surgery for a guitar (see Figure
9): Since the model did not include any labels relating to surgery, it could not label it correctly - instead settling on a label with some visual similarity but very different meaning. Indeed, in our first iteration of the object detection analysis using labels from COCO (see section
4) this same detail was labeled ’bowl’; COCO does not have a label called ’guitar’, nor any other labels that seem appropriate for this detail.
Given more time and technical resources, it might have been possible for us to increase the amount of different labels, by re-running GLIP over the artwork collection with different sets of labels each time. This might result in a dataset with several different labels for the same object, which would either have to be disambiguated through a separate process - or we could simply adjust the design to allow for multiple (possibly contradictory) labels for the same object, inviting users to reflect on the resulting ambiguity. We can only speculate on how such a larger set of labels would affect the user experience: One might hope that it would allow for an even richer experience with more nuance and more opportunities for surprising discoveries. However, it is also possible that adding more labels would increase the proportion of mislabelling, as the system would have to contend with a larger amount of categories overall while applying a limited ontology for each run of the object detection algorithm.
Future developments of GLIP and similar algorithms may lead to an increase in the number of labels that can be applied at a time. However, it is unlikely that this will remove all limitations on the ability of vision algorithms to detect objects in artwork. First, it may take some time before models can include a sufficiently large number of labels without forgoing precision: A comprehensive classification like Iconclass contains over 28,000 unique concepts, whereas the Getty Art & Architecture Thesaurus contains 73,831 concept records, over 600 times larger than the number of labels used in our setup. Furthermore, even if a future system would allow a very long list of labels, there would remain some fundamental challenges with mapping between concepts and images precisely and comprehensively. Debates about the large image classification dataset ImageNet [
69] have demonstrated that classifying images of humans with labels from a lexical database can lead to unintended consequences and controversy [
20]. Similar complications may occur in object detection, as some concepts may mean different things at different times and cultural contexts. For instance, gender labels have acquired new meanings in recent times, adding nuance to what was formerly mostly considered a binary concept. Many other concepts relating to technology, societal roles, norms, institutions, or culture have changed meaning over time and in different societal and cultural contexts. Ciecko and colleagues [
17] provide a striking example of how the use of image classification might inadvertently trigger controversy: An image of iron ankle manacles from Australia’s convict history that is labeled as “Fashion Accessory” and “Jewelry” by a commonly used image classification algorithm. One could easily imagine that if a similar mislabelling were to occur in the collection of a museum relating to the history of slavery or the Holocaust, this could be offensive and hurtful for visitors and highly problematic for the museum.
In the first version of our labelset we included the label ’non-binary person’, in order to accommodate a broader variation of gender identities and supplement the labels ’man’ and ’woman’. However, the results made us question the classification. GLIP returned 210 bounding boxes with this label, of which the majority were depictions of children and/or nude people with displeased or uncomfortable facial expressions. We judged this to be potentially both inaccurate and harmful, and for these reasons we omitted this label from the final version of the labelset with the consequence that our application only provides two labels reflecting gender. This is unfortunate. While we do not have ground truth data available that could help us verify whether there are (few or many) images of people in the collection that should be tagged as non-binary person, the absence of this label might render invisible to the model a broader variety in gender identities. However, it seems that capturing nuances in gender presentation is difficult to do with the technology used in this study. It is worth reflecting on whether it is possible at all to classify gender with computer vision techniques that rely solely on visual appearance. For future work, it could be interesting to explore other ways to classify motifs of people in art instead of (or in addition to) gender, e.g. by clothing, hair, age etc.
7.3 Mislabelling and Trust
Seen in light of the challenges with labelling objects correctly outlined above, it is striking that the test participants generally trust our system’s algorithmic labeling. Only 12 of the 22 participants noticed objects that were incorrectly labeled, even though mislabeled objects – or at least questionably labeled – can easily be found in most categories. (Consider, for instance, some of the objects shown in Fig.
1 and
4.) Those who did notice questionable labels often seemed willing to offer explanations on behalf of the algorithm, one participant even personifying the algorithm:
“...he showed a lot of pictures of long, thin objects, which I do understand why he would think is a flute” (P12). Others suggested the mislabelling made them question their own interpretations - which is well aligned with the typical ideals of art education in museums, which often emphasize questioning one’s preconceptions and interpretation and opening oneself up to seeing artworks in different ways.
While these observations align with other research pointing towards a tendency to overtrust in AI systems [
6,
41,
66], we do not have data to assess clearly why the participants were so willing to trust the system or make excuses for its errors. However, there is a striking similarity with the observations done by Benford and colleagues when exploring the use of emotion detection AI in an art museum: “...visitors tended to construct post hoc rationalizations of their emotional experience that agreed with, or at least accommodated, the ’results’ reported by the system, even when this differed from their initial reflections” [
6, p.12].
We can only speculate about why visitors appear so willing to trust in the output of these computer vision systems – object detection in our case, emotion detection in [
6]. First, several factors in the presentation at the museum may inspire trust among visitors: The system is presented to them by university researchers, which may influence the participants to see the system as trustworthy and authoritative; and the context of the museum as a highly trusted institution may add to this impression. Second, visitors may be extra understanding towards the system’s errors due to the application domain, as interpreting art is both a difficult task and often seen not to have a single correct answer and one to which computer systems are not commonly applied. Third, given the large amount of visual information in the interface and the focus on exploration, it is possible that some mislabellings - like those showing unclear images and shapes - were overlooked as “noise” as participants focused on the higher resolution, and thus, more clear and recognizable images.
To some degree, the discussion above has presupposed that there is a correct and an incorrect label for each object in an art image. That assumption might be challenged in several ways. First, art images frequently appear ambiguous and resist interpretation from audiences, art critics and scholars alike. For instance, should Salvador Dalí’s “Lincoln in Dalivision” be classified as depicting the face of a bearded man, or a naked woman standing by a window? Much art is even more abstract and difficult to interpret unambiguously; and as ambiguity is a central quality of art, removing ambiguity from art is not a desirable goal. Second, one might argue that computer vision may encode ways of seeing objects that subtly differ from our assumptions about how objects should be seen - and which may offer interesting perspectives. For instance, Leahu demonstrates that a vision algorithm may end up encoding not just objects as discrete entities, but also aspects of the
relations that constitute them - such as when a neural network trained to recognize dumbbells also encodes the images of arms holding the dumbbells [
46]. For Leahu, this raises the possibility of "ontological surprises" - that computer vision algorithms may reveal unexpected relations between objects. In our analysis with GLIP we could sometimes see that the context surrounding an object might affect the algorithm’s classification of objects, as in the example in Figure
10. It would be an interesting challenge for future research to explore whether this sensitivity to context - or other particular aspects of the way computer vision encodes objects - could be used to help art viewers or even art scholars discover new ways of seeing art.
7.4 Creating New Images
As highlighted in Section
5, we included the canvas feature to support user engagement in exploration of the collection. Through the test, we found that the canvas encouraged participants to continue their search for objects: When given the task of using the Canvas Screen to make an image, many users were eager to go back and look for more objects they could use to make images. Several also said they would have liked to spend more time going back and forth between the Canvas Screen and the collection. One particularly eager user (P8) spent a long while creating multiple images and would only stop when we insisted that we needed to end the testing session. These observations confirm that the generative feature helped support engagement.
In addition, we found that generating an image through positioning and combining objects on the canvas made the participants reflect on the artworks’ context, time periods, styles, details, and composition. With outpainting, the participants were able to visually experience how styles and details can be merged into something new that goes beyond the original context of the object(s) (Fig.
7). This suggests that outpainting may have a promising potential as a device to facilitate practice-based learning about these dimensions of visual art in a manner that would be much more rapid and less dependent on practical skills than traditional exercises in drawing and painting.
7.5 Implications for Design
Based on the observations outlined above, we suggest a few topics that might be relevant to consider for designers working with object detection in digital art collections.
First, designers should pay close attention to the labelset used for object detection. As long as object metadata for the collection is unavailable or incomplete, it will be difficult to assess - other than by trial and error - which types of objects are prevalent in the collection and can be detected reliably. However, working with subject experts like museum curators or art historians might help identify appropriate labels, particularly when working with collections dominated by older art.
Second, designers might be interested in deliberately introducing flaws or errors in labeling as a way to provoke reflection and nudge users to question their own interpretation. However, our observations suggest that such errors might need to stand out strongly to ensure that users will notice them and identify them as errors. If users place trust and even some sympathy with the algorithm, then designers who wish to inspire critical reflection about the algorithm will need to work carefully on communicating to users that the algorithm is not necessarily to be trusted fully. Designers might explore ways to include confidence measures or other ways of visualizing uncertainty in labelling, however this would need to be balanced against the need to avoid disrupting the aesthetic of the art presentation, which is a strong design norm in art museums. Alternatively, designers might create deliberately ambiguous presentations of the algorithm’s outputs in order to provoke critical reflection, following tactics similar to those presented in [
30,
72].
Third, future designers and art educators might use generative systems (such as in our Canvas Screen) to facilitate learning about visual art and composition. For instance, one might use a more narrowly curated set of paintings based on time period or style to provide insight into frequently depicted motifs and typical composition. This could be supported by predefined text prompts that exemplify styles, details, and compositions recurring within the particular collection of artworks, allowing the user to explore objects and visually experiment with image generation while working towards more focused learning outcomes.
8 Conclusion
We have presented an approach to using object detection to facilitate exploration of a large digital art collection. First, we have demonstrated that recent leaps in computer vision, in particular the emergence of multimodal models like CLIP and GLIP, has made it feasible to use object detection on digitised art images with sufficient precision to support a meaningful and satisfying user experience for a general art-interested audience such as the visitors to the National Gallery of Denmark.
Second, we have presented the design of a web application that use the object detection data as basis for an interface that allow users to explore the collection in a novel way, using objects of interest as an entry point, and using a generative system with outpainting to facilitate creative and playful exploration. The evaluation has demonstrated that this interface has inspired test participants to see the art in a new light and discover new things about the art. We have highlighted the importance of constructing an appropriate labelset for the object detection, and drawn awareness to the participants’ tendency to trust the system’s output and perhaps overlook errors in the object labelling. Finally, we have suggested some design implications that might inform future work with object detection in artwork.
Our study has been limited to only artworks classified as paintings in the museum collection. Further research would be needed to explore whether the technology can be applied across diverse media types such as drawings and sketches, sculptures, photos and video, engravings, and so on. Furthermore, there is a need for cross-disciplinary research collaboration with art experts (for instance in art history or the digital humanities) to explore the aesthetic and pedagogical implications of extracting details from their original context in the artworks and presenting them to users as lists of objects from a variety of different artworks, styles, periods and artistic agendas. While such an approach may seem problematic for some curators as it means that image fragments are presented detached from their original context in the artwork, our study has demonstrated that it has the potential to inspire and engage museum visitors to discover and learn more about art. Tapping into this potential would be beneficial for both museums and their visitors - and would break new ground for the use of computer vision in art education and dissemination.