Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
\addbibresource

article.bib

Cooperative learning of Pl@ntNet’s Artificial Intelligence algorithm: how does it work and how can we improve it?

Tanguy Lefort 1    Antoine Affouard 2    Benjamin Charlier 3    Jean-Christophe Lombardo 4    Mathias Chouet 5    Hervé Goëau6    Joseph Salmon 7    Pierre Bonnet 8    Alexis Joly 9
1Univ. Montpellier, CNRS, IMAG, Inria, LIRMM, France tanguy.lefort@umontpellier.fr
2 Inria, CIRAD, Montpellier, France antoine.affouard@cirad.fr
3 Univ. Montpellier, CNRS, IMAG, France benjamin.charlier@umontpellier.fr
4 Inria, LIRMM, France, jean-christophe.lombardo@inria.fr
5 CIRAD, AMAP, Montpellier, France, mathias.chouet@cirad.fr
6 CIRAD, AMAP, Montpellier, France, herve.goeau@cirad.fr
7 Univ. Montpellier, CNRS, IMAG, Institut Universitaire de France (IUF), joseph.salmon@umontpellier.fr
8 CIRAD, AMAP, Montpellier, France, pierre.bonnet@cirad.fr
9 Inria, LIRMM, France, alexis.joly@inria.fr
Abstract
  1. 1.

    Deep learning models for plant species identification rely on large annotated datasets. The Pl@ntNet system enables global data collection by allowing users to upload and annotate plant observations, leading to noisy labels due to diverse user skills. Achieving consensus is crucial for training, but the vast scale of collected data (number of observations, users and species) makes traditional label aggregation strategies challenging. Existing methods either retain all observations, resulting in noisy training data or selectively keep those with sufficient votes, discarding valuable information. Additionally, as many species are rarely observed, user expertise can not be evaluated as an inter-user agreement: otherwise, botanical experts would have a lower weight in the AI training step than the average user.

  2. 2.

    Our proposed label aggregation strategy aims to cooperatively train plant identification AI models. This strategy estimates user expertise as a trust score per user based on their ability to identify plant species from crowdsourced data. The trust score is recursively estimated from correctly identified species given the current estimated labels. This interpretable score exploits botanical experts’ knowledge and the heterogeneity of users. Subsequently, our strategy removes unreliable observations but retains those with limited trusted annotations, unlike other approaches.

  3. 3.

    We evaluate Pl@ntNet’s strategy on a newly released large subset of the Pl@ntNet database focused on European flora, comprising over 6M observations and 800K users. This anonymized dataset of votes and observations is released openly at https://doi.org/10.5281/zenodo.10782465. We demonstrate that estimating users’ skills based on the diversity of their expertise enhances labeling performance.

  4. 4.

    Our findings emphasize the synergy of human annotation and data filtering in improving AI performance for a refined training dataset. We explore incorporating AI-based votes alongside human input in the label aggregation. This can further enhance human-AI interactions to detect unreliable observations (even with few votes).

Keywords: crowdsourcing, botanical skills, human-AI interaction, label aggregation, Pl@ntNet, plant identification

Running Headline

Citizen science for plant identification

Acknowledgments

This work was funded by the French National Research Agency (ANR) through the grant Pl@ntAgroEco 22-PEAE0009, granted access to the HPC resources of IDRIS under the allocation A0151011389 made by GENCI, and funded by the Chaire IA CaMeLOt (ANR-20-CHIA-0001-01).

Data Availability

The dataset is available at https://doi.org/10.5281/zenodo.10782465

Conflict of Interest

The authors declare no conflicts of interest

Author Contributions

T. Lefort, A. Affouard, A. Joly, B. Charlier, P. Bonnet and J. Salmon conceived the ideas and designed the evaluation methodology; A. Affouard and M. Chouet are the main developers of Pl@ntNet’s backend ; T. Lefort and A. Affouard collected the evaluation data used in this paper; T. Lefort re-implemented Pl@ntNet’s algorithm in python and conducted the evaluation ; J-C. Lombardo, H. Goëau and A. Joly conceived and trained Pl@ntNet’s AI model; T. Lefort, B. Charlier, A. Joly and J. Salmon analyzed the outcomes of the study. All authors contributed critically to the drafts and gave final approval for publication.

1 Introduction

Computer vision models are a great aid in plant species recognition in the field \citepvidal2021perspectives,borowiec2022,mader2021flora. However, to train them we need large annotated datasets. These datasets are often created thanks to citizen science approaches, collecting both reliable and useful information \citepbrown2019potential. Among existing plant recognition applications, the Pl@ntNet citizen science platform \citepaffouard2017pl enables global data collection by allowing users to upload and annotate plant observations \citepbonnet2020citizen.

Refer to caption
Figure 1: Pl@ntNet system of human-AI interaction for plant species recognition. Users take their plant observations in the Pl@ntNet application. A prediction is output by the AI model. Users can validate the prediction or propose another species. The whole votes collection is used to evaluate user expertise (see Algorithm 1) and actively revise observations identifications.

At the time of writing, this participatory approach has resulted in the collection of over 20 million observations (images or group of images of the same plant), belonging to almost 46 00046000\numprint{46000}46 000 species, by more than 6 million users worldwide. In total, more than 25252525 million of images are shared in these observations. The collaborative process of Pl@ntNet is synthetized in Figure 1. The AI model interacts with the human decision by proposing possible species given an observation. For each returned species, using a similarity search, the Pl@ntNet system also shows similar pictures from the database. This lets users visually check that their observation is likely to belong to a predicted species given the most similar observations. For instance, such a visual control can help to compare two plants at various growth stages.

Plant species identification is a task that requires skills to recognize morphological traits (shapes, measurements, environments and specific characteristics). A large number of users with diverse skills have participated in gathering plant observations and helped improve the training dataset of our computer vision model. Their participation is based on votes that they can cast on others’ observations, or by the initial species determination of their observation. The quality of each vote is then processed by the algorithm presented in Section 2.2.

Other citizen science projects such as iNaturalist \citepvan2018inaturalist or eBird \citepsullivan2009ebird use a similar approach to collect data, but differ in their label aggregation strategy. The iNaturalist project, with more than 2.52.52.52.5 million users, records the votes at different taxonomic levels. The resulting label is the aggregation of at least two votes on a species-level identification (or coarser or finer taxonomic level). A taxon requires at least a two-thirds agreements among identifiers and all users have the same weight in the decision-making. Over time, a taxon can be further refined by the community, debated or revoked. eBird handles taxon quality control by using a checklist in each region for observers. Quality control on the checklist is performed and, combined with user knowledge – number of species and checklist submitted, number of flagged observations, discussions among local experts – the species observation is accepted. The eBird project also showed that monitoring species accumulation from observers can help to sort their skills \citepkelling2015. While they consider the species accumulation by hours spent on each collected observation, we propose a strategy that takes into account the entire history of observations of the observer.

In this article, we present the Pl@ntNet label aggregation strategy. Using a new large-scale dataset of more than 6666 million observations and 800800800800 thousand users, we show that our strategy can improve the quality of the collected data, without removing every observation that was only labeled by single users. Finally, aggregated labels are used in practice to train an AI model. We explore how the information contained in the AI predictions can be integrated into the label aggregation strategy to generate new votes and help control data quality. By using the model’s predictions within the label aggregation, the goal is to correct possible mistakes from non expert users without contradicting botanical experts.

2 Methods

2.1 Dataset and notation

To compare the different label aggregation strategies on large-scale datasets, we introduce a subset of the Pl@ntNet database focused on Southwestern European flora observations – Baleares, Corsica, France, Portugal, Sardegna and Spain – from 2017201720172017 to October 2023202320232023. In total, 9 005 1089005108\numprint{9005108}9 005 108 votes are cast by nuser=823 251subscript𝑛user823251n_{\text{user}}=\numprint{823251}italic_n start_POSTSUBSCRIPT user end_POSTSUBSCRIPT = 823 251 users on 6 699 5936699593\numprint{6699593}6 699 593 observations after two cleaning steps on the voted species. The first one is a filtering step. We only keep the votes with plant species belonging to the World Checklist of Vascular Plants (WCVP) \citepgovaerts2023world. For the second step, according to Kew’s Royal Botanical Garden, we matched synonyms to their backbone species if the species is part of the k-southwestern-europe checklist from Plants of the World Online \citeppowo2024 (POWO) system. Note that there are plant species listed in the accepted species from WCVP that are not in the k-southwestern-europe POWO checklist. As there is a possible taxon ambiguity in this case – multiple species possible for a given synonym depending on the referential – we leave the proposed label untouched. The dataset is available at https://zenodo.org/records/10782465.

Notation

In the following, denote K𝐾Kitalic_K the number of species within the dataset. We index the observations by i[n]={1,,n}𝑖delimited-[]subscript𝑛1subscript𝑛i\in[n_{\mathchoice{\mathbin{\vbox{\hbox{\scalebox{0.75}{$\displaystyle\bullet% $}}}}}{\mathbin{\vbox{\hbox{\scalebox{0.75}{$\textstyle\bullet$}}}}}{\mathbin{% \vbox{\hbox{\scalebox{0.75}{$\scriptstyle\bullet$}}}}}{\mathbin{\vbox{\hbox{% \scalebox{0.75}{$\scriptscriptstyle\bullet$}}}}}}]=\{1,\dots,n_{\mathchoice{% \mathbin{\vbox{\hbox{\scalebox{0.75}{$\displaystyle\bullet$}}}}}{\mathbin{% \vbox{\hbox{\scalebox{0.75}{$\textstyle\bullet$}}}}}{\mathbin{\vbox{\hbox{% \scalebox{0.75}{$\scriptstyle\bullet$}}}}}{\mathbin{\vbox{\hbox{\scalebox{0.75% }{$\scriptscriptstyle\bullet$}}}}}}\}italic_i ∈ [ italic_n start_POSTSUBSCRIPT ∙ end_POSTSUBSCRIPT ] = { 1 , … , italic_n start_POSTSUBSCRIPT ∙ end_POSTSUBSCRIPT } where 𝒟subscript𝒟\mathcal{D}_{\mathchoice{\mathbin{\vbox{\hbox{\scalebox{0.75}{$\displaystyle% \bullet$}}}}}{\mathbin{\vbox{\hbox{\scalebox{0.75}{$\textstyle\bullet$}}}}}{% \mathbin{\vbox{\hbox{\scalebox{0.75}{$\scriptstyle\bullet$}}}}}{\mathbin{\vbox% {\hbox{\scalebox{0.75}{$\scriptscriptstyle\bullet$}}}}}}caligraphic_D start_POSTSUBSCRIPT ∙ end_POSTSUBSCRIPT is the considered dataset composed of nsubscript𝑛n_{\mathchoice{\mathbin{\vbox{\hbox{\scalebox{0.75}{$\displaystyle\bullet$}}}}% }{\mathbin{\vbox{\hbox{\scalebox{0.75}{$\textstyle\bullet$}}}}}{\mathbin{\vbox% {\hbox{\scalebox{0.75}{$\scriptstyle\bullet$}}}}}{\mathbin{\vbox{\hbox{% \scalebox{0.75}{$\scriptscriptstyle\bullet$}}}}}}italic_n start_POSTSUBSCRIPT ∙ end_POSTSUBSCRIPT observations and their associated votes. For example, the full south-western European flora dataset from Pl@ntNet of nSWE=6 699 593subscript𝑛SWE6699593n_{\mathrm{SWE}}=\numprint{6699593}italic_n start_POSTSUBSCRIPT roman_SWE end_POSTSUBSCRIPT = 6 699 593 observations is denoted 𝒟SWEsubscript𝒟SWE\mathcal{D}_{\mathrm{SWE}}caligraphic_D start_POSTSUBSCRIPT roman_SWE end_POSTSUBSCRIPT. Other subsets are presented in Section 2.3. We write 𝒰𝒰\mathcal{U}caligraphic_U the set of all users. Each user u𝑢uitalic_u has a unique identifier used as an index, and we denote 𝒰isubscript𝒰𝑖\mathcal{U}_{i}caligraphic_U start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT the set of users that have voted on observation i𝑖iitalic_ii.e. 𝒰=i[nSWE]𝒰i𝒰subscript𝑖delimited-[]subscript𝑛SWEsubscript𝒰𝑖\mathcal{U}=\cup_{i\in[n_{\mathrm{SWE}}]}\mathcal{U}_{i}caligraphic_U = ∪ start_POSTSUBSCRIPT italic_i ∈ [ italic_n start_POSTSUBSCRIPT roman_SWE end_POSTSUBSCRIPT ] end_POSTSUBSCRIPT caligraphic_U start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. The vote of user u𝑢uitalic_u on observation i𝑖iitalic_i is denoted yiu[K]superscriptsubscript𝑦𝑖𝑢delimited-[]𝐾y_{i}^{u}\in[K]italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_u end_POSTSUPERSCRIPT ∈ [ italic_K ]. Estimated labels are denoted y^i[K]subscript^𝑦𝑖delimited-[]𝐾\hat{y}_{i}\in[K]over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ [ italic_K ]. Each observation i𝑖iitalic_i is created by an author u𝑢uitalic_u stored in Author(i)Author𝑖\mathrm{Author}(i)roman_Author ( italic_i ).

2.2 Proposed label aggregation strategy

Algorithm 1 Pl@ntNet iterative weighted majority vote
1:Votes as (u,yiu)i[nSWE],u[nuser]subscript𝑢superscriptsubscript𝑦𝑖𝑢formulae-sequence𝑖delimited-[]subscript𝑛SWE𝑢delimited-[]subscript𝑛user(u,y_{i}^{u})_{i\in[n_{\mathrm{SWE}}],u\in[n_{\text{user}}]}( italic_u , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_u end_POSTSUPERSCRIPT ) start_POSTSUBSCRIPT italic_i ∈ [ italic_n start_POSTSUBSCRIPT roman_SWE end_POSTSUBSCRIPT ] , italic_u ∈ [ italic_n start_POSTSUBSCRIPT user end_POSTSUBSCRIPT ] end_POSTSUBSCRIPT for each observation i𝑖iitalic_i and user u𝑢uitalic_u answering the voted species yiusuperscriptsubscript𝑦𝑖𝑢y_{i}^{u}italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_u end_POSTSUPERSCRIPT, accuracy threshold θaccsubscript𝜃acc\theta_{\text{acc}}italic_θ start_POSTSUBSCRIPT acc end_POSTSUBSCRIPT, confidence threshold θconfsubscript𝜃conf\theta_{\text{conf}}italic_θ start_POSTSUBSCRIPT conf end_POSTSUBSCRIPT, weight function f𝑓fitalic_f, initial weight γ>0𝛾0\gamma>0italic_γ > 0
2:Estimated labels y^isubscript^𝑦𝑖\hat{y}_{i}over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and validity indicator sisubscript𝑠𝑖s_{i}italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT for each observation i𝑖iitalic_i
3:Initialize user weights as wu=γ for each user u[nuser]Initialize user weights as subscript𝑤𝑢𝛾 for each user 𝑢delimited-[]subscript𝑛user\text{Initialize user weights as }w_{u}=\gamma\text{ for each user }u\in[n_{% \text{user}}]Initialize user weights as italic_w start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT = italic_γ for each user italic_u ∈ [ italic_n start_POSTSUBSCRIPT user end_POSTSUBSCRIPT ]
4:while not converged do
5:     Get current estimated labels with a weighted majority vote
i[nSWE],y^i=argmaxk[K]u𝒰iwu𝟙(yiu=k)formulae-sequencefor-all𝑖delimited-[]subscript𝑛SWEsubscript^𝑦𝑖subscriptargmax𝑘delimited-[]𝐾subscript𝑢subscript𝒰𝑖subscript𝑤𝑢1superscriptsubscript𝑦𝑖𝑢𝑘\forall i\in[n_{\mathrm{SWE}}],\ \hat{y}_{i}=\operatorname*{arg\,max}_{k\in[K]% }\sum_{u\in\mathcal{U}_{i}}w_{u}\mathds{1}(y_{i}^{u}=k)∀ italic_i ∈ [ italic_n start_POSTSUBSCRIPT roman_SWE end_POSTSUBSCRIPT ] , over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = start_OPERATOR roman_arg roman_max end_OPERATOR start_POSTSUBSCRIPT italic_k ∈ [ italic_K ] end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_u ∈ caligraphic_U start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_w start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT blackboard_1 ( italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_u end_POSTSUPERSCRIPT = italic_k )
6:     for each observation i[nSWE]𝑖delimited-[]subscript𝑛SWEi\in[n_{\mathrm{SWE}}]italic_i ∈ [ italic_n start_POSTSUBSCRIPT roman_SWE end_POSTSUBSCRIPT ] do
7:         Compute label confidence: confi(y^i)=u𝒰iwu𝟙(yiu=y^i)subscriptconf𝑖subscript^𝑦𝑖subscript𝑢subscript𝒰𝑖subscript𝑤𝑢1superscriptsubscript𝑦𝑖𝑢subscript^𝑦𝑖\mathrm{conf}_{i}(\hat{y}_{i})=\sum_{u\in\mathcal{U}_{i}}w_{u}\mathds{1}(y_{i}% ^{u}=\hat{y}_{i})roman_conf start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) = ∑ start_POSTSUBSCRIPT italic_u ∈ caligraphic_U start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_w start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT blackboard_1 ( italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_u end_POSTSUPERSCRIPT = over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT )
8:         Compute label accuracy: acci(y^i)=confi(y^i)/k[K]confi(k)subscriptacc𝑖subscript^𝑦𝑖subscriptconf𝑖subscript^𝑦𝑖subscript𝑘delimited-[]𝐾subscriptconf𝑖𝑘\mathrm{acc}_{i}(\hat{y}_{i})=\mathrm{conf}_{i}(\hat{y}_{i})/\sum_{k\in[K]}% \mathrm{conf}_{i}(k)roman_acc start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) = roman_conf start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) / ∑ start_POSTSUBSCRIPT italic_k ∈ [ italic_K ] end_POSTSUBSCRIPT roman_conf start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_k )
9:         Compute validity indicator: si=𝟙(acci(y^i)θacc and confi(y^i)θconf)subscript𝑠𝑖1subscriptacc𝑖subscript^𝑦𝑖subscript𝜃acc and subscriptconf𝑖subscript^𝑦𝑖subscript𝜃confs_{i}=\mathds{1}(\mathrm{acc}_{i}(\hat{y}_{i})\geq\theta_{\text{acc}}\text{ % and }\mathrm{conf}_{i}(\hat{y}_{i})\geq\theta_{\text{conf}})italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = blackboard_1 ( roman_acc start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ≥ italic_θ start_POSTSUBSCRIPT acc end_POSTSUBSCRIPT and roman_conf start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ≥ italic_θ start_POSTSUBSCRIPT conf end_POSTSUBSCRIPT )
10:     end for
11:     for each user u[nuser]𝑢delimited-[]subscript𝑛useru\in[n_{\text{user}}]italic_u ∈ [ italic_n start_POSTSUBSCRIPT user end_POSTSUBSCRIPT ] do
12:         Compute the number of valid identified species for authoring observations:
nuauthor=|{yiu[K]|yiu=y^i,si=1,Author(i)=u}|superscriptsubscript𝑛𝑢authorconditional-setsuperscriptsubscript𝑦𝑖𝑢delimited-[]𝐾formulae-sequencesuperscriptsubscript𝑦𝑖𝑢subscript^𝑦𝑖formulae-sequencesubscript𝑠𝑖1Author𝑖𝑢n_{u}^{\text{author}}=|\{y_{i}^{u}\in[K]\,|\,y_{i}^{u}=\hat{y}_{i},s_{i}=1,% \mathrm{Author}(i)=u\}|italic_n start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT start_POSTSUPERSCRIPT author end_POSTSUPERSCRIPT = | { italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_u end_POSTSUPERSCRIPT ∈ [ italic_K ] | italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_u end_POSTSUPERSCRIPT = over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = 1 , roman_Author ( italic_i ) = italic_u } |
13:         Compute the number of identified species by voting on other’s observations:
nuvote=|{yiu[K]|yiu=y^i,Author(i)u}|superscriptsubscript𝑛𝑢voteconditional-setsuperscriptsubscript𝑦𝑖𝑢delimited-[]𝐾formulae-sequencesuperscriptsubscript𝑦𝑖𝑢subscript^𝑦𝑖Author𝑖𝑢n_{u}^{\text{vote}}=|\{y_{i}^{u}\in[K]\,|\,y_{i}^{u}=\hat{y}_{i},\mathrm{% Author}(i)\neq u\}|italic_n start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT start_POSTSUPERSCRIPT vote end_POSTSUPERSCRIPT = | { italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_u end_POSTSUPERSCRIPT ∈ [ italic_K ] | italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_u end_POSTSUPERSCRIPT = over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , roman_Author ( italic_i ) ≠ italic_u } |
14:         Compute the rounding number of identified species per user:
nu=Round(nuauthor+110nuvote)subscript𝑛𝑢Roundsuperscriptsubscript𝑛𝑢author110superscriptsubscript𝑛𝑢voten_{u}=\mathrm{Round}\left(n_{u}^{\text{author}}+\frac{1}{10}n_{u}^{\text{vote}% }\right)italic_n start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT = roman_Round ( italic_n start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT start_POSTSUPERSCRIPT author end_POSTSUPERSCRIPT + divide start_ARG 1 end_ARG start_ARG 10 end_ARG italic_n start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT start_POSTSUPERSCRIPT vote end_POSTSUPERSCRIPT )
15:         Transform number of estimated species per user into trust score: wu=f(nu)subscript𝑤𝑢𝑓subscript𝑛𝑢w_{u}=f(n_{u})italic_w start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT = italic_f ( italic_n start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT )
16:     end for
17:end while

Pl@ntNet label aggregation strategy relies on estimating the number of correctly identified species for each user. Similar to other strategies, we rely on an EM-based iterative procedure \citepDempster_Laird_Rubin77 to estimate consecutively the users’ skills and each observation’s species. The detailed iterative algorithm is provided in Algorithm 1 and available at https://github.com/peerannot/peerannot. The label aggregation strategy generates a trust indicator (sisubscript𝑠𝑖s_{i}italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT) on the observation that can reveal whether an observation is valid or not. Notice that in Algorithm 1 we value 10101010 times more authored observations than voting on other’s observations – if a user proposes a new observation with a label (species name) it is more useful than proposing a label by clicking. Indeed, being on the field leads to more information on the environment and a better determination of the species. Finally, note that an identified species is exclusively identified as author – part of nuauthorsuperscriptsubscript𝑛𝑢authorn_{u}^{\text{author}}italic_n start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT start_POSTSUPERSCRIPT author end_POSTSUPERSCRIPT in Algorithm 1) – or as click – part of nuvotesuperscriptsubscript𝑛𝑢voten_{u}^{\text{vote}}italic_n start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT start_POSTSUPERSCRIPT vote end_POSTSUPERSCRIPT – to avoid redundant skills. The final number of species identified by users is the aggregation of these two terms: nu=Round(nuauthor+110nuvote)subscript𝑛𝑢Roundsuperscriptsubscript𝑛𝑢author110superscriptsubscript𝑛𝑢voten_{u}=\mathrm{Round}\left(n_{u}^{\text{author}}+\frac{1}{10}n_{u}^{\text{vote}% }\right)italic_n start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT = roman_Round ( italic_n start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT start_POSTSUPERSCRIPT author end_POSTSUPERSCRIPT + divide start_ARG 1 end_ARG start_ARG 10 end_ARG italic_n start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT start_POSTSUPERSCRIPT vote end_POSTSUPERSCRIPT ).

Refer to caption
Figure 2: Weight function in Equation 1 used to map the number of identified species to a trust score in the Pl@ntNet label aggregation strategy. A new user starts with a weight of f(0)=f(1)=γ0.74𝑓0𝑓1𝛾similar-to-or-equals0.74f(0)=f(1)=\gamma\simeq 0.74italic_f ( 0 ) = italic_f ( 1 ) = italic_γ ≃ 0.74. The user confidence threshold θconf=2subscript𝜃conf2\theta_{\text{conf}}=2italic_θ start_POSTSUBSCRIPT conf end_POSTSUBSCRIPT = 2 requires a user to have identified at least nu=8subscript𝑛𝑢8n_{u}=8italic_n start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT = 8 species to become self-validating. The parameters α=0.5𝛼0.5\alpha=0.5italic_α = 0.5, β=0.2𝛽0.2\beta=0.2italic_β = 0.2 and γ0.74similar-to-or-equals𝛾0.74\gamma\simeq 0.74italic_γ ≃ 0.74 are used in practice.

The weight function f𝑓fitalic_f shown in Figure 2 is a non-decreasing function that maps the number of identified species nusubscript𝑛𝑢n_{u}italic_n start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT to a trust score in the form of:

wu=f(nu)=nuαnuβ+γ,subscript𝑤𝑢𝑓subscript𝑛𝑢superscriptsubscript𝑛𝑢𝛼superscriptsubscript𝑛𝑢𝛽𝛾w_{u}=f(n_{u})=n_{u}^{\alpha}-n_{u}^{\beta}+\gamma\enspace,italic_w start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT = italic_f ( italic_n start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT ) = italic_n start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT - italic_n start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_β end_POSTSUPERSCRIPT + italic_γ , (1)

where α,β+𝛼𝛽superscriptsubscript\alpha,\beta\in\mathbb{R_{+}^{\star}}italic_α , italic_β ∈ blackboard_R start_POSTSUBSCRIPT + end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT are hyperparameters that were calibrated internally to fit prior knowledge and γ>0𝛾0\gamma>0italic_γ > 0 is the constant representing the initial weight of each user. In practice, we use α=0.5𝛼0.5\alpha=0.5italic_α = 0.5, β=0.2𝛽0.2\beta=0.2italic_β = 0.2 and γ=log(2.1)0.74𝛾2.1similar-to-or-equals0.74\gamma=\log(2.1)\simeq 0.74italic_γ = roman_log ( 2.1 ) ≃ 0.74 in the weight function. This function is sub-linear (𝒪(nu))\mathcal{O}(\sqrt{n_{u}}))caligraphic_O ( square-root start_ARG italic_n start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT end_ARG ) ) but with two different behaviors. The goal of Equation 1 is to separate new users from experts and then help sort multiple experts. This is modeled by the two behaviors of the weight function. In the first part which corresponds to new users with low nusubscript𝑛𝑢n_{u}italic_n start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT, the term in the power of β𝛽\betaitalic_β decreases the weight. We chose an initial weight wu=γsubscript𝑤𝑢𝛾w_{u}=\gammaitalic_w start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT = italic_γ such that a user has a weight equal to 1111 (rounding to two decimals) with two different identifications. This separates the users who only come once to test the application from others. In the second part with a higher number of identified species, the term to the power of β𝛽\betaitalic_β becomes negligible and we tend to the square root function. The sub-linear scale allows for reducing discrepancies between people who have identified a comparable number of species (and thus have presumably comparable expertise). As for the two thresholds that control the level of uncertainty accepted for a given label, they are set to θconf=2subscript𝜃conf2\theta_{\text{conf}}=2italic_θ start_POSTSUBSCRIPT conf end_POSTSUBSCRIPT = 2 to control the total weight on an observation and θacc=0.7subscript𝜃acc0.7\theta_{\text{acc}}=0.7italic_θ start_POSTSUBSCRIPT acc end_POSTSUBSCRIPT = 0.7 to control the agreement between users given their expertise.

Users are said self-validating when they are trusted enough so that their proposed label single-handedly makes an observation valid (si=1)subscript𝑠𝑖1(s_{i}=1)( italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = 1 ). From Algorithm 1, we see that this is verified when their weight wusubscript𝑤𝑢w_{u}italic_w start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT is greater than the level θconfsubscript𝜃conf\theta_{\text{conf}}italic_θ start_POSTSUBSCRIPT conf end_POSTSUBSCRIPT. Indeed, with a single label we obtain confi(y^i)=wu>θconfsubscriptconf𝑖subscript^𝑦𝑖subscript𝑤𝑢subscript𝜃conf\mathrm{conf}_{i}(\hat{y}_{i})=w_{u}>\theta_{\text{conf}}roman_conf start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) = italic_w start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT > italic_θ start_POSTSUBSCRIPT conf end_POSTSUBSCRIPT and acci(y^i)=1>θaccsubscriptacc𝑖subscript^𝑦𝑖1subscript𝜃acc\mathrm{acc}_{i}(\hat{y}_{i})=1>\theta_{\text{acc}}roman_acc start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) = 1 > italic_θ start_POSTSUBSCRIPT acc end_POSTSUBSCRIPT. In practice, this means that an experienced user who has collected enough weight can validate any observation without any other user’s vote. Note that this identification can later be invalidated by other users with enough weight thanks to the accuracy threshold θaccsubscript𝜃acc\theta_{\text{acc}}italic_θ start_POSTSUBSCRIPT acc end_POSTSUBSCRIPT.

2.3 Evaluation against other aggregation strategies

Existing aggregation strategies.

Plant species label aggregation is a challenging task due to the large number of species K=11 425𝐾11425K=\numprint{11425}italic_K = 11 425. Hence, many classical strategies in the label aggregation literature such as Dawid and Skene’s \citepdawid_maximum_1979 and other variations \citeppassonneau-carpenter-2014-benefits, sinha2018fast are not applicable as they require estimating a K×K𝐾𝐾K\times Kitalic_K × italic_K confusion matrix for each user. For the considered dataset 𝒟SWEsubscript𝒟SWE\mathcal{D}_{\text{SWE}}caligraphic_D start_POSTSUBSCRIPT SWE end_POSTSUBSCRIPT, this would result in 11 4252×823 2511014superscript114252823251superscript1014\numprint{11425}^{2}\times\numprint{823251}\approx 10^{14}11 425 start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT × 823 251 ≈ 10 start_POSTSUPERSCRIPT 14 end_POSTSUPERSCRIPT parameters to estimate. Similar issues occur for other label aggregation strategies \citepwhitehill_whose_2009,hovy2013learning,ma2020adversarial. We do not consider deep-learning-based crowdsourcing strategies as \citetrodrigues2018deep,chu2021learning or \citetlefort2022improve as they require training a neural network from crowdsourced labels, but do not output aggregated labels on the training set. In the Pl@ntNet application, we need to propose one or multiple species for each observation to users. To overcome these issues, we consider the following label aggregation strategies that can scale with K𝐾Kitalic_K and the number of users:

  • Majority Vote (MV) \citepjames1998majority: it selects the most answered label111Ties are broken at a random – creating sometimes some variability in the labeling process. and is the most common aggregation strategy. More formally, given an observation i𝑖iitalic_i:

    MV(i,{yiu}u)=argmaxk[K]u𝒰i𝟙(yiu=k).MV𝑖subscriptsuperscriptsubscript𝑦𝑖𝑢𝑢subscriptargmax𝑘delimited-[]𝐾subscript𝑢subscript𝒰𝑖1superscriptsubscript𝑦𝑖𝑢𝑘\mathrm{MV}(i,\{y_{i}^{u}\}_{u})=\operatorname*{arg\,max}_{k\in[K]}\sum_{u\in% \mathcal{U}_{i}}\mathds{1}(y_{i}^{u}=k)\enspace.roman_MV ( italic_i , { italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_u end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT ) = start_OPERATOR roman_arg roman_max end_OPERATOR start_POSTSUBSCRIPT italic_k ∈ [ italic_K ] end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_u ∈ caligraphic_U start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT blackboard_1 ( italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_u end_POSTSUPERSCRIPT = italic_k ) .
  • Worker Agreement With Aggregate (WAWA) \citepappen_wawa_2021: this strategy, also known as the inter-rater agreement, weights each user by how much they agree with the MV labels on average. More formally, given an observation i𝑖iitalic_i:

    WAWA(i,𝒟SWE)WAWA𝑖subscript𝒟SWE\displaystyle\mathrm{WAWA}(i,\mathcal{D}_{\mathrm{SWE}})roman_WAWA ( italic_i , caligraphic_D start_POSTSUBSCRIPT roman_SWE end_POSTSUBSCRIPT ) =argmaxk[K]u𝒰iwu𝟙(yiu=k)absentsubscriptargmax𝑘delimited-[]𝐾subscript𝑢subscript𝒰𝑖subscript𝑤𝑢1superscriptsubscript𝑦𝑖𝑢𝑘\displaystyle=\operatorname*{arg\,max}_{k\in[K]}\sum_{u\in\mathcal{U}_{i}}w_{u% }\mathds{1}(y_{i}^{u}=k)= start_OPERATOR roman_arg roman_max end_OPERATOR start_POSTSUBSCRIPT italic_k ∈ [ italic_K ] end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_u ∈ caligraphic_U start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_w start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT blackboard_1 ( italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_u end_POSTSUPERSCRIPT = italic_k )
    with wuwith subscript𝑤𝑢\displaystyle\text{with }w_{u}with italic_w start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT =1|{yiu}i|i=1nSWE𝟙(yiu=MV(i,{yiu}u)).absent1subscriptsuperscriptsubscript𝑦superscript𝑖𝑢superscript𝑖superscriptsubscriptsuperscript𝑖1subscript𝑛SWE1superscriptsubscript𝑦superscript𝑖𝑢MVsuperscript𝑖subscriptsuperscriptsubscript𝑦superscript𝑖𝑢𝑢\displaystyle=\frac{1}{|\{y_{i^{\prime}}^{u}\}_{i^{\prime}}|}\sum_{i^{\prime}=% 1}^{n_{\mathrm{SWE}}}\mathds{1}\left(y_{i^{\prime}}^{u}=\mathrm{MV}(i^{\prime}% ,\{y_{i^{\prime}}^{u}\}_{u})\right)\enspace.= divide start_ARG 1 end_ARG start_ARG | { italic_y start_POSTSUBSCRIPT italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_u end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT | end_ARG ∑ start_POSTSUBSCRIPT italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT roman_SWE end_POSTSUBSCRIPT end_POSTSUPERSCRIPT blackboard_1 ( italic_y start_POSTSUBSCRIPT italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_u end_POSTSUPERSCRIPT = roman_MV ( italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , { italic_y start_POSTSUBSCRIPT italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_u end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT ) ) .

    As there is no observation filter for the MVMV\mathrm{MV}roman_MV and WAWAWAWA\mathrm{WAWA}roman_WAWA, we consider that for all observation i𝑖iitalic_i, si=1subscript𝑠𝑖1s_{i}=1italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = 1 for these two strategies.

  • TwoThird: The TwoThird aggregation generates a label for observations with at least two votes. The estimated label represents the one with at least two-thirds of the majority in agreement. Every user has the same weight in the aggregation. It is part of the iNaturalist’s label aggregation system \citepvan2018inaturalist. More formally:

    TwoThird(i,{yiu}u)TwoThird𝑖subscriptsuperscriptsubscript𝑦𝑖𝑢𝑢\displaystyle\mathrm{TwoThird}(i,\{y_{i}^{u}\}_{u})roman_TwoThird ( italic_i , { italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_u end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT ) ={MV(i,{yiu}u)if si=1undefinedotherwiseabsentcasesMV𝑖subscriptsuperscriptsubscript𝑦𝑖𝑢𝑢if subscript𝑠𝑖1undefinedotherwise\displaystyle=\begin{cases}\mathrm{MV}(i,\{y_{i}^{u}\}_{u})&\text{if }s_{i}=1% \\ \text{undefined}&\text{otherwise}\end{cases}= { start_ROW start_CELL roman_MV ( italic_i , { italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_u end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT ) end_CELL start_CELL if italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = 1 end_CELL end_ROW start_ROW start_CELL undefined end_CELL start_CELL otherwise end_CELL end_ROW
    with siwith subscript𝑠𝑖\displaystyle\text{ with }s_{i}with italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT =𝟙(maxk[K]1|𝒰i|u𝒰i𝟙(yiu=k)23).absent1subscript𝑘delimited-[]𝐾1subscript𝒰𝑖subscript𝑢subscript𝒰𝑖1superscriptsubscript𝑦𝑖𝑢𝑘23\displaystyle=\mathds{1}\left(\displaystyle\max_{k\in[K]}\frac{1}{|\mathcal{U}% _{i}|}\sum_{u\in\mathcal{U}_{i}}\mathds{1}(y_{i}^{u}=k)\geq\frac{2}{3}\right)\enspace.= blackboard_1 ( roman_max start_POSTSUBSCRIPT italic_k ∈ [ italic_K ] end_POSTSUBSCRIPT divide start_ARG 1 end_ARG start_ARG | caligraphic_U start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | end_ARG ∑ start_POSTSUBSCRIPT italic_u ∈ caligraphic_U start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT blackboard_1 ( italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_u end_POSTSUPERSCRIPT = italic_k ) ≥ divide start_ARG 2 end_ARG start_ARG 3 end_ARG ) .

Creation of an evaluation set in a crowdsourcing setting.

To evaluate the performance of a label aggregation strategy, it is necessary to know the ground truth on a subset of the data. However, in the context of crowdsourced data, there is no known truth for the observations. The sheer volume of data makes it impossible to ask botanical experts to create such ground truth for the whole database. Moreover, identifying species from images is much less accurate than identifying them in the field, due to the partial information contained in the image \citepexperts2018plant.

Instead of asking experts to label a subset of the data, we rather identify botanical experts in the Pl@ntNet user database and consider their determinations as ground truth. We asked botanical-known experts to reference other experts who could have a Pl@ntNet account to create a list of expert users. To these we have added TelaBotanica \citepheaton2010tela users with registered confirmed botanical experience from their account and that are also Pl@ntNet users that participated in the South-Western Europe flora subset. In total, 98989898 Pl@ntNet users were identified as botanical experts. Observation with at least one vote from one of these experts constitute our test set denoted 𝒟expertsubscript𝒟expert\mathcal{D}_{\text{expert}}caligraphic_D start_POSTSUBSCRIPT expert end_POSTSUBSCRIPT. The answers of these experts are considered ground truth labels and used to evaluate strategies’ performance. Despite our selection process of supposedly ‘indisputable’ experts, a few observations in the test set still end up with contradictory labels (4444 observations in total). As they represent a very small percentage, we simply removed them from 𝒟expertsubscript𝒟expert\mathcal{D}_{\text{expert}}caligraphic_D start_POSTSUBSCRIPT expert end_POSTSUBSCRIPT.

Our evaluation set 𝒟expertsubscript𝒟expert\mathcal{D}_{\text{expert}}caligraphic_D start_POSTSUBSCRIPT expert end_POSTSUBSCRIPT is finally composed of 26 81126811\numprint{26811}26 811 observations which received at least one vote from one of the experts. Despite the large number of users, not all observations obtain multiple annotations. Indeed, 310 564310564\numprint{310564}310 564 users were single-time voters (meaning they interacted with the system only once). The lack of votes is a large component of difficulty in the Pl@ntNet database, as there is a high imbalance of the distribution of votes between observations as represented in Figure 4(b). There is a high concentration of votes for a small percentage of the observations as shown in Figure 4(a). Of these evaluation data, 17 12517125\numprint{17125}17 125 received more than two identifications and are stored in 𝒟multiple votessubscript𝒟multiple votes\mathcal{D}_{\text{multiple votes}}caligraphic_D start_POSTSUBSCRIPT multiple votes end_POSTSUBSCRIPT. Then, 1 2631263\numprint{1263}1 263 have more than two votes with at least one disagreement between users and are stored in 𝒟disagreementsubscript𝒟disagreement\mathcal{D}_{\text{disagreement}}caligraphic_D start_POSTSUBSCRIPT disagreement end_POSTSUBSCRIPT. Figure 3 shows the distribution of observations from 𝒟SWEsubscript𝒟SWE\mathcal{D}_{\mathrm{SWE}}caligraphic_D start_POSTSUBSCRIPT roman_SWE end_POSTSUBSCRIPT to the finer and more ambiguous 𝒟disagreementsubscript𝒟disagreement\mathcal{D}_{\text{disagreement}}caligraphic_D start_POSTSUBSCRIPT disagreement end_POSTSUBSCRIPT.

Evaluation metric.

Refer to caption
Figure 3: Log-scales distribution of the observations in the South-West European Flora subset from the Pl@ntNet database. Note that the (sub-)datasets introduced are nested: 𝒟SWE𝒟expert𝒟multiple votes𝒟disagreementsuperset-ofsubscript𝒟SWEsubscript𝒟expertsuperset-ofsubscript𝒟multiple votessuperset-ofsubscript𝒟disagreement\mathcal{D}_{\mathrm{SWE}}\supset\mathcal{D}_{\text{expert}}\supset\mathcal{D}% _{\text{multiple votes}}\supset\mathcal{D}_{\text{disagreement}}caligraphic_D start_POSTSUBSCRIPT roman_SWE end_POSTSUBSCRIPT ⊃ caligraphic_D start_POSTSUBSCRIPT expert end_POSTSUBSCRIPT ⊃ caligraphic_D start_POSTSUBSCRIPT multiple votes end_POSTSUBSCRIPT ⊃ caligraphic_D start_POSTSUBSCRIPT disagreement end_POSTSUBSCRIPT. 𝒟expertsubscript𝒟expert\mathcal{D}_{\text{expert}}caligraphic_D start_POSTSUBSCRIPT expert end_POSTSUBSCRIPT and the following subsets contain observations that received at least one vote from one of the experts.
Refer to caption
(a) Relationship between the number of observations per user and the variety of species proposed per user. Each point represents a concentration of users in the SWE flora subset. 310,564310,564\numprint{310,564}310 , 564 users proposed a single vote.
Refer to caption
(b) Lorenz curves representing the imbalance distribution of the number of votes in the South-West European Flora subset from the Pl@ntNet database. This imbalance is mitigated but kept in the created test set.
Figure 4: Pl@ntNet activity summary in the SWE flora subset. (A): The majority of users have proposed a small number of observations and species. However, some users have proposed a large number of observations and species. (B): In a perfectly balanced dataset, the Lorenz curve would be the diagonal – 50%percent5050\%50 % of the votes would be for 50%percent5050\%50 % of the observations. In practice, there is a high imbalance of the distribution of votes between observations – 80%percent8080\%80 % of the observations are represented by 10%percent1010\%10 % of votes.

To evaluate the label aggregation strategies, we use the following label recovery accuracy computed on the evaluation datasets:

Acc(y^,y;𝒟)=1ni=1n𝟙(y^i=yi)𝟙(si=1),Acc^𝑦𝑦subscript𝒟1subscript𝑛superscriptsubscript𝑖1subscript𝑛1subscript^𝑦𝑖subscript𝑦𝑖1subscript𝑠𝑖1\mathrm{Acc}(\hat{y},y;\mathcal{D}_{\mathchoice{\mathbin{\vbox{\hbox{\scalebox% {0.75}{$\displaystyle\bullet$}}}}}{\mathbin{\vbox{\hbox{\scalebox{0.75}{$% \textstyle\bullet$}}}}}{\mathbin{\vbox{\hbox{\scalebox{0.75}{$\scriptstyle% \bullet$}}}}}{\mathbin{\vbox{\hbox{\scalebox{0.75}{$\scriptscriptstyle\bullet$% }}}}}})=\frac{1}{n_{\mathchoice{\mathbin{\vbox{\hbox{\scalebox{0.75}{$% \displaystyle\bullet$}}}}}{\mathbin{\vbox{\hbox{\scalebox{0.75}{$\textstyle% \bullet$}}}}}{\mathbin{\vbox{\hbox{\scalebox{0.75}{$\scriptstyle\bullet$}}}}}{% \mathbin{\vbox{\hbox{\scalebox{0.75}{$\scriptscriptstyle\bullet$}}}}}}}\sum_{i% =1}^{n_{\mathchoice{\mathbin{\vbox{\hbox{\scalebox{0.75}{$\displaystyle\bullet% $}}}}}{\mathbin{\vbox{\hbox{\scalebox{0.75}{$\textstyle\bullet$}}}}}{\mathbin{% \vbox{\hbox{\scalebox{0.75}{$\scriptstyle\bullet$}}}}}{\mathbin{\vbox{\hbox{% \scalebox{0.75}{$\scriptscriptstyle\bullet$}}}}}}}\mathds{1}(\hat{y}_{i}=y_{i}% )\mathds{1}(s_{i}=1)\enspace,roman_Acc ( over^ start_ARG italic_y end_ARG , italic_y ; caligraphic_D start_POSTSUBSCRIPT ∙ end_POSTSUBSCRIPT ) = divide start_ARG 1 end_ARG start_ARG italic_n start_POSTSUBSCRIPT ∙ end_POSTSUBSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT ∙ end_POSTSUBSCRIPT end_POSTSUPERSCRIPT blackboard_1 ( over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) blackboard_1 ( italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = 1 ) ,

with y^=(y^i)i^𝑦subscriptsubscript^𝑦𝑖𝑖\hat{y}=(\hat{y}_{i})_{i}over^ start_ARG italic_y end_ARG = ( over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT the estimated labels on 𝒟{𝒟expert,𝒟multiple votes,𝒟disagreement}subscript𝒟subscript𝒟expertsubscript𝒟multiple votessubscript𝒟disagreement\mathcal{D}_{\mathchoice{\mathbin{\vbox{\hbox{\scalebox{0.75}{$\displaystyle% \bullet$}}}}}{\mathbin{\vbox{\hbox{\scalebox{0.75}{$\textstyle\bullet$}}}}}{% \mathbin{\vbox{\hbox{\scalebox{0.75}{$\scriptstyle\bullet$}}}}}{\mathbin{\vbox% {\hbox{\scalebox{0.75}{$\scriptscriptstyle\bullet$}}}}}}\in\{\mathcal{D}_{% \text{expert}},\mathcal{D}_{\text{multiple votes}},\mathcal{D}_{\text{% disagreement}}\}caligraphic_D start_POSTSUBSCRIPT ∙ end_POSTSUBSCRIPT ∈ { caligraphic_D start_POSTSUBSCRIPT expert end_POSTSUBSCRIPT , caligraphic_D start_POSTSUBSCRIPT multiple votes end_POSTSUBSCRIPT , caligraphic_D start_POSTSUBSCRIPT disagreement end_POSTSUBSCRIPT }, y=(yi)i𝑦subscriptsubscript𝑦𝑖𝑖y=(y_{i})_{i}italic_y = ( italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT the associated experts labels, considered as ground truth. When the aggregation strategy indicates the observation as invalid (si=0subscript𝑠𝑖0s_{i}=0italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = 0 for Pl@ntNet and TwoThird), this metric considers the sample as wrongly classified. Precision and recall scores are also computed to respectively measure the correctness of the observations indicated as valid and the ability to recover the ground truth observations in the valid set. We take into account the species imbalance by using a macro-average for these metrics. This treats rare species as equally important to common ones. Denoting respectively TPksubscriptTP𝑘\mathrm{TP}_{k}roman_TP start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT, FPksubscriptFP𝑘\mathrm{FP}_{k}roman_FP start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT and FNksubscriptFN𝑘\mathrm{FN}_{k}roman_FN start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT the true positives, false positives and false negatives related to the species k𝑘kitalic_k, the macro averaged precision and recall write

Precisionmacro=1Kk=1KTPkTPk+FPkandRecallmacro=1Kk=1KTPkTPk+FNk.formulae-sequencesubscriptPrecisionmacro1𝐾superscriptsubscript𝑘1𝐾subscriptTP𝑘subscriptTP𝑘subscriptFP𝑘andsubscriptRecallmacro1𝐾superscriptsubscript𝑘1𝐾subscriptTP𝑘subscriptTP𝑘subscriptFN𝑘\mathrm{Precision}_{\mathrm{macro}}=\frac{1}{K}\sum_{k=1}^{K}\frac{\mathrm{TP}% _{k}}{\mathrm{TP}_{k}+\mathrm{FP}_{k}}\quad\text{and}\quad\mathrm{Recall}_{% \mathrm{macro}}=\frac{1}{K}\sum_{k=1}^{K}\frac{\mathrm{TP}_{k}}{\mathrm{TP}_{k% }+\mathrm{FN}_{k}}\enspace.roman_Precision start_POSTSUBSCRIPT roman_macro end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_K end_ARG ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT divide start_ARG roman_TP start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG start_ARG roman_TP start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT + roman_FP start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG and roman_Recall start_POSTSUBSCRIPT roman_macro end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_K end_ARG ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT divide start_ARG roman_TP start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG start_ARG roman_TP start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT + roman_FN start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG .

As both the Pl@ntNet and the TwoThird strategies can invalidate some of the observations, we also compute the proportion of observations removed from the whole dataset (whereas previous metrics are computed on the evaluation dataset). This complementary metric allows measuring the proportion of samples "lost" for the training of the AI model after the aggregation step. In practice, filtering data might remove some noisy samples from the dataset. Yet, in general, the more samples are filtered, the fewer ones to train the neural network training. Finally, we also consider the proportion of species retrieved by the aggregation strategies on 𝒟expert,𝒟mulitple votessubscript𝒟expertsubscript𝒟mulitple votes\mathcal{D}_{\text{expert}},\mathcal{D}_{\text{mulitple votes}}caligraphic_D start_POSTSUBSCRIPT expert end_POSTSUBSCRIPT , caligraphic_D start_POSTSUBSCRIPT mulitple votes end_POSTSUBSCRIPT and 𝒟disagreementsubscript𝒟disagreement\mathcal{D}_{\text{disagreement}}caligraphic_D start_POSTSUBSCRIPT disagreement end_POSTSUBSCRIPT. This is a critical consideration because if a species identified by experts is absent from the aggregated data, the neural network trained on this data will be unable to make predictions for that very species.

We evaluate the label recovery AccAcc\mathrm{Acc}roman_Acc of each strategy on 𝒟expert,𝒟mulitple votessubscript𝒟expertsubscript𝒟mulitple votes\mathcal{D}_{\text{expert}},\mathcal{D}_{\text{mulitple votes}}caligraphic_D start_POSTSUBSCRIPT expert end_POSTSUBSCRIPT , caligraphic_D start_POSTSUBSCRIPT mulitple votes end_POSTSUBSCRIPT and 𝒟disagreementsubscript𝒟disagreement\mathcal{D}_{\text{disagreement}}caligraphic_D start_POSTSUBSCRIPT disagreement end_POSTSUBSCRIPT (see also Figure 3): the test set where experts have provided at least one vote (𝒟expertsubscript𝒟expert\mathcal{D}_{\text{expert}}caligraphic_D start_POSTSUBSCRIPT expert end_POSTSUBSCRIPT), its subset of observations with at least 2222 votes and one from an expert (𝒟multiple votessubscript𝒟multiple votes\mathcal{D}_{\text{multiple votes}}caligraphic_D start_POSTSUBSCRIPT multiple votes end_POSTSUBSCRIPT) and its subset of observations with at least 2222 votes, one from an expert, and one disagreement (𝒟disagreementsubscript𝒟disagreement\mathcal{D}_{\text{disagreement}}caligraphic_D start_POSTSUBSCRIPT disagreement end_POSTSUBSCRIPT). The latter is the most challenging as it contains the observations with the most ambiguity. We selected these subsets to investigate the label aggregation strategies’ performance depending on the ambiguity level.

2.4 Taking into account AI votes

While we restricted ourselves to the SWE subset, Pl@ntNet’s data is collected internationally. The more correctly identified observations are added to the training set, the better the prediction of the trained model for end-users. This classifier is trained from valid observations and aggregated labels (see Figure 1). Note that, in addition to Algorithm 1 and the filter on species names, more pre-processing are implemented for better performance \citepaffouard2017pl, such as additional rejection class (e.g. non-plant observations), malformed observations (multiple images of different species in a single observation). At the time of writing, the model in use in Pl@ntNet is DINOv2 \citepoquab2024dinov2 a transformer-based network. This network is based on contrastive learning \citepwaida2023understanding, and represents similar images as close embedding to learn similar features for similar observations and then uses supervised learning to fine tune the model. Several transformations are performed during training such as data augmentation \citepyang2023image, data standardization and label smoothing \citepszegedy2016rethinking. However, note that some observations from 𝒟SWEsubscript𝒟SWE\mathcal{D}_{\text{SWE}}caligraphic_D start_POSTSUBSCRIPT SWE end_POSTSUBSCRIPT have been processed by an earlier version of Pl@ntNet’s AI: either an InceptionV3 \citepszegedy2015rethinking or a BEIT \citepbao2021beit classifier. We can use the classifiers to generate votes. For an observation i𝑖iitalic_i, the AI vote is denoted yiAI[K]superscriptsubscript𝑦𝑖AIdelimited-[]𝐾y_{i}^{\text{AI}}\in[K]italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT AI end_POSTSUPERSCRIPT ∈ [ italic_K ]. The probability output in the classifier’s predicted species is denoted (yiAI)superscriptsubscript𝑦𝑖AI\mathbb{P}(y_{i}^{\text{AI}})blackboard_P ( italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT AI end_POSTSUPERSCRIPT ).

If we consider the trained model as any other user, denoted as AI as user, the same label aggregation strategies as in Section 2.2 are available. However, with the Pl@ntNet aggregation algorithm, the AI weight increases drastically and overpowers human users (see Section 3.1). This would mean the next Pl@ntNet model is mostly trained on the predictions of the previous one. This defeats the purpose of a cooperative active learning system and the human-AI interaction. It would result in a dangerous feedback loop, and possible mode collapse. Thus, we explore alternative ways of integrating the AI votes in the aggregation algorithm:

  • AI as user: This is the naive approach we just described. The AI is considered as any other user in the database. The total number of users is thus raised to nuser+1subscript𝑛user1n_{\text{user}}+1italic_n start_POSTSUBSCRIPT user end_POSTSUBSCRIPT + 1.

  • Fixed weight AI: Give a fixed weight wAI=1.7>0subscript𝑤AI1.70w_{\text{AI}}=1.7>0italic_w start_POSTSUBSCRIPT AI end_POSTSUBSCRIPT = 1.7 > 0 to the AI. The weight is below the threshold θconfsubscript𝜃conf\theta_{\text{conf}}italic_θ start_POSTSUBSCRIPT conf end_POSTSUBSCRIPT so that it can not self-validate its predictions. The confidence writes

    confi(y^i)=u𝒰iwu𝟙(yiu=y^i)+wAI𝟙(yiAI=k).subscriptconf𝑖subscript^𝑦𝑖subscript𝑢subscript𝒰𝑖subscript𝑤𝑢1superscriptsubscript𝑦𝑖𝑢subscript^𝑦𝑖subscript𝑤AI1superscriptsubscript𝑦𝑖AI𝑘\mathrm{conf}_{i}(\hat{y}_{i})=\sum_{u\in\mathcal{U}_{i}}w_{u}\mathds{1}(y_{i}% ^{u}=\hat{y}_{i})+w_{\text{AI}}\mathds{1}(y_{i}^{\text{AI}}=k)\enspace.roman_conf start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) = ∑ start_POSTSUBSCRIPT italic_u ∈ caligraphic_U start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_w start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT blackboard_1 ( italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_u end_POSTSUPERSCRIPT = over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) + italic_w start_POSTSUBSCRIPT AI end_POSTSUBSCRIPT blackboard_1 ( italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT AI end_POSTSUPERSCRIPT = italic_k ) . (2)

    The final estimated label becomes

    y^i=argmaxk[K]u𝒰iwu𝟙(yiu=k)+wAI𝟙(yiAI=k).subscript^𝑦𝑖subscriptargmax𝑘delimited-[]𝐾subscript𝑢subscript𝒰𝑖subscript𝑤𝑢1superscriptsubscript𝑦𝑖𝑢𝑘subscript𝑤AI1superscriptsubscript𝑦𝑖AI𝑘\hat{y}_{i}=\operatorname*{arg\,max}_{k\in[K]}\sum_{u\in\mathcal{U}_{i}}w_{u}% \mathds{1}(y_{i}^{u}=k)+w_{\text{AI}}\mathds{1}(y_{i}^{\text{AI}}=k)\enspace.over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = start_OPERATOR roman_arg roman_max end_OPERATOR start_POSTSUBSCRIPT italic_k ∈ [ italic_K ] end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_u ∈ caligraphic_U start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_w start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT blackboard_1 ( italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_u end_POSTSUPERSCRIPT = italic_k ) + italic_w start_POSTSUBSCRIPT AI end_POSTSUBSCRIPT blackboard_1 ( italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT AI end_POSTSUPERSCRIPT = italic_k ) . (3)
  • Invalidating AI: The AI is considered as a user with a fixed weight and can only participate in invalidating identifications i.e. have si=0subscript𝑠𝑖0s_{i}=0italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = 0. This translates as the confidence updated as in Equation 2 but the final Weighted MV remains unchanged from Algorithm 1.

  • Confident AI: The AI is considered a user with a fixed weight and can only participate if the confidence in its prediction (yiAI)superscriptsubscript𝑦𝑖AI\mathbb{P}(y_{i}^{\text{AI}})blackboard_P ( italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT AI end_POSTSUPERSCRIPT ) is over a threshold θscore[0,1]subscript𝜃score01\theta_{\text{score}}\in[0,1]italic_θ start_POSTSUBSCRIPT score end_POSTSUBSCRIPT ∈ [ 0 , 1 ]. The confidence writes

    confi(y^i)=u𝒰iwu𝟙(yiu=y^i)+wAI𝟙(yiu=y^i,(yiAI)θscore).subscriptconf𝑖subscript^𝑦𝑖subscript𝑢subscript𝒰𝑖subscript𝑤𝑢1superscriptsubscript𝑦𝑖𝑢subscript^𝑦𝑖subscript𝑤AI1formulae-sequencesuperscriptsubscript𝑦𝑖𝑢subscript^𝑦𝑖superscriptsubscript𝑦𝑖AIsubscript𝜃score\mathrm{conf}_{i}(\hat{y}_{i})=\sum_{u\in\mathcal{U}_{i}}w_{u}\mathds{1}(y_{i}% ^{u}=\hat{y}_{i})+w_{\text{AI}}\mathds{1}(y_{i}^{u}=\hat{y}_{i},\,\mathbb{P}(y% _{i}^{\text{AI}})\geq\theta_{\text{score}})\enspace.roman_conf start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) = ∑ start_POSTSUBSCRIPT italic_u ∈ caligraphic_U start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_w start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT blackboard_1 ( italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_u end_POSTSUPERSCRIPT = over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) + italic_w start_POSTSUBSCRIPT AI end_POSTSUBSCRIPT blackboard_1 ( italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_u end_POSTSUPERSCRIPT = over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , blackboard_P ( italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT AI end_POSTSUPERSCRIPT ) ≥ italic_θ start_POSTSUBSCRIPT score end_POSTSUBSCRIPT ) . (4)

    The final estimated label becomes

    y^i=argmaxk[K]u𝒰iwu𝟙(yiu=k)+wAI𝟙(yiAI=k,(yiAI)θscore).subscript^𝑦𝑖subscriptargmax𝑘delimited-[]𝐾subscript𝑢subscript𝒰𝑖subscript𝑤𝑢1superscriptsubscript𝑦𝑖𝑢𝑘subscript𝑤AI1formulae-sequencesuperscriptsubscript𝑦𝑖AI𝑘superscriptsubscript𝑦𝑖AIsubscript𝜃score\hat{y}_{i}=\operatorname*{arg\,max}_{k\in[K]}\sum_{u\in\mathcal{U}_{i}}w_{u}% \mathds{1}(y_{i}^{u}=k)+w_{\text{AI}}\mathds{1}(y_{i}^{\text{AI}}=k,\,\mathbb{% P}(y_{i}^{\text{AI}})\geq\theta_{\text{score}})\enspace.over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = start_OPERATOR roman_arg roman_max end_OPERATOR start_POSTSUBSCRIPT italic_k ∈ [ italic_K ] end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_u ∈ caligraphic_U start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_w start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT blackboard_1 ( italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_u end_POSTSUPERSCRIPT = italic_k ) + italic_w start_POSTSUBSCRIPT AI end_POSTSUBSCRIPT blackboard_1 ( italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT AI end_POSTSUPERSCRIPT = italic_k , blackboard_P ( italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT AI end_POSTSUPERSCRIPT ) ≥ italic_θ start_POSTSUBSCRIPT score end_POSTSUBSCRIPT ) . (5)

On the choice of the AI weight.

The AI has a fixed weight wAI>0subscript𝑤AI0w_{\text{AI}}>0italic_w start_POSTSUBSCRIPT AI end_POSTSUBSCRIPT > 0 for the Fixed weight AI, the Invalidating AI and the Confident AI strategies. The choice of this weight must meet several constraints. First, we would like to avoid the AI votes to be self-validating as it would validate all the AI predictions on a large part of the database, thus we must have wAI<θconfsubscript𝑤AIsubscript𝜃confw_{\text{AI}}<\theta_{\mathrm{conf}}italic_w start_POSTSUBSCRIPT AI end_POSTSUBSCRIPT < italic_θ start_POSTSUBSCRIPT roman_conf end_POSTSUBSCRIPT in Algorithm 1. We also want the AI votes to help clean the database by invalidating some observations from low-weight users (with weight 0< wlowθconfsubscript𝑤lowsubscript𝜃confw_{\text{low}}\leq\theta_{\mathrm{conf}}italic_w start_POSTSUBSCRIPT low end_POSTSUBSCRIPT ≤ italic_θ start_POSTSUBSCRIPT roman_conf end_POSTSUBSCRIPT). Thus wlow/(wlow+wAI)<θaccsubscript𝑤lowsubscript𝑤lowsubscript𝑤AIsubscript𝜃accw_{\text{low}}/(w_{\text{low}}+w_{\text{AI}})<\theta_{\mathrm{acc}}italic_w start_POSTSUBSCRIPT low end_POSTSUBSCRIPT / ( italic_w start_POSTSUBSCRIPT low end_POSTSUBSCRIPT + italic_w start_POSTSUBSCRIPT AI end_POSTSUBSCRIPT ) < italic_θ start_POSTSUBSCRIPT roman_acc end_POSTSUBSCRIPT. Hence, our constraints read:

{wAI<θconfwlowwlow+wAI<θacc.casessubscript𝑤AIabsentsubscript𝜃confsubscript𝑤lowsubscript𝑤lowsubscript𝑤AIabsentsubscript𝜃acc\begin{cases}w_{\text{AI}}&<\theta_{\mathrm{conf}}\\ \frac{w_{\text{low}}}{w_{\text{low}}+w_{\text{AI}}}&<\theta_{\mathrm{acc}}\end% {cases}\enspace.{ start_ROW start_CELL italic_w start_POSTSUBSCRIPT AI end_POSTSUBSCRIPT end_CELL start_CELL < italic_θ start_POSTSUBSCRIPT roman_conf end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL divide start_ARG italic_w start_POSTSUBSCRIPT low end_POSTSUBSCRIPT end_ARG start_ARG italic_w start_POSTSUBSCRIPT low end_POSTSUBSCRIPT + italic_w start_POSTSUBSCRIPT AI end_POSTSUBSCRIPT end_ARG end_CELL start_CELL < italic_θ start_POSTSUBSCRIPT roman_acc end_POSTSUBSCRIPT end_CELL end_ROW . (6)

Taking the extreme case where a user becomes self-validating: wlow=θconfsubscript𝑤lowsubscript𝜃confw_{\text{low}}=\theta_{\mathrm{conf}}italic_w start_POSTSUBSCRIPT low end_POSTSUBSCRIPT = italic_θ start_POSTSUBSCRIPT roman_conf end_POSTSUBSCRIPT, we obtain that wAI>θconf(1θaccθacc)subscript𝑤AIsubscript𝜃conf1subscript𝜃accsubscript𝜃accw_{\text{AI}}>\theta_{\mathrm{conf}}\left(\frac{1-\theta_{\mathrm{acc}}}{% \theta_{\mathrm{acc}}}\right)italic_w start_POSTSUBSCRIPT AI end_POSTSUBSCRIPT > italic_θ start_POSTSUBSCRIPT roman_conf end_POSTSUBSCRIPT ( divide start_ARG 1 - italic_θ start_POSTSUBSCRIPT roman_acc end_POSTSUBSCRIPT end_ARG start_ARG italic_θ start_POSTSUBSCRIPT roman_acc end_POSTSUBSCRIPT end_ARG ). And using the first condition in Equation 6, we obtain the bounds

θconf(1θaccθacc)<wAI<θconf(0.85<wAI<2).\theta_{\mathrm{conf}}\left(\frac{1-\theta_{\mathrm{acc}}}{\theta_{\mathrm{acc% }}}\right)<w_{\mathrm{AI}}<\theta_{\mathrm{conf}}\left(\Longleftrightarrow 0.8% 5<w_{\mathrm{AI}}<2\right)\enspace.italic_θ start_POSTSUBSCRIPT roman_conf end_POSTSUBSCRIPT ( divide start_ARG 1 - italic_θ start_POSTSUBSCRIPT roman_acc end_POSTSUBSCRIPT end_ARG start_ARG italic_θ start_POSTSUBSCRIPT roman_acc end_POSTSUBSCRIPT end_ARG ) < italic_w start_POSTSUBSCRIPT roman_AI end_POSTSUBSCRIPT < italic_θ start_POSTSUBSCRIPT roman_conf end_POSTSUBSCRIPT ( ⟺ 0.85 < italic_w start_POSTSUBSCRIPT roman_AI end_POSTSUBSCRIPT < 2 ) . (7)

As more than a million observations from our dataset only have two votes, one way to choose the AI weight is to consider that the AI can invalidate two erroneous non-experts that would both have just enough weights to make the observation valid: 1.95=wlow<21.95subscript𝑤low21.95=w_{\text{low}}<21.95 = italic_w start_POSTSUBSCRIPT low end_POSTSUBSCRIPT < 2. Then, the AI weight should be greater than their cumulated confidence: wAI>2wlow(1θaccθacc)subscript𝑤AI2subscript𝑤low1subscript𝜃accsubscript𝜃accw_{\mathrm{AI}}>2w_{\text{low}}\left(\frac{1-\theta_{\mathrm{acc}}}{\theta_{% \mathrm{acc}}}\right)italic_w start_POSTSUBSCRIPT roman_AI end_POSTSUBSCRIPT > 2 italic_w start_POSTSUBSCRIPT low end_POSTSUBSCRIPT ( divide start_ARG 1 - italic_θ start_POSTSUBSCRIPT roman_acc end_POSTSUBSCRIPT end_ARG start_ARG italic_θ start_POSTSUBSCRIPT roman_acc end_POSTSUBSCRIPT end_ARG ). We finally take the upper rounded value wAI=1.70subscript𝑤AI1.70w_{\mathrm{AI}}=1.70italic_w start_POSTSUBSCRIPT roman_AI end_POSTSUBSCRIPT = 1.70 (which satisfies Equation 7).

3 Results

3.1 Label aggregation performance comparison

Refer to caption
(a) Accuracy on 𝒟multiple votessubscript𝒟multiple votes\mathcal{D}_{\text{multiple votes}}caligraphic_D start_POSTSUBSCRIPT multiple votes end_POSTSUBSCRIPT w.r.t. to the proportion of classes recovered
Refer to caption
(b) Accuracy on 𝒟disagreementsubscript𝒟disagreement\mathcal{D}_{\text{disagreement}}caligraphic_D start_POSTSUBSCRIPT disagreement end_POSTSUBSCRIPT w.r.t. the proportion of classes recovered
Figure 5: Accuracy of the aggregation strategies w.r.t. the proportion of classes (species) retrieved on subsets with at least two votes – either agreeing (A) or with at least one disagreeing vote (B). The Pl@ntNet aggregation is more accurate, especially in a highly ambiguous setting (B). The TwoThird data filter highly impacts how many classes are kept in the dataset and the overall accuracy in both settings. WAWA and MV perform similarly with a benefit for WAWA when skill evaluation is needed.

Accuracy of the aggregation strategies.

We begin by evaluating the accuracy of the label aggregation strategies on the set of observations labeled by experts, 𝒟expertsubscript𝒟expert\mathcal{D}_{\text{expert}}caligraphic_D start_POSTSUBSCRIPT expert end_POSTSUBSCRIPT. Figure 5 shows how many predicted labels match the experts answers on 𝒟multiple votessubscript𝒟multiple votes\mathcal{D}_{\text{multiple votes}}caligraphic_D start_POSTSUBSCRIPT multiple votes end_POSTSUBSCRIPT and 𝒟disagreementsubscript𝒟disagreement\mathcal{D}_{\text{disagreement}}caligraphic_D start_POSTSUBSCRIPT disagreement end_POSTSUBSCRIPT. More importantly, we compare this quantity with the proportion of species retrieved by the aggregation strategy. We observe that the data filtering from the TwoThird strategy – requiring at least two third of agreements – highly degrades performance with respect to other strategies. On 𝒟expertsubscript𝒟expert\mathcal{D}_{\text{expert}}caligraphic_D start_POSTSUBSCRIPT expert end_POSTSUBSCRIPT, MV reaches 97%percent9797\%97 % of accuracy, WAWA 98%percent9898\%98 %, TwoThird 60%percent6060\%60 % and Pl@ntNet 99%percent9999\%99 %. To differentiate between the best-performing strategies, we need to look at more ambiguous observations like those in 𝒟multiple votessubscript𝒟multiple votes\mathcal{D}_{\text{multiple votes}}caligraphic_D start_POSTSUBSCRIPT multiple votes end_POSTSUBSCRIPT and 𝒟disagreementsubscript𝒟disagreement\mathcal{D}_{\text{disagreement}}caligraphic_D start_POSTSUBSCRIPT disagreement end_POSTSUBSCRIPT. In highly ambiguous frameworks, the WAWA strategy outperforms the MV one. However, overall the Pl@ntNet aggregation is more often in adequation with the experts and retrieves almost 90%percent9090\%90 % of plant species identified by experts in highly ambiguous datasets against 73%percent7373\%73 % for WAWA, 71%percent7171\%71 % for MV and only 41%percent4141\%41 % for TwoThird.

Refer to caption
(a) Precision and recall of label aggregation strategies on 𝒟disagreementsubscript𝒟disagreement\mathcal{D}_{\text{disagreement}}caligraphic_D start_POSTSUBSCRIPT disagreement end_POSTSUBSCRIPT.
Refer to caption
(b) Number of observations in 𝒟SWEsubscript𝒟SWE\mathcal{D}_{\mathrm{SWE}}caligraphic_D start_POSTSUBSCRIPT roman_SWE end_POSTSUBSCRIPT indicated as valid for training (si=1subscript𝑠𝑖1s_{i}=1italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = 1).
Figure 6: (A): TwoThird strategy has better precision than MV and WAWA strategies, with lower recall because of the heavy filter on the validity of observations. Pl@ntNet aggregation strategy obtains best precision and recall and outperforms other strategies. (B): TwoThird performance drop can be explained in part by the high proportion of data considered invalid. Note that MV and WAWA strategies do not invalidate any observation, hence keeping potentially mislabeled or low-quality observations. Pl@ntNet achieves a balance between filtering out observations and achieving high performance.
Refer to caption
Figure 7: Examples of invalid (si=0)subscript𝑠𝑖0(s_{i}=0)( italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = 0 ) and valid (si=1subscript𝑠𝑖1s_{i}=1italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = 1) observations using the Pl@ntNet strategy described in Algorithm 1.

Precision and recall.

To better evaluate each aggregation strategy, we compute the macro precision and recall metrics for each species. Results are shown in Figure 6(a). The observations filter (si=0subscript𝑠𝑖0s_{i}=0italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = 0) for the TwoThird strategy highly impacts its ability to identify most of the positive observations for a given species. While this agreement threshold filter is created to keep as few noisy samples as possible in research-graded (data quality indicator for research database usage in TwoThird) observations, TwoThird obtains better precision than MV and WAWA but Pl@ntNet’s precision shows significant improvement. WAWA strategy outperforms a naive MV aggregation showing that, indeed, weighing users can lead to better performance. Pl@ntNet strategy outperforms all others by several orders of magnitude. Weighing users based on their number of identified species is both interpretable and effective. The observation filter does not negatively impact the recall.

Volume of valid data.

The community labels are aggregated to generate training data for the AI model. The more data the better, however, we need to filter out observations with low visual quality or potentially mislabeled. This is the reason for the validity indicator sisubscript𝑠𝑖s_{i}italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT in the TwoThird and Pl@ntNet strategies. On 𝒟SWEsubscript𝒟SWE\mathcal{D}_{\mathrm{SWE}}caligraphic_D start_POSTSUBSCRIPT roman_SWE end_POSTSUBSCRIPT, Figure 6(b) shows how much data is kept for later training. MVMV\mathrm{MV}roman_MV and WAWAWAWA\mathrm{WAWA}roman_WAWA keep all proposed observation for training – including potential noisy ones. TwoThird filters out most observations to keep nearly 1.51.51.51.5 million (representing 23.43%percent23.4323.43\%23.43 % of the total observations). Pl@ntNet finds an improved balance between filtering invalid observations and keeping enough data for training.

Qualitative results on Pl@ntNet observation filter.

In this section, we show some examples of observations invalidated by the Pl@ntNet strategy (see Figure 7). Invalid observations often come from the lack of user participation with other’s observations. Causes of disagreements from users can occur from a multitude of factors – blurriness, multiple species in the same observation, the distance from the plant does not allow precise identification, etc. Valid observations, as shown in the second row of Figure 7 are zoomed in on the plant’s flower, leaf or organ to help the identification process.

3.2 Aggregation considering AI vote

The current trained neural network model in Pl@ntNet’s system can make predictions based on its training on the Pl@ntNet database (across different floras). We compare the four following strategies – AI as user, fixed weight AI, invalidating AI and confident AI, presented in Section 2.4 to integrate the AI vote into the Pl@ntNet label aggregation strategy. For the confident AI strategy, we evaluate multiple thresholds θscoresubscript𝜃score\theta_{\text{score}}italic_θ start_POSTSUBSCRIPT score end_POSTSUBSCRIPT. Note that if θscore=0subscript𝜃score0\theta_{\mathrm{score}}=0italic_θ start_POSTSUBSCRIPT roman_score end_POSTSUBSCRIPT = 0 the AI votes for all observations and if θscore=1subscript𝜃score1\theta_{\mathrm{score}}=1italic_θ start_POSTSUBSCRIPT roman_score end_POSTSUBSCRIPT = 1 the AI does not vote and we recover the performance of the current Pl@ntNet aggregation strategy presented in Algorithm 1. We see in Figure 8 that the confident AI strategy with θscore=0.7subscript𝜃score0.7\theta_{\mathrm{score}}=0.7italic_θ start_POSTSUBSCRIPT roman_score end_POSTSUBSCRIPT = 0.7 seems to perform best and keep the most data in both 𝒟SWEsubscript𝒟SWE\mathcal{D}_{\mathrm{SWE}}caligraphic_D start_POSTSUBSCRIPT roman_SWE end_POSTSUBSCRIPT and 𝒟expertsubscript𝒟expert\mathcal{D}_{\mathrm{expert}}caligraphic_D start_POSTSUBSCRIPT roman_expert end_POSTSUBSCRIPT.

Refer to caption
Figure 8: Performance in label recovery and number of observations marked as valid depending on how the AI vote is integrated. MV, WAWA and Pl@ntNet strategy without AI vote are used as reference. The best-performing strategy overall is confident AI with θscore=0.7subscript𝜃score0.7\theta_{\text{score}}=0.7italic_θ start_POSTSUBSCRIPT score end_POSTSUBSCRIPT = 0.7. We also see that when θscoresubscript𝜃score\theta_{\text{score}}italic_θ start_POSTSUBSCRIPT score end_POSTSUBSCRIPT tends to 1111, we recover the vanilla Pl@ntNet aggregation strategy.

4 Discussion

We demonstrated that collaborative identification of plant species can effectively be used to obtain expert level labels. Releasing a large subset of millions of observations and thousands of users from the Pl@ntNet organization, we investigate a label aggregation strategy that weighs user answers based on their estimated number of species correctly identified without using prior expert knowledge. Many strategies used previously either do not scale to the magnitude of the current databases – either Pl@ntNet, iNaturalist or eBird – or are outperformed by our aggregation.

Our strategy weighs users based on the number of correctly identified species. This weight is interpretable and shows the diversity of the user’s skill set. It can be directly applied to other crowdsourced frameworks with a high number of classes like TwoThird or eBird. The values for both hyperparameters θconfsubscript𝜃conf\theta_{\mathrm{conf}}italic_θ start_POSTSUBSCRIPT roman_conf end_POSTSUBSCRIPT and θaccsubscript𝜃acc\theta_{\mathrm{acc}}italic_θ start_POSTSUBSCRIPT roman_acc end_POSTSUBSCRIPT which respectively handle the cumulated weight on observation and the agreement level for the given label can be applied as is.

Note that Pl@ntNet’s label control system heavily rests on visual analysis of observations and inter-user agreements. Additional metadata such as geolocation, date, phenological stage or visual description can be registered in Pl@ntNet and help identify the plant’s species but are currently not directly taken into account for user evaluation. Such information – in particular spatial information – could also be used to generate more interaction between users and collect more votes through possible common interests. In addition, users are helped by the system with images similar to the identification proposed in a given checklist. The additional information could guide users in their vote – for example by notifying a possible incoherence between the current botanical knowledge on a species and the metadata entered (such as the altitude, the distance to the sea, a species not known to survive in a given area).

As for the inclusion of the AI vote, some concerns should be raised. First, as the AI model is trained from the aggregated labels and observations, integrating its vote should not make the AI predictions run out of control. If we consider the AI as a user, as we are in iterative training, the system fails to learn from the human labels. However, using the AI vote to invalidate the data with a fixed weight can help clean the database, and with enough weight other users can switch its validity back. However, this would not help in switching the wrong label. To do so, we investigate in Section 3.2 to only consider a fixed weight label with enough confidence from the AI model. We observe that this strategy leads to better performance. As we use the output probabilities we should discuss the calibration of our network too.

Refer to caption
(a) Reliability diagram of Pl@ntNet AI on 𝒟expertsubscript𝒟expert\mathcal{D}_{\mathrm{expert}}caligraphic_D start_POSTSUBSCRIPT roman_expert end_POSTSUBSCRIPT
Refer to caption
(b) Reliability diagram of Pl@ntNet AI on 𝒟disagreementsubscript𝒟disagreement\mathcal{D}_{\mathrm{disagreement}}caligraphic_D start_POSTSUBSCRIPT roman_disagreement end_POSTSUBSCRIPT
Figure 9: Reliability diagrams for the Pl@ntNet AI on the expert dataset and on the more ambiguous 𝒟disagreementsubscript𝒟disagreement\mathcal{D}_{\mathrm{disagreement}}caligraphic_D start_POSTSUBSCRIPT roman_disagreement end_POSTSUBSCRIPT subset. The AI is overall underconfident (A). However, on more ambiguous observations it is overconfident for observations leading to high predicted probabilities (B).

Calibration is the measure of how close the output confidence is to the true probability \citepniculescu2005predicting. Currently, the Pl@ntNet AI is not calibrated using post-processing methods \citepplatt1999probabilistic, guo_calibration_2017. We discuss hereafter the calibration of the current AI model and possible guidelines for further integration of AI votes.

From Figure 9(a), we see that currently, Pl@ntNet AI is underconfident. Meaning that it consistently underestimates its confidence and outputs to users more uncertainty than it should. One factor that can influence the results is that the calibration is computed on the test set where experts either authored or voted on observations. Botanical experts have more experience with taking pictures of plants and better equipment than the average citizen. Thus, the observation quality – and subsequently the probability distribution output by the AI – can be biased. Another factor known for leading to such suboptimal predictions is the data augmentation \citepkapoor2022uncertainty. As the model trains on multiple versions of each original sample with multiple distortions, these variations can become unrepresentative of the underlying sample distribution and cause unnecessary prediction difficulties. The data augmentation is used to mitigate the species imbalance of the database.

However, this imbalance is also known to lead to miscalibrations in predictions \citepao2023two. On Figure 9(b), we see that for ambiguous observations (where users disagree), the AI is overconfident in its highest predictions – which represents half of the dataset – and underconfident in the other half. These different calibration behaviors inform us that, if a given strategy should incorporate the AI votes in the label aggregation based on the output probabilities, we need to be able to rely on such probabilities. Therefore, even if the confident AI strategy leads to the best performance in Section 2.4, it should not be used directly without recalibration of the model – using for example temperature scaling \citepguo_calibration_2017. In future work, more study is needed to investigate the confidence gap of the model and the observations’ ambiguity from users’ labels. The current large-scaled and interpretable aggregation strategy from Pl@ntNet already outperforms others without the AI votes.

Statement on inclusion

We affirm our commitment to promoting diversity and inclusivity in scientific research. Our collected crowdsourced data brings together a wide range of participants. We actively encourage and welcome involvement from individuals of diverse backgrounds, expertise, and perspectives, recognizing the value of their contributions in advancing ecological research and promoting a more comprehensive understanding of plant biodiversity.

\printbibliography