\addbibresource

article.bib

Cooperative learning of Pl@ntNet’s Artificial Intelligence algorithm: how does it work and how can we improve it?

Tanguy Lefort ¹ Antoine Affouard ² Benjamin Charlier ³ Jean-Christophe Lombardo ⁴ Mathias Chouet ⁵ Hervé Goëau⁶ Joseph Salmon ⁷ Pierre Bonnet ⁸ Alexis Joly ⁹
¹Univ. Montpellier, CNRS, IMAG, Inria, LIRMM, France tanguy.lefort@umontpellier.fr
² Inria, CIRAD, Montpellier, France antoine.affouard@cirad.fr
³ Univ. Montpellier, CNRS, IMAG, France benjamin.charlier@umontpellier.fr
⁴ Inria, LIRMM, France, jean-christophe.lombardo@inria.fr
⁵ CIRAD, AMAP, Montpellier, France, mathias.chouet@cirad.fr
⁶ CIRAD, AMAP, Montpellier, France, herve.goeau@cirad.fr
⁷ Univ. Montpellier, CNRS, IMAG, Institut Universitaire de France (IUF), joseph.salmon@umontpellier.fr
⁸ CIRAD, AMAP, Montpellier, France, pierre.bonnet@cirad.fr
⁹ Inria, LIRMM, France, alexis.joly@inria.fr

Abstract

1.

Deep learning models for plant species identification rely on large annotated datasets. The Pl@ntNet system enables global data collection by allowing users to upload and annotate plant observations, leading to noisy labels due to diverse user skills. Achieving consensus is crucial for training, but the vast scale of collected data (number of observations, users and species) makes traditional label aggregation strategies challenging. Existing methods either retain all observations, resulting in noisy training data or selectively keep those with sufficient votes, discarding valuable information. Additionally, as many species are rarely observed, user expertise can not be evaluated as an inter-user agreement: otherwise, botanical experts would have a lower weight in the AI training step than the average user.
2.

Our proposed label aggregation strategy aims to cooperatively train plant identification AI models. This strategy estimates user expertise as a trust score per user based on their ability to identify plant species from crowdsourced data. The trust score is recursively estimated from correctly identified species given the current estimated labels. This interpretable score exploits botanical experts’ knowledge and the heterogeneity of users. Subsequently, our strategy removes unreliable observations but retains those with limited trusted annotations, unlike other approaches.
3.

We evaluate Pl@ntNet’s strategy on a newly released large subset of the Pl@ntNet database focused on European flora, comprising over 6M observations and 800K users. This anonymized dataset of votes and observations is released openly at https://doi.org/10.5281/zenodo.10782465. We demonstrate that estimating users’ skills based on the diversity of their expertise enhances labeling performance.
4.

Our findings emphasize the synergy of human annotation and data filtering in improving AI performance for a refined training dataset. We explore incorporating AI-based votes alongside human input in the label aggregation. This can further enhance human-AI interactions to detect unreliable observations (even with few votes).

Keywords: crowdsourcing, botanical skills, human-AI interaction, label aggregation, Pl@ntNet, plant identification

Running Headline

Citizen science for plant identification

Acknowledgments

This work was funded by the French National Research Agency (ANR) through the grant Pl@ntAgroEco 22-PEAE0009, granted access to the HPC resources of IDRIS under the allocation A0151011389 made by GENCI, and funded by the Chaire IA CaMeLOt (ANR-20-CHIA-0001-01).

Data Availability

The dataset is available at https://doi.org/10.5281/zenodo.10782465

Conflict of Interest

The authors declare no conflicts of interest

Author Contributions

T. Lefort, A. Affouard, A. Joly, B. Charlier, P. Bonnet and J. Salmon conceived the ideas and designed the evaluation methodology; A. Affouard and M. Chouet are the main developers of Pl@ntNet’s backend ; T. Lefort and A. Affouard collected the evaluation data used in this paper; T. Lefort re-implemented Pl@ntNet’s algorithm in python and conducted the evaluation ; J-C. Lombardo, H. Goëau and A. Joly conceived and trained Pl@ntNet’s AI model; T. Lefort, B. Charlier, A. Joly and J. Salmon analyzed the outcomes of the study. All authors contributed critically to the drafts and gave final approval for publication.

1 Introduction

Computer vision models are a great aid in plant species recognition in the field \citepvidal2021perspectives,borowiec2022,mader2021flora. However, to train them we need large annotated datasets. These datasets are often created thanks to citizen science approaches, collecting both reliable and useful information \citepbrown2019potential. Among existing plant recognition applications, the Pl@ntNet citizen science platform \citepaffouard2017pl enables global data collection by allowing users to upload and annotate plant observations \citepbonnet2020citizen.

Refer to caption — Figure 1: Pl@ntNet system of human-AI interaction for plant species recognition. Users take their plant observations in the Pl@ntNet application. A prediction is output by the AI model. Users can validate the prediction or propose another species. The whole votes collection is used to evaluate user expertise (see Algorithm 1) and actively revise observations identifications.

At the time of writing, this participatory approach has resulted in the collection of over 20 million observations (images or group of images of the same plant), belonging to almost $\numprint{46000}$ species, by more than 6 million users worldwide. In total, more than $25$ million of images are shared in these observations. The collaborative process of Pl@ntNet is synthetized in Figure 1. The AI model interacts with the human decision by proposing possible species given an observation. For each returned species, using a similarity search, the Pl@ntNet system also shows similar pictures from the database. This lets users visually check that their observation is likely to belong to a predicted species given the most similar observations. For instance, such a visual control can help to compare two plants at various growth stages.

Plant species identification is a task that requires skills to recognize morphological traits (shapes, measurements, environments and specific characteristics). A large number of users with diverse skills have participated in gathering plant observations and helped improve the training dataset of our computer vision model. Their participation is based on votes that they can cast on others’ observations, or by the initial species determination of their observation. The quality of each vote is then processed by the algorithm presented in Section 2.2.

Other citizen science projects such as iNaturalist \citepvan2018inaturalist or eBird \citepsullivan2009ebird use a similar approach to collect data, but differ in their label aggregation strategy. The iNaturalist project, with more than $2.5$ million users, records the votes at different taxonomic levels. The resulting label is the aggregation of at least two votes on a species-level identification (or coarser or finer taxonomic level). A taxon requires at least a two-thirds agreements among identifiers and all users have the same weight in the decision-making. Over time, a taxon can be further refined by the community, debated or revoked. eBird handles taxon quality control by using a checklist in each region for observers. Quality control on the checklist is performed and, combined with user knowledge – number of species and checklist submitted, number of flagged observations, discussions among local experts – the species observation is accepted. The eBird project also showed that monitoring species accumulation from observers can help to sort their skills \citepkelling2015. While they consider the species accumulation by hours spent on each collected observation, we propose a strategy that takes into account the entire history of observations of the observer.

In this article, we present the Pl@ntNet label aggregation strategy. Using a new large-scale dataset of more than $6$ million observations and $800$ thousand users, we show that our strategy can improve the quality of the collected data, without removing every observation that was only labeled by single users. Finally, aggregated labels are used in practice to train an AI model. We explore how the information contained in the AI predictions can be integrated into the label aggregation strategy to generate new votes and help control data quality. By using the model’s predictions within the label aggregation, the goal is to correct possible mistakes from non expert users without contradicting botanical experts.

2 Methods

2.1 Dataset and notation

To compare the different label aggregation strategies on large-scale datasets, we introduce a subset of the Pl@ntNet database focused on Southwestern European flora observations – Baleares, Corsica, France, Portugal, Sardegna and Spain – from $2017$ to October $2023$ . In total, $\numprint{9005108}$ votes are cast by $n_{\text{user}}=\numprint{823251}$ users on $\numprint{6699593}$ observations after two cleaning steps on the voted species. The first one is a filtering step. We only keep the votes with plant species belonging to the World Checklist of Vascular Plants (WCVP) \citepgovaerts2023world. For the second step, according to Kew’s Royal Botanical Garden, we matched synonyms to their backbone species if the species is part of the k-southwestern-europe checklist from Plants of the World Online \citeppowo2024 (POWO) system. Note that there are plant species listed in the accepted species from WCVP that are not in the k-southwestern-europe POWO checklist. As there is a possible taxon ambiguity in this case – multiple species possible for a given synonym depending on the referential – we leave the proposed label untouched. The dataset is available at https://zenodo.org/records/10782465.

Notation

In the following, denote $K$ the number of species within the dataset. We index the observations by $i\in[n_{\mathchoice{\mathbin{\vbox{\hbox{\scalebox{0.75}{$\displaystyle\bullet% $}}}}}{\mathbin{\vbox{\hbox{\scalebox{0.75}{$\textstyle\bullet$}}}}}{\mathbin{% \vbox{\hbox{\scalebox{0.75}{$\scriptstyle\bullet$}}}}}{\mathbin{\vbox{\hbox{% \scalebox{0.75}{$\scriptscriptstyle\bullet$}}}}}}]=\{1,\dots,n_{\mathchoice{% \mathbin{\vbox{\hbox{\scalebox{0.75}{$\displaystyle\bullet$}}}}}{\mathbin{% \vbox{\hbox{\scalebox{0.75}{$\textstyle\bullet$}}}}}{\mathbin{\vbox{\hbox{% \scalebox{0.75}{$\scriptstyle\bullet$}}}}}{\mathbin{\vbox{\hbox{\scalebox{0.75% }{$\scriptscriptstyle\bullet$}}}}}}\}$ where $\mathcal{D}_{\mathchoice{\mathbin{\vbox{\hbox{\scalebox{0.75}{$\displaystyle% \bullet$}}}}}{\mathbin{\vbox{\hbox{\scalebox{0.75}{$\textstyle\bullet$}}}}}{% \mathbin{\vbox{\hbox{\scalebox{0.75}{$\scriptstyle\bullet$}}}}}{\mathbin{\vbox% {\hbox{\scalebox{0.75}{$\scriptscriptstyle\bullet$}}}}}}$ is the considered dataset composed of $n_{\mathchoice{\mathbin{\vbox{\hbox{\scalebox{0.75}{$\displaystyle\bullet$}}}}% }{\mathbin{\vbox{\hbox{\scalebox{0.75}{$\textstyle\bullet$}}}}}{\mathbin{\vbox% {\hbox{\scalebox{0.75}{$\scriptstyle\bullet$}}}}}{\mathbin{\vbox{\hbox{% \scalebox{0.75}{$\scriptscriptstyle\bullet$}}}}}}$ observations and their associated votes. For example, the full south-western European flora dataset from Pl@ntNet of $n_{\mathrm{SWE}}=\numprint{6699593}$ observations is denoted $\mathcal{D}_{\mathrm{SWE}}$ . Other subsets are presented in Section 2.3. We write $\mathcal{U}$ the set of all users. Each user $u$ has a unique identifier used as an index, and we denote $\mathcal{U}_{i}$ the set of users that have voted on observation $i$ – i.e. $\mathcal{U}=\cup_{i\in[n_{\mathrm{SWE}}]}\mathcal{U}_{i}$ . The vote of user $u$ on observation $i$ is denoted $y_{i}^{u}\in[K]$ . Estimated labels are denoted $\hat{y}_{i}\in[K]$ . Each observation $i$ is created by an author $u$ stored in $\mathrm{Author}(i)$ .

2.2 Proposed label aggregation strategy

Algorithm 1 Pl@ntNet iterative weighted majority vote

1:Votes as

(u,y_{i}^{u})_{i\in[n_{\mathrm{SWE}}],u\in[n_{\text{user}}]}

for each observation

i

and user

u

answering the voted species

y_{i}^{u}

, accuracy threshold

\theta_{\text{acc}}

, confidence threshold

\theta_{\text{conf}}

, weight function

f

, initial weight

\gamma>0

2:Estimated labels

\hat{y}_{i}

and validity indicator

s_{i}

for each observation

i

\text{Initialize user weights as }w_{u}=\gamma\text{ for each user }u\in[n_{% \text{user}}]

4:while not converged do

5: Get current estimated labels with a weighted majority vote

\forall i\in[n_{\mathrm{SWE}}],\ \hat{y}_{i}=\operatorname*{arg\,max}_{k\in[K]% }\sum_{u\in\mathcal{U}_{i}}w_{u}\mathds{1}(y_{i}^{u}=k)

6: for each observation

i\in[n_{\mathrm{SWE}}]

7: Compute label confidence:

\mathrm{conf}_{i}(\hat{y}_{i})=\sum_{u\in\mathcal{U}_{i}}w_{u}\mathds{1}(y_{i}% ^{u}=\hat{y}_{i})

8: Compute label accuracy:

\mathrm{acc}_{i}(\hat{y}_{i})=\mathrm{conf}_{i}(\hat{y}_{i})/\sum_{k\in[K]}% \mathrm{conf}_{i}(k)

9: Compute validity indicator:

s_{i}=\mathds{1}(\mathrm{acc}_{i}(\hat{y}_{i})\geq\theta_{\text{acc}}\text{ % and }\mathrm{conf}_{i}(\hat{y}_{i})\geq\theta_{\text{conf}})

10: end for

11: for each user

u\in[n_{\text{user}}]

12: Compute the number of valid identified species for authoring observations:

n_{u}^{\text{author}}=|\{y_{i}^{u}\in[K]\,|\,y_{i}^{u}=\hat{y}_{i},s_{i}=1,% \mathrm{Author}(i)=u\}|

13: Compute the number of identified species by voting on other’s observations:

n_{u}^{\text{vote}}=|\{y_{i}^{u}\in[K]\,|\,y_{i}^{u}=\hat{y}_{i},\mathrm{% Author}(i)\neq u\}|

14: Compute the rounding number of identified species per user:

n_{u}=\mathrm{Round}\left(n_{u}^{\text{author}}+\frac{1}{10}n_{u}^{\text{vote}% }\right)

15: Transform number of estimated species per user into trust score:

w_{u}=f(n_{u})

16: end for

17:end while

Pl@ntNet label aggregation strategy relies on estimating the number of correctly identified species for each user. Similar to other strategies, we rely on an EM-based iterative procedure \citepDempster_Laird_Rubin77 to estimate consecutively the users’ skills and each observation’s species. The detailed iterative algorithm is provided in Algorithm 1 and available at https://github.com/peerannot/peerannot. The label aggregation strategy generates a trust indicator ( $s_{i}$ ) on the observation that can reveal whether an observation is valid or not. Notice that in Algorithm 1 we value $10$ times more authored observations than voting on other’s observations – if a user proposes a new observation with a label (species name) it is more useful than proposing a label by clicking. Indeed, being on the field leads to more information on the environment and a better determination of the species. Finally, note that an identified species is exclusively identified as author – part of $n_{u}^{\text{author}}$ in Algorithm 1) – or as click – part of $n_{u}^{\text{vote}}$ – to avoid redundant skills. The final number of species identified by users is the aggregation of these two terms: $n_{u}=\mathrm{Round}\left(n_{u}^{\text{author}}+\frac{1}{10}n_{u}^{\text{vote}% }\right)$ .

The weight function $f$ shown in Figure 2 is a non-decreasing function that maps the number of identified species $n_{u}$ to a trust score in the form of:

w_{u}=f(n_{u})=n_{u}^{\alpha}-n_{u}^{\beta}+\gamma\enspace,

(1)

where $\alpha,\beta\in\mathbb{R_{+}^{\star}}$ are hyperparameters that were calibrated internally to fit prior knowledge and $\gamma>0$ is the constant representing the initial weight of each user. In practice, we use $\alpha=0.5$ , $\beta=0.2$ and $\gamma=\log(2.1)\simeq 0.74$ in the weight function. This function is sub-linear ( $\mathcal{O}(\sqrt{n_{u}}))$ but with two different behaviors. The goal of Equation 1 is to separate new users from experts and then help sort multiple experts. This is modeled by the two behaviors of the weight function. In the first part which corresponds to new users with low $n_{u}$ , the term in the power of $\beta$ decreases the weight. We chose an initial weight $w_{u}=\gamma$ such that a user has a weight equal to $1$ (rounding to two decimals) with two different identifications. This separates the users who only come once to test the application from others. In the second part with a higher number of identified species, the term to the power of $\beta$ becomes negligible and we tend to the square root function. The sub-linear scale allows for reducing discrepancies between people who have identified a comparable number of species (and thus have presumably comparable expertise). As for the two thresholds that control the level of uncertainty accepted for a given label, they are set to $\theta_{\text{conf}}=2$ to control the total weight on an observation and $\theta_{\text{acc}}=0.7$ to control the agreement between users given their expertise.

Users are said self-validating when they are trusted enough so that their proposed label single-handedly makes an observation valid $(s_{i}=1)$ . From Algorithm 1, we see that this is verified when their weight $w_{u}$ is greater than the level $\theta_{\text{conf}}$ . Indeed, with a single label we obtain $\mathrm{conf}_{i}(\hat{y}_{i})=w_{u}>\theta_{\text{conf}}$ and $\mathrm{acc}_{i}(\hat{y}_{i})=1>\theta_{\text{acc}}$ . In practice, this means that an experienced user who has collected enough weight can validate any observation without any other user’s vote. Note that this identification can later be invalidated by other users with enough weight thanks to the accuracy threshold $\theta_{\text{acc}}$ .

2.3 Evaluation against other aggregation strategies

Existing aggregation strategies.

Plant species label aggregation is a challenging task due to the large number of species $K=\numprint{11425}$ . Hence, many classical strategies in the label aggregation literature such as Dawid and Skene’s \citepdawid_maximum_1979 and other variations \citeppassonneau-carpenter-2014-benefits, sinha2018fast are not applicable as they require estimating a $K\times K$ confusion matrix for each user. For the considered dataset $\mathcal{D}_{\text{SWE}}$ , this would result in $\numprint{11425}^{2}\times\numprint{823251}\approx 10^{14}$ parameters to estimate. Similar issues occur for other label aggregation strategies \citepwhitehill_whose_2009,hovy2013learning,ma2020adversarial. We do not consider deep-learning-based crowdsourcing strategies as \citetrodrigues2018deep,chu2021learning or \citetlefort2022improve as they require training a neural network from crowdsourced labels, but do not output aggregated labels on the training set. In the Pl@ntNet application, we need to propose one or multiple species for each observation to users. To overcome these issues, we consider the following label aggregation strategies that can scale with $K$ and the number of users:

•

Majority Vote (MV) \citepjames1998majority: it selects the most answered label¹¹1Ties are broken at a random – creating sometimes some variability in the labeling process. and is the most common aggregation strategy. More formally, given an observation $i$ :

\mathrm{MV}(i,\{y_{i}^{u}\}_{u})=\operatorname*{arg\,max}_{k\in[K]}\sum_{u\in% \mathcal{U}_{i}}\mathds{1}(y_{i}^{u}=k)\enspace.

•

Worker Agreement With Aggregate (WAWA) \citepappen_wawa_2021: this strategy, also known as the inter-rater agreement, weights each user by how much they agree with the MV labels on average. More formally, given an observation $i$ :

	$\displaystyle\mathrm{WAWA}(i,\mathcal{D}_{\mathrm{SWE}})$	$\displaystyle=\operatorname*{arg\,max}_{k\in[K]}\sum_{u\in\mathcal{U}_{i}}w_{u% }\mathds{1}(y_{i}^{u}=k)$
	$\displaystyle\text{with }w_{u}$	$\displaystyle=\frac{1}{\|\{y_{i^{\prime}}^{u}\}_{i^{\prime}}\|}\sum_{i^{\prime}=% 1}^{n_{\mathrm{SWE}}}\mathds{1}\left(y_{i^{\prime}}^{u}=\mathrm{MV}(i^{\prime}% ,\{y_{i^{\prime}}^{u}\}_{u})\right)\enspace.$

As there is no observation filter for the $\mathrm{MV}$ and $\mathrm{WAWA}$ , we consider that for all observation $i$ , $s_{i}=1$ for these two strategies.

•

TwoThird: The TwoThird aggregation generates a label for observations with at least two votes. The estimated label represents the one with at least two-thirds of the majority in agreement. Every user has the same weight in the aggregation. It is part of the iNaturalist’s label aggregation system \citepvan2018inaturalist. More formally:

	$\displaystyle\mathrm{TwoThird}(i,\{y_{i}^{u}\}_{u})$	$\displaystyle=\begin{cases}\mathrm{MV}(i,\{y_{i}^{u}\}_{u})&\text{if }s_{i}=1% \\ \text{undefined}&\text{otherwise}\end{cases}$
	$\displaystyle\text{ with }s_{i}$	$\displaystyle=\mathds{1}\left(\displaystyle\max_{k\in[K]}\frac{1}{\|\mathcal{U}% _{i}\|}\sum_{u\in\mathcal{U}_{i}}\mathds{1}(y_{i}^{u}=k)\geq\frac{2}{3}\right)\enspace.$

Creation of an evaluation set in a crowdsourcing setting.

To evaluate the performance of a label aggregation strategy, it is necessary to know the ground truth on a subset of the data. However, in the context of crowdsourced data, there is no known truth for the observations. The sheer volume of data makes it impossible to ask botanical experts to create such ground truth for the whole database. Moreover, identifying species from images is much less accurate than identifying them in the field, due to the partial information contained in the image \citepexperts2018plant.

Instead of asking experts to label a subset of the data, we rather identify botanical experts in the Pl@ntNet user database and consider their determinations as ground truth. We asked botanical-known experts to reference other experts who could have a Pl@ntNet account to create a list of expert users. To these we have added TelaBotanica \citepheaton2010tela users with registered confirmed botanical experience from their account and that are also Pl@ntNet users that participated in the South-Western Europe flora subset. In total, $98$ Pl@ntNet users were identified as botanical experts. Observation with at least one vote from one of these experts constitute our test set denoted $\mathcal{D}_{\text{expert}}$ . The answers of these experts are considered ground truth labels and used to evaluate strategies’ performance. Despite our selection process of supposedly ‘indisputable’ experts, a few observations in the test set still end up with contradictory labels ( $4$ observations in total). As they represent a very small percentage, we simply removed them from $\mathcal{D}_{\text{expert}}$ .

Our evaluation set $\mathcal{D}_{\text{expert}}$ is finally composed of $\numprint{26811}$ observations which received at least one vote from one of the experts. Despite the large number of users, not all observations obtain multiple annotations. Indeed, $\numprint{310564}$ users were single-time voters (meaning they interacted with the system only once). The lack of votes is a large component of difficulty in the Pl@ntNet database, as there is a high imbalance of the distribution of votes between observations as represented in Figure 4(b). There is a high concentration of votes for a small percentage of the observations as shown in Figure 4(a). Of these evaluation data, $\numprint{17125}$ received more than two identifications and are stored in $\mathcal{D}_{\text{multiple votes}}$ . Then, $\numprint{1263}$ have more than two votes with at least one disagreement between users and are stored in $\mathcal{D}_{\text{disagreement}}$ . Figure 3 shows the distribution of observations from $\mathcal{D}_{\mathrm{SWE}}$ to the finer and more ambiguous $\mathcal{D}_{\text{disagreement}}$ .

Evaluation metric.

To evaluate the label aggregation strategies, we use the following label recovery accuracy computed on the evaluation datasets:

\mathrm{Acc}(\hat{y},y;\mathcal{D}_{\mathchoice{\mathbin{\vbox{\hbox{\scalebox% {0.75}{$\displaystyle\bullet$}}}}}{\mathbin{\vbox{\hbox{\scalebox{0.75}{$% \textstyle\bullet$}}}}}{\mathbin{\vbox{\hbox{\scalebox{0.75}{$\scriptstyle% \bullet$}}}}}{\mathbin{\vbox{\hbox{\scalebox{0.75}{$\scriptscriptstyle\bullet$% }}}}}})=\frac{1}{n_{\mathchoice{\mathbin{\vbox{\hbox{\scalebox{0.75}{$% \displaystyle\bullet$}}}}}{\mathbin{\vbox{\hbox{\scalebox{0.75}{$\textstyle% \bullet$}}}}}{\mathbin{\vbox{\hbox{\scalebox{0.75}{$\scriptstyle\bullet$}}}}}{% \mathbin{\vbox{\hbox{\scalebox{0.75}{$\scriptscriptstyle\bullet$}}}}}}}\sum_{i% =1}^{n_{\mathchoice{\mathbin{\vbox{\hbox{\scalebox{0.75}{$\displaystyle\bullet% $}}}}}{\mathbin{\vbox{\hbox{\scalebox{0.75}{$\textstyle\bullet$}}}}}{\mathbin{% \vbox{\hbox{\scalebox{0.75}{$\scriptstyle\bullet$}}}}}{\mathbin{\vbox{\hbox{% \scalebox{0.75}{$\scriptscriptstyle\bullet$}}}}}}}\mathds{1}(\hat{y}_{i}=y_{i}% )\mathds{1}(s_{i}=1)\enspace,

with $\hat{y}=(\hat{y}_{i})_{i}$ the estimated labels on $\mathcal{D}_{\mathchoice{\mathbin{\vbox{\hbox{\scalebox{0.75}{$\displaystyle% \bullet$}}}}}{\mathbin{\vbox{\hbox{\scalebox{0.75}{$\textstyle\bullet$}}}}}{% \mathbin{\vbox{\hbox{\scalebox{0.75}{$\scriptstyle\bullet$}}}}}{\mathbin{\vbox% {\hbox{\scalebox{0.75}{$\scriptscriptstyle\bullet$}}}}}}\in\{\mathcal{D}_{% \text{expert}},\mathcal{D}_{\text{multiple votes}},\mathcal{D}_{\text{% disagreement}}\}$ , $y=(y_{i})_{i}$ the associated experts labels, considered as ground truth. When the aggregation strategy indicates the observation as invalid ( $s_{i}=0$ for Pl@ntNet and TwoThird), this metric considers the sample as wrongly classified. Precision and recall scores are also computed to respectively measure the correctness of the observations indicated as valid and the ability to recover the ground truth observations in the valid set. We take into account the species imbalance by using a macro-average for these metrics. This treats rare species as equally important to common ones. Denoting respectively $\mathrm{TP}_{k}$ , $\mathrm{FP}_{k}$ and $\mathrm{FN}_{k}$ the true positives, false positives and false negatives related to the species $k$ , the macro averaged precision and recall write

\mathrm{Precision}_{\mathrm{macro}}=\frac{1}{K}\sum_{k=1}^{K}\frac{\mathrm{TP}% _{k}}{\mathrm{TP}_{k}+\mathrm{FP}_{k}}\quad\text{and}\quad\mathrm{Recall}_{% \mathrm{macro}}=\frac{1}{K}\sum_{k=1}^{K}\frac{\mathrm{TP}_{k}}{\mathrm{TP}_{k% }+\mathrm{FN}_{k}}\enspace.

As both the Pl@ntNet and the TwoThird strategies can invalidate some of the observations, we also compute the proportion of observations removed from the whole dataset (whereas previous metrics are computed on the evaluation dataset). This complementary metric allows measuring the proportion of samples "lost" for the training of the AI model after the aggregation step. In practice, filtering data might remove some noisy samples from the dataset. Yet, in general, the more samples are filtered, the fewer ones to train the neural network training. Finally, we also consider the proportion of species retrieved by the aggregation strategies on $\mathcal{D}_{\text{expert}},\mathcal{D}_{\text{mulitple votes}}$ and $\mathcal{D}_{\text{disagreement}}$ . This is a critical consideration because if a species identified by experts is absent from the aggregated data, the neural network trained on this data will be unable to make predictions for that very species.

We evaluate the label recovery $\mathrm{Acc}$ of each strategy on $\mathcal{D}_{\text{expert}},\mathcal{D}_{\text{mulitple votes}}$ and $\mathcal{D}_{\text{disagreement}}$ (see also Figure 3): the test set where experts have provided at least one vote ( $\mathcal{D}_{\text{expert}}$ ), its subset of observations with at least $2$ votes and one from an expert ( $\mathcal{D}_{\text{multiple votes}}$ ) and its subset of observations with at least $2$ votes, one from an expert, and one disagreement ( $\mathcal{D}_{\text{disagreement}}$ ). The latter is the most challenging as it contains the observations with the most ambiguity. We selected these subsets to investigate the label aggregation strategies’ performance depending on the ambiguity level.

2.4 Taking into account AI votes

While we restricted ourselves to the SWE subset, Pl@ntNet’s data is collected internationally. The more correctly identified observations are added to the training set, the better the prediction of the trained model for end-users. This classifier is trained from valid observations and aggregated labels (see Figure 1). Note that, in addition to Algorithm 1 and the filter on species names, more pre-processing are implemented for better performance \citepaffouard2017pl, such as additional rejection class (e.g. non-plant observations), malformed observations (multiple images of different species in a single observation). At the time of writing, the model in use in Pl@ntNet is DINOv2 \citepoquab2024dinov2 a transformer-based network. This network is based on contrastive learning \citepwaida2023understanding, and represents similar images as close embedding to learn similar features for similar observations and then uses supervised learning to fine tune the model. Several transformations are performed during training such as data augmentation \citepyang2023image, data standardization and label smoothing \citepszegedy2016rethinking. However, note that some observations from $\mathcal{D}_{\text{SWE}}$ have been processed by an earlier version of Pl@ntNet’s AI: either an InceptionV3 \citepszegedy2015rethinking or a BEIT \citepbao2021beit classifier. We can use the classifiers to generate votes. For an observation $i$ , the AI vote is denoted $y_{i}^{\text{AI}}\in[K]$ . The probability output in the classifier’s predicted species is denoted $\mathbb{P}(y_{i}^{\text{AI}})$ .

If we consider the trained model as any other user, denoted as AI as user, the same label aggregation strategies as in Section 2.2 are available. However, with the Pl@ntNet aggregation algorithm, the AI weight increases drastically and overpowers human users (see Section 3.1). This would mean the next Pl@ntNet model is mostly trained on the predictions of the previous one. This defeats the purpose of a cooperative active learning system and the human-AI interaction. It would result in a dangerous feedback loop, and possible mode collapse. Thus, we explore alternative ways of integrating the AI votes in the aggregation algorithm:

•

AI as user: This is the naive approach we just described. The AI is considered as any other user in the database. The total number of users is thus raised to $n_{\text{user}}+1$ .

•

Fixed weight AI: Give a fixed weight $w_{\text{AI}}=1.7>0$ to the AI. The weight is below the threshold $\theta_{\text{conf}}$ so that it can not self-validate its predictions. The confidence writes

\mathrm{conf}_{i}(\hat{y}_{i})=\sum_{u\in\mathcal{U}_{i}}w_{u}\mathds{1}(y_{i}% ^{u}=\hat{y}_{i})+w_{\text{AI}}\mathds{1}(y_{i}^{\text{AI}}=k)\enspace.

(2)

The final estimated label becomes

\hat{y}_{i}=\operatorname*{arg\,max}_{k\in[K]}\sum_{u\in\mathcal{U}_{i}}w_{u}% \mathds{1}(y_{i}^{u}=k)+w_{\text{AI}}\mathds{1}(y_{i}^{\text{AI}}=k)\enspace.

(3)

•

Invalidating AI: The AI is considered as a user with a fixed weight and can only participate in invalidating identifications i.e. have $s_{i}=0$ . This translates as the confidence updated as in Equation 2 but the final Weighted MV remains unchanged from Algorithm 1.

•

Confident AI: The AI is considered a user with a fixed weight and can only participate if the confidence in its prediction $\mathbb{P}(y_{i}^{\text{AI}})$ is over a threshold $\theta_{\text{score}}\in[0,1]$ . The confidence writes

\mathrm{conf}_{i}(\hat{y}_{i})=\sum_{u\in\mathcal{U}_{i}}w_{u}\mathds{1}(y_{i}% ^{u}=\hat{y}_{i})+w_{\text{AI}}\mathds{1}(y_{i}^{u}=\hat{y}_{i},\,\mathbb{P}(y% _{i}^{\text{AI}})\geq\theta_{\text{score}})\enspace.

(4)

The final estimated label becomes

\hat{y}_{i}=\operatorname*{arg\,max}_{k\in[K]}\sum_{u\in\mathcal{U}_{i}}w_{u}% \mathds{1}(y_{i}^{u}=k)+w_{\text{AI}}\mathds{1}(y_{i}^{\text{AI}}=k,\,\mathbb{% P}(y_{i}^{\text{AI}})\geq\theta_{\text{score}})\enspace.

(5)

On the choice of the AI weight.

The AI has a fixed weight $w_{\text{AI}}>0$ for the Fixed weight AI, the Invalidating AI and the Confident AI strategies. The choice of this weight must meet several constraints. First, we would like to avoid the AI votes to be self-validating as it would validate all the AI predictions on a large part of the database, thus we must have $w_{\text{AI}}<\theta_{\mathrm{conf}}$ in Algorithm 1. We also want the AI votes to help clean the database by invalidating some observations from low-weight users (with weight 0< $w_{\text{low}}\leq\theta_{\mathrm{conf}}$ ). Thus $w_{\text{low}}/(w_{\text{low}}+w_{\text{AI}})<\theta_{\mathrm{acc}}$ . Hence, our constraints read:

\begin{cases}w_{\text{AI}}&<\theta_{\mathrm{conf}}\\ \frac{w_{\text{low}}}{w_{\text{low}}+w_{\text{AI}}}&<\theta_{\mathrm{acc}}\end% {cases}\enspace.

(6)

Taking the extreme case where a user becomes self-validating: $w_{\text{low}}=\theta_{\mathrm{conf}}$ , we obtain that $w_{\text{AI}}>\theta_{\mathrm{conf}}\left(\frac{1-\theta_{\mathrm{acc}}}{% \theta_{\mathrm{acc}}}\right)$ . And using the first condition in Equation 6, we obtain the bounds

\theta_{\mathrm{conf}}\left(\frac{1-\theta_{\mathrm{acc}}}{\theta_{\mathrm{acc% }}}\right)<w_{\mathrm{AI}}<\theta_{\mathrm{conf}}\left(\Longleftrightarrow 0.8% 5<w_{\mathrm{AI}}<2\right)\enspace.

(7)

As more than a million observations from our dataset only have two votes, one way to choose the AI weight is to consider that the AI can invalidate two erroneous non-experts that would both have just enough weights to make the observation valid: $1.95=w_{\text{low}}<2$ . Then, the AI weight should be greater than their cumulated confidence: $w_{\mathrm{AI}}>2w_{\text{low}}\left(\frac{1-\theta_{\mathrm{acc}}}{\theta_{% \mathrm{acc}}}\right)$ . We finally take the upper rounded value $w_{\mathrm{AI}}=1.70$ (which satisfies Equation 7).

3 Results

3.1 Label aggregation performance comparison

Accuracy of the aggregation strategies.

We begin by evaluating the accuracy of the label aggregation strategies on the set of observations labeled by experts, $\mathcal{D}_{\text{expert}}$ . Figure 5 shows how many predicted labels match the experts answers on $\mathcal{D}_{\text{multiple votes}}$ and $\mathcal{D}_{\text{disagreement}}$ . More importantly, we compare this quantity with the proportion of species retrieved by the aggregation strategy. We observe that the data filtering from the TwoThird strategy – requiring at least two third of agreements – highly degrades performance with respect to other strategies. On $\mathcal{D}_{\text{expert}}$ , MV reaches $97\%$ of accuracy, WAWA $98\%$ , TwoThird $60\%$ and Pl@ntNet $99\%$ . To differentiate between the best-performing strategies, we need to look at more ambiguous observations like those in $\mathcal{D}_{\text{multiple votes}}$ and $\mathcal{D}_{\text{disagreement}}$ . In highly ambiguous frameworks, the WAWA strategy outperforms the MV one. However, overall the Pl@ntNet aggregation is more often in adequation with the experts and retrieves almost $90\%$ of plant species identified by experts in highly ambiguous datasets against $73\%$ for WAWA, $71\%$ for MV and only $41\%$ for TwoThird.

Precision and recall.

To better evaluate each aggregation strategy, we compute the macro precision and recall metrics for each species. Results are shown in Figure 6(a). The observations filter ( $s_{i}=0$ ) for the TwoThird strategy highly impacts its ability to identify most of the positive observations for a given species. While this agreement threshold filter is created to keep as few noisy samples as possible in research-graded (data quality indicator for research database usage in TwoThird) observations, TwoThird obtains better precision than MV and WAWA but Pl@ntNet’s precision shows significant improvement. WAWA strategy outperforms a naive MV aggregation showing that, indeed, weighing users can lead to better performance. Pl@ntNet strategy outperforms all others by several orders of magnitude. Weighing users based on their number of identified species is both interpretable and effective. The observation filter does not negatively impact the recall.

Volume of valid data.

The community labels are aggregated to generate training data for the AI model. The more data the better, however, we need to filter out observations with low visual quality or potentially mislabeled. This is the reason for the validity indicator $s_{i}$ in the TwoThird and Pl@ntNet strategies. On $\mathcal{D}_{\mathrm{SWE}}$ , Figure 6(b) shows how much data is kept for later training. $\mathrm{MV}$ and $\mathrm{WAWA}$ keep all proposed observation for training – including potential noisy ones. TwoThird filters out most observations to keep nearly $1.5$ million (representing $23.43\%$ of the total observations). Pl@ntNet finds an improved balance between filtering invalid observations and keeping enough data for training.

Qualitative results on Pl@ntNet observation filter.

In this section, we show some examples of observations invalidated by the Pl@ntNet strategy (see Figure 7). Invalid observations often come from the lack of user participation with other’s observations. Causes of disagreements from users can occur from a multitude of factors – blurriness, multiple species in the same observation, the distance from the plant does not allow precise identification, etc. Valid observations, as shown in the second row of Figure 7 are zoomed in on the plant’s flower, leaf or organ to help the identification process.

3.2 Aggregation considering AI vote

The current trained neural network model in Pl@ntNet’s system can make predictions based on its training on the Pl@ntNet database (across different floras). We compare the four following strategies – AI as user, fixed weight AI, invalidating AI and confident AI, presented in Section 2.4 to integrate the AI vote into the Pl@ntNet label aggregation strategy. For the confident AI strategy, we evaluate multiple thresholds $\theta_{\text{score}}$ . Note that if $\theta_{\mathrm{score}}=0$ the AI votes for all observations and if $\theta_{\mathrm{score}}=1$ the AI does not vote and we recover the performance of the current Pl@ntNet aggregation strategy presented in Algorithm 1. We see in Figure 8 that the confident AI strategy with $\theta_{\mathrm{score}}=0.7$ seems to perform best and keep the most data in both $\mathcal{D}_{\mathrm{SWE}}$ and $\mathcal{D}_{\mathrm{expert}}$ .

4 Discussion

We demonstrated that collaborative identification of plant species can effectively be used to obtain expert level labels. Releasing a large subset of millions of observations and thousands of users from the Pl@ntNet organization, we investigate a label aggregation strategy that weighs user answers based on their estimated number of species correctly identified without using prior expert knowledge. Many strategies used previously either do not scale to the magnitude of the current databases – either Pl@ntNet, iNaturalist or eBird – or are outperformed by our aggregation.

Our strategy weighs users based on the number of correctly identified species. This weight is interpretable and shows the diversity of the user’s skill set. It can be directly applied to other crowdsourced frameworks with a high number of classes like TwoThird or eBird. The values for both hyperparameters $\theta_{\mathrm{conf}}$ and $\theta_{\mathrm{acc}}$ which respectively handle the cumulated weight on observation and the agreement level for the given label can be applied as is.

Note that Pl@ntNet’s label control system heavily rests on visual analysis of observations and inter-user agreements. Additional metadata such as geolocation, date, phenological stage or visual description can be registered in Pl@ntNet and help identify the plant’s species but are currently not directly taken into account for user evaluation. Such information – in particular spatial information – could also be used to generate more interaction between users and collect more votes through possible common interests. In addition, users are helped by the system with images similar to the identification proposed in a given checklist. The additional information could guide users in their vote – for example by notifying a possible incoherence between the current botanical knowledge on a species and the metadata entered (such as the altitude, the distance to the sea, a species not known to survive in a given area).

As for the inclusion of the AI vote, some concerns should be raised. First, as the AI model is trained from the aggregated labels and observations, integrating its vote should not make the AI predictions run out of control. If we consider the AI as a user, as we are in iterative training, the system fails to learn from the human labels. However, using the AI vote to invalidate the data with a fixed weight can help clean the database, and with enough weight other users can switch its validity back. However, this would not help in switching the wrong label. To do so, we investigate in Section 3.2 to only consider a fixed weight label with enough confidence from the AI model. We observe that this strategy leads to better performance. As we use the output probabilities we should discuss the calibration of our network too.

Calibration is the measure of how close the output confidence is to the true probability \citepniculescu2005predicting. Currently, the Pl@ntNet AI is not calibrated using post-processing methods \citepplatt1999probabilistic, guo_calibration_2017. We discuss hereafter the calibration of the current AI model and possible guidelines for further integration of AI votes.

From Figure 9(a), we see that currently, Pl@ntNet AI is underconfident. Meaning that it consistently underestimates its confidence and outputs to users more uncertainty than it should. One factor that can influence the results is that the calibration is computed on the test set where experts either authored or voted on observations. Botanical experts have more experience with taking pictures of plants and better equipment than the average citizen. Thus, the observation quality – and subsequently the probability distribution output by the AI – can be biased. Another factor known for leading to such suboptimal predictions is the data augmentation \citepkapoor2022uncertainty. As the model trains on multiple versions of each original sample with multiple distortions, these variations can become unrepresentative of the underlying sample distribution and cause unnecessary prediction difficulties. The data augmentation is used to mitigate the species imbalance of the database.

However, this imbalance is also known to lead to miscalibrations in predictions \citepao2023two. On Figure 9(b), we see that for ambiguous observations (where users disagree), the AI is overconfident in its highest predictions – which represents half of the dataset – and underconfident in the other half. These different calibration behaviors inform us that, if a given strategy should incorporate the AI votes in the label aggregation based on the output probabilities, we need to be able to rely on such probabilities. Therefore, even if the confident AI strategy leads to the best performance in Section 2.4, it should not be used directly without recalibration of the model – using for example temperature scaling \citepguo_calibration_2017. In future work, more study is needed to investigate the confidence gap of the model and the observations’ ambiguity from users’ labels. The current large-scaled and interpretable aggregation strategy from Pl@ntNet already outperforms others without the AI votes.

Statement on inclusion

We affirm our commitment to promoting diversity and inclusivity in scientific research. Our collected crowdsourced data brings together a wide range of participants. We actively encourage and welcome involvement from individuals of diverse backgrounds, expertise, and perspectives, recognizing the value of their contributions in advancing ecological research and promoting a more comprehensive understanding of plant biodiversity.

\printbibliography