Zooscan
Zooscan
Zooscan
Received July 14, 2009; accepted in principle October 26, 2009; accepted for publication November 14, 2009
more rapid means for characterizing plankton distri- Indicators program: (http://www.ciesm.org/marine/
butions assessed from a variety of different sampling programs/zooplankton.htm).
methods. In this paper, we first describe the overall approach
Early attempts to use optical bench-top methods for used, including ZooScan hardware together with
treatment of plankton samples were undertaken by ZooProcess and PkID software. We discuss building and
Ortner et al. (Ortner et al., 1979) who used silhouette validating training sets, the selection of classification
photography to record the contents of a plankton algorithms and the accuracy of body size and biomass
sample. Silhouette imaging of plankton samples on estimations that can be derived from the ZooScan
photographic film or video imaging and a limited system. We propose standards for long-term archiving
digitization of plankton samples followed by automatic and sharing of raw and processed images and output
identification was further developed in the 1980s files. We demonstrate a semiautomatic classification
286
G. GORSKY ET AL. j ZOOPLANKTON IMAGE ANALYSIS USING THE ZOOSCAN
ZooScan measurements of body length and between 22 August 2007 and 8 October 2008. These
cross-sectional area. Automated measurements of pre- samples were scanned by ZooScan in two size fractions,
served zooplankton as recorded by ImageJ in ,1 mm and .1 mm, leading to a set of 60 scans. For
ZooProcess were compared with manual measurements classification, we began with a learning set of 13 cat-
of several zooplankton taxa (appendicularians, chaetog- egories (10 zooplankton þ 3 non-zooplankton) that we
naths, copepods, euphausiids, ostracods and thecosome had created previously. This can be downloaded at:
pteropods) collected in the California Current on (http://www.obs-vlfr.fr/LOV/ZooPart/ZooScan/
CalCOFI (California Cooperative Oceanic Fisheries Training_Set_Villefranche/esmeraldo_learning_set.zip).
Investigations) cruises. Specimens were collected along
CalCOFI line 80 between February 2006 and August
2008 and preserved in formaldehyde buffered with
287
JOURNAL OF PLANKTON RESEARCH j VOLUME 32 j NUMBER 3 j PAGES 285 – 303 j 2010
Software. The sequence followed in scanning and analy- Sample treatment. The aliquot volume of a plankton
sis of zooplankton samples is shown schematically in sample to be analyzed is determined by the abundance
Fig. 2. The initial steps are completed using ZooProcess and size distribution of the organisms. It is important to
software: (i) scan and process a blank background minimize coincidence of overlapping animals on the
image, (ii) scan the sample to acquire a high quality raw optical surface. At present, a maximum of approximately
image, linked to associated metadata, (iii) normalize the 1000–1500 objects is scanned in the larger frame,
raw image and convert to full grey scale range, (iv) although this value can be exceeded. Because the abun-
process images by subtracting the background and dance of organisms usually decreases with increasing
removing frame edges, (v) extract and measure individ- body size, it is preferable to scan two (or more) separate
ual objects. Subsequent analysis steps are done with size fractions of each sample. One fraction contains
ZooProcess in combination with PkID: (vi) create a larger individuals that are less abundant, obtained from a
Fig. 2. Schematic illustration of the primary steps in the scanning and analysis of zooplankton samples with the ZooScan/ZooProcess/Plankton
Identifier system.
288
G. GORSKY ET AL. j ZOOPLANKTON IMAGE ANALYSIS USING THE ZOOSCAN
to avoid air bubble formation. We do not stain samples, Image processing. ZooProcess provides two methods for
in order to maintain them unaltered for future compara- removing a heterogeneous background. A daily scan of
tive studies. Although ZooProcess software provides a tool the cell filled with filtered water is recommended, because
to separate overlapping organisms once the sample has the background image provides a blank and also records
been scanned, it is important to physically separate instrument stability over time. A background scan is faster
touching organisms in the scanning frame and separate to process and requires less computer memory than the
them from the frame edges prior to digitizing the sample. second option, the rolling ball method (Sternberg, 1983),
Manual separation takes 10 min per sample. which requires no blank image to be scanned. A lower
setting of the rolling ball diameter parameter will clean
Detailed description the background, but may create artifacts in zones of
uneven contrast on the bodies of larger organisms. Apart
289
JOURNAL OF PLANKTON RESEARCH j VOLUME 32 j NUMBER 3 j PAGES 285 – 303 j 2010
but it can also be used standalone. It has been developed combinations of variables and immediately test their
in DELPHI (Borland), because the source code can be suitability for a particular classification task using cross
compiled. For supervised learning, PkID works in validation.
conjunction with Tanagra (Rakotomalala, 2005; http://
eric.univ-lyon2.fr/~ricco/tanagra/en/tanagra.html), also
developed in DELPHI. Source code for PkID can be Data analysis and performance evaluations. Evaluation of
obtained on request for customization (see Gasparini, classifier performance requires the examination of a
2007). CM, which is a contingency table crossing true (manu-
Three successive steps are followed in applying PkID: ally validated) and predicted (assigned by the classifier)
(i) “Learning” creates training files that link measure- identification of objects. Object counts on the matrix
ments from groups of similar objects (vignettes); (ii) diagonal represent correct identifications and the sum
Table I: The different classifiers in Plankton Identifier (Gasparini, 2007) analyzed in the present study
Name Short description Reference
5-NN k-nearest neighbor using heterogeneous value difference metric Peters et al. (2002)
S-SVC linear Support Vector Machine from LIBSVM library, using linear functions Chang and Lin (2001)
S-SVC RBF Support Vector Machine from LIBSVM library, using radial basis activation functions Chang and Lin (2001)
Random Forest Bagging, decision tree algorithm Breiman (2001)
C 4.5 Decision tree algorithm Quinlan (1993)
Multilayer Perceptron Multilayer Perceptron neural network Simpson et al. (1992)
290
G. GORSKY ET AL. j ZOOPLANKTON IMAGE ANALYSIS USING THE ZOOSCAN
Data management. Here we recommend appropriate measurements. Then we present a brief case study from
practices for archiving ZooScan data and metadata. the Bay of Villefranche, in order to illustrate the sequen-
ZooScan data include: (i) raw images of zooplankton tial processes involved in sample and data analysis.
samples or sub-samples, (ii) raw background images
from the system’s hardware, (iii) digital images of indi- Learning set creation
vidual objects (i.e. vignettes), (iv) measurements made
After scanning, normalization, background subtraction
by ZooProcess software on individual objects,
and extraction of vignettes, the first step is to create a
(v) classification results determined automatically or
preliminary learning set or to use an existing learning
semi-automatically using PkID and (vi) computed abun-
set to classify (“predict”) a small number of dominant
dances, biovolumes and biomass. ZooScan metadata
groups. Our experiments to determine the optimal
include: (i) information about sampling and measured
number of objects to sort into each category for con-
291
JOURNAL OF PLANKTON RESEARCH j VOLUME 32 j NUMBER 3 j PAGES 285 – 303 j 2010
Fig. 4. Dependence of (A) recall (true positives) and (B) contamination (false positives) rate on the number of categories predicted by the
classifier, using different classifier algorithms (see Table I).
length measurements (data not shown), although feret reproducible, although their values may differ somewhat
diameter typically showed the best relationship. from manually determined values.
Comparison of automated measurements of surface The relationships between C and N content and auto-
area (as area excluded) with manual measurements of mated measurements of linear or areal dimensions were
the same individuals was carried out for three taxa well described by power curves (Fig. 7). Much of the
(copepods, euphausiids and chaetognaths: Fig. 6). In all scatter in the relationships shown in Fig. 7 is attributable to
cases, there was a linear relationship between manual the mixture of different species included in these analyses.
and automated measurements. The automated measure- The exponents for C and N were similar to each other,
ments were somewhat higher for copepods and euphau- implying relatively constant C:N ratios. In the case of both
siids, but lower for chaetognaths. These results suggest copepods and chaetognaths, the exponents relating C or N
that automated measurements are consistent and content to linear dimensions (feret diameter) were close to
292
G. GORSKY ET AL. j ZOOPLANKTON IMAGE ANALYSIS USING THE ZOOSCAN
Fig. 6. Relationship between automated measurements of area excluded and manual measurements of projected area for (A) copepods,
(B) euphausiids, (C) chaetognaths from the California Current.
3 and the exponents in relation to areal measurements differences are consistent with the changing body shapes
(area excluded) were close to 2. However, for euphausiids, with ontogeny of euphausiids, as the cephalothorax width
the exponents were close to 2 and 1, respectively. These and depth of euphausiids tends to increase in proportion
293
JOURNAL OF PLANKTON RESEARCH j VOLUME 32 j NUMBER 3 j PAGES 285 – 303 j 2010
to total body length through their development. In the few objects with low accuracy of detection were not
case of copepods and euphausiids, area excluded was a retained (they were left to contaminate the prediction).
slightly better predictor of C or N content than feret diam- The accuracy of the prediction for this second learning
eter, although for all three taxa the results indicate that set was much better than for the first iteration, and the
automated measurements of either linear or areal dimen- subsequent manual validation was done faster.
sions of vignettes can be related in a useful manner to the As samples were analyzed from different seasons in
biomass of these organisms. the Villefranche time series, newly encountered taxo-
nomic categories were added into the learning set when
Results from the Bay of they became sufficiently numerous, provided that con-
Villefranche-sur-mer fusion with other dominant categories remained low.
This occurred, for example, with cladocerans that
Learning set optimization and application bloomed only in autumn and were nearly absent during
To create our initial learning set for the Villefranche other time periods. Sometimes categories with relatively
case study, we utilized a pre-existing learning set (see high contamination were maintained in the learning set
Methods) to predict 5000 objects from the Villefranche because of their ecological value. For example, the
time series. We then manually validated the prediction Limacina category showed a 34.5% error rate (Table II).
into 30 categories (which took ca. 4 h). We included cat- Nevertheless, it was maintained as a separate group
egories for “bad focus” objects, artifacts, bubbles and because subsequent manual validation was rapid and
fibers. To improve the classifier, we then randomly the seasonal development of this taxonomic group was
selected a fixed number of vignettes drawn from each of important. After the prediction results did not improve
these 30 categories from the Villefranche time series significantly with additional iterations, we considered
and created a new learning set. This second learning the learning set satisfactory. It contained 14 zooplankton
set was tested by cross validation in PkID using the and 6 other categories (Table II) and was applied to the
Random Forest algorithm. Categories containing only a rest of the samples.
294
Downloaded
Table II: Confusion matrix for the 20 categories in the learning set used for machine classification of the 2007–2008 time series by the
G. from
Random Forest algorithm
GORSKY
Aggregates_dark
Pteropoda_other
Copepoda_small
Copepoda_other
https://academic.oup.com/plankt/article/32/3/285/1536761
Decapoda_large
Appendicularia
in learning set
Chaetognatha
Nectophores
Aggregates
Percentage
ET AL.
1-Precision
Bad focus
Cladocera
Radiolaria
Medusae
Thaliacea
Limacina
Bubbles
Oithona
Egg-like
Scratch
Fibers
Recall
Total
byZOOSCAN
Scratch 1.6 2 0 0 5 0 0 0 1 0 0 0 0 6 0 2 4 0 285 0 0 305 0.93 0.14
Table III: Final categories used for classifying Our semi-automated analysis of an annual cycle of
the 2007– 2008 time series in the Bay of zooplankton variation in the Bay of Villefranche
Villefranche, after initial machine revealed pronounced seasonal variation in abundance,
with substantial changes in the composition of the
classification, followed by manual validation,
mesozooplankton (Fig. 9). Calanoid copepods were the
then manual subdivision into additional numerically dominant organisms at all times of year,
categories increasing from 75% before the peak of the bloom to a
Categories used for classifying the Villefranche time series maximum of 95% at the peak and declining to 55%
Aggregates Decapoda_large afterwards, as cladocerans, appendicularians and other
Aggregates_dark Decapoda_other taxa increased in relative importance (Figs 9 and 10).
Algae Echinodermata Poecilostome and oithonid copepods were abundant
Amphipoda Egg like
296
G. GORSKY ET AL. j ZOOPLANKTON IMAGE ANALYSIS USING THE ZOOSCAN
Fig. 9. Total abundance of mesozooplankton from 2007 to 2008, and the proportion of primary mesozooplankton categories (inset pie
diagrams) before, during and after the 2008 spring bloom in the Bay of Villefranche. All classifications were validated manually.
Equally important for such sample collections is the complement to the conservation of the physical samples
archiving of digital representations of the samples, to themselves. Such digital images permit automatic or
facilitate permanent records of their contents as a semiautomatic image analysis, rapid measurement of
297
JOURNAL OF PLANKTON RESEARCH j VOLUME 32 j NUMBER 3 j PAGES 285 – 303 j 2010
298
G. GORSKY ET AL. j ZOOPLANKTON IMAGE ANALYSIS USING THE ZOOSCAN
Here we endorse a practical semi-automated method dimensions of digitized organisms can be related to
that may help biologists obtain taxonomically more their biomass, applied on a taxon-specific basis.
detailed data sets with sufficient accuracy. Comparison The classification method proposed here allows a
between machine predicted and manually validated relatively detailed taxonomic characterization of zoo-
classifications showed that for dominant taxa such as plankton samples and provides a practical compromise
copepods, automatic recognition was sufficiently accu- between the fully automatic but less accurate and the
rate. However, for less abundant taxa such as appendi- accurate manual classification of zooplankton. Useful
cularians and chaetognaths, automatic recognition size and biomass estimations may be rapidly obtained
generally overestimated true abundances (but underesti- for ecologically oriented studies. Results from different
mated the abundance of decapods). Fully automated ZooScan data sets can be combined using
classification would have resulted in inaccurate descrip- PANGAEAw’s data warehouse, thus encouraging coop-
299
JOURNAL OF PLANKTON RESEARCH j VOLUME 32 j NUMBER 3 j PAGES 285 – 303 j 2010
Breiman, L. (2001) Random forests. Mach. Learn., 45, 5– 32. Irigoien, X., Fernandes, J. A., Grosjean, P. h. et al. (2009) Spring
Brown, J. H., Gillooly, J. F., Allen, A. P. et al. (2004) Toward a meta- zooplankton distribution in the Bay of Biscay from 1998 to 2006 in
bolic theory of ecology. Ecology, 85, 1771–1789. relation with anchovy recruitment. J. Plankton Res., 31, 1–17.
Chang, C.-C. and Lin, J. (2001) Training nu-support vector classifiers: Jeffries, H. P., Sherman, K., Maurer, R. et al. (1980) Computer proces-
theory and algorithms. Neural. Comp., 13, 2119– 2147. sing of zooplankton samples. In Kennedy, V. (ed.), Estuarine
Perspectives. Academic Press, New York, pp. 303–316.
Checkley, D. M., Jr, Davis, R. E., Herman, A. W. et al. (2008)
Assessing plankton and other particles in situ with the SOLOPC. Jefferies, H. P., Berman, M. S., Poularikas, A. D. et al. (1984)
Limnol. Oceanogr., 53, 2123–2126. Automated sizing, counting and identification of zooplankton by
pattern recognition. Mar. Biol., 78, 329– 334.
Culverhouse, P. F., Williams, R., Reguera, B. et al. (1996) Automatic
categorisation of 23 species of Dinoflagellate by artificial neural Kennett, J. P. (1968) Globorotalia truncatulinoides as a
network. Mar. Ecol. Prog. Ser., 139, 281 –287. paleo-oceanographic index. Science, 159, 1461–1463.
Culverhouse, P. F., Williams, R., Reguera, B. et al. (2003) Do experts Ortner, P. B., Cummings, S. R., Aftring, R. P. et al. (1979) Silhouette
300
G. GORSKY ET AL. j ZOOPLANKTON IMAGE ANALYSIS USING THE ZOOSCAN
301
JOURNAL OF PLANKTON RESEARCH j VOLUME 32 j NUMBER 3 j PAGES 285 – 303 j 2010
Other variables
302
G. GORSKY ET AL. j ZOOPLANKTON IMAGE ANALYSIS USING THE ZOOSCAN
Perim The length of the outside boundary of the Skelarea Surface area of skeleton in pixels. In a
object binary image, the skeleton is obtained by
Major Primary axis of the best fitting ellipse for the repeatedly removing pixels from the edges
object of objects until they are reduced to the
Minor Secondary axis of the best fitting ellipse for width of a single pixel
the object
Circ Circularity ¼ (4 * Pi * Area) / Perim2; a
value of 1 indicates a perfect circle, a value APPENDIX 5.
approaching 0 indicates an increasingly Examples of extracted vignettes and measurements.
elongated polygon Vignettes of (a) an appendicularian, (b and c) copepods
Feret Maximum feret diameter, i.e. the longest with antennules in different orientations and (c) a chae-
303