Eurographics/ACM SIGGRAPH Symposium on Computer Animation (2004)
R. Boulic, D. K. Pai (Editors)
Cartoon Textures
Christina de Juan† and Bobby Bodenheimer‡
Vanderbilt University
Abstract
In this paper we present a method for creating novel animations from a library of existing two-dimensional cartoon
data. Drawing inspiration from the idea of video textures, sequences of similar-looking cartoon data are combined
into a user-directed sequence. Starting with a small amount of cartoon data, we employ a method of nonlinear
dimensionality reduction to discover a lower-dimensional structure of the data. The user selects a start and end
frame and the system traverses this lower-dimensional manifold to re-sequence the data into a new animation.
The system can automatically detect when a new sequence has visual discontinuities and may require additional
source material.
Categories and Subject Descriptors (according to ACM CCS): I.3.7 [Computer Graphics]: Animation
1. Introduction
The process of traditional cel animation has seen a number
of enhancements in recent years, but these have focused on
such tasks as texture mapping the cels [CJTF98], creating
shadows [PFWF00], or retargeting the motion of one character onto another character [BLCD02]. However, cel animation remains a very tedious and time-consuming task, requiring twenty-four hand drawn frames per second of animation. For a typical animated TV series, artists bring life to
familiar cartoon characters for every episode, yet no method
exists that would allow them to reuse their drawings for future episodes.
Software packages such as Toon Boom Technologies,
[FBC∗ 95], can create simple inbetweens based on vector animation. Although an animator could reuse the original models of the characters, the basic animation still has to be created, and these animations tend to lack the expressiveness of
familiar styles, such as the distinctive style of animations by
Chuck Jones. The same issues arise when creating 3D models for cartoon characters and ’toon-rendering them. ’Toonrendering is a technique that can render 3D scenes in styles
that have the look of a traditionally animated film; it is often
called ’toon shading. Adding a great deal of deformation,
like squash and stretch or incorporating other principles of
† email:cdejuan@vuse.vanderbilt.edu
‡ email:bobbyb@vuse.vanderbilt.edu
© The Eurographics Association 2004.
animation [Bla94] to a 3D model of a character is often challenging, requiring the skills of a talented artist. Even when
’toon-rendering a 3D character, one cannot expect it to look
like the traditionally hand drawn Wile E. Coyote getting flattened or stretched in a visually extreme manner.
This work presents a method for creating novel animations from a library of existing cartoon data. Drawing inspiration from the idea of video textures [SSSE00], sequences
of similar-looking cartoon data are combined into a userdirected sequence. Our goal is re-sequencing cartoon data
to create new motion from the original data that retains the
same characteristics and exposes similar or new behaviors.
The number of new behaviors that can be re-sequenced
is restricted by the amount of data in our library for each
character. Starting with a small amount of cartoon data, we
use an unsupervised learning method for nonlinear dimensionality reduction to discover a lower-dimensional structure
of the data. The user selects a desired start and end frame
and the system traverses this lower-dimensional manifold to
re-sequence the data into a new animation. Our method is
model-free, i.e., no a priori knowledge of the drawing or
character is required. The user does not need the ability to
animate, or know what an acceptable inbetween is, since
the data is already provided. The system can detect when
a transition is abrupt, allowing the user to inspect the new
animation and determine if any additional source material is
needed. Minimal user input is required to generate new an-
Christina de Juan & Bobby Bodenheimer / Cartoon Textures
imations, and the system requires much less data than the
video textures method for re-sequencing.
2. Previous Work
2.1. Animation-Based Methods
The issue of generating inbetweens for cartoon animation
has been studied. Reeves [Ree81] presented a method for
creating inbetweens by using moving-point constraints. A
moving-point is a curve in space and time that provides a
constraint on the path and speed of a specific point on the
keyframe for a character. These moving-points are manually
specified, and allow for multiple paths and speeds of interpolation. While this method provides control in creating a
new animated sequence by generating the inbetweens “automatically,” a great deal of manual effort is involved.
Sederberg and Greenwood [SG92] studied how to
smoothly blend between a pair of 2-dimensional polygonal shapes. By modelling a pair of contours as thin wires,
[SG92] minimize equations of work for deforming thin
wires to achieve smooth shape transformation between the
two contours. They address the problem of vertex correspondences by specifying a small number of initial corresponding
point pairs on the input contours. While their results show
nice shape blending, the shapes must be polygonal, therefore using existing animations would require polygonalizing
every image. Their results also depend on the initial manual
placement of the corresponding vertex pairs.
which is inherently sparse and contains exaggerated deformations between frames. In their follow-up work [SE01],
they use user-directed video sprites for creating new character animations. However, the examples shown require a
vast amount of video data: 30 minutes of video footage for
a hamster yielding 15,000 sprite frames. In our work, the
largest cartoon data set we use has 2,000 frames, yet we still
achieve good results with sparser data of 560 frames.
Recently, other researchers have found inspiration from
video textures and have applied it to motion capture data.
Sidenbladh et al. [SBS02] employ a probabilistic search
method to find the next pose in a motion stream and obtain
it from a motion database. Arikan and Forsyth [AF02] construct a hierarchy of graphs connecting a motion database
and use randomized search to extract motion satisfying specified constraints. Kovar et al. [KGP02] use a similar idea to
construct a directed graph of motion that can be traversed
to generate different styles of motion. Lee et al. [LCR∗ 02]
model motion as a first-order Markov process and also construct a graph of motion. They demonstrate three interfaces
for controlling the traversal of their graph. In our work, once
the structure of the data is learned, the manifold that represents the data can be traversed to re-sequence the data.
2.2. Dimensionality Reduction
Bregler et al. [BLCD02] reused cartoon motion data by
capturing the motion of one character and retargeting it onto
a new cartoon character. This approach does not generate a
new cartoon motion. Their system requires a great deal of
expert user intervention to train the system and a talented
artist to draw all the key-shapes. Each of the key-shapes must
be manually specified for the source and target character,
and parameterized by hand to find the affine deformations
that the source key-shapes undergo before applying them to
the target key-shapes. Their work provides a method for reusing the overall motion of the cartoon data, but it does not
look at the structure of the data itself and therefore cannot
re-sequence the data to expose meaningful new behaviors.
Dimensionality reduction for image data sets consisting of a
large number of images has been used to represent a mean
image, or subset of images, that are representative of the
entire data set. A commonly used dimensionality reduction
method is Principle Component Analysis (PCA) [Jol86], a
linear embedding technique that will generate a mean image and eigenvectors that span the principle shape variations in the image space. However, this technique does not
retain the spatio-temporal structure in the data that we are
seeking. We assume our data have some underlying spatial
surface (manifold) for which we wish to discover an embedding into a lower-dimensional space. Multidimensional
scaling (MDS)[KW78] is another approach that finds an embedding that preserves the pairwise distances, equivalent to
PCA when those distances are Euclidean. However, many
data sets contain essential nonlinear structures that are invisible to PCA and MDS.
We are motivated by the work of Schödl et al. [SSSE00]
on video textures to retain the original images in motion sequences but play them back in non-repetitive streams of arbitrary length. Video textures is most similar to our goal of
re-sequencing cartoon images, specifically the “video-based
animation” section of their work, although it is not userdirectable. They use the L2 distance to compute the differences between frames for building the video structure. We
want to compare the differences between frames in a similar fashion to analyze the data for re-sequencing. [SSSE00]
assume a large data set with incremental changes between
frames. Their methods do not extend well to cartoon data,
Two techniques for manifold-based nonlinear dimensionality reduction are Isomap [TdSL00] and Locally Linear
Embedding (LLE) [RS00]. Both methods use local neighborhoods of nearby data to find a low-dimensional manifold
embedded in a high-dimensional space. However, neither
of these methods account for temporal structure in cartoon
data. A modified version of Isomap, called Spatio-Temporal
Isomap (ST-Isomap) [JM03], can account for the temporal
dependencies between sequentially adjacent frames. We borrow the idea of extending Isomap using temporal neighborhoods from [JM03], and use ST-Isomap for dimensionality
reduction of cartoon data to maintain the temporal structure
© The Eurographics Association 2004.
Christina de Juan & Bobby Bodenheimer / Cartoon Textures
in the embedding. [JM03] focuses on synthesizing humanoid
motions from a motion database by automatically learning
motion vocabularies. Starting with manually segmented motion capture data, ST-Isomap is applied to the motion segments in two passes, along with clustering techniques for
each of the resulting sets of embeddings. Motion primitives
and behaviors are then extracted and used for motion synthesis. This type of analysis and synthesis also requires more
data than is typically available for cartoon synthesis. Thus,
we adapt the methods of [JM03] to use images as input, and
use only one pass of ST-Isomap for creating the embedding
used for re-sequencing.
3. Technical Approach
Since we are not generating new frames, the types of new
motions that can be re-sequenced are restricted by the
amount of data in our library for each character. Our method
is model-free, requiring no a priori knowledge of the cartoon
character. First, the cartoon data is pre-processed. Next, nonlinear dimensionality reduction is used to learn the structure
of the data. Finally, by selecting a start and end frame from
the original data set, the data is re-sequenced to create a new
motion.
3.1. Pre-Processing Cartoon Data
Our input data comes from 2D animated video or ’toonrendered motion capture. The video is pre-processed to remove the background and register the character relative to a
fixed location throughout the sequence. There are a number
of video-based tracking techniques that can be used for background subtraction, although currently we manually segment
the images. Since our representation of the data is modelfree, we do not need to identify any specific region of the
character, i.e., limbs or joints, so it does not matter that
the characters may undergo deformation. The registration is
done using the centroid of the character in each frame and
repositioning it to the center of the image, facilitating the
computation of a distance matrix later.
We examine four cartoon sequences with different characters, a gremlin, Daffy Duck, the Grinch, and Michigan J.
Frog. For the gremlin data set, there are 2,000 images of size
320 by 240 that are cropped and scaled to 150 by 180. The
gremlin data set is created from three clips of motion capture
of free-style dancing performed by the same subject, which
is played through a gremlin model and ’toon-rendered on
a constant white background. There are 560 images in the
Daffy data set, with images of size 720 by 480, cropped and
scaled to 310 by 238. There are 295 images in the Grinch
data set and 146 images in the Frog data set, both sets with
images of size 640 by 480. For these sequences, the characters are segmented and placed on a constant blue background. Figure 1 shows examples of the frames from the
original data along with the corresponding segmented images. Our focus in this work is primarily with the gremlin
© The Eurographics Association 2004.
Figure 1: The top row shows a frame from the gremlin data
set before and after processing. The second row shows an
original and cleaned up frame from the Daffy Duck data.
An example frame from the Grinch data in the third row, and
Michigan J. Frog in the last row. Daffy and M.J. Frog ™&
©Warner Bros. Entertainment Inc. (s04)., Grinch ©Turner
Entertainment Co.
and Daffy Duck data sets because of their larger size. We
later discuss the issues with the smaller data sets.
3.2. Dimensionality Reduction
Nonlinear dimensionality reduction finds an embedding of
the data into a lower-dimensional space. We use a modified Isomap, ST-Isomap, to perform the manifold-based nonlinear dimensionality reduction. Like standard Isomap, STIsomap preserves the intrinsic geometry of the data as captured in the geodesic manifold distances between all pairs of
data points. It also retains the notion of temporal coherence,
which is critical to the resulting output for cartoon data. STIsomap uses an algorithm similar to Isomap, as follows:
1. Compute the local neighborhoods based on the distances
Christina de Juan & Bobby Bodenheimer / Cartoon Textures
DX (i, j) between all-pairs of points i, j in the input space
X based on a chosen distance metric.
2. Adjust DX (i, j) to account for temporal neighbors.
3. Estimate the geodesic distances into a full distance matrix
D(i, j) by computing all-pairs shortest paths from DX ,
which contains the pairwise distances.
4. Apply MDS to construct a d-dimensional embedding of
the data.
The difference between Isomap and ST-Isomap is in step 2,
where the temporal dependencies are accounted for.
One issue with Isomap is determining the size of the spatial neighborhoods. If the data is sufficiently dense, Isomap
can form a single connected component, which is important in representing the data as a single manifold structure.
The connected components of a graph represent the distinct
pieces of the graph. Two data points (nodes in the graph) are
in the same connected component if and only if there exists
some path between them.
Our experimental results found that varying the size of the
neighborhood (step 1) will ensure that a single connected
component is formed regardless of the sparseness of the
data. However, depending on the distance metric used and
the sparseness of the data, the spatial neighborhoods need to
be increased to a point such that no meaningful structure will
be found. This issue arises with Isomap since its main objective is in preserving the global structure and preserving the
geodesic distances of the manifold. ST-Isomap, by including adjacent temporal neighbors, remedies this deficiency,
allowing a smaller spatial neighborhood size while forming
a single connected component. Having all of the data points
in the same embedding is desirable for re-sequencing. Using from one to three temporal neighbors and a small spatial
neighborhood results in a meaningful structure that is usable
for re-sequencing.
3.3. Distance Metrics
The key to creating a good lower-dimensional embedding
of our data is the distance metric used to create the input
to Isomap. When computing the local neighborhoods for
D(i, j), we examined three different distance metrics: the L2
distance, the cross-correlation between pairs of images, and
an approximation to the Hausdorff distance [DHR93]. As
mentioned previously, video textures uses the L2 distance for
computing the similarity between video frames. Although
this works well for densely sampled video, it is insufficient
for dealing with sparse cartoon data.
3.3.1. L2 Distance
The first distance metric is the L2 distance between all-pairs
of images. Given two input images Ii and I j :
q
dL2 (Ii , I j ) = kIi k2 + kI j k2 − 2 ∗ (Ii · I j )
Only the luminance of the images is used for the L2 distance.
The distance matrix DL2 (i, j) is created such that
DL2 (i, j) = dL2 (Ii , I j )
This metric is simple and works well for large data sets with
incremental changes between frames, but is unable to handle
cartoon data, which is inherently sparse and contains exaggerated deformations between frames.
3.3.2. Cross-Correlation Distance
The second distance metric is based on the cross-correlation
between a pair of images. This metric also uses only the luminance of the images. Given two input images Ii and I j :
ci, j = q
∑m ∑n (Iimn − I¯i )(I jmn − I¯j )
(∑m ∑n (Iimn − I¯i )2 )(∑m ∑n (I jmn − I¯j )2 )
where I¯i and I¯j are the mean values of Ii and I j respectively.
This equation gives us a scalar value ci, j for the correlation coefficient between image Ii and image I j in the range
[−1.0, 1.0]. However, we want the correlation-based distance metric to be 0.0 for highly correlated images and 1.0
for anti-correlated images. Therefore the correlation-based
distance matrix between images Ii and I j is Dcorr (i, j) =
(1.0 − ci, j )/2.0.
3.3.3. Hausdorff Distance
The third distance metric is an approximation to the Hausdorff distance. This metric uses an edge map and a distance
map of each image. The edge map E is computed using a
standard Canny edge detector [Can86]. The distance map X
is the distance transform calculated from E, and represents
the pixel distance to the nearest edge in E for each pixel in
X. Then, the Hausdorff distance between a pair of images Ii
and I j is:
DHaus (i, j) =
∑(x,y)∈Ei ≡1 X j (x, y)
∑(x,y)∈Ei ≡1 Ei (x, y)
where Ei is the edge map of image Ii , X j is the distance map
of image I j , and (x, y) denote the corresponding pixel coordinates for each image. Figure 2 shows an example of the
edge map and distance map for a given frame.
Figure 2: An edge map in the center, and distance map on
the right, for a frame from the Daffy Duck data set. ™&
©Warner Bros. Entertainment Inc. (s04).
Figure 3 shows an example of the L2, correlation-based
distance and the Hausdorff distance matrices for the Daffy
© The Eurographics Association 2004.
Christina de Juan & Bobby Bodenheimer / Cartoon Textures
(Figure 4(i,k,m,o)), as indicated by the variance plots. The
Grinch data can be reduced down to a three dimensional
manifold (Figure 4(a,c)). The Frog data set is very sparse,
and can at best be reduced to a five dimensional manifold.
L2
Correlation-Based
Hausdorff
Figure 3: L2, Correlation-based and Hausdorff distance
matrices for the Daffy data set.
data set. A value of zero corresponds to similar images,
while a value of one corresponds to dissimilar images. Note
that the diagonal is zero as expected, and the banding indicates structure in the data. We found that the Hausdorff
distance metric works best for all data sets.
The differences in the variances and neighborhood graphs
for the data sets in the figure are also influenced by varying the spatial neighborhood size for creating the original
Isomap and the ST-Isomap embeddings. More spatial neighbors are included in the original Isomap to ensure that a single connected component is embedded for all data sets except the gremlin, which is sufficiently dense. The Daffy data
set requires 20 spatial neighbors using original Isomap. For
the Grinch data, 74 spatial neighbors are needed to generate
a single connected component using original Isomap. Similarly, the Frog data requires 10 spatial neighbors to generate a single connected component using original Isomap. All
data sets required seven spatial neighbors to generate the single connected component using ST-Isomap, while the temporal neighbors varied from one to three.
3.4. Embedding
Once the distance matrix for a data set is computed, we apply ST-Isomap to obtain the lower-dimensional embedding
of the cartoon data. The dimensionality of the embedding
space must be determined. Choosing a dimensionality too
low or too high results in incoherent re-sequencing.
Estimating the true dimensionality of the data using STIsomap is different than with PCA. In PCA, picking the dimensionality of a reduced data set can be done automatically
such that the proportion of variance (shape variations) retained by mapping down to n-dimensions can be found as the
normalized sum of the n-largest eigenvalues. This residual
variance is typically chosen to be greater than 80% (usually
90%), while the remaining variance is assumed to be noise.
PCA seeks to maximize the principal shape variations in the
data, while minimizing the error associated with reconstructing the data from the lower-dimensional representation. The
intrinsic dimensionality of the data estimates the lower dimensional subspace where the high dimensional data actually “lives”.
In ST-Isomap, the residual variance is computed using the
intrinsic manifold differences, which take into account possible nonlinear folding or twisting. We pre-select the number of dimensions in which to embed the data, from one to
10 dimensions. The true dimensionality of the data can be
estimated from the decrease in residual variance error as the
dimensionality of the embedding space is increased. We select the “knee” of the curve, or the point at which the residual variance does not significantly decrease with added dimensions. Figure 4 shows the residual variances and the 2dimensional projections of the neighborhood graphs for all
the data sets. The neighborhood graphs represent the manifold structure of the data, but only 2-dimensional embedding spaces are shown in the figure. Notice that the gremlin
and the Daffy data sets are reduced to about five dimensions
© The Eurographics Association 2004.
3.5. Re-sequencing New Animations
To generate a new animation, the user selects a start frame
and an end frame, and the system traverses the Isomap embedding space to find the shortest cost path through the manifold. This path gives the indices of the images used for the
resulting animation, which is created by re-sequencing the
original, unregistered images. Dijkstra’s algorithm [Sed02]
is used to find the shortest cost path through the manifold. The dimensionality of the embedding space used for
re-sequencing, i.e., for traversing the neighborhood graph,
varies for each data set. The Daffy data set and the Frog data
set use a 5-dimensional embedding space, the gremlin data
set uses a 4-dimensional embedding space, and the Grinch
data set uses a 3-dimensional embedding space.
3.5.1. Post-Processing
To ensure the smoothest looking re-sequenced animations,
we add a small amount of automatic post-processing. Only
the start and end keyframes for each re-sequenced segment
are specified, but currently there are no restrictions on the
number of inbetweens that the path should have. As such,
the shortest cost path may not visit all temporally adjacent
frames in the embedding space. To improve the re-sequenced
animation, we process the frames specified from the path
using the following automatic techniques. First, any missing sequentially adjacent frames within eight frames are inserted, helping to smooth some of the choppiness associated with skipping the missing frames. Sequentially adjacent frames are those that are adjacent in the original sequence. For example, if the re-sequenced path selected is [20
24 60 70] before inserting the sequentially adjacent frames,
the resulting path becomes [20 21 22 23 24 60 70]. Using up
to eight sequentially adjacent frames does not significantly
change the overall re-sequenced path since the temporally
Christina de Juan & Bobby Bodenheimer / Cartoon Textures
0.04
0.06
Two−dimensional Isomap embedding (with neighborhood graph).
Two−dimensional Isomap embedding (with neighborhood graph).
8
15
0.035
0.05
6
10
0.03
0.04
Residual variance
Residual variance
4
0.025
2
0.02
0.015
0
0.01
−2
5
0.03
0
0.02
0.01
0.005
0
−5
−4
1
2
3
4
5
6
Isomap dimensionality
7
8
9
10
0
−6
−15
−10
−5
0
5
10
1
2
3
4
15
5
6
Isomap dimensionality
7
8
9
10
−10
−25
−20
−15
−10
−5
0
5
10
15
20
25
(a) Grinch original Isomap, (b) Grinch original Isomap, (c) Grinch ST-Isomap with (d) Grinch ST-Isomap with
variance plot
2D graph
three temporal neighbors, vari- three temporal neighbors, 2D
ance plot
graph
0.8
0.4
Two−dimensional Isomap embedding (with neighborhood graph).
Two−dimensional Isomap embedding (with neighborhood graph).
25
10
0.35
0.7
8
20
6
0.3
0.6
15
0.5
10
Residual variance
Residual variance
4
5
0.4
0.25
2
0.2
0
0.15
−2
0
0.3
−4
0.1
−5
−6
0.2
0.05
−10
0.1
1
2
3
4
5
6
Isomap dimensionality
7
8
9
10
−8
0
−15
−20
−15
−10
−5
0
5
10
1
2
3
4
15
5
6
Isomap dimensionality
7
8
9
10
−10
−15
−10
−5
0
5
10
15
20
25
(e) Frog original Isomap, vari- (f) Frog original Isomap, 2D (g) Frog ST-Isomap with three (h) Frog ST-Isomap with
ance plot
graph
temporal neighbors, variance three temporal neighbors, 2D
plot
graph
0.3
0.3
Two−dimensional Isomap embedding (with neighborhood graph).
Two−dimensional Isomap embedding (with neighborhood graph).
12
8
10
6
0.25
0.25
8
4
2
Residual variance
Residual variance
6
0.2
4
2
0.15
0.2
0
−2
0.15
0
−4
−2
−6
0.1
0.05
0.1
1
2
3
4
5
6
Isomap dimensionality
7
8
9
10
−4
−8
−6
−10
0.05
−8
−20
−15
−10
−5
0
5
1
2
3
4
10
5
6
Isomap dimensionality
7
8
9
10
−12
−10
−5
0
5
10
15
20
(i) Gremlin original Isomap, (j) Gremlin original Isomap, (k) Gremlin ST-Isomap with (l) Gremlin ST-Isomap with
variance plot
2D graph
two temporal neighbors, vari- two temporal neighbors, 2D
ance plot
graph
0.35
0.7
Two−dimensional Isomap embedding (with neighborhood graph).
Two−dimensional Isomap embedding (with neighborhood graph).
20
15
0.3
0.6
10
15
0.5
10
Residual variance
Residual variance
0.25
0.2
5
0.15
5
0
0.4
−5
0.3
−10
0
0.1
0.2
−15
0.05
0.1
−5
−20
0
1
2
3
4
5
6
Isomap dimensionality
7
8
9
10
0
−10
−25
−20
−15
−10
−5
0
5
10
15
20
25
1
2
3
4
5
6
Isomap dimensionality
7
8
9
10
−25
−25
−20
−15
−10
−5
0
5
10
15
20
25
(m) Daffy original Isomap, (n) Daffy original Isomap, 2D (o) Daffy ST-Isomap with two (p) Daffy ST-Isomap with two
variance plot
graph
temporal neighbors, variance temporal neighbors, 2D graph
plot
Figure 4: Results showing the residual variance and 2-dimensional projection of the neighborhood graph generated with
original Isomap and ST-Isomap using the Hausdorff distance matrix on the Grinch, Michigan J. Frog, the gremlin, and Daffy
Duck. The number of temporal neighbors used is indicated in the figure.
© The Eurographics Association 2004.
Christina de Juan & Bobby Bodenheimer / Cartoon Textures
adjacent frames are usually near each other in the embedding space.
After adding these frames, we further improve the
smoothness of the re-sequenced animations by matching the
velocity of the centroid of each character from frame to
frame in the new path. The new sequence was found based
on the distance metric using registered images, described
in Section 3.1. The registered images thus no longer possess any offset of the character within the frame. In postprocessing, the original, unregistered images are used. For
each original image in the data set, the character’s centroid
is calculated and stored. Then a velocity vector is computed
based on each frame’s previous and next temporal neighbor
in the original (unregistered) sequence. When given a path
for re-sequencing, the position and velocity of the centroid
for the character in every frame are known. The position of
the character is adjusted from one frame to the next in the
new sequence based on the projected position indicated by
the first frame’s velocity vector from the original sequence.
This adjustment is done whenever the path jumps from one
single frame or subsequence in the path to another. Subsequences in the path are handled such that the first frame
in the subsequence has its character repositioned based on
the previous frame’s projected position, while the remaining
frames in that subsequence are adjusted to the first frame’s
new position.
Finally, if the character translates along the z-axis then
the figure often changes in size within the frame. The final
re-sequenced frames are adjusted using a scale factor based
on the average pixel volume
in the sequence. The scale facq
a
tor s is defined as s = Vol
Vols where Vola is the average pixel
volume in the entire path, and Vols is the average pixel volume of a subsequence (or just the pixel volume of a single
frame). Then s is applied to each frame of the subsequence
(or single frame) in the path.
We found that for the gremlin data set and the Daffy data set,
using two or three temporal neighbors yielded the best results. The gremlin data set is well populated with only a few
large jumps at the transitions between motion capture clips,
but the Hausdorff distance metric is an improvement over the
L2 distance. For the Daffy data set, there are also a few large
jumps in the original data resulting from the camera cuts for
those scenes. The Hausdorff distance metric is significantly
better than the L2, and reasonable paths are found through
the embedding space.
We are able to re-sequence the gremlin data into a short
motion clip that retains the same characteristics of the original dance motion, but shows a new dance behavior. This
result was achieved by selecting six keyframes (sets of start
and end frames) and applying ST-Isomap with two temporal
neighbors, and post processed as described in Section 3.5.1.
The result is a sequence with a total of 57 frames.
We also re-sequence the Daffy data into two short motion clips, each retaining the original characteristics of the
gesturing motion, but showing a new gesturing behavior.
The clips were created by selecting six and seven keyframes
and applying ST-Isomap with two temporal neighbors. The
first clip was minimally post-processed, only the missing
temporally adjacent frames were inserted, and resulted in
a sequence with a total of 59 frames. The second clip was
post-processed by including any missing temporally adjacent frames and velocity-matching the centroids, resulting
in a sequence with a total of 98 frames. Both clips show new
gesturing behaviors.
Daffy
246 → 235
0.413511
good
Daffy
326 → 77
6.173898
bad
Daffy
99 → 243
3.010666
accept
3.5.2. Threshold Detection
Daffy
235 → 236
0.094055
good
In re-sequencing cartoon data, the transitions from the shortest cost path may result in a visual discontinuity. A small cost
would indicate a good transition, while a large cost would indicate a bad transition. The system can automatically identify when the cost of a transition is too large. A threshold is
determined for each data set, and notifies the user of abrupt
transitions in the re-sequenced animation. The threshold is
currently determined manually for each data set by examining the embedding structure and its associated costs. This
notification allows the user to decide if additional source material or inbetweens are needed to produce a more visually
compelling sequence.
Daffy
98 → 99
7.270829
bad
4. Results
A demonstration of re-sequencing cartoon data with STIsomap can be seen in the accompanying video. The Hausdorff distance metric works best for all of our cartoon data.
© The Eurographics Association 2004.
Table 1: Examples of the distance values between pairs of
frames using the Hausdorff distance metric on the Daffy
data set. Adjacent frames in the original data set may not
always have a low distance value, as shown in the table. The
transition from frame 98 to 99 is an abrupt transition according to the distance metric.
After generating several re-sequenced animations for a
particular data set, we inspect the cost values associated with
the transitions and determine a threshold value for abrupt
transitions. Once the threshold is determined, the system
can use threshold detection to indicate to the user when a
large transition cost has occurred. Our findings indicate that
a threshold value of DHaus < 2.2 represents a good transition while DHaus > 3.9 represents an abrupt transition for
Christina de Juan & Bobby Bodenheimer / Cartoon Textures
the Daffy data set. Table 1 shows some of the distance values associated with the transitions for a re-sequenced animation, while Figure 5 shows the frames referred to in
the table. The transition from frame 99 to 243 has a value
2.2 ≤ DHaus ≤ 3.9, representing a region that should be inspected by the animator before accepting or rejecting. In this
case it is accepted.
frame 246
frame 235
frame 326
frame 77
frame 99
frame 243
frame 235
frame 236
To test the system’s ability to detect a large transition,
an example is generated with three images from the Daffy
data set removed. ST-Isomap is applied using two temporal
neighbors and seven spatial neighbors. In the path generated
from the data set with missing frames, the transition cost
exceeded the pre-set threshold and resulted in a sequence
with visual discontinuities. Inserting inbetween frames at
the point of highest transition cost generates an improved
sequence. Figure 6 shows the two paths without any postprocessing. The sequence generated from the data set with
missing images differs from the other sequence only in the
transition from the first frame 326 to the second frame 77,
which is where the inbetweens were added.
The Michigan J. Frog data set illustrates the challenges in
re-sequencing cartoon data. This data set has 146 frames, of
which only 73 are unique. Although ST-Isomap can reduce
the data to approximately five dimensions, traversing the
resulting embedding space for re-sequencing yields jumpy
motion. A transition threshold can still be found even though
the data set is so sparse. A threshold value Dcorr ≤ 0.58 represents a good transition. Figure 7 shows examples of good
and bad transitions for the Frog, and the corresponding transition costs, for a path generated using ST-Isomap with three
temporal neighbors.
5. Conclusion and Future Work
frame 98
frame 99
Figure 5: An example of good, bad, and acceptable transitions for the Daffy data set from a path generated using STIsomap with two temporal neighbors. The pairs of frames
shown correspond with the values shown in Table 1. ™&
©Warner Bros. Entertainment Inc. (s04).
We are able to re-sequence cartoon data to new animations
that retain the characteristics of the original motion. Our
method is model-free, i.e., no a priori knowledge of the
drawing or cartoon character is required. The keys to the
method are the identification of a suitable metric to characterize the differences in cartoon images and the use of a nonlinear dimensionality reduction and embedding technique,
ST-Isomap. The system can characterize when a novel resequencing requires additional source material to produce a
visually compelling animation.
We foresee that this system will be useful as an aid to
artists charged with generating inbetweens in cel animation.
If a sufficient body of prior animation is available, the inbetween artist could use the system to match keyframes in a
new animation and generate inbetweens from existing data.
Only if the keyframes were sufficiently novel or the transition cost too high would the inbetween artist be required to
generate new art.
We would like to address the issue of synthesizing new
© The Eurographics Association 2004.
Christina de Juan & Bobby Bodenheimer / Cartoon Textures
frame 326
frame 325
frame 324
frame 77
frame 51
frame 326
frame 77
frame 51
frame 98
frame 98
Figure 6: A filmstrip of two paths without any post-processing. The bottom row shows the path generated from the Daffy data
set with three frames removed. The top row shows the same path with inbetweens inserted at the point of highest transition cost,
in this case between frames 326 and 77. ™& ©Warner Bros. Entertainment Inc. (s04).
frame 22
frame 27
frame 12
frame 109
Figure 7: An example of good and bad transitions for the Frog data set. The first pair of images demonstrates a good transition
from frame 22 to 27 with a cost of 0.198132. The second pair of images demonstrates a bad transition from frame 12 to 109
with a cost of 0.609729. ™& ©Warner Bros. Entertainment Inc. (s04).
data, i.e., generating transitions with blending or interpolating. Currently, the system will only determine that an abrupt
transition has been made, but it cannot automatically generate the necessary inbetweens. Using optical flow for generating the inbetween frames may work if the two frames are
not significantly dissimilar. Another technique to investigate
is smoothing the transitions by adding motion blur [BE01].
discontinuous. We would like to investigate how adding a
component of velocity (similar to the post-processing) to the
distance metric may change the embedding space, and thus
change the resulting re-sequencing. One possible way of incorporating the “velocity” into the distance metric would be
to calculate the optic flow of the edge map of each image
and use it to estimate a velocity term.
Other error metrics more specific to cartoon images, such
as perception-based image metrics, may reveal how the human visual system accepts certain types of transitions in
an animated character, while other transitions are obviously
bad. Even though some of the data is sparse, it was originally
drawn that way, and when playing back the animation, some
of the frames that are considered abrupt transitions by our
system may actually be visually acceptable by the viewer.
User studies may provide some insight into this behavior.
Finally, the sparseness of the data is an issue because of
the slow acquisition of clean and segmented images of cartoon characters. To acquire a larger amount of data more
quickly, we are looking into automatic methods of background segmentation. One method we have begun using is
a level set method that looks at the character as regions of
specific color values.
Another improvement would be to explore other methods of traversing the Isomap embedding space. The shortest cost only represents the similarity between two frames.
Some cartoon motions that are very expressive and exaggerated may call for a quick transition between dissimilar
frames. In this case, the lowest cost would not be appropriate. Post-processing the re-sequenced animations helps
produce smoother results, but the results may still be too
The authors thank Chad Jenkins for his insights in using STIsomap. We thank Steve Park and the Graphics, Visualization, and Usability Center at the Georgia Institute of Technology for supplying the motion capture data used in this
study. We also thank Robert Pless for his insightful suggestion on how to quickly compute the Hausdorff metric. We
wish to thank Gary R. Simon at Warner Bros. Entertainment
Inc. for giving us permission to use the images of Daffy Duck
© The Eurographics Association 2004.
6. Acknowledgments
Christina de Juan & Bobby Bodenheimer / Cartoon Textures
and Michigan J. Frog. We also thank the reviewers for their
helpful comments. This research was supported by National
Science Foundation Grant IIS-0237621.
References
[AF02]
[BE01]
[Bla94]
A RIKAN O., F ORSYTH D. A.: Synthesizing constrained motions from examples. ACM
Transactions on Graphics 21, 3 (July 2002),
483–490. 2
B ROSTOW G. J., E SSA I.: Image-based motion blur for stop motion animation. In Proceedings of the 28th annual conference on
Computer graphics and interactive techniques
(2001), ACM Press, pp. 561–566. 9
[DHR93]
D.P. H UTTENLOCKER G. K., RUCKLIDGE W.:
Comparing images using the hausdorff distance.
IEEE Transactions on Pattern Analysis and Machine Intelligence 15, 9 (September 1993), 850–
863. 4
[FBC∗ 95] F EKETE J., B IZOUARN É., C OURNARIE É.,
G ALAS T., TAILLEFER F.: TicTacToon: A
paperless system for professional 2-D animation. In SIGGRAPH 95 Conference Proceedings
(1995), Cook R., (Ed.), Addison Wesley, pp. 79–
90. 1
[JM03]
[PFWF00] P ETROVIC L., F UJITO B., W ILLIAMS L.,
F INKELSTEIN A.: Shadows for cel animation.
In Proceedings of ACM SIGGRAPH 2000 (July
2000), ACM Press / ACM SIGGRAPH / Addison Wesley Longman, pp. 511–516. 1
[Ree81]
R EEVES W. T.: Inbetweening for computer animation utilizing moving point constraints. In
Siggraph 1981, Computer Graphics Proceedings (1981), pp. 263–269. 2
[RS00]
ROWEIS S., S AUL L.: Nonlinear dimensionality
reduction by locally linear embedding. Science
290 (2000), 2323–2326. 2
[SBS02]
S IDENBLADH H., B LACK M. J., S IGAL L.: Implicit probabilistic models of human motion for
synthesis and tracking. In Computer Vistion —
ECCV 2002 (1) (May 2002), Heyden A., Sparr
G., Nielsen M.„ Johansen P., (Eds.), SpringerVerlag, pp. 784–800. 7th European Conference
on Computer Vision. 2
[SE01]
S CHÖDL A., E SSA I.: Controlled animation of
video sprites. In Symposium on Computer Animation (2001), ACM Press / ACM SIGGRAPH,
pp. 121–127. 2
[Sed02]
S EDGEWICK R.: Algorithms in C: Part 5 Graph
Algorithms, 3 ed. Addison-Wesley, 2002. 5
[SG92]
S EDERBERG T. W., G REENWOOD E.: A physically based approach to 2-D shape blending. In
Proceedings of ACM SIGGRAPH 1992 (1992),
Catmull E. E., (Ed.), vol. 26, pp. 25–34. 2
C ANNY J.: A computational approach to edge
detection. IEEE Transactions on Pattern Analysis and Machine Intelligence 8, 6 (1986), 679–
698. 4
[CJTF98] C ORREA W. T., J ENSEN R. J., T HAYER C. E.,
F INKELSTEIN A.: Texture mapping for cel animation. Computer Graphics 32, Annual Conference Series (1998), 435–446. 1
J ENKINS O. C., M ATARIC M. J.: Automated derivation of behavior vocabularies for
autonomous humanoid motion. In Proc. of the
2nd Intl. joint conf. on Autonomous agents and
multiagent systems (2003), pp. 225–232. 2, 3
[Jol86]
J OLLIFFE I.: Principal Component Analysis.
Springer-Verlag, New York, 1986. 2
[KGP02]
KOVAR L., G LEICHER M., P IGHIN F.: Motion
graphs. ACM Transactions on Graphics 21, 3
(July 2002), 473–482. 2
K RUSKAL J. B., W ISH M.: Multidimensional
Scaling. Sage Publications, Beverly Hills, 1978.
2
[LCR∗ 02] L EE J., C HAI J., R EITSMA P. S. A., H ODGINS
J. K., P OLLARD N. S.: Interactive control of
avatars animated with human motion data. ACM
Transactions on Graphics 21, 3 (July 2002),
491–500. 2
B LAIR P.: Cartoon Animation. Walter Foster
Publishing, Inc., 1994. 1
[BLCD02] B REGLER C., L OEB L., C HUANG E., D ESH PANDE H.: Turning to the masters: Motion capturing cartoons. ACM Transactions on Graphics
21, 3 (July 2002), 399–407. 1, 2
[Can86]
[KW78]
[SSSE00] S CHÖDL A., S ZELISKI R., S ALESIN D. H.,
E SSA I.: Video textures. In Proceedings of
ACM SIGGRAPH 2000 (July 2000), ACM Press
/ ACM SIGGRAPH / Addison Wesley Longman, pp. 489–498. 1, 2
[TdSL00] T ENENBAUM J. B., DE S ILVA V., L ANGFORD
J. C.: A global geometric framework for nonlinear dimensionality reduction. Science 290
(2000), 2319–2323. 2
© The Eurographics Association 2004.