Human motion synthesis from 3D video

Jonathan Starck

Human Motion Synthesis from 3D Video Peng Huang, Adrian Hilton, Jonathan Starck Centre for Vision Speech and Signal Processsing University of Surrey, Guildford, GU2 7XH, UK {p.huang,a.hilton,j.starck}@surrey.ac.uk Abstract Multiple view 3D video reconstruction of actor performance captures a level-of-detail for body and clothing movement which is time-consuming to produce using existing animation tools. In this paper we present a framework for concatenative synthesis from multiple 3D video sequences according to user constraints on movement, position and timing. Multiple 3D video sequences of an actor performing different movements are automatically constructed into a surface motion graph which represents the possible transitions with similar shape and motion between sequences without unnatural movement artefacts. Shape similarity over an adaptive temporal window is used to identify transitions between 3D video sequences. Novel 3D video sequences are synthesized by finding the optimal path in the surface motion graph between user specified key-frames for control of movement, location and timing. The optimal path which satisfies the user constraints whilst minimizing the total transition cost between 3D video sequences is found using integer linear programming. Results demonstrate that this framework allows flexible production of novel 3D video sequences which preserve the detailed dynamics of the captured movement for actress with loose clothing and long hair without visible artefacts. 1. Introduction Acquisition and reconstruction of human motion from temporal sequences of people has been a central issue in computer vision over the past decade with advances in the video-based recovery of both skeletal motion and temporal surface sequences which capture both the body, loose clothing and hair movement. The reuse of captured temporal sequences of people (2D video, 3D marker positions, skeletal motion, 3D video surfaces) for animation production is an important problem. Both 2D and 3D video sequences contain detailed information on changes in shape and appearance which is not represented in skeletal motion. There is considerable interest in the use of the surface detail infor- mation in animation production as it is prohibitively expensive to reproduce from the underlying skeletal motion. In this paper we introduce a framework for user controlled animation from captured 3D video sequences of people with loose clothing that preserves the non-rigid surface deformation from multiple view video reconstruction whilst allowing constraints on motion, timing and position. Multiple view reconstruction of human performance as a 3D video has advanced to the stage of capturing detailed non-rigid dynamic surface shape of the body, clothing and hair during motion[1, 24, 19, 22]. Full 3D video scene capture holds the potential to create truly realistic synthetic animated content by reproducing the dynamics of shape and appearance currently missing from marker-based motion capture. However, in 3D video capture the acquisition results in an unstructured volumetric or mesh approximation of the surface shape at each frame without temporal correspondence, estimating correspondence has been the subject of much recent work [2, 1, 24, 23, 18, 17]. Although these techniques could be combined with the surface motion graph to achieve smooth transitions, accurate dense correspondence of dynamic surfaces remains an open problem. It makes the reuse of this kind of data more challenging than conventional motion capture data. In this work performing concatenative synthesis based on 3D shape and motion similarity does not require explicit surface correspondence. For conventional motion capture, reuse is performed either by learning motion characteristics [13] or examplebased methods [21, 12, 11, 4]. Since learning methods risk losing important detail by abstraction they cannot guarantee that synthesized motion is physically realistic and existing systems do not focus on the satisfaction of high-level constraints. Example-based methods which resample the captured data retain the realism and allow a higher-level control [5, 6, 16]. Previous research in example-based concatenative synthesis from motion capture [21, 12, 11, 4] employed a directed graph or Markov chain to represent temporal connections between frames and search for a path satisfying user constraints. These approaches only deal with low degree- of-freedom (DOF) skeletal motion capture data and cannot be directly extended to high DOF 3D video surface motion capture data. Concatenative animation was introduced in video-based rendering as a means to record and replay the detailed dynamics of a scene from 2D video [16]. Previous work constructed animation transition graphs from 3D video either manually [19] or interactively [26]. Video segments are re-ordered and concatenated at transition points to generate new animated contents. The smoothness of transitions between video segments seriously affects the quality of the final synthesis results. Automatic identification of transitions based on similarity metrics becomes an important problem for high-quality synthesis. In this paper, we extend example-based methods for motion synthesis from conventional motion capture to 3D video. Temporal 3D video sequence matching [7, 8] is used to automatically identify transitions. A framework is introduced to allow synthesis according to user defined constraints on location, timing and motion key-frames. The system is able to automatically detect transitions, construct a motion graphs and search for the optimal path to satisfy user-defined constraints. The realism and flexibility of the motion synthesis is demonstrated on real data from a public database of 3D video which contains sequences of an actress performing multiple motions with complex non-rigid movement of clothing and hair. 2. Related work 2.1. Motion synthesis Motion synthesis from conventional motion capture can be categorized into learning and example-based methods. Learning approaches model general motion characteristics and cannot guarantee that the synthesized motion is physically realistic or looks natural. Example-based methods provide an attractive alternative as there is no loss of detail from the original motion dynamics. Current examplebased methods that allow high-level constraints on synthesized motion (timing, position etc.) on skeletal motion capture and 2D video are reviewed. Tanco and Hilton [21] introduced a two-level statistical model for skeletal motion capture to synthesize novel motions: a Markov chain on states (clusters of frames) and a Hidden Markov Model on frames. According to userdefined key-frames, synthesis is performed by searching for an optimal state sequence which minimizes the transition cost between key-frames. This approach does not allow user-defined constraints on position or timing. Similarly, Lee et al. [12] provided a two-layer structure allowing efficient search and interactive control from skeletal data. The recursive search terminates when the depth of the spanning tree reaches a given maximum. In both approaches a sim- ilarity metric based on the skeletal pose is used to identify transitions over a fixed temporal window. Kovar et al. [11] construct a directed graph on skeletal motion capture sequences, referred to as a motion graph, where edges correspond to segments of motion and nodes identify connections between segments. Motion segments include original motions and generated transitions. Synthesis is performed by an optimal graph walk that satisfies user-defined constraints. Similarly, Arikan et al. [4] employ a direct graph to connect motion segments where each node corresponds to a motion and each edge a transition. A hierarchical randomized search is used to generate motions. Reitsma and Pollard [15] evaluate motion graphs for skeletal motion data and find the capability degrades rapidly with increases in the complexity of environment or tasks. Wang and Bodenheimer [25] evaluate the optimally weighted cost metric for finding transitions through a cross-validation and user study. Previous approaches on databases of skeletal motion capture exploit the known temporal correspondence for similarity metrics to identify transitions. Skeletal motion capture does not retain the detail of captured surface dynamics. Multiple view reconstruction of 3D video captures the detailed non-rigid surface dynamics as a surface mesh sequence without temporal correspondence. However, current methods cannot be directly extended to 3D video since similarity metrics have considered only skeletal pose. This does not account for surface shape deformation in clothing and hair. The challenge addressed in this work is to identify transitions for 3D video sequences without temporal correspondence and allow user-controlled synthesis to produce novel animations. In previous research on 3D video surface similarity has been defined either manually or through a shape descriptor. Starck et al. [20] manually identify transitions to construct a motion graph for interactive control using 3D video sequences to preserve dynamic surface shape and appearance. Xu [26] et al. re-use 3D video in a framework of motion editing they compute shape histograms in spherical coordinate system to measure frameto-frame dissimilarity. In this paper we identify transitions between 3D video sequences using shape similarity over an adaptive window. A framework is introduced to allow concatenative synthesis from 3D video according to high-level user-specified constraints on motion, position and timing. This overcomes limitations of previous example-based approaches to animation from either skeletal motion capture or 3D video data. 2.2. 3D shape matching In the 3D object retrieval literature, shape descriptors have been widely used to measure similarity. However, these descriptors aim to discriminate between rigid shapes for different object classes (book, mug, chair) and interclass variations (cars, chairs) instead of instances from se- quences of the same moving non-rigid object, a person, which differ in both shape and motion. Although a number of researchers have addressed the problem of temporal similarity for skeletal motion in the concatenative motion synthesis literature [21, 12, 11, 4], temporal shape matching for 3D video sequences of people with unknown temporal correspondence has received limited investigation. Osada et al. [14] introduced Shape Distribution that computes the probability distribution of geometric properties of an object as a signature to discriminate similar and dissimilar models. Johnson and Hebert [9] presented a 3D shape-based object recognition system using Spin Images that encode the density of mesh vertices projected onto an object-centred space into a 2D histogram. Ankerst et al. [3] provided a 3D Shape Histogram descriptor based on a partitioning of the space where an object resides to classify a molecular database. Kazhdan et al. [10] introduced a Spherical Harmonic Representation as a 3D shape descriptor for 3D object retrieval that is constructed by measuring the energy contained in different frequency bands. A comparison of these shape descriptors and their natural extension to temporal matching (via temporal filtering of static similarity) is provided in [8]. 3. Surface Motion Graph The framework comprises two stages: pre-processing the database of the 3D video sequences to construct a surface motion graph; and motion synthesis by optimizing the graph-path to satisfy user-defined constraints and minimize the transition cost between 3D video segments. The surface motion graph represents the possible transitions between 3D video sequences which is analogous to motion graphs [11] for skeletal motion capture sequences. Transition points between 3D video sequences are identified without temporal correspondence using a volumetric temporal shape similarity metric. Surface motion transitions from a 3D video sequence X = {xi } to Y = {yj } are defined as an overlapped subsequence of n frames. Smooth transitions could be generated by linear blending overlapped frames, zk = (1 − α(t))xi+k + α(t)yj+k (1) where zk denote the blended frame of xi+k and yj+k , k = − n2 ... n2 . Since the process of blending requires surface correspondence which is unknown in this work, Equation 1 is only used as a guide to identify transitions frames xi ∈ X and yj ∈ Y and adaptive window length n which maximize the similarity between the motion sequences X and Y over the transition. The concatenation then performs as a switch from X to Y at the centre of the window. Although previous work on estimating dense surface correspondence [2, 1, 24, 23, 18, 17] could be combined with surface mo- tion graph to achieve smooth transitions, accurately estimating dense correspondences of dynamic surfaces remains an open problem. Linear blending of overlapped frames to generate a smooth transition will be solved in future work. Subsequent sections present a metric for temporal shape similarity without surface correspondence and introduce the estimation of transitions according to Equation 1 for constructing the surface motion graph. 3.1. Temporal Shape Similarity To identify possible transitions between 3D video sequences without temporal correspondence we use a timefiltered volumetric shape histogram to define a similarity metric over a temporal window. The time-filtered volumetric shape histogram has previously been shown to give good performance for human motion recognition on 3D video sequences of people [8]. A shape histogram partitions the space containing an object into disjoint cells corresponding to the bins of a histogram. Given a 3D surface mesh, a volume-sampling spherical shape histogram is constructed as follows: 1. A volumetric representation is constructed by rasterizing the surface into a set of voxels that lie inside the model. 2. Space is transformed to a spherical coordinates system (r, φ, θ) around the centre of mass of the model. 3. A 3D histogram is constructed in spherical coordinates, accumulating the voxels in the volume representation with bins size (∆r, ∆φ, ∆θ). 4. The final histogram is normalized by the total number of occupied voxels. To estimate the orientation about the vertical axis which maximizes the shape similarity, or equivalently minimizes the difference in shape distribution, the spherical histogram similarity is evaluated for all feasible rotations in θ. Instead of rotating the 3D mesh, we generate a high resolution histogram first and shift it with 1◦ resolution in θ, and re-bin to a coarse histogram. For frame i with no rotation, the high ∗ resolution histogram Hi,0 = h∗i (r, φ, θ) and its coarse histogram is Hi,0 ; for frame j with α◦ rotation (α is integer), ∗ = h∗j (r, φ, f (θ, α)), the high resolution histogram Hj,α ◦ f (θ, α) = (θ + α)%360 , and its coarse histogram is Hj,α . The similarity between frame i and j is computed as the minium Euclidean Distance between Hi,0 and Hj,α , s(i, j) = min α∈{0,...,359} {k Hi,0 − Hj,α k} (2) In this paper, we set (∆r, ∆φ, ∆θ) = (10, 20, 40) for the coarse histogram, the optimal bin size reported for human shape similarity [8]. Figure 1. An example of Surface Motion Graph. Double circles denote start and end key-frames. The higher-level graph represents the 3D video sequences and transitions, the lower-level graph represents all possible motion graphs between a particular start and end keyframe. 3.2. Transitions In this work, we define transitions to maximize the temporal shape similarity according to the linear blend in Equation 1. Given two 3D video sequences X = {xi } and Y = {yj }, the (i, j) element of the frame-to-frame shape similarity matrix SXY is defined as SXY (i, j) = s(xi , yj ) (3) where s(xi , yj ) is computed by Equation 2. Since the shape similarity is symmetric s(i, j) = s(j, i), the temporal shape similarity is computed as a linearly weighted average of the shape similarity for individual frames about the central frame of the window, n−1 ′ SXY (i, j, n) = 2 X SXY (i + k, j + k) · w(k) (4) k=− n−1 2 where w(k) is normalised version of w′ (k), i.e. w(k) = P n−1 n+1 ′ ′ 2 w′ (k)/ k=− n−1 w (k) and w (k) = min{k + 2 , −k + n+1 2 }. 2 Note that n is the window size and transition length, if n is an even number, the central frame (i, j) is at half way between frames. The transition between X and Y is ′ the global minimum of SXY (i, j) over all possible i, j, n, ′ (iopt , jopt , nopt ) = arg min{SXY (i, j, n)} (5) i,j,n Without blending, the concatenation is performed as a switch at the central frame, i.e. x⌊iopt ⌋ → y⌊jopt ⌋+1 . 3.3. Automatic Graph Construction A surface motion graph is a two-level directed graph. In the higher-level, each node represents a motion and each edge a transition with direction. In the lower-level, each node represents a frame and each edge a sequence of frames connecting them. Figure 1 illustrates an example of surface motion graph for four motions. Once the user have defined the higher-level graph and selected key-frames, the lowerlevel graph will be generated automatically as follows: 1. Initialization: Insert all 3D video sequences as edges and their terminal frames as nodes to create a disconnected graph (Figure 1 upper-right). 2. Identify transitions: If there is an edge between 3D video sequences X and Y in the high-level graph, the transition represented as (iopt , jopt , nopt ) is evaluated according to Equation 5. 3. Insert transitions and key-frames: The start and end of transitions, key-frames, if they are not in the graph, are inserted as nodes, breaking existing edges (3D video sub-sequences) into smaller ones. Transitions are inserted as edges to join either different 3D videos or different parts of the same sequence. 4. Motion Synthesis Motion synthesis is performed by optimizing over all possible paths on the lower-level surface motion graph between user selected key-frames to minimize the total transition cost whilst satisfying constraints on position and timing. Note that the optimization considers loops to allow repetition of cyclic motion. 4.1. Cost Function The cost C(F ) of a path F through the surface motion graph is defined as a combination of the total transition cost Cs (F ), location cost Cd (F ) and time cost Ct (F ), C(F ) = Cs (F ) + wd · Cd (F ) + wt · Ct (F ) (6) where wd and wt are weights for distance and time constraints respectively. wd = 1/0.3 and wt = 1/10 are set equal to an error of 30 centimeters in distance with an error of 10 frames in time [4] with a relative weight of 1.0 to emphasise the smoothness cost. Individual cost terms for transitions, distance and time are defined as follows: Total Transition Cost Cs (F ) for a path F is defined as the sum of dissimilarity for all transitions between the input 3D video sequences concatenated to, Cs (F ) = s(F ) = Nt X si (7) i=1 where Nt denotes the total number of transitions and si the filtered dissimilarity for ith transition . Distance Cost Cd (F ) for a path F with Nf frames is computed as the absolute difference between the userspecified target distance dT and the total travelled distance d(F ), Cd (F ) = |d(F ) − dT | (8) PNf −1 where d(F ) = i=1 k ci+1 − ci k. ci+1 and ci denotes the projection of the centroid of the mesh at frame i + 1 and i onto the ground respectively along the vertical axis. Time Cost Ct (F ) for a path F with Nf frames is evaluated as the absolute difference between the user-specified target time tT and the total travelled time t(F ), Ct (F ) = |t(F ) − tT | (9) where t(F ) = Nf . Here, the time is measured in “frames” and the rate of captured 3D video sequences is 25f ps. 4.2. Path Optimization The transition, distance and time costs become, Fopt = arg min{C(F )} (10) F Enumerating all possible paths from the start to end keyframe and evaluating the combined cost will give the global optima. But when cycles appear in the graph, the number of paths may become infinite. To avoid this, in stead of enumerating all paths, we enumerate all walks (paths without any loop) from the start to end key-frame and all loops attached to each walk. The global optimal path must be one of compositions of a walk and attached loops. Given l = {li } and n = {ni }, i = 0, . . . , Nl , let l0 denote a walk, n0 = 1, {l1 , . . . , lNl } attached loops and {n1 , . . . , nNl } corresponding repetitions, a path F can be represented as n · l and the optimization becomes, Fopt = nopt · lopt = arg min{C(n · l)} n,l (11) (12) Ct (F ) = |n · t(l) − tT | Cs (F ) = n · s(l) (13) (14) where f (l) = {f (l0 ), . . . , f (lNl )}, f = d or t or s. Once the surface motion graph is constructed, there may be more than one walks from the start to end key-frames, let be Nk walks, for each walk lk,0 , the corresponding set lk = {lk,0 , . . . , lk,Nk,l } is determined, the objective is to find an optimal set of repetitions nk,opt according to Equation 11, (15) nk,opt = arg min{C(nk · lk )} nk and the index of global optimal walk and repetitions kopt will be, (16) kopt = arg min {C(nk,opt · lk )} k=1,...,Nk finally, the optimal path Fopt is found as a composition of optimal walk and loops lopt = lkopt together with optimal repetitions nopt = nkopt ,opt , Fopt = nkopt ,opt · lkopt (17) The decomposition of an arbitary path to a walk and attached loops is described in Section 4.2.1 and the composition of a walk and attachted loops back to a path in Section 4.2.3. The optimization of repetitions for a given walk and loops according to Equation 15 are solved as Integer Linear Programming (ILP) problems in Section 4.2.2. 4.2.1 In this section, we present an efficient approach to search for the optimal path that best satisfies the user defined constraints. The optimal path Fopt is found to minimize the combined cost C(F ) defined by Equation 6, Cd (F ) = |n · d(l) − dT | Graph Path Decomposition Depth-First Search (DFS) is performed to decompose all possible paths between user-selected start and end keyframes to walks and loops . Given an adjacency matrix adj for the graph, source and sink (start and end key-frames), the algorithm is implemented recursively and its pseudo code is presented in Algorithm 1. This gives a set of walks and each of them associated with a set of loops. Note that although loops may be nested such that one loop included another, in our optimization all possible nested sub-loops are included as illustrated in Figure 2. (a) (b) Figure 2. Examples of Graph Path Decomposition that consider nested sub-loops. 4.2.2 Integer Linear Programming Optimization of Equation 15 is non-linear. It can be converted to constrained Integer Linear Programming (ILP) sub-problems. Given a walk and attached loops l = {li }, l0 denotes the walk, n0 = 1, the corresponding repetitions n = {ni } are optimized as four independent ILP subproblems, p = 1, 2, 3, 4, as follows, minimize Cp · np (18a) subject to np,0 = 1 0 ≤ ni ≤ +∞, integer (18b) (18c) Cd,p · np ≥ 0 (18d) Ct,p · np ≥ 0 (18e) Cd,p , Ct,p , Cs,p and Cp are 1D vectors, their elements are computed as follows, Cd,p,i = signd (p) · wd · (d(li ) − dT ) (19a) Ct,p,i = signt (p) · wt · (t(li ) − tT ) (19b) Cs,p,i = Cs (li ) Cp,i = Cd,p,i + Ct,p,i + Cs,p,i (19d) (19c) where signd = (1, 1, −1, −1) and signt = (1, −1, 1, −1). For each sub-problem np,opt is solved efficiently using a standard ILP solver. The optimal repeat times of loops for a given l is then computed as the one that achieves the minimun combined cost, nopt = arg min {Cp · np,opt } (20) p=1,2,3,4 4.2.3 Graph Path Composition Once the optimal walk and loops lopt with repetitions nopt have been evaluated, a complete path can be composed by F = nopt · lopt . The final motion sub-sequences are concatenated head-to-tail by matching the centroid of mesh at transitions. However, some loops may indirectly connect to the walk via other loops, e.g. l2 via l1 connecting to l0 in Figure 2(b), if the repetition of via-loops are zero, we cannot make the path. The feedback strategy is presented in Algorithm 2, when the isolation of loops happens, a constraint is added to the ILP solver. 5. Results and Evaluation 3D character animations are synthesized from a public available database of 3D video [19] which comprises an actress (Roxanne) wearing three different costumes: a game character with shorts and t-shirt (Character1); a long flowing dress (Fashion1); and a shorter tight fitting dress (Fashion2). For each costume periodic (walk,run,stagger), motion transitions (walk to run, run to walk, walk to stand, stand to walk) and other motions (hit, twirl) are included. The captured 3D video sequences are unstructured meshes with unknown temporal correspondence and different mesh connectivity at each time-frame. A mesh contains approximately 100k vertices and 200k triangles. Surface motion graphs are automatically constructed from 3D video sequences for the performer within three different constumes. Optimization is done in seconds for a user-defined constraints on distance and time. Motions are concatenated without post-processing or blending to show the raw 3D video transitions. Synthesis results are presented in accompanying videos. An example of selected frames from a synthesized motion for Fashion1 captured in a virtual camera view is shown in Figure 3. These results demonstrate that the motion synthesis preserves the detailed clothing and hair dynamics in the captured 3D video sequences and does not produce unnatural movements at transitions. Motion synthesis is evaluated for the three surface motion graphs which represents potential transitions with four pairs of key-frames for each costume as shown in Table 1. Evaluation is performed by synthesizing motions for target constraints on distance of 1−20m in 1m intervals and times of 1−40s in 1s intervals giving 800 sequences for each keyframe pair and 9600 synthesized sequences in total. The maximum, minimum and root mean square errors over all synthesized sequences for distance moved and timing are presented in Table 1 for the sequences generated from each key-frame pair. This analysis shows that the maximum distance and timing errors are less-than 1% of the target indicating that the path optimization generates sequences which accurately satisfy the user-defined constraints. Smoothness cost is evaluated as a weighted average of Hausdorff Distance for overlapped individual frames at transitions. The weights are set to decrease about the central frame the same way with w(k) in Equation 4. Computation times are given for an ILP solver together with a Matlab implementation of the synthesis framework running on a single processor machine. The computation time is approximately constant with respect to the distance and timing constraints as indicated by the low standard deviation. 6. Conclusion In this paper, we presented a framework for concatenative human motion synthesis from 3D video sequences according to user-defined constraints on movements, position and timing. Transistions between 3D video sequences are identified without the requirement for temporal correspondence using 3D shape similarity over an adaptive temporal window. A surface motion graph is automatically constructed to represent potential transitions for both crosstransitions between different motion sequences and selftransitions in the cyclic motion. Path optimization is performed between user-specified key-frames using standard Figure 3. Example meshes for synthesized motion from 3D video database of Roxanne. Fashion1: Pose#1 → Twirl#85. Target traversing distance and time is 10 metres and 500 frames. SMG: Key-frames Character1: Stand#1 → Hit#45 Stand#1 → Walk#16 Walk#16 → Jog#13 Jog#13 → Hit#45 Fashion1: Pose#1 → Twirl#85 Pose#1 → Walk#15 Walk#15 → WalkPose#37 WalkPose#37 → Twirl#85 Fashion2: Pose#1 → Twirl#100 Pose#1 → Walk#15 Walk#15 → WalkPose#37 WalkPose#37 → Twirl#100 Smoothness (cm) min max 4.24 19.79 4.24 26.80 4.24 26.80 4.24 25.76 5.82 20.54 5.82 20.54 4.23 13.09 4.23 20.54 4.49 21.20 4.87 18.12 4.49 16.05 4.49 18.12 Distance error (m) min rms max 0.0001 0.17 0.96 0.0002 0.21 0.92 0.0001 0.21 0.96 0.0002 0.20 0.98 0.0001 0.23 0.95 0.0001 0.47 0.99 0.0000 0.33 1.00 0.0001 0.29 1.00 0.0000 0.23 0.95 0.0008 0.39 0.98 0.0002 0.36 1.00 0.0002 0.28 1.00 Time error (frame) min rms max 0 0.16 22 0 4.28 23 0 4.81 24 0 3.88 24 0 3.76 24 0 4.31 24 0 5.85 25 0 4.77 25 0 7.74 25 0 8.78 25 0 8.24 25 0 8.16 24 Cputime (sec.) mean ± dev. 14.43 ± 7.05 12.21 ± 4.74 13.54 ± 4.98 12.20 ± 3.42 12.79 ± 2.92 5.65 ± 1.46 12.01 ± 4.85 10.10 ± 3.07 12.87 ± 3.61 7.09 ± 1.91 14.95 ± 6.11 10.51 ± 2.77 Table 1. Evaluation for Roxanne. A grid of target 20 × 40 (metres×seconds) is tested for each pair of key-frames shown in the first column. ILP solver to satisfy constraints on distance and timing with repeated motions for loops in the graph. Results demonstrate that concatenative synthesis of novel sequences accurately satisfy the user constraints and produce motions which preserve the detailed non-rigid dynamics of clothing and loose hair. Linear blending of meshes to produce smooth transitions which requires accurate dense correspondences of dynamic surfaces will be solved in future work. This approach greatly increases the flexibility in the reuse of 3D video sequences allowing specification of highlevel user constraints to produce novel complex 3D video sequences of motion. Acknowledgements This work was supported by EPSRC grant EP/E001351. References [1] E. Aguiar, C. Stoll, C. Theobalt, N. Ahmed, H.-P. Seidel, and S. Thrun. Performance capture from sparse multi-view video. ACM Trans. Graph., 27(3):1–10, 2008. [2] N. Ahmed, C. Theobalt, C. Rossl, S. Thrun, and H.-P. Seidel. Dense correspondence finding for parametrization-free animation reconstruction from video. pages 1–8, June 2008. [3] M. Ankerst, G. Kastenmüller, H.-P. Kriegel, and T. Seidl. 3D shape histograms for similarity search and classification in spatial databases. In SSD ’99: Proceedings of the 6th International Symposium on Advances in Spatial Databases, pages 207–226, London, UK, 1999. Springer-Verlag. [4] O. Arikan and D. A. Forsyth. Interactive motion generation from examples. ACM Trans. Graph., 21(3):483–490, 2002. [5] C. Bregler, M. Covell, and M. Slaney. Video rewrite: driving visual speech with audio. In SIGGRAPH ’97: Proceedings of the 24th annual conference on Computer graphics and interactive techniques, pages 353–360, New York, NY, USA, 1997. ACM Press/Addison-Wesley Publishing Co. [6] E. Cosatto and H. Graf. Sample-based synthesis of photorealistic talking heads. In CA ’98: Proceedings of the Computer Animation, page 103, Washington, DC, USA, 1998. IEEE Computer Society. [7] P. Huang, J. Starck, and A. Hilton. A study of shape similarity for temporal surface sequences of people. In 3DIM ’07: Proceedings of the Sixth International Conference on 3-D Digital Imaging and Modeling, pages 408–418, Washington, DC, USA, 2007. IEEE Computer Society. [8] P. Huang, J. Starck, and A. Hilton. Temporal 3d shape matching. The Fourth European Conference on Visual Media Production (CVMP’07), pages 1–10, Nov. 2007. [9] A. E. Johnson and M. Hebert. Using spin images for efficient object recognition in cluttered 3d scenes. IEEE Trans Pattern Anal Mach Intell, 21(5):433–449, 1999. [10] M. Kazhdan, T. Funkhouser, and S. Rusinkiewicz. Rotation invariant spherical harmonic representation of 3d shape descriptors. SGP ’03: Proceedings of the 2003 Eurographics/ACM SIGGRAPH symposium on Geometry processing, pages 156–164, 2003. [11] L. Kovar, M. Gleicher, and F. Pighin. Motion graphs. In SIGGRAPH ’02: Proceedings of the 29th annual conference on Computer graphics and interactive techniques, volume 21, pages 473–482, New York, NY, USA, July 2002. ACM Press. [12] J. Lee, J. Chai, P. S. A. Reitsma, J. K. Hodgins, and N. S. Pollard. Interactive control of avatars animated with human motion data. ACM Trans. Graph., 21(3):491–500, 2002. [13] Y. Li, T. Wang, and H. Y. Shum. Motion texture: a twolevel statistical model for character motion synthesis. In SIGGRAPH ’02: Proceedings of the 29th annual conference on Computer graphics and interactive techniques, volume 21, pages 465–472, New York, NY, USA, July 2002. ACM Press. [14] R. Osada, T. Funkhouser, B. Chazelle, and D. Dobkin. Shape distributions. ACM Trans. Graph., 21(4):807–832, October 2002. [15] P. S. A. Reitsma and N. S. Pollard. Evaluating motion graphs for character animation. ACM Trans. Graph., 26(4):18, 2007. [16] A. Schödl and I. A. Essa. Controlled animation of video sprites. In SCA ’02: Proceedings of the 2002 ACM SIGGRAPH/Eurographics symposium on Computer animation, pages 121–127, New York, NY, USA, 2002. ACM. [17] J. Starck and A. Hilton. Spherical matching for temporal correspondence of non-rigid surfaces. IEEE International Conference on Computer Vision (ICCV), pages 1387–1394, 2005. [18] J. Starck and A. Hilton. Correspondence labelling for widetimeframe free-form surface matching. Computer Vision, 2007. ICCV 2007. IEEE 11th International Conference on, pages 1–8, Oct. 2007. [19] J. Starck and A. Hilton. Surface capture for performancebased animation. IEEE Computer Graphics and Applications, 27(3):21–31, 2007. [20] J. Starck, G. Miller, and A. Hilton. Video-based character animation. In SCA ’05: Proceedings of the 2005 ACM SIGGRAPH/Eurographics symposium on Computer animation, pages 49–58, New York, NY, USA, 2005. ACM Press. [21] L. M. Tanco and A. Hilton. Realistic synthesis of novel human movements from a database of motion capture examples. In HUMO ’00: Proceedings of the Workshop on Human Motion (HUMO’00), page 137, Washington, DC, USA, 2000. IEEE Computer Society. [22] C. Theobalt, N. Ahmed, H. Lensch, M. Magnor, and H. P. Seidel. Seeing people in different light-joint shape, motion, and reflectance capture. IEEE Transactions on Visualization and Computer Graphics, 13(4):663–674, 2007. [23] K. Varanasi, A. Zaharescu, E. Boyer, and R. P. Horaud. Temporal surface tracking using mesh evolution. In Proceedings of the Tenth European Conference on Computer Vision, volume Part II of LNCS, pages 30–43, Marseille, France, October 2008. Springer-Verlag. [24] D. Vlasic, I. Baran, W. Matusik, and J. Popović. Articulated mesh animation from multi-view silhouettes. ACM Trans. Graph., 27(3):1–9, 2008. [25] J. Wang and B. Bodenheimer. Synthesis and evaluation of linear motion transitions. ACM Trans. Graph., 27(1):1–15, 2008. [26] J. Xu, T. Yamasaki, and K. Aizawa. Motion editing in 3d video database. In 3DPVT ’06: Proceedings of the Third International Symposium on 3D Data Processing, Visualization, and Transmission, pages 472–479, Washington, DC, USA, 2006. IEEE Computer Society. Appendix Find all walks(adj,source,sink,walk) walk ← source; if source=sink then walks = walk; return; else walks=[]; foreach node in source’s neighbours do if node not in walk then new walks = Find all walks(adj,node,sink,walk); walks ← new walks; else loop=[]; i = index of node in walk; loop ← walk(i:end); if loop not in loops then loops ← loop; end end end end Algorithm 1: Find all walks and loops by DFS I := 1..Nl ; n ← Solve ILP; k = 1; while exist i ∈ I, ni ≥ 1 and li not connected to l0 do set a 2D matrix A according to n: foreach i ∈ I do if ni == 1 then ak,i = 1; else ak,i = 0; end end add constraints An ≥ 1 to ILP solver; n ← Solve ILP; k = k + 1; end Algorithm 2: Feedback strategy

RELATED PAPERS

RELATED TOPICS

Log In

Human motion synthesis from 3D video

Human motion synthesis from 3D video

Human motion synthesis from 3D video

Human motion synthesis from 3D video

Human motion synthesis from 3D video

Related Papers

RELATED PAPERS

RELATED TOPICS