Search | arXiv e-print repository

Instant 3D Human Avatar Generation using Image Diffusion Models

Authors: Nikos Kolotouros, Thiemo Alldieck, Enric Corona, Eduard Gabriel Bazavan, Cristian Sminchisescu

Abstract: We present AvatarPopUp, a method for fast, high quality 3D human avatar generation from different input modalities, such as images and text prompts and with control over the generated pose and shape. The common theme is the use of diffusion-based image generation networks that are specialized for each particular task, followed by a 3D lifting network. We purposefully decouple the generation from t… ▽ More We present AvatarPopUp, a method for fast, high quality 3D human avatar generation from different input modalities, such as images and text prompts and with control over the generated pose and shape. The common theme is the use of diffusion-based image generation networks that are specialized for each particular task, followed by a 3D lifting network. We purposefully decouple the generation from the 3D modeling which allow us to leverage powerful image synthesis priors, trained on billions of text-image pairs. We fine-tune latent diffusion networks with additional image conditioning for image generation and back-view prediction, and to support qualitatively different multiple 3D hypotheses. Our partial fine-tuning approach allows to adapt the networks for each task without inducing catastrophic forgetting. In our experiments, we demonstrate that our method produces accurate, high-quality 3D avatars with diverse appearance that respect the multimodal text, image, and body control signals. Our approach can produce a 3D model in as few as 2 seconds, a four orders of magnitude speedup wrt the vast majority of existing methods, most of which solve only a subset of our tasks, and with fewer controls. AvatarPopUp enables applications that require the controlled 3D generation of human avatars at scale. The project website can be found at https://www.nikoskolot.com/avatarpopup/. △ Less

Submitted 12 July, 2024; v1 submitted 11 June, 2024; originally announced June 2024.

Comments: Camera-ready version

arXiv:2404.00485 [pdf, other]

DiffHuman: Probabilistic Photorealistic 3D Reconstruction of Humans

Authors: Akash Sengupta, Thiemo Alldieck, Nikos Kolotouros, Enric Corona, Andrei Zanfir, Cristian Sminchisescu

Abstract: We present DiffHuman, a probabilistic method for photorealistic 3D human reconstruction from a single RGB image. Despite the ill-posed nature of this problem, most methods are deterministic and output a single solution, often resulting in a lack of geometric detail and blurriness in unseen or uncertain regions. In contrast, DiffHuman predicts a probability distribution over 3D reconstructions cond… ▽ More We present DiffHuman, a probabilistic method for photorealistic 3D human reconstruction from a single RGB image. Despite the ill-posed nature of this problem, most methods are deterministic and output a single solution, often resulting in a lack of geometric detail and blurriness in unseen or uncertain regions. In contrast, DiffHuman predicts a probability distribution over 3D reconstructions conditioned on an input 2D image, which allows us to sample multiple detailed 3D avatars that are consistent with the image. DiffHuman is implemented as a conditional diffusion model that denoises pixel-aligned 2D observations of an underlying 3D shape representation. During inference, we may sample 3D avatars by iteratively denoising 2D renders of the predicted 3D representation. Furthermore, we introduce a generator neural network that approximates rendering with considerably reduced runtime (55x speed up), resulting in a novel dual-branch diffusion framework. Our experiments show that DiffHuman can produce diverse and detailed reconstructions for the parts of the person that are unseen or uncertain in the input image, while remaining competitive with the state-of-the-art when reconstructing visible surfaces. △ Less

Submitted 30 March, 2024; originally announced April 2024.

Comments: CVPR 2024

arXiv:2403.08764 [pdf, other]

VLOGGER: Multimodal Diffusion for Embodied Avatar Synthesis

Authors: Enric Corona, Andrei Zanfir, Eduard Gabriel Bazavan, Nikos Kolotouros, Thiemo Alldieck, Cristian Sminchisescu

Abstract: We propose VLOGGER, a method for audio-driven human video generation from a single input image of a person, which builds on the success of recent generative diffusion models. Our method consists of 1) a stochastic human-to-3d-motion diffusion model, and 2) a novel diffusion-based architecture that augments text-to-image models with both spatial and temporal controls. This supports the generation o… ▽ More We propose VLOGGER, a method for audio-driven human video generation from a single input image of a person, which builds on the success of recent generative diffusion models. Our method consists of 1) a stochastic human-to-3d-motion diffusion model, and 2) a novel diffusion-based architecture that augments text-to-image models with both spatial and temporal controls. This supports the generation of high quality video of variable length, easily controllable through high-level representations of human faces and bodies. In contrast to previous work, our method does not require training for each person, does not rely on face detection and cropping, generates the complete image (not just the face or the lips), and considers a broad spectrum of scenarios (e.g. visible torso or diverse subject identities) that are critical to correctly synthesize humans who communicate. We also curate MENTOR, a new and diverse dataset with 3d pose and expression annotations, one order of magnitude larger than previous ones (800,000 identities) and with dynamic gestures, on which we train and ablate our main technical contributions. VLOGGER outperforms state-of-the-art methods in three public benchmarks, considering image quality, identity preservation and temporal consistency while also generating upper-body gestures. We analyze the performance of VLOGGER with respect to multiple diversity metrics, showing that our architectural choices and the use of MENTOR benefit training a fair and unbiased model at scale. Finally we show applications in video editing and personalization. △ Less

Submitted 13 March, 2024; originally announced March 2024.

Comments: Project web: https://enriccorona.github.io/vlogger/

arXiv:2312.14024 [pdf, other]

NICP: Neural ICP for 3D Human Registration at Scale

Authors: Riccardo Marin, Enric Corona, Gerard Pons-Moll

Abstract: Aligning a template to 3D human point clouds is a long-standing problem crucial for tasks like animation, reconstruction, and enabling supervised learning pipelines. Recent data-driven methods leverage predicted surface correspondences. However, they are not robust to varied poses, identities, or noise. In contrast, industrial solutions often rely on expensive manual annotations or multi-view capt… ▽ More Aligning a template to 3D human point clouds is a long-standing problem crucial for tasks like animation, reconstruction, and enabling supervised learning pipelines. Recent data-driven methods leverage predicted surface correspondences. However, they are not robust to varied poses, identities, or noise. In contrast, industrial solutions often rely on expensive manual annotations or multi-view capturing systems. Recently, neural fields have shown promising results. Still, their purely data-driven and extrinsic nature does not incorporate any guidance toward the target surface, often resulting in a trivial misalignment of the template registration. Currently, no method can be considered the standard for 3D Human registration, limiting the scalability of downstream applications. In this work, we propose a neural scalable registration method, NSR, a pipeline that, for the first time, generalizes and scales across thousands of shapes and more than ten different data sources. Our essential contribution is NICP, an ICP-style self-supervised task tailored to neural fields. NSR takes a few seconds, is self-supervised, and works out of the box on pre-trained neural fields. NSR combines NICP with a localized neural field trained on a large MoCap dataset, achieving the state of the art over public benchmarks. The release of our code and checkpoints provides a powerful tool useful for many downstream tasks like dataset alignments, cleaning, or asset animation. △ Less

Submitted 21 July, 2024; v1 submitted 21 December, 2023; originally announced December 2023.

Comments: Accepted at ECCV 2024

arXiv:2311.10591 [pdf, other]

FOCAL: A Cost-Aware Video Dataset for Active Learning

Authors: Kiran Kokilepersaud, Yash-Yee Logan, Ryan Benkert, Chen Zhou, Mohit Prabhushankar, Ghassan AlRegib, Enrique Corona, Kunjan Singh, Mostafa Parchami

Abstract: In this paper, we introduce the FOCAL (Ford-OLIVES Collaboration on Active Learning) dataset which enables the study of the impact of annotation-cost within a video active learning setting. Annotation-cost refers to the time it takes an annotator to label and quality-assure a given video sequence. A practical motivation for active learning research is to minimize annotation-cost by selectively lab… ▽ More In this paper, we introduce the FOCAL (Ford-OLIVES Collaboration on Active Learning) dataset which enables the study of the impact of annotation-cost within a video active learning setting. Annotation-cost refers to the time it takes an annotator to label and quality-assure a given video sequence. A practical motivation for active learning research is to minimize annotation-cost by selectively labeling informative samples that will maximize performance within a given budget constraint. However, previous work in video active learning lacks real-time annotation labels for accurately assessing cost minimization and instead operates under the assumption that annotation-cost scales linearly with the amount of data to annotate. This assumption does not take into account a variety of real-world confounding factors that contribute to a nonlinear cost such as the effect of an assistive labeling tool and the variety of interactions within a scene such as occluded objects, weather, and motion of objects. FOCAL addresses this discrepancy by providing real annotation-cost labels for 126 video sequences across 69 unique city scenes with a variety of weather, lighting, and seasonal conditions. We also introduce a set of conformal active learning algorithms that take advantage of the sequential structure of video data in order to achieve a better trade-off between annotation-cost and performance while also reducing floating point operations (FLOPS) overhead by at least 77.67%. We show how these approaches better reflect how annotations on videos are done in practice through a sequence selection framework. We further demonstrate the advantage of these approaches by introducing two performance-cost metrics and show that the best conformal active learning method is cheaper than the best traditional active learning method by 113 hours. △ Less

Submitted 17 November, 2023; originally announced November 2023.

Comments: This paper was accepted as a main conference paper at the IEEE International Conference on Big Data

arXiv:2302.12018 [pdf, other]

Gaussian Switch Sampling: A Second Order Approach to Active Learning

Authors: Ryan Benkert, Mohit Prabhushankar, Ghassan AlRegib, Armin Pacharmi, Enrique Corona

Abstract: In active learning, acquisition functions define informativeness directly on the representation position within the model manifold. However, for most machine learning models (in particular neural networks) this representation is not fixed due to the training pool fluctuations in between active learning rounds. Therefore, several popular strategies are sensitive to experiment parameters (e.g. archi… ▽ More In active learning, acquisition functions define informativeness directly on the representation position within the model manifold. However, for most machine learning models (in particular neural networks) this representation is not fixed due to the training pool fluctuations in between active learning rounds. Therefore, several popular strategies are sensitive to experiment parameters (e.g. architecture) and do not consider model robustness to out-of-distribution settings. To alleviate this issue, we propose a grounded second-order definition of information content and sample importance within the context of active learning. Specifically, we define importance by how often a neural network "forgets" a sample during training - artifacts of second order representation shifts. We show that our definition produces highly accurate importance scores even when the model representations are constrained by the lack of training data. Motivated by our analysis, we develop Gaussian Switch Sampling (GauSS). We show that GauSS is setup agnostic and robust to anomalous distributions with exhaustive experiments on three in-distribution benchmarks, three out-of-distribution benchmarks, and three different architectures. We report an improvement of up to 5% when compared against four popular query strategies. △ Less

Submitted 16 February, 2023; originally announced February 2023.

arXiv:2212.06820 [pdf, other]

Structured 3D Features for Reconstructing Controllable Avatars

Authors: Enric Corona, Mihai Zanfir, Thiemo Alldieck, Eduard Gabriel Bazavan, Andrei Zanfir, Cristian Sminchisescu

Abstract: We introduce Structured 3D Features, a model based on a novel implicit 3D representation that pools pixel-aligned image features onto dense 3D points sampled from a parametric, statistical human mesh surface. The 3D points have associated semantics and can move freely in 3D space. This allows for optimal coverage of the person of interest, beyond just the body shape, which in turn, additionally he… ▽ More We introduce Structured 3D Features, a model based on a novel implicit 3D representation that pools pixel-aligned image features onto dense 3D points sampled from a parametric, statistical human mesh surface. The 3D points have associated semantics and can move freely in 3D space. This allows for optimal coverage of the person of interest, beyond just the body shape, which in turn, additionally helps modeling accessories, hair, and loose clothing. Owing to this, we present a complete 3D transformer-based attention framework which, given a single image of a person in an unconstrained pose, generates an animatable 3D reconstruction with albedo and illumination decomposition, as a result of a single end-to-end model, trained semi-supervised, and with no additional postprocessing. We show that our S3F model surpasses the previous state-of-the-art on various tasks, including monocular 3D reconstruction, as well as albedo and shading estimation. Moreover, we show that the proposed methodology allows novel view synthesis, relighting, and re-posing the reconstruction, and can naturally be extended to handle multiple input images (e.g. different views of a person, or the same view, in different poses, in video). Finally, we demonstrate the editing capabilities of our model for 3D virtual try-on applications. △ Less

Submitted 15 April, 2023; v1 submitted 13 December, 2022; originally announced December 2022.

Comments: Accepted at CVPR 2023. Project page: https://enriccorona.github.io/s3f/, Video: https://www.youtube.com/watch?v=mcZGcQ6L-2s

arXiv:2210.08399 [pdf, other]

Tensor-Train Compression of Discrete Element Method Simulation Data

Authors: Saibal De, Eduardo Corona, Paramsothy Jayakumar, Shravan Veerapaneni

Abstract: We propose a framework for discrete scientific data compression based on the tensor-train (TT) decomposition. Our approach is tailored to handle unstructured output data from discrete element method (DEM) simulations, demonstrating its effectiveness in compressing both raw (e.g. particle position and velocity) and derived (e.g. stress and strain) datasets. We show that geometry-driven "tensorizati… ▽ More We propose a framework for discrete scientific data compression based on the tensor-train (TT) decomposition. Our approach is tailored to handle unstructured output data from discrete element method (DEM) simulations, demonstrating its effectiveness in compressing both raw (e.g. particle position and velocity) and derived (e.g. stress and strain) datasets. We show that geometry-driven "tensorization" coupled with the TT decomposition (known as quantized TT) yields a hierarchical compression scheme, achieving high compression ratios for key variables in these DEM datasets. △ Less

Submitted 15 October, 2022; originally announced October 2022.

arXiv:2207.10758 [pdf, other]

DEVIANT: Depth EquiVarIAnt NeTwork for Monocular 3D Object Detection

Authors: Abhinav Kumar, Garrick Brazil, Enrique Corona, Armin Parchami, Xiaoming Liu

Abstract: Modern neural networks use building blocks such as convolutions that are equivariant to arbitrary 2D translations. However, these vanilla blocks are not equivariant to arbitrary 3D translations in the projective manifold. Even then, all monocular 3D detectors use vanilla blocks to obtain the 3D coordinates, a task for which the vanilla blocks are not designed for. This paper takes the first step t… ▽ More Modern neural networks use building blocks such as convolutions that are equivariant to arbitrary 2D translations. However, these vanilla blocks are not equivariant to arbitrary 3D translations in the projective manifold. Even then, all monocular 3D detectors use vanilla blocks to obtain the 3D coordinates, a task for which the vanilla blocks are not designed for. This paper takes the first step towards convolutions equivariant to arbitrary 3D translations in the projective manifold. Since the depth is the hardest to estimate for monocular detection, this paper proposes Depth EquiVarIAnt NeTwork (DEVIANT) built with existing scale equivariant steerable blocks. As a result, DEVIANT is equivariant to the depth translations in the projective manifold whereas vanilla networks are not. The additional depth equivariance forces the DEVIANT to learn consistent depth estimates, and therefore, DEVIANT achieves state-of-the-art monocular 3D detection results on KITTI and Waymo datasets in the image-only category and performs competitively to methods using extra information. Moreover, DEVIANT works better than vanilla networks in cross-dataset evaluation. Code and models at https://github.com/abhi1kumar/DEVIANT △ Less

Submitted 21 July, 2022; originally announced July 2022.

Comments: ECCV 2022

arXiv:2205.06254 [pdf, other]

Learned Vertex Descent: A New Direction for 3D Human Model Fitting

Authors: Enric Corona, Gerard Pons-Moll, Guillem Alenyà, Francesc Moreno-Noguer

Abstract: We propose a novel optimization-based paradigm for 3D human model fitting on images and scans. In contrast to existing approaches that directly regress the parameters of a low-dimensional statistical body model (e.g. SMPL) from input images, we train an ensemble of per-vertex neural fields network. The network predicts, in a distributed manner, the vertex descent direction towards the ground truth… ▽ More We propose a novel optimization-based paradigm for 3D human model fitting on images and scans. In contrast to existing approaches that directly regress the parameters of a low-dimensional statistical body model (e.g. SMPL) from input images, we train an ensemble of per-vertex neural fields network. The network predicts, in a distributed manner, the vertex descent direction towards the ground truth, based on neural features extracted at the current vertex projection. At inference, we employ this network, dubbed LVD, within a gradient-descent optimization pipeline until its convergence, which typically occurs in a fraction of a second even when initializing all vertices into a single point. An exhaustive evaluation demonstrates that our approach is able to capture the underlying body of clothed people with very different body shapes, achieving a significant improvement compared to state-of-the-art. LVD is also applicable to 3D model fitting of humans and hands, for which we show a significant improvement to the SOTA with a much simpler and faster method. △ Less

Submitted 19 July, 2022; v1 submitted 12 May, 2022; originally announced May 2022.

Comments: Project page: https://www.iri.upc.edu/people/ecorona/lvd/

Journal ref: ECCV 2022

arXiv:2204.01695 [pdf, other]

LISA: Learning Implicit Shape and Appearance of Hands

Authors: Enric Corona, Tomas Hodan, Minh Vo, Francesc Moreno-Noguer, Chris Sweeney, Richard Newcombe, Lingni Ma

Abstract: This paper proposes a do-it-all neural model of human hands, named LISA. The model can capture accurate hand shape and appearance, generalize to arbitrary hand subjects, provide dense surface correspondences, be reconstructed from images in the wild and easily animated. We train LISA by minimizing the shape and appearance losses on a large set of multi-view RGB image sequences annotated with coars… ▽ More This paper proposes a do-it-all neural model of human hands, named LISA. The model can capture accurate hand shape and appearance, generalize to arbitrary hand subjects, provide dense surface correspondences, be reconstructed from images in the wild and easily animated. We train LISA by minimizing the shape and appearance losses on a large set of multi-view RGB image sequences annotated with coarse 3D poses of the hand skeleton. For a 3D point in the hand local coordinate, our model predicts the color and the signed distance with respect to each hand bone independently, and then combines the per-bone predictions using predicted skinning weights. The shape, color and pose representations are disentangled by design, allowing to estimate or animate only selected parameters. We experimentally demonstrate that LISA can accurately reconstruct a dynamic hand from monocular or multi-view sequences, achieving a noticeably higher quality of reconstructed hand shapes compared to baseline approaches. Project page: https://www.iri.upc.edu/people/ecorona/lisa/. △ Less

Submitted 4 April, 2022; originally announced April 2022.

Comments: Published at CVPR 2022

arXiv:2201.02017 [pdf, other]

Enhancing Egocentric 3D Pose Estimation with Third Person Views

Authors: Ameya Dhamanaskar, Mariella Dimiccoli, Enric Corona, Albert Pumarola, Francesc Moreno-Noguer

Abstract: In this paper, we propose a novel approach to enhance the 3D body pose estimation of a person computed from videos captured from a single wearable camera. The key idea is to leverage high-level features linking first- and third-views in a joint embedding space. To learn such embedding space we introduce First2Third-Pose, a new paired synchronized dataset of nearly 2,000 videos depicting human acti… ▽ More In this paper, we propose a novel approach to enhance the 3D body pose estimation of a person computed from videos captured from a single wearable camera. The key idea is to leverage high-level features linking first- and third-views in a joint embedding space. To learn such embedding space we introduce First2Third-Pose, a new paired synchronized dataset of nearly 2,000 videos depicting human activities captured from both first- and third-view perspectives. We explicitly consider spatial- and motion-domain features, combined using a semi-Siamese architecture trained in a self-supervised fashion. Experimental results demonstrate that the joint multi-view embedded space learned with our dataset is useful to extract discriminatory features from arbitrary single-view egocentric videos, without needing domain adaptation nor knowledge of camera parameters. We achieve significant improvement of egocentric 3D body pose estimation performance on two unconstrained datasets, over three supervised state-of-the-art approaches. Our dataset and code will be available for research purposes. △ Less

Submitted 15 June, 2022; v1 submitted 6 January, 2022; originally announced January 2022.

arXiv:2109.14065 [pdf, other]

Localization of a Smart Infrastructure Fisheye Camera in a Prior Map for Autonomous Vehicles

Authors: Subodh Mishra, Armin Parchami, Enrique Corona, Punarjay Chakravarty, Ankit Vora, Devarth Parikh, Gaurav Pandey

Abstract: This work presents a technique for localization of a smart infrastructure node, consisting of a fisheye camera, in a prior map. These cameras can detect objects that are outside the line of sight of the autonomous vehicles (AV) and send that information to AVs using V2X technology. However, in order for this information to be of any use to the AV, the detected objects should be provided in the ref… ▽ More This work presents a technique for localization of a smart infrastructure node, consisting of a fisheye camera, in a prior map. These cameras can detect objects that are outside the line of sight of the autonomous vehicles (AV) and send that information to AVs using V2X technology. However, in order for this information to be of any use to the AV, the detected objects should be provided in the reference frame of the prior map that the AV uses for its own navigation. Therefore, it is important to know the accurate pose of the infrastructure camera with respect to the prior map. Here we propose to solve this localization problem in two steps, \textit{(i)} we perform feature matching between perspective projection of fisheye image and bird's eye view (BEV) satellite imagery from the prior map to estimate an initial camera pose, \textit{(ii)} we refine the initialization by maximizing the Mutual Information (MI) between intensity of pixel values of fisheye image and reflectivity of 3D LiDAR points in the map data. We validate our method on simulated data and also present results with real world data. △ Less

Submitted 28 September, 2021; originally announced September 2021.

Comments: Submitted to ICRA 2022

arXiv:2104.14068 [pdf, ps, other]

Fast and accurate solvers for simulating Janus particle suspensions in Stokes flow

Authors: Ryan Kohl, Eduardo Corona, Vani Cheruvu, Shravan Veerapaneni

Abstract: We present a novel computational framework for simulating suspensions of rigid spherical Janus particles in Stokes flow. We show that long-range Janus particle interactions for a wide array of applications may be resolved using fast, spectrally accurate boundary integral methods tailored to polydisperse suspensions of spherical particles. These are incorporated into our rigid body Stokes platform.… ▽ More We present a novel computational framework for simulating suspensions of rigid spherical Janus particles in Stokes flow. We show that long-range Janus particle interactions for a wide array of applications may be resolved using fast, spectrally accurate boundary integral methods tailored to polydisperse suspensions of spherical particles. These are incorporated into our rigid body Stokes platform. Our approach features the use of spherical harmonic expansions for spectrally accurate integral operator evaluation, complementarity-based collision resolution, and optimal O(n) scaling with the number of particles when accelerated via fast summation techniques. We demonstrate the flexibility of our platform through three key examples of Janus particle systems prominent in biomedical applications: amphiphilic, bipolar electric and phoretic particles. We formulate Janus particle interactions in boundary integral form and showcase characteristic self-assembly and complex collective behavior for each particle type. △ Less

Submitted 28 April, 2021; originally announced April 2021.

arXiv:2103.07637 [pdf, other]

AIR4Children: Artificial Intelligence and Robotics for Children

Authors: Rocio Montenegro, Elva Corona, Donato Badillo-Perez, Angel Mandujano, Leticia Vazquez, Dago Cruz, Miguel Xochicale

Abstract: We introduce AIR4Children, Artificial Intelligence for Children, as a way to (a) tackle aspects for inclusion, accessibility, transparency, equity, fairness and participation and (b) to create affordable child-centred materials in AI and Robotics (AIR). We present current challenges and opportunities for a child-centred approaches for AIR. Similarly, we touch on open-sourced software and hardware… ▽ More We introduce AIR4Children, Artificial Intelligence for Children, as a way to (a) tackle aspects for inclusion, accessibility, transparency, equity, fairness and participation and (b) to create affordable child-centred materials in AI and Robotics (AIR). We present current challenges and opportunities for a child-centred approaches for AIR. Similarly, we touch on open-sourced software and hardware technologies to make a more inclusive, affordable and fair participation of children in areas of AIR. Then, we describe the avenues that AIR4Children can take with the development of open-sourced software and hardware based on our initial pilots and experiences. Similarly, we propose to follow the philosophy of Montessori education to help children to not only develop computational thinking but also to internalise new concepts and learning skills through activities of movement and repetition. Finally, we conclude with the opportunities of our work and mainly we pose the future work of putting in practice what is proposed here to evaluate the potential impact on AIR to children, instructors, parents and their community. △ Less

Submitted 13 March, 2021; originally announced March 2021.

arXiv:2103.06871 [pdf, other]

SMPLicit: Topology-aware Generative Model for Clothed People

Authors: Enric Corona, Albert Pumarola, Guillem Alenyà, Gerard Pons-Moll, Francesc Moreno-Noguer

Abstract: In this paper we introduce SMPLicit, a novel generative model to jointly represent body pose, shape and clothing geometry. In contrast to existing learning-based approaches that require training specific models for each type of garment, SMPLicit can represent in a unified manner different garment topologies (e.g. from sleeveless tops to hoodies and to open jackets), while controlling other propert… ▽ More In this paper we introduce SMPLicit, a novel generative model to jointly represent body pose, shape and clothing geometry. In contrast to existing learning-based approaches that require training specific models for each type of garment, SMPLicit can represent in a unified manner different garment topologies (e.g. from sleeveless tops to hoodies and to open jackets), while controlling other properties like the garment size or tightness/looseness. We show our model to be applicable to a large variety of garments including T-shirts, hoodies, jackets, shorts, pants, skirts, shoes and even hair. The representation flexibility of SMPLicit builds upon an implicit model conditioned with the SMPL human body parameters and a learnable latent space which is semantically interpretable and aligned with the clothing attributes. The proposed model is fully differentiable, allowing for its use into larger end-to-end trainable systems. In the experimental section, we demonstrate SMPLicit can be readily used for fitting 3D scans and for 3D reconstruction in images of dressed people. In both cases we are able to go beyond state of the art, by retrieving complex garment geometries, handling situations with multiple clothing layers and providing a tool for easy outfit editing. To stimulate further research in this direction, we will make our code and model publicly available at http://www.iri.upc.edu/people/ecorona/smplicit/. △ Less

Submitted 2 April, 2021; v1 submitted 11 March, 2021; originally announced March 2021.

Comments: Accepted at CVPR 2021

arXiv:2103.02743 [pdf, other]

Efficient data-driven encoding of scene motion using Eccentricity

Authors: Bruno Costa, Enrique Corona, Mostafa Parchami, Gint Puskorius, Dimitar Filev

Abstract: This paper presents a novel approach of representing dynamic visual scenes with static maps generated from video/image streams. Such representation allows easy visual assessment of motion in dynamic environments. These maps are 2D matrices calculated recursively, in a pixel-wise manner, that is based on the recently introduced concept of Eccentricity data analysis. Eccentricity works as a metric o… ▽ More This paper presents a novel approach of representing dynamic visual scenes with static maps generated from video/image streams. Such representation allows easy visual assessment of motion in dynamic environments. These maps are 2D matrices calculated recursively, in a pixel-wise manner, that is based on the recently introduced concept of Eccentricity data analysis. Eccentricity works as a metric of a discrepancy between a particular pixel of an image and its normality model, calculated in terms of mean and variance of past readings of the same spatial region of the image. While Eccentricity maps carry temporal information about the scene, actual images do not need to be stored nor processed in batches. Rather, all the calculations are done recursively, based on a small amount of statistical information stored in memory, thus resulting in a very computationally efficient (processor- and memory-wise) method. The list of potential applications includes video-based activity recognition, intent recognition, object tracking, video description, and so on. △ Less

Submitted 3 March, 2021; originally announced March 2021.

arXiv:2012.09696 [pdf, other]

Multi-FinGAN: Generative Coarse-To-Fine Sampling of Multi-Finger Grasps

Authors: Jens Lundell, Enric Corona, Tran Nguyen Le, Francesco Verdoja, Philippe Weinzaepfel, Gregory Rogez, Francesc Moreno-Noguer, Ville Kyrki

Abstract: While there exists many methods for manipulating rigid objects with parallel-jaw grippers, grasping with multi-finger robotic hands remains a quite unexplored research topic. Reasoning and planning collision-free trajectories on the additional degrees of freedom of several fingers represents an important challenge that, so far, involves computationally costly and slow processes. In this work, we p… ▽ More While there exists many methods for manipulating rigid objects with parallel-jaw grippers, grasping with multi-finger robotic hands remains a quite unexplored research topic. Reasoning and planning collision-free trajectories on the additional degrees of freedom of several fingers represents an important challenge that, so far, involves computationally costly and slow processes. In this work, we present Multi-FinGAN, a fast generative multi-finger grasp sampling method that synthesizes high quality grasps directly from RGB-D images in about a second. We achieve this by training in an end-to-end fashion a coarse-to-fine model composed of a classification network that distinguishes grasp types according to a specific taxonomy and a refinement network that produces refined grasp poses and joint angles. We experimentally validate and benchmark our method against a standard grasp-sampling method on 790 grasps in simulation and 20 grasps on a real Franka Emika Panda. All experimental results using our method show consistent improvements both in terms of grasp quality metrics and grasp success rate. Remarkably, our approach is up to 20-30 times faster than the baseline, a significant improvement that opens the door to feedback-based grasp re-planning and task informative grasping. Code is available at https://irobotics.aalto.fi/multi-fingan/. △ Less

Submitted 15 March, 2021; v1 submitted 17 December, 2020; originally announced December 2020.

Comments: Accepted to IEEE Conference on Robotics and Automation 2021 (ICRA). Code is available at https://irobotics.aalto.fi/multi-fingan/

arXiv:2011.13961 [pdf, other]

D-NeRF: Neural Radiance Fields for Dynamic Scenes

Authors: Albert Pumarola, Enric Corona, Gerard Pons-Moll, Francesc Moreno-Noguer

Abstract: Neural rendering techniques combining machine learning with geometric reasoning have arisen as one of the most promising approaches for synthesizing novel views of a scene from a sparse set of images. Among these, stands out the Neural radiance fields (NeRF), which trains a deep network to map 5D input coordinates (representing spatial location and viewing direction) into a volume density and view… ▽ More Neural rendering techniques combining machine learning with geometric reasoning have arisen as one of the most promising approaches for synthesizing novel views of a scene from a sparse set of images. Among these, stands out the Neural radiance fields (NeRF), which trains a deep network to map 5D input coordinates (representing spatial location and viewing direction) into a volume density and view-dependent emitted radiance. However, despite achieving an unprecedented level of photorealism on the generated images, NeRF is only applicable to static scenes, where the same spatial location can be queried from different images. In this paper we introduce D-NeRF, a method that extends neural radiance fields to a dynamic domain, allowing to reconstruct and render novel images of objects under rigid and non-rigid motions from a \emph{single} camera moving around the scene. For this purpose we consider time as an additional input to the system, and split the learning process in two main stages: one that encodes the scene into a canonical space and another that maps this canonical representation into the deformed scene at a particular time. Both mappings are simultaneously learned using fully-connected networks. Once the networks are trained, D-NeRF can render novel images, controlling both the camera view and the time variable, and thus, the object movement. We demonstrate the effectiveness of our approach on scenes with objects under rigid, articulated and non-rigid motions. Code, model weights and the dynamic scenes dataset will be released. △ Less

Submitted 27 November, 2020; originally announced November 2020.

arXiv:2010.05302 [pdf, other]

PI-Net: Pose Interacting Network for Multi-Person Monocular 3D Pose Estimation

Authors: Wen Guo, Enric Corona, Francesc Moreno-Noguer, Xavier Alameda-Pineda

Abstract: Recent literature addressed the monocular 3D pose estimation task very satisfactorily. In these studies, different persons are usually treated as independent pose instances to estimate. However, in many every-day situations, people are interacting, and the pose of an individual depends on the pose of his/her interactees. In this paper, we investigate how to exploit this dependency to enhance curre… ▽ More Recent literature addressed the monocular 3D pose estimation task very satisfactorily. In these studies, different persons are usually treated as independent pose instances to estimate. However, in many every-day situations, people are interacting, and the pose of an individual depends on the pose of his/her interactees. In this paper, we investigate how to exploit this dependency to enhance current - and possibly future - deep networks for 3D monocular pose estimation. Our pose interacting network, or PI-Net, inputs the initial pose estimates of a variable number of interactees into a recurrent architecture used to refine the pose of the person-of-interest. Evaluating such a method is challenging due to the limited availability of public annotated multi-person 3D human pose datasets. We demonstrate the effectiveness of our method in the MuPoTS dataset, setting the new state-of-the-art on it. Qualitative results on other multi-person datasets (for which 3D pose ground-truth is not available) showcase the proposed PI-Net. PI-Net is implemented in PyTorch and the code will be made available upon acceptance of the paper. △ Less

Submitted 11 October, 2020; originally announced October 2020.

Comments: Accepted at WACV 2021

arXiv:1909.06623 [pdf, other]

doi 10.1016/j.jcp.2020.109524

A scalable computational platform for particulate Stokes suspensions

Authors: Wen Yan, Eduardo Corona, Dhairya Malhotra, Shravan Veerapaneni, Michael Shelley

Abstract: We describe a computational framework for simulating suspensions of rigid particles in Newtonian Stokes flow. One central building block is a collision-resolution algorithm that overcomes the numerical constraints arising from particle collisions. This algorithm extends the well-known complementarity method for non-smooth multi-body dynamics to resolve collisions in dense rigid body suspensions. T… ▽ More We describe a computational framework for simulating suspensions of rigid particles in Newtonian Stokes flow. One central building block is a collision-resolution algorithm that overcomes the numerical constraints arising from particle collisions. This algorithm extends the well-known complementarity method for non-smooth multi-body dynamics to resolve collisions in dense rigid body suspensions. This approach formulates the collision resolution problem as a linear complementarity problem with geometric `non-overlapping' constraints imposed at each timestep. It is then reformulated as a constrained quadratic programming problem and the Barzilai-Borwein projected gradient descent method is applied for its solution. This framework is designed to be applicable for any convex particle shape, e.g., spheres and spherocylinders, and applicable to any Stokes mobility solver, including the Rotne-Prager-Yamakawa approximation, Stokesian Dynamics, and PDE solvers (e.g., boundary integral and immersed boundary methods). In particular, this method imposes Newton's Third Law and records the entire contact network. Further, we describe a fast, parallel, and spectrally-accurate boundary integral method tailored for spherical particles, capable of resolving lubrication effects. We show weak and strong parallel scalings up to $8\times 10^4$ particles with approximately $4\times 10^7$ degrees of freedom on $1792$ cores. We demonstrate the versatility of this framework with several examples, including sedimentation of particle clusters, and active matter systems composed of ensembles of particles driven to rotate. △ Less

Submitted 15 May, 2020; v1 submitted 14 September, 2019; originally announced September 2019.

arXiv:1904.03419 [pdf, other]

Context-aware Human Motion Prediction

Authors: Enric Corona, Albert Pumarola, Guillem Alenyà, Francesc Moreno-Noguer

Abstract: The problem of predicting human motion given a sequence of past observations is at the core of many applications in robotics and computer vision. Current state-of-the-art formulate this problem as a sequence-to-sequence task, in which a historical of 3D skeletons feeds a Recurrent Neural Network (RNN) that predicts future movements, typically in the order of 1 to 2 seconds. However, one aspect tha… ▽ More The problem of predicting human motion given a sequence of past observations is at the core of many applications in robotics and computer vision. Current state-of-the-art formulate this problem as a sequence-to-sequence task, in which a historical of 3D skeletons feeds a Recurrent Neural Network (RNN) that predicts future movements, typically in the order of 1 to 2 seconds. However, one aspect that has been obviated so far, is the fact that human motion is inherently driven by interactions with objects and/or other humans in the environment. In this paper, we explore this scenario using a novel context-aware motion prediction architecture. We use a semantic-graph model where the nodes parameterize the human and objects in the scene and the edges their mutual interactions. These interactions are iteratively learned through a graph attention layer, fed with the past observations, which now include both object and human body motions. Once this semantic graph is learned, we inject it to a standard RNN to predict future movements of the human/s and object/s. We consider two variants of our architecture, either freezing the contextual interactions in the future of updating them. A thorough evaluation in the "Whole-Body Human Motion Database" shows that in both cases, our context-aware networks clearly outperform baselines in which the context information is not considered. △ Less

Submitted 23 March, 2020; v1 submitted 6 April, 2019; originally announced April 2019.

Comments: Accepted at CVPR20

arXiv:1810.05780 [pdf, other]

Pose Estimation for Objects with Rotational Symmetry

Authors: Enric Corona, Kaustav Kundu, Sanja Fidler

Abstract: Pose estimation is a widely explored problem, enabling many robotic tasks such as grasping and manipulation. In this paper, we tackle the problem of pose estimation for objects that exhibit rotational symmetry, which are common in man-made and industrial environments. In particular, our aim is to infer poses for objects not seen at training time, but for which their 3D CAD models are available at… ▽ More Pose estimation is a widely explored problem, enabling many robotic tasks such as grasping and manipulation. In this paper, we tackle the problem of pose estimation for objects that exhibit rotational symmetry, which are common in man-made and industrial environments. In particular, our aim is to infer poses for objects not seen at training time, but for which their 3D CAD models are available at test time. Previous work has tackled this problem by learning to compare captured views of real objects with the rendered views of their 3D CAD models, by embedding them in a joint latent space using neural networks. We show that sidestepping the issue of symmetry in this scenario during training leads to poor performance at test time. We propose a model that reasons about rotational symmetry during training by having access to only a small set of symmetry-labeled objects, whereby exploiting a large collection of unlabeled CAD models. We demonstrate that our approach significantly outperforms a naively trained neural network on a new pose dataset containing images of tools and hardware. △ Less

Submitted 12 October, 2018; originally announced October 2018.

Comments: Accepted at IROS 2018. More details available at http://www.cs.utoronto.ca/~ecorona/symmetry_pose_estimation

arXiv:1808.02558 [pdf, other]

Tensor Train accelerated solvers for nonsmooth rigid body dynamics

Authors: Eduardo Corona, David Gorsich, Paramsothy Jayakumar, Shravan Veerapaneni

Abstract: In the last two decades, increased need for high-fidelity simulations of the time evolution and propagation of forces in granular media has spurred renewed interest in discrete element method (DEM) modeling of frictional contact. Force penalty methods, while economic and accessible, introduce artificial stiffness, requiring small time steps to retain numerical stability. Optimization-based methods… ▽ More In the last two decades, increased need for high-fidelity simulations of the time evolution and propagation of forces in granular media has spurred renewed interest in discrete element method (DEM) modeling of frictional contact. Force penalty methods, while economic and accessible, introduce artificial stiffness, requiring small time steps to retain numerical stability. Optimization-based methods, which enforce contacts geometrically through complementarity constraints, allow the use of larger time steps at the expense of solving a nonlinear complementarity problem (NCP) each time step. We review the latest efforts to produce solvers for this NCP, focusing on its relaxation to a cone complementarity problem (CCP) and solution via an equivalent quadratic optimization problem with conic constraints. We distinguish between linearly convergent first order methods and second order methods, which gain quadratic convergence and more robust performance at the expense of the solution of large sparse linear systems. We propose a novel acceleration for the solution of Newton step linear systems in second order methods using low-rank compression based fast direct solvers. We use the Quantized Tensor Train (QTT) decomposition to produce efficient approximate representations of the system matrix and its inverse. This provides a robust framework to accelerate its solution in a direct or a preconditioned iterative method. In a number of numerical tests, we demonstrate that this approach displays sublinear scaling of precomputation costs, may be efficiently updated across Newton iterations as well as across time steps, and leads to a fast, optimal complexity solution of the Newton step. This allows our method to gain an order of magnitude speedups over state-of-the-art preconditioning techniques for moderate to large-scale systems, mitigating the computational bottleneck of second order methods. △ Less

Submitted 7 August, 2018; originally announced August 2018.

Comments: Submitted to the Journal Applied Mechanics Reviews (ASME) (invited article)

arXiv:1707.06551 [pdf, other]

doi 10.1016/j.jcp.2018.02.017

Boundary integral equation analysis for suspension of spheres in Stokes flow

Authors: Eduardo Corona, Shravan Veerapaneni

Abstract: We show that the standard boundary integral operators, defined on the unit sphere, for the Stokes equations diagonalize on a specific set of vector spherical harmonics and provide formulas for their spectra. We also derive analytical expressions for evaluating the operators away from the boundary. When two particle are located close to each other, we use a truncated series expansion to compute the… ▽ More We show that the standard boundary integral operators, defined on the unit sphere, for the Stokes equations diagonalize on a specific set of vector spherical harmonics and provide formulas for their spectra. We also derive analytical expressions for evaluating the operators away from the boundary. When two particle are located close to each other, we use a truncated series expansion to compute the hydrodynamic interaction. On the other hand, we use the standard spectrally accurate quadrature scheme to evaluate smooth integrals on the far-field, and accelerate the resulting discrete sums using the fast multipole method (FMM). We employ this discretization scheme to analyze several boundary integral formulations of interest including those arising in porous media flow, active matter and magneto-hydrodynamics of rigid particles. We provide numerical results verifying the accuracy and scaling of their evaluation. △ Less

Submitted 9 February, 2018; v1 submitted 18 July, 2017; originally announced July 2017.

arXiv:1606.07428 [pdf, other]

doi 10.1016/j.jcp.2016.12.018

An integral equation formulation for rigid bodies in Stokes flow in three dimensions

Authors: Eduardo Corona, Leslie Greengard, Manas Rachh, Shravan Veerapaneni

Abstract: We present a new derivation of a boundary integral equation (BIE) for simulating the three-dimensional dynamics of arbitrarily-shaped rigid particles of genus zero immersed in a Stokes fluid, on which are prescribed forces and torques. Our method is based on a single-layer representation and leads to a simple second-kind integral equation. It avoids the use of auxiliary sources within each particl… ▽ More We present a new derivation of a boundary integral equation (BIE) for simulating the three-dimensional dynamics of arbitrarily-shaped rigid particles of genus zero immersed in a Stokes fluid, on which are prescribed forces and torques. Our method is based on a single-layer representation and leads to a simple second-kind integral equation. It avoids the use of auxiliary sources within each particle that play a role in some classical formulations. We use a spectrally accurate quadrature scheme to evaluate the corresponding layer potentials, so that only a small number of spatial discretization points per particle are required. The resulting discrete sums are computed in $\mathcal{O}(n)$ time, where $n$ denotes the number of particles, using the fast multipole method (FMM). The particle positions and orientations are updated by a high-order time-stepping scheme. We illustrate the accuracy, conditioning and scaling of our solvers with several numerical examples. △ Less

Submitted 22 June, 2016; originally announced June 2016.

arXiv:1511.06029 [pdf, other]

A Tensor-Train accelerated solver for integral equations in complex geometries

Authors: Eduardo Corona, Abtin Rahimian, Denis Zorin

Abstract: We present a framework using the Quantized Tensor Train (QTT) decomposition to accurately and efficiently solve volume and boundary integral equations in three dimensions. We describe how the QTT decomposition can be used as a hierarchical compression and inversion scheme for matrices arising from the discretization of integral equations. For a broad range of problems, computational and storage co… ▽ More We present a framework using the Quantized Tensor Train (QTT) decomposition to accurately and efficiently solve volume and boundary integral equations in three dimensions. We describe how the QTT decomposition can be used as a hierarchical compression and inversion scheme for matrices arising from the discretization of integral equations. For a broad range of problems, computational and storage costs of the inversion scheme are extremely modest $O(\log N)$ and once the inverse is computed, it can be applied in $O(N \log N)$. We analyze the QTT ranks for hierarchically low rank matrices and discuss its relationship to commonly used hierarchical compression techniques such as FMM and HSS. We prove that the QTT ranks are bounded for translation-invariant systems and argue that this behavior extends to non-translation invariant volume and boundary integrals. For volume integrals, the QTT decomposition provides an efficient direct solver requiring significantly less memory compared to other fast direct solvers. We present results demonstrating the remarkable performance of the QTT-based solver when applied to both translation and non-translation invariant volume integrals in 3D. For boundary integral equations, we demonstrate that using a QTT decomposition to construct preconditioners for a Krylov subspace method leads to an efficient and robust solver with a small memory footprint. We test the QTT preconditioners in the iterative solution of an exterior elliptic boundary value problem (Laplace) formulated as a boundary integral equation in complex, multiply connected geometries. △ Less

Submitted 1 October, 2016; v1 submitted 18 November, 2015; originally announced November 2015.

arXiv:1303.5466 [pdf, other]

An O(N) Direct Solver for Integral Equations on the Plane

Authors: Eduardo Corona, Per-Gunnar Martinsson, Denis Zorin

Abstract: An efficient direct solver for volume integral equations with O(N) complexity for a broad range of problems is presented. The solver relies on hierarchical compression of the discretized integral operator, and exploits that off-diagonal blocks of certain dense matrices have numerically low rank. Technically, the solver is inspired by previously developed direct solvers for integral equations based… ▽ More An efficient direct solver for volume integral equations with O(N) complexity for a broad range of problems is presented. The solver relies on hierarchical compression of the discretized integral operator, and exploits that off-diagonal blocks of certain dense matrices have numerically low rank. Technically, the solver is inspired by previously developed direct solvers for integral equations based on "recursive skeletonization" and "Hierarchically Semi-Separable" (HSS) matrices, but it improves on the asymptotic complexity of existing solvers by incorporating an additional level of compression. The resulting solver has optimal O(N) complexity for all stages of the computation, as demonstrated by both theoretical analysis and numerical examples. The computational examples further display good practical performance in terms of both speed and memory usage. In particular, it is demonstrated that even problems involving 10^{7} unknowns can be solved to precision 10^{-10} using a simple Matlab implementation of the algorithm executed on a single core. △ Less

Submitted 14 May, 2013; v1 submitted 21 March, 2013; originally announced March 2013.

Comments: Submitted to the SIAM Journal of Scientific Computing (May 14, 2013). 32 pages, 12 figures, 6 sections

Showing 1–28 of 28 results for author: Corona, E