No abstract available.
Front Matter
Audio-Synchronized Visual Animation
Current visual generation methods can produce high-quality videos guided by text prompts. However, effectively controlling object dynamics remains a challenge. This work explores audio as a cue to generate temporally synchronized image animations. ...
Expressive Whole-Body 3D Gaussian Avatar
Facial expression and hand motions are necessary to express our emotions and interact with the world. Nevertheless, most of the 3D human avatars modeled from a casually captured video only support body motions without facial expressions and hand ...
Canonical Shape Projection Is All You Need for 3D Few-Shot Class Incremental Learning
- Ali Cheraghian,
- Zeeshan Hayder,
- Sameera Ramasinghe,
- Shafin Rahman,
- Javad Jafaryahya,
- Lars Petersson,
- Mehrtash Harandi
In recent years, robust pre-trained foundation models have been successfully used in many downstream tasks. Here, we would like to use such powerful models to address the problem of few-shot class incremental learning (FSCIL) tasks on 3D point ...
Controllable Human-Object Interaction Synthesis
Synthesizing semantic-aware, long-horizon, human-object interaction is critical to simulate realistic human behaviors. In this work, we address the challenging problem of generating synchronized object motion and human motion guided by language ...
Freditor: High-Fidelity and Transferable NeRF Editing by Frequency Decomposition
This paper enables high-fidelity, transferable NeRF editing by frequency decomposition. Recent NeRF editing pipelines lift 2D stylization results to 3D scenes while suffering from blurry results, and fail to capture detailed structures caused by ...
DoughNet: A Visual Predictive Model for Topological Manipulation of Deformable Objects
Manipulation of elastoplastic objects like dough often involves topological changes such as splitting and merging. The ability to accurately predict these topological changes that a specific action might incur is critical for planning interactions ...
PAV: Personalized Head Avatar from Unstructured Video Collection
We propose PAV, Personalized Head Avatar for the synthesis of human faces under arbitrary viewpoints and facial expressions. PAV introduces a method that learns a dynamic deformable neural radiance field (NeRF), in particular from a collection of ...
Strike a Balance in Continual Panoptic Segmentation
This study explores the emerging area of continual panoptic segmentation, highlighting three key balances. First, we introduce past-class backtrace distillation to balance the stability of existing knowledge with the adaptability to new ...
In Defense of Lazy Visual Grounding for Open-Vocabulary Semantic Segmentation
We present Lazy Visual Grounding for open-vocabulary semantic segmentation, which decouples unsupervised object mask discovery from object grounding. Plenty of the previous art casts this task as pixel-to-text classification without object-level ...
MultiDelete for Multimodal Machine Unlearning
Machine Unlearning removes specific knowledge about training data samples from an already trained model. It has significant practical benefits, such as purging private, inaccurate, or outdated information from trained models without the need for ...
Unified Local-Cloud Decision-Making via Reinforcement Learning
Embodied vision-based real-world systems, such as mobile robots, require a careful balance between energy consumption, compute latency, and safety constraints to optimize operation across dynamic tasks and contexts. As local computation tends to ...
UniTalker: Scaling up Audio-Driven 3D Facial Animation Through A Unified Model
Audio-driven 3D facial animation aims to map input audio to realistic facial motion. Despite significant progress, limitations arise from inconsistent 3D annotations, restricting previous models to training on specific annotations and thereby ...
Robo-ABC: Affordance Generalization Beyond Categories via Semantic Correspondence for Robot Manipulation
Enabling robotic manipulation that generalizes to out-of-distribution scenes is a crucial step toward the open-world embodied intelligence. For human beings, this ability is rooted in the understanding of semantic correspondence among different ...
Efficient Frequency-Domain Image Deraining with Contrastive Regularization
Most current single image-deraining (SID) methods are based on the Transformer with global modeling for high-quality reconstruction. However, their architectures only build long-range features from the spatial domain, which suffers from a ...
Stitched ViTs are Flexible Vision Backbones
Large pretrained plain vision Transformers (ViTs) have been the workhorse for many downstream tasks. However, existing works utilizing off-the-shelf ViTs are inefficient in terms of training and deployment, because adopting ViTs with individual ...
TrajPrompt: Aligning Color Trajectory with Vision-Language Representations
- Li-Wu Tsao,
- Hao-Tang Tsui,
- Yu-Rou Tuan,
- Pei-Chi Chen,
- Kuan-Lin Wang,
- Jhih-Ciang Wu,
- Hong-Han Shuai,
- Wen-Huang Cheng
Cross-modal learning shows promising potential to overcome the limitations of single-modality tasks. However, without proper design for representation alignment between different data sources, the external modality cannot fully exhibit its value. ...
SemReg: Semantics Constrained Point Cloud Registration
Despite the recent success of Transformers in point cloud registration, the cross-attention mechanism, while enabling point-wise feature exchange between point clouds, suffers from redundant feature interactions among semantically unrelated ...
Cascade-Zero123: One Image to Highly Consistent 3D with Self-prompted Nearby Views
- Yabo Chen,
- Jiemin Fang,
- Yuyang Huang,
- Taoran Yi,
- Xiaopeng Zhang,
- Lingxi Xie,
- Xinggang Wang,
- Wenrui Dai,
- Hongkai Xiong,
- Qi Tian
Synthesizing multi-view 3D from one single image is a significant but challenging task. Zero-1-to-3 methods have achieved great success by lifting a 2D latent diffusion model to the 3D scope. The target-view image is generated with a single-view ...
ReSyncer: Rewiring Style-Based Generator for Unified Audio-Visually Synced Facial Performer
- Jiazhi Guan,
- Zhiliang Xu,
- Hang Zhou,
- Kaisiyuan Wang,
- Shengyi He,
- Zhanwang Zhang,
- Borong Liang,
- Haocheng Feng,
- Errui Ding,
- Jingtuo Liu,
- Jingdong Wang,
- Youjian Zhao,
- Ziwei Liu
Lip-syncing videos with given audio is the foundation for various applications including the creation of virtual presenters or performers. While recent studies explore high-fidelity lip-sync with different techniques, their task-orientated models ...
Language-Driven Physics-Based Scene Synthesis and Editing via Feature Splatting
Scene representations using 3D Gaussian primitives have produced excellent results in modeling the appearance of static and dynamic 3D scenes. Many graphics applications, however, demand the ability to manipulate both the appearance and the ...
AlignDiff: Aligning Diffusion Models for General Few-Shot Segmentation
Text-to-image diffusion models have shown remarkable success in synthesizing photo-realistic images. Apart from creative applications, can we use such models to synthesize samples that aid the few-shot training of discriminative models? In this ...
SkateFormer: Skeletal-Temporal Transformer for Human Action Recognition
Skeleton-based action recognition, which classifies human actions based on the coordinates of joints and their connectivity within skeleton data, is widely utilized in various scenarios. While Graph Convolutional Networks (GCNs) have been proposed ...
-Tuning: Efficient Image-to-Video Transfer Learning for Video Temporal Grounding
Tree-D Fusion: Simulation-Ready Tree Dataset from Single Images with Diffusion Priors
We introduce Tree-D Fusion, featuring the first collection of 600,000 environmentally aware, 3D simulation-ready tree models generated through Diffusion priors. Each reconstructed 3D tree model corresponds to an image from Google’s Auto Arborist ...
Parameterization-Driven Neural Surface Reconstruction for Object-Oriented Editing in Neural Rendering
The advancements in neural rendering have increased the need for techniques that enable intuitive editing of 3D objects represented as neural implicit surfaces. This paper introduces a novel neural algorithm for parameterizing neural implicit ...
DomainFusion: Generalizing to Unseen Domains with Latent Diffusion Models
Latent Diffusion Models (LDMs) are powerful and potential tools for facilitating generation-based methods for domain generalization. However, existing diffusion-based DG methods are restricted to offline augmentation using LDM and suffer from ...
Index Terms
- Computer Vision – ECCV 2024: 18th European Conference, Milan, Italy, September 29–October 4, 2024, Proceedings, Part XLI