No abstract available.
Front Matter
RCS-Prompt: Learning Prompt to Rearrange Class Space for Prompt-Based Continual Learning
Prompt-based Continual Learning is an emerging direction in leveraging pre-trained knowledge for downstream continual learning. While arriving at a new session, existing prompt-based continual learning methods usually adapt features from pre-...
Text-Anchored Score Composition: Tackling Condition Misalignment in Text-to-Image Diffusion Models
Text-to-image diffusion models have advanced towards more controllable generation via supporting various additional conditions (e.g., depth map, bounding box) beyond text. However, these models are learned based on the premise of perfect alignment ...
Grounding DINO: Marrying DINO with Grounded Pre-training for Open-Set Object Detection
- Shilong Liu,
- Zhaoyang Zeng,
- Tianhe Ren,
- Feng Li,
- Hao Zhang,
- Jie Yang,
- Qing Jiang,
- Chunyuan Li,
- Jianwei Yang,
- Hang Su,
- Jun Zhu,
- Lei Zhang
In this paper, we develop an open-set object detector, called Grounding DINO, by marrying Transformer-based detector DINO with grounded pre-training, which can detect arbitrary objects with human inputs such as category names or referring ...
Make Your ViT-Based Multi-view 3D Detectors Faster via Token Compression
Slow inference speed is one of the most crucial concerns for deploying multi-view 3D detectors to tasks with high real-time requirements like autonomous driving. Although many sparse query-based methods have already attempted to improve the ...
OV-Uni3DETR: Towards Unified Open-Vocabulary 3D Object Detection via Cycle-Modality Propagation
In the current state of 3D object detection research, the severe scarcity of annotated 3D data, substantial disparities across different data modalities, and the absence of a unified architecture, have impeded the progress towards the goal of ...
LLaVA-Plus: Learning to Use Tools for Creating Multimodal Agents
- Shilong Liu,
- Hao Cheng,
- Haotian Liu,
- Hao Zhang,
- Feng Li,
- Tianhe Ren,
- Xueyan Zou,
- Jianwei Yang,
- Hang Su,
- Jun Zhu,
- Lei Zhang,
- Jianfeng Gao,
- Chunyuan Li
This paper presents LLaVA-Plus (Large Language and Vision Assistants that Plug and Learn to Use Skills), a general-purpose multimodal assistant trained using an end-to-end approach that systematically expands the capabilities of large multimodal ...
ClearCLIP: Decomposing CLIP Representations for Dense Vision-Language Inference
Despite the success of large-scale pretrained Vision-Language Models (VLMs) especially CLIP in various open-vocabulary tasks, their application to semantic segmentation remains challenging, producing noisy segmentation maps with mis-segmented ...
Two-Stage Active Learning for Efficient Temporal Action Segmentation
Training a temporal action segmentation (TAS) model on long and untrimmed videos requires gathering framewise video annotations, which is very costly. We propose a two-stage active learning framework to efficiently learn a TAS model using only a ...
MVPGS: Excavating Multi-view Priors for Gaussian Splatting from Sparse Input Views
Recently, the Neural Radiance Field (NeRF) advancement has facilitated few-shot Novel View Synthesis (NVS), which is a significant challenge in 3D vision applications. Despite numerous attempts to reduce the dense input requirement in NeRF, it ...
Domain-Adaptive 2D Human Pose Estimation via Dual Teachers in Extremely Low-Light Conditions
Existing 2D human pose estimation research predominantly concentrates on well-lit scenarios, with limited exploration of poor lighting conditions, which are a prevalent aspect of daily life. Recent studies on low-light pose estimation require the ...
Towards More Practical Group Activity Detection: A New Benchmark and Model
Group activity detection (GAD) is the task of identifying members of each group and classifying the activity of the group at the same time in a video. While GAD has been studied recently, there is still much room for improvement in both dataset ...
Depicting Beyond Scores: Advancing Image Quality Assessment Through Multi-modal Language Models
We introduce a Depicted image Quality Assessment method (DepictQA), overcoming the constraints of traditional score-based methods. DepictQA allows for detailed, language-based, human-like evaluation of image quality by leveraging Multi-modal Large ...
Zero-Shot Image Feature Consensus with Deep Functional Maps
Correspondences emerge from large-scale vision models trained for generative and discriminative tasks. This has been revealed and benchmarked by computing correspondence maps between pairs of images, using nearest neighbors on the feature grids. ...
WindPoly: Polygonal Mesh Reconstruction via Winding Numbers
Polygonal mesh reconstruction of a raw point cloud is a valuable topic in the field of computer graphics and 3D vision. Especially to 3D architectural models, polygonal mesh provides concise expressions for fundamental geometric structures while ...
MinD-3D: Reconstruct High-Quality 3D Objects in Human Brain
In this paper, we introduce Recon3DMind, an innovative task aimed at reconstructing 3D visuals from Functional Magnetic Resonance Imaging (fMRI) signals, marking a significant advancement in the fields of cognitive neuroscience and computer ...
Tokenize Anything via Prompting
We present a unified, promptable model capable of simultaneously segmenting, recognizing, and captioning anything. Unlike SAM, we aim to build a versatile region representation in the wild via visual prompting. To achieve this, we train a ...
Geospecific View Generation Geometry-Context Aware High-Resolution Ground View Inference from Satellite Views
Predicting realistic ground views from satellite imagery in urban scenes is a challenging task due to the significant view gaps between satellite and ground-view images. We propose a novel pipeline to tackle this challenge, by generating ...
Scissorhands: Scrub Data Influence via Connection Sensitivity in Networks
Machine unlearning has become a pivotal task to erase the influence of data from a trained model. It adheres to recent data regulation standards and enhances the privacy and security of machine learning applications. In this work, we present a new ...
City-on-Web: Real-Time Neural Rendering of Large-Scale Scenes on the Web
Existing neural radiance field-based methods can achieve real-time rendering of small scenes on the web platform. However, extending these methods to large-scale scenes still poses significant challenges due to limited resources in computation, ...
GRAPE: Generalizable and Robust Multi-view Facial Capture
Deep learning-based multi-view facial capture methods have shown impressive accuracy while being several orders of magnitude faster than a traditional mesh registration pipeline. However, the existing systems (e.g. TEMPEH) are strictly restricted ...
Training-Free Model Merging for Multi-target Domain Adaptation
In this paper, we study multi-target domain adaptation of scene understanding models. While previous methods achieved commendable results through inter-domain consistency losses, they often assumed unrealistic simultaneous access to images from ...
Multi-RoI Human Mesh Recovery with Camera Consistency and Contrastive Losses
Besides a 3D mesh, Human Mesh Recovery (HMR) methods usually need to estimate a camera for computing 2D reprojection loss. Previous approaches may encounter the following problem: both the mesh and camera are not correct but the combination of ...
Open-Vocabulary Camouflaged Object Segmentation
Recently, the emergence of the large-scale vision-language model (VLM), such as CLIP, has opened the way towards open-world object perception. Many works have explored the utilization of pre-trained VLM for the challenging open-vocabulary dense ...
Index Terms
- Computer Vision – ECCV 2024: 18th European Conference, Milan, Italy, September 29–October 4, 2024, Proceedings, Part XLVII