Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to main content

Showing 1–50 of 188 results for author: Tombari, F

.
  1. arXiv:2407.08707  [pdf, other

    cs.CV cs.LG

    Extracting Training Data from Document-Based VQA Models

    Authors: Francesco Pinto, Nathalie Rauschmayr, Florian Tramèr, Philip Torr, Federico Tombari

    Abstract: Vision-Language Models (VLMs) have made remarkable progress in document-based Visual Question Answering (i.e., responding to queries about the contents of an input document provided as an image). In this work, we show these models can memorize responses for training samples and regurgitate them even when the relevant visual information has been removed. This includes Personal Identifiable Informat… ▽ More

    Submitted 11 July, 2024; originally announced July 2024.

    Comments: ICML 2024

    ACM Class: I.2.7; I.2.10; K.4.1

  2. arXiv:2407.00503  [pdf, other

    cs.CV

    Toward a Diffusion-Based Generalist for Dense Vision Tasks

    Authors: Yue Fan, Yongqin Xian, Xiaohua Zhai, Alexander Kolesnikov, Muhammad Ferjad Naeem, Bernt Schiele, Federico Tombari

    Abstract: Building generalized models that can solve many computer vision tasks simultaneously is an intriguing direction. Recent works have shown image itself can be used as a natural interface for general-purpose visual perception and demonstrated inspiring results. In this paper, we explore diffusion-based vision generalists, where we unify different types of dense prediction tasks as conditional image g… ▽ More

    Submitted 29 June, 2024; originally announced July 2024.

    Comments: Published at CVPR 2024 as a workshop paper

  3. arXiv:2406.18717  [pdf, other

    cs.CV

    Dynamic Gaussian Marbles for Novel View Synthesis of Casual Monocular Videos

    Authors: Colton Stearns, Adam Harley, Mikaela Uy, Florian Dubost, Federico Tombari, Gordon Wetzstein, Leonidas Guibas

    Abstract: Gaussian splatting has become a popular representation for novel-view synthesis, exhibiting clear strengths in efficiency, photometric quality, and compositional edibility. Following its success, many works have extended Gaussians to 4D, showing that dynamic Gaussians maintain these benefits while also tracking scene geometry far better than alternative representations. Yet, these methods assume d… ▽ More

    Submitted 26 June, 2024; originally announced June 2024.

  4. arXiv:2406.14599  [pdf, other

    cs.CV

    Stylebreeder: Exploring and Democratizing Artistic Styles through Text-to-Image Models

    Authors: Matthew Zheng, Enis Simsar, Hidir Yesiltepe, Federico Tombari, Joel Simon, Pinar Yanardag

    Abstract: Text-to-image models are becoming increasingly popular, revolutionizing the landscape of digital art creation by enabling highly detailed and creative visual content generation. These models have been widely employed across various domains, particularly in art generation, where they facilitate a broad spectrum of creative expression and democratize access to artistic creation. In this paper, we in… ▽ More

    Submitted 20 June, 2024; originally announced June 2024.

  5. arXiv:2406.09801  [pdf, other

    cs.CV

    RaNeuS: Ray-adaptive Neural Surface Reconstruction

    Authors: Yida Wang, David Joseph Tan, Nassir Navab, Federico Tombari

    Abstract: Our objective is to leverage a differentiable radiance field \eg NeRF to reconstruct detailed 3D surfaces in addition to producing the standard novel view renderings. There have been related methods that perform such tasks, usually by utilizing a signed distance field (SDF). However, the state-of-the-art approaches still fail to correctly reconstruct the small-scale details, such as the leaves, ro… ▽ More

    Submitted 14 June, 2024; originally announced June 2024.

    Comments: 3DV 2024, oral. In: Proceedings of the IEEE/CVF International Conference on 3D Vision (2023)

  6. arXiv:2405.21066  [pdf, other

    cs.CV

    Mixed Diffusion for 3D Indoor Scene Synthesis

    Authors: Siyi Hu, Diego Martin Arroyo, Stephanie Debats, Fabian Manhardt, Luca Carlone, Federico Tombari

    Abstract: Realistic conditional 3D scene synthesis significantly enhances and accelerates the creation of virtual environments, which can also provide extensive training data for computer vision and robotics research among other applications. Diffusion models have shown great performance in related applications, e.g., making precise arrangements of unordered sets. However, these models have not been fully e… ▽ More

    Submitted 31 May, 2024; originally announced May 2024.

    Comments: 19 pages, 14 figures. Under review. Code to be released at: https://github.com/MIT-SPARK/MiDiffusion

  7. arXiv:2405.16544  [pdf, other

    cs.CV

    Splat-SLAM: Globally Optimized RGB-only SLAM with 3D Gaussians

    Authors: Erik Sandström, Keisuke Tateno, Michael Oechsle, Michael Niemeyer, Luc Van Gool, Martin R. Oswald, Federico Tombari

    Abstract: 3D Gaussian Splatting has emerged as a powerful representation of geometry and appearance for RGB-only dense Simultaneous Localization and Mapping (SLAM), as it provides a compact dense map representation while enabling efficient and high-quality map rendering. However, existing methods show significantly worse reconstruction quality than competing methods using other 3D representations, e.g. neur… ▽ More

    Submitted 26 May, 2024; originally announced May 2024.

    Comments: 21 pages

  8. arXiv:2405.03690  [pdf, other

    cs.CV

    How Good is my Video LMM? Complex Video Reasoning and Robustness Evaluation Suite for Video-LMMs

    Authors: Muhammad Uzair Khattak, Muhammad Ferjad Naeem, Jameel Hassan, Muzammal Naseer, Federico Tombari, Fahad Shahbaz Khan, Salman Khan

    Abstract: Recent advancements in Large Language Models (LLMs) have led to the development of Video Large Multi-modal Models (Video-LMMs) that can handle a wide range of video understanding tasks. These models have the potential to be deployed in real-world applications such as robotics, AI assistants, medical surgery, and autonomous vehicles. The widespread adoption of Video-LMMs in our daily lives undersco… ▽ More

    Submitted 8 May, 2024; v1 submitted 6 May, 2024; originally announced May 2024.

    Comments: Technical report

  9. arXiv:2405.00915  [pdf, other

    cs.CV cs.AI cs.LG

    EchoScene: Indoor Scene Generation via Information Echo over Scene Graph Diffusion

    Authors: Guangyao Zhai, Evin Pınar Örnek, Dave Zhenyu Chen, Ruotong Liao, Yan Di, Nassir Navab, Federico Tombari, Benjamin Busam

    Abstract: We present EchoScene, an interactive and controllable generative model that generates 3D indoor scenes on scene graphs. EchoScene leverages a dual-branch diffusion model that dynamically adapts to scene graphs. Existing methods struggle to handle scene graphs due to varying numbers of nodes, multiple edge combinations, and manipulator-induced node-edge operations. EchoScene overcomes this by assoc… ▽ More

    Submitted 1 May, 2024; originally announced May 2024.

    Comments: 25 pages. 10 figures

  10. arXiv:2404.07204  [pdf, other

    cs.CV cs.AI cs.LG

    BRAVE: Broadening the visual encoding of vision-language models

    Authors: Oğuzhan Fatih Kar, Alessio Tonioni, Petra Poklukar, Achin Kulshrestha, Amir Zamir, Federico Tombari

    Abstract: Vision-language models (VLMs) are typically composed of a vision encoder, e.g. CLIP, and a language model (LM) that interprets the encoded features to solve downstream tasks. Despite remarkable progress, VLMs are subject to several shortcomings due to the limited capabilities of vision encoders, e.g. "blindness" to certain image features, visual hallucination, etc. To address these issues, we stud… ▽ More

    Submitted 10 April, 2024; originally announced April 2024.

    Comments: Project page at https://brave-vlms.epfl.ch/

  11. arXiv:2404.04421  [pdf, other

    cs.GR cs.CV

    PhysAvatar: Learning the Physics of Dressed 3D Avatars from Visual Observations

    Authors: Yang Zheng, Qingqing Zhao, Guandao Yang, Wang Yifan, Donglai Xiang, Florian Dubost, Dmitry Lagun, Thabo Beeler, Federico Tombari, Leonidas Guibas, Gordon Wetzstein

    Abstract: Modeling and rendering photorealistic avatars is of crucial importance in many applications. Existing methods that build a 3D avatar from visual observations, however, struggle to reconstruct clothed humans. We introduce PhysAvatar, a novel framework that combines inverse rendering with inverse physics to automatically estimate the shape and appearance of a human from multi-view video data along w… ▽ More

    Submitted 9 April, 2024; v1 submitted 5 April, 2024; originally announced April 2024.

    Comments: Project Page: https://qingqing-zhao.github.io/PhysAvatar

  12. arXiv:2404.03658  [pdf, other

    cs.CV

    Know Your Neighbors: Improving Single-View Reconstruction via Spatial Vision-Language Reasoning

    Authors: Rui Li, Tobias Fischer, Mattia Segu, Marc Pollefeys, Luc Van Gool, Federico Tombari

    Abstract: Recovering the 3D scene geometry from a single view is a fundamental yet ill-posed problem in computer vision. While classical depth estimation methods infer only a 2.5D scene representation limited to the image plane, recent approaches based on radiance fields reconstruct a full 3D representation. However, these methods still struggle with occluded regions since inferring geometry without visual… ▽ More

    Submitted 4 April, 2024; originally announced April 2024.

    Comments: CVPR 2024. Project page: https://ruili3.github.io/kyn

  13. arXiv:2404.03650  [pdf, other

    cs.CV

    OpenNeRF: Open Set 3D Neural Scene Segmentation with Pixel-Wise Features and Rendered Novel Views

    Authors: Francis Engelmann, Fabian Manhardt, Michael Niemeyer, Keisuke Tateno, Marc Pollefeys, Federico Tombari

    Abstract: Large visual-language models (VLMs), like CLIP, enable open-set image segmentation to segment arbitrary concepts from an image in a zero-shot manner. This goes beyond the traditional closed-set assumption, i.e., where models can only segment classes from a pre-defined training set. More recently, first works on open-set segmentation in 3D scenes have appeared in the literature. These methods are h… ▽ More

    Submitted 4 April, 2024; originally announced April 2024.

    Comments: ICLR 2024, Project page: https://opennerf.github.io

    Journal ref: ICLR 2024

  14. arXiv:2404.01887   

    cs.CV

    3D scene generation from scene graphs and self-attention

    Authors: Pietro Bonazzi, Mengqi Wang, Diego Martin Arroyo, Fabian Manhardt, Nico Messikomer, Federico Tombari, Davide Scaramuzza

    Abstract: Synthesizing realistic and diverse indoor 3D scene layouts in a controllable fashion opens up applications in simulated navigation and virtual reality. As concise and robust representations of a scene, scene graphs have proven to be well-suited as the semantic control on the generated layout. We present a variant of the conditional variational autoencoder (cVAE) model to synthesize 3D scenes from… ▽ More

    Submitted 23 April, 2024; v1 submitted 2 April, 2024; originally announced April 2024.

    Comments: Some authors were not timely informed of the submission

  15. arXiv:2404.01112   

    cs.CV cs.CG

    Few-shot point cloud reconstruction and denoising via learned Guassian splats renderings and fine-tuned diffusion features

    Authors: Pietro Bonazzi, Marie-Julie Rakatosaona, Marco Cannici, Federico Tombari, Davide Scaramuzza

    Abstract: Existing deep learning methods for the reconstruction and denoising of point clouds rely on small datasets of 3D shapes. We circumvent the problem by leveraging deep learning methods trained on billions of images. We propose a method to reconstruct point clouds from few images and to denoise point clouds from their rendering by exploiting prior knowledge distilled from image-based deep learning mo… ▽ More

    Submitted 23 April, 2024; v1 submitted 1 April, 2024; originally announced April 2024.

    Comments: An author was not timely informed before the released submission

  16. arXiv:2404.00469  [pdf, other

    cs.CV

    SceneGraphLoc: Cross-Modal Coarse Visual Localization on 3D Scene Graphs

    Authors: Yang Miao, Francis Engelmann, Olga Vysotska, Federico Tombari, Marc Pollefeys, Dániel Béla Baráth

    Abstract: We introduce a novel problem, i.e., the localization of an input image within a multi-modal reference map represented by a database of 3D scene graphs. These graphs comprise multiple modalities, including object-level point clouds, images, attributes, and relationships between objects, offering a lightweight and efficient alternative to conventional methods that rely on extensive image databases.… ▽ More

    Submitted 12 July, 2024; v1 submitted 30 March, 2024; originally announced April 2024.

  17. arXiv:2403.19776  [pdf, other

    cs.CV cs.LG

    CLoRA: A Contrastive Approach to Compose Multiple LoRA Models

    Authors: Tuna Han Salih Meral, Enis Simsar, Federico Tombari, Pinar Yanardag

    Abstract: Low-Rank Adaptations (LoRAs) have emerged as a powerful and popular technique in the field of image generation, offering a highly effective way to adapt and refine pre-trained deep learning models for specific tasks without the need for comprehensive retraining. By employing pre-trained LoRA models, such as those representing a specific cat and a particular dog, the objective is to generate an ima… ▽ More

    Submitted 28 March, 2024; originally announced March 2024.

  18. arXiv:2403.14279  [pdf, other

    cs.CV

    Zero123-6D: Zero-shot Novel View Synthesis for RGB Category-level 6D Pose Estimation

    Authors: Francesco Di Felice, Alberto Remus, Stefano Gasperini, Benjamin Busam, Lionel Ott, Federico Tombari, Roland Siegwart, Carlo Alberto Avizzano

    Abstract: Estimating the pose of objects through vision is essential to make robotic platforms interact with the environment. Yet, it presents many challenges, often related to the lack of flexibility and generalizability of state-of-the-art solutions. Diffusion models are a cutting-edge neural architecture transforming 2D and 3D computer vision, outlining remarkable performances in zero-shot novel-view syn… ▽ More

    Submitted 30 July, 2024; v1 submitted 21 March, 2024; originally announced March 2024.

    Comments: 6 pages, 2 reference pages, 4 figures

  19. arXiv:2403.13806  [pdf, other

    cs.CV cs.GR

    RadSplat: Radiance Field-Informed Gaussian Splatting for Robust Real-Time Rendering with 900+ FPS

    Authors: Michael Niemeyer, Fabian Manhardt, Marie-Julie Rakotosaona, Michael Oechsle, Daniel Duckworth, Rama Gosula, Keisuke Tateno, John Bates, Dominik Kaeser, Federico Tombari

    Abstract: Recent advances in view synthesis and real-time rendering have achieved photorealistic quality at impressive rendering speeds. While Radiance Field-based methods achieve state-of-the-art quality in challenging scenarios such as in-the-wild captures and large-scale scenes, they often suffer from excessively high compute requirements linked to volumetric rendering. Gaussian Splatting-based methods,… ▽ More

    Submitted 20 March, 2024; originally announced March 2024.

    Comments: Project page at https://m-niemeyer.github.io/radsplat/

  20. arXiv:2403.11324  [pdf, other

    cs.CV

    GeoGaussian: Geometry-aware Gaussian Splatting for Scene Rendering

    Authors: Yanyan Li, Chenyu Lyu, Yan Di, Guangyao Zhai, Gim Hee Lee, Federico Tombari

    Abstract: During the Gaussian Splatting optimization process, the scene's geometry can gradually deteriorate if its structure is not deliberately preserved, especially in non-textured regions such as walls, ceilings, and furniture surfaces. This degradation significantly affects the rendering quality of novel views that deviate significantly from the viewpoints in the training data. To mitigate this issue,… ▽ More

    Submitted 17 July, 2024; v1 submitted 17 March, 2024; originally announced March 2024.

    Comments: accepted to ECCV 2024

  21. arXiv:2403.10099  [pdf, other

    cs.CV

    KP-RED: Exploiting Semantic Keypoints for Joint 3D Shape Retrieval and Deformation

    Authors: Ruida Zhang, Chenyangguang Zhang, Yan Di, Fabian Manhardt, Xingyu Liu, Federico Tombari, Xiangyang Ji

    Abstract: In this paper, we present KP-RED, a unified KeyPoint-driven REtrieval and Deformation framework that takes object scans as input and jointly retrieves and deforms the most geometrically similar CAD models from a pre-processed database to tightly match the target. Unlike existing dense matching based methods that typically struggle with noisy partial scans, we propose to leverage category-consisten… ▽ More

    Submitted 20 March, 2024; v1 submitted 15 March, 2024; originally announced March 2024.

    Comments: Accepted by CVPR 2024

  22. arXiv:2403.06904  [pdf, other

    cs.CV

    FocusCLIP: Multimodal Subject-Level Guidance for Zero-Shot Transfer in Human-Centric Tasks

    Authors: Muhammad Saif Ullah Khan, Muhammad Ferjad Naeem, Federico Tombari, Luc Van Gool, Didier Stricker, Muhammad Zeshan Afzal

    Abstract: We propose FocusCLIP, integrating subject-level guidance--a specialized mechanism for target-specific supervision--into the CLIP framework for improved zero-shot transfer on human-centric tasks. Our novel contributions enhance CLIP on both the vision and text sides. On the vision side, we incorporate ROI heatmaps emulating human visual attention mechanisms to emphasize subject-relevant image regio… ▽ More

    Submitted 25 March, 2024; v1 submitted 11 March, 2024; originally announced March 2024.

  23. arXiv:2403.00372  [pdf, other

    cs.CV

    HyperSDFusion: Bridging Hierarchical Structures in Language and Geometry for Enhanced 3D Text2Shape Generation

    Authors: Zhiying Leng, Tolga Birdal, Xiaohui Liang, Federico Tombari

    Abstract: 3D shape generation from text is a fundamental task in 3D representation learning. The text-shape pairs exhibit a hierarchical structure, where a general text like ``chair" covers all 3D shapes of the chair, while more detailed prompts refer to more specific shapes. Furthermore, both text and 3D shapes are inherently hierarchical structures. However, existing Text2Shape methods, such as SDFusion,… ▽ More

    Submitted 30 April, 2024; v1 submitted 1 March, 2024; originally announced March 2024.

    Journal ref: IEEE/CVF conference on computer vision and pattern recognition 2024

  24. arXiv:2402.15321  [pdf, other

    cs.CV cs.AI cs.LG

    OpenSUN3D: 1st Workshop Challenge on Open-Vocabulary 3D Scene Understanding

    Authors: Francis Engelmann, Ayca Takmaz, Jonas Schult, Elisabetta Fedele, Johanna Wald, Songyou Peng, Xi Wang, Or Litany, Siyu Tang, Federico Tombari, Marc Pollefeys, Leonidas Guibas, Hongbo Tian, Chunjie Wang, Xiaosheng Yan, Bingwen Wang, Xuanyang Zhang, Xiao Liu, Phuc Nguyen, Khoi Nguyen, Anh Tran, Cuong Pham, Zhening Huang, Xiaoyang Wu, Xi Chen , et al. (3 additional authors not shown)

    Abstract: This report provides an overview of the challenge hosted at the OpenSUN3D Workshop on Open-Vocabulary 3D Scene Understanding held in conjunction with ICCV 2023. The goal of this workshop series is to provide a platform for exploration and discussion of open-vocabulary 3D scene understanding tasks, including but not limited to segmentation, detection and mapping. We provide an overview of the chall… ▽ More

    Submitted 17 March, 2024; v1 submitted 23 February, 2024; originally announced February 2024.

    Comments: Our OpenSUN3D workshop website for ICCV 2023: https://opensun3d.github.io/index_iccv23.html

  25. arXiv:2402.03466  [pdf, other

    cs.CV cs.CG cs.RO

    Physics-Encoded Graph Neural Networks for Deformation Prediction under Contact

    Authors: Mahdi Saleh, Michael Sommersperger, Nassir Navab, Federico Tombari

    Abstract: In robotics, it's crucial to understand object deformation during tactile interactions. A precise understanding of deformation can elevate robotic simulations and have broad implications across different industries. We introduce a method using Physics-Encoded Graph Neural Networks (GNNs) for such predictions. Similar to robotic grasping and manipulation scenarios, we focus on modeling the dynamics… ▽ More

    Submitted 5 February, 2024; originally announced February 2024.

    Comments: Accepted at 2024 IEEE International Conference on Robotics and Automation (ICRA2024)

  26. arXiv:2402.03445  [pdf, other

    cs.CV cs.GR cs.LG

    Denoising Diffusion via Image-Based Rendering

    Authors: Titas Anciukevičius, Fabian Manhardt, Federico Tombari, Paul Henderson

    Abstract: Generating 3D scenes is a challenging open problem, which requires synthesizing plausible content that is fully consistent in 3D space. While recent methods such as neural radiance fields excel at view synthesis and 3D reconstruction, they cannot synthesize plausible details in unobserved regions since they lack a generative capability. Conversely, existing generative methods are typically not cap… ▽ More

    Submitted 20 February, 2024; v1 submitted 5 February, 2024; originally announced February 2024.

    Comments: Accepted at ICLR 2024. Project page: https://anciukevicius.github.io/generative-image-based-rendering

  27. arXiv:2401.05335  [pdf, other

    cs.CV cs.GR cs.LG

    InseRF: Text-Driven Generative Object Insertion in Neural 3D Scenes

    Authors: Mohamad Shahbazi, Liesbeth Claessens, Michael Niemeyer, Edo Collins, Alessio Tonioni, Luc Van Gool, Federico Tombari

    Abstract: We introduce InseRF, a novel method for generative object insertion in the NeRF reconstructions of 3D scenes. Based on a user-provided textual description and a 2D bounding box in a reference viewpoint, InseRF generates new objects in 3D scenes. Recently, methods for 3D scene editing have been profoundly transformed, owing to the use of strong priors of text-to-image diffusion models in 3D generat… ▽ More

    Submitted 10 January, 2024; originally announced January 2024.

  28. arXiv:2401.02418  [pdf, other

    cs.CV

    Learning to Prompt with Text Only Supervision for Vision-Language Models

    Authors: Muhammad Uzair Khattak, Muhammad Ferjad Naeem, Muzammal Naseer, Luc Van Gool, Federico Tombari

    Abstract: Foundational vision-language models such as CLIP are becoming a new paradigm in vision, due to their excellent generalization abilities. However, adapting these models for downstream tasks while maintaining their generalization remains a challenge. In literature, one branch of methods adapts CLIP by learning prompts using visual information. While effective, most of these works require labeled dat… ▽ More

    Submitted 4 January, 2024; originally announced January 2024.

    Comments: Project Page: https://muzairkhattak.github.io/ProText/

  29. arXiv:2312.17232  [pdf, other

    cs.CV

    Segment3D: Learning Fine-Grained Class-Agnostic 3D Segmentation without Manual Labels

    Authors: Rui Huang, Songyou Peng, Ayca Takmaz, Federico Tombari, Marc Pollefeys, Shiji Song, Gao Huang, Francis Engelmann

    Abstract: Current 3D scene segmentation methods are heavily dependent on manually annotated 3D training datasets. Such manual annotations are labor-intensive, and often lack fine-grained details. Importantly, models trained on this data typically struggle to recognize object classes beyond the annotated classes, i.e., they do not generalize well to unseen domains and require additional domain-specific annot… ▽ More

    Submitted 28 December, 2023; originally announced December 2023.

    Comments: Project Page: http://segment3d.github.io

  30. arXiv:2312.13285  [pdf, other

    cs.CV

    UniSDF: Unifying Neural Representations for High-Fidelity 3D Reconstruction of Complex Scenes with Reflections

    Authors: Fangjinhua Wang, Marie-Julie Rakotosaona, Michael Niemeyer, Richard Szeliski, Marc Pollefeys, Federico Tombari

    Abstract: Neural 3D scene representations have shown great potential for 3D reconstruction from 2D images. However, reconstructing real-world captures of complex scenes still remains a challenge. Existing generic 3D reconstruction methods often struggle to represent fine geometric details and do not adequately model reflective surfaces of large-scale scenes. Techniques that explicitly focus on reflective su… ▽ More

    Submitted 20 December, 2023; originally announced December 2023.

    Comments: Project page: https://fangjinhuawang.github.io/UniSDF

  31. arXiv:2312.11897  [pdf, other

    cs.CV

    Text-Conditioned Resampler For Long Form Video Understanding

    Authors: Bruno Korbar, Yongqin Xian, Alessio Tonioni, Andrew Zisserman, Federico Tombari

    Abstract: In this paper we present a text-conditioned video resampler (TCR) module that uses a pre-trained and frozen visual encoder and large language model (LLM) to process long video sequences for a task. TCR localises relevant visual features from the video given a text condition and provides them to a LLM to generate a text response. Due to its lightweight design and use of cross-attention, TCR can pro… ▽ More

    Submitted 25 March, 2024; v1 submitted 19 December, 2023; originally announced December 2023.

  32. arXiv:2312.09256  [pdf, other

    cs.CV

    LIME: Localized Image Editing via Attention Regularization in Diffusion Models

    Authors: Enis Simsar, Alessio Tonioni, Yongqin Xian, Thomas Hofmann, Federico Tombari

    Abstract: Diffusion models (DMs) have gained prominence due to their ability to generate high-quality, varied images, with recent advancements in text-to-image generation. The research focus is now shifting towards the controllability of DMs. A significant challenge within this domain is localized editing, where specific areas of an image are modified without affecting the rest of the content. This paper in… ▽ More

    Submitted 14 December, 2023; originally announced December 2023.

  33. arXiv:2312.06059  [pdf, other

    cs.CV cs.AI cs.LG

    CONFORM: Contrast is All You Need For High-Fidelity Text-to-Image Diffusion Models

    Authors: Tuna Han Salih Meral, Enis Simsar, Federico Tombari, Pinar Yanardag

    Abstract: Images produced by text-to-image diffusion models might not always faithfully represent the semantic intent of the provided text prompt, where the model might overlook or entirely fail to produce certain objects. Existing solutions often require customly tailored functions for each of these problems, leading to sub-optimal results, especially for complex prompts. Our work introduces a novel perspe… ▽ More

    Submitted 10 December, 2023; originally announced December 2023.

  34. arXiv:2312.04201  [pdf, other

    math.AT

    Matching distance via the extended Pareto grid

    Authors: Patrizio Frosini, Eloy Mósig García, Nicola Quercioli, Francesca Tombari

    Abstract: One of the most animated themes of multidimensional persistence is the comparison between invariants. The matching distance between persistent Betti numbers functions (or rank invariants), is among the most studied metrics in this context, particularly in 2-parameter persistence. The main reason for this interest is that, in the 2-parameter case, the foliation method allows for an effective comput… ▽ More

    Submitted 6 July, 2024; v1 submitted 7 December, 2023; originally announced December 2023.

    Comments: Theorem 5 is now a stronger result and Appendix B has been added

  35. arXiv:2312.02255  [pdf, other

    cs.CV cs.GR cs.LG

    Re-Nerfing: Improving Novel Views Synthesis through Novel Views Synthesis

    Authors: Felix Tristram, Stefano Gasperini, Nassir Navab, Federico Tombari

    Abstract: Neural Radiance Fields (NeRFs) have shown remarkable novel view synthesis capabilities even in large-scale, unbounded scenes, albeit requiring hundreds of views or introducing artifacts in sparser settings. Their optimization suffers from shape-radiance ambiguities wherever only a small visual overlap is available. This leads to erroneous scene geometry and artifacts. In this paper, we propose Re-… ▽ More

    Submitted 17 April, 2024; v1 submitted 4 December, 2023; originally announced December 2023.

    Comments: Code will be released upon acceptance

  36. arXiv:2312.00204  [pdf, other

    cs.CV

    DNS SLAM: Dense Neural Semantic-Informed SLAM

    Authors: Kunyi Li, Michael Niemeyer, Nassir Navab, Federico Tombari

    Abstract: In recent years, coordinate-based neural implicit representations have shown promising results for the task of Simultaneous Localization and Mapping (SLAM). While achieving impressive performance on small synthetic scenes, these methods often suffer from oversmoothed reconstructions, especially for complex real-world scenes. In this work, we introduce DNS SLAM, a novel neural RGB-D semantic SLAM a… ▽ More

    Submitted 30 November, 2023; originally announced December 2023.

  37. arXiv:2311.16241  [pdf, other

    cs.CV

    SemiVL: Semi-Supervised Semantic Segmentation with Vision-Language Guidance

    Authors: Lukas Hoyer, David Joseph Tan, Muhammad Ferjad Naeem, Luc Van Gool, Federico Tombari

    Abstract: In semi-supervised semantic segmentation, a model is trained with a limited number of labeled images along with a large corpus of unlabeled images to reduce the high annotation effort. While previous methods are able to learn good segmentation boundaries, they are prone to confuse classes with similar visual appearance due to the limited supervision. On the other hand, vision-language models (VLMs… ▽ More

    Submitted 27 November, 2023; originally announced November 2023.

  38. arXiv:2311.14189  [pdf, other

    cs.CV

    D-SCo: Dual-Stream Conditional Diffusion for Monocular Hand-Held Object Reconstruction

    Authors: Bowen Fu, Gu Wang, Chenyangguang Zhang, Yan Di, Ziqin Huang, Zhiying Leng, Fabian Manhardt, Xiangyang Ji, Federico Tombari

    Abstract: Reconstructing hand-held objects from a single RGB image is a challenging task in computer vision. In contrast to prior works that utilize deterministic modeling paradigms, we employ a point cloud denoising diffusion model to account for the probabilistic nature of this problem. In the core, we introduce centroid-fixed dual-stream conditional diffusion for monocular hand-held object reconstruction… ▽ More

    Submitted 22 March, 2024; v1 submitted 23 November, 2023; originally announced November 2023.

  39. arXiv:2311.13009  [pdf, ps, other

    cs.CV

    3D Compression Using Neural Fields

    Authors: Janis Postels, Yannick Strümpler, Klara Reichard, Luc Van Gool, Federico Tombari

    Abstract: Neural Fields (NFs) have gained momentum as a tool for compressing various data modalities - e.g. images and videos. This work leverages previous advances and proposes a novel NF-based compression algorithm for 3D data. We derive two versions of our approach - one tailored to watertight shapes based on Signed Distance Fields (SDFs) and, more generally, one for arbitrary non-watertight shapes using… ▽ More

    Submitted 21 November, 2023; originally announced November 2023.

  40. arXiv:2311.11125  [pdf, other

    cs.CV

    SecondPose: SE(3)-Consistent Dual-Stream Feature Fusion for Category-Level Pose Estimation

    Authors: Yamei Chen, Yan Di, Guangyao Zhai, Fabian Manhardt, Chenyangguang Zhang, Ruida Zhang, Federico Tombari, Nassir Navab, Benjamin Busam

    Abstract: Category-level object pose estimation, aiming to predict the 6D pose and 3D size of objects from known categories, typically struggles with large intra-class shape variation. Existing works utilizing mean shapes often fall short of capturing this variation. To address this issue, we present SecondPose, a novel approach integrating object-specific geometric features with semantic category priors fr… ▽ More

    Submitted 21 March, 2024; v1 submitted 18 November, 2023; originally announced November 2023.

    Comments: CVPR 2024 accepted. Code is available at: https://github.com/NOrangeeroli/SecondPose

  41. arXiv:2310.13355  [pdf, other

    cs.CV

    SILC: Improving Vision Language Pretraining with Self-Distillation

    Authors: Muhammad Ferjad Naeem, Yongqin Xian, Xiaohua Zhai, Lukas Hoyer, Luc Van Gool, Federico Tombari

    Abstract: Image-Text pretraining on web-scale image caption datasets has become the default recipe for open vocabulary classification and retrieval models thanks to the success of CLIP and its variants. Several works have also used CLIP features for dense prediction tasks and have shown the emergence of open-set abilities. However, the contrastive objective used by these models only focuses on image-text al… ▽ More

    Submitted 7 December, 2023; v1 submitted 20 October, 2023; originally announced October 2023.

  42. arXiv:2310.11696  [pdf, other

    cs.CV

    MOHO: Learning Single-view Hand-held Object Reconstruction with Multi-view Occlusion-Aware Supervision

    Authors: Chenyangguang Zhang, Guanlong Jiao, Yan Di, Gu Wang, Ziqin Huang, Ruida Zhang, Fabian Manhardt, Bowen Fu, Federico Tombari, Xiangyang Ji

    Abstract: Previous works concerning single-view hand-held object reconstruction typically rely on supervision from 3D ground-truth models, which are hard to collect in real world. In contrast, readily accessible hand-object videos offer a promising training data source, but they only give heavily occluded object observations. In this paper, we present a novel synthetic-to-real framework to exploit Multi-vie… ▽ More

    Submitted 13 March, 2024; v1 submitted 17 October, 2023; originally announced October 2023.

    Comments: CVPR 2024

  43. arXiv:2310.10931  [pdf, other

    cs.RO

    Open-Structure: a Structural Benchmark Dataset for SLAM Algorithms

    Authors: Yanyan Li, Zhao Guo, Ze Yang, Yanbiao Sun, Liang Zhao, Federico Tombari

    Abstract: This paper introduces a new benchmark dataset, Open-Structure, for evaluating visual odometry and SLAM methods, which directly equips point and line measurements, correspondences, structural associations, and co-visibility factor graphs instead of providing raw images. Based on the proposed benchmark dataset, these 2D or 3D data can be directly input to different stages of SLAM pipelines to avoid… ▽ More

    Submitted 16 October, 2023; originally announced October 2023.

  44. arXiv:2309.12188  [pdf, other

    cs.RO cs.CV

    SG-Bot: Object Rearrangement via Coarse-to-Fine Robotic Imagination on Scene Graphs

    Authors: Guangyao Zhai, Xiaoni Cai, Dianye Huang, Yan Di, Fabian Manhardt, Federico Tombari, Nassir Navab, Benjamin Busam

    Abstract: Object rearrangement is pivotal in robotic-environment interactions, representing a significant capability in embodied AI. In this paper, we present SG-Bot, a novel rearrangement framework that utilizes a coarse-to-fine scheme with a scene graph as the scene representation. Unlike previous methods that rely on either known goal priors or zero-shot large models, SG-Bot exemplifies lightweight, real… ▽ More

    Submitted 24 March, 2024; v1 submitted 21 September, 2023; originally announced September 2023.

    Comments: ICRA 2024 accepted. Project website: https://sites.google.com/view/sg-bot

  45. arXiv:2309.02965  [pdf, other

    cs.CV

    Dynamic Hyperbolic Attention Network for Fine Hand-object Reconstruction

    Authors: Zhiying Leng, Shun-Cheng Wu, Mahdi Saleh, Antonio Montanaro, Hao Yu, Yin Wang, Nassir Navab, Xiaohui Liang, Federico Tombari

    Abstract: Reconstructing both objects and hands in 3D from a single RGB image is complex. Existing methods rely on manually defined hand-object constraints in Euclidean space, leading to suboptimal feature learning. Compared with Euclidean space, hyperbolic space better preserves the geometric properties of meshes thanks to its exponentially-growing space distance, which amplifies the differences between th… ▽ More

    Submitted 6 September, 2023; originally announced September 2023.

    Comments: Accpeted by ICCV 2023

    ACM Class: I.4.5

  46. arXiv:2308.15827  [pdf, other

    cs.CV

    Introducing Language Guidance in Prompt-based Continual Learning

    Authors: Muhammad Gul Zain Ali Khan, Muhammad Ferjad Naeem, Luc Van Gool, Didier Stricker, Federico Tombari, Muhammad Zeshan Afzal

    Abstract: Continual Learning aims to learn a single model on a sequence of tasks without having access to data from previous tasks. The biggest challenge in the domain still remains catastrophic forgetting: a loss in performance on seen classes of earlier tasks. Some existing methods rely on an expensive replay buffer to store a chunk of data from previous tasks. This, while promising, becomes expensive whe… ▽ More

    Submitted 30 August, 2023; originally announced August 2023.

    Comments: Accepted at ICCV 2023

  47. 3D Adversarial Augmentations for Robust Out-of-Domain Predictions

    Authors: Alexander Lehner, Stefano Gasperini, Alvaro Marcos-Ramiro, Michael Schmidt, Nassir Navab, Benjamin Busam, Federico Tombari

    Abstract: Since real-world training datasets cannot properly sample the long tail of the underlying data distribution, corner cases and rare out-of-domain samples can severely hinder the performance of state-of-the-art models. This problem becomes even more severe for dense tasks, such as 3D semantic segmentation, where points of non-standard objects can be confidently associated to the wrong class. In this… ▽ More

    Submitted 29 August, 2023; originally announced August 2023.

    Comments: 37 pages, 12 figures

  48. arXiv:2308.13357  [pdf, other

    stat.ML cs.LG math.AT

    A topological model for partial equivariance in deep learning and data analysis

    Authors: Lucia Ferrari, Patrizio Frosini, Nicola Quercioli, Francesca Tombari

    Abstract: In this article, we propose a topological model to encode partial equivariance in neural networks. To this end, we introduce a class of operators, called P-GENEOs, that change data expressed by measurements, respecting the action of certain sets of transformations, in a non-expansive way. If the set of transformations acting is a group, then we obtain the so-called GENEOs. We then study the spaces… ▽ More

    Submitted 25 August, 2023; originally announced August 2023.

  49. Robust Monocular Depth Estimation under Challenging Conditions

    Authors: Stefano Gasperini, Nils Morbitzer, HyunJun Jung, Nassir Navab, Federico Tombari

    Abstract: While state-of-the-art monocular depth estimation approaches achieve impressive results in ideal settings, they are highly unreliable under challenging illumination and weather conditions, such as at nighttime or in the presence of rain. In this paper, we uncover these safety-critical issues and tackle them with md4all: a simple and effective solution that works reliably under both adverse and ide… ▽ More

    Submitted 18 August, 2023; originally announced August 2023.

    Comments: ICCV 2023. Source code and data: https://md4all.github.io

  50. arXiv:2308.08231  [pdf, other

    cs.CV

    DDF-HO: Hand-Held Object Reconstruction via Conditional Directed Distance Field

    Authors: Chenyangguang Zhang, Yan Di, Ruida Zhang, Guangyao Zhai, Fabian Manhardt, Federico Tombari, Xiangyang Ji

    Abstract: Reconstructing hand-held objects from a single RGB image is an important and challenging problem. Existing works utilizing Signed Distance Fields (SDF) reveal limitations in comprehensively capturing the complex hand-object interactions, since SDF is only reliable within the proximity of the target, and hence, infeasible to simultaneously encode local hand and object cues. To address this issue, w… ▽ More

    Submitted 26 October, 2023; v1 submitted 16 August, 2023; originally announced August 2023.

    Comments: Camera Ready for NeurIPS 2023