Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to main content

Showing 1–50 of 69 results for author: Rehg, J M

Searching in archive cs. Search in all archives.
.
  1. arXiv:2407.09648  [pdf, other

    cs.CV

    3x2: 3D Object Part Segmentation by 2D Semantic Correspondences

    Authors: Anh Thai, Weiyao Wang, Hao Tang, Stefan Stojanov, Matt Feiszli, James M. Rehg

    Abstract: 3D object part segmentation is essential in computer vision applications. While substantial progress has been made in 2D object part segmentation, the 3D counterpart has received less attention, in part due to the scarcity of annotated 3D datasets, which are expensive to collect. In this work, we propose to leverage a few annotated 3D shapes or richly annotated 2D datasets to perform 3D object par… ▽ More

    Submitted 12 July, 2024; originally announced July 2024.

    Comments: Accepted to ECCV 2024

  2. arXiv:2406.18848  [pdf, other

    cs.LG

    Temporally Multi-Scale Sparse Self-Attention for Physical Activity Data Imputation

    Authors: Hui Wei, Maxwell A. Xu, Colin Samplawski, James M. Rehg, Santosh Kumar, Benjamin M. Marlin

    Abstract: Wearable sensors enable health researchers to continuously collect data pertaining to the physiological state of individuals in real-world settings. However, such data can be subject to extensive missingness due to a complex combination of factors. In this work, we study the problem of imputation of missing step count data, one of the most ubiquitous forms of wearable sensor data. We construct a n… ▽ More

    Submitted 26 June, 2024; originally announced June 2024.

    Comments: Accepted by Conference on Health, Inference, and Learning (CHIL) 2024

  3. arXiv:2406.17126  [pdf, other

    cs.CV cs.LG

    MM-SpuBench: Towards Better Understanding of Spurious Biases in Multimodal LLMs

    Authors: Wenqian Ye, Guangtao Zheng, Yunsheng Ma, Xu Cao, Bolin Lai, James M. Rehg, Aidong Zhang

    Abstract: Spurious bias, a tendency to use spurious correlations between non-essential input attributes and target variables for predictions, has revealed a severe robustness pitfall in deep learning models trained on single modality data. Multimodal Large Language Models (MLLMs), which integrate both vision and language models, have demonstrated strong capability in joint vision-language understanding. How… ▽ More

    Submitted 24 June, 2024; originally announced June 2024.

  4. arXiv:2406.10424  [pdf, other

    cs.CV cs.AI

    What is the Visual Cognition Gap between Humans and Multimodal LLMs?

    Authors: Xu Cao, Bolin Lai, Wenqian Ye, Yunsheng Ma, Joerg Heintz, Jintai Chen, Jianguo Cao, James M. Rehg

    Abstract: Recently, Multimodal Large Language Models (MLLMs) have shown great promise in language-guided perceptual tasks such as recognition, segmentation, and object detection. However, their effectiveness in addressing visual cognition problems that require high-level reasoning is not well-established. One such challenge is abstract visual reasoning (AVR) -- the cognitive ability to discern relationships… ▽ More

    Submitted 14 June, 2024; originally announced June 2024.

    Comments: 14 pages, 4 figures, the appendix will be updated soon

    MSC Class: 68T01

  5. arXiv:2404.03566  [pdf, other

    cs.CV

    PointInfinity: Resolution-Invariant Point Diffusion Models

    Authors: Zixuan Huang, Justin Johnson, Shoubhik Debnath, James M. Rehg, Chao-Yuan Wu

    Abstract: We present PointInfinity, an efficient family of point cloud diffusion models. Our core idea is to use a transformer-based architecture with a fixed-size, resolution-invariant latent representation. This enables efficient training with low-resolution point clouds, while allowing high-resolution point clouds to be generated during inference. More importantly, we show that scaling the test-time reso… ▽ More

    Submitted 4 April, 2024; originally announced April 2024.

    Comments: Accepted to CVPR 2024, project website at https://zixuanh.com/projects/pointinfinity

  6. arXiv:2403.02090  [pdf, other

    cs.CV cs.CL cs.LG

    Modeling Multimodal Social Interactions: New Challenges and Baselines with Densely Aligned Representations

    Authors: Sangmin Lee, Bolin Lai, Fiona Ryan, Bikram Boote, James M. Rehg

    Abstract: Understanding social interactions involving both verbal and non-verbal cues is essential for effectively interpreting social situations. However, most prior works on multimodal social cues focus predominantly on single-person behaviors or rely on holistic visual representations that are not aligned to utterances in multi-party environments. Consequently, they are limited in modeling the intricate… ▽ More

    Submitted 29 April, 2024; v1 submitted 4 March, 2024; originally announced March 2024.

    Comments: CVPR 2024 Oral

  7. arXiv:2312.14198  [pdf, other

    cs.CV

    ZeroShape: Regression-based Zero-shot Shape Reconstruction

    Authors: Zixuan Huang, Stefan Stojanov, Anh Thai, Varun Jampani, James M. Rehg

    Abstract: We study the problem of single-image zero-shot 3D shape reconstruction. Recent works learn zero-shot shape reconstruction through generative modeling of 3D assets, but these models are computationally expensive at train and inference time. In contrast, the traditional approach to this problem is regression-based, where deterministic models are trained to directly regress the object shape. Such reg… ▽ More

    Submitted 16 January, 2024; v1 submitted 20 December, 2023; originally announced December 2023.

    Comments: Project page: https://zixuanh.com/projects/zeroshape.html

  8. arXiv:2312.12870  [pdf, other

    cs.CV

    The Audio-Visual Conversational Graph: From an Egocentric-Exocentric Perspective

    Authors: Wenqi Jia, Miao Liu, Hao Jiang, Ishwarya Ananthabhotla, James M. Rehg, Vamsi Krishna Ithapu, Ruohan Gao

    Abstract: In recent years, the thriving development of research related to egocentric videos has provided a unique perspective for the study of conversational interactions, where both visual and audio signals play a crucial role. While most prior work focus on learning about behaviors that directly involve the camera wearer, we introduce the Ego-Exocentric Conversational Graph Prediction problem, marking th… ▽ More

    Submitted 3 April, 2024; v1 submitted 20 December, 2023; originally announced December 2023.

  9. arXiv:2312.04524  [pdf, other

    cs.CV

    RAVE: Randomized Noise Shuffling for Fast and Consistent Video Editing with Diffusion Models

    Authors: Ozgur Kara, Bariscan Kurtkaya, Hidir Yesiltepe, James M. Rehg, Pinar Yanardag

    Abstract: Recent advancements in diffusion-based models have demonstrated significant success in generating images from text. However, video editing models have not yet reached the same level of visual quality and user control. To address this, we introduce RAVE, a zero-shot video editing method that leverages pre-trained text-to-image diffusion models without additional training. RAVE takes an input video… ▽ More

    Submitted 7 December, 2023; originally announced December 2023.

    Comments: Project webpage: https://rave-video.github.io , Github: http://github.com/rehg-lab/RAVE

  10. arXiv:2312.04372  [pdf, other

    cs.CL cs.AI

    LaMPilot: An Open Benchmark Dataset for Autonomous Driving with Language Model Programs

    Authors: Yunsheng Ma, Can Cui, Xu Cao, Wenqian Ye, Peiran Liu, Juanwu Lu, Amr Abdelraouf, Rohit Gupta, Kyungtae Han, Aniket Bera, James M. Rehg, Ziran Wang

    Abstract: Autonomous driving (AD) has made significant strides in recent years. However, existing frameworks struggle to interpret and execute spontaneous user instructions, such as "overtake the car ahead." Large Language Models (LLMs) have demonstrated impressive reasoning capabilities showing potential to bridge this gap. In this paper, we present LaMPilot, a novel framework that integrates LLMs into AD… ▽ More

    Submitted 4 April, 2024; v1 submitted 7 December, 2023; originally announced December 2023.

    Comments: CVPR 2024

  11. arXiv:2312.03849  [pdf, other

    cs.CV

    LEGO: Learning EGOcentric Action Frame Generation via Visual Instruction Tuning

    Authors: Bolin Lai, Xiaoliang Dai, Lawrence Chen, Guan Pang, James M. Rehg, Miao Liu

    Abstract: Generating instructional images of human daily actions from an egocentric viewpoint serves as a key step towards efficient skill transfer. In this paper, we introduce a novel problem -- egocentric action frame generation. The goal is to synthesize an image depicting an action in the user's context (i.e., action frame) by conditioning on a user prompt and an input egocentric image. Notably, existin… ▽ More

    Submitted 22 March, 2024; v1 submitted 6 December, 2023; originally announced December 2023.

    Comments: 34 pages

  12. arXiv:2312.03533  [pdf, other

    cs.CV

    Low-shot Object Learning with Mutual Exclusivity Bias

    Authors: Anh Thai, Ahmad Humayun, Stefan Stojanov, Zixuan Huang, Bikram Boote, James M. Rehg

    Abstract: This paper introduces Low-shot Object Learning with Mutual Exclusivity Bias (LSME), the first computational framing of mutual exclusivity bias, a phenomenon commonly observed in infants during word learning. We provide a novel dataset, comprehensive baselines, and a state-of-the-art method to enable the ML community to tackle this challenging learning task. The goal of LSME is to analyze an RGB im… ▽ More

    Submitted 6 December, 2023; originally announced December 2023.

    Comments: Accepted at NeurIPS 2023, Datasets and Benchmarks Track. Project website https://ngailapdi.github.io/projects/lsme/

  13. arXiv:2312.00151  [pdf, other

    cs.CV cs.AI

    Which way is `right'?: Uncovering limitations of Vision-and-Language Navigation model

    Authors: Meera Hahn, Amit Raj, James M. Rehg

    Abstract: The challenging task of Vision-and-Language Navigation (VLN) requires embodied agents to follow natural language instructions to reach a goal location or object (e.g. `walk down the hallway and turn left at the piano'). For agents to complete this task successfully, they must be able to ground objects referenced into the instruction (e.g.`piano') into the visual scene as well as ground directional… ▽ More

    Submitted 30 November, 2023; originally announced December 2023.

  14. arXiv:2311.18259  [pdf, other

    cs.CV cs.AI

    Ego-Exo4D: Understanding Skilled Human Activity from First- and Third-Person Perspectives

    Authors: Kristen Grauman, Andrew Westbury, Lorenzo Torresani, Kris Kitani, Jitendra Malik, Triantafyllos Afouras, Kumar Ashutosh, Vijay Baiyya, Siddhant Bansal, Bikram Boote, Eugene Byrne, Zach Chavis, Joya Chen, Feng Cheng, Fu-Jen Chu, Sean Crane, Avijit Dasgupta, Jing Dong, Maria Escobar, Cristhian Forigua, Abrham Gebreselasie, Sanjay Haresh, Jing Huang, Md Mohaiminul Islam, Suyog Jain , et al. (76 additional authors not shown)

    Abstract: We present Ego-Exo4D, a diverse, large-scale multimodal multiview video dataset and benchmark challenge. Ego-Exo4D centers around simultaneously-captured egocentric and exocentric video of skilled human activities (e.g., sports, music, dance, bike repair). 740 participants from 13 cities worldwide performed these activities in 123 different natural scene contexts, yielding long-form captures from… ▽ More

    Submitted 29 April, 2024; v1 submitted 30 November, 2023; originally announced November 2023.

    Comments: updated baseline results and dataset statistics to match the released v2 data; added table to appendix comparing stats of Ego-Exo4D alongside other datasets

  15. arXiv:2311.00519  [pdf, other

    cs.LG

    REBAR: Retrieval-Based Reconstruction for Time-series Contrastive Learning

    Authors: Maxwell A. Xu, Alexander Moreno, Hui Wei, Benjamin M. Marlin, James M. Rehg

    Abstract: The success of self-supervised contrastive learning hinges on identifying positive data pairs, such that when they are pushed together in embedding space, the space encodes useful information for subsequent downstream tasks. Constructing positive pairs is non-trivial as the pairing must be similar enough to reflect a shared semantic meaning, but different enough to capture within-class variation.… ▽ More

    Submitted 16 March, 2024; v1 submitted 1 November, 2023; originally announced November 2023.

    Comments: ICLR 2024 | Code available at: https://github.com/maxxu05/rebar

    Journal ref: The Eleventh International Conference on Learning Representations (2024)

  16. arXiv:2306.06325  [pdf, other

    cs.LG

    Explaining a machine learning decision to physicians via counterfactuals

    Authors: Supriya Nagesh, Nina Mishra, Yonatan Naamad, James M. Rehg, Mehul A. Shah, Alexei Wagner

    Abstract: Machine learning models perform well on several healthcare tasks and can help reduce the burden on the healthcare system. However, the lack of explainability is a major roadblock to their adoption in hospitals. \textit{How can the decision of an ML model be explained to a physician?} The explanations considered in this paper are counterfactuals (CFs), hypothetical scenarios that would have resulte… ▽ More

    Submitted 9 June, 2023; originally announced June 2023.

  17. arXiv:2305.03907  [pdf, other

    cs.CV

    Listen to Look into the Future: Audio-Visual Egocentric Gaze Anticipation

    Authors: Bolin Lai, Fiona Ryan, Wenqi Jia, Miao Liu, James M. Rehg

    Abstract: Egocentric gaze anticipation serves as a key building block for the emerging capability of Augmented Reality. Notably, gaze behavior is driven by both visual cues and audio signals during daily activities. Motivated by this observation, we introduce the first model that leverages both the video and audio modalities for egocentric gaze anticipation. Specifically, we propose a Contrastive Spatial-Te… ▽ More

    Submitted 22 March, 2024; v1 submitted 5 May, 2023; originally announced May 2023.

    Comments: 30 pages

  18. arXiv:2304.06247  [pdf, other

    cs.CV

    ShapeClipper: Scalable 3D Shape Learning from Single-View Images via Geometric and CLIP-based Consistency

    Authors: Zixuan Huang, Varun Jampani, Anh Thai, Yuanzhen Li, Stefan Stojanov, James M. Rehg

    Abstract: We present ShapeClipper, a novel method that reconstructs 3D object shapes from real-world single-view RGB images. Instead of relying on laborious 3D, multi-view or camera pose annotation, ShapeClipper learns shape reconstruction from a set of single-view segmented images. The key idea is to facilitate shape learning via CLIP-based shape consistency, where we encourage objects with similar CLIP en… ▽ More

    Submitted 12 April, 2023; originally announced April 2023.

    Comments: Accepted to CVPR 2023, project website at https://zixuanh.com/projects/shapeclipper.html

  19. arXiv:2303.16024  [pdf, other

    cs.CV cs.SD eess.AS

    Egocentric Auditory Attention Localization in Conversations

    Authors: Fiona Ryan, Hao Jiang, Abhinav Shukla, James M. Rehg, Vamsi Krishna Ithapu

    Abstract: In a noisy conversation environment such as a dinner party, people often exhibit selective auditory attention, or the ability to focus on a particular speaker while tuning out others. Recognizing who somebody is listening to in a conversation is essential for developing technologies that can understand social behavior and devices that can augment human hearing by amplifying particular sound source… ▽ More

    Submitted 28 March, 2023; originally announced March 2023.

  20. arXiv:2212.08279  [pdf, other

    cs.LG cs.CL cs.CV

    Werewolf Among Us: A Multimodal Dataset for Modeling Persuasion Behaviors in Social Deduction Games

    Authors: Bolin Lai, Hongxin Zhang, Miao Liu, Aryan Pariani, Fiona Ryan, Wenqi Jia, Shirley Anugrah Hayati, James M. Rehg, Diyi Yang

    Abstract: Persuasion modeling is a key building block for conversational agents. Existing works in this direction are limited to analyzing textual dialogue corpus. We argue that visual signals also play an important role in understanding human persuasive behaviors. In this paper, we introduce the first multimodal dataset for modeling persuasion behaviors. Our dataset includes 199 dialogue transcriptions and… ▽ More

    Submitted 15 December, 2022; originally announced December 2022.

    Comments: 17 pages

  21. arXiv:2212.07514  [pdf, other

    cs.LG cs.AI

    PulseImpute: A Novel Benchmark Task for Pulsative Physiological Signal Imputation

    Authors: Maxwell A. Xu, Alexander Moreno, Supriya Nagesh, V. Burak Aydemir, David W. Wetter, Santosh Kumar, James M. Rehg

    Abstract: The promise of Mobile Health (mHealth) is the ability to use wearable sensors to monitor participant physiology at high frequencies during daily life to enable temporally-precise health interventions. However, a major challenge is frequent missing data. Despite a rich imputation literature, existing techniques are ineffective for the pulsative signals which comprise many mHealth applications, and… ▽ More

    Submitted 15 December, 2023; v1 submitted 14 December, 2022; originally announced December 2022.

    Comments: NeurIPS 2022 | Code available at: https://github.com/rehg-lab/pulseimpute | Data available at: https://doi.org/10.5281/zenodo.7129964

    Journal ref: Advances in Neural Information Processing Systems 35 (2022) 26874-26888

  22. arXiv:2211.15059  [pdf, other

    cs.CV

    Learning Dense Object Descriptors from Multiple Views for Low-shot Category Generalization

    Authors: Stefan Stojanov, Anh Thai, Zixuan Huang, James M. Rehg

    Abstract: A hallmark of the deep learning era for computer vision is the successful use of large-scale labeled datasets to train feature representations for tasks ranging from object recognition and semantic segmentation to optical flow estimation and novel view synthesis of 3D scenes. In this work, we aim to learn dense discriminative object representations for low-shot category recognition without requiri… ▽ More

    Submitted 27 November, 2022; originally announced November 2022.

    Comments: Accepted at NeurIPS 2022. Code and data available at https://github.com/rehg-lab/dope_selfsup

  23. arXiv:2210.04864  [pdf, other

    cs.CV cs.AI cs.CL

    Transformer-based Localization from Embodied Dialog with Large-scale Pre-training

    Authors: Meera Hahn, James M. Rehg

    Abstract: We address the challenging task of Localization via Embodied Dialog (LED). Given a dialog from two agents, an Observer navigating through an unknown environment and a Locator who is attempting to identify the Observer's location, the goal is to predict the Observer's final location in a map. We develop a novel LED-Bert architecture and present an effective pretraining strategy. We show that a grap… ▽ More

    Submitted 10 October, 2022; originally announced October 2022.

    Journal ref: International Joint Conference on Natural Language Processing (2022)

  24. arXiv:2208.04464  [pdf, other

    cs.CV

    In the Eye of Transformer: Global-Local Correlation for Egocentric Gaze Estimation

    Authors: Bolin Lai, Miao Liu, Fiona Ryan, James M. Rehg

    Abstract: In this paper, we present the first transformer-based model to address the challenging problem of egocentric gaze estimation. We observe that the connection between the global scene context and local visual information is vital for localizing the gaze fixation from egocentric video frames. To this end, we design the transformer encoder to embed the global context as one additional visual token and… ▽ More

    Submitted 10 August, 2022; v1 submitted 8 August, 2022; originally announced August 2022.

    Comments: 23 pages

  25. arXiv:2204.10235  [pdf, other

    cs.CV

    Planes vs. Chairs: Category-guided 3D shape learning without any 3D cues

    Authors: Zixuan Huang, Stefan Stojanov, Anh Thai, Varun Jampani, James M. Rehg

    Abstract: We present a novel 3D shape reconstruction method which learns to predict an implicit 3D shape representation from a single RGB image. Our approach uses a set of single-view images of multiple object categories without viewpoint annotation, forcing the model to learn across multiple object categories without 3D supervision. To facilitate learning with such minimal supervision, we use category labe… ▽ More

    Submitted 21 April, 2022; originally announced April 2022.

    Comments: Project page: https://zixuanh.com/multiclass3D

  26. arXiv:2203.11305  [pdf, other

    cs.CV

    Generative Adversarial Network for Future Hand Segmentation from Egocentric Video

    Authors: Wenqi Jia, Miao Liu, James M. Rehg

    Abstract: We introduce the novel problem of anticipating a time series of future hand masks from egocentric video. A key challenge is to model the stochasticity of future head motions, which globally impact the head-worn camera video analysis. To this end, we propose a novel deep generative model -- EgoGAN, which uses a 3D Fully Convolutional Network to learn a spatio-temporal video representation for pixel… ▽ More

    Submitted 20 July, 2022; v1 submitted 21 March, 2022; originally announced March 2022.

  27. arXiv:2111.01222  [pdf, other

    cs.LG stat.ML

    Kernel Deformed Exponential Families for Sparse Continuous Attention

    Authors: Alexander Moreno, Supriya Nagesh, Zhenke Wu, Walter Dempsey, James M. Rehg

    Abstract: Attention mechanisms take an expectation of a data representation with respect to probability weights. This creates summary statistics that focus on important features. Recently, (Martins et al. 2020, 2021) proposed continuous attention mechanisms, focusing on unimodal attention densities from the exponential and deformed exponential families: the latter has sparse support. (Farinhas et al. 2021)… ▽ More

    Submitted 12 November, 2021; v1 submitted 1 November, 2021; originally announced November 2021.

  28. arXiv:2111.01193  [pdf, other

    cs.CL cs.LG

    Transformers for prompt-level EMA non-response prediction

    Authors: Supriya Nagesh, Alexander Moreno, Stephanie M. Carpenter, Jamie Yap, Soujanya Chatterjee, Steven Lloyd Lizotte, Neng Wan, Santosh Kumar, Cho Lam, David W. Wetter, Inbal Nahum-Shani, James M. Rehg

    Abstract: Ecological Momentary Assessments (EMAs) are an important psychological data source for measuring current cognitive states, affect, behavior, and environmental factors from participants in mobile health (mHealth) studies and treatment programs. Non-response, in which participants fail to respond to EMA prompts, is an endemic problem. The ability to accurately predict non-response could be utilized… ▽ More

    Submitted 1 November, 2021; originally announced November 2021.

  29. arXiv:2110.13998  [pdf, other

    cs.LG cs.AI

    Efficient Learning and Decoding of the Continuous-Time Hidden Markov Model for Disease Progression Modeling

    Authors: Yu-Ying Liu, Alexander Moreno, Maxwell A. Xu, Shuang Li, Jena C. McDaniel, Nancy C. Brady, Agata Rozga, Fuxin Li, Le Song, James M. Rehg

    Abstract: The Continuous-Time Hidden Markov Model (CT-HMM) is an attractive approach to modeling disease progression due to its ability to describe noisy observations arriving irregularly in time. However, the lack of an efficient parameter learning algorithm for CT-HMM restricts its use to very small models or requires unrealistic constraints on the state transitions. In this paper, we present the first co… ▽ More

    Submitted 26 October, 2021; originally announced October 2021.

  30. arXiv:2110.09470  [pdf, other

    cs.CV

    No RL, No Simulation: Learning to Navigate without Navigating

    Authors: Meera Hahn, Devendra Chaplot, Shubham Tulsiani, Mustafa Mukadam, James M. Rehg, Abhinav Gupta

    Abstract: Most prior methods for learning navigation policies require access to simulation environments, as they need online policy interaction and rely on ground-truth maps for rewards. However, building simulators is expensive (requires manual effort for each and every scene) and creates challenges in transferring learned policies to robotic platforms in the real-world, due to the sim-to-real domain gap.… ▽ More

    Submitted 22 October, 2021; v1 submitted 18 October, 2021; originally announced October 2021.

  31. arXiv:2110.07058  [pdf, other

    cs.CV cs.AI

    Ego4D: Around the World in 3,000 Hours of Egocentric Video

    Authors: Kristen Grauman, Andrew Westbury, Eugene Byrne, Zachary Chavis, Antonino Furnari, Rohit Girdhar, Jackson Hamburger, Hao Jiang, Miao Liu, Xingyu Liu, Miguel Martin, Tushar Nagarajan, Ilija Radosavovic, Santhosh Kumar Ramakrishnan, Fiona Ryan, Jayant Sharma, Michael Wray, Mengmeng Xu, Eric Zhongcong Xu, Chen Zhao, Siddhant Bansal, Dhruv Batra, Vincent Cartillier, Sean Crane, Tien Do , et al. (60 additional authors not shown)

    Abstract: We introduce Ego4D, a massive-scale egocentric video dataset and benchmark suite. It offers 3,670 hours of daily-life activity video spanning hundreds of scenarios (household, outdoor, workplace, leisure, etc.) captured by 931 unique camera wearers from 74 worldwide locations and 9 different countries. The approach to collection is designed to uphold rigorous privacy and ethics standards with cons… ▽ More

    Submitted 11 March, 2022; v1 submitted 13 October, 2021; originally announced October 2021.

    Comments: To appear in the Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022. This version updates the baseline result numbers for the Hands and Objects benchmark (appendix)

  32. arXiv:2105.09544  [pdf, other

    cs.CV

    Egocentric Activity Recognition and Localization on a 3D Map

    Authors: Miao Liu, Lingni Ma, Kiran Somasundaram, Yin Li, Kristen Grauman, James M. Rehg, Chao Li

    Abstract: Given a video captured from a first person perspective and the environment context of where the video is recorded, can we recognize what the person is doing and identify where the action occurs in the 3D space? We address this challenging problem of jointly recognizing and localizing actions of a mobile user on a known 3D map from egocentric videos. To this end, we propose a novel deep probabilist… ▽ More

    Submitted 12 August, 2022; v1 submitted 20 May, 2021; originally announced May 2021.

    Comments: European Conference on Computer Vision (ECCV) 2022

  33. arXiv:2101.12159  [pdf, other

    cs.CV

    Discriminative Appearance Modeling with Multi-track Pooling for Real-time Multi-object Tracking

    Authors: Chanho Kim, Li Fuxin, Mazen Alotaibi, James M. Rehg

    Abstract: In multi-object tracking, the tracker maintains in its memory the appearance and motion information for each object in the scene. This memory is utilized for finding matches between tracks and detections and is updated based on the matching result. Many approaches model each target in isolation and lack the ability to use all the targets in the scene to jointly update the memory. This can be probl… ▽ More

    Submitted 28 January, 2021; originally announced January 2021.

  34. arXiv:2101.07296  [pdf, other

    cs.CV cs.LG

    Using Shape to Categorize: Low-Shot Learning with an Explicit Shape Bias

    Authors: Stefan Stojanov, Anh Thai, James M. Rehg

    Abstract: It is widely accepted that reasoning about object shape is important for object recognition. However, the most powerful object recognition methods today do not explicitly make use of object shape during learning. In this work, motivated by recent developments in low-shot learning, findings in developmental psychology, and the increased use of synthetic data in computer vision research, we investig… ▽ More

    Submitted 20 June, 2021; v1 submitted 18 January, 2021; originally announced January 2021.

    Comments: Accepted at CVPR2021. Project page, code and data available at https://rehg-lab.github.io/publication-pages/lowshot-shapebias/

  35. arXiv:2101.07295  [pdf, other

    cs.LG cs.CV

    The Surprising Positive Knowledge Transfer in Continual 3D Object Shape Reconstruction

    Authors: Anh Thai, Stefan Stojanov, Zixuan Huang, Isaac Rehg, James M. Rehg

    Abstract: Continual learning has been extensively studied for classification tasks with methods developed to primarily avoid catastrophic forgetting, a phenomenon where earlier learned concepts are forgotten at the expense of more recent samples. In this work, we present a set of continual 3D object shape reconstruction tasks, including complete 3D shape reconstruction from different input modalities, as we… ▽ More

    Submitted 8 September, 2022; v1 submitted 18 January, 2021; originally announced January 2021.

    Comments: Accepted to 3DV 2022

  36. arXiv:2011.13341  [pdf, other

    cs.CV

    4D Human Body Capture from Egocentric Video via 3D Scene Grounding

    Authors: Miao Liu, Dexin Yang, Yan Zhang, Zhaopeng Cui, James M. Rehg, Siyu Tang

    Abstract: We introduce a novel task of reconstructing a time series of second-person 3D human body meshes from monocular egocentric videos. The unique viewpoint and rapid embodied camera motion of egocentric videos raise additional technical barriers for human body capture. To address those challenges, we propose a simple yet effective optimization-based approach that leverages 2D observations of the entire… ▽ More

    Submitted 15 October, 2021; v1 submitted 26 November, 2020; originally announced November 2020.

  37. arXiv:2011.08277  [pdf, other

    cs.CV cs.CL

    Where Are You? Localization from Embodied Dialog

    Authors: Meera Hahn, Jacob Krantz, Dhruv Batra, Devi Parikh, James M. Rehg, Stefan Lee, Peter Anderson

    Abstract: We present Where Are You? (WAY), a dataset of ~6k dialogs in which two humans -- an Observer and a Locator -- complete a cooperative localization task. The Observer is spawned at random in a 3D environment and can navigate from first-person views while answering questions from the Locator. The Locator must localize the Observer in a detailed top-down map by asking questions and giving instructions… ▽ More

    Submitted 3 September, 2021; v1 submitted 16 November, 2020; originally announced November 2020.

    Journal ref: EMNLP 2020

  38. arXiv:2006.07752  [pdf, other

    cs.CV

    3D Reconstruction of Novel Object Shapes from Single Images

    Authors: Anh Thai, Stefan Stojanov, Vijay Upadhya, James M. Rehg

    Abstract: Accurately predicting the 3D shape of any arbitrary object in any pose from a single image is a key goal of computer vision research. This is challenging as it requires a model to learn a representation that can infer both the visible and occluded portions of any object using a limited training set. A training set that covers all possible object shapes is inherently infeasible. Such learning-based… ▽ More

    Submitted 1 September, 2021; v1 submitted 13 June, 2020; originally announced June 2020.

    Comments: First two authors contributed equally

  39. arXiv:2006.00626  [pdf, other

    cs.CV

    In the Eye of the Beholder: Gaze and Actions in First Person Video

    Authors: Yin Li, Miao Liu, James M. Rehg

    Abstract: We address the task of jointly determining what a person is doing and where they are looking based on the analysis of video captured by a headworn camera. To facilitate our research, we first introduce the EGTEA Gaze+ dataset. Our dataset comes with videos, gaze tracking data, hand masks and action annotations, thereby providing the most comprehensive benchmark for First Person Vision (FPV). Movin… ▽ More

    Submitted 31 October, 2020; v1 submitted 31 May, 2020; originally announced June 2020.

    Comments: Submitted to TPAMI

  40. arXiv:2004.08051  [pdf, other

    cs.RO cs.AI cs.CV cs.LG

    Approximate Inverse Reinforcement Learning from Vision-based Imitation Learning

    Authors: Keuntaek Lee, Bogdan Vlahov, Jason Gibson, James M. Rehg, Evangelos A. Theodorou

    Abstract: In this work, we present a method for obtaining an implicit objective function for vision-based navigation. The proposed methodology relies on Imitation Learning, Model Predictive Control (MPC), and an interpretation technique used in Deep Neural Networks. We use Imitation Learning as a means to do Inverse Reinforcement Learning in order to create an approximate cost function generator for a visua… ▽ More

    Submitted 8 April, 2021; v1 submitted 16 April, 2020; originally announced April 2020.

  41. arXiv:2004.04690  [pdf, other

    cs.LG cs.CV stat.ML

    Orthogonal Over-Parameterized Training

    Authors: Weiyang Liu, Rongmei Lin, Zhen Liu, James M. Rehg, Liam Paull, Li Xiong, Le Song, Adrian Weller

    Abstract: The inductive bias of a neural network is largely determined by the architecture and the training algorithm. To achieve good generalization, how to effectively train a neural network is of great importance. We propose a novel orthogonal over-parameterized training (OPT) framework that can provably minimize the hyperspherical energy which characterizes the diversity of neurons on a hypersphere. By… ▽ More

    Submitted 4 June, 2021; v1 submitted 9 April, 2020; originally announced April 2020.

    Comments: CVPR 2021 Oral (43 Pages, Substantial Update from v3, Typos Fixed from v5)

  42. arXiv:2003.02501  [pdf, other

    cs.CV

    Detecting Attended Visual Targets in Video

    Authors: Eunji Chong, Yongxin Wang, Nataniel Ruiz, James M. Rehg

    Abstract: We address the problem of detecting attention targets in video. Our goal is to identify where each person in each frame of a video is looking, and correctly handle the case where the gaze target is out-of-frame. Our novel architecture models the dynamic interaction between the scene and head features and infers time-varying attention targets. We introduce a new annotated dataset, VideoAttentionTar… ▽ More

    Submitted 30 March, 2020; v1 submitted 5 March, 2020; originally announced March 2020.

    Comments: Accepted to CVPR 2020

  43. arXiv:2003.01169  [pdf, other

    stat.ME cs.LG stat.ML

    A Robust Functional EM Algorithm for Incomplete Panel Count Data

    Authors: Alexander Moreno, Zhenke Wu, Jamie Yap, David Wetter, Cho Lam, Inbal Nahum-Shani, Walter Dempsey, James M. Rehg

    Abstract: Panel count data describes aggregated counts of recurrent events observed at discrete time points. To understand dynamics of health behaviors, the field of quantitative behavioral research has evolved to increasingly rely upon panel count data collected via multiple self reports, for example, about frequencies of smoking using in-the-moment surveys on mobile devices. However, missing reports are c… ▽ More

    Submitted 19 June, 2020; v1 submitted 2 March, 2020; originally announced March 2020.

    Comments: 25 pages

  44. arXiv:1910.13003  [pdf, other

    cs.LG cs.CV stat.ML

    Neural Similarity Learning

    Authors: Weiyang Liu, Zhen Liu, James M. Rehg, Le Song

    Abstract: Inner product-based convolution has been the founding stone of convolutional neural networks (CNNs), enabling end-to-end learning of visual representation. By generalizing inner product with a bilinear matrix, we propose the neural similarity which serves as a learnable parametric similarity measure for CNNs. Neural similarity naturally generalizes the convolution and enhances flexibility. Further… ▽ More

    Submitted 6 December, 2019; v1 submitted 28 October, 2019; originally announced October 2019.

    Comments: NeurIPS 2019 (v3)

  45. arXiv:1906.04892  [pdf, other

    cs.CV cs.LG

    Regularizing Neural Networks via Minimizing Hyperspherical Energy

    Authors: Rongmei Lin, Weiyang Liu, Zhen Liu, Chen Feng, Zhiding Yu, James M. Rehg, Li Xiong, Le Song

    Abstract: Inspired by the Thomson problem in physics where the distribution of multiple propelling electrons on a unit sphere can be modeled via minimizing some potential energy, hyperspherical energy minimization has demonstrated its potential in regularizing neural networks and improving their generalization power. In this paper, we first study the important role that hyperspherical energy plays in neural… ▽ More

    Submitted 9 April, 2020; v1 submitted 11 June, 2019; originally announced June 2019.

    Comments: CVPR 2020

  46. arXiv:1905.05162  [pdf, other

    cs.RO cs.LG

    Locally Weighted Regression Pseudo-Rehearsal for Online Learning of Vehicle Dynamics

    Authors: Grady Williams, Brian Goldfain, James M. Rehg, Evangelos A. Theodorou

    Abstract: We consider the problem of online adaptation of a neural network designed to represent vehicle dynamics. The neural network model is intended to be used by an MPC control law to autonomously control the vehicle. This problem is challenging because both the input and target distributions are non-stationary, and naive approaches to online adaptation result in catastrophic forgetting, which can in tu… ▽ More

    Submitted 13 May, 2019; originally announced May 2019.

    Comments: 10 pages, 4 figures

  47. arXiv:1904.09936  [pdf, other

    cs.CV

    Tripping through time: Efficient Localization of Activities in Videos

    Authors: Meera Hahn, Asim Kadav, James M. Rehg, Hans Peter Graf

    Abstract: Localizing moments in untrimmed videos via language queries is a new and interesting task that requires the ability to accurately ground language into video. Previous works have approached this task by processing the entire video, often more than once, to localize relevant activities. In the real world applications of this approach, such as video surveillance, efficiency is a key system requiremen… ▽ More

    Submitted 18 August, 2020; v1 submitted 22 April, 2019; originally announced April 2019.

    Comments: Presented at BMVC, 2020

  48. arXiv:1904.05475  [pdf, other

    cs.CV

    Learning to Generate Synthetic Data via Compositing

    Authors: Shashank Tripathi, Siddhartha Chandra, Amit Agrawal, Ambrish Tyagi, James M. Rehg, Visesh Chari

    Abstract: We present a task-aware approach to synthetic data generation. Our framework employs a trainable synthesizer network that is optimized to produce meaningful training samples by assessing the strengths and weaknesses of a `target' network. The synthesizer and target networks are trained in an adversarial manner wherein each network is updated with a goal to outdo the other. Additionally, we ensure… ▽ More

    Submitted 8 July, 2019; v1 submitted 10 April, 2019; originally announced April 2019.

    Comments: Accepted to CVPR 2019, supplementary material included

  49. arXiv:1904.04812  [pdf, other

    cs.CV

    Unsupervised 3D Pose Estimation with Geometric Self-Supervision

    Authors: Ching-Hang Chen, Ambrish Tyagi, Amit Agrawal, Dylan Drover, Rohith MV, Stefan Stojanov, James M. Rehg

    Abstract: We present an unsupervised learning approach to recover 3D human pose from 2D skeletal joints extracted from a single image. Our method does not require any multi-view image data, 3D skeletons, correspondences between 2D-3D points, or use previously learned 3D priors during training. A lifting network accepts 2D landmarks as inputs and generates a corresponding 3D skeleton estimate. During trainin… ▽ More

    Submitted 9 April, 2019; originally announced April 2019.

  50. arXiv:1904.03249  [pdf, other

    cs.CV

    Attention Distillation for Learning Video Representations

    Authors: Miao Liu, Xin Chen, Yun Zhang, Yin Li, James M. Rehg

    Abstract: We address the challenging problem of learning motion representations using deep models for video recognition. To this end, we make use of attention modules that learn to highlight regions in the video and aggregate features for recognition. Specifically, we propose to leverage output attention maps as a vehicle to transfer the learned representation from a motion (flow) network to an RGB network.… ▽ More

    Submitted 14 August, 2020; v1 submitted 5 April, 2019; originally announced April 2019.