-
Large Language Models for Automatic Milestone Detection in Group Discussions
Authors:
Zhuoxu Duan,
Zhengye Yang,
Samuel Westby,
Christoph Riedl,
Brooke Foucault Welles,
Richard J. Radke
Abstract:
Large language models like GPT have proven widely successful on natural language understanding tasks based on written text documents. In this paper, we investigate an LLM's performance on recordings of a group oral communication task in which utterances are often truncated or not well-formed. We propose a new group task experiment involving a puzzle with several milestones that can be achieved in…
▽ More
Large language models like GPT have proven widely successful on natural language understanding tasks based on written text documents. In this paper, we investigate an LLM's performance on recordings of a group oral communication task in which utterances are often truncated or not well-formed. We propose a new group task experiment involving a puzzle with several milestones that can be achieved in any order. We investigate methods for processing transcripts to detect if, when, and by whom a milestone has been completed. We demonstrate that iteratively prompting GPT with transcription chunks outperforms semantic similarity search methods using text embeddings, and further discuss the quality and randomness of GPT responses under different context window sizes.
△ Less
Submitted 16 June, 2024;
originally announced June 2024.
-
Context-aware Video Anomaly Detection in Long-Term Datasets
Authors:
Zhengye Yang,
Richard Radke
Abstract:
Video anomaly detection research is generally evaluated on short, isolated benchmark videos only a few minutes long. However, in real-world environments, security cameras observe the same scene for months or years at a time, and the notion of anomalous behavior critically depends on context, such as the time of day, day of week, or schedule of events. Here, we propose a context-aware video anomaly…
▽ More
Video anomaly detection research is generally evaluated on short, isolated benchmark videos only a few minutes long. However, in real-world environments, security cameras observe the same scene for months or years at a time, and the notion of anomalous behavior critically depends on context, such as the time of day, day of week, or schedule of events. Here, we propose a context-aware video anomaly detection algorithm, Trinity, specifically targeted to these scenarios. Trinity is especially well-suited to crowded scenes in which individuals cannot be easily tracked, and anomalies are due to speed, direction, or absence of group motion. Trinity is a contrastive learning framework that aims to learn alignments between context, appearance, and motion, and uses alignment quality to classify videos as normal or anomalous. We evaluate our algorithm on both conventional benchmarks and a public webcam-based dataset we collected that spans more than three months of activity.
△ Less
Submitted 11 April, 2024;
originally announced April 2024.
-
Multimodality in Group Communication Research
Authors:
Robin Lange,
Brooke Foucault Welles,
Gyanendra Sharma,
Richard J. Radke,
Javier O. Garcia,
Christoph Riedl
Abstract:
Team interactions are often multisensory, requiring members to pick up on verbal, visual, spatial and body language cues. Multimodal research, research that captures multiple modes of communication such as audio and visual signals, is therefore integral to understanding these multisensory group communication processes. This type of research has gained traction in biomedical engineering and neurosc…
▽ More
Team interactions are often multisensory, requiring members to pick up on verbal, visual, spatial and body language cues. Multimodal research, research that captures multiple modes of communication such as audio and visual signals, is therefore integral to understanding these multisensory group communication processes. This type of research has gained traction in biomedical engineering and neuroscience, but it is unclear the extent to which communication and management researchers conduct multimodal research. Our study finds that despite its' utility, multimodal research is underutilized in the communication and management literature's. This paper then covers introductory guidelines for creating new multimodal research including considerations for sensors, data integration and ethical considerations.
△ Less
Submitted 26 January, 2024;
originally announced January 2024.
-
Building Better Human-Agent Teams: Balancing Human Resemblance and Contribution in Voice Assistants
Authors:
Samuel Westby,
Richard J. Radke,
Christoph Riedl,
Brooke Foucault Welles
Abstract:
Voice assistants are increasingly prevalent, from personal devices to team environments. This study explores how voice type and contribution quality influence human-agent team performance and perceptions of anthropomorphism, animacy, intelligence, and trustworthiness. By manipulating both, we reveal mechanisms of perception and clarify ambiguity in previous work. Our results show that the human re…
▽ More
Voice assistants are increasingly prevalent, from personal devices to team environments. This study explores how voice type and contribution quality influence human-agent team performance and perceptions of anthropomorphism, animacy, intelligence, and trustworthiness. By manipulating both, we reveal mechanisms of perception and clarify ambiguity in previous work. Our results show that the human resemblance of a voice assistant's voice negatively interacts with the helpfulness of an agent's contribution to flip its effect on perceived anthropomorphism and perceived animacy. This means human teammates interpret the agent's contributions differently depending on its voice. Our study found no significant effect of voice on perceived intelligence, trustworthiness, or team performance. We find differences in these measures are caused by manipulating the helpfulness of an agent. These findings suggest that function matters more than form when designing agents for high-performing human-agent teams, but controlling perceptions of anthropomorphism and animacy can be unpredictable even with high human resemblance.
△ Less
Submitted 16 May, 2024; v1 submitted 22 August, 2023;
originally announced August 2023.
-
Self-supervised Learning with Local Contrastive Loss for Detection and Semantic Segmentation
Authors:
Ashraful Islam,
Ben Lundell,
Harpreet Sawhney,
Sudipta Sinha,
Peter Morales,
Richard J. Radke
Abstract:
We present a self-supervised learning (SSL) method suitable for semi-global tasks such as object detection and semantic segmentation. We enforce local consistency between self-learned features, representing corresponding image locations of transformed versions of the same image, by minimizing a pixel-level local contrastive (LC) loss during training. LC-loss can be added to existing self-supervise…
▽ More
We present a self-supervised learning (SSL) method suitable for semi-global tasks such as object detection and semantic segmentation. We enforce local consistency between self-learned features, representing corresponding image locations of transformed versions of the same image, by minimizing a pixel-level local contrastive (LC) loss during training. LC-loss can be added to existing self-supervised learning methods with minimal overhead. We evaluate our SSL approach on two downstream tasks -- object detection and semantic segmentation, using COCO, PASCAL VOC, and CityScapes datasets. Our method outperforms the existing state-of-the-art SSL approaches by 1.9% on COCO object detection, 1.4% on PASCAL VOC detection, and 0.6% on CityScapes segmentation.
△ Less
Submitted 7 December, 2022; v1 submitted 10 July, 2022;
originally announced July 2022.
-
Dynamic Distillation Network for Cross-Domain Few-Shot Recognition with Unlabeled Data
Authors:
Ashraful Islam,
Chun-Fu Chen,
Rameswar Panda,
Leonid Karlinsky,
Rogerio Feris,
Richard J. Radke
Abstract:
Most existing works in few-shot learning rely on meta-learning the network on a large base dataset which is typically from the same domain as the target dataset. We tackle the problem of cross-domain few-shot learning where there is a large shift between the base and target domain. The problem of cross-domain few-shot recognition with unlabeled target data is largely unaddressed in the literature.…
▽ More
Most existing works in few-shot learning rely on meta-learning the network on a large base dataset which is typically from the same domain as the target dataset. We tackle the problem of cross-domain few-shot learning where there is a large shift between the base and target domain. The problem of cross-domain few-shot recognition with unlabeled target data is largely unaddressed in the literature. STARTUP was the first method that tackles this problem using self-training. However, it uses a fixed teacher pretrained on a labeled base dataset to create soft labels for the unlabeled target samples. As the base dataset and unlabeled dataset are from different domains, projecting the target images in the class-domain of the base dataset with a fixed pretrained model might be sub-optimal. We propose a simple dynamic distillation-based approach to facilitate unlabeled images from the novel/base dataset. We impose consistency regularization by calculating predictions from the weakly-augmented versions of the unlabeled images from a teacher network and matching it with the strongly augmented versions of the same images from a student network. The parameters of the teacher network are updated as exponential moving average of the parameters of the student network. We show that the proposed network learns representation that can be easily adapted to the target domain even though it has not been trained with target-specific classes during the pretraining phase. Our model outperforms the current state-of-the art method by 4.4% for 1-shot and 3.6% for 5-shot classification in the BSCD-FSL benchmark, and also shows competitive performance on traditional in-domain few-shot learning task.
△ Less
Submitted 1 November, 2021; v1 submitted 14 June, 2021;
originally announced June 2021.
-
A Broad Study on the Transferability of Visual Representations with Contrastive Learning
Authors:
Ashraful Islam,
Chun-Fu Chen,
Rameswar Panda,
Leonid Karlinsky,
Richard Radke,
Rogerio Feris
Abstract:
Tremendous progress has been made in visual representation learning, notably with the recent success of self-supervised contrastive learning methods. Supervised contrastive learning has also been shown to outperform its cross-entropy counterparts by leveraging labels for choosing where to contrast. However, there has been little work to explore the transfer capability of contrastive learning to a…
▽ More
Tremendous progress has been made in visual representation learning, notably with the recent success of self-supervised contrastive learning methods. Supervised contrastive learning has also been shown to outperform its cross-entropy counterparts by leveraging labels for choosing where to contrast. However, there has been little work to explore the transfer capability of contrastive learning to a different domain. In this paper, we conduct a comprehensive study on the transferability of learned representations of different contrastive approaches for linear evaluation, full-network transfer, and few-shot recognition on 12 downstream datasets from different domains, and object detection tasks on MSCOCO and VOC0712. The results show that the contrastive approaches learn representations that are easily transferable to a different downstream task. We further observe that the joint objective of self-supervised contrastive loss with cross-entropy/supervised-contrastive loss leads to better transferability of these models over their supervised counterparts. Our analysis reveals that the representations learned from the contrastive approaches contain more low/mid-level semantics than cross-entropy models, which enables them to quickly adapt to a new task. Our codes and models will be publicly available to facilitate future research on transferability of visual representations.
△ Less
Submitted 15 August, 2021; v1 submitted 24 March, 2021;
originally announced March 2021.
-
A Hybrid Attention Mechanism for Weakly-Supervised Temporal Action Localization
Authors:
Ashraful Islam,
Chengjiang Long,
Richard Radke
Abstract:
Weakly supervised temporal action localization is a challenging vision task due to the absence of ground-truth temporal locations of actions in the training videos. With only video-level supervision during training, most existing methods rely on a Multiple Instance Learning (MIL) framework to predict the start and end frame of each action category in a video. However, the existing MIL-based approa…
▽ More
Weakly supervised temporal action localization is a challenging vision task due to the absence of ground-truth temporal locations of actions in the training videos. With only video-level supervision during training, most existing methods rely on a Multiple Instance Learning (MIL) framework to predict the start and end frame of each action category in a video. However, the existing MIL-based approach has a major limitation of only capturing the most discriminative frames of an action, ignoring the full extent of an activity. Moreover, these methods cannot model background activity effectively, which plays an important role in localizing foreground activities. In this paper, we present a novel framework named HAM-Net with a hybrid attention mechanism which includes temporal soft, semi-soft and hard attentions to address these issues. Our temporal soft attention module, guided by an auxiliary background class in the classification module, models the background activity by introducing an "action-ness" score for each video snippet. Moreover, our temporal semi-soft and hard attention modules, calculating two attention scores for each video snippet, help to focus on the less discriminative frames of an action to capture the full action boundary. Our proposed approach outperforms recent state-of-the-art methods by at least 2.2% mAP at IoU threshold 0.5 on the THUMOS14 dataset, and by at least 1.3% mAP at IoU threshold 0.75 on the ActivityNet1.2 dataset. Code can be found at: https://github.com/asrafulashiq/hamnet.
△ Less
Submitted 24 March, 2021; v1 submitted 2 January, 2021;
originally announced January 2021.
-
Towards Visually Explaining Similarity Models
Authors:
Meng Zheng,
Srikrishna Karanam,
Terrence Chen,
Richard J. Radke,
Ziyan Wu
Abstract:
We consider the problem of visually explaining similarity models, i.e., explaining why a model predicts two images to be similar in addition to producing a scalar score. While much recent work in visual model interpretability has focused on gradient-based attention, these methods rely on a classification module to generate visual explanations. Consequently, they cannot readily explain other kinds…
▽ More
We consider the problem of visually explaining similarity models, i.e., explaining why a model predicts two images to be similar in addition to producing a scalar score. While much recent work in visual model interpretability has focused on gradient-based attention, these methods rely on a classification module to generate visual explanations. Consequently, they cannot readily explain other kinds of models that do not use or need classification-like loss functions (e.g., similarity models trained with a metric learning loss). In this work, we bridge this crucial gap, presenting a method to generate gradient-based visual attention for image similarity predictors. By relying solely on the learned feature embedding, we show that our approach can be applied to any kind of CNN-based similarity architecture, an important step towards generic visual explainability. We show that our resulting attention maps serve more than just interpretability; they can be infused into the model learning process itself with new trainable constraints. We show that the resulting similarity models perform, and can be visually explained, better than the corresponding baseline models trained without these constraints. We demonstrate our approach using extensive experiments on three different kinds of tasks: generic image retrieval, person re-identification, and low-shot semantic segmentation.
△ Less
Submitted 13 October, 2020; v1 submitted 13 August, 2020;
originally announced August 2020.
-
Weakly Supervised Temporal Action Localization Using Deep Metric Learning
Authors:
Ashraful Islam,
Richard J. Radke
Abstract:
Temporal action localization is an important step towards video understanding. Most current action localization methods depend on untrimmed videos with full temporal annotations of action instances. However, it is expensive and time-consuming to annotate both action labels and temporal boundaries of videos. To this end, we propose a weakly supervised temporal action localization method that only r…
▽ More
Temporal action localization is an important step towards video understanding. Most current action localization methods depend on untrimmed videos with full temporal annotations of action instances. However, it is expensive and time-consuming to annotate both action labels and temporal boundaries of videos. To this end, we propose a weakly supervised temporal action localization method that only requires video-level action instances as supervision during training. We propose a classification module to generate action labels for each segment in the video, and a deep metric learning module to learn the similarity between different action instances. We jointly optimize a balanced binary cross-entropy loss and a metric loss using a standard backpropagation algorithm. Extensive experiments demonstrate the effectiveness of both of these components in temporal localization. We evaluate our algorithm on two challenging untrimmed video datasets: THUMOS14 and ActivityNet1.2. Our approach improves the current state-of-the-art result for THUMOS14 by 6.5% mAP at IoU threshold 0.5, and achieves competitive performance for ActivityNet1.2.
△ Less
Submitted 21 January, 2020;
originally announced January 2020.
-
Multi-person Spatial Interaction in a Large Immersive Display Using Smartphones as Touchpads
Authors:
Gyanendra Sharma,
Richard J Radke
Abstract:
In this paper, we present a multi-user interaction interface for a large immersive space that supports simultaneous screen interactions by combining (1) user input via personal smartphones and Bluetooth microphones, (2) spatial tracking via an overhead array of Kinect sensors, and (3) WebSocket interfaces to a webpage running on the large screen. Users are automatically, dynamically assigned perso…
▽ More
In this paper, we present a multi-user interaction interface for a large immersive space that supports simultaneous screen interactions by combining (1) user input via personal smartphones and Bluetooth microphones, (2) spatial tracking via an overhead array of Kinect sensors, and (3) WebSocket interfaces to a webpage running on the large screen. Users are automatically, dynamically assigned personal and shared screen sub-spaces based on their tracked location with respect to the screen, and use a webpage on their personal smartphone for touchpad-type input. We report user experiments using our interaction framework that involve image selection and placement tasks, with the ultimate goal of realizing display-wall environments as viable, interactive workspaces with natural multimodal interfaces.
△ Less
Submitted 26 November, 2019;
originally announced November 2019.
-
Towards Visually Explaining Variational Autoencoders
Authors:
Wenqian Liu,
Runze Li,
Meng Zheng,
Srikrishna Karanam,
Ziyan Wu,
Bir Bhanu,
Richard J. Radke,
Octavia Camps
Abstract:
Recent advances in Convolutional Neural Network (CNN) model interpretability have led to impressive progress in visualizing and understanding model predictions. In particular, gradient-based visual attention methods have driven much recent effort in using visual attention maps as a means for visual explanations. A key problem, however, is these methods are designed for classification and categoriz…
▽ More
Recent advances in Convolutional Neural Network (CNN) model interpretability have led to impressive progress in visualizing and understanding model predictions. In particular, gradient-based visual attention methods have driven much recent effort in using visual attention maps as a means for visual explanations. A key problem, however, is these methods are designed for classification and categorization tasks, and their extension to explaining generative models, e.g. variational autoencoders (VAE) is not trivial. In this work, we take a step towards bridging this crucial gap, proposing the first technique to visually explain VAEs by means of gradient-based attention. We present methods to generate visual attention from the learned latent space, and also demonstrate such attention explanations serve more than just explaining VAE predictions. We show how these attention maps can be used to localize anomalies in images, demonstrating state-of-the-art performance on the MVTec-AD dataset. We also show how they can be infused into model training, helping bootstrap the VAE into learning improved latent space disentanglement, demonstrated on the Dsprites dataset.
△ Less
Submitted 14 April, 2020; v1 submitted 17 November, 2019;
originally announced November 2019.
-
Visual Similarity Attention
Authors:
Meng Zheng,
Srikrishna Karanam,
Terrence Chen,
Richard J. Radke,
Ziyan Wu
Abstract:
While there has been substantial progress in learning suitable distance metrics, these techniques in general lack transparency and decision reasoning, i.e., explaining why the input set of images is similar or dissimilar. In this work, we solve this key problem by proposing the first method to generate generic visual similarity explanations with gradient-based attention. We demonstrate that our te…
▽ More
While there has been substantial progress in learning suitable distance metrics, these techniques in general lack transparency and decision reasoning, i.e., explaining why the input set of images is similar or dissimilar. In this work, we solve this key problem by proposing the first method to generate generic visual similarity explanations with gradient-based attention. We demonstrate that our technique is agnostic to the specific similarity model type, e.g., we show applicability to Siamese, triplet, and quadruplet models. Furthermore, we make our proposed similarity attention a principled part of the learning process, resulting in a new paradigm for learning similarity functions. We demonstrate that our learning mechanism results in more generalizable, as well as explainable, similarity models. Finally, we demonstrate the generality of our framework by means of experiments on a variety of tasks, including image retrieval, person re-identification, and low-shot semantic segmentation.
△ Less
Submitted 3 May, 2022; v1 submitted 17 November, 2019;
originally announced November 2019.
-
Re-Identification with Consistent Attentive Siamese Networks
Authors:
Meng Zheng,
Srikrishna Karanam,
Ziyan Wu,
Richard J. Radke
Abstract:
We propose a new deep architecture for person re-identification (re-id). While re-id has seen much recent progress, spatial localization and view-invariant representation learning for robust cross-view matching remain key, unsolved problems. We address these questions by means of a new attention-driven Siamese learning architecture, called the Consistent Attentive Siamese Network. Our key innovati…
▽ More
We propose a new deep architecture for person re-identification (re-id). While re-id has seen much recent progress, spatial localization and view-invariant representation learning for robust cross-view matching remain key, unsolved problems. We address these questions by means of a new attention-driven Siamese learning architecture, called the Consistent Attentive Siamese Network. Our key innovations compared to existing, competing methods include (a) a flexible framework design that produces attention with only identity labels as supervision, (b) explicit mechanisms to enforce attention consistency among images of the same person, and (c) a new Siamese framework that integrates attention and attention consistency, producing principled supervisory signals as well as the first mechanism that can explain the reasoning behind the Siamese framework's predictions. We conduct extensive evaluations on the CUHK03-NP, DukeMTMC-ReID, and Market-1501 datasets and report competitive performance.
△ Less
Submitted 11 April, 2019; v1 submitted 18 November, 2018;
originally announced November 2018.
-
Measuring the Temporal Behavior of Real-World Person Re-Identification
Authors:
Meng Zheng,
Srikrishna Karanam,
Richard J. Radke
Abstract:
Designing real-world person re-identification (re-id) systems requires attention to operational aspects not typically considered in academic research. Typically, the probe image or image sequence is matched to a gallery set with a fixed candidate list. On the other hand, in real-world applications of re-id, we would search for a person of interest in a gallery set that is continuously populated by…
▽ More
Designing real-world person re-identification (re-id) systems requires attention to operational aspects not typically considered in academic research. Typically, the probe image or image sequence is matched to a gallery set with a fixed candidate list. On the other hand, in real-world applications of re-id, we would search for a person of interest in a gallery set that is continuously populated by new candidates over time. A key question of interest for the operator of such a system is: how long is a correct match to a probe likely to remain in a rank-k shortlist of candidates? In this paper, we propose to distill this information into what we call a Rank Persistence Curve (RPC), which unlike a conventional cumulative match characteristic (CMC) curve helps directly compare the temporal performance of different re-id algorithms. To carefully illustrate the concept, we collected a new multi-shot person re-id dataset called RPIfield. The RPIfield dataset is constructed using a network of 12 cameras with 112 explicitly time-stamped actor paths among about 4000 distractors. We then evaluate the temporal performance of different re-id algorithms using the proposed RPCs using single and pairwise camera videos from RPIfield, and discuss considerations for future research.
△ Less
Submitted 16 August, 2018;
originally announced August 2018.
-
Rank Persistence: Assessing the Temporal Performance of Real-World Person Re-Identification
Authors:
Srikrishna Karanam,
Eric Lam,
Richard J. Radke
Abstract:
Designing useful person re-identification systems for real-world applications requires attention to operational aspects not typically considered in academic research. Here, we focus on the temporal aspect of re-identification; that is, instead of finding a match to a probe person of interest in a fixed candidate gallery, we consider the more realistic scenario in which the gallery is continuously…
▽ More
Designing useful person re-identification systems for real-world applications requires attention to operational aspects not typically considered in academic research. Here, we focus on the temporal aspect of re-identification; that is, instead of finding a match to a probe person of interest in a fixed candidate gallery, we consider the more realistic scenario in which the gallery is continuously populated by new candidates over a long time period. A key question of interest for an operator of such a system is: how long is a correct match to a probe likely to remain in a rank-k shortlist of possible candidates? We propose to distill this information into a Rank Persistence Curve (RPC), which allows different algorithms' temporal performance characteristics to be directly compared. We present examples to illustrate the RPC using a new long-term dataset with multiple candidate reappearances, and discuss considerations for future re-identification research that explicitly involves temporal aspects.
△ Less
Submitted 4 June, 2017; v1 submitted 2 June, 2017;
originally announced June 2017.
-
A Systematic Evaluation and Benchmark for Person Re-Identification: Features, Metrics, and Datasets
Authors:
Srikrishna Karanam,
Mengran Gou,
Ziyan Wu,
Angels Rates-Borras,
Octavia Camps,
Richard J. Radke
Abstract:
Person re-identification (re-id) is a critical problem in video analytics applications such as security and surveillance. The public release of several datasets and code for vision algorithms has facilitated rapid progress in this area over the last few years. However, directly comparing re-id algorithms reported in the literature has become difficult since a wide variety of features, experimental…
▽ More
Person re-identification (re-id) is a critical problem in video analytics applications such as security and surveillance. The public release of several datasets and code for vision algorithms has facilitated rapid progress in this area over the last few years. However, directly comparing re-id algorithms reported in the literature has become difficult since a wide variety of features, experimental protocols, and evaluation metrics are employed. In order to address this need, we present an extensive review and performance evaluation of single- and multi-shot re-id algorithms. The experimental protocol incorporates the most recent advances in both feature extraction and metric learning. To ensure a fair comparison, all of the approaches were implemented using a unified code library that includes 11 feature extraction algorithms and 22 metric learning and ranking techniques. All approaches were evaluated using a new large-scale dataset that closely mimics a real-world problem setting, in addition to 16 other publicly available datasets: VIPeR, GRID, CAVIAR, DukeMTMC4ReID, 3DPeS, PRID, V47, WARD, SAIVT-SoftBio, CUHK01, CHUK02, CUHK03, RAiD, iLIDSVID, HDA+ and Market1501. The evaluation codebase and results will be made publicly available for community use.
△ Less
Submitted 14 February, 2018; v1 submitted 31 May, 2016;
originally announced May 2016.