Search | arXiv e-print repository

Referring Atomic Video Action Recognition

Authors: Kunyu Peng, Jia Fu, Kailun Yang, Di Wen, Yufan Chen, Ruiping Liu, Junwei Zheng, Jiaming Zhang, M. Saquib Sarfraz, Rainer Stiefelhagen, Alina Roitberg

Abstract: We introduce a new task called Referring Atomic Video Action Recognition (RAVAR), aimed at identifying atomic actions of a particular person based on a textual description and the video data of this person. This task differs from traditional action recognition and localization, where predictions are delivered for all present individuals. In contrast, we focus on recognizing the correct atomic acti… ▽ More We introduce a new task called Referring Atomic Video Action Recognition (RAVAR), aimed at identifying atomic actions of a particular person based on a textual description and the video data of this person. This task differs from traditional action recognition and localization, where predictions are delivered for all present individuals. In contrast, we focus on recognizing the correct atomic action of a specific individual, guided by text. To explore this task, we present the RefAVA dataset, containing 36,630 instances with manually annotated textual descriptions of the individuals. To establish a strong initial benchmark, we implement and validate baselines from various domains, e.g., atomic action localization, video question answering, and text-video retrieval. Since these existing methods underperform on RAVAR, we introduce RefAtomNet -- a novel cross-stream attention-driven method specialized for the unique challenges of RAVAR: the need to interpret a textual referring expression for the targeted individual, utilize this reference to guide the spatial localization and harvest the prediction of the atomic actions for the referring person. The key ingredients are: (1) a multi-stream architecture that connects video, text, and a new location-semantic stream, and (2) cross-stream agent attention fusion and agent token fusion which amplify the most relevant information across these streams and consistently surpasses standard attention-based fusion on RAVAR. Extensive experiments demonstrate the effectiveness of RefAtomNet and its building blocks for recognizing the action of the described individual. The dataset and code will be made publicly available at https://github.com/KPeng9510/RAVAR. △ Less

Submitted 10 July, 2024; v1 submitted 1 July, 2024; originally announced July 2024.

Comments: Accepted to ECCV 2024. The dataset and code will be made publicly available at https://github.com/KPeng9510/RAVAR

arXiv:2406.11438 [pdf, other]

Search for Majorana Neutrinos with the Complete KamLAND-Zen Dataset

Authors: S. Abe, T. Araki, K. Chiba, T. Eda, M. Eizuka, Y. Funahashi, A. Furuto, A. Gando, Y. Gando, S. Goto, T. Hachiya, K. Hata, K. Ichimura, S. Ieki, H. Ikeda, K. Inoue, K. Ishidoshiro, Y. Kamei, N. Kawada, Y. Kishimoto, M. Koga, A. Marthe, Y. Matsumoto, T. Mitsui, H. Miyake , et al. (48 additional authors not shown)

Abstract: We present a search for neutrinoless double-beta ($0νββ$) decay of $^{136}$Xe using the full KamLAND-Zen 800 dataset with 745 kg of enriched xenon, corresponding to an exposure of $2.097$ ton yr of $^{136}$Xe. This updated search benefits from a more than twofold increase in exposure, recovery of photo-sensor gain, and reduced background from muon-induced spallation of xenon. Combining with the se… ▽ More We present a search for neutrinoless double-beta ($0νββ$) decay of $^{136}$Xe using the full KamLAND-Zen 800 dataset with 745 kg of enriched xenon, corresponding to an exposure of $2.097$ ton yr of $^{136}$Xe. This updated search benefits from a more than twofold increase in exposure, recovery of photo-sensor gain, and reduced background from muon-induced spallation of xenon. Combining with the search in the previous KamLAND-Zen phase, we obtain a lower limit for the $0νββ$ decay half-life of $T_{1/2}^{0ν} > 3.8 \times 10^{26}$ yr at 90% C.L., a factor of 1.7 improvement over the previous limit. The corresponding upper limits on the effective Majorana neutrino mass are in the range 28-122 meV using phenomenological nuclear matrix element calculations. △ Less

Submitted 17 June, 2024; originally announced June 2024.

Comments: arXiv admin note: text overlap with arXiv:2203.02139

arXiv:2405.14497 [pdf, other]

Improving Single Domain-Generalized Object Detection: A Focus on Diversification and Alignment

Authors: Muhammad Sohail Danish, Muhammad Haris Khan, Muhammad Akhtar Munir, M. Saquib Sarfraz, Mohsen Ali

Abstract: In this work, we tackle the problem of domain generalization for object detection, specifically focusing on the scenario where only a single source domain is available. We propose an effective approach that involves two key steps: diversifying the source domain and aligning detections based on class prediction confidence and localization. Firstly, we demonstrate that by carefully selecting a set o… ▽ More In this work, we tackle the problem of domain generalization for object detection, specifically focusing on the scenario where only a single source domain is available. We propose an effective approach that involves two key steps: diversifying the source domain and aligning detections based on class prediction confidence and localization. Firstly, we demonstrate that by carefully selecting a set of augmentations, a base detector can outperform existing methods for single domain generalization by a good margin. This highlights the importance of domain diversification in improving the performance of object detectors. Secondly, we introduce a method to align detections from multiple views, considering both classification and localization outputs. This alignment procedure leads to better generalized and well-calibrated object detector models, which are crucial for accurate decision-making in safety-critical applications. Our approach is detector-agnostic and can be seamlessly applied to both single-stage and two-stage detectors. To validate the effectiveness of our proposed methods, we conduct extensive experiments and ablations on challenging domain-shift scenarios. The results consistently demonstrate the superiority of our approach compared to existing methods. Our code and models are available at: https://github.com/msohaildanish/DivAlign △ Less

Submitted 23 May, 2024; originally announced May 2024.

arXiv:2405.13580 [pdf, other]

AltChart: Enhancing VLM-based Chart Summarization Through Multi-Pretext Tasks

Authors: Omar Moured, Jiaming Zhang, M. Saquib Sarfraz, Rainer Stiefelhagen

Abstract: Chart summarization is a crucial task for blind and visually impaired individuals as it is their primary means of accessing and interpreting graphical data. Crafting high-quality descriptions is challenging because it requires precise communication of essential details within the chart without vision perception. Many chart analysis methods, however, produce brief, unstructured responses that may c… ▽ More Chart summarization is a crucial task for blind and visually impaired individuals as it is their primary means of accessing and interpreting graphical data. Crafting high-quality descriptions is challenging because it requires precise communication of essential details within the chart without vision perception. Many chart analysis methods, however, produce brief, unstructured responses that may contain significant hallucinations, affecting their reliability for blind people. To address these challenges, this work presents three key contributions: (1) We introduce the AltChart dataset, comprising 10,000 real chart images, each paired with a comprehensive summary that features long-context, and semantically rich annotations. (2) We propose a new method for pretraining Vision-Language Models (VLMs) to learn fine-grained chart representations through training with multiple pretext tasks, yielding a performance gain with ${\sim}2.5\%$. (3) We conduct extensive evaluations of four leading chart summarization models, analyzing how accessible their descriptions are. Our dataset and codes are publicly available on our project page: https://github.com/moured/AltChart. △ Less

Submitted 22 May, 2024; originally announced May 2024.

Comments: Accepted in ICDAR 2024. Project page is at: https://github.com/moured/AltChart

arXiv:2405.02678 [pdf, other]

Position: Quo Vadis, Unsupervised Time Series Anomaly Detection?

Authors: M. Saquib Sarfraz, Mei-Yen Chen, Lukas Layer, Kunyu Peng, Marios Koulakis

Abstract: The current state of machine learning scholarship in Timeseries Anomaly Detection (TAD) is plagued by the persistent use of flawed evaluation metrics, inconsistent benchmarking practices, and a lack of proper justification for the choices made in novel deep learning-based model designs. Our paper presents a critical analysis of the status quo in TAD, revealing the misleading track of current resea… ▽ More The current state of machine learning scholarship in Timeseries Anomaly Detection (TAD) is plagued by the persistent use of flawed evaluation metrics, inconsistent benchmarking practices, and a lack of proper justification for the choices made in novel deep learning-based model designs. Our paper presents a critical analysis of the status quo in TAD, revealing the misleading track of current research and highlighting problematic methods, and evaluation practices. Our position advocates for a shift in focus from solely pursuing novel model designs to improving benchmarking practices, creating non-trivial datasets, and critically evaluating the utility of complex methods against simpler baselines. Our findings demonstrate the need for rigorous evaluation protocols, the creation of simple baselines, and the revelation that state-of-the-art deep anomaly detection models effectively learn linear mappings. These findings suggest the need for more exploration and development of simple and interpretable TAD methods. The increment of model complexity in the state-of-the-art deep-learning based models unfortunately offers very little improvement. We offer insights and suggestions for the field to move forward. Code: https://github.com/ssarfraz/QuoVadisTAD △ Less

Submitted 5 June, 2024; v1 submitted 4 May, 2024; originally announced May 2024.

Comments: ICML 2024

arXiv:2401.16923 [pdf, other]

Fourier Prompt Tuning for Modality-Incomplete Scene Segmentation

Authors: Ruiping Liu, Jiaming Zhang, Kunyu Peng, Yufan Chen, Ke Cao, Junwei Zheng, M. Saquib Sarfraz, Kailun Yang, Rainer Stiefelhagen

Abstract: Integrating information from multiple modalities enhances the robustness of scene perception systems in autonomous vehicles, providing a more comprehensive and reliable sensory framework. However, the modality incompleteness in multi-modal segmentation remains under-explored. In this work, we establish a task called Modality-Incomplete Scene Segmentation (MISS), which encompasses both system-level… ▽ More Integrating information from multiple modalities enhances the robustness of scene perception systems in autonomous vehicles, providing a more comprehensive and reliable sensory framework. However, the modality incompleteness in multi-modal segmentation remains under-explored. In this work, we establish a task called Modality-Incomplete Scene Segmentation (MISS), which encompasses both system-level modality absence and sensor-level modality errors. To avoid the predominant modality reliance in multi-modal fusion, we introduce a Missing-aware Modal Switch (MMS) strategy to proactively manage missing modalities during training. Utilizing bit-level batch-wise sampling enhances the model's performance in both complete and incomplete testing scenarios. Furthermore, we introduce the Fourier Prompt Tuning (FPT) method to incorporate representative spectral information into a limited number of learnable prompts that maintain robustness against all MISS scenarios. Akin to fine-tuning effects but with fewer tunable parameters (1.1%). Extensive experiments prove the efficacy of our proposed approach, showcasing an improvement of 5.84% mIoU over the prior state-of-the-art parameter-efficient methods in modality missing. The source code is publicly available at https://github.com/RuipingL/MISS. △ Less

Submitted 10 April, 2024; v1 submitted 30 January, 2024; originally announced January 2024.

Comments: Accepted to IEEE IV 2024. The source code is publicly available at https://github.com/RuipingL/MISS

arXiv:2312.06330 [pdf, other]

Navigating Open Set Scenarios for Skeleton-based Action Recognition

Authors: Kunyu Peng, Cheng Yin, Junwei Zheng, Ruiping Liu, David Schneider, Jiaming Zhang, Kailun Yang, M. Saquib Sarfraz, Rainer Stiefelhagen, Alina Roitberg

Abstract: In real-world scenarios, human actions often fall outside the distribution of training data, making it crucial for models to recognize known actions and reject unknown ones. However, using pure skeleton data in such open-set conditions poses challenges due to the lack of visual background cues and the distinct sparse structure of body pose sequences. In this paper, we tackle the unexplored Open-Se… ▽ More In real-world scenarios, human actions often fall outside the distribution of training data, making it crucial for models to recognize known actions and reject unknown ones. However, using pure skeleton data in such open-set conditions poses challenges due to the lack of visual background cues and the distinct sparse structure of body pose sequences. In this paper, we tackle the unexplored Open-Set Skeleton-based Action Recognition (OS-SAR) task and formalize the benchmark on three skeleton-based datasets. We assess the performance of seven established open-set approaches on our task and identify their limits and critical generalization issues when dealing with skeleton information. To address these challenges, we propose a distance-based cross-modality ensemble method that leverages the cross-modal alignment of skeleton joints, bones, and velocities to achieve superior open-set recognition performance. We refer to the key idea as CrossMax - an approach that utilizes a novel cross-modality mean max discrepancy suppression mechanism to align latent spaces during training and a cross-modality distance-based logits refinement method during testing. CrossMax outperforms existing approaches and consistently yields state-of-the-art results across all datasets and backbones. The benchmark, code, and models will be released at https://github.com/KPeng9510/OS-SAR. △ Less

Submitted 11 December, 2023; originally announced December 2023.

Comments: Accepted to AAAI 2024. The benchmark, code, and models will be released at https://github.com/KPeng9510/OS-SAR

arXiv:2311.04815 [pdf, other]

Domain Adaptive Object Detection via Balancing Between Self-Training and Adversarial Learning

Authors: Muhammad Akhtar Munir, Muhammad Haris Khan, M. Saquib Sarfraz, Mohsen Ali

Abstract: Deep learning based object detectors struggle generalizing to a new target domain bearing significant variations in object and background. Most current methods align domains by using image or instance-level adversarial feature alignment. This often suffers due to unwanted background and lacks class-specific alignment. A straightforward approach to promote class-level alignment is to use high confi… ▽ More Deep learning based object detectors struggle generalizing to a new target domain bearing significant variations in object and background. Most current methods align domains by using image or instance-level adversarial feature alignment. This often suffers due to unwanted background and lacks class-specific alignment. A straightforward approach to promote class-level alignment is to use high confidence predictions on unlabeled domain as pseudo-labels. These predictions are often noisy since model is poorly calibrated under domain shift. In this paper, we propose to leverage model's predictive uncertainty to strike the right balance between adversarial feature alignment and class-level alignment. We develop a technique to quantify predictive uncertainty on class assignments and bounding-box predictions. Model predictions with low uncertainty are used to generate pseudo-labels for self-training, whereas the ones with higher uncertainty are used to generate tiles for adversarial feature alignment. This synergy between tiling around uncertain object regions and generating pseudo-labels from highly certain object regions allows capturing both image and instance-level context during the model adaptation. We report thorough ablation study to reveal the impact of different components in our approach. Results on five diverse and challenging adaptation scenarios show that our approach outperforms existing state-of-the-art methods with noticeable margins. △ Less

Submitted 8 November, 2023; originally announced November 2023.

Comments: Accepted for publication in IEEE Transactions on Pattern Analysis and Machine Intelligence (Volume: 45, Issue: 12, December 2023); Extended version of our conference paper, arXiv link: arXiv:2110.00249

arXiv:2305.08420 [pdf, other]

Exploring Few-Shot Adaptation for Activity Recognition on Diverse Domains

Authors: Kunyu Peng, Di Wen, David Schneider, Jiaming Zhang, Kailun Yang, M. Saquib Sarfraz, Rainer Stiefelhagen, Alina Roitberg

Abstract: Domain adaptation is essential for activity recognition to ensure accurate and robust performance across diverse environments, sensor types, and data sources. Unsupervised domain adaptation methods have been extensively studied, yet, they require large-scale unlabeled data from the target domain. In this work, we focus on Few-Shot Domain Adaptation for Activity Recognition (FSDA-AR), which leverag… ▽ More Domain adaptation is essential for activity recognition to ensure accurate and robust performance across diverse environments, sensor types, and data sources. Unsupervised domain adaptation methods have been extensively studied, yet, they require large-scale unlabeled data from the target domain. In this work, we focus on Few-Shot Domain Adaptation for Activity Recognition (FSDA-AR), which leverages a very small amount of labeled target videos to achieve effective adaptation. This approach is appealing for applications because it only needs a few or even one labeled example per class in the target domain, ideal for recognizing rare but critical activities. However, the existing FSDA-AR works mostly focus on the domain adaptation on sports videos, where the domain diversity is limited. We propose a new FSDA-AR benchmark using five established datasets considering the adaptation on more diverse and challenging domains. Our results demonstrate that FSDA-AR performs comparably to unsupervised domain adaptation with significantly fewer labeled target domain samples. We further propose a novel approach, RelaMiX, to better leverage the few labeled target domain samples as knowledge guidance. RelaMiX encompasses a temporal relational attention network with relation dropout, alongside a cross-domain information alignment mechanism. Furthermore, it integrates a mechanism for mixing features within a latent space by using the few-shot target domain samples. The proposed RelaMiX solution achieves state-of-the-art performance on all datasets within the FSDA-AR benchmark. To encourage future research of few-shot domain adaptation for activity recognition, our code will be publicly available at https://github.com/KPeng9510/RelaMiX. △ Less

Submitted 27 April, 2024; v1 submitted 15 May, 2023; originally announced May 2023.

Comments: The benchmark and source code will be publicly available at https://github.com/KPeng9510/RelaMiX

arXiv:2303.00952 [pdf, other]

Towards Activated Muscle Group Estimation in the Wild

Authors: Kunyu Peng, David Schneider, Alina Roitberg, Kailun Yang, Jiaming Zhang, Chen Deng, Kaiyu Zhang, M. Saquib Sarfraz, Rainer Stiefelhagen

Abstract: In this paper, we tackle the new task of video-based Activated Muscle Group Estimation (AMGE) aiming at identifying active muscle regions during physical activity in the wild. To this intent, we provide the MuscleMap dataset featuring >15K video clips with 135 different activities and 20 labeled muscle groups. This dataset opens the vistas to multiple video-based applications in sports and rehabil… ▽ More In this paper, we tackle the new task of video-based Activated Muscle Group Estimation (AMGE) aiming at identifying active muscle regions during physical activity in the wild. To this intent, we provide the MuscleMap dataset featuring >15K video clips with 135 different activities and 20 labeled muscle groups. This dataset opens the vistas to multiple video-based applications in sports and rehabilitation medicine under flexible environment constraints. The proposed MuscleMap dataset is constructed with YouTube videos, specifically targeting High-Intensity Interval Training (HIIT) physical exercise in the wild. To make the AMGE model applicable in real-life situations, it is crucial to ensure that the model can generalize well to numerous types of physical activities not present during training and involving new combinations of activated muscles. To achieve this, our benchmark also covers an evaluation setting where the model is exposed to activity types excluded from the training set. Our experiments reveal that the generalizability of existing architectures adapted for the AMGE task remains a challenge. Therefore, we also propose a new approach, TransM3E, which employs a multi-modality feature fusion mechanism between both the video transformer model and the skeleton-based graph convolution model with novel cross-modal knowledge distillation executed on multi-classification tokens. The proposed method surpasses all popular video classification models when dealing with both, previously seen and new types of physical activities. The database and code can be found at https://github.com/KPeng9510/MuscleMap. △ Less

Submitted 5 August, 2024; v1 submitted 1 March, 2023; originally announced March 2023.

Comments: Accepted to ACM MM 2024. The database and code can be found at https://github.com/KPeng9510/MuscleMap

arXiv:2209.07601 [pdf, other]

Towards Improving Calibration in Object Detection Under Domain Shift

Authors: Muhammad Akhtar Munir, Muhammad Haris Khan, M. Saquib Sarfraz, Mohsen Ali

Abstract: With deep neural network based solution more readily being incorporated in real-world applications, it has been pressing requirement that predictions by such models, especially in safety-critical environments, be highly accurate and well-calibrated. Although some techniques addressing DNN calibration have been proposed, they are only limited to visual classification applications and in-domain pred… ▽ More With deep neural network based solution more readily being incorporated in real-world applications, it has been pressing requirement that predictions by such models, especially in safety-critical environments, be highly accurate and well-calibrated. Although some techniques addressing DNN calibration have been proposed, they are only limited to visual classification applications and in-domain predictions. Unfortunately, very little to no attention is paid towards addressing calibration of DNN-based visual object detectors, that occupy similar space and importance in many decision making systems as their visual classification counterparts. In this work, we study the calibration of DNN-based object detection models, particularly under domain shift. To this end, we first propose a new, plug-and-play, train-time calibration loss for object detection (coined as TCD). It can be used with various application-specific loss functions as an auxiliary loss function to improve detection calibration. Second, we devise a new implicit technique for improving calibration in self-training based domain adaptive detectors, featuring a new uncertainty quantification mechanism for object detection. We demonstrate TCD is capable of enhancing calibration with notable margins (1) across different DNN-based object detection paradigms both in in-domain and out-of-domain predictions, and (2) in different domain-adaptive detectors across challenging adaptation scenarios. Finally, we empirically show that our implicit calibration technique can be used in tandem with TCD during adaptation to further boost calibration in diverse domain shift scenarios. △ Less

Submitted 29 October, 2022; v1 submitted 15 September, 2022; originally announced September 2022.

Comments: To appear in NeurIPS 2022

arXiv:2205.07139 [pdf, other]

doi 10.1007/978-3-031-16443-9_66

Breaking with Fixed Set Pathology Recognition through Report-Guided Contrastive Training

Authors: Constantin Seibold, Simon Reiß, M. Saquib Sarfraz, Rainer Stiefelhagen, Jens Kleesiek

Abstract: When reading images, radiologists generate text reports describing the findings therein. Current state-of-the-art computer-aided diagnosis tools utilize a fixed set of predefined categories automatically extracted from these medical reports for training. This form of supervision limits the potential usage of models as they are unable to pick up on anomalies outside of their predefined set, thus, m… ▽ More When reading images, radiologists generate text reports describing the findings therein. Current state-of-the-art computer-aided diagnosis tools utilize a fixed set of predefined categories automatically extracted from these medical reports for training. This form of supervision limits the potential usage of models as they are unable to pick up on anomalies outside of their predefined set, thus, making it a necessity to retrain the classifier with additional data when faced with novel classes. In contrast, we investigate direct text supervision to break away from this closed set assumption. By doing so, we avoid noisy label extraction via text classifiers and incorporate more contextual information. We employ a contrastive global-local dual-encoder architecture to learn concepts directly from unstructured medical reports while maintaining its ability to perform free form classification. We investigate relevant properties of open set recognition for radiological data and propose a method to employ currently weakly annotated data into training. We evaluate our approach on the large-scale chest X-Ray datasets MIMIC-CXR, CheXpert, and ChestX-Ray14 for disease classification. We show that despite using unstructured medical report supervision, we perform on par with direct label supervision through a sophisticated inference setting. △ Less

Submitted 14 May, 2022; originally announced May 2022.

Comments: Provisionally Accepted at MICCAI2022

arXiv:2203.12997 [pdf, other]

Hierarchical Nearest Neighbor Graph Embedding for Efficient Dimensionality Reduction

Authors: M. Saquib Sarfraz, Marios Koulakis, Constantin Seibold, Rainer Stiefelhagen

Abstract: Dimensionality reduction is crucial both for visualization and preprocessing high dimensional data for machine learning. We introduce a novel method based on a hierarchy built on 1-nearest neighbor graphs in the original space which is used to preserve the grouping properties of the data distribution on multiple levels. The core of the proposal is an optimization-free projection that is competitiv… ▽ More Dimensionality reduction is crucial both for visualization and preprocessing high dimensional data for machine learning. We introduce a novel method based on a hierarchy built on 1-nearest neighbor graphs in the original space which is used to preserve the grouping properties of the data distribution on multiple levels. The core of the proposal is an optimization-free projection that is competitive with the latest versions of t-SNE and UMAP in performance and visualization quality while being an order of magnitude faster in run-time. Furthermore, its interpretable mechanics, the ability to project new data, and the natural separation of data clusters in visualizations make it a general purpose unsupervised dimension reduction technique. In the paper, we argue about the soundness of the proposed method and evaluate it on a diverse collection of datasets with sizes varying from 1K to 11M samples and dimensions from 28 to 16K. We perform comparisons with other state-of-the-art methods on multiple metrics and target dimensions highlighting its efficiency and performance. Code is available at https://github.com/koulakis/h-nne △ Less

Submitted 29 May, 2022; v1 submitted 24 March, 2022; originally announced March 2022.

Comments: CVPR 2022

arXiv:2110.15741 [pdf, ps, other]

On new parameters concerning a generalization of the parallelogram law in Banach spaces

Authors: Qi Liu, Zhijian Yang, Muhammad Sarfraz, Yongjin Li

Abstract: We shall introduce a new geometric constant $L^{\prime}_{\mathrm{Y}}(λ,X)$ based on a generalization of the parallelogram law. We first investigate some basic properties of this new coefficient. Next, it is shown that, for a Banach space, $L^{\prime}_{\mathrm{Y}}(λ,X)$ becomes $1$ for some $λ_0\in (0,1)$ if and only if the norm is induced by an inner product. Moreover, some relations between other… ▽ More We shall introduce a new geometric constant $L^{\prime}_{\mathrm{Y}}(λ,X)$ based on a generalization of the parallelogram law. We first investigate some basic properties of this new coefficient. Next, it is shown that, for a Banach space, $L^{\prime}_{\mathrm{Y}}(λ,X)$ becomes $1$ for some $λ_0\in (0,1)$ if and only if the norm is induced by an inner product. Moreover, some relations between other well-known geometric constants are studied. Finally, a sufficient condition which implies normal structure is presented. △ Less

Submitted 29 October, 2021; originally announced October 2021.

MSC Class: 46B20

arXiv:2110.00249 [pdf, other]

Synergizing between Self-Training and Adversarial Learning for Domain Adaptive Object Detection

Authors: Muhammad Akhtar Munir, Muhammad Haris Khan, M. Saquib Sarfraz, Mohsen Ali

Abstract: We study adapting trained object detectors to unseen domains manifesting significant variations of object appearance, viewpoints and backgrounds. Most current methods align domains by either using image or instance-level feature alignment in an adversarial fashion. This often suffers due to the presence of unwanted background and as such lacks class-specific alignment. A common remedy to promote c… ▽ More We study adapting trained object detectors to unseen domains manifesting significant variations of object appearance, viewpoints and backgrounds. Most current methods align domains by either using image or instance-level feature alignment in an adversarial fashion. This often suffers due to the presence of unwanted background and as such lacks class-specific alignment. A common remedy to promote class-level alignment is to use high confidence predictions on the unlabelled domain as pseudo labels. These high confidence predictions are often fallacious since the model is poorly calibrated under domain shift. In this paper, we propose to leverage model predictive uncertainty to strike the right balance between adversarial feature alignment and class-level alignment. Specifically, we measure predictive uncertainty on class assignments and the bounding box predictions. Model predictions with low uncertainty are used to generate pseudo-labels for self-supervision, whereas the ones with higher uncertainty are used to generate tiles for an adversarial feature alignment stage. This synergy between tiling around the uncertain object regions and generating pseudo-labels from highly certain object regions allows us to capture both the image and instance level context during the model adaptation stage. We perform extensive experiments covering various domain shift scenarios. Our approach improves upon existing state-of-the-art methods with visible margins. △ Less

Submitted 1 October, 2021; originally announced October 2021.

Comments: To appear in NeurIPS2021

arXiv:2103.11264 [pdf, other]

Temporally-Weighted Hierarchical Clustering for Unsupervised Action Segmentation

Authors: M. Saquib Sarfraz, Naila Murray, Vivek Sharma, Ali Diba, Luc Van Gool, Rainer Stiefelhagen

Abstract: Action segmentation refers to inferring boundaries of semantically consistent visual concepts in videos and is an important requirement for many video understanding tasks. For this and other video understanding tasks, supervised approaches have achieved encouraging performance but require a high volume of detailed frame-level annotations. We present a fully automatic and unsupervised approach for… ▽ More Action segmentation refers to inferring boundaries of semantically consistent visual concepts in videos and is an important requirement for many video understanding tasks. For this and other video understanding tasks, supervised approaches have achieved encouraging performance but require a high volume of detailed frame-level annotations. We present a fully automatic and unsupervised approach for segmenting actions in a video that does not require any training. Our proposal is an effective temporally-weighted hierarchical clustering algorithm that can group semantically consistent frames of the video. Our main finding is that representing a video with a 1-nearest neighbor graph by taking into account the time progression is sufficient to form semantically and temporally consistent clusters of frames where each cluster may represent some action in the video. Additionally, we establish strong unsupervised baselines for action segmentation and show significant performance improvements over published unsupervised methods on five challenging action segmentation datasets. Our code is available at https://github.com/ssarfraz/FINCH-Clustering/tree/master/TW-FINCH △ Less

Submitted 27 March, 2021; v1 submitted 20 March, 2021; originally announced March 2021.

Comments: CVPR 2021

arXiv:2101.03141 [pdf, other]

An Isolation Forest Learning Based Outlier Detection Approach for Effectively Classifying Cyber Anomalies

Authors: Rony Chowdhury Ripan, Iqbal H. Sarker, Md Musfique Anwar, Md. Hasan Furhad, Fazle Rahat, Mohammed Moshiul Hoque, Muhammad Sarfraz

Abstract: Cybersecurity has recently gained considerable interest in today's security issues because of the popularity of the Internet-of-Things (IoT), the considerable growth of mobile networks, and many related apps. Therefore, detecting numerous cyber-attacks in a network and creating an effective intrusion detection system plays a vital role in today's security. In this paper, we present an Isolation Fo… ▽ More Cybersecurity has recently gained considerable interest in today's security issues because of the popularity of the Internet-of-Things (IoT), the considerable growth of mobile networks, and many related apps. Therefore, detecting numerous cyber-attacks in a network and creating an effective intrusion detection system plays a vital role in today's security. In this paper, we present an Isolation Forest Learning-Based Outlier Detection Model for effectively classifying cyber anomalies. In order to evaluate the efficacy of the resulting Outlier Detection model, we also use several conventional machine learning approaches, such as Logistic Regression (LR), Support Vector Machine (SVM), AdaBoost Classifier (ABC), Naive Bayes (NB), and K-Nearest Neighbor (KNN). The effectiveness of our proposed Outlier Detection model is evaluated by conducting experiments on Network Intrusion Dataset with evaluation metrics such as precision, recall, F1-score, and accuracy. Experimental results show that the classification accuracy of cyber anomalies has been improved after removing outliers. △ Less

Submitted 9 December, 2020; originally announced January 2021.

Comments: 10 pages

arXiv:2008.08418 [pdf]

Anchor-free Small-scale Multispectral Pedestrian Detection

Authors: Alexander Wolpert, Michael Teutsch, M. Saquib Sarfraz, Rainer Stiefelhagen

Abstract: Multispectral images consisting of aligned visual-optical (VIS) and thermal infrared (IR) image pairs are well-suited for practical applications like autonomous driving or visual surveillance. Such data can be used to increase the performance of pedestrian detection especially for weakly illuminated, small-scaled, or partially occluded instances. The current state-of-the-art is based on variants o… ▽ More Multispectral images consisting of aligned visual-optical (VIS) and thermal infrared (IR) image pairs are well-suited for practical applications like autonomous driving or visual surveillance. Such data can be used to increase the performance of pedestrian detection especially for weakly illuminated, small-scaled, or partially occluded instances. The current state-of-the-art is based on variants of Faster R-CNN and thus passes through two stages: a proposal generator network with handcrafted anchor boxes for object localization and a classification network for verifying the object category. In this paper we propose a method for effective and efficient multispectral fusion of the two modalities in an adapted single-stage anchor-free base architecture. We aim at learning pedestrian representations based on object center and scale rather than direct bounding box predictions. In this way, we can both simplify the network architecture and achieve higher detection performance, especially for pedestrians under occlusion or at low object resolution. In addition, we provide a study on well-suited multispectral data augmentation techniques that improve the commonly used augmentations. The results show our method's effectiveness in detecting small-scaled pedestrians. We achieve 5.68% log-average miss rate in comparison to the best current state-of-the-art of 7.49% (25% improvement) on the challenging KAIST Multispectral Pedestrian Detection Benchmark. Code: https://github.com/HensoldtOptronicsCV/MultispectralPedestrianDetection △ Less

Submitted 20 August, 2020; v1 submitted 19 August, 2020; originally announced August 2020.

Comments: BMVC2020

arXiv:2004.02195 [pdf, other]

Clustering based Contrastive Learning for Improving Face Representations

Authors: Vivek Sharma, Makarand Tapaswi, M. Saquib Sarfraz, Rainer Stiefelhagen

Abstract: A good clustering algorithm can discover natural groupings in data. These groupings, if used wisely, provide a form of weak supervision for learning representations. In this work, we present Clustering-based Contrastive Learning (CCL), a new clustering-based representation learning approach that uses labels obtained from clustering along with video constraints to learn discriminative face features… ▽ More A good clustering algorithm can discover natural groupings in data. These groupings, if used wisely, provide a form of weak supervision for learning representations. In this work, we present Clustering-based Contrastive Learning (CCL), a new clustering-based representation learning approach that uses labels obtained from clustering along with video constraints to learn discriminative face features. We demonstrate our method on the challenging task of learning representations for video face clustering. Through several ablation studies, we analyze the impact of creating pair-wise positive and negative labels from different sources. Experiments on three challenging video face clustering datasets: BBT-0101, BF-0502, and ACCIO show that CCL achieves a new state-of-the-art on all datasets. △ Less

Submitted 5 April, 2020; originally announced April 2020.

Comments: To appear at IEEE International Conference on Automatic Face and Gesture Recognition (FG), 2020

arXiv:1908.00274 [pdf, other]

Content and Colour Distillation for Learning Image Translations with the Spatial Profile Loss

Authors: M. Saquib Sarfraz, Constantin Seibold, Haroon Khalid, Rainer Stiefelhagen

Abstract: Generative adversarial networks has emerged as a defacto standard for image translation problems. To successfully drive such models, one has to rely on additional networks e.g., discriminators and/or perceptual networks. Training these networks with pixel based losses alone are generally not sufficient to learn the target distribution. In this paper, we propose a novel method of computing the loss… ▽ More Generative adversarial networks has emerged as a defacto standard for image translation problems. To successfully drive such models, one has to rely on additional networks e.g., discriminators and/or perceptual networks. Training these networks with pixel based losses alone are generally not sufficient to learn the target distribution. In this paper, we propose a novel method of computing the loss directly between the source and target images that enable proper distillation of shape/content and colour/style. We show that this is useful in typical image-to-image translations allowing us to successfully drive the generator without relying on additional networks. We demonstrate this on many difficult image translation problems such as image-to-image domain mapping, single image super-resolution and photo realistic makeup transfer. Our extensive evaluation shows the effectiveness of the proposed formulation and its ability to synthesize realistic images. [Code release: https://github.com/ssarfraz/SPL] △ Less

Submitted 1 August, 2019; originally announced August 2019.

Comments: BMVC 2019

arXiv:1903.01000 [pdf, other]

Self-Supervised Learning of Face Representations for Video Face Clustering

Authors: Vivek Sharma, Makarand Tapaswi, M. Saquib Sarfraz, Rainer Stiefelhagen

Abstract: Analyzing the story behind TV series and movies often requires understanding who the characters are and what they are doing. With improving deep face models, this may seem like a solved problem. However, as face detectors get better, clustering/identification needs to be revisited to address increasing diversity in facial appearance. In this paper, we address video face clustering using unsupervis… ▽ More Analyzing the story behind TV series and movies often requires understanding who the characters are and what they are doing. With improving deep face models, this may seem like a solved problem. However, as face detectors get better, clustering/identification needs to be revisited to address increasing diversity in facial appearance. In this paper, we address video face clustering using unsupervised methods. Our emphasis is on distilling the essential information, identity, from the representations obtained using deep pre-trained face networks. We propose a self-supervised Siamese network that can be trained without the need for video/track based supervision, and thus can also be applied to image collections. We evaluate our proposed method on three video face clustering datasets. The experiments show that our methods outperform current state-of-the-art methods on all datasets. Video face clustering is lacking a common benchmark as current works are often evaluated with different metrics and/or different sets of face tracks. △ Less

Submitted 3 March, 2019; originally announced March 2019.

Comments: To appear at International Conference on Automatic Face and Gesture Recognition (2019) as an Oral. The datasets and code are available at https://github.com/vivoutlaw/SSIAM

arXiv:1902.11266 [pdf, other]

Efficient Parameter-free Clustering Using First Neighbor Relations

Authors: M. Saquib Sarfraz, Vivek Sharma, Rainer Stiefelhagen

Abstract: We present a new clustering method in the form of a single clustering equation that is able to directly discover groupings in the data. The main proposition is that the first neighbor of each sample is all one needs to discover large chains and finding the groups in the data. In contrast to most existing clustering algorithms our method does not require any hyper-parameters, distance thresholds an… ▽ More We present a new clustering method in the form of a single clustering equation that is able to directly discover groupings in the data. The main proposition is that the first neighbor of each sample is all one needs to discover large chains and finding the groups in the data. In contrast to most existing clustering algorithms our method does not require any hyper-parameters, distance thresholds and/or the need to specify the number of clusters. The proposed algorithm belongs to the family of hierarchical agglomerative methods. The technique has a very low computational overhead, is easily scalable and applicable to large practical problems. Evaluation on well known datasets from different domains ranging between 1077 and 8.1 million samples shows substantial performance gains when compared to the existing clustering techniques. △ Less

Submitted 28 February, 2019; originally announced February 2019.

Comments: CVPR 2019

arXiv:1711.10886 [pdf]

doi 10.1007/s00287-017-1077-7

A Multimodal Assistive System for Helping Visually Impaired in Social Interactions

Authors: M. Saquib Sarfraz, Angela Constantinescu, Melanie Zuzej, Rainer Stiefelhagen

Abstract: Access to non-verbal cues in social interactions is vital for people with visual impairment. It has been shown that non-verbal cues such as eye contact, number of people, their names and positions are helpful for individuals who are blind. While there is an increasing interest in developing systems to provide these cues less emphasis has been put in evaluating its impact on the visually impaired u… ▽ More Access to non-verbal cues in social interactions is vital for people with visual impairment. It has been shown that non-verbal cues such as eye contact, number of people, their names and positions are helpful for individuals who are blind. While there is an increasing interest in developing systems to provide these cues less emphasis has been put in evaluating its impact on the visually impaired users. In this paper, we provide this analysis by conducting a user study with 12 visually impaired participants in a typical social interaction setting. We design a real time multi-modal system that provides such non-verbal cues via audio and haptic interfaces. The study shows that such systems are generally perceived as useful in social interaction and brings forward some concerns that are not being addressed in its usability aspects. The study provides important insight about developing such technology for this significant part of society. △ Less

Submitted 29 November, 2017; originally announced November 2017.

Journal ref: Informatik Spectrum, Springer volume 40,No. 6. 2017

arXiv:1711.10378 [pdf, other]

A Pose-Sensitive Embedding for Person Re-Identification with Expanded Cross Neighborhood Re-Ranking

Authors: M. Saquib Sarfraz, Arne Schumann, Andreas Eberle, Rainer Stiefelhagen

Abstract: Person re identification is a challenging retrieval task that requires matching a person's acquired image across non overlapping camera views. In this paper we propose an effective approach that incorporates both the fine and coarse pose information of the person to learn a discriminative embedding. In contrast to the recent direction of explicitly modeling body parts or correcting for misalignmen… ▽ More Person re identification is a challenging retrieval task that requires matching a person's acquired image across non overlapping camera views. In this paper we propose an effective approach that incorporates both the fine and coarse pose information of the person to learn a discriminative embedding. In contrast to the recent direction of explicitly modeling body parts or correcting for misalignment based on these, we show that a rather straightforward inclusion of acquired camera view and/or the detected joint locations into a convolutional neural network helps to learn a very effective representation. To increase retrieval performance, re-ranking techniques based on computed distances have recently gained much attention. We propose a new unsupervised and automatic re-ranking framework that achieves state-of-the-art re-ranking performance. We show that in contrast to the current state-of-the-art re-ranking methods our approach does not require to compute new rank lists for each image pair (e.g., based on reciprocal neighbors) and performs well by using simple direct rank list based comparison or even by just using the already computed euclidean distances between the images. We show that both our learned representation and our re-ranking method achieve state-of-the-art performance on a number of challenging surveillance image and video datasets. The code is available online at: https://github.com/pse-ecn/pose-sensitive-embedding △ Less

Submitted 2 April, 2018; v1 submitted 28 November, 2017; originally announced November 2017.

Comments: CVPR 2018: v2 (fixes, added new results on PRW dataset)

arXiv:1707.06089 [pdf, other]

Deep View-Sensitive Pedestrian Attribute Inference in an end-to-end Model

Authors: M. Saquib Sarfraz, Arne Schumann, Yan Wang, Rainer Stiefelhagen

Abstract: Pedestrian attribute inference is a demanding problem in visual surveillance that can facilitate person retrieval, search and indexing. To exploit semantic relations between attributes, recent research treats it as a multi-label image classification task. The visual cues hinting at attributes can be strongly localized and inference of person attributes such as hair, backpack, shorts, etc., are hig… ▽ More Pedestrian attribute inference is a demanding problem in visual surveillance that can facilitate person retrieval, search and indexing. To exploit semantic relations between attributes, recent research treats it as a multi-label image classification task. The visual cues hinting at attributes can be strongly localized and inference of person attributes such as hair, backpack, shorts, etc., are highly dependent on the acquired view of the pedestrian. In this paper we assert this dependence in an end-to-end learning framework and show that a view-sensitive attribute inference is able to learn better attribute predictions. Our proposed model jointly predicts the coarse pose (view) of the pedestrian and learns specialized view-specific multi-label attribute predictions. We show in an extensive evaluation on three challenging datasets (PETA, RAP and WIDER) that our proposed end-to-end view-aware attribute prediction model provides competitive performance and improves on the published state-of-the-art on these datasets. △ Less

Submitted 19 July, 2017; originally announced July 2017.

Comments: accepted BMVC 2017

arXiv:1601.05347 [pdf, other]

doi 10.1007/s11263-016-0933-2

Deep Perceptual Mapping for Cross-Modal Face Recognition

Authors: M. Saquib Sarfraz, Rainer Stiefelhagen

Abstract: Cross modal face matching between the thermal and visible spectrum is a much desired capability for night-time surveillance and security applications. Due to a very large modality gap, thermal-to-visible face recognition is one of the most challenging face matching problem. In this paper, we present an approach to bridge this modality gap by a significant margin. Our approach captures the highly n… ▽ More Cross modal face matching between the thermal and visible spectrum is a much desired capability for night-time surveillance and security applications. Due to a very large modality gap, thermal-to-visible face recognition is one of the most challenging face matching problem. In this paper, we present an approach to bridge this modality gap by a significant margin. Our approach captures the highly non-linear relationship between the two modalities by using a deep neural network. Our model attempts to learn a non-linear mapping from visible to thermal spectrum while preserving the identity information. We show substantive performance improvement on three difficult thermal-visible face datasets. The presented approach improves the state-of-the-art by more than 10\% on UND-X1 dataset and by more than 15-30\% on NVESD dataset in terms of Rank-1 identification. Our method bridges the drop in performance due to the modality gap by more than 40\%. △ Less

Submitted 7 July, 2016; v1 submitted 20 January, 2016; originally announced January 2016.

Comments: This is the extended version (invited IJCV submission) with new results of our previous submission (arXiv:1507.02879)

arXiv:1507.02879 [pdf, other]

Deep Perceptual Mapping for Thermal to Visible Face Recognition

Authors: M. Saquib Sarfraz, Rainer Stiefelhagen

Abstract: Cross modal face matching between the thermal and visible spectrum is a much de- sired capability for night-time surveillance and security applications. Due to a very large modality gap, thermal-to-visible face recognition is one of the most challenging face matching problem. In this paper, we present an approach to bridge this modality gap by a significant margin. Our approach captures the highly… ▽ More Cross modal face matching between the thermal and visible spectrum is a much de- sired capability for night-time surveillance and security applications. Due to a very large modality gap, thermal-to-visible face recognition is one of the most challenging face matching problem. In this paper, we present an approach to bridge this modality gap by a significant margin. Our approach captures the highly non-linear relationship be- tween the two modalities by using a deep neural network. Our model attempts to learn a non-linear mapping from visible to thermal spectrum while preserving the identity in- formation. We show substantive performance improvement on a difficult thermal-visible face dataset. The presented approach improves the state-of-the-art by more than 10% in terms of Rank-1 identification and bridge the drop in performance due to the modality gap by more than 40%. △ Less

Submitted 10 July, 2015; originally announced July 2015.

Comments: BMVC 2015 (oral)

Showing 1–27 of 27 results for author: Sarfraz, M