Search | arXiv e-print repository

PEPL: Precision-Enhanced Pseudo-Labeling for Fine-Grained Image Classification in Semi-Supervised Learning

Authors: Bowen Tian, Songning Lai, Lujundong Li, Zhihao Shuai, Runwei Guan, Tian Wu, Yutao Yue

Abstract: Fine-grained image classification has witnessed significant advancements with the advent of deep learning and computer vision technologies. However, the scarcity of detailed annotations remains a major challenge, especially in scenarios where obtaining high-quality labeled data is costly or time-consuming. To address this limitation, we introduce Precision-Enhanced Pseudo-Labeling(PEPL) approach s… ▽ More Fine-grained image classification has witnessed significant advancements with the advent of deep learning and computer vision technologies. However, the scarcity of detailed annotations remains a major challenge, especially in scenarios where obtaining high-quality labeled data is costly or time-consuming. To address this limitation, we introduce Precision-Enhanced Pseudo-Labeling(PEPL) approach specifically designed for fine-grained image classification within a semi-supervised learning framework. Our method leverages the abundance of unlabeled data by generating high-quality pseudo-labels that are progressively refined through two key phases: initial pseudo-label generation and semantic-mixed pseudo-label generation. These phases utilize Class Activation Maps (CAMs) to accurately estimate the semantic content and generate refined labels that capture the essential details necessary for fine-grained classification. By focusing on semantic-level information, our approach effectively addresses the limitations of standard data augmentation and image-mixing techniques in preserving critical fine-grained features. We achieve state-of-the-art performance on benchmark datasets, demonstrating significant improvements over existing semi-supervised strategies, with notable boosts in accuracy and robustness.Our code has been open sourced at https://github.com/TianSuya/SemiFG. △ Less

Submitted 4 September, 2024; originally announced September 2024.

Comments: Under review

arXiv:2409.01867 [pdf, other]

ASD-Chat: An Innovative Dialogue Intervention System for Children with Autism based on LLM and VB-MAPP

Authors: Chengyun Deng, Shuzhong Lai, Chi Zhou, Mengyi Bao, Jingwen Yan, Haifeng Li, Lin Yao, Yueming Wang

Abstract: Early diagnosis and professional intervention can help children with autism spectrum disorder (ASD) return to normal life. However, the scarcity and imbalance of professional medical resources currently prevent many autistic children from receiving the necessary diagnosis and intervention. Therefore, numerous paradigms have been proposed that use computer technology to assist or independently cond… ▽ More Early diagnosis and professional intervention can help children with autism spectrum disorder (ASD) return to normal life. However, the scarcity and imbalance of professional medical resources currently prevent many autistic children from receiving the necessary diagnosis and intervention. Therefore, numerous paradigms have been proposed that use computer technology to assist or independently conduct ASD interventions, with the aim of alleviating the aforementioned problem. However, these paradigms often lack a foundation in clinical intervention methods and suffer from a lack of personalization. Addressing these concerns, we propose ASD-Chat, a social intervention system based on VB-MAPP (Verbal Behavior Milestones Assessment and Placement Program) and powered by ChatGPT as the backbone for dialogue generation. Specifically, we designed intervention paradigms and prompts based on the clinical intervention method VB-MAPP and utilized ChatGPT's generative capabilities to facilitate social dialogue interventions. Experimental results demonstrate that our proposed system achieves competitive intervention effects to those of professional interventionists, making it a promising tool for long-term interventions in real healthcare scenario in the future. △ Less

Submitted 3 September, 2024; originally announced September 2024.

arXiv:2409.01256 [pdf, other]

Real-time Accident Anticipation for Autonomous Driving Through Monocular Depth-Enhanced 3D Modeling

Authors: Haicheng Liao, Yongkang Li, Chengyue Wang, Songning Lai, Zhenning Li, Zilin Bian, Jaeyoung Lee, Zhiyong Cui, Guohui Zhang, Chengzhong Xu

Abstract: The primary goal of traffic accident anticipation is to foresee potential accidents in real time using dashcam videos, a task that is pivotal for enhancing the safety and reliability of autonomous driving technologies. In this study, we introduce an innovative framework, AccNet, which significantly advances the prediction capabilities beyond the current state-of-the-art (SOTA) 2D-based methods by… ▽ More The primary goal of traffic accident anticipation is to foresee potential accidents in real time using dashcam videos, a task that is pivotal for enhancing the safety and reliability of autonomous driving technologies. In this study, we introduce an innovative framework, AccNet, which significantly advances the prediction capabilities beyond the current state-of-the-art (SOTA) 2D-based methods by incorporating monocular depth cues for sophisticated 3D scene modeling. Addressing the prevalent challenge of skewed data distribution in traffic accident datasets, we propose the Binary Adaptive Loss for Early Anticipation (BA-LEA). This novel loss function, together with a multi-task learning strategy, shifts the focus of the predictive model towards the critical moments preceding an accident. {We rigorously evaluate the performance of our framework on three benchmark datasets--Dashcam Accident Dataset (DAD), Car Crash Dataset (CCD), and AnAn Accident Detection (A3D), and DADA-2000 Dataset--demonstrating its superior predictive accuracy through key metrics such as Average Precision (AP) and mean Time-To-Accident (mTTA). △ Less

Submitted 2 September, 2024; originally announced September 2024.

arXiv:2408.17443 [pdf, other]

Bridging Episodes and Semantics: A Novel Framework for Long-Form Video Understanding

Authors: Gueter Josmy Faure, Jia-Fong Yeh, Min-Hung Chen, Hung-Ting Su, Winston H. Hsu, Shang-Hong Lai

Abstract: While existing research often treats long-form videos as extended short videos, we propose a novel approach that more accurately reflects human cognition. This paper introduces BREASE: BRidging Episodes And SEmantics for Long-Form Video Understanding, a model that simulates episodic memory accumulation to capture action sequences and reinforces them with semantic knowledge dispersed throughout the… ▽ More While existing research often treats long-form videos as extended short videos, we propose a novel approach that more accurately reflects human cognition. This paper introduces BREASE: BRidging Episodes And SEmantics for Long-Form Video Understanding, a model that simulates episodic memory accumulation to capture action sequences and reinforces them with semantic knowledge dispersed throughout the video. Our work makes two key contributions: First, we develop an Episodic COmpressor (ECO) that efficiently aggregates crucial representations from micro to semi-macro levels. Second, we propose a Semantics reTRiever (SeTR) that enhances these aggregated representations with semantic information by focusing on the broader context, dramatically reducing feature dimensionality while preserving relevant macro-level information. Extensive experiments demonstrate that BREASE achieves state-of-the-art performance across multiple long video understanding benchmarks in both zero-shot and fully-supervised settings. The project page and code are at: https://joslefaure.github.io/assets/html/hermes.html. △ Less

Submitted 30 August, 2024; originally announced August 2024.

Comments: Accepted to the EVAL-FoMo Workshop at ECCV'24. Project page: https://joslefaure.github.io/assets/html/hermes.html

arXiv:2408.15996 [pdf, other]

Spatio-Temporal Context Prompting for Zero-Shot Action Detection

Authors: Wei-Jhe Huang, Min-Hung Chen, Shang-Hong Lai

Abstract: Spatio-temporal action detection encompasses the tasks of localizing and classifying individual actions within a video. Recent works aim to enhance this process by incorporating interaction modeling, which captures the relationship between people and their surrounding context. However, these approaches have primarily focused on fully-supervised learning, and the current limitation lies in the lack… ▽ More Spatio-temporal action detection encompasses the tasks of localizing and classifying individual actions within a video. Recent works aim to enhance this process by incorporating interaction modeling, which captures the relationship between people and their surrounding context. However, these approaches have primarily focused on fully-supervised learning, and the current limitation lies in the lack of generalization capability to recognize unseen action categories. In this paper, we aim to adapt the pretrained image-language models to detect unseen actions. To this end, we propose a method which can effectively leverage the rich knowledge of visual-language models to perform Person-Context Interaction. Meanwhile, our Context Prompting module will utilize contextual information to prompt labels, thereby enhancing the generation of more representative text features. Moreover, to address the challenge of recognizing distinct actions by multiple people at the same timestamp, we design the Interest Token Spotting mechanism which employs pretrained visual knowledge to find each person's interest context tokens, and then these tokens will be used for prompting to generate text features tailored to each individual. To evaluate the ability to detect unseen actions, we propose a comprehensive benchmark on J-HMDB, UCF101-24, and AVA datasets. The experiments show that our method achieves superior results compared to previous approaches and can be further extended to multi-action videos, bringing it closer to real-world applications. The code and data can be found in https://webber2933.github.io/ST-CLIP-project-page. △ Less

Submitted 29 August, 2024; v1 submitted 28 August, 2024; originally announced August 2024.

Comments: Project page: https://webber2933.github.io/ST-CLIP-project-page

arXiv:2408.15628 [pdf, other]

CSAD: Unsupervised Component Segmentation for Logical Anomaly Detection

Authors: Yu-Hsuan Hsieh, Shang-Hong Lai

Abstract: To improve logical anomaly detection, some previous works have integrated segmentation techniques with conventional anomaly detection methods. Although these methods are effective, they frequently lead to unsatisfactory segmentation results and require manual annotations. To address these drawbacks, we develop an unsupervised component segmentation technique that leverages foundation models to aut… ▽ More To improve logical anomaly detection, some previous works have integrated segmentation techniques with conventional anomaly detection methods. Although these methods are effective, they frequently lead to unsatisfactory segmentation results and require manual annotations. To address these drawbacks, we develop an unsupervised component segmentation technique that leverages foundation models to autonomously generate training labels for a lightweight segmentation network without human labeling. Integrating this new segmentation technique with our proposed Patch Histogram module and the Local-Global Student-Teacher (LGST) module, we achieve a detection AUROC of 95.3% in the MVTec LOCO AD dataset, which surpasses previous SOTA methods. Furthermore, our proposed method provides lower latency and higher throughput than most existing approaches. △ Less

Submitted 1 September, 2024; v1 submitted 28 August, 2024; originally announced August 2024.

arXiv:2408.07889 [pdf, other]

MambaVT: Spatio-Temporal Contextual Modeling for robust RGB-T Tracking

Authors: Simiao Lai, Chang Liu, Jiawen Zhu, Ben Kang, Yang Liu, Dong Wang, Huchuan Lu

Abstract: Existing RGB-T tracking algorithms have made remarkable progress by leveraging the global interaction capability and extensive pre-trained models of the Transformer architecture. Nonetheless, these methods mainly adopt imagepair appearance matching and face challenges of the intrinsic high quadratic complexity of the attention mechanism, resulting in constrained exploitation of temporal informatio… ▽ More Existing RGB-T tracking algorithms have made remarkable progress by leveraging the global interaction capability and extensive pre-trained models of the Transformer architecture. Nonetheless, these methods mainly adopt imagepair appearance matching and face challenges of the intrinsic high quadratic complexity of the attention mechanism, resulting in constrained exploitation of temporal information. Inspired by the recently emerged State Space Model Mamba, renowned for its impressive long sequence modeling capabilities and linear computational complexity, this work innovatively proposes a pure Mamba-based framework (MambaVT) to fully exploit spatio-temporal contextual modeling for robust visible-thermal tracking. Specifically, we devise the long-range cross-frame integration component to globally adapt to target appearance variations, and introduce short-term historical trajectory prompts to predict the subsequent target states based on local temporal location clues. Extensive experiments show the significant potential of vision Mamba for RGB-T tracking, with MambaVT achieving state-of-the-art performance on four mainstream benchmarks while requiring lower computational costs. We aim for this work to serve as a simple yet strong baseline, stimulating future research in this field. The code and pre-trained models will be made available. △ Less

Submitted 14 August, 2024; originally announced August 2024.

arXiv:2408.02073 [pdf]

Case-based reasoning approach for diagnostic screening of children with developmental delays

Authors: Zichen Song, Jiakang Li, Songning Lai, Sitan Huang

Abstract: According to the World Health Organization, the population of children with developmental delays constitutes approximately 6% to 9% of the total population. Based on the number of newborns in Huaibei, Anhui Province, China, in 2023 (94,420), it is estimated that there are about 7,500 cases (suspected cases of developmental delays) of suspicious cases annually. Early identification and appropriate… ▽ More According to the World Health Organization, the population of children with developmental delays constitutes approximately 6% to 9% of the total population. Based on the number of newborns in Huaibei, Anhui Province, China, in 2023 (94,420), it is estimated that there are about 7,500 cases (suspected cases of developmental delays) of suspicious cases annually. Early identification and appropriate early intervention for these children can significantly reduce the wastage of medical resources and societal costs. International research indicates that the optimal period for intervention in children with developmental delays is before the age of six, with the golden treatment period being before three and a half years of age. Studies have shown that children with developmental delays who receive early intervention exhibit significant improvement in symptoms; some may even fully recover. This research adopts a hybrid model combining a CNN-Transformer model with Case-Based Reasoning (CBR) to enhance the screening efficiency for children with developmental delays. The CNN-Transformer model is an excellent model for image feature extraction and recognition, effectively identifying features in bone age images to determine bone age. CBR is a technique for solving problems based on similar cases; it solves current problems based on past experiences, similar to how humans solve problems through learning from experience. Given CBR's memory capability to judge and compare new cases based on previously stored old cases, it is suitable for application in support systems with latent and variable characteristics. Therefore, this study utilizes the CNN-Transformer-CBR to establish a screening system for children with developmental delays, aiming to improve screening efficiency. △ Less

Submitted 18 July, 2024; originally announced August 2024.

arXiv:2408.00794 [pdf]

CCSRP: Robust Pruning of Spiking Neural Networks through Cooperative Coevolution

Authors: Zichen Song, Jiakang Li, Songning Lai, Sitan Huang

Abstract: Spiking neural networks (SNNs) have shown promise in various dynamic visual tasks, yet those ready for practical deployment often lack the compactness and robustness essential in resource-limited and safety-critical settings. Prior research has predominantly concentrated on enhancing the compactness or robustness of artificial neural networks through strategies like network pruning and adversarial… ▽ More Spiking neural networks (SNNs) have shown promise in various dynamic visual tasks, yet those ready for practical deployment often lack the compactness and robustness essential in resource-limited and safety-critical settings. Prior research has predominantly concentrated on enhancing the compactness or robustness of artificial neural networks through strategies like network pruning and adversarial training, with little exploration into similar methodologies for SNNs. Robust pruning of SNNs aims to reduce computational overhead while preserving both accuracy and robustness. Current robust pruning approaches generally necessitate expert knowledge and iterative experimentation to establish suitable pruning criteria or auxiliary modules, thus constraining their broader application. Concurrently, evolutionary algorithms (EAs) have been employed to automate the pruning of artificial neural networks, delivering remarkable outcomes yet overlooking the aspect of robustness. In this work, we propose CCSRP, an innovative robust pruning method for SNNs, underpinned by cooperative co-evolution. Robust pruning is articulated as a tri-objective optimization challenge, striving to balance accuracy, robustness, and compactness concurrently, resolved through a cooperative co-evolutionary pruning framework that independently prunes filters across layers using EAs. Our experiments on CIFAR-10 and SVHN demonstrate that CCSRP can match or exceed the performance of the latest methodologies. △ Less

Submitted 18 July, 2024; originally announced August 2024.

arXiv:2407.20224 [pdf, other]

Can Editing LLMs Inject Harm?

Authors: Canyu Chen, Baixiang Huang, Zekun Li, Zhaorun Chen, Shiyang Lai, Xiongxiao Xu, Jia-Chen Gu, Jindong Gu, Huaxiu Yao, Chaowei Xiao, Xifeng Yan, William Yang Wang, Philip Torr, Dawn Song, Kai Shu

Abstract: Knowledge editing has been increasingly adopted to correct the false or outdated knowledge in Large Language Models (LLMs). Meanwhile, one critical but under-explored question is: can knowledge editing be used to inject harm into LLMs? In this paper, we propose to reformulate knowledge editing as a new type of safety threat for LLMs, namely Editing Attack, and conduct a systematic investigation wi… ▽ More Knowledge editing has been increasingly adopted to correct the false or outdated knowledge in Large Language Models (LLMs). Meanwhile, one critical but under-explored question is: can knowledge editing be used to inject harm into LLMs? In this paper, we propose to reformulate knowledge editing as a new type of safety threat for LLMs, namely Editing Attack, and conduct a systematic investigation with a newly constructed dataset EditAttack. Specifically, we focus on two typical safety risks of Editing Attack including Misinformation Injection and Bias Injection. For the risk of misinformation injection, we first categorize it into commonsense misinformation injection and long-tail misinformation injection. Then, we find that editing attacks can inject both types of misinformation into LLMs, and the effectiveness is particularly high for commonsense misinformation injection. For the risk of bias injection, we discover that not only can biased sentences be injected into LLMs with high effectiveness, but also one single biased sentence injection can cause a bias increase in general outputs of LLMs, which are even highly irrelevant to the injected sentence, indicating a catastrophic impact on the overall fairness of LLMs. Then, we further illustrate the high stealthiness of editing attacks, measured by their impact on the general knowledge and reasoning capacities of LLMs, and show the hardness of defending editing attacks with empirical evidence. Our discoveries demonstrate the emerging misuse risks of knowledge editing techniques on compromising the safety alignment of LLMs and the feasibility of disseminating misinformation or bias with LLMs as new channels. △ Less

Submitted 16 August, 2024; v1 submitted 29 July, 2024; originally announced July 2024.

Comments: The first two authors contributed equally. 9 pages for main paper, 36 pages including appendix. The code, results, dataset for this paper and more resources are on the project website: https://llm-editing.github.io

arXiv:2407.10981 [pdf, other]

Systematic Literature Review of AI-enabled Spectrum Management in 6G and Future Networks

Authors: Bushra Sabir, Shuiqiao Yang, David Nguyen, Nan Wu, Alsharif Abuadbba, Hajime Suzuki, Shangqi Lai, Wei Ni, Ding Ming, Surya Nepal

Abstract: Artificial Intelligence (AI) has advanced significantly in various domains like healthcare, finance, and cybersecurity, with successes such as DeepMind's medical imaging and Tesla's autonomous vehicles. As telecommunications transition from 5G to 6G, integrating AI is crucial for complex demands like data processing, network optimization, and security. Despite ongoing research, there's a gap in co… ▽ More Artificial Intelligence (AI) has advanced significantly in various domains like healthcare, finance, and cybersecurity, with successes such as DeepMind's medical imaging and Tesla's autonomous vehicles. As telecommunications transition from 5G to 6G, integrating AI is crucial for complex demands like data processing, network optimization, and security. Despite ongoing research, there's a gap in consolidating AI-enabled Spectrum Management (AISM) advancements. Traditional spectrum management methods are inadequate for 6G due to its dynamic and complex demands, making AI essential for spectrum optimization, security, and network efficiency. This study aims to address this gap by: (i) Conducting a systematic review of AISM methodologies, focusing on learning models, data handling techniques, and performance metrics. (ii) Examining security and privacy concerns related to AI and traditional network threats within AISM contexts. Using the Systematic Literature Review (SLR) methodology, we meticulously analyzed 110 primary studies to: (a) Identify AI's utility in spectrum management. (b) Develop a taxonomy of AI approaches. (c) Classify datasets and performance metrics used. (d) Detail security and privacy threats and countermeasures. Our findings reveal challenges such as under-explored AI usage in critical AISM systems, computational resource demands, transparency issues, the need for real-world datasets, imbalances in security and privacy research, and the absence of testbeds, benchmarks, and security analysis tools. Addressing these challenges is vital for maximizing AI's potential in advancing 6G technology. △ Less

Submitted 12 June, 2024; originally announced July 2024.

Comments: 35 pages

arXiv:2406.19101 [pdf, other]

DocKylin: A Large Multimodal Model for Visual Document Understanding with Efficient Visual Slimming

Authors: Jiaxin Zhang, Wentao Yang, Songxuan Lai, Zecheng Xie, Lianwen Jin

Abstract: Current multimodal large language models (MLLMs) face significant challenges in visual document understanding (VDU) tasks due to the high resolution, dense text, and complex layouts typical of document images. These characteristics demand a high level of detail perception ability from MLLMs. While increasing input resolution improves detail perception capability, it also leads to longer sequences… ▽ More Current multimodal large language models (MLLMs) face significant challenges in visual document understanding (VDU) tasks due to the high resolution, dense text, and complex layouts typical of document images. These characteristics demand a high level of detail perception ability from MLLMs. While increasing input resolution improves detail perception capability, it also leads to longer sequences of visual tokens, increasing computational costs and straining the models' ability to handle long contexts. To address these challenges, we introduce DocKylin, a document-centric MLLM that performs visual content slimming at both the pixel and token levels, thereby reducing token sequence length in VDU scenarios. We introduce an Adaptive Pixel Slimming (APS) preprocessing module to perform pixel-level slimming, increasing the proportion of informative pixels. Moreover, we propose a novel Dynamic Token Slimming (DTS) module to conduct token-level slimming, filtering essential tokens and removing others to adaptively create a more compact visual sequence. Experiments demonstrate DocKylin's promising performance across various VDU benchmarks and the effectiveness of each component. △ Less

Submitted 2 September, 2024; v1 submitted 27 June, 2024; originally announced June 2024.

arXiv:2406.17213 [pdf, other]

doi 10.18653/v1/2021.findings-emnlp.339

Detecting Frames in News Headlines and Lead Images in U.S. Gun Violence Coverage

Authors: Isidora Chara Tourni, Lei Guo, Hengchang Hu, Edward Halim, Prakash Ishwar, Taufiq Daryanto, Mona Jalal, Boqi Chen, Margrit Betke, Fabian Zhafransyah, Sha Lai, Derry Tanti Wijaya

Abstract: News media structure their reporting of events or issues using certain perspectives. When describing an incident involving gun violence, for example, some journalists may focus on mental health or gun regulation, while others may emphasize the discussion of gun rights. Such perspectives are called \say{frames} in communication research. We study, for the first time, the value of combining lead i… ▽ More News media structure their reporting of events or issues using certain perspectives. When describing an incident involving gun violence, for example, some journalists may focus on mental health or gun regulation, while others may emphasize the discussion of gun rights. Such perspectives are called \say{frames} in communication research. We study, for the first time, the value of combining lead images and their contextual information with text to identify the frame of a given news article. We observe that using multiple modes of information(article- and image-derived features) improves prediction of news frames over any single mode of information when the images are relevant to the frames of the headlines. We also observe that frame image relevance is related to the ease of conveying frames via images, which we call frame concreteness. Additionally, we release the first multimodal news framing dataset related to gun violence in the U.S., curated and annotated by communication researchers. The dataset will allow researchers to further examine the use of multiple information modalities for studying media framing. △ Less

Submitted 24 June, 2024; originally announced June 2024.

Comments: published at Findings of the Association for Computational Linguistics: EMNLP 2021

arXiv:2406.16529 [pdf, other]

Towards Better Graph-based Cross-document Relation Extraction via Non-bridge Entity Enhancement and Prediction Debiasing

Authors: Hao Yue, Shaopeng Lai, Chengyi Yang, Liang Zhang, Junfeng Yao, Jinsong Su

Abstract: Cross-document Relation Extraction aims to predict the relation between target entities located in different documents. In this regard, the dominant models commonly retain useful information for relation prediction via bridge entities, which allows the model to elaborately capture the intrinsic interdependence between target entities. However, these studies ignore the non-bridge entities, each of… ▽ More Cross-document Relation Extraction aims to predict the relation between target entities located in different documents. In this regard, the dominant models commonly retain useful information for relation prediction via bridge entities, which allows the model to elaborately capture the intrinsic interdependence between target entities. However, these studies ignore the non-bridge entities, each of which co-occurs with only one target entity and offers the semantic association between target entities for relation prediction. Besides, the commonly-used dataset--CodRED contains substantial NA instances, leading to the prediction bias during inference. To address these issues, in this paper, we propose a novel graph-based cross-document RE model with non-bridge entity enhancement and prediction debiasing. Specifically, we use a unified entity graph to integrate numerous non-bridge entities with target entities and bridge entities, modeling various associations between them, and then use a graph recurrent network to encode this graph. Finally, we introduce a novel debiasing strategy to calibrate the original prediction distribution. Experimental results on the closed and open settings show that our model significantly outperforms all baselines, including the GPT-3.5-turbo and InstructUIE, achieving state-of-the-art performance. Particularly, our model obtains 66.23% and 55.87% AUC points in the official leaderboard\footnote{\url{https://codalab.lisn.upsaclay.fr/competitions/3770#results}} under the two settings, respectively, ranking the first place in all submissions since December 2023. Our code is available at https://github.com/DeepLearnXMU/CoRE-NEPD. △ Less

Submitted 24 June, 2024; originally announced June 2024.

Comments: Accepted to ACL 2024 Findings

arXiv:2406.12330 [pdf, other]

Security and Privacy of 6G Federated Learning-enabled Dynamic Spectrum Sharing

Authors: Viet Vo, Thusitha Dayaratne, Blake Haydon, Xingliang Yuan, Shangqi Lai, Sharif Abuadbba, Hajime Suzuki, Carsten Rudolph

Abstract: Spectrum sharing is increasingly vital in 6G wireless communication, facilitating dynamic access to unused spectrum holes. Recently, there has been a significant shift towards employing machine learning (ML) techniques for sensing spectrum holes. In this context, federated learning (FL)-enabled spectrum sensing technology has garnered wide attention, allowing for the construction of an aggregated… ▽ More Spectrum sharing is increasingly vital in 6G wireless communication, facilitating dynamic access to unused spectrum holes. Recently, there has been a significant shift towards employing machine learning (ML) techniques for sensing spectrum holes. In this context, federated learning (FL)-enabled spectrum sensing technology has garnered wide attention, allowing for the construction of an aggregated ML model without disclosing the private spectrum sensing information of wireless user devices. However, the integrity of collaborative training and the privacy of spectrum information from local users have remained largely unexplored. This article first examines the latest developments in FL-enabled spectrum sharing for prospective 6G scenarios. It then identifies practical attack vectors in 6G to illustrate potential AI-powered security and privacy threats in these contexts. Finally, the study outlines future directions, including practical defense challenges and guidelines. △ Less

Submitted 18 June, 2024; originally announced June 2024.

Comments: 7 pages, 5 figures. The paper is submitted to IEEE Networks for review

arXiv:2406.12299 [pdf, other]

Exploiting and Securing ML Solutions in Near-RT RIC: A Perspective of an xApp

Authors: Thusitha Dayaratne, Viet Vo, Shangqi Lai, Sharif Abuadbba, Blake Haydon, Hajime Suzuki, Xingliang Yuan, Carsten Rudolph

Abstract: Open Radio Access Networks (O-RAN) are emerging as a disruptive technology, revolutionising traditional mobile network architecture and deployments in the current 5G and the upcoming 6G era. Disaggregation of network architecture, inherent support for AI/ML workflows, cloud-native principles, scalability, and interoperability make O-RAN attractive to network providers for beyond-5G and 6G deployme… ▽ More Open Radio Access Networks (O-RAN) are emerging as a disruptive technology, revolutionising traditional mobile network architecture and deployments in the current 5G and the upcoming 6G era. Disaggregation of network architecture, inherent support for AI/ML workflows, cloud-native principles, scalability, and interoperability make O-RAN attractive to network providers for beyond-5G and 6G deployments. Notably, the ability to deploy custom applications, including Machine Learning (ML) solutions as xApps or rApps on the RAN Intelligent Controllers (RICs), has immense potential for network function and resource optimisation. However, the openness, nascent standards, and distributed architecture of O-RAN and RICs introduce numerous vulnerabilities exploitable through multiple attack vectors, which have not yet been fully explored. To address this gap and ensure robust systems before large-scale deployments, this work analyses the security of ML-based applications deployed on the RIC platform. We focus on potential attacks, defence mechanisms, and pave the way for future research towards a more robust RIC platform. △ Less

Submitted 18 June, 2024; originally announced June 2024.

arXiv:2406.05036 [pdf, other]

TimeSieve: Extracting Temporal Dynamics through Information Bottlenecks

Authors: Ninghui Feng, Songning Lai, Jiayu Yang, Fobao Zhou, Zhenxiao Yin, Hang Zhao

Abstract: Time series forecasting has become an increasingly popular research area due to its critical applications in various real-world domains such as traffic management, weather prediction, and financial analysis. Despite significant advancements, existing models face notable challenges, including the necessity of manual hyperparameter tuning for different datasets, and difficulty in effectively disting… ▽ More Time series forecasting has become an increasingly popular research area due to its critical applications in various real-world domains such as traffic management, weather prediction, and financial analysis. Despite significant advancements, existing models face notable challenges, including the necessity of manual hyperparameter tuning for different datasets, and difficulty in effectively distinguishing signal from redundant features in data characterized by strong seasonality. These issues hinder the generalization and practical application of time series forecasting models. To solve this issues, we propose an innovative time series forecasting model TimeSieve designed to address these challenges. Our approach employs wavelet transforms to preprocess time series data, effectively capturing multi-scale features without the need for additional parameters or manual hyperparameter tuning. Additionally, we introduce the information bottleneck theory that filters out redundant features from both detail and approximation coefficients, retaining only the most predictive information. This combination reduces significantly improves the model's accuracy. Extensive experiments demonstrate that our model outperforms existing state-of-the-art methods on 70% of the datasets, achieving higher predictive accuracy and better generalization across diverse datasets. Our results validate the effectiveness of our approach in addressing the key challenges in time series forecasting, paving the way for more reliable and efficient predictive models in practical applications. The code for our model is available at https://github.com/xll0328/TimeSieve. △ Less

Submitted 21 August, 2024; v1 submitted 7 June, 2024; originally announced June 2024.

arXiv:2405.19647 [pdf, other]

FTS: A Framework to Find a Faithful TimeSieve

Authors: Songning Lai, Ninghui Feng, Jiechao Gao, Hao Wang, Haochen Sui, Xin Zou, Jiayu Yang, Wenshuo Chen, Hang Zhao, Xuming Hu, Yutao Yue

Abstract: The field of time series forecasting has garnered significant attention in recent years, prompting the development of advanced models like TimeSieve, which demonstrates impressive performance. However, an analysis reveals certain unfaithfulness issues, including high sensitivity to random seeds, input and layer noise perturbations and parametric perturbations. Recognizing these challenges, we emba… ▽ More The field of time series forecasting has garnered significant attention in recent years, prompting the development of advanced models like TimeSieve, which demonstrates impressive performance. However, an analysis reveals certain unfaithfulness issues, including high sensitivity to random seeds, input and layer noise perturbations and parametric perturbations. Recognizing these challenges, we embark on a quest to define the concept of \textbf{\underline{F}aithful \underline{T}ime\underline{S}ieve \underline{(FTS)}}, a model that consistently delivers reliable and robust predictions. To address these issues, we propose a novel framework aimed at identifying and rectifying unfaithfulness in TimeSieve. Our framework is designed to enhance the model's stability and faithfulness, ensuring that its outputs are less susceptible to the aforementioned factors. Experimentation validates the effectiveness of our proposed framework, demonstrating improved faithfulness in the model's behavior. △ Less

Submitted 10 August, 2024; v1 submitted 29 May, 2024; originally announced May 2024.

Journal ref: IJCAI2024 workshop

arXiv:2405.04009 [pdf, other]

Structured Click Control in Transformer-based Interactive Segmentation

Authors: Long Xu, Yongquan Chen, Rui Huang, Feng Wu, Shiwu Lai

Abstract: Click-point-based interactive segmentation has received widespread attention due to its efficiency. However, it's hard for existing algorithms to obtain precise and robust responses after multiple clicks. In this case, the segmentation results tend to have little change or are even worse than before. To improve the robustness of the response, we propose a structured click intent model based on gra… ▽ More Click-point-based interactive segmentation has received widespread attention due to its efficiency. However, it's hard for existing algorithms to obtain precise and robust responses after multiple clicks. In this case, the segmentation results tend to have little change or are even worse than before. To improve the robustness of the response, we propose a structured click intent model based on graph neural networks, which adaptively obtains graph nodes via the global similarity of user-clicked Transformer tokens. Then the graph nodes will be aggregated to obtain structured interaction features. Finally, the dual cross-attention will be used to inject structured interaction features into vision Transformer features, thereby enhancing the control of clicks over segmentation results. Extensive experiments demonstrated the proposed algorithm can serve as a general structure in improving Transformer-based interactive segmenta?tion performance. The code and data will be released at https://github.com/hahamyt/scc. △ Less

Submitted 7 May, 2024; originally announced May 2024.

Comments: 10 pages, 6 figures, submitted to NeurIPS 2024

arXiv:2404.14135 [pdf, other]

Text in the Dark: Extremely Low-Light Text Image Enhancement

Authors: Che-Tsung Lin, Chun Chet Ng, Zhi Qin Tan, Wan Jun Nah, Xinyu Wang, Jie Long Kew, Pohao Hsu, Shang Hong Lai, Chee Seng Chan, Christopher Zach

Abstract: Extremely low-light text images are common in natural scenes, making scene text detection and recognition challenging. One solution is to enhance these images using low-light image enhancement methods before text extraction. However, previous methods often do not try to particularly address the significance of low-level features, which are crucial for optimal performance on downstream scene text t… ▽ More Extremely low-light text images are common in natural scenes, making scene text detection and recognition challenging. One solution is to enhance these images using low-light image enhancement methods before text extraction. However, previous methods often do not try to particularly address the significance of low-level features, which are crucial for optimal performance on downstream scene text tasks. Further research is also hindered by the lack of extremely low-light text datasets. To address these limitations, we propose a novel encoder-decoder framework with an edge-aware attention module to focus on scene text regions during enhancement. Our proposed method uses novel text detection and edge reconstruction losses to emphasize low-level scene text features, leading to successful text extraction. Additionally, we present a Supervised Deep Curve Estimation (Supervised-DCE) model to synthesize extremely low-light images based on publicly available scene text datasets such as ICDAR15 (IC15). We also labeled texts in the extremely low-light See In the Dark (SID) and ordinary LOw-Light (LOL) datasets to allow for objective assessment of extremely low-light image enhancement through scene text tasks. Extensive experiments show that our model outperforms state-of-the-art methods in terms of both image quality and scene text metrics on the widely-used LOL, SID, and synthetic IC15 datasets. Code and dataset will be released publicly at https://github.com/chunchet-ng/Text-in-the-Dark. △ Less

Submitted 22 April, 2024; originally announced April 2024.

Comments: The first two authors contributed equally to this work

arXiv:2403.12999 [pdf]

Prompt Selection and Augmentation for Few Examples Code Generation in Large Language Model and its Application in Robotics Control

Authors: On Tai Wu, Frodo Kin Sun Chan, Zunhao Zhang, Yan Nei Law, Benny Drescher, Edmond Shiao Bun Lai

Abstract: Few-shot prompting and step-by-step reasoning have enhanced the capabilities of Large Language Models (LLMs) in tackling complex tasks including code generation. In this paper, we introduce a prompt selection and augmentation algorithm aimed at improving mathematical reasoning and robot arm operations. Our approach incorporates a multi-stage example augmentation scheme combined with an example sel… ▽ More Few-shot prompting and step-by-step reasoning have enhanced the capabilities of Large Language Models (LLMs) in tackling complex tasks including code generation. In this paper, we introduce a prompt selection and augmentation algorithm aimed at improving mathematical reasoning and robot arm operations. Our approach incorporates a multi-stage example augmentation scheme combined with an example selection scheme. This algorithm improves LLM performance by selecting a set of examples that increase diversity, minimize redundancy, and increase relevance to the question. When combined with the Program-of-Thought prompting, our algorithm demonstrates an improvement in performance on the GSM8K and SVAMP benchmarks, with increases of 0.3% and 1.1% respectively. Furthermore, in simulated tabletop environments, our algorithm surpasses the Code-as-Policies approach by achieving a 3.4% increase in successful task completions and a decrease of over 70% in the number of examples used. Its ability to discard examples that contribute little to solving the problem reduces the inferencing time of an LLM-powered robotics system. This algorithm also offers important benefits for industrial process automation by streamlining the development and deployment process, reducing manual programming effort, and enhancing code reusability. △ Less

Submitted 11 March, 2024; originally announced March 2024.

Comments: 17 pages, 4 figures

arXiv:2402.12590 [pdf, other]

Evolving AI Collectives to Enhance Human Diversity and Enable Self-Regulation

Authors: Shiyang Lai, Yujin Potter, Junsol Kim, Richard Zhuang, Dawn Song, James Evans

Abstract: Large language model behavior is shaped by the language of those with whom they interact. This capacity and their increasing prevalence online portend that they will intentionally or unintentionally "program" one another and form emergent AI subjectivities, relationships, and collectives. Here, we call upon the research community to investigate these "societies" of interacting artificial intellige… ▽ More Large language model behavior is shaped by the language of those with whom they interact. This capacity and their increasing prevalence online portend that they will intentionally or unintentionally "program" one another and form emergent AI subjectivities, relationships, and collectives. Here, we call upon the research community to investigate these "societies" of interacting artificial intelligences to increase their rewards and reduce their risks for human society and the health of online environments. We use a small "community" of models and their evolving outputs to illustrate how such emergent, decentralized AI collectives can spontaneously expand the bounds of human diversity and reduce the risk of toxic, anti-social behavior online. Finally, we discuss opportunities for AI cross-moderation and address ethical issues and design challenges associated with creating and maintaining free-formed AI collectives. △ Less

Submitted 18 June, 2024; v1 submitted 19 February, 2024; originally announced February 2024.

Comments: ICML 2024

arXiv:2402.04557 [pdf]

doi 10.1021/acs.iecr.3c02520

An Artificial Intelligence (AI) workflow for catalyst design and optimization

Authors: Nung Siong Lai, Yi Shen Tew, Xialin Zhong, Jun Yin, Jiali Li, Binhang Yan, Xiaonan Wang

Abstract: In the pursuit of novel catalyst development to address pressing environmental concerns and energy demand, conventional design and optimization methods often fall short due to the complexity and vastness of the catalyst parameter space. The advent of Machine Learning (ML) has ushered in a new era in the field of catalyst optimization, offering potential solutions to the shortcomings of traditional… ▽ More In the pursuit of novel catalyst development to address pressing environmental concerns and energy demand, conventional design and optimization methods often fall short due to the complexity and vastness of the catalyst parameter space. The advent of Machine Learning (ML) has ushered in a new era in the field of catalyst optimization, offering potential solutions to the shortcomings of traditional techniques. However, existing methods fail to effectively harness the wealth of information contained within the burgeoning body of scientific literature on catalyst synthesis. To address this gap, this study proposes an innovative Artificial Intelligence (AI) workflow that integrates Large Language Models (LLMs), Bayesian optimization, and an active learning loop to expedite and enhance catalyst optimization. Our methodology combines advanced language understanding with robust optimization strategies, effectively translating knowledge extracted from diverse literature into actionable parameters for practical experimentation and optimization. In this article, we demonstrate the application of this AI workflow in the optimization of catalyst synthesis for ammonia production. The results underscore the workflow's ability to streamline the catalyst development process, offering a swift, resource-efficient, and high-precision alternative to conventional methods. △ Less

Submitted 6 February, 2024; originally announced February 2024.

Comments: 31 pages, 7 figures

Journal ref: Ind. Eng. Chem. Res. 2023, 62, 43, 17835-17848

arXiv:2401.04403 [pdf, other]

MST: Adaptive Multi-Scale Tokens Guided Interactive Segmentation

Authors: Long Xu, Shanghong Li, Yongquan Chen, Jun Luo, Shiwu Lai

Abstract: Interactive segmentation has gained significant attention for its application in human-computer interaction and data annotation. To address the target scale variation issue in interactive segmentation, a novel multi-scale token adaptation algorithm is proposed. By performing top-k operations across multi-scale tokens, the computational complexity is greatly simplified while ensuring performance. T… ▽ More Interactive segmentation has gained significant attention for its application in human-computer interaction and data annotation. To address the target scale variation issue in interactive segmentation, a novel multi-scale token adaptation algorithm is proposed. By performing top-k operations across multi-scale tokens, the computational complexity is greatly simplified while ensuring performance. To enhance the robustness of multi-scale token selection, we also propose a token learning algorithm based on contrastive loss. This algorithm can effectively improve the performance of multi-scale token adaptation. Extensive benchmarking shows that the algorithm achieves state-of-the-art (SOTA) performance, compared to current methods. An interactive demo and all reproducible codes will be released at https://github.com/hahamyt/mst. △ Less

Submitted 2 February, 2024; v1 submitted 9 January, 2024; originally announced January 2024.

Comments: 11 pages, 10 figures

arXiv:2312.16044 [pdf, other]

LLMLight: Large Language Models as Traffic Signal Control Agents

Authors: Siqi Lai, Zhao Xu, Weijia Zhang, Hao Liu, Hui Xiong

Abstract: Traffic Signal Control (TSC) is a crucial component in urban traffic management, aiming to optimize road network efficiency and reduce congestion. Traditional methods in TSC, primarily based on transportation engineering and reinforcement learning (RL), often exhibit limitations in generalization across varied traffic scenarios and lack interpretability. This paper presents LLMLight, a novel frame… ▽ More Traffic Signal Control (TSC) is a crucial component in urban traffic management, aiming to optimize road network efficiency and reduce congestion. Traditional methods in TSC, primarily based on transportation engineering and reinforcement learning (RL), often exhibit limitations in generalization across varied traffic scenarios and lack interpretability. This paper presents LLMLight, a novel framework employing Large Language Models (LLMs) as decision-making agents for TSC. Specifically, the framework begins by instructing the LLM with a knowledgeable prompt detailing real-time traffic conditions. Leveraging the advanced generalization capabilities of LLMs, LLMLight engages a reasoning and decision-making process akin to human intuition for effective traffic control. Moreover, we build LightGPT, a specialized backbone LLM tailored for TSC tasks. By learning nuanced traffic patterns and control strategies, LightGPT enhances the LLMLight framework cost-effectively. Extensive experiments on nine real-world and synthetic datasets showcase the remarkable effectiveness, generalization ability, and interpretability of LLMLight against nine transportation-based and RL-based baselines. △ Less

Submitted 5 March, 2024; v1 submitted 26 December, 2023; originally announced December 2023.

arXiv:2312.09480 [pdf, other]

TAB: Text-Align Anomaly Backbone Model for Industrial Inspection Tasks

Authors: Ho-Weng Lee, Shang-Hong Lai

Abstract: In recent years, the focus on anomaly detection and localization in industrial inspection tasks has intensified. While existing studies have demonstrated impressive outcomes, they often rely heavily on extensive training datasets or robust features extracted from pre-trained models trained on diverse datasets like ImageNet. In this work, we propose a novel framework leveraging the visual-linguisti… ▽ More In recent years, the focus on anomaly detection and localization in industrial inspection tasks has intensified. While existing studies have demonstrated impressive outcomes, they often rely heavily on extensive training datasets or robust features extracted from pre-trained models trained on diverse datasets like ImageNet. In this work, we propose a novel framework leveraging the visual-linguistic CLIP model to adeptly train a backbone model tailored to the manufacturing domain. Our approach concurrently considers visual and text-aligned embedding spaces for normal and abnormal conditions. The resulting pre-trained backbone markedly enhances performance in industrial downstream tasks, particularly in anomaly detection and localization. Notably, this improvement is substantiated through experiments conducted on multiple datasets such as MVTecAD, BTAD, and KSDD2. Furthermore, using our pre-trained backbone weights allows previous works to achieve superior performance in few-shot scenarios with less training data. The proposed anomaly backbone provides a foundation model for more precise anomaly detection and localization. △ Less

Submitted 14 December, 2023; originally announced December 2023.

arXiv:2310.12846 [pdf, other]

Physical Information Neural Networks for Solving High-index Differential-algebraic Equation Systems Based on Radau Methods

Authors: Jiasheng Chen, Juan Tang, Ming Yan, Shuai Lai, Kun Liang, Jianguang Lu, Wenqiang Yang

Abstract: As is well known, differential algebraic equations (DAEs), which are able to describe dynamic changes and underlying constraints, have been widely applied in engineering fields such as fluid dynamics, multi-body dynamics, mechanical systems and control theory. In practical physical modeling within these domains, the systems often generate high-index DAEs. Classical implicit numerical methods typic… ▽ More As is well known, differential algebraic equations (DAEs), which are able to describe dynamic changes and underlying constraints, have been widely applied in engineering fields such as fluid dynamics, multi-body dynamics, mechanical systems and control theory. In practical physical modeling within these domains, the systems often generate high-index DAEs. Classical implicit numerical methods typically result in varying order reduction of numerical accuracy when solving high-index systems.~Recently, the physics-informed neural network (PINN) has gained attention for solving DAE systems. However, it faces challenges like the inability to directly solve high-index systems, lower predictive accuracy, and weaker generalization capabilities. In this paper, we propose a PINN computational framework, combined Radau IIA numerical method with a neural network structure via the attention mechanisms, to directly solve high-index DAEs. Furthermore, we employ a domain decomposition strategy to enhance solution accuracy. We conduct numerical experiments with two classical high-index systems as illustrative examples, investigating how different orders of the Radau IIA method affect the accuracy of neural network solutions. The experimental results demonstrate that the PINN based on a 5th-order Radau IIA method achieves the highest level of system accuracy. Specifically, the absolute errors for all differential variables remains as low as $10^{-6}$, and the absolute errors for algebraic variables is maintained at $10^{-5}$, surpassing the results found in existing literature. Therefore, our method exhibits excellent computational accuracy and strong generalization capabilities, providing a feasible approach for the high-precision solution of larger-scale DAEs with higher indices or challenging high-dimensional partial differential algebraic equation systems. △ Less

Submitted 19 October, 2023; originally announced October 2023.

arXiv:2310.05056 [pdf, other]

Open-Vocabulary Animal Keypoint Detection with Semantic-feature Matching

Authors: Hao Zhang, Lumin Xu, Shenqi Lai, Wenqi Shao, Nanning Zheng, Ping Luo, Yu Qiao, Kaipeng Zhang

Abstract: Current image-based keypoint detection methods for animal (including human) bodies and faces are generally divided into full-supervised and few-shot class-agnostic approaches. The former typically relies on laborious and time-consuming manual annotations, posing considerable challenges in expanding keypoint detection to a broader range of keypoint categories and animal species. The latter, though… ▽ More Current image-based keypoint detection methods for animal (including human) bodies and faces are generally divided into full-supervised and few-shot class-agnostic approaches. The former typically relies on laborious and time-consuming manual annotations, posing considerable challenges in expanding keypoint detection to a broader range of keypoint categories and animal species. The latter, though less dependent on extensive manual input, still requires necessary support images with annotation for reference during testing. To realize zero-shot keypoint detection without any prior annotation, we introduce the Open-Vocabulary Keypoint Detection (OVKD) task, which is innovatively designed to use text prompts for identifying arbitrary keypoints across any species. In pursuit of this goal, we have developed a novel framework named Open-Vocabulary Keypoint Detection with Semantic-feature Matching (KDSM). This framework synergistically combines vision and language models, creating an interplay between language features and local keypoint visual features. KDSM enhances its capabilities by integrating Domain Distribution Matrix Matching (DDMM) and other special modules, such as the Vision-Keypoint Relational Awareness (VKRA) module, improving the framework's generalizability and overall performance.Our comprehensive experiments demonstrate that KDSM significantly outperforms the baseline in terms of performance and achieves remarkable success in the OVKD task.Impressively, our method, operating in a zero-shot fashion, still yields results comparable to state-of-the-art few-shot species class-agnostic keypoint detection methods.We will make the source code publicly accessible. △ Less

Submitted 11 December, 2023; v1 submitted 8 October, 2023; originally announced October 2023.

arXiv:2309.11798 [pdf, other]

A Comprehensive Review of Community Detection in Graphs

Authors: Jiakang Li, Songning Lai, Zhihao Shuai, Yuan Tan, Yifan Jia, Mianyang Yu, Zichen Song, Xiaokang Peng, Ziyang Xu, Yongxin Ni, Haifeng Qiu, Jiayu Yang, Yutong Liu, Yonggang Lu

Abstract: The study of complex networks has significantly advanced our understanding of community structures which serves as a crucial feature of real-world graphs. Detecting communities in graphs is a challenging problem with applications in sociology, biology, and computer science. Despite the efforts of an interdisciplinary community of scientists, a satisfactory solution to this problem has not yet been… ▽ More The study of complex networks has significantly advanced our understanding of community structures which serves as a crucial feature of real-world graphs. Detecting communities in graphs is a challenging problem with applications in sociology, biology, and computer science. Despite the efforts of an interdisciplinary community of scientists, a satisfactory solution to this problem has not yet been achieved. This review article delves into the topic of community detection in graphs, which serves as a thorough exposition of various community detection methods from perspectives of modularity-based method, spectral clustering, probabilistic modelling, and deep learning. Along with the methods, a new community detection method designed by us is also presented. Additionally, the performance of these methods on the datasets with and without ground truth is compared. In conclusion, this comprehensive review provides a deep understanding of community detection in graphs. △ Less

Submitted 12 July, 2024; v1 submitted 21 September, 2023; originally announced September 2023.

arXiv:2309.10641 [pdf, other]

KFC: Kinship Verification with Fair Contrastive Loss and Multi-Task Learning

Authors: Jia Luo Peng, Keng Wei Chang, Shang-Hong Lai

Abstract: Kinship verification is an emerging task in computer vision with multiple potential applications. However, there's no large enough kinship dataset to train a representative and robust model, which is a limitation for achieving better performance. Moreover, face verification is known to exhibit bias, which has not been dealt with by previous kinship verification works and sometimes even results in… ▽ More Kinship verification is an emerging task in computer vision with multiple potential applications. However, there's no large enough kinship dataset to train a representative and robust model, which is a limitation for achieving better performance. Moreover, face verification is known to exhibit bias, which has not been dealt with by previous kinship verification works and sometimes even results in serious issues. So we first combine existing kinship datasets and label each identity with the correct race in order to take race information into consideration and provide a larger and complete dataset, called KinRace dataset. Secondly, we propose a multi-task learning model structure with attention module to enhance accuracy, which surpasses state-of-the-art performance. Lastly, our fairness-aware contrastive loss function with adversarial learning greatly mitigates racial bias. We introduce a debias term into traditional contrastive loss and implement gradient reverse in race classification task, which is an innovative idea to mix two fairness methods to alleviate bias. Exhaustive experimental evaluation demonstrates the effectiveness and superior performance of the proposed KFC in both standard deviation and accuracy at the same time. △ Less

Submitted 20 September, 2023; v1 submitted 19 September, 2023; originally announced September 2023.

Comments: Accepted by BMVC 2023

arXiv:2308.13229 [pdf, other]

ReST: A Reconfigurable Spatial-Temporal Graph Model for Multi-Camera Multi-Object Tracking

Authors: Cheng-Che Cheng, Min-Xuan Qiu, Chen-Kuo Chiang, Shang-Hong Lai

Abstract: Multi-Camera Multi-Object Tracking (MC-MOT) utilizes information from multiple views to better handle problems with occlusion and crowded scenes. Recently, the use of graph-based approaches to solve tracking problems has become very popular. However, many current graph-based methods do not effectively utilize information regarding spatial and temporal consistency. Instead, they rely on single-came… ▽ More Multi-Camera Multi-Object Tracking (MC-MOT) utilizes information from multiple views to better handle problems with occlusion and crowded scenes. Recently, the use of graph-based approaches to solve tracking problems has become very popular. However, many current graph-based methods do not effectively utilize information regarding spatial and temporal consistency. Instead, they rely on single-camera trackers as input, which are prone to fragmentation and ID switch errors. In this paper, we propose a novel reconfigurable graph model that first associates all detected objects across cameras spatially before reconfiguring it into a temporal graph for Temporal Association. This two-stage association approach enables us to extract robust spatial and temporal-aware features and address the problem with fragmented tracklets. Furthermore, our model is designed for online tracking, making it suitable for real-world applications. Experimental results show that the proposed graph model is able to extract more discriminating features for object tracking, and our model achieves state-of-the-art performance on several public datasets. △ Less

Submitted 25 August, 2023; originally announced August 2023.

Comments: Accepted by ICCV2023

arXiv:2306.17020

Classifying Crime Types using Judgment Documents from Social Media

Authors: Haoxuan Xu, Zeyu He, Mengfan Shen, Songning Lai, Ziqiang Han, Yifan Peng

Abstract: The task of determining crime types based on criminal behavior facts has become a very important and meaningful task in social science. But the problem facing the field now is that the data samples themselves are unevenly distributed, due to the nature of the crime itself. At the same time, data sets in the judicial field are less publicly available, and it is not practical to produce large data s… ▽ More The task of determining crime types based on criminal behavior facts has become a very important and meaningful task in social science. But the problem facing the field now is that the data samples themselves are unevenly distributed, due to the nature of the crime itself. At the same time, data sets in the judicial field are less publicly available, and it is not practical to produce large data sets for direct training. This article proposes a new training model to solve this problem through NLP processing methods. We first propose a Crime Fact Data Preprocessing Module (CFDPM), which can balance the defects of uneven data set distribution by generating new samples. Then we use a large open source dataset (CAIL-big) as our pretraining dataset and a small dataset collected by ourselves for Fine-tuning, giving it good generalization ability to unfamiliar small datasets. At the same time, we use the improved Bert model with dynamic masking to improve the model. Experiments show that the proposed method achieves state-of-the-art results on the present dataset. At the same time, the effectiveness of module CFDPM is proved by experiments. This article provides a valuable methodology contribution for classifying social science texts such as criminal behaviors. Extensive experiments on public benchmarks show that the proposed method achieves new state-of-the-art results. △ Less

Submitted 21 October, 2023; v1 submitted 29 June, 2023; originally announced June 2023.

Comments: The paper has no errors; it just needs to be supplemented to become a new article

arXiv:2306.14421 [pdf, other]

doi 10.1145/3580305.3599767

A Preference-aware Meta-optimization Framework for Personalized Vehicle Energy Consumption Estimation

Authors: Siqi Lai, Weijia Zhang, Hao Liu

Abstract: Vehicle Energy Consumption (VEC) estimation aims to predict the total energy required for a given trip before it starts, which is of great importance to trip planning and transportation sustainability. Existing approaches mainly focus on extracting statistically significant factors from typical trips to improve the VEC estimation. However, the energy consumption of each vehicle may diverge widely… ▽ More Vehicle Energy Consumption (VEC) estimation aims to predict the total energy required for a given trip before it starts, which is of great importance to trip planning and transportation sustainability. Existing approaches mainly focus on extracting statistically significant factors from typical trips to improve the VEC estimation. However, the energy consumption of each vehicle may diverge widely due to the personalized driving behavior under varying travel contexts. To this end, this paper proposes a preference-aware meta-optimization framework Meta-Pec for personalized vehicle energy consumption estimation. Specifically, we first propose a spatiotemporal behavior learning module to capture the latent driver preference hidden in historical trips. Moreover, based on the memorization of driver preference, we devise a selection-based driving behavior prediction module to infer driver-specific driving patterns on a given route, which provides additional basis and supervision signals for VEC estimation. Besides, a driver-specific meta-optimization scheme is proposed to enable fast model adaption by learning and sharing transferable knowledge globally. Extensive experiments on two real-world datasets show the superiority of our proposed framework against ten numerical and data-driven machine learning baselines. The source code is available at https://github.com/usail-hkust/Meta-Pec. △ Less

Submitted 26 June, 2023; originally announced June 2023.

arXiv:2306.14313 [pdf, other]

A Closer Look at Geometric Temporal Dynamics for Face Anti-Spoofing

Authors: Chih-Jung Chang, Yaw-Chern Lee, Shih-Hsuan Yao, Min-Hung Chen, Chien-Yi Wang, Shang-Hong Lai, Trista Pei-Chun Chen

Abstract: Face anti-spoofing (FAS) is indispensable for a face recognition system. Many texture-driven countermeasures were developed against presentation attacks (PAs), but the performance against unseen domains or unseen spoofing types is still unsatisfactory. Instead of exhaustively collecting all the spoofing variations and making binary decisions of live/spoof, we offer a new perspective on the FAS tas… ▽ More Face anti-spoofing (FAS) is indispensable for a face recognition system. Many texture-driven countermeasures were developed against presentation attacks (PAs), but the performance against unseen domains or unseen spoofing types is still unsatisfactory. Instead of exhaustively collecting all the spoofing variations and making binary decisions of live/spoof, we offer a new perspective on the FAS task to distinguish between normal and abnormal movements of live and spoof presentations. We propose Geometry-Aware Interaction Network (GAIN), which exploits dense facial landmarks with spatio-temporal graph convolutional network (ST-GCN) to establish a more interpretable and modularized FAS model. Additionally, with our cross-attention feature interaction mechanism, GAIN can be easily integrated with other existing methods to significantly boost performance. Our approach achieves state-of-the-art performance in the standard intra- and cross-dataset evaluations. Moreover, our model outperforms state-of-the-art methods by a large margin in the cross-dataset cross-type protocol on CASIA-SURF 3DMask (+10.26% higher AUC score), exhibiting strong robustness against domain shifts and unseen spoofing types. △ Less

Submitted 25 June, 2023; originally announced June 2023.

Comments: 2023 CVPR Biometrics Workshop, Best Paper Award

arXiv:2306.10351 [pdf, other]

Bkd-FedGNN: A Benchmark for Classification Backdoor Attacks on Federated Graph Neural Network

Authors: Fan Liu, Siqi Lai, Yansong Ning, Hao Liu

Abstract: Federated Graph Neural Network (FedGNN) has recently emerged as a rapidly growing research topic, as it integrates the strengths of graph neural networks and federated learning to enable advanced machine learning applications without direct access to sensitive data. Despite its advantages, the distributed nature of FedGNN introduces additional vulnerabilities, particularly backdoor attacks stemmin… ▽ More Federated Graph Neural Network (FedGNN) has recently emerged as a rapidly growing research topic, as it integrates the strengths of graph neural networks and federated learning to enable advanced machine learning applications without direct access to sensitive data. Despite its advantages, the distributed nature of FedGNN introduces additional vulnerabilities, particularly backdoor attacks stemming from malicious participants. Although graph backdoor attacks have been explored, the compounded complexity introduced by the combination of GNNs and federated learning has hindered a comprehensive understanding of these attacks, as existing research lacks extensive benchmark coverage and in-depth analysis of critical factors. To address these limitations, we propose Bkd-FedGNN, a benchmark for backdoor attacks on FedGNN. Specifically, Bkd-FedGNN decomposes the graph backdoor attack into trigger generation and injection steps, and extending the attack to the node-level federated setting, resulting in a unified framework that covers both node-level and graph-level classification tasks. Moreover, we thoroughly investigate the impact of multiple critical factors in backdoor attacks on FedGNN. These factors are categorized into global-level and local-level factors, including data distribution, the number of malicious attackers, attack time, overlapping rate, trigger size, trigger type, trigger position, and poisoning rate. Finally, we conduct comprehensive evaluations on 13 benchmark datasets and 13 critical factors, comprising 1,725 experimental configurations for node-level and graph-level tasks from six domains. These experiments encompass over 8,000 individual tests, allowing us to provide a thorough evaluation and insightful observations that advance our understanding of backdoor attacks on FedGNN.The Bkd-FedGNN benchmark is publicly available at https://github.com/usail-hkust/BkdFedGCN. △ Less

Submitted 17 June, 2023; originally announced June 2023.

arXiv:2306.05310 [pdf, other]

A framework for dynamically training and adapting deep reinforcement learning models to different, low-compute, and continuously changing radiology deployment environments

Authors: Guangyao Zheng, Shuhao Lai, Vladimir Braverman, Michael A. Jacobs, Vishwa S. Parekh

Abstract: While Deep Reinforcement Learning has been widely researched in medical imaging, the training and deployment of these models usually require powerful GPUs. Since imaging environments evolve rapidly and can be generated by edge devices, the algorithm is required to continually learn and adapt to changing environments, and adjust to low-compute devices. To this end, we developed three image coreset… ▽ More While Deep Reinforcement Learning has been widely researched in medical imaging, the training and deployment of these models usually require powerful GPUs. Since imaging environments evolve rapidly and can be generated by edge devices, the algorithm is required to continually learn and adapt to changing environments, and adjust to low-compute devices. To this end, we developed three image coreset algorithms to compress and denoise medical images for selective experience replayed-based lifelong reinforcement learning. We implemented neighborhood averaging coreset, neighborhood sensitivity-based sampling coreset, and maximum entropy coreset on full-body DIXON water and DIXON fat MRI images. All three coresets produced 27x compression with excellent performance in localizing five anatomical landmarks: left knee, right trochanter, left kidney, spleen, and lung across both imaging environments. Maximum entropy coreset obtained the best performance of $11.97\pm 12.02$ average distance error, compared to the conventional lifelong learning framework's $19.24\pm 50.77$. △ Less

Submitted 8 June, 2023; originally announced June 2023.

arXiv:2306.04032 [pdf, other]

BokehOrNot: Transforming Bokeh Effect with Image Transformer and Lens Metadata Embedding

Authors: Zhihao Yang, Wenyi Lian, Siyuan Lai

Abstract: Bokeh effect is an optical phenomenon that offers a pleasant visual experience, typically generated by high-end cameras with wide aperture lenses. The task of bokeh effect transformation aims to produce a desired effect in one set of lenses and apertures based on another combination. Current models are limited in their ability to render a specific set of bokeh effects, primarily transformations fr… ▽ More Bokeh effect is an optical phenomenon that offers a pleasant visual experience, typically generated by high-end cameras with wide aperture lenses. The task of bokeh effect transformation aims to produce a desired effect in one set of lenses and apertures based on another combination. Current models are limited in their ability to render a specific set of bokeh effects, primarily transformations from sharp to blur. In this paper, we propose a novel universal method for embedding lens metadata into the model and introducing a loss calculation method using alpha masks from the newly released Bokeh Effect Transformation Dataset(BETD) [3]. Based on the above techniques, we propose the BokehOrNot model, which is capable of producing both blur-to-sharp and sharp-to-blur bokeh effect with various combinations of lenses and aperture sizes. Our proposed model outperforms current leading bokeh rendering and image restoration models and renders visually natural bokeh effects. Our code is available at: https://github.com/indicator0/bokehornot. △ Less

Submitted 6 June, 2023; originally announced June 2023.

arXiv:2306.00188 [pdf, other]

Multi-environment lifelong deep reinforcement learning for medical imaging

Authors: Guangyao Zheng, Shuhao Lai, Vladimir Braverman, Michael A. Jacobs, Vishwa S. Parekh

Abstract: Deep reinforcement learning(DRL) is increasingly being explored in medical imaging. However, the environments for medical imaging tasks are constantly evolving in terms of imaging orientations, imaging sequences, and pathologies. To that end, we developed a Lifelong DRL framework, SERIL to continually learn new tasks in changing imaging environments without catastrophic forgetting. SERIL was devel… ▽ More Deep reinforcement learning(DRL) is increasingly being explored in medical imaging. However, the environments for medical imaging tasks are constantly evolving in terms of imaging orientations, imaging sequences, and pathologies. To that end, we developed a Lifelong DRL framework, SERIL to continually learn new tasks in changing imaging environments without catastrophic forgetting. SERIL was developed using selective experience replay based lifelong learning technique for the localization of five anatomical landmarks in brain MRI on a sequence of twenty-four different imaging environments. The performance of SERIL, when compared to two baseline setups: MERT(multi-environment-best-case) and SERT(single-environment-worst-case) demonstrated excellent performance with an average distance of $9.90\pm7.35$ pixels from the desired landmark across all 120 tasks, compared to $10.29\pm9.07$ for MERT and $36.37\pm22.41$ for SERT($p<0.05$), demonstrating the excellent potential for continuously learning multiple tasks across dynamically changing imaging environments. △ Less

Submitted 31 May, 2023; originally announced June 2023.

arXiv:2305.20055 [pdf, other]

Cross-Domain Car Detection Model with Integrated Convolutional Block Attention Mechanism

Authors: Haoxuan Xu, Songning Lai, Xianyang Li, Yang Yang

Abstract: Car detection, particularly through camera vision, has become a major focus in the field of computer vision and has gained widespread adoption. While current car detection systems are capable of good detection, reliable detection can still be challenging due to factors such as proximity between the car, light intensity, and environmental visibility. To address these issues, we propose cross-domain… ▽ More Car detection, particularly through camera vision, has become a major focus in the field of computer vision and has gained widespread adoption. While current car detection systems are capable of good detection, reliable detection can still be challenging due to factors such as proximity between the car, light intensity, and environmental visibility. To address these issues, we propose cross-domain Car Detection Model with integrated convolutional block Attention mechanism(CDMA) that we apply to car recognition for autonomous driving and other areas. CDMA includes several novelties: 1)Building a complete cross-domain target detection framework. 2)Developing an unpaired target domain picture generation module with an integrated convolutional attention mechanism which specifically emphasizes the car headlights feature. 3)Adopting Generalized Intersection over Union (GIOU) as the loss function of the target detection framework. 4)Designing an object detection model integrated with two-headed Convolutional Block Attention Module(CBAM). 5)Utilizing an effective data enhancement method. To evaluate the model's effectiveness, we performed a reduced will resolution process on the data in the SSLAD dataset and used it as the benchmark dataset for our task. Experimental results show that the performance of the cross-domain car target detection model improves by 40% over the model without our framework, and our improvements have a significant impact on cross-domain car recognition. △ Less

Submitted 29 June, 2023; v1 submitted 31 May, 2023; originally announced May 2023.

Comments: It needs to be returned for major modifications

arXiv:2305.09221 [pdf, other]

A Multi-Client Searchable Encryption Scheme for IoT Environment

Authors: Nazatul H. Sultan, Shabnam Kasra-Kermanshahi, Yen Tran, Shangqi Lai, Vijay Varadharajan, Surya Nepal, Xun Yi

Abstract: The proliferation of connected devices through Internet connectivity presents both opportunities for smart applications and risks to security and privacy. It is vital to proactively address these concerns to fully leverage the potential of the Internet of Things. IoT services where one data owner serves multiple clients, like smart city transportation, smart building management and healthcare can… ▽ More The proliferation of connected devices through Internet connectivity presents both opportunities for smart applications and risks to security and privacy. It is vital to proactively address these concerns to fully leverage the potential of the Internet of Things. IoT services where one data owner serves multiple clients, like smart city transportation, smart building management and healthcare can offer benefits but also bring cybersecurity and data privacy risks. For example, in healthcare, a hospital may collect data from medical devices and make it available to multiple clients such as researchers and pharmaceutical companies. This data can be used to improve medical treatments and research but if not protected, it can also put patients' personal information at risk. To ensure the benefits of these services, it is important to implement proper security and privacy measures. In this paper, we propose a symmetric searchable encryption scheme with dynamic updates on a database that has a single owner and multiple clients for IoT environments. Our proposed scheme supports both forward and backward privacy. Additionally, our scheme supports a decentralized storage environment in which data owners can outsource data across multiple servers or even across multiple service providers to improve security and privacy. Further, it takes a minimum amount of effort and costs to revoke a client's access to our system at any time. The performance and formal security analyses of the proposed scheme show that our scheme provides better functionality, and security and is more efficient in terms of computation and storage than the closely related works. △ Less

Submitted 16 May, 2023; originally announced May 2023.

Comments: 22 pages, 5 figures, this version was submitted to ESORICS 2023

arXiv:2305.08473 [pdf, other]

Shared and Private Information Learning in Multimodal Sentiment Analysis with Deep Modal Alignment and Self-supervised Multi-Task Learning

Authors: Songning Lai, Jiakang Li, Guinan Guo, Xifeng Hu, Yulong Li, Yuan Tan, Zichen Song, Yutong Liu, Zhaoxia Ren, Chun Wan, Danmin Miao, Zhi Liu

Abstract: Designing an effective representation learning method for multimodal sentiment analysis tasks is a crucial research direction. The challenge lies in learning both shared and private information in a complete modal representation, which is difficult with uniform multimodal labels and a raw feature fusion approach. In this work, we propose a deep modal shared information learning module based on the… ▽ More Designing an effective representation learning method for multimodal sentiment analysis tasks is a crucial research direction. The challenge lies in learning both shared and private information in a complete modal representation, which is difficult with uniform multimodal labels and a raw feature fusion approach. In this work, we propose a deep modal shared information learning module based on the covariance matrix to capture the shared information between modalities. Additionally, we use a label generation module based on a self-supervised learning strategy to capture the private information of the modalities. Our module is plug-and-play in multimodal tasks, and by changing the parameterization, it can adjust the information exchange relationship between the modes and learn the private or shared information between the specified modes. We also employ a multi-task learning strategy to help the model focus its attention on the modal differentiation training data. We provide a detailed formulation derivation and feasibility proof for the design of the deep modal shared information learning module. We conduct extensive experiments on three common multimodal sentiment analysis baseline datasets, and the experimental results validate the reliability of our model. Furthermore, we explore more combinatorial techniques for the use of the module. Our approach outperforms current state-of-the-art methods on most of the metrics of the three public datasets. △ Less

Submitted 19 March, 2024; v1 submitted 15 May, 2023; originally announced May 2023.

Journal ref: International Joint Conference on Neural Networks (IJCNN) 2024

arXiv:2305.07611 [pdf, other]

Multimodal Sentiment Analysis: A Survey

Authors: Songning Lai, Xifeng Hu, Haoxuan Xu, Zhaoxia Ren, Zhi Liu

Abstract: Multimodal sentiment analysis has become an important research area in the field of artificial intelligence. With the latest advances in deep learning, this technology has reached new heights. It has great potential for both application and research, making it a popular research topic. This review provides an overview of the definition, background, and development of multimodal sentiment analysis.… ▽ More Multimodal sentiment analysis has become an important research area in the field of artificial intelligence. With the latest advances in deep learning, this technology has reached new heights. It has great potential for both application and research, making it a popular research topic. This review provides an overview of the definition, background, and development of multimodal sentiment analysis. It also covers recent datasets and advanced models, emphasizing the challenges and future prospects of this technology. Finally, it looks ahead to future research directions. It should be noted that this review provides constructive suggestions for promising research directions and building better performing multimodal sentiment analysis models, which can help researchers in this field. △ Less

Submitted 3 July, 2023; v1 submitted 12 May, 2023; originally announced May 2023.

Comments: It needs to be returned for major modifications

arXiv:2304.04688 [pdf, other]

Interaction-Aware Prompting for Zero-Shot Spatio-Temporal Action Detection

Authors: Wei-Jhe Huang, Jheng-Hsien Yeh, Min-Hung Chen, Gueter Josmy Faure, Shang-Hong Lai

Abstract: The goal of spatial-temporal action detection is to determine the time and place where each person's action occurs in a video and classify the corresponding action category. Most of the existing methods adopt fully-supervised learning, which requires a large amount of training data, making it very difficult to achieve zero-shot learning. In this paper, we propose to utilize a pre-trained visual-la… ▽ More The goal of spatial-temporal action detection is to determine the time and place where each person's action occurs in a video and classify the corresponding action category. Most of the existing methods adopt fully-supervised learning, which requires a large amount of training data, making it very difficult to achieve zero-shot learning. In this paper, we propose to utilize a pre-trained visual-language model to extract the representative image and text features, and model the relationship between these features through different interaction modules to obtain the interaction feature. In addition, we use this feature to prompt each label to obtain more appropriate text features. Finally, we calculate the similarity between the interaction feature and the text feature for each label to determine the action category. Our experiments on J-HMDB and UCF101-24 datasets demonstrate that the proposed interaction module and prompting make the visual-language features better aligned, thus achieving excellent accuracy for zero-shot spatio-temporal action detection. The code will be available at https://github.com/webber2933/iCLIP. △ Less

Submitted 20 September, 2023; v1 submitted 10 April, 2023; originally announced April 2023.

Comments: Accepted by ICCV Workshop 2023 (What is Next in Multimodal Foundation Models?)

arXiv:2304.04546 [pdf, other]

Kinship Representation Learning with Face Componential Relation

Authors: Weng-Tai Su, Min-Hung Chen, Chien-Yi Wang, Shang-Hong Lai, Trista Pei-Chun Chen

Abstract: Kinship recognition aims to determine whether the subjects in two facial images are kin or non-kin, which is an emerging and challenging problem. However, most previous methods focus on heuristic designs without considering the spatial correlation between face images. In this paper, we aim to learn discriminative kinship representations embedded with the relation information between face component… ▽ More Kinship recognition aims to determine whether the subjects in two facial images are kin or non-kin, which is an emerging and challenging problem. However, most previous methods focus on heuristic designs without considering the spatial correlation between face images. In this paper, we aim to learn discriminative kinship representations embedded with the relation information between face components (e.g., eyes, nose, etc.). To achieve this goal, we propose the Face Componential Relation Network, which learns the relationship between face components among images with a cross-attention mechanism, which automatically learns the important facial regions for kinship recognition. Moreover, we propose Face Componential Relation Network (FaCoRNet), which adapts the loss function by the guidance from cross-attention to learn more discriminative feature representations. The proposed FaCoRNet outperforms previous state-of-the-art methods by large margins for the largest public kinship recognition FIW benchmark. △ Less

Submitted 29 September, 2023; v1 submitted 10 April, 2023; originally announced April 2023.

Comments: ICCV 2023 Workshop (Analysis and Modeling of Faces and Gestures)

arXiv:2303.10826 [pdf, other]

Visual Prompt Multi-Modal Tracking

Authors: Jiawen Zhu, Simiao Lai, Xin Chen, Dong Wang, Huchuan Lu

Abstract: Visible-modal object tracking gives rise to a series of downstream multi-modal tracking tributaries. To inherit the powerful representations of the foundation model, a natural modus operandi for multi-modal tracking is full fine-tuning on the RGB-based parameters. Albeit effective, this manner is not optimal due to the scarcity of downstream data and poor transferability, etc. In this paper, inspi… ▽ More Visible-modal object tracking gives rise to a series of downstream multi-modal tracking tributaries. To inherit the powerful representations of the foundation model, a natural modus operandi for multi-modal tracking is full fine-tuning on the RGB-based parameters. Albeit effective, this manner is not optimal due to the scarcity of downstream data and poor transferability, etc. In this paper, inspired by the recent success of the prompt learning in language models, we develop Visual Prompt multi-modal Tracking (ViPT), which learns the modal-relevant prompts to adapt the frozen pre-trained foundation model to various downstream multimodal tracking tasks. ViPT finds a better way to stimulate the knowledge of the RGB-based model that is pre-trained at scale, meanwhile only introducing a few trainable parameters (less than 1% of model parameters). ViPT outperforms the full fine-tuning paradigm on multiple downstream tracking tasks including RGB+Depth, RGB+Thermal, and RGB+Event tracking. Extensive experiments show the potential of visual prompt learning for multi-modal tracking, and ViPT can achieve state-of-the-art performance while satisfying parameter efficiency. Code and models are available at https://github.com/jiawen-zhu/ViPT. △ Less

Submitted 24 March, 2023; v1 submitted 19 March, 2023; originally announced March 2023.

Comments: Accepted by CVPR2023

arXiv:2212.05786 [pdf, other]

Multi-scale Feature Imitation for Unsupervised Anomaly Localization

Authors: Chao Hu, Shengxin Lai

Abstract: The unsupervised anomaly localization task faces the challenge of missing anomaly sample training, detecting multiple types of anomalies, and dealing with the proportion of the area of multiple anomalies. A separate teacher-student feature imitation network structure and a multi-scale processing strategy combining an image and feature pyramid are proposed to solve these problems. A network module… ▽ More The unsupervised anomaly localization task faces the challenge of missing anomaly sample training, detecting multiple types of anomalies, and dealing with the proportion of the area of multiple anomalies. A separate teacher-student feature imitation network structure and a multi-scale processing strategy combining an image and feature pyramid are proposed to solve these problems. A network module importance search method based on gradient descent optimization is proposed to simplify the network structure. The experimental results show that the proposed algorithm performs better than the feature modeling anomaly localization method on the real industrial product detection dataset in the same period. The multi-scale strategy can effectively improve the effect compared with the benchmark method. △ Less

Submitted 12 December, 2022; v1 submitted 12 December, 2022; originally announced December 2022.

Comments: International Joint Conference on Neural Networks 2023

Journal ref: International Joint Conference on Neural Networks 2023

arXiv:2212.04611 [pdf]

doi 10.1016/j.tourman.2022.104665

Multidimensional Service Quality Scoring System

Authors: Shiyang Lai

Abstract: This supplementary paper aims to introduce the Multidimensional Service Quality Scoring System (MSQs), a review-based method for quantifying host service quality mentioned and employed in the paper Exit and transition: Exploring the survival status of Airbnb listings in a time of professionalization. MSQs is not an end-to-end implementation and is essentially composed of three pipelines, namely Da… ▽ More This supplementary paper aims to introduce the Multidimensional Service Quality Scoring System (MSQs), a review-based method for quantifying host service quality mentioned and employed in the paper Exit and transition: Exploring the survival status of Airbnb listings in a time of professionalization. MSQs is not an end-to-end implementation and is essentially composed of three pipelines, namely Data Collection and Preprocessing, Objects Recognition and Grouping, and Aspect-based Service Scoring. Using the study mentioned above as a case, the technical details of MSQs are explained in this article. △ Less

Submitted 8 December, 2022; originally announced December 2022.

Journal ref: Tourism Management, 95, 104665 (2022)

arXiv:2211.15955 [pdf, other]

Generalized Face Anti-Spoofing via Multi-Task Learning and One-Side Meta Triplet Loss

Authors: Chu-Chun Chuang, Chien-Yi Wang, Shang-Hong Lai

Abstract: With the increasing variations of face presentation attacks, model generalization becomes an essential challenge for a practical face anti-spoofing system. This paper presents a generalized face anti-spoofing framework that consists of three tasks: depth estimation, face parsing, and live/spoof classification. With the pixel-wise supervision from the face parsing and depth estimation tasks, the re… ▽ More With the increasing variations of face presentation attacks, model generalization becomes an essential challenge for a practical face anti-spoofing system. This paper presents a generalized face anti-spoofing framework that consists of three tasks: depth estimation, face parsing, and live/spoof classification. With the pixel-wise supervision from the face parsing and depth estimation tasks, the regularized features can better distinguish spoof faces. While simulating domain shift with meta-learning techniques, the proposed one-side triplet loss can further improve the generalization capability by a large margin. Extensive experiments on four public datasets demonstrate that the proposed framework and training strategies are more effective than previous works for model generalization to unseen domains. △ Less

Submitted 29 November, 2022; originally announced November 2022.

Comments: 2023 IEEE International Conference on Automatic Face and Gesture Recognition (FG)

arXiv:2211.15940 [pdf, other]

PiggyBack: Pretrained Visual Question Answering Environment for Backing up Non-deep Learning Professionals

Authors: Zhihao Zhang, Siwen Luo, Junyi Chen, Sijia Lai, Siqu Long, Hyunsuk Chung, Soyeon Caren Han

Abstract: We propose a PiggyBack, a Visual Question Answering platform that allows users to apply the state-of-the-art visual-language pretrained models easily. The PiggyBack supports the full stack of visual question answering tasks, specifically data processing, model fine-tuning, and result visualisation. We integrate visual-language models, pretrained by HuggingFace, an open-source API platform of deep… ▽ More We propose a PiggyBack, a Visual Question Answering platform that allows users to apply the state-of-the-art visual-language pretrained models easily. The PiggyBack supports the full stack of visual question answering tasks, specifically data processing, model fine-tuning, and result visualisation. We integrate visual-language models, pretrained by HuggingFace, an open-source API platform of deep learning technologies; however, it cannot be runnable without programming skills or deep learning understanding. Hence, our PiggyBack supports an easy-to-use browser-based user interface with several deep learning visual language pretrained models for general users and domain experts. The PiggyBack includes the following benefits: Free availability under the MIT License, Portability due to web-based and thus runs on almost any platform, A comprehensive data creation and processing technique, and ease of use on deep learning-based visual language pretrained models. The demo video is available on YouTube and can be found at https://youtu.be/iz44RZ1lF4s. △ Less

Submitted 30 November, 2022; v1 submitted 29 November, 2022; originally announced November 2022.

Comments: Accepted by WSDM 2023

arXiv:2211.15181 [pdf, other]

MixFairFace: Towards Ultimate Fairness via MixFair Adapter in Face Recognition

Authors: Fu-En Wang, Chien-Yi Wang, Min Sun, Shang-Hong Lai

Abstract: Although significant progress has been made in face recognition, demographic bias still exists in face recognition systems. For instance, it usually happens that the face recognition performance for a certain demographic group is lower than the others. In this paper, we propose MixFairFace framework to improve the fairness in face recognition models. First of all, we argue that the commonly used a… ▽ More Although significant progress has been made in face recognition, demographic bias still exists in face recognition systems. For instance, it usually happens that the face recognition performance for a certain demographic group is lower than the others. In this paper, we propose MixFairFace framework to improve the fairness in face recognition models. First of all, we argue that the commonly used attribute-based fairness metric is not appropriate for face recognition. A face recognition system can only be considered fair while every person has a close performance. Hence, we propose a new evaluation protocol to fairly evaluate the fairness performance of different approaches. Different from previous approaches that require sensitive attribute labels such as race and gender for reducing the demographic bias, we aim at addressing the identity bias in face representation, i.e., the performance inconsistency between different identities, without the need for sensitive attribute labels. To this end, we propose MixFair Adapter to determine and reduce the identity bias of training samples. Our extensive experiments demonstrate that our MixFairFace approach achieves state-of-the-art fairness performance on all benchmark datasets. △ Less

Submitted 28 November, 2022; originally announced November 2022.

Comments: Accepted in AAAI-23; Code: https://github.com/fuenwang/MixFairFace

Showing 1–50 of 104 results for author: Lai, S