Search | arXiv e-print repository

Latent Ewald summation for machine learning of long-range interactions

Abstract: Machine learning interatomic potentials (MLIPs) often neglect long-range interactions, such as electrostatic and dispersion forces. In this work, we introduce a straightforward and efficient method to account for long-range interactions by learning a latent variable from local atomic descriptors and applying an Ewald summation to this variable. We demonstrate that in systems including charged, pol… ▽ More Machine learning interatomic potentials (MLIPs) often neglect long-range interactions, such as electrostatic and dispersion forces. In this work, we introduce a straightforward and efficient method to account for long-range interactions by learning a latent variable from local atomic descriptors and applying an Ewald summation to this variable. We demonstrate that in systems including charged, polar, or apolar molecular dimers, bulk water, and water-vapor interface, standard short-ranged MLIPs can lead to unphysical predictions even when employing message passing. The long-range models effectively eliminate these artifacts, with only about twice the computational cost of short-range MLIPs. △ Less

Submitted 27 August, 2024; originally announced August 2024.

arXiv:2408.14074 [pdf, other]

Abstraction Engineering

Authors: Nelly Bencomo, Jordi Cabot, Marsha Chechik, Betty H. C. Cheng, Benoit Combemale, Andrzej Wąsowski, Steffen Zschaler

Abstract: Modern software-based systems operate under rapidly changing conditions and face ever-increasing uncertainty. In response, systems are increasingly adaptive and reliant on artificial-intelligence methods. In addition to the ubiquity of software with respect to users and application areas (e.g., transportation, smart grids, medicine, etc.), these high-impact software systems necessarily draw from m… ▽ More Modern software-based systems operate under rapidly changing conditions and face ever-increasing uncertainty. In response, systems are increasingly adaptive and reliant on artificial-intelligence methods. In addition to the ubiquity of software with respect to users and application areas (e.g., transportation, smart grids, medicine, etc.), these high-impact software systems necessarily draw from many disciplines for foundational principles, domain expertise, and workflows. Recent progress with lowering the barrier to entry for coding has led to a broader community of developers, who are not necessarily software engineers. As such, the field of software engineering needs to adapt accordingly and offer new methods to systematically develop high-quality software systems by a broad range of experts and non-experts. This paper looks at these new challenges and proposes to address them through the lens of Abstraction. Abstraction is already used across many disciplines involved in software development -- from the time-honored classical deductive reasoning and formal modeling to the inductive reasoning employed by modern data science. The software engineering of the future requires Abstraction Engineering -- a systematic approach to abstraction across the inductive and deductive spaces. We discuss the foundations of Abstraction Engineering, identify key challenges, highlight the research questions that help address these challenges, and create a roadmap for future research. △ Less

Submitted 26 August, 2024; originally announced August 2024.

arXiv:2408.08189 [pdf, other]

FancyVideo: Towards Dynamic and Consistent Video Generation via Cross-frame Textual Guidance

Authors: Jiasong Feng, Ao Ma, Jing Wang, Bo Cheng, Xiaodan Liang, Dawei Leng, Yuhui Yin

Abstract: Synthesizing motion-rich and temporally consistent videos remains a challenge in artificial intelligence, especially when dealing with extended durations. Existing text-to-video (T2V) models commonly employ spatial cross-attention for text control, equivalently guiding different frame generations without frame-specific textual guidance. Thus, the model's capacity to comprehend the temporal logic c… ▽ More Synthesizing motion-rich and temporally consistent videos remains a challenge in artificial intelligence, especially when dealing with extended durations. Existing text-to-video (T2V) models commonly employ spatial cross-attention for text control, equivalently guiding different frame generations without frame-specific textual guidance. Thus, the model's capacity to comprehend the temporal logic conveyed in prompts and generate videos with coherent motion is restricted. To tackle this limitation, we introduce FancyVideo, an innovative video generator that improves the existing text-control mechanism with the well-designed Cross-frame Textual Guidance Module (CTGM). Specifically, CTGM incorporates the Temporal Information Injector (TII), Temporal Affinity Refiner (TAR), and Temporal Feature Booster (TFB) at the beginning, middle, and end of cross-attention, respectively, to achieve frame-specific textual guidance. Firstly, TII injects frame-specific information from latent features into text conditions, thereby obtaining cross-frame textual conditions. Then, TAR refines the correlation matrix between cross-frame textual conditions and latent features along the time dimension. Lastly, TFB boosts the temporal consistency of latent features. Extensive experiments comprising both quantitative and qualitative evaluations demonstrate the effectiveness of FancyVideo. Our video demo, code and model are available at https://360cvgroup.github.io/FancyVideo/. △ Less

Submitted 16 August, 2024; v1 submitted 15 August, 2024; originally announced August 2024.

arXiv:2408.07957 [pdf, other]

Joint Optimization of Buffer Delay and HARQ for Video Communications

Authors: Baoping Cheng, Peng Lei, Xiaoyan Xie, Tao Fu, Yukun Zhang, Xiaoming Tao

Abstract: To improve the quality of experience (QoE) in video communication over lossy networks, this paper presents a transmission method that jointly optimizes buffer delay and Hybrid Automatic Repeat request (HARQ), referred to as BD-HARQ. This method operates on packet group and employs dynamic buffer delay combined with HARQ strategy for transmission. By defining the QoE based on metrics such as buffer… ▽ More To improve the quality of experience (QoE) in video communication over lossy networks, this paper presents a transmission method that jointly optimizes buffer delay and Hybrid Automatic Repeat request (HARQ), referred to as BD-HARQ. This method operates on packet group and employs dynamic buffer delay combined with HARQ strategy for transmission. By defining the QoE based on metrics such as buffer delay, Forward Error Correction (FEC) redundancy, and data recovery rate, the proposed method derives its closed-form expression through rigorous mathematical modeling and analysis. The optimal transmission parameters, i.e., the buffer delay and the FEC redundancy, are then determined and implemented, guaranteeing the real-time performance, transmission efficiency, and data recovery rate of video communication. Experimental results demonstrate that the proposed method aligns well with its theoretical expectations, and that it can provide up to 13.7% higher QoE compared to existing methods and increase the tolerance for packet loss rate from 15%-22% to up to 31% while maintaining a high QoE. △ Less

Submitted 15 August, 2024; originally announced August 2024.

Comments: 6 pages, 5figures

arXiv:2408.07397 [pdf, other]

Bridging Training and Execution via Dynamic Directed Graph-Based Communication in Cooperative Multi-Agent Systems

Authors: Zhuohui Zhang, Bin He, Bin Cheng, Gang Li

Abstract: Multi-agent systems must learn to communicate and understand interactions between agents to achieve cooperative goals in partially observed tasks. However, existing approaches lack a dynamic directed communication mechanism and rely on global states, thus diminishing the role of communication in centralized training. Thus, we propose the transformer-based graph coarsening network (TGCNet), a novel… ▽ More Multi-agent systems must learn to communicate and understand interactions between agents to achieve cooperative goals in partially observed tasks. However, existing approaches lack a dynamic directed communication mechanism and rely on global states, thus diminishing the role of communication in centralized training. Thus, we propose the transformer-based graph coarsening network (TGCNet), a novel multi-agent reinforcement learning (MARL) algorithm. TGCNet learns the topological structure of a dynamic directed graph to represent the communication policy and integrates graph coarsening networks to approximate the representation of global state during training. It also utilizes the transformer decoder for feature extraction during execution. Experiments on multiple cooperative MARL benchmarks demonstrate state-of-the-art performance compared to popular MARL algorithms. Further ablation studies validate the effectiveness of our dynamic directed graph communication mechanism and graph coarsening networks. △ Less

Submitted 14 August, 2024; originally announced August 2024.

Comments: 9 pages, 7 figures

arXiv:2408.04344 [pdf, other]

Semantic-Enhanced Indirect Call Analysis with Large Language Models

Authors: Baijun Cheng, Cen Zhang, Kailong Wang, Ling Shi, Yang Liu, Haoyu Wang, Yao Guo, Xiangqun Chen

Abstract: In contemporary software development, the widespread use of indirect calls to achieve dynamic features poses challenges in constructing precise control flow graphs (CFGs), which further impacts the performance of downstream static analysis tasks. To tackle this issue, various types of indirect call analyzers have been proposed. However, they do not fully leverage the semantic information of the pr… ▽ More In contemporary software development, the widespread use of indirect calls to achieve dynamic features poses challenges in constructing precise control flow graphs (CFGs), which further impacts the performance of downstream static analysis tasks. To tackle this issue, various types of indirect call analyzers have been proposed. However, they do not fully leverage the semantic information of the program, limiting their effectiveness in real-world scenarios. To address these issues, this paper proposes Semantic-Enhanced Analysis (SEA), a new approach to enhance the effectiveness of indirect call analysis. Our fundamental insight is that for common programming practices, indirect calls often exhibit semantic similarity with their invoked targets. This semantic alignment serves as a supportive mechanism for static analysis techniques in filtering out false targets. Notably, contemporary large language models (LLMs) are trained on extensive code corpora, encompassing tasks such as code summarization, making them well-suited for semantic analysis. Specifically, SEA leverages LLMs to generate natural language summaries of both indirect calls and target functions from multiple perspectives. Through further analysis of these summaries, SEA can determine their suitability as caller-callee pairs. Experimental results demonstrate that SEA can significantly enhance existing static analysis methods by producing more precise target sets for indirect calls. △ Less

Submitted 8 August, 2024; originally announced August 2024.

Comments: Accepted by ASE'24

arXiv:2408.00290 [pdf, other]

Multi-Modal Parameter-Efficient Fine-tuning via Graph Neural Network

Authors: Bin Cheng, Jiaxuan Lu

Abstract: With the advent of the era of foundation models, pre-training and fine-tuning have become common paradigms. Recently, parameter-efficient fine-tuning has garnered widespread attention due to its better balance between the number of learnable parameters and performance. However, some current parameter-efficient fine-tuning methods only model a single modality and lack the utilization of structural… ▽ More With the advent of the era of foundation models, pre-training and fine-tuning have become common paradigms. Recently, parameter-efficient fine-tuning has garnered widespread attention due to its better balance between the number of learnable parameters and performance. However, some current parameter-efficient fine-tuning methods only model a single modality and lack the utilization of structural knowledge in downstream tasks. To address this issue, this paper proposes a multi-modal parameter-efficient fine-tuning method based on graph networks. Each image is fed into a multi-modal large language model (MLLM) to generate a text description. The image and its corresponding text description are then processed by a frozen image encoder and text encoder to generate image features and text features, respectively. A graph is constructed based on the similarity of the multi-modal feature nodes, and knowledge and relationships relevant to these features are extracted from each node. Additionally, Elastic Weight Consolidation (EWC) regularization is incorporated into the loss function to mitigate the problem of forgetting during task learning. The proposed model achieves test accuracies on the OxfordPets, Flowers102, and Food101 datasets that improve by 4.45%, 2.92%, and 0.23%, respectively. The code is available at https://github.com/yunche0/GA-Net/tree/master. △ Less

Submitted 1 August, 2024; originally announced August 2024.

arXiv:2407.13664 [pdf, other]

Decision Focused Causal Learning for Direct Counterfactual Marketing Optimization

Authors: Hao Zhou, Rongxiao Huang, Shaoming Li, Guibin Jiang, Jiaqi Zheng, Bing Cheng, Wei Lin

Abstract: Marketing optimization plays an important role to enhance user engagement in online Internet platforms. Existing studies usually formulate this problem as a budget allocation problem and solve it by utilizing two fully decoupled stages, i.e., machine learning (ML) and operation research (OR). However, the learning objective in ML does not take account of the downstream optimization task in OR, whi… ▽ More Marketing optimization plays an important role to enhance user engagement in online Internet platforms. Existing studies usually formulate this problem as a budget allocation problem and solve it by utilizing two fully decoupled stages, i.e., machine learning (ML) and operation research (OR). However, the learning objective in ML does not take account of the downstream optimization task in OR, which causes that the prediction accuracy in ML may be not positively related to the decision quality. Decision Focused Learning (DFL) integrates ML and OR into an end-to-end framework, which takes the objective of the downstream task as the decision loss function and guarantees the consistency of the optimization direction between ML and OR. However, deploying DFL in marketing is non-trivial due to multiple technological challenges. Firstly, the budget allocation problem in marketing is a 0-1 integer stochastic programming problem and the budget is uncertain and fluctuates a lot in real-world settings, which is beyond the general problem background in DFL. Secondly, the counterfactual in marketing causes that the decision loss cannot be directly computed and the optimal solution can never be obtained, both of which disable the common gradient-estimation approaches in DFL. Thirdly, the OR solver is called frequently to compute the decision loss during model training in DFL, which produces huge computational cost and cannot support large-scale training data. In this paper, we propose a decision focused causal learning framework (DFCL) for direct counterfactual marketing optimization, which overcomes the above technological challenges. Both offline experiments and online A/B testing demonstrate the effectiveness of DFCL over the state-of-the-art methods. Currently, DFCL has been deployed in several marketing scenarios in Meituan, one of the largest online food delivery platform in the world. △ Less

Submitted 18 July, 2024; originally announced July 2024.

Comments: Accepted by KDD 2024

arXiv:2406.17257 [pdf, other]

Leveraging Parameter-Efficient Transfer Learning for Multi-Lingual Text-to-Speech Adaptation

Authors: Yingting Li, Ambuj Mehrish, Bryan Chew, Bo Cheng, Soujanya Poria

Abstract: Different languages have distinct phonetic systems and vary in their prosodic features making it challenging to develop a Text-to-Speech (TTS) model that can effectively synthesise speech in multilingual settings. Furthermore, TTS architecture needs to be both efficient enough to capture nuances in multiple languages and efficient enough to be practical for deployment. The standard approach is to… ▽ More Different languages have distinct phonetic systems and vary in their prosodic features making it challenging to develop a Text-to-Speech (TTS) model that can effectively synthesise speech in multilingual settings. Furthermore, TTS architecture needs to be both efficient enough to capture nuances in multiple languages and efficient enough to be practical for deployment. The standard approach is to build transformer based model such as SpeechT5 and train it on large multilingual dataset. As the size of these models grow the conventional fine-tuning for adapting these model becomes impractical due to heavy computational cost. In this paper, we proposes to integrate parameter-efficient transfer learning (PETL) methods such as adapters and hypernetwork with TTS architecture for multilingual speech synthesis. Notably, in our experiments PETL methods able to achieve comparable or even better performance compared to full fine-tuning with only $\sim$2.5\% tunable parameters.The code and samples are available at: https://anonymous.4open.science/r/multilingualTTS-BA4C. △ Less

Submitted 24 June, 2024; originally announced June 2024.

arXiv:2406.03177 [pdf, other]

FAPNet: An Effective Frequency Adaptive Point-based Eye Tracker

Authors: Xiaopeng Lin, Hongwei Ren, Bojun Cheng

Abstract: Eye tracking is crucial for human-computer interaction in different domains. Conventional cameras encounter challenges such as power consumption and image quality during different eye movements, prompting the need for advanced solutions with ultra-fast, low-power, and accurate eye trackers. Event cameras, fundamentally designed to capture information about moving objects, exhibit low power consump… ▽ More Eye tracking is crucial for human-computer interaction in different domains. Conventional cameras encounter challenges such as power consumption and image quality during different eye movements, prompting the need for advanced solutions with ultra-fast, low-power, and accurate eye trackers. Event cameras, fundamentally designed to capture information about moving objects, exhibit low power consumption and high temporal resolution. This positions them as an alternative to traditional cameras in the realm of eye tracking. Nevertheless, existing event-based eye tracking networks neglect the pivotal sparse and fine-grained temporal information in events, resulting in unsatisfactory performance. Moreover, the energy-efficient features are further compromised by the use of excessively complex models, hindering efficient deployment on edge devices. In this paper, we utilize Point Cloud as the event representation to harness the high temporal resolution and sparse characteristics of events in eye tracking tasks. We rethink the point-based architecture PEPNet with preprocessing the long-term relationships between samples, leading to the innovative design of FAPNet. A frequency adaptive mechanism is designed to realize adaptive tracking according to the speed of the pupil movement and the Inter Sample LSTM module is introduced to utilize the temporal correlation between samples. In the Event-based Eye Tracking Challenge, we utilize vanilla PEPNet, which is the former work to achieve the $p_{10}$ accuracy of 97.95\%. On the SEET synthetic dataset, FAPNet can achieve state-of-the-art while consuming merely 10\% of the PEPNet's computational resources. Notably, the computational demand of FAPNet is independent of the sensor's spatial resolution, enhancing its applicability on resource-limited edge devices. △ Less

Submitted 5 June, 2024; originally announced June 2024.

Comments: Accepted by CVPRW 2024 (AIS)

arXiv:2405.19688 [pdf, other]

DNPM: A Neural Parametric Model for the Synthesis of Facial Geometric Details

Authors: Haitao Cao, Baoping Cheng, Qiran Pu, Haocheng Zhang, Bin Luo, Yixiang Zhuang, Juncong Lin, Liyan Chen, Xuan Cheng

Abstract: Parametric 3D models have enabled a wide variety of computer vision and graphics tasks, such as modeling human faces, bodies and hands. In 3D face modeling, 3DMM is the most widely used parametric model, but can't generate fine geometric details solely from identity and expression inputs. To tackle this limitation, we propose a neural parametric model named DNPM for the facial geometric details, w… ▽ More Parametric 3D models have enabled a wide variety of computer vision and graphics tasks, such as modeling human faces, bodies and hands. In 3D face modeling, 3DMM is the most widely used parametric model, but can't generate fine geometric details solely from identity and expression inputs. To tackle this limitation, we propose a neural parametric model named DNPM for the facial geometric details, which utilizes deep neural network to extract latent codes from facial displacement maps encoding details and wrinkles. Built upon DNPM, a novel 3DMM named Detailed3DMM is proposed, which augments traditional 3DMMs by including the synthesis of facial details only from the identity and expression inputs. Moreover, we show that DNPM and Detailed3DMM can facilitate two downstream applications: speech-driven detailed 3D facial animation and 3D face reconstruction from a degraded image. Extensive experiments have shown the usefulness of DNPM and Detailed3DMM, and the progressiveness of two proposed applications. △ Less

Submitted 13 June, 2024; v1 submitted 30 May, 2024; originally announced May 2024.

arXiv:2405.14159 [pdf, other]

Super Tiny Language Models

Authors: Dylan Hillier, Leon Guertler, Cheston Tan, Palaash Agrawal, Chen Ruirui, Bobby Cheng

Abstract: The rapid advancement of large language models (LLMs) has led to significant improvements in natural language processing but also poses challenges due to their high computational and energy demands. This paper introduces a series of research efforts focused on Super Tiny Language Models (STLMs), which aim to deliver high performance with significantly reduced parameter counts. We explore innovativ… ▽ More The rapid advancement of large language models (LLMs) has led to significant improvements in natural language processing but also poses challenges due to their high computational and energy demands. This paper introduces a series of research efforts focused on Super Tiny Language Models (STLMs), which aim to deliver high performance with significantly reduced parameter counts. We explore innovative techniques such as byte-level tokenization with a pooling mechanism, weight tying, and efficient training strategies. These methods aim to significantly reduce reduce the parameter count compared to traditional models -- in future works, we aim to build on these in a way that maintains and improves upon the performance of base transformer models. This series of papers will explore into various subproblems, including tokenizer-free models, self-play based training, and alternative training objectives. We will target models with 10M, 50M, and 100M parameters. Our ultimate goal is to make high-performance language models more accessible and practical for a wide range of applications. △ Less

Submitted 26 June, 2024; v1 submitted 23 May, 2024; originally announced May 2024.

Comments: 11 pages, 4 figures

ACM Class: I.2.7

arXiv:2405.09057 [pdf, other]

Response Matching for generating materials and molecules

Authors: Bingqing Cheng

Abstract: Machine learning has recently emerged as a powerful tool for generating new molecular and material structures. The success of state-of-the-art models stems from their ability to incorporate physical symmetries, such as translation, rotation, and periodicity. Here, we present a novel generative method called Response Matching (RM), which leverages the fact that each stable material or molecule exis… ▽ More Machine learning has recently emerged as a powerful tool for generating new molecular and material structures. The success of state-of-the-art models stems from their ability to incorporate physical symmetries, such as translation, rotation, and periodicity. Here, we present a novel generative method called Response Matching (RM), which leverages the fact that each stable material or molecule exists at the minimum of its potential energy surface. Consequently, any perturbation induces a response in energy and stress, driving the structure back to equilibrium. Matching to such response is closely related to score matching in diffusion models. By employing the combination of a machine learning interatomic potential and random structure search as the denoising model, RM exploits the locality of atomic interactions, and inherently respects permutation, translation, rotation, and periodic invariances. RM is the first model to handle both molecules and bulk materials under the same framework. We demonstrate the efficiency and generalization of RM across three systems: a small organic molecular dataset, stable crystals from the Materials Project, and one-shot learning on a single diamond configuration. △ Less

Submitted 14 May, 2024; originally announced May 2024.

arXiv:2405.06116 [pdf, other]

Rethinking Efficient and Effective Point-based Networks for Event Camera Classification and Regression: EventMamba

Authors: Hongwei Ren, Yue Zhou, Jiadong Zhu, Haotian Fu, Yulong Huang, Xiaopeng Lin, Yuetong Fang, Fei Ma, Hao Yu, Bojun Cheng

Abstract: Event cameras, drawing inspiration from biological systems, efficiently detect changes in ambient light with low latency and high dynamic range while consuming minimal power. The most current approach to processing event data often involves converting it into frame-based representations, which is well-established in traditional vision. However, this approach neglects the sparsity of event data, lo… ▽ More Event cameras, drawing inspiration from biological systems, efficiently detect changes in ambient light with low latency and high dynamic range while consuming minimal power. The most current approach to processing event data often involves converting it into frame-based representations, which is well-established in traditional vision. However, this approach neglects the sparsity of event data, loses fine-grained temporal information during the transformation process, and increases the computational burden, making it ineffective for characterizing event camera properties. In contrast, Point Cloud is a popular representation for 3D processing and is better suited to match the sparse and asynchronous nature of the event camera. Nevertheless, despite the theoretical compatibility of point-based methods with event cameras, the results show a performance gap that is not yet satisfactory compared to frame-based methods. In order to bridge the performance gap, we propose EventMamba, an efficient and effective Point Cloud framework that achieves competitive results even compared to the state-of-the-art (SOTA) frame-based method in both classification and regression tasks. This notable accomplishment is facilitated by our rethinking of the distinction between Event Cloud and Point Cloud, emphasizing effective temporal information extraction through optimized network structures. Specifically, EventMamba leverages temporal aggregation and State Space Model (SSM) based Mamba boasting enhanced temporal information extraction capabilities. Through a hierarchical structure, EventMamba is adept at abstracting local and global spatial features and implicit and explicit temporal features. By adhering to the lightweight design principle, EventMamba delivers impressive results with minimal computational resource utilization, demonstrating its efficiency and effectiveness. △ Less

Submitted 2 July, 2024; v1 submitted 9 May, 2024; originally announced May 2024.

Comments: Extension Journal of TTPOINT and PEPNet, modify the dataset split method

arXiv:2404.18599 [pdf, other]

Self-supervised learning for classifying paranasal anomalies in the maxillary sinus

Authors: Debayan Bhattacharya, Finn Behrendt, Benjamin Tobias Becker, Lennart Maack, Dirk Beyersdorff, Elina Petersen, Marvin Petersen, Bastian Cheng, Dennis Eggert, Christian Betz, Anna Sophie Hoffmann, Alexander Schlaefer

Abstract: Purpose: Paranasal anomalies, frequently identified in routine radiological screenings, exhibit diverse morphological characteristics. Due to the diversity of anomalies, supervised learning methods require large labelled dataset exhibiting diverse anomaly morphology. Self-supervised learning (SSL) can be used to learn representations from unlabelled data. However, there are no SSL methods designed… ▽ More Purpose: Paranasal anomalies, frequently identified in routine radiological screenings, exhibit diverse morphological characteristics. Due to the diversity of anomalies, supervised learning methods require large labelled dataset exhibiting diverse anomaly morphology. Self-supervised learning (SSL) can be used to learn representations from unlabelled data. However, there are no SSL methods designed for the downstream task of classifying paranasal anomalies in the maxillary sinus (MS). Methods: Our approach uses a 3D Convolutional Autoencoder (CAE) trained in an unsupervised anomaly detection (UAD) framework. Initially, we train the 3D CAE to reduce reconstruction errors when reconstructing normal maxillary sinus (MS) image. Then, this CAE is applied to an unlabelled dataset to generate coarse anomaly locations by creating residual MS images. Following this, a 3D Convolutional Neural Network (CNN) reconstructs these residual images, which forms our SSL task. Lastly, we fine-tune the encoder part of the 3D CNN on a labelled dataset of normal and anomalous MS images. Results: The proposed SSL technique exhibits superior performance compared to existing generic self-supervised methods, especially in scenarios with limited annotated data. When trained on just 10% of the annotated dataset, our method achieves an Area Under the Precision-Recall Curve (AUPRC) of 0.79 for the downstream classification task. This performance surpasses other methods, with BYOL attaining an AUPRC of 0.75, SimSiam at 0.74, SimCLR at 0.73 and Masked Autoencoding using SparK at 0.75. Conclusion: A self-supervised learning approach that inherently focuses on localizing paranasal anomalies proves to be advantageous, particularly when the subsequent task involves differentiating normal from anomalous maxillary sinuses. Access our code at https://github.com/mtec-tuhh/self-supervised-paranasal-anomaly △ Less

Submitted 29 April, 2024; originally announced April 2024.

arXiv:2404.12888 [pdf, other]

Learn2Talk: 3D Talking Face Learns from 2D Talking Face

Authors: Yixiang Zhuang, Baoping Cheng, Yao Cheng, Yuntao Jin, Renshuai Liu, Chengyang Li, Xuan Cheng, Jing Liao, Juncong Lin

Abstract: Speech-driven facial animation methods usually contain two main classes, 3D and 2D talking face, both of which attract considerable research attention in recent years. However, to the best of our knowledge, the research on 3D talking face does not go deeper as 2D talking face, in the aspect of lip-synchronization (lip-sync) and speech perception. To mind the gap between the two sub-fields, we prop… ▽ More Speech-driven facial animation methods usually contain two main classes, 3D and 2D talking face, both of which attract considerable research attention in recent years. However, to the best of our knowledge, the research on 3D talking face does not go deeper as 2D talking face, in the aspect of lip-synchronization (lip-sync) and speech perception. To mind the gap between the two sub-fields, we propose a learning framework named Learn2Talk, which can construct a better 3D talking face network by exploiting two expertise points from the field of 2D talking face. Firstly, inspired by the audio-video sync network, a 3D sync-lip expert model is devised for the pursuit of lip-sync between audio and 3D facial motion. Secondly, a teacher model selected from 2D talking face methods is used to guide the training of the audio-to-3D motions regression network to yield more 3D vertex accuracy. Extensive experiments show the advantages of the proposed framework in terms of lip-sync, vertex accuracy and speech perception, compared with state-of-the-arts. Finally, we show two applications of the proposed framework: audio-visual speech recognition and speech-driven 3D Gaussian Splatting based avatar animation. △ Less

Submitted 19 April, 2024; originally announced April 2024.

arXiv:2404.11770 [pdf, other]

Event-Based Eye Tracking. AIS 2024 Challenge Survey

Authors: Zuowen Wang, Chang Gao, Zongwei Wu, Marcos V. Conde, Radu Timofte, Shih-Chii Liu, Qinyu Chen, Zheng-jun Zha, Wei Zhai, Han Han, Bohao Liao, Yuliang Wu, Zengyu Wan, Zhong Wang, Yang Cao, Ganchao Tan, Jinze Chen, Yan Ru Pei, Sasskia Brüers, Sébastien Crouzet, Douglas McLelland, Oliver Coenen, Baoheng Zhang, Yizhao Gao, Jingyuan Li , et al. (14 additional authors not shown)

Abstract: This survey reviews the AIS 2024 Event-Based Eye Tracking (EET) Challenge. The task of the challenge focuses on processing eye movement recorded with event cameras and predicting the pupil center of the eye. The challenge emphasizes efficient eye tracking with event cameras to achieve good task accuracy and efficiency trade-off. During the challenge period, 38 participants registered for the Kaggl… ▽ More This survey reviews the AIS 2024 Event-Based Eye Tracking (EET) Challenge. The task of the challenge focuses on processing eye movement recorded with event cameras and predicting the pupil center of the eye. The challenge emphasizes efficient eye tracking with event cameras to achieve good task accuracy and efficiency trade-off. During the challenge period, 38 participants registered for the Kaggle competition, and 8 teams submitted a challenge factsheet. The novel and diverse methods from the submitted factsheets are reviewed and analyzed in this survey to advance future event-based eye tracking research. △ Less

Submitted 17 April, 2024; originally announced April 2024.

Comments: Qinyu Chen is the corresponding author

arXiv:2404.09000 [pdf, other]

MaSkel: A Model for Human Whole-body X-rays Generation from Human Masking Images

Authors: Yingjie Xi, Boyuan Cheng, Jingyao Cai, Jian Jun Zhang, Xiaosong Yang

Abstract: The human whole-body X-rays could offer a valuable reference for various applications, including medical diagnostics, digital animation modeling, and ergonomic design. The traditional method of obtaining X-ray information requires the use of CT (Computed Tomography) scan machines, which emit potentially harmful radiation. Thus it faces a significant limitation for realistic applications because it… ▽ More The human whole-body X-rays could offer a valuable reference for various applications, including medical diagnostics, digital animation modeling, and ergonomic design. The traditional method of obtaining X-ray information requires the use of CT (Computed Tomography) scan machines, which emit potentially harmful radiation. Thus it faces a significant limitation for realistic applications because it lacks adaptability and safety. In our work, We proposed a new method to directly generate the 2D human whole-body X-rays from the human masking images. The predicted images will be similar to the real ones with the same image style and anatomic structure. We employed a data-driven strategy. By leveraging advanced generative techniques, our model MaSkel(Masking image to Skeleton X-rays) could generate a high-quality X-ray image from a human masking image without the need for invasive and harmful radiation exposure, which not only provides a new path to generate highly anatomic and customized data but also reduces health risks. To our knowledge, our model MaSkel is the first work for predicting whole-body X-rays. In this paper, we did two parts of the work. The first one is to solve the data limitation problem, the diffusion-based techniques are utilized to make a data augmentation, which provides two synthetic datasets for preliminary pretraining. Then we designed a two-stage training strategy to train MaSkel. At last, we make qualitative and quantitative evaluations of the generated X-rays. In addition, we invite some professional doctors to assess our predicted data. These evaluations demonstrate the MaSkel's superior ability to generate anatomic X-rays from human masking images. The related code and links of the dataset are available at https://github.com/2022yingjie/MaSkel. △ Less

Submitted 13 April, 2024; originally announced April 2024.

arXiv:2404.07663 [pdf, other]

Interactive Ontology Matching with Cost-Efficient Learning

Authors: Bin Cheng, Jonathan Fürst, Tobias Jacobs, Celia Garrido-Hidalgo

Abstract: The creation of high-quality ontologies is crucial for data integration and knowledge-based reasoning, specifically in the context of the rising data economy. However, automatic ontology matchers are often bound to the heuristics they are based on, leaving many matches unidentified. Interactive ontology matching systems involving human experts have been introduced, but they do not solve the fundam… ▽ More The creation of high-quality ontologies is crucial for data integration and knowledge-based reasoning, specifically in the context of the rising data economy. However, automatic ontology matchers are often bound to the heuristics they are based on, leaving many matches unidentified. Interactive ontology matching systems involving human experts have been introduced, but they do not solve the fundamental issue of flexibly finding additional matches outside the scope of the implemented heuristics, even though this is highly demanded in industrial settings. Active machine learning methods appear to be a promising path towards a flexible interactive ontology matcher. However, off-the-shelf active learning mechanisms suffer from low query efficiency due to extreme class imbalance, resulting in a last-mile problem where high human effort is required to identify the remaining matches. To address the last-mile problem, this work introduces DualLoop, an active learning method tailored to ontology matching. DualLoop offers three main contributions: (1) an ensemble of tunable heuristic matchers, (2) a short-term learner with a novel query strategy adapted to highly imbalanced data, and (3) long-term learners to explore potential matches by creating and tuning new heuristics. We evaluated DualLoop on three datasets of varying sizes and domains. Compared to existing active learning methods, we consistently achieved better F1 scores and recall, reducing the expected query cost spent on finding 90% of all matches by over 50%. Compared to traditional interactive ontology matchers, we are able to find additional, last-mile matches. Finally, we detail the successful deployment of our approach within an actual product and report its operational performance results within the Architecture, Engineering, and Construction (AEC) industry sector, showcasing its practical value and efficiency. △ Less

Submitted 11 April, 2024; originally announced April 2024.

arXiv:2404.04645 [pdf, other]

HyperTTS: Parameter Efficient Adaptation in Text to Speech using Hypernetworks

Authors: Yingting Li, Rishabh Bhardwaj, Ambuj Mehrish, Bo Cheng, Soujanya Poria

Abstract: Neural speech synthesis, or text-to-speech (TTS), aims to transform a signal from the text domain to the speech domain. While developing TTS architectures that train and test on the same set of speakers has seen significant improvements, out-of-domain speaker performance still faces enormous limitations. Domain adaptation on a new set of speakers can be achieved by fine-tuning the whole model for… ▽ More Neural speech synthesis, or text-to-speech (TTS), aims to transform a signal from the text domain to the speech domain. While developing TTS architectures that train and test on the same set of speakers has seen significant improvements, out-of-domain speaker performance still faces enormous limitations. Domain adaptation on a new set of speakers can be achieved by fine-tuning the whole model for each new domain, thus making it parameter-inefficient. This problem can be solved by Adapters that provide a parameter-efficient alternative to domain adaptation. Although famous in NLP, speech synthesis has not seen much improvement from Adapters. In this work, we present HyperTTS, which comprises a small learnable network, "hypernetwork", that generates parameters of the Adapter blocks, allowing us to condition Adapters on speaker representations and making them dynamic. Extensive evaluations of two domain adaptation settings demonstrate its effectiveness in achieving state-of-the-art performance in the parameter-efficient regime. We also compare different variants of HyperTTS, comparing them with baselines in different studies. Promising results on the dynamic adaptation of adapter parameters using hypernetworks open up new avenues for domain-generic multi-speaker TTS systems. The audio samples and code are available at https://github.com/declare-lab/HyperTTS. △ Less

Submitted 6 April, 2024; originally announced April 2024.

arXiv:2404.00569 [pdf, other]

CM-TTS: Enhancing Real Time Text-to-Speech Synthesis Efficiency through Weighted Samplers and Consistency Models

Authors: Xiang Li, Fan Bu, Ambuj Mehrish, Yingting Li, Jiale Han, Bo Cheng, Soujanya Poria

Abstract: Neural Text-to-Speech (TTS) systems find broad applications in voice assistants, e-learning, and audiobook creation. The pursuit of modern models, like Diffusion Models (DMs), holds promise for achieving high-fidelity, real-time speech synthesis. Yet, the efficiency of multi-step sampling in Diffusion Models presents challenges. Efforts have been made to integrate GANs with DMs, speeding up infere… ▽ More Neural Text-to-Speech (TTS) systems find broad applications in voice assistants, e-learning, and audiobook creation. The pursuit of modern models, like Diffusion Models (DMs), holds promise for achieving high-fidelity, real-time speech synthesis. Yet, the efficiency of multi-step sampling in Diffusion Models presents challenges. Efforts have been made to integrate GANs with DMs, speeding up inference by approximating denoising distributions, but this introduces issues with model convergence due to adversarial training. To overcome this, we introduce CM-TTS, a novel architecture grounded in consistency models (CMs). Drawing inspiration from continuous-time diffusion models, CM-TTS achieves top-quality speech synthesis in fewer steps without adversarial training or pre-trained model dependencies. We further design weighted samplers to incorporate different sampling positions into model training with dynamic probabilities, ensuring unbiased learning throughout the entire training process. We present a real-time mel-spectrogram generation consistency model, validated through comprehensive evaluations. Experimental results underscore CM-TTS's superiority over existing single-step speech synthesis systems, representing a significant advancement in the field. △ Less

Submitted 31 March, 2024; originally announced April 2024.

Comments: Accepted by Findings of NAACL 2024. Code is available at https://github.com/XiangLi2022/CM-TTS

arXiv:2403.19412 [pdf, other]

A Simple and Effective Point-based Network for Event Camera 6-DOFs Pose Relocalization

Authors: Hongwei Ren, Jiadong Zhu, Yue Zhou, Haotian FU, Yulong Huang, Bojun Cheng

Abstract: Event cameras exhibit remarkable attributes such as high dynamic range, asynchronicity, and low latency, making them highly suitable for vision tasks that involve high-speed motion in challenging lighting conditions. These cameras implicitly capture movement and depth information in events, making them appealing sensors for Camera Pose Relocalization (CPR) tasks. Nevertheless, existing CPR network… ▽ More Event cameras exhibit remarkable attributes such as high dynamic range, asynchronicity, and low latency, making them highly suitable for vision tasks that involve high-speed motion in challenging lighting conditions. These cameras implicitly capture movement and depth information in events, making them appealing sensors for Camera Pose Relocalization (CPR) tasks. Nevertheless, existing CPR networks based on events neglect the pivotal fine-grained temporal information in events, resulting in unsatisfactory performance. Moreover, the energy-efficient features are further compromised by the use of excessively complex models, hindering efficient deployment on edge devices. In this paper, we introduce PEPNet, a simple and effective point-based network designed to regress six degrees of freedom (6-DOFs) event camera poses. We rethink the relationship between the event camera and CPR tasks, leveraging the raw Point Cloud directly as network input to harness the high-temporal resolution and inherent sparsity of events. PEPNet is adept at abstracting the spatial and implicit temporal features through hierarchical structure and explicit temporal features by Attentive Bi-directional Long Short-Term Memory (A-Bi-LSTM). By employing a carefully crafted lightweight design, PEPNet delivers state-of-the-art (SOTA) performance on both indoor and outdoor datasets with meager computational resources. Specifically, PEPNet attains a significant 38% and 33% performance improvement on the random split IJRR and M3ED datasets, respectively. Moreover, the lightweight design version PEPNet$_{tiny}$ accomplishes results comparable to the SOTA while employing a mere 0.5% of the parameters. △ Less

Submitted 28 March, 2024; originally announced March 2024.

Comments: Accepted by CVPR 2024

arXiv:2403.15917 [pdf, other]

Who Uses Personas in Requirements Engineering: The Practitioners' Perspective

Authors: Yi Wang, Chetan Arora, Xiao Liu, Thuong Hoang, Vasudha Malhotra, Ben Cheng, John Grundy

Abstract: Personas are commonly used in software projects to gain a better understanding of end-users' needs. However, there is a limited understanding of their usage and effectiveness in practice. This paper presents the results of a two-step investigation, comprising interviews with 26 software developers, UI/UX designers, business analysts and product managers and a survey of 203 practitioners, aimed at… ▽ More Personas are commonly used in software projects to gain a better understanding of end-users' needs. However, there is a limited understanding of their usage and effectiveness in practice. This paper presents the results of a two-step investigation, comprising interviews with 26 software developers, UI/UX designers, business analysts and product managers and a survey of 203 practitioners, aimed at shedding light on the current practices, methods and challenges of using personas in software development. Our findings reveal variations in the frequency and effectiveness of personas across different software projects and IT companies, the challenges practitioners face when using personas and the reasons for not using them at all. Furthermore, we investigate the coverage of human aspects in personas, often assumed to be a key feature of persona descriptions. Contrary to the general perception, our study shows that human aspects are often ignored for various reasons in personas or requirements engineering in general. Our study provides actionable insights for practitioners to overcome challenges in using personas during requirements engineering stages, and we identify areas for future research. △ Less

Submitted 23 March, 2024; originally announced March 2024.

arXiv:2403.00128 [pdf, other]

From Flies to Robots: Inverted Landing in Small Quadcopters with Dynamic Perching

Authors: Bryan Habas, Bo Cheng

Abstract: Inverted landing is a routine behavior among a number of animal fliers. However, mastering this feat poses a considerable challenge for robotic fliers, especially to perform dynamic perching with rapid body rotations (or flips) and landing against gravity. Inverted landing in flies have suggested that optical flow senses are closely linked to the precise triggering and control of body flips that l… ▽ More Inverted landing is a routine behavior among a number of animal fliers. However, mastering this feat poses a considerable challenge for robotic fliers, especially to perform dynamic perching with rapid body rotations (or flips) and landing against gravity. Inverted landing in flies have suggested that optical flow senses are closely linked to the precise triggering and control of body flips that lead to a variety of successful landing behaviors. Building upon this knowledge, we aimed to replicate the flies' landing behaviors in small quadcopters by developing a control policy general to arbitrary ceiling-approach conditions. First, we employed reinforcement learning in simulation to optimize discrete sensory-motor pairs across a broad spectrum of ceiling-approach velocities and directions. Next, we converted the sensory-motor pairs to a two-stage control policy in a continuous augmented-optical flow space. The control policy consists of a first-stage Flip-Trigger Policy, which employs a one-class support vector machine, and a second-stage Flip-Action Policy, implemented as a feed-forward neural network. To transfer the inverted-landing policy to physical systems, we utilized domain randomization and system identification techniques for a zero-shot sim-to-real transfer. As a result, we successfully achieved a range of robust inverted-landing behaviors in small quadcopters, emulating those observed in flies. △ Less

Submitted 29 February, 2024; originally announced March 2024.

Comments: 17 pages, 19 Figures, Journal paper currently under review

arXiv:2402.15713 [pdf, other]

Making Pre-trained Language Models Better Continual Few-Shot Relation Extractors

Authors: Shengkun Ma, Jiale Han, Yi Liang, Bo Cheng

Abstract: Continual Few-shot Relation Extraction (CFRE) is a practical problem that requires the model to continuously learn novel relations while avoiding forgetting old ones with few labeled training data. The primary challenges are catastrophic forgetting and overfitting. This paper harnesses prompt learning to explore the implicit capabilities of pre-trained language models to address the above two chal… ▽ More Continual Few-shot Relation Extraction (CFRE) is a practical problem that requires the model to continuously learn novel relations while avoiding forgetting old ones with few labeled training data. The primary challenges are catastrophic forgetting and overfitting. This paper harnesses prompt learning to explore the implicit capabilities of pre-trained language models to address the above two challenges, thereby making language models better continual few-shot relation extractors. Specifically, we propose a Contrastive Prompt Learning framework, which designs prompt representation to acquire more generalized knowledge that can be easily adapted to old and new categories, and margin-based contrastive learning to focus more on hard samples, therefore alleviating catastrophic forgetting and overfitting issues. To further remedy overfitting in low-resource scenarios, we introduce an effective memory augmentation strategy that employs well-crafted prompts to guide ChatGPT in generating diverse samples. Extensive experiments demonstrate that our method outperforms state-of-the-art methods by a large margin and significantly mitigates catastrophic forgetting and overfitting in low-resource scenarios. △ Less

Submitted 23 February, 2024; originally announced February 2024.

Comments: Accepted as COLING2024

arXiv:2402.13252 [pdf, other]

Improving Robustness for Joint Optimization of Camera Poses and Decomposed Low-Rank Tensorial Radiance Fields

Authors: Bo-Yu Cheng, Wei-Chen Chiu, Yu-Lun Liu

Abstract: In this paper, we propose an algorithm that allows joint refinement of camera pose and scene geometry represented by decomposed low-rank tensor, using only 2D images as supervision. First, we conduct a pilot study based on a 1D signal and relate our findings to 3D scenarios, where the naive joint pose optimization on voxel-based NeRFs can easily lead to sub-optimal solutions. Moreover, based on th… ▽ More In this paper, we propose an algorithm that allows joint refinement of camera pose and scene geometry represented by decomposed low-rank tensor, using only 2D images as supervision. First, we conduct a pilot study based on a 1D signal and relate our findings to 3D scenarios, where the naive joint pose optimization on voxel-based NeRFs can easily lead to sub-optimal solutions. Moreover, based on the analysis of the frequency spectrum, we propose to apply convolutional Gaussian filters on 2D and 3D radiance fields for a coarse-to-fine training schedule that enables joint camera pose optimization. Leveraging the decomposition property in decomposed low-rank tensor, our method achieves an equivalent effect to brute-force 3D convolution with only incurring little computational overhead. To further improve the robustness and stability of joint optimization, we also propose techniques of smoothed 2D supervision, randomly scaled kernel parameters, and edge-guided loss mask. Extensive quantitative and qualitative evaluations demonstrate that our proposed framework achieves superior performance in novel view synthesis as well as rapid convergence for optimization. △ Less

Submitted 20 February, 2024; originally announced February 2024.

Comments: AAAI 2024. Project page: https://alex04072000.github.io/Joint-TensoRF/

arXiv:2402.07472 [pdf, other]

Cartesian atomic cluster expansion for machine learning interatomic potentials

Authors: Bingqing Cheng

Abstract: Machine learning interatomic potentials are revolutionizing large-scale, accurate atomistic modelling in material science and chemistry. Many potentials use atomic cluster expansion or equivariant message passing frameworks. Such frameworks typically use spherical harmonics as angular basis functions, and then use Clebsch-Gordan contraction to maintain rotational symmetry, which may introduce redu… ▽ More Machine learning interatomic potentials are revolutionizing large-scale, accurate atomistic modelling in material science and chemistry. Many potentials use atomic cluster expansion or equivariant message passing frameworks. Such frameworks typically use spherical harmonics as angular basis functions, and then use Clebsch-Gordan contraction to maintain rotational symmetry, which may introduce redundancies in representations and computational overhead. We propose an alternative: a Cartesian-coordinates-based atomic density expansion. This approach provides a complete set of polynormially indepedent features of atomic environments while maintaining interaction body orders. Additionally, we integrate low-dimensional embeddings of various chemical elements and inter-atomic message passing. The resulting potential, named Cartesian Atomic Cluster Expansion (CACE), exhibits good accuracy, stability, and generalizability. We validate its performance in diverse systems, including bulk water, small molecules, and 25-element high-entropy alloys. △ Less

Submitted 29 July, 2024; v1 submitted 12 February, 2024; originally announced February 2024.

arXiv:2402.04663 [pdf, other]

CLIF: Complementary Leaky Integrate-and-Fire Neuron for Spiking Neural Networks

Authors: Yulong Huang, Xiaopeng Lin, Hongwei Ren, Haotian Fu, Yue Zhou, Zunchang Liu, Biao Pan, Bojun Cheng

Abstract: Spiking neural networks (SNNs) are promising brain-inspired energy-efficient models. Compared to conventional deep Artificial Neural Networks (ANNs), SNNs exhibit superior efficiency and capability to process temporal information. However, it remains a challenge to train SNNs due to their undifferentiable spiking mechanism. The surrogate gradients method is commonly used to train SNNs, but often c… ▽ More Spiking neural networks (SNNs) are promising brain-inspired energy-efficient models. Compared to conventional deep Artificial Neural Networks (ANNs), SNNs exhibit superior efficiency and capability to process temporal information. However, it remains a challenge to train SNNs due to their undifferentiable spiking mechanism. The surrogate gradients method is commonly used to train SNNs, but often comes with an accuracy disadvantage over ANNs counterpart. We link the degraded accuracy to the vanishing of gradient on the temporal dimension through the analytical and experimental study of the training process of Leaky Integrate-and-Fire (LIF) Neuron-based SNNs. Moreover, we propose the Complementary Leaky Integrate-and-Fire (CLIF) Neuron. CLIF creates extra paths to facilitate the backpropagation in computing temporal gradient while keeping binary output. CLIF is hyperparameter-free and features broad applicability. Extensive experiments on a variety of datasets demonstrate CLIF's clear performance advantage over other neuron models. Furthermore, the CLIF's performance even slightly surpasses superior ANNs with identical network structure and training conditions. The code is available at https://github.com/HuuYuLong/Complementary-LIF. △ Less

Submitted 14 July, 2024; v1 submitted 7 February, 2024; originally announced February 2024.

arXiv:2401.11644 [pdf, other]

Friends Across Time: Multi-Scale Action Segmentation Transformer for Surgical Phase Recognition

Authors: Bokai Zhang, Jiayuan Meng, Bin Cheng, Dean Biskup, Svetlana Petculescu, Angela Chapman

Abstract: Automatic surgical phase recognition is a core technology for modern operating rooms and online surgical video assessment platforms. Current state-of-the-art methods use both spatial and temporal information to tackle the surgical phase recognition task. Building on this idea, we propose the Multi-Scale Action Segmentation Transformer (MS-AST) for offline surgical phase recognition and the Multi-S… ▽ More Automatic surgical phase recognition is a core technology for modern operating rooms and online surgical video assessment platforms. Current state-of-the-art methods use both spatial and temporal information to tackle the surgical phase recognition task. Building on this idea, we propose the Multi-Scale Action Segmentation Transformer (MS-AST) for offline surgical phase recognition and the Multi-Scale Action Segmentation Causal Transformer (MS-ASCT) for online surgical phase recognition. We use ResNet50 or EfficientNetV2-M for spatial feature extraction. Our MS-AST and MS-ASCT can model temporal information at different scales with multi-scale temporal self-attention and multi-scale temporal cross-attention, which enhances the capture of temporal relationships between frames and segments. We demonstrate that our method can achieve 95.26% and 96.15% accuracy on the Cholec80 dataset for online and offline surgical phase recognition, respectively, which achieves new state-of-the-art results. Our method can also achieve state-of-the-art results on non-medical datasets in the video action segmentation domain. △ Less

Submitted 21 January, 2024; originally announced January 2024.

arXiv:2401.02737 [pdf, other]

The Vulnerability Is in the Details: Locating Fine-grained Information of Vulnerable Code Identified by Graph-based Detectors

Authors: Baijun Cheng, Kailong Wang, Cuiyun Gao, Xiapu Luo, Yulei Sui, Li Li, Yao Guo, Xiangqun Chen, Haoyu Wang

Abstract: Vulnerability detection is a crucial component in the software development lifecycle. Existing vulnerability detectors, especially those based on deep learning (DL) models, have achieved high effectiveness. Despite their capability of detecting vulnerable code snippets from given code fragments, the detectors are typically unable to further locate the fine-grained information pertaining to the vul… ▽ More Vulnerability detection is a crucial component in the software development lifecycle. Existing vulnerability detectors, especially those based on deep learning (DL) models, have achieved high effectiveness. Despite their capability of detecting vulnerable code snippets from given code fragments, the detectors are typically unable to further locate the fine-grained information pertaining to the vulnerability, such as the precise vulnerability triggering locations.In this paper, we propose VULEXPLAINER, a tool for automatically locating vulnerability-critical code lines from coarse-level vulnerable code snippets reported by DL-based detectors.Our approach takes advantage of the code structure and the semantics of the vulnerabilities. Specifically, we leverage program slicing to get a set of critical program paths containing vulnerability-triggering and vulnerability-dependent statements and rank them to pinpoint the most important one (i.e., sub-graph) as the data flow associated with the vulnerability. We demonstrate that VULEXPLAINER performs consistently well on four state-of-the-art graph-representation(GP)-based vulnerability detectors, i.e., it can flag the vulnerability-triggering code statements with an accuracy of around 90% against eight common C/C++ vulnerabilities, outperforming five widely used GNN-based explanation approaches. The experimental results demonstrate the effectiveness of VULEXPLAINER, which provides insights into a promising research line: integrating program slicing and deep learning for the interpretation of vulnerable code fragments. △ Less

Submitted 21 February, 2024; v1 submitted 5 January, 2024; originally announced January 2024.

arXiv:2401.02686 [pdf, other]

doi 10.1145/3641543

Beyond Fidelity: Explaining Vulnerability Localization of Learning-based Detectors

Authors: Baijun Cheng, Shengming Zhao, Kailong Wang, Meizhen Wang, Guangdong Bai, Ruitao Feng, Yao Guo, Lei Ma, Haoyu Wang

Abstract: Vulnerability detectors based on deep learning (DL) models have proven their effectiveness in recent years. However, the shroud of opacity surrounding the decision-making process of these detectors makes it difficult for security analysts to comprehend. To address this, various explanation approaches have been proposed to explain the predictions by highlighting important features, which have been… ▽ More Vulnerability detectors based on deep learning (DL) models have proven their effectiveness in recent years. However, the shroud of opacity surrounding the decision-making process of these detectors makes it difficult for security analysts to comprehend. To address this, various explanation approaches have been proposed to explain the predictions by highlighting important features, which have been demonstrated effective in other domains such as computer vision and natural language processing. Unfortunately, an in-depth evaluation of vulnerability-critical features, such as fine-grained vulnerability-related code lines, learned and understood by these explanation approaches remains lacking. In this study, we first evaluate the performance of ten explanation approaches for vulnerability detectors based on graph and sequence representations, measured by two quantitative metrics including fidelity and vulnerability line coverage rate. Our results show that fidelity alone is not sufficient for evaluating these approaches, as fidelity incurs significant fluctuations across different datasets and detectors. We subsequently check the precision of the vulnerability-related code lines reported by the explanation approaches, and find poor accuracy in this task among all of them. This can be attributed to the inefficiency of explainers in selecting important features and the presence of irrelevant artifacts learned by DL-based detectors. △ Less

Submitted 21 February, 2024; v1 submitted 5 January, 2024; originally announced January 2024.

Comments: Accepted by Tosem

arXiv:2401.01585 [pdf, other]

On Completely Edge-Independent Spanning Trees in Locally Twisted Cubes

Authors: Xiaorui Li, Baolei Cheng, Jianxi Fan, Yan Wang, Dajin Wang

Abstract: A network can contain numerous spanning trees. If two spanning trees $T_i,T_j$ do not share any common edges, $T_i$ and $T_j$ are said to be pairwisely edge-disjoint. For spanning trees $T_1, T_2, ..., T_m$, if every two of them are pairwisely edge-disjoint, they are called completely edge-independent spanning trees (CEISTs for short). CEISTs can facilitate many network functionalities, and constr… ▽ More A network can contain numerous spanning trees. If two spanning trees $T_i,T_j$ do not share any common edges, $T_i$ and $T_j$ are said to be pairwisely edge-disjoint. For spanning trees $T_1, T_2, ..., T_m$, if every two of them are pairwisely edge-disjoint, they are called completely edge-independent spanning trees (CEISTs for short). CEISTs can facilitate many network functionalities, and constructing CEISTs as maximally allowed as possible in a given network is a worthy undertaking. In this paper, we establish the maximal number of CEISTs in the locally twisted cube network, and propose an algorithm to construct $\lfloor \frac{n}{2} \rfloor$ CEISTs in $LTQ_n$, the $n$-dimensional locally twisted cube. The proposed algorithm has been actually implemented, and we present the outputs. Network broadcasting in the $LTQ_n$ was simulated using $\lfloor\frac{n}{2}\rfloor$ CEISTs, and the performance compared with broadcasting using a single tree. △ Less

Submitted 1 August, 2024; v1 submitted 3 January, 2024; originally announced January 2024.

arXiv:2312.00296 [pdf, other]

Towards Aligned Canonical Correlation Analysis: Preliminary Formulation and Proof-of-Concept Results

Authors: Biqian Cheng, Evangelos E. Papalexakis, Jia Chen

Abstract: Canonical Correlation Analysis (CCA) has been widely applied to jointly embed multiple views of data in a maximally correlated latent space. However, the alignment between various data perspectives, which is required by traditional approaches, is unclear in many practical cases. In this work we propose a new framework Aligned Canonical Correlation Analysis (ACCA), to address this challenge by iter… ▽ More Canonical Correlation Analysis (CCA) has been widely applied to jointly embed multiple views of data in a maximally correlated latent space. However, the alignment between various data perspectives, which is required by traditional approaches, is unclear in many practical cases. In this work we propose a new framework Aligned Canonical Correlation Analysis (ACCA), to address this challenge by iteratively solving the alignment and multi-view embedding. △ Less

Submitted 7 December, 2023; v1 submitted 30 November, 2023; originally announced December 2023.

Comments: 4 pages, 7 figures, KDD SoCal symposium 2023 (extended version)

arXiv:2310.07572 [pdf, other]

Impact of Label Types on Training SWIN Models with Overhead Imagery

Authors: Ryan Ford, Kenneth Hutchison, Nicholas Felts, Benjamin Cheng, Jesse Lew, Kyle Jackson

Abstract: Understanding the impact of data set design on model training and performance can help alleviate the costs associated with generating remote sensing and overhead labeled data. This work examined the impact of training shifted window transformers using bounding boxes and segmentation labels, where the latter are more expensive to produce. We examined classification tasks by comparing models trained… ▽ More Understanding the impact of data set design on model training and performance can help alleviate the costs associated with generating remote sensing and overhead labeled data. This work examined the impact of training shifted window transformers using bounding boxes and segmentation labels, where the latter are more expensive to produce. We examined classification tasks by comparing models trained with both target and backgrounds against models trained with only target pixels, extracted by segmentation labels. For object detection models, we compared performance using either label type when training. We found that the models trained on only target pixels do not show performance improvement for classification tasks, appearing to conflate background pixels in the evaluation set with target pixels. For object detection, we found that models trained with either label type showed equivalent performance across testing. We found that bounding boxes appeared to be sufficient for tasks that did not require more complex labels, such as object segmentation. Continuing work to determine consistency of this result across data types and model architectures could potentially result in substantial savings in generating remote sensing data sets for deep learning. △ Less

Submitted 11 October, 2023; originally announced October 2023.

arXiv:2310.07189 [pdf, other]

SpikePoint: An Efficient Point-based Spiking Neural Network for Event Cameras Action Recognition

Authors: Hongwei Ren, Yue Zhou, Yulong Huang, Haotian Fu, Xiaopeng Lin, Jie Song, Bojun Cheng

Abstract: Event cameras are bio-inspired sensors that respond to local changes in light intensity and feature low latency, high energy efficiency, and high dynamic range. Meanwhile, Spiking Neural Networks (SNNs) have gained significant attention due to their remarkable efficiency and fault tolerance. By synergistically harnessing the energy efficiency inherent in event cameras and the spike-based processin… ▽ More Event cameras are bio-inspired sensors that respond to local changes in light intensity and feature low latency, high energy efficiency, and high dynamic range. Meanwhile, Spiking Neural Networks (SNNs) have gained significant attention due to their remarkable efficiency and fault tolerance. By synergistically harnessing the energy efficiency inherent in event cameras and the spike-based processing capabilities of SNNs, their integration could enable ultra-low-power application scenarios, such as action recognition tasks. However, existing approaches often entail converting asynchronous events into conventional frames, leading to additional data mapping efforts and a loss of sparsity, contradicting the design concept of SNNs and event cameras. To address this challenge, we propose SpikePoint, a novel end-to-end point-based SNN architecture. SpikePoint excels at processing sparse event cloud data, effectively extracting both global and local features through a singular-stage structure. Leveraging the surrogate training method, SpikePoint achieves high accuracy with few parameters and maintains low power consumption, specifically employing the identity mapping feature extractor on diverse datasets. SpikePoint achieves state-of-the-art (SOTA) performance on four event-based action recognition datasets using only 16 timesteps, surpassing other SNN methods. Moreover, it also achieves SOTA performance across all methods on three datasets, utilizing approximately 0.3\% of the parameters and 0.5\% of power consumption employed by artificial neural networks (ANNs). These results emphasize the significance of Point Cloud and pave the way for many ultra-low-power event-based data processing applications. △ Less

Submitted 23 January, 2024; v1 submitted 11 October, 2023; originally announced October 2023.

Comments: Accepted by ICLR 2024 (Spotlight)

arXiv:2310.06470 [pdf, other]

Focus on Local Regions for Query-based Object Detection

Authors: Hongbin Xu, Yamei Xia, Shuai Zhao, Bo Cheng

Abstract: Query-based methods have garnered significant attention in object detection since the advent of DETR, the pioneering query-based detector. However, these methods face challenges like slow convergence and suboptimal performance. Notably, self-attention in object detection often hampers convergence due to its global focus. To address these issues, we propose FoLR, a transformer-like architecture wit… ▽ More Query-based methods have garnered significant attention in object detection since the advent of DETR, the pioneering query-based detector. However, these methods face challenges like slow convergence and suboptimal performance. Notably, self-attention in object detection often hampers convergence due to its global focus. To address these issues, we propose FoLR, a transformer-like architecture with only decoders. We improve the self-attention by isolating connections between irrelevant objects that makes it focus on local regions but not global regions. We also design the adaptive sampling method to extract effective features based on queries' local regions from feature maps. Additionally, we employ a look-back strategy for decoders to retain previous information, followed by the Feature Mixer module to fuse features and queries. Experimental results demonstrate FoLR's state-of-the-art performance in query-based detectors, excelling in convergence speed and computational efficiency. Index Terms: Local regions, Attention mechanism, Object detection △ Less

Submitted 13 December, 2023; v1 submitted 10 October, 2023; originally announced October 2023.

arXiv:2310.00288 [pdf]

doi 10.1038/s41928-023-00965-5

Parallel in-memory wireless computing

Authors: Cong Wang, Gong-Jie Ruan, Zai-Zheng Yang, Xing-Jian Yangdong, Yixiang Li, Liang Wu, Yingmeng Ge, Yichen Zhao, Chen Pan, Wei Wei, Li-Bo Wang, Bin Cheng, Zaichen Zhang, Chuan Zhang, Shi-Jun Liang, Feng Miao

Abstract: Parallel wireless digital communication with ultralow power consumption is critical for emerging edge technologies such as 5G and Internet of Things. However, the physical separation between digital computing units and analogue transmission units in traditional wireless technology leads to high power consumption. Here we report a parallel in-memory wireless computing scheme. The approach combines… ▽ More Parallel wireless digital communication with ultralow power consumption is critical for emerging edge technologies such as 5G and Internet of Things. However, the physical separation between digital computing units and analogue transmission units in traditional wireless technology leads to high power consumption. Here we report a parallel in-memory wireless computing scheme. The approach combines in-memory computing with wireless communication using memristive crossbar arrays. We show that the system can be used for the radio transmission of a binary stream of 480 bits with a bit error rate of 0. The in-memory wireless computing uses two orders of magnitude less power than conventional technology (based on digital-to-analogue and analogue-to-digital converters). We also show that the approach can be applied to acoustic and optical wireless communications △ Less

Submitted 30 September, 2023; originally announced October 2023.

Journal ref: Nat Electron 6, 381-389 (2023)

arXiv:2308.09993 [pdf, other]

doi 10.1145/3581783.3612258

TTPOINT: A Tensorized Point Cloud Network for Lightweight Action Recognition with Event Cameras

Authors: Hongwei Ren, Yue Zhou, Haotian Fu, Yulong Huang, Renjing Xu, Bojun Cheng

Abstract: Event cameras have gained popularity in computer vision due to their data sparsity, high dynamic range, and low latency. As a bio-inspired sensor, event cameras generate sparse and asynchronous data, which is inherently incompatible with the traditional frame-based method. Alternatively, the point-based method can avoid additional modality transformation and naturally adapt to the sparsity of even… ▽ More Event cameras have gained popularity in computer vision due to their data sparsity, high dynamic range, and low latency. As a bio-inspired sensor, event cameras generate sparse and asynchronous data, which is inherently incompatible with the traditional frame-based method. Alternatively, the point-based method can avoid additional modality transformation and naturally adapt to the sparsity of events. Still, it typically cannot reach a comparable accuracy as the frame-based method. We propose a lightweight and generalized point cloud network called TTPOINT which achieves competitive results even compared to the state-of-the-art (SOTA) frame-based method in action recognition tasks while only using 1.5 % of the computational resources. The model is adept at abstracting local and global geometry by hierarchy structure. By leveraging tensor-train compressed feature extractors, TTPOINT can be designed with minimal parameters and computational complexity. Additionally, we developed a straightforward downsampling algorithm to maintain the spatio-temporal feature. In the experiment, TTPOINT emerged as the SOTA method on three datasets while also attaining SOTA among point cloud methods on all five datasets. Moreover, by using the tensor-train decomposition method, the accuracy of the proposed TTPOINT is almost unaffected while compressing the parameter size by 55 % in all five datasets. △ Less

Submitted 19 August, 2023; originally announced August 2023.

arXiv:2307.14352 [pdf, other]

General Image-to-Image Translation with One-Shot Image Guidance

Authors: Bin Cheng, Zuhao Liu, Yunbo Peng, Yue Lin

Abstract: Large-scale text-to-image models pre-trained on massive text-image pairs show excellent performance in image synthesis recently. However, image can provide more intuitive visual concepts than plain text. People may ask: how can we integrate the desired visual concept into an existing image, such as our portrait? Current methods are inadequate in meeting this demand as they lack the ability to pres… ▽ More Large-scale text-to-image models pre-trained on massive text-image pairs show excellent performance in image synthesis recently. However, image can provide more intuitive visual concepts than plain text. People may ask: how can we integrate the desired visual concept into an existing image, such as our portrait? Current methods are inadequate in meeting this demand as they lack the ability to preserve content or translate visual concepts effectively. Inspired by this, we propose a novel framework named visual concept translator (VCT) with the ability to preserve content in the source image and translate the visual concepts guided by a single reference image. The proposed VCT contains a content-concept inversion (CCI) process to extract contents and concepts, and a content-concept fusion (CCF) process to gather the extracted information to obtain the target image. Given only one reference image, the proposed VCT can complete a wide range of general image-to-image translation tasks with excellent results. Extensive experiments are conducted to prove the superiority and effectiveness of the proposed methods. Codes are available at https://github.com/CrystalNeuro/visual-concept-translator. △ Less

Submitted 20 September, 2023; v1 submitted 20 July, 2023; originally announced July 2023.

Comments: accepted by ICCV 2023

arXiv:2307.08178 [pdf]

Robot motor learning shows emergence of frequency-modulated, robust swimming with an invariant Strouhal-number

Authors: Hankun Deng, Donghao Li, Colin Nitroy, Andrew Wertz, Shashank Priya, Bo Cheng

Abstract: Fish locomotion emerges from a diversity of interactions among deformable structures, surrounding fluids and neuromuscular activations, i.e., fluid-structure interactions (FSI) controlled by fish's motor systems. Previous studies suggested that such motor-controlled FSI may possess embodied traits. However, their implications in motor learning, neuromuscular control, gait generation, and swimming… ▽ More Fish locomotion emerges from a diversity of interactions among deformable structures, surrounding fluids and neuromuscular activations, i.e., fluid-structure interactions (FSI) controlled by fish's motor systems. Previous studies suggested that such motor-controlled FSI may possess embodied traits. However, their implications in motor learning, neuromuscular control, gait generation, and swimming performance remain to be uncovered. Using robot models, we studied how swimming behaviours emerged from the FSI and the embodied traits. We developed modular robots with various designs and used Central Pattern Generators (CPGs) to control the torque acting on robot body. We used reinforcement learning to learn CPG parameters to maximize the swimming speed. The results showed that motor frequency converged faster than other parameters, and the emergent swimming gaits were robust against disruptions applied to motor control. For all robots and frequencies tested, swimming speed was proportional to the mean undulation velocity of body and caudal-fin combined, yielding an invariant, undulation-based Strouhal number. The Strouhal number also revealed two fundamental classes of undulatory swimming in both biological and robotic fishes. The robot actuators also demonstrated diverse functions as motors, virtual springs, and virtual masses. These results provide novel insights into the embodied traits of motor-controlled FSI for fish-inspired locomotion. △ Less

Submitted 16 July, 2023; originally announced July 2023.

arXiv:2307.04849 [pdf, other]

doi 10.1016/j.knosys.2023.110604

SigOpt Mulch: An Intelligent System for AutoML of Gradient Boosted Trees

Authors: Aleksei Sorokin, Xinran Zhu, Eric Hans Lee, Bolong Cheng

Abstract: Gradient boosted trees (GBTs) are ubiquitous models used by researchers, machine learning (ML) practitioners, and data scientists because of their robust performance, interpretable behavior, and ease-of-use. One critical challenge in training GBTs is the tuning of their hyperparameters. In practice, selecting these hyperparameters is often done manually. Recently, the ML community has advocated fo… ▽ More Gradient boosted trees (GBTs) are ubiquitous models used by researchers, machine learning (ML) practitioners, and data scientists because of their robust performance, interpretable behavior, and ease-of-use. One critical challenge in training GBTs is the tuning of their hyperparameters. In practice, selecting these hyperparameters is often done manually. Recently, the ML community has advocated for tuning hyperparameters through black-box optimization and developed state-of-the-art systems to do so. However, applying such systems to tune GBTs suffers from two drawbacks. First, these systems are not \textit{model-aware}, rather they are designed to apply to a \textit{generic} model; this leaves significant optimization performance on the table. Second, using these systems requires \textit{domain knowledge} such as the choice of hyperparameter search space, which is an antithesis to the automatic experimentation that black-box optimization aims to provide. In this paper, we present SigOpt Mulch, a model-aware hyperparameter tuning system specifically designed for automated tuning of GBTs that provides two improvements over existing systems. First, Mulch leverages powerful techniques in metalearning and multifidelity optimization to perform model-aware hyperparameter optimization. Second, it automates the process of learning performant hyperparameters by making intelligent decisions about the optimization search space, thus reducing the need for user domain knowledge. These innovations allow Mulch to identify good GBT hyperparameters far more efficiently -- and in a more seamless and user-friendly way -- than existing black-box hyperparameter tuning systems. △ Less

Submitted 10 July, 2023; originally announced July 2023.

Journal ref: Knowledge-Based Systems Volume 273, 3 August 2023, 110604

arXiv:2306.13780 [pdf, other]

doi 10.1109/WSC57314.2022.10015472

Achieving Diversity in Objective Space for Sample-efficient Search of Multiobjective Optimization Problems

Authors: Eric Hans Lee, Bolong Cheng, Michael McCourt

Abstract: Efficiently solving multi-objective optimization problems for simulation optimization of important scientific and engineering applications such as materials design is becoming an increasingly important research topic. This is due largely to the expensive costs associated with said applications, and the resulting need for sample-efficient, multiobjective optimization methods that efficiently explor… ▽ More Efficiently solving multi-objective optimization problems for simulation optimization of important scientific and engineering applications such as materials design is becoming an increasingly important research topic. This is due largely to the expensive costs associated with said applications, and the resulting need for sample-efficient, multiobjective optimization methods that efficiently explore the Pareto frontier to expose a promising set of design solutions. We propose moving away from using explicit optimization to identify the Pareto frontier and instead suggest searching for a diverse set of outcomes that satisfy user-specified performance criteria. This method presents decision makers with a robust pool of promising design decisions and helps them better understand the space of good solutions. To achieve this outcome, we introduce the Likelihood of Metric Satisfaction (LMS) acquisition function, analyze its behavior and properties, and demonstrate its viability on various problems. △ Less

Submitted 23 June, 2023; originally announced June 2023.

Journal ref: 2022 Winter Simulation Conference

arXiv:2306.01492 [pdf, other]

Multi-Modal Emotion Recognition for Enhanced Requirements Engineering: A Novel Approach

Authors: Ben Cheng, Chetan Arora, Xiao Liu, Thuong Hoang, Yi Wang, John Grundy

Abstract: Requirements engineering (RE) plays a crucial role in developing software systems by bridging the gap between stakeholders' needs and system specifications. However, effective communication and elicitation of stakeholder requirements can be challenging, as traditional RE methods often overlook emotional cues. This paper introduces a multi-modal emotion recognition platform (MEmoRE) to enhance the… ▽ More Requirements engineering (RE) plays a crucial role in developing software systems by bridging the gap between stakeholders' needs and system specifications. However, effective communication and elicitation of stakeholder requirements can be challenging, as traditional RE methods often overlook emotional cues. This paper introduces a multi-modal emotion recognition platform (MEmoRE) to enhance the requirements engineering process by capturing and analyzing the emotional cues of stakeholders in real-time. MEmoRE leverages state-of-the-art emotion recognition techniques, integrating facial expression, vocal intonation, and textual sentiment analysis to comprehensively understand stakeholder emotions. This multi-modal approach ensures the accurate and timely detection of emotional cues, enabling requirements engineers to tailor their elicitation strategies and improve overall communication with stakeholders. We further intend to employ our platform for later RE stages, such as requirements reviews and usability testing. By integrating multi-modal emotion recognition into requirements engineering, we aim to pave the way for more empathetic, effective, and successful software development processes. We performed a preliminary evaluation of our platform. This paper reports on the platform design, preliminary evaluation, and future development plan as an ongoing project. △ Less

Submitted 2 June, 2023; originally announced June 2023.

arXiv:2304.06274 [pdf, other]

EWT: Efficient Wavelet-Transformer for Single Image Denoising

Authors: Juncheng Li, Bodong Cheng, Ying Chen, Guangwei Gao, Tieyong Zeng

Abstract: Transformer-based image denoising methods have achieved encouraging results in the past year. However, it must uses linear operations to model long-range dependencies, which greatly increases model inference time and consumes GPU storage space. Compared with convolutional neural network-based methods, current Transformer-based image denoising methods cannot achieve a balance between performance im… ▽ More Transformer-based image denoising methods have achieved encouraging results in the past year. However, it must uses linear operations to model long-range dependencies, which greatly increases model inference time and consumes GPU storage space. Compared with convolutional neural network-based methods, current Transformer-based image denoising methods cannot achieve a balance between performance improvement and resource consumption. In this paper, we propose an Efficient Wavelet Transformer (EWT) for image denoising. Specifically, we use Discrete Wavelet Transform (DWT) and Inverse Wavelet Transform (IWT) for downsampling and upsampling, respectively. This method can fully preserve the image features while reducing the image resolution, thereby greatly reducing the device resource consumption of the Transformer model. Furthermore, we propose a novel Dual-stream Feature Extraction Block (DFEB) to extract image features at different levels, which can further reduce model inference time and GPU memory usage. Experiments show that our method speeds up the original Transformer by more than 80%, reduces GPU memory usage by more than 60%, and achieves excellent denoising results. All code will be public. △ Less

Submitted 13 April, 2023; originally announced April 2023.

Comments: 12 pages, 11 figurs

arXiv:2303.17915 [pdf, other]

Multiple Instance Ensembling For Paranasal Anomaly Classification In The Maxillary Sinus

Authors: Debayan Bhattacharya, Finn Behrendt, Benjamin Tobias Becker, Dirk Beyersdorff, Elina Petersen, Marvin Petersen, Bastian Cheng, Dennis Eggert, Christian Betz, Anna Sophie Hoffmann, Alexander Schlaefer

Abstract: Paranasal anomalies are commonly discovered during routine radiological screenings and can present with a wide range of morphological features. This diversity can make it difficult for convolutional neural networks (CNNs) to accurately classify these anomalies, especially when working with limited datasets. Additionally, current approaches to paranasal anomaly classification are constrained to ide… ▽ More Paranasal anomalies are commonly discovered during routine radiological screenings and can present with a wide range of morphological features. This diversity can make it difficult for convolutional neural networks (CNNs) to accurately classify these anomalies, especially when working with limited datasets. Additionally, current approaches to paranasal anomaly classification are constrained to identifying a single anomaly at a time. These challenges necessitate the need for further research and development in this area. In this study, we investigate the feasibility of using a 3D convolutional neural network (CNN) to classify healthy maxillary sinuses (MS) and MS with polyps or cysts. The task of accurately identifying the relevant MS volume within larger head and neck Magnetic Resonance Imaging (MRI) scans can be difficult, but we develop a straightforward strategy to tackle this challenge. Our end-to-end solution includes the use of a novel sampling technique that not only effectively localizes the relevant MS volume, but also increases the size of the training dataset and improves classification results. Additionally, we employ a multiple instance ensemble prediction method to further boost classification performance. Finally, we identify the optimal size of MS volumes to achieve the highest possible classification performance on our dataset. With our multiple instance ensemble prediction strategy and sampling strategy, our 3D CNNs achieve an F1 of 0.85 whereas without it, they achieve an F1 of 0.70. We demonstrate the feasibility of classifying anomalies in the MS. We propose a data enlarging strategy alongside a novel ensembling strategy that proves to be beneficial for paranasal anomaly classification in the MS. △ Less

Submitted 31 March, 2023; originally announced March 2023.

arXiv:2303.14408 [pdf, other]

VL-SAT: Visual-Linguistic Semantics Assisted Training for 3D Semantic Scene Graph Prediction in Point Cloud

Authors: Ziqin Wang, Bowen Cheng, Lichen Zhao, Dong Xu, Yang Tang, Lu Sheng

Abstract: The task of 3D semantic scene graph (3DSSG) prediction in the point cloud is challenging since (1) the 3D point cloud only captures geometric structures with limited semantics compared to 2D images, and (2) long-tailed relation distribution inherently hinders the learning of unbiased prediction. Since 2D images provide rich semantics and scene graphs are in nature coped with languages, in this stu… ▽ More The task of 3D semantic scene graph (3DSSG) prediction in the point cloud is challenging since (1) the 3D point cloud only captures geometric structures with limited semantics compared to 2D images, and (2) long-tailed relation distribution inherently hinders the learning of unbiased prediction. Since 2D images provide rich semantics and scene graphs are in nature coped with languages, in this study, we propose Visual-Linguistic Semantics Assisted Training (VL-SAT) scheme that can significantly empower 3DSSG prediction models with discrimination about long-tailed and ambiguous semantic relations. The key idea is to train a powerful multi-modal oracle model to assist the 3D model. This oracle learns reliable structural representations based on semantics from vision, language, and 3D geometry, and its benefits can be heterogeneously passed to the 3D model during the training stage. By effectively utilizing visual-linguistic semantics in training, our VL-SAT can significantly boost common 3DSSG prediction models, such as SGFN and SGGpoint, only with 3D inputs in the inference stage, especially when dealing with tail relation triplets. Comprehensive evaluations and ablation studies on the 3DSSG dataset have validated the effectiveness of the proposed scheme. Code is available at https://github.com/wz7in/CVPR2023-VLSAT. △ Less

Submitted 25 March, 2023; originally announced March 2023.

Comments: CVPR2023 Highlight

arXiv:2301.01850 [pdf]

doi 10.1109/RAMS51473.2023.10088222

Bayesian Weapon System Reliability Modeling with Cox-Weibull Neural Network

Authors: Michael Potter, Benny Cheng

Abstract: We propose to integrate weapon system features (such as weapon system manufacturer, deployment time and location, storage time and location, etc.) into a parameterized Cox-Weibull [1] reliability model via a neural network, like DeepSurv [2], to improve predictive maintenance. In parallel, we develop an alternative Bayesian model by parameterizing the Weibull parameters with a neural network and e… ▽ More We propose to integrate weapon system features (such as weapon system manufacturer, deployment time and location, storage time and location, etc.) into a parameterized Cox-Weibull [1] reliability model via a neural network, like DeepSurv [2], to improve predictive maintenance. In parallel, we develop an alternative Bayesian model by parameterizing the Weibull parameters with a neural network and employing dropout methods such as Monte-Carlo (MC)-dropout for comparative purposes. Due to data collection procedures in weapon system testing we employ a novel interval-censored log-likelihood which incorporates Monte-Carlo Markov Chain (MCMC) [3] sampling of the Weibull parameters during gradient descent optimization. We compare classification metrics such as receiver operator curve (ROC) area under the curve (AUC), precision-recall (PR) AUC, and F scores to show our model generally outperforms traditional powerful models such as XGBoost and the current standard conditional Weibull probability density estimation model. △ Less

Submitted 14 April, 2023; v1 submitted 4 January, 2023; originally announced January 2023.

Comments: Pre-print with minor revisions, published at The 69th Annual Reliability and Maintainability Symposium, January 23-26, 2023, FL, USA

arXiv:2212.04450 [pdf, other]

GAUCHE: A Library for Gaussian Processes in Chemistry

Authors: Ryan-Rhys Griffiths, Leo Klarner, Henry B. Moss, Aditya Ravuri, Sang Truong, Samuel Stanton, Gary Tom, Bojana Rankovic, Yuanqi Du, Arian Jamasb, Aryan Deshwal, Julius Schwartz, Austin Tripp, Gregory Kell, Simon Frieder, Anthony Bourached, Alex Chan, Jacob Moss, Chengzhi Guo, Johannes Durholt, Saudamini Chaurasia, Felix Strieth-Kalthoff, Alpha A. Lee, Bingqing Cheng, Alán Aspuru-Guzik , et al. (2 additional authors not shown)

Abstract: We introduce GAUCHE, a library for GAUssian processes in CHEmistry. Gaussian processes have long been a cornerstone of probabilistic machine learning, affording particular advantages for uncertainty quantification and Bayesian optimisation. Extending Gaussian processes to chemical representations, however, is nontrivial, necessitating kernels defined over structured inputs such as graphs, strings… ▽ More We introduce GAUCHE, a library for GAUssian processes in CHEmistry. Gaussian processes have long been a cornerstone of probabilistic machine learning, affording particular advantages for uncertainty quantification and Bayesian optimisation. Extending Gaussian processes to chemical representations, however, is nontrivial, necessitating kernels defined over structured inputs such as graphs, strings and bit vectors. By defining such kernels in GAUCHE, we seek to open the door to powerful tools for uncertainty quantification and Bayesian optimisation in chemistry. Motivated by scenarios frequently encountered in experimental chemistry, we showcase applications for GAUCHE in molecular discovery and chemical reaction optimisation. The codebase is made available at https://github.com/leojklarner/gauche △ Less

Submitted 21 February, 2023; v1 submitted 6 December, 2022; originally announced December 2022.

arXiv:2211.08888 [pdf, other]

ELDA: Using Edges to Have an Edge on Semantic Segmentation Based UDA

Authors: Ting-Hsuan Liao, Huang-Ru Liao, Shan-Ya Yang, Jie-En Yao, Li-Yuan Tsao, Hsu-Shen Liu, Bo-Wun Cheng, Chen-Hao Chao, Chia-Che Chang, Yi-Chen Lo, Chun-Yi Lee

Abstract: Many unsupervised domain adaptation (UDA) methods have been proposed to bridge the domain gap by utilizing domain invariant information. Most approaches have chosen depth as such information and achieved remarkable success. Despite their effectiveness, using depth as domain invariant information in UDA tasks may lead to multiple issues, such as excessively high extraction costs and difficulties in… ▽ More Many unsupervised domain adaptation (UDA) methods have been proposed to bridge the domain gap by utilizing domain invariant information. Most approaches have chosen depth as such information and achieved remarkable success. Despite their effectiveness, using depth as domain invariant information in UDA tasks may lead to multiple issues, such as excessively high extraction costs and difficulties in achieving a reliable prediction quality. As a result, we introduce Edge Learning based Domain Adaptation (ELDA), a framework which incorporates edge information into its training process to serve as a type of domain invariant information. In our experiments, we quantitatively and qualitatively demonstrate that the incorporation of edge information is indeed beneficial and effective and enables ELDA to outperform the contemporary state-of-the-art methods on two commonly adopted benchmarks for semantic segmentation based UDA tasks. In addition, we show that ELDA is able to better separate the feature distributions of different classes. We further provide an ablation analysis to justify our design decisions. △ Less

Submitted 16 November, 2022; originally announced November 2022.

Comments: Accepted by BMVC2022. Ting-Hsuan Liao and Huang-Ru Liao contributed equally to this work

arXiv:2210.17313 [pdf, ps, other]

DiscreteCommunication and ControlUpdating in Event-Triggered Consensus

Authors: Bin Cheng, Yuezu Lv, Zhongkui Li, Zhisheng Duan

Abstract: This paper studies the consensus control problem faced with three essential demands, namely, discrete control updating for each agent, discrete-time communications among neighboring agents, and the fully distributed fashion of the controller implementation without requiring any global information of the whole network topology. Noting that the existing related results only meeting one or two demand… ▽ More This paper studies the consensus control problem faced with three essential demands, namely, discrete control updating for each agent, discrete-time communications among neighboring agents, and the fully distributed fashion of the controller implementation without requiring any global information of the whole network topology. Noting that the existing related results only meeting one or two demands at most are essentially not applicable, in this paper we establish a novel framework to solve the problem of fully distributed consensus with discrete communication and control. The first key point in this framework is the design of controllers that are only updated at discrete event instants and do not depend on global information by introducing time-varying gains inspired by the adaptive control technique. Another key point is the invention of novel dynamic triggering functions that are independent of relative information among neighboring agents. Under the established framework, we propose fully distributed state-feedback event-triggered protocols for undirected graphs and also further study the more complexed cases of output-feedback control and directed graphs. Finally, numerical examples are provided to verify the effectiveness of the proposed event-triggered protocols. △ Less

Submitted 26 October, 2022; originally announced October 2022.

Showing 1–50 of 128 results for author: Cheng, B