-
PassTSL: Modeling Human-Created Passwords through Two-Stage Learning
Authors:
Yangde Wang,
Haozhang Li,
Weidong Qiu,
Shujun Li,
Peng Tang
Abstract:
Textual passwords are still the most widely used user authentication mechanism. Due to the close connections between textual passwords and natural languages, advanced technologies in natural language processing (NLP) and machine learning (ML) could be used to model passwords for different purposes such as studying human password-creation behaviors and developing more advanced password cracking met…
▽ More
Textual passwords are still the most widely used user authentication mechanism. Due to the close connections between textual passwords and natural languages, advanced technologies in natural language processing (NLP) and machine learning (ML) could be used to model passwords for different purposes such as studying human password-creation behaviors and developing more advanced password cracking methods for informing better defence mechanisms. In this paper, we propose PassTSL (modeling human-created Passwords through Two-Stage Learning), inspired by the popular pretraining-finetuning framework in NLP and deep learning (DL). We report how different pretraining settings affected PassTSL and proved its effectiveness by applying it to six large leaked password databases. Experimental results showed that it outperforms five state-of-the-art (SOTA) password cracking methods on password guessing by a significant margin ranging from 4.11% to 64.69% at the maximum point. Based on PassTSL, we also implemented a password strength meter (PSM), and our experiments showed that it was able to estimate password strength more accurately, causing fewer unsafe errors (overestimating the password strength) than two other SOTA PSMs when they produce the same rate of safe errors (underestimating the password strength): a neural-network based method and zxcvbn. Furthermore, we explored multiple finetuning settings, and our evaluations showed that, even a small amount of additional training data, e.g., only 0.1% of the pretrained data, can lead to over 3% improvement in password guessing on average. We also proposed a heuristic approach to selecting finetuning passwords based on JS (Jensen-Shannon) divergence and experimental results validated its usefulness. In summary, our contributions demonstrate the potential and feasibility of applying advanced NLP and ML methods to password modeling and cracking.
△ Less
Submitted 19 July, 2024;
originally announced July 2024.
-
VEON: Vocabulary-Enhanced Occupancy Prediction
Authors:
Jilai Zheng,
Pin Tang,
Zhongdao Wang,
Guoqing Wang,
Xiangxuan Ren,
Bailan Feng,
Chao Ma
Abstract:
Perceiving the world as 3D occupancy supports embodied agents to avoid collision with any types of obstacle. While open-vocabulary image understanding has prospered recently, how to bind the predicted 3D occupancy grids with open-world semantics still remains under-explored due to limited open-world annotations. Hence, instead of building our model from scratch, we try to blend 2D foundation model…
▽ More
Perceiving the world as 3D occupancy supports embodied agents to avoid collision with any types of obstacle. While open-vocabulary image understanding has prospered recently, how to bind the predicted 3D occupancy grids with open-world semantics still remains under-explored due to limited open-world annotations. Hence, instead of building our model from scratch, we try to blend 2D foundation models, specifically a depth model MiDaS and a semantic model CLIP, to lift the semantics to 3D space, thus fulfilling 3D occupancy. However, building upon these foundation models is not trivial. First, the MiDaS faces the depth ambiguity problem, i.e., it only produces relative depth but fails to estimate bin depth for feature lifting. Second, the CLIP image features lack high-resolution pixel-level information, which limits the 3D occupancy accuracy. Third, open vocabulary is often trapped by the long-tail problem. To address these issues, we propose VEON for Vocabulary-Enhanced Occupancy predictioN by not only assembling but also adapting these foundation models. We first equip MiDaS with a Zoedepth head and low-rank adaptation (LoRA) for relative-metric-bin depth transformation while reserving beneficial depth prior. Then, a lightweight side adaptor network is attached to the CLIP vision encoder to generate high-resolution features for fine-grained 3D occupancy prediction. Moreover, we design a class reweighting strategy to give priority to the tail classes. With only 46M trainable parameters and zero manual semantic labels, VEON achieves 15.14 mIoU on Occ3D-nuScenes, and shows the capability of recognizing objects with open-vocabulary categories, meaning that our VEON is label-efficient, parameter-efficient, and precise enough.
△ Less
Submitted 16 July, 2024;
originally announced July 2024.
-
Proportional Dynamics in Linear Fisher Markets with Auto-bidding: Convergence, Incentives and Fairness
Authors:
Juncheng Li,
Pingzhong Tang
Abstract:
Proportional dynamics, originated from peer-to-peer file sharing systems, models a decentralized price-learning process in Fisher markets. Previously, items in the dynamics operate independently of one another, and each is assumed to belong to a different seller. In this paper, we show how it can be generalized to the setting where each seller brings multiple items and buyers allocate budgets at t…
▽ More
Proportional dynamics, originated from peer-to-peer file sharing systems, models a decentralized price-learning process in Fisher markets. Previously, items in the dynamics operate independently of one another, and each is assumed to belong to a different seller. In this paper, we show how it can be generalized to the setting where each seller brings multiple items and buyers allocate budgets at the granularity of sellers rather than individual items. The generalized dynamics consistently converges to the competitive equilibrium, and interestingly relates to the auto-bidding paradigm currently popular in online advertising auction markets. In contrast to peer-to-peer networks, the proportional rule is not imposed as a protocol in auto-bidding markets. Regarding this incentive concern, we show that buyers have a strong tendency to follow the rule, but it is easy for sellers to profitably deviate (given buyers' commitment to the rule). Based on this observation, we further study the seller-side deviation game and show that it admits a unique pure Nash equilibrium. Though it is generally different from the competitive equilibrium, we show that it attains a good fairness guarantee as long as the market is competitive enough and not severely monopolized.
△ Less
Submitted 16 July, 2024;
originally announced July 2024.
-
Price Competition in Linear Fisher Markets: Stability, Equilibrium and Personalization
Authors:
Juncheng Li,
Pingzhong Tang
Abstract:
Linear Fisher market is one of the most fundamental economic models. The market is traditionally examined on the basis of individual's price-taking behavior. However, this assumption breaks in markets such as online advertising and e-commerce, where several oligopolists dominate the market and are able to compete with each other via strategic actions. Motivated by this, we study the price competit…
▽ More
Linear Fisher market is one of the most fundamental economic models. The market is traditionally examined on the basis of individual's price-taking behavior. However, this assumption breaks in markets such as online advertising and e-commerce, where several oligopolists dominate the market and are able to compete with each other via strategic actions. Motivated by this, we study the price competition among sellers in linear Fisher markets. From an algorithmic game-theoretic perspective, we establish a model to analyze behaviors of buyers and sellers that are driven by utility-maximizing purposes and also constrained by computational tractability. The main economic observation is the role played by personalization: the classic benchmark market outcome, namely competitive equilibrium, remains to be a steady-state if every buyer must be treated "equally"; however, sellers have the incentive to personalize, and as a result the market would become more unpredictable and less efficient. In addition, we build a series of algorithmic and complexity results along the road to justify our modeling choices and reveal market structures. We find interesting connections between our model and other computational problems such as stable matching, network flow, etc. We believe these results and techniques are of independent interest.
△ Less
Submitted 16 July, 2024;
originally announced July 2024.
-
Pay Less On Clinical Images: Asymmetric Multi-Modal Fusion Method For Efficient Multi-Label Skin Lesion Classification
Authors:
Peng Tang,
Tobias Lasser
Abstract:
Existing multi-modal approaches primarily focus on enhancing multi-label skin lesion classification performance through advanced fusion modules, often neglecting the associated rise in parameters. In clinical settings, both clinical and dermoscopy images are captured for diagnosis; however, dermoscopy images exhibit more crucial visual features for multi-label skin lesion classification. Motivated…
▽ More
Existing multi-modal approaches primarily focus on enhancing multi-label skin lesion classification performance through advanced fusion modules, often neglecting the associated rise in parameters. In clinical settings, both clinical and dermoscopy images are captured for diagnosis; however, dermoscopy images exhibit more crucial visual features for multi-label skin lesion classification. Motivated by this observation, we introduce a novel asymmetric multi-modal fusion method in this paper for efficient multi-label skin lesion classification. Our fusion method incorporates two innovative schemes. Firstly, we validate the effectiveness of our asymmetric fusion structure. It employs a light and simple network for clinical images and a heavier, more complex one for dermoscopy images, resulting in significant parameter savings compared to the symmetric fusion structure using two identical networks for both modalities. Secondly, in contrast to previous approaches using mutual attention modules for interaction between image modalities, we propose an asymmetric attention module. This module solely leverages clinical image information to enhance dermoscopy image features, considering clinical images as supplementary information in our pipeline. We conduct the extensive experiments on the seven-point checklist dataset. Results demonstrate the generality of our proposed method for both networks and Transformer structures, showcasing its superiority over existing methods We will make our code publicly available.
△ Less
Submitted 13 July, 2024;
originally announced July 2024.
-
PDMLP: Patch-based Decomposed MLP for Long-Term Time Series Forecasting
Authors:
Peiwang Tang,
Weitai Zhang
Abstract:
Recent studies have attempted to refine the Transformer architecture to demonstrate its effectiveness in Long-Term Time Series Forecasting (LTSF) tasks. Despite surpassing many linear forecasting models with ever-improving performance, we remain skeptical of Transformers as a solution for LTSF. We attribute the effectiveness of these models largely to the adopted Patch mechanism, which enhances se…
▽ More
Recent studies have attempted to refine the Transformer architecture to demonstrate its effectiveness in Long-Term Time Series Forecasting (LTSF) tasks. Despite surpassing many linear forecasting models with ever-improving performance, we remain skeptical of Transformers as a solution for LTSF. We attribute the effectiveness of these models largely to the adopted Patch mechanism, which enhances sequence locality to an extent yet fails to fully address the loss of temporal information inherent to the permutation-invariant self-attention mechanism. Further investigation suggests that simple linear layers augmented with the Patch mechanism may outperform complex Transformer-based LTSF models. Moreover, diverging from models that use channel independence, our research underscores the importance of cross-variable interactions in enhancing the performance of multivariate time series forecasting. The interaction information between variables is highly valuable but has been misapplied in past studies, leading to suboptimal cross-variable models. Based on these insights, we propose a novel and simple Patch-based Decomposed MLP (PDMLP) for LTSF tasks. Specifically, we employ simple moving averages to extract smooth components and noise-containing residuals from time series data, engaging in semantic information interchange through channel mixing and specializing in random noise with channel independence processing. The PDMLP model consistently achieves state-of-the-art results on several real-world datasets. We hope this surprising finding will spur new research directions in the LTSF field and pave the way for more efficient and concise solutions.
△ Less
Submitted 27 May, 2024; v1 submitted 22 May, 2024;
originally announced May 2024.
-
Grounding DINO 1.5: Advance the "Edge" of Open-Set Object Detection
Authors:
Tianhe Ren,
Qing Jiang,
Shilong Liu,
Zhaoyang Zeng,
Wenlong Liu,
Han Gao,
Hongjie Huang,
Zhengyu Ma,
Xiaoke Jiang,
Yihao Chen,
Yuda Xiong,
Hao Zhang,
Feng Li,
Peijun Tang,
Kent Yu,
Lei Zhang
Abstract:
This paper introduces Grounding DINO 1.5, a suite of advanced open-set object detection models developed by IDEA Research, which aims to advance the "Edge" of open-set object detection. The suite encompasses two models: Grounding DINO 1.5 Pro, a high-performance model designed for stronger generalization capability across a wide range of scenarios, and Grounding DINO 1.5 Edge, an efficient model o…
▽ More
This paper introduces Grounding DINO 1.5, a suite of advanced open-set object detection models developed by IDEA Research, which aims to advance the "Edge" of open-set object detection. The suite encompasses two models: Grounding DINO 1.5 Pro, a high-performance model designed for stronger generalization capability across a wide range of scenarios, and Grounding DINO 1.5 Edge, an efficient model optimized for faster speed demanded in many applications requiring edge deployment. The Grounding DINO 1.5 Pro model advances its predecessor by scaling up the model architecture, integrating an enhanced vision backbone, and expanding the training dataset to over 20 million images with grounding annotations, thereby achieving a richer semantic understanding. The Grounding DINO 1.5 Edge model, while designed for efficiency with reduced feature scales, maintains robust detection capabilities by being trained on the same comprehensive dataset. Empirical results demonstrate the effectiveness of Grounding DINO 1.5, with the Grounding DINO 1.5 Pro model attaining a 54.3 AP on the COCO detection benchmark and a 55.7 AP on the LVIS-minival zero-shot transfer benchmark, setting new records for open-set object detection. Furthermore, the Grounding DINO 1.5 Edge model, when optimized with TensorRT, achieves a speed of 75.2 FPS while attaining a zero-shot performance of 36.2 AP on the LVIS-minival benchmark, making it more suitable for edge computing scenarios. Model examples and demos with API will be released at https://github.com/IDEA-Research/Grounding-DINO-1.5-API
△ Less
Submitted 31 May, 2024; v1 submitted 16 May, 2024;
originally announced May 2024.
-
Automated Conversion of Static to Dynamic Scheduler via Natural Language
Authors:
Paul Mingzheng Tang,
Kenji Kah Hoe Leong,
Nowshad Shaik,
Hoong Chuin Lau
Abstract:
In this paper, we explore the potential application of Large Language Models (LLMs) that will automatically model constraints and generate code for dynamic scheduling problems given an existing static model. Static scheduling problems are modelled and coded by optimization experts. These models may be easily obsoleted as the underlying constraints may need to be fine-tuned in order to reflect chan…
▽ More
In this paper, we explore the potential application of Large Language Models (LLMs) that will automatically model constraints and generate code for dynamic scheduling problems given an existing static model. Static scheduling problems are modelled and coded by optimization experts. These models may be easily obsoleted as the underlying constraints may need to be fine-tuned in order to reflect changes in the scheduling rules. Furthermore, it may be necessary to turn a static model into a dynamic one in order to cope with disturbances in the environment. In this paper, we propose a Retrieval-Augmented Generation (RAG) based LLM model to automate the process of implementing constraints for Dynamic Scheduling (RAGDyS), without seeking help from an optimization modeling expert. Our framework aims to minimize technical complexities related to mathematical modelling and computational workload for end-users, thereby allowing end-users to quickly obtain a new schedule close to the original schedule with changes reflected by natural language constraint descriptions.
△ Less
Submitted 8 May, 2024;
originally announced May 2024.
-
Modal Folding: Discovering Smooth Folding Patterns for Sheet Materials using Strain-Space Modes
Authors:
Pengbin Tang,
Ronan Hinchet,
Roi Poranne,
Bernhard Thomaszewski,
Stelian Coros
Abstract:
Folding can transform mundane objects such as napkins into stunning works of art. However, finding new folding transformations for sheet materials is a challenging problem that requires expertise and real-world experimentation. In this paper, we present Modal Folding -- an automated approach for discovering energetically optimal folding transformations, i.e., large deformations that require little…
▽ More
Folding can transform mundane objects such as napkins into stunning works of art. However, finding new folding transformations for sheet materials is a challenging problem that requires expertise and real-world experimentation. In this paper, we present Modal Folding -- an automated approach for discovering energetically optimal folding transformations, i.e., large deformations that require little mechanical work. For small deformations, minimizing internal energy for fixed displacement magnitudes leads to the well-known elastic eigenmodes. While linear modes provide promising directions for bending, they cannot capture the rotational motion required for folding. To overcome this limitation, we introduce strain-space modes -- nonlinear analogues of elastic eigenmodes that operate on per-element curvatures instead of vertices. Using strain-space modes to determine target curvatures for bending elements, we can generate complex nonlinear folding motions by simply minimizing the sheet's internal energy. Our modal folding approach offers a systematic and automated way to create complex designs. We demonstrate the effectiveness of our method with simulation results for a range of shapes and materials, and validate our designs with physical prototypes.
△ Less
Submitted 7 May, 2024;
originally announced May 2024.
-
Improving Domain Generalization on Gaze Estimation via Branch-out Auxiliary Regularization
Authors:
Ruijie Zhao,
Pinyan Tang,
Sihui Luo
Abstract:
Despite remarkable advancements, mainstream gaze estimation techniques, particularly appearance-based methods, often suffer from performance degradation in uncontrolled environments due to variations in illumination and individual facial attributes. Existing domain adaptation strategies, limited by their need for target domain samples, may fall short in real-world applications. This letter introdu…
▽ More
Despite remarkable advancements, mainstream gaze estimation techniques, particularly appearance-based methods, often suffer from performance degradation in uncontrolled environments due to variations in illumination and individual facial attributes. Existing domain adaptation strategies, limited by their need for target domain samples, may fall short in real-world applications. This letter introduces Branch-out Auxiliary Regularization (BAR), an innovative method designed to boost gaze estimation's generalization capabilities without requiring direct access to target domain data. Specifically, BAR integrates two auxiliary consistency regularization branches: one that uses augmented samples to counteract environmental variations, and another that aligns gaze directions with positive source domain samples to encourage the learning of consistent gaze features. These auxiliary pathways strengthen the core network and are integrated in a smooth, plug-and-play manner, facilitating easy adaptation to various other models. Comprehensive experimental evaluations on four cross-dataset tasks demonstrate the superiority of our approach.
△ Less
Submitted 2 May, 2024;
originally announced May 2024.
-
Empirical Studies of Propagation Characteristics and Modeling Based on XL-MIMO Channel Measurement: From Far-Field to Near-Field
Authors:
Haiyang Miao,
Jianhua Zhang,
Pan Tang,
Lei Tian,
Weirang Zuo,
Qi Wei,
Guangyi Liu
Abstract:
In the sixth-generation (6G), the extremely large-scale multiple-input-multiple-output (XL-MIMO) is considered a promising enabling technology. With the further expansion of array element number and frequency bands, near-field effects will be more likely to occur in 6G communication systems. The near-field radio communications (NFRC) will become crucial in 6G communication systems. It is known tha…
▽ More
In the sixth-generation (6G), the extremely large-scale multiple-input-multiple-output (XL-MIMO) is considered a promising enabling technology. With the further expansion of array element number and frequency bands, near-field effects will be more likely to occur in 6G communication systems. The near-field radio communications (NFRC) will become crucial in 6G communication systems. It is known that the channel research is very important for the development and performance evaluation of the communication systems. In this paper, we will systematically investigate the channel measurements and modeling for the emerging NFRC. First, the principle design of massive MIMO channel measurement platform are solved. Second, an indoor XL-MIMO channel measurement campaign with 1600 array elements is conducted, and the channel characteristics are extracted and validated in the near-field region. Then, the outdoor XL-MIMO channel measurement campaign with 320 array elements is conducted, and the channel characteristics are extracted and modeled from near-field to far-field (NF-FF) region. The spatial non-stationary characteristics of angular spread at the transmitting end are more important in modeling. We hope that this work will give some reference to the near-field and far-field research for 6G.
△ Less
Submitted 26 April, 2024;
originally announced April 2024.
-
OccGen: Generative Multi-modal 3D Occupancy Prediction for Autonomous Driving
Authors:
Guoqing Wang,
Zhongdao Wang,
Pin Tang,
Jilai Zheng,
Xiangxuan Ren,
Bailan Feng,
Chao Ma
Abstract:
Existing solutions for 3D semantic occupancy prediction typically treat the task as a one-shot 3D voxel-wise segmentation perception problem. These discriminative methods focus on learning the mapping between the inputs and occupancy map in a single step, lacking the ability to gradually refine the occupancy map and the reasonable scene imaginative capacity to complete the local regions somewhere.…
▽ More
Existing solutions for 3D semantic occupancy prediction typically treat the task as a one-shot 3D voxel-wise segmentation perception problem. These discriminative methods focus on learning the mapping between the inputs and occupancy map in a single step, lacking the ability to gradually refine the occupancy map and the reasonable scene imaginative capacity to complete the local regions somewhere. In this paper, we introduce OccGen, a simple yet powerful generative perception model for the task of 3D semantic occupancy prediction. OccGen adopts a ''noise-to-occupancy'' generative paradigm, progressively inferring and refining the occupancy map by predicting and eliminating noise originating from a random Gaussian distribution. OccGen consists of two main components: a conditional encoder that is capable of processing multi-modal inputs, and a progressive refinement decoder that applies diffusion denoising using the multi-modal features as conditions. A key insight of this generative pipeline is that the diffusion denoising process is naturally able to model the coarse-to-fine refinement of the dense 3D occupancy map, therefore producing more detailed predictions. Extensive experiments on several occupancy benchmarks demonstrate the effectiveness of the proposed method compared to the state-of-the-art methods. For instance, OccGen relatively enhances the mIoU by 9.5%, 6.3%, and 13.3% on nuScenes-Occupancy dataset under the muli-modal, LiDAR-only, and camera-only settings, respectively. Moreover, as a generative perception model, OccGen exhibits desirable properties that discriminative models cannot achieve, such as providing uncertainty estimates alongside its multiple-step predictions.
△ Less
Submitted 23 April, 2024;
originally announced April 2024.
-
SparseOcc: Rethinking Sparse Latent Representation for Vision-Based Semantic Occupancy Prediction
Authors:
Pin Tang,
Zhongdao Wang,
Guoqing Wang,
Jilai Zheng,
Xiangxuan Ren,
Bailan Feng,
Chao Ma
Abstract:
Vision-based perception for autonomous driving requires an explicit modeling of a 3D space, where 2D latent representations are mapped and subsequent 3D operators are applied. However, operating on dense latent spaces introduces a cubic time and space complexity, which limits scalability in terms of perception range or spatial resolution. Existing approaches compress the dense representation using…
▽ More
Vision-based perception for autonomous driving requires an explicit modeling of a 3D space, where 2D latent representations are mapped and subsequent 3D operators are applied. However, operating on dense latent spaces introduces a cubic time and space complexity, which limits scalability in terms of perception range or spatial resolution. Existing approaches compress the dense representation using projections like Bird's Eye View (BEV) or Tri-Perspective View (TPV). Although efficient, these projections result in information loss, especially for tasks like semantic occupancy prediction. To address this, we propose SparseOcc, an efficient occupancy network inspired by sparse point cloud processing. It utilizes a lossless sparse latent representation with three key innovations. Firstly, a 3D sparse diffuser performs latent completion using spatially decomposed 3D sparse convolutional kernels. Secondly, a feature pyramid and sparse interpolation enhance scales with information from others. Finally, the transformer head is redesigned as a sparse variant. SparseOcc achieves a remarkable 74.9% reduction on FLOPs over the dense baseline. Interestingly, it also improves accuracy, from 12.8% to 14.1% mIOU, which in part can be attributed to the sparse representation's ability to avoid hallucinations on empty voxels.
△ Less
Submitted 15 April, 2024;
originally announced April 2024.
-
Single-Shared Network with Prior-Inspired Loss for Parameter-Efficient Multi-Modal Imaging Skin Lesion Classification
Authors:
Peng Tang,
Tobias Lasser
Abstract:
In this study, we introduce a multi-modal approach that efficiently integrates multi-scale clinical and dermoscopy features within a single network, thereby substantially reducing model parameters. The proposed method includes three novel fusion schemes. Firstly, unlike current methods that usually employ two individual models for for clinical and dermoscopy modalities, we verified that multimodal…
▽ More
In this study, we introduce a multi-modal approach that efficiently integrates multi-scale clinical and dermoscopy features within a single network, thereby substantially reducing model parameters. The proposed method includes three novel fusion schemes. Firstly, unlike current methods that usually employ two individual models for for clinical and dermoscopy modalities, we verified that multimodal feature can be learned by sharing the parameters of encoder while leaving the individual modal-specific classifiers. Secondly, the shared cross-attention module can replace the individual one to efficiently interact between two modalities at multiple layers. Thirdly, different from current methods that equally optimize dermoscopy and clinical branches, inspired by prior knowledge that dermoscopy images play a more significant role than clinical images, we propose a novel biased loss. This loss guides the single-shared network to prioritize dermoscopy information over clinical information, implicitly learning a better joint feature representation for the modal-specific task. Extensive experiments on a well-recognized Seven-Point Checklist (SPC) dataset and a collected dataset demonstrate the effectiveness of our method on both CNN and Transformer structures. Furthermore, our method exhibits superiority in both accuracy and model parameters compared to currently advanced methods.
△ Less
Submitted 28 March, 2024;
originally announced March 2024.
-
Synthesize Step-by-Step: Tools, Templates and LLMs as Data Generators for Reasoning-Based Chart VQA
Authors:
Zhuowan Li,
Bhavan Jasani,
Peng Tang,
Shabnam Ghadar
Abstract:
Understanding data visualizations like charts and plots requires reasoning about both visual elements and numerics. Although strong in extractive questions, current chart visual question answering (chart VQA) models suffer on complex reasoning questions. In this work, we address the lack of reasoning ability by data augmentation. We leverage Large Language Models (LLMs), which have shown to have s…
▽ More
Understanding data visualizations like charts and plots requires reasoning about both visual elements and numerics. Although strong in extractive questions, current chart visual question answering (chart VQA) models suffer on complex reasoning questions. In this work, we address the lack of reasoning ability by data augmentation. We leverage Large Language Models (LLMs), which have shown to have strong reasoning ability, as an automatic data annotator that generates question-answer annotations for chart images. The key innovation in our method lies in the Synthesize Step-by-Step strategy: our LLM-based data generator learns to decompose the complex question into step-by-step sub-questions (rationales), which are then used to derive the final answer using external tools, i.e. Python. This step-wise generation procedure is trained on synthetic data generated using a template-based QA generation pipeline. Experimental results highlight the significance of the proposed step-by-step generation. By training with the LLM-augmented data (LAMENDA), we significantly enhance the chart VQA models, achieving the state-of-the-art accuracy on the ChartQA and PlotQA datasets. In particular, our approach improves the accuracy of the previous state-of-the-art approach from 38% to 54% on the human-written questions in the ChartQA dataset, which needs strong reasoning. We hope our work underscores the potential of synthetic data and encourages further exploration of data augmentation using LLMs for reasoning-heavy tasks.
△ Less
Submitted 28 March, 2024; v1 submitted 24 March, 2024;
originally announced March 2024.
-
Federated Semi-supervised Learning for Medical Image Segmentation with intra-client and inter-client Consistency
Authors:
Yubin Zheng,
Peng Tang,
Tianjie Ju,
Weidong Qiu,
Bo Yan
Abstract:
Medical image segmentation plays a vital role in clinic disease diagnosis and medical image analysis. However, labeling medical images for segmentation task is tough due to the indispensable domain expertise of radiologists. Furthermore, considering the privacy and sensitivity of medical images, it is impractical to build a centralized segmentation dataset from different medical institutions. Fede…
▽ More
Medical image segmentation plays a vital role in clinic disease diagnosis and medical image analysis. However, labeling medical images for segmentation task is tough due to the indispensable domain expertise of radiologists. Furthermore, considering the privacy and sensitivity of medical images, it is impractical to build a centralized segmentation dataset from different medical institutions. Federated learning aims to train a shared model of isolated clients without local data exchange which aligns well with the scarcity and privacy characteristics of medical data. To solve the problem of labeling hard, many advanced semi-supervised methods have been proposed in a centralized data setting. As for federated learning, how to conduct semi-supervised learning under this distributed scenario is worth investigating. In this work, we propose a novel federated semi-supervised learning framework for medical image segmentation. The intra-client and inter-client consistency learning are introduced to smooth predictions at the data level and avoid confirmation bias of local models. They are achieved with the assistance of a Variational Autoencoder (VAE) trained collaboratively by clients. The added VAE model plays three roles: 1) extracting latent low-dimensional features of all labeled and unlabeled data; 2) performing a novel type of data augmentation in calculating intra-client consistency loss; 3) utilizing the generative ability of itself to conduct inter-client consistency distillation. The proposed framework is compared with other federated semi-supervised or self-supervised learning methods. The experimental results illustrate that our method outperforms the state-of-the-art method while avoiding a lot of computation and communication overhead.
△ Less
Submitted 19 March, 2024;
originally announced March 2024.
-
SSDRec: Self-Augmented Sequence Denoising for Sequential Recommendation
Authors:
Chi Zhang,
Qilong Han,
Rui Chen,
Xiangyu Zhao,
Peng Tang,
Hongtao Song
Abstract:
Traditional sequential recommendation methods assume that users' sequence data is clean enough to learn accurate sequence representations to reflect user preferences. In practice, users' sequences inevitably contain noise (e.g., accidental interactions), leading to incorrect reflections of user preferences. Consequently, some pioneer studies have explored modeling sequentiality and correlations in…
▽ More
Traditional sequential recommendation methods assume that users' sequence data is clean enough to learn accurate sequence representations to reflect user preferences. In practice, users' sequences inevitably contain noise (e.g., accidental interactions), leading to incorrect reflections of user preferences. Consequently, some pioneer studies have explored modeling sequentiality and correlations in sequences to implicitly or explicitly reduce noise's influence. However, relying on only available intra-sequence information (i.e., sequentiality and correlations in a sequence) is insufficient and may result in over-denoising and under-denoising problems (OUPs), especially for short sequences. To improve reliability, we propose to augment sequences by inserting items before denoising. However, due to the data sparsity issue and computational costs, it is challenging to select proper items from the entire item universe to insert into proper positions in a target sequence. Motivated by the above observation, we propose a novel framework--Self-augmented Sequence Denoising for sequential Recommendation (SSDRec) with a three-stage learning paradigm to solve the above challenges. In the first stage, we empower SSDRec by a global relation encoder to learn multi-faceted inter-sequence relations in a data-driven manner. These relations serve as prior knowledge to guide subsequent stages. In the second stage, we devise a self-augmentation module to augment sequences to alleviate OUPs. Finally, we employ a hierarchical denoising module in the third stage to reduce the risk of false augmentations and pinpoint all noise in raw sequences. Extensive experiments on five real-world datasets demonstrate the superiority of \model over state-of-the-art denoising methods and its flexible applications to mainstream sequential recommendation models. The source code is available at https://github.com/zc-97/SSDRec.
△ Less
Submitted 7 March, 2024;
originally announced March 2024.
-
An Interactive Agent Foundation Model
Authors:
Zane Durante,
Bidipta Sarkar,
Ran Gong,
Rohan Taori,
Yusuke Noda,
Paul Tang,
Ehsan Adeli,
Shrinidhi Kowshika Lakshmikanth,
Kevin Schulman,
Arnold Milstein,
Demetri Terzopoulos,
Ade Famoti,
Noboru Kuno,
Ashley Llorens,
Hoi Vo,
Katsu Ikeuchi,
Li Fei-Fei,
Jianfeng Gao,
Naoki Wake,
Qiuyuan Huang
Abstract:
The development of artificial intelligence systems is transitioning from creating static, task-specific models to dynamic, agent-based systems capable of performing well in a wide range of applications. We propose an Interactive Agent Foundation Model that uses a novel multi-task agent training paradigm for training AI agents across a wide range of domains, datasets, and tasks. Our training paradi…
▽ More
The development of artificial intelligence systems is transitioning from creating static, task-specific models to dynamic, agent-based systems capable of performing well in a wide range of applications. We propose an Interactive Agent Foundation Model that uses a novel multi-task agent training paradigm for training AI agents across a wide range of domains, datasets, and tasks. Our training paradigm unifies diverse pre-training strategies, including visual masked auto-encoders, language modeling, and next-action prediction, enabling a versatile and adaptable AI framework. We demonstrate the performance of our framework across three separate domains -- Robotics, Gaming AI, and Healthcare. Our model demonstrates its ability to generate meaningful and contextually relevant outputs in each area. The strength of our approach lies in its generality, leveraging a variety of data sources such as robotics sequences, gameplay data, large-scale video datasets, and textual information for effective multimodal and multi-task learning. Our approach provides a promising avenue for developing generalist, action-taking, multimodal systems.
△ Less
Submitted 17 June, 2024; v1 submitted 8 February, 2024;
originally announced February 2024.
-
Feature Norm Regularized Federated Learning: Transforming Skewed Distributions into Global Insights
Authors:
Ke Hu,
WeiDong Qiu,
Peng Tang
Abstract:
In the field of federated learning, addressing non-independent and identically distributed (non-i.i.d.) data remains a quintessential challenge for improving global model performance. This work introduces the Feature Norm Regularized Federated Learning (FNR-FL) algorithm, which uniquely incorporates class average feature norms to enhance model accuracy and convergence in non-i.i.d. scenarios. Our…
▽ More
In the field of federated learning, addressing non-independent and identically distributed (non-i.i.d.) data remains a quintessential challenge for improving global model performance. This work introduces the Feature Norm Regularized Federated Learning (FNR-FL) algorithm, which uniquely incorporates class average feature norms to enhance model accuracy and convergence in non-i.i.d. scenarios. Our comprehensive analysis reveals that FNR-FL not only accelerates convergence but also significantly surpasses other contemporary federated learning algorithms in test accuracy, particularly under feature distribution skew scenarios. The novel modular design of FNR-FL facilitates seamless integration with existing federated learning frameworks, reinforcing its adaptability and potential for widespread application. We substantiate our claims through rigorous empirical evaluations, demonstrating FNR-FL's exceptional performance across various skewed data distributions. Relative to FedAvg, FNR-FL exhibits a substantial 66.24\% improvement in accuracy and a significant 11.40\% reduction in training time, underscoring its enhanced effectiveness and efficiency.
△ Less
Submitted 11 December, 2023;
originally announced December 2023.
-
Joint-Individual Fusion Structure with Fusion Attention Module for Multi-Modal Skin Cancer Classification
Authors:
Peng Tang,
Xintong Yan,
Yang Nan,
Xiaobin Hu,
Xiaobin Hu,
Bjoern H Menzee. Sebastian Krammer,
Tobias Lasser
Abstract:
Most convolutional neural network (CNN) based methods for skin cancer classification obtain their results using only dermatological images. Although good classification results have been shown, more accurate results can be achieved by considering the patient's metadata, which is valuable clinical information for dermatologists. Current methods only use the simple joint fusion structure (FS) and fu…
▽ More
Most convolutional neural network (CNN) based methods for skin cancer classification obtain their results using only dermatological images. Although good classification results have been shown, more accurate results can be achieved by considering the patient's metadata, which is valuable clinical information for dermatologists. Current methods only use the simple joint fusion structure (FS) and fusion modules (FMs) for the multi-modal classification methods, there still is room to increase the accuracy by exploring more advanced FS and FM. Therefore, in this paper, we design a new fusion method that combines dermatological images (dermoscopy images or clinical images) and patient metadata for skin cancer classification from the perspectives of FS and FM. First, we propose a joint-individual fusion (JIF) structure that learns the shared features of multi-modality data and preserves specific features simultaneously. Second, we introduce a fusion attention (FA) module that enhances the most relevant image and metadata features based on both the self and mutual attention mechanism to support the decision-making pipeline. We compare the proposed JIF-MMFA method with other state-of-the-art fusion methods on three different public datasets. The results show that our JIF-MMFA method improves the classification results for all tested CNN backbones and performs better than the other fusion methods on the three public datasets, demonstrating our method's effectiveness and robustness
△ Less
Submitted 7 December, 2023;
originally announced December 2023.
-
Favour: FAst Variance Operator for Uncertainty Rating
Authors:
Thomas D. Ahle,
Sahar Karimi,
Peter Tak Peter Tang
Abstract:
Bayesian Neural Networks (BNN) have emerged as a crucial approach for interpreting ML predictions. By sampling from the posterior distribution, data scientists may estimate the uncertainty of an inference. Unfortunately many inference samples are often needed, the overhead of which greatly hinder BNN's wide adoption. To mitigate this, previous work proposed propagating the first and second moments…
▽ More
Bayesian Neural Networks (BNN) have emerged as a crucial approach for interpreting ML predictions. By sampling from the posterior distribution, data scientists may estimate the uncertainty of an inference. Unfortunately many inference samples are often needed, the overhead of which greatly hinder BNN's wide adoption. To mitigate this, previous work proposed propagating the first and second moments of the posterior directly through the network. However, on its own this method is even slower than sampling, so the propagated variance needs to be approximated such as assuming independence between neural nodes. The resulting trade-off between quality and inference time did not match even plain Monte Carlo sampling.
Our contribution is a more principled variance propagation framework based on "spiked covariance matrices", which smoothly interpolates between quality and inference time. This is made possible by a new fast algorithm for updating a diagonal-plus-low-rank matrix approximation under various operations. We tested our algorithm against sampling based MC Dropout and Variational Inference on a number of downstream uncertainty themed tasks, such as calibration and out-of-distribution testing. We find that Favour is as fast as performing 2-3 inference samples, while matching the performance of 10-100 samples.
In summary, this work enables the use of BNN in the realm of performance critical tasks where they have previously been out of reach.
△ Less
Submitted 21 November, 2023;
originally announced November 2023.
-
AcademicGPT: Empowering Academic Research
Authors:
Shufa Wei,
Xiaolong Xu,
Xianbiao Qi,
Xi Yin,
Jun Xia,
Jingyi Ren,
Peijun Tang,
Yuxiang Zhong,
Yihao Chen,
Xiaoqin Ren,
Yuxin Liang,
Liankai Huang,
Kai Xie,
Weikang Gui,
Wei Tan,
Shuanglong Sun,
Yongquan Hu,
Qinxian Liu,
Nanjin Li,
Chihao Dai,
Lihua Wang,
Xiaohui Liu,
Lei Zhang,
Yutao Xie
Abstract:
Large Language Models (LLMs) have demonstrated exceptional capabilities across various natural language processing tasks. Yet, many of these advanced LLMs are tailored for broad, general-purpose applications. In this technical report, we introduce AcademicGPT, designed specifically to empower academic research. AcademicGPT is a continual training model derived from LLaMA2-70B. Our training corpus…
▽ More
Large Language Models (LLMs) have demonstrated exceptional capabilities across various natural language processing tasks. Yet, many of these advanced LLMs are tailored for broad, general-purpose applications. In this technical report, we introduce AcademicGPT, designed specifically to empower academic research. AcademicGPT is a continual training model derived from LLaMA2-70B. Our training corpus mainly consists of academic papers, thesis, content from some academic domain, high-quality Chinese data and others. While it may not be extensive in data scale, AcademicGPT marks our initial venture into a domain-specific GPT tailored for research area. We evaluate AcademicGPT on several established public benchmarks such as MMLU and CEval, as well as on some specialized academic benchmarks like PubMedQA, SCIEval, and our newly-created ComputerScienceQA, to demonstrate its ability from general knowledge ability, to Chinese ability, and to academic ability. Building upon AcademicGPT's foundation model, we also developed several applications catered to the academic area, including General Academic Question Answering, AI-assisted Paper Reading, Paper Review, and AI-assisted Title and Abstract Generation.
△ Less
Submitted 20 November, 2023;
originally announced November 2023.
-
DEED: Dynamic Early Exit on Decoder for Accelerating Encoder-Decoder Transformer Models
Authors:
Peng Tang,
Pengkai Zhu,
Tian Li,
Srikar Appalaraju,
Vijay Mahadevan,
R. Manmatha
Abstract:
Encoder-decoder transformer models have achieved great success on various vision-language (VL) tasks, but they suffer from high inference latency. Typically, the decoder takes up most of the latency because of the auto-regressive decoding. To accelerate the inference, we propose an approach of performing Dynamic Early Exit on Decoder (DEED). We build a multi-exit encoder-decoder transformer model…
▽ More
Encoder-decoder transformer models have achieved great success on various vision-language (VL) tasks, but they suffer from high inference latency. Typically, the decoder takes up most of the latency because of the auto-regressive decoding. To accelerate the inference, we propose an approach of performing Dynamic Early Exit on Decoder (DEED). We build a multi-exit encoder-decoder transformer model which is trained with deep supervision so that each of its decoder layers is capable of generating plausible predictions. In addition, we leverage simple yet practical techniques, including shared generation head and adaptation modules, to keep accuracy when exiting at shallow decoder layers. Based on the multi-exit model, we perform step-level dynamic early exit during inference, where the model may decide to use fewer decoder layers based on its confidence of the current layer at each individual decoding step. Considering different number of decoder layers may be used at different decoding steps, we compute deeper-layer decoder features of previous decoding steps just-in-time, which ensures the features from different decoding steps are semantically aligned. We evaluate our approach with two state-of-the-art encoder-decoder transformer models on various VL tasks. We show our approach can reduce overall inference latency by 30%-60% with comparable or even higher accuracy compared to baselines.
△ Less
Submitted 14 November, 2023;
originally announced November 2023.
-
Multiple-Question Multiple-Answer Text-VQA
Authors:
Peng Tang,
Srikar Appalaraju,
R. Manmatha,
Yusheng Xie,
Vijay Mahadevan
Abstract:
We present Multiple-Question Multiple-Answer (MQMA), a novel approach to do text-VQA in encoder-decoder transformer models. The text-VQA task requires a model to answer a question by understanding multi-modal content: text (typically from OCR) and an associated image. To the best of our knowledge, almost all previous approaches for text-VQA process a single question and its associated content to p…
▽ More
We present Multiple-Question Multiple-Answer (MQMA), a novel approach to do text-VQA in encoder-decoder transformer models. The text-VQA task requires a model to answer a question by understanding multi-modal content: text (typically from OCR) and an associated image. To the best of our knowledge, almost all previous approaches for text-VQA process a single question and its associated content to predict a single answer. In order to answer multiple questions from the same image, each question and content are fed into the model multiple times. In contrast, our proposed MQMA approach takes multiple questions and content as input at the encoder and predicts multiple answers at the decoder in an auto-regressive manner at the same time. We make several novel architectural modifications to standard encoder-decoder transformers to support MQMA. We also propose a novel MQMA denoising pre-training task which is designed to teach the model to align and delineate multiple questions and content with associated answers. MQMA pre-trained model achieves state-of-the-art results on multiple text-VQA datasets, each with strong baselines. Specifically, on OCR-VQA (+2.5%), TextVQA (+1.4%), ST-VQA (+0.6%), DocVQA (+1.1%) absolute improvements over the previous state-of-the-art approaches.
△ Less
Submitted 14 November, 2023;
originally announced November 2023.
-
Evoke: Evoking Critical Thinking Abilities in LLMs via Reviewer-Author Prompt Editing
Authors:
Xinyu Hu,
Pengfei Tang,
Simiao Zuo,
Zihan Wang,
Bowen Song,
Qiang Lou,
Jian Jiao,
Denis Charles
Abstract:
Large language models (LLMs) have made impressive progress in natural language processing. These models rely on proper human instructions (or prompts) to generate suitable responses. However, the potential of LLMs are not fully harnessed by commonly-used prompting methods: many human-in-the-loop algorithms employ ad-hoc procedures for prompt selection; while auto prompt generation approaches are e…
▽ More
Large language models (LLMs) have made impressive progress in natural language processing. These models rely on proper human instructions (or prompts) to generate suitable responses. However, the potential of LLMs are not fully harnessed by commonly-used prompting methods: many human-in-the-loop algorithms employ ad-hoc procedures for prompt selection; while auto prompt generation approaches are essentially searching all possible prompts randomly and inefficiently. We propose Evoke, an automatic prompt refinement framework. In Evoke, there are two instances of a same LLM: one as a reviewer (LLM-Reviewer), it scores the current prompt; the other as an author (LLM-Author), it edits the prompt by considering the edit history and the reviewer's feedback. Such an author-reviewer feedback loop ensures that the prompt is refined in each iteration. We further aggregate a data selection approach to Evoke, where only the hard samples are exposed to the LLM. The hard samples are more important because the LLM can develop deeper understanding of the tasks out of them, while the model may already know how to solve the easier cases. Experimental results show that Evoke significantly outperforms existing methods. For instance, in the challenging task of logical fallacy detection, Evoke scores above 80, while all other baseline methods struggle to reach 20.
△ Less
Submitted 20 October, 2023;
originally announced October 2023.
-
Individually Rational Collaborative Vehicle Routing through Give-And-Take Exchanges
Authors:
Paul Mingzheng Tang,
Ba Phong Tran,
Hoong Chuin Lau
Abstract:
In this paper, we are concerned with the automated exchange of orders between logistics companies in a marketplace platform to optimize total revenues. We introduce a novel multi-agent approach to this problem, focusing on the Collaborative Vehicle Routing Problem (CVRP) through the lens of individual rationality. Our proposed algorithm applies the principles of Vehicle Routing Problem (VRP) to pa…
▽ More
In this paper, we are concerned with the automated exchange of orders between logistics companies in a marketplace platform to optimize total revenues. We introduce a novel multi-agent approach to this problem, focusing on the Collaborative Vehicle Routing Problem (CVRP) through the lens of individual rationality. Our proposed algorithm applies the principles of Vehicle Routing Problem (VRP) to pairs of vehicles from different logistics companies, optimizing the overall routes while considering standard VRP constraints plus individual rationality constraints. By facilitating cooperation among competing logistics agents through a Give-and-Take approach, we show that it is possible to reduce travel distance and increase operational efficiency system-wide. More importantly, our approach ensures individual rationality and faster convergence, which are important properties of ensuring the long-term sustainability of the marketplace platform. We demonstrate the efficacy of our approach through extensive experiments using real-world test data from major logistics companies. The results reveal our algorithm's ability to rapidly identify numerous optimal solutions, underscoring its practical applicability and potential to transform the logistics industry.
△ Less
Submitted 31 August, 2023;
originally announced August 2023.
-
Learning the hub graphical Lasso model with the structured sparsity via an efficient algorithm
Authors:
Chengjing Wang,
Peipei Tang,
Wenling He,
Meixia Lin
Abstract:
Graphical models have exhibited their performance in numerous tasks ranging from biological analysis to recommender systems. However, graphical models with hub nodes are computationally difficult to fit, particularly when the dimension of the data is large. To efficiently estimate the hub graphical models, we introduce a two-phase algorithm. The proposed algorithm first generates a good initial po…
▽ More
Graphical models have exhibited their performance in numerous tasks ranging from biological analysis to recommender systems. However, graphical models with hub nodes are computationally difficult to fit, particularly when the dimension of the data is large. To efficiently estimate the hub graphical models, we introduce a two-phase algorithm. The proposed algorithm first generates a good initial point via a dual alternating direction method of multipliers (ADMM), and then warm starts a semismooth Newton (SSN) based augmented Lagrangian method (ALM) to compute a solution that is accurate enough for practical tasks. The sparsity structure of the generalized Jacobian ensures that the algorithm can obtain a nice solution very efficiently. Comprehensive experiments on both synthetic data and real data show that it obviously outperforms the existing state-of-the-art algorithms. In particular, in some high dimensional tasks, it can save more than 70\% of the execution time, meanwhile still achieves a high-quality estimation.
△ Less
Submitted 17 August, 2023;
originally announced August 2023.
-
SR-R$^2$KAC: Improving Single Image Defocus Deblurring
Authors:
Peng Tang,
Zhiqiang Xu,
Pengfei Wei,
Xiaobin Hu,
Peilin Zhao,
Xin Cao,
Chunlai Zhou,
Tobias Lasser
Abstract:
We propose an efficient deep learning method for single image defocus deblurring (SIDD) by further exploring inverse kernel properties. Although the current inverse kernel method, i.e., kernel-sharing parallel atrous convolution (KPAC), can address spatially varying defocus blurs, it has difficulty in handling large blurs of this kind. To tackle this issue, we propose a Residual and Recursive Ke…
▽ More
We propose an efficient deep learning method for single image defocus deblurring (SIDD) by further exploring inverse kernel properties. Although the current inverse kernel method, i.e., kernel-sharing parallel atrous convolution (KPAC), can address spatially varying defocus blurs, it has difficulty in handling large blurs of this kind. To tackle this issue, we propose a Residual and Recursive Kernel-sharing Atrous Convolution (R$^2$KAC). R$^2$KAC builds on a significant observation of inverse kernels, that is, successive use of inverse-kernel-based deconvolutions with fixed size helps remove unexpected large blurs but produces ringing artifacts. Specifically, on top of kernel-sharing atrous convolutions used to simulate multi-scale inverse kernels, R$^2$KAC applies atrous convolutions recursively to simulate a large inverse kernel. Specifically, on top of kernel-sharing atrous convolutions, R$^2$KAC stacks atrous convolutions recursively to simulate a large inverse kernel. To further alleviate the contingent effect of recursive stacking, i.e., ringing artifacts, we add identity shortcuts between atrous convolutions to simulate residual deconvolutions. Lastly, a scale recurrent module is embedded in the R$^2$KAC network, leading to SR-R$^2$KAC, so that multi-scale information from coarse to fine is exploited to progressively remove the spatially varying defocus blurs. Extensive experimental results show that our method achieves the state-of-the-art performance.
△ Less
Submitted 30 July, 2023;
originally announced July 2023.
-
Graph-Ensemble Learning Model for Multi-label Skin Lesion Classification using Dermoscopy and Clinical Images
Authors:
Peng Tang,
Yang Nan,
Tobias Lasser
Abstract:
Many skin lesion analysis (SLA) methods recently focused on developing a multi-modal-based multi-label classification method due to two factors. The first is multi-modal data, i.e., clinical and dermoscopy images, which can provide complementary information to obtain more accurate results than single-modal data. The second one is that multi-label classification, i.e., seven-point checklist (SPC) c…
▽ More
Many skin lesion analysis (SLA) methods recently focused on developing a multi-modal-based multi-label classification method due to two factors. The first is multi-modal data, i.e., clinical and dermoscopy images, which can provide complementary information to obtain more accurate results than single-modal data. The second one is that multi-label classification, i.e., seven-point checklist (SPC) criteria as an auxiliary classification task can not only boost the diagnostic accuracy of melanoma in the deep learning (DL) pipeline but also provide more useful functions to the clinical doctor as it is commonly used in clinical dermatologist's diagnosis. However, most methods only focus on designing a better module for multi-modal data fusion; few methods explore utilizing the label correlation between SPC and skin disease for performance improvement. This study fills the gap that introduces a Graph Convolution Network (GCN) to exploit prior co-occurrence between each category as a correlation matrix into the DL model for the multi-label classification. However, directly applying GCN degraded the performances in our experiments; we attribute this to the weak generalization ability of GCN in the scenario of insufficient statistical samples of medical data. We tackle this issue by proposing a Graph-Ensemble Learning Model (GELN) that views the prediction from GCN as complementary information of the predictions from the fusion model and adaptively fuses them by a weighted averaging scheme, which can utilize the valuable information from GCN while avoiding its negative influences as much as possible. To evaluate our method, we conduct experiments on public datasets. The results illustrate that our GELN can consistently improve the classification performance on different datasets and that the proposed method can achieve state-of-the-art performance in SPC and diagnosis classification.
△ Less
Submitted 4 July, 2023;
originally announced July 2023.
-
DeepTagger: Knowledge Enhanced Named Entity Recognition for Web-Based Ads Queries
Authors:
Simiao Zuo,
Pengfei Tang,
Xinyu Hu,
Qiang Lou,
Jian Jiao,
Denis Charles
Abstract:
Named entity recognition (NER) is a crucial task for online advertisement. State-of-the-art solutions leverage pre-trained language models for this task. However, three major challenges remain unresolved: web queries differ from natural language, on which pre-trained models are trained; web queries are short and lack contextual information; and labeled data for NER is scarce. We propose DeepTagger…
▽ More
Named entity recognition (NER) is a crucial task for online advertisement. State-of-the-art solutions leverage pre-trained language models for this task. However, three major challenges remain unresolved: web queries differ from natural language, on which pre-trained models are trained; web queries are short and lack contextual information; and labeled data for NER is scarce. We propose DeepTagger, a knowledge-enhanced NER model for web-based ads queries. The proposed knowledge enhancement framework leverages both model-free and model-based approaches. For model-free enhancement, we collect unlabeled web queries to augment domain knowledge; and we collect web search results to enrich the information of ads queries. We further leverage effective prompting methods to automatically generate labels using large language models such as ChatGPT. Additionally, we adopt a model-based knowledge enhancement method based on adversarial data augmentation. We employ a three-stage training framework to train DeepTagger models. Empirical results in various NER tasks demonstrate the effectiveness of the proposed framework.
△ Less
Submitted 30 June, 2023;
originally announced June 2023.
-
Efficient and Interpretable Compressive Text Summarisation with Unsupervised Dual-Agent Reinforcement Learning
Authors:
Peggy Tang,
Junbin Gao,
Lei Zhang,
Zhiyong Wang
Abstract:
Recently, compressive text summarisation offers a balance between the conciseness issue of extractive summarisation and the factual hallucination issue of abstractive summarisation. However, most existing compressive summarisation methods are supervised, relying on the expensive effort of creating a new training dataset with corresponding compressive summaries. In this paper, we propose an efficie…
▽ More
Recently, compressive text summarisation offers a balance between the conciseness issue of extractive summarisation and the factual hallucination issue of abstractive summarisation. However, most existing compressive summarisation methods are supervised, relying on the expensive effort of creating a new training dataset with corresponding compressive summaries. In this paper, we propose an efficient and interpretable compressive summarisation method that utilises unsupervised dual-agent reinforcement learning to optimise a summary's semantic coverage and fluency by simulating human judgment on summarisation quality. Our model consists of an extractor agent and a compressor agent, and both agents have a multi-head attentional pointer-based structure. The extractor agent first chooses salient sentences from a document, and then the compressor agent compresses these extracted sentences by selecting salient words to form a summary without using reference summaries to compute the summary reward. To our best knowledge, this is the first work on unsupervised compressive summarisation. Experimental results on three widely used datasets (e.g., Newsroom, CNN/DM, and XSum) show that our model achieves promising performance and a significant improvement on Newsroom in terms of the ROUGE metric, as well as interpretability of semantic coverage of summarisation results.
△ Less
Submitted 6 June, 2023;
originally announced June 2023.
-
DocFormerv2: Local Features for Document Understanding
Authors:
Srikar Appalaraju,
Peng Tang,
Qi Dong,
Nishant Sankaran,
Yichu Zhou,
R. Manmatha
Abstract:
We propose DocFormerv2, a multi-modal transformer for Visual Document Understanding (VDU). The VDU domain entails understanding documents (beyond mere OCR predictions) e.g., extracting information from a form, VQA for documents and other tasks. VDU is challenging as it needs a model to make sense of multiple modalities (visual, language and spatial) to make a prediction. Our approach, termed DocFo…
▽ More
We propose DocFormerv2, a multi-modal transformer for Visual Document Understanding (VDU). The VDU domain entails understanding documents (beyond mere OCR predictions) e.g., extracting information from a form, VQA for documents and other tasks. VDU is challenging as it needs a model to make sense of multiple modalities (visual, language and spatial) to make a prediction. Our approach, termed DocFormerv2 is an encoder-decoder transformer which takes as input - vision, language and spatial features. DocFormerv2 is pre-trained with unsupervised tasks employed asymmetrically i.e., two novel document tasks on encoder and one on the auto-regressive decoder. The unsupervised tasks have been carefully designed to ensure that the pre-training encourages local-feature alignment between multiple modalities. DocFormerv2 when evaluated on nine datasets shows state-of-the-art performance over strong baselines e.g. TabFact (4.3%), InfoVQA (1.4%), FUNSD (1%). Furthermore, to show generalization capabilities, on three VQA tasks involving scene-text, Doc- Formerv2 outperforms previous comparably-sized models and even does better than much larger models (such as GIT2, PaLi and Flamingo) on some tasks. Extensive ablations show that due to its pre-training, DocFormerv2 understands multiple modalities better than prior-art in VDU.
△ Less
Submitted 2 June, 2023;
originally announced June 2023.
-
RNN-Guard: Certified Robustness Against Multi-frame Attacks for Recurrent Neural Networks
Authors:
Yunruo Zhang,
Tianyu Du,
Shouling Ji,
Peng Tang,
Shanqing Guo
Abstract:
It is well-known that recurrent neural networks (RNNs), although widely used, are vulnerable to adversarial attacks including one-frame attacks and multi-frame attacks. Though a few certified defenses exist to provide guaranteed robustness against one-frame attacks, we prove that defending against multi-frame attacks remains a challenging problem due to their enormous perturbation space. In this p…
▽ More
It is well-known that recurrent neural networks (RNNs), although widely used, are vulnerable to adversarial attacks including one-frame attacks and multi-frame attacks. Though a few certified defenses exist to provide guaranteed robustness against one-frame attacks, we prove that defending against multi-frame attacks remains a challenging problem due to their enormous perturbation space. In this paper, we propose the first certified defense against multi-frame attacks for RNNs called RNN-Guard. To address the above challenge, we adopt the perturb-all-frame strategy to construct perturbation spaces consistent with those in multi-frame attacks. However, the perturb-all-frame strategy causes a precision issue in linear relaxations. To address this issue, we introduce a novel abstract domain called InterZono and design tighter relaxations. We prove that InterZono is more precise than Zonotope yet carries the same time complexity. Experimental evaluations across various datasets and model structures show that the certified robust accuracy calculated by RNN-Guard with InterZono is up to 2.18 times higher than that with Zonotope. In addition, we extend RNN-Guard as the first certified training method against multi-frame attacks to directly enhance RNNs' robustness. The results show that the certified robust accuracy of models trained with RNN-Guard against multi-frame attacks is 15.47 to 67.65 percentage points higher than those with other training methods.
△ Less
Submitted 16 April, 2023;
originally announced April 2023.
-
A Strategy-proof Mechanism For Networked Housing Markets
Authors:
Youjia Zhang,
Pingzhong Tang
Abstract:
This paper studies a house allocation problem in a networked housing market, where agents can invite others to join the system in order to enrich their options. Top Trading Cycle is a well-known matching mechanism that achieves a set of desirable properties in a market without invitations. However, under a tree-structured networked market, existing agents have to strategically propagate the barter…
▽ More
This paper studies a house allocation problem in a networked housing market, where agents can invite others to join the system in order to enrich their options. Top Trading Cycle is a well-known matching mechanism that achieves a set of desirable properties in a market without invitations. However, under a tree-structured networked market, existing agents have to strategically propagate the barter market as their invitees may compete in the same house with them. Our impossibility result shows that TTC cannot work properly in a networked housing market. Hence, we characterize the possible competitions between inviters and invitees, which lead agents to fail to refer others truthfully (strategy-proof). We then present a novel mechanism based on TTC, avoiding the aforementioned competition to ensure all agents report preference and propagate the barter market truthfully. Unlike the existing mechanisms, the agents' preferences are less restricted under our mechanism. Furthermore, we show by simulations that our mechanism outperforms the existing matching mechanisms in terms of the number of swaps and agents' satisfaction.
△ Less
Submitted 19 March, 2023;
originally announced March 2023.
-
Sequential Persuasion Using Limited Experiments
Authors:
Bonan Ni,
Weiran Shen,
Pingzhong Tang
Abstract:
Bayesian persuasion and its derived information design problem has been one of the main research agendas in the economics and computation literature over the past decade. However, when attempting to apply its model and theory, one is often limited by the fact that the sender can only implement very restricted information structures. Moreover, in this case, the sender can possibly achieve higher ex…
▽ More
Bayesian persuasion and its derived information design problem has been one of the main research agendas in the economics and computation literature over the past decade. However, when attempting to apply its model and theory, one is often limited by the fact that the sender can only implement very restricted information structures. Moreover, in this case, the sender can possibly achieve higher expected utility by performing a sequence of feasible experiments, where the choice of each experiment depends on the outcomes of all previous experiments. Indeed, it has been well observed that real life persuasions often take place in rounds during which the sender exhibits experiments/arguments sequentially.
We study the sender's expected utility maximization using finite and infinite sequences of experiments. For infinite sequences of experiments, we characterize the supremum of the sender's expected utility using a function that generalizes the concave closure definition in the standard Bayesian persuasion problem. With this characterization, we first study a special case where the sender can use feasible experiments to achieve the optimal expected utility of the standard Bayesian persuasion without feasibility constraints, which is a trivial utility upper bound, and establish structural findings about the sender's optimal sequential design in this case. Then we derive conditions under which the sender's optimal sequential design exists; when an optimal sequential design exists, there exists an optimal design that is Markovian, i.e., the choice of the next experiment only depends on the receiver's current belief.
△ Less
Submitted 19 March, 2023;
originally announced March 2023.
-
A Truthful Referral Auction Over Networks
Authors:
Youjia Zhang,
Pingzhong Tang
Abstract:
This paper studies a mechanism design problem over a network, where agents can only participate by referrals. The Bulow-Klemberer theorem proposes that expanding the number of participants is a more effective approach to increase revenue than modifying the auction format. However, agents lack the motivation to invite others because doing so intensifies competition among them. On the other hand, mi…
▽ More
This paper studies a mechanism design problem over a network, where agents can only participate by referrals. The Bulow-Klemberer theorem proposes that expanding the number of participants is a more effective approach to increase revenue than modifying the auction format. However, agents lack the motivation to invite others because doing so intensifies competition among them. On the other hand, misreporting social networks is also a common problem that can reduce revenue. Examples of misreporting include Sybil attacks (an agent pretending to be multiple bidders) and coalition groups (multiple agents pretending to be an agent). To address these challenges, we introduce a novel mechanism called the Truthful Referral Diffusion Mechanism (TRDM). TRDM incentivizes agents to report their social networks truthfully, and some of them are rewarded by the seller for improving revenue. In spite of the fact that some agents overbid in TRDM, the revenue is fixed, and it is higher than the revenue of any mechanism without referrals. TRDM is budget-balanced (non-negative revenue) and generates an efficient outcome (maximized social welfare), making it attractive for both the seller and the buyers as it improves revenue and reward.
△ Less
Submitted 16 March, 2023; v1 submitted 16 February, 2023;
originally announced February 2023.
-
Collusion-proof And Sybil-proof Reward Mechanisms For Query Incentive Networks
Authors:
Youjia Zhang,
Pingzhong Tang
Abstract:
This paper explores reward mechanisms for a query incentive network in which agents seek information from social networks. In a query tree issued by the task owner, each agent is rewarded by the owner for contributing to the solution, for instance, solving the task or inviting others to solve it. The reward mechanism determines the reward for each agent and motivates all agents to propagate and re…
▽ More
This paper explores reward mechanisms for a query incentive network in which agents seek information from social networks. In a query tree issued by the task owner, each agent is rewarded by the owner for contributing to the solution, for instance, solving the task or inviting others to solve it. The reward mechanism determines the reward for each agent and motivates all agents to propagate and report their information truthfully. In particular, the reward cannot exceed the budget set by the task owner. However, our impossibility results demonstrate that a reward mechanism cannot simultaneously achieve Sybil-proof (agents benefit from manipulating multiple fake identities), collusion-proof (multiple agents pretend as a single agent to improve the reward), and other essential properties. In order to address these issues, we propose two novel reward mechanisms. The first mechanism achieves Sybil-proof and collusion-proof, respectively; the second mechanism sacrifices Sybil-proof to achieve the approximate versions of Sybil-proof and collusion-proof. Additionally, we show experimentally that our second reward mechanism outperforms the existing ones.
△ Less
Submitted 12 February, 2023;
originally announced February 2023.
-
Infomaxformer: Maximum Entropy Transformer for Long Time-Series Forecasting Problem
Authors:
Peiwang Tang,
Xianchao Zhang
Abstract:
The Transformer architecture yields state-of-the-art results in many tasks such as natural language processing (NLP) and computer vision (CV), since the ability to efficiently capture the precise long-range dependency coupling between input sequences. With this advanced capability, however, the quadratic time complexity and high memory usage prevents the Transformer from dealing with long time-ser…
▽ More
The Transformer architecture yields state-of-the-art results in many tasks such as natural language processing (NLP) and computer vision (CV), since the ability to efficiently capture the precise long-range dependency coupling between input sequences. With this advanced capability, however, the quadratic time complexity and high memory usage prevents the Transformer from dealing with long time-series forecasting problem (LTFP). To address these difficulties: (i) we revisit the learned attention patterns of the vanilla self-attention, redesigned the calculation method of self-attention based the Maximum Entropy Principle. (ii) we propose a new method to sparse the self-attention, which can prevent the loss of more important self-attention scores due to random sampling.(iii) We propose Keys/Values Distilling method motivated that a large amount of feature in the original self-attention map is redundant, which can further reduce the time and spatial complexity and make it possible to input longer time-series. Finally, we propose a method that combines the encoder-decoder architecture with seasonal-trend decomposition, i.e., using the encoder-decoder architecture to capture more specific seasonal parts. A large number of experiments on several large-scale datasets show that our Infomaxformer is obviously superior to the existing methods. We expect this to open up a new solution for Transformer to solve LTFP, and exploring the ability of the Transformer architecture to capture much longer temporal dependencies.
△ Less
Submitted 4 January, 2023;
originally announced January 2023.
-
MMBench: Benchmarking End-to-End Multi-modal DNNs and Understanding Their Hardware-Software Implications
Authors:
Cheng Xu,
Xiaofeng Hou,
Jiacheng Liu,
Chao Li,
Tianhao Huang,
Xiaozhi Zhu,
Mo Niu,
Lingyu Sun,
Peng Tang,
Tongqiao Xu,
Kwang-Ting Cheng,
Minyi Guo
Abstract:
The explosive growth of various types of big data and advances in AI technologies have catalyzed a new type of workloads called multi-modal DNNs. Multi-modal DNNs are capable of interpreting and reasoning about information from multiple modalities, making them more applicable to real-world AI scenarios. In recent research, multi-modal DNNs have outperformed the best uni-modal DNN in a wide range o…
▽ More
The explosive growth of various types of big data and advances in AI technologies have catalyzed a new type of workloads called multi-modal DNNs. Multi-modal DNNs are capable of interpreting and reasoning about information from multiple modalities, making them more applicable to real-world AI scenarios. In recent research, multi-modal DNNs have outperformed the best uni-modal DNN in a wide range of distributed computing applications from traditional multimedia systems to emerging autonomous edge systems. However, despite their importance and superiority, very limited research attention has been devoted to understand the characteristics of multi-modal DNNs and their implications on current computing software/hardware platforms. Existing benchmarks either target uni-modal DNNs or only focus on the algorithm characteristics of multi-modal DNNs. There lacks representative benchmark suites that provide comprehensive system and architecture level analysis of multi-modal networks.
To advance the understanding of these multi-modal DNN workloads and facilitate related research, we present MMBench, an open-source, end-to-end benchmark suite consisting of a set of real-world multi-modal DNN workloads with relevant performance metrics for evaluation. We then use MMBench to conduct an in-depth analysis on the characteristics of multi-modal DNNs. We demonstrate their unique characteristics of clear multi-stage execution, frequent synchronization and high heterogeneity, which distinguish them from conventional uni-modal DNNs. Finally, we conduct a case study and extend our benchmark to edge devices. We hope that our work can provide insights for future software/hardware design and optimization to underpin multi-modal DNNs on both cloud and edge computing platforms.
△ Less
Submitted 28 August, 2023; v1 submitted 2 December, 2022;
originally announced December 2022.
-
An efficient algorithm for the $\ell_{p}$ norm based metric nearness problem
Authors:
Peipei Tang,
Bo Jiang,
Chengjing Wang
Abstract:
Given a dissimilarity matrix, the metric nearness problem is to find the nearest matrix of distances that satisfy the triangle inequalities. This problem has wide applications, such as sensor networks, image processing, and so on. But it is of great challenge even to obtain a moderately accurate solution due to the $O(n^{3})$ metric constraints and the nonsmooth objective function which is usually…
▽ More
Given a dissimilarity matrix, the metric nearness problem is to find the nearest matrix of distances that satisfy the triangle inequalities. This problem has wide applications, such as sensor networks, image processing, and so on. But it is of great challenge even to obtain a moderately accurate solution due to the $O(n^{3})$ metric constraints and the nonsmooth objective function which is usually a weighted $\ell_{p}$ norm based distance. In this paper, we propose a delayed constraint generation method with each subproblem solved by the semismooth Newton based proximal augmented Lagrangian method (PALM) for the metric nearness problem. Due to the high memory requirement for the storage of the matrix related to the metric constraints, we take advantage of the special structure of the matrix and do not need to store the corresponding constraint matrix. A pleasing aspect of our algorithm is that we can solve these problems involving up to $10^{8}$ variables and $10^{13}$ constraints. Numerical experiments demonstrate the efficiency of our algorithm.
In theory, firstly, under a mild condition, we establish a primal-dual error bound condition which is very essential for the analysis of local convergence rate of PALM. Secondly, we prove the equivalence between the dual nondegeneracy condition and nonsingularity of the generalized Jacobian for the inner subproblem of PALM. Thirdly, when $q(\cdot)=\|\cdot\|_{1}$ or $\|\cdot\|_{\infty}$, without the strict complementarity condition, we also prove the equivalence between the the dual nondegeneracy condition and the uniqueness of the primal solution.
△ Less
Submitted 2 November, 2022;
originally announced November 2022.
-
CPS-MEBR: Click Feedback-Aware Web Page Summarization for Multi-Embedding-Based Retrieval
Authors:
Wenbiao Li,
Pan Tang,
Zhengfan Wu,
Weixue Lu,
Minghua Zhang,
Zhenlei Tian,
Daiting Shi,
Yu Sun,
Simiu Gu,
Dawei Yin
Abstract:
Embedding-based retrieval (EBR) is a technique to use embeddings to represent query and document, and then convert the retrieval problem into a nearest neighbor search problem in the embedding space. Some previous works have mainly focused on representing the web page with a single embedding, but in real web search scenarios, it is difficult to represent all the information of a long and complex s…
▽ More
Embedding-based retrieval (EBR) is a technique to use embeddings to represent query and document, and then convert the retrieval problem into a nearest neighbor search problem in the embedding space. Some previous works have mainly focused on representing the web page with a single embedding, but in real web search scenarios, it is difficult to represent all the information of a long and complex structured web page as a single embedding. To address this issue, we design a click feedback-aware web page summarization for multi-embedding-based retrieval (CPS-MEBR) framework which is able to generate multiple embeddings for web pages to match different potential queries. Specifically, we use the click data of users in search logs to train a summary model to extract those sentences in web pages that are frequently clicked by users, which are more likely to answer those potential queries. Meanwhile, we introduce sentence-level semantic interaction to design a multi-embedding-based retrieval (MEBR) model, which can generate multiple embeddings to deal with different potential queries by using frequently clicked sentences in web pages. Offline experiments show that it can perform high quality candidate retrieval compared to single-embedding-based retrieval (SEBR) model.
△ Less
Submitted 7 May, 2023; v1 submitted 18 October, 2022;
originally announced October 2022.
-
TLDW: Extreme Multimodal Summarisation of News Videos
Authors:
Peggy Tang,
Kun Hu,
Lei Zhang,
Jiebo Luo,
Zhiyong Wang
Abstract:
Multimodal summarisation with multimodal output is drawing increasing attention due to the rapid growth of multimedia data. While several methods have been proposed to summarise visual-text contents, their multimodal outputs are not succinct enough at an extreme level to address the information overload issue. To the end of extreme multimodal summarisation, we introduce a new task, eXtreme Multimo…
▽ More
Multimodal summarisation with multimodal output is drawing increasing attention due to the rapid growth of multimedia data. While several methods have been proposed to summarise visual-text contents, their multimodal outputs are not succinct enough at an extreme level to address the information overload issue. To the end of extreme multimodal summarisation, we introduce a new task, eXtreme Multimodal Summarisation with Multimodal Output (XMSMO) for the scenario of TL;DW - Too Long; Didn't Watch, akin to TL;DR. XMSMO aims to summarise a video-document pair into a summary with an extremely short length, which consists of one cover frame as the visual summary and one sentence as the textual summary. We propose a novel unsupervised Hierarchical Optimal Transport Network (HOT-Net) consisting of three components: hierarchical multimodal encoders, hierarchical multimodal fusion decoders, and optimal transport solvers. Our method is trained, without using reference summaries, by optimising the visual and textual coverage from the perspectives of the distance between the semantic distributions under optimal transport plans. To facilitate the study on this task, we collect a large-scale dataset XMSMO-News by harvesting 4,891 video-document pairs. The experimental results show that our method achieves promising performance in terms of ROUGE and IoU metrics.
△ Less
Submitted 16 October, 2022;
originally announced October 2022.
-
Vulnerabilities of Single-Round Incentive Compatibility in Auto-bidding: Theory and Evidence from ROI-Constrained Online Advertising Markets
Authors:
Juncheng Li,
Pingzhong Tang
Abstract:
Most of the work in the auction design literature assumes that bidders behave rationally based on the information available for every individual auction, and the revelation principle enables designers to restrict their efforts to incentive compatible (IC) mechanisms. However, in today's online advertising markets, one of the most important real-life applications of auction design, the data and com…
▽ More
Most of the work in the auction design literature assumes that bidders behave rationally based on the information available for every individual auction, and the revelation principle enables designers to restrict their efforts to incentive compatible (IC) mechanisms. However, in today's online advertising markets, one of the most important real-life applications of auction design, the data and computational power required to bid optimally are only available to the platform, and an advertiser can only participate by setting performance objectives and constraints for its proxy auto-bidder provided by the platform. The prevalence of auto-bidding necessitates a review of auction theory. In this paper, we examine the markets through the lens of ROI-constrained value-maximizing campaigns. We show that second price auction exhibits many undesirable properties (computational hardness, non-monotonicity, instability of bidders' utilities, and interference in A/B testing) and loses its dominant theoretical advantages in single-item scenarios. In addition, we make it clear how IC and its runner-up-winner interdependence contribute to each property. We hope that our work could bring new perspectives to the community and benefit practitioners to attain a better grasp of real-world markets.
△ Less
Submitted 11 May, 2024; v1 submitted 12 October, 2022;
originally announced October 2022.
-
MTSMAE: Masked Autoencoders for Multivariate Time-Series Forecasting
Authors:
Peiwang Tang,
Xianchao Zhang
Abstract:
Large-scale self-supervised pre-training Transformer architecture have significantly boosted the performance for various tasks in natural language processing (NLP) and computer vision (CV). However, there is a lack of researches on processing multivariate time-series by pre-trained Transformer, and especially, current study on masking time-series for self-supervised learning is still a gap. Differ…
▽ More
Large-scale self-supervised pre-training Transformer architecture have significantly boosted the performance for various tasks in natural language processing (NLP) and computer vision (CV). However, there is a lack of researches on processing multivariate time-series by pre-trained Transformer, and especially, current study on masking time-series for self-supervised learning is still a gap. Different from language and image processing, the information density of time-series increases the difficulty of research. The challenge goes further with the invalidity of the previous patch embedding and mask methods. In this paper, according to the data characteristics of multivariate time-series, a patch embedding method is proposed, and we present an self-supervised pre-training approach based on Masked Autoencoders (MAE), called MTSMAE, which can improve the performance significantly over supervised learning without pre-training. Evaluating our method on several common multivariate time-series datasets from different fields and with different characteristics, experiment results demonstrate that the performance of our method is significantly better than the best method currently available.
△ Less
Submitted 3 October, 2022;
originally announced October 2022.
-
Fuzzy Attention Neural Network to Tackle Discontinuity in Airway Segmentation
Authors:
Yang Nan,
Javier Del Ser,
Zeyu Tang,
Peng Tang,
Xiaodan Xing,
Yingying Fang,
Francisco Herrera,
Witold Pedrycz,
Simon Walsh,
Guang Yang
Abstract:
Airway segmentation is crucial for the examination, diagnosis, and prognosis of lung diseases, while its manual delineation is unduly burdensome. To alleviate this time-consuming and potentially subjective manual procedure, researchers have proposed methods to automatically segment airways from computerized tomography (CT) images. However, some small-sized airway branches (e.g., bronchus and termi…
▽ More
Airway segmentation is crucial for the examination, diagnosis, and prognosis of lung diseases, while its manual delineation is unduly burdensome. To alleviate this time-consuming and potentially subjective manual procedure, researchers have proposed methods to automatically segment airways from computerized tomography (CT) images. However, some small-sized airway branches (e.g., bronchus and terminal bronchioles) significantly aggravate the difficulty of automatic segmentation by machine learning models. In particular, the variance of voxel values and the severe data imbalance in airway branches make the computational module prone to discontinuous and false-negative predictions. especially for cohorts with different lung diseases. Attention mechanism has shown the capacity to segment complex structures, while fuzzy logic can reduce the uncertainty in feature representations. Therefore, the integration of deep attention networks and fuzzy theory, given by the fuzzy attention layer, should be an escalated solution for better generalization and robustness. This paper presents an efficient method for airway segmentation, comprising a novel fuzzy attention neural network and a comprehensive loss function to enhance the spatial continuity of airway segmentation. The deep fuzzy set is formulated by a set of voxels in the feature map and a learnable Gaussian membership function. Different from the existing attention mechanism, the proposed channel-specific fuzzy attention addresses the issue of heterogeneous features in different channels. Furthermore, a novel evaluation metric is proposed to assess both the continuity and completeness of airway structures. The efficiency, generalization and robustness of the proposed method have been proved by training on normal lung disease while testing on datasets of lung cancer, COVID-19 and pulmonary fibrosis.
△ Less
Submitted 9 September, 2022; v1 submitted 5 September, 2022;
originally announced September 2022.
-
Features Fusion Framework for Multimodal Irregular Time-series Events
Authors:
Peiwang Tang,
Xianchao Zhang
Abstract:
Some data from multiple sources can be modeled as multimodal time-series events which have different sampling frequencies, data compositions, temporal relations and characteristics. Different types of events have complex nonlinear relationships, and the time of each event is irregular. Neither the classical Recurrent Neural Network (RNN) model nor the current state-of-the-art Transformer model can…
▽ More
Some data from multiple sources can be modeled as multimodal time-series events which have different sampling frequencies, data compositions, temporal relations and characteristics. Different types of events have complex nonlinear relationships, and the time of each event is irregular. Neither the classical Recurrent Neural Network (RNN) model nor the current state-of-the-art Transformer model can deal with these features well. In this paper, a features fusion framework for multimodal irregular time-series events is proposed based on the Long Short-Term Memory networks (LSTM). Firstly, the complex features are extracted according to the irregular patterns of different events. Secondly, the nonlinear correlation and complex temporal dependencies relationship between complex features are captured and fused into a tensor. Finally, a feature gate are used to control the access frequency of different tensors.
Extensive experiments on MIMIC-III dataset demonstrate that the proposed framework significantly outperforms to the existing methods in terms of AUC (the area under Receiver Operating Characteristic curve) and AP (Average Precision).
△ Less
Submitted 4 September, 2022;
originally announced September 2022.
-
Unsupervised Tissue Segmentation via Deep Constrained Gaussian Network
Authors:
Yang Nan,
Peng Tang,
Guyue Zhang,
Caihong Zeng,
Zhihong Liu,
Zhifan Gao,
Heye Zhang,
Guang Yang
Abstract:
Tissue segmentation is the mainstay of pathological examination, whereas the manual delineation is unduly burdensome. To assist this time-consuming and subjective manual step, researchers have devised methods to automatically segment structures in pathological images. Recently, automated machine and deep learning based methods dominate tissue segmentation research studies. However, most machine an…
▽ More
Tissue segmentation is the mainstay of pathological examination, whereas the manual delineation is unduly burdensome. To assist this time-consuming and subjective manual step, researchers have devised methods to automatically segment structures in pathological images. Recently, automated machine and deep learning based methods dominate tissue segmentation research studies. However, most machine and deep learning based approaches are supervised and developed using a large number of training samples, in which the pixelwise annotations are expensive and sometimes can be impossible to obtain. This paper introduces a novel unsupervised learning paradigm by integrating an end-to-end deep mixture model with a constrained indicator to acquire accurate semantic tissue segmentation. This constraint aims to centralise the components of deep mixture models during the calculation of the optimisation function. In so doing, the redundant or empty class issues, which are common in current unsupervised learning methods, can be greatly reduced. By validation on both public and in-house datasets, the proposed deep constrained Gaussian network achieves significantly (Wilcoxon signed-rank test) better performance (with the average Dice scores of 0.737 and 0.735, respectively) on tissue segmentation with improved stability and robustness, compared to other existing unsupervised segmentation approaches. Furthermore, the proposed method presents a similar performance (p-value > 0.05) compared to the fully supervised U-Net.
△ Less
Submitted 4 August, 2022;
originally announced August 2022.
-
Sharing Construction Safety Inspection Experiences and Site-Specific Knowledge through XR-Augmented Visual Assistance
Authors:
Pengkun Liu,
Jinding Xing,
Ruoxin Xiong,
Pingbo Tang
Abstract:
Early identification of on-site hazards is crucial for accident prevention in the construction industry. Currently, the construction industry relies on experienced safety advisors (SAs) to identify site hazards and generate mitigation measures to guide field workers. However, more than half of the site hazards remain unrecognized due to the lack of field experience or site-specific knowledge of so…
▽ More
Early identification of on-site hazards is crucial for accident prevention in the construction industry. Currently, the construction industry relies on experienced safety advisors (SAs) to identify site hazards and generate mitigation measures to guide field workers. However, more than half of the site hazards remain unrecognized due to the lack of field experience or site-specific knowledge of some SAs. To address these limitations, this study proposed an Extended Reality (XR)-augmented visual assistance framework, including Virtual Reality (VR) and Augmented Reality (AR), that enables capturing and transferring subconscious inspection strategies between workers or workers/machines for a construction safety inspection. The purpose is to enhance SA's training and real-time situational awareness for identifying on-site hazards while reducing their mental workloads.
△ Less
Submitted 31 May, 2022;
originally announced May 2022.
-
Mining Observation and Cognitive Behavior Process Patterns of Bridge Inspector
Authors:
Pengkun Liu,
Ruoxin Xiong,
Pingbo Tang
Abstract:
In bridge inspection, engineers should diagnose the observed bridge defects by identifying the factors underlying those defects. Traditionally, engineers search and organize structural condition-related information based on visual inspections. Even following the same qualitative inspection standards, experienced engineers tend to find the critical defects and predict the underlying reasons more re…
▽ More
In bridge inspection, engineers should diagnose the observed bridge defects by identifying the factors underlying those defects. Traditionally, engineers search and organize structural condition-related information based on visual inspections. Even following the same qualitative inspection standards, experienced engineers tend to find the critical defects and predict the underlying reasons more reliably than less experienced ones. Unique bridge and site conditions, quality of available data, and personal skills and knowledge collectively influence such a subjective nature of data-driven bridge diagnosis. Unfortunately, the lack of detailed data about how experienced engineers observe bridge defects and identify failure modes makes it hard to comprehend what engineers' behaviors form the best practice of producing reliable bridge inspection. Besides, even experienced engineers could sometimes fail to notice critical defects, thereby producing inconsistent, conflicting condition assessments. Therefore, a detailed cognitive behavior analysis of bridge inspectors is critical for enabling a proactive inspector coaching system that uses inspectors' behavior histories to complement personal limitations. This paper presents a computational framework for revealing engineers' observation and cognitive-behavioral processes to identify bridge defects and produce diagnosis conclusions. The authors designed a bridge inspection game consisting of FEM simulation data and inspection reports to capture and analyze experienced and inexperienced engineers' diagnosis behaviors. Mining these behavioral logs have revealed reusable behavioral process patterns that map critical bridge defects and diagnosis conclusions. The results indicate that the proposed method can proactively share inspection experiences and improve inspection processes' explainability and reliability.
△ Less
Submitted 18 May, 2022;
originally announced May 2022.
-
Optimal Anonymous Independent Reward Scheme Design
Authors:
Mengjing Chen,
Pingzhong Tang,
Zihe Wang,
Shenke Xiao,
Xiwang Yang
Abstract:
We consider designing reward schemes that incentivize agents to create high-quality content (e.g., videos, images, text, ideas). The problem is at the center of a real-world application where the goal is to optimize the overall quality of generated content on user-generated content platforms. We focus on anonymous independent reward schemes (AIRS) that only take the quality of an agent's content a…
▽ More
We consider designing reward schemes that incentivize agents to create high-quality content (e.g., videos, images, text, ideas). The problem is at the center of a real-world application where the goal is to optimize the overall quality of generated content on user-generated content platforms. We focus on anonymous independent reward schemes (AIRS) that only take the quality of an agent's content as input. We prove the general problem is NP-hard. If the cost function is convex, we show the optimal AIRS can be formulated as a convex optimization problem and propose an efficient algorithm to solve it. Next, we explore the optimal linear reward scheme and prove it has a 1/2-approximation ratio, and the ratio is tight. Lastly, we show the proportional scheme can be arbitrarily bad compared to AIRS.
△ Less
Submitted 30 April, 2022;
originally announced May 2022.