-
Role-RL: Online Long-Context Processing with Role Reinforcement Learning for Distinct LLMs in Their Optimal Roles
Authors:
Lewei He,
Tianyu Shi,
Pengran Huang,
Bingzhi Chen,
Qianglong Chen,
Jiahui Pan
Abstract:
Large language models (LLMs) with long-context processing are still challenging because of their implementation complexity, training efficiency and data sparsity. To address this issue, a new paradigm named Online Long-context Processing (OLP) is proposed when we process a document of unlimited length, which typically occurs in the information reception and organization of diverse streaming media…
▽ More
Large language models (LLMs) with long-context processing are still challenging because of their implementation complexity, training efficiency and data sparsity. To address this issue, a new paradigm named Online Long-context Processing (OLP) is proposed when we process a document of unlimited length, which typically occurs in the information reception and organization of diverse streaming media such as automated news reporting, live e-commerce, and viral short videos. Moreover, a dilemma was often encountered when we tried to select the most suitable LLM from a large number of LLMs amidst explosive growth aiming for outstanding performance, affordable prices, and short response delays. In view of this, we also develop Role Reinforcement Learning (Role-RL) to automatically deploy different LLMs in their respective roles within the OLP pipeline according to their actual performance. Extensive experiments are conducted on our OLP-MINI dataset and it is found that OLP with Role-RL framework achieves OLP benchmark with an average recall rate of 93.2% and the LLM cost saved by 79.4%. The code and dataset are publicly available at: https://anonymous.4open.science/r/Role-RL.
△ Less
Submitted 26 September, 2024;
originally announced September 2024.
-
A QoE-Aware Split Inference Accelerating Algorithm for NOMA-based Edge Intelligence
Authors:
Xin Yuan,
Ning Li,
Quan Chen,
Wenchao Xu,
Zhaoxin Zhang,
Song Guo
Abstract:
Even the AI has been widely used and significantly changed our life, deploying the large AI models on resource limited edge devices directly is not appropriate. Thus, the model split inference is proposed to improve the performance of edge intelligence, in which the AI model is divided into different sub models and the resource-intensive sub model is offloaded to edge server wirelessly for reducin…
▽ More
Even the AI has been widely used and significantly changed our life, deploying the large AI models on resource limited edge devices directly is not appropriate. Thus, the model split inference is proposed to improve the performance of edge intelligence, in which the AI model is divided into different sub models and the resource-intensive sub model is offloaded to edge server wirelessly for reducing resource requirements and inference latency. However, the previous works mainly concentrate on improving and optimizing the system QoS, ignore the effect of QoE which is another critical item for the users except for QoS. Even the QoE has been widely learned in EC, considering the differences between task offloading in EC and split inference in EI, and the specific issues in QoE which are still not addressed in EC and EI, these algorithms cannot work effectively in edge split inference scenarios. Thus, an effective resource allocation algorithm is proposed in this paper, for accelerating split inference in EI and achieving the tradeoff between inference delay, QoE, and resource consumption, abbreviated as ERA. Specifically, the ERA takes the resource consumption, QoE, and inference latency into account to find the optimal model split strategy and resource allocation strategy. Since the minimum inference delay and resource consumption, and maximum QoE cannot be satisfied simultaneously, the gradient descent based algorithm is adopted to find the optimal tradeoff between them. Moreover, the loop iteration GD approach is developed to reduce the complexity of the GD algorithm caused by parameter discretization. Additionally, the properties of the proposed algorithms are investigated, including convergence, complexity, and approximation error. The experimental results demonstrate that the performance of ERA is much better than that of the previous studies.
△ Less
Submitted 24 September, 2024;
originally announced September 2024.
-
Autotuning Bipedal Locomotion MPC with GRFM-Net for Efficient Sim-to-Real Transfer
Authors:
Qianzhong Chen,
Junheng Li,
Sheng Cheng,
Naira Hovakimyan,
Quan Nguyen
Abstract:
Bipedal locomotion control is essential for humanoid robots to navigate complex, human-centric environments. While optimization-based control designs are popular for integrating sophisticated models of humanoid robots, they often require labor-intensive manual tuning. In this work, we address the challenges of parameter selection in bipedal locomotion control using DiffTune, a model-based autotuni…
▽ More
Bipedal locomotion control is essential for humanoid robots to navigate complex, human-centric environments. While optimization-based control designs are popular for integrating sophisticated models of humanoid robots, they often require labor-intensive manual tuning. In this work, we address the challenges of parameter selection in bipedal locomotion control using DiffTune, a model-based autotuning method that leverages differential programming for efficient parameter learning. A major difficulty lies in balancing model fidelity with differentiability. We address this difficulty using a low-fidelity model for differentiability, enhanced by a Ground Reaction Force-and-Moment Network (GRFM-Net) to capture discrepancies between MPC commands and actual control effects. We validate the parameters learned by DiffTune with GRFM-Net in hardware experiments, which demonstrates the parameters' optimality in a multi-objective setting compared with baseline parameters, reducing the total loss by up to 40.5$\%$ compared with the expert-tuned parameters. The results confirm the GRFM-Net's effectiveness in mitigating the sim-to-real gap, improving the transferability of simulation-learned parameters to real hardware.
△ Less
Submitted 23 September, 2024;
originally announced September 2024.
-
PolicyCraft: Supporting Collaborative and Participatory Policy Design through Case-Grounded Deliberation
Authors:
Tzu-Sheng Kuo,
Quan Ze Chen,
Amy X. Zhang,
Jane Hsieh,
Haiyi Zhu,
Kenneth Holstein
Abstract:
Community and organizational policies are typically designed in a top-down, centralized fashion, with limited input from impacted stakeholders. This can result in policies that are misaligned with community needs or perceived as illegitimate. How can we support more collaborative, participatory approaches to policy design? In this paper, we present PolicyCraft, a system that structures collaborati…
▽ More
Community and organizational policies are typically designed in a top-down, centralized fashion, with limited input from impacted stakeholders. This can result in policies that are misaligned with community needs or perceived as illegitimate. How can we support more collaborative, participatory approaches to policy design? In this paper, we present PolicyCraft, a system that structures collaborative policy design through case-grounded deliberation. Building on past research that highlights the value of concrete cases in establishing common ground, PolicyCraft supports users in collaboratively proposing, critiquing, and revising policies through discussion and voting on cases. A field study across two university courses showed that students using PolicyCraft reached greater consensus and developed better-supported course policies, compared with those using a baseline system that did not scaffold their use of concrete cases. Reflecting on our findings, we discuss opportunities for future HCI systems to help groups more effectively bridge between abstract policies and concrete cases.
△ Less
Submitted 23 September, 2024;
originally announced September 2024.
-
FACET: Fast and Accurate Event-Based Eye Tracking Using Ellipse Modeling for Extended Reality
Authors:
Junyuan Ding,
Ziteng Wang,
Chang Gao,
Min Liu,
Qinyu Chen
Abstract:
Eye tracking is a key technology for gaze-based interactions in Extended Reality (XR), but traditional frame-based systems struggle to meet XR's demands for high accuracy, low latency, and power efficiency. Event cameras offer a promising alternative due to their high temporal resolution and low power consumption. In this paper, we present FACET (Fast and Accurate Event-based Eye Tracking), an end…
▽ More
Eye tracking is a key technology for gaze-based interactions in Extended Reality (XR), but traditional frame-based systems struggle to meet XR's demands for high accuracy, low latency, and power efficiency. Event cameras offer a promising alternative due to their high temporal resolution and low power consumption. In this paper, we present FACET (Fast and Accurate Event-based Eye Tracking), an end-to-end neural network that directly outputs pupil ellipse parameters from event data, optimized for real-time XR applications. The ellipse output can be directly used in subsequent ellipse-based pupil trackers. We enhance the EV-Eye dataset by expanding annotated data and converting original mask labels to ellipse-based annotations to train the model. Besides, a novel trigonometric loss is adopted to address angle discontinuities and a fast causal event volume event representation method is put forward. On the enhanced EV-Eye test set, FACET achieves an average pupil center error of 0.20 pixels and an inference time of 0.53 ms, reducing pixel error and inference time by 1.6$\times$ and 1.8$\times$ compared to the prior art, EV-Eye, with 4.4$\times$ and 11.7$\times$ less parameters and arithmetic operations. The code is available at https://github.com/DeanJY/FACET.
△ Less
Submitted 23 September, 2024;
originally announced September 2024.
-
Towards Accountable AI-Assisted Eye Disease Diagnosis: Workflow Design, External Validation, and Continual Learning
Authors:
Qingyu Chen,
Tiarnan D L Keenan,
Elvira Agron,
Alexis Allot,
Emily Guan,
Bryant Duong,
Amr Elsawy,
Benjamin Hou,
Cancan Xue,
Sanjeeb Bhandari,
Geoffrey Broadhead,
Chantal Cousineau-Krieger,
Ellen Davis,
William G Gensheimer,
David Grasic,
Seema Gupta,
Luis Haddock,
Eleni Konstantinou,
Tania Lamba,
Michele Maiberger,
Dimosthenis Mantopoulos,
Mitul C Mehta,
Ayman G Nahri,
Mutaz AL-Nawaflh,
Arnold Oshinsky
, et al. (13 additional authors not shown)
Abstract:
Timely disease diagnosis is challenging due to increasing disease burdens and limited clinician availability. AI shows promise in diagnosis accuracy but faces real-world application issues due to insufficient validation in clinical workflows and diverse populations. This study addresses gaps in medical AI downstream accountability through a case study on age-related macular degeneration (AMD) diag…
▽ More
Timely disease diagnosis is challenging due to increasing disease burdens and limited clinician availability. AI shows promise in diagnosis accuracy but faces real-world application issues due to insufficient validation in clinical workflows and diverse populations. This study addresses gaps in medical AI downstream accountability through a case study on age-related macular degeneration (AMD) diagnosis and severity classification. We designed and implemented an AI-assisted diagnostic workflow for AMD, comparing diagnostic performance with and without AI assistance among 24 clinicians from 12 institutions with real patient data sampled from the Age-Related Eye Disease Study (AREDS). Additionally, we demonstrated continual enhancement of an existing AI model by incorporating approximately 40,000 additional medical images (named AREDS2 dataset). The improved model was then systematically evaluated using both AREDS and AREDS2 test sets, as well as an external test set from Singapore. AI assistance markedly enhanced diagnostic accuracy and classification for 23 out of 24 clinicians, with the average F1-score increasing by 20% from 37.71 (Manual) to 45.52 (Manual + AI) (P-value < 0.0001), achieving an improvement of over 50% in some cases. In terms of efficiency, AI assistance reduced diagnostic times for 17 out of the 19 clinicians tracked, with time savings of up to 40%. Furthermore, a model equipped with continual learning showed robust performance across three independent datasets, recording a 29% increase in accuracy, and elevating the F1-score from 42 to 54 in the Singapore population.
△ Less
Submitted 23 September, 2024;
originally announced September 2024.
-
Towards Ground-truth-free Evaluation of Any Segmentation in Medical Images
Authors:
Ahjol Senbi,
Tianyu Huang,
Fei Lyu,
Qing Li,
Yuhui Tao,
Wei Shao,
Qiang Chen,
Chengyan Wang,
Shuo Wang,
Tao Zhou,
Yizhe Zhang
Abstract:
We explore the feasibility and potential of building a ground-truth-free evaluation model to assess the quality of segmentations generated by the Segment Anything Model (SAM) and its variants in medical imaging. This evaluation model estimates segmentation quality scores by analyzing the coherence and consistency between the input images and their corresponding segmentation predictions. Based on p…
▽ More
We explore the feasibility and potential of building a ground-truth-free evaluation model to assess the quality of segmentations generated by the Segment Anything Model (SAM) and its variants in medical imaging. This evaluation model estimates segmentation quality scores by analyzing the coherence and consistency between the input images and their corresponding segmentation predictions. Based on prior research, we frame the task of training this model as a regression problem within a supervised learning framework, using Dice scores (and optionally other metrics) along with mean squared error to compute the training loss. The model is trained utilizing a large collection of public datasets of medical images with segmentation predictions from SAM and its variants. We name this model EvanySeg (Evaluation of Any Segmentation in Medical Images). Our exploration of convolution-based models (e.g., ResNet) and transformer-based models (e.g., ViT) suggested that ViT yields better performance for this task. EvanySeg can be employed for various tasks, including: (1) identifying poorly segmented samples by detecting low-percentile segmentation quality scores; (2) benchmarking segmentation models without ground truth by averaging quality scores across test samples; (3) alerting human experts to poor-quality segmentation predictions during human-AI collaboration by applying a threshold within the score space; and (4) selecting the best segmentation prediction for each test sample at test time when multiple segmentation models are available, by choosing the prediction with the highest quality score. Models and code will be made available at https://github.com/ahjolsenbics/EvanySeg.
△ Less
Submitted 24 September, 2024; v1 submitted 23 September, 2024;
originally announced September 2024.
-
Federated Graph Learning with Adaptive Importance-based Sampling
Authors:
Anran Li,
Yuanyuan Chen,
Chao Ren,
Wenhan Wang,
Ming Hu,
Tianlin Li,
Han Yu,
Qingyu Chen
Abstract:
For privacy-preserving graph learning tasks involving distributed graph datasets, federated learning (FL)-based GCN (FedGCN) training is required. A key challenge for FedGCN is scaling to large-scale graphs, which typically incurs high computation and communication costs when dealing with the explosively increasing number of neighbors. Existing graph sampling-enhanced FedGCN training approaches ig…
▽ More
For privacy-preserving graph learning tasks involving distributed graph datasets, federated learning (FL)-based GCN (FedGCN) training is required. A key challenge for FedGCN is scaling to large-scale graphs, which typically incurs high computation and communication costs when dealing with the explosively increasing number of neighbors. Existing graph sampling-enhanced FedGCN training approaches ignore graph structural information or dynamics of optimization, resulting in high variance and inaccurate node embeddings. To address this limitation, we propose the Federated Adaptive Importance-based Sampling (FedAIS) approach. It achieves substantial computational cost saving by focusing the limited resources on training important nodes, while reducing communication overhead via adaptive historical embedding synchronization. The proposed adaptive importance-based sampling method jointly considers the graph structural heterogeneity and the optimization dynamics to achieve optimal trade-off between efficiency and accuracy. Extensive evaluations against five state-of-the-art baselines on five real-world graph datasets show that FedAIS achieves comparable or up to 3.23% higher test accuracy, while saving communication and computation costs by 91.77% and 85.59%.
△ Less
Submitted 22 September, 2024;
originally announced September 2024.
-
Accelerated Multi-Contrast MRI Reconstruction via Frequency and Spatial Mutual Learning
Authors:
Qi Chen,
Xiaohan Xing,
Zhen Chen,
Zhiwei Xiong
Abstract:
To accelerate Magnetic Resonance (MR) imaging procedures, Multi-Contrast MR Reconstruction (MCMR) has become a prevalent trend that utilizes an easily obtainable modality as an auxiliary to support high-quality reconstruction of the target modality with under-sampled k-space measurements. The exploration of global dependency and complementary information across different modalities is essential fo…
▽ More
To accelerate Magnetic Resonance (MR) imaging procedures, Multi-Contrast MR Reconstruction (MCMR) has become a prevalent trend that utilizes an easily obtainable modality as an auxiliary to support high-quality reconstruction of the target modality with under-sampled k-space measurements. The exploration of global dependency and complementary information across different modalities is essential for MCMR. However, existing methods either struggle to capture global dependency due to the limited receptive field or suffer from quadratic computational complexity. To tackle this dilemma, we propose a novel Frequency and Spatial Mutual Learning Network (FSMNet), which efficiently explores global dependencies across different modalities. Specifically, the features for each modality are extracted by the Frequency-Spatial Feature Extraction (FSFE) module, featuring a frequency branch and a spatial branch. Benefiting from the global property of the Fourier transform, the frequency branch can efficiently capture global dependency with an image-size receptive field, while the spatial branch can extract local features. To exploit complementary information from the auxiliary modality, we propose a Cross-Modal Selective fusion (CMS-fusion) module that selectively incorporate the frequency and spatial features from the auxiliary modality to enhance the corresponding branch of the target modality. To further integrate the enhanced global features from the frequency branch and the enhanced local features from the spatial branch, we develop a Frequency-Spatial fusion (FS-fusion) module, resulting in a comprehensive feature representation for the target modality. Extensive experiments on the BraTS and fastMRI datasets demonstrate that the proposed FSMNet achieves state-of-the-art performance for the MCMR task with different acceleration factors. The code is available at: https://github.com/qic999/FSMNet.
△ Less
Submitted 21 September, 2024;
originally announced September 2024.
-
Enhancing Large Language Models with Domain-specific Retrieval Augment Generation: A Case Study on Long-form Consumer Health Question Answering in Ophthalmology
Authors:
Aidan Gilson,
Xuguang Ai,
Thilaka Arunachalam,
Ziyou Chen,
Ki Xiong Cheong,
Amisha Dave,
Cameron Duic,
Mercy Kibe,
Annette Kaminaka,
Minali Prasad,
Fares Siddig,
Maxwell Singer,
Wendy Wong,
Qiao Jin,
Tiarnan D. L. Keenan,
Xia Hu,
Emily Y. Chew,
Zhiyong Lu,
Hua Xu,
Ron A. Adelman,
Yih-Chung Tham,
Qingyu Chen
Abstract:
Despite the potential of Large Language Models (LLMs) in medicine, they may generate responses lacking supporting evidence or based on hallucinated evidence. While Retrieval Augment Generation (RAG) is popular to address this issue, few studies implemented and evaluated RAG in downstream domain-specific applications. We developed a RAG pipeline with 70,000 ophthalmology-specific documents that ret…
▽ More
Despite the potential of Large Language Models (LLMs) in medicine, they may generate responses lacking supporting evidence or based on hallucinated evidence. While Retrieval Augment Generation (RAG) is popular to address this issue, few studies implemented and evaluated RAG in downstream domain-specific applications. We developed a RAG pipeline with 70,000 ophthalmology-specific documents that retrieve relevant documents to augment LLMs during inference time. In a case study on long-form consumer health questions, we systematically evaluated the responses including over 500 references of LLMs with and without RAG on 100 questions with 10 healthcare professionals. The evaluation focuses on factuality of evidence, selection and ranking of evidence, attribution of evidence, and answer accuracy and completeness. LLMs without RAG provided 252 references in total. Of which, 45.3% hallucinated, 34.1% consisted of minor errors, and 20.6% were correct. In contrast, LLMs with RAG significantly improved accuracy (54.5% being correct) and reduced error rates (18.8% with minor hallucinations and 26.7% with errors). 62.5% of the top 10 documents retrieved by RAG were selected as the top references in the LLM response, with an average ranking of 4.9. The use of RAG also improved evidence attribution (increasing from 1.85 to 2.49 on a 5-point scale, P<0.001), albeit with slight decreases in accuracy (from 3.52 to 3.23, P=0.03) and completeness (from 3.47 to 3.27, P=0.17). The results demonstrate that LLMs frequently exhibited hallucinated and erroneous evidence in the responses, raising concerns for downstream applications in the medical domain. RAG substantially reduced the proportion of such evidence but encountered challenges.
△ Less
Submitted 20 September, 2024;
originally announced September 2024.
-
FullAnno: A Data Engine for Enhancing Image Comprehension of MLLMs
Authors:
Jing Hao,
Yuxiang Zhao,
Song Chen,
Yanpeng Sun,
Qiang Chen,
Gang Zhang,
Kun Yao,
Errui Ding,
Jingdong Wang
Abstract:
Multimodal Large Language Models (MLLMs) have shown promise in a broad range of vision-language tasks with their strong reasoning and generalization capabilities. However, they heavily depend on high-quality data in the Supervised Fine-Tuning (SFT) phase. The existing approaches aim to curate high-quality data via GPT-4V, but they are not scalable due to the commercial nature of GPT-4V and the sim…
▽ More
Multimodal Large Language Models (MLLMs) have shown promise in a broad range of vision-language tasks with their strong reasoning and generalization capabilities. However, they heavily depend on high-quality data in the Supervised Fine-Tuning (SFT) phase. The existing approaches aim to curate high-quality data via GPT-4V, but they are not scalable due to the commercial nature of GPT-4V and the simplicity of the prompts used to instruct the model. To this end, we devised the FullAnno system, which is a data engine that can generate large-scale, high-quality, and fine-grained image annotations consisting of the category and position of objects, region descriptions, text information, as well as image dense captions. This engine is characterized by its cascade annotation process, which involves multiple expert models and employs rich prompts to instruct LLMs in generating dense image captions. We re-annotated the COCO and Visual Genome datasets using our FullAnno system, tripling the number of object annotations and increasing the length of the original image captions by a factor of 15. Experiments show that the regenerated annotation can significantly enhance the capabilities of LLaVA-v1.5 on several benchmarks. The re-annotated data are available at: https://arcana-project-page.github.io
△ Less
Submitted 20 September, 2024;
originally announced September 2024.
-
Enhancing Semi-Supervised Learning via Representative and Diverse Sample Selection
Authors:
Qian Shao,
Jiangrui Kang,
Qiyuan Chen,
Zepeng Li,
Hongxia Xu,
Yiwen Cao,
Jiajuan Liang,
Jian Wu
Abstract:
Semi-Supervised Learning (SSL) has become a preferred paradigm in many deep learning tasks, which reduces the need for human labor. Previous studies primarily focus on effectively utilising the labelled and unlabeled data to improve performance. However, we observe that how to select samples for labelling also significantly impacts performance, particularly under extremely low-budget settings. The…
▽ More
Semi-Supervised Learning (SSL) has become a preferred paradigm in many deep learning tasks, which reduces the need for human labor. Previous studies primarily focus on effectively utilising the labelled and unlabeled data to improve performance. However, we observe that how to select samples for labelling also significantly impacts performance, particularly under extremely low-budget settings. The sample selection task in SSL has been under-explored for a long time. To fill in this gap, we propose a Representative and Diverse Sample Selection approach (RDSS). By adopting a modified Frank-Wolfe algorithm to minimise a novel criterion $α$-Maximum Mean Discrepancy ($α$-MMD), RDSS samples a representative and diverse subset for annotation from the unlabeled data. We demonstrate that minimizing $α$-MMD enhances the generalization ability of low-budget learning. Experimental results show that RDSS consistently improves the performance of several popular SSL frameworks and outperforms the state-of-the-art sample selection approaches used in Active Learning (AL) and Semi-Supervised Active Learning (SSAL), even with constrained annotation budgets.
△ Less
Submitted 17 September, 2024;
originally announced September 2024.
-
Federated Learning with Integrated Sensing, Communication, and Computation: Frameworks and Performance Analysis
Authors:
Yipeng Liang,
Qimei Chen,
Hao Jiang
Abstract:
With the emergence of integrated sensing, communication, and computation (ISCC) in the upcoming 6G era, federated learning with ISCC (FL-ISCC), integrating sample collection, local training, and parameter exchange and aggregation, has garnered increasing interest for enhancing training efficiency. Currently, FL-ISCC primarily includes two algorithms: FedAVG-ISCC and FedSGD-ISCC. However, the theor…
▽ More
With the emergence of integrated sensing, communication, and computation (ISCC) in the upcoming 6G era, federated learning with ISCC (FL-ISCC), integrating sample collection, local training, and parameter exchange and aggregation, has garnered increasing interest for enhancing training efficiency. Currently, FL-ISCC primarily includes two algorithms: FedAVG-ISCC and FedSGD-ISCC. However, the theoretical understanding of the performance and advantages of these algorithms remains limited. To address this gap, we investigate a general FL-ISCC framework, implementing both FedAVG-ISCC and FedSGD-ISCC. We experimentally demonstrate the substantial potential of the ISCC framework in reducing latency and energy consumption in FL. Furthermore, we provide a theoretical analysis and comparison. The results reveal that:1) Both sample collection and communication errors negatively impact algorithm performance, highlighting the need for careful design to optimize FL-ISCC applications. 2) FedAVG-ISCC performs better than FedSGD-ISCC under IID data due to its advantage with multiple local updates. 3) FedSGD-ISCC is more robust than FedAVG-ISCC under non-IID data, where the multiple local updates in FedAVG-ISCC worsen performance as non-IID data increases. FedSGD-ISCC maintains performance levels similar to IID conditions. 4) FedSGD-ISCC is more resilient to communication errors than FedAVG-ISCC, which suffers from significant performance degradation as communication errors increase.Extensive simulations confirm the effectiveness of the FL-ISCC framework and validate our theoretical analysis.
△ Less
Submitted 17 September, 2024;
originally announced September 2024.
-
Reasoning Graph Enhanced Exemplars Retrieval for In-Context Learning
Authors:
Yukang Lin,
Bingchen Zhong,
Shuoran Jiang,
Joanna Siebert,
Qingcai Chen
Abstract:
Large language models(LLMs) have exhibited remarkable few-shot learning capabilities and unified the paradigm of NLP tasks through the in-context learning(ICL) technique. Despite the success of ICL, the quality of the exemplar demonstrations can significantly influence the LLM's performance. Existing exemplar selection methods mainly focus on the semantic similarity between queries and candidate e…
▽ More
Large language models(LLMs) have exhibited remarkable few-shot learning capabilities and unified the paradigm of NLP tasks through the in-context learning(ICL) technique. Despite the success of ICL, the quality of the exemplar demonstrations can significantly influence the LLM's performance. Existing exemplar selection methods mainly focus on the semantic similarity between queries and candidate exemplars. On the other hand, the logical connections between reasoning steps can be beneficial to depict the problem-solving process as well. In this paper, we proposes a novel method named Reasoning Graph-enhanced Exemplar Retrieval(RGER). RGER first quires LLM to generate an initial response, then expresses intermediate problem-solving steps to a graph structure. After that, it employs graph kernel to select exemplars with semantic and structural similarity. Extensive experiments demonstrate the structural relationship is helpful to the alignment of queries and candidate exemplars. The efficacy of RGER on math and logit reasoning tasks showcases its superiority over state-of-the-art retrieval-based approaches. Our code is released at https://github.com/Yukang-Lin/RGER.
△ Less
Submitted 17 September, 2024;
originally announced September 2024.
-
RetrievalAttention: Accelerating Long-Context LLM Inference via Vector Retrieval
Authors:
Di Liu,
Meng Chen,
Baotong Lu,
Huiqiang Jiang,
Zhenhua Han,
Qianxi Zhang,
Qi Chen,
Chengruidong Zhang,
Bailu Ding,
Kai Zhang,
Chen Chen,
Fan Yang,
Yuqing Yang,
Lili Qiu
Abstract:
Transformer-based Large Language Models (LLMs) have become increasingly important. However, due to the quadratic time complexity of attention computation, scaling LLMs to longer contexts incurs extremely slow inference latency and high GPU memory consumption for caching key-value (KV) vectors. This paper proposes RetrievalAttention, a training-free approach to both accelerate attention computation…
▽ More
Transformer-based Large Language Models (LLMs) have become increasingly important. However, due to the quadratic time complexity of attention computation, scaling LLMs to longer contexts incurs extremely slow inference latency and high GPU memory consumption for caching key-value (KV) vectors. This paper proposes RetrievalAttention, a training-free approach to both accelerate attention computation and reduce GPU memory consumption. By leveraging the dynamic sparsity of attention mechanism, RetrievalAttention proposes to use approximate nearest neighbor search (ANNS) indexes for KV vectors in CPU memory and retrieves the most relevant ones with vector search during generation. Unfortunately, we observe that the off-the-shelf ANNS indexes are often ineffective for such retrieval tasks due to the out-of-distribution (OOD) between query vectors and key vectors in attention mechanism. RetrievalAttention addresses the OOD challenge by designing an attention-aware vector search algorithm that can adapt to the distribution of query vectors. Our evaluation shows that RetrievalAttention only needs to access 1--3% of data while maintaining high model accuracy. This leads to significant reduction in the inference cost of long-context LLMs with much lower GPU memory footprint. In particular, RetrievalAttention only needs a single NVIDIA RTX4090 (24GB) for serving 128K tokens in LLMs with 8B parameters, which is capable of generating one token in 0.188 seconds.
△ Less
Submitted 18 September, 2024; v1 submitted 16 September, 2024;
originally announced September 2024.
-
Instigating Cooperation among LLM Agents Using Adaptive Information Modulation
Authors:
Qiliang Chen,
Sepehr Ilami,
Nunzio Lore,
Babak Heydari
Abstract:
This paper introduces a novel framework combining LLM agents as proxies for human strategic behavior with reinforcement learning (RL) to engage these agents in evolving strategic interactions within team environments. Our approach extends traditional agent-based simulations by using strategic LLM agents (SLA) and introducing dynamic and adaptive governance through a pro-social promoting RL agent (…
▽ More
This paper introduces a novel framework combining LLM agents as proxies for human strategic behavior with reinforcement learning (RL) to engage these agents in evolving strategic interactions within team environments. Our approach extends traditional agent-based simulations by using strategic LLM agents (SLA) and introducing dynamic and adaptive governance through a pro-social promoting RL agent (PPA) that modulates information access across agents in a network, optimizing social welfare and promoting pro-social behavior. Through validation in iterative games, including the prisoner dilemma, we demonstrate that SLA agents exhibit nuanced strategic adaptations. The PPA agent effectively learns to adjust information transparency, resulting in enhanced cooperation rates. This framework offers significant insights into AI-mediated social dynamics, contributing to the deployment of AI in real-world team settings.
△ Less
Submitted 19 September, 2024; v1 submitted 16 September, 2024;
originally announced September 2024.
-
Revisiting Physical-World Adversarial Attack on Traffic Sign Recognition: A Commercial Systems Perspective
Authors:
Ningfei Wang,
Shaoyuan Xie,
Takami Sato,
Yunpeng Luo,
Kaidi Xu,
Qi Alfred Chen
Abstract:
Traffic Sign Recognition (TSR) is crucial for safe and correct driving automation. Recent works revealed a general vulnerability of TSR models to physical-world adversarial attacks, which can be low-cost, highly deployable, and capable of causing severe attack effects such as hiding a critical traffic sign or spoofing a fake one. However, so far existing works generally only considered evaluating…
▽ More
Traffic Sign Recognition (TSR) is crucial for safe and correct driving automation. Recent works revealed a general vulnerability of TSR models to physical-world adversarial attacks, which can be low-cost, highly deployable, and capable of causing severe attack effects such as hiding a critical traffic sign or spoofing a fake one. However, so far existing works generally only considered evaluating the attack effects on academic TSR models, leaving the impacts of such attacks on real-world commercial TSR systems largely unclear. In this paper, we conduct the first large-scale measurement of physical-world adversarial attacks against commercial TSR systems. Our testing results reveal that it is possible for existing attack works from academia to have highly reliable (100\%) attack success against certain commercial TSR system functionality, but such attack capabilities are not generalizable, leading to much lower-than-expected attack success rates overall. We find that one potential major factor is a spatial memorization design that commonly exists in today's commercial TSR systems. We design new attack success metrics that can mathematically model the impacts of such design on the TSR system-level attack success, and use them to revisit existing attacks. Through these efforts, we uncover 7 novel observations, some of which directly challenge the observations or claims in prior works due to the introduction of the new metrics.
△ Less
Submitted 15 September, 2024;
originally announced September 2024.
-
The Optimality of (Accelerated) SGD for High-Dimensional Quadratic Optimization
Authors:
Haihan Zhang,
Yuanshi Liu,
Qianwen Chen,
Cong Fang
Abstract:
Stochastic gradient descent (SGD) is a widely used algorithm in machine learning, particularly for neural network training. Recent studies on SGD for canonical quadratic optimization or linear regression show it attains well generalization under suitable high-dimensional settings. However, a fundamental question -- for what kinds of high-dimensional learning problems SGD and its accelerated varian…
▽ More
Stochastic gradient descent (SGD) is a widely used algorithm in machine learning, particularly for neural network training. Recent studies on SGD for canonical quadratic optimization or linear regression show it attains well generalization under suitable high-dimensional settings. However, a fundamental question -- for what kinds of high-dimensional learning problems SGD and its accelerated variants can achieve optimality has yet to be well studied. This paper investigates SGD with two essential components in practice: exponentially decaying step size schedule and momentum. We establish the convergence upper bound for momentum accelerated SGD (ASGD) and propose concrete classes of learning problems under which SGD or ASGD achieves min-max optimal convergence rates. The characterization of the target function is based on standard power-law decays in (functional) linear regression. Our results unveil new insights for understanding the learning bias of SGD: (i) SGD is efficient in learning ``dense'' features where the corresponding weights are subject to an infinity norm constraint; (ii) SGD is efficient for easy problem without suffering from the saturation effect; (iii) momentum can accelerate the convergence rate by order when the learning problem is relatively hard. To our knowledge, this is the first work to clearly identify the optimal boundary of SGD versus ASGD for the problem under mild settings.
△ Less
Submitted 15 September, 2024;
originally announced September 2024.
-
TG-LLaVA: Text Guided LLaVA via Learnable Latent Embeddings
Authors:
Dawei Yan,
Pengcheng Li,
Yang Li,
Hao Chen,
Qingguo Chen,
Weihua Luo,
Wei Dong,
Qingsen Yan,
Haokui Zhang,
Chunhua Shen
Abstract:
Currently, inspired by the success of vision-language models (VLMs), an increasing number of researchers are focusing on improving VLMs and have achieved promising results. However, most existing methods concentrate on optimizing the connector and enhancing the language model component, while neglecting improvements to the vision encoder itself. In contrast, we propose Text Guided LLaVA (TG-LLaVA)…
▽ More
Currently, inspired by the success of vision-language models (VLMs), an increasing number of researchers are focusing on improving VLMs and have achieved promising results. However, most existing methods concentrate on optimizing the connector and enhancing the language model component, while neglecting improvements to the vision encoder itself. In contrast, we propose Text Guided LLaVA (TG-LLaVA) in this paper, which optimizes VLMs by guiding the vision encoder with text, offering a new and orthogonal optimization direction. Specifically, inspired by the purpose-driven logic inherent in human behavior, we use learnable latent embeddings as a bridge to analyze textual instruction and add the analysis results to the vision encoder as guidance, refining it. Subsequently, another set of latent embeddings extracts additional detailed text-guided information from high-resolution local patches as auxiliary information. Finally, with the guidance of text, the vision encoder can extract text-related features, similar to how humans focus on the most relevant parts of an image when considering a question. This results in generating better answers. Experiments on various datasets validate the effectiveness of the proposed method. Remarkably, without the need for additional training data, our propsoed method can bring more benefits to the baseline (LLaVA-1.5) compared with other concurrent methods. Furthermore, the proposed method consistently brings improvement in different settings.
△ Less
Submitted 20 September, 2024; v1 submitted 14 September, 2024;
originally announced September 2024.
-
Adjoint Matching: Fine-tuning Flow and Diffusion Generative Models with Memoryless Stochastic Optimal Control
Authors:
Carles Domingo-Enrich,
Michal Drozdzal,
Brian Karrer,
Ricky T. Q. Chen
Abstract:
Dynamical generative models that produce samples through an iterative process, such as Flow Matching and denoising diffusion models, have seen widespread use, but there has not been many theoretically-sound methods for improving these models with reward fine-tuning. In this work, we cast reward fine-tuning as stochastic optimal control (SOC). Critically, we prove that a very specific memoryless no…
▽ More
Dynamical generative models that produce samples through an iterative process, such as Flow Matching and denoising diffusion models, have seen widespread use, but there has not been many theoretically-sound methods for improving these models with reward fine-tuning. In this work, we cast reward fine-tuning as stochastic optimal control (SOC). Critically, we prove that a very specific memoryless noise schedule must be enforced during fine-tuning, in order to account for the dependency between the noise variable and the generated samples. We also propose a new algorithm named Adjoint Matching which outperforms existing SOC algorithms, by casting SOC problems as a regression problem. We find that our approach significantly improves over existing methods for reward fine-tuning, achieving better consistency, realism, and generalization to unseen human preference reward models, while retaining sample diversity.
△ Less
Submitted 13 September, 2024;
originally announced September 2024.
-
xTED: Cross-Domain Policy Adaptation via Diffusion-Based Trajectory Editing
Authors:
Haoyi Niu,
Qimao Chen,
Tenglong Liu,
Jianxiong Li,
Guyue Zhou,
Yi Zhang,
Jianming Hu,
Xianyuan Zhan
Abstract:
Reusing pre-collected data from different domains is an attractive solution in decision-making tasks where the accessible data is insufficient in the target domain but relatively abundant in other related domains. Existing cross-domain policy transfer methods mostly aim at learning domain correspondences or corrections to facilitate policy learning, which requires learning domain/task-specific mod…
▽ More
Reusing pre-collected data from different domains is an attractive solution in decision-making tasks where the accessible data is insufficient in the target domain but relatively abundant in other related domains. Existing cross-domain policy transfer methods mostly aim at learning domain correspondences or corrections to facilitate policy learning, which requires learning domain/task-specific model components, representations, or policies that are inflexible or not fully reusable to accommodate arbitrary domains and tasks. These issues make us wonder: can we directly bridge the domain gap at the data (trajectory) level, instead of devising complicated, domain-specific policy transfer models? In this study, we propose a Cross-Domain Trajectory EDiting (xTED) framework with a new diffusion transformer model (Decision Diffusion Transformer, DDiT) that captures the trajectory distribution from the target dataset as a prior. The proposed diffusion transformer backbone captures the intricate dependencies among state, action, and reward sequences, as well as the transition dynamics within the target data trajectories. With the above pre-trained diffusion prior, source data trajectories with domain gaps can be transformed into edited trajectories that closely resemble the target data distribution through the diffusion-based editing process, which implicitly corrects the underlying domain gaps, enhancing the state realism and dynamics reliability in source trajectory data, while enabling flexible choices of downstream policy learning methods. Despite its simplicity, xTED demonstrates superior performance against other baselines in extensive simulation and real-robot experiments.
△ Less
Submitted 13 September, 2024;
originally announced September 2024.
-
Policy Prototyping for LLMs: Pluralistic Alignment via Interactive and Collaborative Policymaking
Authors:
K. J. Kevin Feng,
Inyoung Cheong,
Quan Ze Chen,
Amy X. Zhang
Abstract:
Emerging efforts in AI alignment seek to broaden participation in shaping model behavior by eliciting and integrating collective input into a policy for model finetuning. While pluralistic, these processes are often linear and do not allow participating stakeholders to confirm whether potential outcomes of their contributions are indeed consistent with their intentions. Design prototyping has long…
▽ More
Emerging efforts in AI alignment seek to broaden participation in shaping model behavior by eliciting and integrating collective input into a policy for model finetuning. While pluralistic, these processes are often linear and do not allow participating stakeholders to confirm whether potential outcomes of their contributions are indeed consistent with their intentions. Design prototyping has long advocated for rapid iteration using tight feedback loops of ideation, experimentation, and evaluation to mitigate these issues. We thus propose policy prototyping for LLMs, a new process that draws inspiration from prototyping practices to enable stakeholders to collaboratively and interactively draft LLM policies. Through learnings from a real-world LLM policymaking initiative at an industrial AI lab, we motivate our approach and characterize policy prototyping with four guiding principles. Because policy prototyping emphasizes a contrasting set of priorities compared to previous approaches, we envision our approach to be a valuable addition to the methodological repertoire for pluralistic alignment.
△ Less
Submitted 13 September, 2024;
originally announced September 2024.
-
Unified Audio Event Detection
Authors:
Yidi Jiang,
Ruijie Tao,
Wen Huang,
Qian Chen,
Wen Wang
Abstract:
Sound Event Detection (SED) detects regions of sound events, while Speaker Diarization (SD) segments speech conversations attributed to individual speakers. In SED, all speaker segments are classified as a single speech event, while in SD, non-speech sounds are treated merely as background noise. Thus, both tasks provide only partial analysis in complex audio scenarios involving both speech conver…
▽ More
Sound Event Detection (SED) detects regions of sound events, while Speaker Diarization (SD) segments speech conversations attributed to individual speakers. In SED, all speaker segments are classified as a single speech event, while in SD, non-speech sounds are treated merely as background noise. Thus, both tasks provide only partial analysis in complex audio scenarios involving both speech conversation and non-speech sounds. In this paper, we introduce a novel task called Unified Audio Event Detection (UAED) for comprehensive audio analysis. UAED explores the synergy between SED and SD tasks, simultaneously detecting non-speech sound events and fine-grained speech events based on speaker identities. To tackle this task, we propose a Transformer-based UAED (T-UAED) framework and construct the UAED Data derived from the Librispeech dataset and DESED soundbank. Experiments demonstrate that the proposed framework effectively exploits task interactions and substantially outperforms the baseline that simply combines the outputs of SED and SD models. T-UAED also shows its versatility by performing comparably to specialized models for individual SED and SD tasks on DESED and CALLHOME datasets.
△ Less
Submitted 13 September, 2024;
originally announced September 2024.
-
SimMAT: Exploring Transferability from Vision Foundation Models to Any Image Modality
Authors:
Chenyang Lei,
Liyi Chen,
Jun Cen,
Xiao Chen,
Zhen Lei,
Felix Heide,
Ziwei Liu,
Qifeng Chen,
Zhaoxiang Zhang
Abstract:
Foundation models like ChatGPT and Sora that are trained on a huge scale of data have made a revolutionary social impact. However, it is extremely challenging for sensors in many different fields to collect similar scales of natural images to train strong foundation models. To this end, this work presents a simple and effective framework SimMAT to study an open problem: the transferability from vi…
▽ More
Foundation models like ChatGPT and Sora that are trained on a huge scale of data have made a revolutionary social impact. However, it is extremely challenging for sensors in many different fields to collect similar scales of natural images to train strong foundation models. To this end, this work presents a simple and effective framework SimMAT to study an open problem: the transferability from vision foundation models trained on natural RGB images to other image modalities of different physical properties (e.g., polarization). SimMAT consists of a modality-agnostic transfer layer (MAT) and a pretrained foundation model. We apply SimMAT to a representative vision foundation model Segment Anything Model (SAM) to support any evaluated new image modality. Given the absence of relevant benchmarks, we construct a new benchmark to evaluate the transfer learning performance. Our experiments confirm the intriguing potential of transferring vision foundation models in enhancing other sensors' performance. Specifically, SimMAT can improve the segmentation performance (mIoU) from 22.15% to 53.88% on average for evaluated modalities and consistently outperforms other baselines. We hope that SimMAT can raise awareness of cross-modal transfer learning and benefit various fields for better results with vision foundation models.
△ Less
Submitted 12 September, 2024;
originally announced September 2024.
-
Thermal3D-GS: Physics-induced 3D Gaussians for Thermal Infrared Novel-view Synthesis
Authors:
Qian Chen,
Shihao Shu,
Xiangzhi Bai
Abstract:
Novel-view synthesis based on visible light has been extensively studied. In comparison to visible light imaging, thermal infrared imaging offers the advantage of all-weather imaging and strong penetration, providing increased possibilities for reconstruction in nighttime and adverse weather scenarios. However, thermal infrared imaging is influenced by physical characteristics such as atmospheric…
▽ More
Novel-view synthesis based on visible light has been extensively studied. In comparison to visible light imaging, thermal infrared imaging offers the advantage of all-weather imaging and strong penetration, providing increased possibilities for reconstruction in nighttime and adverse weather scenarios. However, thermal infrared imaging is influenced by physical characteristics such as atmospheric transmission effects and thermal conduction, hindering the precise reconstruction of intricate details in thermal infrared scenes, manifesting as issues of floaters and indistinct edge features in synthesized images. To address these limitations, this paper introduces a physics-induced 3D Gaussian splatting method named Thermal3D-GS. Thermal3D-GS begins by modeling atmospheric transmission effects and thermal conduction in three-dimensional media using neural networks. Additionally, a temperature consistency constraint is incorporated into the optimization objective to enhance the reconstruction accuracy of thermal infrared images. Furthermore, to validate the effectiveness of our method, the first large-scale benchmark dataset for this field named Thermal Infrared Novel-view Synthesis Dataset (TI-NSD) is created. This dataset comprises 20 authentic thermal infrared video scenes, covering indoor, outdoor, and UAV(Unmanned Aerial Vehicle) scenarios, totaling 6,664 frames of thermal infrared image data. Based on this dataset, this paper experimentally verifies the effectiveness of Thermal3D-GS. The results indicate that our method outperforms the baseline method with a 3.03 dB improvement in PSNR and significantly addresses the issues of floaters and indistinct edge features present in the baseline method. Our dataset and codebase will be released in \href{https://github.com/mzzcdf/Thermal3DGS}{\textcolor{red}{Thermal3DGS}}.
△ Less
Submitted 12 September, 2024;
originally announced September 2024.
-
DECOLLAGE: 3D Detailization by Controllable, Localized, and Learned Geometry Enhancement
Authors:
Qimin Chen,
Zhiqin Chen,
Vladimir G. Kim,
Noam Aigerman,
Hao Zhang,
Siddhartha Chaudhuri
Abstract:
We present a 3D modeling method which enables end-users to refine or detailize 3D shapes using machine learning, expanding the capabilities of AI-assisted 3D content creation. Given a coarse voxel shape (e.g., one produced with a simple box extrusion tool or via generative modeling), a user can directly "paint" desired target styles representing compelling geometric details, from input exemplar sh…
▽ More
We present a 3D modeling method which enables end-users to refine or detailize 3D shapes using machine learning, expanding the capabilities of AI-assisted 3D content creation. Given a coarse voxel shape (e.g., one produced with a simple box extrusion tool or via generative modeling), a user can directly "paint" desired target styles representing compelling geometric details, from input exemplar shapes, over different regions of the coarse shape. These regions are then up-sampled into high-resolution geometries which adhere with the painted styles. To achieve such controllable and localized 3D detailization, we build on top of a Pyramid GAN by making it masking-aware. We devise novel structural losses and priors to ensure that our method preserves both desired coarse structures and fine-grained features even if the painted styles are borrowed from diverse sources, e.g., different semantic parts and even different shape categories. Through extensive experiments, we show that our ability to localize details enables novel interactive creative workflows and applications. Our experiments further demonstrate that in comparison to prior techniques built on global detailization, our method generates structure-preserving, high-resolution stylized geometries with more coherent shape details and style transitions.
△ Less
Submitted 9 September, 2024;
originally announced September 2024.
-
Analyzing Tumors by Synthesis
Authors:
Qi Chen,
Yuxiang Lai,
Xiaoxi Chen,
Qixin Hu,
Alan Yuille,
Zongwei Zhou
Abstract:
Computer-aided tumor detection has shown great potential in enhancing the interpretation of over 80 million CT scans performed annually in the United States. However, challenges arise due to the rarity of CT scans with tumors, especially early-stage tumors. Developing AI with real tumor data faces issues of scarcity, annotation difficulty, and low prevalence. Tumor synthesis addresses these challe…
▽ More
Computer-aided tumor detection has shown great potential in enhancing the interpretation of over 80 million CT scans performed annually in the United States. However, challenges arise due to the rarity of CT scans with tumors, especially early-stage tumors. Developing AI with real tumor data faces issues of scarcity, annotation difficulty, and low prevalence. Tumor synthesis addresses these challenges by generating numerous tumor examples in medical images, aiding AI training for tumor detection and segmentation. Successful synthesis requires realistic and generalizable synthetic tumors across various organs. This chapter reviews AI development on real and synthetic data and summarizes two key trends in synthetic data for cancer imaging research: modeling-based and learning-based approaches. Modeling-based methods, like Pixel2Cancer, simulate tumor development over time using generic rules, while learning-based methods, like DiffTumor, learn from a few annotated examples in one organ to generate synthetic tumors in others. Reader studies with expert radiologists show that synthetic tumors can be convincingly realistic. We also present case studies in the liver, pancreas, and kidneys reveal that AI trained on synthetic tumors can achieve performance comparable to, or better than, AI only trained on real data. Tumor synthesis holds significant promise for expanding datasets, enhancing AI reliability, improving tumor detection performance, and preserving patient privacy.
△ Less
Submitted 9 September, 2024;
originally announced September 2024.
-
NetDPSyn: Synthesizing Network Traces under Differential Privacy
Authors:
Danyu Sun,
Joann Qiongna Chen,
Chen Gong,
Tianhao Wang,
Zhou Li
Abstract:
As the utilization of network traces for the network measurement research becomes increasingly prevalent, concerns regarding privacy leakage from network traces have garnered the public's attention. To safeguard network traces, researchers have proposed the trace synthesis that retains the essential properties of the raw data. However, previous works also show that synthesis traces with generative…
▽ More
As the utilization of network traces for the network measurement research becomes increasingly prevalent, concerns regarding privacy leakage from network traces have garnered the public's attention. To safeguard network traces, researchers have proposed the trace synthesis that retains the essential properties of the raw data. However, previous works also show that synthesis traces with generative models are vulnerable under linkage attacks.
This paper introduces NetDPSyn, the first system to synthesize high-fidelity network traces under privacy guarantees. NetDPSyn is built with the Differential Privacy (DP) framework as its core, which is significantly different from prior works that apply DP when training the generative model. The experiments conducted on three flow and two packet datasets indicate that NetDPSyn achieves much better data utility in downstream tasks like anomaly detection. NetDPSyn is also 2.5 times faster than the other methods on average in data synthesis.
△ Less
Submitted 8 September, 2024;
originally announced September 2024.
-
End User Authoring of Personalized Content Classifiers: Comparing Example Labeling, Rule Writing, and LLM Prompting
Authors:
Leijie Wang,
Kathryn Yurechko,
Pranati Dani,
Quan Ze Chen,
Amy X. Zhang
Abstract:
Existing tools for laypeople to create personal classifiers often assume a motivated user working uninterrupted in a single, lengthy session. However, users tend to engage with social media casually, with many short sessions on an ongoing, daily basis. To make creating personal classifiers for content curation easier for such users, tools should support rapid initialization and iterative refinemen…
▽ More
Existing tools for laypeople to create personal classifiers often assume a motivated user working uninterrupted in a single, lengthy session. However, users tend to engage with social media casually, with many short sessions on an ongoing, daily basis. To make creating personal classifiers for content curation easier for such users, tools should support rapid initialization and iterative refinement. In this work, we compare three strategies -- (1) example labeling, (2) rule writing, and (3) large language model (LLM) prompting -- for end users to build personal content classifiers. From an experiment with 37 non-programmers tasked with creating personalized comment moderation filters, we found that with LLM prompting, participants reached 95\% of peak performance in 5 minutes, beating other strategies due to higher recall, but all strategies struggled with iterative refinement. Despite LLM prompting's better performance, participants preferred different strategies in different contexts and, even when prompting, provided examples or wrote rule-like prompts, suggesting hybrid approaches.
△ Less
Submitted 5 September, 2024;
originally announced September 2024.
-
Online Scheduling via Gradient Descent for Weighted Flow Time Minimization
Authors:
Qingyun Chen,
Sungjin Im,
Aditya Petety
Abstract:
In this paper, we explore how a natural generalization of Shortest Remaining Processing Time (SRPT) can be a powerful \emph{meta-algorithm} for online scheduling. The meta-algorithm processes jobs to maximally reduce the objective of the corresponding offline scheduling problem of the remaining jobs: minimizing the total weighted completion time of them (the residual optimum). We show that it achi…
▽ More
In this paper, we explore how a natural generalization of Shortest Remaining Processing Time (SRPT) can be a powerful \emph{meta-algorithm} for online scheduling. The meta-algorithm processes jobs to maximally reduce the objective of the corresponding offline scheduling problem of the remaining jobs: minimizing the total weighted completion time of them (the residual optimum). We show that it achieves scalability for minimizing total weighted flow time when the residual optimum exhibits \emph{supermodularity}. Scalability here means it is $O(1)$-competitive with an arbitrarily small speed augmentation advantage over the adversary, representing the best possible outcome achievable for various scheduling problems.
Thanks to this finding, our approach does not require the residual optimum to have a closed mathematical form. Consequently, we can obtain the schedule by solving a linear program, which makes our approach readily applicable to a rich body of applications. Furthermore, by establishing a novel connection to \emph{substitute valuations in Walrasian markets}, we show how to achieve supermodularity, thereby obtaining scalable algorithms for various scheduling problems, such as matroid scheduling, generalized network flow, and generalized arbitrary speed-up curves, etc., and this is the \emph{first} non-trivial or scalable algorithm for many of them.
△ Less
Submitted 4 September, 2024;
originally announced September 2024.
-
HiPrompt: Tuning-free Higher-Resolution Generation with Hierarchical MLLM Prompts
Authors:
Xinyu Liu,
Yingqing He,
Lanqing Guo,
Xiang Li,
Bu Jin,
Peng Li,
Yan Li,
Chi-Min Chan,
Qifeng Chen,
Wei Xue,
Wenhan Luo,
Qifeng Liu,
Yike Guo
Abstract:
The potential for higher-resolution image generation using pretrained diffusion models is immense, yet these models often struggle with issues of object repetition and structural artifacts especially when scaling to 4K resolution and higher. We figure out that the problem is caused by that, a single prompt for the generation of multiple scales provides insufficient efficacy. In response, we propos…
▽ More
The potential for higher-resolution image generation using pretrained diffusion models is immense, yet these models often struggle with issues of object repetition and structural artifacts especially when scaling to 4K resolution and higher. We figure out that the problem is caused by that, a single prompt for the generation of multiple scales provides insufficient efficacy. In response, we propose HiPrompt, a new tuning-free solution that tackles the above problems by introducing hierarchical prompts. The hierarchical prompts offer both global and local guidance. Specifically, the global guidance comes from the user input that describes the overall content, while the local guidance utilizes patch-wise descriptions from MLLMs to elaborately guide the regional structure and texture generation. Furthermore, during the inverse denoising process, the generated noise is decomposed into low- and high-frequency spatial components. These components are conditioned on multiple prompt levels, including detailed patch-wise descriptions and broader image-level prompts, facilitating prompt-guided denoising under hierarchical semantic guidance. It further allows the generation to focus more on local spatial regions and ensures the generated images maintain coherent local and global semantics, structures, and textures with high definition. Extensive experiments demonstrate that HiPrompt outperforms state-of-the-art works in higher-resolution image generation, significantly reducing object repetition and enhancing structural quality.
△ Less
Submitted 9 September, 2024; v1 submitted 4 September, 2024;
originally announced September 2024.
-
CMM-Math: A Chinese Multimodal Math Dataset To Evaluate and Enhance the Mathematics Reasoning of Large Multimodal Models
Authors:
Wentao Liu,
Qianjun Pan,
Yi Zhang,
Zhuo Liu,
Ji Wu,
Jie Zhou,
Aimin Zhou,
Qin Chen,
Bo Jiang,
Liang He
Abstract:
Large language models (LLMs) have obtained promising results in mathematical reasoning, which is a foundational skill for human intelligence. Most previous studies focus on improving and measuring the performance of LLMs based on textual math reasoning datasets (e.g., MATH, GSM8K). Recently, a few researchers have released English multimodal math datasets (e.g., MATHVISTA and MATH-V) to evaluate t…
▽ More
Large language models (LLMs) have obtained promising results in mathematical reasoning, which is a foundational skill for human intelligence. Most previous studies focus on improving and measuring the performance of LLMs based on textual math reasoning datasets (e.g., MATH, GSM8K). Recently, a few researchers have released English multimodal math datasets (e.g., MATHVISTA and MATH-V) to evaluate the effectiveness of large multimodal models (LMMs). In this paper, we release a Chinese multimodal math (CMM-Math) dataset, including benchmark and training parts, to evaluate and enhance the mathematical reasoning of LMMs. CMM-Math contains over 28,000 high-quality samples, featuring a variety of problem types (e.g., multiple-choice, fill-in-the-blank, and so on) with detailed solutions across 12 grade levels from elementary to high school in China. Specifically, the visual context may be present in the questions or opinions, which makes this dataset more challenging. Through comprehensive analysis, we discover that state-of-the-art LMMs on the CMM-Math dataset face challenges, emphasizing the necessity for further improvements in LMM development. We also propose a Multimodal Mathematical LMM (Math-LMM) to handle the problems with mixed input of multiple images and text segments. We train our model using three stages, including foundational pre-training, foundational fine-tuning, and mathematical fine-tuning. The extensive experiments indicate that our model effectively improves math reasoning performance by comparing it with the SOTA LMMs over three multimodal mathematical datasets.
△ Less
Submitted 6 September, 2024; v1 submitted 4 September, 2024;
originally announced September 2024.
-
What are the Essential Factors in Crafting Effective Long Context Multi-Hop Instruction Datasets? Insights and Best Practices
Authors:
Zhi Chen,
Qiguang Chen,
Libo Qin,
Qipeng Guo,
Haijun Lv,
Yicheng Zou,
Wanxiang Che,
Hang Yan,
Kai Chen,
Dahua Lin
Abstract:
Recent advancements in large language models (LLMs) with extended context windows have significantly improved tasks such as information extraction, question answering, and complex planning scenarios. In order to achieve success in long context tasks, a large amount of work has been done to enhance the long context capabilities of the model through synthetic data. Existing methods typically utilize…
▽ More
Recent advancements in large language models (LLMs) with extended context windows have significantly improved tasks such as information extraction, question answering, and complex planning scenarios. In order to achieve success in long context tasks, a large amount of work has been done to enhance the long context capabilities of the model through synthetic data. Existing methods typically utilize the Self-Instruct framework to generate instruction tuning data for better long context capability improvement. However, our preliminary experiments indicate that less than 35% of generated samples are multi-hop, and more than 40% exhibit poor quality, limiting comprehensive understanding and further research. To improve the quality of synthetic data, we propose the Multi-agent Interactive Multi-hop Generation (MIMG) framework, incorporating a Quality Verification Agent, a Single-hop Question Generation Agent, a Multiple Question Sampling Strategy, and a Multi-hop Question Merger Agent. This framework improves the data quality, with the proportion of high-quality, multi-hop, and diverse data exceeding 85%. Furthermore, we systematically investigate strategies for document selection, question merging, and validation techniques through extensive experiments across various models. Our findings show that our synthetic high-quality long-context instruction data significantly enhances model performance, even surpassing models trained on larger amounts of human-annotated data. Our code is available at: https://github.com/WowCZ/LongMIT.
△ Less
Submitted 3 September, 2024;
originally announced September 2024.
-
Follow-Your-Canvas: Higher-Resolution Video Outpainting with Extensive Content Generation
Authors:
Qihua Chen,
Yue Ma,
Hongfa Wang,
Junkun Yuan,
Wenzhe Zhao,
Qi Tian,
Hongmei Wang,
Shaobo Min,
Qifeng Chen,
Wei Liu
Abstract:
This paper explores higher-resolution video outpainting with extensive content generation. We point out common issues faced by existing methods when attempting to largely outpaint videos: the generation of low-quality content and limitations imposed by GPU memory. To address these challenges, we propose a diffusion-based method called \textit{Follow-Your-Canvas}. It builds upon two core designs. F…
▽ More
This paper explores higher-resolution video outpainting with extensive content generation. We point out common issues faced by existing methods when attempting to largely outpaint videos: the generation of low-quality content and limitations imposed by GPU memory. To address these challenges, we propose a diffusion-based method called \textit{Follow-Your-Canvas}. It builds upon two core designs. First, instead of employing the common practice of "single-shot" outpainting, we distribute the task across spatial windows and seamlessly merge them. It allows us to outpaint videos of any size and resolution without being constrained by GPU memory. Second, the source video and its relative positional relation are injected into the generation process of each window. It makes the generated spatial layout within each window harmonize with the source video. Coupling with these two designs enables us to generate higher-resolution outpainting videos with rich content while keeping spatial and temporal consistency. Follow-Your-Canvas excels in large-scale video outpainting, e.g., from 512X512 to 1152X2048 (9X), while producing high-quality and aesthetically pleasing results. It achieves the best quantitative results across various resolution and scale setups. The code is released on https://github.com/mayuelala/FollowYourCanvas
△ Less
Submitted 2 September, 2024;
originally announced September 2024.
-
LSMS: Language-guided Scale-aware MedSegmentor for Medical Image Referring Segmentation
Authors:
Shuyi Ouyang,
Jinyang Zhang,
Xiangye Lin,
Xilai Wang,
Qingqing Chen,
Yen-Wei Chen,
Lanfen Lin
Abstract:
Conventional medical image segmentation methods have been found inadequate in facilitating physicians with the identification of specific lesions for diagnosis and treatment. Given the utility of text as an instructional format, we introduce a novel task termed Medical Image Referring Segmentation (MIRS), which requires segmenting specified lesions in images based on the given language expressions…
▽ More
Conventional medical image segmentation methods have been found inadequate in facilitating physicians with the identification of specific lesions for diagnosis and treatment. Given the utility of text as an instructional format, we introduce a novel task termed Medical Image Referring Segmentation (MIRS), which requires segmenting specified lesions in images based on the given language expressions. Due to the varying object scales in medical images, MIRS demands robust vision-language modeling and comprehensive multi-scale interaction for precise localization and segmentation under linguistic guidance. However, existing medical image segmentation methods fall short in meeting these demands, resulting in insufficient segmentation accuracy. In response, we propose an approach named Language-guided Scale-aware MedSegmentor (LSMS), incorporating two appealing designs: (1)~a Scale-aware Vision-Language Attention module that leverages diverse convolutional kernels to acquire rich visual knowledge and interact closely with linguistic features, thereby enhancing lesion localization capability; (2)~a Full-Scale Decoder that globally models multi-modal features across various scales, capturing complementary information between scales to accurately outline lesion boundaries. Addressing the lack of suitable datasets for MIRS, we constructed a vision-language medical dataset called Reference Hepatic Lesion Segmentation (RefHL-Seg). This dataset comprises 2,283 abdominal CT slices from 231 cases, with corresponding textual annotations and segmentation masks for various liver lesions in images. We validated the performance of LSMS for MIRS and conventional medical image segmentation tasks across various datasets. Our LSMS consistently outperforms on all datasets with lower computational costs. The code and datasets will be released.
△ Less
Submitted 2 September, 2024; v1 submitted 30 August, 2024;
originally announced August 2024.
-
Service-Oriented AoI Modeling and Analysis for Non-Terrestrial Networks
Authors:
Zheng Guo,
Qian Chen,
Weixiao Meng
Abstract:
To achieve truly seamless global intelligent connectivity, non-terrestrial networks (NTN) mainly composed of low earth orbit (LEO) satellites and drones are recognized as important components of the future 6G network architecture. Meanwhile, the rapid advancement of the Internet of Things (IoT) has led to the proliferation of numerous applications with stringent requirements for timely information…
▽ More
To achieve truly seamless global intelligent connectivity, non-terrestrial networks (NTN) mainly composed of low earth orbit (LEO) satellites and drones are recognized as important components of the future 6G network architecture. Meanwhile, the rapid advancement of the Internet of Things (IoT) has led to the proliferation of numerous applications with stringent requirements for timely information delivery. The Age of Information (AoI), a critical performance metric for assessing the freshness of data in information update systems, has gained significant importance in this context. However, existing modeling and analysis work on AoI mainly focuses on terrestrial networks, and the distribution characteristics of ground nodes and the high dynamics of satellites have not been fully considered, which poses challenges for more accurate evaluation. Against this background, we model the ground nodes as a hybrid distribution of Poisson point process (PPP) and Poisson cluster process (PCP) to capture the impact of ground node distribution on the AoI of status update packet transmission supported by UAVs and satellites in NTN, and the visibility and cross-traffic characteristics of satellites are additionally considered. We derived the average AoI for the system in these two different situations and examined the impact of various network parameters on AoI performance.
△ Less
Submitted 30 August, 2024;
originally announced August 2024.
-
Video to Music Moment Retrieval
Authors:
Zijie Xin,
Minquan Wang,
Ye Ma,
Bo Wang,
Quan Chen,
Peng Jiang,
Xirong Li
Abstract:
Adding proper background music helps complete a short video to be shared. Towards automating the task, previous research focuses on video-to-music retrieval (VMR), aiming to find amidst a collection of music the one best matching the content of a given video. Since music tracks are typically much longer than short videos, meaning the returned music has to be cut to a shorter moment, there is a cle…
▽ More
Adding proper background music helps complete a short video to be shared. Towards automating the task, previous research focuses on video-to-music retrieval (VMR), aiming to find amidst a collection of music the one best matching the content of a given video. Since music tracks are typically much longer than short videos, meaning the returned music has to be cut to a shorter moment, there is a clear gap between the practical need and VMR. In order to bridge the gap, we propose in this paper video to music moment retrieval (VMMR) as a new task. To tackle the new task, we build a comprehensive dataset Ad-Moment which contains 50K short videos annotated with music moments and develop a two-stage approach. In particular, given a test video, the most similar music is retrieved from a given collection. Then, a Transformer based music moment localization is performed. We term this approach Retrieval and Localization (ReaL). Extensive experiments on real-world datasets verify the effectiveness of the proposed method for VMMR.
△ Less
Submitted 29 August, 2024;
originally announced August 2024.
-
WavTokenizer: an Efficient Acoustic Discrete Codec Tokenizer for Audio Language Modeling
Authors:
Shengpeng Ji,
Ziyue Jiang,
Xize Cheng,
Yifu Chen,
Minghui Fang,
Jialong Zuo,
Qian Yang,
Ruiqi Li,
Ziang Zhang,
Xiaoda Yang,
Rongjie Huang,
Yidi Jiang,
Qian Chen,
Siqi Zheng,
Wen Wang,
Zhou Zhao
Abstract:
Language models have been effectively applied to modeling natural signals, such as images, video, speech, and audio. A crucial component of these models is the codec tokenizer, which compresses high-dimensional natural signals into lower-dimensional discrete tokens. In this paper, we introduce WavTokenizer, which offers several advantages over previous SOTA acoustic codec models in the audio domai…
▽ More
Language models have been effectively applied to modeling natural signals, such as images, video, speech, and audio. A crucial component of these models is the codec tokenizer, which compresses high-dimensional natural signals into lower-dimensional discrete tokens. In this paper, we introduce WavTokenizer, which offers several advantages over previous SOTA acoustic codec models in the audio domain: 1)extreme compression. By compressing the layers of quantizers and the temporal dimension of the discrete codec, one-second audio of 24kHz sampling rate requires only a single quantizer with 40 or 75 tokens. 2)improved subjective quality. Despite the reduced number of tokens, WavTokenizer achieves state-of-the-art reconstruction quality with outstanding UTMOS scores and inherently contains richer semantic information. Specifically, we achieve these results by designing a broader VQ space, extended contextual windows, and improved attention networks, as well as introducing a powerful multi-scale discriminator and an inverse Fourier transform structure. We conducted extensive reconstruction experiments in the domains of speech, audio, and music. WavTokenizer exhibited strong performance across various objective and subjective metrics compared to state-of-the-art models. We also tested semantic information, VQ utilization, and adaptability to generative models. Comprehensive ablation studies confirm the necessity of each module in WavTokenizer. The related code, demos, and pre-trained models are available at https://github.com/jishengpeng/WavTokenizer.
△ Less
Submitted 29 August, 2024;
originally announced August 2024.
-
ResVG: Enhancing Relation and Semantic Understanding in Multiple Instances for Visual Grounding
Authors:
Minghang Zheng,
Jiahua Zhang,
Qingchao Chen,
Yuxin Peng,
Yang Liu
Abstract:
Visual grounding aims to localize the object referred to in an image based on a natural language query. Although progress has been made recently, accurately localizing target objects within multiple-instance distractions (multiple objects of the same category as the target) remains a significant challenge. Existing methods demonstrate a significant performance drop when there are multiple distract…
▽ More
Visual grounding aims to localize the object referred to in an image based on a natural language query. Although progress has been made recently, accurately localizing target objects within multiple-instance distractions (multiple objects of the same category as the target) remains a significant challenge. Existing methods demonstrate a significant performance drop when there are multiple distractions in an image, indicating an insufficient understanding of the fine-grained semantics and spatial relationships between objects. In this paper, we propose a novel approach, the Relation and Semantic-sensitive Visual Grounding (ResVG) model, to address this issue. Firstly, we enhance the model's understanding of fine-grained semantics by injecting semantic prior information derived from text queries into the model. This is achieved by leveraging text-to-image generation models to produce images representing the semantic attributes of target objects described in queries. Secondly, we tackle the lack of training samples with multiple distractions by introducing a relation-sensitive data augmentation method. This method generates additional training data by synthesizing images containing multiple objects of the same category and pseudo queries based on their spatial relationships. The proposed ReSVG model significantly improves the model's ability to comprehend both object semantics and spatial relations, leading to enhanced performance in visual grounding tasks, particularly in scenarios with multiple-instance distractions. We conduct extensive experiments to validate the effectiveness of our methods on five datasets. Code is available at https://github.com/minghangz/ResVG.
△ Less
Submitted 29 August, 2024;
originally announced August 2024.
-
Training-free Video Temporal Grounding using Large-scale Pre-trained Models
Authors:
Minghang Zheng,
Xinhao Cai,
Qingchao Chen,
Yuxin Peng,
Yang Liu
Abstract:
Video temporal grounding aims to identify video segments within untrimmed videos that are most relevant to a given natural language query. Existing video temporal localization models rely on specific datasets for training and have high data collection costs, but they exhibit poor generalization capability under the across-dataset and out-of-distribution (OOD) settings. In this paper, we propose a…
▽ More
Video temporal grounding aims to identify video segments within untrimmed videos that are most relevant to a given natural language query. Existing video temporal localization models rely on specific datasets for training and have high data collection costs, but they exhibit poor generalization capability under the across-dataset and out-of-distribution (OOD) settings. In this paper, we propose a Training-Free Video Temporal Grounding (TFVTG) approach that leverages the ability of pre-trained large models. A naive baseline is to enumerate proposals in the video and use the pre-trained visual language models (VLMs) to select the best proposal according to the vision-language alignment. However, most existing VLMs are trained on image-text pairs or trimmed video clip-text pairs, making it struggle to (1) grasp the relationship and distinguish the temporal boundaries of multiple events within the same video; (2) comprehend and be sensitive to the dynamic transition of events (the transition from one event to another) in the video. To address these issues, we propose leveraging large language models (LLMs) to analyze multiple sub-events contained in the query text and analyze the temporal order and relationships between these events. Secondly, we split a sub-event into dynamic transition and static status parts and propose the dynamic and static scoring functions using VLMs to better evaluate the relevance between the event and the description. Finally, for each sub-event description, we use VLMs to locate the top-k proposals and leverage the order and relationships between sub-events provided by LLMs to filter and integrate these proposals. Our method achieves the best performance on zero-shot video temporal grounding on Charades-STA and ActivityNet Captions datasets without any training and demonstrates better generalization capabilities in cross-dataset and OOD settings.
△ Less
Submitted 28 August, 2024;
originally announced August 2024.
-
HEAD: A Bandwidth-Efficient Cooperative Perception Approach for Heterogeneous Connected and Autonomous Vehicles
Authors:
Deyuan Qu,
Qi Chen,
Yongqi Zhu,
Yihao Zhu,
Sergei S. Avedisov,
Song Fu,
Qing Yang
Abstract:
In cooperative perception studies, there is often a trade-off between communication bandwidth and perception performance. While current feature fusion solutions are known for their excellent object detection performance, transmitting the entire sets of intermediate feature maps requires substantial bandwidth. Furthermore, these fusion approaches are typically limited to vehicles that use identical…
▽ More
In cooperative perception studies, there is often a trade-off between communication bandwidth and perception performance. While current feature fusion solutions are known for their excellent object detection performance, transmitting the entire sets of intermediate feature maps requires substantial bandwidth. Furthermore, these fusion approaches are typically limited to vehicles that use identical detection models. Our goal is to develop a solution that supports cooperative perception across vehicles equipped with different modalities of sensors. This method aims to deliver improved perception performance compared to late fusion techniques, while achieving precision similar to the state-of-art intermediate fusion, but requires an order of magnitude less bandwidth. We propose HEAD, a method that fuses features from the classification and regression heads in 3D object detection networks. Our method is compatible with heterogeneous detection networks such as LiDAR PointPillars, SECOND, VoxelNet, and camera Bird's-eye View (BEV) Encoder. Given the naturally smaller feature size in the detection heads, we design a self-attention mechanism to fuse the classification head and a complementary feature fusion layer to fuse the regression head. Our experiments, comprehensively evaluated on the V2V4Real and OPV2V datasets, demonstrate that HEAD is a fusion method that effectively balances communication bandwidth and perception performance.
△ Less
Submitted 27 August, 2024;
originally announced August 2024.
-
SkillMimic: Learning Reusable Basketball Skills from Demonstrations
Authors:
Yinhuai Wang,
Qihan Zhao,
Runyi Yu,
Ailing Zeng,
Jing Lin,
Zhengyi Luo,
Hok Wai Tsui,
Jiwen Yu,
Xiu Li,
Qifeng Chen,
Jian Zhang,
Lei Zhang,
Ping Tan
Abstract:
Mastering basketball skills such as diverse layups and dribbling involves complex interactions with the ball and requires real-time adjustments. Traditional reinforcement learning methods for interaction skills rely on labor-intensive, manually designed rewards that do not generalize well across different skills. Inspired by how humans learn from demonstrations, we propose SkillMimic, a data-drive…
▽ More
Mastering basketball skills such as diverse layups and dribbling involves complex interactions with the ball and requires real-time adjustments. Traditional reinforcement learning methods for interaction skills rely on labor-intensive, manually designed rewards that do not generalize well across different skills. Inspired by how humans learn from demonstrations, we propose SkillMimic, a data-driven approach that mimics both human and ball motions to learn a wide variety of basketball skills. SkillMimic employs a unified configuration to learn diverse skills from human-ball motion datasets, with skill diversity and generalization improving as the dataset grows. This approach allows training a single policy to learn multiple skills, enabling smooth skill switching even if these switches are not present in the reference dataset. The skills acquired by SkillMimic can be easily reused by a high-level controller to accomplish complex basketball tasks. To evaluate our approach, we introduce two basketball datasets: one estimated through monocular RGB videos and the other using advanced motion capture equipment, collectively containing about 35 minutes of diverse basketball skills. Experiments show that our method can effectively learn various basketball skills included in the dataset with a unified configuration, including various styles of dribbling, layups, and shooting. Furthermore, by training a high-level controller to reuse the acquired skills, we can achieve complex basketball tasks such as layup scoring, which involves dribbling toward the basket, timing the dribble and layup to score, retrieving the rebound, and repeating the process. The project page and video demonstrations are available at https://ingrid789.github.io/SkillMimic/
△ Less
Submitted 12 August, 2024;
originally announced August 2024.
-
Artificial Intelligence in Landscape Architecture: A Survey
Authors:
Yue Xing,
Wensheng Gan,
Qidi Chen
Abstract:
The development history of landscape architecture (LA) reflects the human pursuit of environmental beautification and ecological balance. With the advancement of artificial intelligence (AI) technologies that simulate and extend human intelligence, immense opportunities have been provided for LA, offering scientific and technological support throughout the entire workflow. In this article, we comp…
▽ More
The development history of landscape architecture (LA) reflects the human pursuit of environmental beautification and ecological balance. With the advancement of artificial intelligence (AI) technologies that simulate and extend human intelligence, immense opportunities have been provided for LA, offering scientific and technological support throughout the entire workflow. In this article, we comprehensively review the applications of AI technology in the field of LA. First, we introduce the many potential benefits that AI brings to the design, planning, and management aspects of LA. Secondly, we discuss how AI can assist the LA field in solving its current development problems, including urbanization, environmental degradation and ecological decline, irrational planning, insufficient management and maintenance, and lack of public participation. Furthermore, we summarize the key technologies and practical cases of applying AI in the LA domain, from design assistance to intelligent management, all of which provide innovative solutions for the planning, design, and maintenance of LA. Finally, we look ahead to the problems and opportunities in LA, emphasizing the need to combine human expertise and judgment for rational decision-making. This article provides both theoretical and practical guidance for LA designers, researchers, and technology developers. The successful integration of AI technology into LA holds great promise for enhancing the field's capabilities and achieving more sustainable, efficient, and user-friendly outcomes.
△ Less
Submitted 26 August, 2024;
originally announced August 2024.
-
Evolvable Psychology Informed Neural Network for Memory Behavior Modeling
Authors:
Xiaoxuan Shen,
Zhihai Hu,
Qirong Chen,
Shengyingjie Liu,
Ruxia Liang,
Jianwen Sun
Abstract:
Memory behavior modeling is a core issue in cognitive psychology and education. Classical psychological theories typically use memory equations to describe memory behavior, which exhibits insufficient accuracy and controversy, while data-driven memory modeling methods often require large amounts of training data and lack interpretability. Knowledge-informed neural network models have shown excelle…
▽ More
Memory behavior modeling is a core issue in cognitive psychology and education. Classical psychological theories typically use memory equations to describe memory behavior, which exhibits insufficient accuracy and controversy, while data-driven memory modeling methods often require large amounts of training data and lack interpretability. Knowledge-informed neural network models have shown excellent performance in fields like physics, but there have been few attempts in the domain of behavior modeling. This paper proposed a psychology theory informed neural networks for memory behavior modeling named PsyINN, where it constructs a framework that combines neural network with differentiating sparse regression, achieving joint optimization. Specifically, to address the controversies and ambiguity of descriptors in memory equations, a descriptor evolution method based on differentiating operators is proposed to achieve precise characterization of descriptors and the evolution of memory theoretical equations. Additionally, a buffering mechanism for the sparse regression and a multi-module alternating iterative optimization method are proposed, effectively mitigating gradient instability and local optima issues. On four large-scale real-world memory behavior datasets, the proposed method surpasses the state-of-the-art methods in prediction accuracy. Ablation study demonstrates the effectiveness of the proposed refinements, and application experiments showcase its potential in inspiring psychological research.
△ Less
Submitted 22 August, 2024;
originally announced August 2024.
-
Grounded Multi-Hop VideoQA in Long-Form Egocentric Videos
Authors:
Qirui Chen,
Shangzhe Di,
Weidi Xie
Abstract:
This paper considers the problem of Multi-Hop Video Question Answering (MH-VidQA) in long-form egocentric videos. This task not only requires to answer visual questions, but also to localize multiple relevant time intervals within the video as visual evidences. We develop an automated pipeline to create multi-hop question-answering pairs with associated temporal evidence, enabling to construct a l…
▽ More
This paper considers the problem of Multi-Hop Video Question Answering (MH-VidQA) in long-form egocentric videos. This task not only requires to answer visual questions, but also to localize multiple relevant time intervals within the video as visual evidences. We develop an automated pipeline to create multi-hop question-answering pairs with associated temporal evidence, enabling to construct a large-scale dataset for instruction-tuning. To monitor the progress of this new task, we further curate a high-quality benchmark, MultiHop-EgoQA, with careful manual verification and refinement. Experimental results reveal that existing multi-modal systems exhibit inadequate multi-hop grounding and reasoning abilities, resulting in unsatisfactory performance. We then propose a novel architecture, termed as Grounding Scattered Evidence with Large Language Model (GeLM), that enhances multi-modal large language models (MLLMs) by incorporating a grounding module to retrieve temporal evidence from videos using flexible grounding tokens. Trained on our visual instruction data, GeLM demonstrates improved multi-hop grounding and reasoning capabilities, setting a new baseline for this challenging task. Furthermore, when trained on third-person view videos, the same architecture also achieves state-of-the-art performance on the single-hop VidQA benchmark, ActivityNet-RTL, demonstrating its effectiveness.
△ Less
Submitted 26 August, 2024;
originally announced August 2024.
-
D&M: Enriching E-commerce Videos with Sound Effects by Key Moment Detection and SFX Matching
Authors:
Jingyu Liu,
Minquan Wang,
Ye Ma,
Bo Wang,
Aozhu Chen,
Quan Chen,
Peng Jiang,
Xirong Li
Abstract:
Videos showcasing specific products are increasingly important for E-commerce. Key moments naturally exist as the first appearance of a specific product, presentation of its distinctive features, the presence of a buying link, etc. Adding proper sound effects (SFX) to these key moments, or video decoration with SFX (VDSFX), is crucial for enhancing the user engaging experience. Previous studies ab…
▽ More
Videos showcasing specific products are increasingly important for E-commerce. Key moments naturally exist as the first appearance of a specific product, presentation of its distinctive features, the presence of a buying link, etc. Adding proper sound effects (SFX) to these key moments, or video decoration with SFX (VDSFX), is crucial for enhancing the user engaging experience. Previous studies about adding SFX to videos perform video to SFX matching at a holistic level, lacking the ability of adding SFX to a specific moment. Meanwhile, previous studies on video highlight detection or video moment retrieval consider only moment localization, leaving moment to SFX matching untouched. By contrast, we propose in this paper D&M, a unified method that accomplishes key moment detection and moment to SFX matching simultaneously. Moreover, for the new VDSFX task we build a large-scale dataset SFX-Moment from an E-commerce platform. For a fair comparison, we build competitive baselines by extending a number of current video moment detection methods to the new task. Extensive experiments on SFX-Moment show the superior performance of the proposed method over the baselines. Code and data will be released.
△ Less
Submitted 23 August, 2024;
originally announced August 2024.
-
Diffusion-Based Visual Art Creation: A Survey and New Perspectives
Authors:
Bingyuan Wang,
Qifeng Chen,
Zeyu Wang
Abstract:
The integration of generative AI in visual art has revolutionized not only how visual content is created but also how AI interacts with and reflects the underlying domain knowledge. This survey explores the emerging realm of diffusion-based visual art creation, examining its development from both artistic and technical perspectives. We structure the survey into three phases, data feature and frame…
▽ More
The integration of generative AI in visual art has revolutionized not only how visual content is created but also how AI interacts with and reflects the underlying domain knowledge. This survey explores the emerging realm of diffusion-based visual art creation, examining its development from both artistic and technical perspectives. We structure the survey into three phases, data feature and framework identification, detailed analyses using a structured coding process, and open-ended prospective outlooks. Our findings reveal how artistic requirements are transformed into technical challenges and highlight the design and application of diffusion-based methods within visual art creation. We also provide insights into future directions from technical and synergistic perspectives, suggesting that the confluence of generative AI and art has shifted the creative paradigm and opened up new possibilities. By summarizing the development and trends of this emerging interdisciplinary area, we aim to shed light on the mechanisms through which AI systems emulate and possibly, enhance human capacities in artistic perception and creativity.
△ Less
Submitted 22 August, 2024;
originally announced August 2024.
-
Integrating Audio, Visual, and Semantic Information for Enhanced Multimodal Speaker Diarization
Authors:
Luyao Cheng,
Hui Wang,
Siqi Zheng,
Yafeng Chen,
Rongjie Huang,
Qinglin Zhang,
Qian Chen,
Xihao Li
Abstract:
Speaker diarization, the process of segmenting an audio stream or transcribed speech content into homogenous partitions based on speaker identity, plays a crucial role in the interpretation and analysis of human speech. Most existing speaker diarization systems rely exclusively on unimodal acoustic information, making the task particularly challenging due to the innate ambiguities of audio signals…
▽ More
Speaker diarization, the process of segmenting an audio stream or transcribed speech content into homogenous partitions based on speaker identity, plays a crucial role in the interpretation and analysis of human speech. Most existing speaker diarization systems rely exclusively on unimodal acoustic information, making the task particularly challenging due to the innate ambiguities of audio signals. Recent studies have made tremendous efforts towards audio-visual or audio-semantic modeling to enhance performance. However, even the incorporation of up to two modalities often falls short in addressing the complexities of spontaneous and unstructured conversations. To exploit more meaningful dialogue patterns, we propose a novel multimodal approach that jointly utilizes audio, visual, and semantic cues to enhance speaker diarization. Our method elegantly formulates the multimodal modeling as a constrained optimization problem. First, we build insights into the visual connections among active speakers and the semantic interactions within spoken content, thereby establishing abundant pairwise constraints. Then we introduce a joint pairwise constraint propagation algorithm to cluster speakers based on these visual and semantic constraints. This integration effectively leverages the complementary strengths of different modalities, refining the affinity estimation between individual speaker embeddings. Extensive experiments conducted on multiple multimodal datasets demonstrate that our approach consistently outperforms state-of-the-art speaker diarization methods.
△ Less
Submitted 21 August, 2024;
originally announced August 2024.
-
CrossFi: A Cross Domain Wi-Fi Sensing Framework Based on Siamese Network
Authors:
Zijian Zhao,
Tingwei Chen,
Zhijie Cai,
Xiaoyang Li,
Hang Li,
Qimei Chen,
Guangxu Zhu
Abstract:
In recent years, Wi-Fi sensing has garnered significant attention due to its numerous benefits, such as privacy protection, low cost, and penetration ability. Extensive research has been conducted in this field, focusing on areas such as gesture recognition, people identification, and fall detection. However, many data-driven methods encounter challenges related to domain shift, where the model fa…
▽ More
In recent years, Wi-Fi sensing has garnered significant attention due to its numerous benefits, such as privacy protection, low cost, and penetration ability. Extensive research has been conducted in this field, focusing on areas such as gesture recognition, people identification, and fall detection. However, many data-driven methods encounter challenges related to domain shift, where the model fails to perform well in environments different from the training data. One major factor contributing to this issue is the limited availability of Wi-Fi sensing datasets, which makes models learn excessive irrelevant information and over-fit to the training set. Unfortunately, collecting large-scale Wi-Fi sensing datasets across diverse scenarios is a challenging task. To address this problem, we propose CrossFi, a siamese network-based approach that excels in both in-domain scenario and cross-domain scenario, including few-shot, zero-shot scenarios, and even works in few-shot new-class scenario where testing set contains new categories. The core component of CrossFi is a sample-similarity calculation network called CSi-Net, which improves the structure of the siamese network by using an attention mechanism to capture similarity information, instead of simply calculating the distance or cosine similarity. Based on it, we develop an extra Weight-Net that can generate a template for each class, so that our CrossFi can work in different scenarios. Experimental results demonstrate that our CrossFi achieves state-of-the-art performance across various scenarios. In gesture recognition task, our CrossFi achieves an accuracy of 98.17% in in-domain scenario, 91.72% in one-shot cross-domain scenario, 64.81% in zero-shot cross-domain scenario, and 84.75% in one-shot new-class scenario. To facilitate future research, we will release the code for our model upon publication.
△ Less
Submitted 20 August, 2024; v1 submitted 20 August, 2024;
originally announced August 2024.
-
DemMamba: Alignment-free Raw Video Demoireing with Frequency-assisted Spatio-Temporal Mamba
Authors:
Shuning Xu,
Xina Liu,
Binbin Song,
Xiangyu Chen,
Qiubo Chen,
Jiantao Zhou
Abstract:
Moire patterns arise when two similar repetitive patterns interfere, a phenomenon frequently observed during the capture of images or videos on screens. The color, shape, and location of moire patterns may differ across video frames, posing a challenge in learning information from adjacent frames and preserving temporal consistency. Previous video demoireing methods heavily rely on well-designed a…
▽ More
Moire patterns arise when two similar repetitive patterns interfere, a phenomenon frequently observed during the capture of images or videos on screens. The color, shape, and location of moire patterns may differ across video frames, posing a challenge in learning information from adjacent frames and preserving temporal consistency. Previous video demoireing methods heavily rely on well-designed alignment modules, resulting in substantial computational burdens. Recently, Mamba, an improved version of the State Space Model (SSM), has demonstrated significant potential for modeling long-range dependencies with linear complexity, enabling efficient temporal modeling in video demoireing without requiring a specific alignment module. In this paper, we propose a novel alignment-free Raw video demoireing network with frequency-assisted spatio-temporal Mamba (DemMamba). The Spatial Mamba Block (SMB) and Temporal Mamba Block (TMB) are sequentially arranged to facilitate effective intra- and inter-relationship modeling in Raw videos with moire patterns. Within SMB, an Adaptive Frequency Block (AFB) is introduced to aid demoireing in the frequency domain. For TMB, a Channel Attention Block (CAB) is embedded to further enhance temporal information interactions by exploiting the inter-channel relationships among features. Extensive experiments demonstrate that our proposed DemMamba surpasses state-of-the-art approaches by 1.3 dB and delivers a superior visual experience.
△ Less
Submitted 20 August, 2024;
originally announced August 2024.