-
AceParse: A Comprehensive Dataset with Diverse Structured Texts for Academic Literature Parsing
Authors:
Huawei Ji,
Cheng Deng,
Bo Xue,
Zhouyang Jin,
Jiaxin Ding,
Xiaoying Gan,
Luoyi Fu,
Xinbing Wang,
Chenghu Zhou
Abstract:
With the development of data-centric AI, the focus has shifted from model-driven approaches to improving data quality. Academic literature, as one of the crucial types, is predominantly stored in PDF formats and needs to be parsed into texts before further processing. However, parsing diverse structured texts in academic literature remains challenging due to the lack of datasets that cover various…
▽ More
With the development of data-centric AI, the focus has shifted from model-driven approaches to improving data quality. Academic literature, as one of the crucial types, is predominantly stored in PDF formats and needs to be parsed into texts before further processing. However, parsing diverse structured texts in academic literature remains challenging due to the lack of datasets that cover various text structures. In this paper, we introduce AceParse, the first comprehensive dataset designed to support the parsing of a wide range of structured texts, including formulas, tables, lists, algorithms, and sentences with embedded mathematical expressions. Based on AceParse, we fine-tuned a multimodal model, named AceParser, which accurately parses various structured texts within academic literature. This model outperforms the previous state-of-the-art by 4.1% in terms of F1 score and by 5% in Jaccard Similarity, demonstrating the potential of multimodal models in academic literature parsing. Our dataset is available at https://github.com/JHW5981/AceParse.
△ Less
Submitted 16 September, 2024;
originally announced September 2024.
-
Scientific and technological knowledge grows linearly over time
Authors:
Huquan Kang,
Luoyi Fu,
Russell J. Funk,
Xinbing Wang,
Jiaxin Ding,
Shiyu Liang,
Jianghao Wang,
Lei Zhou,
Chenghu Zhou
Abstract:
The past few centuries have witnessed a dramatic growth in scientific and technological knowledge. However, the nature of that growth - whether exponential or otherwise - remains controversial, perhaps partly due to the lack of quantitative characterizations. We evaluated knowledge as a collective thinking structure, using citation networks as a representation, by examining extensive datasets that…
▽ More
The past few centuries have witnessed a dramatic growth in scientific and technological knowledge. However, the nature of that growth - whether exponential or otherwise - remains controversial, perhaps partly due to the lack of quantitative characterizations. We evaluated knowledge as a collective thinking structure, using citation networks as a representation, by examining extensive datasets that include 213 million publications (1800-2020) and 7.6 million patents (1976-2020). We found that knowledge - which we conceptualize as the reduction of uncertainty in a knowledge network - grew linearly over time in naturally formed citation networks that themselves expanded exponentially. Moreover, our results revealed inflection points in the growth of knowledge that often corresponded to important developments within fields, such as major breakthroughs, new paradigms, or the emergence of entirely new areas of study. Around these inflection points, knowledge may grow rapidly or exponentially on a local scale, although the overall growth rate remains linear when viewed globally. Previous studies concluding an exponential growth of knowledge may have focused primarily on these local bursts of rapid growth around key developments, leading to the misconception of a global exponential trend. Our findings help to reconcile the discrepancy between the perceived exponential growth and the actual linear growth of knowledge by highlighting the distinction between local and global growth patterns. Overall, our findings reveal major science development trends for policymaking, showing that producing knowledge is far more challenging than producing papers.
△ Less
Submitted 12 September, 2024;
originally announced September 2024.
-
In-Context Imitation Learning via Next-Token Prediction
Authors:
Letian Fu,
Huang Huang,
Gaurav Datta,
Lawrence Yunliang Chen,
William Chung-Ho Panitch,
Fangchen Liu,
Hui Li,
Ken Goldberg
Abstract:
We explore how to enhance next-token prediction models to perform in-context imitation learning on a real robot, where the robot executes new tasks by interpreting contextual information provided during the input phase, without updating its underlying policy parameters. We propose In-Context Robot Transformer (ICRT), a causal transformer that performs autoregressive prediction on sensorimotor traj…
▽ More
We explore how to enhance next-token prediction models to perform in-context imitation learning on a real robot, where the robot executes new tasks by interpreting contextual information provided during the input phase, without updating its underlying policy parameters. We propose In-Context Robot Transformer (ICRT), a causal transformer that performs autoregressive prediction on sensorimotor trajectories without relying on any linguistic data or reward function. This formulation enables flexible and training-free execution of new tasks at test time, achieved by prompting the model with sensorimotor trajectories of the new task composing of image observations, actions and states tuples, collected through human teleoperation. Experiments with a Franka Emika robot demonstrate that the ICRT can adapt to new tasks specified by prompts, even in environment configurations that differ from both the prompt and the training data. In a multitask environment setup, ICRT significantly outperforms current state-of-the-art next-token prediction models in robotics on generalizing to unseen tasks. Code, checkpoints and data are available on https://icrt.dev/
△ Less
Submitted 28 August, 2024;
originally announced August 2024.
-
Unlock the Power of Frozen LLMs in Knowledge Graph Completion
Authors:
Bo Xue,
Yi Xu,
Yunchong Song,
Yiming Pang,
Yuyang Ren,
Jiaxin Ding,
Luoyi Fu,
Xinbing Wang
Abstract:
Classical knowledge graph completion (KGC) methods rely solely on structural information, struggling with the inherent sparsity of knowledge graphs (KGs). Large Language Models (LLMs) learn extensive knowledge from large corpora with powerful context modeling, which is ideal for mitigating the limitations of previous methods. Directly fine-tuning LLMs offers great capability but comes at the cost…
▽ More
Classical knowledge graph completion (KGC) methods rely solely on structural information, struggling with the inherent sparsity of knowledge graphs (KGs). Large Language Models (LLMs) learn extensive knowledge from large corpora with powerful context modeling, which is ideal for mitigating the limitations of previous methods. Directly fine-tuning LLMs offers great capability but comes at the cost of huge time and memory consumption, while utilizing frozen LLMs yields suboptimal results. In this work, we aim to leverage LLMs for KGC effectively and efficiently. We capture the context-aware hidden states of knowledge triples by employing prompts to stimulate the intermediate layers of LLMs. We then train a data-efficient classifier on these hidden states to harness the inherent capabilities of frozen LLMs in KGC. We also generate entity descriptions with subgraph sampling on KGs, reducing the ambiguity of triplets and enriching the knowledge representation. Extensive experiments on standard benchmarks showcase the efficiency and effectiveness of our approach. We outperform classical KGC methods on most datasets and match the performance of fine-tuned LLMs. Additionally, compared to fine-tuned LLMs, we boost GPU memory efficiency by \textbf{$188\times$} and speed up training+inference by \textbf{$13.48\times$}.
△ Less
Submitted 13 August, 2024;
originally announced August 2024.
-
Hybrid SD: Edge-Cloud Collaborative Inference for Stable Diffusion Models
Authors:
Chenqian Yan,
Songwei Liu,
Hongjian Liu,
Xurui Peng,
Xiaojian Wang,
Fangming Chen,
Lean Fu,
Xing Mei
Abstract:
Stable Diffusion Models (SDMs) have shown remarkable proficiency in image synthesis. However, their broad application is impeded by their large model sizes and intensive computational requirements, which typically require expensive cloud servers for deployment. On the flip side, while there are many compact models tailored for edge devices that can reduce these demands, they often compromise on se…
▽ More
Stable Diffusion Models (SDMs) have shown remarkable proficiency in image synthesis. However, their broad application is impeded by their large model sizes and intensive computational requirements, which typically require expensive cloud servers for deployment. On the flip side, while there are many compact models tailored for edge devices that can reduce these demands, they often compromise on semantic integrity and visual quality when compared to full-sized SDMs. To bridge this gap, we introduce Hybrid SD, an innovative, training-free SDMs inference framework designed for edge-cloud collaborative inference. Hybrid SD distributes the early steps of the diffusion process to the large models deployed on cloud servers, enhancing semantic planning. Furthermore, small efficient models deployed on edge devices can be integrated for refining visual details in the later stages. Acknowledging the diversity of edge devices with differing computational and storage capacities, we employ structural pruning to the SDMs U-Net and train a lightweight VAE. Empirical evaluations demonstrate that our compressed models achieve state-of-the-art parameter efficiency (225.8M) on edge devices with competitive image quality. Additionally, Hybrid SD reduces the cloud cost by 66% with edge-cloud collaborative inference.
△ Less
Submitted 13 August, 2024;
originally announced August 2024.
-
AutoFAIR : Automatic Data FAIRification via Machine Reading
Authors:
Tingyan Ma,
Wei Liu,
Bin Lu,
Xiaoying Gan,
Yunqiang Zhu,
Luoyi Fu,
Chenghu Zhou
Abstract:
The explosive growth of data fuels data-driven research, facilitating progress across diverse domains. The FAIR principles emerge as a guiding standard, aiming to enhance the findability, accessibility, interoperability, and reusability of data. However, current efforts primarily focus on manual data FAIRification, which can only handle targeted data and lack efficiency. To address this issue, we…
▽ More
The explosive growth of data fuels data-driven research, facilitating progress across diverse domains. The FAIR principles emerge as a guiding standard, aiming to enhance the findability, accessibility, interoperability, and reusability of data. However, current efforts primarily focus on manual data FAIRification, which can only handle targeted data and lack efficiency. To address this issue, we propose AutoFAIR, an architecture designed to enhance data FAIRness automately. Firstly, We align each data and metadata operation with specific FAIR indicators to guide machine-executable actions. Then, We utilize Web Reader to automatically extract metadata based on language models, even in the absence of structured data webpage schemas. Subsequently, FAIR Alignment is employed to make metadata comply with FAIR principles by ontology guidance and semantic matching. Finally, by applying AutoFAIR to various data, especially in the field of mountain hazards, we observe significant improvements in findability, accessibility, interoperability, and reusability of data. The FAIRness scores before and after applying AutoFAIR indicate enhanced data value.
△ Less
Submitted 7 August, 2024;
originally announced August 2024.
-
LLM Stability: A detailed analysis with some surprises
Authors:
Berk Atil,
Alexa Chittams,
Liseng Fu,
Ferhan Ture,
Lixinyu Xu,
Breck Baldwin
Abstract:
LLM (large language model) practitioners commonly notice that outputs can vary for the same inputs, but we have been unable to find work that evaluates LLM stability as the main objective. In our study of 6 deterministically configured LLMs across 8 common tasks with 5 identical runs, we see accuracy variations up to 10\%. In addition, no LLM consistently delivers repeatable accuracy across all ta…
▽ More
LLM (large language model) practitioners commonly notice that outputs can vary for the same inputs, but we have been unable to find work that evaluates LLM stability as the main objective. In our study of 6 deterministically configured LLMs across 8 common tasks with 5 identical runs, we see accuracy variations up to 10\%. In addition, no LLM consistently delivers repeatable accuracy across all tasks. We also show examples of variation that are not normally distributed and compare configurations with zero-shot/few-shot prompting and fine-tuned examples. To better quantify what is going on, we introduce metrics focused on stability: TARr@N for the total agreement rate at N runs over raw output, and TARa@N for total agreement over parsed-out answers. We suggest that stability metrics be integrated into leader boards and research results going forward.
△ Less
Submitted 12 September, 2024; v1 submitted 6 August, 2024;
originally announced August 2024.
-
A Role-specific Guided Large Language Model for Ophthalmic Consultation Based on Stylistic Differentiation
Authors:
Laiyi Fu,
Binbin Fan,
Hongkai Du,
Yanxiang Feng,
Chunhua Li,
Huping Song
Abstract:
Ophthalmology consultations are crucial for diagnosing, treating, and preventing eye diseases. However, the growing demand for consultations exceeds the availability of ophthalmologists. By leveraging large pre-trained language models, we can design effective dialogues for specific scenarios, aiding in consultations. Traditional fine-tuning strategies for question-answering tasks are impractical d…
▽ More
Ophthalmology consultations are crucial for diagnosing, treating, and preventing eye diseases. However, the growing demand for consultations exceeds the availability of ophthalmologists. By leveraging large pre-trained language models, we can design effective dialogues for specific scenarios, aiding in consultations. Traditional fine-tuning strategies for question-answering tasks are impractical due to increasing model size and often ignoring patient-doctor role function during consultations. In this paper, we propose EyeDoctor, an ophthalmic medical questioning large language model that enhances accuracy through doctor-patient role perception guided and an augmented knowledge base with external disease information. Experimental results show EyeDoctor achieves higher question-answering precision in ophthalmology consultations. Notably, EyeDoctor demonstrated a 7.25% improvement in Rouge-1 scores and a 10.16% improvement in F1 scores on multi-round datasets compared to second best model ChatGPT, highlighting the importance of doctor-patient role differentiation and dynamic knowledge base expansion for intelligent medical consultations. EyeDoc also serves as a free available web based service and souce code is available at https://github.com/sperfu/EyeDoc.
△ Less
Submitted 31 July, 2024; v1 submitted 25 July, 2024;
originally announced July 2024.
-
Exterior Penalty Policy Optimization with Penalty Metric Network under Constraints
Authors:
Shiqing Gao,
Jiaxin Ding,
Luoyi Fu,
Xinbing Wang,
Chenghu Zhou
Abstract:
In Constrained Reinforcement Learning (CRL), agents explore the environment to learn the optimal policy while satisfying constraints. The penalty function method has recently been studied as an effective approach for handling constraints, which imposes constraints penalties on the objective to transform the constrained problem into an unconstrained one. However, it is challenging to choose appropr…
▽ More
In Constrained Reinforcement Learning (CRL), agents explore the environment to learn the optimal policy while satisfying constraints. The penalty function method has recently been studied as an effective approach for handling constraints, which imposes constraints penalties on the objective to transform the constrained problem into an unconstrained one. However, it is challenging to choose appropriate penalties that balance policy performance and constraint satisfaction efficiently. In this paper, we propose a theoretically guaranteed penalty function method, Exterior Penalty Policy Optimization (EPO), with adaptive penalties generated by a Penalty Metric Network (PMN). PMN responds appropriately to varying degrees of constraint violations, enabling efficient constraint satisfaction and safe exploration. We theoretically prove that EPO consistently improves constraint satisfaction with a convergence guarantee. We propose a new surrogate function and provide worst-case constraint violation and approximation error. In practice, we propose an effective smooth penalty function, which can be easily implemented with a first-order optimizer. Extensive experiments are conducted, showing that EPO outperforms the baselines in terms of policy performance and constraint satisfaction with a stable training process, particularly on complex tasks.
△ Less
Submitted 22 July, 2024;
originally announced July 2024.
-
SINKT: A Structure-Aware Inductive Knowledge Tracing Model with Large Language Model
Authors:
Lingyue Fu,
Hao Guan,
Kounianhua Du,
Jianghao Lin,
Wei Xia,
Weinan Zhang,
Ruiming Tang,
Yasheng Wang,
Yong Yu
Abstract:
Knowledge Tracing (KT) aims to determine whether students will respond correctly to the next question, which is a crucial task in intelligent tutoring systems (ITS). In educational KT scenarios, transductive ID-based methods often face severe data sparsity and cold start problems, where interactions between individual students and questions are sparse, and new questions and concepts consistently a…
▽ More
Knowledge Tracing (KT) aims to determine whether students will respond correctly to the next question, which is a crucial task in intelligent tutoring systems (ITS). In educational KT scenarios, transductive ID-based methods often face severe data sparsity and cold start problems, where interactions between individual students and questions are sparse, and new questions and concepts consistently arrive in the database. In addition, existing KT models only implicitly consider the correlation between concepts and questions, lacking direct modeling of the more complex relationships in the heterogeneous graph of concepts and questions. In this paper, we propose a Structure-aware Inductive Knowledge Tracing model with large language model (dubbed SINKT), which, for the first time, introduces large language models (LLMs) and realizes inductive knowledge tracing. Firstly, SINKT utilizes LLMs to introduce structural relationships between concepts and constructs a heterogeneous graph for concepts and questions. Secondly, by encoding concepts and questions with LLMs, SINKT incorporates semantic information to aid prediction. Finally, SINKT predicts the student's response to the target question by interacting with the student's knowledge state and the question representation. Experiments on four real-world datasets demonstrate that SINKT achieves state-of-the-art performance among 12 existing transductive KT models. Additionally, we explore the performance of SINKT on the inductive KT task and provide insights into various modules.
△ Less
Submitted 23 July, 2024; v1 submitted 1 July, 2024;
originally announced July 2024.
-
FoldGPT: Simple and Effective Large Language Model Compression Scheme
Authors:
Songwei Liu,
Chao Zeng,
Lianqiang Li,
Chenqian Yan,
Lean Fu,
Xing Mei,
Fangmin Chen
Abstract:
The demand for deploying large language models(LLMs) on mobile devices continues to increase, driven by escalating data security concerns and cloud costs. However, network bandwidth and memory limitations pose challenges for deploying billion-level models on mobile devices. In this study, we investigate the outputs of different layers across various scales of LLMs and found that the outputs of mos…
▽ More
The demand for deploying large language models(LLMs) on mobile devices continues to increase, driven by escalating data security concerns and cloud costs. However, network bandwidth and memory limitations pose challenges for deploying billion-level models on mobile devices. In this study, we investigate the outputs of different layers across various scales of LLMs and found that the outputs of most layers exhibit significant similarity. Moreover, this similarity becomes more pronounced as the model size increases, indicating substantial redundancy in the depth direction of the LLMs. Based on this observation, we propose an efficient model volume compression strategy, termed FoldGPT, which combines block removal and block parameter sharing.This strategy consists of three parts: (1) Based on the learnable gating parameters, we determine the block importance ranking while modeling the coupling effect between blocks. Then we delete some redundant layers based on the given removal rate. (2) For the retained blocks, we apply a specially designed group parameter sharing strategy, where blocks within the same group share identical weights, significantly compressing the number of parameters and slightly reducing latency overhead. (3) After sharing these Blocks, we "cure" the mismatch caused by sparsity with a minor amount of fine-tuning and introduce a tail-layer distillation strategy to improve the performance. Experiments demonstrate that FoldGPT outperforms previous state-of-the-art(SOTA) methods in efficient model compression, demonstrating the feasibility of achieving model lightweighting through straightforward block removal and parameter sharing.
△ Less
Submitted 30 June, 2024;
originally announced July 2024.
-
A Federated Online Restless Bandit Framework for Cooperative Resource Allocation
Authors:
Jingwen Tong,
Xinran Li,
Liqun Fu,
Jun Zhang,
Khaled B. Letaief
Abstract:
Restless multi-armed bandits (RMABs) have been widely utilized to address resource allocation problems with Markov reward processes (MRPs). Existing works often assume that the dynamics of MRPs are known prior, which makes the RMAB problem solvable from an optimization perspective. Nevertheless, an efficient learning-based solution for RMABs with unknown system dynamics remains an open problem. In…
▽ More
Restless multi-armed bandits (RMABs) have been widely utilized to address resource allocation problems with Markov reward processes (MRPs). Existing works often assume that the dynamics of MRPs are known prior, which makes the RMAB problem solvable from an optimization perspective. Nevertheless, an efficient learning-based solution for RMABs with unknown system dynamics remains an open problem. In this paper, we study the cooperative resource allocation problem with unknown system dynamics of MRPs. This problem can be modeled as a multi-agent online RMAB problem, where multiple agents collaboratively learn the system dynamics while maximizing their accumulated rewards. We devise a federated online RMAB framework to mitigate the communication overhead and data privacy issue by adopting the federated learning paradigm. Based on this framework, we put forth a Federated Thompson Sampling-enabled Whittle Index (FedTSWI) algorithm to solve this multi-agent online RMAB problem. The FedTSWI algorithm enjoys a high communication and computation efficiency, and a privacy guarantee. Moreover, we derive a regret upper bound for the FedTSWI algorithm. Finally, we demonstrate the effectiveness of the proposed algorithm on the case of online multi-user multi-channel access. Numerical results show that the proposed algorithm achieves a fast convergence rate of $\mathcal{O}(\sqrt{T\log(T)})$ and better performance compared with baselines. More importantly, its sample complexity decreases with the number of agents.
△ Less
Submitted 12 June, 2024;
originally announced June 2024.
-
Differentiation of Multi-objective Data-driven Decision Pipeline
Authors:
Peng Li,
Lixia Wu,
Chaoqun Feng,
Haoyuan Hu,
Lei Fu,
Jieping Ye
Abstract:
Real-world scenarios frequently involve multi-objective data-driven optimization problems, characterized by unknown problem coefficients and multiple conflicting objectives. Traditional two-stage methods independently apply a machine learning model to estimate problem coefficients, followed by invoking a solver to tackle the predicted optimization problem. The independent use of optimization solve…
▽ More
Real-world scenarios frequently involve multi-objective data-driven optimization problems, characterized by unknown problem coefficients and multiple conflicting objectives. Traditional two-stage methods independently apply a machine learning model to estimate problem coefficients, followed by invoking a solver to tackle the predicted optimization problem. The independent use of optimization solvers and prediction models may lead to suboptimal performance due to mismatches between their objectives. Recent efforts have focused on end-to-end training of predictive models that use decision loss derived from the downstream optimization problem. However, these methods have primarily focused on single-objective optimization problems, thus limiting their applicability. We aim to propose a multi-objective decision-focused approach to address this gap. In order to better align with the inherent properties of multi-objective optimization problems, we propose a set of novel loss functions. These loss functions are designed to capture the discrepancies between predicted and true decision problems, considering solution space, objective space, and decision quality, named landscape loss, Pareto set loss, and decision loss, respectively. Our experimental results demonstrate that our proposed method significantly outperforms traditional two-stage methods and most current decision-focused methods.
△ Less
Submitted 2 June, 2024;
originally announced June 2024.
-
PatchScaler: An Efficient Patch-Independent Diffusion Model for Super-Resolution
Authors:
Yong Liu,
Hang Dong,
Jinshan Pan,
Qingji Dong,
Kai Chen,
Rongxiang Zhang,
Lean Fu,
Fei Wang
Abstract:
Diffusion models significantly improve the quality of super-resolved images with their impressive content generation capabilities. However, the huge computational costs limit the applications of these methods.Recent efforts have explored reasonable inference acceleration to reduce the number of sampling steps, but the computational cost remains high as each step is performed on the entire image.Th…
▽ More
Diffusion models significantly improve the quality of super-resolved images with their impressive content generation capabilities. However, the huge computational costs limit the applications of these methods.Recent efforts have explored reasonable inference acceleration to reduce the number of sampling steps, but the computational cost remains high as each step is performed on the entire image.This paper introduces PatchScaler, a patch-independent diffusion-based single image super-resolution (SR) method, designed to enhance the efficiency of the inference process.The proposed method is motivated by the observation that not all the image patches within an image need the same sampling steps for reconstructing high-resolution images.Based on this observation, we thus develop a Patch-adaptive Group Sampling (PGS) to divide feature patches into different groups according to the patch-level reconstruction difficulty and dynamically assign an appropriate sampling configuration for each group so that the inference speed can be better accelerated.In addition, to improve the denoising ability at each step of the sampling, we develop a texture prompt to guide the estimations of the diffusion model by retrieving high-quality texture priors from a patch-independent reference texture memory.Experiments show that our PatchScaler achieves favorable performance in both quantitative and qualitative evaluations with fast inference speed.Our code and model are available at \url{https://github.com/yongliuy/PatchScaler}.
△ Less
Submitted 11 June, 2024; v1 submitted 27 May, 2024;
originally announced May 2024.
-
Dataset and Benchmark for Urdu Natural Scenes Text Detection, Recognition and Visual Question Answering
Authors:
Hiba Maryam,
Ling Fu,
Jiajun Song,
Tajrian ABM Shafayet,
Qidi Luo,
Xiang Bai,
Yuliang Liu
Abstract:
The development of Urdu scene text detection, recognition, and Visual Question Answering (VQA) technologies is crucial for advancing accessibility, information retrieval, and linguistic diversity in digital content, facilitating better understanding and interaction with Urdu-language visual data. This initiative seeks to bridge the gap between textual and visual comprehension. We propose a new mul…
▽ More
The development of Urdu scene text detection, recognition, and Visual Question Answering (VQA) technologies is crucial for advancing accessibility, information retrieval, and linguistic diversity in digital content, facilitating better understanding and interaction with Urdu-language visual data. This initiative seeks to bridge the gap between textual and visual comprehension. We propose a new multi-task Urdu scene text dataset comprising over 1000 natural scene images, which can be used for text detection, recognition, and VQA tasks. We provide fine-grained annotations for text instances, addressing the limitations of previous datasets for facing arbitrary-shaped texts. By incorporating additional annotation points, this dataset facilitates the development and assessment of methods that can handle diverse text layouts, intricate shapes, and non-standard orientations commonly encountered in real-world scenarios. Besides, the VQA annotations make it the first benchmark for the Urdu Text VQA method, which can prompt the development of Urdu scene text understanding. The proposed dataset is available at: https://github.com/Hiba-MeiRuan/Urdu-VQA-Dataset-/tree/main
△ Less
Submitted 21 May, 2024;
originally announced May 2024.
-
The First Swahili Language Scene Text Detection and Recognition Dataset
Authors:
Fadila Wendigoundi Douamba,
Jianjun Song,
Ling Fu,
Yuliang Liu,
Xiang Bai
Abstract:
Scene text recognition is essential in many applications, including automated translation, information retrieval, driving assistance, and enhancing accessibility for individuals with visual impairments. Much research has been done to improve the accuracy and performance of scene text detection and recognition models. However, most of this research has been conducted in the most common languages, E…
▽ More
Scene text recognition is essential in many applications, including automated translation, information retrieval, driving assistance, and enhancing accessibility for individuals with visual impairments. Much research has been done to improve the accuracy and performance of scene text detection and recognition models. However, most of this research has been conducted in the most common languages, English and Chinese. There is a significant gap in low-resource languages, especially the Swahili Language. Swahili is widely spoken in East African countries but is still an under-explored language in scene text recognition. No studies have been focused explicitly on Swahili natural scene text detection and recognition, and no dataset for Swahili language scene text detection and recognition is publicly available. We propose a comprehensive dataset of Swahili scene text images and evaluate the dataset on different scene text detection and recognition models. The dataset contains 976 images collected in different places and under various circumstances. Each image has its annotation at the word level. The proposed dataset can also serve as a benchmark dataset specific to the Swahili language for evaluating and comparing different approaches and fostering future research endeavors. The dataset is available on GitHub via this link: https://github.com/FadilaW/Swahili-STR-Dataset
△ Less
Submitted 18 May, 2024;
originally announced May 2024.
-
Progressive enhancement and restoration for mural images under low-light and defected conditions based on multi-receptive field strategy
Authors:
Xiameng Wei,
Binbin Fan,
Ying Wang,
Yanxiang Feng,
Laiyi Fu
Abstract:
Ancient murals are valuable cultural heritage with great archaeological value. They provide insights into ancient religions, ceremonies, folklore, among other things through their content. However, due to long-term oxidation and inadequate protection, ancient murals have suffered continuous damage, including peeling and mold etc. Additionally, since ancient murals were typically painted indoors, t…
▽ More
Ancient murals are valuable cultural heritage with great archaeological value. They provide insights into ancient religions, ceremonies, folklore, among other things through their content. However, due to long-term oxidation and inadequate protection, ancient murals have suffered continuous damage, including peeling and mold etc. Additionally, since ancient murals were typically painted indoors, the light intensity in images captured by digital devices is often low. The poor visibility hampers the further restoration of damaged areas. To address the escalating damage to ancient murals and facilitate batch restoration at archaeological sites, we propose a two-stage restoration model with automatic defect area detection strategy which called MER(Mural Enhancement and Restoration net) for ancient murals that are damaged and have been captured in low light. Our two-stage model not only enhances the visual quality of restored images but also achieves commendable results in relevant metric evaluations compared with other competitors. Furthermore, we have launched a website dedicated to the restoration of ancient mural paintings, utilizing the proposed model. Code is available at https://gitee.com/bbfan2024/MER.git.
△ Less
Submitted 16 July, 2024; v1 submitted 13 May, 2024;
originally announced May 2024.
-
OXYGENERATOR: Reconstructing Global Ocean Deoxygenation Over a Century with Deep Learning
Authors:
Bin Lu,
Ze Zhao,
Luyu Han,
Xiaoying Gan,
Yuntao Zhou,
Lei Zhou,
Luoyi Fu,
Xinbing Wang,
Chenghu Zhou,
Jing Zhang
Abstract:
Accurately reconstructing the global ocean deoxygenation over a century is crucial for assessing and protecting marine ecosystem. Existing expert-dominated numerical simulations fail to catch up with the dynamic variation caused by global warming and human activities. Besides, due to the high-cost data collection, the historical observations are severely sparse, leading to big challenge for precis…
▽ More
Accurately reconstructing the global ocean deoxygenation over a century is crucial for assessing and protecting marine ecosystem. Existing expert-dominated numerical simulations fail to catch up with the dynamic variation caused by global warming and human activities. Besides, due to the high-cost data collection, the historical observations are severely sparse, leading to big challenge for precise reconstruction. In this work, we propose OxyGenerator, the first deep learning based model, to reconstruct the global ocean deoxygenation from 1920 to 2023. Specifically, to address the heterogeneity across large temporal and spatial scales, we propose zoning-varying graph message-passing to capture the complex oceanographic correlations between missing values and sparse observations. Additionally, to further calibrate the uncertainty, we incorporate inductive bias from dissolved oxygen (DO) variations and chemical effects. Compared with in-situ DO observations, OxyGenerator significantly outperforms CMIP6 numerical simulations, reducing MAPE by 38.77%, demonstrating a promising potential to understand the "breathless ocean" in data-driven manner.
△ Less
Submitted 12 May, 2024;
originally announced May 2024.
-
Site-Specific Deployment Optimization of Intelligent Reflecting Surface for Coverage Enhancement
Authors:
Dongsheng Fu,
Xintong Chen,
Jiangbin Lyu,
Liqun Fu
Abstract:
Intelligent Reflecting Surface (IRS) is a promising technology for next generation wireless networks. Despite substantial research in IRS-aided communications, the assumed antenna and channel models are typically simplified without considering site-specific characteristics, which in turn critically affect the IRS deployment and performance in a given environment. In this paper, we first investigat…
▽ More
Intelligent Reflecting Surface (IRS) is a promising technology for next generation wireless networks. Despite substantial research in IRS-aided communications, the assumed antenna and channel models are typically simplified without considering site-specific characteristics, which in turn critically affect the IRS deployment and performance in a given environment. In this paper, we first investigate the link-level performance of active or passive IRS taking into account the IRS element radiation pattern (ERP) as well as the antenna radiation pattern of the access point (AP). Then the network-level coverage performance is evaluated/optimized in site-specific multi-building scenarios, by properly deploying multiple IRSs on candidate building facets to serve a given set of users or Points of Interests (PoIs). The problem is reduced to an integer linear programming (ILP) based on given link-level metrics, which is then solved efficiently under moderate network sizes. Numerical results confirm the impact of AP antenna/IRS element pattern on the link-level performance. In addition, it is found that active IRSs, though associated with higher hardware complexity and cost, significantly improve the site-specific network coverage performance in terms of average ergodic rate and fairness among the PoIs as well as the range of serving area, compared with passive IRSs that have a much larger number of elements.
△ Less
Submitted 5 May, 2024;
originally announced May 2024.
-
AFDM Channel Estimation in Multi-Scale Multi-Lag Channels
Authors:
Rongyou Cao,
Yuheng Zhong,
Jiangbin Lyu,
Deqing Wang,
Liqun Fu
Abstract:
Affine Frequency Division Multiplexing (AFDM) is a brand new chirp-based multi-carrier (MC) waveform for high mobility communications, with promising advantages over Orthogonal Frequency Division Multiplexing (OFDM) and other MC waveforms. Existing AFDM research focuses on wireless communication at high carrier frequency (CF), which typically considers only Doppler frequency shift (DFS) as a resul…
▽ More
Affine Frequency Division Multiplexing (AFDM) is a brand new chirp-based multi-carrier (MC) waveform for high mobility communications, with promising advantages over Orthogonal Frequency Division Multiplexing (OFDM) and other MC waveforms. Existing AFDM research focuses on wireless communication at high carrier frequency (CF), which typically considers only Doppler frequency shift (DFS) as a result of mobility, while ignoring the accompanied Doppler time scaling (DTS) on waveform. However, for underwater acoustic (UWA) communication at much lower CF and propagating at speed of sound, the DTS effect could not be ignored and poses significant challenges for channel estimation. This paper analyzes the channel frequency response (CFR) of AFDM under multi-scale multi-lag (MSML) channels, where each propagating path could have different delay and DFS/DTS. Based on the newly derived input-output formula and its characteristics, two new channel estimation methods are proposed, i.e., AFDM with iterative multi-index (AFDM-IMI) estimation under low to moderate DTS, and AFDM with orthogonal matching pursuit (AFDM-OMP) estimation under high DTS. Numerical results confirm the effectiveness of the proposed methods against the original AFDM channel estimation method. Moreover, the resulted AFDM system outperforms OFDM as well as Orthogonal Chirp Division Multiplexing (OCDM) in terms of channel estimation accuracy and bit error rate (BER), which is consistent with our theoretical analysis based on CFR overlap probability (COP), mutual incoherent property (MIP) and channel diversity gain under MSML channels.
△ Less
Submitted 4 May, 2024;
originally announced May 2024.
-
Fast Online Movement Optimization of Aerial Base Stations Based on Global Connectivity Map
Authors:
Yiling Wang,
Jiangbin Lyu,
Liqun Fu
Abstract:
Unmanned aerial vehicles (UAVs) can serve as aerial base stations (ABSs) to provide wireless connectivity for ground users (GUs) in diverse scenarios. However, it is an NP-hard problem with exponential complexity in $M$ and $N$, in order to maximize the coverage rate (CR) of $M$ GUs by jointly placing $N$ ABSs with limited coverage range. This problem becomes even more intricate when the coverage…
▽ More
Unmanned aerial vehicles (UAVs) can serve as aerial base stations (ABSs) to provide wireless connectivity for ground users (GUs) in diverse scenarios. However, it is an NP-hard problem with exponential complexity in $M$ and $N$, in order to maximize the coverage rate (CR) of $M$ GUs by jointly placing $N$ ABSs with limited coverage range. This problem becomes even more intricate when the coverage range becomes irregular due to site-specific obstructions (e.g., buildings) on the air-ground channel, and/or when the GUs are in motion. To address the above challenges, we study a multi-ABS movement optimization problem to maximize the average coverage rate of mobile GUs within a site-specific environment. We tackle this challenging problem by 1) constructing the global connectivity map (GCM) which contains the connectivity information between given pairs of ABS/GU locations; 2) partitioning the ABS movement problem into ABS placement sub-problems and formulate each sub-problem into a binary integer linear programing (BILP) problem based on GCM; 3) proposing a fast online algorithm to execute (one-pass) projected stochastic subgradient descent within the dual space to rapidly solve the BILP problem with near-optimal performance. Numerical results demonstrate that our proposed algorithm achieves a high CR performance close to that obtained by the open source solver (SCIP), yet with significantly reduced running time. In addition, the algorithm also notably outperforms one of the state-of-the-art deep reinforcement learning (DRL) methods and the K-means initiated evolutionary algorithm in terms of CR performance and/or time efficiency.
△ Less
Submitted 4 May, 2024;
originally announced May 2024.
-
CodeGRAG: Extracting Composed Syntax Graphs for Retrieval Augmented Cross-Lingual Code Generation
Authors:
Kounianhua Du,
Renting Rui,
Huacan Chai,
Lingyue Fu,
Wei Xia,
Yasheng Wang,
Ruiming Tang,
Yong Yu,
Weinan Zhang
Abstract:
Utilizing large language models to generate codes has shown promising meaning in software development revolution. Despite the intelligence shown by the general large language models, their specificity in code generation can still be improved due to the syntactic gap and mismatched vocabulary existing among natural language and different programming languages. In addition, programming languages are…
▽ More
Utilizing large language models to generate codes has shown promising meaning in software development revolution. Despite the intelligence shown by the general large language models, their specificity in code generation can still be improved due to the syntactic gap and mismatched vocabulary existing among natural language and different programming languages. In addition, programming languages are inherently logical and complex, making them hard to be correctly generated. Existing methods rely on multiple prompts to the large language model to explore better solutions, which is expensive. In this paper, we propose Syntax Graph Retrieval Augmented Code Generation (CodeGRAG) to enhance the performance of LLMs in single-round code generation tasks. CodeGRAG extracts and summarizes the control flow and data flow of code blocks to fill the gap between programming languages and natural language. The extracted external structural knowledge models the inherent flows of code blocks, which can facilitate LLMs for better understanding of code syntax and serve as a bridge among different programming languages. CodeGRAG significantly improves the code generation ability of LLMs and can even offer performance gain for cross-lingual code generation, e.g., C++ for Python.
△ Less
Submitted 2 May, 2024;
originally announced May 2024.
-
RepEval: Effective Text Evaluation with LLM Representation
Authors:
Shuqian Sheng,
Yi Xu,
Tianhang Zhang,
Zanwei Shen,
Luoyi Fu,
Jiaxin Ding,
Lei Zhou,
Xinbing Wang,
Chenghu Zhou
Abstract:
Automatic evaluation metrics for generated texts play an important role in the NLG field, especially with the rapid growth of LLMs. However, existing metrics are often limited to specific scenarios, making it challenging to meet the evaluation requirements of expanding LLM applications. Therefore, there is a demand for new, flexible, and effective metrics. In this study, we introduce RepEval, the…
▽ More
Automatic evaluation metrics for generated texts play an important role in the NLG field, especially with the rapid growth of LLMs. However, existing metrics are often limited to specific scenarios, making it challenging to meet the evaluation requirements of expanding LLM applications. Therefore, there is a demand for new, flexible, and effective metrics. In this study, we introduce RepEval, the first metric leveraging the projection of LLM representations for evaluation. RepEval requires minimal sample pairs for training, and through simple prompt modifications, it can easily transition to various tasks. Results on ten datasets from three tasks demonstrate the high effectiveness of our method, which exhibits stronger correlations with human judgments compared to previous metrics, even outperforming GPT-4. Our work underscores the richness of information regarding text quality embedded within LLM representations, offering insights for the development of new metrics.
△ Less
Submitted 30 April, 2024;
originally announced April 2024.
-
Patent Value Characterization -- An Empirical Analysis of Elevator Industry Patents
Authors:
Yuhang Guan,
Runzheng Wang,
Lei Fu,
Huanle Zhang
Abstract:
The global patent application count has steadily increased, achieving eight consecutive years of growth.The global patent industry has shown a general trend of expansion. This is attributed to the increasing innovation activities, particularly in the fields of technology, healthcare, and biotechnology. Some emerging market countries, such as China and India, have experienced significant growth in…
▽ More
The global patent application count has steadily increased, achieving eight consecutive years of growth.The global patent industry has shown a general trend of expansion. This is attributed to the increasing innovation activities, particularly in the fields of technology, healthcare, and biotechnology. Some emerging market countries, such as China and India, have experienced significant growth in the patent domain, becoming important participants in global patent activities.
△ Less
Submitted 20 February, 2024;
originally announced April 2024.
-
Second Edition FRCSyn Challenge at CVPR 2024: Face Recognition Challenge in the Era of Synthetic Data
Authors:
Ivan DeAndres-Tame,
Ruben Tolosana,
Pietro Melzi,
Ruben Vera-Rodriguez,
Minchul Kim,
Christian Rathgeb,
Xiaoming Liu,
Aythami Morales,
Julian Fierrez,
Javier Ortega-Garcia,
Zhizhou Zhong,
Yuge Huang,
Yuxi Mi,
Shouhong Ding,
Shuigeng Zhou,
Shuai He,
Lingzhi Fu,
Heng Cong,
Rongyu Zhang,
Zhihong Xiao,
Evgeny Smirnov,
Anton Pimenov,
Aleksei Grigorev,
Denis Timoshenko,
Kaleb Mesfin Asfaw
, et al. (33 additional authors not shown)
Abstract:
Synthetic data is gaining increasing relevance for training machine learning models. This is mainly motivated due to several factors such as the lack of real data and intra-class variability, time and errors produced in manual labeling, and in some cases privacy concerns, among others. This paper presents an overview of the 2nd edition of the Face Recognition Challenge in the Era of Synthetic Data…
▽ More
Synthetic data is gaining increasing relevance for training machine learning models. This is mainly motivated due to several factors such as the lack of real data and intra-class variability, time and errors produced in manual labeling, and in some cases privacy concerns, among others. This paper presents an overview of the 2nd edition of the Face Recognition Challenge in the Era of Synthetic Data (FRCSyn) organized at CVPR 2024. FRCSyn aims to investigate the use of synthetic data in face recognition to address current technological limitations, including data privacy concerns, demographic biases, generalization to novel scenarios, and performance constraints in challenging situations such as aging, pose variations, and occlusions. Unlike the 1st edition, in which synthetic data from DCFace and GANDiffFace methods was only allowed to train face recognition systems, in this 2nd edition we propose new sub-tasks that allow participants to explore novel face generative methods. The outcomes of the 2nd FRCSyn Challenge, along with the proposed experimental protocol and benchmarking contribute significantly to the application of synthetic data to face recognition.
△ Less
Submitted 16 April, 2024;
originally announced April 2024.
-
Characterizing the Influence of Topology on Graph Learning Tasks
Authors:
Kailong Wu,
Yule Xie,
Jiaxin Ding,
Yuxiang Ren,
Luoyi Fu,
Xinbing Wang,
Chenghu Zhou
Abstract:
Graph neural networks (GNN) have achieved remarkable success in a wide range of tasks by encoding features combined with topology to create effective representations. However, the fundamental problem of understanding and analyzing how graph topology influences the performance of learning models on downstream tasks has not yet been well understood. In this paper, we propose a metric, TopoInf, which…
▽ More
Graph neural networks (GNN) have achieved remarkable success in a wide range of tasks by encoding features combined with topology to create effective representations. However, the fundamental problem of understanding and analyzing how graph topology influences the performance of learning models on downstream tasks has not yet been well understood. In this paper, we propose a metric, TopoInf, which characterizes the influence of graph topology by measuring the level of compatibility between the topological information of graph data and downstream task objectives. We provide analysis based on the decoupled GNNs on the contextual stochastic block model to demonstrate the effectiveness of the metric. Through extensive experiments, we demonstrate that TopoInf is an effective metric for measuring topological influence on corresponding tasks and can be further leveraged to enhance graph learning.
△ Less
Submitted 11 April, 2024;
originally announced April 2024.
-
UniFL: Improve Stable Diffusion via Unified Feedback Learning
Authors:
Jiacheng Zhang,
Jie Wu,
Yuxi Ren,
Xin Xia,
Huafeng Kuang,
Pan Xie,
Jiashi Li,
Xuefeng Xiao,
Min Zheng,
Lean Fu,
Guanbin Li
Abstract:
Diffusion models have revolutionized the field of image generation, leading to the proliferation of high-quality models and diverse downstream applications. However, despite these significant advancements, the current competitive solutions still suffer from several limitations, including inferior visual quality, a lack of aesthetic appeal, and inefficient inference, without a comprehensive solutio…
▽ More
Diffusion models have revolutionized the field of image generation, leading to the proliferation of high-quality models and diverse downstream applications. However, despite these significant advancements, the current competitive solutions still suffer from several limitations, including inferior visual quality, a lack of aesthetic appeal, and inefficient inference, without a comprehensive solution in sight. To address these challenges, we present UniFL, a unified framework that leverages feedback learning to enhance diffusion models comprehensively. UniFL stands out as a universal, effective, and generalizable solution applicable to various diffusion models, such as SD1.5 and SDXL. Notably, UniFL incorporates three key components: perceptual feedback learning, which enhances visual quality; decoupled feedback learning, which improves aesthetic appeal; and adversarial feedback learning, which optimizes inference speed. In-depth experiments and extensive user studies validate the superior performance of our proposed method in enhancing both the quality of generated models and their acceleration. For instance, UniFL surpasses ImageReward by 17% user preference in terms of generation quality and outperforms LCM and SDXL Turbo by 57% and 20% in 4-step inference. Moreover, we have verified the efficacy of our approach in downstream tasks, including Lora, ControlNet, and AnimateDiff.
△ Less
Submitted 22 May, 2024; v1 submitted 8 April, 2024;
originally announced April 2024.
-
ByteEdit: Boost, Comply and Accelerate Generative Image Editing
Authors:
Yuxi Ren,
Jie Wu,
Yanzuo Lu,
Huafeng Kuang,
Xin Xia,
Xionghui Wang,
Qianqian Wang,
Yixing Zhu,
Pan Xie,
Shiyin Wang,
Xuefeng Xiao,
Yitong Wang,
Min Zheng,
Lean Fu
Abstract:
Recent advancements in diffusion-based generative image editing have sparked a profound revolution, reshaping the landscape of image outpainting and inpainting tasks. Despite these strides, the field grapples with inherent challenges, including: i) inferior quality; ii) poor consistency; iii) insufficient instrcution adherence; iv) suboptimal generation efficiency. To address these obstacles, we p…
▽ More
Recent advancements in diffusion-based generative image editing have sparked a profound revolution, reshaping the landscape of image outpainting and inpainting tasks. Despite these strides, the field grapples with inherent challenges, including: i) inferior quality; ii) poor consistency; iii) insufficient instrcution adherence; iv) suboptimal generation efficiency. To address these obstacles, we present ByteEdit, an innovative feedback learning framework meticulously designed to Boost, Comply, and Accelerate Generative Image Editing tasks. ByteEdit seamlessly integrates image reward models dedicated to enhancing aesthetics and image-text alignment, while also introducing a dense, pixel-level reward model tailored to foster coherence in the output. Furthermore, we propose a pioneering adversarial and progressive feedback learning strategy to expedite the model's inference speed. Through extensive large-scale user evaluations, we demonstrate that ByteEdit surpasses leading generative image editing products, including Adobe, Canva, and MeiTu, in both generation quality and consistency. ByteEdit-Outpainting exhibits a remarkable enhancement of 388% and 135% in quality and consistency, respectively, when compared to the baseline model. Experiments also verfied that our acceleration models maintains excellent performance results in terms of quality and consistency.
△ Less
Submitted 7 April, 2024;
originally announced April 2024.
-
From Learning to Analytics: Improving Model Efficacy with Goal-Directed Client Selection
Authors:
Jingwen Tong,
Zhenzhen Chen,
Liqun Fu,
Jun Zhang,
Zhu Han
Abstract:
Federated learning (FL) is an appealing paradigm for learning a global model among distributed clients while preserving data privacy. Driven by the demand for high-quality user experiences, evaluating the well-trained global model after the FL process is crucial. In this paper, we propose a closed-loop model analytics framework that allows for effective evaluation of the trained global model using…
▽ More
Federated learning (FL) is an appealing paradigm for learning a global model among distributed clients while preserving data privacy. Driven by the demand for high-quality user experiences, evaluating the well-trained global model after the FL process is crucial. In this paper, we propose a closed-loop model analytics framework that allows for effective evaluation of the trained global model using clients' local data. To address the challenges posed by system and data heterogeneities in the FL process, we study a goal-directed client selection problem based on the model analytics framework by selecting a subset of clients for the model training. This problem is formulated as a stochastic multi-armed bandit (SMAB) problem. We first put forth a quick initial upper confidence bound (Quick-Init UCB) algorithm to solve this SMAB problem under the federated analytics (FA) framework. Then, we further propose a belief propagation-based UCB (BP-UCB) algorithm under the democratized analytics (DA) framework. Moreover, we derive two regret upper bounds for the proposed algorithms, which increase logarithmically over the time horizon. The numerical results demonstrate that the proposed algorithms achieve nearly optimal performance, with a gap of less than 1.44% and 3.12% under the FA and DA frameworks, respectively.
△ Less
Submitted 30 March, 2024;
originally announced April 2024.
-
Opportunities and challenges in the application of large artificial intelligence models in radiology
Authors:
Liangrui Pan,
Zhenyu Zhao,
Ying Lu,
Kewei Tang,
Liyong Fu,
Qingchun Liang,
Shaoliang Peng
Abstract:
Influenced by ChatGPT, artificial intelligence (AI) large models have witnessed a global upsurge in large model research and development. As people enjoy the convenience by this AI large model, more and more large models in subdivided fields are gradually being proposed, especially large models in radiology imaging field. This article first introduces the development history of large models, techn…
▽ More
Influenced by ChatGPT, artificial intelligence (AI) large models have witnessed a global upsurge in large model research and development. As people enjoy the convenience by this AI large model, more and more large models in subdivided fields are gradually being proposed, especially large models in radiology imaging field. This article first introduces the development history of large models, technical details, workflow, working principles of multimodal large models and working principles of video generation large models. Secondly, we summarize the latest research progress of AI large models in radiology education, radiology report generation, applications of unimodal and multimodal radiology. Finally, this paper also summarizes some of the challenges of large AI models in radiology, with the aim of better promoting the rapid revolution in the field of radiography.
△ Less
Submitted 24 March, 2024;
originally announced March 2024.
-
Is Reference Necessary in the Evaluation of NLG Systems? When and Where?
Authors:
Shuqian Sheng,
Yi Xu,
Luoyi Fu,
Jiaxin Ding,
Lei Zhou,
Xinbing Wang,
Chenghu Zhou
Abstract:
The majority of automatic metrics for evaluating NLG systems are reference-based. However, the challenge of collecting human annotation results in a lack of reliable references in numerous application scenarios. Despite recent advancements in reference-free metrics, it has not been well understood when and where they can be used as an alternative to reference-based metrics. In this study, by emplo…
▽ More
The majority of automatic metrics for evaluating NLG systems are reference-based. However, the challenge of collecting human annotation results in a lack of reliable references in numerous application scenarios. Despite recent advancements in reference-free metrics, it has not been well understood when and where they can be used as an alternative to reference-based metrics. In this study, by employing diverse analytical approaches, we comprehensively assess the performance of both metrics across a wide range of NLG tasks, encompassing eight datasets and eight evaluation models. Based on solid experiments, the results show that reference-free metrics exhibit a higher correlation with human judgment and greater sensitivity to deficiencies in language quality. However, their effectiveness varies across tasks and is influenced by the quality of candidate texts. Therefore, it's important to assess the performance of reference-free metrics before applying them to a new task, especially when inputs are in uncommon form or when the answer space is highly variable. Our study can provide insight into the appropriate application of automatic metrics and the impact of metric choice on evaluation performance.
△ Less
Submitted 21 March, 2024;
originally announced March 2024.
-
Lifelong LERF: Local 3D Semantic Inventory Monitoring Using FogROS2
Authors:
Adam Rashid,
Chung Min Kim,
Justin Kerr,
Letian Fu,
Kush Hari,
Ayah Ahmad,
Kaiyuan Chen,
Huang Huang,
Marcus Gualtieri,
Michael Wang,
Christian Juette,
Nan Tian,
Liu Ren,
Ken Goldberg
Abstract:
Inventory monitoring in homes, factories, and retail stores relies on maintaining data despite objects being swapped, added, removed, or moved. We introduce Lifelong LERF, a method that allows a mobile robot with minimal compute to jointly optimize a dense language and geometric representation of its surroundings. Lifelong LERF maintains this representation over time by detecting semantic changes…
▽ More
Inventory monitoring in homes, factories, and retail stores relies on maintaining data despite objects being swapped, added, removed, or moved. We introduce Lifelong LERF, a method that allows a mobile robot with minimal compute to jointly optimize a dense language and geometric representation of its surroundings. Lifelong LERF maintains this representation over time by detecting semantic changes and selectively updating these regions of the environment, avoiding the need to exhaustively remap. Human users can query inventory by providing natural language queries and receiving a 3D heatmap of potential object locations. To manage the computational load, we use Fog-ROS2, a cloud robotics platform, to offload resource-intensive tasks. Lifelong LERF obtains poses from a monocular RGBD SLAM backend, and uses these poses to progressively optimize a Language Embedded Radiance Field (LERF) for semantic monitoring. Experiments with 3-5 objects arranged on a tabletop and a Turtlebot with a RealSense camera suggest that Lifelong LERF can persistently adapt to changes in objects with up to 91% accuracy.
△ Less
Submitted 15 March, 2024;
originally announced March 2024.
-
MD-Dose: A Diffusion Model based on the Mamba for Radiotherapy Dose Prediction
Authors:
Linjie Fu,
Xia Li,
Xiuding Cai,
Yingkai Wang,
Xueyao Wang,
Yali Shen,
Yu Yao
Abstract:
Radiation therapy is crucial in cancer treatment. Experienced experts typically iteratively generate high-quality dose distribution maps, forming the basis for excellent radiation therapy plans. Therefore, automated prediction of dose distribution maps is significant in expediting the treatment process and providing a better starting point for developing radiation therapy plans. With the remarkabl…
▽ More
Radiation therapy is crucial in cancer treatment. Experienced experts typically iteratively generate high-quality dose distribution maps, forming the basis for excellent radiation therapy plans. Therefore, automated prediction of dose distribution maps is significant in expediting the treatment process and providing a better starting point for developing radiation therapy plans. With the remarkable results of diffusion models in predicting high-frequency regions of dose distribution maps, dose prediction methods based on diffusion models have been extensively studied. However, existing methods mainly utilize CNNs or Transformers as denoising networks. CNNs lack the capture of global receptive fields, resulting in suboptimal prediction performance. Transformers excel in global modeling but face quadratic complexity with image size, resulting in significant computational overhead. To tackle these challenges, we introduce a novel diffusion model, MD-Dose, based on the Mamba architecture for predicting radiation therapy dose distribution in thoracic cancer patients. In the forward process, MD-Dose adds Gaussian noise to dose distribution maps to obtain pure noise images. In the backward process, MD-Dose utilizes a noise predictor based on the Mamba to predict the noise, ultimately outputting the dose distribution maps. Furthermore, We develop a Mamba encoder to extract structural information and integrate it into the noise predictor for localizing dose regions in the planning target volume (PTV) and organs at risk (OARs). Through extensive experiments on a dataset of 300 thoracic tumor patients, we showcase the superiority of MD-Dose in various metrics and time consumption.
△ Less
Submitted 13 March, 2024;
originally announced March 2024.
-
SiLVR: Scalable Lidar-Visual Reconstruction with Neural Radiance Fields for Robotic Inspection
Authors:
Yifu Tao,
Yash Bhalgat,
Lanke Frank Tarimo Fu,
Matias Mattamala,
Nived Chebrolu,
Maurice Fallon
Abstract:
We present a neural-field-based large-scale reconstruction system that fuses lidar and vision data to generate high-quality reconstructions that are geometrically accurate and capture photo-realistic textures. This system adapts the state-of-the-art neural radiance field (NeRF) representation to also incorporate lidar data which adds strong geometric constraints on the depth and surface normals. W…
▽ More
We present a neural-field-based large-scale reconstruction system that fuses lidar and vision data to generate high-quality reconstructions that are geometrically accurate and capture photo-realistic textures. This system adapts the state-of-the-art neural radiance field (NeRF) representation to also incorporate lidar data which adds strong geometric constraints on the depth and surface normals. We exploit the trajectory from a real-time lidar SLAM system to bootstrap a Structure-from-Motion (SfM) procedure to both significantly reduce the computation time and to provide metric scale which is crucial for lidar depth loss. We use submapping to scale the system to large-scale environments captured over long trajectories. We demonstrate the reconstruction system with data from a multi-camera, lidar sensor suite onboard a legged robot, hand-held while scanning building scenes for 600 metres, and onboard an aerial robot surveying a multi-storey mock disaster site-building. Website: https://ori-drs.github.io/projects/silvr/
△ Less
Submitted 11 March, 2024;
originally announced March 2024.
-
AceMap: Knowledge Discovery through Academic Graph
Authors:
Xinbing Wang,
Luoyi Fu,
Xiaoying Gan,
Ying Wen,
Guanjie Zheng,
Jiaxin Ding,
Liyao Xiang,
Nanyang Ye,
Meng Jin,
Shiyu Liang,
Bin Lu,
Haiwen Wang,
Yi Xu,
Cheng Deng,
Shao Zhang,
Huquan Kang,
Xingli Wang,
Qi Li,
Zhixin Guo,
Jiexing Qi,
Pan Liu,
Yuyang Ren,
Lyuwen Wu,
Jungang Yang,
Jianping Zhou
, et al. (1 additional authors not shown)
Abstract:
The exponential growth of scientific literature requires effective management and extraction of valuable insights. While existing scientific search engines excel at delivering search results based on relational databases, they often neglect the analysis of collaborations between scientific entities and the evolution of ideas, as well as the in-depth analysis of content within scientific publicatio…
▽ More
The exponential growth of scientific literature requires effective management and extraction of valuable insights. While existing scientific search engines excel at delivering search results based on relational databases, they often neglect the analysis of collaborations between scientific entities and the evolution of ideas, as well as the in-depth analysis of content within scientific publications. The representation of heterogeneous graphs and the effective measurement, analysis, and mining of such graphs pose significant challenges. To address these challenges, we present AceMap, an academic system designed for knowledge discovery through academic graph. We present advanced database construction techniques to build the comprehensive AceMap database with large-scale academic entities that contain rich visual, textual, and numerical information. AceMap also employs innovative visualization, quantification, and analysis methods to explore associations and logical relationships among academic entities. AceMap introduces large-scale academic network visualization techniques centered on nebular graphs, providing a comprehensive view of academic networks from multiple perspectives. In addition, AceMap proposes a unified metric based on structural entropy to quantitatively measure the knowledge content of different academic entities. Moreover, AceMap provides advanced analysis capabilities, including tracing the evolution of academic ideas through citation relationships and concept co-occurrence, and generating concise summaries informed by this evolutionary process. In addition, AceMap uses machine reading methods to generate potential new ideas at the intersection of different fields. Exploring the integration of large language models and knowledge graphs is a promising direction for future research in idea evolution. Please visit \url{https://www.acemap.info} for further exploration.
△ Less
Submitted 14 April, 2024; v1 submitted 4 March, 2024;
originally announced March 2024.
-
ResAdapter: Domain Consistent Resolution Adapter for Diffusion Models
Authors:
Jiaxiang Cheng,
Pan Xie,
Xin Xia,
Jiashi Li,
Jie Wu,
Yuxi Ren,
Huixia Li,
Xuefeng Xiao,
Min Zheng,
Lean Fu
Abstract:
Recent advancement in text-to-image models (e.g., Stable Diffusion) and corresponding personalized technologies (e.g., DreamBooth and LoRA) enables individuals to generate high-quality and imaginative images. However, they often suffer from limitations when generating images with resolutions outside of their trained domain. To overcome this limitation, we present the Resolution Adapter (ResAdapter…
▽ More
Recent advancement in text-to-image models (e.g., Stable Diffusion) and corresponding personalized technologies (e.g., DreamBooth and LoRA) enables individuals to generate high-quality and imaginative images. However, they often suffer from limitations when generating images with resolutions outside of their trained domain. To overcome this limitation, we present the Resolution Adapter (ResAdapter), a domain-consistent adapter designed for diffusion models to generate images with unrestricted resolutions and aspect ratios. Unlike other multi-resolution generation methods that process images of static resolution with complex post-process operations, ResAdapter directly generates images with the dynamical resolution. Especially, after learning a deep understanding of pure resolution priors, ResAdapter trained on the general dataset, generates resolution-free images with personalized diffusion models while preserving their original style domain. Comprehensive experiments demonstrate that ResAdapter with only 0.5M can process images with flexible resolutions for arbitrary diffusion models. More extended experiments demonstrate that ResAdapter is compatible with other modules (e.g., ControlNet, IP-Adapter and LCM-LoRA) for image generation across a broad range of resolutions, and can be integrated into other multi-resolution model (e.g., ElasticDiffusion) for efficiently generating higher-resolution images. Project link is https://res-adapter.github.io
△ Less
Submitted 4 March, 2024;
originally announced March 2024.
-
CAPT: Category-level Articulation Estimation from a Single Point Cloud Using Transformer
Authors:
Lian Fu,
Ryoichi Ishikawa,
Yoshihiro Sato,
Takeshi Oishi
Abstract:
The ability to estimate joint parameters is essential for various applications in robotics and computer vision. In this paper, we propose CAPT: category-level articulation estimation from a point cloud using Transformer. CAPT uses an end-to-end transformer-based architecture for joint parameter and state estimation of articulated objects from a single point cloud. The proposed CAPT methods accurat…
▽ More
The ability to estimate joint parameters is essential for various applications in robotics and computer vision. In this paper, we propose CAPT: category-level articulation estimation from a point cloud using Transformer. CAPT uses an end-to-end transformer-based architecture for joint parameter and state estimation of articulated objects from a single point cloud. The proposed CAPT methods accurately estimate joint parameters and states for various articulated objects with high precision and robustness. The paper also introduces a motion loss approach, which improves articulation estimation performance by emphasizing the dynamic features of articulated objects. Additionally, the paper presents a double voting strategy to provide the framework with coarse-to-fine parameter estimation. Experimental results on several category datasets demonstrate that our methods outperform existing alternatives for articulation estimation. Our research provides a promising solution for applying Transformer-based architectures in articulated object analysis.
△ Less
Submitted 27 February, 2024;
originally announced February 2024.
-
Enhancing Robotic Manipulation with AI Feedback from Multimodal Large Language Models
Authors:
Jinyi Liu,
Yifu Yuan,
Jianye Hao,
Fei Ni,
Lingzhi Fu,
Yibin Chen,
Yan Zheng
Abstract:
Recently, there has been considerable attention towards leveraging large language models (LLMs) to enhance decision-making processes. However, aligning the natural language text instructions generated by LLMs with the vectorized operations required for execution presents a significant challenge, often necessitating task-specific details. To circumvent the need for such task-specific granularity, i…
▽ More
Recently, there has been considerable attention towards leveraging large language models (LLMs) to enhance decision-making processes. However, aligning the natural language text instructions generated by LLMs with the vectorized operations required for execution presents a significant challenge, often necessitating task-specific details. To circumvent the need for such task-specific granularity, inspired by preference-based policy learning approaches, we investigate the utilization of multimodal LLMs to provide automated preference feedback solely from image inputs to guide decision-making. In this study, we train a multimodal LLM, termed CriticGPT, capable of understanding trajectory videos in robot manipulation tasks, serving as a critic to offer analysis and preference feedback. Subsequently, we validate the effectiveness of preference labels generated by CriticGPT from a reward modeling perspective. Experimental evaluation of the algorithm's preference accuracy demonstrates its effective generalization ability to new tasks. Furthermore, performance on Meta-World tasks reveals that CriticGPT's reward model efficiently guides policy learning, surpassing rewards based on state-of-the-art pre-trained representation models.
△ Less
Submitted 21 February, 2024;
originally announced February 2024.
-
A Touch, Vision, and Language Dataset for Multimodal Alignment
Authors:
Letian Fu,
Gaurav Datta,
Huang Huang,
William Chung-Ho Panitch,
Jaimyn Drake,
Joseph Ortiz,
Mustafa Mukadam,
Mike Lambeta,
Roberto Calandra,
Ken Goldberg
Abstract:
Touch is an important sensing modality for humans, but it has not yet been incorporated into a multimodal generative language model. This is partially due to the difficulty of obtaining natural language labels for tactile data and the complexity of aligning tactile readings with both visual observations and language descriptions. As a step towards bridging that gap, this work introduces a new data…
▽ More
Touch is an important sensing modality for humans, but it has not yet been incorporated into a multimodal generative language model. This is partially due to the difficulty of obtaining natural language labels for tactile data and the complexity of aligning tactile readings with both visual observations and language descriptions. As a step towards bridging that gap, this work introduces a new dataset of 44K in-the-wild vision-touch pairs, with English language labels annotated by humans (10%) and textual pseudo-labels from GPT-4V (90%). We use this dataset to train a vision-language-aligned tactile encoder for open-vocabulary classification and a touch-vision-language (TVL) model for text generation using the trained encoder. Results suggest that by incorporating touch, the TVL model improves (+29% classification accuracy) touch-vision-language alignment over existing models trained on any pair of those modalities. Although only a small fraction of the dataset is human-labeled, the TVL model demonstrates improved visual-tactile understanding over GPT-4V (+12%) and open-source vision-language models (+32%) on a new touch-vision understanding benchmark. Code and data: https://tactile-vlm.github.io.
△ Less
Submitted 20 February, 2024;
originally announced February 2024.
-
Rethinking Patch Dependence for Masked Autoencoders
Authors:
Letian Fu,
Long Lian,
Renhao Wang,
Baifeng Shi,
Xudong Wang,
Adam Yala,
Trevor Darrell,
Alexei A. Efros,
Ken Goldberg
Abstract:
In this work, we re-examine inter-patch dependencies in the decoding mechanism of masked autoencoders (MAE). We decompose this decoding mechanism for masked patch reconstruction in MAE into self-attention and cross-attention. Our investigations suggest that self-attention between mask patches is not essential for learning good representations. To this end, we propose a novel pretraining framework:…
▽ More
In this work, we re-examine inter-patch dependencies in the decoding mechanism of masked autoencoders (MAE). We decompose this decoding mechanism for masked patch reconstruction in MAE into self-attention and cross-attention. Our investigations suggest that self-attention between mask patches is not essential for learning good representations. To this end, we propose a novel pretraining framework: Cross-Attention Masked Autoencoders (CrossMAE). CrossMAE's decoder leverages only cross-attention between masked and visible tokens, with no degradation in downstream performance. This design also enables decoding only a small subset of mask tokens, boosting efficiency. Furthermore, each decoder block can now leverage different encoder features, resulting in improved representation learning. CrossMAE matches MAE in performance with 2.5 to 3.7$\times$ less decoding compute. It also surpasses MAE on ImageNet classification and COCO instance segmentation under the same compute. Code and models: https://crossmae.github.io
△ Less
Submitted 25 January, 2024;
originally announced January 2024.
-
WAL-Net: Weakly supervised auxiliary task learning network for carotid plaques classification
Authors:
Haitao Gan,
Lingchao Fu,
Ran Zhou,
Weiyan Gan,
Furong Wang,
Xiaoyan Wu,
Zhi Yang,
Zhongwei Huang
Abstract:
The classification of carotid artery ultrasound images is a crucial means for diagnosing carotid plaques, holding significant clinical relevance for predicting the risk of stroke. Recent research suggests that utilizing plaque segmentation as an auxiliary task for classification can enhance performance by leveraging the correlation between segmentation and classification tasks. However, this appro…
▽ More
The classification of carotid artery ultrasound images is a crucial means for diagnosing carotid plaques, holding significant clinical relevance for predicting the risk of stroke. Recent research suggests that utilizing plaque segmentation as an auxiliary task for classification can enhance performance by leveraging the correlation between segmentation and classification tasks. However, this approach relies on obtaining a substantial amount of challenging-to-acquire segmentation annotations. This paper proposes a novel weakly supervised auxiliary task learning network model (WAL-Net) to explore the interdependence between carotid plaque classification and segmentation tasks. The plaque classification task is primary task, while the plaque segmentation task serves as an auxiliary task, providing valuable information to enhance the performance of the primary task. Weakly supervised learning is adopted in the auxiliary task to completely break away from the dependence on segmentation annotations. Experiments and evaluations are conducted on a dataset comprising 1270 carotid plaque ultrasound images from Wuhan University Zhongnan Hospital. Results indicate that the proposed method achieved an approximately 1.3% improvement in carotid plaque classification accuracy compared to the baseline network. Specifically, the accuracy of mixed-echoic plaques classification increased by approximately 3.3%, demonstrating the effectiveness of our approach.
△ Less
Submitted 27 January, 2024; v1 submitted 25 January, 2024;
originally announced January 2024.
-
Adapting Large Language Models for Education: Foundational Capabilities, Potentials, and Challenges
Authors:
Qingyao Li,
Lingyue Fu,
Weiming Zhang,
Xianyu Chen,
Jingwei Yu,
Wei Xia,
Weinan Zhang,
Ruiming Tang,
Yong Yu
Abstract:
Online education platforms, leveraging the internet to distribute education resources, seek to provide convenient education but often fall short in real-time communication with students. They often struggle to address the diverse obstacles students encounter throughout their learning journey. Solving the problems encountered by students poses a significant challenge for traditional deep learning m…
▽ More
Online education platforms, leveraging the internet to distribute education resources, seek to provide convenient education but often fall short in real-time communication with students. They often struggle to address the diverse obstacles students encounter throughout their learning journey. Solving the problems encountered by students poses a significant challenge for traditional deep learning models, as it requires not only a broad spectrum of subject knowledge but also the ability to understand what constitutes a student's individual difficulties. It's challenging for traditional machine learning models, as they lack the capacity to comprehend students' personalized needs. Recently, the emergence of large language models (LLMs) offers the possibility for resolving this issue by comprehending individual requests. Although LLMs have been successful in various fields, creating an LLM-based education system is still challenging for the wide range of educational skills required. This paper reviews the recently emerged LLM research related to educational capabilities, including mathematics, writing, programming, reasoning, and knowledge-based question answering, with the aim to explore their potential in constructing the next-generation intelligent education system. Specifically, for each capability, we focus on investigating two aspects. Firstly, we examine the current state of LLMs regarding this capability: how advanced they have become, whether they surpass human abilities, and what deficiencies might exist. Secondly, we evaluate whether the development methods for LLMs in this area are generalizable, that is, whether these methods can be applied to construct a comprehensive educational supermodel with strengths across various capabilities, rather than being effective in only a singular aspect.
△ Less
Submitted 26 April, 2024; v1 submitted 27 December, 2023;
originally announced January 2024.
-
GeoGalactica: A Scientific Large Language Model in Geoscience
Authors:
Zhouhan Lin,
Cheng Deng,
Le Zhou,
Tianhang Zhang,
Yi Xu,
Yutong Xu,
Zhongmou He,
Yuanyuan Shi,
Beiya Dai,
Yunchong Song,
Boyi Zeng,
Qiyuan Chen,
Yuxun Miao,
Bo Xue,
Shu Wang,
Luoyi Fu,
Weinan Zhang,
Junxian He,
Yunqiang Zhu,
Xinbing Wang,
Chenghu Zhou
Abstract:
Large language models (LLMs) have achieved huge success for their general knowledge and ability to solve a wide spectrum of tasks in natural language processing (NLP). Due to their impressive abilities, LLMs have shed light on potential inter-discipline applications to foster scientific discoveries of a specific domain by using artificial intelligence (AI for science, AI4S). In the meantime, utili…
▽ More
Large language models (LLMs) have achieved huge success for their general knowledge and ability to solve a wide spectrum of tasks in natural language processing (NLP). Due to their impressive abilities, LLMs have shed light on potential inter-discipline applications to foster scientific discoveries of a specific domain by using artificial intelligence (AI for science, AI4S). In the meantime, utilizing NLP techniques in geoscience research and practice is wide and convoluted, contributing from knowledge extraction and document classification to question answering and knowledge discovery. In this work, we take the initial step to leverage LLM for science, through a rather straightforward approach. We try to specialize an LLM into geoscience, by further pre-training the model with a vast amount of texts in geoscience, as well as supervised fine-tuning (SFT) the resulting model with our custom collected instruction tuning dataset. These efforts result in a model GeoGalactica consisting of 30 billion parameters. To our best knowledge, it is the largest language model for the geoscience domain. More specifically, GeoGalactica is from further pre-training of Galactica. We train GeoGalactica over a geoscience-related text corpus containing 65 billion tokens, preserving as the largest geoscience-specific text corpus. Then we fine-tune the model with 1 million pairs of instruction-tuning data consisting of questions that demand professional geoscience knowledge to answer. In this technical report, we will illustrate in detail all aspects of GeoGalactica, including data collection, data cleaning, base model selection, pre-training, SFT, and evaluation. We open-source our data curation tools and the checkpoints of GeoGalactica during the first 3/4 of pre-training.
△ Less
Submitted 13 April, 2024; v1 submitted 31 December, 2023;
originally announced January 2024.
-
Spatial Deep Learning for Site-Specific Movement Optimization of Aerial Base Stations
Authors:
Jiangbin Lyu,
Xu Chen,
Jiefeng Zhang,
Liqun Fu
Abstract:
Unmanned aerial vehicles (UAVs) can be utilized as aerial base stations (ABSs) to provide wireless connectivity for ground users (GUs) in various emergency scenarios. However, it is a NP-hard problem with exponential complexity in $M$ and $N$, in order to maximize the coverage rate of $M$ GUs by jointly placing $N$ ABSs with limited coverage range. The problem is further complicated when the cover…
▽ More
Unmanned aerial vehicles (UAVs) can be utilized as aerial base stations (ABSs) to provide wireless connectivity for ground users (GUs) in various emergency scenarios. However, it is a NP-hard problem with exponential complexity in $M$ and $N$, in order to maximize the coverage rate of $M$ GUs by jointly placing $N$ ABSs with limited coverage range. The problem is further complicated when the coverage range becomes irregular due to site-specific blockages (e.g., buildings) on the air-ground channel, and/or when the GUs are moving. To address the above challenges, we study a multi-ABS movement optimization problem to maximize the average coverage rate of mobile GUs in a site-specific environment. The Spatial Deep Learning with Multi-dimensional Archive of Phenotypic Elites (SDL-ME) algorithm is proposed to tackle this challenging problem by 1) partitioning the complicated ABS movement problem into ABS placement sub-problems each spanning finite time horizon; 2) using an encoder-decoder deep neural network (DNN) as the emulator to capture the spatial correlation of ABSs/GUs and thereby reducing the cost of interaction with the actual environment; 3) employing the emulator to speed up a quality-diversity search for the optimal placement solution; and 4) proposing a planning-exploration-serving scheme for multi-ABS movement coordination. Numerical results demonstrate that the proposed approach significantly outperforms the benchmark Deep Reinforcement Learning (DRL)-based method and other two baselines in terms of average coverage rate, training time and/or sample efficiency. Moreover, with one-time training, our proposed method can be applied in scenarios where the number of ABSs/GUs dynamically changes on site and/or with different/varying GU speeds, which is thus more robust and flexible compared with conventional DRL-based methods.
△ Less
Submitted 16 December, 2023;
originally announced December 2023.
-
IRS-Aided Sectorized Base Station Design and 3D Coverage Performance Analysis
Authors:
Xintong Chen,
Jiangbin Lyu,
Liqun Fu
Abstract:
Intelligent reflecting surface (IRS) is regarded as a revolutionary paradigm that can reconfigure the wireless propagation environment for enhancing the desired signal and/or weakening the interference, and thus improving the quality of service (QoS) for communication systems. In this paper, we propose an IRS-aided sectorized BS design where the IRS is mounted in front of a transmitter (TX) and re…
▽ More
Intelligent reflecting surface (IRS) is regarded as a revolutionary paradigm that can reconfigure the wireless propagation environment for enhancing the desired signal and/or weakening the interference, and thus improving the quality of service (QoS) for communication systems. In this paper, we propose an IRS-aided sectorized BS design where the IRS is mounted in front of a transmitter (TX) and reflects/reconfigures signal towards the desired user equipment (UE). Unlike prior works that address link-level analysis/optimization of IRS-aided systems, we focus on the system-level three-dimensional (3D) coverage performance in both single-/multiple-cell scenarios. To this end, a distance/angle-dependent 3D channel model is considered for UEs in the 3D space, as well as the non-isotropic TX beam pattern and IRS element radiation pattern (ERP), both of which affect the average channel power as well as the multi-path fading statistics. Based on the above, a general formula of received signal power in our design is obtained, along with derived power scaling laws and upper/lower bounds on the mean signal/interference power under IRS passive beamforming or random scattering. Numerical results validate our analysis and demonstrate that our proposed design outperforms the benchmark schemes with fixed BS antenna patterns or active 3D beamforming. In particular, for aerial UEs that suffer from strong inter-cell interference, the IRS-aided BS design provides much better QoS in terms of the ergodic throughput performance compared with benchmarks, thanks to the IRS-inherent double pathloss effect that helps weaken the interference.
△ Less
Submitted 16 December, 2023;
originally announced December 2023.
-
SP-DiffDose: A Conditional Diffusion Model for Radiation Dose Prediction Based on Multi-Scale Fusion of Anatomical Structures, Guided by SwinTransformer and Projector
Authors:
Linjie Fu,
Xia Li,
Xiuding Cai,
Yingkai Wang,
Xueyao Wang,
Yu Yao,
Yali Shen
Abstract:
Radiation therapy serves as an effective and standard method for cancer treatment. Excellent radiation therapy plans always rely on high-quality dose distribution maps obtained through repeated trial and error by experienced experts. However, due to individual differences and complex clinical situations, even seasoned expert teams may need help to achieve the best treatment plan every time quickly…
▽ More
Radiation therapy serves as an effective and standard method for cancer treatment. Excellent radiation therapy plans always rely on high-quality dose distribution maps obtained through repeated trial and error by experienced experts. However, due to individual differences and complex clinical situations, even seasoned expert teams may need help to achieve the best treatment plan every time quickly. Many automatic dose distribution prediction methods have been proposed recently to accelerate the radiation therapy planning process and have achieved good results. However, these results suffer from over-smoothing issues, with the obtained dose distribution maps needing more high-frequency details, limiting their clinical application. To address these limitations, we propose a dose prediction diffusion model based on SwinTransformer and a projector, SP-DiffDose. To capture the direct correlation between anatomical structure and dose distribution maps, SP-DiffDose uses a structural encoder to extract features from anatomical images, then employs a conditional diffusion process to blend noise and anatomical images at multiple scales and gradually map them to dose distribution maps. To enhance the dose prediction distribution for organs at risk, SP-DiffDose utilizes SwinTransformer in the deeper layers of the network to capture features at different scales in the image. To learn good representations from the fused features, SP-DiffDose passes the fused features through a designed projector, improving dose prediction accuracy. Finally, we evaluate SP-DiffDose on an internal dataset. The results show that SP-DiffDose outperforms existing methods on multiple evaluation metrics, demonstrating the superiority and generalizability of our method.
△ Less
Submitted 11 December, 2023;
originally announced December 2023.
-
Combating Multi-path Interference to Improve Chirp-based Underwater Acoustic Communication
Authors:
Wenjun Xie,
Enqi Zhang,
Lizhao You,
Deqing Wang,
Zhaorui Wang,
Liqun Fu
Abstract:
Linear chirp-based underwater acoustic communication has been widely used due to its reliability and long-range transmission capability. However, unlike the counterpart chirp technology in wireless -- LoRa, its throughput is severely limited by the number of modulated chirps in a symbol. The fundamental challenge lies in the underwater multi-path channel, where the delayed copied of one symbol may…
▽ More
Linear chirp-based underwater acoustic communication has been widely used due to its reliability and long-range transmission capability. However, unlike the counterpart chirp technology in wireless -- LoRa, its throughput is severely limited by the number of modulated chirps in a symbol. The fundamental challenge lies in the underwater multi-path channel, where the delayed copied of one symbol may cause inter-symbol and intra-symbol interfere. In this paper, we present UWLoRa+, a system that realizes the same chirp modulation as LoRa with higher data rate, and enhances LoRa's design to address the multi-path challenge via the following designs: a) we replace the linear chirp used by LoRa with the non-linear chirp to reduce the signal interference range and the collision probability; b) we design an algorithm that first demodulates each path and then combines the demodulation results of detected paths; and c) we replace the Hamming codes used by LoRa with the non-binary LDPC codes to mitigate the impact of the inevitable collision.Experiment results show that the new designs improve the bit error rate (BER) by 3x, and the packet error rate (PER) significantly, compared with the LoRa's naive design. Compared with an state-of-the-art system for decoding underwater LoRa chirp signal, UWLoRa+ improves the throughput by up to 50 times.
△ Less
Submitted 29 November, 2023;
originally announced November 2023.
-
Enhancing Scene Text Detectors with Realistic Text Image Synthesis Using Diffusion Models
Authors:
Ling Fu,
Zijie Wu,
Yingying Zhu,
Yuliang Liu,
Xiang Bai
Abstract:
Scene text detection techniques have garnered significant attention due to their wide-ranging applications. However, existing methods have a high demand for training data, and obtaining accurate human annotations is labor-intensive and time-consuming. As a solution, researchers have widely adopted synthetic text images as a complementary resource to real text images during pre-training. Yet there…
▽ More
Scene text detection techniques have garnered significant attention due to their wide-ranging applications. However, existing methods have a high demand for training data, and obtaining accurate human annotations is labor-intensive and time-consuming. As a solution, researchers have widely adopted synthetic text images as a complementary resource to real text images during pre-training. Yet there is still room for synthetic datasets to enhance the performance of scene text detectors. We contend that one main limitation of existing generation methods is the insufficient integration of foreground text with the background. To alleviate this problem, we present the Diffusion Model based Text Generator (DiffText), a pipeline that utilizes the diffusion model to seamlessly blend foreground text regions with the background's intrinsic features. Additionally, we propose two strategies to generate visually coherent text with fewer spelling errors. With fewer text instances, our produced text images consistently surpass other synthetic data in aiding text detectors. Extensive experiments on detecting horizontal, rotated, curved, and line-level texts demonstrate the effectiveness of DiffText in producing realistic text images.
△ Less
Submitted 28 November, 2023;
originally announced November 2023.
-
VehicleGAN: Pair-flexible Pose Guided Image Synthesis for Vehicle Re-identification
Authors:
Baolu Li,
Ping Liu,
Lan Fu,
Jinlong Li,
Jianwu Fang,
Zhigang Xu,
Hongkai Yu
Abstract:
Vehicle Re-identification (Re-ID) has been broadly studied in the last decade; however, the different camera view angle leading to confused discrimination in the feature subspace for the vehicles of various poses, is still challenging for the Vehicle Re-ID models in the real world. To promote the Vehicle Re-ID models, this paper proposes to synthesize a large number of vehicle images in the target…
▽ More
Vehicle Re-identification (Re-ID) has been broadly studied in the last decade; however, the different camera view angle leading to confused discrimination in the feature subspace for the vehicles of various poses, is still challenging for the Vehicle Re-ID models in the real world. To promote the Vehicle Re-ID models, this paper proposes to synthesize a large number of vehicle images in the target pose, whose idea is to project the vehicles of diverse poses into the unified target pose so as to enhance feature discrimination. Considering that the paired data of the same vehicles in different traffic surveillance cameras might be not available in the real world, we propose the first Pair-flexible Pose Guided Image Synthesis method for Vehicle Re-ID, named as VehicleGAN in this paper, which works for both supervised and unsupervised settings without the knowledge of geometric 3D models. Because of the feature distribution difference between real and synthetic data, simply training a traditional metric learning based Re-ID model with data-level fusion (i.e., data augmentation) is not satisfactory, therefore we propose a new Joint Metric Learning (JML) via effective feature-level fusion from both real and synthetic data. Intensive experimental results on the public VeRi-776 and VehicleID datasets prove the accuracy and effectiveness of our proposed VehicleGAN and JML.
△ Less
Submitted 17 April, 2024; v1 submitted 27 November, 2023;
originally announced November 2023.
-
Enhancing Uncertainty-Based Hallucination Detection with Stronger Focus
Authors:
Tianhang Zhang,
Lin Qiu,
Qipeng Guo,
Cheng Deng,
Yue Zhang,
Zheng Zhang,
Chenghu Zhou,
Xinbing Wang,
Luoyi Fu
Abstract:
Large Language Models (LLMs) have gained significant popularity for their impressive performance across diverse fields. However, LLMs are prone to hallucinate untruthful or nonsensical outputs that fail to meet user expectations in many real-world applications. Existing works for detecting hallucinations in LLMs either rely on external knowledge for reference retrieval or require sampling multiple…
▽ More
Large Language Models (LLMs) have gained significant popularity for their impressive performance across diverse fields. However, LLMs are prone to hallucinate untruthful or nonsensical outputs that fail to meet user expectations in many real-world applications. Existing works for detecting hallucinations in LLMs either rely on external knowledge for reference retrieval or require sampling multiple responses from the LLM for consistency verification, making these methods costly and inefficient. In this paper, we propose a novel reference-free, uncertainty-based method for detecting hallucinations in LLMs. Our approach imitates human focus in factuality checking from three aspects: 1) focus on the most informative and important keywords in the given text; 2) focus on the unreliable tokens in historical context which may lead to a cascade of hallucinations; and 3) focus on the token properties such as token type and token frequency. Experimental results on relevant datasets demonstrate the effectiveness of our proposed method, which achieves state-of-the-art performance across all the evaluation metrics and eliminates the need for additional information.
△ Less
Submitted 22 November, 2023;
originally announced November 2023.