Search | arXiv e-print repository

Femtosecond switching of strong light-matter interactions in microcavities with two-dimensional semiconductors

Authors: Armando Genco, Charalambos Louca, Cristina Cruciano, Kok Wee Song, Chiara Trovatello, Giuseppe Di Blasio, Giacomo Sansone, Sam Randerson, Peter Claronino, Rahul Jayaprakash, Kenji Watanabe, Takashi Taniguchi, David G. Lidzey, Oleksandr Kyriienko, Stefano Dal Conte, Alexander I. Tartakovskii, Giulio Cerullo

Abstract: Ultrafast all-optical logic devices based on nonlinear light-matter interactions hold the promise to overcome the speed limitations of conventional electronic devices. Strong coupling of excitons and photons inside an optical resonator enhances such interactions and generates new polariton states which give access to unique nonlinear phenomena, such as Bose-Einstein condensation, used for all-opti… ▽ More Ultrafast all-optical logic devices based on nonlinear light-matter interactions hold the promise to overcome the speed limitations of conventional electronic devices. Strong coupling of excitons and photons inside an optical resonator enhances such interactions and generates new polariton states which give access to unique nonlinear phenomena, such as Bose-Einstein condensation, used for all-optical ultrafast polariton transistors. However, the pulse energies required to pump such devices range from tens to hundreds of pJ, making them not competitive with electronic transistors. Here we introduce a new paradigm for all-optical switching based on the ultrafast transition from the strong to the weak coupling regime in microcavities embedding atomically thin transition metal dichalcogenides. Employing single and double stacks of hBN-encapsulated MoS$_2$ homobilayers with high optical nonlinearities and fast exciton relaxation times, we observe a collapse of the 55-meV polariton gap and its revival in less than one picosecond, lowering the threshold for optical switching below 4 pJ per pulse, while retaining ultrahigh switching frequencies. As an additional degree of freedom, the switching can be triggered pumping either the intra- or the interlayer excitons of the bilayers at different wavelengths, speeding up the polariton dynamics, owing to unique interspecies excitonic interactions. Our approach will enable the development of compact ultrafast all-optical logical circuits and neural networks, showcasing a new platform for polaritonic information processing based on manipulating the light-matter coupling. △ Less

Submitted 31 July, 2024; originally announced August 2024.

arXiv:2407.20502 [pdf, other]

Restoring Real-World Degraded Events Improves Deblurring Quality

Authors: Yeqing Shen, Shang Li, Kun Song

Abstract: Due to its high speed and low latency, DVS is frequently employed in motion deblurring. Ideally, high-quality events would adeptly capture intricate motion information. However, real-world events are generally degraded, thereby introducing significant artifacts into the deblurred results. In response to this challenge, we model the degradation of events and propose RDNet to improve the quality of… ▽ More Due to its high speed and low latency, DVS is frequently employed in motion deblurring. Ideally, high-quality events would adeptly capture intricate motion information. However, real-world events are generally degraded, thereby introducing significant artifacts into the deblurred results. In response to this challenge, we model the degradation of events and propose RDNet to improve the quality of image deblurring. Specifically, we first analyze the mechanisms underlying degradation and simulate paired events based on that. These paired events are then fed into the first stage of the RDNet for training the restoration model. The events restored in this stage serve as a guide for the second-stage deblurring process. To better assess the deblurring performance of different methods on real-world degraded events, we present a new real-world dataset named DavisMCR. This dataset incorporates events with diverse degradation levels, collected by manipulating environmental brightness and target object contrast. Our experiments are conducted on synthetic datasets (GOPRO), real-world datasets (REBlur), and the proposed dataset (DavisMCR). The results demonstrate that RDNet outperforms classical event denoising methods in event restoration. Furthermore, RDNet exhibits better performance in deblurring tasks compared to state-of-the-art methods. DavisMCR are available at https://github.com/Yeeesir/DVS_RDNet. △ Less

Submitted 29 July, 2024; originally announced July 2024.

arXiv:2407.17491 [pdf, other]

Robust Adaptation of Foundation Models with Black-Box Visual Prompting

Authors: Changdae Oh, Gyeongdeok Seo, Geunyoung Jung, Zhi-Qi Cheng, Hosik Choi, Jiyoung Jung, Kyungwoo Song

Abstract: With the surge of large-scale pre-trained models (PTMs), adapting these models to numerous downstream tasks becomes a crucial problem. Consequently, parameter-efficient transfer learning (PETL) of large models has grasped huge attention. While PETL methods show impressive performance, they commonly rely on two optimistic assumptions: 1) the entire parameters of a PTM are available, and 2) a suffic… ▽ More With the surge of large-scale pre-trained models (PTMs), adapting these models to numerous downstream tasks becomes a crucial problem. Consequently, parameter-efficient transfer learning (PETL) of large models has grasped huge attention. While PETL methods show impressive performance, they commonly rely on two optimistic assumptions: 1) the entire parameters of a PTM are available, and 2) a sufficiently large memory capacity is equipped for caching all the intermediate activations to compute gradients. However, in most real-world applications, PTMs are served as black-box APIs or proprietary software without explicit parameter accessibility. Besides, it is hard to meet a large memory requirement for modern PTMs. This work proposes black-box visual prompting (BlackVIP), which efficiently adapts the PTMs without knowledge about model architectures and parameters. BlackVIP has two components; 1) Coordinator and 2) simultaneous perturbation stochastic approximation with gradient correction (SPSA-GC). The Coordinator designs input-dependent visual prompts, which allow the target PTM to adapt in the wild. SPSA-GC efficiently estimates the gradient of PTM to update the Coordinator. Besides, we propose a variant, BlackVIP-SE, which significantly reduces the runtime and computational cost of BlackVIP. Extensive experiments on 19 datasets demonstrate that BlackVIPs enable robust adaptation to diverse domains and tasks with minimal memory requirements. We further provide theoretical analysis on the generalization of visual prompting methods by presenting their connection to the certified robustness of randomized smoothing. △ Less

Submitted 3 July, 2024; originally announced July 2024.

Comments: Extended work from the CVPR'23 paper: arxiv:2303.14773; This paper has been submitted to IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI) for possible publication

arXiv:2407.02867 [pdf, other]

Contrast then Memorize: Semantic Neighbor Retrieval-Enhanced Inductive Multimodal Knowledge Graph Completion

Authors: Yu Zhao, Ying Zhang, Baohang Zhou, Xinying Qian, Kehui Song, Xiangrui Cai

Abstract: A large number of studies have emerged for Multimodal Knowledge Graph Completion (MKGC) to predict the missing links in MKGs. However, fewer studies have been proposed to study the inductive MKGC (IMKGC) involving emerging entities unseen during training. Existing inductive approaches focus on learning textual entity representations, which neglect rich semantic information in visual modality. More… ▽ More A large number of studies have emerged for Multimodal Knowledge Graph Completion (MKGC) to predict the missing links in MKGs. However, fewer studies have been proposed to study the inductive MKGC (IMKGC) involving emerging entities unseen during training. Existing inductive approaches focus on learning textual entity representations, which neglect rich semantic information in visual modality. Moreover, they focus on aggregating structural neighbors from existing KGs, which of emerging entities are usually limited. However, the semantic neighbors are decoupled from the topology linkage and usually imply the true target entity. In this paper, we propose the IMKGC task and a semantic neighbor retrieval-enhanced IMKGC framework CMR, where the contrast brings the helpful semantic neighbors close, and then the memorize supports semantic neighbor retrieval to enhance inference. Specifically, we first propose a unified cross-modal contrastive learning to simultaneously capture the textual-visual and textual-textual correlations of query-entity pairs in a unified representation space. The contrastive learning increases the similarity of positive query-entity pairs, therefore making the representations of helpful semantic neighbors close. Then, we explicitly memorize the knowledge representations to support the semantic neighbor retrieval. At test time, we retrieve the nearest semantic neighbors and interpolate them to the query-entity similarity distribution to augment the final prediction. Extensive experiments validate the effectiveness of CMR on three inductive MKGC datasets. Codes are available at https://github.com/OreOZhao/CMR. △ Less

Submitted 3 July, 2024; originally announced July 2024.

Comments: Accepted by SIGIR 2024

arXiv:2407.01853 [pdf, other]

Improving Multilingual Instruction Finetuning via Linguistically Natural and Diverse Datasets

Authors: Sathish Reddy Indurthi, Wenxuan Zhou, Shamil Chollampatt, Ravi Agrawal, Kaiqiang Song, Lingxiao Zhao, Chenguang Zhu

Abstract: Advancements in Large Language Models (LLMs) have significantly enhanced instruction-following capabilities. However, most Instruction Fine-Tuning (IFT) datasets are predominantly in English, limiting model performance in other languages. Traditional methods for creating multilingual IFT datasets such as translating existing English IFT datasets or converting existing NLP datasets into IFT dataset… ▽ More Advancements in Large Language Models (LLMs) have significantly enhanced instruction-following capabilities. However, most Instruction Fine-Tuning (IFT) datasets are predominantly in English, limiting model performance in other languages. Traditional methods for creating multilingual IFT datasets such as translating existing English IFT datasets or converting existing NLP datasets into IFT datasets by templating, struggle to capture linguistic nuances and ensure prompt (instruction) diversity. To address this issue, we propose a novel method for collecting multilingual IFT datasets that preserves linguistic naturalness and ensures prompt diversity. This approach leverages English-focused LLMs, monolingual corpora, and a scoring function to create high-quality, diversified IFT datasets in multiple languages. Experiments demonstrate that LLMs finetuned using these IFT datasets show notable improvements in both generative and discriminative tasks, indicating enhanced language comprehension by LLMs in non-English contexts. Specifically, on the multilingual summarization task, LLMs using our IFT dataset achieved 17.57% and 15.23% improvements over LLMs fine-tuned with translation-based and template-based datasets, respectively. △ Less

Submitted 1 July, 2024; originally announced July 2024.

arXiv:2407.01145 [pdf]

Machine Learning-Assisted 3D Printing of Thermoelectric Materials of Ultrahigh Performances at Room Temperature

Authors: Kaidong Song, Guoyue Xu, A. N. M. Tanvir, Ke Wang, Md Omarsany Bappy, Haijian Yang, Wenjie Shang, Le Zhou, Alexander Dowling, Tengei Luo, Yanliang Zhang

Abstract: Thermoelectric energy conversion is an attractive technology for generating electricity from waste heat and using electricity for solid-state cooling. However, conventional manufacturing processes for thermoelectric devices are costly and limited to simple device geometries. This work reports an extrusion printing method to fabricate high-performance thermoelectric materials with complex 3D archit… ▽ More Thermoelectric energy conversion is an attractive technology for generating electricity from waste heat and using electricity for solid-state cooling. However, conventional manufacturing processes for thermoelectric devices are costly and limited to simple device geometries. This work reports an extrusion printing method to fabricate high-performance thermoelectric materials with complex 3D architectures. By integrating high-throughput experimentation and Bayesian optimization (BO), our approach significantly accelerates the simultaneous search for the optimal ink formulation and printing parameters that deliver high thermoelectric performances while maintaining desired shape fidelity. A Gaussian process regression (GPR)-based machine learning model is employed to expeditiously predict thermoelectric power factor as a function of ink formulation and printing parameters. The printed bismuth antimony telluride (BiSbTe)-based thermoelectric materials under the optimized conditions exhibit an ultrahigh room temperature zT of 1.3, which is by far the highest in the printed thermoelectric materials. The machine learning-guided ink-based printing strategy can be highly generalizable to a wide range of functional materials and devices for broad technological applications. △ Less

Submitted 1 July, 2024; originally announced July 2024.

arXiv:2407.00914 [pdf, ps, other]

Multifractal analysis of the convergence exponents for the digits in $d$-decaying Gauss like dynamical systems

Authors: Kunkun Song, Mengjie Zhang

Abstract: Let $\{a_n(x)\}_{n\geq1}$ be the sequence of digits of $x\in(0,1)$ in infinite iterated function systems with polynomial decay of the derivative. We first study the multifractal spectrum of the convergence exponent defined by the sequence of the digits $\{a_n(x)\}_{n\geq1}$ and the weighted products of distinct digits with finite numbers respectively, and then calculate the Hausdorff dimensions of… ▽ More Let $\{a_n(x)\}_{n\geq1}$ be the sequence of digits of $x\in(0,1)$ in infinite iterated function systems with polynomial decay of the derivative. We first study the multifractal spectrum of the convergence exponent defined by the sequence of the digits $\{a_n(x)\}_{n\geq1}$ and the weighted products of distinct digits with finite numbers respectively, and then calculate the Hausdorff dimensions of the intersection of sets defined by the convergence exponent of the weighted product of distinct digits with finite numbers and sets of points whose digits are non-decreasing in such iterated function systems. △ Less

Submitted 30 June, 2024; originally announced July 2024.

Comments: 17 pages

MSC Class: 11K55; 28A80

arXiv:2406.18157 [pdf]

Photosensitive PEEK Ink Enables Digital Light Processing 3D Printed High-performance Small Architected-Plastics

Authors: Ze Zhang, Kewei Song, Rongyi Zhuang, Jianxian He, Yi Yang, Yifan Pan, Takeshi Mino, Kayo Hirose, Shinjiro Umezu

Abstract: Polyetheretherketone (PEEK), as a semi-crystalline high-performance engineering plastic, has demonstrated good application prospects since its introduction. The ability of PEEK to be fabricated in complex architecture is a major limitation due to the inherent shortcomings of material extrusion 3D printing technology in terms of low resolution, low surface quality, and interlayer bonding. We propos… ▽ More Polyetheretherketone (PEEK), as a semi-crystalline high-performance engineering plastic, has demonstrated good application prospects since its introduction. The ability of PEEK to be fabricated in complex architecture is a major limitation due to the inherent shortcomings of material extrusion 3D printing technology in terms of low resolution, low surface quality, and interlayer bonding. We propose a novel PEEK ink processing process based on digital light processing (DLP) 3D printing, which is based on high solid content PEEK ink to achieve green bodies with high accuracy, and one-step sintering to enhance the crystallinity of PEEK. We have investigated the processing mechanism of this process and constructed perfect process parameters in terms of mouldability, printing accuracy, material thermal properties, and PEEK crystallinity. Furthermore, the material and architecture performance of the proposed process was evaluated in terms of comprehensive thermal performance (including heat resistance of the substrate, thermal stability, surface energy after heat treatment, and coefficient of static friction and coefficient of kinetic friction), mechanical performance, and corrosion resistance (20 wt% hydrochloric acid, 20 wt% sodium hydroxide, 99 wt% acetone, and 99.5 wt% chloroform). The process is a bold extension of PEEK processing methods to utilize the properties of PEEK in more flexible and efficient applications. △ Less

Submitted 26 June, 2024; originally announced June 2024.

arXiv:2406.17862 [pdf, other]

ESBMC v7.6: Enhanced Model Checking of C++ Programs with Clang AST

Authors: Xianzhiyu Li, Kunjian Song, Mikhail R. Gadelha, Franz Brauße, Rafael S. Menezes, Konstantin Korovin, Lucas C. Cordeiro

Abstract: This paper presents Efficient SMT-Based Context-Bounded Model Checker (ESBMC) v7.6, an extended version based on previous work on ESBMC v7.3 by K. Song et al. The v7.3 introduced a new Clang-based C++ front-end to address the challenges posed by modern C++ programs. Although the new front-end has demonstrated significant potential in previous studies, it remains in the developmental stage and lack… ▽ More This paper presents Efficient SMT-Based Context-Bounded Model Checker (ESBMC) v7.6, an extended version based on previous work on ESBMC v7.3 by K. Song et al. The v7.3 introduced a new Clang-based C++ front-end to address the challenges posed by modern C++ programs. Although the new front-end has demonstrated significant potential in previous studies, it remains in the developmental stage and lacks several essential features. ESBMC v7.6 further enhanced this foundation by adding and extending features based on the Clang AST, such as 1) exception handling, 2) extended memory management and memory safety verification, including dangling pointers, duplicate deallocation, memory leaks and rvalue references and 3) new operational models for STL updating the outdated C++ operational models. Our extensive experiments demonstrate that ESBMC v7.6 can handle a significantly broader range of C++ features introduced in recent versions of the C++ standard. △ Less

Submitted 25 June, 2024; originally announced June 2024.

Comments: 27 pages, 2 figures. arXiv admin note: substantial text overlap with arXiv:2308.05649

arXiv:2406.15664 [pdf, other]

Flat Posterior Does Matter For Bayesian Transfer Learning

Authors: Sungjun Lim, Jeyoon Yeom, Sooyon Kim, Hoyoon Byun, Jinho Kang, Yohan Jung, Jiyoung Jung, Kyungwoo Song

Abstract: The large-scale pre-trained neural network has achieved notable success in enhancing performance for downstream tasks. Another promising approach for generalization is Bayesian Neural Network (BNN), which integrates Bayesian methods into neural network architectures, offering advantages such as Bayesian Model averaging (BMA) and uncertainty quantification. Despite these benefits, transfer learning… ▽ More The large-scale pre-trained neural network has achieved notable success in enhancing performance for downstream tasks. Another promising approach for generalization is Bayesian Neural Network (BNN), which integrates Bayesian methods into neural network architectures, offering advantages such as Bayesian Model averaging (BMA) and uncertainty quantification. Despite these benefits, transfer learning for BNNs has not been widely investigated and shows limited improvement. We hypothesize that this issue arises from the inability to find flat minima, which is crucial for generalization performance. To address this, we evaluate the sharpness of BNNs in various settings, revealing their insufficiency in seeking flat minima and the influence of flatness on BMA performance. Therefore, we propose Sharpness-aware Bayesian Model Averaging (SA-BMA), a Bayesian-fitting flat posterior seeking optimizer integrated with Bayesian transfer learning. SA-BMA calculates the divergence between posteriors in the parameter space, aligning with the nature of BNNs, and serves as a generalized version of existing sharpness-aware optimizers. We validate that SA-BMA improves generalization performance in few-shot classification and distribution shift scenarios by ensuring flatness. △ Less

Submitted 21 June, 2024; originally announced June 2024.

arXiv:2406.14228 [pdf, other]

EvoAgent: Towards Automatic Multi-Agent Generation via Evolutionary Algorithms

Authors: Siyu Yuan, Kaitao Song, Jiangjie Chen, Xu Tan, Dongsheng Li, Deqing Yang

Abstract: The rise of powerful large language models (LLMs) has spurred a new trend in building LLM-based autonomous agents for solving complex tasks, especially multi-agent systems. Despite the remarkable progress, we notice that existing works are heavily dependent on human-designed frameworks, which greatly limits the functional scope and scalability of agent systems. How to automatically extend the spec… ▽ More The rise of powerful large language models (LLMs) has spurred a new trend in building LLM-based autonomous agents for solving complex tasks, especially multi-agent systems. Despite the remarkable progress, we notice that existing works are heavily dependent on human-designed frameworks, which greatly limits the functional scope and scalability of agent systems. How to automatically extend the specialized agent to multi-agent systems to improve task-solving capability still remains a significant challenge. In this paper, we introduce EvoAgent, a generic method to automatically extend expert agents to multi-agent systems via the evolutionary algorithm, thereby improving the effectiveness of LLM-based agents in solving tasks. Specifically, we consider the existing agent frameworks as the initial individual and then apply a series of evolutionary operators (e.g., mutation, crossover, selection, etc.) to generate multiple agents with diverse agent settings. EvoAgent can be generalized to any LLM-based agent framework, and can automatically extend the existing agent framework to multi-agent systems without any extra human designs. Experimental results across various tasks have shown that EvoAgent can automatically generate multiple expert agents and significantly enhance the task-solving capabilities of LLM-based agents. △ Less

Submitted 11 July, 2024; v1 submitted 20 June, 2024; originally announced June 2024.

Comments: Work in process

arXiv:2406.12084 [pdf, other]

When Reasoning Meets Information Aggregation: A Case Study with Sports Narratives

Authors: Yebowen Hu, Kaiqiang Song, Sangwoo Cho, Xiaoyang Wang, Wenlin Yao, Hassan Foroosh, Dong Yu, Fei Liu

Abstract: Reasoning is most powerful when an LLM accurately aggregates relevant information. We examine the critical role of information aggregation in reasoning by requiring the LLM to analyze sports narratives. To succeed at this task, an LLM must infer points from actions, identify related entities, attribute points accurately to players and teams, and compile key statistics to draw conclusions. We condu… ▽ More Reasoning is most powerful when an LLM accurately aggregates relevant information. We examine the critical role of information aggregation in reasoning by requiring the LLM to analyze sports narratives. To succeed at this task, an LLM must infer points from actions, identify related entities, attribute points accurately to players and teams, and compile key statistics to draw conclusions. We conduct comprehensive experiments with real NBA basketball data and present SportsGen, a new method to synthesize game narratives. By synthesizing data, we can rigorously evaluate LLMs' reasoning capabilities under complex scenarios with varying narrative lengths and density of information. Our findings show that most models, including GPT-4o, often fail to accurately aggregate basketball scores due to frequent scoring patterns. Open-source models like Llama-3 further suffer from significant score hallucinations. Finally, the effectiveness of reasoning is influenced by narrative complexity, information density, and domain-specific terms, highlighting the challenges in analytical reasoning tasks. △ Less

Submitted 17 June, 2024; originally announced June 2024.

arXiv:2406.11827 [pdf, other]

WPO: Enhancing RLHF with Weighted Preference Optimization

Authors: Wenxuan Zhou, Ravi Agrawal, Shujian Zhang, Sathish Reddy Indurthi, Sanqiang Zhao, Kaiqiang Song, Silei Xu, Chenguang Zhu

Abstract: Reinforcement learning from human feedback (RLHF) is a promising solution to align large language models (LLMs) more closely with human values. Off-policy preference optimization, where the preference data is obtained from other models, is widely adopted due to its cost efficiency and scalability. However, off-policy preference optimization often suffers from a distributional gap between the polic… ▽ More Reinforcement learning from human feedback (RLHF) is a promising solution to align large language models (LLMs) more closely with human values. Off-policy preference optimization, where the preference data is obtained from other models, is widely adopted due to its cost efficiency and scalability. However, off-policy preference optimization often suffers from a distributional gap between the policy used for data collection and the target policy, leading to suboptimal optimization. In this paper, we propose a novel strategy to mitigate this problem by simulating on-policy learning with off-policy preference data. Our Weighted Preference Optimization (WPO) method adapts off-policy data to resemble on-policy data more closely by reweighting preference pairs according to their probability under the current policy. This method not only addresses the distributional gap problem but also enhances the optimization process without incurring additional costs. We validate our method on instruction following benchmarks including Alpaca Eval 2 and MT-bench. WPO not only outperforms Direct Preference Optimization (DPO) by up to 5.6% on Alpaca Eval 2 but also establishes a remarkable length-controlled winning rate against GPT-4-turbo of 48.6% based on Llama-3-8B-Instruct, making it the strongest 8B model on the leaderboard. We will release the code and models at https://github.com/wzhouad/WPO. △ Less

Submitted 17 June, 2024; originally announced June 2024.

arXiv:2406.11602 [pdf, other]

Association between a Failed Prominence Eruption and the Drainage of Mass from Another Prominence

Authors: Jianchao Xue, Li Feng, Hui Li, Ping Zhang, Jun Chen, Guanglu Shi, Kaifan Ji, Ye Qiu, Chuan Li, Lei Lu, Beili Ying, Ying Li, Yu Huang, Youping Li, Jingwei Li, Jie Zhao, Dechao Song, Shuting Li, Zhengyuan Tian, Yingna Su, Qingmin Zhang, Yunyi Ge, Jiahui Shan, Qiao Li, Gen Li , et al. (9 additional authors not shown)

Abstract: Sympathetic eruptions of solar prominences have been studied for decades, however, it is usually difficult to identify their causal links. Here we present two failed prominence eruptions on 26 October 2022 and explore their connections. Using stereoscopic observations, the south prominence (PRO-S) erupts with untwisting motions, flare ribbons occur underneath, and new connections are formed during… ▽ More Sympathetic eruptions of solar prominences have been studied for decades, however, it is usually difficult to identify their causal links. Here we present two failed prominence eruptions on 26 October 2022 and explore their connections. Using stereoscopic observations, the south prominence (PRO-S) erupts with untwisting motions, flare ribbons occur underneath, and new connections are formed during the eruption. The north prominence (PRO-N) rises up along with PRO-S, and its upper part disappears due to catastrophic mass draining along an elongated structure after PRO-S failed eruption. We suggest that the eruption of PRO-S initiates due to a kink instability, further rises up, and fails to erupt due to reconnection with surrounding fields. The elongated structure connecting PRO-N overlies PRO-S, which causes the rising up of PRO-N along with PRO-S and mass drainage after PRO-S eruption. This study suggests that a prominence may end its life through mass drainage forced by an eruption underneath. △ Less

Submitted 20 June, 2024; v1 submitted 17 June, 2024; originally announced June 2024.

Comments: 15 pages, 7 figures, has been accepted by Solar Physics

arXiv:2406.09716 [pdf, ps, other]

Speed-up of Data Analysis with Kernel Trick in Encrypted Domain

Authors: Joon Soo Yoo, Baek Kyung Song, Tae Min Ahn, Ji Won Heo, Ji Won Yoon

Abstract: Homomorphic encryption (HE) is pivotal for secure computation on encrypted data, crucial in privacy-preserving data analysis. However, efficiently processing high-dimensional data in HE, especially for machine learning and statistical (ML/STAT) algorithms, poses a challenge. In this paper, we present an effective acceleration method using the kernel method for HE schemes, enhancing time performanc… ▽ More Homomorphic encryption (HE) is pivotal for secure computation on encrypted data, crucial in privacy-preserving data analysis. However, efficiently processing high-dimensional data in HE, especially for machine learning and statistical (ML/STAT) algorithms, poses a challenge. In this paper, we present an effective acceleration method using the kernel method for HE schemes, enhancing time performance in ML/STAT algorithms within encrypted domains. This technique, independent of underlying HE mechanisms and complementing existing optimizations, notably reduces costly HE multiplications, offering near constant time complexity relative to data dimension. Aimed at accessibility, this method is tailored for data scientists and developers with limited cryptography background, facilitating advanced data analysis in secure environments. △ Less

Submitted 14 June, 2024; originally announced June 2024.

Comments: Submitted as a preprint

arXiv:2406.09081 [pdf, ps, other]

Multifractal analysis of the growth rate of digits in Schneider's $p$-adic continued fraction dynamical system

Authors: Kunkun Song, Wanlou Wu, Yueli Yu, Sainan Zeng

Abstract: Let $\mathbb{Z}_p$ be the ring of $p$-adic integers and $a_n(x)$ be the $n$-th digit of Schneider's $p$-adic continued fraction of $x\in p\mathbb{Z}_p$. We study the growth rate of the digits $\{a_n(x)\}_{n\geq1}$ from the viewpoint of multifractal analysis. The Hausdorff dimension of the set \[E_{\sup}(ψ)=\Big\{x\in p\mathbb{Z}_p:\ \limsup\limits_{n\to\infty}\frac{a_n(x)}{ψ(n)}=1\Big\}\] is compl… ▽ More Let $\mathbb{Z}_p$ be the ring of $p$-adic integers and $a_n(x)$ be the $n$-th digit of Schneider's $p$-adic continued fraction of $x\in p\mathbb{Z}_p$. We study the growth rate of the digits $\{a_n(x)\}_{n\geq1}$ from the viewpoint of multifractal analysis. The Hausdorff dimension of the set \[E_{\sup}(ψ)=\Big\{x\in p\mathbb{Z}_p:\ \limsup\limits_{n\to\infty}\frac{a_n(x)}{ψ(n)}=1\Big\}\] is completely determined for any $ψ:\mathbb{N}\to\mathbb{R}^{+}$ satisfying $ψ(n)\to \infty$ as $n\to\infty$. As an application, we also calculate the Hausdorff dimension of the intersection sets \[E^{\sup}_{\inf}(ψ,α_1,α_2)=\left\{x\in p\mathbb{Z}_p:\liminf_{n\rightarrow\infty}\dfrac{a_n(x)}{ψ(n)}=α_1,~\limsup_{n\rightarrow\infty}\dfrac{a_n(x)}{ψ(n)}=α_2\right\}\] for the above function $ψ$ and $0\leqα_1<α_2\leq\infty$. △ Less

Submitted 13 June, 2024; originally announced June 2024.

arXiv:2406.08263 [pdf, other]

Electrically tunable and enhanced nonlinearity of moiré exciton-polaritons in transition metal dichalcogenide bilayers

Authors: Kok Wee Song, Oleksandr Kyriienko

Abstract: We develop a microscopic theory for nonlinear optical response of moiré exciton-polaritons in bilayers of transition metal dichalcogenides (TMDs). Our theory allows to study the tunnel-coupled intralayer and interlayer excitonic modes for a wide range of twist angles ($θ$), external electric field, and light-matter coupling, providing insights into the hybridization regime inaccessible before. Spe… ▽ More We develop a microscopic theory for nonlinear optical response of moiré exciton-polaritons in bilayers of transition metal dichalcogenides (TMDs). Our theory allows to study the tunnel-coupled intralayer and interlayer excitonic modes for a wide range of twist angles ($θ$), external electric field, and light-matter coupling, providing insights into the hybridization regime inaccessible before. Specifically, we account for the Umklapp scattering processes of two exciton-polaritons responsible for enhanced nonlinearity, and show that it is crucial for describing interactions at strong hybridization. We reveal a regime of attractive nonlinearity for moiré polaritons, stemming from the anisotropic Coulomb interactions, which can explain some of experimental features of optical response in TMD bilayers. Furthermore, within our theory we demonstrate that the attractive nonlinearity can be tuned into repulsive by applying an external electric field. Our findings show that nonlinear moiré polaritons offer a controllable platform nonlinear polaritonic devices. △ Less

Submitted 21 June, 2024; v1 submitted 12 June, 2024; originally announced June 2024.

Comments: main text: 8 pages and 3 figures, supplemental material: 5 pages and 1 figure

arXiv:2406.07572 [pdf, ps, other]

Domain-specific ReAct for physics-integrated iterative modeling: A case study of LLM agents for gas path analysis of gas turbines

Authors: Tao Song, Yuwei Fan, Chenlong Feng, Keyu Song, Chao Liu, Dongxiang Jiang

Abstract: This study explores the application of large language models (LLMs) with callable tools in energy and power engineering domain, focusing on gas path analysis of gas turbines. We developed a dual-agent tool-calling process to integrate expert knowledge, predefined tools, and LLM reasoning. We evaluated various LLMs, including LLama3, Qwen1.5 and GPT. Smaller models struggled with tool usage and par… ▽ More This study explores the application of large language models (LLMs) with callable tools in energy and power engineering domain, focusing on gas path analysis of gas turbines. We developed a dual-agent tool-calling process to integrate expert knowledge, predefined tools, and LLM reasoning. We evaluated various LLMs, including LLama3, Qwen1.5 and GPT. Smaller models struggled with tool usage and parameter extraction, while larger models demonstrated favorable capabilities. All models faced challenges with complex, multi-component problems. Based on the test results, we infer that LLMs with nearly 100 billion parameters could meet professional scenario requirements with fine-tuning and advanced prompt design. Continued development are likely to enhance their accuracy and effectiveness, paving the way for more robust AI-driven solutions. △ Less

Submitted 1 June, 2024; originally announced June 2024.

arXiv:2406.07471 [pdf, other]

OphNet: A Large-Scale Video Benchmark for Ophthalmic Surgical Workflow Understanding

Authors: Ming Hu, Peng Xia, Lin Wang, Siyuan Yan, Feilong Tang, Zhongxing Xu, Yimin Luo, Kaimin Song, Jurgen Leitner, Xuelian Cheng, Jun Cheng, Chi Liu, Kaijing Zhou, Zongyuan Ge

Abstract: Surgical scene perception via videos is critical for advancing robotic surgery, telesurgery, and AI-assisted surgery, particularly in ophthalmology. However, the scarcity of diverse and richly annotated video datasets has hindered the development of intelligent systems for surgical workflow analysis. Existing datasets face challenges such as small scale, lack of diversity in surgery and phase cate… ▽ More Surgical scene perception via videos is critical for advancing robotic surgery, telesurgery, and AI-assisted surgery, particularly in ophthalmology. However, the scarcity of diverse and richly annotated video datasets has hindered the development of intelligent systems for surgical workflow analysis. Existing datasets face challenges such as small scale, lack of diversity in surgery and phase categories, and absence of time-localized annotations. These limitations impede action understanding and model generalization validation in complex and diverse real-world surgical scenarios. To address this gap, we introduce OphNet, a large-scale, expert-annotated video benchmark for ophthalmic surgical workflow understanding. OphNet features: 1) A diverse collection of 2,278 surgical videos spanning 66 types of cataract, glaucoma, and corneal surgeries, with detailed annotations for 102 unique surgical phases and 150 fine-grained operations. 2) Sequential and hierarchical annotations for each surgery, phase, and operation, enabling comprehensive understanding and improved interpretability. 3) Time-localized annotations, facilitating temporal localization and prediction tasks within surgical workflows. With approximately 285 hours of surgical videos, OphNet is about 20 times larger than the largest existing surgical workflow analysis benchmark. Code and dataset are available at: https://minghu0830.github.io/OphNet-benchmark/. △ Less

Submitted 19 July, 2024; v1 submitted 11 June, 2024; originally announced June 2024.

Comments: Accepted by ECCV 2024

arXiv:2406.05763 [pdf, other]

WenetSpeech4TTS: A 12,800-hour Mandarin TTS Corpus for Large Speech Generation Model Benchmark

Authors: Linhan Ma, Dake Guo, Kun Song, Yuepeng Jiang, Shuai Wang, Liumeng Xue, Weiming Xu, Huan Zhao, Binbin Zhang, Lei Xie

Abstract: With the development of large text-to-speech (TTS) models and scale-up of the training data, state-of-the-art TTS systems have achieved impressive performance. In this paper, we present WenetSpeech4TTS, a multi-domain Mandarin corpus derived from the open-sourced WenetSpeech dataset. Tailored for the text-to-speech tasks, we refined WenetSpeech by adjusting segment boundaries, enhancing the audio… ▽ More With the development of large text-to-speech (TTS) models and scale-up of the training data, state-of-the-art TTS systems have achieved impressive performance. In this paper, we present WenetSpeech4TTS, a multi-domain Mandarin corpus derived from the open-sourced WenetSpeech dataset. Tailored for the text-to-speech tasks, we refined WenetSpeech by adjusting segment boundaries, enhancing the audio quality, and eliminating speaker mixing within each segment. Following a more accurate transcription process and quality-based data filtering process, the obtained WenetSpeech4TTS corpus contains $12,800$ hours of paired audio-text data. Furthermore, we have created subsets of varying sizes, categorized by segment quality scores to allow for TTS model training and fine-tuning. VALL-E and NaturalSpeech 2 systems are trained and fine-tuned on these subsets to validate the usability of WenetSpeech4TTS, establishing baselines on benchmark for fair comparison of TTS systems. The corpus and corresponding benchmarks are publicly available on huggingface. △ Less

Submitted 19 June, 2024; v1 submitted 9 June, 2024; originally announced June 2024.

Comments: Accepted by INTERSPEECH2024

arXiv:2406.05613 [pdf, other]

Distributed Motion Control of Multiple Mobile Manipulator System with Disturbance and Communication Delay

Authors: Wenhang Liu, Meng Ren, Kun Song, Michael Yu Wang, Zhenhua Xiong

Abstract: In real-world object manipulation scenarios, multiple mobile manipulator systems may suffer from disturbances and asynchrony, leading to excessive interaction forces and causing object damage or emergency stops. This paper presents a novel distributed motion control approach aimed at reducing these unnecessary interaction forces. The control strategy only utilizes force information without the nee… ▽ More In real-world object manipulation scenarios, multiple mobile manipulator systems may suffer from disturbances and asynchrony, leading to excessive interaction forces and causing object damage or emergency stops. This paper presents a novel distributed motion control approach aimed at reducing these unnecessary interaction forces. The control strategy only utilizes force information without the need for global position and velocity information. Disturbances are corrected through compensatory movements of the manipulators. Besides, the asymmetric, non-uniform, and time-varying communication delays between robots are also considered. The stability of the control law is rigorously proven by the Lyapunov theorem. Subsequently, the efficacy of the proposed control law is validated through simulations and experiments of collaborative object transportation by two robots. Experimental results demonstrate the effectiveness of the proposed control law in reducing interaction forces during object manipulation. △ Less

Submitted 8 June, 2024; originally announced June 2024.

arXiv:2406.05352 [pdf, other]

1st Place Winner of the 2024 Pixel-level Video Understanding in the Wild (CVPR'24 PVUW) Challenge in Video Panoptic Segmentation and Best Long Video Consistency of Video Semantic Segmentation

Authors: Qingfeng Liu, Mostafa El-Khamy, Kee-Bong Song

Abstract: The third Pixel-level Video Understanding in the Wild (PVUW CVPR 2024) challenge aims to advance the state of art in video understanding through benchmarking Video Panoptic Segmentation (VPS) and Video Semantic Segmentation (VSS) on challenging videos and scenes introduced in the large-scale Video Panoptic Segmentation in the Wild (VIPSeg) test set and the large-scale Video Scene Parsing in the Wi… ▽ More The third Pixel-level Video Understanding in the Wild (PVUW CVPR 2024) challenge aims to advance the state of art in video understanding through benchmarking Video Panoptic Segmentation (VPS) and Video Semantic Segmentation (VSS) on challenging videos and scenes introduced in the large-scale Video Panoptic Segmentation in the Wild (VIPSeg) test set and the large-scale Video Scene Parsing in the Wild (VSPW) test set, respectively. This paper details our research work that achieved the 1st place winner in the PVUW'24 VPS challenge, establishing state of art results in all metrics, including the Video Panoptic Quality (VPQ) and Segmentation and Tracking Quality (STQ). With minor fine-tuning our approach also achieved the 3rd place in the PVUW'24 VSS challenge ranked by the mIoU (mean intersection over union) metric and the first place ranked by the VC16 (16-frame video consistency) metric. Our winning solution stands on the shoulders of giant foundational vision transformer model (DINOv2 ViT-g) and proven multi-stage Decoupled Video Instance Segmentation (DVIS) frameworks for video understanding. △ Less

Submitted 8 June, 2024; originally announced June 2024.

arXiv:2406.04941 [pdf, ps, other]

TCMD: A Traditional Chinese Medicine QA Dataset for Evaluating Large Language Models

Authors: Ping Yu, Kaitao Song, Fengchen He, Ming Chen, Jianfeng Lu

Abstract: The recently unprecedented advancements in Large Language Models (LLMs) have propelled the medical community by establishing advanced medical-domain models. However, due to the limited collection of medical datasets, there are only a few comprehensive benchmarks available to gauge progress in this area. In this paper, we introduce a new medical question-answering (QA) dataset that contains massive… ▽ More The recently unprecedented advancements in Large Language Models (LLMs) have propelled the medical community by establishing advanced medical-domain models. However, due to the limited collection of medical datasets, there are only a few comprehensive benchmarks available to gauge progress in this area. In this paper, we introduce a new medical question-answering (QA) dataset that contains massive manual instruction for solving Traditional Chinese Medicine examination tasks, called TCMD. Specifically, our TCMD collects massive questions across diverse domains with their annotated medical subjects and thus supports us in comprehensively assessing the capability of LLMs in the TCM domain. Extensive evaluation of various general LLMs and medical-domain-specific LLMs is conducted. Moreover, we also analyze the robustness of current LLMs in solving TCM QA tasks by introducing randomness. The inconsistency of the experimental results also reveals the shortcomings of current LLMs in solving QA tasks. We also expect that our dataset can further facilitate the development of LLMs in the TCM area. △ Less

Submitted 7 June, 2024; originally announced June 2024.

arXiv:2406.03999 [pdf, other]

Unveiling the Dynamics of Information Interplay in Supervised Learning

Authors: Kun Song, Zhiquan Tan, Bochao Zou, Huimin Ma, Weiran Huang

Abstract: In this paper, we use matrix information theory as an analytical tool to analyze the dynamics of the information interplay between data representations and classification head vectors in the supervised learning process. Specifically, inspired by the theory of Neural Collapse, we introduce matrix mutual information ratio (MIR) and matrix entropy difference ratio (HDR) to assess the interactions of… ▽ More In this paper, we use matrix information theory as an analytical tool to analyze the dynamics of the information interplay between data representations and classification head vectors in the supervised learning process. Specifically, inspired by the theory of Neural Collapse, we introduce matrix mutual information ratio (MIR) and matrix entropy difference ratio (HDR) to assess the interactions of data representation and class classification heads in supervised learning, and we determine the theoretical optimal values for MIR and HDR when Neural Collapse happens. Our experiments show that MIR and HDR can effectively explain many phenomena occurring in neural networks, for example, the standard supervised training dynamics, linear mode connectivity, and the performance of label smoothing and pruning. Additionally, we use MIR and HDR to gain insights into the dynamics of grokking, which is an intriguing phenomenon observed in supervised training, where the model demonstrates generalization capabilities long after it has learned to fit the training data. Furthermore, we introduce MIR and HDR as loss terms in supervised and semi-supervised learning to optimize the information interactions among samples and classification heads. The empirical results provide evidence of the method's effectiveness, demonstrating that the utilization of MIR and HDR not only aids in comprehending the dynamics throughout the training process but can also enhances the training procedure itself. △ Less

Submitted 6 June, 2024; originally announced June 2024.

Comments: Accepted by ICML 2024

arXiv:2405.20840 [pdf, ps, other]

Convergence rate of the Euler-Maruyama scheme to density dependent SDEs driven by $α$-stable additive noise

Authors: Ke Song, Zimo Hao

Abstract: In this paper, we establish the weak convergence rate of density-dependent stochastic differential equations with bounded drift driven by $α$-stable processes with $α\in(1,2)$. The well-posedness of these equations has been previously obtained in \cite{wu2023well}. We derive an explicit convergence rate in total variation for the Euler-Maruyama scheme, employing a technique rooted in \cite{hao2023… ▽ More In this paper, we establish the weak convergence rate of density-dependent stochastic differential equations with bounded drift driven by $α$-stable processes with $α\in(1,2)$. The well-posedness of these equations has been previously obtained in \cite{wu2023well}. We derive an explicit convergence rate in total variation for the Euler-Maruyama scheme, employing a technique rooted in \cite{hao2023}. △ Less

Submitted 31 May, 2024; originally announced May 2024.

arXiv:2405.19119 [pdf, other]

Can Graph Learning Improve Task Planning?

Authors: Xixi Wu, Yifei Shen, Caihua Shan, Kaitao Song, Siwei Wang, Bohang Zhang, Jiarui Feng, Hong Cheng, Wei Chen, Yun Xiong, Dongsheng Li

Abstract: Task planning is emerging as an important research topic alongside the development of large language models (LLMs). It aims to break down complex user requests into solvable sub-tasks, thereby fulfilling the original requests. In this context, the sub-tasks can be naturally viewed as a graph, where the nodes represent the sub-tasks, and the edges denote the dependencies among them. Consequently, t… ▽ More Task planning is emerging as an important research topic alongside the development of large language models (LLMs). It aims to break down complex user requests into solvable sub-tasks, thereby fulfilling the original requests. In this context, the sub-tasks can be naturally viewed as a graph, where the nodes represent the sub-tasks, and the edges denote the dependencies among them. Consequently, task planning is a decision-making problem that involves selecting a connected path or subgraph within the corresponding graph and invoking it. In this paper, we explore graph learning-based methods for task planning, a direction that is orthogonal to the prevalent focus on prompt design. Our interest in graph learning stems from a theoretical discovery: the biases of attention and auto-regressive loss impede LLMs' ability to effectively navigate decision-making on graphs, which is adeptly addressed by graph neural networks (GNNs). This theoretical insight led us to integrate GNNs with LLMs to enhance overall performance. Extensive experiments demonstrate that GNN-based methods surpass existing solutions even without training, and minimal training can further enhance their performance. Additionally, our approach complements prompt engineering and fine-tuning techniques, with performance further enhanced by improved prompts or a fine-tuned model. △ Less

Submitted 29 May, 2024; originally announced May 2024.

arXiv:2405.11747 [pdf, ps, other]

Wolff potentials and nonlocal equations of Lane-Emden type

Authors: Quoc-Hung Nguyen, Jihoon Ok, Kyeong Song

Abstract: We consider nonlocal equations of the type \[ (-Δ_{p})^{s}u = μ\quad \text{in }Ω, \] where $Ω\subset \mathbb{R}^{n}$ is either a bounded domain or the whole $\mathbb{R}^{n}$, $μ$ is a Radon measure on $Ω$, $0<s<1$ and $1<p<n/s$. Especially, we extend the existence, regularity and Wolff potential estimates for SOLA (Solutions Obtained as Limits of Approximations), established by Kuusi, Mingione, an… ▽ More We consider nonlocal equations of the type \[ (-Δ_{p})^{s}u = μ\quad \text{in }Ω, \] where $Ω\subset \mathbb{R}^{n}$ is either a bounded domain or the whole $\mathbb{R}^{n}$, $μ$ is a Radon measure on $Ω$, $0<s<1$ and $1<p<n/s$. Especially, we extend the existence, regularity and Wolff potential estimates for SOLA (Solutions Obtained as Limits of Approximations), established by Kuusi, Mingione, and Sire (Comm. Math. Phys. 337:1317--1368, 2015), to the strongly singular case $1<p\le2-s/n$. Moreover, using Wolff potentials and Orlicz capacities, we present both a sufficient and a necessary conditions for the existence of SOLA to nonlocal equations of the type \[ (-Δ_{p})^{s}u = P(u) + μ\quad \text{in }Ω, \] where $P(\cdot)$ is either a power function or an exponential function. △ Less

Submitted 19 May, 2024; originally announced May 2024.

arXiv:2405.11726 [pdf, other]

RHAML: Rendezvous-based Hierarchical Architecture for Mutual Localization

Authors: Gaoming Chen, Kun Song, Xiang Xu, Wenhang Liu, Zhenhua Xiong

Abstract: Mutual localization serves as the foundation for collaborative perception and task assignment in multi-robot systems. Effectively utilizing limited onboard sensors for mutual localization between marker-less robots is a worthwhile goal. However, due to inadequate consideration of large scale variations of the observed robot and localization refinement, previous work has shown limited accuracy when… ▽ More Mutual localization serves as the foundation for collaborative perception and task assignment in multi-robot systems. Effectively utilizing limited onboard sensors for mutual localization between marker-less robots is a worthwhile goal. However, due to inadequate consideration of large scale variations of the observed robot and localization refinement, previous work has shown limited accuracy when robots are equipped only with RGB cameras. To enhance the precision of localization, this paper proposes a novel rendezvous-based hierarchical architecture for mutual localization (RHAML). Firstly, to learn multi-scale robot features, anisotropic convolutions are introduced into the network, yielding initial localization results. Then, the iterative refinement module with rendering is employed to adjust the observed robot poses. Finally, the pose graph is conducted to globally optimize all localization results, which takes into account multi-frame observations. Therefore, a flexible architecture is provided that allows for the selection of appropriate modules based on requirements. Simulations demonstrate that RHAML effectively addresses the problem of multi-robot mutual localization, achieving translation errors below 2 cm and rotation errors below 0.5 degrees when robots exhibit 5 m of depth variation. Moreover, its practical utility is validated by applying it to map fusion when multi-robots explore unknown environments. △ Less

Submitted 19 May, 2024; originally announced May 2024.

Comments: 8 pages, 8 figures, submitted to RA-L

arXiv:2405.08345 [pdf, other]

Multi-Robot Rendezvous in Unknown Environment with Limited Communication

Authors: Kun Song, Gaoming Chen, Wenhang Liu, Zhenhua Xiong

Abstract: Rendezvous aims at gathering all robots at a specific location, which is an important collaborative behavior for multirobot systems. However, in an unknown environment, it is challenging to achieve rendezvous. Previous researches mainly focus on special scenarios where communication is not allowed and each robot executes a random searching strategy, which is highly time-consuming, especially in la… ▽ More Rendezvous aims at gathering all robots at a specific location, which is an important collaborative behavior for multirobot systems. However, in an unknown environment, it is challenging to achieve rendezvous. Previous researches mainly focus on special scenarios where communication is not allowed and each robot executes a random searching strategy, which is highly time-consuming, especially in large-scale environments. In this work, we focus on rendezvous in unknown environments where communication is available. We divide this task into two steps: rendezvous based environment exploration with relative pose (RP) estimation and rendezvous point election. A new strategy called partitioned and incomplete exploration for rendezvous (PIER) is proposed to efficiently explore the unknown environment, where lightweight topological maps are constructed and shared among robots for RP estimation with very few communications. Then, a rendezvous point selection algorithm based on the merged topological map is proposed for efficient rendezvous for multi-robot systems. The effectiveness of the proposed methods is validated in both simulations and real-world experiments. △ Less

Submitted 14 May, 2024; originally announced May 2024.

Comments: Submit to RAL. 8 pages, 6 figures

arXiv:2404.19205 [pdf, other]

TableVQA-Bench: A Visual Question Answering Benchmark on Multiple Table Domains

Authors: Yoonsik Kim, Moonbin Yim, Ka Yeon Song

Abstract: In this paper, we establish a benchmark for table visual question answering, referred to as the TableVQA-Bench, derived from pre-existing table question-answering (QA) and table structure recognition datasets. It is important to note that existing datasets have not incorporated images or QA pairs, which are two crucial components of TableVQA. As such, the primary objective of this paper is to obta… ▽ More In this paper, we establish a benchmark for table visual question answering, referred to as the TableVQA-Bench, derived from pre-existing table question-answering (QA) and table structure recognition datasets. It is important to note that existing datasets have not incorporated images or QA pairs, which are two crucial components of TableVQA. As such, the primary objective of this paper is to obtain these necessary components. Specifically, images are sourced either through the application of a \textit{stylesheet} or by employing the proposed table rendering system. QA pairs are generated by exploiting the large language model (LLM) where the input is a text-formatted table. Ultimately, the completed TableVQA-Bench comprises 1,500 QA pairs. We comprehensively compare the performance of various multi-modal large language models (MLLMs) on TableVQA-Bench. GPT-4V achieves the highest accuracy among commercial and open-sourced MLLMs from our experiments. Moreover, we discover that the number of vision queries plays a significant role in TableVQA performance. To further analyze the capabilities of MLLMs in comparison to their LLM backbones, we investigate by presenting image-formatted tables to MLLMs and text-formatted tables to LLMs, respectively. Our findings suggest that processing visual inputs is more challenging than text inputs, as evidenced by the lower performance of MLLMs, despite generally requiring higher computational costs than LLMs. The proposed TableVQA-Bench and evaluation codes are available at \href{https://github.com/naver-ai/tablevqabench}{https://github.com/naver-ai/tablevqabench}. △ Less

Submitted 29 April, 2024; originally announced April 2024.

Comments: Technical Report

arXiv:2404.18252 [pdf, other]

Fisher Information Improved Training-Free Conditional Diffusion Model

Authors: Kaiyu Song, Hanjiang Lai

Abstract: Recently, the diffusion model with the training-free methods has succeeded in conditional image generation tasks. However, there is an efficiency problem because it requires calculating the gradient with high computational cost, and previous methods make strong assumptions to solve it, sacrificing generalization. In this work, we propose the Fisher information guided diffusion model (FIGD). Concre… ▽ More Recently, the diffusion model with the training-free methods has succeeded in conditional image generation tasks. However, there is an efficiency problem because it requires calculating the gradient with high computational cost, and previous methods make strong assumptions to solve it, sacrificing generalization. In this work, we propose the Fisher information guided diffusion model (FIGD). Concretely, we introduce the Fisher information to estimate the gradient without making any additional assumptions to reduce computation cost. Meanwhile, we demonstrate that the Fisher information ensures the generalization of FIGD and provides new insights for training-free methods based on the information theory. The experimental results demonstrate that FIGD could achieve different conditional generations more quickly while maintaining high quality. △ Less

Submitted 28 April, 2024; originally announced April 2024.

arXiv:2404.13694 [pdf, other]

Solute segregation in polycrystalline aluminum from hybrid Monte Carlo and molecular dynamics simulations with a unified neuroevolution potential

Authors: Keke Song, Jiahui Liu, Shunda Chen, Zheyong Fan, Yanjing Su, Ping Qian

Abstract: One of the most effective methods to enhance the strength of aluminum alloys involves modifying grain boundaries (GBs) through solute segregation. However, the fundamental mechanisms of solute segregation and their impacts on material properties remain elusive. In this study, we implemented highly efficient hybrid Monte Carlo and molecular dynamics (MCMD) algorithms in the graphics process units m… ▽ More One of the most effective methods to enhance the strength of aluminum alloys involves modifying grain boundaries (GBs) through solute segregation. However, the fundamental mechanisms of solute segregation and their impacts on material properties remain elusive. In this study, we implemented highly efficient hybrid Monte Carlo and molecular dynamics (MCMD) algorithms in the graphics process units molecular dynamics (GPUMD) package. Using this efficient MCMD approach combined with a general-purpose machine-learning-based neuroevolution potential (NEP) for 16 elemental metals and their alloys, we simulated the segregation of 15 solutes in polycrystalline Al. Our results elucidate the segregation behavior and trends of 15 solutes in polycrystalline Al. Additionally, we investigated the impact of solutes on the strength of polycrystalline Al. The mechanisms underlying solute strengthening and embrittlement were analyzed at the atomistic level, revealing the importance of GB cohesion, as well as the nucleation and movement of Shockley dislocations, in determining the material's strength. We anticipate that our developed methods, along with our insights into solute segregation behavior in polycrystalline Al, will be valuable for the design of Al alloys and other multi-component materials, including medium-entropy materials, high-entropy materials, and complex concentrated alloys. △ Less

Submitted 21 April, 2024; originally announced April 2024.

Comments: 10 pages, 6 figures

arXiv:2404.11092 [pdf, ps, other]

Estimation for conditional moment models based on martingale difference divergence

Authors: Kunyang Song, Feiyu Jiang, Ke Zhu

Abstract: We provide a new estimation method for conditional moment models via the martingale difference divergence (MDD).Our MDD-based estimation method is formed in the framework of a continuum of unconditional moment restrictions. Unlike the existing estimation methods in this framework, the MDD-based estimation method adopts a non-integrable weighting function, which could grab more information from unc… ▽ More We provide a new estimation method for conditional moment models via the martingale difference divergence (MDD).Our MDD-based estimation method is formed in the framework of a continuum of unconditional moment restrictions. Unlike the existing estimation methods in this framework, the MDD-based estimation method adopts a non-integrable weighting function, which could grab more information from unconditional moment restrictions than the integrable weighting function to enhance the estimation efficiency. Due to the nature of shift-invariance in MDD, our MDD-based estimation method can not identify the intercept parameters. To overcome this identification issue, we further provide a two-step estimation procedure for the model with intercept parameters. Under regularity conditions, we establish the asymptotics of the proposed estimators, which are not only easy-to-implement with analytic asymptotic variances, but also applicable to time series data with an unspecified form of conditional heteroskedasticity. Finally, we illustrate the usefulness of the proposed estimators by simulations and two real examples. △ Less

Submitted 17 April, 2024; originally announced April 2024.

arXiv:2404.09531 [pdf, other]

Oblique-MERF: Revisiting and Improving MERF for Oblique Photography

Authors: Xiaoyi Zeng, Kaiwen Song, Leyuan Yang, Bailin Deng, Juyong Zhang

Abstract: Neural implicit fields have established a new paradigm for scene representation, with subsequent work achieving high-quality real-time rendering. However, reconstructing 3D scenes from oblique aerial photography presents unique challenges, such as varying spatial scale distributions and a constrained range of tilt angles, often resulting in high memory consumption and reduced rendering quality at… ▽ More Neural implicit fields have established a new paradigm for scene representation, with subsequent work achieving high-quality real-time rendering. However, reconstructing 3D scenes from oblique aerial photography presents unique challenges, such as varying spatial scale distributions and a constrained range of tilt angles, often resulting in high memory consumption and reduced rendering quality at extrapolated viewpoints. In this paper, we enhance MERF to accommodate these data characteristics by introducing an innovative adaptive occupancy plane optimized during the volume rendering process and a smoothness regularization term for view-dependent color to address these issues. Our approach, termed Oblique-MERF, surpasses state-of-the-art real-time methods by approximately 0.7 dB, reduces VRAM usage by about 40%, and achieves higher rendering frame rates with more realistic rendering outcomes across most viewpoints. △ Less

Submitted 15 April, 2024; originally announced April 2024.

arXiv:2404.05674 [pdf, other]

MoMA: Multimodal LLM Adapter for Fast Personalized Image Generation

Authors: Kunpeng Song, Yizhe Zhu, Bingchen Liu, Qing Yan, Ahmed Elgammal, Xiao Yang

Abstract: In this paper, we present MoMA: an open-vocabulary, training-free personalized image model that boasts flexible zero-shot capabilities. As foundational text-to-image models rapidly evolve, the demand for robust image-to-image translation grows. Addressing this need, MoMA specializes in subject-driven personalized image generation. Utilizing an open-source, Multimodal Large Language Model (MLLM), w… ▽ More In this paper, we present MoMA: an open-vocabulary, training-free personalized image model that boasts flexible zero-shot capabilities. As foundational text-to-image models rapidly evolve, the demand for robust image-to-image translation grows. Addressing this need, MoMA specializes in subject-driven personalized image generation. Utilizing an open-source, Multimodal Large Language Model (MLLM), we train MoMA to serve a dual role as both a feature extractor and a generator. This approach effectively synergizes reference image and text prompt information to produce valuable image features, facilitating an image diffusion model. To better leverage the generated features, we further introduce a novel self-attention shortcut method that efficiently transfers image features to an image diffusion model, improving the resemblance of the target object in generated images. Remarkably, as a tuning-free plug-and-play module, our model requires only a single reference image and outperforms existing methods in generating images with high detail fidelity, enhanced identity-preservation and prompt faithfulness. Our work is open-source, thereby providing universal access to these advancements. △ Less

Submitted 8 April, 2024; originally announced April 2024.

arXiv:2404.02117 [pdf, other]

Pre-trained Vision and Language Transformers Are Few-Shot Incremental Learners

Authors: Keon-Hee Park, Kyungwoo Song, Gyeong-Moon Park

Abstract: Few-Shot Class Incremental Learning (FSCIL) is a task that requires a model to learn new classes incrementally without forgetting when only a few samples for each class are given. FSCIL encounters two significant challenges: catastrophic forgetting and overfitting, and these challenges have driven prior studies to primarily rely on shallow models, such as ResNet-18. Even though their limited capac… ▽ More Few-Shot Class Incremental Learning (FSCIL) is a task that requires a model to learn new classes incrementally without forgetting when only a few samples for each class are given. FSCIL encounters two significant challenges: catastrophic forgetting and overfitting, and these challenges have driven prior studies to primarily rely on shallow models, such as ResNet-18. Even though their limited capacity can mitigate both forgetting and overfitting issues, it leads to inadequate knowledge transfer during few-shot incremental sessions. In this paper, we argue that large models such as vision and language transformers pre-trained on large datasets can be excellent few-shot incremental learners. To this end, we propose a novel FSCIL framework called PriViLege, Pre-trained Vision and Language transformers with prompting functions and knowledge distillation. Our framework effectively addresses the challenges of catastrophic forgetting and overfitting in large models through new pre-trained knowledge tuning (PKT) and two losses: entropy-based divergence loss and semantic knowledge distillation loss. Experimental results show that the proposed PriViLege significantly outperforms the existing state-of-the-art methods with a large margin, e.g., +9.38% in CUB200, +20.58% in CIFAR-100, and +13.36% in miniImageNet. Our implementation code is available at https://github.com/KHU-AGI/PriViLege. △ Less

Submitted 2 April, 2024; originally announced April 2024.

Comments: Accepted by CVPR 2024

arXiv:2404.01954 [pdf, other]

HyperCLOVA X Technical Report

Authors: Kang Min Yoo, Jaegeun Han, Sookyo In, Heewon Jeon, Jisu Jeong, Jaewook Kang, Hyunwook Kim, Kyung-Min Kim, Munhyong Kim, Sungju Kim, Donghyun Kwak, Hanock Kwak, Se Jung Kwon, Bado Lee, Dongsoo Lee, Gichang Lee, Jooho Lee, Baeseong Park, Seongjin Shin, Joonsang Yu, Seolki Baek, Sumin Byeon, Eungsup Cho, Dooseok Choe, Jeesung Han , et al. (371 additional authors not shown)

Abstract: We introduce HyperCLOVA X, a family of large language models (LLMs) tailored to the Korean language and culture, along with competitive capabilities in English, math, and coding. HyperCLOVA X was trained on a balanced mix of Korean, English, and code data, followed by instruction-tuning with high-quality human-annotated datasets while abiding by strict safety guidelines reflecting our commitment t… ▽ More We introduce HyperCLOVA X, a family of large language models (LLMs) tailored to the Korean language and culture, along with competitive capabilities in English, math, and coding. HyperCLOVA X was trained on a balanced mix of Korean, English, and code data, followed by instruction-tuning with high-quality human-annotated datasets while abiding by strict safety guidelines reflecting our commitment to responsible AI. The model is evaluated across various benchmarks, including comprehensive reasoning, knowledge, commonsense, factuality, coding, math, chatting, instruction-following, and harmlessness, in both Korean and English. HyperCLOVA X exhibits strong reasoning capabilities in Korean backed by a deep understanding of the language and cultural nuances. Further analysis of the inherent bilingual nature and its extension to multilingualism highlights the model's cross-lingual proficiency and strong generalization ability to untargeted languages, including machine translation between several language pairs and cross-lingual inference tasks. We believe that HyperCLOVA X can provide helpful guidance for regions or countries in developing their sovereign LLMs. △ Less

Submitted 13 April, 2024; v1 submitted 2 April, 2024; originally announced April 2024.

Comments: 44 pages; updated authors list and fixed author names

arXiv:2404.01706 [pdf, other]

Polarity Calibration for Opinion Summarization

Authors: Yuanyuan Lei, Kaiqiang Song, Sangwoo Cho, Xiaoyang Wang, Ruihong Huang, Dong Yu

Abstract: Opinion summarization is automatically generating summaries from a variety of subjective information, such as product reviews or political opinions. The challenge of opinions summarization lies in presenting divergent or even conflicting opinions. We conduct an analysis of previous summarization models, which reveals their inclination to amplify the polarity bias, emphasizing the majority opinions… ▽ More Opinion summarization is automatically generating summaries from a variety of subjective information, such as product reviews or political opinions. The challenge of opinions summarization lies in presenting divergent or even conflicting opinions. We conduct an analysis of previous summarization models, which reveals their inclination to amplify the polarity bias, emphasizing the majority opinions while ignoring the minority opinions. To address this issue and make the summarizer express both sides of opinions, we introduce the concept of polarity calibration, which aims to align the polarity of output summary with that of input text. Specifically, we develop a reinforcement training approach for polarity calibration. This approach feeds the polarity distance between output summary and input text as reward into the summarizer, and also balance polarity calibration with content preservation and language naturality. We evaluate our Polarity Calibration model (PoCa) on two types of opinions summarization tasks: summarizing product reviews and political opinions articles. Automatic and human evaluation demonstrate that our approach can mitigate the polarity mismatch between output summary and input text, as well as maintain the content semantic and language quality. △ Less

Submitted 2 April, 2024; originally announced April 2024.

Comments: Accepted to NAACL 2024

arXiv:2403.19833 [pdf, other]

ChatTracer: Large Language Model Powered Real-time Bluetooth Device Tracking System

Authors: Qijun Wang, Shichen Zhang, Kunzhe Song, Huacheng Zeng

Abstract: Large language models (LLMs) have transformed the way we interact with cyber technologies. In this paper, we study the possibility of connecting LLM with wireless sensor networks (WSN). A successful design will not only extend LLM's knowledge landscape to the physical world but also revolutionize human interaction with WSN. To the end, we present ChatTracer, an LLM-powered real-time Bluetooth devi… ▽ More Large language models (LLMs) have transformed the way we interact with cyber technologies. In this paper, we study the possibility of connecting LLM with wireless sensor networks (WSN). A successful design will not only extend LLM's knowledge landscape to the physical world but also revolutionize human interaction with WSN. To the end, we present ChatTracer, an LLM-powered real-time Bluetooth device tracking system. ChatTracer comprises three key components: an array of Bluetooth sniffing nodes, a database, and a fine-tuned LLM. ChatTracer was designed based on our experimental observation that commercial Apple/Android devices always broadcast hundreds of BLE packets per minute even in their idle status. Its novelties lie in two aspects: i) a reliable and efficient BLE packet grouping algorithm; and ii) an LLM fine-tuning strategy that combines both supervised fine-tuning (SFT) and reinforcement learning with human feedback (RLHF). We have built a prototype of ChatTracer with four sniffing nodes. Experimental results show that ChatTracer not only outperforms existing localization approaches, but also provides an intelligent interface for user interaction. △ Less

Submitted 9 July, 2024; v1 submitted 28 March, 2024; originally announced March 2024.

arXiv:2403.15187 [pdf, ps, other]

Spectrum of $S$- and $P$-wave $cc\bar{q}\bar{q}'$ $(\bar{q},\bar{q}' = \bar{u}, \bar{d}, \bar{s})$ systems in a chiral SU(3) quark model

Authors: Du Wang, Ke-Rang Song, Wen-Ling Wang, Fei Huang

Abstract: Inspired by the resonance $T_{cc}^+(3875)$ recently observed by the LHCb Collaboration, we systematically explore the $S$- and $P$-wave $cc\bar{q}\bar{q}'$ $(\bar{q},\bar{q}' = \bar{u}, \bar{d}, \bar{s})$ systems in a chiral SU(3) quark model. The Hamiltonian contains the kinetic energy, the one-gluon-exchange (OGE) potential, the confinement potential, and the one-boson-exchange (OBE) potential s… ▽ More Inspired by the resonance $T_{cc}^+(3875)$ recently observed by the LHCb Collaboration, we systematically explore the $S$- and $P$-wave $cc\bar{q}\bar{q}'$ $(\bar{q},\bar{q}' = \bar{u}, \bar{d}, \bar{s})$ systems in a chiral SU(3) quark model. The Hamiltonian contains the kinetic energy, the one-gluon-exchange (OGE) potential, the confinement potential, and the one-boson-exchange (OBE) potential stemming from the coupling of quark and chiral fields. The Schrödinger equation is solved by use of the variational method with the spacial trial wave functions chosen as Gaussian functions. It is found that the lowest state has a mass $3879$ MeV, isospin and spin-parity $IJ^P=01^+$, and quark constituent $cc\bar{u}\bar{d}$, in agreement with the experimentally observed $T_{cc}^+(3875)$. This state is approximately at the calculated $DD^\ast$ threshold, and has a root-mean-square radius about $0.48$ fm. These demonstrates that the $T_{cc}^+(3875)$ can be accommodated as a stable and compact tetraquark sate in the chiral SU(3) quark model. All the other $S$- and $P$-wave $cc\bar{q}\bar{q}'$ $(\bar{q},\bar{q}' = \bar{u}, \bar{d}, \bar{s})$ states lie about one hundred to few hundreds MeV higher than the corresponding meson-meson thresholds, and thus are not suggested to be candidates of stable and compact tetraquark states due to their fall-apart decays to two mesons. △ Less

Submitted 22 March, 2024; originally announced March 2024.

Comments: 9 pages, 2 figures

arXiv:2403.10558 [pdf, other]

Adaptive Hybrid Masking Strategy for Privacy-Preserving Face Recognition Against Model Inversion Attack

Authors: Yinggui Wang, Yuanqing Huang, Jianshu Li, Le Yang, Kai Song, Lei Wang

Abstract: The utilization of personal sensitive data in training face recognition (FR) models poses significant privacy concerns, as adversaries can employ model inversion attacks (MIA) to infer the original training data. Existing defense methods, such as data augmentation and differential privacy, have been employed to mitigate this issue. However, these methods often fail to strike an optimal balance bet… ▽ More The utilization of personal sensitive data in training face recognition (FR) models poses significant privacy concerns, as adversaries can employ model inversion attacks (MIA) to infer the original training data. Existing defense methods, such as data augmentation and differential privacy, have been employed to mitigate this issue. However, these methods often fail to strike an optimal balance between privacy and accuracy. To address this limitation, this paper introduces an adaptive hybrid masking algorithm against MIA. Specifically, face images are masked in the frequency domain using an adaptive MixUp strategy. Unlike the traditional MixUp algorithm, which is predominantly used for data augmentation, our modified approach incorporates frequency domain mixing. Previous studies have shown that increasing the number of images mixed in MixUp can enhance privacy preservation but at the expense of reduced face recognition accuracy. To overcome this trade-off, we develop an enhanced adaptive MixUp strategy based on reinforcement learning, which enables us to mix a larger number of images while maintaining satisfactory recognition accuracy. To optimize privacy protection, we propose maximizing the reward function (i.e., the loss function of the FR system) during the training of the strategy network. While the loss function of the FR network is minimized in the phase of training the FR network. The strategy network and the face recognition network can be viewed as antagonistic entities in the training process, ultimately reaching a more balanced trade-off. Experimental results demonstrate that our proposed hybrid masking scheme outperforms existing defense algorithms in terms of privacy preservation and recognition accuracy against MIA. △ Less

Submitted 23 April, 2024; v1 submitted 13 March, 2024; originally announced March 2024.

arXiv:2403.09073 [pdf, other]

Large Language Models are Parallel Multilingual Learners

Authors: Yongyu Mu, Peinan Feng, Zhiquan Cao, Yuzhang Wu, Bei Li, Chenglong Wang, Tong Xiao, Kai Song, Tongran Liu, Chunliang Zhang, Jingbo Zhu

Abstract: In this study, we reveal an in-context learning (ICL) capability of multilingual large language models (LLMs): by translating the input to several languages, we provide Parallel Input in Multiple Languages (PiM) to LLMs, which significantly enhances their comprehension abilities. To test this capability, we design extensive experiments encompassing 8 typical datasets, 7 languages and 8 state-of-th… ▽ More In this study, we reveal an in-context learning (ICL) capability of multilingual large language models (LLMs): by translating the input to several languages, we provide Parallel Input in Multiple Languages (PiM) to LLMs, which significantly enhances their comprehension abilities. To test this capability, we design extensive experiments encompassing 8 typical datasets, 7 languages and 8 state-of-the-art multilingual LLMs. Experimental results show that (1) incorporating more languages help PiM surpass the conventional ICL further; (2) even combining with the translations that are inferior to baseline performance can also help. Moreover, by examining the activated neurons in LLMs, we discover a counterintuitive but interesting phenomenon. Contrary to the common thought that PiM would activate more neurons than monolingual input to leverage knowledge learned from diverse languages, PiM actually inhibits neurons and promotes more precise neuron activation especially when more languages are added. This phenomenon aligns with the neuroscience insight about synaptic pruning, which removes less used neural connections, strengthens remainders, and then enhances brain intelligence. △ Less

Submitted 13 March, 2024; originally announced March 2024.

Comments: Working in process

arXiv:2403.08964 [pdf]

Hyperelasticity of Blood Clots: Bridging the Gap between Microscopic and Continuum Scales

Authors: Nicholas Filla, Beikang Gu, Jixin Hou, Kenan Song, He Li, Ning Liu, Xianqiao Wang

Abstract: The biomechanical properties of blood clots, which are dictated by their compositions and micro-structures, play a critical role in determining their fates, occlusion, persistency, or embolization in the human circulatory system. While numerous constitutive models have emerged to describe the biomechanics of blood clots, the majority of these models have primarily focused on the macroscopic deform… ▽ More The biomechanical properties of blood clots, which are dictated by their compositions and micro-structures, play a critical role in determining their fates, occlusion, persistency, or embolization in the human circulatory system. While numerous constitutive models have emerged to describe the biomechanics of blood clots, the majority of these models have primarily focused on the macroscopic deformation of the clots and the resultant strain-stress correlations without depicting the microscopic contributions from their structural components, such as fibrin fibers, fibrin network and red blood cells. This work addresses the gap in current scientific understanding by quantifying how changes in the microstructure of blood clots affect its mechanical responses under different external stresses. We leverage our previous published work to develop a hyperelastic potential model for blood clots, which incorporates six distinct strain-energy components to describe the alignment of fibers, the entropic and enthalpic stretching of fibrin fibers, the buckling of these fibers, clot densification, and clot jamming.These strain-energy components are represented by a combination of simple harmonic oscillators, one-sided harmonic potentials, and a Gaussian potential. The proposed model, which is C0, C1, and C2 continuous with a total of 13 parameters, has been validated against three data sets: fibrin clot in tension, blood clot in compression, and blood clots in shear, demonstrating its robustness. Subsequent simulations of a microscopic blood clot model are performed to uncover mechanistic correlations for a majority of the hyperelastic potential's stiffness/strain parameters. Our results show that only one proposed term concerning fiber buckling needs further refinement, while the remaining five strain-energy terms appear to describe precisely what they were intended to. △ Less

Submitted 13 March, 2024; originally announced March 2024.

Comments: 13 figures

arXiv:2403.08827 [pdf, other]

Locational Scenario-based Pricing in a Bilateral Distribution Energy Market under Uncertainty

Authors: Hien Thanh Doan, Minsoo Kim, Keunju Song, Hongseok Kim

Abstract: In recent years, there has been a significant focus on advancing the next generation of power systems. Despite these efforts, persistent challenges revolve around addressing the operational impact of uncertainty on predicted data, especially concerning economic dispatch and optimal power flow. To tackle these challenges, we introduce a stochastic day-ahead scheduling approach for a community. This… ▽ More In recent years, there has been a significant focus on advancing the next generation of power systems. Despite these efforts, persistent challenges revolve around addressing the operational impact of uncertainty on predicted data, especially concerning economic dispatch and optimal power flow. To tackle these challenges, we introduce a stochastic day-ahead scheduling approach for a community. This method involves iterative improvements in economic dispatch and optimal power flow, aiming to minimize operational costs by incorporating quantile forecasting. Then, we present a real-time market and payment problem to handle optimization in real-time decision-making and payment calculation. We assess the effectiveness of our proposed method against benchmark results and conduct a test using data from 50 real households to demonstrate its practicality. Furthermore, we compare our method with existing studies in the field across two different seasons of the year. In the summer season, our method decreases optimality gap by 60% compared to the baseline, and in the winter season, it reduces optimality gap by 67%. Moreover, our proposed method mitigates the congestion of distribution network by 16.7\% within a day caused by uncertain energy, which is a crucial aspect for implementing energy markets in the real world. △ Less

Submitted 11 March, 2024; originally announced March 2024.

arXiv:2403.05952 [pdf]

New Directions for Thermoelectrics: A Roadmap from High-Throughput Materials Discovery to Advanced Device Manufacturing

Authors: Kaidong Song, A. N. M. Tanvir, Md Omarsany Bappy, Yanliang Zhang

Abstract: Thermoelectric materials, which can convert waste heat into electricity or act as solid-state Peltier coolers, are emerging as key technologies to address global energy shortages and environmental sustainability. However, discovering materials with high thermoelectric conversion efficiency is a complex and slow process. The emerging field of high-throughput material discovery demonstrates its pote… ▽ More Thermoelectric materials, which can convert waste heat into electricity or act as solid-state Peltier coolers, are emerging as key technologies to address global energy shortages and environmental sustainability. However, discovering materials with high thermoelectric conversion efficiency is a complex and slow process. The emerging field of high-throughput material discovery demonstrates its potential to accelerate the development of new thermoelectric materials combining high efficiency and low cost. The synergistic integration of high-throughput material processing and characterization techniques with machine learning algorithms can form an efficient closed-loop process to generate and analyze broad data sets to discover new thermoelectric materials with unprecedented performances. Meanwhile, the recent development of advanced manufacturing methods provides exciting opportunities to realize scalable, low-cost, and energy-efficient fabrication of thermoelectric devices. This review provides an overview of recent advances in discovering thermoelectric materials using high-throughput methods, including processing, characterization, and screening. Advanced manufacturing methods of thermoelectric devices are also introduced to realize the broad impacts of thermoelectric materials in power generation and solid-state cooling. In the end, this paper also discusses the future research prospects and directions. △ Less

Submitted 9 March, 2024; originally announced March 2024.

arXiv:2403.04031 [pdf, other]

Can Large Language Models do Analytical Reasoning?

Authors: Yebowen Hu, Kaiqiang Song, Sangwoo Cho, Xiaoyang Wang, Hassan Foroosh, Dong Yu, Fei Liu

Abstract: This paper explores the cutting-edge Large Language Model with analytical reasoning on sports. Our analytical reasoning embodies the tasks of letting large language models count how many points each team scores in a quarter in the NBA and NFL games. Our major discoveries are in two folds. Firstly, we find among all the models we employed, GPT-4 stands out in effectiveness, followed by Claude-2.1,… ▽ More This paper explores the cutting-edge Large Language Model with analytical reasoning on sports. Our analytical reasoning embodies the tasks of letting large language models count how many points each team scores in a quarter in the NBA and NFL games. Our major discoveries are in two folds. Firstly, we find among all the models we employed, GPT-4 stands out in effectiveness, followed by Claude-2.1, with GPT-3.5, Gemini-Pro, and Llama-2-70b lagging behind. Specifically, we compare three different prompting techniques and a divide-and-conquer approach, we find that the latter was the most effective. Our divide-and-conquer approach breaks down play-by-play data into smaller, more manageable segments, solves each piece individually, and then aggregates them together. Besides the divide-and-conquer approach, we also explore the Chain of Thought (CoT) strategy, which markedly improves outcomes for certain models, notably GPT-4 and Claude-2.1, with their accuracy rates increasing significantly. However, the CoT strategy has negligible or even detrimental effects on the performance of other models like GPT-3.5 and Gemini-Pro. Secondly, to our surprise, we observe that most models, including GPT-4, struggle to accurately count the total scores for NBA quarters despite showing strong performance in counting NFL quarter scores. This leads us to further investigate the factors that impact the complexity of analytical reasoning tasks with extensive experiments, through which we conclude that task complexity depends on the length of context, the information density, and the presence of related information. Our research provides valuable insights into the complexity of analytical reasoning tasks and potential directions for developing future large language models. △ Less

Submitted 6 March, 2024; originally announced March 2024.

arXiv:2403.03100 [pdf, other]

NaturalSpeech 3: Zero-Shot Speech Synthesis with Factorized Codec and Diffusion Models

Authors: Zeqian Ju, Yuancheng Wang, Kai Shen, Xu Tan, Detai Xin, Dongchao Yang, Yanqing Liu, Yichong Leng, Kaitao Song, Siliang Tang, Zhizheng Wu, Tao Qin, Xiang-Yang Li, Wei Ye, Shikun Zhang, Jiang Bian, Lei He, Jinyu Li, Sheng Zhao

Abstract: While recent large-scale text-to-speech (TTS) models have achieved significant progress, they still fall short in speech quality, similarity, and prosody. Considering speech intricately encompasses various attributes (e.g., content, prosody, timbre, and acoustic details) that pose significant challenges for generation, a natural idea is to factorize speech into individual subspaces representing di… ▽ More While recent large-scale text-to-speech (TTS) models have achieved significant progress, they still fall short in speech quality, similarity, and prosody. Considering speech intricately encompasses various attributes (e.g., content, prosody, timbre, and acoustic details) that pose significant challenges for generation, a natural idea is to factorize speech into individual subspaces representing different attributes and generate them individually. Motivated by it, we propose NaturalSpeech 3, a TTS system with novel factorized diffusion models to generate natural speech in a zero-shot way. Specifically, 1) we design a neural codec with factorized vector quantization (FVQ) to disentangle speech waveform into subspaces of content, prosody, timbre, and acoustic details; 2) we propose a factorized diffusion model to generate attributes in each subspace following its corresponding prompt. With this factorization design, NaturalSpeech 3 can effectively and efficiently model intricate speech with disentangled subspaces in a divide-and-conquer way. Experiments show that NaturalSpeech 3 outperforms the state-of-the-art TTS systems on quality, similarity, prosody, and intelligibility, and achieves on-par quality with human recordings. Furthermore, we achieve better performance by scaling to 1B parameters and 200K hours of training data. △ Less

Submitted 23 April, 2024; v1 submitted 5 March, 2024; originally announced March 2024.

Comments: Achieving human-level quality and naturalness on multi-speaker datasets (e.g., LibriSpeech) in a zero-shot way

arXiv:2402.14279 [pdf, other]

Mitigating the Linguistic Gap with Phonemic Representations for Robust Multilingual Language Understanding

Authors: Haeji Jung, Changdae Oh, Jooeon Kang, Jimin Sohn, Kyungwoo Song, Jinkyu Kim, David R. Mortensen

Abstract: Approaches to improving multilingual language understanding often require multiple languages during the training phase, rely on complicated training techniques, and -- importantly -- struggle with significant performance gaps between high-resource and low-resource languages. We hypothesize that the performance gaps between languages are affected by linguistic gaps between those languages and provi… ▽ More Approaches to improving multilingual language understanding often require multiple languages during the training phase, rely on complicated training techniques, and -- importantly -- struggle with significant performance gaps between high-resource and low-resource languages. We hypothesize that the performance gaps between languages are affected by linguistic gaps between those languages and provide a novel solution for robust multilingual language modeling by employing phonemic representations (specifically, using phonemes as input tokens to LMs rather than subwords). We present quantitative evidence from three cross-lingual tasks that demonstrate the effectiveness of phonemic representation, which is further justified by a theoretical analysis of the cross-lingual performance gap. △ Less

Submitted 21 February, 2024; originally announced February 2024.

arXiv:2402.10979 [pdf, other]

SportsMetrics: Blending Text and Numerical Data to Understand Information Fusion in LLMs

Authors: Yebowen Hu, Kaiqiang Song, Sangwoo Cho, Xiaoyang Wang, Hassan Foroosh, Dong Yu, Fei Liu

Abstract: Large language models hold significant potential for integrating various data types, such as text documents and database records, for advanced analytics. However, blending text and numerical data presents substantial challenges. LLMs need to process and cross-reference entities and numbers, handle data inconsistencies and redundancies, and develop planning capabilities such as building a working m… ▽ More Large language models hold significant potential for integrating various data types, such as text documents and database records, for advanced analytics. However, blending text and numerical data presents substantial challenges. LLMs need to process and cross-reference entities and numbers, handle data inconsistencies and redundancies, and develop planning capabilities such as building a working memory for managing complex data queries. In this paper, we introduce four novel tasks centered around sports data analytics to evaluate the numerical reasoning and information fusion capabilities of LLMs. These tasks involve providing LLMs with detailed, play-by-play sports game descriptions, then challenging them with adversarial scenarios such as new game rules, longer durations, scrambled narratives, and analyzing key statistics in game summaries. We conduct extensive experiments on NBA and NFL games to assess the performance of LLMs on these tasks. Our benchmark, SportsMetrics, introduces a new mechanism for assessing LLMs' numerical reasoning and fusion skills. △ Less

Submitted 16 June, 2024; v1 submitted 15 February, 2024; originally announced February 2024.

Comments: ACL 2024 Long Paper

arXiv:2402.10047 [pdf, other]

OH-Formation Following Vibrationally Induced Reaction Dynamics of H$_2$COO

Authors: Kaisheng Song, Meenu Upadhyay, Markus Meuwly

Abstract: The reaction dynamics of H$_2$COO to form linear HCOOH and dioxirane as first steps for OH-elimination is quantitatively investigated. Using a machine learned potential energy surface at the CASPT2/aug-cc-pVTZ level of theory vibrational excitation along the CH-normal mode $ν_{\rm CH}$ with energies up to 40.0 kcal/mol ($\sim 5 ν_{\rm CH}$) leads almost exclusively to linear HCOOH which further de… ▽ More The reaction dynamics of H$_2$COO to form linear HCOOH and dioxirane as first steps for OH-elimination is quantitatively investigated. Using a machine learned potential energy surface at the CASPT2/aug-cc-pVTZ level of theory vibrational excitation along the CH-normal mode $ν_{\rm CH}$ with energies up to 40.0 kcal/mol ($\sim 5 ν_{\rm CH}$) leads almost exclusively to linear HCOOH which further decomposes into OH+HCO. Although the barrier to form dioxirane is only 21.4 kcal/mol the reaction probability to form dioxirane is two orders of magnitude lower if the CH-stretch mode is excited. Following the dioxirane-formation pathway is facile, however, if in addition the COO-bend vibration is excited with energies equivalent to $\sim (2 ν_{\rm CH} + 4 ν_{\rm COO})$ or $\sim (3 ν_{\rm CH} + ν_{\rm COO})$. For OH-formation in the atmosphere the pathway through linear HCOOH is probably most relevant because the alternative pathways (through dioxirane or formic acid) involve several intermediates that can de-excite through collisions, relax {\it via} Intramolecular vibrational energy redistribution (IVR), or pass through very loose and vulnerable transition states (formic acid). This work demonstrates how, by selectively exciting particular vibrational modes, it is possible to dial into desired reaction channels with a high degree of specificity for a process relevant to atmospheric chemistry. △ Less

Submitted 15 February, 2024; originally announced February 2024.

Showing 1–50 of 401 results for author: Song, K