-
Femtosecond switching of strong light-matter interactions in microcavities with two-dimensional semiconductors
Authors:
Armando Genco,
Charalambos Louca,
Cristina Cruciano,
Kok Wee Song,
Chiara Trovatello,
Giuseppe Di Blasio,
Giacomo Sansone,
Sam Randerson,
Peter Claronino,
Rahul Jayaprakash,
Kenji Watanabe,
Takashi Taniguchi,
David G. Lidzey,
Oleksandr Kyriienko,
Stefano Dal Conte,
Alexander I. Tartakovskii,
Giulio Cerullo
Abstract:
Ultrafast all-optical logic devices based on nonlinear light-matter interactions hold the promise to overcome the speed limitations of conventional electronic devices. Strong coupling of excitons and photons inside an optical resonator enhances such interactions and generates new polariton states which give access to unique nonlinear phenomena, such as Bose-Einstein condensation, used for all-opti…
▽ More
Ultrafast all-optical logic devices based on nonlinear light-matter interactions hold the promise to overcome the speed limitations of conventional electronic devices. Strong coupling of excitons and photons inside an optical resonator enhances such interactions and generates new polariton states which give access to unique nonlinear phenomena, such as Bose-Einstein condensation, used for all-optical ultrafast polariton transistors. However, the pulse energies required to pump such devices range from tens to hundreds of pJ, making them not competitive with electronic transistors. Here we introduce a new paradigm for all-optical switching based on the ultrafast transition from the strong to the weak coupling regime in microcavities embedding atomically thin transition metal dichalcogenides. Employing single and double stacks of hBN-encapsulated MoS$_2$ homobilayers with high optical nonlinearities and fast exciton relaxation times, we observe a collapse of the 55-meV polariton gap and its revival in less than one picosecond, lowering the threshold for optical switching below 4 pJ per pulse, while retaining ultrahigh switching frequencies. As an additional degree of freedom, the switching can be triggered pumping either the intra- or the interlayer excitons of the bilayers at different wavelengths, speeding up the polariton dynamics, owing to unique interspecies excitonic interactions. Our approach will enable the development of compact ultrafast all-optical logical circuits and neural networks, showcasing a new platform for polaritonic information processing based on manipulating the light-matter coupling.
△ Less
Submitted 31 July, 2024;
originally announced August 2024.
-
Restoring Real-World Degraded Events Improves Deblurring Quality
Authors:
Yeqing Shen,
Shang Li,
Kun Song
Abstract:
Due to its high speed and low latency, DVS is frequently employed in motion deblurring. Ideally, high-quality events would adeptly capture intricate motion information. However, real-world events are generally degraded, thereby introducing significant artifacts into the deblurred results. In response to this challenge, we model the degradation of events and propose RDNet to improve the quality of…
▽ More
Due to its high speed and low latency, DVS is frequently employed in motion deblurring. Ideally, high-quality events would adeptly capture intricate motion information. However, real-world events are generally degraded, thereby introducing significant artifacts into the deblurred results. In response to this challenge, we model the degradation of events and propose RDNet to improve the quality of image deblurring. Specifically, we first analyze the mechanisms underlying degradation and simulate paired events based on that. These paired events are then fed into the first stage of the RDNet for training the restoration model. The events restored in this stage serve as a guide for the second-stage deblurring process. To better assess the deblurring performance of different methods on real-world degraded events, we present a new real-world dataset named DavisMCR. This dataset incorporates events with diverse degradation levels, collected by manipulating environmental brightness and target object contrast. Our experiments are conducted on synthetic datasets (GOPRO), real-world datasets (REBlur), and the proposed dataset (DavisMCR). The results demonstrate that RDNet outperforms classical event denoising methods in event restoration. Furthermore, RDNet exhibits better performance in deblurring tasks compared to state-of-the-art methods. DavisMCR are available at https://github.com/Yeeesir/DVS_RDNet.
△ Less
Submitted 29 July, 2024;
originally announced July 2024.
-
Robust Adaptation of Foundation Models with Black-Box Visual Prompting
Authors:
Changdae Oh,
Gyeongdeok Seo,
Geunyoung Jung,
Zhi-Qi Cheng,
Hosik Choi,
Jiyoung Jung,
Kyungwoo Song
Abstract:
With the surge of large-scale pre-trained models (PTMs), adapting these models to numerous downstream tasks becomes a crucial problem. Consequently, parameter-efficient transfer learning (PETL) of large models has grasped huge attention. While PETL methods show impressive performance, they commonly rely on two optimistic assumptions: 1) the entire parameters of a PTM are available, and 2) a suffic…
▽ More
With the surge of large-scale pre-trained models (PTMs), adapting these models to numerous downstream tasks becomes a crucial problem. Consequently, parameter-efficient transfer learning (PETL) of large models has grasped huge attention. While PETL methods show impressive performance, they commonly rely on two optimistic assumptions: 1) the entire parameters of a PTM are available, and 2) a sufficiently large memory capacity is equipped for caching all the intermediate activations to compute gradients. However, in most real-world applications, PTMs are served as black-box APIs or proprietary software without explicit parameter accessibility. Besides, it is hard to meet a large memory requirement for modern PTMs. This work proposes black-box visual prompting (BlackVIP), which efficiently adapts the PTMs without knowledge about model architectures and parameters. BlackVIP has two components; 1) Coordinator and 2) simultaneous perturbation stochastic approximation with gradient correction (SPSA-GC). The Coordinator designs input-dependent visual prompts, which allow the target PTM to adapt in the wild. SPSA-GC efficiently estimates the gradient of PTM to update the Coordinator. Besides, we propose a variant, BlackVIP-SE, which significantly reduces the runtime and computational cost of BlackVIP. Extensive experiments on 19 datasets demonstrate that BlackVIPs enable robust adaptation to diverse domains and tasks with minimal memory requirements. We further provide theoretical analysis on the generalization of visual prompting methods by presenting their connection to the certified robustness of randomized smoothing.
△ Less
Submitted 3 July, 2024;
originally announced July 2024.
-
Contrast then Memorize: Semantic Neighbor Retrieval-Enhanced Inductive Multimodal Knowledge Graph Completion
Authors:
Yu Zhao,
Ying Zhang,
Baohang Zhou,
Xinying Qian,
Kehui Song,
Xiangrui Cai
Abstract:
A large number of studies have emerged for Multimodal Knowledge Graph Completion (MKGC) to predict the missing links in MKGs. However, fewer studies have been proposed to study the inductive MKGC (IMKGC) involving emerging entities unseen during training. Existing inductive approaches focus on learning textual entity representations, which neglect rich semantic information in visual modality. More…
▽ More
A large number of studies have emerged for Multimodal Knowledge Graph Completion (MKGC) to predict the missing links in MKGs. However, fewer studies have been proposed to study the inductive MKGC (IMKGC) involving emerging entities unseen during training. Existing inductive approaches focus on learning textual entity representations, which neglect rich semantic information in visual modality. Moreover, they focus on aggregating structural neighbors from existing KGs, which of emerging entities are usually limited. However, the semantic neighbors are decoupled from the topology linkage and usually imply the true target entity. In this paper, we propose the IMKGC task and a semantic neighbor retrieval-enhanced IMKGC framework CMR, where the contrast brings the helpful semantic neighbors close, and then the memorize supports semantic neighbor retrieval to enhance inference. Specifically, we first propose a unified cross-modal contrastive learning to simultaneously capture the textual-visual and textual-textual correlations of query-entity pairs in a unified representation space. The contrastive learning increases the similarity of positive query-entity pairs, therefore making the representations of helpful semantic neighbors close. Then, we explicitly memorize the knowledge representations to support the semantic neighbor retrieval. At test time, we retrieve the nearest semantic neighbors and interpolate them to the query-entity similarity distribution to augment the final prediction. Extensive experiments validate the effectiveness of CMR on three inductive MKGC datasets. Codes are available at https://github.com/OreOZhao/CMR.
△ Less
Submitted 3 July, 2024;
originally announced July 2024.
-
Improving Multilingual Instruction Finetuning via Linguistically Natural and Diverse Datasets
Authors:
Sathish Reddy Indurthi,
Wenxuan Zhou,
Shamil Chollampatt,
Ravi Agrawal,
Kaiqiang Song,
Lingxiao Zhao,
Chenguang Zhu
Abstract:
Advancements in Large Language Models (LLMs) have significantly enhanced instruction-following capabilities. However, most Instruction Fine-Tuning (IFT) datasets are predominantly in English, limiting model performance in other languages. Traditional methods for creating multilingual IFT datasets such as translating existing English IFT datasets or converting existing NLP datasets into IFT dataset…
▽ More
Advancements in Large Language Models (LLMs) have significantly enhanced instruction-following capabilities. However, most Instruction Fine-Tuning (IFT) datasets are predominantly in English, limiting model performance in other languages. Traditional methods for creating multilingual IFT datasets such as translating existing English IFT datasets or converting existing NLP datasets into IFT datasets by templating, struggle to capture linguistic nuances and ensure prompt (instruction) diversity. To address this issue, we propose a novel method for collecting multilingual IFT datasets that preserves linguistic naturalness and ensures prompt diversity. This approach leverages English-focused LLMs, monolingual corpora, and a scoring function to create high-quality, diversified IFT datasets in multiple languages. Experiments demonstrate that LLMs finetuned using these IFT datasets show notable improvements in both generative and discriminative tasks, indicating enhanced language comprehension by LLMs in non-English contexts. Specifically, on the multilingual summarization task, LLMs using our IFT dataset achieved 17.57% and 15.23% improvements over LLMs fine-tuned with translation-based and template-based datasets, respectively.
△ Less
Submitted 1 July, 2024;
originally announced July 2024.
-
Machine Learning-Assisted 3D Printing of Thermoelectric Materials of Ultrahigh Performances at Room Temperature
Authors:
Kaidong Song,
Guoyue Xu,
A. N. M. Tanvir,
Ke Wang,
Md Omarsany Bappy,
Haijian Yang,
Wenjie Shang,
Le Zhou,
Alexander Dowling,
Tengei Luo,
Yanliang Zhang
Abstract:
Thermoelectric energy conversion is an attractive technology for generating electricity from waste heat and using electricity for solid-state cooling. However, conventional manufacturing processes for thermoelectric devices are costly and limited to simple device geometries. This work reports an extrusion printing method to fabricate high-performance thermoelectric materials with complex 3D archit…
▽ More
Thermoelectric energy conversion is an attractive technology for generating electricity from waste heat and using electricity for solid-state cooling. However, conventional manufacturing processes for thermoelectric devices are costly and limited to simple device geometries. This work reports an extrusion printing method to fabricate high-performance thermoelectric materials with complex 3D architectures. By integrating high-throughput experimentation and Bayesian optimization (BO), our approach significantly accelerates the simultaneous search for the optimal ink formulation and printing parameters that deliver high thermoelectric performances while maintaining desired shape fidelity. A Gaussian process regression (GPR)-based machine learning model is employed to expeditiously predict thermoelectric power factor as a function of ink formulation and printing parameters. The printed bismuth antimony telluride (BiSbTe)-based thermoelectric materials under the optimized conditions exhibit an ultrahigh room temperature zT of 1.3, which is by far the highest in the printed thermoelectric materials. The machine learning-guided ink-based printing strategy can be highly generalizable to a wide range of functional materials and devices for broad technological applications.
△ Less
Submitted 1 July, 2024;
originally announced July 2024.
-
Multifractal analysis of the convergence exponents for the digits in $d$-decaying Gauss like dynamical systems
Authors:
Kunkun Song,
Mengjie Zhang
Abstract:
Let $\{a_n(x)\}_{n\geq1}$ be the sequence of digits of $x\in(0,1)$ in infinite iterated function systems with polynomial decay of the derivative. We first study the multifractal spectrum of the convergence exponent defined by the sequence of the digits $\{a_n(x)\}_{n\geq1}$ and the weighted products of distinct digits with finite numbers respectively, and then calculate the Hausdorff dimensions of…
▽ More
Let $\{a_n(x)\}_{n\geq1}$ be the sequence of digits of $x\in(0,1)$ in infinite iterated function systems with polynomial decay of the derivative. We first study the multifractal spectrum of the convergence exponent defined by the sequence of the digits $\{a_n(x)\}_{n\geq1}$ and the weighted products of distinct digits with finite numbers respectively, and then calculate the Hausdorff dimensions of the intersection of sets defined by the convergence exponent of the weighted product of distinct digits with finite numbers and sets of points whose digits are non-decreasing in such iterated function systems.
△ Less
Submitted 30 June, 2024;
originally announced July 2024.
-
Photosensitive PEEK Ink Enables Digital Light Processing 3D Printed High-performance Small Architected-Plastics
Authors:
Ze Zhang,
Kewei Song,
Rongyi Zhuang,
Jianxian He,
Yi Yang,
Yifan Pan,
Takeshi Mino,
Kayo Hirose,
Shinjiro Umezu
Abstract:
Polyetheretherketone (PEEK), as a semi-crystalline high-performance engineering plastic, has demonstrated good application prospects since its introduction. The ability of PEEK to be fabricated in complex architecture is a major limitation due to the inherent shortcomings of material extrusion 3D printing technology in terms of low resolution, low surface quality, and interlayer bonding. We propos…
▽ More
Polyetheretherketone (PEEK), as a semi-crystalline high-performance engineering plastic, has demonstrated good application prospects since its introduction. The ability of PEEK to be fabricated in complex architecture is a major limitation due to the inherent shortcomings of material extrusion 3D printing technology in terms of low resolution, low surface quality, and interlayer bonding. We propose a novel PEEK ink processing process based on digital light processing (DLP) 3D printing, which is based on high solid content PEEK ink to achieve green bodies with high accuracy, and one-step sintering to enhance the crystallinity of PEEK. We have investigated the processing mechanism of this process and constructed perfect process parameters in terms of mouldability, printing accuracy, material thermal properties, and PEEK crystallinity. Furthermore, the material and architecture performance of the proposed process was evaluated in terms of comprehensive thermal performance (including heat resistance of the substrate, thermal stability, surface energy after heat treatment, and coefficient of static friction and coefficient of kinetic friction), mechanical performance, and corrosion resistance (20 wt% hydrochloric acid, 20 wt% sodium hydroxide, 99 wt% acetone, and 99.5 wt% chloroform). The process is a bold extension of PEEK processing methods to utilize the properties of PEEK in more flexible and efficient applications.
△ Less
Submitted 26 June, 2024;
originally announced June 2024.
-
ESBMC v7.6: Enhanced Model Checking of C++ Programs with Clang AST
Authors:
Xianzhiyu Li,
Kunjian Song,
Mikhail R. Gadelha,
Franz Brauße,
Rafael S. Menezes,
Konstantin Korovin,
Lucas C. Cordeiro
Abstract:
This paper presents Efficient SMT-Based Context-Bounded Model Checker (ESBMC) v7.6, an extended version based on previous work on ESBMC v7.3 by K. Song et al. The v7.3 introduced a new Clang-based C++ front-end to address the challenges posed by modern C++ programs. Although the new front-end has demonstrated significant potential in previous studies, it remains in the developmental stage and lack…
▽ More
This paper presents Efficient SMT-Based Context-Bounded Model Checker (ESBMC) v7.6, an extended version based on previous work on ESBMC v7.3 by K. Song et al. The v7.3 introduced a new Clang-based C++ front-end to address the challenges posed by modern C++ programs. Although the new front-end has demonstrated significant potential in previous studies, it remains in the developmental stage and lacks several essential features. ESBMC v7.6 further enhanced this foundation by adding and extending features based on the Clang AST, such as 1) exception handling, 2) extended memory management and memory safety verification, including dangling pointers, duplicate deallocation, memory leaks and rvalue references and 3) new operational models for STL updating the outdated C++ operational models. Our extensive experiments demonstrate that ESBMC v7.6 can handle a significantly broader range of C++ features introduced in recent versions of the C++ standard.
△ Less
Submitted 25 June, 2024;
originally announced June 2024.
-
Flat Posterior Does Matter For Bayesian Transfer Learning
Authors:
Sungjun Lim,
Jeyoon Yeom,
Sooyon Kim,
Hoyoon Byun,
Jinho Kang,
Yohan Jung,
Jiyoung Jung,
Kyungwoo Song
Abstract:
The large-scale pre-trained neural network has achieved notable success in enhancing performance for downstream tasks. Another promising approach for generalization is Bayesian Neural Network (BNN), which integrates Bayesian methods into neural network architectures, offering advantages such as Bayesian Model averaging (BMA) and uncertainty quantification. Despite these benefits, transfer learning…
▽ More
The large-scale pre-trained neural network has achieved notable success in enhancing performance for downstream tasks. Another promising approach for generalization is Bayesian Neural Network (BNN), which integrates Bayesian methods into neural network architectures, offering advantages such as Bayesian Model averaging (BMA) and uncertainty quantification. Despite these benefits, transfer learning for BNNs has not been widely investigated and shows limited improvement. We hypothesize that this issue arises from the inability to find flat minima, which is crucial for generalization performance. To address this, we evaluate the sharpness of BNNs in various settings, revealing their insufficiency in seeking flat minima and the influence of flatness on BMA performance. Therefore, we propose Sharpness-aware Bayesian Model Averaging (SA-BMA), a Bayesian-fitting flat posterior seeking optimizer integrated with Bayesian transfer learning. SA-BMA calculates the divergence between posteriors in the parameter space, aligning with the nature of BNNs, and serves as a generalized version of existing sharpness-aware optimizers. We validate that SA-BMA improves generalization performance in few-shot classification and distribution shift scenarios by ensuring flatness.
△ Less
Submitted 21 June, 2024;
originally announced June 2024.
-
EvoAgent: Towards Automatic Multi-Agent Generation via Evolutionary Algorithms
Authors:
Siyu Yuan,
Kaitao Song,
Jiangjie Chen,
Xu Tan,
Dongsheng Li,
Deqing Yang
Abstract:
The rise of powerful large language models (LLMs) has spurred a new trend in building LLM-based autonomous agents for solving complex tasks, especially multi-agent systems. Despite the remarkable progress, we notice that existing works are heavily dependent on human-designed frameworks, which greatly limits the functional scope and scalability of agent systems. How to automatically extend the spec…
▽ More
The rise of powerful large language models (LLMs) has spurred a new trend in building LLM-based autonomous agents for solving complex tasks, especially multi-agent systems. Despite the remarkable progress, we notice that existing works are heavily dependent on human-designed frameworks, which greatly limits the functional scope and scalability of agent systems. How to automatically extend the specialized agent to multi-agent systems to improve task-solving capability still remains a significant challenge. In this paper, we introduce EvoAgent, a generic method to automatically extend expert agents to multi-agent systems via the evolutionary algorithm, thereby improving the effectiveness of LLM-based agents in solving tasks. Specifically, we consider the existing agent frameworks as the initial individual and then apply a series of evolutionary operators (e.g., mutation, crossover, selection, etc.) to generate multiple agents with diverse agent settings. EvoAgent can be generalized to any LLM-based agent framework, and can automatically extend the existing agent framework to multi-agent systems without any extra human designs. Experimental results across various tasks have shown that EvoAgent can automatically generate multiple expert agents and significantly enhance the task-solving capabilities of LLM-based agents.
△ Less
Submitted 11 July, 2024; v1 submitted 20 June, 2024;
originally announced June 2024.
-
When Reasoning Meets Information Aggregation: A Case Study with Sports Narratives
Authors:
Yebowen Hu,
Kaiqiang Song,
Sangwoo Cho,
Xiaoyang Wang,
Wenlin Yao,
Hassan Foroosh,
Dong Yu,
Fei Liu
Abstract:
Reasoning is most powerful when an LLM accurately aggregates relevant information. We examine the critical role of information aggregation in reasoning by requiring the LLM to analyze sports narratives. To succeed at this task, an LLM must infer points from actions, identify related entities, attribute points accurately to players and teams, and compile key statistics to draw conclusions. We condu…
▽ More
Reasoning is most powerful when an LLM accurately aggregates relevant information. We examine the critical role of information aggregation in reasoning by requiring the LLM to analyze sports narratives. To succeed at this task, an LLM must infer points from actions, identify related entities, attribute points accurately to players and teams, and compile key statistics to draw conclusions. We conduct comprehensive experiments with real NBA basketball data and present SportsGen, a new method to synthesize game narratives. By synthesizing data, we can rigorously evaluate LLMs' reasoning capabilities under complex scenarios with varying narrative lengths and density of information. Our findings show that most models, including GPT-4o, often fail to accurately aggregate basketball scores due to frequent scoring patterns. Open-source models like Llama-3 further suffer from significant score hallucinations. Finally, the effectiveness of reasoning is influenced by narrative complexity, information density, and domain-specific terms, highlighting the challenges in analytical reasoning tasks.
△ Less
Submitted 17 June, 2024;
originally announced June 2024.
-
WPO: Enhancing RLHF with Weighted Preference Optimization
Authors:
Wenxuan Zhou,
Ravi Agrawal,
Shujian Zhang,
Sathish Reddy Indurthi,
Sanqiang Zhao,
Kaiqiang Song,
Silei Xu,
Chenguang Zhu
Abstract:
Reinforcement learning from human feedback (RLHF) is a promising solution to align large language models (LLMs) more closely with human values. Off-policy preference optimization, where the preference data is obtained from other models, is widely adopted due to its cost efficiency and scalability. However, off-policy preference optimization often suffers from a distributional gap between the polic…
▽ More
Reinforcement learning from human feedback (RLHF) is a promising solution to align large language models (LLMs) more closely with human values. Off-policy preference optimization, where the preference data is obtained from other models, is widely adopted due to its cost efficiency and scalability. However, off-policy preference optimization often suffers from a distributional gap between the policy used for data collection and the target policy, leading to suboptimal optimization. In this paper, we propose a novel strategy to mitigate this problem by simulating on-policy learning with off-policy preference data. Our Weighted Preference Optimization (WPO) method adapts off-policy data to resemble on-policy data more closely by reweighting preference pairs according to their probability under the current policy. This method not only addresses the distributional gap problem but also enhances the optimization process without incurring additional costs. We validate our method on instruction following benchmarks including Alpaca Eval 2 and MT-bench. WPO not only outperforms Direct Preference Optimization (DPO) by up to 5.6% on Alpaca Eval 2 but also establishes a remarkable length-controlled winning rate against GPT-4-turbo of 48.6% based on Llama-3-8B-Instruct, making it the strongest 8B model on the leaderboard. We will release the code and models at https://github.com/wzhouad/WPO.
△ Less
Submitted 17 June, 2024;
originally announced June 2024.
-
Association between a Failed Prominence Eruption and the Drainage of Mass from Another Prominence
Authors:
Jianchao Xue,
Li Feng,
Hui Li,
Ping Zhang,
Jun Chen,
Guanglu Shi,
Kaifan Ji,
Ye Qiu,
Chuan Li,
Lei Lu,
Beili Ying,
Ying Li,
Yu Huang,
Youping Li,
Jingwei Li,
Jie Zhao,
Dechao Song,
Shuting Li,
Zhengyuan Tian,
Yingna Su,
Qingmin Zhang,
Yunyi Ge,
Jiahui Shan,
Qiao Li,
Gen Li
, et al. (9 additional authors not shown)
Abstract:
Sympathetic eruptions of solar prominences have been studied for decades, however, it is usually difficult to identify their causal links. Here we present two failed prominence eruptions on 26 October 2022 and explore their connections. Using stereoscopic observations, the south prominence (PRO-S) erupts with untwisting motions, flare ribbons occur underneath, and new connections are formed during…
▽ More
Sympathetic eruptions of solar prominences have been studied for decades, however, it is usually difficult to identify their causal links. Here we present two failed prominence eruptions on 26 October 2022 and explore their connections. Using stereoscopic observations, the south prominence (PRO-S) erupts with untwisting motions, flare ribbons occur underneath, and new connections are formed during the eruption. The north prominence (PRO-N) rises up along with PRO-S, and its upper part disappears due to catastrophic mass draining along an elongated structure after PRO-S failed eruption. We suggest that the eruption of PRO-S initiates due to a kink instability, further rises up, and fails to erupt due to reconnection with surrounding fields. The elongated structure connecting PRO-N overlies PRO-S, which causes the rising up of PRO-N along with PRO-S and mass drainage after PRO-S eruption. This study suggests that a prominence may end its life through mass drainage forced by an eruption underneath.
△ Less
Submitted 20 June, 2024; v1 submitted 17 June, 2024;
originally announced June 2024.
-
Speed-up of Data Analysis with Kernel Trick in Encrypted Domain
Authors:
Joon Soo Yoo,
Baek Kyung Song,
Tae Min Ahn,
Ji Won Heo,
Ji Won Yoon
Abstract:
Homomorphic encryption (HE) is pivotal for secure computation on encrypted data, crucial in privacy-preserving data analysis. However, efficiently processing high-dimensional data in HE, especially for machine learning and statistical (ML/STAT) algorithms, poses a challenge. In this paper, we present an effective acceleration method using the kernel method for HE schemes, enhancing time performanc…
▽ More
Homomorphic encryption (HE) is pivotal for secure computation on encrypted data, crucial in privacy-preserving data analysis. However, efficiently processing high-dimensional data in HE, especially for machine learning and statistical (ML/STAT) algorithms, poses a challenge. In this paper, we present an effective acceleration method using the kernel method for HE schemes, enhancing time performance in ML/STAT algorithms within encrypted domains. This technique, independent of underlying HE mechanisms and complementing existing optimizations, notably reduces costly HE multiplications, offering near constant time complexity relative to data dimension. Aimed at accessibility, this method is tailored for data scientists and developers with limited cryptography background, facilitating advanced data analysis in secure environments.
△ Less
Submitted 14 June, 2024;
originally announced June 2024.
-
Multifractal analysis of the growth rate of digits in Schneider's $p$-adic continued fraction dynamical system
Authors:
Kunkun Song,
Wanlou Wu,
Yueli Yu,
Sainan Zeng
Abstract:
Let $\mathbb{Z}_p$ be the ring of $p$-adic integers and $a_n(x)$ be the $n$-th digit of Schneider's $p$-adic continued fraction of $x\in p\mathbb{Z}_p$. We study the growth rate of the digits $\{a_n(x)\}_{n\geq1}$ from the viewpoint of multifractal analysis. The Hausdorff dimension of the set \[E_{\sup}(ψ)=\Big\{x\in p\mathbb{Z}_p:\ \limsup\limits_{n\to\infty}\frac{a_n(x)}{ψ(n)}=1\Big\}\] is compl…
▽ More
Let $\mathbb{Z}_p$ be the ring of $p$-adic integers and $a_n(x)$ be the $n$-th digit of Schneider's $p$-adic continued fraction of $x\in p\mathbb{Z}_p$. We study the growth rate of the digits $\{a_n(x)\}_{n\geq1}$ from the viewpoint of multifractal analysis. The Hausdorff dimension of the set \[E_{\sup}(ψ)=\Big\{x\in p\mathbb{Z}_p:\ \limsup\limits_{n\to\infty}\frac{a_n(x)}{ψ(n)}=1\Big\}\] is completely determined for any $ψ:\mathbb{N}\to\mathbb{R}^{+}$ satisfying $ψ(n)\to \infty$ as $n\to\infty$. As an application, we also calculate the Hausdorff dimension of the intersection sets \[E^{\sup}_{\inf}(ψ,α_1,α_2)=\left\{x\in p\mathbb{Z}_p:\liminf_{n\rightarrow\infty}\dfrac{a_n(x)}{ψ(n)}=α_1,~\limsup_{n\rightarrow\infty}\dfrac{a_n(x)}{ψ(n)}=α_2\right\}\] for the above function $ψ$ and $0\leqα_1<α_2\leq\infty$.
△ Less
Submitted 13 June, 2024;
originally announced June 2024.
-
Electrically tunable and enhanced nonlinearity of moiré exciton-polaritons in transition metal dichalcogenide bilayers
Authors:
Kok Wee Song,
Oleksandr Kyriienko
Abstract:
We develop a microscopic theory for nonlinear optical response of moiré exciton-polaritons in bilayers of transition metal dichalcogenides (TMDs). Our theory allows to study the tunnel-coupled intralayer and interlayer excitonic modes for a wide range of twist angles ($θ$), external electric field, and light-matter coupling, providing insights into the hybridization regime inaccessible before. Spe…
▽ More
We develop a microscopic theory for nonlinear optical response of moiré exciton-polaritons in bilayers of transition metal dichalcogenides (TMDs). Our theory allows to study the tunnel-coupled intralayer and interlayer excitonic modes for a wide range of twist angles ($θ$), external electric field, and light-matter coupling, providing insights into the hybridization regime inaccessible before. Specifically, we account for the Umklapp scattering processes of two exciton-polaritons responsible for enhanced nonlinearity, and show that it is crucial for describing interactions at strong hybridization. We reveal a regime of attractive nonlinearity for moiré polaritons, stemming from the anisotropic Coulomb interactions, which can explain some of experimental features of optical response in TMD bilayers. Furthermore, within our theory we demonstrate that the attractive nonlinearity can be tuned into repulsive by applying an external electric field. Our findings show that nonlinear moiré polaritons offer a controllable platform nonlinear polaritonic devices.
△ Less
Submitted 21 June, 2024; v1 submitted 12 June, 2024;
originally announced June 2024.
-
Domain-specific ReAct for physics-integrated iterative modeling: A case study of LLM agents for gas path analysis of gas turbines
Authors:
Tao Song,
Yuwei Fan,
Chenlong Feng,
Keyu Song,
Chao Liu,
Dongxiang Jiang
Abstract:
This study explores the application of large language models (LLMs) with callable tools in energy and power engineering domain, focusing on gas path analysis of gas turbines. We developed a dual-agent tool-calling process to integrate expert knowledge, predefined tools, and LLM reasoning. We evaluated various LLMs, including LLama3, Qwen1.5 and GPT. Smaller models struggled with tool usage and par…
▽ More
This study explores the application of large language models (LLMs) with callable tools in energy and power engineering domain, focusing on gas path analysis of gas turbines. We developed a dual-agent tool-calling process to integrate expert knowledge, predefined tools, and LLM reasoning. We evaluated various LLMs, including LLama3, Qwen1.5 and GPT. Smaller models struggled with tool usage and parameter extraction, while larger models demonstrated favorable capabilities. All models faced challenges with complex, multi-component problems. Based on the test results, we infer that LLMs with nearly 100 billion parameters could meet professional scenario requirements with fine-tuning and advanced prompt design. Continued development are likely to enhance their accuracy and effectiveness, paving the way for more robust AI-driven solutions.
△ Less
Submitted 1 June, 2024;
originally announced June 2024.
-
OphNet: A Large-Scale Video Benchmark for Ophthalmic Surgical Workflow Understanding
Authors:
Ming Hu,
Peng Xia,
Lin Wang,
Siyuan Yan,
Feilong Tang,
Zhongxing Xu,
Yimin Luo,
Kaimin Song,
Jurgen Leitner,
Xuelian Cheng,
Jun Cheng,
Chi Liu,
Kaijing Zhou,
Zongyuan Ge
Abstract:
Surgical scene perception via videos is critical for advancing robotic surgery, telesurgery, and AI-assisted surgery, particularly in ophthalmology. However, the scarcity of diverse and richly annotated video datasets has hindered the development of intelligent systems for surgical workflow analysis. Existing datasets face challenges such as small scale, lack of diversity in surgery and phase cate…
▽ More
Surgical scene perception via videos is critical for advancing robotic surgery, telesurgery, and AI-assisted surgery, particularly in ophthalmology. However, the scarcity of diverse and richly annotated video datasets has hindered the development of intelligent systems for surgical workflow analysis. Existing datasets face challenges such as small scale, lack of diversity in surgery and phase categories, and absence of time-localized annotations. These limitations impede action understanding and model generalization validation in complex and diverse real-world surgical scenarios. To address this gap, we introduce OphNet, a large-scale, expert-annotated video benchmark for ophthalmic surgical workflow understanding. OphNet features: 1) A diverse collection of 2,278 surgical videos spanning 66 types of cataract, glaucoma, and corneal surgeries, with detailed annotations for 102 unique surgical phases and 150 fine-grained operations. 2) Sequential and hierarchical annotations for each surgery, phase, and operation, enabling comprehensive understanding and improved interpretability. 3) Time-localized annotations, facilitating temporal localization and prediction tasks within surgical workflows. With approximately 285 hours of surgical videos, OphNet is about 20 times larger than the largest existing surgical workflow analysis benchmark. Code and dataset are available at: https://minghu0830.github.io/OphNet-benchmark/.
△ Less
Submitted 19 July, 2024; v1 submitted 11 June, 2024;
originally announced June 2024.
-
WenetSpeech4TTS: A 12,800-hour Mandarin TTS Corpus for Large Speech Generation Model Benchmark
Authors:
Linhan Ma,
Dake Guo,
Kun Song,
Yuepeng Jiang,
Shuai Wang,
Liumeng Xue,
Weiming Xu,
Huan Zhao,
Binbin Zhang,
Lei Xie
Abstract:
With the development of large text-to-speech (TTS) models and scale-up of the training data, state-of-the-art TTS systems have achieved impressive performance. In this paper, we present WenetSpeech4TTS, a multi-domain Mandarin corpus derived from the open-sourced WenetSpeech dataset. Tailored for the text-to-speech tasks, we refined WenetSpeech by adjusting segment boundaries, enhancing the audio…
▽ More
With the development of large text-to-speech (TTS) models and scale-up of the training data, state-of-the-art TTS systems have achieved impressive performance. In this paper, we present WenetSpeech4TTS, a multi-domain Mandarin corpus derived from the open-sourced WenetSpeech dataset. Tailored for the text-to-speech tasks, we refined WenetSpeech by adjusting segment boundaries, enhancing the audio quality, and eliminating speaker mixing within each segment. Following a more accurate transcription process and quality-based data filtering process, the obtained WenetSpeech4TTS corpus contains $12,800$ hours of paired audio-text data. Furthermore, we have created subsets of varying sizes, categorized by segment quality scores to allow for TTS model training and fine-tuning. VALL-E and NaturalSpeech 2 systems are trained and fine-tuned on these subsets to validate the usability of WenetSpeech4TTS, establishing baselines on benchmark for fair comparison of TTS systems. The corpus and corresponding benchmarks are publicly available on huggingface.
△ Less
Submitted 19 June, 2024; v1 submitted 9 June, 2024;
originally announced June 2024.
-
Distributed Motion Control of Multiple Mobile Manipulator System with Disturbance and Communication Delay
Authors:
Wenhang Liu,
Meng Ren,
Kun Song,
Michael Yu Wang,
Zhenhua Xiong
Abstract:
In real-world object manipulation scenarios, multiple mobile manipulator systems may suffer from disturbances and asynchrony, leading to excessive interaction forces and causing object damage or emergency stops. This paper presents a novel distributed motion control approach aimed at reducing these unnecessary interaction forces. The control strategy only utilizes force information without the nee…
▽ More
In real-world object manipulation scenarios, multiple mobile manipulator systems may suffer from disturbances and asynchrony, leading to excessive interaction forces and causing object damage or emergency stops. This paper presents a novel distributed motion control approach aimed at reducing these unnecessary interaction forces. The control strategy only utilizes force information without the need for global position and velocity information. Disturbances are corrected through compensatory movements of the manipulators. Besides, the asymmetric, non-uniform, and time-varying communication delays between robots are also considered. The stability of the control law is rigorously proven by the Lyapunov theorem. Subsequently, the efficacy of the proposed control law is validated through simulations and experiments of collaborative object transportation by two robots. Experimental results demonstrate the effectiveness of the proposed control law in reducing interaction forces during object manipulation.
△ Less
Submitted 8 June, 2024;
originally announced June 2024.
-
1st Place Winner of the 2024 Pixel-level Video Understanding in the Wild (CVPR'24 PVUW) Challenge in Video Panoptic Segmentation and Best Long Video Consistency of Video Semantic Segmentation
Authors:
Qingfeng Liu,
Mostafa El-Khamy,
Kee-Bong Song
Abstract:
The third Pixel-level Video Understanding in the Wild (PVUW CVPR 2024) challenge aims to advance the state of art in video understanding through benchmarking Video Panoptic Segmentation (VPS) and Video Semantic Segmentation (VSS) on challenging videos and scenes introduced in the large-scale Video Panoptic Segmentation in the Wild (VIPSeg) test set and the large-scale Video Scene Parsing in the Wi…
▽ More
The third Pixel-level Video Understanding in the Wild (PVUW CVPR 2024) challenge aims to advance the state of art in video understanding through benchmarking Video Panoptic Segmentation (VPS) and Video Semantic Segmentation (VSS) on challenging videos and scenes introduced in the large-scale Video Panoptic Segmentation in the Wild (VIPSeg) test set and the large-scale Video Scene Parsing in the Wild (VSPW) test set, respectively. This paper details our research work that achieved the 1st place winner in the PVUW'24 VPS challenge, establishing state of art results in all metrics, including the Video Panoptic Quality (VPQ) and Segmentation and Tracking Quality (STQ). With minor fine-tuning our approach also achieved the 3rd place in the PVUW'24 VSS challenge ranked by the mIoU (mean intersection over union) metric and the first place ranked by the VC16 (16-frame video consistency) metric. Our winning solution stands on the shoulders of giant foundational vision transformer model (DINOv2 ViT-g) and proven multi-stage Decoupled Video Instance Segmentation (DVIS) frameworks for video understanding.
△ Less
Submitted 8 June, 2024;
originally announced June 2024.
-
TCMD: A Traditional Chinese Medicine QA Dataset for Evaluating Large Language Models
Authors:
Ping Yu,
Kaitao Song,
Fengchen He,
Ming Chen,
Jianfeng Lu
Abstract:
The recently unprecedented advancements in Large Language Models (LLMs) have propelled the medical community by establishing advanced medical-domain models. However, due to the limited collection of medical datasets, there are only a few comprehensive benchmarks available to gauge progress in this area. In this paper, we introduce a new medical question-answering (QA) dataset that contains massive…
▽ More
The recently unprecedented advancements in Large Language Models (LLMs) have propelled the medical community by establishing advanced medical-domain models. However, due to the limited collection of medical datasets, there are only a few comprehensive benchmarks available to gauge progress in this area. In this paper, we introduce a new medical question-answering (QA) dataset that contains massive manual instruction for solving Traditional Chinese Medicine examination tasks, called TCMD. Specifically, our TCMD collects massive questions across diverse domains with their annotated medical subjects and thus supports us in comprehensively assessing the capability of LLMs in the TCM domain. Extensive evaluation of various general LLMs and medical-domain-specific LLMs is conducted. Moreover, we also analyze the robustness of current LLMs in solving TCM QA tasks by introducing randomness. The inconsistency of the experimental results also reveals the shortcomings of current LLMs in solving QA tasks. We also expect that our dataset can further facilitate the development of LLMs in the TCM area.
△ Less
Submitted 7 June, 2024;
originally announced June 2024.
-
Unveiling the Dynamics of Information Interplay in Supervised Learning
Authors:
Kun Song,
Zhiquan Tan,
Bochao Zou,
Huimin Ma,
Weiran Huang
Abstract:
In this paper, we use matrix information theory as an analytical tool to analyze the dynamics of the information interplay between data representations and classification head vectors in the supervised learning process. Specifically, inspired by the theory of Neural Collapse, we introduce matrix mutual information ratio (MIR) and matrix entropy difference ratio (HDR) to assess the interactions of…
▽ More
In this paper, we use matrix information theory as an analytical tool to analyze the dynamics of the information interplay between data representations and classification head vectors in the supervised learning process. Specifically, inspired by the theory of Neural Collapse, we introduce matrix mutual information ratio (MIR) and matrix entropy difference ratio (HDR) to assess the interactions of data representation and class classification heads in supervised learning, and we determine the theoretical optimal values for MIR and HDR when Neural Collapse happens. Our experiments show that MIR and HDR can effectively explain many phenomena occurring in neural networks, for example, the standard supervised training dynamics, linear mode connectivity, and the performance of label smoothing and pruning. Additionally, we use MIR and HDR to gain insights into the dynamics of grokking, which is an intriguing phenomenon observed in supervised training, where the model demonstrates generalization capabilities long after it has learned to fit the training data. Furthermore, we introduce MIR and HDR as loss terms in supervised and semi-supervised learning to optimize the information interactions among samples and classification heads. The empirical results provide evidence of the method's effectiveness, demonstrating that the utilization of MIR and HDR not only aids in comprehending the dynamics throughout the training process but can also enhances the training procedure itself.
△ Less
Submitted 6 June, 2024;
originally announced June 2024.
-
Convergence rate of the Euler-Maruyama scheme to density dependent SDEs driven by $α$-stable additive noise
Authors:
Ke Song,
Zimo Hao
Abstract:
In this paper, we establish the weak convergence rate of density-dependent stochastic differential equations with bounded drift driven by $α$-stable processes with $α\in(1,2)$. The well-posedness of these equations has been previously obtained in \cite{wu2023well}. We derive an explicit convergence rate in total variation for the Euler-Maruyama scheme, employing a technique rooted in \cite{hao2023…
▽ More
In this paper, we establish the weak convergence rate of density-dependent stochastic differential equations with bounded drift driven by $α$-stable processes with $α\in(1,2)$. The well-posedness of these equations has been previously obtained in \cite{wu2023well}. We derive an explicit convergence rate in total variation for the Euler-Maruyama scheme, employing a technique rooted in \cite{hao2023}.
△ Less
Submitted 31 May, 2024;
originally announced May 2024.
-
Can Graph Learning Improve Task Planning?
Authors:
Xixi Wu,
Yifei Shen,
Caihua Shan,
Kaitao Song,
Siwei Wang,
Bohang Zhang,
Jiarui Feng,
Hong Cheng,
Wei Chen,
Yun Xiong,
Dongsheng Li
Abstract:
Task planning is emerging as an important research topic alongside the development of large language models (LLMs). It aims to break down complex user requests into solvable sub-tasks, thereby fulfilling the original requests. In this context, the sub-tasks can be naturally viewed as a graph, where the nodes represent the sub-tasks, and the edges denote the dependencies among them. Consequently, t…
▽ More
Task planning is emerging as an important research topic alongside the development of large language models (LLMs). It aims to break down complex user requests into solvable sub-tasks, thereby fulfilling the original requests. In this context, the sub-tasks can be naturally viewed as a graph, where the nodes represent the sub-tasks, and the edges denote the dependencies among them. Consequently, task planning is a decision-making problem that involves selecting a connected path or subgraph within the corresponding graph and invoking it. In this paper, we explore graph learning-based methods for task planning, a direction that is orthogonal to the prevalent focus on prompt design. Our interest in graph learning stems from a theoretical discovery: the biases of attention and auto-regressive loss impede LLMs' ability to effectively navigate decision-making on graphs, which is adeptly addressed by graph neural networks (GNNs). This theoretical insight led us to integrate GNNs with LLMs to enhance overall performance. Extensive experiments demonstrate that GNN-based methods surpass existing solutions even without training, and minimal training can further enhance their performance. Additionally, our approach complements prompt engineering and fine-tuning techniques, with performance further enhanced by improved prompts or a fine-tuned model.
△ Less
Submitted 29 May, 2024;
originally announced May 2024.
-
Wolff potentials and nonlocal equations of Lane-Emden type
Authors:
Quoc-Hung Nguyen,
Jihoon Ok,
Kyeong Song
Abstract:
We consider nonlocal equations of the type \[ (-Δ_{p})^{s}u = μ\quad \text{in }Ω, \] where $Ω\subset \mathbb{R}^{n}$ is either a bounded domain or the whole $\mathbb{R}^{n}$, $μ$ is a Radon measure on $Ω$, $0<s<1$ and $1<p<n/s$. Especially, we extend the existence, regularity and Wolff potential estimates for SOLA (Solutions Obtained as Limits of Approximations), established by Kuusi, Mingione, an…
▽ More
We consider nonlocal equations of the type \[ (-Δ_{p})^{s}u = μ\quad \text{in }Ω, \] where $Ω\subset \mathbb{R}^{n}$ is either a bounded domain or the whole $\mathbb{R}^{n}$, $μ$ is a Radon measure on $Ω$, $0<s<1$ and $1<p<n/s$. Especially, we extend the existence, regularity and Wolff potential estimates for SOLA (Solutions Obtained as Limits of Approximations), established by Kuusi, Mingione, and Sire (Comm. Math. Phys. 337:1317--1368, 2015), to the strongly singular case $1<p\le2-s/n$. Moreover, using Wolff potentials and Orlicz capacities, we present both a sufficient and a necessary conditions for the existence of SOLA to nonlocal equations of the type \[ (-Δ_{p})^{s}u = P(u) + μ\quad \text{in }Ω, \] where $P(\cdot)$ is either a power function or an exponential function.
△ Less
Submitted 19 May, 2024;
originally announced May 2024.
-
RHAML: Rendezvous-based Hierarchical Architecture for Mutual Localization
Authors:
Gaoming Chen,
Kun Song,
Xiang Xu,
Wenhang Liu,
Zhenhua Xiong
Abstract:
Mutual localization serves as the foundation for collaborative perception and task assignment in multi-robot systems. Effectively utilizing limited onboard sensors for mutual localization between marker-less robots is a worthwhile goal. However, due to inadequate consideration of large scale variations of the observed robot and localization refinement, previous work has shown limited accuracy when…
▽ More
Mutual localization serves as the foundation for collaborative perception and task assignment in multi-robot systems. Effectively utilizing limited onboard sensors for mutual localization between marker-less robots is a worthwhile goal. However, due to inadequate consideration of large scale variations of the observed robot and localization refinement, previous work has shown limited accuracy when robots are equipped only with RGB cameras. To enhance the precision of localization, this paper proposes a novel rendezvous-based hierarchical architecture for mutual localization (RHAML). Firstly, to learn multi-scale robot features, anisotropic convolutions are introduced into the network, yielding initial localization results. Then, the iterative refinement module with rendering is employed to adjust the observed robot poses. Finally, the pose graph is conducted to globally optimize all localization results, which takes into account multi-frame observations. Therefore, a flexible architecture is provided that allows for the selection of appropriate modules based on requirements. Simulations demonstrate that RHAML effectively addresses the problem of multi-robot mutual localization, achieving translation errors below 2 cm and rotation errors below 0.5 degrees when robots exhibit 5 m of depth variation. Moreover, its practical utility is validated by applying it to map fusion when multi-robots explore unknown environments.
△ Less
Submitted 19 May, 2024;
originally announced May 2024.
-
Multi-Robot Rendezvous in Unknown Environment with Limited Communication
Authors:
Kun Song,
Gaoming Chen,
Wenhang Liu,
Zhenhua Xiong
Abstract:
Rendezvous aims at gathering all robots at a specific location, which is an important collaborative behavior for multirobot systems. However, in an unknown environment, it is challenging to achieve rendezvous. Previous researches mainly focus on special scenarios where communication is not allowed and each robot executes a random searching strategy, which is highly time-consuming, especially in la…
▽ More
Rendezvous aims at gathering all robots at a specific location, which is an important collaborative behavior for multirobot systems. However, in an unknown environment, it is challenging to achieve rendezvous. Previous researches mainly focus on special scenarios where communication is not allowed and each robot executes a random searching strategy, which is highly time-consuming, especially in large-scale environments. In this work, we focus on rendezvous in unknown environments where communication is available. We divide this task into two steps: rendezvous based environment exploration with relative pose (RP) estimation and rendezvous point election. A new strategy called partitioned and incomplete exploration for rendezvous (PIER) is proposed to efficiently explore the unknown environment, where lightweight topological maps are constructed and shared among robots for RP estimation with very few communications. Then, a rendezvous point selection algorithm based on the merged topological map is proposed for efficient rendezvous for multi-robot systems. The effectiveness of the proposed methods is validated in both simulations and real-world experiments.
△ Less
Submitted 14 May, 2024;
originally announced May 2024.
-
TableVQA-Bench: A Visual Question Answering Benchmark on Multiple Table Domains
Authors:
Yoonsik Kim,
Moonbin Yim,
Ka Yeon Song
Abstract:
In this paper, we establish a benchmark for table visual question answering, referred to as the TableVQA-Bench, derived from pre-existing table question-answering (QA) and table structure recognition datasets. It is important to note that existing datasets have not incorporated images or QA pairs, which are two crucial components of TableVQA. As such, the primary objective of this paper is to obta…
▽ More
In this paper, we establish a benchmark for table visual question answering, referred to as the TableVQA-Bench, derived from pre-existing table question-answering (QA) and table structure recognition datasets. It is important to note that existing datasets have not incorporated images or QA pairs, which are two crucial components of TableVQA. As such, the primary objective of this paper is to obtain these necessary components. Specifically, images are sourced either through the application of a \textit{stylesheet} or by employing the proposed table rendering system. QA pairs are generated by exploiting the large language model (LLM) where the input is a text-formatted table. Ultimately, the completed TableVQA-Bench comprises 1,500 QA pairs. We comprehensively compare the performance of various multi-modal large language models (MLLMs) on TableVQA-Bench. GPT-4V achieves the highest accuracy among commercial and open-sourced MLLMs from our experiments. Moreover, we discover that the number of vision queries plays a significant role in TableVQA performance. To further analyze the capabilities of MLLMs in comparison to their LLM backbones, we investigate by presenting image-formatted tables to MLLMs and text-formatted tables to LLMs, respectively. Our findings suggest that processing visual inputs is more challenging than text inputs, as evidenced by the lower performance of MLLMs, despite generally requiring higher computational costs than LLMs. The proposed TableVQA-Bench and evaluation codes are available at \href{https://github.com/naver-ai/tablevqabench}{https://github.com/naver-ai/tablevqabench}.
△ Less
Submitted 29 April, 2024;
originally announced April 2024.
-
Fisher Information Improved Training-Free Conditional Diffusion Model
Authors:
Kaiyu Song,
Hanjiang Lai
Abstract:
Recently, the diffusion model with the training-free methods has succeeded in conditional image generation tasks. However, there is an efficiency problem because it requires calculating the gradient with high computational cost, and previous methods make strong assumptions to solve it, sacrificing generalization. In this work, we propose the Fisher information guided diffusion model (FIGD). Concre…
▽ More
Recently, the diffusion model with the training-free methods has succeeded in conditional image generation tasks. However, there is an efficiency problem because it requires calculating the gradient with high computational cost, and previous methods make strong assumptions to solve it, sacrificing generalization. In this work, we propose the Fisher information guided diffusion model (FIGD). Concretely, we introduce the Fisher information to estimate the gradient without making any additional assumptions to reduce computation cost. Meanwhile, we demonstrate that the Fisher information ensures the generalization of FIGD and provides new insights for training-free methods based on the information theory. The experimental results demonstrate that FIGD could achieve different conditional generations more quickly while maintaining high quality.
△ Less
Submitted 28 April, 2024;
originally announced April 2024.
-
Solute segregation in polycrystalline aluminum from hybrid Monte Carlo and molecular dynamics simulations with a unified neuroevolution potential
Authors:
Keke Song,
Jiahui Liu,
Shunda Chen,
Zheyong Fan,
Yanjing Su,
Ping Qian
Abstract:
One of the most effective methods to enhance the strength of aluminum alloys involves modifying grain boundaries (GBs) through solute segregation. However, the fundamental mechanisms of solute segregation and their impacts on material properties remain elusive. In this study, we implemented highly efficient hybrid Monte Carlo and molecular dynamics (MCMD) algorithms in the graphics process units m…
▽ More
One of the most effective methods to enhance the strength of aluminum alloys involves modifying grain boundaries (GBs) through solute segregation. However, the fundamental mechanisms of solute segregation and their impacts on material properties remain elusive. In this study, we implemented highly efficient hybrid Monte Carlo and molecular dynamics (MCMD) algorithms in the graphics process units molecular dynamics (GPUMD) package. Using this efficient MCMD approach combined with a general-purpose machine-learning-based neuroevolution potential (NEP) for 16 elemental metals and their alloys, we simulated the segregation of 15 solutes in polycrystalline Al. Our results elucidate the segregation behavior and trends of 15 solutes in polycrystalline Al. Additionally, we investigated the impact of solutes on the strength of polycrystalline Al. The mechanisms underlying solute strengthening and embrittlement were analyzed at the atomistic level, revealing the importance of GB cohesion, as well as the nucleation and movement of Shockley dislocations, in determining the material's strength. We anticipate that our developed methods, along with our insights into solute segregation behavior in polycrystalline Al, will be valuable for the design of Al alloys and other multi-component materials, including medium-entropy materials, high-entropy materials, and complex concentrated alloys.
△ Less
Submitted 21 April, 2024;
originally announced April 2024.
-
Estimation for conditional moment models based on martingale difference divergence
Authors:
Kunyang Song,
Feiyu Jiang,
Ke Zhu
Abstract:
We provide a new estimation method for conditional moment models via the martingale difference divergence (MDD).Our MDD-based estimation method is formed in the framework of a continuum of unconditional moment restrictions. Unlike the existing estimation methods in this framework, the MDD-based estimation method adopts a non-integrable weighting function, which could grab more information from unc…
▽ More
We provide a new estimation method for conditional moment models via the martingale difference divergence (MDD).Our MDD-based estimation method is formed in the framework of a continuum of unconditional moment restrictions. Unlike the existing estimation methods in this framework, the MDD-based estimation method adopts a non-integrable weighting function, which could grab more information from unconditional moment restrictions than the integrable weighting function to enhance the estimation efficiency. Due to the nature of shift-invariance in MDD, our MDD-based estimation method can not identify the intercept parameters. To overcome this identification issue, we further provide a two-step estimation procedure for the model with intercept parameters. Under regularity conditions, we establish the asymptotics of the proposed estimators, which are not only easy-to-implement with analytic asymptotic variances, but also applicable to time series data with an unspecified form of conditional heteroskedasticity. Finally, we illustrate the usefulness of the proposed estimators by simulations and two real examples.
△ Less
Submitted 17 April, 2024;
originally announced April 2024.
-
Oblique-MERF: Revisiting and Improving MERF for Oblique Photography
Authors:
Xiaoyi Zeng,
Kaiwen Song,
Leyuan Yang,
Bailin Deng,
Juyong Zhang
Abstract:
Neural implicit fields have established a new paradigm for scene representation, with subsequent work achieving high-quality real-time rendering. However, reconstructing 3D scenes from oblique aerial photography presents unique challenges, such as varying spatial scale distributions and a constrained range of tilt angles, often resulting in high memory consumption and reduced rendering quality at…
▽ More
Neural implicit fields have established a new paradigm for scene representation, with subsequent work achieving high-quality real-time rendering. However, reconstructing 3D scenes from oblique aerial photography presents unique challenges, such as varying spatial scale distributions and a constrained range of tilt angles, often resulting in high memory consumption and reduced rendering quality at extrapolated viewpoints. In this paper, we enhance MERF to accommodate these data characteristics by introducing an innovative adaptive occupancy plane optimized during the volume rendering process and a smoothness regularization term for view-dependent color to address these issues. Our approach, termed Oblique-MERF, surpasses state-of-the-art real-time methods by approximately 0.7 dB, reduces VRAM usage by about 40%, and achieves higher rendering frame rates with more realistic rendering outcomes across most viewpoints.
△ Less
Submitted 15 April, 2024;
originally announced April 2024.
-
MoMA: Multimodal LLM Adapter for Fast Personalized Image Generation
Authors:
Kunpeng Song,
Yizhe Zhu,
Bingchen Liu,
Qing Yan,
Ahmed Elgammal,
Xiao Yang
Abstract:
In this paper, we present MoMA: an open-vocabulary, training-free personalized image model that boasts flexible zero-shot capabilities. As foundational text-to-image models rapidly evolve, the demand for robust image-to-image translation grows. Addressing this need, MoMA specializes in subject-driven personalized image generation. Utilizing an open-source, Multimodal Large Language Model (MLLM), w…
▽ More
In this paper, we present MoMA: an open-vocabulary, training-free personalized image model that boasts flexible zero-shot capabilities. As foundational text-to-image models rapidly evolve, the demand for robust image-to-image translation grows. Addressing this need, MoMA specializes in subject-driven personalized image generation. Utilizing an open-source, Multimodal Large Language Model (MLLM), we train MoMA to serve a dual role as both a feature extractor and a generator. This approach effectively synergizes reference image and text prompt information to produce valuable image features, facilitating an image diffusion model. To better leverage the generated features, we further introduce a novel self-attention shortcut method that efficiently transfers image features to an image diffusion model, improving the resemblance of the target object in generated images. Remarkably, as a tuning-free plug-and-play module, our model requires only a single reference image and outperforms existing methods in generating images with high detail fidelity, enhanced identity-preservation and prompt faithfulness. Our work is open-source, thereby providing universal access to these advancements.
△ Less
Submitted 8 April, 2024;
originally announced April 2024.
-
Pre-trained Vision and Language Transformers Are Few-Shot Incremental Learners
Authors:
Keon-Hee Park,
Kyungwoo Song,
Gyeong-Moon Park
Abstract:
Few-Shot Class Incremental Learning (FSCIL) is a task that requires a model to learn new classes incrementally without forgetting when only a few samples for each class are given. FSCIL encounters two significant challenges: catastrophic forgetting and overfitting, and these challenges have driven prior studies to primarily rely on shallow models, such as ResNet-18. Even though their limited capac…
▽ More
Few-Shot Class Incremental Learning (FSCIL) is a task that requires a model to learn new classes incrementally without forgetting when only a few samples for each class are given. FSCIL encounters two significant challenges: catastrophic forgetting and overfitting, and these challenges have driven prior studies to primarily rely on shallow models, such as ResNet-18. Even though their limited capacity can mitigate both forgetting and overfitting issues, it leads to inadequate knowledge transfer during few-shot incremental sessions. In this paper, we argue that large models such as vision and language transformers pre-trained on large datasets can be excellent few-shot incremental learners. To this end, we propose a novel FSCIL framework called PriViLege, Pre-trained Vision and Language transformers with prompting functions and knowledge distillation. Our framework effectively addresses the challenges of catastrophic forgetting and overfitting in large models through new pre-trained knowledge tuning (PKT) and two losses: entropy-based divergence loss and semantic knowledge distillation loss. Experimental results show that the proposed PriViLege significantly outperforms the existing state-of-the-art methods with a large margin, e.g., +9.38% in CUB200, +20.58% in CIFAR-100, and +13.36% in miniImageNet. Our implementation code is available at https://github.com/KHU-AGI/PriViLege.
△ Less
Submitted 2 April, 2024;
originally announced April 2024.
-
HyperCLOVA X Technical Report
Authors:
Kang Min Yoo,
Jaegeun Han,
Sookyo In,
Heewon Jeon,
Jisu Jeong,
Jaewook Kang,
Hyunwook Kim,
Kyung-Min Kim,
Munhyong Kim,
Sungju Kim,
Donghyun Kwak,
Hanock Kwak,
Se Jung Kwon,
Bado Lee,
Dongsoo Lee,
Gichang Lee,
Jooho Lee,
Baeseong Park,
Seongjin Shin,
Joonsang Yu,
Seolki Baek,
Sumin Byeon,
Eungsup Cho,
Dooseok Choe,
Jeesung Han
, et al. (371 additional authors not shown)
Abstract:
We introduce HyperCLOVA X, a family of large language models (LLMs) tailored to the Korean language and culture, along with competitive capabilities in English, math, and coding. HyperCLOVA X was trained on a balanced mix of Korean, English, and code data, followed by instruction-tuning with high-quality human-annotated datasets while abiding by strict safety guidelines reflecting our commitment t…
▽ More
We introduce HyperCLOVA X, a family of large language models (LLMs) tailored to the Korean language and culture, along with competitive capabilities in English, math, and coding. HyperCLOVA X was trained on a balanced mix of Korean, English, and code data, followed by instruction-tuning with high-quality human-annotated datasets while abiding by strict safety guidelines reflecting our commitment to responsible AI. The model is evaluated across various benchmarks, including comprehensive reasoning, knowledge, commonsense, factuality, coding, math, chatting, instruction-following, and harmlessness, in both Korean and English. HyperCLOVA X exhibits strong reasoning capabilities in Korean backed by a deep understanding of the language and cultural nuances. Further analysis of the inherent bilingual nature and its extension to multilingualism highlights the model's cross-lingual proficiency and strong generalization ability to untargeted languages, including machine translation between several language pairs and cross-lingual inference tasks. We believe that HyperCLOVA X can provide helpful guidance for regions or countries in developing their sovereign LLMs.
△ Less
Submitted 13 April, 2024; v1 submitted 2 April, 2024;
originally announced April 2024.
-
Polarity Calibration for Opinion Summarization
Authors:
Yuanyuan Lei,
Kaiqiang Song,
Sangwoo Cho,
Xiaoyang Wang,
Ruihong Huang,
Dong Yu
Abstract:
Opinion summarization is automatically generating summaries from a variety of subjective information, such as product reviews or political opinions. The challenge of opinions summarization lies in presenting divergent or even conflicting opinions. We conduct an analysis of previous summarization models, which reveals their inclination to amplify the polarity bias, emphasizing the majority opinions…
▽ More
Opinion summarization is automatically generating summaries from a variety of subjective information, such as product reviews or political opinions. The challenge of opinions summarization lies in presenting divergent or even conflicting opinions. We conduct an analysis of previous summarization models, which reveals their inclination to amplify the polarity bias, emphasizing the majority opinions while ignoring the minority opinions. To address this issue and make the summarizer express both sides of opinions, we introduce the concept of polarity calibration, which aims to align the polarity of output summary with that of input text. Specifically, we develop a reinforcement training approach for polarity calibration. This approach feeds the polarity distance between output summary and input text as reward into the summarizer, and also balance polarity calibration with content preservation and language naturality. We evaluate our Polarity Calibration model (PoCa) on two types of opinions summarization tasks: summarizing product reviews and political opinions articles. Automatic and human evaluation demonstrate that our approach can mitigate the polarity mismatch between output summary and input text, as well as maintain the content semantic and language quality.
△ Less
Submitted 2 April, 2024;
originally announced April 2024.
-
ChatTracer: Large Language Model Powered Real-time Bluetooth Device Tracking System
Authors:
Qijun Wang,
Shichen Zhang,
Kunzhe Song,
Huacheng Zeng
Abstract:
Large language models (LLMs) have transformed the way we interact with cyber technologies. In this paper, we study the possibility of connecting LLM with wireless sensor networks (WSN). A successful design will not only extend LLM's knowledge landscape to the physical world but also revolutionize human interaction with WSN. To the end, we present ChatTracer, an LLM-powered real-time Bluetooth devi…
▽ More
Large language models (LLMs) have transformed the way we interact with cyber technologies. In this paper, we study the possibility of connecting LLM with wireless sensor networks (WSN). A successful design will not only extend LLM's knowledge landscape to the physical world but also revolutionize human interaction with WSN. To the end, we present ChatTracer, an LLM-powered real-time Bluetooth device tracking system. ChatTracer comprises three key components: an array of Bluetooth sniffing nodes, a database, and a fine-tuned LLM. ChatTracer was designed based on our experimental observation that commercial Apple/Android devices always broadcast hundreds of BLE packets per minute even in their idle status. Its novelties lie in two aspects: i) a reliable and efficient BLE packet grouping algorithm; and ii) an LLM fine-tuning strategy that combines both supervised fine-tuning (SFT) and reinforcement learning with human feedback (RLHF). We have built a prototype of ChatTracer with four sniffing nodes. Experimental results show that ChatTracer not only outperforms existing localization approaches, but also provides an intelligent interface for user interaction.
△ Less
Submitted 9 July, 2024; v1 submitted 28 March, 2024;
originally announced March 2024.
-
Spectrum of $S$- and $P$-wave $cc\bar{q}\bar{q}'$ $(\bar{q},\bar{q}' = \bar{u}, \bar{d}, \bar{s})$ systems in a chiral SU(3) quark model
Authors:
Du Wang,
Ke-Rang Song,
Wen-Ling Wang,
Fei Huang
Abstract:
Inspired by the resonance $T_{cc}^+(3875)$ recently observed by the LHCb Collaboration, we systematically explore the $S$- and $P$-wave $cc\bar{q}\bar{q}'$ $(\bar{q},\bar{q}' = \bar{u}, \bar{d}, \bar{s})$ systems in a chiral SU(3) quark model. The Hamiltonian contains the kinetic energy, the one-gluon-exchange (OGE) potential, the confinement potential, and the one-boson-exchange (OBE) potential s…
▽ More
Inspired by the resonance $T_{cc}^+(3875)$ recently observed by the LHCb Collaboration, we systematically explore the $S$- and $P$-wave $cc\bar{q}\bar{q}'$ $(\bar{q},\bar{q}' = \bar{u}, \bar{d}, \bar{s})$ systems in a chiral SU(3) quark model. The Hamiltonian contains the kinetic energy, the one-gluon-exchange (OGE) potential, the confinement potential, and the one-boson-exchange (OBE) potential stemming from the coupling of quark and chiral fields. The Schrödinger equation is solved by use of the variational method with the spacial trial wave functions chosen as Gaussian functions. It is found that the lowest state has a mass $3879$ MeV, isospin and spin-parity $IJ^P=01^+$, and quark constituent $cc\bar{u}\bar{d}$, in agreement with the experimentally observed $T_{cc}^+(3875)$. This state is approximately at the calculated $DD^\ast$ threshold, and has a root-mean-square radius about $0.48$ fm. These demonstrates that the $T_{cc}^+(3875)$ can be accommodated as a stable and compact tetraquark sate in the chiral SU(3) quark model. All the other $S$- and $P$-wave $cc\bar{q}\bar{q}'$ $(\bar{q},\bar{q}' = \bar{u}, \bar{d}, \bar{s})$ states lie about one hundred to few hundreds MeV higher than the corresponding meson-meson thresholds, and thus are not suggested to be candidates of stable and compact tetraquark states due to their fall-apart decays to two mesons.
△ Less
Submitted 22 March, 2024;
originally announced March 2024.
-
Adaptive Hybrid Masking Strategy for Privacy-Preserving Face Recognition Against Model Inversion Attack
Authors:
Yinggui Wang,
Yuanqing Huang,
Jianshu Li,
Le Yang,
Kai Song,
Lei Wang
Abstract:
The utilization of personal sensitive data in training face recognition (FR) models poses significant privacy concerns, as adversaries can employ model inversion attacks (MIA) to infer the original training data. Existing defense methods, such as data augmentation and differential privacy, have been employed to mitigate this issue. However, these methods often fail to strike an optimal balance bet…
▽ More
The utilization of personal sensitive data in training face recognition (FR) models poses significant privacy concerns, as adversaries can employ model inversion attacks (MIA) to infer the original training data. Existing defense methods, such as data augmentation and differential privacy, have been employed to mitigate this issue. However, these methods often fail to strike an optimal balance between privacy and accuracy. To address this limitation, this paper introduces an adaptive hybrid masking algorithm against MIA. Specifically, face images are masked in the frequency domain using an adaptive MixUp strategy. Unlike the traditional MixUp algorithm, which is predominantly used for data augmentation, our modified approach incorporates frequency domain mixing. Previous studies have shown that increasing the number of images mixed in MixUp can enhance privacy preservation but at the expense of reduced face recognition accuracy. To overcome this trade-off, we develop an enhanced adaptive MixUp strategy based on reinforcement learning, which enables us to mix a larger number of images while maintaining satisfactory recognition accuracy. To optimize privacy protection, we propose maximizing the reward function (i.e., the loss function of the FR system) during the training of the strategy network. While the loss function of the FR network is minimized in the phase of training the FR network. The strategy network and the face recognition network can be viewed as antagonistic entities in the training process, ultimately reaching a more balanced trade-off. Experimental results demonstrate that our proposed hybrid masking scheme outperforms existing defense algorithms in terms of privacy preservation and recognition accuracy against MIA.
△ Less
Submitted 23 April, 2024; v1 submitted 13 March, 2024;
originally announced March 2024.
-
Large Language Models are Parallel Multilingual Learners
Authors:
Yongyu Mu,
Peinan Feng,
Zhiquan Cao,
Yuzhang Wu,
Bei Li,
Chenglong Wang,
Tong Xiao,
Kai Song,
Tongran Liu,
Chunliang Zhang,
Jingbo Zhu
Abstract:
In this study, we reveal an in-context learning (ICL) capability of multilingual large language models (LLMs): by translating the input to several languages, we provide Parallel Input in Multiple Languages (PiM) to LLMs, which significantly enhances their comprehension abilities. To test this capability, we design extensive experiments encompassing 8 typical datasets, 7 languages and 8 state-of-th…
▽ More
In this study, we reveal an in-context learning (ICL) capability of multilingual large language models (LLMs): by translating the input to several languages, we provide Parallel Input in Multiple Languages (PiM) to LLMs, which significantly enhances their comprehension abilities. To test this capability, we design extensive experiments encompassing 8 typical datasets, 7 languages and 8 state-of-the-art multilingual LLMs. Experimental results show that (1) incorporating more languages help PiM surpass the conventional ICL further; (2) even combining with the translations that are inferior to baseline performance can also help. Moreover, by examining the activated neurons in LLMs, we discover a counterintuitive but interesting phenomenon. Contrary to the common thought that PiM would activate more neurons than monolingual input to leverage knowledge learned from diverse languages, PiM actually inhibits neurons and promotes more precise neuron activation especially when more languages are added. This phenomenon aligns with the neuroscience insight about synaptic pruning, which removes less used neural connections, strengthens remainders, and then enhances brain intelligence.
△ Less
Submitted 13 March, 2024;
originally announced March 2024.
-
Hyperelasticity of Blood Clots: Bridging the Gap between Microscopic and Continuum Scales
Authors:
Nicholas Filla,
Beikang Gu,
Jixin Hou,
Kenan Song,
He Li,
Ning Liu,
Xianqiao Wang
Abstract:
The biomechanical properties of blood clots, which are dictated by their compositions and micro-structures, play a critical role in determining their fates, occlusion, persistency, or embolization in the human circulatory system. While numerous constitutive models have emerged to describe the biomechanics of blood clots, the majority of these models have primarily focused on the macroscopic deform…
▽ More
The biomechanical properties of blood clots, which are dictated by their compositions and micro-structures, play a critical role in determining their fates, occlusion, persistency, or embolization in the human circulatory system. While numerous constitutive models have emerged to describe the biomechanics of blood clots, the majority of these models have primarily focused on the macroscopic deformation of the clots and the resultant strain-stress correlations without depicting the microscopic contributions from their structural components, such as fibrin fibers, fibrin network and red blood cells. This work addresses the gap in current scientific understanding by quantifying how changes in the microstructure of blood clots affect its mechanical responses under different external stresses. We leverage our previous published work to develop a hyperelastic potential model for blood clots, which incorporates six distinct strain-energy components to describe the alignment of fibers, the entropic and enthalpic stretching of fibrin fibers, the buckling of these fibers, clot densification, and clot jamming.These strain-energy components are represented by a combination of simple harmonic oscillators, one-sided harmonic potentials, and a Gaussian potential. The proposed model, which is C0, C1, and C2 continuous with a total of 13 parameters, has been validated against three data sets: fibrin clot in tension, blood clot in compression, and blood clots in shear, demonstrating its robustness. Subsequent simulations of a microscopic blood clot model are performed to uncover mechanistic correlations for a majority of the hyperelastic potential's stiffness/strain parameters. Our results show that only one proposed term concerning fiber buckling needs further refinement, while the remaining five strain-energy terms appear to describe precisely what they were intended to.
△ Less
Submitted 13 March, 2024;
originally announced March 2024.
-
Locational Scenario-based Pricing in a Bilateral Distribution Energy Market under Uncertainty
Authors:
Hien Thanh Doan,
Minsoo Kim,
Keunju Song,
Hongseok Kim
Abstract:
In recent years, there has been a significant focus on advancing the next generation of power systems. Despite these efforts, persistent challenges revolve around addressing the operational impact of uncertainty on predicted data, especially concerning economic dispatch and optimal power flow. To tackle these challenges, we introduce a stochastic day-ahead scheduling approach for a community. This…
▽ More
In recent years, there has been a significant focus on advancing the next generation of power systems. Despite these efforts, persistent challenges revolve around addressing the operational impact of uncertainty on predicted data, especially concerning economic dispatch and optimal power flow. To tackle these challenges, we introduce a stochastic day-ahead scheduling approach for a community. This method involves iterative improvements in economic dispatch and optimal power flow, aiming to minimize operational costs by incorporating quantile forecasting. Then, we present a real-time market and payment problem to handle optimization in real-time decision-making and payment calculation. We assess the effectiveness of our proposed method against benchmark results and conduct a test using data from 50 real households to demonstrate its practicality. Furthermore, we compare our method with existing studies in the field across two different seasons of the year. In the summer season, our method decreases optimality gap by 60% compared to the baseline, and in the winter season, it reduces optimality gap by 67%. Moreover, our proposed method mitigates the congestion of distribution network by 16.7\% within a day caused by uncertain energy, which is a crucial aspect for implementing energy markets in the real world.
△ Less
Submitted 11 March, 2024;
originally announced March 2024.
-
New Directions for Thermoelectrics: A Roadmap from High-Throughput Materials Discovery to Advanced Device Manufacturing
Authors:
Kaidong Song,
A. N. M. Tanvir,
Md Omarsany Bappy,
Yanliang Zhang
Abstract:
Thermoelectric materials, which can convert waste heat into electricity or act as solid-state Peltier coolers, are emerging as key technologies to address global energy shortages and environmental sustainability. However, discovering materials with high thermoelectric conversion efficiency is a complex and slow process. The emerging field of high-throughput material discovery demonstrates its pote…
▽ More
Thermoelectric materials, which can convert waste heat into electricity or act as solid-state Peltier coolers, are emerging as key technologies to address global energy shortages and environmental sustainability. However, discovering materials with high thermoelectric conversion efficiency is a complex and slow process. The emerging field of high-throughput material discovery demonstrates its potential to accelerate the development of new thermoelectric materials combining high efficiency and low cost. The synergistic integration of high-throughput material processing and characterization techniques with machine learning algorithms can form an efficient closed-loop process to generate and analyze broad data sets to discover new thermoelectric materials with unprecedented performances. Meanwhile, the recent development of advanced manufacturing methods provides exciting opportunities to realize scalable, low-cost, and energy-efficient fabrication of thermoelectric devices. This review provides an overview of recent advances in discovering thermoelectric materials using high-throughput methods, including processing, characterization, and screening. Advanced manufacturing methods of thermoelectric devices are also introduced to realize the broad impacts of thermoelectric materials in power generation and solid-state cooling. In the end, this paper also discusses the future research prospects and directions.
△ Less
Submitted 9 March, 2024;
originally announced March 2024.
-
Can Large Language Models do Analytical Reasoning?
Authors:
Yebowen Hu,
Kaiqiang Song,
Sangwoo Cho,
Xiaoyang Wang,
Hassan Foroosh,
Dong Yu,
Fei Liu
Abstract:
This paper explores the cutting-edge Large Language Model with analytical reasoning on sports. Our analytical reasoning embodies the tasks of letting large language models count how many points each team scores in a quarter in the NBA and NFL games. Our major discoveries are in two folds. Firstly, we find among all the models we employed, GPT-4 stands out in effectiveness, followed by Claude-2.1,…
▽ More
This paper explores the cutting-edge Large Language Model with analytical reasoning on sports. Our analytical reasoning embodies the tasks of letting large language models count how many points each team scores in a quarter in the NBA and NFL games. Our major discoveries are in two folds. Firstly, we find among all the models we employed, GPT-4 stands out in effectiveness, followed by Claude-2.1, with GPT-3.5, Gemini-Pro, and Llama-2-70b lagging behind. Specifically, we compare three different prompting techniques and a divide-and-conquer approach, we find that the latter was the most effective. Our divide-and-conquer approach breaks down play-by-play data into smaller, more manageable segments, solves each piece individually, and then aggregates them together. Besides the divide-and-conquer approach, we also explore the Chain of Thought (CoT) strategy, which markedly improves outcomes for certain models, notably GPT-4 and Claude-2.1, with their accuracy rates increasing significantly. However, the CoT strategy has negligible or even detrimental effects on the performance of other models like GPT-3.5 and Gemini-Pro. Secondly, to our surprise, we observe that most models, including GPT-4, struggle to accurately count the total scores for NBA quarters despite showing strong performance in counting NFL quarter scores. This leads us to further investigate the factors that impact the complexity of analytical reasoning tasks with extensive experiments, through which we conclude that task complexity depends on the length of context, the information density, and the presence of related information. Our research provides valuable insights into the complexity of analytical reasoning tasks and potential directions for developing future large language models.
△ Less
Submitted 6 March, 2024;
originally announced March 2024.
-
NaturalSpeech 3: Zero-Shot Speech Synthesis with Factorized Codec and Diffusion Models
Authors:
Zeqian Ju,
Yuancheng Wang,
Kai Shen,
Xu Tan,
Detai Xin,
Dongchao Yang,
Yanqing Liu,
Yichong Leng,
Kaitao Song,
Siliang Tang,
Zhizheng Wu,
Tao Qin,
Xiang-Yang Li,
Wei Ye,
Shikun Zhang,
Jiang Bian,
Lei He,
Jinyu Li,
Sheng Zhao
Abstract:
While recent large-scale text-to-speech (TTS) models have achieved significant progress, they still fall short in speech quality, similarity, and prosody. Considering speech intricately encompasses various attributes (e.g., content, prosody, timbre, and acoustic details) that pose significant challenges for generation, a natural idea is to factorize speech into individual subspaces representing di…
▽ More
While recent large-scale text-to-speech (TTS) models have achieved significant progress, they still fall short in speech quality, similarity, and prosody. Considering speech intricately encompasses various attributes (e.g., content, prosody, timbre, and acoustic details) that pose significant challenges for generation, a natural idea is to factorize speech into individual subspaces representing different attributes and generate them individually. Motivated by it, we propose NaturalSpeech 3, a TTS system with novel factorized diffusion models to generate natural speech in a zero-shot way. Specifically, 1) we design a neural codec with factorized vector quantization (FVQ) to disentangle speech waveform into subspaces of content, prosody, timbre, and acoustic details; 2) we propose a factorized diffusion model to generate attributes in each subspace following its corresponding prompt. With this factorization design, NaturalSpeech 3 can effectively and efficiently model intricate speech with disentangled subspaces in a divide-and-conquer way. Experiments show that NaturalSpeech 3 outperforms the state-of-the-art TTS systems on quality, similarity, prosody, and intelligibility, and achieves on-par quality with human recordings. Furthermore, we achieve better performance by scaling to 1B parameters and 200K hours of training data.
△ Less
Submitted 23 April, 2024; v1 submitted 5 March, 2024;
originally announced March 2024.
-
Mitigating the Linguistic Gap with Phonemic Representations for Robust Multilingual Language Understanding
Authors:
Haeji Jung,
Changdae Oh,
Jooeon Kang,
Jimin Sohn,
Kyungwoo Song,
Jinkyu Kim,
David R. Mortensen
Abstract:
Approaches to improving multilingual language understanding often require multiple languages during the training phase, rely on complicated training techniques, and -- importantly -- struggle with significant performance gaps between high-resource and low-resource languages. We hypothesize that the performance gaps between languages are affected by linguistic gaps between those languages and provi…
▽ More
Approaches to improving multilingual language understanding often require multiple languages during the training phase, rely on complicated training techniques, and -- importantly -- struggle with significant performance gaps between high-resource and low-resource languages. We hypothesize that the performance gaps between languages are affected by linguistic gaps between those languages and provide a novel solution for robust multilingual language modeling by employing phonemic representations (specifically, using phonemes as input tokens to LMs rather than subwords). We present quantitative evidence from three cross-lingual tasks that demonstrate the effectiveness of phonemic representation, which is further justified by a theoretical analysis of the cross-lingual performance gap.
△ Less
Submitted 21 February, 2024;
originally announced February 2024.
-
SportsMetrics: Blending Text and Numerical Data to Understand Information Fusion in LLMs
Authors:
Yebowen Hu,
Kaiqiang Song,
Sangwoo Cho,
Xiaoyang Wang,
Hassan Foroosh,
Dong Yu,
Fei Liu
Abstract:
Large language models hold significant potential for integrating various data types, such as text documents and database records, for advanced analytics. However, blending text and numerical data presents substantial challenges. LLMs need to process and cross-reference entities and numbers, handle data inconsistencies and redundancies, and develop planning capabilities such as building a working m…
▽ More
Large language models hold significant potential for integrating various data types, such as text documents and database records, for advanced analytics. However, blending text and numerical data presents substantial challenges. LLMs need to process and cross-reference entities and numbers, handle data inconsistencies and redundancies, and develop planning capabilities such as building a working memory for managing complex data queries. In this paper, we introduce four novel tasks centered around sports data analytics to evaluate the numerical reasoning and information fusion capabilities of LLMs. These tasks involve providing LLMs with detailed, play-by-play sports game descriptions, then challenging them with adversarial scenarios such as new game rules, longer durations, scrambled narratives, and analyzing key statistics in game summaries. We conduct extensive experiments on NBA and NFL games to assess the performance of LLMs on these tasks. Our benchmark, SportsMetrics, introduces a new mechanism for assessing LLMs' numerical reasoning and fusion skills.
△ Less
Submitted 16 June, 2024; v1 submitted 15 February, 2024;
originally announced February 2024.
-
OH-Formation Following Vibrationally Induced Reaction Dynamics of H$_2$COO
Authors:
Kaisheng Song,
Meenu Upadhyay,
Markus Meuwly
Abstract:
The reaction dynamics of H$_2$COO to form linear HCOOH and dioxirane as first steps for OH-elimination is quantitatively investigated. Using a machine learned potential energy surface at the CASPT2/aug-cc-pVTZ level of theory vibrational excitation along the CH-normal mode $ν_{\rm CH}$ with energies up to 40.0 kcal/mol ($\sim 5 ν_{\rm CH}$) leads almost exclusively to linear HCOOH which further de…
▽ More
The reaction dynamics of H$_2$COO to form linear HCOOH and dioxirane as first steps for OH-elimination is quantitatively investigated. Using a machine learned potential energy surface at the CASPT2/aug-cc-pVTZ level of theory vibrational excitation along the CH-normal mode $ν_{\rm CH}$ with energies up to 40.0 kcal/mol ($\sim 5 ν_{\rm CH}$) leads almost exclusively to linear HCOOH which further decomposes into OH+HCO. Although the barrier to form dioxirane is only 21.4 kcal/mol the reaction probability to form dioxirane is two orders of magnitude lower if the CH-stretch mode is excited. Following the dioxirane-formation pathway is facile, however, if in addition the COO-bend vibration is excited with energies equivalent to $\sim (2 ν_{\rm CH} + 4 ν_{\rm COO})$ or $\sim (3 ν_{\rm CH} + ν_{\rm COO})$. For OH-formation in the atmosphere the pathway through linear HCOOH is probably most relevant because the alternative pathways (through dioxirane or formic acid) involve several intermediates that can de-excite through collisions, relax {\it via} Intramolecular vibrational energy redistribution (IVR), or pass through very loose and vulnerable transition states (formic acid). This work demonstrates how, by selectively exciting particular vibrational modes, it is possible to dial into desired reaction channels with a high degree of specificity for a process relevant to atmospheric chemistry.
△ Less
Submitted 15 February, 2024;
originally announced February 2024.