Search | arXiv e-print repository

VGG-Tex: A Vivid Geometry-Guided Facial Texture Estimation Model for High Fidelity Monocular 3D Face Reconstruction

Authors: Haoyu Wu, Ziqiao Peng, Xukun Zhou, Yunfei Cheng, Jun He, Hongyan Liu, Zhaoxin Fan

Abstract: 3D face reconstruction from monocular images has promoted the development of various applications such as augmented reality. Though existing methods have made remarkable progress, most of them emphasize geometric reconstruction, while overlooking the importance of texture prediction. To address this issue, we propose VGG-Tex, a novel Vivid Geometry-Guided Facial Texture Estimation model designed f… ▽ More 3D face reconstruction from monocular images has promoted the development of various applications such as augmented reality. Though existing methods have made remarkable progress, most of them emphasize geometric reconstruction, while overlooking the importance of texture prediction. To address this issue, we propose VGG-Tex, a novel Vivid Geometry-Guided Facial Texture Estimation model designed for High Fidelity Monocular 3D Face Reconstruction. The core of this approach is leveraging 3D parametric priors to enhance the outcomes of 2D UV texture estimation. Specifically, VGG-Tex includes a Facial Attributes Encoding Module, a Geometry-Guided Texture Generator, and a Visibility-Enhanced Texture Completion Module. These components are responsible for extracting parametric priors, generating initial textures, and refining texture details, respectively. Based on the geometry-texture complementarity principle, VGG-Tex also introduces a Texture-guided Geometry Refinement Module to further balance the overall fidelity of the reconstructed 3D faces, along with corresponding losses. Comprehensive experiments demonstrate that our method significantly improves texture reconstruction performance compared to existing state-of-the-art methods. △ Less

Submitted 15 September, 2024; originally announced September 2024.

arXiv:2409.09682 [pdf]

A Robust Probability-based Joint Registration Method of Multiple Point Clouds Considering Local Consistency

Authors: Lingjie Su, Wei Xu, Shuyang Zhao, Yuqi Cheng, Wenlong Li

Abstract: In robotic inspection, joint registration of multiple point clouds is an essential technique for estimating the transformation relationships between measured parts, such as multiple blades in a propeller. However, the presence of noise and outliers in the data can significantly impair the registration performance by affecting the correctness of correspondences. To address this issue, we incorporat… ▽ More In robotic inspection, joint registration of multiple point clouds is an essential technique for estimating the transformation relationships between measured parts, such as multiple blades in a propeller. However, the presence of noise and outliers in the data can significantly impair the registration performance by affecting the correctness of correspondences. To address this issue, we incorporate local consistency property into the probability-based joint registration method. Specifically, each measured point set is treated as a sample from an unknown Gaussian Mixture Model (GMM), and the registration problem is framed as estimating the probability model. By incorporating local consistency into the optimization process, we enhance the robustness and accuracy of the posterior distributions, which represent the one-to-all correspondences that directly determine the registration results. Effective closed-form solution for transformation and probability parameters are derived with Expectation-Maximization (EM) algorithm. Extensive experiments demonstrate that our method outperforms the existing methods, achieving high accuracy and robustness with the existence of noise and outliers. The code will be available at https://github.com/sulingjie/JPRLC_registration. △ Less

Submitted 15 September, 2024; originally announced September 2024.

Comments: Submitted to ICRA 2025

arXiv:2409.08898 [pdf, other]

Kraus is King: High-order Completely Positive and Trace Preserving (CPTP) Low Rank Method for the Lindblad Master Equation

Authors: Daniel Appelo, Yingda Cheng

Abstract: We design high order accurate methods that exploit low rank structure in the density matrix while respecting the essential structure of the Lindblad equation. Our methods preserves complete positivity and are trace preserving. We design high order accurate methods that exploit low rank structure in the density matrix while respecting the essential structure of the Lindblad equation. Our methods preserves complete positivity and are trace preserving. △ Less

Submitted 13 September, 2024; originally announced September 2024.

arXiv:2409.08872 [pdf, other]

Exploring the Impact of Data Quantity on ASR in Extremely Low-resource Languages

Authors: Yao-Fei Cheng, Li-Wei Chen, Hung-Shin Lee, Hsin-Min Wang

Abstract: This study investigates the efficacy of data augmentation techniques for low-resource automatic speech recognition (ASR), focusing on two endangered Austronesian languages, Amis and Seediq. Recognizing the potential of self-supervised learning (SSL) in low-resource settings, we explore the impact of data volume on the continued pre-training of SSL models. We propose a novel data-selection scheme l… ▽ More This study investigates the efficacy of data augmentation techniques for low-resource automatic speech recognition (ASR), focusing on two endangered Austronesian languages, Amis and Seediq. Recognizing the potential of self-supervised learning (SSL) in low-resource settings, we explore the impact of data volume on the continued pre-training of SSL models. We propose a novel data-selection scheme leveraging a multilingual corpus to augment the limited target language data. This scheme utilizes a language classifier to extract utterance embeddings and employs one-class classifiers to identify utterances phonetically and phonologically proximate to the target languages. Utterances are ranked and selected based on their decision scores, ensuring the inclusion of highly relevant data in the SSL-ASR pipeline. Our experimental results demonstrate the effectiveness of this approach, yielding substantial improvements in ASR performance for both Amis and Seediq. These findings underscore the feasibility and promise of data augmentation through cross-lingual transfer learning for low-resource language ASR. △ Less

Submitted 13 September, 2024; originally announced September 2024.

arXiv:2409.07731 [pdf, other]

Group delay controlled by the decoherence of a single artificial atom

Authors: Y. -T. Cheng, K. -M. Hsieh, B. -Y. Wu, Z. Q. Niu, F. Aziz, Y. -H. Huang, P. Y. Wen, K. -T. Lin, Y. -H. Lin, J. C. Chen, A. F. Kockum, G. -D. Lin, Z. -R. Lin, Y. Lu, I. -C. Hoi

Abstract: The ability to slow down light at the single-photon level has applications in quantum information processing and other quantum technologies. We demonstrate two methods, both using just a single artificial atom, enabling dynamic control over microwave light velocities in waveguide quantum electrodynamics (waveguide QED). Our methods are based on two distinct mechanisms harnessing the balance betwee… ▽ More The ability to slow down light at the single-photon level has applications in quantum information processing and other quantum technologies. We demonstrate two methods, both using just a single artificial atom, enabling dynamic control over microwave light velocities in waveguide quantum electrodynamics (waveguide QED). Our methods are based on two distinct mechanisms harnessing the balance between radiative and non-radiative decay rates of a superconducting artificial atom in front of a mirror. In the first method, we tune the radiative decay of the atom using interference effects due to the mirror; in the second method, we pump the atom to control its non-radiative decay through the Autler--Townes effect. When the half the radiative decay rate exceeds the non-radiative decay rate, we observe positive group delay; conversely, dominance of the non-radiative decay rate results in negative group delay. Our results advance signal-processing capabilities in waveguide QED. △ Less

Submitted 11 September, 2024; originally announced September 2024.

arXiv:2409.06772 [pdf, other]

Broad-Line AGN at $3.5<z<6$: The Black Hole Mass Function and a Connection with Little Red Dots

Authors: Anthony J. Taylor, Steven L. Finkelstein, Dale D. Kocevski, Junehyoung Jeon, Volker Bromm, Ricardo O. Amorin, Pablo Arrabal Haro, Bren E. Backhaus, Micaela B. Bagley, Eduardo Bañados, Rachana Bhatawdekar, Madisyn Brooks, Antonello Calabro, Oscar A. Chavez Ortiz, Yingjie Cheng, Nikko J. Cleri, Justin W. Cole, Kelcey Davis, Mark Dickinson, Callum Donnan, James S. Dunlop, Richard S. Ellis, Vital Fernandez, Adriano Fontana, Seiji Fujimoto , et al. (26 additional authors not shown)

Abstract: We present a sample of 50 H-alpha detected broad-line active galactic nuclei (BLAGN) at redshifts 3.5<z<6.8 using data from the CEERS and RUBIES surveys. We select these sources directly from JWST/NIRSpec G395M/F290LP spectra. We use a multi-step pre-selection and a Bayesian fitting procedure to ensure a high-quality sample of sources with broad Balmer lines and narrow forbidden lines. We compute… ▽ More We present a sample of 50 H-alpha detected broad-line active galactic nuclei (BLAGN) at redshifts 3.5<z<6.8 using data from the CEERS and RUBIES surveys. We select these sources directly from JWST/NIRSpec G395M/F290LP spectra. We use a multi-step pre-selection and a Bayesian fitting procedure to ensure a high-quality sample of sources with broad Balmer lines and narrow forbidden lines. We compute rest-frame ultraviolet and optical spectral slopes for these objects, and determine that 10 BLAGN in our sample are also little red dots (LRDs). These LRD BLAGN, when examined in aggregate, show broader H-alpha line profiles and a higher fraction of broad-to-narrow component H-alpha emission than non-LRD BLAGN. Moreover, we find that ~66% of these objects are intrinsically reddened (beta (optical)>0), independent of the contributions of emission lines to the broadband photometry. We construct the black hole (BH) mass function at 3.5<z<6 after computing robust observational and line detection completeness corrections. This BH mass function shows broad agreement with both recent JWST/NIRSpec and JWST/NIRCam WFSS based BH mass functions, though we extend these earlier results to log(M(BH)/M(sun)) < 7. The derived BH mass function is consistent with a variety of theoretical models, indicating that the observed abundance of black holes in the early universe is not discrepant with physically-motivated predictions. The BH mass function shape resembles a largely featureless power-law, suggesting that any signature from black-hole seeding has been lost by redshift z~5-6. Finally, we compute the BLAGN UV luminosity function and find good agreement with JWST-detected BLAGN samples from recent works, finding that BLAGN hosts constitute <10% of the total observed UV luminosity at all but the brightest luminosities. △ Less

Submitted 10 September, 2024; originally announced September 2024.

Comments: 28 pages, 14 figures, 4 tables. Submitted to ApJ

arXiv:2409.06216 [pdf, other]

SubRegWeigh: Effective and Efficient Annotation Weighing with Subword Regularization

Authors: Kohei Tsuji, Tatsuya Hiraoka, Yuchang Cheng, Tomoya Iwakura

Abstract: Many datasets of natural language processing (NLP) sometimes include annotation errors. Researchers have attempted to develop methods to reduce the adverse effect of errors in datasets automatically. However, an existing method is time-consuming because it requires many trained models to detect errors. We propose a novel method to reduce the time of error detection. Specifically, we use a tokeniza… ▽ More Many datasets of natural language processing (NLP) sometimes include annotation errors. Researchers have attempted to develop methods to reduce the adverse effect of errors in datasets automatically. However, an existing method is time-consuming because it requires many trained models to detect errors. We propose a novel method to reduce the time of error detection. Specifically, we use a tokenization technique called subword regularization to create pseudo-multiple models which are used to detect errors. Our proposed method, SubRegWeigh, can perform annotation weighting four to five times faster than the existing method. Additionally, SubRegWeigh improved performance in both document classification and named entity recognition tasks. In experiments with pseudo-incorrect labels, pseudo-incorrect labels were adequately detected. △ Less

Submitted 10 September, 2024; originally announced September 2024.

Comments: 14 pages, 1 figures, 10 tables

arXiv:2409.06100 [pdf, other]

The Abundance and Properties of Barred Galaxies out to $z \sim$ 4 Using $\textit{JWST}$ CEERS Data

Authors: Yuchen Guo, Shardha Jogee, Eden Wise, Keith Pritchett Jr., Elizabeth J. McGrath, Steven L. Finkelstein, Kartheik G. Iyer, Pablo Arrabal Haro, Micaela B. Bagley, Mark Dickinson, Jeyhan S. Kartaltepe, Anton M. Koekemoer, Casey Papovich, Nor Pirzkal, L. Y. Aaron Yung, Bren E. Backhaus, Eric F. Bell, Rachana Bhatawdekar, Yingjie Cheng, Luca Costantin, Alexander de la Vega, Mauro Giavalisco, Nimish P. Hathi, Benne W. Holwerda, Peter Kurczynski , et al. (4 additional authors not shown)

Abstract: We analyze $\textit{JWST}$ CEERS NIRCam images to present {the first estimate} of the observed fraction and properties of bars out to $z \sim 4$. We analyze a sample of 1770 galaxies with stellar mass $M_\star > 10^{10} M_\odot$ at $0.5 \leq z \leq 4$ and identify barred galaxies via ellipse fits and visual classification of both F200W and F444W images. Our results apply mainly to bars with projec… ▽ More We analyze $\textit{JWST}$ CEERS NIRCam images to present {the first estimate} of the observed fraction and properties of bars out to $z \sim 4$. We analyze a sample of 1770 galaxies with stellar mass $M_\star > 10^{10} M_\odot$ at $0.5 \leq z \leq 4$ and identify barred galaxies via ellipse fits and visual classification of both F200W and F444W images. Our results apply mainly to bars with projected semi-major axis $a_{\rm bar}$ $> 1.5 $ kpc ($\sim$ 2 $\times$ PSF in F200W images) that can be robustly traced by ellipse fits. For such bars, the {observed} bar fraction at $z\sim$ 2-4 is low ($\lesssim 10\%$), and they appear to be emerging at least as early as $z\sim 4$ when the Universe was $\sim$ 13\% of its present age. At $z\sim$ 2-4, compared to our results, TNG50 simulations {predict} a significantly larger bar fraction due to a large population of small bars with $a_{\rm bar}$ $< 1.5$ kpc {that we cannot robustly detect}. If such a population exists, the true bar fraction may be significantly higher than our results. At $z \ge 1.5$, many barred galaxies show nearby neighbors, suggesting bars may be tidally triggered. {From $z \sim 4$ to $z \sim 0.5$, the observed bar fraction, average projected bar length, and projected bar strength rise.} Our results highlight the early emergence and evolution of barred galaxies and the rising importance of bar-driven secular evolution from $z \sim$4 to today. △ Less

Submitted 9 September, 2024; originally announced September 2024.

Comments: 25 pages, 16 figures, submitted to ApJ, Comments are welcome

arXiv:2409.05576 [pdf, other]

JavaVFC: Java Vulnerability Fixing Commits from Open-source Software

Authors: Tan Bui, Yan Naing Tun, Yiran Cheng, Ivana Clairine Irsan, Ting Zhang, Hong Jin Kang

Abstract: We present a comprehensive dataset of Java vulnerability-fixing commits (VFCs) to advance research in Java vulnerability analysis. Our dataset, derived from thousands of open-source Java projects on GitHub, comprises two variants: JavaVFC and JavaVFC-extended. The dataset was constructed through a rigorous process involving heuristic rules and multiple rounds of manual labeling. We initially used… ▽ More We present a comprehensive dataset of Java vulnerability-fixing commits (VFCs) to advance research in Java vulnerability analysis. Our dataset, derived from thousands of open-source Java projects on GitHub, comprises two variants: JavaVFC and JavaVFC-extended. The dataset was constructed through a rigorous process involving heuristic rules and multiple rounds of manual labeling. We initially used keywords to filter candidate VFCs based on commit messages, then refined this keyword set through iterative manual labeling. The final labeling round achieved a precision score of 0.7 among three annotators. We applied the refined keyword set to 34,321 open-source Java repositories with over 50 GitHub stars, resulting in JavaVFC with 784 manually verified VFCs and JavaVFC-extended with 16,837 automatically identified VFCs. Both variants are presented in a standardized JSONL format for easy access and analysis. This dataset supports various research endeavors, including VFC identification, fine-grained vulnerability detection, and automated vulnerability repair. The JavaVFC and JavaVFC-extended are publicly available at https://zenodo.org/records/13731781. △ Less

Submitted 9 September, 2024; originally announced September 2024.

arXiv:2409.04774 [pdf, other]

Untie the Knots: An Efficient Data Augmentation Strategy for Long-Context Pre-Training in Language Models

Authors: Junfeng Tian, Da Zheng, Yang Cheng, Rui Wang, Colin Zhang, Debing Zhang

Abstract: Large language models (LLM) have prioritized expanding the context window from which models can incorporate more information. However, training models to handle long contexts presents significant challenges. These include the scarcity of high-quality natural long-context data, the potential for performance degradation on short-context tasks, and the reduced training efficiency associated with atte… ▽ More Large language models (LLM) have prioritized expanding the context window from which models can incorporate more information. However, training models to handle long contexts presents significant challenges. These include the scarcity of high-quality natural long-context data, the potential for performance degradation on short-context tasks, and the reduced training efficiency associated with attention mechanisms. In this paper, we introduce Untie the Knots (\textbf{UtK}), a novel data augmentation strategy employed during the continue pre-training phase, designed to efficiently enable LLMs to gain long-context capabilities without the need to modify the existing data mixture. In particular, we chunk the documents, shuffle the chunks, and create a complex and knotted structure of long texts; LLMs are then trained to untie these knots and identify relevant segments within seemingly chaotic token sequences. This approach greatly improves the model's performance by accurately attending to relevant information in long context and the training efficiency is also largely increased. We conduct extensive experiments on models with 7B and 72B parameters, trained on 20 billion tokens, demonstrating that UtK achieves 75\% and 84.5\% accurracy on RULER at 128K context length, significantly outperforming other long context strategies. The trained models will open-source for further research. △ Less

Submitted 7 September, 2024; originally announced September 2024.

arXiv:2409.04761 [pdf]

Transformer Based Tissue Classification in Robotic Needle Biopsy

Authors: Fanxin Wang, Yikun Cheng, Sudipta S Mukherjee, Rohit Bhargava, Thenkurussi Kesavadas

Abstract: Image-guided minimally invasive robotic surgery is commonly employed for tasks such as needle biopsies or localized therapies. However, the nonlinear deformation of various tissue types presents difficulties for surgeons in achieving precise needle tip placement, particularly when relying on low-fidelity biopsy imaging systems. In this paper, we introduce a method to classify needle biopsy interve… ▽ More Image-guided minimally invasive robotic surgery is commonly employed for tasks such as needle biopsies or localized therapies. However, the nonlinear deformation of various tissue types presents difficulties for surgeons in achieving precise needle tip placement, particularly when relying on low-fidelity biopsy imaging systems. In this paper, we introduce a method to classify needle biopsy interventions and identify tissue types based on a comprehensive needle-tissue contact model that incorporates both position and force parameters. We trained a transformer model using a comprehensive dataset collected from a formerly developed robotics platform, which consists of synthetic and porcine tissue from various locations (liver, kidney, heart, belly, hock) marked with interaction phases (pre-puncture, puncture, post-puncture, neutral). This model achieves a significant classification accuracy of 0.93. Our demonstrated method can assist surgeons in identifying transitions to different tissues, aiding surgeons with tissue awareness. △ Less

Submitted 7 September, 2024; originally announced September 2024.

Comments: 8 pages

Journal ref: IEEE SMC 2024

arXiv:2409.03381 [pdf, other]

CogniDual Framework: Self-Training Large Language Models within a Dual-System Theoretical Framework for Improving Cognitive Tasks

Authors: Yongxin Deng, Xihe Qiu, Xiaoyu Tan, Chao Qu, Jing Pan, Yuan Cheng, Yinghui Xu, Wei Chu

Abstract: Cognitive psychology investigates perception, attention, memory, language, problem-solving, decision-making, and reasoning. Kahneman's dual-system theory elucidates the human decision-making process, distinguishing between the rapid, intuitive System 1 and the deliberative, rational System 2. Recent advancements have positioned large language Models (LLMs) as formidable tools nearing human-level p… ▽ More Cognitive psychology investigates perception, attention, memory, language, problem-solving, decision-making, and reasoning. Kahneman's dual-system theory elucidates the human decision-making process, distinguishing between the rapid, intuitive System 1 and the deliberative, rational System 2. Recent advancements have positioned large language Models (LLMs) as formidable tools nearing human-level proficiency in various cognitive tasks. Nonetheless, the presence of a dual-system framework analogous to human cognition in LLMs remains unexplored. This study introduces the \textbf{CogniDual Framework for LLMs} (CFLLMs), designed to assess whether LLMs can, through self-training, evolve from deliberate deduction to intuitive responses, thereby emulating the human process of acquiring and mastering new information. Our findings reveal the cognitive mechanisms behind LLMs' response generation, enhancing our understanding of their capabilities in cognitive psychology. Practically, self-trained models can provide faster responses to certain queries, reducing computational demands during inference. △ Less

Submitted 6 September, 2024; v1 submitted 5 September, 2024; originally announced September 2024.

arXiv:2409.01526 [pdf, other]

Directional sources realised by toroidal dipoles

Authors: Junho Jung, Yuqiong Cheng, Wanyue Xiao, Shubo Wang

Abstract: Directional optical sources can give rise to the directional excitation and propagation of light. The directionality of the conventional directional dipole (CDD) sources are attributed to the interference of the electric and/or magnetic dipoles, while the effect of the toroidal dipole on optical directionality remains unexplored.} Here, we numerically and analytically investigate the directional p… ▽ More Directional optical sources can give rise to the directional excitation and propagation of light. The directionality of the conventional directional dipole (CDD) sources are attributed to the interference of the electric and/or magnetic dipoles, while the effect of the toroidal dipole on optical directionality remains unexplored.} Here, we numerically and analytically investigate the directional properties of the toroidal dipole. We show that the toroidal dipole can replace the electric dipole in the CDD sources to form the pseudo directional dipoles (PDDs), which can be applied to achieve analogous near-field directional coupling with a silicon waveguide. Moreover, the directionality of the PDDs can be flexibly controlled by changing the geometric parameters of the toroidal dipole, leading to tunable asymmetric coupling between the sources and the waveguide. These new types of directional sources provide more degrees of freedom for tailoring the optical directionality compared to the conventional sources. The results open new possibilities for directional light manipulation and can find applications in on-chip optical routing, waveguiding, and nanophotonic communications. △ Less

Submitted 2 September, 2024; originally announced September 2024.

Comments: 21 pages, 6 figures

arXiv:2409.01216 [pdf, other]

doi 10.24963/ijcai.2024/131

ESP-PCT: Enhanced VR Semantic Performance through Efficient Compression of Temporal and Spatial Redundancies in Point Cloud Transformers

Authors: Luoyu Mei, Shuai Wang, Yun Cheng, Ruofeng Liu, Zhimeng Yin, Wenchao Jiang, Shuai Wang, Wei Gong

Abstract: Semantic recognition is pivotal in virtual reality (VR) applications, enabling immersive and interactive experiences. A promising approach is utilizing millimeter-wave (mmWave) signals to generate point clouds. However, the high computational and memory demands of current mmWave point cloud models hinder their efficiency and reliability. To address this limitation, our paper introduces ESP-PCT, a… ▽ More Semantic recognition is pivotal in virtual reality (VR) applications, enabling immersive and interactive experiences. A promising approach is utilizing millimeter-wave (mmWave) signals to generate point clouds. However, the high computational and memory demands of current mmWave point cloud models hinder their efficiency and reliability. To address this limitation, our paper introduces ESP-PCT, a novel Enhanced Semantic Performance Point Cloud Transformer with a two-stage semantic recognition framework tailored for VR applications. ESP-PCT takes advantage of the accuracy of sensory point cloud data and optimizes the semantic recognition process, where the localization and focus stages are trained jointly in an end-to-end manner. We evaluate ESP-PCT on various VR semantic recognition conditions, demonstrating substantial enhancements in recognition efficiency. Notably, ESP-PCT achieves a remarkable accuracy of 93.2% while reducing the computational requirements (FLOPs) by 76.9% and memory usage by 78.2% compared to the existing Point Transformer model simultaneously. These underscore ESP-PCT's potential in VR semantic recognition by achieving high accuracy and reducing redundancy. The code and data of this project are available at \url{https://github.com/lymei-SEU/ESP-PCT}. △ Less

Submitted 2 September, 2024; originally announced September 2024.

Journal ref: Proceedings of the Thirty-Third International Joint Conference on Artificial Intelligence, IJCAI 2024

arXiv:2408.17150 [pdf, other]

Look, Compare, Decide: Alleviating Hallucination in Large Vision-Language Models via Multi-View Multi-Path Reasoning

Authors: Xiaoye Qu, Jiashuo Sun, Wei Wei, Yu Cheng

Abstract: Recently, Large Vision-Language Models (LVLMs) have demonstrated impressive capabilities in multi-modal context comprehension. However, they still suffer from hallucination problems referring to generating inconsistent outputs with the image content. To mitigate hallucinations, previous studies mainly focus on retraining LVLMs with custom datasets. Although effective, they inherently come with add… ▽ More Recently, Large Vision-Language Models (LVLMs) have demonstrated impressive capabilities in multi-modal context comprehension. However, they still suffer from hallucination problems referring to generating inconsistent outputs with the image content. To mitigate hallucinations, previous studies mainly focus on retraining LVLMs with custom datasets. Although effective, they inherently come with additional computational costs. In this paper, we propose a training-free framework, \textbf{MVP}, that aims to reduce hallucinations by making the most of the innate capabilities of the LVLMs via \textbf{M}ulti-\textbf{V}iew Multi-\textbf{P}ath Reasoning. Specifically, we first devise a multi-view information-seeking strategy to thoroughly perceive the comprehensive information in the image, which enriches the general global information captured by the original vision encoder in LVLMs. Furthermore, during the answer decoding, we observe that the occurrence of hallucinations has a strong correlation with the certainty of the answer tokens. Thus, we propose multi-path reasoning for each information view to quantify and aggregate the certainty scores for each potential answer among multiple decoding paths and finally decide the output answer. By fully grasping the information in the image and carefully considering the certainty of the potential answers when decoding, our MVP can effectively reduce hallucinations in LVLMs.The extensive experiments verify that our proposed MVP significantly mitigates the hallucination problem across four well-known LVLMs. The source code is available at: \url{https://github.com/GasolSun36/MVP}. △ Less

Submitted 30 August, 2024; originally announced August 2024.

Comments: 13 pages, 7 tables, 7 figures

arXiv:2408.16500 [pdf, other]

CogVLM2: Visual Language Models for Image and Video Understanding

Authors: Wenyi Hong, Weihan Wang, Ming Ding, Wenmeng Yu, Qingsong Lv, Yan Wang, Yean Cheng, Shiyu Huang, Junhui Ji, Zhao Xue, Lei Zhao, Zhuoyi Yang, Xiaotao Gu, Xiaohan Zhang, Guanyu Feng, Da Yin, Zihan Wang, Ji Qi, Xixuan Song, Peng Zhang, Debing Liu, Bin Xu, Juanzi Li, Yuxiao Dong, Jie Tang

Abstract: Beginning with VisualGLM and CogVLM, we are continuously exploring VLMs in pursuit of enhanced vision-language fusion, efficient higher-resolution architecture, and broader modalities and applications. Here we propose the CogVLM2 family, a new generation of visual language models for image and video understanding including CogVLM2, CogVLM2-Video and GLM-4V. As an image understanding model, CogVLM2… ▽ More Beginning with VisualGLM and CogVLM, we are continuously exploring VLMs in pursuit of enhanced vision-language fusion, efficient higher-resolution architecture, and broader modalities and applications. Here we propose the CogVLM2 family, a new generation of visual language models for image and video understanding including CogVLM2, CogVLM2-Video and GLM-4V. As an image understanding model, CogVLM2 inherits the visual expert architecture with improved training recipes in both pre-training and post-training stages, supporting input resolution up to $1344 \times 1344$ pixels. As a video understanding model, CogVLM2-Video integrates multi-frame input with timestamps and proposes automated temporal grounding data construction. Notably, CogVLM2 family has achieved state-of-the-art results on benchmarks like MMBench, MM-Vet, TextVQA, MVBench and VCGBench. All models are open-sourced in https://github.com/THUDM/CogVLM2 and https://github.com/THUDM/GLM-4, contributing to the advancement of the field. △ Less

Submitted 29 August, 2024; originally announced August 2024.

arXiv:2408.16030 [pdf]

A Deep Learning Approach to Localizing Multi-level Airway Collapse Based on Snoring Sounds

Authors: Ying-Chieh Hsu, Stanley Yung-Chuan Liu, Chao-Jung Huang, Chi-Wei Wu, Ren-Kai Cheng, Jane Yung-Jen Hsu, Shang-Ran Huang, Yuan-Ren Cheng, Fu-Shun Hsu

Abstract: This study investigates the application of machine/deep learning to classify snoring sounds excited at different levels of the upper airway in patients with obstructive sleep apnea (OSA) using data from drug-induced sleep endoscopy (DISE). The snoring sounds of 39 subjects were analyzed and labeled according to the Velum, Oropharynx, Tongue Base, and Epiglottis (VOTE) classification system. The da… ▽ More This study investigates the application of machine/deep learning to classify snoring sounds excited at different levels of the upper airway in patients with obstructive sleep apnea (OSA) using data from drug-induced sleep endoscopy (DISE). The snoring sounds of 39 subjects were analyzed and labeled according to the Velum, Oropharynx, Tongue Base, and Epiglottis (VOTE) classification system. The dataset, comprising 5,173 one-second segments, was used to train and test models, including Support Vector Machine (SVM), Bidirectional Long Short-Term Memory (BiLSTM), and ResNet-50. The ResNet-50, a convolutional neural network (CNN), showed the best overall performance in classifying snoring acoustics, particularly in identifying multi-level obstructions. The study emphasizes the potential of integrating snoring acoustics with deep learning to improve the diagnosis and treatment of OSA. However, challenges such as limited sample size, data imbalance, and differences between pharmacologically induced and natural snoring sounds were noted, suggesting further research to enhance model accuracy and generalizability. △ Less

Submitted 28 August, 2024; originally announced August 2024.

arXiv:2408.15632 [pdf, other]

Structural Optimization of Lightweight Bipedal Robot via SERL

Authors: Yi Cheng, Chenxi Han, Yuheng Min, Linqi Ye, Houde Liu, Hang Liu

Abstract: Designing a bipedal robot is a complex and challenging task, especially when dealing with a multitude of structural parameters. Traditional design methods often rely on human intuition and experience. However, such approaches are time-consuming, labor-intensive, lack theoretical guidance and hard to obtain optimal design results within vast design spaces, thus failing to full exploit the inherent… ▽ More Designing a bipedal robot is a complex and challenging task, especially when dealing with a multitude of structural parameters. Traditional design methods often rely on human intuition and experience. However, such approaches are time-consuming, labor-intensive, lack theoretical guidance and hard to obtain optimal design results within vast design spaces, thus failing to full exploit the inherent performance potential of robots. In this context, this paper introduces the SERL (Structure Evolution Reinforcement Learning) algorithm, which combines reinforcement learning for locomotion tasks with evolution algorithms. The aim is to identify the optimal parameter combinations within a given multidimensional design space. Through the SERL algorithm, we successfully designed a bipedal robot named Wow Orin, where the optimal leg length are obtained through optimization based on body structure and motor torque. We have experimentally validated the effectiveness of the SERL algorithm, which is capable of optimizing the best structure within specified design space and task conditions. Additionally, to assess the performance gap between our designed robot and the current state-of-the-art robots, we compared Wow Orin with mainstream bipedal robots Cassie and Unitree H1. A series of experimental results demonstrate the Outstanding energy efficiency and performance of Wow Orin, further validating the feasibility of applying the SERL algorithm to practical design. △ Less

Submitted 28 August, 2024; originally announced August 2024.

arXiv:2408.14757 [pdf, other]

Learning effective pruning at initialization from iterative pruning

Authors: Shengkai Liu, Yaofeng Cheng, Fusheng Zha, Wei Guo, Lining Sun, Zhenshan Bing, Chenguang Yang

Abstract: Pruning at initialization (PaI) reduces training costs by removing weights before training, which becomes increasingly crucial with the growing network size. However, current PaI methods still have a large accuracy gap with iterative pruning, especially at high sparsity levels. This raises an intriguing question: can we get inspiration from iterative pruning to improve the PaI performance? In the… ▽ More Pruning at initialization (PaI) reduces training costs by removing weights before training, which becomes increasingly crucial with the growing network size. However, current PaI methods still have a large accuracy gap with iterative pruning, especially at high sparsity levels. This raises an intriguing question: can we get inspiration from iterative pruning to improve the PaI performance? In the lottery ticket hypothesis, the iterative rewind pruning (IRP) finds subnetworks retroactively by rewinding the parameter to the original initialization in every pruning iteration, which means all the subnetworks are based on the initial state. Here, we hypothesise the surviving subnetworks are more important and bridge the initial feature and their surviving score as the PaI criterion. We employ an end-to-end neural network (\textbf{AutoS}parse) to learn this correlation, input the model's initial features, output their score and then prune the lowest score parameters before training. To validate the accuracy and generalization of our method, we performed PaI across various models. Results show that our approach outperforms existing methods in high-sparsity settings. Notably, as the underlying logic of model pruning is consistent in different models, only one-time IRP on one model is needed (e.g., once IRP on ResNet-18/CIFAR-10, AutoS can be generalized to VGG-16/CIFAR-10, ResNet-18/TinyImageNet, et al.). As the first neural network-based PaI method, we conduct extensive experiments to validate the factors influencing this approach. These results reveal the learning tendencies of neural networks and provide new insights into our understanding and research of PaI from a practical perspective. Our code is available at: https://github.com/ChengYaofeng/AutoSparse.git. △ Less

Submitted 26 August, 2024; originally announced August 2024.

arXiv:2408.12352 [pdf, other]

GarmentAligner: Text-to-Garment Generation via Retrieval-augmented Multi-level Corrections

Authors: Shiyue Zhang, Zheng Chong, Xujie Zhang, Hanhui Li, Yuhao Cheng, Yiqiang Yan, Xiaodan Liang

Abstract: General text-to-image models bring revolutionary innovation to the fields of arts, design, and media. However, when applied to garment generation, even the state-of-the-art text-to-image models suffer from fine-grained semantic misalignment, particularly concerning the quantity, position, and interrelations of garment components. Addressing this, we propose GarmentAligner, a text-to-garment diffus… ▽ More General text-to-image models bring revolutionary innovation to the fields of arts, design, and media. However, when applied to garment generation, even the state-of-the-art text-to-image models suffer from fine-grained semantic misalignment, particularly concerning the quantity, position, and interrelations of garment components. Addressing this, we propose GarmentAligner, a text-to-garment diffusion model trained with retrieval-augmented multi-level corrections. To achieve semantic alignment at the component level, we introduce an automatic component extraction pipeline to obtain spatial and quantitative information of garment components from corresponding images and captions. Subsequently, to exploit component relationships within the garment images, we construct retrieval subsets for each garment by retrieval augmentation based on component-level similarity ranking and conduct contrastive learning to enhance the model perception of components from positive and negative samples. To further enhance the alignment of components across semantic, spatial, and quantitative granularities, we propose the utilization of multi-level correction losses that leverage detailed component information. The experimental findings demonstrate that GarmentAligner achieves superior fidelity and fine-grained semantic alignment when compared to existing competitors. △ Less

Submitted 23 August, 2024; v1 submitted 22 August, 2024; originally announced August 2024.

Comments: Accepted by ECCV 2024

arXiv:2408.12076 [pdf, other]

ConflictBank: A Benchmark for Evaluating the Influence of Knowledge Conflicts in LLM

Authors: Zhaochen Su, Jun Zhang, Xiaoye Qu, Tong Zhu, Yanshu Li, Jiashuo Sun, Juntao Li, Min Zhang, Yu Cheng

Abstract: Large language models (LLMs) have achieved impressive advancements across numerous disciplines, yet the critical issue of knowledge conflicts, a major source of hallucinations, has rarely been studied. Only a few research explored the conflicts between the inherent knowledge of LLMs and the retrieved contextual knowledge. However, a thorough assessment of knowledge conflict in LLMs is still missin… ▽ More Large language models (LLMs) have achieved impressive advancements across numerous disciplines, yet the critical issue of knowledge conflicts, a major source of hallucinations, has rarely been studied. Only a few research explored the conflicts between the inherent knowledge of LLMs and the retrieved contextual knowledge. However, a thorough assessment of knowledge conflict in LLMs is still missing. Motivated by this research gap, we present ConflictBank, the first comprehensive benchmark developed to systematically evaluate knowledge conflicts from three aspects: (i) conflicts encountered in retrieved knowledge, (ii) conflicts within the models' encoded knowledge, and (iii) the interplay between these conflict forms. Our investigation delves into four model families and twelve LLM instances, meticulously analyzing conflicts stemming from misinformation, temporal discrepancies, and semantic divergences. Based on our proposed novel construction framework, we create 7,453,853 claim-evidence pairs and 553,117 QA pairs. We present numerous findings on model scale, conflict causes, and conflict types. We hope our ConflictBank benchmark will help the community better understand model behavior in conflicts and develop more reliable LLMs. △ Less

Submitted 21 August, 2024; originally announced August 2024.

Comments: Under Review

arXiv:2408.11826 [pdf, other]

Generative Organizational Behavior Simulation using Large Language Model based Autonomous Agents: A Holacracy Perspective

Authors: Chen Zhu, Yihang Cheng, Jingshuai Zhang, Yusheng Qiu, Sitao Xia, Hengshu Zhu

Abstract: In this paper, we present the technical details and periodic findings of our project, CareerAgent, which aims to build a generative simulation framework for a Holacracy organization using Large Language Model-based Autonomous Agents. Specifically, the simulation framework includes three phases: construction, execution, and evaluation, and it incorporates basic characteristics of individuals, organ… ▽ More In this paper, we present the technical details and periodic findings of our project, CareerAgent, which aims to build a generative simulation framework for a Holacracy organization using Large Language Model-based Autonomous Agents. Specifically, the simulation framework includes three phases: construction, execution, and evaluation, and it incorporates basic characteristics of individuals, organizations, tasks, and meetings. Through our simulation, we obtained several interesting findings. At the organizational level, an increase in the average values of management competence and functional competence can reduce overall members' stress levels, but it negatively impacts deeper organizational performance measures such as average task completion. At the individual level, both competences can improve members' work performance. From the analysis of social networks, we found that highly competent members selectively participate in certain tasks and take on more responsibilities. Over time, small sub-communities form around these highly competent members within the holacracy. These findings contribute theoretically to the study of organizational science and provide practical insights for managers to understand the organization dynamics. △ Less

Submitted 5 August, 2024; originally announced August 2024.

arXiv:2408.11405 [pdf, other]

DDSP Guitar Amp: Interpretable Guitar Amplifier Modeling

Authors: Yen-Tung Yeh, Yu-Hua Chen, Yuan-Chiao Cheng, Jui-Te Wu, Jun-Jie Fu, Yi-Fan Yeh, Yi-Hsuan Yang

Abstract: Neural network models for guitar amplifier emulation, while being effective, often demand high computational cost and lack interpretability. Drawing ideas from physical amplifier design, this paper aims to address these issues with a new differentiable digital signal processing (DDSP)-based model, called ``DDSP guitar amp,'' that models the four components of a guitar amp (i.e., preamp, tone stack… ▽ More Neural network models for guitar amplifier emulation, while being effective, often demand high computational cost and lack interpretability. Drawing ideas from physical amplifier design, this paper aims to address these issues with a new differentiable digital signal processing (DDSP)-based model, called ``DDSP guitar amp,'' that models the four components of a guitar amp (i.e., preamp, tone stack, power amp, and output transformer) using specific DSP-inspired designs. With a set of time- and frequency-domain metrics, we demonstrate that DDSP guitar amp achieves performance comparable with that of black-box baselines while requiring less than 10\% of the computational operations per audio sample, thereby holding greater potential for usages in real-time applications. △ Less

Submitted 21 August, 2024; originally announced August 2024.

Comments: Preprint paper

arXiv:2408.11283 [pdf, other]

Inference Plans for Hybrid Particle Filtering

Authors: Ellie Y. Cheng, Eric Atkinson, Guillaume Baudart, Louis Mandel, Michael Carbin

Abstract: Advanced probabilistic programming languages (PPLs) use hybrid inference systems to combine symbolic exact inference and Monte Carlo methods to improve inference performance. These systems use heuristics to partition random variables within the program into variables that are encoded symbolically and variables that are encoded with sampled values, and the heuristics are not necessarily aligned wit… ▽ More Advanced probabilistic programming languages (PPLs) use hybrid inference systems to combine symbolic exact inference and Monte Carlo methods to improve inference performance. These systems use heuristics to partition random variables within the program into variables that are encoded symbolically and variables that are encoded with sampled values, and the heuristics are not necessarily aligned with the performance evaluation metrics used by the developer. In this work, we present inference plans, a programming interface that enables developers to control the partitioning of random variables during hybrid particle filtering. We further present Siren, a new PPL that enables developers to use annotations to specify inference plans the inference system must implement. To assist developers with statically reasoning about whether an inference plan can be implemented, we present an abstract-interpretation-based static analysis for Siren for determining inference plan satisfiability. We prove the analysis is sound with respect to Siren's semantics. Our evaluation applies inference plans to three different hybrid particle filtering algorithms on a suite of benchmarks and shows that the control provided by inference plans enables speed ups of 1.76x on average and up to 206x to reach target accuracy, compared to the inference plans implemented by default heuristics; the results also show that inference plans improve accuracy by 1.83x on average and up to 595x with less or equal runtime, compared to the default inference plans. We further show that the static analysis is precise in practice, identifying all satisfiable inference plans in 27 out of the 33 benchmark-algorithm combinations. △ Less

Submitted 20 August, 2024; originally announced August 2024.

arXiv:2408.10513 [pdf, other]

Relativistic and Electron Correlation Effects in Static Dipole Polarizabilities for Main-Group Elements

Authors: YingXing Cheng

Abstract: In this study, I compute the static dipole polarizability of main-group elements using the finite-field method combined with relativistic coupled-cluster and configuration interaction simulations. The computational results closely align with the values recommended in the 2018 table of static dipole polarizabilities of neutral elements [Mol. Phys. 117, 1200 (2019)]. Additionally, I investigate the… ▽ More In this study, I compute the static dipole polarizability of main-group elements using the finite-field method combined with relativistic coupled-cluster and configuration interaction simulations. The computational results closely align with the values recommended in the 2018 table of static dipole polarizabilities of neutral elements [Mol. Phys. 117, 1200 (2019)]. Additionally, I investigate the influence of relativistic effects and electron correlation on atomic dipole polarizabilities. Specifically, three types of relativistic effects impacting dipole polarizabilities are studied: scalar-relativistic, spin-orbit coupling, and fully relativistic Dirac-Coulomb effects. The results indicate that scalar-relativistic effects are predominant for atoms in Groups 1--2, with minimal influence from spin-orbit coupling effects. Conversely, for elements in Groups 13--18, scalar-relativistic effects are less significant, while spin-orbit coupling significantly affects elements starting from the fourth row in Groups 13--14 and from the fifth row in Groups 15--18. In each category of relativistic effects, the impact of electron correlation is evaluated. The results show that electron correlation significantly influences dipole polarizability calculations, particularly for Groups 1--2 and 13--14 atoms, but is less significant for Groups 15--18 atoms. This study provides a comprehensive and consistent dataset of dipole polarizabilities and contributes to a systematic understanding of the roles of relativistic and electron correlation effects in atomic dipole polarizabilities, serving as a valuable reference for future research. △ Less

Submitted 14 September, 2024; v1 submitted 19 August, 2024; originally announced August 2024.

arXiv:2408.10199 [pdf, other]

Magnetic Fields in Massive Star-forming Regions (MagMaR) IV: Tracing the Magnetic Fields in the O-type protostellar system IRAS 16547$-$4247

Authors: Luis A. Zapata, Manuel Fernández-López, Patricio Sanhueza, Josep M. Girart, Luis F. Rodríguez, Paulo Cortes, Koch Patrick, María T. Beltrán, Kate Pattle, Henrik Beuther, Piyali Saha, Wenyu Jiao, Fengwei Xu, Xing Walker Lu, Fernando Olguin, Shanghuo Li, Ian W. Stephens, Ji-hyun Kang, Yu Cheng, Spandan Choudhury, Kaho Morii, Eun Jung Chung, Jia-Wei Wang, Jihye Hwang, A-Ran Lyo , et al. (2 additional authors not shown)

Abstract: The formation of the massive stars, and in particular, the role that the magnetic fields play in their early evolutionary phase is still far from being completely understood. Here, we present Atacama Large Millimeter/Submillimeter Array (ALMA) 1.2 mm full polarized continuum, and H$^{13}$CO$^+$(3$-$2), CS(5$-$4), and HN$^{13}$C(3$-$2) line observations with a high angular resolution ($\sim$0.4… ▽ More The formation of the massive stars, and in particular, the role that the magnetic fields play in their early evolutionary phase is still far from being completely understood. Here, we present Atacama Large Millimeter/Submillimeter Array (ALMA) 1.2 mm full polarized continuum, and H$^{13}$CO$^+$(3$-$2), CS(5$-$4), and HN$^{13}$C(3$-$2) line observations with a high angular resolution ($\sim$0.4$''$ or 1100 au). In the 1.2 mm continuum emission, we reveal a dusty envelope surrounding the massive protostars, IRAS16547-E and IRAS16547-W, with dimensions of $\sim$10,000 au. This envelope has a bi-conical structure likely carved by the powerful thermal radio jet present in region. The magnetic fields vectors follow very-well the bi-conical envelope. The polarization fraction is $\sim$2.0\% in this region. Some of these vectors seem to converge to IRAS 16547-E, and IRAS 16547-W, the most massive protostars. Moreover, the velocity fields revealed from the spectral lines H$^{13}$CO$^+$(3$-$2), and HN$^{13}$C(3$-$2) show velocity gradients with a good correspondence with the magnetic fields, that maybe are tracing the cavities of molecular outflows or maybe in some parts infall. We derived a magnetic field strength in some filamentary regions that goes from 2 to 6.1\,mG. We also find that the CS(5$-$4) molecular line emission reveals multiple outflow cavities or bow-shocks with different orientations, some of which seem to follow the NW-SE radio thermal jet. △ Less

Submitted 19 August, 2024; originally announced August 2024.

Comments: Accepted by the Astrophysical Journal, 13 pages

arXiv:2408.07965 [pdf, other]

Efficient simulation of inhomogeneously correlated systems using block interaction product states

Authors: Yifan Cheng, Zhaoxuan Xie, Xiaoyu Xie, Haibo Ma

Abstract: The strength of DMRG lies in its treatment of identical sites that are energetically degenerate and spatially similar. However, this becomes a drawback when applied to quantum chemistry calculations for large systems, as entangled orbitals often span broad ranges in energy and space, with notably inhomogeneous interactions. In this study, we propose addressing strong intra-fragment and weak inter-… ▽ More The strength of DMRG lies in its treatment of identical sites that are energetically degenerate and spatially similar. However, this becomes a drawback when applied to quantum chemistry calculations for large systems, as entangled orbitals often span broad ranges in energy and space, with notably inhomogeneous interactions. In this study, we propose addressing strong intra-fragment and weak inter-fragment correlations separately using a multi-configurational block interaction product state (BIPS) framework. The strong correlation is captured in electronic states on fragments, considering entanglement between fragments and their environments. This method has been tested in various chemical systems and shows high accuracy and efficiency in addressing inhomogeneous effects in quantum chemistry. △ Less

Submitted 15 August, 2024; originally announced August 2024.

arXiv:2408.07490 [pdf, other]

Attention-Guided Perturbation for Unsupervised Image Anomaly Detection

Authors: Tingfeng Huang, Yuxuan Cheng, Jingbo Xia, Rui Yu, Yuxuan Cai, Jinhai Xiang, Xinwei He, Xiang Bai

Abstract: Reconstruction-based methods have significantly advanced modern unsupervised anomaly detection. However, the strong capacity of neural networks often violates the underlying assumptions by reconstructing abnormal samples well. To alleviate this issue, we present a simple yet effective reconstruction framework named Attention-Guided Pertuation Network (AGPNet), which learns to add perturbation nois… ▽ More Reconstruction-based methods have significantly advanced modern unsupervised anomaly detection. However, the strong capacity of neural networks often violates the underlying assumptions by reconstructing abnormal samples well. To alleviate this issue, we present a simple yet effective reconstruction framework named Attention-Guided Pertuation Network (AGPNet), which learns to add perturbation noise with an attention mask, for accurate unsupervised anomaly detection. Specifically, it consists of two branches, \ie, a plain reconstruction branch and an auxiliary attention-based perturbation branch. The reconstruction branch is simply a plain reconstruction network that learns to reconstruct normal samples, while the auxiliary branch aims to produce attention masks to guide the noise perturbation process for normal samples from easy to hard. By doing so, we are expecting to synthesize hard yet more informative anomalies for training, which enable the reconstruction branch to learn important inherent normal patterns both comprehensively and efficiently. Extensive experiments are conducted on three popular benchmarks covering MVTec-AD, VisA, and MVTec-3D, and show that our framework obtains leading anomaly detection performance under various setups including few-shot, one-class, and multi-class setups. △ Less

Submitted 14 August, 2024; originally announced August 2024.

arXiv:2408.07321 [pdf, other]

LLM-Enhanced Static Analysis for Precise Identification of Vulnerable OSS Versions

Authors: Yiran Cheng, Lwin Khin Shar, Ting Zhang, Shouguo Yang, Chaopeng Dong, David Lo, Shichao Lv, Zhiqiang Shi, Limin Sun

Abstract: Open-source software (OSS) has experienced a surge in popularity, attributed to its collaborative development model and cost-effective nature. However, the adoption of specific software versions in development projects may introduce security risks when these versions bring along vulnerabilities. Current methods of identifying vulnerable versions typically analyze and trace the code involved in vul… ▽ More Open-source software (OSS) has experienced a surge in popularity, attributed to its collaborative development model and cost-effective nature. However, the adoption of specific software versions in development projects may introduce security risks when these versions bring along vulnerabilities. Current methods of identifying vulnerable versions typically analyze and trace the code involved in vulnerability patches using static analysis with pre-defined rules. They then use syntactic-level code clone detection to identify the vulnerable versions. These methods are hindered by imprecisions due to (1) the inclusion of vulnerability-irrelevant code in the analysis and (2) the inadequacy of syntactic-level code clone detection. This paper presents Vercation, an approach designed to identify vulnerable versions of OSS written in C/C++. Vercation combines program slicing with a Large Language Model (LLM) to identify vulnerability-relevant code from vulnerability patches. It then backtraces historical commits to gather previous modifications of identified vulnerability-relevant code. We propose semantic-level code clone detection to compare the differences between pre-modification and post-modification code, thereby locating the vulnerability-introducing commit (vic) and enabling to identify the vulnerable versions between the patch commit and the vic. We curate a dataset linking 74 OSS vulnerabilities and 1013 versions to evaluate Vercation. On this dataset, our approach achieves the F1 score of 92.4%, outperforming current state-of-the-art methods. More importantly, Vercation detected 134 incorrect vulnerable OSS versions in NVD reports. △ Less

Submitted 14 August, 2024; originally announced August 2024.

arXiv:2408.07262 [pdf, other]

Ensemble architecture in polyp segmentation

Authors: Hao-Yun Hsu, Yi-Ching Cheng, Guan-Hua Huang

Abstract: In this research, we revisit the architecture of semantic segmentation and evaluate the models excelling in polyp segmentation. We introduce an integrated framework that harnesses the advantages of different models to attain an optimal outcome. More specifically, we fuse the learned features from convolutional and transformer models for prediction, and we view this approach as an ensemble techniqu… ▽ More In this research, we revisit the architecture of semantic segmentation and evaluate the models excelling in polyp segmentation. We introduce an integrated framework that harnesses the advantages of different models to attain an optimal outcome. More specifically, we fuse the learned features from convolutional and transformer models for prediction, and we view this approach as an ensemble technique to enhance model performance. Our experiments on polyp segmentation reveal that the proposed architecture surpasses other top models, exhibiting improved learning capacity and resilience. The code is available at https://github.com/HuangDLab/EnFormer. △ Less

Submitted 13 August, 2024; originally announced August 2024.

arXiv:2408.06072 [pdf, other]

CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer

Authors: Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiaohan Zhang, Guanyu Feng, Da Yin, Xiaotao Gu, Yuxuan Zhang, Weihan Wang, Yean Cheng, Ting Liu, Bin Xu, Yuxiao Dong, Jie Tang

Abstract: We introduce CogVideoX, a large-scale diffusion transformer model designed for generating videos based on text prompts. To efficently model video data, we propose to levearge a 3D Variational Autoencoder (VAE) to compress videos along both spatial and temporal dimensions. To improve the text-video alignment, we propose an expert transformer with the expert adaptive LayerNorm to facilitate the deep… ▽ More We introduce CogVideoX, a large-scale diffusion transformer model designed for generating videos based on text prompts. To efficently model video data, we propose to levearge a 3D Variational Autoencoder (VAE) to compress videos along both spatial and temporal dimensions. To improve the text-video alignment, we propose an expert transformer with the expert adaptive LayerNorm to facilitate the deep fusion between the two modalities. By employing a progressive training technique, CogVideoX is adept at producing coherent, long-duration videos characterized by significant motions. In addition, we develop an effective text-video data processing pipeline that includes various data preprocessing strategies and a video captioning method. It significantly helps enhance the performance of CogVideoX, improving both generation quality and semantic alignment. Results show that CogVideoX demonstrates state-of-the-art performance across both multiple machine metrics and human evaluations. The model weights of both the 3D Causal VAE and CogVideoX are publicly available at https://github.com/THUDM/CogVideo. △ Less

Submitted 12 August, 2024; originally announced August 2024.

arXiv:2408.05907 [pdf]

Cryogenic nonlinear conversion processes in periodically-poled thin-film lithium niobate waveguides

Authors: Yujie Cheng, Xiaoting Li, Lantian Feng, Haochuan Li, Wenzhao Sun, Xinyu Song, Yuyang Ding, Guangcan Guo, Cheng Wang, Xifeng Ren

Abstract: Periodically poled thin-film lithium niobate (TFLN) waveguides, which enable efficient quadratic nonlinear processes, serve as crucial foundation for classical and quantum signal processing with photonic integrated circuits. To expand their application scope, we provide, to our best knowledge, the first investigation of nonlinear conversion processes in periodically poled TFLN waveguides at cryoge… ▽ More Periodically poled thin-film lithium niobate (TFLN) waveguides, which enable efficient quadratic nonlinear processes, serve as crucial foundation for classical and quantum signal processing with photonic integrated circuits. To expand their application scope, we provide, to our best knowledge, the first investigation of nonlinear conversion processes in periodically poled TFLN waveguides at cryogenic condition. Through systematic experimental characterization, we find that the periodically poled TFLN waveguide maintains consistent conversion efficiencies at both cryogenic and room temperatures for both classical second-harmonic generation and quantum photon-pair generation processes, demonstrating the significant potential of TFLN wavelength conversion devices for cryogenic applications. This breakthrough will foster future scalable quantum photonic systems and optical interfacing among different cryogenic platforms. △ Less

Submitted 11 August, 2024; originally announced August 2024.

arXiv:2408.04808 [pdf, other]

Scaling Deep Learning Computation over the Inter-Core Connected Intelligence Processor

Authors: Yiqi Liu, Yuqi Xue, Yu Cheng, Lingxiao Ma, Ziming Miao, Jilong Xue, Jian Huang

Abstract: As AI chips incorporate numerous parallelized cores to scale deep learning (DL) computing, inter-core communication is enabled recently by employing high-bandwidth and low-latency interconnect links on the chip (e.g., Graphcore IPU). It allows each core to directly access the fast scratchpad memory in other cores, which enables new parallel computing paradigms. However, without proper support for… ▽ More As AI chips incorporate numerous parallelized cores to scale deep learning (DL) computing, inter-core communication is enabled recently by employing high-bandwidth and low-latency interconnect links on the chip (e.g., Graphcore IPU). It allows each core to directly access the fast scratchpad memory in other cores, which enables new parallel computing paradigms. However, without proper support for the scalable inter-core connections in current DL compilers, it is hard for developers to exploit the benefits of this new architecture. We present T10, the first DL compiler to exploit the inter-core communication bandwidth and distributed on-chip memory on AI chips. To formulate the computation and communication patterns of tensor operators in this new architecture, T10 introduces a distributed tensor abstraction rTensor. T10 maps a DNN model to execution plans with a generalized compute-shift pattern, by partitioning DNN computation into sub-operators and mapping them to cores, so that the cores can exchange data following predictable patterns. T10 makes globally optimized trade-offs between on-chip memory consumption and inter-core communication overhead, selects the best execution plan from a vast optimization space, and alleviates unnecessary inter-core communications. Our evaluation with a real inter-core connected AI chip, the Graphcore IPU, shows up to 3.3$\times$ performance improvement, and scalability support for larger models, compared to state-of-the-art DL compilers and vendor libraries. △ Less

Submitted 8 August, 2024; originally announced August 2024.

Comments: This paper is accepted at The 30th ACM Symposium on Operating Systems Principles (SOSP'24)

arXiv:2408.03847 [pdf, other]

GAIA -- A Large Language Model for Advanced Power Dispatch

Authors: Yuheng Cheng, Huan Zhao, Xiyuan Zhou, Junhua Zhao, Yuji Cao, Chao Yang

Abstract: Power dispatch is essential for providing stable, cost-effective, and eco-friendly electricity to society. However, traditional methods falter as power systems grow in scale and complexity, struggling with multitasking, swift problem-solving, and human-machine collaboration. This paper introduces GAIA, the pioneering Large Language Model (LLM) tailored for power dispatch tasks. We have developed a… ▽ More Power dispatch is essential for providing stable, cost-effective, and eco-friendly electricity to society. However, traditional methods falter as power systems grow in scale and complexity, struggling with multitasking, swift problem-solving, and human-machine collaboration. This paper introduces GAIA, the pioneering Large Language Model (LLM) tailored for power dispatch tasks. We have developed a novel dataset construction technique that harnesses a range of data sources to fine-tune GAIA for optimal performance in this domain. This approach streamlines LLM training, allowing for the seamless integration of multidimensional data in power system management. Additionally, we have crafted specialized prompt strategies to boost GAIA's input-output efficiency in dispatch scenarios. When evaluated on the ElecBench benchmark, GAIA surpasses the baseline model LLaMA2 on multiple metrics. In practical applications, GAIA has demonstrated its ability to enhance decision-making processes, improve operational efficiency, and facilitate better human-machine interactions in power dispatch operations. This paper expands the application of LLMs to power dispatch and validates their practical utility, paving the way for future innovations in this field. △ Less

Submitted 7 August, 2024; originally announced August 2024.

arXiv:2408.03220 [pdf, other]

Masked Random Noise for Communication Efficient Federaetd Learning

Authors: Shiwei Li, Yingyi Cheng, Haozhao Wang, Xing Tang, Shijie Xu, Weihong Luo, Yuhua Li, Dugang Liu, Xiuqiang He, and Ruixuan Li

Abstract: Federated learning is a promising distributed training paradigm that effectively safeguards data privacy. However, it may involve significant communication costs, which hinders training efficiency. In this paper, we aim to enhance communication efficiency from a new perspective. Specifically, we request the distributed clients to find optimal model updates relative to global model parameters withi… ▽ More Federated learning is a promising distributed training paradigm that effectively safeguards data privacy. However, it may involve significant communication costs, which hinders training efficiency. In this paper, we aim to enhance communication efficiency from a new perspective. Specifically, we request the distributed clients to find optimal model updates relative to global model parameters within predefined random noise. For this purpose, we propose Federated Masked Random Noise (FedMRN), a novel framework that enables clients to learn a 1-bit mask for each model parameter and apply masked random noise (i.e., the Hadamard product of random noise and masks) to represent model updates. To make FedMRN feasible, we propose an advanced mask training strategy, called progressive stochastic masking (PSM). After local training, each client only need to transmit local masks and a random seed to the server. Additionally, we provide theoretical guarantees for the convergence of FedMRN under both strongly convex and non-convex assumptions. Extensive experiments are conducted on four popular datasets. The results show that FedMRN exhibits superior convergence speed and test accuracy compared to relevant baselines, while attaining a similar level of accuracy as FedAvg. △ Less

Submitted 6 August, 2024; originally announced August 2024.

Comments: Accepted by MM 2024

arXiv:2408.00550 [pdf, other]

Mitigating Multilingual Hallucination in Large Vision-Language Models

Authors: Xiaoye Qu, Mingyang Song, Wei Wei, Jianfeng Dong, Yu Cheng

Abstract: While Large Vision-Language Models (LVLMs) have exhibited remarkable capabilities across a wide range of tasks, they suffer from hallucination problems, where models generate plausible yet incorrect answers given the input image-query pair. This hallucination phenomenon is even more severe when querying the image in non-English languages, while existing methods for mitigating hallucinations in LVL… ▽ More While Large Vision-Language Models (LVLMs) have exhibited remarkable capabilities across a wide range of tasks, they suffer from hallucination problems, where models generate plausible yet incorrect answers given the input image-query pair. This hallucination phenomenon is even more severe when querying the image in non-English languages, while existing methods for mitigating hallucinations in LVLMs only consider the English scenarios. In this paper, we make the first attempt to mitigate this important multilingual hallucination in LVLMs. With thorough experiment analysis, we found that multilingual hallucination in LVLMs is a systemic problem that could arise from deficiencies in multilingual capabilities or inadequate multimodal abilities. To this end, we propose a two-stage Multilingual Hallucination Removal (MHR) framework for LVLMs, aiming to improve resistance to hallucination for both high-resource and low-resource languages. Instead of relying on the intricate manual annotations of multilingual resources, we fully leverage the inherent capabilities of the LVLM and propose a novel cross-lingual alignment method, which generates multiple responses for each image-query input and then identifies the hallucination-aware pairs for each language. These data pairs are finally used for direct preference optimization to prompt the LVLMs to favor non-hallucinating responses. Experimental results show that our MHR achieves a substantial reduction in hallucination generation for LVLMs. Notably, on our extended multilingual POPE benchmark, our framework delivers an average increase of 19.0% in accuracy across 13 different languages. Our code and model weights are available at https://github.com/ssmisya/MHR △ Less

Submitted 1 August, 2024; originally announced August 2024.

arXiv:2408.00026 [pdf, other]

Study of Wide-Field-of-View X-ray Observations of the Virgo Cluster Using the Lobster Eye Imager for Astronomy

Authors: Wen-Cheng Feng, Shu-Mei Jia, Hai-Hui Zhao, Heng Yu, Hai-Wu Pan, Cheng-Kui Li, Yu-Lin Cheng, Shan-Shan Weng, Yong Chen, Yuan Liu, Zhi-Xing Ling, Chen Zhang

Abstract: The Lobster Eye Imager for Astronomy (LEIA) is the pathfinder of the wide-field X-ray telescope used in the Einstein Probe mission. In this study, we present an image of the Virgo Cluster taken by LEIA in the 0.5-4.5 keV band with an exposure time of $\sim$17.3 ks in the central region. This extended emission is generally consistent with the results obtained by ROSAT. However, the field is affecte… ▽ More The Lobster Eye Imager for Astronomy (LEIA) is the pathfinder of the wide-field X-ray telescope used in the Einstein Probe mission. In this study, we present an image of the Virgo Cluster taken by LEIA in the 0.5-4.5 keV band with an exposure time of $\sim$17.3 ks in the central region. This extended emission is generally consistent with the results obtained by ROSAT. However, the field is affected by bright point sources due to the instrument's Point Spread Function (PSF) effect. Through fitting of the LEIA spectrum of the Virgo Cluster, we obtained a temperature of $2.1^{+0.3}_{-0.1}$ keV, which is consistent with the XMM-Newton results ($\sim$2.3 keV). Above 1.6 keV, the spectrum is dominated by the X-ray background. In summary, this study validates LEIA's extended source imaging and spectral resolution capabilities for the first time. △ Less

Submitted 31 July, 2024; originally announced August 2024.

Comments: 9 pages, 6 figures, 1 table

arXiv:2407.20875 [pdf, other]

Localized stem structures in quasi-resonant two-soliton solutions for the asymmetric Nizhnik-Novikov-Veselov system

Authors: Feng Yuan, Jiguang Rao, Jingsong He, Yi Cheng

Abstract: Elastic collisions of solitons generally have a finite phase shift. When the phase shift has a finitely large value, the two vertices of the (2+1)-dimensional 2-soliton are significantly separated due to the phase shift, accompanied by the formation of a local structure connecting the two V-shaped solitons. We define this local structure as the stem structure. This study systematically investigate… ▽ More Elastic collisions of solitons generally have a finite phase shift. When the phase shift has a finitely large value, the two vertices of the (2+1)-dimensional 2-soliton are significantly separated due to the phase shift, accompanied by the formation of a local structure connecting the two V-shaped solitons. We define this local structure as the stem structure. This study systematically investigates the localized stem structures between two solitons in the (2+1)-dimensional asymmetric Nizhnik-Novikov-Veselov system. These stem structures, arising from quasi-resonant collisions between the solitons, exhibit distinct features of spatial locality and temporal invariance. We explore two scenarios: one characterized by weakly quasi-resonant collisions (i.e. $a_{12}\approx 0$), and the other by strongly quasi-resonant collisions (i.e. $a_{12}\approx +\infty$). Through mathematical analysis, we extract comprehensive insights into the trajectories, amplitudes, and velocities of the soliton arms. Furthermore, we discuss the characteristics of the stem structures, including their length and extreme points. Our findings shed new light on the interaction between solitons in the (2+1)-dimensional asymmetric Nizhnik-Novikov-Veselov system. △ Less

Submitted 30 July, 2024; originally announced July 2024.

Comments: 14 pages, 6 figures;Accepted by journal of mathematical physics(July, 2024)

arXiv:2407.19941 [pdf, other]

Boosting Graph Foundation Model from Structural Perspective

Authors: Yao Cheng, Yige Zhao, Jianxiang Yu, Xiang Li

Abstract: Graph foundation models have recently attracted significant attention due to its strong generalizability. Although existing methods resort to language models to learn unified semantic representations across domains, they disregard the unique structural characteristics of graphs from different domains. To address the problem, in this paper, we boost graph foundation model from structural perspectiv… ▽ More Graph foundation models have recently attracted significant attention due to its strong generalizability. Although existing methods resort to language models to learn unified semantic representations across domains, they disregard the unique structural characteristics of graphs from different domains. To address the problem, in this paper, we boost graph foundation model from structural perspective and propose BooG. The model constructs virtual super nodes to unify structural characteristics of graph data from different domains. Specifically, the super nodes fuse the information of anchor nodes and class labels, where each anchor node captures the information of a node or a graph instance to be classified. Instead of using the raw graph structure, we connect super nodes to all nodes within their neighborhood by virtual edges. This new structure allows for effective information aggregation while unifying cross-domain structural characteristics. Additionally, we propose a novel pre-training objective based on contrastive learning, which learns more expressive representations for graph data and generalizes effectively to different domains and downstream tasks. Experimental results on various datasets and tasks demonstrate the superior performance of BooG. We provide our code and data here: https://anonymous.4open.science/r/BooG-EE42/. △ Less

Submitted 29 July, 2024; originally announced July 2024.

arXiv:2407.19467 [pdf, other]

Enhancing Taobao Display Advertising with Multimodal Representations: Challenges, Approaches and Insights

Authors: Xiang-Rong Sheng, Feifan Yang, Litong Gong, Biao Wang, Zhangming Chan, Yujing Zhang, Yueyao Cheng, Yong-Nan Zhu, Tiezheng Ge, Han Zhu, Yuning Jiang, Jian Xu, Bo Zheng

Abstract: Despite the recognized potential of multimodal data to improve model accuracy, many large-scale industrial recommendation systems, including Taobao display advertising system, predominantly depend on sparse ID features in their models. In this work, we explore approaches to leverage multimodal data to enhance the recommendation accuracy. We start from identifying the key challenges in adopting mul… ▽ More Despite the recognized potential of multimodal data to improve model accuracy, many large-scale industrial recommendation systems, including Taobao display advertising system, predominantly depend on sparse ID features in their models. In this work, we explore approaches to leverage multimodal data to enhance the recommendation accuracy. We start from identifying the key challenges in adopting multimodal data in a manner that is both effective and cost-efficient for industrial systems. To address these challenges, we introduce a two-phase framework, including: 1) the pre-training of multimodal representations to capture semantic similarity, and 2) the integration of these representations with existing ID-based models. Furthermore, we detail the architecture of our production system, which is designed to facilitate the deployment of multimodal representations. Since the integration of multimodal representations in mid-2023, we have observed significant performance improvements in Taobao display advertising system. We believe that the insights we have gathered will serve as a valuable resource for practitioners seeking to leverage multimodal data in their systems. △ Less

Submitted 28 July, 2024; originally announced July 2024.

Comments: Accepted at CIKM 2024

arXiv:2407.19352 [pdf]

Design and Optimization of Big Data and Machine Learning-Based Risk Monitoring System in Financial Markets

Authors: Liyang Wang, Yu Cheng, Xingxin Gu, Zhizhong Wu

Abstract: With the increasing complexity of financial markets and rapid growth in data volume, traditional risk monitoring methods no longer suffice for modern financial institutions. This paper designs and optimizes a risk monitoring system based on big data and machine learning. By constructing a four-layer architecture, it effectively integrates large-scale financial data and advanced machine learning al… ▽ More With the increasing complexity of financial markets and rapid growth in data volume, traditional risk monitoring methods no longer suffice for modern financial institutions. This paper designs and optimizes a risk monitoring system based on big data and machine learning. By constructing a four-layer architecture, it effectively integrates large-scale financial data and advanced machine learning algorithms. Key technologies employed in the system include Long Short-Term Memory (LSTM) networks, Random Forest, Gradient Boosting Trees, and real-time data processing platform Apache Flink, ensuring the real-time and accurate nature of risk monitoring. Research findings demonstrate that the system significantly enhances efficiency and accuracy in risk management, particularly excelling in identifying and warning against market crash risks. △ Less

Submitted 27 July, 2024; originally announced July 2024.

arXiv:2407.18357 [pdf, other]

Needle Segmentation Using GAN: Restoring Thin Instrument Visibility in Robotic Ultrasound

Authors: Zhongliang Jiang, Xuesong Li, Xiangyu Chu, Angelos Karlas, Yuan Bi, Yingsheng Cheng, K. W. Samuel Au, Nassir Navab

Abstract: Ultrasound-guided percutaneous needle insertion is a standard procedure employed in both biopsy and ablation in clinical practices. However, due to the complex interaction between tissue and instrument, the needle may deviate from the in-plane view, resulting in a lack of close monitoring of the percutaneous needle. To address this challenge, we introduce a robot-assisted ultrasound (US) imaging s… ▽ More Ultrasound-guided percutaneous needle insertion is a standard procedure employed in both biopsy and ablation in clinical practices. However, due to the complex interaction between tissue and instrument, the needle may deviate from the in-plane view, resulting in a lack of close monitoring of the percutaneous needle. To address this challenge, we introduce a robot-assisted ultrasound (US) imaging system designed to seamlessly monitor the insertion process and autonomously restore the visibility of the inserted instrument when misalignment happens. To this end, the adversarial structure is presented to encourage the generation of segmentation masks that align consistently with the ground truth in high-order space. This study also systematically investigates the effects on segmentation performance by exploring various training loss functions and their combinations. When misalignment between the probe and the percutaneous needle is detected, the robot is triggered to perform transverse searching to optimize the positional and rotational adjustment to restore needle visibility. The experimental results on ex-vivo porcine samples demonstrate that the proposed method can precisely segment the percutaneous needle (with a tip error of $0.37\pm0.29mm$ and an angle error of $1.19\pm 0.29^{\circ}$). Furthermore, the needle appearance can be successfully restored under the repositioned probe pose in all 45 trials, with repositioning errors of $1.51\pm0.95mm$ and $1.25\pm0.79^{\circ}$. from latex to text with math symbols △ Less

Submitted 25 July, 2024; originally announced July 2024.

Comments: accepted by IEEE TIM. code: https://github.com/noseefood/NeedleSegmentation-GAN; video: https://youtu.be/4WuEP9PACs0

arXiv:2407.17892 [pdf, ps, other]

An Iterative Approach to Topic Modelling

Authors: Albert Wong, Florence Wing Yau Cheng, Ashley Keung, Yamileth Hercules, Mary Alexandra Garcia, Yew-Wei Lim, Lien Pham

Abstract: Topic modelling has become increasingly popular for summarizing text data, such as social media posts and articles. However, topic modelling is usually completed in one shot. Assessing the quality of resulting topics is challenging. No effective methods or measures have been developed for assessing the results or for making further enhancements to the topics. In this research, we propose we propos… ▽ More Topic modelling has become increasingly popular for summarizing text data, such as social media posts and articles. However, topic modelling is usually completed in one shot. Assessing the quality of resulting topics is challenging. No effective methods or measures have been developed for assessing the results or for making further enhancements to the topics. In this research, we propose we propose to use an iterative process to perform topic modelling that gives rise to a sense of completeness of the resulting topics when the process is complete. Using the BERTopic package, a popular method in topic modelling, we demonstrate how the modelling process can be applied iteratively to arrive at a set of topics that could not be further improved upon using one of the three selected measures for clustering comparison as the decision criteria. This demonstration is conducted using a subset of the COVIDSenti-A dataset. The early success leads us to believe that further research using in using this approach in conjunction with other topic modelling algorithms could be viable. △ Less

Submitted 25 July, 2024; originally announced July 2024.

arXiv:2407.16654 [pdf, other]

Magnetic Fields in Massive Star-forming Regions (MagMaR): Unveiling an Hourglass Magnetic Field in G333.46-0.16 using ALMA

Authors: Piyali Saha, Patricio Sanhueza, Marco Padovani, Josep M. Girart, Paulo Cortes, Kaho Morii, Junhao Liu, A. Sanchez-Monge, Daniele Galli, Shantanu Basu, Patrick M. Koch, Maria T. Beltran, Shanghuo Li, Henrik Beuther, Ian W. Stephens, Fumitaka Nakamura, Qizhou Zhang, Wenyu Jiao, M. Fernandez-Lopez, Jihye Hwang, Eun Jung Chung, Kate Pattle, Luis A. Zapata, Fengwei Xu, Fernando A. Olguin , et al. (11 additional authors not shown)

Abstract: The contribution of the magnetic field to the formation of high-mass stars is poorly understood. We report the high-angular resolution ($\sim0.3^{\prime\prime}$, 870 au) map of the magnetic field projected on the plane of the sky (B$_\mathrm{POS}$) towards the high-mass star forming region G333.46$-$0.16 (G333), obtained with the Atacama Large Millimeter/submillimeter Array (ALMA) at 1.2 mm as par… ▽ More The contribution of the magnetic field to the formation of high-mass stars is poorly understood. We report the high-angular resolution ($\sim0.3^{\prime\prime}$, 870 au) map of the magnetic field projected on the plane of the sky (B$_\mathrm{POS}$) towards the high-mass star forming region G333.46$-$0.16 (G333), obtained with the Atacama Large Millimeter/submillimeter Array (ALMA) at 1.2 mm as part of the Magnetic Fields in Massive Star-forming Regions (MagMaR) survey. The B$_\mathrm{POS}$ morphology found in this region is consistent with a canonical ``hourglass'' which suggest a dynamically important field. This region is fragmented into two protostars separated by $\sim1740$ au. Interestingly, by analysing H$^{13}$CO$^{+}$ ($J=3-2$) line emission, we find no velocity gradient over the extend of the continuum which is consistent with a strong field. We model the B$_\mathrm{POS}$, obtaining a marginally supercritical mass-to-flux ratio of 1.43, suggesting an initially strongly magnetized environment. Based on the Davis-Chandrasekhar-Fermi method, the magnetic field strength towards G333 is estimated to be 5.7 mG. The absence of strong rotation and outflows towards the central region of G333 suggests strong magnetic braking, consistent with a highly magnetized environment. Our study shows that despite being a strong regulator, the magnetic energy fails to prevent the process of fragmentation, as revealed by the formation of the two protostars in the central region. △ Less

Submitted 23 July, 2024; originally announced July 2024.

arXiv:2407.16140 [pdf]

Frequency stabilization based on H13C14N absorption in lithium niobate micro-disk laser

Authors: Zhen Yi, Zhihao Zhang, Jianglin Guan, Guanghui Zhao, Renhong Gao, Botao Fu, Jintian Lin, Jinming Chen, Jian Liu, Yijie Pan, Ya Cheng

Abstract: We demonstrate an on-chip lithium niobate micro-disk laser based on hydrogen cyanide (H13C14N) gas saturation absorption method for frequency stabilization. The laser chip consists of two main components: a micro-disk laser and a combined racetrack ring cavity. By operating on the H13C14N P12 absorption line at 1551.3 nm, the laser frequency can be precisely stabilized. The laser demonstrates rema… ▽ More We demonstrate an on-chip lithium niobate micro-disk laser based on hydrogen cyanide (H13C14N) gas saturation absorption method for frequency stabilization. The laser chip consists of two main components: a micro-disk laser and a combined racetrack ring cavity. By operating on the H13C14N P12 absorption line at 1551.3 nm, the laser frequency can be precisely stabilized. The laser demonstrates remarkable stability, achieving a best stability value of 9*10^-9. Furthermore, the short-term stability, evaluated over continuous time intervals of 35 seconds, showcases exceptional performance. Additionally, the residual drift remains well below 30 MHz. △ Less

Submitted 22 July, 2024; originally announced July 2024.

arXiv:2407.15795 [pdf, other]

AdaCLIP: Adapting CLIP with Hybrid Learnable Prompts for Zero-Shot Anomaly Detection

Authors: Yunkang Cao, Jiangning Zhang, Luca Frittoli, Yuqi Cheng, Weiming Shen, Giacomo Boracchi

Abstract: Zero-shot anomaly detection (ZSAD) targets the identification of anomalies within images from arbitrary novel categories. This study introduces AdaCLIP for the ZSAD task, leveraging a pre-trained vision-language model (VLM), CLIP. AdaCLIP incorporates learnable prompts into CLIP and optimizes them through training on auxiliary annotated anomaly detection data. Two types of learnable prompts are pr… ▽ More Zero-shot anomaly detection (ZSAD) targets the identification of anomalies within images from arbitrary novel categories. This study introduces AdaCLIP for the ZSAD task, leveraging a pre-trained vision-language model (VLM), CLIP. AdaCLIP incorporates learnable prompts into CLIP and optimizes them through training on auxiliary annotated anomaly detection data. Two types of learnable prompts are proposed: static and dynamic. Static prompts are shared across all images, serving to preliminarily adapt CLIP for ZSAD. In contrast, dynamic prompts are generated for each test image, providing CLIP with dynamic adaptation capabilities. The combination of static and dynamic prompts is referred to as hybrid prompts, and yields enhanced ZSAD performance. Extensive experiments conducted across 14 real-world anomaly detection datasets from industrial and medical domains indicate that AdaCLIP outperforms other ZSAD methods and can generalize better to different categories and even domains. Finally, our analysis highlights the importance of diverse auxiliary data and optimized prompts for enhanced generalization capacity. Code is available at https://github.com/caoyunkang/AdaCLIP. △ Less

Submitted 22 July, 2024; originally announced July 2024.

Comments: Accepted by ECCV 2024

arXiv:2407.15451 [pdf, other]

Domain-Adaptive 2D Human Pose Estimation via Dual Teachers in Extremely Low-Light Conditions

Authors: Yihao Ai, Yifei Qi, Bo Wang, Yu Cheng, Xinchao Wang, Robby T. Tan

Abstract: Existing 2D human pose estimation research predominantly concentrates on well-lit scenarios, with limited exploration of poor lighting conditions, which are a prevalent aspect of daily life. Recent studies on low-light pose estimation require the use of paired well-lit and low-light images with ground truths for training, which are impractical due to the inherent challenges associated with annotat… ▽ More Existing 2D human pose estimation research predominantly concentrates on well-lit scenarios, with limited exploration of poor lighting conditions, which are a prevalent aspect of daily life. Recent studies on low-light pose estimation require the use of paired well-lit and low-light images with ground truths for training, which are impractical due to the inherent challenges associated with annotation on low-light images. To this end, we introduce a novel approach that eliminates the need for low-light ground truths. Our primary novelty lies in leveraging two complementary-teacher networks to generate more reliable pseudo labels, enabling our model achieves competitive performance on extremely low-light images without the need for training with low-light ground truths. Our framework consists of two stages. In the first stage, our model is trained on well-lit data with low-light augmentations. In the second stage, we propose a dual-teacher framework to utilize the unlabeled low-light data, where a center-based main teacher produces the pseudo labels for relatively visible cases, while a keypoints-based complementary teacher focuses on producing the pseudo labels for the missed persons of the main teacher. With the pseudo labels from both teachers, we propose a person-specific low-light augmentation to challenge a student model in training to outperform the teachers. Experimental results on real low-light dataset (ExLPose-OCN) show, our method achieves 6.8% (2.4 AP) improvement over the state-of-the-art (SOTA) method, despite no low-light ground-truth data is used in our approach, in contrast to the SOTA method. Our code will be available at:https://github.com/ayh015-dev/DA-LLPose. △ Less

Submitted 23 July, 2024; v1 submitted 22 July, 2024; originally announced July 2024.

Comments: 18 pages, 3 figure. Accepted by ECCV24

arXiv:2407.15226 [pdf, other]

Variation Bayesian Interference for Multiple Extended Targets or Unresolved Group Targets Tracking

Authors: Yuanhao Cheng, Yunhe Cao, Tat-Soon Yeo, Yulin Zhang, Fu Jie

Abstract: In this work, we propose a tracking method for multiple extended targets or unresolvable group targets based on the Variational Bayesian Inference (VBI). Firstly, based on the most commonly used Random Matrix Model (RMM), the joint states of a single target are modeled as a Gamma Gaussian Inverse Wishart (GGIW) distribution, and the multi-target joint association variables are involved in the esti… ▽ More In this work, we propose a tracking method for multiple extended targets or unresolvable group targets based on the Variational Bayesian Inference (VBI). Firstly, based on the most commonly used Random Matrix Model (RMM), the joint states of a single target are modeled as a Gamma Gaussian Inverse Wishart (GGIW) distribution, and the multi-target joint association variables are involved in the estimation together as unknown information with a prior distribution. A shape evolution model and VBI are employed to address the shortcomings of the RMM. Through the VBI, we can derive the approximate variational posterior for the exact multi-target posterior. Furthermore, to demonstrate the applicability of the method in real-world tracking scenarios, we present two potential lightweight schemes. The first is based on clustering, which effectively prunes the joint association events. The second is a simplification of the variational posterior through marginal association probabilities. We demonstrate the effectiveness of the proposed method using simulation experiments, and the proposed method outperforms current state-of-the-art methods in terms of accuracy and adaptability. This manuscript is only a preprint version, a completer and more official version will be uploaded as soon as possible △ Less

Submitted 6 August, 2024; v1 submitted 21 July, 2024; originally announced July 2024.

Comments: 21 pages, 15 figures, 3 tables

arXiv:2407.15225 [pdf]

An electro-optically tunable arrayed waveguide grating fabricated on thin film lithium niobate

Authors: Zhe Wang, 1 Zhiwei Fang, Yiran Zhu, Jian Liu, Lang Gao, Jianping Yu, Haisu Zhang, Min Wang, Ya Cheng

Abstract: We design and fabricate an 8-channel thin film lithium niobate (TFLN) arrayed-waveguide grating (AWG) and demonstrate the electro-optical tunability of the device. The monolithically integrated microelectrodes are designed for waveguides phase modulation and wavelength tunning. Experiments show that the fabricated electro-optically controlled TFLN AWG has a channel spacing of 200 GHz and a wavelen… ▽ More We design and fabricate an 8-channel thin film lithium niobate (TFLN) arrayed-waveguide grating (AWG) and demonstrate the electro-optical tunability of the device. The monolithically integrated microelectrodes are designed for waveguides phase modulation and wavelength tunning. Experiments show that the fabricated electro-optically controlled TFLN AWG has a channel spacing of 200 GHz and a wavelength tuning efficiency of 10 pm/V. △ Less

Submitted 21 July, 2024; originally announced July 2024.

arXiv:2407.14601 [pdf, other]

ANDES, the high resolution spectrograph for the ELT: science goals, project overview and future developments

Authors: A. Marconi, M. Abreu, V. Adibekyan, V. Alberti, S. Albrecht, J. Alcaniz, M. Aliverti, C. Allende Prieto, J. D. Alvarado Gómez, C. S. Alves, P. J. Amado, M. Amate, M. I. Andersen, S. Antoniucci, E. Artigau, C. Bailet, C. Baker, V. Baldini, A. Balestra, S. A. Barnes, F. Baron, S. C. C. Barros, S. M. Bauer, M. Beaulieu, O. Bellido-Tirado , et al. (264 additional authors not shown)

Abstract: The first generation of ELT instruments includes an optical-infrared high-resolution spectrograph, indicated as ELT-HIRES and recently christened ANDES (ArmazoNes high Dispersion Echelle Spectrograph). ANDES consists of three fibre-fed spectrographs ([U]BV, RIZ, YJH) providing a spectral resolution of $\sim$100,000 with a minimum simultaneous wavelength coverage of 0.4-1.8 $μ$m with the goal of ex… ▽ More The first generation of ELT instruments includes an optical-infrared high-resolution spectrograph, indicated as ELT-HIRES and recently christened ANDES (ArmazoNes high Dispersion Echelle Spectrograph). ANDES consists of three fibre-fed spectrographs ([U]BV, RIZ, YJH) providing a spectral resolution of $\sim$100,000 with a minimum simultaneous wavelength coverage of 0.4-1.8 $μ$m with the goal of extending it to 0.35-2.4 $μ$m with the addition of a U arm to the BV spectrograph and a separate K band spectrograph. It operates both in seeing- and diffraction-limited conditions and the fibre feeding allows several, interchangeable observing modes including a single conjugated adaptive optics module and a small diffraction-limited integral field unit in the NIR. Modularity and fibre-feeding allow ANDES to be placed partly on the ELT Nasmyth platform and partly in the Coudé room. ANDES has a wide range of groundbreaking science cases spanning nearly all areas of research in astrophysics and even fundamental physics. Among the top science cases, there are the detection of biosignatures from exoplanet atmospheres, finding the fingerprints of the first generation of stars, tests on the stability of Nature's fundamental couplings, and the direct detection of the cosmic acceleration. The ANDES project is carried forward by a large international consortium, composed of 35 Institutes from 13 countries, forming a team of almost 300 scientists and engineers which include the majority of the scientific and technical expertise in the field that can be found in ESO member states. △ Less

Submitted 19 July, 2024; originally announced July 2024.

Comments: SPIE astronomical telescope and instrumentation 2024, in press

Showing 1–50 of 1,714 results for author: Cheng, Y