Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to main content

Showing 1–50 of 217 results for author: Ni, Z

.
  1. arXiv:2407.03648  [pdf, other

    eess.AS cs.SD

    High Fidelity Text-Guided Music Generation and Editing via Single-Stage Flow Matching

    Authors: Gael Le Lan, Bowen Shi, Zhaoheng Ni, Sidd Srinivasan, Anurag Kumar, Brian Ellis, David Kant, Varun Nagaraja, Ernie Chang, Wei-Ning Hsu, Yangyang Shi, Vikas Chandra

    Abstract: We introduce a simple and efficient text-controllable high-fidelity music generation and editing model. It operates on sequences of continuous latent representations from a low frame rate 48 kHz stereo variational auto encoder codec that eliminates the information loss drawback of discrete representations. Based on a diffusion transformer architecture trained on a flow-matching objective the model… ▽ More

    Submitted 4 July, 2024; originally announced July 2024.

  2. arXiv:2406.12712  [pdf, other

    cs.CV

    Self-Localized Collaborative Perception

    Authors: Zhenyang Ni, Zixing Lei, Yifan Lu, Dingju Wang, Chen Feng, Yanfeng Wang, Siheng Chen

    Abstract: Collaborative perception has garnered considerable attention due to its capacity to address several inherent challenges in single-agent perception, including occlusion and out-of-range issues. However, existing collaborative perception systems heavily rely on precise localization systems to establish a consistent spatial coordinate system between agents. This reliance makes them susceptible to lar… ▽ More

    Submitted 18 June, 2024; originally announced June 2024.

  3. arXiv:2406.08377  [pdf, other

    cs.CV

    DDR: Exploiting Deep Degradation Response as Flexible Image Descriptor

    Authors: Juncheng Wu, Zhangkai Ni, Hanli Wang, Wenhan Yang, Yuyin Zhou, Shiqi Wang

    Abstract: Image deep features extracted by pre-trained networks are known to contain rich and informative representations. In this paper, we present Deep Degradation Response (DDR), a method to quantify changes in image deep features under varying degradation conditions. Specifically, our approach facilitates flexible and adaptive degradation, enabling the controlled synthesis of image degradation through t… ▽ More

    Submitted 12 June, 2024; originally announced June 2024.

  4. arXiv:2406.05478  [pdf, other

    cs.CV cs.AI

    Revisiting Non-Autoregressive Transformers for Efficient Image Synthesis

    Authors: Zanlin Ni, Yulin Wang, Renping Zhou, Jiayi Guo, Jinyi Hu, Zhiyuan Liu, Shiji Song, Yuan Yao, Gao Huang

    Abstract: The field of image synthesis is currently flourishing due to the advancements in diffusion models. While diffusion models have been successful, their computational intensity has prompted the pursuit of more efficient alternatives. As a representative work, non-autoregressive Transformers (NATs) have been recognized for their rapid generation. However, a major drawback of these models is their infe… ▽ More

    Submitted 8 June, 2024; originally announced June 2024.

    Comments: Accepted by CVPR2024

  5. arXiv:2406.04660  [pdf, other

    eess.AS cs.SD

    URGENT Challenge: Universality, Robustness, and Generalizability For Speech Enhancement

    Authors: Wangyou Zhang, Robin Scheibler, Kohei Saijo, Samuele Cornell, Chenda Li, Zhaoheng Ni, Anurag Kumar, Jan Pirklbauer, Marvin Sach, Shinji Watanabe, Tim Fingscheidt, Yanmin Qian

    Abstract: The last decade has witnessed significant advancements in deep learning-based speech enhancement (SE). However, most existing SE research has limitations on the coverage of SE sub-tasks, data diversity and amount, and evaluation metrics. To fill this gap and promote research toward universal SE, we establish a new SE challenge, named URGENT, to focus on the universality, robustness, and generaliza… ▽ More

    Submitted 7 June, 2024; originally announced June 2024.

    Comments: 6 pages, 3 figures, 3 tables. Accepted by Interspeech 2024. An extended version of the accepted manuscript with appendix

  6. arXiv:2406.04295  [pdf, other

    cs.CV

    Everything to the Synthetic: Diffusion-driven Test-time Adaptation via Synthetic-Domain Alignment

    Authors: Jiayi Guo, Junhao Zhao, Chunjiang Ge, Chaoqun Du, Zanlin Ni, Shiji Song, Humphrey Shi, Gao Huang

    Abstract: Test-time adaptation (TTA) aims to enhance the performance of source-domain pretrained models when tested on unknown shifted target domains. Traditional TTA methods primarily adapt model weights based on target data streams, making model performance sensitive to the amount and order of target data. Recently, diffusion-driven TTA methods have demonstrated strong performance by using an unconditiona… ▽ More

    Submitted 6 June, 2024; originally announced June 2024.

    Comments: GitHub: https://github.com/SHI-Labs/Diffusion-Driven-Test-Time-Adaptation-via-Synthetic-Domain-Alignment

  7. arXiv:2406.03287  [pdf, other

    cs.NE cs.CL cs.LG

    SpikeLM: Towards General Spike-Driven Language Modeling via Elastic Bi-Spiking Mechanisms

    Authors: Xingrun Xing, Zheng Zhang, Ziyi Ni, Shitao Xiao, Yiming Ju, Siqi Fan, Yequan Wang, Jiajun Zhang, Guoqi Li

    Abstract: Towards energy-efficient artificial intelligence similar to the human brain, the bio-inspired spiking neural networks (SNNs) have advantages of biological plausibility, event-driven sparsity, and binary activation. Recently, large-scale language models exhibit promising generalization capability, making it a valuable issue to explore more general spike-driven models. However, the binary spikes in… ▽ More

    Submitted 5 June, 2024; originally announced June 2024.

  8. arXiv:2406.02560  [pdf, other

    eess.AS cs.AI cs.CL cs.LG

    Less Peaky and More Accurate CTC Forced Alignment by Label Priors

    Authors: Ruizhe Huang, Xiaohui Zhang, Zhaoheng Ni, Li Sun, Moto Hira, Jeff Hwang, Vimal Manohar, Vineel Pratap, Matthew Wiesner, Shinji Watanabe, Daniel Povey, Sanjeev Khudanpur

    Abstract: Connectionist temporal classification (CTC) models are known to have peaky output distributions. Such behavior is not a problem for automatic speech recognition (ASR), but it can cause inaccurate forced alignments (FA), especially at finer granularity, e.g., phoneme level. This paper aims at alleviating the peaky behavior for CTC and improve its suitability for forced alignment generation, by leve… ▽ More

    Submitted 15 June, 2024; v1 submitted 22 April, 2024; originally announced June 2024.

    Comments: Accepted by ICASSP 2024. Github repo: https://github.com/huangruizhe/audio/tree/aligner_label_priors

  9. arXiv:2406.00627  [pdf, other

    cs.CL

    Prompt Framework for Role-playing: Generation and Evaluation

    Authors: Xun Liu, Zhengwei Ni

    Abstract: Large language models (LLM) have demonstrated remarkable abilities in generating natural language, understanding user instruction, and mimicking human language use. These capabilities have garnered considerable interest in applications such as role-playing. However, the process of collecting individual role scripts (or profiles) data and manually evaluating the performance can be costly. We introd… ▽ More

    Submitted 2 June, 2024; originally announced June 2024.

  10. arXiv:2405.19765  [pdf, other

    cs.CV cs.AI

    Towards Unified Multi-granularity Text Detection with Interactive Attention

    Authors: Xingyu Wan, Chengquan Zhang, Pengyuan Lyu, Sen Fan, Zihan Ni, Kun Yao, Errui Ding, Jingdong Wang

    Abstract: Existing OCR engines or document image analysis systems typically rely on training separate models for text detection in varying scenarios and granularities, leading to significant computational complexity and resource demands. In this paper, we introduce "Detect Any Text" (DAT), an advanced paradigm that seamlessly unifies scene text detection, layout analysis, and document page detection into a… ▽ More

    Submitted 30 May, 2024; originally announced May 2024.

    Comments: ICML 2024

  11. arXiv:2405.18790  [pdf, other

    cs.CV cs.MM eess.IV

    Opinion-Unaware Blind Image Quality Assessment using Multi-Scale Deep Feature Statistics

    Authors: Zhangkai Ni, Yue Liu, Keyan Ding, Wenhan Yang, Hanli Wang, Shiqi Wang

    Abstract: Deep learning-based methods have significantly influenced the blind image quality assessment (BIQA) field, however, these methods often require training using large amounts of human rating data. In contrast, traditional knowledge-based methods are cost-effective for training but face challenges in effectively extracting features aligned with human visual perception. To bridge these gaps, we propos… ▽ More

    Submitted 29 May, 2024; originally announced May 2024.

    Comments: Accepted to IEEE Transactions on Multimedia 2024

  12. arXiv:2405.06525  [pdf, other

    cs.CV

    Semantic and Spatial Adaptive Pixel-level Classifier for Semantic Segmentation

    Authors: Xiaowen Ma, Zhenliang Ni, Xinghao Chen

    Abstract: Vanilla pixel-level classifiers for semantic segmentation are based on a certain paradigm, involving the inner product of fixed prototypes obtained from the training set and pixel features in the test image. This approach, however, encounters significant limitations, i.e., feature deviation in the semantic domain and information loss in the spatial domain. The former struggles with large intra-cla… ▽ More

    Submitted 10 May, 2024; originally announced May 2024.

  13. arXiv:2405.06228  [pdf, other

    cs.CV

    Context-Guided Spatial Feature Reconstruction for Efficient Semantic Segmentation

    Authors: Zhenliang Ni, Xinghao Chen, Yingjie Zhai, Yehui Tang, Yunhe Wang

    Abstract: Semantic segmentation is an important task for many applications but it is still quite challenging to achieve advanced performance with limited computational costs. In this paper, we present CGRSeg, an efficient yet competitive segmentation framework based on context-guided spatial feature reconstruction. A Rectangular Self-Calibration Module is carefully designed for spatial feature reconstructio… ▽ More

    Submitted 9 May, 2024; originally announced May 2024.

  14. arXiv:2405.02965  [pdf, other

    cs.AI cs.RO

    Robust Collaborative Perception without External Localization and Clock Devices

    Authors: Zixing Lei, Zhenyang Ni, Ruize Han, Shuo Tang, Dingju Wang, Chen Feng, Siheng Chen, Yanfeng Wang

    Abstract: A consistent spatial-temporal coordination across multiple agents is fundamental for collaborative perception, which seeks to improve perception abilities through information exchange among agents. To achieve this spatial-temporal alignment, traditional methods depend on external devices to provide localization and clock signals. However, hardware-generated signals could be vulnerable to noise and… ▽ More

    Submitted 31 May, 2024; v1 submitted 5 May, 2024; originally announced May 2024.

    Comments: 6pages, accepted to ICRA 2024

  15. arXiv:2404.12916  [pdf, other

    cs.CR

    Physical Backdoor Attack can Jeopardize Driving with Vision-Large-Language Models

    Authors: Zhenyang Ni, Rui Ye, Yuxi Wei, Zhen Xiang, Yanfeng Wang, Siheng Chen

    Abstract: Vision-Large-Language-models(VLMs) have great application prospects in autonomous driving. Despite the ability of VLMs to comprehend and make decisions in complex scenarios, their integration into safety-critical autonomous driving systems poses serious security risks. In this paper, we propose BadVLMDriver, the first backdoor attack against VLMs for autonomous driving that can be launched in prac… ▽ More

    Submitted 22 April, 2024; v1 submitted 19 April, 2024; originally announced April 2024.

  16. arXiv:2404.09574  [pdf, other

    cs.LG cs.AI

    Predicting and Analyzing Pedestrian Crossing Behavior at Unsignalized Crossings

    Authors: Chi Zhang, Janis Sprenger, Zhongjun Ni, Christian Berger

    Abstract: Understanding and predicting pedestrian crossing behavior is essential for enhancing automated driving and improving driving safety. Predicting gap selection behavior and the use of zebra crossing enables driving systems to proactively respond and prevent potential conflicts. This task is particularly challenging at unsignalized crossings due to the ambiguous right of way, requiring pedestrians to… ▽ More

    Submitted 15 April, 2024; originally announced April 2024.

    Comments: 8 pages, 10 figures, 4 tables. Accepted in 2024 IEEE Intelligent Vehicles Symposium (IV)

    MSC Class: 68T40; 68T45 ACM Class: I.2.10

  17. arXiv:2404.06010  [pdf, other

    cond-mat.mtrl-sci

    Magnetic field control of continuous Néel vector rotation and Néel temperature in a van der Waals antiferromagnet

    Authors: Zhuoliang Ni, Urban Seifert, Amanda V. Haglund, Nan Huang, David G. Mandrus, Leon Balents, Liang Wu

    Abstract: In a collinear antiferromagnet, spins tend to cant towards the direction of an applied magnetic field, thereby decreasing the energy of the system. The canting angle becomes negligible when the magnetic field is small so that the induced anisotropic energy is substantially lower than the exchange energy. However, this tiny anisotropy can play a significant role when the intrinsic anisotropy of the… ▽ More

    Submitted 9 April, 2024; originally announced April 2024.

  18. arXiv:2403.17898  [pdf, other

    cs.CV

    Octree-GS: Towards Consistent Real-time Rendering with LOD-Structured 3D Gaussians

    Authors: Kerui Ren, Lihan Jiang, Tao Lu, Mulin Yu, Linning Xu, Zhangkai Ni, Bo Dai

    Abstract: The recent 3D Gaussian splatting (3D-GS) has shown remarkable rendering fidelity and efficiency compared to NeRF-based neural scene representations. While demonstrating the potential for real-time rendering, 3D-GS encounters rendering bottlenecks in large scenes with complex details due to an excessive number of Gaussian primitives located within the viewing frustum. This limitation is particularl… ▽ More

    Submitted 26 March, 2024; originally announced March 2024.

    Comments: Project page: https://city-super.github.io/octree-gs/

  19. arXiv:2403.11703  [pdf, other

    cs.CV cs.AI

    LLaVA-UHD: an LMM Perceiving Any Aspect Ratio and High-Resolution Images

    Authors: Ruyi Xu, Yuan Yao, Zonghao Guo, Junbo Cui, Zanlin Ni, Chunjiang Ge, Tat-Seng Chua, Zhiyuan Liu, Maosong Sun, Gao Huang

    Abstract: Visual encoding constitutes the basis of large multimodal models (LMMs) in understanding the visual world. Conventional LMMs process images in fixed sizes and limited resolutions, while recent explorations in this direction are limited in adaptivity, efficiency, and even correctness. In this work, we first take GPT-4V and LLaVA-1.5 as representative examples and expose systematic flaws rooted in t… ▽ More

    Submitted 18 March, 2024; originally announced March 2024.

    Comments: Preprint

  20. arXiv:2403.11084  [pdf

    physics.optics physics.app-ph

    High Performance Graphene Integrated Photonics Platform Enabled by Gold-assisted Transfer

    Authors: Xiaoxuan Wu, Zhengyi Cao, Tianxiang Zhao, Yun Wu, Zhonghui Li, Spyros Doukas, Elefterios Lidorikis, Yu Xue, Liu Liu, Omid Ghaebi, Giancarlo Soavi, Junpeng Lv, Zhenghua Ni, Junjia Wang

    Abstract: Graphene is promising for nanoscale, efficient, ultra-fast photo- and opto-electronic devices because of its remarkable electrical and optical properties, such as fast electron relaxation and heat dissipation. Here, we realize high-performance graphene integrated photonics platform enabled by gold-assisted transfer. Thanks to our optimized transfer technique, we fabricate and demonstrate (1) a mic… ▽ More

    Submitted 17 March, 2024; originally announced March 2024.

  21. arXiv:2403.08203  [pdf, other

    q-bio.NC cs.LG eess.IV

    Learnable Community-Aware Transformer for Brain Connectome Analysis with Token Clustering

    Authors: Yanting Yang, Beidi Zhao, Zhuohao Ni, Yize Zhao, Xiaoxiao Li

    Abstract: Neuroscientific research has revealed that the complex brain network can be organized into distinct functional communities, each characterized by a cohesive group of regions of interest (ROIs) with strong interconnections. These communities play a crucial role in comprehending the functional organization of the brain and its implications for neurological conditions, including Autism Spectrum Disor… ▽ More

    Submitted 12 March, 2024; originally announced March 2024.

  22. arXiv:2403.04326  [pdf, other

    eess.SY cs.AI cs.LG

    Edge-based Parametric Digital Twins for Intelligent Building Indoor Climate Modeling

    Authors: Zhongjun Ni, Chi Zhang, Magnus Karlsson, Shaofang Gong

    Abstract: Digital transformation in the built environment generates vast data for developing data-driven models to optimize building operations. This study presents an integrated solution utilizing edge computing, digital twins, and deep learning to enhance the understanding of climate in buildings. Parametric digital twins, created using an ontology, ensure consistent data representation across diverse ser… ▽ More

    Submitted 7 March, 2024; originally announced March 2024.

    Comments: 8 pages, 8 figures, accepted in the 20th IEEE International Conference on Factory Communication Systems

    MSC Class: 68T07 ACM Class: I.5.4

  23. arXiv:2402.18192  [pdf, other

    cs.CV eess.IV

    Misalignment-Robust Frequency Distribution Loss for Image Transformation

    Authors: Zhangkai Ni, Juncheng Wu, Zian Wang, Wenhan Yang, Hanli Wang, Lin Ma

    Abstract: This paper aims to address a common challenge in deep learning-based image transformation methods, such as image enhancement and super-resolution, which heavily rely on precisely aligned paired datasets with pixel-level alignments. However, creating precisely aligned paired images presents significant challenges and hinders the advancement of methods trained on such data. To overcome this challeng… ▽ More

    Submitted 28 February, 2024; originally announced February 2024.

    Comments: Accepted to Computer Vision and Pattern Recognition Conference (CVPR) 2024

  24. arXiv:2401.09686  [pdf, other

    eess.AS cs.SD

    An Empirical Study on the Impact of Positional Encoding in Transformer-based Monaural Speech Enhancement

    Authors: Qiquan Zhang, Meng Ge, Hongxu Zhu, Eliathamby Ambikairajah, Qi Song, Zhaoheng Ni, Haizhou Li

    Abstract: Transformer architecture has enabled recent progress in speech enhancement. Since Transformers are position-agostic, positional encoding is the de facto standard component used to enable Transformers to distinguish the order of elements in a sequence. However, it remains unclear how positional encoding exactly impacts speech enhancement based on Transformer architectures. In this paper, we perform… ▽ More

    Submitted 13 February, 2024; v1 submitted 17 January, 2024; originally announced January 2024.

    Comments: Accepted by ICASSP 2024

  25. arXiv:2401.06411  [pdf, other

    cs.ET

    An Efficient and Scalable Clocking Assignment Algorithm for Multi-Threaded Multi-Phase Single Flux Quantum Circuits

    Authors: Robert S. Aviles, Xi Li, Lei Lu, Zhaorui Ni, Peter A. Beerel

    Abstract: A key distinguishing feature of single flux quantum (SFQ) circuits is that each logic gate is clocked. This feature forces the introduction of path-balancing flip-flops to ensure proper synchronization of inputs at each gate. This paper proposes a polynomial time complexity approximation algorithm for clocking assignments that minimizes the insertion of path balancing buffers for multi-threaded mu… ▽ More

    Submitted 12 January, 2024; originally announced January 2024.

  26. arXiv:2312.14199  [pdf, other

    cs.CR

    Report on 2023 CyberTraining PI Meeting, 26-27 September 2023

    Authors: Geoffrey Fox, Mary P Thomas, Sajal Bhatia, Marisa Brazil, Nicole M Gasparini, Venkatesh Mohan Merwade, Henry J. Neeman, Jeff Carver, Henri Casanova, Vipin Chaudhary, Dirk Colbry, Lonnie Crosby, Prasun Dewan, Jessica Eisma, Nicole M Gasparini, Ahmed Irfan, Kate Kaehey, Qianqian Liu, Zhen Ni, Sushil Prasad, Apan Qasem, Erik Saule, Prabha Sundaravadivel, Karen Tomko

    Abstract: This document describes a two-day meeting held for the Principal Investigators (PIs) of NSF CyberTraining grants. The report covers invited talks, panels, and six breakout sessions. The meeting involved over 80 PIs and NSF program managers (PMs). The lessons recorded in detail in the report are a wealth of information that could help current and future PIs, as well as NSF PMs, understand the futur… ▽ More

    Submitted 28 December, 2023; v1 submitted 20 December, 2023; originally announced December 2023.

    Comments: 38 pages, 3 main sections and 2 Appendix sections, 2 figures, 19 tables; updated version: author corrections

  27. arXiv:2312.09095  [pdf, other

    cs.CV

    ColNeRF: Collaboration for Generalizable Sparse Input Neural Radiance Field

    Authors: Zhangkai Ni, Peiqi Yang, Wenhan Yang, Hanli Wang, Lin Ma, Sam Kwong

    Abstract: Neural Radiance Fields (NeRF) have demonstrated impressive potential in synthesizing novel views from dense input, however, their effectiveness is challenged when dealing with sparse input. Existing approaches that incorporate additional depth or semantic supervision can alleviate this issue to an extent. However, the process of supervision collection is not only costly but also potentially inaccu… ▽ More

    Submitted 14 December, 2023; v1 submitted 14 December, 2023; originally announced December 2023.

  28. arXiv:2312.08264  [pdf, other

    eess.SP cs.LG physics.ao-ph

    Kunyu: A High-Performing Global Weather Model Beyond Regression Losses

    Authors: Zekun Ni

    Abstract: Over the past year, data-driven global weather forecasting has emerged as a new alternative to traditional numerical weather prediction. This innovative approach yields forecasts of comparable accuracy at a tiny fraction of computational costs. Regrettably, as far as I know, existing models exclusively rely on regression losses, producing forecasts with substantial blurring. Such blurring, althoug… ▽ More

    Submitted 4 December, 2023; originally announced December 2023.

    Comments: 12 pages, 5 figures

  29. arXiv:2312.06568  [pdf, other

    cs.LG cs.AI cs.CR

    Sparse but Strong: Crafting Adversarially Robust Graph Lottery Tickets

    Authors: Subhajit Dutta Chowdhury, Zhiyu Ni, Qingyuan Peng, Souvik Kundu, Pierluigi Nuzzo

    Abstract: Graph Lottery Tickets (GLTs), comprising a sparse adjacency matrix and a sparse graph neural network (GNN), can significantly reduce the inference latency and compute footprint compared to their dense counterparts. Despite these benefits, their performance against adversarial structure perturbations remains to be fully explored. In this work, we first investigate the resilience of GLTs against dif… ▽ More

    Submitted 11 December, 2023; originally announced December 2023.

    Comments: Accepted at NeurIPS 2023 GLFrontiers Workshop

  30. arXiv:2312.05966  [pdf, other

    cs.LG cs.CV

    Fake It Till Make It: Federated Learning with Consensus-Oriented Generation

    Authors: Rui Ye, Yaxin Du, Zhenyang Ni, Siheng Chen, Yanfeng Wang

    Abstract: In federated learning (FL), data heterogeneity is one key bottleneck that causes model divergence and limits performance. Addressing this, existing methods often regard data heterogeneity as an inherent property and propose to mitigate its adverse effects by correcting models. In this paper, we seek to break this inherent property by generating data to complement the original dataset to fundamenta… ▽ More

    Submitted 10 December, 2023; originally announced December 2023.

    Comments: 27 pages

  31. arXiv:2312.04410  [pdf, other

    cs.CV

    Smooth Diffusion: Crafting Smooth Latent Spaces in Diffusion Models

    Authors: Jiayi Guo, Xingqian Xu, Yifan Pu, Zanlin Ni, Chaofei Wang, Manushree Vasu, Shiji Song, Gao Huang, Humphrey Shi

    Abstract: Recently, diffusion models have made remarkable progress in text-to-image (T2I) generation, synthesizing images with high fidelity and diverse contents. Despite this advancement, latent space smoothness within diffusion models remains largely unexplored. Smooth latent spaces ensure that a perturbation on an input latent corresponds to a steady change in the output image. This property proves benef… ▽ More

    Submitted 7 December, 2023; originally announced December 2023.

    Comments: GitHub: https://github.com/SHI-Labs/Smooth-Diffusion

  32. arXiv:2311.01092  [pdf, other

    cs.CV

    Learning A Multi-Task Transformer Via Unified And Customized Instruction Tuning For Chest Radiograph Interpretation

    Authors: Lijian Xu, Ziyu Ni, Xinglong Liu, Xiaosong Wang, Hongsheng Li, Shaoting Zhang

    Abstract: The emergence of multi-modal deep learning models has made significant impacts on clinical applications in the last decade. However, the majority of models are limited to single-tasking, without considering disease diagnosis is indeed a multi-task procedure. Here, we demonstrate a unified transformer model specifically designed for multi-modal clinical tasks by incorporating customized instruction… ▽ More

    Submitted 3 March, 2024; v1 submitted 2 November, 2023; originally announced November 2023.

  33. arXiv:2311.00897  [pdf, other

    cs.SD cs.CL eess.AS

    On The Open Prompt Challenge In Conditional Audio Generation

    Authors: Ernie Chang, Sidd Srinivasan, Mahi Luthra, Pin-Jie Lin, Varun Nagaraja, Forrest Iandola, Zechun Liu, Zhaoheng Ni, Changsheng Zhao, Yangyang Shi, Vikas Chandra

    Abstract: Text-to-audio generation (TTA) produces audio from a text description, learning from pairs of audio samples and hand-annotated text. However, commercializing audio generation is challenging as user-input prompts are often under-specified when compared to text descriptions used to train TTA models. In this work, we treat TTA models as a ``blackbox'' and address the user prompt challenge with two ke… ▽ More

    Submitted 1 November, 2023; originally announced November 2023.

    Comments: 5 pages, 3 figures, 4 tables

  34. arXiv:2310.20496  [pdf, other

    cs.LG

    BasisFormer: Attention-based Time Series Forecasting with Learnable and Interpretable Basis

    Authors: Zelin Ni, Hang Yu, Shizhan Liu, Jianguo Li, Weiyao Lin

    Abstract: Bases have become an integral part of modern deep learning-based models for time series forecasting due to their ability to act as feature extractors or future references. To be effective, a basis must be tailored to the specific set of time series data and exhibit distinct correlation with each time series within the set. However, current state-of-the-art methods are limited in their ability to s… ▽ More

    Submitted 18 January, 2024; v1 submitted 31 October, 2023; originally announced October 2023.

    Comments: NeurIPS 2023(poster)

  35. arXiv:2310.19069  [pdf, other

    cs.LG cs.DC

    Efficient Cluster Selection for Personalized Federated Learning: A Multi-Armed Bandit Approach

    Authors: Zhou Ni, Morteza Hashemi

    Abstract: Federated learning (FL) offers a decentralized training approach for machine learning models, prioritizing data privacy. However, the inherent heterogeneity in FL networks, arising from variations in data distribution, size, and device capabilities, poses challenges in user federation. Recognizing this, Personalized Federated Learning (PFL) emphasizes tailoring learning processes to individual dat… ▽ More

    Submitted 29 October, 2023; originally announced October 2023.

  36. arXiv:2310.17864  [pdf, other

    eess.AS cs.SD

    TorchAudio 2.1: Advancing speech recognition, self-supervised learning, and audio processing components for PyTorch

    Authors: Jeff Hwang, Moto Hira, Caroline Chen, Xiaohui Zhang, Zhaoheng Ni, Guangzhi Sun, Pingchuan Ma, Ruizhe Huang, Vineel Pratap, Yuekai Zhang, Anurag Kumar, Chin-Yun Yu, Chuang Zhu, Chunxi Liu, Jacob Kahn, Mirco Ravanelli, Peng Sun, Shinji Watanabe, Yangyang Shi, Yumeng Tao, Robin Scheibler, Samuele Cornell, Sean Kim, Stavros Petridis

    Abstract: TorchAudio is an open-source audio and speech processing library built for PyTorch. It aims to accelerate the research and development of audio and speech technologies by providing well-designed, easy-to-use, and performant PyTorch components. Its contributors routinely engage with users to understand their needs and fulfill them by developing impactful features. Here, we survey TorchAudio's devel… ▽ More

    Submitted 26 October, 2023; originally announced October 2023.

  37. A Digitalization Framework for Smart Maintenance of Historic Buildings

    Authors: Zhongjun Ni

    Abstract: Smart maintenance of historic buildings involves integration of digital technologies and data analysis methods to help maintain functionalities of these buildings and preserve their heritage values. However, the maintenance of historic buildings is a long-term process. During the process, the digital transformation requires overcoming various challenges, such as stable and scalable storage and com… ▽ More

    Submitted 3 October, 2023; originally announced October 2023.

    Comments: Licentiate Thesis

  38. arXiv:2310.00746  [pdf, other

    cs.CL cs.AI

    RoleLLM: Benchmarking, Eliciting, and Enhancing Role-Playing Abilities of Large Language Models

    Authors: Zekun Moore Wang, Zhongyuan Peng, Haoran Que, Jiaheng Liu, Wangchunshu Zhou, Yuhan Wu, Hongcheng Guo, Ruitong Gan, Zehao Ni, Jian Yang, Man Zhang, Zhaoxiang Zhang, Wanli Ouyang, Ke Xu, Stephen W. Huang, Jie Fu, Junran Peng

    Abstract: The advent of Large Language Models (LLMs) has paved the way for complex tasks such as role-playing, which enhances user interactions by enabling models to imitate various characters. However, the closed-source nature of state-of-the-art LLMs and their general-purpose training limit role-playing optimization. In this paper, we introduce RoleLLM, a framework to benchmark, elicit, and enhance role-p… ▽ More

    Submitted 18 June, 2024; v1 submitted 1 October, 2023; originally announced October 2023.

    Comments: 30 pages, repo at https://github.com/InteractiveNLP-Team/RoleLLM-public

  39. arXiv:2309.10795  [pdf, other

    eess.AS

    Exploring Speech Enhancement for Low-resource Speech Synthesis

    Authors: Zhaoheng Ni, Sravya Popuri, Ning Dong, Kohei Saijo, Xiaohui Zhang, Gael Le Lan, Yangyang Shi, Vikas Chandra, Changhan Wang

    Abstract: High-quality and intelligible speech is essential to text-to-speech (TTS) model training, however, obtaining high-quality data for low-resource languages is challenging and expensive. Applying speech enhancement on Automatic Speech Recognition (ASR) corpus mitigates the issue by augmenting the training data, while how the nonlinear speech distortion brought by speech enhancement models affects TTS… ▽ More

    Submitted 19 September, 2023; originally announced September 2023.

    Comments: Submitted to ICASSP 2024

  40. arXiv:2309.10537  [pdf, other

    eess.AS cs.MM cs.SD

    FoleyGen: Visually-Guided Audio Generation

    Authors: Xinhao Mei, Varun Nagaraja, Gael Le Lan, Zhaoheng Ni, Ernie Chang, Yangyang Shi, Vikas Chandra

    Abstract: Recent advancements in audio generation have been spurred by the evolution of large-scale deep learning models and expansive datasets. However, the task of video-to-audio (V2A) generation continues to be a challenge, principally because of the intricate relationship between the high-dimensional visual and auditory data, and the challenges associated with temporal synchronization. In this study, we… ▽ More

    Submitted 19 September, 2023; originally announced September 2023.

  41. arXiv:2309.08804  [pdf, other

    eess.AS cs.SD

    Stack-and-Delay: a new codebook pattern for music generation

    Authors: Gael Le Lan, Varun Nagaraja, Ernie Chang, David Kant, Zhaoheng Ni, Yangyang Shi, Forrest Iandola, Vikas Chandra

    Abstract: In language modeling based music generation, a generated waveform is represented by a sequence of hierarchical token stacks that can be decoded either in an auto-regressive manner or in parallel, depending on the codebook patterns. In particular, flattening the codebooks represents the highest quality decoding strategy, while being notoriously slow. To this end, we propose a novel stack-and-delay… ▽ More

    Submitted 15 September, 2023; originally announced September 2023.

  42. arXiv:2309.08773  [pdf, other

    cs.SD cs.AI cs.LG eess.AS

    Enhance audio generation controllability through representation similarity regularization

    Authors: Yangyang Shi, Gael Le Lan, Varun Nagaraja, Zhaoheng Ni, Xinhao Mei, Ernie Chang, Forrest Iandola, Yang Liu, Vikas Chandra

    Abstract: This paper presents an innovative approach to enhance control over audio generation by emphasizing the alignment between audio and text representations during model training. In the context of language model-based audio generation, the model leverages input from both textual and audio token representations to predict subsequent audio tokens. However, the current configuration lacks explicit regula… ▽ More

    Submitted 15 September, 2023; originally announced September 2023.

    Comments: 5 pages

  43. arXiv:2309.07988  [pdf, other

    cs.LG cs.AR cs.SD eess.AS

    Folding Attention: Memory and Power Optimization for On-Device Transformer-based Streaming Speech Recognition

    Authors: Yang Li, Liangzhen Lai, Yuan Shangguan, Forrest N. Iandola, Zhaoheng Ni, Ernie Chang, Yangyang Shi, Vikas Chandra

    Abstract: Transformer-based models excel in speech recognition. Existing efforts to optimize Transformer inference, typically for long-context applications, center on simplifying attention score calculations. However, streaming speech recognition models usually process a limited number of tokens each time, making attention score calculation less of a bottleneck. Instead, the bottleneck lies in the linear pr… ▽ More

    Submitted 18 January, 2024; v1 submitted 14 September, 2023; originally announced September 2023.

  44. arXiv:2309.07726  [pdf, other

    cs.RO

    GRID: Scene-Graph-based Instruction-driven Robotic Task Planning

    Authors: Zhe Ni, Xiaoxin Deng, Cong Tai, Xinyue Zhu, Qinghongbing Xie, Weihang Huang, Xiang Wu, Long Zeng

    Abstract: Recent works have shown that Large Language Models (LLMs) can facilitate the grounding of instructions for robotic task planning. Despite this progress, most existing works have primarily focused on utilizing raw images to aid LLMs in understanding environmental information. However, this approach not only limits the scope of observation but also typically necessitates extensive multimodal data co… ▽ More

    Submitted 10 March, 2024; v1 submitted 14 September, 2023; originally announced September 2023.

    Comments: 8 pages, 10 figures

  45. Autonomous Stabilization of Fock States in an Oscillator against Multiphoton Losses

    Authors: Sai Li, Zhongchu Ni, Libo Zhang, Yanyan Cai, Jiasheng Mai, Shengcheng Wen, Pan Zheng, Xiaowei Deng, Song Liu, Yuan Xu, Dapeng Yu

    Abstract: Fock states with a well-defined number of photons in an oscillator have shown a wide range of applications in quantum information science. Nonetheless, their usefulness has been marred by single and multiple photon losses due to unavoidable environment-induced dissipation. Though several dissipation engineering methods have been developed to counteract the leading single-photon loss error, avertin… ▽ More

    Submitted 16 May, 2024; v1 submitted 16 August, 2023; originally announced August 2023.

    Comments: Main text: 6 pages, 4 figures; Supplementary material: 7 pages, 5 figures, 4 tables

    Journal ref: Phys. Rev. Lett. 132, 203602 (2024)

  46. arXiv:2308.07608  [pdf, ps, other

    math.CO

    Extremal problems for disjoint graphs

    Authors: Zhenyu Ni, Jing Wang, Liying Kang

    Abstract: For a simple graph $F$, let $\mathrm{EX}(n, F)$ and $\mathrm{EX_{sp}}(n,F)$ be the set of graphs with the maximum number of edges and the set of graphs with the maximum spectral radius in an $n$-vertex graph without any copy of the graph $F$, respectively. Let $F$ be a graph with $\mathrm{ex}(n,F)=e(T_{n,r})+O(1)$. In this paper, we show that $\mathrm{EX_{sp}}(n,kF)\subseteq \mathrm{EX}(n,kF)$ for… ▽ More

    Submitted 15 August, 2023; originally announced August 2023.

    Comments: 23 pages. arXiv admin note: text overlap with arXiv:2306.16747

    MSC Class: 05C50; 05C35

  47. arXiv:2308.07249  [pdf, other

    cond-mat.str-el

    Signatures of Z$_3$ Vestigial Potts-nematic order in van der Waals antiferromagnets

    Authors: Zhuoliang Ni, Daniil S. Antonenko, W. Joe Meese, Qi Tian, Nan Huang, Amanda V. Haglund, Matthew Cothrine, David G. Mandrus, Rafael M. Fernandes, Jörn W. F. Venderbos, Liang Wu

    Abstract: Layered van der Waals magnets have attracted much recent attention as a promising and versatile platform for exploring intrinsic two-dimensional magnetism. Within this broader class, the transition metal phosphorous trichalcogenides $M$P$X_3$ stand out as particularly interesting, as they provide a realization of honeycomb lattice magnetism and are known to display a variety of magnetic ordering p… ▽ More

    Submitted 14 August, 2023; originally announced August 2023.

    Comments: 6 pages, 4 figures + supplementary material

  48. arXiv:2308.02552  [pdf, other

    cs.CV

    Degeneration-Tuning: Using Scrambled Grid shield Unwanted Concepts from Stable Diffusion

    Authors: Zixuan Ni, Longhui Wei, Jiacheng Li, Siliang Tang, Yueting Zhuang, Qi Tian

    Abstract: Owing to the unrestricted nature of the content in the training data, large text-to-image diffusion models, such as Stable Diffusion (SD), are capable of generating images with potentially copyrighted or dangerous content based on corresponding textual concepts information. This includes specific intellectual property (IP), human faces, and various artistic styles. However, Negative Prompt, a wide… ▽ More

    Submitted 7 August, 2023; v1 submitted 1 August, 2023; originally announced August 2023.

    Journal ref: ACM MM 2023

  49. A Knowledge-enhanced Two-stage Generative Framework for Medical Dialogue Information Extraction

    Authors: Zefa Hu, Ziyi Ni, Jing Shi, Shuang Xu, Bo Xu

    Abstract: This paper focuses on term-status pair extraction from medical dialogues (MD-TSPE), which is essential in diagnosis dialogue systems and the automatic scribe of electronic medical records (EMRs). In the past few years, works on MD-TSPE have attracted increasing research attention, especially after the remarkable progress made by generative methods. However, these generative methods output a whole… ▽ More

    Submitted 19 February, 2024; v1 submitted 30 July, 2023; originally announced July 2023.

    Comments: Published in Machine Intelligence Research, https://link.springer.com/article/10.1007/s11633-023-1461-5

  50. arXiv:2307.00484  [pdf, other

    quant-ph cond-mat.quant-gas

    Quantum Force Sensing by Digital Twinning of Atomic Bose-Einstein Condensates

    Authors: Tangyou Huang, Zhongcheng Yu, Zhongyi Ni, Xiaoji Zhou, Xiaopeng Li

    Abstract: High sensitivity detection plays a vital role in science discoveries and technological applications. While intriguing methods utilizing collective many-body correlations and quantum entanglements have been developed in physics to enhance sensitivity, their practical implementation remains challenging due to rigorous technological requirements. Here, we propose an entirely data-driven approach that… ▽ More

    Submitted 1 June, 2024; v1 submitted 2 July, 2023; originally announced July 2023.

    Comments: 10 pages,4 figures