Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to main content

Showing 1–50 of 341 results for author: yu, K

Searching in archive cs. Search in all archives.
.
  1. arXiv:2407.19639  [pdf, other

    cs.CR

    Segmented Private Data Aggregation in the Multi-message Shuffle Model

    Authors: Shaowei Wang, Ruilin Yang, Sufen Zeng, Kaiqi Yu, Rundong Mei, Shaozheng Huang, Wei Yang

    Abstract: The shuffle model of differential privacy (DP) offers compelling privacy-utility trade-offs in decentralized settings (e.g., internet of things, mobile edge networks). Particularly, the multi-message shuffle model, where each user may contribute multiple messages, has shown that accuracy can approach that of the central model of DP. However, existing studies typically assume a uniform privacy prot… ▽ More

    Submitted 28 July, 2024; originally announced July 2024.

  2. arXiv:2407.13198  [pdf, other

    cs.SD eess.AS

    DiveSound: LLM-Assisted Automatic Taxonomy Construction for Diverse Audio Generation

    Authors: Baihan Li, Zeyu Xie, Xuenan Xu, Yiwei Guo, Ming Yan, Ji Zhang, Kai Yu, Mengyue Wu

    Abstract: Audio generation has attracted significant attention. Despite remarkable enhancement in audio quality, existing models overlook diversity evaluation. This is partially due to the lack of a systematic sound class diversity framework and a matching dataset. To address these issues, we propose DiveSound, a novel framework for constructing multimodal datasets with in-class diversified taxonomy, assist… ▽ More

    Submitted 18 July, 2024; originally announced July 2024.

  3. arXiv:2407.11337  [pdf, other

    cs.CV

    Continuity Preserving Online CenterLine Graph Learning

    Authors: Yunhui Han, Kun Yu, Zhiwei Li

    Abstract: Lane topology, which is usually modeled by a centerline graph, is essential for high-level autonomous driving. For a high-quality graph, both topology connectivity and spatial continuity of centerline segments are critical. However, most of existing approaches pay more attention to connectivity while neglect the continuity. Such kind of centerline graph usually cause problem to planning of autonom… ▽ More

    Submitted 15 July, 2024; originally announced July 2024.

    Comments: ECCV 2024

  4. arXiv:2407.11034  [pdf

    cs.LG

    Bridging Data Gaps in Healthcare: A Scoping Review of Transfer Learning in Biomedical Data Analysis

    Authors: Siqi Li, Xin Li, Kunyu Yu, Di Miao, Mingcheng Zhu, Mengying Yan, Yuhe Ke, Danny D'Agostino, Yilin Ning, Qiming Wu, Ziwen Wang, Yuqing Shang, Molei Liu, Chuan Hong, Nan Liu

    Abstract: Clinical and biomedical research in low-resource settings often faces significant challenges due to the need for high-quality data with sufficient sample sizes to construct effective models. These constraints hinder robust model training and prompt researchers to seek methods for leveraging existing knowledge from related studies to support new research efforts. Transfer learning (TL), a machine l… ▽ More

    Submitted 4 July, 2024; originally announced July 2024.

  5. arXiv:2407.10956  [pdf, other

    cs.AI cs.CL

    Spider2-V: How Far Are Multimodal Agents From Automating Data Science and Engineering Workflows?

    Authors: Ruisheng Cao, Fangyu Lei, Haoyuan Wu, Jixuan Chen, Yeqiao Fu, Hongcheng Gao, Xinzhuang Xiong, Hanchong Zhang, Yuchen Mao, Wenjing Hu, Tianbao Xie, Hongshen Xu, Danyang Zhang, Sida Wang, Ruoxi Sun, Pengcheng Yin, Caiming Xiong, Ansong Ni, Qian Liu, Victor Zhong, Lu Chen, Kai Yu, Tao Yu

    Abstract: Data science and engineering workflows often span multiple stages, from warehousing to orchestration, using tools like BigQuery, dbt, and Airbyte. As vision language models (VLMs) advance in multimodal understanding and code generation, VLM-based agents could potentially automate these workflows by generating SQL queries, Python code, and GUI operations. This automation can improve the productivit… ▽ More

    Submitted 15 July, 2024; originally announced July 2024.

    Comments: 34 pages, 14 figures, 10 tables

  6. arXiv:2407.08672  [pdf, other

    cs.CV

    NODE-Adapter: Neural Ordinary Differential Equations for Better Vision-Language Reasoning

    Authors: Yi Zhang, Chun-Wun Cheng, Ke Yu, Zhihai He, Carola-Bibiane Schönlieb, Angelica I. Aviles-Rivero

    Abstract: In this paper, we consider the problem of prototype-based vision-language reasoning problem. We observe that existing methods encounter three major challenges: 1) escalating resource demands and prolonging training times, 2) contending with excessive learnable parameters, and 3) fine-tuning based only on a single modality. These challenges will hinder their capability to adapt Vision-Language Mode… ▽ More

    Submitted 11 July, 2024; originally announced July 2024.

  7. arXiv:2407.06334  [pdf, other

    cs.AI q-bio.QM

    Double-Ended Synthesis Planning with Goal-Constrained Bidirectional Search

    Authors: Kevin Yu, Jihye Roh, Ziang Li, Wenhao Gao, Runzhong Wang, Connor W. Coley

    Abstract: Computer-aided synthesis planning (CASP) algorithms have demonstrated expert-level abilities in planning retrosynthetic routes to molecules of low to moderate complexity. However, current search methods assume the sufficiency of reaching arbitrary building blocks, failing to address the common real-world constraint where using specific molecules is desired. To this end, we present a formulation of… ▽ More

    Submitted 8 July, 2024; originally announced July 2024.

    Comments: 10 pages main, 4 figures

  8. arXiv:2407.03892  [pdf, other

    cs.SD cs.AI eess.AS

    On the Effectiveness of Acoustic BPE in Decoder-Only TTS

    Authors: Bohan Li, Feiyu Shen, Yiwei Guo, Shuai Wang, Xie Chen, Kai Yu

    Abstract: Discretizing speech into tokens and generating them by a decoder-only model have been a promising direction for text-to-speech (TTS) and spoken language modeling (SLM). To shorten the sequence length of speech tokens, acoustic byte-pair encoding (BPE) has emerged in SLM that treats speech tokens from self-supervised semantic representations as characters to further compress the token sequence. But… ▽ More

    Submitted 4 July, 2024; originally announced July 2024.

    Comments: 5 pages, 3 tables, 1 figures. accepted to Interspeech 2024

  9. arXiv:2407.02350  [pdf, other

    cs.CV

    Conceptual Codebook Learning for Vision-Language Models

    Authors: Yi Zhang, Ke Yu, Siqi Wu, Zhihai He

    Abstract: In this paper, we propose Conceptual Codebook Learning (CoCoLe), a novel fine-tuning method for vision-language models (VLMs) to address the challenge of improving the generalization capability of VLMs while fine-tuning them on downstream tasks in a few-shot setting. We recognize that visual concepts, such as textures, shapes, and colors are naturally transferable across domains and play a crucial… ▽ More

    Submitted 15 July, 2024; v1 submitted 2 July, 2024; originally announced July 2024.

    Comments: Accepted by ECCV 2024

  10. arXiv:2407.01093  [pdf, other

    cs.CL cs.AI cs.MA

    IBSEN: Director-Actor Agent Collaboration for Controllable and Interactive Drama Script Generation

    Authors: Senyu Han, Lu Chen, Li-Min Lin, Zhengshan Xu, Kai Yu

    Abstract: Large language models have demonstrated their capabilities in storyline creation and human-like character role-playing. Current language model agents mainly focus on reasonable behaviors from the level of individuals, and their behaviors might be hard to constraint on the level of the whole storyline. In this paper we introduce IBSEN, a director-actor coordinate agent framework that generates dram… ▽ More

    Submitted 1 July, 2024; originally announced July 2024.

    Comments: Accepted by ACL 2024 Main

  11. arXiv:2407.00630  [pdf, other

    cs.CR

    BAZAM: A Blockchain-Assisted Zero-Trust Authentication in Multi-UAV Wireless Networks

    Authors: Mingyue Xie, Zheng Chang, Osama Alfarraj, Keping Yu, Tao Chen, Hongwei Li

    Abstract: Unmanned aerial vehicles (UAVs) are vulnerable to interception and attacks when operated remotely without a unified and efficient identity authentication. Meanwhile, the openness of wireless communication environments potentially leads to data leakage and system paralysis. However, conventional authentication schemes in the UAV network are system-centric, failing to adapt to the diversity of UAVs… ▽ More

    Submitted 30 June, 2024; originally announced July 2024.

  12. arXiv:2407.00466  [pdf, other

    cs.CL cs.AI

    BioKGBench: A Knowledge Graph Checking Benchmark of AI Agent for Biomedical Science

    Authors: Xinna Lin, Siqi Ma, Junjie Shan, Xiaojing Zhang, Shell Xu Hu, Tiannan Guo, Stan Z. Li, Kaicheng Yu

    Abstract: Pursuing artificial intelligence for biomedical science, a.k.a. AI Scientist, draws increasing attention, where one common approach is to build a copilot agent driven by Large Language Models (LLMs). However, to evaluate such systems, people either rely on direct Question-Answering (QA) to the LLM itself, or in a biomedical experimental manner. How to precisely benchmark biomedical agents from an… ▽ More

    Submitted 29 June, 2024; originally announced July 2024.

  13. arXiv:2406.19705  [pdf, other

    cs.AI

    DISCO: Efficient Diffusion Solver for Large-Scale Combinatorial Optimization Problems

    Authors: Kexiong Yu, Hang Zhao, Yuhang Huang, Renjiao Yi, Kai Xu, Chenyang Zhu

    Abstract: Combinatorial Optimization (CO) problems are fundamentally crucial in numerous practical applications across diverse industries, characterized by entailing enormous solution space and demanding time-sensitive response. Despite significant advancements made by recent neural solvers, their limited expressiveness does not conform well to the multi-modal nature of CO landscapes. While some research ha… ▽ More

    Submitted 19 July, 2024; v1 submitted 28 June, 2024; originally announced June 2024.

  14. arXiv:2406.14673  [pdf, other

    cs.CL

    Insights into LLM Long-Context Failures: When Transformers Know but Don't Tell

    Authors: Taiming Lu, Muhan Gao, Kuai Yu, Adam Byerly, Daniel Khashabi

    Abstract: Large Language Models (LLMs) exhibit positional bias, struggling to utilize information from the middle or end of long contexts. Our study explores LLMs' long-context reasoning by probing their hidden representations. We find that while LLMs encode the position of target information, they often fail to leverage this in generating accurate responses. This reveals a disconnect between information re… ▽ More

    Submitted 20 June, 2024; originally announced June 2024.

  15. arXiv:2406.11546  [pdf, other

    eess.AS cs.CL cs.SD

    GigaSpeech 2: An Evolving, Large-Scale and Multi-domain ASR Corpus for Low-Resource Languages with Automated Crawling, Transcription and Refinement

    Authors: Yifan Yang, Zheshu Song, Jianheng Zhuo, Mingyu Cui, Jinpeng Li, Bo Yang, Yexing Du, Ziyang Ma, Xunying Liu, Ziyuan Wang, Ke Li, Shuai Fan, Kai Yu, Wei-Qiang Zhang, Guoguo Chen, Xie Chen

    Abstract: The evolution of speech technology has been spurred by the rapid increase in dataset sizes. Traditional speech models generally depend on a large amount of labeled training data, which is scarce for low-resource languages. This paper presents GigaSpeech 2, a large-scale, multi-domain, multilingual speech recognition corpus. It is designed for low-resource languages and does not rely on paired spee… ▽ More

    Submitted 17 June, 2024; originally announced June 2024.

    Comments: Under review

  16. arXiv:2406.10358  [pdf, other

    cs.CR eess.SY

    I Still See You: Why Existing IoT Traffic Reshaping Fails

    Authors: Su Wang, Keyang Yu, Qi Li, Dong Chen

    Abstract: The Internet traffic data produced by the Internet of Things (IoT) devices are collected by Internet Service Providers (ISPs) and device manufacturers, and often shared with their third parties to maintain and enhance user services. Unfortunately, on-path adversaries could infer and fingerprint users' sensitive privacy information such as occupancy and user activities by analyzing these network tr… ▽ More

    Submitted 14 June, 2024; originally announced June 2024.

    Comments: EWSN'24 paper accepted, to appear

  17. arXiv:2406.09317  [pdf, other

    eess.IV cs.CV

    Common and Rare Fundus Diseases Identification Using Vision-Language Foundation Model with Knowledge of Over 400 Diseases

    Authors: Meng Wang, Tian Lin, Aidi Lin, Kai Yu, Yuanyuan Peng, Lianyu Wang, Cheng Chen, Ke Zou, Huiyu Liang, Man Chen, Xue Yao, Meiqin Zhang, Binwei Huang, Chaoxin Zheng, Peixin Zhang, Wei Chen, Yilong Luo, Yifan Chen, Honghe Xia, Tingkun Shi, Qi Zhang, Jinming Guo, Xiaolin Chen, Jingcheng Wang, Yih Chung Tham , et al. (24 additional authors not shown)

    Abstract: Previous foundation models for retinal images were pre-trained with limited disease categories and knowledge base. Here we introduce RetiZero, a vision-language foundation model that leverages knowledge from over 400 fundus diseases. To RetiZero's pre-training, we compiled 341,896 fundus images paired with text descriptions, sourced from public datasets, ophthalmic literature, and online resources… ▽ More

    Submitted 30 June, 2024; v1 submitted 13 June, 2024; originally announced June 2024.

  18. arXiv:2406.08052  [pdf, other

    cs.SD eess.AS

    FakeSound: Deepfake General Audio Detection

    Authors: Zeyu Xie, Baihan Li, Xuenan Xu, Zheng Liang, Kai Yu, Mengyue Wu

    Abstract: With the advancement of audio generation, generative models can produce highly realistic audios. However, the proliferation of deepfake general audio can pose negative consequences. Therefore, we propose a new task, deepfake general audio detection, which aims to identify whether audio content is manipulated and to locate deepfake regions. Leveraging an automated manipulation pipeline, a dataset n… ▽ More

    Submitted 12 June, 2024; originally announced June 2024.

    Comments: Accepted by INTERSPEECH 2024

    MSC Class: 68Txx ACM Class: I.2

  19. arXiv:2406.06962  [pdf, other

    cs.CL cs.AI

    Evolving Subnetwork Training for Large Language Models

    Authors: Hanqi Li, Lu Chen, Da Ma, Zijian Wu, Su Zhu, Kai Yu

    Abstract: Large language models have ushered in a new era of artificial intelligence research. However, their substantial training costs hinder further development and widespread adoption. In this paper, inspired by the redundancy in the parameters of large language models, we propose a novel training paradigm: Evolving Subnetwork Training (EST). EST samples subnetworks from the layers of the large language… ▽ More

    Submitted 11 June, 2024; originally announced June 2024.

    Comments: Accepted to ICML 2024

  20. arXiv:2406.05766  [pdf, other

    cs.LG cs.AI cs.CL cs.CV

    Gentle-CLIP: Exploring Aligned Semantic In Low-Quality Multimodal Data With Soft Alignment

    Authors: Zijia Song, Zelin Zang, Yelin Wang, Guozheng Yang, Jiangbin Zheng, Kaicheng yu, Wanyu Chen, Stan Z. Li

    Abstract: Multimodal fusion breaks through the barriers between diverse modalities and has already yielded numerous impressive performances. However, in various specialized fields, it is struggling to obtain sufficient alignment data for the training process, which seriously limits the use of previously elegant models. Thus, semi-supervised learning attempts to achieve multimodal alignment with fewer matche… ▽ More

    Submitted 9 June, 2024; originally announced June 2024.

  21. arXiv:2406.05343  [pdf, other

    cs.AI cs.CL

    M3GIA: A Cognition Inspired Multilingual and Multimodal General Intelligence Ability Benchmark

    Authors: Wei Song, Yadong Li, Jianhua Xu, Guowei Wu, Lingfeng Ming, Kexin Yi, Weihua Luo, Houyi Li, Yi Du, Fangda Guo, Kaicheng Yu

    Abstract: As recent multi-modality large language models (MLLMs) have shown formidable proficiency on various complex tasks, there has been increasing attention on debating whether these models could eventually mirror human intelligence. However, existing benchmarks mainly focus on evaluating solely on task performance, such as the accuracy of identifying the attribute of an object. Combining well-developed… ▽ More

    Submitted 14 June, 2024; v1 submitted 8 June, 2024; originally announced June 2024.

  22. arXiv:2406.01392  [pdf, other

    cs.CL

    Sparsity-Accelerated Training for Large Language Models

    Authors: Da Ma, Lu Chen, Pengyu Wang, Hongshen Xu, Hanqi Li, Liangtai Sun, Su Zhu, Shuai Fan, Kai Yu

    Abstract: Large language models (LLMs) have demonstrated proficiency across various natural language processing (NLP) tasks but often require additional training, such as continual pre-training and supervised fine-tuning. However, the costs associated with this, primarily due to their large parameter count, remain high. This paper proposes leveraging \emph{sparsity} in pre-trained LLMs to expedite this trai… ▽ More

    Submitted 6 June, 2024; v1 submitted 3 June, 2024; originally announced June 2024.

    Comments: Accepted to ACL 2024 Findings

  23. arXiv:2406.01349  [pdf, other

    cs.CV

    Unleashing Generalization of End-to-End Autonomous Driving with Controllable Long Video Generation

    Authors: Enhui Ma, Lijun Zhou, Tao Tang, Zhan Zhang, Dong Han, Junpeng Jiang, Kun Zhan, Peng Jia, Xianpeng Lang, Haiyang Sun, Di Lin, Kaicheng Yu

    Abstract: Using generative models to synthesize new data has become a de-facto standard in autonomous driving to address the data scarcity issue. Though existing approaches are able to boost perception models, we discover that these approaches fail to improve the performance of planning of end-to-end autonomous driving models as the generated videos are usually less than 8 frames and the spatial and tempora… ▽ More

    Submitted 6 June, 2024; v1 submitted 3 June, 2024; originally announced June 2024.

    Comments: Project Page: https://westlake-autolab.github.io/delphi.github.io/, 8 figures

  24. arXiv:2406.00617  [pdf, other

    cs.DB cs.SI

    Maximum $k$-Plex Search: An Alternated Reduction-and-Bound Method

    Authors: Shuohao Gao, Kaiqiang Yu, Shengxin Liu, Cheng Long

    Abstract: $k$-plexes relax cliques by allowing each vertex to disconnect to at most $k$ vertices. Finding a maximum $k… ▽ More

    Submitted 2 June, 2024; originally announced June 2024.

  25. arXiv:2405.20220  [pdf, other

    cs.DC cs.CY

    BeerReview: A Blockchain-enabled Peer Review Platform

    Authors: Guodong Jin, Zihan Zhou, Wenzheng Tang, Kanglei Yu, Hao Xu, Erwu Liu

    Abstract: In an era of increasing concerns over intellectual property rights, traditional peer review systems face challenges including plagiarism, malicious attacks, and unauthorized data access. BeerReview, a blockchain-enabled peer review platform, offers a robust solution, enabling experts and scholars to participate actively in the review process without concerns about plagiarism or security threats. F… ▽ More

    Submitted 30 May, 2024; originally announced May 2024.

  26. arXiv:2405.17670  [pdf, other

    cs.RO

    Deployment of NLP and LLM Techniques to Control Mobile Robots at the Edge: A Case Study Using GPT-4-Turbo and LLaMA 2

    Authors: Pascal Sikorski, Leendert Schrader, Kaleb Yu, Lucy Billadeau, Jinka Meenakshi, Naveena Mutharasan, Flavio Esposito, Hadi AliAkbarpour, Madi Babaiasl

    Abstract: This paper investigates the possibility of intuitive human-robot interaction through the application of Natural Language Processing (NLP) and Large Language Models (LLMs) in mobile robotics. We aim to explore the feasibility of using these technologies for edge-based deployment, where traditional cloud dependencies are eliminated. The study specifically contrasts the performance of GPT-4-Turbo, wh… ▽ More

    Submitted 19 June, 2024; v1 submitted 27 May, 2024; originally announced May 2024.

  27. arXiv:2405.17665  [pdf, other

    cs.RO

    Enhanced Robot Arm at the Edge with NLP and Vision Systems

    Authors: Pascal Sikorski, Kaleb Yu, Lucy Billadeau, Flavio Esposito, Hadi AliAkbarpour, Madi Babaiasl

    Abstract: This paper introduces a "proof of concept" for a new approach to assistive robotics, integrating edge computing with Natural Language Processing (NLP) and computer vision to enhance the interaction between humans and robotic systems. Our "proof of concept" demonstrates the feasibility of using large language models (LLMs) and vision systems in tandem for interpreting and executing complex commands… ▽ More

    Submitted 19 June, 2024; v1 submitted 27 May, 2024; originally announced May 2024.

  28. arXiv:2405.16393  [pdf, other

    cs.CV cs.AI

    Disentangling Foreground and Background Motion for Enhanced Realism in Human Video Generation

    Authors: Jinlin Liu, Kai Yu, Mengyang Feng, Xiefan Guo, Miaomiao Cui

    Abstract: Recent advancements in human video synthesis have enabled the generation of high-quality videos through the application of stable diffusion models. However, existing methods predominantly concentrate on animating solely the human element (the foreground) guided by pose information, while leaving the background entirely static. Contrary to this, in authentic, high-quality videos, backgrounds often… ▽ More

    Submitted 28 May, 2024; v1 submitted 25 May, 2024; originally announced May 2024.

  29. arXiv:2405.16078   

    cs.IT

    An Multi-resources Integration Empowered Task Offloading in Internet of Vehicles: From the Perspective of Wireless Interference

    Authors: Xiaowu Liu, Yun Wang, Kan Yu, Dianxia Chen, Dong Li, Qixun Zhang, Zhiyong Feng

    Abstract: The task offloading technology plays a vital role in the Internet of Vehicles (IoV), by satisfying the diversified demands of the vehicles, such as the energy consumption and processing latency of the computing task. Different from the previous works, on the one hand, they ignored the wireless interference of communications among vehicle-to-vehicle (V2V), as well as between vehicles and roadside u… ▽ More

    Submitted 25 June, 2024; v1 submitted 25 May, 2024; originally announced May 2024.

    Comments: The paper has been rejected by IEEE Transactions on Communications, apart from the Reviewers' comments, we need reconsider that inaccuracies in the data or results were identified post-submission, necessitating a withdrawal for correction. In addition, considering the plausibility of the simulations, one or more of the authors requested the withdrawal of the manuscript

  30. arXiv:2405.16062  [pdf, other

    cs.IT eess.SP

    Movable Antenna Empowered Physical Layer Security Without Eve's CSI: Joint Optimization of Beamforming and Antenna Positions

    Authors: Zhiyong Feng, Yujia Zhao, Kan Yu, Dong Li

    Abstract: Physical layer security (PLS) technology based on the fixed-position antenna (FPA) has {attracted widespread attention}. Due to the fixed feature of the antennas, current FPA-based PLS schemes cannot fully utilize the spatial degree of freedom, and thus a weaken secure gain in the desired/undesired direction may exist. Different from the concept of FPA, mobile antenna (MA) is a novel technology th… ▽ More

    Submitted 25 May, 2024; originally announced May 2024.

  31. arXiv:2405.16060  [pdf, other

    cs.IT

    Delay-Effective Task Offloading Technology in Internet of Vehicles: From the Perspective of the Vehicle Platooning

    Authors: Kan Yu, Fuze Zhu, Xiaowu Liu, Zhiyong Feng, Dong Li

    Abstract: The task offloading technology plays a crucial vital role in the Internet of Vehicle (IoV) with the demands of delay minimum, by jointly optimizing the heterogeneous computing resources supported by the vehicles, roadside units (RSUs), and macro base stations (MBSs). In previous works, on the one hand, they ignored the wireless interference among the exchange and sharing of the task data. On the o… ▽ More

    Submitted 25 May, 2024; originally announced May 2024.

  32. arXiv:2405.10300  [pdf, other

    cs.CV

    Grounding DINO 1.5: Advance the "Edge" of Open-Set Object Detection

    Authors: Tianhe Ren, Qing Jiang, Shilong Liu, Zhaoyang Zeng, Wenlong Liu, Han Gao, Hongjie Huang, Zhengyu Ma, Xiaoke Jiang, Yihao Chen, Yuda Xiong, Hao Zhang, Feng Li, Peijun Tang, Kent Yu, Lei Zhang

    Abstract: This paper introduces Grounding DINO 1.5, a suite of advanced open-set object detection models developed by IDEA Research, which aims to advance the "Edge" of open-set object detection. The suite encompasses two models: Grounding DINO 1.5 Pro, a high-performance model designed for stronger generalization capability across a wide range of scenarios, and Grounding DINO 1.5 Edge, an efficient model o… ▽ More

    Submitted 31 May, 2024; v1 submitted 16 May, 2024; originally announced May 2024.

    Comments: homepage: https://deepdataspace.com/home

  33. arXiv:2405.09276  [pdf, other

    cs.LG cs.AI cs.DC

    Dual-Segment Clustering Strategy for Federated Learning in Heterogeneous Environments

    Authors: Pengcheng Sun, Erwu Liu, Wei Ni, Kanglei Yu, Rui Wang, Abbas Jamalipour

    Abstract: Federated learning (FL) is a distributed machine learning paradigm with high efficiency and low communication load, only transmitting parameters or gradients of network. However, the non-independent and identically distributed (Non-IID) data characteristic has a negative impact on this paradigm. Furthermore, the heterogeneity of communication quality will significantly affect the accuracy of param… ▽ More

    Submitted 15 May, 2024; originally announced May 2024.

  34. arXiv:2405.03121  [pdf, other

    cs.CV cs.AI

    AniTalker: Animate Vivid and Diverse Talking Faces through Identity-Decoupled Facial Motion Encoding

    Authors: Tao Liu, Feilong Chen, Shuai Fan, Chenpeng Du, Qi Chen, Xie Chen, Kai Yu

    Abstract: The paper introduces AniTalker, an innovative framework designed to generate lifelike talking faces from a single portrait. Unlike existing models that primarily focus on verbal cues such as lip synchronization and fail to capture the complex dynamics of facial expressions and nonverbal cues, AniTalker employs a universal motion representation. This innovative representation effectively captures a… ▽ More

    Submitted 5 May, 2024; originally announced May 2024.

    Comments: 14 pages, 7 figures

  35. arXiv:2405.02712  [pdf, other

    cs.CL

    CoE-SQL: In-Context Learning for Multi-Turn Text-to-SQL with Chain-of-Editions

    Authors: Hanchong Zhang, Ruisheng Cao, Hongshen Xu, Lu Chen, Kai Yu

    Abstract: Recently, Large Language Models (LLMs) have been demonstrated to possess impressive capabilities in a variety of domains and tasks. We investigate the issue of prompt design in the multi-turn text-to-SQL task and attempt to enhance the LLMs' reasoning capacity when generating SQL queries. In the conversational context, the current SQL query can be modified from the preceding SQL query with only a… ▽ More

    Submitted 4 May, 2024; originally announced May 2024.

  36. arXiv:2405.02429  [pdf, other

    cs.IR cs.AI cs.CL cs.LG

    CALRec: Contrastive Alignment of Generative LLMs For Sequential Recommendation

    Authors: Yaoyiran Li, Xiang Zhai, Moustafa Alzantot, Keyi Yu, Ivan Vulić, Anna Korhonen, Mohamed Hammad

    Abstract: Traditional recommender systems such as matrix factorization methods rely on learning a shared dense embedding space to represent both items and user preferences. Sequence models such as RNN, GRUs, and, recently, Transformers have also excelled in the task of sequential recommendation. This task requires understanding the sequential structure present in users' historical interactions to predict th… ▽ More

    Submitted 3 May, 2024; originally announced May 2024.

  37. arXiv:2404.19723  [pdf, other

    eess.AS cs.SD

    Attention-Constrained Inference for Robust Decoder-Only Text-to-Speech

    Authors: Hankun Wang, Chenpeng Du, Yiwei Guo, Shuai Wang, Xie Chen, Kai Yu

    Abstract: Recent popular decoder-only text-to-speech models are known for their ability of generating natural-sounding speech. However, such models sometimes suffer from word skipping and repeating due to the lack of explicit monotonic alignment constraints. In this paper, we notice from the attention maps that some particular attention heads of the decoder-only model indicate the alignments between speech… ▽ More

    Submitted 30 April, 2024; originally announced April 2024.

  38. arXiv:2404.14946  [pdf, other

    cs.SD cs.CL eess.AS

    StoryTTS: A Highly Expressive Text-to-Speech Dataset with Rich Textual Expressiveness Annotations

    Authors: Sen Liu, Yiwei Guo, Xie Chen, Kai Yu

    Abstract: While acoustic expressiveness has long been studied in expressive text-to-speech (ETTS), the inherent expressiveness in text lacks sufficient attention, especially for ETTS of artistic works. In this paper, we introduce StoryTTS, a highly ETTS dataset that contains rich expressiveness both in acoustic and textual perspective, from the recording of a Mandarin storytelling show. A systematic and com… ▽ More

    Submitted 23 April, 2024; originally announced April 2024.

    Comments: Accepted by ICASSP 2024

    Journal ref: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2024, pp. 11521-11525

  39. arXiv:2404.06079  [pdf, other

    eess.AS cs.AI

    The X-LANCE Technical Report for Interspeech 2024 Speech Processing Using Discrete Speech Unit Challenge

    Authors: Yiwei Guo, Chenrun Wang, Yifan Yang, Hankun Wang, Ziyang Ma, Chenpeng Du, Shuai Wang, Hanzheng Li, Shuai Fan, Hui Zhang, Xie Chen, Kai Yu

    Abstract: Discrete speech tokens have been more and more popular in multiple speech processing fields, including automatic speech recognition (ASR), text-to-speech (TTS) and singing voice synthesis (SVS). In this paper, we describe the systems developed by the SJTU X-LANCE group for the TTS (acoustic + vocoder), SVS, and ASR tracks in the Interspeech 2024 Speech Processing Using Discrete Speech Unit Challen… ▽ More

    Submitted 9 April, 2024; v1 submitted 9 April, 2024; originally announced April 2024.

    Comments: 5 pages, 3 figures. Report of a challenge

  40. arXiv:2404.05538  [pdf, other

    cs.IT cs.LG eess.SP

    Cell-Free Multi-User MIMO Equalization via In-Context Learning

    Authors: Matteo Zecchin, Kai Yu, Osvaldo Simeone

    Abstract: Large pre-trained sequence models, such as transformers, excel as few-shot learners capable of in-context learning (ICL). In ICL, a model is trained to adapt its operation to a new task based on limited contextual information, typically in the form of a few training examples for the given task. Previous work has explored the use of ICL for channel equalization in single-user multi-input and multip… ▽ More

    Submitted 11 April, 2024; v1 submitted 8 April, 2024; originally announced April 2024.

  41. arXiv:2404.04748  [pdf, other

    cs.CL

    Multilingual Brain Surgeon: Large Language Models Can be Compressed Leaving No Language Behind

    Authors: Hongchuan Zeng, Hongshen Xu, Lu Chen, Kai Yu

    Abstract: Large Language Models (LLMs) have ushered in a new era in Natural Language Processing, but their massive size demands effective compression techniques for practicality. Although numerous model compression techniques have been investigated, they typically rely on a calibration set that overlooks the multilingual context and results in significant accuracy degradation for low-resource languages. Thi… ▽ More

    Submitted 6 April, 2024; originally announced April 2024.

    Comments: 22 pages, 8 figures, 13 tables. Accepted by LREC-COLING 2024

  42. arXiv:2403.18349  [pdf, other

    cs.CL

    Rejection Improves Reliability: Training LLMs to Refuse Unknown Questions Using RL from Knowledge Feedback

    Authors: Hongshen Xu, Zichen Zhu, Situo Zhang, Da Ma, Shuai Fan, Lu Chen, Kai Yu

    Abstract: Large Language Models (LLMs) often generate erroneous outputs, known as hallucinations, due to their limitations in discerning questions beyond their knowledge scope. While addressing hallucination has been a focal point in research, previous efforts primarily concentrate on enhancing correctness without giving due consideration to the significance of rejection mechanisms. In this paper, we conduc… ▽ More

    Submitted 7 April, 2024; v1 submitted 27 March, 2024; originally announced March 2024.

  43. arXiv:2403.16107  [pdf, other

    cs.HC

    Designing Upper-Body Gesture Interaction with and for People with Spinal Muscular Atrophy in VR

    Authors: Jingze Tian, Yingna Wang, Keye Yu, Liyi Xu, Junan Xie, Franklin Mingzhe Li, Yafeng Niu, Mingming Fan

    Abstract: Recent research proposed gaze-assisted gestures to enhance interaction within virtual reality (VR), providing opportunities for people with motor impairments to experience VR. Compared to people with other motor impairments, those with Spinal Muscular Atrophy (SMA) exhibit enhanced distal limb mobility, providing them with more design space. However, it remains unknown what gaze-assisted upper-bod… ▽ More

    Submitted 24 March, 2024; originally announced March 2024.

    Comments: Proceedings of the CHI Conference on Human Factors in Computing Systems (CHI '24), May 11--16, 2024, Honolulu, HI, USA

  44. arXiv:2403.13332  [pdf, other

    eess.AS cs.SD

    TDT-KWS: Fast And Accurate Keyword Spotting Using Token-and-duration Transducer

    Authors: Yu Xi, Hao Li, Baochen Yang, Haoyu Li, Hainan Xu, Kai Yu

    Abstract: Designing an efficient keyword spotting (KWS) system that delivers exceptional performance on resource-constrained edge devices has long been a subject of significant attention. Existing KWS search algorithms typically follow a frame-synchronous approach, where search decisions are made repeatedly at each frame despite the fact that most frames are keyword-irrelevant. In this paper, we propose TDT… ▽ More

    Submitted 20 March, 2024; originally announced March 2024.

    Comments: Accepted by ICASSP2024

  45. arXiv:2403.08295  [pdf, other

    cs.CL cs.AI

    Gemma: Open Models Based on Gemini Research and Technology

    Authors: Gemma Team, Thomas Mesnard, Cassidy Hardin, Robert Dadashi, Surya Bhupatiraju, Shreya Pathak, Laurent Sifre, Morgane Rivière, Mihir Sanjay Kale, Juliette Love, Pouya Tafti, Léonard Hussenot, Pier Giuseppe Sessa, Aakanksha Chowdhery, Adam Roberts, Aditya Barua, Alex Botev, Alex Castro-Ros, Ambrose Slone, Amélie Héliou, Andrea Tacchetti, Anna Bulanova, Antonia Paterson, Beth Tsai, Bobak Shahriari , et al. (83 additional authors not shown)

    Abstract: This work introduces Gemma, a family of lightweight, state-of-the art open models built from the research and technology used to create Gemini models. Gemma models demonstrate strong performance across academic benchmarks for language understanding, reasoning, and safety. We release two sizes of models (2 billion and 7 billion parameters), and provide both pretrained and fine-tuned checkpoints. Ge… ▽ More

    Submitted 16 April, 2024; v1 submitted 13 March, 2024; originally announced March 2024.

  46. arXiv:2403.06419  [pdf, other

    cs.LG

    Causal Multi-Label Feature Selection in Federated Setting

    Authors: Yukun Song, Dayuan Cao, Jiali Miao, Shuai Yang, Kui Yu

    Abstract: Multi-label feature selection serves as an effective mean for dealing with high-dimensional multi-label data. To achieve satisfactory performance, existing methods for multi-label feature selection often require the centralization of substantial data from multiple sources. However, in Federated setting, centralizing data from all sources and merging them into a single dataset is not feasible. To t… ▽ More

    Submitted 11 March, 2024; originally announced March 2024.

  47. arXiv:2403.04652  [pdf, other

    cs.CL cs.AI

    Yi: Open Foundation Models by 01.AI

    Authors: 01. AI, :, Alex Young, Bei Chen, Chao Li, Chengen Huang, Ge Zhang, Guanwei Zhang, Heng Li, Jiangcheng Zhu, Jianqun Chen, Jing Chang, Kaidong Yu, Peng Liu, Qiang Liu, Shawn Yue, Senbin Yang, Shiming Yang, Tao Yu, Wen Xie, Wenhao Huang, Xiaohui Hu, Xiaoyi Ren, Xinyao Niu, Pengcheng Nie , et al. (7 additional authors not shown)

    Abstract: We introduce the Yi model family, a series of language and multimodal models that demonstrate strong multi-dimensional capabilities. The Yi model family is based on 6B and 34B pretrained language models, then we extend them to chat models, 200K long context models, depth-upscaled models, and vision-language models. Our base models achieve strong performance on a wide range of benchmarks like MMLU,… ▽ More

    Submitted 7 March, 2024; originally announced March 2024.

  48. arXiv:2403.04594  [pdf, other

    cs.SD eess.AS

    A Detailed Audio-Text Data Simulation Pipeline using Single-Event Sounds

    Authors: Xuenan Xu, Xiaohang Xu, Zeyu Xie, Pingyue Zhang, Mengyue Wu, Kai Yu

    Abstract: Recently, there has been an increasing focus on audio-text cross-modal learning. However, most of the existing audio-text datasets contain only simple descriptions of sound events. Compared with classification labels, the advantages of such descriptions are significantly limited. In this paper, we first analyze the detailed information that human descriptions of audio may contain beyond sound even… ▽ More

    Submitted 7 March, 2024; originally announced March 2024.

  49. arXiv:2403.02574  [pdf, other

    cs.IR cs.AI cs.CL

    ChatCite: LLM Agent with Human Workflow Guidance for Comparative Literature Summary

    Authors: Yutong Li, Lu Chen, Aiwei Liu, Kai Yu, Lijie Wen

    Abstract: The literature review is an indispensable step in the research process. It provides the benefit of comprehending the research problem and understanding the current research situation while conducting a comparative analysis of prior works. However, literature summary is challenging and time consuming. The previous LLM-based studies on literature review mainly focused on the complete process, includ… ▽ More

    Submitted 4 March, 2024; originally announced March 2024.

    Comments: 18 pages, 5 figures

    MSC Class: 68T50 ACM Class: I.2.7

  50. arXiv:2403.01517  [pdf, other

    cs.CV

    MatchU: Matching Unseen Objects for 6D Pose Estimation from RGB-D Images

    Authors: Junwen Huang, Hao Yu, Kuan-Ting Yu, Nassir Navab, Slobodan Ilic, Benjamin Busam

    Abstract: Recent learning methods for object pose estimation require resource-intensive training for each individual object instance or category, hampering their scalability in real applications when confronted with previously unseen objects. In this paper, we propose MatchU, a Fuse-Describe-Match strategy for 6D pose estimation from RGB-D images. MatchU is a generic approach that fuses 2D texture and 3D ge… ▽ More

    Submitted 8 May, 2024; v1 submitted 3 March, 2024; originally announced March 2024.