Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to main content

Showing 1–50 of 258 results for author: Lin, K

Searching in archive cs. Search in all archives.
.
  1. arXiv:2407.18982  [pdf, other

    cs.CR cs.AI cs.DC cs.LG

    Low-Latency Privacy-Preserving Deep Learning Design via Secure MPC

    Authors: Ke Lin, Yasir Glani, Ping Luo

    Abstract: Secure multi-party computation (MPC) facilitates privacy-preserving computation between multiple parties without leaking private information. While most secure deep learning techniques utilize MPC operations to achieve feasible privacy-preserving machine learning on downstream tasks, the overhead of the computation and communication still hampers their practical application. This work proposes a l… ▽ More

    Submitted 24 July, 2024; originally announced July 2024.

    Comments: 9 pages, accepted at IJCAI'24 AISafety

  2. arXiv:2407.15281  [pdf, other

    cs.CL

    SynCPKL: Harnessing LLMs to Generate Synthetic Data for Commonsense Persona Knowledge Linking

    Authors: Kuan-Yen Lin

    Abstract: Understanding rich dialogues often requires NLP systems to access relevant commonsense persona knowledge, but retrieving this knowledge is challenging due to complex contexts and the implicit nature of commonsense. This paper presents our approach to the Commonsense Persona Knowledge Linking (CPKL) challenge, addressing the critical need for integrating persona and commonsense knowledge in open-do… ▽ More

    Submitted 21 July, 2024; originally announced July 2024.

  3. arXiv:2407.10937  [pdf, other

    cs.CV

    IDOL: Unified Dual-Modal Latent Diffusion for Human-Centric Joint Video-Depth Generation

    Authors: Yuanhao Zhai, Kevin Lin, Linjie Li, Chung-Ching Lin, Jianfeng Wang, Zhengyuan Yang, David Doermann, Junsong Yuan, Zicheng Liu, Lijuan Wang

    Abstract: Significant advances have been made in human-centric video generation, yet the joint video-depth generation problem remains underexplored. Most existing monocular depth estimation methods may not generalize well to synthesized images or videos, and multi-view-based methods have difficulty controlling the human appearance and motion. In this work, we present IDOL (unIfied Dual-mOdal Latent diffusio… ▽ More

    Submitted 15 July, 2024; originally announced July 2024.

    Comments: ECCV 2024; project page: https://yhzhai.github.io/idol/

  4. arXiv:2407.10860  [pdf, other

    cs.CV

    Human-Centric Transformer for Domain Adaptive Action Recognition

    Authors: Kun-Yu Lin, Jiaming Zhou, Wei-Shi Zheng

    Abstract: We study the domain adaptation task for action recognition, namely domain adaptive action recognition, which aims to effectively transfer action recognition power from a label-sufficient source domain to a label-free target domain. Since actions are performed by humans, it is crucial to exploit human cues in videos when recognizing actions across domains. However, existing methods are prone to los… ▽ More

    Submitted 15 July, 2024; originally announced July 2024.

    Comments: Accepted by TPAMI

  5. arXiv:2407.06516  [pdf, other

    cs.CV

    VQA-Diff: Exploiting VQA and Diffusion for Zero-Shot Image-to-3D Vehicle Asset Generation in Autonomous Driving

    Authors: Yibo Liu, Zheyuan Yang, Guile Wu, Yuan Ren, Kejian Lin, Bingbing Liu, Yang Liu, Jinjun Shan

    Abstract: Generating 3D vehicle assets from in-the-wild observations is crucial to autonomous driving. Existing image-to-3D methods cannot well address this problem because they learn generation merely from image RGB information without a deeper understanding of in-the-wild vehicles (such as car models, manufacturers, etc.). This leads to their poor zero-shot prediction capability to handle real-world obser… ▽ More

    Submitted 10 July, 2024; v1 submitted 8 July, 2024; originally announced July 2024.

  6. arXiv:2407.05285  [pdf, other

    cs.LG cs.AI cs.CR

    Gradient Diffusion: A Perturbation-Resilient Gradient Leakage Attack

    Authors: Xuan Liu, Siqi Cai, Qihua Zhou, Song Guo, Ruibin Li, Kaiwei Lin

    Abstract: Recent years have witnessed the vulnerability of Federated Learning (FL) against gradient leakage attacks, where the private training data can be recovered from the exchanged gradients, making gradient protection a critical issue for the FL training process. Existing solutions often resort to perturbation-based mechanisms, such as differential privacy, where each participating client injects a spe… ▽ More

    Submitted 7 July, 2024; originally announced July 2024.

  7. arXiv:2406.16544  [pdf, other

    cs.CV

    Hierarchical B-frame Video Coding for Long Group of Pictures

    Authors: Ivan Kirillov, Denis Parkhomenko, Kirill Chernyshev, Alexander Pletnev, Yibo Shi, Kai Lin, Dmitry Babin

    Abstract: Learned video compression methods already outperform VVC in the low-delay (LD) case, but the random-access (RA) scenario remains challenging. Most works on learned RA video compression either use HEVC as an anchor or compare it to VVC in specific test conditions, using RGB-PSNR metric instead of Y-PSNR and avoiding comprehensive evaluation. Here, we present an end-to-end learned video codec for ra… ▽ More

    Submitted 24 June, 2024; originally announced June 2024.

  8. arXiv:2406.14235  [pdf, other

    cs.CV cs.RO

    Mitigating the Human-Robot Domain Discrepancy in Visual Pre-training for Robotic Manipulation

    Authors: Jiaming Zhou, Teli Ma, Kun-Yu Lin, Ronghe Qiu, Zifan Wang, Junwei Liang

    Abstract: Learning generalizable visual dynamic representation across different embodied environments is crucial for real-world robotic manipulation. As the scale and diversity of robot demonstration data are limited, recent works have turned to large-scale pre-training using human data. However, the morphological differences between humans and robots introduce a significant human-robot domain discrepancy,… ▽ More

    Submitted 20 June, 2024; originally announced June 2024.

  9. arXiv:2406.13719  [pdf, other

    cs.CV

    GUI Action Narrator: Where and When Did That Action Take Place?

    Authors: Qinchen Wu, Difei Gao, Kevin Qinghong Lin, Zhuoyu Wu, Xiangwu Guo, Peiran Li, Weichen Zhang, Hengxu Wang, Mike Zheng Shou

    Abstract: The advent of Multimodal LLMs has significantly enhanced image OCR recognition capabilities, making GUI automation a viable reality for increasing efficiency in digital tasks. One fundamental aspect of developing a GUI automation system is understanding primitive GUI actions. This comprehension is crucial as it enables agents to learn from user demonstrations, an essential element of automation. T… ▽ More

    Submitted 19 June, 2024; originally announced June 2024.

  10. arXiv:2406.11816  [pdf, other

    cs.CV

    VideoLLM-online: Online Video Large Language Model for Streaming Video

    Authors: Joya Chen, Zhaoyang Lv, Shiwei Wu, Kevin Qinghong Lin, Chenan Song, Difei Gao, Jia-Wei Liu, Ziteng Gao, Dongxing Mao, Mike Zheng Shou

    Abstract: Recent Large Language Models have been enhanced with vision capabilities, enabling them to comprehend images, videos, and interleaved vision-language content. However, the learning methods of these large multimodal models typically treat videos as predetermined clips, making them less effective and efficient at handling streaming video inputs. In this paper, we propose a novel Learning-In-Video-St… ▽ More

    Submitted 17 June, 2024; originally announced June 2024.

    Comments: CVPR 2024. This arxiv version is upgraded with Llama-3

  11. arXiv:2406.11781  [pdf, other

    cs.IR

    DiffMM: Multi-Modal Diffusion Model for Recommendation

    Authors: Yangqin Jiang, Lianghao Xia, Wei Wei, Da Luo, Kangyi Lin, Chao Huang

    Abstract: The rise of online multi-modal sharing platforms like TikTok and YouTube has enabled personalized recommender systems to incorporate multiple modalities (such as visual, textual, and acoustic) into user representations. However, addressing the challenge of data sparsity in these systems remains a key issue. To address this limitation, recent research has introduced self-supervised learning techniq… ▽ More

    Submitted 17 June, 2024; originally announced June 2024.

  12. arXiv:2406.10227  [pdf, other

    cs.CV cs.AI

    VideoGUI: A Benchmark for GUI Automation from Instructional Videos

    Authors: Kevin Qinghong Lin, Linjie Li, Difei Gao, Qinchen WU, Mingyi Yan, Zhengyuan Yang, Lijuan Wang, Mike Zheng Shou

    Abstract: Graphical User Interface (GUI) automation holds significant promise for enhancing human productivity by assisting with computer tasks. Existing task formulations primarily focus on simple tasks that can be specified by a single, language-only instruction, such as "Insert a new slide." In this work, we introduce VideoGUI, a novel multi-modal benchmark designed to evaluate GUI assistants on visual-c… ▽ More

    Submitted 14 June, 2024; originally announced June 2024.

    Comments: 24 pages, 16 tables, 17 figures

  13. arXiv:2406.09767  [pdf, other

    cs.RO

    Language-Guided Manipulation with Diffusion Policies and Constrained Inpainting

    Authors: Ce Hao, Kelvin Lin, Siyuan Luo, Harold Soh

    Abstract: Diffusion policies have demonstrated robust performance in generative modeling, prompting their application in robotic manipulation controlled via language descriptions. In this paper, we introduce a zero-shot, open-vocabulary diffusion policy method for robot manipulation. Using Vision-Language Models (VLMs), our method transforms linguistic task descriptions into actionable keyframes in 3D space… ▽ More

    Submitted 14 June, 2024; originally announced June 2024.

  14. arXiv:2406.08407  [pdf, other

    cs.CV cs.AI cs.CL

    MMWorld: Towards Multi-discipline Multi-faceted World Model Evaluation in Videos

    Authors: Xuehai He, Weixi Feng, Kaizhi Zheng, Yujie Lu, Wanrong Zhu, Jiachen Li, Yue Fan, Jianfeng Wang, Linjie Li, Zhengyuan Yang, Kevin Lin, William Yang Wang, Lijuan Wang, Xin Eric Wang

    Abstract: Multimodal Language Language Models (MLLMs) demonstrate the emerging abilities of "world models" -- interpreting and reasoning about complex real-world dynamics. To assess these abilities, we posit videos are the ideal medium, as they encapsulate rich representations of real-world dynamics and causalities. To this end, we introduce MMWorld, a new benchmark for multi-discipline, multi-faceted multi… ▽ More

    Submitted 29 July, 2024; v1 submitted 12 June, 2024; originally announced June 2024.

  15. arXiv:2406.07540  [pdf, other

    cs.CV cs.LG

    Ctrl-X: Controlling Structure and Appearance for Text-To-Image Generation Without Guidance

    Authors: Kuan Heng Lin, Sicheng Mo, Ben Klingher, Fangzhou Mu, Bolei Zhou

    Abstract: Recent controllable generation approaches such as FreeControl and Diffusion Self-guidance bring fine-grained spatial and appearance control to text-to-image (T2I) diffusion models without training auxiliary modules. However, these methods optimize the latent embedding for each type of score function with longer diffusion steps, making the generation process time-consuming and limiting their flexib… ▽ More

    Submitted 11 June, 2024; originally announced June 2024.

    Comments: 18 pages, 11 figures, see project page at https://genforce.github.io/ctrl-x

  16. arXiv:2406.06890  [pdf, other

    cs.CV

    Motion Consistency Model: Accelerating Video Diffusion with Disentangled Motion-Appearance Distillation

    Authors: Yuanhao Zhai, Kevin Lin, Zhengyuan Yang, Linjie Li, Jianfeng Wang, Chung-Ching Lin, David Doermann, Junsong Yuan, Lijuan Wang

    Abstract: Image diffusion distillation achieves high-fidelity generation with very few sampling steps. However, applying these techniques directly to video diffusion often results in unsatisfactory frame quality due to the limited visual quality in public video datasets. This affects the performance of both teacher and student video diffusion models. Our study aims to improve video diffusion distillation wh… ▽ More

    Submitted 10 June, 2024; originally announced June 2024.

    Comments: Project page: https://yhzhai.github.io/mcm/

  17. arXiv:2406.03298  [pdf, other

    cs.CV cs.RO

    L-PR: Exploiting LiDAR Fiducial Marker for Unordered Low Overlap Multiview Point Cloud Registration

    Authors: Yibo Liu, Jinjun Shan, Amaldev Haridevan, Shuo Zhang, Kejian Lin

    Abstract: Point cloud registration is a prerequisite for many applications in computer vision and robotics. Most existing methods focus on pairwise registration of two point clouds with high overlap. Although there have been some methods for low overlap cases, they struggle in degraded scenarios. This paper introduces a novel framework named L-PR, designed to register unordered low overlap multiview point c… ▽ More

    Submitted 5 June, 2024; originally announced June 2024.

    Comments: 8 pages

  18. Retrieval-Augmented Conversational Recommendation with Prompt-based Semi-Structured Natural Language State Tracking

    Authors: Sara Kemper, Justin Cui, Kai Dicarlantonio, Kathy Lin, Danjie Tang, Anton Korikov, Scott Sanner

    Abstract: Conversational recommendation (ConvRec) systems must understand rich and diverse natural language (NL) expressions of user preferences and intents, often communicated in an indirect manner (e.g., "I'm watching my weight"). Such complex utterances make retrieving relevant items challenging, especially if only using often incomplete or out-of-date metadata. Fortunately, many domains feature rich ite… ▽ More

    Submitted 25 May, 2024; originally announced June 2024.

  19. arXiv:2405.15784  [pdf, other

    cs.IR cs.AI cs.CL

    CLARINET: Augmenting Language Models to Ask Clarification Questions for Retrieval

    Authors: Yizhou Chi, Jessy Lin, Kevin Lin, Dan Klein

    Abstract: Users often make ambiguous requests that require clarification. We study the problem of asking clarification questions in an information retrieval setting, where systems often face ambiguous search queries and it is challenging to turn the uncertainty in the retrieval model into a natural language question. We present CLARINET, a system that asks informative clarification questions by choosing que… ▽ More

    Submitted 28 April, 2024; originally announced May 2024.

  20. arXiv:2405.13860  [pdf, other

    cs.CV

    MAGIC: Map-Guided Few-Shot Audio-Visual Acoustics Modeling

    Authors: Diwei Huang, Kunyang Lin, Peihao Chen, Qing Du, Mingkui Tan

    Abstract: Few-shot audio-visual acoustics modeling seeks to synthesize the room impulse response in arbitrary locations with few-shot observations. To sufficiently exploit the provided few-shot data for accurate acoustic modeling, we present a *map-guided* framework by constructing acoustic-related visual semantic feature maps of the scenes. Visual features preserve semantic details related to sound and map… ▽ More

    Submitted 22 May, 2024; originally announced May 2024.

    Comments: 17 pages, 12 pages for main paper, 5 pages for supplementary

  21. arXiv:2405.10925  [pdf

    stat.ME cs.AI cs.LG

    High-dimensional multiple imputation (HDMI) for partially observed confounders including natural language processing-derived auxiliary covariates

    Authors: Janick Weberpals, Pamela A. Shaw, Kueiyu Joshua Lin, Richard Wyss, Joseph M Plasek, Li Zhou, Kerry Ngan, Thomas DeRamus, Sudha R. Raman, Bradley G. Hammill, Hana Lee, Sengwee Toh, John G. Connolly, Kimberly J. Dandreo, Fang Tian, Wei Liu, Jie Li, José J. Hernández-Muñoz, Sebastian Schneeweiss, Rishi J. Desai

    Abstract: Multiple imputation (MI) models can be improved by including auxiliary covariates (AC), but their performance in high-dimensional data is not well understood. We aimed to develop and compare high-dimensional MI (HDMI) approaches using structured and natural language processing (NLP)-derived AC in studies with partially observed confounders. We conducted a plasmode simulation study using data from… ▽ More

    Submitted 17 May, 2024; originally announced May 2024.

  22. Beyond Static Calibration: The Impact of User Preference Dynamics on Calibrated Recommendation

    Authors: Kun Lin, Masoud Mansoury, Farzad Eskandanian, Milad Sabouri, Bamshad Mobasher

    Abstract: Calibration in recommender systems is an important performance criterion that ensures consistency between the distribution of user preference categories and that of recommendations generated by the system. Standard methods for mitigating miscalibration typically assume that user preference profiles are static, and they measure calibration relative to the full history of user's interactions, includ… ▽ More

    Submitted 16 May, 2024; originally announced May 2024.

    Comments: 8 pages, 4 figures, accepted as LBR paper at UMAP '24 -- ACM Conference on User Modeling, Adaptation and Personalization 2024

    MSC Class: 68-06 ACM Class: H.3.4

  23. arXiv:2405.07503  [pdf, other

    cs.RO cs.AI

    Consistency Policy: Accelerated Visuomotor Policies via Consistency Distillation

    Authors: Aaditya Prasad, Kevin Lin, Jimmy Wu, Linqi Zhou, Jeannette Bohg

    Abstract: Many robotic systems, such as mobile manipulators or quadrotors, cannot be equipped with high-end GPUs due to space, weight, and power constraints. These constraints prevent these systems from leveraging recent developments in visuomotor policy architectures that require high-end GPUs to achieve fast policy inference. In this paper, we propose Consistency Policy, a faster and similarly powerful al… ▽ More

    Submitted 28 June, 2024; v1 submitted 13 May, 2024; originally announced May 2024.

    Comments: https://consistency-policy.github.io/

  24. arXiv:2405.05962  [pdf, other

    cs.LG cs.CR cs.DC

    Age Aware Scheduling for Differentially-Private Federated Learning

    Authors: Kuan-Yu Lin, Hsuan-Yin Lin, Yu-Pin Hsu, Yu-Chih Huang

    Abstract: This paper explores differentially-private federated learning (FL) across time-varying databases, delving into a nuanced three-way tradeoff involving age, accuracy, and differential privacy (DP). Emphasizing the potential advantages of scheduling, we propose an optimization problem aimed at meeting DP requirements while minimizing the loss difference between the aggregated model and the model obta… ▽ More

    Submitted 5 July, 2024; v1 submitted 9 May, 2024; originally announced May 2024.

    Comments: Simulation parameters updated. Paper accepted for presentation at the 2024 IEEE International Symposium on Information Theory (ISIT 2024)

  25. arXiv:2405.02794  [pdf, other

    cs.RO

    Octopi: Object Property Reasoning with Large Tactile-Language Models

    Authors: Samson Yu, Kelvin Lin, Anxing Xiao, Jiafei Duan, Harold Soh

    Abstract: Physical reasoning is important for effective robot manipulation. Recent work has investigated both vision and language modalities for physical reasoning; vision can reveal information about objects in the environment and language serves as an abstraction and communication medium for additional context. Although these works have demonstrated success on a variety of physical reasoning tasks, they a… ▽ More

    Submitted 4 June, 2024; v1 submitted 4 May, 2024; originally announced May 2024.

    Comments: Accepted at Robotics: Science and Systems (R:SS 2024)

  26. arXiv:2404.17343  [pdf, other

    cs.CL cs.FL

    A Bionic Natural Language Parser Equivalent to a Pushdown Automaton

    Authors: Zhenghao Wei, Kehua Lin, Jianlin Feng

    Abstract: Assembly Calculus (AC), proposed by Papadimitriou et al., aims to reproduce advanced cognitive functions through simulating neural activities, with several applications based on AC having been developed, including a natural language parser proposed by Mitropolsky et al. However, this parser lacks the ability to handle Kleene closures, preventing it from parsing all regular languages and rendering… ▽ More

    Submitted 26 April, 2024; originally announced April 2024.

    Comments: to be published in IJCNN 2024

  27. arXiv:2404.16375  [pdf, other

    cs.CV cs.AI cs.CL

    List Items One by One: A New Data Source and Learning Paradigm for Multimodal LLMs

    Authors: An Yan, Zhengyuan Yang, Junda Wu, Wanrong Zhu, Jianwei Yang, Linjie Li, Kevin Lin, Jianfeng Wang, Julian McAuley, Jianfeng Gao, Lijuan Wang

    Abstract: Set-of-Mark (SoM) Prompting unleashes the visual grounding capability of GPT-4V, by enabling the model to associate visual objects with tags inserted on the image. These tags, marked with alphanumerics, can be indexed via text tokens for easy reference. Despite the extraordinary performance from GPT-4V, we observe that other Multimodal Large Language Models (MLLMs) struggle to understand these vis… ▽ More

    Submitted 25 April, 2024; originally announced April 2024.

    Comments: Preprint

  28. arXiv:2404.15909  [pdf, other

    cs.CV

    Learning Long-form Video Prior via Generative Pre-Training

    Authors: Jinheng Xie, Jiajun Feng, Zhaoxu Tian, Kevin Qinghong Lin, Yawen Huang, Xi Xia, Nanxu Gong, Xu Zuo, Jiaqi Yang, Yefeng Zheng, Mike Zheng Shou

    Abstract: Concepts involved in long-form videos such as people, objects, and their interactions, can be viewed as following an implicit prior. They are notably complex and continue to pose challenges to be comprehensively learned. In recent years, generative pre-training (GPT) has exhibited versatile capacities in modeling any kind of text content even visual locations. Can this manner work for learning lon… ▽ More

    Submitted 24 April, 2024; originally announced April 2024.

  29. arXiv:2404.14705  [pdf, other

    cs.CV

    Think-Program-reCtify: 3D Situated Reasoning with Large Language Models

    Authors: Qingrong He, Kejun Lin, Shizhe Chen, Anwen Hu, Qin Jin

    Abstract: This work addresses the 3D situated reasoning task which aims to answer questions given egocentric observations in a 3D environment. The task remains challenging as it requires comprehensive 3D perception and complex reasoning skills. End-to-end models trained on supervised data for 3D situated reasoning suffer from data scarcity and generalization ability. Inspired by the recent success of levera… ▽ More

    Submitted 22 April, 2024; originally announced April 2024.

  30. arXiv:2404.06780  [pdf, other

    cs.CV

    Urban Architect: Steerable 3D Urban Scene Generation with Layout Prior

    Authors: Fan Lu, Kwan-Yee Lin, Yan Xu, Hongsheng Li, Guang Chen, Changjun Jiang

    Abstract: Text-to-3D generation has achieved remarkable success via large-scale text-to-image diffusion models. Nevertheless, there is no paradigm for scaling up the methodology to urban scale. Urban scenes, characterized by numerous elements, intricate arrangement relationships, and vast scale, present a formidable barrier to the interpretability of ambiguous textual descriptions for effective model optimi… ▽ More

    Submitted 10 April, 2024; originally announced April 2024.

    Comments: Project page: https://urbanarchitect.github.io/

  31. arXiv:2404.01294  [pdf, other

    cs.CV

    CosmicMan: A Text-to-Image Foundation Model for Humans

    Authors: Shikai Li, Jianglin Fu, Kaiyuan Liu, Wentao Wang, Kwan-Yee Lin, Wayne Wu

    Abstract: We present CosmicMan, a text-to-image foundation model specialized for generating high-fidelity human images. Unlike current general-purpose foundation models that are stuck in the dilemma of inferior quality and text-image misalignment for humans, CosmicMan enables generating photo-realistic human images with meticulous appearance, reasonable structure, and precise text-image alignment with detai… ▽ More

    Submitted 1 April, 2024; originally announced April 2024.

    Comments: Accepted by CVPR 2024. The supplementary material is included. Project Page: https://cosmicman-cvpr2024.github.io

  32. arXiv:2403.12945  [pdf, other

    cs.RO

    DROID: A Large-Scale In-The-Wild Robot Manipulation Dataset

    Authors: Alexander Khazatsky, Karl Pertsch, Suraj Nair, Ashwin Balakrishna, Sudeep Dasari, Siddharth Karamcheti, Soroush Nasiriany, Mohan Kumar Srirama, Lawrence Yunliang Chen, Kirsty Ellis, Peter David Fagan, Joey Hejna, Masha Itkina, Marion Lepert, Yecheng Jason Ma, Patrick Tree Miller, Jimmy Wu, Suneel Belkhale, Shivin Dass, Huy Ha, Arhan Jain, Abraham Lee, Youngwoon Lee, Marius Memmel, Sungjae Park , et al. (74 additional authors not shown)

    Abstract: The creation of large, diverse, high-quality robot manipulation datasets is an important stepping stone on the path toward more capable and robust robotic manipulation policies. However, creating such datasets is challenging: collecting robot manipulation data in diverse environments poses logistical and safety challenges and requires substantial investments in hardware and human labour. As a resu… ▽ More

    Submitted 19 March, 2024; originally announced March 2024.

    Comments: Project website: https://droid-dataset.github.io/

  33. arXiv:2403.10856  [pdf, other

    cs.CL cs.CR

    Zero-shot Generative Linguistic Steganography

    Authors: Ke Lin, Yiyang Luo, Zijian Zhang, Ping Luo

    Abstract: Generative linguistic steganography attempts to hide secret messages into covertext. Previous studies have generally focused on the statistical differences between the covertext and stegotext, however, ill-formed stegotext can readily be identified by humans. In this paper, we propose a novel zero-shot approach based on in-context learning for linguistic steganography to achieve better perceptual… ▽ More

    Submitted 16 March, 2024; originally announced March 2024.

    Comments: 15 pages, 6 figures. Accepted at NAACL 2024

  34. arXiv:2403.10020  [pdf, other

    cs.CL cs.MM

    Lost in Overlap: Exploring Watermark Collision in LLMs

    Authors: Yiyang Luo, Ke Lin, Chao Gu

    Abstract: The proliferation of large language models (LLMs) in generating content raises concerns about text copyright. Watermarking methods, particularly logit-based approaches, embed imperceptible identifiers into text to address these challenges. However, the widespread use of watermarking across diverse LLMs has led to an inevitable issue known as watermark collision during common tasks like question an… ▽ More

    Submitted 15 March, 2024; originally announced March 2024.

    Comments: Short Paper, 4 pages

  35. arXiv:2403.01560  [pdf, other

    cs.CV

    Rethinking CLIP-based Video Learners in Cross-Domain Open-Vocabulary Action Recognition

    Authors: Kun-Yu Lin, Henghui Ding, Jiaming Zhou, Yu-Ming Tang, Yi-Xing Peng, Zhilin Zhao, Chen Change Loy, Wei-Shi Zheng

    Abstract: Building upon the impressive success of CLIP (Contrastive Language-Image Pretraining), recent pioneer works have proposed to adapt the powerful CLIP to video data, leading to efficient and effective video learners for open-vocabulary action recognition. Inspired by that humans perform actions in diverse environments, our work delves into an intriguing question: Can CLIP-based video learners effect… ▽ More

    Submitted 24 May, 2024; v1 submitted 3 March, 2024; originally announced March 2024.

  36. arXiv:2402.16075  [pdf, other

    cs.LG cs.AI cs.RO

    Don't Start from Scratch: Behavioral Refinement via Interpolant-based Policy Diffusion

    Authors: Kaiqi Chen, Eugene Lim, Kelvin Lin, Yiyang Chen, Harold Soh

    Abstract: Imitation learning empowers artificial agents to mimic behavior by learning from demonstrations. Recently, diffusion models, which have the ability to model high-dimensional and multimodal distributions, have shown impressive performance on imitation learning tasks. These models learn to shape a policy by diffusing actions (or states) from standard Gaussian noise. However, the target policy to be… ▽ More

    Submitted 10 July, 2024; v1 submitted 25 February, 2024; originally announced February 2024.

  37. arXiv:2401.11654  [pdf, other

    cs.CV

    ActionHub: A Large-scale Action Video Description Dataset for Zero-shot Action Recognition

    Authors: Jiaming Zhou, Junwei Liang, Kun-Yu Lin, Jinrui Yang, Wei-Shi Zheng

    Abstract: Zero-shot action recognition (ZSAR) aims to learn an alignment model between videos and class descriptions of seen actions that is transferable to unseen actions. The text queries (class descriptions) used in existing ZSAR works, however, are often short action names that fail to capture the rich semantics in the videos, leading to misalignment. With the intuition that video content descriptions (… ▽ More

    Submitted 21 January, 2024; originally announced January 2024.

  38. End-to-End Optimized Image Compression with the Frequency-Oriented Transform

    Authors: Yuefeng Zhang, Kai Lin

    Abstract: Image compression constitutes a significant challenge amidst the era of information explosion. Recent studies employing deep learning methods have demonstrated the superior performance of learning-based image compression methods over traditional codecs. However, an inherent challenge associated with these methods lies in their lack of interpretability. Following an analysis of the varying degrees… ▽ More

    Submitted 16 January, 2024; originally announced January 2024.

    Comments: 25 pages, accepted by MVAP

    Journal ref: Machine Vision and Applications,Volume 35, article number 27, (2024)

  39. arXiv:2401.05033  [pdf, other

    cs.CL cs.AI

    Bootstrapping LLM-based Task-Oriented Dialogue Agents via Self-Talk

    Authors: Dennis Ulmer, Elman Mansimov, Kaixiang Lin, Justin Sun, Xibin Gao, Yi Zhang

    Abstract: Large language models (LLMs) are powerful dialogue agents, but specializing them towards fulfilling a specific function can be challenging. Instructing tuning, i.e. tuning models on instruction and sample responses generated by humans (Ouyang et al., 2022), has proven as an effective method to do so, yet requires a number of data samples that a) might not be available or b) costly to generate. Fur… ▽ More

    Submitted 10 January, 2024; originally announced January 2024.

  40. arXiv:2401.00849  [pdf, other

    cs.CV

    COSMO: COntrastive Streamlined MultimOdal Model with Interleaved Pre-Training

    Authors: Alex Jinpeng Wang, Linjie Li, Kevin Qinghong Lin, Jianfeng Wang, Kevin Lin, Zhengyuan Yang, Lijuan Wang, Mike Zheng Shou

    Abstract: In the evolution of Vision-Language Pre-training, shifting from short-text comprehension to encompassing extended textual contexts is pivotal. Recent autoregressive vision-language models like \cite{flamingo, palme}, leveraging the long-context capability of Large Language Models, have excelled in few-shot text generation tasks but face challenges in alignment tasks. Addressing this gap, we introd… ▽ More

    Submitted 1 January, 2024; originally announced January 2024.

    Comments: 16 pages; Website: http://fingerrec.github.io/cosmo

  41. arXiv:2312.12620  [pdf, ps, other

    cs.CY

    "It Can Relate to Real Lives": Attitudes and Expectations in Justice-Centered Data Structures & Algorithms for Non-Majors

    Authors: Anna Batra, Iris Zhou, Suh Young Choi, Chongjiu Gao, Yanbing Xiao, Sonia Fereidooni, Kevin Lin

    Abstract: Prior work has argued for a more justice-centered approach to postsecondary computing education by emphasizing ethics, identity, and political vision. In this experience report, we examine how postsecondary students of diverse gender and racial identities experience a justice-centered Data Structures and Algorithms designed for undergraduate non-computer science majors. Through a quantitative and… ▽ More

    Submitted 15 March, 2024; v1 submitted 19 December, 2023; originally announced December 2023.

    Comments: Experience Reports and Tools paper in the Proceedings of the 55th ACM Technical Symposium on Computer Science Education V. 1 (SIGCSE 2024); 7 pages

    ACM Class: K.3.2

  42. arXiv:2312.10271  [pdf, other

    eess.IV cs.CV cs.LG

    Robustness of Deep Learning for Accelerated MRI: Benefits of Diverse Training Data

    Authors: Kang Lin, Reinhard Heckel

    Abstract: Deep learning based methods for image reconstruction are state-of-the-art for a variety of imaging tasks. However, neural networks often perform worse if the training data differs significantly from the data they are applied to. For example, a network trained for accelerated magnetic resonance imaging (MRI) on one scanner performs worse on another scanner. In this work, we investigate the impact o… ▽ More

    Submitted 15 December, 2023; originally announced December 2023.

  43. arXiv:2312.07536  [pdf, other

    cs.CV

    FreeControl: Training-Free Spatial Control of Any Text-to-Image Diffusion Model with Any Condition

    Authors: Sicheng Mo, Fangzhou Mu, Kuan Heng Lin, Yanli Liu, Bochen Guan, Yin Li, Bolei Zhou

    Abstract: Recent approaches such as ControlNet offer users fine-grained spatial control over text-to-image (T2I) diffusion models. However, auxiliary modules have to be trained for each type of spatial condition, model architecture, and checkpoint, putting them at odds with the diverse intents and preferences a human designer would like to convey to the AI models during the content creation process. In this… ▽ More

    Submitted 12 December, 2023; originally announced December 2023.

    Comments: Project Page: https://genforce.github.io/freecontrol/

  44. arXiv:2312.05783  [pdf, other

    cs.LG

    DCIR: Dynamic Consistency Intrinsic Reward for Multi-Agent Reinforcement Learning

    Authors: Kunyang Lin, Yufeng Wang, Peihao Chen, Runhao Zeng, Siyuan Zhou, Mingkui Tan, Chuang Gan

    Abstract: Learning optimal behavior policy for each agent in multi-agent systems is an essential yet difficult problem. Despite fruitful progress in multi-agent reinforcement learning, the challenge of addressing the dynamics of whether two agents should exhibit consistent behaviors is still under-explored. In this paper, we propose a new approach that enables agents to learn whether their behaviors should… ▽ More

    Submitted 10 December, 2023; originally announced December 2023.

    Comments: 15 pages, 11 pages for main paper, 4 pages for supplementary

  45. arXiv:2312.02514  [pdf, ps, other

    cs.CR

    Skipping Scheme for Gate-hiding Garbled Circuits

    Authors: Ke Lin

    Abstract: In classic settings of garbled circuits, each gate type is leaked to improve both space and speed optimization. Zahur et al. have shown in EUROCRYPT 2015 that a typical linear garbling scheme requires at least two $λ$-bit elements per gate with a security parameter of $λ$, which limits their efficiency. In contrast to typical garbled circuits, gate-hiding garbled circuits have the potential to dra… ▽ More

    Submitted 5 December, 2023; originally announced December 2023.

    Comments: 20 pages, 8 figures

  46. arXiv:2312.01987  [pdf, other

    cs.CV

    Bootstrapping SparseFormers from Vision Foundation Models

    Authors: Ziteng Gao, Zhan Tong, Kevin Qinghong Lin, Joya Chen, Mike Zheng Shou

    Abstract: The recently proposed SparseFormer architecture provides an alternative approach to visual understanding by utilizing a significantly lower number of visual tokens via adjusting RoIs, greatly reducing computational costs while still achieving promising performance. However, training SparseFormers from scratch is still expensive, and scaling up the number of parameters can be challenging. In this p… ▽ More

    Submitted 4 April, 2024; v1 submitted 4 December, 2023; originally announced December 2023.

    Comments: CVPR 2024

  47. arXiv:2311.17435  [pdf, other

    cs.CV cs.AI

    MM-Narrator: Narrating Long-form Videos with Multimodal In-Context Learning

    Authors: Chaoyi Zhang, Kevin Lin, Zhengyuan Yang, Jianfeng Wang, Linjie Li, Chung-Ching Lin, Zicheng Liu, Lijuan Wang

    Abstract: We present MM-Narrator, a novel system leveraging GPT-4 with multimodal in-context learning for the generation of audio descriptions (AD). Unlike previous methods that primarily focused on downstream fine-tuning with short video clips, MM-Narrator excels in generating precise audio descriptions for videos of extensive lengths, even beyond hours, in an autoregressive manner. This capability is made… ▽ More

    Submitted 29 November, 2023; originally announced November 2023.

    Comments: Project page at https://mm-narrator.github.io/

  48. arXiv:2311.17118  [pdf, other

    cs.CV

    Towards Weakly Supervised End-to-end Learning for Long-video Action Recognition

    Authors: Jiaming Zhou, Hanjun Li, Kun-Yu Lin, Junwei Liang

    Abstract: Developing end-to-end action recognition models on long videos is fundamental and crucial for long-video action understanding. Due to the unaffordable cost of end-to-end training on the whole long videos, existing works generally train models on short clips trimmed from long videos. However, this ``trimming-then-training'' practice requires action interval annotations for clip-level supervision, i… ▽ More

    Submitted 24 May, 2024; v1 submitted 28 November, 2023; originally announced November 2023.

  49. arXiv:2311.16716  [pdf, other

    cs.IR cs.AI

    GraphPro: Graph Pre-training and Prompt Learning for Recommendation

    Authors: Yuhao Yang, Lianghao Xia, Da Luo, Kangyi Lin, Chao Huang

    Abstract: GNN-based recommenders have excelled in modeling intricate user-item interactions through multi-hop message passing. However, existing methods often overlook the dynamic nature of evolving user-item interactions, which impedes the adaption to changing user preferences and distribution shifts in newly arriving data. Thus, their scalability and performances in real-world dynamic environments are lim… ▽ More

    Submitted 19 February, 2024; v1 submitted 28 November, 2023; originally announced November 2023.

    Comments: Accepted by WWW'2024, full paper

  50. arXiv:2311.16501  [pdf, other

    cs.CV

    Context-Aware Indoor Point Cloud Object Generation through User Instructions

    Authors: Yiyang Luo, Ke Lin, Chao Gu

    Abstract: Indoor scene modification has emerged as a prominent area within computer vision, particularly for its applications in Augmented Reality (AR) and Virtual Reality (VR). Traditional methods often rely on pre-existing object databases and predetermined object positions, limiting their flexibility and adaptability to new scenarios. In response to this challenge, we present a novel end-to-end multi-mod… ▽ More

    Submitted 22 July, 2024; v1 submitted 26 November, 2023; originally announced November 2023.