Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to main content

Showing 1–50 of 83 results for author: Yao, K

Searching in archive cs. Search in all archives.
.
  1. arXiv:2406.08814  [pdf, other

    cs.CV

    Skim then Focus: Integrating Contextual and Fine-grained Views for Repetitive Action Counting

    Authors: Zhengqi Zhao, Xiaohu Huang, Hao Zhou, Kun Yao, Errui Ding, Jingdong Wang, Xinggang Wang, Wenyu Liu, Bin Feng

    Abstract: The key to action counting is accurately locating each video's repetitive actions. Instead of estimating the probability of each frame belonging to an action directly, we propose a dual-branch network, i.e., SkimFocusNet, working in a two-step manner. The model draws inspiration from empirical observations indicating that humans typically engage in coarse skimming of entire sequences to grasp the… ▽ More

    Submitted 13 June, 2024; originally announced June 2024.

    Comments: 13 pages, 9 figures

  2. arXiv:2406.03459  [pdf, other

    cs.CV

    LW-DETR: A Transformer Replacement to YOLO for Real-Time Detection

    Authors: Qiang Chen, Xiangbo Su, Xinyu Zhang, Jian Wang, Jiahui Chen, Yunpeng Shen, Chuchu Han, Ziliang Chen, Weixiang Xu, Fanrong Li, Shan Zhang, Kun Yao, Errui Ding, Gang Zhang, Jingdong Wang

    Abstract: In this paper, we present a light-weight detection transformer, LW-DETR, which outperforms YOLOs for real-time object detection. The architecture is a simple stack of a ViT encoder, a projector, and a shallow DETR decoder. Our approach leverages recent advanced techniques, such as training-effective techniques, e.g., improved loss and pretraining, and interleaved window and global attentions for r… ▽ More

    Submitted 5 June, 2024; originally announced June 2024.

  3. arXiv:2405.21013  [pdf, other

    cs.CV

    StrucTexTv3: An Efficient Vision-Language Model for Text-rich Image Perception, Comprehension, and Beyond

    Authors: Pengyuan Lyu, Yulin Li, Hao Zhou, Weihong Ma, Xingyu Wan, Qunyi Xie, Liang Wu, Chengquan Zhang, Kun Yao, Errui Ding, Jingdong Wang

    Abstract: Text-rich images have significant and extensive value, deeply integrated into various aspects of human life. Notably, both visual cues and linguistic symbols in text-rich images play crucial roles in information transmission but are accompanied by diverse challenges. Therefore, the efficient and effective understanding of text-rich images is a crucial litmus test for the capability of Vision-Langu… ▽ More

    Submitted 4 June, 2024; v1 submitted 31 May, 2024; originally announced May 2024.

  4. arXiv:2405.19765  [pdf, other

    cs.CV cs.AI

    Towards Unified Multi-granularity Text Detection with Interactive Attention

    Authors: Xingyu Wan, Chengquan Zhang, Pengyuan Lyu, Sen Fan, Zihan Ni, Kun Yao, Errui Ding, Jingdong Wang

    Abstract: Existing OCR engines or document image analysis systems typically rely on training separate models for text detection in varying scenarios and granularities, leading to significant computational complexity and resource demands. In this paper, we introduce "Detect Any Text" (DAT), an advanced paradigm that seamlessly unifies scene text detection, layout analysis, and document page detection into a… ▽ More

    Submitted 30 May, 2024; originally announced May 2024.

    Comments: ICML 2024

  5. arXiv:2405.17201  [pdf, other

    cs.CV

    Diagnosing the Compositional Knowledge of Vision Language Models from a Game-Theoretic View

    Authors: Jin Wang, Shichao Dong, Yapeng Zhu, Kelu Yao, Weidong Zhao, Chao Li, Ping Luo

    Abstract: Compositional reasoning capabilities are usually considered as fundamental skills to characterize human perception. Recent studies show that current Vision Language Models (VLMs) surprisingly lack sufficient knowledge with respect to such capabilities. To this end, we propose to thoroughly diagnose the composition representations encoded by VLMs, systematically revealing the potential cause for th… ▽ More

    Submitted 27 May, 2024; originally announced May 2024.

    Comments: 21 pages, 8 figures

  6. arXiv:2405.14278  [pdf, other

    cs.CV

    SCMix: Stochastic Compound Mixing for Open Compound Domain Adaptation in Semantic Segmentation

    Authors: Kai Yao, Zhaorui Tan, Zixian Su, Xi Yang, Jie Sun, Kaizhu Huang

    Abstract: Open compound domain adaptation (OCDA) aims to transfer knowledge from a labeled source domain to a mix of unlabeled homogeneous compound target domains while generalizing to open unseen domains. Existing OCDA methods solve the intra-domain gaps by a divide-and-conquer strategy, which divides the problem into several individual and parallel domain adaptation (DA) tasks. Such approaches often conta… ▽ More

    Submitted 23 May, 2024; originally announced May 2024.

  7. arXiv:2404.13191  [pdf, other

    cs.RO

    Action Contextualization: Adaptive Task Planning and Action Tuning using Large Language Models

    Authors: Sthithpragya Gupta, Kunpeng Yao, Loïc Niederhauser, Aude Billard

    Abstract: Large Language Models (LLMs) present a promising frontier in robotic task planning by leveraging extensive human knowledge. Nevertheless, the current literature often overlooks the critical aspects of adaptability and error correction within robotic systems. This work aims to overcome this limitation by enabling robots to modify their motion strategies and select the most suitable task plans based… ▽ More

    Submitted 19 April, 2024; originally announced April 2024.

  8. arXiv:2404.13067  [pdf, other

    cs.CL cs.AI cs.LG

    Towards Efficient Resume Understanding: A Multi-Granularity Multi-Modal Pre-Training Approach

    Authors: Feihu Jiang, Chuan Qin, Jingshuai Zhang, Kaichun Yao, Xi Chen, Dazhong Shen, Chen Zhu, Hengshu Zhu, Hui Xiong

    Abstract: In the contemporary era of widespread online recruitment, resume understanding has been widely acknowledged as a fundamental and crucial task, which aims to extract structured information from resume documents automatically. Compared to the traditional rule-based approaches, the utilization of recently proposed pre-trained document understanding models can greatly enhance the effectiveness of resu… ▽ More

    Submitted 13 April, 2024; originally announced April 2024.

    Comments: ICME 2024 Accepted

  9. arXiv:2404.08695  [pdf, other

    cs.CL cs.AI cs.IR

    Enhancing Question Answering for Enterprise Knowledge Bases using Large Language Models

    Authors: Feihu Jiang, Chuan Qin, Kaichun Yao, Chuyu Fang, Fuzhen Zhuang, Hengshu Zhu, Hui Xiong

    Abstract: Efficient knowledge management plays a pivotal role in augmenting both the operational efficiency and the innovative capacity of businesses and organizations. By indexing knowledge through vectorization, a variety of knowledge retrieval methods have emerged, significantly enhancing the efficacy of knowledge management systems. Recently, the rapid advancements in generative natural language process… ▽ More

    Submitted 20 April, 2024; v1 submitted 10 April, 2024; originally announced April 2024.

    Comments: DASFAA 2024 Accepted

  10. arXiv:2403.10759  [pdf, other

    cs.RO

    Fully Distributed Cooperative Multi-agent Underwater Obstacle Avoidance Under Dog Walking Paradigm

    Authors: Kanzhong Yao, Ognjen Marjanovic, Simon Watson

    Abstract: Navigation in cluttered underwater environments is challenging, especially when there are constraints on communication and self-localisation. Part of the fully distributed underwater navigation problem has been resolved by introducing multi-agent robot teams, however when the environment becomes cluttered, the problem remains unresolved. In this paper, we first studied the connection between every… ▽ More

    Submitted 15 March, 2024; originally announced March 2024.

    Comments: This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible

  11. arXiv:2403.10629  [pdf, other

    cs.RO eess.SY

    Virtual Elastic Tether: a New Approach for Multi-agent Navigation in Confined Aquatic Environments

    Authors: Kanzhong Yao, Xueliang Cheng, Keir Groves, Barry Lennox, Ognjen Marjanovic, Simon Watson

    Abstract: Underwater navigation is a challenging area in the field of mobile robotics due to inherent constraints in self-localisation and communication in underwater environments. Some of these challenges can be mitigated by using collaborative multi-agent teams. However, when applied underwater, the robustness of traditional multi-agent collaborative control approaches is highly limited due to the unavail… ▽ More

    Submitted 15 March, 2024; originally announced March 2024.

    Comments: This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible

  12. arXiv:2403.05530  [pdf, other

    cs.CL cs.AI

    Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

    Authors: Gemini Team, Petko Georgiev, Ving Ian Lei, Ryan Burnell, Libin Bai, Anmol Gulati, Garrett Tanzer, Damien Vincent, Zhufeng Pan, Shibo Wang, Soroosh Mariooryad, Yifan Ding, Xinyang Geng, Fred Alcober, Roy Frostig, Mark Omernick, Lexi Walker, Cosmin Paduraru, Christina Sorokin, Andrea Tacchetti, Colin Gaffney, Samira Daruki, Olcan Sercinoglu, Zach Gleicher, Juliette Love , et al. (1092 additional authors not shown)

    Abstract: In this report, we introduce the Gemini 1.5 family of models, representing the next generation of highly compute-efficient multimodal models capable of recalling and reasoning over fine-grained information from millions of tokens of context, including multiple long documents and hours of video and audio. The family includes two new models: (1) an updated Gemini 1.5 Pro, which exceeds the February… ▽ More

    Submitted 14 June, 2024; v1 submitted 8 March, 2024; originally announced March 2024.

  13. arXiv:2402.03241  [pdf, other

    cs.CV cs.LG

    FROSTER: Frozen CLIP Is A Strong Teacher for Open-Vocabulary Action Recognition

    Authors: Xiaohu Huang, Hao Zhou, Kun Yao, Kai Han

    Abstract: In this paper, we introduce FROSTER, an effective framework for open-vocabulary action recognition. The CLIP model has achieved remarkable success in a range of image-based tasks, benefiting from its strong generalization capability stemming from pretaining on massive image-text pairs. However, applying CLIP directly to the open-vocabulary action recognition task is challenging due to the absence… ▽ More

    Submitted 5 February, 2024; originally announced February 2024.

    Comments: Accepted by ICLR 2024

  14. arXiv:2401.16465  [pdf, other

    cs.CV cs.GR

    DressCode: Autoregressively Sewing and Generating Garments from Text Guidance

    Authors: Kai He, Kaixin Yao, Qixuan Zhang, Jingyi Yu, Lingjie Liu, Lan Xu

    Abstract: Apparel's significant role in human appearance underscores the importance of garment digitalization for digital human creation. Recent advances in 3D content creation are pivotal for digital human creation. Nonetheless, garment generation from text guidance is still nascent. We introduce a text-driven 3D garment generation framework, DressCode, which aims to democratize design for novices and offe… ▽ More

    Submitted 14 June, 2024; v1 submitted 29 January, 2024; originally announced January 2024.

    Comments: Project page: https://IHe-KaiI.github.io/DressCode/

  15. arXiv:2312.11805  [pdf, other

    cs.CL cs.AI cs.CV

    Gemini: A Family of Highly Capable Multimodal Models

    Authors: Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M. Dai, Anja Hauth, Katie Millican, David Silver, Melvin Johnson, Ioannis Antonoglou, Julian Schrittwieser, Amelia Glaese, Jilin Chen, Emily Pitler, Timothy Lillicrap, Angeliki Lazaridou, Orhan Firat, James Molloy, Michael Isard, Paul R. Barham, Tom Hennigan, Benjamin Lee , et al. (1325 additional authors not shown)

    Abstract: This report introduces a new family of multimodal models, Gemini, that exhibit remarkable capabilities across image, audio, video, and text understanding. The Gemini family consists of Ultra, Pro, and Nano sizes, suitable for applications ranging from complex reasoning tasks to on-device memory-constrained use-cases. Evaluation on a broad range of benchmarks shows that our most-capable Gemini Ultr… ▽ More

    Submitted 17 June, 2024; v1 submitted 18 December, 2023; originally announced December 2023.

  16. arXiv:2312.09486  [pdf, other

    cs.CV cs.LG

    Unraveling Batch Normalization for Realistic Test-Time Adaptation

    Authors: Zixian Su, Jingwei Guo, Kai Yao, Xi Yang, Qiufeng Wang, Kaizhu Huang

    Abstract: While recent test-time adaptations exhibit efficacy by adjusting batch normalization to narrow domain disparities, their effectiveness diminishes with realistic mini-batches due to inaccurate target estimation. As previous attempts merely introduce source statistics to mitigate this issue, the fundamental problem of inaccurate target estimation still persists, leaving the intrinsic test-time domai… ▽ More

    Submitted 13 April, 2024; v1 submitted 14 December, 2023; originally announced December 2023.

    Comments: Accepted by AAAI 2024

  17. arXiv:2312.05829  [pdf, other

    cs.IT eess.SP

    EM Based p-norm-like Constraint RLS Algorithm for Sparse System Identification

    Authors: Shuyang Jiang, Kung Yao

    Abstract: In this paper, the recursive least squares (RLS) algorithm is considered in the sparse system identification setting. The cost function of RLS algorithm is regularized by a $p$-norm-like ($0 \leq p \leq 1$) constraint of the estimated system parameters. In order to minimize the regularized cost function, we transform it into a penalized maximum likelihood (ML) problem, which is solved by the expec… ▽ More

    Submitted 10 December, 2023; originally announced December 2023.

    Comments: 11 pages, 3 figures, journal manuscript

  18. arXiv:2312.01407  [pdf, other

    cs.CV

    VideoRF: Rendering Dynamic Radiance Fields as 2D Feature Video Streams

    Authors: Liao Wang, Kaixin Yao, Chengcheng Guo, Zhirui Zhang, Qiang Hu, Jingyi Yu, Lan Xu, Minye Wu

    Abstract: Neural Radiance Fields (NeRFs) excel in photorealistically rendering static scenes. However, rendering dynamic, long-duration radiance fields on ubiquitous devices remains challenging, due to data storage and computational constraints. In this paper, we introduce VideoRF, the first approach to enable real-time streaming and rendering of dynamic radiance fields on mobile platforms. At the core is a… ▽ More

    Submitted 3 December, 2023; originally announced December 2023.

    Comments: Project page, see https://aoliao12138.github.io/VideoRF

  19. arXiv:2312.00963  [pdf, other

    cs.LG stat.ME

    Spatiotemporal Transformer for Imputing Sparse Data: A Deep Learning Approach

    Authors: Kehui Yao, Jingyi Huang, Jun Zhu

    Abstract: Effective management of environmental resources and agricultural sustainability heavily depends on accurate soil moisture data. However, datasets like the SMAP/Sentinel-1 soil moisture product often contain missing values across their spatiotemporal grid, which poses a significant challenge. This paper introduces a novel Spatiotemporal Transformer model (ST-Transformer) specifically designed to ad… ▽ More

    Submitted 1 December, 2023; originally announced December 2023.

  20. arXiv:2311.18044  [pdf, other

    cs.RO cs.LG

    Transfer Learning in Robotics: An Upcoming Breakthrough? A Review of Promises and Challenges

    Authors: Noémie Jaquier, Michael C. Welle, Andrej Gams, Kunpeng Yao, Bernardo Fichera, Aude Billard, Aleš Ude, Tamim Asfour, Danica Kragic

    Abstract: Transfer learning is a conceptually-enticing paradigm in pursuit of truly intelligent embodied agents. The core concept -- reusing prior knowledge to learn in and from novel situations -- is successfully leveraged by humans to handle novel situations. In recent years, transfer learning has received renewed interest from the community from different perspectives, including imitation learning, domai… ▽ More

    Submitted 2 May, 2024; v1 submitted 29 November, 2023; originally announced November 2023.

    Comments: 21 pages, 7 figures

  21. arXiv:2311.15939  [pdf, other

    cs.CV

    Unleashing the Power of Prompt-driven Nucleus Instance Segmentation

    Authors: Zhongyi Shui, Yunlong Zhang, Kai Yao, Chenglu Zhu, Sunyi Zheng, Jingxiong Li, Honglin Li, Yuxuan Sun, Ruizhe Guo, Lin Yang

    Abstract: Nucleus instance segmentation in histology images is crucial for a broad spectrum of clinical applications. Current dominant algorithms rely on regression of nuclear proxy maps. Distinguishing nucleus instances from the estimated maps requires carefully curated post-processing, which is error-prone and parameter-sensitive. Recently, the Segment Anything Model (SAM) has earned huge attention in med… ▽ More

    Submitted 24 January, 2024; v1 submitted 27 November, 2023; originally announced November 2023.

    Comments: under review

  22. arXiv:2311.02172  [pdf, other

    cs.CG

    Fast and Accurate Approximations of the Optimal Transport in Semi-Discrete and Discrete Settings

    Authors: Pankaj K. Agarwal, Sharath Raghvendra, Pouyan Shirzadian, Keegan Yao

    Abstract: Given a $d$-dimensional continuous (resp. discrete) probability distribution $μ$ and a discrete distribution $ν$, the semi-discrete (resp. discrete) Optimal Transport (OT) problem asks for computing a minimum-cost plan to transport mass from $μ$ to $ν$; we assume $n$ to be the size of the support of the discrete distributions, and we assume we have access to an oracle outputting the mass of $μ$ in… ▽ More

    Submitted 3 November, 2023; originally announced November 2023.

  23. arXiv:2310.20695  [pdf, other

    cs.CV cs.AI

    HAP: Structure-Aware Masked Image Modeling for Human-Centric Perception

    Authors: Junkun Yuan, Xinyu Zhang, Hao Zhou, Jian Wang, Zhongwei Qiu, Zhiyin Shao, Shaofeng Zhang, Sifan Long, Kun Kuang, Kun Yao, Junyu Han, Errui Ding, Lanfen Lin, Fei Wu, Jingdong Wang

    Abstract: Model pre-training is essential in human-centric perception. In this paper, we first introduce masked image modeling (MIM) as a pre-training approach for this task. Upon revisiting the MIM training strategy, we reveal that human structure priors offer significant potential. Motivated by this insight, we further incorporate an intuitive human structure prior - human parts - into pre-training. Speci… ▽ More

    Submitted 31 October, 2023; originally announced October 2023.

    Comments: Accepted by NeurIPS 2023

  24. arXiv:2310.13760  [pdf, other

    cs.CL

    Enhancing Abstractiveness of Summarization Models through Calibrated Distillation

    Authors: Hwanjun Song, Igor Shalyminov, Hang Su, Siffi Singh, Kaisheng Yao, Saab Mansour

    Abstract: Sequence-level knowledge distillation reduces the size of Seq2Seq models for more efficient abstractive summarization. However, it often leads to a loss of abstractiveness in summarization. In this paper, we propose a novel approach named DisCal to enhance the level of abstractiveness (measured by n-gram overlap) without sacrificing the informativeness (measured by ROUGE) of generated summaries. D… ▽ More

    Submitted 4 December, 2023; v1 submitted 20 October, 2023; originally announced October 2023.

    Comments: Accepted at EMNLP-Findings 2023

  25. arXiv:2309.16085  [pdf, other

    cs.RO

    Differentiable Robot Neural Distance Function for Adaptive Grasp Synthesis on a Unified Robotic Arm-Hand System

    Authors: Yiting Chen, Xiao Gao, Kunpeng Yao, Loïc Niederhauser, Yasemin Bekiroglu, Aude Billard

    Abstract: Grasping is a fundamental skill for robots to interact with their environment. While grasp execution requires coordinated movement of the hand and arm to achieve a collision-free and secure grip, many grasp synthesis studies address arm and hand motion planning independently, leading to potentially unreachable grasps in practical settings. The challenge of determining integrated arm-hand configura… ▽ More

    Submitted 27 September, 2023; originally announced September 2023.

    Comments: Under review

  26. arXiv:2309.14962  [pdf, other

    cs.CV

    GridFormer: Towards Accurate Table Structure Recognition via Grid Prediction

    Authors: Pengyuan Lyu, Weihong Ma, Hongyi Wang, Yuechen Yu, Chengquan Zhang, Kun Yao, Yang Xue, Jingdong Wang

    Abstract: All tables can be represented as grids. Based on this observation, we propose GridFormer, a novel approach for interpreting unconstrained table structures by predicting the vertex and edge of a grid. First, we propose a flexible table representation in the form of an MXN grid. In this representation, the vertexes and edges of the grid store the localization and adjacency information of the table.… ▽ More

    Submitted 26 September, 2023; originally announced September 2023.

    Comments: ACMMM2023

  27. arXiv:2309.13570  [pdf, other

    cs.CV

    Robust 6DoF Pose Estimation Against Depth Noise and a Comprehensive Evaluation on a Mobile Dataset

    Authors: Zixun Huang, Keling Yao, Seth Z. Zhao, Chuanyu Pan, Chenfeng Xu, Kathy Zhuang, Tianjian Xu, Weiyu Feng, Allen Y. Yang

    Abstract: Robust 6DoF pose estimation with mobile devices is the foundation for applications in robotics, augmented reality, and digital twin localization. In this paper, we extensively investigate the robustness of existing RGBD-based 6DoF pose estimation methods against varying levels of depth sensor noise. We highlight that existing 6DoF pose estimation methods suffer significant performance discrepancie… ▽ More

    Submitted 17 June, 2024; v1 submitted 24 September, 2023; originally announced September 2023.

  28. arXiv:2309.06955  [pdf, other

    cs.RO

    Enhancing Dexterity in Confined Spaces: Real-Time Motion Planning for Multi-Fingered In-Hand Manipulation

    Authors: Xiao Gao, Kunpeng Yao, Farshad Khadivar, Aude Billard

    Abstract: Dexterous in-hand manipulation in robotics, particularly with multi-fingered robotic hands, poses significant challenges due to the intricate avoidance of collisions among fingers and the object being manipulated. Collision-free paths for all fingers must be generated in real-time, as the rapid changes in hand and finger positions necessitate instantaneous recalculations to prevent collisions and… ▽ More

    Submitted 25 June, 2024; v1 submitted 13 September, 2023; originally announced September 2023.

  29. arXiv:2308.13724  [pdf, other

    cs.RO cs.AI

    ISR-LLM: Iterative Self-Refined Large Language Model for Long-Horizon Sequential Task Planning

    Authors: Zhehua Zhou, Jiayang Song, Kunpeng Yao, Zhan Shu, Lei Ma

    Abstract: Motivated by the substantial achievements observed in Large Language Models (LLMs) in the field of natural language processing, recent research has commenced investigations into the application of LLMs for complex, long-horizon sequential task planning challenges in robotics. LLMs are advantageous in offering the potential to enhance the generalizability as task-agnostic planners and facilitate fl… ▽ More

    Submitted 25 August, 2023; originally announced August 2023.

  30. arXiv:2308.07313  [pdf, other

    cs.CV

    Group Pose: A Simple Baseline for End-to-End Multi-person Pose Estimation

    Authors: Huan Liu, Qiang Chen, Zichang Tan, Jiang-Jiang Liu, Jian Wang, Xiangbo Su, Xiaolong Li, Kun Yao, Junyu Han, Errui Ding, Yao Zhao, Jingdong Wang

    Abstract: In this paper, we study the problem of end-to-end multi-person pose estimation. State-of-the-art solutions adopt the DETR-like framework, and mainly develop the complex decoder, e.g., regarding pose estimation as keypoint box detection and combining with human detection in ED-Pose, hierarchically predicting with pose decoder and joint (keypoint) decoder in PETR. We present a simple yet effective t… ▽ More

    Submitted 14 August, 2023; originally announced August 2023.

    Comments: Accepted by ICCV 2023

  31. arXiv:2308.07202  [pdf, other

    cs.CV

    Towards Robust Real-Time Scene Text Detection: From Semantic to Instance Representation Learning

    Authors: Xugong Qin, Pengyuan Lyu, Chengquan Zhang, Yu Zhou, Kun Yao, Peng Zhang, Hailun Lin, Weiping Wang

    Abstract: Due to the flexible representation of arbitrary-shaped scene text and simple pipeline, bottom-up segmentation-based methods begin to be mainstream in real-time scene text detection. Despite great progress, these methods show deficiencies in robustness and still suffer from false positives and instance adhesion. Different from existing methods which integrate multiple-granularity features or multip… ▽ More

    Submitted 14 August, 2023; originally announced August 2023.

    Comments: Accepted by ACM MM 2023

  32. arXiv:2307.12571  [pdf, other

    cs.CV

    MataDoc: Margin and Text Aware Document Dewarping for Arbitrary Boundary

    Authors: Beiya Dai, Xing li, Qunyi Xie, Yulin Li, Xiameng Qin, Chengquan Zhang, Kun Yao, Junyu Han

    Abstract: Document dewarping from a distorted camera-captured image is of great value for OCR and document understanding. The document boundary plays an important role which is more evident than the inner region in document dewarping. Current learning-based methods mainly focus on complete boundary cases, leading to poor document correction performance of documents with incomplete boundaries. In contrast to… ▽ More

    Submitted 24 July, 2023; originally announced July 2023.

    Comments: 12 pages

  33. arXiv:2306.17074  [pdf, other

    cs.CV cs.AI

    Learning Structure-Guided Diffusion Model for 2D Human Pose Estimation

    Authors: Zhongwei Qiu, Qiansheng Yang, Jian Wang, Xiyu Wang, Chang Xu, Dongmei Fu, Kun Yao, Junyu Han, Errui Ding, Jingdong Wang

    Abstract: One of the mainstream schemes for 2D human pose estimation (HPE) is learning keypoints heatmaps by a neural network. Existing methods typically improve the quality of heatmaps by customized architectures, such as high-resolution representation and vision Transformers. In this paper, we propose \textbf{DiffusionPose}, a new scheme that formulates 2D HPE as a keypoints heatmaps generation problem fr… ▽ More

    Submitted 29 June, 2023; originally announced June 2023.

  34. arXiv:2306.14182  [pdf, other

    cs.CV cs.AI

    Switch-BERT: Learning to Model Multimodal Interactions by Switching Attention and Input

    Authors: Qingpei Guo, Kaisheng Yao, Wei Chu

    Abstract: The ability to model intra-modal and inter-modal interactions is fundamental in multimodal machine learning. The current state-of-the-art models usually adopt deep learning models with fixed structures. They can achieve exceptional performances on specific tasks, but face a particularly challenging problem of modality mismatch because of diversity of input modalities and their fixed structures. In… ▽ More

    Submitted 25 June, 2023; originally announced June 2023.

    Comments: Accepted by ECCV2022

  35. arXiv:2306.05716  [pdf, other

    cs.RO cs.AI

    Transferring Foundation Models for Generalizable Robotic Manipulation

    Authors: Jiange Yang, Wenhui Tan, Chuhao Jin, Keling Yao, Bei Liu, Jianlong Fu, Ruihua Song, Gangshan Wu, Limin Wang

    Abstract: Improving the generalization capabilities of general-purpose robotic manipulation agents in the real world has long been a significant challenge. Existing approaches often rely on collecting large-scale robotic data which is costly and time-consuming, such as the RT-1 dataset. However, due to insufficient diversity of data, these approaches typically suffer from limiting their capability in open-d… ▽ More

    Submitted 18 March, 2024; v1 submitted 9 June, 2023; originally announced June 2023.

    Comments: 9 pages, 5 figures

  36. arXiv:2306.03287  [pdf, other

    cs.CV

    ICDAR 2023 Competition on Structured Text Extraction from Visually-Rich Document Images

    Authors: Wenwen Yu, Chengquan Zhang, Haoyu Cao, Wei Hua, Bohan Li, Huang Chen, Mingyu Liu, Mingrui Chen, Jianfeng Kuang, Mengjun Cheng, Yuning Du, Shikun Feng, Xiaoguang Hu, Pengyuan Lyu, Kun Yao, Yuechen Yu, Yuliang Liu, Wanxiang Che, Errui Ding, Cheng-Lin Liu, Jiebo Luo, Shuicheng Yan, Min Zhang, Dimosthenis Karatzas, Xing Sun , et al. (2 additional authors not shown)

    Abstract: Structured text extraction is one of the most valuable and challenging application directions in the field of Document AI. However, the scenarios of past benchmarks are limited, and the corresponding evaluation protocols usually focus on the submodules of the structured text extraction scheme. In order to eliminate these problems, we organized the ICDAR 2023 competition on Structured text extracti… ▽ More

    Submitted 5 June, 2023; originally announced June 2023.

    Comments: ICDAR 2023 Competition on SVRD report (To be appear in ICDAR 2023)

  37. arXiv:2305.12793  [pdf, other

    eess.AS cs.CL cs.MM cs.SD

    Zero-Shot End-to-End Spoken Language Understanding via Cross-Modal Selective Self-Training

    Authors: Jianfeng He, Julian Salazar, Kaisheng Yao, Haoqi Li, Jinglun Cai

    Abstract: End-to-end (E2E) spoken language understanding (SLU) is constrained by the cost of collecting speech-semantics pairs, especially when label domains change. Hence, we explore \textit{zero-shot} E2E SLU, which learns E2E SLU without speech-semantics pairs, instead using only speech-text and text-semantics pairs. Previous work achieved zero-shot by pseudolabeling all speech-text transcripts with a na… ▽ More

    Submitted 2 February, 2024; v1 submitted 22 May, 2023; originally announced May 2023.

    Comments: 18 pages, 7 figures

  38. arXiv:2305.11392  [pdf, other

    cs.CV cs.CL

    Fast-StrucTexT: An Efficient Hourglass Transformer with Modality-guided Dynamic Token Merge for Document Understanding

    Authors: Mingliang Zhai, Yulin Li, Xiameng Qin, Chen Yi, Qunyi Xie, Chengquan Zhang, Kun Yao, Yuwei Wu, Yunde Jia

    Abstract: Transformers achieve promising performance in document understanding because of their high effectiveness and still suffer from quadratic computational complexity dependency on the sequence length. General efficient transformers are challenging to be directly adapted to model document. They are unable to handle the layout representation in documents, e.g. word, line and paragraph, on different gran… ▽ More

    Submitted 18 May, 2023; originally announced May 2023.

    Comments: IJCAI 2023

  39. Seq-HGNN: Learning Sequential Node Representation on Heterogeneous Graph

    Authors: Chenguang Du, Kaichun Yao, Hengshu Zhu, Deqing Wang, Fuzhen Zhuang, Hui Xiong

    Abstract: Recent years have witnessed the rapid development of heterogeneous graph neural networks (HGNNs) in information retrieval (IR) applications. Many existing HGNNs design a variety of tailor-made graph convolutions to capture structural and semantic information in heterogeneous graphs. However, existing HGNNs usually represent each node as a single vector in the multi-layer graph convolution calculat… ▽ More

    Submitted 12 August, 2023; v1 submitted 18 May, 2023; originally announced May 2023.

    Comments: SIGIR 2023

  40. arXiv:2304.03184  [pdf, other

    cs.CV

    Instant-NVR: Instant Neural Volumetric Rendering for Human-object Interactions from Monocular RGBD Stream

    Authors: Yuheng Jiang, Kaixin Yao, Zhuo Su, Zhehao Shen, Haimin Luo, Lan Xu

    Abstract: Convenient 4D modeling of human-object interactions is essential for numerous applications. However, monocular tracking and rendering of complex interaction scenarios remain challenging. In this paper, we propose Instant-NVR, a neural approach for instant volumetric human-object tracking and rendering using a single RGBD camera. It bridges traditional non-rigid tracking with recent instant radianc… ▽ More

    Submitted 6 April, 2023; originally announced April 2023.

    Comments: CVPR 2023

  41. arXiv:2303.09956  [pdf, other

    cs.CV cs.AI

    GNNFormer: A Graph-based Framework for Cytopathology Report Generation

    Authors: Yang-Fan Zhou, Kai-Lang Yao, Wu-Jun Li

    Abstract: Cytopathology report generation is a necessary step for the standardized examination of pathology images. However, manually writing detailed reports brings heavy workloads for pathologists. To improve efficiency, some existing works have studied automatic generation of cytopathology reports, mainly by applying image caption generation frameworks with visual encoders originally proposed for natural… ▽ More

    Submitted 17 March, 2023; originally announced March 2023.

    Comments: 12 pages, 6 figures

  42. Exploiting Kinematic Redundancy for Robotic Grasping of Multiple Objects

    Authors: Kunpeng Yao, Aude Billard

    Abstract: Humans coordinate the abundant degrees of freedom (DoFs) of hands to dexterously perform tasks in everyday life. We imitate human strategies to advance the dexterity of multi-DoF robotic hands. Specifically, we enable a robot hand to grasp multiple objects by exploiting its kinematic redundancy, referring to all its controllable DoFs. We propose a human-like grasp synthesis algorithm to generate g… ▽ More

    Submitted 30 March, 2023; v1 submitted 3 March, 2023; originally announced March 2023.

  43. arXiv:2303.00289  [pdf, other

    cs.CV

    StrucTexTv2: Masked Visual-Textual Prediction for Document Image Pre-training

    Authors: Yuechen Yu, Yulin Li, Chengquan Zhang, Xiaoqiang Zhang, Zengyuan Guo, Xiameng Qin, Kun Yao, Junyu Han, Errui Ding, Jingdong Wang

    Abstract: In this paper, we present StrucTexTv2, an effective document image pre-training framework, by performing masked visual-textual prediction. It consists of two self-supervised pre-training tasks: masked image modeling and masked language modeling, based on text region-level image masking. The proposed method randomly masks some image regions according to the bounding box coordinates of text words. T… ▽ More

    Submitted 1 March, 2023; originally announced March 2023.

    Comments: ICLR 2023

  44. arXiv:2303.00170  [pdf, other

    cs.LG cs.SI

    Asymmetric Learning for Graph Neural Network based Link Prediction

    Authors: Kai-Lang Yao, Wu-Jun Li

    Abstract: Link prediction is a fundamental problem in many graph based applications, such as protein-protein interaction prediction. Graph neural network (GNN) has recently been widely used for link prediction. However, existing GNN based link prediction (GNN-LP) methods suffer from scalability problem during training for large-scale graphs, which has received little attention by researchers. In this paper,… ▽ More

    Submitted 28 February, 2023; originally announced March 2023.

  45. arXiv:2302.06676  [pdf, other

    cs.LG cs.IR

    Netflix and Forget: Efficient and Exact Machine Unlearning from Bi-linear Recommendations

    Authors: Mimee Xu, Jiankai Sun, Xin Yang, Kevin Yao, Chong Wang

    Abstract: People break up, miscarry, and lose loved ones. Their online streaming and shopping recommendations, however, do not necessarily update, and may serve as unhappy reminders of their loss. When users want to renege on their past actions, they expect the recommender platforms to erase selective data at the model level. Ideally, given any specified user history, the recommender can unwind or "forget",… ▽ More

    Submitted 13 February, 2023; originally announced February 2023.

    Comments: 8 pages, 8 figures

  46. arXiv:2212.08568  [pdf, other

    cs.CV cs.LG

    Biomedical image analysis competitions: The state of current participation practice

    Authors: Matthias Eisenmann, Annika Reinke, Vivienn Weru, Minu Dietlinde Tizabi, Fabian Isensee, Tim J. Adler, Patrick Godau, Veronika Cheplygina, Michal Kozubek, Sharib Ali, Anubha Gupta, Jan Kybic, Alison Noble, Carlos Ortiz de Solórzano, Samiksha Pachade, Caroline Petitjean, Daniel Sage, Donglai Wei, Elizabeth Wilden, Deepak Alapatt, Vincent Andrearczyk, Ujjwal Baid, Spyridon Bakas, Niranjan Balu, Sophia Bano , et al. (331 additional authors not shown)

    Abstract: The number of international benchmarking competitions is steadily increasing in various fields of machine learning (ML) research and practice. So far, however, little is known about the common practice as well as bottlenecks faced by the community in tackling the research questions posed. To shed light on the status quo of algorithm development in the specific field of biomedical imaging analysis,… ▽ More

    Submitted 12 September, 2023; v1 submitted 16 December, 2022; originally announced December 2022.

  47. arXiv:2211.14805  [pdf, other

    cs.CV

    Rethinking Data Augmentation for Single-source Domain Generalization in Medical Image Segmentation

    Authors: Zixian Su, Kai Yao, Xi Yang, Qiufeng Wang, Jie Sun, Kaizhu Huang

    Abstract: Single-source domain generalization (SDG) in medical image segmentation is a challenging yet essential task as domain shifts are quite common among clinical image datasets. Previous attempts most conduct global-only/random augmentation. Their augmented samples are usually insufficient in diversity and informativeness, thus failing to cover the possible target domain distribution. In this paper, we… ▽ More

    Submitted 27 November, 2022; originally announced November 2022.

  48. arXiv:2211.09799  [pdf, other

    cs.CV

    CAE v2: Context Autoencoder with CLIP Target

    Authors: Xinyu Zhang, Jiahui Chen, Junkun Yuan, Qiang Chen, Jian Wang, Xiaodi Wang, Shumin Han, Xiaokang Chen, Jimin Pi, Kun Yao, Junyu Han, Errui Ding, Jingdong Wang

    Abstract: Masked image modeling (MIM) learns visual representation by masking and reconstructing image patches. Applying the reconstruction supervision on the CLIP representation has been proven effective for MIM. However, it is still under-explored how CLIP supervision in MIM influences performance. To investigate strategies for refining the CLIP-targeted MIM, we study two critical elements in MIM, i.e., t… ▽ More

    Submitted 17 November, 2022; originally announced November 2022.

  49. arXiv:2211.03594  [pdf, ps, other

    cs.CV

    Group DETR v2: Strong Object Detector with Encoder-Decoder Pretraining

    Authors: Qiang Chen, Jian Wang, Chuchu Han, Shan Zhang, Zexian Li, Xiaokang Chen, Jiahui Chen, Xiaodi Wang, Shuming Han, Gang Zhang, Haocheng Feng, Kun Yao, Junyu Han, Errui Ding, Jingdong Wang

    Abstract: We present a strong object detector with encoder-decoder pretraining and finetuning. Our method, called Group DETR v2, is built upon a vision transformer encoder ViT-Huge~\cite{dosovitskiy2020image}, a DETR variant DINO~\cite{zhang2022dino}, and an efficient DETR training method Group DETR~\cite{chen2022group}. The training process consists of self-supervised pretraining and finetuning a ViT-Huge… ▽ More

    Submitted 7 November, 2022; originally announced November 2022.

    Comments: Tech report, 3 pages. We establishes a new SoTA (64.5 mAP) on the COCO test-dev

  50. arXiv:2209.08468  [pdf, other

    cs.GR cs.CV

    Human Performance Modeling and Rendering via Neural Animated Mesh

    Authors: Fuqiang Zhao, Yuheng Jiang, Kaixin Yao, Jiakai Zhang, Liao Wang, Haizhao Dai, Yuhui Zhong, Yingliang Zhang, Minye Wu, Lan Xu, Jingyi Yu

    Abstract: We have recently seen tremendous progress in the neural advances for photo-real human modeling and rendering. However, it's still challenging to integrate them into an existing mesh-based pipeline for downstream applications. In this paper, we present a comprehensive neural approach for high-quality reconstruction, compression, and rendering of human performances from dense multi-view videos. Our… ▽ More

    Submitted 17 September, 2022; originally announced September 2022.

    Comments: 18 pages, 17 figures