Search | arXiv e-print repository

doi 10.1016/j.measurement.2024.115376

On Flange-based 3D Hand-Eye Calibration for Soft Robotic Tactile Welding

Authors: Xudong Han, Ning Guo, Yu Jie, He Wang, Fang Wan, Chaoyang Song

Abstract: This paper investigates the direct application of standardized designs on the robot for conducting robot hand-eye calibration by employing 3D scanners with collaborative robots. The well-established geometric features of the robot flange are exploited by directly capturing its point cloud data. In particular, an iterative method is proposed to facilitate point cloud processing toward a refined cal… ▽ More This paper investigates the direct application of standardized designs on the robot for conducting robot hand-eye calibration by employing 3D scanners with collaborative robots. The well-established geometric features of the robot flange are exploited by directly capturing its point cloud data. In particular, an iterative method is proposed to facilitate point cloud processing toward a refined calibration outcome. Several extensive experiments are conducted over a range of collaborative robots, including Universal Robots UR5 & UR10 e-series, Franka Emika, and AUBO i5 using an industrial-grade 3D scanner Photoneo Phoxi S & M and a commercial-grade 3D scanner Microsoft Azure Kinect DK. Experimental results show that translational and rotational errors converge efficiently to less than 0.28 mm and 0.25 degrees, respectively, achieving a hand-eye calibration accuracy as high as the camera's resolution, probing the hardware limit. A welding seam tracking system is presented, combining the flange-based calibration method with soft tactile sensing. The experiment results show that the system enables the robot to adjust its motion in real-time, ensuring consistent weld quality and paving the way for more efficient and adaptable manufacturing processes. △ Less

Submitted 27 July, 2024; v1 submitted 22 July, 2024; originally announced July 2024.

Comments: 25 pages, 14 figures, 2 tables, Accepted by Measurement

arXiv:2407.12449 [pdf, other]

Close the Sim2real Gap via Physically-based Structured Light Synthetic Data Simulation

Authors: Kaixin Bai, Lei Zhang, Zhaopeng Chen, Fang Wan, Jianwei Zhang

Abstract: Despite the substantial progress in deep learning, its adoption in industrial robotics projects remains limited, primarily due to challenges in data acquisition and labeling. Previous sim2real approaches using domain randomization require extensive scene and model optimization. To address these issues, we introduce an innovative physically-based structured light simulation system, generating both… ▽ More Despite the substantial progress in deep learning, its adoption in industrial robotics projects remains limited, primarily due to challenges in data acquisition and labeling. Previous sim2real approaches using domain randomization require extensive scene and model optimization. To address these issues, we introduce an innovative physically-based structured light simulation system, generating both RGB and physically realistic depth images, surpassing previous dataset generation tools. We create an RGBD dataset tailored for robotic industrial grasping scenarios and evaluate it across various tasks, including object detection, instance segmentation, and embedding sim2real visual perception in industrial robotic grasping. By reducing the sim2real gap and enhancing deep learning training, we facilitate the application of deep learning models in industrial settings. Project details are available at https://baikaixinpublic.github.io/structured light 3D synthesizer/. △ Less

Submitted 17 July, 2024; originally announced July 2024.

Comments: 7 pages, 2024 IEEE International Conference on Robotics and Automation

arXiv:2407.01094 [pdf, other]

Evaluation of Text-to-Video Generation Models: A Dynamics Perspective

Authors: Mingxiang Liao, Hannan Lu, Xinyu Zhang, Fang Wan, Tianyu Wang, Yuzhong Zhao, Wangmeng Zuo, Qixiang Ye, Jingdong Wang

Abstract: Comprehensive and constructive evaluation protocols play an important role in the development of sophisticated text-to-video (T2V) generation models. Existing evaluation protocols primarily focus on temporal consistency and content continuity, yet largely ignore the dynamics of video content. Dynamics are an essential dimension for measuring the visual vividness and the honesty of video content to… ▽ More Comprehensive and constructive evaluation protocols play an important role in the development of sophisticated text-to-video (T2V) generation models. Existing evaluation protocols primarily focus on temporal consistency and content continuity, yet largely ignore the dynamics of video content. Dynamics are an essential dimension for measuring the visual vividness and the honesty of video content to text prompts. In this study, we propose an effective evaluation protocol, termed DEVIL, which centers on the dynamics dimension to evaluate T2V models. For this purpose, we establish a new benchmark comprising text prompts that fully reflect multiple dynamics grades, and define a set of dynamics scores corresponding to various temporal granularities to comprehensively evaluate the dynamics of each generated video. Based on the new benchmark and the dynamics scores, we assess T2V models with the design of three metrics: dynamics range, dynamics controllability, and dynamics-based quality. Experiments show that DEVIL achieves a Pearson correlation exceeding 90% with human ratings, demonstrating its potential to advance T2V generation models. Code is available at https://github.com/MingXiangL/DEVIL. △ Less

Submitted 1 July, 2024; originally announced July 2024.

arXiv:2407.01050 [pdf, other]

Evolutionary Morphology Towards Overconstrained Locomotion via Large-Scale, Multi-Terrain Deep Reinforcement Learning

Authors: Yenan Chen, Chuye Zhang, Pengxi Gu, Jianuo Qiu, Jiayi Yin, Nuofan Qiu, Guojing Huang, Bangchao Huang, Zishang Zhang, Hui Deng, Wei Zhang, Fang Wan, Chaoyang Song

Abstract: While the animals' Fin-to-Limb evolution has been well-researched in biology, such morphological transformation remains under-adopted in the modern design of advanced robotic limbs. This paper investigates a novel class of overconstrained locomotion from a design and learning perspective inspired by evolutionary morphology, aiming to integrate the concept of `intelligent design under constraints'… ▽ More While the animals' Fin-to-Limb evolution has been well-researched in biology, such morphological transformation remains under-adopted in the modern design of advanced robotic limbs. This paper investigates a novel class of overconstrained locomotion from a design and learning perspective inspired by evolutionary morphology, aiming to integrate the concept of `intelligent design under constraints' - hereafter referred to as constraint-driven design intelligence - in developing modern robotic limbs with superior energy efficiency. We propose a 3D-printable design of robotic limbs parametrically reconfigurable as a classical planar 4-bar linkage, an overconstrained Bennett linkage, and a spherical 4-bar linkage. These limbs adopt a co-axial actuation, identical to the modern legged robot platforms, with the added capability of upgrading into a wheel-legged system. Then, we implemented a large-scale, multi-terrain deep reinforcement learning framework to train these reconfigurable limbs for a comparative analysis of overconstrained locomotion in energy efficiency. Results show that the overconstrained limbs exhibit more efficient locomotion than planar limbs during forward and sideways walking over different terrains, including floors, slopes, and stairs, with or without random noises, by saving at least 22% mechanical energy in completing the traverse task, with the spherical limbs being the least efficient. It also achieves the highest average speed of 0.85 meters per second on flat terrain, which is 20% faster than the planar limbs. This study paves the path for an exciting direction for future research in overconstrained robotics leveraging evolutionary morphology and reconfigurable mechanism intelligence when combined with state-of-the-art methods in deep reinforcement learning. △ Less

Submitted 1 July, 2024; originally announced July 2024.

Comments: 13 pages, 5 figures, Accepted and Presented at ReMAR2024

arXiv:2406.14136 [pdf, other]

One Fling to Goal: Environment-aware Dynamics for Goal-conditioned Fabric Flinging

Authors: Linhan Yang, Lei Yang, Haoran Sun, Zeqing Zhang, Haibin He, Fang Wan, Chaoyang Song, Jia Pan

Abstract: Fabric manipulation dynamically is commonly seen in manufacturing and domestic settings. While dynamically manipulating a fabric piece to reach a target state is highly efficient, this task presents considerable challenges due to the varying properties of different fabrics, complex dynamics when interacting with environments, and meeting required goal conditions. To address these challenges, we pr… ▽ More Fabric manipulation dynamically is commonly seen in manufacturing and domestic settings. While dynamically manipulating a fabric piece to reach a target state is highly efficient, this task presents considerable challenges due to the varying properties of different fabrics, complex dynamics when interacting with environments, and meeting required goal conditions. To address these challenges, we present \textit{One Fling to Goal}, an algorithm capable of handling fabric pieces with diverse shapes and physical properties across various scenarios. Our method learns a graph-based dynamics model equipped with environmental awareness. With this dynamics model, we devise a real-time controller to enable high-speed fabric manipulation in one attempt, requiring less than 3 seconds to finish the goal-conditioned task. We experimentally validate our method on a goal-conditioned manipulation task in five diverse scenarios. Our method significantly improves this goal-conditioned task, achieving an average error of 13.2mm in complex scenarios. Our method can be seamlessly transferred to real-world robotic systems and generalized to unseen scenarios in a zero-shot manner. △ Less

Submitted 20 June, 2024; originally announced June 2024.

arXiv:2406.10813 [pdf, other]

Self-Evolution Fine-Tuning for Policy Optimization

Authors: Ruijun Chen, Jiehao Liang, Shiping Gao, Fanqi Wan, Xiaojun Quan

Abstract: The alignment of large language models (LLMs) is crucial not only for unlocking their potential in specific tasks but also for ensuring that responses meet human expectations and adhere to safety and ethical principles. Current alignment methodologies face considerable challenges. For instance, supervised fine-tuning (SFT) requires extensive, high-quality annotated samples, while reinforcement lea… ▽ More The alignment of large language models (LLMs) is crucial not only for unlocking their potential in specific tasks but also for ensuring that responses meet human expectations and adhere to safety and ethical principles. Current alignment methodologies face considerable challenges. For instance, supervised fine-tuning (SFT) requires extensive, high-quality annotated samples, while reinforcement learning from human feedback (RLHF) is complex and often unstable. In this paper, we introduce self-evolution fine-tuning (SEFT) for policy optimization, with the aim of eliminating the need for annotated samples while retaining the stability and efficiency of SFT. SEFT first trains an adaptive reviser to elevate low-quality responses while maintaining high-quality ones. The reviser then gradually guides the policy's optimization by fine-tuning it with enhanced responses. One of the prominent features of this method is its ability to leverage unlimited amounts of unannotated data for policy optimization through supervised fine-tuning. Our experiments on AlpacaEval 2.0 and MT-Bench demonstrate the effectiveness of SEFT. We also provide a comprehensive analysis of its advantages over existing alignment techniques. △ Less

Submitted 16 June, 2024; originally announced June 2024.

arXiv:2406.10744 [pdf, other]

Technique Report of CVPR 2024 PBDL Challenges

Authors: Ying Fu, Yu Li, Shaodi You, Boxin Shi, Linwei Chen, Yunhao Zou, Zichun Wang, Yichen Li, Yuze Han, Yingkai Zhang, Jianan Wang, Qinglin Liu, Wei Yu, Xiaoqian Lv, Jianing Li, Shengping Zhang, Xiangyang Ji, Yuanpei Chen, Yuhan Zhang, Weihang Peng, Liwen Zhang, Zhe Xu, Dingyong Gou, Cong Li, Senyan Xu , et al. (75 additional authors not shown)

Abstract: The intersection of physics-based vision and deep learning presents an exciting frontier for advancing computer vision technologies. By leveraging the principles of physics to inform and enhance deep learning models, we can develop more robust and accurate vision systems. Physics-based vision aims to invert the processes to recover scene properties such as shape, reflectance, light distribution, a… ▽ More The intersection of physics-based vision and deep learning presents an exciting frontier for advancing computer vision technologies. By leveraging the principles of physics to inform and enhance deep learning models, we can develop more robust and accurate vision systems. Physics-based vision aims to invert the processes to recover scene properties such as shape, reflectance, light distribution, and medium properties from images. In recent years, deep learning has shown promising improvements for various vision tasks, and when combined with physics-based vision, these approaches can enhance the robustness and accuracy of vision systems. This technical report summarizes the outcomes of the Physics-Based Vision Meets Deep Learning (PBDL) 2024 challenge, held in CVPR 2024 workshop. The challenge consisted of eight tracks, focusing on Low-Light Enhancement and Detection as well as High Dynamic Range (HDR) Imaging. This report details the objectives, methodologies, and results of each track, highlighting the top-performing solutions and their innovative approaches. △ Less

Submitted 12 July, 2024; v1 submitted 15 June, 2024; originally announced June 2024.

Comments: CVPR 2024 PBDL Challenges: https://pbdl-ws.github.io/pbdl2024/challenge/index.html

arXiv:2406.10594 [pdf, other]

BlockPruner: Fine-grained Pruning for Large Language Models

Authors: Longguang Zhong, Fanqi Wan, Ruijun Chen, Xiaojun Quan, Liangzhi Li

Abstract: With the rapid growth in the size and complexity of large language models (LLMs), the costs associated with their training and inference have escalated significantly. Research indicates that certain layers in LLMs harbor substantial redundancy, and pruning these layers has minimal impact on the overall performance. While various layer pruning methods have been developed based on this insight, they… ▽ More With the rapid growth in the size and complexity of large language models (LLMs), the costs associated with their training and inference have escalated significantly. Research indicates that certain layers in LLMs harbor substantial redundancy, and pruning these layers has minimal impact on the overall performance. While various layer pruning methods have been developed based on this insight, they generally overlook the finer-grained redundancies within the layers themselves. In this paper, we delve deeper into the architecture of LLMs and demonstrate that finer-grained pruning can be achieved by targeting redundancies in multi-head attention (MHA) and multi-layer perceptron (MLP) blocks. We propose a novel, training-free structured pruning approach called BlockPruner. Unlike existing layer pruning methods, BlockPruner segments each Transformer layer into MHA and MLP blocks. It then assesses the importance of these blocks using perplexity measures and applies a heuristic search for iterative pruning. We applied BlockPruner to LLMs of various sizes and architectures and validated its performance across a wide range of downstream tasks. Experimental results show that BlockPruner achieves more granular and effective pruning compared to state-of-the-art baselines. △ Less

Submitted 20 June, 2024; v1 submitted 15 June, 2024; originally announced June 2024.

arXiv:2405.16071 [pdf, other]

DynRefer: Delving into Region-level Multi-modality Tasks via Dynamic Resolution

Authors: Yuzhong Zhao, Feng Liu, Yue Liu, Mingxiang Liao, Chen Gong, Qixiang Ye, Fang Wan

Abstract: Region-level multi-modality methods can translate referred image regions to human preferred language descriptions. Unfortunately, most of existing methods using fixed visual inputs remain lacking the resolution adaptability to find out precise language descriptions. In this study, we propose a dynamic resolution approach, referred to as DynRefer, to pursue high-accuracy region-level referring thro… ▽ More Region-level multi-modality methods can translate referred image regions to human preferred language descriptions. Unfortunately, most of existing methods using fixed visual inputs remain lacking the resolution adaptability to find out precise language descriptions. In this study, we propose a dynamic resolution approach, referred to as DynRefer, to pursue high-accuracy region-level referring through mimicking the resolution adaptability of human visual cognition. DynRefer first implements stochastic vision-language alignment. It aligns desired language descriptions of multi-modality tasks with images of stochastic resolution, which are constructed by nesting a set of views around the referred region. DynRefer then implements dynamic multi-modality referring, which is realized by selecting views based on image and language priors. This allows the visual information used for referring to better match human preferences, thereby improving the representational adaptability of region-level multi-modality models. Extensive experiments show that DynRefer brings mutual improvement upon tasks including region-level captioning, open-vocabulary region recognition and attribute detection. Last but not least, DynRefer achieves new state-of-the-art on multiple region-level multi-modality tasks using a single model. Code is available at https://github.com/callsys/DynRefer. △ Less

Submitted 25 May, 2024; originally announced May 2024.

Comments: Code is available at https://github.com/callsys/DynRefer

arXiv:2403.09363 [pdf, other]

Sentinel-Guided Zero-Shot Learning: A Collaborative Paradigm without Real Data Exposure

Authors: Fan Wan, Xingyu Miao, Haoran Duan, Jingjing Deng, Rui Gao, Yang Long

Abstract: With increasing concerns over data privacy and model copyrights, especially in the context of collaborations between AI service providers and data owners, an innovative SG-ZSL paradigm is proposed in this work. SG-ZSL is designed to foster efficient collaboration without the need to exchange models or sensitive data. It consists of a teacher model, a student model and a generator that links both m… ▽ More With increasing concerns over data privacy and model copyrights, especially in the context of collaborations between AI service providers and data owners, an innovative SG-ZSL paradigm is proposed in this work. SG-ZSL is designed to foster efficient collaboration without the need to exchange models or sensitive data. It consists of a teacher model, a student model and a generator that links both model entities. The teacher model serves as a sentinel on behalf of the data owner, replacing real data, to guide the student model at the AI service provider's end during training. Considering the disparity of knowledge space between the teacher and student, we introduce two variants of the teacher model: the omniscient and the quasi-omniscient teachers. Under these teachers' guidance, the student model seeks to match the teacher model's performance and explores domains that the teacher has not covered. To trade off between privacy and performance, we further introduce two distinct security-level training protocols: white-box and black-box, enhancing the paradigm's adaptability. Despite the inherent challenges of real data absence in the SG-ZSL paradigm, it consistently outperforms in ZSL and GZSL tasks, notably in the white-box protocol. Our comprehensive evaluation further attests to its robustness and efficiency across various setups, including stringent black-box training protocol. △ Less

Submitted 14 March, 2024; originally announced March 2024.

arXiv:2402.16107 [pdf, other]

Knowledge Fusion of Chat LLMs: A Preliminary Technical Report

Authors: Fanqi Wan, Ziyi Yang, Longguang Zhong, Xiaojun Quan, Xinting Huang, Wei Bi

Abstract: Recently, FuseLLM introduced the concept of knowledge fusion to transfer the collective knowledge of multiple structurally varied LLMs into a target LLM through lightweight continual training. In this report, we extend the scalability and flexibility of the FuseLLM framework to realize the fusion of chat LLMs, resulting in FusionChat. FusionChat comprises two main stages. Firstly, we undertake kno… ▽ More Recently, FuseLLM introduced the concept of knowledge fusion to transfer the collective knowledge of multiple structurally varied LLMs into a target LLM through lightweight continual training. In this report, we extend the scalability and flexibility of the FuseLLM framework to realize the fusion of chat LLMs, resulting in FusionChat. FusionChat comprises two main stages. Firstly, we undertake knowledge fusion for structurally and scale-varied source LLMs to derive multiple target LLMs of identical structure and size via lightweight fine-tuning. Then, these target LLMs are merged within the parameter space, wherein we propose a novel method for determining the merging weights based on the variation ratio of parameter matrices before and after fine-tuning. We validate our approach using three prominent chat LLMs with diverse architectures and scales, namely NH2-Mixtral-8x7B, NH2-Solar-10.7B, and OpenChat-3.5-7B. Experimental results spanning various chat domains demonstrate the superiority of FusionChat-7B across a broad spectrum of chat LLMs at 7B and 34B scales, even surpassing GPT-3.5 (March) and approaching Mixtral-8x7B-Instruct. △ Less

Submitted 28 May, 2024; v1 submitted 25 February, 2024; originally announced February 2024.

Comments: Technical Report, work in progress

arXiv:2402.03634 [pdf, other]

Ray Denoising: Depth-aware Hard Negative Sampling for Multi-view 3D Object Detection

Authors: Feng Liu, Tengteng Huang, Qianjing Zhang, Haotian Yao, Chi Zhang, Fang Wan, Qixiang Ye, Yanzhao Zhou

Abstract: Multi-view 3D object detection systems often struggle with generating precise predictions due to the challenges in estimating depth from images, increasing redundant and incorrect detections. Our paper presents Ray Denoising, an innovative method that enhances detection accuracy by strategically sampling along camera rays to construct hard negative examples. These examples, visually challenging to… ▽ More Multi-view 3D object detection systems often struggle with generating precise predictions due to the challenges in estimating depth from images, increasing redundant and incorrect detections. Our paper presents Ray Denoising, an innovative method that enhances detection accuracy by strategically sampling along camera rays to construct hard negative examples. These examples, visually challenging to differentiate from true positives, compel the model to learn depth-aware features, thereby improving its capacity to distinguish between true and false positives. Ray Denoising is designed as a plug-and-play module, compatible with any DETR-style multi-view 3D detectors, and it only minimally increases training computational costs without affecting inference speed. Our comprehensive experiments, including detailed ablation studies, consistently demonstrate that Ray Denoising outperforms strong baselines across multiple datasets. It achieves a 1.9\% improvement in mean Average Precision (mAP) over the state-of-the-art StreamPETR method on the NuScenes dataset. It shows significant performance gains on the Argoverse 2 dataset, highlighting its generalization capability. The code will be available at https://github.com/LiewFeng/RayDN. △ Less

Submitted 12 March, 2024; v1 submitted 5 February, 2024; originally announced February 2024.

arXiv:2402.01950 [pdf, other]

ConRF: Zero-shot Stylization of 3D Scenes with Conditioned Radiation Fields

Authors: Xingyu Miao, Yang Bai, Haoran Duan, Fan Wan, Yawen Huang, Yang Long, Yefeng Zheng

Abstract: Most of the existing works on arbitrary 3D NeRF style transfer required retraining on each single style condition. This work aims to achieve zero-shot controlled stylization in 3D scenes utilizing text or visual input as conditioning factors. We introduce ConRF, a novel method of zero-shot stylization. Specifically, due to the ambiguity of CLIP features, we employ a conversion process that maps th… ▽ More Most of the existing works on arbitrary 3D NeRF style transfer required retraining on each single style condition. This work aims to achieve zero-shot controlled stylization in 3D scenes utilizing text or visual input as conditioning factors. We introduce ConRF, a novel method of zero-shot stylization. Specifically, due to the ambiguity of CLIP features, we employ a conversion process that maps the CLIP feature space to the style space of a pre-trained VGG network and then refine the CLIP multi-modal knowledge into a style transfer neural radiation field. Additionally, we use a 3D volumetric representation to perform local style transfer. By combining these operations, ConRF offers the capability to utilize either text or images as references, resulting in the generation of sequences with novel views enhanced by global or local stylization. Our experiment demonstrates that ConRF outperforms other existing methods for 3D scene and single-text stylization in terms of visual quality. △ Less

Submitted 6 March, 2024; v1 submitted 2 February, 2024; originally announced February 2024.

arXiv:2401.17910 [pdf, other]

ControlCap: Controllable Region-level Captioning

Authors: Yuzhong Zhao, Yue Liu, Zonghao Guo, Weijia Wu, Chen Gong, Fang Wan, Qixiang Ye

Abstract: Region-level captioning is challenged by the caption degeneration issue, which refers to that pre-trained multimodal models tend to predict the most frequent captions but miss the less frequent ones. In this study, we propose a controllable region-level captioning (ControlCap) approach, which introduces control words to a multimodal model to address the caption degeneration issue. In specific, Con… ▽ More Region-level captioning is challenged by the caption degeneration issue, which refers to that pre-trained multimodal models tend to predict the most frequent captions but miss the less frequent ones. In this study, we propose a controllable region-level captioning (ControlCap) approach, which introduces control words to a multimodal model to address the caption degeneration issue. In specific, ControlCap leverages a discriminative module to generate control words within the caption space to partition it to multiple sub-spaces. The multimodal model is constrained to generate captions within a few sub-spaces containing the control words, which increases the opportunity of hitting less frequent captions, alleviating the caption degeneration issue. Furthermore, interactive control words can be given by either a human or an expert model, which enables captioning beyond the training caption space, enhancing the model's generalization ability. Extensive experiments on Visual Genome and RefCOCOg datasets show that ControlCap respectively improves the CIDEr score by 21.6 and 2.2, outperforming the state-of-the-arts by significant margins. Code is available at https://github.com/callsys/ControlCap. △ Less

Submitted 9 March, 2024; v1 submitted 31 January, 2024; originally announced January 2024.

Comments: https://github.com/callsys/ControlCap

arXiv:2401.10768 [pdf, other]

Knowledge Verification to Nip Hallucination in the Bud

Authors: Fanqi Wan, Xinting Huang, Leyang Cui, Xiaojun Quan, Wei Bi, Shuming Shi

Abstract: While large language models (LLMs) have demonstrated exceptional performance across various tasks following human alignment, they may still generate responses that sound plausible but contradict factual knowledge, a phenomenon known as \emph{hallucination}. In this paper, we demonstrate the feasibility of mitigating hallucinations by verifying and minimizing the inconsistency between external know… ▽ More While large language models (LLMs) have demonstrated exceptional performance across various tasks following human alignment, they may still generate responses that sound plausible but contradict factual knowledge, a phenomenon known as \emph{hallucination}. In this paper, we demonstrate the feasibility of mitigating hallucinations by verifying and minimizing the inconsistency between external knowledge present in the alignment data and the intrinsic knowledge embedded within foundation LLMs. Specifically, we propose a novel approach called Knowledge Consistent Alignment (KCA), which employs a well-aligned LLM to automatically formulate assessments based on external knowledge to evaluate the knowledge boundaries of foundation LLMs. To address knowledge inconsistencies in the alignment data, KCA implements several specific strategies to deal with these data instances. We demonstrate the superior efficacy of KCA in reducing hallucinations across six benchmarks, utilizing foundation LLMs of varying backbones and scales. This confirms the effectiveness of mitigating hallucinations by reducing knowledge inconsistency. Our code, model weights, and data are openly accessible at \url{https://github.com/fanqiwan/KCA}. △ Less

Submitted 16 April, 2024; v1 submitted 19 January, 2024; originally announced January 2024.

Comments: Work in progress

arXiv:2401.10491 [pdf, other]

Knowledge Fusion of Large Language Models

Authors: Fanqi Wan, Xinting Huang, Deng Cai, Xiaojun Quan, Wei Bi, Shuming Shi

Abstract: While training large language models (LLMs) from scratch can generate models with distinct functionalities and strengths, it comes at significant costs and may result in redundant capabilities. Alternatively, a cost-effective and compelling approach is to merge existing pre-trained LLMs into a more potent model. However, due to the varying architectures of these LLMs, directly blending their weigh… ▽ More While training large language models (LLMs) from scratch can generate models with distinct functionalities and strengths, it comes at significant costs and may result in redundant capabilities. Alternatively, a cost-effective and compelling approach is to merge existing pre-trained LLMs into a more potent model. However, due to the varying architectures of these LLMs, directly blending their weights is impractical. In this paper, we introduce the notion of knowledge fusion for LLMs, aimed at combining the capabilities of existing LLMs and transferring them into a single LLM. By leveraging the generative distributions of source LLMs, we externalize their collective knowledge and unique strengths, thereby potentially elevating the capabilities of the target model beyond those of any individual source LLM. We validate our approach using three popular LLMs with different architectures--Llama-2, MPT, and OpenLLaMA--across various benchmarks and tasks. Our findings confirm that the fusion of LLMs can improve the performance of the target model across a range of capabilities such as reasoning, commonsense, and code generation. Our code, model weights, and data are public at \url{https://github.com/fanqiwan/FuseLLM}. △ Less

Submitted 22 January, 2024; v1 submitted 19 January, 2024; originally announced January 2024.

Comments: Accepted to ICLR 2024

arXiv:2401.04861 [pdf, other]

doi 10.1016/j.patcog.2024.110729

CTNeRF: Cross-Time Transformer for Dynamic Neural Radiance Field from Monocular Video

Authors: Xingyu Miao, Yang Bai, Haoran Duan, Yawen Huang, Fan Wan, Yang Long, Yefeng Zheng

Abstract: The goal of our work is to generate high-quality novel views from monocular videos of complex and dynamic scenes. Prior methods, such as DynamicNeRF, have shown impressive performance by leveraging time-varying dynamic radiation fields. However, these methods have limitations when it comes to accurately modeling the motion of complex objects, which can lead to inaccurate and blurry renderings of d… ▽ More The goal of our work is to generate high-quality novel views from monocular videos of complex and dynamic scenes. Prior methods, such as DynamicNeRF, have shown impressive performance by leveraging time-varying dynamic radiation fields. However, these methods have limitations when it comes to accurately modeling the motion of complex objects, which can lead to inaccurate and blurry renderings of details. To address this limitation, we propose a novel approach that builds upon a recent generalization NeRF, which aggregates nearby views onto new viewpoints. However, such methods are typically only effective for static scenes. To overcome this challenge, we introduce a module that operates in both the time and frequency domains to aggregate the features of object motion. This allows us to learn the relationship between frames and generate higher-quality images. Our experiments demonstrate significant improvements over state-of-the-art methods on dynamic scene datasets. Specifically, our approach outperforms existing methods in terms of both the accuracy and visual quality of the synthesized views. Our code is available on https://github.com/xingy038/CTNeRF. △ Less

Submitted 26 June, 2024; v1 submitted 9 January, 2024; originally announced January 2024.

Comments: Accepted by Pattern Recognition

arXiv:2312.12295 [pdf, other]

Describing Robots from Design to Learning: Towards an Interactive Lifecycle Representation of Robots

Authors: Nuofan Qiu, Fang Wan, Chaoyang Song

Abstract: The robot development process is divided into several stages, which create barriers to the exchange of information between these different stages. We advocate for an interactive lifecycle representation, extending from robot morphology design to learning, and introduce the role of robot description formats in facilitating information transfer throughout this pipeline. We analyzed the relationship… ▽ More The robot development process is divided into several stages, which create barriers to the exchange of information between these different stages. We advocate for an interactive lifecycle representation, extending from robot morphology design to learning, and introduce the role of robot description formats in facilitating information transfer throughout this pipeline. We analyzed the relationship between design and simulation, enabling us to employ robot process automation methods for transferring information from the design phase to the learning phase in simulation. As part of this effort, we have developed an open-source plugin called ACDC4Robot for Fusion 360, which automates this process and transforms Fusion 360 into a user-friendly graphical interface for creating and editing robot description formats. Additionally, we offer an out-of-the-box robot model library to streamline and reduce repetitive tasks. All codes are hosted open-source. (\url{https://github.com/bionicdl-sustech/ACDC4Robot}) △ Less

Submitted 19 December, 2023; originally announced December 2023.

Comments: 11 pages, 8 figures, 2 tables, submitted to ICRA2024 for review

arXiv:2312.09863 [pdf, other]

Proprioceptive State Estimation for Amphibious Tactile Sensing

Authors: Ning Guo, Xudong Han, Shuqiao Zhong, Zhiyuan Zhou, Jian Lin, Jian S. Dai, Fang Wan, Chaoyang Song

Abstract: This paper presents a novel vision-based proprioception approach for a soft robotic finger that can estimate and reconstruct tactile interactions in both terrestrial and aquatic environments. The key to this system lies in the finger's unique metamaterial structure, which facilitates omni-directional passive adaptation during grasping, protecting delicate objects across diverse scenarios. A compac… ▽ More This paper presents a novel vision-based proprioception approach for a soft robotic finger that can estimate and reconstruct tactile interactions in both terrestrial and aquatic environments. The key to this system lies in the finger's unique metamaterial structure, which facilitates omni-directional passive adaptation during grasping, protecting delicate objects across diverse scenarios. A compact in-finger camera captures high-framerate images of the finger's deformation during contact, extracting crucial tactile data in real-time. We present a volumetric discretized model of the soft finger and use the geometry constraints captured by the camera to find the optimal estimation of the deformed shape. The approach is benchmarked using a motion capture system with sparse markers and a haptic device with dense measurements. Both results show state-of-the-art accuracies, with a median error of 1.96 mm for overall body deformation, corresponding to 2.1% of the finger's length. More importantly, the state estimation is robust in both on-land and underwater environments as we demonstrate its usage for underwater object shape sensing. This combination of passive adaptation and real-time tactile sensing paves the way for amphibious robotic grasping applications. △ Less

Submitted 21 July, 2024; v1 submitted 15 December, 2023; originally announced December 2023.

Comments: 24 pages, 11 figures, 1 table, Conditionally Accepted for the Special Collection on Tactile Robotics in IEEE Transactions on Robotics

arXiv:2312.09822 [pdf, other]

SeeThruFinger: See and Grasp Anything with a Soft Touch

Authors: Fang Wan, Chaoyang Song

Abstract: We present SeeThruFinger, a soft robotic finger with an in-finger vision for multi-modal perception, including visual perception and tactile sensing, for geometrically adaptive and real-time reactive grasping. Multi-modal perception of intrinsic and extrinsic interactions is critical in building intelligent robots that learn. Instead of adding various sensors for different modalities, a preferred… ▽ More We present SeeThruFinger, a soft robotic finger with an in-finger vision for multi-modal perception, including visual perception and tactile sensing, for geometrically adaptive and real-time reactive grasping. Multi-modal perception of intrinsic and extrinsic interactions is critical in building intelligent robots that learn. Instead of adding various sensors for different modalities, a preferred solution is to integrate them into one elegant and coherent design, which is a challenging task. This study leverages the Soft Polyhedral Network design as a robotic finger, capable of omni-directional adaptation with an unobstructed view of the finger's spatial deformation from the inside. By embedding a miniature camera underneath, we achieve the visual perception of the external environment by inpainting the finger mask using E2FGV, which can be used for object detection in the downstream tasks for grasping. After contacting the objects, we use real-time object segmentation algorithms, such as XMem, to track the soft finger's spatial deformations. We also learned a Supervised Variational Autoencoder to enable tactile sensing of 6D forces and torques for reactive grasp. As a result, we achieved multi-modal perception, including visual perception and tactile sensing, and soft, adaptive object grasping within a single vision-based soft finger design compatible with multi-fingered robotic grippers. △ Less

Submitted 15 December, 2023; originally announced December 2023.

Comments: 10 pages, 5 figures, 1 table, submitted to Soft Robotics under review

arXiv:2311.14974 [pdf, other]

Active Surface with Passive Omni-Directional Adaptation of Soft Polyhedral Fingers for In-Hand Manipulation

Authors: Sen Li, Fang Wan, Chaoyang Song

Abstract: Track systems effectively distribute loads, augmenting traction and maneuverability on unstable terrains, leveraging their expansive contact areas. This tracked locomotion capability also aids in hand manipulation of not only regular objects but also irregular objects. In this study, we present the design of a soft robotic finger with an active surface on an omni-adaptive network structure, which… ▽ More Track systems effectively distribute loads, augmenting traction and maneuverability on unstable terrains, leveraging their expansive contact areas. This tracked locomotion capability also aids in hand manipulation of not only regular objects but also irregular objects. In this study, we present the design of a soft robotic finger with an active surface on an omni-adaptive network structure, which can be easily installed on existing grippers and achieve stability and dexterity for in-hand manipulation. The system's active surfaces initially transfer the object from the fingertip segment with less compliance to the middle segment of the finger with superior adaptability. Despite the omni-directional deformation of the finger, in-hand manipulation can still be executed with controlled active surfaces. We characterized the soft finger's stiffness distribution and simplified models to assess the feasibility of repositioning and reorienting a grasped object. A set of experiments on in-hand manipulation was performed with the proposed fingers, demonstrating the dexterity and robustness of the strategy. △ Less

Submitted 25 November, 2023; originally announced November 2023.

Comments: 10 pages, 6 figures, 2 tables, submitted to ICRA 2024

arXiv:2310.20256 [pdf, other]

PsyCoT: Psychological Questionnaire as Powerful Chain-of-Thought for Personality Detection

Authors: Tao Yang, Tianyuan Shi, Fanqi Wan, Xiaojun Quan, Qifan Wang, Bingzhe Wu, Jiaxiang Wu

Abstract: Recent advances in large language models (LLMs), such as ChatGPT, have showcased remarkable zero-shot performance across various NLP tasks. However, the potential of LLMs in personality detection, which involves identifying an individual's personality from their written texts, remains largely unexplored. Drawing inspiration from Psychological Questionnaires, which are carefully designed by psychol… ▽ More Recent advances in large language models (LLMs), such as ChatGPT, have showcased remarkable zero-shot performance across various NLP tasks. However, the potential of LLMs in personality detection, which involves identifying an individual's personality from their written texts, remains largely unexplored. Drawing inspiration from Psychological Questionnaires, which are carefully designed by psychologists to evaluate individual personality traits through a series of targeted items, we argue that these items can be regarded as a collection of well-structured chain-of-thought (CoT) processes. By incorporating these processes, LLMs can enhance their capabilities to make more reasonable inferences on personality from textual input. In light of this, we propose a novel personality detection method, called PsyCoT, which mimics the way individuals complete psychological questionnaires in a multi-turn dialogue manner. In particular, we employ a LLM as an AI assistant with a specialization in text analysis. We prompt the assistant to rate individual items at each turn and leverage the historical rating results to derive a conclusive personality preference. Our experiments demonstrate that PsyCoT significantly improves the performance and robustness of GPT-3.5 in personality detection, achieving an average F1 score improvement of 4.23/10.63 points on two benchmark datasets compared to the standard prompting method. Our code is available at https://github.com/TaoYang225/PsyCoT. △ Less

Submitted 4 November, 2023; v1 submitted 31 October, 2023; originally announced October 2023.

Comments: Accepted to Findings of EMNLP 2023

arXiv:2310.09824 [pdf, other]

Overconstrained Locomotion

Authors: Haoran Sun, Bangchao Huang, Zishang Zhang, Ronghan Xu, Guojing Huang, Shihao Feng, Guangyi Huang, Jiayi Yin, Nuofan Qiu, Hua Chen, Wei Zhang, Jia Pan, Fang Wan, Chaoyang Song

Abstract: This paper studies the design, control, and learning of a novel robotic limb that produces overconstrained locomotion by employing the Bennett linkage for motion generation, capable of parametric reconfiguration between a reptile- and mammal-inspired morphology within a single quadruped. In contrast to the prevailing focus on planar linkages, this research delves into adopting overconstrained link… ▽ More This paper studies the design, control, and learning of a novel robotic limb that produces overconstrained locomotion by employing the Bennett linkage for motion generation, capable of parametric reconfiguration between a reptile- and mammal-inspired morphology within a single quadruped. In contrast to the prevailing focus on planar linkages, this research delves into adopting overconstrained linkages as the limb mechanism. The overconstrained linkages have solid theoretical foundations in advanced kinematics but are under-explored in robotic applications. This study showcases the morphological superiority of Overconstrained Robotic Limbs (ORLs) that can transform into planar or spherical limbs, exemplified using the simplest case of a Bennett linkage as an ORL. We apply Model Predictive Control (MPC) to simulate a range of overconstrained locomotion tasks, revealing its superiority in energy efficiency against planar limbs when considering foothold distances and speeds. The results are further verified in overconstrained locomotion policies optimized from Reinforcement Learning (RL). From an evolutionary biology perspective, these findings highlight the mechanism distinctions in limb design between reptiles and mammals and represent the first documented instance of ORLs outperforming planar limb designs in dynamic locomotion. Future studies will focus on deploying the model-based and learning-based overconstrained locomotion skills in the robotic hardware to close the Sim2Real gap for developing evolutionary-inspired, energy-efficient control of novel robotic limbs. △ Less

Submitted 30 July, 2024; v1 submitted 15 October, 2023; originally announced October 2023.

Comments: 30 pages, 20 figures, 2 tables

arXiv:2310.09168 [pdf, other]

Explore-Instruct: Enhancing Domain-Specific Instruction Coverage through Active Exploration

Authors: Fanqi Wan, Xinting Huang, Tao Yang, Xiaojun Quan, Wei Bi, Shuming Shi

Abstract: Instruction-tuning can be substantially optimized through enhanced diversity, resulting in models capable of handling a broader spectrum of tasks. However, existing data employed for such tuning often exhibit an inadequate coverage of individual domains, limiting the scope for nuanced comprehension and interactions within these areas. To address this deficiency, we propose Explore-Instruct, a nove… ▽ More Instruction-tuning can be substantially optimized through enhanced diversity, resulting in models capable of handling a broader spectrum of tasks. However, existing data employed for such tuning often exhibit an inadequate coverage of individual domains, limiting the scope for nuanced comprehension and interactions within these areas. To address this deficiency, we propose Explore-Instruct, a novel approach to enhance the data coverage to be used in domain-specific instruction-tuning through active exploration via Large Language Models (LLMs). Built upon representative domain use cases, Explore-Instruct explores a multitude of variations or possibilities by implementing a search algorithm to obtain diversified and domain-focused instruction-tuning data. Our data-centric analysis validates the effectiveness of this proposed approach in improving domain-specific instruction coverage. Moreover, our model's performance demonstrates considerable advancements over multiple baselines, including those utilizing domain-specific data enhancement. Our findings offer a promising opportunity to improve instruction coverage, especially in domain-specific contexts, thereby advancing the development of adaptable language models. Our code, model weights, and data are public at \url{https://github.com/fanqiwan/Explore-Instruct}. △ Less

Submitted 24 October, 2023; v1 submitted 13 October, 2023; originally announced October 2023.

Comments: Accepted to EMNLP 2023 (Main Conference)

arXiv:2310.08877 [pdf, other]

Retrieval-Generation Alignment for End-to-End Task-Oriented Dialogue System

Authors: Weizhou Shen, Yingqi Gao, Canbin Huang, Fanqi Wan, Xiaojun Quan, Wei Bi

Abstract: Developing an efficient retriever to retrieve knowledge from a large-scale knowledge base (KB) is critical for task-oriented dialogue systems to effectively handle localized and specialized tasks. However, widely used generative models such as T5 and ChatGPT often struggle to differentiate subtle differences among the retrieved KB records when generating responses, resulting in suboptimal quality… ▽ More Developing an efficient retriever to retrieve knowledge from a large-scale knowledge base (KB) is critical for task-oriented dialogue systems to effectively handle localized and specialized tasks. However, widely used generative models such as T5 and ChatGPT often struggle to differentiate subtle differences among the retrieved KB records when generating responses, resulting in suboptimal quality of generated responses. In this paper, we propose the application of maximal marginal likelihood to train a perceptive retriever by utilizing signals from response generation for supervision. In addition, our approach goes beyond considering solely retrieved entities and incorporates various meta knowledge to guide the generator, thus improving the utilization of knowledge. We evaluate our approach on three task-oriented dialogue datasets using T5 and ChatGPT as the backbone models. The results demonstrate that when combined with meta knowledge, the response generator can effectively leverage high-quality knowledge records from the retriever and enhance the quality of generated responses. The codes and models of this paper are available at https://github.com/shenwzh3/MK-TOD. △ Less

Submitted 20 October, 2023; v1 submitted 13 October, 2023; originally announced October 2023.

Comments: Accepted to EMNLP 2023 Main Conference

arXiv:2308.08538 [pdf, other]

doi 10.1177/02783649241238765

Proprioceptive Learning with Soft Polyhedral Networks

Authors: Xiaobo Liu, Xudong Han, Wei Hong, Fang Wan, Chaoyang Song

Abstract: Proprioception is the "sixth sense" that detects limb postures with motor neurons. It requires a natural integration between the musculoskeletal systems and sensory receptors, which is challenging among modern robots that aim for lightweight, adaptive, and sensitive designs at a low cost. Here, we present the Soft Polyhedral Network with an embedded vision for physical interactions, capable of ada… ▽ More Proprioception is the "sixth sense" that detects limb postures with motor neurons. It requires a natural integration between the musculoskeletal systems and sensory receptors, which is challenging among modern robots that aim for lightweight, adaptive, and sensitive designs at a low cost. Here, we present the Soft Polyhedral Network with an embedded vision for physical interactions, capable of adaptive kinesthesia and viscoelastic proprioception by learning kinetic features. This design enables passive adaptations to omni-directional interactions, visually captured by a miniature high-speed motion tracking system embedded inside for proprioceptive learning. The results show that the soft network can infer real-time 6D forces and torques with accuracies of 0.25/0.24/0.35 N and 0.025/0.034/0.006 Nm in dynamic interactions. We also incorporate viscoelasticity in proprioception during static adaptation by adding a creep and relaxation modifier to refine the predicted results. The proposed soft network combines simplicity in design, omni-adaptation, and proprioceptive sensing with high accuracy, making it a versatile solution for robotics at a low cost with more than 1 million use cycles for tasks such as sensitive and competitive grasping, and touch-based geometry reconstruction. This study offers new insights into vision-based proprioception for soft robots in adaptive grasping, soft manipulation, and human-robot interaction. △ Less

Submitted 27 July, 2024; v1 submitted 16 August, 2023; originally announced August 2023.

Comments: 20 pages, 10 figures, 2 tables, Published in the International Journal of Robotics Research

arXiv:2308.08510 [pdf, other]

Autoencoding a Soft Touch to Learn Grasping from On-land to Underwater

Authors: Ning Guo, Xudong Han, Xiaobo Liu, Shuqiao Zhong, Zhiyuan Zhou, Jian Lin, Jiansheng Dai, Fang Wan, Chaoyang Song

Abstract: Robots play a critical role as the physical agent of human operators in exploring the ocean. However, it remains challenging to grasp objects reliably while fully submerging under a highly pressurized aquatic environment with little visible light, mainly due to the fluidic interference on the tactile mechanics between the finger and object surfaces. This study investigates the transferability of g… ▽ More Robots play a critical role as the physical agent of human operators in exploring the ocean. However, it remains challenging to grasp objects reliably while fully submerging under a highly pressurized aquatic environment with little visible light, mainly due to the fluidic interference on the tactile mechanics between the finger and object surfaces. This study investigates the transferability of grasping knowledge from on-land to underwater via a vision-based soft robotic finger that learns 6D forces and torques (FT) using a Supervised Variational Autoencoder (SVAE). A high-framerate camera captures the whole-body deformations while a soft robotic finger interacts with physical objects on-land and underwater. Results show that the trained SVAE model learned a series of latent representations of the soft mechanics transferrable from land to water, presenting a superior adaptation to the changing environments against commercial FT sensors. Soft, delicate, and reactive grasping enabled by tactile intelligence enhances the gripper's underwater interaction with improved reliability and robustness at a much-reduced cost, paving the path for learning-based intelligent grasping to support fundamental scientific discoveries in environmental and ocean research. △ Less

Submitted 16 August, 2023; originally announced August 2023.

Comments: 17 pages, 5 figures, 1 table, submitted to Advanced Intelligent Systems for review

arXiv:2308.07225 [pdf, other]

doi 10.1109/TCSVT.2023.3305776

DS-Depth: Dynamic and Static Depth Estimation via a Fusion Cost Volume

Authors: Xingyu Miao, Yang Bai, Haoran Duan, Yawen Huang, Fan Wan, Xinxing Xu, Yang Long, Yefeng Zheng

Abstract: Self-supervised monocular depth estimation methods typically rely on the reprojection error to capture geometric relationships between successive frames in static environments. However, this assumption does not hold in dynamic objects in scenarios, leading to errors during the view synthesis stage, such as feature mismatch and occlusion, which can significantly reduce the accuracy of the generated… ▽ More Self-supervised monocular depth estimation methods typically rely on the reprojection error to capture geometric relationships between successive frames in static environments. However, this assumption does not hold in dynamic objects in scenarios, leading to errors during the view synthesis stage, such as feature mismatch and occlusion, which can significantly reduce the accuracy of the generated depth maps. To address this problem, we propose a novel dynamic cost volume that exploits residual optical flow to describe moving objects, improving incorrectly occluded regions in static cost volumes used in previous work. Nevertheless, the dynamic cost volume inevitably generates extra occlusions and noise, thus we alleviate this by designing a fusion module that makes static and dynamic cost volumes compensate for each other. In other words, occlusion from the static volume is refined by the dynamic volume, and incorrect information from the dynamic volume is eliminated by the static volume. Furthermore, we propose a pyramid distillation loss to reduce photometric error inaccuracy at low resolutions and an adaptive photometric error loss to alleviate the flow direction of the large gradient in the occlusion regions. We conducted extensive experiments on the KITTI and Cityscapes datasets, and the results demonstrate that our model outperforms previously published baselines for self-supervised monocular depth estimation. △ Less

Submitted 14 August, 2023; originally announced August 2023.

arXiv:2307.09756 [pdf, other]

Generative Prompt Model for Weakly Supervised Object Localization

Authors: Yuzhong Zhao, Qixiang Ye, Weijia Wu, Chunhua Shen, Fang Wan

Abstract: Weakly supervised object localization (WSOL) remains challenging when learning object localization models from image category labels. Conventional methods that discriminatively train activation models ignore representative yet less discriminative object parts. In this study, we propose a generative prompt model (GenPromp), defining the first generative pipeline to localize less discriminative obje… ▽ More Weakly supervised object localization (WSOL) remains challenging when learning object localization models from image category labels. Conventional methods that discriminatively train activation models ignore representative yet less discriminative object parts. In this study, we propose a generative prompt model (GenPromp), defining the first generative pipeline to localize less discriminative object parts by formulating WSOL as a conditional image denoising procedure. During training, GenPromp converts image category labels to learnable prompt embeddings which are fed to a generative model to conditionally recover the input image with noise and learn representative embeddings. During inference, enPromp combines the representative embeddings with discriminative embeddings (queried from an off-the-shelf vision-language model) for both representative and discriminative capacity. The combined embeddings are finally used to generate multi-scale high-quality attention maps, which facilitate localizing full object extent. Experiments on CUB-200-2011 and ILSVRC show that GenPromp respectively outperforms the best discriminative models by 5.2% and 5.6% (Top-1 Loc), setting a solid baseline for WSOL with the generative model. Code is available at https://github.com/callsys/GenPromp. △ Less

Submitted 19 July, 2023; originally announced July 2023.

Journal ref: International Conference on Computer Vision Conference (ICCV2023)

arXiv:2306.04932 [pdf, other]

Jigsaw-based Benchmarking for Learning Robotic Manipulation

Authors: Xiaobo Liu, Fang Wan, Sheng Ge, Haokun Wang, Haoran Sun, Chaoyang Song

Abstract: Benchmarking provides experimental evidence of the scientific baseline to enhance the progression of fundamental research, which is also applicable to robotics. In this paper, we propose a method to benchmark metrics of robotic manipulation, which addresses the spatial-temporal reasoning skills for robot learning with the jigsaw game. In particular, our approach exploits a simple set of jigsaw pie… ▽ More Benchmarking provides experimental evidence of the scientific baseline to enhance the progression of fundamental research, which is also applicable to robotics. In this paper, we propose a method to benchmark metrics of robotic manipulation, which addresses the spatial-temporal reasoning skills for robot learning with the jigsaw game. In particular, our approach exploits a simple set of jigsaw pieces by designing a structured protocol, which can be highly customizable according to a wide range of task specifications. Researchers can selectively adopt the proposed protocol to benchmark their research outputs, on a comparable scale in the functional, task, and system-level of details. The purpose is to provide a potential look-up table for learning-based robot manipulation, commonly available in other engineering disciplines, to facilitate the adoption of robotics through calculated, empirical, and systematic experimental evidence. △ Less

Submitted 8 June, 2023; originally announced June 2023.

Comments: 7 pages, 7 figures, accepted to 2023 IEEE International Conference on Advanced Robotics and Mechatronics (ICARM)

arXiv:2306.04928 [pdf, other]

Underwater Intention Recognition using Head Motion and Throat Vibration for Supernumerary Robotic Assistance

Authors: Yuqin Guo, Rongzheng Zhang, Wanghongjie Qiu, Harry Asada, Fang Wan, Chaoyang Song

Abstract: This study presents a multi-modal mechanism for recognizing human intentions while diving underwater, aiming to achieve natural human-robot interactions through an underwater superlimb for diving assistance. The underwater environment severely limits the divers' capabilities in intention expression, which becomes more challenging when they intend to operate tools while keeping control of body post… ▽ More This study presents a multi-modal mechanism for recognizing human intentions while diving underwater, aiming to achieve natural human-robot interactions through an underwater superlimb for diving assistance. The underwater environment severely limits the divers' capabilities in intention expression, which becomes more challenging when they intend to operate tools while keeping control of body postures in 3D with the various diving suits and gears. The current literature is limited in underwater intention recognition, impeding the development of intelligent wearable systems for human-robot interactions underwater. Here, we present a novel solution to simultaneously detect head motion and throat vibrations under the water in a compact, wearable design. Experiment results show that using machine learning algorithms, we achieved high performance in integrating these two modalities to translate human intentions to robot control commands for an underwater superlimb system. This study's results paved the way for future development in underwater intention recognition and underwater human-robot interactions with supernumerary support. △ Less

Submitted 16 August, 2023; v1 submitted 8 June, 2023; originally announced June 2023.

Comments: 6 pages, 9 figures, 3 tables, accepted to IEEE CASE 2023

arXiv:2305.10149 [pdf, other]

Multi-Grained Knowledge Retrieval for End-to-End Task-Oriented Dialog

Authors: Fanqi Wan, Weizhou Shen, Ke Yang, Xiaojun Quan, Wei Bi

Abstract: Retrieving proper domain knowledge from an external database lies at the heart of end-to-end task-oriented dialog systems to generate informative responses. Most existing systems blend knowledge retrieval with response generation and optimize them with direct supervision from reference responses, leading to suboptimal retrieval performance when the knowledge base becomes large-scale. To address th… ▽ More Retrieving proper domain knowledge from an external database lies at the heart of end-to-end task-oriented dialog systems to generate informative responses. Most existing systems blend knowledge retrieval with response generation and optimize them with direct supervision from reference responses, leading to suboptimal retrieval performance when the knowledge base becomes large-scale. To address this, we propose to decouple knowledge retrieval from response generation and introduce a multi-grained knowledge retriever (MAKER) that includes an entity selector to search for relevant entities and an attribute selector to filter out irrelevant attributes. To train the retriever, we propose a novel distillation objective that derives supervision signals from the response generator. Experiments conducted on three standard benchmarks with both small and large-scale knowledge bases demonstrate that our retriever performs knowledge retrieval more effectively than existing methods. Our code has been made publicly available.\footnote{https://github.com/18907305772/MAKER} △ Less

Submitted 17 May, 2023; originally announced May 2023.

Comments: Accepted to ACL 2023 (Main Conference)

arXiv:2305.09892 [pdf, other]

Clustering-Aware Negative Sampling for Unsupervised Sentence Representation

Authors: Jinghao Deng, Fanqi Wan, Tao Yang, Xiaojun Quan, Rui Wang

Abstract: Contrastive learning has been widely studied in sentence representation learning. However, earlier works mainly focus on the construction of positive examples, while in-batch samples are often simply treated as negative examples. This approach overlooks the importance of selecting appropriate negative examples, potentially leading to a scarcity of hard negatives and the inclusion of false negative… ▽ More Contrastive learning has been widely studied in sentence representation learning. However, earlier works mainly focus on the construction of positive examples, while in-batch samples are often simply treated as negative examples. This approach overlooks the importance of selecting appropriate negative examples, potentially leading to a scarcity of hard negatives and the inclusion of false negatives. To address these issues, we propose ClusterNS (Clustering-aware Negative Sampling), a novel method that incorporates cluster information into contrastive learning for unsupervised sentence representation learning. We apply a modified K-means clustering algorithm to supply hard negatives and recognize in-batch false negatives during training, aiming to solve the two issues in one unified framework. Experiments on semantic textual similarity (STS) tasks demonstrate that our proposed ClusterNS compares favorably with baselines in unsupervised sentence representation learning. Our code has been made publicly available. △ Less

Submitted 16 May, 2023; originally announced May 2023.

Comments: accepted to Finding of ACL2023, 16 pages

arXiv:2210.03592 [pdf, other]

Specialized Re-Ranking: A Novel Retrieval-Verification Framework for Cloth Changing Person Re-Identification

Authors: Renjie Zhang, Yu Fang, Huaxin Song, Fangbin Wan, Yanwei Fu, Hirokazu Kato, Yang Wu

Abstract: Cloth changing person re-identification(Re-ID) can work under more complicated scenarios with higher security than normal Re-ID and biometric techniques and is therefore extremely valuable in applications. Meanwhile, higher flexibility in appearance always leads to more similar-looking confusing images, which is the weakness of the widely used retrieval methods. In this work, we shed light on how… ▽ More Cloth changing person re-identification(Re-ID) can work under more complicated scenarios with higher security than normal Re-ID and biometric techniques and is therefore extremely valuable in applications. Meanwhile, higher flexibility in appearance always leads to more similar-looking confusing images, which is the weakness of the widely used retrieval methods. In this work, we shed light on how to handle these similar images. Specifically, we propose a novel retrieval-verification framework. Given an image, the retrieval module can search for similar images quickly. Our proposed verification network will then compare the input image and the candidate images by contrasting those local details and give a similarity score. An innovative ranking strategy is also introduced to take a good balance between retrieval and verification results. Comprehensive experiments are conducted to show the effectiveness of our framework and its capability in improving the state-of-the-art methods remarkably on both synthetic and realistic datasets. △ Less

Submitted 7 October, 2022; originally announced October 2022.

Comments: Accepted by Pattern Recognition

arXiv:2205.09613 [pdf, other]

Integrally Migrating Pre-trained Transformer Encoder-decoders for Visual Object Detection

Authors: Feng Liu, Xiaosong Zhang, Zhiliang Peng, Zonghao Guo, Fang Wan, Xiangyang Ji, Qixiang Ye

Abstract: Modern object detectors have taken the advantages of backbone networks pre-trained on large scale datasets. Except for the backbone networks, however, other components such as the detector head and the feature pyramid network (FPN) remain trained from scratch, which hinders fully tapping the potential of representation models. In this study, we propose to integrally migrate pre-trained transformer… ▽ More Modern object detectors have taken the advantages of backbone networks pre-trained on large scale datasets. Except for the backbone networks, however, other components such as the detector head and the feature pyramid network (FPN) remain trained from scratch, which hinders fully tapping the potential of representation models. In this study, we propose to integrally migrate pre-trained transformer encoder-decoders (imTED) to a detector, constructing a feature extraction path which is ``fully pre-trained" so that detectors' generalization capacity is maximized. The essential differences between imTED with the baseline detector are twofold: (1) migrating the pre-trained transformer decoder to the detector head while removing the randomly initialized FPN from the feature extraction path; and (2) defining a multi-scale feature modulator (MFM) to enhance scale adaptability. Such designs not only reduce randomly initialized parameters significantly but also unify detector training with representation learning intendedly. Experiments on the MS COCO object detection dataset show that imTED consistently outperforms its counterparts by $\sim$2.4 AP. Without bells and whistles, imTED improves the state-of-the-art of few-shot object detection by up to 7.6 AP. Code is available at https://github.com/LiewFeng/imTED. △ Less

Submitted 2 December, 2022; v1 submitted 19 May, 2022; originally announced May 2022.

Comments: 6 figures, 6 tables

arXiv:2202.11319 [pdf, other]

Absolute Zero-Shot Learning

Authors: Rui Gao, Fan Wan, Daniel Organisciak, Jiyao Pu, Junyan Wang, Haoran Duan, Peng Zhang, Xingsong Hou, Yang Long

Abstract: Considering the increasing concerns about data copyright and privacy issues, we present a novel Absolute Zero-Shot Learning (AZSL) paradigm, i.e., training a classifier with zero real data. The key innovation is to involve a teacher model as the data safeguard to guide the AZSL model training without data leaking. The AZSL model consists of a generator and student network, which can achieve date-f… ▽ More Considering the increasing concerns about data copyright and privacy issues, we present a novel Absolute Zero-Shot Learning (AZSL) paradigm, i.e., training a classifier with zero real data. The key innovation is to involve a teacher model as the data safeguard to guide the AZSL model training without data leaking. The AZSL model consists of a generator and student network, which can achieve date-free knowledge transfer while maintaining the performance of the teacher network. We investigate `black-box' and `white-box' scenarios in AZSL task as different levels of model security. Besides, we also provide discussion of teacher model in both inductive and transductive settings. Despite embarrassingly simple implementations and data-missing disadvantages, our AZSL framework can retain state-of-the-art ZSL and GZSL performance under the `white-box' scenario. Extensive qualitative and quantitative analysis also demonstrates promising results when deploying the model under `black-box' scenario. △ Less

Submitted 23 February, 2022; originally announced February 2022.

arXiv:2108.13969 [pdf, other]

Semi-Supervised Crowd Counting from Unlabeled Data

Authors: Haoran Duan, Fan Wan, Rui Sun, Zeyu Wang, Varun Ojha, Yu Guan, Hubert P. H. Shum, Bingzhang Hu, Yang Long

Abstract: Automatic Crowd behavior analysis can be applied to effectively help the daily transportation statistics and planning, which helps the smart city construction. As one of the most important keys, crowd counting has drawn increasing attention. Recent works achieved promising performance but relied on the supervised paradigm with expensive crowd annotations. To alleviate the annotation cost in real-w… ▽ More Automatic Crowd behavior analysis can be applied to effectively help the daily transportation statistics and planning, which helps the smart city construction. As one of the most important keys, crowd counting has drawn increasing attention. Recent works achieved promising performance but relied on the supervised paradigm with expensive crowd annotations. To alleviate the annotation cost in real-world transportation scenarios, in this work we proposed a semi-supervised learning framework $S^{4}\textit{Crowd}$, which can leverage both unlabeled/labeled data for robust crowd counting. In the unsupervised pathway, two \textit{self-supervised losses} were proposed to simulate the crowd variations such as scale, illumination, based on which supervised information pseudo labels were generated and gradually refined. We also proposed a crowd-driven recurrent unit \textit{Gated-Crowd-Recurrent-Unit (GCRU)}, which can preserve discriminant crowd information by extracting second-order statistics, yielding pseudo labels with improved quality. A joint loss including both unsupervised/supervised information was proposed, and a dynamic weighting strategy was employed to balance the importance of the unsupervised loss and supervised loss at different training stages. We conducted extensive experiments on four popular crowd counting datasets in semi-supervised settings. Experimental results supported the effectiveness of each proposed component in our $S^{4}$Crowd framework. Our method achieved competitive performance in semi-supervised learning approaches on these crowd counting datasets. △ Less

Submitted 26 March, 2024; v1 submitted 31 August, 2021; originally announced August 2021.

arXiv:2104.02324 [pdf, other]

Multiple instance active learning for object detection

Authors: Tianning Yuan, Fang Wan, Mengying Fu, Jianzhuang Liu, Songcen Xu, Xiangyang Ji, Qixiang Ye

Abstract: Despite the substantial progress of active learning for image recognition, there still lacks an instance-level active learning method specified for object detection. In this paper, we propose Multiple Instance Active Object Detection (MI-AOD), to select the most informative images for detector training by observing instance-level uncertainty. MI-AOD defines an instance uncertainty learning module,… ▽ More Despite the substantial progress of active learning for image recognition, there still lacks an instance-level active learning method specified for object detection. In this paper, we propose Multiple Instance Active Object Detection (MI-AOD), to select the most informative images for detector training by observing instance-level uncertainty. MI-AOD defines an instance uncertainty learning module, which leverages the discrepancy of two adversarial instance classifiers trained on the labeled set to predict instance uncertainty of the unlabeled set. MI-AOD treats unlabeled images as instance bags and feature anchors in images as instances, and estimates the image uncertainty by re-weighting instances in a multiple instance learning (MIL) fashion. Iterative instance uncertainty learning and re-weighting facilitate suppressing noisy instances, toward bridging the gap between instance uncertainty and image-level uncertainty. Experiments validate that MI-AOD sets a solid baseline for instance-level active learning. On commonly used object detection datasets, MI-AOD outperforms state-of-the-art methods with significant margins, particularly when the labeled sets are small. Code is available at https://github.com/yuantn/MI-AOD. △ Less

Submitted 6 April, 2021; originally announced April 2021.

Comments: 10 pages, 7 figures, 5 tables. Code is available at https://github.com/yuantn/MI-AOD

arXiv:2103.14862 [pdf, other]

TS-CAM: Token Semantic Coupled Attention Map for Weakly Supervised Object Localization

Authors: Wei Gao, Fang Wan, Xingjia Pan, Zhiliang Peng, Qi Tian, Zhenjun Han, Bolei Zhou, Qixiang Ye

Abstract: Weakly supervised object localization (WSOL) is a challenging problem when given image category labels but requires to learn object localization models. Optimizing a convolutional neural network (CNN) for classification tends to activate local discriminative regions while ignoring complete object extent, causing the partial activation issue. In this paper, we argue that partial activation is cause… ▽ More Weakly supervised object localization (WSOL) is a challenging problem when given image category labels but requires to learn object localization models. Optimizing a convolutional neural network (CNN) for classification tends to activate local discriminative regions while ignoring complete object extent, causing the partial activation issue. In this paper, we argue that partial activation is caused by the intrinsic characteristics of CNN, where the convolution operations produce local receptive fields and experience difficulty to capture long-range feature dependency among pixels. We introduce the token semantic coupled attention map (TS-CAM) to take full advantage of the self-attention mechanism in visual transformer for long-range dependency extraction. TS-CAM first splits an image into a sequence of patch tokens for spatial embedding, which produce attention maps of long-range visual dependency to avoid partial activation. TS-CAM then re-allocates category-related semantics for patch tokens, enabling each of them to be aware of object categories. TS-CAM finally couples the patch tokens with the semantic-agnostic attention map to achieve semantic-aware localization. Experiments on the ILSVRC/CUB-200-2011 datasets show that TS-CAM outperforms its CNN-CAM counterparts by 7.1%/27.1% for WSOL, achieving state-of-the-art performance. △ Less

Submitted 3 August, 2021; v1 submitted 27 March, 2021; originally announced March 2021.

Comments: Accepted by ICCV2021 (poster)

arXiv:2101.12379 [pdf, other]

Learning-based Optoelectronically Innervated Tactile Finger for Rigid-Soft Interactive Grasping

Authors: Linhan Yang, Xudong Han, Weijie Guo, Fang Wan, Jia Pan, Chaoyang Song

Abstract: This paper presents a novel design of a soft tactile finger with omni-directional adaptation using multi-channel optical fibers for rigid-soft interactive grasping. Machine learning methods are used to train a model for real-time prediction of force, torque, and contact using the tactile data collected. We further integrated such fingers in a reconfigurable gripper design with three fingers so tha… ▽ More This paper presents a novel design of a soft tactile finger with omni-directional adaptation using multi-channel optical fibers for rigid-soft interactive grasping. Machine learning methods are used to train a model for real-time prediction of force, torque, and contact using the tactile data collected. We further integrated such fingers in a reconfigurable gripper design with three fingers so that the finger arrangement can be actively adjusted in real-time based on the tactile data collected during grasping, achieving the process of rigid-soft interactive grasping. Detailed sensor calibration and experimental results are also included to further validate the proposed design for enhanced grasping robustness. △ Less

Submitted 28 January, 2021; originally announced January 2021.

Comments: 8 pages,9 figures, Submitted to RAL and ICRA2021

arXiv:2012.03168 [pdf, other]

Design of an Optoelectronically Innervated Gripper for Rigid-Soft Interactive Grasping

Authors: Linhan Yang, Xudong Han, Weijie Guo, Zixin Zhang, Fang Wan, Jia Pan, Chaoyang Song

Abstract: Over the past few decades, efforts have been made towards robust robotic grasping, and therefore dexterous manipulation. The soft gripper has shown their potential in robust grasping due to their inherent properties-low, control complexity, and high adaptability. However, the deformation of the soft gripper when interacting with objects bring inaccuracy of grasped objects, which causes instability… ▽ More Over the past few decades, efforts have been made towards robust robotic grasping, and therefore dexterous manipulation. The soft gripper has shown their potential in robust grasping due to their inherent properties-low, control complexity, and high adaptability. However, the deformation of the soft gripper when interacting with objects bring inaccuracy of grasped objects, which causes instability for robust grasping and further manipulation. In this paper, we present an omni-directional adaptive soft finger that can sense deformation based on embedded optical fibers and the application of machine learning methods to interpret transmitted light intensities. Furthermore, to use tactile information provided by a soft finger, we design a low-cost and multi degrees of freedom gripper to conform to the shape of objects actively and optimize grasping policy, which is called Rigid-Soft Interactive Grasping. Two main advantages of this grasping policy are provided: one is that a more robust grasping could be achieved through an active adaptation; the other is that the tactile information collected could be helpful for further manipulation. △ Less

Submitted 5 December, 2020; originally announced December 2020.

Comments: 11 pages, 6 figures, submitted to IEEE ICRA 2021

arXiv:2007.14557 [pdf, other]

Chained-Tracker: Chaining Paired Attentive Regression Results for End-to-End Joint Multiple-Object Detection and Tracking

Authors: Jinlong Peng, Changan Wang, Fangbin Wan, Yang Wu, Yabiao Wang, Ying Tai, Chengjie Wang, Jilin Li, Feiyue Huang, Yanwei Fu

Abstract: Existing Multiple-Object Tracking (MOT) methods either follow the tracking-by-detection paradigm to conduct object detection, feature extraction and data association separately, or have two of the three subtasks integrated to form a partially end-to-end solution. Going beyond these sub-optimal frameworks, we propose a simple online model named Chained-Tracker (CTracker), which naturally integrates… ▽ More Existing Multiple-Object Tracking (MOT) methods either follow the tracking-by-detection paradigm to conduct object detection, feature extraction and data association separately, or have two of the three subtasks integrated to form a partially end-to-end solution. Going beyond these sub-optimal frameworks, we propose a simple online model named Chained-Tracker (CTracker), which naturally integrates all the three subtasks into an end-to-end solution (the first as far as we know). It chains paired bounding boxes regression results estimated from overlapping nodes, of which each node covers two adjacent frames. The paired regression is made attentive by object-attention (brought by a detection module) and identity-attention (ensured by an ID verification module). The two major novelties: chained structure and paired attentive regression, make CTracker simple, fast and effective, setting new MOTA records on MOT16 and MOT17 challenge datasets (67.6 and 66.6, respectively), without relying on any extra training data. The source code of CTracker can be found at: github.com/pjl1995/CTracker. △ Less

Submitted 28 July, 2020; originally announced July 2020.

Comments: European Conference on Computer Vision 2020 (Spotlight)

arXiv:2006.14863 [pdf, other]

Domain Contrast for Domain Adaptive Object Detection

Authors: Feng Liu, Xiaoxong Zhang, Fang Wan, Xiangyang Ji, Qixiang Ye

Abstract: We present Domain Contrast (DC), a simple yet effective approach inspired by contrastive learning for training domain adaptive detectors. DC is deduced from the error bound minimization perspective of a transferred model, and is implemented with cross-domain contrast loss which is plug-and-play. By minimizing cross-domain contrast loss, DC guarantees the transferability of detectors while naturall… ▽ More We present Domain Contrast (DC), a simple yet effective approach inspired by contrastive learning for training domain adaptive detectors. DC is deduced from the error bound minimization perspective of a transferred model, and is implemented with cross-domain contrast loss which is plug-and-play. By minimizing cross-domain contrast loss, DC guarantees the transferability of detectors while naturally alleviating the class imbalance issue in the target domain. DC can be applied at either image level or region level, consistently improving detectors' transferability and discriminability. Extensive experiments on commonly used benchmarks show that DC improves the baseline and state-of-the-art by significant margins, while demonstrating great potential for large domain divergence. △ Less

Submitted 26 June, 2020; originally announced June 2020.

arXiv:2005.02588 [pdf, other]

DeepClaw: A Robotic Hardware Benchmarking Platform for Learning Object Manipulation

Authors: Fang Wan, Haokun Wang, Xiaobo Liu, Linhan Yang, Chaoyang Song

Abstract: We present DeepClaw as a reconfigurable benchmark of robotic hardware and task hierarchy for robot learning. The DeepClaw benchmark aims at a mechatronics perspective of the robot learning problem, which features a minimum design of robot cell that can be easily reconfigured to host robot hardware from various vendors, including manipulators, grippers, cameras, desks, and objects, aiming at a stre… ▽ More We present DeepClaw as a reconfigurable benchmark of robotic hardware and task hierarchy for robot learning. The DeepClaw benchmark aims at a mechatronics perspective of the robot learning problem, which features a minimum design of robot cell that can be easily reconfigured to host robot hardware from various vendors, including manipulators, grippers, cameras, desks, and objects, aiming at a streamlined collection of physical manipulation data and evaluation of the learned skills for hardware benchmarking. We provide a detailed design of the robot cell with readily available parts to build the experiment environment that can host a wide range of robotic hardware commonly adopted for robot learning. We also propose a hierarchical pipeline of software integration, including localization, recognition, grasp planning, and motion planning, to streamline learning-based robot control, data collection, and experiment validation towards shareability and reproducibility. We present benchmarking results of the DeepClaw system for a baseline Tic-Tac-Toe task, a bin-clearing task, and a jigsaw puzzle task using three sets of standard robotic hardware. Our results show that tasks defined in DeepClaw can be easily reproduced on three robot cells. Under the same task setup, the differences in robotic hardware used will present a non-negligible impact on the performance metrics of robot learning. All design layouts and codes are hosted on Github for open access. △ Less

Submitted 6 May, 2020; originally announced May 2020.

Comments: 13 pages, 6 figures, 2 tables, accepted for AIM 2020

arXiv:2004.00163 [pdf, other]

Weakly-Supervised Action Localization with Expectation-Maximization Multi-Instance Learning

Authors: Zhekun Luo, Devin Guillory, Baifeng Shi, Wei Ke, Fang Wan, Trevor Darrell, Huijuan Xu

Abstract: Weakly-supervised action localization requires training a model to localize the action segments in the video given only video level action label. It can be solved under the Multiple Instance Learning (MIL) framework, where a bag (video) contains multiple instances (action segments). Since only the bag's label is known, the main challenge is assigning which key instances within the bag to trigger t… ▽ More Weakly-supervised action localization requires training a model to localize the action segments in the video given only video level action label. It can be solved under the Multiple Instance Learning (MIL) framework, where a bag (video) contains multiple instances (action segments). Since only the bag's label is known, the main challenge is assigning which key instances within the bag to trigger the bag's label. Most previous models use attention-based approaches applying attentions to generate the bag's representation from instances, and then train it via the bag's classification. These models, however, implicitly violate the MIL assumption that instances in negative bags should be uniformly negative. In this work, we explicitly model the key instances assignment as a hidden variable and adopt an Expectation-Maximization (EM) framework. We derive two pseudo-label generation schemes to model the E and M process and iteratively optimize the likelihood lower bound. We show that our EM-MIL approach more accurately models both the learning objective and the MIL assumptions. It achieves state-of-the-art performance on two standard benchmarks, THUMOS14 and ActivityNet1.2. △ Less

Submitted 25 August, 2020; v1 submitted 31 March, 2020; originally announced April 2020.

Comments: Accepted at European Conference on Computer Vision (ECCV), 2020

arXiv:2003.04070 [pdf, other]

When Person Re-identification Meets Changing Clothes

Authors: Fangbin Wan, Yang Wu, Xuelin Qian, Yixiong Chen, Yanwei Fu

Abstract: Person re-identification (ReID) is now an active research topic for AI-based video surveillance applications such as specific person search, but the practical issue that the target person(s) may change clothes (clothes inconsistency problem) has been overlooked for long. For the first time, this paper systematically studies this problem. We first overcome the difficulty of lack of suitable dataset… ▽ More Person re-identification (ReID) is now an active research topic for AI-based video surveillance applications such as specific person search, but the practical issue that the target person(s) may change clothes (clothes inconsistency problem) has been overlooked for long. For the first time, this paper systematically studies this problem. We first overcome the difficulty of lack of suitable dataset, by collecting a small yet representative real dataset for testing whilst building a large realistic synthetic dataset for training and deeper studies. Facilitated by our new datasets, we are able to conduct various interesting new experiments for studying the influence of clothes inconsistency. We find that changing clothes makes ReID a much harder problem in the sense of bringing difficulties to learning effective representations and also challenges the generalization ability of previous ReID models to identify persons with unseen (new) clothes. Representative existing ReID models are adopted to show informative results on such a challenging setting, and we also provide some preliminary efforts on improving the robustness of existing models on handling the clothes inconsistency issue in the data. We believe that this study can be inspiring and helpful for encouraging more researches in this direction. The dataset is available on the project website: https://wanfb.github.io/dataset.html. △ Less

Submitted 24 May, 2020; v1 submitted 9 March, 2020; originally announced March 2020.

Comments: Accepted by CVPRW 2020

arXiv:2003.03586 [pdf, other]

doi 10.1109/ROBIO.2017.8324761

Hybrid Actuator Design for a Gait Augmentation Wearable

Authors: Fang Wan, Zheng Wang, Brooke Franchuk, Xinyao Hu, Zhenglong Sun, Chaoyang Song

Abstract: We describe a fluidic actuator design that replaces the sealed chamber of a hydraulic cylinder using a soft actuator to provide compliant linear compression with a large force ($\geq$100 N) at a low operation pressure ($\leq$50 kPa) for a lower-limb wearable. The external shells constrain the deformation of the soft actuator under fluidic pressurization. This enables us to use latex party balloons… ▽ More We describe a fluidic actuator design that replaces the sealed chamber of a hydraulic cylinder using a soft actuator to provide compliant linear compression with a large force ($\geq$100 N) at a low operation pressure ($\leq$50 kPa) for a lower-limb wearable. The external shells constrain the deformation of the soft actuator under fluidic pressurization. This enables us to use latex party balloons as a quick and cheap alternative for initial design investigation. We found that the forces exerted by the soft material deformation are well-captured by the rigid shells, removing the necessity of explicitly describing the mechanics of the soft material deformation and its interaction with the rigid structure. One can use the classical Force, Pressure and Area formula factored with an efficiency parameter to characterize the actuator performance. Furthermore, we proposed an engineering design of the hybrid actuator using a customized soft actuator placed inside a single shell cavity with an open end for the compression force. Our results show that the proposed design can generate a very high force within a short stroke distance. At a low input pressure of 50 kPa, the exerted block force is approaching only about 3\% less than the classical equation predicted. The actuator is fitted to a new gait augmentation design for correcting knee alignment, which is usually challenging for actuators made from the purely soft material. △ Less

Submitted 7 March, 2020; originally announced March 2020.

Comments: 5 pages, 7 figures, published at IEEE ROBIO 2017

arXiv:2003.02080 [pdf, other]

doi 10.1109/RoboSoft48309.2020.9116028

Robotic Cane as a Soft SuperLimb for Elderly Sit-to-Stand Assistance

Authors: Xia Wu, Haiyuan Liu, Ziqi Liu, Mingdong Chen, Fang Wan, Chenglong Fu, Harry Asada, Zheng Wang, Chaoyang Song

Abstract: Many researchers have identified robotics as a potential solution to the aging population faced by many developed and developing countries. If so, how should we address the cognitive acceptance and ambient control of elderly assistive robots through design? In this paper, we proposed an explorative design of an ambient SuperLimb (Supernumerary Robotic Limb) system that involves a pneumatically-dri… ▽ More Many researchers have identified robotics as a potential solution to the aging population faced by many developed and developing countries. If so, how should we address the cognitive acceptance and ambient control of elderly assistive robots through design? In this paper, we proposed an explorative design of an ambient SuperLimb (Supernumerary Robotic Limb) system that involves a pneumatically-driven robotic cane for at-home motion assistance, an inflatable vest for compliant human-robot interaction, and a depth sensor for ambient intention detection. The proposed system aims at providing active assistance during the sit-to-stand transition for at-home usage by the elderly at the bedside, in the chair, and on the toilet. We proposed a modified biomechanical model with a linear cane robot for closed-loop control implementation. We validated the design feasibility of the proposed ambient SuperLimb system including the biomechanical model, our result showed the advantages in reducing lower limb efforts and elderly fall risks, yet the detection accuracy using depth sensing and adjustments on the model still require further research in the future. Nevertheless, we summarized empirical guidelines to support the ambient design of elderly-assistive SuperLimb systems for lower limb functional augmentation. △ Less

Submitted 29 February, 2020; originally announced March 2020.

Comments: 8 pages, 9 figures, accepted for IEEE RoboSoft 2020

Journal ref: 2020 3rd IEEE International Conference on Soft Robotics (RoboSoft)

arXiv:2003.01584 [pdf, ps, other]

doi 10.1109/LRA.2020.2969932

Rigid-Soft Interactive Learning for Robust Grasping

Authors: Linhan Yang, Fang Wan, Haokun Wang, Xiaobo Liu, Yujia Liu, Jia Pan, Chaoyang Song

Abstract: Inspired by widely used soft fingers on grasping, we propose a method of rigid-soft interactive learning, aiming at reducing the time of data collection. In this paper, we classify the interaction categories into Rigid-Rigid, Rigid-Soft, Soft-Rigid according to the interaction surface between grippers and target objects. We find experimental evidence that the interaction types between grippers and… ▽ More Inspired by widely used soft fingers on grasping, we propose a method of rigid-soft interactive learning, aiming at reducing the time of data collection. In this paper, we classify the interaction categories into Rigid-Rigid, Rigid-Soft, Soft-Rigid according to the interaction surface between grippers and target objects. We find experimental evidence that the interaction types between grippers and target objects play an essential role in the learning methods. We use soft, stuffed toys for training, instead of everyday objects, to reduce the integration complexity and computational burden and exploit such rigid-soft interaction by changing the gripper fingers to the soft ones when dealing with rigid, daily-life items such as the Yale-CMU-Berkeley (YCB) objects. With a small data collection of 5K picking attempts in total, our results suggest that such Rigid-Soft and Soft-Rigid interactions are transferable. Moreover, the combination of different grasp types shows better performance on the grasping test. We achieve the best grasping performance at 97.5\% for easy YCB objects and 81.3\% for difficult YCB objects while using a precise grasp with a two-soft-finger gripper to collect training data and power grasp with a four-soft-finger gripper to test. △ Less

Submitted 29 February, 2020; originally announced March 2020.

Comments: 8 pages, 5 figures, Accepted for IEEE RAL and IEEE ICRA 2020

arXiv:2003.01583 [pdf, ps, other]

doi 10.1109/RoboSoft48309.2020.9116026

Scalable Tactile Sensing for an Omni-adaptive Soft Robot Finger

Authors: Zeyi Yang, Sheng Ge, Fang Wan, Yujia Liu, Chaoyang Song

Abstract: Robotic fingers made of soft material and compliant structures usually lead to superior adaptation when interacting with the unstructured physical environment. In this paper, we present an embedded sensing solution using optical fibers for an omni-adaptive soft robotic finger with exceptional adaptation in all directions. In particular, we managed to insert a pair of optical fibers inside the fing… ▽ More Robotic fingers made of soft material and compliant structures usually lead to superior adaptation when interacting with the unstructured physical environment. In this paper, we present an embedded sensing solution using optical fibers for an omni-adaptive soft robotic finger with exceptional adaptation in all directions. In particular, we managed to insert a pair of optical fibers inside the finger's structural cavity without interfering with its adaptive performance. The resultant integration is scalable as a versatile, low-cost, and moisture-proof solution for physically safe human-robot interaction. In addition, we experimented with our finger design for an object sorting task and identified sectional diameters of 94\% objects within the $\pm$6mm error and measured 80\% of the structural strains within $\pm$0.1mm/mm error. The proposed sensor design opens many doors in future applications of soft robotics for scalable and adaptive physical interactions in the unstructured environment. △ Less

Submitted 29 February, 2020; originally announced March 2020.

Comments: 8 pages, 6 figures, full-length version of a submission to IEEE RoboSoft 2020

Journal ref: 2020 3rd IEEE International Conference on Soft Robotics (RoboSoft)

Showing 1–50 of 61 results for author: Wan, F