Search | arXiv e-print repository

Category-level Object Detection, Pose Estimation and Reconstruction from Stereo Images

Authors: Chuanrui Zhang, Yonggen Ling, Minglei Lu, Minghan Qin, Haoqian Wang

Abstract: We study the 3D object understanding task for manipulating everyday objects with different material properties (diffuse, specular, transparent and mixed). Existing monocular and RGB-D methods suffer from scale ambiguity due to missing or imprecise depth measurements. We present CODERS, a one-stage approach for Category-level Object Detection, pose Estimation and Reconstruction from Stereo images.… ▽ More We study the 3D object understanding task for manipulating everyday objects with different material properties (diffuse, specular, transparent and mixed). Existing monocular and RGB-D methods suffer from scale ambiguity due to missing or imprecise depth measurements. We present CODERS, a one-stage approach for Category-level Object Detection, pose Estimation and Reconstruction from Stereo images. The base of our pipeline is an implicit stereo matching module that combines stereo image features with 3D position information. Concatenating this presented module and the following transform-decoder architecture leads to end-to-end learning of multiple tasks required by robot manipulation. Our approach significantly outperforms all competing methods in the public TOD dataset. Furthermore, trained on simulated data, CODERS generalize well to unseen category-level object instances in real-world robot manipulation experiments. Our dataset, code, and demos will be available on our project page. △ Less

Submitted 17 July, 2024; v1 submitted 9 July, 2024; originally announced July 2024.

arXiv:2407.06196 [pdf, other]

Poetry2Image: An Iterative Correction Framework for Images Generated from Chinese Classical Poetry

Authors: Jing Jiang, Yiran Ling, Binzhu Li, Pengxiang Li, Junming Piao, Yu Zhang

Abstract: Text-to-image generation models often struggle with key element loss or semantic confusion in tasks involving Chinese classical poetry.Addressing this issue through fine-tuning models needs considerable training costs. Additionally, manual prompts for re-diffusion adjustments need professional knowledge. To solve this problem, we propose Poetry2Image, an iterative correction framework for images g… ▽ More Text-to-image generation models often struggle with key element loss or semantic confusion in tasks involving Chinese classical poetry.Addressing this issue through fine-tuning models needs considerable training costs. Additionally, manual prompts for re-diffusion adjustments need professional knowledge. To solve this problem, we propose Poetry2Image, an iterative correction framework for images generated from Chinese classical poetry. Utilizing an external poetry dataset, Poetry2Image establishes an automated feedback and correction loop, which enhances the alignment between poetry and image through image generation models and subsequent re-diffusion modifications suggested by large language models (LLM). Using a test set of 200 sentences of Chinese classical poetry, the proposed method--when integrated with five popular image generation models--achieves an average element completeness of 70.63%, representing an improvement of 25.56% over direct image generation. In tests of semantic correctness, our method attains an average semantic consistency of 80.09%. The study not only promotes the dissemination of ancient poetry culture but also offers a reference for similar non-fine-tuning methods to enhance LLM generation. △ Less

Submitted 15 June, 2024; originally announced July 2024.

Comments: 13 pages, 7 figures

arXiv:2407.05415 [pdf, other]

DIVESPOT: Depth Integrated Volume Estimation of Pile of Things Based on Point Cloud

Authors: Yiran Ling, Rongqiang Zhao, Yixuan Shen, Dongbo Li, Jing Jin, Jie Liu

Abstract: Non-contact volume estimation of pile-type objects has considerable potential in industrial scenarios, including grain, coal, mining, and stone materials. However, using existing method for these scenarios is challenged by unstable measurement poses, significant light interference, the difficulty of training data collection, and the computational burden brought by large piles. To address the above… ▽ More Non-contact volume estimation of pile-type objects has considerable potential in industrial scenarios, including grain, coal, mining, and stone materials. However, using existing method for these scenarios is challenged by unstable measurement poses, significant light interference, the difficulty of training data collection, and the computational burden brought by large piles. To address the above issues, we propose the Depth Integrated Volume EStimation of Pile Of Things (DIVESPOT) based on point cloud technology in this study. For the challenges of unstable measurement poses, the point cloud pose correction and filtering algorithm is designed based on the Random Sample Consensus (RANSAC) and the Hierarchical Density-Based Spatial Clustering of Applications with Noise (HDBSCAN). To cope with light interference and to avoid the relying on training data, the height-distribution-based ground feature extraction algorithm is proposed to achieve RGB-independent. To reduce the computational burden, the storage space optimizing strategy is developed, such that accurate estimation can be acquired by using compressed voxels. Experimental results demonstrate that the DIVESPOT method enables non-data-driven, RGB-independent segmentation of pile point clouds, maintaining a volume calculation relative error within 2%. Even with 90% compression of the voxel mesh, the average error of the results can be under 3%. △ Less

Submitted 7 July, 2024; originally announced July 2024.

arXiv:2406.10521 [pdf, other]

MALLM-GAN: Multi-Agent Large Language Model as Generative Adversarial Network for Synthesizing Tabular Data

Authors: Yaobin Ling, Xiaoqian Jiang, Yejin Kim

Abstract: In the era of big data, access to abundant data is crucial for driving research forward. However, such data is often inaccessible due to privacy concerns or high costs, particularly in healthcare domain. Generating synthetic (tabular) data can address this, but existing models typically require substantial amounts of data to train effectively, contradicting our objective to solve data scarcity. To… ▽ More In the era of big data, access to abundant data is crucial for driving research forward. However, such data is often inaccessible due to privacy concerns or high costs, particularly in healthcare domain. Generating synthetic (tabular) data can address this, but existing models typically require substantial amounts of data to train effectively, contradicting our objective to solve data scarcity. To address this challenge, we propose a novel framework to generate synthetic tabular data, powered by large language models (LLMs) that emulates the architecture of a Generative Adversarial Network (GAN). By incorporating data generation process as contextual information and utilizing LLM as the optimizer, our approach significantly enhance the quality of synthetic data generation in common scenarios with small sample sizes. Our experimental results on public and private datasets demonstrate that our model outperforms several state-of-art models regarding generating higher quality synthetic data for downstream tasks while keeping privacy of the real data. △ Less

Submitted 29 June, 2024; v1 submitted 15 June, 2024; originally announced June 2024.

arXiv:2404.14934 [pdf, other]

G3R: Generating Rich and Fine-grained mmWave Radar Data from 2D Videos for Generalized Gesture Recognition

Authors: Kaikai Deng, Dong Zhao, Wenxin Zheng, Yue Ling, Kangwen Yin, Huadong Ma

Abstract: Millimeter wave radar is gaining traction recently as a promising modality for enabling pervasive and privacy-preserving gesture recognition. However, the lack of rich and fine-grained radar datasets hinders progress in developing generalized deep learning models for gesture recognition across various user postures (e.g., standing, sitting), positions, and scenes. To remedy this, we resort to desi… ▽ More Millimeter wave radar is gaining traction recently as a promising modality for enabling pervasive and privacy-preserving gesture recognition. However, the lack of rich and fine-grained radar datasets hinders progress in developing generalized deep learning models for gesture recognition across various user postures (e.g., standing, sitting), positions, and scenes. To remedy this, we resort to designing a software pipeline that exploits wealthy 2D videos to generate realistic radar data, but it needs to address the challenge of simulating diversified and fine-grained reflection properties of user gestures. To this end, we design G3R with three key components: (i) a gesture reflection point generator expands the arm's skeleton points to form human reflection points; (ii) a signal simulation model simulates the multipath reflection and attenuation of radar signals to output the human intensity map; (iii) an encoder-decoder model combines a sampling module and a fitting module to address the differences in number and distribution of points between generated and real-world radar data for generating realistic radar data. We implement and evaluate G3R using 2D videos from public data sources and self-collected real-world radar data, demonstrating its superiority over other state-of-the-art approaches for gesture recognition. △ Less

Submitted 23 April, 2024; originally announced April 2024.

Comments: 18 pages, 29 figures

arXiv:2404.10777 [pdf, other]

Divide-Conquer-and-Merge: Memory- and Time-Efficient Holographic Displays

Authors: Zhenxing Dong, Jidong Jia, Yan Li, Yuye Ling

Abstract: Recently, deep learning-based computer-generated holography (CGH) has demonstrated tremendous potential in three-dimensional (3D) displays and yielded impressive display quality. However, most existing deep learning-based CGH techniques can only generate holograms of 1080p resolution, which is far from the ultra-high resolution (16K+) required for practical virtual reality (VR) and augmented reali… ▽ More Recently, deep learning-based computer-generated holography (CGH) has demonstrated tremendous potential in three-dimensional (3D) displays and yielded impressive display quality. However, most existing deep learning-based CGH techniques can only generate holograms of 1080p resolution, which is far from the ultra-high resolution (16K+) required for practical virtual reality (VR) and augmented reality (AR) applications to support a wide field of view and large eye box. One of the major obstacles in current CGH frameworks lies in the limited memory available on consumer-grade GPUs which could not facilitate the generation of higher-definition holograms. To overcome the aforementioned challenge, we proposed a divide-conquer-and-merge strategy to address the memory and computational capacity scarcity in ultra-high-definition CGH generation. This algorithm empowers existing CGH frameworks to synthesize higher-definition holograms at a faster speed while maintaining high-fidelity image display quality. Both simulations and experiments were conducted to demonstrate the capabilities of the proposed framework. By integrating our strategy into HoloNet and CCNNs, we achieved significant reductions in GPU memory usage during the training period by 64.3\% and 12.9\%, respectively. Furthermore, we observed substantial speed improvements in hologram generation, with an acceleration of up to 3$\times$ and 2 $\times$, respectively. Particularly, we successfully trained and inferred 8K definition holograms on an NVIDIA GeForce RTX 3090 GPU for the first time in simulations. Furthermore, we conducted full-color optical experiments to verify the effectiveness of our method. We believe our strategy can provide a novel approach for memory- and time-efficient holographic displays. △ Less

Submitted 25 February, 2024; originally announced April 2024.

Comments: This paper has been accepted as conference paper in IEEE VR 2024

arXiv:2403.01162 [pdf, ps, other]

doi 10.1016/j.orl.2024.107103

Envy-Free House Allocation with Minimum Subsidy

Authors: Davin Choo, Yan Hao Ling, Warut Suksompong, Nicholas Teh, Jian Zhang

Abstract: House allocation refers to the problem where $m$ houses are to be allocated to $n$ agents so that each agent receives one house. Since an envy-free house allocation does not always exist, we consider finding such an allocation in the presence of subsidy. We show that computing an envy-free allocation with minimum subsidy is NP-hard in general, but can be done efficiently if $m$ differs from $n$ by… ▽ More House allocation refers to the problem where $m$ houses are to be allocated to $n$ agents so that each agent receives one house. Since an envy-free house allocation does not always exist, we consider finding such an allocation in the presence of subsidy. We show that computing an envy-free allocation with minimum subsidy is NP-hard in general, but can be done efficiently if $m$ differs from $n$ by an additive constant or if the agents have identical utilities. △ Less

Submitted 2 March, 2024; originally announced March 2024.

Journal ref: Operations Research Letters, 54:107103 (2024)

arXiv:2401.02682 [pdf, other]

Homophily-Related: Adaptive Hybrid Graph Filter for Multi-View Graph Clustering

Authors: Zichen Wen, Yawen Ling, Yazhou Ren, Tianyi Wu, Jianpeng Chen, Xiaorong Pu, Zhifeng Hao, Lifang He

Abstract: Recently there is a growing focus on graph data, and multi-view graph clustering has become a popular area of research interest. Most of the existing methods are only applicable to homophilous graphs, yet the extensive real-world graph data can hardly fulfill the homophily assumption, where the connected nodes tend to belong to the same class. Several studies have pointed out that the poor perform… ▽ More Recently there is a growing focus on graph data, and multi-view graph clustering has become a popular area of research interest. Most of the existing methods are only applicable to homophilous graphs, yet the extensive real-world graph data can hardly fulfill the homophily assumption, where the connected nodes tend to belong to the same class. Several studies have pointed out that the poor performance on heterophilous graphs is actually due to the fact that conventional graph neural networks (GNNs), which are essentially low-pass filters, discard information other than the low-frequency information on the graph. Nevertheless, on certain graphs, particularly heterophilous ones, neglecting high-frequency information and focusing solely on low-frequency information impedes the learning of node representations. To break this limitation, our motivation is to perform graph filtering that is closely related to the homophily degree of the given graph, with the aim of fully leveraging both low-frequency and high-frequency signals to learn distinguishable node embedding. In this work, we propose Adaptive Hybrid Graph Filter for Multi-View Graph Clustering (AHGFC). Specifically, a graph joint process and graph joint aggregation matrix are first designed by using the intrinsic node features and adjacency relationship, which makes the low and high-frequency signals on the graph more distinguishable. Then we design an adaptive hybrid graph filter that is related to the homophily degree, which learns the node embedding based on the graph joint aggregation matrix. After that, the node embedding of each view is weighted and fused into a consensus embedding for the downstream task. Experimental results show that our proposed model performs well on six datasets containing homophilous and heterophilous graphs. △ Less

Submitted 5 January, 2024; originally announced January 2024.

Comments: Accepted by AAAI2024

arXiv:2312.10655 [pdf, other]

Practical Non-Intrusive GUI Exploration Testing with Visual-based Robotic Arms

Authors: Shengcheng Yu, Chunrong Fang, Mingzhe Du, Yuchen Ling, Zhenyu Chen, Zhendong Su

Abstract: GUI testing is significant in the SE community. Most existing frameworks are intrusive and only support some specific platforms. With the development of distinct scenarios, diverse embedded systems or customized operating systems on different devices do not support existing intrusive GUI testing frameworks. Some approaches adopt robotic arms to replace the interface invoking of mobile apps under t… ▽ More GUI testing is significant in the SE community. Most existing frameworks are intrusive and only support some specific platforms. With the development of distinct scenarios, diverse embedded systems or customized operating systems on different devices do not support existing intrusive GUI testing frameworks. Some approaches adopt robotic arms to replace the interface invoking of mobile apps under test and use computer vision technologies to identify GUI elements. However, some challenges are unsolved. First, existing approaches assume that GUI screens are fixed so that they cannot be adapted to diverse systems with different screen conditions. Second, existing approaches use XY-plane robotic arms, which cannot flexibly simulate testing operations. Third, existing approaches ignore compatibility bugs and only focus on crash bugs. A more practical approach is required for the non-intrusive scenario. We propose a practical non-intrusive GUI testing framework with visual robotic arms. RoboTest integrates novel GUI screen and widget detection algorithms, adaptive to detecting screens of different sizes and then to extracting GUI widgets from the detected screens. Then, a set of testing operations is applied with a 4-DOF robotic arm, which effectively and flexibly simulates human testing operations. During app exploration, RoboTest integrates the Principle of Proximity-guided exploration strategy, choosing close widgets of the previous targets to reduce robotic arm movement overhead and improve exploration efficiency. RoboTest can effectively detect some compatibility bugs beyond crash bugs with a GUI comparison on different devices of the same test operations. We evaluate RoboTest with 20 mobile apps, with a case study on an embedded system. The results show that RoboTest can effectively, efficiently, and generally explore AUTs to find bugs and reduce exploration time overhead. △ Less

Submitted 17 December, 2023; originally announced December 2023.

Comments: Accepted by the 46th International Conference on Software Engineering (ICSE 2024)

arXiv:2311.14251 [pdf, ps, other]

Optimal 1-bit Error Exponent for 2-hop Relaying with Binary-Input Channels

Authors: Yan Hao Ling, Jonathan Scarlett

Abstract: In this paper, we study the problem of relaying a single bit over a tandem of binary-input channels, with the goal of attaining the highest possible error exponent in the exponentially decaying error probability. Our previous work gave an exact characterization of the best possible error exponent in various special cases, including when the two channels are identical, but the general case was left… ▽ More In this paper, we study the problem of relaying a single bit over a tandem of binary-input channels, with the goal of attaining the highest possible error exponent in the exponentially decaying error probability. Our previous work gave an exact characterization of the best possible error exponent in various special cases, including when the two channels are identical, but the general case was left as an open problem. We resolve this open problem by deriving a new converse bound that matches our existing achievability bound. △ Less

Submitted 6 June, 2024; v1 submitted 23 November, 2023; originally announced November 2023.

Comments: IEEE Transactions on Information Theory

arXiv:2310.20544 [pdf, other]

Information-theoretic causality and applications to turbulence: energy cascade and inner/outer layer interactions

Authors: Adrián Lozano-Durán, Gonzalo Arranz, Yuenong Ling

Abstract: We introduce an information-theoretic method for quantifying causality in chaotic systems. The approach, referred to as IT-causality, quantifies causality by measuring the information gained about future events conditioned on the knowledge of past events. The causal interactions are classified into redundant, unique, and synergistic contributions depending on their nature. The formulation is non-i… ▽ More We introduce an information-theoretic method for quantifying causality in chaotic systems. The approach, referred to as IT-causality, quantifies causality by measuring the information gained about future events conditioned on the knowledge of past events. The causal interactions are classified into redundant, unique, and synergistic contributions depending on their nature. The formulation is non-intrusive, invariance under invertible transformations of the variables, and provides the missing causality due to unobserved variables. The method only requires pairs of past-future events of the quantities of interest, making it convenient for both computational simulations and experimental investigations. IT-causality is validated in four scenarios representing basic causal interactions among variables: mediator, confounder, redundant collider, and synergistic collider. The approach is leveraged to address two questions relevant to turbulence research: i) the scale locality of the energy cascade in isotropic turbulence, and ii) the interactions between inner and outer layer flow motions in wall-bounded turbulence. In the former case, we demonstrate that causality in the energy cascade flows sequentially from larger to smaller scales without requiring intermediate scales. Conversely, the flow of information from small to large scales is shown to be redundant. In the second problem, we observe a unidirectional causality flow, with causality predominantly originating from the outer layer and propagating towards the inner layer, but not vice versa. The decomposition of IT-causality into intensities also reveals that the causality is primarily associated with high-velocity streaks. △ Less

Submitted 31 October, 2023; originally announced October 2023.

arXiv:2310.15584 [pdf, other]

Accelerating Split Federated Learning over Wireless Communication Networks

Authors: Ce Xu, Jinxuan Li, Yuan Liu, Yushi Ling, Miaowen Wen

Abstract: The development of artificial intelligence (AI) provides opportunities for the promotion of deep neural network (DNN)-based applications. However, the large amount of parameters and computational complexity of DNN makes it difficult to deploy it on edge devices which are resource-constrained. An efficient method to address this challenge is model partition/splitting, in which DNN is divided into t… ▽ More The development of artificial intelligence (AI) provides opportunities for the promotion of deep neural network (DNN)-based applications. However, the large amount of parameters and computational complexity of DNN makes it difficult to deploy it on edge devices which are resource-constrained. An efficient method to address this challenge is model partition/splitting, in which DNN is divided into two parts which are deployed on device and server respectively for co-training or co-inference. In this paper, we consider a split federated learning (SFL) framework that combines the parallel model training mechanism of federated learning (FL) and the model splitting structure of split learning (SL). We consider a practical scenario of heterogeneous devices with individual split points of DNN. We formulate a joint problem of split point selection and bandwidth allocation to minimize the system latency. By using alternating optimization, we decompose the problem into two sub-problems and solve them optimally. Experiment results demonstrate the superiority of our work in latency reduction and accuracy improvement. △ Less

Submitted 24 October, 2023; originally announced October 2023.

arXiv:2310.01361 [pdf, other]

GenSim: Generating Robotic Simulation Tasks via Large Language Models

Authors: Lirui Wang, Yiyang Ling, Zhecheng Yuan, Mohit Shridhar, Chen Bao, Yuzhe Qin, Bailin Wang, Huazhe Xu, Xiaolong Wang

Abstract: Collecting large amounts of real-world interaction data to train general robotic policies is often prohibitively expensive, thus motivating the use of simulation data. However, existing methods for data generation have generally focused on scene-level diversity (e.g., object instances and poses) rather than task-level diversity, due to the human effort required to come up with and verify novel tas… ▽ More Collecting large amounts of real-world interaction data to train general robotic policies is often prohibitively expensive, thus motivating the use of simulation data. However, existing methods for data generation have generally focused on scene-level diversity (e.g., object instances and poses) rather than task-level diversity, due to the human effort required to come up with and verify novel tasks. This has made it challenging for policies trained on simulation data to demonstrate significant task-level generalization. In this paper, we propose to automatically generate rich simulation environments and expert demonstrations by exploiting a large language models' (LLM) grounding and coding ability. Our approach, dubbed GenSim, has two modes: goal-directed generation, wherein a target task is given to the LLM and the LLM proposes a task curriculum to solve the target task, and exploratory generation, wherein the LLM bootstraps from previous tasks and iteratively proposes novel tasks that would be helpful in solving more complex tasks. We use GPT4 to expand the existing benchmark by ten times to over 100 tasks, on which we conduct supervised finetuning and evaluate several LLMs including finetuned GPTs and Code Llama on code generation for robotic simulation tasks. Furthermore, we observe that LLMs-generated simulation programs can enhance task-level generalization significantly when used for multitask policy training. We further find that with minimal sim-to-real adaptation, the multitask policies pretrained on GPT4-generated simulation tasks exhibit stronger transfer to unseen long-horizon tasks in the real world and outperform baselines by 25%. See the project website (https://liruiw.github.io/gensim) for code, demos, and videos. △ Less

Submitted 21 January, 2024; v1 submitted 2 October, 2023; originally announced October 2023.

Comments: See our project website (https://liruiw.github.io/gensim), demo and datasets (https://huggingface.co/spaces/Gen-Sim/Gen-Sim), and code (https://github.com/liruiw/GenSim) for more details

Journal ref: International Conference on Learning Representations (ICLR), 2024

arXiv:2309.13574 [pdf, ps, other]

LLM for Test Script Generation and Migration: Challenges, Capabilities, and Opportunities

Authors: Shengcheng Yu, Chunrong Fang, Yuchen Ling, Chentian Wu, Zhenyu Chen

Abstract: This paper investigates the application of large language models (LLM) in the domain of mobile application test script generation. Test script generation is a vital component of software testing, enabling efficient and reliable automation of repetitive test tasks. However, existing generation approaches often encounter limitations, such as difficulties in accurately capturing and reproducing test… ▽ More This paper investigates the application of large language models (LLM) in the domain of mobile application test script generation. Test script generation is a vital component of software testing, enabling efficient and reliable automation of repetitive test tasks. However, existing generation approaches often encounter limitations, such as difficulties in accurately capturing and reproducing test scripts across diverse devices, platforms, and applications. These challenges arise due to differences in screen sizes, input modalities, platform behaviors, API inconsistencies, and application architectures. Overcoming these limitations is crucial for achieving robust and comprehensive test automation. By leveraging the capabilities of LLMs, we aim to address these challenges and explore its potential as a versatile tool for test automation. We investigate how well LLMs can adapt to diverse devices and systems while accurately capturing and generating test scripts. Additionally, we evaluate its cross-platform generation capabilities by assessing its ability to handle operating system variations and platform-specific behaviors. Furthermore, we explore the application of LLMs in cross-app migration, where it generates test scripts across different applications and software environments based on existing scripts. Throughout the investigation, we analyze its adaptability to various user interfaces, app architectures, and interaction patterns, ensuring accurate script generation and compatibility. The findings of this research contribute to the understanding of LLMs' capabilities in test automation. Ultimately, this research aims to enhance software testing practices, empowering app developers to achieve higher levels of software quality and development efficiency. △ Less

Submitted 24 September, 2023; originally announced September 2023.

Comments: Accepted by the 23rd IEEE International Conference on Software Quality, Reliability, and Security (QRS 2023)

arXiv:2308.03782 [pdf, ps, other]

Bio+Clinical BERT, BERT Base, and CNN Performance Comparison for Predicting Drug-Review Satisfaction

Authors: Yue Ling

Abstract: The objective of this study is to develop natural language processing (NLP) models that can analyze patients' drug reviews and accurately classify their satisfaction levels as positive, neutral, or negative. Such models would reduce the workload of healthcare professionals and provide greater insight into patients' quality of life, which is a critical indicator of treatment effectiveness. To achie… ▽ More The objective of this study is to develop natural language processing (NLP) models that can analyze patients' drug reviews and accurately classify their satisfaction levels as positive, neutral, or negative. Such models would reduce the workload of healthcare professionals and provide greater insight into patients' quality of life, which is a critical indicator of treatment effectiveness. To achieve this, we implemented and evaluated several classification models, including a BERT base model, Bio+Clinical BERT, and a simpler CNN. Results indicate that the medical domain-specific Bio+Clinical BERT model significantly outperformed the general domain base BERT model, achieving macro f1 and recall score improvement of 11%, as shown in Table 2. Future research could explore how to capitalize on the specific strengths of each model. Bio+Clinical BERT excels in overall performance, particularly with medical jargon, while the simpler CNN demonstrates the ability to identify crucial words and accurately classify sentiment in texts with conflicting sentiments. △ Less

Submitted 2 August, 2023; originally announced August 2023.

Comments: KDD 2023 Workshop on Applied Data Science for Healthcare

arXiv:2307.11130 [pdf, other]

Frequency-aware optical coherence tomography image super-resolution via conditional generative adversarial neural network

Authors: Xueshen Li, Zhenxing Dong, Hongshan Liu, Jennifer J. Kang-Mieler, Yuye Ling, Yu Gan

Abstract: Optical coherence tomography (OCT) has stimulated a wide range of medical image-based diagnosis and treatment in fields such as cardiology and ophthalmology. Such applications can be further facilitated by deep learning-based super-resolution technology, which improves the capability of resolving morphological structures. However, existing deep learning-based method only focuses on spatial distrib… ▽ More Optical coherence tomography (OCT) has stimulated a wide range of medical image-based diagnosis and treatment in fields such as cardiology and ophthalmology. Such applications can be further facilitated by deep learning-based super-resolution technology, which improves the capability of resolving morphological structures. However, existing deep learning-based method only focuses on spatial distribution and disregard frequency fidelity in image reconstruction, leading to a frequency bias. To overcome this limitation, we propose a frequency-aware super-resolution framework that integrates three critical frequency-based modules (i.e., frequency transformation, frequency skip connection, and frequency alignment) and frequency-based loss function into a conditional generative adversarial network (cGAN). We conducted a large-scale quantitative study from an existing coronary OCT dataset to demonstrate the superiority of our proposed framework over existing deep learning frameworks. In addition, we confirmed the generalizability of our framework by applying it to fish corneal images and rat retinal images, demonstrating its capability to super-resolve morphological details in eye imaging. △ Less

Submitted 20 July, 2023; originally announced July 2023.

Comments: 13 pages, 7 figures, submitted to Biomedical Optics Express special issue

arXiv:2307.04231 [pdf, other]

Mx2M: Masked Cross-Modality Modeling in Domain Adaptation for 3D Semantic Segmentation

Authors: Boxiang Zhang, Zunran Wang, Yonggen Ling, Yuanyuan Guan, Shenghao Zhang, Wenhui Li

Abstract: Existing methods of cross-modal domain adaptation for 3D semantic segmentation predict results only via 2D-3D complementarity that is obtained by cross-modal feature matching. However, as lacking supervision in the target domain, the complementarity is not always reliable. The results are not ideal when the domain gap is large. To solve the problem of lacking supervision, we introduce masked model… ▽ More Existing methods of cross-modal domain adaptation for 3D semantic segmentation predict results only via 2D-3D complementarity that is obtained by cross-modal feature matching. However, as lacking supervision in the target domain, the complementarity is not always reliable. The results are not ideal when the domain gap is large. To solve the problem of lacking supervision, we introduce masked modeling into this task and propose a method Mx2M, which utilizes masked cross-modality modeling to reduce the large domain gap. Our Mx2M contains two components. One is the core solution, cross-modal removal and prediction (xMRP), which makes the Mx2M adapt to various scenarios and provides cross-modal self-supervision. The other is a new way of cross-modal feature matching, the dynamic cross-modal filter (DxMF) that ensures the whole method dynamically uses more suitable 2D-3D complementarity. Evaluation of the Mx2M on three DA scenarios, including Day/Night, USA/Singapore, and A2D2/SemanticKITTI, brings large improvements over previous methods on many metrics. △ Less

Submitted 9 July, 2023; originally announced July 2023.

arXiv:2306.11025 [pdf, ps, other]

Temporal Data Meets LLM -- Explainable Financial Time Series Forecasting

Authors: Xinli Yu, Zheng Chen, Yuan Ling, Shujing Dong, Zongyi Liu, Yanbin Lu

Abstract: This paper presents a novel study on harnessing Large Language Models' (LLMs) outstanding knowledge and reasoning abilities for explainable financial time series forecasting. The application of machine learning models to financial time series comes with several challenges, including the difficulty in cross-sequence reasoning and inference, the hurdle of incorporating multi-modal signals from histo… ▽ More This paper presents a novel study on harnessing Large Language Models' (LLMs) outstanding knowledge and reasoning abilities for explainable financial time series forecasting. The application of machine learning models to financial time series comes with several challenges, including the difficulty in cross-sequence reasoning and inference, the hurdle of incorporating multi-modal signals from historical news, financial knowledge graphs, etc., and the issue of interpreting and explaining the model results. In this paper, we focus on NASDAQ-100 stocks, making use of publicly accessible historical stock price data, company metadata, and historical economic/financial news. We conduct experiments to illustrate the potential of LLMs in offering a unified solution to the aforementioned challenges. Our experiments include trying zero-shot/few-shot inference with GPT-4 and instruction-based fine-tuning with a public LLM model Open LLaMA. We demonstrate our approach outperforms a few baselines, including the widely applied classic ARMA-GARCH model and a gradient-boosting tree model. Through the performance comparison results and a few examples, we find LLMs can make a well-thought decision by reasoning over information from both textual news and price time series and extracting insights, leveraging cross-sequence information, and utilizing the inherent knowledge embedded within the LLM. Additionally, we show that a publicly available LLM such as Open-LLaMA, after fine-tuning, can comprehend the instruction to generate explainable forecasts and achieve reasonable performance, albeit relatively inferior in comparison to GPT-4. △ Less

Submitted 19 June, 2023; originally announced June 2023.

ACM Class: F.2.2; I.2.7; I.2.1

arXiv:2305.14655 [pdf, other]

Learning Survival Distribution with Implicit Survival Function

Authors: Yu Ling, Weimin Tan, Bo Yan

Abstract: Survival analysis aims at modeling the relationship between covariates and event occurrence with some untracked (censored) samples. In implementation, existing methods model the survival distribution with strong assumptions or in a discrete time space for likelihood estimation with censorship, which leads to weak generalization. In this paper, we propose Implicit Survival Function (ISF) based on I… ▽ More Survival analysis aims at modeling the relationship between covariates and event occurrence with some untracked (censored) samples. In implementation, existing methods model the survival distribution with strong assumptions or in a discrete time space for likelihood estimation with censorship, which leads to weak generalization. In this paper, we propose Implicit Survival Function (ISF) based on Implicit Neural Representation for survival distribution estimation without strong assumptions,and employ numerical integration to approximate the cumulative distribution function for prediction and optimization. Experimental results show that ISF outperforms the state-of-the-art methods in three public datasets and has robustness to the hyperparameter controlling estimation precision. △ Less

Submitted 23 May, 2023; originally announced May 2023.

arXiv:2305.07223 [pdf, other]

Transavs: End-To-End Audio-Visual Segmentation With Transformer

Authors: Yuhang Ling, Yuxi Li, Zhenye Gan, Jiangning Zhang, Mingmin Chi, Yabiao Wang

Abstract: Audio-Visual Segmentation (AVS) is a challenging task, which aims to segment sounding objects in video frames by exploring audio signals. Generally AVS faces two key challenges: (1) Audio signals inherently exhibit a high degree of information density, as sounds produced by multiple objects are entangled within the same audio stream; (2) Objects of the same category tend to produce similar audio s… ▽ More Audio-Visual Segmentation (AVS) is a challenging task, which aims to segment sounding objects in video frames by exploring audio signals. Generally AVS faces two key challenges: (1) Audio signals inherently exhibit a high degree of information density, as sounds produced by multiple objects are entangled within the same audio stream; (2) Objects of the same category tend to produce similar audio signals, making it difficult to distinguish between them and thus leading to unclear segmentation results. Toward this end, we propose TransAVS, the first Transformer-based end-to-end framework for AVS task. Specifically, TransAVS disentangles the audio stream as audio queries, which will interact with images and decode into segmentation masks with full transformer architectures. This scheme not only promotes comprehensive audio-image communication but also explicitly excavates instance cues encapsulated in the scene. Meanwhile, to encourage these audio queries to capture distinctive sounding objects instead of degrading to be homogeneous, we devise two self-supervised loss functions at both query and mask levels, allowing the model to capture distinctive features within similar audio data and achieve more precise segmentation. Our experiments demonstrate that TransAVS achieves state-of-the-art results on the AVSBench dataset, highlighting its effectiveness in bridging the gap between audio and visual modalities. △ Less

Submitted 26 December, 2023; v1 submitted 11 May, 2023; originally announced May 2023.

Comments: 4 pages, 3 figures

arXiv:2304.02226 [pdf, ps, other]

Maxflow-Based Bounds for Low-Rate Information Propagation over Noisy Networks

Authors: Yan Hao Ling, Jonathan Scarlett

Abstract: We study error exponents for the problem of low-rate communication over a directed graph, where each edge in the graph represents a noisy communication channel, and there is a single source and destination. We derive maxflow-based achievability and converse bounds on the error exponent that match when there are two messages and all channels satisfy a symmetry condition called pairwise reversibilit… ▽ More We study error exponents for the problem of low-rate communication over a directed graph, where each edge in the graph represents a noisy communication channel, and there is a single source and destination. We derive maxflow-based achievability and converse bounds on the error exponent that match when there are two messages and all channels satisfy a symmetry condition called pairwise reversibility. More generally, we show that the upper and lower bounds match to within a factor of 4. We also show that with three messages there are cases where the maxflow-based error exponent is strictly suboptimal, thus showing that our tightness result cannot be extended beyond two messages without further assumptions. △ Less

Submitted 5 April, 2023; originally announced April 2023.

arXiv:2210.07011 [pdf, other]

Variational Graph Generator for Multi-View Graph Clustering

Authors: Jianpeng Chen, Yawen Ling, Jie Xu, Yazhou Ren, Shudong Huang, Xiaorong Pu, Zhifeng Hao, Philip S. Yu, Lifang He

Abstract: Multi-view graph clustering (MGC) methods are increasingly being studied due to the explosion of multi-view data with graph structural information. The critical point of MGC is to better utilize the view-specific and view-common information in features and graphs of multiple views. However, existing works have an inherent limitation that they are unable to concurrently utilize the consensus graph… ▽ More Multi-view graph clustering (MGC) methods are increasingly being studied due to the explosion of multi-view data with graph structural information. The critical point of MGC is to better utilize the view-specific and view-common information in features and graphs of multiple views. However, existing works have an inherent limitation that they are unable to concurrently utilize the consensus graph information across multiple graphs and the view-specific feature information. To address this issue, we propose Variational Graph Generator for Multi-View Graph Clustering (VGMGC). Specifically, a novel variational graph generator is proposed to extract common information among multiple graphs. This generator infers a reliable variational consensus graph based on a priori assumption over multiple graphs. Then a simple yet effective graph encoder in conjunction with the multi-view clustering objective is presented to learn the desired graph embeddings for clustering, which embeds the inferred view-common graph and view-specific graphs together with features. Finally, theoretical results illustrate the rationality of VGMGC by analyzing the uncertainty of the inferred consensus graph with information bottleneck principle. Extensive experiments demonstrate the superior performance of our VGMGC over SOTAs. △ Less

Submitted 16 December, 2022; v1 submitted 13 October, 2022; originally announced October 2022.

Comments: submitted to TNNLS

arXiv:2210.01295 [pdf, other]

Max-Quantile Grouped Infinite-Arm Bandits

Authors: Ivan Lau, Yan Hao Ling, Mayank Shrivastava, Jonathan Scarlett

Abstract: In this paper, we consider a bandit problem in which there are a number of groups each consisting of infinitely many arms. Whenever a new arm is requested from a given group, its mean reward is drawn from an unknown reservoir distribution (different for each group), and the uncertainty in the arm's mean reward can only be reduced via subsequent pulls of the arm. The goal is to identify the infinit… ▽ More In this paper, we consider a bandit problem in which there are a number of groups each consisting of infinitely many arms. Whenever a new arm is requested from a given group, its mean reward is drawn from an unknown reservoir distribution (different for each group), and the uncertainty in the arm's mean reward can only be reduced via subsequent pulls of the arm. The goal is to identify the infinite-arm group whose reservoir distribution has the highest $(1-α)$-quantile (e.g., median if $α= \frac{1}{2}$), using as few total arm pulls as possible. We introduce a two-step algorithm that first requests a fixed number of arms from each group and then runs a finite-arm grouped max-quantile bandit algorithm. We characterize both the instance-dependent and worst-case regret, and provide a matching lower bound for the latter, while discussing various strengths, weaknesses, algorithmic improvements, and potential lower bounds associated with our instance-dependent upper bounds. △ Less

Submitted 1 February, 2023; v1 submitted 3 October, 2022; originally announced October 2022.

Comments: ALT 2023

arXiv:2209.08924 [pdf, other]

HVC-Net: Unifying Homography, Visibility, and Confidence Learning for Planar Object Tracking

Authors: Haoxian Zhang, Yonggen Ling

Abstract: Robust and accurate planar tracking over a whole video sequence is vitally important for many vision applications. The key to planar object tracking is to find object correspondences, modeled by homography, between the reference image and the tracked image. Existing methods tend to obtain wrong correspondences with changing appearance variations, camera-object relative motions and occlusions. To a… ▽ More Robust and accurate planar tracking over a whole video sequence is vitally important for many vision applications. The key to planar object tracking is to find object correspondences, modeled by homography, between the reference image and the tracked image. Existing methods tend to obtain wrong correspondences with changing appearance variations, camera-object relative motions and occlusions. To alleviate this problem, we present a unified convolutional neural network (CNN) model that jointly considers homography, visibility, and confidence. First, we introduce correlation blocks that explicitly account for the local appearance changes and camera-object relative motions as the base of our model. Second, we jointly learn the homography and visibility that links camera-object relative motions with occlusions. Third, we propose a confidence module that actively monitors the estimation quality from the pixel correlation distributions obtained in correlation blocks. All these modules are plugged into a Lucas-Kanade (LK) tracking pipeline to obtain both accurate and robust planar object tracking. Our approach outperforms the state-of-the-art methods on public POT and TMT datasets. Its superior performance is also verified on a real-world application, synthesizing high-quality in-video advertisements. △ Less

Submitted 19 September, 2022; originally announced September 2022.

Comments: Accepted to ECCV 2022

arXiv:2208.09116 [pdf, other]

Effective, Platform-Independent GUI Testing via Image Embedding and Reinforcement Learning

Authors: Shengcheng Yu, Chunrong Fang, Xin Li, Yuchen Ling, Zhenyu Chen, Zhendong Su

Abstract: Software applications have been playing an increasingly important role in various aspects of society. In particular, mobile apps and web apps are the most prevalent among all applications and are widely used in various industries as well as in people's daily lives. To help ensure mobile and web app quality, many approaches have been introduced to improve app GUI testing via automated exploration.… ▽ More Software applications have been playing an increasingly important role in various aspects of society. In particular, mobile apps and web apps are the most prevalent among all applications and are widely used in various industries as well as in people's daily lives. To help ensure mobile and web app quality, many approaches have been introduced to improve app GUI testing via automated exploration. Despite the extensive effort, existing approaches are still limited in reaching high code coverage, constructing high-quality models, and being generally applicable. Reinforcement learning-based approaches are faced with difficult challenges, including effective app state abstraction, reward function design, etc. Moreover, they heavily depend on the specific execution platforms, thus leading to poor generalizability and being unable to adapt to different platforms. We propose PIRLTest, an effective platform-independent approach for app testing. It utilizes computer vision and reinforcement learning techniques in a novel, synergistic manner for automated testing. It extracts the GUI widgets from GUI pages and characterizes the corresponding GUI layouts, embedding the GUI pages as states. The app GUI state combines the macroscopic perspective and the microscopic perspective, and attaches the critical semantic information from GUI images. This enables PIRLTest to be platform-independent and makes the testing approach generally applicable on different platforms. PIRLTest explores apps with the guidance of a curiosity-driven strategy, which uses a Q-network to estimate the values of specific state-action pairs to encourage more exploration in uncovered pages without platform dependency. The exploration will be assigned with rewards for all actions, which are designed considering both the app GUI states and the concrete widgets, to help the framework explore more uncovered pages. △ Less

Submitted 12 June, 2024; v1 submitted 18 August, 2022; originally announced August 2022.

Comments: Accepted by ACM Transactions on Software Engineering and Methodology in 2024

arXiv:2208.02003 [pdf, ps, other]

Multi-Bit Relaying over a Tandem of Channels

Authors: Yan Hao Ling, Jonathan Scarlett

Abstract: We study error exponents for the problem of relaying a message over a tandem of two channels sharing the same transition law, in particular moving beyond the 1-bit setting studied in recent related works. Our main results show that the 1-hop and 2-hop exponents coincide in both of the following settings: (i) the number of messages is fixed, and the channel law satisfies a condition called pairwise… ▽ More We study error exponents for the problem of relaying a message over a tandem of two channels sharing the same transition law, in particular moving beyond the 1-bit setting studied in recent related works. Our main results show that the 1-hop and 2-hop exponents coincide in both of the following settings: (i) the number of messages is fixed, and the channel law satisfies a condition called pairwise reversibility, or (ii) the channel is arbitrary, and a zero-rate limit is taken from above. In addition, we provide various extensions of our results that relax the assumptions of pairwise reversibility and/or the two channels having identical transition laws, and we provide an example for which the 2-hop exponent is strictly below the 1-hop exponent. △ Less

Submitted 24 February, 2023; v1 submitted 3 August, 2022; originally announced August 2022.

Comments: IEEE Transactions on Information Theory

arXiv:2207.12002 [pdf, ps, other]

An Optimal Motion Planning Framework for Quadruped Jumping

Authors: Zhitao Song, Linzhu Yue, Guangli Sun, Yihu Ling, Hongshuo Wei, Linhai Gui, Yun-Hui Liu

Abstract: This paper presents an optimal motion planning framework to generate versatile energy-optimal quadrupedal jumping motions automatically (e.g., flips, spin). The jumping motions via the centroidal dynamics are formulated as a 12-dimensional black-box optimization problem subject to the robot kino-dynamic constraints. Gradient-based approaches offer great success in addressing trajectory optimizatio… ▽ More This paper presents an optimal motion planning framework to generate versatile energy-optimal quadrupedal jumping motions automatically (e.g., flips, spin). The jumping motions via the centroidal dynamics are formulated as a 12-dimensional black-box optimization problem subject to the robot kino-dynamic constraints. Gradient-based approaches offer great success in addressing trajectory optimization (TO), yet, prior knowledge (e.g., reference motion, contact schedule) is required and results in sub-optimal solutions. The new proposed framework first employed a heuristics-based optimization method to avoid these problems. Moreover, a prioritization fitness function is created for heuristics-based algorithms in robot ground reaction force (GRF) planning, enhancing convergence and searching performance considerably. Since heuristics-based algorithms often require significant time, motions are planned offline and stored as a pre-motion library. A selector is designed to automatically choose motions with user-specified or perception information as input. The proposed framework has been successfully validated only with a simple continuously tracking PD controller in an open-source Mini-Cheetah by several challenging jumping motions, including jumping over a window-shaped obstacle with 30 cm height and left-flipping over a rectangle obstacle with 27 cm height. △ Less

Submitted 25 July, 2022; originally announced July 2022.

Comments: Accept by IROS 2022

arXiv:2205.03943 [pdf, other]

doi 10.1145/3528233.3530728

Learning to Brachiate via Simplified Model Imitation

Authors: Daniele Reda, Hung Yu Ling, Michiel van de Panne

Abstract: Brachiation is the primary form of locomotion for gibbons and siamangs, in which these primates swing from tree limb to tree limb using only their arms. It is challenging to control because of the limited control authority, the required advance planning, and the precision of the required grasps. We present a novel approach to this problem using reinforcement learning, and as demonstrated on a fing… ▽ More Brachiation is the primary form of locomotion for gibbons and siamangs, in which these primates swing from tree limb to tree limb using only their arms. It is challenging to control because of the limited control authority, the required advance planning, and the precision of the required grasps. We present a novel approach to this problem using reinforcement learning, and as demonstrated on a finger-less 14-link planar model that learns to brachiate across challenging handhold sequences. Key to our method is the use of a simplified model, a point mass with a virtual arm, for which we first learn a policy that can brachiate across handhold sequences with a prescribed order. This facilitates the learning of the policy for the full model, for which it provides guidance by providing an overall center-of-mass trajectory to imitate, as well as for the timing of the holds. Lastly, the simplified model can also readily be used for planning suitable sequences of handholds in a given environment. Our results demonstrate brachiation motions with a variety of durations for the flight and hold phases, as well as emergent extra back-and-forth swings when this proves useful. The system is evaluated with a variety of ablations. The method enables future work towards more general 3D brachiation, as well as using simplified model imitation in other settings. △ Less

Submitted 8 May, 2022; originally announced May 2022.

Comments: 11 pages, 6 figures. Accepted at SIGGRAPH 2022. For videos, supplementary material and code, visit the following URL https://brachiation-rl.github.io/brachiation

arXiv:2204.11769 [pdf, ps, other]

Multi-scale reconstruction of undersampled spectral-spatial OCT data for coronary imaging using deep learning

Authors: Xueshen Li, Shengting Cao, Hongshan Liu, Xinwen Yao, Brigitta C. Brott, Silvio H. Litovsky, Xiaoyu Song, Yuye Ling, Yu Gan

Abstract: Coronary artery disease (CAD) is a cardiovascular condition with high morbidity and mortality. Intravascular optical coherence tomography (IVOCT) has been considered as an optimal imagining system for the diagnosis and treatment of CAD. Constrained by Nyquist theorem, dense sampling in IVOCT attains high resolving power to delineate cellular structures/ features. There is a trade-off between high… ▽ More Coronary artery disease (CAD) is a cardiovascular condition with high morbidity and mortality. Intravascular optical coherence tomography (IVOCT) has been considered as an optimal imagining system for the diagnosis and treatment of CAD. Constrained by Nyquist theorem, dense sampling in IVOCT attains high resolving power to delineate cellular structures/ features. There is a trade-off between high spatial resolution and fast scanning rate for coronary imaging. In this paper, we propose a viable spectral-spatial acquisition method that down-scales the sampling process in both spectral and spatial domain while maintaining high quality in image reconstruction. The down-scaling schedule boosts data acquisition speed without any hardware modifications. Additionally, we propose a unified multi-scale reconstruction framework, namely Multiscale- Spectral-Spatial-Magnification Network (MSSMN), to resolve highly down-scaled (compressed) OCT images with flexible magnification factors. We incorporate the proposed methods into Spectral Domain OCT (SD-OCT) imaging of human coronary samples with clinical features such as stent and calcified lesions. Our experimental results demonstrate that spectral-spatial downscaled data can be better reconstructed than data that is downscaled solely in either spectral or spatial domain. Moreover, we observe better reconstruction performance using MSSMN than using existing reconstruction methods. Our acquisition method and multi-scale reconstruction framework, in combination, may allow faster SD-OCT inspection with high resolution during coronary intervention. △ Less

Submitted 25 April, 2022; originally announced April 2022.

Comments: 11 pages, 8 figures, reviewed by IEEE trans BME

arXiv:2204.07955 [pdf, other]

Vision-Language Pre-Training for Multimodal Aspect-Based Sentiment Analysis

Authors: Yan Ling, Jianfei Yu, Rui Xia

Abstract: As an important task in sentiment analysis, Multimodal Aspect-Based Sentiment Analysis (MABSA) has attracted increasing attention in recent years. However, previous approaches either (i) use separately pre-trained visual and textual models, which ignore the crossmodal alignment or (ii) use vision-language models pre-trained with general pre-training tasks, which are inadequate to identify finegrai… ▽ More As an important task in sentiment analysis, Multimodal Aspect-Based Sentiment Analysis (MABSA) has attracted increasing attention in recent years. However, previous approaches either (i) use separately pre-trained visual and textual models, which ignore the crossmodal alignment or (ii) use vision-language models pre-trained with general pre-training tasks, which are inadequate to identify finegrained aspects, opinions, and their alignments across modalities. To tackle these limitations, we propose a task-specific Vision-Language Pre-training framework for MABSA (VLPMABSA), which is a unified multimodal encoder-decoder architecture for all the pretraining and downstream tasks. We further design three types of task-specific pre-training tasks from the language, vision, and multimodal modalities, respectively. Experimental results show that our approach generally outperforms the state-of-the-art approaches on three MABSA subtasks. Further analysis demonstrates the effectiveness of each pretraining task. The source code is publicly released at https://github.com/NUSTM/VLP-MABSA. △ Less

Submitted 21 April, 2022; v1 submitted 17 April, 2022; originally announced April 2022.

Comments: Accepted by ACL 2022 (long paper)

arXiv:2203.01675 [pdf, other]

Cross-Modality Earth Mover's Distance for Visible Thermal Person Re-Identification

Authors: Yongguo Ling, Zhun Zhong, Donglin Cao, Zhiming Luo, Yaojin Lin, Shaozi Li, Nicu Sebe

Abstract: Visible thermal person re-identification (VT-ReID) suffers from the inter-modality discrepancy and intra-identity variations. Distribution alignment is a popular solution for VT-ReID, which, however, is usually restricted to the influence of the intra-identity variations. In this paper, we propose the Cross-Modality Earth Mover's Distance (CM-EMD) that can alleviate the impact of the intra-identit… ▽ More Visible thermal person re-identification (VT-ReID) suffers from the inter-modality discrepancy and intra-identity variations. Distribution alignment is a popular solution for VT-ReID, which, however, is usually restricted to the influence of the intra-identity variations. In this paper, we propose the Cross-Modality Earth Mover's Distance (CM-EMD) that can alleviate the impact of the intra-identity variations during modality alignment. CM-EMD selects an optimal transport strategy and assigns high weights to pairs that have a smaller intra-identity variation. In this manner, the model will focus on reducing the inter-modality discrepancy while paying less attention to intra-identity variations, leading to a more effective modality alignment. Moreover, we introduce two techniques to improve the advantage of CM-EMD. First, the Cross-Modality Discrimination Learning (CM-DL) is designed to overcome the discrimination degradation problem caused by modality alignment. By reducing the ratio between intra-identity and inter-identity variances, CM-DL leads the model to learn more discriminative representations. Second, we construct the Multi-Granularity Structure (MGS), enabling us to align modalities from both coarse- and fine-grained levels with the proposed CM-EMD. Extensive experiments show the benefits of the proposed CM-EMD and its auxiliary techniques (CM-DL and MGS). Our method achieves state-of-the-art performance on two VT-ReID benchmarks. △ Less

Submitted 3 March, 2022; originally announced March 2022.

Comments: 10 pages, 5 figures

ACM Class: I.4.10

arXiv:2202.11377 [pdf, other]

Multi-scale Sparse Representation-Based Shadow Inpainting for Retinal OCT Images

Authors: Yaoqi Tang, Yufan Li, Hongshan Liu, Jiaxuan Li, Peiyao Jin, Yu Gan, Yuye Ling, Yikai Su

Abstract: Inpainting shadowed regions cast by superficial blood vessels in retinal optical coherence tomography (OCT) images is critical for accurate and robust machine analysis and clinical diagnosis. Traditional sequence-based approaches such as propagating neighboring information to gradually fill in the missing regions are cost-effective. But they generate less satisfactory outcomes when dealing with la… ▽ More Inpainting shadowed regions cast by superficial blood vessels in retinal optical coherence tomography (OCT) images is critical for accurate and robust machine analysis and clinical diagnosis. Traditional sequence-based approaches such as propagating neighboring information to gradually fill in the missing regions are cost-effective. But they generate less satisfactory outcomes when dealing with larger missing regions and texture-rich structures. Emerging deep learning-based methods such as encoder-decoder networks have shown promising results in natural image inpainting tasks. However, they typically need a long computational time for network training in addition to the high demand on the size of datasets, which makes it difficult to be applied on often small medical datasets. To address these challenges, we propose a novel multi-scale shadow inpainting framework for OCT images by synergically applying sparse representation and deep learning: sparse representation is used to extract features from a small amount of training images for further inpainting and to regularize the image after the multi-scale image fusion, while convolutional neural network (CNN) is employed to enhance the image quality. During the image inpainting, we divide preprocessed input images into different branches based on the shadow width to harvest complementary information from different scales. Finally, a sparse representation-based regularizing module is designed to refine the generated contents after multi-scale feature aggregation. Experiments are conducted to compare our proposal versus both traditional and deep learning-based techniques on synthetic and real-world shadows. Results demonstrate that our proposed method achieves favorable image inpainting in terms of visual quality and quantitative metrics, especially when wide shadows are presented. △ Less

Submitted 23 February, 2022; originally announced February 2022.

arXiv:2112.14375 [pdf, ps, other]

Variational Learning for the Inverted Beta-Liouville Mixture Model and Its Application to Text Categorization

Authors: Yongfa Ling, Wenbo Guan, Qiang Ruan, Heping Song, Yuping Lai

Abstract: The finite invert Beta-Liouville mixture model (IBLMM) has recently gained some attention due to its positive data modeling capability. Under the conventional variational inference (VI) framework, the analytically tractable solution to the optimization of the variational posterior distribution cannot be obtained, since the variational object function involves evaluation of intractable moments. Wit… ▽ More The finite invert Beta-Liouville mixture model (IBLMM) has recently gained some attention due to its positive data modeling capability. Under the conventional variational inference (VI) framework, the analytically tractable solution to the optimization of the variational posterior distribution cannot be obtained, since the variational object function involves evaluation of intractable moments. With the recently proposed extended variational inference (EVI) framework, a new function is proposed to replace the original variational object function in order to avoid intractable moment computation, so that the analytically tractable solution of the IBLMM can be derived in an elegant way. The good performance of the proposed approach is demonstrated by experiments with both synthesized data and a real-world application namely text categorization. △ Less

Submitted 28 December, 2021; originally announced December 2021.

arXiv:2112.07120 [pdf, other]

Simple Coding Techniques for Many-Hop Relaying

Authors: Yan Hao Ling, Jonathan Scarlett

Abstract: In this paper, we study the problem of relaying a single bit of information across a series of binary symmetric channels, and the associated trade-off between the number of hops $m$, the transmission time $n$, and the error probability. We introduce a simple, efficient, and deterministic protocol that attains positive information velocity (i.e., a non-vanishing ratio $\frac{m}{n}$ and small error… ▽ More In this paper, we study the problem of relaying a single bit of information across a series of binary symmetric channels, and the associated trade-off between the number of hops $m$, the transmission time $n$, and the error probability. We introduce a simple, efficient, and deterministic protocol that attains positive information velocity (i.e., a non-vanishing ratio $\frac{m}{n}$ and small error probability) and is significantly simpler than existing protocols that do so. In addition, we characterize the optimal low-noise and high-noise scaling laws of the information velocity, and we adapt our 1-bit protocol to transmit $k$ bits over $m$ hops with $O(m+k)$ transmission time. △ Less

Submitted 7 December, 2022; v1 submitted 13 December, 2021; originally announced December 2021.

Comments: IEEE Transactions on Information Theory, Volume 68, Issue 11, pp. 7043-7053, Nov. 2022

arXiv:2111.03301 [pdf, other]

Frequency-Aware Physics-Inspired Degradation Model for Real-World Image Super-Resolution

Authors: Zhenxing Dong, Hong Cao, Wang Shen, Yu Gan, Yuye Ling, Guangtao Zhai, Yikai Su

Abstract: Current learning-based single image super-resolution (SISR) algorithms underperform on real data due to the deviation in the assumed degrada-tion process from that in the real-world scenario. Conventional degradation processes consider applying blur, noise, and downsampling (typicallybicubic downsampling) on high-resolution (HR) images to synthesize low-resolution (LR) counterparts. However, few w… ▽ More Current learning-based single image super-resolution (SISR) algorithms underperform on real data due to the deviation in the assumed degrada-tion process from that in the real-world scenario. Conventional degradation processes consider applying blur, noise, and downsampling (typicallybicubic downsampling) on high-resolution (HR) images to synthesize low-resolution (LR) counterparts. However, few works on degradation modelling have taken the physical aspects of the optical imaging system intoconsideration. In this paper, we analyze the imaging system optically andexploit the characteristics of the real-world LR-HR pairs in the spatial frequency domain. We formulate a real-world physics-inspired degradationmodel by considering bothopticsandsensordegradation; The physical degradation of an imaging system is modelled as a low-pass filter, whose cut-off frequency is dictated by the object distance, the focal length of thelens, and the pixel size of the image sensor. In particular, we propose to use a convolutional neural network (CNN) to learn the cutoff frequency of real-world degradation process. The learned network is then applied to synthesize LR images from unpaired HR images. The synthetic HR-LR image pairs are later used to train an SISR network. We evaluatethe effectiveness and generalization capability of the proposed degradation model on real-world images captured by different imaging systems. Experimental results showcase that the SISR network trained by using our synthetic data performs favorably against the network using the traditional degradation model. Moreover, our results are comparable to that obtained by the same network trained by using real-world LR-HR pairs, which are challenging to obtain in real scenes. △ Less

Submitted 11 February, 2022; v1 submitted 5 November, 2021; originally announced November 2021.

Comments: 22 pages,12 figures

arXiv:2110.00896 [pdf, other]

Disarranged Zone Learning (DZL): An unsupervised and dynamic automatic stenosis recognition methodology based on coronary angiography

Authors: Yanan Dai, Pengxiong Zhu, Bangde Xue, Yun Ling, Xibao Shi, Liang Geng, Qi Zhang, Jun Liu

Abstract: We proposed a novel unsupervised methodology named Disarranged Zone Learning (DZL) to automatically recognize stenosis in coronary angiography. The methodology firstly disarranges the frames in a video, secondly it generates an effective zone and lastly trains an encoder-decoder GRU model to learn the capability to recover disarranged frames. The breakthrough of our study is to discover and valida… ▽ More We proposed a novel unsupervised methodology named Disarranged Zone Learning (DZL) to automatically recognize stenosis in coronary angiography. The methodology firstly disarranges the frames in a video, secondly it generates an effective zone and lastly trains an encoder-decoder GRU model to learn the capability to recover disarranged frames. The breakthrough of our study is to discover and validate the Sequence Intensity (Recover Difficulty) is a measure of Coronary Artery Stenosis Status. Hence, the prediction accuracy of DZL is used as an approximator of coronary stenosis indicator. DZL is an unsupervised methodology and no label engineering effort is needed, the sub GRU model in DZL works as a self-supervised approach. So DZL could theoretically utilize infinitely huge amounts of coronary angiographies to learn and improve performance without laborious data labeling. There is no data preprocessing precondition to run DZL as it dynamically utilizes the whole video, hence it is easy to be implemented and generalized to overcome the data heterogeneity of coronary angiography. The overall average precision score achieves 0.93, AUC achieves 0.8 for this pure methodology. The highest segmented average precision score is 0.98 and the highest segmented AUC is 0.87 for coronary occlusion indicator. Finally, we developed a software demo to implement DZL methodology. △ Less

Submitted 2 October, 2021; originally announced October 2021.

arXiv:2109.12769 [pdf]

Heterogeneous Treatment Effect Estimation using machine learning for Healthcare application: tutorial and benchmark

Authors: Yaobin Ling, Pulakesh Upadhyaya, Luyao Chen, Xiaoqian Jiang, Yejin Kim

Abstract: Developing new drugs for target diseases is a time-consuming and expensive task, drug repurposing has become a popular topic in the drug development field. As much health claim data become available, many studies have been conducted on the data. The real-world data is noisy, sparse, and has many confounding factors. In addition, many studies have shown that drugs effects are heterogeneous among th… ▽ More Developing new drugs for target diseases is a time-consuming and expensive task, drug repurposing has become a popular topic in the drug development field. As much health claim data become available, many studies have been conducted on the data. The real-world data is noisy, sparse, and has many confounding factors. In addition, many studies have shown that drugs effects are heterogeneous among the population. Lots of advanced machine learning models about estimating heterogeneous treatment effects (HTE) have emerged in recent years, and have been applied to in econometrics and machine learning communities. These studies acknowledge medicine and drug development as the main application area, but there has been limited translational research from the HTE methodology to drug development. We aim to introduce the HTE methodology to the healthcare area and provide feasibility consideration when translating the methodology with benchmark experiments on healthcare administrative claim data. Also, we want to use benchmark experiments to show how to interpret and evaluate the model when it is applied to healthcare research. By introducing the recent HTE techniques to a broad readership in biomedical informatics communities, we expect to promote the wide adoption of causal inference using machine learning. We also expect to provide the feasibility of HTE for personalized drug effectiveness. △ Less

Submitted 21 February, 2023; v1 submitted 26 September, 2021; originally announced September 2021.

Comments: 52 pages, 8 figures

Journal ref: Journal of Biomedical Informatics (2022): 104256

arXiv:2109.11765 [pdf, other]

Dimension Reduction for Data with Heterogeneous Missingness

Authors: Yurong Ling, Zijing Liu, Jing-Hao Xue

Abstract: Dimension reduction plays a pivotal role in analysing high-dimensional data. However, observations with missing values present serious difficulties in directly applying standard dimension reduction techniques. As a large number of dimension reduction approaches are based on the Gram matrix, we first investigate the effects of missingness on dimension reduction by studying the statistical propertie… ▽ More Dimension reduction plays a pivotal role in analysing high-dimensional data. However, observations with missing values present serious difficulties in directly applying standard dimension reduction techniques. As a large number of dimension reduction approaches are based on the Gram matrix, we first investigate the effects of missingness on dimension reduction by studying the statistical properties of the Gram matrix with or without missingness, and then we present a bias-corrected Gram matrix with nice statistical properties under heterogeneous missingness. Extensive empirical results, on both simulated and publicly available real datasets, show that the proposed unbiased Gram matrix can significantly improve a broad spectrum of representative dimension reduction approaches. △ Less

Submitted 27 September, 2021; v1 submitted 24 September, 2021; originally announced September 2021.

Comments: Accepted for the 37th Conference on Uncertainty in Artificial Intelligence (UAI 2021)

arXiv:2104.06565 [pdf, other]

Optimal Rates of Teaching and Learning Under Uncertainty

Authors: Yan Hao Ling, Jonathan Scarlett

Abstract: In this paper, we consider a recently-proposed model of teaching and learning under uncertainty, in which a teacher receives independent observations of a single bit corrupted by binary symmetric noise, and sequentially transmits to a student through another binary symmetric channel based on the bits observed so far. After a given number $n$ of transmissions, the student outputs an estimate of the… ▽ More In this paper, we consider a recently-proposed model of teaching and learning under uncertainty, in which a teacher receives independent observations of a single bit corrupted by binary symmetric noise, and sequentially transmits to a student through another binary symmetric channel based on the bits observed so far. After a given number $n$ of transmissions, the student outputs an estimate of the unknown bit, and we are interested in the exponential decay rate of the error probability as $n$ increases. We propose a novel block-structured teaching strategy in which the teacher encodes the number of 1s received in each block, and show that the resulting error exponent is the binary relative entropy $D\big(\frac{1}{2}\|\max(p,q)\big)$, where $p$ and $q$ are the noise parameters. This matches a trivial converse result based on the data processing inequality, and settles two conjectures of [Jog and Loh, 2021] and [Huleihel, Polyanskiy, and Shayevitz, 2019]. In addition, we show that the computation time required by the teacher and student is linear in $n$. We also study a more general setting in which the binary symmetric channels are replaced by general binary-input discrete memoryless channels. We provide an achievability bound and a converse bound, and show that the two coincide in certain cases, including (i) when the two channels are identical, and (ii) when the student-teacher channel is a binary symmetric channel. More generally, we give sufficient conditions under which our learning rate is the best possible for block-structured protocols. △ Less

Submitted 7 December, 2022; v1 submitted 13 April, 2021; originally announced April 2021.

Comments: IEEE Transactions on Information Theory, Volume 67, Issue 11, pp. 7067-7080, Nov. 2021. This version slightly modifies/expands the 'Existing Results' section

arXiv:2103.14274 [pdf, other]

doi 10.1145/3386569.3392422

Character Controllers Using Motion VAEs

Authors: Hung Yu Ling, Fabio Zinno, George Cheng, Michiel van de Panne

Abstract: A fundamental problem in computer animation is that of realizing purposeful and realistic human movement given a sufficiently-rich set of motion capture clips. We learn data-driven generative models of human movement using autoregressive conditional variational autoencoders, or Motion VAEs. The latent variables of the learned autoencoder define the action space for the movement and thereby govern… ▽ More A fundamental problem in computer animation is that of realizing purposeful and realistic human movement given a sufficiently-rich set of motion capture clips. We learn data-driven generative models of human movement using autoregressive conditional variational autoencoders, or Motion VAEs. The latent variables of the learned autoencoder define the action space for the movement and thereby govern its evolution over time. Planning or control algorithms can then use this action space to generate desired motions. In particular, we use deep reinforcement learning to learn controllers that achieve goal-directed movements. We demonstrate the effectiveness of the approach on multiple tasks. We further evaluate system-design choices and describe the current limitations of Motion VAEs. △ Less

Submitted 26 March, 2021; originally announced March 2021.

Comments: Project page: https://www.cs.ubc.ca/~hyuling/projects/mvae/ ; Code: https://github.com/electronicarts/character-motion-vaes

arXiv:2103.13584 [pdf, other]

BERT4SO: Neural Sentence Ordering by Fine-tuning BERT

Authors: Yutao Zhu, Jian-Yun Nie, Kun Zhou, Shengchao Liu, Yabo Ling, Pan Du

Abstract: Sentence ordering aims to arrange the sentences of a given text in the correct order. Recent work frames it as a ranking problem and applies deep neural networks to it. In this work, we propose a new method, named BERT4SO, by fine-tuning BERT for sentence ordering. We concatenate all sentences and compute their representations by using multiple special tokens and carefully designed segment (interv… ▽ More Sentence ordering aims to arrange the sentences of a given text in the correct order. Recent work frames it as a ranking problem and applies deep neural networks to it. In this work, we propose a new method, named BERT4SO, by fine-tuning BERT for sentence ordering. We concatenate all sentences and compute their representations by using multiple special tokens and carefully designed segment (interval) embeddings. The tokens across multiple sentences can attend to each other which greatly enhances their interactions. We also propose a margin-based listwise ranking loss based on ListMLE to facilitate the optimization process. Experimental results on five benchmark datasets demonstrate the effectiveness of our proposed method. △ Less

Submitted 11 May, 2021; v1 submitted 24 March, 2021; originally announced March 2021.

arXiv:2103.01524 [pdf, other]

Feature-Align Network with Knowledge Distillation for Efficient Denoising

Authors: Lucas D. Young, Fitsum A. Reda, Rakesh Ranjan, Jon Morton, Jun Hu, Yazhu Ling, Xiaoyu Xiang, David Liu, Vikas Chandra

Abstract: We propose an efficient neural network for RAW image denoising. Although neural network-based denoising has been extensively studied for image restoration, little attention has been given to efficient denoising for compute limited and power sensitive devices, such as smartphones and smartwatches. In this paper, we present a novel architecture and a suite of training techniques for high quality den… ▽ More We propose an efficient neural network for RAW image denoising. Although neural network-based denoising has been extensively studied for image restoration, little attention has been given to efficient denoising for compute limited and power sensitive devices, such as smartphones and smartwatches. In this paper, we present a novel architecture and a suite of training techniques for high quality denoising in mobile devices. Our work is distinguished by three main contributions. (1) Feature-Align layer that modulates the activations of an encoder-decoder architecture with the input noisy images. The auto modulation layer enforces attention to spatially varying noise that tend to be "washed away" by successive application of convolutions and non-linearity. (2) A novel Feature Matching Loss that allows knowledge distillation from large denoising networks in the form of a perceptual content loss. (3) Empirical analysis of our efficient model trained to specialize on different noise subranges. This opens additional avenue for model size reduction by sacrificing memory for compute. Extensive experimental validation shows that our efficient model produces high quality denoising results that compete with state-of-the-art large networks, while using significantly fewer parameters and MACs. On the Darmstadt Noise Dataset benchmark, we achieve a PSNR of 48.28dB, while using 263 times fewer MACs, and 17.6 times fewer parameters than the state-of-the-art network, which achieves 49.12dB. △ Less

Submitted 17 March, 2021; v1 submitted 2 March, 2021; originally announced March 2021.

MSC Class: 94A08 (Primary) 68T07; 65D19 (Secondary) ACM Class: I.4.5; I.2.6

arXiv:2011.07526 [pdf, ps, other]

Domain Adaptation Gaze Estimation by Embedding with Prediction Consistency

Authors: Zidong Guo, Zejian Yuan, Chong Zhang, Wanchao Chi, Yonggen Ling, Shenghao Zhang

Abstract: Gaze is the essential manifestation of human attention. In recent years, a series of work has achieved high accuracy in gaze estimation. However, the inter-personal difference limits the reduction of the subject-independent gaze estimation error. This paper proposes an unsupervised method for domain adaptation gaze estimation to eliminate the impact of inter-personal diversity. In domain adaption,… ▽ More Gaze is the essential manifestation of human attention. In recent years, a series of work has achieved high accuracy in gaze estimation. However, the inter-personal difference limits the reduction of the subject-independent gaze estimation error. This paper proposes an unsupervised method for domain adaptation gaze estimation to eliminate the impact of inter-personal diversity. In domain adaption, we design an embedding representation with prediction consistency to ensure that the linear relationship between gaze directions in different domains remains consistent on gaze space and embedding space. Specifically, we employ source gaze to form a locally linear representation in the gaze space for each target domain prediction. Then the same linear combinations are applied in the embedding space to generate hypothesis embedding for the target domain sample, remaining prediction consistency. The deviation between the target and source domain is reduced by approximating the predicted and hypothesis embedding for the target domain sample. Guided by the proposed strategy, we design Domain Adaptation Gaze Estimation Network(DAGEN), which learns embedding with prediction consistency and achieves state-of-the-art results on both the MPIIGaze and the EYEDIAP datasets. △ Less

Submitted 15 November, 2020; originally announced November 2020.

Comments: 16 pages, 6 figures, ACCV 2020 (oral)

arXiv:2011.03158 [pdf, other]

Leveraging an Efficient and Semantic Location Embedding to Seek New Ports of Bike Share Services

Authors: Yuan Wang, Chenwei Wang, Yinan Ling, Keita Yokoyama, Hsin-Tai Wu, Yi Fang

Abstract: For short distance traveling in crowded urban areas, bike share services are becoming popular owing to the flexibility and convenience. To expand the service coverage, one of the key tasks is to seek new service ports, which requires to well understand the underlying features of the existing service ports. In this paper, we propose a new model, named for Efficient and Semantic Location Embedding (… ▽ More For short distance traveling in crowded urban areas, bike share services are becoming popular owing to the flexibility and convenience. To expand the service coverage, one of the key tasks is to seek new service ports, which requires to well understand the underlying features of the existing service ports. In this paper, we propose a new model, named for Efficient and Semantic Location Embedding (ESLE), which carries both geospatial and semantic information of the geo-locations. To generate ESLE, we first train a multi-label model with a deep Convolutional Neural Network (CNN) by feeding the static map-tile images and then extract location embedding vectors from the model. Compared to most recent relevant literature, ESLE is not only much cheaper in computation, but also easier to interpret via a systematic semantic analysis. Finally, we apply ESLE to seek new service ports for NTT DOCOMO's bike share services operated in Japan. The initial results demonstrate the effectiveness of ESLE, and provide a few insights that might be difficult to discover by using the conventional approaches. △ Less

Submitted 5 November, 2020; originally announced November 2020.

Comments: 10 pages, 5 figures, 8 tables, to appear in 2020 IEEE International Conference on Big Data

arXiv:2007.08071 [pdf, other]

Learning End-to-End Action Interaction by Paired-Embedding Data Augmentation

Authors: Ziyang Song, Zejian Yuan, Chong Zhang, Wanchao Chi, Yonggen Ling, Shenghao Zhang

Abstract: In recognition-based action interaction, robots' responses to human actions are often pre-designed according to recognized categories and thus stiff. In this paper, we specify a new Interactive Action Translation (IAT) task which aims to learn end-to-end action interaction from unlabeled interactive pairs, removing explicit action recognition. To enable learning on small-scale data, we propose a P… ▽ More In recognition-based action interaction, robots' responses to human actions are often pre-designed according to recognized categories and thus stiff. In this paper, we specify a new Interactive Action Translation (IAT) task which aims to learn end-to-end action interaction from unlabeled interactive pairs, removing explicit action recognition. To enable learning on small-scale data, we propose a Paired-Embedding (PE) method for effective and reliable data augmentation. Specifically, our method first utilizes paired relationships to cluster individual actions in an embedding space. Then two actions originally paired can be replaced with other actions in their respective neighborhood, assembling into new pairs. An Act2Act network based on conditional GAN follows to learn from augmented data. Besides, IAT-test and IAT-train scores are specifically proposed for evaluating methods on our task. Experimental results on two datasets show impressive effects and broad application prospects of our method. △ Less

Submitted 15 July, 2020; originally announced July 2020.

Comments: 16 pages, 7 figures

arXiv:2007.01065 [pdf, other]

Attention-Oriented Action Recognition for Real-Time Human-Robot Interaction

Authors: Ziyang Song, Ziyi Yin, Zejian Yuan, Chong Zhang, Wanchao Chi, Yonggen Ling, Shenghao Zhang

Abstract: Despite the notable progress made in action recognition tasks, not much work has been done in action recognition specifically for human-robot interaction. In this paper, we deeply explore the characteristics of the action recognition task in interaction scenarios and propose an attention-oriented multi-level network framework to meet the need for real-time interaction. Specifically, a Pre-Attentio… ▽ More Despite the notable progress made in action recognition tasks, not much work has been done in action recognition specifically for human-robot interaction. In this paper, we deeply explore the characteristics of the action recognition task in interaction scenarios and propose an attention-oriented multi-level network framework to meet the need for real-time interaction. Specifically, a Pre-Attention network is employed to roughly focus on the interactor in the scene at low resolution firstly and then perform fine-grained pose estimation at high resolution. The other compact CNN receives the extracted skeleton sequence as input for action recognition, utilizing attention-like mechanisms to capture local spatial-temporal patterns and global semantic information effectively. To evaluate our approach, we construct a new action dataset specially for the recognition task in interaction scenarios. Experimental results on our dataset and high efficiency (112 fps at 640 x 480 RGBD) on the mobile computing platform (Nvidia Jetson AGX Xavier) demonstrate excellent applicability of our method on action recognition in real-time human-robot interaction. △ Less

Submitted 2 July, 2020; originally announced July 2020.

Comments: 8 pages, 8 figures

arXiv:2006.07113 [pdf, other]

Large-scale Hybrid Approach for Predicting User Satisfaction with Conversational Agents

Authors: Dookun Park, Hao Yuan, Dongmin Kim, Yinglei Zhang, Matsoukas Spyros, Young-Bum Kim, Ruhi Sarikaya, Edward Guo, Yuan Ling, Kevin Quinn, Pham Hung, Benjamin Yao, Sungjin Lee

Abstract: Measuring user satisfaction level is a challenging task, and a critical component in developing large-scale conversational agent systems serving the needs of real users. An widely used approach to tackle this is to collect human annotation data and use them for evaluation or modeling. Human annotation based approaches are easier to control, but hard to scale. A novel alternative approach is to col… ▽ More Measuring user satisfaction level is a challenging task, and a critical component in developing large-scale conversational agent systems serving the needs of real users. An widely used approach to tackle this is to collect human annotation data and use them for evaluation or modeling. Human annotation based approaches are easier to control, but hard to scale. A novel alternative approach is to collect user's direct feedback via a feedback elicitation system embedded to the conversational agent system, and use the collected user feedback to train a machine-learned model for generalization. User feedback is the best proxy for user satisfaction, but is not available for some ineligible intents and certain situations. Thus, these two types of approaches are complementary to each other. In this work, we tackle the user satisfaction assessment problem with a hybrid approach that fuses explicit user feedback, user satisfaction predictions inferred by two machine-learned models, one trained on user feedback data and the other human annotation data. The hybrid approach is based on a waterfall policy, and the experimental results with Amazon Alexa's large-scale datasets show significant improvements in inferring user satisfaction. A detailed hybrid architecture, an in-depth analysis on user feedback data, and an algorithm that generates data sets to properly simulate the live traffic are presented in this paper. △ Less

Submitted 29 May, 2020; originally announced June 2020.

arXiv:2005.07855 [pdf]

Neural Stochastic Block Model & Scalable Community-Based Graph Learning

Authors: Zheng Chen, Xinli Yu, Yuan Ling, Xiaohua Hu

Abstract: This paper proposes a novel scalable community-based neural framework for graph learning. The framework learns the graph topology through the task of community detection and link prediction by optimizing with our proposed joint SBM loss function, which results from a non-trivial adaptation of the likelihood function of the classic Stochastic Block Model (SBM). Compared with SBM, our framework is f… ▽ More This paper proposes a novel scalable community-based neural framework for graph learning. The framework learns the graph topology through the task of community detection and link prediction by optimizing with our proposed joint SBM loss function, which results from a non-trivial adaptation of the likelihood function of the classic Stochastic Block Model (SBM). Compared with SBM, our framework is flexible, naturally allows soft labels and digestion of complex node attributes. The main goal is efficient valuation of complex graph data, therefore our design carefully aims at accommodating large data, and ensures there is a single forward pass for efficient evaluation. For large graph, it remains an open problem of how to efficiently leverage its underlying structure for various graph learning tasks. Previously it can be heavy work. With our community-based framework, this becomes less difficult and allows the task models to basically plug-in-and-play and perform joint training. We currently look into two particular applications, the graph alignment and the anomalous correlation detection, and discuss how to make use of our framework to tackle both problems. Extensive experiments are conducted to demonstrate the effectiveness of our approach. We also contributed tweaks of classic techniques which we find helpful for performance and scalability. For example, 1) the GAT+, an improved design of GAT (Graph Attention Network), the scaled-cosine similarity, and a unified implementation of the convolution/attention based and the random-walk based neural graph models. △ Less

Submitted 15 May, 2020; originally announced May 2020.

ACM Class: I.2; H.2.8

arXiv:2005.04456 [pdf, other]

Rethinking Item Importance in Session-based Recommendation

Authors: Zhiqiang Pan, Fei Cai, Yanxiang Ling, Maarten de Rijke

Abstract: Session-based recommendation aims to predict users' based on anonymous sessions. Previous work mainly focuses on the transition relationship between items during an ongoing session. They generally fail to pay enough attention to the importance of the items in terms of their relevance to user's main intent. In this paper, we propose a Session-based Recommendation approach with an Importance Extract… ▽ More Session-based recommendation aims to predict users' based on anonymous sessions. Previous work mainly focuses on the transition relationship between items during an ongoing session. They generally fail to pay enough attention to the importance of the items in terms of their relevance to user's main intent. In this paper, we propose a Session-based Recommendation approach with an Importance Extraction Module, i.e., SR-IEM, that considers both a user's long-term and recent behavior in an ongoing session. We employ a modified self-attention mechanism to estimate item importance in a session, which is then used to predict user's long-term preference. Item recommendations are produced by combining the user's long-term preference and current interest as conveyed by the last interacted item. Experiments conducted on two benchmark datasets validate that SR-IEM outperforms the start-of-the-art in terms of Recall and MRR and has a reduced computational complexity. △ Less

Submitted 9 May, 2020; originally announced May 2020.

arXiv:2005.04323 [pdf, other]

ALLSTEPS: Curriculum-driven Learning of Stepping Stone Skills

Authors: Zhaoming Xie, Hung Yu Ling, Nam Hee Kim, Michiel van de Panne

Abstract: Humans are highly adept at walking in environments with foot placement constraints, including stepping-stone scenarios where the footstep locations are fully constrained. Finding good solutions to stepping-stone locomotion is a longstanding and fundamental challenge for animation and robotics. We present fully learned solutions to this difficult problem using reinforcement learning. We demonstrate… ▽ More Humans are highly adept at walking in environments with foot placement constraints, including stepping-stone scenarios where the footstep locations are fully constrained. Finding good solutions to stepping-stone locomotion is a longstanding and fundamental challenge for animation and robotics. We present fully learned solutions to this difficult problem using reinforcement learning. We demonstrate the importance of a curriculum for efficient learning and evaluate four possible curriculum choices compared to a non-curriculum baseline. Results are presented for a simulated human character, a realistic bipedal robot simulation and a monster character, in each case producing robust, plausible motions for challenging stepping stone sequences and terrains. △ Less

Submitted 29 August, 2020; v1 submitted 8 May, 2020; originally announced May 2020.

Showing 1–50 of 68 results for author: Ling, Y