Author: Xiao, Jun : Search

research-article

Multi-view human pose and shape estimation via mesh-aligned voxel interpolation

Information Fusion (INFU), Volume 114, Issue Chttps://doi.org/10.1016/j.inffus.2024.102651

Abstract

Although multi-view human pose and shape regression methods have information from other views for complementing and correcting, existing ones still have its own drawback of not fully taking advantage of multi-view setup. Thus they are far from ...

Highlights

A new network for multi-view human pose and shape regression.
Human body features are well merged through multi-scale volumetric aggregation.
A mesh-aligned voxel selection module is proposed to make effective prediction.
A new ...

research-article

ENCODE: Breaking the Trade-Off Between Performance and Efficiency in Long-Term User Behavior Modeling

IEEE Transactions on Knowledge and Data Engineering (IEEECS_TKDE), Volume 37, Issue 1Pages 265–277https://doi.org/10.1109/TKDE.2024.3486445

Long-term user behavior sequences are a goldmine for businesses to explore users’ interests to improve Click-Through Rate (CTR). However, it is very challenging to accurately capture users’ long-term interests from their long-term behavior ...

research-article

Decomposed Prototype Learning for Few-Shot Scene Graph Generation

ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM), Volume 21, Issue 1Article No.: 30, Pages 1–24https://doi.org/10.1145/3700877

Today's scene graph generation (SGG) models typically require abundant manual annotations to learn new predicate types. Therefore, it is difficult to apply them to real-world applications with massive uncommon predicate categories whose annotations are ...

research-article

IDPro: Flexible Interactive Video Object Segmentation by ID-Queried Concurrent Propagation

IEEE Transactions on Circuits and Systems for Video Technology (IEEETCSVT), Volume 34, Issue 12_Part_1Pages 12171–12183https://doi.org/10.1109/TCSVT.2024.3431714

Interactive Video Object Segmentation (iVOS) is inherently demanding, requiring real-time interaction between humans and computers. Enhancing user experience involves considerations such as user input habits, segmentation quality, running time, and memory ...

research-article

Improving Reference-Based Distinctive Image Captioning with Contrastive Rewards

ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM), Volume 20, Issue 12Article No.: 390, Pages 1–24https://doi.org/10.1145/3694683

Distinctive Image Captioning (DIC)—generating distinctive captions that describe the unique details of a target image—has received considerable attention over the last few years. A recent DIC method proposes to generate distinctive captions by comparing ...

research-article

Knowledge-Guided Causal Intervention for Weakly-Supervised Object Localization

IEEE Transactions on Knowledge and Data Engineering (IEEECS_TKDE), Volume 36, Issue 11Pages 6477–6489https://doi.org/10.1109/TKDE.2024.3389668

Previous weakly-supervised object localization (WSOL) methods aim to expand activation map discriminative areas to cover the whole objects, yet neglect two inherent challenges when relying solely on image-level labels. First, the “entangled context&...

research-article

Open Access

Point Cloud Densification for 3D Gaussian Splatting from Sparse Input Views

MM '24: Proceedings of the 32nd ACM International Conference on MultimediaPages 896–904https://doi.org/10.1145/3664647.3681454

The technique of 3D Gaussian splatting (3DGS) has demonstrated its effectiveness and efficiency in rendering photo-realistic images for novel view synthesis. However, 3DGS requires a high density of camera coverage, and its performance inevitably ...

research-article

Seeing Beyond Classes: Zero-Shot Grounded Situation Recognition via Language Explainer

MM '24: Proceedings of the 32nd ACM International Conference on MultimediaPages 1602–1611https://doi.org/10.1145/3664647.3681036

Benefiting from strong generalization ability, pre-trained vision-language models (VLMs), e.g., CLIP, have been widely utilized in zero-shot scene understanding. Unlike simple recognition tasks, grounded situation recognition (GSR) requires the model not ...

research-article

Open Access

FC-4DFS: Frequency-controlled Flexible 4D Facial Expression Synthesizing

MM '24: Proceedings of the 32nd ACM International Conference on MultimediaPages 10882–10890https://doi.org/10.1145/3664647.3681455

4D facial expression synthesizing is a critical problem in the fields of computer vision and graphics. Current methods lack flexibility and smoothness when simulating the inter-frame motion of expression sequences. In this paper, we propose a frequency-...

research-article

Neural Interaction Energy for Multi-Agent Trajectory Prediction

MM '24: Proceedings of the 32nd ACM International Conference on MultimediaPages 1952–1960https://doi.org/10.1145/3664647.3680792

Maintaining temporal stability is crucial in multi-agent trajectory prediction. Insufficient regularization to uphold this temporal stability often results in fluctuations in kinematic states, leading to inconsistent predictions and the amplification of ...

research-article

NICEST: Noisy Label Correction and Training for Robust Scene Graph Generation

IEEE Transactions on Pattern Analysis and Machine Intelligence (ITPM), Volume 46, Issue 10Pages 6873–6888https://doi.org/10.1109/TPAMI.2024.3387349

Nearly all existing scene graph generation (SGG) models have overlooked the ground-truth annotation qualities of mainstream SGG datasets, i.e., they assume: 1) all the manually annotated positive samples are equally correct; 2) all the un-annotated ...

Article

Learning Equilibrium Transformation for Gamut Expansion and Color Restoration

Computer Vision – ECCV 2024Pages 415–432https://doi.org/10.1007/978-3-031-73209-6_24

Abstract

Existing imaging systems support wide-gamut images like ProPhoto RGB, but most images are typically encoded in a narrower gamut space (e.g., sRGB). To this end, these images can be enhanced by learning to recover the original color values beyond ...

Article

DECap: Towards Generalized Explicit Caption Editing via Diffusion Mechanism

Computer Vision – ECCV 2024Pages 365–381https://doi.org/10.1007/978-3-031-72775-7_21

Abstract

Explicit Caption Editing (ECE)—refining reference image captions through a sequence of explicit edit operations (e.g., KEEP, DETELE)—has raised significant attention due to its explainable and human-like nature. After training with carefully ...

research-article

UN-η: An offline adaptive normalization method for deploying transformers

Knowledge-Based Systems (KNBS), Volume 300, Issue Chttps://doi.org/10.1016/j.knosys.2024.112141

Abstract

Transformer has become the de-facto architecture for many natural language and vision tasks thanks to its remarkable performance. As an essential part of transformers, normalization functions struggle to improve the robustness and performance of ...

research-article

Deep multi-scale feature mixture model for image super-resolution with multiple-focal-length degradation

Image Communication (IMAG), Volume 127, Issue Chttps://doi.org/10.1016/j.image.2024.117139

Abstract

Single image super-resolution is a classic problem in computer vision. In recent years, deep learning-based models have achieved unprecedented success with this problem. However, most existing deep super-resolution models unavoidably produce ...

Highlights

This paper focuses on a real-world degradation, i.e., the multiple-focal-length degradation, for single image super-resolution.
An efficient multi-scale feature mixture model is proposed for restoring images under the multiple-focal-...

research-article

SPMLD: A skin pathological image dataset for non-melanoma with detailed lesion area annotation

Computers in Biology and Medicine (CBIM), Volume 179, Issue Chttps://doi.org/10.1016/j.compbiomed.2024.108793

Abstract

Skin tumors are the most common tumors in humans and the clinical characteristics of three common non-melanoma tumors (IDN, SK, BCC) are similar, resulting in a high misdiagnosis rate. The accurate differential diagnosis of these tumors needs to ...

Highlights

We develop a publicly available skin pathological image dataset, SPMLD, carried out by some dermatologists who collect and annotate the dataset.
We propose a lesion area-based enhanced classification network, which can automatically ...

research-article

From Easy to Hard: Learning Curricular Shape-Aware Features for Robust Panoptic Scene Graph Generation

International Journal of Computer Vision (IJCV), Volume 133, Issue 1Pages 489–508https://doi.org/10.1007/s11263-024-02190-9

Abstract

Panoptic Scene Graph Generation (PSG) aims to generate a comprehensive graph-structure representation based on panoptic segmentation masks. Despite remarkable progress in PSG, almost all existing methods neglect the importance of shape-aware ...

Article

One Stage Near-Ground Pothole Object Detection

Advanced Intelligent Computing Technology and ApplicationsPages 419–429https://doi.org/10.1007/978-981-97-5612-4_36

Abstract

To propose a single-stage object detection method for realizing real-time and accurate identification and detection of near-ground small pothole objects photographed by UAV. Through the special copy-paste technology of the pothole dataset obtained ...

poster

T2DyVec: Leveraging Sparse Images and Controllable Text for Dynamic SVG

SIGGRAPH '24: ACM SIGGRAPH 2024 PostersArticle No.: 48, Pages 1–2https://doi.org/10.1145/3641234.3671020

In this work, we introduce a controlled dynamic vector graphic generation method. While existing work mostly focuses on text-based generation of single-frame images, dynamic images, or single-frame vectors, there is a lack of research on generating ...

research-article

Learning Combinatorial Prompts for Universal Controllable Image Captioning

International Journal of Computer Vision (IJCV), Volume 133, Issue 1Pages 129–150https://doi.org/10.1007/s11263-024-02179-4

Abstract

Controllable Image Captioning (CIC)—generating natural language descriptions about images under the guidance of given control signals—is one of the most promising directions toward next-generation captioning systems. Till now, various kinds of ...

Applied Filters

People

Names

Institutions

Authors

Editors

Advisors

Publications

Journal/Magazine Names

Proceedings/Book Names

All Publications

Content Type

Supplemental Material Type

Media Formats

Publisher

Conferences

Sponsors

Conference Event

Proceedings Series

Publication Date

Save to Binder

Upcoming Conferences