Author: Mao, Zhendong : Search

research-article

Document-level Relation Extraction with Progressive Self-distillation

ACM Transactions on Information Systems (TOIS), Volume 42, Issue 6Article No.: 143, Pages 1–34https://doi.org/10.1145/3656168

Document-level relation extraction (RE) aims to simultaneously predict relations (including no-relation cases denoted as NA) between all entity pairs in a document. It is typically formulated as a relation classification task with entities pre-detected in ...

research-article

Exploring Visual Relationships via Transformer-based Graphs for Enhanced Image Captioning

ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM), Volume 20, Issue 5Article No.: 133, Pages 1–23https://doi.org/10.1145/3638558

Image captioning (IC), bringing vision to language, has drawn extensive attention. A crucial aspect of IC is the accurate depiction of visual relations among image objects. Visual relations encompass two primary facets: content relations and structural ...

research-article

Sentiment-Oriented Transformer-Based Variational Autoencoder Network for Live Video Commenting

ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM), Volume 20, Issue 4Article No.: 104, Pages 1–24https://doi.org/10.1145/3633334

Automatic live video commenting is getting increasing attention due to its significance in narration generation, topic explanation, etc. However, the diverse sentiment consideration of the generated comments is missing from current methods. Sentimental ...

research-article

Unlocking the Power of Cross-Dimensional Semantic Dependency for Image-Text Matching

MM '23: Proceedings of the 31st ACM International Conference on MultimediaOctober 2023, Pages 4828–4837https://doi.org/10.1145/3581783.3611703

Image-text matching, as a fundamental cross-modal task, bridges vision and language. The key challenge lies in accurately learning the semantic similarity of these two heterogeneous modalities. To determine the semantic similarity between visual and ...

Article

Contour-Augmented Concept Prediction Network for Image Captioning

Artificial Neural Networks and Machine Learning – ICANN 2023Sep 2023, Pages 180–191https://doi.org/10.1007/978-3-031-44210-0_15

Abstract

Semantic information in images is essential for image captioning. However, previous works leverage the pre-trained object detector to mine semantics in an image, making the model unable to accurately capture visual semantics, and further making ...

research-article

ABINet++: Autonomous, Bidirectional and Iterative Language Modeling for Scene Text Spotting

IEEE Transactions on Pattern Analysis and Machine Intelligence (ITPM), Volume 45, Issue 6June 2023, Pages 7123–7141https://doi.org/10.1109/TPAMI.2022.3223908

Scene text spotting is of great importance to the computer vision community due to its wide variety of applications. Recent methods attempt to introduce linguistic knowledge for challenging recognition rather than pure visual classification. However, how ...

research-article

GH-DDM: the generalized hybrid denoising diffusion model for medical image generation

Multimedia Systems (MUME), Volume 29, Issue 3Jun 2023, Pages 1335–1345https://doi.org/10.1007/s00530-023-01059-0

Abstract

Deep-learning-based medical imaging plays a pivotal role in modern healthcare while suffering from the data scarcity bottleneck, since obtaining sufficient high-quality data in the medical imaging area is difficult and expensive. To alleviate this ...

research-article

Unified Adaptive Relevance Distinguishable Attention Network for Image-Text Matching

IEEE Transactions on Multimedia (TOM), Volume 252023, Pages 1320–1332https://doi.org/10.1109/TMM.2022.3141603

Image-text matching, as a fundamental cross-modal task, bridges the gap between vision and language. The core is to accurately learn semantic alignment to find relevant shared semantics in image and text. Existing methods typically attend to all fragments ...

research-article

Intra-Class Adaptive Augmentation With Neighbor Correction for Deep Metric Learning

IEEE Transactions on Multimedia (TOM), Volume 252023, Pages 7758–7771https://doi.org/10.1109/TMM.2022.3227414

Deep metric learning aims to learn an embedding space, where semantically similar samples are close together and dissimilar ones are repelled against. To explore more hard and informative training signals for augmentation and generalization, recent ...

research-article

Background Layout Generation and Object Knowledge Transfer for Text-to-Image Generation

MM '22: Proceedings of the 30th ACM International Conference on MultimediaOctober 2022, Pages 4327–4335https://doi.org/10.1145/3503161.3548154

Text-to-Image generation (T2I) aims to generate realistic and semantically consistent images according to the natural language descriptions. Built upon the recent advances in generative adversarial networks (GANs), existing T2I models have made great ...

research-article

Fine-tuning with Multi-modal Entity Prompts for News Image Captioning

MM '22: Proceedings of the 30th ACM International Conference on MultimediaOctober 2022, Pages 4365–4373https://doi.org/10.1145/3503161.3547883

News Image Captioning aims to generate descriptions for images embedded in news articles, including plentiful real-world concepts, especially about named entities. However, existing methods are limited in the entity-level template. Not only is it labor-...

research-article

DSE-GAN: Dynamic Semantic Evolution Generative Adversarial Network for Text-to-Image Generation

MM '22: Proceedings of the 30th ACM International Conference on MultimediaOctober 2022, Pages 4345–4354https://doi.org/10.1145/3503161.3547881

Text-to-image generation aims at generating realistic images which are semantically consistent with the given text. Previous works mainly adopt the multi-stage architecture by stacking generator-discriminator pairs to engage multiple adversarial ...

research-article

Semantically Similarity-Wise Dual-Branch Network for Scene Graph Generation

IEEE Transactions on Circuits and Systems for Video Technology (IEEETCSVT), Volume 32, Issue 7July 2022, Pages 4573–4583https://doi.org/10.1109/TCSVT.2021.3130197

Scene graph generation aims to detect visual entities and relationships between them from an image. The object-level visual information is of vital importance for predicting accurate relationships. However, most existing methods essentially encode visual ...

research-article

Self-Supervised Synthesis Ranking for Deep Metric Learning

IEEE Transactions on Circuits and Systems for Video Technology (IEEETCSVT), Volume 32, Issue 7July 2022, Pages 4736–4750https://doi.org/10.1109/TCSVT.2021.3124908

The core purpose of deep metric learning is to construct an embedding space, where objects belonging to the same class are gathered together and the ones from different classes are pushed apart. Most existing approaches typically insist to inter-class ...

research-article

Weakly Supervised Pediatric Bone Age Assessment Using Ultrasonic Images via Automatic Anatomical RoI Detection

ICMR '22: Proceedings of the 2022 International Conference on Multimedia RetrievalJune 2022, Pages 647–653https://doi.org/10.1145/3512527.3531436

Bone age assessment (BAA) is vital in pediatric clinical diagnosis. Existing deep learning methods predict bone age based on Regions of Interest (RoIs) detection or segmentation of hand radiograph, which requires expensive annotations. Limitations of ...

research-article

Multi-task hourglass network for online automatic diagnosis of developmental dysplasia of the hip

World Wide Web (WWWJ), Volume 26, Issue 2Mar 2023, Pages 539–559https://doi.org/10.1007/s11280-022-01051-0

Abstract

Developmental dysplasia of the hip (DDH) is one of the most common diseases in children. Due to the experience-requiring medical image analysis work, online automatic diagnosis of DDH has intrigued the researchers. Traditional implementation of ...

research-article

Joint Local Correlation and Global Contextual Information for Unsupervised 3D Model Retrieval and Classification

IEEE Transactions on Circuits and Systems for Video Technology (IEEETCSVT), Volume 32, Issue 5May 2022, Pages 3265–3278https://doi.org/10.1109/TCSVT.2021.3099496

Unsupervised 3D model analysis has attracted tremendous attentions with the increasing growth of 3D model data and the extensive human annotations. Many effective methods have been designed to address the 3D model analysis with labeled information, while ...

research-article

Focus Your Attention: A Focal Attention for Multimodal Learning

IEEE Transactions on Multimedia (TOM), Volume 242022, Pages 103–115https://doi.org/10.1109/TMM.2020.3046855

The key point in multimodal learning is to learn semantic alignment that finds the correspondence between sub-elements of instances from different modality data. Attention mechanism has shown its power in semantic alignment learning as it enables to ...

research-article

Task-Adaptive Attention for Image Captioning

IEEE Transactions on Circuits and Systems for Video Technology (IEEETCSVT), Volume 32, Issue 1Jan. 2022, Pages 43–51https://doi.org/10.1109/TCSVT.2021.3067449

Attention mechanisms are now widely used in image captioning models. However, most attention models only focus on visual features. When generating syntax related words, little visual information is needed. In this case, these attention models could ...

research-article

Mask and Predict: Multi-step Reasoning for Scene Graph Generation

MM '21: Proceedings of the 29th ACM International Conference on MultimediaOctober 2021, Pages 4128–4136https://doi.org/10.1145/3474085.3475545

Scene Graph Generation (SGG) aims to parse the image as a set of semantics, containing objects and their relations. Currently, the SGG methods only stay at presenting the intuitive detection in the image, such as the triplet "logo on board". Intuitively,...

Applied Filters

People

Names

Institutions

Authors

Publications

Journal/Magazine Names

Proceedings/Book Names

All Publications

Content Type

Supplemental Material Type

Media Formats

Publisher

Conferences

Sponsors

Conference Event

Proceedings Series

Publication Date

Caption

Document-level Relation Extraction with Progressive Self-distillation

Exploring Visual Relationships via Transformer-based Graphs for Enhanced Image Captioning

Sentiment-Oriented Transformer-Based Variational Autoencoder Network for Live Video Commenting

Unlocking the Power of Cross-Dimensional Semantic Dependency for Image-Text Matching

Contour-Augmented Concept Prediction Network for Image Captioning

Upcoming Conferences

ABINet++: Autonomous, Bidirectional and Iterative Language Modeling for Scene Text Spotting

GH-DDM: the generalized hybrid denoising diffusion model for medical image generation

Unified Adaptive Relevance Distinguishable Attention Network for Image-Text Matching

Intra-Class Adaptive Augmentation With Neighbor Correction for Deep Metric Learning

Background Layout Generation and Object Knowledge Transfer for Text-to-Image Generation

Fine-tuning with Multi-modal Entity Prompts for News Image Captioning

DSE-GAN: Dynamic Semantic Evolution Generative Adversarial Network for Text-to-Image Generation

Semantically Similarity-Wise Dual-Branch Network for Scene Graph Generation

Self-Supervised Synthesis Ranking for Deep Metric Learning

Weakly Supervised Pediatric Bone Age Assessment Using Ultrasonic Images via Automatic Anatomical RoI Detection

Multi-task hourglass network for online automatic diagnosis of developmental dysplasia of the hip

Joint Local Correlation and Global Contextual Information for Unsupervised 3D Model Retrieval and Classification

Focus Your Attention: A Focal Attention for Multimodal Learning

Task-Adaptive Attention for Image Captioning

Mask and Predict: Multi-step Reasoning for Scene Graph Generation

Applied Filters

People

Names

Institutions

Authors

Publications

Journal/Magazine Names

Proceedings/Book Names

All Publications

Content Type

Supplemental Material Type

Media Formats

Publisher

Conferences

Sponsors

Conference Event

Proceedings Series

Publication Date

Save to Binder

Upcoming Conferences