research-article

Joint-teaching: Learning to Refine Knowledge for Resource-constrained Unsupervised Cross-modal Retrieval

Authors:

Peng-Fei Zhang,

Jiasheng Duan,

Zi Huang,

Hongzhi YinAuthors Info & Claims

MM '21: Proceedings of the 29th ACM International Conference on Multimedia

Pages 1517 - 1525

https://doi.org/10.1145/3474085.3475286

Published: 17 October 2021 Publication History

Get Access

Abstract

Cross-modal retrieval has received considerable attention owing to its applicability to enable users to search desired information with diversified forms. Existing retrieval methods retain good performance mainly relying on complex deep neural networks and high-quality supervision signals, which deters them from real-world resource-constrained development and deployment. In this paper, we propose an effective unsupervised learning framework named JOint-teachinG (JOG) to pursue a high-performance yet light-weight cross-modal retrieval model. The key idea is to utilize the knowledge of a pre-trained model (a.k.a. the "teacher") to endow the to-be-learned model (a.k.a. the "student") with strong feature learning ability and predictive power. Considering that a teacher model serving the same task as the student is not always available, we resort to a cross-task teacher to leverage transferrable knowledge to guide student learning. To eliminate the inevitable noises in the distilled knowledge resulting from the task discrepancy, an online knowledge-refinement strategy is designed to progressively improve the quality of the cross-task knowledge in a joint-teaching manner, where a peer student is engaged. In addition, the proposed JOG learns to represent the original high-dimensional data with compact binary codes to accelerate the query processing, further facilitating resource-limited retrieval. Through extensive experiments, we demonstrate that in various network structures, the proposed method can yield promising learning results on widely-used benchmarks. The proposed research is a pioneering work for resource-constrained cross-modal retrieval, which has strong potential to be applied to on-device deployment and is hoped to pave the way for further study.

Supplementary Material

MP4 File (MM21-fp0653.mp4)

Presentation video.

Download
20.72 MB

References

[1]

Galen Andrew, Raman Arora, Jeff Bilmes, and Karen Livescu. 2013. Deep canonical correlation analysis. In Proceedings of International Conference on Machine Learning. 1247--1255.

Abstract

Supplementary Material

References

Cited By

Index Terms

Recommendations

Unsupervised Deep Cross-Modal Hashing by Knowledge Distillation for Large-scale Cross-modal Retrieval

Multi-label double-layer learning for cross-modal retrieval

Cross-modal Retrieval with Label Completion

Comments

Information

Published In

Sponsors

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Funding Sources

Conference

Acceptance Rates

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

Login options

Full Access

View options

PDF

eReader

Figures

Other

Share

Share this Publication link

Share on social media

Affiliations