Cross-modal Recipe Retrieval with Hierarchical Transformers and Pretrained Food Image Encoder

Qin, Hanyan; Zhang, Xiankun; Song, Chen

doi:10.1007/978-981-97-5675-9_36

Hanyan Qin¹⁰,
Xiankun Zhang¹⁰ &
Chen Song¹⁰

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 14879))

Included in the following conference series:

International Conference on Intelligent Computing

233 Accesses

Abstract

Social media platforms have seen a surge in sharing food experiences and recipes. To manage the vast amount of data generated by this trend, a cross-modal retrieval system has been developed to retrieve recipes and images. Existing literature has mainly focused on textual aspects of recipes, except for cross-modal pre-trained models, disregarding the features of food images themselves. To address this limitation, we comprehensively analyzed the characteristics of food images and proposed a new cross-modal retrieval framework. Our approach uses a pre-trained network based on food images as the image encoder and a hierarchical Transformer as the text encoder. Our research showed that food images have unique features such as color, texture, and shape that can be utilized to enhance cross-modal retrieval. We also discovered that the current state-of-the-art cross-modal retrieval models that rely solely on textual information are limited in their capacity to retrieve images. Therefore, we developed a new cross-modal retrieval framework that combines both textual and visual information. The image encoder, based on a pre-trained network, extracts visual features from food images, while the text encoder, based on a hierarchical Transformer, extracts textual features from recipes. Our experiments demonstrated that our enhanced dual encoders significantly outperformed the existing baseline models on the dataset. Our proposed framework, which incorporates both textual and visual information, can improve the accuracy of cross-modal retrieval systems and enhance the user experience in searching for food-related information.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 59.99; Price excludes VAT (USA)

Softcover Book: USD 79.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Aguilar, E., et al.: Grab, pay, and eat: semantic food detection for smart restaurants. IEEE Trans. Multim. 20(12), 3266–3275 (2018)
Article Google Scholar
Bossard, L., et al.: Food-101–mining discriminative components with random forests. In: Computer Vision–ECCV 2014: 13th European Conference, Zurich, 6–12 September 2014, Proceedings, Part VI 13, pp. 446–461. Springer (2014)
Google Scholar
Carvalho, M., et al.: Cross-modal retrieval in the cooking context: learning semantic text-image embeddings. In: The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval, pp. 35–44 (2018)
Google Scholar
Fu, H., et al.: Mcen: bridging cross-modal gap between cooking recipes and dish images with latent variable model. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 14570–14580 (2020)
Google Scholar
Guerrero, R., et al.: Cross-modal retrieval and synthesis (X-MRS): closing the modality gap in shared subspace learning. In: Proceedings of the 29th ACM International Conference on Multimedia, pp. 3192–3201 (2021)
Google Scholar
Guo, M., et al.: Effective parallel corpus mining using bilingual sentence embeddings. arXiv preprint arXiv:1807.11906 (2018)
Kaur, P., et al.: Foodx-251: a dataset for fine-grained food classification. arXiv preprint arXiv:1907.06167 (2019)
Min, W., et al.: A survey on food computing. ACM Comput. Surv. 52(5), 1–36 (2019)
Article Google Scholar
Min, W., et al.: Large scale visual food recognition. IEEE Trans. Pattern Anal. Mach. Intell. 45(8), 9932–9949 (2023)
Google Scholar
Okamoto, K., Yanai, K.: UEC-FoodPIX complete: a large-scale food image segmentation dataset. In: Pattern Recognition. ICPR International Workshops and Challenges: Virtual Event, 10–15 January 2021, Proceedings, Part V, pp. 647–659. Springer (2021)
Google Scholar
Salvador, A., et al.: Inverse cooking: recipe generation from food images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10453–10462 (2019)
Google Scholar
Salvador, A., et al.: Learning cross-modal embeddings for cooking recipes and food images. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3020–3028 (2017)
Google Scholar
Salvador, A., et al.: Revamping cross-modal recipe retrieval with hierarchical transformers and self-supervised learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15475–15484 (2021)
Google Scholar
Wang, H., et al.: Learning cross-modal embeddings with adversarial networks for cooking recipes and food images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11572–11581 (2019)
Google Scholar
Wang, H., et al.: Structure-aware generation network for recipe generation from images. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, 23–28 August 2020, Proceedings, Part XXVII 16, pp. 359–374. Springer (2020)
Google Scholar
Yang, J., et al.: Transformer-based cross-modal recipe embeddings with large batch training. In: International Conference on Multimedia Modeling, pp. 471–482 Springer, Cham (2023). https://doi.org/10.1007/978-3-031-27818-1_39
Zhen, L., et al.: Deep supervised cross-modal retrieval. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10394–10403 (2019)
Google Scholar
Zhu, B., et al.: R2gan: cross-modal recipe retrieval with generative adversarial network. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11477–11486 (2019)
Google Scholar
Zhu, B., Ngo, C.-W.: CookGAN: causality based text-to-image synthesis. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5519–5527 (2020)
Google Scholar
Zhu, L., et al.: Cross-modal retrieval: a systematic review of methods and future directions. arXiv preprint arXiv:2308.14263 (2023)

Download references

Acknowledgments

The authors thank the laboratory equipment and configuration for the timely help in analyzing a large amount of data. Fundings from the National Natural Science Foundation (grant number: 62377036).

Author information

Authors and Affiliations

College of Artificial Intelligence, Tianjin University of Science and Technology, Tianjin, 300457, China
Hanyan Qin, Xiankun Zhang & Chen Song

Authors

Hanyan Qin
View author publications
You can also search for this author in PubMed Google Scholar
Xiankun Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Chen Song
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Xiankun Zhang .

Editor information

Editors and Affiliations

Eastern Institute of Technology, Ningbo, China
De-Shuang Huang
Tianjin University of Science and Technology, Tianjin, China
Xiankun Zhang
Tianjin University of Science and Technology, Tianjin, China
Chuanlei Zhang

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Qin, H., Zhang, X., Song, C. (2024). Cross-modal Recipe Retrieval with Hierarchical Transformers and Pretrained Food Image Encoder. In: Huang, DS., Zhang, X., Zhang, C. (eds) Advanced Intelligent Computing Technology and Applications. ICIC 2024. Lecture Notes in Computer Science(), vol 14879. Springer, Singapore. https://doi.org/10.1007/978-981-97-5675-9_36

Download citation

DOI: https://doi.org/10.1007/978-981-97-5675-9_36
Published: 01 August 2024
Publisher Name: Springer, Singapore
Print ISBN: 978-981-97-5674-2
Online ISBN: 978-981-97-5675-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics