Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to main content

Cross-modal Recipe Retrieval with Hierarchical Transformers and Pretrained Food Image Encoder

  • Conference paper
  • First Online:
Advanced Intelligent Computing Technology and Applications (ICIC 2024)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 14879))

Included in the following conference series:

  • 233 Accesses

Abstract

Social media platforms have seen a surge in sharing food experiences and recipes. To manage the vast amount of data generated by this trend, a cross-modal retrieval system has been developed to retrieve recipes and images. Existing literature has mainly focused on textual aspects of recipes, except for cross-modal pre-trained models, disregarding the features of food images themselves. To address this limitation, we comprehensively analyzed the characteristics of food images and proposed a new cross-modal retrieval framework. Our approach uses a pre-trained network based on food images as the image encoder and a hierarchical Transformer as the text encoder. Our research showed that food images have unique features such as color, texture, and shape that can be utilized to enhance cross-modal retrieval. We also discovered that the current state-of-the-art cross-modal retrieval models that rely solely on textual information are limited in their capacity to retrieve images. Therefore, we developed a new cross-modal retrieval framework that combines both textual and visual information. The image encoder, based on a pre-trained network, extracts visual features from food images, while the text encoder, based on a hierarchical Transformer, extracts textual features from recipes. Our experiments demonstrated that our enhanced dual encoders significantly outperformed the existing baseline models on the dataset. Our proposed framework, which incorporates both textual and visual information, can improve the accuracy of cross-modal retrieval systems and enhance the user experience in searching for food-related information.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 59.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 79.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Aguilar, E., et al.: Grab, pay, and eat: semantic food detection for smart restaurants. IEEE Trans. Multim. 20(12), 3266–3275 (2018)

    Article  Google Scholar 

  2. Bossard, L., et al.: Food-101–mining discriminative components with random forests. In: Computer Vision–ECCV 2014: 13th European Conference, Zurich, 6–12 September 2014, Proceedings, Part VI 13, pp. 446–461. Springer (2014)

    Google Scholar 

  3. Carvalho, M., et al.: Cross-modal retrieval in the cooking context: learning semantic text-image embeddings. In: The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval, pp. 35–44 (2018)

    Google Scholar 

  4. Fu, H., et al.: Mcen: bridging cross-modal gap between cooking recipes and dish images with latent variable model. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 14570–14580 (2020)

    Google Scholar 

  5. Guerrero, R., et al.: Cross-modal retrieval and synthesis (X-MRS): closing the modality gap in shared subspace learning. In: Proceedings of the 29th ACM International Conference on Multimedia, pp. 3192–3201 (2021)

    Google Scholar 

  6. Guo, M., et al.: Effective parallel corpus mining using bilingual sentence embeddings. arXiv preprint arXiv:1807.11906 (2018)

  7. Kaur, P., et al.: Foodx-251: a dataset for fine-grained food classification. arXiv preprint arXiv:1907.06167 (2019)

  8. Min, W., et al.: A survey on food computing. ACM Comput. Surv. 52(5), 1–36 (2019)

    Article  Google Scholar 

  9. Min, W., et al.: Large scale visual food recognition. IEEE Trans. Pattern Anal. Mach. Intell. 45(8), 9932–9949 (2023)

    Google Scholar 

  10. Okamoto, K., Yanai, K.: UEC-FoodPIX complete: a large-scale food image segmentation dataset. In: Pattern Recognition. ICPR International Workshops and Challenges: Virtual Event, 10–15 January 2021, Proceedings, Part V, pp. 647–659. Springer (2021)

    Google Scholar 

  11. Salvador, A., et al.: Inverse cooking: recipe generation from food images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10453–10462 (2019)

    Google Scholar 

  12. Salvador, A., et al.: Learning cross-modal embeddings for cooking recipes and food images. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3020–3028 (2017)

    Google Scholar 

  13. Salvador, A., et al.: Revamping cross-modal recipe retrieval with hierarchical transformers and self-supervised learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15475–15484 (2021)

    Google Scholar 

  14. Wang, H., et al.: Learning cross-modal embeddings with adversarial networks for cooking recipes and food images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11572–11581 (2019)

    Google Scholar 

  15. Wang, H., et al.: Structure-aware generation network for recipe generation from images. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, 23–28 August 2020, Proceedings, Part XXVII 16, pp. 359–374. Springer (2020)

    Google Scholar 

  16. Yang, J., et al.: Transformer-based cross-modal recipe embeddings with large batch training. In: International Conference on Multimedia Modeling, pp. 471–482 Springer, Cham (2023). https://doi.org/10.1007/978-3-031-27818-1_39

  17. Zhen, L., et al.: Deep supervised cross-modal retrieval. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10394–10403 (2019)

    Google Scholar 

  18. Zhu, B., et al.: R2gan: cross-modal recipe retrieval with generative adversarial network. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11477–11486 (2019)

    Google Scholar 

  19. Zhu, B., Ngo, C.-W.: CookGAN: causality based text-to-image synthesis. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5519–5527 (2020)

    Google Scholar 

  20. Zhu, L., et al.: Cross-modal retrieval: a systematic review of methods and future directions. arXiv preprint arXiv:2308.14263 (2023)

Download references

Acknowledgments

The authors thank the laboratory equipment and configuration for the timely help in analyzing a large amount of data. Fundings from the National Natural Science Foundation (grant number: 62377036).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Xiankun Zhang .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2024 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Qin, H., Zhang, X., Song, C. (2024). Cross-modal Recipe Retrieval with Hierarchical Transformers and Pretrained Food Image Encoder. In: Huang, DS., Zhang, X., Zhang, C. (eds) Advanced Intelligent Computing Technology and Applications. ICIC 2024. Lecture Notes in Computer Science(), vol 14879. Springer, Singapore. https://doi.org/10.1007/978-981-97-5675-9_36

Download citation

  • DOI: https://doi.org/10.1007/978-981-97-5675-9_36

  • Published:

  • Publisher Name: Springer, Singapore

  • Print ISBN: 978-981-97-5674-2

  • Online ISBN: 978-981-97-5675-9

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics