Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

Transfer Learning for the Visual Arts: The Multi-modal Retrieval of Iconclass Codes

Published: 24 June 2023 Publication History

Abstract

Iconclass is an iconographic thesaurus, which is widely used in the digital heritage domain to describe subjects depicted in artworks. Each subject is assigned a unique descriptive code, which has a corresponding textual definition. The assignment of Iconclass codes is a challenging task for computational systems, due to the large number of available labels in comparison to the limited amount of training data available. Transfer learning has become a common strategy to overcome such a data shortage. In deep learning, transfer learning consists in fine-tuning the weights of a deep neural network for a downstream task. In this work, we present a deep retrieval framework, which can be fully fine-tuned for the task under consideration. Our work is based on a recent approach to this task, which already yielded state-of-the-art performance, although it could not be fully fine-tuned yet. This approach exploits the multi-linguality and multi-modality that is inherent to digital heritage data. Our framework jointly processes multiple input modalities, namely, textual and visual features. We extract the textual features from the artwork titles in multiple languages, whereas the visual features are derived from photographic reproductions of the artworks. The definitions of the Iconclass codes, containing useful textual information, are used as target labels instead of the codes themselves. As our main contribution, we demonstrate that our approach outperforms the state-of-the-art by a large margin. In addition, our approach is superior to the M3P feature extractor and outperforms the multi-lingual CLIP in most experiments due to the better quality of the visual features. Our out-of-domain and zero-shot experiments show poor results and demonstrate that the Iconclass retrieval remains a challenging task. We make our source code and models publicly available to support heritage institutions in the further enrichment of their digital collections.

References

[1]
Peter Anderson, Xiaodong He, Chris Buehler, Damien Teney, Mark Johnson, Stephen Gould, and Lei Zhang. 2018. Bottom-up and top-down attention for image captioning and visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6077–6086.
[2]
Nikolay Banar, Walter Daelemans, and Mike Kestemont. 2020. Neural machine translation of artwork titles using iconclass codes. In Proceedings of the the 4th Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities, and Literature. 42–51.
[3]
Nikolay Banar, Walter Daelemans, and Mike Kestemont. 2021. Multi-modal label retrieval for the visual arts: The case of iconclass. In Proceedings of the ICAART (1). 622–629.
[4]
Lorenzo Baraldi, Marcella Cornia, Costantino Grana, and Rita Cucchiara. 2018. Aligning text and document illustrations: Towards visually explainable digital humanities. In Proceedings of the 2018 24th International Conference on Pattern Recognition. IEEE, 1097–1102.
[5]
Hans Brandhorst. 2019. A Word is Worth a Thousand Pictures: Why the Use of Iconclass Will Make Artificial Intelligence Smarter. (2019). Retrieved from https://labs.brill.com/ictestset/ICONCLASS_and_AI.pdf. Accessed 10 Nov. 2021.
[6]
Giovanna Castellano and Gennaro Vessio. 2021. Deep learning approaches to pattern extraction and recognition in paintings and drawings: An overview. Neural Computing and Applications 33, 19 (2021), 12263–12282.
[7]
Eva Cetinic. 2021. Iconographic image captioning for artworks. In Pattern Recognition. ICPR International Workshops and Challenges: Virtual Event, January 10–15, 2021, Proceedings, Part III. Springer, 502–516.
[8]
Eva Cetinic. 2021. Towards generating and evaluating iconographic image captions of artworks. Journal of Imaging 7, 8 (2021), 123.
[9]
Yen-Chun Chen, Linjie Li, Licheng Yu, Ahmed El Kholy, Faisal Ahmed, Zhe Gan, Yu Cheng, and Jingjing Liu. 2020. Uniter: Universal image-text representation learning. In Proceedings of the European Conference on Computer Vision. Springer, 104–120.
[10]
Kyunghyun Cho, Bart Van Merriënboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. 2014. Learning phrase representations using RNN encoder-decoder for statistical machine translation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). Association for Computational Linguistics, Doha, 1724–1734.
[11]
Marcella Cornia, Matteo Stefanini, Lorenzo Baraldi, Massimiliano Corsini, and Rita Cucchiara. 2020. Explaining digital humanities by aligning images and textual descriptions. Pattern Recognition Letters 129 (2020), 166–172.
[12]
Elliot J. Crowley, Omkar M. Parkhi, and Andrew Zisserman. 2015. Face painting: Querying art with photos. In British Machine Vision Conference, 2015. 1–13.
[13]
Elliot J. Crowley and Andrew Zisserman. 2014. In search of art. In Proceedings of the European Conference on Computer Vision. Springer, 54–70.
[14]
Elliot J. Crowley and Andrew Zisserman. 2014. The state-of-the-art: Object retrieval in paintings using discriminative regions. Proceedings of the British Machine Vision Conference 2014, 1–12.
[15]
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the NAACL-HLT (1).
[16]
Hongliang Fei, Tan Yu, and Ping Li. 2021. Cross-lingual cross-modal pretraining for multimodal retrieval. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 3644–3650.
[17]
Marco Fiorucci, Marina Khoroshiltseva, Massimiliano Pontil, Arianna Traviglia, Alessio Del Bue, and Stuart James. 2020. Machine learning for cultural heritage: A survey. Pattern Recognition Letters 133 (2020), 102–108.
[18]
Noa Garcia and George Vogiatzis. 2018. How to read paintings: Semantic art understanding with multi-modal retrieval. In Proceedings of the European Conference on Computer Vision Workshops. 0–0.
[19]
Ismael Garrido-Muñoz, Arturo Montejo-Ráez, Fernando Martínez-Santiago, and L. Alfonso Ureña-López. 2021. A survey on bias in deep NLP. Applied Sciences 11, 7 (2021), 3184.
[20]
Angelika Grund. 1993. ICONCLASS. On subject analysis of iconographic representations of works of art. KO Knowledge Organization 20, 1 (1993), 20–29.
[21]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 770–778.
[22]
Yangqing Jia, Evan Shelhamer, Jeff Donahue, Sergey Karayev, Jonathan Long, Ross Girshick, Sergio Guadarrama, and Trevor Darrell. 2014. Caffe: Convolutional architecture for fast feature embedding. In Proceedings of the 22nd ACM International Conference on Multimedia. 675–678.
[23]
Diederik P. Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv:1412.6980. Retrieved from https://arxiv.org/abs/1412.6980.
[24]
Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalantidis, Li-Jia Li, David A. Shamma, Michael S. Bernstein, and Li Fei-Fei. 2017. Visual genome: Connecting language and vision using crowdsourced dense image annotations. International Journal of Computer Vision 123, 1 (2017), 32–73.
[25]
Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. 2012. Imagenet classification with deep convolutional neural networks. In Proceedings of the Advances in Neural Information Processing Systems. 1097–1105.
[26]
Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. 2015. Deep learning. Nature 521, 7553 (2015), 436–444.
[27]
Gen Li, Nan Duan, Yuejian Fang, Ming Gong, and Daxin Jiang. 2020. Unicoder-vl: A universal encoder for vision and language by cross-modal pre-training. In Proceedings of the AAAI Conference on Artificial Intelligence. 11336–11344.
[28]
Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C. Lawrence Zitnick. 2014. Microsoft coco: Common objects in context. In Proceedings of the European Conference on Computer Vision. Springer, 740–755.
[29]
Jiasen Lu, Dhruv Batra, Devi Parikh, and Stefan Lee. 2019. ViLBERT: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. In Proceedings of the 33rd International Conference on Neural Information Processing Systems. 13–23.
[30]
Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S. Corrado, and Jeff Dean. 2013. Distributed representations of words and phrases and their compositionality. In Proceedings of the Advances in Neural Information Processing Systems. 3111–3119.
[31]
Federico Milani and Piero Fraternali. 2020. A dataset and a convolutional model for iconography classification in paintings. arXiv:2010.11697. Retrieved from https://arxiv.org/abs/2010.11697.
[32]
Minheng Ni, Haoyang Huang, Lin Su, Edward Cui, Taroon Bharti, Lijuan Wang, Dongdong Zhang, and Nan Duan. 2021. M3p: Learning universal representations via multitask multilingual multimodal pre-training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 3977–3986.
[33]
Erwin Panofsky. 2018. Studies in Iconology: Humanistic Themes in the Art of the Renaissance. Routledge.
[34]
Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Kopf, Edward Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. 2019. PyTorch: An imperative style, high-performance deep learning library. In Proceedings of the Advances in Neural Information Processing Systems 32. H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett (Eds.), Curran Associates, Inc., 8024–8035. Retrieved from http://papers.neurips.cc/paper/9015-pytorch-an-imperative-style-high-performance-deep-learning-library.pdf.
[35]
Jeffrey Pennington, Richard Socher, and Christopher D. Manning. 2014. Glove: Global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing. 1532–1543.
[36]
Nicolò Oreste Pinciroli Vago, Federico Milani, Piero Fraternali, and Ricardo da Silva Torres. 2021. Comparing CAM algorithms for the identification of salient image features in iconography artwork analysis. Journal of Imaging 7, 7 (2021), 106.
[37]
Etienne Posthumus. 2020. Brill Iconclass AI Test Set. (2020). Retrieved from https://labs.brill.com/ictestset/. Accessed 10 Nov. 2021.
[38]
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. 2021. Learning transferable visual models from natural language supervision. In Proceedings of the International Conference on Machine Learning. PMLR, 8748–8763.
[39]
T. C. Rajapakse. 2019. Simple Transformers. (2019). Retrieved from https://github.com/ThilinaRajapakse/simpletransformers. Accessed 10 Nov. 2021.
[40]
Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. 2015. Faster r-cnn: Towards real-time object detection with region proposal networks. In Proceedings of the Advances in Neural Information Processing Systems. 91–99.
[41]
Ricardo Ribani and Mauricio Marengoni. 2019. A survey of transfer learning for convolutional neural networks. In Proceedings of the 2019 32nd SIBGRAPI Conference on Graphics, Patterns and Images Tutorials. IEEE, 47–57.
[42]
Matthia Sabatelli, Nikolay Banar, Marie Cocriamont, Eva Coudyzer, Karine Lasaracina, Walter Daelemans, Pierre Geurts, and Mike Kestemont. 2021. Advances in digital music iconography: Benchmarking the detection of musical instruments in unrestricted, non-photorealistic images from the artistic domain. Digital Humanities Quarterly 15, 1 (2021), 1–22.
[43]
Matthia Sabatelli, Mike Kestemont, Walter Daelemans, and Pierre Geurts. 2018. Deep transfer learning for art classification problems. In Proceedings of the European Conference on Computer Vision Workshops. 0–0.
[44]
Jürgen Schmidhuber. 2015. Deep learning in neural networks: An overview. Neural Networks 61 (2015), 85–117.
[45]
Karen Simonyan and Andrew Zisserman. 2015. Very deep convolutional networks for large-scale image recognition. In Proceedings of the International Conference on Learning Representations.
[46]
Matteo Stefanini, Marcella Cornia, Lorenzo Baraldi, Massimiliano Corsini, and Rita Cucchiara. 2019. Artpedia: A new visual-semantic dataset with visual and contextual sentences in the artistic domain. In Proceedings of the International Conference on Image Analysis and Processing. Springer, 729–740.
[47]
Weijie Su, Xizhou Zhu, Yue Cao, Bin Li, Lewei Lu, Furu Wei, and Jifeng Dai. 2019. VL-BERT: Pre-training of generic visual-linguistic representations. In Proceedings of the International Conference on Learning Representations.
[48]
Hao Tan and Mohit Bansal. 2019. LXMERT: Learning cross-modality encoder representations from transformers. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing. 5100–5111.
[49]
Mingxing Tan and Quoc Le. 2019. Efficientnet: Rethinking model scaling for convolutional neural networks. In Proceedings of the International Conference on Machine Learning. PMLR, 6105–6114.
[50]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Proceedings of the Advances in Neural Information Processing Systems. 5998–6008.
[51]
G. Vellekoop, E. Tholen, and L. D. Couprie. 1973. Iconclass: An Iconographic Classification System. North-Holland Pub. Co., Amsterdam.
[52]
Jian Wang, Feng Zhou, Shilei Wen, Xiao Liu, and Yuanqing Lin. 2017. Deep metric learning with angular loss. In Proceedings of the IEEE International Conference on Computer Vision. 2593–2601.
[53]
Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, Joe Davison, Sam Shleifer, Patrick von Platen, Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu, Teven Le Scao, Sylvain Gugger, Mariama Drame, Quentin Lhoest, and Alexander M. Rush. 2020. Transformers: State-of-the-art natural language processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations. Association for Computational Linguistics, Online, 38–45. Retrieved from https://www.aclweb.org/anthology/2020.emnlp-demos.6.
[54]
Yiling Wu, Shuhui Wang, Guoli Song, and Qingming Huang. 2019. Learning fragment self-attention embeddings for image-text matching. In Proceedings of the 27th ACM International Conference on Multimedia. 2088–2096.
[55]
Yiling Wu, Shuhui Wang, Guoli Song, and Qingming Huang. 2019. Online asymmetric metric learning with multi-layer similarity aggregation for cross-modal retrieval. IEEE Transactions on Image Processing 28, 9 (2019), 4299–4312.
[56]
Jianwei Yang, Jiasen Lu, Dhruv Batra, and Devi Parikh. 2017. A faster pytorch implementation of faster R-CNN. (2017). Retrieved from https://github.com/jwyang/faster-rcnn.pytorch. Accessed 15 Sep. 2021.
[57]
Mingyang Zhou, Luowei Zhou, Shuohang Wang, Yu Cheng, Linjie Li, Zhou Yu, and Jingjing Liu. 2021. UC2: Universal cross-lingual cross-modal vision-and-language pre-training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 4155–4165.

Cited By

View all
  • (2024)Scene Classification on Fine Arts with Style TransferProceedings of the 6th workshop on the analySis, Understanding and proMotion of heritAge Contents10.1145/3689094.3689468(18-27)Online publication date: 28-Oct-2024
  • (2024)Applications of deep learning to infrared thermography for the automatic classification of thermal pathologies: Review and case studyDiagnosis of Heritage Buildings by Non-Destructive Techniques10.1016/B978-0-443-16001-1.00005-X(103-132)Online publication date: 2024

Recommendations

Comments

Information & Contributors

Information

Published In

cover image Journal on Computing and Cultural Heritage
Journal on Computing and Cultural Heritage   Volume 16, Issue 2
June 2023
312 pages
ISSN:1556-4673
EISSN:1556-4711
DOI:10.1145/3585396
Issue’s Table of Contents

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 24 June 2023
Online AM: 17 March 2023
Accepted: 05 August 2022
Revised: 06 July 2022
Received: 16 November 2021
Published in JOCCH Volume 16, Issue 2

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Iconclass
  2. cultural heritage
  3. transfer learning
  4. deep learning
  5. natural language processing
  6. multi-modal retrieval
  7. multi-lingual retrieval

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)162
  • Downloads (Last 6 weeks)16
Reflects downloads up to 01 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2024)Scene Classification on Fine Arts with Style TransferProceedings of the 6th workshop on the analySis, Understanding and proMotion of heritAge Contents10.1145/3689094.3689468(18-27)Online publication date: 28-Oct-2024
  • (2024)Applications of deep learning to infrared thermography for the automatic classification of thermal pathologies: Review and case studyDiagnosis of Heritage Buildings by Non-Destructive Techniques10.1016/B978-0-443-16001-1.00005-X(103-132)Online publication date: 2024

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Full Text

View this article in Full Text.

Full Text

HTML Format

View this article in HTML Format.

HTML Format

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media