Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to main content

Fine-Grained Visual Entailment

  • Conference paper
  • First Online:
Computer Vision – ECCV 2022 (ECCV 2022)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13696))

Included in the following conference series:

  • 2273 Accesses

Abstract

Visual entailment is a recently proposed multimodal reasoning task where the goal is to predict the logical relationship of a piece of text to an image. In this paper, we propose an extension of this task, where the goal is to predict the logical relationship of fine-grained knowledge elements within a piece of text to an image. Unlike prior work, our method is inherently explainable and makes logical predictions at different levels of granularity. Because we lack fine-grained labels to train our method, we propose a novel multi-instance learning approach which learns a fine-grained labeling using only sample-level supervision. We also impose novel semantic structural constraints which ensure that fine-grained predictions are internally semantically consistent. We evaluate our method on a new dataset of manually annotated knowledge elements and show that our method achieves 68.18% accuracy at this challenging task while significantly outperforming several strong baselines. Finally, we present extensive qualitative results illustrating our method’s predictions and the visual evidence our method relied on. Our code and annotated dataset can be found here: https://github.com/SkrighYZ/FGVE.

C. Thomas and Y. Zhang—These two authors contributed equally.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 89.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 119.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Antol, S., et al.: VQA: visual question answering. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2425–2433 (2015)

    Google Scholar 

  2. Babar, S., Das, S.: Where to look?: mining complementary image regions for weakly supervised object localization. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 1010–1019 (2021)

    Google Scholar 

  3. Banarescu, L., et al.: Abstract meaning representation for sembanking. In: Proceedings of the 7th Linguistic Annotation Workshop and Interoperability with Discourse, pp. 178–186 (2013)

    Google Scholar 

  4. Bevilacqua, M., Blloshmi, R., Navigli, R.: One SPRING to rule them both: symmetric AMR semantic parsing and generation without a complex pipeline. In: Proceedings of AAAI (2021)

    Google Scholar 

  5. Bonial, C., et al.: Abstract meaning representation of constructions: the more we include, the better the representation. In: Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018) (2018)

    Google Scholar 

  6. Bowman, S., Angeli, G., Potts, C., Manning, C.D.: A large annotated corpus for learning natural language inference. In: Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pp. 632–642 (2015)

    Google Scholar 

  7. Camburu, O.M., Rocktäschel, T., Lukasiewicz, T., Blunsom, P.: e-SNLI: natural language inference with natural language explanations. In: Advances in Neural Information Processing Systems, vol. 31 (2018)

    Google Scholar 

  8. Chefer, H., Gur, S., Wolf, L.: Generic attention-model explainability for interpreting bi-modal and encoder-decoder transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 397–406 (2021)

    Google Scholar 

  9. Chefer, H., Gur, S., Wolf, L.: Transformer interpretability beyond attention visualization. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 782–791 (2021)

    Google Scholar 

  10. Chen, B., et al.: Joint multimedia event extraction from video and article. In: Findings of the Association for Computational Linguistics: EMNLP 2021, pp. 74–88 (2021)

    Google Scholar 

  11. Chen, J., Kong, Y.: Explainable video entailment with grounded visual evidence. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (2021)

    Google Scholar 

  12. Chen, Y.-C., et al.: UNITER: UNiversal Image-TExt representation learning. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12375, pp. 104–120. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58577-8_7

    Chapter  Google Scholar 

  13. Dagan, I., Glickman, O., Magnini, B.: The PASCAL recognising textual entailment challenge. In: Quiñonero-Candela, J., Dagan, I., Magnini, B., d’Alché-Buc, F. (eds.) MLCW 2005. LNCS (LNAI), vol. 3944, pp. 177–190. Springer, Heidelberg (2006). https://doi.org/10.1007/11736790_9

    Chapter  Google Scholar 

  14. Das, A., et al.: Visual dialog. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2017)

    Google Scholar 

  15. Das, A., Kottur, S., Moura, J.M., Lee, S., Batra, D.: Learning cooperative visual dialog agents with deep reinforcement learning. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2951–2960 (2017)

    Google Scholar 

  16. Dietterich, T.G., Lathrop, R.H., Lozano-Pérez, T.: Solving the multiple instance problem with axis-parallel rectangles. Artif. Intell. 89(1–2), 31–71 (1997)

    Article  MATH  Google Scholar 

  17. Dong, B., Huang, Z., Guo, Y., Wang, Q., Niu, Z., Zuo, W.: Boosting weakly supervised object detection via learning bounding box adjusters. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 2876–2885 (2021)

    Google Scholar 

  18. Fukui, H., Hirakawa, T., Yamashita, T., Fujiyoshi, H.: Attention branch network: learning of attention mechanism for visual explanation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10705–10714 (2019)

    Google Scholar 

  19. Fung, Y., et al.: InfoSurgeon: cross-media fine-grained information consistency checking for fake news detection. In: Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pp. 1683–1698 (2021)

    Google Scholar 

  20. Gokhale, T., Banerjee, P., Baral, C., Yang, Y.: VQA-LOL: visual question answering under the lens of logic. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12366, pp. 379–396. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58589-1_23

    Chapter  Google Scholar 

  21. Goodman, M.W.: Penman: an open-source library and tool for AMR graphs. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: System Demonstrations, pp. 312–319 (2020)

    Google Scholar 

  22. Hinton, G., Vinyals, O., Dean, J., et al.: Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, vol. 2, no. 7 (2015)

  23. Huang, Z., Zeng, Z., Huang, Y., Liu, B., Fu, D., Fu, J.: Seeing out of the box: end-to-end pre-training for vision-language representation learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12976–12985 (2021)

    Google Scholar 

  24. Kayser, M., et al.: e-ViL: a dataset and benchmark for natural language explanations in vision-language tasks. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 1244–1254 (2021)

    Google Scholar 

  25. Knight, K., et al.: Abstract meaning representation (AMR) annotation release 3.0 (2020). https://catalog.ldc.upenn.edu/LDC2020T02

  26. Kong, C., Lin, D., Bansal, M., Urtasun, R., Fidler, S.: What are you talking about? Text-to-image coreference. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3558–3565 (2014)

    Google Scholar 

  27. Kumar, V., Namboodiri, A., Jawahar, C.: Region pooling with adaptive feature fusion for end-to-end person recognition. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2133–2142 (2020)

    Google Scholar 

  28. Li, M., et al.: Cross-media structured common space for multimedia event extraction. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 2557–2568 (2020)

    Google Scholar 

  29. Li, Y., Zeng, J., Shan, S., Chen, X.: Occlusion aware facial expression recognition using CNN with attention mechanism. IEEE Trans. Image Process. 28(5), 2439–2450 (2018)

    Article  MathSciNet  Google Scholar 

  30. Liu, J., et al.: VIOLIN: a large-scale dataset for video-and-language inference. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10900–10910 (2020)

    Google Scholar 

  31. Lu, J., Goswami, V., Rohrbach, M., Parikh, D., Lee, S.: 12-in-1: multi-task vision and language representation learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10437–10446 (2020)

    Google Scholar 

  32. Luo, G., Darrell, T., Rohrbach, A.: NewsCLIPpings: automatic generation of out-of-context multimodal media. In: Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp. 6801–6817 (2021)

    Google Scholar 

  33. Luo, Z., et al.: Weakly-supervised action localization with expectation-maximization multi-instance learning. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12374, pp. 729–745. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58526-6_43

    Chapter  Google Scholar 

  34. Marelli, M., Menini, S., Baroni, M., Bentivogli, L., Bernardi, R., Zamparelli, R.: A sick cure for the evaluation of compositional distributional semantic models. In: Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC 2014), pp. 216–223 (2014)

    Google Scholar 

  35. Matthiessen, C.M., Christian, M., Bateman, J.A., Matthiessen, M.: Text Generation and Systemic-Functional Linguistics: Experiences from English and Japanese. Burns & Oates (1991)

    Google Scholar 

  36. Palmer, M., Gildea, D., Kingsbury, P.: The proposition bank: an annotated corpus of semantic roles. Comput. Linguist. 31(1), 71–106 (2005)

    Article  Google Scholar 

  37. Paszke, A., et al.: Pytorch: an imperative style, high-performance deep learning library. In: Advances in Neural Information Processing Systems, vol. 32 (2019)

    Google Scholar 

  38. Plummer, B.A., Mallya, A., Cervantes, C.M., Hockenmaier, J., Lazebnik, S.: Phrase localization and visual relationship detection with comprehensive image-language cues. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 1928–1937 (2017)

    Google Scholar 

  39. Rymarczyk, D., Borowa, A., Tabor, J., Zielinski, B.: Kernel self-attention for weakly-supervised image classification using deep multiple instance learning. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 1721–1730 (2021)

    Google Scholar 

  40. Selvaraju, R.R., Cogswell, M., Das, A., Vedantam, R., Parikh, D., Batra, D.: Grad-CAM: visual explanations from deep networks via gradient-based localization. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 618–626 (2017)

    Google Scholar 

  41. Sheng, S., et al.: Human-adversarial visual question answering. In: Advances in Neural Information Processing Systems, vol. 34 (2021)

    Google Scholar 

  42. Shi, J., Zhong, Y., Xu, N., Li, Y., Xu, C.: A simple baseline for weakly-supervised scene graph generation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16393–16402 (2021)

    Google Scholar 

  43. Shrikumar, A., Greenside, P., Kundaje, A.: Learning important features through propagating activation differences. In: International Conference on Machine Learning, pp. 3145–3153. PMLR (2017)

    Google Scholar 

  44. Si, Q., Lin, Z., Zheng, M., Fu, P., Wang, W.: Check it again: progressive visual question answering via visual entailment. In: Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pp. 4101–4110 (2021)

    Google Scholar 

  45. Tkachenko, M., Malyuk, M., Shevchenko, N., Holmanyuk, A., Liubimov, N.: Label studio: data labeling software (2020–2021). https://github.com/heartexlabs/label-studio. Open source software. https://github.com/heartexlabs/label-studio

  46. Veličković, P., Cucurull, G., Casanova, A., Romero, A., Liò, P., Bengio, Y.: Graph attention networks. In: International Conference on Learning Representations (2018)

    Google Scholar 

  47. Voita, E., Talbot, D., Moiseev, F., Sennrich, R., Titov, I.: Analyzing multi-head self-attention: specialized heads do the heavy lifting, the rest can be pruned. In: 57th Annual Meeting of the Association for Computational Linguistics, pp. 5797–5808. ACL Anthology (2019)

    Google Scholar 

  48. Wagner, J., Kohler, J.M., Gindele, T., Hetzel, L., Wiedemer, J.T., Behnke, S.: Interpretable and fine-grained visual explanations for convolutional neural networks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9097–9107 (2019)

    Google Scholar 

  49. Wu, J., Yu, Y., Huang, C., Yu, K.: Deep multiple instance learning for image classification and auto-annotation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3460–3469 (2015)

    Google Scholar 

  50. Xie, N., Lai, F., Doran, D., Kadav, A.: Visual entailment: a novel task for fine-grained image understanding. arXiv preprint arXiv:1901.06706 (2019)

  51. Yang, H., Wu, H., Chen, H.: Detecting 11k classes: large scale object detection without fine-grained bounding boxes. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9805–9813 (2019)

    Google Scholar 

  52. Young, P., Lai, A., Hodosh, M., Hockenmaier, J.: From image descriptions to visual denotations: new similarity metrics for semantic inference over event descriptions. Trans. Assoc. Comput. Linguist. 2, 67–78 (2014)

    Article  Google Scholar 

  53. Yuan, S., et al.: Weakly supervised cross-domain alignment with optimal transport. In: Proceedings of the British Machine Vision Conference (2020)

    Google Scholar 

  54. Yuan, T., et al.: Multiple instance active learning for object detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5330–5339 (2021)

    Google Scholar 

  55. Zellers, R., Bisk, Y., Farhadi, A., Choi, Y.: From recognition to cognition: visual commonsense reasoning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6720–6731 (2019)

    Google Scholar 

  56. Zhang, P., et al.: VinVL: revisiting visual representations in vision-language models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5579–5588 (2021)

    Google Scholar 

  57. Zhang, T., et al.: Improving event extraction via multimodal integration. In: Proceedings of the 25th ACM International Conference on Multimedia, pp. 270–278 (2017)

    Google Scholar 

  58. Zhou, B., Bau, D., Oliva, A., Torralba, A.: Interpreting deep visual representations via network dissection. IEEE Trans. Pattern Anal. Mach. Intell. 41(9), 2131–2145 (2018)

    Article  Google Scholar 

  59. Zhou, Y., Sun, X., Liu, D., Zha, Z., Zeng, W.: Adaptive pooling in multi-instance learning for web video annotation. In: Proceedings of the IEEE International Conference on Computer Vision Workshops, pp. 318–327 (2017)

    Google Scholar 

  60. Zunino, A., et al.: Explainable deep classification models for domain generalization. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3233–3242 (2021)

    Google Scholar 

Download references

Acknowledgements

This research is based upon work supported by DARPA SemaFor Program No. HR001120C0123. The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies, either expressed or implied, of DARPA, or the U.S. Government. The U.S. Government is authorized to reproduce and distribute reprints for governmental purposes notwithstanding any copyright annotation.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Christopher Thomas .

Editor information

Editors and Affiliations

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 1395 KB)

Rights and permissions

Reprints and permissions

Copyright information

© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Thomas, C., Zhang, Y., Chang, SF. (2022). Fine-Grained Visual Entailment. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds) Computer Vision – ECCV 2022. ECCV 2022. Lecture Notes in Computer Science, vol 13696. Springer, Cham. https://doi.org/10.1007/978-3-031-20059-5_23

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-20059-5_23

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-20058-8

  • Online ISBN: 978-3-031-20059-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics