research-article

Recommending Answers to Math Questions Based on KL-Divergence and Approximate XML Tree Matching

Authors:

Yiu-Kai Dennis NgAuthors Info & Claims

SIGIR-AP '23: Proceedings of the Annual International ACM SIGIR Conference on Research and Development in Information Retrieval in the Asia Pacific Region

Pages 21 - 31

https://doi.org/10.1145/3624918.3625337

Published: 26 November 2023 Publication History

Abstract

Math is the science and study of quality, structure, space, and change. It seeks out patterns, formulates new conjectures, and establishes the truth by rigorous deduction from appropriately chosen axioms and definitions. The study of math makes a person better at solving problems. It gives someone skills that can use across other subjects and apply in different job roles. In the modern world, builders use math every day to do their work, since construction workers add, subtract, divide, multiply, and work with fractions. It is obvious that math is a major contributor to many areas of study. For this reason, math information retrieval (Math IR) deserves attention and recognition, since a reliable Math IR system helps users find relevant answers to math questions and benefits all math learners whenever they need help solve a math problem, regardless of the time and place. Moreover, Math IR systems enhance the learning experience of their users. In this paper, we present MaRec, a recommender system that retrieves and ranks math answers based on their textual content and embedded formulas in answering a math question. MaRec ranks a potential answer A given a math question Q by computing the (i) KL-divergence score on A and Q using their textual contents, and (ii) the subtree matching score of the math formulas in Q and A represented as XML trees. The design of MaRec is simple and easy to understand, since it solely relies on a probability model and an elegant tree-matching approach in ranking math answers. Conducted empirical studies show that MaRec significantly outperforms (i) three existing state-of-the-art MathIR systems based on an offline evaluation, and (ii) two top-of-the-line machine learning systems based on an online analysis.

References

[1]

P. Ahern. 2023. 27 Mind-Bottling SEO Stats for 2023 (+ Beyond). https://inter-growth.co/seo-stats/. Intergrowth.

[2]

N. Belkin, R. Oddy, and H. Brooks. 1982. ASK for Information Retrieval: Part I. Background and Theory. Journal of Documentation (1982).

[3]

S. Bhatia, D. Majumdar, and P. Mitra. 2011. Query Suggestions in the Absence of Query Logs. In Proceedings of the 34th International ACM SIGIR Conference on Research and Development in Information Retrieval. 795–804.

[4]

David Blei, Andrew Ng, and Michael Jordan. 2001. Latent dirichlet allocation. Advances in neural information processing systems 14 (2001).

[5]

T. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, and A. Askell. 2020. Language Models are Few-Shot Learners. Advances in Neural Information Processing Systems 33 (2020), 1877–1901.

[6]

Y. Bu, S. Zou, Y. Liang, and V. Veeravalli. 2018. Estimation of KL Divergence: Optimal Minimax Rate. IEEE Transactions on Information Theory 64, 4 (2018), 2648–2674.

Digital Library

[7]

The Nation’s REport Card. 2019. National Achievement-Level Results.

[8]

G. Cormack, C. Clarke, and S. Buettcher. 2009. Reciprocal Rank Fusion Outperforms Condorcet and Individual Rank Learning Methods. In Proceedings of the 32nd International ACM SIGIR Conference on Research and Development in Information Retrieval. 758–759.

[9]

W. Croft, D. Metzler, and T. Strohman. 2010. Search Engines: Information Retrieval in Practice. Addison Wesley.

Digital Library

[10]

D. Carlisle and P. Ion and R. Miner. 2021. Mathematical Markup Language (MathML), Version 3.0, 2nd Edition. W3C. https://www.w3.org/ TR/2014/REC-MathML3-20140410/.

[11]

P. Dadure, P. Pakray, and S. Bandyopadhyay. 2022. Embedding and Generalization of Formula with Context in the Retrieval of Mathematical Information. King Saud University-Computer and Information Sciences 34, 9 (2022), 6624–6634.

Digital Library

[12]

K. Davila and R. Zanibbi. 2017. Layout and Semantics: Combining Representations for Mathematical Formula Search. In Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval. 1165–1168.

[13]

J. Devlin, M. Chang, K. Lee, and K. Toutanova. 2018. Bert: Pre-Training of Deep Bidirectional Transformers for Language Understanding. arXiv preprint arXiv:1810.04805 (2018).

[14]

S. Dominich. 2001. Mathematical Foundations of Information Retrieval. Vol. 12. Springer Science & Business Media.

[15]

R. Fatima. 2012. Role of Mathematics in the Development of Society. National Meet on Celebration of National Year of Mathematics. Organized by NCERT, New Delhi 1 (2012), 12.

[16]

L. Fredrik. [n. d.]. xml.etree.ElementTree-The ElementTree XML API. https://github.com/python/cpython/tree/3.11/Lib/xml/etree/ElementTree.py.

[17]

P. Ginsparg. 2021. Lessons from arXiv’s 30 Years of Information Sharing. Nature Reviews Physics 3, 9 (2021), 602–603.

[18]

P. Gupta and V. Gupta. 2012. A Survey of Text Question Answering Techniques. International Journal of Computer Applications 53, 4 (2012).

[19]

X. Hu, L. Gao, X. Lin, Z. Tang, X. Lin, and J. Baker. 2013. Wikimirs: A Mathematical Information Retrieval System for Wikipedia. In Proceedings of the 13th ACM/IEEE-CS joint conference on Digital Libraries. 11–20.

[20]

B. Jansen, A. Spink, and T. Saracevic. 2000. Real Life, Real Users, and Real Needs: a Study and Analysis of User Queries on the Web. IPM 36, 2 (2000), 207–227.

[21]

B. Jones and M. Kenward. 2003. Design and Analysis of Cross-Over Trials, 2nd Ed.Chapman and Hall.

[22]

L. Kazmier. 2003. Schaum’s Outline of Business Statistics. McGraw-Hill.

[23]

M. Lewis, Y. Liu, N. Goyal, M. Ghazvininejad, A. Mohamed, O. Levy, V. Stoyanov, and L. Zettlemoyer. 2019. Bart: Denoising Sequence-to-Sequence Pre-Training for Natural Language Generation, Translation, and Comprehension. arXiv preprint arXiv:1910.13461 (2019).

[24]

M. Líška, P. Sojka, and M. Ružička. 2015. Combining Text and Formula Queries in Math Information Retrieval: Evaluation of Query Results Merging Strategies. In Proceedings of NWSearch. 7–9.

[25]

X. Luo, A. Baranova, and J. Biegert. 2019. Problemsolver at Semeval-2019 Task 10: Sequence-to-Sequence Learning and Expression Trees. In Proceedings of the 13th International Workshop on Semantic Evaluation. 1292–1296.

[26]

C. Manning, P. Raghavan, and H. Schütze. 2008. Introduction to Information Retrieval. Cambridge University Press Cambridge.

[27]

B. Mansouri, V. Novotnỳ, A. Agarwal, D. Oard, and R. Zanibbi. 2022. Third CLEF Lab on Answer Retrieval for Questions on Math (Working Notes Version). Proceedings of the CLEF 2022 (CEUR Working Notes) (2022).

[28]

B. Mansouri, S. Rohatgi, D. Oard, J. Wu, C. Giles, and R. Zanibbi. 2019. Tangent-CFT: An Embedding Model for Mathematical Formulas. In Proceedings of the 2019 ACM SIGIR International Conference on Theory of Information Retrieval. 11–18.

[29]

B. Miller and A. Youssef. 2003. Technical Aspects of the Digital Library of Mathematical Functions. Annals of Math. & AI 38, 1 (2003), 121–136.

Digital Library

[30]

Y. Ng, D. Fraser, B. Kassaie, and F. Tompa. 2021. Dowsing for Math Answers. In International Conference of the Cross-Language Evaluation Forum for European Languages. Springer, 201–212.

[31]

T. Nguyen, K. Chang, and S. Hui. 2012. A Math-Aware Search Engine for Math Question Answering System. In Proceedings of the 21st ACM International Conference on Information and Knowledge Management. 724–733.

[32]

V. Novotnỳ, P. Sojka, M. Stefánik, and D. Lupták. 2020. Three is Better than One: Ensembling Math Information Retrieval Systems. In CLEF (Working Notes).

[33]

A. Pathak, P. Pakray, and A. Gelbukh. 2018. A Formula Embedding Approach to Math Information Retrieval. Computación y Sistemas 22, 3 (2018), 819–833.

[34]

S. Peng, K. Yuan, L. Gao, and Z. Tang. 2021. Mathbert: A Pre-Trained Model for Mathematical Formula Understanding. arXiv preprint arXiv:2105.00377 (2021).

[35]

S. Robertson and H. Zaragoza. 2009. The Probabilistic Relevance Framework: BM25 and Beyond. Foundations and Trends in IR 3, 4 (2009), 333–389.

Digital Library

[36]

L. Rozakis. 2002. Test Taking Strategies and Study Skills for the Utterly Confused. McGraw Hill.

[37]

M. Schubotz, A. Grigorev, M. Leich, H. Cohl, N. Meuschke, B. Gipp, A. Youssef, and V.Markl. 2016. Semantification of Identifiers in Mathematics for Better Math Information Retrieval. In Proceedings of the 39th International ACM SIGIR conference on Research and Development in Information Retrieval. 135–144.

Digital Library

[38]

P. Sojka and M. Líška. 2011. The Art of Mathematics Retrieval. In Proceedings of the 11th ACM Symposium on Document Engineering. 57–60.

[39]

R. Srihari and W. Li. 2000. A Question Answering System Supported by Information Extraction. In Sixth Applied Natural Language Processing Conference. 166–172.

Digital Library

[40]

D. Stalnaker. 2013. Math Expression Retrieval Using Symbol Pairs in Layout Trees. Master’s thesis. Rochester Institute of Technology.

[41]

Y. Stathopoulos and S. Teufel. 2016. Mathematical Information Retrieval Based on Type Embeddings and Query Expansion. In Proceedings of COLING. 2344–2355.

[42]

Public School View. 2023. Average Public School Math Proficiency. https://publicschoolreview.com/average-math-proficiency-stats/national-data.

[43]

Y. Wang, X. Liu, and S. Shi. 2017. Deep Neural Solver for Math Word Problems. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 845–854.

[44]

WebFX. 2022. 95 SEO Statistics from This Year That’ll Transform Your Strategy. https://www.webfx.com/seo/statistics/.

[45]

R. Zanibbi and D. Blostein. 2012. Recognition and Retrieval of Mathematical Expressions. Document Analysis and Recognition (IJDAR) 15, 4 (2012), 331–357.

Digital Library

[46]

R. Zanibbi and D. Blostein. 2012. Recognition and Retrieval of Mathematical Expressions. Document Analysis and Recognition (IJDAR) 15, 4 (2012), 331–357.

Digital Library

[47]

K. Zhang. 1996. A Constrained Edit Distance between Unordered Labeled Trees. Algorithmica 15, 3 (1996), 205–222.

Digital Library

[48]

Z. Zhang, T. Wang, X. Song, and Y. Wang. 2022. The Design and Implementation of the Natural Handwriting Mathematical Formula Recognition System. In Proceedings of the 6th International Conference on Advances in Image Processing. 114–121.

[49]

J. Zhao, M. Kan, and Y. Theng. 2008. Math Information Retrieval: User Requirements and Prototype Implementation. In Proceedings of the 8th ACM/IEEE-CS joint conference on Digital Libraries. 187–196.

[50]

W. Zhong, J. Yang, and J. Lin. 2022. Evaluating Token-Level and Passage-Level Dense Retrieval Models for Math Information Retrieval. arXiv preprint arXiv:2203.11163 (2022).

Cited By

Index Terms

Recommending Answers to Math Questions Based on KL-Divergence and Approximate XML Tree Matching
1. Applied computing
2. Information systems
  1. Information retrieval
    1. Retrieval models and ranking
    2. Retrieval tasks and goals

Index terms have been assigned to the content through auto-classification.

Recommendations

Clarifying Questions in Math Information Retrieval
ICTIR '23: Proceedings of the 2023 ACM SIGIR International Conference on Theory of Information Retrieval

One of the challenges of math information retrieval is the inherent ambiguity of mathematical notation. The use of various notations, symbols, and conventions can lead to ambiguities in math search queries, potentially causing confusion and errors. ...
Two Parameter Inference Methods in Likelihood-Free Models: Approximate Bayesian Computation and Contrastive Divergence
Optimum matching forests III: Facets of matching forest polyhedra

In [3] we presented a linear system which definesP(G), the convex hull of incidence vectors of matching forests of a mixed graphG. However, many of the inequalities of this system may be redundant. Here we describe the dimension of the facets ofP(G) ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

SIGIR-AP '23: Proceedings of the Annual International ACM SIGIR Conference on Research and Development in Information Retrieval in the Asia Pacific Region

November 2023

324 pages

ISBN:9798400704086

DOI:10.1145/3624918

Editors:
Qingyao Ai
Tsinghua University, China
,
Yiqin Liu
Tsinghua University, China
,
Alistair Moffat
The University of Melbourne, Australia
,
Xuanjing Huang
Fudan University, China
,
Tetsuya Sakai
Waseda University, Japan
,
Justin Zobel
The University of Melbourne, Australia

Copyright © 2023 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

SIGIR: ACM Special Interest Group on Information Retrieval

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 26 November 2023

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Conference

SIGIR-AP '23

Sponsor:

SIGIR

SIGIR-AP '23: Annual International ACM SIGIR Conference on Research and Development in Information Retrieval in the Asia Pacific Region

November 26 - 28, 2023

Beijing, China

Upcoming Conference

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
50
Total Downloads

Downloads (Last 12 months)50
Downloads (Last 6 weeks)4

Reflects downloads up to 16 Oct 2024

Other Metrics

View Author Metrics

Citations

Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Media

Figures

Other

Tables

View Table of Contents