Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3583780.3614767acmconferencesArticle/Chapter ViewAbstractPublication PagescikmConference Proceedingsconference-collections
research-article

A Principled Decomposition of Pointwise Mutual Information for Intention Template Discovery

Published: 21 October 2023 Publication History

Abstract

With the rise of Artificial Intelligence (AI), question answering systems have become common for users to interact with computers, e.g., ChatGPT and Siri. These systems require a substantial amount of labeled data to train their models. However, the labeled data is scarce and challenging to be constructed. The construction process typically involves two stages: discovering potential sample candidates and manually labeling these candidates. To discover high-quality candidate samples, we study the intention paraphrase template discovery task: Given some seed questions or templates of an intention, discover new paraphrase templates that describe the intention and are diverse to the seeds enough in text. As the first exploration of the task, we identify the new quality requirements, i.e., relevance, divergence and popularity, and identify the new challenges, i.e., the paradox of divergent yet relevant paraphrases, and the conflict of popular yet relevant paraphrases. To untangle the paradox of divergent yet relevant paraphrases, in which the traditional bag of words falls short, we develop usage-centric modeling, which represents a question/template/answer as a bag of usages that users engaged (e.g., up-votes), and uses a usage-flow graph to interrelate templates, questions and answers. To balance the conflict of popular yet relevant paraphrases, we propose a new and principled decomposition for the well-known Pointwise Mutual Information from the usage perspective (usage-PMI), and then develop a Bayesian inference framework over the usage-flow graph to estimate the usage-PMI. Extensive experiments over three large CQA corpora show strong performance advantage over the baselines adopted from paraphrase identification task. We release 885,000 paraphrase templates of high quality discovered by our proposed PMI decomposition model, and the data is available in site https://github.com/Para-Questions/Intention\_template\_discovery.

References

[1]
Basant Agarwal, Heri Ramampiaro, Helge Langseth, and Massimiliano Ruocco. 2018. A deep network model for paraphrase detection in short text messages. Inf. Process. Manag., Vol. 54, 6 (2018), 922--937. https://doi.org/10.1016/j.ipm.2018.06.005
[2]
Ganesh Agarwal, Govind Kabra, and Kevin Chen-Chuan Chang. 2010. Towards rich query interpretation: walking back and forth for mining query templates. In WWW. ACM, 1--10. https://doi.org/10.1145/1772690.1772692
[3]
Krisztian Balog, Marc Bron, and Maarten de Rijke. 2011. Query modeling for entity search based on terms, categories, and examples. ACM Trans. Inf. Syst., Vol. 29, 4 (2011), 22:1--22:31. https://doi.org/10.1145/2037661.2037667
[4]
Gerlof Bouma. 2009. Normalized (Pointwise) Mutual Information in Collocation Extraction. Proceedings of the Biennial GSCL Conference 2009 (01 2009).
[5]
Xin Cao, Gao Cong, Bin Cui, Christian S. Jensen, and Ce Zhang. 2009. The use of categorization information in language models for question retrieval. In CIKM. ACM, 265--274. https://doi.org/10.1145/1645953.1645989
[6]
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In NAACL-HLT. Association for Computational Linguistics, 4171--4186. https://doi.org/10.18653/v1/n19--1423
[7]
Bill Dolan, Chris Quirk, and Chris Brockett. 2004. Unsupervised Construction of Large Paraphrase Corpora: Exploiting Massively Parallel News Sources. In COLING. https://aclanthology.org/C04--1051/
[8]
Li Dong, Jonathan Mallinson, Siva Reddy, and Mirella Lapata. 2017. Learning to Paraphrase for Question Answering. In EMNLP. Association for Computational Linguistics, 875--886. https://doi.org/10.18653/v1/d17--1091
[9]
Iryna Haponchyk, Antonio Uva, Seunghak Yu, Olga Uryupina, and Alessandro Moschitti. 2018. Supervised Clustering of Questions into Intents for Dialog System Applications. In EMNLP. Association for Computational Linguistics, 2310--2321. https://doi.org/10.18653/v1/d18--1254
[10]
Hua He, Kevin Gimpel, and Jimmy Lin. 2015. Multi-Perspective Sentence Similarity Modeling with Convolutional Neural Networks. In EMNLP. The Association for Computational Linguistics, 1576--1586. https://doi.org/10.18653/v1/d15--1181
[11]
Jiwoon Jeon, W. Bruce Croft, and Joon Ho Lee. 2005. Finding similar questions in large question and answer archives. In CIKM. ACM, 84--90. https://doi.org/10.1145/1099554.1099572
[12]
Tom Kenter and Maarten de Rijke. 2015. Short Text Similarity with Word Embeddings. In CIKM. ACM, 1411--1420. https://dl.acm.org/citation.cfm?id=2806475
[13]
Bowon Ko and Ho-Jin Choi. 2020. Paraphrase Bidirectional Transformer with Multi-task Learning. In BigComp. IEEE, 217--220. https://doi.org/10.1109/BigComp48618.2020.00--72
[14]
Divesh R. Kubal and Anant V. Nimkar. 2018. A Hybrid Deep Learning Architecture for Paraphrase Identification. In ICCCNT. IEEE, 1--6. https://doi.org/10.1109/ICCCNT.2018.8493752
[15]
Wuwei Lan and Wei Xu. 2018. Neural Network Models for Paraphrase Identification, Semantic Textual Similarity, Natural Language Inference, and Question Answering. In COLING. Association for Computational Linguistics, 3890--3902. https://aclanthology.org/C18--1328/
[16]
Zhi Lei, Guixian Zhang, Lijuan Wu, Kui Zhang, and Rongjiao Liang. 2022. A Multi-level Mesh Mutual Attention Model for Visual Question Answering. Data Sci. Eng., Vol. 7, 4 (2022), 339--353.
[17]
Yanen Li, Bo-June Paul Hsu, and ChengXiang Zhai. 2013. Unsupervised identification of synonymous query intent templates for attribute intents. In CIKM. ACM, 2029--2038. https://doi.org/10.1145/2505515.2505694
[18]
Denghao Ma, Yueguo Chen, Changyu Wang, Hongbin Pei, Yitao Zhai, Gang Zheng, and Qi Chen. 2022. Definition-Augmented Jointly Training Framework for Intention Phrase Mining. In DASFAA (Lecture Notes in Computer Science, Vol. 13247). Springer, 331--339. https://doi.org/10.1007/978--3-031-00129--1_28
[19]
Denghao Ma, Li Chong, Yueguo Chen, and Liang Shen. 2023. Category-Highlighting Transformer Network for Question Retrieval. In DASFAA (Lecture Notes in Computer Science, Vol. 13945). Springer, 457--467. https://doi.org/10.1007/978--3-031--30675--4_33
[20]
Kathleen R. McKeown. 1983. Paraphrasing Questions Using Given and New Information. Am. J. Comput. Linguistics, Vol. 9, 1 (1983), 1--10.
[21]
Donald Metzler and W. Bruce Croft. 2007. Linear feature-based models for information retrieval. Inf. Retr., Vol. 10, 3 (2007), 257--274. https://doi.org/10.1007/s10791-006--9019-z
[22]
Aitor Ormazabal, Mikel Artetxe, Aitor Soroa, Gorka Labaka, and Eneko Agirre. 2022. Principled Paraphrase Generation with Parallel Corpora. In ACL. Association for Computational Linguistics, 1621--1638. https://doi.org/10.18653/v1/2022.acl-long.114
[23]
Ellie Pavlick, Juri Ganitkevitch, Tsz Ping Chan, Xuchen Yao, Benjamin Van Durme, and Chris Callison-Burch. 2015. Domain-Specific Paraphrase Extraction. In ACL. The Association for Computer Linguistics, 57--62. https://doi.org/10.3115/v1/p15--2010
[24]
Qiwei Peng, David J. Weir, Julie Weeds, and Yekun Chai. 2022. Predicate-Argument Based Bi-Encoder for Paraphrase Identification. In ACL. Association for Computational Linguistics, 5579--5589. https://doi.org/10.18653/v1/2022.acl-long.382
[25]
Yeon Seonwoo, Juhee Son, Jiho Jin, Sang-Woo Lee, Ji-Hoon Kim, Jung-Woo Ha, and Alice Oh. 2022. Two-Step Question Retrieval for Open-Domain QA. In ACL. Association for Computational Linguistics, 1487--1492. https://doi.org/10.18653/v1/2022.findings-acl.117
[26]
Darsh J. Shah, Tao Lei, Alessandro Moschitti, Salvatore Romeo, and Preslav Nakov. 2018. Adversarial Domain Adaptation for Duplicate Question Detection. In EMNLP. Association for Computational Linguistics, 1056--1063. https://doi.org/10.18653/v1/d18--1131
[27]
Andrew Skabar and Khaled Abdalgader. 2013. Clustering Sentence-Level Text Using a Novel Fuzzy Relational Clustering Algorithm. IEEE Trans. Knowl. Data Eng., Vol. 25, 1 (2013), 62--75.
[28]
Ngoc Phuoc An Vo, Simone Magnolini, and Octavian Popescu. 2015. Paraphrase Identification and Semantic Similarity in Twitter with Simple Features. In SocialNLP@NAACL. Association for Computational Linguistics, 10--19. https://doi.org/10.3115/v1/W15--1702
[29]
Yu Wu, Wei Wu, Zhoujun Li, and Ming Zhou. 2015. Mining Query Subtopics from Questions in Community Question Answering. In AAAI. AAAI Press, 339--345. http://www.aaai.org/ocs/index.php/AAAI/AAAI15/paper/view/9394
[30]
Kexin Yang, Dayiheng Liu, Wenqiang Lei, Baosong Yang, Haibo Zhang, Xue Zhao, Wenqing Yao, and Boxing Chen. 2022. GCPG: A General Framework for Controllable Paraphrase Generation. In ACL. Association for Computational Linguistics, 4035--4047. https://doi.org/10.18653/v1/2022.findings-acl.318
[31]
Wenpeng Yin and Hinrich Schü tze. 2015. Convolutional Neural Network for Paraphrase Identification. In NAACL. The Association for Computational Linguistics, 901--911. https://doi.org/10.3115/v1/n15--1091
[32]
ChengXiang Zhai and John D. Lafferty. 2001. A Study of Smoothing Methods for Language Models Applied to Ad Hoc Information Retrieval. In SIGIR. ACM, 334--342. https://doi.org/10.1145/383952.384019
[33]
Rong Zhang, Qifei Zhou, Bo Wu, Weiping Li, and Tong Mo. 2020. What Do Questions Exactly Ask? MFAE: Duplicate Question Identification with Multi-Fusion Asking Emphasis. In SDM. SIAM, 226--234. https://doi.org/10.1137/1.9781611976236.26
[34]
Weinan Zhang, Zhaoyan Ming, Yu Zhang, Ting Liu, and Tat-Seng Chua. 2016. Capturing the Semantics of Key Phrases Using Multiple Languages for Question Retrieval. IEEE Trans. Knowl. Data Eng., Vol. 28, 4 (2016), 888--900.
[35]
Shiqi Zhao, Ming Zhou, and Ting Liu. 2007. Learning Question Paraphrases for QA from Encarta Logs. In IJCAI. 1795--1801. http://ijcai.org/Proceedings/07/Papers/290.pdf
[36]
Guangyou Zhou and Jimmy Xiangji Huang. 2017. Modeling and Learning Distributed Word Representation with Metadata for Question Retrieval. IEEE Trans. Knowl. Data Eng., Vol. 29, 6 (2017), 1226--1239.
[37]
Guangyou Zhou, Yang Liu, Fang Liu, Daojian Zeng, and Jun Zhao. 2013. Improving Question Retrieval in Community Question Answering Using World Knowledge. In IJCAI. IJCAI/AAAI, 2239--2245. http://www.aaai.org/ocs/index.php/IJCAI/IJCAI13/paper/view/6581
[38]
Mingdong Zhu, Derong Shen, Lixin Xu, and Xianfang Wang. 2021. Scalable Multi-grained Cross-modal Similarity Query with Interpretability. Data Sci. Eng., Vol. 6, 3 (2021), 280--293. https://doi.org/10.1007/s41019-021-00162--4

Index Terms

  1. A Principled Decomposition of Pointwise Mutual Information for Intention Template Discovery

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    CIKM '23: Proceedings of the 32nd ACM International Conference on Information and Knowledge Management
    October 2023
    5508 pages
    ISBN:9798400701245
    DOI:10.1145/3583780
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 21 October 2023

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. Bayesian inference
    2. paraphrasing
    3. pointwise mutual information

    Qualifiers

    • Research-article

    Funding Sources

    Conference

    CIKM '23
    Sponsor:

    Acceptance Rates

    Overall Acceptance Rate 1,861 of 8,427 submissions, 22%

    Upcoming Conference

    CIKM '25

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)34
    • Downloads (Last 6 weeks)1
    Reflects downloads up to 13 Jan 2025

    Other Metrics

    Citations

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media