Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3627673.3679944acmconferencesArticle/Chapter ViewAbstractPublication PagescikmConference Proceedingsconference-collections
short-paper

Improving German News Clustering with Contrastive Learning

Published: 21 October 2024 Publication History

Abstract

Automatic news articles clustering is one of the most important tasks for news publishers. Traditional unsupervised models exploit generic text representation (e.g., BERT) and typically do not consider the relationships between each paragraph in news articles. Such depth learning from news articles is important for clustering full-length articles. Recently contrastive learning (CL) has shown to be a popular method for representation learning that uses positive and negative data pairs generated using data augmentation techniques to improve the representation in the latent space. In this work, we propose text augmentation methods and use contrastive learning to cluster daily growing full-length German news articles. Our experiments on four German news article datasets (one labeled and three unlabeled datasets) demonstrate that contrastive learning and our text augmentation methods significantly improve the representation of news articles compared to generic pre-trained text representation and have high performance for clustering tasks.

References

[1]
Joel Azzopardi and Christopher Staff. 2012. Incremental Clustering of News Reports. Algorithms, Vol. 5, 3 (2012), 364--378. https://doi.org/10.3390/a5030364
[2]
Somnath Banerjee, Krishnan Ramanathan, and Ajay Gupta. 2007. Clustering short texts using wikipedia. In Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval. 787--788.
[3]
Markus Bayer, Marc-André Kaufhold, and Christian Reuter. 2022. A Survey on Data Augmentation for Text Classification. ACM Comput. Surv., Vol. 55, 7, Article 146 (dec 2022), 39 pages. https://doi.org/10.1145/3544558
[4]
Christos Bouras and Vassilis Tsogkas. 2012. A clustering technique for news articles using WordNet. Knowledge-Based Systems, Vol. 36 (2012), 115--128.
[5]
Changyou Chen, Jianyi Zhang, Yi Xu, Liqun Chen, Jiali Duan, Yiran Chen, Son Tran, Belinda Zeng, and Trishul Chilimbi. 2022. Why do We Need Large Batchsizes in Contrastive Learning? A Gradient-Bias Perspective. In Advances in Neural Information Processing Systems, S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh (Eds.), Vol. 35. Curran Associates, Inc., 33860--33875. https://proceedings.neurips.cc/paper_files/paper/2022/file/db174d373133dcc6bf83bc98e4b681f8-Paper-Conference.pdf
[6]
Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. 2020. A Simple Framework for Contrastive Learning of Visual Representations. In Proceedings of the 37th International Conference on Machine Learning (Proceedings of Machine Learning Research, Vol. 119), Hal Daumé III and Aarti Singh (Eds.). PMLR, 1597--1607. https://proceedings.mlr.press/v119/chen20j.html
[7]
Tianyu Gao, Xingcheng Yao, and Danqi Chen. 2021. SimCSE: Simple Contrastive Learning of Sentence Embeddings. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, Marie-Francine Moens, Xuanjing Huang, Lucia Specia, and Scott Wen-tau Yih (Eds.). Association for Computational Linguistics, Online and Punta Cana, Dominican Republic, 6894--6910. https://doi.org/10.18653/v1/2021.emnlp-main.552
[8]
Matthew Henderson, Rami Al-Rfou, Brian Strope, Yun-Hsuan Sung, László Lukács, Ruiqi Guo, Sanjiv Kumar, Balint Miklos, and Ray Kurzweil. 2017. Efficient natural language response suggestion for smart reply. arXiv preprint arXiv:1705.00652 (2017).
[9]
Saikishore Kalloori, Francesco Ricci, and Rosella Gennari. 2018. Eliciting pairwise preferences in recommender systems. In Proceedings of the 12th ACM Conference on Recommender Systems. 329--337.
[10]
Dilek Küccük and Fazli Can. 2020. Stance detection: A survey. ACM Computing Surveys (CSUR), Vol. 53, 1 (2020), 1--37.
[11]
Mathis Linger and Mhamed Hajaiej. 2020. Batch clustering for multilingual news streaming. arXiv preprint arXiv:2004.08123 (2020).
[12]
Dongsheng Luo, Wei Cheng, Jingchao Ni, Wenchao Yu, Xuchao Zhang, Bo Zong, Yanchi Liu, Zhengzhang Chen, Dongjin Song, Haifeng Chen, et al. 2021. Unsupervised document embedding via contrastive augmentation. arXiv preprint arXiv:2103.14542 (2021).
[13]
Niklas Muennighoff, Nouamane Tazi, Loic Magne, and Nils Reimers. 2023. MTEB: Massive Text Embedding Benchmark. In Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, Andreas Vlachos and Isabelle Augenstein (Eds.). Association for Computational Linguistics, Dubrovnik, Croatia, 2014--2037. https://doi.org/10.18653/v1/2023.eacl-main.148
[14]
Jianmo Ni, Gustavo Hernandez Abrego, Noah Constant, Ji Ma, Keith Hall, Daniel Cer, and Yinfei Yang. 2022. Sentence-T5: Scalable Sentence Encoders from Pre-trained Text-to-Text Models. In Findings of the Association for Computational Linguistics: ACL 2022, Smaranda Muresan, Preslav Nakov, and Aline Villavicencio (Eds.). Association for Computational Linguistics, Dublin, Ireland, 1864--1874. https://doi.org/10.18653/v1/2022.findings-acl.146
[15]
Faik Kerem Örs, Süveyda Yeniterzi, and Reyyan Yeniterzi. 2020. Event clustering within news articles. In Proceedings of the Workshop on Automated Extraction of Socio-political Events from News 2020. 63--68.
[16]
Xiao Pan, Mingxuan Wang, Liwei Wu, and Lei Li. 2021. Contrastive Learning for Many-to-many Multilingual Neural Machine Translation. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), Chengqing Zong, Fei Xia, Wenjie Li, and Roberto Navigli (Eds.). Association for Computational Linguistics, Online, 244--258. https://doi.org/10.18653/v1/2021.acl-long.21
[17]
Taesung Park, Alexei A Efros, Richard Zhang, and Jun-Yan Zhu. 2020. Contrastive learning for unpaired image-to-image translation. In Computer Vision--ECCV 2020: 16th European Conference, Glasgow, UK, August 23--28, 2020, Proceedings, Part IX 16. Springer, 319--345.
[18]
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. 2021. Learning Transferable Visual Models From Natural Language Supervision. In Proceedings of the 38th International Conference on Machine Learning (Proceedings of Machine Learning Research, Vol. 139), Marina Meila and Tong Zhang (Eds.). PMLR, 8748--8763. https://proceedings.mlr.press/v139/radford21a.html
[19]
Andrew Rosenberg and Julia Hirschberg. 2007. V-Measure: A Conditional Entropy-Based External Cluster Evaluation Measure. In Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL), Jason Eisner (Ed.). Association for Computational Linguistics, Prague, Czech Republic, 410--420. https://aclanthology.org/D07--1043
[20]
Peter J. Rousseeuw. 1987. Silhouettes: A graphical aid to the interpretation and validation of cluster analysis. J. Comput. Appl. Math., Vol. 20 (1987), 53--65. https://doi.org/10.1016/0377-0427(87)90125--7
[21]
Kailash Karthik Saravanakumar, Miguel Ballesteros, Muthu Kumar Chandrasekaran, and Kathleen McKeown. 2021. Event-Driven News Stream Clustering using Entity-Aware Contextual Embeddings. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, Paola Merlo, Jorg Tiedemann, and Reut Tsarfaty (Eds.). Association for Computational Linguistics, Online, 2330--2340. https://doi.org/10.18653/v1/2021.eacl-main.198
[22]
Dietmar Schabus, Marcin Skowron, and Martin Trapp. 2017. One Million Posts: A Data Set of German Online Discussions. In Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR). Tokyo, Japan, 1241--1244. https://doi.org/10.1145/3077136.3080711
[23]
Todor Staykovski, Alberto Barrón-Cedeno, Giovanni Da San Martino, and Preslav Nakov. 2019. Dense vs. Sparse Representations for News Stream Clustering. In Text2Story@ ECIR. 47--52.
[24]
Yonglong Tian, Chen Sun, Ben Poole, Dilip Krishnan, Cordelia Schmid, and Phillip Isola. 2020. What makes for good views for contrastive learning? Advances in neural information processing systems, Vol. 33 (2020), 6827--6839.
[25]
Jason Wei and Kai Zou. 2019. EDA: Easy Data Augmentation Techniques for Boosting Performance on Text Classification Tasks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Kentaro Inui, Jing Jiang, Vincent Ng, and Xiaojun Wan (Eds.). Association for Computational Linguistics, Hong Kong, China, 6382--6388. https://doi.org/10.18653/v1/D19--1670
[26]
Bohong Wu, Zhuosheng Zhang, Jinyuan Wang, and Hai Zhao. 2022. Sentence-aware Contrastive Learning for Open-Domain Passage Retrieval. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Smaranda Muresan, Preslav Nakov, and Aline Villavicencio (Eds.). Association for Computational Linguistics, Dublin, Ireland, 1062--1074. https://doi.org/10.18653/v1/2022.acl-long.76
[27]
Lingling Xu, Haoran Xie, Zongxi Li, Fu Lee Wang, Weiming Wang, and Qing Li. 2023. Contrastive Learning Models for Sentence Representations. ACM Trans. Intell. Syst. Technol., Vol. 14, 4, Article 67 (jun 2023), 34 pages. https://doi.org/10.1145/3593590
[28]
Peng Xu, Xinchi Chen, Xiaofei Ma, Zhiheng Huang, and Bing Xiang. 2021. Contrastive Document Representation Learning with Graph Attention Networks. In Findings of the Association for Computational Linguistics: EMNLP 2021, Marie-Francine Moens, Xuanjing Huang, Lucia Specia, and Scott Wen-tau Yih (Eds.). Association for Computational Linguistics, Punta Cana, Dominican Republic, 3874--3884. https://doi.org/10.18653/v1/2021.findings-emnlp.327
[29]
Shusheng Xu, Xingxing Zhang, Yi Wu, and Furu Wei. 2022. Sequence Level Contrastive Learning for Text Summarization. Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 36, 10 (Jun. 2022), 11556--11565. https://doi.org/10.1609/aaai.v36i10.21409
[30]
Yuanmeng Yan, Rumei Li, Sirui Wang, Fuzheng Zhang, Wei Wu, and Weiran Xu. 2021. ConSERT: A Contrastive Framework for Self-Supervised Sentence Representation Transfer. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), Chengqing Zong, Fei Xia, Wenjie Li, and Roberto Navigli (Eds.). Association for Computational Linguistics, Online, 5065--5075. https://doi.org/10.18653/v1/2021.acl-long.393
[31]
Dejiao Zhang, Feng Nan, Xiaokai Wei, Shang-Wen Li, Henghui Zhu, Kathleen McKeown, Ramesh Nallapati, Andrew O. Arnold, and Bing Xiang. 2021. Supporting Clustering with Contrastive Learning. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Kristina Toutanova, Anna Rumshisky, Luke Zettlemoyer, Dilek Hakkani-Tur, Iz Beltagy, Steven Bethard, Ryan Cotterell, Tanmoy Chakraborty, and Yichao Zhou (Eds.). Association for Computational Linguistics, Online, 5419--5430. https://doi.org/10.18653/v1/2021.naacl-main.427
[32]
Li Zheng, Lei Li, Wenxing Hong, and Tao Li. 2013. PENETRATE: Personalized news recommendation using ensemble hierarchical clustering. Expert Systems with Applications, Vol. 40, 6 (2013), 2127--2136.

Index Terms

  1. Improving German News Clustering with Contrastive Learning

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    CIKM '24: Proceedings of the 33rd ACM International Conference on Information and Knowledge Management
    October 2024
    5705 pages
    ISBN:9798400704369
    DOI:10.1145/3627673
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 21 October 2024

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. contrastive learning
    2. news clustering
    3. text augmentation

    Qualifiers

    • Short-paper

    Conference

    CIKM '24
    Sponsor:

    Acceptance Rates

    Overall Acceptance Rate 1,861 of 8,427 submissions, 22%

    Upcoming Conference

    CIKM '25

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • 0
      Total Citations
    • 57
      Total Downloads
    • Downloads (Last 12 months)57
    • Downloads (Last 6 weeks)7
    Reflects downloads up to 09 Jan 2025

    Other Metrics

    Citations

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media