short-paper

Improving German News Clustering with Contrastive Learning

Authors:

Piriyakorn Piriyatamwong,

Saikishore Kalloori,

Fabio ZündAuthors Info & Claims

CIKM '24: Proceedings of the 33rd ACM International Conference on Information and Knowledge Management

Pages 3979 - 3983

https://doi.org/10.1145/3627673.3679944

Published: 21 October 2024 Publication History

Abstract

Automatic news articles clustering is one of the most important tasks for news publishers. Traditional unsupervised models exploit generic text representation (e.g., BERT) and typically do not consider the relationships between each paragraph in news articles. Such depth learning from news articles is important for clustering full-length articles. Recently contrastive learning (CL) has shown to be a popular method for representation learning that uses positive and negative data pairs generated using data augmentation techniques to improve the representation in the latent space. In this work, we propose text augmentation methods and use contrastive learning to cluster daily growing full-length German news articles. Our experiments on four German news article datasets (one labeled and three unlabeled datasets) demonstrate that contrastive learning and our text augmentation methods significantly improve the representation of news articles compared to generic pre-trained text representation and have high performance for clustering tasks.

References

[1]

Joel Azzopardi and Christopher Staff. 2012. Incremental Clustering of News Reports. Algorithms, Vol. 5, 3 (2012), 364--378. https://doi.org/10.3390/a5030364

[2]

Somnath Banerjee, Krishnan Ramanathan, and Ajay Gupta. 2007. Clustering short texts using wikipedia. In Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval. 787--788.

Digital Library

[3]

Markus Bayer, Marc-André Kaufhold, and Christian Reuter. 2022. A Survey on Data Augmentation for Text Classification. ACM Comput. Surv., Vol. 55, 7, Article 146 (dec 2022), 39 pages. https://doi.org/10.1145/3544558

Digital Library

[4]

Christos Bouras and Vassilis Tsogkas. 2012. A clustering technique for news articles using WordNet. Knowledge-Based Systems, Vol. 36 (2012), 115--128.

Digital Library

[5]

Changyou Chen, Jianyi Zhang, Yi Xu, Liqun Chen, Jiali Duan, Yiran Chen, Son Tran, Belinda Zeng, and Trishul Chilimbi. 2022. Why do We Need Large Batchsizes in Contrastive Learning? A Gradient-Bias Perspective. In Advances in Neural Information Processing Systems, S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh (Eds.), Vol. 35. Curran Associates, Inc., 33860--33875. https://proceedings.neurips.cc/paper_files/paper/2022/file/db174d373133dcc6bf83bc98e4b681f8-Paper-Conference.pdf

[6]

Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. 2020. A Simple Framework for Contrastive Learning of Visual Representations. In Proceedings of the 37th International Conference on Machine Learning (Proceedings of Machine Learning Research, Vol. 119), Hal Daumé III and Aarti Singh (Eds.). PMLR, 1597--1607. https://proceedings.mlr.press/v119/chen20j.html

[7]

Tianyu Gao, Xingcheng Yao, and Danqi Chen. 2021. SimCSE: Simple Contrastive Learning of Sentence Embeddings. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, Marie-Francine Moens, Xuanjing Huang, Lucia Specia, and Scott Wen-tau Yih (Eds.). Association for Computational Linguistics, Online and Punta Cana, Dominican Republic, 6894--6910. https://doi.org/10.18653/v1/2021.emnlp-main.552

[8]

Matthew Henderson, Rami Al-Rfou, Brian Strope, Yun-Hsuan Sung, László Lukács, Ruiqi Guo, Sanjiv Kumar, Balint Miklos, and Ray Kurzweil. 2017. Efficient natural language response suggestion for smart reply. arXiv preprint arXiv:1705.00652 (2017).

[9]

Saikishore Kalloori, Francesco Ricci, and Rosella Gennari. 2018. Eliciting pairwise preferences in recommender systems. In Proceedings of the 12th ACM Conference on Recommender Systems. 329--337.

Digital Library

[10]

Dilek Küccük and Fazli Can. 2020. Stance detection: A survey. ACM Computing Surveys (CSUR), Vol. 53, 1 (2020), 1--37.

Digital Library

[11]

Mathis Linger and Mhamed Hajaiej. 2020. Batch clustering for multilingual news streaming. arXiv preprint arXiv:2004.08123 (2020).

[12]

Dongsheng Luo, Wei Cheng, Jingchao Ni, Wenchao Yu, Xuchao Zhang, Bo Zong, Yanchi Liu, Zhengzhang Chen, Dongjin Song, Haifeng Chen, et al. 2021. Unsupervised document embedding via contrastive augmentation. arXiv preprint arXiv:2103.14542 (2021).

[13]

Niklas Muennighoff, Nouamane Tazi, Loic Magne, and Nils Reimers. 2023. MTEB: Massive Text Embedding Benchmark. In Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, Andreas Vlachos and Isabelle Augenstein (Eds.). Association for Computational Linguistics, Dubrovnik, Croatia, 2014--2037. https://doi.org/10.18653/v1/2023.eacl-main.148

[14]

Jianmo Ni, Gustavo Hernandez Abrego, Noah Constant, Ji Ma, Keith Hall, Daniel Cer, and Yinfei Yang. 2022. Sentence-T5: Scalable Sentence Encoders from Pre-trained Text-to-Text Models. In Findings of the Association for Computational Linguistics: ACL 2022, Smaranda Muresan, Preslav Nakov, and Aline Villavicencio (Eds.). Association for Computational Linguistics, Dublin, Ireland, 1864--1874. https://doi.org/10.18653/v1/2022.findings-acl.146

[15]

Faik Kerem Örs, Süveyda Yeniterzi, and Reyyan Yeniterzi. 2020. Event clustering within news articles. In Proceedings of the Workshop on Automated Extraction of Socio-political Events from News 2020. 63--68.

[16]

Xiao Pan, Mingxuan Wang, Liwei Wu, and Lei Li. 2021. Contrastive Learning for Many-to-many Multilingual Neural Machine Translation. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), Chengqing Zong, Fei Xia, Wenjie Li, and Roberto Navigli (Eds.). Association for Computational Linguistics, Online, 244--258. https://doi.org/10.18653/v1/2021.acl-long.21

[17]

Taesung Park, Alexei A Efros, Richard Zhang, and Jun-Yan Zhu. 2020. Contrastive learning for unpaired image-to-image translation. In Computer Vision--ECCV 2020: 16th European Conference, Glasgow, UK, August 23--28, 2020, Proceedings, Part IX 16. Springer, 319--345.

[18]

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. 2021. Learning Transferable Visual Models From Natural Language Supervision. In Proceedings of the 38th International Conference on Machine Learning (Proceedings of Machine Learning Research, Vol. 139), Marina Meila and Tong Zhang (Eds.). PMLR, 8748--8763. https://proceedings.mlr.press/v139/radford21a.html

[19]

Andrew Rosenberg and Julia Hirschberg. 2007. V-Measure: A Conditional Entropy-Based External Cluster Evaluation Measure. In Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL), Jason Eisner (Ed.). Association for Computational Linguistics, Prague, Czech Republic, 410--420. https://aclanthology.org/D07--1043

[20]

Peter J. Rousseeuw. 1987. Silhouettes: A graphical aid to the interpretation and validation of cluster analysis. J. Comput. Appl. Math., Vol. 20 (1987), 53--65. https://doi.org/10.1016/0377-0427(87)90125--7

Digital Library

[21]

Kailash Karthik Saravanakumar, Miguel Ballesteros, Muthu Kumar Chandrasekaran, and Kathleen McKeown. 2021. Event-Driven News Stream Clustering using Entity-Aware Contextual Embeddings. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, Paola Merlo, Jorg Tiedemann, and Reut Tsarfaty (Eds.). Association for Computational Linguistics, Online, 2330--2340. https://doi.org/10.18653/v1/2021.eacl-main.198

[22]

Dietmar Schabus, Marcin Skowron, and Martin Trapp. 2017. One Million Posts: A Data Set of German Online Discussions. In Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR). Tokyo, Japan, 1241--1244. https://doi.org/10.1145/3077136.3080711

Digital Library

[23]

Todor Staykovski, Alberto Barrón-Cedeno, Giovanni Da San Martino, and Preslav Nakov. 2019. Dense vs. Sparse Representations for News Stream Clustering. In Text2Story@ ECIR. 47--52.

[24]

Yonglong Tian, Chen Sun, Ben Poole, Dilip Krishnan, Cordelia Schmid, and Phillip Isola. 2020. What makes for good views for contrastive learning? Advances in neural information processing systems, Vol. 33 (2020), 6827--6839.

[25]

Jason Wei and Kai Zou. 2019. EDA: Easy Data Augmentation Techniques for Boosting Performance on Text Classification Tasks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Kentaro Inui, Jing Jiang, Vincent Ng, and Xiaojun Wan (Eds.). Association for Computational Linguistics, Hong Kong, China, 6382--6388. https://doi.org/10.18653/v1/D19--1670

[26]

Bohong Wu, Zhuosheng Zhang, Jinyuan Wang, and Hai Zhao. 2022. Sentence-aware Contrastive Learning for Open-Domain Passage Retrieval. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Smaranda Muresan, Preslav Nakov, and Aline Villavicencio (Eds.). Association for Computational Linguistics, Dublin, Ireland, 1062--1074. https://doi.org/10.18653/v1/2022.acl-long.76

[27]

Lingling Xu, Haoran Xie, Zongxi Li, Fu Lee Wang, Weiming Wang, and Qing Li. 2023. Contrastive Learning Models for Sentence Representations. ACM Trans. Intell. Syst. Technol., Vol. 14, 4, Article 67 (jun 2023), 34 pages. https://doi.org/10.1145/3593590

Digital Library

[28]

Peng Xu, Xinchi Chen, Xiaofei Ma, Zhiheng Huang, and Bing Xiang. 2021. Contrastive Document Representation Learning with Graph Attention Networks. In Findings of the Association for Computational Linguistics: EMNLP 2021, Marie-Francine Moens, Xuanjing Huang, Lucia Specia, and Scott Wen-tau Yih (Eds.). Association for Computational Linguistics, Punta Cana, Dominican Republic, 3874--3884. https://doi.org/10.18653/v1/2021.findings-emnlp.327

[29]

Shusheng Xu, Xingxing Zhang, Yi Wu, and Furu Wei. 2022. Sequence Level Contrastive Learning for Text Summarization. Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 36, 10 (Jun. 2022), 11556--11565. https://doi.org/10.1609/aaai.v36i10.21409

[30]

Yuanmeng Yan, Rumei Li, Sirui Wang, Fuzheng Zhang, Wei Wu, and Weiran Xu. 2021. ConSERT: A Contrastive Framework for Self-Supervised Sentence Representation Transfer. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), Chengqing Zong, Fei Xia, Wenjie Li, and Roberto Navigli (Eds.). Association for Computational Linguistics, Online, 5065--5075. https://doi.org/10.18653/v1/2021.acl-long.393

[31]

Dejiao Zhang, Feng Nan, Xiaokai Wei, Shang-Wen Li, Henghui Zhu, Kathleen McKeown, Ramesh Nallapati, Andrew O. Arnold, and Bing Xiang. 2021. Supporting Clustering with Contrastive Learning. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Kristina Toutanova, Anna Rumshisky, Luke Zettlemoyer, Dilek Hakkani-Tur, Iz Beltagy, Steven Bethard, Ryan Cotterell, Tanmoy Chakraborty, and Yichao Zhou (Eds.). Association for Computational Linguistics, Online, 5419--5430. https://doi.org/10.18653/v1/2021.naacl-main.427

[32]

Li Zheng, Lei Li, Wenxing Hong, and Tao Li. 2013. PENETRATE: Personalized news recommendation using ensemble hierarchical clustering. Expert Systems with Applications, Vol. 40, 6 (2013), 2127--2136.

Digital Library

Index Terms

Improving German News Clustering with Contrastive Learning
1. Information systems
  1. Information systems applications
    1. Data mining
      1. Clustering

Recommendations

Contrastive learning with text augmentation for text classification
Abstract
Various contrastive learning models have been successfully applied to representation learning for downstream tasks. The positive samples used in contrastive learning are often derived from augmented data, which improve the performance of many ...
Contrastive visual clustering for improving instance-level contrastive learning as a plugin
Abstract
Contrastive learning has achieved remarkable success in computer vision, however it is built on instance-level discrimination which leaves the valuable intra-class correlation in dataset unexploited. Current semantic clustering methods are proven ...
Highlights
- A contrastive clustering method CVC is proposed to improve contrastive learning.
- CVC is shown to be a generic method.
- Experiments show CVC improves linear classification performances by a large margin.
Federated Momentum Contrastive Clustering
Self-supervised representation learning and deep clustering are mutually beneficial to learn high-quality representations and cluster data simultaneously in centralized settings. However, it is not always feasible to gather large amounts of data at a ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

CIKM '24: Proceedings of the 33rd ACM International Conference on Information and Knowledge Management

October 2024

5705 pages

ISBN:9798400704369

DOI:10.1145/3627673

General Chairs:
Edoardo Serra
Boise State University, USA
,
Francesca Spezzano
Boise State University, USA

Copyright © 2024 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

SIGIR: ACM Special Interest Group on Information Retrieval

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 21 October 2024

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Short-paper

Conference

CIKM '24

Sponsor:

SIGIR

CIKM '24: The 33rd ACM International Conference on Information and Knowledge Management

October 21 - 25, 2024

ID, Boise, USA

Acceptance Rates

Overall Acceptance Rate 1,861 of 8,427 submissions, 22%

Upcoming Conference

CIKM '25

Sponsor:
sigir
sigir

The 34th ACM International Conference on Information and Knowledge Management

November 10 - 14, 2025

Seoul , Republic of Korea

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
57
Total Downloads

Downloads (Last 12 months)57
Downloads (Last 6 weeks)7

Reflects downloads up to 09 Jan 2025

Other Metrics

View Author Metrics

Citations

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents