research-article

Dynamic Gaussian Embedding of Authors

Authors:

Antoine Gourru,

Christophe Gravier,

Julien JacquesAuthors Info & Claims

WWW '22: Proceedings of the ACM Web Conference 2022

Pages 2109 - 2119

https://doi.org/10.1145/3485447.3512084

Published: 25 April 2022 Publication History

Abstract

Authors publish documents in a dynamic manner. Their topic of interest and writing style might shift over time. Tasks such as author classification, author identification or link prediction are difficult to solve in such complex data settings. We propose a new representation learning model, DGEA (for Dynamic Gaussian Embedding of Authors), that is more suited to solve these tasks by capturing this temporal evolution. We formulate a general embedding framework: author representation at time t is a Gaussian distribution that leverages pre-trained document vectors, and that depends on the publications observed until t. The representations should retain some form of multi-topic information and temporal smoothness. We propose two models that fit into this framework. The first one, K-DGEA, uses a first order Markov model optimized with an Expectation Maximization Algorithm with Kalman Equations. The second, R-DGEA, makes use of a Recurrent Neural Network to model the time dependence. We evaluate our method on several quantitative tasks: author identification, classification, and co-authorship prediction, on two datasets written in English. In addition, our model is language agnostic since it only requires pre-trained document embeddings. It outperforms existing baselines by up to 18% on an author classification task on a news articles dataset.

References

[1]

Silvio Amir, Glen Coppersmith, Paula Carvalho, Mario J Silva, and Byron C Wallace. 2017. Quantifying mental health from social media with neural user embeddings. arXiv preprint arXiv:1705.00335(2017).

[2]

Waleed Ammar, Dirk Groeneveld, Chandra Bhagavatula, Iz Beltagy, Miles Crawford, Doug Downey, Jason Dunkelberger, Ahmed Elgohary, Sergey Feldman, Vu Ha, 2018. Construction of the Literature Graph in Semantic Scholar. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 3 (Industry Papers). 84–91.

[3]

Sanjeev Arora, Yingyu Liang, and Tengyu Ma. 2017. A simple but tough-to-beat baseline for sentence embeddings. In 5th International Conference on Learning Representations, ICLR 2017.

[4]

Robert Bamler and Stephan Mandt. 2017. Dynamic word embeddings. In International conference on Machine learning. PMLR, 380–389.

[5]

Iz Beltagy, Matthew E. Peters, and Arman Cohan. 2020. Longformer: The Long-Document Transformer. arXiv:2004.05150 (2020).

[6]

CM Bishop. 2006. Pattern Recognition and Machine Learning (Information Science and Statistics), chapter 13.

[7]

Christopher M Bishop. 1994. Mixture density networks. (1994).

[8]

David M Blei and John D Lafferty. 2006. Dynamic topic models. In Proceedings of the 23rd international conference on Machine learning. 113–120.

Digital Library

[9]

David M Blei, Andrew Y Ng, and Michael I Jordan. 2003. Latent dirichlet allocation. Journal of machine Learning research 3, Jan (2003), 993–1022.

Digital Library

[10]

Aleksandar Bojchevski and Stephan Günnemann. 2018. Deep Gaussian Embedding of Graphs: Unsupervised Inductive Learning via Ranking. In Proceeding of the International Conference on Learning Representations(ICLR).

[11]

Robin Brochier, Antoine Gourru, Adrien Guille, and Julien Velcin. 2020. New Datasets and a Benchmark of Document Network Embedding Methods for Scientific Expert Finding. In 10th International Workshop on Bibliometric-enhanced Information Retrieval co-located with 42nd European Conference on Information Retrieval.

[12]

Tom B Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, 2020. Language models are few-shot learners. Advances in Neural Information Processing Systems 33 (2020).

[13]

Daniel Cer, Yinfei Yang, Sheng-yi Kong, Nan Hua, Nicole Limtiaco, Rhomni St John, Noah Constant, Mario Guajardo-Cespedes, Steve Yuan, Chris Tar, 2018. Universal Sentence Encoder for English. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations. 169–174.

[14]

Kyunghyun Cho, Bart van Merrienboer, Çaglar Gülçehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. 2014. Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation. In EMNLP.

[15]

Ronan Collobert, Jason Weston, Léon Bottou, Michael Karlen, Koray Kavukcuoglu, and Pavel Kuksa. 2011. Natural language processing (almost) from scratch. Journal of machine learning research 12, ARTICLE (2011), 2493–2537.

[16]

Alexis Conneau, Douwe Kiela, Holger Schwenk, Loïc Barrault, and Antoine Bordes. 2017. Supervised Learning of Universal Sentence Representations from Natural Language Inference Data. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing. 670–680.

[17]

Edouard Delasalles, Sylvain Lamprier, and Ludovic Denoyer. 2019. Learning Dynamic Author Representations with Temporal Language Models. In 2019 IEEE International Conference on Data Mining (ICDM). 120–129.

[18]

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). 4171–4186.

[19]

Adji B Dieng, Francisco JR Ruiz, and David M Blei. 2019. The Dynamic Embedded Topic Model. arXiv preprint arXiv:1907.05545(2019).

[20]

Soumyajit Ganguly, Manish Gupta, Vasudeva Varma, Vikram Pudi, 2016. Author2vec: Learning author representations by combining content and link information. In Proceedings of the 25th International Conference Companion on World Wide Web. International World Wide Web Conferences Steering Committee, 49–50.

[21]

Soumyajit Ganguly and Vikram Pudi. 2017. Paper2vec: Combining graph and text information for scientific paper representation. In European conference on information retrieval. Springer, 383–395.

[22]

Amir Globerson, Gal Chechik, Fernando Pereira, and Naftali Tishby. 2007. Euclidean Embedding of Co-occurrence Data. Journal of Machine Learning Research 8 (2007), 2265–2295.

Digital Library

[23]

Antoine Gourru, Adrien Guille, Julien Velcin, and Julien Jacques. 2020. Document Network Projection in Pretrained Word Embedding Space. In European Conference on Information Retrieval. Springer, 150–157.

[24]

Antoine Gourru, Julien Velcin, and Julien Jacques. 2020. Gaussian Embedding of Linked Documents from a Pretrained Semantic Space. In Proceedings of the 29th International Joint Conference on Artificial Intelligence.

[25]

Sepp Hochreiter, Jürgen Schmidhuber, and Corso Elvezia. 1997. LONG SHORT-TERM MEMORY. Neural Computation 9, 8 (1997), 1735–1780.

Digital Library

[26]

Mohit Iyyer, Varun Manjunatha, Jordan Boyd-Graber, and Hal Daumé III. 2015. Deep unordered composition rivals syntactic methods for text classification. In Proceedings of the 53rd annual meeting of the association for computational linguistics and the 7th international joint conference on natural language processing (volume 1: Long papers). 1681–1691.

[27]

Geng Ji, Robert Bamler, Erik B Sudderth, and Stephan Mandt. 2017. Bayesian paragraph vectors. Symposium on Advances in Approximate Bayesian Inference (2017).

[28]

Rudolph Emil Kalman. 1960. A new approach to linear filtering and prediction problems. (1960).

[29]

Diederik P Kingma and Jimmy Ba. 2015. Adam: A Method for Stochastic Optimization. In ICLR (Poster).

[30]

Diederik P Kingma and Max Welling. 2014. Auto-encoding variational bayes. Proceedings of the International Conference on Learning Representations (ICLR) (2014).

[31]

Ryan Kiros, Yukun Zhu, Ruslan Salakhutdinov, Richard S Zemel, Antonio Torralba, Raquel Urtasun, and Sanja Fidler. 2015. Skip-thought vectors. In Proceedings of the 28th International Conference on Neural Information Processing Systems-Volume 2. 3294–3302.

[32]

Saar Kuzi, Anna Shtok, and Oren Kurland. 2016. Query expansion using word embeddings. In Proceedings of the 25th ACM international on conference on information and knowledge management. ACM, 1929–1932.

Digital Library

[33]

Quoc Le and Tomas Mikolov. 2014. Distributed representations of sentences and documents. In International Conference on Machine Learning. 1188–1196.

Digital Library

[34]

Zaiqiao Meng, Shangsong Liang, Hongyan Bao, and Xiangliang Zhang. 2019. Co-embedding attributed networks. In Proceedings of the Twelfth ACM International Conference on Web Search and Data Mining. 393–401.

Digital Library

[35]

Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. 2013. Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems. 3111–3119.

[36]

Kevin P Murphy. 2007. Conjugate Bayesian analysis of the Gaussian distribution. def 1, 2σ2 (2007), 16.

[37]

Shimei Pan and Tao Ding. 2019. Social media-based user embedding: A literature review. In Proceedings of the 28th International Joint Conference on Artificial Intelligence.

[38]

Herbert E Rauch, F Tung, and Charlotte T Striebel. 1965. Maximum likelihood estimates of linear dynamic systems. AIAA journal 3, 8 (1965), 1445–1450.

[39]

Nils Reimers and Iryna Gurevych. 2019. Sentence-bert: Sentence embeddings using siamese bert-networks. Proceedings of the International Conference on Empirical Methods in Natural Language Processing (2019).

[40]

Michael Röder, Andreas Both, and Alexander Hinneburg. 2015. Exploring the space of topic coherence measures. In Proceedings of the eighth ACM international conference on Web search and data mining. 399–408.

Digital Library

[41]

Michal Rosen-Zvi, Thomas Griffiths, Mark Steyvers, and Padhraic Smyth. 2004. The author-topic model for authors and documents. In Proceedings of the 20th conference on Uncertainty in artificial intelligence. 487–494.

Digital Library

[42]

Maja Rudolph and David Blei. 2018. Dynamic Embeddings for Language Evolution. In Proceedings of the 2018 World Wide Web Conference on World Wide Web. International World Wide Web Conferences Steering Committee, 1003–1011.

Digital Library

[43]

Purnamrita Sarkar, Sajid M Siddiqi, and Geoffrey J Gordon. 2006. Approximate Kalman filters for embedding author-word co-occurrence data over time. In ICML Workshop on Statistical Network Analysis. Springer, 126–139.

[44]

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in neural information processing systems. 5998–6008.

[45]

Cheng Yang, Zhiyuan Liu, Deli Zhao, Maosong Sun, and Edward Y Chang. 2015. Network representation learning with rich text information. In Proceedings of the International Joint Conference on Artificial Intelligence.

[46]

Zijun Yao, Yifan Sun, Weicong Ding, Nikhil Rao, and Hui Xiong. 2018. Dynamic word embeddings for evolving semantic discovery. In Proceedings of the eleventh acm international conference on web search and data mining. 673–681.

Digital Library

Cited By

Terreau EVelcin J(2024)Building Brownian Bridges to Learn Dynamic Author Representations from TextsAdvances in Intelligent Data Analysis XXII10.1007/978-3-031-58547-0_19(230-241)Online publication date: 16-Apr-2024
https://doi.org/10.1007/978-3-031-58547-0_19

Index Terms

Dynamic Gaussian Embedding of Authors

Index terms have been assigned to the content through auto-classification.

Recommendations

Linked Document Embedding for Classification
CIKM '16: Proceedings of the 25th ACM International on Conference on Information and Knowledge Management

Word and document embedding algorithms such as Skip-gram and Paragraph Vector have been proven to help various text analysis tasks such as document classification, document clustering and information retrieval. The vast majority of these algorithms are ...
Discriminative locally document embedding

Document embedding is a technology that captures informative representations from high-dimensional observations by some structure-preserving maps over corpus and has been intensively explored in machine learning. Recently, some manifold-inspired ...
Improving aspect-based sentiment analysis via aligning aspect embedding
Abstract
Aspect-Based Sentiment Analysis (ABSA) is a fine-grained sentiment analysis task, which aims to predict sentiment polarities of given aspects or target terms in text. ABSA contains two subtasks: Aspect-Category Sentiment Analysis (ACSA) and ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

WWW '22: Proceedings of the ACM Web Conference 2022

April 2022

3764 pages

ISBN:9781450390965

DOI:10.1145/3485447

Editors:
Frédérique Laforest
INSA Lyon, France
,
Raphaël Troncy
EURECOM, France
,
Elena Simperl
King’s College London, UK
,
Deepak Agarwal
Pinterest, USA
,
Aristides Gionis
KTH Royal Institute of Technology, Sweden
,
Ivan Herman
W3C / retired
,
Lionel Médini
Université Lyon 1, France

Copyright © 2022 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

SIGWEB: ACM Special Interest Group on Hypertext, Hypermedia, and Web

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 25 April 2022

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Conference

WWW '22

Sponsor:

SIGWEB

WWW '22: The ACM Web Conference 2022

April 25 - 29, 2022

Virtual Event, Lyon, France

Acceptance Rates

Overall Acceptance Rate 1,899 of 8,196 submissions, 23%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

1
Total Citations
View Citations
169
Total Downloads

Downloads (Last 12 months)30
Downloads (Last 6 weeks)1

Reflects downloads up to 30 Aug 2024

Other Metrics

View Author Metrics

Citations

Cited By

Terreau EVelcin J(2024)Building Brownian Bridges to Learn Dynamic Author Representations from TextsAdvances in Intelligent Data Analysis XXII10.1007/978-3-031-58547-0_19(230-241)Online publication date: 16-Apr-2024
https://doi.org/10.1007/978-3-031-58547-0_19

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Media

Figures

Other

Tables

View Table of Contents