Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to main content

Joint dynamic topic model for recognition of lead-lag relationship in two text corpora

  • Published:
Data Mining and Knowledge Discovery Aims and scope Submit manuscript

Abstract

Topic evolution modeling has received significant attentions in recent decades. Although various topic evolution models have been proposed, most studies focus on the single document corpus. However in practice, we can easily access data from multiple sources and also observe relationships between them. Then it is of great interest to recognize the relationship between multiple text corpora and further utilize this relationship to improve topic modeling. In this work, we focus on a special type of relationship between two text corpora, which we define as the “lead-lag relationship". This relationship characterizes the phenomenon that one text corpus would influence the topics to be discussed in the other text corpus in the future. To discover the lead-lag relationship, we propose a joint dynamic topic model and also develop an embedding extension to address the modeling problem of large-scale text corpus. With the recognized lead-lag relationship, the similarities of the two text corpora can be figured out and the quality of topic learning in both corpora can be improved. We numerically investigate the performance of the joint dynamic topic modeling approach using synthetic data. Finally, we apply the proposed model on two text corpora consisting of statistical papers and the graduation theses. Results show the proposed model can well recognize the lead-lag relationship between the two corpora, and the specific and shared topic patterns in the two corpora are also discovered.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
EUR 32.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or Ebook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Similar content being viewed by others

Notes

  1. https://epub.cnki.net/kns/brief/result.aspx?dbPrefix=CDMD.

  2. https://fanyi.baidu.com/.

References

  • Ahmed A, Xing EP (2008) Dynamic non-parametric mixture models and the recurrent Chinese restaurant process: with applications to evolutionary clustering. In: Proceedings of the SIAM international conference on data mining, SDM 2008, April 24–26, 2008, Atlanta, pp 219–230

  • Ahmed A, Xing EP (2010) Timeline: a dynamic hierarchical Dirichlet process model for recovering birth/death and evolution of topics in text stream. In: Proceedings of the twenty-sixth conference on uncertainty in artificial intelligence, Catalina Island, July 8–11, pp 20–29

  • AlSumait L, Barbara D, Domeniconi C (2008) On-line LDA: adaptive topic models for mining text streams with applications to topic detection and tracking. In: Proceedings of the 8th IEEE international conference on data mining, pp 3–12

  • Ashley R, Granger CWJ, Schmalensee R (1980) Advertising and aggregate consumption: an analysis of causality. Econometrica 48(5):1149–1167

    Article  MathSciNet  MATH  Google Scholar 

  • Blei DM, Lafferty JD (2006) Dynamic topic models. In: Proceedings of the twenty-third international conference (ICML 2006), Pittsburgh, June 25–29, pp 113–120

  • Blei DM, Ng AY, Jordan MI (2003) Latent Dirichlet allocation. J Mach Learn Res 3:993–1022

    MATH  Google Scholar 

  • Chae J, Thom D, Bosch H, et al (2012) Spatiotemporal social media analytics for abnormal event detection and examination using seasonal-trend decomposition. In: IEEE conference on visual analytics science & technology, pp 143–152

  • Chen J, Gong Z, Liu W (2019) A nonparametric model for online topic discovery with word embeddings. Inf Sci 504:32–47

    Article  MathSciNet  Google Scholar 

  • Costa G, Ortale R (2021) Jointly modeling and simultaneously discovering topics and clusters in text corpora using word vectors. Inf Sci 563:226–240

    Article  MathSciNet  Google Scholar 

  • Cryer JD, Chan KS (2008) Time series analysis: with applications in R. Springer

    Book  MATH  Google Scholar 

  • Dieng AB, Ruiz FJR, Blei DM (2019) The dynamic embedded topic model. arXiv preprint arXiv:1907.05545

  • Dieng AB, Ruiz FJR, Blei DM (2020) Topic modeling in embedding spaces. Trans Assoc Comput Linguist 8:439–453

    Article  Google Scholar 

  • Dubey A, Hefny A, Williamson S, et al (2013) A nonparametric mixture model for topic modeling over time. In: Proceedings of the SIAM international conference on data mining, pp 530–538

  • Granger CWJ (1969) Investigating causal relations by econometric models and cross-spectral methods. Econometrica 37(3):424–438

    Article  MATH  Google Scholar 

  • He J, Chen X, Du M et al (2015) Topic evolution analysis based on improved online LDA model. J Cent South Univ (Sci Technol) 46(2):547–553

    Google Scholar 

  • Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9(8):1735–1780

    Article  Google Scholar 

  • Jordan MI, Ghahramani Z, Jaakkola TS et al (1999) An introduction to variational methods for graphical models. Mach Learn 37(2):183–233

    Article  MATH  Google Scholar 

  • Kalman RE (1960) A new approach to linear filtering and prediction problems. J Basic Eng 82(1):35–45

    Article  MathSciNet  Google Scholar 

  • Kawamae N (2011) Trend analysis model: trend consists of temporal words, topics, and timestamps. In: Proceedings of the fourth ACM international conference on web search and data mining, pp 317–326

  • Kingma DP, Ba JL (2015) Adam: a method for stochastic optimization. In: International conference on learning representations, 2015, pp 1–15

  • Kingma DP, Welling M (2013) Auto-encoding variational Bayes. In: International conference on learning representations(ICLR), pp 1–14

  • Meng H, Xu HC, Zhou WX et al (2017) Symmetric thermal optimal path and time-dependent lead-lag relationship: novel statistical tests and application to uk and us real-estate and monetary policies. Quant Finance 17(6):959–977

    Article  MathSciNet  MATH  Google Scholar 

  • Mohamad S, Bouchachia A (2020) Online gaussian lda for unsupervised pattern mining from utility usage data. In: 2020 19th IEEE international conference on machine learning and applications (ICMLA), IEEE, pp 41–48

  • Nallapati RM, Ditmore S, Lafferty JD, et al (2007) Multiscale topic tomography. In: ACM Sigkdd international conference on knowledge discovery & data mining, pp 520–529

  • Pozdnoukhov A, Kaiser C (2011) Space-time dynamics of topics in streaming text. In: ACM Sigspatial international workshop on location-based social networks, pp 1–8

  • Rudolph M, Blei D (2018) Dynamic embeddings for language evolution. In: Proceedings of the 2018 world wide web conference, pp 1003–1011

  • Runge J, Bathiany S, Bollt E et al (2019) Inferring causation from time series in Earth system sciences. Nat Commun 10(1):2553–2553

    Article  Google Scholar 

  • Sasaki K, Yoshikawa T, Furuhashi T (2014) Online topic model for Twitter considering dynamics of user interests and topic trends. In: Proceedings of the conference on empirical methods in natural language processing, pp 1977–1985

  • Saul LK, Jordan MI (1995) Exploiting tractable substructures in intractable networks. Adv Neural Inf Process Syst 8:486–492

    Google Scholar 

  • Sornette D, Zhou W (2005) Non-parametric determination of real-time lag structure between two time series: the “optimal thermal causal path’’ method. Quantit Finance 5(6):577–591

    Article  MathSciNet  MATH  Google Scholar 

  • Sugihara G, May RM, Ye H et al (2012) Detecting causality in complex ecosystems. Science 338(6106):496–500

    Article  MATH  Google Scholar 

  • Vavliakis KN, Tzima FA, Mitkas PA (2012) Event detection via LDA for the mediaeval 2012 sed task. In: MediaEval workshop, pp 1–2

  • Wallach HM, Murray I, Salakhutdinov R, et al (2009) Evaluation methods for topic models. In: Proceedings of the 26th annual international conference on machine learning, pp 1–8

  • Wang C, Blei D, Heckerman D (2008) Continuous time dynamic topic models. Uncertainty in Artificial Intelligence, pp 579–586

  • Wang X, McCallum A (2006) Topics over time: A non-Markov continuous-time model of topical trends. In: Proceedings of the ACM SIGKDD international conference on knowledge discovery and data mining, pp 424–433

  • Yang M, Qu Q, Chen X et al (2019) Discovering author interest evolution in order-sensitive and Semantic-aware topic modeling. Inf Sci 486:271–286. https://doi.org/10.1016/j.ins.2019.02.040

    Article  Google Scholar 

  • Ye H, Deyle ER, Gilarranz LJ et al (2015) Distinguishing time-delayed causal interactions using convergent cross mapping. Sci Rep 5(1):14750

    Article  Google Scholar 

  • Zhou H, Huimin YU, Roland HU (2017) Topic evolution based on the probabilistic topic model: a review. Front Comput Sci 11(5):786–802

    Article  Google Scholar 

  • Zhou X, Chen L (2014) Event detection over Twitter social media streams. VLDB J 23(3):381–400

    Article  MathSciNet  Google Scholar 

Download references

Acknowledgements

This work is supported by the Fundamental Research Funds for the Central Universities, and the Research Funds of Renmin University of China (21XNA026).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Feifei Wang.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Additional information

Responsible editor: Sriraam Natarajan.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Xiaoling Lu is the co-first author.

Appendix: The lists of selected journals and top schools with the highest number of theses

Appendix: The lists of selected journals and top schools with the highest number of theses

The Ten Selected Statistical Journals

1

Journal of the American Statistical Association

2

Econometrica

3

Journal of the Royal Statistical Society Series B (Statistical Methodology)

4

Annals of Statistics

5

Fuzzy Sets and Systems

6

Computational Statistics & Data Analysis

7

American Statistician

8

Journal of business & Economic Statistics

9

Stochastic Processes and Their Applications

10

Statistics and Computing

The Chinese Universities with Top Number of Theses

1

Dongbei University of Finance and Economics

2

Zhejiang Gongshang University

3

East China Normal University

4

Huazhong University of Science and Technology

5

Tianjin University of Finance and Economics

6

Northeast Normal University

7

Hunan University

8

Jinan University

9

Central South University

10

Shandong University

Rights and permissions

Springer Nature or its licensor holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Zhu, Y., Lu, X., Hong, J. et al. Joint dynamic topic model for recognition of lead-lag relationship in two text corpora. Data Min Knowl Disc 36, 2272–2298 (2022). https://doi.org/10.1007/s10618-022-00873-w

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10618-022-00873-w

Keywords