Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/584792.584829acmconferencesArticle/Chapter ViewAbstractPublication PagescikmConference Proceedingsconference-collections
Article

Topic-based document segmentation with probabilistic latent semantic analysis

Published: 04 November 2002 Publication History
  • Get Citation Alerts
  • Abstract

    This paper presents a new method for topic-based document segmentation, i.e., the identification of boundaries between parts of a document that bear on different topics. The method combines the use of the Probabilistic Latent Semantic Analysis (PLSA) model with the method of selecting segmentation points based on the similarity values between pairs of adjacent blocks. The use of PLSA allows for a better representation of sparse information in a text block, such as a sentence or a sequence of sentences. Furthermore, segmentation performance is improved by combining different instantiations of the same model, either using different random initializations or different numbers of latent classes. Results on commonly available data sets are significantly better than those of other state-of-the-art systems.

    References

    [1]
    A. Basu, I.R. Harris, and S. Basu. Minimum distance estimation: The approach using density-based distances. In G.S. Maddala and C.R. Rao, editors, Handbook of Statistics volume 15,pages 21--48. North-Holland, 1997.
    [2]
    D. Beeferman, A. Berger, and J. Lafferty. Statistical models for text segmentation. Machine Learning 34:177--210, 1999.
    [3]
    D. Blei, A. Ng, and M. Jordan. Latent dirichlet allocation. In Proceedings of NIPS-2001 Vancuver, BC, Canada, 2001.
    [4]
    T.Brants.Test data likelihood for PLSA models. In ACM SIGIR Workshop on Mathematical/Formal Methods in Information Retrieval Tampere, Finland, 2002.
    [5]
    F.Y.Y. Choi. Advances in domain independent linear text segmentation. In Proceedings of NAACL-2000 pages 26--33, Seattle, WA, 2000.
    [6]
    F.Y.Y. Choi. Improving the efficiency of speech interfaces for text navigation. In Proceedings of the IEE colloquium: Speech and Language Processing for Disabled and Elderly People 2000.
    [7]
    F.Y.Y. Choi, P.Wiemer-Hastings, and J.More. Latent semantic analysis for text segmentation. In L.Lee and D.Harman, editors, Proceedings of the 2001 Conference on Empirical Methods in Natural Language Processing pages 109--117, 2001.
    [8]
    W.B. Croft, S.Cronen-Townsend, and V. Larvrenk. Relevance feedback and personalization: A language modeling perspective. In DELOS Workshop: Personalisation and Recommender Systems in Digital Libraries 2001.
    [9]
    S.C. Deerwester, S.T. Dumais, T.K. Landauer, G.W. Furnas, and R.A. Harshman. Indexing by latent semantic analysis. Journal of the American Society of Information Science 41(6):391--407, 1990.
    [10]
    A.P. Dempster, N.M. Laird, and D.B. Rubin. Maximum likelihood from incomplete data via the em algorithm. Journal of the Royal Statistical Society 39(1):1--21,1977.
    [11]
    D. Gildea and T. Hofmann. Topic-based language models using em. In Proceedings of the 6th European Conference on Speech Communication and Technology (EUROSPEECH), pages 2167--2170, 1999.
    [12]
    M.A. Hearst and C. Plaunt. Subtopic structuring for full-length document access. In Research and Development in Information Retrieval pages 59--68, 1993.
    [13]
    T. Hofmann. Probabilistic latent semantic indexing. In Proceedings of SIGIR-99 pages 35--44, Berkeley, CA, 1999.
    [14]
    T. Hofmann. Unsupervised learning by probabilistic latent semantic analysis. Machine Learning Journal 42(1):177--196, 2001.
    [15]
    T. Kailath. The divergence and bhattacharyya distance measures in signal selection. IEEE Trans. Commun. Tech., COM-15:52--60,1967.
    [16]
    H. Kozima. Text segmentation based on similarity between words. In Meeting of the Association for Computational Linguistics pages 286--288, 1993.
    [17]
    S. Kullback and R.A. Leibler. On information and sufficiency. Annals of Mathematical Statistics 22:79--86, 1951.
    [18]
    V. Lavrenk, J. Allan, E. DeGuzman, D. LaFlamme, V. Pollard, and S. Thomas. Topic-based language models using em. In Proceedings ofthe 6th European Conference on Speech Communication and Technology (EUROSPEECH), pages 2167--2170, 1999.
    [19]
    L. Lee. Measures of distributional similarity. In 37th Annual Meeting of the Association for Computational Linguistics pages 25--32, 1999.
    [20]
    H. Li and K. Yamanishi. Topic analysis using a finite mixture model. In Proceedings of Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora, pages 35--44, 2000.
    [21]
    H. Li and K. Yamanishi. Topic analysis using a finite mixture model. IPSJ SIGNotes Natural Language (NL), 139(009), 2000.
    [22]
    L. Pevzner and M. Hearst. A critique and improvement of an evaluation metric for text segmentation. Computational Linguistics 28(1):19--36, 2002.
    [23]
    J.W. Tukey. Exploratory Data Analysis Addison Wesley Longman,Inc., Reading, MA, 1977.

    Cited By

    View all
    • (2024)Revisiting Probabilistic Latent Semantic Analysis: Extensions, Challenges and InsightsTechnologies10.3390/technologies1201000512:1(5)Online publication date: 3-Jan-2024
    • (2024)Segmented Summarization and Refinement: A Pipeline for Long-Document Analysis on Social MediaJournal of Social Computing10.23919/JSC.2024.00105:2(132-144)Online publication date: Jun-2024
    • (2024)Dynamic Segmentation for Efficient Retrieval of Podcasts: The Repping AlgorithmProceedings of the 2024 International Conference on Multimedia Retrieval10.1145/3652583.3658047(29-36)Online publication date: 30-May-2024
    • Show More Cited By

    Index Terms

    1. Topic-based document segmentation with probabilistic latent semantic analysis

        Recommendations

        Comments

        Information & Contributors

        Information

        Published In

        cover image ACM Conferences
        CIKM '02: Proceedings of the eleventh international conference on Information and knowledge management
        November 2002
        704 pages
        ISBN:1581134924
        DOI:10.1145/584792
        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Sponsors

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        Published: 04 November 2002

        Permissions

        Request permissions for this article.

        Check for updates

        Author Tags

        1. PLSA
        2. text segmentation
        3. topic identification

        Qualifiers

        • Article

        Conference

        CIKM02

        Acceptance Rates

        Overall Acceptance Rate 1,861 of 8,427 submissions, 22%

        Upcoming Conference

        Contributors

        Other Metrics

        Bibliometrics & Citations

        Bibliometrics

        Article Metrics

        • Downloads (Last 12 months)31
        • Downloads (Last 6 weeks)1
        Reflects downloads up to 10 Aug 2024

        Other Metrics

        Citations

        Cited By

        View all
        • (2024)Revisiting Probabilistic Latent Semantic Analysis: Extensions, Challenges and InsightsTechnologies10.3390/technologies1201000512:1(5)Online publication date: 3-Jan-2024
        • (2024)Segmented Summarization and Refinement: A Pipeline for Long-Document Analysis on Social MediaJournal of Social Computing10.23919/JSC.2024.00105:2(132-144)Online publication date: Jun-2024
        • (2024)Dynamic Segmentation for Efficient Retrieval of Podcasts: The Repping AlgorithmProceedings of the 2024 International Conference on Multimedia Retrieval10.1145/3652583.3658047(29-36)Online publication date: 30-May-2024
        • (2024)Negative-Sensitive Framework With Semantic Enhancement for Composed Image RetrievalIEEE Transactions on Multimedia10.1109/TMM.2024.336989826(7608-7621)Online publication date: 2024
        • (2022)Subtask analysis of process data through a predictive modelBritish Journal of Mathematical and Statistical Psychology10.1111/bmsp.1229076:1(211-235)Online publication date: Nov-2022
        • (2022)Neural Text Segmentation and its Application to Sentiment AnalysisIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2020.298336034:2(828-842)Online publication date: 1-Feb-2022
        • (2022)Bayesian Folding-In Using Generalized Dirichlet and Beta-Liouville Kernels for Information Retrieval2022 IEEE Symposium Series on Computational Intelligence (SSCI)10.1109/SSCI51031.2022.10022072(1430-1435)Online publication date: 4-Dec-2022
        • (2022)An Analysis of Various Text Segmentation ApproachesProceedings of International Conference on Intelligent Cyber-Physical Systems10.1007/978-981-16-7136-4_22(285-302)Online publication date: 24-Jan-2022
        • (2021)Paragraph Boundary Recognition in Novels for Story UnderstandingApplied Sciences10.3390/app1112563211:12(5632)Online publication date: 18-Jun-2021
        • (2021)BATS: A Spectral Biclustering Approach to Single Document Topic Modeling and SegmentationACM Transactions on Intelligent Systems and Technology10.1145/346826812:5(1-29)Online publication date: 15-Oct-2021
        • Show More Cited By

        View Options

        Get Access

        Login options

        View options

        PDF

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader

        Media

        Figures

        Other

        Tables

        Share

        Share

        Share this Publication link

        Share on social media