Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/2960811.2967151acmconferencesArticle/Chapter ViewAbstractPublication PagesdocengConference Proceedingsconference-collections
short-paper

Combining Taxonomies using Word2vec

Published: 13 September 2016 Publication History
  • Get Citation Alerts
  • Abstract

    Taxonomies have gained a broad usage in a variety of fields due to their extensibility, as well as their use for classification and knowledge organization. Of particular interest is the digital document management domain in which their hierarchical structure can be effectively employed in order to organize documents into content-specific categories. Common or standard taxonomies (e.g., the ACM Computing Classification System) contain concepts that are too general for conceptualizing specific knowledge domains. In this paper we introduce a novel automated approach that combines sub-trees from general taxonomies with specialized seed taxonomies by using specific Natural Language Processing techniques. We provide an extensible and generalizable model for combining taxonomies in the practical context of two very large European research projects. Because the manual combination of taxonomies by domain experts is a highly time consuming task, our model measures the semantic relatedness between concept labels in CBOW or skip-gram Word2vec vector spaces. A preliminary quantitative evaluation of the resulting taxonomies is performed after applying a greedy algorithm with incremental thresholds used for matching and combining topic labels.

    References

    [1]
    Berners-Lee, T., Hendler, J., Lassila, O. 2001. The semantic web. Scientific American Magazine: pp. 35--44, May 2001
    [2]
    Blei, D. M., Ng, A. Y., Jordan, M. I. Latent Dirichlet Allocation, Journal of Machine Learning Research 3, pp. 993--1022, 2003
    [3]
    Cormen, Th. C., Leiserson, C. E., Rivest, R., Stein, C. 2001. Introduction to Algorithms. Second Edition, MIT Press, Massachusetts, USA, 2001
    [4]
    Cox, K. 1992. Information Retrieval by Browsing. Proceedings of The 5th International Conference on New Information Technology, Hong Kong, 1992
    [5]
    Gollub, T., Volkse, M., Hagen, M., Stein, B. 2014. Dynamic taxonomy composition via keyqueries. IEEE/ACM Joint Conference on Digital Libraries (JCDL), pp. 39--48, London
    [6]
    The 2012 ACM Computing Classification System, Retrieves December 28, 2015 from Association for Computing Machinery, Inc., New York, NY
    [7]
    Jurafsky, D., Martin, J. H. 2009. Speech and language processing. An introduction to natural language processing, computational linguistics and speech recognition. 2nd edition, Upper Saddle River, N.J., London: Pearson Prentice Hall
    [8]
    Konijn, J. 2015. Education for Data Intensive Science to Open New science frontiers (EDISON) -- Project proposal
    [9]
    Liu, X., Song, Y., Liu, S., Wang, H. 2012. Automatic Taxonomy Construction from Keywords, ACM SIGKDD conference, August 12--16, Beijing, China
    [10]
    Mikolov, T., Chen, K., Corrado, G., Dean, J. 2013. Efficient Estimation of Word Representation in Vector Space. Proceedings of Workshop at ICLR. Retrieved December 29, 2015 from http://arxiv.org/pdf/1301.3781.pdf
    [11]
    National Information Standards Organization (NISO) 2005. Guidelines for the Construction, Format, and Management of Monolingual Controlled Vocabularies.
    [12]
    Sebastiani, F. 2002. Machine Learning in Automated Text Categorization. ACM Computing Surveys vol. 34, pp. 1--47
    [13]
    Shvaiko, P., Euzenat, J. 2013. Ontology Matching: State of the Art and Future Challenges. IEEE Transactions on Knowledge and Data Engineering, vol. 25, no. 1, pp.158--176
    [14]
    Stein, B., Gollub, T., Hoppe, D. 2011, Beyond Precision @ 10: Clustering the Long Tail of Web Search Results, 20th ACM International Conference on in Information and Knowledge Management, Glasgow, UK, pp. 2141--2144
    [15]
    Westera, W. 2014, Realising an Applied Gaming Ecosystem (RAGE) - Annex 1 to the Grant Agreement (Description of the Action) Part B

    Cited By

    View all
    • (2024)Product Space Clustering with Graph Learning for Diversifying Industrial ProductionApplied Sciences10.3390/app1407283314:7(2833)Online publication date: 27-Mar-2024
    • (2024)Password cracking using chunk similarityFuture Generation Computer Systems10.1016/j.future.2023.09.013150(380-394)Online publication date: Jan-2024
    • (2023)Verwaltung von Kultur-Artefakten: Herausforderungen bei der Realisierung typischer Einsatzszenarien in Kulturerbe-ArchivenWissensbasierte KI-Anwendungen10.1007/978-3-662-68002-5_15(245-258)Online publication date: 1-Dec-2023
    • Show More Cited By

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    DocEng '16: Proceedings of the 2016 ACM Symposium on Document Engineering
    September 2016
    222 pages
    ISBN:9781450344388
    DOI:10.1145/2960811
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    In-Cooperation

    • SIGDOC: ACM Special Interest Group on Systems Documentation

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 13 September 2016

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. automated semantic integration
    2. ontology alignment
    3. taxonomy integration
    4. word2vec

    Qualifiers

    • Short-paper

    Funding Sources

    • EC H2020 EDISON
    • EC H2020 RAGE

    Conference

    DocEng '16
    Sponsor:
    DocEng '16: ACM Symposium on Document Engineering 2016
    September 13 - 16, 2016
    Vienna, Austria

    Acceptance Rates

    DocEng '16 Paper Acceptance Rate 11 of 35 submissions, 31%;
    Overall Acceptance Rate 178 of 537 submissions, 33%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)18
    • Downloads (Last 6 weeks)5

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Product Space Clustering with Graph Learning for Diversifying Industrial ProductionApplied Sciences10.3390/app1407283314:7(2833)Online publication date: 27-Mar-2024
    • (2024)Password cracking using chunk similarityFuture Generation Computer Systems10.1016/j.future.2023.09.013150(380-394)Online publication date: Jan-2024
    • (2023)Verwaltung von Kultur-Artefakten: Herausforderungen bei der Realisierung typischer Einsatzszenarien in Kulturerbe-ArchivenWissensbasierte KI-Anwendungen10.1007/978-3-662-68002-5_15(245-258)Online publication date: 1-Dec-2023
    • (2021)Production2Vec: a hybrid recommender system combining semantic and product complexity approach to improve industrial resiliency2021 2nd International Conference on Artificial Intelligence and Information Systems10.1145/3469213.3469218(1-6)Online publication date: 28-May-2021
    • (2021)Application of machine learning algorithms in wind power: a reviewEnergy Sources, Part A: Recovery, Utilization, and Environmental Effects10.1080/15567036.2020.1869867(1-22)Online publication date: 8-Feb-2021
    • (2021)Using Word2Vec-LDA-Cosine Similarity for Discovering News Dissemination Pattern to Support Government–Citizen EngagementProceedings of International Conference on Data Science and Applications10.1007/978-981-16-5120-5_53(703-716)Online publication date: 23-Nov-2021
    • (2018)Extracting Semantic Relations for Scholarly Knowledge Base Construction2018 IEEE 12th International Conference on Semantic Computing (ICSC)10.1109/ICSC.2018.00017(56-63)Online publication date: Jan-2018
    • (2018)The Edutainment Platform: Interactive Storytelling Relying on Semantic SimilarityChallenges and Solutions in Smart Learning10.1007/978-981-10-8743-1_13(87-96)Online publication date: 10-Mar-2018
    • (2018)Managing Cultural Assets: Challenges for Implementing Typical Cultural Heritage Archive’s Usage ScenariosSemantic Applications10.1007/978-3-662-55433-3_15(219-230)Online publication date: 14-Apr-2018

    View Options

    Get Access

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media