Abstract
Most NLP tasks are based on large, well-organized corpus in general domain, while limited work has been done in specific domain due to the lack of qualified corpus and evaluation dataset. However domain-specific applications are widely needed nowadays. In this paper, we propose a fast and inexpensive, model-assisted method to train a high-quality distributional model from scattered, unconstructed web pages, which can capture knowledge from a specific domain. This approach does not require pre-organized corpus and much human help, and hence works on the specific domain which can’t afford the cost of artificially constructed corpus and complex training. We use Word2vec to assist in creating term set and evaluation dataset of embroidery domain. Next, we train a distributional model on filtered search results of term set, and conduct a task-specific tuning via two simple but practical evaluation metrics, word pairs similarity and in-domain terms’ coverage. Furthermore, our much-smaller models outperform the word embedding model trained on a large, general corpus in our task. In this work, we demonstrate the effectiveness of our method and hope it can serve as a reference for researchers who extract high-quality knowledge in specific domains.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Altszyler, E., Ribeiro, S., Sigman, M., Slezak, D.F.: The interpretation of dream meaning: resolving ambiguity using latent semantic analysis in a small corpus of text. Conscious. Cogn. 56, 178–187 (2017). https://doi.org/10.1016/j.concog.2017.09.004
Altszyler, E., Sigman, M., Slezak, D.F.: Comparative study of LSA vs Word2Vec embeddings in small corpora: a case study in dreams database. Science 8, 9
Altszyler, E., Sigman, M., Slezak, D.F.: Corpus specificity in LSA and Word2Vec: the role of out-of-domain documents. arXiv preprint arXiv:1712.10054 (2017)
Baroni, M., Dinu, G., Kruszewski, G.: Don’t count, predict! a systematic comparison of context-counting vs. context-predicting semantic vectors. In: Proceedings of the 52nd Annual Meeting of the ACL, vol. 1: Long Papers, pp. 238–247 (2014)
Cardellino, C., Alonso i Alemany, L.: Disjoint semi-supervised Spanish verb sense disambiguation using word embeddings. In: XVIII Simposio Argentino de Inteligencia Artificial (ASAI)-JAIIO 46 (Córdoba, 2017) (2017)
Chiu, B., Crichton, G., Korhonen, A., Pyysalo, S.: How to train good word embeddings for biomedical NLP. In: Proceedings of the 15th Workshop on BioNLP. ACL (2016)
Diaz, F., Mitra, B., Craswell, N.: Query expansion with locally-trained word embeddings. In: Proceedings of the 54th Annual Meeting of the ACL, vol. 1: Long Papers. ACL (2016)
Dusserre, E., Padró, M.: Bigger does not mean better! we prefer specificity. In: IWCS 2017–12th International Conference on Computational Semantics–Short Papers (2017)
Finkelstein, L., et al.: Placing search in context: the concept revisited. ACM Trans. Inf. Syst. 20(1), 116–131 (2002). https://doi.org/10.1145/503104.503110
Hill, F., Reichart, R., Korhonen, A.: SimLex-999: evaluating semantic models with (genuine) similarity estimation. Comput. Linguist. 41(4), 665–695 (2015). https://doi.org/10.1162/coli_a_00237
Jin, P., Wu, Y.: SemEval-2012 task 4: evaluating Chinese word similarity. In: Proceedings of the First Joint Conference on Lexical and Computational Semantics-Volume 1: Proceedings of the Main Conference and the Shared Task, and Volume 2: Proceedings of the Sixth International Workshop on Semantic Evaluation, pp. 374–377. ACL (2012)
Kusner, M., Sun, Y., Kolkin, N., Weinberger, K.: From word embeddings to document distances. In: International Conference on Machine Learning, pp. 957–966 (2015)
Kutuzov, A., Kunilovskaya, M.: Size vs. structure in training corpora for word embedding models: araneum russicum maximum and russian national corpus. In: van der Aalst, W., et al. (eds.) AIST 2017. LNCS, vol. 10716, pp. 47–58. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-73013-4_5
Lai, S., Liu, K., He, S., Zhao, J.: How to generate a good word embedding? IEEE Intell. Syst. 1 (2017)
Levy, O., Goldberg, Y., Dagan, I.: Improving distributional similarity with lessons learned from word embeddings. TACL 3, 211–225 (2015)
Maas, A.L., Daly, R.E., Pham, P.T., Huang, D., Ng, A.Y., Potts, C.: Learning word vectors for sentiment analysis. In: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, vol. 1, pp. 142–150. ACL (2011)
Major, V., Surkis, A., Aphinyanaphongs, Y.: Utility of general and specific word embeddings for classifying translational stages of research. arXiv preprint arXiv:1705.06262 (2017)
Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781 (2013)
Mikolov, T., Le, Q.V., Sutskever, I.: Exploiting similarities among languages for machine translation. arXiv preprint arXiv:1309.4168 (2013)
Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed representations of words and phrases and their compositionality. In: Advances in Neural Information Processing Systems, pp. 3111–3119 (2013)
Pakhomov, S.V., Finley, G., McEwan, R., Wang, Y., Melton, G.B.: Corpus domain effects on distributional semantic modeling of medical terms. Bioinformatics 32, 3635–3644 (2016). https://doi.org/10.1093/bioinformatics/btw529
Qu, L., Ferraro, G., Zhou, L., Hou, W., Schneider, N., Baldwin, T.: Big data small data, in domain out-of domain, known word unknown word: the impact of word representations on sequence labelling tasks. In: Proceedings of the Nineteenth Conference on CoNLL. ACL (2015). https://doi.org/10.18653/v1/k15-1009
Rekabsaz, N., Mitra, B., Lupu, M., Hanbury, A.: Toward incorporation of relevant documents in Word2Vec. arXiv preprint arXiv:1707.06598 (2017)
Spousta, M.: Web as a corpus. In: Zbornik konference WDS, vol. 6, pp. 179–184 (2006)
Sugathadasa, K., et al.: Synergistic union of Word2Vec and lexicon for domain specific semantic similarity. In: 2017 IEEE ICIIS. IEEE, December 2017. https://doi.org/10.1109/iciinfs.2017.8300343
Tang, D., Wei, F., Yang, N., Zhou, M., Liu, T., Qin, B.: Learning sentiment-specific word embedding for twitter sentiment classification. In: Proceedings of the 52nd Annual Meeting of the ACL, vol. 1: Long Papers. ACL (2014). https://doi.org/10.3115/v1/p14-1146
Muneeb, T.H., Sahu, S., Anand, A.: Evaluating distributed word representations for capturing semantics of biomedical concepts. In: Proceedings of BioNLP 2015. ACL (2015)
Tixier, A.J.P., Vazirgiannis, M., Hallowell, M.R.: Word embeddings for the construction domain. arXiv preprint arXiv:1610.09333 (2016)
Wang, Y., et al.: A comparison of word embeddings for the biomedical natural language processing. arXiv preprint arXiv:1802.00400 (2018)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2018 Springer Nature Switzerland AG
About this paper
Cite this paper
Lu, Y., Yu, S., Shi, M., Li, C. (2018). Extract Knowledge from Web Pages in a Specific Domain. In: Liu, W., Giunchiglia, F., Yang, B. (eds) Knowledge Science, Engineering and Management. KSEM 2018. Lecture Notes in Computer Science(), vol 11061. Springer, Cham. https://doi.org/10.1007/978-3-319-99365-2_10
Download citation
DOI: https://doi.org/10.1007/978-3-319-99365-2_10
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-99364-5
Online ISBN: 978-3-319-99365-2
eBook Packages: Computer ScienceComputer Science (R0)