Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
article

Syntactic N-grams as machine learning features for natural language processing

Published: 01 February 2014 Publication History

Abstract

In this paper we introduce and discuss a concept of syntactic n-grams (sn-grams). Sn-grams differ from traditional n-grams in the manner how we construct them, i.e., what elements are considered neighbors. In case of sn-grams, the neighbors are taken by following syntactic relations in syntactic trees, and not by taking words as they appear in a text, i.e., sn-grams are constructed by following paths in syntactic trees. In this manner, sn-grams allow bringing syntactic knowledge into machine learning methods; still, previous parsing is necessary for their construction. Sn-grams can be applied in any natural language processing (NLP) task where traditional n-grams are used. We describe how sn-grams were applied to authorship attribution. We used as baseline traditional n-grams of words, part of speech (POS) tags and characters; three classifiers were applied: support vector machines (SVM), naive Bayes (NB), and tree classifier J48. Sn-grams give better results with SVM classifier.

References

[1]
Applying authorship analysis to extremist-group web forum messages. IEEE Intelligent Systems. v20 i5. 67-75.
[2]
Agarwal, A., Biads, F., & Mckeown, K. (2009). Contextual phrase-level polarity analysis using lexical affect scoring and syntactic N-grams. In Proceedings of 12th conference of the European chapter of the ACL (EACL) (pp. 24-32).
[3]
Argamon, S., & Juola, P. (2011). Overview of the international authorship identification competition at PAN-2011. In Proceedings of fifth international workshop on uncovering plagiarism, authorship, and social software misuse.
[4]
Outside the cave of shadows: Using syntactic annotation to enhance authorship attribution. Literary and Linguistic Computing. 121-131.
[5]
de Marneffe, M., MacCartney, B., & Manning, C. (2006). Generating typed dependency parses from phrase structure parses. In Proceedings of LREC.
[6]
Authorship attribution with support vector machines. Applied Intelligence. v19 i1. 109-123.
[7]
Escalante, H., Solorio, T., & Montes-y-Gomez, M. (2011). Local histograms of character n-grams for authorship attribution. In Proceedings of 49th annual meeting of the association for computational linguistics (pp. 288-298).
[8]
Finding maximal sequential patterns in text document collections and single documents. Informatica. v34 i1. 93-101.
[9]
Quantitative authorship attribution: An evaluation of techniques. Literary and Linguistic Computing. v22 i3. 251-270.
[10]
Guthrie, D., Allison, B., Liu, W., Guthrie, L., & Wilks, Y. (2006). A closer look at skip-gram modelling. In Proceedings of LREC.
[11]
The use of a structural N-gram language model in generation-heavy hybrid machine translation. LNCS. v3123. 61-69.
[12]
The WEKA data mining software: An update. SIGKDD Explorations. v11 i1.
[13]
The evolution of stylometry in humanities scholarship. Literary and Linguistic Computing. v13 i3. 111-117.
[14]
Juola, P. (2004). Ad-hoc authorship attribution competition. In Proceedings of the joint conference of the association for computers and the humanities and the association for literary and linguistic computing (pp. 175-176).
[15]
Authorship attribution. Foundations and Trends in Information Retrieval. v1 i3. 233-334.
[16]
N-gram-based author profiles for authorship attribution. Computational Linguistics. 225-264.
[17]
Khalilov, M., & Fonollosa, J. (2009). N-gram-based statistical machine translation versus syntax augmented machine translation: Comparison and system combination. In Proceedings of 12th conference of the European chapter of the ACL (pp. 424-432).
[18]
Authorship attribution in the wild. Language Resources and Evaluation. v45 i1. 83-94.
[19]
Measuring differentiability: Unmasking pseudonymous authors. Journal of Machine Learning Research. 1261-1276.
[20]
Luyckx, K. (2010). Scalability Issues in Authorship Attribution. Ph.D. thesis, University of Antwerp.
[21]
Machine learning in automated text categorization. ACM Computing Surveys. v34 i1.
[22]
Sidorov, G., Velasquez, F., Stamatatos, E., Gelbukh, A., & Chanona-Hernandez, L. (2012). Syntactic dependency-based N-grams as classification features. In Proceedings of MICAI: Vol. 7630. LNAI (pp. 1-11).
[23]
Sidorov, G., Velasquez, F., Stamatatos, E., Gelbukh, A., & Chanona-Hernandez, L. (2013). Syntactic dependency-based N-grams: More evidence of usefulness in classification. In Proceedings of CICLing: Vol. 7816. LNCS (pp. 13-24).
[24]
A survey of modern authorship attribution methods. Journal of the American Society for information Science and Technology. v60 i3. 538-556.
[25]
Author verification by linguistic profiling: An exploration of the parameter space. ACM Transactions on Speech and Language Processing. v4 i1. 1-17.

Cited By

View all
  • (2024)Multization: Multi-Modal Summarization Enhanced by Multi-Contextually Relevant and Irrelevant Attention AlignmentACM Transactions on Asian and Low-Resource Language Information Processing10.1145/365198323:5(1-29)Online publication date: 10-May-2024
  • (2023)Game—FloraPark (the Flower Game)INFORMS Transactions on Education10.1287/ited.2022.003524:1(105-117)Online publication date: 1-Sep-2023
  • (2023)A Framework for Actor-Oriented Automated Hate Speech DetectionProceedings of the 2023 12th International Conference on Software and Computer Applications10.1145/3587828.3587870(283-289)Online publication date: 23-Feb-2023
  • Show More Cited By

Index Terms

  1. Syntactic N-grams as machine learning features for natural language processing

      Recommendations

      Reviews

      Jun Ping Ng

      Traditionally, n-grams are derived by extracting groups of words as they appear in a text. This paper describes a new way to formulate n-grams, referred to as syntactic n-grams (sn-grams). Sn-grams are made by extracting groups of words not based on how they appear in a text, but rather on how they are presented in grammar parse trees. Derivations from both constituent grammar parses and typed dependency parses are possible. This difference in derivation is interesting and can potentially have an impact on how n-grams are used. N-grams have found plenty of uses. In fact, they are the mainstay of many language models. However, they suffer from two main problems. The first is that of data sparseness, especially in cases where insufficient data is available. The other problem involves stop words, or words that do not hold much meaning. Because of the way n-grams are derived, stop words can make their way into n-grams. The use of sn-grams can potentially overcome, or at least alleviate, both problems. The authors apply sn-grams to the problem of author attribution, the process of identifying the author of a piece of text. Comparing an approach based on sn-grams to one using traditional n-grams, the paper shows that sn-grams demonstrate better performance. It would have been more interesting if the authors had compared sn-grams to other related technologies, such as path features and string kernels, for example. Though not exactly the same, these are common ways to use syntactic information. Furthermore, while sn-grams outperform the use of n-grams for author attribution, the case for the superiority of sn-grams would have been more convincing if either a more state-of-the-art approach to the problem had been used as the comparative baseline [1], or the authors had chosen another problem that better highlights the value of the sn-gram approach. Many advanced approaches to author attribution have been applied with good success (such as the use of topic models). I would suggest that the authors compare the performance of sn-grams to that of conditional random fields (CRF) [2] in a future work. Since the typical CRF classifier uses n-grams, it would be quite exciting if they could show that sn-grams can boost the performance of these CRF classifiers. I definitely found this paper interesting to read. The idea of syntactic n-grams has the potential to be very useful. The paper is also clearly written, and the approach is adequately explained. Online Computing Reviews Service

      Access critical reviews of Computing literature here

      Become a reviewer for Computing Reviews.

      Comments

      Information & Contributors

      Information

      Published In

      cover image Expert Systems with Applications: An International Journal
      Expert Systems with Applications: An International Journal  Volume 41, Issue 3
      February, 2014
      150 pages

      Publisher

      Pergamon Press, Inc.

      United States

      Publication History

      Published: 01 February 2014

      Author Tags

      1. Authorship attribution
      2. Classification features
      3. J48
      4. NB
      5. Parsing
      6. SVM
      7. Syntactic n-grams
      8. Syntactic paths
      9. sn-Grams

      Qualifiers

      • Article

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)0
      • Downloads (Last 6 weeks)0
      Reflects downloads up to 30 Aug 2024

      Other Metrics

      Citations

      Cited By

      View all
      • (2024)Multization: Multi-Modal Summarization Enhanced by Multi-Contextually Relevant and Irrelevant Attention AlignmentACM Transactions on Asian and Low-Resource Language Information Processing10.1145/365198323:5(1-29)Online publication date: 10-May-2024
      • (2023)Game—FloraPark (the Flower Game)INFORMS Transactions on Education10.1287/ited.2022.003524:1(105-117)Online publication date: 1-Sep-2023
      • (2023)A Framework for Actor-Oriented Automated Hate Speech DetectionProceedings of the 2023 12th International Conference on Software and Computer Applications10.1145/3587828.3587870(283-289)Online publication date: 23-Feb-2023
      • (2023)PolyHopeExpert Systems with Applications: An International Journal10.1016/j.eswa.2023.120078225:COnline publication date: 1-Sep-2023
      • (2022)CLSA-CapsNetJournal of Intelligent & Fuzzy Systems: Applications in Engineering and Technology10.3233/JIFS-21132143:1(107-123)Online publication date: 1-Jan-2022
      • (2022)Improving the performance of code vulnerability prediction using abstract syntax tree informationProceedings of the 18th International Conference on Predictive Models and Data Analytics in Software Engineering10.1145/3558489.3559066(2-11)Online publication date: 7-Nov-2022
      • (2022)A text as unique as a fingerprintExpert Systems with Applications: An International Journal10.1016/j.eswa.2022.117280203:COnline publication date: 1-Oct-2022
      • (2022)Similarity-based approach for inventive design solutions assistanceJournal of Intelligent Manufacturing10.1007/s10845-021-01749-433:6(1681-1698)Online publication date: 1-Aug-2022
      • (2021)Design and Analysis of a Novel Authorship Verification Framework for Hijacked Social Media Accounts Compromised by a HumanSecurity and Communication Networks10.1155/2021/88696812021Online publication date: 1-Jan-2021
      • (2021)Research on Data Generation Model Based on Improved SeqGANProceedings of the 2021 10th International Conference on Software and Computer Applications10.1145/3457784.3457791(45-50)Online publication date: 23-Feb-2021
      • Show More Cited By

      View Options

      View options

      Get Access

      Login options

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media