article

Syntactic N-grams as machine learning features for natural language processing

Authors:

Grigori Sidorov,

Francisco Velasquez,

Efstathios Stamatatos,

Alexander Gelbukh,

Liliana Chanona-HernándezAuthors Info & Claims

Expert Systems with Applications: An International Journal, Volume 41, Issue 3

Pages 853 - 860

https://doi.org/10.1016/j.eswa.2013.08.015

Published: 01 February 2014 Publication History

Abstract

In this paper we introduce and discuss a concept of syntactic n-grams (sn-grams). Sn-grams differ from traditional n-grams in the manner how we construct them, i.e., what elements are considered neighbors. In case of sn-grams, the neighbors are taken by following syntactic relations in syntactic trees, and not by taking words as they appear in a text, i.e., sn-grams are constructed by following paths in syntactic trees. In this manner, sn-grams allow bringing syntactic knowledge into machine learning methods; still, previous parsing is necessary for their construction. Sn-grams can be applied in any natural language processing (NLP) task where traditional n-grams are used. We describe how sn-grams were applied to authorship attribution. We used as baseline traditional n-grams of words, part of speech (POS) tags and characters; three classifiers were applied: support vector machines (SVM), naive Bayes (NB), and tree classifier J48. Sn-grams give better results with SVM classifier.

References

[1]

Applying authorship analysis to extremist-group web forum messages. IEEE Intelligent Systems. v20 i5. 67-75.

Crossref

Google Scholar

[2]

Agarwal, A., Biads, F., & Mckeown, K. (2009). Contextual phrase-level polarity analysis using lexical affect scoring and syntactic N-grams. In Proceedings of 12th conference of the European chapter of the ACL (EACL) (pp. 24-32).

Crossref

Google Scholar

[3]

Argamon, S., & Juola, P. (2011). Overview of the international authorship identification competition at PAN-2011. In Proceedings of fifth international workshop on uncovering plagiarism, authorship, and social software misuse.

Google Scholar

[4]

Outside the cave of shadows: Using syntactic annotation to enhance authorship attribution. Literary and Linguistic Computing. 121-131.

Google Scholar

[5]

de Marneffe, M., MacCartney, B., & Manning, C. (2006). Generating typed dependency parses from phrase structure parses. In Proceedings of LREC.

Google Scholar

[6]

Authorship attribution with support vector machines. Applied Intelligence. v19 i1. 109-123.

Crossref

Google Scholar

[7]

Escalante, H., Solorio, T., & Montes-y-Gomez, M. (2011). Local histograms of character n-grams for authorship attribution. In Proceedings of 49th annual meeting of the association for computational linguistics (pp. 288-298).

Crossref

Google Scholar

[8]

Finding maximal sequential patterns in text document collections and single documents. Informatica. v34 i1. 93-101.

Google Scholar

[9]

Quantitative authorship attribution: An evaluation of techniques. Literary and Linguistic Computing. v22 i3. 251-270.

Google Scholar

[10]

Guthrie, D., Allison, B., Liu, W., Guthrie, L., & Wilks, Y. (2006). A closer look at skip-gram modelling. In Proceedings of LREC.

Google Scholar

[11]

The use of a structural N-gram language model in generation-heavy hybrid machine translation. LNCS. v3123. 61-69.

Google Scholar

[12]

The WEKA data mining software: An update. SIGKDD Explorations. v11 i1.

Crossref

Google Scholar

[13]

The evolution of stylometry in humanities scholarship. Literary and Linguistic Computing. v13 i3. 111-117.

Google Scholar

[14]

Juola, P. (2004). Ad-hoc authorship attribution competition. In Proceedings of the joint conference of the association for computers and the humanities and the association for literary and linguistic computing (pp. 175-176).

Google Scholar

[15]

Authorship attribution. Foundations and Trends in Information Retrieval. v1 i3. 233-334.

Crossref

Google Scholar

[16]

N-gram-based author profiles for authorship attribution. Computational Linguistics. 225-264.

Google Scholar

[17]

Khalilov, M., & Fonollosa, J. (2009). N-gram-based statistical machine translation versus syntax augmented machine translation: Comparison and system combination. In Proceedings of 12th conference of the European chapter of the ACL (pp. 424-432).

Crossref

Google Scholar

[18]

Authorship attribution in the wild. Language Resources and Evaluation. v45 i1. 83-94.

Crossref

Google Scholar

[19]

Measuring differentiability: Unmasking pseudonymous authors. Journal of Machine Learning Research. 1261-1276.

Crossref

Google Scholar

[20]

Luyckx, K. (2010). Scalability Issues in Authorship Attribution. Ph.D. thesis, University of Antwerp.

Google Scholar

[21]

Machine learning in automated text categorization. ACM Computing Surveys. v34 i1.

Crossref

Google Scholar

[22]

Sidorov, G., Velasquez, F., Stamatatos, E., Gelbukh, A., & Chanona-Hernandez, L. (2012). Syntactic dependency-based N-grams as classification features. In Proceedings of MICAI: Vol. 7630. LNAI (pp. 1-11).

Crossref

Google Scholar

[23]

Sidorov, G., Velasquez, F., Stamatatos, E., Gelbukh, A., & Chanona-Hernandez, L. (2013). Syntactic dependency-based N-grams: More evidence of usefulness in classification. In Proceedings of CICLing: Vol. 7816. LNCS (pp. 13-24).

Crossref

Google Scholar

[24]

A survey of modern authorship attribution methods. Journal of the American Society for information Science and Technology. v60 i3. 538-556.

Crossref

Google Scholar

[25]

Author verification by linguistic profiling: An exploration of the parameter space. ACM Transactions on Speech and Language Processing. v4 i1. 1-17.

Crossref

Google Scholar

Cited By

View all

Rong HChen ZLu ZXu FSheng V(2024)Multization: Multi-Modal Summarization Enhanced by Multi-Contextually Relevant and Irrelevant Attention AlignmentACM Transactions on Asian and Low-Resource Language Information Processing10.1145/365198323:5(1-29)Online publication date: 10-May-2024
https://dl.acm.org/doi/10.1145/3651983
Zhao YPark ARudna OSong J(2023)Game—FloraPark (the Flower Game)INFORMS Transactions on Education10.1287/ited.2022.003524:1(105-117)Online publication date: 1-Sep-2023
https://dl.acm.org/doi/10.1287/ited.2022.0035
Cahyana RMaulidevi NSurendro K(2023)A Framework for Actor-Oriented Automated Hate Speech DetectionProceedings of the 2023 12th International Conference on Software and Computer Applications10.1145/3587828.3587870(283-289)Online publication date: 23-Feb-2023
https://dl.acm.org/doi/10.1145/3587828.3587870
Show More Cited By

Index Terms

Syntactic N-grams as machine learning features for natural language processing
1. Computing methodologies
  1. Artificial intelligence
    1. Natural language processing
  2. Machine learning

Recommendations

Syntactic dependency-based n-grams as classification features
MICAI'12: Proceedings of the 11th Mexican international conference on Advances in Computational Intelligence - Volume Part II

In this paper we introduce a concept of syntactic n-grams (sn-grams). Sn-grams differ from traditional n-grams in the manner of what elements are considered neighbors. In case of sn-grams, the neighbors are taken by following syntactic relations in ...
Syntactic dependency-based n-grams: more evidence of usefulness in classification
CICLing'13: Proceedings of the 14th international conference on Computational Linguistics and Intelligent Text Processing - Volume Part I

The paper introduces and discusses a concept of syntactic n-grams (sn-grams) that can be applied instead of traditional n-grams in many NLP tasks. Sn-grams are constructed by following paths in syntactic trees, so sn-grams allow bringing syntactic ...
A Unique Word Prediction System for Text Entry in Hindi
ICTCS '16: Proceedings of the Second International Conference on Information and Communication Technology for Competitive Strategies

Word prediction is very effective technique for improving efficiency of entering text. Current word prediction systems predict a word if and only if a user has not made mistake in the starting of some characters of the word. This is more applicable for ...

Reviews

Reviewer: Jun Ping Ng

Traditionally, n-grams are derived by extracting groups of words as they appear in a text. This paper describes a new way to formulate n-grams, referred to as syntactic n-grams (sn-grams). Sn-grams are made by extracting groups of words not based on how they appear in a text, but rather on how they are presented in grammar parse trees. Derivations from both constituent grammar parses and typed dependency parses are possible. This difference in derivation is interesting and can potentially have an impact on how n-grams are used. N-grams have found plenty of uses. In fact, they are the mainstay of many language models. However, they suffer from two main problems. The first is that of data sparseness, especially in cases where insufficient data is available. The other problem involves stop words, or words that do not hold much meaning. Because of the way n-grams are derived, stop words can make their way into n-grams. The use of sn-grams can potentially overcome, or at least alleviate, both problems. The authors apply sn-grams to the problem of author attribution, the process of identifying the author of a piece of text. Comparing an approach based on sn-grams to one using traditional n-grams, the paper shows that sn-grams demonstrate better performance. It would have been more interesting if the authors had compared sn-grams to other related technologies, such as path features and string kernels, for example. Though not exactly the same, these are common ways to use syntactic information. Furthermore, while sn-grams outperform the use of n-grams for author attribution, the case for the superiority of sn-grams would have been more convincing if either a more state-of-the-art approach to the problem had been used as the comparative baseline [1], or the authors had chosen another problem that better highlights the value of the sn-gram approach. Many advanced approaches to author attribution have been applied with good success (such as the use of topic models). I would suggest that the authors compare the performance of sn-grams to that of conditional random fields (CRF) [2] in a future work. Since the typical CRF classifier uses n-grams, it would be quite exciting if they could show that sn-grams can boost the performance of these CRF classifiers. I definitely found this paper interesting to read. The idea of syntactic n-grams has the potential to be very useful. The paper is also clearly written, and the approach is adequately explained. Online Computing Reviews Service

Access critical reviews of Computing literature here

Become a reviewer for Computing Reviews.

Comments

Information & Contributors

Information

Published In

cover image Expert Systems with Applications: An International Journal

Expert Systems with Applications: An International Journal Volume 41, Issue 3

February, 2014

150 pages

ISSN:0957-4174

Issue’s Table of Contents

Publisher

Pergamon Press, Inc.

United States

Publication History

Published: 01 February 2014

Author Tags

Qualifiers

Article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

51
Total Citations
View Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 30 Aug 2024

Other Metrics

View Author Metrics

Citations

Cited By

View all

Rong HChen ZLu ZXu FSheng V(2024)Multization: Multi-Modal Summarization Enhanced by Multi-Contextually Relevant and Irrelevant Attention AlignmentACM Transactions on Asian and Low-Resource Language Information Processing10.1145/365198323:5(1-29)Online publication date: 10-May-2024
https://dl.acm.org/doi/10.1145/3651983
Zhao YPark ARudna OSong J(2023)Game—FloraPark (the Flower Game)INFORMS Transactions on Education10.1287/ited.2022.003524:1(105-117)Online publication date: 1-Sep-2023
https://dl.acm.org/doi/10.1287/ited.2022.0035
Cahyana RMaulidevi NSurendro K(2023)A Framework for Actor-Oriented Automated Hate Speech DetectionProceedings of the 2023 12th International Conference on Software and Computer Applications10.1145/3587828.3587870(283-289)Online publication date: 23-Feb-2023
https://dl.acm.org/doi/10.1145/3587828.3587870
Balouchzahi FSidorov GGelbukh A(2023)PolyHopeExpert Systems with Applications: An International Journal10.1016/j.eswa.2023.120078225:COnline publication date: 1-Sep-2023
https://dl.acm.org/doi/10.1016/j.eswa.2023.120078
Mahendhiran PSubramanian K(2022)CLSA-CapsNetJournal of Intelligent & Fuzzy Systems: Applications in Engineering and Technology10.3233/JIFS-21132143:1(107-123)Online publication date: 1-Jan-2022
https://dl.acm.org/doi/10.3233/JIFS-211321
Al Debeyan FHall TBowes DMcIntosh SShang WPerez G(2022)Improving the performance of code vulnerability prediction using abstract syntax tree informationProceedings of the 18th International Conference on Predictive Models and Data Analytics in Software Engineering10.1145/3558489.3559066(2-11)Online publication date: 7-Nov-2022
https://dl.acm.org/doi/10.1145/3558489.3559066
Rocha MMorais PBarros DSantos JDias-Trindade SValentim R(2022)A text as unique as a fingerprintExpert Systems with Applications: An International Journal10.1016/j.eswa.2022.117280203:COnline publication date: 1-Oct-2022
https://dl.acm.org/doi/10.1016/j.eswa.2022.117280
Ni XSamet ACavallucci D(2022)Similarity-based approach for inventive design solutions assistanceJournal of Intelligent Manufacturing10.1007/s10845-021-01749-433:6(1681-1698)Online publication date: 1-Aug-2022
https://dl.acm.org/doi/10.1007/s10845-021-01749-4
Alterkavı SErbay H(2021)Design and Analysis of a Novel Authorship Verification Framework for Hijacked Social Media Accounts Compromised by a HumanSecurity and Communication Networks10.1155/2021/88696812021Online publication date: 1-Jan-2021
https://dl.acm.org/doi/10.1155/2021/8869681
Dou JQie SLu JRen Y(2021)Research on Data Generation Model Based on Improved SeqGANProceedings of the 2021 10th International Conference on Software and Computer Applications10.1145/3457784.3457791(45-50)Online publication date: 23-Feb-2021
https://dl.acm.org/doi/10.1145/3457784.3457791
Show More Cited By

View Options

View options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Abstract

References

Cited By

Index Terms

Recommendations

Syntactic dependency-based n-grams as classification features

Syntactic dependency-based n-grams: more evidence of usefulness in classification

A Unique Word Prediction System for Text Entry in Hindi

Reviews

Access critical reviews of Computing literature here

Comments

Information

Published In

Publisher

Publication History

Author Tags

Qualifiers

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

View options

Get Access

Login options

Full Access

Figures

Other

Share

Share this Publication link

Share on social media

Affiliations