research-article

Writeprints: A stylometric approach to identity-level identification and similarity detection in cyberspace

Authors:

Hsinchun ChenAuthors Info & Claims

ACM Transactions on Information Systems (TOIS), Volume 26, Issue 2

Article No.: 7, Pages 1 - 29

https://doi.org/10.1145/1344411.1344413

Published: 08 April 2008 Publication History

Abstract

One of the problems often associated with online anonymity is that it hinders social accountability, as substantiated by the high levels of cybercrime. Although identity cues are scarce in cyberspace, individuals often leave behind textual identity traces. In this study we proposed the use of stylometric analysis techniques to help identify individuals based on writing style. We incorporated a rich set of stylistic features, including lexical, syntactic, structural, content-specific, and idiosyncratic attributes. We also developed the Writeprints technique for identification and similarity detection of anonymous identities. Writeprints is a Karhunen-Loeve transforms-based technique that uses a sliding window and pattern disruption algorithm with individual author-level feature sets. The Writeprints technique and extended feature set were evaluated on a testbed encompassing four online datasets spanning different domains: email, instant messaging, feedback comments, and program code. Writeprints outperformed benchmark techniques, including SVM, Ensemble SVM, PCA, and standard Karhunen-Loeve transforms, on the identification and similarity detection tasks with accuracy as high as 94% when differentiating between 100 authors. The extended feature set also significantly outperformed a baseline set of features commonly used in previous research. Furthermore, individual-author-level feature sets generally outperformed use of a single group of attributes.

References

[1]

Abbasi, A. and Chen, H. 2005. Identification and comparison of extremist-group Web forum messages using authorship analysis. IEEE Intel. Syst. 20, 5, 67--75.

Digital Library

[2]

Abbasi, A. and Chen, H. 2006. Visualizing authorship for identification. In Proceedings of the 4th IEEE Symposium on Intelligence and Security Informatics, San Diego, CA.

Digital Library

[3]

Airoldi, E. and Malin, B. 2004. Data mining challenges for electronic safety: The case of fraudulent intent detection in e-mails. In Proceedings of the Workshop on Privacy and Security Aspects of Data Mining.

[4]

Argamon, S., Saric, M., and Stein, S. S. 2003. Style mining of electronic messages for multiple authorship discrimination: First results In Proceedings of the 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.

Digital Library

[5]

Argamon, S., Koppel, M., and Avneri, G. 1998. Routing documents according to style. In Proceedings of the 1st International Workshop on Innovative Information.

[6]

Bayyen, R. H., Halteren, H. V., Neijt, A., and Tweedie, F. J. 2002. An experiment in authorship attribution. In Proceedings of the 6th International Conference on Statistical Analysis of Textual Data.

[7]

Bayyen, R. H., Halteren, H. V., and Tweedie, F. J. 1996. Outside the cave of shadows: Using syntactic annotation to enhance authorship attribution. Liter. Linguist. Comput. 2, 110--120.

[8]

Berry, R. E. and Meekings, B. A. E. 1985. A style analysis of C programs. Commun. ACM 28, 1, 80--88.

Digital Library

[9]

Binongo, J. N. G. and Smith, M. W. A. 1999. The application of principal component analysis to stylometry. Liter. Linguist. Compu. 14, 4, 445--466.

[10]

Burrows, J. F. 1987. Word patterns and story shapes: The statistical analysis of narrative style. Liter. Linguist. Comput. 2, 61--67.

[11]

Chaski, C. E. 2005. Who's at the keyboard&quest; Authorship attribution in digital evidence investigation. Int. J. Digit. Evidence 4, 1, 1--13.

[12]

Chaski, C. E. 2001. Empirical evaluation of language-based author identification techniques. Forensic Linguist. 8, 1, 1--65.

[13]

Cherkauer, K. J. 1996. Human expert-level performance on a scientific image analysis task by a system using combined artificial neural networks. In Working Notes of the AAAI Workshop on Integrating Multiple Learned Models, P. Chan, ed., 15--21.

[14]

Corney, M., De Vel, O., Anderson, A., and Mohay, G. 2002. Gender-Preferential text mining of email discourse. In 18th Annual Computer Security Applications Conference, Las Vegas, NV.

Digital Library

[15]

Dash, M. and Liu, H. 1997. Feature selection for classification. Intell. Data Anal. 1, 131--156.

[16]

De Vel, O., Anderson, A., Corney, M., and Mohay, G. 2001. Mining e-mail content for author identification forensics. ACM SIGMOD Rec. 30, 4, 55--64.

Digital Library

[17]

Dietterich, T. G. 2000. Ensemble methods in machine learning. In Proceedings of the 1st International Workshop on Multiple Classifier Systems, 1--15.

Digital Library

[18]

Diederich, J., Kindermann, J., Leopold, E., and Paass, G. 2003. Authorship attribution with support vector machines. Appl. Intell. 19, 109--123.

Digital Library

[19]

Ding, H. and Samadzaheh, H. M. 2004. Extraction of Java program fingerprints for software authorship identification. J. Syst. Softw. 72, 49--57.

Digital Library

[20]

Efron, M., Marchionini, G., and Zhiang, J. 2004. Implications of the recursive representation problem for automatic concept identification in on-line government information. In Proceedings of the ASIST SIG-CR Workshop.

[21]

Erickson, T. and Kellogg, W. A. 2000. Social translucence: An approach to designing systems that support social processes. ACM Trans. Comput. Hum. Interact. 7, 1, 59--83.

Digital Library

[22]

Fellbaum, C. 1998. Wordnet: An Electronic Lexical Database. MIT Press, Cambridge, MA.

[23]

Forman, G. 2003. An extensive empirical study of feature selection metrics for text classification, Journal of Machine Learning Research 3, 1289--1305.

Digital Library

[24]

Forsyth, R. S. and Holmes, D. I. 1996. Feature finding for text classification. Litera. Linguist. Comput. 11, 4, 163--174.

[25]

Garson, G. D. 2006. Public Information Technology and E-Governance: Managing the Virtual State. Jones and Bartlet, Boston, MA.

[26]

Gray, A., Sallis, P., and MacDonnel, S. 1997. Software forensics: Extended authorship analysis techniques to computer programs. In Proceedings of the 3rd Biannual Conference on the International Association of Forensic Linguists.

[27]

Guyon, I., and Elisseeff, A. 2003. An introduction to variable and feature selection. J. Mach. Learn. Res. 3, 1157--1182.

Digital Library

[28]

Hayne, C. S. and Rice, E. R. 1997. Attribution accuracy when using anonymity in group support systems. Int. J. Hum. Comput. Studies 47, 429--452.

Digital Library

[29]

Hayne, C. S., PolLard, E. C., and Rice, E. R. 2003. Identification of comment authorship in anonymous group support systems. J. Manage. Inf. Syst. 20, 1, 301--329.

Digital Library

[30]

Herring, S. C. 2002. Computer-Mediated communication on the Internet. Ann. Rev. Inf. Sci. Technol. 36, 1, 109--168.

[31]

Holmes, D. I. 1992. A stylometric analysis of Mormon scripture and related texts. J. Royal Statis. Soci. 155, 91--120.

[32]

Jackson, D. 1993. Stopping rules in principal component analysis: A comparison of heuristical and statistical approaches. Ecol. 74, 8, 2204--2214.

[33]

Josang, A., Ismail, R., and Boyd, C. 2007. A survey of trust and reputation systems for online service provision. Decis. Support Syst. 43, 2, 618--644.

Digital Library

[34]

Juola, P. and Baayen, H. 2005. A controlled-corpus experiment in authorship identification by cross-entropy. Liter. Linguist. Comput. 20, 59--67.

[35]

Kirby, M. and Sirovich, L. 1990. Application of the Karhunen-Loeve procedure for the characterization of human faces. IEEE Trans. Pattern Anal. Mach. Intell. 12, 1, 103--108.

Digital Library

[36]

Kjell, B. Woods, W. A., and Frieder, O. 1994. Discrimination of authorship using visualization. Inf. Process. Manage. 30, 1, 141--150.

Digital Library

[37]

Koppel, M. and Schler, J. 2003. Exploiting stylistic idiosyncrasies for authorship attribution. In Proceedings of the IJCAI Workshop on Computational Approaches to Style Analysis and Synthesis, Acapulco, Mexico.

[38]

Koppel, M. Akiva, N., and Dagan, I. 2006. Feature instability as a criterion for selecting potential style markers. J. Amer. Soc. Inf. Sci. Technol. 57, 11, 1519--1525.

Digital Library

[39]

Krsul, I. and Spafford, H. E. 1997. Authorship analysis: Identifying the author of a program. Comput. Secur. 16, 3, 233--257.

Digital Library

[40]

Li, J., Zheng, R., and Chen, H. 2006. From fingerprint to writeprint. Commun. ACM 49, 4, 76--82.

Digital Library

[41]

Martindale, C. and McKenzie, D. 1995. On the utility of content analysis in author attribution: The federalist. Comput. Humanit. 29, 259--270.

[42]

McDonald, D., Chen, H., Hua, S., and Marshall, B. 2004. Extracting gene pathway relations using a hybrid grammar: The Arizona relation parser. Bioinf. 20, 18, 3370--3378.

Digital Library

[43]

Merriam, T. V. N. and Matthews, R. A. J. 1994. Neural computation in stylometry II: An application to the works of Shakespeare and Marlowe. Liter. Linguist. Comput. 9, 1--6.

[44]

Moores, T. and Dhillon, G. 2000. Software piracy: A view from Hong Kong. Commun. ACM 43, 12, 88--93.

Digital Library

[45]

Morzy, M. 2005. New algorithms for mining the reputation of participants of online auctions. In Proceedings of the 1st Workshop on Internet and Network Economics, Hong Kong.

Digital Library

[46]

Mosteller, F. 1964. Applied Bayesian and Classical Inference: The Case of the Federalist Papers 2nd ed., Springer.

[47]

Oman, W. P. and Cook, R. C. 1989. Programming style authorship analysis. In Proceedings of the 17th Annual ACM Computer Science Conference, 320--326.

Digital Library

[48]

Pan, Y. 2006. ID identification in online communities. Working paper.

[49]

Peng, F., Schuurmans, D., Keselj, V., and Wang, S. 2003. Automated authorship attribution with character level language models. In Proceedings of the 10th Conference of the European Chapter of the Association for Computational Linguistics.

Digital Library

[50]

Platt, J. 1999. Fast training on SVMs using sequential minimal optimization. In Advances in Kernel Methods: Support Vector Learning, B. Scholkopf et al., eds. MIT Press, Cambridge, MA, 185--208.

Digital Library

[51]

Rudman, J. 1997. The state of authorship attribution studies: Some problems and solutions. Comput. Humanit. 31, 351--365.

[52]

Sack, W. 2000. Conversation Map: An interface for very large-scale conversations. J. Manage. Inf. Syst. 17, 3, 73--92.

Digital Library

[53]

Stamatatos, E. and Widme, R. G. 2002. Music performer recognition using an ensemble of simple classifiers. In Proceedings of the 15th European Conference on Artificial Intelligence, Lyon, France.

[54]

Stamatatos, E., Fakotakis, N., and Kokkinakis, G. 2000. Automatic text categorization in terms of genre and author. Comput. Linguist 26, 4, 471--495.

Digital Library

[55]

Sullivan, B. 2005. Seduced into scams: Online lovers often duped. MSNBC, July 28.

[56]

Tweedie, F. J., Singh, S., and Holmes, D. I. 1996. Neural network applications in stylometry: The Federalist papers. Comput. Humanit. 30, 1, 1--10.

[57]

Uenohara, M. and Kanade, T. 1997. Use of the Fourier and Karhunen-Loeve decomposition for fast pattern matching with a large set of features. IEEE Trans. Pattern Analy. Mach. Intell. 19, 8, 891--897.

Digital Library

[58]

Wang, H., Fan, W., and Yu, S. P. 2003. Mining concept-drifting data streams using ensemble classifiers. In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.

Digital Library

[59]

Watanbe, S. 1985. Pattern Recognition: Human and Mechanical. John Wiley, New York.

Digital Library

[60]

Webb, A. 2002. Statistical Pattern Recognition. John Wiley, New York.

[61]

Whitelaw, C. and Argamon, S. 2004. Systemic functional features in stylistic text classification. In Proceedings of the AAAI Symposium on Style and Meaning in Language, Art, Music and Design, Washington, DC.

[62]

Yang, Y. and Pederson, J. O. 1997. A comparative study on feature selection in text categorization. In Proceedings of the 14th International Conference on Machine Learning, 412--420.

Digital Library

[63]

Yule, G. U. 1944. The Statistical Study of Literary Vocabulary. Cambridge University Press.

[64]

Yule, G. U. 1938. On sentence length as a statistical characteristic on style prose. Biometrika 30.

[65]

Zheng, R., Li, J., Huang, Z., and Chen, H. 2006. A framework for authorship analysis of online messages: Writing-style features and techniques. J. Amer. Soc. Inf. Sci. Technol. 57, 3, 378--393.

Digital Library

Cited By

Bevilacqua MOketch KQin RStamey WZhang XGan YYang KAbbasi A(2024)When Automated Assessment Meets Automated Content Generation: Examining Text Quality in the Era of GPTsACM Transactions on Information Systems10.1145/3702639Online publication date: 5-Nov-2024
https://doi.org/10.1145/3702639
Yang HHung YWang L(2024)Stylometry-based Fake News Classification Using Text Mining TechniquesProceedings of the 2024 11th Multidisciplinary International Social Networks Conference10.1145/3675669.3675682(85-94)Online publication date: 21-Aug-2024
https://dl.acm.org/doi/10.1145/3675669.3675682
Win T(2024)Authorship Identification System Using Word2Vec Word Embedding Model2024 IEEE Conference on Computer Applications (ICCA)10.1109/ICCA62361.2024.10533018(1-9)Online publication date: 16-Mar-2024
https://doi.org/10.1109/ICCA62361.2024.10533018
Show More Cited By

Index Terms

Writeprints: A stylometric approach to identity-level identification and similarity detection in cyberspace
1. Computing methodologies
  1. Artificial intelligence
    1. Natural language processing
      1. Language resources
  2. Machine learning
    1. Learning paradigms
      1. Unsupervised learning
        Cluster analysis

Recommendations

UrduAI: Writeprints for Urdu Authorship Identification
The authorship identification task aims at identifying the original author of an anonymous text sample from a set of candidate authors. It has several application domains such as digital text forensics and information retrieval. These application domains ...
Handwriting style classification

This paper describes an independent handwriting style classifier that has been designed to select the best recognizer for a given style of writing. For this purpose a definition of handwriting legibility has been defined and a method implemented that ...
Chinese text classification by the Naïve Bayes Classifier and the associative classifier with multiple confidence threshold values

Each type of classifier has its own advantages as well as certain shortcomings. In this paper, we take the advantages of the associative classifier and the Naive Bayes Classifier to make up the shortcomings of each other, thus improving the accuracy of ...

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Information Systems

ACM Transactions on Information Systems Volume 26, Issue 2

March 2008

214 pages

ISSN:1046-8188

EISSN:1558-2868

DOI:10.1145/1344411

Issue’s Table of Contents

Copyright © 2008 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 08 April 2008

Accepted: 01 May 2007

Revised: 01 May 2007

Received: 01 November 2006

Published in TOIS Volume 26, Issue 2

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

310
Total Citations
View Citations
2,851
Total Downloads

Downloads (Last 12 months)98
Downloads (Last 6 weeks)3

Reflects downloads up to 11 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Bevilacqua MOketch KQin RStamey WZhang XGan YYang KAbbasi A(2024)When Automated Assessment Meets Automated Content Generation: Examining Text Quality in the Era of GPTsACM Transactions on Information Systems10.1145/3702639Online publication date: 5-Nov-2024
https://doi.org/10.1145/3702639
Yang HHung YWang L(2024)Stylometry-based Fake News Classification Using Text Mining TechniquesProceedings of the 2024 11th Multidisciplinary International Social Networks Conference10.1145/3675669.3675682(85-94)Online publication date: 21-Aug-2024
https://dl.acm.org/doi/10.1145/3675669.3675682
Win T(2024)Authorship Identification System Using Word2Vec Word Embedding Model2024 IEEE Conference on Computer Applications (ICCA)10.1109/ICCA62361.2024.10533018(1-9)Online publication date: 16-Mar-2024
https://doi.org/10.1109/ICCA62361.2024.10533018
Suljic AHossain M(2024)Towards Performance Improvement of Authorship AttributionIEEE Access10.1109/ACCESS.2024.340767312(77054-77064)Online publication date: 2024
https://doi.org/10.1109/ACCESS.2024.3407673
Al-Omari MElhersh Hal Huneety AMashaqba B(2024)Authorship analysis of three Jordanian columnists: is there a linguistic fingerprint?Cogent Arts & Humanities10.1080/23311983.2024.243434511:1Online publication date: 3-Dec-2024
https://doi.org/10.1080/23311983.2024.2434345
Wosah PAli Mirza QSayers W(2024)Analysing the email data using stylometric method and deep learning to mitigate phishing attackInternational Journal of Information Technology10.1007/s41870-024-01839-5Online publication date: 5-May-2024
https://doi.org/10.1007/s41870-024-01839-5
A. Oliveira EMohoni MRios S(2024)Towards Explainable Authorship Verification: An Approach to Minimise Academic Misconduct in Higher EducationArtificial Intelligence in Education. Posters and Late Breaking Results, Workshops and Tutorials, Industry and Innovation Tracks, Practitioners, Doctoral Consortium and Blue Sky10.1007/978-3-031-64315-6_7(87-100)Online publication date: 2-Jul-2024
https://doi.org/10.1007/978-3-031-64315-6_7
Medhi SSarma S(2024)Authorship Attribution for Assamese Language Documents: Initial ResultsAdvanced Computing, Machine Learning, Robotics and Internet Technologies10.1007/978-3-031-47224-4_21(232-242)Online publication date: 16-Apr-2024
https://doi.org/10.1007/978-3-031-47224-4_21
Hartwig KSandler RReuter C(2024)Navigating misinformation in voice messages: Identification of user‐centered features for digital interventionsRisk, Hazards & Crisis in Public Policy10.1002/rhc3.1229615:2(203-235)Online publication date: 25-Mar-2024
https://doi.org/10.1002/rhc3.12296
Şeref MŞeref OAbrahams AHill SWarnick Q(2023)Rhetoric Mining: A New Text-Analytics Approach for Quantifying PersuasionINFORMS Journal on Data Science10.1287/ijds.2022.00242:1(24-44)Online publication date: Apr-2023
https://doi.org/10.1287/ijds.2022.0024
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Issue’s Table of Contents