Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

Writeprints: A stylometric approach to identity-level identification and similarity detection in cyberspace

Published: 08 April 2008 Publication History

Abstract

One of the problems often associated with online anonymity is that it hinders social accountability, as substantiated by the high levels of cybercrime. Although identity cues are scarce in cyberspace, individuals often leave behind textual identity traces. In this study we proposed the use of stylometric analysis techniques to help identify individuals based on writing style. We incorporated a rich set of stylistic features, including lexical, syntactic, structural, content-specific, and idiosyncratic attributes. We also developed the Writeprints technique for identification and similarity detection of anonymous identities. Writeprints is a Karhunen-Loeve transforms-based technique that uses a sliding window and pattern disruption algorithm with individual author-level feature sets. The Writeprints technique and extended feature set were evaluated on a testbed encompassing four online datasets spanning different domains: email, instant messaging, feedback comments, and program code. Writeprints outperformed benchmark techniques, including SVM, Ensemble SVM, PCA, and standard Karhunen-Loeve transforms, on the identification and similarity detection tasks with accuracy as high as 94% when differentiating between 100 authors. The extended feature set also significantly outperformed a baseline set of features commonly used in previous research. Furthermore, individual-author-level feature sets generally outperformed use of a single group of attributes.

References

[1]
Abbasi, A. and Chen, H. 2005. Identification and comparison of extremist-group Web forum messages using authorship analysis. IEEE Intel. Syst. 20, 5, 67--75.
[2]
Abbasi, A. and Chen, H. 2006. Visualizing authorship for identification. In Proceedings of the 4th IEEE Symposium on Intelligence and Security Informatics, San Diego, CA.
[3]
Airoldi, E. and Malin, B. 2004. Data mining challenges for electronic safety: The case of fraudulent intent detection in e-mails. In Proceedings of the Workshop on Privacy and Security Aspects of Data Mining.
[4]
Argamon, S., Saric, M., and Stein, S. S. 2003. Style mining of electronic messages for multiple authorship discrimination: First results In Proceedings of the 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.
[5]
Argamon, S., Koppel, M., and Avneri, G. 1998. Routing documents according to style. In Proceedings of the 1st International Workshop on Innovative Information.
[6]
Bayyen, R. H., Halteren, H. V., Neijt, A., and Tweedie, F. J. 2002. An experiment in authorship attribution. In Proceedings of the 6th International Conference on Statistical Analysis of Textual Data.
[7]
Bayyen, R. H., Halteren, H. V., and Tweedie, F. J. 1996. Outside the cave of shadows: Using syntactic annotation to enhance authorship attribution. Liter. Linguist. Comput. 2, 110--120.
[8]
Berry, R. E. and Meekings, B. A. E. 1985. A style analysis of C programs. Commun. ACM 28, 1, 80--88.
[9]
Binongo, J. N. G. and Smith, M. W. A. 1999. The application of principal component analysis to stylometry. Liter. Linguist. Compu. 14, 4, 445--466.
[10]
Burrows, J. F. 1987. Word patterns and story shapes: The statistical analysis of narrative style. Liter. Linguist. Comput. 2, 61--67.
[11]
Chaski, C. E. 2005. Who's at the keyboard? Authorship attribution in digital evidence investigation. Int. J. Digit. Evidence 4, 1, 1--13.
[12]
Chaski, C. E. 2001. Empirical evaluation of language-based author identification techniques. Forensic Linguist. 8, 1, 1--65.
[13]
Cherkauer, K. J. 1996. Human expert-level performance on a scientific image analysis task by a system using combined artificial neural networks. In Working Notes of the AAAI Workshop on Integrating Multiple Learned Models, P. Chan, ed., 15--21.
[14]
Corney, M., De Vel, O., Anderson, A., and Mohay, G. 2002. Gender-Preferential text mining of email discourse. In 18th Annual Computer Security Applications Conference, Las Vegas, NV.
[15]
Dash, M. and Liu, H. 1997. Feature selection for classification. Intell. Data Anal. 1, 131--156.
[16]
De Vel, O., Anderson, A., Corney, M., and Mohay, G. 2001. Mining e-mail content for author identification forensics. ACM SIGMOD Rec. 30, 4, 55--64.
[17]
Dietterich, T. G. 2000. Ensemble methods in machine learning. In Proceedings of the 1st International Workshop on Multiple Classifier Systems, 1--15.
[18]
Diederich, J., Kindermann, J., Leopold, E., and Paass, G. 2003. Authorship attribution with support vector machines. Appl. Intell. 19, 109--123.
[19]
Ding, H. and Samadzaheh, H. M. 2004. Extraction of Java program fingerprints for software authorship identification. J. Syst. Softw. 72, 49--57.
[20]
Efron, M., Marchionini, G., and Zhiang, J. 2004. Implications of the recursive representation problem for automatic concept identification in on-line government information. In Proceedings of the ASIST SIG-CR Workshop.
[21]
Erickson, T. and Kellogg, W. A. 2000. Social translucence: An approach to designing systems that support social processes. ACM Trans. Comput. Hum. Interact. 7, 1, 59--83.
[22]
Fellbaum, C. 1998. Wordnet: An Electronic Lexical Database. MIT Press, Cambridge, MA.
[23]
Forman, G. 2003. An extensive empirical study of feature selection metrics for text classification, Journal of Machine Learning Research 3, 1289--1305.
[24]
Forsyth, R. S. and Holmes, D. I. 1996. Feature finding for text classification. Litera. Linguist. Comput. 11, 4, 163--174.
[25]
Garson, G. D. 2006. Public Information Technology and E-Governance: Managing the Virtual State. Jones and Bartlet, Boston, MA.
[26]
Gray, A., Sallis, P., and MacDonnel, S. 1997. Software forensics: Extended authorship analysis techniques to computer programs. In Proceedings of the 3rd Biannual Conference on the International Association of Forensic Linguists.
[27]
Guyon, I., and Elisseeff, A. 2003. An introduction to variable and feature selection. J. Mach. Learn. Res. 3, 1157--1182.
[28]
Hayne, C. S. and Rice, E. R. 1997. Attribution accuracy when using anonymity in group support systems. Int. J. Hum. Comput. Studies 47, 429--452.
[29]
Hayne, C. S., PolLard, E. C., and Rice, E. R. 2003. Identification of comment authorship in anonymous group support systems. J. Manage. Inf. Syst. 20, 1, 301--329.
[30]
Herring, S. C. 2002. Computer-Mediated communication on the Internet. Ann. Rev. Inf. Sci. Technol. 36, 1, 109--168.
[31]
Holmes, D. I. 1992. A stylometric analysis of Mormon scripture and related texts. J. Royal Statis. Soci. 155, 91--120.
[32]
Jackson, D. 1993. Stopping rules in principal component analysis: A comparison of heuristical and statistical approaches. Ecol. 74, 8, 2204--2214.
[33]
Josang, A., Ismail, R., and Boyd, C. 2007. A survey of trust and reputation systems for online service provision. Decis. Support Syst. 43, 2, 618--644.
[34]
Juola, P. and Baayen, H. 2005. A controlled-corpus experiment in authorship identification by cross-entropy. Liter. Linguist. Comput. 20, 59--67.
[35]
Kirby, M. and Sirovich, L. 1990. Application of the Karhunen-Loeve procedure for the characterization of human faces. IEEE Trans. Pattern Anal. Mach. Intell. 12, 1, 103--108.
[36]
Kjell, B. Woods, W. A., and Frieder, O. 1994. Discrimination of authorship using visualization. Inf. Process. Manage. 30, 1, 141--150.
[37]
Koppel, M. and Schler, J. 2003. Exploiting stylistic idiosyncrasies for authorship attribution. In Proceedings of the IJCAI Workshop on Computational Approaches to Style Analysis and Synthesis, Acapulco, Mexico.
[38]
Koppel, M. Akiva, N., and Dagan, I. 2006. Feature instability as a criterion for selecting potential style markers. J. Amer. Soc. Inf. Sci. Technol. 57, 11, 1519--1525.
[39]
Krsul, I. and Spafford, H. E. 1997. Authorship analysis: Identifying the author of a program. Comput. Secur. 16, 3, 233--257.
[40]
Li, J., Zheng, R., and Chen, H. 2006. From fingerprint to writeprint. Commun. ACM 49, 4, 76--82.
[41]
Martindale, C. and McKenzie, D. 1995. On the utility of content analysis in author attribution: The federalist. Comput. Humanit. 29, 259--270.
[42]
McDonald, D., Chen, H., Hua, S., and Marshall, B. 2004. Extracting gene pathway relations using a hybrid grammar: The Arizona relation parser. Bioinf. 20, 18, 3370--3378.
[43]
Merriam, T. V. N. and Matthews, R. A. J. 1994. Neural computation in stylometry II: An application to the works of Shakespeare and Marlowe. Liter. Linguist. Comput. 9, 1--6.
[44]
Moores, T. and Dhillon, G. 2000. Software piracy: A view from Hong Kong. Commun. ACM 43, 12, 88--93.
[45]
Morzy, M. 2005. New algorithms for mining the reputation of participants of online auctions. In Proceedings of the 1st Workshop on Internet and Network Economics, Hong Kong.
[46]
Mosteller, F. 1964. Applied Bayesian and Classical Inference: The Case of the Federalist Papers 2nd ed., Springer.
[47]
Oman, W. P. and Cook, R. C. 1989. Programming style authorship analysis. In Proceedings of the 17th Annual ACM Computer Science Conference, 320--326.
[48]
Pan, Y. 2006. ID identification in online communities. Working paper.
[49]
Peng, F., Schuurmans, D., Keselj, V., and Wang, S. 2003. Automated authorship attribution with character level language models. In Proceedings of the 10th Conference of the European Chapter of the Association for Computational Linguistics.
[50]
Platt, J. 1999. Fast training on SVMs using sequential minimal optimization. In Advances in Kernel Methods: Support Vector Learning, B. Scholkopf et al., eds. MIT Press, Cambridge, MA, 185--208.
[51]
Rudman, J. 1997. The state of authorship attribution studies: Some problems and solutions. Comput. Humanit. 31, 351--365.
[52]
Sack, W. 2000. Conversation Map: An interface for very large-scale conversations. J. Manage. Inf. Syst. 17, 3, 73--92.
[53]
Stamatatos, E. and Widme, R. G. 2002. Music performer recognition using an ensemble of simple classifiers. In Proceedings of the 15th European Conference on Artificial Intelligence, Lyon, France.
[54]
Stamatatos, E., Fakotakis, N., and Kokkinakis, G. 2000. Automatic text categorization in terms of genre and author. Comput. Linguist 26, 4, 471--495.
[55]
Sullivan, B. 2005. Seduced into scams: Online lovers often duped. MSNBC, July 28.
[56]
Tweedie, F. J., Singh, S., and Holmes, D. I. 1996. Neural network applications in stylometry: The Federalist papers. Comput. Humanit. 30, 1, 1--10.
[57]
Uenohara, M. and Kanade, T. 1997. Use of the Fourier and Karhunen-Loeve decomposition for fast pattern matching with a large set of features. IEEE Trans. Pattern Analy. Mach. Intell. 19, 8, 891--897.
[58]
Wang, H., Fan, W., and Yu, S. P. 2003. Mining concept-drifting data streams using ensemble classifiers. In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.
[59]
Watanbe, S. 1985. Pattern Recognition: Human and Mechanical. John Wiley, New York.
[60]
Webb, A. 2002. Statistical Pattern Recognition. John Wiley, New York.
[61]
Whitelaw, C. and Argamon, S. 2004. Systemic functional features in stylistic text classification. In Proceedings of the AAAI Symposium on Style and Meaning in Language, Art, Music and Design, Washington, DC.
[62]
Yang, Y. and Pederson, J. O. 1997. A comparative study on feature selection in text categorization. In Proceedings of the 14th International Conference on Machine Learning, 412--420.
[63]
Yule, G. U. 1944. The Statistical Study of Literary Vocabulary. Cambridge University Press.
[64]
Yule, G. U. 1938. On sentence length as a statistical characteristic on style prose. Biometrika 30.
[65]
Zheng, R., Li, J., Huang, Z., and Chen, H. 2006. A framework for authorship analysis of online messages: Writing-style features and techniques. J. Amer. Soc. Inf. Sci. Technol. 57, 3, 378--393.

Cited By

View all
  • (2024)When Automated Assessment Meets Automated Content Generation: Examining Text Quality in the Era of GPTsACM Transactions on Information Systems10.1145/3702639Online publication date: 5-Nov-2024
  • (2024)Stylometry-based Fake News Classification Using Text Mining TechniquesProceedings of the 2024 11th Multidisciplinary International Social Networks Conference10.1145/3675669.3675682(85-94)Online publication date: 21-Aug-2024
  • (2024)Authorship Identification System Using Word2Vec Word Embedding Model2024 IEEE Conference on Computer Applications (ICCA)10.1109/ICCA62361.2024.10533018(1-9)Online publication date: 16-Mar-2024
  • Show More Cited By

Index Terms

  1. Writeprints: A stylometric approach to identity-level identification and similarity detection in cyberspace

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Transactions on Information Systems
      ACM Transactions on Information Systems  Volume 26, Issue 2
      March 2008
      214 pages
      ISSN:1046-8188
      EISSN:1558-2868
      DOI:10.1145/1344411
      Issue’s Table of Contents
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 08 April 2008
      Accepted: 01 May 2007
      Revised: 01 May 2007
      Received: 01 November 2006
      Published in TOIS Volume 26, Issue 2

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. Stylometry
      2. discourse
      3. online text
      4. style classification
      5. text mining

      Qualifiers

      • Research-article
      • Research
      • Refereed

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)98
      • Downloads (Last 6 weeks)3
      Reflects downloads up to 11 Feb 2025

      Other Metrics

      Citations

      Cited By

      View all
      • (2024)When Automated Assessment Meets Automated Content Generation: Examining Text Quality in the Era of GPTsACM Transactions on Information Systems10.1145/3702639Online publication date: 5-Nov-2024
      • (2024)Stylometry-based Fake News Classification Using Text Mining TechniquesProceedings of the 2024 11th Multidisciplinary International Social Networks Conference10.1145/3675669.3675682(85-94)Online publication date: 21-Aug-2024
      • (2024)Authorship Identification System Using Word2Vec Word Embedding Model2024 IEEE Conference on Computer Applications (ICCA)10.1109/ICCA62361.2024.10533018(1-9)Online publication date: 16-Mar-2024
      • (2024)Towards Performance Improvement of Authorship AttributionIEEE Access10.1109/ACCESS.2024.340767312(77054-77064)Online publication date: 2024
      • (2024)Authorship analysis of three Jordanian columnists: is there a linguistic fingerprint?Cogent Arts & Humanities10.1080/23311983.2024.243434511:1Online publication date: 3-Dec-2024
      • (2024)Analysing the email data using stylometric method and deep learning to mitigate phishing attackInternational Journal of Information Technology10.1007/s41870-024-01839-5Online publication date: 5-May-2024
      • (2024)Towards Explainable Authorship Verification: An Approach to Minimise Academic Misconduct in Higher EducationArtificial Intelligence in Education. Posters and Late Breaking Results, Workshops and Tutorials, Industry and Innovation Tracks, Practitioners, Doctoral Consortium and Blue Sky10.1007/978-3-031-64315-6_7(87-100)Online publication date: 2-Jul-2024
      • (2024)Authorship Attribution for Assamese Language Documents: Initial ResultsAdvanced Computing, Machine Learning, Robotics and Internet Technologies10.1007/978-3-031-47224-4_21(232-242)Online publication date: 16-Apr-2024
      • (2024)Navigating misinformation in voice messages: Identification of user‐centered features for digital interventionsRisk, Hazards & Crisis in Public Policy10.1002/rhc3.1229615:2(203-235)Online publication date: 25-Mar-2024
      • (2023)Rhetoric Mining: A New Text-Analytics Approach for Quantifying PersuasionINFORMS Journal on Data Science10.1287/ijds.2022.00242:1(24-44)Online publication date: Apr-2023
      • Show More Cited By

      View Options

      Login options

      Full Access

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Figures

      Tables

      Media

      Share

      Share

      Share this Publication link

      Share on social media