Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
 
 
Sign in to use this feature.

Years

Between: -

Subjects

remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline

Journals

Article Types

Countries / Regions

Search Results (14)

Search Parameters:
Keywords = stylometry

Order results
Result details
Results per page
Select all
Export citation of selected articles as:
30 pages, 1001 KiB  
Article
Genre Classification of Books in Russian with Stylometric Features: A Case Study
by Natalia Vanetik, Margarita Tiamanova, Genady Kogan and Marina Litvak
Information 2024, 15(6), 340; https://doi.org/10.3390/info15060340 - 7 Jun 2024
Viewed by 605
Abstract
Within the literary domain, genres function as fundamental organizing concepts that provide readers, publishers, and academics with a unified framework. Genres are discrete categories that are distinguished by common stylistic, thematic, and structural components. They facilitate the categorization process and improve our understanding [...] Read more.
Within the literary domain, genres function as fundamental organizing concepts that provide readers, publishers, and academics with a unified framework. Genres are discrete categories that are distinguished by common stylistic, thematic, and structural components. They facilitate the categorization process and improve our understanding of a wide range of literary expressions. In this paper, we introduce a new dataset for genre classification of Russian books, covering 11 literary genres. We also perform dataset evaluation for the tasks of binary and multi-class genre identification. Through extensive experimentation and analysis, we explore the effectiveness of different text representations, including stylometric features, in genre classification. Our findings clarify the challenges present in classifying Russian literature by genre, revealing insights into the performance of different models across various genres. Furthermore, we address several research questions regarding the difficulty of multi-class classification compared to binary classification, and the impact of stylometric features on classification accuracy. Full article
(This article belongs to the Special Issue Text Mining: Challenges, Algorithms, Tools and Applications)
Show Figures

Figure 1

24 pages, 1274 KiB  
Article
Significance of Single-Interval Discrete Attributes: Case Study on Two-Level Discretisation
by Urszula Stańczyk, Beata Zielosko and Grzegorz Baron
Appl. Sci. 2024, 14(10), 4088; https://doi.org/10.3390/app14104088 - 11 May 2024
Viewed by 462
Abstract
Supervised discretisation is widely considered as far more advantageous than unsupervised transformation of attributes, because it helps to preserve the informative content of a variable, which is useful in classification. After discretisation, based on employed criteria, some attributes can be found irrelevant, and [...] Read more.
Supervised discretisation is widely considered as far more advantageous than unsupervised transformation of attributes, because it helps to preserve the informative content of a variable, which is useful in classification. After discretisation, based on employed criteria, some attributes can be found irrelevant, and all their values can be represented in a discrete domain by a single interval. In consequence, such attributes are removed from considerations, and no knowledge is mined from them. The paper presents research focused on extended transformations of attribute values, thus combining supervised with unsupervised discretisation strategies. For all variables with single intervals returned from supervised algorithms, the ranges of values were transformed by unsupervised methods with varying numbers of bins. Resulting variants of the data were subjected to selected data mining techniques, and the performance of a group of classifiers was evaluated and compared. The experiments were performed on a stylometric task of authorship attribution. Full article
(This article belongs to the Section Computing and Artificial Intelligence)
Show Figures

Figure 1

32 pages, 2235 KiB  
Article
Importance of Characteristic Features and Their Form for Data Exploration
by Urszula Stańczyk, Beata Zielosko and Grzegorz Baron
Entropy 2024, 26(5), 404; https://doi.org/10.3390/e26050404 - 6 May 2024
Viewed by 871
Abstract
The nature of the input features is one of the key factors indicating what kind of tools, methods, or approaches can be used in a knowledge discovery process. Depending on the characteristics of the available attributes, some techniques could lead to unsatisfactory performance [...] Read more.
The nature of the input features is one of the key factors indicating what kind of tools, methods, or approaches can be used in a knowledge discovery process. Depending on the characteristics of the available attributes, some techniques could lead to unsatisfactory performance or even may not proceed at all without additional preprocessing steps. The types of variables and their domains affect performance. Any changes to their form can influence it as well, or even enable some learners. On the other hand, the relevance of features for a task constitutes another element with a noticeable impact on data exploration. The importance of attributes can be estimated through the application of mechanisms belonging to the feature selection and reduction area, such as rankings. In the described research framework, the data form was conditioned on relevance by the proposed procedure of gradual discretisation controlled by a ranking of attributes. Supervised and unsupervised discretisation methods were employed to the datasets from the stylometric domain and the task of binary authorship attribution. For the selected classifiers, extensive tests were performed and they indicated many cases of enhanced prediction for partially discretised datasets. Full article
Show Figures

Figure 1

18 pages, 506 KiB  
Article
Morphosyntactic Annotation in Literary Stylometry
by Robert Gorman
Information 2024, 15(4), 211; https://doi.org/10.3390/info15040211 - 9 Apr 2024
Cited by 1 | Viewed by 735
Abstract
This article investigates the stylometric usefulness of morphosyntactic annotation. Focusing on the style of literary texts, it argues that including morphosyntactic annotation in analyses of style has at least two important advantages: (1) maintaining a topic agnostic approach and (2) providing input variables [...] Read more.
This article investigates the stylometric usefulness of morphosyntactic annotation. Focusing on the style of literary texts, it argues that including morphosyntactic annotation in analyses of style has at least two important advantages: (1) maintaining a topic agnostic approach and (2) providing input variables that are interpretable in traditional grammatical terms. This study demonstrates how widely available Universal Dependency parsers can generate useful morphological and syntactic data for texts in a range of languages. These data can serve as the basis for input features that are strongly informative about the style of individual novels, as indicated by accuracy in classification tests. The interpretability of such features is demonstrated by a discussion of the weakness of an “authorial” signal as opposed to the clear distinction among individual works. Full article
(This article belongs to the Special Issue Computational Linguistics and Natural Language Processing)
Show Figures

Figure 1

16 pages, 487 KiB  
Article
The Language of Deception: Applying Findings on Opinion Spam to Legal and Forensic Discourses
by Alibek Jakupov, Julien Longhi and Besma Zeddini
Languages 2024, 9(1), 10; https://doi.org/10.3390/languages9010010 - 22 Dec 2023
Viewed by 2585
Abstract
Digital forensic investigations are becoming increasingly crucial in criminal investigations and civil litigations, especially in cases of corporate espionage and intellectual property theft as more communication occurs online via e-mail and social media. Deceptive opinion spam analysis is an emerging field of research [...] Read more.
Digital forensic investigations are becoming increasingly crucial in criminal investigations and civil litigations, especially in cases of corporate espionage and intellectual property theft as more communication occurs online via e-mail and social media. Deceptive opinion spam analysis is an emerging field of research that aims to detect and identify fraudulent reviews, comments, and other forms of deceptive online content. In this paper, we explore how the findings from this field may be relevant to forensic investigation, particularly the features that capture stylistic patterns and sentiments, which are psychologically relevant aspects of truthful and deceptive language. To assess these features’ utility, we demonstrate the potential of our proposed approach using the real-world dataset from the Enron Email Corpus. Our findings suggest that deceptive opinion spam analysis may be a valuable tool for forensic investigators and legal professionals looking to identify and analyze deceptive behavior in online communication. By incorporating these techniques into their investigative and legal strategies, professionals can improve the accuracy and reliability of their findings, leading to more effective and just outcomes. Full article
(This article belongs to the Special Issue New Challenges in Forensic and Legal Linguistics)
Show Figures

Figure 1

14 pages, 278 KiB  
Article
Authorship Attribution on Short Texts in the Slovenian Language
by Gregor Gabrovšek, Peter Peer, Žiga Emeršič and Borut Batagelj
Appl. Sci. 2023, 13(19), 10965; https://doi.org/10.3390/app131910965 - 4 Oct 2023
Viewed by 1161
Abstract
The study investigates the task of authorship attribution on short texts in Slovenian using the BERT language model. Authorship attribution is the task of attributing a written text to its author, frequently using stylometry or computational techniques. We create five custom datasets for [...] Read more.
The study investigates the task of authorship attribution on short texts in Slovenian using the BERT language model. Authorship attribution is the task of attributing a written text to its author, frequently using stylometry or computational techniques. We create five custom datasets for different numbers of included text authors and fine-tune two BERT models, SloBERTa and BERT Multilingual (mBERT), to evaluate their performance in closed-class and open-class problems with varying numbers of authors. Our models achieved an F1 score of approximately 0.95 when using the dataset with the comments of the top five users by the number of written comments. Training on datasets that include comments written by an increasing number of people results in models with a gradually decreasing F1 score. Including out-of-class comments in the evaluation decreases the F1 score by approximately 0.05. The study demonstrates the feasibility of using BERT models for authorship attribution in short texts in the Slovenian language. Full article
(This article belongs to the Special Issue Natural Language Processing (NLP) and Applications)
Show Figures

Figure 1

19 pages, 5568 KiB  
Article
A Scientometric Study of the Stylometric Research Field
by Panagiotis D. Michailidis
Informatics 2022, 9(3), 60; https://doi.org/10.3390/informatics9030060 - 18 Aug 2022
Cited by 5 | Viewed by 2839
Abstract
Stylometry has gained great popularity in digital humanities and social sciences. Many works on stylometry have recently been reported. However, there is a research gap regarding review studies in this field from a bibliometric and evolutionary perspective. Therefore, in this paper, a bibliometric [...] Read more.
Stylometry has gained great popularity in digital humanities and social sciences. Many works on stylometry have recently been reported. However, there is a research gap regarding review studies in this field from a bibliometric and evolutionary perspective. Therefore, in this paper, a bibliometric analysis of publications from the Scopus database in the stylometric research field was proposed. Then, research articles published between 1968 and 2021 were collected and analyzed using the Bibliometrix R package for bibliometric analysis via the Biblioshiny web interface. Empirical results were also presented in terms of the performance analysis and the science mapping analysis. From these results, it is concluded that there has been a strong growth in stylometry research in recent years, while the USA, Poland, and the UK are the most productive countries, and this is due to many strong research partnerships. It was also concluded that the research topics of most articles, based on author keywords, focused on two broad thematic categories: (1) the main tasks in stylometry and (2) methodological approaches (statistics and machine learning methods). Full article
(This article belongs to the Special Issue Digital Humanities and Visualization)
Show Figures

Figure 1

18 pages, 849 KiB  
Article
Privacy Issues in Stylometric Methods
by Antonios Patergianakis and Konstantinos Limniotis
Cryptography 2022, 6(2), 17; https://doi.org/10.3390/cryptography6020017 - 7 Apr 2022
Cited by 2 | Viewed by 3220
Abstract
Stylometry is a well-known field, aiming to identify the author of a text, based only on the way she/he writes. Despite its obvious advantages in several areas, such as in historical research or for copyright purposes, it may also yield privacy and personal [...] Read more.
Stylometry is a well-known field, aiming to identify the author of a text, based only on the way she/he writes. Despite its obvious advantages in several areas, such as in historical research or for copyright purposes, it may also yield privacy and personal data protection issues if it is used in specific contexts, without the users being aware of it. It is, therefore, of importance to assess the potential use of stylometry methods, as well as the implications of their use for online privacy protection. This paper aims to present, through relevant experiments, the possibility of the automated identification of a person using stylometry. The ultimate goal is to analyse the risks regarding privacy and personal data protection stemming from the use of stylometric techniques to evaluate the effectiveness of a specific stylometric identification system, as well as to examine whether proper anonymisation techniques can be applied so as to ensure that the identity of an author of a text (e.g., a user in an anonymous social network) remains hidden, even if stylometric methods are to be applied for possible re-identification. Full article
(This article belongs to the Special Issue Privacy-Preserving Techniques in Cloud/Fog and Internet of Things)
Show Figures

Figure 1

27 pages, 1761 KiB  
Article
Parallel Stylometric Document Embeddings with Deep Learning Based Language Models in Literary Authorship Attribution
by Mihailo Škorić, Ranka Stanković, Milica Ikonić Nešić, Joanna Byszuk and Maciej Eder
Mathematics 2022, 10(5), 838; https://doi.org/10.3390/math10050838 - 7 Mar 2022
Cited by 6 | Viewed by 4006
Abstract
This paper explores the effectiveness of parallel stylometric document embeddings in solving the authorship attribution task by testing a novel approach on literary texts in 7 different languages, totaling in 7051 unique 10,000-token chunks from 700 PoS and lemma annotated documents. We used [...] Read more.
This paper explores the effectiveness of parallel stylometric document embeddings in solving the authorship attribution task by testing a novel approach on literary texts in 7 different languages, totaling in 7051 unique 10,000-token chunks from 700 PoS and lemma annotated documents. We used these documents to produce four document embedding models using Stylo R package (word-based, lemma-based, PoS-trigrams-based, and PoS-mask-based) and one document embedding model using mBERT for each of the seven languages. We created further derivations of these embeddings in the form of average, product, minimum, maximum, and l2 norm of these document embedding matrices and tested them both including and excluding the mBERT-based document embeddings for each language. Finally, we trained several perceptrons on the portions of the dataset in order to procure adequate weights for a weighted combination approach. We tested standalone (two baselines) and composite embeddings for classification accuracy, precision, recall, weighted-average, and macro-averaged F1-score, compared them with one another and have found that for each language most of our composition methods outperform the baselines (with a couple of methods outperforming all baselines for all languages), with or without mBERT inputs, which are found to have no significant positive impact on the results of our methods. Full article
Show Figures

Figure 1

18 pages, 2095 KiB  
Article
Determination of the Features of the Author’s Style of A.S. Pushkin’s Poems by Machine Learning Methods
by Vladimir Barakhnin, Olga Kozhemyakina and Irina Grigorieva
Appl. Sci. 2022, 12(3), 1674; https://doi.org/10.3390/app12031674 - 6 Feb 2022
Cited by 2 | Viewed by 1667
Abstract
This paper presents the study of the author’s style of A.S. Pushkin based on the comparison of his poetic texts with the texts of contemporary poets. The purpose of this study is to determine the features of the author’s style of A.S. Pushkin [...] Read more.
This paper presents the study of the author’s style of A.S. Pushkin based on the comparison of his poetic texts with the texts of contemporary poets. The purpose of this study is to determine the features of the author’s style of A.S. Pushkin using machine learning methods. This paper describes the construction of several classifications based on different groups of features, as well as the classification based on a combined set of features from different groups. The quality of all constructed classifications is also analyzed; special attention is paid to the interpretation of the neural network solution and the identification of features of the author’s style. Full article
(This article belongs to the Special Issue Natural Language Processing: Approaches and Applications)
Show Figures

Figure 1

18 pages, 8411 KiB  
Review
Stylometry and Numerals Usage: Benford’s Law and Beyond
by Andrei V. Zenkov
Stats 2021, 4(4), 1051-1068; https://doi.org/10.3390/stats4040060 - 14 Dec 2021
Cited by 2 | Viewed by 2239
Abstract
We suggest two approaches to the statistical analysis of texts, both based on the study of numerals occurrence in literary texts. The first approach is related to Benford’s Law and the analysis of the frequency distribution of various leading digits of numerals contained [...] Read more.
We suggest two approaches to the statistical analysis of texts, both based on the study of numerals occurrence in literary texts. The first approach is related to Benford’s Law and the analysis of the frequency distribution of various leading digits of numerals contained in the text. In coherent literary texts, the share of the leading digit 1 is even larger than prescribed by Benford’s Law and can reach 50 percent. The frequencies of occurrence of the digit 1, as well as, to a lesser extent, the digits 2 and 3, are usually a characteristic the author’s style feature, manifested in all (sufficiently long) literary texts of any author. This approach is convenient for testing whether a group of texts has common authorship: the latter is dubious if the frequency distributions are sufficiently different. The second approach is the extension of the first one and requires the study of the frequency distribution of numerals themselves (not their leading digits). The approach yields non-trivial information about the author, stylistic and genre peculiarities of the texts and is suited for the advanced stylometric analysis. The proposed approaches are illustrated by examples of computer analysis of the literary texts in English and Russian. Full article
(This article belongs to the Special Issue Benford's Law(s) and Applications)
Show Figures

Figure 1

18 pages, 847 KiB  
Article
Language-Independent Fake News Detection: English, Portuguese, and Spanish Mutual Features
by Hugo Queiroz Abonizio, Janaina Ignacio de Morais, Gabriel Marques Tavares and Sylvio Barbon Junior
Future Internet 2020, 12(5), 87; https://doi.org/10.3390/fi12050087 - 11 May 2020
Cited by 62 | Viewed by 9114
Abstract
Online Social Media (OSM) have been substantially transforming the process of spreading news, improving its speed, and reducing barriers toward reaching out to a broad audience. However, OSM are very limited in providing mechanisms to check the credibility of news propagated through their [...] Read more.
Online Social Media (OSM) have been substantially transforming the process of spreading news, improving its speed, and reducing barriers toward reaching out to a broad audience. However, OSM are very limited in providing mechanisms to check the credibility of news propagated through their structure. The majority of studies on automatic fake news detection are restricted to English documents, with few works evaluating other languages, and none comparing language-independent characteristics. Moreover, the spreading of deceptive news tends to be a worldwide problem; therefore, this work evaluates textual features that are not tied to a specific language when describing textual data for detecting news. Corpora of news written in American English, Brazilian Portuguese, and Spanish were explored to study complexity, stylometric, and psychological text features. The extracted features support the detection of fake, legitimate, and satirical news. We compared four machine learning algorithms (k-Nearest Neighbors (k-NN), Support Vector Machine (SVM), Random Forest (RF), and Extreme Gradient Boosting (XGB)) to induce the detection model. Results show our proposed language-independent features are successful in describing fake, satirical, and legitimate news across three different languages, with an average detection accuracy of 85.3% with RF. Full article
(This article belongs to the Special Issue Social Web, New Media, Algorithms and Power)
Show Figures

Figure 1

750 KiB  
Article
Hierarchical and Non-Hierarchical Linear and Non-Linear Clustering Methods to “Shakespeare Authorship Question”
by Refat Aljumily
Soc. Sci. 2015, 4(3), 758-799; https://doi.org/10.3390/socsci4030758 - 17 Sep 2015
Cited by 8 | Viewed by 7612
Abstract
A few literary scholars have long claimed that Shakespeare did not write some of his best plays (history plays and tragedies) and proposed at one time or another various suspect authorship candidates. Most modern-day scholars of Shakespeare have rejected this claim, arguing that [...] Read more.
A few literary scholars have long claimed that Shakespeare did not write some of his best plays (history plays and tragedies) and proposed at one time or another various suspect authorship candidates. Most modern-day scholars of Shakespeare have rejected this claim, arguing that strong evidence that Shakespeare wrote the plays and poems being his name appears on them as the author. This has caused and led to an ongoing scholarly academic debate for quite some long time. Stylometry is a fast-growing field often used to attribute authorship to anonymous or disputed texts. Stylometric attempts to resolve this literary puzzle have raised interesting questions over the past few years. The following paper contributes to “the Shakespeare authorship question” by using a mathematically-based methodology to examine the hypothesis that Shakespeare wrote all the disputed plays traditionally attributed to him. More specifically, the mathematically based methodology used here is based on Mean Proximity, as a linear hierarchical clustering method, and on Principal Components Analysis, as a non-hierarchical linear clustering method. It is also based, for the first time in the domain, on Self-Organizing Map U-Matrix and Voronoi Map, as non-linear clustering methods to cover the possibility that our data contains significant non-linearities. Vector Space Model (VSM) is used to convert texts into vectors in a high dimensional space. The aim of which is to compare the degrees of similarity within and between limited samples of text (the disputed plays). The various works and plays assumed to have been written by Shakespeare and possible authors notably, Sir Francis Bacon, Christopher Marlowe, John Fletcher, and Thomas Kyd, where “similarity” is defined in terms of correlation/distance coefficient measure based on the frequency of usage profiles of function words, word bi-grams, and character triple-grams. The claim that Shakespeare authored all the disputed plays traditionally attributed to him is falsified in favor of the alternative authors according to the stylistic criteria and analytic methodology used. The result of this validated analysis is empirically-based, objective, and involves replicable evidence which can be used in conjunction with existing arguments to resolve the question of whether or not Shakespeare of Stratford-upon-Avon wrote all the disputed plays traditionally attributed to him. Full article
Show Figures

Figure 1

683 KiB  
Article
Authorship Attribution Using Principal Component Analysis and Competitive Neural Networks
by Mehmet Can
Math. Comput. Appl. 2014, 19(1), 21-36; https://doi.org/10.3390/mca19010021 - 1 Apr 2014
Cited by 8 | Viewed by 1609
Abstract
Feature extraction is a common problem in statistical pattern recognition. It refers to a process whereby a data space is transformed into a feature space that, in theory, has exactly the same dimension as the original data space. However, the transformation is designed [...] Read more.
Feature extraction is a common problem in statistical pattern recognition. It refers to a process whereby a data space is transformed into a feature space that, in theory, has exactly the same dimension as the original data space. However, the transformation is designed in such a way that the data set may be represented by a reduced number of "effective" features and yet retain most of the intrinsic information content of the data; in other words, the data set undergoes a dimensionality reduction. Principal component analysis is one of these processes. In this paper the data collected by counting selected syntactic characteristics in around a thousand paragraphs of each of the sample books underwent a principal component analysis. Authors of texts identified by the competitive neural networks, which use these effective features. Full article
Back to TopTop