George Markopoulos is currently Associate Professor of Computational Linguistics and Director of the Phonetics and Computational Linguistics Laboratory, at the Department of Linguistics, Faculty of Philology, National and Kapodistrian University of Athens, where he teaches and conducts research since 1998. He specializes in the computational processing of linguistic data, corpus design and annotation, and text mining and information retrieval. Phone: (+30)-210-7277857 Address: National and Kapodistrian University of Athens Department of Linguistics School of Philosophy University Campus 157 84 Athens Greece
We present a research focused on the prediction of the author's personality based on natural lang... more We present a research focused on the prediction of the author's personality based on natural language processing techniques applied to essays written in Modern Greek by highschool students. Each writer has been profiled by filling in the Jung Typology Test. In addition, personality prediction is being discussed under the general research framework of author profiling by examining the effectiveness of several stylometric features to predict students' personality types. The feature set we employed was a combination of the word and sentence length, the most frequent part-of-speech tags, most frequent character/word bigrams and trigrams, most frequent words, as well as hapax/dis legomena. Since personality prediction represents a complex multidimensional research problem, we applied various machine learning algorithms to optimize our model's performance after extracting the stylometric features. We compared nine machine learning algorithms and ranked them according to their cross-validated accuracy. The best results were obtained by the Naive Bayes algorithm. According to the personality classification based on the Jung Typology Test, the author's personality prediction accuracy reached 80.7% on Extraversion, 79.9% on Intuition, 68.8% on Feeling, 75.7% on Judging, according to the personality classification. The reported results show a competitive approach to the personality prediction problem. Furthermore, our research revealed new combinations of stylometric features and corresponding computational techniques, giving interesting and satisfying solutions to the problem of the author's personality prediction for Modern Greek.
Η υπολογιστική επεξεργασία της αλλομορφίας εξακολουθεί να αποτελεί τεράστια πρόκληση από τις πρώτ... more Η υπολογιστική επεξεργασία της αλλομορφίας εξακολουθεί να αποτελεί τεράστια πρόκληση από τις πρώτες συστηματικές προσπάθειες πρόβλεψης της αλλομορφίας με τεχνικές μηχανικής μάθησης. Το μοντέλο MaxEnt προσφέρει έναν στατιστικό τρόπο για να δημιουργήσετε ένα πιθανοτικό μοντέλο για SOI που συνδυάζει διαφορετικά γλωσσικά στοιχεία. Στόχος είναι να προβλέψουμε τις αλλομορφικές αλλαγές στην ονοματική σύνθεση και να δείξουμε την ουσιαστική συμβολή διαφόρων μορφολογικών, φωνολογικών και σημασιολογικών χαρακτηριστικών. Για την αξιολόγηση της αποτελεσματικότητας του μοντέλου μας, χρησιμοποιήθηκε ένα δοκιμαστικό σώμα με ονοματικά σύνθετα που έχουν οποιοδήποτε είδος γραμματικής κατηγορίας ως πρώτο συνθετικό. Δημιουργήσαμε τον ALLOMANTIS, έναν αναλυτή μορφολογικής πρόβλεψης για την ονομαστική αλλομορφία. Η συνολική ακρίβεια του μοντέλου ήταν πάνω από 90%.
This study aims to develop an effective and precise methodology for detecting AI-generated text, ... more This study aims to develop an effective and precise methodology for detecting AI-generated text, leveraging the synergistic combination of transformer learning and stylometric features. The research utilized two datasets provided by the AuTexTification: Automated Text Identification shared task, a component of IberLEF 2023, the 5th Workshop on Iberian Languages Evaluation Forum held at the SEPLN 2023 Conference. Our team engaged in both English language subtasks, which included binary classification of texts as either human or AI-generated and multiclass classification to predict the specific AI writing model employed from a selection of six. Our main approach was to experiment with multiple Transformer models and, at the same time, to use an extensive stylometric feature engineering workflow. Each method (transformers and stylometric features) was first applied separately, and then we explored various ways to combine them. The most efficient method was based on ensemble learning utilizing majority voting employing the two most accurate transformer models in our training data and a comprehensive combined concatenation of many different stylometric feature groups. The macro-F1 scores on the test sets on subtasks 1 and 2 were 60.78 and 55.87, respectively, positioning our group above the median of the competing teams. This study underscores the potential of combining transformer learning and stylometric features to enhance the accuracy of AI-generated text detection.
International Journal on Advances in Life Sciences, 2021
We present a study focused on the prediction of the
author's personality based on natural languag... more We present a study focused on the prediction of the author's personality based on natural language processing techniques applied to essays written in Modern Greek by highschool students. Each writer has been profiled by filling in two personality questionnaires, one based on the typology of Carl Jung and the other based on the Model of Five Factors. In addition, personality prediction is being discussed under the general research framework of author profiling by examining the effectiveness of several stylometric features to predict students’ personality types. The feature set we employed was a combination of the word and sentence length, the most frequent part-of-speech tags, most frequent character/word bigrams and trigrams, most frequent words, as well as hapax/dis legomena. Since personality prediction represents a complex multidimensional research problem, we applied various machine learning algorithms to optimize our model’s performance after extracting the stylometric features. We compared nine machine learning algorithms and ranked them according to their cross-validated accuracy. The best results in predicting the Jung’s Typology types were obtained by the Naive Bayes algorithm. In contrast, for the prediction of personality features based on the Five Factors Model, the Generalized Linear Model (Binomial method) algorithm prevailed. According to the personality classification based on the Jung Typology Test, the author’s personality prediction accuracy reached 80.7% on Extraversion, 79.9% on Intuition, 68.8% on Feeling, 75.7% on Judging, according to the personality classification. In the Big Five personality classification, the prediction accuracy reached 85.9% on Openness, 71.2% on Conscientiousness, 67.6% on Extraversion, 70.2% on Agreeableness, and 65.6% on Neuroticism. The reported results show a competitive approach to the personality prediction problem. Furthermore, our research revealed new combinations of stylometric features and corresponding computational techniques, giving interesting and satisfying solutions to the author’s personality prediction problem for Modern Greek.
Social media offer an unprecedented affordance for political communication, expanding the versati... more Social media offer an unprecedented affordance for political communication, expanding the versatility of personal networks and facilitating the construction of online partisan identities. Within the framework of Critical Discourse Analysis and in the light of ideologies as axiomatic underpinnings of social representations, the aim of this paper is to explore the indexical use of ideological labels as strategies of affiliation and/or otherization, based on a corpus of Twitter data drawn from the profiles of five elected party leaders of the Hellenic Parliament.
This study presents a first standardized encoding for a hierarchical annotated corpus of Greek ap... more This study presents a first standardized encoding for a hierarchical annotated corpus of Greek aphasic speech. Given the need for a deeper understanding of aphasic speech at the different levels of linguistic analysis and the lack of an encoding standard, an encoding scheme has been created for a multilevel linguistic annotation. To this purpose, the professional annotation tool ELAN has been used for hierarchically interconnected tiers of linguistic transcriptions.The aim of this research is to establish this encoding scheme as an annotation guide for aphasic speech corpora and thus facilitate data extraction at the phonological, morphosyntactic, semantic and narrative level of linguistic analysis.
In this paper, the process of designing an annotated Greek Corpus of Aphasic Discourse (GREECAD) ... more In this paper, the process of designing an annotated Greek Corpus of Aphasic Discourse (GREECAD) is presented. Given that resources of this kind are quite limited, a major aim of the GREECAD was to provide a set of specifications which could serve as a methodological basis for the development of other relevant corpora, and, therefore, to contribute to the future research in this area. The GREECAD was developed with the following requirements: a) to include a rather homogeneous sample of Greek as spoken by individuals with aphasia; b) to document speech samples with rich metadata, which include demographic information, as well as detailed information on the patients’ medical record and neuropsychological evaluation; c) to provide annotated speech samples, which encode information at the micro-linguistic (words, POS, grammatical errors, clause types, etc.) and discourse level (narrative structure elements, main events, evaluation devices, etc.). In terms of the design of the GREECAD, the basic requirements regarding data collection, metadata, transcription, and annotation procedures were set. The discourse samples were transcribed and annotated with the ELAN tool. To ensure accurate and consistent annotation, a Transcription and Annotation Guide was compiled, which includes detailed guidelines regarding all aspects of the transcription and annotation procedure.
In this paper, the process of designing an annotated Greek Corpus of Aphasic Discourse (GREECAD) ... more In this paper, the process of designing an annotated Greek Corpus of Aphasic Discourse (GREECAD) is presented. Given that resources of this kind are quite limited, a major aim of the GREECAD was to provide a set of specifications which could serve as a methodological basis for the development of other relevant corpora, and, therefore, to contribute to the future research in this area. The GREECAD was developed with the following requirements: a) to include a rather homogeneous sample of Greek as spoken by individuals with aphasia; b) to document speech samples with rich metadata, which include demographic information, as well as detailed information on the patients’ medical record and neuropsychological evaluation; c) to provide annotated speech samples, which encode information at the micro-linguistic (words, POS, grammatical errors, clause types, etc.) and discourse level (narrative structure elements, main events, evaluation devices, etc.). In terms of the design of the GREECAD, the basic requirements regarding data collection, metadata, transcription, and annotation procedures were set. The discourse samples were transcribed and annotated with the ELAN tool. To ensure accurate and consistent annotation, a Transcription and Annotation Guide was compiled, which includes detailed guidelines regarding all aspects of the transcription and annotation procedure.
Web 2.0 has become a very useful information resource nowadays, as
people are strongly inclined t... more Web 2.0 has become a very useful information resource nowadays, as people are strongly inclined to express online their opinion in social media, blogs and review sites. Sentiment analysis aims at classifying documents as positive or negative according to their overall expressed sentiment. In this paper, we create a sentiment classifier applying Support Vector Machines on hotel reviews written in Modern Greek. Using a unigram language model, we compare two different methodologies and the emerging results look very promising.
One of the most prominent current paradigms in automatic syntactic analysis is datadriven depende... more One of the most prominent current paradigms in automatic syntactic analysis is datadriven dependency parsing. In this approach, manually annotated treebanks are used in training and evaluating parsers that create tree representations, where each word depends on a head word and is assigned a label depicting its relation to the head word. This paper describes the Greek Dependency Treebank, a resource that contains 130+ thousand tokens in 5669 manually annotated sentences from texts including transcripts of parliamentary sessions and several types of web documents. The GDT has been used in experiments for training and evaluating a dependency parser for the Greek language.
This paper deals with the application of corpus analysis to Greek law texts in order to illustrat... more This paper deals with the application of corpus analysis to Greek law texts in order to illustrate their stylometric profile. Due to some vagueness or rigidity elements, these texts are often criticised for constituting an opaque “legalistic” language (Tiersma 1999: 139-41). For the purposes of this study we designed a Greek Legal Corpus which we then juxtaposed to the Hellenic National Corpus1. Using criteria, such as lexical “richness”, word and sentence length, and part-of-speech frequencies, we look for linguistic features which may affect the precision and comprehensibility of the legal language (Bhatia 2010).
Most language processing systems are based on competence models of natural language. The problems... more Most language processing systems are based on competence models of natural language. The problems and limitations these systems bump into suggest a view into performance models for language processing, which take into consideration the statistical properties of actual language use. The data-oriented parsing model we describe in this paper uses an annotated corpus. It analyses new input by finding the most probable way to reconstruct this input from fragments that are already contained in the corpus.
In this article we first attempt to look at the various stages of the development of CALL and the... more In this article we first attempt to look at the various stages of the development of CALL and then we examine the current situation through the application of new technologies in language education.
The aim of this paper is to present empirical results from a corpus-based compilation of a basic... more The aim of this paper is to present empirical results from a corpus-based compilation of a basic vocabulary for the teaching of Modern Greek as a foreign language.
The aim of this paper is to present some preliminary results regarding the extraction of basic vo... more The aim of this paper is to present some preliminary results regarding the extraction of basic vocabulary items from specialized corpora and their application in the teaching of Modern Greek as a Foreign Language. A statistical method has been devised based on Log Likelihood measure in order to track the most important words related with the specific thematic unit of "Shopping" .
We present a research focused on the prediction of the author's personality based on natural lang... more We present a research focused on the prediction of the author's personality based on natural language processing techniques applied to essays written in Modern Greek by highschool students. Each writer has been profiled by filling in the Jung Typology Test. In addition, personality prediction is being discussed under the general research framework of author profiling by examining the effectiveness of several stylometric features to predict students' personality types. The feature set we employed was a combination of the word and sentence length, the most frequent part-of-speech tags, most frequent character/word bigrams and trigrams, most frequent words, as well as hapax/dis legomena. Since personality prediction represents a complex multidimensional research problem, we applied various machine learning algorithms to optimize our model's performance after extracting the stylometric features. We compared nine machine learning algorithms and ranked them according to their cross-validated accuracy. The best results were obtained by the Naive Bayes algorithm. According to the personality classification based on the Jung Typology Test, the author's personality prediction accuracy reached 80.7% on Extraversion, 79.9% on Intuition, 68.8% on Feeling, 75.7% on Judging, according to the personality classification. The reported results show a competitive approach to the personality prediction problem. Furthermore, our research revealed new combinations of stylometric features and corresponding computational techniques, giving interesting and satisfying solutions to the problem of the author's personality prediction for Modern Greek.
Η υπολογιστική επεξεργασία της αλλομορφίας εξακολουθεί να αποτελεί τεράστια πρόκληση από τις πρώτ... more Η υπολογιστική επεξεργασία της αλλομορφίας εξακολουθεί να αποτελεί τεράστια πρόκληση από τις πρώτες συστηματικές προσπάθειες πρόβλεψης της αλλομορφίας με τεχνικές μηχανικής μάθησης. Το μοντέλο MaxEnt προσφέρει έναν στατιστικό τρόπο για να δημιουργήσετε ένα πιθανοτικό μοντέλο για SOI που συνδυάζει διαφορετικά γλωσσικά στοιχεία. Στόχος είναι να προβλέψουμε τις αλλομορφικές αλλαγές στην ονοματική σύνθεση και να δείξουμε την ουσιαστική συμβολή διαφόρων μορφολογικών, φωνολογικών και σημασιολογικών χαρακτηριστικών. Για την αξιολόγηση της αποτελεσματικότητας του μοντέλου μας, χρησιμοποιήθηκε ένα δοκιμαστικό σώμα με ονοματικά σύνθετα που έχουν οποιοδήποτε είδος γραμματικής κατηγορίας ως πρώτο συνθετικό. Δημιουργήσαμε τον ALLOMANTIS, έναν αναλυτή μορφολογικής πρόβλεψης για την ονομαστική αλλομορφία. Η συνολική ακρίβεια του μοντέλου ήταν πάνω από 90%.
This study aims to develop an effective and precise methodology for detecting AI-generated text, ... more This study aims to develop an effective and precise methodology for detecting AI-generated text, leveraging the synergistic combination of transformer learning and stylometric features. The research utilized two datasets provided by the AuTexTification: Automated Text Identification shared task, a component of IberLEF 2023, the 5th Workshop on Iberian Languages Evaluation Forum held at the SEPLN 2023 Conference. Our team engaged in both English language subtasks, which included binary classification of texts as either human or AI-generated and multiclass classification to predict the specific AI writing model employed from a selection of six. Our main approach was to experiment with multiple Transformer models and, at the same time, to use an extensive stylometric feature engineering workflow. Each method (transformers and stylometric features) was first applied separately, and then we explored various ways to combine them. The most efficient method was based on ensemble learning utilizing majority voting employing the two most accurate transformer models in our training data and a comprehensive combined concatenation of many different stylometric feature groups. The macro-F1 scores on the test sets on subtasks 1 and 2 were 60.78 and 55.87, respectively, positioning our group above the median of the competing teams. This study underscores the potential of combining transformer learning and stylometric features to enhance the accuracy of AI-generated text detection.
International Journal on Advances in Life Sciences, 2021
We present a study focused on the prediction of the
author's personality based on natural languag... more We present a study focused on the prediction of the author's personality based on natural language processing techniques applied to essays written in Modern Greek by highschool students. Each writer has been profiled by filling in two personality questionnaires, one based on the typology of Carl Jung and the other based on the Model of Five Factors. In addition, personality prediction is being discussed under the general research framework of author profiling by examining the effectiveness of several stylometric features to predict students’ personality types. The feature set we employed was a combination of the word and sentence length, the most frequent part-of-speech tags, most frequent character/word bigrams and trigrams, most frequent words, as well as hapax/dis legomena. Since personality prediction represents a complex multidimensional research problem, we applied various machine learning algorithms to optimize our model’s performance after extracting the stylometric features. We compared nine machine learning algorithms and ranked them according to their cross-validated accuracy. The best results in predicting the Jung’s Typology types were obtained by the Naive Bayes algorithm. In contrast, for the prediction of personality features based on the Five Factors Model, the Generalized Linear Model (Binomial method) algorithm prevailed. According to the personality classification based on the Jung Typology Test, the author’s personality prediction accuracy reached 80.7% on Extraversion, 79.9% on Intuition, 68.8% on Feeling, 75.7% on Judging, according to the personality classification. In the Big Five personality classification, the prediction accuracy reached 85.9% on Openness, 71.2% on Conscientiousness, 67.6% on Extraversion, 70.2% on Agreeableness, and 65.6% on Neuroticism. The reported results show a competitive approach to the personality prediction problem. Furthermore, our research revealed new combinations of stylometric features and corresponding computational techniques, giving interesting and satisfying solutions to the author’s personality prediction problem for Modern Greek.
Social media offer an unprecedented affordance for political communication, expanding the versati... more Social media offer an unprecedented affordance for political communication, expanding the versatility of personal networks and facilitating the construction of online partisan identities. Within the framework of Critical Discourse Analysis and in the light of ideologies as axiomatic underpinnings of social representations, the aim of this paper is to explore the indexical use of ideological labels as strategies of affiliation and/or otherization, based on a corpus of Twitter data drawn from the profiles of five elected party leaders of the Hellenic Parliament.
This study presents a first standardized encoding for a hierarchical annotated corpus of Greek ap... more This study presents a first standardized encoding for a hierarchical annotated corpus of Greek aphasic speech. Given the need for a deeper understanding of aphasic speech at the different levels of linguistic analysis and the lack of an encoding standard, an encoding scheme has been created for a multilevel linguistic annotation. To this purpose, the professional annotation tool ELAN has been used for hierarchically interconnected tiers of linguistic transcriptions.The aim of this research is to establish this encoding scheme as an annotation guide for aphasic speech corpora and thus facilitate data extraction at the phonological, morphosyntactic, semantic and narrative level of linguistic analysis.
In this paper, the process of designing an annotated Greek Corpus of Aphasic Discourse (GREECAD) ... more In this paper, the process of designing an annotated Greek Corpus of Aphasic Discourse (GREECAD) is presented. Given that resources of this kind are quite limited, a major aim of the GREECAD was to provide a set of specifications which could serve as a methodological basis for the development of other relevant corpora, and, therefore, to contribute to the future research in this area. The GREECAD was developed with the following requirements: a) to include a rather homogeneous sample of Greek as spoken by individuals with aphasia; b) to document speech samples with rich metadata, which include demographic information, as well as detailed information on the patients’ medical record and neuropsychological evaluation; c) to provide annotated speech samples, which encode information at the micro-linguistic (words, POS, grammatical errors, clause types, etc.) and discourse level (narrative structure elements, main events, evaluation devices, etc.). In terms of the design of the GREECAD, the basic requirements regarding data collection, metadata, transcription, and annotation procedures were set. The discourse samples were transcribed and annotated with the ELAN tool. To ensure accurate and consistent annotation, a Transcription and Annotation Guide was compiled, which includes detailed guidelines regarding all aspects of the transcription and annotation procedure.
In this paper, the process of designing an annotated Greek Corpus of Aphasic Discourse (GREECAD) ... more In this paper, the process of designing an annotated Greek Corpus of Aphasic Discourse (GREECAD) is presented. Given that resources of this kind are quite limited, a major aim of the GREECAD was to provide a set of specifications which could serve as a methodological basis for the development of other relevant corpora, and, therefore, to contribute to the future research in this area. The GREECAD was developed with the following requirements: a) to include a rather homogeneous sample of Greek as spoken by individuals with aphasia; b) to document speech samples with rich metadata, which include demographic information, as well as detailed information on the patients’ medical record and neuropsychological evaluation; c) to provide annotated speech samples, which encode information at the micro-linguistic (words, POS, grammatical errors, clause types, etc.) and discourse level (narrative structure elements, main events, evaluation devices, etc.). In terms of the design of the GREECAD, the basic requirements regarding data collection, metadata, transcription, and annotation procedures were set. The discourse samples were transcribed and annotated with the ELAN tool. To ensure accurate and consistent annotation, a Transcription and Annotation Guide was compiled, which includes detailed guidelines regarding all aspects of the transcription and annotation procedure.
Web 2.0 has become a very useful information resource nowadays, as
people are strongly inclined t... more Web 2.0 has become a very useful information resource nowadays, as people are strongly inclined to express online their opinion in social media, blogs and review sites. Sentiment analysis aims at classifying documents as positive or negative according to their overall expressed sentiment. In this paper, we create a sentiment classifier applying Support Vector Machines on hotel reviews written in Modern Greek. Using a unigram language model, we compare two different methodologies and the emerging results look very promising.
One of the most prominent current paradigms in automatic syntactic analysis is datadriven depende... more One of the most prominent current paradigms in automatic syntactic analysis is datadriven dependency parsing. In this approach, manually annotated treebanks are used in training and evaluating parsers that create tree representations, where each word depends on a head word and is assigned a label depicting its relation to the head word. This paper describes the Greek Dependency Treebank, a resource that contains 130+ thousand tokens in 5669 manually annotated sentences from texts including transcripts of parliamentary sessions and several types of web documents. The GDT has been used in experiments for training and evaluating a dependency parser for the Greek language.
This paper deals with the application of corpus analysis to Greek law texts in order to illustrat... more This paper deals with the application of corpus analysis to Greek law texts in order to illustrate their stylometric profile. Due to some vagueness or rigidity elements, these texts are often criticised for constituting an opaque “legalistic” language (Tiersma 1999: 139-41). For the purposes of this study we designed a Greek Legal Corpus which we then juxtaposed to the Hellenic National Corpus1. Using criteria, such as lexical “richness”, word and sentence length, and part-of-speech frequencies, we look for linguistic features which may affect the precision and comprehensibility of the legal language (Bhatia 2010).
Most language processing systems are based on competence models of natural language. The problems... more Most language processing systems are based on competence models of natural language. The problems and limitations these systems bump into suggest a view into performance models for language processing, which take into consideration the statistical properties of actual language use. The data-oriented parsing model we describe in this paper uses an annotated corpus. It analyses new input by finding the most probable way to reconstruct this input from fragments that are already contained in the corpus.
In this article we first attempt to look at the various stages of the development of CALL and the... more In this article we first attempt to look at the various stages of the development of CALL and then we examine the current situation through the application of new technologies in language education.
The aim of this paper is to present empirical results from a corpus-based compilation of a basic... more The aim of this paper is to present empirical results from a corpus-based compilation of a basic vocabulary for the teaching of Modern Greek as a foreign language.
The aim of this paper is to present some preliminary results regarding the extraction of basic vo... more The aim of this paper is to present some preliminary results regarding the extraction of basic vocabulary items from specialized corpora and their application in the teaching of Modern Greek as a Foreign Language. A statistical method has been devised based on Log Likelihood measure in order to track the most important words related with the specific thematic unit of "Shopping" .
Uploads
Papers by George Markopoulos
author's personality based on natural language processing
techniques applied to essays written in Modern Greek by highschool
students. Each writer has been profiled by filling in two
personality questionnaires, one based on the typology of Carl
Jung and the other based on the Model of Five Factors. In
addition, personality prediction is being discussed under the
general research framework of author profiling by examining
the effectiveness of several stylometric features to predict
students’ personality types. The feature set we employed was a
combination of the word and sentence length, the most frequent
part-of-speech tags, most frequent character/word bigrams and
trigrams, most frequent words, as well as hapax/dis legomena.
Since personality prediction represents a complex
multidimensional research problem, we applied various
machine learning algorithms to optimize our model’s
performance after extracting the stylometric features. We
compared nine machine learning algorithms and ranked them
according to their cross-validated accuracy. The best results in
predicting the Jung’s Typology types were obtained by the
Naive Bayes algorithm. In contrast, for the prediction of
personality features based on the Five Factors Model, the
Generalized Linear Model (Binomial method) algorithm
prevailed. According to the personality classification based on
the Jung Typology Test, the author’s personality prediction
accuracy reached 80.7% on Extraversion, 79.9% on Intuition,
68.8% on Feeling, 75.7% on Judging, according to the
personality classification. In the Big Five personality
classification, the prediction accuracy reached 85.9% on
Openness, 71.2% on Conscientiousness, 67.6% on Extraversion,
70.2% on Agreeableness, and 65.6% on Neuroticism. The
reported results show a competitive approach to the personality
prediction problem. Furthermore, our research revealed new
combinations of stylometric features and corresponding
computational techniques, giving interesting and satisfying
solutions to the author’s personality prediction problem for
Modern Greek.
speech corpora and thus facilitate data extraction at the phonological, morphosyntactic, semantic and narrative level of linguistic analysis.
of this kind are quite limited, a major aim of the GREECAD was to provide a set of specifications which could serve as a methodological
basis for the development of other relevant corpora, and, therefore, to contribute to the future research in this area. The GREECAD was
developed with the following requirements: a) to include a rather homogeneous sample of Greek as spoken by individuals with aphasia;
b) to document speech samples with rich metadata, which include demographic information, as well as detailed information on the
patients’ medical record and neuropsychological evaluation; c) to provide annotated speech samples, which encode information at the
micro-linguistic (words, POS, grammatical errors, clause types, etc.) and discourse level (narrative structure elements, main events,
evaluation devices, etc.). In terms of the design of the GREECAD, the basic requirements regarding data collection, metadata,
transcription, and annotation procedures were set. The discourse samples were transcribed and annotated with the ELAN tool. To ensure
accurate and consistent annotation, a Transcription and Annotation Guide was compiled, which includes detailed guidelines regarding all
aspects of the transcription and annotation procedure.
of this kind are quite limited, a major aim of the GREECAD was to provide a set of specifications which could serve as a methodological
basis for the development of other relevant corpora, and, therefore, to contribute to the future research in this area. The GREECAD was
developed with the following requirements: a) to include a rather homogeneous sample of Greek as spoken by individuals with aphasia;
b) to document speech samples with rich metadata, which include demographic information, as well as detailed information on the
patients’ medical record and neuropsychological evaluation; c) to provide annotated speech samples, which encode information at the
micro-linguistic (words, POS, grammatical errors, clause types, etc.) and discourse level (narrative structure elements, main events,
evaluation devices, etc.). In terms of the design of the GREECAD, the basic requirements regarding data collection, metadata,
transcription, and annotation procedures were set. The discourse samples were transcribed and annotated with the ELAN tool. To ensure
accurate and consistent annotation, a Transcription and Annotation Guide was compiled, which includes detailed guidelines regarding all
aspects of the transcription and annotation procedure.
people are strongly inclined to express online their opinion in social media, blogs and review sites. Sentiment analysis aims at classifying documents as positive or negative according to their overall expressed sentiment. In this paper, we create a sentiment classifier applying Support Vector Machines on hotel reviews written in Modern Greek. Using a unigram language model, we compare two different
methodologies and the emerging results look very promising.
author's personality based on natural language processing
techniques applied to essays written in Modern Greek by highschool
students. Each writer has been profiled by filling in two
personality questionnaires, one based on the typology of Carl
Jung and the other based on the Model of Five Factors. In
addition, personality prediction is being discussed under the
general research framework of author profiling by examining
the effectiveness of several stylometric features to predict
students’ personality types. The feature set we employed was a
combination of the word and sentence length, the most frequent
part-of-speech tags, most frequent character/word bigrams and
trigrams, most frequent words, as well as hapax/dis legomena.
Since personality prediction represents a complex
multidimensional research problem, we applied various
machine learning algorithms to optimize our model’s
performance after extracting the stylometric features. We
compared nine machine learning algorithms and ranked them
according to their cross-validated accuracy. The best results in
predicting the Jung’s Typology types were obtained by the
Naive Bayes algorithm. In contrast, for the prediction of
personality features based on the Five Factors Model, the
Generalized Linear Model (Binomial method) algorithm
prevailed. According to the personality classification based on
the Jung Typology Test, the author’s personality prediction
accuracy reached 80.7% on Extraversion, 79.9% on Intuition,
68.8% on Feeling, 75.7% on Judging, according to the
personality classification. In the Big Five personality
classification, the prediction accuracy reached 85.9% on
Openness, 71.2% on Conscientiousness, 67.6% on Extraversion,
70.2% on Agreeableness, and 65.6% on Neuroticism. The
reported results show a competitive approach to the personality
prediction problem. Furthermore, our research revealed new
combinations of stylometric features and corresponding
computational techniques, giving interesting and satisfying
solutions to the author’s personality prediction problem for
Modern Greek.
speech corpora and thus facilitate data extraction at the phonological, morphosyntactic, semantic and narrative level of linguistic analysis.
of this kind are quite limited, a major aim of the GREECAD was to provide a set of specifications which could serve as a methodological
basis for the development of other relevant corpora, and, therefore, to contribute to the future research in this area. The GREECAD was
developed with the following requirements: a) to include a rather homogeneous sample of Greek as spoken by individuals with aphasia;
b) to document speech samples with rich metadata, which include demographic information, as well as detailed information on the
patients’ medical record and neuropsychological evaluation; c) to provide annotated speech samples, which encode information at the
micro-linguistic (words, POS, grammatical errors, clause types, etc.) and discourse level (narrative structure elements, main events,
evaluation devices, etc.). In terms of the design of the GREECAD, the basic requirements regarding data collection, metadata,
transcription, and annotation procedures were set. The discourse samples were transcribed and annotated with the ELAN tool. To ensure
accurate and consistent annotation, a Transcription and Annotation Guide was compiled, which includes detailed guidelines regarding all
aspects of the transcription and annotation procedure.
of this kind are quite limited, a major aim of the GREECAD was to provide a set of specifications which could serve as a methodological
basis for the development of other relevant corpora, and, therefore, to contribute to the future research in this area. The GREECAD was
developed with the following requirements: a) to include a rather homogeneous sample of Greek as spoken by individuals with aphasia;
b) to document speech samples with rich metadata, which include demographic information, as well as detailed information on the
patients’ medical record and neuropsychological evaluation; c) to provide annotated speech samples, which encode information at the
micro-linguistic (words, POS, grammatical errors, clause types, etc.) and discourse level (narrative structure elements, main events,
evaluation devices, etc.). In terms of the design of the GREECAD, the basic requirements regarding data collection, metadata,
transcription, and annotation procedures were set. The discourse samples were transcribed and annotated with the ELAN tool. To ensure
accurate and consistent annotation, a Transcription and Annotation Guide was compiled, which includes detailed guidelines regarding all
aspects of the transcription and annotation procedure.
people are strongly inclined to express online their opinion in social media, blogs and review sites. Sentiment analysis aims at classifying documents as positive or negative according to their overall expressed sentiment. In this paper, we create a sentiment classifier applying Support Vector Machines on hotel reviews written in Modern Greek. Using a unigram language model, we compare two different
methodologies and the emerging results look very promising.