Introversion-Extraversion Prediction Using Machine
Introversion-Extraversion Prediction Using Machine
INTERNATIONAL JOURNAL
ON INFORMATICS VISUALIZATION
journal homepage : www.joiv.org/index.php/joiv
Abstract—Introversion and extroversion are personality traits that assess the type of interaction between people and others.
Introversion and extraversion have their advantages and disadvantages. Knowing their personality, people can utilize these advantages
and disadvantages for their benefit. This study compares and evaluates several machine learning models and dataset balancing methods
to predict the introversion-extraversion personality based on the survey result conducted by Open-Source Psychometrics Project. The
dataset was balanced using three balancing methods, and fifteen questions were chosen as the features based on their correlations with
the personality self-identification result. The dataset was used to train several supervised machine-learning models. The best model for
the Synthetic Minority Oversampling (SMOTE), Adaptive Synthesis Sampling (ADASYN), and Synthetic Minority Oversampling-
Edited Nearest Neighbor (SMOTE-ENN) datasets was the Random Forest with the 10-fold cross-validation accuracy of 95.5%, 95.3%,
and 71.0%. On the original dataset, the best model was Support Vector Machine, with a 10-fold cross-validation accuracy of 73.5%.
Based on the results, the best balancing methods to increase the models’ performance were oversampling. Conversely, the hybrid
method of oversampling-undersampling did not significantly increase performance. Furthermore, the tree-like models, like Random
Forest and Decision Tree, improved performance substantially from the data balancing. In contrast, the other models, excluding the
SVM, did not show a significant rise in performance. This research implies that further study is needed on the hybrid balancing method
and another classification model to improve personality classification performance.
Manuscript received 7 Jul. 2022; revised 26 Dec. 2022; accepted 30 Jan. 2023. Date of publication 31 Dec. 2023.
International Journal on Informatics Visualization is licensed under a Creative Commons Attribution-Share Alike 4.0 International License.
2154
leaders [8], because extroverts are more comfortable working An experiment to predict extraversion also had been
in teams, and it is easy to break the ice in a team, whether in conducted by a study [11]. Electrocardiographic and NEO-
a tense or relaxed state [9]. Introverts are not a good advantage FFI data were used to train the Random Forest model, which
of being a leader because introverts prefer to work alone and resulted in 60.6% accuracy in predicting the extroverts and
prefer to interact less. The advantages and disadvantages can introverts. Another study used unmentioned campus' data to
be used for their gain from knowing their personality and predict their students' introversion or extraversion, which
traits. Consequently, this study aimed to find the best method obtained an accuracy of 72% by using linear SVM [12]. In
for predicting the introversion-extraversion personality. addition, another study experimented with predicting
The extraversion-introversion detection can help in a few introversion and extraversion based on the interaction of the
aspects of life. For example, this can be used to recommend subjects with a robot [13]. The result was 70% accuracy on
specific treatments for the employees based on their extraversion prediction.
personality to boost their work performance. Another This study compares and evaluates machine learning
example of the application is to determine the best learning models and dataset balancing methods to improve
method for students based on their extraversion-introversion introversion-extraversion personality classification
personality. performance. Several machine learning models such as
Previous studies on extrovert and introvert prediction have Decision Tree, K-Nearest Neighbor, Logistic Regression,
been done multiple times. A previous study compared several Linear Discriminant Analysis, Gaussian Naïve Bayes,
prediction models for the introversion-extraversion Random Forest, and Support Vector Machine were trained
personality, with the best model gaining 73.81% accuracy and evaluated for each original, oversampling methods such
[10]. The dataset used in the previous work was obtained from as SMOTE and ADASYN, and a hybrid between
the Multidimensional Introversion-Extroversion Scales oversampling and undersampling dataset, SMOTE-ENN.
assessment result by the Open-Source Psychometrics Project
and augmented using oversampling methods, which will also II. MATERIAL AND METHOD
be used in this study. However, what differs from this research The method used by the researchers is to conduct an
is that this study used an additional hybrid augmentation experiment in which to identify one's personality. The Open-
method between oversampling and undersampling, SMOTE- Source Psychometrics Project provided the dataset in 2019
ENN. It also used various supervised machine learning [14]. Fig. 1 shows that the dataset was executed in several
models such as Decision Tree, Logistic Regression, k-NN, stages to facilitate the experiment to identify one's personality.
Linear Discriminant Analysis, Gaussian Naïve Bayes,
Random Forest, SVM Linear, SVM Polynomial, and SVM
Gaussian.
2155
A. Dataset for all the minority classes to match the amount of the
majority class data. The result was that each of the classes had
The dataset used in this experiment was provided by Open- 4,404 data.
Source Psychometrics Project [14]. The data was taken from
an online survey called the Multidimensional Introversion-
Extroversion Scales on their website. The published result of
the survey was last updated on August 19, 2019, with a total
of 7,188 responses.
According to the survey, 91 statements require the
participants to submit an answer using a five-point scale
starting from 1, which represents disagreement, to 5, which
represents an agreement, also known as the Likert scale.
Moreover, the participant was asked to determine to which
category they belonged. The dataset contained the responses,
times needed, each statement's position, introversion-
extraversion self-identification, and personal data, e.g.,
country, gender, and age. The data taken for the experiment Fig. 3 SMOTE Dataset
are the responses of each statement, the introversion-
The second method that can balance a dataset is Adaptive
extraversion identification (IE), country, gender, and age.
Synthesis Sampling (ADASYN). This method helps the
B. Pre-processing classification process by generating more data for the
unbalanced minority classes [19]. The amount of generated
The first step of pre-processing the dataset was to remove
data for each class is based on the weighted distribution
the invalid data. This data includes null and out-of-scope data
depending on the level of learning difficulties [19]. The result
values. Afterward, the IE data was cleansed from the dataset,
of the ADASYN dataset can be seen in Fig. 4.
excluding the 1 (Introvert), 2 (Extrovert), and 3 (Neither).
Furthermore, the dataset was balanced using several re-
sampling methods. This step was necessary because the
original dataset was imbalanced, as seen in Fig. 2. The class
data comprised 4,404 introverts, 989 extroverts, and 1,768
neither. The imbalanced dataset can make the machine
learning model more biased toward the majority class [15].
This can result in overfitting and a lousy performance in
classifying the minority class. Re-sampling can be applied to
the imbalanced dataset to achieve a balanced dataset. This
process consists of oversampling and undersampling. In
oversampling, new data is produced to increase the number of
minority classes, while in undersampling, the majority class
data is decreased to match with the minority class [16]. In this
Fig. 4 ADASYN Dataset
study, three re-sampling methods were used to balance the
dataset. There is also the SMOTE-ENN method, which combines
SMOTE and Edited Nearest Neighbor (ENN). The ENN
method works by finding the K-nearest neighbor, starting by
taking samples based on the category of its nearest neighbor,
then using the k-NN rules to the rest of the data [20]. If there
is a minority class data with two or more majority class data
as its neighbor, the majority class data will be deleted. Thus,
the distance between the majority and the minority data will
be reduced. The result of SMOTE-ENN can be seen in Fig. 5.
2156
C. Feature Selection 1) Evaluation Result of Original Dataset: Based on the
The feature selection was made by calculating the original dataset result, as seen in Table II, it can be observed
correlation between the features and the output label on the that the best result was the SVM Linear, resulting in a mean
original dataset using the Pearson correlation coefficient. The accuracy of 0.735. The second-best result was SVM Gaussian,
Pearson correlation coefficient is one of the popular methods resulting in a mean accuracy of 0.734. The results' difference
to calculate the correlation between two variables. This was 0.001, which was insignificant.
method resulted in a value with a range between -1 and 1, TABLE II
where the closer the value to 1, the higher the correlation ORIGINAL DATASET RESULT
between values, and the same thing applied to -1. However, Method Accuracy Precision Recall F1
the correlation is in the opposite direction [21]. Decision Tree 0.624 0.630 0.623 0.626
For this study, the results of the Pearson correlation Logistic 0.732 0.709 0.731 0.715
coefficient were turned into absolute values. The correlation Regression
list was sorted from the highest to the lowest correlation k-NN 0.698 0.665 0.698 0.672
values, as seen in Table I. Thereafter, the top 15 features were Linear 0.730 0.708 0.729 0.714
selected and stored for the next section of the experiment. Discriminant
Analysis
TABLE I Gaussian Naïve 0.699 0.711 0.698 0.701
RESULT OF SORTED CORRELATION FEATURES Bayes
Question [14] Correlation Score Random Forest 0.726 0.704 0.725 0.710
(Absolute) SVM Linear 0.735 0.711 0.734 0.716
Q83A: "I keep in the background." 0.412 SVM Polynomial 0.725 0.696 0.725 0.704
Q91A: "I talk to a lot of different people at 0.396 SVM Gaussian 0.734 0.709 0.733 0.714
parties."
Q82A: "I don’t talk a lot." 0.394 Despite being the best model for the original dataset, the
Q90A: "I start conversations." 0.366 result for the correct prediction of each class was relatively
Q80A: "I love large parties." 0.347 poor. As seen in the confusion matrix in Fig. 6, the recall score
Q89A: "I don’t mind being the center of 0.340 for class 3 or “Neither” was 35.0%, the score for class 2 or
attention."
“Extrovert” was 62.9%, and class 1 or “Introvert” was 91%,
Q81A: "I am quiet around strangers." 0.340
Q84A: "I don’t like to draw attention to 0.324
resulting in the weighted average recall score of 73.4%.
myself."
Q14A: "I want a huge social circle." 0.309
Q13A: "I can keep a conversation going 0.295
with anyone about anything."
Q5A: "I mostly listen to people in 0.293
conversations."
Q44A: "I mostly listen to people in 0.288
conversations."
Q16A: "I act wild and crazy." 0.269
Q15A: "I talk to people when waiting in 0.267
lines."
Q85A: "I have little to say." 0.266
2157
By applying the same method for the selected features to the The SMOTE and the SMOTE-ENN datasets scored the
original dataset, the mean accuracy obtained was 0.726, highest mean accuracy for the same method, i.e., Random
which resulted in a significant score range between the mean Forest Method. The best result was obtained from the SMOTE
accuracy scores of both methods that stood at 0.229 or 22.9%. dataset scoring a mean accuracy of 0.955. Furthermore, the
TABLE III
SMOTE-ENN dataset scored 0.710 for the mean accuracy. By
SMOTE DATASET RESULT comparing the results of the SMOTE dataset with the
SMOTE-ENN dataset, the score range of the datasets was
Method Accuracy Precision Recall F1 considered significant; it stood around 0.245 or 24.5% in
Decision Tree 0.940 0.940 0.939 0.940
percentage.
Logistic 0.690 0.705 0.689 0.695
Regression The recall score for the best SMOTE-ENN dataset model
k-NN 0.782 0.797 0.781 0.785 was below the result of the best original dataset model with a
Linear 0.683 0.699 0.683 0.688 2.4% decrease. The best model, Random Forest, obtained
Discriminant 61.3%, 92.4%, and 81.2% for classes 1, 2, and 3, respectively.
Analysis The model scored 71% on weighted average recall. The
Gaussian Naïve 0.686 0.712 0.685 0.694 confusion matrix for this model can be found in Fig. 8.
Bayes
Random Forest 0.955 0.955 0.955 0.955
SVM Linear 0.690 0.706 0.689 0.695
SVM Polynomial 0.730 0.742 0.730 0.734
SVM Gaussian 0.736 0.744 0.736 0.739
2158
than the best SMOTE model and 21.9% higher in contrast to The study used nine classification models to find the best
the best original dataset model. The individual recall score for method to predict personality.
class 1, 2, and 3 was 96.7%, 96.6%, and 91.0%. Fig. 9 shows The best method for the SMOTE, ADASYN, and SMOTE-
the confusion matrix for ADASYN Random Forest. ENN datasets was the Random Forest with a mean accuracy
of 0.955, 0.953, and 0,710. For the original dataset, the best
method was SVM Linear, with a mean accuracy of 0.735.
The balancing method using oversampling, such as
SMOTE and ADASYN, increased the accuracy for several
models compared to the original dataset, with the biggest
jump of 22.9% and 22.7% on the Random Forest model. The
results show that these oversampling methods help to boost
machine learning models, especially tree-like models. The
SMOTE was better overall than ADASYN, although with
only minor performance differences.
In contrast, the hybrid method of oversampling and
undersampling, such as SMOTE-ENN, showed little growth
Fig. 9 Random Forest (ADASYN Dataset) Confusion Matrix in accuracy, with only 2 out of 9 surpassing the original
dataset. The biggest boost of SMOTE-ENN was the Decision
B. Discussion Tree compared to the original dataset, with a margin of 7.2%
This study aims to boost the performance of introversion- in mean accuracy. These outcomes concluded that the
extraversion classification by comparing and evaluating oversampling methods were better than the hybrid method for
various machine learning models with several balanced this case.
datasets. The discovery showed that the performance of Further study could be done using other balancing methods
certain models could accomplish quite a significant increase to balance the dataset, especially with the hybrid method.
through dataset balancing operations. Also, other classification models can be used to achieve
The oversampling method, such as SMOTE and ADASYN, higher accuracy in predicting personality. Additionally, the
raised the F1 to 24.5% on tree-like machine learning models dataset balancing approach can be used to increase models’
like Decision Tree and Random Forest. For the other models classifying performance in other research areas.
tested in this study, the increment is less significant than the
tree-like models, with the biggest boost of 11.3% by k- REFERENCES
Nearest Neighbour. Based on this research, SMOTE is [1] R. M. Bergner, “What is personality? Two myths and a definition,”
superior to ADASYN on every model, though with a thin New Ideas Psychol., vol. 57, 2020, doi:
margin. The biggest gap between both was the SVM Linear 10.1016/j.newideapsych.2019.100759.
model, with a 1.3% difference. [2] P. G. Zimbardo, R. L. Johnson, and V. McCann, Psychology : core
concepts, 8th ed. NY: Pearson, 2017.
In contrast to the oversampling method, the hybrid between [3] C. D. Nye and B. W. Roberts, A neo-socioanalytic model of
oversampling and undersampling, SMOTE-ENN, did not personality development. Elsevier Inc., 2019.
perform well in this study. The tree-like models exceeded the [4] A. Baumert et al., “Integrating Personality Structure, Personality
original dataset score, with the biggest margin of 8% by Process, and Personality Development,” European Journal of
Personality, vol. 31, no. 5, pp. 503–528, Sep. 2017, doi:
Decision Tree. The rest of the models were below the original 10.1002/per.2115.
dataset model, with the worst margin of 10.9% by Logistic [5] D. Petric, “Introvert , Extrovert and Ambivert,” Knot Theory Mind, no.
Regression. September, pp. 1–4, 2019, doi: 10.13140/RG.2.2.28059.41764/2.
From the obtained results, this study found that from [6] M. C. Shehni and T. Khezrab, “Review of Literature on Learners’
Personality in Language Learning: Focusing on Extrovert and
several machine learning models that were tested, the overall Introvert Learners,” Theory and Practice in Language Studies, vol. 10,
best-performing model for the re-sampling datasets were the no. 11, p. 1478, Nov. 2020, doi: 10.17507/tpls.1011.20.
tree-like models, such as Decision Tree and Random Forest. [7] Y. Tao, Y. Cai, C. Rana, and Y. Zhong, “The impact of the
The other model, the k-Nearest Neighbor, was able to gain a Extraversion-Introversion personality traits and emotions in a moral
decision-making task,” Personality and Individual Differences, vol.
boost from the balanced datasets but not as well as the tree- 158, p. 109840, May 2020, doi: 10.1016/j.paid.2020.109840.
like model. The rest of the models were not affected by the [8] A. M. Grant, F. Gino, and D. A. Hofmann, “Reversing the Extraverted
balanced datasets. This study also perceived that all the tested Leadership Advantage: The Role of Employee Proactivity,” Academy
sampling methods increased some of the model’s of Management Journal, vol. 54, no. 3, pp. 528–550, Jun. 2011, doi:
10.5465/amj.2011.61968043.
performance. Although, with the hybrid method, SMOTE- [9] J. E. Bono and T. A. Judge, “Personality and Transformational and
ENN, only a few models gained the advantage. Transactional Leadership: A Meta-Analysis.,” Journal of Applied
Psychology, vol. 89, no. 5, pp. 901–910, 2004, doi: 10.1037/0021-
9010.89.5.901.
IV. CONCLUSIONS
[10] C. So, “Are You an Introvert or Extrovert? Accurate Classification
This study was conducted using an original dataset to With Only Ten Predictors,” 2020 International Conference on
predict the extraversion-introversion personality. Fifteen Artificial Intelligence in Information and Communication (ICAIIC),
Feb. 2020, doi: 10.1109/icaiic48513.2020.9065069.
questions were chosen based on their high correlation score to [11] H. Baumgartl, S. Bayerlein, and R. Buettner, “Measuring Extraversion
the output goal using the Pearson correlation coefficient. The Using EEG Data,” Lecture Notes in Information Systems and
imbalanced dataset was balanced with oversampling Organisation, pp. 259–265, 2020, doi: 10.1007/978-3-030-60073-
techniques, like SMOTE and ADASYN, and a hybrid of 0_30.
[12] L. Ge, H. Tang, Q. Zhou, Y. Tang, and J. Lang, “Classification
oversampling and undersampling techniques, SMOTE-ENN. Algorithms to Predict Students’ Extraversion-Introversion Traits,”
2159
2016 International Conference on Cyberworlds (CW), Sep. 2016, doi: [18] N. V. Chawla, K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer,
10.1109/cw.2016.27. “Smote: Synthetic minority over-sampling technique,” J. Artif. Intell.
[13] S. M. Anzalone, G. Varni, S. Ivaldi, and M. Chetouani, “Automated Res., vol. 16, no. September 28, pp. 321–357, 2002, [Online].
Prediction of Extraversion During Human–Humanoid Interaction,” Available:
International Journal of Social Robotics, vol. 9, no. 3, pp. 385–399, https://arxiv.org/pdf/1106.1813.pdf%0Ahttp://www.snopes.com/horr
Feb. 2017, doi: 10.1007/s12369-017-0399-6. ors/insects/telamonia.asp.
[14] Open-Source Psychometrics Project, “Development of the [19] H. He, Y. Bai, E. Garcia, and S. Li, “ADASYN: Adaptive synthetic
Multidimensional Introversion-Extraversion Scales.” 2019. sampling approach for imbalanced learning. In IEEE International
[15] J. Tanha, Y. Abdi, N. Samadi, N. Razzaghi, and M. Asadpour, Joint Conference on Neural Networks, 2008,” IJCNN 2008.(IEEE
“Boosting methods for multi-class imbalanced data classification: an World Congr. Comput. Intell. (pp. 1322– 1328), no. 3, pp. 1322– 1328,
experimental review,” Journal of Big Data, vol. 7, no. 1, Sep. 2020, 2008.
doi: 10.1186/s40537-020-00349-y. [20] T. Lu, Y. Huang, W. Zhao, and J. Zhang, “The Metering Automation
[16] R. Mohammed, J. Rawashdeh, and M. Abdullah, “Machine Learning System based Intrusion Detection Using Random Forest Classifier
with Oversampling and Undersampling Techniques: Overview Study with SMOTE+ENN,” 2019 IEEE 7th International Conference on
and Experimental Results,” 2020 11th International Conference on Computer Science and Network Technology (ICCSNT), Oct. 2019,
Information and Communication Systems (ICICS), Apr. 2020, doi: doi: 10.1109/iccsnt47585.2019.8962430.
10.1109/icics49469.2020.239556. [21] H. Zhu, X. You, and S. Liu, “Multiple Ant Colony Optimization Based
[17] V. S. Spelmen and R. Porkodi, “A Review on Handling Imbalanced on Pearson Correlation Coefficient,” IEEE Access, vol. 7, pp. 61628–
Data,” 2018 International Conference on Current Trends towards 61638, 2019, doi: 10.1109/access.2019.2915673.
Converging Technologies (ICCTCT), Mar. 2018,
doi:10.1109/icctct.2018.8551020.
2160