1. Introduction
Language can be a useful window through which we can understand the social and psychological attributes of its users, including personality traits. The utility of language in identifying personality has long been recognized and investigated [
1,
2,
3,
4,
5,
6,
7], with many recent studies observing correlations between specific language features and certain personality attributes. For example, Jurafsky et al. (2009) [
8] and Ranganath et al. (2009) [
9] showed that linguistic and interactional features such as speech rate, pitch range, and laughter are closely related to friendliness or flirtatiousness, to an extent that these features can be used in detecting the speaker’s intent to be friendly or to be flirtatious at above 70% accuracy. Weninger et al. (2012) [
10] identified a number of acoustic features such as loudness or voice quality that can be used to recognize speakers that exhibit personality characteristics of leadership (e.g., high-achieving, charismatic, team-playing, etc.). Holtzman et al. (2019) [
11] observed that certain linguistic items, such as sports-related words, swear words, and second-person pronouns, may mirror narcissistic personality.
Studies have also examined how language reflects personality dimensions identified by the Myers–Briggs Type Indicator (MBTI) model. The MBTI model measures four personality dimensions of introversion–extroversion, sensing–intuiting, thinking–feeling, and judging–perceiving, and attempts to describe an individual’s personality as one of 16 types. The MBTI model utilizes a series of questions that are designed to gauge a person’s inclinations for outwardness, sociability, reliance on senses, orderliness, and so forth. Respondents to the MBTI questionnaire answer a series of questions on a Likert scale (e.g., very unlikely to very likely). Based on the compiled answers, the model proposes the likelihood of the person’s introvertedness (compared to extrovertedness), the degree of sensing (compared to intuiting), that of thinking (compared to feeling), and that of judging (compared to perceiving). The introverted/extroverted dimension reflects a person’s orientation toward the outer world. The sensing/intuiting dimension differentiates individuals based on how memory and information is processed and stored; a prototypically sensing individual processes information by relying on physical, sensory vehicles, but a prototypically intuiting individual places a subjectively determined interpretative lens when processing information. The thinking/feeling dimension reflects one’s style of interaction or decision-making; “thinking” individuals are more logical, task-oriented, and less compromising in prioritizing values, whereas “feeling” individuals are more flexible and tactful in making decisions, actively taking into account different contexts and situations. The judging/perceiving dimension groups individuals based on whether they appear more structured and orderly in carrying out daily tasks; the judging types are thought to be more orderly, while the perceiving types are thought to be more open in selecting and prioritizing what is to be done next. Of these different personality dimensions, the most commonly investigated dimension is extroversion, though there are studies that explore the relationship between language and other personality dimensions in addition to extroversion, such as conscientiousness, agreeableness, neuroticism, or openness [
7,
12,
13,
14]. Studies have identified a number of linguistic characteristics that are associated with extroversion; it has been found that extroverted (or “outgoing”) Dutch speakers tend to use linguistic markers of abstraction more frequently than introverted Dutch speakers [
15], and also that extroverts tend to be more verbose than introverts [
16,
17,
18,
19,
20]. Relatedly, extroverts were observed to exhibit higher verbal fluency not only in their own speech [
21], but also during interaction as manifested with their shorter response time [
22,
23].
These linguistic features identified as markers of a particular personality dimension can also be used in predicting personality. This has been shown by Mairesse et al. (2007) [
24], who included a number of language features (e.g., thematic word groups, utterance types, pitch, verbosity measured by voice time compared to silent pause, speech rate measured by word per second, etc.) that are related to certain personality factors in building a predictive model, and achieved up to 73% classification accuracy for predicting certain personality traits such as extroversion or openness for experience. Guidi et al. (2019) [
25] found the significant correlations between personality traits and speech features, including pitch and voice quality. Particularly focusing on acoustic and vocal features, Mohammadi and Vinciarelli (2012) [
26] and Polzehl et al. (2010) [
27] similarly showed that the personality traits can be predicted by pitch range, intensity, loudness, formants (acoustic energy that reflects resonance depending on the shape and size of the human vocal tract), spectral analysis (the distribution of acoustic energy across frequency), or speech rate.
The current study is an attempt to continue this line of inquiry by exploring the relationship between speech and personality. Drawing on speech produced during pseudo-naturalistic conversations, we extracted linguistic information from four domains of language use; namely, the time taken in responding to a conversational prompt (i.e., response time) extracted from the interactional domain, speech rate from the paralinguistic domain, pitch and intensity from the acoustic domain, and discourse markers from the text domain. Based on previous studies, we hypothesize that these speech characteristics are associated with some aspects of personality traits as identified in the MBTI model, such as extroversion–introversion, sensing–intuiting, thinking–feeling, or judging–perceiving. Thus, the extracted speech characteristics were examined for their associations with the aforementioned personality traits. In addition, we included gender as part of our variable in examining the associations between speech characteristics and personality traits. This is to ensure that we observe the potential relations between speech and personality on variables that exhibit different ranges depending on the gender of participants, such as pitch. Furthermore, we look at whether men and women, in general, display different patterns for the examined speech characteristics, as part of our effort to search for potential future directions.
2. Methods
2.1. Participants
We collected semi-natural conversations of 30 individuals (12 males and 18 females) from South Texas. These conversations were collected between 2019 and 2021, and all participants were native speakers of English. The average age of the participants is 32.8 years, and age ranges from 19 to 66. The conversations took place in the frame of sociolinguistic interview [
28], a speech elicitation technique in which the interviewer’s sole goal is to elicit spontaneous and natural talk in large quantities. To achieve this, a typical sociolinguistic interview is often lengthy in duration, spanning from 30 min to 90 min or so, and loosely structured; most interviewees become more relaxed and less self-conscious as the interview continues, which in turn yields more naturalistic speech. Prior to collecting interviews, the study protocol was approved by the Institutional Review Board (IRB) at Texas A&M University—Corpus Christi.
2.2. Experimental Protocol
Two interviewers were recruited and trained to conduct sociolinguistic interviews. The interviewers were undergraduate students, and were native speakers of English. We recruited volunteers by contacting individuals that would potentially be interested in participating in the study. Some participants were students, but not all. All interviews were conducted one-on-one; in other words, one interviewer and one participant were involved in each interview. Each interview was approximately 40 min long on average, and the topics covered during the interviews include family/friends, neighborhood change, school years, memorable incidents, etc. Though the interviews were loosely structured, they were standardized across participants. Given the nature of sociolinguistic interviews, in which the main goal of an interviewer is to elicit the most naturalistic speech, the topics are not strictly controlled; whenever possible, the interviewee is given a full control of the speech event, and when a particular topic is conducive to eliciting much talk, the interviewer dwells on this topic as long as possible. The interviews were audio-recorded, using portable solid-state recorders (Zoom H2n Handy Recorder, Zoom Corporation, Tokyo, Japan). Following the interview, participants were requested to fill out the MBTI test survey.
2.3. Data Segmentation
All recoded voice data, without cutting or manipulating any specific parts, was used to analysis. Note that, because we used naturalistic conversation data, the length of the conversation measured from each participant vary depending on the topic of the conversation or questions from the interviewer. Each interview recording was manually segmented based on the percept of a “breath group”, which often coincides with an intonational phrase beginning with a louder volume and/or higher pitch ending with a smaller volume and/or lower pitch. The segmented chunks were then manually transcribed as heard, reflecting verbatim production of the utterance, which includes false starts, stammers, fillers, errors, curses, etc. Along with the speech of the interviewees, the speech of the interviewers was transcribed as well. The utterance-level segmentation and transcription were performed using a transcribing software ELAN (Version 6.0) [
29]. The audio recordings and corresponding transcripts were inputted into FAVE (Forced Alignment and Vowel Extraction) [
30], a computational tool that processes a recording of speech and its word level transcript and generates a file that contains segmentation of not only words, but also units within a word, such as vowels and consonants.
The generated file is in the format of TextGrid, which is essentially a text file object that Praat—a speech analysis software used for this study—can generate in order to label or annotate the audio file [
31]. Once generated, a TextGrid file enables us to work with the segments that are aligned with the corresponding acoustic signals. An illustration of a Praat window where both the acoustic signals (soundwave and spectrogram) and the TextGrid are loaded is provided in
Figure 1.
To sum up, the final dataset included the audio files and the corresponding TextGrid files (containing multi-level annotations including utterance, word, and phone such as vowels and consonants). The data collection and processing procedure is schematically illustrated in
Figure 2.
2.4. Extracted Features
We selected five features that may potentially be associated with some aspect of the personality domains, namely, response time (in milliseconds, or ms), intensity (in dB), pitch (in Hz), speech rate (words per minute, or wpm), and discourse markers. These selections were made either because previous findings on the relationship between language and personality included these features or because we suspected that these features might be utilized when people perceive or judge others’ personality. Because these features are drawn from various domains of language use, reflecting interactional attributes (i.e., response time), acoustic attributes (i.e., pitch and intensity), content or textual attributes (i.e., discourse markers), as well as paralinguistic (linguistic-external) speech attributes (i.e., speech rate), examining them would allow us to consider language in its interaction with personality from a more holistic perspective.
For the current study, six features are identified.
Table 1 summarizes the descriptions of these features. They are demographic information (gender), and vocal and speech features (response time, voice frequency, loudness, speech rate, and frequency of discourse markers). First of all, gender is considered as a variable that can potentially explain the differences in phonetic and linguistic characteristics of people who have different personality traits. Second, it is presumed that response times during natural conversations could be related to their personality traits based on previous reports [
23,
24,
25,
26,
27,
28,
29,
30,
31,
32]. Third, voice frequency was selected to investigate whether there is a significant difference in frequency of people’s voices among the various personality traits. Fourth, it is assumed that the speech loudness could be significantly different between the personality traits. Fifth, we posited that the talking speed might be associated with personality traits. Lastly, it is hypothesized that the personality traits could be related to the types of words people use –specifically discourse markers–in natural conversations.
Response time (RT) refers to the amount of time a speaker takes before responding to a prompt, whether the prompt is provided in the form of a question or a statement. A Praat script was written to process the interviews in our dataset, in which all the instances of RT (i.e., the averaged time between the end of an interviewer’s speech and the start of a response) were measured. Each TextGrid file contains four “tiers”, whereby the top two tiers (Tiers 1 and 2) are annotations for the interviewer and the bottom two tiers (Tiers 3 and 4) for the interviewee. Tiers 1 and 3 contain the phone-level annotation, and Tiers 2 and 4 contain the word-level annotation (as illustrated in
Figure 1). For each interview, the script started by scanning Tier 4, determining whether the interviewee was talking or silent. If silent, where the segment was labeled as “sp”, the script then read the annotation in Tier 2 in order to see if it is also marked as “sp”. If both parties (the interviewer and the interviewee) were silent, then it was assumed that this silent time constitutes a RT for that particular interactive moment. The script then identified the start time of “sp” by the interviewer as well as the start time of “not sp” by the interviewee following the interviewer’s “sp”, thereby calculating the duration of the RT in milliseconds. When the script was run toward the end of a recording, all the instances of RT were collected in order to compute average, minimum, and maximum RT for each participant throughout the interview.
In order to measure pitch and intensity, another Praat script was written and run, focusing on the speech produced by the interviewees only. For both the pitch and intensity, the measurement was taken on vowels only (only the stressed vowels were measured). In order to measure pitch, the script extracted 100 pitch values per second for a vowel token, using the automated pitch measurement, specifically the unbiased autocorrelation method developed by Boersma (1993) [
32]. For both male and female speakers, the pitch floor was set as 75 Hz and the ceiling set as 600 Hz. This range is set as standard by the automated pitch measurement provided in Praat. While we acknowledge that the pitch floor is often recommended to be set differently depending on the gender of the speaker [
33]—often around 70 Hz for men and 100 Hz for women, we did not differentiate the pitch setting for the speaker gender. There are two reasons for this: Firstly, the adjustment of pitch floor is an especially pressing matter when the background noise is present; however, all of the interviews conducted for this study have minimal background noise. Secondly, many speakers, especially younger female speakers, employ a creaky voice, often co-occurring with a very low pitch. It is often advised to set the pitch floor as low as 50 Hz or even 40 Hz if the recording contains predominantly creaky voice. Given that the creaky voice is fairly common across the participants in our data, we felt that setting the pitch floor of 75 Hz for both males and females was apt in measuring more pitch points per speaker, which in turn would grant us a higher confidence in our calculations for the average pitch per speaker.
Per token, the mean fundamental frequency was computed. Vowel tokens with duration shorter than 50 milliseconds were excluded. In obtaining the degree of loudness, or intensity, the intensity of a sound in air was measured using units of Pascal, and expressed in Decibel (dB). All the measurements for intensity were taken at the midpoint of a vowel. Measurements for the pitch and intensity were compiled and computed for average, minimum, maximum, as well as variance. In examining the pitch difference between the two opposites within a given personality dimension (e.g., introverted vs. extroverted), we separated male participants from female participants.
Speech rate and discourse markers were analyzed using R package “tidytext” [
34]. Firstly, each utterance (i.e., intonational phrase) was parsed down to individual words, and the duration of the utterance was also logged. Per utterance, the number of words per minute was calculated as a measure of speech rate; this was conducted by dividing the number of words multiplied by 60 with the duration. All the utterances per participant were compiled and computed for average, minimum, maximum, and variance.
In order to examine the relationship between discourse markers and personality, a total of nine discourse markers—like, well, uh, um, just, kind of, sort of, I mean, and you know—were identified. “Uh” and “um” were treated as one, as well as “kind of” and “sort of”. This yields a total of seven types of markers. Per participant, the frequency of each of these markers ((the number of occurrences for each/the total number of words in the interview) × 100) was generated. For two-word markers (i.e., kind of, sort of, I mean, and you know), the frequency rate was calculated in reference to the total number of bigrams—instead of the total number of words—in the interview.
2.5. Statistical Analysis
In this study, based on the central limit theorem, we presumed that the extracted variables from the sound recording for 40–60 min per person for a total of 30 subjects follow a normal distribution, especially for the non-lexical features. To determine whether there is a statistical difference of the extracted features between personality types, two sample t-tests were performed for each of the four MBTI personality types. Particularly for the personality types in which the number of participants is small, the statistical analysis results would be best if used as a reference. It is also the case that the statistical significance would often be low when considering the small number of sample size. For discourse markers, a series of two-sample t-tests was conducted as well in order to see if there were any differences among participants, depending on their gender (male vs. female), and also on their personality types (e.g., introverted vs. extroverted), in terms of the frequency of these markers. Normality tests (the Shapiro–Wilk test) were performed on the frequency of the markers, after which the type of statistical test was determined. If the hypothesis of normality is rejected, the Wilcoxon rank-sum test was performed. For almost all markers (with an exception for “all markers combined” and for “like”), the distribution was right-tailed (i.e., positive skew). Nevertheless, an important part in this study is to compare the magnitude of differences in the extracted features between personality types. In the results section, two tables show the average difference in sound and linguistic characteristics by gender and MBTI personality traits. More in-depth interpretations are discussed in the discussion section.
4. Discussion
The study observed that various speech characteristics are associated with some personality dimensions as delineated in the Myers–Briggs Type Indicator, thereby generally corroborating the relationship between speech (and, more broadly, language) and personality as noted by existing literature. Specifically, it seems that speech rate, response time, intensity, and discourse markers can be used in conjecturing a speaker’s personality as informed by the MBTI categories, with each of these features associated with a more targeted personality dimension in its predictive sense.
In this section, we discuss the findings presented in the previous section in more detail. Corroborating many previous studies that observed a faster response time among extroverts [
22,
23,
35,
36,
37], our study also found that extroverts responded 0.329 s faster than introverts in conversation (
p = 0.035). Compared to the introverted group, extroverts are oriented more toward the outer world. This tendency arguably seems to manifest in the prompt response during an interaction, which presumably requires less inward reflection on the part of a speaker.
Our study also found that loudness is differentiated between the judging types and the perceiving types, in which the former spoke with a higher volume (approximately 4.1 dB louder) than the latter (p = 0.009). In addition to loudness, speech rate was also differentiated between the judging types and the perceiving types (p = 0.046), in which the former spoke slower than the latter. It is interesting to find that the judging types, who appear in the outside world as more structured and orderly in carrying out a variety of tasks, spoke with a higher volume and spoke slower than the perceiving types, who prefer a more adaptable and spontaneous lifestyle.
As with response time, loudness, and speech rate, discourse markers proved to be useful in reflecting different personality types. It was observed that “well”, identified as “a response marker” to establish discourse coherence [
38] was more frequently used by the judging types than the perceiving types (
p = 0.013). It could be argued that the judging types—associated with conscientious types per the Big Five model [
39]—are relatively more sensitive to the structure of the ongoing discourse in talk, given their tendency to organize their experiences in an orderly way. If thus posited, the judging types might arguably utilize “well” more frequently in order to achieve orderliness, or coherence, than the perceiving types, whatever this orderliness means in their minds. Furthermore, “uh/um” and “just” are more frequently used among the intuiting types than the sensing types. In general, the personality dimension of sensing/intuiting per the MBTI model seems to be the most responsive dimension in differentiating two opposite types based on the use of discourse markers. As mentioned in the previous section, when considering all the discourse markers together, the intuiting group showed a higher frequency rate than the sensing group (
p = 0.005). This might be attributed to the characteristics of the intuiting types, who mainly are proposed to process the information not just through the sensory vehicle, but more importantly, by layering the initial information obtained via senses with interpretive meanings and patterns. These discourse markers can shape the informational or interactional structure of the discourse (e.g., indicating old information vs. new information, weakening or strengthening the effect of a statement, etc.), thereby reflecting the speaker interpretation not only on their own utterances but also on the context of the ongoing conversation. For example, we downgrade certain information in speech (e.g., “it was just a game”) because we feel the need to take the emphasis away from the conveyed information. This also means that the information is conveyed in an interpretive way, with the meanings already imposed by the speaker at the time of speaking, reflecting the speaker’s cognitive-psychological act of “intuiting” manifested in the ongoing talk.
Although the use of discourse markers appears in our study to be a useful predictor of the sensing/intuiting personality dimension, it remains to be seen whether this would hold when other types of words are included besides discourse markers. In addition, explaining why the use of these discourse markers is associated with sensing/intuiting dimension but not with others, such as extroversion/introversion or thinking/feeling, is not an easy task. As speculated above, we could attempt to account for any differences in the frequency of discourse markers between the sensing group and the intuiting group by turning to the functions of these markers (i.e., approximating, filling the silence, pausing, down-grading, etc.), and propose that such functions—whichever they may be—would more likely reflect “intuiting” individuals’ speech style rather than “sensing” individuals’ style. For example, we might say that “intuiting” types use “just” more frequently (in other words, more likely to be down-grading), because their cognition is pre-occupied with meaning-making, or filling-in-the-lines, to a larger extent than “sensing” types. This logic, while not entirely nonsensical, is not internally supported; according to this reasoning (i.e., “just” appears more when one is cognitively “busy”), it could easily be the case that the use of “just” appears more frequently in the speech of “thinking” individuals than of “feeling” individuals, or vice versa. However, thinking types and feeling types exhibit no difference in the frequency of “like”, as reported above.
At the very least, our study provides evidence in the usefulness of looking at discourse markers in personality research. Many studies have recognized the importance what people say (i.e., words) in understanding one’s personality, but for the majority of these studies, the focus has been placed on words that are either semantically or functionally substantial, often leaving out the words that are “nonessential” in contributing to the overall meaning, or “habitual” phrases including “you know”, “I guess”, and “you see” [
40] (p.46). These less-substantial words (rendered as “meaningless”) are labeled as “nonfluencies” (in case of “uh” or “rr”) or as “fillers” (in case of “like”, “I mean”, “you know”, “blah”, etc.) [
41]. However, these words, or nonfluencies, while lacking in their substantial contribution to the meaning, occur too frequently in everyday speech to be entirely neglected. Furthermore, these discourse markers are not without functions; in the fields of sociolinguistics and discourse analysis, these words are recognized by their particular, often nonoverlapping, functions in interactional contexts. Our study corroborates previous studies in which the use of language that is not consciously controlled is found to lend insights into one’s personality [
42,
43]. As Laserna et al. (2014) [
44] show, filled pauses and discourse markers can be used to mark age groups, genders, or personality (e.g., conscientiousness) aspects. Relatedly, Iacobelli et al. (2011) [
13] p. 573 noted that looking at bigrams in order to consider “words in context” yields better results in classifying personality than solely relying on thematic categories of words, thus lending us further insights into the importance of considering these discourse markers in better understanding the relationship between language and personality traits.
It should also be noted that, despite the proven effectiveness of utilizing sociolinguistic interviews in eliciting naturalistic speech, the speech data used in this study was not drawn from a completely observer-free, and therefore natural, context. Even though the participants were fairly-comfortable with the interview setting (especially attributed to the fact that the interviews were long in duration and that the participants became more and more relaxed as the interview continued), there is no way of knowing whether their interview speech was truly representative of their natural speech as would be found in an unrecorded, unobserved setting. Such environmental contexts (the interview setting, unfamiliar interviewees, etc.) could variably intervene with many aspects of the speech produced among different personality types. Perhaps certain personality types tend to be less nervous when speaking in a less natural setting, or certain personality types might be less affected by the comfort level. More studies are called for in addressing these questions. As such, the findings reported in the current study need to be validated with more participants, keeping in mind the potential impact of pseudo-naturalistic contexts in which the interviews were conducted.
Despite these limitations and reservations, our study clearly shows that language characteristics including discourse markers, speech rate, intensity, or response time warrant more of our attention in exploring various aspects of personality. The findings in this study might also be of interest for those that seek to extrapolate and utilize linguistic features in predicting personality attributes. One such effort is currently underway by the authors of this paper, in which the aggregate of significant correlations between speech characteristics and personality dimensions is inputted into an artificial neural network, which is designed to predict personality traits only from speech data processing (the citation anonymized for review—disclosed in the cover letter). Given the increased interest in improving the quality of human-computer interactions in recent years, the current study calls for further attention to the applicability of speech data in understanding and predicting personality attributes.
The way we speak can mirror some aspects of ourselves, whether such aspects are social or psychological, including personality. While this idea resonates with us on a generally intuitive level, it is not always clear exactly what speech characteristics or linguistic features bear relevance in displaying certain aspects of personality. It is our belief that our study contributes to this line of inquiry, joining in answering the questions of the detailed relationship between language and personality.