347x Filetype PDF File size 0.52 MB Source: aclanthology.org
HowtoObtainReliableLabelsforMBTIClassificationfromTexts?
ˇ
Sanja Stajner Seren Yenikent
SymantoResearch SymantoResearch
Nuremberg, Germany Nuremberg, Germany
sanja.stajner@symanto.com seren.yenikent@symanto.com
Abstract popularity of MBTI framework (it is estimated that
AutomaticdetectionoftheMyers-BriggsType morethan2million US adults complete the inven-
2
Indicator (MBTI) from short posts attracted tory every year), there is a number of freely avail-
noticeable attention in the last few years. Re- able alternative MBTI questionnaires on the inter-
cent studies showed that this is quite a diffi- net, with the 16personalities test3 being one of the
cult task, especially on commonlyusedTwitter most popular ones. According to the Myers-Briggs
data. Obtaining MBTI labels is also difficult, 4 5
as human annotation requires trained psychol- Foundation and the 16personality test website,
ogists, and automatic way of obtaining them both questionnaires satisfy the accepted standards
is through long questionnaires of questionable for test validity and reliability. Nevertheless, the
usability for the task. In this paper, we present MBTI questionnaires have received a noticeable
a method for collecting reliable MBTI labels criticism from the academic community (Pittenger,
via only four carefully selected questions that 1993; Boyle, 1995) for not relying on a scientif-
can be applied to any type of textual data. ically proven (i.e. data-driven) background, but
1 Introduction rather on qualitative measures such as observation
and introspection. The other common criticism is
TheMyers-Briggs Type Indicator (MBTI) model the binary nature of the questionnaire as it is known
(Briggs-Myers and Myers, 1995) is one of the most that the majority of people usually lies somewhere
widely used non-clinical psychometric models in the middle of the scales (Pittenger, 1993).
ˇ
(Stajner and Yenikent, 2020). It classifies people The questionnaire-based personality detection
into two groups across four dimensions: extraver- has several weaknesses: it requires trained human
sion/introversion (E/I), sensing/intuition (S/N), assessors; it is prone to social desirability bias
thinking/feeling (T/F), and judgement/perception (Krumpal, 2011) and reference-group effect (Heine
(J/P). This leads to a total of 16 personality types. et al., 2002); it is questionable if answering ques-
The first three dimensions are based on the theo- tionnaires is a natural way of showing ones per-
retical work of Carl Jung (1921), while the fourth sonality (as opposed to free writing or behaviour
dimension was added later by Myers and Briggs- “whennobodywatches”). To detect MBTI typolo-
Myers(1995). The MBTI personality framework gies in a more natural way and without necessity
has already been used for decades in educational for trained human assessors, many studies have
and industry settings, e.g. for finding jobs that best attempted at building systems for automatic de-
resonate with the person’s preferences for informa- tection of MBTI personality types from text in
tion processing (S/N and T/F dimensions), finding the last several years. Attempts have been made
workorganization types that best resonate with the for automatic detection of MBTI personality types
person’s preferred judgement processes (J/P dimen- from: tweets written in English (Plank and Hovy,
sion) thus leading to better job satisfaction, and 2015), six other Western European languages (Ver-
for better matching work environments with the
person’s preferences (E/I dimension) to lower em- professional/versions-of-the-mbti-questionnaire/
2https://www.verywellmind.com/the-myers-briggs-type-
ployee turnover (Briggs-Myers and Myers, 1995). indicator-2795583#the-mbti-today
The original MBTI questionnaire contains 93 3https://www.16personalities.com/free-personality-test
questions and is not freely available.1 Due to the 4myersbriggs.org
5https://www.16personalities.com/articles/reliability-and-
1https://www.myersbriggs.org/using-type-as-a- validity
1360
Proceedings of Recent Advances in Natural Language Processing, pages 1360–1368
Sep 1–3, 2021.
https://doi.org/10.26615/978-954-452-072-4_152
hoeven et al., 2016), and Japanese (Yamada et al., it is known that many people have characteristics of
2019); English posts collected from Personality both polarities across MBTI dimensions (Pittenger,
6 7
Cafe forum available in Kaggle; and English 1993), such filtering of training datasets might lead
´ ˇ
Reddit comments (Gjurkovic and Snajder, 2018; to better performances of automatic systems for
´
Gjurkovic et al., 2020). Despite being trained on MBTIdetection from texts by removing noise.
large amounts of textual data (over one million),
andmodelledasfourbinaryclassificationtasks,the 2 Related Work
best systems performed only slightly better than
the randomandmajority-classbaselines, regardless Plank and Hovy (2015) were the first to explore
of the architecture used. the use of Twitter data for obtaining a large-scale
Some studies suggested that tweets might not dataset for open-vocabulary automatic detection
contain sufficient amounts of MBTI signals (even of MBTI personality traits. They collected a cor-
after concatenating up to 150-200 tweets per user) pus of 1.2M English tweets automatically labelled
due to the nature of Twitter posts (Celli and Lepri, for gender and MBTI type. To identify the users
ˇ for whom an MBTI type can be automatically as-
2018; Stajner and Yenikent, 2020, 2021). An- signed, the authors relied on mentions of any of
other issue with all those studies and obtained the 16 MBTI types plus the word “Briggs”. Addi-
results might be that the systems are supervised tionally, each user was labelled as female or male
and were trained with gold labels obtained via whenever it was discernible; those users for whom
MBTI questionnaires that suffer from all earlier the gender was not discernible were excluded from
ˇ
mentionedweaknesses. Inourrecentstudy(Stajner the study. For each selected Twitter user, the au-
and Yenikent, 2021), we found a low association thors collected up to 2000 most recent tweets (to be
between the MBTI types obtained via question- included, each user had to have at least 100 tweets).
naires and the MBTI signals found in the short Plank and Hovy (2015) found that the distribution
texts written by participants (tweets and free texts of MBTI types across the selected Twitter users
oncarefully chosen topics). At the same time, the significantly differs from the distribution of MBTI
inter-annotator agreement of two expert annotators types across the general US population. The au-
assigning MBTI types based on those free texts thors further trained binary classification models
ˇ
wasquite high (Stajner and Yenikent, 2021). (for each MBTI dimension separately) using vari-
Contributions. To avoid all previously men- ous features and model architectures. The best sys-
tioned problems in automatic MBTI detection from tems outperformed majority-class baselines only
texts, in this study, we propose a carefully designed for I/E and T/F dimensions.
set of four questions with answers on a 1-5 scale Verhoeven et al. (2016) used a similar strategy
(Section 3) that aim to capture the main MBTI for obtaining large-scale MBTI datasets for six
characteristics without taking much time from par- other languages: German, Italian, Dutch, French,
ticipants, and can be administered together with Portuguese, and Spanish. As opposed to the work
any open-end questions without need for trained of Plank and Hovy (2015), the triggers for identify-
human assessors. The validity of our question- ing users whose MBTI types can be automatically
naire has been assessed via expert human anno- assigned were mentions of one of the 16 personal-
tation following previously proposed annotation ity types and the word “personality” or pronouns
ˇ
methodology (Stajner and Yenikent, 2021). The andverbformssuchas“Iam”or“Ihave”,foreach
agreement between the answers to the newly pro- of the six languages. All retrieved contexts were
posed questions and the expert human annotations manually checked for whether or not they describe
wasfoundtobesimilar as between two trained an- the personality of the writer of the post. For all
notators (Section 5.2). Another advantage of the users whose posts passed this check, the gender
proposed method is that it goes beyond binary ty- was annotated based on the user’s name, handle,
pology, by offering a 5-point scale for each MBTI description, and profile picture (Verhoeven et al.,
dimension. This creates a possibility for filtering 2016). Distributions of MBTI types across Twitter
out those instances written by people who exhibit users of the six languages were found to be similar,
similar amount of signals from both polarities. As with only a few exceptions (Verhoeven et al., 2016).
6https://www.personalitycafe.com/ Theauthors also trained binary classifiers using the
7https://www.kaggle.com/datasnaek/mbti-type dataset with 200 concatenated tweets for each user
1361
and LinearSVC classifier with binary word and
character n-gram features. Similar as for English
(Plank and Hovy, 2015), in most of the languages,
the best classifiers outperformed the majority-class
baselines only for E/I and T/F dimensions.
´ ˇ
Gjurkovic and Snajder (2018) compiled a large-
scale MBTIdatasetfromEnglishRedditcomments
by relying on flairs—short introductions of users
on various subreddits—which, in the case of the
MBTI-related subreddits, usually contain the users’
´
MBTIresults. In the subsequent study (Gjurkovic
et al., 2020), dataset was further enriched with de-
mographic information about the users (age, gen-
der, location, and language), and the labels for two
otherpersonalitymodels. ThedistributionofMBTI
typesinthisdatasetalsosignificantlydeviatedfrom
the general US population (see Figure 3 in Sec-
tion 6 for comparison of MBTI type distribution
amongdifferent populations/datasets).
Automatic assignment of MBTI type to each
user in all above-mentioned studies is based on
automatic extraction of contexts in which a cer- Figure 1: Demographic questions.
tain MBTI type is mentioned. Without man-
ual inspection of each such mention—which was via popular questionnaires), which might be an
only reported for the study by Verhoeven et al. indication that MBTI results obtained via question-
(2016))—the assigned labels might not be reliable, naires do not resonate well with the MBTI signals
as they may refer to someone else mentioned in the found in more natural textual forms.
tweet and not the writer of the tweet, or they might Thecurrent study aims to overcome previously
be a part of a larger phrase, e.g. “I think/believe I reported issues by proposing four questions with
amanINTP”or“IexpecttogetESFJastheresult the answers on a 1–5 scale to obtain MBTI labels
if I do personality assessment”. that better resonate with the expert human MBTI
Tothebest of our knowledge, the only study in annotations on short texts.
whichMBTIlabelswereobtainedbyexplicitlyask-
ing participants to report their MBTI type, if they 3 Questionnaire
had done an MBTI personality test in the past, is
ˇ Thewholequestionnaire consisted of one optional
our recent study (Stajner and Yenikent, 2021). The
AmazonMechanicalTurkworkerswerealsoasked question “YoumighthaveobtainedyourMBTItype
to describe their favourite type of vacations and in the past via questionnaires. If you know your
preferred hobbies in minimum 300 characters each. MBTItype,please type it here”, four compulsory
Wefoundthatthis type of texts (responses to care- demographic questions, four compulsory questions
fully selected open-end questions) contain more with answers on a 1–5 scale that aimed to capture
MBTI signals than tweets (even if concatenated the participants MBTI type, and two compulsory
together for each user). We further proposed de- open-end questions. Demographic questions en-
tailed guidelines for MBTI personality annotation compassed gender, age, whether or not English is
from textual data, and showed that expert human their native language, and the highest level of ed-
annotators have a high level of agreement among ucation obtained (Figure 1). The gender question
themselves on obtained textual answers when fol- had four possible answers: female, male, other,
lowing provided guidelines. At the same time, we prefer not to specify. Five age groups were offered
found that the annotators have a low level of agree- to choose from: 18–25, 26–35, 36–45, 46–55, and
ment with the MBTI types reported by participants over 55.
(based on their previous MBTI personality testing After answering demographic questions, partici-
1362
intuitive, by asking whether they prefer technical
andhands-onhobbies(1=sensing)orabstractand
imaginative (5 = intuitive). The third MBTI di-
mension (T/F) is fundamentally about how people
maketheir decisions, whether based on rational or
emotional motives. As people do not engage with
strict decision-making processes during their free
time, which is ultimately based on their personal
interests, the question measured the preference for
rational (1 = thinking) or emotional (5 = feeling)
reasoning for liking a certain hobby. The fourth
question aimed to capture the preference for spon-
taneous and flexible (1 = perceiving), or a well-
planned (5 = judging) schedule at vacations.
We initially prepared two questions per each
MBTIdimensionandperformedapilotstudywith
30participants to choose those questions (Figure 2)
that better correspond to the MBTI types provided
bythe participants, and the MBTI annotations by
two annotators.
Finally, participants were asked to answer to two
Figure 2: Questions for obtaining MBTI labels. open-end questions, which we previously proposed
ˇ
(Stajner and Yenikent, 2021) as the optimal ques-
tions for annotating MBTI types from texts:
pants were provided with four questions that aimed
to capture their MBTI type, and were asked to pro- • Describe which kind of vacations you typi-
vide an answer on a 1–5 points scale. Those four cally enjoy and why.
questions are the central contribution of this study. • Describe what type of hobbies you enjoy and
Byfollowing the idea that aspects of leisure time why.
represent the most natural version of personality, as
it is directed by high degrees of intrinsic motivation The two questions were preceded by the follow-
ˇ ing instructions: “The following questions aim to
(Stajner and Yenikent, 2021), the questions are fo-
cussed on typical leisure time activities—hobbies understand your life style preferences. While an-
and vacations. This also gave us the opportunity swering, please write down the first things that
to utilize the previously proposed open-end ques- cometoyourmindwithout much contemplation.”
ˇ To be accepted, each answer needed to contain a
tions (Stajner and Yenikent, 2021) in the validation
process (Section 5). In deciding the content of minimumof300characters.
the questions for each individual dimension, we 4 Challenges in Data Collection
followed the main definitions provided by Briggs-
Myers and Myers (1995). Although each MBTI Data was collected via Amazon Mechanical Turk
dimensioncorrespondstomultiplepracticalandbe- (AMT) platform. We prepared the questionnaire
havioral characteristics, the core theoretical focus as Google Forms and provided the link to it in
for every dimension is consistent. the HIT of the AMT platform. We experimented
The first question (for the E/I dimension) was withvarioussetupsintheplatform: differentvalues
designed with the idea of capturing whether the for monetary compensations, allowing only those
person prefers to be surrounded by people and participants with high scores on previous tasks,
social interactions, on one end of the scale (1 = different times for validation of the answers and
extraverted), or to spend quiet and calm time by payment. The only variable that noticeably influ-
themselves, on the other end of the scale (5 = in- enced the time needed for obtaining completed
troverted). The second question (for the S/N di- HITs was whether or not we restrict the partici-
mension) aims to capture the characteristics of the pants according to their performance on the pre-
tasks people would prefer to process, concrete or vious HITs. Without any restrictions, we were
1363
no reviews yet
Please Login to review.