Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Publi 6917

Download as pdf or txt
Download as pdf or txt
You are on page 1of 9

ProZe: Explainable and

Prompt-guided Zero-Shot Text


Classification
Ismail Harrando* , Alison Reboud* , Thomas Schleider* , Thibault Ehrhart and Raphael Troncy
EURECOM, Sophia Antipolis, France

Abstract—As technology accelerates the generation and communication of textual data, the need
to automatically understand this content becomes a necessity. In order to classify text, being it
for tagging, indexing or curating documents, one often relies on large, opaque models that are
trained on pre-annotated datasets, making the process unexplainable, difficult to scale and
ill-adapted for niche domains with scarce data. To tackle these challenges, we propose ProZe, a
text classification approach that leverages knowledge from two sources: prompting pre-trained
language models, as well as querying ConceptNet, a common-sense knowledge base which can
be used to add a layer of explainability to the results. We evaluate our approach empirically and
we show how this combination not only performs on par with state-of-the-art zero shot
classification on several domains, but also offers explainable predictions that can be visualized.
Keywords: Text classification, zero-shot, explainability, common sense knowledge graph,
prompting language models

Introduction datasets. However, a major downside of all these


The Natural Language Processing (NLP) and approaches based on transformer-based language
Information Extraction (IE) fields have seen many models is that they suffer from a lack of explain-
recent breakthroughs, especially since the in- ability.
troduction of Transformer-based approaches and Recently, ZeSTE [2] tackled this lack of in-
BERT [1], which has become the de-facto family terpretability problem in text classification by
of models to tackle most NLP tasks. Over the last departing from language models and relying in-
years, few-shot and zero-shot learning approaches stead on ConceptNet [3] and its explicit relations
have gained momentum, particularly for the cases between words. With every word being a node
with little data and where uncommon or spe- in ConceptNet, ZeSTE can justify the relatedness
cialized vocabularies are being used. Fully zero- between words in the document to classify its
shot classification approaches do not require any assigned label. While it shows state-of-the-art
training data and often show respectable perfor- results in topic categorization, it does not offer
mance. An interesting new paradigm is prompt- ways to specialize the classifier beyond “common
based learning which leverages pre-trained lan- sense knowledge” (domain adaptation), nor does
guage models through prompts (i.e. input queries it offer the possibility to disambiguate labels.
that are handcrafted to produce the desirable These challenges are important to solve for text
output) instead of training models on annotated classification of specific domains, especially since
zero-shot classification is particularly useful for
* Equal contribution domain-specific use cases with little data to train

IEEE Internet Computing: Special Issue on Knowledge-Infused Learning Published by the IEEE Computer Society
a model. As a consequence, this paper proposes approaches. [9] provides a survey of the recent
ProZe, a Zero-Shot classification model which advances in the field, while proposing Entail,
combines latent contextual information from pre- a zero-shot classification model based on using
trained language models (via prompting) and ex- language models fine-tuned on the task of Nat-
plicit knowledge from ConceptNet. This method ural Language Inference to classify documents.
keeps the explainability property of ZeSTE while Some zero-shot classification models also takes
still offering a step towards label disambiguation advantage of “prompt-based learning” [10], a new
and domain adaptation. paradigm used for many NLP tasks that allows to
The remainder of this paper is structured as extract information out of Language Models.
follows. First, we give an overview of the relevant
Explainability in NLP
state-of-the-art work. We then detail our proposed
One direction of a growing amount of work
method called ProZE. Next, we present our re-
interested in explainable methods is to generate
sults on common topic categorization datasets as
explanations and to develop evaluations that mea-
well as on three challenging datasets from diverse
sure the extent and likelihood that an explanation
domains: screenplay aspects for a crime TV se-
and its label are associated with each other in the
ries [4], historical silk textile descriptions [5], and
model that generated them [11]. However, none
the Situation Typing dataset [6]. We report and
of these techniques totally compensate for the
analyze the results of several empirical classifi-
obscurity associated with language models. This
cation experiments, which includes a comparison
is the main reason why the approach presented in
to some state-of-the-art Zero-Shot approaches.
this paper relies on ZesTE (Zero Shot Topic Ex-
Finally, we conclude and outline some future
traction) [2], which is not based on a pre-trained
work.
language model, and provides explainability of
its classification results using ConceptNet as a
Related Work prediction support.
Language Models Previous contributions leverage knowledge
Since the breakthrough performance by AlexNet graphs [12], [13], [14] and common-sense [15] to
on the 2012 ImageNet challenge[7], transfer improve the performance of several classification
learning via pre-trained models became a new tasks. To the best of our knowledge, our approach
standard in many machine learning tasks. With is the first to use a common-sense knowledge
the introduction of the Transformers architecture graph to not have a learning component, and uses
[8], this paradigm shift made its way to the NLP the KG as is, allowing it to retain explainability.
field as well through the advent of ”pre-trained
language models”. ConceptNet
The most influential Transformer-based model A central resource for this work is ConceptNet
is BERT [1]. Its defining feature is its ability to [3], a semantic network ”designed to help com-
pre-train deep bidirectional representations. Such puters understand the meanings of words that
pre-trained language models remain part of the people use”1 . Broadly speaking, ConceptNet is a
most successful approaches for a many NLP graph of words (or concepts), connected by edges
tasks, such as text classification. Despite the wide representing semantic relations that go beyond
availability of these language models, many clas- the lexical relations than can be found in a
sification experiments require also annotated and dictionary such as ”Synonym” or ”Hypernym”.
balanced training data to make a model properly Most importantly, ConceptNet contains relations
associate documents with labels, which can be of general ”relatedness” (or /r/RelatedTo on
either expensive or not available at all for niche ConceptNet), which imply an undefined semantic
domains. relation between two concepts, such as ”Busi-
ness” and ”Outsourcing”: while both terms are
Zero-Shot Classification used in similar contexts, one cannot define such
With rising popularity of zero-shot classification relation as one of containment, usage or typing. It
methods, there are now more attempts to bench-
mark and evaluate them on text classification 1 https://conceptnet.io

2 IEEE Internet Computing: Special Issue on Knowledge-Infused Learning


it notable that, unlike semantic similarity between neighborhood. To do so, multiple approaches ex-
two terms via word embeddings, ”relatedness” ist. In this paper, we present and compare 3 such
relations are usually mined for dictionary entries scoring methods (SM):
or corresponding Wikipedia articles, thus making
1) ConceptNet embeddings similarity
them explainable to the user.
(SM1): ConceptNet Numberbatch3
Other than the knowledge graph, ConceptNet
are graph embeddings computed for
comes with its set of graph embeddings called
ConceptNet nodes. To quantify their
”ConceptNet Numberbatch”. Computed in a spe-
similarity, we compute cosine similarity
cial way to reflect both the connectedness of
between the embedding of each node on
nodes on the ConceptNet graph and the linguistic
the label neighborhood and the label node
properties of words via retrofitting to other pre-
itself.
trained word embeddings [3], these embeddings
2) Scoring through Inference (SM2): for this
can better capture semantic relatedness between
scoring method, we use a model that is pre-
words, as demonstrated by their performance on
trained on the task of Natural Language
the SemEval 2017 challenge (https://alt.qcri.org/
Inference. In a similar setting to the pre-
semeval2017).
vious method, we prompt the model with a
We use both the semantic graph for generating
sentence related to the label or its domain,
explanations and the Numberbatch embeddings
and then we ask it to score all the words
to prune out excessive and noisy relations in our
from its neighborhood based on the logical
method.
entailment between the prompt (premise)
and a template containing the word (hy-
Method
pothesis).
Our model can be seen as a pipeline com-
3) Language Modeling Probability (SM3):
prising several components. In this section, we
for this scoring method, we combine the
explain each step of the process in further details.
predictive power of language models with
Generating Label Neighborhoods the explicit relations that we can find on
The first step of our approach is to manually the label neighborhood. For each label, we
create mappings between target class labels and supply the language model with a prompt,
their ConceptNet nodes. For instance, if we want or a sentence that is likely to guide it
our classifier to recognize documents for the class towards a specific meaning of the label we
“sport”, we designate the node /c/en/sport target (for example, the definition of the
as our starting node.2 label), and then, we ask it to predict the
Based on these mappings between target la- next word in a Cloze statement (a sentence
bels and concept nodes, we can then generate a where one word is removed and replaced
list of candidate words (from ConceptNet) that by a blank). For example, to score words
are related to the respective concept. This list can related to the label ”sport”, we can give
be called the ”label neighborhood”. Each of the the model a definition of the word, and
candidate is produced by retrieving every node then ask it to predict the blank word in
that is N-hops away from the class label node. the following Cloze statement: ”Sport is
Afterwards, a score can be calculated for each related to [blank].”. Given that language
label based on which words are present in the models, are pre-trained on predicting such
input text or document to classify. To this end, blanks, we can use the scores they attribute
we score every word in the label neighborhood to that blank to measure the similarity be-
based on its ”similarity” to the class label. tween our label and the candidate words
Scoring a Document from its neighborhood. For instance, when
Like ZeSTE, we proceed to score each document we give the dictionary definition of sport
by first generating a score for each node in a label to the language model, the top predicted
words are ’recreation’, ’fitness’ and ’exer-
2 In the remainder of this paper, we will omit the prefix
/c/en/ as alllabels in our datasets are in English. 3 https://github.com/commonsense/conceptnet-numberbatch

December 2021
3
cise’. Because the language model outputs a the meaning of the label and thus come up
probability for every word in its vocabulary, with better related words. For instance, for
we score only the words that are originally the label ”space”, we provide the language
on the label neighborhood. If a word in the model with the sentence ”Space is the ex-
neighborhood does not appear among the panse that exists beyond Earth and between
predictions of the model (i.e. out of the celestial bodies”. We take the definitions
model’s vocabulary), the score from SM1 from Wikipedia or a dictionary, we generate
is used. it using a NLG model etc.
Once the scores are computed by one of these We observed experimentally that using just
methods, we can proceed to score any document the description of the domain as a prompts gives
given as input to the model. To score such docu- better overall performance. Therefore, we only
ment, we first tokenize it into separate words. We report results on these prompts in the following
then take all the nodes from the neighborhood of sections. As for the hypothesis, we provide the
a label that appear in the tokenized document, and model with a sentence like ”[blank] is similar to
we add up their scores to produce a score for the space” or ”Space is about [blank]” which we
label. We do so for each label we are targeting, use in our reported results.
and the final prediction of the model corresponds We note that, while the combination of
to the label with the highest score. Because all premise and hypothesis can impact the overall
the nodes in the neighborhood are linked to the performance of the model, the search space for a
label node with explicit relations on ConceptNet, good prompt is quite wide. Thus, we only report
we can explain in the end how each word in the the performance on some combinations, as we
document contributed to the score and how it is intend this paper to only point out the use of such
related to our label. mechanism for this task rather than fully optimize
Prompting Language Models the process.
In this section, we explain how we leverage Tool Demonstrator
language models to score the label neighbors To explain the decisions of the model, we follow
extracted from ConcepNet, as per the scoring the same method as ZeSTE [2], i.e. we highlight
methods SM2 and SM3 described above. the words which contribute to the decision of the
Both SM2 and SM3 methods rely on prompt- classification as shown in a graph that links them
ing the language model, i.e. to feed it a sentence with semantic relations to the label node. The
that would function as a context to ”query” its difference is that the scores in ProZe take also
content (also known as probing [16]). As ex- into account the scoring from the language model.
pressed in the related work, prompting language To illustrate the contribution of the language
models is an open problem in the literature. In model, we developed an interactive demonstrator
this work, we explore some potential ideas for enabling a user to test the effect of prompting the
prompting to serve our objective of measuring language model to improve the results of zero-
word-label relatedness. shot classification (Figure 1). This demonstrator
The prompting follows the same scheme for is available at http://proze.tools.eurecom.fr/.
both scoring methods. We vary both the premise After choosing a label to study, the user is
and hypothesis templates and report the results asked to enter a prompt that can help the model
for some proposals in the Evaluation section. For to identify words related to the label (e.g. def-
the premise, we experiment with two approaches: inition or domain). The user is then shown an
1) Domain description: where we prime the abridged version of the prompt-enhanced label
model with the name or description of the neighborhood: the connection between any node
domain of the datasets, i.e. ”Silk Textile”, and the label node is omitted for clarity but it can
”Crime series”, etc. be trivially retrieved from ConceptNet, and only
2) Label definition: where we prime the model the top 50 (based on the used scoring) words are
with the definition of the label, with the as- shown to represent the new label neighborhood,
sumption that this will help it disambiguate with the intensity of the color reflecting higher

4 IEEE Internet Computing: Special Issue on Knowledge-Infused Learning


scores. particularly important. The entire dataset contains
The user can view in detail the updates hap- 5,956 labeled texts and 11 types of situations:
pening before and after introducing the new scor- “food supply”, “infrastructure”, “medical assis-
ing from the Language Model. For this demon- tance”, “search/rescue”, “shelter”, “utilities, en-
stration, we use the SM3 method to score the ergy, or sanitation”, “water supply”, “evacuation”,
nodes as it requires only one pass through the “regime change”, “terrorism”, “crime violence”
Language Model to generate a score for all words and a “none” category. In our experiment, we use
in its vocabulary, whereas the SM2 method re- the test set (2343 texts), where we only select
quires an inference for every word in the label texts that represent at least one of the situations
neighborhood. As a consequence, while the SM2 and we consider it a success if the model predicts
methods takes up to 7 minutes per label on our at least one correct label.
hardware, the SM3 method takes less than a
Crime Aspects
second while still delivering good performance.
The Crime Scene Investigation (CSI) dataset con-
Datasets tains 39 CSI video episodes together with their
In this section, we present three widely used screenplays segmented into 1544 scenes 4 . An
topic categorization datasets in the news domain, episode scene contains on average 21 sentences
as well as three other very different and domain- and 335 tokens. Originally, this dataset is used
specific datasets making used of fine-grained la- for screenplay summarization as each scene is
bels. annotated with a binary label denoting whether it
should be part of a summary episode or not. Ad-
News Topics Datasets ditionally, the three annotators had to justify their
Used to benchmark multiple text classification choice of their selected summary scenes with
approaches, news datasets are often categorized regards to it being about one/more or none of the
by topic and are written in simple and common following six aspects: i) victim, ii) the cause of
language. In our experiments, we report results on death, iii) an autopsy report, iv) crucial evidence,
three such commonly-used datasets: AG News, v) the perpetrator, and vi) the motive/relation
BBC News and 20NG. between perpetrator and victim.
• 20 Newsgroups [17]: a collection of 18000 We define the following labels to evaluate
user-generated forum posts arranged into the ProZe system: victim, cause of death, crime
20 groups seen as topics such as “Base- scene, evidence, perpetrator, motive. For our clas-
ball”, “Space”, “Cryptography”, and “Middle sification task, we kept only the scenes which
East”. were associated to at least one aspect (449
• AG News [18]: a news dataset containing scenes). In the case where one scene is associated
127600 English news articles from various to multiple labels, if the model predicts one of the
sources. Articles are fairly distributed among labels, we consider it a success.
4 categories: “World”, “Sports”, “Business” Silk Fabric Properties
and “Sci/Tech”. This dataset is an excerpt from the multilingual
• BBC News [19]: a news dataset from BBC knowledge graph of the European H2020 SIL-
containing 2225 English news articles classi- KNOW research project5 aiming at improving the
fied in 5 categories: “Politics”, “Business”, understanding, conservation and dissemination of
“Entertainment”, “Sports” and “Tech”. European silk heritage. The SILKNOW knowl-
Crisis Situations edge graph consists of metadata about 39,274
The first low-resource classification dataset we unique objects integrated from 19 museums and
use is the Situation Typing dataset [6]. The goal represented through a CIDOC-CRM-based set of
is to predict the type of need (such as the need for classes and properties. This metadata about silk
water or medical care) required in a specific situa- fabrics contains usually both explicit categorical
tion or to identify issues such as violence. There- information, like specific weaving techniques or
fore, this dataset constitutes a real world, high- 4 https://github.com/EdinburghNLP/csi-corpus

consequence domain for which explainability is 5 https://silknow.eu/

December 2021
5
Figure 1: ProZe neighborhoods demo. (1) The user is asked to select a label (2) The user can
input a text to prompt and guide the language model. (3) The user can visualize the label
neighborhood, with added and removed nodes highlighted, and is shown a detailed list of all
the changes resulting from the prompt.

their production years, but also rich and detailed Property SILKNOW Concept ConceptNet
Material Cotton /c/en/cotton
textual descriptions. Our goal is to try to predict Material Wool /c/en/wool
categorical values based on these text descrip- Material Textile /c/en/textile
tions. Material Metal thread /c/en/metal
Material Metal silver thread /c/en/silver
The SILKNOW Knowledge Graph dataset can Material Silver thread /c/en/silver
be divided into using ”material” and ”weaving Material Gold thread /c/en/gold
technique” subsets. More precisely, we slightly Technique Damask /c/en/damask
Technique Embroidery /c/en/embroidery
extend the dataset used in [20], and after remov- Technique Velvet /c/en/velvet
ing objects with more than one value per property, Technique Voided Velvet /c/en/velvet
we obtain 1429 object descriptions making use Technique Tabby (silk weave) /c/en/tabby
Technique Muslin /c/en/tabby
of 7 different labels for silk materials, and 833 Technique Satin (Fabric) /c/en/satin
object descriptions with 6 unique labels for silk Technique Brocaded /c/en/brocaded
techniques. The chosen labels have also to be Table 1: Mapping between the concepts used
mapped to ConceptNet entries to work with this in the SILKNOW knowledge graph and Con-
approach. Table 1 shows the final selection of the- ceptNet (ProZe and ZeSTE)
saurus concepts and their mapping to ConceptNet
nodes.

Evaluation Net to perform Zero-Shot classification;


We evaluate ProZe on these 6 datasets. In this • Entail: this model was originally proposed
section, we present the results of this evaluation. in [9]. We use bart-large-mnli as the
backend Transformer model, which it is a
version of BART [21] that was been fine-tuned
Baselines
on the Multi-genre Natural Language Inference
We compare our model with:
(MNLI) task, as per the implementation we
• ZeSTE: this approach solely relies on Concept- use for our experiments (can be tested at

6 IEEE Internet Computing: Special Issue on Knowledge-Infused Learning


https://huggingface.co/zero-shot/). Given a text the explainability layer reflects accurately the de-
acting as a premise, the task of Natural Lan- cisions of the model, as words that are not scored
guage Inference (NLI) aims at predicting the well by the language model will not contribute
relation it holds with an hypothesis sentence, significantly to the classification score.
labelling it either as false (contradiction), true Table 2 contains the accuracy and weighted
(entailment), or undetermined (neutral). Gen- average scores for the 3 news datasets that consist
erally, the labels are injected in a sentence of general knowledge texts. ProZe has similar
such as “This text is about” + label, to form performance, but not beating ZeSTE, which is in
an hypothesis. The confidence score for the line with our expectations: both approaches are
relation between the text to be labelled and based on the ConceptNet commonsense knowl-
the premise to be ’entail’ is the confidence of edge graph, and the vocabulary does not need or
the label to be correct. We use the implemen- cannot be guided into a more fitting direction with
tation provided at https://github.com/katanaml/ the prompts. For all three news datasets, however,
sample-apps/tree/master/01) ProZe performs better than Entail.
Table 3 shows the results for the 3 domain-
Quantitative Analysis specific datasets. We observe that ProZe is con-
We limit the size of the label neighborhoods sistently outperforming ZeSTE, which we take
to 20k per label for each experiment, except in as a confirmation that the guidance through the
cases where querying ConceptNet returns less prompt is effective for specific domains. For two
nodes than that. Then, we resize all the other datasets, silk material and situations, ProZe even
neighborhoods to be all equal in size to the beats the non-explainable baseline scores of the
smallest one (by eliminating the nodes with the Entail approach. This is not the case for the
lowest similarity), as we found that having neigh- silk technique and the CSI screenplay datasets as
borhoods of different sizes skews the predictions some labels from these datasets have very limited
towards the larger ones (by virtue of having more neighborhoods in ConceptNet. Nevertheless, our
nodes to contribute to the score). This can be approach is still close and retains in all cases its
circumvented by increasing the number of hops higher degree of explainability.
(thus boosting the size of smaller neighborhoods
before filtering), but according to our observa- Qualitative Analysis
tions, this hurts the quality of the kept nodes To illustrate why a re-ranking of related words
as they get less semantically relevant as we hop induced by a domain prompt improves the score,
further. Resizing the neighborhoods eliminate the we analyse a concrete example. Taken from
bias against the in-domain labels that may not the silk technique dataset, the top 10 candi-
have so many related words in the first place. date terms of the ConceptNet label neighborhood
Table 3 and Table 2 show a score compari- for the weaving technique ”embroidery” are as
son of the ProZe approaches to the baselines of follows: ”Embroidery, overstitch, running stitch,
ZeSTE and the Entail approach. ProZe-A refers picot, stumpwork, arresene, couture, fancywork,
to scoring the nodes using a combination of SM1 embroider, berlin work”. While these words are
and SM2, whereas ProZe-B uses a combination clearly related to the concept of embroidery, they
of SM1 and SM3. We tested several ways to are not necessarily relevant in the context of silk
combine the scores from ConceptNet (SM1) and textile. For example, ”picot” is a dimensional
language models (SM2 and SM3), including tak- embroidery related to crochet. The intuition is
ing the sum of the two scoring methods, their then that this neighborhood can be improved by
product, their max, or a weighted average. Em- specifying the domain.
pirically, we obtain the best empirical results by In comparison, the top 10 candidate terms of
multiplying the two scores (both normalized to the pre-trained BART language model, guided by
be between 0 and 1). The main advantage of a prompt that included the term ”silk textile” are:
multiplication is that it penalizes disagreement ”Craft artifact sewn, fabric, embroidery stitch,
between the language model and the KG over embroidery, detail, embroider, mending, embel-
how close two terms are. This also means that lishment, elaboration, filoselle”. These terms are

December 2021
7
20 Newsgroup AG News BBC News
Datasets
Weighted Weighted Weighted
Accuracy Accuracy Accuracy
Avg Avg Avg
ZeSTE 63.1% 63.0% 69.9% 70.3% 84.0% 84.6%
Entail 46.0% 43.3% 66.0% 64.4% 71.1% 71.5%
ProZe-A 62.7% 62.8% 68.5% 69.1% 83.2% 83.7%
ProZe-B 64.6% 64.6% 69.0% 69.6% 84.2% 84.8%
Table 2: Prediction scores for the news datasets (the top score in each metric is emboldened).

Silk Silk
Crime aspects Crisis situations
Datasets Material Technique
Weighted Weighted Weighted Weighted
Accuracy Accuracy Accuracy Accuracy
Avg Avg Avg Avg
ZeSTE 34.3% 39.0% 46.9% 47.2% 31.2% 32.3% 46.3% 45.8%
Entail 29.0% 33.3% 64.0% 65.8% 43.7% 43.7% 46.7% 48.1%
ProZe-A 39.0% 40.1% 50.8% 57.6% 36.3% 37.6% 50.1% 49.7%
ProZe-B 37.4% 41.7% 48.5% 48.7% 29.8% 31.1% 50.1% 49.8%

Table 3: Prediction scores for the domain-specific datasets (the top score in each metric is
emboldened).

more general even if also related to silk textile. models can also be tried to measure how such
Words such as ”detail”, ”mending”, ”elaboration” choice can improve the overall classification, es-
or ”embellishment” seem useful for classifying pecially for specific domains such as e.g. medical
texts that are not only consisting of details about documents.
different types of embroidery. When combining
the scores from ConceptNet and the language Another potential improvement over this
model, the ProZe method increases its F1 score method is to filter out words unrelated to the
of circa 8%, from 61% to 69%. label using the slot-filling predictions from the
language model. From early experiments, this
method seems to give good results by restrict-
Conclusion and Future Work
ing the neighborhood nodes to ones that almost
In this paper, we demonstrated the potential
exclusively relate to the label in some way.
of fusing knowledge about the world from two
sources: First, a common-sense knowledge graph A natural direction of work is to involve
(ConceptNet), which explicitly encodes knowl- the user in the creation of the label neigh-
edge about words and their meaning. Second, pre- borhood (human-in-the-loop) by asking whether
trained language models, which contain a lot of some words that only the Language Model and
knowledge about language and word usage that is not ConceptNet suggests pertain to the target la-
latently encoded into them. We explored several bel. This allows to inject the extracted knowledge
methods to extract this knowledge and leverage from the language model back into the zero-shot
it for the use case of zero-shot classification. We classifier, and fill in the gaps of knowledge from
also empirically demonstrated the efficiency of ConceptNet.
such combination on several diverse datasets from
different domains. Finally, some existing limitations of the orig-
This work is experimental and does not fully inal work can be still improved upon such as
explore all possibilities of this setup. As future letting the language model inform the label se-
work, we want to study the effect of prompt lection and expansion, handling multi-word la-
choice in more detail, and seeing how such choice bels, and integrating more informative concepts
impacts not only the quality of the predictions but from ConceptNet beyond word tokenization (e.g.
also that of the explanations. Different language ’crime scene’, ’tear gaz’).

8 IEEE Internet Computing: Special Issue on Knowledge-Infused Learning


Acknowledgment ACL-IJCNLP 2021. Online: ACL, Aug. 2021, pp. 4179–
This work has been partially supported by 4192.
European Union’s Horizon 2020 research and in- 12. Q. Chen, W. Wang, K. Huang, and F. Coenen, “Zero-
novation programme within the Odeuropa project shot text classification via knowledge graph embedding
(grant agreement No. 101004469), by CHIST- for social media data,” IEEE Internet of Things Journal,
ERA within the CIMPLE project (CHIST-ERA- pp. 1–1, 2021.
19-XAI-003) and by raisin.ai within the MyLit- 13. T. Liu, Y. Hu, J. Gao, Y. Sun, and B. Yin, “Zero-shot text
tleEngine project. classification with semantically extended graph convo-
lutional network,” in 2020 25th International Conference
REFERENCES on Pattern Recognition (ICPR), 2021, pp. 8352–8359.

1. J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, 14. J. Zhang, P. Lertvittayakumjorn, and Y. Guo,

“BERT: Pre-training of Deep Bidirectional Transformers “Integrating semantic knowledge to tackle zero-

for Language Understanding,” in NAACL. Association shot text classification,” in Proceedings of the 2019

for Computational Linguistics, 2019, pp. 4171–4186. Conference of the North American Chapter of the

2. I. Harrando and R. Troncy, “Explainable Zero-Shot Topic Association for Computational Linguistics: Human

Extraction Using a Common-Sense Knowledge Graph,” Language Technologies, Volume 1 (Long and Short

in 3rd Conference on Language, Data and Knowledge Papers). Minneapolis, Minnesota: Association for

(LDK), Zaragoza, Spain, 2021. Computational Linguistics, Jun. 2019, pp. 1031–1040.

3. R. Speer, J. Chin, and C. Havasi, “Conceptnet 5.5: An [Online]. Available: https://aclanthology.org/N19-1108

open multilingual graph of general knowledge,” in 31st 15. N. Nayak and S. Bach, “Zero-shot learning with com-

AAAI Conference on Artificial Intelligence, 2017. mon sense knowledge graphs,” 2021. [Online]. Avail-

4. L. Frermann, S. B. Cohen, and M. Lapata, “Whodunnit? able: https://openreview.net/forum?id=jYkO 0z2TAr

crime drama as a case for natural language understand- 16. A. Conneau, G. Kruszewski, G. Lample, L. Barrault, and

ing,” TACL, vol. 6, pp. 1–15, 2018. M. Baroni, “What you can cram into a single $&!#* vec-

5. T. Schleider, T. Ehrhart, P. Lisena, and R. Troncy, “Sil- tor: Probing sentence embeddings for linguistic prop-

know knowledge graph,” Nov. 2021. erties,” in ACL. Melbourne, Australia: Association for

6. S. Mayhew, T. Tsygankova, F. Marini, Z. Wang, J. Lee, Computational Linguistics, Jul. 2018, pp. 2126–2136.

X. Yu, X. Fu, W. Shi, Z. Zhao, and W. Yin, “University 17. K. Lang, “Newsweeder: Learning to filter netnews,” in

of pennsylvania lorehlt 2019 submission,” Technical re- ICML, 1995, pp. 331–339.

port, Tech. Rep. 18. A. Gulli, AG’s corpus of news articles, 2005. [Online].

7. A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet Available: http://groups.di.unipi.it/∼gulli/AG corpus of

classification with deep convolutional neural networks,” news articles.html

in NeurIPS, F. Pereira, C. J. C. Burges, L. Bottou, and 19. D. Greene and P. Cunningham, “Practical solutions to

K. Q. Weinberger, Eds., vol. 25. Curran Associates, the problem of diagonal dominance in kernel document

Inc., 2012. clustering,” in 23rd International Conference on Ma-

8. A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, chine learning (ICML), 2006, pp. 377–384.

L. Jones, A. N. Gomez, L. u. Kaiser, and I. Polosukhin, 20. T. Schleider and R. Troncy, “Zero-shot information ex-

“Attention is all you need,” in NeurIPS, vol. 30. Curran traction to enhance a knowledge graph describing silk

Associates, Inc., 2017. textiles,” in Workshop on Computational Linguistics for

9. W. Yin, J. Hay, and D. Roth, “Benchmarking Zero-shot Cultural Heritage, Social Sciences, Humanities and Lit-

Text Classification: Datasets, Evaluation and Entailment erature. Punta Cana, Dominican Republic (online):

Approach,” in EMNLP, 2019, pp. 3914–3923. Association for Computational Linguistics, Nov. 2021,

10. P. Liu, W. Yuan, J. Fu, Z. Jiang, H. Hayashi, and G. Neu- pp. 138–146.

big, “Pre-train, prompt, and predict: A systematic survey 21. M. Lewis, Y. Liu, N. Goyal, M. Ghazvininejad, A. Mo-

of prompting methods in natural language processing,” hamed, O. Levy, V. Stoyanov, and L. Zettlemoyer, “Bart:

arXiv preprint arXiv:2107.13586, 2021. Denoising sequence-to-sequence pre-training for natu-

11. B. Paranjape, J. Michael, M. Ghazvininejad, H. Ha- ral language generation, translation, and comprehen-

jishirzi, and L. Zettlemoyer, “Prompting contrastive ex- sion,” in ACL, 2020.

planations for commonsense reasoning tasks,” in Find-


ings of the Association for Computational Linguistics:

December 2021
9

You might also like