Publi 6917
Publi 6917
Publi 6917
Abstract—As technology accelerates the generation and communication of textual data, the need
to automatically understand this content becomes a necessity. In order to classify text, being it
for tagging, indexing or curating documents, one often relies on large, opaque models that are
trained on pre-annotated datasets, making the process unexplainable, difficult to scale and
ill-adapted for niche domains with scarce data. To tackle these challenges, we propose ProZe, a
text classification approach that leverages knowledge from two sources: prompting pre-trained
language models, as well as querying ConceptNet, a common-sense knowledge base which can
be used to add a layer of explainability to the results. We evaluate our approach empirically and
we show how this combination not only performs on par with state-of-the-art zero shot
classification on several domains, but also offers explainable predictions that can be visualized.
Keywords: Text classification, zero-shot, explainability, common sense knowledge graph,
prompting language models
IEEE Internet Computing: Special Issue on Knowledge-Infused Learning Published by the IEEE Computer Society
a model. As a consequence, this paper proposes approaches. [9] provides a survey of the recent
ProZe, a Zero-Shot classification model which advances in the field, while proposing Entail,
combines latent contextual information from pre- a zero-shot classification model based on using
trained language models (via prompting) and ex- language models fine-tuned on the task of Nat-
plicit knowledge from ConceptNet. This method ural Language Inference to classify documents.
keeps the explainability property of ZeSTE while Some zero-shot classification models also takes
still offering a step towards label disambiguation advantage of “prompt-based learning” [10], a new
and domain adaptation. paradigm used for many NLP tasks that allows to
The remainder of this paper is structured as extract information out of Language Models.
follows. First, we give an overview of the relevant
Explainability in NLP
state-of-the-art work. We then detail our proposed
One direction of a growing amount of work
method called ProZE. Next, we present our re-
interested in explainable methods is to generate
sults on common topic categorization datasets as
explanations and to develop evaluations that mea-
well as on three challenging datasets from diverse
sure the extent and likelihood that an explanation
domains: screenplay aspects for a crime TV se-
and its label are associated with each other in the
ries [4], historical silk textile descriptions [5], and
model that generated them [11]. However, none
the Situation Typing dataset [6]. We report and
of these techniques totally compensate for the
analyze the results of several empirical classifi-
obscurity associated with language models. This
cation experiments, which includes a comparison
is the main reason why the approach presented in
to some state-of-the-art Zero-Shot approaches.
this paper relies on ZesTE (Zero Shot Topic Ex-
Finally, we conclude and outline some future
traction) [2], which is not based on a pre-trained
work.
language model, and provides explainability of
its classification results using ConceptNet as a
Related Work prediction support.
Language Models Previous contributions leverage knowledge
Since the breakthrough performance by AlexNet graphs [12], [13], [14] and common-sense [15] to
on the 2012 ImageNet challenge[7], transfer improve the performance of several classification
learning via pre-trained models became a new tasks. To the best of our knowledge, our approach
standard in many machine learning tasks. With is the first to use a common-sense knowledge
the introduction of the Transformers architecture graph to not have a learning component, and uses
[8], this paradigm shift made its way to the NLP the KG as is, allowing it to retain explainability.
field as well through the advent of ”pre-trained
language models”. ConceptNet
The most influential Transformer-based model A central resource for this work is ConceptNet
is BERT [1]. Its defining feature is its ability to [3], a semantic network ”designed to help com-
pre-train deep bidirectional representations. Such puters understand the meanings of words that
pre-trained language models remain part of the people use”1 . Broadly speaking, ConceptNet is a
most successful approaches for a many NLP graph of words (or concepts), connected by edges
tasks, such as text classification. Despite the wide representing semantic relations that go beyond
availability of these language models, many clas- the lexical relations than can be found in a
sification experiments require also annotated and dictionary such as ”Synonym” or ”Hypernym”.
balanced training data to make a model properly Most importantly, ConceptNet contains relations
associate documents with labels, which can be of general ”relatedness” (or /r/RelatedTo on
either expensive or not available at all for niche ConceptNet), which imply an undefined semantic
domains. relation between two concepts, such as ”Busi-
ness” and ”Outsourcing”: while both terms are
Zero-Shot Classification used in similar contexts, one cannot define such
With rising popularity of zero-shot classification relation as one of containment, usage or typing. It
methods, there are now more attempts to bench-
mark and evaluate them on text classification 1 https://conceptnet.io
December 2021
3
cise’. Because the language model outputs a the meaning of the label and thus come up
probability for every word in its vocabulary, with better related words. For instance, for
we score only the words that are originally the label ”space”, we provide the language
on the label neighborhood. If a word in the model with the sentence ”Space is the ex-
neighborhood does not appear among the panse that exists beyond Earth and between
predictions of the model (i.e. out of the celestial bodies”. We take the definitions
model’s vocabulary), the score from SM1 from Wikipedia or a dictionary, we generate
is used. it using a NLG model etc.
Once the scores are computed by one of these We observed experimentally that using just
methods, we can proceed to score any document the description of the domain as a prompts gives
given as input to the model. To score such docu- better overall performance. Therefore, we only
ment, we first tokenize it into separate words. We report results on these prompts in the following
then take all the nodes from the neighborhood of sections. As for the hypothesis, we provide the
a label that appear in the tokenized document, and model with a sentence like ”[blank] is similar to
we add up their scores to produce a score for the space” or ”Space is about [blank]” which we
label. We do so for each label we are targeting, use in our reported results.
and the final prediction of the model corresponds We note that, while the combination of
to the label with the highest score. Because all premise and hypothesis can impact the overall
the nodes in the neighborhood are linked to the performance of the model, the search space for a
label node with explicit relations on ConceptNet, good prompt is quite wide. Thus, we only report
we can explain in the end how each word in the the performance on some combinations, as we
document contributed to the score and how it is intend this paper to only point out the use of such
related to our label. mechanism for this task rather than fully optimize
Prompting Language Models the process.
In this section, we explain how we leverage Tool Demonstrator
language models to score the label neighbors To explain the decisions of the model, we follow
extracted from ConcepNet, as per the scoring the same method as ZeSTE [2], i.e. we highlight
methods SM2 and SM3 described above. the words which contribute to the decision of the
Both SM2 and SM3 methods rely on prompt- classification as shown in a graph that links them
ing the language model, i.e. to feed it a sentence with semantic relations to the label node. The
that would function as a context to ”query” its difference is that the scores in ProZe take also
content (also known as probing [16]). As ex- into account the scoring from the language model.
pressed in the related work, prompting language To illustrate the contribution of the language
models is an open problem in the literature. In model, we developed an interactive demonstrator
this work, we explore some potential ideas for enabling a user to test the effect of prompting the
prompting to serve our objective of measuring language model to improve the results of zero-
word-label relatedness. shot classification (Figure 1). This demonstrator
The prompting follows the same scheme for is available at http://proze.tools.eurecom.fr/.
both scoring methods. We vary both the premise After choosing a label to study, the user is
and hypothesis templates and report the results asked to enter a prompt that can help the model
for some proposals in the Evaluation section. For to identify words related to the label (e.g. def-
the premise, we experiment with two approaches: inition or domain). The user is then shown an
1) Domain description: where we prime the abridged version of the prompt-enhanced label
model with the name or description of the neighborhood: the connection between any node
domain of the datasets, i.e. ”Silk Textile”, and the label node is omitted for clarity but it can
”Crime series”, etc. be trivially retrieved from ConceptNet, and only
2) Label definition: where we prime the model the top 50 (based on the used scoring) words are
with the definition of the label, with the as- shown to represent the new label neighborhood,
sumption that this will help it disambiguate with the intensity of the color reflecting higher
December 2021
5
Figure 1: ProZe neighborhoods demo. (1) The user is asked to select a label (2) The user can
input a text to prompt and guide the language model. (3) The user can visualize the label
neighborhood, with added and removed nodes highlighted, and is shown a detailed list of all
the changes resulting from the prompt.
their production years, but also rich and detailed Property SILKNOW Concept ConceptNet
Material Cotton /c/en/cotton
textual descriptions. Our goal is to try to predict Material Wool /c/en/wool
categorical values based on these text descrip- Material Textile /c/en/textile
tions. Material Metal thread /c/en/metal
Material Metal silver thread /c/en/silver
The SILKNOW Knowledge Graph dataset can Material Silver thread /c/en/silver
be divided into using ”material” and ”weaving Material Gold thread /c/en/gold
technique” subsets. More precisely, we slightly Technique Damask /c/en/damask
Technique Embroidery /c/en/embroidery
extend the dataset used in [20], and after remov- Technique Velvet /c/en/velvet
ing objects with more than one value per property, Technique Voided Velvet /c/en/velvet
we obtain 1429 object descriptions making use Technique Tabby (silk weave) /c/en/tabby
Technique Muslin /c/en/tabby
of 7 different labels for silk materials, and 833 Technique Satin (Fabric) /c/en/satin
object descriptions with 6 unique labels for silk Technique Brocaded /c/en/brocaded
techniques. The chosen labels have also to be Table 1: Mapping between the concepts used
mapped to ConceptNet entries to work with this in the SILKNOW knowledge graph and Con-
approach. Table 1 shows the final selection of the- ceptNet (ProZe and ZeSTE)
saurus concepts and their mapping to ConceptNet
nodes.
December 2021
7
20 Newsgroup AG News BBC News
Datasets
Weighted Weighted Weighted
Accuracy Accuracy Accuracy
Avg Avg Avg
ZeSTE 63.1% 63.0% 69.9% 70.3% 84.0% 84.6%
Entail 46.0% 43.3% 66.0% 64.4% 71.1% 71.5%
ProZe-A 62.7% 62.8% 68.5% 69.1% 83.2% 83.7%
ProZe-B 64.6% 64.6% 69.0% 69.6% 84.2% 84.8%
Table 2: Prediction scores for the news datasets (the top score in each metric is emboldened).
Silk Silk
Crime aspects Crisis situations
Datasets Material Technique
Weighted Weighted Weighted Weighted
Accuracy Accuracy Accuracy Accuracy
Avg Avg Avg Avg
ZeSTE 34.3% 39.0% 46.9% 47.2% 31.2% 32.3% 46.3% 45.8%
Entail 29.0% 33.3% 64.0% 65.8% 43.7% 43.7% 46.7% 48.1%
ProZe-A 39.0% 40.1% 50.8% 57.6% 36.3% 37.6% 50.1% 49.7%
ProZe-B 37.4% 41.7% 48.5% 48.7% 29.8% 31.1% 50.1% 49.8%
Table 3: Prediction scores for the domain-specific datasets (the top score in each metric is
emboldened).
more general even if also related to silk textile. models can also be tried to measure how such
Words such as ”detail”, ”mending”, ”elaboration” choice can improve the overall classification, es-
or ”embellishment” seem useful for classifying pecially for specific domains such as e.g. medical
texts that are not only consisting of details about documents.
different types of embroidery. When combining
the scores from ConceptNet and the language Another potential improvement over this
model, the ProZe method increases its F1 score method is to filter out words unrelated to the
of circa 8%, from 61% to 69%. label using the slot-filling predictions from the
language model. From early experiments, this
method seems to give good results by restrict-
Conclusion and Future Work
ing the neighborhood nodes to ones that almost
In this paper, we demonstrated the potential
exclusively relate to the label in some way.
of fusing knowledge about the world from two
sources: First, a common-sense knowledge graph A natural direction of work is to involve
(ConceptNet), which explicitly encodes knowl- the user in the creation of the label neigh-
edge about words and their meaning. Second, pre- borhood (human-in-the-loop) by asking whether
trained language models, which contain a lot of some words that only the Language Model and
knowledge about language and word usage that is not ConceptNet suggests pertain to the target la-
latently encoded into them. We explored several bel. This allows to inject the extracted knowledge
methods to extract this knowledge and leverage from the language model back into the zero-shot
it for the use case of zero-shot classification. We classifier, and fill in the gaps of knowledge from
also empirically demonstrated the efficiency of ConceptNet.
such combination on several diverse datasets from
different domains. Finally, some existing limitations of the orig-
This work is experimental and does not fully inal work can be still improved upon such as
explore all possibilities of this setup. As future letting the language model inform the label se-
work, we want to study the effect of prompt lection and expansion, handling multi-word la-
choice in more detail, and seeing how such choice bels, and integrating more informative concepts
impacts not only the quality of the predictions but from ConceptNet beyond word tokenization (e.g.
also that of the explanations. Different language ’crime scene’, ’tear gaz’).
1. J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, 14. J. Zhang, P. Lertvittayakumjorn, and Y. Guo,
“BERT: Pre-training of Deep Bidirectional Transformers “Integrating semantic knowledge to tackle zero-
for Language Understanding,” in NAACL. Association shot text classification,” in Proceedings of the 2019
for Computational Linguistics, 2019, pp. 4171–4186. Conference of the North American Chapter of the
2. I. Harrando and R. Troncy, “Explainable Zero-Shot Topic Association for Computational Linguistics: Human
Extraction Using a Common-Sense Knowledge Graph,” Language Technologies, Volume 1 (Long and Short
in 3rd Conference on Language, Data and Knowledge Papers). Minneapolis, Minnesota: Association for
(LDK), Zaragoza, Spain, 2021. Computational Linguistics, Jun. 2019, pp. 1031–1040.
open multilingual graph of general knowledge,” in 31st 15. N. Nayak and S. Bach, “Zero-shot learning with com-
AAAI Conference on Artificial Intelligence, 2017. mon sense knowledge graphs,” 2021. [Online]. Avail-
crime drama as a case for natural language understand- 16. A. Conneau, G. Kruszewski, G. Lample, L. Barrault, and
ing,” TACL, vol. 6, pp. 1–15, 2018. M. Baroni, “What you can cram into a single $&!#* vec-
5. T. Schleider, T. Ehrhart, P. Lisena, and R. Troncy, “Sil- tor: Probing sentence embeddings for linguistic prop-
know knowledge graph,” Nov. 2021. erties,” in ACL. Melbourne, Australia: Association for
6. S. Mayhew, T. Tsygankova, F. Marini, Z. Wang, J. Lee, Computational Linguistics, Jul. 2018, pp. 2126–2136.
X. Yu, X. Fu, W. Shi, Z. Zhao, and W. Yin, “University 17. K. Lang, “Newsweeder: Learning to filter netnews,” in
of pennsylvania lorehlt 2019 submission,” Technical re- ICML, 1995, pp. 331–339.
port, Tech. Rep. 18. A. Gulli, AG’s corpus of news articles, 2005. [Online].
in NeurIPS, F. Pereira, C. J. C. Burges, L. Bottou, and 19. D. Greene and P. Cunningham, “Practical solutions to
K. Q. Weinberger, Eds., vol. 25. Curran Associates, the problem of diagonal dominance in kernel document
8. A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, chine learning (ICML), 2006, pp. 377–384.
L. Jones, A. N. Gomez, L. u. Kaiser, and I. Polosukhin, 20. T. Schleider and R. Troncy, “Zero-shot information ex-
“Attention is all you need,” in NeurIPS, vol. 30. Curran traction to enhance a knowledge graph describing silk
9. W. Yin, J. Hay, and D. Roth, “Benchmarking Zero-shot Cultural Heritage, Social Sciences, Humanities and Lit-
Text Classification: Datasets, Evaluation and Entailment erature. Punta Cana, Dominican Republic (online):
Approach,” in EMNLP, 2019, pp. 3914–3923. Association for Computational Linguistics, Nov. 2021,
10. P. Liu, W. Yuan, J. Fu, Z. Jiang, H. Hayashi, and G. Neu- pp. 138–146.
big, “Pre-train, prompt, and predict: A systematic survey 21. M. Lewis, Y. Liu, N. Goyal, M. Ghazvininejad, A. Mo-
of prompting methods in natural language processing,” hamed, O. Levy, V. Stoyanov, and L. Zettlemoyer, “Bart:
11. B. Paranjape, J. Michael, M. Ghazvininejad, H. Ha- ral language generation, translation, and comprehen-
December 2021
9