Understanding Natural Language
Understanding Natural Language
Yong Yang,
yongy3@illinois.edu
January 6, 2024
Abstract
In recent years, transformer-based models like BERT and ChatGPT/GPT-3/4 have shown
remarkable performance in various natural language understanding tasks. However, it’s crucial
to note that while these models exhibit impressive surface-level language understanding, they
may not truly understand the intent and meaning beyond the superficial sentences. This paper
is a survey of studies of the popular Large Language Models (LLMs) from various research
and industry papers and review the abilities in term of comprehending language understanding
like what human have, revealing key challenges and limitations associated with popular LLMs
including BERTology and GPT alike models.
1 Introduction
In this paper, I conducted extensive research and strive to understand the capabilities and boundaries
of popular Large Language Models (LLM) - BERT, GPT and its sibling variants. The study starts
with BERT and its variants (mBERT and RoBERT etc) architecture which is called BERTology.
It reveals the knowledge BERT may have: Syntactic Knowledge, Semantic Knowledge and World
Knowledge, Commonsense Knowledge, and Reasoning. In order to measure the extent to which
the semantic understanding and reasoning capability of the models have reached, we also explore to
study the definition of meaning. While NLP gains increasingly significant public exposure nowa-
days, it is crucial to make it clear on the distinction between the linguistic word form and semantic
meaning. Next, we also study the reasoning capability of GPT3 and how to improve the reasoning
capability by Chain-of-Thought (CoT) prompting which involves zero-shot and few-shot reasoners
as prompting techniques. After all, we summarize the latest studies with existing capabilities and
limitations that these popular LLMs have gained today.
2 Background
This is a basically literature review and expected to answer a question: To what extend do LLMs
understand the natural language? The focus of the study will be
1 How LLM understand natural language, whether they truly understand the intent and meaning
beyond the superficial sentences and the knowledge and capabilities of LLM today?
2 What are the challenges and limitations towards truly understanding natural language today?
1
There are many LLMs today and they are growing every day. I don’t try to enumerate all the
models, instead just focus on these popular and well-known models based on Transformer: BERT,
GPT and its siblings. Also, understanding natural language with Large Language Models (LLMs)
is a broad subject. After consulting the papers, it became apparent that delving deeper into the topic
and surpassing the minimum of 4-5 papers is necessary to thoroughly investigate and address the
topic.
GPT-3 and GPT-4 OpenAI GPT-3 is an autoregressive language model that employs a Trans-
former model. Be aware GPT-3 is not a single model but a family of models that has different num-
bers of trainable hyperparameters and fine-tuning settings. Unlike BERT which is open sourced,
GPT-3 is closed and black box. As the paper is being written, GPT-4 Turb has just been released
which is claimed to be another significant leap on NLP. This review is just based on known infor-
mation and studies collected from public papers and experiments. We start reviewing BERT and
its variants(a.k.a BERTology) as they are both based on the Transformer model and iterate to newer
and larger models of GPT-3. We just focus on text processing only.
2
Attention Can Reflect Syntactic Structure The attention mechanism is an innovative part of
Transformer architecture and essentially is a mapping in sequence-to-sequence between a query
and a set of key-value pairs to an output.
QK T
Attention(Q, K,V ) = so f tmax √ V
dk
About syntactic structure, Ravishankar et al. (2021) studied that the Transformer model with mul-
tiple head attentions mechanism allows it to jointly attend to information from different represen-
tations(features). It has been observed that individual dependency relations were often tracked by
specialized heads. In this paper, experiments were conducted with a tree decoding test to show
that the attention mechanism was learning to represent the structural objective of the parser. It’s
surprising that the transformer parameters, K and Q, were only modestly capable of resembling
the dependency structure. What is more important is the Value (V) parameters, which play the most
faithful representation of the linguistic structure via attention. The experiments in this paper focused
on a linguistic structure that the attention-based model can learn and no test tasks were designed to
explore semantic-orientation classification. Actually, this is an un-answered question which sets of
transformer parameters are suited for learning such semantic information, or not at all? This leads us
to study the next paper and the extent to which the transformer-based model, including BERTology,
understands the natural language in terms of semantic aspects.
GPT-3’s semantic knowledge GPT-3 seems doing a better job with linguistic knowledge to iden-
tify certain semantic information in most cases, but still fails when there are some types of distur-
bance happening in the sentence. Per existing studies and experiments (Zhang et al., 2022), GPT-3
doesn’t possess Semantic Knowledge in the same way humans do, but it can generate responses that
appear to understand the “meaning” of the input by recognizing patterns and associations in the data
it was trained on.
3
4.3 World Knowledge or Commonsense Knowledge
The study (Da and Kasai, 2019) shows BERT is lack of World knowledge. It struggles with prag-
matic inference, role-based event knowledge, and abstract attributes of objects that are likely to be
assumed rather than mentioned. To answer questions like “Does the cake go in the oven?” which
looks common sense to humans, BERT does have difficulty answering because of a lack of strong
Contextualization.
Commonsense knowledge, an alias of world knowledge, requires context info to learn. In the
paper “Cracking the Contextual Commonsense Code: Understanding Commonsense Reasoning Ap-
titude of Deep Contextual Representations” (Da and Kasai, 2019) a method was developed through
attribute classification in the semantic datasets and compared the contextual model to traditional
word embedding. The result outperforms word type embedding but still lacks some commonsense
attributes - visual and perceptual properties. To mitigate this deficiency, a knowledge graph embed-
ding was added in BERT features utilizing CSLB1 , a semantic norm dataset. Knowledge Graphs
can help encode information that extends beyond BERT’s embedding features. A classifier was also
introduced to classify if an attribute applies to a candidate object, word, or sentence. It’s found the
F1 attribute score is much stronger - the median F1 score is nearly double that of GloVe2 baselines.
This means BERT encodes commonsense traits. However, this is not perfect. Some traits exhibit
better than others. Specifically, physical traits such as “is made of wood” and “has a top” perform
way better than those abstract traits such as “is creepy and is strong”.
To answer the question of whether to use a camera flash, it would be thus related to the traits
“does have flash”, “is dark”, and “is light”, the model needs fine-tuning on additional data which
is manually selected related to attributes that BERT is deficient in. The results (Table 1) show with
System Accuracy
Human(Golden) 97.4
Random Baseline 48.9
BERT(LARGE) 82.3
with ConceptNet 83.1
with WebChild 82.7
with ATOMIC 82.5
with all KB 83.3
with all KB + RACE(selected) 85.5
Table 1: Test set results for knowledge base embeddings on MCScript 2.0 (Da and Kasai, 2019)
ConceptNet: An open, multilingual knowledge graph (https://conceptnet.io)
WebChild: Fine-grained commonsense knowledge distillation.Tandon et al. (2017)
ATOMIC: An Atlas of Machine Commonsense for If-Then Reasoning Sap et al. (2018)
RACE: Large-scale reading comprehension dataset from examinations Lai et al. (2017)
explicit knowledge embeddings that each knowledge base improves accuracy, with ConceptNet giv-
ing the largest performance boost. ATOMIC gives the smallest boost, likely because the ATOMIC
edges involve longer phrases, which means fewer matches and the overlap between ATOMIC Sap
1 CSLB, a semantic norm dataset collected by the Cambridge Centre for Speech, Language, and the Brain.
2 GloVe: Global Vectors for Word Representation: https://nlp.stanford.edu/projects/glove/
4
et al. (2018) text and the text present in the task is not as large as either ConceptNet or WebChild. As
a result can tell that combining the knowledge base embeddings and the implicit RACE fine-tuning
yields the highest accuracy. so fine-tuning is very critical in contextual knowledge learning.
BERTology’s capability we have learned so far So given varied studies, BERT does possess a
limited amount of syntactic, semantic, and world knowledge although some studies show some. It
looks like it has built-in knowledge of syntactic structure due to its nature of encoding and embed-
ding, but lacks strong semantic and world knowledge although some hypes claim to have. Further,
BERT has limited reasoning abilities and its performance is heavily attributed to pattern recogni-
tion. The awkward situation is there is no single probing method that can reliably tell what extent
the knowledge of the model possesses. A given method may favor one over another. This actually
leads us to think about the definition of the “meaning” of language since the term “meaning” is so
rich and multifaceted.
M = E ×I
which contains pairs (e, i) of natural language expression e and the communicative intents i they
can evoke.
Communicative intents are about something out of language. For example, when a teacher says
“It is cold in the room”, the intent behind the utterance is that “we should close the window” or
“increase the heater temperature to make the room warmer.” It claims that LLMs trained purely on
form will not learn meaning because there is no sufficient signal to learn the relationship between
the form and non-linguistic intent of human language users.
Octopus test Why meaning can’t be learned from linguistic form alone? Because it lacks the
ability to connect its utterances to the world. The Octopus test described in (Bender and Koller,
2020) is designed to run experiments based on two isolated Octopus, A, and B on two stranded
islands, they can only communicate by a wire in the sea. There is the third one O who can learn
the communication between A and B. O is very good at detecting statistical patterns and learning
and can predict with great accuracy how B will respond to each of A’s utterances. However, this
is working well until someday a new situation beyond the existing utterances happen. Dealing
5
with new situations or new tasks requires the ability to map accurately between words and real-
world entities as well as reasoning and creative thinking, which cannot be learned from statistics
summary. When A run into an emergency on confronted with a bear never seen before and ask for
help from B, the middle Octopus O who never had such experience has no idea how to deal with
and respond.
Hype One hype for believing LM might be learning meaning is the claim that human children
can acquire language just by listening to it. This is not true based on some studies. Actually, kids
won’t pick up a language from passive exposure such as TV or radio. The critical part of language
learning is not just plain attention but also joint attention where interaction is important to boost
the meaning of understanding. So the conclusion is that learning a linguistic system is like human
learning. Communication relies on joint attention and intersubjectivity: the ability to be aware of
what another human is attending to and guess/interact with what they intend to communicate. It
cannot be learned by purely passive “learning”, the key point is “interaction” between learner and
teacher.
Does BERTopogy learn meaning In conclusion of this paper, BERTopogy doesn’t learn “mean-
ing”, it just learns some reflection of meaning in linguistic form.
Category Poor scoring attributes(fit score <1.0) Perfect scoring attributes (fit score = 1.0)
Visual is triangular, is long has a back, has a top
Perceptual is wet,is rough, creepy and strong does drive, does bend, live in river
Taxonomic is a home, is a garden tool is cat, is a body part
Per the above fine-grained comparison (Table 2) between attributes using BERT representations,
overall BERT is strong enough to fit many features that would easily be represented in text such as
“does bend”, “does drive”, or “does live in river”, but still seems to have difficulty to fit those that
most pertain to abstract common-sense, such as “is hardy” and “has a strong smell”. So this paper
(Da and Kasai, 2019) tells BERT shows a strong ability to encode various commonsense features
in its embedding space, particularly those that are easily represented in text while facing challenges
with abstract commonsense attributes.
6
Figure 1: Given a Python prompt (on top) which swaps of two builtin functions, large language
models prefer the incorrect but statistically common continuation (right) to the correct but unusual
one (left) (from Miceli Barone et al. (2023))
7
Figure 2: Question Answer by Few-Show reasoner (from Chen (2023))
and deduction in natural language processing. The Chain of Thoughts (CoT) Reasoning is the key
point to prompt the model step by step and so empower LLM to discover more context it may al-
ready have with more self-conciseness. For instance, you could provide a few examples of how you
want the model to answer questions about a specific topic, and the few-shot reasoner would use that
information to generate responses to new, similar queries.
Advance further: Zero-shot To advance further, this still needs one or more shots in terms of
prompting examples. Here raise a question, can we do better? Even without shots but still empow-
ering the model to reason itself upfront or internally without user intervention?
Figure 3: Left is standard Zero-shot and right Zero-shot-CoT. (from Kojima et al. (2023))
Another paper “Large Language Models are zero-shot reasoners” (Kojima et al., 2023) shows
zero-shot-CoT prompt examples that demonstrate good reasoning capability. Zero-shot reasoning
8
refers to the ability of LLMs to perform multi-step reasoning tasks on unseen domains without any
hand-crafted examples. It enables them to generalize knowledge from their training data and apply
it to new, unseen situations. The idea behind this is to trigger LLMs by simply adding a “Let’s
think step by step” prompt to generate a reasoning path in the LLM’s background processing that
decomposes a complex problem into two or more “simpler” and breaks it down into sub-problems.
This looks very simple, and actually, I think the key point behind this is we teach the model to
explore a reasoning path that decomposes the complex reasoning into multiple simpler steps.
This style of “Chain of thought prompting” demonstrated good performance in arithmetic and
logical reasoning.
6 Discussion
GPT-4 GPT-4 self-corrected itself in the middle of writing his answer if you told it’s wrong. This
could be prompted by human feedback to guide the model to choose another path or choose a sec-
ondary good answer as a backup. Considering the earlier section few-shot and zero-shot reasoning,
it is a topic to empower the model itself to do self-reasoning and fact-check before replying.
harmful information LLM may generate instructions for dangerous or potentially harmful or
illegal activities. The LLM may not tell the difference between bad and good. Actually, it is even
arguable for humans to reliable to distinguish without full knowledge. This is still an open big
question of how to improve the robustness and safety of language models.
9
7 Conclusion
We started from BERTology and GPT-3 3, studied the capability from syntax knowledge 4.1, world
knowledge 4.3, Semantic Knowledge 4.2, and contextual information. The surface knowledge in-
cluding syntax could be easier to retrieve from statistical patterns and attention mechanisms of
transformer-based models. We also learned the difference between linguistic form and semantic
meaning 4.4. It is not that obvious and even challenging to learn Semantic knowledge and reason-
ing capability. Although some recent studies are showing that few-shot and zero-shot reasoning by
Chain-of-Thought prompting can empower LLMs with stronger reasoning capability 4.6 to break
down complex problems, there is an open question of how to fully understand natural language like
Human does.
References
Emily M. Bender and Alexander Koller. 2020. Climbing towards NLU: On meaning, form, and
understanding in the age of data. In Proceedings of the 58th Annual Meeting of the Associa-
tion for Computational Linguistics, pages 5185–5198, Online. Association for Computational
Linguistics.
Wenhu Chen. 2023. Large language models are few(1)-shot table reasoners. In Findings of the
Association for Computational Linguistics: EACL 2023, pages 1120–1130, Dubrovnik, Croatia.
Association for Computational Linguistics.
Jeff Da and Jungo Kasai. 2019. Cracking the contextual commonsense code: Understanding com-
monsense reasoning aptitude of deep contextual representations. In Proceedings of the First
Workshop on Commonsense Inference in Natural Language Processing, pages 1–12, Hong Kong,
China. Association for Computational Linguistics.
Nouha Dziri, Ximing Lu, Melanie Sclar, Xiang Lorraine Li, Liwei Jiang, Bill Yuchen Lin, Peter
West, Chandra Bhagavatula, Ronan Le Bras, Jena D. Hwang, Soumya Sanyal, Sean Welleck,
Xiang Ren, Allyson Ettinger, Zaid Harchaoui, and Yejin Choi. 2023. Faith and fate: Limits of
transformers on compositionality.
Goran Glavaš and Ivan Vulić. 2021. Is supervised syntactic parsing beneficial for language under-
standing tasks? an empirical investigation. In Proceedings of the 16th Conference of the Euro-
pean Chapter of the Association for Computational Linguistics: Main Volume, pages 3090–3104,
Online. Association for Computational Linguistics.
Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. 2023.
Large language models are zero-shot reasoners.
Guokun Lai, Qizhe Xie, Hanxiao Liu, Yiming Yang, and Eduard Hovy. 2017. RACE: Large-scale
ReAding comprehension dataset from examinations. In Proceedings of the 2017 Conference on
Empirical Methods in Natural Language Processing, pages 785–794, Copenhagen, Denmark.
Association for Computational Linguistics.
Antonio Valerio Miceli Barone, Fazl Barez, Shay B. Cohen, and Ioannis Konstas. 2023. The larger
they are, the harder they fail: Language models do not recognize identifier swaps in python. In
10
Findings of the Association for Computational Linguistics: ACL 2023, pages 272–292, Toronto,
Canada. Association for Computational Linguistics.
Simon Ostermann, Ashutosh Modi, Michael Roth, Stefan Thater, and Manfred Pinkal. 2018. MC-
Script: A novel dataset for assessing machine comprehension using script knowledge. In
Proceedings of the Eleventh International Conference on Language Resources and Evaluation
(LREC 2018), Miyazaki, Japan. European Language Resources Association (ELRA).
Vinit Ravishankar, Artur Kulmizev, Mostafa Abdou, Anders Søgaard, and Joakim Nivre. 2021.
Attention can reflect syntactic structure (if you let it). In Proceedings of the 16th Conference
of the European Chapter of the Association for Computational Linguistics: Main Volume, pages
3031–3045, Online. Association for Computational Linguistics.
Anna Rogers, Olga Kovaleva, and Anna Rumshisky. 2020. A primer in BERTology: What we
know about how BERT works. Transactions of the Association for Computational Linguistics,
8:842–866.
Maarten Sap, Ronan Le Bras, Emily Allaway, Chandra Bhagavatula, Nicholas Lourie, Hannah
Rashkin, Brendan Roof, Noah A. Smith, and Yejin Choi. 2018. ATOMIC: an atlas of machine
commonsense for if-then reasoning. CoRR, abs/1811.00146.
Niket Tandon, Gerard de Melo, and Gerhard Weikum. 2017. WebChild 2.0 : Fine-grained com-
monsense knowledge distillation. In Proceedings of ACL 2017, System Demonstrations, pages
115–120, Vancouver, Canada. Association for Computational Linguistics.
Dingjun Wu, Jing Zhang, and Xinmei Huang. 2023. Chain of thought prompting elicits knowledge
augmentation. In Findings of the Association for Computational Linguistics: ACL 2023, pages
6519–6534, Toronto, Canada. Association for Computational Linguistics.
Lining Zhang, Mengchen Wang, Liben Chen, and Wenxin Zhang. 2022. Probing GPT-3’s linguistic
knowledge on semantic tasks. In Proceedings of the Fifth BlackboxNLP Workshop on Analyzing
and Interpreting Neural Networks for NLP, pages 297–304, Abu Dhabi, United Arab Emirates
(Hybrid). Association for Computational Linguistics.
11