Ai Notes
Ai Notes
Ai Notes
CLASS 10
1. What is a Chabot?
A chatbot is a computer program that's designed to simulate human conversation
through voice commands or text chats or both. Eg: Mitsuku Bot, Jabberwacky etc.
OR
A chatbot is a computer program that can learn over time how to best interact with
humans. It can answer questions and troubleshoot customer problems, evaluate and
qualify prospects, generate sales leads and increase sales on an ecommerce site.
OR
A chatbot is a computer program designed to simulate conversation with human users.
A chatbot is also known as an artificial conversational entity (ACE), chat robot, talk bot,
chatterbot or chatterbox.
OR
A chatbot is a software application used to conduct an on-line chat conversation via text
or text-to-speech, in lieu of providing direct contact with a live human agent.
1. What are the types of data used for Natural Language Processing applications?
Natural Language Processing takes in the data of Natural Languages in the form of
written words and spoken words which humans use in their daily lives and operates on
this.
Script-bot Smart-bot
• A scripted chatbot doesn’t carry • Smart bots are built on NLP and
even a glimpse of A.I ML.
• Script bots are easy to make • Smart –bots are comparatively
difficult to make.
• Script bot functioning is very • Smart-bots are flexible and
limited as they are less powerful. powerful.
• Script bots work around a script ● Smart bots work on bigger
which is programmed in them databases and other resources
directly
• No or little language processing ● NLP and Machine learning skills
skills are required.
• Limited functionality ● Wide functionality
8. Which words in a corpus have the highest values and which ones have the least?
Stop words like - and, this, is, the, etc. have highest values in a corpus. But these words
do not talk about the corpus at all. Hence, these are termed as stopwords and are mostly
removed at the pre-processing stage only.
Rare or valuable words occur the least but add the most importance to the corpus.
Hence, when we look at the text, we take frequent and rare words into consideration.
9. Does the vocabulary of a corpus remain the same before and after text
normalization? Why?
No, the vocabulary of a corpus does not remain the same before and after text
normalization. Reasons are –
● In normalization the text is normalized through various steps and is lowered to
minimum vocabulary since the machine does not require grammatically correct
statements but the essence of it.
● In normalization Stop words, Special Characters and Numbers are removed.
● In stemming the affixes of words are removed and the words are converted to their base
form.
So, after normalization, we get the reduced vocabulary.
10. What is the significance of converting the text into a common case?
In Text Normalization, we undergo several steps to normalize the text to a lower level.
After the removal of stop words, we convert the whole text into a similar case,
preferably lower case. This ensures that the case-sensitivity of the machine does not
consider same words as different just because of different cases.
As shown in the graph, occurrence and value of a word are inversely proportional. The
words which occur most (like stop words) have negligible value. As the occurrence of
words drops, the value of such words rises. These words are termed as rare or valuable
words. These words occur the least but add the most value to the corpus.
16. What are stop words? Explain with the help of examples.
“Stop words” are the most common words in a language like “the”, “a”, “on”, “is”, “all”.
These words do not carry important meaning and are usually removed from texts. It is
possible to remove stop words using Natural Language Toolkit (NLTK), a suite of
libraries and programs for symbolic and statistical natural language processing.
2. Classify each of the images according to how well the model’s output matches the
data samples:
Here, the red dashed line is model’s output while the blue crosses are actual data
samples.
● The model’s output does not match the true function at all. Hence the model is said to be
under fitting and its accuracy is lower.
● In the second case, model performance is trying to cover all the data samples even if
they are out of alignment to the true function. This model is said to be over fitting and
this too has a lower accuracy
● In the third one, the model’s performance matches well with the true function which
states that the model has optimum accuracy and the model is called a perfect fit.
Sentence Segmentation - Under sentence segmentation, the whole corpus is divided into
sentences. Each sentence is taken as a different data so now the whole corpus gets
reduced to sentences.
Tokenisation- After segmenting the sentences, each sentence is then further divided into
tokens. Tokens is a term used for any word or number or special character occurring in
a sentence. Under tokenisation, every word, number and special character is considered
separately and each of them is now a separate token.
Removing Stop words, Special Characters and Numbers - In this step, the tokens which
are not necessary are removed from the token list.
Converting text to a common case -After the stop words removal, we convert the whole
text into a similar case, preferably lower case. This ensures that the case-sensitivity of
the machine does not consider same words as different just because of different cases.
Stemming In this step, the remaining words are reduced to their root words. In other
words, stemming is the process in which the affixes of words are removed and the
words are converted to their base form.
Lemmatization -in lemmatization, the word we get after affix removal (also known as
lemma) is a meaningful one.
With this we have normalized our text to tokens which are the simplest form of words
present in the corpus. Now it is time to convert the tokens into numbers. For this, we
would use the Bag of Words algorithm
6. Through a step-by-step process, calculate TFIDF for the given corpus and mention
the word(s) having highest value.
Document 1: We are going to Mumbai
Document 2: Mumbai is a famous place.
Document 3: We are going to a famous place.
Document 4: I am famous in Mumbai.
Term Frequency
Term frequency is the frequency of a word in one document. Term frequency can easily
be found from the document vector table as in that table we mention the frequency of
each word of the vocabulary in each document.
Talking about inverse document frequency, we need to put the document frequency in
the denominator while the total number of documents is the numerator. Here, the total
number of documents are 3, hence inverse document frequency becomes:
7. Normalize the given text and comment on the vocabulary before and after the
normalization:
Raj and Vijay are best friends. They play together with other friends. Raj likes to
play football but Vijay prefers to play online games. Raj wants to be a footballer.
Vijay wants to become an online gamer.
Likes -s Like
Prefers -s Prefer
Wants -s want