ML Questions
ML Questions
ML Questions
Basic ML
3. How will you make models out of the tweets for the pharma company
Advanced ML
11. What to be done on the dataset if the assumptions are not met?
Intermediate ML
1. If you create a scatter plot of values for x and y and see that there is not a linear relationship
between the two variables, then one can do the following:
- Apply a nonlinear transformation to the independent and/or dependent variable. e.g. log, square
root, or reciprocal of the independent and/or dependent variable
- Add another independent variable to the model.
2. If residuals are not independent then one can do the following:
- For positive serial correlation, consider adding lags of the dependent and/or independent
variable to the model.
- For negative serial correlation, check to make sure that none of your variables is
overdifferenced.
- For seasonal correlation, consider adding seasonal dummy variables to the model
3. If Residuals do not have constant variance, then one can do the following:
- Transform the dependent variable
- Use weighted regression
4. If Residuals are not normally distributed, then one can do the following:
- First, verify that any outliers aren’t having a huge impact on the distribution. If there are outliers
present, make sure that they are real values and that they aren’t data entry errors
- Next, you can apply a nonlinear transformation to the independent and/or dependent variable.
e.g. log, square root, or the reciprocal of the independent and/or dependent variable
23. You are given a data set on cancer detection. You’ve build a classification
model and achieved an accuracy of 96%. Why shouldn’t you be happy with
your model performance? What can you do about it?
Intermediate ML
24. You are working on a time series data set. You manager has asked you to
build a high accuracy model. You start with the decision tree algorithm, since
you know it works fairly well on all kinds of data. Later, you tried a time series
regression model and got higher accuracy than decision tree model. Can this
happen? Why?
Advanced ML
25. You came to know that your model is suffering from low bias and high
variance. Which algorithm should you use to tackle it? Why?
Intermediate ML
27. After analyzing the model, your manager has informed that your regression
model is suffering from multicollinearity. How would you check if he’s true?
Without losing any information, can you still build a better model?
Intermediate ML
29. While working on a data set, how do you select important variables?
Explain your methods.
Basic ML
31. Both being tree based algorithm, how is random forest different from
Gradient boosting algorithm (GBM)?
Basic ML
32. You’ve got a data set to work having p (no. of variable) > n (no. of
observation). Why is (Ordinary Least Squares) OLS is bad option to work with?
Which techniques would be best to use? Why?
Advanced ML
33. We know that one hot encoding increasing the dimensionality of a data set.
But, label encoding doesn’t. How ?
Intermediate ML
34. You are given a data set consisting of variables having more than 30%
missing values? Let’s say, out of 50 variables, 8 variables have missing values
higher than 30%. How will you deal with them?
Basic ML
35. People who bought this, also bought…’ recommendations seen on amazon
is a result of which algorithm?
Intermediate ML
37. You have been asked to evaluate a regression model based on R², adjusted
R² and tolerance. What will be your criteria?
Basic ML
38. Considering the long list of machine learning algorithm, given a data set,
how do you decide which one to use?
Basic ML
41. How can you prove that one improvement you've brought to an algorithm
is really an improvement over not doing anything?
Basic ML
42. Explain what resampling methods are and why they are useful. Also
explain their limitations.
Basic ML
- Repeatedly drawing samples from a training set and refitting a model of interest on each
sample in order to obtain additional information about the fitted model
- Example: repeatedly draw different samples from training data, fit a linear regression to each
new sample, and then examine the extent to which the resulting fit differ
- most common are: cross-validation and the bootstrap
cross-validation: random sampling with no replacement, bootstrap: random sampling with
replacement
- cross-validation: evaluating model performance, model selection (select the appropriate level of
flexibility)
- bootstrap: mostly used to quantify the uncertainty associated with a given estimator or statistical
learning method
43. Is it better to have too many false positives, or too many false negatives?
Explain.
Basic ML
False-positive and false-negative are two problems we have to deal with while evaluating a
mode.
In medical, a false positive can lead to unnecessary treatment and a false negative can lead to a
false diagnostic, which is very serious since the disease has been ignored.
However, we can minimize the errors by collecting more information, considering other variables,
adjusting the sensitivity (true positive rate) and specificity (true negative rate) of the test, or
conducting the test multiple times.
Even so, it is still hard since reducing one type of error means increasing the other type of error.
Sometimes, one type of error is more preferable than the other one, so data scientists will have
to evaluate the consequences of the errors and make a decision
44. What is selection bias, why is it important and how can you avoid it
Basic ML
Selection bias occurs if a data set's examples are chosen in a way that is not reflective of their
real-world distribution.
How to avoid selection biases
Mechanisms for avoiding selection biases include:
- Using random methods when selecting subgroups from populations.
- Ensuring that the subgroups selected are equivalent to the population at large in terms of their
key characteristics (this method is less of a protection than the first since typically the key
characteristics are not known).
47. Can you cite some examples where both false positive and false negatives
are equally important?
Intermediate ML
Let us take an example of a medical field where:
A false positive = person is considered as sick but actually is healthy
A false negative = person is considered as healthy but is actually sick
What does it mean?
False-positive cases lead to overspending due to unnecessary care and damaging the health of
an otherwise healthy person due to unnecessary side effects of the therapy.
A false negative case means that your patients get sicker or die.
In this case, both false positive and false negatives are equally important since it concerns a
person’s life
53. What is part of speech (POS) tagging? What is the simplest approach to
building a POS tagger that you can imagine?
Basic NLP
POS tagging is the process of marking up a word in a corpus to a corresponding part of a speech
tag, based on its context and definition. The most common approach is to use the lexicon-based
approach, using a lexicon to assign a tag for each word. The lexicon is constructed from a gold
standard annotated corpus, where each word type is coupled with its most frequent associated
tag in the gold standard corpus.
54. How would you build a part of speech (POS) tagger from scratch given a
corpus of annotated sentences? How would you deal with unknown words?
Basic NLP
First, we will create features from words (like last 2,3 letters, the previous word, next word, etc.).
Then we will train a classifier to find the POS tag. HMM, CRF and RNNs can be used to train the
model. Unknown words can also be predicted by generating the features (position of the word,
suffix, etc) from them.
55. How would you train a model that identifies whether the word “Apple” in a
sentence belongs to the fruit or the company?
Basic NLP
This particular task is known as NER (Named Entity Recognition) tagging. HMM, CRF and RNNs
can be used to train a model for NER
56. How would you find all the occurrences of quoted text in a news article?
Basic NLP
Train a classifier model to look at the constituent parts of a news article and assign a probability
that, taken together, composes a valid quoted text.
57. How would you build a system that auto-corrects text that has been
generated by a speech recognition system?
Basic NLP
It can be done in multiple ways, but the simplest way would be to take the unknown words and
compare them with similar words from our dictionary. Distances can be calculated using
algorithms like Levenshtein and if the result is satisfactory, the words can be exchanged
58. Which are some popular models other than word2vec?
Basic NLP
Some popular models other than word2vec are GloVe, Adagram, FastText, etc
60. Explain some metrics to test out a Named Entity recognition model.
Basic NLP
When you train a NER system the most typical evaluation method is to measure precision, recall,
f1-score, and confusion matrix at a token level.
61. List out some popular Python libraries that are used for NLP.
Basic NLP
Some popular libraries for NLP are, NLTK, Gensim, spaCy, TextBlob, etc.
63. What is the difference between search function and match function?
Basic NLP
re.search() method finds something anywhere present in the string and return a match object,
whereas re.match() method finds something only at the beginning of the string and returns a
match object
69. How would you build a system to translate English text to Greek and vice-
versa?
Basic NLP
One can use Neural Machine Translation to translate English text to Greek and vice-versa. A
sequence to sequence model can be created using RNNs.
70. How would you build a system that automatically groups news articles by
subject?
Basic NLP
There can be different ways to do this task, if you have annotated data, you can train a classifier
model to classify different articles
71. What are stop words? Describe an application in which stop words should
be removed.
Basic NLP
Stop words are frequently used word, which does not add much meaning to a sentence or does
not help in prediction. We will need to remove the stop words while performing sentiment
analysis.
72. How would you design a model to predict whether a movie review was
positive or negative?
Basic NLP
We will need to perform sentiment analysis on the reviews, It can be done in multiple ways, one
simple way to do this is by training a classifier using ML algorithms or RNNs (LSTM or GRU).
73. What is entropy? How would you estimate the entropy of the English
language?
Basic NLP
Entropy is a measure of randomness in the information. One possible way of calculating the
entropy of English uses N-grams. One can statistically calculate the entropy of the next letter
when the previous N - 1 letters are known.
74. What is the TF-IDF score of a word and in what context is this useful?
Basic NLP
TF-IDF is a statistical measure that evaluates how relevant a word is to a document in a
collection of documents. This is done by multiplying two metrics: how many times a word
appears in a document, and the inverse document frequency of the word across a set of
documents. TF-IDF is used to convert text corpus into a matrix on which Machine learning
algorithms can be implemented
76. What are the difficulties in building and using an annotated corpus of text
such as the Brown Corpus and what can be done to mitigate them?
Basic NLP
77. What tools for training NLP models (NLTK, Apache OpenNLP, GATE,
MALLET etc…) have you used?
Basic NLP
To train NLP models, I have used NLTK, Gensim, Spacy and a few others
78. Are you familiar with WordNet or other related linguistic resources?
Basic NLP
WordNet is the lexical database i.e. dictionary for the English language, specifically designed for
NLP. Synset is a special kind of a simple interface that is present in NLTK to look up words in
WordNet.
80. What are some of the common problems using fixed window neural
models?
Advanced NLP
The main problem faced while using a fixed window neural model, is the window size can be
small for large sentences, making it unable to process the complete information
89. Why is there a specific need for an architecture like GRU or LSTM?
Advanced NLP
RNN’s suffer from short-term memory. If a sequence is long enough, they will have a hard time
carrying the information from the earlier timesteps to later ones. This is called the Vanishing
Gradient Problem. To solve this issue, GRU and LSTMs are used
92. What kind of datasets are RNNs known best to work on?
Advanced NLP
RNNs are good at making predictions when the data is sequential.
93. What are the different possible architectures in RNNs and give examples of
the same?
Advanced NLP
Different possible architectures for RNN are the following:
94. What are some of the ways to address the exploding gradients problem in
RNNs?
Advanced NLP
Some of the ways to address the exploding gradient problem are
1. Gradient Clipping: Limit the size of gradients during the training of your network.
2. Weight Regularization: apply a penalty to the networks loss function for large weight
values.
3. Using LSTM or GRU
1. Bahdanau Attention
2. Luong Attention
104. What information is stored in the hidden and cell state of an LSTM?
Advanced NLP
The cell state ( also called long-term memory) contains the information from the past. Hidden
State (also called working memory) contains the information from the current state that needs to
be taken to the next state
106. What are the differences between BERT and ALBERT v2?
Advanced NLP
BERT is an expensive model in terms of memory and time consumed on computations, even
with GPU. ALBERT v2 is lighter and faster than BERT. Cross-layer parameter sharing is the
most significant change in BERT architecture that created ALBERT.
1. BERT Base: 12 layers (transformer blocks), 12 attention heads, and 110 million
parameters
2. BERT Large: 24 layers (transformer blocks), 16 attention heads and, 340 million
parameters
1. BERT
2. GPT-3
3. XLNet
109. What are the most challenging NLP problems that researchers/industries
are working on currently?
Advanced NLP
Following are the challenges faced currently in NLP
115. How will you import multiple excel sheets in a data frame?
Basic Python
123. Python or R – Which one would you prefer for text analytics?
Intermediate Python
126. How is the Python series different from a single column dataframe?
Intermediate Python
Python series is the data structure for a single column of a DataFrame, not only conceptually, but
literally, i.e. the data in a DataFrame is actually stored in memory as a collection of Series
Series is a one-dimensional object that can hold any data type such as integers, floats and
strings and it does not have any name/header whereas the dataframe has column names.
127. Which libraries in SciPy have you worked with in your project?
Intermediate Python
SciPy contains modules for optimization, linear algebra, integration, interpolation, special
functions, FFT, signal and image processing, ODE solvers etc
Subpackages include:
scipy.cluster
scipy.constants
scipy.fftpack
scipy.integrate
scipy.interpolation
scipy.linalg
scipy.io
scipy.ndimage
scipy.odr
scipy.optimize
scipy.signal
scipy.sparse
scipy.spatial
scipy.special
scipy.stats
scipy.weaves
141. Difference between univariate and bivariate analysis? What all different
functions can be used in python?
Basic Python
Univariate statistics summarize only one variable at a time.
Bivariate statistics compare two variables.
Below are a few functions which can be used in the univariate and bivariate analysis:
1. To find the population proportions with different types of blood disorders.
df.Thal.value_counts()
2. To make a plot of the distribution :
sns.distplot(df.Variable.dropna())
3. Find the minimum, maximum, average, and standard deviation of data.
There is a function called ‘describe’
4. Find the mean of the Variable
df.Variable.dropna().mean()
5. Boxplot to observe outliers
sns.boxplot(x = "", y = "", hue = "", data=df)
6. Correlation plot:
data.corr()
142. What all different methods can be used to standardize the data using
python?
Intermediate Python
Min Max Scaler.
Standard Scaler.
Max Abs Scaler.
Robust Scaler.
Quantile Transformer Scaler.
Power Transformer Scaler.
Unit Vector Scaler.
145. Can you plot 3D plots using matplotlib? Name the function.
Intermediate Python
Yes
Function:
import numpy as np
import matplotlib.pyplot as plt
fig = plt.figure()
ax = plt.axes(projection ='3d')
151. Can you convert a string into an int? When and how?
Basic Python
Python offers the int() method that takes a String object as an argument and returns an integer.
This can be done when the value is either of numeric object or floating-point.
But keep these special cases in mind:
A floating-point (an integer with a fractional part) as an argument will return the float rounded
down to the nearest whole integer.
154. What is the difference between list, array and tuple in Python?
Basic Python
List:
The list is an ordered collection of data types.
The list is mutable.
List are dynamic and can contain objects of different data types.
List elements can be accessed by index number
Array:
An array is an ordered collection of similar data types.
An array is mutable.
An array can be accessed by using its index number.
Tuple:
Tuples are immutable and can store any type of data type.
it is defined using ().
it cannot be changed or replaced as it is an immutable data type