ML Imp Questions-1
ML Imp Questions-1
ML Imp Questions-1
Logistic Regression is also called the logit model. It is a method to forecast the binary
outcome from a linear combination of predictor variables.
Name three types of biases that can occur during sampling
In the sampling process, there are three types of biases, which are:
● Selection bias
● Under coverage bias
● Survivorship bias
A decision tree is a popular supervised machine learning algorithm. It is mainly used for
Regression and Classification. It breaks down a dataset into smaller subsets. The
decision tree is able to handle both categorical and numerical data.
What are the differences between supervised and unsupervised learning?
Overfitting refers to a model that is only set for a very small amount of data and ignores
the bigger picture. There are three main methods to avoid overfitting:
1. Keep the model simple—take fewer variables into account, thereby removing
some of the noise in the training data
2. Use cross-validation techniques, such as k folds cross-validation
3. Use regularization techniques, such as LASSO, that penalize certain model
parameters if they're likely to cause overfitting
Univariate
Bivariate
Multivariate
The patterns can be studied by drawing conclusions using mean, median, and mode,
dispersion or range, minimum, maximum, etc.
What are the feature selection methods used to select the right variables?
There are two main methods for feature selection, i.e, filter, and wrapper methods.
Filter Methods
This involves:
Wrapper Methods
This involves:
● Forward Selection: We test one feature at a time and keep adding them until
we get a good fit
● Backward Selection: We test all the features and start removing them to see
what works better
● Recursive Feature Elimination: Recursively looks through all the different
features and how they pair together
You are given a data set consisting of variables with more than 30 percent
missing values. How will you deal with them?
If the data set is large, we can just simply remove the rows with missing data values. It is
the quickest way; we use the rest of the data to predict the values.
For smaller data sets, we can substitute missing values with the mean or average of the
rest of the data using the pandas' data frame in python. There are different ways to do
so, such as df.mean(), df.fillna(mean).
Dimensionality reduction refers to the process of converting a data set with vast
dimensions into data with fewer dimensions (fields) to convey similar information
concisely.
Monitor
Evaluate
Compare
Rebuild
A recommender system predicts what a user would rate a specific product based on their
preferences. It can be split into two different areas:
Collaborative Filtering
Content-based Filtering
We use the elbow method to select k for k-means clustering. The idea of the elbow
method is to run k-means clustering on the data set where 'k' is the number of clusters.
● Try a different model. Data detected as outliers by linear models can be fit by
nonlinear models. Therefore, be sure you are choosing the correct model.
= 609 / 650
= 0.93
List out the libraries in Python used for Data Analysis and Scientific
Computations.
● SciPy
● Pandas
● Matplotlib
● NumPy
● SciKit
● Seaborn
The power analysis is an integral part of the experimental design. It helps you to
determine the sample size required to find out the effect of a given size from a cause
with a specific level of assurance.
The Naive Bayes Algorithm model is based on the Bayes Theorem. It describes the
probability of an event.
Linear regression is a statistical programming method where the score of a variable 'A' is
predicted from the score of a second variable 'B'. B is referred to as the predictor
variable and A as the criterion variable.
State the difference between the expected value and mean value
Mean value is generally referred to when you are discussing a probability distribution
whereas expected value is referred to in the context of a random variable.
Bagging
Bagging method helps you to implement similar learners on small sample populations.
It helps you to make nearer predictions.
Boosting
Boosting is an iterative method which allows you to adjust the weight of an observation
depending upon the last classification. Boosting decreases the bias error and helps you
to build strong predictive models.
Back-propagation is the essence of neural net training. It is the method of tuning the
weights of a neural net depending upon the error rate obtained in the previous epoch.
The main difference between the two is that the data scientists have more technical
knowledge than business analysts. Moreover, they don't need an understanding of the
business required for data visualization.
Explain why Data Cleansing is essential and which method you use to
maintain clean data
Dirty data often leads to the incorrect inside, which can damage the prospect of any
organization. For example, if you want to run a targeted marketing campaign.
What is precision?
Precision is the most commonly used error metric as a classification mechanism. Its
range is from 0 to 1, where 1 represents 100%
A cluster sampling method is used when it is challenging to study the target population
spread across, and simple random sampling can't be applied.
A Validation set is mostly considered as a part of the training set as it is used for
parameter selection which helps you to avoid overfitting of the model being built.
While a Test Set is used for testing or evaluating the performance of a trained machine
learning model.
Cleaning data from multiple sources to transform it into a format that data analysts or
data scientists can work with is a cumbersome process because - as the number of data
sources increases, the time taken to clean the data increases exponentially due to the
number of sources and the volume of data generated in these sources.
The random variables are distributed in the form of a symmetrical bell shaped curve.
What is Linear Regression?
An experimental design technique for determining the effect of a given sample size.
The process of filtering used by most of the recommender systems to find patterns or
information by collaborating viewpoints, various data sources, and multiple agents.
Cluster sampling is a technique used when it becomes difficult to study the target
population spread across a wide area and simple random sampling cannot be applied. A
cluster sample is a probability sample where each sampling unit is a collection or cluster
of elements.
● Understand the problem statement, understand the data and then give the
answer.Assigning a default value which can be mean, minimum or maximum value.
Getting into the data is important.
● If it is a categorical variable, the default value is assigned. The missing value is
assigned a default value.
● If you have a distribution of data coming, for normal distribution give the mean value.