Python Learning
Python Learning
htm
Python
1. It has various libraries which supports in general computing and scientific computing.
2. Its Package NumPy helps in scientific computing and its array manages less memory than
convention python library.
3. Python has packages which can directly use the code from other languages like C or Java.
4. Competition – R or MATLAB
5. Its excellent memory management capability, especially garbage collection, makes it versatile in
manage large volume of data for slicing, dicing, transformation & visualization.
This variable tells the path where to locate the module files imported in a program.
Windows
Anaconda (from www.continuum.io) is a free Python distribution for SciPy
stack. It is also available for Linux and Mac.
Python (x,y): It is a free Python distribution with SciPy stack and Spyder
IDE for Windows OS. (Downloadable from www.python-xy.github.io/)
Python summary
1. Under this models are trained using labeled data. It means that data is
already trained before and then the values for test data is to be
predicted.
2. The application would receive the prediction data along with correct
values, and the algorithm learns by comparing its actual output with
correct outputs to find errors.
3. It then modifies model accordingly.
4. It would be used in such scenarios where the historical data can
predict future values or events,
5. ML Process
a. Data Collection/Acquisition
b. Data Cleaning and Transformation
c. Split the data to Test and Train Set
d. Model Building and Training
e. Model Testing
i. Adjust Model Parameters to fine tune basis testing.
f. Model Deployment
Confusion Matrix
Predicted Condition
Prediction Prediction
Positive Negative
True Condition Condition Positive True Positive False Negative
(Type 2 Error)
Condition Negative False Positive True Negative
(Type 1 Error)
Model Evaluation
To focus is the model should focus on False Negative or False Positive.
In case of medical disease diagnosis, it’s better to go towards False Positive, and
minimize False Negative.
Models Work
# The coefficients
print('Coefficients: \n', lm.coef_)
Learning
a. X_Train -> It is the training set which would be trained with the
model.
b. Y_Train -> This would be the labelled values which would be
provided to train the model.
c. X_Test -> These values would be used for model prediction.
d. Y_Test -> These values would be used to validate the predicted
model dataset.
Key Point -> If the Predicted set on X_Test and Y_Test Values are
falling on same straight line, then linear regression is the best fit
model. This can be checked by using the scatter plot of Predicted
data set and Y_Test.
metrics.mean_absolute_error(y_test, predictions)
metics.mean_squared_error(y_test, predictions)
np.sqrt(metics.mean_squared_error(y_test, predictions))
o Bias refers to the gap between the predicted value set and the
actual value set(i.e. Y_Test).
o Variance is how much scattered the value set is among each other.
The more closer they are to each other, is called low variance.
The model after the bias trade off begins to over fit.
When model tries to fit each and every data point of the data set, that is
called overfitting, When too many features are selected to predict the
actual data point.
For eg, One tries to identify ball with too many features, specifically –
Radius attribute,
Underfitting, when the predicted value on line set is far apart from the
actual values. When lesser features are chosen to predict the line values,
hence the predicted values are apart from the line.
Link - https://www.youtube.com/watch?v=viV_53s97Nw
1. Linear Regression Theory
This explains the phenomenon that the subsequent values, are generally tend to move towards
the mean.
Goal is to minimize the vertical distance between all the data points and our line.
So in determining the best line, we are attempting to minimize the distance between all the
points and their distance to our line.
One method is using Least Square Method(Classic regression method), which is fitted by
minimizing the sum of squares of the residuals. The residuals for an observation is the difference
between the observation (the y value) and the fitted line.
Columns having Text data is of no use for regression, hence we either drop
them or transform to build categories out of it.
2. Logistic Regression
Theory
It is a classification algorithm, used to predict categorical value.
We learned -
1. Heatmap is to visualize the column data filled or empty
2. Fill empty values.
3. Pass multiple values to function
4. Drop Null Values
Df.dropna(Column_Name, axis=1, inplace = True)
5. Get Dummies for data transformation
Pd.get_dummies
6. Remove the textual data(drop function)
Practical Commands
from sklearn.model_selection import train_test_split
logmodel = LogisticRegression()
logmodel.fit(X_train, y_train)
predictions = logmodel.predict(X_test)
It has 2 types
1. Binary classification – Two Values to chose(Yes or No). Eg email is
spam or not.
2. Multi Class Classification – Multiple Values. Eg, Which party will win the
election in India,
The Linear regression won’t be fit for binary and multiple classes problems
related to classification and hence we need logistic regression. It uses
sigmoid function which would either predict 0 or 1.
-z
Z= 1 / 1+e
So here we are diving the value more than 1 by 1, which will result in value
between 0 and 1.
We can set a cut off point at 0.5, which translates as any value above 0.5
would results to 1 and value lower than 0.5 will lead to 0 class.
Theory
It is a classification algorithm. This infers to predict the new value class
basis the values of nearest data points. This is simplest classification
algorithm, but depends on k value chosen(area around the data points).
KNN, as the name implies, tries to associate the new data point with the
surrounding class data points. It can be used to classify the data point in 2
or more classes.
Pros
1. Easy to use
2. Training is trivial
3. Works with any number of classes
4. Easy to add more data
5. Few Parameters
a. K
b. Distance Metric
Cons
Steps
1. Import Libraries
2. Read data from the file
3. EDA
4. Standardize the variables
5. Train Test Split
6. Using KNN model
7. Predictions and evaluations
8. Predicting K Value
9. Retrain with new K Value
Practical Command
error_rate = []
for i in range(1,40):
knn = KNeighborsClassifier(i)
knn.fit(X_train, y_train)
pred_i = knn.predict(X_test)
error_rate.append(np.mean(pred_i != y_test))
a) Under this, model builds decision tree, under which different parameters are identified and
based the parameters conditions value, the model would provide predictions. For eg, particular
year sale would depend on Marketing budget, people employed, R&D budget.
In this tree, we have nodes and edges. Nodes are the parameters and edges are the decisions or
the possibilities of the nodes. It can have root and leaves nodes.
Model Library
Random Forests
b) To improve the performance, one can use many decision trees with a random sample of
features chosen as the split. What happens in Random forest is that, it chooses a new random
sample of features at every single split. Here the features and the values are chosen with
replacement. Hence, In random forest with decision Tree, the model chooses the features
randomly and de-correlates the phenomenon of different trees have common features, because
the features would be picked randomly at each split for decision tree to be applied.
Model Efficiency – It’s been done via classification where m is chosen as the square root of p.
Model Library
from sklearn.ensemble import RandomForestClassifier
rfc = RandomForestClassifier(n_estimators=200)
It’s accuracy is much higher than Decision Trees.
Unbalanced Data sets for labelling plays a major role in the model prediction.
This is used for classification of data points into separate classes. This is
done by drawing a line called as hyperplane which separate the data points
into the respective classes. And the margin lines touch are known as Support
Vectors.
This can also be applied to non-linearly separable data through the “kernel
trick”, which is through 3-Dimension plane.
If gamma is small, then it means low bias and high variance and vice-versa.
Need to use grid search for parameter adjustment. Mainly C and gamma.
Model Commands
model = SVC()
IN SVC model 3 main parameters are there. C, Gamma and Kernel to work on
the algorithm.
Learning Param Grid (This is done to let model finds the best parameters
most suitable)
param_grid = { ‘C’:[0.1, 1, 10, 100, 1000], ‘gamma’ :[1, 0.1, 0.01, 0.001, 0.0001]}
A. Useful Commands
1. Check isnull (Blank Data)
sns.heatmap(df.isnull(),yticklabels=False,cbar=False,cmap='viridis')
Get Values
final_df.Country.value_counts().values
6. Pie Plot
plt.pie(country_val, labels=country_names, autopct="%1.2f%%")
dict1 = {0 : 'malignant',
1: 'benign'}
cancer_df['target_name'] = cancer_df['target_class'].map(dict1)
10.Tree Visualization
from IPython.display import Image
from sklearn.externals.six import String10
from sklearn.tree import export_graphviz
import pydot
features = list(df.columns[1:])
features
dot_data = StringIO()
export_graphviz(dtree,
out_file=dot_data,feature_names=features,filled=True,rounded=True)
graph = pydot.graph_from_dot_data(dot_data.getvalue())
Image(graph[0].create_png())