Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
10 views

Python Learning

The document discusses Python Pandas which is a software library written for the Python programming language for data manipulation and analysis. It provides data structures and operations for manipulating numerical tables and time series. Some key points covered include that Pandas allows working with different data types together, it provides Series for one-dimensional data and DataFrame for two-dimensional data, and it can be used for tasks like data cleaning, transformation, slicing and dicing of data.

Uploaded by

Vinay
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views

Python Learning

The document discusses Python Pandas which is a software library written for the Python programming language for data manipulation and analysis. It provides data structures and operations for manipulating numerical tables and time series. Some key points covered include that Pandas allows working with different data types together, it provides Series for one-dimensional data and DataFrame for two-dimensional data, and it can be used for tasks like data cleaning, transformation, slicing and dicing of data.

Uploaded by

Vinay
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 21

https://www.tutorialspoint.com/python/python_pandas.

htm

Python

1. It has various libraries which supports in general computing and scientific computing.
2. Its Package NumPy helps in scientific computing and its array manages less memory than
convention python library.
3. Python has packages which can directly use the code from other languages like C or Java.
4. Competition – R or MATLAB
5. Its excellent memory management capability, especially garbage collection, makes it versatile in
manage large volume of data for slicing, dicing, transformation & visualization.

Python Environment Variables


PYTHONPATH

 This variable tells the path where to locate the module files imported in a program.

Windows
Anaconda (from www.continuum.io) is a free Python distribution for SciPy
stack. It is also available for Linux and Mac.

Canopy (www.enthought.com/products/canopy/) is available as free as well


as commercial distribution with full SciPy stack for Windows, Linux and Mac.

Python (x,y): It is a free Python distribution with SciPy stack and Spyder
IDE for Windows OS. (Downloadable from www.python-xy.github.io/)

Python summary

1. We have different objects – List, Tuple, Dictionary and Set


2. Each can be identified with an index, however these objects can’t be multi-dimensional.
3. To overcome that limitation, we use numpy through which we can build multi-dimensional
arrays.
4. In Numpy we can define multi-dimensional arrays using np.array.
5. np.arange is used for getting values between specified range. It can be defined as
np.arange(starting value, end value, increment step)
6. However, there is a limitation in numpy that all the values in array can be of single data type.
& to overcome that we require usage of Pandas package.
7. Pandas provide Series and DataFrame under which all the other objects like List, Dictionary, Set,
Tuple & Numpy can be used.
8. Series is one column, and DataFrame is collection of Series.

Tips & Tricks

1. Numpy consider 1 as Columns and Rows as 1, which is opposite in pandas.


2. If the command returns output, except pop function, no changes are done inplace.
3. Define null as np.NAN
4. Np.where is used for 2 points
a. To return index of values which condition gets true
b. Replacement of value with particular condition
5. Drop object(List, Set, Dict, DataFrame, Numpy array, Series) to drop that object.
6. Thresh=1 check whether any value is non – null.
7. Axis =0 is for rows and 1 for columns
8. Sort by Value(sort_values) and Sort by Index(sort_index)
9. Whenever multiple values to be passed for a function expecting one value, it should be passed
as List.
10. iloc works basis on position whereas loc works basis on basis of labels in a dataframe.
11. To ignore warnings in model creation, the way is to import package name
Import warnings
And set warning.Filter = False

Ways to work on blank values

1. Replace with Mean or Median (Using npwhere or isnull as well)


2. Method = BFill(Backward Fill), Ffill (Forward Fill) or nearest
3. DropNa (in case if the blank values are less than 5%)
4. Reindex function along with range creates new data as per the range defined, besides keeping
older values at their original index.
5. It has fill_value, will fill only those values which got created due to null values due to reindexing.
Supervised Learning Overview

1. Under this models are trained using labeled data. It means that data is
already trained before and then the values for test data is to be
predicted.
2. The application would receive the prediction data along with correct
values, and the algorithm learns by comparing its actual output with
correct outputs to find errors.
3. It then modifies model accordingly.
4. It would be used in such scenarios where the historical data can
predict future values or events,

5. ML Process
a. Data Collection/Acquisition
b. Data Cleaning and Transformation
c. Split the data to Test and Train Set
d. Model Building and Training
e. Model Testing
i. Adjust Model Parameters to fine tune basis testing.
f. Model Deployment

6. Often data is split into 3 pars


a. Training Data-Set – To train the model features
b. Validation Dataset – To determine what model parameters to be
adjusted
c. Test Data-Set – To determine the final performance metric.

7. Evaluating Performance – Classification Error Metrics


a. Accuracy (For Balanced Data) –
i. Number of Correct Predictions/Number of Total Predictions
ii. It is useful when target classes are well balanced.
b. Recall (For unbalance data)
1. No of True Positives / No of True Positives + No. of False
Negative
2. Find all relevant cases within a dataset.
c. Precision
i. No of True Positives / No of True Positives + No. of False
Positives
ii. Identify the relevant data points
d. F1-Score
i. We find optimal blend of precision and recall, we can
combine the two metrics using what is called F1 score.
F1 = 2 * (Precision * Recall) / (Precision + Recall)
ii. Using of harmonic mean(in case of F1) instead of simple
average, because it punishes the extreme values.

Confusion Matrix

Predicted Condition
Prediction Prediction
Positive Negative
True Condition Condition Positive True Positive False Negative
(Type 2 Error)
Condition Negative False Positive True Negative
(Type 1 Error)

Model Evaluation
To focus is the model should focus on False Negative or False Positive.
In case of medical disease diagnosis, it’s better to go towards False Positive, and
minimize False Negative.

8. Evaluating Performance – Regression Error Metrics

Regression Task is done for continuous variables.

9. Data Accuracy can be checked for continuous values are


a. Mean Absolute Error –
i. This is the mean of absolute value of errors.
ii. Drawback - This doesn’t punish large errors.

b. Mean Squared Error


i. Squaring the mean errors.

c. Root Mean Square Error


i. Root of the mean of the square errors.

Python Machine Learning Commands

1. For classification problem, some estimators also provide this method,


which returns the probability that a new observation has each
categorical label. In this case, the label with the highest probability is
returned by model.predict_proba()

2. For classification or regression problems, most estimators implement a


score method. Scores are between 0 and 1, with a larger score
indicating a better fit.

3. model.predict(X_Test) – Predict labels in clustering algorithms in


unsupervised estimators.

4. Model.transform() – Given an unsupervised model, transform new data


into the new basis. This accepts one argument X_new, and returns the
new representation of the data based on the unsupervised model.

5. model.fit_transform() – Some estimators implement this method. which


more efficiently performs a fit and a transform on the same input data
under case of unsupervised learning.

Models Work

Training and Testing Data


from sklearn.model_selection import train_test_split
X_train, X_Test, Y_train, Y_Test = train_test_split(X,Y,test_size=0.3, random_state=101)

Training the Model


from sklearn.linear_model import LinearRegression
lm = LinearRegression()
lm.fit(X_train,Y_train)

Print out the coefficients of the model

# The coefficients
print('Coefficients: \n', lm.coef_)

Predicting Test Data


lm.predict(X_test)

Learning
a. X_Train -> It is the training set which would be trained with the
model.
b. Y_Train -> This would be the labelled values which would be
provided to train the model.
c. X_Test -> These values would be used for model prediction.
d. Y_Test -> These values would be used to validate the predicted
model dataset.

Key Point -> If the Predicted set on X_Test and Y_Test Values are
falling on same straight line, then linear regression is the best fit
model. This can be checked by using the scatter plot of Predicted
data set and Y_Test.

Another way to validate the model selection is by preparing the


distplot of residuals, which can be done by plotting the y_test –
predictions. if the resultant is uniformly distributed, then it reaffirms
that model selection was right.

Model Evaluation (Metrics)

from sklearn import metrics

metrics.mean_absolute_error(y_test, predictions)
metics.mean_squared_error(y_test, predictions)
np.sqrt(metics.mean_squared_error(y_test, predictions))

print('MAE:', metrics.mean_absolute_error(y_test, predictions))


print('MSE:', metrics.mean_squared_error(y_test, predictions))
print('RMSE:', np.sqrt(metrics.mean_squared_error(y_test, predictions)))

metrics.explained_variance_score(y_test,predictions) – Summarizes the


model accuracy percentage.
Boston DataSet Working

From sklearn.datasets load_boston


boston = load_boston()
boston.keys()

Bias Variance Trade Off

o Bias refers to the gap between the predicted value set and the
actual value set(i.e. Y_Test).

o Variance is how much scattered the value set is among each other.
The more closer they are to each other, is called low variance.

Bias Variance is there to understand the model performance. It is the


point, where we are adding noise by adding model complexity.
To be
evaluated
The training error goes down as it has to, but the test error is starting to
go up.

The model after the bias trade off begins to over fit.

Overfitting and Under fitting

When model tries to fit each and every data point of the data set, that is
called overfitting, When too many features are selected to predict the
actual data point.

For eg, One tries to identify ball with too many features, specifically –
Radius attribute,

Underfitting, when the predicted value on line set is far apart from the
actual values. When lesser features are chosen to predict the line values,
hence the predicted values are apart from the line.

Best Fit, is in between under and over fitting.

Link - https://www.youtube.com/watch?v=viV_53s97Nw
1. Linear Regression Theory

This explains the phenomenon that the subsequent values, are generally tend to move towards
the mean.

Goal is to minimize the vertical distance between all the data points and our line.
So in determining the best line, we are attempting to minimize the distance between all the
points and their distance to our line.

Next Predicted value is closer to the mean of all the values.

One method is using Least Square Method(Classic regression method), which is fitted by
minimizing the sum of squares of the residuals. The residuals for an observation is the difference
between the observation (the y value) and the fitted line.

Example - Prediction of weather or price,

Columns having Text data is of no use for regression, hence we either drop
them or transform to build categories out of it.

a. It ventures to establish the relationship of dependent variable with the


explanatory independent variables. It believes that the all the data
points of independent variable would be plotted around the straight line
(i.e. harmonic mean). Also, any change in independent variables would
impact the dependent variable (increase or decrease). Increase will be
called as positive slope and decrease would be called as negative
slope.

b. Linear Regression is used to predict the future value basis the


historical data set.

c. Linear Regression works only in case of numerical values. Text or


categorical values are of no use.

d. Under Linear Regression, we have independent variables and


dependent variables. The independent variables (or features) are
basically combined to get the value of dependent variables. For eg,
sales revenue is a dependent variable, while marketing cost, shipment
cost, product cost, number of employees are examples of independent
variables.

e. Independent variables are discrete and dependent variables are


continuous.

f. In a graph, Independent variables would be plot on X axis and


dependent variable would be plot on Y Axis.

g. It can have single independent variable (Called as Simple Linear


Regression) or multiple independent variables (Multiple Class
Independent variables).

h. In case of multiple linear regression - All the independent variables


won’t equally classify in an equation to form the resultant dependent
variables. We have, something called as coefficients, which would be
multiplied with independent features based on its importance. For eg,
In case of Temperature prediction, current temperature and wind flow
would be more important (or have higher weightage) than the region
population count feature.

i. Linear Regression Model drives the co-efficient value (weightage) of


the independent variables basis the training set given for past dataset.

2. Logistic Regression
Theory
It is a classification algorithm, used to predict categorical value.

a. Logistic Regression is used to predict the probability of a certain event.

We learned -
1. Heatmap is to visualize the column data filled or empty
2. Fill empty values.
3. Pass multiple values to function
4. Drop Null Values
Df.dropna(Column_Name, axis=1, inplace = True)
5. Get Dummies for data transformation
Pd.get_dummies
6. Remove the textual data(drop function)

Practical Commands
from sklearn.model_selection import train_test_split

X_train,X_test, y_train, y_test = train_test_split(X,y, test_size=0.33,


random_state=101)

from sklearn.linear_model import LogisticRegression

logmodel = LogisticRegression()

logmodel.fit(X_train, y_train)

predictions = logmodel.predict(X_test)

from sklearn.metrics import classification_report


print(classification_report(y_test,predictions))

from sklearn.metrics import confusion_matrix


print(confusion_matrix(y_test,predictions))

It has 2 types
1. Binary classification – Two Values to chose(Yes or No). Eg email is
spam or not.
2. Multi Class Classification – Multiple Values. Eg, Which party will win the
election in India,

The Linear regression won’t be fit for binary and multiple classes problems
related to classification and hence we need logistic regression. It uses
sigmoid function which would either predict 0 or 1.
-z
Z= 1 / 1+e

So here we are diving the value more than 1 by 1, which will result in value
between 0 and 1.

We can set a cut off point at 0.5, which translates as any value above 0.5
would results to 1 and value lower than 0.5 will lead to 0 class.

One can use confusion matrix to evaluate the model performance.


3. KNN (K Nearest Neighbor) Algorithm

Theory
It is a classification algorithm. This infers to predict the new value class
basis the values of nearest data points. This is simplest classification
algorithm, but depends on k value chosen(area around the data points).

KNN, as the name implies, tries to associate the new data point with the
surrounding class data points. It can be used to classify the data point in 2
or more classes.

Pros

1. Easy to use
2. Training is trivial
3. Works with any number of classes
4. Easy to add more data
5. Few Parameters
a. K
b. Distance Metric

Cons

1. High Prediction cost


2. Not good with high dimensionality data
3. Categorical features don’t work well

Steps
1. Import Libraries
2. Read data from the file
3. EDA
4. Standardize the variables
5. Train Test Split
6. Using KNN model
7. Predictions and evaluations
8. Predicting K Value
9. Retrain with new K Value
Practical Command

from sklearn.preprocessing import StandardScaler


scaler = StandardScaler()
scaler.fit(df.drop('TARGET CLASS',axis=1))

scaled_features = scaler.transform(df.drop('TARGET CLASS',axis=1))


df_feat = pd.DataFrame(scaled_features, columns=df.columns[:-1])

from sklearn.neighbors import KNeighborsClassifier


knn = KNeighborsClassifier(n_neighbors=1)
knn.fit(X_train, y_train)

------------- Import Classification Model


from sklearn.metrics import classification_report, confusion_matrix
print(classification_report(y_test, pred))
print(confusion_matrix(y_test, pred))

------ Predict Best K Value -------------------------

error_rate = []

for i in range(1,40):
knn = KNeighborsClassifier(i)
knn.fit(X_train, y_train)
pred_i = knn.predict(X_test)
error_rate.append(np.mean(pred_i != y_test))

-------------- Plot Error Rate to find out best K Value ----------------


sns.set_style('whitegrid')
plt.figure(figsize=(10,6))
#plt.plot(range(1,40),error_rate)
plt.plot(range(1,40),error_rate,color='b',ls='--',marker='o')
plt.xlabel('K')
plt.ylabel('Error Rate')
plt.title('Error rate as K Value')
4. Decision Trees and Random Forests

Section 19 - Tree Methods

a) Under this, model builds decision tree, under which different parameters are identified and
based the parameters conditions value, the model would provide predictions. For eg, particular
year sale would depend on Marketing budget, people employed, R&D budget.

In this tree, we have nodes and edges. Nodes are the parameters and edges are the decisions or
the possibilities of the nodes. It can have root and leaves nodes.

Model Library

from sklearn.tree import DecisionTreeClassifier


dtree= DecisionTreeClassifier()

Random Forests

b) To improve the performance, one can use many decision trees with a random sample of
features chosen as the split. What happens in Random forest is that, it chooses a new random
sample of features at every single split. Here the features and the values are chosen with
replacement. Hence, In random forest with decision Tree, the model chooses the features
randomly and de-correlates the phenomenon of different trees have common features, because
the features would be picked randomly at each split for decision tree to be applied.

Model Efficiency – It’s been done via classification where m is chosen as the square root of p.

Model Library
from sklearn.ensemble import RandomForestClassifier
rfc = RandomForestClassifier(n_estimators=200)
It’s accuracy is much higher than Decision Trees.

Unbalanced Data sets for labelling plays a major role in the model prediction.

5) SVM Theory (Support Vector Machines) – Section 20

This is used for classification of data points into separate classes. This is
done by drawing a line called as hyperplane which separate the data points
into the respective classes. And the margin lines touch are known as Support
Vectors.

This can also be applied to non-linearly separable data through the “kernel
trick”, which is through 3-Dimension plane.
If gamma is small, then it means low bias and high variance and vice-versa.

Need to use grid search for parameter adjustment. Mainly C and gamma.

Model Commands

from sklearn.svm import SVC

model = SVC()

IN SVC model 3 main parameters are there. C, Gamma and Kernel to work on
the algorithm.

C controls the cost of misclassification of the cost on the training data. A


Large C Value, gives low bias and high variance, low bias because the
penalization of misclassification.

Kernel values can be –

a. Rbf (Radio Basis Function) - Default


b. Gamma parameter - A Large Gamma Value, gives high bas and low
variance.

Learning Param Grid (This is done to let model finds the best parameters
most suitable)

from sklearn.grid_search import GridSearchCV

param_grid = { ‘C’:[0.1, 1, 10, 100, 1000], ‘gamma’ :[1, 0.1, 0.01, 0.001, 0.0001]}

grid = GridSearchCV(SVC(), param_grid, verbose=3)


Appendix

A. Useful Commands
1. Check isnull (Blank Data)

sns.heatmap(df.isnull(),yticklabels=False,cbar=False,cmap='viridis')

[features for features in df.columns if df[features].isnull().sum()>0]

2. Read Data in pandas with Encoding –


df = pd.read_csv('C:\\Personal\\Learning\\Python\\Udemy Course - Python\\
Datasets\\Zomato\\Zomatodataset\\zomato.csv',encoding='ISO-8859-1')

3. Get Columns names who have null values


[features for features in df.columns if df[features].isnull().sum()>0]

4. Merge two datasets


final_df = pd.merge(df, df_country, on='Country Code', how='left')

5. Get Column Names of dataset having value counts


final_df.Country.value_counts().index

Get Values
final_df.Country.value_counts().values

6. Pie Plot
plt.pie(country_val, labels=country_names, autopct="%1.2f%%")

7. Get Group by on specific Column or Columns

final_df.groupby(by=['Aggregate rating', 'Rating color', 'Rating


text']).size().reset_index().rename(columns={0:'Rating Count'})

8. Append one dataset with another


df = df_train.append(df_test)
9. Map Column name from another dictionary

dict1 = {0 : 'malignant',
1: 'benign'}

cancer_df['target_name'] = cancer_df['target_class'].map(dict1)

10.Tree Visualization
from IPython.display import Image
from sklearn.externals.six import String10
from sklearn.tree import export_graphviz
import pydot

features = list(df.columns[1:])
features

dot_data = StringIO()
export_graphviz(dtree,
out_file=dot_data,feature_names=features,filled=True,rounded=True)

graph = pydot.graph_from_dot_data(dot_data.getvalue())
Image(graph[0].create_png())

11.Import Image url

from IPython.display import Image


url = 'http://upload.wikimedia.org/wikipedia/commons/4/41/Iris_versicolor_3.jpg'
Image(url,width=300, height=300)
B. Pandas Documentation
https://pandas.pydata.org/docs/
C. Steps for EDA
I. Missing Values
II. Explore about numerical variables
III. Explore categorical variables
IV. Finding relationship between features
V.

You might also like