7 Practicals With Python Practice With Data Science Cookbook
7 Practicals With Python Practice With Data Science Cookbook
1. Introduction
Data science is a field that is at the intersection of many fields, including data mining, machine learning, and
statistics, to name a few. Data science has penetrated deeply in our connected world and there is a
growing demand in the market for people who not only understand data science algorithms thoroughly, but
are also capable of programming these algorithms. Treating these algorithms as a black box and using them
in decision-making systems will lead to counterproductive results. With tons of algorithms and innumerable
problems out there, it requires a good grasp of the underlying algorithms in order to choose the best one for
any given problem. Python as a programming language has evolved over the years and today, it is the number
one choice for a data scientist.
• Its ability to act as a scripting language for quick prototype building and
• its sophisticated language constructs for full-fledged software development combined with its fantastic
library support for numeric computations has led to its current popularity among data scientists and
the general scientific programming community.
• Not just that, Python is also popular among web developers; thanks to frameworks such as Django
and Flask.
Carefully crafted recipes, which touch upon the different aspects of data science, including data exploration,
data analysis and mining, machine learning, and large scale machine learning.
The first chapter introduces the Python data structures and function programming concepts. The early chapters
cover the basics of data science and the later chapters are dedicated to advanced data science algorithms.
State-of-the-art algorithms that are currently used in practice by leading data scientists across industries
including the ensemble methods, random forest, regression with regularization, and others are covered in
detail. Some of the algorithms that are popular in academia and still not widely introduced to the mainstream
such as rotational forest are covered in detail.
Covering the right mix of math philosophy behind the data science algorithms and implementation
details. With each recipe, just enough math introductions are provided to contemplate how the algorithm
works; (take full benefits of these methods in applications)
Part 2 Data Analysis : exploration and more (wrangle and deep dive)
• covers data preprocessing and transformation routines to perform exploratory data analysis tasks
in order to efficiently build data science algorithms. introduces the concept of dimensionality
reduction (from simple methods to the advanced state-of-the-art techniques) in order to tackle the
curse of dimensionality issues in data science..
1
Part 3 Data Mining : Needle in a haystack Name
• discusses unsupervised data mining techniques, starting with elaborate discussions on distance
methods and kernel methods and following it up with clustering and outlier detection techniques.
Part 4 Machine learning : supervised, regression, ensemble, tree-based and perceptron/stochastic gradient
descent
• covers supervised data mining techniques, including nearest neighbors, Naïve Bayes, and
classification trees. In the beginning, we will lay a heavy emphasis on data preparation for
supervised learning.
• introduces regression problems and follows it up with topics on regularization including LASSO
and ridge. Finally, we will discuss crossvalidation techniques as a way to choose hyperparameters
for these methods.
• introduces various ensemble techniques including bagging, boosting, and gradient boosting This
chapter shows you how to make a powerful state-ofthe-art method in data science where, instead of
building a single model for a given problem, an ensemble or a bag of models are built.
• introduces some more bagging methods based on tree-based algorithms. Due to their robustness to
noise and universal applicability to a variety of problems, they are very popular among the data science
community.
• Online Learning, covers large scale machine learning and algorithms suited to tackle such large
scale problems. This includes algorithms that work with streaming data and data that cannot be fitted
into memory completely (perception and stochastic gradient descent). Several types of linear
algorithms, including logistic regression, linear regression, and linear SVM, can be accommodated
using this framework
d. How it works
Let’s start by including the NumPy library:
e. Additional information
You can refer to the following link for some excellent NumPy documentation:
http://www.numpy.org
2
f) Connections to other recipes : See also
Plotting with matplotlib recipe in Chapter 3, Analyzing Data - Explore & Wrangle
Machine Learning with Scikit Learn recipe in Chapter 3, Analyzing Data - Explore &Wrangle
3
7. Imputing the data
8. Performing random sampling
9.Scaling the data
10. Standardizing the data
11. Performing tokenization
12. Removing stop words
13. Stemming the words
14. Performing word lemmatization
15. Representing the text as a bag of words
16. Calculating term frequencies and inverse document frequencies