Hands-on Data Analysis and Visualization with Pandas: Engineer, Analyse and Visualize Data, Using Powerful Python Libraries
5/5
()
Data Analysis
Data Visualization
Pandas
Jupyterlab
Data Science
Mentor Figure
Machine Learning
Statistics
Publishing
Python Programming
Python
About this ebook
This book will help in learning python data structures and essential concepts such as Functions, Lambdas, List comprehensions, Datetime objects, etc. required for data engineering. It also covers an in-depth understanding of Python data science packages where JupyterLab used as an IDE for writing, documenting, and executing the python code, Numpy used for computation of numerical operations, Pandas for cleaning and reorganizing the data, handling large datasets and merging the dataframes to get meaningful insights. You will go through the statistics to understand the relation between the variables using SciPy and building visualization charts using Matplotllib and Seaborn libraries.
Related to Hands-on Data Analysis and Visualization with Pandas
Related ebooks
Pandas in 7 Days: Utilize Python to Manipulate Data, Conduct Scientific Computing, Time Series Analysis, and Exploratory Data Analysis Rating: 0 out of 5 stars0 ratingsHands-On Data Analysis with Pandas: Efficiently perform data collection, wrangling, analysis, and visualization using Python Rating: 0 out of 5 stars0 ratingsPython Data Science Essentials - Second Edition Rating: 4 out of 5 stars4/5Python For Data Science Rating: 0 out of 5 stars0 ratingsData Science with Jupyter: Master Data Science skills with easy-to-follow Python examples Rating: 0 out of 5 stars0 ratingsData Analysis with Python: Introducing NumPy, Pandas, Matplotlib, and Essential Elements of Python Programming (English Edition) Rating: 0 out of 5 stars0 ratingsR for Data Science Rating: 5 out of 5 stars5/5Python Data Analysis Rating: 4 out of 5 stars4/5Python Data Science Essentials Rating: 0 out of 5 stars0 ratingsData Science Fundamentals and Practical Approaches: Understand Why Data Science Is the Next (English Edition) Rating: 0 out of 5 stars0 ratingsPractical Full Stack Machine Learning: A Guide to Build Reliable, Reusable, and Production-Ready Full Stack ML Solutions Rating: 0 out of 5 stars0 ratingsPython Machine Learning: A Step by Step Beginner’s Guide to Learn Machine Learning Using Python Rating: 0 out of 5 stars0 ratingsAdvanced Machine Learning with Python Rating: 0 out of 5 stars0 ratingsPractical Data Science Cookbook - Second Edition Rating: 0 out of 5 stars0 ratingsPractical Data Analysis Rating: 4 out of 5 stars4/5Getting Started with Python Data Analysis Rating: 0 out of 5 stars0 ratingsPython In - Depth: Use Python Programming Features, Techniques, and Modules to Solve Everyday Problems Rating: 0 out of 5 stars0 ratingsAdvanced Deep Learning with Python: Design and implement advanced next-generation AI solutions using TensorFlow and PyTorch Rating: 0 out of 5 stars0 ratingsPython Machine Learning By Example Rating: 4 out of 5 stars4/5Hands-on Supervised Learning with Python Rating: 0 out of 5 stars0 ratingsMachine Learning for Beginners: Learn to Build Machine Learning Systems Using Python (English Edition) Rating: 0 out of 5 stars0 ratings
Computers For You
Elon Musk Rating: 4 out of 5 stars4/5Procreate for Beginners: Introduction to Procreate for Drawing and Illustrating on the iPad Rating: 5 out of 5 stars5/5Deep Search: How to Explore the Internet More Effectively Rating: 5 out of 5 stars5/5Learning the Chess Openings Rating: 5 out of 5 stars5/5Slenderman: Online Obsession, Mental Illness, and the Violent Crime of Two Midwestern Girls Rating: 4 out of 5 stars4/5Mastering ChatGPT: 21 Prompts Templates for Effortless Writing Rating: 4 out of 5 stars4/5The Innovators: How a Group of Hackers, Geniuses, and Geeks Created the Digital Revolution Rating: 4 out of 5 stars4/5CompTIA Security+ Get Certified Get Ahead: SY0-701 Study Guide Rating: 5 out of 5 stars5/5The Professional Voiceover Handbook: Voiceover training, #1 Rating: 5 out of 5 stars5/5SQL QuickStart Guide: The Simplified Beginner's Guide to Managing, Analyzing, and Manipulating Data With SQL Rating: 4 out of 5 stars4/5Some Future Day: How AI Is Going to Change Everything Rating: 0 out of 5 stars0 ratingsExcel 101: A Beginner's & Intermediate's Guide for Mastering the Quintessence of Microsoft Excel (2010-2019 & 365) in no time! Rating: 0 out of 5 stars0 ratingsAn Ultimate Guide to Kali Linux for Beginners Rating: 3 out of 5 stars3/5The ChatGPT Millionaire Handbook: Make Money Online With the Power of AI Technology Rating: 4 out of 5 stars4/5The Self-Taught Computer Scientist: The Beginner's Guide to Data Structures & Algorithms Rating: 0 out of 5 stars0 ratingsStandard Deviations: Flawed Assumptions, Tortured Data, and Other Ways to Lie with Statistics Rating: 4 out of 5 stars4/5How to Create Cpn Numbers the Right way: A Step by Step Guide to Creating cpn Numbers Legally Rating: 4 out of 5 stars4/5101 Awesome Builds: Minecraft® Secrets from the World's Greatest Crafters Rating: 4 out of 5 stars4/5Alan Turing: The Enigma: The Book That Inspired the Film The Imitation Game - Updated Edition Rating: 4 out of 5 stars4/5A Brief History of Artificial Intelligence: What It Is, Where We Are, and Where We Are Going Rating: 4 out of 5 stars4/5Going Text: Mastering the Command Line Rating: 4 out of 5 stars4/5Tor and the Dark Art of Anonymity Rating: 5 out of 5 stars5/5Python Machine Learning By Example Rating: 4 out of 5 stars4/5The Hacker Crackdown: Law and Disorder on the Electronic Frontier Rating: 4 out of 5 stars4/5CompTia Security 701: Fundamentals of Security Rating: 0 out of 5 stars0 ratingsUncanny Valley: A Memoir Rating: 4 out of 5 stars4/5
Reviews for Hands-on Data Analysis and Visualization with Pandas
1 rating0 reviews
Book preview
Hands-on Data Analysis and Visualization with Pandas - PURNA CHANDER RAO. KATHULA
CHAPTER 1
Introduction to Data Analysis
Data analysis is an art. It is a science of extracting insights from the silos of data. This chapter introduces you to the data and its ecosystem components, along with the different stages of the data analysis process, how Python is useful for data analysis and different data science libraries/modules, and their installation process.
Structure
Inspiration for data analysis
What is data science?
Domain expertise
Maths and statistics
Artificial intelligence
Machine learning
Data infrastructure
Data analysis process
Business requirements
Data collection
Data cleansing
Data exploration and visualization
Data modeling
Model validation and testing
Deployment
Why Python for data analysis?
Python libraries for data analysis
Objective
This chapter will guide you through the different processes of data analysis, various concepts such as maths, statistics, and processes that make up this discipline. The concepts covered here will be a heads up for the coming chapters where these concepts and procedures will be applied in the form of Python code with different data related libraries.
Inspiration for data analysis
In this chapter, we will be covering various factors and trends that influence data analysis. In the current world of digitalization, a huge amount of data is produced by IoT devices like sensors, diagnosis reports from healthcare or wellness industry, social network portals such as Facebook, YouTube, LinkedIn, Instagram, and e-commerce sites like Alibaba, Amazon, or Flipkart, where you add an audio, video, comment, add a like, emoji, or you make bank transactions online or use an ATM kiosk to withdraw the money, buy something on e-commerce sites and much more.
This data is not exactly useful information. It is the result of processing, which takes into account a certain set of data that extracts some set of conclusions that can be used in different ways. This process of extracting information from the raw data is data analysis. This analysis of the data becomes the foundation for building predictive models or drawing data visualization charts around the data.
Without Big data and analytics, companies are blind and deaf, wandering on to the web like deer on a freeway.
-Geoffrey Moore, author, and consultant.
What is data science?
Data science is a study of data. It is multidisciplinary that involves maths, statistics, algorithms, domain expertise, processes, and systems to extract insights from data. This data might be structured, semi-structured, and unstructured. The following Figure 1.1 display different structures of data:
Figure 1.1
Structured data
Tabular rows and columns (Databases)
DWH (Tera data systems) and BI Systems
Text files such as comma-separated (.csv), tab-separated (.tsv).
Semi-structured data
Excel, XML, JSON, Logs.
Unstructured data
Audio, Video, Images.
Domain expertise
Domain expertise or domain knowledge is about expertise in a particular field like Healthcare, Insurance, Banking, and so on. A domain expert may or may not relate to technology but has in-depth knowledge of a particular industry, its trends, and practices that impact the industry. The process of data analysis not only requires having good expertise in tools and computational techniques but also needs to have a good understanding of the data. In short, the data analyst must be able to know how to search not only for data but also for information and how to treat that information to get valid insights from it.
For example, you are asked to build an application for e-commerce, banking, or insurance domain. The application has to be that it complements the industry and various dimensions of it. The technical team wouldn‘t know the industry norms or the application features; here is where domain expert and domain knowledge comes into the picture.
Maths and statistics
It is a study of statistics from a mathematical point of view. Data analysis requires a good amount of math. Good knowledge of statistics is also required because the statistical methods are applied to the analysis and interpretation of the data. Python provides a good amount of libraries to solve these mathematical and statistical problems, but one should have a good idea about how the libraries work.
Artificial intelligence
Artificial intelligence is the intelligence demonstrated by machines, in contrast to the natural intelligence displayed by humans. Artificial intelligence is the superset of data science, which is one of the advanced concepts in data analysis. It is the study of training computers for jobs which are done by humans. The term Artificial intelligence is two different words: Artificial means something which is not natural or human-made, and Intelligence means the ability to think or understand.
AI Market is already widespread, and you interact with it on a daily basis. Here are a few examples of Artificial intelligence:
Search engines like Google internally use gigantic algorithms to perform a better search.
Self-driving cars where the vehicles can completely navigate their way from one point to another.
Chatbots help as online messengers to assist customers immediately and effectively.
Voice searches on smartphones use AI to determine the best result for those long-tail keywords and conversational queries.
Online Ads use AI to target specific customers based on past behavior, interest, and search queries.
Machine learning
It is an Algorithmic driven study which makes computers capable of learning based on their own previous experience and improve the performance of the task. Machine learning is the subset of Artificial intelligence, and it is a study of machines where machines learn by themselves without being explicitly trained. Assuming you are asked to write a program for a speech recognition software converting speech to text, based on accent, grammar, pronunciation, vocabulary. It would be a gigantic task that can be easily understood by machine learning.
Technically machine learning is divided into three parts, explained as follows:
Supervised learning
In this learning, we ask machine questions and compare answers with the actual answers and instruct the machines to minimize the errors. Supervised machine learning can do things as follows:
Weather forecasting.
Detecting online frauds.
Market forecasting.
Image classification.
Unsupervised learning
In this learning, you give the machine huge chunks of data and instruct it to find some sort of patterns, and based on these patterns, your machine accomplishes certain tasks. Unsupervised machine learning can do things as follows:
Build recommendation engines
Targeted marketing
Customer segmentation
Reinforcement learning
In this learning, the machine is left in an environment where something is happening, and there is a reward if the machine does what we want, and there is a penalty if it performs incorrectly and based on it we instruct the machine to maximize the reward, and eventually, the machine learns the things which we want it to do. Reinforcement learning works on:
Games
Bidding and advertising
Training self-driven cars
Data infrastructure
Generally, people tend to refer to infrastructure as those things that support what they are doing at work. For example, the roads used for transportation, sewage system, and bridges, all these are considered as infrastructure. The role of data infrastructure is to protect, preserve, process, move, secure, and serve data as well as their applications for information service delivery. Data infrastructure includes software, hardware, and cloud or managed services, servers, storage, and so on.
Thanks to the Big data world, it generates a humongous amount of information that needs to be processed. Sometimes normal desktop systems or servers doesn‘t have enough computation power to read, process, or analyze them. We need systems with a high configuration of RAM or a good amount of disk space to save the data. The cloud-based Amazon (AWS)/GCP/Azure help us meet the challenges through resource allocation and virtualization.
Data analysis process
Data analysis is a series of steps in which the raw data is transformed and processed in order to produce insights about the data and to make predictions. The processing includes mathematical and statistical approaches and charts or graphs for data visualizations. So data analysis is schematized as a process chain consisting of the following sequence of stages, as shown in Figure 1.2:
Figure 1.2
Let‘s discuss these processes in detail.
Business requirements
Data Analysis starts with a problem to be solved, which needs to be defined, like predicting the stock price of a company or identifying credit card fraudulent transactions or detecting tumors based on health data and so on.
Data collection
The data must be chosen with the basic purpose of building a predictive model. This is the most tedious task to analyze anything we need to have data. Mostly data will be shared by the clients in the form of comma-separated, tab-delimited, pipe delimited files. Not all data is available in files or databases; it can be as HTML pages; this process of collecting the data is called Web Scraping. Python libraries such as scrapy, beautiful soup, and requests help in scraping the data from web pages.
Data cleansing
This stage seems to be less problematic but requires more resources and time to complete. The data collected may be from different sources such as excel, CSV, Json, parquet or a scraped data from a web page each of which will have different representation of data like date field might be a string or an integer might be read as float, so all these data needs to be cleaned for data analysis. Cleansing includes invalid data, ambiguous or missing values or outliers in the data.
Data exploring and visualization
Exploration is the process of graphical and statistical representation to find patterns, connections, and relations between variables in the data. Python libraries such as matplotlib and seaborn help us to visualize the data. Different statistical formats like heatmaps, boxplot, violin plot, scatter plots help us to understand the patterns, outliers, and relationships better. Exploration also includes one or more of the following activities:
Grouping the data
Summarizing the data
Construction of regression models to find the deviation of data
Data modeling
It is the process of choosing a suitable statistical model to predict the result. After data exploration, we need to develop a mathematical model that encodes the relationship between data. These models are divided according to the result they produce:
Classification: If the result obtained by the model is categorical.
Regression: If the result obtained by the model is numerical.
Clustering: It involves grouping of the data points to gain valuable insights.
Python’s Scikit Learn library provides methods such as linear regression, logistic regression, classification trees, SVM, Adaboost, and K-nearest neighbor to generate these models.
Model validation and testing
Validation of the model is divided into train and test phase. The data is randomly divided to 70 percent for training, 30 percent for testing. The model gets trained by the 70 percent data, which in turn compares with the remaining 30 percent test data. There are several techniques to validate the effectiveness of the model; the most popular is k-Fold