Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
99 views

Unit 1 Data Science Notes

Uploaded by

p.brinda
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
99 views

Unit 1 Data Science Notes

Uploaded by

p.brinda
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 33

21CS73T-DATA SCIENCE AND ANALYTICS

UNITI INTRODUCTIONTO DATASCIENCE ANDBIGDATA

Introduction to Data Science – Applications - Data Science Process – Exploratory Data analysis –
Collection of data – Graphical presentation of data – Classification of data – Storage and retrieval of
data – Big data – Challenges of Conventional Systems - Web Data – Evolution Of Analytic
Scalability - Analytic Processes and Tools - Analysis vs Reporting .

Introduction to Data Science


DATA SCIENCE:
 Data Science is a combination of multiple disciplines that uses statistics, data analysis, andmachine
learning to analyze data and to extract knowledge and insights from it.
 Data Science is about data gathering, analysis and decision-making.
 Data Science is about finding patterns in data, through analysis, and make future predictions.

Data Science is used in many industries in the world today, e.g. banking, consultancy, healthcare, and
manufacturing.
Data Science can be applied in nearly every part of a business where data is available. Examples are:

 Consumer goods
 Stock markets
 Industry
 Politics
 Logistic companies
 E-commerce
 Data: It refers to the raw information that is collected, stored, and processed. In today’s digital age,
enormous amounts of data are generated from various sources such as sensors, social media,
transactions, and more. This data can come in structured formats (e.g., databases) or unstructured
formats (e.g., text, images, videos).
 Structured Data
Structured data is organized and easier to work with.
 Unstructured Data
Unstructured data is not organized. We must organize the data for analysis purposes.

 Science: It refers to the systematic study and investigation of phenomena using scientific methods and
principles. Science involves forming hypotheses, conducting experiments, analyzing data, and drawing
conclusions based on evidence.
Applications

1. In Search EnginesThe most useful application of Data Science is Search Engines. As we know when we
want to search for something on the internet, we mostly use Search engineslike Google, Yahoo, DuckDuckGo
and Bing, etc. So Data Science is used to get Searches faster.
For Example, When we search for something suppose “Data Structure and algorithm courses ” then at that
time on Internet Explorer we get the first link of GeeksforGeeks Courses. This happens because the
GeeksforGeeks website is visited most in order to get information regarding Data Structure courses and
Computer related subjects. So this analysis is done using Data Science, and we get the Topmost visited Web
Links.
2. In Transport
Data Science is also entered in real-time such as the Transport field like Driverless Cars. With the help of
Driverless Cars, it is easy to reduce the number of Accidents.
For Example, In Driverless Cars the training data is fed into the algorithm and with the help of Data Science
techniques, the Data is analyzed like what as the speed limit in highways, Busy Streets, Narrow Roads, etc.
And how to handle different situations while driving etc.
3. In Finance
Data Science plays a key role in Financial Industries. Financial Industries always have an issue of fraud and
risk of losses. Thus, Financial Industries needs to automate risk of loss analysis in order to carry out strategic
decisions for the company. Also, Financial Industries uses Data Science Analytics tools in order to predict the
future. It allows the companies to predict customer lifetime value and their stock market moves.
For Example, In Stock Market, Data Science is the main part. In the Stock Market, Data Science is used to
examine past behavior with past data and their goal is to examine the future outcome. Data is analyzed in such
a way that it makes it possible to predict future stock prices over a set timetable.
4. In E-Commerce
E-Commerce Websites like Amazon, Flipkart, etc. uses data Science to make a better user experience with
personalized recommendations.
For Example, When we search for something on the E-commerce websites we get suggestions similar to
choices according to our past data and also we get recommendations according to most buy the product, most
rated, most searched, etc. This is all done with the help of Data Science.
5. In Health Care
In the Healthcare Industry data science act as a boon. Data Science is used for:
 Detecting Tumor.
 Drug discoveries.
 Medical Image Analysis.
 Virtual Medical Bots.
 Genetics and Genomics.
 Predictive Modeling for Diagnosis etc.
6. Image Recognition
Currently, Data Science is also used in Image Recognition. For Example, When we upload our image with our
friend on Facebook, Facebook gives suggestions Tagging who is in the picture. This is done with the help of
machine learning and Data Science. When an Image is Recognized, the data analysis is done on one’s
Facebook friends and after analysis, if the faces which are present in the picture matched with someone else
profile then Facebook suggests us auto-tagging.
7. Targeting Recommendation
Targeting Recommendation is the most important application of Data Science. Whatever the user searches on
the Internet, he/she will see numerous posts everywhere. This can be explained properly with an example:
Suppose I want a mobile phone, so I just Google search it and after that, I changed my mind to buy offline. In
Real -World Data Science helps those companies who are paying for Advertisements for their mobile. So
everywhere on the internet in the social media, in the websites, in the apps everywhere I will see the
recommendation of that mobile phone which I searched for. So this will force me to buy online.
8. Airline Routing Planning
With the help of Data Science, Airline Sector is also growing like with the help of it, it becomes easy to predict
flight delays. It also helps to decide whether to directly land into the destination or take a halt in between like a
flight can have a direct route from Delhi to the U.S.A or it can halt in between after that reach at the
destination.
9. Data Science in Gaming
In most of the games where a user will play with an opponent i.e. a Computer Opponent, data science concepts
are used with machine learning where with the help of past data the Computer will improve its performance.
There are many games like Chess, EA Sports, etc. will use Data Science concepts.
10. Medicine and Drug Development
The process of creating medicine is very difficult and time-consuming and has to be done with full disciplined
because it is a matter of Someone’s life. Without Data Science, it takes lots of time, resources, and finance or
developing new Medicine or drug but with the help of Data Science, it becomes easy because the prediction of
success rate can be easily determined based on biological data or factors. The algorithms based on data science
will forecast how this will react to the human body without lab experiments.

DATA SCIENCE PROCESS


1. Thefirststepofthisprocessissettingaresearchgoal.Themainpurposehere ismakingsure all the stakeholders
understand the what, how, and why of the project. In every serious project this will result in a project charter.

2. Thesecondphaseisdataretrieval.Youwanttohavedataavailableforanalysis,sothisstep includes finding suitable


data and getting access to the data from the data owner. The result is data in its raw form, which probably
needs polishing and transformation before it becomes usable.

3. Now that you have the raw data, it’s time to prepare it. This includes transforming the data
fromarawformintodatathat’sdirectlyusableinyourmodels.Toachieve this,you’lldetectand correct different kinds
of errors in the data, combine data from different data sources, and transform it. If you have successfully
completed this step, you can progress to data visualization and modeling.

4. Thefourthstepisdataexploration.Thegoalofthisstepistogainadeepunderstandingof thedata. You’ll lookfor


patterns, correlations, and deviations based on visual and descriptive techniques. The insights you gain from
this phase will enable you to start modeling.

5. Finally, we get to the sexiest part: model building (often referred to as “data modeling”
throughoutthisbook).Itisnowthatyouattempttogaintheinsightsormakethepredictions stated in your project
charter. Now is the time to bring out the heavy guns, but remember research has taught us that often (but not
always) a combination of simple models tends to outperform one complicated model. If you’ve done this
phase right, you’re almost done.

6. The last step of the data science model is presenting your results and automating the analysis, if needed.
One goal of a project is to change a process and/or make better decisions.
Youmaystillneedtoconvincethebusinessthatyourfindingswillindeedchangethebusiness
processasexpected.Thisiswhereyoucanshineinyourinfluencerrole.Theimportanceofthis
stepismoreapparentinprojectsonastrategicandtacticallevel.Certainprojectsrequireyouto perform the business
process over and over again, so automating the project will save time.
Exploratory Data Analysis (EDA)
Exploratory Data Analysis (EDA) is a process of describing the data by means of statistical and
visualization techniques in order to bring important aspects of that data into focus for further analysis. This
involves inspecting the dataset from many angles, describing & summarizing it without making any
assumptions about its contents.
EDA is a significant step to take before diving into statistical modeling or machine learning, to ensure the
data is really what it is claimed to be and that there are no obvious errors. It should be part of data science
projects in every organization.

Learning Objectives
 Learn what Exploratory Data Analysis (EDA) is and why it’s important in data analytics.
 Understand how to look at and clean data, including dealing with single variables.
 Summarize data using simple statistics and visual tools like bar plots to find patterns.
 Ask and answer questions about the data to uncover deeper insights.
 Use Python libraries like pandas, NumPy, Matplotlib, and Seaborn to explore and visualize data.
What is Exploratory Data Analysis?
Exploratory Data Analysis (EDA) is like exploring a new place. You look around, observe things, and try to
understand what’s going on. Similarly, in EDA data science, you look at a dataset, check out the different
parts, and try to figure out what’s happening in the data. It involves using statistics and visual tools to
understand and summarize data, helping data scientists and data analysts inspect the dataset from various
angles without making assumptions about its contents.
Here’s a Typical Process
 Look at the Data: Gather information about the data, such as the number of rows and columns, and
the type of information each column contains. This includes understanding single variables and their
distributions.
 Clean the Data: Fix issues like missing or incorrect values. Preprocessing is essential to ensure the
data is ready for analysis and predictive modeling.
 Make Summaries: Summarize the data to get a general idea of its contents, such as average values,
common values, or value distributions. Calculating quantiles and checking for skewness can provide
insights into the data’s distribution.
 Visualize the Data: Use interactive charts and graphs to spot trends, patterns, or anomalies. Bar
plots, scatter plots, and other visualizations help in understanding relationships between variables.
Python libraries like pandas, NumPy, Matplotlib, Seaborn, and Plotly are commonly used for this
purpose.
 Ask Questions: Formulate questions based on your observations, such as why certain data points
differ or if there are relationships between different parts of the data.
 Find Answers: Dig deeper into the data to answer these questions, which may involve further
analysis or creating models, including regression or linear regression models.
For example, in Python, you can perform EDA techniques by importing necessary libraries, loading your
dataset, and using functions to display basic information, summary statistics, check for missing values, and
visualize distributions and relationships between variables. Here’s a basic example:
import pandas as pd
import seaborn assns
importmatplotlib.pyplotasplt

# Load the dataset


data = pd.read_csv('your_dataset.csv')

# Display basic information


print(data.info())
# Display summary statistics
print(data.describe())

# Check for missing values


print(data.isnull().sum())

# Visualize distributions
sns.histplot(data['column_name'])
plt.show()
Why is Exploratory Data Analysis Important?
Exploratory Data Analysis (EDA) is an essential step in the data analysis process. It involves analyzing and
visualizing data to understand its main characteristics, uncover patterns, and identify relationships between
variables. Python offers several libraries that are commonly used for EDA, including pandas, NumPy,
Matplotlib, Seaborn, and Plotly.
EDA is crucial because raw data is usually skewed, may have outliers, or too many missing values. A
model built on such data results in sub-optimal performance. In the hurry to get to the machine learning
stage, some data professionals either entirely skip the EDA process or do a very mediocre job. This is a
mistake with many implications, including:
 Generating Inaccurate Models: Models built on unexamined data can be inaccurate and unreliable.
 Using Wrong Data: Without EDA, you might be analyzing or modeling the wrong data, leading to
false conclusions.
 Inefficient Resource Use: Inefficiently using computational and human resources due to lack of
proper data understanding.
 Improper Data Preparation: EDA helps in creating the right types of variables, which is critical
for effective data preparation.
In this article, we’ll be using Pandas, Seaborn, and Matplotlib libraries of Python to demonstrate various
EDA techniques applied to Haberman’s Breast Cancer Survival Dataset. This will provide a practical
understanding of EDA and highlight its importance in the data analysis workflow.
Also Read: Step-by-Step Exploratory Data Analysis (EDA) using Python
Types of EDA Techniques
Before diving into the dataset, let’s first understand the different types of Exploratory Data Analysis (EDA)
techniques. Here are five key types of EDA techniques:
 Univariate Analysis: Univariate analysis examines individual variables to understand their
distributions and summary statistics. This includes calculating measures such as mean, median,
mode, and standard deviation, and visualizing the data using histograms, bar charts, box plots, and
violin plots.
 Bivariate Analysis: Bivariate analysis explores the relationship between two variables. It uncovers
patterns through techniques like scatter plots, pair plots, and heatmaps. This helps to identify
potential associations or dependencies between variables.
 Multivariate Analysis: Multivariate analysis involves examining more than two variables
simultaneously to understand their relationships and combined effects. Techniques such as contour
plots, and principal component analysis (PCA) are commonly used in multivariate EDA.
 Visualization Techniques: EDA relies heavily on visualization methods to depict data distributions,
trends, and associations. Various charts and graphs, such as bar charts, line charts, scatter plots, and
heatmaps, are used to make data easier to understand and interpret.
 Outlier Detection: EDA involves identifying outliers within the data—anomalies that deviate
significantly from the rest of the data. Tools such as box plots, z-score analysis, and scatter plots
help in detecting and analyzing outliers.
 Statistical Tests: EDA often includes performing statistical tests to validate hypotheses or discern
significant differences between groups. Tests such as t-tests, chi-square tests, and ANOVA add
depth to the analysis process by providing a statistical basis for the observed patterns.
By using these EDA techniques, we can gain a comprehensive understanding of the data, identify key
patterns and relationships, and ensure the data’s integrity before proceeding with more complex analyses.
Dataset Description
The dataset used is an open source dataset and comprises cases from the exploratory data analysis
conducted between 1958 and 1970 at the University of Chicago’s Billings Hospital, focusing on the
survival of patients post-surgery for breast cancer. The dataset can be downloaded from here.
[Source: Tjen-Sien Lim (limt@stat.wisc.edu), Date: March 4, 1999]
Importing Libraries and Loading Data
Import all necessary packages:
importnumpyas np
import pandas as pd
importmatplotlib.pyplotasplt
import seaborn assns
importscipy.statsas stats
Load the dataset in pandas dataframe:
df = pd.read_csv('haberman.csv', header = 0)
df.columns = ['patient_age', 'operation_year', 'positive_axillary_nodes', 'survival_status']
Understanding Data
To understand the dataset, let’s just see the first few rows.
print(df.head())
Output:

Shape of the DataFrame


To understand the size of the dataset, we check its shape.
df.shape
Output
(305, 4)
Class Distribution
Next, let’s see how many data points there are for each class label in our dataset. There are 305 rows and 4
columns. But how many data points for each class label are present in our dataset?
df[‘survival_status’].value_counts ()
Output

 The dataset is imbalanced as expected.


 Out of a total of 305 patients, the number of patients who survived over 5 years post-operation is
nearly 3 times the number of patients who died within 5 years.
Checking for Missing Values
Let’s check for any missing values in the dataset.
print("Missing values in each column:\n", df.isnull().sum())
Output
There are no missing values in the dataset.
Data Information
Let’s get a summary of the dataset to understand the data types and further verify the absence of missing
values.
df.info()
Output

 All the columns are of integer type.


 No missing values in the dataset.
By understanding the basic structure, distribution, and completeness of the data, we can proceed with more
detailed exploratory data analysis (EDA) and uncover deeper insights.
Data Preparation
Before proceeding with statistical analysis and visualization, we need to modify the original class labels.
The current labels are 1 (survived 5 years or more) and 2 (died within 5 years), which are not very
descriptive. We’ll map these to more intuitive categorical variables: ‘yes’ for survival and ‘no’ for non-
survival.
# Map survival status values to categorical variables 'yes' and 'no'
df['survival_status'] = df['survival_status'].map({1: 'yes', 2: 'no'})

# Display the updated DataFrame to verify changes


print(df.head())
Output

General Statistical Analysis


We will now perform a general statistical analysis to understand the overall distribution and central
tendencies of the data.
# Display summary statistics of the DataFrame
df .describe ()
Output
 On average, patients got operated at age of 63.
 An average number of positive axillary nodes detected = 4.
 As indicated by the 50th percentile, the median of positive axillary nodes is 1.
 As indicated by the 75th percentile, 75% of the patients have less than 4 nodes detected.
If you see, there is a significant difference between the mean and the median values. This is because there
are some outliers in our data and the mean is influenced by the presence of outliers.
Class-wise Statistical Analysis
To gain deeper insights, we’ll perform a statistical analysis for each class (survived vs. not survived)
separately.
Survived (Yes) Analysis:
survival_yes = df[df['survival_status'] == 'yes']
print(survival_yes.describe())
Output

Not Survived (No) Analysis:


survival_no = df[df['survival_status'] == 'no']
print(survival_no.describe())
Output:

From the above class-wise analysis, it can be observed that —


 The average age at which the patient is operated on is nearly the same in both cases.
 Patients who died within 5 years on average had about 4 to 5 positive axillary nodes more than the
patients who lived over 5 years post-operation.
Note that, all these observations are solely based on the data at hand.
Uni-variate Data Analysis
“A picture is worth ten thousand words”
– Frank R. Bernard
Uni-variate analysis involves studying one variable at a time. This type of analysis helps in understanding
the distribution and characteristics of each variable individually. Below are different ways to perform uni-
variate analysis along with their outputs and interpretations.
Distribution Plots
Distribution plots, also known as probability density function (PDF) plots, show how values in a dataset are
spread out. They help us see the shape of the data distribution and identify patterns.
Patient’s Age
sns.FacetGrid(data, hue="Survival_Status", height=5).map(sns.histplot, "Age", kde=True).add_legend()
plt.title('Distribution of Age')
plt.xlabel('Age')
plt.ylabel('Frequency')
plt.show()
Output

 Among all age groups, patients aged 40-60 years are the highest.
 There is a high overlap between the class labels, implying that survival status post-operation cannot
be discerned from age alone.
Operation Year
sns.FacetGrid(data, hue="Survival_Status", height=5).map(sns.histplot, "Year", kde=True).add_legend()
plt.title('Distribution of Operation Year')
plt.xlabel('Operation Year')
plt.ylabel('Frequency')
plt.show()
Output

 Similar to the age plot, there is a significant overlap between the class labels, suggesting that
operation year alone is not a distinctive factor for survival status.
Number of Positive Axillary Nodes
sns.FacetGrid(data, hue="Survival_Status", height=5).map(sns.histplot, "Nodes", kde=True).add_legend()
plt.title('Distribution of Positive Axillary Nodes')
plt.xlabel('Number of Positive Axillary Nodes')
plt.ylabel('Frequency')
plt.show()
Output
 Patients with 4 or fewer axillary nodes mostly survived 5 years or longer.
 Patients with more than 4 axillary nodes have a lower likelihood of survival compared to those with
4 or fewer nodes.
But we must back our observations with some quantitative measure. That’s where the Cumulative
Distribution function(CDF) plots come into the picture.
Cumulative Distribution Function (CDF)
CDF plots show the probability that a variable will take a value less than or equal to a specific value. They
provide a cumulative measure of the distribution.
counts, bin_edges = np.histogram(data[data['Survival_Status'] == 1]['Nodes'], density=True)
pdf = counts / sum(counts)
cdf = np.cumsum(pdf)
plt.plot(bin_edges[1:], cdf, label='CDF Survival status = Yes')

counts, bin_edges = np.histogram(data[data['Survival_Status'] == 2]['Nodes'], density=True)


pdf = counts / sum(counts)
cdf = np.cumsum(pdf)
plt.plot(bin_edges[1:], cdf, label='CDF Survival status = No')

plt.legend()
plt.xlabel("Positive Axillary Nodes")
plt.ylabel("CDF")
plt.title('Cumulative Distribution Function for Positive Axillary Nodes')
plt.grid()
plt.show()
Output

 Patients with 4 or fewer positive axillary nodes have about an 85% chance of surviving 5 years or
longer post-operation.
 The likelihood decreases for patients with more than 4 axillary nodes.
Box Plots
Box plots, also known as box-and-whisker plots, summarize data using five key metrics: minimum, lower
quartile (25th percentile), median (50th percentile), upper quartile (75th percentile), and maximum. They
also highlight outliers.
plt.figure(figsize=(15, 4))
plt.subplot(1, 3, 1)
sns.boxplot(x='Survival_Status', y='Age', data=data)
plt.title('Box Plot of Age')
plt.subplot(1, 3, 2)
sns.boxplot(x='Survival_Status', y='Year', data=data)
plt.title('Box Plot of Operation Year')
plt.subplot(1, 3, 3)
sns.boxplot(x='Survival_Status', y='Nodes', data=data)
plt.title('Box Plot of Positive Axillary Nodes')
plt.show()
Output:

 The patient age and operation year plots show similar statistics.
 The isolated points in the positive axillary nodes box plot are outliers, which is expected in medical
datasets.
Violin Plots
Violin plots combine the features of box plots and density plots. They provide a visual summary of the data
and show the distribution’s shape, density, and variability.
plt.figure(figsize=(15, 4))
plt.subplot(1, 3, 1)
sns.violinplot(x='Survival_Status', y='Age', data=data)
plt.title('Violin Plot of Age')
plt.subplot(1, 3, 2)
sns.violinplot(x='Survival_Status', y='Year', data=data)
plt.title('Violin Plot of Operation Year')
plt.subplot(1, 3, 3)
sns.violinplot(x='Survival_Status', y='Nodes', data=data)
plt.title('Violin Plot of Positive Axillary Nodes')
plt.show()
Output:
 The distribution of positive axillary nodes is highly skewed for the ‘yes’ class label and moderately
skewed for the ‘no’ label.
 The majority of patients, regardless of survival status, have a lower number of positive axillary
nodes, with those having 4 or fewer nodes more likely to survive 5 years post-operation.
These observations align with our previous analyses and provide a deeper understanding of the data.
Bar Charts
Bar charts display the frequency or count of categories within a single variable, making them useful for
comparing different groups.
Survival Status Count
sns.countplot(x='Survival_Status', data=df)
plt.title('Count of Survival Status')
plt.xlabel('Survival Status')
plt.ylabel('Count')
plt.show()
Output

 This bar chart shows the number of patients who survived 5 years or longer versus those who did
not. It helps visualize the class imbalance in the dataset.
Histograms
Histograms show the distribution of numerical data by grouping data points into bins. They help understand
the frequency distribution of a variable.
Age Distribution
df['Age'].plot(kind='hist', bins=20, edgecolor='black')
plt.title('Histogram of Age')
plt.xlabel('Age')
plt.ylabel('Frequency')
plt.show()
Output
 The histogram displays how the ages of patients are distributed. Most patients are between 40 and 60
years old.
These observations align with our previous analyses and provide a deeper understanding of the data.
Bi-variate Data Analysis
Bi-variate data analysis involves studying the relationship between two variables at a time. This helps in
understanding how one variable affects another and can reveal underlying patterns or correlations. Here are
some common methods for bi-variate analysis.
Pair Plot
A pair plot visualizes the pairwise relationships between variables in a dataset. It displays both the
distributions of individual variables and their relationships.
sns.set_style('whitegrid')
sns.pairplot(data, hue='Survival_Status')
plt.show()
Output

 The pair plot shows scatter plots of each pair of variables and histograms of each variable along the
diagonal.
 The scatter plots on the upper and lower halves of the matrix are mirror images, so analyzing one
half is sufficient.
 The histograms on the diagonal show the univariate distribution of each feature.
 There is a high overlap between any two features, indicating no clear distinction between the
survival status class labels based on feature pairs.
While the pair plot provides an overview of the relationships between all pairs of variables, sometimes it is
useful to focus on the relationship between just two specific variables in more detail. This is where the joint
plot comes in.
Joint Plot
A joint plot provides a detailed view of the relationship between two variables along with their individual
distributions.
sns.jointplot(x='Age', y='Nodes', data=data, kind='scatter')
plt.show()
Output
 The scatter plot in the center shows no correlation between the patient’s age and the number of
positive axillary nodes detected.
 The histogram on the top edge shows that patients are more likely to get operated on between the
ages of 40 and 60 years.
 The histogram on the right edge indicates that the majority of patients had fewer than 4 positive
axillary nodes.
While joint plots and pair plots help visualize the relationships between pairs of variables, a heatmap can
provide a broader view of the correlations among all the variables in the dataset simultaneously.
Heatmap
A heatmap visualizes the correlation between different variables. It uses color coding to represent the
strength of the correlations, which can help identify relationships between variables.
sns.heatmap(data.corr(), cmap='YlGnBu', annot=True)
plt.show()
Output:

 The heatmap displays Pearson’s R values, indicating the correlation between pairs of variables.
 Correlation values close to 0 suggest no linear relationship between the variables.
 In this dataset, there are no strong correlations between any pairs of variables, as most values are
near 0.
These bi-variate analysis techniques provide valuable insights into the relationships between different
features in the dataset, helping to understand how they interact and influence each other. Understanding
these relationships is crucial for building more accurate models and making informed decisions in data
analysis and machine learning tasks.
Multivariate Analysis
Multivariate analysis involves examining more than two variables simultaneously to understand their
relationships and combined effects. This type of analysis is essential for uncovering complex interactions in
data. Let’s explore several multivariate analysis techniques.
Contour Plot
A contour plot is a graphical technique that represents a 3-dimensional surface by plotting constant z slices,
called contours, in a 2-dimensional format. This allows us to visualize complex relationships between three
variables in an easily interpretable 2-D chart.

For example, let’s examine the relationship between patient’s age and operation year, and how these relate
to the number of patients.
sns.jointplot(x='Age', y='Year', data=data, kind='kde', fill=True)
plt.show()
Output

 From the above contour plot, it can be observed that the years 1959–1964 witnessed more patients in
the age group of 45–55 years.
 The contour lines represent the density of data points. Closer contour lines indicate a higher density
of data points.
 The areas with the darkest shading represent the highest density of patients, showing the most
common combinations of age and operation year.
By utilizing contour plots, we can effectively consolidate information from three dimensions into a two-
dimensional format, making it easier to identify patterns and relationships in the data. This approach
enhances our ability to perform comprehensive multivariate analysis and extract valuable insights from
complex datasets.
3D Scatter Plot
A 3D scatter plot is an extension of the traditional scatter plot into three dimensions, which allows us to
visualize the relationship among three variables.
from mpl_toolkits.mplot3d import Axes3D

fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')

ax.scatter(df['Age'], df['Year'], df['Nodes'])


ax.set_xlabel('Age')
ax.set_ylabel('Year')
ax.set_zlabel('Nodes')
plt.show()
Output

 Most patients are aged between 40 to 70 years, with their surgeries predominantly occurring between
the years 1958 to 1966.
 The majority of patients have fewer than 10 positive axillary lymph nodes, indicating that low node
counts are common in this dataset.
 A few patients have a significantly higher number of positive nodes (up to around 50), suggesting
cases of more advanced cancer.
 There is no strong correlation between the patient’s age or the year of surgery and the number of
positive nodes detected. Positive nodes are spread across various ages and years without a clear
trend.
Conclusion
In this article, we learned some common steps involved in exploratory data analysis. We also saw several
types of charts & plots and what information is conveyed by each of these. This is just not it, I encourage
you to play with the data and come up with different kinds of visualizations and observe what insights you
can extract from it.

COLLECTION OF DATA
Data Collection is the process of collecting information from relevant sources in order to find a solution to the
given statistical enquiry. Collection of Data is the first and foremost step in a statistical investigation.

Methods of Collecting Data


There are two different methods of collecting data: Primary Data Collection and Secondary Data Collection.
There are a number of methods of collecting primary data, Some of the common methods are as follows :
1. Direct Personal Investigation : As the name suggests, the method of direct personal investigation involves
collecting data personally from the source of origin. In simple words, the investigator makes direct contact
with the person from whom he/she wants to obtain information. This method can attain success only when the
investigator collecting data is efficient, diligent, tolerant and impartial. For example, direct contact with the
household women to obtain information about their daily routine and schedule.
2. Indirect Oral Investigation : In this method of collecting primary data, the investigator does not make
direct contact with the person from whom he/she needs information, instead they collect the data orally from
some other person who has the necessary required information. For example, collecting data of employees
from their superiors or managers.
3. Information from Local Sources or Correspondents : In this method, for the collection of data, the
investigator appoints correspondents or local persons at various places, which are then furnished by them to the
investigator. With the help of correspondents and local persons, the investigators can cover a wide area.
4. Information through Questionnaires and Schedules : In this method of collecting primary data, the
investigator, while keeping in mind the motive of the study, prepares a questionnaire. The investigator can
collect data through the questionnaire in two ways:
 Mailing Method: This method involves mailing the questionnaires to the informants for the
collection of data. The investigator attaches a letter with the questionnaire in the mail to define the
purpose of the study or research. The investigator also assures the informants that their information
would be kept secret, and then the informants note the answers to the questionnaire and return the
completed file.
 Enumerator’s Method: This method involves the preparation of a questionnaire according to the
purpose of the study or research. However, in this case, the enumerator reaches out to the
informants himself with the prepared questionnaire. Enumerators are not the investigators
themselves; they are the people who help the investigator in the collection of data.
. Methods of Collecting Secondary Data
Secondary data can be collected through different published and unpublished sources. Some of them are as
follows:
1. Published Sources
 Government Publications: Government publishes different documents which consists of different
varieties of information or data published by the Ministries, Central and State Governments in India
as their routine activity. As the government publishes these Statistics, they are fairly reliable to the
investigator. Examples of Government publications on Statistics are the Annual Survey of
Industries, Statistical Abstract of India, etc.
 Semi-Government Publications: Different Semi-Government bodies also publish data related to
health, education, deaths and births. These kinds of data are also reliable and used by different
informants. Some examples of semi-government bodies are Metropolitan Councils, Municipalities,
etc.
 Publications of Trade Associations: Various big trade associations collect and publish data from
their research and statistical divisions of different trading activities and their aspects. For example,
data published by Sugar Mills Association regarding different sugar mills in India.
 Journals and Papers: Different newspapers and magazines provide a variety of statistical data in
their writings, which are used by different investigators for their studies.
 International Publications: Different international organizations like IMF, UNO, ILO, World
Bank, etc., publish a variety of statistical information which are used as secondary data.
 Publications of Research Institutions: Research institutions and universities also publish their
research activities and their findings, which are used by different investigators as secondary data.
For example National Council of Applied Economics, the Indian Statistical Institute, etc.
2. Unpublished Sources
Another source of collecting secondary data is unpublished sources. The data in unpublished sources is
collected by different government organizations and other organizations. These organizations usually collect
data for their self-use and are not published anywhere. For example, research work done by professors,
professionals, teachers and records maintained by business and private enterprises.
Example:
The table below shows the production of rice in India.

he above table contains the production of rice in India in different years. It can be seen that these values vary
from one year to another. Therefore, they are known as variable. A variable is a quantity or attribute, the value
of which varies from one investigation to another. In general, the variables are represented by letters such as X,
Y, or Z. In the above example, years are represented by variable X, and the production of rice is represented by
variable Y. The values of variable X and variable Y are data from which an investigator and enumerator collect
information regarding the trends of rice production in India.

GRAPHICAL PRESENTATION OF DATA:


Graphical Representation of Data: Graphical Representation of Data,” where numbers and facts become lively
pictures and colorful diagrams. Instead of staring at boring lists of numbers, we use fun charts, cool graphs,
and interesting visuals to understand information better. In this exciting concept of data visualization, we’ll
learn about different kinds of graphs, charts, and pictures that help us see patterns and stories hidden in data.
Types of Graphical Representations
Line Graphs
A line graph is used to show how the value of a particular variable changes with time. We plot this graph by
connecting the points at different values of the variable. It can be useful for analyzing the trends in the data and
predicting further trends.
Bar Graphs
A bar graph is a type of graphical representation of the data in which bars of uniform width are drawn with
equal spacing between them on one axis (x-axis usually), depicting the variable. The values of the variables are
represented by the height of the bars.

Histograms
This is similar to bar graphs, but it is based frequency of numerical values rather than their actual values. The
data is organized into intervals and the bars represent the frequency of the values in that range. That is, it
counts how many values of the data lie in a particular range.

Line Plot
It is a plot that displays data as points and checkmarks above a number line, showing the frequency of the
point.

Box and Whisker Plot


These plots divide the data into four parts to show their summary. They are more concerned about the
spread, average, and median of the data.
Pie Chart
It is a type of graph which represents the data in form of a circular graph. The circle is divided such that each
portion represents a proportion of the whole.

Advantages
 It gives us a summary of the data which is easier to look at and analyze.
 It saves time.
 We can compare and study more than one variable at a time.
Disadvantages
 It usually takes only one aspect of the data and ignores the other. For example, A bar graph does not
represent the mean, median, and other statistics of the data.
 Interpretation of graphs can vary based on individual perspectives, leading to subjective
conclusions.
 Poorly constructed or misleading visuals can distort data interpretation and lead to incorrect
conclusions.
CLASSIFICATION OF DATA :
Process of classifying data in relevant categories so that it can be used or applied more efficiently. The
classification of data makes it easy for the user to retrieve it. Data classification holds its importance when
comes to data security and compliance and also to meet different types of business or personal objective. It is
also of major requirement, as data must be easily retrievable within a specific period of time.
Types of Data Classification :
Data can be broadly classified into 3 types.
1. Structured Data :
Structured data is created using a fixed schema and is maintained in tabular format. The elements in structured
data are addressable for effective analysis. It contains all the data which can be stored in the SQL database in a
tabular format. Today, most of the data is developed and processed in the simplest way to manage
information.
Examples –
Relational data, Geo-location, credit card numbers, addresses, etc.
Consider an example for Relational Data like you have to maintain a record of students for a university like the
name of the student, ID of a student, address, and Email of the student. To store the record of students used the
following relational schema and table for the same.

S_I S_Addres
S_Name S_Email
D s

1001 A Delhi A@gmail.com

1002 B Mumbai B@gmail.com

2. Unstructured Data :
It is defined as the data in which is not follow a pre-defined standard or you can say that any does not follow
any organized format. This kind of data is also not fit for the relational database because in the relational
database you will see a pre-defined manner or you can say organized way of data. Unstructured data is also
very important for the big data domain and To manage and store Unstructured data there are many platforms to
handle it like No-SQL Database.
Examples –
Word, PDF, text, media logs, etc.
3. Semi-Structured Data :
Semi-structured data is information that does not reside in a relational database but that have some
organizational properties that make it easier to analyze. With some process, you can store them in a relational
database but is very hard for some kind of semi-structured data, but semi-structured exist to ease space.
Example –
XML data.
Features of Data Classification :
The main goal of the organization of data is to arrange the data in such a form that it becomes fairly available
to the users. So it’s basic features as following.
 Homogeneity – The data items in a particular group should be similar to each other.
 Clarity – There must be no confusion in the positioning of any data item in a particular group.
 Stability – The data item set must be stable i.e. any investigation should not affect the same set of
classification.
 Elastic – One should be able to change the basis of classification as the purpose of classification
changes.
BIG DATA:
Data science is the study of data analysis by advanced technology (Machine Learning, Artificial Intelligence,
Big data). It processes a huge amount of structured, semi-structured, and unstructured data to extract insight
meaning, from which one pattern can be designed that will be useful to take a decision for grabbing the new
business opportunity, the betterment of product/service, and ultimately business growth. Data science process
to make sense of Big data/huge amount of data that is used in business.

CHALLENGES OF CONVENTIONAL SYSTEMS:

 It cannot work on unstructured data efficiently


 It is built on top of the relational data model
 It is batch oriented and we need to wait for nightly ETL(extract,transform and load)and transformation
jobs to complete before the required insight is obtained
 Inadequate support of aggregated summaries of data

WEB DATA:
Data that is sourced and structured from websites is referred to as "web data". WDI is an extension and
specialization of data integration that views the web as a collection of heterogeneous databases.
Examples of web data include online product reviews, social media posts, website traffic statistics, and search
engine results
Web content consist of several types of data – text, image, audio, video etc. Content data is the group of facts that
a web page is designed. It can provide effective and interesting patterns about user needs. Text documents are
related to text mining, machine learning and natural language processing
EVOLUTION OF ANALYTIC SCALABILITY

raditional Analytic Architecture:


Traditional analytics collects data from
heterogeneous data sources and we had
to pull all data
together into a separate analytics
environment to do analysis which can be
an analytical server or a
personal computer with more computing
capability.
The heavy processing occurs in the
analytic environment as shown in figure.
In such environments, shipping of data
becomes a must, which might result in
issues related with
security of data and its confidentiality
raditional Analytic Architecture:
Traditional analytics collects data from
heterogeneous data sources and we had
to pull all data
together into a separate analytics
environment to do analysis which can be
an analytical server or a
personal computer with more computing
capability.
The heavy processing occurs in the
analytic environment as shown in figure.
In such environments, shipping of data
becomes a must, which might result in
issues related with
security of data and its confidentiality
raditional Analytic Architecture:
Traditional analytics collects data from
heterogeneous data sources and we had
to pull all data
together into a separate analytics
environment to do analysis which can be
an analytical server or a
personal computer with more computing
capability.
The heavy processing occurs in the
analytic environment as shown in figure.
In such environments, shipping of data
becomes a must, which might result in
issues related with
security of data and its confidentiality
raditional Analytic Architecture:
Traditional analytics collects data from
heterogeneous data sources and we had
to pull all data
together into a separate analytics
environment to do analysis which can be
an analytical server or a
personal computer with more computing
capability.
The heavy processing occurs in the
analytic environment as shown in figure.
In such environments, shipping of data
becomes a must, which might result in
issues related with
security of data and its confidentiality
raditional Analytic Architecture:
Traditional analytics collects data from
heterogeneous data sources and we had
to pull all data
together into a separate analytics
environment to do analysis which can be
an analytical server or a
personal computer with more computing
capability.
The heavy processing occurs in the
analytic environment as shown in figure.
In such environments, shipping of data
becomes a must, which might result in
issues related with
security of data and its confidentiality
raditional Analytic Architecture:
Traditional analytics collects data from
heterogeneous data sources and we had
to pull all data
together into a separate analytics
environment to do analysis which can be
an analytical server or a
personal computer with more computing
capability.
The heavy processing occurs in the
analytic environment as shown in figure.
In such environments, shipping of data
becomes a must, which might result in
issues related with
security of data and its confidentiality
 Traditional Analytic Architecture: Traditional analytics collects data from heterogeneous data sources and
we had to pull all data together into a separate analytics environment to do analysis which can be an
analytical server or a personal computer with more computing capability.
 The heavy processing occurs in the analytic environment as shown in figure.
 In such environments, shipping of data becomes a must, which might result in issues related with security
of data and its confidentiality

ModernIn-DatabaseArchitecture:

Data from heterogeneous sources are collected, transformed and loaded into data
warehouse forfinal analysis by decisionmakers.
The processing stays in the database where the data has been
consolidated.The dataispresentedinaggregatedformforquerying.
Queries from users are submitted to OLAP (online analytical processing) engines for
execution.Such in-database architectures are tested for their query throughput rather
than transactionthroughputas intraditionaldatabaseenvironments.
More of metadata is required for directing the queries which helps in reducing the time
taken foransweringqueriesandhenceincreasethe querythroughput.
Moreover, the data in consolidated form are free from anomalies, since they are pre-
processedbeforeloadingintowarehouseswhichmay beuseddirectlyforanalysis.

MassivelyParallelProcessing(MPP)
It is a type of computing wherein the process is being done by many CPUs working in parallel to
execute a singleprogram. One of the most significant differences between a Symmetric Multi-
Processing or SMP and MassiveParallel Processing is that with MPP, each of the many CPUs has its
own memory to assist it in preventing apossible hold up that the user may experience with using SMP
when all of the CPUs attempt to access thememorysimultaneously.
The salientfeatureofMPPsystemsis:
o Looselycouplednodes
o Nodeslinkedtogetherbyahigh-speedconnection
o Eachnodehas itsownmemory
o Disksarenotshared;eachbeing attachedtoonlyonenode sharednothingarchitectures
TheCloudComputing:
CloudcomputingisthedeliveryofcomputingservicesovertheInternet.
Examples of cloud services include online file storage, social networking sites, webmail, and online
businessapplications.
The cloud computing model allows access to information and computer resources from
anywhere that anetworkconnectionis available.
Cloud computing provides a shared pool of resources, including data storage space, networks,
computerprocessingpower,andspecializedcorporateanduserapplications.
McKinseyandCompanyhasindicatedthefollowingascharacteristicfeaturesofcloud:
1. Masktheunderlyinginfrastructurefromtheuser
2. Beelastictoscaleondemand
3. Ona pay-per-usebasis
4. NationalInstituteofStandardsandTechnology(NIST)
5. On-demandself-service
6. Broadnetworkaccess
7. Resourcepooling
8. Rapidelasticity
9. Measuredservice
Therearetwotypesofcloudenvironment:
1. PublicCloud:
o Theservicesandinfrastructureareprovidedoff-siteoverthe internet
o Lesssecuredandmorevulnerable thanprivateclouds
2. PrivateCloud:
o Infrastructureoperatedsolely forasingleorganization
o Offerthegreatestlevelofsecurityandcontrol

GridComputing:

Grid computing is a form of distributed computing whereby a "super and virtual computer" is
composed of aclusterofnetworked, looselycoupledcomputers,
actinginconcerttoperformverylargetasks.
Grid computing (Foster and Kesselman, 1999) is a growing technology that facilitates the executions
of large-scale resource intensiveapplicationsongeographicallydistributedcomputing resources.
Facilitates flexible, secure, coordinated large scale resource sharing among dynamic collections of
individuals,institutions,andresource
Distributed or Grid computing in general is a special type of parallel computing that relies on
completecomputers connected to a network by a conventional network interface producing
commodity hardware,comparedtothelowerefficiencyofdesigningandconstructingasmall
numberofcustomsupercomputers.

DisadvantageofGridComputing:
Thevariousprocessorsandlocalstorageareasdonothavehigh-speedconnections.

Hadoop:

Apache Hadoop is an open-source software framework for storage and large-scale processing of
data-sets onclustersofcommodity hardware.
Two main building blocks inside this runtime environment are MapReduce and Hadoop Distributed
File System(HDFS).
MapReduce:
Hadoop MapReduce is a software framework for easily writing applications which process vast
amountsof data (multi-terabyte data-sets) in-parallel on large clusters (thousands of nodes) of
commodityhardware inareliable,fault-tolerantmanner.

A MapReduce job usually splits the input data-set into independent chunks which are processed
by themaptasks inacompletelyparallelmanner.
The framework sorts the outputs of the maps, which are then input to the reduce
tasks.Typically, boththeinputandtheoutputofthejobarestoredinafile-system.
Theframeworktakescareofschedulingtasks,monitoringthemandre-executesthe failedtasks.
HDFS:

HDFSstandsforHadoopDistributedFile System.
HDFS is one of the core components of the Hadoop framework and is responsible
for thestorageaspect.
Unlike the usual storage available on our computers, HDFS is a Distributed File
System andpartsofasinglelargefilecanbestoredondifferentnodes acrossthecluster.
HDFSisadistributed,reliable, andscalablefilesystem.

ANALYTIC PROCESSES AND TOOLS

Big data is the storage and analysis of large data sets. These are complex data sets which can be both structured
or unstructured. They are so large that it is not possible to work on them with traditional analytical tools. These
days, organizations are realising the value they get out of big data analytics and hence they are deploying big
data tools and processes to bring more efficiency in their work environment
Big Data Tools

R-Programming
R is a free open source software programming language and a software environment for statistical computing
and graphics. It is used by data miners for developing statistical software and data analysis. It has become a
highly popular tool for big data in recent years.

Tableau Public
Tableau is another popular big data tool. It is simple and very intuitive to use. It communicates the insights of
the data through data visualisation. Through Tableau, an analyst can check a hypothesis and explore the data
before starting to work on it extensively.

Datawrapper
It is an online data visualization tool for making interactive charts. You need to paste your data file in a csv, pdf
or excel format or paste it directly in the field. Datawrapper then generates any visualization in the form of bar,
line, map etc. It can be embedded into any other website as well. It is easy to use and produces visually
effective charts.

APACHE Hadoop
It’s a Java-based open-source platform that is being used to store and process big data. It is built on a cluster
system that allows the system to process data efficiently and let the data run parallel. It can process both
structured and unstructured data from one server to multiple computers. Hadoop also offers cross-
platform support for its users. Today, it is the best big data analytic tool and is popularly used by many
tech giants such as Amazon, Microsoft, IBM, etc.

Cassandra

APACHE Cassandra is an open-source NoSQL distributed database that is used to fetch large amounts of
data. It’s one of the most popular tools for data analytics and has been praised by many tech companies
due to its high scalability and availability without compromising speed and performance. It is capable of
delivering thousands of operations every second and can handle petabytes of resources with almost zero
downtime. It was created by Facebook back in 2008 and was published publicly.
Qubole

It’s an open-source big data tool that helps in fetching data in a value of chain using ad-hoc analysis in
machine learning. Qubole is a data lake platform that offers end-to-end service with reduced time and effort
which are required in moving data pipelines. It is capable of configuring multi-cloud services such as AWS,
Azure, and Google Cloud. Besides, it also helps in lowering the cost of cloud computing by 50%.

Spark

APACHE Spark is another framework that is used to process data and perform numerous tasks on a large
scale. It is also used to process data via multiple computers with the help of distributing tools. It is widely
used among data analysts as it offers easy-to-use APIs that provide easy data pulling methods and it
is capable of handling multi-petabytes of data as well. Recently, Spark made a record of processing 100
terabytes of data in just 23 minutes which broke the previous world record of Hadoop (71 minutes). This
is the reason why big tech giants are moving towards spark now and is highly suitable for ML and AI today.

Mongo DB

Came in limelight in 2010, is a free, open-source platform and a document-oriented (NoSQL)


database that is used to store a high volume of data. It uses collections and documents for storage and its
document consists of key-value pairs which are considered a basic unit of Mongo DB. It is so popular
among developers due to its availability for multi-programming languages such as Python, Jscript, and
Ruby.

SAS

Today it is one of the best tools for creating statistical modeling used by data analysts. By using SAS, a data
scientist can mine, manage, extract or update data in different variants from different sources. Statistical
Analytical System or SAS allows a user to access the data in any format (SAS tables or Excel worksheets).
Besides that it also offers a cloud platform for business analytics called SAS Viya and also to get a strong
grip on AI & ML, they have introduced new tools and products.

Data Pine

Datapine is an analytical used for BI and was founded back in 2012 (Berlin, Germany). In a short period of
time, it has gained much popularity in a number of countries and it’s mainly used for data extraction (for
small-medium companies fetching data for close monitoring). With the help of its enhanced UI design,
anyone can visit and check the data as per their requirement and offer in 4 different price brackets, starting
from $249 per month. They do offer dashboards by functions, industry, and platform.

Rapid Miner

It’s a fully automated visual workflow design tool used for data analytics. It’s a no-code platform and users
aren’t required to code for segregating data. Today, it is being heavily used in many industries such as ed-
tech, training, research, etc. Though it’s an open-source platform but has a limitation of adding 10000 data
rows and a single logical processor. With the help of Rapid Miner, one can easily deploy their ML models
to the web or mobile (only when the user interface is ready to collect real-time figures).
ANALYTICS VS REPORTING

ANALYTICS REPORTING

Analytics is the method of


Reporting is an action that includes all the needed
examining and analyzing
information and data and is put together in an
summarized data to make
organized way.
business decisions.

Questioning the data,


Identifying business events, gathering the required
understanding it, investigating
information, organizing, summarizing, and
it, and presenting it to the end
presenting existing data are all part of reporting.
users are all part of analytics.

The purpose of analytics is to


The purpose of reporting is to organize the data into
draw conclusions based on
meaningful information.
data.

Analytics is used by data


Reporting is provided to the appropriate business
analysts, scientists, and
leaders to perform effectively and efficiently within
business people to make
a firm.
effective decisions.

You might also like