Unit 1 Data Science Notes
Unit 1 Data Science Notes
Introduction to Data Science – Applications - Data Science Process – Exploratory Data analysis –
Collection of data – Graphical presentation of data – Classification of data – Storage and retrieval of
data – Big data – Challenges of Conventional Systems - Web Data – Evolution Of Analytic
Scalability - Analytic Processes and Tools - Analysis vs Reporting .
Data Science is used in many industries in the world today, e.g. banking, consultancy, healthcare, and
manufacturing.
Data Science can be applied in nearly every part of a business where data is available. Examples are:
Consumer goods
Stock markets
Industry
Politics
Logistic companies
E-commerce
Data: It refers to the raw information that is collected, stored, and processed. In today’s digital age,
enormous amounts of data are generated from various sources such as sensors, social media,
transactions, and more. This data can come in structured formats (e.g., databases) or unstructured
formats (e.g., text, images, videos).
Structured Data
Structured data is organized and easier to work with.
Unstructured Data
Unstructured data is not organized. We must organize the data for analysis purposes.
Science: It refers to the systematic study and investigation of phenomena using scientific methods and
principles. Science involves forming hypotheses, conducting experiments, analyzing data, and drawing
conclusions based on evidence.
Applications
1. In Search EnginesThe most useful application of Data Science is Search Engines. As we know when we
want to search for something on the internet, we mostly use Search engineslike Google, Yahoo, DuckDuckGo
and Bing, etc. So Data Science is used to get Searches faster.
For Example, When we search for something suppose “Data Structure and algorithm courses ” then at that
time on Internet Explorer we get the first link of GeeksforGeeks Courses. This happens because the
GeeksforGeeks website is visited most in order to get information regarding Data Structure courses and
Computer related subjects. So this analysis is done using Data Science, and we get the Topmost visited Web
Links.
2. In Transport
Data Science is also entered in real-time such as the Transport field like Driverless Cars. With the help of
Driverless Cars, it is easy to reduce the number of Accidents.
For Example, In Driverless Cars the training data is fed into the algorithm and with the help of Data Science
techniques, the Data is analyzed like what as the speed limit in highways, Busy Streets, Narrow Roads, etc.
And how to handle different situations while driving etc.
3. In Finance
Data Science plays a key role in Financial Industries. Financial Industries always have an issue of fraud and
risk of losses. Thus, Financial Industries needs to automate risk of loss analysis in order to carry out strategic
decisions for the company. Also, Financial Industries uses Data Science Analytics tools in order to predict the
future. It allows the companies to predict customer lifetime value and their stock market moves.
For Example, In Stock Market, Data Science is the main part. In the Stock Market, Data Science is used to
examine past behavior with past data and their goal is to examine the future outcome. Data is analyzed in such
a way that it makes it possible to predict future stock prices over a set timetable.
4. In E-Commerce
E-Commerce Websites like Amazon, Flipkart, etc. uses data Science to make a better user experience with
personalized recommendations.
For Example, When we search for something on the E-commerce websites we get suggestions similar to
choices according to our past data and also we get recommendations according to most buy the product, most
rated, most searched, etc. This is all done with the help of Data Science.
5. In Health Care
In the Healthcare Industry data science act as a boon. Data Science is used for:
Detecting Tumor.
Drug discoveries.
Medical Image Analysis.
Virtual Medical Bots.
Genetics and Genomics.
Predictive Modeling for Diagnosis etc.
6. Image Recognition
Currently, Data Science is also used in Image Recognition. For Example, When we upload our image with our
friend on Facebook, Facebook gives suggestions Tagging who is in the picture. This is done with the help of
machine learning and Data Science. When an Image is Recognized, the data analysis is done on one’s
Facebook friends and after analysis, if the faces which are present in the picture matched with someone else
profile then Facebook suggests us auto-tagging.
7. Targeting Recommendation
Targeting Recommendation is the most important application of Data Science. Whatever the user searches on
the Internet, he/she will see numerous posts everywhere. This can be explained properly with an example:
Suppose I want a mobile phone, so I just Google search it and after that, I changed my mind to buy offline. In
Real -World Data Science helps those companies who are paying for Advertisements for their mobile. So
everywhere on the internet in the social media, in the websites, in the apps everywhere I will see the
recommendation of that mobile phone which I searched for. So this will force me to buy online.
8. Airline Routing Planning
With the help of Data Science, Airline Sector is also growing like with the help of it, it becomes easy to predict
flight delays. It also helps to decide whether to directly land into the destination or take a halt in between like a
flight can have a direct route from Delhi to the U.S.A or it can halt in between after that reach at the
destination.
9. Data Science in Gaming
In most of the games where a user will play with an opponent i.e. a Computer Opponent, data science concepts
are used with machine learning where with the help of past data the Computer will improve its performance.
There are many games like Chess, EA Sports, etc. will use Data Science concepts.
10. Medicine and Drug Development
The process of creating medicine is very difficult and time-consuming and has to be done with full disciplined
because it is a matter of Someone’s life. Without Data Science, it takes lots of time, resources, and finance or
developing new Medicine or drug but with the help of Data Science, it becomes easy because the prediction of
success rate can be easily determined based on biological data or factors. The algorithms based on data science
will forecast how this will react to the human body without lab experiments.
3. Now that you have the raw data, it’s time to prepare it. This includes transforming the data
fromarawformintodatathat’sdirectlyusableinyourmodels.Toachieve this,you’lldetectand correct different kinds
of errors in the data, combine data from different data sources, and transform it. If you have successfully
completed this step, you can progress to data visualization and modeling.
5. Finally, we get to the sexiest part: model building (often referred to as “data modeling”
throughoutthisbook).Itisnowthatyouattempttogaintheinsightsormakethepredictions stated in your project
charter. Now is the time to bring out the heavy guns, but remember research has taught us that often (but not
always) a combination of simple models tends to outperform one complicated model. If you’ve done this
phase right, you’re almost done.
6. The last step of the data science model is presenting your results and automating the analysis, if needed.
One goal of a project is to change a process and/or make better decisions.
Youmaystillneedtoconvincethebusinessthatyourfindingswillindeedchangethebusiness
processasexpected.Thisiswhereyoucanshineinyourinfluencerrole.Theimportanceofthis
stepismoreapparentinprojectsonastrategicandtacticallevel.Certainprojectsrequireyouto perform the business
process over and over again, so automating the project will save time.
Exploratory Data Analysis (EDA)
Exploratory Data Analysis (EDA) is a process of describing the data by means of statistical and
visualization techniques in order to bring important aspects of that data into focus for further analysis. This
involves inspecting the dataset from many angles, describing & summarizing it without making any
assumptions about its contents.
EDA is a significant step to take before diving into statistical modeling or machine learning, to ensure the
data is really what it is claimed to be and that there are no obvious errors. It should be part of data science
projects in every organization.
Learning Objectives
Learn what Exploratory Data Analysis (EDA) is and why it’s important in data analytics.
Understand how to look at and clean data, including dealing with single variables.
Summarize data using simple statistics and visual tools like bar plots to find patterns.
Ask and answer questions about the data to uncover deeper insights.
Use Python libraries like pandas, NumPy, Matplotlib, and Seaborn to explore and visualize data.
What is Exploratory Data Analysis?
Exploratory Data Analysis (EDA) is like exploring a new place. You look around, observe things, and try to
understand what’s going on. Similarly, in EDA data science, you look at a dataset, check out the different
parts, and try to figure out what’s happening in the data. It involves using statistics and visual tools to
understand and summarize data, helping data scientists and data analysts inspect the dataset from various
angles without making assumptions about its contents.
Here’s a Typical Process
Look at the Data: Gather information about the data, such as the number of rows and columns, and
the type of information each column contains. This includes understanding single variables and their
distributions.
Clean the Data: Fix issues like missing or incorrect values. Preprocessing is essential to ensure the
data is ready for analysis and predictive modeling.
Make Summaries: Summarize the data to get a general idea of its contents, such as average values,
common values, or value distributions. Calculating quantiles and checking for skewness can provide
insights into the data’s distribution.
Visualize the Data: Use interactive charts and graphs to spot trends, patterns, or anomalies. Bar
plots, scatter plots, and other visualizations help in understanding relationships between variables.
Python libraries like pandas, NumPy, Matplotlib, Seaborn, and Plotly are commonly used for this
purpose.
Ask Questions: Formulate questions based on your observations, such as why certain data points
differ or if there are relationships between different parts of the data.
Find Answers: Dig deeper into the data to answer these questions, which may involve further
analysis or creating models, including regression or linear regression models.
For example, in Python, you can perform EDA techniques by importing necessary libraries, loading your
dataset, and using functions to display basic information, summary statistics, check for missing values, and
visualize distributions and relationships between variables. Here’s a basic example:
import pandas as pd
import seaborn assns
importmatplotlib.pyplotasplt
# Visualize distributions
sns.histplot(data['column_name'])
plt.show()
Why is Exploratory Data Analysis Important?
Exploratory Data Analysis (EDA) is an essential step in the data analysis process. It involves analyzing and
visualizing data to understand its main characteristics, uncover patterns, and identify relationships between
variables. Python offers several libraries that are commonly used for EDA, including pandas, NumPy,
Matplotlib, Seaborn, and Plotly.
EDA is crucial because raw data is usually skewed, may have outliers, or too many missing values. A
model built on such data results in sub-optimal performance. In the hurry to get to the machine learning
stage, some data professionals either entirely skip the EDA process or do a very mediocre job. This is a
mistake with many implications, including:
Generating Inaccurate Models: Models built on unexamined data can be inaccurate and unreliable.
Using Wrong Data: Without EDA, you might be analyzing or modeling the wrong data, leading to
false conclusions.
Inefficient Resource Use: Inefficiently using computational and human resources due to lack of
proper data understanding.
Improper Data Preparation: EDA helps in creating the right types of variables, which is critical
for effective data preparation.
In this article, we’ll be using Pandas, Seaborn, and Matplotlib libraries of Python to demonstrate various
EDA techniques applied to Haberman’s Breast Cancer Survival Dataset. This will provide a practical
understanding of EDA and highlight its importance in the data analysis workflow.
Also Read: Step-by-Step Exploratory Data Analysis (EDA) using Python
Types of EDA Techniques
Before diving into the dataset, let’s first understand the different types of Exploratory Data Analysis (EDA)
techniques. Here are five key types of EDA techniques:
Univariate Analysis: Univariate analysis examines individual variables to understand their
distributions and summary statistics. This includes calculating measures such as mean, median,
mode, and standard deviation, and visualizing the data using histograms, bar charts, box plots, and
violin plots.
Bivariate Analysis: Bivariate analysis explores the relationship between two variables. It uncovers
patterns through techniques like scatter plots, pair plots, and heatmaps. This helps to identify
potential associations or dependencies between variables.
Multivariate Analysis: Multivariate analysis involves examining more than two variables
simultaneously to understand their relationships and combined effects. Techniques such as contour
plots, and principal component analysis (PCA) are commonly used in multivariate EDA.
Visualization Techniques: EDA relies heavily on visualization methods to depict data distributions,
trends, and associations. Various charts and graphs, such as bar charts, line charts, scatter plots, and
heatmaps, are used to make data easier to understand and interpret.
Outlier Detection: EDA involves identifying outliers within the data—anomalies that deviate
significantly from the rest of the data. Tools such as box plots, z-score analysis, and scatter plots
help in detecting and analyzing outliers.
Statistical Tests: EDA often includes performing statistical tests to validate hypotheses or discern
significant differences between groups. Tests such as t-tests, chi-square tests, and ANOVA add
depth to the analysis process by providing a statistical basis for the observed patterns.
By using these EDA techniques, we can gain a comprehensive understanding of the data, identify key
patterns and relationships, and ensure the data’s integrity before proceeding with more complex analyses.
Dataset Description
The dataset used is an open source dataset and comprises cases from the exploratory data analysis
conducted between 1958 and 1970 at the University of Chicago’s Billings Hospital, focusing on the
survival of patients post-surgery for breast cancer. The dataset can be downloaded from here.
[Source: Tjen-Sien Lim (limt@stat.wisc.edu), Date: March 4, 1999]
Importing Libraries and Loading Data
Import all necessary packages:
importnumpyas np
import pandas as pd
importmatplotlib.pyplotasplt
import seaborn assns
importscipy.statsas stats
Load the dataset in pandas dataframe:
df = pd.read_csv('haberman.csv', header = 0)
df.columns = ['patient_age', 'operation_year', 'positive_axillary_nodes', 'survival_status']
Understanding Data
To understand the dataset, let’s just see the first few rows.
print(df.head())
Output:
Among all age groups, patients aged 40-60 years are the highest.
There is a high overlap between the class labels, implying that survival status post-operation cannot
be discerned from age alone.
Operation Year
sns.FacetGrid(data, hue="Survival_Status", height=5).map(sns.histplot, "Year", kde=True).add_legend()
plt.title('Distribution of Operation Year')
plt.xlabel('Operation Year')
plt.ylabel('Frequency')
plt.show()
Output
Similar to the age plot, there is a significant overlap between the class labels, suggesting that
operation year alone is not a distinctive factor for survival status.
Number of Positive Axillary Nodes
sns.FacetGrid(data, hue="Survival_Status", height=5).map(sns.histplot, "Nodes", kde=True).add_legend()
plt.title('Distribution of Positive Axillary Nodes')
plt.xlabel('Number of Positive Axillary Nodes')
plt.ylabel('Frequency')
plt.show()
Output
Patients with 4 or fewer axillary nodes mostly survived 5 years or longer.
Patients with more than 4 axillary nodes have a lower likelihood of survival compared to those with
4 or fewer nodes.
But we must back our observations with some quantitative measure. That’s where the Cumulative
Distribution function(CDF) plots come into the picture.
Cumulative Distribution Function (CDF)
CDF plots show the probability that a variable will take a value less than or equal to a specific value. They
provide a cumulative measure of the distribution.
counts, bin_edges = np.histogram(data[data['Survival_Status'] == 1]['Nodes'], density=True)
pdf = counts / sum(counts)
cdf = np.cumsum(pdf)
plt.plot(bin_edges[1:], cdf, label='CDF Survival status = Yes')
plt.legend()
plt.xlabel("Positive Axillary Nodes")
plt.ylabel("CDF")
plt.title('Cumulative Distribution Function for Positive Axillary Nodes')
plt.grid()
plt.show()
Output
Patients with 4 or fewer positive axillary nodes have about an 85% chance of surviving 5 years or
longer post-operation.
The likelihood decreases for patients with more than 4 axillary nodes.
Box Plots
Box plots, also known as box-and-whisker plots, summarize data using five key metrics: minimum, lower
quartile (25th percentile), median (50th percentile), upper quartile (75th percentile), and maximum. They
also highlight outliers.
plt.figure(figsize=(15, 4))
plt.subplot(1, 3, 1)
sns.boxplot(x='Survival_Status', y='Age', data=data)
plt.title('Box Plot of Age')
plt.subplot(1, 3, 2)
sns.boxplot(x='Survival_Status', y='Year', data=data)
plt.title('Box Plot of Operation Year')
plt.subplot(1, 3, 3)
sns.boxplot(x='Survival_Status', y='Nodes', data=data)
plt.title('Box Plot of Positive Axillary Nodes')
plt.show()
Output:
The patient age and operation year plots show similar statistics.
The isolated points in the positive axillary nodes box plot are outliers, which is expected in medical
datasets.
Violin Plots
Violin plots combine the features of box plots and density plots. They provide a visual summary of the data
and show the distribution’s shape, density, and variability.
plt.figure(figsize=(15, 4))
plt.subplot(1, 3, 1)
sns.violinplot(x='Survival_Status', y='Age', data=data)
plt.title('Violin Plot of Age')
plt.subplot(1, 3, 2)
sns.violinplot(x='Survival_Status', y='Year', data=data)
plt.title('Violin Plot of Operation Year')
plt.subplot(1, 3, 3)
sns.violinplot(x='Survival_Status', y='Nodes', data=data)
plt.title('Violin Plot of Positive Axillary Nodes')
plt.show()
Output:
The distribution of positive axillary nodes is highly skewed for the ‘yes’ class label and moderately
skewed for the ‘no’ label.
The majority of patients, regardless of survival status, have a lower number of positive axillary
nodes, with those having 4 or fewer nodes more likely to survive 5 years post-operation.
These observations align with our previous analyses and provide a deeper understanding of the data.
Bar Charts
Bar charts display the frequency or count of categories within a single variable, making them useful for
comparing different groups.
Survival Status Count
sns.countplot(x='Survival_Status', data=df)
plt.title('Count of Survival Status')
plt.xlabel('Survival Status')
plt.ylabel('Count')
plt.show()
Output
This bar chart shows the number of patients who survived 5 years or longer versus those who did
not. It helps visualize the class imbalance in the dataset.
Histograms
Histograms show the distribution of numerical data by grouping data points into bins. They help understand
the frequency distribution of a variable.
Age Distribution
df['Age'].plot(kind='hist', bins=20, edgecolor='black')
plt.title('Histogram of Age')
plt.xlabel('Age')
plt.ylabel('Frequency')
plt.show()
Output
The histogram displays how the ages of patients are distributed. Most patients are between 40 and 60
years old.
These observations align with our previous analyses and provide a deeper understanding of the data.
Bi-variate Data Analysis
Bi-variate data analysis involves studying the relationship between two variables at a time. This helps in
understanding how one variable affects another and can reveal underlying patterns or correlations. Here are
some common methods for bi-variate analysis.
Pair Plot
A pair plot visualizes the pairwise relationships between variables in a dataset. It displays both the
distributions of individual variables and their relationships.
sns.set_style('whitegrid')
sns.pairplot(data, hue='Survival_Status')
plt.show()
Output
The pair plot shows scatter plots of each pair of variables and histograms of each variable along the
diagonal.
The scatter plots on the upper and lower halves of the matrix are mirror images, so analyzing one
half is sufficient.
The histograms on the diagonal show the univariate distribution of each feature.
There is a high overlap between any two features, indicating no clear distinction between the
survival status class labels based on feature pairs.
While the pair plot provides an overview of the relationships between all pairs of variables, sometimes it is
useful to focus on the relationship between just two specific variables in more detail. This is where the joint
plot comes in.
Joint Plot
A joint plot provides a detailed view of the relationship between two variables along with their individual
distributions.
sns.jointplot(x='Age', y='Nodes', data=data, kind='scatter')
plt.show()
Output
The scatter plot in the center shows no correlation between the patient’s age and the number of
positive axillary nodes detected.
The histogram on the top edge shows that patients are more likely to get operated on between the
ages of 40 and 60 years.
The histogram on the right edge indicates that the majority of patients had fewer than 4 positive
axillary nodes.
While joint plots and pair plots help visualize the relationships between pairs of variables, a heatmap can
provide a broader view of the correlations among all the variables in the dataset simultaneously.
Heatmap
A heatmap visualizes the correlation between different variables. It uses color coding to represent the
strength of the correlations, which can help identify relationships between variables.
sns.heatmap(data.corr(), cmap='YlGnBu', annot=True)
plt.show()
Output:
The heatmap displays Pearson’s R values, indicating the correlation between pairs of variables.
Correlation values close to 0 suggest no linear relationship between the variables.
In this dataset, there are no strong correlations between any pairs of variables, as most values are
near 0.
These bi-variate analysis techniques provide valuable insights into the relationships between different
features in the dataset, helping to understand how they interact and influence each other. Understanding
these relationships is crucial for building more accurate models and making informed decisions in data
analysis and machine learning tasks.
Multivariate Analysis
Multivariate analysis involves examining more than two variables simultaneously to understand their
relationships and combined effects. This type of analysis is essential for uncovering complex interactions in
data. Let’s explore several multivariate analysis techniques.
Contour Plot
A contour plot is a graphical technique that represents a 3-dimensional surface by plotting constant z slices,
called contours, in a 2-dimensional format. This allows us to visualize complex relationships between three
variables in an easily interpretable 2-D chart.
For example, let’s examine the relationship between patient’s age and operation year, and how these relate
to the number of patients.
sns.jointplot(x='Age', y='Year', data=data, kind='kde', fill=True)
plt.show()
Output
From the above contour plot, it can be observed that the years 1959–1964 witnessed more patients in
the age group of 45–55 years.
The contour lines represent the density of data points. Closer contour lines indicate a higher density
of data points.
The areas with the darkest shading represent the highest density of patients, showing the most
common combinations of age and operation year.
By utilizing contour plots, we can effectively consolidate information from three dimensions into a two-
dimensional format, making it easier to identify patterns and relationships in the data. This approach
enhances our ability to perform comprehensive multivariate analysis and extract valuable insights from
complex datasets.
3D Scatter Plot
A 3D scatter plot is an extension of the traditional scatter plot into three dimensions, which allows us to
visualize the relationship among three variables.
from mpl_toolkits.mplot3d import Axes3D
fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')
Most patients are aged between 40 to 70 years, with their surgeries predominantly occurring between
the years 1958 to 1966.
The majority of patients have fewer than 10 positive axillary lymph nodes, indicating that low node
counts are common in this dataset.
A few patients have a significantly higher number of positive nodes (up to around 50), suggesting
cases of more advanced cancer.
There is no strong correlation between the patient’s age or the year of surgery and the number of
positive nodes detected. Positive nodes are spread across various ages and years without a clear
trend.
Conclusion
In this article, we learned some common steps involved in exploratory data analysis. We also saw several
types of charts & plots and what information is conveyed by each of these. This is just not it, I encourage
you to play with the data and come up with different kinds of visualizations and observe what insights you
can extract from it.
COLLECTION OF DATA
Data Collection is the process of collecting information from relevant sources in order to find a solution to the
given statistical enquiry. Collection of Data is the first and foremost step in a statistical investigation.
he above table contains the production of rice in India in different years. It can be seen that these values vary
from one year to another. Therefore, they are known as variable. A variable is a quantity or attribute, the value
of which varies from one investigation to another. In general, the variables are represented by letters such as X,
Y, or Z. In the above example, years are represented by variable X, and the production of rice is represented by
variable Y. The values of variable X and variable Y are data from which an investigator and enumerator collect
information regarding the trends of rice production in India.
Histograms
This is similar to bar graphs, but it is based frequency of numerical values rather than their actual values. The
data is organized into intervals and the bars represent the frequency of the values in that range. That is, it
counts how many values of the data lie in a particular range.
Line Plot
It is a plot that displays data as points and checkmarks above a number line, showing the frequency of the
point.
Advantages
It gives us a summary of the data which is easier to look at and analyze.
It saves time.
We can compare and study more than one variable at a time.
Disadvantages
It usually takes only one aspect of the data and ignores the other. For example, A bar graph does not
represent the mean, median, and other statistics of the data.
Interpretation of graphs can vary based on individual perspectives, leading to subjective
conclusions.
Poorly constructed or misleading visuals can distort data interpretation and lead to incorrect
conclusions.
CLASSIFICATION OF DATA :
Process of classifying data in relevant categories so that it can be used or applied more efficiently. The
classification of data makes it easy for the user to retrieve it. Data classification holds its importance when
comes to data security and compliance and also to meet different types of business or personal objective. It is
also of major requirement, as data must be easily retrievable within a specific period of time.
Types of Data Classification :
Data can be broadly classified into 3 types.
1. Structured Data :
Structured data is created using a fixed schema and is maintained in tabular format. The elements in structured
data are addressable for effective analysis. It contains all the data which can be stored in the SQL database in a
tabular format. Today, most of the data is developed and processed in the simplest way to manage
information.
Examples –
Relational data, Geo-location, credit card numbers, addresses, etc.
Consider an example for Relational Data like you have to maintain a record of students for a university like the
name of the student, ID of a student, address, and Email of the student. To store the record of students used the
following relational schema and table for the same.
S_I S_Addres
S_Name S_Email
D s
2. Unstructured Data :
It is defined as the data in which is not follow a pre-defined standard or you can say that any does not follow
any organized format. This kind of data is also not fit for the relational database because in the relational
database you will see a pre-defined manner or you can say organized way of data. Unstructured data is also
very important for the big data domain and To manage and store Unstructured data there are many platforms to
handle it like No-SQL Database.
Examples –
Word, PDF, text, media logs, etc.
3. Semi-Structured Data :
Semi-structured data is information that does not reside in a relational database but that have some
organizational properties that make it easier to analyze. With some process, you can store them in a relational
database but is very hard for some kind of semi-structured data, but semi-structured exist to ease space.
Example –
XML data.
Features of Data Classification :
The main goal of the organization of data is to arrange the data in such a form that it becomes fairly available
to the users. So it’s basic features as following.
Homogeneity – The data items in a particular group should be similar to each other.
Clarity – There must be no confusion in the positioning of any data item in a particular group.
Stability – The data item set must be stable i.e. any investigation should not affect the same set of
classification.
Elastic – One should be able to change the basis of classification as the purpose of classification
changes.
BIG DATA:
Data science is the study of data analysis by advanced technology (Machine Learning, Artificial Intelligence,
Big data). It processes a huge amount of structured, semi-structured, and unstructured data to extract insight
meaning, from which one pattern can be designed that will be useful to take a decision for grabbing the new
business opportunity, the betterment of product/service, and ultimately business growth. Data science process
to make sense of Big data/huge amount of data that is used in business.
WEB DATA:
Data that is sourced and structured from websites is referred to as "web data". WDI is an extension and
specialization of data integration that views the web as a collection of heterogeneous databases.
Examples of web data include online product reviews, social media posts, website traffic statistics, and search
engine results
Web content consist of several types of data – text, image, audio, video etc. Content data is the group of facts that
a web page is designed. It can provide effective and interesting patterns about user needs. Text documents are
related to text mining, machine learning and natural language processing
EVOLUTION OF ANALYTIC SCALABILITY
ModernIn-DatabaseArchitecture:
Data from heterogeneous sources are collected, transformed and loaded into data
warehouse forfinal analysis by decisionmakers.
The processing stays in the database where the data has been
consolidated.The dataispresentedinaggregatedformforquerying.
Queries from users are submitted to OLAP (online analytical processing) engines for
execution.Such in-database architectures are tested for their query throughput rather
than transactionthroughputas intraditionaldatabaseenvironments.
More of metadata is required for directing the queries which helps in reducing the time
taken foransweringqueriesandhenceincreasethe querythroughput.
Moreover, the data in consolidated form are free from anomalies, since they are pre-
processedbeforeloadingintowarehouseswhichmay beuseddirectlyforanalysis.
MassivelyParallelProcessing(MPP)
It is a type of computing wherein the process is being done by many CPUs working in parallel to
execute a singleprogram. One of the most significant differences between a Symmetric Multi-
Processing or SMP and MassiveParallel Processing is that with MPP, each of the many CPUs has its
own memory to assist it in preventing apossible hold up that the user may experience with using SMP
when all of the CPUs attempt to access thememorysimultaneously.
The salientfeatureofMPPsystemsis:
o Looselycouplednodes
o Nodeslinkedtogetherbyahigh-speedconnection
o Eachnodehas itsownmemory
o Disksarenotshared;eachbeing attachedtoonlyonenode sharednothingarchitectures
TheCloudComputing:
CloudcomputingisthedeliveryofcomputingservicesovertheInternet.
Examples of cloud services include online file storage, social networking sites, webmail, and online
businessapplications.
The cloud computing model allows access to information and computer resources from
anywhere that anetworkconnectionis available.
Cloud computing provides a shared pool of resources, including data storage space, networks,
computerprocessingpower,andspecializedcorporateanduserapplications.
McKinseyandCompanyhasindicatedthefollowingascharacteristicfeaturesofcloud:
1. Masktheunderlyinginfrastructurefromtheuser
2. Beelastictoscaleondemand
3. Ona pay-per-usebasis
4. NationalInstituteofStandardsandTechnology(NIST)
5. On-demandself-service
6. Broadnetworkaccess
7. Resourcepooling
8. Rapidelasticity
9. Measuredservice
Therearetwotypesofcloudenvironment:
1. PublicCloud:
o Theservicesandinfrastructureareprovidedoff-siteoverthe internet
o Lesssecuredandmorevulnerable thanprivateclouds
2. PrivateCloud:
o Infrastructureoperatedsolely forasingleorganization
o Offerthegreatestlevelofsecurityandcontrol
GridComputing:
Grid computing is a form of distributed computing whereby a "super and virtual computer" is
composed of aclusterofnetworked, looselycoupledcomputers,
actinginconcerttoperformverylargetasks.
Grid computing (Foster and Kesselman, 1999) is a growing technology that facilitates the executions
of large-scale resource intensiveapplicationsongeographicallydistributedcomputing resources.
Facilitates flexible, secure, coordinated large scale resource sharing among dynamic collections of
individuals,institutions,andresource
Distributed or Grid computing in general is a special type of parallel computing that relies on
completecomputers connected to a network by a conventional network interface producing
commodity hardware,comparedtothelowerefficiencyofdesigningandconstructingasmall
numberofcustomsupercomputers.
DisadvantageofGridComputing:
Thevariousprocessorsandlocalstorageareasdonothavehigh-speedconnections.
Hadoop:
Apache Hadoop is an open-source software framework for storage and large-scale processing of
data-sets onclustersofcommodity hardware.
Two main building blocks inside this runtime environment are MapReduce and Hadoop Distributed
File System(HDFS).
MapReduce:
Hadoop MapReduce is a software framework for easily writing applications which process vast
amountsof data (multi-terabyte data-sets) in-parallel on large clusters (thousands of nodes) of
commodityhardware inareliable,fault-tolerantmanner.
A MapReduce job usually splits the input data-set into independent chunks which are processed
by themaptasks inacompletelyparallelmanner.
The framework sorts the outputs of the maps, which are then input to the reduce
tasks.Typically, boththeinputandtheoutputofthejobarestoredinafile-system.
Theframeworktakescareofschedulingtasks,monitoringthemandre-executesthe failedtasks.
HDFS:
HDFSstandsforHadoopDistributedFile System.
HDFS is one of the core components of the Hadoop framework and is responsible
for thestorageaspect.
Unlike the usual storage available on our computers, HDFS is a Distributed File
System andpartsofasinglelargefilecanbestoredondifferentnodes acrossthecluster.
HDFSisadistributed,reliable, andscalablefilesystem.
Big data is the storage and analysis of large data sets. These are complex data sets which can be both structured
or unstructured. They are so large that it is not possible to work on them with traditional analytical tools. These
days, organizations are realising the value they get out of big data analytics and hence they are deploying big
data tools and processes to bring more efficiency in their work environment
Big Data Tools
R-Programming
R is a free open source software programming language and a software environment for statistical computing
and graphics. It is used by data miners for developing statistical software and data analysis. It has become a
highly popular tool for big data in recent years.
Tableau Public
Tableau is another popular big data tool. It is simple and very intuitive to use. It communicates the insights of
the data through data visualisation. Through Tableau, an analyst can check a hypothesis and explore the data
before starting to work on it extensively.
Datawrapper
It is an online data visualization tool for making interactive charts. You need to paste your data file in a csv, pdf
or excel format or paste it directly in the field. Datawrapper then generates any visualization in the form of bar,
line, map etc. It can be embedded into any other website as well. It is easy to use and produces visually
effective charts.
APACHE Hadoop
It’s a Java-based open-source platform that is being used to store and process big data. It is built on a cluster
system that allows the system to process data efficiently and let the data run parallel. It can process both
structured and unstructured data from one server to multiple computers. Hadoop also offers cross-
platform support for its users. Today, it is the best big data analytic tool and is popularly used by many
tech giants such as Amazon, Microsoft, IBM, etc.
Cassandra
APACHE Cassandra is an open-source NoSQL distributed database that is used to fetch large amounts of
data. It’s one of the most popular tools for data analytics and has been praised by many tech companies
due to its high scalability and availability without compromising speed and performance. It is capable of
delivering thousands of operations every second and can handle petabytes of resources with almost zero
downtime. It was created by Facebook back in 2008 and was published publicly.
Qubole
It’s an open-source big data tool that helps in fetching data in a value of chain using ad-hoc analysis in
machine learning. Qubole is a data lake platform that offers end-to-end service with reduced time and effort
which are required in moving data pipelines. It is capable of configuring multi-cloud services such as AWS,
Azure, and Google Cloud. Besides, it also helps in lowering the cost of cloud computing by 50%.
Spark
APACHE Spark is another framework that is used to process data and perform numerous tasks on a large
scale. It is also used to process data via multiple computers with the help of distributing tools. It is widely
used among data analysts as it offers easy-to-use APIs that provide easy data pulling methods and it
is capable of handling multi-petabytes of data as well. Recently, Spark made a record of processing 100
terabytes of data in just 23 minutes which broke the previous world record of Hadoop (71 minutes). This
is the reason why big tech giants are moving towards spark now and is highly suitable for ML and AI today.
Mongo DB
SAS
Today it is one of the best tools for creating statistical modeling used by data analysts. By using SAS, a data
scientist can mine, manage, extract or update data in different variants from different sources. Statistical
Analytical System or SAS allows a user to access the data in any format (SAS tables or Excel worksheets).
Besides that it also offers a cloud platform for business analytics called SAS Viya and also to get a strong
grip on AI & ML, they have introduced new tools and products.
Data Pine
Datapine is an analytical used for BI and was founded back in 2012 (Berlin, Germany). In a short period of
time, it has gained much popularity in a number of countries and it’s mainly used for data extraction (for
small-medium companies fetching data for close monitoring). With the help of its enhanced UI design,
anyone can visit and check the data as per their requirement and offer in 4 different price brackets, starting
from $249 per month. They do offer dashboards by functions, industry, and platform.
Rapid Miner
It’s a fully automated visual workflow design tool used for data analytics. It’s a no-code platform and users
aren’t required to code for segregating data. Today, it is being heavily used in many industries such as ed-
tech, training, research, etc. Though it’s an open-source platform but has a limitation of adding 10000 data
rows and a single logical processor. With the help of Rapid Miner, one can easily deploy their ML models
to the web or mobile (only when the user interface is ready to collect real-time figures).
ANALYTICS VS REPORTING
ANALYTICS REPORTING