Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
4 views

Introduction to Data-Science

The document outlines the course CSC 313: Introduction to Data Science at the University of Abuja, detailing the grading policy, course content, and classroom expectations. It covers the significance of data science, its lifecycle, and various applications across fields such as finance, healthcare, and e-commerce. Additionally, it describes different job roles in data science, including responsibilities and required programming languages.

Uploaded by

igochebishop65
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views

Introduction to Data-Science

The document outlines the course CSC 313: Introduction to Data Science at the University of Abuja, detailing the grading policy, course content, and classroom expectations. It covers the significance of data science, its lifecycle, and various applications across fields such as finance, healthcare, and e-commerce. Additionally, it describes different job roles in data science, including responsibilities and required programming languages.

Uploaded by

igochebishop65
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 246

University of Abuja, Nigeria

Department of Computer Science


CSC 313: Introduction to Data Science

1
Course Information

• Lecturer: Mrs. Fatimah Binta Abdullahi (PhD)


• Contact Info: fatimah.abdullahi@uniabuja.edu.ng
– Office:
Class Time: 09-11am
Venue:
Grading Policy
S/N Assessment Percentage

1 Class Activities (Assignments) 10

2 Python Exercise 10
3 R Exercise 10
4 Final Exam 70
Total 100
Do’s and Don’ts
• Do’s :
o Maintain absolute silence and listen attentively
during lectures
o Ask questions

Don’ts :
o Do not make noise during lectures
o Use of mobile phones for any purpose is not
allowed
o Late coming is not allowed
The course covers:
• Introduction
• Fundamental of data science
• Steps and scope of data science
• Introduction to big data
• Introduction to data mining
• Use cases in various domains
• Introduction to data analytics
• Introduction to R and Python
programming.
Module Objectives

What do you need?


• A general understanding of computer
systems.

What will you learn?


• Definitions and notions of data
science.
• How data science is related to other
disciplines.
• What skills do data scientists need
6
What is Data
Science?
Data Science is the area of study that involves
extracting insights from vast amounts of data
using various scientific methods, techniques,
algorithms, and processes.

 Data Science is an interdisciplinary field that allows


you to extract knowledge from structured or
unstructured data.

7
What is Data Science?
(2)

Data science is about extraction,
preparation, analysis, visualization, and
maintenance of information.

It is a cross-disciplinary field that uses


scientific methods and processes to draw
insights from data.

The data used for analysis comes from


different sources and presented in
various format. 8
What is Data Science? (3)

Data Scientists must have the basic


knowledge of mathematics, computer
programming, Machine learning, Visualization,
and statistics that will help:
• Solve complex data problems efficiently.
• Collect, clean, integrate, analyze, visualize,
and interact with data to create data
products. 9
What is Data Science? (4)

Machine Mathematics and


Computer Statistics
learning
Science

Data
science
Data
Traditional analytics
Software

Business / Domain
expertise

Figure 1: Collaboration of data science with other


branch of studies 10
Road Map to Data Science
Learn Programming Language Commonly used languages are
Python and R
Mathematics and Statistics Linear Algebra, mean, median,
mode, standard deviation etc
Data Visualization Involves charts, graphs etc. We
have two libraries in Python for
this Seaborn and Matplotlib

Machine Learning (ML) Know about basic ML like linear


regression, Decision tree, SVM,
KNN and Random forest

Project Try to make projects with the


help of Kaggle.

11
What is Data Science? (5)
 Data Science is about data, models, and evaluation.
 Data science can solve a variety of problems once we have
the right data and model.
Scientific,
Collect &
social or Problem
understand
Business
data
problem

Use data to Clean and


create format data
solution
12
Significance of Data Science?

1. Data is the oil for today's world. With the right


tools, technologies, and algorithms, we can use
data and convert it into a business advantage.
2. Data Science can help you detect fraud using
advanced machine learning algorithms and
prevent any significant monetary losses.
3. Allow us to build intelligence ability in machines.
13
Significance of Data Science? (2)

4. It enables you to make better and faster


decisions.
5. Helps you to recommend the right product to
the right customer to enhance your business.

14
Data Science Life Cycle
Data science’s lifecycle consists of five distinct stages, each
with its tasks:

 Capture: This stage involves gathering raw structured


and unstructured data. This include Data acquisition,
Data entry, Data extraction.

 Maintain: This stage covers taking the raw data and


putting it in a form that can be used. This include Data
Warehousing, Data Cleansing, Data Processing, and Data
Architecture.

15
Data Science Lifecycle (2)
 Process: In this stage, the data scientists take the prepared
data and examine its patterns, ranges, and biases to
determine how useful it will be in predictive analysis using
Data Mining, Clustering/Classification, Data Modeling and
Data Summarization

 Analyze: This stage involves performing various analyses


of the data. This include Exploratory/Confirmatory,
Predictive Analysis, Regression, Text Mining, Qualitative
Analysis.

 Communicate: In this final step, analysts prepare the


analyses in easily readable forms such as charts, graphs, and
reports. This include Data Reporting, Data Visualization,
16
Business Intelligence and Decision Making.
Evolution of Data Science: Growth & Innovation

 Data science was born from the idea of merging applied


statistics with computer science.

 The resulting field of study would use the extraordinary


power of modern computing.

 Scientists realized they could not only collect data and


solve statistical problems.

 But use the data to solve real-world problems and make


reliable fact-driven predictions. .

17
Evolution of Data Science: Growth & Innovation (2)
 1962: American mathematician John W. Tukey first
articulated the idea of data science.

 In his article “The Future of Data Analysis,” he


foresaw the emergence of a new field nearly two
decades before the first Personal computers.

 While Tukey was ahead of his time, he was not alone


in his early appreciation of what would come to be
known as data science.

18
Evolution of Data Science: Growth & Innovation(3)
 1977: The theories and predictions of pre data scientists
like Tukey and Naur established the International
Association for Statistical Computing (IASC), whose
mission was to link traditional statistical methodology,
modern computer technology, and the knowledge of
domain experts to convert data into information and
knowledge.

 1980s and 1990s: Data science began taking more


significant strides with the emergence of the first
Knowledge Discovery in Databases (KDD) workshop
and the founding of the International Federation of
Classification Societies (IFCS).
19
Evolution of Data Science: Growth & Innovation (4)
 1994: Business Week published a story on the new
phenomenon of Database Marketing. It described the
process by which businesses were collecting and
leveraging enormous amounts of data to learn more about
their customers, competition, or advertising techniques.

 1990s and early 2000s: We can see that data science has
emerged as a recognized and specialized field. Several
data science academic journals began to circulate, and
data science proponents like Jeff Wu and William S.
Cleveland continued to help develop and expound upon
the necessity and potential of data science.

20
Evolution of Data Science: Growth & Innovation (5)
2000s: Technology made enormous leaps by providing
universal access to internet connectivity, communication,
and (of course) data collection.

2005: Big data enters the scene. With tech giants such as
Google and Facebook uncovering large amounts of data,
new technologies capable of processing them became
necessary. Hadoop rose to the challenge, and later Spark
and Cassandra made their debuts.

2014: Due to the increasing importance of data, and


organization’s interest in finding patterns, better business
decisions, demand for data scientists began to see dramatic
growth in different parts of the world. 21
Evolution of Data Science: Growth & Innovation (6)

 2015: Machine learning, deep learning, and Artificial


Intelligence (AI) officially enter the realm of data
science.

 2018: New regulations in the field was one of the


biggest aspects in the evolution of data science.

 2020s: We saw additional breakthroughs in AI, machine


learning, and increasing demand for qualified
professionals in Big Data.

22
Data Science Components

23
Data Science Components(2)
 Data engineering is the process of organizing,
managing, and analyzing large amounts of
data. It is about building systems that will enable
the collection and usage of data.

 Advanced Computing: involves the use of cutting-


edge technologies and techniques to solve complex
problems in fields such as artificial intelligence,
machine learning, data science, cybersecurity, and
more.

24
Data Science Components (3)
 Statistics is the method of collecting and analyzing
numerical data in large quantities to get useful insights.
this is the most critical unit in data science.

 Visualization technique helps you to access huge


amounts of data in an easy-to-understand and visual
format.
 Domain expert is a person who is an authority within
a particular area (e.g. an accountant is an expert in the
domain of accountancy).

25
Data Science Jobs Roles
Most prominent Data Scientist job titles are:

26
Data Science Jobs Roles (2)
Data Scientist:
A data scientist is a professional who manages enormous
amounts of data and come up with business visions by using
various tools, techniques, methodologies, algorithms, etc.
Role:
 Identifying data collection sources for business needs
 Processing, cleansing, and integrating data
 Automation data collection and management process
 Using Data Science techniques/tools to improve processes

Languages
R, SAS, Python, SQL, Hive, Matlab, Pig, Spark

27
Data Science Jobs Roles (2)
Data Engineers: Build and test scalable Big Data
ecosystems for businesses so that the data scientists can run
their algorithms on systems that are stable and highly
optimized.
Also, he/she updates the existing systems with newer or
upgraded versions of the current technologies to improve the
efficiency of the databases.
Role:
 Design and maintain data management systems
 Data collection/acquisition and management
 Conducting primary and secondary research
Languages:
SQL, Hive, R, SAS, Matlab, Python, Java, Ruby, C++, and
Perl 28
Data Science Jobs Roles (3)
Data Analysts: are responsible for a variety of tasks
including visualization and processing of massive
amounts of data. They perform queries on the databases
from time to time. The most important skill of a data
analyst is optimization
Role:
 Extracting data from primary and secondary sources using
automated tools
 Developing and maintaining databases
 Performing data analysis and making reports with
recommendations
Languages:
R, Python, HTML, JS, C, C+ +, SQL
29
Data Science Jobs Roles (4)
Statistician: collects, analyses, and understands qualitative
and quantitative data by using statistical techniques,
theories, and methods.

Role:
 Collecting, analyzing, and interpreting data
 Analyzing data, assessing results, and predicting
trends/relationships using statistical
methodologies/tools
 Designing data collection processes

Languages:
SQL, R, Matlab, Tableau, Python
30
Data Science Jobs Roles (5)
Data Administrators: are responsible for the proper
functioning of all databases of an enterprise and grant or
revoke its services to the employees of the company
depending on their requirements.

Role:
 Working on database software to store and manage
data
 Working on database design and development
 Implementing security measures for the database
 Preparing reports, documentation, and operating
manuals
Languages:
Ruby on Rails, SQL, Java, C#, and Python 31
Data Science Jobs Roles (6)
Business Analyst: They have a good understanding of how
data-oriented technologies work and how to handle large
volumes of data, they also separate the high-value data from
the low-value data.
Role:
 Understanding the business of the organization
 Conducting detailed business analysis – outlining
problems, opportunities, and solutions
 Working on improving existing business processes

Languages:
SQL, Tableau, Power BI, and Python

32
Data Science Jobs Roles (7)
Data Architects: creates the blueprints for data management so
that the databases can easily be integrated, centralized, and protected
with the best security measures.
Role:
 Developing and implementing overall data strategy in line with
business/organization
 Identifying data collection sources in line with the data strategy
 Collaborating with cross-functional teams and stakeholders for
the smooth functioning of database systems
 Planning and managing end-to-end data architecture

Languages:
SQL, Power BI, Java, C#, and Python
33
Data Science Jobs Roles (8)
Data Analysts Manager: coordinates the different tasks
that must be completed by their team for a big data project

Role:
 Researching and creating effective methods to
collect data.
 Analyzing information.
 Recommending solutions to a business.

Languages:
R, Python, HTML, JS, C, C+ +, SQL

34
Applications of Data Science in Various Fields
1. Search Engines
The most useful application of data science is Search
Engines. As we know when we want to search for something
on the internet, we mostly use search engines like Google,
Yahoo, Safari, Firefox, etc. data science is used to get
searches faster.
2. Transport
Data science also contributes to the Transport field like the
use of Driverless Cars. With the help of driverless cars, it is
easy to reduce the number of accidents.
For Example: for driverless cars, the training data is fed into
the algorithm and with the help of data science techniques,
the data then analyzes the speed limit on highways, busy
streets, narrow roads, etc. 35
Applications of data science in various fields(2)
3. Finance
Data science plays a key role in Financial Industries in the
area of fraud detection and minimizing the risk of losses.

Thus, Financial Industries need to automate risk of loss


analysis to carry out strategic decisions for the company.

These Financial Industries use data science analytics tools to


predict the future.

For Example, In the stock market, data science is used to


examine past behavior with past data to examine future
outcomes.
36
Applications of data science in various fields (3)

4. E-Commerce: Websites like Amazon, Flipkart, Jumia,


Konga etc. use data science to make a better user
experience with personalized recommendations.

For Example, When we search for something on E-


commerce websites, we get suggestions similar to our
choices based on our past data and we get
recommendations according to most bought product, most
rated, most searched, etc. This is all done with the help of
data science.

37
Applications of data science in various fields (4)
5. Health Care:
In the Healthcare Industry, data science is used for:

 Detecting tumor.
 Drug discoveries.
 Medical Image Analysis.
 Virtual Medical Bots.
 Genetics and Genomics.
 Predictive Modeling for Diagnosis etc.

38
Applications of data science in various fields (5)
6. Image Recognition
Currently, data science is also used in Image Recognition.

For Example, When we upload images with our friends on


Facebook, Facebook gives suggestions for tagging who is in
the picture.
This is done with the help of machine learning and data
science.
When an image is recognized, the data analysis is done on
one’s Facebook friends, and after analysis, if the faces that
are present in the picture match with someone else profile,
then Facebook suggests auto-tagging.

39
Applications of data science in various fields (6)
8. Airline Routing Planning
With the help of data science, the Airline Sector can now
easily predict flight delays. It helps decide whether to
directly land at a destination or take a halt in between. For
example, a flight can have a direct route from Delhi to the
U.S. stop in between.

9. Data Science in Gaming


In most games where a user plays with a computer
opponent, data science concepts are used with machine
learning to learn from past data to improve performance.
There are many games like Chess, Sports, etc. that use data
science concepts.
40
Applications of data science in various fields (7)
10. Medicine and Drug Development
The process of developing medicine is very difficult and
time-consuming because it is a matter of someone’s life.
Without data science, it takes lots of time, resources, and
finances to develop new Medicines or drugs. Data science
has made it easy to predict success rates based on biological
data or factors. The data science algorithms will forecast how
this will react with the human body without lab experiments.

11. In Delivery Logistics


Various logistics companies like DHL, FedEx, etc. make use
of data science. It helps these companies find the best route
for the shipment of products, time suited for delivery, mode
of transportation to the destination, etc.
41
Data Science Tasks

42
1. Business Requirements

43
2. Data Acquisition

44
3. Data Processing

45
4. Data Exploration

46
5. Modelling

47
6. Deployment

48
How does Data Science Work?
Collect Analyze Data Insights

Raw data is Using various Actionable insights


gathered from statistical analysis that will serve as a
various sources that and machine solution for the
explain the business learning business problems
problem approaches data gathered through
modelling is then data science
performed to get
the optimum
solutions that
explain the
business problem

49
Consider An Example
 Suppose there is an organization that is working towards
finding out potential leads for the sales team.
 They can follow this approach to get optimal solutions using
data science

Collect Data Analyze Data Insights

Gather the previous Use statistical Use machine


data on the sales analysis to find out learning to get
that were closed the patterns that actionable insights
were followed by for finding out
the leads that were potential leads
closed

50
Challenges of Data Science Technology

1. High variety of information & data is required


for accurate analysis
2. Not adequate data science talent pool available
3. Management does not provide financial support
for a data science team
4. Unavailability /difficult access to data

51
Challenges of Data Science Technology(2)

5) Data Science results not effectively used by


business decision makers
6) Privacy issues
7) If an organization is very small, they can't have
a data science team

52
What is Data Scientist’s salary?

According to Glassdoor average salary of a Data


Scientist in the United States is $117,345 per year
while it ranges from $86,000 to $157,000 per year
depending upon the expertise, experience, Skills as
well and job nature.

53
What is Data Scientist salary?(2)

Average year salaries of different posts are


• Senior Data Scientist: $137,000
• Data Analyst: $67,000
• Quantitative Analyst: $116,000
• Data Engineer: $117,000
• Machine Learning Engineer: $121,000

54
Summary
Data Science is the area of study that involves extracting insights from vast amounts
of data by the use of various scientific methods, algorithms, and processes.

Statistics, Visualization, Mathematics, and Machine Learning, are important Data


Science concepts.

Important Data Scientist job roles are: 1) Data Scientist 2) Data Engineer 3) Data
Analyst 4) Statistician 5) Data Architect 6) Data Admin 7) Business Analyst 8)
Data/Analytics Manager

R, SQL, Python, and SaS are essential Data science tools

Important applications of Data science are 1) Internet Search 2) Recommendation


Systems 3) Image & Speech Recognition 4) Health 5) Online Price Comparison.

A high variety of information & data is the biggest challenge of Data Science
technology.

55
Exercises
1. What is data science? How does it relate to and
differ from statistics?

2. Identify three areas or domains in which data


science is being used and describe how.

3. Explain the data science life cycle.

4. What are the important roles in data science?

56
Reference
• Introduction to Data Science A Beginner's Guide

57
Lecture Two: Methodology of Extracting

Knowledge From Big Datasets

58
Methodology of Extracting Knowledge From Big
Datasets
 Extracting knowledge from big datasets involves a variety
of techniques such as data collection, cleaning, analysis,
and visualization.

 Knowledge extraction is the creation of knowledge from


structured and unstructured sources.

 The resulting knowledge needs to be in a machine-


readable and machine-interpretable format and must
represent knowledge in a manner that facilitates
inferencing.

59
Methodology of Extracting Knowledge From Big Datasets(2)
 Although it is like information extraction and data warehouse,
the main criteria is that the extraction result goes beyond the
creation of structured information or the transformation into a
relational schema.

 It requires either the reuse of existing formal knowledge or the


generation of a schema based on the source data.

 The process of information extraction (IE) is used to


extract useful structured information from unstructured
data in the form of entities, relations, objects, events, and
many other types.

60
Methodology of Extracting Knowledge From Big Datasets(3)

Big data poses new challenges for IE techniques


with the rapid growth of multidimensional
unstructured data.

Traditional IE systems are inefficient in dealing


with this huge deluge of unstructured big data.

61
Knowledge Extraction Steps
Data selection: where data relevant to the analysis task are
retrieved from the database.

Data cleaning to remove noise and inconsistent data.

Data integration where multiple data sources may be


combined.

Data transformation (where data are transformed into forms


appropriate for mining by performing summary or aggregation
operations and data reduction to obtain a smaller representation
of the original data without sacrificing its integrity.

62
Knowledge Extraction Steps (2)

Data mining is an essential process where intelligent


methods are applied to extract data patterns.

Pattern evaluation to identify the interesting patterns


representing knowledge based on interestingness
measures.

Knowledge presentation where visualization and


knowledge representation techniques are used to
present mined knowledge to users.

63
Knowledge Discovery In Datasets

64
Tools for Data Science

65
Tools for Data Science

There are many tools and platforms available for data


science. Here are some of the most popular ones.

Apache Spark and Hadoop: These are open-source big


data processing frameworks that can handle large-scale data
processing and analytics.

Microsoft Power BI: This is a business analytics service


that provides interactive visualizations and business
intelligence capabilities with an interface for end-users to
create their reports and dashboards.

66
Tools for Data Science (2)
Jupyter Notebook: This is an open-source web application
that allows you to create and share documents that contain
live code, equations, visualizations, and narrative text.

RapidMiner: This is a data science platform that provides


an integrated environment for data preparation, machine
learning, deep learning, text mining, and predictive
analytics.

Apache Hadoop: This is an open-source software


framework that allows for distributed storage and
processing of large datasets.

67
Tools for Data Science

Alteryx: This is a self-service data analytics platform that


provides a drag-and-drop interface for advanced analytics,
and data science.
Python: This is a popular programming language that is
widely used in data science for data analysis, machine
learning, and visualization.
KNIME: This is an open-source data analytics platform that
provides a visual interface for creating data science
workflows.
R programming is an open-source programming language
that is widely used as a statistical software and data analysis
tool.
68
Hands-On Example: Computational Thinking

Given the following numbers and tasked with finding the largest of
them.
7, 24, 62, 11, 4, 39, 42, 5, 97, 54.

Let’s try doing it systematically by looking at two numbers at a time.


The first two numbers are 7 and 24.
Pick the larger of them, which is 24. Now take that and compare with
the next number. Is it larger than 24? Yes, then 62 is our largest
number. The next number is 11. Is it larger than the largest number
62? No. If you continue this process, you will end up with 97 as the
largest.
A complex problem (looking through 10 numbers) is broken into a set
of small problems (comparing two numbers at a time). This process is
called decomposition, which is identifying small steps to solve a
large problem.
69
Hands-On Example: Analyzing Data (2)
The dataset is an average height and weight for American women.
The dataset contains a sample of 15 observations.
Observatio Height Weight(lbs)
n (Inches)
1 58 115
The height and weight numbers range
2 59 117 from small to large.
3 60 120
4 61 123
5 62 126
Making it easier to see the boundaries
6 63 129 of this dataset.
7 64 132
8 65 135
height ranges from 58 to 72
9 66 139
10 67 142
11 68 146 weight ranges from 115 to 164.
12 69 150
13 70 154
14 71 159
15 72 164
70
Hands-On Example: Analyzing Data (3)

We can compute average height by adding up the


numbers in the “Height” column and dividing by
15. This yields a value of 65.

We can conclude that the average height of an


American woman is 65 inches, based on the 15
observations.

 Similarly, the average weight is computed as 136


pounds in this case.

71
Hands-On Example: Analyzing Data (4)

180
Visualization of height vs. weight
160
data
140
120
Weights (lbs)

100
80
60
40
20
0
58 59 60 61 62 63 64 65 66 67 68 69 70 71 72

Height (inches)

72
Hands-On Example: Analyzing Data (5)

Question: On average, how much increase can we expect in


weight with an increase of one inch in height?

A simple method is to compute the differences in height (72


− 58 = 14 inches) and weight (164 − 115 = 49 pounds), then
divide the weight difference by the height difference, that is,
49/14, leading to 3.5

On average, a one-inch difference in height leads to a


difference of 3.5 pounds in weight.

On average, an increase of an inch in height results in an


increase of less than 4 pounds in weight for a height between
58 and 65 inches. 73
Hands-On Example: Analyzing Data (6)

 For values of height greater than 65 inches, weight increases


more rapidly (by 4 pounds mostly until 70 inches, and 5 pounds
for more than 70 inches).

Question: What would you expect the weight to be of an American


woman who is 57 inches tall?

 We will have to extrapolate the data we have.

 We know from the lower range of height (less than the average of
65 inches), for each inch of height change, weight changes by
about 3 pounds.

74
Hands-On Example: Analyzing Data (7)
 For someone who is 58 inches in height, the
corresponding weight is 115 pounds.

 If we deduct an inch from the height, we should


deduct 3 pounds from the weight to have 112
pounds.

What would you expect the weight of someone who


is 73 inches tall to be?

The correct estimate is 169 pounds.

75
Problem One
# Insurance provider Rating (out of 10)
1 GEICO 4.7
2 GEICO 8.3
3 GEICO 9.2
4 Progressive 7.4
5 Progressive 6.7
6 Progressive 8.9
7 USAA 3.8
8 USAA 6.3
9 USAA 8.1

The table contains an imaginary dataset of auto insurance providers and


their ratings as provided by the latest three customers. which of the auto
insurance providers would you opt for based on these ratings?

76
Problem Two
The table presents a movie review dataset from the past. It consists of three
attributes: the movie name, the leading actor in the movie, and its IMDB
rating. [Note: assume a better rating means a more watchable movie.]

How do you predict which of these actor’s movies you should watch when
a new one is released?

Leading actor Movie name IMDB rating (out of 10)


Irfan Khan Knock Out 6.0
Irfan Khan New York 6.8
Irfan Khan Life in a … metro 7.4
Anupam Kher Striker 7.1
Anupam Kher Dirty Politics 2.6
Anil Kapoor Calcutta Mail 6.0
Anil Kapoor Race 6.6

77
Lecture three: What will you learn?

Definition of Data
Data types
Major data sources
Formats.

78
What is Data?
 Data is a collection of facts, figures, or statistics that can be
analyzed to gain insights and knowledge about a particular
subject.

 It can be in the form of numbers, text, images, audio, or video.

 Data means anything one can measure or record

 Data is a very basic element of data science.

 There are different types of data.

79
What is Data?(2)
Data (Plural) Singular is datum Data is divided
into two types:

Categorical or qualitative data is based on


descriptive information.
These types of data are collected through focus
groups, interviews, opened ended questionnaire
etc.

Numerical or quantative data is based on


numerical information or terms.
80
What is Data?(3)
 Categorical or qualitative data is divided into 3 types:

 Binomial data are variable data with only two options eg


good or bad, true or false.

 Nominal or Unordered data are variable data which is in


unordered form or no natural order between the possible
values, for example red, green, man.

 Ordinal data is a variable data with proper order eg short,


medium and long.

81
What is Data? (4)

Categorical or qualitative data is divided into 2


types:

Discrete data is data that is countable eg Number


of children.

Continuous data is data that can be measured eg


height, width, and length, and has an infinite
number of values: real numbers.

82
Why is Data Important?

 Data is important because it can help individuals and


organizations make informed decisions.

 It can solve problems, understand complex phenomena,


improve processes, and understand customers.

 Data can be used to identify patterns, trends, and


relationships that are not immediately apparent.

 Data can be used to predict future outcomes.

83
Why is Data Important?(2)

 In today’s world, data is a valuable asset for businesses


and organizations.

 It is essential to have the skills to collect, analyze and


interpret data to make informed decisions.

84
Data Sources

 Evolution of Technology has generated a lot of data,


where most of these data are not structured.

 Business - Web, E-commerce, Transactions, Stocks.

 Science - Remote Sensing, Bio informatics, Scientific


Simulation.

 Society and Everyone – News, Digital Cameras, You


Tube, Facebook, twitter, emails.

85
Data Sources (2)

Internet Of Thing (IOT) produces huge amount of data that is


measured in Zettabyte (a unit of information equal to 270 bytes.)

86
Data Sources (3)

Therefore, we cannot rely on the traditional data processing systems

87
Tabular Representation of Various Memory Sizes

Name Equal To Size (In Bytes)


Byte 8 Bits 1
Kilobyte 1024 Bytes 1024
Megabyte 1, 024 Kilobytes 1, 048, 576
Gigabyte 1, 024 Megabytes 1, 073, 741, 824
Terabyte 1, 024 Gigabytes 1, 099, 511, 627, 776
Petabyte 1, 024 Terabytes 1, 125, 899, 906, 842, 624
Exabyte 1, 024 Petabytes 1, 152, 921, 504, 606, 846, 976
Zettabyte 1, 024 Exabytes 1, 180, 591, 620, 717, 411, 303, 424
Yottabyte 1, 024 Zettabytes 1, 208, 925, 819, 614, 629, 174, 706, 176

88
Different Ways of Ccombining Data

We can perform two operations to combine information from


different data sets.
The first operation is joining: enriching an observation from
one table with information from another table.
The second operation is appending or stacking: adding the
observations of one table to those of another table.

89
Different Ways of Ccombining Data (2)
Joining tables: allows one to combine the information of
one observation found in one table with the information
found in another table.

For example, the first table contains information about the


purchases of a customer and the other table contains
information about the region where your customer lives.

Joining the tables allows you to combine the information so


that you can use it for model.

90
Different Ways of Ccombining Data(3)

Joining two tables on the Item and Region keys

91
Different Ways of Combining Data(4)

 Appending tables is effectively adding observations from


one table to another table.

 One table contains the observations from the month of


January and the second table contains observations from
the month of February.

 The result of appending these tables is a larger one with


the observations from January as well as February.

92
Different Ways of Combining Data(5)

Appending data from tables is a common operation but requires an equal structure in the tables being
appended.

93
Basic Classification of Data
 Structured data are highly organized information that can be
seamlessly included in a database and readily searched via
simple search operations.

 It is concerned with all data stored in a database in the form


of a table with rows and columns.

 Structured data are the most processed and the simplest to


manage.

 It only represents 5 to 10% of all data

94
Basic Classification of Data(2)
 Semi-structured data doesn’t reside in a relational
database.

 It has some properties that make it easier to analyze.

 With some processes they can be stored in a relational


database.

 But as structured data, semi structured data represents a


few parts of data 5 to 10%.

95
Basic Classification of Data(3)

 Unstructured data is data devoid of any underlying


structure and it represent around 80% of data.
 It often include text and multimedia content.

 Examples include e-mail messages, word processing


documents, videos, photos, audio files, presentations,
webpages and other kinds of business documents.

 Just as with structured data, unstructured data is either


machine generated or human generated.

96
Basic Classification of Data(4)
Open data: Data that is freely available in a
public domain and can be used by anyone as they
wish, without restrictions from copyright, patents,
or other mechanisms of control.

Outlier: A data point that is different in value


from the other data points in the sample

97
Lecture four: Scope of Data Science

What will you learn?

How to perform basic data cleaning, Pre-


processing, Training and Testing

98
Scope of Data Science
Data collection is one of the most tasking steps in
the process of data science.

It is the process of accumulating data that’s


required to solve a problem statement.

The data collection process can be broken down


into several steps.

99
Data Collection Steps
Defining the aim of your research: This step involves
identifying the problem that needs to be solved and defining the
research question

Choosing your data collection method: This step involves


selecting the appropriate method for collecting data based on the
research question and the type of data required

Planning your data collection procedures: This step involves


creating a plan for collecting data, including identifying the
sources of data, the data collection tools, and the data collection
schedule

Collecting the data: This step involves executing the data


collection plan and collecting the data from various sources.
100
Methods of Data Collection

There are two main methods of data collection in research based on the
information that is required, namely:

  Primary Data Collection


  Secondary Data Collection

Primary Data Collection Methods


Primary data refers to data collected from first-hand experience
directly from the main source.

It refers to data that has never been used in the past.

The data gathered by primary data collection methods are generally


regarded as the best kind of data in research.

101
Methods of Data Collection(2)
The most common primary data collection methods:

1. Interviews
are direct methods of data collection.

It is simply a process in which the interviewer asks


questions, and the interviewee responds to them.

It provides a high degree of flexibility because questions


can be adjusted and changed anytime according to the
situation.

102
Methods of data collection(3)

2. Observations
In this method, researchers observe a situation around
them and record the findings.

It can be used to evaluate the behavior of different


people in controlled (everyone knows they are being
observed).

And uncontrolled (no one knows they are being


observed) situations.

103
Methods of data collection(4)

3. Surveys and Questionnaires

Surveys and questionnaires provide a broad


perspective from large groups of people.

They can be conducted face-to-face, mailed, or even


posted on the Internet to get respondents from
anywhere in the world.

104
Methods of data collection(5)
4. Focus Groups

A focus group is like an interview, but it is conducted with


a group of people who all have something in common.

The data collected is like in-person interviews, but they


offer a better understanding of why a certain group of
people thinks in a particular way.

105
Methods of data collection(6)
5. Oral Histories
Oral histories also involve asking questions like interviews and focus
groups.

It is a technique for generating and preserving original, historically


interesting information.

However, it is defined more precisely, and the data collected is


linked to a single phenomenon.

It involves collecting the opinions and personal experiences of people


in a particular event that they were involved in.

This method of interviewing is used to preserve the voices,


memories and perspectives of people in history.

106
Methods of data collection(7)
Secondary Data Collection Methods
Secondary data refers to data collected by someone else.

It is much more inexpensive and easier to collect than primary


data.
Here are some of the most common secondary data collection
methods:
1. Internet

The use of the Internet has become one of the most popular
secondary data collection methods in recent times.

There is a large pool of free and paid research resources that can
be easily accessed on the Internet.
107
Methods of data collection(8)
2. Government Archives

There is lots of data available from government archives that


you can make use of.

The most important advantage is that the data in government


archives are authentic and verifiable.

The challenge, however, is that data is not always readily


available due to a number of factors.

108
Methods of data collection(9)

3. Libraries

Most researchers donate several copies of their academic


research to libraries.

You can collect important and authentic information based


on different research contexts.

109
Data Preparation
 Data preprocessing, a component of data preparation,
describes any type of processing performed on raw data.
 Data preprocessing transforms the data into a format that
is more easily and effectively processed in data mining,
machine learning and other data science tasks.
 The techniques are generally used at the earliest stages of
the machine learning and AI development pipeline to
ensure accurate results.
 There are several different tools and methods used for
preprocessing data, including the following:

110
Data Preparation(2)
The main activities of data preparation are:
 Data cleaning: fixing incomplete or erroneous data
 Data integration: unifying data from different sources
 Data transformation: formatting the data
 Data reduction: reducing data to its simplest form
 Data discretization: reducing the number of values to
make data management easier
 Feature engineering: selecting and transforming
variables to work better with machine learning

111
Data Preparation (3)
 Sampling, which selects a representative subset from a
large population of data;
 Transformation, which manipulates raw data to produce
a single input;
 Denoising, which removes noise from data;
 Imputation, which synthesizes statistically relevant data
for missing values
 Normalization, which organizes data for more efficient
access; and
 Feature extraction, which pulls out a relevant feature
subset that is significant in a particular context.

112
Data Preparation (4)
 Data preparation is a phase that may need to be revisited
multiple times throughout your project.

 Data comes from various sources and is usually unusable


in its raw state, as it often has corrupt and missing
attributes, conflicting values, and outliers.

 Data preparation resolves these issues and improves the


quality of your data, allowing it to be used effectively in
the modelling stage.

113
Data cleansing

 In data science, the first step after data collection is data


cleansing, which involves identifying and cleaning up any
incorrect or incomplete data sets.

 Data cleansing is critical to identify errors and


inconsistencies that can skew your data analysis and lead
to poor business decisions.

 The most important thing about data cleansing is that it’s


an ongoing process.

114
115
Methods of Data Cleaning

The three key methods that describe ways in


which data may be “cleaned,” or better
organized.

Data Munging
The process of converting data to something
more suitable for a computer to understand.

The approaches are manipulating or wrangling


(or munging) the data to turn it into something
that is more convenient or desirable.

This can be done manually, automatically, or, in


116
many cases, semi-automatically.
Methods of Data Cleaning (2)

Consider the following text recipe.

Add two diced tomatoes, three cloves of garlic,


and a pinch of salt in the mix. This can be turned
into a table
This table conveys the same information as the
text, but it is more analysis friendly.
2.2Wrangled data for a
Ingredient recipeQuantity Unit/size
Tomato 2 Diced
Garlic 3 Cloves
Salt 1 Pinch 117
Methods of Data Cleaning(3)

Handling Missing Data


Are data that may be in the right format, but some of
the values are missing

Consider a table containing customer data in which


some home phone numbers are absent.

Strategies to combat missing data include ignoring


that record, using a global constant to fill in all
missing values, imputation
Table
118
Methods of Data Cleaning(4)

Smooth Noisy Data


There are times when the data is not missing, but it
is corrupted for some reason

Data corruption may be as a result of faulty data


collection instruments, data entry problems, or
technology limitations.

Table
119
Data Storage and Presentation
Depending on its nature, data is stored in various formats.

When data is structured, storing and presenting it in some


delimited way is common.

meaning various fields and values of the data are separated


using delimiters, such as commas or tabs.

the most commonly used formats that store data as simple


text are comma-separated values (CSV) and tab-separated
values (TSV).

CSV: format is the most common import and export format


for spreadsheets and databases. 120
Data Storage and Presentation (2)
Advantage of CSV

The CSV format is more generic and useful when sharing with almost
anyone.

Any spreadsheet program such as Microsoft Excel or Google Sheets


can readily open a CSV file and display it correctly most of the time.

Disadvantage of CSV
Since the comma is used to separate fields, if the data contains a comma
that could be problematic.

This could be addressed by escaping the comma (typically adding a


backslash before it.

121
Data Storage and Presentation (3)
treat,before,after,diff
No Treatment,13,16,3
No Treatment,10,18,8
No Treatment,16,16,0
Placebo,16,13,-3
Placebo,14,12,-2
Placebo,19,12,-7
Seroxat (Paxil),17,15,-2
Seroxat (Paxil),14,19,5
Seroxat (Paxil),20,14,-6
Effexor,17,19,2
Effexor,20,12,-8
Effexor,13,10,-3 122
Data Storage and Presentation (4)
 TSV (Tab Separated Values) files are used for raw data
and can be imported into and exported from spread sheet
software instead of commas, the fields are separated
using tabs

 Tab-separated values files are essentially text files , and


the raw data can be viewed by text editors.

 Advantage of TSV format is that the delimiter (tab), may


have to be removed if the tab character is present.

123
Data Storage and Presentation (5)
On the other hand, TSV is less common than other delimited
formats such as CSV.

Name<TAB>Age<TAB>AddressRyan<TAB>33<TAB>11
15WFranklinPaul<TAB>25<TAB>BigFarmWayJim<TAB
>45<TAB>WMainStSamantha<TAB>32<TAB>28George
St.

where <TAB> denotes a TAB character

124
Data Preprocessing

 Data in the real world is often dirty; that is, it is in


need of being cleaned up before it can be used for a
desired purpose.

 This is often called data pre-processing. What


makes data “dirty?

Incomplete. Data is incomplete:


 when some of the attribute values are lacking
 certain attributes of interest are lacking,
 attributes contain only aggregate data.
125
Data Preprocessing (2)
 Noisy data. Data is noisy when data contains errors or
outliers.

 For example, some of the data points in a dataset may


contain extreme values that can affect the dataset’s range.

 Inconsistent. Data is inconsistent if it contains


discrepancies in codes or names.

 For example, if the “Name” column for registration


records of employees contains values other than
alphabetical letters
 if records do not start with a capital letter, discrepancies
are present. 126
Data Integration
 Data from various sources commonly needs to be
integrated for efficient and effective data analyses.

The following steps describe how to integrate


multiple databases or files.

1. Combine data from multiple sources into a


coherent storage place (e.g., a single file or a
database).
2. Engage in schema integration or combining
metadata from different sources.
127
Data Integration (2)
3. Detect and resolve data value conflicts. For
example:

a. A conflict may arise; for instance, such as the


presence of different attributes and values from
various sources for the same real-world entity.

b. Reasons for this conflict could be different


representations or different scales; for example,
metric vs. British units.

128
Data Integration (3)
4. Address redundant data in data integration.
Redundant data is commonly generated in the
process of integrating multiple databases.
For example:

a. The same attribute may have different names in


different databases.
b. One attribute may be a “derived” attribute in
another table.
c. Correlation analysis may detect instances of
redundant data.
129
Data Transformation
 Data must be transformed so that it can be
consistent and readable (by a system).
 The following five processes may be used for
data transformation

1. Smoothing: Remove noise from data.


2. Aggregation: Summarization, data cube
construction.
3. Generalization: Concept hierarchy climbing.

130
Data Transformation (2)
4. Normalization: Scaled to fall within a small, specified range and
aggregation. Some of the techniques that are used for accomplishing
normalization are:
a. Min-max normalization (feature scaling): that rescale values in the
range (0,1)

We have minA and maxA as the minimum and maximum value observed
for attribute A and Vi is the value for attribute A that has to be
normalized. The min-max normalization will map Vi to the V’i in a
new smaller range [new_minA, new_maxA]. The formula for the min-
max normalization is:

131
Data Transformation (3)
For example, we have $1200 and $9800 as the minimum and
maximum value for the attribute income, and [0.0, 1.0] is the range in
which we map a value of $73,600. The value of $73,600 will be
transformed using min-max normalization as follows:

b. Z-score normalization (standardization): This method normalizes


the value for attribute A using the mean and standard deviation. The
following formula is used for Z-score normalization:

132
Data Transformation (4)
Here à and σ are the mean and standard deviation for attribute A,
respectively. For example, we have a mean and standard deviation for
attribute A as $5,400 and $1,600 and normalize the value to $73,600
using z-score normalization.

c. Normalization by decimal scaling: This method normalizes the


value of attribute A by moving the decimal point in the value. The
movement of a decimal point depends on the maximum absolute value
of A. The formula for the decimal scaling is given below:

133
Data Transformation (5)
Here j is the smallest integer such that max(|v’i|) < 1 for
example, the observed values for the attribute A range from -
986 to 917, and the maximum absolute value for attribute A
is 986. Here, to normalize each attribute A value using
decimal scaling, divide each attribute A by 1000 ie j=3. So,
the value -986 would be normalized to -0.986 and 917
would be normalized to 0.917.

5. Attribute or feature construction. is the process of


creating new features from the existing data attributes to
improve the predictive performance of machine learning
models. These new attributes can better capture the
underlying patterns in the data, making the model more
accurate and interpretable. 134
Data Reduction
Data reduction is a key process in which a reduced
representation of a dataset produces the same or similar
analytical results.

One example of a large dataset that could warrant reduction


is a data cube.

Data cubes are multidimensional sets of data that can be


stored in a spreadsheet.

A data cube could be in two, three, or a higher dimension.


Each dimension typically represents an attribute of interest.

135
Data Reduction (2)
 The two most common techniques used for data reduction are:

 Data Cube Aggregation. The lowest level of a data cube is the


aggregated data for an individual entity of interest.

 To do this, use the smallest representation that is sufficient to


address the given task.

In other words, we reduce the data to its more meaningful size and
structure for the task at hand.

136
Data Compression
Data compression employs converting the structure of data
in a way that consumes less space. Data compression
involves building a compact representation of information
by removing redundancy and representing data in binary
form. Data that can be restored successfully from its
compressed form is called lossless compression .

137
Data Compression (2)

Lossless Compression: The encoding technique


allows a simple and minimal data size reduction.
Lossless data compression uses algorithms to
restore the precise original data from the
compressed data.

Lossy Compression: In lossy-data compression,


the decompressed data may differ from the
original data but are useful enough to retrieve
information from them.
138
Data Reduction (3)
 Dimensionality Reduction method works
concerning the nature of the data.
 Here, a dimension or a column in your data
spreadsheet is referred to as a “feature,” and this
process aims to identify the features to remove.
 This requires identifying redundancy in the given
data and/or creating features that could sufficiently
represent raw features.

Strategies for reduction include sampling, clustering,


principal component analysis etc.

139
Data Discretization
 Data Discretization is the process of converting
continuous values into more manageable parts

There are three types of attributes involved in


discretization:

a. Nominal: Values from an unordered set


b. Ordinal: Values from an ordered set
c. Continuous: Real numbers

140
141
Hands-On Example 2: Data Pre-processing
Now, let us use a sample dataset and walk through those stages step
by step.

For this example, we will use a modified version of a dataset of the


number of deaths from excessive wine consumption, available in
Table 2. to explain the pre-processing stages.

The dataset consists of the following attributes:


a. Name of the country from which the sample was obtained
b. Alcohol consumption is measured as liters of wine per capita
. . Number of deaths from alcohol consumption, per 100,000 people
d. Number of heart disease deaths, per 100,000 people
e. Number of deaths from liver diseases, also per 100,000 people.

Data Cleaning. In this stage, we will go through the following pre-


processing steps: 142
Table 2: Dataset for the number of deaths from excessive wine consumption
# Country Alcohol Deaths Heart Liver
1 Australia 2.5 785 211 15.30000019
2 Austria 3.000000095 863 167 45.59999847
3 Belg. and Lux. 2.900000095 883 131 20.70000076
4 Canada 2.400000095 793 NA 16.39999962
5 Denmark 2.900000095 971 220 23.89999962
6 Finland 0.800000012 970 297 19
7 France 9.100000381 751 11 37.90000153
8 Iceland -0.800000012 743 211 11.19999981
9 Ireland 0.699999988 1000 300 6.5
10 Israel 0.600000024 -834 183 13.69999981
11 Italy 27.900000095 775 107 42.20000076
12 Japan 1.5 680 36 23.20000076
13 Netherlands 1.799999952 773 167 9.199999809
14 New Zealand 1.899999976 916 266 7.699999809
15 Norway 0.0800000012 806 227 12.19999981
16 Spain 6.5 724 NA NA
17 Sweden 1.600000024 743 207 11.19999981
18 Switzerland 5.800000191 693 115 20.29999924
19 UK 1.299999952 941 285 10.30000019
20 US 1.200000048 926 199 22.10000038
21 West Germany 2.700000048 861 172 36.70000076

143
Hands-On Example 2: Data Pre-processing(2)
Smooth Noisy Data. We can see that the wine consumption value for
Iceland per capita is −0.800000012. However, wine consumption
values per capita cannot be negative. Therefore, it must be a faulty
entry and we should change the alcohol consumption for Iceland to
0.800000012. Using the same logic, the number of deaths in Israel
should be converted from −834 to 834.

• Handling Missing Data. As we can see in the dataset, we have


missing values (represented by NA – not available) of the number of
cases of heart disease in Canada and the number of cases of heart and
liver disease in Spain. A simple workaround for this is to replace all
the NAs with some common values, such as zero or average of all the
values for that attribute. Here, we are going to use the average of the
attribute for handling the missing values. So, for both Canada and
Spain, we will use the value of 185 as the number of heart diseases.
Likewise, the number of liver diseases in Spain is replaced by 20.27.
144
Table 3: Noisy and Handling Missing dataset for excessive wine consumption
# Country Alcohol Deaths Heart Liver
1 Australia 2.5 785 211 15.30000019
2 Austria 3.000000095 863 167 45.59999847
3 Belg. and Lux. 2.900000095 883 131 20.70000076
4 Canada 2.400000095 793 185 16.39999962
5 Denmark 2.900000095 971 220 23.89999962
6 Finland 0.800000012 970 297 19
7 France 9.100000381 751 11 37.90000153
8 Iceland 0.800000012 743 211 11.19999981
9 Ireland 0.699999988 1000 300 6.5
10 Israel 0.600000024 834 183 13.69999981
11 Italy 27.900000095 775 107 42.20000076
12 Japan 1.5 680 36 23.20000076
13 Netherlands 1.799999952 773 167 9.199999809
14 New Zealand 1.899999976 916 266 7.699999809
15 Norway 0.0800000012 806 227 12.19999981
16 Spain 6.5 724 185 20.27
17 Sweden 1.600000024 743 207 11.19999981
18 Switzerland 5.800000191 693 115 20.29999924
19 UK 1.299999952 941 285 10.30000019
20 US 1.200000048 926 199 22.10000038
21 West Germany 2.700000048 861 172 36.70000076
145
Hands-On Example 2: Data Pre-processing (3)
Data Integration. Now let us assume we have another dataset
(fictitious) collected from a different source, which is about alcohol
consumption and the number of related fatalities across various
states of India, as shown in Table 4
Table 4: Indian data sets
# Name of the State Alcohol Heart Fatal alcohol
consumption disease related
accidents
1 Andaman and Nicobar Islands 3.73 20.312 2201
2 Andhra Pradesh 2.05 16,723 29,700
3 Arunachal Pradesh 1.98 13,109 11.251
4 Assam 1.91 8532 211,250
5 Bihar 3.21 12,372 375,000
6 Chhattisgarh 2.03 28,501 183,207
7 Goa 5.79 19,932 307,291
146
Hands-On Example 2: Data Pre-processing (4).

Here is what the dataset contains:

A. Name of the State.


B. Liters of alcohol consumed per capita.
C. Number of fatal heart diseases, measured per 1,000,000 people.
D. Number of fatal accidents related to alcohol per 1,000,000 people.
Now we can use this dataset to integrate the attributes for India into our original
dataset.

To do this, we calculate the total alcohol consumption for the country of India as an
average of alcohol consumption, which is 2.95.

Similarly, we can calculate the heart diseases per 100,000 people in India as 171
(approximated to the nearest integer value).

Since we do not have any source for the number of total deaths or the number of
fatal liver diseases in India, we are going to handle these the same way we
previously addressed any missing values. The resultant dataset is shown in Table 4
147
Hands-On Example 2: Data Pre-processing (5)

Note that some of the assumptions we have made here before using
this external dataset are for our cause. First, when we are using the
average of the alcohol consumption for these States as the amount of
alcohol consumption for India, we are assuming that:

(a) The populations of these States are the same or at least similar

(b) The sample of these States is similar to the whole population of


India
(c) the wine consumption is roughly equivalent to the total alcohol
consumption value in India; even though in reality, the wine
consumption per capita should be less than the total alcohol
consumption per capita, as there are other kinds of alcoholic
beverages in the market.

148
Hands-On Example 2: Data Pre-processing (6)
Data Transformation: is process involves one or more of smoothing, removing
noise from data, summarization, generalization, and normalization.

For this example, we will employ smoothing, which is simpler than summarization
and normalization. As we can see, in our data the wine consumption per capita for
Italy is unusually high, whereas the same for Norway is unusually low. So, chances
are these are outliers.

In this case, we will replace the value of wine consumption for Italy with
7.900000095. Similarly, for Norway, we will use the value of 0.800000012 in
Place of 0.0800000012. We are treating both of these potential errors as
“equipment error” or “entry error,” which resulted in an extra digit for both of
these countries (an extra “2” in front of Italy and an extra “0” after the decimal
point for Norway).
This is a reasonable assumption given the limited context we have about the dataset.
A more practical approach would be to look at the nearest geolocation for which we
have the values and use that value to make predictions about the countries with
erroneous entries. So, at the end of this step, the dataset will be transformed into
what is shown in Table 5.
149
Table 5: Integration of Indian data sets
# Country Alcohol Deaths Heart Liver
1 Australia 2.5 785 211 15.30000019
2 Austria 3.000000095 863 167 45.59999847
3 Belg. and Lux. 2.900000095 883 131 20.70000076
4 Canada 2.400000095 793 185 16.39999962
5 Denmark 2.900000095 971 220 23.89999962
6 Finland 0.800000012 970 297 19
7 France 9.100000381 751 11 37.90000153
8 Iceland 0.800000012 743 211 11.19999981
9 Ireland 0.699999988 1000 300 6.5
10 Israel 0.600000024 834 183 13.69999981
11 Italy 27.900000095 775 107 42.20000076
12 Japan 1.5 680 36 23.20000076
13 Netherlands 1.799999952 773 167 9.199999809
14 New Zealand 1.899999976 916 266 7.699999809
15 Norway 0.0800000012 806 227 12.19999981
16 Spain 6.5 724 185 20.27
17 Sweden 1.600000024 743 207 11.19999981
18 Switzerland 5.800000191 693 115 20.29999924
19 UK 1.299999952 941 285 10.30000019
20 US 1.200000048 926 199 22.10000038
21 West Germany 2.700000048 861 172 36.70000076
22 India 2.950000000 750 171 20.27 150
Table 6: Data Transformation
# Country Alcohol Deaths Heart Liver
1 Australia 2.5 785 211 15.30000019
2 Austria 3.000000095 863 167 45.59999847
3 Belg. and Lux. 2.900000095 883 131 20.70000076
4 Canada 2.400000095 793 185 16.39999962
5 Denmark 2.900000095 971 220 23.89999962
6 Finland 0.800000012 970 297 19
7 France 9.100000381 751 11 37.90000153
8 Iceland 0.800000012 743 211 11.19999981
9 Ireland 0.699999988 1000 300 6.5
10 Israel 0.600000024 834 183 13.69999981
11 Italy 7.900000095 775 107 42.20000076
12 Japan 1.5 680 36 23.20000076
13 Netherlands 1.799999952 773 167 9.199999809
14 New Zealand 1.899999976 916 266 7.699999809
15 Norway 0.800000012 806 227 12.19999981
16 Spain 6.5 724 185 20.27
17 Sweden 1.600000024 743 207 11.19999981
18 Switzerland 5.800000191 693 115 20.29999924
19 UK 1.299999952 941 285 10.30000019
20 US 1.200000048 926 199 22.10000038
21 West Germany 2.700000048 861 172 36.70000076 151
22 India 2.950000000 750 171 20.27
Data Reduction. The process is aimed at producing a reduced
representation of the dataset that can be used to obtain the same or
similar analytical results.

Thus, reducing the liver column to one decimal place and the alcohol
consumption column to two decimal places would result in the dataset
shown in Table 7.

Note that data reduction does not mean just reducing the size of
attributes – it also may involve removing some attributes, which is
known as feature selection.

For example, if we are interested in the relation between the wine


consumed and number of casualties from heart disease, we may opt to
remove the attribute “number of liver diseases” if we assume that
there is no relation between number of heart disease fatalities and the
number of lung disease fatalities.
152
Table 7: Data Reduction
# Country Alcohol Deaths Heart Liver
1 Australia 2.50 785 211 15.3
2 Austria 3.00 863 167 45.6
3 Belg. and Lux. 2.90 883 131 20.7
4 Canada 2.40 793 185 16.4
5 Denmark 2.90 971 220 23.9
6 Finland 0.80 970 297 19.0
7 France 9.10 751 11 37.9
8 Iceland 0.80 743 211 11.2
9 Ireland 0.70 1000 300 6.5
10 Israel 0.60 834 183 13.7
11 Italy 7.90 775 107 42.2
12 Japan 1.50 680 36 23.2
13 Netherlands 1.80 773 167 9.2
14 New Zealand 1.90 916 266 7.7
15 Norway 0.80 806 227 12.21
16 Spain 6.50 724 185 20.3
17 Sweden 1.60 743 207 11.2
18 Switzerland 5.80 693 115 20.3
19 UK 1.30 941 285 10.3
20 US 1.20 926 199 22.1
21 West Germany 2.70 861 172 36.7
22 India 2.95 750 171 20.3 153
Hands-On Example 2: Data Pre-processing (7)
Data Discretization. As we can see, all the attributes involved in
our dataset are continuous types (values in real numbers).

However, depending on the model you want to build, you may have
to discretize the attribute values into binary or categorical types.

For example, you may want to discretize the wine consumption per
capita into four categories

– less than or equal to 1.00 per capita (represented by 0)


- more than 1.00 but less than or equal to 2.00 per capita (1)
- more than 2.00 but less than or equal to 5.00 per capita (2)
- more than 5.00 per capita (3). The resultant dataset should look
like that shown in Table 8.

154
Table 8: Data Discretization
# Country Alcohol Deaths Heart Liver
1 Australia 2 785 211 15.3
2 Austria 2 863 167 45.6
3 Belg. and Lux. 2 883 131 20.7
4 Canada 2 793 185 16.4
5 Denmark 2 971 220 23.9
6 Finland 0 970 297 19.0
7 France 3 751 11 37.9
8 Iceland 0 743 211 11.2
9 Ireland 0 1000 300 6.5
10 Israel 0 834 183 13.7
11 Italy 3 775 107 42.2
12 Japan 1 680 36 23.2
13 Netherlands 1 773 167 9.2
14 New Zealand 1 916 266 7.7
15 Norway 0 806 227 12.21
16 Spain 3 724 185 20.3
17 Sweden 1 743 207 11.2
18 Switzerland 3 693 115 20.3
19 UK 1 941 285 10.3
20 US 1 926 199 22.1
21 West Germany 2 861 172 36.7
22 India 2 750 171 20.3 155
Hands-On Example 2: Data Pre-processing (8)

Data pre-processing and the hands-on exercise we


did here have given some insights into what needs
to occur before you get your hands on nice-looking
data for processing.

156
Prediction and forecasting:
 Prediction and forecasting: The next step in Data
Science is data analysis, prediction, and forecasting.
 One can do this on an individual level or on a larger scale
for the entire customer database.

 Prediction and forecasting helps us understand how our


customers behave and what they may do next.

 One can use these insights to create better products,


marketing campaigns, and customer support.

 Normally, the techniques used for prediction and


forecasting include regression, time series analysis, and
artificial neural networks 157
Steps in Data science Process

158
Steps in Data science Process(2)
 The first step of this process is setting a research goal.

 The main purpose here is to make sure all the stakeholders


understand the what, how, and why of the project.

 In every serious project this will result in a project charter.

 The second phase is data retrieval. You want to have data


available for analysis, so this step includes finding suitable
data and getting access to the data from the data owner.

 The result is data in its raw form, which probably needs


polishing and transformation before it becomes usable.

159
Steps in Data science Process(3)
 Now that you have the raw data, it’s time to prepare it.

 This includes transforming the data from a raw form


into data that’s directly usable in your models.

 To achieve this, you’ll detect and correct different kinds


of errors in the data, combine data from different data
sources, and transform it.

 If you have completed this step, you can progress to data


visualization and modeling.

160
Steps in Data science Process(4)
 The goal of this step is to gain a deep understanding of the
data.

 The patterns, correlations, and deviations are based on visual


and descriptive techniques.

 The insights you gain from this phase will enable you to start
modeling.

 Finally, we get to model building (often referred to as “data


modeling”).
 An attempt to gain the insights or make the predictions stated
in your project charter.

 A combination of simple models tends to outperform one


161
complicated model.
Steps in Data science Process(5)
 The last step of the data science model is presenting
your results and automating the analysis, if needed.

 One goal of a project is to change a process and/or


make better decisions.

 You may still need to convince the business that your


findings will indeed change the business process as
expected.

162
Create a project charter
A project charter requires teamwork, and your input covers
at least the following:
 A clear research goal

 The project mission and context

 How you’re going to perform your analysis

 What resources you expect to use

 Proof that it’s an achievable project, or proof of concepts

 Deliverables and a measure of success

 A timeline
163
Summary
 Setting the research goal —Defining the what, the why, and the
how of your project in a project charter.
 Retrieving data —Finding and getting access to data needed in the
project. This data is either found within the company or retrieved
from a third party.
 Data preparation —Checking and removing data errors, enriching
the data with data from other data sources, and transforming it into a
suitable format for your models.
 Data exploration —Diving deeper into the data using descriptive
statistics and visual techniques.
 Data modeling —Using machine learning and statistical techniques
to achieve the project goal.
 Presentation and automation —Presenting the results to the
stakeholders and industrializing your analysis process for repetitive
reuse and integration with other tools.

164
Lectures Five: Rudiments of data
Visualizations

165
Rudiments of data Visualizations
 Data visualization is the process of creating graphical
representations of information.

 It helps the presenter communicate data in a way that’s


easy for the viewer to interpret and draw conclusions.

 There are many different techniques and tools available for


visualizing data.

 The purpose of data visualization is to present the finished


result in a way that others can easily understand to detect
patterns and trends

166
Data visualization techniques
 Pie Chart: is Ideal for illustrating proportions, or part-to-
whole comparisons. Pie charts are relatively simple and easy
to read.
 They are best suited for audiences who might be unfamiliar
with the information or are only interested in key takeaways.

167
Data visualization
 techniques(2)
Bar Chart: This is a visualization technique,where one
axis of the chart shows the categories being compared, and
the other, a measured value.
 The length of the bar indicates how each group measures
according to the value.

168
Data visualization techniques (3)
 Histogram: A histogram is a graphical representation of
the distribution of numerical data.
 It groups data into bins and displays the frequency of the
number of observations in each bin.

169
Data visualization techniques (4)
 Gantt Chart: A Gantt chart is a type of bar chart that
illustrates a project schedule.

 It shows the start and finish dates of the different tasks


that make up the project.

170
Data visualization techniques (5)
 Heat Map: A heat map is a graphical representation of
data where the individual values contained in a matrix
are represented as colors.
 They are useful for visualizing large amounts of data
and identifying patterns.

171
Data visualization techniques (6)
 Box and Whisker Plot: A box and whisker plot is a
graphical representation of the distribution of numerical
data.
 It displays the median, quartiles, and outliers of the data

172
Data visualization techniques (7)
 Scatter Plot: A scatter plot is a graphical representation
of the relationship between two variables.
 It is used to identify trends and relationships between
variables.

173
Data Visualization Techniques (8)
 Pictogram Chart: A pictogram chart is a type of chart
that uses icons or pictures to represent data.
 It is useful for presenting data in a visually appealing way

174
Data visualization techniques (9)
 Timeline: A timeline is a graphical representation of a
chronological sequence of events.
 It is useful for presenting historical data or project
schedules.

175
Data visualization techniques (10)
 Word Cloud: A word cloud is a visual representation of
text data.
 It displays the most frequently occurring words in a text
document as larger words.

176
Definitions
 Machine Learning is designing algorithms and models to
analyze data based on defined metrics that learn from experience
to ensure the data science process can be automated and
improved without continuously programming it. These
algorithms learn to make predictions about unforeseen/future
data.

 Deep Learning method is a subset of machine learning,


deep learning involves employing different learning methods to
train models to detect the right patterns and present results.

 Data mining: This process involves uncovering hidden patterns


and relations in data to identify trends and make predictions
more adequately.

177
 Data distribution refers to the way data is spread out or
distributed over a range of values.

 It is important because it can help us understand the


characteristics of a dataset, such as its central tendency,
variability, and shape.

 Probability is the measure of the likelihood of an event


occurring.
 It is expressed as a number between 0 and 1, where 0
indicates that the event is impossible and 1 indicates that
the event is certain to occur.

 Probability helps us make predictions about the


likelihood of future events based on past observation 178
Probability distributions

 In statistics, a probability distribution is a function that


describes the likelihood of different possible values of a
variable.

 There are two types of probability distributions: discrete


probability distributions and continuous probability
distributions.

 Discrete probability distributions are used when the


variable can only take on a finite or countable number of
values.
 Continuous probability distributions are used when the
variable can take on any value within a certain range.
179
Data Simulation
 Data simulation is the process of generating synthetic
data that closely mimics the properties and
characteristics of real-world data.

 Simulated data has the advantage of not needing to be


collected from surveys or monitoring software.

 It is created via mathematical or computational models.

 Its features can be considered in three main areas:


flexibility, scalability, and replicability

180
Predictive Analytics

 Analytics is a branch of advanced analytics that uses


Predictive historical data, statistical modeling, data
mining techniques, and machine learning algorithms to
make predictions about future outcomes.

 Predictive analytics models are designed to assess


historical data, discover patterns, observe trends, and use
that information to predict future trends.

 Some popular predictive analytics models include


classification, clustering, and time series models

181
Descriptive Statistics
 Descriptive statistics describe, show, and
summarize the basic features of a dataset found in
a given study.

 It helps analysts to understand the data better.

 Descriptive statistics represent the available data


sample and do not include theories, inferences,
probabilities, or conclusions. That’s a job for
inferential statistics.

182
Types of Descriptive Statistics

 Distribution, which deals with each value’s


frequency.

 Central tendency, which covers the averages of


the values.

 Variability (or dispersion), which shows how


spread out the values are

183
Distribution (also called Frequency Distribution)

 Datasets consist of a distribution of scores or values.

 Statisticians use graphs and tables to summarize the


frequency of every possible value of a variable,
rendered in percentages or numbers.

 Statisticians depict frequency distributions as either a


graph or as a table.

184
Measures of Central Tendency
 Measures of central tendency estimate a dataset's average or
center.
 Finding the result using three methods: mean, mode, and median.

 Mean. The mean is also known as “M” and is the most common
method for finding averages.
 You get the mean by adding all the response values together,
dividing the sum by the number of responses, or “N.”
 For instance, say someone is trying to figure out how many hours
a day they sleep in a week.
 So, the data set would be the hour entries (e.g., 6,8,7,10,8,4,9),
and the sum of those values is 52.
 There are seven responses, so N=7.
 You divide the value sum of 52 by N, or 7, to find M, which in this
instance is 7.3.
185
Measures of Central Tendency (2)
 Mode. The mode is just the most frequent response
value.

 Datasets may have any number of modes, including


“zero.”

 You can find the mode by arranging your dataset's order


from the lowest to highest value and then looking for the
most common response.

 So, in using our sleep study from the last part:


4,6,7,8,8,9,10. As you can see, the mode is eight.
186
Measures of Central Tendency (3)
 Median. Finally, we have the median, defined as the
value in the precise center of the dataset.

 Arrange the values in ascending order (like we did for


the mode).

 Look for the number in the set’s middle. In this case, the
median is eight.

187
Measures of Central Tendency (4)
 Variability (also called Dispersion)
 The measure of variability gives the statistician an idea of
how spread out the responses are.
 The spread has three aspects — range, standard deviation,
and variance.

 Range. Use range to determine how far apart the most


extreme values are.
 Start by subtracting the dataset’s lowest value from its
highest value.
 Using our sleep study: 4,6,7,8,8,9,10. We subtract four (the
lowest) from ten (the highest) and get six.
 Range = Maximum value – Minimum value
10 – 4 = 6 188
Measures of Central Tendency (5)
 Standard Deviation: The standard deviation (s) is your
dataset’s average amount of variability, showing how far
each score lies from the mean.

 The larger your standard deviation, the greater your dataset’s


variable. Follow these six steps:

1. List the scores and their means.


2. Find the deviation by subtracting the mean from each score.
3. Square each deviation.
4. Total up all the squared deviations.
5. Divide the sum of the squared deviations by N-1.
6. Find the result’s square root.

Example: Using our sleep study: 4,6,7,8,8,9,10 189


Measures of Central Tendency (6)
Raw Number Deviation from Deviation
data Mean Squared

4 4-7.3 = -3.3 10.89


6 6-7.3= -1.3 1,69
7 7.7.3=-0.3 0.09
8 8-7,3=0.7 0.49
8 8-7.3=0.7 0.49
9 9.7.3=1.7 2.89
10 10-7.3=2.7 7.29
M=7.3 Sum =0.9 Squares
sum=23.83
190
Measures of Central Tendency(7)

Divide the sum of the squared deviations by 6 (N-1):


23.83/6, and we get 3.971, and the square root of that
result is 1.992.

As a result, we now know that each score deviates from the


mean by an average of 1.992 points.

Variance: Variance reflects the degree of data spread. The


greater the degree of data spread, the larger the variance
relative to the mean. You can get the variance by just
squaring the standard deviation. Using the above example,
we square 1.992 and arrive at 3.971.
191
Working Example of Box
Let’s understand Box plot with this an example.

Step 1 — take the set of numbers given

14, 19, 100, 27, 54, 52, 93, 50, 61, 87,68, 85, 75, 82, 95

Arrange the data in increasing(ascending) order

14, 19, 27, 50, 52, 54, 61, 68, 75, 82, 85, 87, 93, 95, 100

Step 2 — Find the median of this data set. Median is mid value in
this ordered data set.

14, 19, 27, 50, 52, 54, 61, 68, 75, 82, 85, 87, 93, 95, 100

Here it is 68. 192


Working Example of Box (2)
Step 3 — Find the Lower Quartile.

Lower Quartile is the median from the left of the, medium


found in the Step 2(ie. 68)
(14, 19, 27, 50, 52, 54, 61), 68, 75, 82, 85, 87, 93, 95, 100
Lower Quartile is 50

Step 4 — Find the Upper Quartile.

Upper Quartile is the median from the Right of the medium


found in the Step 2(ie. 68)
14, 19, 27, 50, 52, 54, 61, 68,( 75, 82, 85, 87, 93, 95, 100)
Upper Quartile is 87
193
Working Example of Box(3)
Step 5 — Find the Minimum Value

It is value lies in the extreme left from this data set or first
value in the data set after ordering.
14, 19, 27, 50, 52, 54, 61, 68, 75, 82, 85, 87, 93, 95, 100
Minimum Value is 14

Step 6 — Find the Maximum Value

It is value lies in the extreme Right from this data set or


last value in the data set after ordering.
14, 19, 27, 50, 52, 54, 61, 68, 75, 82, 85, 87, 93, 95, 100
Maximum Value is 100 194
Working Example of Box(4)

195
Lecture six: Use Cases In
Various Domain

196
Use Cases In Various Domain

 Data science has a wide range of


applications in various domains such as:

- image
- natural language
- audio
- video.

Here are some examples of use cases in each


domain:
197
Use Cases In Various Domain(2)

 Image: Data science can be used to analyze and


classify images.

 For example, it can be used to identify objects in


images.

 Detect faces and recognize handwriting.

 It can also be used to create realistic images using


generative models

198
Use Cases In Various Domain(3)
 Natural Language: Data science can be used to
analyze and understand natural language.

 For example, it can be used to create chatbots


that can understand and respond to human
language

 Analyze sentiment in text

 Translate text from one language to another.


199
Use Cases In Various Domain(4)

 Audio: Data science can be used to analyze


and classify audio data.

 For example, it can be used to recognize


speech.

 Identify music and detect anomalies in sound


patterns.
200
Use Cases In Various Domain(4)
 Video: Data science can be used to analyze and
classify video data.

 For example, it can be used to recognize faces in


videos.

 Detect objects in videos.

 Analyze video content for advertising purposes


201
Basic introduction to Data mining
 Data mining is the process of discovering patterns in
large datasets using statistical and computational
methods.

 It involves extracting useful information from data and


transforming it into an understandable structure for
further use.

 Data mining is used in various fields such as marketing,


finance, healthcare, and education.

 This process involves uncovering hidden patterns and


relations in data to identify trends and make predictions
more adequately. 202
Basic introduction to Data mining (2)
A typical data mining system may have the following major
components.

 Knowledge Base:
This is the domain knowledge that is used to guide the
search or evaluate the interestingness of resulting patterns.

 Data Mining Engine:


This is essential to the data mining system and ideally
consists of a set of functional modules for tasks such as
characterization, association and correlation analysis,
classification, prediction, cluster analysis, outlier analysis,
and evolution analysis. 203
Basic introduction to Data mining (3)
 Pattern Evaluation Module
This component typically employs interestingness measures
interacts with the data mining modules so as to focus the search on
interesting patterns.

 User interface
This module communicates between users and the data mining
system, allowing the user to interact with the system by specifying
a data mining query or task, providing information to help focus
the search, and performing exploratory datamining based on the
intermediate data mining results.

 In addition, this component allows the user to browse database and


data warehouse schemas or data structures, evaluate mined
patterns, and visualize the patterns in different forms.
204
Basic introduction to Data mining (4)

Architecture of a typical Data mining 205


Data Mining Tasks
Data mining involves six common classes of tasks:
 Anomaly detection (Outlier/change/deviation
detection) – The identification of unusual data records,
that might be interesting or data errors that require further
investigation.

 Association rule learning (Dependency modelling) –


Searches for relationships between variables.
 For example, a supermarket might gather data on
customer purchasing habits.
 Using association rule learning, the supermarket can
determine which products are frequently bought together
and use this information for marketing purposes.
 This is sometimes referred to as basket analysis. 206
A common example of association rule mining is basket analysis. A
shopper puts items from a store into a basket. Once there are some
items in the basket, it is possible to recommend associated items that
are available in the store to the shopper.
207
Data Mining Tasks (2)
 Clustering – is about discovering groups and structures
in the data that are in some way or another "similar",
without using known structures in the data.

208
Data Mining Tasks (3)
 Classification – is about the assignment of labels to objects
for identification or assigning categories to objects. For
example, an e-mail program might attempt to classify an e-
mail as "legitimate" or as "spam".

209
Data Mining Tasks (4)
 Regression – This is about the relationship between
features to predict numeric values.

We may be interested in the relationship between different properties of cars. Two


important properties are fuel consumption and the top speed of cars. Usually, cars that
can reach a higher top speed have stronger motors that consume more fuel. With
regression, we can model the relationship between the variables using data about a
sample of cars. 210
Data Mining Tasks (5)
 Time-series: – is the analysis of temporal data. E.g. daily,
weekly, or monthly over time.

211
Basic introduction to knowledge extraction(2)
 Business Intelligence (BI) is a set of tools and techniques
used to transform raw data into meaningful and useful
information for business analysis purposes.

 BI helps organizations make informed decisions by


providing insights into their operations, customers, and
competitors.

 BI tools include dashboards, scorecards and reports.

212
Basic introduction to knowledge extraction(3)

 Knowledge management is the process of creating,


sharing, using, and managing the knowledge and
information of an organization.

 It involves identifying, capturing, evaluating, retrieving,


and sharing information assets within an organization.

 It helps organizations improve their performance by


making better use of their knowledge assets.

213
Introduction to Big Data integration and
intelligence
 Big data integration is the practice of using people,
processes, suppliers, and technologies to retrieve,
reconcile, and make better use of data from various
sources for decision support.

 Big data has the following characteristics: volume,


velocity, veracity, variability, value, and visualization.

 The great example of big data is social media site


Facebook where hundreds of terabytes data is added daily
in the form of text, audio, video, images etc

214
Brief History of Big Data

from Megabyte (MB) in 1970s to Exabyte (EB)


which was introduced in 2011.
215
PHASES OF BIG DATA

Three phases of Big data


216
Characteristics of Big Data
Big data has been amply characterized by the well-known 3Vs
(Volume, Velocity and Variety).

However, the businesses have found these three dimensions


less adequate to tackle Big data properly.

Hence, with seven more features have been proposed, a


consolidated depiction to the Big data problem has been
obtained.

These features are mainly; Veracity, Variability, Value, Validity,


Vulnerability, Volatility and Visualization.

The following is a brief discussion on the 3Vs of Big data.217


CHARACTERISTICS OF BIG DATA

Volume: refers to the vast increase in the data growth.

This is evident as more than 90% of the data we encounter


was produced recently.

In fact, more than 2.5 quintillion (10^18) bytes are created


daily since as early as 2013 from every post, share,
search, click, stream, and many more data producers.

218
Characteristics of Big Data(2)

Velocity: represents the accumulation of data in high


speed, near real-time, and real-time from dissimilar data
sources.

The velocity of data requires advanced solutions that can


store, process, manage, and analyze streams of
heterogeneous data and infer value on motion.

Figure 2.4 shows a depiction of how much data was


generated in a minute from various resources in 2019.

219
How much data is generated in each minute
220
Characteristics of Big Data (3)
Variety: involves collecting data from various resources
heterogeneous types.

This includes importing data in dissimilar formats, namely:

Structured (tables reside in relational databases – RDBMS,


etc.)

Semi-structured (email, eXtensible Markup Language


(XML), Javascript Object Notation(JSON), and other
markup languages, etc.)
Unstructured (text, pictures, audio files, video, sensor data,
etc.).
221
Introduction to Data Analytics
 Data analytics is the process of examining large and
varied data sets to uncover:
 hidden patterns,
 unknown correlations,
 market trends,
 customer preferences
 other useful information

 That can help organizations make more informed


business decisions.

 It is used in various fields such as finance, healthcare,


marketing, and education 222
Lectures Seven: Introduction to R programming

 What will you learn?


• Loading structured data into R.
• Using R to do statistical analysis,
including producing models and
visualizations.

223
Introduction to R programming
 R programming is an open-source programming language
that is widely used as a statistical software and data
analysis tool.
 It was designed by Ross Ihaka and Robert Gentleman at
the University of Auckland, New Zealand, and is currently
being developed by the R Development Core Team
 R programming is used as a leading tool for machine
learning, statistics, and data analysis.
 It provides an extensive collection of graphical and
statistical techniques, making it a preferred choice for
statisticians and data analysts.
 R programming is available across platforms like
Windows, Linux and Mac OS
224
Setting Up R
 R is a free, open source, environment for data
analysis.
 It is available as a free binary download for Mac,
Linux and Windows.
 To install R in your computer go to
https://cran.r-project.org/index.html and
download and install the appropriate binary file

225
Setting Up R(2)

226
Setting Up Rstudio

 We will use Rstudio to interact with R.


 Rstudio is a very powerful application to
makes data analysis with R easier to do.
 To install go to
https://www.rstudio.com/products/rstudio/dow
nload/
and download the appropriate version of
Rstudio.

227
Setting Up Rstudio(2)

228
Working with R
When you installed R, a few illustrative datasets were
installed as well.
Write the following command in the console data()
In R, the first attempt to get help with something is to
use the ? operation

229
Working with R
Let’s see what the swiss data looks like using the
Rstudio data viewer.

230
Working with R
What value is produced when R evaluates the expression swiss?.
When we write the expression swiss, we tell R we want
the value referenced by the name swiss, that is, the data itself!

231
Working with R
On your own: Use the Data Viewer to explore another
type of the datasets you saw listed before.

Functions are a series of instructions that take some input


value and produce a different value.

The name View refers to the function that takes a data


frame as input, and displays it in the Data viewer.

Functions are called using the parentheses we saw


before: View(swiss), the parentheses say that you are
passing input swiss to the function View.
232
Plotting in RStudio

There’s a couple of things going on here…


 plot is a function that takes two inputs, the data to put
in the x and y axes, evaluates to nothing, but creates a
plot of the data
 swiss$Education is how we refer the Education column
in the swiss data frame.
On your own: Make a plot using other variables in
the swiss dataset.

233
Plotting in RStudio
Let’s make a plot using the swiss dataset:

234
Plotting in RStudio

one of the biggest advantages of using R over Excel or


other similar programs, is that
We can write expressions in scripts that are easy to share
with others, making analyses easier to reproduce.
Let’s write a script that we can use to make the same plot
we just made.
In the Rstudio menu select File>New File>R Script
This will open a tab in the File editor in which we can
write expressions:
plot(swiss$Education, swiss$Fertility)
235
Plotting in RStudio

236
Plotting in RStudio
We can then evaluate the expressions in the file one at a
time, or all at the same time.

We can then save these expressions in a script. In the Rstudio


menu select File>Save and save as a text file.

The convention is to use the .R or .r file extension,


e.g., swiss_plot.r

On your own: Add expressions for additional plots to the


script and save again. Run the new expressions.
237
Plotting in RStudio

238
R Basic
2+2 we assigned a value to variable “year” and checked if it
[1] 4 is divisible by 4 using the modulus operator (%%)
> x=2
> y=2 Now let us put this code in a file. In RStudio, select File
> z=x+y > New File > R Script. This should open an editor
>print(z) where you can type your code. There, write the
> [1] 4 following:
> 2>3
[1] Now, let us write a small program on the R console:
FALSE > year = 2020
> 2==2 > if (year%%4==0)
[1] TRUE + print (‘Leap year’)
[1] “Leap year”

239
PYTHON

Python is a scripting language

It means programs written in Python do not need to be


compiled like you would do with a program in C or Java.

A Python program runs line by line The language (its syntax


and structure) also provides a very easy learning curve for
the beginner.

240
Download and Install Python

To set up your development environment from scratch


using the Anaconda distribution is a great choice.

If you follow the link


(https://www.anaconda.com/download/
(https://www.anaconda.com/download/))

to the Anaconda download page you can choose


between installers for Windows, macOS, and Linux:

241
A screenshot of Anaconda 242
243
244
245
REFERENCES

• Introduction to Data Science A Beginner's Guide by


Chirag Shah
• Mastering Python for Data Science by Samir Madhavan
• Mastering Machine learning with R by Cory Leicester
• Data science from Scratch First Principle with Python
by Joel Grus

246

You might also like