Chapter 2 - Intro. To Data Sciences (Autosaved)

Chapter Two
Introduction to Data Science

Topics Covered
▪ An Overview of Data Science
▪ Data and information
▪ Data types and representation
▪ Data Processing Cycle
▪ Data Value Chain (Acquisition, Analysis ,Curating, Storage, Usage)
▪ Basic concepts of Big data

2.1 Overview of Data Science
▪ What is Data science?
✓ A multi-disciplinary field that uses scientific methods, processes,
algorithms and systems to extract knowledge and insights from
structured, semi structured and unstructured data.
✓ is
much more than simply analyzing data.
 Examples of data
 Your notebook
 Prices of items in supermarket
 Files in computer
 Barcode etc.
Overview…
 Data science continues to evolve as one of the most promising and in-
demand career paths for skilled professionals
 To be a successful data professional in today’s market requires to
advance past traditional skills of analyzing large amounts of data by data
mining and programming skills.
Cont. . . Data Science Experts/Scientist?
✓ Data scientists are analytical experts who utilize their skills in both technology and social science to
find trends and manage data.
✓ They use industry knowledge, contextual understanding, uncertainty of existing assumptions to
uncover solutions to business challenges..
✓ Need a strong quantitative background

in statistics and linear algebra as well as
programming knowledge
✓ Need to be curious and result-oriented
✓ Must master the full spectrum of the

data science life cycle and possess a
level of flexibility and understanding to
maximize returns.
2.2 Data and Information
▪ Data?
✓ A representation of facts, concepts, or instructions in a formalized manner,
which should be suitable for communication, interpretation, or processing by
human or electronic machine.
▪ Information?
✓ Organized or classified data, which has some meaningful values for the receiver.
✓ A processed data on which decisions and actions are based.
✓ Principle of information - processed data must qualify for the following
 Timely − Information should be available when required.
 Accuracy − Information should be accurate.
 Completeness − Information should be complete.

Summery: Data Vs. Information
Data Information
Described as unprocessed or raw facts

Described as processed data
and figures
Cannot help in decision making Can help in decision making
Raw material that can be organized, Interpreted data; created from
structured, and interpreted to create organized, structured, and processed
useful information systems. data in a particular context.
‘groups of non-random’ symbols in the
Processed data in the form of text,
form of text, images, and voice
images, and voice representing
representing quantities, action and
quantities, action and objects'.
objects'.
2.3 Data Processing Cycle
 Re-structuring or re-ordering of data by people or machine to increase their usefulness and
add values for a particular purpose.
 The set of operations used to transform data into useful information.
Data Processing Cycle
input data is
prepared in some
convenient form for
processing output data is the result
e.g. electronic of processing step and
computers form of the output data
input data is changed to depends on the use of
produce data in a more the data.
useful form Produced information need to
e.g. calculating CGPA be stored for future usage
Cont. . .
2.4 Data types and their representation
 Data type defines the operations that can be done on the data, the meaning of
the data, and the way values of that type can be stored.
 Data types can be described from diverse perspective
(a) Computer (b) Analytics
programming perspective
perspective
✓ Integers (int) --- whole numbers

✓ Booleans (bool) -- restricted to one of two values: ✓ Structured Data:
true or false ✓ Semi-structured Data:
✓ Characters (char) -- store a single character (symbol) ✓ Unstructured Data
✓ floating-point numbers (float) --- store real numbers
✓ alphanumeric strings (string) --- group of characters
Cont. . .
▪ Structured Data:
✓ adheres to a pre-defined data model and straightforward to analyze
✓ take a tabular format. E.g. Excel files or SQL databases
▪ Semi-structured Data:
✓ does not conform with the formal structure of data model.
✓ Contains tags or other markers for separation semantic elements enforce hierarchies of records
and fields within the data
✓ Fore example: JSON and XML
▪ Unstructured Data
 does not have a predefined data model or is not organized in a pre-defined manner.
 Typically text-heavy but may contain data such as dates, numbers, and facts as well.
 Examples: audio, video files or No-SQL databases.
✓ Metadata - data about data that provides additional information about a specific set of data.
e.g. photographs metadata - describe when and where the photos were taken.
2.5 Data value chain
 Introduced to describe the information flow within a big data system as a series of
steps needed to generate value and useful insights from data.
• ensuring data trustworthiness, accessibility,
• Ensuring the needs of fast access to the data
reusability
Cont. . . • content creation, selection, classification, • RDBMS & NoSQL
• ACID (Atomicity, Consistency, Isolation, & Durability)
transformation, validation, and preservation
❑Deliver in low latency,
❑Predictable in both
capturing data and
executing queries
❑Be able to handle very
high transaction data
volumes,
❑Flexible and dynamic in
a distributed environment
• Exploring, transforming,
and modelling data with
the goal of highlighting
relevant data
• Synthesizing and
extracting useful hidden
information with high
potential from a
business point of view • Infrastructure
Cont. . . Use case of Data Science
Cont. . . Application domain of Data Science
2.6 Basic concepts of Big data
▪ Big data is a blanket term for the non-traditional strategies and technologies needed
to gather, organize and process insights from large datasets.
▪ Big data refers to:

✓ large and complex datasets that it
becomes difficult to process using
on-hand database management
tools or traditional data processing
applications
✓ The category of computing

strategies and technologies that are
used to handle large datasets
▪ Goal of Big data:
✓ To surface insights and connections from large
volumes of heterogeneous data that would not
be possible using conventional methods
Cont. . . Characteristics of Big data
• The amount of data • The speed at which • The types of data • Data trustworthiness (the • The way in which • Business value of
from myriad source data are generated • Data comes in degree to which big data the big data can be the data collected
• large amounts of • Data is live many different can be trusted) used and formatted • Uses and purpose
data Zeta bytes streaming or in forms from diverse • Data accuracy • To whom the data of data
(Massive datasets) motion sources How accurate is it? are accessible?
• Realtime
2.7 Hadoop and its Ecosystem
▪ Hadoop is an open-source framework intended to make interaction with big data easier.
▪ It is inspired by a technical document published by Google.
▪ It allows for the distributed processing of large datasets across clusters of computers
using simple programming models.
▪ The four key characteristics of Hadoop

✓ Economical: Its systems are highly economical as ordinary
computers can be used for data processing.
✓ Reliable: It is reliable as it stores copies of the data on different
machines and is resistant to hardware failure.
✓ Scalable: It is easily scalable both, horizontally and vertically. A
few extra nodes help in scaling up the framework.
✓ Flexible: It is flexible and you can store as much structured and
unstructured data as you need to and decide to use them later.
Cont. . . The 4 core components of Hadoop
and its Ecosystem
▪ The 4 core components of
Hadoop includes
✓ Data Management,
✓ Data Access,
✓ Data Processing
✓ Data Storage.
The Hadoop Ecosystem

Cont. . . The 4 core components of Hadoop
and its Ecosystem
▪ Hadoop ecosystem comprises of the following components
✓ HDFS: Hadoop Distributed File System
✓ YARN: Yet Another Resource Negotiator
✓ MapReduce: Programming based Data Processing
✓ Spark: In-Memory data processing
✓ PIG, HIVE: Query-based processing of data services
✓ HBase: NoSQL Database
✓ Mahout, Spark MLLib: Machine Learning algorithm libraries
✓ Solar, Lucene: Searching and Indexing
✓ Zookeeper: Managing cluster
✓ Oozie: Job Scheduling
Cont. . . The Big data life cycle with Hadoop
▪ Stage 1- Ingesting data into the system

✓ The data is ingested or transferred to Hadoop from various sources such as relational
databases, systems, or local files.
▪ Stage 2- Processing the data in storage (stored and processed )
✓ The data is stored in the distributed file system, HDFS, and the NoSQL distributed data, HBase.
Spark and MapReduce perform data processing.
▪ Stage 3- Computing and analyzing data
✓ by processing frameworks such as Pig, Hive, and Impala.
▪ Stage 4- Visualizing the results (Access)

✓ by tools such as Hue and Cloudera Search
Cont. . . Big data lifecycle at Google Cloud platform
End of Chapter 2
(Data Science)
Assignment Questions
1. Discuss the difference of Big data and Data Science.
2. Briefly discuss the Big data life cycle.
3. List and explain Big data application domains with example.
4. What is Clustered Computing? Explain its advantages.
Thank you!
25

Chapter 2 - Intro. To Data Sciences (Autosaved)

Uploaded by

Copyright:

Available Formats

Chapter 2 - Intro. To Data Sciences (Autosaved)

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Chapter 2 - Intro. To Data Sciences (Autosaved)

Uploaded by

Copyright:

Available Formats

Chapter Two

Introduction to Data Science

▪ An Overview of Data Science

▪ Data and information

▪ Data types and representation

▪ Data Processing Cycle

▪ Data Value Chain (Acquisition, Analysis ,Curating, Storage, Usage)

▪ Basic concepts of Big data

✓ Need a strong quantitative background

✓ Need to be curious and result-oriented

✓ Must master the full spectrum of the

✓ A processed data on which decisions and actions are based.

✓ Principle of information - processed data must qualify for the following

 Timely − Information should be available when required.

 Accuracy − Information should be accurate.

 Completeness − Information should be complete.

Described as unprocessed or raw facts

Data Processing Cycle

✓ Integers (int) --- whole numbers

✓ take a tabular format. E.g. Excel files or SQL databases

 Examples: audio, video files or No-SQL databases.

▪ Big data refers to:

✓ The category of computing

▪ The four key characteristics of Hadoop

The Hadoop Ecosystem

▪ Stage 1- Ingesting data into the system

▪ Stage 4- Visualizing the results (Access)

You might also like