Chapter 2 - Intro. To Data Sciences (Autosaved)
Chapter 2 - Intro. To Data Sciences (Autosaved)
Chapter 2 - Intro. To Data Sciences (Autosaved)
Data Information
input data is
prepared in some
convenient form for
processing output data is the result
e.g. electronic of processing step and
computers form of the output data
input data is changed to depends on the use of
produce data in a more the data.
useful form Produced information need to
e.g. calculating CGPA be stored for future usage
Cont. . .
2.4 Data types and their representation
Data type defines the operations that can be done on the data, the meaning of
the data, and the way values of that type can be stored.
Data types can be described from diverse perspective
(a) Computer (b) Analytics
programming perspective
perspective
▪ Semi-structured Data:
✓ does not conform with the formal structure of data model.
✓ Contains tags or other markers for separation semantic elements enforce hierarchies of records
and fields within the data
✓ Fore example: JSON and XML
▪ Unstructured Data
does not have a predefined data model or is not organized in a pre-defined manner.
Typically text-heavy but may contain data such as dates, numbers, and facts as well.
✓ Metadata - data about data that provides additional information about a specific set of data.
e.g. photographs metadata - describe when and where the photos were taken.
2.5 Data value chain
Introduced to describe the information flow within a big data system as a series of
steps needed to generate value and useful insights from data.
• ensuring data trustworthiness, accessibility,
• Ensuring the needs of fast access to the data
reusability
Cont. . . • content creation, selection, classification, • RDBMS & NoSQL
• ACID (Atomicity, Consistency, Isolation, & Durability)
transformation, validation, and preservation
❑Deliver in low latency,
❑Predictable in both
capturing data and
executing queries
❑Be able to handle very
high transaction data
volumes,
❑Flexible and dynamic in
a distributed environment
• Exploring, transforming,
and modelling data with
the goal of highlighting
relevant data
• Synthesizing and
extracting useful hidden
information with high
potential from a
business point of view • Infrastructure
Cont. . . Use case of Data Science
Cont. . . Application domain of Data Science
2.6 Basic concepts of Big data
▪ Big data is a blanket term for the non-traditional strategies and technologies needed
to gather, organize and process insights from large datasets.
• The amount of data • The speed at which • The types of data • Data trustworthiness (the • The way in which • Business value of
from myriad source data are generated • Data comes in degree to which big data the big data can be the data collected
• large amounts of • Data is live many different can be trusted) used and formatted • Uses and purpose
data Zeta bytes streaming or in forms from diverse • Data accuracy • To whom the data of data
(Massive datasets) motion sources How accurate is it? are accessible?
• Realtime
2.7 Hadoop and its Ecosystem
▪ Hadoop is an open-source framework intended to make interaction with big data easier.
▪ It is inspired by a technical document published by Google.
▪ It allows for the distributed processing of large datasets across clusters of computers
using simple programming models.
✓ Data Processing
✓ Data Storage.
25