DV Co1 All PDF
DV Co1 All PDF
DV Co1 All PDF
INTRODUCTION
Data visualization
What is data visualization and why is it important?
• Data visualization refers to the techniques used to
communicate data or information by encoding it as
visual objects (e.g., points, lines or bars) contained in
graphics.
• Data visualization is the graphical representation of
information and data.
• The goal is to communicate information clearly and
efficiently to users.
• It is one of the steps in data analysis or data science.
• According to Vitaly Friedman (2008) the "main goal of
data visualization is to communicate information
clearly and effectively through graphical means.
How is data Visualization used?
• Data visualization is the act of taking
information (data) and placing it into a visual
context, such as a map or graph. Data
visualizations make big and small data easier for
the human brain to understand,
and visualization also makes it easier to detect
patterns, trends, and outliers in groups of data.
Why is data visualization so important?
• Data Collection
We’re getting better and better at collecting data, but we lag in what
we can do with it. Lots of data are available around us, but it’s not
being used to its greatest potential because it’s not being visualized
as well as it could be.
• Data Never Stays the Same
What happens when things start moving? How do we interact with
“live” data? How do we unravel data as it changes over time?
Is from the previous diagram the data visualization appeared after data
analytics?
Choice1: Yes
Choice2: No
Example-01
Example-02
Example-03
Poll Question-02:
Choice1: Yes
Choice2: No
Poll Question
What is correct saying about data and information ?
A: Data is input and information is processed input
B: Data is processed input and information is input
C: Data is input and information is input
D: Data and information both are processed input
What Is a Data Model?
Data modeling provides a method and means for describing the real-
world information requirements in a manner understandable to the
stakeholders in an organization. In addition, data modeling enables the
database practitioners to take these information requirements and
implement these as a computer database system to support the
business of the organization.
So, what is a data model?
A data model is a device that. helps the users or stakeholders
understand clearly the database system that is being implemented
based on the information requirements of an organization, and. enables
the database practitioners to implement the database system exactly
conforming to the information requirements.
Poll Question
Choose a correct choice:
A student table with 10 attributes and 250 rows of data is called Data
model.
Choice1: Yes
Choice2: No
Conceptual models
Provide flexible data-structuring capabilities Present a “community view”: the
logical structure of the entire database Contain data stored in the database
Show relationships among data including:
Constraints
Semantic information (e.g., business rules)
Security and integrity information
An attribute is a __________ in
a relation.
a) Row
b) Column
c) Value
d) Tuple
Relational Model
RELATIONAL MODEL (RM) represents the database as a
collection of relations. A relation is nothing but a table of
values. Every row in the table represents a collection of
related data values. These rows in the table denote a real-
world entity
Relational or relationship.
Model Concepts
• Attribute: Each column in a Table. Attributes are the properties
which define a relation. e.g., Student_Rollno, NAME,etc.
• Tuple: It is nothing but a single row of a table, which contains a
single record.
• Relation Schema: A relation schema represents the name of the
relation with its attributes.
• Degree: the number of an entity type that is connected to a
relationship is the degree of that relationship.
• Cardinality: This is the numerical relationship between rows of
Relational Integrity constraints
Domain constraints
Key constraints
Referential integrity constraints
Domain Constraints
Example:
In the given table, CustomerID is a key attribute of Customer Table. It is most likely to have a
single key for one customer, CustomerID =1 is only for the CustomerName =" Google".
1 Google Active
2 Amazon Active
3 Apple Inactive
Referential integrity constraints
Referential integrity constraints is base on the concept of
Foreign Keys. A foreign key is an important attribute of a
relation which should be referred to in other relationships.
Referential integrity constraint state happens where relation
refers to a key attribute of a different or same relation.
However, that key element must exist in the table.
Example:
Operations in Relational Model
• Super Key
• Candidate Key
• Primary Key
• Composite Key (A key that has more
than one attribute)
• Secondary or Alternative key
• Non-key Attributes
• Non-prime Attributes
ER Model of a
Company
Relational Model of a
Company
Session No: CO1-3
Session Topic: Spread sheet models, Relational Data
Models
DATA MODELING AND
VISUALIZATION TECHNIQUES
(Course code: 18CS3262)
Relational Model :
Data Blending/Joins
Data Blending-Importance in visualization
context
• In Relational Model , data is maintained in the form of rows
and columns which is most commonly called as relation or
Table.
• In Database, data is spread over multiple tables.
• Sometimes we may want data from two or more tables.
• Data blending is a method for combining data from multiple
sources.
• Data blending brings in additional information from a
secondary data source and displays it with data from the
primary data source directly in the view.
Step 1: Create and Insert data in Table 1
Step 2: Create and Insert data in Table 2
Step 3: Drag and drop the first table or left table
to the field region.
• To describe how
to join data in
Tableau, we need
at least two
tables.
• First, Drag and
drop the first
table or left the
table to the field
region.
• In this example,
we are using our
Employee table as
a left table
Step 4: Drag and drop the second table or
right table to the field region.
• Drag and drop the
second table or
right table to the
field region.
• When you
dragged the
Department table,
a pop-up window
will be opened to
select the Join
type and Joining
key, as shown in
adjacent figure.
Different Types of JOINs
• (INNER) JOIN: Returns records that have matching values in both
tables
• LEFT (OUTER) JOIN: Return all records from the left table, and the
matched records from the right table
• RIGHT (OUTER) JOIN: Return all records from the right table, and the
matched records from the left table
• FULL (OUTER) JOIN: Return all records when there is a match in either
left or right table
INNER JOINS
For the
given case
study ,
perform the
following
Join
operations
in SQL and
Tableau
1) Inner
2) Left
3) Right
4) Full
Outer
Activity: Design
A: C
B: C++
C: Java
Object Oriented Modelling
• Object OrientedAnalysis
Identifying Classes
Attributes and Operations
• Components of Class Diagrams
Associations
Multiplicity
Aggregation
Composition
Generalization
4
Object Oriented Analysis
Motivation
5
Nearly anything can be an object
External Entities
E.g. people, devices, cat, etc.
Things
E.g. reports, displays, etc.
Occurrences or Events
E.g. CricketMatch, MarriageFunction etc.
Roles
E.g. Manager, President, Captain etc.
Structures
E.g. Bridge, Bunglow, Four-wheeled vehicles etc.
Organizational Units
E.g. group, team, etc.
Places
E.g. manufacturing floor, loading dock, etc.
6
What are classes?
A class describes a group of objects with
similar properties (attributes),
common behaviour (operations),
common relationships to other objects,
and common meaning (“semantics”).
Examples
employee: has a name, employee# and department where he/she
is working.
7
Object
The instances of a class are calledobjects.
Objects are represented as:
Fred_Bloggs:Employee
name: Fred Bloggs
Employee #:
234609234
Department:
Marketing
9
What is not an Object from the given below choices?
12
Multiplicity
A multiplicity is a factor associated with an attribute. It specifies how many instances of
attributes are created when a class is initialized. If a multiplicity is not specified, by
default one is considered as a default multiplicity.
(Example: There are 100 students in one college. The college can have multiple students.)
Example of Many to Many Multiplicity
Different Meaning of Relating two Classes
Class associations
Multiplicity Multiplicity
A client has A staff member has
exactly one staffmember zero or more clients on
as a contact person His/her clientList
Name
of the
association
Client
StaffMember
companyAddress
staffName 1 liaises with 0..* companyEmail
staff# companyFax
staffStartDate contact ClientList companyName
person companyTelephone
Direction
The “liaises with”
association should be
read in this direction
Role
The staffmember’s Role
role in this association The clients’ role
is as a contact person in this association
is as a clientList
16
Class Diagrams eye
Class name aggregation Colour
0..2 Diameter
Correction
multiplicities
patient kidney
0..1
attributes Name
Operational?
Date of Birth
0..1
Height
Weight
services 0..1 1..2
heart
Normal
generalization 1 bpm Blood
type
In-patient Out-patient
Room Last organ
Bed visit Natural/arti
Physici next f.
an visit Orig/implant
physicia donor
n
17
Class Name
(Implementation of +acquire(property) & +dispose(property) in the implementation class i.e Person and Corporation class)
Generalization
A generalization helps to connect a subclass to its superclass. A sub-class is
inherited from its superclass.
Example-01
Example-02
Association
This kind of relationship represents static relationships between classes A and
B. For example; an employee works for an organization.
Here are some rules for Association: Association is mostly verb or a verb
phrase or noun or noun phrase.
Association with out Direction Association with Direction
Aggregation
• Aggregation is a special type of association that models a whole- part
relationship between aggregate and its parts.
• An aggregation is a special case of association denoting a consists-of
hierarchy
• The aggregate is from bottom class to top class, and the components
are from top class to bottom class.
Composition
• The composition is a special type of aggregation which denotes strong
ownership between two classes when one class is a part of another class.
• The composition is a part of aggregation, and it portrays the whole-part
relationship. It depicts dependency between a composite (parent) and its
parts (children), which means that if the composite is discarded, so will its
parts get deleted. It exists between similar objects.
A: Yes
B: No
References:
Another way to manage unstructured data is to have it flow into a data lake,
allowing it to be in its raw, unstructured format.
Attentive Test Question
• Student database is an example of _____________ data model.
A: Structure
B: Unstructure
C: Semistructured
• What is the data model for 1) keeping Employee Attendance Record,
2) Collecting movie Reviews from social media from the following?
A:structured, unstructure
B:structured, semistructured
C:unstructured, structured.
D:structured, structured
Semi-Structured data
Semi-structured data is information that does not reside in a
relational database but that have some organizational properties
that make it easier to analyze. With some process, you can store
them in the relation database (it could be very hard for some kind
of semi-structured data), but Semi-structured exist to ease space.
Example: XML data.
Books.xml
<?xml version="1.0" encoding="UTF-8"?>
<bookstore>
<book category="cooking">
<title lang="en">Everyday Italian</title>
<author>Giada De Laurentiis</author>
<year>2005</year>
<price>30.00</price>
</book>
<book category="children">
<title lang="en">Harry Potter</title>
<author>J K. Rowling</author>
<year>2005</year>
<price>29.99</price>
</book>
<book category="web">
<title lang="en">XQuery Kick Start</title>
<author>James McGovern</author>
<author>Per Bothner</author>
<author>Kurt Cagle</author>
<author>James Linn</author>
<author>Vaidyanathan Nagarajan</author>
<year>2003</year>
<price>49.99</price>
</book>
<book category="web" cover="paperback">
<title lang="en">Learning XML</title>
<author>Erik T. Ray</author>
<year>2003</year>
<price>39.95</price>
</book>
</bookstore>
big data analytics
Big data analytics is the often complex process of examining large
and varied data sets, or big data, to uncover information -- such as
hidden patterns, unknown correlations, market trends and
customer preferences--that can help organizations make informed
business decisions.
3V Problem of bigdatra Analytics
Big data analytics technologies and tools
Unstructured and semi-structured data types typically don't fit well in
traditional data warehouses that are based on relational databases oriented
to structured data sets. Further, data warehouses may not be able to handle
the processing demands posed by sets of big data that need to be updated
frequently or even continually, as in the case of real-time data on stock
trading, the online activities of website visitors or the performance of
mobile applications.
As a result, many of the organizations that collect, process and analyze big
data turn to NoSQL databases, as well as Hadoop and its companion data
analytics tools,
• YARN: a cluster management technology and one of the key features in
second-generation Hadoop.
• MapReduce: a software framework that allows developers to write
programs that process massive amounts of unstructured data in parallel
across a distributed cluster of processors or stand-alone computers.
• Spark: an open source, parallel processing framework that enables users to
run large-scale data analytics applications across clustered systems.
• HBase: a column-oriented key/value data store built to run on top of the
Hadoop Distributed File System (HDFS).
• Hive: an open source data warehouse system for querying and analyzing
large data sets stored in Hadoop files.
• Kafka: a distributed publish/subscribe messaging system designed to replace
traditional message brokers.
• Pig: an open source technology that offers a high-level mechanism for the
parallel programming of MapReduce jobs executed on Hadoop clusters.
Overall Architeture of Hadoop Map Reduce
Framework
Attentive Test Question
• Bigdata analytics is more relevant to __________data model.
A: Structure
B: Unstructure
C: Semistructured
• I have a text data set of a tweeter data. Which tool is more relevant to
visualize analytical result?
A: SQL of oracle
B: excel of microsoft
C: Hadoop of apache
References:
• https://learn.g2.com/structured-vs-unstructured-data
• Book: Big Data Data for Dummies by Judith Hurwitz, Alan Nugent, Dr. Fern
Halper, and Marcia Kaufman
Will Resume After 5 minutes........
Activity: Model Preparation of BigData Analytics
Para01: Hadoop MapReduce is a software framework for easily writing applications which process vast
amounts of data
Para02: A MapReduce job usually splits the input data-set into independent chunks which are processed
by the map tasks in a completely parallel manner. The framework sorts the outputs of the maps, which
are then input to the reduce tasks.
Para03: Typically the compute nodes and the storage nodes are the same, that is, the MapReduce
framework and the Hadoop Distributed File System are running on the same set of nodes. This
configuration allows the framework to effectively schedule tasks on the nodes where data is already
present, resulting in very high aggregate bandwidth across the cluster.
Para04: The MapReduce framework consists of a single master ResourceManager, one worker
NodeManager per cluster-node, and MRAppMaster per application.
Para05:The Hadoop job client then submits the job and configuration to the ResourceManager which
then assumes the responsibility of distributing the software and configuration to the workers, scheduling
tasks and monitoring them, providing status and diagnostic information to the job-client.
Session No: CO1-5
Session Topic: Unstructured data model, Semi structured
data
DATA MODELING AND
VISUALIZATION TECHNIQUES
(Course code: 18CS3262)
Session Objective
• An ability to represent structured data, unstructured data and
semi structured data.
• An ability to identify and uses relevant tools to process, analyze
and visulize the structured data and unstructured data types.
Poll Question
• What is the output of select * from employee;
A: All the rows of employee table are displayed.
B: All the rows of employee table whose dept no=5 are
displayed.
C: Last 10 rows of employee table are displayed.
D: First 10 rows of employee table are displayed.
Structured Data
Structured data is most often categorized as quantitative data, and it's the
type of data most of us are used to working with. Think of data that fits
neatly within fixed fields and columns in relational databases and
spreadsheets.
Another way to manage unstructured data is to have it flow into a data lake,
allowing it to be in its raw, unstructured format.
Poll Question
• Student database is an example of _____________ data model.
A: Structure
B: Unstructure
C: Semistructured
• Find suitable datamodels for {Employee Attendance Record, Film
Review from social media} as per their data models.
A:structured, unstructure
B:structured, semistructured
C:unstructured, structured.
D:structured, structured
Semi-Structured data
Semi-structured data is information that does not reside in a
relational database but that have some organizational properties
that make it easier to analyze. With some process, you can store
them in the relation database (it could be very hard for some kind
of semi-structured data), but Semi-structured exist to ease space.
Example: XML data.
Books.xml
<?xml version="1.0" encoding="UTF-8"?>
<bookstore>
<book category="cooking">
<title lang="en">Everyday Italian</title>
<author>Giada De Laurentiis</author>
<year>2005</year>
<price>30.00</price>
</book>
<book category="children">
<title lang="en">Harry Potter</title>
<author>J K. Rowling</author>
<year>2005</year>
<price>29.99</price>
</book>
<book category="web">
<title lang="en">XQuery Kick Start</title>
<author>James McGovern</author>
<author>Per Bothner</author>
<author>Kurt Cagle</author>
<author>James Linn</author>
<author>Vaidyanathan Nagarajan</author>
<year>2003</year>
<price>49.99</price>
</book>
As a result, many of the organizations that collect, process and analyze big
data turn to NoSQL databases, as well as Hadoop and its companion data
analytics tools,
• YARN: a cluster management technology and one of the key features in
second-generation Hadoop.
• MapReduce: a software framework that allows developers to write
programs that process massive amounts of unstructured data in parallel
across a distributed cluster of processors or stand-alone computers.
• Spark: an open source, parallel processing framework that enables users
to run large-scale data analytics applications across clustered systems.
• HBase: a column-oriented key/value data store built to run on top of the
Hadoop Distributed File System (HDFS).
• Hive: an open source data warehouse system for querying and analyzing
large data sets stored in Hadoop files.
• Kafka: a distributed publish/subscribe messaging system designed to
replace traditional message brokers.
• Pig: an open source technology that offers a high-level mechanism for the
parallel programming of MapReduce jobs executed on Hadoop clusters.
Overall Architeture of Hadoop Map Reduce
Framework
Poll Question
• Bigdata analytics is more relevant to __________data model.
A: Structure
B: Unstructure
C: Semistructured
• I have a text data set of a tweeter data. Which tool is more relevant to
visualize analytical result?
A: SQL of oracle
B: excel of microsoft
C: Hadoop of apache
References:
• https://learn.g2.com/structured-vs-unstructured-data
• Book: Big Data Data for Dummies by Judith Hurwitz, Alan Nugent, Dr. Fern
Halper, and Marcia Kaufman
Will Resume After 5 minutes........
Activity (Quiz)
• Data Collection
We’re getting better and better at collecting data, but we lag in what
we can do with it. Lots of data are available around us, but it’s not
being used to its greatest potential because it’s not being visualized
as well as it could be.
• Data Never Stays the Same
What happens when things start moving? How do we interact with
“live” data? How do we unravel data as it changes over time?
Is from the previous diagram the data visualization appeared after data
analytics?
Choice1: Yes
Choice2: No
Example-01
Example-02
Example-03
Poll Question-02:
Choice1: Yes
Choice2: No
Poll Question
What is correct saying about data and information ?
A: Data is input and information is processed input
B: Data is processed input and information is input
C: Data is input and information is input
D: Data and information both are processed input
What Is a Data Model?
Data modeling provides a method and means for describing the real-
world information requirements in a manner understandable to the
stakeholders in an organization. In addition, data modeling enables the
database practitioners to take these information requirements and
implement these as a computer database system to support the
business of the organization.
So, what is a data model?
A data model is a device that. helps the users or stakeholders
understand clearly the database system that is being implemented
based on the information requirements of an organization, and. enables
the database practitioners to implement the database system exactly
conforming to the information requirements.
Poll Question
Choose a correct choice:
A student table with 10 attributes and 250 rows of data is called Data
model.
Choice1: Yes
Choice2: No
Conceptual models
Provide flexible data-structuring capabilities Present a “community view”: the
logical structure of the entire database Contain data stored in the database
Show relationships among data including:
Constraints
Semantic information (e.g., business rules)
Security and integrity information