Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
248 views196 pages

DV Co1 All PDF

Download as pdf or txt
Download as pdf or txt
Download as pdf or txt
You are on page 1/ 196

B.

TECH III YEAR SEM-1


ACADEMIC YEAR –JULY, 2020

DATA MODELING &


VISUALIZATION TECHNIQUES
COURSE CODE: 18CS3262

INTRODUCTION
Data visualization
What is data visualization and why is it important?
• Data visualization refers to the techniques used to
communicate data or information by encoding it as
visual objects (e.g., points, lines or bars) contained in
graphics.
• Data visualization is the graphical representation of
information and data.
• The goal is to communicate information clearly and
efficiently to users.
• It is one of the steps in data analysis or data science.
• According to Vitaly Friedman (2008) the "main goal of
data visualization is to communicate information
clearly and effectively through graphical means.
How is data Visualization used?
• Data visualization is the act of taking
information (data) and placing it into a visual
context, such as a map or graph. Data
visualizations make big and small data easier for
the human brain to understand,
and visualization also makes it easier to detect
patterns, trends, and outliers in groups of data.
Why is data visualization so important?

We need data visualization because a visual


summary of information makes it easier to
identify patterns and trends than looking
through thousands of rows on a spreadsheet.
It's the way the human brain works. Since the
purpose of data analysis is to gain
insights, data is much more valuable when it is
visualized
Why do we visualize data?

Because of the way the human brain processes


information, using charts or graphs
to visualize large amounts of complex data is
easier than poring over spreadsheets or reports
Best data visualization tools..

• The best data visualization tools include


Google Charts, Tableau, Grafana, Chartist. js,
FusionCharts, Datawrapper, Infogram,
ChartBlocks, and D3. js. The best tools offer a
variety of visualization styles, are easy to use,
and can handle large data sets.
Visualization pipiline..
Benefits of Data Visualization

• Faster Action. The human brain tends to process


visual information far more easily than written
information. ...
• Communicate Findings in Constructive Ways. ...
• Understand Connections Between Operations
and Results. ...
• Embrace Emerging Trends. ...
• Interact With Data. ...
• Create New Discussion. ...
• Machine Learning: Come One, Come All.
What are the types of visualization?
The 15 Most Common Types of Data Visualization Formats
• Column Chart
• Bar Graph
• Stacked Bar Graph
• Stacked Column Chart
• Area Chart
• Dual Axis Chart
• Line Graph
• Mekko Chart
• Pie Chart
• Waterfall Chart
• Bubble Chart
• Scatter Plot Chart
• Bullet Graph
• Funnel Chart
• Heat Map
Author Stephen Few described eight types of quantitative messages
that users may attempt to understand or communicate from a set of
data and the associated graphs used to help communicate the
message:

• Time series: unemployment rate over a 10-year period. Ex:


line chart
• Ranking: Categorical subdivisions are ranked in ascending or
descending order, such as a ranking of sales performance: ex:
bar chart
• Part-to-whole: Categorical subdivisions are measured as a
ratio to the whole (i.e., a percentage out of 100%). Ex: pie
chart / bar chart
• Deviation: Categorical subdivisions are compared against a
reference, such as a comparison of actual vs. ex: budget , bar
chart
Cont…
• Frequency distribution: Shows the number of
observations of a particular variable for given intervals.
Stock market ex: histogram
• Correlation: Comparison between observations
represented by two variables. ex: scatter plot
• Geographic or geospatial: Comparison of a variable
across a map or layout, such as the unemployment rate
by state or the number of persons on the various floors
of a building.
• Nominal comparison: Comparing categorical
subdivisions in no particular order, such as the sales
volume by product code. Ex: A bar chart
Terminology….
Data visualization involves specific terminology, some of which is
derived from statistics. For example, author Stephen Few
defines two types of data, which are used in combination to
support a meaningful analysis or visualization:

Categorical: Text labels describing the nature of the data, such


as "Name" or "Age". This term also covers qualitative (non-
numerical) data.

Quantitative: Numerical measures, such as "25" to represent


the age in years.
Two primary types of information
displays are tables and graphs.
• A table contains quantitative data organized into
rows and columns with categorical labels.
• The table might have categorical column labels
representing the name (a qualitative variable)
and age (a quantitative variable)
• A graph is primarily used to show relationships
among data and portrays values encoded
as visual objects.
• e.g., lines, bars, or points.
• Many graphs are also referred to as charts
DV LAB - INTRODUCTION
• 1. PLOT A HISTOGRAM USING MATPLOTLIB.
(TAKE RANDOM VARIABLES 80)
• To create a scatter plot in Matplotlib we can
use the scatter method. (IRIS DATA).
• To create a line-chart in Pandas we can
call <dataframe>.plot.line().
• TO CREATE Histogram & multiple histograms
(iris data)
SKILLING - INTRODUCTION
• Campaign Finance Data
• This data set consists of contributions data from the Federal Elections
Commission (FEC) during the 2012 election campaign. We have compiled
two versions of this data set. The small set covers 2011 - Oct 2012, which
is ~200K individual contributions. The big set covers 2005 - Oct 2012,
which is ~1M contributions (reveals several congressional election finance
cycles, and ~2 presidential cycles). Both small and large versions are
available as either CSV (comma separated values) or TDE (Tableau data
engine) files. Each file has been compressed using zip.
• Download: (Small Version) csv, tde; (Large Version) csv, tde
• Info about how to translate various codes throughout the data can be
found in these schema files for: candidates, committees,
and contributions. The overall shape of the data starts with the
contributions schema, with data joined onto each row from the
candidates and committees schemas.
Movie Data

• This dataset contains some important


statistics from a large sample of movies. The
data includes the movie budget and revenue
from different sources as well as ratings
from RottenTomatoes, The
Numbers and IMDB.
• Download: csv file
• Sources: RottenTomatoes, The
Numbers and IMDB.
Flight Data

• FAA data describing every commercial flight


during the month of December 2009. For detailed
descriptions of each data column in the attached
file please see www.transtats.bts.gov. You are
also welcome to download your own version of
the file (which might include columns or time
spans that were left out from this dataset)
directly from www.transtats.bts.gov.
• Download: zipped csv file
• Source: www.transtats.bts.gov
Data Set: Antibiotics
• After the World War II, antibiotics were considered as
"wonder drugs", since they were easy remedy for what had
been intractable ailments. To learn which drug worked
most effectively for which bacterial infection, performance
of the three most popular antibiotics on 16 bacteria were
gathered.
• The values in the table represent the minimum inhibitory
concentration (MIC), a measure of the effectiveness of the
antibiotic, which represents the concentration of antibiotic
required to prevent growth in vitro. The reaction of the
bacteria to Gram staining is described by the covariate
“gram staining”. Bacteria that are stained dark blue or
violet are Gram-positive. Otherwise, they are Gram-
negative.
https://courses.cs.washington.edu/courses/cse512/14wi/a
1.html
Session No: CO1-2
Session Topic: Introduction to Data Visualisation &
Data Model, Conceptual model
DATA MODELING AND
VISUALIZATION TECHNIQUES
(Course code: 18CS3262)
Session Objective
• An ability to know the importance of Data Visualisation
through different examples.

• An ability to understand the softwares that plays vital role


in Data Visualization.
• An ability to understand the basic purpose of the data
model.
• An ability to know the data model representation with
suitable example.
Poll Question-01
Is Chart an example of Data Visualization ?
Choice1: Yes
Choice2: No
Why Data Display Requires Planning
• Too Much Information
When you hear the term “information overload,” you probably
know exactly what it means because it’s something you deal with
daily.

• Data Collection
We’re getting better and better at collecting data, but we lag in what
we can do with it. Lots of data are available around us, but it’s not
being used to its greatest potential because it’s not being visualized
as well as it could be.
• Data Never Stays the Same
What happens when things start moving? How do we interact with
“live” data? How do we unravel data as it changes over time?

• What Is the Question?


The most important part of understanding data is identifying the
question that you want to answer. Rather than thinking about the
data that was collected, think about how it will be used and work
backward to what was collected. The more specific you can make
your question, the more specific and clear the visual result will be.
Poll Question-02:

Is from the previous diagram the data visualization appeared after data
analytics?
Choice1: Yes
Choice2: No
Example-01
Example-02
Example-03
Poll Question-02:

Choose the correct answer


• Has through data visualization the data been visualized with
intelligence?

Choice1: Yes
Choice2: No
Poll Question
What is correct saying about data and information ?
A: Data is input and information is processed input
B: Data is processed input and information is input
C: Data is input and information is input
D: Data and information both are processed input
What Is a Data Model?
Data modeling provides a method and means for describing the real-
world information requirements in a manner understandable to the
stakeholders in an organization. In addition, data modeling enables the
database practitioners to take these information requirements and
implement these as a computer database system to support the
business of the organization.
So, what is a data model?
A data model is a device that. helps the users or stakeholders
understand clearly the database system that is being implemented
based on the information requirements of an organization, and. enables
the database practitioners to implement the database system exactly
conforming to the information requirements.
Poll Question
Choose a correct choice:

A student table with 10 attributes and 250 rows of data is called Data
model.

Choice1: Yes
Choice2: No
Conceptual models
Provide flexible data-structuring capabilities Present a “community view”: the
logical structure of the entire database Contain data stored in the database
Show relationships among data including:
 Constraints
 Semantic information (e.g., business rules)
 Security and integrity information

Consider a database as a collection of entities (objects) of various kinds Are


the basis for identification and high-level description of main data objects;
they avoid details are database independent regardless of the database you will
be using.
Poll Question
• Is ER diagram representtion of a Data Model ?
Choice 1: Yes
Choice 2: No

• In the previous figure what is the level of ER diagram description?


Choice1: Conceptual Level
Choice2: External Level
Choice3: Physical Level
Reference
• Fry, Visualizing Data. O’Reilly Media, 2008, ISBN 0596514557 .
• Munzner, Visualization Analysis and Design, 2014, ISBN 1466508914
• https://www.youtube.com/watch?v=aHaOIvR00So
• Paulraj Ponniah, DATA MODELING FUNDAMENTALS A Practical
Guide for IT Professionals.
Will Resume After 5 minutes........
Activity (Refer LMS for Submission)
Case Study
Design a data model for the following scenario.
A school system needed to build a database application. The system consists of about 5000 Students
taught by 200 Teachers. There are 50 classrooms to cater the need of the students. A Library is there
with around 500 text and reference books. The students must enroll a course during their study.
The given below are the guidelines to develop the data model
 A student need to enroll one Course at a time.
 Every student may not have a faculty for teaching. But every student must have a faculty for
mentoring.
 Every course must have a course faculty.
 The library shall be used for both students and Teachers. Both of them may issue multiple books in
their favor.
 Every class room shall be occupied by a student on a rotational basis.
Session No: CO1-3
Session Topic: Spread sheet models,
Relational Data Models
DATA MODELING
AND VISUALIZATION
TECHNIQUES
(Course code: 18CS3262)
Session Objective
At the end of this session , student should be able to
• Understand how data is transformed and maintained in
relational model.
• Understand different ways of merging information
(blending) from multiple sources
• Implement the concept of data blending using Tableau
on a given case study
Poll Question

An attribute is a __________ in
a relation.
a) Row
b) Column
c) Value
d) Tuple
Relational Model
RELATIONAL MODEL (RM) represents the database as a
collection of relations. A relation is nothing but a table of
values. Every row in the table represents a collection of
related data values. These rows in the table denote a real-
world entity
Relational or relationship.
Model Concepts
• Attribute: Each column in a Table. Attributes are the properties
which define a relation. e.g., Student_Rollno, NAME,etc.
• Tuple: It is nothing but a single row of a table, which contains a
single record.
• Relation Schema: A relation schema represents the name of the
relation with its attributes.
• Degree: the number of an entity type that is connected to a
relationship is the degree of that relationship.
• Cardinality: This is the numerical relationship between rows of
Relational Integrity constraints

Relational Integrity constraints is referred to conditions


which must be present for a valid relation. These
integrity constraints are derived from the rules in the
mini-world that the database represents.

There are many types of integrity constraints. Constraints


on the Relational database management system is mostly
divided into three main categories are:

Domain constraints
Key constraints
Referential integrity constraints
Domain Constraints

Domain constraints can be violated if an attribute


value is not appearing in the corresponding domain or
it is not of the appropriate data type.

Domain constraints specify that within each tuple, and


the value of each attribute must be unique. This is
specified as data types which include standard data
types integers, real numbers, characters, Booleans,
variable length strings, etc.
Key constraints
An attribute that can uniquely identify a tuple in a relation is
called the key of the table. The value of the attribute for
different tuples in the relation has to be unique.

Example:
In the given table, CustomerID is a key attribute of Customer Table. It is most likely to have a
single key for one customer, CustomerID =1 is only for the CustomerName =" Google".

CustomerID CustomerName Status

1 Google Active

2 Amazon Active

3 Apple Inactive
Referential integrity constraints
Referential integrity constraints is base on the concept of
Foreign Keys. A foreign key is an important attribute of a
relation which should be referred to in other relationships.
Referential integrity constraint state happens where relation
refers to a key attribute of a different or same relation.
However, that key element must exist in the table.
Example:
Operations in Relational Model

Four basic update operations performed on relational


database model are
Insert, update, delete and select.

• Insert is used to insert data into the relation


• Delete is used to delete tuples from the table.
• Modify allows you to change the values of some
attributes in existing tuples.
• Select allows you to choose a specific range of data.
ER Model

ENTITY RELATIONAL (ER) MODEL is a high-level


conceptual data model diagram. ER modeling helps you
to analyze data requirements systematically to produce a
well-designed database. The Entity-Relation model
represents real-world entities and the relationship
between them. It is considered a best practice to
complete
Components ERof the
modeling before implementing your
ER Diagram
database.
This model is based on three basic concepts:
• Entities
• Attributes
• Relationships
ER Model to Relational Model
Conversion
Introduction to Keys

• Super Key
• Candidate Key
• Primary Key
• Composite Key (A key that has more
than one attribute)
• Secondary or Alternative key
• Non-key Attributes
• Non-prime Attributes
ER Model of a
Company
Relational Model of a
Company
Session No: CO1-3
Session Topic: Spread sheet models, Relational Data
Models
DATA MODELING AND
VISUALIZATION TECHNIQUES
(Course code: 18CS3262)
Relational Model :
Data Blending/Joins
Data Blending-Importance in visualization
context
• In Relational Model , data is maintained in the form of rows
and columns which is most commonly called as relation or
Table.
• In Database, data is spread over multiple tables.
• Sometimes we may want data from two or more tables.
• Data blending is a method for combining data from multiple
sources.
• Data blending brings in additional information from a
secondary data source and displays it with data from the
primary data source directly in the view.
Step 1: Create and Insert data in Table 1
Step 2: Create and Insert data in Table 2
Step 3: Drag and drop the first table or left table
to the field region.
• To describe how
to join data in
Tableau, we need
at least two
tables.
• First, Drag and
drop the first
table or left the
table to the field
region.
• In this example,
we are using our
Employee table as
a left table
Step 4: Drag and drop the second table or
right table to the field region.
• Drag and drop the
second table or
right table to the
field region.
• When you
dragged the
Department table,
a pop-up window
will be opened to
select the Join
type and Joining
key, as shown in
adjacent figure.
Different Types of JOINs
• (INNER) JOIN: Returns records that have matching values in both
tables
• LEFT (OUTER) JOIN: Return all records from the left table, and the
matched records from the right table
• RIGHT (OUTER) JOIN: Return all records from the right table, and the
matched records from the left table
• FULL (OUTER) JOIN: Return all records when there is a match in either
left or right table
INNER JOINS

• An inner join of Tables A and B


gives the result of A intersect B,
• Inner joins use a comparison
operator to match rows from two
tables based on the values in
common columns from each
table

Select the Dept ID column


from the Employee table
as shown
INNER JOINS

Select the Id column


from department
table as shown
INNER JOIN

• Created a simple table


report with Occupation,
Last name, Department
name, First name on Rows
and Sales Amount, and
Yearly income on columns
• Inner Join produced 10
Rows, which includes all
the matching records from
Employee and Department
table.
LEFT JOIN

• Left Join is producing 14


Rows.
• It includes all the records
from the Employee table
and matching records from
the Department table.
• Remember, four non-
matching records from
Department table will be
displayed as Nulls
RIGHT JOIN

• Right Join is producing 12


Rows.
• Output includes all records
from the Department table
and matching records from
the Employee table.
• Two non-matching records
from the Employee table
will display as Nulls
FULL OUTER JOIN

• FULL OUTER JOIN returns a


result set that includes rows
from both left and right tables.
• When no matching rows exist
for the row in the left table, the
columns of the right table will
have nulls
• Full Outer Join is producing 16
Rows.
Blend your data – Web Reference

The following reference provides step wise guidance for


Data Blending :
https://help.tableau.com/current/pro/desktop/en-
us/multiple_connections.htm
Will Resume After 5 minutes........
Poll Question

Q. Which of the following statements are False?


A. RIGHT OUTER JOIN is equivalent to LEFT OUTER JOIN if
order of tables are reversed
B. FULL OUTER JOIN is same as CROSS JOIN
C. SELF JOIN is a special type of OUTER JOIN
D. Both B and C
References
• Book: Fundamentals of Database Systems, 7th Edition by Ramez
Elmasri, University of Texas at Arlington and Shamkant B. Navathe,
University of Texas at Arlington.
• https://mindmajix.com/tableau-data-blending
Case Study

For the
given case
study ,
perform the
following
Join
operations
in SQL and
Tableau
1) Inner
2) Left
3) Right
4) Full
Outer
Activity: Design

• Entities with their properites and interactions are given.


Represent this Entity relationships in the form of ER
diagram.

• Trnasform ER model into relational model.


Session No: CO1-4
Session Topic: Object Oriented Model

DATA MODELING AND


VISUALIZATION TECHNIQUES
(Course code: 18CS3262)
Session Objective
• An ability to know Object-Oriented Modelling.
• An ability to visualize the interaction of classes through the class
diagram.
POLL QUESTION

What is not an Object Oriented Programming?

A: C

B: C++

C: Java
Object Oriented Modelling
• Object OrientedAnalysis
Identifying Classes
Attributes and Operations
• Components of Class Diagrams
Associations
Multiplicity
Aggregation
Composition
Generalization

4
Object Oriented Analysis
Motivation

A model based on objects (rather than functions) will be more stable


over time.
OO emphasizes importance of well-defined interfaces between objects.

5
Nearly anything can be an object
External Entities
E.g. people, devices, cat, etc.
Things
E.g. reports, displays, etc.
Occurrences or Events
E.g. CricketMatch, MarriageFunction etc.
Roles
E.g. Manager, President, Captain etc.
Structures
E.g. Bridge, Bunglow, Four-wheeled vehicles etc.
Organizational Units
E.g. group, team, etc.
Places
E.g. manufacturing floor, loading dock, etc.
6
What are classes?
A class describes a group of objects with
similar properties (attributes),
common behaviour (operations),
common relationships to other objects,
and common meaning (“semantics”).
Examples
employee: has a name, employee# and department where he/she
is working.

7
Object
The instances of a class are calledobjects.
Objects are represented as:

Fred_Bloggs:Employee
name: Fred Bloggs
Employee #:
234609234
Department:
Marketing

9
What is not an Object from the given below choices?

A: Dog (name: Bobby,color: White, length:25.5 In)

B: Cat(name: Miaun, color: Black, length:12 In, weight: 4kg)

C: Car(manufacturer:Tata, model: Tiago, variant: XT, yellow)

D: max_of_twoNos(int n1, int n2)


Associations
Class diagrams show classes and their relationships

12
Multiplicity
A multiplicity is a factor associated with an attribute. It specifies how many instances of
attributes are created when a class is initialized. If a multiplicity is not specified, by
default one is considered as a default multiplicity.

Meaning of multplicity Symbols


0 No instances (rare)
0..1 No instances, or one instance
1 Exactly one instance
1..1 Exactly one instance
0..* Zero or more instances
* Zero or more instances
1..* One or more instances

(Example: There are 100 students in one college. The college can have multiple students.)
Example of Many to Many Multiplicity
Different Meaning of Relating two Classes
Class associations
Multiplicity Multiplicity
A client has A staff member has
exactly one staffmember zero or more clients on
as a contact person His/her clientList
Name
of the
association
Client
StaffMember
companyAddress
staffName 1 liaises with 0..* companyEmail
staff# companyFax
staffStartDate contact ClientList companyName
person companyTelephone

Direction
The “liaises with”
association should be
read in this direction
Role
The staffmember’s Role
role in this association The clients’ role
is as a contact person in this association
is as a clientList

16
Class Diagrams eye
Class name aggregation Colour
0..2 Diameter
Correction
multiplicities
patient kidney
0..1
attributes Name
Operational?
Date of Birth
0..1
Height
Weight
services 0..1 1..2

heart
Normal
generalization 1 bpm Blood
type

In-patient Out-patient
Room Last organ
Bed visit Natural/arti
Physici next f.
an visit Orig/implant
physicia donor
n

17
Class Name

• A class name should always start with a capital letter.


• A class name should always be in the center of the first compartment.
• A class name should always be written in bold format.
• An abstract class name should be written in italics format.
Attributes
• An attribute is named property of a class which describes the object
being modeled. In the class diagram, this component is placed just below
the name-compartment.
• The attributes are generally written along with the visibility factor.
Public, private, protected and package are the four visibilities which are
denoted by +, -, #, or ~ signs respectively.
Dependency
A dependency means the relation between two or more classes in which a
change in one may force changes in the other. However, it will always create
a weaker relationship. Dependency indicates that one class depends on
another.
Realization
• Denotes the implementation of the functionality defined in one class
by another class.
• This is an implementation of interface.

(Implementation of +acquire(property) & +dispose(property) in the implementation class i.e Person and Corporation class)
Generalization
A generalization helps to connect a subclass to its superclass. A sub-class is
inherited from its superclass.
Example-01
Example-02
Association
This kind of relationship represents static relationships between classes A and
B. For example; an employee works for an organization.
Here are some rules for Association: Association is mostly verb or a verb
phrase or noun or noun phrase.
Association with out Direction Association with Direction
Aggregation
• Aggregation is a special type of association that models a whole- part
relationship between aggregate and its parts.
• An aggregation is a special case of association denoting a consists-of
hierarchy
• The aggregate is from bottom class to top class, and the components
are from top class to bottom class.
Composition
• The composition is a special type of aggregation which denotes strong
ownership between two classes when one class is a part of another class.
• The composition is a part of aggregation, and it portrays the whole-part
relationship. It depicts dependency between a composite (parent) and its
parts (children), which means that if the composite is discarded, so will its
parts get deleted. It exists between similar objects.

(Example:01 - If Library class is destroyed then there is no existance of Book class).


(Example:02- As you can see from the example given below, the composition association
relationship connects the Person class with Brain class, Heart class, and Legs class. If the
person is destroyed, the brain, heart, and legs will also get discarded.)
Representation of All the Edges of Class Level Digram
Class Diagram Example: Order System
Active Participation Test Question

Choose the correct choice:

Is Class an instance of Object ?

A: Yes

B: No
References:

• Paulraj Ponniah, DATA MODELING FUNDAMENTALS A Practical Guide


for IT Professionals.
• http://www.cs.toronto.edu/~sme/CSC340F/slides/11-objects.pdf
• https://en.wikipedia.org/wiki/Class_diagram
• https://www.guru99.com/uml-class-diagram.html
• https://www.visual-paradigm.com/guide/uml-unified-modeling-language/what-
is-class-diagram/
Will Resume After 5 minutes........
Activity (Design a class diagram for the following explanation)
Consider the following set of requirements for a university information system that is used to keep track of students
transcripts. The first requirement RS1 concerns the data to be kept in permanent storage, which is further broken
down into five parts.
• RS1a. The university keeps track of each student's name, student number, social security number, current address
and phone, permanent address and phone, birthdate, sex, class (freshman, sophomore, ..., graduate), major
department, minor department (if any), and degree program (B.A., B.S., ..., Ph.D.) and student's permanent
address, Both social security number and student number have unique values for each student.
• RS1b. Each department is described by a name, department code, office number, office phone and college. Both
name and code have unique values for each department.
• RS1c. Each course has a course name, description, code number, number of semester hours, level, and offering
department. The value of code number is unique for each course.
• RS1d. Each section has an instructor, semester, year, course, and section number. The section number
distinguishes different sections of the same course that are taught during the same semester/year; its values are 1,
2, 3, ...; up to the number of sections taught during each semester.
• RS1e. A grade report has a student, section, and grade.
• RS2 An administrator can update the courses to be taught by instructors, and enter the list of students taking a
course.
• RS3 An instructor can enter and update the grades of the course(s) taught by this instructor.
• RS4 A student can request a grade report from the information system.
Session No: CO1-5
Session Topic: Unstructured data model, Semi structured
data
DATA MODELING AND
VISUALIZATION TECHNIQUES
(Course code: 18CS3262)
Session Objective
• An ability to represent structured data, unstructured data and
semi structured data.
• An ability to identify and uses relevant tools to process, analyze
and visulize the structured data and unstructured data types.
Poll Question
• What is the output of select * from employee;
A: All the rows of employee table are displayed.
B: All the rows of employee table whose dept no=5 are
displayed.
C: Last 10 rows of employee table are displayed.
D: First 10 rows of employee table are displayed.
Structured Data
Structured data is most often categorized as quantitative data, and it's the
type of data most of us are used to working with. Think of data that fits
neatly within fixed fields and columns in relational databases and
spreadsheets.

Examples of structured data include names, dates, addresses, credit card


numbers, stock information, geolocation, and more.

Structured data is highly organized and easily understood by machine


language. Those working within relational databases can input, search,
and manipulate structured data relatively quickly. This is the most
attractive feature of structured data.
Example of Structured Data
Unstructured Data
Unstructured data is most often categorized as qualitative data, and it
cannot be processed and analyzed using conventional tools and methods.

Examples of unstructured data include text, video, audio, mobile activity,


social media activity, satellite imagery, surveillance imagery – the list goes
on and on.

Unstructured data is difficult to deconstruct because it has no pre-defined


model, meaning it cannot be organized in relational databases. Instead, non-
relational, or NoSQL databases, are best fit for managing unstructured
data.

Another way to manage unstructured data is to have it flow into a data lake,
allowing it to be in its raw, unstructured format.
Attentive Test Question
• Student database is an example of _____________ data model.
A: Structure
B: Unstructure
C: Semistructured
• What is the data model for 1) keeping Employee Attendance Record,
2) Collecting movie Reviews from social media from the following?
A:structured, unstructure
B:structured, semistructured
C:unstructured, structured.
D:structured, structured
Semi-Structured data
Semi-structured data is information that does not reside in a
relational database but that have some organizational properties
that make it easier to analyze. With some process, you can store
them in the relation database (it could be very hard for some kind
of semi-structured data), but Semi-structured exist to ease space.
Example: XML data.
Books.xml
<?xml version="1.0" encoding="UTF-8"?>
<bookstore>
<book category="cooking">
<title lang="en">Everyday Italian</title>
<author>Giada De Laurentiis</author>
<year>2005</year>
<price>30.00</price>
</book>

<book category="children">
<title lang="en">Harry Potter</title>
<author>J K. Rowling</author>
<year>2005</year>
<price>29.99</price>
</book>
<book category="web">
<title lang="en">XQuery Kick Start</title>
<author>James McGovern</author>
<author>Per Bothner</author>
<author>Kurt Cagle</author>
<author>James Linn</author>
<author>Vaidyanathan Nagarajan</author>
<year>2003</year>
<price>49.99</price>
</book>
<book category="web" cover="paperback">
<title lang="en">Learning XML</title>
<author>Erik T. Ray</author>
<year>2003</year>
<price>39.95</price>
</book>
</bookstore>
big data analytics
Big data analytics is the often complex process of examining large
and varied data sets, or big data, to uncover information -- such as
hidden patterns, unknown correlations, market trends and
customer preferences--that can help organizations make informed
business decisions.
3V Problem of bigdatra Analytics
Big data analytics technologies and tools
Unstructured and semi-structured data types typically don't fit well in
traditional data warehouses that are based on relational databases oriented
to structured data sets. Further, data warehouses may not be able to handle
the processing demands posed by sets of big data that need to be updated
frequently or even continually, as in the case of real-time data on stock
trading, the online activities of website visitors or the performance of
mobile applications.

As a result, many of the organizations that collect, process and analyze big
data turn to NoSQL databases, as well as Hadoop and its companion data
analytics tools,
• YARN: a cluster management technology and one of the key features in
second-generation Hadoop.
• MapReduce: a software framework that allows developers to write
programs that process massive amounts of unstructured data in parallel
across a distributed cluster of processors or stand-alone computers.
• Spark: an open source, parallel processing framework that enables users to
run large-scale data analytics applications across clustered systems.
• HBase: a column-oriented key/value data store built to run on top of the
Hadoop Distributed File System (HDFS).
• Hive: an open source data warehouse system for querying and analyzing
large data sets stored in Hadoop files.
• Kafka: a distributed publish/subscribe messaging system designed to replace
traditional message brokers.
• Pig: an open source technology that offers a high-level mechanism for the
parallel programming of MapReduce jobs executed on Hadoop clusters.
Overall Architeture of Hadoop Map Reduce
Framework
Attentive Test Question
• Bigdata analytics is more relevant to __________data model.
A: Structure
B: Unstructure
C: Semistructured

• I have a text data set of a tweeter data. Which tool is more relevant to
visualize analytical result?
A: SQL of oracle
B: excel of microsoft
C: Hadoop of apache
References:

• https://learn.g2.com/structured-vs-unstructured-data
• Book: Big Data Data for Dummies by Judith Hurwitz, Alan Nugent, Dr. Fern
Halper, and Marcia Kaufman
Will Resume After 5 minutes........
Activity: Model Preparation of BigData Analytics

In the next slide a text of five paragraphs is given.


• Each of a computer assigned a paragraph of text to derive the frequency of
words.
• Assume there are five computers named as C1, C2, C3, C4, C5 who
worked parallel to do this job.
• After the frequency are derived at local level by individual computers, they
communicate it to the machine C1. Computer C1 cumulates all individual
frequencies of words and prepare a final frequency of words.
• Present graphically the final report generated by computer C1.
Text Data for Big Data Analysis using Hadoop Framework........

Para01: Hadoop MapReduce is a software framework for easily writing applications which process vast
amounts of data
Para02: A MapReduce job usually splits the input data-set into independent chunks which are processed
by the map tasks in a completely parallel manner. The framework sorts the outputs of the maps, which
are then input to the reduce tasks.
Para03: Typically the compute nodes and the storage nodes are the same, that is, the MapReduce
framework and the Hadoop Distributed File System are running on the same set of nodes. This
configuration allows the framework to effectively schedule tasks on the nodes where data is already
present, resulting in very high aggregate bandwidth across the cluster.
Para04: The MapReduce framework consists of a single master ResourceManager, one worker
NodeManager per cluster-node, and MRAppMaster per application.
Para05:The Hadoop job client then submits the job and configuration to the ResourceManager which
then assumes the responsibility of distributing the software and configuration to the workers, scheduling
tasks and monitoring them, providing status and diagnostic information to the job-client.
Session No: CO1-5
Session Topic: Unstructured data model, Semi structured
data
DATA MODELING AND
VISUALIZATION TECHNIQUES
(Course code: 18CS3262)
Session Objective
• An ability to represent structured data, unstructured data and
semi structured data.
• An ability to identify and uses relevant tools to process, analyze
and visulize the structured data and unstructured data types.
Poll Question
• What is the output of select * from employee;
A: All the rows of employee table are displayed.
B: All the rows of employee table whose dept no=5 are
displayed.
C: Last 10 rows of employee table are displayed.
D: First 10 rows of employee table are displayed.
Structured Data
Structured data is most often categorized as quantitative data, and it's the
type of data most of us are used to working with. Think of data that fits
neatly within fixed fields and columns in relational databases and
spreadsheets.

Examples of structured data include names, dates, addresses, credit card


numbers, stock information, geolocation, and more.

Structured data is highly organized and easily understood by machine


language. Those working within relational databases can input, search,
and manipulate structured data relatively quickly. This is the most
attractive feature of structured data.
Example of Structured Data
Unstructured Data
Unstructured data is most often categorized as qualitative data, and it
cannot be processed and analyzed using conventional tools and methods.

Examples of unstructured data include text, video, audio, mobile activity,


social media activity, satellite imagery, surveillance imagery – the list goes
on and on.

Unstructured data is difficult to deconstruct because it has no pre-defined


model, meaning it cannot be organized in relational databases. Instead, non-
relational, or NoSQL databases, are best fit for managing unstructured
data.

Another way to manage unstructured data is to have it flow into a data lake,
allowing it to be in its raw, unstructured format.
Poll Question
• Student database is an example of _____________ data model.
A: Structure
B: Unstructure
C: Semistructured
• Find suitable datamodels for {Employee Attendance Record, Film
Review from social media} as per their data models.
A:structured, unstructure
B:structured, semistructured
C:unstructured, structured.
D:structured, structured
Semi-Structured data
Semi-structured data is information that does not reside in a
relational database but that have some organizational properties
that make it easier to analyze. With some process, you can store
them in the relation database (it could be very hard for some kind
of semi-structured data), but Semi-structured exist to ease space.
Example: XML data.
Books.xml
<?xml version="1.0" encoding="UTF-8"?>
<bookstore>
<book category="cooking">
<title lang="en">Everyday Italian</title>
<author>Giada De Laurentiis</author>
<year>2005</year>
<price>30.00</price>
</book>

<book category="children">
<title lang="en">Harry Potter</title>
<author>J K. Rowling</author>
<year>2005</year>
<price>29.99</price>
</book>
<book category="web">
<title lang="en">XQuery Kick Start</title>
<author>James McGovern</author>
<author>Per Bothner</author>
<author>Kurt Cagle</author>
<author>James Linn</author>
<author>Vaidyanathan Nagarajan</author>
<year>2003</year>
<price>49.99</price>
</book>

<book category="web" cover="paperback">


<title lang="en">Learning XML</title>
<author>Erik T. Ray</author>
<year>2003</year>
<price>39.95</price>
Title Author

Everyday Italian Giada De Laurentiis

Harry Potter J K. Rowling

XQuery Kick Start James McGovern

Learning XML Erik T. Ray


Properties Structured data Semi-structured data Unstructured data
It is based on Relational It is based on character
echnology It is based on XML/RDF
database table and binary data

Matured transaction and No transaction


Transaction is adapted
TransactiTon management various concurrency management and no
from DBMS not matured
technique concurrency

Versioning over Versioning over tuples or


Version management Versioned as whole
tuples,row,tables graph is possible
It is more flexible than
It is sehema dependent and structuded data but less it very flexible and there is
Flexibility
less flexible than flexible than abbsence of schema
unstructured data
It is very difficult to scale It’s scaling is simpler than
Scalability It is very scalable
DB schema sstructured data
New technology, not very
Robustness Very robust —
spread
Structured query allow Queries over anonymous Only textual query are
Query performance
complex joining nodes are possible possible
big data analytics
Big data analytics is the often complex process of examining large
and varied data sets, or big data, to uncover information -- such as
hidden patterns, unknown correlations, market trends and
customer preferences--that can help organizations make informed
business decisions.
3V Problem of bigdatra Analytics
Big data analytics technologies and tools
Unstructured and semi-structured data types typically don't fit well in
traditional data warehouses that are based on relational databases oriented
to structured data sets. Further, data warehouses may not be able to handle
the processing demands posed by sets of big data that need to be updated
frequently or even continually, as in the case of real-time data on stock
trading, the online activities of website visitors or the performance of
mobile applications.

As a result, many of the organizations that collect, process and analyze big
data turn to NoSQL databases, as well as Hadoop and its companion data
analytics tools,
• YARN: a cluster management technology and one of the key features in
second-generation Hadoop.
• MapReduce: a software framework that allows developers to write
programs that process massive amounts of unstructured data in parallel
across a distributed cluster of processors or stand-alone computers.
• Spark: an open source, parallel processing framework that enables users
to run large-scale data analytics applications across clustered systems.
• HBase: a column-oriented key/value data store built to run on top of the
Hadoop Distributed File System (HDFS).
• Hive: an open source data warehouse system for querying and analyzing
large data sets stored in Hadoop files.
• Kafka: a distributed publish/subscribe messaging system designed to
replace traditional message brokers.
• Pig: an open source technology that offers a high-level mechanism for the
parallel programming of MapReduce jobs executed on Hadoop clusters.
Overall Architeture of Hadoop Map Reduce
Framework
Poll Question
• Bigdata analytics is more relevant to __________data model.
A: Structure
B: Unstructure
C: Semistructured

• I have a text data set of a tweeter data. Which tool is more relevant to
visualize analytical result?
A: SQL of oracle
B: excel of microsoft
C: Hadoop of apache
References:

• https://learn.g2.com/structured-vs-unstructured-data
• Book: Big Data Data for Dummies by Judith Hurwitz, Alan Nugent, Dr. Fern
Halper, and Marcia Kaufman
Will Resume After 5 minutes........
Activity (Quiz)

10 Quiz Questions to be Answered with in next 20 minutes.


Kindly Open your LMS for details
Session No: CO1-2
Session Topic: Introduction to Data Visualisation &
Data Model, Conceptual model
DATA MODELING AND
VISUALIZATION TECHNIQUES
(Course code: 18CS3262)
Session Objective
• An ability to know the importance of Data Visualisation
through different examples.

• An ability to understand the softwares that plays vital role


in Data Visualization.
• An ability to understand the basic purpose of the data
model.
• An ability to know the data model representation with
suitable example.
Poll Question-01
Is Chart an example of Data Visualization ?
Choice1: Yes
Choice2: No
Why Data Display Requires Planning
• Too Much Information
When you hear the term “information overload,” you probably
know exactly what it means because it’s something you deal with
daily.

• Data Collection
We’re getting better and better at collecting data, but we lag in what
we can do with it. Lots of data are available around us, but it’s not
being used to its greatest potential because it’s not being visualized
as well as it could be.
• Data Never Stays the Same
What happens when things start moving? How do we interact with
“live” data? How do we unravel data as it changes over time?

• What Is the Question?


The most important part of understanding data is identifying the
question that you want to answer. Rather than thinking about the
data that was collected, think about how it will be used and work
backward to what was collected. The more specific you can make
your question, the more specific and clear the visual result will be.
Poll Question-02:

Is from the previous diagram the data visualization appeared after data
analytics?
Choice1: Yes
Choice2: No
Example-01
Example-02
Example-03
Poll Question-02:

Choose the correct answer


• Has through data visualization the data been visualized with
intelligence?

Choice1: Yes
Choice2: No
Poll Question
What is correct saying about data and information ?
A: Data is input and information is processed input
B: Data is processed input and information is input
C: Data is input and information is input
D: Data and information both are processed input
What Is a Data Model?
Data modeling provides a method and means for describing the real-
world information requirements in a manner understandable to the
stakeholders in an organization. In addition, data modeling enables the
database practitioners to take these information requirements and
implement these as a computer database system to support the
business of the organization.
So, what is a data model?
A data model is a device that. helps the users or stakeholders
understand clearly the database system that is being implemented
based on the information requirements of an organization, and. enables
the database practitioners to implement the database system exactly
conforming to the information requirements.
Poll Question
Choose a correct choice:

A student table with 10 attributes and 250 rows of data is called Data
model.

Choice1: Yes
Choice2: No
Conceptual models
Provide flexible data-structuring capabilities Present a “community view”: the
logical structure of the entire database Contain data stored in the database
Show relationships among data including:
 Constraints
 Semantic information (e.g., business rules)
 Security and integrity information

Consider a database as a collection of entities (objects) of various kinds Are


the basis for identification and high-level description of main data objects;
they avoid details are database independent regardless of the database you will
be using.
Poll Question
• Is ER diagram representtion of a Data Model ?
Choice 1: Yes
Choice 2: No

• In the previous figure what is the level of ER diagram description?


Choice1: Conceptual Level
Choice2: External Level
Choice3: Physical Level
Reference
• Fry, Visualizing Data. O’Reilly Media, 2008, ISBN 0596514557 .
• Munzner, Visualization Analysis and Design, 2014, ISBN 1466508914
• https://www.youtube.com/watch?v=aHaOIvR00So
• Paulraj Ponniah, DATA MODELING FUNDAMENTALS A Practical
Guide for IT Professionals.
Will Resume After 5 minutes........
Activity (Refer LMS for Submission)
Case Study
Design a data model for the following scenario.
A school system needed to build a database application. The system consists of about 5000 Students
taught by 200 Teachers. There are 50 classrooms to cater the need of the students. A Library is there
with around 500 text and reference books. The students must enroll a course during their study.
The given below are the guidelines to develop the data model
 A student need to enroll one Course at a time.
 Every student may not have a faculty for teaching. But every student must have a faculty for
mentoring.
 Every course must have a course faculty.
 The library shall be used for both students and Teachers. Both of them may issue multiple books in
their favor.
 Every class room shall be occupied by a student on a rotational basis.

You might also like