0% found this document useful (0 votes)

209 views

Data Management For Analytics Notes

1. The document provides an introduction to data management concepts including data frameworks, data storage types, data modeling, database design, and database properties. 2. It discusses data modeling techniques like normalization, functional dependencies, and entity relationship diagrams. 3. Key aspects of database design are covered including the phases of design from conceptual to logical to physical models.

Uploaded by

Keng Whye Leong

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

209 views

Data Management For Analytics Notes

Uploaded by

Keng Whye Leong

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 21

1.

Introduction to Data Management for Analytics

(I) Data Management Frameworks
Note: Review Pdf 1.0 then do a google search on the frameworks
- DAMA-DMBOK

(2) Types of Data Storage in Cloud Computing

2. Data Modelling & Design
- Data Modelling

● Process of learning about the data; construct visual

representation of the parts of the data
● Goal is to show relationships between structures, data
points, data grouping, attributes of the data

- Functional Dependencies
● Definition: Constraint that determines the relation of one attribute to another attribute in a database
- Denoted by an arrow →
- Example: X → Y
- In example below, Employee Name/Salary/City are all functionally dependent on Employee Number. So we can
say Employee number → Employee Name/Salary/City

● Multivalued Dependency
- Definition: Occurs when there are two/more independent attributes in a table that is dependent on
another attribute

← Example: maf_year and color are independent of each other

- However, both are dependent on car_model
- Therefore, both columns/attributes are ‘multivalue dependent’ on car_model
- We can denote the r/s as car_model → maf_year | car_model → color
● Trivial Functional Dependency
- Definition: Occurs when the attribute that has the dependency is a subset of the attribute it is dependent on.
- Example: X → Y is a trivial functional dependency if Y is a subset of X

← Example: (Emp_id, Emp_name) → Emp_id is a trivial functional dependency as

Emp_id is a subset of (Emp_id, Emp_name)

● Non-Trivial Functional Dependency

- Definition: Basically where attribute that has the dependency is not a subset of the attribute it is
dependent on

← Example: Company → CEO. CEO is not a subset of company. We

must know the company before we know the CEO
- Similarly, CEO → Age. We must know who the CEO is before we
can tell his/her age

● Transitive Dependency
- Definition: Occurs between 3 or more attributes. Essentially, it’s an indirect non-trivial dependency

← Example: Company → Age is a transitive dependency

- We know Company → CEO and CEO → Age
- Therefore, Company → Age; We must know the company before we know
CEO, we must know who the CEO is before we know the age
- Normalization

● Refers to the process of removing

dependencies from the
tables/data
● Purpose is to avoid data
redundancy, insertion, update &
deletion anomaly

- Process of Normalization
● From 0NF to 1NF
- MatrixNum is the unique key for all the records
- We can say that the rest of the columns are ‘functionally dependent’ on MatrixNum
- From 0NF to 1NF, we’re removing the nesting/grouping on the Unique Key to associate it 1-1 to the
rest of the columns

● From 1NF to 2NF

- Essentially you want to remove partial functional dependencies in the table
- Idea of partial functional dependencies is when your data requires more than 1 unique keys working together to
uniquely identify/make sense of the data
- See example below → Your MatrixNum relates to Name/Programme. EnrollNum relates to Semester,
AcadYear, Course, CourseName.
- Your MatrixNum + EnrollNum relates to Result

- To normalize the data → You need to split it into separate tables

● From 2NF to 3NF

- Idea here is that you wanna remove transitive dependencies from the table
- In the 2NF table below, EnrollNum is associated with Course and Course is associated with CourseName
- So the transitive dependency is EnrollNum → Course → CourseName
- To normalize the data → You need to split the Course & CourseName into separate mapping table
- Data Model & Data Modelling Notation
● Chen Notation
- Lecturer Comments: Very old and overly technical

● Crow’s Feet Notation

- Lecturer’s Comments: This is the industry standard. We will use this for purposes of the course as
well

● Unified Modelling Language (UML) Notation

- :Lecturer Comments: Usually used by software developers

- Designing a Database (Phases of Database Design)

● Data Model: Plan/blueprint for a database design; More generalized and abstract than a database
design
● Phases of a Database Design:

- The actual implementation is (1) You come up w the data model, (2) You write SQL scripts to
represent the data model & create the tables, (3) You run the SQL scripts and insert the actual data
● Conceptual → You think/come up w the table/fields you require
Logical → You determine the r/s between the different tables
Physical → You come up w the details for the different fields/tables (E.g. Set character limits/data type, etc…)
- Entity Relationship Diagram (ER Diagram)

● Basically a structural diagram used in database design; contains entity and maps out the relationships
between entities

- Aspects of the ER Model

● Entity
- Basically the ‘table’ in a database
- Represented by a name and a rectangle, with its attributes listed in the body of
the rectangle

● Entity Attributes

- Refers to the property/characteristic of the entity/table

- For databases, it has a name and the data type/size of the attribute

● Primary Key
- Refers to a special entity attribute that uniquely defines a record in a table
- This means that the column/field that is the PK, must not be repeated in the table
- Example, id field of a given table
● Foreign Key
- Is essentially a PK of another table
- FK need not be unique in the table where it is not the PK
- Concept of Cardinality for ER-Model
● ER Model relationships are classified by their cardinality
● Cardinality refers to the possible number of occurrences in one entity which is associated with the
number of occurrences in another
- E.g. ONE team has MANY players, we can then say Team has a one-to-many cardinality with Player
● Notations:

● Reading/Interpreting a ER Diagram/Relation

- Between Customer & Pizza, to establish relationship between Customer → Pizza, we must look at the crow’s
feet notation attached to Pizza
- In this case, its 0/Many attached to Pizza → Means a customer can either order 0 pizza or many pizzas
- For the converse, it's also 0/Many attached to Customer → Means a Pizza can be ordered by 0 customer or many
customers

- Weak Entity
● Defined as an entity that does not have attributes/columns that can identify its records uniquely
- E.g. No primary key in weak entity
- Example: Intermediate tables → Where table consists of PKs of two other tables
- Database Design Misc Example
● In this example, COMPANY & PART tables can’t be joined to each other
- Solution is to create an intermediary table mapping CompanyName & PartNumber to allow both tables
to join
- Lecturer: Would be good to create a PK in COMPANY_PART_INT table for ease of reference

- Database Property → ACID: Atomicity, Consistency, Isolation, Durability

● Properties that all transactions should possess
○ Atomicity
- Relates to the ‘all or nothing’ property
- The transaction must be an indivisible unit that is either performed it its entirety/not performed
at all
○ Consistency
- DB transaction must transform the database from one consistent state to another consistent
state
○ Isolation
- Transactions must be able to execute independently from one another
- I.e. One incomplete transaction must not affect another transaction
○ Durability
- Effects of a successful transaction must be permanently recorded in the database and not lost
in a subsequent failure

- Types of Data
● Transactional Data
- Refers to the data that is captured from transactions
- Example: Time of transaction, place, price, payment method employed, etc…
- Usually captured at point of sale through a POS system
● Analytical Data
- Transaction data that is transformed via calculations/analysis
● Master Data
- Refers to the actual critical business objects upon which transactions are performed
- Data Warehouse
● The general idea for data storage in Data Warehouse is to provide information & knowledge to support
decision making in your org
● Having the data in normalized/OLTP form usually is not good as it can be computationally expensive to
join data together to perform analysis
● Therefore, often data in Data Warehouse is stored in a de-normalized form

- OLTP vs OLAP

- Dimensionality Modelling (Converting data from OLTP to OLAP)

● Basically, you are de-normalizing the data for high performance access (Less joins)
● You can represent the denormalized data via 2 schemas:
○ Star Schema
- Structure that contains a fact table in the center (Fact tables are tables that contain
transactional data; E.g. Sales)
- Fact table is surrounded by dimension tables containing reference data (Dimension tables are
tables that contain reference information; E.g. Store_id to Store Name)
○ Snowflake Schema
- Variant or star schema where dimension tables do not contain denormalized data
● Terminology
- Dimension Tables: Tables connected to fact table containing reference/static data
- Attribute: Non-key fields in Dimension tables
- Fact Table: Central table in a dimensional model contain facts/transactional data
- Facts: Business measures/metrics
- Grain/Granularity: Level of detail/frequency at which data in Fact table is recorded
3. Data Integration and Interoperability
- Data Integration

● Process of bringing data from disparate sources together to provide users w a unified view
● Purpose: To make data more easily available and easier to consume by systems/end-users
● Benefits: Free-up resources, improve data quality, improved operational efficiency, can gain valuable
insight through data

- Data Integration Tools

- Set Theory for Data Join

1. Outer Join

2. Left/Right Join

3. Inner Join

- Data Acquisition and Extraction

● Data Acquisition
- Process of capturing, integrating, transforming, aggregating and loading the data to the data
warehouse after assuring data quality
- Process is more inclusive/comprehensive that ETL (Extract Transform Load) / ELT (Extract Load
Transform)
● ETL/ELT

○ ETL
○ ELT
● ETL vs ELT
- If you transform first, you are in practice determining the schema in advance, data stored might be too
inflexible for use
- If you load first, the risk is that it might become ‘rubbish’
● Data Warehouse
- Structured data is loaded into the data warehouse for analytical use
● Data Lake
- Data lake is the place where all sorts of data is stored
- Structured, textual, unstructured data
● Data Lakehouse
- Similar to Data Lake but with data management architecture baked in to index/cache all forms of data
stored in the data lake

- Data Consolidation
● Basically consolidating data from different silos to a single place

- Data Visualization
● Bring all data from different sources/places to one platform
● One platform to access/combine/analyze the dataset; reduce access cost
- Data Federation
● A software/platform that allows multiple databases to function as one
● Data from multiple sources are combined into a common model; i.e. can query/join using a common
platform/schema
- Data Replication
● Data is intentionally stored in > 1 site/server
● Purpose is to allow data to be available in case of downtime/heavy traffic; idea of improving data
accessibility/uptime
- Data Harmonization

- Data Pipeline
- Data Engineering
- Data Fabric
3. Data Project Implementation
4. Data Governance - Data Quality, Security & Privacy

- Types of Data for Data Governance

- Data Governance Implementation: Sales Analytics

- Data Quality
● Quality is assessed wrt to data’s fit to purpose it was intended for
- High quality means it accurately represents the real world constructs
- Bad data will result in low information quality, then as it moves up the management hierarchy, it leads
to bad business decision

- Measuring/Assessment Data Quality - Data Quality Checks

1. Data Sampling
(i) Random
(ii) Sampling with Fixed Criteria

2. Data Profiling
- Process where data is examined & analyzed; generate summary statistics
- Purpose: Give an overview of data to ensure any discrepancies/risks/trends are spotted
- Data Dictionary

● Specification/description of data structures in a

database/data model/data source
● Contains list of entities/tables/datasets and their
fields/columns/data elements
● Information may include: Data type, description,
relationships, aliases, constraints, sources,
etc…
● Data Catalog - Distinct from Data Dictionary -
Basically an inventory of data objects in your
organization

- Data Mapping

● Definition: R/s between 2/more datasets and matching/connecting fields from one dataset to another
● Purpose: Link data fields across areas to create standardized accurate data

- Data Privacy: Data Confidentiality, Anonymization, Masking

1. Data Masking
- Definition: Technique that scrambles data to create an inauthentic company for non-production
purposes
- After masking, data retains the characteristics & integrity of production data
- Masked data usually used for analytics/training/testing
2. Data Redaction
- Definition: Data masking technique that replaces data with chose redaction
- E.g. → S9300000J → XXXXXX000J
- Purpose: Used as a secrecy control/privacy control, usually used for hiding personal identifiable information
(PII)

3. Data Encryption
- Definition: Translate data into another form → Only people w access to a secret key/password can read it

4. Data Masking/Redaction vs Encryption

- Data Masking/Redaction used more frequently as it allows organization to maintain usability of
customer data; Usually used as the standard solution for pseudonymisation

- Aspects of Data Security

1. Data Access
- Authentication/authorization of access
- Data access is recorded and will be audited
- Data access must necessarily relate to location where data is stored; on-prem vs cloud
2. Data Classification (User Role)

3. Data Lineage
- Need to understand how changes upstream may affect downstream sources
- E.g. Upstream data source gets an update, downstream data might be impacted
4. Data Encryption
- Whether the data is encrypted at rest or encrypted while in-transit

- Data Classification
● Definition: Process of organization information/data assets using an agreed upon categorization logic
- Result usually is a large repository of metadata useful to make further decision/to facilitate use and
governance of data
- E.g. Can make decisions on the value/security/access rights/usage rights/privacy/storage
location/quality/retention period of the data
● Example - GDPR Classification Tags

- Data Lineage

Advanced Oracle SQL Tuning
No ratings yet
Advanced Oracle SQL Tuning
5 pages
Case Problem 2 Lawsuit Defense Strategy
No ratings yet
Case Problem 2 Lawsuit Defense Strategy
3 pages
Salesforce.com Interview Q & A & Certification Question Bank with Answers
From Everand
Salesforce.com Interview Q & A & Certification Question Bank with Answers
Mohammed Azizuddin Aamer
4/5 (5)
APA DOC No. 2104
100% (1)
APA DOC No. 2104
409 pages
Antipsychotic Agents: CNS Drugs
No ratings yet
Antipsychotic Agents: CNS Drugs
44 pages
Fundamental and Advanced Database Tutorial
No ratings yet
Fundamental and Advanced Database Tutorial
93 pages
Database Management System: Introduction To DBMS Ms. Deepikkaa.S
No ratings yet
Database Management System: Introduction To DBMS Ms. Deepikkaa.S
45 pages
Designing Databases: Data Storage Design Objectives
No ratings yet
Designing Databases: Data Storage Design Objectives
8 pages
Database Concepts EGS 2207
No ratings yet
Database Concepts EGS 2207
51 pages
Database Management System and ER Modelling
No ratings yet
Database Management System and ER Modelling
48 pages
QUIZ 2 Notes
No ratings yet
QUIZ 2 Notes
14 pages
sql basics
No ratings yet
sql basics
6 pages
Adbms Imp
No ratings yet
Adbms Imp
25 pages
Mpeb DBMS
No ratings yet
Mpeb DBMS
40 pages
3designing A Database
No ratings yet
3designing A Database
17 pages
DBMS 2
No ratings yet
DBMS 2
33 pages
RDBMS Concepts
No ratings yet
RDBMS Concepts
28 pages
Database Normalization and ERD
No ratings yet
Database Normalization and ERD
10 pages
Chapter-06
No ratings yet
Chapter-06
46 pages
DBMS QUESTION BANK
No ratings yet
DBMS QUESTION BANK
5 pages
Database Summery
No ratings yet
Database Summery
5 pages
Notes 1
No ratings yet
Notes 1
8 pages
DBMS Imp question
No ratings yet
DBMS Imp question
22 pages
VLB Janakiammal College of Engineering and Technology
No ratings yet
VLB Janakiammal College of Engineering and Technology
54 pages
RDBMS
No ratings yet
RDBMS
46 pages
Comprehensive Study Notes on Database Management
No ratings yet
Comprehensive Study Notes on Database Management
10 pages
Chapter 2
No ratings yet
Chapter 2
117 pages
Unit 4: Database Design & Development
No ratings yet
Unit 4: Database Design & Development
122 pages
Data Modeling: Database Review
No ratings yet
Data Modeling: Database Review
27 pages
DBms
No ratings yet
DBms
25 pages
Database Slides
No ratings yet
Database Slides
100 pages
dbms notes
No ratings yet
dbms notes
14 pages
Basis Data - Database Design and SQL
No ratings yet
Basis Data - Database Design and SQL
72 pages
Prelims - Dcit55
No ratings yet
Prelims - Dcit55
4 pages
Business-Database Management System
No ratings yet
Business-Database Management System
24 pages
Session 10-11
No ratings yet
Session 10-11
49 pages
Final Notes BTM 382
No ratings yet
Final Notes BTM 382
14 pages
Database Management System (DBMS) : Sem 1 IBS
No ratings yet
Database Management System (DBMS) : Sem 1 IBS
53 pages
Fdbms Final
No ratings yet
Fdbms Final
8 pages
Module 2
No ratings yet
Module 2
16 pages
Requiremen Ts Analysis
No ratings yet
Requiremen Ts Analysis
42 pages
Dbms Notes
No ratings yet
Dbms Notes
18 pages
Erd 01
No ratings yet
Erd 01
33 pages
Database Modeling and Design: Logical Design: Toby Teorey, Sam Lightstone, Tom Nadeau
No ratings yet
Database Modeling and Design: Logical Design: Toby Teorey, Sam Lightstone, Tom Nadeau
68 pages
Rdbms (Eltp)
100% (10)
Rdbms (Eltp)
125 pages
UNIT-1 NOTES
No ratings yet
UNIT-1 NOTES
18 pages
Database and SQL
No ratings yet
Database and SQL
65 pages
DBMS imp
No ratings yet
DBMS imp
25 pages
Database Modeling and Design: Logical Design: Toby Teorey, Sam Lightstone, Tom Nadeau
No ratings yet
Database Modeling and Design: Logical Design: Toby Teorey, Sam Lightstone, Tom Nadeau
67 pages
It6701 - Information Management: Unit I - Database Modelling, Management and Development
No ratings yet
It6701 - Information Management: Unit I - Database Modelling, Management and Development
35 pages
Business Intelligence and Databases - Kopie
No ratings yet
Business Intelligence and Databases - Kopie
14 pages
DBMS
No ratings yet
DBMS
7 pages
DBMS
No ratings yet
DBMS
4 pages
DBMS - Unit 1
No ratings yet
DBMS - Unit 1
66 pages
Database Design Using The Relational Model: Lecture Notes Prepared by Prof. Melody Angelique C. Rivera
No ratings yet
Database Design Using The Relational Model: Lecture Notes Prepared by Prof. Melody Angelique C. Rivera
66 pages
Database Designing Concepts Data Base: Disadvantages of Manual System
60% (5)
Database Designing Concepts Data Base: Disadvantages of Manual System
51 pages
DBMS CA3 Suggested Topics
No ratings yet
DBMS CA3 Suggested Topics
7 pages
Designing A Database Ok
No ratings yet
Designing A Database Ok
13 pages
SQL Server Basic
No ratings yet
SQL Server Basic
15 pages
Data Modeling using the Entity Relationship Model
No ratings yet
Data Modeling using the Entity Relationship Model
42 pages
DATABASE From the conceptual model to the final application in Access, Visual Basic, Pascal, Html and Php: Inside, examples of applications created with Access, Visual Studio, Lazarus and Wamp
From Everand
DATABASE From the conceptual model to the final application in Access, Visual Basic, Pascal, Html and Php: Inside, examples of applications created with Access, Visual Studio, Lazarus and Wamp
Olga Maria Stefania Cucaro
No ratings yet
Oracle: Protect Your Data
From Everand
Oracle: Protect Your Data
Floribert TCHOKO
No ratings yet
Cascade
From Everand
Cascade
David Wright
3/5 (1)
THE SQL LANGUAGE: Master Database Management and Unlock the Power of Data (2024 Beginner's Guide)
From Everand
THE SQL LANGUAGE: Master Database Management and Unlock the Power of Data (2024 Beginner's Guide)
JAMIE POWERS
No ratings yet
Sarita, Jayant & Sameer
No ratings yet
Sarita, Jayant & Sameer
12 pages
RFID Based Smart Lock Implementation: Computer Engineering Department, College of Engineering, Al-Iraqia University, Iraq
No ratings yet
RFID Based Smart Lock Implementation: Computer Engineering Department, College of Engineering, Al-Iraqia University, Iraq
5 pages
Dr. Fusina DRDC Bi-Toroid Transformer Data 24-10-09
No ratings yet
Dr. Fusina DRDC Bi-Toroid Transformer Data 24-10-09
2 pages
Alan Ayala: Work Experience
No ratings yet
Alan Ayala: Work Experience
1 page
Kitchen Specs Assignment
No ratings yet
Kitchen Specs Assignment
5 pages
(Cambridge Intellectual Property and Information Law) Cambridge University Press - Digital Data Collection and Information Privacy Law. 54-Cambridge University Press (2020)
100% (1)
(Cambridge Intellectual Property and Information Law) Cambridge University Press - Digital Data Collection and Information Privacy Law. 54-Cambridge University Press (2020)
337 pages
Musuko Ga Kawaikute Shikataganai Mazoku No Hahaoya Vol.9 Chapter 200 Successor - Manganelo
No ratings yet
Musuko Ga Kawaikute Shikataganai Mazoku No Hahaoya Vol.9 Chapter 200 Successor - Manganelo
1 page
Women in Agriculture
No ratings yet
Women in Agriculture
8 pages
Current Liabilities
No ratings yet
Current Liabilities
2 pages
30-Longest Increasing Subsequence-22-03-2024
No ratings yet
30-Longest Increasing Subsequence-22-03-2024
23 pages
Mrs. Vandana - Aligarh - Master Walkin Closet and Bathroom Closet - Wardrobe
No ratings yet
Mrs. Vandana - Aligarh - Master Walkin Closet and Bathroom Closet - Wardrobe
6 pages
RPT 2020 Bahasa Inggeris Tingkatan 3 KSSM Sumberpendidikan
No ratings yet
RPT 2020 Bahasa Inggeris Tingkatan 3 KSSM Sumberpendidikan
3 pages
Inventory Control Using ABC and Min-Max Analysis o
No ratings yet
Inventory Control Using ABC and Min-Max Analysis o
11 pages
dầu hạt cải-dầu oliu
No ratings yet
dầu hạt cải-dầu oliu
14 pages
Vid Exam
100% (1)
Vid Exam
61 pages
82_Module_2.3_QOS_March2023
No ratings yet
82_Module_2.3_QOS_March2023
33 pages
Tvsram Email: Tvsram@sac - Ernet.in: VHDL Basics by
No ratings yet
Tvsram Email: Tvsram@sac - Ernet.in: VHDL Basics by
64 pages
Literature Review
100% (1)
Literature Review
58 pages
Warsash New Training Requirements Under Stcw10
No ratings yet
Warsash New Training Requirements Under Stcw10
6 pages
Distant Space Travel Better As Family Affair
No ratings yet
Distant Space Travel Better As Family Affair
1 page
The Trapeze Artist This Text Is About Linda Spelman - They Was A Lawyer Who Found A New Career On A Circus
No ratings yet
The Trapeze Artist This Text Is About Linda Spelman - They Was A Lawyer Who Found A New Career On A Circus
1 page
A User-Friendly FORTRAN BVP Solver
No ratings yet
A User-Friendly FORTRAN BVP Solver
18 pages
Different Kinds of Drugs and Its Effects
No ratings yet
Different Kinds of Drugs and Its Effects
13 pages
DOC-20241128-WA0006_241213_211559
No ratings yet
DOC-20241128-WA0006_241213_211559
131 pages
Unit 1 Family Life Lesson 2 Language
No ratings yet
Unit 1 Family Life Lesson 2 Language
76 pages
Formato de Conteo Vehicular Xls
No ratings yet
Formato de Conteo Vehicular Xls
2 pages