Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
209 views

Data Management For Analytics Notes

1. The document provides an introduction to data management concepts including data frameworks, data storage types, data modeling, database design, and database properties. 2. It discusses data modeling techniques like normalization, functional dependencies, and entity relationship diagrams. 3. Key aspects of database design are covered including the phases of design from conceptual to logical to physical models.

Uploaded by

Keng Whye Leong
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
209 views

Data Management For Analytics Notes

1. The document provides an introduction to data management concepts including data frameworks, data storage types, data modeling, database design, and database properties. 2. It discusses data modeling techniques like normalization, functional dependencies, and entity relationship diagrams. 3. Key aspects of database design are covered including the phases of design from conceptual to logical to physical models.

Uploaded by

Keng Whye Leong
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 21

1.

Introduction to Data Management for Analytics


(I) Data Management Frameworks
Note: Review Pdf 1.0 then do a google search on the frameworks
- DAMA-DMBOK

(2) Types of Data Storage in Cloud Computing


2. Data Modelling & Design
- Data Modelling

● Process of learning about the data; construct visual


representation of the parts of the data
● Goal is to show relationships between structures, data
points, data grouping, attributes of the data

- Functional Dependencies
● Definition: Constraint that determines the relation of one attribute to another attribute in a database
- Denoted by an arrow →
- Example: X → Y
- In example below, Employee Name/Salary/City are all functionally dependent on Employee Number. So we can
say Employee number → Employee Name/Salary/City

● Multivalued Dependency
- Definition: Occurs when there are two/more independent attributes in a table that is dependent on
another attribute

← Example: maf_year and color are independent of each other


- However, both are dependent on car_model
- Therefore, both columns/attributes are ‘multivalue dependent’ on car_model
- We can denote the r/s as car_model → maf_year | car_model → color
● Trivial Functional Dependency
- Definition: Occurs when the attribute that has the dependency is a subset of the attribute it is dependent on.
- Example: X → Y is a trivial functional dependency if Y is a subset of X

← Example: (Emp_id, Emp_name) → Emp_id is a trivial functional dependency as


Emp_id is a subset of (Emp_id, Emp_name)

● Non-Trivial Functional Dependency


- Definition: Basically where attribute that has the dependency is not a subset of the attribute it is
dependent on

← Example: Company → CEO. CEO is not a subset of company. We


must know the company before we know the CEO
- Similarly, CEO → Age. We must know who the CEO is before we
can tell his/her age

● Transitive Dependency
- Definition: Occurs between 3 or more attributes. Essentially, it’s an indirect non-trivial dependency

← Example: Company → Age is a transitive dependency


- We know Company → CEO and CEO → Age
- Therefore, Company → Age; We must know the company before we know
CEO, we must know who the CEO is before we know the age
- Normalization

● Refers to the process of removing


dependencies from the
tables/data
● Purpose is to avoid data
redundancy, insertion, update &
deletion anomaly

- Process of Normalization
● From 0NF to 1NF
- MatrixNum is the unique key for all the records
- We can say that the rest of the columns are ‘functionally dependent’ on MatrixNum
- From 0NF to 1NF, we’re removing the nesting/grouping on the Unique Key to associate it 1-1 to the
rest of the columns

● From 1NF to 2NF


- Essentially you want to remove partial functional dependencies in the table
- Idea of partial functional dependencies is when your data requires more than 1 unique keys working together to
uniquely identify/make sense of the data
- See example below → Your MatrixNum relates to Name/Programme. EnrollNum relates to Semester,
AcadYear, Course, CourseName.
- Your MatrixNum + EnrollNum relates to Result

- To normalize the data → You need to split it into separate tables

● From 2NF to 3NF


- Idea here is that you wanna remove transitive dependencies from the table
- In the 2NF table below, EnrollNum is associated with Course and Course is associated with CourseName
- So the transitive dependency is EnrollNum → Course → CourseName
- To normalize the data → You need to split the Course & CourseName into separate mapping table
- Data Model & Data Modelling Notation
● Chen Notation
- Lecturer Comments: Very old and overly technical

● Crow’s Feet Notation


- Lecturer’s Comments: This is the industry standard. We will use this for purposes of the course as
well

● Unified Modelling Language (UML) Notation


- :Lecturer Comments: Usually used by software developers

- Designing a Database (Phases of Database Design)


● Data Model: Plan/blueprint for a database design; More generalized and abstract than a database
design
● Phases of a Database Design:

- The actual implementation is (1) You come up w the data model, (2) You write SQL scripts to
represent the data model & create the tables, (3) You run the SQL scripts and insert the actual data
● Conceptual → You think/come up w the table/fields you require
Logical → You determine the r/s between the different tables
Physical → You come up w the details for the different fields/tables (E.g. Set character limits/data type, etc…)
- Entity Relationship Diagram (ER Diagram)

● Basically a structural diagram used in database design; contains entity and maps out the relationships
between entities

- Aspects of the ER Model


● Entity
- Basically the ‘table’ in a database
- Represented by a name and a rectangle, with its attributes listed in the body of
the rectangle

● Entity Attributes

- Refers to the property/characteristic of the entity/table


- For databases, it has a name and the data type/size of the attribute

● Primary Key
- Refers to a special entity attribute that uniquely defines a record in a table
- This means that the column/field that is the PK, must not be repeated in the table
- Example, id field of a given table
● Foreign Key
- Is essentially a PK of another table
- FK need not be unique in the table where it is not the PK
- Concept of Cardinality for ER-Model
● ER Model relationships are classified by their cardinality
● Cardinality refers to the possible number of occurrences in one entity which is associated with the
number of occurrences in another
- E.g. ONE team has MANY players, we can then say Team has a one-to-many cardinality with Player
● Notations:

● Reading/Interpreting a ER Diagram/Relation

- Between Customer & Pizza, to establish relationship between Customer → Pizza, we must look at the crow’s
feet notation attached to Pizza
- In this case, its 0/Many attached to Pizza → Means a customer can either order 0 pizza or many pizzas
- For the converse, it's also 0/Many attached to Customer → Means a Pizza can be ordered by 0 customer or many
customers

- Weak Entity
● Defined as an entity that does not have attributes/columns that can identify its records uniquely
- E.g. No primary key in weak entity
- Example: Intermediate tables → Where table consists of PKs of two other tables
- Database Design Misc Example
● In this example, COMPANY & PART tables can’t be joined to each other
- Solution is to create an intermediary table mapping CompanyName & PartNumber to allow both tables
to join
- Lecturer: Would be good to create a PK in COMPANY_PART_INT table for ease of reference

- Database Property → ACID: Atomicity, Consistency, Isolation, Durability


● Properties that all transactions should possess
○ Atomicity
- Relates to the ‘all or nothing’ property
- The transaction must be an indivisible unit that is either performed it its entirety/not performed
at all
○ Consistency
- DB transaction must transform the database from one consistent state to another consistent
state
○ Isolation
- Transactions must be able to execute independently from one another
- I.e. One incomplete transaction must not affect another transaction
○ Durability
- Effects of a successful transaction must be permanently recorded in the database and not lost
in a subsequent failure

- Types of Data
● Transactional Data
- Refers to the data that is captured from transactions
- Example: Time of transaction, place, price, payment method employed, etc…
- Usually captured at point of sale through a POS system
● Analytical Data
- Transaction data that is transformed via calculations/analysis
● Master Data
- Refers to the actual critical business objects upon which transactions are performed
- Data Warehouse
● The general idea for data storage in Data Warehouse is to provide information & knowledge to support
decision making in your org
● Having the data in normalized/OLTP form usually is not good as it can be computationally expensive to
join data together to perform analysis
● Therefore, often data in Data Warehouse is stored in a de-normalized form

- OLTP vs OLAP

- Dimensionality Modelling (Converting data from OLTP to OLAP)

● Basically, you are de-normalizing the data for high performance access (Less joins)
● You can represent the denormalized data via 2 schemas:
○ Star Schema
- Structure that contains a fact table in the center (Fact tables are tables that contain
transactional data; E.g. Sales)
- Fact table is surrounded by dimension tables containing reference data (Dimension tables are
tables that contain reference information; E.g. Store_id to Store Name)
○ Snowflake Schema
- Variant or star schema where dimension tables do not contain denormalized data
● Terminology
- Dimension Tables: Tables connected to fact table containing reference/static data
- Attribute: Non-key fields in Dimension tables
- Fact Table: Central table in a dimensional model contain facts/transactional data
- Facts: Business measures/metrics
- Grain/Granularity: Level of detail/frequency at which data in Fact table is recorded
3. Data Integration and Interoperability
- Data Integration

● Process of bringing data from disparate sources together to provide users w a unified view
● Purpose: To make data more easily available and easier to consume by systems/end-users
● Benefits: Free-up resources, improve data quality, improved operational efficiency, can gain valuable
insight through data

- Data Integration Tools

- Set Theory for Data Join


1. Outer Join

2. Left/Right Join

3. Inner Join

- Data Acquisition and Extraction


● Data Acquisition
- Process of capturing, integrating, transforming, aggregating and loading the data to the data
warehouse after assuring data quality
- Process is more inclusive/comprehensive that ETL (Extract Transform Load) / ELT (Extract Load
Transform)
● ETL/ELT

○ ETL
○ ELT
● ETL vs ELT
- If you transform first, you are in practice determining the schema in advance, data stored might be too
inflexible for use
- If you load first, the risk is that it might become ‘rubbish’
● Data Warehouse
- Structured data is loaded into the data warehouse for analytical use
● Data Lake
- Data lake is the place where all sorts of data is stored
- Structured, textual, unstructured data
● Data Lakehouse
- Similar to Data Lake but with data management architecture baked in to index/cache all forms of data
stored in the data lake

- Data Consolidation
● Basically consolidating data from different silos to a single place

- Data Visualization
● Bring all data from different sources/places to one platform
● One platform to access/combine/analyze the dataset; reduce access cost
- Data Federation
● A software/platform that allows multiple databases to function as one
● Data from multiple sources are combined into a common model; i.e. can query/join using a common
platform/schema
- Data Replication
● Data is intentionally stored in > 1 site/server
● Purpose is to allow data to be available in case of downtime/heavy traffic; idea of improving data
accessibility/uptime
- Data Harmonization

- Data Pipeline
- Data Engineering
- Data Fabric
3. Data Project Implementation
4. Data Governance - Data Quality, Security & Privacy

- Types of Data for Data Governance

- Data Governance Implementation: Sales Analytics

- Data Quality
● Quality is assessed wrt to data’s fit to purpose it was intended for
- High quality means it accurately represents the real world constructs
- Bad data will result in low information quality, then as it moves up the management hierarchy, it leads
to bad business decision

- Measuring/Assessment Data Quality - Data Quality Checks

1. Data Sampling
(i) Random
(ii) Sampling with Fixed Criteria

2. Data Profiling
- Process where data is examined & analyzed; generate summary statistics
- Purpose: Give an overview of data to ensure any discrepancies/risks/trends are spotted
- Data Dictionary

● Specification/description of data structures in a


database/data model/data source
● Contains list of entities/tables/datasets and their
fields/columns/data elements
● Information may include: Data type, description,
relationships, aliases, constraints, sources,
etc…
● Data Catalog - Distinct from Data Dictionary -
Basically an inventory of data objects in your
organization

- Data Mapping

● Definition: R/s between 2/more datasets and matching/connecting fields from one dataset to another
● Purpose: Link data fields across areas to create standardized accurate data

- Data Privacy: Data Confidentiality, Anonymization, Masking


1. Data Masking
- Definition: Technique that scrambles data to create an inauthentic company for non-production
purposes
- After masking, data retains the characteristics & integrity of production data
- Masked data usually used for analytics/training/testing
2. Data Redaction
- Definition: Data masking technique that replaces data with chose redaction
- E.g. → S9300000J → XXXXXX000J
- Purpose: Used as a secrecy control/privacy control, usually used for hiding personal identifiable information
(PII)

3. Data Encryption
- Definition: Translate data into another form → Only people w access to a secret key/password can read it

4. Data Masking/Redaction vs Encryption


- Data Masking/Redaction used more frequently as it allows organization to maintain usability of
customer data; Usually used as the standard solution for pseudonymisation

- Aspects of Data Security


1. Data Access
- Authentication/authorization of access
- Data access is recorded and will be audited
- Data access must necessarily relate to location where data is stored; on-prem vs cloud
2. Data Classification (User Role)

3. Data Lineage
- Need to understand how changes upstream may affect downstream sources
- E.g. Upstream data source gets an update, downstream data might be impacted
4. Data Encryption
- Whether the data is encrypted at rest or encrypted while in-transit

- Data Classification
● Definition: Process of organization information/data assets using an agreed upon categorization logic
- Result usually is a large repository of metadata useful to make further decision/to facilitate use and
governance of data
- E.g. Can make decisions on the value/security/access rights/usage rights/privacy/storage
location/quality/retention period of the data
● Example - GDPR Classification Tags

- Data Lineage

You might also like