Data Management For Analytics Notes
Data Management For Analytics Notes
- Functional Dependencies
● Definition: Constraint that determines the relation of one attribute to another attribute in a database
- Denoted by an arrow →
- Example: X → Y
- In example below, Employee Name/Salary/City are all functionally dependent on Employee Number. So we can
say Employee number → Employee Name/Salary/City
● Multivalued Dependency
- Definition: Occurs when there are two/more independent attributes in a table that is dependent on
another attribute
● Transitive Dependency
- Definition: Occurs between 3 or more attributes. Essentially, it’s an indirect non-trivial dependency
- Process of Normalization
● From 0NF to 1NF
- MatrixNum is the unique key for all the records
- We can say that the rest of the columns are ‘functionally dependent’ on MatrixNum
- From 0NF to 1NF, we’re removing the nesting/grouping on the Unique Key to associate it 1-1 to the
rest of the columns
- The actual implementation is (1) You come up w the data model, (2) You write SQL scripts to
represent the data model & create the tables, (3) You run the SQL scripts and insert the actual data
● Conceptual → You think/come up w the table/fields you require
Logical → You determine the r/s between the different tables
Physical → You come up w the details for the different fields/tables (E.g. Set character limits/data type, etc…)
- Entity Relationship Diagram (ER Diagram)
● Basically a structural diagram used in database design; contains entity and maps out the relationships
between entities
● Entity Attributes
● Primary Key
- Refers to a special entity attribute that uniquely defines a record in a table
- This means that the column/field that is the PK, must not be repeated in the table
- Example, id field of a given table
● Foreign Key
- Is essentially a PK of another table
- FK need not be unique in the table where it is not the PK
- Concept of Cardinality for ER-Model
● ER Model relationships are classified by their cardinality
● Cardinality refers to the possible number of occurrences in one entity which is associated with the
number of occurrences in another
- E.g. ONE team has MANY players, we can then say Team has a one-to-many cardinality with Player
● Notations:
● Reading/Interpreting a ER Diagram/Relation
- Between Customer & Pizza, to establish relationship between Customer → Pizza, we must look at the crow’s
feet notation attached to Pizza
- In this case, its 0/Many attached to Pizza → Means a customer can either order 0 pizza or many pizzas
- For the converse, it's also 0/Many attached to Customer → Means a Pizza can be ordered by 0 customer or many
customers
- Weak Entity
● Defined as an entity that does not have attributes/columns that can identify its records uniquely
- E.g. No primary key in weak entity
- Example: Intermediate tables → Where table consists of PKs of two other tables
- Database Design Misc Example
● In this example, COMPANY & PART tables can’t be joined to each other
- Solution is to create an intermediary table mapping CompanyName & PartNumber to allow both tables
to join
- Lecturer: Would be good to create a PK in COMPANY_PART_INT table for ease of reference
- Types of Data
● Transactional Data
- Refers to the data that is captured from transactions
- Example: Time of transaction, place, price, payment method employed, etc…
- Usually captured at point of sale through a POS system
● Analytical Data
- Transaction data that is transformed via calculations/analysis
● Master Data
- Refers to the actual critical business objects upon which transactions are performed
- Data Warehouse
● The general idea for data storage in Data Warehouse is to provide information & knowledge to support
decision making in your org
● Having the data in normalized/OLTP form usually is not good as it can be computationally expensive to
join data together to perform analysis
● Therefore, often data in Data Warehouse is stored in a de-normalized form
- OLTP vs OLAP
● Basically, you are de-normalizing the data for high performance access (Less joins)
● You can represent the denormalized data via 2 schemas:
○ Star Schema
- Structure that contains a fact table in the center (Fact tables are tables that contain
transactional data; E.g. Sales)
- Fact table is surrounded by dimension tables containing reference data (Dimension tables are
tables that contain reference information; E.g. Store_id to Store Name)
○ Snowflake Schema
- Variant or star schema where dimension tables do not contain denormalized data
● Terminology
- Dimension Tables: Tables connected to fact table containing reference/static data
- Attribute: Non-key fields in Dimension tables
- Fact Table: Central table in a dimensional model contain facts/transactional data
- Facts: Business measures/metrics
- Grain/Granularity: Level of detail/frequency at which data in Fact table is recorded
3. Data Integration and Interoperability
- Data Integration
● Process of bringing data from disparate sources together to provide users w a unified view
● Purpose: To make data more easily available and easier to consume by systems/end-users
● Benefits: Free-up resources, improve data quality, improved operational efficiency, can gain valuable
insight through data
2. Left/Right Join
3. Inner Join
○ ETL
○ ELT
● ETL vs ELT
- If you transform first, you are in practice determining the schema in advance, data stored might be too
inflexible for use
- If you load first, the risk is that it might become ‘rubbish’
● Data Warehouse
- Structured data is loaded into the data warehouse for analytical use
● Data Lake
- Data lake is the place where all sorts of data is stored
- Structured, textual, unstructured data
● Data Lakehouse
- Similar to Data Lake but with data management architecture baked in to index/cache all forms of data
stored in the data lake
- Data Consolidation
● Basically consolidating data from different silos to a single place
- Data Visualization
● Bring all data from different sources/places to one platform
● One platform to access/combine/analyze the dataset; reduce access cost
- Data Federation
● A software/platform that allows multiple databases to function as one
● Data from multiple sources are combined into a common model; i.e. can query/join using a common
platform/schema
- Data Replication
● Data is intentionally stored in > 1 site/server
● Purpose is to allow data to be available in case of downtime/heavy traffic; idea of improving data
accessibility/uptime
- Data Harmonization
- Data Pipeline
- Data Engineering
- Data Fabric
3. Data Project Implementation
4. Data Governance - Data Quality, Security & Privacy
- Data Quality
● Quality is assessed wrt to data’s fit to purpose it was intended for
- High quality means it accurately represents the real world constructs
- Bad data will result in low information quality, then as it moves up the management hierarchy, it leads
to bad business decision
1. Data Sampling
(i) Random
(ii) Sampling with Fixed Criteria
2. Data Profiling
- Process where data is examined & analyzed; generate summary statistics
- Purpose: Give an overview of data to ensure any discrepancies/risks/trends are spotted
- Data Dictionary
- Data Mapping
● Definition: R/s between 2/more datasets and matching/connecting fields from one dataset to another
● Purpose: Link data fields across areas to create standardized accurate data
3. Data Encryption
- Definition: Translate data into another form → Only people w access to a secret key/password can read it
3. Data Lineage
- Need to understand how changes upstream may affect downstream sources
- E.g. Upstream data source gets an update, downstream data might be impacted
4. Data Encryption
- Whether the data is encrypted at rest or encrypted while in-transit
- Data Classification
● Definition: Process of organization information/data assets using an agreed upon categorization logic
- Result usually is a large repository of metadata useful to make further decision/to facilitate use and
governance of data
- E.g. Can make decisions on the value/security/access rights/usage rights/privacy/storage
location/quality/retention period of the data
● Example - GDPR Classification Tags
- Data Lineage