Advanced Data Modeling (2)
Advanced Data Modeling (2)
Power Query
DAX
Agenda
9:00 - 9:15 Initial remarks and Introduction to the course
Section A
9:15 - 10:15 Intro to Data Preparation
Section B
10:15 - 11:00 Data Model Schemas, Normalization, Calculated Columns and Measures
11:00 - 11:15 Break
11:15 - 11:45 Lab 1
Section C
11:45 - 12:15 Data Storage in Power BI
12:15 - 12:30 Best Practices, Q&A
Section A
Intro to Data Preparation
Why Prepare our data?
• Power BI is powerful enough to compile and analyze
data, but..
• If the data is not prepared properly, these
compilations will be slower and reduce the report’s
analytical efficiency
• Data needs to cater to the technology of the
compression engine being used by PBI to develop a
robust data model
What is a Data model?
The Technology behind Power BI
The VertiPaq Engine:
Columnar Database Engine - Columns & Segments
How many distinct products sold in 2017-Q1 , only Product and Date columns are used
First Name Last Name Sales First Name Last Name Sales
Query Metadata 14 KB
Entities
Dimension Table:
Contain descriptive information used to slice and dice data from Fact Tables (eg:
branch_name, branch_type)
branch_key
Also holds Relationship/Key Fields used to connect the dimension to the fact table
(eg: branch_key)
Wider tables with small amount of rows
Fact Table:
Contain facts/details which are fields used as values in a visualization (eg: dollars_sold,
units_sold)
Also holds Relationship/Key fields used to connect the dimension to the fact table
(eg: time_key, item_key, branch_key, location_key)
Narrow tables with large amount of rows
• H
Golden Rule:
Avoid using a single table that includes everything (both facts and dimensions)
Relationships
• Connections between a 2 tables (usually
fact & Dim tables) using columns from
each are called Relationships
Bi-Directional Relationship
‐ Allow you to pass filters in both directions
‐ This is different than Many to Many
‐ There is a significant performance penalty for Bi-Directional filtering
Section B
Data Model Schemas, Normalization, DAX Calculated
Columns and Measures
Phases in Building a Power BI Desktop File
Data Model Brings Facts and Dimensions Together
Data Models
Flat or Snowflake
Star Schema
Denormalized Schema
Flat or
Denormalized
Schema
• Highly inefficient
Schema
Example:
One row per order or per Item
Daily or Monthly date grain
A Calculated Column is evaluated as a new column in the table in which it resides and will not change value until the
underlying data is refreshed.
Measures are calculations which do not have a result until they are used in a visualization.
They may use sums, averages, minimum or maximum values, counts, or more advanced calculations; and they change
value in response to your interaction with your reports.
Calculated Column
What is a Calculated Column?
Calculated Column
Best Practices – Calculated Columns
What is a Measure?
Columns
Values
Slicer
Rows
Designing good data models
Key takeaways to design a good Power BI Desktop data model
• If a fact table contains an ID field which is unique for each record, remove it unless needed as a connector key
• Ex. Transaction ID
• The DateTime data type is usually not needed, unless you are specifically using the Time component
➢ If you really need Time, try splitting Date & Time into
Knowledge Check
2. What are some advantages of a star schema over a flat or denormalized model?
• Dimension tables save space by reducing the amount of data that needs to be repeated over and
over in every row
• Relationships between tables can be leveraged for more complex measures
- The connection will ingest/pull all the data from the source and
make it a part of the PBI
Choosing storage mode: Import vs DirectQuery
Best Practices
Data Modeling
An inefficient model can completely slow down a report, even with very small data
volumes
GOALS:
Why is it undesired?
• Calculated columns don’t compress as well as physical columns
Proposed Solution
• Perform calc in Power Query, ideally push down
Remove unused tables and columns
Scenario
• Model contains tables/columns that are not used for reporting/analysis or
calculations
Why is it undesired?
• Increases model size
• Increases time to load into memory
• Increases refresh time
• May affect usability
Avoid high precision/cardinality columns
Scenario
• Model contains columns at a higher precision than needed for analysis e.g. datetime
in milliseconds, weight to 6 decimal places
• Model contains columns that are highly unique
Why is it undesired?
• Less compression with high precision/cardinality
• Increases time to load into memory
• Increases refresh time
Proposed Solution
• Remove if not needed
• Reduce precision
• Split datetime into date and time
Use integers instead of strings
Why is it undesired?
• Strings use dictionary encoding, integers use run length encoding which is more
efficient
Proposed Solution
• Check data types and set to integer if known to be numerical
Be careful with bi-directional relationships
Scenario
• Most relationships in the model are set to bi-
directional
Why is it undesired?
• Applying filters/slicers traverses many
relationships and can be slower
• Some filter chains unlikely to add business
value
Proposed Solution
• Only use bi-di where the business scenario
requires it
Set Default Summarization
Scenario
• Numeric columns in model that are purely
informational (e.g. Account ID)
• Default summarization is Sum
Why is it undesired?
• Power BI will try to sum the number when
dropped into visuals.
• Detailed tables/matrixes can be slower
Proposed Solution
• Set the default summarization to None
Q&A