The
Date dimension is a well known construct in general
data warehousing. In many cases, the data for a date dimension is generated using a database stored procedure or shell-script.
Another approach to obtain the data for a date dimension is to generate it using an ETL tool like
Pentaho Data Integration, a.k.a.
Kettle. I think this approach makes sense for a number of reasons:
- When you tend to use a particular ETL tool, you will be able to reuse the date dimension generator over an over, and on different database platforms.
- You won't need special database privileges beyond the ones you need already. Privileges for creating tables and to perform DML will usually be available, whereas you might need to convince a DBA that you require extra privileges to create and execute stored procedures.
In addition to these general considerations, you can pull a neat little trick with Kettle to localize the data and format of the date attributes. I wouldn't go as far as to say that this feature is Kettle specific: rather, it relies on the localization support built into the
java platform and the way you can put that to use in Kettle transformations.
Prerequisites
In this tip, the steps to create a date dimension are described using
Kettle 2.5.1 (Generally available Release) and
MySQL 5.1.20 (Beta). You will be able to follow through the example using earlier (and later) versions of both products though - I am not using any functionality that is specific to these particular version of the products. The recipe does not really require that you understand anything about data warehouses or date dimensions, but you will probably appreciate it better if you do ;)
Overview
The transformation to generate the data for the date dimension follows a pretty straightforward design. The graphical representation of the transformation is shown below:
First, the dimension table is created (Prepare). After that, rows are generated to fill it (Input). However, the generated rows are almost empty and barren - we still need to derive and add data to fill the attributes of the date dimension (Transformation). Finally, the data is stored in the date dimension table (Output).
Step-by-Step
The remainder of this article describes in detail how to build this transformation. The majority of steps is probably not very interesting to moderately experienced Kettle users, but may be of use to beginning users.
Note for users that are completely new to Kettle - it is advisable to review the first few chapters of the
Spoon user guide (Spoon is the name of Kettle tool you use to design the ETL process). It explains how to start up the tool, create a new transformation, add and connect steps etc. You can find it in the docs/English directory beneath the Kettle home directory.
MySQL JDBC driver: setting the characterEncoding property to UTF8
You need to create a (JDBC) connection to MySQL in the usual, straightforward way:
In addition, you need to set the
characterEncoding
property of the JDBC driver:
This ensures MySQL will be able to understand the utf8 encoded data that we may produce to generate a date dimension in the, say, Chinese language. Note that you cannot just use a statement like
SET NAMES utf8
to do this. This is not specific to Kettle, but has to do with the way the MySQL JDBC driver (Connector/J) handles character sets. Please refer to the "
Using character sets and unicode" section of the Connector/J documentation for more information on this topic.
Creating the date dimension table
In this particular case, it seemed convenient to create the dimension table as part of the transformation. This is done using the "Execute SQL Script" step shown below:
The "Execute SQL Script" step executes the following script to create the date dimension table:
DROP TABLE IF EXISTS dim_date
;
CREATE TABLE IF NOT EXISTS dim_date (
date_key smallint unsigned NOT NULL,
date date NOT NULL,
date_short char(12) NOT NULL,
date_medium char(16) NOT NULL,
date_long char(24) NOT NULL,
date_full char(32) NOT NULL,
day_in_year smallint unsigned NOT NULL,
day_in_month tinyint unsigned NOT NULL,
is_first_day_in_month char(10) NOT NULL,
is_last_day_in_month char(10) NOT NULL,
day_abbreviation char(3) NOT NULL,
day_name char(12) NOT NULL,
week_in_year tinyint unsigned NOT NULL,
week_in_month tinyint unsigned NOT NULL,
is_first_day_in_week char(10) NOT NULL,
is_last_day_in_week char(10) NOT NULL,
month_number tinyint unsigned NOT NULL,
month_abbreviation char(3) NOT NULL,
month_name char(12) NOT NULL,
year2 char(2) NOT NULL,
year4 year NOT NULL,
quarter_name char(2) NOT NULL,
quarter_number tinyint NOT NULL,
year_quarter char(7) NOT NULL,
year_month_number char(7) NOT NULL,
year_month_abbreviation char(8) NOT NULL,
PRIMARY KEY(date_key),
UNIQUE(date)
)
ENGINE=MyISAM
DEFAULT CHARACTER SET utf8
DEFAULT COLLATE utf8_unicode_ci
This is by no means a complete date dimension. The most important limitation is that it only contains attributes that are immediately derivable from the calendar. So, attributes to denote business specific periods like the fiscal year and holidays are not included.
Generating 10 years worth of days
The grain of the date dimension is days - a row in the date dimension represents a single day. In this case, the "Generate Rows" is configured to generate 3660 rows, which roughly corresponds with enough days to last 10 years:
In the example, this step is also used to provide parameters to generate the date dimension data. As we'll see in a moment, the
inital_date
field effectively specifies the first date that goes into the date dimension. The
language_code
and
country_code
fields are used to localize the textual attributes of the date dimension, and the
local_yes
and
local_no
fields are used for boolean attributes.
There are other ways to get these parameters into our transformation. For example, we could have used an "Add Constants" step with a similar result. Another possibility would be to get this data from the environment using a "Get Variables" step, and this would allow the parameters to be specified at transformation run-time.
Counting the days
Although we certainly generate enough rows, they are all identical. In order to have each row represent a single distinct day, we need a way to 'count' the generated rows. We do this by adding a "Sequence" step:
In this case, we use the "Add sequence" step to generate an incrementing number within the scope of the transformation. As we'll see later on, we can add this to our initial date to get a series of consecutive dates.
Calculating date dimension Attributes
The previous steps form a basis from which we can derive all of the attributes that currently make up our date dimension. To actually calculate the date attributes, we use a "Modified Java Script Value" step:
Kettle comes with an embedded
Rhino javascript engine. The "Modified Java Script Value" step lets you use it to run javascript code to as part of the transformation.
The javascript code is executed for each row that comes out of the previous steps. In the script code, one can reference the values from the input rows, perform some processing on them, and generate new output fields.
One of the fortunate characteristics of the Rhino engine is that it lets us use
java classes inside the javascript code. Let's take a look at the script to see how we can use that to generate the localized data for our data dimension attributes.
Initialization
The first thing we do in the javascript code is to get data from the current input row. In the "Generate Rows" step, we added the
language_code
and
country_code
fields to specify a locale. Here, in the script, we use the following piece of code to turn that into a
java.util.Locale
object:
//Create a Locale according to the specified language code
var locale = new java.util.Locale(
language_code.getString() //get the ISO639 language_code from the input row
, country_code.getString() //get the ISO3166 country_code from the input row
);
The
java.util.Locale
class represents a particular cultural region. It forms a cornerstone of the internationalization support built into the java platform, and provides information to many other classes to generate appropriately localized output.
We will be using the locale on a number of occasions, but first, we use our it to initialize a
java.util.Calendar
object:
//Create a calendar, use the specified locale
var calendar = new java.util.GregorianCalendar(locale);
(Note that the java platform currently only provides one concrete Calender Class: the
java.util.GregorianCalendar
. Unfortunately, java does not seem to provide a built-in recipe for dealing with, for example, Islamic or Hebrew calendars).
We require the calendar object to obtain an instance of the
java.util.Date
Class that represents the date corresponding to the current row. To do that, we first set the calendar's current date using the
initial_date
field that was specified in the "Generate Rows" step:
//Set the initial date
calendar.setTime(initial_date.getDate());
We need this to add the number of days generated by our "Add Sequence" step:
//set the calendar to the current date by adding DaySequence days
calendar.add(calendar.DAY_OF_MONTH,DaySequence.getInteger() - 1);
(Note that we substract
1
from the DaySequence value. This is because our sequence starts at
1
, and we want the specified initial date to be included in our date dimension).
We conclude the initialization of the script by retrieving a
java.util.Date
object that represents the date for the current row.
//get the calendar date
var date = new java.util.Date(calendar.getTimeInMillis());
This
java.util.Date
instance is assigned to the
date
variable in the script, allowing it to be used as an output field of the javascript step. We require this in order to fill the
date
column of the date dimension table. We will also be using the
date
variable later on in this script to derive the value of other date dimension attributes.
Getting Text representations of full dates
Our date dimension has a number of attributes to denote a complete date containing day, month and year parts, in various formats:
date_short
,
date_medium
,
date_long
and
date_full
. These are all generated using the
java.text.DateFormat
class.
To do that, we first need to create an appropriate
DateFormat
instance using the static
getDateInstance()
method, passing our locale object as well as a constant that specifies whether we want to short, medium, long or full format. Then, we can pass the
java.util.Date
object for which we want to obtain the textual representation to the
format
method of the newly created
java.text.DateFormat
instance:
//en-us example: 9/3/07
var date_short = java.text.DateFormat.getDateInstance(
java.text.DateFormat.SHORT
, locale
).format(date);
//en-us example: Sep 3, 2007
var date_medium = java.text.DateFormat.getDateInstance(
java.text.DateFormat.MEDIUM
, locale
).format(date);
//en-us example: September 3, 2007
var date_long = java.text.DateFormat.getDateInstance(
java.text.DateFormat.LONG
, locale
).format(date);
//en-us example: Monday, September 3, 2007
var date_full = java.text.DateFormat.getDateInstance(
java.text.DateFormat.FULL
, locale
).format(date);
Formatting date parts
Extracting and formatting different date parts is most easily done by applying the
format
function on a subclass of
java.text.Dateformat
, the
java.text.SimpleDateFormat
class. The
java.text.SimpleDateFormat
class allows formatting of dates based on date and time
patterns:
//day in year: 1..366
var simpleDateFormat = java.text.SimpleDateFormat("D",locale);
var day_in_year = simpleDateFormat.format(date);
In this example, we pass both the locale and a date pattern to the constructor to create an instance of the
java.text.SimpleDateFormat
class. The pattern is passed as the string
"D"
, specifying a day-in-year format.
Once we created the
java.text.SimpleDateFormat
instance, we can apply a new pattern to it using the
applyPattern()
method. Calling the
format
method again, we obtain the date in the desired format:
//day in month: 1..31
simpleDateFormat.applyPattern("d");
var day_in_month = simpleDateFormat.format(date);
//en-us example: "Monday"
simpleDateFormat.applyPattern("EEEE");
var day_name = simpleDateFormat.format(date);
//en-us example: "Mon"
simpleDateFormat.applyPattern("E");
var day_abbreviation = simpleDateFormat.format(date);
//week in year, 1..53
simpleDateFormat.applyPattern("ww");
var week_in_year = simpleDateFormat.format(date);
//week in month, 1..5
simpleDateFormat.applyPattern("W");
var week_in_month = simpleDateFormat.format(date);
//month number in year, 1..12
simpleDateFormat.applyPattern("MM");
var month_number = simpleDateFormat.format(date);
//en-us example: "September"
simpleDateFormat.applyPattern("MMMM");
var month_name = simpleDateFormat.format(date);
//en-us example: "Sep"
simpleDateFormat.applyPattern("MMM");
var month_abbreviation = simpleDateFormat.format(date);
//2 digit representation of the year, example: "07" for 2007
simpleDateFormat.applyPattern("y");
var year2 = simpleDateFormat.format(date);
//4 digit representation of the year, example: 2007
simpleDateFormat.applyPattern("yyyy");
var year4 = simpleDateFormat.format(date);
Dealing with Quarters
Although the
java.text.SimpleDateFormat
class is useful, it does not provide any functionality for working with quarters. We do want our date dimension to contain attributes to represent the quarter, so we have to reside to computing these manually:
//handling Quarters is a DIY
var quarter_name = "Q";
var quarter_number;
switch(parseInt(month_number)){
case 1: case 2: case 3: quarter_number = "1"; break;
case 4: case 5: case 6: quarter_number = "2"; break;
case 7: case 8: case 9: quarter_number = "3"; break;
case 10: case 11: case 12: quarter_number = "4"; break;
}
quarter_name += quarter_number;
Although this will do for now, this solution doesn't really cut it because it does not produce localized output. Anyway, it is better than nothing so we'll just have to make do with it.
Period demarcation flags
Our date dimension has a few attributes that are used to indicate the start and end of week and month periods. We use simple yes/no type flags, but we allow the actual "yes" and "no" values to be specified by the user in the "Generate Rows" step. We retrieve them with the following piece of code:
//get the local yes/no values
var yes = local_yes.getString();
var no = local_no.getString();
We can now use them these to flag the start and end of week and month periods.
The start (and of course, also the end) of the week are subject to the locale. In order to find out if we are dealing with the first day of a week, we use the
getFirstDayOfWeek()
method of the
java.util.Calendar
class. By comparing its return value with the day of week of the current row, we can see if we happen to be dealing with the first day of the week:
//initialize for week calculations
var first_day_of_week = calendar.getFirstDayOfWeek();
var day_of_week = java.util.Calendar.DAY_OF_WEEK;
//find out if this is the first day of the week
var is_first_day_in_week;
if(first_day_of_week==calendar.get(day_of_week)){
is_first_day_in_week = yes;
} else {
is_first_day_in_week = no;
}
Note that we obtain the current day of the week by passing the value of the
DAY_OF_WEEK
constant to the
get method of the
java.util.Calendar
object that we initialized at the start of the script.
In order to set the value for the
is_last_day_in_week
attribute of the date dimension, we simply find out if the
next day happens to be the first day of the week. If it is, then by definition, the current row represents the last day of the week:
//calculate the next day
calendar.add(calendar.DAY_OF_MONTH,1);
//get the next calendar date
var next_day = new java.util.Date(calendar.getTimeInMillis());
//find out if this is the first day of the week
var is_last_day_in_week;
if(first_day_of_week==calendar.get(day_of_week)){
is_last_day_in_week = yes;
} else {
is_last_day_in_week = no;
}
(Note that we have already used similar code to add a day to a date when we added the day sequence to the initial date.)
We can use similar logic to calculate the values for the
is_first_day_in_month
and
is_last_day_in_month
indicators. This is actually easier, because the first day in the month is not dependant upon the locale (at least - not within one calendar). So, we only need to find out if the day of month is equal to one:
//find out if this is the first day of the month
var is_first_day_of_month;
if(day_in_month == 1){
is_first_day_in_month = yes;
} else {
is_first_day_in_month = no;
}
//find out if this is the last day in the month
var is_last_day_of_month;
if(java.text.SimpleDateFormat("d",locale).format(next_day)==1){
is_last_day_in_month = yes;
} else {
is_last_day_in_month = no;
}
A few more date attributes
We conclude the computation of the date attributes by adding a few more useful labels:
//a few useful labels
var year_quarter = year4 + "-" + quarter_name;
var year_month_number = year4 + "-" + month_number;
var year_month_abbreviation = year4 + "-" + month_abbreviation;
Like when we calculated the quarters, this is actually not a very good method because the results will not be localized. That said, the result will make sense for many locales, and we don't really have a better way to deal with it right now.
Defining the step outputs
We just calculated all the required values to fill the attributes of our date dimension. We just need to get them out of the script and into the outputs of the step.
Every variable declared in the javascript (using the
var
keyword) can be used as an output field of the javascript step. The easiest way to generate the outputs is by hitting the "Get Variables" button at the bottom of the dialog. This simply adds an output field for each variable declared in the script:
By default, the data type for all the outputs added in this way is set to the String type. Although it is good practice to choose a more specific data type, it is almost always unncessary in this case, as all integer type values will be correctly converted implicitly when we insert them into the database. There is one exception in this case, and that is the
date
output. Inside the script, it is an instance of a
java.util.Date
class, and we must set the type to "Date" in the output too. Otherwise, the (java) string representation of the
java.util.Date
object will be sent as output, and this is not automatically recognized as a
date
by MySQL.
Discarding Fields
We are now almost ready to insert the rows into the date dimension table. We only need to discard all fields in the stream that do not correspond with any of the columns in our date dimension table. We use a "Select Values" step to do that:
We use the "Get Fields To Select" button to pull in all available fields, and after that, simply select and delete each field that we do not need. As a final step, we rename the
DaySequence
field to
date_key
to map it to the
date_key
column in our date dimension table.
Inserting data into the table
In the final step, we add the generated data to the
dim_date
table we created in the very first step of the transformation:
We only need to specify the connection and the table name here, and the step will then automatically attempt to map the fields of the incoming rows to table columns.
We could have used the "Insert / Update" step, or even the "Execute SQL Script" step too to write the data to the dimension table, but that would require a little bit of extra work.
Running the transformation
After building the transformation, you can run it by hitting the "running man" icon on the toolbar. This will open a dialog where you can set a number of properties for the transformation. Hit "Launch" button there and after that, the transformation will be executed:
Closing Notes
I hope you enjoyed this tip. If you want to, you can
download the kettle transformation here, and use it as you see fit.
If you are interested in open source data warehousing, register for the
MySQL Enterprise Data Warehousing Seminar, Thursday, September 06, 2007 and hear what Robin Schumacher has to say about that subject. (Note that this is a general MySQL data warehousing seminar - this post and the seminar are unrelated)