Pentaho Data Integration Beginner's Guide
4/5
()
About this ebook
Capturing, manipulating, cleansing, transferring, and loading data effectively are the prime requirements in every IT organization. Achieving these tasks require people devoted to developing extensive software programs, or investing in ETL or data integration tools that can simplify this work.
Pentaho Data Integration is a full-featured open source ETL solution that allows you to meet these requirements. Pentaho Data Integration has an intuitive, graphical, drag-and-drop design environment and its ETL capabilities are powerful. However, getting started with Pentaho Data Integration can be difficult or confusing.
"Pentaho Data Integration Beginner's Guide, Second Edition" provides the guidance needed to overcome that difficulty, covering all the possible key features of Pentaho Data Integration.
"Pentaho Data Integration Beginner's Guide, Second Edition" starts with the installation of Pentaho Data Integration software and then moves on to cover all the key Pentaho Data Integration concepts. Each chapter introduces new features, allowing you to gradually get involved with the tool. First, you will learn to do all kinds of data manipulation and work with plain files. Then, the book gives you a primer on databases and teaches you how to work with databases inside Pentaho Data Integration. Moreover, you will be introduced to data warehouse concepts and you will learn how to load data in a data warehouse. After that, you will learn to implement simple and complex processes. Finally, you will have the opportunity of applying and reinforcing all the learned concepts through the implementation of a simple datamart.
With "Pentaho Data Integration Beginner's Guide, Second Edition", you will learn everything you need to know in order to meet your data manipulation requirements.
ApproachThis book focuses on teaching you by example. The book walks you through every aspect of Pentaho Data Integration, giving systematic instructions in a friendly style, allowing you to learn in front of your computer, playing with the tool. The extensive use of drawings and screenshots make the process of learning Pentaho Data Integration easy. Throughout the book, numerous tips and helpful hints are provided that you will not find anywhere else.
Who this book is forThis book is a must-have for software developers, database administrators, IT students, and everyone involved or interested in developing ETL solutions, or, more generally, doing any kind of data manipulation. Those who have never used Pentaho Data Integration will benefit most from the book, but those who have, they will also find it useful.
This book is also a good starting point for database administrators, data warehouse designers, architects, or anyone who is responsible for data warehouse projects and needs to load data into them.
Maria Carina Roldan
Maria Carina Roldan was born in Esquel, Argen na, and earned her Bachelor's degree in Computer Science at at the Universidad Nacional de La Plata (UNLP) and then moved to Buenos Aires where she has lived since 1994. She has worked as a BI consultant for almost fifteen years. She started working with Pentaho technology back in 2006. Over the last three and a half years, she has been devoted to working full me for Webdetails—a company acquired by Pentaho in 2013—as an ETL specialist. Carina is the author of Pentaho 3.2 Data Integra on Beginner's Book, Packt Publishing, April 2009, and the co-author of Pentaho Data Integra on 4 Cookbook, Packt Publishing, June 2011.
Read more from Maria Carina Roldan
Pentaho 3.2 Data Integration Beginner's Guide Rating: 0 out of 5 stars0 ratingsPentaho Data Integration Quick Start Guide: Create ETL processes using Pentaho Rating: 0 out of 5 stars0 ratingsPentaho Data Integration 4 Cookbook: Over 70 recipes to solve ETL problems using Pentaho Kettle Rating: 0 out of 5 stars0 ratings
Related to Pentaho Data Integration Beginner's Guide
Related ebooks
Microsoft SQL Server 2014 Business Intelligence Development Beginner’s Guide Rating: 0 out of 5 stars0 ratingsMachine Learning with R Rating: 4 out of 5 stars4/5Hadoop: Data Processing and Modelling Rating: 0 out of 5 stars0 ratingsTabular Modeling with SQL Server 2016 Analysis Services Cookbook Rating: 4 out of 5 stars4/5Microsoft Tabular Modeling Cookbook Rating: 0 out of 5 stars0 ratingsBig Data Analytics Rating: 0 out of 5 stars0 ratingsLearning Qlik® Sense: The Official Guide Rating: 0 out of 5 stars0 ratingsQlikView for Developers Rating: 0 out of 5 stars0 ratingsMastering Data Warehouse Design: Relational and Dimensional Techniques Rating: 4 out of 5 stars4/5Python Business Intelligence Cookbook Rating: 0 out of 5 stars0 ratingsGetting Started with Talend Open Studio for Data Integration Rating: 0 out of 5 stars0 ratingsMastering PostgreSQL 9.6 Rating: 0 out of 5 stars0 ratingsKNIME Essentials Rating: 0 out of 5 stars0 ratingsExpert T-SQL Window Functions in SQL Server 2019: The Hidden Secret to Fast Analytic and Reporting Queries Rating: 0 out of 5 stars0 ratingsInstant Pentaho Data Integration Kitchen Rating: 0 out of 5 stars0 ratingsDatabricks A Complete Guide - 2021 Edition Rating: 0 out of 5 stars0 ratingsLearning PostgreSQL Rating: 1 out of 5 stars1/5Building the Data Warehouse Rating: 5 out of 5 stars5/5Data Visualization: Representing Information on Modern Web Rating: 5 out of 5 stars5/5MariaDB Essentials Rating: 0 out of 5 stars0 ratingsLearning Jupyter Rating: 5 out of 5 stars5/5Talend Open Studio Cookbook Rating: 2 out of 5 stars2/5NumPy Cookbook Rating: 5 out of 5 stars5/5Big Data Visualization Rating: 0 out of 5 stars0 ratingsPractical Data Science Cookbook - Second Edition Rating: 0 out of 5 stars0 ratingsHadoop Real-World Solutions Cookbook - Second Edition Rating: 0 out of 5 stars0 ratingsLearn Data Warehousing in 24 Hours Rating: 0 out of 5 stars0 ratingsPractical Data Analysis Cookbook Rating: 0 out of 5 stars0 ratings
Computers For You
Elon Musk Rating: 4 out of 5 stars4/5The ChatGPT Millionaire Handbook: Make Money Online With the Power of AI Technology Rating: 4 out of 5 stars4/5Standard Deviations: Flawed Assumptions, Tortured Data, and Other Ways to Lie with Statistics Rating: 4 out of 5 stars4/5Mastering ChatGPT: 21 Prompts Templates for Effortless Writing Rating: 4 out of 5 stars4/5Uncanny Valley: A Memoir Rating: 4 out of 5 stars4/5Slenderman: Online Obsession, Mental Illness, and the Violent Crime of Two Midwestern Girls Rating: 4 out of 5 stars4/5Alan Turing: The Enigma: The Book That Inspired the Film The Imitation Game - Updated Edition Rating: 4 out of 5 stars4/5The Invisible Rainbow: A History of Electricity and Life Rating: 5 out of 5 stars5/5The Innovators: How a Group of Hackers, Geniuses, and Geeks Created the Digital Revolution Rating: 4 out of 5 stars4/5Excel 101: A Beginner's & Intermediate's Guide for Mastering the Quintessence of Microsoft Excel (2010-2019 & 365) in no time! Rating: 0 out of 5 stars0 ratingsHow to Create Cpn Numbers the Right way: A Step by Step Guide to Creating cpn Numbers Legally Rating: 4 out of 5 stars4/5CompTIA Security+ Get Certified Get Ahead: SY0-701 Study Guide Rating: 5 out of 5 stars5/5Everybody Lies: Big Data, New Data, and What the Internet Can Tell Us About Who We Really Are Rating: 4 out of 5 stars4/5The Hacker Crackdown: Law and Disorder on the Electronic Frontier Rating: 4 out of 5 stars4/5Deep Search: How to Explore the Internet More Effectively Rating: 5 out of 5 stars5/5SQL QuickStart Guide: The Simplified Beginner's Guide to Managing, Analyzing, and Manipulating Data With SQL Rating: 4 out of 5 stars4/5The Professional Voiceover Handbook: Voiceover training, #1 Rating: 5 out of 5 stars5/5Procreate for Beginners: Introduction to Procreate for Drawing and Illustrating on the iPad Rating: 5 out of 5 stars5/5The Best Hacking Tricks for Beginners Rating: 4 out of 5 stars4/5Python Machine Learning By Example Rating: 4 out of 5 stars4/5ChatGPT 4 $10,000 per Month #1 Beginners Guide to Make Money Online Generated by Artificial Intelligence Rating: 0 out of 5 stars0 ratingsCompTIA IT Fundamentals (ITF+) Study Guide: Exam FC0-U61 Rating: 0 out of 5 stars0 ratings101 Awesome Builds: Minecraft® Secrets from the World's Greatest Crafters Rating: 4 out of 5 stars4/5Grokking Algorithms: An illustrated guide for programmers and other curious people Rating: 4 out of 5 stars4/5Learning the Chess Openings Rating: 5 out of 5 stars5/5Tor and the Dark Art of Anonymity Rating: 5 out of 5 stars5/5
Reviews for Pentaho Data Integration Beginner's Guide
1 rating0 reviews
Book preview
Pentaho Data Integration Beginner's Guide - Maria Carina Roldan
Table of Contents
Pentaho Data Integration Beginner's Guide
Credits
About the Author
About the Reviewers
www.PacktPub.com
Support files, eBooks, discount offers and more
Why Subscribe?
Free Access for Packt account holders
Preface
How to read this book
What this book covers
What you need for this book
Who this book is for
Conventions
Time for action – heading
What just happened?
Pop quiz – heading
Have a go hero – heading
Reader feedback
Customer support
Downloading the example code
Errata
Piracy
Questions
1. Getting Started with Pentaho Data Integration
Pentaho Data Integration and Pentaho BI Suite
Exploring the Pentaho Demo
Pentaho Data Integration
Using PDI in real-world scenarios
Loading data warehouses or datamarts
Integrating data
Data cleansing
Migrating information
Exporting data
Integrating PDI along with other Pentaho tools
Pop quiz – PDI data sources
Installing PDI
Time for action – installing PDI
What just happened?
Pop quiz – PDI prerequisites
Launching the PDI graphical designer – Spoon
Time for action – starting and customizing Spoon
What just happened?
Spoon
Setting preferences in the Options window
Storing transformations and jobs in a repository
Creating your first transformation
Time for action – creating a hello world transformation
What just happened?
Directing Kettle engine with transformations
Exploring the Spoon interface
Designing a transformation
Running and previewing the transformation
Pop quiz – PDI basics
Installing MySQL
Time for action – installing MySQL on Windows
What just happened?
Time for action – installing MySQL on Ubuntu
What just happened?
Have a go hero – installing a visual software for administering and querying MySQL
Summary
2. Getting Started with Transformations
Designing and previewing transformations
Time for action – creating a simple transformation and getting familiar with the design process
What just happened?
Getting familiar with editing features
Using the mouseover assistance toolbar
Working with grids
Understanding the Kettle rowset
Looking at the results in the Execution Results pane
The Logging tab
The Step Metrics tab
Have a go hero – calculating the achieved percentage of work
Have a go hero - calculating the achieved percentage of work (second version)
Running transformations in an interactive fashion
Time for action – generating a range of dates and inspecting the data as it is being created
What just happened?
Adding or modifying fields by using different PDI steps
The Select values step
Getting fields
Date fields
Pop quiz – generating data with PDI
Have a go hero – experiencing different PDI steps
Have a go hero – generating a rowset with dates
Handling errors
Time for action – avoiding errors while converting the estimated time from string to integer
What just happened?
The error handling functionality
Time for action – configuring the error handling to see the description of the errors
What just happened?
Personalizing the error handling
Have a go hero – trying out different ways of handling errors
Summary
3. Manipulating Real-world Data
Reading data from files
Time for action – reading results of football matches from files
What just happened?
Input files
Input steps
Reading several files at once
Time for action – reading all your files at a time using a single text file input step
What just happened?
Time for action – reading all your files at a time using a single text file input step and regular expressions
What just happened?
Regular expressions
Troubleshooting reading files
Have a go hero – exploring your own files
Pop quiz – providing a list of text files using regular expressions
Have a go hero – measuring the performance of input steps
Sending data to files
Time for action – sending the results of matches to a plain file
What just happened?
Output files
Output steps
Have a go hero – extending your transformations by writing output files
Have a go hero – generate your custom matches.txt file
Getting system information
Time for action – reading and writing matches files with flexibility
What just happened?
The Get System Info step
Running transformations from a terminal window
Time for action – running the matches transformation from a terminal window
What just happened?
Have a go hero – finding out system information
XML files
Time for action – getting data from an XML file with information about countries
What just happened?
What is XML?
PDI transformation files
Getting data from XML files
XPath
Configuring the Get data from the XML step
Kettle variables
How and when you can use variables
Have a go hero – exploring XML files
Summary
4. Filtering, Searching, and Performing Other Useful Operations with Data
Sorting data
Time for action – sorting information about matches with the Sort rows step
What just happened?
Have a go hero – listing the last match played by each team
Calculations on groups of rows
Time for action – calculating football match statistics by grouping data
What just happened?
Group by Step
Numeric fields
Have a go hero – formatting 99.55
Pop quiz — formatting output fields
Have a go hero – listing the languages spoken by a country
Filtering
Time for action – counting frequent words by filtering
What just happened?
Time for action – refining the counting task by filtering even more
What just happened?
Filtering rows using the Filter rows step
Have a go hero – playing with filters
Looking up data
Time for action – finding out which language people speak
What just happened?
The Stream lookup step
Have a go hero – selecting the most popular of the official languages
Have a go hero – counting words more precisely
Data cleaning
Time for action – fixing words before counting them
What just happened?
Cleansing data with PDI
Have a go hero – counting words by cleaning them first
Summary
5. Controlling the Flow of Data
Splitting streams
Time for action – browsing new features of PDI by copying a dataset
What just happened?
Copying rows
Have a go hero – recalculating statistics
Distributing rows
Time for action – assigning tasks by distributing
What just happened?
Pop quiz – understanding the difference between copying and distributing
Splitting the stream based on conditions
Time for action – assigning tasks by filtering priorities with the Filter rows step
What just happened?
PDI steps for splitting the stream based on conditions
Time for action – assigning tasks by filtering priorities with the Switch/Case step
What just happened?
Have a go hero – listing languages and countries
Pop quiz – deciding between a Number range step and a Switch/Case step
Merging streams
Time for action – gathering progress and merging it all together
What just happened?
PDI options for merging streams
Time for action – giving priority to Bouchard by using the Append Stream
What just happened?
Have a go hero – sorting and merging all tasks
Treating invalid data by splitting and merging streams
Time for action – treating errors in the estimated time to avoid discarding rows
What just happened?
Treating rows with invalid data
Have a go hero – trying to find missing countries
Summary
6. Transforming Your Data by Coding
Doing simple tasks with the JavaScript step
Time for action – counting frequent words by coding in JavaScript
What just happened?
Using the JavaScript language in PDI
Inserting JavaScript code using the Modified JavaScript Value Step
Adding fields
Modifying fields
Using transformation predefined constants
Testing the script using the Test script button
Reading and parsing unstructured files with JavaScript
Time for action – changing a list of house descriptions with JavaScript
What just happened?
Looping over the dataset rows
Have a go hero – enhancing the houses file
Doing simple tasks with the Java Class step
Time for action – counting frequent words by coding in Java
What just happened?
Using the Java language in PDI
Inserting Java code using the User Defined Java Class step
Adding fields
Modifying fields
Sending rows to the next step
Data types equivalence
Testing the Java Class using the Test class button
Have a go hero – parameterizing the Java Class
Transforming the dataset with Java
Time for action – splitting the field to rows using Java
What just happened?
Avoiding coding by using purpose built steps
Pop quiz – choosing a scripting language for coding inside a transformation
Summary
7. Transforming the Rowset
Converting rows to columns
Time for action – enhancing the films file by converting rows to columns
What just happened?
Converting row data to column data by using the Row Denormaliser step
Have a go hero – houses revisited
Aggregating data with a Row Denormaliser step
Time for action – aggregating football matches data with the Row Denormaliser step
What just happened?
Using Row Denormaliser for aggregating data
Have a go hero – calculating statistics by team
Normalizing data
Time for action – enhancing the matches file by normalizing the dataset
What just happened?
Modifying the dataset with a Row Normaliser step
Summarizing the PDI steps that operate on sets of rows
Have a go hero – verifying the benefits of normalizing
Have a go hero – normalizing the Films file
Generating a custom time dimension dataset by using Kettle variables
Time for action – creating the time dimension dataset
What just happened?
Getting variables
Time for action – parameterizing the start and end date of the time dimension dataset
What just happened?
Using the Get Variables step
Have a go hero – enhancing the time dimension
Summary
8. Working with Databases
Introducing the Steel Wheels sample database
Connecting to the Steel Wheels database
Time for action – creating a connection to the Steel Wheels database
What just happened?
Connecting with Relational Database Management Systems
Pop quiz – connecting to a database in several transformations
Have a go hero – connecting to your own databases
Exploring the Steel Wheels database
Time for action – exploring the sample database
What just happened?
A brief word about SQL
Exploring any configured database with the database explorer
Have a go hero – exploring the sample data in depth
Have a go hero – exploring your own databases
Querying a database
Time for action – getting data about shipped orders
What just happened?
Getting data from the database with the Table input step
Using the SELECT statement for generating a new dataset
Making flexible queries using parameters
Time for action – getting orders in a range of dates using parameters
What just happened?
Adding parameters to your queries
Making flexible queries by using Kettle variables
Time for action – getting orders in a range of dates by using Kettle variables
What just happened?
Using Kettle variables in your queries
Pop quiz – interpreting data types coming from a database
Have a go hero – querying the sample data
Sending data to a database
Time for action – loading a table with a list of manufacturers
What just happened?
Inserting new data into a database table with the Table output step
Inserting or updating data by using other PDI steps
Time for action – inserting new products or updating existing ones
What just happened?
Time for action – testing the update of existing products
What just happened?
Inserting or updating with the Insert/Update step
Have a go hero – populating a films database
Have a go hero – populating the products table
Pop quiz – replacing an Insert/Update step with a Table Output step followed by an Update step
Eliminating data from a database
Time for action – deleting data about discontinued items
What just happened?
Deleting records of a database table with the Delete step
Have a go hero – deleting old orders
Have a go hero – creating the time dimension
Summary
9. Performing Advanced Operations with Databases
Preparing the environment
Time for action – populating the Jigsaw database
What just happened?
Exploring the Jigsaw database model
Looking up data in a database
Doing simple lookups
Time for action – using a Database lookup step to create a list of products to buy
What just happened?
Looking up values in a database with the Database lookup step
Have a go hero – preparing the delivery of the products
Have a go hero – refining the transformation
Performing complex lookups
Time for action – using a Database join step to create a list of suggested products to buy
What just happened?
Joining data from the database to the stream data by using a Database join step
Have a go hero – rebuilding the list of customers
Introducing dimensional modeling
Loading dimensions with data
Time for action – loading a region dimension with a Combination lookup/update step
What just happened?
Time for action – testing the transformation that loads the region dimension
What just happened?
Describing data with dimensions
Loading Type I SCD with a Combination lookup/update step
Have a go hero – adding regions to the Region dimension
Have a go hero – loading the manufacturers dimension
Storing a history of changes
Time for action – keeping a history of changes in products by using the Dimension lookup/update step
What just happened?
Time for action – testing the transformation that keeps history of product changes
What just happened?
Keeping an entire history of data with a Type II slowly changing dimension
Loading Type II SCDs with the Dimension lookup/update step
Have a go hero – storing a history just for the theme of a product
Have a go hero – loading the Regions dimension as a Type II SCD
Pop quiz – implementing a Type III SCD in PDI
Have a go hero – loading a mini dimension
Summary
10. Creating Basic Task Flows
Introducing PDI jobs
Time for action – creating a folder with a Kettle job
What just happened?
Executing processes with PDI jobs
Using Spoon to design and run jobs
Pop quiz – defining PDI jobs
Designing and running jobs
Time for action – creating a simple job and getting familiar with the design process
What just happened?
Changing the flow of execution on the basis of conditions
Looking at the results in the Execution results window
The Logging tab
The Job metrics tab
Running transformations from jobs
Time for action – generating a range of dates and inspecting how things are running
What just happened?
Using the Transformation job entry
Have a go hero – loading the dimension tables
Receiving arguments and parameters in a job
Time for action – generating a hello world file by using arguments and parameters
What just happened?
Using named parameters in jobs
Have a go hero – backing up your work
Running jobs from a terminal window
Time for action – executing the hello world job from a terminal window
What just happened?
Have a go hero – experiencing Kitchen
Using named parameters and command-line arguments in transformations
Time for action – calling the hello world transformation with fixed arguments and parameters
What just happened?
Have a go hero – saying hello again and again
Have a go hero – loading the time dimension from a job
Deciding between the use of a command-line argument and a named parameter
Have a go hero – analyzing the use of arguments and named parameters
Summary
11. Creating Advanced Transformations and Jobs
Re-using part of your transformations
Time for action – calculating statistics with the use of a subtransformations
What just happened?
Creating and using subtransformations
Have a go hero – calculating statistics for all subjects
Have a go hero – counting words more precisely (second version)
Creating a job as a process flow
Time for action – generating top average scores by copying and getting rows
What just happened?
Transferring data between transformations by using the copy/get rows mechanism
Have a go hero – modifying the flow
Iterating jobs and transformations
Time for action – generating custom files by executing a transformation for every input row
What just happened?
Executing for each row
Have a go hero – building lists of products to buy
Enhancing your processes with the use of variables
Time for action – generating custom messages by setting a variable with the name of the examination file
What just happened?
Setting variables inside a transformation
Running a job inside another job with a Job job entry
Understanding the scope of variables
Have a go hero – processing several files at once
Have a go hero – enhancing the jigsaw database update process
Have a go hero – executing the proper jigsaw database update process
Pop quiz – deciding the scope of variables
Summary
12. Developing and Implementing a Simple Datamart
Exploring the sales datamart
Deciding the level of granularity
Loading the dimensions
Time for action – loading the dimensions for the sales datamart
What just happened?
Extending the sales datamart model
Have a go hero – loading the dimensions for the puzzle star model
Loading a fact table with aggregated data
Time for action – loading the sales fact table by looking up dimensions
What just happened?
Getting the information from the source with SQL queries
Translating the business keys into surrogate keys
Obtaining the surrogate key for Type I SCD
Obtaining the surrogate key for Type II SCD
Obtaining the surrogate key for the Junk dimension
Obtaining the surrogate key for the Time dimension
Pop quiz – creating a product type dimension
Have a go hero – loading a puzzles fact table
Getting facts and dimensions together
Time for action – loading the fact table using a range of dates obtained from the command line
What just happened?
Time for action – loading the SALES star
What just happened?
Have a go hero – enhancing the loading process of the sales fact table
Have a go hero – loading the puzzle sales star
Have a go hero – loading the facts once a month
Automating the administrative tasks
Time for action – automating the loading of the sales datamart
What just happened?
Have a go hero – creating a backup of your work automatically
Have a go hero – enhancing the automation process by sending an email if an error occurs
Summary
A. Working with Repositories
Creating a database repository
Time for action – creating a PDI repository
What just happened?
Creating a database repository to store your transformations and jobs
Working with the repository storage system
Time for action – logging into a database repository
What just happened?
Logging into a database repository using credentials
Creating transformations and jobs in repository folders
Creating database connections, users, servers, partitions, and clusters
Designing jobs and transformations
Backing up and restoring a repository
Examining and modifying the contents of a repository with the Repository Explorer
Migrating from file-based system to repository-based system and vice versa
Summary
B. Pan and Kitchen – Launching Transformations and Jobs from the Command Line
Running transformations and jobs stored in files
Running transformations and jobs from a repository
Specifying command-line options
Kettle variables and the Kettle home directory
Checking the exit code
Providing options when running Pan and Kitchen
Summary
C. Quick Reference – Steps and Job Entries
Transformation steps
Job entries
Summary
D. Spoon Shortcuts
General shortcuts
Designing transformations and jobs
Grids
Repositories
Database wizards
Summary
E. Introducing PDI 5 Features
Welcome page
Usability
Solutions to commonly occurring situations
Backend
Summary
F. Best Practices
Summary
G. Pop Quiz Answers
Chapter 1, Getting Started with Pentaho Data Integration
Pop quiz – PDI data sources
Pop quiz – PDI prerequisites
Pop quiz – PDI basics
Chapter 2, Getting Started with Transformations
Pop quiz – generating data with PDI
Chapter 3, Manipulating Real-world Data
Pop quiz – providing a list of text files using regular expressions
Chapter 4, Filtering, Searching, and Performing Other Useful Operations with Data
Pop quiz – formatting output fields
Chapter 5, Controlling the Flow of Data
Pop quiz – deciding between a Number range step and a Switch/Case step
Pop quiz – understanding the difference between copying and distributing
Chapter 6, Transforming Your Data by Coding
Pop quiz – choosing a scripting language for coding inside a transformation
Chapter 8, Working with Databases
Pop quiz – connecting to a database in several transformations
Pop quiz – interpreting data types coming from a database
Pop quiz – interpreting data types coming from a database
Chapter 9, Performing Advanced Operations with Databases
Pop quiz – implementing a Type III SCD in PDI
Chapter 10, Creating Basic Task Flows
Pop quiz – defining PDI jobs
Chapter 11, Creating Advanced Transformations and Jobs
Pop quiz – deciding the scope of variables
Chapter 12, Developing and Implementing a Simple Datamart
Pop quiz – creating a product type dimension
Index
Pentaho Data Integration Beginner's Guide
Second Edition
Pentaho Data Integration Beginner's Guide
Second Edition
Copyright © 2013 Packt Publishing
All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.
Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author, nor Packt Publishing, and its dealers and distributors will be held liable for any damages caused or alleged to be caused directly or indirectly by this book.
Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.
First published: April 2010
Second Edition: October 2013
Production Reference: 1171013
Published by Packt Publishing Ltd.
Livery Place
35 Livery Street
Birmingham B3 2PB, UK.
ISBN 978-1-78216-504-0
www.packtpub.com
Cover Image by Suresh Mogre (<suresh.mogre.99@gmail.com>)
Credits
Author
María Carina Roldán
Reviewers
Tomoyuki Hayashi
Gretchen Moran
Acquisition Editors
Usha Iyer
Greg Wild
Lead Technical Editor
Azharuddin Sheikh
Technical Editors
Sharvari H. Baet
Aparna K
Kanhucharan Panda
Vivek Pillai
Project Coordinator
Navu Dhillon
Proofreaders
Simran Bhogal
Ameesha Green
Indexer
Mariammal Chettiyar
Graphics
Ronak Dhruv
Yuvraj Mannari
Production Coordinator
Conidon Miranda
Cover Work
Conidon Miranda
About the Author
María Carina Roldán was born in Esquel, Argentina, and earned her Bachelor's degree in Computer Science at at the Universidad Nacional de La Plata (UNLP) and then moved to Buenos Aires where she has lived since 1994.
She has worked as a BI consultant for almost fifteen years. She started working with Pentaho technology back in 2006. Over the last three and a half years, she has been devoted to working full time for Webdetails—a company acquired by Pentaho in 2013—as an ETL specialist.
Carina is the author of Pentaho 3.2 Data Integration Beginner's Book, Packt Publishing, April 2009, and the co-author of Pentaho Data Integration 4 Cookbook, Packt Publishing, June 2011.
I'd like to thank those who have encouraged me to write this book: firstly, the Pentaho community. They have given me such rewarding feedback after my other two books on PDI; it is because of them that I feel compelled to pass my knowledge on to those willing to learn. I also want to thank my friends! Especially Flavia, Jaqui, and Marce for their encouraging words throughout the writing process; Silvina for clearing up my questions about English; Gonçalo for helping with the use of PDI on Mac systems; and Hernán for helping with ideas and examples for this new edition.
I would also like to thank the technical reviewers—Gretchen, Tomoyuki, Nelson, and Paula—for the time and dedication that they have put in to reviewing the book.
About the Reviewers
Tomoyuki Hayashi is a system engineer who mainly works for the intersection of open source and enterprise software. He has developed a CMIS-compliant and CouchDB-based ECM software named NemakiWare (http://nemakiware.com/).
He is currently working with Aegif, Japan, which provides advisory services for content-oriented applications, collaboration improvement, and ECM in general. It is one of the most experienced companies in Japan that supports the introduction of foreign-made software to the Japanese market.
Gretchen Moran works as an independent Pentaho consultant on a variety of business intelligence and big data projects. She has 15 years of experience in the business intelligence realm, developing software and providing services for a number of companies including Hyperion Solutions and the Pentaho Corporation.
Gretchen continues to contribute to Pentaho Corporation's latest and greatest software initiatives while managing the daily adventures of her two children, Isabella and Jack, with her husband, Doug.
www.PacktPub.com
Support files, eBooks, discount offers and more
You might want to visit www.PacktPub.com for support files and downloads related to your book.
Did you know that Packt offers eBook versions of every book published, with PDF and ePub files available? You can upgrade to the eBook version at www.PacktPub.com and as a print book customer, you are entitled to a discount on the eBook copy. Get in touch with us at
At www.PacktPub.com, you can also read a collection of free technical articles, sign up for a range of free newsletters and receive exclusive discounts and offers on Packt books and eBooks.
http://PacktLib.PacktPub.com
Do you need instant solutions to your IT questions? PacktLib is Packt's online digital book library. Here, you can access, read and search across Packt's entire library of books.
Why Subscribe?
Fully searchable across every book published by Packt
Copy and paste, print and bookmark content
On demand and accessible via web browser
Free Access for Packt account holders
If you have an account with Packt at www.PacktPub.com, you can use this to access PacktLib today and view nine entirely free books. Simply use your login credentials for immediate access.
Preface
Pentaho Data Integration (also known as Kettle) is an engine along with a suite of tools responsible for the processes of Extracting, Transforming, and Loading—better known as the ETL processes. PDI not only serves as an ETL tool, but is also used for other purposes such as migrating data between applications or databases, exporting data from databases to flat files, data cleansing, and much more. PDI has an intuitive, graphical, drag-and-drop design environment, and its ETL capabilities are powerful. However, getting started with PDI can be difficult or confusing. This book provides the guidance needed to overcome that difficulty, covering the key features of PDI. Each chapter introduces new features, allowing you to gradually get involved with the tool.
By the end of the book, you will have not only experimented with all kinds of examples, but will have also built a basic but complete datamart with the help of PDI.
How to read this book
Although it is recommended that you read all the chapters, you don't have to. The book allows you to tailor the PDI learning process according to your particular needs.
The first five chapters along with Chapter 10, Creating Basic Task Flows, cover the core concepts. If you don't know PDI and want to learn just the basics, reading those chapters will suffice. If you need to work with databases, you could include Chapter 8, Working with Databases, in the roadmap.
If you already know the basics, you can improve your PDI knowledge by reading Chapter 6, Transforming Your Data by Coding, Chapter 7, Transforming the Rowset, and Chapter 11, Creating Advanced Transformations and Jobs.
If you already know PDI and want to learn how to use it to load or maintain a data warehouse or datamart, you will find all that you need in Chapter 9, Performing Advanced Operations with Databases, and Chapter 12, Developing and Implementing a Simple Datamart.
Finally, all the appendices are valuable resources for anyone reading this book.
What this book covers
Chapter 1, Getting Started with Pentaho Data Integration, serves as the most basic introduction to PDI, presenting the tool. This chapter includes instructions for installing PDI and gives you the opportunity to play with the graphical designer (Spoon). The chapter also includes instructions for installing a MySQL server.
Chapter 2, Getting Started with Transformations, explains the fundamentals of working with transformations, including learning the simplest ways of transforming data and getting familiar with the process of designing, debugging, and testing a transformation.
Chapter 3, Manipulating Real-world Data, explains how to apply the concepts learned in the previous chapter to real-world data that comes from different sources. It also explains how to save the results to different destinations: plain files, Excel files, and more. As real data is very prone to errors, this chapter also explains the basics of handling errors and validating data.
Chapter 4, Filtering, Searching, and Performing Other Useful Operations with Data, expands the set of operations learned in previous chapters by teaching the reader a great variety of essential features such as filtering, sorting, or looking for data.
Chapter 5, Controlling the Flow of Data, explains different options that PDI offers to combine or split flows of data.
Chapter 6, Transforming Your Data by Coding, explains how JavaScript and Java coding can help in the treatment of data. It shows why you may need to code inside PDI, and explains in detail how to do it.
Chapter 7, Transforming the Rowset, explains the ability of PDI to deal with some sophisticated problems—for example, normalizing data from pivoted tables—in a simple fashion.
Chapter 8, Working with Databases, explains how to use PDI to work with databases. The list of topics covered includes connecting to a database, previewing and getting data, and inserting, updating, and deleting data. As database knowledge is not presumed, the chapter also covers fundamental concepts of databases and the SQL language.
Chapter 9, Performing Advanced Operations with Databases, explains how to perform advanced operations with databases, including those especially designed to load data warehouses. A primer on data warehouse concepts is also given in case you are not familiar with the subject.
Chapter 10, Creating Basic Task Flows, serves as an introduction to processes in PDI. Through the creation of simple jobs, you will learn what jobs are and what they are used for.
Chapter 11, Creating Advanced Transformations and Jobs, deals with advanced concepts that will allow you to build complex PDI projects. The list of covered topics includes nesting jobs, iterating on jobs and transformations, and creating subtransformations.
Chapter 12, Developing and Implementing a Simple Datamart, presents a simple datamart project, and guides you to build the datamart by using all the concepts learned throughout the book.
Appendix A, Working with Repositories, is a step-by-step guide to the creation of a PDI database repository and then gives instructions on to work with it.
Appendix B, Pan and Kitchen – Launching Transformations and Jobs from the Command Line, is a quick reference for running transformations and jobs from the command line.
Appendix C, Quick Reference – Steps and Job Entries, serves as a quick reference to steps and job entries used throughout the book.
Appendix D, Spoon Shortcuts, is an extensive list of Spoon shortcuts useful for saving time when designing and running PDI jobs and transformations.
Appendix E, Introducing PDI 5 Features, quickly introduces you to the architectural and functional features included in Kettle 5—the version that was under development when this book was written.
Appendix F, Best Practices, gives a list of best PDI practices and recommendations.
Appendix G , Pop Quiz Answers, contains answers to pop quiz questions.
What you need for this book
PDI is a multiplatform tool. This means that no matter what your operating system is, you will be able to work with the tool. The only prerequisite is to have JVM 1.6 installed. It is also useful to have Excel or Calculator, along with a nice text editor.
Having an Internet connection while reading is extremely useful as well. Several links are provided throughout the book that complement what is explained. Additionally, there is the PDI forum where you may search or post doubts if you are stuck with something.
Who this book is for
This book is a must-have for software developers, database administrators, IT students, and everyone involved or interested in developing ETL solutions, or more generally, doing any kind of data manipulation. Those who have never used PDI will benefit the most from the book, but those who have, will also find it useful.
This book is also a good starting point for database administrators, data warehouse designers, architects, or anyone who is responsible for data warehouse projects and needs to load data into them.
You don't need to have any prior data warehouse or database experience to read this book. Fundamental database and data warehouse technical terms and concepts are explained in easy-to-understand language.
Conventions
In this book, you will find several headings that appear frequently.
To give clear instructions on how to complete a procedure or task, we use:
Time for action – heading
Action 1
Action 2
Action 3
Instructions often need some extra explanation so that they make sense, so they are followed with:
What just happened?
This heading explains the working of tasks or instructions that you have just completed.
You will also find some other learning aids in the book, including:
Pop quiz – heading
These are short multiple-choice questions intended to help you test your own understanding.
Have a go hero – heading
These practical challenges and give you ideas for experimenting with what you have learned.
You will also find a number of styles of text that distinguish between different kinds of information. Here are some examples of these styles, and an explanation of their meaning.
Code words in text are shown as follows: You may notice that we used the Unix command rm to remove the Drush directory rather than the DOS del command.
A block of code is set as follows:
# * Fine Tuning
#
key_buffer = 16M
key_buffer_size = 32M
max_allowed_packet = 16M
thread_stack = 512K
thread_cache_size = 8
max_connections = 300
When we wish to draw your attention to a particular part of a code block, the relevant lines or items are set in bold:
# * Fine Tuning
#
key_buffer = 16M
key_buffer_size = 32M
max_allowed_packet = 16M
thread_stack = 512K
thread_cache_size = 8
max_connections = 300
Any command-line input or output is written as follows:
cd /ProgramData/Propeople rm -r Drush git clone --branch master http://git.drupal.org/project/drush.git
New terms and important words are shown in bold. Words that you see on the screen, in menus or dialog boxes for example, appear in the text like this: On the Select Destination Location screen, click on Next to accept the default destination.
Note
Warnings or important notes appear in a box like this.
Tip
Tips and tricks appear like this.
Reader feedback
Feedback from our readers is always welcome. Let us know what you think about this book—what you liked or may have disliked. Reader feedback is important for us to develop titles that you really get the most out of.
To send us general feedback, simply send an e-mail to <feedback@packtpub.com>, and mention the book title through the subject of your message.
If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, see our author guide on www.packtpub.com/authors.
Customer support
Now that you are the proud owner of a Packt book, we have a number of things to help you to get the most from your purchase.
Downloading the example code
You can download the example code files for all Packt books you have purchased from your account at http://www.packtpub.com. If you purchased this book elsewhere, you can visit http://www.packtpub.com/support and register to have the files e-mailed directly to you.
Errata
Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you find a mistake in one of our books—maybe a mistake in the text or the code—we