Presentation On Talend MDM by Bhushan Maindarkar
Presentation On Talend MDM by Bhushan Maindarkar
Presentation On Talend MDM by Bhushan Maindarkar
Talend
Presentation on Talend MDM
Bhushan Maindarkar.
Table of Contents
1. General Information ....................................................................................................................... 4
1.1. What is ETL .............................................................................................................................. 4
1.2. What is Talend ......................................................................................................................... 4
1.3. What is Talend Open Studio .................................................................................................... 4
2. Installation .......................................................................................................................... 5
2.1. Hardware requirement ............................................................................................................ 5
2.2. Software requirement ............................................................................................................. 5
2.3. Download ................................................................................................................................ 6
2.4. Configure the memory settings ............................................................................................... 6
2.5. Launch the Studio .................................................................................................................... 6
3.Talend Integration ................................................................................................................. 6
3.1. Create New Project ................................................................................................................. 7
3.2. Delete Project .......................................................................................................................... 7
3.3. Getting started with a basic Job Creating a Job ...................................................................... 8
3.4. Workspace window ................................................................................................................. 9
3.5. Add components to the Job ................................................................................................. 10
3.6. List of Talend components .................................................................................................... 11
3.7 Connect the components together ....................................................................................... 13
3.8. Connect components using drag and drop method .............................................................. 13
3.9. Configuring the components ................................................................................................. 14
3.10. Execute Job ........................................................................................................................... 15
3.11. Custom code components ................................................................................................... 16
3.11.1. tjava component ..................................................................................................... 16
3.11.2. tjavaRow component .............................................................................................. 18
3.11.3. tjavaFlex component ............................................................................................... 20
3.11.4. tLibraryLoad component ......................................................................................... 22
3.11.5. tSetGlobalVar component ....................................................................................... 23
3.12. Connection components ..................................................................................................... 24
3.12.1. tMysqlInput component ......................................................................................... 24
1
Fidel Technologies Pvt Ltd
2
Fidel Technologies Pvt Ltd
3
Fidel Technologies Pvt Ltd
1. General Information
Business modeling
Graphical development
Metadata-driven design and execution
Real-time debugging
Robust execution
4
Fidel Technologies Pvt Ltd
2. Installation
Before installing your Talend product, make sure the machines you are using meet
the following hardware requirements recommended by Talend.
Memory usage heavily depends on the size and nature of your Talend projects.
However, in summary, if your Jobs include many transformation components, you
should consider upgrading the total amount of memory allocated to your servers,
based on the following recommendations
Memory Usage
Disk usage:
Product Client/Server Required disk space Required disk space for use
for installation
Studio Client 3 GB 3+GB
In order for your Talend product to use the Java environment installed on your
machine, you must set the JAVA_HOME environment variable.
2. Open the Start menu and type Environment variable in the search bar to open
the Environment variable properties.
5
Fidel Technologies Pvt Ltd
4. Under System Variables, click New... to create a variable. Name the variable
JAVA_HOME, enter the path to your Java JRE, and click OK.
5. Under System Variables, select the Path variable, click Edit... and add the
following variable at the end of the Path variable value: ;%JAVA_HOME%\bin
2.3. Download
1. Download the product from talend website.
2. Note that the .zip file contains binaries for ALL platforms (Linux/Unix, Windows
and MacOS).
3. Once the download is complete, extract the archive file on your hard drive.
3. Talend Integration:
Fast and cost effective way to connect data
Maximize the value of data to your business with Talend Data Integration software,
a modern data platform based on an open and scalable architecture. Graphical
tools and wizards help you develop and deploy data integration jobs 10 times
faster than hand coding.
Develop and deploy 10 times faster
Synchronize metadata across database platforms .
Let anyone access and cleanse data while governing its use.
6
Fidel Technologies Pvt Ltd
1. On the login screen, click Manage Connections, then on the dialog box that
opens click Delete Existing Project(s) to open the [Select Project] dialog box.
7
Fidel Technologies Pvt Ltd
8
Fidel Technologies Pvt Ltd
9
Fidel Technologies Pvt Ltd
1. Enter the search keyword(s) in the search field of the Palette and press
Enter to validate your search.
2. Select the component you want to use and click on the design workspace
where you want to drop the component.
10
Fidel Technologies Pvt Ltd
3. Note that you can also drop a note to your Job the same way you drop
components.
4. Each newly-added component is shown in a blue box to show that it as an
individual Sub job
11 tMysqlInput READ MYSQL table and extract fields based on Mysql query.
12 tMysqlOutput INSERT or UPDATE lines into MYSQL Database.
13 tMysqlConnection Create a connection to a MYSQL Database.
11
Fidel Technologies Pvt Ltd
12
Fidel Technologies Pvt Ltd
Now that the components have been added on the workspace, they have to be
connected together. Components connected together form a subjob. Jobs are
composed of one or several subjobs carrying out various processes. In this
example, as the tLogRow and tFileOutputDelimited components are already
connected, you only need to connect the tFileInputDelimited to the tLogRow
component. To connect the components together, use either of the following
methods:
3. In the contextual menu that opens, select the type of connection you want to
use to link the components, Row > Main in this example.
4. Click the target component to create the link, tLogRow in this example
13
Fidel Technologies Pvt Ltd
2. When the O icon appears, click it and drag the cursor to the destination
component, tLogRow in this example. A Row > Main connection is automatically
created between the two components.
14
Fidel Technologies Pvt Ltd
2. Browse your system or enter the path to the output file, customers.csv in this
example.
3. Select the Include Header check box.
4. If needed, click the Sync columns button to retrieve the schema from the
input component.
15
Fidel Technologies Pvt Ltd
Purpose:
16
Fidel Technologies Pvt Ltd
tJava enables you to enter personalized code in order to integrate it in Talend program. You can
execute this code only once.
Job design:
tRowGenerator_1
tJava Code:
17
Fidel Technologies Pvt Ltd
System.out.println("Date");
System.out.println(new Date());
System.out.println(TalendDate.getCurrentDate());
Output:
Starting job tjava at 18:33 26/05/2017.
Purpose:
The tJavaRow component allows Java logic to be performed for every record within a flow.
Job design:
18
Fidel Technologies Pvt Ltd
tRowGenerator_1
19
Fidel Technologies Pvt Ltd
tJava Code:
//Code generated according to input schema and output schema
System.out.println("tJavaRow");
output_row.First_Name = StringHandling.UPCASE(input_row.First_Name);
output_row.Last_Name = input_row.Last_Name;
output_row.City = input_row.City;
Output:
20
Fidel Technologies Pvt Ltd
3.11.3. tjavaFlex
Purpose:
Job design:
21
Fidel Technologies Pvt Ltd
tJavaFlex Code
Schema of tJavaFlex
Output :
Starting job tjava at 16:12 18/05/2017.
22
Fidel Technologies Pvt Ltd
[statistics] disconnected
Job tjava ended at 16:12 18/05/2017. [exit code=0]
3.11.4. tLibraryLoad
Purpose:
If you want to add/load third party libraries in Talend Project, then we can choose
tLibraryLoad
Job design
tlibraryLoad component
23
Fidel Technologies Pvt Ltd
3.11.5. tSetGlobalVar
Purpose:
Job Design :
24
Fidel Technologies Pvt Ltd
tJava Code :
Output :
Starting job tjava at 18:04 18/05/2017.
25
Fidel Technologies Pvt Ltd
3.12.1. tMysqlInput :
Purpose:
Job Design :
tMysqlInput Schema :
Output :
Starting job tjava at 18:52 18/05/2017.
26
Fidel Technologies Pvt Ltd
|=-------------+----------------+-------------+----------------=|
|447 |Bhushan |null |null |
|448 |East |null |null |
|450 |higashi |null |null |
|452 |Bushan |null |null |
|455 |wast |null |null |
|456 |Mouth |null |null |
|458 |Nilesh |null |null |
|459 |higashi |null |null |
|462 |Bushan |null |null |
|465 |Bhushan |null |null |
|471 |East |null |null |
|473 |Bhushan |null |null |
|475 |kigashi |null |null |
|477 |tanaka |null |null |
|480 |Mouth |null |null |
|481 |matama |null |null |
|484 |Bushan |null |null |
|486 |tanaka |null |null |
|487 |East |null |null |
|489 |Mukesh |null |null |
|495 |Nitesh |null |null |
|496 |South |null |null |
'--------------+----------------+-------------+-----------------'
[statistics] disconnected
Job tjava ended at 18:52 18/05/2017. [exit code=0]
3.12.2 tMysqlOutput
Purpose:
Job Design :
Output :
Starting job tjava at 18:52 18/05/2017.
27
Fidel Technologies Pvt Ltd
| tLogRow_1 |
|=-------------+----------------+-------------+----------------=|
|customerNumber|contactFirstName|ZenkakuString|MappedPhoneNumber|
|=-------------+----------------+-------------+----------------=|
|447 |Bhushan |null |null |
|448 |East |null |null |
|450 |higashi |null |null |
|452 |Bushan |null |null |
|455 |wast |null |null |
|456 |Mouth |null |null |
|458 |Nilesh |null |null |
|459 |higashi |null |null |
|462 |Bushan |null |null |
|465 |Bhushan |null |null |
|471 |East |null |null |
|473 |Bhushan |null |null |
|475 |kigashi |null |null |
|477 |tanaka |null |null |
|480 |Mouth |null |null |
|481 |matama |null |null |
|484 |Bushan |null |null |
|486 |tanaka |null |null |
|487 |East |null |null |
|489 |Mukesh |null |null |
|495 |Nitesh |null |null |
|496 |South |null |null |
'--------------+----------------+-------------+-----------------'
[statistics] disconnected
Job tjava ended at 18:52 18/05/2017. [exit code=0]
3.12.3. tMysqlConnection :
Purpose:
28
Fidel Technologies Pvt Ltd
Job Design :
29
Fidel Technologies Pvt Ltd
Output :
3.13.1. taddCRCRow
Purpose:
30
Fidel Technologies Pvt Ltd
Job Design :
Output:
Starting job dataquality at 11:34 19/05/2017.
31
Fidel Technologies Pvt Ltd
[statistics] disconnected
Job dataquality ended at 11:34 19/05/2017. [exit code=0]
3.13.2. tReplaceList
Purpose:
Job Design
32
Fidel Technologies Pvt Ltd
tRowGenertor1
tRowGenertor2
Output :
Starting job chgfileEncode at 12:17 19/05/2017.
33
Fidel Technologies Pvt Ltd
Purpose:
Replace the expression with another one.
Job Design :
34
Fidel Technologies Pvt Ltd
tReplace Component:
Output :
Starting job chgfileEncode at 12:41 19/05/2017.
Purpose:
tUniqRow Makes a data flow unique based on the schema.
35
Fidel Technologies Pvt Ltd
Job Design :
tFixedFlowInput_1:
36
Fidel Technologies Pvt Ltd
tJavaRow
tFixedFlowInput_2
37
Fidel Technologies Pvt Ltd
tUniqueRow_1:
Output:
Starting job tuniqRow at 18:21 22/05/2017.
.-------------------+---.
| Duplicate |
|=------------------+--=|
|ABC |PQR|
|=------------------+--=|
|2.000000000000001 |A |
|-1.8333333333333337|A |
'-------------------+---'
[statistics] disconnected
Job tuniqRow ended at 18:21 22/05/2017. [exit code=0]
38
Fidel Technologies Pvt Ltd
3.13.5. tSchemaComplianceCheck
Purpose:
Validates all input rows against a reference schema or checks type, nullability,
length of rows against reference values. The validation can be carried out in full or
partly.
Job design:
Dataset(Input):
39
Fidel Technologies Pvt Ltd
tSchemaComplianceCheck
Output(Rejected):
tFileOutputDelimited(CSV):
1; label 1 with max length 30;another label with max length 40;2007-10-01;not null;nullable
4; label 4 with max length 30;another label with max length 40;2007-12-13;not null;
5; label 5 with max length 30;another label with max length 40;2007-13-12;not null;
9; label 9 with max length 30;another label with max length 40;2007-10-01;;
40
Fidel Technologies Pvt Ltd
Purpose:
Compares a column from the main flow with a reference column from the lookup
flow and outputs the main flow data displaying the distance
Helps ensuring the data quality of any source data against a reference data source.
Job design:
tFuzzyMatch component
41
Fidel Technologies Pvt Ltd
pronunciation. It first loads the phonetics of all entries of the lookup reference and
checks all entries of the main flow against the entries of the reference flow.
Double Metaphone: a new version of the Metaphone phonetic algorithm, that
produces more accurate results than the original algorithm. It can return both a
primary and a secondary code for a string. This accounts for some ambiguous
cases as well as for multiple variants of surnames with common ancestry.
Min Distance (Levenshtein only) Set the minimum number of changes allowed to
match the reference. If set to 0, only perfect matches are returned.
Max Distance (Levenshtein only) Set the maximum number of changes allowed to
match the reference.
Output:
42
Fidel Technologies Pvt Ltd
Purpose:
Reads the access-log file for an Apache HTTP server.
Purpose tApachLogInput helps to effectively manage the Apache HTTP Server, It is
necessary to get feedback about the activity and performance of the server as well as any
problems that may be occurring.
Job design:
43
Fidel Technologies Pvt Ltd
tApacheLogInput:
Output:
Function:
tCreateTemporaryFile creates and manages temporary files.
44
Fidel Technologies Pvt Ltd
Job design:
tFileDeliminated output:
45
Fidel Technologies Pvt Ltd
Schema tRowGenerator:
tlogrow output:
46
Fidel Technologies Pvt Ltd
3.14.3. tFireCompare:
purpose:
Compares two files and provides comparison data (based on a read-only schema)
Job design :
47
Fidel Technologies Pvt Ltd
Output:
Purpose
Copies a source file into a target directory and can remove the source file if so defined.
Job design:
48
Fidel Technologies Pvt Ltd
Purpose:
reads the header and content parts of an email file defined helps to extract standard key
data from emails
Job design:
tFileInputMail:
49
Fidel Technologies Pvt Ltd
Output:
3.14.6. tFileInputProperties:
Purpose:
tFileInputProperties reads a text file row by row and extracts the fields.
tFileInputProperties opens a text file and reads it row by row then separates the fields
according to the model key = value.
Job design:
50
Fidel Technologies Pvt Ltd
Output:
Starting job tFileInputProperties at 15:34 01/06/2017.
3.14.7. tRSSInput
Purpose
tRSSInput makes it possible to keep track of blog entries on websites to gather and
organize information for you quickly and easily.
Job design:
51
Fidel Technologies Pvt Ltd
Output:
52
Fidel Technologies Pvt Ltd
07 tXMLMap tXMLMap allow Allows Join, columns row filtering ,transformation and
multiple output.
3.15.1. tAggregateRow:
Purpose:
tAggregateRow receives a input and aggregates it based on one or more columns.
Job Design :
tAggregateRow component
53
Fidel Technologies Pvt Ltd
tMap component
[statistics] disconnected
Job tAggregateRow ended at 15:09 19/05/2017. [exit code=0]
Purpose:
tFilterRow component is used to filter input rows by setting conditions on the selected columns.
54
Fidel Technologies Pvt Ltd
Job design :
55
Fidel Technologies Pvt Ltd
Output :
3.15.3. tSortRow:
Purpose:
tSortRow component sorts input data based on one or several columns, by sort type and order.
Job Design :
56
Fidel Technologies Pvt Ltd
tSortRow Component :
Output :
3.15.4. tAggregateSortedRow:
Purpose:
tAggregateSortedRow receives a sorted flow and aggregates it based on one or more columns. For
each output line, are provided the aggregation key and the relevant result of set operations (min,
max, sum)
57
Fidel Technologies Pvt Ltd
Job Design :
tAggregateSortedRow :
Output :
Starting job tAggregateSorted at 16:00 19/05/2017.
58
Fidel Technologies Pvt Ltd
'----------+-------'
[statistics] disconnected
Job tAggregateSorted ended at 16:00 19/05/2017. [exit code=0]
3.15.5. tSampleRow:
Purpose:
Job Design :
tSampleRow Component :
Output :
59
Fidel Technologies Pvt Ltd
3.15.6. tXMLMap
Purpose:
tXMLMap allow Allows Join, columns row filtering ,transformation and multiple output.
Job Design :
tXMLMap :
60
Fidel Technologies Pvt Ltd
tFileOutputDelimited :
Output :
61
Fidel Technologies Pvt Ltd
Purpose:
The tHttpRequest component is part of the Internet family of components, and makes both POST
and GET requests to the
Job design:
62
Fidel Technologies Pvt Ltd
Output:
3.16.2 tRest :
Purpose:
The tREST component serves as a REST Web service client that sends HTTP requests to a REST
Web service provider and gets the responses.
Job design:
63
Fidel Technologies Pvt Ltd
tRest Component :
3.16.3. tExtractJSONField
Purpose:
tExtractJSONFields extracts the data from JSON fields stored in a file, a database table, etc.,
based on the XPath query.
64
Fidel Technologies Pvt Ltd
Output:
Purpose:
Merges data from various sources, based on a common schema.
Job design
65
Fidel Technologies Pvt Ltd
Schema
66
Fidel Technologies Pvt Ltd
Output
Purpose:
Duplicate the incoming schema into two identical output flows.
Job design
67
Fidel Technologies Pvt Ltd
Schema of tReplicate:
tFilterRow 1:
tFilterRow2:
68
Fidel Technologies Pvt Ltd
Output :
4. Data Profiling
Data profiling is the process of examining the data available in different data
sources and collecting statistics and information about this data. Data profiling
helps to assess the quality level of the data according to defined set goals.
If data is of a poor quality, or managed in structures that cannot be integrated to
meet the needs of the enterprise, business processes and decision-making suffer.
Compared to manual analysis techniques, data profiling technology improves the
enterprise ability to meet the challenge of managing data quality and to address
the data quality challenges faced during data migrations and data integrations.
4.1. Create a connection
In the DQ Repository tree view, expand Metadata, right-click DB Connections and
select Create DB
Connection.
69
Fidel Technologies Pvt Ltd
70
Fidel Technologies Pvt Ltd
71
Fidel Technologies Pvt Ltd
72
Fidel Technologies Pvt Ltd
73
Fidel Technologies Pvt Ltd
74
Fidel Technologies Pvt Ltd
5.Go through the steps in the wizard and modify the database connection settings
as required.
6.Click Finish to validate the modifications. A dialog box opens prompting you to
reload the updated database connection.
75
Fidel Technologies Pvt Ltd
7. Select the reload option if you want to reload the new database structure for the
updated database connection
Output:
76
Fidel Technologies Pvt Ltd
77
Fidel Technologies Pvt Ltd
78
Fidel Technologies Pvt Ltd
79
Fidel Technologies Pvt Ltd
80
Fidel Technologies Pvt Ltd
Talend MDM Architecture can be broken down into functional blocks that enable
interaction between users and the MDM Hub and their corresponding IT needs. Here
are the main building blocks of Talend MDM
The Clients block includes one or more Talend Studio and Web browsers that
could be on the same or on different machines.
From the Studio, you can set up and operate a centralized repository. You can
build data models that employ
The necessary business and data rules to create a single copy of the master data. This
master data will be propagated back to target and source systems.
From the Web browser, you can search, display or edit master data with tasks
defined by the Studio.
The Server block includes an MDM server - where the master data are governed
and monitored.
The Database block includes the MDM database - where the master data and the
system data are stored
5.2. Modeling:
81
Fidel Technologies Pvt Ltd
Before we get started, lets first get a common understanding of the most important
MDM terms:
Term Description
(business) element Also referred to as business attribute. The actual name of the data
point.
(business) entity Describes the actual data (the elements), its nature, its structure and
its relationships.1 An entity can have one or more business elements.
The Talend MDM jargon for this concept is data model entity.
data model type This is an element or collection of elements which is globally defined
and can be used across various entities. This makes maintenance of
common elements easier.
data model Defines the attributes (elements), user access rights and relationships
of entities mastered by the MDM Hub. The data model is the central
component of Talend MDM. A data model maps to one or more
(business) entities that can be explicitly defined. Any concept can be
defined by a data model.1 A data model can have multiple entities.
(business) domain A collection of data models that define a particular concept. For
instance, the customer domain may be defined by the organization,
account, contact and opportunity data models. A product domain may
be defined by a product, product family and price list.
Ultimately, the domain is the collection of all data models that relate to
a concept. Talend MDM can model any and many domains within a
single hub. It is a generic multi-domain MDM solution.1
data container Holds data of one or several business entities. Data containers are
typically used to separate master data domains.1
In the Studio workspace, an editor opens where you can define the details of your new
data model. The new data model and data container are listed in the MDM Repository
82
Fidel Technologies Pvt Ltd
tree view.
83
Fidel Technologies Pvt Ltd
84
Fidel Technologies Pvt Ltd
85
Fidel Technologies Pvt Ltd
86
Fidel Technologies Pvt Ltd
5.7.Web GUI:
On successful installation, http://localhost:8080/talendmdm will show:
The open source version comes with only two user accounts (it is restricted to these two
ones):
standard user
user: user
password: user
admin
user: administrator
password: administrator
Web view:
87
Fidel Technologies Pvt Ltd
Web entity
88