Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Presentation On Talend MDM by Bhushan Maindarkar

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 89

This user guide explains how to use Talend

open studio, Talend Integration, Talend


Profiling and Talend MDM. This user guide
contains examples of every component.

Talend
Presentation on Talend MDM

Bhushan Maindarkar.

Talend MDM User Guide.


Fidel Technologies Pvt Ltd

Table of Contents
1. General Information ....................................................................................................................... 4
1.1. What is ETL .............................................................................................................................. 4
1.2. What is Talend ......................................................................................................................... 4
1.3. What is Talend Open Studio .................................................................................................... 4
2. Installation .......................................................................................................................... 5
2.1. Hardware requirement ............................................................................................................ 5
2.2. Software requirement ............................................................................................................. 5
2.3. Download ................................................................................................................................ 6
2.4. Configure the memory settings ............................................................................................... 6
2.5. Launch the Studio .................................................................................................................... 6
3.Talend Integration ................................................................................................................. 6
3.1. Create New Project ................................................................................................................. 7
3.2. Delete Project .......................................................................................................................... 7
3.3. Getting started with a basic Job Creating a Job ...................................................................... 8
3.4. Workspace window ................................................................................................................. 9
3.5. Add components to the Job ................................................................................................. 10
3.6. List of Talend components .................................................................................................... 11
3.7 Connect the components together ....................................................................................... 13
3.8. Connect components using drag and drop method .............................................................. 13
3.9. Configuring the components ................................................................................................. 14
3.10. Execute Job ........................................................................................................................... 15
3.11. Custom code components ................................................................................................... 16
3.11.1. tjava component ..................................................................................................... 16
3.11.2. tjavaRow component .............................................................................................. 18
3.11.3. tjavaFlex component ............................................................................................... 20
3.11.4. tLibraryLoad component ......................................................................................... 22
3.11.5. tSetGlobalVar component ....................................................................................... 23
3.12. Connection components ..................................................................................................... 24
3.12.1. tMysqlInput component ......................................................................................... 24

1
Fidel Technologies Pvt Ltd

3.12.2. tMysqlOutput component ...................................................................................... 26


3.12.3. tMysqlConnection component ............................................................................... 27
3.13. Data quality components .................................................................................................... 29
3.13.1. taddCRCRow component ........................................................................................ 29
3.13.2. tReplaceList component .......................................................................................... 31
3.13.3. tReplace component ............................................................................................... 33
3.13.4. tUniqRow component ............................................................................................ 34
3.13.5. tSchemaComplianceCheck component .................................................................. 38
3.13.6. tFuzzyMatch component ........................................................................................ 40
3.14. File components .................................................................................................................. 42
3.14.1. tApacheLogInput component ................................................................................. 42
3.14.2. tCreateTemporaryFile component .......................................................................... 43
3.14.3. tFileCompare component ....................................................................................... 46
3.14.4. tFileCopy component ............................................................................................. 48
3.14.5. tFileInputMail component ...................................................................................... 49
3.14.6. tFileInputProperties component ............................................................................ 50
3.14.7. tRSSInput component ............................................................................................. 51
3.15. Processing Components: ..................................................................................................... 52
3.15.1. tAggregateRow component .................................................................................... 53
3.15.2. tFilterRow component ............................................................................................ 55
3.15.3. tSortRow component .............................................................................................. 56
3.15.4. tAggregateSortedRow component ......................................................................... 58
3.15.5. tSampleRow component ......................................................................................... 59
3.15.6. tXMLMap component ............................................................................................. 60
3.15.7. tMap component .................................................................................................... 61
3.16. Internet Components: ......................................................................................................... 62
3.16.1. tHttpRequest component ...................................................................................... 62
3.16.2. tRest component .................................................................................................... 64
3.16.3. tExtractJSONField component ................................................................................ 65
3.16.4. tUnite component .................................................................................................. 66

2
Fidel Technologies Pvt Ltd

3.16.5. tReplicate component ............................................................................................ 68


4. Talend Data Profiling........................................................................................................... 70
4.1. Create a connection into Profiling ................................................................................ 70
4.2. Create to a delimited file .............................................................................................. 73
4.3. Open or edit database connection ............................................................................... 74
4.4. Database Analysis ......................................................................................................... 77
4.5. Column Analysis ........................................................................................................... 77
4.5.1. Add patterns to the analyzed columns ................................................................. 78
4.6. Duplication Analysis ..................................................................................................... 80
5.Talend MDM-Master Data Management .............................................................................. 82
5.1. Functional architecture of Talend MDM Architecture .......................................................... 82
5.2.2. Modeling ............................................................................................................................. 82
5.2. Creating a data model ........................................................................................................... 83
5.2.1. Creating business entities in the data model ............................................................ 84
5.3. Add server Location ............................................................................................................... 85
5.4. Data Container ...................................................................................................................... 86
5.5. Create a view ......................................................................................................................... 86
5.6. Deploy a model ...................................................................................................................... 87
5.7. Web GUI ................................................................................................................................. 88

3
Fidel Technologies Pvt Ltd

1. General Information

1.1. What is ETL


ETL, which stands for "extract, transform and load," is the set of functions
combined into one tool or solution that enables companies to "extract" data from
numerous databases, applications and systems, "transform" it as appropriate, and
"load" it into another database, a data mart or a data warehouse for analysis, or
send it along to another operational system to support a business process.

1.2. What is Talend


Talend offers an enterprise class integration solution that allows users to natively
connect databases, flat files, and cloud - based applications. The Talend software
provides graphical dragand-drop tools, test creation, and code generation in
numerous languages.
Features of Talend

Business modeling
Graphical development
Metadata-driven design and execution
Real-time debugging
Robust execution

4
Fidel Technologies Pvt Ltd

1.3. Talend Open Studio


Talend Open Studio for Data Integration is an open source data integration product
developed by Talend and designed to combine, convert and update data in various
locations across a business.

2. Installation
Before installing your Talend product, make sure the machines you are using meet
the following hardware requirements recommended by Talend.

Memory usage heavily depends on the size and nature of your Talend projects.
However, in summary, if your Jobs include many transformation components, you
should consider upgrading the total amount of memory allocated to your servers,
based on the following recommendations

2.1. Hardware requirement

Memory Usage

Product Client/Server Recommended alloc,memory


Studio Client 3GB minimum, 4GB recommended

Disk usage:

Product Client/Server Required disk space Required disk space for use
for installation
Studio Client 3 GB 3+GB

2.2. Software requirement


Setting up JAVA_HOME

In order for your Talend product to use the Java environment installed on your
machine, you must set the JAVA_HOME environment variable.

To do so, proceed as follows:

1. Find the folder where Java is installed, usually C:\Program Files\Java\JREx.x.x.

2. Open the Start menu and type Environment variable in the search bar to open
the Environment variable properties.

5
Fidel Technologies Pvt Ltd

3. Click Environment Variables....

4. Under System Variables, click New... to create a variable. Name the variable
JAVA_HOME, enter the path to your Java JRE, and click OK.

5. Under System Variables, select the Path variable, click Edit... and add the
following variable at the end of the Path variable value: ;%JAVA_HOME%\bin

2.3. Download
1. Download the product from talend website.

2. Note that the .zip file contains binaries for ALL platforms (Linux/Unix, Windows
and MacOS).

3. Once the download is complete, extract the archive file on your hard drive.

2.4. Configure the memory settings


1. If you want to tune the memory allocation for your JVM, you only need to edit
the TOS_DQ-win-x86_64.inifile.

2. The default values are:

-vmargs -Xms40m -Xmx500m -XX:MaxMetaspaceSize=128m

2.5. Launch the Studio


1. Double-click the TOS_DQ-win-x86_64.exe executable file to launch your Talend
Studio.

3. Talend Integration:
Fast and cost effective way to connect data
Maximize the value of data to your business with Talend Data Integration software,
a modern data platform based on an open and scalable architecture. Graphical
tools and wizards help you develop and deploy data integration jobs 10 times
faster than hand coding.
Develop and deploy 10 times faster
Synchronize metadata across database platforms .
Let anyone access and cleanse data while governing its use.

6
Fidel Technologies Pvt Ltd

3.1. Create New Project

1. Launch Talend Studio


2. On the login window, select the Create a new project option and enter a
project name in the field.
3. Click Finish to create the project and open it in the Studio.

3.4. Delete a project

1. On the login screen, click Manage Connections, then on the dialog box that
opens click Delete Existing Project(s) to open the [Select Project] dialog box.

7
Fidel Technologies Pvt Ltd

3.3. Getting started with a basic Job Creating a Job

8
Fidel Technologies Pvt Ltd

9
Fidel Technologies Pvt Ltd

3.4. Workspace window

3.5. Add components to the Job

To drop a component from the Palette, proceed as follows:

1. Enter the search keyword(s) in the search field of the Palette and press
Enter to validate your search.

2. Select the component you want to use and click on the design workspace
where you want to drop the component.

10
Fidel Technologies Pvt Ltd

3. Note that you can also drop a note to your Job the same way you drop
components.
4. Each newly-added component is shown in a blue box to show that it as an
individual Sub job

3.6. List of Talend components

ID Name of Components Description


1 taddCRCRow taddCRCRow adds CRC column for all rows of flow
2 tChangeFileEncoding tChangeFileEncoding Changes the Encoding of file.
3 tReplaceList tReplaceList Replaces String with a dynamic replacement list.
4 tUniqRow tUniqRow Makes a data flow unique based on the schema.
5 tReplace Replace the expression with another one.
6 tjava tJava enables you to enter personalized code in order to integrate it in
Talend program. You can execute this code only once.
7 tjavaRow The tJavaRow component allows Java logic to be performed for every
record within a flow.
8 tjavaFlex tJavaFlex enables you to enter personalized code in order to integrate it in
Talend program. With tJavaFlex, you can enter the three java-code parts
(start, main and end) that constitute a kind of component dedicated to do a
desired operation.
9 tLibraryLoad If you want to add/load third party libraries in Talend Project, then we can
choose tLibraryLoad
10 tSetGlobalVar tSetGlobalVar allows you to define and set global variables in GUI.

11 tMysqlInput READ MYSQL table and extract fields based on Mysql query.
12 tMysqlOutput INSERT or UPDATE lines into MYSQL Database.
13 tMysqlConnection Create a connection to a MYSQL Database.

11
Fidel Technologies Pvt Ltd

14 tAggregateRow tAggregateRow receives a input and aggregates it based on one or more


columns.
15 tAggregateSortedRow tAggregateRow receives a input and aggregates it based on one or more
columns.
16 tExternalSortedRow tAggregateSortedRow receives a sorted flow and aggregates it based on
one or more columns. For each output line, are provided the aggregation
key and the relevant result of set operations (min, max, sum)
17 tFilterRow tFilterRow component is used to filter input rows by setting conditions on
the selected columns.
18 tMap tMap allow Join, columns row filtering, transformation and sort type and
order.
19 tSampleRow tSampleRow filter rows according to the line numbers.
20 tSortRow tSortRow component sorts input data based on one or several columns, by
sort type and order.
21 tXMLMap tXMLMap allow Allows Join, columns row filtering ,transformation and
multiple output.
22 tFileInputDeliminated tFileInputDelimited reads a given file row by row with simple separated
fields.
23 tFileInputExcel tFileInputExcel reads an Excel file (.xls or .xlsx) and extracts data line by
line.
24 tFileInputFullRow tFileInputFullRow opens a file and reads it row by row and sends complete
rows as defined in the Schema to the next job component, via a Row link.
25 tFileInputLDIF tFileOutputLDIF outputs data to an LDIF type of file which can then be
loaded into a LDAP directory.
26 tFileInputMail reads the header and content parts of an email file defined
27 tFileInputMSDeliminated tFileInputMSDelimited reads a complex multi-structured delimited file.
28 tFileInputMSPositional tFileInputMSDelimited reads a complex multi-structured delimited file.
29 tFileInputXML tFileInputXML reads an XML structured file and extracts data row by row.
30 tFileInputRegrex Powerful feature which can replace number of other components of the File
family. Requires some advanced knowledge on regular expression syntax
31 tFileOutputDeliminated tFileOutputDeliminated Write to a file row by row with simple separated
fields
32 tFileOutputExcel tFileOutputExcel writes an MS Excel file with separated data value
according to a defined schema.
33 tFileOutputRow tFileOutputRow write data into file.
34 tFileOutputLDIF tFileOutputLDIF writes or modifies a LDIF file with data separated in
respective entries based on the schema defined,.or else deletes content
from an LDIF file.
35 tFileOutputMSDeliminated tFileOutputMSDeliminated writes into file based on schema
36 tFileOutputMSPositional tFileOutputMSPositional writes into file based on position of field in a
string.
37 tFileOutputXML tFileOutputXML writes an XML file with separated data value according to a
defined schema.
38 tHttpRequest The tHttpRequest component is part of the Internet family of components,
and makes both POST and GET requests to the
39 tRest The tREST component serves as a REST Web service client that sends
HTTP requests to a REST Web service provider and gets the responses.
40 tExtractJSONField tExtractJSONFields extracts the data from JSON fields stored in a file, a
database table, etc., based on the XPath query.

12
Fidel Technologies Pvt Ltd

41 tMsgBox It displayed the message box


42 tUnite Merges data from various sources, based on a common schema.
43 tReplicate Duplicate the incoming schema into two identical output flows.

3.7 Connect the components together

Now that the components have been added on the workspace, they have to be
connected together. Components connected together form a subjob. Jobs are
composed of one or several subjobs carrying out various processes. In this
example, as the tLogRow and tFileOutputDelimited components are already
connected, you only need to connect the tFileInputDelimited to the tLogRow
component. To connect the components together, use either of the following
methods:

1. Right-click and click again

2. Right-click the source component, tFileInputDelimited in this example.

3. In the contextual menu that opens, select the type of connection you want to
use to link the components, Row > Main in this example.

4. Click the target component to create the link, tLogRow in this example

3.8. Drag and drop method

1. Click the input component, tFileInputDelimited in this example.

13
Fidel Technologies Pvt Ltd

2. When the O icon appears, click it and drag the cursor to the destination
component, tLogRow in this example. A Row > Main connection is automatically
created between the two components.

3.9. Configuring the components

Ex:Configuring the tLogRow component

1. Double-click the tLogRow component to open its Basic settings view.


2. In the Mode area, select Table (print values in cells of a table).

Set tFileOutputDelimited component

1. Double-click the tFileOutputDelimited component to open its Basic settings


view.

14
Fidel Technologies Pvt Ltd

2. Browse your system or enter the path to the output file, customers.csv in this
example.
3. Select the Include Header check box.
4. If needed, click the Sync columns button to retrieve the schema from the
input component.

3.10. Execute Job

Now that components are configured, the Job can be executed.


To do so, proceed as follows:
1. Press Ctrl+S to save the Job.
2. Go to Run tab, and click on Run to execute the Job.
3. Or just press F6 to execute Job.

15
Fidel Technologies Pvt Ltd

3.11. Custom code components

ID Name of Components Description


1 tjava tJava enables you to enter personalized code in order to integrate it in
Talend program. You can execute this code only once.
2 tjavaRow The tJavaRow component allows Java logic to be performed for every
record within a flow.
3 tjavaFlex tJavaFlex enables you to enter personalized code in order to integrate it in
Talend program. With tJavaFlex, you can enter the three java-code parts
(start, main and end) that constitute a kind of component dedicated to do a
desired operation.
4 tLibraryLoad If you want to add/load third party libraries in Talend Project, then we can
choose tLibraryLoad
5 tSetGlobalVar tSetGlobalVar allows you to define and set global variables in GUI.

3.11.1 tjava component example

Purpose:

16
Fidel Technologies Pvt Ltd

tJava enables you to enter personalized code in order to integrate it in Talend program. You can
execute this code only once.

Job design:

tRowGenerator_1

tJava Code:

17
Fidel Technologies Pvt Ltd

System.out.println("Date");
System.out.println(new Date());
System.out.println(TalendDate.getCurrentDate());

Output:
Starting job tjava at 18:33 26/05/2017.

[statistics] connecting to socket on port 3369


[statistics] connected
Date
Fri May 26 18:33:32 IST 2017
Fri May 26 18:33:32 IST 2017
.----------+----------+--------------.
| tLogRow_1 |
|=---------+----------+-------------=|
|first_name|last_name |city |
|=---------+----------+-------------=|
|Gerald |Van Buren |Charleston |
|Warren |Fillmore |Annapolis |
|Rutherford|McKinley |Nashville |
|Herbert |Harding |Saint Paul |
|Rutherford|Roosevelt |Madison |
|Grover |Cleveland |Springfield |
|John |Polk |Albany |
|Jimmy |Carter |Harrisburg |
|Benjamin |Hoover |Juneau |
|James |Taft |Lincoln |
|Benjamin |Hayes |Columbia |
|Harry |Madison |Jackson |
|Andrew |Van Buren |Salem |
|Warren |Madison |Des Moines |
|Andrew |Nixon |Harrisburg |
|Benjamin |Reagan |Providence |
|Theodore |Adams |Bismarck |
|Andrew |Roosevelt |Denver |
|Ulysses |Lincoln |Oklahoma City |
|Richard |Coolidge |Springfield |
|Martin |Polk |Denver |
|Abraham |Fillmore |Richmond |

3.11.2. tjavaRow component

Purpose:

The tJavaRow component allows Java logic to be performed for every record within a flow.

Job design:

18
Fidel Technologies Pvt Ltd

tRowGenerator_1

tRowGenerator_1 Schema setting

19
Fidel Technologies Pvt Ltd

tJava Code:
//Code generated according to input schema and output schema
System.out.println("tJavaRow");
output_row.First_Name = StringHandling.UPCASE(input_row.First_Name);
output_row.Last_Name = input_row.Last_Name;
output_row.City = input_row.City;

Output:

20
Fidel Technologies Pvt Ltd

3.11.3. tjavaFlex

Purpose:

tJavaFlex enables you to enter personalized code in order to integrate it in Talend


program. With tJavaFlex, you can enter the three java-code parts (start, main and
end) that constitute a kind of component dedicated to do a desired operation.

Job design:

21
Fidel Technologies Pvt Ltd

tJavaFlex Code

Schema of tJavaFlex

Output :
Starting job tjava at 16:12 18/05/2017.

22
Fidel Technologies Pvt Ltd

[statistics] connecting to socket on port 3519


[statistics] connected
tJavaFlex_1: Start code
tJavaFlex_1: Main code: i=1
tJavaFlex_1: Main code: i=2
tJavaFlex_1: Main code: i=3
tJavaFlex_1: End code
.---------.
|tLogRow_1|
|=-------=|
|newColumn|
|=-------=|
|row 1 |
|row 2 |
|row 3 |
'---------'

[statistics] disconnected
Job tjava ended at 16:12 18/05/2017. [exit code=0]

3.11.4. tLibraryLoad

Purpose:
If you want to add/load third party libraries in Talend Project, then we can choose
tLibraryLoad

Job design

tlibraryLoad component

23
Fidel Technologies Pvt Ltd

3.11.5. tSetGlobalVar

Purpose:

tSetGlobalVar allows you to define and set global variables in GUI.

Job Design :

Component setting for tSetGlobalVar

24
Fidel Technologies Pvt Ltd

tJava Code :

Output :
Starting job tjava at 18:04 18/05/2017.

[statistics] connecting to socket on port 4013


[statistics] connected
myString=Hello World!
[statistics] disconnected
Job tjava ended at 18:04 18/05/2017. [exit code=0]

3.12. Connection components

ID Name of Components Description


1 tMysqlInput READ MYSQL table and extract fields based on Mysql query.
2 tMysqlOutput INSERT or UPDATE lines into MYSQL Database.
3 tMysqlConnection Create a connection to a MYSQL Database.

25
Fidel Technologies Pvt Ltd

3.12.1. tMysqlInput :

Purpose:

READ MYSQL table and extract fields based on Mysql query.

Job Design :

tMysqlInput Schema :

Output :
Starting job tjava at 18:52 18/05/2017.

[statistics] connecting to socket on port 3823


[statistics] connected
.--------------+----------------+-------------+-----------------.
| tLogRow_1 |
|=-------------+----------------+-------------+----------------=|
|customerNumber|contactFirstName|ZenkakuString|MappedPhoneNumber|

26
Fidel Technologies Pvt Ltd

|=-------------+----------------+-------------+----------------=|
|447 |Bhushan |null |null |
|448 |East |null |null |
|450 |higashi |null |null |
|452 |Bushan |null |null |
|455 |wast |null |null |
|456 |Mouth |null |null |
|458 |Nilesh |null |null |
|459 |higashi |null |null |
|462 |Bushan |null |null |
|465 |Bhushan |null |null |
|471 |East |null |null |
|473 |Bhushan |null |null |
|475 |kigashi |null |null |
|477 |tanaka |null |null |
|480 |Mouth |null |null |
|481 |matama |null |null |
|484 |Bushan |null |null |
|486 |tanaka |null |null |
|487 |East |null |null |
|489 |Mukesh |null |null |
|495 |Nitesh |null |null |
|496 |South |null |null |
'--------------+----------------+-------------+-----------------'

[statistics] disconnected
Job tjava ended at 18:52 18/05/2017. [exit code=0]

3.12.2 tMysqlOutput
Purpose:

INSERT or UPDATE lines into MYSQL Database.

Job Design :

Output :
Starting job tjava at 18:52 18/05/2017.

[statistics] connecting to socket on port 3823


[statistics] connected
.--------------+----------------+-------------+-----------------.

27
Fidel Technologies Pvt Ltd

| tLogRow_1 |
|=-------------+----------------+-------------+----------------=|
|customerNumber|contactFirstName|ZenkakuString|MappedPhoneNumber|
|=-------------+----------------+-------------+----------------=|
|447 |Bhushan |null |null |
|448 |East |null |null |
|450 |higashi |null |null |
|452 |Bushan |null |null |
|455 |wast |null |null |
|456 |Mouth |null |null |
|458 |Nilesh |null |null |
|459 |higashi |null |null |
|462 |Bushan |null |null |
|465 |Bhushan |null |null |
|471 |East |null |null |
|473 |Bhushan |null |null |
|475 |kigashi |null |null |
|477 |tanaka |null |null |
|480 |Mouth |null |null |
|481 |matama |null |null |
|484 |Bushan |null |null |
|486 |tanaka |null |null |
|487 |East |null |null |
|489 |Mukesh |null |null |
|495 |Nitesh |null |null |
|496 |South |null |null |
'--------------+----------------+-------------+-----------------'

[statistics] disconnected
Job tjava ended at 18:52 18/05/2017. [exit code=0]

3.12.3. tMysqlConnection :
Purpose:

Create a connection to a MYSQL Database.

28
Fidel Technologies Pvt Ltd

Job Design :

tMysqlConnection component setting :

29
Fidel Technologies Pvt Ltd

Output :

3.13. Data Quality Components


ID Name of Components Description
1 taddCRCRow taddCRCRow adds CRC column for all rows of flow
2 tChangeFileEncoding tChangeFileEncoding Changes the Encoding of file.
3 tReplaceList tReplaceList Replaces String with a dynamic replacement list.
4 tUniqRow tUniqRow Makes a data flow unique based on the schema.
5 tReplace Replace the expression with another one.
6 tSchemaComplianceCheck Validates all input rows against a reference schema or checks type, null
ability, length of rows against reference values. The validation can be
carried out in full or partly.
7 tIntervalMatch tIntervalMatch receives a main flow and aggregates it based on join to a
lookup flow (Java) or a given lookup file (Perl). Then it matches a specified
value to a range of values and returns related information.
8 tFuzzyMatch Compares a column from the main flow with a reference column from the
lookup flow and outputs the main flow data displaying the distance
Helps ensuring the data quality of any source data against a reference
data source.

3.13.1. taddCRCRow
Purpose:

taddCRCRow adds CRC column for all rows of flow

30
Fidel Technologies Pvt Ltd

Job Design :

tMap component setting :

tAddCRCRow component setting

Output:
Starting job dataquality at 11:34 19/05/2017.

31
Fidel Technologies Pvt Ltd

[statistics] connecting to socket on port 3883


[statistics] connected
For input string: "411028 "
.--+--------------+--------+-------+--------+----------+----------.
| tLogRow_1 |
|=-+--------------+--------+-------+--------+----------+---------=|
|Id|Street |Town |Country|Postcode|var1 |CRC |
|=-+--------------+--------+-------+--------+----------+---------=|
|1 |north est road|pune |india |413411 |19-05-2017|2848775222|
|3 |manhaton rd |New york|US |284745 |19-05-2017|2774761735|
'--+--------------+--------+-------+--------+----------+----------'

[statistics] disconnected
Job dataquality ended at 11:34 19/05/2017. [exit code=0]

3.13.2. tReplaceList
Purpose:

tReplaceList Replaces String with a dynamic replacement list.

Job Design

32
Fidel Technologies Pvt Ltd

tRowGenertor1

tRowGenertor2

tReplaceList component setting :

Output :
Starting job chgfileEncode at 12:17 19/05/2017.

[statistics] connecting to socket on port 3931


[statistics] connected
.----------+----------+--------------.
| tLogRow_1 |
|=---------+----------+-------------=|
|Last_Name |First_Name|city |
|=---------+----------+-------------=|
|Garfield |Millard |Little Rock |

33
Fidel Technologies Pvt Ltd

|Arthur |Lyndon |Topeka |


|Eisenhower|Richard |Topeka |
|Harding |Andrew |Concord |
|Fillmore |Calvin |Baton Rouge |
|Monroe |William |Raleigh |
|Eisenhower|Woodrow |Baton Rouge |
|Van Buren |Dwight |Denver |
|Truman |Martin |Olympia |
|Carter |Chester |Honolulu |
|Coolidge |Harry |Carson City |
|Coolidge |James |Columbus |
|Pierce |Grover |Frankfort |
|Ford |Ulysses |Charleston |
'----------+----------+--------------'
[statistics] disconnected
Job chgfileEncode ended at 12:17 19/05/2017. [exit code=0

3.13.3. tReplace component

Purpose:
Replace the expression with another one.

Job Design :

34
Fidel Technologies Pvt Ltd

tReplace Component:

Output :
Starting job chgfileEncode at 12:41 19/05/2017.

[statistics] connecting to socket on port 3532


[statistics] connected
.---------+----------+------.
| tLogRow_1 |
|=--------+----------+-----=|
|Last_Name|First_Name|city |
|=--------+----------+-----=|
|1 |Bhushan |Samuel|
|2 |Ortega |lee |
|3 |Zant |Thi |
|4 |Cohen |John |
|5 |Park |Umar |
|6 |Knipp |Troy |
|7 |Lunberg |Greg |
|8 |Brown |Sami |
|9 |Barnhill |Pascal|
|10 |Rose |Aaron |
|11 | | |
|12 | | |
'---------+----------+------'
[statistics] disconnected
Job chgfileEncode ended at 12:41 19/05/2017. [exit code=0]

3.13.4. tUniqRow component

Purpose:
tUniqRow Makes a data flow unique based on the schema.

35
Fidel Technologies Pvt Ltd

Job Design :

tFixedFlowInput_1:

36
Fidel Technologies Pvt Ltd

tJavaRow

tFixedFlowInput_2

37
Fidel Technologies Pvt Ltd

tUniqueRow_1:

Output:
Starting job tuniqRow at 18:21 22/05/2017.

[statistics] connecting to socket on port 3921


[statistics] connected
.-------------------+---.
| Unique |
|=------------------+--=|
|ABC |PQR|
|=------------------+--=|
|-6.333333333333332 |A |
|18.499999999999996 |B |
|21.055555555555557 |C |
|-1.2222222222222219|X |
|32.666666666666664 |Q |
'-------------------+---'

.-------------------+---.
| Duplicate |
|=------------------+--=|
|ABC |PQR|
|=------------------+--=|
|2.000000000000001 |A |
|-1.8333333333333337|A |
'-------------------+---'

[statistics] disconnected
Job tuniqRow ended at 18:21 22/05/2017. [exit code=0]

38
Fidel Technologies Pvt Ltd

3.13.5. tSchemaComplianceCheck

Purpose:
Validates all input rows against a reference schema or checks type, nullability,
length of rows against reference values. The validation can be carried out in full or
partly.

Job design:

Dataset(Input):

39
Fidel Technologies Pvt Ltd

tSchemaComplianceCheck

Output(Rejected):

tFileOutputDelimited(CSV):
1; label 1 with max length 30;another label with max length 40;2007-10-01;not null;nullable
4; label 4 with max length 30;another label with max length 40;2007-12-13;not null;
5; label 5 with max length 30;another label with max length 40;2007-13-12;not null;
9; label 9 with max length 30;another label with max length 40;2007-10-01;;

40
Fidel Technologies Pvt Ltd

3.13.6. tFuzzyMatch component

Purpose:
Compares a column from the main flow with a reference column from the lookup
flow and outputs the main flow data displaying the distance

Helps ensuring the data quality of any source data against a reference data source.

Job design:

tFuzzyMatch component

Select the relevant matching algorithm among:

Levenshtein: Based on the edit distance theory. It calculates the number of


insertion, deletion or substitution required for an entry to match the reference entry.
Metaphone: Based on a phonetic algorithm for indexing entries by their

41
Fidel Technologies Pvt Ltd

pronunciation. It first loads the phonetics of all entries of the lookup reference and
checks all entries of the main flow against the entries of the reference flow.
Double Metaphone: a new version of the Metaphone phonetic algorithm, that
produces more accurate results than the original algorithm. It can return both a
primary and a secondary code for a string. This accounts for some ambiguous
cases as well as for multiple variants of surnames with common ancestry.

Min Distance (Levenshtein only) Set the minimum number of changes allowed to
match the reference. If set to 0, only perfect matches are returned.

Max Distance (Levenshtein only) Set the maximum number of changes allowed to
match the reference.

Output:

42
Fidel Technologies Pvt Ltd

3.14. File Components


ID Name of Components Description
01 tApacheLogInput tApacheLogInput reads the access-log file for an Apache HTTP server.

02 tCreateTemporaryFile tCreateTemporaryFile creates and manages temporary files.


03 tFileCompare Compares two files and provides comparison data (based on a read-
only schema)
04 tFileCopy Copies a source file into a target directory and can remove the source
file if so defined.
05 tFileInputMail reads the header and content parts of an email file defined helps to extract
standard key data from emails
06 tFileInputProperties tFileInputProperties reads a text file row by row and extracts the fields.
tFileInputProperties opens a text file and reads it row by row then separates
the fields according to the model key = value.
07 tRSSInput tRSSInput reads RSS-Feeds using URLs.
tRSSInput makes it possible to keep track of blog entries on websites to
gather and organize information for you quickly and easily.

3.14.1. tApacheLogInput component

Purpose:
Reads the access-log file for an Apache HTTP server.
Purpose tApachLogInput helps to effectively manage the Apache HTTP Server, It is
necessary to get feedback about the activity and performance of the server as well as any
problems that may be occurring.

Job design:

43
Fidel Technologies Pvt Ltd

tApacheLogInput:

Output:

3.14.2. tCreateTemporaryFile component

Function:
tCreateTemporaryFile creates and manages temporary files.

44
Fidel Technologies Pvt Ltd

Job design:

tFileDeliminated output:

45
Fidel Technologies Pvt Ltd

Schema tRowGenerator:

tlogrow output:

46
Fidel Technologies Pvt Ltd

3.14.3. tFireCompare:

purpose:
Compares two files and provides comparison data (based on a read-only schema)

Job design :

tFileUnarchive component setting

47
Fidel Technologies Pvt Ltd

tFileCompare component setting

Output:

3.14.4. tFileCopy component:

Purpose
Copies a source file into a target directory and can remove the source file if so defined.

Job design:

48
Fidel Technologies Pvt Ltd

tFileCopy component setting:

4.14.5. tFileInputMail component:

Purpose:
reads the header and content parts of an email file defined helps to extract standard key
data from emails

Job design:

tFileInputMail:

49
Fidel Technologies Pvt Ltd

Output:

3.14.6. tFileInputProperties:

Purpose:

tFileInputProperties reads a text file row by row and extracts the fields.

tFileInputProperties opens a text file and reads it row by row then separates the fields
according to the model key = value.

Job design:

50
Fidel Technologies Pvt Ltd

tMap component setting:

Output:
Starting job tFileInputProperties at 15:34 01/06/2017.

[statistics] connecting to socket on port 3694


[statistics] connected
user|root|
url|jdbc:mysql://localhost:3307/taskmanager|jdbc:mysql://localhost:3307/t
askmanager
password|root|
driver|com.mysql.jdbc.Driver|com.mysql.jdbc.Driver
[statistics] disconnected
Job tFileInputProperties ended at 15:34 01/06/2017. [exit code=0]

3.14.7. tRSSInput

Purpose

tRSSInput reads RSS-Feeds using URLs.

tRSSInput makes it possible to keep track of blog entries on websites to gather and
organize information for you quickly and easily.

Job design:

51
Fidel Technologies Pvt Ltd

Output:

3.15. Processing Components:


ID Name of Components Description
01 tAggregateRow tAggregateRow receives a input and aggregates it based on one or more
columns.
02 tAggregateSortedRow tAggregateSortedRow receives a sorted flow and aggregates it based on
one or more columns. For each output line, are provided the aggregation key
and the relevant result of set operations (min, max, sum)
03 tFilterRow tFilterRow component is used to filter input rows by setting conditions on the
selected columns.
04 tMap tMap allow Join, columns row filtering, transformation and sort type and
order.
05 tSampleRow tSampleRow filter rows according to the line numbers.
06 tSortRow tSortRow component sorts input data based on one or several columns, by
sort type and order.

52
Fidel Technologies Pvt Ltd

07 tXMLMap tXMLMap allow Allows Join, columns row filtering ,transformation and
multiple output.

3.15.1. tAggregateRow:

Purpose:
tAggregateRow receives a input and aggregates it based on one or more columns.

Job Design :

tAggregateRow component

53
Fidel Technologies Pvt Ltd

tMap component

Starting job tAggregateRow at 15:09 19/05/2017.

[statistics] connecting to socket on port 3531


[statistics] connected
.--------+----------+-------------------.
| tLogRow_1 |
|=-------+----------+------------------=|
|Order_ID|Shipper_ID|Shipper_Name |
|=-------+----------+------------------=|
|6 |1 |Shiny Shipping |
|4 |2 |Rose Marry Ship Pvt|
|2 |3 |Nick Ltd |
|3 |4 |Michle Ltd |
'--------+----------+-------------------'

[statistics] disconnected
Job tAggregateRow ended at 15:09 19/05/2017. [exit code=0]

3.15.2. tFilterRow component :

Purpose:
tFilterRow component is used to filter input rows by setting conditions on the selected columns.

54
Fidel Technologies Pvt Ltd

Job design :

tFilterRow Component Setting :

55
Fidel Technologies Pvt Ltd

Output :

3.15.3. tSortRow:

Purpose:
tSortRow component sorts input data based on one or several columns, by sort type and order.

Job Design :

56
Fidel Technologies Pvt Ltd

tSortRow Component :

Output :

3.15.4. tAggregateSortedRow:

Purpose:

tAggregateSortedRow receives a sorted flow and aggregates it based on one or more columns. For
each output line, are provided the aggregation key and the relevant result of set operations (min,
max, sum)

57
Fidel Technologies Pvt Ltd

Job Design :

tAggregateSortedRow :

Output :
Starting job tAggregateSorted at 16:00 19/05/2017.

[statistics] connecting to socket on port 4022


[statistics] connected
.----------+-------.
| tLogRow_1 |
|=---------+------=|
|City |Country|
|=---------+------=|
|London |UK |
|California|USA |
|Texas |USA |
|Tokyo |Japan |
|Tokyo |Japan |
|Madrid |Spain |
|Saitama |Japan |
|Texas |USA |
|Birmingham|UK |

58
Fidel Technologies Pvt Ltd

'----------+-------'

[statistics] disconnected
Job tAggregateSorted ended at 16:00 19/05/2017. [exit code=0]

3.15.5. tSampleRow:
Purpose:

tSampleRow filter rows according to the line numbers.

Job Design :

tSampleRow Component :

Output :

59
Fidel Technologies Pvt Ltd

3.15.6. tXMLMap

Purpose:
tXMLMap allow Allows Join, columns row filtering ,transformation and multiple output.

Job Design :

tXMLMap :

60
Fidel Technologies Pvt Ltd

tFileOutputDelimited :

Output :

3.16. Internet Component:


ID Name of Components Description
01 tHttpRequest The tHttpRequest component is part of the Internet family of components,
and makes both POST and GET requests to the
02 tRest The tREST component serves as a REST Web service client that sends
HTTP requests to a REST Web service provider and gets the responses.
04 tExtractJSONField tExtractJSONFields extracts the data from JSON fields stored in a file,
a database table, etc., based on the XPath query.

61
Fidel Technologies Pvt Ltd

05 tMsgBox It displayed the message box


06 tUnite Merges data from various sources, based on a common schema.
07 tReplicate Duplicate the incoming schema into two identical output flows.

3.16.1. tHttpRequest component

Purpose:

The tHttpRequest component is part of the Internet family of components, and makes both POST
and GET requests to the

Job design:

tHttpRequest Component set

62
Fidel Technologies Pvt Ltd

Output:

3.16.2 tRest :

Purpose:

The tREST component serves as a REST Web service client that sends HTTP requests to a REST
Web service provider and gets the responses.

Job design:

63
Fidel Technologies Pvt Ltd

tRest Component :

3.16.3. tExtractJSONField

Purpose:
tExtractJSONFields extracts the data from JSON fields stored in a file, a database table, etc.,
based on the XPath query.

64
Fidel Technologies Pvt Ltd

Output:

3.16.4. tUnite component

Purpose:
Merges data from various sources, based on a common schema.

Job design

65
Fidel Technologies Pvt Ltd

Schema

66
Fidel Technologies Pvt Ltd

Output

3.16.5. tReplicate component

Purpose:
Duplicate the incoming schema into two identical output flows.

Job design

67
Fidel Technologies Pvt Ltd

Schema of tReplicate:

tFilterRow 1:

tFilterRow2:

68
Fidel Technologies Pvt Ltd

Output :

4. Data Profiling
Data profiling is the process of examining the data available in different data
sources and collecting statistics and information about this data. Data profiling
helps to assess the quality level of the data according to defined set goals.
If data is of a poor quality, or managed in structures that cannot be integrated to
meet the needs of the enterprise, business processes and decision-making suffer.
Compared to manual analysis techniques, data profiling technology improves the
enterprise ability to meet the challenge of managing data quality and to address
the data quality challenges faced during data migrations and data integrations.
4.1. Create a connection
In the DQ Repository tree view, expand Metadata, right-click DB Connections and
select Create DB
Connection.

69
Fidel Technologies Pvt Ltd

70
Fidel Technologies Pvt Ltd

71
Fidel Technologies Pvt Ltd

4.2. Connect to a delimited file:


Before being able to profile data in a delimited file, you must first set up the
connection to this file.
To create a connection to a delimited file, do the following:
1. Expand the Metadata folder.

2. Right-click FileDelimited connections and then select Create File Delimited


Connection to open the [New Delimited File] wizard.
3. Follow the steps defined in the wizard to create a connection to a delimited file.

72
Fidel Technologies Pvt Ltd

4.3. Open or edit a database connection


You can edit the connection to a specific database and change the connection
metadata and the connection information

1.In the DQ Repository tree view, expand Metadata > DB Connection.


2.Either:
Double-click the database connection you want to open, or,
Right-click the database connection and select Open in the
contextual menu.

73
Fidel Technologies Pvt Ltd

3. Modify the connection metadata in the Connection Metadata view, as required.


4.Click the Edit... button in the Connection information view to open the [Database
Connection] wizard

74
Fidel Technologies Pvt Ltd

5.Go through the steps in the wizard and modify the database connection settings
as required.
6.Click Finish to validate the modifications. A dialog box opens prompting you to
reload the updated database connection.

75
Fidel Technologies Pvt Ltd

7. Select the reload option if you want to reload the new database structure for the
updated database connection

4.4. Database Analysis:


A first step in evaluating data is to get a high-level overview of its structure and
content. Talend offers structural analysis jobs at the level of entire database,
schema, catalog, and tables & views. Drilling down into the results shows row
counts, schema counts, table counts, rows per table, views counts, rows per view,
keys, and indexes.

Output:

4.5. Column Analysis:


For tables & views of interest, a column analysis can be executed to show counts
of rows, nulls, distinct, uniques, duplicates, and blanks (by default). Additionlly
more complicated indicators can be added per column if needed. Here are the
results for the person.person.firstname column:

76
Fidel Technologies Pvt Ltd

4.5.1 Add patterns to the analyzed columns:


You can add patterns to one or more of the analyzed columns to validate the full
record (all columns) against all the patterns, and not to validate each column
against a specific pattern as it is the case with the column analysis.
The results chart is a single bar chart for the totality of the used patterns. This chart
shows the number of the rows that match "all" the patterns

77
Fidel Technologies Pvt Ltd

78
Fidel Technologies Pvt Ltd

4.6. Duplication Analysis


Having analyzed completeness, timeliness, validity, and accuracy of our data, we
can now perform a duplication analysis. Using Talends Match Analysis job on the
person table, here is the job configuration for duplicates on first and last name.
Two separate match algorithms are available, with configurable confidence and
match thresholds.

79
Fidel Technologies Pvt Ltd

80
Fidel Technologies Pvt Ltd

5. MDM-Master Data Management


Talend Open Studio for MDM provides unified development and management tools to
integrate and process all of your data with an easy to use, visual designer.
Talend Studio provides the key capabilities of Talend MDM for data governance and
data stewardship, which enable users to build data models employing the necessary
business and data rules to create one single copy of the master data to be propagated
back to the source and target systems.

5.1. Functional architecture of Talend MDM Architecture:

Talend MDM Architecture can be broken down into functional blocks that enable
interaction between users and the MDM Hub and their corresponding IT needs. Here
are the main building blocks of Talend MDM
The Clients block includes one or more Talend Studio and Web browsers that
could be on the same or on different machines.
From the Studio, you can set up and operate a centralized repository. You can
build data models that employ
The necessary business and data rules to create a single copy of the master data. This
master data will be propagated back to target and source systems.
From the Web browser, you can search, display or edit master data with tasks
defined by the Studio.
The Server block includes an MDM server - where the master data are governed
and monitored.
The Database block includes the MDM database - where the master data and the
system data are stored

5.2. Modeling:

81
Fidel Technologies Pvt Ltd

Before we get started, lets first get a common understanding of the most important
MDM terms:
Term Description
(business) element Also referred to as business attribute. The actual name of the data
point.
(business) entity Describes the actual data (the elements), its nature, its structure and
its relationships.1 An entity can have one or more business elements.
The Talend MDM jargon for this concept is data model entity.
data model type This is an element or collection of elements which is globally defined
and can be used across various entities. This makes maintenance of
common elements easier.
data model Defines the attributes (elements), user access rights and relationships
of entities mastered by the MDM Hub. The data model is the central
component of Talend MDM. A data model maps to one or more
(business) entities that can be explicitly defined. Any concept can be
defined by a data model.1 A data model can have multiple entities.
(business) domain A collection of data models that define a particular concept. For
instance, the customer domain may be defined by the organization,
account, contact and opportunity data models. A product domain may
be defined by a product, product family and price list.
Ultimately, the domain is the collection of all data models that relate to
a concept. Talend MDM can model any and many domains within a
single hub. It is a generic multi-domain MDM solution.1
data container Holds data of one or several business entities. Data containers are
typically used to separate master data domains.1

5.2. Creating a data model:


The first step at the beginning of any MDM project involves setting up a data model
and creating business entities in this data model. In this example, a Movie data model is
created

In the Studio workspace, an editor opens where you can define the details of your new
data model. The new data model and data container are listed in the MDM Repository

82
Fidel Technologies Pvt Ltd

tree view.

5.2.1. Creating business entities in the data model:


The following procedure shows how to populate the Movie data model created in
Creating a data model with some business entities.
1. in the editor, right-click anywhere in the Data Model Entities panel, and then click
New Entity.
2. In the [New Entity] dialog box that opens, enter a name for your new entity in the
Name field, Movie in this example.
3. Select the Complex Type option.
You use the Simple Type option if you want to define a single element type such as a
phone number or an email address, and the Complex Type option if you want to define
a more complete structure, such as an
Address or, in this case, the different attributes that describe a movie

83
Fidel Technologies Pvt Ltd

Lets have a look at the available types:


Simple type: Used for single, self-contained elements like email addresses.
Complex type: Used for structures like address which consists of multiple
elements? A complex type can also inherit elements from another complex type.

5.3. Add Server Location :


1. Right click on Server explore select add server location
2. Type Name of server.
3. Type server location.
4. Type Username & Password
5. Check connection.

84
Fidel Technologies Pvt Ltd

5.4. Data Container:


All the master data is stored in a Data Container. A data container can hold the data of
various business entities. Note that a business entity stored in one data container is not
visible from another data container.
To create a data container, simple right click on the Data Container in the repository
tree and choose New.

5.5. Create a view:


A view is basically what an end user can see via the web interface, which includes the
form and search functionality. There are various views that you can create. We will
only have a look at the most simple view here, which basically will allow end users to
create the business entity online and search for values within certain attributes.

85
Fidel Technologies Pvt Ltd

5.6. Deploy a model:


Once you have finalized your data model, you can deploy it to the MDM server. Right
click the data model in the repository and choose Deploy to .

Select server location definition

86
Fidel Technologies Pvt Ltd

5.7.Web GUI:
On successful installation, http://localhost:8080/talendmdm will show:
The open source version comes with only two user accounts (it is restricted to these two
ones):

standard user
user: user
password: user

admin
user: administrator
password: administrator

Web view:

87
Fidel Technologies Pvt Ltd

Web entity

88

You might also like