ETL DataSanity

Digital solutions for the
New Generation financial services

enterprise
ETL and Data Sanity

Analytical Plotting
Machine Learning (Re-training) Tool
Presentation by:
Infrasoft Technologies Ltd.
Trademarks & Copyrights: Infrasoft Technologies Limited | Third Party Copyrights & Trademarks Acknowledged
This presentation is a proprietary material of Infrasoft Technologies Ltd. Any unauthorized usage or distribution of this material without prior consent of Infrasoft Technologies Ltd. is
prohibited. The content here is highly confidential and is meant for internal circulation only. All viewers to this presentation voluntarily bind themselves to the clauses of standard
Confidentiality & Non-Disclosure Agreement.
Presentation by: Infrasoft Technologies Limited 1

ETL and Data Sanity Process
Sourcing
• Data can be scheduled to be sourced from staging database (non-
production). Replication from production to staging can be scheduled on
regular frequency
• Alternatively, data can be manually downloaded into files on regular basis
and the files can be uploaded in to target system
Data Sanity
• Identify the entities for where the data tables are to be sourced
• Verify attribute adequacy for each entity. If the required attributes are
missing, provision for such attributes with default values
• Verify the data types and the make the data compatible
• Identify missing values, replace with appropriate value based on default
rule
• Identify junk values, replace with appropriate value based on default rule
• Verify data formats (Date, Decimal, etc.) and rectify the data based on
default rule
• Verify data length and make it compatible
• Verify lookup codes and make it consistent across all entities
ETL and Data Sanity Process
Business Validations
• Perform business validations on data
• Verify mandatory fields availability
• Perform basic date validations
• Perform validations on numeric fields (non-zero, non-negative)
which are storing amount
• Perform referential integrity test on the key data attributes
• If business validation fails discard the record and mark / store it in
exception
Run ETL process

Analytical Graphs
The graph represents an ATM swipe of a customer. In the x axis we have hours from 0 to 23
and y axis contains the number of time the customer swiped at a given time. Here you can
see as per the patterns, the customer is active during afternoons and not during mid-night
and early mornings and the system detects this as fraud.
Similarly, we can have a customer pattern where he might be active in mid night and have
no activity during day, where ML will say it’s an alert in case a swipe is detected
Analytical Graphs
High Velocity same Maker and Checker for

Loan Approval
In this scenario, we plot a common maker
(loanexec1) and checker (loanchk1)
connection, where all the loans from maker is
approved by the same checker i.e. 0.42 with a
disapproval of a very small percentage of 0.03.
An ideal scenario will be where the loanexec1

has all of his loan approved with different
checker, with an equal number of approvals
and disapprovals provided the branch in which
the loanexec1 works has a single checker and
loanexec1 is only assigned to a single branch,
which was not the scenario here
Analytical Graphs
Employee viewing customer’s profile

Here, the graph represents 4 employee’s normal distribution of viewing or editing a customers profile. In case of
Employee E1003, he is watching a particular customer profile beyond his normal pattern up to 20 times, when his other
customer views are max 7 times in a day. This makes the employees action suspicious and needs further investigation.
Here x-axis is the dates in which an employee was active and viewing customers profile and y-axis is the number of times
he viewed a single profile. The points on the graph in a single day, is the number of times the employee viewed a unique
customer
Analytical Graphs
This graph is a derivative of first

graph, where the system feels an
employee’s view is fraud with
respect his normal distribution
Predictive Analysis Tool
Here, user will upload the excel sheet on which he wants to run Machine Learning. In
case the excel sheet has output column, the user needs to mention the column name or
the PA tool will generate a label column w.r.t its anomaly detector.
Data Mapping
Here we map, columns with their data types. We need to mention columns which are IDs
or date since these columns cant be taken as input to the ML model
Missing Values
For ML to work, we need to clean the data, like removing or applying default values where
ever there is missing values.
Correlation Matrix
If 2 or more columns have a higher correlation or inverse correlation say 1.00 to .85 or -
.85 to -1.00, we can remove a column through logical reasoning on which column is
important as both input to the ML, will not help in good learning and may downgrade
the ML.
Anomaly Detection
Since during excel input, as we didn't mention any output column for ML to predict, the
tool goes for anomaly detection to help the analyst to cluster the huge data to relevant
blocks. We first find optimum number of clusters
Ones the optimum cluster is selected, we go for selecting which clusters can be grouped
to a single cluster. For this we use plots of different features to determine which classes
actual symbolizes anomaly and which represent normal data.
Ones the grouping of cluster is finalized, we go for fitting the ML model with our data
set. And according to the percentage of data split, the accuracy of the model will be
given as output.
Ones the model is finalized, next is to upload the test data set for predicting the output
as per a saved ML model.

ETL DataSanity

Uploaded by

Copyright:

Available Formats

ETL DataSanity

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

ETL DataSanity

Uploaded by

Copyright:

Available Formats

Digital solutions for the

New Generation financial services

ETL and Data Sanity

Presentation by: Infrasoft Technologies Limited 1

Run ETL process

Presentation by: Infrasoft Technologies Limited 3

High Velocity same Maker and Checker for

An ideal scenario will be where the loanexec1

Employee viewing customer’s profile

This graph is a derivative of first

You might also like