ETL DataSanity
ETL DataSanity
ETL DataSanity
This presentation is a proprietary material of Infrasoft Technologies Ltd. Any unauthorized usage or distribution of this material without prior consent of Infrasoft Technologies Ltd. is
prohibited. The content here is highly confidential and is meant for internal circulation only. All viewers to this presentation voluntarily bind themselves to the clauses of standard
Confidentiality & Non-Disclosure Agreement.
Data Sanity
• Identify the entities for where the data tables are to be sourced
• Verify attribute adequacy for each entity. If the required attributes are
missing, provision for such attributes with default values
• Verify the data types and the make the data compatible
• Identify missing values, replace with appropriate value based on default
rule
• Identify junk values, replace with appropriate value based on default rule
• Verify data formats (Date, Decimal, etc.) and rectify the data based on
default rule
• Verify data length and make it compatible
• Verify lookup codes and make it consistent across all entities
Presentation by: Infrasoft Technologies Limited 2
ETL and Data Sanity Process
Business Validations
• Perform business validations on data
• Verify mandatory fields availability
• Perform basic date validations
• Perform validations on numeric fields (non-zero, non-negative)
which are storing amount
• Perform referential integrity test on the key data attributes
• If business validation fails discard the record and mark / store it in
exception
The graph represents an ATM swipe of a customer. In the x axis we have hours from 0 to 23
and y axis contains the number of time the customer swiped at a given time. Here you can
see as per the patterns, the customer is active during afternoons and not during mid-night
and early mornings and the system detects this as fraud.
Similarly, we can have a customer pattern where he might be active in mid night and have
no activity during day, where ML will say it’s an alert in case a swipe is detected
Analytical Graphs
Here, user will upload the excel sheet on which he wants to run Machine Learning. In
case the excel sheet has output column, the user needs to mention the column name or
the PA tool will generate a label column w.r.t its anomaly detector.
Predictive Analysis Tool
Data Mapping
Here we map, columns with their data types. We need to mention columns which are IDs
or date since these columns cant be taken as input to the ML model
Predictive Analysis Tool
Missing Values
For ML to work, we need to clean the data, like removing or applying default values where
ever there is missing values.
Predictive Analysis Tool
Correlation Matrix
If 2 or more columns have a higher correlation or inverse correlation say 1.00 to .85 or -
.85 to -1.00, we can remove a column through logical reasoning on which column is
important as both input to the ML, will not help in good learning and may downgrade
the ML.
Predictive Analysis Tool
Anomaly Detection
Since during excel input, as we didn't mention any output column for ML to predict, the
tool goes for anomaly detection to help the analyst to cluster the huge data to relevant
blocks. We first find optimum number of clusters
Predictive Analysis Tool
Ones the optimum cluster is selected, we go for selecting which clusters can be grouped
to a single cluster. For this we use plots of different features to determine which classes
actual symbolizes anomaly and which represent normal data.
Predictive Analysis Tool
Ones the grouping of cluster is finalized, we go for fitting the ML model with our data
set. And according to the percentage of data split, the accuracy of the model will be
given as output.
Predictive Analysis Tool
Ones the model is finalized, next is to upload the test data set for predicting the output
as per a saved ML model.