Mastering Data Cleaning Techniques with SQL — Explained Examples _ by ? panData _ Level Up Coding
Mastering Data Cleaning Techniques with SQL — Explained Examples _ by ? panData _ Level Up Coding
Get unlimited access to the best of Medium for less than $1/week. Become a member
🐼 panData · Follow
Published in Level Up Coding · 11 min read · Mar 20
https://levelup.gitconnected.com/mastering-data-cleaning-techniques-with-sql-explained-examples-80980fef2d3a 1/31
9/7/23, 8:30 PM Mastering Data Cleaning Techniques with SQL — Explained Examples | by 🐼 panData | Level Up Coding
https://levelup.gitconnected.com/mastering-data-cleaning-techniques-with-sql-explained-examples-80980fef2d3a 2/31
9/7/23, 8:30 PM Mastering Data Cleaning Techniques with SQL — Explained Examples | by 🐼 panData | Level Up Coding
https://levelup.gitconnected.com/mastering-data-cleaning-techniques-with-sql-explained-examples-80980fef2d3a 3/31
9/7/23, 8:30 PM Mastering Data Cleaning Techniques with SQL — Explained Examples | by 🐼 panData | Level Up Coding
https://levelup.gitconnected.com/mastering-data-cleaning-techniques-with-sql-explained-examples-80980fef2d3a 4/31
9/7/23, 8:30 PM Mastering Data Cleaning Techniques with SQL — Explained Examples | by 🐼 panData | Level Up Coding
Here are some essential SQL functions that can help in the data cleaning
process:
1. TRIM
This function removes leading and trailing spaces from a string.
Using the TRIM function is helpful when you want to clean up text data in
your database by removing unnecessary spaces, which can cause issues
when comparing, searching, or analyzing data. It ensures that your text data
is consistent and free of formatting errors caused by extra spaces.
The result of this query will be a single row containing two columns. The first
column will display the uppercase version of the input string ‘Hello World’ as
‘HELLO WORLD’, and the second column will display the lowercase version
of the input string ‘Hello World’ as ‘hello world’.
3. REPLACE
This function replaces all occurrences of a specified substring with another
substring.
https://levelup.gitconnected.com/mastering-data-cleaning-techniques-with-sql-explained-examples-80980fef2d3a 6/31
9/7/23, 8:30 PM Mastering Data Cleaning Techniques with SQL — Explained Examples | by 🐼 panData | Level Up Coding
FROM employees;
This query is useful in cases where you need to update the email addresses of
employees, for example, when a company changes its domain name or
merges with another company, and employee email addresses
need to be updated accordingly.
4. NULLIF
This function returns NULL if two expressions are equal; otherwise, it
returns the first expression.
The result of this query will be a table containing two columns: employee_id
This query is useful in cases where you want to treat zero salaries as
missing data and represent them with NULL values, which can be helpful
for certain calculations or analyses where zero values might be misleading or
inappropriate.
5. COALESCE
This function returns the first non-NULL expression from a list of
expressions.
https://levelup.gitconnected.com/mastering-data-cleaning-techniques-with-sql-explained-examples-80980fef2d3a 8/31
9/7/23, 8:30 PM Mastering Data Cleaning Techniques with SQL — Explained Examples | by 🐼 panData | Level Up Coding
This query is useful in cases where you want to handle missing salary data by
providing a default salary value, ensuring that your calculations or analyses
are not affected by NULL values in the salary column.
https://levelup.gitconnected.com/mastering-data-cleaning-techniques-with-sql-explained-examples-80980fef2d3a 9/31
9/7/23, 8:30 PM Mastering Data Cleaning Techniques with SQL — Explained Examples | by 🐼 panData | Level Up Coding
FROM employees;
The result of this query will be a table containing a single column: full_name .
The full_name column will display the combined first and last names of the
employees, separated by a space.
Using the CONCAT function is helpful when you want to join separate pieces of
text data together into a single string. In this example, it allows you to create
a full name from separate first and last name columns, making it easier to
display, search, or analyze the employee names.
https://levelup.gitconnected.com/mastering-data-cleaning-techniques-with-sql-explained-examples-80980fef2d3a 10/31
9/7/23, 8:30 PM Mastering Data Cleaning Techniques with SQL — Explained Examples | by 🐼 panData | Level Up Coding
The result of this query will be a table containing a single column: short_id .
The short_id column will display the extracted substring from the
employee_id column, containing the first three characters of each employee
ID.
Using the SUBSTRING function is helpful when you want to extract specific
portions of text data in your database. In this example, it allows you to create
a shortened version of the employee ID, which could be useful for generating
summary reports, creating unique identifiers, or simplifying the display of
complex strings.
https://levelup.gitconnected.com/mastering-data-cleaning-techniques-with-sql-explained-examples-80980fef2d3a 11/31
9/7/23, 8:30 PM Mastering Data Cleaning Techniques with SQL — Explained Examples | by 🐼 panData | Level Up Coding
SELECT employee_name
FROM employees
WHERE CHAR_LENGTH(employee_name) > 10;
Using the CHAR_LENGTH function is helpful when you want to filter or analyze
text data in your database based on its length. In this example, it allows you
to find employees with longer names, which could be useful for formatting
purposes, data analysis, or identifying potential data quality issues.
The result of this query will be a table containing two columns: employee_id
Using the ROUND function is helpful when you want to simplify numeric data
for display, reporting, or analysis purposes. In this example, it allows you to
create a rounded version of the employee salaries, which could be useful for
generating summary reports, aggregating data, or reducing the complexity of
your data for easier analysis.
The CAST and CONVERT functions are used to change the data type of a value or
column.
The result of this query will be a table containing two columns: employee_id
and hire_date_string . The hire_date_string column will display the hire date
values for each employee as strings.
Using the CAST function is helpful when you need to convert data types for
display, reporting, or data manipulation purposes. In this example, it allows
you to create a string version of the employee hire dates, which could be
useful for text-based reports, string manipulation tasks, or data export to
systems that require a specific data type.
https://levelup.gitconnected.com/mastering-data-cleaning-techniques-with-sql-explained-examples-80980fef2d3a 14/31
9/7/23, 8:30 PM Mastering Data Cleaning Techniques with SQL — Explained Examples | by 🐼 panData | Level Up Coding
The result of this query will be a table containing all the columns and only the
rows where the specified column has a NULL value.
Using the IS NULL clause is helpful when you need to identify missing or
incomplete data in your table. In this example, it allows you to retrieve all
rows with a NULL value in a specific column, which could be useful for data
cleaning, data validation, or further analysis.
https://levelup.gitconnected.com/mastering-data-cleaning-techniques-with-sql-explained-examples-80980fef2d3a 15/31
9/7/23, 8:30 PM Mastering Data Cleaning Techniques with SQL — Explained Examples | by 🐼 panData | Level Up Coding
For example, let’s create a table named employees with three columns:
employee_id , employee_name , and employee_status . We want the
employee_status column to have a default value of 'Active':
In this example, if you insert a new row into the employees table without
specifying a value for the employee_status column, the default value 'Active'
will be automatically assigned to the employee_status column.
https://levelup.gitconnected.com/mastering-data-cleaning-techniques-with-sql-explained-examples-80980fef2d3a 16/31
9/7/23, 8:30 PM Mastering Data Cleaning Techniques with SQL — Explained Examples | by 🐼 panData | Level Up Coding
Using the DEFAULT keyword when creating a table is useful when you want to
assign a common or standard value to a column for new records, reducing the
need to explicitly provide a value for every insertion. This can help streamline
data entry and ensure data consistency across the table.
1. SELECT DISTINCT
Duplicates can occur when data is collected from multiple sources or due to
data entry errors. To remove duplicates, use the DISTINCT keyword.
https://levelup.gitconnected.com/mastering-data-cleaning-techniques-with-sql-explained-examples-80980fef2d3a 17/31
9/7/23, 8:30 PM Mastering Data Cleaning Techniques with SQL — Explained Examples | by 🐼 panData | Level Up Coding
2. FROM employees : This specifies the source table, which is the employees
The result of this query will be a table containing unique employee_id and
department_id pairs, with no duplicate rows.
The SELECT DISTINCT statement is helpful when you need to retrieve a list of
unique records from a table, especially when dealing with large datasets
where duplicate records might be present. In this example, it allows you to
fetch a list of employees along with their department IDs without any
duplicates, which could be useful for further analysis, reporting, or data
cleaning tasks.
https://levelup.gitconnected.com/mastering-data-cleaning-techniques-with-sql-explained-examples-80980fef2d3a 18/31
9/7/23, 8:30 PM Mastering Data Cleaning Techniques with SQL — Explained Examples | by 🐼 panData | Level Up Coding
1. CHECK
A CHECK constraint ensures that the data in a column meets a specific
condition. If the condition is not met, the data cannot be inserted or updated.
For example, let’s create a table named products with two columns:
product_id and product_price . We want to ensure that the product_price
In this example, the CHECK constraint ensures that the product_price column
contains a value greater than 0. If an attempt is made to insert or update a
row with a non-positive value for the product_price column, the database will
reject the operation, thereby maintaining data integrity.
https://levelup.gitconnected.com/mastering-data-cleaning-techniques-with-sql-explained-examples-80980fef2d3a 19/31
9/7/23, 8:30 PM Mastering Data Cleaning Techniques with SQL — Explained Examples | by 🐼 panData | Level Up Coding
Using the CHECK constraint when creating a table is beneficial for enforcing
data validation rules and ensuring that the data stored in the table meets
specific business requirements or constraints. This can help maintain data
quality and consistency across the table.
2. UNIQUE
A UNIQUE constraint ensures that all values in a column are unique. This
helps prevent duplicate data.
For example, let’s create a table named users with two columns: user_id and
email . We want to ensure that the email column contains unique values for
each user:
https://levelup.gitconnected.com/mastering-data-cleaning-techniques-with-sql-explained-examples-80980fef2d3a 20/31
9/7/23, 8:30 PM Mastering Data Cleaning Techniques with SQL — Explained Examples | by 🐼 panData | Level Up Coding
In this example, the UNIQUE constraint ensures that the email column
contains unique values for each row in the table. If an attempt is made to
insert or update a row with an email address that already exists in the table,
the database will reject the operation, thereby maintaining data integrity.
Using the UNIQUE constraint when creating a table is beneficial for enforcing
data uniqueness rules and ensuring that the data stored in the table meets
specific business requirements or constraints. This can help maintain data
quality and consistency across the table.
3. FOREIGN KEY
A FOREIGN KEY constraint is used to maintain referential integrity between
two tables. It ensures that the data in a column matches the data in the
primary key column of another table.
For example, let’s create two tables: orders and order_items . The orders
table contains information about each order, and the order_items table
contains information about the items in each order. We want to ensure that
https://levelup.gitconnected.com/mastering-data-cleaning-techniques-with-sql-explained-examples-80980fef2d3a 21/31
9/7/23, 8:30 PM Mastering Data Cleaning Techniques with SQL — Explained Examples | by 🐼 panData | Level Up Coding
each order item in the order_items table is associated with a valid order in
the orders table:
In this example, the FOREIGN KEY constraint ensures that the order_id column
in the order_items table refers to a valid order_id in the orders table. If an
attempt is made to insert or update a row in the order_items table with an
order_id that does not exist in the orders table, the database will reject the
operation, thereby maintaining referential integrity.
Using the FOREIGN KEY constraint when creating a table is beneficial for
enforcing referential integrity between related tables and ensuring that the
https://levelup.gitconnected.com/mastering-data-cleaning-techniques-with-sql-explained-examples-80980fef2d3a 22/31
9/7/23, 8:30 PM Mastering Data Cleaning Techniques with SQL — Explained Examples | by 🐼 panData | Level Up Coding
Conclusion
Mastering data cleaning techniques in SQL is crucial for ensuring data quality
and accuracy in your database. By using SQL functions and applying
constraints, you can effectively clean and validate your data, leading to better
analysis and decision-making.
operators. Additionally, you can set default values for columns during
table creation, which will be used when no value is provided during
data insertion or update.
https://levelup.gitconnected.com/mastering-data-cleaning-techniques-with-sql-explained-examples-80980fef2d3a 24/31
9/7/23, 8:30 PM Mastering Data Cleaning Techniques with SQL — Explained Examples | by 🐼 panData | Level Up Coding
Level Up Coding
Thanks for being a part of our community! Before you go:
🐼 Derived from the Latin word "pan," which means "all" or "every," I have embarked on a
journey of compiling a personal repository of Data I have studied.
https://levelup.gitconnected.com/mastering-data-cleaning-techniques-with-sql-explained-examples-80980fef2d3a 26/31
9/7/23, 8:30 PM Mastering Data Cleaning Techniques with SQL — Explained Examples | by 🐼 panData | Level Up Coding
1 14
https://levelup.gitconnected.com/mastering-data-cleaning-techniques-with-sql-explained-examples-80980fef2d3a 27/31
9/7/23, 8:30 PM Mastering Data Cleaning Techniques with SQL — Explained Examples | by 🐼 panData | Level Up Coding
4 1
https://levelup.gitconnected.com/mastering-data-cleaning-techniques-with-sql-explained-examples-80980fef2d3a 28/31
9/7/23, 8:30 PM Mastering Data Cleaning Techniques with SQL — Explained Examples | by 🐼 panData | Level Up Coding
13 125 3
Lists
https://levelup.gitconnected.com/mastering-data-cleaning-techniques-with-sql-explained-examples-80980fef2d3a 29/31
9/7/23, 8:30 PM Mastering Data Cleaning Techniques with SQL — Explained Examples | by 🐼 panData | Level Up Coding
100 723 9
https://levelup.gitconnected.com/mastering-data-cleaning-techniques-with-sql-explained-examples-80980fef2d3a 30/31
9/7/23, 8:30 PM Mastering Data Cleaning Techniques with SQL — Explained Examples | by 🐼 panData | Level Up Coding
39 1 296
https://levelup.gitconnected.com/mastering-data-cleaning-techniques-with-sql-explained-examples-80980fef2d3a 31/31