Cleansing Data With SQL Server 2016 Data Quality Services
Cleansing Data With SQL Server 2016 Data Quality Services
Signing In
In this task, you will sign in to the virtual machine.
1. To sign in to the virtual machine, using the portal menu, click Commands, and then select
Ctrl + Alt + Delete.
2. In the password box, enter Pass@word1 (do not enter the period), and then click Submit.
If you are not using a US English keyboard, the password you enter may not be correctly received
by the virtual machine. You must complete the following task to sign in, and then update the virtual
machine language.
Note: For lab users with English keyboards, if the @ symbol is above the 2, then your keyboard is
a US English keyboard, and you should not complete the following task.
2. Use the on-screen keyboard to enter the password Pass@word1. (Do not enter the period.)
Tip: To reveal the input password before submitting, click the following.
6. In the Control Panel window, from inside the Clock, Language, and Region group, click
Change Input Methods.
8. In the Add Languages window, locate and select your language, and then click Add (or Open).
If the selected language has regional variants, you will be directed to the Regional Variants
window, in which case, select a variant, and then click Add.
3. Resize the width of the pane, with the aim of arriving at a minimum width that allows the font size
of the content (lab manual) to remain easily readable.
Tip: Less width occupied by the pane allows for more room for the virtual machine screen.
4. In the virtual machine screen, right-click the desktop, and then select Screen Resolution.
5. In the Screen Resolution window, in the Resolution dropdown list, select a higher resolution.
1024 x 768 is the recommended minimum, but use a higher resolution if this fits your screen size.
6. Click Apply.
Setting Up
In this task, you will setup the lab database required to complete this lab.
1. To open File Explorer, on the taskbar, click the File Explorer shortcut.
4. In the Command window, when prompted to press any key to continue, press any key.
Important: When naming objects in this lab, be sure to enter the names exactly as the lab
describes. Incorrect name values may result in errors later in the lab.
5. In the lower pane, notice that the Domain Management activity is selected.
Creating Domains
In this task, you will create the knowledge base domains.
1. To create a domain, click Create a Domain.
Tip: In Data Quality Client, commands are available either as icons, or right-click context menus.
To determine what an icon does, hover the cursor over it to reveal a tooltip.
2. In the Create Domain window, in the Domain Name box, enter Office.
There are many domain properties that can also be set when creating the domain, and these can
be modified at any time during domain management.
3. Click OK.
• District
• Address1
• Address2
• City
• StateOrProvince
• PostalCode
• Country
• Phone
• ManagerFirstName
• ManagerLastName
• ManagerTitle
• ManagerEmail
5. Verify that you have 13 domains.
6. Select the Address1 domain.
Important: It is a common mistake to configure the wrong domain, which later involves determining
which domain to undo (there is no Ctrl-Z to undo). Always take care to select the correct domain
before configuring it.
Domain Action
Address2 Enable Speller: Uncheck
StateOrProvince Format Output to: Upper Case
Enable Speller: Uncheck
ManagerEmail Format Output to: Lower Case
Enable Speller: Uncheck
Creating Domains
In this task, you will configure domain values and define a synonym.
1. Select the Office domain.
2. Select the Domain Values tab.
3. Notice that the domain values already includes the DQS_NULL value.
All domains include this value, and it cannot be deleted.
This configuring ensures that missing Office values will result in an invalid record.
5. Repeat the last step to set the DQS_NULL value to Invalid for the following additional domains:
• District
• Address1
• City
• PostalCode
• StateOrProvince
• Country
• Phone
8. In the new row added to the domain value grid, enter Canada.
9. Press Enter.
10. Notice that the domain value is set to type Correct (green check mark).
11. Add two additional domain values:
• United States
• US
12. In the grid, notice that domain values sort alphabetically, and that new domain values added
during the activity are adorned with a yellow star.
13. To define synonyms, first select the United States domain value, and then while pressing the
Control key, select the US domain value.
14. Right-click the selection, and then select Set as Synonyms.
4. In the domain rule grid, in the Name box, enter Not an initial.
5. In the Build a Rule section, modify the operator to Length is Greater Than or Equal to, and then
in the corresponding box, enter 2.
7. In the Test Domain Rule window, click Adds a New Testing Term for the Domain Rule.
• 800 123-4567
• (800) 123-4567
21. Note that this domain rule only tests valid email addresses, and not the additional requirement that
the email address must belong to a particular domain.
22. To add a new condition, click Add a New Condition to the Selected Clause.
• rob@hotmail.com
• rob@@lab.microsoft.com
• rob@lab.microsoft.com
25. Verify that the only the final term is correct.
The first regular expression validates a US postal code (ZIP Code) allowing also for the Zip+4
Code format. The second regular expression validates a Canadian postal code, requiring a space
at the fourth character.
28. To modify the operator, to the right of the AND operator, click the down-arrow, and then select
OR.
29. Verify that the domain rule looks like the following.
• 1234
• 12345
• 12345-123
• 12345-1234
• A1A1A1
4. In the term-based relation grid, in the Value box, enter Distr. (include the period).
This relation will ensure all abbreviated instances will be corrected to the full name.
2. In the Create a Composite Domain window, in the Composite Domain Name box, enter
Address.
5. Add the following domains also to the composite domain, ensuring that they are added in the
order listed.
Tip: You can double-click each domain to add it to the list, and you can also multi-select items in
order by pressing the Control key, and then add them by clicking the right-arrow.
• Address2
• City
• StateOrProvince
• PostalCode
• Country
6. Verify that the Domains in Composite Domain list includes the following six domains, in the
order presented.
7. Click OK.
4. In the cross-domain rules grid, in the Name box, enter Vancouver CA.
The value BC will be added to the StateOrProvince domain values as a result of configuring this
rule.
The value WA will be added to the StateOrProvince domain values as a result of configuring this
rule.
The knowledge base is not yet ready to applied to a cleansing activity. You will continue to
enhance the knowledge base with knowledge discovery activities in the next exercise.
2. In the grid, notice that the Office knowledge base is locked, and has the state In Work.
The knowledge base cannot be used until it is unlocked. You will unlock the knowledge base
when you publish it in the next exercise.
3. Click Cancel.
2. Notice the first listed activity is the one you just completed.
Every DQS activity undertaken is logged and remains available for review and audit.
3. Click Close.
2. Notice that step 1 of the activity is to map to external data containing knowledge.
7. In the Mappings grid, in the first row, in the Source Column column, select the
ProvinceOrTerritoryCode column.
8. In the corresponding Domain column, select the StateOrProvince domain.
10. Notice that step 2 of the activity is to discover knowledge from the source.
13. Note that 13 unique values were detected, of which 12 are new values for the domain.
In the previous exercise, when you added the cross-domain rules, both BC and WA were added
to the domain values. BC (British Colombia) was included in the source data, but not added to the
domain values as it already exists.
16. Review the list of domain values, and notice that this is a list of what has been added in this
activity.
17. To reveal all domain values, uncheck the Show Only New checkbox.
3. Map only the following four source columns to their respective domains.
The rationale for performing knowledge discovery for the StateOrProvince domain is to detect
and appropriately configure anomalies.
4. Notice that domains that can be cleansed by domain rules (i.e. Phone and ManagerEmail) are
not included in this knowledge discovery activity. Some domains do not need to have possible
values stored as domain values.
5. Proceed to the discovery step, and start the discovery process.
9. For the Country domain, notice the notification icon in the New column.
10. Hover the cursor over the notification icon to reveal a tooltip describing a possible issue.
You can ignore the issue in this lab.
14. Right-click the Ausstin, TX text, and then select the correct spelling suggestion: Austin, TX.
16. Scroll down the list to locate the Lehi, UT office domain value (which is, in fact, correctly spelled).
17. Right-click the Lehi, UT domain value, and then select Add to Dictionary.
18. Notice that the red squiggly has been removed.
19. Show all domain values.
It is useful to show all values when managing synonyms that may involve existing members.
20. Locate the two adjacent domain values for New York.
21. Multi-select the two domain values, right-click the selection, and then select Set as Synonyms.
22. Ensure that New York, NY is the leading value.
24. Use the dictionary to correct the Midwesst Distr. domain value to Midwest Distr.
25. Show all domain values, and notice how the misspelled Midwest domain value corrects to an
existing domain value.
27. In the adjacent Correct to box, enter the correct domain value, Greater Southeast District, and
then press Enter.
28. Notice how the error domain value relates to the correct domain value.
29. Correct also the Mid-Atlantic Dist. value to the Mid Atlantic District value.
32. Show all domain values, and notice how the corrections relate to an existing domain values.
33. Select the Country domain.
38. When notified that the knowledge base has been published, click OK.
You will use the knowledge base in the next exercise to cleanse the Office dataset.
39. Review the knowledge base status, and notice that it is no longer locked, and has not state (i.e. it
is open).
40. Review the activity monitoring, and notice the three knowledge discovery activities you have just
completed.
7. Notice that the project consists of a single connection manager, which is used to connect to the
Lab-DQS database.
9. In the Add SSIS Connection Manager window, select the DQS connection manager type.
11. In the Add DQS Cleansing Connection Manager window, in the Server Name dropdown list—
do not click the dropdown arrow—enter localhost.
2. To add a data flow task, click the link located at the center of the designer.
6. Verify that the data flow component looks like the following.
Do not be concerned about the error icon, which will disappear when you complete the next steps.
7. To edit the source component, right-click the component, and then select Edit.
8. In the ADO.NET Source Editor window, in the ADO.NET Connection Manager dropdown list,
notice that the localhost.Lab-DQS connection manager is selected.
9. In the Name of the Table or the View dropdown list, select
"dbo"."MSFTOffice_NorthAmerica".
11. From the SSIS Toolbox, expand Other Transforms, and then drag the DQS Cleansing to the
data flow designer, and drop it directly beneath the source component.
12. Verify that the data flow design looks like the following.
13. To connect the components, first select the Office Dataset source component, and then drag the
standard output (the left, blue arrow) on top of the cleansing component.
15. To edit the cleansing component, right-click the component, and then select Edit.
16. In the DQS Cleansing Transform Editor window, in the Data Quality Connection Manager
dropdown list, select the DQS Cleansing Connection Manager.localhost connection manager.
17. In the Data Quality Knowledge Base dropdown list, select Office.
18. In the Available Domains list, review the knowledge base domains, noticing that the first listed in
the composite domain.
You will not use the composite domain to cleanse that data in this package design.
22. Notice the second grid that defines the mapping between input columns and the knowledge base
domains.
It also defines alias output columns for the source, output and status columns.
24. Map each input column to its respective domain—do not map the Address composite domain.
For your knowledge base, this will mean that StateOrProvince values will be set to upper case,
and ManagerEmail values will be set to lower case.
The reason needs to be output to help explain why values are invalid.
30. From the SSIS Toolbox, from inside the Common group, drag the Conditional Split to the data
flow designer, and drop it directly beneath the cleansing component.
31. Configure the standard output of the cleansing component to connect to the new component, as
follows.
35. Scroll to the bottom of the columns list, and then drag the Record Status column into the
Condition box.
36. In the Condition box, complete the expression as follows (note that the operator is two equals (=)
signs, which tests for equality).
Any record with an invalid record status will be output to the Invalid output.
40. From the SSIS Toolbox, expand Other Destinations (the last group), and then drag the
ADO NET Destination to the data flow designer, and drop it beneath, and to the left of, the
conditional split component
41. In the Properties pane, set the Name property to DimOffice.
42. Configure the standard output of the conditional split component to connect to the new
component.
43. In the Input Output Selection window, in the Output dropdown list, select Correct.
This page of the editor is used to configure the mappings between the input columns, and the
columns of the DimOffice table.
51. From the Available Input Columns list, drag the Office_Output column to the Office columns of
the Available Destination Columns list.
There is no need to map to the OfficeKey column, as this is an identity column that will
automatically populate a sequence of values when rows are inserted into the table.
The source columns will contain original values, while the output columns will contain
standardized column (i.e. lower case email addresses), so you will map only the output columns.
There is no need to store other column types as the rows passed to this destination are only
correct records. Status columns will only ever be Correct or Corrected.
As this table will be used to analyze data quality issues, all output columns will be stored.
3. To sort the activities by descending order, in the activity grid, click the ID column header twice.
4. Notice the first listed activity is a SSIS Cleansing type.
Every activity undertaken with the Data Quality Server—even when invoked by SSIS—is logged
and remains available for review and audit.
5. Click Close.
2. In the project grid, right-click the SSIS cleansing project, and then select Open.
The SSIS cleansing project is highlighted in red, and is locked.
3. Notice that the project opens at the Manage and View Results step.
It is possible to complete a manual cleansing process.
4. Click Close.
2. In the Connect to Server window, ensure that the Server Type is set to Database Engine, and
that the Server Name is set to SQLSERVER2016BI.
3. Click Connect.
It is very important that you execute the script in the manner intended. Many script files include
multiple batches of statements (completed with the GO keyword), and so you should select the
statements together with the GO keyword, and then execute only that selection.
8. To execute a subset of a script, select the text you intend to execute, and then click Execute (or
press F5).
9. Read the comments in the first batch (line 3).
10. Select and execute the only query in the batch (lines 4-5).
11. Read the commented text, and then execute the query for each of the remaining batches in the
script.
12. To exit SSMS, on the File menu, select Exit.
Finishing Up
In this task, you will finish up by undoing the configurations made in this lab, and by closing opened
applications.
1. Close Data Quality Client.
2. In a File Explorer window, navigate to the D:\SQLServer2016BI\Lab09\Assets folder.
3. Right-click the Cleanup.cmd file, and then select Run as Administrator.
4. In the Command window, when prompted to press any key to continue, press any key.
5. Close the File Explorer window.