Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Expert Veri Ed, Online, Free.: Topic 1 - Question Set 1

Download as pdf or txt
Download as pdf or txt
You are on page 1of 161

5/26/22, 3:46 PM DP-203 Exam – Free Actual Q&As, Page 1 | ExamTopics

-
Expert Verified, Online, Free.

 Custom View Settings

Topic 1 - Question Set 1

https://www.examtopics.com/exams/microsoft/dp-203/custom-view/ 1/161
5/26/22, 3:46 PM DP-203 Exam – Free Actual Q&As, Page 1 | ExamTopics

Question #1 Topic 1

You have a table in an Azure Synapse Analytics dedicated SQL pool. The table was created by using the following Transact-SQL statement.

You need to alter the table to meet the following requirements:

✑ Ensure that users can identify the current manager of employees.

✑ Support creating an employee reporting hierarchy for your entire company.

✑ Provide fast lookup of the managers' attributes such as name and job title.

Which column should you add to the table?

A.
[ManagerEmployeeID] [smallint] NULL

B.
[ManagerEmployeeKey] [smallint] NULL

C.
[ManagerEmployeeKey] [int] NULL

D.
[ManagerName] [varchar](200) NULL

Correct Answer:
C

We need an extra column to identify the Manager. Use the data type as the EmployeeKey column, an int column.

Reference:

https://docs.microsoft.com/en-us/analysis-services/tabular-models/hierarchies-ssas-tabular

Community vote distribution


C (100%)

 
alexleonvalencia
Highly Voted 
5 months, 2 weeks ago
Selected Answer: C
La respuesta es correcta.
upvoted 10 times

 
jskibick
Highly Voted 
3 months, 3 weeks ago
Selected Answer: C
Answer C. Smallint eliminates A and B. But I would name the field [ManagerEmployeeID] [int] NULL since it should reference EmployeeID, not
EmployeeKey since this one is IDENTITY.
upvoted 6 times

 
Dothy
Most Recent 
2 weeks ago
Selected Answer: C
upvoted 1 times

 
Egocentric
1 month, 1 week ago
C is the correct answer
upvoted 1 times

 
boggy011
1 month, 1 week ago
Selected Answer: C
upvoted 1 times

 
temacc
2 months ago
Selected Answer: C
Correct answer is C
upvoted 1 times

https://www.examtopics.com/exams/microsoft/dp-203/custom-view/ 2/161
5/26/22, 3:46 PM DP-203 Exam – Free Actual Q&As, Page 1 | ExamTopics

 
Guincimund
2 months ago
Selected Answer: C
answer is C.
upvoted 1 times

 
NeerajKumar
2 months, 3 weeks ago
Selected Answer: C
Correct Ans is C
upvoted 2 times

 
KosteK
2 months, 4 weeks ago
Selected Answer: C
correct answer is C
upvoted 1 times

 
samtrion
3 months ago
Selected Answer: C
It is quite obvious C
upvoted 1 times

 
ArunCDE
3 months, 2 weeks ago
Selected Answer: C
This is the correct answer.
upvoted 1 times

 
PallaviPatel
4 months ago
Selected Answer: C
correct is C. Use surrogate key instead of business key as a reference.
upvoted 1 times

 
Aurelkb
4 months, 1 week ago
correct answer is C
upvoted 1 times

 
pozdrotechno
4 months, 1 week ago
Selected Answer: C
C is correct.

The column should be based on the surrogate key (EmployeeKey), including an identical data type.
upvoted 2 times

 
SofiaG
4 months, 1 week ago
Selected Answer: C
Correto
upvoted 1 times

 
jchen9314
4 months, 2 weeks ago
INT has the best performance to be a key.
upvoted 2 times

https://www.examtopics.com/exams/microsoft/dp-203/custom-view/ 3/161
5/26/22, 3:46 PM DP-203 Exam – Free Actual Q&As, Page 1 | ExamTopics

Question #2 Topic 1

You have an Azure Synapse workspace named MyWorkspace that contains an Apache Spark database named mytestdb.

You run the following command in an Azure Synapse Analytics Spark pool in MyWorkspace.

CREATE TABLE mytestdb.myParquetTable(

EmployeeID int,

EmployeeName string,

EmployeeStartDate date)

USING Parquet -

You then use Spark to insert a row into mytestdb.myParquetTable. The row contains the following data.

One minute later, you execute the following query from a serverless SQL pool in MyWorkspace.

SELECT EmployeeID -

FROM mytestdb.dbo.myParquetTable

WHERE name = 'Alice';

What will be returned by the query?

A.
24

B.
an error

C.
a null value

Correct Answer:
A

Once a database has been created by a Spark job, you can create tables in it with Spark that use Parquet as the storage format. Table names
will be converted to lower case and need to be queried using the lower case name. These tables will immediately become available for querying
by any of the Azure Synapse workspace Spark pools. They can also be used from any of the Spark jobs subject to permissions.
Note: For external tables, since they are synchronized to serverless SQL pool asynchronously, there will be a delay until they appear.

Reference:

https://docs.microsoft.com/en-us/azure/synapse-analytics/metadata/table

Community vote distribution


B (100%)

 
kruukp
Highly Voted 
1 year ago
B is a correct answer. There is a column 'name' in the where clause which doesn't exist in the table.
upvoted 103 times

 
knarf
11 months, 1 week ago
I agree B is correct, not because the column 'name' in the query is invalid, but because the table reference itself is invalid as the table was
created as CREATE TABLE mytestdb.myParquetTable and not mytestdb.dbo.myParquetTable
upvoted 14 times

 
EddyRoboto
9 months ago
When we query a spark table from SQL Serveless we must use the schema, in this case, dbo, so, this doesn't cause erros.
upvoted 7 times

 
anarvekar
9 months, 3 weeks ago
Isn't dbo the default schema the objects are created in, if the schema name is not explicitly specified in the DDL?
upvoted 2 times

 
AugustineUba
10 months ago
I agree with this.
upvoted 1 times

 
anto69
4 months, 2 weeks ago
Agree there's no column named 'name'
upvoted 2 times

 
baobabko
12 months ago

https://www.examtopics.com/exams/microsoft/dp-203/custom-view/ 4/161
5/26/22, 3:46 PM DP-203 Exam – Free Actual Q&As, Page 1 | ExamTopics

Even if the column name is correct. When I tried the example , it threw an error that table doesn't exist (as expected - after all - it is a Spark
table, not SQL. There is no external or any other table which could be queried in the SQL pool)
upvoted 4 times

 
EddyRoboto
9 months ago
They shared the same metadata, perhaps you forgot to specify the schema in your query in SQL Serveless Pool. You should have tried
spark_db.[dbo].spark_table
upvoted 2 times

 
Alekx42
12 months ago
https://docs.microsoft.com/en-us/azure/synapse-analytics/metadata/table

"Once a database has been created by a Spark job, you can create tables in it with Spark that use Parquet as the storage format. Table names
will be converted to lower case and need to be queried using the lower case name. These tables will immediately become available for
querying by any of the Azure Synapse workspace Spark pools. The Spark created, managed, and external tables are also made available as
external tables with the same name in the corresponding synchronized database in serverless SQL pool."

I think the reason you got the error was because the query had to use the lower case names. See the example in the same link, they create a
similar table and use the lowercase letters to query it from the Serverless SQL pool.

Anyway, this confirms that B is the correct answer here.


upvoted 7 times

 
knarf
11 months ago
See my post above and comment?
upvoted 1 times

 
polokkk
Highly Voted 
4 months, 2 weeks ago
A is correct in real exam, it was employeename not name. So 24 is the one to select in real exam.

B is correct in this question as it isn't exactly the same as exam thus B is correct
upvoted 12 times

 
Dothy
Most Recent 
2 weeks ago
No EmployeeName column in query. So answer B is correct
upvoted 1 times

 
Lizaveta
4 weeks ago
I came across this question on an exam today. It was correct query with "WHERE

EmployeeName = 'Alice' ". So I answered A.24


upvoted 3 times

 
FelixI
1 month ago
Selected Answer: B
No EmployeeName column in query
upvoted 1 times

 
Egocentric
1 month, 1 week ago
B is correct because there is no column called name
upvoted 1 times

 
xuezhizlv
1 month, 3 weeks ago
Selected Answer: B
B is the correct answer.
upvoted 1 times

 
AlCubeHead
2 months ago
Selected Answer: B
There is no name column in the table. Also the - at the end of the select is also dubious to me
upvoted 1 times

 
Guincimund
2 months ago
Selected Answer: B
SELECT EmployeeID -

FROM mytestdb.dbo.myParquetTable

WHERE name = 'Alice';

As the Where clause is name = 'Alice', the answer is B as there is no column named 'name'.

In the case where the Where clause is "Where EmployeeName = 'Alice' " then the answer will return 24. which is answer A.
upvoted 2 times

 
Sakshi_21
2 months, 2 weeks ago
Selected Answer: B
name column dosent exsists in the table
upvoted 1 times

 
enricobny
2 months, 2 weeks ago

https://www.examtopics.com/exams/microsoft/dp-203/custom-view/ 5/161
5/26/22, 3:46 PM DP-203 Exam – Free Actual Q&As, Page 1 | ExamTopics

BI is the right answer. Column 'name' is not present in the table structure and also using mytestdb.dbo.myParquetTable will not work - [dbo] is the
problem.

The correct syntax is :

SELECT EmployeeID FROM mytestdb.myParquetTable WHERE EmployeeName = 'Alice';


upvoted 1 times

 
Rama22
2 months, 3 weeks ago
name = 'Alice', not EmployeeName, will throw Error
upvoted 1 times

 
jskibick
3 months, 3 weeks ago
Selected Answer: B
B, SQL for serverless has error, name field do not exist in table
upvoted 1 times

 
PallaviPatel
4 months ago
Selected Answer: B
Wrong table and column names in the query.
upvoted 1 times

 
Fer079
4 months ago
Regarding the lower case... I have test it on Azure, I have created the table in Spark pool, and it´s true that it is converted to lower case
automatically, however we can query from both Spark pool and synapse serverless pool using lower/upper case and it will always find the table...
Does any one test it?
upvoted 2 times

 
ANath
4 months ago
I am getting 'Bulk load data conversion error (type mismatch or invalid character for the specified codepage)' error.
upvoted 1 times

 
ANath
4 months ago
Sorry I was doing it in a wrong way. If we specify table name in lower order and specify correct column name the exact result will show.
upvoted 1 times

 
pozdrotechno
4 months, 1 week ago
Selected Answer: B
B is correct.

- incorrect db/schema/table name: mytestdb.myParquetTable vs mytestdb.dbo.myParquetTable


- incorrect column name: EmployeeName vs name

- not using lower case in the query


upvoted 2 times

https://www.examtopics.com/exams/microsoft/dp-203/custom-view/ 6/161
5/26/22, 3:46 PM DP-203 Exam – Free Actual Q&As, Page 1 | ExamTopics

Question #3 Topic 1

DRAG DROP -

You have a table named SalesFact in an enterprise data warehouse in Azure Synapse Analytics. SalesFact contains sales data from the past 36
months and has the following characteristics:

✑ Is partitioned by month

✑ Contains one billion rows

✑ Has clustered columnstore indexes

At the beginning of each month, you need to remove data from SalesFact that is older than 36 months as quickly as possible.

Which three actions should you perform in sequence in a stored procedure? To answer, move the appropriate actions from the list of actions to the
answer area and arrange them in the correct order.

Select and Place:

Correct Answer:

Step 1: Create an empty table named SalesFact_work that has the same schema as SalesFact.

Step 2: Switch the partition containing the stale data from SalesFact to SalesFact_Work.

SQL Data Warehouse supports partition splitting, merging, and switching. To switch partitions between two tables, you must ensure that the
partitions align on their respective boundaries and that the table definitions match.

Loading data into partitions with partition switching is a convenient way stage new data in a table that is not visible to users the switch in the
new data.

Step 3: Drop the SalesFact_Work table.

Reference:

https://docs.microsoft.com/en-us/azure/sql-data-warehouse/sql-data-warehouse-tables-partition

 
hsetin
Highly Voted 
8 months, 2 weeks ago
https://www.examtopics.com/exams/microsoft/dp-203/custom-view/ 7/161
5/26/22, 3:46 PM DP-203 Exam – Free Actual Q&As, Page 1 | ExamTopics

Given answer D A C is correct.


upvoted 24 times

 
svik
8 months, 2 weeks ago
Yes. Once the partition is switched with an empty partition it is equivalent to truncating the partition from the original table
upvoted 1 times

 
Dothy
Most Recent 
2 weeks ago
Step 1: Create an empty table named SalesFact_work that has the same schema as SalesFact.

Step 2: Switch the partition containing the stale data from SalesFact to SalesFact_Work.

Step 3: Drop the SalesFact_Work table.


upvoted 1 times

 
JJdeWit
1 month, 1 week ago
D A C is the right option.

For more information, this doc discusses exactly this example: https://docs.microsoft.com/en-us/azure/synapse-analytics/sql-data-warehouse/sql-
data-warehouse-tables-partition
upvoted 1 times

 
theezin
1 month, 3 weeks ago
Why not included deleting sales data older than 36 months which is mentioned in question?
upvoted 1 times

 
RamGhase
3 months, 2 weeks ago
i could not understand how answer handled to remove data before 36 month
upvoted 1 times

 
gerard
3 months, 2 weeks ago
you have to move the partitions that contains the date before 36 months
upvoted 2 times

 
PallaviPatel
4 months ago
D A C is correct.
upvoted 1 times

 
indomanish
4 months, 2 weeks ago
Partition switching help us in loading large data set quickly . Not sure if it will help in purging data as well.
upvoted 2 times

 
SabaJamal2010AtGmail
4 months, 4 weeks ago
Given answer is correct
upvoted 2 times

 
covillmail
7 months ago
DAC is correct
upvoted 4 times

 
AvithK
9 months, 2 weeks ago
truncate partition is even quicker, why isn't that the answer, if the data is dropped anyway?
upvoted 3 times

 
yolap31172
7 months, 2 weeks ago
There is no way to truncate partitions in Synapse. Partitions don't even have names and you can't reference them by value.
upvoted 4 times

 
BlackMal
9 months, 2 weeks ago
This, i think it should be the answer
upvoted 1 times

 
poornipv
10 months ago
what is the correct answer for this?
upvoted 2 times

 
AnonAzureDataEngineer
10 months ago
Seems like it should be:

1. E

2. A

3. C
upvoted 1 times

 
dragos_dragos62000
10 months, 4 weeks ago
Correct!
upvoted 1 times

 
Dileepvikram
12 months ago
https://www.examtopics.com/exams/microsoft/dp-203/custom-view/ 8/161
5/26/22, 3:46 PM DP-203 Exam – Free Actual Q&As, Page 1 | ExamTopics

The data copy to back up table is not mentioned in the answer


upvoted 1 times

 
savin
11 months, 1 week ago
partition switching part covers it. So its correct i think
upvoted 1 times

 
wfrf92
1 year ago
Is this correct ????
upvoted 1 times

 
alain2
1 year ago
Yes, it is.

https://www.cathrinewilhelmsen.net/table-partitioning-in-sql-server-partition-switching/
upvoted 5 times

 
YipingRuan
7 months, 2 weeks ago
"Archive data by switching out: Switch from Partition to Non-Partitioned" ?
upvoted 1 times

 
TorbenS
1 year ago
yes, I think so
upvoted 4 times

https://www.examtopics.com/exams/microsoft/dp-203/custom-view/ 9/161
5/26/22, 3:46 PM DP-203 Exam – Free Actual Q&As, Page 1 | ExamTopics

Question #4 Topic 1

You have files and folders in Azure Data Lake Storage Gen2 for an Azure Synapse workspace as shown in the following exhibit.

You create an external table named ExtTable that has LOCATION='/topfolder/'.

When you query ExtTable by using an Azure Synapse Analytics serverless SQL pool, which files are returned?

A.
File2.csv and File3.csv only

B.
File1.csv and File4.csv only

C.
File1.csv, File2.csv, File3.csv, and File4.csv

D.
File1.csv only

Correct Answer:
C

To run a T-SQL query over a set of files within a folder or set of folders while treating them as a single entity or rowset, provide a path to a folder
or a pattern
(using wildcards) over a set of files or folders.

Reference:

https://docs.microsoft.com/en-us/azure/synapse-analytics/sql/query-data-storage#query-multiple-files-or-folders

Community vote distribution


B (100%)

 
Chillem1900
Highly Voted 
1 year ago
I believe the answer should be B.

In case of a serverless pool a wildcard should be added to the location.

https://docs.microsoft.com/en-us/azure/synapse-analytics/sql/develop-tables-external-tables?tabs=hadoop#arguments-create-external-table
upvoted 76 times

 
captainpike
7 months, 1 week ago
I tested and proove you right, the answer is B. Remind the question is referring to serverless SQL and not dedicated SQL pool. "Unlike Hadoop
external tables, native external tables don't return subfolders unless you specify /** at the end of path. In this example, if
LOCATION='/webdata/', a serverless SQL pool query, will return rows from mydata.txt. It won't return mydata2.txt and mydata3.txt because
they're located in a subfolder. Hadoop tables will return all files within any subfolder."
upvoted 12 times

 
alain2
Highly Voted 
1 year ago
"Serverless SQL pool can recursively traverse folders only if you specify /** at the end of path."

https://docs.microsoft.com/en-us/azure/synapse-analytics/sql/query-folders-multiple-csv-files
upvoted 17 times

 
Preben
11 months, 3 weeks ago
When you are quoting from Microsoft documentation, do not ADD in words to the sentence. 'Only' is not used.
upvoted 10 times

 
captainpike
7 months, 1 week ago
The answer is B however. I could not make "/**" to work. somebody?
upvoted 2 times

 
amiral404
Most Recent 
2 days, 21 hours ago
C is correct as mentionned in the official documentation which showcase a similar example : https://docs.microsoft.com/en-us/sql/t-
sql/statements/create-external-table-transact-sql?view=sql-server-ver15&tabs=dedicated#location--folder_or_filepath
upvoted 1 times

 
Backy
1 week, 5 days ago
The question does not show the actual query so this is a problem
upvoted 1 times

 
Dothy
2 weeks ago
I believe the answer should be B.

https://www.examtopics.com/exams/microsoft/dp-203/custom-view/ 10/161
5/26/22, 3:46 PM DP-203 Exam – Free Actual Q&As, Page 1 | ExamTopics

upvoted 1 times

 
carloalbe
3 weeks ago
In this example, if LOCATION='/webdata/', a PolyBase query will return rows from mydata.txt and mydata2.txt. It won't return mydata3.txt because
it's a file in a hidden folder. And it won't return _hidden.txt because it's a hidden file. https://docs.microsoft.com/en-us/sql/t-
sql/statements/media/aps-polybase-folder-traversal.png?view=sql-server-ver15b

https://docs.microsoft.com/en-us/sql/t-sql/statements/create-external-table-transact-sql?view=sql-server-ver15&tabs=dedicated
upvoted 1 times

 
BJPJowee
4 weeks ago
the answer is correct. C see the link https://docs.microsoft.com/en-us/sql/t-sql/statements/create-external-table-transact-sql?view=sql-server-
ver15&tabs=dedicated
upvoted 1 times

 
MS_Nikhil
4 weeks, 1 day ago
Ans is definitely B
upvoted 1 times

 
poundmanluffy
1 month, 4 weeks ago
Selected Answer: B
Option is definitely "B"

Below is the documentation given on MS Docs:

Recursive data for external tables

Unlike Hadoop external tables, native external tables don't return subfolders unless you specify /** at the end of path. In this example, if
LOCATION='/webdata/', a serverless SQL pool query, will return rows from mydata.txt. It won't return mydata2.txt and mydata3.txt because they're
located in a subfolder. Hadoop tables will return all files within any sub-folder.
upvoted 1 times

 
Ozren
2 months, 1 week ago
Selected Answer: B
This is not a recursive pattern like '.../**'. So the answer is B, not C.
upvoted 2 times

 
kamil_k
2 months, 2 weeks ago
this one is tricky, I found information here which would suggest answer C is indeed correct:

https://docs.microsoft.com/en-us/sql/t-sql/statements/create-external-table-transact-sql?view=sql-server-ver15&tabs=dedicated#arguments-2
upvoted 1 times

 
kamil_k
2 months, 2 weeks ago
ok I've done the test:

1. created gen 2 storage acct

2. created azure synapse workspace

3. created container myfilesystem, subfolder topfolder and another subfolder topfolder under that

4. created two csv files and dropped one per folder i.e. one in topfolder and the other in topfolder/topfolder

5. executed the following code:

DROP EXTERNAL DATA SOURCE test;

CREATE EXTERNAL DATA SOURCE test

WITH

LOCATION = 'https://[storage-account-name].blob.core.windows.net/myfilesystem'
)

CREATE EXTERNAL FILE FORMAT test

WITH (

FORMAT_TYPE = DELIMITEDTEXT,

FORMAT_OPTIONS (

FIELD_TERMINATOR = ',',

FIRST_ROW = 2

);

CREATE EXTERNAL TABLE test

(id int, value int)

WITH (

LOCATION='/topfolder/',

DATA_SOURCE = test,

FILE_FORMAT = test

);

SELECT * FROM test;

The result were only records from File1.csv which was located in the first "topfolder".
upvoted 2 times

https://www.examtopics.com/exams/microsoft/dp-203/custom-view/ 11/161
5/26/22, 3:46 PM DP-203 Exam – Free Actual Q&As, Page 1 | ExamTopics

 
kamil_k
2 months, 2 weeks ago
in other words, the answer C is incorrect. I forgot to mention I used the built-in serverless SQL Pool
upvoted 1 times

 
islamarfh
1 month, 4 weeks ago
this is the from the document tell that B is indeed correct

In this example, if LOCATION='/webdata/', a PolyBase query will return rows from mydata.txt and mydata2.txt. It won't return mydata3.txt
because it's a file in a hidden folder. And it won't return _hidden.txt because it's a hidden file.
upvoted 1 times

 
RalphLiang
2 months, 3 weeks ago
Selected Answer: B
I believe the answer should be B.

https://docs.microsoft.com/en-us/azure/synapse-analytics/sql/develop-tables-external-tables?tabs=hadoop#arguments-create-external-table
upvoted 2 times

 
KosteK
2 months, 4 weeks ago
Selected Answer: B
Tested. Ans: B
upvoted 2 times

 
toms100
3 months, 2 weeks ago
Answer is C

If you specify LOCATION to be a folder, a PolyBase query that selects from the external table will retrieve files from the folder and all of its
subfolders.

Refer https://docs.microsoft.com/en-us/sql/t-sql/statements/create-external-table-transact-sql?view=sql-server-ver15&tabs=dedicated
upvoted 2 times

 
PallaviPatel
4 months ago
Selected Answer: B
B is correct answer.
upvoted 2 times

 
Sandip4u
4 months, 2 weeks ago
The answer is B , In case of a serverless pool a wildcard should be added to the location , otherwise this will not fetch the files from child folders
upvoted 2 times

 
bharatnhkh10
4 months, 3 weeks ago
Selected Answer: B
as we need ** o pick all files
upvoted 2 times

https://www.examtopics.com/exams/microsoft/dp-203/custom-view/ 12/161
5/26/22, 3:46 PM DP-203 Exam – Free Actual Q&As, Page 1 | ExamTopics

Question #5 Topic 1

HOTSPOT -

You are planning the deployment of Azure Data Lake Storage Gen2.

You have the following two reports that will access the data lake:

✑ Report1: Reads three columns from a file that contains 50 columns.

✑ Report2: Queries a single record based on a timestamp.

You need to recommend in which format to store the data in the data lake to support the reports. The solution must minimize read times.

What should you recommend for each report? To answer, select the appropriate options in the answer area.

NOTE: Each correct selection is worth one point.

Hot Area:

Correct Answer:

Report1: CSV -

CSV: The destination writes records as delimited data.

Report2: AVRO -

AVRO supports timestamps.

Not Parquet, TSV: Not options for Azure Data Lake Storage Gen2.

Reference:

https://streamsets.com/documentation/datacollector/latest/help/datacollector/UserGuide/Destinations/ADLS-G2-D.html

 
alain2
Highly Voted 
1 year ago
1: Parquet - column-oriented binary file format

2: AVRO - Row based format, and has logical type timestamp

https://youtu.be/UrWthx8T3UY
upvoted 92 times

 
azurestudent1498
1 month, 1 week ago

https://www.examtopics.com/exams/microsoft/dp-203/custom-view/ 13/161
5/26/22, 3:46 PM DP-203 Exam – Free Actual Q&As, Page 1 | ExamTopics

this is correct.
upvoted 1 times

 
terajuana
11 months, 2 weeks ago
the web is full of old information. timestamp support has been added to parquet
upvoted 5 times

 
vlad888
11 months ago
Ok, but in 1st case we need only 3 of 50 columns. Parquet i columnar format. In 2nd Avro because ideal for read full row
upvoted 12 times

 
Himlo24
Highly Voted 
1 year ago
Shouldn't the answer for Report 1 be Parquet? Because Parquet format is Columnar and should be best for reading a few columns only.
upvoted 18 times

 
main616
Most Recent 
1 week, 6 days ago
1. csv (or json) . csv/json support query accelerate by select specified rows https://docs.microsoft.com/en-us/azure/storage/blobs/data-lake-
storage-query-acceleration#overview

2. avro
upvoted 1 times

 
Dothy
2 weeks ago
1: Parquet

2: AVRO
upvoted 1 times

 
RalphLiang
2 months, 3 weeks ago
Consider Parquet and ORC file formats when the I/O patterns are more read heavy or when the query patterns are focused on a subset of columns
in the records.

the Avro format works well with a message bus such as Event Hub or Kafka that write multiple events/messages in succession.
upvoted 3 times

 
ragz_87
3 months, 3 weeks ago
1. Parquet

2. Avro

https://docs.microsoft.com/en-us/azure/storage/blobs/data-lake-storage-best-practices

"Consider using the Avro file format in cases where your I/O patterns are more write heavy, or the query patterns favor retrieving multiple rows of
records in their entirety.

Consider Parquet and ORC file formats when the I/O patterns are more read heavy or when the query patterns are focused on a subset of columns
in the records."
upvoted 5 times

 
SebK
2 months ago
Thank you.
upvoted 1 times

 
MohammadKhubeb
4 months ago
Why NOT csv in report1 ?
upvoted 1 times

 
Sandip4u
4 months, 2 weeks ago
This has to be parquet and AVRO , got the answer from Udemy
upvoted 4 times

 
Mahesh_mm
5 months ago
1. Parquet

2. AVRO
upvoted 3 times

 
marcin1212
5 months, 2 weeks ago
The goal is: The solution must minimize read times.

I made small test on Databrick plus DataLake.

The same file saved as Parquet and Avro

9 mln of records.

Parquet ~150 MB

Avro ~700MB

Reading Parquet is always 10 times faster that Avro.

I checked:

- for all data or small range of data with condition

- all or only one column

So I will select option:

https://www.examtopics.com/exams/microsoft/dp-203/custom-view/ 14/161
5/26/22, 3:46 PM DP-203 Exam – Free Actual Q&As, Page 1 | ExamTopics

- Parquet

- Parquet
upvoted 2 times

 
dev2dev
4 months, 3 weeks ago
how can be faster read is same as number of reads?
upvoted 1 times

 
Ozzypoppe
6 months ago
Solution says parquet is not supported for adls gen 2 but it actually is: https://docs.microsoft.com/en-us/azure/data-factory/format-parquet
upvoted 3 times

 
noranathalie
7 months, 1 week ago
An interesting and complete article that explains the different uses between parquet/avro/csv and gives answers to the question :
https://medium.com/ssense-tech/csv-vs-parquet-vs-avro-choosing-the-right-tool-for-the-right-job-79c9f56914a8
upvoted 4 times

 
elimey
10 months, 1 week ago
https://luminousmen.com/post/big-data-file-formats
upvoted 1 times

 
elimey
10 months, 1 week ago
Report 1 definitely Parquet
upvoted 1 times

 
noone_a
10 months, 3 weeks ago
report 1 - Parquet as it is columar.

report 2 - avro as it is row based and can be compressed further than csv.
upvoted 1 times

 
bsa_2021
11 months, 1 week ago
The actual answer provided and answer from discussion differs. Which one to follow for actual exam?
upvoted 1 times

 
Yaduvanshi
7 months, 2 weeks ago
Follow what feels logical after reading the answer and the discussion forum.
upvoted 2 times

 
bc5468521
12 months ago
1- Parquet

2- Parquet

Since they are all querying; AVRO is good for writing, OLTP, Parquet is good for quering/read
upvoted 5 times

https://www.examtopics.com/exams/microsoft/dp-203/custom-view/ 15/161
5/26/22, 3:46 PM DP-203 Exam – Free Actual Q&As, Page 1 | ExamTopics

Question #6 Topic 1

You are designing the folder structure for an Azure Data Lake Storage Gen2 container.

Users will query data by using a variety of services including Azure Databricks and Azure Synapse Analytics serverless SQL pools. The data will be
secured by subject area. Most queries will include data from the current year or current month.

Which folder structure should you recommend to support fast queries and simplified folder security?

A.
/{SubjectArea}/{DataSource}/{DD}/{MM}/{YYYY}/{FileData}_{YYYY}_{MM}_{DD}.csv

B.
/{DD}/{MM}/{YYYY}/{SubjectArea}/{DataSource}/{FileData}_{YYYY}_{MM}_{DD}.csv

C.
/{YYYY}/{MM}/{DD}/{SubjectArea}/{DataSource}/{FileData}_{YYYY}_{MM}_{DD}.csv

D.
/{SubjectArea}/{DataSource}/{YYYY}/{MM}/{DD}/{FileData}_{YYYY}_{MM}_{DD}.csv

Correct Answer:
D

There's an important reason to put the date at the end of the directory structure. If you want to lock down certain regions or subject matters to
users/groups, then you can easily do so with the POSIX permissions. Otherwise, if there was a need to restrict a certain security group to
viewing just the UK data or certain planes, with the date structure in front a separate permission would be required for numerous directories
under every hour directory. Additionally, having the date structure in front would exponentially increase the number of directories as time went
on.

Note: In IoT workloads, there can be a great deal of data being landed in the data store that spans across numerous products, devices,
organizations, and customers. It's important to pre-plan the directory layout for organization, security, and efficient processing of the data for
down-stream consumers. A general template to consider might be the following layout:

{Region}/{SubjectMatter(s)}/{yyyy}/{mm}/{dd}/{hh}/

Community vote distribution


D (100%)

 
sagga
Highly Voted 
1 year ago
D is correct

https://docs.microsoft.com/en-us/azure/storage/blobs/data-lake-storage-best-practices#batch-jobs-structure
upvoted 41 times

 
Dothy
Most Recent 
2 weeks ago
D is correct
upvoted 1 times

 
Olukunmi
1 month ago
Selected Answer: D
D is correct
upvoted 1 times

 
Egocentric
1 month, 1 week ago
D is correct
upvoted 1 times

 
SebK
2 months ago
Selected Answer: D
D is correct
upvoted 2 times

 
RalphLiang
2 months, 3 weeks ago
Selected Answer: D
D is correct
upvoted 1 times

 
NeerajKumar
2 months, 3 weeks ago
Selected Answer: D
Correct
upvoted 1 times

 
PallaviPatel
4 months ago
Selected Answer: D
Correct
upvoted 1 times

https://www.examtopics.com/exams/microsoft/dp-203/custom-view/ 16/161
5/26/22, 3:46 PM DP-203 Exam – Free Actual Q&As, Page 1 | ExamTopics

 
Skyrocket
4 months ago
D is correct
upvoted 1 times

 
VeroDon
4 months, 3 weeks ago
Selected Answer: D
Thats correct
upvoted 2 times

 
Mahesh_mm
5 months ago
D is correct
upvoted 1 times

 
alexleonvalencia
5 months, 2 weeks ago
Respuesta correcta D.
upvoted 1 times

 
rashjan
5 months, 2 weeks ago
Selected Answer: D
D is correct
upvoted 1 times

 
ohana
7 months ago
Took the exam today, this question came out.

Ans: D
upvoted 4 times

 
Sunnyb
11 months, 3 weeks ago
D is absolutely correct
upvoted 2 times

https://www.examtopics.com/exams/microsoft/dp-203/custom-view/ 17/161
5/26/22, 3:46 PM DP-203 Exam – Free Actual Q&As, Page 1 | ExamTopics

Question #7 Topic 1

HOTSPOT -

You need to output files from Azure Data Factory.

Which file format should you use for each type of output? To answer, select the appropriate options in the answer area.

NOTE: Each correct selection is worth one point.

Hot Area:

Correct Answer:

Box 1: Parquet -

Parquet stores data in columns, while Avro stores data in a row-based format. By their very nature, column-oriented data stores are optimized
for read-heavy analytical workloads, while row-based databases are best for write-heavy transactional workloads.

Box 2: Avro -

An Avro schema is created using JSON format.

AVRO supports timestamps.

Note: Azure Data Factory supports the following file formats (not GZip or TXT).

Avro format -

✑ Binary format

✑ Delimited text format

✑ Excel format

✑ JSON format

✑ ORC format

✑ Parquet format

https://www.examtopics.com/exams/microsoft/dp-203/custom-view/ 18/161
5/26/22, 3:46 PM DP-203 Exam – Free Actual Q&As, Page 1 | ExamTopics

✑ XML format

Reference:

https://www.datanami.com/2018/05/16/big-data-file-formats-demystified

 
Mahesh_mm
Highly Voted 
5 months ago
Parquet and AVRO is correct option
upvoted 17 times

 
Dothy
Most Recent 
2 weeks ago
agree with the answer
upvoted 2 times

 
RalphLiang
2 months, 3 weeks ago
Parquet and AVRO is correct option
upvoted 2 times

 
PallaviPatel
4 months ago
correct
upvoted 1 times

 
Skyrocket
4 months ago
Parquet and AVRO is right.
upvoted 2 times

 
edba
5 months ago
GZIP file format is one of supported Binary format by ADF.

https://docs.microsoft.com/en-us/azure/data-factory/connector-file-system?tabs=data-factory#file-system-as-sink
upvoted 1 times

 
bad_atitude
5 months, 1 week ago
agree with the answer
upvoted 2 times

 
alexleonvalencia
5 months, 2 weeks ago
Respuesta correcta PARQUET & AVRO.
upvoted 1 times

https://www.examtopics.com/exams/microsoft/dp-203/custom-view/ 19/161
5/26/22, 3:46 PM DP-203 Exam – Free Actual Q&As, Page 1 | ExamTopics

Question #8 Topic 1

HOTSPOT -

You use Azure Data Factory to prepare data to be queried by Azure Synapse Analytics serverless SQL pools.

Files are initially ingested into an Azure Data Lake Storage Gen2 account as 10 small JSON files. Each file contains the same data attributes and
data from a subsidiary of your company.

You need to move the files to a different folder and transform the data to meet the following requirements:

✑ Provide the fastest possible query times.

✑ Automatically infer the schema from the underlying files.

How should you configure the Data Factory copy activity? To answer, select the appropriate options in the answer area.

NOTE: Each correct selection is worth one point.

Hot Area:

Correct Answer:

Box 1: Preserver hierarchy -

Compared to the flat namespace on Blob storage, the hierarchical namespace greatly improves the performance of directory management
operations, which improves overall job performance.

Box 2: Parquet -

Azure Data Factory parquet format is supported for Azure Data Lake Storage Gen2.

Parquet supports the schema property.

Reference:

https://docs.microsoft.com/en-us/azure/storage/blobs/data-lake-storage-introduction https://docs.microsoft.com/en-us/azure/data-
factory/format-parquet

 
alain2
Highly Voted 
1 year ago
1. Merge Files

2. Parquet

https://docs.microsoft.com/en-us/azure/storage/blobs/data-lake-storage-performance-tuning-guidance
upvoted 87 times

https://www.examtopics.com/exams/microsoft/dp-203/custom-view/ 20/161
5/26/22, 3:46 PM DP-203 Exam – Free Actual Q&As, Page 1 | ExamTopics

 
edba
5 months ago
just want to add a bit more reference regarding copyBehavior in ADF plus info mentioned in Best Practice doc, so it shall be MergeFile first.

https://docs.microsoft.com/en-us/azure/data-factory/connector-file-system?tabs=data-factory#file-system-as-sink
upvoted 3 times

 
kilowd
7 months, 1 week ago
Larger files lead to better performance and reduced costs.

Typically, analytics engines such as HDInsight have a per-file overhead that involves tasks such as listing, checking access, and performing
various metadata operations. If you store your data as many small files, this can negatively affect performance. In general, organize your data
into larger sized files for better performance (256 MB to 100 GB in size). S
upvoted 3 times

 
Ameenymous
1 year ago
The smaller the files, the negative the performance so Merge and Parquet seems to be the right answer.
upvoted 14 times

 
captainbee
Highly Voted 
10 months, 3 weeks ago
It's frustrating just how many questions ExamTopics get wrong. Can't be helpful
upvoted 26 times

 
RyuHayabusa
10 months, 1 week ago
At least it helps in learning, as you have to research and think for yourself. Another big topic is having this questions in the first place is
immensely helpful
upvoted 30 times

 
SebK
1 month, 4 weeks ago
Agree.
upvoted 2 times

 
gssd4scoder
7 months ago
Trying to understand if an answer is correct will help learn more
upvoted 3 times

 
Dothy
Most Recent 
2 weeks ago
1. Merge Files

2. Parquet
upvoted 1 times

 
KashRaynardMorse
1 month, 1 week ago
A requirement was "Automatically infer the schema from the underlying files", meaning Preserve hierarchy is needed.
upvoted 2 times

 
gabdu
3 weeks, 3 days ago
it is possible that all or some schemas may be different in that case we cannot merge
upvoted 1 times

 
imomins
1 month, 3 weeks ago
another hot key is : You need to move the files to a different folder

so answer should be preserve hierarchy.


upvoted 2 times

 
Eyepatch993
2 months ago
1. Preserve heirarchy - ADF is used only for processing and Synapse is the sink. Since synapse has parallel processing power, it can process the files
in different folder and thus improve performance.
2. Parquet
upvoted 1 times

 
kamil_k
2 months, 1 week ago
Are these answers the actual correct answers or guesses? Who highlights the correct answers?
upvoted 2 times

 
srakrn
4 months ago
"In general, we recommend that your system have some sort of process to aggregate small files into larger ones for use by downstream
applications."

https://docs.microsoft.com/en-us/azure/storage/blobs/data-lake-storage-best-practices

Therefore I believe the answer should be 'Merge files' and 'Parquet'.


upvoted 3 times

 
Sandip4u
4 months, 2 weeks ago
Merge and parquet will be the right option , also taken reference from Udemy
upvoted 2 times

https://www.examtopics.com/exams/microsoft/dp-203/custom-view/ 21/161
5/26/22, 3:46 PM DP-203 Exam – Free Actual Q&As, Page 1 | ExamTopics

 
Mahesh_mm
5 months ago
1.As hierarchical namespace greatly improves the performance of directory management operations, which improves overall job performance,
Preserver herarchy looks correct. Also there is overhead for merging files.

2. Parquet
upvoted 3 times

 
Boompiee
2 weeks, 3 days ago
The overhead for merging happens once, after that it's faster every time to query the files if they are merged.
upvoted 1 times

 
m2shines
5 months, 1 week ago
Merge Files and Parquet
upvoted 1 times

 
AM1971
6 months, 2 weeks ago
shouldn't a json file be flattened first? So I think the answer is: flatten and parquet
upvoted 1 times

 
RinkiiiiiV
7 months, 3 weeks ago
1. Preserver hierarchy

2. Parquet
upvoted 1 times

 
noobplayer
7 months, 2 weeks ago
Is this correct?
upvoted 2 times

 
Marcus1612
8 months, 2 weeks ago
The files are copied/transform from one folder to another inside the same hierarchical account. The hierarchical property is defined during the
account creation. The destination folder still have the hierarchical. On the other hand, as mentioned by Microsoft:Typically, analytics engines such
as HDInsight and Azure Data Lake Analytics have a per-file overhead. If you store your data as many small files, this can negatively affect
performance. In general, organize your data into larger sized files for better performance (256MB to 100GB in size).
upvoted 2 times

 
meetj
9 months ago
1. Merge for sure

https://docs.microsoft.com/en-us/azure/data-factory/connector-file-system, clearly defined several behavior


upvoted 2 times

 
elimey
10 months, 1 week ago
1. Merge Files: Because the question said 10 different small JSON to a different file

2. Parquet
upvoted 5 times

 
Erte
11 months ago
Box 1: Preserver herarchy

Compared to the flat namespace on Blob storage, the hierarchical namespace greatly improves the performance of directory management
operations, which

improves overall job performance.

Box 2: Parquet

Azure Data Factory parquet format is supported for Azure Data Lake Storage Gen2. Parquet supports the schema property.

Reference:

https://docs.microsoft.com/en-us/azure/storage/blobs/data-lake-storage-introduction https://docs.microsoft.com/en-us/azure/data-
factory/format-parquet
upvoted 2 times

https://www.examtopics.com/exams/microsoft/dp-203/custom-view/ 22/161
5/26/22, 3:46 PM DP-203 Exam – Free Actual Q&As, Page 1 | ExamTopics

Question #9 Topic 1

HOTSPOT -

You have a data model that you plan to implement in a data warehouse in Azure Synapse Analytics as shown in the following exhibit.

All the dimension tables will be less than 2 GB after compression, and the fact table will be approximately 6 TB. The dimension tables will be
relatively static with very few data inserts and updates.

Which type of table should you use for each table? To answer, select the appropriate options in the answer area.

NOTE: Each correct selection is worth one point.

Hot Area:

https://www.examtopics.com/exams/microsoft/dp-203/custom-view/ 23/161
5/26/22, 3:46 PM DP-203 Exam – Free Actual Q&As, Page 1 | ExamTopics

Correct Answer:

Box 1: Replicated -

Replicated tables are ideal for small star-schema dimension tables, because the fact table is often distributed on a column that is not
compatible with the connected dimension tables. If this case applies to your schema, consider changing small dimension tables currently
implemented as round-robin to replicated.

Box 2: Replicated -

Box 3: Replicated -

Box 4: Hash-distributed -

For Fact tables use hash-distribution with clustered columnstore index. Performance improves when two hash tables are joined on the same
distribution column.

Reference:

https://azure.microsoft.com/en-us/updates/reduce-data-movement-and-make-your-queries-more-efficient-with-the-general-availability-of-
replicated-tables/ https://azure.microsoft.com/en-us/blog/replicated-tables-now-generally-available-in-azure-sql-data-warehouse/

 
ian_viana
Highly Voted 
8 months, 1 week ago
The answer is correct.

The Dims are under 2gb so no point in use hash.

Common distribution methods for tables:

The table category often determines which option to choose for distributing the table.

Table category Recommended distribution option

Fact -Use hash-distribution with clustered columnstore index. Performance improves when two hash tables are joined on the same distribution
column.

Dimension - Use replicated for smaller tables. If tables are too large to store on each Compute node, use hash-distributed.

Staging - Use round-robin for the staging table. The load with CTAS is fast. Once the data is in the staging table, use INSERT...SELECT to move the
data to production tables.

https://docs.microsoft.com/en-us/azure/synapse-analytics/sql-data-warehouse/sql-data-warehouse-tables-overview#common-distribution-
methods-for-tables
upvoted 30 times

 
GameLift
7 months, 3 weeks ago
Thanks, but where in the question does it indicate about Fact table has clustered columnstore index.?
upvoted 3 times

 
berserksap
6 months, 4 weeks ago

https://www.examtopics.com/exams/microsoft/dp-203/custom-view/ 24/161
5/26/22, 3:46 PM DP-203 Exam – Free Actual Q&As, Page 1 | ExamTopics

Normally for big tables we use clustered columnstore index for optimal performance and compression. Since the table mentioned here is in
TBs we can safely assume using this index is the best choice

https://docs.microsoft.com/en-us/azure/synapse-analytics/sql-data-warehouse/sql-data-warehouse-tables-index
upvoted 2 times

 
berserksap
6 months, 4 weeks ago
https://docs.microsoft.com/en-us/azure/synapse-analytics/sql/develop-tables-overview
upvoted 1 times

 
ohana
Highly Voted 
7 months ago
Took the exam today, this question came out.

Ans: All the Dim tables --> Replicated

Fact Tables --> Hash Distributed


upvoted 21 times

 
Dothy
Most Recent 
2 weeks ago
The answer is correct.
upvoted 1 times

 
PallaviPatel
4 months ago
correct answer
upvoted 2 times

 
Pritam85
4 months ago
Got this question on 2312/2021...answer is correct
upvoted 1 times

 
Mahesh_mm
5 months ago
Ans is correct
upvoted 2 times

 
alfonsodisalvo
6 months, 3 weeks ago
Dimension are Replicated :

"Since the table has multiple copies, replicated tables work best when the table size is less than 2 GB compressed."

"Replicated tables may not yield the best query performance when:

The table has frequent insert, update, and delete operations"

" We recommend using replicated tables instead of round-robin tables in most cases"

https://docs.microsoft.com/en-us/azure/synapse-analytics/sql-data-warehouse/design-guidance-for-replicated-tables
upvoted 1 times

 
gssd4scoder
7 months ago
Correct: https://docs.microsoft.com/en-us/azure/synapse-analytics/sql/develop-tables-overview#common-distribution-methods-for-tables
upvoted 1 times

https://www.examtopics.com/exams/microsoft/dp-203/custom-view/ 25/161
5/26/22, 3:46 PM DP-203 Exam – Free Actual Q&As, Page 1 | ExamTopics

Question #10 Topic 1

HOTSPOT -

You have an Azure Data Lake Storage Gen2 container.

Data is ingested into the container, and then transformed by a data integration application. The data is NOT modified after that. Users can read
files in the container but cannot modify the files.

You need to design a data archiving solution that meets the following requirements:

✑ New data is accessed frequently and must be available as quickly as possible.

✑ Data that is older than five years is accessed infrequently but must be available within one second when requested.

✑ Data that is older than seven years is NOT accessed. After seven years, the data must be persisted at the lowest cost possible.

✑ Costs must be minimized while maintaining the required availability.

How should you manage the data? To answer, select the appropriate options in the answer area.

NOTE: Each correct selection is worth one point

Hot Area:

Correct Answer:

Box 1: Move to cool storage -

Box 2: Move to archive storage -

Archive - Optimized for storing data that is rarely accessed and stored for at least 180 days with flexible latency requirements, on the order of
hours.

The following table shows a comparison of premium performance block blob storage, and the hot, cool, and archive access tiers.

https://www.examtopics.com/exams/microsoft/dp-203/custom-view/ 26/161
5/26/22, 3:46 PM DP-203 Exam – Free Actual Q&As, Page 1 | ExamTopics

Reference:

https://docs.microsoft.com/en-us/azure/storage/blobs/storage-blob-storage-tiers

 
yobllip
Highly Voted 
11 months, 3 weeks ago
Answer should be

1 - Cool

2 - Archive

Comparison table shown access time for cool tier ttfb is milliseconds

https://docs.microsoft.com/en-us/azure/storage/blobs/storage-blob-storage-tiers#comparing-block-blob-storage-options
upvoted 48 times

 
r00s
1 day, 16 hours ago
Right. #1 is Cool because it's clearly mentioned in the documentation that "Older data sets that are not used frequently, but are expected to be
available for immediate access"

https://docs.microsoft.com/en-us/azure/storage/blobs/access-tiers-overview#comparing-block-blob-storage-options
upvoted 1 times

 
Justbu
Highly Voted 
8 months, 1 week ago
Tricky question, it says data that is OLDER THAN (> 5 years), must be available within one second when requested

But the first question asks for Five-year-old data, which is =5, so it can also be hot storage

Similarly for the seven-year-old.

Not sure, please confirm?


upvoted 8 times

 
Dothy
Most Recent 
2 weeks ago
ans is correct
upvoted 1 times

 
PallaviPatel
4 months ago
ans is correct
upvoted 1 times

 
ANath
4 months, 3 weeks ago
1. Cool Storage

2. Archive Storage
upvoted 1 times

 
Mahesh_mm
5 months ago
Answer is correct
upvoted 1 times

 
ssitb
11 months, 4 weeks ago
https://www.examtopics.com/exams/microsoft/dp-203/custom-view/ 27/161
5/26/22, 3:46 PM DP-203 Exam – Free Actual Q&As, Page 1 | ExamTopics

Answer should be

1-hot

2-archive

https://www.bmc.com/blogs/cold-vs-hot-data-storage/

Cold storage data retrieval can take much longer than hot storage. It can take minutes to hours to access cold storage data
upvoted 2 times

 
marcin1212
5 months, 1 week ago
https://www.bmc.com/blogs/cold-vs-hot-data-storage/

It isn't about Azure !


upvoted 2 times

 
captainbee
11 months, 3 weeks ago
Cold storage takes milliseconds to retrieve
upvoted 5 times

 
syamkumar
11 months, 3 weeks ago
I also doubt if its hot storage and archive.. because its mentioned 5-year-old has to be retrieved within seconds which is not possible via cold
storage//
upvoted 1 times

 
savin
11 months, 1 week ago
but the cost factor is also there. keeping the data in hot tier for 5 years vs cold tier for 5 years would add significant amount.
upvoted 1 times

 
DrC
12 months ago
Answer is correct
upvoted 8 times

https://www.examtopics.com/exams/microsoft/dp-203/custom-view/ 28/161
5/26/22, 3:46 PM DP-203 Exam – Free Actual Q&As, Page 1 | ExamTopics

Question #11 Topic 1

DRAG DROP -

You need to create a partitioned table in an Azure Synapse Analytics dedicated SQL pool.

How should you complete the Transact-SQL statement? To answer, drag the appropriate values to the correct targets. Each value may be used
once, more than once, or not at all. You may need to drag the split bar between panes or scroll to view content.

NOTE: Each correct selection is worth one point.

Select and Place:

Correct Answer:

Box 1: DISTRIBUTION -
Table distribution options include DISTRIBUTION = HASH ( distribution_column_name ), assigns each row to one distribution by hashing the
value stored in distribution_column_name.

Box 2: PARTITION -

Table partition options. Syntax:

PARTITION ( partition_column_name RANGE [ LEFT | RIGHT ] FOR VALUES ( [ boundary_value [,...n] ] ))

Reference:

https://docs.microsoft.com/en-us/sql/t-sql/statements/create-table-azure-sql-data-warehouse

 
Sunnyb
Highly Voted 
11 months, 3 weeks ago
Answer is correct
upvoted 43 times

 
Sasha_in_San_Francisco
Highly Voted 
6 months, 3 weeks ago
Correct answer by how to remember? Distribution option before the Partition option because… ‘D’ comes before ‘P’ or because the system needs
to know the algorithm (hash, round-robin, replicate) before it can start to Partition or segment the data. (seem reasonable?)
upvoted 26 times

 
Dothy
Most Recent 
2 weeks ago
Answer is correct
upvoted 1 times

 
Egocentric
1 month, 1 week ago
provided answer is correct
upvoted 1 times

 
vineet1234
1 month, 3 weeks ago
D comes before P as in DP-203
upvoted 3 times

 
PallaviPatel
4 months ago
https://www.examtopics.com/exams/microsoft/dp-203/custom-view/ 29/161
5/26/22, 3:46 PM DP-203 Exam – Free Actual Q&As, Page 1 | ExamTopics

correct
upvoted 1 times

 
Jaws1990
4 months, 3 weeks ago
Wouldn't VALUES(1,1000000, 200000) create a partition for records with ID <= 1 which would mean 1 row?
upvoted 1 times

 
ploer
3 months, 2 weeks ago
Having three boundaries will lead to four partitions:

1. Partition for values < 1

2. Partition for values from 1 to 999999

3. Partition for values from 1000000 to 199999

4. Partition for values >= 2000000


upvoted 2 times

 
nastyaaa
3 months, 1 week ago
but only <= and >. it is range left for values, right
upvoted 1 times

 
Mahesh_mm
5 months ago
Answer is correct
upvoted 1 times

 
hugoborda
8 months ago
Answer is correct
upvoted 1 times

 
hsetin
8 months, 3 weeks ago
Indeed! Answer is correct
upvoted 1 times

https://www.examtopics.com/exams/microsoft/dp-203/custom-view/ 30/161
5/26/22, 3:46 PM DP-203 Exam – Free Actual Q&As, Page 1 | ExamTopics

Question #12 Topic 1

You need to design an Azure Synapse Analytics dedicated SQL pool that meets the following requirements:

✑ Can return an employee record from a given point in time.

✑ Maintains the latest employee information.

✑ Minimizes query complexity.

How should you model the employee data?

A.
as a temporal table

B.
as a SQL graph table

C.
as a degenerate dimension table

D.
as a Type 2 slowly changing dimension (SCD) table

Correct Answer:
D

A Type 2 SCD supports versioning of dimension members. Often the source system doesn't store versions, so the data warehouse load process
detects and manages changes in a dimension table. In this case, the dimension table must use a surrogate key to provide a unique reference to
a version of the dimension member. It also includes columns that define the date range validity of the version (for example, StartDate and
EndDate) and possibly a flag column (for example,

IsCurrent) to easily filter by current dimension members.

Reference:

https://docs.microsoft.com/en-us/learn/modules/populate-slowly-changing-dimensions-azure-synapse-analytics-pipelines/3-choose-between-
dimension-types

Community vote distribution


D (100%)

 
bc5468521
Highly Voted 
12 months ago
Answer D; Temporal table is better than SCD2, but it is not supported in Synpase yet
upvoted 47 times

 
sparkchu
2 months, 2 weeks ago
though this not something relative to this question. temproal tables looks alike to delta table.
upvoted 1 times

 
Preben
11 months, 3 weeks ago
Here's the documentation for how to implement temporal tables in Synapse from 2019.

https://docs.microsoft.com/en-us/azure/synapse-analytics/sql-data-warehouse/sql-data-warehouse-tables-temporary
upvoted 1 times

 
mbravo
11 months, 2 weeks ago
Temporal tables and Temporary tables are two very distinct concepts. Your link has absolutely nothing to do with this question.
upvoted 11 times

 
Vaishnav
10 months, 2 weeks ago
https://docs.microsoft.com/en-us/azure/azure-sql/temporal-tables

Answer : A Temporal Tables


upvoted 1 times

 
berserksap
6 months, 4 weeks ago
I think synapse doesn't support temporal tables. Please check the below comment by hsetin.
upvoted 1 times

 
rashjan
Highly Voted 
5 months, 2 weeks ago
Selected Answer: D
D is correct (voting comment that people dont have to open discussion always, please upvote to help others)
upvoted 39 times

 
Dothy
Most Recent 
2 weeks ago
Answer is correct
upvoted 2 times

 
Martin_Nbg
1 month ago
Temporal tables are not supported in Synapse so D is correct.

https://www.examtopics.com/exams/microsoft/dp-203/custom-view/ 31/161
5/26/22, 3:46 PM DP-203 Exam – Free Actual Q&As, Page 1 | ExamTopics

upvoted 2 times

 
sparkchu
1 month, 3 weeks ago
overall, you should use delta table :@
upvoted 1 times

 
PallaviPatel
4 months ago
Selected Answer: D
correct
upvoted 1 times

 
Adelina
4 months, 2 weeks ago
Selected Answer: D
D is correct
upvoted 2 times

 
dev2dev
4 months, 2 weeks ago
Confusing high voted comment. D is SCD2 but comment is talking about temporal table. Either way SCD2 is right answer which is Choice D
upvoted 1 times

 
VeroDon
4 months, 3 weeks ago
Selected Answer: D
Dedicated SQL Pools is the key
upvoted 3 times

 
Mahesh_mm
5 months ago
Answer is D
upvoted 1 times

 
hsetin
8 months, 3 weeks ago
Answer is D. Microsoft seems to have confirmed this.

https://docs.microsoft.com/en-us/answers/questions/130561/temporal-table-in-azure-
synapse.html#:~:text=Unfortunately%2C%20we%20do%20not%20support,submitted%20by%20another%20Azure%20customer.
upvoted 3 times

 
dd1122
9 months, 2 weeks ago
Answer D is correct. Temporal tables mentioned in the link below are supported in Azure SQL Database(PaaS) and Azure Managed Instance, where
as in this question Dedicated SQL Pools are mentioned so no temporal tables can be used. SCD Type 2 is the answer.

https://docs.microsoft.com/en-us/azure/azure-sql/temporal-tables
upvoted 4 times

 
escoins
11 months ago
Definitively answer D
upvoted 3 times

 
[Removed]
11 months, 1 week ago
The answer is A - Temporal tables

"Temporal tables enable you to restore row versions from any point in time."

https://docs.microsoft.com/en-us/azure/azure-sql/database/business-continuity-high-availability-disaster-recover-hadr-overview
upvoted 1 times

 
Dileepvikram
11 months, 3 weeks ago
The requirement says that the table should store latest information, so the answer should be temporal table, right? Because scd type 2 will store
the complete history.
upvoted 1 times

 
captainbee
11 months, 3 weeks ago
Also needs to return employee information from a given point in time? Full history needed for that.
upvoted 12 times

https://www.examtopics.com/exams/microsoft/dp-203/custom-view/ 32/161
5/26/22, 3:46 PM DP-203 Exam – Free Actual Q&As, Page 1 | ExamTopics

Question #13 Topic 1

You have an enterprise-wide Azure Data Lake Storage Gen2 account. The data lake is accessible only through an Azure virtual network named
VNET1.

You are building a SQL pool in Azure Synapse that will use data from the data lake.

Your company has a sales team. All the members of the sales team are in an Azure Active Directory group named Sales. POSIX controls are used
to assign the

Sales group access to the files in the data lake.

You plan to load data to the SQL pool every hour.

You need to ensure that the SQL pool can load the sales data from the data lake.

Which three actions should you perform? Each correct answer presents part of the solution.

NOTE: Each area selection is worth one point.

A.
Add the managed identity to the Sales group.

B.
Use the managed identity as the credentials for the data load process.

C.
Create a shared access signature (SAS).

D.
Add your Azure Active Directory (Azure AD) account to the Sales group.

E.
Use the shared access signature (SAS) as the credentials for the data load process.

F.
Create a managed identity.

Correct Answer:
ABF

The managed identity grants permissions to the dedicated SQL pools in the workspace.

Note: Managed identity for Azure resources is a feature of Azure Active Directory. The feature provides Azure services with an automatically
managed identity in

Azure AD -

Reference:

https://docs.microsoft.com/en-us/azure/synapse-analytics/security/synapse-workspace-managed-identity

Community vote distribution


ABF (100%)

 
Diane
Highly Voted 
1 year ago
correct answer is ABF https://www.examtopics.com/discussions/microsoft/view/41207-exam-dp-200-topic-1-question-56-discussion/
upvoted 61 times

 
AvithK
9 months, 2 weeks ago
yes but the order is different it is FAB
upvoted 24 times

 
gssd4scoder
7 months ago
Exactly, agree with you
upvoted 1 times

 
KingIlo
9 months, 1 week ago
The question didn't specify order or sequence
upvoted 9 times

 
IDKol
Highly Voted 
10 months, 1 week ago
Correct Answer should be

F. Create a managed identity.

A. Add the managed identity to the Sales group.

B. Use the managed identity as the credentials for the data load process.
upvoted 20 times

 
Dothy
Most Recent 
2 weeks ago
correct answer is ABF
upvoted 1 times

 
Egocentric
1 month, 1 week ago
ABF is correct
upvoted 1 times

https://www.examtopics.com/exams/microsoft/dp-203/custom-view/ 33/161
5/26/22, 3:46 PM DP-203 Exam – Free Actual Q&As, Page 1 | ExamTopics

 
praticewizards
2 months ago
Selected Answer: ABF
FAB - create, add to group, use to load data
upvoted 1 times

 
Backy
4 months, 2 weeks ago
Is answer A properly worded?

"Add the managed identity to the Sales group" should be "Add the Sales group to managed identity"
upvoted 3 times

 
lukeonline
4 months, 3 weeks ago
Selected Answer: ABF
FAB should be correct
upvoted 4 times

 
VeroDon
4 months, 3 weeks ago
Selected Answer: ABF
FAB is correct sequence
upvoted 2 times

 
SabaJamal2010AtGmail
4 months, 4 weeks ago
1. Create a managed identity.

2. Add the managed identity to the Sales group.

3. Use the managed identity as the credentials for the data load process.
upvoted 2 times

 
Mahesh_mm
5 months ago
FAB is correct sequence
upvoted 1 times

 
Lewistrick
5 months ago
Would it even be a good idea to have the data load process be part of the Sales team? They have separate responsibilities, so should be part of
another group. I know that's not possible in the answer list, but I'm trying to think best practices here.
upvoted 2 times

 
Aslam208
6 months ago
Selected Answer: ABF
Correct answer is F, A, B
upvoted 6 times

 
FredNo
6 months, 1 week ago
Selected Answer: ABF
use managed identity
upvoted 5 times

 
ohana
7 months ago
Took the exam today. Similar question came out. Ans: ABF.

Use managed identity!


upvoted 3 times

 
Eniyan
8 months ago
It should be FAB please refer the following reference.https://docs.microsoft.com/en-us/azure/synapse-analytics/sql-data-warehouse/quickstart-
bulk-load-copy-tsql-examples
upvoted 3 times

 
AvithK
9 months, 2 weeks ago
I don't get why it doesn't start with F. The managed identity should be created first, right?
upvoted 2 times

 
Mazazino
7 months, 1 week ago
There's no mentioning of sequence. The question is just about the right steps
upvoted 2 times

 
MonemSnow
10 months, 3 weeks ago
A, C, F is the correct answer
upvoted 1 times

https://www.examtopics.com/exams/microsoft/dp-203/custom-view/ 34/161
5/26/22, 3:46 PM DP-203 Exam – Free Actual Q&As, Page 1 | ExamTopics

Question #14 Topic 1

HOTSPOT -

You have an Azure Synapse Analytics dedicated SQL pool that contains the users shown in the following table.

User1 executes a query on the database, and the query returns the results shown in the following exhibit.

User1 is the only user who has access to the unmasked data.

Use the drop-down menus to select the answer choice that completes each statement based on the information presented in the graphic.

NOTE: Each correct selection is worth one point.

Hot Area:

https://www.examtopics.com/exams/microsoft/dp-203/custom-view/ 35/161
5/26/22, 3:46 PM DP-203 Exam – Free Actual Q&As, Page 1 | ExamTopics

Correct Answer:

Box 1: 0 -

The YearlyIncome column is of the money data type.

The Default masking function: Full masking according to the data types of the designated fields

✑ Use a zero value for numeric data types (bigint, bit, decimal, int, money, numeric, smallint, smallmoney, tinyint, float, real).

Box 2: the values stored in the database

Users with administrator privileges are always excluded from masking, and see the original data without any mask.

Reference:

https://docs.microsoft.com/en-us/azure/azure-sql/database/dynamic-data-masking-overview

 
hsetin
Highly Voted 
8 months, 3 weeks ago
user 1 is admin, so he will see the value stored in dbms.

1. 0

2. Value in database
upvoted 48 times

 
azurearmy
7 months ago
2 is wrong
upvoted 2 times

 
rjile
Highly Voted 
9 months ago
• Use a zero value for numeric data types (bigint, bit, decimal, int, money, numeric, smallint, smallmoney, tinyint, float, real).

• Use 01-01-1900 for date/time data types (date, datetime2, datetime, datetimeoffset, smalldatetime, time).
upvoted 13 times

 
berserksap
6 months, 4 weeks ago
The second question is queried by User 1 who is the admin
upvoted 13 times

 
Dothy
Most Recent 
2 weeks ago
1. 0

2. Value in database
upvoted 1 times

 
Egocentric
1 month, 1 week ago
on this question its just about paying attention to detail
upvoted 3 times

 
manan16
1 month, 2 weeks ago
How user2 can access data as it is masked?
upvoted 1 times

 
manan16
1 month, 2 weeks ago
Can Someone explain first option as in doc it says 0
upvoted 1 times

 
Mahesh_mm
5 months ago
https://www.examtopics.com/exams/microsoft/dp-203/custom-view/ 36/161
5/26/22, 3:46 PM DP-203 Exam – Free Actual Q&As, Page 1 | ExamTopics

1. 0 (Default values for money data type for masked function will written when queried by user2)

2. Value in database ( As it is queried by user1 who is admin )


upvoted 6 times

 
Milan1988
7 months ago
CORRECT
upvoted 2 times

 
gssd4scoder
7 months ago
Agree with answer, but I see a typo in the question db_datereader MUST be db_datareader.
upvoted 3 times

 
Jiddu
7 months, 3 weeks ago
o for money and 1/1/1900 for dates

https://docs.microsoft.com/en-us/sql/relational-databases/security/dynamic-data-masking?view=sql-server-ver15
upvoted 4 times

 
GervasioMontaNelas
8 months, 3 weeks ago
Its correct
upvoted 2 times

 
rjile
9 months ago
correct?
upvoted 2 times

 
Mazazino
7 months, 1 week ago
yes, it's correct
upvoted 2 times

https://www.examtopics.com/exams/microsoft/dp-203/custom-view/ 37/161
5/26/22, 3:46 PM DP-203 Exam – Free Actual Q&As, Page 1 | ExamTopics

Question #15 Topic 1

You have an enterprise data warehouse in Azure Synapse Analytics.

Using PolyBase, you create an external table named [Ext].[Items] to query Parquet files stored in Azure Data Lake Storage Gen2 without importing
the data to the data warehouse.

The external table has three columns.

You discover that the Parquet files have a fourth column named ItemID.

Which command should you run to add the ItemID column to the external table?

A.

B.

C.

D.

Correct Answer:
C

Incorrect Answers:

A, D: Only these Data Definition Language (DDL) statements are allowed on external tables:

✑ CREATE TABLE and DROP TABLE

✑ CREATE STATISTICS and DROP STATISTICS

✑ CREATE VIEW and DROP VIEW

Reference:

https://docs.microsoft.com/en-us/sql/t-sql/statements/create-external-table-transact-sql

 
Chien_Nguyen_Van
Highly Voted 
8 months, 3 weeks ago
C is correct

https://www.examtopics.com/discussions/microsoft/view/19469-exam-dp-200-topic-1-question-27-discussion/
upvoted 29 times

 
Ozren
Most Recent 
2 months, 1 week ago
Good thing the details are shown here: "The external table has three columns." And the solution yet reveals the column details. This doesn't make
any sense to me. If C is the correct answer (only one that seems acceptable), then the question itself is flawed.
upvoted 2 times

 
PallaviPatel
3 months, 4 weeks ago
c is correct.
upvoted 1 times

 
hugoborda
8 months ago
Answer is correct
upvoted 2 times

https://www.examtopics.com/exams/microsoft/dp-203/custom-view/ 38/161
5/26/22, 3:46 PM DP-203 Exam – Free Actual Q&As, Page 1 | ExamTopics

Question #16 Topic 1

HOTSPOT -

You have two Azure Storage accounts named Storage1 and Storage2. Each account holds one container and has the hierarchical namespace
enabled. The system has files that contain data stored in the Apache Parquet format.

You need to copy folders and files from Storage1 to Storage2 by using a Data Factory copy activity. The solution must meet the following
requirements:

✑ No transformations must be performed.

✑ The original folder structure must be retained.

✑ Minimize time required to perform the copy activity.

How should you configure the copy activity? To answer, select the appropriate options in the answer area.

NOTE: Each correct selection is worth one point.

Hot Area:

Correct Answer:

Box 1: Parquet -

For Parquet datasets, the type property of the copy activity source must be set to ParquetSource.

Box 2: PreserveHierarchy -

PreserveHierarchy (default): Preserves the file hierarchy in the target folder. The relative path of the source file to the source folder is identical
to the relative path of the target file to the target folder.

Incorrect Answers:

✑ FlattenHierarchy: All files from the source folder are in the first level of the target folder. The target files have autogenerated names.

✑ MergeFiles: Merges all files from the source folder to one file. If the file name is specified, the merged file name is the specified name.
Otherwise, it's an autogenerated file name.

Reference:

https://docs.microsoft.com/en-us/azure/data-factory/format-parquet https://docs.microsoft.com/en-us/azure/data-factory/connector-azure-
data-lake-storage

 
EddyRoboto
Highly Voted 
9 months ago
This could be binary as source and sink, since there are no transformations on files. I tend to believe that would be binary the correct anwer.

https://www.examtopics.com/exams/microsoft/dp-203/custom-view/ 39/161
5/26/22, 3:46 PM DP-203 Exam – Free Actual Q&As, Page 1 | ExamTopics

upvoted 43 times

 
michalS
8 months, 3 weeks ago
I agree. If it's just copying then binary is fine and would probably be faster
upvoted 6 times

 
iooj
3 months, 1 week ago
Agree. I've checked it. With binary source and sink datasets it works.
upvoted 2 times

 
rav009
8 months ago
agree. When using Binary dataset, the service does not parse file content but treat it as-is.

Not parsing the file will save the time. (https://docs.microsoft.com/en-us/azure/data-factory/format-binary)


So Binary!
upvoted 8 times

 
GameLift
7 months, 1 week ago
But the doc says "When using Binary dataset in copy activity, you can only copy from Binary dataset to Binary dataset." So I guess it's parquet
then?
upvoted 3 times

 
captainpike
7 months, 1 week ago
This note is referring to the fact that, in the template, you have to specify “BinarySink” as the type for the target Sink; and that exactly what
the Copy data tool does. (you can check this by editing the created copy pipeline and see the code). Choosing BInary and PreserveHierarchy
copy all file as they are perfectly.
upvoted 3 times

 
AbhiGola
Highly Voted 
8 months, 3 weeks ago
Answer seems correct as data is store is parquet already and requirement is to do no transformation so answer is right
upvoted 34 times

 
NintyFour
1 week, 2 days ago
As question has mentioned, Minimize time required to perform the copy activity.

And binary is faster than Parquet. Hence, Binary is answer


upvoted 1 times

 
NintyFour
Most Recent 
1 week, 2 days ago
As question has mentioned, Minimize time required to perform the copy activity.

And binary is faster than Parquet. Hence, Binary is answer


upvoted 1 times

 
AzureRan
2 weeks ago
Is it binary or parquet?
upvoted 1 times

 
DingDongSingSong
2 months ago
So what is the answer to this question? Binary or Parquet? The file is a ParquetFile. If you're simply copying a file, you just need to define the right
source type (i.e. Parquet) in this instance. Why would you even consider Binary when the file isn't Binary type
upvoted 2 times

 
kamil_k
2 months, 1 week ago
I've just tested it in Azure, created two Gen2 storage accounts, used Binary as source and destination, placed two parquet files in account one.
Created pipeline in ADF, added copy data activity and then defined first binary as source with wildcard path (*.parquet) and the sink as binary, with
linked service for account 2, selected PreserveHierarchy. It worked.
upvoted 6 times

 
AnshulSuryawanshi
2 months, 3 weeks ago
When using Binary dataset in copy activity, you can only copy from Binary dataset to Binary dataset
upvoted 1 times

 
Sandip4u
4 months, 2 weeks ago
this should be binary
upvoted 1 times

 
VeroDon
4 months, 3 weeks ago
The type property of the dataset must be set to Parquet

https://docs.microsoft.com/en-us/azure/data-factory/format-parquet#parquet-as-source
upvoted 1 times

 
Mahesh_mm
5 months ago
I think it is Parquet as When using Binary dataset in copy activity, you can only copy from Binary dataset to Binary dataset.
upvoted 2 times

 
Canary_2021
5 months, 1 week ago
If you only copy over files from one storage to another, don't need to read data inside the file, binary should be selected for better performance.

https://www.examtopics.com/exams/microsoft/dp-203/custom-view/ 40/161
5/26/22, 3:46 PM DP-203 Exam – Free Actual Q&As, Page 1 | ExamTopics

upvoted 5 times

 
m2shines
5 months, 1 week ago
Binary and Preserve Hierarchy should be the answer
upvoted 4 times

 
Lucky_me
5 months, 1 week ago
The answers are correct! Binary doesn't work; I just tried.
upvoted 5 times

 
kamil_k
2 months, 1 week ago
hmm what did you try? I literally created it the same way as described i.e. two gen2 storage accounts. I chose gen2 as source linked service with
binary as file type and the same for destination. In the copy data activity in ADF pipeline I specified preserve hierarchy and it worked as
expected.
upvoted 1 times

 
Ozzypoppe
5 months, 2 weeks ago
https://docs.microsoft.com/en-us/azure/data-factory/format-parquet#parquet-as-source
upvoted 2 times

 
medsimus
7 months, 2 weeks ago
The correct answer is Binary , I test it
upvoted 8 times

https://www.examtopics.com/exams/microsoft/dp-203/custom-view/ 41/161
5/26/22, 3:46 PM DP-203 Exam – Free Actual Q&As, Page 1 | ExamTopics

Question #17 Topic 1

You have an Azure Data Lake Storage Gen2 container that contains 100 TB of data.

You need to ensure that the data in the container is available for read workloads in a secondary region if an outage occurs in the primary region.
The solution must minimize costs.

Which type of data redundancy should you use?

A.
geo-redundant storage (GRS)

B.
read-access geo-redundant storage (RA-GRS)

C.
zone-redundant storage (ZRS)

D.
locally-redundant storage (LRS)

Correct Answer:
B

Geo-redundant storage (with GRS or GZRS) replicates your data to another physical location in the secondary region to protect against regional
outages.

However, that data is available to be read only if the customer or Microsoft initiates a failover from the primary to secondary region. When you
enable read access to the secondary region, your data is available to be read at all times, including in a situation where the primary region
becomes unavailable.

Incorrect Answers:

A: While Geo-redundant storage (GRS) is cheaper than Read-Access Geo-Redundant Storage (RA-GRS), GRS does NOT initiate automatic
failover.

C, D: Locally redundant storage (LRS) and Zone-redundant storage (ZRS) provides redundancy within a single region.

Reference:

https://docs.microsoft.com/en-us/azure/storage/common/storage-redundancy

Community vote distribution


A (65%) B (35%)

 
meetj
Highly Voted 
9 months ago
B is right

Geo-redundant storage (with GRS or GZRS) replicates your data to another physical location in the secondary region to protect against regional
outages. However, that data is available to be read only if the customer or Microsoft initiates a failover from the primary to secondary region.
When you enable read access to the secondary region, your data is available to be read at all times, including in a situation where the primary
region becomes unavailable.
upvoted 53 times

 
dev2dev
4 months, 2 weeks ago
A looks correct answer. RA-GRS is always avialable because its auto failover. Since this is not asked in the question but more importantly the
question is about reducing cost which GRS.
upvoted 13 times

 
BK10
3 months, 1 week ago
It should be A because of two reasons:

1. Minimize cost

2. When primary is unavailable.

Hence No need for RA_GRS


upvoted 7 times

 
Sasha_in_San_Francisco
Highly Voted 
6 months, 3 weeks ago
In my opinion, I believe the and answer is A, and this is why.

In the question they state "...available for read workloads in a secondary region IF AN OUTAGE OCCURES in the primary...". Well, answer B (RA-GRS)
states in Microsoft documentation that RA-GRS is for when "...your data is available to be read AT ALL TIMES, including in a situation where the
primary region becomes unavailable."

To me, the nature of the question is what is the cheapest solution which allows for failover to read workload, when there is an outage. Answer (A).

Common sense would be 'A' too because that is probably the most often real-life use case.
upvoted 40 times

 
SabaJamal2010AtGmail
4 months, 4 weeks ago
It's not about common sense rather about technology. With GRS, data remains available even if an entire data center becomes unavailable or if
there is a widespread regional failure. There would be a down time when a region becomes unavailable. Alternately, you could implement read-
access geo-redundant storage (RA-GRS), which provides read-access to the data in alternate locations.
upvoted 2 times

https://www.examtopics.com/exams/microsoft/dp-203/custom-view/ 42/161
5/26/22, 3:46 PM DP-203 Exam – Free Actual Q&As, Page 1 | ExamTopics

 
prathamesh1996
Most Recent 
6 days, 16 hours ago
A is Correct for Minimize cost $ When primary is unavailable.
upvoted 1 times

 
Andushi
4 weeks, 1 day ago
Selected Answer: A
A because of costs aspect
upvoted 2 times

 
muove
1 month, 1 week ago
A is correct because of cost, RA-GRS will cost $5,910.73, GRS will cost 4,596.12
upvoted 2 times

 
Egocentric
1 month, 2 weeks ago
GRS is the correct answer,the key in the question is reducing costs
upvoted 1 times

 
Somesh512
1 month, 2 weeks ago
Selected Answer: A
To reduce cost GRS should be right option
upvoted 1 times

 
KosteK
1 month, 4 weeks ago
Selected Answer: A
GRZ is cheaper
upvoted 1 times

 
praticewizards
2 months ago
Selected Answer: B
The explanation is right. The given answer is wrong
upvoted 1 times

 
DingDongSingSong
2 months ago
B is incorrect. The answer is A. GRS is cheaper than RA-GRS. GRS read access is available ONLY once primary region failover occurs (therefore lower
cost). The requirement is for read-access availability in secondary region at lower cost WHEN a failover occurs in primary. Therefore, A is the answer
upvoted 2 times

 
phdphd
2 months, 1 week ago
Selected Answer: A
Got this question on the exam. RA_GRS was not an option, so it should to be A.
upvoted 10 times

 
vineet1234
2 months, 2 weeks ago
A is right. GRS means secondary is available ONLY when primary is down. And it is cheaper than RA-GRS (where secondary read access is always
available). The question sneaks in the word 'read workloads' just to confuse.
upvoted 2 times

 
Sgarima
2 months, 3 weeks ago
Selected Answer: B
B is correct.

Geo-redundant storage (with GRS or GZRS) replicates your data to another physical location in the secondary region to protect against regional
outages. However, that data is available to be read only if the customer or Microsoft initiates a failover from the primary to secondary region.
When you enable read access to the secondary region, your data is available to be read at all times, including in a situation where the primary
region becomes unavailable. For read access to the secondary region, enable read-access geo-redundant storage (RA-GRS) or read-access geo-
zone-redundant storage (RA-GZRS).
upvoted 1 times

 
NamitSehgal
3 months ago
A should be the answer as we need data in read only secondary only when something happens at region A, not always.
upvoted 2 times

 
MANESH_PAI
3 months, 1 week ago
Selected Answer: A
It is GRS because GRS is cheaper than RA-GRS

https://azure.microsoft.com/en-gb/pricing/details/storage/blobs/
upvoted 3 times

 
PallaviPatel
3 months, 4 weeks ago
Selected Answer: B
correct
upvoted 2 times

 
Tinaaaaaaa
4 months ago
https://www.examtopics.com/exams/microsoft/dp-203/custom-view/ 43/161
5/26/22, 3:46 PM DP-203 Exam – Free Actual Q&As, Page 1 | ExamTopics

Selected Answer: B
While Geo-redundant storage (GRS) is cheaper than Read-Access Geo-Redundant Storage (RA-GRS), GRS does NOT initiate automatic failover.
upvoted 2 times

https://www.examtopics.com/exams/microsoft/dp-203/custom-view/ 44/161
5/26/22, 3:46 PM DP-203 Exam – Free Actual Q&As, Page 1 | ExamTopics

Question #18 Topic 1

You plan to implement an Azure Data Lake Gen 2 storage account.

You need to ensure that the data lake will remain available if a data center fails in the primary Azure region. The solution must minimize costs.

Which type of replication should you use for the storage account?

A.
geo-redundant storage (GRS)

B.
geo-zone-redundant storage (GZRS)

C.
locally-redundant storage (LRS)

D.
zone-redundant storage (ZRS)

Correct Answer:
D

Zone-redundant storage (ZRS) copies your data synchronously across three Azure availability zones in the primary region.

Incorrect Answers:

C: Locally redundant storage (LRS) copies your data synchronously three times within a single physical location in the primary region. LRS is the
least expensive replication option, but is not recommended for applications requiring high availability or durability

Reference:

https://docs.microsoft.com/en-us/azure/storage/common/storage-redundancy

Community vote distribution


D (98%)

 
JohnMasipa
Highly Voted 
9 months ago
This can't be correct. Should be D.
upvoted 70 times

 
JayBird
9 months ago
Why, LRS is cheaper?
upvoted 1 times

 
Vitality
8 months, 2 weeks ago
It is cheaper but LRS helps to replicate data in the same data center while ZRS replicates data synchronously across three storage clusters in
one region. So if one data center fails you should go for ZRS.
upvoted 8 times

 
azurearmy
7 months ago
Also, note that the question talks about failure in "a data center". As long as other data centers are running fine(as in ZRS which will have
many), ZRS would be the least expensive option.
upvoted 6 times

 
MadEgg
Highly Voted 
4 months, 3 weeks ago
Selected Answer: D
First, about the Question:

What fails? -> The (complete) DataCenter, not the region and not components inside a DataCenter.

So, what helps us in this situation?

LRS: "..copies your data synchronously three times within a single physical location in the primary region." Important is here the SINGLE PHYSICAL
LOCATION (meaning inside the same Data Center. So in our scenario all copies wouldn't work anymore.)

-> C is wrong.

ZRS: "...copies your data synchronously across three Azure availability zones in the primary region" (meaning, in different Data Centers. In our
scenario this would meet the requirements)

-> D is right

GRS/GZRS: are like LRS/ZRS but with the Data Centers in different azure regions. This works too but is more expensive than ZRS. So ZRS is the right
answer.

https://docs.microsoft.com/en-us/azure/storage/common/storage-redundancy
upvoted 30 times

 
Ozren
2 months, 1 week ago
Yes, well said, that's the correct answer.
upvoted 1 times

 
Narasimhap
3 months, 1 week ago
Well explained!
upvoted 1 times

https://www.examtopics.com/exams/microsoft/dp-203/custom-view/ 45/161
5/26/22, 3:46 PM DP-203 Exam – Free Actual Q&As, Page 1 | ExamTopics

 
DrTaz
4 months, 3 weeks ago
I agree.

Please give this comment a medal (or a cookie).


upvoted 3 times

 
olavrab8
Most Recent 
2 weeks, 2 days ago
Selected Answer: D
D -> Data is replicated synchronously
upvoted 1 times

 
Egocentric
1 month, 1 week ago
D is correct
upvoted 2 times

 
ravi2931
1 month, 2 weeks ago
it should be D
upvoted 1 times

 
ravi2931
1 month, 2 weeks ago
see this explained clearly -

LRS is the lowest-cost redundancy option and offers the least durability compared to other options. LRS protects your data against server rack
and drive failures. However, if a disaster such as fire or flooding occurs within the data center, all replicas of a storage account using LRS may be
lost or unrecoverable. To mitigate this risk, Microsoft recommends using zone-redundant storage (ZRS), geo-redundant storage (GRS), or geo-
zone-redundant storage (GZRS)
upvoted 1 times

 
ASG1205
1 month, 2 weeks ago
Selected Answer: D
Answer should be D, as LRS won't be helpfull in case of whole datacenter failure.
upvoted 1 times

 
Andy91
2 months ago
Selected Answer: D
This is the correct answer indeed
upvoted 2 times

 
bhanuprasad9331
2 months, 4 weeks ago
Selected Answer: C
Answer is LRS.

From microsoft docs:

LRS replicates data in a single AZ. An AZ can contain one or more data centers. So, even if one data center fails, data can be accessed through
other data centers in the same AZ.

https://docs.microsoft.com/en-us/azure/availability-zones/az-overview#availability-zones

https://docs.microsoft.com/en-us/azure/storage/common/storage-redundancy#redundancy-in-the-primary-region
upvoted 1 times

 
PallaviPatel
3 months, 4 weeks ago
Selected Answer: D
D is correct.
upvoted 3 times

 
vimalnits
4 months ago
Correct answer is D.
upvoted 2 times

 
Tinaaaaaaa
4 months ago
LRS helps to replicate data in the same data center while ZRS replicates data synchronously across three storage clusters in one region
upvoted 1 times

 
Shatheesh
4 months ago
D is the correct answer, In question it’s clearly mentioned if data center fails it should be available, LRS stores everything in sane data center so it’s
not the correct answer, next cheapest option is ZRS.
upvoted 1 times

 
Jaws1990
4 months, 3 weeks ago
Selected Answer: D
Mentions data centre (Availability Zone) failure, not rack failure, so should be Zone Redundant Storage.
upvoted 3 times

 
DrTaz
4 months, 3 weeks ago
Selected Answer: D

https://www.examtopics.com/exams/microsoft/dp-203/custom-view/ 46/161
5/26/22, 3:46 PM DP-203 Exam – Free Actual Q&As, Page 1 | ExamTopics

note that the "data centre fails"


upvoted 2 times

 
VeroDon
4 months, 3 weeks ago
After reading all the comments ill go with LRS. it doesn't mention a disaster. "LRS protects your data against server rack and drive failures"
https://docs.microsoft.com/en-us/azure/storage/common/storage-redundancy.
upvoted 1 times

 
ArunMonika
4 months, 4 weeks ago
I will go with D
upvoted 1 times

 
Mahesh_mm
5 months ago
Answer is D
upvoted 1 times

https://www.examtopics.com/exams/microsoft/dp-203/custom-view/ 47/161
5/26/22, 3:46 PM DP-203 Exam – Free Actual Q&As, Page 1 | ExamTopics

Question #19 Topic 1

HOTSPOT -

You have a SQL pool in Azure Synapse.

You plan to load data from Azure Blob storage to a staging table. Approximately 1 million rows of data will be loaded daily. The table will be
truncated before each daily load.
You need to create the staging table. The solution must minimize how long it takes to load the data to the staging table.

How should you configure the table? To answer, select the appropriate options in the answer area.

NOTE: Each correct selection is worth one point.

Hot Area:

Correct Answer:

Box 1: Hash -

Hash-distributed tables improve query performance on large fact tables. They can have very large numbers of rows and still achieve high
performance.

Incorrect Answers:

Round-robin tables are useful for improving loading speed.

Box 2: Clustered columnstore -

When creating partitions on clustered columnstore tables, it is important to consider how many rows belong to each partition. For optimal
compression and performance of clustered columnstore tables, a minimum of 1 million rows per distribution and partition is needed.

Box 3: Date -

Table partitions enable you to divide your data into smaller groups of data. In most cases, table partitions are created on a date column.

Partition switching can be used to quickly remove or replace a section of a table.

https://www.examtopics.com/exams/microsoft/dp-203/custom-view/ 48/161
5/26/22, 3:46 PM DP-203 Exam – Free Actual Q&As, Page 1 | ExamTopics

Reference:

https://docs.microsoft.com/en-us/azure/synapse-analytics/sql-data-warehouse/sql-data-warehouse-tables-partition
https://docs.microsoft.com/en-us/azure/synapse-analytics/sql-data-warehouse/sql-data-warehouse-tables-distribute

 
A1000
Highly Voted 
9 months ago
Round-Robin

Heap

None
upvoted 156 times

 
Narasimhap
3 months, 1 week ago
Round- Robin

Heap

None.
No brainer for this question.
upvoted 3 times

 
anto69
4 months, 2 weeks ago
I agree too
upvoted 2 times

 
gssd4scoder
7 months ago
Agree 100%.

All in paragraphs under this: https://docs.microsoft.com/en-us/azure/synapse-analytics/sql/develop-tables-overview.


upvoted 5 times

 
DrTaz
4 months, 3 weeks ago
Also agree 100%
upvoted 2 times

 
laszek
Highly Voted 
8 months, 4 weeks ago
Round-robin - this is the simplest distribution model, not great for querying but fast to process

Heap - no brainer when creating staging tables

No partitions - this is a staging table, why add effort to partition, when truncated daily?
upvoted 29 times

 
Vardhan_Brahmanapally
6 months, 3 weeks ago
Can you explain me why should we use heap?
upvoted 1 times

 
DrTaz
4 months, 3 weeks ago
The term heap basically refers to a table without a clustered index. Adding a clustered index to a temp table makes absolutely no sense and
is a waste of compute resources for a table that would be entirely truncated daily.

no clustered index = heap.


upvoted 3 times

 
SQLDev0000
3 months ago
DrTaz is right, in addition, when you populate an indexed table, you are also writing to the index, so this adds an additional overhead in
the write process
upvoted 2 times

 
berserksap
6 months, 4 weeks ago
Had doubts regarding why there is no need for a partition. While what you suggested is true won't it be better if there is a date partition to
truncate the table ?
upvoted 1 times

 
andy_g
3 months, 1 week ago
There is no filter on a truncate statement so no benefit in having a partition
upvoted 1 times

 
SandipSingha
Most Recent 
2 weeks, 3 days ago
Round-Robin

Heap

None
upvoted 1 times

 
Sandip4u
4 months, 2 weeks ago
Round-robin,heap,none
upvoted 2 times

 
Mahesh_mm
4 months, 4 weeks ago
Round-Robin

Heap

https://www.examtopics.com/exams/microsoft/dp-203/custom-view/ 49/161
5/26/22, 3:46 PM DP-203 Exam – Free Actual Q&As, Page 1 | ExamTopics

None
upvoted 2 times

 
ArunMonika
4 months, 4 weeks ago
Answer: Round-Robin (1), Heap (2), None (3).
upvoted 1 times

 
m2shines
5 months, 1 week ago
Round-robin, Heap and None
upvoted 1 times

 
Sasha_in_San_Francisco
6 months, 3 weeks ago
Answer: Round-Robin (1), Heap (2), None (3).

Within this doc:

https://docs.microsoft.com/en-us/azure/synapse-analytics/sql/develop-tables-overview

#1. Search for “Use round-robin for the staging table.”

#2. Search for: “A heap table can be especially useful for loading data, such as a staging table,…”

Within this doc:

https://docs.microsoft.com/en-us/azure/synapse-analytics/sql-data-warehouse/sql-data-warehouse-tables-partition?context=/azure/synapse-
analytics/context/context
#3. Partitioning by date is useful when stage destination has data because you can hide the inserting data’s new partition (to keep users from
hitting it), complete the load and then unhide the new partition.

However, in this question it states, “the table will be truncated before each daily load”, so, it appears it’s a true Staging table and there are no users
with access, no existing data, and I see no reason to have a Date partition. To me, such a partition would do nothing but slow the load.
upvoted 12 times

 
Sasha_in_San_Francisco
6 months, 3 weeks ago
Answer: Round-Robin (1), Heap (2), None (3).
upvoted 1 times

 
Aslam208
6 months, 3 weeks ago
Round-Robin, Heap, Noe.

A polite request to the moderator, please verify these answers and correct. For some people, wrong answers will be detrimental.
upvoted 6 times

 
Vardhan_Brahmanapally
7 months ago
Many of the answers provided in this website are incorrect
upvoted 5 times

 
dJeePe
3 weeks, 5 days ago
Did MS hack this site to make it give wrong answers ? ;-)
upvoted 1 times

 
itacshish
7 months, 1 week ago
Round-Robin

Heap

None
upvoted 2 times

 
HaliBrickclay
7 months, 1 week ago
as per Microsoft document

Load to a staging table

To achieve the fastest loading speed for moving data into a data warehouse table, load data into a staging table. Define the staging table as a heap
and use round-robin for the distribution option.

Consider that loading is usually a two-step process in which you first load to a staging table and then insert the data into a production data
warehouse table. If the production table uses a hash distribution, the total time to load and insert might be faster if you define the staging table
with the hash distribution. Loading to the staging table takes longer, but the second step of inserting the rows to the production table does not
incur data movement across the distributions.
upvoted 4 times

 
VeroDon
4 months, 3 weeks ago
It doesn't mention the prd table. Only the staging. So, round Robin/Heap is the answer, correct? tricky questions.

:)
upvoted 1 times

 
estrelle2008
7 months, 2 weeks ago
Please correct the answers ExamTopics, as Microsoft itself recently published best practices on data loading in Synapse, and describes staging as
100% FAB answers is correct instead of ADF. https://docs.microsoft.com/en-us/azure/synapse-analytics/sql/data-loading-best-practices
upvoted 2 times

 
RinkiiiiiV
7 months, 3 weeks ago
Round-Robin

Heap

None
upvoted 1 times

https://www.examtopics.com/exams/microsoft/dp-203/custom-view/ 50/161
5/26/22, 3:46 PM DP-203 Exam – Free Actual Q&As, Page 1 | ExamTopics

 
hugoborda
8 months ago
Round-Robin

Heap

None
upvoted 2 times

 
hsetin
8 months, 2 weeks ago
Why heap and not CCI?
upvoted 1 times

https://www.examtopics.com/exams/microsoft/dp-203/custom-view/ 51/161
5/26/22, 3:46 PM DP-203 Exam – Free Actual Q&As, Page 1 | ExamTopics

Question #20 Topic 1

You are designing a fact table named FactPurchase in an Azure Synapse Analytics dedicated SQL pool. The table contains purchases from
suppliers for a retail store. FactPurchase will contain the following columns.

FactPurchase will have 1 million rows of data added daily and will contain three years of data.

Transact-SQL queries similar to the following query will be executed daily.

SELECT -

SupplierKey, StockItemKey, IsOrderFinalized, COUNT(*)

FROM FactPurchase -

WHERE DateKey >= 20210101 -

AND DateKey <= 20210131 -

GROUP By SupplierKey, StockItemKey, IsOrderFinalized

Which table distribution will minimize query times?

A.
replicated

B.
hash-distributed on PurchaseKey

C.
round-robin

D.
hash-distributed on IsOrderFinalized

Correct Answer:
B

Hash-distributed tables improve query performance on large fact tables.

To balance the parallel processing, select a distribution column that:

✑ Has many unique values. The column can have duplicate values. All rows with the same value are assigned to the same distribution. Since
there are 60 distributions, some distributions can have > 1 unique values while others may end with zero values.

✑ Does not have NULLs, or has only a few NULLs.

✑ Is not a date column.

Incorrect Answers:

C: Round-robin tables are useful for improving loading speed.

Reference:

https://docs.microsoft.com/en-us/azure/synapse-analytics/sql-data-warehouse/sql-data-warehouse-tables-distribute

Community vote distribution


B (86%) 7%

 
FredNo
Highly Voted 
6 months ago
Selected Answer: B
Correct
upvoted 18 times

 
GameLift
Highly Voted 
8 months, 2 weeks ago

https://www.examtopics.com/exams/microsoft/dp-203/custom-view/ 52/161
5/26/22, 3:46 PM DP-203 Exam – Free Actual Q&As, Page 1 | ExamTopics

Is it hash-distributed on PurchaseKey and not on IsOrderFinalized because 'IsOrderFinalized' yields less distributions(rows either contain yes,no
values) compared to PurchaseKey?
upvoted 7 times

 
Podavenna
8 months, 2 weeks ago
Yes, your logic is correct!
upvoted 4 times

 
Dothy
Most Recent 
2 weeks ago
B Correct
upvoted 1 times

 
SandipSingha
2 weeks, 3 days ago
B Correct
upvoted 1 times

 
sarapaisley
1 month, 2 weeks ago
Selected Answer: B
B is correct
upvoted 1 times

 
Anshul2910
2 months, 2 weeks ago
Selected Answer: B
CORRECT
upvoted 1 times

 
Istiaque
3 months, 2 weeks ago
Selected Answer: B
A round-robin distributed table distributes table rows evenly across all distributions. The assignment of rows to distributions is random. Unlike
hash-distributed tables, rows with equal values are not guaranteed to be assigned to the same distribution.

As a result, the system sometimes needs to invoke a data movement operation to better organize your data before it can resolve a query. This extra
step can slow down your queries.
upvoted 2 times

 
PallaviPatel
3 months, 4 weeks ago
Selected Answer: C
The options do not have correct key selected for hash distribution and query performance will improve only if correct distribution column is
selected. Also question says 1 million rows but how much those rows convert into actual GB of data is a question the data types are majorly int
which arn't bulky. hence I will go for round robin instead of hash distribution.
upvoted 2 times

 
vineet1234
2 months, 2 weeks ago
Incorrect.. 1 million rows added per day. And the table has 3 years of data. So it's a large fact table. So Hash distributed. On purchase key (not
on IsOrderFinalized, as it's very low cardinality)
upvoted 2 times

 
Canary_2021
4 months, 3 weeks ago
Selected Answer: D
Hash field should be used in join, group by, having. SupplierKey, StockItemKey, IsOrderFinalized are group by fields. PurchaseKey doesn’t exist in
the query, why select PurchaseKey as hash key?

I select D. IsOrderFinalized may only provide 2 partitions, not as good as suppliekey and stockitemkey, but at least it is a group by column.
upvoted 2 times

 
Canary_2021
4 months, 3 weeks ago
To balance the parallel processing, select a distribution column that:

* Has many unique values.

* Does not have NULLs, or has only a few NULLs.

* Is not a date column.

Based on these descriptions, maybe B is the right answer. Just purchasekey is not a part of the query, is it still improve performance of this
specific query?
upvoted 4 times

 
Mahesh_mm
4 months, 4 weeks ago
B is correct
upvoted 2 times

 
kahei
5 months, 2 weeks ago
Selected Answer: B
upvoted 1 times

 
alexleonvalencia
5 months, 2 weeks ago
Selected Answer: B

https://www.examtopics.com/exams/microsoft/dp-203/custom-view/ 53/161
5/26/22, 3:46 PM DP-203 Exam – Free Actual Q&As, Page 1 | ExamTopics

B es la respuesta correcta.
upvoted 2 times

 
stuard
7 months, 2 weeks ago
hash-distributed on PurchaseKey and round-robin are going to provide the same result (in a case PurchaseKey has even distribution) for the query
as this specific query does not use PurchaseKey. However, round-robin is going to provide a slightly faster loading time.
upvoted 6 times

 
RinkiiiiiV
7 months, 3 weeks ago
Yes Agree..
upvoted 1 times

 
Gilvan
8 months, 2 weeks ago
Correct
upvoted 4 times

https://www.examtopics.com/exams/microsoft/dp-203/custom-view/ 54/161
5/26/22, 3:46 PM DP-203 Exam – Free Actual Q&As, Page 1 | ExamTopics

Question #21 Topic 1

HOTSPOT -

From a website analytics system, you receive data extracts about user interactions such as downloads, link clicks, form submissions, and video
plays.

The data contains the following columns.

You need to design a star schema to support analytical queries of the data. The star schema will contain four tables including a date dimension.

To which table should you add each column? To answer, select the appropriate options in the answer area.

NOTE: Each correct selection is worth one point.

Hot Area:

https://www.examtopics.com/exams/microsoft/dp-203/custom-view/ 55/161
5/26/22, 3:46 PM DP-203 Exam – Free Actual Q&As, Page 1 | ExamTopics

Correct Answer:

Box 1: DimEvent -

Box 2: DimChannel -

Box 3: FactEvents -

Fact tables store observations or events, and can be sales orders, stock balances, exchange rates, temperatures, etc

Reference:

https://docs.microsoft.com/en-us/power-bi/guidance/star-schema

 
gssd4scoder
Highly Voted 
7 months ago
It seems to be correct
upvoted 29 times

 
DingDongSingSong
Highly Voted 
2 months ago
What is this question? It is poorly written. I couldn't even understand what's being asked here. It talks about 4 tables, yet the answer shows 3. Then,
the columns mentioned in the question don't match the column/attributes shown in the 3 tables noted in the answer.
upvoted 7 times

 
Dothy
Most Recent 
2 weeks ago
EventCategory -> dimEvent

channelGrouping -> dimChannel

TotalEvents -> factEven


upvoted 1 times

 
JJdeWit
2 weeks, 4 days ago
EventCategory ==> dimEvent

channelGrouping ==> dimChannel

TotalEvents ==> factEvent

Explanation:
A bit of knowledge of Google Analytics Universal helps to understand this question. eventCategory, eventAction and eventLabel all contain
information about the event/action done on the website, and can be logically be grouped together. ChannelGrouping is about how the user came
on the website (through Google, and advertisement, an email link, etc.) and is not related to events at all. It therefore would make sense to put it in
a second dim table.
upvoted 1 times

 
Mahesh_mm
4 months, 4 weeks ago
Answer is correct
upvoted 3 times

 
laszek
8 months, 4 weeks ago
I would add ChannelGrouping to DimEvents table. What would DimChannel table contain? only one column? No sense to me
upvoted 3 times

 
manquak
8 months, 3 weeks ago

https://www.examtopics.com/exams/microsoft/dp-203/custom-view/ 56/161
5/26/22, 3:46 PM DP-203 Exam – Free Actual Q&As, Page 1 | ExamTopics

It is supposed to contain 4 tables. Date, Event, Fact so the logical conclusion would be to include the channel dimension. If it were up to me
Questionthough Topic 1
#22 I'd use the channel as a degenerate dimension and store it in fact table if it's the only information that we have provided.
upvoted 3 times

Note:This

question is part
Seansmyrke
2 months,
of a series of questions
3 weeks ago that present the same scenario. Each question in the series contains a unique solution that
I mean
might meet the ifstated
you think
goals.about
Someit,question
ChannelName (facebook,google,youtube),
sets might ChannelType
have more than one correct (paidothers
solution, while media, freenot
might posts,
haveads), ChannleDelivery
a correct solution.
(chrome, etc
etc). Just thinking out loud
After you answer a question in this section, you will NOT be able to return to it. As a result, these questions will not appear in the review screen.

upvoted 1 times

You have an Azure Storage account that contains 100 GB of files. The files contain rows of text and numerical values. 75% of the rows contain
description data that has an average length of 1.1 MB.

You plan to copy the data from the storage account to an enterprise data warehouse in Azure Synapse Analytics.

You need to prepare the files to ensure that the data copies quickly.

Solution: You convert the files to compressed delimited text files.

Does this meet the goal?

A.
Yes

B.
No

Correct Answer:
A

All file formats have different performance characteristics. For the fastest load, use compressed delimited text files.

Reference:

https://docs.microsoft.com/en-us/azure/sql-data-warehouse/guidance-for-loading-data

Community vote distribution


A (63%) B (38%)

 
Fahd92
Highly Voted 
7 months, 3 weeks ago
They said you need to prepare the files to copy, maybe the mean we should make them less than 1MB ? so it will be A else would be B !!!!
upvoted 12 times

 
ANath
4 months, 1 week ago
The answer should be A.

https://azure.microsoft.com/en-gb/blog/increasing-polybase-row-width-limitation-in-azure-sql-data-warehouse/
upvoted 1 times

 
Thij
7 months, 3 weeks ago
After reading the other questions oh this topic I go with A because the relevant part seems to be the compression.
upvoted 4 times

 
Muishkin
Most Recent 
4 weeks, 1 day ago
A text file seems to be too simple an answer however true as per the microsoft link.I was thinking of parquet/avro files
upvoted 1 times

 
Massy
2 months, 2 weeks ago
Selected Answer: B
From the question: "75% of the rows contain description data that has an average length of 1.1 MB". You can't

From the documentation: "When you put data into the text files in Azure Blob storage or Azure Data Lake Store, they must have fewer than
1,000,000 bytes of data."

So 75% of rows aren't good for a delimited text files... why you said answer is yes?
upvoted 3 times

 
kamil_k
2 months, 2 weeks ago
I initially thought so too, however isn't this limit only relevant to PolyBase copy? It is not mentioned which method is used to transfer the data
so you could fit more than 1mb into a column in the table if you want to, you just have to use something else e.g. COPY command.
upvoted 2 times

 
PallaviPatel
3 months, 4 weeks ago
Selected Answer: A
correct answer.
upvoted 2 times

 
Mahesh_mm
4 months, 4 weeks ago
A is correct
upvoted 1 times

 
alexleonvalencia
5 months, 2 weeks ago
Selected Answer: A
Correcto

https://www.examtopics.com/exams/microsoft/dp-203/custom-view/ 57/161
5/26/22, 3:46 PM DP-203 Exam – Free Actual Q&As, Page 1 | ExamTopics

upvoted 2 times

 
rashjan
5 months, 2 weeks ago
Selected Answer: A
correct because compression
upvoted 1 times

 
Odoxtoom
7 months, 1 week ago
Consider this sets one question:

What should you do to improve loading times?

What | Yes | No |

compressed | O | O |

columnstore | O | O |

> 1MB | O | O |

So now answers should be clear


upvoted 1 times

 
HaliBrickclay
7 months, 1 week ago
As per Microsoft

Row size and data type limits

PolyBase loads are limited to rows smaller than 1 MB. It cannot be used to load to VARCHR(MAX), NVARCHAR(MAX), or VARBINARY(MAX). For
more information, see Azure Synapse Analytics service capacity limits.

When your source data has rows greater than 1 MB, you might want to vertically split the source tables into several small ones. Make sure that the
largest size of each row doesn't exceed the limit. The smaller tables can then be loaded by using PolyBase and merged together in Azure Synapse
Analytics.
upvoted 2 times

 
jamesraju
7 months, 2 weeks ago
The answer should be 'yes"

All file formats have different performance characteristics. For the fastest load, use compressed delimited text files. The difference between UTF-8
and UTF-16 performance is minimal.
upvoted 1 times

 
RinkiiiiiV
7 months, 3 weeks ago
correct Answer is B
upvoted 1 times

 
gk765
8 months ago
Correct Answer is B. There is limit of 1MB when it comes to the row length. Hence you have to modify the files to ensure the row size is less than
1MB
upvoted 3 times

 
kolakone
8 months, 1 week ago
Answer is correct
upvoted 1 times

https://www.examtopics.com/exams/microsoft/dp-203/custom-view/ 58/161
5/26/22, 3:46 PM DP-203 Exam – Free Actual Q&As, Page 1 | ExamTopics

Question #23 Topic 1

Note: This question is part of a series of questions that present the same scenario. Each question in the series contains a unique solution that
might meet the stated goals. Some question sets might have more than one correct solution, while others might not have a correct solution.

After you answer a question in this section, you will NOT be able to return to it. As a result, these questions will not appear in the review screen.

You have an Azure Storage account that contains 100 GB of files. The files contain rows of text and numerical values. 75% of the rows contain
description data that has an average length of 1.1 MB.

You plan to copy the data from the storage account to an enterprise data warehouse in Azure Synapse Analytics.

You need to prepare the files to ensure that the data copies quickly.

Solution: You copy the files to a table that has a columnstore index.

Does this meet the goal?

A.
Yes

B.
No

Correct Answer:
B

Instead convert the files to compressed delimited text files.

Reference:

https://docs.microsoft.com/en-us/azure/sql-data-warehouse/guidance-for-loading-data

Community vote distribution


B (100%)

 
Odoxtoom
Highly Voted 
7 months, 1 week ago
Consider this sets one question:

What should you do to improve loading times?

What | Yes | No |

compressed | O | O |

columnstore | O | O |

> 1MB | O | O |

So now answers should be clear


upvoted 6 times

 
Julius7000
7 months ago
Can You explain this in more details?
upvoted 10 times

 
helly13
5 months, 2 weeks ago
I really didn't understand this , can you explain?
upvoted 5 times

 
Amsterliese
Most Recent 
1 month, 2 weeks ago
Columnstore index would be used for faster reading, but the question is only about faster loading. So for faster loading you want the least possible
overhead. So the answer should be no. Am I right?
upvoted 2 times

 
Muishkin
4 weeks, 1 day ago
Yes load to a table without indexes for faster load right?
upvoted 1 times

 
lionurag
2 months, 3 weeks ago
Selected Answer: B
B is correct
upvoted 2 times

 
bhanuprasad9331
3 months, 1 week ago
From the documentation, loads to heap table are faster than indexed tables. So, better to use heap table than columnstore index table in this case.

https://docs.microsoft.com/en-us/azure/synapse-analytics/sql-data-warehouse/sql-data-warehouse-tables-index#heap-tables
upvoted 4 times

 
PallaviPatel
3 months, 4 weeks ago
Selected Answer: B
B is correct.
upvoted 1 times

https://www.examtopics.com/exams/microsoft/dp-203/custom-view/ 59/161
5/26/22, 3:46 PM DP-203 Exam – Free Actual Q&As, Page 1 | ExamTopics

 
DE_Sanjay
4 months, 1 week ago
NO is the answer.
upvoted 1 times

 
Mahesh_mm
4 months, 4 weeks ago
B is correct
upvoted 1 times

 
rashjan
5 months, 2 weeks ago
Selected Answer: B
Correct Answer: No.
upvoted 2 times

 
sachabess79
7 months, 3 weeks ago
No, The index will expand the time of insertion
upvoted 3 times

 
michalS
8 months, 3 weeks ago
https://docs.microsoft.com/en-us/azure/synapse-analytics/sql-data-warehouse/guidance-for-loading-data. "For the fastest load, use compressed
delimited text files."
upvoted 1 times

 
umeshkd05
8 months, 2 weeks ago
But the row size also need to be < 1 MB

So, files need to be modified to make all rows < 1 MB

Answer: NO
upvoted 4 times

 
Julius7000
7 months ago
In other words, i think that 100GB is much to much for the columnstore index memorywise. The documentation in unclear with the context
of this particular question, but i think the ansewer is NO, as ithe given answer is the wrong idea anyways.
upvoted 1 times

 
Julius7000
7 months ago
Not Row size, row NUMBER have to be at maximum of 1,048,576 rows.

"When there is memory pressure, the columnstore index might not be able to achieve maximum compression rates. This effects query
performance."
upvoted 1 times

 
gk765
8 months ago
Correct answer should be NO
upvoted 2 times

https://www.examtopics.com/exams/microsoft/dp-203/custom-view/ 60/161
5/26/22, 3:46 PM DP-203 Exam – Free Actual Q&As, Page 1 | ExamTopics

Question #24 Topic 1

Note: This question is part of a series of questions that present the same scenario. Each question in the series contains a unique solution that
might meet the stated goals. Some question sets might have more than one correct solution, while others might not have a correct solution.

After you answer a question in this section, you will NOT be able to return to it. As a result, these questions will not appear in the review screen.

You have an Azure Storage account that contains 100 GB of files. The files contain rows of text and numerical values. 75% of the rows contain
description data that has an average length of 1.1 MB.

You plan to copy the data from the storage account to an enterprise data warehouse in Azure Synapse Analytics.

You need to prepare the files to ensure that the data copies quickly.

Solution: You modify the files to ensure that each row is more than 1 MB.

Does this meet the goal?

A.
Yes

B.
No

Correct Answer:
B

Instead convert the files to compressed delimited text files.

Reference:

https://docs.microsoft.com/en-us/azure/sql-data-warehouse/guidance-for-loading-data

Community vote distribution


B (100%)

 
Gilvan
Highly Voted 
8 months, 1 week ago
No, rows need to have less than 1 MB. A batch size between 100 K to 1M rows is the recommended baseline for determining optimal batch size
capacity.
upvoted 7 times

 
PallaviPatel
Most Recent 
3 months, 4 weeks ago
Selected Answer: B
B is correct.
upvoted 4 times

 
amarG1996
4 months, 3 weeks ago
PolyBase can't load rows that have more than 1,000,000 bytes of data. When you put data into the text files in Azure Blob storage or Azure Data
Lake Store, they must have fewer than 1,000,000 bytes of data. This byte limitation is true regardless of the table schema.
upvoted 2 times

 
kamil_k
2 months, 2 weeks ago
is it stated anywhere that we have to use PolyBase? What about COPY command?
upvoted 1 times

 
amarG1996
4 months, 3 weeks ago
https://docs.microsoft.com/en-us/azure/synapse-analytics/sql/data-loading-best-practices#prepare-data-in-azure-storage
upvoted 1 times

 
Mahesh_mm
4 months, 4 weeks ago
Answer is No
upvoted 1 times

 
rashjan
5 months, 2 weeks ago
Selected Answer: B
Correct Answer: No.
upvoted 2 times

 
Odoxtoom
7 months, 1 week ago
Consider this sets one question:

What should you do to improve loading times?

What | Yes | No |

compressed | O | O |

columnstore | O | O |

> 1MB | O | O |

So now answers should be clear


upvoted 1 times

https://www.examtopics.com/exams/microsoft/dp-203/custom-view/ 61/161
5/26/22, 3:46 PM DP-203 Exam – Free Actual Q&As, Page 1 | ExamTopics

 
Aslam208
6 months ago
@Odoxtoom, can you please explain your answer and specify based on this matrix which option is correct.
upvoted 3 times

 
Bishtu
5 months ago
Yes

No

No
upvoted 2 times

https://www.examtopics.com/exams/microsoft/dp-203/custom-view/ 62/161
5/26/22, 3:46 PM DP-203 Exam – Free Actual Q&As, Page 1 | ExamTopics

Question #25 Topic 1

You build a data warehouse in an Azure Synapse Analytics dedicated SQL pool.

Analysts write a complex SELECT query that contains multiple JOIN and CASE statements to transform data for use in inventory reports. The
inventory reports will use the data and additional WHERE parameters depending on the report. The reports will be produced once daily.

You need to implement a solution to make the dataset available for the reports. The solution must minimize query times.

What should you implement?

A.
an ordered clustered columnstore index

B.
a materialized view

C.
result set caching

D.
a replicated table

Correct Answer:
B

Materialized views for dedicated SQL pools in Azure Synapse provide a low maintenance method for complex analytical queries to get fast
performance without any query change.

Incorrect Answers:

C: One daily execution does not make use of result cache caching.

Note: When result set caching is enabled, dedicated SQL pool automatically caches query results in the user database for repetitive use. This
allows subsequent query executions to get results directly from the persisted cache so recomputation is not needed. Result set caching
improves query performance and reduces compute resource usage. In addition, queries using cached results set do not use any concurrency
slots and thus do not count against existing concurrency limits.

Reference:

https://docs.microsoft.com/en-us/azure/synapse-analytics/sql-data-warehouse/performance-tuning-materialized-views
https://docs.microsoft.com/en-us/azure/synapse-analytics/sql-data-warehouse/performance-tuning-result-set-caching

Community vote distribution


B (100%)

 
ANath
Highly Voted 
3 months, 2 weeks ago
B is correct.

Materialized view and result set caching

These two features in dedicated SQL pool are used for query performance tuning. Result set caching is used for getting high concurrency and fast
response from repetitive queries against static data.

To use the cached result, the form of the cache requesting query must match with the query that produced the cache. In addition, the cached result
must apply to the entire query.

Materialized views allow data changes in the base tables. Data in materialized views can be applied to a piece of a query. This support allows the
same materialized views to be used by different queries that share some computation for faster performance.
upvoted 6 times

 
SandipSingha
Most Recent 
2 weeks, 3 days ago
B materialized view
upvoted 1 times

 
Egocentric
1 month, 1 week ago
B is correct without a doubt
upvoted 1 times

 
DingDongSingSong
2 months ago
Why isn't the answer "A" when the query may have additional WHERE parameters depending on the report. That mean's the query isn't static and
will change depending on the report. A clustered columstore index would provide a bettery query performance in case of a complex query where
query isn't static.
upvoted 1 times

 
PallaviPatel
3 months, 4 weeks ago
Selected Answer: B
B correct.
upvoted 1 times

 
VeroDon
4 months, 3 weeks ago

https://www.examtopics.com/exams/microsoft/dp-203/custom-view/ 63/161
5/26/22, 3:46 PM DP-203 Exam – Free Actual Q&As, Page 1 | ExamTopics

Selected Answer: B
Correct
upvoted 3 times

 
Mahesh_mm
4 months, 4 weeks ago
B is correct
upvoted 1 times

 
bad_atitude
5 months, 1 week ago
B materialized view
upvoted 2 times

 
Canary_2021
5 months, 1 week ago
Selected Answer: B
B is the correct answer.

A materialized view is a database object that contains the results of a query. A materialized view is not simply a window on the base table. It is
actually a separate object holding data in itself. So query data against a materialized view with different filters should be quick.

Difference Between View and Materialized View:

https://techdifferences.com/difference-between-view-and-materialized-view.html
upvoted 4 times

 
alexleonvalencia
5 months, 2 weeks ago
Respuesta Correcta B, Una vista materializada.
upvoted 4 times

https://www.examtopics.com/exams/microsoft/dp-203/custom-view/ 64/161
5/26/22, 3:46 PM DP-203 Exam – Free Actual Q&As, Page 1 | ExamTopics

Question #26 Topic 1

You have an Azure Synapse Analytics workspace named WS1 that contains an Apache Spark pool named Pool1.

You plan to create a database named DB1 in Pool1.

You need to ensure that when tables are created in DB1, the tables are available automatically as external tables to the built-in serverless SQL
pool.

Which format should you use for the tables in DB1?

A.
CSV

B.
ORC

C.
JSON

D.
Parquet

Correct Answer:
D

Serverless SQL pool can automatically synchronize metadata from Apache Spark. A serverless SQL pool database will be created for each
database existing in serverless Apache Spark pools.

For each Spark external table based on Parquet or CSV and located in Azure Storage, an external table is created in a serverless SQL pool
database.

Reference:

https://docs.microsoft.com/en-us/azure/synapse-analytics/sql/develop-storage-files-spark-tables

Community vote distribution


D (100%)

 
KevinSames
Highly Voted 
5 months, 1 week ago
Both A and D are correct

"For each Spark external table based on Parquet or CSV and located in Azure Storage, an external table is created in a serverless SQL pool
database. As such, you can shut down your Spark pools and still query Spark external tables from serverless SQL pool."
upvoted 15 times

 
RehanRajput
Most Recent 
3 days, 18 hours ago
Both A and D
upvoted 1 times

 
RehanRajput
3 days, 18 hours ago
https://docs.microsoft.com/en-us/azure/synapse-analytics/metadata/database
upvoted 1 times

 
MatiCiri
4 weeks ago
Selected Answer: D
Looks correct to me
upvoted 1 times

 
AhmedDaffaie
2 months, 2 weeks ago
JSON is also supported by Serverless SQL Pool but it is kinda complicated. Why is it not selected?
upvoted 2 times

 
Ajitk27
3 months ago
Selected Answer: D
Looks correct to me
upvoted 1 times

 
VijayMore
3 months ago
Selected Answer: D
Correct
upvoted 1 times

 
PallaviPatel
3 months, 4 weeks ago
Selected Answer: D
Both A and D are correct. as CSV and Parquet are correct answers.
upvoted 1 times

 
Mahesh_mm
4 months, 4 weeks ago
Parquet and CSV are correct
https://www.examtopics.com/exams/microsoft/dp-203/custom-view/ 65/161
5/26/22, 3:46 PM DP-203 Exam – Free Actual Q&As, Page 1 | ExamTopics

upvoted 4 times

 
Nifl91
5 months, 2 weeks ago
I think A and D are both correct answers.
upvoted 3 times

 
alexleonvalencia
5 months, 2 weeks ago
Respuesta Correcta Parquet.
upvoted 1 times

https://www.examtopics.com/exams/microsoft/dp-203/custom-view/ 66/161
5/26/22, 3:46 PM DP-203 Exam – Free Actual Q&As, Page 1 | ExamTopics

Question #27 Topic 1

You are planning a solution to aggregate streaming data that originates in Apache Kafka and is output to Azure Data Lake Storage Gen2. The
developers who will implement the stream processing solution use Java.

Which service should you recommend using to process the streaming data?

A.
Azure Event Hubs

B.
Azure Data Factory

C.
Azure Stream Analytics

D.
Azure Databricks

Correct Answer:
D

The following tables summarize the key differences in capabilities for stream processing technologies in Azure.

General capabilities -

Integration capabilities -

Reference:

https://docs.microsoft.com/en-us/azure/architecture/data-guide/technology-choices/stream-processing

Community vote distribution

https://www.examtopics.com/exams/microsoft/dp-203/custom-view/ 67/161
5/26/22, 3:46 PM DP-203 Exam – Free Actual Q&As, Page 1 | ExamTopics

D (100%)

 
Nifl91
Highly Voted 
5 months, 2 weeks ago
Correct!
upvoted 13 times

 
NewTuanAnh
Most Recent 
1 month, 2 weeks ago
why not C: Azure Stream Analytics?
upvoted 1 times

 
Muishkin
4 weeks, 1 day ago
Yes Azure stream Analytics for streaming data?
upvoted 1 times

 
NewTuanAnh
1 month, 2 weeks ago
I see, Azure Stream Analytics does not associate with Java
upvoted 1 times

 
sdokmak
2 days, 22 hours ago
or databricks
upvoted 1 times

 
sdokmak
2 days, 22 hours ago
kafka*
upvoted 1 times

 
PallaviPatel
3 months, 4 weeks ago
Selected Answer: D
correct.
upvoted 1 times

 
Mahesh_mm
4 months, 4 weeks ago
Answer is correct
upvoted 3 times

 
alexleonvalencia
5 months, 2 weeks ago
Respuesta correcta Azure DataBricks.
upvoted 4 times

https://www.examtopics.com/exams/microsoft/dp-203/custom-view/ 68/161
5/26/22, 3:46 PM DP-203 Exam – Free Actual Q&As, Page 1 | ExamTopics

Question #28 Topic 1

You plan to implement an Azure Data Lake Storage Gen2 container that will contain CSV files. The size of the files will vary based on the number
of events that occur per hour.

File sizes range from 4 KB to 5 GB.

You need to ensure that the files stored in the container are optimized for batch processing.

What should you do?

A.
Convert the files to JSON

B.
Convert the files to Avro

C.
Compress the files

D.
Merge the files

Correct Answer:
B

Avro supports batch and is very relevant for streaming.

Note: Avro is framework developed within Apache's Hadoop project. It is a row-based storage format which is widely used as a serialization
process. AVRO stores its schema in JSON format making it easy to read and interpret by any program. The data itself is stored in binary format
by doing it compact and efficient.

Reference:

https://www.adaltas.com/en/2020/07/23/benchmark-study-of-different-file-format/

Community vote distribution


D (100%)

 
VeroDon
Highly Voted 
4 months, 3 weeks ago
You can not merge the files if u don't know how many files exist in ADLS2. In this case, you could easily create a file larger than 100 GB in size and
decrease performance. so B is the correct answer. Convert to AVRO
upvoted 25 times

 
Massy
3 weeks, 5 days ago
I can understand why you say not merge, but why avro?
upvoted 2 times

 
Canary_2021
Highly Voted 
4 months, 3 weeks ago
Selected Answer: D
If you store your data as many small files, this can negatively affect performance. In general, organize your data into larger sized files for better
performance (256 MB to 100 GB in size).

https://docs.microsoft.com/en-us/azure/storage/blobs/data-lake-storage-best-practices#optimize-for-data-ingest
upvoted 7 times

 
SAYAK7
Most Recent 
4 days, 15 hours ago
Selected Answer: D
Batch can support JSON or AVRO, you should input one file by merging them all.
upvoted 1 times

 
sdokmak
1 day, 9 hours ago
They're CSV so you're saying answer is A
upvoted 1 times

 
sdokmak
1 day, 9 hours ago
B*, AVRO is faster than JSON
upvoted 1 times

 
RehanRajput
1 month ago
Selected Answer: D
You need to make sure that the files in the container are optimized for BATCH PROCESSING. In case of batch processing it makes sense to merge
files as to reduce the amount of IO Listing operations.

B would have been correct if we had to optimize for stream processing.


upvoted 3 times

 
Karthikj18
1 month, 3 weeks ago
Conversion makes an additional load, so not an good idea to convert into Avro rather than merging is easier
upvoted 1 times

https://www.examtopics.com/exams/microsoft/dp-203/custom-view/ 69/161
5/26/22, 3:46 PM DP-203 Exam – Free Actual Q&As, Page 1 | ExamTopics

 
SebK
2 months ago
Selected Answer: D
merge files
upvoted 2 times

 
adfgasd
3 months, 3 weeks ago
This question makes me very confused.

It says the file size depends on the number of events per hour, so i guess there is a file generated every hour. In the worst case, we have 5GB * 24h,
which is greater than 100GB...
But why is AVRO a good choice??
upvoted 6 times

 
PallaviPatel
3 months, 4 weeks ago
Selected Answer: D
merge files is correct.
upvoted 3 times

 
vincetita
4 months ago
Selected Answer: D
Small-sized files will hurt performance. Optimal file size: 256MB to 100GB
upvoted 1 times

 
Tomi1234
4 months, 3 weeks ago
Selected Answer: D
In my opinion for better batch processing files should be not bigger than 100GB but as big as possible.
upvoted 7 times

 
VeroDon
4 months, 3 weeks ago
One example of batch processing is transforming a large set of flat, semi-structured CSV or JSON files into a schematized and structured format
that is ready for further querying. Typically the data is converted from the raw formats used for ingestion (such as CSV) into binary formats that are
more performant for querying because they store data in a columnar format, and often provide indexes and inline statistics about the data.

https://docs.microsoft.com/en-us/azure/architecture/data-guide/big-data/batch-processing
upvoted 1 times

 
edba
4 months, 4 weeks ago
I think it shall be D as well. Please check the link below. https://docs.microsoft.com/en-us/azure/storage/blobs/data-lake-storage-best-
practices#file-size
upvoted 3 times

 
Mahesh_mm
4 months, 4 weeks ago
B is correct
upvoted 2 times

 
SabaJamal2010AtGmail
4 months, 3 weeks ago
Consider using the Avro file format in cases where your I/O patterns are more write heavy, or the query patterns favor retrieving multiple rows
of records in their entirety. For example, the Avro format works well with a message bus such as Event Hub or Kafka that write multiple
events/messages in succession.
upvoted 1 times

 
didixuecoding
5 months ago
Correct Answer should be D: Merge the files
upvoted 2 times

 
corebit
5 months ago
Please explain why it is D.
upvoted 1 times

 
alexleonvalencia
5 months, 2 weeks ago
Respuesta correcta Avro
upvoted 1 times

https://www.examtopics.com/exams/microsoft/dp-203/custom-view/ 70/161
5/26/22, 3:46 PM DP-203 Exam – Free Actual Q&As, Page 1 | ExamTopics

Question #29 Topic 1

HOTSPOT -

You store files in an Azure Data Lake Storage Gen2 container. The container has the storage policy shown in the following exhibit.

Use the drop-down menus to select the answer choice that completes each statement based on the information presented in the graphic.

NOTE: Each correct selection is worth one point.

Hot Area:

https://www.examtopics.com/exams/microsoft/dp-203/custom-view/ 71/161
5/26/22, 3:46 PM DP-203 Exam – Free Actual Q&As, Page 1 | ExamTopics

Correct Answer:

Box 1: moved to cool storage -

The ManagementPolicyBaseBlob.TierToCool property gets or sets the function to tier blobs to cool storage. Support blobs currently at Hot tier.

Box 2: container1/contoso.csv -

As defined by prefixMatch.

prefixMatch: An array of strings for prefixes to be matched. Each rule can define up to 10 case-senstive prefixes. A prefix string must start with
a container name.

Reference:

https://docs.microsoft.com/en-us/dotnet/api/microsoft.azure.management.storage.fluent.models.managementpolicybaseblob.tiertocool

 
bad_atitude
Highly Voted 
5 months, 1 week ago
correct
upvoted 17 times

 
adfgasd
5 months, 1 week ago
why the .csv?
upvoted 1 times

 
Lewistrick
5 months ago
It matches anything that starts with "container1/contoso" and the csv in the answer is the only one that matches.
upvoted 9 times

 
alexleonvalencia
Highly Voted 
5 months, 2 weeks ago
Respuesta Cool Tier & Container1/contoso.csv
upvoted 5 times

 
AJ01
Most Recent 
4 months, 2 weeks ago
shouldn't the question be greater than 60 days?
upvoted 2 times

 
stunner85_
3 months, 3 weeks ago
The files get deleted after 60 days but after 30 days they are moved to the cool storage.
upvoted 3 times

 
Mahesh_mm
4 months, 4 weeks ago
correct
upvoted 3 times

https://www.examtopics.com/exams/microsoft/dp-203/custom-view/ 72/161
5/26/22, 3:46 PM DP-203 Exam – Free Actual Q&As, Page 1 | ExamTopics

https://www.examtopics.com/exams/microsoft/dp-203/custom-view/ 73/161
5/26/22, 3:46 PM DP-203 Exam – Free Actual Q&As, Page 1 | ExamTopics

Question #30 Topic 1

You are designing a financial transactions table in an Azure Synapse Analytics dedicated SQL pool. The table will have a clustered columnstore
index and will include the following columns:

✑ TransactionType: 40 million rows per transaction type

✑ CustomerSegment: 4 million per customer segment

✑ TransactionMonth: 65 million rows per month

AccountType: 500 million per account type

You have the following query requirements:

✑ Analysts will most commonly analyze transactions for a given month.

✑ Transactions analysis will typically summarize transactions by transaction type, customer segment, and/or account type

You need to recommend a partition strategy for the table to minimize query times.

On which column should you recommend partitioning the table?

A.
CustomerSegment

B.
AccountType

C.
TransactionType

D.
TransactionMonth

Correct Answer:
D

For optimal compression and performance of clustered columnstore tables, a minimum of 1 million rows per distribution and partition is
needed. Before partitions are created, dedicated SQL pool already divides each table into 60 distributed databases.

Example: Any partitioning added to a table is in addition to the distributions created behind the scenes. Using this example, if the sales fact
table contained 36 monthly partitions, and given that a dedicated SQL pool has 60 distributions, then the sales fact table should contain 60
million rows per month, or 2.1 billion rows when all months are populated. If a table contains fewer than the recommended minimum number of
rows per partition, consider using fewer partitions in order to increase the number of rows per partition.

Community vote distribution


D (100%)

 
Lewistrick
Highly Voted 
5 months ago
Anyone else thinks this is a very badly explained situation?
upvoted 12 times

 
Canary_2021
Highly Voted 
5 months, 1 week ago
Selected Answer: D
Select D because analysts will most commonly analyze transactions for a given month,
upvoted 9 times

 
Youdaoud
Most Recent 
1 month, 3 weeks ago
Selected Answer: D
correct answer D
upvoted 1 times

 
PallaviPatel
3 months, 4 weeks ago
Selected Answer: D
correct.
upvoted 3 times

 
SabaJamal2010AtGmail
4 months, 3 weeks ago
D is correct because "Transactions analysis will typically summarize transactions by transaction type, customer segment, and/or account type"
implying its part of the WHERE clause. The option of choosing a distribution column is to ensure that it is not used in the WHERE clause.
upvoted 6 times

 
ploer
3 months, 3 weeks ago
D is correct, but those columns will be used for aggregate funtions. TransactionMonth column will be used in the where-clause: "analysts will
most commonly analyze transactions for a given month", so the given month must be in the where clause. Partitioning on the where clause
column significantly reduces the amount of data to be processed which leads to increased performance. Do not confuse with distribution
column on hash partitioned tables. Using TransactionMonth column as distribution column here would be a really bad idea because all queried
data would be on one single node.
upvoted 15 times

https://www.examtopics.com/exams/microsoft/dp-203/custom-view/ 74/161
5/26/22, 3:46 PM DP-203 Exam – Free Actual Q&As, Page 1 | ExamTopics

 
Mahesh_mm
4 months, 4 weeks ago
D is correct
upvoted 2 times

 
Sonnie01
5 months, 2 weeks ago
Selected Answer: D
correct
upvoted 5 times

 
edba
4 months, 4 weeks ago
check this as well for explanation. https://www.linkedin.com/pulse/partitioning-distribution-azure-synapse-analytics-swapnil-mule
upvoted 2 times

 
gf2tw
5 months, 2 weeks ago
Agree with D, should not be confused with Distribution column for Hash-distributed tables.
upvoted 5 times

https://www.examtopics.com/exams/microsoft/dp-203/custom-view/ 75/161
5/26/22, 3:46 PM DP-203 Exam – Free Actual Q&As, Page 1 | ExamTopics

Question #31 Topic 1

HOTSPOT -

You have an Azure Data Lake Storage Gen2 account named account1 that stores logs as shown in the following table.

You do not expect that the logs will be accessed during the retention periods.

You need to recommend a solution for account1 that meets the following requirements:

✑ Automatically deletes the logs at the end of each retention period

✑ Minimizes storage costs

What should you include in the recommendation? To answer, select the appropriate options in the answer area.

NOTE: Each correct selection is worth one point.

Hot Area:

Correct Answer:

Box 1: Store the infrastructure logs in the Cool access tier and the application logs in the Archive access tier

For infrastructure logs: Cool tier - An online tier optimized for storing data that is infrequently accessed or modified. Data in the cool tier should
be stored for a minimum of 30 days. The cool tier has lower storage costs and higher access costs compared to the hot tier.

For application logs: Archive tier - An offline tier optimized for storing data that is rarely accessed, and that has flexible latency requirements, on
the order of hours.

Data in the archive tier should be stored for a minimum of 180 days.

Box 2: Azure Blob storage lifecycle management rules

Blob storage lifecycle management offers a rule-based policy that you can use to transition your data to the desired access tier when your
specified conditions are met. You can also use lifecycle management to expire data at the end of its life.

Reference:

https://docs.microsoft.com/en-us/azure/storage/blobs/access-tiers-overview

 
gf2tw
Highly Voted 
5 months, 2 weeks ago
"Data must remain in the Archive tier for at least 180 days or be subject to an early deletion charge. For example, if a blob is moved to the Archive
tier and then deleted or moved to the Hot tier after 45 days, you'll be charged an early deletion fee equivalent to 135 (180 minus 45) days of
storing that blob in the Archive tier." <- from the sourced link.

This explains why we have to use two different access tiers rather than both as archive.
upvoted 26 times

https://www.examtopics.com/exams/microsoft/dp-203/custom-view/ 76/161
5/26/22, 3:46 PM DP-203 Exam – Free Actual Q&As, Page 1 | ExamTopics

 
Backy
Most Recent 
2 weeks ago
The question says "You do not expect that the logs will be accessed during the retention periods" - so there is no reason to keep any of them as
Cool, so the correct answer should be to put them both in Archive
upvoted 1 times

 
sdokmak
2 days, 21 hours ago
yeah but because the infrastructure logs are <180 days before deleting, there is a considerable fee to delete if in archive, so not the cheapest
option.
upvoted 2 times

 
Muishkin
4 weeks, 1 day ago
But the question says 360 days and 60 days for the 2 logs...whereas archive tier could store only upto 180 days .Also the cool tier has lesser storage
cost /- hour as compared to archive tier.So should'nt the answer be cool tier for both?
upvoted 1 times

 
Mahesh_mm
4 months, 4 weeks ago
Answers are correct
upvoted 2 times

 
ANath
5 months ago
The answers are correct.

Data must remain in the Archive tier for at least 180 days or be subject to an early deletion charge. For example, if a blob is moved to the Archive
tier and then deleted or moved to the Hot tier after 45 days, you'll be charged an early deletion fee equivalent to 135 (180 minus 45) days of
storing that blob in the Archive tier.

A blob in the Cool tier in a general-purpose v2 accounts is subject to an early deletion penalty if it is deleted or moved to a different tier before 30
days has elapsed. This charge is prorated. For example, if a blob is moved to the Cool tier and then deleted after 21 days, you'll be charged an early
deletion fee equivalent to 9 (30 minus 21) days of storing that blob in the Cool tier.

https://docs.microsoft.com/en-us/azure/storage/blobs/access-tiers-overview
upvoted 4 times

https://www.examtopics.com/exams/microsoft/dp-203/custom-view/ 77/161
5/26/22, 3:46 PM DP-203 Exam – Free Actual Q&As, Page 1 | ExamTopics

Question #32 Topic 1

You plan to ingest streaming social media data by using Azure Stream Analytics. The data will be stored in files in Azure Data Lake Storage, and
then consumed by using Azure Databricks and PolyBase in Azure Synapse Analytics.

You need to recommend a Stream Analytics data output format to ensure that the queries from Databricks and PolyBase against the files
encounter the fewest possible errors. The solution must ensure that the files can be queried quickly and that the data type information is retained.

What should you recommend?

A.
JSON

B.
Parquet

C.
CSV

D.
Avro

Correct Answer:
B

Need Parquet to support both Databricks and PolyBase.

Reference:

https://docs.microsoft.com/en-us/sql/t-sql/statements/create-external-file-format-transact-sql

Community vote distribution


B (100%)

 
hrastogi7
Highly Voted 
5 months, 1 week ago
Parquet can be quickly retrieved and maintain metadata in itself. Hence Parquet is correct answer.
upvoted 10 times

 
Muishkin
Most Recent 
4 weeks, 1 day ago
Isnt JSON good for batch processing/streaming?
upvoted 1 times

 
RehanRajput
3 days, 18 hours ago
Indeed. However, we also want to query the data using PolyBase. Polybase doesn't support Avro.

https://docs.microsoft.com/en-us/azure/synapse-analytics/sql/load-data-overview#polybase-external-file-formats
upvoted 2 times

 
AhmedDaffaie
1 month, 4 weeks ago
I am confused!

Avro has self-describing schema and good for quick loading (patching), why parquet?
upvoted 3 times

 
Boompiee
2 weeks, 1 day ago
Apparently, the deciding factor is the fact that PolyBase doesn't support AVRO, but it does support Parquet.
upvoted 3 times

 
PallaviPatel
3 months, 4 weeks ago
Selected Answer: B
correct.
upvoted 1 times

 
EmmettBrown
4 months ago
Selected Answer: B
Parquet is the correct answer
upvoted 1 times

 
alexleonvalencia
5 months, 2 weeks ago
Respuesta correcta PARQUET
upvoted 1 times

https://www.examtopics.com/exams/microsoft/dp-203/custom-view/ 78/161
5/26/22, 3:46 PM DP-203 Exam – Free Actual Q&As, Page 1 | ExamTopics

Question #33 Topic 1

You have an Azure Synapse Analytics dedicated SQL pool named Pool1. Pool1 contains a partitioned fact table named dbo.Sales and a staging
table named stg.Sales that has the matching table and partition definitions.

You need to overwrite the content of the first partition in dbo.Sales with the content of the same partition in stg.Sales. The solution must minimize
load times.

What should you do?

A.
Insert the data from stg.Sales into dbo.Sales.

B.
Switch the first partition from dbo.Sales to stg.Sales.

C.
Switch the first partition from stg.Sales to dbo.Sales.

D.
Update dbo.Sales from stg.Sales.

Correct Answer:
B

A way to eliminate rollbacks is to use Metadata Only operations like partition switching for data management. For example, rather than execute
a DELETE statement to delete all rows in a table where the order_date was in October of 2001, you could partition your data monthly. Then you
can switch out the partition with data for an empty partition from another table

Note: Syntax:

SWITCH [ PARTITION source_partition_number_expression ] TO [ schema_name. ] target_table [ PARTITION target_partition_number_expression


]
Switches a block of data in one of the following ways:

✑ Reassigns all data of a table as a partition to an already-existing partitioned table.

✑ Switches a partition from one partitioned table to another.

✑ Reassigns all data in one partition of a partitioned table to an existing non-partitioned table.

Reference:

https://docs.microsoft.com/en-us/azure/synapse-analytics/sql/best-practices-dedicated-sql-pool

Community vote distribution


C (96%) 4%

 
Aslam208
Highly Voted 
5 months, 2 weeks ago
Selected Answer: C
The correct answer is C
upvoted 31 times

 
Nifl91
Highly Voted 
5 months, 2 weeks ago
this must be C. since the need is to overwrite dbo.Sales with the content of stg.Sales.

SWITCH source TO target


upvoted 9 times

 
SAYAK7
Most Recent 
4 days, 4 hours ago
Selected Answer: C
Coz we have to impact dbo.Sales
upvoted 1 times

 
kknczny
1 month ago
Selected Answer: B
As partition in stg.Sales is the one we will be using to overwrite the first partition in dbo.Stage, should it not be understood as "B. Switch the first
partition from dbo.Sales to stg.Sales."?

Also look at the query syntax:

"SWITCH [ PARTITION source_partition_number_expression ] TO [ schema_name. ] target_table [ PARTITION target_partition_number_expression ]"

So from stg to dbo


upvoted 2 times

 
kanak01
3 months, 2 weeks ago
Seriously.. Who puts Fact Table data into Dimension table !
upvoted 1 times

 
rockyc05
3 months ago
Its Fact to Stage Table actually acc to the answer provided
upvoted 1 times

https://www.examtopics.com/exams/microsoft/dp-203/custom-view/ 79/161
5/26/22, 3:46 PM DP-203 Exam – Free Actual Q&As, Page 1 | ExamTopics

 
PallaviPatel
3 months, 4 weeks ago
Selected Answer: C
C is correct as partition switching works from source to target.
upvoted 2 times

 
dev2dev
4 months, 2 weeks ago
Either way works.
upvoted 1 times

 
Sandip4u
4 months, 2 weeks ago
Stg.sales is a temp table which does not have any partition , So C can not be correct
upvoted 2 times

 
ABExams
3 months, 1 week ago
It literally states it has the same partition definition.
upvoted 2 times

 
alex623
4 months, 2 weeks ago
The correct answer is C, because target table is dbo.sales
upvoted 1 times

 
Rickie85
4 months, 2 weeks ago
Selected Answer: C
C correct
upvoted 1 times

 
Jaws1990
4 months, 2 weeks ago
Selected Answer: C
B is the wrong way round.
upvoted 3 times

 
VeroDon
4 months, 3 weeks ago
Selected Answer: C
https://medium.com/@cocci.g/switch-partitions-in-azure-synapse-sql-dw-1e0e32309872
upvoted 4 times

 
Mahesh_mm
4 months, 4 weeks ago
C is correct answer
upvoted 1 times

 
ArunMonika
4 months, 4 weeks ago
I will go with C
upvoted 1 times

 
gitoxam686
5 months ago
Selected Answer: C
C is correct answer because we have to overwrite.
upvoted 2 times

 
adfgasd
5 months ago
Selected Answer: C
C for sure
upvoted 2 times

 
Will_KaiZuo
5 months, 1 week ago
Selected Answer: C
agree with C
upvoted 1 times

https://www.examtopics.com/exams/microsoft/dp-203/custom-view/ 80/161
5/26/22, 3:46 PM DP-203 Exam – Free Actual Q&As, Page 1 | ExamTopics

Question #34 Topic 1

You are designing a slowly changing dimension (SCD) for supplier data in an Azure Synapse Analytics dedicated SQL pool.

You plan to keep a record of changes to the available fields.

The supplier data contains the following columns.

Which three additional columns should you add to the data to create a Type 2 SCD? Each correct answer presents part of the solution.

NOTE: Each correct selection is worth one point.

A.
surrogate primary key

B.
effective start date

C.
business key

D.
last modified date

E.
effective end date

F.
foreign key

Correct Answer:
BCE

C: The Slowly Changing Dimension transformation requires at least one business key column.

BE: Historical attribute changes create new records instead of updating existing ones. The only change that is permitted in an existing record is
an update to a column that indicates whether the record is current or expired. This kind of change is equivalent to a Type 2 change. The Slowly
Changing Dimension transformation directs these rows to two outputs: Historical Attribute Inserts Output and New Output.

Reference:

https://docs.microsoft.com/en-us/sql/integration-services/data-flow/transformations/slowly-changing-dimension-transformation

Community vote distribution


ABE (85%) BCE (15%)

 
ItHYMeRIsh
Highly Voted 
5 months, 2 weeks ago
Selected Answer: ABE
The answer is ABE. A type 2 SCD requires a surrogate key to uniquely identify each record when versioning.

See https://docs.microsoft.com/en-us/learn/modules/populate-slowly-changing-dimensions-azure-synapse-analytics-pipelines/3-choose-
between-dimension-types under SCD Type 2 “ the dimension table must use a surrogate key to provide a unique reference to a version of the
dimension member.”

A business key is already part of this table - SupplierSystemID. The column is derived from the source data.
upvoted 35 times

 
CHOPIN
Highly Voted 
4 months, 2 weeks ago
Selected Answer: BCE
WHAT ARE YOU GUYS TALKING ABOUT??? You are really misleading other people!!! No issue with the provided answer. Should be BCE!!!

Check this out:

https://www.examtopics.com/exams/microsoft/dp-203/custom-view/ 81/161
5/26/22, 3:46 PM DP-203 Exam – Free Actual Q&As, Page 1 | ExamTopics

https://docs.microsoft.com/en-us/sql/integration-services/data-flow/transformations/slowly-changing-dimension-transformation?view=sql-server-
ver15

"The Slowly Changing Dimension transformation requires at least one business key column."

[Surrogate key] is not mentioned in this Microsoft documentation AT ALL!!!


upvoted 9 times

 
dev2dev
4 months, 2 weeks ago
Search for Business Keys in that page. and make sure you wear specs :D
upvoted 2 times

 
assU2
4 months, 2 weeks ago
Yes, because SupplierSystemID is unique. But Microsoft questions are terribly misleading here. People think that SupplierSystemID is business
key because of Supplier in it. Also, there are some really not good and not sufficient examples on Learn. See https://docs.microsoft.com/en-
us/learn/modules/populate-slowly-changing-dimensions-azure-synapse-analytics-pipelines/3-choose-between-dimension-types
upvoted 1 times

 
Mad_001
3 months ago
I don't understand.

1) What in your opinion should then be the business key. Can you explain please.

2) SupplierSysteID ist uniqe in the source system. Is there a definition that the column need to be unique also in the DataWarehouse? If no,
there ist the possibility to use it as business key. Am I wrong?
upvoted 1 times

 
Onobhas01
1 month, 2 weeks ago
No you're not wrong, the unique identifier form the ERP system is the Business Key
upvoted 1 times

 
shrikantK
Most Recent 
6 days, 20 hours ago
ABE seems correct. Why not business key is already discussed. Why not foreign key? one reason: Foreign key constraint is not supported in
dedicated SQL pool.
upvoted 1 times

 
gabdu
3 weeks, 2 days ago
why is there no mention of flag?
upvoted 2 times

 
necktru
4 weeks ago
Selected Answer: ABE
Please, SupplierSystemID is unique in ERP, that not mean that must be unique in our DW, that's why we need a surrogate primary key, If don't, SCD
type 2 can't be implemented
upvoted 1 times

 
Andushi
4 weeks ago
Selected Answer: ABE
https://docs.microsoft.com/en-us/learn/modules/populate-slowly-changing-dimensions-azure-synapse-analytics-pipelines/3-choose-between-
dimension-types
upvoted 1 times

 
ladywhiteadder
1 month, 4 weeks ago
Selected Answer: ABE
ABE - very clear answer if you know what a type 2 SCD is. you will need a new surrogate key. the business key is already there - it's
SupplierSystemID - and will stay the same over time = will not be unique when anything changes as we will insert a new row then.
upvoted 4 times

 
kilowd
3 months, 3 weeks ago
Selected Answer: ABE
Type 2

In order to support type 2 changes, we need to add four columns to our table:

· Surrogate Key – the original ID will no longer be sufficient to identify the specific record we require, we therefore need to create a new ID that the
fact records can join to specifically.

· Current Flag – A quick method of returning only the current version of each record

· Start Date – The date from which the specific historical version is active

· End Date – The date to which the specific historical version record is active

https://adatis.co.uk/introduction-to-slowly-changing-dimensions-scd-types/
upvoted 4 times

 
KashRaynardMorse
1 month ago
Good answer! It's worth talking about the business key, since that is the controversial bit.

There needs to be something that functions as a business key, in this case it can be the SupplierSystemID.

https://www.examtopics.com/exams/microsoft/dp-203/custom-view/ 82/161
5/26/22, 3:46 PM DP-203 Exam – Free Actual Q&As, Page 1 | ExamTopics

The Current Flag is not strictly needed, the solution would function okay without it, but I would include it in real life anyway for performance
and ease of querying (it's also not shown as an option).
upvoted 2 times

 
KashRaynardMorse
1 month ago
And to add, the question is what "additional" columns are needed. So emphasising, although a business key is definitely needed, the column
that serves it's purpose is already present (albeit with a different column name), so does not need adding again.
upvoted 1 times

 
stunner85_
3 months, 3 weeks ago
Here's why the answer is Surrogate Key and not Business key:

In a temporal database, it is necessary to distinguish between the surrogate key and the business key. Every row would have both a business key
and a surrogate key. The surrogate key identifies one unique row in the database, the business key identifies one unique entity of the modeled
world.
upvoted 2 times

 
m0rty
3 months, 4 weeks ago
Selected Answer: ABE
correcto
upvoted 1 times

 
PallaviPatel
3 months, 4 weeks ago
Selected Answer: ABE
these are correct answers.
upvoted 1 times

 
Hervedoux
4 months, 1 week ago
Totally agree with Chopin, SCD type 2 tables require at least a Business Key column and a start and end date to capture historical dat, thus BCE is
the correct answer.

https://docs.microsoft.com/en-us/sql/integration-services/data-flow/transformations/slowly-changing-dimension-transformation
upvoted 1 times

 
Mahesh_mm
4 months, 4 weeks ago
ABE is correct
upvoted 2 times

 
gitoxam686
5 months ago
Selected Answer: ABE
A B E.... Surrogate Key s required.
upvoted 3 times

 
KevinSames
5 months ago
Selected Answer: ABE
ez ezez
upvoted 1 times

 
m2shines
5 months, 1 week ago
A, B and E
upvoted 1 times

 
Nifl91
5 months, 2 weeks ago
shouldn't it be ABE? we already have a business key! we need a surrogate to use as a primary key when a supplier with updated attributes is to be
inserted into the table
upvoted 1 times

 
assU2
4 months, 2 weeks ago
It's not a business key, its unique. And business key may not be unique because it's 2 SCD. You can have multiple rows for one entity with
different start/end dates.
upvoted 2 times

 
necktru
4 weeks ago
It's unique in the ERP, in the DW can be duplicated, it's why we need the surrogate pk that must be unique, answers are ABE
upvoted 1 times

https://www.examtopics.com/exams/microsoft/dp-203/custom-view/ 83/161
5/26/22, 3:46 PM DP-203 Exam – Free Actual Q&As, Page 1 | ExamTopics

Question #35 Topic 1

HOTSPOT -

You have a Microsoft SQL Server database that uses a third normal form schema.

You plan to migrate the data in the database to a star schema in an Azure Synapse Analytics dedicated SQL pool.

You need to design the dimension tables. The solution must optimize read operations.

What should you include in the solution? To answer, select the appropriate options in the answer area.

NOTE: Each correct selection is worth one point.

Hot Area:

Correct Answer:

Box 1: Denormalize to a second normal form

Denormalization is the process of transforming higher normal forms to lower normal forms via storing the join of higher normal form relations
as a base relation.

Denormalization increases the performance in data retrieval at cost of bringing update anomalies to a database.

Box 2: New identity columns -

The collapsing relations strategy can be used in this step to collapse classification entities into component entities to obtain flat dimension
tables with single-part keys that connect directly to the fact table. The single-part key is a surrogate key generated to ensure it remains unique
over time.

Example:

https://www.examtopics.com/exams/microsoft/dp-203/custom-view/ 84/161
5/26/22, 3:46 PM DP-203 Exam – Free Actual Q&As, Page 1 | ExamTopics

Note: A surrogate key on a table is a column with a unique identifier for each row. The key is not generated from the table data. Data modelers
like to create surrogate keys on their tables when they design data warehouse models. You can use the IDENTITY property to achieve this goal
simply and effectively without affecting load performance.

Reference:

https://www.mssqltips.com/sqlservertip/5614/explore-the-role-of-normal-forms-in-dimensional-modeling/ https://docs.microsoft.com/en-
us/azure/synapse-analytics/sql-data-warehouse/sql-data-warehouse-tables-identity

 
JimZhang4123
1 week, 2 days ago
'The solution must optimize read operations.' means denormalization
upvoted 1 times

 
PallaviPatel
3 months, 4 weeks ago
Answer correct.
upvoted 3 times

 
Mahesh_mm
4 months, 4 weeks ago
Answers are correct
upvoted 1 times

 
PallaviPatel
5 months ago
answer is correct
upvoted 4 times

 
moreinva43
5 months, 2 weeks ago
While denormalizing does require implementing a lower level of normalization, the second normal form ONLY applies when a table has a
composite primary key. https://www.geeksforgeeks.org/second-normal-form-2nf/
upvoted 1 times

https://www.examtopics.com/exams/microsoft/dp-203/custom-view/ 85/161
5/26/22, 3:46 PM DP-203 Exam – Free Actual Q&As, Page 1 | ExamTopics

Question #36 Topic 1

HOTSPOT -

You plan to develop a dataset named Purchases by using Azure Databricks. Purchases will contain the following columns:

✑ ProductID

✑ ItemPrice

✑ LineTotal

✑ Quantity

✑ StoreID

✑ Minute

✑ Month

✑ Hour

Year -

✑ Day
You need to store the data to support hourly incremental load pipelines that will vary for each Store ID. The solution must minimize storage costs.

How should you complete the code? To answer, select the appropriate options in the answer area.

NOTE: Each correct selection is worth one point.

Hot Area:

Correct Answer:

Box 1: partitionBy -

We should overwrite at the partition level.

Example:

df.write.partitionBy("y","m","d")

.mode(SaveMode.Append)

.parquet("/data/hive/warehouse/db_name.db/" + tableName)

Box 2: ("StoreID", "Year", "Month", "Day", "Hour", "StoreID")

https://www.examtopics.com/exams/microsoft/dp-203/custom-view/ 86/161
5/26/22, 3:46 PM DP-203 Exam – Free Actual Q&As, Page 1 | ExamTopics

Box 3: parquet("/Purchases")

Reference:

https://intellipaat.com/community/11744/how-to-partition-and-write-dataframe-in-spark-without-deleting-partitions-with-no-new-data

 
sparkchu
1 month, 3 weeks ago
ans should be saveAsTable. format is defined by format() method.
upvoted 1 times

 
assU2
4 months, 2 weeks ago
Can anyone explain why it's Partitioning and not Bucketing pls?
upvoted 2 times

 
KashRaynardMorse
1 month ago
Bucketing feature (part of data skipping index) was removed and microsoft recommends using DeltaLake, which uses the partition syntax.

https://docs.microsoft.com/en-us/azure/databricks/spark/latest/spark-sql/dataskipping-index
upvoted 1 times

 
bhanuprasad9331
3 months, 1 week ago
There should be a different folder for each store. Partitioning will create separate folder for each storeId. In bucketing, multiple stores having
same hash value can be present in the same file, so multiple storeIds can be present under a single file.
upvoted 3 times

 
assU2
4 months, 2 weeks ago
Is it a question of correct syntax (numBuckets int the number of buckets to save) or is it smth else?
upvoted 1 times

 
Mahesh_mm
4 months, 4 weeks ago
Answers are correct
upvoted 4 times

 
Aslam208
5 months, 2 weeks ago
correct
upvoted 4 times

https://www.examtopics.com/exams/microsoft/dp-203/custom-view/ 87/161
5/26/22, 3:46 PM DP-203 Exam – Free Actual Q&As, Page 1 | ExamTopics

Question #37 Topic 1

You are designing a partition strategy for a fact table in an Azure Synapse Analytics dedicated SQL pool. The table has the following
specifications:

✑ Contain sales data for 20,000 products.

Use hash distribution on a column named ProductID.

✑ Contain 2.4 billion records for the years 2019 and 2020.

Which number of partition ranges provides optimal compression and performance for the clustered columnstore index?

A.
40

B.
240

C.
400

D.
2,400

Correct Answer:
A

Each partition should have around 1 millions records. Dedication SQL pools already have 60 partitions.

We have the formula: Records/(Partitions*60)= 1 million

Partitions= Records/(1 million * 60)

Partitions= 2.4 x 1,000,000,000/(1,000,000 * 60) = 40

Note: Having too many partitions can reduce the effectiveness of clustered columnstore indexes if each partition has fewer than 1 million rows.
Dedicated SQL pools automatically partition your data into 60 databases. So, if you create a table with 100 partitions, the result will be 6000
partitions.

Reference:

https://docs.microsoft.com/en-us/azure/synapse-analytics/sql/best-practices-dedicated-sql-pool

Community vote distribution


A (64%) C (36%)

 
Aslam208
Highly Voted 
5 months, 2 weeks ago
correct
upvoted 12 times

 
sdokmak
Most Recent 
1 day, 9 hours ago
Selected Answer: A
quick maths
upvoted 1 times

 
MS_Nikhil
3 weeks, 6 days ago
Selected Answer: A
A is correct
upvoted 1 times

 
Egocentric
1 month, 2 weeks ago
correct
upvoted 1 times

 
Twom
2 months, 1 week ago
Selected Answer: A
Correct
upvoted 2 times

 
jskibick
3 months ago
Selected Answer: A
I am also confused.

So we have 2.400.000.000 rows that are already split in 60 nodes od SQL DW. That makes

40.000.000 per node.

Now is question how to order partitions to obtain efficiency for CCI.

Next, we know the partitions will be divided into CCI segments ~1mln per each. And here is my problem. Because CCI will autosplit data in
partitions into 1mln row segments. We do not have to do it on our own in partitions. I would split data into monthly partitions i.e. #24 for 2 year,
https://www.examtopics.com/exams/microsoft/dp-203/custom-view/ 88/161
5/26/22, 3:46 PM DP-203 Exam – Free Actual Q&As, Page 1 | ExamTopics

2019 and 2020. The segments will autosplit partitions.

But there is not such answer.

I will have to go with A = 40


upvoted 3 times

 
Justin_beswick
3 months, 2 weeks ago
Selected Answer: C
The Rule is Partitions= Records/(1 million * 60)

24,000,000,000/60,000,000 = 400
upvoted 4 times

 
helpaws
3 months ago
it's 2.4 billion, not 24 billion
upvoted 9 times

 
AlvaroEPMorais
3 months, 1 week ago
The Rule is Partitions= Records/(1 million * 60)

2,400,000,000/60,000,000 = 40
upvoted 8 times

 
dev2dev
4 months, 2 weeks ago
Are you suggesting create 40 partitions on ProductId? This is confusing. If you create 40 partitions, SQL Pool will create 40*60 partitions which is
2400. And documentation says create fewer partitions. If we want to create paritions by year then we can create 2 partitions for two years which
internally creates 2*60 = 120 paritions, but extra 2 paritions for outer boundaries will make it 4*60 = 240. So 240 paritions for 2.4 billion rows is
ideal. But what is confusing me is we creat only 4 paritions which is not even in options
upvoted 2 times

 
Canary_2021
4 months, 2 weeks ago
A distributed table appears as a single table, but the rows are actually stored across 60 distributions.

60 is for distribution, not partition.


upvoted 2 times

 
Muishkin
4 weeks ago
So then how do we calculate the number of partitions?Is'nt it user driven ?
upvoted 1 times

 
Canary_2021
4 months, 2 weeks ago
If you partition your data, each partition will need to have 1 million rows to benefit from a clustered columnstore index. For a table with 100
partitions, it needs to have at least 6 billion rows to benefit from a clustered columns store (60 distributions 100 partitions 1 million rows).
upvoted 1 times

 
Mahesh_mm
4 months, 4 weeks ago
correct
upvoted 1 times

https://www.examtopics.com/exams/microsoft/dp-203/custom-view/ 89/161
5/26/22, 3:46 PM DP-203 Exam – Free Actual Q&As, Page 1 | ExamTopics

Question #38 Topic 1

HOTSPOT -

You are creating dimensions for a data warehouse in an Azure Synapse Analytics dedicated SQL pool.

You create a table by using the Transact-SQL statement shown in the following exhibit.

Use the drop-down menus to select the answer choice that completes each statement based on the information presented in the graphic.

NOTE: Each correct selection is worth one point.

Hot Area:

Correct Answer:

Box 1: Type 2 -

A Type 2 SCD supports versioning of dimension members. Often the source system doesn't store versions, so the data warehouse load process
detects and manages changes in a dimension table. In this case, the dimension table must use a surrogate key to provide a unique reference to
a version of the dimension member. It also includes columns that define the date range validity of the version (for example, StartDate and
EndDate) and possibly a flag column (for example,

https://www.examtopics.com/exams/microsoft/dp-203/custom-view/ 90/161
5/26/22, 3:46 PM DP-203 Exam – Free Actual Q&As, Page 1 | ExamTopics

IsCurrent) to easily filter by current dimension members.

Incorrect Answers:

A Type 1 SCD always reflects the latest values, and when changes in source data are detected, the dimension table data is overwritten.

Box 2: a business key -

A business key or natural key is an index which identifies uniqueness of a row based on columns that exist naturally in a table according to
business rules. For example business keys are customer code in a customer table, composite of sales order header number and sales order
item line number within a sales order details table.

Reference:

https://docs.microsoft.com/en-us/learn/modules/populate-slowly-changing-dimensions-azure-synapse-analytics-pipelines/3-choose-between-
dimension-types

 
nkav
Highly Voted 
1 year ago
product key is a surrogate key as it is an identity column
upvoted 98 times

 
111222333
1 year ago
Agree on the surrogate key, exactly.

"In data warehousing, IDENTITY functionality is particularly important as it makes easier the creation of surrogate keys."

Why ProductKey is certainly not a business key: "The IDENTITY value in Synapse is not guaranteed to be unique if the user explicitly inserts a
duplicate value with 'SET IDENTITY_INSERT ON' or reseeds IDENTITY". Business key is an index which identifies uniqueness of a row and here
Microsoft says that identity doesn't guarantee uniqueness.

References:

https://azure.microsoft.com/en-us/blog/identity-now-available-with-azure-sql-data-warehouse/

https://docs.microsoft.com/en-us/azure/synapse-analytics/sql-data-warehouse/sql-data-warehouse-tables-identity
upvoted 8 times

 
rikku33
8 months ago
Type 2

In order to support type 2 changes, we need to add four columns to our table:

· Surrogate Key – the original ID will no longer be sufficient to identify the specific record we require, we therefore need to create a new ID
that the fact records can join to specifically.

· Current Flag – A quick method of returning only the current version of each record

· Start Date – The date from which the specific historical version is active

· End Date – The date to which the specific historical version record is active

With these elements in place, our table will now look like:
upvoted 4 times

 
sagga
Highly Voted 
1 year ago
Type2 because there are start and end columns and ProductKey is a surrogate key. ProductNumber seems a business key.
upvoted 29 times

 
DrC
12 months ago
The start and end columns are for when to when the product was being sold, not for metadata purposes. That makes it:

Type 1 – No History

Update record directly, there is no record of historical values, only current state
upvoted 40 times

 
Kyle1
8 months, 1 week ago
When the product is not being sold anymore, it becomes a historical record. Hence Type 2.
upvoted 2 times

 
rockyc05
3 months ago
It is type 1 not 2
upvoted 1 times

 
Yuri1101
5 months, 1 week ago
With type 2, you normally don't update any column of a row other than row start date and end date.
upvoted 1 times

 
captainbee
11 months, 3 weeks ago
Exactly how I saw it
upvoted 1 times

https://www.examtopics.com/exams/microsoft/dp-203/custom-view/ 91/161
5/26/22, 3:46 PM DP-203 Exam – Free Actual Q&As, Page 1 | ExamTopics

 
SandipSingha
Most Recent 
2 weeks, 3 days ago
product key is definitely a surrogate key
upvoted 1 times

 
dazero
3 weeks ago
Definitely Type 1. There are no columns to indicate the different versions of the same business key. The sell start and end date columns are actual
source columns from when the product was sold. The Insert and Update columns are audit columns that explain when a record was inserted for the
first time and when it was updated. So the insert date remains the same, but the updated column is updated every time a Type 1 update occurs.
upvoted 1 times

 
AlCubeHead
2 months ago
Product Key is surrogate NOT business key as it's a derived IDENTITY

Dimension is type 1 as it does not have a StartDate and EndDate associated with data changes and also does not have an IsCurrent flag
upvoted 5 times

 
Shrek66
3 months, 3 weeks ago
Agree with ploer

SCD Type 1

Surrogate
upvoted 4 times

 
ploer
3 months, 3 weeks ago
Surrugate Key and Type 1 SCD. Tpye 1 SCD because sellenddate and sellstartdate are attributes of the product and not for versioning. rowupdated
and rowinserted are used for scd. And - as the naming indicates- the fact that both have a not null constraint leads to the conclusion that we have
no possibility to store the information which row is the current one. So it must be scd type 1.
upvoted 3 times

 
skkdp203
3 months, 3 weeks ago
SCD Type 1

Surrogate

https://docs.microsoft.com/en-us/learn/modules/populate-slowly-changing-dimensions-azure-synapse-analytics-pipelines/3-choose-between-
dimension-types

Surrogate keys in dimension tables

It is critical that the primary key’s value of a dimension table remains unchanged. And it is highly recommended that all dimension tables use
surrogate keys as primary keys.

Surrogate keys are key generated and managed inside the data warehouse rather than keys extracted from data source systems.
upvoted 3 times

 
dev2dev
3 months, 3 weeks ago
identity/surrogate key's can be a business key in transition tables but in dw it can be used only as surrogate key.
upvoted 1 times

 
assU2
4 months, 2 weeks ago
Maybe it's type 2 because the logic is: we can have multiple rows with one productID, different surrogate keys and different start/end sale dates.

key | id | start sale | end sale |

1 | 998 | 2021-01-01 | 2021-02-01 |

2 | 998 | 2022-01-01 | 9999-12-31 | <-- current


upvoted 2 times

 
assU2
4 months, 2 weeks ago
Where are these answers from? Why there are so many mistakes? ProductKey is obviously a surrogate key
upvoted 1 times

 
alex623
4 months, 2 weeks ago
It seems like Type 1: There is no flag to inform if the record is the current record, also the date column is just for modified date
upvoted 1 times

 
Boompiee
2 weeks, 1 day ago
The flag is commonly used, but not required.
upvoted 1 times

 
Mahesh_mm
4 months, 4 weeks ago
Type 2 and surrogate key
upvoted 2 times

 
Ayan3B
5 months, 2 weeks ago
when table being created rowinsertdatetime and rowupdatedatetime attribute has been kept along with ETL identifier attribute so no previous
version data would be kept. So type 1 is answer. Type 2 keep the older version information at row level along with start date and end date and type
3 keeps column level restricted old version of data.

Second answer would be surrogate key as product key generated with IDENTITY
upvoted 4 times

 
satyamkishoresingh
5 months, 3 weeks ago
What is type 0 ?
https://www.examtopics.com/exams/microsoft/dp-203/custom-view/ 92/161
5/26/22, 3:46 PM DP-203 Exam – Free Actual Q&As, Page 1 | ExamTopics

upvoted 1 times

 
DrTaz
4 months, 3 weeks ago
SCD type 0 us a constant value that never changes.
upvoted 1 times

 
jay5518
6 months ago
This was on test today
upvoted 1 times

 
ohana
7 months ago
Took the exam today, this question came out.

Type2, Surrogate Key


upvoted 7 times

https://www.examtopics.com/exams/microsoft/dp-203/custom-view/ 93/161
5/26/22, 3:46 PM DP-203 Exam – Free Actual Q&As, Page 1 | ExamTopics

Question #39 Topic 1

You are designing a fact table named FactPurchase in an Azure Synapse Analytics dedicated SQL pool. The table contains purchases from
suppliers for a retail store. FactPurchase will contain the following columns.

FactPurchase will have 1 million rows of data added daily and will contain three years of data.

Transact-SQL queries similar to the following query will be executed daily.

SELECT -

SupplierKey, StockItemKey, COUNT(*)

FROM FactPurchase -

WHERE DateKey >= 20210101 -

AND DateKey <= 20210131 -

GROUP By SupplierKey, StockItemKey

Which table distribution will minimize query times?

A.
replicated

B.
hash-distributed on PurchaseKey

C.
round-robin

D.
hash-distributed on DateKey

Correct Answer:
B

Hash-distributed tables improve query performance on large fact tables, and are the focus of this article. Round-robin tables are useful for
improving loading speed.

Incorrect:

Not D: Do not use a date column. . All data for the same date lands in the same distribution. If several users are all filtering on the same date,
then only 1 of the 60 distributions do all the processing work.

Reference:

https://docs.microsoft.com/en-us/azure/synapse-analytics/sql-data-warehouse/sql-data-warehouse-tables-distribute

Community vote distribution


B (88%) 13%

 
AugustineUba
Highly Voted 
9 months, 3 weeks ago
From the documentation the answer is clear enough. B is the right answer.

When choosing a distribution column, select a distribution column that: "Is not a date column. All data for the same date lands in the same
distribution. If several users are all filtering on the same date, then only 1 of the 60 distributions do all the processing work."
upvoted 33 times

 
YipingRuan
7 months, 1 week ago
To minimize data movement, select a distribution column that:

Is used in JOIN, GROUP BY, DISTINCT, OVER, and HAVING clauses.

https://www.examtopics.com/exams/microsoft/dp-203/custom-view/ 94/161
5/26/22, 3:46 PM DP-203 Exam – Free Actual Q&As, Page 1 | ExamTopics

"PurchaseKey" is not used in the group by


upvoted 6 times

 
YipingRuan
7 months, 1 week ago
Consider using the round-robin distribution for your table in the following scenarios:

When getting started as a simple starting point since it is the default

If there is no obvious joining key

If there is no good candidate column for hash distributing the table

If the table does not share a common join key with other tables

If the join is less significant than other joins in the query


upvoted 5 times

 
waterbender19
Highly Voted 
9 months, 3 weeks ago
I think the answer should be D for that specific query. If you look at the datatypes, DateKey is an INT datatype not a DATE datatype.
upvoted 13 times

 
kamil_k
2 months, 1 week ago
n.b. if we look at the example query itself the date range is 31 days so we will use 31 distributions out of 60, and only process ~31 million
records
upvoted 1 times

 
waterbender19
9 months, 3 weeks ago
and thet statement that Fact table will be added 1 million rows daily means that each datekey value has an equal amount of rows associated
with that value.
upvoted 5 times

 
Lucky_me
4 months, 2 weeks ago
But the DateKey is used in the WHERE clause.
upvoted 2 times

 
kamil_k
2 months, 1 week ago
I agree, date key is int, and besides, even if it was a date, when you query a couple days then 1 million rows per distribution is not that
much. So what if you are going to use only a couple distributions to do the job? Isn't it still faster than using all distributions to process all
of the records to get the required date range?
upvoted 1 times

 
AnandEMani
8 months, 3 weeks ago
https://docs.microsoft.com/en-us/azure/synapse-analytics/sql-data-warehouse/sql-data-warehouse-tables-distribute this link says date filed ,
NOT a date Data type. B is correct
upvoted 3 times

 
Ramkrish39
Most Recent 
2 months, 1 week ago
Agree B is the right answer
upvoted 1 times

 
PallaviPatel
3 months, 4 weeks ago
Selected Answer: C
I will go with round robin.

''Consider using the round-robin distribution for your table in the following scenarios:

When getting started as a simple starting point since it is the default

If there is no obvious joining key

If there is no good candidate column for hash distributing the table

If the table does not share a common join key with other tables

If the join is less significant than other joins in the query


upvoted 1 times

 
yovi
4 months, 4 weeks ago
Anyone, when you finish an exam, do they give you the correct answers in the end?
upvoted 1 times

 
dev2dev
4 months, 2 weeks ago
those finished exam will not know the answer. because answers are not reveled
upvoted 1 times

 
Mahesh_mm
4 months, 4 weeks ago
B is correct ans
upvoted 1 times

 
danish456
5 months ago
Selected Answer: B
It's correct
upvoted 1 times

https://www.examtopics.com/exams/microsoft/dp-203/custom-view/ 95/161
5/26/22, 3:46 PM DP-203 Exam – Free Actual Q&As, Page 1 | ExamTopics

 
trietnv
5 months ago
Selected Answer: B
1. choose distribution b/c "joining a round-robin table usually requires reshuffling the rows, which is a performance hit"

2. Choose PurchaseKey b/c "not used in WHERE"

refer:
https://docs.microsoft.com/en-us/azure/synapse-analytics/sql-data-warehouse/sql-data-warehouse-tables-distribute

and

https://docs.microsoft.com/en-us/azure/synapse-analytics/sql-data-warehouse/sql-data-warehouse-tables-distribute
upvoted 2 times

 
Aslam208
5 months, 2 weeks ago
Selected Answer: B
B is correct
upvoted 1 times

 
Hervedoux
6 months ago
Selected Answer: B
Its cleary a hash on purchasekey column
upvoted 3 times

 
ohana
7 months ago
Took the exam today, this question came out.

Ans: B
upvoted 5 times

 
Marcus1612
8 months, 2 weeks ago
To optimize the MPP, data have to be distributed evenly. Datekey is not a good candidate because the data will be distributed evenly one day per
60 days. In practice, if many users query the fact table to retreive the data about the week before, only 7 nodes will process the queries instead of
60. According to microsoft documentation:"To balance the parallel processing, select a distribution column that .. Is not a date column. All data for
the same date lands in the same distribution. If several users are all filtering on the same date, then only 1 of the 60 distributions do all the
processing work.
upvoted 4 times

 
Marcus1612
8 months, 2 weeks ago
the good answer is B
upvoted 2 times

 
andimohr
10 months ago
The reference given in the answer is precise: Choose a distribution column with data that a) distributes evenly b) has many unique values c) does
not have NULLs or few NULLs and d) IS NOT A DATE COLUMN... definitely the best choice for the Hash distribution is on the Identity column.
upvoted 4 times

 
noone_a
10 months, 3 weeks ago
although its a fact table, replicated is the correct distribution in this case.

Each row is 141 bytes in size x 1000000 records = 135Mb total size

Microsoft recommend replicated distribution for anything under 2GB.

We have no further information regarding table growth so this answer is based only on the info provided.
upvoted 1 times

 
noone_a
10 months, 3 weeks ago
edit, this is incorrect as it will have 1 million records added daily for 3 years, putting it over 2GB
upvoted 4 times

 
vlad888
11 months ago
Yes - do not use date column - there is such recomendation in synapse docs. But here we have range search - potensiallu several nodes will be
used.
upvoted 1 times

 
vlad888
11 months ago
Actually it is clear that it should be hash distributed. BUT Product key brings no benefit for this query - doesn't participated in it at all. So - DateKey.
Although it is unusual for Synapse
upvoted 4 times

 
savin
11 months, 1 week ago
I don't think there is enough information to decide this. Also we can not decide it by just looking at one query. Only considering this query and if
we assume no other dimensions are connected to this fact table, good answer would be D.
upvoted 2 times

https://www.examtopics.com/exams/microsoft/dp-203/custom-view/ 96/161
5/26/22, 3:46 PM DP-203 Exam – Free Actual Q&As, Page 1 | ExamTopics

Question #40 Topic 1

You are implementing a batch dataset in the Parquet format.

Data files will be produced be using Azure Data Factory and stored in Azure Data Lake Storage Gen2. The files will be consumed by an Azure
Synapse Analytics serverless SQL pool.

You need to minimize storage costs for the solution.

What should you do?

A.
Use Snappy compression for files.

B.
Use OPENROWSET to query the Parquet files.

C.
Create an external table that contains a subset of columns from the Parquet files.

D.
Store all data as string in the Parquet files.

Correct Answer:
C

An external table points to data located in Hadoop, Azure Storage blob, or Azure Data Lake Storage. External tables are used to read data from
files or write data to files in Azure Storage. With Synapse SQL, you can use external tables to read external data using dedicated SQL pool or
serverless SQL pool.

Reference:

https://docs.microsoft.com/en-us/azure/synapse-analytics/sql/develop-tables-external-tables

Community vote distribution


A (37%) C (32%) B (32%)

 
m2shines
Highly Voted 
5 months, 1 week ago
Answer should be A, because this talks about minimizing storage costs, not querying costs
upvoted 22 times

 
assU2
4 months, 2 weeks ago
Isn't snappy a default compressionCodec for parquet in azure?

https://docs.microsoft.com/en-us/azure/data-factory/format-parquet
upvoted 8 times

 
Aslam208
Highly Voted 
5 months, 2 weeks ago
C is the correct answer, as an external table with a subset of columns with parquet files would be cost-effective.
upvoted 13 times

 
RehanRajput
3 days, 17 hours ago
This is not correct.

1. External tables are are not saved in the database. (This is why they're external)

2. You're assuming that the SQL Serverless pools have a local storage. They don't -- > https://docs.microsoft.com/en-us/azure/synapse-
analytics/sql/best-practices-serverless-sql-pool
upvoted 1 times

 
Massy
3 weeks, 5 days ago
in serverless sql pool you don't create a copy of the data, so how could be cost effective?
upvoted 1 times

 
sdokmak
Most Recent 
1 day, 8 hours ago
Selected Answer: B
I agree with Canary2021
upvoted 1 times

 
rohitbinnani
1 month, 2 weeks ago
Selected Answer: C
not A - The default compression for a parquet file is SNAPPY. Even in Python as well.

C - because an external table that contains a subset of columns from the Parquet files will not need re-saving them in databases and that would
save storage costs.
upvoted 6 times

 
RehanRajput
3 days, 17 hours ago
This is not correct.

1. External tables are are not saved in the database. (This is why they're external)

2. You're assuming that the SQL Serverless pools have a local storage. They don't -- > https://docs.microsoft.com/en-us/azure/synapse-
analytics/sql/best-practices-serverless-sql-pool
upvoted 1 times

https://www.examtopics.com/exams/microsoft/dp-203/custom-view/ 97/161
5/26/22, 3:46 PM DP-203 Exam – Free Actual Q&As, Page 1 | ExamTopics

 
DingDongSingSong
2 months ago
The answer is NOT A. Snappy compression offers fast compression, but file size at rest is larger which will translate into higher storage cost. The
answer is C where an external table with requisite columns is made available which will reduce the amount of storage
upvoted 4 times

 
cotillion
3 months, 2 weeks ago
Selected Answer: A
Only A has sth to do with the storage
upvoted 1 times

 
PallaviPatel
3 months, 4 weeks ago
A looks to be correct.
upvoted 1 times

 
dev2dev
4 months, 1 week ago
Since this is a batch process, and we can delete files once loaded and this can't be avoid initial/temporary storage cost of any form for loading
data, including most optimized parquet format with compression option. So best approach would be to store only required columns which can
save storage. However, we can always use OPENROWSET if we are not interested to persist data. Yeah, like someone said, this is shitty question
with shitty options.
upvoted 2 times

 
Ramkrish39
2 months, 1 week ago
OPENROWSET is for JSON files
upvoted 1 times

 
bhushanhegde
4 months, 2 weeks ago
As per the documentation, A is the correct answer

https://docs.microsoft.com/en-us/azure/data-factory/format-parquet#dataset-properties
upvoted 1 times

 
Jaws1990
4 months, 2 weeks ago
Selected Answer: A
creating an external table with fewer columns than the file has no effect on the file itself and will actually fail so in no way helps with storage costs.

See MS documentation "The column definitions, including the data types and number of columns, must match the data in the external files. If
there's a mismatch, the file rows will be rejected when querying the actual data."
upvoted 6 times

 
Canary_2021
4 months, 3 weeks ago
Selected Answer: B
In order to query data from external table, need to creat these 3 items. Feel that they all cost some storage.

CREATE EXTERNAL DATA SOURCE

CREATE EXTERNAL FILE FORMAT

CREATE EXTERNAL TABLE

If using open row set, don’t need to creat any thing, so l select B.
upvoted 5 times

 
ploer
3 months, 3 weeks ago
But this has nothing to do with storage costs. Only some bytes in the data dictionary are added and you are not even charged for this.
upvoted 1 times

 
sdokmak
1 day, 8 hours ago
no storage cost = WIN :)
upvoted 1 times

 
TestMitch
5 months, 1 week ago
This question is garbage.
upvoted 7 times

 
assU2
4 months, 2 weeks ago
Like many others...
upvoted 2 times

 
Jerrylolu
5 months, 1 week ago
That is correct. Looks like whoever put it here, didnt remember it clearly.
upvoted 2 times

 
vijju23
5 months, 2 weeks ago
Answer is B. which is best as per storage cost. reason we are querying parquet file when need using OPENROWSET.
upvoted 7 times

https://www.examtopics.com/exams/microsoft/dp-203/custom-view/ 98/161
5/26/22, 3:46 PM DP-203 Exam – Free Actual Q&As, Page 1 | ExamTopics

Question #41 Topic 1

DRAG DROP -

You need to build a solution to ensure that users can query specific files in an Azure Data Lake Storage Gen2 account from an Azure Synapse
Analytics serverless SQL pool.

Which three actions should you perform in sequence? To answer, move the appropriate actions from the list of actions to the answer area and
arrange them in the correct order.

NOTE: More than one order of answer choices is correct. You will receive credit for any of the correct orders you select.

Select and Place:

Correct Answer:

Step 1: Create an external data source

You can create external tables in Synapse SQL pools via the following steps:

1. CREATE EXTERNAL DATA SOURCE to reference an external Azure storage and specify the credential that should be used to access the
storage.

2. CREATE EXTERNAL FILE FORMAT to describe format of CSV or Parquet files.

3. CREATE EXTERNAL TABLE on top of the files placed on the data source with the same file format.

Step 2: Create an external file format object

Creating an external file format is a prerequisite for creating an external table.

Step 3: Create an external table

Reference:

https://docs.microsoft.com/en-us/azure/synapse-analytics/sql/develop-tables-external-tables

 
avijitd
Highly Voted 
5 months, 2 weeks ago
Looks correct answer
upvoted 12 times

 
SandipSingha
Most Recent 
2 weeks, 3 days ago
correct
upvoted 1 times

 
lotuspetall
2 months, 1 week ago
correct
upvoted 1 times

 
PallaviPatel
3 months, 4 weeks ago
https://www.examtopics.com/exams/microsoft/dp-203/custom-view/ 99/161
5/26/22, 3:46 PM DP-203 Exam – Free Actual Q&As, Page 1 | ExamTopics

correct
upvoted 2 times

 
ANath
4 months, 3 weeks ago
Correct
upvoted 1 times

 
gf2tw
5 months, 2 weeks ago
Correct
upvoted 1 times

https://www.examtopics.com/exams/microsoft/dp-203/custom-view/ 100/161
5/26/22, 3:46 PM DP-203 Exam – Free Actual Q&As, Page 1 | ExamTopics

Question #42 Topic 1

You are designing a data mart for the human resources (HR) department at your company. The data mart will contain employee information and
employee transactions.

From a source system, you have a flat extract that has the following fields:

✑ EmployeeID

FirstName -

✑ LastName

✑ Recipient

✑ GrossAmount

✑ TransactionID
✑ GovernmentID
✑ NetAmountPaid

✑ TransactionDate

You need to design a star schema data model in an Azure Synapse Analytics dedicated SQL pool for the data mart.

Which two tables should you create? Each correct answer presents part of the solution.

NOTE: Each correct selection is worth one point.

A.
a dimension table for Transaction

B.
a dimension table for EmployeeTransaction

C.
a dimension table for Employee

D.
a fact table for Employee

E.
a fact table for Transaction

Correct Answer:
CE

C: Dimension tables contain attribute data that might change but usually changes infrequently. For example, a customer's name and address
are stored in a dimension table and updated only when the customer's profile changes. To minimize the size of a large fact table, the customer's
name and address don't need to be in every row of a fact table. Instead, the fact table and the dimension table can share a customer ID. A query
can join the two tables to associate a customer's profile and transactions.

E: Fact tables contain quantitative data that are commonly generated in a transactional system, and then loaded into the dedicated SQL pool.
For example, a retail business generates sales transactions every day, and then loads the data into a dedicated SQL pool fact table for analysis.

Reference:

https://docs.microsoft.com/en-us/azure/synapse-analytics/sql-data-warehouse/sql-data-warehouse-tables-overview

Community vote distribution


CE (100%)

 
avijitd
Highly Voted 
5 months, 2 weeks ago
Correct Answer . Emp info as Dimension & trans table as fact
upvoted 7 times

 
SandipSingha
Most Recent 
2 weeks, 3 days ago
correct
upvoted 1 times

 
tg2707
3 weeks, 2 days ago
why not fact table for employee and dim table for transactions
upvoted 1 times

 
Egocentric
1 month, 1 week ago
CE is correct
upvoted 1 times

 
NewTuanAnh
1 month, 2 weeks ago
Selected Answer: CE
CE is the correct answer
upvoted 2 times

https://www.examtopics.com/exams/microsoft/dp-203/custom-view/ 101/161
5/26/22, 3:46 PM DP-203 Exam – Free Actual Q&As, Page 1 | ExamTopics

 
SebK
2 months ago
Selected Answer: CE
CE is correct
upvoted 1 times

 
surya610
3 months ago
Selected Answer: CE
Dimension for employee and fact for transactions.
upvoted 1 times

 
PallaviPatel
3 months, 4 weeks ago
Selected Answer: CE
correct
upvoted 1 times

 
gf2tw
5 months, 2 weeks ago
Correct
upvoted 2 times

https://www.examtopics.com/exams/microsoft/dp-203/custom-view/ 102/161
5/26/22, 3:46 PM DP-203 Exam – Free Actual Q&As, Page 1 | ExamTopics

Question #43 Topic 1

You are designing a dimension table for a data warehouse. The table will track the value of the dimension attributes over time and preserve the
history of the data by adding new rows as the data changes.

Which type of slowly changing dimension (SCD) should you use?

A.
Type 0

B.
Type 1

C.
Type 2

D.
Type 3

Correct Answer:
C

A Type 2 SCD supports versioning of dimension members. Often the source system doesn't store versions, so the data warehouse load process
detects and manages changes in a dimension table. In this case, the dimension table must use a surrogate key to provide a unique reference to
a version of the dimension member. It also includes columns that define the date range validity of the version (for example, StartDate and
EndDate) and possibly a flag column (for example,

IsCurrent) to easily filter by current dimension members.

Incorrect Answers:

B: A Type 1 SCD always reflects the latest values, and when changes in source data are detected, the dimension table data is overwritten.

D: A Type 3 SCD supports storing two versions of a dimension member as separate columns. The table includes a column for the current value
of a member plus either the original or previous value of the member. So Type 3 uses additional columns to track one key instance of history,
rather than storing additional rows to track each change like in a Type 2 SCD.

Reference:

https://docs.microsoft.com/en-us/learn/modules/populate-slowly-changing-dimensions-azure-synapse-analytics-pipelines/3-choose-between-
dimension-types

Community vote distribution


C (100%)

 
gf2tw
Highly Voted 
5 months, 2 weeks ago
Correct
upvoted 12 times

 
SandipSingha
Most Recent 
2 weeks, 3 days ago
correct
upvoted 1 times

 
SandipSingha
2 weeks, 3 days ago
correct
upvoted 1 times

 
AZ9997989798979789798979789797
3 weeks, 1 day ago
Correct
upvoted 1 times

 
Onobhas01
1 month, 3 weeks ago
Selected Answer: C
Correct!
upvoted 1 times

 
surya610
3 months ago
Selected Answer: C
Correct
upvoted 1 times

 
PallaviPatel
3 months, 4 weeks ago
Selected Answer: C
correct
upvoted 1 times

 
saupats
4 months, 2 weeks ago
correct

https://www.examtopics.com/exams/microsoft/dp-203/custom-view/ 103/161
5/26/22, 3:46 PM DP-203 Exam – Free Actual Q&As, Page 1 | ExamTopics

upvoted 1 times

 
ANath
4 months, 3 weeks ago
correct
upvoted 1 times

https://www.examtopics.com/exams/microsoft/dp-203/custom-view/ 104/161
5/26/22, 3:46 PM DP-203 Exam – Free Actual Q&As, Page 1 | ExamTopics

Question #44 Topic 1

DRAG DROP -

You have data stored in thousands of CSV files in Azure Data Lake Storage Gen2. Each file has a header row followed by a properly formatted
carriage return (/ r) and line feed (/n).

You are implementing a pattern that batch loads the files daily into an enterprise data warehouse in Azure Synapse Analytics by using PolyBase.

You need to skip the header row when you import the files into the data warehouse. Before building the loading pattern, you need to prepare the
required database objects in Azure Synapse Analytics.

Which three actions should you perform in sequence? To answer, move the appropriate actions from the list of actions to the answer area and
arrange them in the correct order.

NOTE: Each correct selection is worth one point

Select and Place:

Correct Answer:

Step 1: Create an external data source that uses the abfs location

Create External Data Source to reference Azure Data Lake Store Gen 1 or 2

Step 2: Create an external file format and set the First_Row option.

Create External File Format.

Step 3: Use CREATE EXTERNAL TABLE AS SELECT (CETAS) and configure the reject options to specify reject values or percentages

To use PolyBase, you must create external tables to reference your external data.

Use reject options.

Note: REJECT options don't apply at the time this CREATE EXTERNAL TABLE AS SELECT statement is run. Instead, they're specified here so that
the database can use them at a later time when it imports data from the external table. Later, when the CREATE TABLE AS SELECT statement
selects data from the external table, the database will use the reject options to determine the number or percentage of rows that can fail to
import before it stops the import.

Reference:

https://docs.microsoft.com/en-us/sql/relational-databases/polybase/polybase-t-sql-objects https://docs.microsoft.com/en-us/sql/t-
sql/statements/create-external-table-as-select-transact-sql

https://www.examtopics.com/exams/microsoft/dp-203/custom-view/ 105/161
5/26/22, 3:46 PM DP-203 Exam – Free Actual Q&As, Page 1 | ExamTopics

 
avijitd
Highly Voted 
5 months, 2 weeks ago
https://docs.microsoft.com/en-us/azure/synapse-analytics/sql/develop-tables-external-tables?tabs=hadoop#create-external-data-source

Hadoop external data source in dedicated SQL pool for Azure Data Lake Gen2 pointing

CREATE DATABASE SCOPED CREDENTIAL [ADLS_credential]

WITH IDENTITY='SHARED ACCESS SIGNATURE',

SECRET = 'sv=2018-03-28&ss=bf&srt=sco&sp=rl&st=2019-10-14T12%3A10%3A25Z&se=2061-12-
31T12%3A10%3A00Z&sig=KlSU2ullCscyTS0An0nozEpo4tO5JAgGBvw%2FJX2lguw%3D'

GO

CREATE EXTERNAL DATA SOURCE AzureDataLakeStore


WITH

-- Please note the abfss endpoint when your account has secure transfer enabled

( LOCATION = 'abfss://data@newyorktaxidataset.dfs.core.windows.net' ,

CREDENTIAL = ADLS_credential ,

TYPE = HADOOP

) ;

So I guess 1. DB scoped credential

2.external DS

3.External file as mentioned by @alex


upvoted 27 times

 
Fer079
Highly Voted 
3 months, 3 weeks ago
The right answer should be:

1) Create database scoped credential

2)Create External data source

3) Create External File

"Create external table as select (CETAS)" makes no sense in this case because we would need to include a Select to fill out the external table,
however this data must come from files and not from other tables. In this case It's not the same an "external table" as an "external table as select",
the first one the data come from files and the second one the data come from a SQL query to export them into files.
upvoted 11 times

 
ravi2931
1 month, 3 weeks ago
I was thinking same and its obvious
upvoted 1 times

 
Genere
Most Recent 
1 month, 3 weeks ago
"CETAS : Creates an external table and THEN EXPERTS, in parallel, the results of a Transact-SQL SELECT statement to Hadoop or Azure Blob
storage."

We are not looking here to export data but rather to consume data from ADLS.

The right answer should be:

1) Create database scoped credential

2)Create External data source

3) Create External File


upvoted 4 times

 
wwdba
2 months ago
1. Create database scoped credential

2. Create External data source

3. Create External File

You are implementing a pattern that batch loads the files daily...so "Create external table as select" is wrong because it'll load the data into Synapse
only once
upvoted 3 times

 
DingDongSingSong
2 months ago
According to this link, when using Polybase: https://docs.microsoft.com/en-us/sql/relational-databases/polybase/polybase-configure-sql-server?
view=sql-server-ver15

Step 1: CREATE DATABASE SCOPED CREDENTIAL (Transact-SQL)

Step 2: CREATE EXTERNAL DATA SOURCE (Transact-SQL)

Step 3: CREATE EXTERNAL TABLE (Transact-SQL)

Therefore, answer is A,B,C. Correct?


upvoted 1 times

 
ovokpus
3 months ago
Why should you be the one to create the database scoped credential? You ought to have that already
upvoted 1 times

 
VeroDon
4 months, 3 weeks ago
"Azure Synapse Analytics uses a database scoped credential to access non-public Azure blob storage with PolyBase" The question doesnt mention
if the storage is or is not private

https://docs.microsoft.com/en-us/sql/t-sql/statements/create-database-scoped-credential-transact-sql?view=sql-server-ver15
upvoted 3 times

 
ANath
4 months, 1 week ago

https://www.examtopics.com/exams/microsoft/dp-203/custom-view/ 106/161
5/26/22, 3:46 PM DP-203 Exam – Free Actual Q&As, Page 1 | ExamTopics

Now that's clear the confusion if database scoped credential is needed in this context or not. By going through VeroDon's provided link it is
clear that database scoped credential is needed for non-public azure blob storage.
upvoted 1 times

 
VeroDon
4 months, 3 weeks ago
Correct

https://docs.microsoft.com/en-us/azure/synapse-analytics/sql/load-data-overview
upvoted 2 times

 
alexleonvalencia
5 months, 2 weeks ago
Step 1 : Create External data source ...

Step 2 : Create External File ....

Step 3 : Use Create External table ...


upvoted 3 times

 
alexleonvalencia
5 months, 2 weeks ago
Corrijo;

1) Create database scoped credential

2)Create External data source

3) Create External File


upvoted 7 times

https://www.examtopics.com/exams/microsoft/dp-203/custom-view/ 107/161
5/26/22, 3:46 PM DP-203 Exam – Free Actual Q&As, Page 1 | ExamTopics

Question #45 Topic 1

HOTSPOT -

You are building an Azure Synapse Analytics dedicated SQL pool that will contain a fact table for transactions from the first half of the year 2020.

You need to ensure that the table meets the following requirements:

✑ Minimizes the processing time to delete data that is older than 10 years

✑ Minimizes the I/O for queries that use year-to-date values

How should you complete the Transact-SQL statement? To answer, select the appropriate options in the answer area.

NOTE: Each correct selection is worth one point.

Hot Area:

https://www.examtopics.com/exams/microsoft/dp-203/custom-view/ 108/161
5/26/22, 3:46 PM DP-203 Exam – Free Actual Q&As, Page 1 | ExamTopics

Correct Answer:

Box 1: PARTITION -

RANGE RIGHT FOR VALUES is used with PARTITION.

Part 2: [TransactionDateID]

Partition on the date column.

Example: Creating a RANGE RIGHT partition function on a datetime column

The following partition function partitions a table or index into 12 partitions, one for each month of a year's worth of values in a datetime
column.

CREATE PARTITION FUNCTION [myDateRangePF1] (datetime)

AS RANGE RIGHT FOR VALUES ('20030201', '20030301', '20030401',

'20030501', '20030601', '20030701', '20030801',

'20030901', '20031001', '20031101', '20031201');

Reference:

https://docs.microsoft.com/en-us/sql/t-sql/statements/create-partition-function-transact-sql

 
gf2tw
Highly Voted 
5 months, 2 weeks ago
Correct
upvoted 6 times

 
gabdu
Most Recent 
3 weeks, 2 days ago
How are we ensuring "Minimizes the processing time to delete data that is older than 10 years"?
upvoted 2 times

 
wwdba
2 months, 2 weeks ago
Correct
upvoted 1 times

 
PallaviPatel
3 months, 3 weeks ago
correct
upvoted 1 times

 
saupats
4 months, 2 weeks ago
correct
upvoted 1 times

https://www.examtopics.com/exams/microsoft/dp-203/custom-view/ 109/161
5/26/22, 3:46 PM DP-203 Exam – Free Actual Q&As, Page 1 | ExamTopics

Question #46 Topic 1

You are performing exploratory analysis of the bus fare data in an Azure Data Lake Storage Gen2 account by using an Azure Synapse Analytics
serverless SQL pool.

You execute the Transact-SQL query shown in the following exhibit.

What do the query results include?

A.
Only CSV files in the tripdata_2020 subfolder.

B.
All files that have file names that beginning with "tripdata_2020".

C.
All CSV files that have file names that contain "tripdata_2020".

D.
Only CSV that have file names that beginning with "tripdata_2020".

Correct Answer:
D

Community vote distribution


D (100%)

 
gf2tw
Highly Voted 
5 months, 2 weeks ago
Correct
upvoted 10 times

 
Egocentric
Most Recent 
1 month, 1 week ago
on this one you need to pay attention to wording
upvoted 1 times

 
jskibick
1 month, 2 weeks ago
Selected Answer: D
D all good
upvoted 1 times

 
sarapaisley
1 month, 2 weeks ago
Selected Answer: D
D is correct
upvoted 1 times

 
SebK
2 months ago
Selected Answer: D
Correct
upvoted 1 times

 
DingDongSingSong
2 months ago
Why is option C not correct, when the code has "tripdata_2020*.csv" which means that a wild card is used with "tripdata_2020" csv files. So,
example tripdata_2020A.csv, tripdata_2020B.csv, tripdata_2020YZ.csv, all 3 would be queried. Option D does not make sense, even gramatically
upvoted 1 times

https://www.examtopics.com/exams/microsoft/dp-203/custom-view/ 110/161
5/26/22, 3:46 PM DP-203 Exam – Free Actual Q&As, Page 1 | ExamTopics

 
PallaviPatel
3 months, 3 weeks ago
Selected Answer: D
correct
upvoted 2 times

 
anto69
4 months, 2 weeks ago
No doubts is correct, no doubts is ans D
upvoted 1 times

 
duds19
5 months, 2 weeks ago
Why not B?
upvoted 1 times

 
Nifl91
5 months, 1 week ago
Because of the .csv at the end
upvoted 3 times

https://www.examtopics.com/exams/microsoft/dp-203/custom-view/ 111/161
5/26/22, 3:46 PM DP-203 Exam – Free Actual Q&As, Page 1 | ExamTopics

Question #47 Topic 1

DRAG DROP -

You use PySpark in Azure Databricks to parse the following JSON input.

You need to output the data in the following tabular format.

How should you complete the PySpark code? To answer, drag the appropriate values to the correct targets. Each value may be used once, more
than once, or not at all. You may need to drag the spit bar between panes or scroll to view content.

NOTE: Each correct selection is worth one point.

Select and Place:

Correct Answer:

Box 1: select -

Box 2: explode -

https://www.examtopics.com/exams/microsoft/dp-203/custom-view/ 112/161
5/26/22, 3:46 PM DP-203 Exam – Free Actual Q&As, Page 1 | ExamTopics

Bop 3: alias -

pyspark.sql.Column.alias returns this column aliased with a new name or names (in the case of expressions that return more than one column,
such as explode).

Reference:

https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.Column.alias.html https://docs.microsoft.com/en-
us/azure/databricks/sql/language-manual/functions/explode

 
galacaw
4 weeks ago
Correct
upvoted 2 times

https://www.examtopics.com/exams/microsoft/dp-203/custom-view/ 113/161
5/26/22, 3:46 PM DP-203 Exam – Free Actual Q&As, Page 1 | ExamTopics

Question #48 Topic 1

HOTSPOT -

You are designing an application that will store petabytes of medical imaging data.

When the data is first created, the data will be accessed frequently during the first week. After one month, the data must be accessible within 30
seconds, but files will be accessed infrequently. After one year, the data will be accessed infrequently but must be accessible within five minutes.

You need to select a storage strategy for the data. The solution must minimize costs.

Which storage tier should you use for each time frame? To answer, select the appropriate options in the answer area.

NOTE: Each correct selection is worth one point.

Hot Area:

Correct Answer:

Box 1: Hot -

Hot tier - An online tier optimized for storing data that is accessed or modified frequently. The Hot tier has the highest storage costs, but the
lowest access costs.

https://www.examtopics.com/exams/microsoft/dp-203/custom-view/ 114/161
5/26/22, 3:46 PM DP-203 Exam – Free Actual Q&As, Page 1 | ExamTopics

Box 2: Cool -
Cool tier - An online tier optimized for storing data that is infrequently accessed or modified. Data in the Cool tier should be stored for a
minimum of 30 days. The

Cool tier has lower storage costs and higher access costs compared to the Hot tier.

Box 3: Cool -
Not Archive tier - An offline tier optimized for storing data that is rarely accessed, and that has flexible latency requirements, on the order of
hours. Data in the

Archive tier should be stored for a minimum of 180 days.

Reference:

https://docs.microsoft.com/en-us/azure/storage/blobs/access-tiers-overview https://www.altaro.com/hyper-v/azure-archive-storage/

 
nefarious_smalls
2 weeks, 4 days ago
Why would it not be be Hot Cool and Archive
upvoted 1 times

 
SandipSingha
2 weeks, 2 days ago
After one year, the data will be accessed infrequently but must be accessible within five minutes.
upvoted 2 times

 
Guincimund
2 weeks, 3 days ago
"After one year, the data will be accessed infrequently but must be accessible within five minutes"

The latency for the first bytes, is "hours" for the archive. so because they want to be able to access the data within 5 min, you need to place it in
"cool"

So the answer is correct.


upvoted 2 times

 
nefarious_smalls
2 weeks, 4 days ago
I dont know
upvoted 1 times

 
Andy91
1 month ago
Correct answer!

Hot, Cool, Cool


upvoted 1 times

https://www.examtopics.com/exams/microsoft/dp-203/custom-view/ 115/161
5/26/22, 3:46 PM DP-203 Exam – Free Actual Q&As, Page 1 | ExamTopics

Question #49 Topic 1

You have an Azure Synapse Analytics Apache Spark pool named Pool1.

You plan to load JSON files from an Azure Data Lake Storage Gen2 container into the tables in Pool1. The structure and data types vary by file.

You need to load the files into the tables. The solution must maintain the source data types.

What should you do?

A.
Use a Conditional Split transformation in an Azure Synapse data flow.

B.
Use a Get Metadata activity in Azure Data Factory.

C.
Load the data by using the OPENROWSET Transact-SQL command in an Azure Synapse Analytics serverless SQL pool.

D.
Load the data by using PySpark.

Correct Answer:
C

Serverless SQL pool can automatically synchronize metadata from Apache Spark. A serverless SQL pool database will be created for each
database existing in serverless Apache Spark pools.

Serverless SQL pool enables you to query data in your data lake. It offers a T-SQL query surface area that accommodates semi-structured and
unstructured data queries.

To support a smooth experience for in place querying of data that's located in Azure Storage files, serverless SQL pool uses the OPENROWSET
function with additional capabilities.

The easiest way to see to the content of your JSON file is to provide the file URL to the OPENROWSET function, specify csv FORMAT.

Reference:

https://docs.microsoft.com/en-us/azure/synapse-analytics/sql/query-json-files https://docs.microsoft.com/en-us/azure/synapse-
analytics/sql/query-data-storage

Community vote distribution


D (100%)

 
Ben_1010
4 days, 18 hours ago
Why PySpark?
upvoted 1 times

 
Andushi
2 weeks, 6 days ago
Selected Answer: D
Should be D, I agree with @galacaw
upvoted 1 times

 
galacaw
4 weeks ago
Should be D, it's about Apache Spark pool, not serverless SQL pool.
upvoted 4 times

https://www.examtopics.com/exams/microsoft/dp-203/custom-view/ 116/161
5/26/22, 3:46 PM DP-203 Exam – Free Actual Q&As, Page 1 | ExamTopics

Question #50 Topic 1

You have an Azure Databricks workspace named workspace1 in the Standard pricing tier. Workspace1 contains an all-purpose cluster named
cluster1.

You need to reduce the time it takes for cluster1 to start and scale up. The solution must minimize costs.

What should you do first?

A.
Configure a global init script for workspace1.

B.
Create a cluster policy in workspace1.

C.
Upgrade workspace1 to the Premium pricing tier.

D.
Create a pool in workspace1.

Correct Answer:
D

You can use Databricks Pools to Speed up your Data Pipelines and Scale Clusters Quickly.

Databricks Pools, a managed cache of virtual machine instances that enables clusters to start and scale 4 times faster.

Reference:

https://databricks.com/blog/2019/11/11/databricks-pools-speed-up-data-pipelines.html

 
Maggiee
1 week, 5 days ago
Answer should be C
upvoted 2 times

 
sdokmak
1 day, 8 hours ago
Answer is D:

looking at the reference link, pool works for this. Optimized scaling not needed to reduce 'start and scale up' times only.
upvoted 1 times

 
galacaw
4 weeks ago
Correct
upvoted 4 times

https://www.examtopics.com/exams/microsoft/dp-203/custom-view/ 117/161
5/26/22, 3:46 PM DP-203 Exam – Free Actual Q&As, Page 1 | ExamTopics

Question #51 Topic 1

HOTSPOT -

You are building an Azure Stream Analytics job that queries reference data from a product catalog file. The file is updated daily.

The reference data input details for the file are shown in the Input exhibit. (Click the Input tab.)

The storage account container view is shown in the Refdata exhibit. (Click the Refdata tab.)

You need to configure the Stream Analytics job to pick up the new reference data.

What should you configure? To answer, select the appropriate options in the answer area.

https://www.examtopics.com/exams/microsoft/dp-203/custom-view/ 118/161
5/26/22, 3:46 PM DP-203 Exam – Free Actual Q&As, Page 1 | ExamTopics

NOTE: Each correct selection is worth one point.

Hot Area:

Correct Answer:

Box 1: {date}/product.csv -

In the 2nd exhibit we see: Location: refdata / 2020-03-20

Note: Path Pattern: This is a required property that is used to locate your blobs within the specified container. Within the path, you may choose
to specify one or more instances of the following 2 variables:

{date}, {time}

Example 1: products/{date}/{time}/product-list.csv

Example 2: products/{date}/product-list.csv

Example 3: product-list.csv -

https://www.examtopics.com/exams/microsoft/dp-203/custom-view/ 119/161
5/26/22, 3:46 PM DP-203 Exam – Free Actual Q&As, Page 1 | ExamTopics

Box 2: YYYY-MM-DD -

Note: Date Format [optional]: If you have used {date} within the Path Pattern that you specified, then you can select the date format in which
your blobs are organized from the drop-down of supported formats.

Example: YYYY/MM/DD, MM/DD/YYYY, etc.

Reference:

https://docs.microsoft.com/en-us/azure/stream-analytics/stream-analytics-use-reference-data

 
inotbf83
3 weeks, 2 days ago
I should change box 2 to YYYY/MM/DD (as shows 1st exhibit). A bit confusing with time format in the box 1.
upvoted 4 times

 
jackttt
4 weeks, 1 day ago
The file is updated daily, i think `{date}/product.csv` is correct
upvoted 4 times

 
Lotusss
1 month ago
Wrong! Path Pattern: {dat}/{time}/product.csv

Dat format: yyyy-mm-dd

https://docs.microsoft.com/en-us/azure/stream-analytics/stream-analytics-use-reference-data
upvoted 2 times

 
KashRaynardMorse
3 weeks, 4 days ago
See that the file is stored under the date folder, and there is no time folder.

Your link does recommend the time part, but the the link also says it's optional, and ultimately you need to answer the question, which states
the path without the time.
upvoted 4 times

https://www.examtopics.com/exams/microsoft/dp-203/custom-view/ 120/161
5/26/22, 3:46 PM DP-203 Exam – Free Actual Q&As, Page 1 | ExamTopics

Question #52 Topic 1

HOTSPOT -

You have the following Azure Stream Analytics query.

For each of the following statements, select Yes if the statement is true. Otherwise, select No.

NOTE: Each correct selection is worth one point.

Hot Area:

Correct Answer:

Box 1: No -

Note: You can now use a new extension of Azure Stream Analytics SQL to specify the number of partitions of a stream when reshuffling the
data.

The outcome is a stream that has the same partition scheme. Please see below for an example:

WITH step1 AS (SELECT * FROM [input1] PARTITION BY DeviceID INTO 10), step2 AS (SELECT * FROM [input2] PARTITION BY DeviceID INTO 10)

SELECT * INTO [output] FROM step1 PARTITION BY DeviceID UNION step2 PARTITION BY DeviceID

Note: The new extension of Azure Stream Analytics SQL includes a keyword INTO that allows you to specify the number of partitions for a
stream when performing reshuffling using a PARTITION BY statement.

https://www.examtopics.com/exams/microsoft/dp-203/custom-view/ 121/161
5/26/22, 3:46 PM DP-203 Exam – Free Actual Q&As, Page 1 | ExamTopics

Box 2: Yes -

When joining two streams of data explicitly repartitioned, these streams must have the same partition key and partition count.

Box 3: Yes -

Streaming Units (SUs) represents the computing resources that are allocated to execute a Stream Analytics job. The higher the number of SUs,
the more CPU and memory resources are allocated for your job.

In general, the best practice is to start with 6 SUs for queries that don't use PARTITION BY.

Here there are 10 partitions, so 6x10 = 60 SUs is good.

Note: Remember, Streaming Unit (SU) count, which is the unit of scale for Azure Stream Analytics, must be adjusted so the number of physical
resources available to the job can fit the partitioned flow. In general, six SUs is a good number to assign to each partition. In case there are
insufficient resources assigned to the job, the system will only apply the repartition if it benefits the job.

Reference:

https://azure.microsoft.com/en-in/blog/maximize-throughput-with-repartitioning-in-azure-stream-analytics/ https://docs.microsoft.com/en-
us/azure/stream-analytics/stream-analytics-streaming-unit-consumption

 
TacoB
2 weeks, 5 days ago
Reading https://docs.microsoft.com/en-us/stream-analytics-query/union-azure-stream-analytics and the second sample given in there I would
expect the first one to be No.
upvoted 1 times

 
Akshay_1995
3 weeks, 3 days ago
Correct
upvoted 1 times

https://www.examtopics.com/exams/microsoft/dp-203/custom-view/ 122/161
5/26/22, 3:46 PM DP-203 Exam – Free Actual Q&As, Page 1 | ExamTopics

Question #53 Topic 1

HOTSPOT -

You are building a database in an Azure Synapse Analytics serverless SQL pool.

You have data stored in Parquet files in an Azure Data Lake Storege Gen2 container.

Records are structured as shown in the following sample.

"id": 123,

"address_housenumber": "19c",

"address_line": "Memory Lane",

"applicant1_name": "Jane",

"applicant2_name": "Dev"

The records contain two applicants at most.

You need to build a table that includes only the address fields.

How should you complete the Transact-SQL statement? To answer, select the appropriate options in the answer area.

NOTE: Each correct selection is worth one point.

Hot Area:

Correct Answer:

Box 1: CREATE EXTERNAL TABLE -

An external table points to data located in Hadoop, Azure Storage blob, or Azure Data Lake Storage. External tables are used to read data from
files or write data to files in Azure Storage. With Synapse SQL, you can use external tables to read external data using dedicated SQL pool or

https://www.examtopics.com/exams/microsoft/dp-203/custom-view/ 123/161
5/26/22, 3:46 PM DP-203 Exam – Free Actual Q&As, Page 1 | ExamTopics

serverless SQL pool.

Syntax:

CREATE EXTERNAL TABLE { database_name.schema_name.table_name | schema_name.table_name | table_name }

( <column_definition> [ ,...n ] )

WITH (

LOCATION = 'folder_or_filepath',

DATA_SOURCE = external_data_source_name,

FILE_FORMAT = external_file_format_name

Box 2. OPENROWSET -

When using serverless SQL pool, CETAS is used to create an external table and export query results to Azure Storage Blob or Azure Data Lake
Storage Gen2.

Example:

AS -

SELECT decennialTime, stateName, SUM(population) AS population

FROM -

OPENROWSET(BULK
'https://azureopendatastorage.blob.core.windows.net/censusdatacontainer/release/us_population_county/year=*/*.parquet',

FORMAT='PARQUET') AS [r]

GROUP BY decennialTime, stateName

GO -

Reference:

https://docs.microsoft.com/en-us/azure/synapse-analytics/sql/develop-tables-external-tables

 
SandipSingha
2 weeks, 2 days ago
correct
upvoted 2 times

 
Feljoud
3 weeks ago
correct
upvoted 3 times

https://www.examtopics.com/exams/microsoft/dp-203/custom-view/ 124/161
5/26/22, 3:46 PM DP-203 Exam – Free Actual Q&As, Page 1 | ExamTopics

Question #54 Topic 1

HOTSPOT -

You have an Azure Synapse Analytics dedicated SQL pool named Pool1 and an Azure Data Lake Storage Gen2 account named Account1.

You plan to access the files in Account1 by using an external table.

You need to create a data source in Pool1 that you can reference when you create the external table.

How should you complete the Transact-SQL statement? To answer, select the appropriate options in the answer area.

NOTE: Each correct selection is worth one point.

Hot Area:

Correct Answer:

Box 1: blob -

The following example creates an external data source for Azure Data Lake Gen2

CREATE EXTERNAL DATA SOURCE YellowTaxi

WITH ( LOCATION = 'https://azureopendatastorage.blob.core.windows.net/nyctlc/yellow/',

TYPE = HADOOP)

Box 2: HADOOP -

Reference:

https://docs.microsoft.com/en-us/azure/synapse-analytics/sql/develop-tables-external-tables

 
Jmanuelleon
1 week, 1 day ago
Es confuso.... en la definición para location, indica usar DFS,https://docs.microsoft.com/es-es/azure/synapse-analytics/sql/develop-tables-external-
tables?tabs=hadoop#location, pero en el ejemplo que aparece mas abajo, usa lo contrario, https://docs.microsoft.com/es-es/azure/synapse-
analytics/sql/develop-tables-external-tables?tabs=hadoop#example-for-create-external-data-source (En el ejemplo siguiente se crea un origen de
datos externo para Azure Data Lake Gen2 que apunta al conjunto de datos de Nueva York disponible públicamente: CREATE EXTERNAL DATA
SOURCE YellowTaxi

WITH ( LOCATION = 'https://azureopendatastorage.blob.core.windows.net/nyctlc/yellow/',


TYPE = HADOOP))
upvoted 1 times

 
hbad
1 week, 2 days ago
https://www.examtopics.com/exams/microsoft/dp-203/custom-view/ 125/161
5/26/22, 3:46 PM DP-203 Exam – Free Actual Q&As, Page 1 | ExamTopics

It is hadoop and dfs. For dfs see link below location section:

https://docs.microsoft.com/en-us/azure/synapse-analytics/sql/develop-tables-external-tables?tabs=hadoop
upvoted 2 times

 
LetsPassExams
2 weeks, 5 days ago
From https://docs.microsoft.com/en-us/azure/synapse-analytics/sql/develop-tables-external-tables?tabs=hadoop#example-for-create-external-
data-source

The following example creates an external data source for Azure Data Lake Gen2 pointing to the publicly available New York data set:

SQL

Copy

CREATE EXTERNAL DATA SOURCE YellowTaxi

WITH ( LOCATION = 'https://azureopendatastorage.blob.core.windows.net/nyctlc/yellow/',


TYPE = HADOOP)
upvoted 2 times

 
LetsPassExams
2 weeks, 5 days ago
I thin answer is correct:

https://docs.microsoft.com/en-us/azure/synapse-analytics/sql/develop-tables-external-tables?tabs=hadoop#example-for-create-external-data-
source
upvoted 1 times

 
shrikantK
3 weeks, 5 days ago
dfs is the answer as question is about Azure Data Lake Storage Gen2 . if question was about blob storage then answer would have been blob.
upvoted 1 times

 
Andushi
3 weeks, 6 days ago
1. is DFS
upvoted 2 times

 
Andushi
4 weeks ago
I agree with galacaw is dfs and type Hadoop
upvoted 2 times

 
galacaw
4 weeks ago
1. dfs (for Azure Data Lake Storage Gen2)
upvoted 4 times

https://www.examtopics.com/exams/microsoft/dp-203/custom-view/ 126/161
5/26/22, 3:46 PM DP-203 Exam – Free Actual Q&As, Page 1 | ExamTopics

Question #55 Topic 1

You have an Azure subscription that contains an Azure Blob Storage account named storage1 and an Azure Synapse Analytics dedicated SQL pool
named

Pool1.

You need to store data in storage1. The data will be read by Pool1. The solution must meet the following requirements:

Enable Pool1 to skip columns and rows that are unnecessary in a query.

✑ Automatically create column statistics.

✑ Minimize the size of files.

Which type of file should you use?

A.
JSON

B.
Parquet

C.
Avro

D.
CSV

Correct Answer:
B

Automatic creation of statistics is turned on for Parquet files. For CSV files, you need to create statistics manually until automatic creation of
CSV files statistics is supported.

Reference:

https://docs.microsoft.com/en-us/azure/synapse-analytics/sql/develop-tables-statistics

Community vote distribution


B (100%)

 
ClassMistress
1 week, 1 day ago
Selected Answer: B
Automatic creation of statistics is turned on for Parquet files. For CSV files, you need to create statistics manually until automatic creation of CSV
files statistics is supported.
upvoted 1 times

 
sdokmak
1 day, 7 hours ago
Good point, also better cost
upvoted 1 times

 
shachar_ash
2 weeks, 1 day ago
Correct
upvoted 2 times

https://www.examtopics.com/exams/microsoft/dp-203/custom-view/ 127/161
5/26/22, 3:46 PM DP-203 Exam – Free Actual Q&As, Page 1 | ExamTopics

Question #56 Topic 1

DRAG DROP -

You plan to create a table in an Azure Synapse Analytics dedicated SQL pool.

Data in the table will be retained for five years. Once a year, data that is older than five years will be deleted.

You need to ensure that the data is distributed evenly across partitions. The solution must minimize the amount of time required to delete old
data.

How should you complete the Transact-SQL statement? To answer, drag the appropriate values to the correct targets. Each value may be used
once, more than once, or not at all. You may need to drag the split bar between panes or scroll to view content.

NOTE: Each correct selection is worth one point.

Select and Place:

Correct Answer:

Box 1: HASH -

https://www.examtopics.com/exams/microsoft/dp-203/custom-view/ 128/161
5/26/22, 3:46 PM DP-203 Exam – Free Actual Q&As, Page 1 | ExamTopics

Box 2: OrderDateKey -

In most cases, table partitions are created on a date column.

A way to eliminate rollbacks is to use Metadata Only operations like partition switching for data management. For example, rather than execute
a DELETE statement to delete all rows in a table where the order_date was in October of 2001, you could partition your data early. Then you can
switch out the partition with data for an empty partition from another table.

Reference:

https://docs.microsoft.com/en-us/sql/t-sql/statements/create-table-azure-sql-data-warehouse https://docs.microsoft.com/en-
us/azure/synapse-analytics/sql/best-practices-dedicated-sql-pool

 
ClassMistress
1 week, 1 day ago
I think it is Hash because the question refer to a Fact table.
upvoted 1 times

 
jebias
1 month ago
I think the first answer should be Round-Robin as it should be distributed evenly.

https://docs.microsoft.com/en-us/azure/synapse-analytics/sql-data-warehouse/sql-data-warehouse-tables-distribute
upvoted 2 times

 
Feljoud
1 month ago
While you are right, that Round-Robin guarantees an even distribution, it is only recommended to use on small tables < 2 GB (see your link).
Using the Hash of the ProductKey will also allow for an even distribution but in a more efficient manner.

Also, the Syntax here would be wrong if you would insert Round-Robin. As in that case it would only say: "DISTRIBUTION = ROUND-ROBIN" (no
ProductKey)
upvoted 10 times

 
nefarious_smalls
2 weeks, 4 days ago
You are exactly righty
upvoted 1 times

 
Muishkin
3 weeks, 6 days ago
yes i think so too
upvoted 1 times

 
Massy
3 weeks, 5 days ago
the syntax is ok only for HASH
upvoted 2 times

https://www.examtopics.com/exams/microsoft/dp-203/custom-view/ 129/161
5/26/22, 3:46 PM DP-203 Exam – Free Actual Q&As, Page 1 | ExamTopics

Question #57 Topic 1

HOTSPOT -

You have an Azure Data Lake Storage Gen2 service.

You need to design a data archiving solution that meets the following requirements:

✑ Data that is older than five years is accessed infrequently but must be available within one second when requested.

✑ Data that is older than seven years is NOT accessed.

✑ Costs must be minimized while maintaining the required availability.

How should you manage the data? To answer, select the appropriate options in the answer area.

NOTE: Each correct selection is worth one point.

Hot Area:

Correct Answer:

Box 1: Move to cool storage -

Box 2: Move to archive storage -

Archive - Optimized for storing data that is rarely accessed and stored for at least 180 days with flexible latency requirements, on the order of
hours.

The following table shows a comparison of premium performance block blob storage, and the hot, cool, and archive access tiers.

https://www.examtopics.com/exams/microsoft/dp-203/custom-view/ 130/161
5/26/22, 3:46 PM DP-203 Exam – Free Actual Q&As, Page 1 | ExamTopics

Reference:

https://docs.microsoft.com/en-us/azure/storage/blobs/storage-blob-storage-tiers

 
sagur
Highly Voted 
1 month ago
If "Data that is older than seven years is NOT accessed" then this data can be deleted to minimize the storage costs, right?
upvoted 5 times

 
Feljoud
1 month ago
Would agree, but the question states: "a data archiving solution", so maybe to keep the data was implied with this?
upvoted 2 times

 
noobprogrammer
1 month ago
Makes sense to me
upvoted 1 times

 
Massy
1 month ago
I agree, should be deleted
upvoted 1 times

 
KashRaynardMorse
3 weeks, 4 days ago
Deleting data older than 7 years is not an option available in the answer list. Becareful of the gotcha; 'Delete the blob' is an option but it
would delete all the data, included the ones that are e.g. 5 years old. So you can't choose that answer. So the next best thing to do is to put
it into archive.
upvoted 5 times

 
Boompiee
2 weeks, 1 day ago
I'm confused by your comment. It clearly does state an option to delete the blob after 7 years.
upvoted 1 times

https://www.examtopics.com/exams/microsoft/dp-203/custom-view/ 131/161
5/26/22, 3:46 PM DP-203 Exam – Free Actual Q&As, Page 1 | ExamTopics

Topic 2 - Question Set 2

https://www.examtopics.com/exams/microsoft/dp-203/custom-view/ 132/161
5/26/22, 3:46 PM DP-203 Exam – Free Actual Q&As, Page 1 | ExamTopics

Question #1 Topic 2

Note: This question is part of a series of questions that present the same scenario. Each question in the series contains a unique solution that
might meet the stated goals. Some question sets might have more than one correct solution, while others might not have a correct solution.

After you answer a question in this section, you will NOT be able to return to it. As a result, these questions will not appear in the review screen.

You plan to create an Azure Databricks workspace that has a tiered structure. The workspace will contain the following three workloads:

✑ A workload for data engineers who will use Python and SQL.

✑ A workload for jobs that will run notebooks that use Python, Scala, and SQL.

✑ A workload that data scientists will use to perform ad hoc analysis in Scala and R.

The enterprise architecture team at your company identifies the following standards for Databricks environments:

✑ The data engineers must share a cluster.

✑ The job cluster will be managed by using a request process whereby data scientists and data engineers provide packaged notebooks for
deployment to the cluster.

✑ All the data scientists must be assigned their own cluster that terminates automatically after 120 minutes of inactivity. Currently, there are
three data scientists.

You need to create the Databricks clusters for the workloads.

Solution: You create a Standard cluster for each data scientist, a High Concurrency cluster for the data engineers, and a Standard cluster for the
jobs.
Does this meet the goal?

A.
Yes

B.
No

Correct Answer:
B

We would need a High Concurrency cluster for the jobs.

Note:

Standard clusters are recommended for a single user. Standard can run workloads developed in any language: Python, R, Scala, and SQL.

A high concurrency cluster is a managed cloud resource. The key benefits of high concurrency clusters are that they provide Apache Spark-
native fine-grained sharing for maximum resource utilization and minimum query latencies.

Reference:

https://docs.azuredatabricks.net/clusters/configure.html

Community vote distribution


A (89%) 11%

 
Amalbenrebai
Highly Voted 
8 months, 4 weeks ago
- data engineers: high concurrency cluster

- jobs: Standard cluster

- data scientists: Standard cluster


upvoted 53 times

 
Egocentric
1 month, 1 week ago
agreed
upvoted 1 times

 
Julius7000
8 months, 1 week ago
Tell me one thing: is this answer 9jobs) based on the text:

"A Single Node cluster has no workers and runs Spark jobs on the driver node.

In contrast, a Standard cluster requires at least one Spark worker node in addition to the driver node to execute Spark jobs."?

I dont understand the connection between worker noodes and the requirements given in the question about jobs workspace.
upvoted 1 times

 
gangstfear
Highly Voted 
8 months, 4 weeks ago
The answer must be A!
upvoted 31 times

 
Eyepatch993
Most Recent 
2 months ago
Selected Answer: B
Standard clusters do not have fault tolerance. Both the data scientist and data engineers will be using the job cluster for processing their
notebooks, so if a standard cluster is chosen and a fault occurs in the notebook of any one user, there is a chance that other notebooks might also
fail. Due to this a high concurrency cluster is recommended for running jobs.
upvoted 2 times

https://www.examtopics.com/exams/microsoft/dp-203/custom-view/ 133/161
5/26/22, 3:46 PM DP-203 Exam – Free Actual Q&As, Page 1 | ExamTopics

 
Boompiee
2 weeks, 1 day ago
It may not be a best practice, but the question asked is: does the solution meet the stated requirements, and it does..
upvoted 1 times

 
Hanse
2 months, 2 weeks ago
As per Link: https://docs.azuredatabricks.net/clusters/configure.html

Standard and Single Node clusters terminate automatically after 120 minutes by default. --> Data Scientists

High Concurrency clusters do not terminate automatically by default.

A Standard cluster is recommended for a single user. --> Standard for Data Scientists & High Concurrency for Data Engineers

Standard clusters can run workloads developed in any language: Python, SQL, R, and Scala.

High Concurrency clusters can run workloads developed in SQL, Python, and R. The performance and security of High Concurrency clusters is
provided by running user code in separate processes, which is not possible in Scala. --> Jobs needs Standard
upvoted 2 times

 
ovokpus
3 months ago
Selected Answer: A
Yes it seems to be!
upvoted 2 times

 
PallaviPatel
3 months, 3 weeks ago
Selected Answer: A
correct
upvoted 2 times

 
kilowd
4 months ago
Selected Answer: A
Data Engineers - High Concurrency cluster as it provides for sharing . Also caters for SQl,Python and R.

Data Scientist - Standard Clusters which automatically terminates after 120 minutes and caters for Scala,SQl,Python and R.

JOBS- Standard Cluster


upvoted 2 times

 
let_88
4 months ago
As per the doc in Microsoft the High Concurrency cluster doesn't support Scala.

High Concurrency clusters can run workloads developed in SQL, Python, and R. The performance and security of High Concurrency clusters is
provided by running user code in separate processes, which is not possible in Scala.

https://docs.microsoft.com/en-us/azure/databricks/clusters/configure#cluster-mode
upvoted 6 times

 
tesen_tolga
4 months, 1 week ago
Selected Answer: A
The answer must be A!
upvoted 2 times

 
SabaJamal2010AtGmail
4 months, 3 weeks ago
The solution does not meet the requirement because: "High Concurrency clusters work only for SQL, Python, and R. The performance and security
of High Concurrency clusters is provided by running user code in separate processes, which is not possible in Scala.
upvoted 1 times

 
FredNo
6 months ago
Selected Answer: A
Data scientists and jobs use Scala so they need standard cluster
upvoted 9 times

 
Aslam208
6 months, 3 weeks ago
Answer is A.
upvoted 4 times

 
gangstfear
9 months ago
Shouldn't the answer be A, as ll the requirements are met:

Data Scientist - Standard

Data Engineer - High Concurrnecy

Jobs - Standard
upvoted 13 times

 
satyamkishoresingh
8 months, 4 weeks ago
Yes , Given solution is correct.
upvoted 6 times

 
echerish
9 months ago
Question 23 and 24 seems to have been swapped. They Key is

Data Scientist - Standard

Data Engineer - High Concurrnecy

Jobs - Standard
https://www.examtopics.com/exams/microsoft/dp-203/custom-view/ 134/161
5/26/22, 3:46 PM DP-203 Exam – Free Actual Q&As, Page 1 | ExamTopics

upvoted 7 times

 
MoDar
9 months ago
Answer A

Scala is not supported in High Concurrency cluster --> Jobs & Data scientists --> Standard

Data engineers --> High concurrence


upvoted 8 times

 
damaldon
11 months, 1 week ago
Answer: B

-Data scientist should have their own cluster and should terminate after 120 mins - STANDARD

-Cluster for Jobs should support scala - STANDARD

https://docs.microsoft.com/en-us/azure/databricks/clusters/configure
upvoted 6 times

 
Sunnyb
11 months, 3 weeks ago
B is the correct answer

Link below:

https://docs.microsoft.com/en-us/azure/databricks/clusters/configure
upvoted 3 times

https://www.examtopics.com/exams/microsoft/dp-203/custom-view/ 135/161
5/26/22, 3:46 PM DP-203 Exam – Free Actual Q&As, Page 1 | ExamTopics

Question #2 Topic 2

Note: This question is part of a series of questions that present the same scenario. Each question in the series contains a unique solution that
might meet the stated goals. Some question sets might have more than one correct solution, while others might not have a correct solution.

After you answer a question in this section, you will NOT be able to return to it. As a result, these questions will not appear in the review screen.

You plan to create an Azure Databricks workspace that has a tiered structure. The workspace will contain the following three workloads:

✑ A workload for data engineers who will use Python and SQL.

✑ A workload for jobs that will run notebooks that use Python, Scala, and SQL.

✑ A workload that data scientists will use to perform ad hoc analysis in Scala and R.

The enterprise architecture team at your company identifies the following standards for Databricks environments:

✑ The data engineers must share a cluster.

✑ The job cluster will be managed by using a request process whereby data scientists and data engineers provide packaged notebooks for
deployment to the cluster.

✑ All the data scientists must be assigned their own cluster that terminates automatically after 120 minutes of inactivity. Currently, there are
three data scientists.

You need to create the Databricks clusters for the workloads.

Solution: You create a Standard cluster for each data scientist, a High Concurrency cluster for the data engineers, and a High Concurrency cluster
for the jobs.

Does this meet the goal?

A.
Yes

B.
No

Correct Answer:
A

We need a High Concurrency cluster for the data engineers and the jobs.

Note: Standard clusters are recommended for a single user. Standard can run workloads developed in any language: Python, R, Scala, and SQL.

A high concurrency cluster is a managed cloud resource. The key benefits of high concurrency clusters are that they provide Apache Spark-
native fine-grained sharing for maximum resource utilization and minimum query latencies.

Reference:

https://docs.azuredatabricks.net/clusters/configure.html

Community vote distribution


B (100%)

 
dfdsfdsfsd
Highly Voted 
1 year ago
High-concurrency clusters do not support Scala. So the answer is still 'No' but the reasoning is wrong.

https://docs.microsoft.com/en-us/azure/databricks/clusters/configure
upvoted 31 times

 
Preben
11 months, 3 weeks ago
I agree that High concurrency does not support Scala. But they specified using a Standard cluster for the jobs, which does support Scala. Why is
the answer 'No'?
upvoted 2 times

 
eng1
11 months, 1 week ago
Because the High Concurrency cluster for each data scientist is not correct, it should be standard for a single user!
upvoted 4 times

 
FRAN__CO_HO
Highly Voted 
11 months, 1 week ago
Answer should be NO, which

Data scientist: STANDARD as need to run scala

Jobs: STANDARD as need to run scala

Data Engineers: High-concurrency clusters as better resource sharing


upvoted 10 times

 
ClassMistress
Most Recent 
1 week, 1 day ago
Selected Answer: B
High Concurrency clusters is provided by running user code in separate processes, which is not possible in Scala.
upvoted 1 times

 
narendra399
1 month, 3 weeks ago
1 and 2 are same questions but answers are different why?
upvoted 2 times

https://www.examtopics.com/exams/microsoft/dp-203/custom-view/ 136/161
5/26/22, 3:46 PM DP-203 Exam – Free Actual Q&As, Page 1 | ExamTopics

 
Hanse
2 months, 2 weeks ago
As per Link: https://docs.azuredatabricks.net/clusters/configure.html

Standard and Single Node clusters terminate automatically after 120 minutes by default. --> Data Scientists

High Concurrency clusters do not terminate automatically by default.

A Standard cluster is recommended for a single user. --> Standard for Data Scientists & High Concurrency for Data Engineers

Standard clusters can run workloads developed in any language: Python, SQL, R, and Scala.

High Concurrency clusters can run workloads developed in SQL, Python, and R. The performance and security of High Concurrency clusters is
provided by running user code in separate processes, which is not possible in Scala. --> Jobs needs Standard
upvoted 2 times

 
lukeonline
4 months, 3 weeks ago
Selected Answer: B
high concurrency does not support scala
upvoted 2 times

 
rashjan
5 months, 2 weeks ago
Selected Answer: B
wrong: no
upvoted 1 times

 
FredNo
6 months ago
Selected Answer: B
Answer is no because high concurrency does not support scala
upvoted 5 times

 
Aslam208
6 months, 3 weeks ago
Answer is No
upvoted 2 times

 
damaldon
11 months, 1 week ago
Answer: NO

-Data scientist should have their own cluster and should terminate after 120 mins - STANDARD

-Cluster for Jobs should support scala - STANDARD

https://docs.microsoft.com/en-us/azure/databricks/clusters/configure
upvoted 2 times

 
nas28
11 months, 3 weeks ago
Answer correct : No. but the reason is wrong, They want data scientists cluster to shut down automatically after 120 minutes so Standard cluster
not high concurrency
upvoted 3 times

 
Sunnyb
11 months, 3 weeks ago
Answer is correct - NO
upvoted 2 times

https://www.examtopics.com/exams/microsoft/dp-203/custom-view/ 137/161
5/26/22, 3:46 PM DP-203 Exam – Free Actual Q&As, Page 1 | ExamTopics

Question #3 Topic 2

HOTSPOT -

You plan to create a real-time monitoring app that alerts users when a device travels more than 200 meters away from a designated location.

You need to design an Azure Stream Analytics job to process the data for the planned app. The solution must minimize the amount of code
developed and the number of technologies used.

What should you include in the Stream Analytics job? To answer, select the appropriate options in the answer area.

NOTE: Each correct selection is worth one point.

Hot Area:

Correct Answer:

Input type: Stream -

You can process real-time IoT data streams with Azure Stream Analytics.

Function: Geospatial -

With built-in geospatial functions, you can use Azure Stream Analytics to build applications for scenarios such as fleet management, ride
sharing, connected cars, and asset tracking.

Note: In a real-world scenario, you could have hundreds of these sensors generating events as a stream. Ideally, a gateway device would run
code to push these events to Azure Event Hubs or Azure IoT Hubs.

Reference:

https://docs.microsoft.com/en-us/azure/stream-analytics/stream-analytics-get-started-with-azure-stream-analytics-to-process-data-from-iot-
devices https://docs.microsoft.com/en-us/azure/stream-analytics/geospatial-scenarios

 
Podavenna
Highly Voted 
8 months, 1 week ago

https://www.examtopics.com/exams/microsoft/dp-203/custom-view/ 138/161
5/26/22, 3:46 PM DP-203 Exam – Free Actual Q&As, Page 1 | ExamTopics

Correct solution!
upvoted 22 times

 
ClassMistress
Most Recent 
1 week, 1 day ago
Correct
upvoted 1 times

 
NewTuanAnh
1 month, 2 weeks ago
Correct!
upvoted 1 times

 
PallaviPatel
3 months, 3 weeks ago
Correct
upvoted 1 times

https://www.examtopics.com/exams/microsoft/dp-203/custom-view/ 139/161
5/26/22, 3:46 PM DP-203 Exam – Free Actual Q&As, Page 1 | ExamTopics

Question #4 Topic 2

A company has a real-time data analysis solution that is hosted on Microsoft Azure. The solution uses Azure Event Hub to ingest data and an
Azure Stream

Analytics cloud job to analyze the data. The cloud job is configured to use 120 Streaming Units (SU).

You need to optimize performance for the Azure Stream Analytics job.

Which two actions should you perform? Each correct answer presents part of the solution.

NOTE: Each correct selection is worth one point.

A.
Implement event ordering.

B.
Implement Azure Stream Analytics user-defined functions (UDF).

C.
Implement query parallelization by partitioning the data output.

D.
Scale the SU count for the job up.

E.
Scale the SU count for the job down.

F.
Implement query parallelization by partitioning the data input.

Correct Answer:
DF

D: Scale out the query by allowing the system to process each input partition separately.

F: A Stream Analytics job definition includes inputs, a query, and output. Inputs are where the job reads the data stream from.

Reference:

https://docs.microsoft.com/en-us/azure/stream-analytics/stream-analytics-parallelization

Community vote distribution


CF (50%) DF (40%) 10%

 
manquak
Highly Voted 
8 months, 3 weeks ago
Partition input and output.

REF: https://docs.microsoft.com/en-us/azure/stream-analytics/stream-analytics-parallelization
upvoted 25 times

 
kolakone
8 months, 1 week ago
Agree. And partitioning Input and output with same number of partitions gives the best performance optimization..
upvoted 5 times

 
Lio95
Highly Voted 
8 months ago
No event consumer was mentioned. Therefore, partitioning output is not relevant. Answer is correct
upvoted 11 times

 
Boompiee
2 weeks, 1 day ago
The stream analytics job is the consumer.
upvoted 1 times

 
nicolas1999
6 months, 1 week ago
Stream analytics ALWAYS has at least one output. There is no need to mention that. So correct answer is input and output
upvoted 2 times

 
Andushi
Most Recent 
2 weeks, 6 days ago
Selected Answer: CF
I agree with @manquak.
upvoted 1 times

 
DingDongSingSong
2 months ago
I think the answer is correct. The two things you do is: 1. Scale up SU and 2. partition input. If this doesn't work, THEN you could partition output as
well.
upvoted 1 times

 
Dianova
3 months, 1 week ago
Selected Answer: DF
I think answer is correct, because:

Nothing is mentioned in the question about the output and some type of outputs do not support partitioning (like PowerBI), so it would be risky to
assume that we can partition the output to implement Embarrassingly parallel jobs.

https://docs.microsoft.com/en-us/azure/stream-analytics/stream-analytics-parallelization#outputs

Implementing query parallelization by partitioning the data input would be an optimization but the total number of SUs depends on the number of
https://www.examtopics.com/exams/microsoft/dp-203/custom-view/ 140/161
5/26/22, 3:46 PM DP-203 Exam – Free Actual Q&As, Page 1 | ExamTopics

partitions, so the SUs would need to be scaled up.

https://docs.microsoft.com/en-us/azure/stream-analytics/stream-analytics-parallelization#calculate-the-maximum-streaming-units-of-a-job
upvoted 7 times

 
PallaviPatel
3 months, 3 weeks ago
Selected Answer: CF
ignore my previous answer C and F is correct.
upvoted 2 times

 
PallaviPatel
3 months, 3 weeks ago
Selected Answer: DF
correct
upvoted 1 times

 
assU2
4 months ago
Selected Answer: CF
Partitioning lets you divide data into subsets based on a partition key. If your input (for example Event Hubs) is partitioned by a key, it is highly
recommended to specify this partition key when adding input to your Stream Analytics job. Scaling a Stream Analytics job takes advantage of
partitions in the input and output.

More to say scaling is not an optimization


upvoted 1 times

 
assU2
4 months ago
Is scaling an optimization??
upvoted 1 times

 
DE_Sanjay
4 months, 1 week ago
C & F Should be the right answer.
upvoted 1 times

 
dev2dev
4 months, 1 week ago
Optimization is always about improving performance using existing resources. So definitly not increasing SKU or SU
upvoted 4 times

 
alex623
4 months, 1 week ago
I think the answer is to partitioning input and output, because the target is to optimize regardless of computing capacity (#SUs)
upvoted 1 times

 
DingDongSingSong
2 months ago
who says optimization is regardless of computing capacity. Infact computing capacity increase is ONE of the ways to optimize performance.
upvoted 1 times

 
Jaws1990
4 months, 2 weeks ago
Selected Answer: CF
Should always aim for Embarrassingly parallel jobs (partitioning input, job and output) https://docs.microsoft.com/en-us/azure/stream-
analytics/stream-analytics-parallelization

Upping the computing power of a resource (SUs in this case) should never be classed as 'optimisation' like the question asks.
upvoted 5 times

 
dev2dev
4 months, 1 week ago
I agree
upvoted 1 times

 
trietnv
5 months ago
Selected Answer: BF
Choosing the number of required SUs for a particular job depends on the partition configuration for "the inputs" and "the query" that's defined
within the job. The Scale page allows you to set the right number of SUs. It is a best practice to allocate more SUs than needed.

ref: https://docs.microsoft.com/en-us/azure/stream-analytics/stream-analytics-streaming-unit-consumption
upvoted 2 times

 
Sayour
5 months, 1 week ago
The Answer Is Correct, You Scall Up Streaming Units And Partition Input So The Input Events Are More Efficient To Process.
upvoted 3 times

 
m2shines
5 months, 1 week ago
C and F
upvoted 1 times

 
rashjan
5 months, 2 weeks ago
Selected Answer: CF
Partition input and output is the correct answer even if output is not mentioned because stream analytics always have at least one output.
upvoted 1 times

https://www.examtopics.com/exams/microsoft/dp-203/custom-view/ 141/161
5/26/22, 3:46 PM DP-203 Exam – Free Actual Q&As, Page 1 | ExamTopics

Question #5 Topic 2

You need to trigger an Azure Data Factory pipeline when a file arrives in an Azure Data Lake Storage Gen2 container.

Which resource provider should you enable?

A.
Microsoft.Sql

B.
Microsoft.Automation

C.
Microsoft.EventGrid

D.
Microsoft.EventHub

Correct Answer:
C

Event-driven architecture (EDA) is a common data integration pattern that involves production, detection, consumption, and reaction to events.
Data integration scenarios often require Data Factory customers to trigger pipelines based on events happening in storage account, such as the
arrival or deletion of a file in Azure

Blob Storage account. Data Factory natively integrates with Azure Event Grid, which lets you trigger pipelines on such events.

Reference:

https://docs.microsoft.com/en-us/azure/data-factory/how-to-create-event-trigger https://docs.microsoft.com/en-us/azure/data-
factory/concepts-pipeline-execution-triggers

Community vote distribution


C (100%)

 
jv2120
Highly Voted 
5 months, 2 weeks ago
Correct. C

Azure Event Grids – Event-driven publish-subscribe model (think reactive programming)

Azure Event Hubs – Multiple source big data streaming pipeline (think telemetry data)

In this case its more suitable vs Event Hubs.


upvoted 12 times

 
medsimus
Highly Voted 
8 months ago
Correct

https://docs.microsoft.com/en-us/azure/data-factory/how-to-create-event-trigger?tabs=data-factory
upvoted 11 times

 
PallaviPatel
Most Recent 
3 months, 3 weeks ago
Selected Answer: C
Correct.
upvoted 2 times

 
romanzdk
4 months, 1 week ago
But EventHub does not support ADLS, only Blob storage
upvoted 1 times

 
romanzdk
4 months, 1 week ago
https://docs.microsoft.com/en-us/azure/event-grid/overview
upvoted 2 times

 
Swagat039
4 months, 2 weeks ago
C. is correct.

You need storage event trigger (for this Microsoft.EventGrid service needs to be enabled).
upvoted 1 times

 
Vardhan_Brahmanapally
6 months, 3 weeks ago
Why not eventhub?
upvoted 3 times

 
wijaz789
8 months, 3 weeks ago
Absolutely correct
upvoted 4 times

https://www.examtopics.com/exams/microsoft/dp-203/custom-view/ 142/161
5/26/22, 3:46 PM DP-203 Exam – Free Actual Q&As, Page 1 | ExamTopics

Question #6 Topic 2

You plan to perform batch processing in Azure Databricks once daily.

Which type of Databricks cluster should you use?

A.
High Concurrency

B.
automated

C.
interactive

Correct Answer:
B

Automated Databricks clusters are the best for jobs and automated batch processing.

Note: Azure Databricks has two types of clusters: interactive and automated. You use interactive clusters to analyze data collaboratively with
interactive notebooks. You use automated clusters to run fast and robust automated jobs.

Example: Scheduled batch workloads (data engineers running ETL jobs)

This scenario involves running batch job JARs and notebooks on a regular cadence through the Databricks platform.

The suggested best practice is to launch a new cluster for each run of critical jobs. This helps avoid any issues (failures, missing SLA, and so
on) due to an existing workload (noisy neighbor) on a shared cluster.

Reference:

https://docs.microsoft.com/en-us/azure/databricks/clusters/create https://docs.databricks.com/administration-guide/cloud-
configurations/aws/cmbp.html#scenario-3-scheduled-batch-workloads-data-engineers-running-etl-jobs

Community vote distribution


B (100%)

 
Podavenna
Highly Voted 
8 months, 1 week ago
Correct!
upvoted 8 times

 
necktru
Most Recent 
3 weeks ago
Selected Answer: B
correct
upvoted 1 times

 
PallaviPatel
3 months, 3 weeks ago
Selected Answer: B
correct.
upvoted 1 times

 
satyamkishoresingh
8 months, 3 weeks ago
What is automated cluster ?
upvoted 1 times

 
wijaz789
8 months, 3 weeks ago
There are 2 types of databricks clusters:

1) Standard/Interactive - best for querying and processing data by users.

2) Automatic/Jobs - best for jobs and automated batch processing.


upvoted 11 times

 
Swagat039
4 months, 2 weeks ago
Job cluster
upvoted 1 times

https://www.examtopics.com/exams/microsoft/dp-203/custom-view/ 143/161
5/26/22, 3:46 PM DP-203 Exam – Free Actual Q&As, Page 1 | ExamTopics

Question #7 Topic 2

HOTSPOT -

You are processing streaming data from vehicles that pass through a toll booth.

You need to use Azure Stream Analytics to return the license plate, vehicle make, and hour the last vehicle passed during each 10-minute window.

How should you complete the query? To answer, select the appropriate options in the answer area.

NOTE: Each correct selection is worth one point.

Hot Area:

https://www.examtopics.com/exams/microsoft/dp-203/custom-view/ 144/161
5/26/22, 3:46 PM DP-203 Exam – Free Actual Q&As, Page 1 | ExamTopics

Correct Answer:

Box 1: MAX -

The first step on the query finds the maximum time stamp in 10-minute windows, that is the time stamp of the last event for that window. The
second step joins the results of the first query with the original stream to find the event that match the last time stamps in each window.

Query:

WITH LastInWindow AS -

SELECT -

MAX(Time) AS LastEventTime -

FROM -

Input TIMESTAMP BY Time -

GROUP BY -

TumblingWindow(minute, 10)

SELECT -

Input.License_plate,

Input.Make,

https://www.examtopics.com/exams/microsoft/dp-203/custom-view/ 145/161
5/26/22, 3:46 PM DP-203 Exam – Free Actual Q&As, Page 1 | ExamTopics

Input.Time -

FROM -

Input TIMESTAMP BY Time -

INNER JOIN LastInWindow -

ON DATEDIFF(minute, Input, LastInWindow) BETWEEN 0 AND 10

AND Input.Time = LastInWindow.LastEventTime

Box 2: TumblingWindow -

Tumbling windows are a series of fixed-sized, non-overlapping and contiguous time intervals.

Box 3: DATEDIFF -

DATEDIFF is a date-specific function that compares and returns the time difference between two DateTime fields, for more information, refer to
date functions.

Reference:

https://docs.microsoft.com/en-us/stream-analytics-query/tumbling-window-azure-stream-analytics

 
rikku33
Highly Voted 
8 months ago
correct
upvoted 16 times

 
Jerrylolu
5 months, 1 week ago
Why not Hopping Window??
upvoted 1 times

 
Wijn4nd
4 months, 2 weeks ago
Because a hopping window can overlap, and we need the data from 10 minute time frames that DON'T overlap
upvoted 3 times

 
PallaviPatel
Most Recent 
3 months, 3 weeks ago
correct.
upvoted 1 times

 
BusinessApps
3 months, 3 weeks ago
HoppingWindow has a minimum of three arguments whereas TumblingWindow only takes two so considering the solution only has two
arguments it has to be Tumbling

https://docs.microsoft.com/en-us/stream-analytics-query/hopping-window-azure-stream-analytics

https://docs.microsoft.com/en-us/stream-analytics-query/tumbling-window-azure-stream-analytics
upvoted 3 times

 
DrTaz
4 months, 3 weeks ago
Answer is 100% correct.
upvoted 2 times

 
bubububox
4 months, 3 weeks ago
definitively hopping. because the event (last car passing) can be part of more than one window. Thus it cant be tumbling
upvoted 1 times

 
DrTaz
4 months, 3 weeks ago
the question defines non-overlapping windows, thus tumbling for sure 100%
upvoted 1 times

 
durak
5 months ago
Why not Select COunt?
upvoted 1 times

 
DrTaz
4 months, 3 weeks ago
max is used to get "last" event.
upvoted 1 times

 
irantov
8 months, 1 week ago
I think it is correct. Although, we could also use hoppingwindow. But it would be better to use Tumblingwindow as time events are unique.
upvoted 3 times

 
TelixFom
7 months, 2 weeks ago
I was thinking TumblingWindow based on the term: "each 10-minute window." This infers that the situation is not looking for a rolling max.
https://www.examtopics.com/exams/microsoft/dp-203/custom-view/ 146/161
5/26/22, 3:46 PM DP-203 Exam – Free Actual Q&As, Page 1 | ExamTopics

upvoted 2 times

 
elcholo
8 months, 3 weeks ago
QUEEEE!
upvoted 4 times

 
GameLift
8 months, 2 weeks ago
very confusing
upvoted 4 times

https://www.examtopics.com/exams/microsoft/dp-203/custom-view/ 147/161
5/26/22, 3:46 PM DP-203 Exam – Free Actual Q&As, Page 1 | ExamTopics

Question #8 Topic 2

You have an Azure Data Factory instance that contains two pipelines named Pipeline1 and Pipeline2.
Pipeline1 has the activities shown in the following exhibit.

Pipeline2 has the activities shown in the following exhibit.

You execute Pipeline2, and Stored procedure1 in Pipeline1 fails.

What is the status of the pipeline runs?

A.
Pipeline1 and Pipeline2 succeeded.

B.
Pipeline1 and Pipeline2 failed.

C.
Pipeline1 succeeded and Pipeline2 failed.

D.
Pipeline1 failed and Pipeline2 succeeded.

Correct Answer:
A

Activities are linked together via dependencies. A dependency has a condition of one of the following: Succeeded, Failed, Skipped, or
Completed.

Consider Pipeline1:

If we have a pipeline with two activities where Activity2 has a failure dependency on Activity1, the pipeline will not fail just because Activity1
failed. If Activity1 fails and Activity2 succeeds, the pipeline will succeed. This scenario is treated as a try-catch block by Data Factory.

The failure dependency means this pipeline reports success.

Note:

If we have a pipeline containing Activity1 and Activity2, and Activity2 has a success dependency on Activity1, it will only execute if Activity1 is
successful. In this scenario, if Activity1 fails, the pipeline will fail.

Reference:

https://datasavvy.me/category/azure-data-factory/

Community vote distribution


A (100%)

 
echerish
Highly Voted 
9 months ago
Pipeline 2 executes Pipeline 1 if success set variable. Since Pipeline 1 exists it's a success

Pipeline 1 Stored procedure fails. If fails set variable. Since the expected outcome is fail the job runs successfully and sets variable1.

At least that's how I understand it


upvoted 22 times

 
SaferSephy
Highly Voted 
8 months, 2 weeks ago
Correct answer is A. The trick is the fact that pipeline 1 only has a Failure dependency between de activity's. In this situation this results in a
Succeeded pipeline if the Stored procedure failed.

If also the success connection was linked to a follow up activity, and the SP would fail, the pipeline would be indeed marked as failed.

https://www.examtopics.com/exams/microsoft/dp-203/custom-view/ 148/161
5/26/22, 3:46 PM DP-203 Exam – Free Actual Q&As, Page 1 | ExamTopics

So A.
upvoted 21 times

 
BK10
3 months, 1 week ago
well explained! A is right
upvoted 1 times

 
SebK
Most Recent 
2 months ago
Selected Answer: A
Correct
upvoted 1 times

 
AngelJP
2 months, 1 week ago
Selected Answer: A
A correct:

Pipeline 1 is in try catch sentence --> Success

Pipeline 2 --> Success

https://docs.microsoft.com/en-us/azure/data-factory/tutorial-pipeline-failure-error-handling#try-catch-block
upvoted 2 times

 
PallaviPatel
3 months, 3 weeks ago
Selected Answer: A
A correct. I agree with SaferSephy's comments below.
upvoted 2 times

 
dev2dev
4 months, 1 week ago
A is correct. Pipeline 1 is connected to Set variable to Failure node/event. Its like handling exceptions/errors in programming language. Without
Failure node, it would be treated as failed.
upvoted 1 times

 
VeroDon
4 months, 3 weeks ago
Selected Answer: A
Correct
upvoted 1 times

 
JSSA
5 months ago
Correct answer is A
upvoted 1 times

 
rashjan
5 months, 2 weeks ago
Selected Answer: A
correct
upvoted 1 times

 
medsimus
7 months, 1 week ago
Correct answer , I tested it in synapse . the first activity failed but the pipeline succeded
upvoted 5 times

 
Oleczek
8 months, 3 weeks ago
Just checked it myself on Azure, answer A is correct.
upvoted 4 times

 
wdeleersnyder
8 months, 4 weeks ago
I'm not seeing this... what's not being called out is if Pipeline 2 has a dependency on Pipeline 1. It happens all the time where two pipelines run;
one runs, the other fails.

It should be D in my opinion.
upvoted 4 times

 
gangstfear
9 months ago
The answer must be B
upvoted 2 times

 
JohnMasipa
9 months ago
Can someone please explain why the answer is A?
upvoted 1 times

 
dev2dev
4 months, 1 week ago
if you look at the green and red squares, they are called Success and Failure events, in psuedo code , pipeline can be read as "On Error Set
Variable", where as pipeline 2 has "On Sucess Set Variable"
upvoted 1 times

 
Ayanchakrain
9 months ago

https://www.examtopics.com/exams/microsoft/dp-203/custom-view/ 149/161
5/26/22, 3:46 PM DP-203 Exam – Free Actual Q&As, Page 1 | ExamTopics

Pipeline1 has the failure dependency


upvoted 2 times

https://www.examtopics.com/exams/microsoft/dp-203/custom-view/ 150/161
5/26/22, 3:46 PM DP-203 Exam – Free Actual Q&As, Page 1 | ExamTopics

Question #9 Topic 2

HOTSPOT -

A company plans to use Platform-as-a-Service (PaaS) to create the new data pipeline process. The process must meet the following requirements:
Ingest:

✑ Access multiple data sources.

✑ Provide the ability to orchestrate workflow.

✑ Provide the capability to run SQL Server Integration Services packages.

Store:

✑ Optimize storage for big data workloads.

✑ Provide encryption of data at rest.

✑ Operate with no size limits.

Prepare and Train:

✑ Provide a fully-managed and interactive workspace for exploration and visualization.

✑ Provide the ability to program in R, SQL, Python, Scala, and Java.

Provide seamless user authentication with Azure Active Directory.

Model & Serve:

✑ Implement native columnar storage.

✑ Support for the SQL language

✑ Provide support for structured streaming.

You need to build the data integration pipeline.

Which technologies should you use? To answer, select the appropriate options in the answer area.

NOTE: Each correct selection is worth one point.

Hot Area:

https://www.examtopics.com/exams/microsoft/dp-203/custom-view/ 151/161
5/26/22, 3:46 PM DP-203 Exam – Free Actual Q&As, Page 1 | ExamTopics

Correct Answer:

Ingest: Azure Data Factory -

Azure Data Factory pipelines can execute SSIS packages.

In Azure, the following services and tools will meet the core requirements for pipeline orchestration, control flow, and data movement: Azure
Data Factory, Oozie on HDInsight, and SQL Server Integration Services (SSIS).

Store: Data Lake Storage -

Data Lake Storage Gen1 provides unlimited storage.

Note: Data at rest includes information that resides in persistent storage on physical media, in any digital format. Microsoft Azure offers a
variety of data storage solutions to meet different needs, including file, disk, blob, and table storage. Microsoft also provides encryption to
protect Azure SQL Database, Azure Cosmos

DB, and Azure Data Lake.

Prepare and Train: Azure Databricks

Azure Databricks provides enterprise-grade Azure security, including Azure Active Directory integration.

With Azure Databricks, you can set up your Apache Spark environment in minutes, autoscale and collaborate on shared projects in an
interactive workspace.

Azure Databricks supports Python, Scala, R, Java and SQL, as well as data science frameworks and libraries including TensorFlow, PyTorch and
scikit-learn.

Model and Serve: Azure Synapse Analytics

Azure Synapse Analytics/ SQL Data Warehouse stores data into relational tables with columnar storage.

Azure SQL Data Warehouse connector now offers efficient and scalable structured streaming write support for SQL Data Warehouse. Access
SQL Data

Warehouse from Azure Databricks using the SQL Data Warehouse connector.

Note: As of November 2019, Azure SQL Data Warehouse is now Azure Synapse Analytics.

Reference:

https://docs.microsoft.com/bs-latn-ba/azure/architecture/data-guide/technology-choices/pipeline-orchestration-data-movement
https://docs.microsoft.com/en-us/azure/azure-databricks/what-is-azure-databricks

 
Podavenna
Highly Voted 
8 months, 1 week ago
Correct solution!
upvoted 26 times

 
irantov
Highly Voted 
8 months, 1 week ago
Correct!

https://www.examtopics.com/exams/microsoft/dp-203/custom-view/ 152/161
5/26/22, 3:46 PM DP-203 Exam – Free Actual Q&As, Page 1 | ExamTopics

upvoted 10 times

 
SebK
Most Recent 
2 months ago
Correct
upvoted 1 times

 
Massy
2 months, 1 week ago
for the store, couldn't we use also Azure Blob Storage? It supports all the three requisites
upvoted 1 times

 
NewTuanAnh
1 month, 2 weeks ago
Because ADLS Gen2 support Big Data Workload better
upvoted 1 times

 
paras_gadhiya
3 months ago
Correct
upvoted 1 times

 
PallaviPatel
3 months, 3 weeks ago
Correct solution.
upvoted 1 times

 
joeljohnrm
4 months, 2 weeks ago
Correct Solution
upvoted 1 times

 
[Removed]
4 months, 3 weeks ago
for model and server, HDI has all of this. Why DataBricks?
upvoted 1 times

 
rockyc05
3 months ago
Also seamless integration with AAD
upvoted 1 times

 
rockyc05
3 months ago
Support for SQL
upvoted 1 times

 
corebit
5 months, 1 week ago
Would be best if people including answers that go against the popular responses provide some reference instead of blinding saying false
upvoted 3 times

 
Akash0105
6 months, 2 weeks ago
Answer is correct.

Azure Databricks supports java: https://azure.microsoft.com/en-us/services/databricks/#overview


upvoted 2 times

 
Pratikh
6 months, 3 weeks ago
Databricks doesn't support Java so in the Prep and Train should be HDInsight Apache Spark Cluster
upvoted 3 times

 
KOSTA007
6 months, 2 weeks ago
Azure Databricks supports Python, Scala, R, Java, and SQL, as well as data science frameworks and libraries including TensorFlow, PyTorch, and
scikit-learn.
upvoted 9 times

 
Aslam208
6 months, 3 weeks ago
Databricks does not support Java, Prepare and Train should be Azure HDInsight Apache spark cluster
upvoted 1 times

 
Aslam208
5 months, 2 weeks ago
I would like to correct my answer here... java is supported in Azure Databricks, therefore Prepare and Train can be done with Azure Databricks
upvoted 3 times

 
Samanda
7 months, 1 week ago
false. kafka hd insight is the correct option in the last box
upvoted 1 times

 
datachamp
8 months, 1 week ago
Is this an ad?
upvoted 9 times

https://www.examtopics.com/exams/microsoft/dp-203/custom-view/ 153/161
5/26/22, 3:46 PM DP-203 Exam – Free Actual Q&As, Page 1 | ExamTopics

Question #10 Topic 2

DRAG DROP -

You have the following table named Employees.

You need to calculate the employee_type value based on the hire_date value.

How should you complete the Transact-SQL statement? To answer, drag the appropriate values to the correct targets. Each value may be used
once, more than once, or not at all. You may need to drag the split bar between panes or scroll to view content.

NOTE: Each correct selection is worth one point.

Select and Place:

Correct Answer:

Box 1: CASE -

CASE evaluates a list of conditions and returns one of multiple possible result expressions.

CASE can be used in any statement or clause that allows a valid expression. For example, you can use CASE in statements such as SELECT,
UPDATE,

DELETE and SET, and in clauses such as select_list, IN, WHERE, ORDER BY, and HAVING.

Syntax: Simple CASE expression:

CASE input_expression -

WHEN when_expression THEN result_expression [ ...n ]

[ ELSE else_result_expression ]

END -

Box 2: ELSE -

Reference:

https://docs.microsoft.com/en-us/sql/t-sql/language-elements/case-transact-sql

 
MoDar
Highly Voted 
8 months, 4 weeks ago

https://www.examtopics.com/exams/microsoft/dp-203/custom-view/ 154/161
5/26/22, 3:46 PM DP-203 Exam – Free Actual Q&As, Page 1 | ExamTopics

Correct
upvoted 24 times

 
NewTuanAnh
Most Recent 
1 month, 2 weeks ago
the answer is correct

CASE ...

WHEN ... THEN...

ELSE ...
upvoted 2 times

 
PallaviPatel
3 months, 3 weeks ago
correct
upvoted 1 times

 
steeee
9 months ago
The answer is correct. But, is this in the scope of this exam?
upvoted 4 times

 
anto69
4 months, 2 weeks ago
it seems
upvoted 1 times

 
mkutts
6 months, 4 weeks ago
Got this question yesterday so yes.
upvoted 5 times

 
parwa
9 months ago
make sense to me , data engineer should be able to write Queries
upvoted 6 times

https://www.examtopics.com/exams/microsoft/dp-203/custom-view/ 155/161
5/26/22, 3:46 PM DP-203 Exam – Free Actual Q&As, Page 1 | ExamTopics

Question #11 Topic 2

DRAG DROP -

You have an Azure Synapse Analytics workspace named WS1.

You have an Azure Data Lake Storage Gen2 container that contains JSON-formatted files in the following format.

You need to use the serverless SQL pool in WS1 to read the files.

How should you complete the Transact-SQL statement? To answer, drag the appropriate values to the correct targets. Each value may be used
once, more than once, or not at all. You may need to drag the split bar between panes or scroll to view content.

https://www.examtopics.com/exams/microsoft/dp-203/custom-view/ 156/161
5/26/22, 3:46 PM DP-203 Exam – Free Actual Q&As, Page 1 | ExamTopics

NOTE: Each correct selection is worth one point.

Select and Place:

Correct Answer:

Box 1: openrowset -

The easiest way to see to the content of your CSV file is to provide file URL to OPENROWSET function, specify csv FORMAT.

Example:

SELECT *

FROM OPENROWSET(

BULK 'csv/population/population.csv',

DATA_SOURCE = 'SqlOnDemandDemo',

FORMAT = 'CSV', PARSER_VERSION = '2.0',

FIELDTERMINATOR =',',

ROWTERMINATOR = '\n'

https://www.examtopics.com/exams/microsoft/dp-203/custom-view/ 157/161
5/26/22, 3:46 PM DP-203 Exam – Free Actual Q&As, Page 1 | ExamTopics

Box 2: openjson -
You can access your JSON files from the Azure File Storage share by using the mapped drive, as shown in the following example:

SELECT book.* FROM -

OPENROWSET(BULK N't:\books\books.json', SINGLE_CLOB) AS json

CROSS APPLY OPENJSON(BulkColumn)

WITH( id nvarchar(100), name nvarchar(100), price float,

pages_i int, author nvarchar(100)) AS book

Reference:

https://docs.microsoft.com/en-us/azure/synapse-analytics/sql/query-single-csv-file https://docs.microsoft.com/en-us/sql/relational-
databases/json/import-json-documents-into-sql-server

 
Maunik
Highly Voted 
8 months, 2 weeks ago
Answer is correct

https://docs.microsoft.com/en-us/azure/synapse-analytics/sql/query-json-files
upvoted 25 times

 
Lrng15
8 months ago
answer is correct as per this link
upvoted 1 times

 
gf2tw
Highly Voted 
8 months, 2 weeks ago
The question and answer seem out of place, there was no mention of the CSV and the query in the answer doesn't match up with openjson at all
upvoted 6 times

 
dev2dev
4 months, 1 week ago
Look at the WITH statement, the csv column can contain json data.
upvoted 1 times

 
anto69
4 months, 2 weeks ago
agree with u
upvoted 1 times

 
dead_SQL_pool
6 months ago
Actually, the csv format is specified if you're using OPENROWSET to read json files in Synapse. The OPENJSON is required if you want to parse
data from every array in the document. See the OPENJSON example in this link:

https://docs.microsoft.com/en-us/azure/synapse-analytics/sql/query-json-files#query-json-files-using-openjson
upvoted 8 times

 
gf2tw
5 months, 2 weeks ago
Thanks, you're right:

"The easiest way to see to the content of your JSON file is to provide the file URL to the OPENROWSET function, specify csv FORMAT, and set
values 0x0b for fieldterminator and fieldquote."
upvoted 4 times

 
gssd4scoder
6 months, 3 weeks ago
agree with you, very misleading
upvoted 1 times

 
SebK
Most Recent 
2 months ago
Correct
upvoted 1 times

 
PallaviPatel
3 months, 3 weeks ago
correct
upvoted 1 times

https://www.examtopics.com/exams/microsoft/dp-203/custom-view/ 158/161
5/26/22, 3:46 PM DP-203 Exam – Free Actual Q&As, Page 1 | ExamTopics

Question #12 Topic 2

DRAG DROP -

You have an Apache Spark DataFrame named temperatures. A sample of the data is shown in the following table.

You need to produce the following table by using a Spark SQL query.

How should you complete the query? To answer, drag the appropriate values to the correct targets. Each value may be used once, more than once,
or not at all.

You may need to drag the split bar between panes or scroll to view content.

NOTE: Each correct selection is worth one point.

Select and Place:

https://www.examtopics.com/exams/microsoft/dp-203/custom-view/ 159/161
5/26/22, 3:46 PM DP-203 Exam – Free Actual Q&As, Page 1 | ExamTopics

Correct Answer:

Box 1: PIVOT -

PIVOT rotates a table-valued expression by turning the unique values from one column in the expression into multiple columns in the output.
And PIVOT runs aggregations where they're required on any remaining column values that are wanted in the final output.

Incorrect Answers:

UNPIVOT carries out the opposite operation to PIVOT by rotating columns of a table-valued expression into column values.

Box 2: CAST -

If you want to convert an integer value to a DECIMAL data type in SQL Server use the CAST() function.

Example:

SELECT -

CAST(12 AS DECIMAL(7,2) ) AS decimal_value;

Here is the result:

decimal_value

12.00

Reference:

https://learnsql.com/cookbook/how-to-convert-an-integer-to-a-decimal-in-sql-server/ https://docs.microsoft.com/en-us/sql/t-sql/queries/from-
using-pivot-and-unpivot

 
SujithaVulchi
Highly Voted 
8 months ago
correct answer, pivot and cast
upvoted 22 times

 
ggggyyyyy
Most Recent 
8 months ago
correct. cast not convert
upvoted 3 times

https://www.examtopics.com/exams/microsoft/dp-203/custom-view/ 160/161
5/26/22, 3:46 PM DP-203 Exam – Free Actual Q&As, Page 1 | ExamTopics

Question #13 Topic 2

You have an Azure Data Factory that contains 10 pipelines.

You need to label each pipeline with its main purpose of either ingest, transform, or load. The labels must be available for grouping and filtering
when using the monitoring experience in Data Factory.

What should you add to each pipeline?

A.
a resource tag

B.
a correlation ID

C.
a run group ID

D.
an annotation

Correct Answer:
D

Annotations are additional, informative tags that you can add to specific factory resources: pipelines, datasets, linked services, and triggers. By
adding annotations, you can easily filter and search for specific factory resources.

Reference:

https://www.cathrinewilhelmsen.net/annotations-user-properties-azure-data-factory/

Community vote distribution


D (100%)

 
umeshkd05
Highly Voted 
8 months, 2 weeks ago
Annotation
upvoted 16 times

 
anto69
4 months, 1 week ago
Cause ADF pipelines are not first class resources
upvoted 1 times

 
AhmedDaffaie
Most Recent 
2 months, 1 week ago
What is the difference between resource tags and annotations?
upvoted 1 times

 
paras_gadhiya
3 months ago
Correct!
upvoted 1 times

 
PallaviPatel
3 months, 3 weeks ago
Selected Answer: D
correct
upvoted 1 times

 
huesazo
4 months, 1 week ago
Selected Answer: D
Anotacion
upvoted 1 times

 
aarthy2
7 months, 3 weeks ago
yes correct, annotation provides label functionality than show in pipeline monitoring.
upvoted 2 times

https://www.examtopics.com/exams/microsoft/dp-203/custom-view/ 161/161

You might also like