Expert Veri Ed, Online, Free.: Topic 1 - Question Set 1

5/26/22, 3:46 PM DP-203 Exam – Free Actual Q&As, Page 1 | ExamTopics
-
Expert Verified, Online, Free.
 Custom View Settings
Topic 1 - Question Set 1
https://www.examtopics.com/exams/microsoft/dp-203/custom-view/ 1/161
Question #1 Topic 1
You have a table in an Azure Synapse Analytics dedicated SQL pool. The table was created by using the following Transact-SQL statement.
You need to alter the table to meet the following requirements:
✑ Ensure that users can identify the current manager of employees.
✑ Support creating an employee reporting hierarchy for your entire company.
✑ Provide fast lookup of the managers' attributes such as name and job title.
Which column should you add to the table?
A.
[ManagerEmployeeID] [smallint] NULL
B.
[ManagerEmployeeKey] [smallint] NULL
C.
[ManagerEmployeeKey] [int] NULL
D.
[ManagerName] [varchar](200) NULL
Correct Answer:
C
We need an extra column to identify the Manager. Use the data type as the EmployeeKey column, an int column.
Reference:
https://docs.microsoft.com/en-us/analysis-services/tabular-models/hierarchies-ssas-tabular
Community vote distribution

C (100%)
 
alexleonvalencia
Highly Voted 
5 months, 2 weeks ago
Selected Answer: C
La respuesta es correcta.
upvoted 10 times
 
jskibick
Highly Voted 
Selected Answer: C
Answer C. Smallint eliminates A and B. But I would name the field [ManagerEmployeeID] [int] NULL since it should reference EmployeeID, not
EmployeeKey since this one is IDENTITY.
upvoted 6 times
 
Dothy
Most Recent 
2 weeks ago
Selected Answer: C
upvoted 1 times
 
Egocentric
1 month, 1 week ago
C is the correct answer
upvoted 1 times
 
boggy011
1 month, 1 week ago
Selected Answer: C
upvoted 1 times
 
temacc
2 months ago
Selected Answer: C
Correct answer is C
upvoted 1 times
 
Guincimund
2 months ago
Selected Answer: C
answer is C.
upvoted 1 times
 
NeerajKumar
Selected Answer: C
Correct Ans is C
upvoted 2 times
 
KosteK
Selected Answer: C
correct answer is C
upvoted 1 times
 
samtrion
3 months ago
Selected Answer: C
It is quite obvious C
upvoted 1 times
 
ArunCDE
Selected Answer: C
This is the correct answer.
upvoted 1 times
 
PallaviPatel
4 months ago
Selected Answer: C
correct is C. Use surrogate key instead of business key as a reference.
upvoted 1 times
 
Aurelkb
4 months, 1 week ago
correct answer is C
upvoted 1 times
 
pozdrotechno
Selected Answer: C
C is correct.
The column should be based on the surrogate key (EmployeeKey), including an identical data type.
upvoted 2 times
 
SofiaG
Selected Answer: C
Correto
upvoted 1 times
 
jchen9314
INT has the best performance to be a key.
upvoted 2 times
Question #2 Topic 1
You have an Azure Synapse workspace named MyWorkspace that contains an Apache Spark database named mytestdb.
You run the following command in an Azure Synapse Analytics Spark pool in MyWorkspace.
CREATE TABLE mytestdb.myParquetTable(
EmployeeID int,
EmployeeName string,
EmployeeStartDate date)
USING Parquet -
You then use Spark to insert a row into mytestdb.myParquetTable. The row contains the following data.
One minute later, you execute the following query from a serverless SQL pool in MyWorkspace.
SELECT EmployeeID -
FROM mytestdb.dbo.myParquetTable
WHERE name = 'Alice';
What will be returned by the query?
A.
24
B.
an error
C.
a null value
Correct Answer:
A
Once a database has been created by a Spark job, you can create tables in it with Spark that use Parquet as the storage format. Table names
will be converted to lower case and need to be queried using the lower case name. These tables will immediately become available for querying
by any of the Azure Synapse workspace Spark pools. They can also be used from any of the Spark jobs subject to permissions.
Note: For external tables, since they are synchronized to serverless SQL pool asynchronously, there will be a delay until they appear.
Reference:
https://docs.microsoft.com/en-us/azure/synapse-analytics/metadata/table

B (100%)
 
kruukp
Highly Voted 
1 year ago
B is a correct answer. There is a column 'name' in the where clause which doesn't exist in the table.
upvoted 103 times
 
knarf
I agree B is correct, not because the column 'name' in the query is invalid, but because the table reference itself is invalid as the table was
created as CREATE TABLE mytestdb.myParquetTable and not mytestdb.dbo.myParquetTable
upvoted 14 times
 
EddyRoboto
9 months ago
When we query a spark table from SQL Serveless we must use the schema, in this case, dbo, so, this doesn't cause erros.
upvoted 7 times
 
anarvekar
Isn't dbo the default schema the objects are created in, if the schema name is not explicitly specified in the DDL?
upvoted 2 times
 
AugustineUba
10 months ago
I agree with this.
upvoted 1 times
 
anto69
Agree there's no column named 'name'
upvoted 2 times
 
baobabko
12 months ago
Even if the column name is correct. When I tried the example , it threw an error that table doesn't exist (as expected - after all - it is a Spark
table, not SQL. There is no external or any other table which could be queried in the SQL pool)
upvoted 4 times
 
EddyRoboto
9 months ago
They shared the same metadata, perhaps you forgot to specify the schema in your query in SQL Serveless Pool. You should have tried
spark_db.[dbo].spark_table
upvoted 2 times
 
Alekx42
12 months ago
https://docs.microsoft.com/en-us/azure/synapse-analytics/metadata/table
"Once a database has been created by a Spark job, you can create tables in it with Spark that use Parquet as the storage format. Table names
will be converted to lower case and need to be queried using the lower case name. These tables will immediately become available for
querying by any of the Azure Synapse workspace Spark pools. The Spark created, managed, and external tables are also made available as
external tables with the same name in the corresponding synchronized database in serverless SQL pool."
I think the reason you got the error was because the query had to use the lower case names. See the example in the same link, they create a
similar table and use the lowercase letters to query it from the Serverless SQL pool.
Anyway, this confirms that B is the correct answer here.

upvoted 7 times
 
knarf
11 months ago
See my post above and comment?
upvoted 1 times
 
polokkk
Highly Voted 
A is correct in real exam, it was employeename not name. So 24 is the one to select in real exam.
B is correct in this question as it isn't exactly the same as exam thus B is correct
upvoted 12 times
 
Dothy
Most Recent 
2 weeks ago
No EmployeeName column in query. So answer B is correct
upvoted 1 times
 
Lizaveta
4 weeks ago
I came across this question on an exam today. It was correct query with "WHERE
EmployeeName = 'Alice' ". So I answered A.24

upvoted 3 times
 
FelixI
1 month ago
Selected Answer: B
No EmployeeName column in query
upvoted 1 times
 
Egocentric
1 month, 1 week ago
B is correct because there is no column called name
upvoted 1 times
 
xuezhizlv
1 month, 3 weeks ago
Selected Answer: B
B is the correct answer.
upvoted 1 times
 
AlCubeHead
2 months ago
Selected Answer: B
There is no name column in the table. Also the - at the end of the select is also dubious to me
upvoted 1 times
 
Guincimund
2 months ago
Selected Answer: B
SELECT EmployeeID -
FROM mytestdb.dbo.myParquetTable
WHERE name = 'Alice';
As the Where clause is name = 'Alice', the answer is B as there is no column named 'name'.
In the case where the Where clause is "Where EmployeeName = 'Alice' " then the answer will return 24. which is answer A.
upvoted 2 times
 
Sakshi_21
Selected Answer: B
name column dosent exsists in the table
upvoted 1 times
 
enricobny
BI is the right answer. Column 'name' is not present in the table structure and also using mytestdb.dbo.myParquetTable will not work - [dbo] is the
problem.
The correct syntax is :
SELECT EmployeeID FROM mytestdb.myParquetTable WHERE EmployeeName = 'Alice';

upvoted 1 times
 
Rama22
name = 'Alice', not EmployeeName, will throw Error
upvoted 1 times
 
jskibick
Selected Answer: B
B, SQL for serverless has error, name field do not exist in table
upvoted 1 times
 
PallaviPatel
4 months ago
Selected Answer: B
Wrong table and column names in the query.
upvoted 1 times
 
Fer079
4 months ago
Regarding the lower case... I have test it on Azure, I have created the table in Spark pool, and it´s true that it is converted to lower case
automatically, however we can query from both Spark pool and synapse serverless pool using lower/upper case and it will always find the table...
Does any one test it?
upvoted 2 times
 
ANath
4 months ago
I am getting 'Bulk load data conversion error (type mismatch or invalid character for the specified codepage)' error.
upvoted 1 times
 
ANath
4 months ago
Sorry I was doing it in a wrong way. If we specify table name in lower order and specify correct column name the exact result will show.
upvoted 1 times
 
pozdrotechno
Selected Answer: B
B is correct.
- incorrect db/schema/table name: mytestdb.myParquetTable vs mytestdb.dbo.myParquetTable

- incorrect column name: EmployeeName vs name
- not using lower case in the query

upvoted 2 times
Question #3 Topic 1
DRAG DROP -
You have a table named SalesFact in an enterprise data warehouse in Azure Synapse Analytics. SalesFact contains sales data from the past 36
months and has the following characteristics:
✑ Is partitioned by month
✑ Contains one billion rows
✑ Has clustered columnstore indexes
At the beginning of each month, you need to remove data from SalesFact that is older than 36 months as quickly as possible.
Which three actions should you perform in sequence in a stored procedure? To answer, move the appropriate actions from the list of actions to the
answer area and arrange them in the correct order.
Select and Place:
Correct Answer:
Step 1: Create an empty table named SalesFact_work that has the same schema as SalesFact.
Step 2: Switch the partition containing the stale data from SalesFact to SalesFact_Work.
SQL Data Warehouse supports partition splitting, merging, and switching. To switch partitions between two tables, you must ensure that the
partitions align on their respective boundaries and that the table definitions match.
Loading data into partitions with partition switching is a convenient way stage new data in a table that is not visible to users the switch in the
new data.
Step 3: Drop the SalesFact_Work table.
Reference:
https://docs.microsoft.com/en-us/azure/sql-data-warehouse/sql-data-warehouse-tables-partition
 
hsetin
Highly Voted 
Given answer D A C is correct.

upvoted 24 times
 
svik
Yes. Once the partition is switched with an empty partition it is equivalent to truncating the partition from the original table
upvoted 1 times
 
Dothy
Most Recent 
2 weeks ago
Step 1: Create an empty table named SalesFact_work that has the same schema as SalesFact.
Step 2: Switch the partition containing the stale data from SalesFact to SalesFact_Work.
Step 3: Drop the SalesFact_Work table.

upvoted 1 times
 
JJdeWit
1 month, 1 week ago
D A C is the right option.
For more information, this doc discusses exactly this example: https://docs.microsoft.com/en-us/azure/synapse-analytics/sql-data-warehouse/sql-
data-warehouse-tables-partition
upvoted 1 times
 
theezin
Why not included deleting sales data older than 36 months which is mentioned in question?
upvoted 1 times
 
RamGhase
i could not understand how answer handled to remove data before 36 month
upvoted 1 times
 
gerard
you have to move the partitions that contains the date before 36 months
upvoted 2 times
 
PallaviPatel
4 months ago
D A C is correct.
upvoted 1 times
 
indomanish
Partition switching help us in loading large data set quickly . Not sure if it will help in purging data as well.
upvoted 2 times
 
SabaJamal2010AtGmail
Given answer is correct
upvoted 2 times
 
covillmail
7 months ago
DAC is correct
upvoted 4 times
 
AvithK
truncate partition is even quicker, why isn't that the answer, if the data is dropped anyway?
upvoted 3 times
 
yolap31172
There is no way to truncate partitions in Synapse. Partitions don't even have names and you can't reference them by value.
upvoted 4 times
 
BlackMal
This, i think it should be the answer
upvoted 1 times
 
poornipv
10 months ago
what is the correct answer for this?
upvoted 2 times
 
AnonAzureDataEngineer
10 months ago
Seems like it should be:
1. E
2. A
3. C
upvoted 1 times
 
dragos_dragos62000
Correct!
upvoted 1 times
 
Dileepvikram
12 months ago
The data copy to back up table is not mentioned in the answer

upvoted 1 times
 
savin
partition switching part covers it. So its correct i think
upvoted 1 times
 
wfrf92
1 year ago
Is this correct ????
upvoted 1 times
 
alain2
1 year ago
Yes, it is.
https://www.cathrinewilhelmsen.net/table-partitioning-in-sql-server-partition-switching/
upvoted 5 times
 
YipingRuan
"Archive data by switching out: Switch from Partition to Non-Partitioned" ?
upvoted 1 times
 
TorbenS
1 year ago
yes, I think so
upvoted 4 times
Question #4 Topic 1
You have files and folders in Azure Data Lake Storage Gen2 for an Azure Synapse workspace as shown in the following exhibit.
You create an external table named ExtTable that has LOCATION='/topfolder/'.
When you query ExtTable by using an Azure Synapse Analytics serverless SQL pool, which files are returned?
A.
File2.csv and File3.csv only
B.
File1.csv and File4.csv only
C.
File1.csv, File2.csv, File3.csv, and File4.csv
D.
File1.csv only
Correct Answer:
C
To run a T-SQL query over a set of files within a folder or set of folders while treating them as a single entity or rowset, provide a path to a folder
or a pattern
(using wildcards) over a set of files or folders.
Reference:
https://docs.microsoft.com/en-us/azure/synapse-analytics/sql/query-data-storage#query-multiple-files-or-folders

B (100%)
 
Chillem1900
Highly Voted 
1 year ago
I believe the answer should be B.
In case of a serverless pool a wildcard should be added to the location.
https://docs.microsoft.com/en-us/azure/synapse-analytics/sql/develop-tables-external-tables?tabs=hadoop#arguments-create-external-table
upvoted 76 times
 
captainpike
I tested and proove you right, the answer is B. Remind the question is referring to serverless SQL and not dedicated SQL pool. "Unlike Hadoop
external tables, native external tables don't return subfolders unless you specify /** at the end of path. In this example, if
LOCATION='/webdata/', a serverless SQL pool query, will return rows from mydata.txt. It won't return mydata2.txt and mydata3.txt because
they're located in a subfolder. Hadoop tables will return all files within any subfolder."
upvoted 12 times
 
alain2
Highly Voted 
1 year ago
"Serverless SQL pool can recursively traverse folders only if you specify /** at the end of path."
https://docs.microsoft.com/en-us/azure/synapse-analytics/sql/query-folders-multiple-csv-files
upvoted 17 times
 
Preben
When you are quoting from Microsoft documentation, do not ADD in words to the sentence. 'Only' is not used.
upvoted 10 times
 
captainpike
The answer is B however. I could not make "/**" to work. somebody?
upvoted 2 times
 
amiral404
Most Recent 
2 days, 21 hours ago
C is correct as mentionned in the official documentation which showcase a similar example : https://docs.microsoft.com/en-us/sql/t-
sql/statements/create-external-table-transact-sql?view=sql-server-ver15&tabs=dedicated#location--folder_or_filepath
upvoted 1 times
 
Backy
1 week, 5 days ago
The question does not show the actual query so this is a problem
upvoted 1 times
 
Dothy
2 weeks ago
upvoted 1 times
 
carloalbe
3 weeks ago
In this example, if LOCATION='/webdata/', a PolyBase query will return rows from mydata.txt and mydata2.txt. It won't return mydata3.txt because
it's a file in a hidden folder. And it won't return _hidden.txt because it's a hidden file. https://docs.microsoft.com/en-us/sql/t-
sql/statements/media/aps-polybase-folder-traversal.png?view=sql-server-ver15b
https://docs.microsoft.com/en-us/sql/t-sql/statements/create-external-table-transact-sql?view=sql-server-ver15&tabs=dedicated
upvoted 1 times
 
BJPJowee
4 weeks ago
the answer is correct. C see the link https://docs.microsoft.com/en-us/sql/t-sql/statements/create-external-table-transact-sql?view=sql-server-
ver15&tabs=dedicated
upvoted 1 times
 
MS_Nikhil
4 weeks, 1 day ago
Ans is definitely B
upvoted 1 times
 
poundmanluffy
Selected Answer: B
Option is definitely "B"
Below is the documentation given on MS Docs:
Recursive data for external tables
Unlike Hadoop external tables, native external tables don't return subfolders unless you specify /** at the end of path. In this example, if
LOCATION='/webdata/', a serverless SQL pool query, will return rows from mydata.txt. It won't return mydata2.txt and mydata3.txt because they're
located in a subfolder. Hadoop tables will return all files within any sub-folder.
upvoted 1 times
 
Ozren
Selected Answer: B
This is not a recursive pattern like '.../**'. So the answer is B, not C.
upvoted 2 times
 
kamil_k
this one is tricky, I found information here which would suggest answer C is indeed correct:
https://docs.microsoft.com/en-us/sql/t-sql/statements/create-external-table-transact-sql?view=sql-server-ver15&tabs=dedicated#arguments-2
upvoted 1 times
 
kamil_k
ok I've done the test:
1. created gen 2 storage acct
2. created azure synapse workspace
3. created container myfilesystem, subfolder topfolder and another subfolder topfolder under that
4. created two csv files and dropped one per folder i.e. one in topfolder and the other in topfolder/topfolder
5. executed the following code:
DROP EXTERNAL DATA SOURCE test;
CREATE EXTERNAL DATA SOURCE test
WITH
LOCATION = 'https://[storage-account-name].blob.core.windows.net/myfilesystem'
)
CREATE EXTERNAL FILE FORMAT test
WITH (
FORMAT_TYPE = DELIMITEDTEXT,
FORMAT_OPTIONS (
FIELD_TERMINATOR = ',',
FIRST_ROW = 2
);
CREATE EXTERNAL TABLE test
(id int, value int)
WITH (
LOCATION='/topfolder/',
DATA_SOURCE = test,
FILE_FORMAT = test
);
SELECT * FROM test;
The result were only records from File1.csv which was located in the first "topfolder".
upvoted 2 times
 
kamil_k
in other words, the answer C is incorrect. I forgot to mention I used the built-in serverless SQL Pool
upvoted 1 times
 
islamarfh
this is the from the document tell that B is indeed correct
In this example, if LOCATION='/webdata/', a PolyBase query will return rows from mydata.txt and mydata2.txt. It won't return mydata3.txt
because it's a file in a hidden folder. And it won't return _hidden.txt because it's a hidden file.
upvoted 1 times
 
RalphLiang
Selected Answer: B
https://docs.microsoft.com/en-us/azure/synapse-analytics/sql/develop-tables-external-tables?tabs=hadoop#arguments-create-external-table
upvoted 2 times
 
KosteK
Selected Answer: B
Tested. Ans: B
upvoted 2 times
 
toms100
Answer is C
If you specify LOCATION to be a folder, a PolyBase query that selects from the external table will retrieve files from the folder and all of its
subfolders.
Refer https://docs.microsoft.com/en-us/sql/t-sql/statements/create-external-table-transact-sql?view=sql-server-ver15&tabs=dedicated
upvoted 2 times
 
PallaviPatel
4 months ago
Selected Answer: B
B is correct answer.
upvoted 2 times
 
Sandip4u
The answer is B , In case of a serverless pool a wildcard should be added to the location , otherwise this will not fetch the files from child folders
upvoted 2 times
 
bharatnhkh10
Selected Answer: B
as we need ** o pick all files
upvoted 2 times
Question #5 Topic 1
HOTSPOT -
You are planning the deployment of Azure Data Lake Storage Gen2.
You have the following two reports that will access the data lake:
✑ Report1: Reads three columns from a file that contains 50 columns.
✑ Report2: Queries a single record based on a timestamp.
You need to recommend in which format to store the data in the data lake to support the reports. The solution must minimize read times.
What should you recommend for each report? To answer, select the appropriate options in the answer area.
NOTE: Each correct selection is worth one point.
Hot Area:
Correct Answer:

Report1: CSV -
CSV: The destination writes records as delimited data.
Report2: AVRO -
AVRO supports timestamps.
Not Parquet, TSV: Not options for Azure Data Lake Storage Gen2.
Reference:
https://streamsets.com/documentation/datacollector/latest/help/datacollector/UserGuide/Destinations/ADLS-G2-D.html
 
alain2
Highly Voted 
1 year ago
1: Parquet - column-oriented binary file format
2: AVRO - Row based format, and has logical type timestamp
https://youtu.be/UrWthx8T3UY
upvoted 92 times
 
azurestudent1498
1 month, 1 week ago
this is correct.
upvoted 1 times
 
terajuana
the web is full of old information. timestamp support has been added to parquet
upvoted 5 times
 
vlad888
11 months ago
Ok, but in 1st case we need only 3 of 50 columns. Parquet i columnar format. In 2nd Avro because ideal for read full row
upvoted 12 times
 
Himlo24
Highly Voted 
1 year ago
Shouldn't the answer for Report 1 be Parquet? Because Parquet format is Columnar and should be best for reading a few columns only.
upvoted 18 times
 
main616
Most Recent 
1 week, 6 days ago
1. csv (or json) . csv/json support query accelerate by select specified rows https://docs.microsoft.com/en-us/azure/storage/blobs/data-lake-
storage-query-acceleration#overview
2. avro
upvoted 1 times
 
Dothy
2 weeks ago
1: Parquet
2: AVRO
upvoted 1 times
 
RalphLiang
Consider Parquet and ORC file formats when the I/O patterns are more read heavy or when the query patterns are focused on a subset of columns
in the records.
the Avro format works well with a message bus such as Event Hub or Kafka that write multiple events/messages in succession.
upvoted 3 times
 
ragz_87
1. Parquet
2. Avro
https://docs.microsoft.com/en-us/azure/storage/blobs/data-lake-storage-best-practices
"Consider using the Avro file format in cases where your I/O patterns are more write heavy, or the query patterns favor retrieving multiple rows of
records in their entirety.
Consider Parquet and ORC file formats when the I/O patterns are more read heavy or when the query patterns are focused on a subset of columns
in the records."
upvoted 5 times
 
SebK
2 months ago
Thank you.
upvoted 1 times
 
MohammadKhubeb
4 months ago
Why NOT csv in report1 ?
upvoted 1 times
 
Sandip4u
This has to be parquet and AVRO , got the answer from Udemy
upvoted 4 times
 
Mahesh_mm
5 months ago
1. Parquet
2. AVRO
upvoted 3 times
 
marcin1212
The goal is: The solution must minimize read times.
I made small test on Databrick plus DataLake.
The same file saved as Parquet and Avro
9 mln of records.
Parquet ~150 MB
Avro ~700MB
Reading Parquet is always 10 times faster that Avro.
I checked:
- for all data or small range of data with condition
- all or only one column
So I will select option:
- Parquet
- Parquet
upvoted 2 times
 
dev2dev
how can be faster read is same as number of reads?
upvoted 1 times
 
Ozzypoppe
6 months ago
Solution says parquet is not supported for adls gen 2 but it actually is: https://docs.microsoft.com/en-us/azure/data-factory/format-parquet
upvoted 3 times
 
noranathalie
An interesting and complete article that explains the different uses between parquet/avro/csv and gives answers to the question :
https://medium.com/ssense-tech/csv-vs-parquet-vs-avro-choosing-the-right-tool-for-the-right-job-79c9f56914a8
upvoted 4 times
 
elimey
https://luminousmen.com/post/big-data-file-formats
upvoted 1 times
 
elimey
Report 1 definitely Parquet
upvoted 1 times
 
noone_a
report 1 - Parquet as it is columar.
report 2 - avro as it is row based and can be compressed further than csv.
upvoted 1 times
 
bsa_2021
The actual answer provided and answer from discussion differs. Which one to follow for actual exam?
upvoted 1 times
 
Yaduvanshi
Follow what feels logical after reading the answer and the discussion forum.
upvoted 2 times
 
bc5468521
12 months ago
1- Parquet
2- Parquet
Since they are all querying; AVRO is good for writing, OLTP, Parquet is good for quering/read
upvoted 5 times
Question #6 Topic 1
You are designing the folder structure for an Azure Data Lake Storage Gen2 container.
Users will query data by using a variety of services including Azure Databricks and Azure Synapse Analytics serverless SQL pools. The data will be
secured by subject area. Most queries will include data from the current year or current month.
Which folder structure should you recommend to support fast queries and simplified folder security?
A.
/{SubjectArea}/{DataSource}/{DD}/{MM}/{YYYY}/{FileData}_{YYYY}_{MM}_{DD}.csv
B.
/{DD}/{MM}/{YYYY}/{SubjectArea}/{DataSource}/{FileData}_{YYYY}_{MM}_{DD}.csv
C.
/{YYYY}/{MM}/{DD}/{SubjectArea}/{DataSource}/{FileData}_{YYYY}_{MM}_{DD}.csv
D.
/{SubjectArea}/{DataSource}/{YYYY}/{MM}/{DD}/{FileData}_{YYYY}_{MM}_{DD}.csv
Correct Answer:
D
There's an important reason to put the date at the end of the directory structure. If you want to lock down certain regions or subject matters to
users/groups, then you can easily do so with the POSIX permissions. Otherwise, if there was a need to restrict a certain security group to
viewing just the UK data or certain planes, with the date structure in front a separate permission would be required for numerous directories
under every hour directory. Additionally, having the date structure in front would exponentially increase the number of directories as time went
on.
Note: In IoT workloads, there can be a great deal of data being landed in the data store that spans across numerous products, devices,
organizations, and customers. It's important to pre-plan the directory layout for organization, security, and efficient processing of the data for
down-stream consumers. A general template to consider might be the following layout:
{Region}/{SubjectMatter(s)}/{yyyy}/{mm}/{dd}/{hh}/

D (100%)
 
sagga
Highly Voted 
1 year ago
D is correct
https://docs.microsoft.com/en-us/azure/storage/blobs/data-lake-storage-best-practices#batch-jobs-structure
upvoted 41 times
 
Dothy
Most Recent 
2 weeks ago
D is correct
upvoted 1 times
 
Olukunmi
1 month ago
Selected Answer: D
D is correct
upvoted 1 times
 
Egocentric
1 month, 1 week ago
D is correct
upvoted 1 times
 
SebK
2 months ago
Selected Answer: D
D is correct
upvoted 2 times
 
RalphLiang
Selected Answer: D
D is correct
upvoted 1 times
 
NeerajKumar
Selected Answer: D
Correct
upvoted 1 times
 
PallaviPatel
4 months ago
Selected Answer: D
Correct
upvoted 1 times
 
Skyrocket
4 months ago
D is correct
upvoted 1 times
 
VeroDon
Selected Answer: D
Thats correct
upvoted 2 times
 
Mahesh_mm
5 months ago
D is correct
upvoted 1 times
 
alexleonvalencia
Respuesta correcta D.
upvoted 1 times
 
rashjan
Selected Answer: D
D is correct
upvoted 1 times
 
ohana
7 months ago
Took the exam today, this question came out.
Ans: D
upvoted 4 times
 
Sunnyb
D is absolutely correct
upvoted 2 times
Question #7 Topic 1
HOTSPOT -
You need to output files from Azure Data Factory.
Which file format should you use for each type of output? To answer, select the appropriate options in the answer area.
Hot Area:
Correct Answer:

Box 1: Parquet -
Parquet stores data in columns, while Avro stores data in a row-based format. By their very nature, column-oriented data stores are optimized
for read-heavy analytical workloads, while row-based databases are best for write-heavy transactional workloads.
Box 2: Avro -
An Avro schema is created using JSON format.
AVRO supports timestamps.
Note: Azure Data Factory supports the following file formats (not GZip or TXT).
Avro format -
✑ Binary format
✑ Delimited text format
✑ Excel format
✑ JSON format
✑ ORC format
✑ Parquet format
✑ XML format
Reference:
https://www.datanami.com/2018/05/16/big-data-file-formats-demystified
 
Mahesh_mm
Highly Voted 
5 months ago
Parquet and AVRO is correct option
upvoted 17 times
 
Dothy
Most Recent 
2 weeks ago
agree with the answer
upvoted 2 times
 
RalphLiang
Parquet and AVRO is correct option
upvoted 2 times
 
PallaviPatel
4 months ago
correct
upvoted 1 times
 
Skyrocket
4 months ago
Parquet and AVRO is right.
upvoted 2 times
 
edba
5 months ago
GZIP file format is one of supported Binary format by ADF.
https://docs.microsoft.com/en-us/azure/data-factory/connector-file-system?tabs=data-factory#file-system-as-sink
upvoted 1 times
 
bad_atitude
agree with the answer
upvoted 2 times
 
alexleonvalencia
Respuesta correcta PARQUET & AVRO.
upvoted 1 times
Question #8 Topic 1
HOTSPOT -
You use Azure Data Factory to prepare data to be queried by Azure Synapse Analytics serverless SQL pools.
Files are initially ingested into an Azure Data Lake Storage Gen2 account as 10 small JSON files. Each file contains the same data attributes and
data from a subsidiary of your company.
You need to move the files to a different folder and transform the data to meet the following requirements:
✑ Provide the fastest possible query times.
✑ Automatically infer the schema from the underlying files.
How should you configure the Data Factory copy activity? To answer, select the appropriate options in the answer area.
Hot Area:
Correct Answer:

Box 1: Preserver hierarchy -
Compared to the flat namespace on Blob storage, the hierarchical namespace greatly improves the performance of directory management
operations, which improves overall job performance.
Box 2: Parquet -
Azure Data Factory parquet format is supported for Azure Data Lake Storage Gen2.
Parquet supports the schema property.
Reference:
https://docs.microsoft.com/en-us/azure/storage/blobs/data-lake-storage-introduction https://docs.microsoft.com/en-us/azure/data-
factory/format-parquet
 
alain2
Highly Voted 
1 year ago
1. Merge Files
2. Parquet
https://docs.microsoft.com/en-us/azure/storage/blobs/data-lake-storage-performance-tuning-guidance
upvoted 87 times
 
edba
5 months ago
just want to add a bit more reference regarding copyBehavior in ADF plus info mentioned in Best Practice doc, so it shall be MergeFile first.
https://docs.microsoft.com/en-us/azure/data-factory/connector-file-system?tabs=data-factory#file-system-as-sink
upvoted 3 times
 
kilowd
Larger files lead to better performance and reduced costs.
Typically, analytics engines such as HDInsight have a per-file overhead that involves tasks such as listing, checking access, and performing
various metadata operations. If you store your data as many small files, this can negatively affect performance. In general, organize your data
into larger sized files for better performance (256 MB to 100 GB in size). S
upvoted 3 times
 
Ameenymous
1 year ago
The smaller the files, the negative the performance so Merge and Parquet seems to be the right answer.
upvoted 14 times
 
captainbee
Highly Voted 
It's frustrating just how many questions ExamTopics get wrong. Can't be helpful
upvoted 26 times
 
RyuHayabusa
At least it helps in learning, as you have to research and think for yourself. Another big topic is having this questions in the first place is
immensely helpful
upvoted 30 times
 
SebK
Agree.
upvoted 2 times
 
gssd4scoder
7 months ago
Trying to understand if an answer is correct will help learn more
upvoted 3 times
 
Dothy
Most Recent 
2 weeks ago
1. Merge Files
2. Parquet
upvoted 1 times
 
KashRaynardMorse
1 month, 1 week ago
A requirement was "Automatically infer the schema from the underlying files", meaning Preserve hierarchy is needed.
upvoted 2 times
 
gabdu
3 weeks, 3 days ago
it is possible that all or some schemas may be different in that case we cannot merge
upvoted 1 times
 
imomins
another hot key is : You need to move the files to a different folder
so answer should be preserve hierarchy.

upvoted 2 times
 
Eyepatch993
2 months ago
1. Preserve heirarchy - ADF is used only for processing and Synapse is the sink. Since synapse has parallel processing power, it can process the files
in different folder and thus improve performance.
2. Parquet
upvoted 1 times
 
kamil_k
Are these answers the actual correct answers or guesses? Who highlights the correct answers?
upvoted 2 times
 
srakrn
4 months ago
"In general, we recommend that your system have some sort of process to aggregate small files into larger ones for use by downstream
applications."
https://docs.microsoft.com/en-us/azure/storage/blobs/data-lake-storage-best-practices
Therefore I believe the answer should be 'Merge files' and 'Parquet'.

upvoted 3 times
 
Sandip4u
Merge and parquet will be the right option , also taken reference from Udemy
upvoted 2 times
 
Mahesh_mm
5 months ago
1.As hierarchical namespace greatly improves the performance of directory management operations, which improves overall job performance,
Preserver herarchy looks correct. Also there is overhead for merging files.
2. Parquet
upvoted 3 times
 
Boompiee
2 weeks, 3 days ago
The overhead for merging happens once, after that it's faster every time to query the files if they are merged.
upvoted 1 times
 
m2shines
Merge Files and Parquet
upvoted 1 times
 
AM1971
shouldn't a json file be flattened first? So I think the answer is: flatten and parquet
upvoted 1 times
 
RinkiiiiiV
1. Preserver hierarchy
2. Parquet
upvoted 1 times
 
noobplayer
Is this correct?
upvoted 2 times
 
Marcus1612
The files are copied/transform from one folder to another inside the same hierarchical account. The hierarchical property is defined during the
account creation. The destination folder still have the hierarchical. On the other hand, as mentioned by Microsoft:Typically, analytics engines such
as HDInsight and Azure Data Lake Analytics have a per-file overhead. If you store your data as many small files, this can negatively affect
performance. In general, organize your data into larger sized files for better performance (256MB to 100GB in size).
upvoted 2 times
 
meetj
9 months ago
1. Merge for sure
https://docs.microsoft.com/en-us/azure/data-factory/connector-file-system, clearly defined several behavior

upvoted 2 times
 
elimey
1. Merge Files: Because the question said 10 different small JSON to a different file
2. Parquet
upvoted 5 times
 
Erte
11 months ago
Box 1: Preserver herarchy
Compared to the flat namespace on Blob storage, the hierarchical namespace greatly improves the performance of directory management
operations, which
improves overall job performance.
Box 2: Parquet
Azure Data Factory parquet format is supported for Azure Data Lake Storage Gen2. Parquet supports the schema property.
Reference:
https://docs.microsoft.com/en-us/azure/storage/blobs/data-lake-storage-introduction https://docs.microsoft.com/en-us/azure/data-
factory/format-parquet
upvoted 2 times
Question #9 Topic 1
HOTSPOT -
You have a data model that you plan to implement in a data warehouse in Azure Synapse Analytics as shown in the following exhibit.
All the dimension tables will be less than 2 GB after compression, and the fact table will be approximately 6 TB. The dimension tables will be
relatively static with very few data inserts and updates.
Which type of table should you use for each table? To answer, select the appropriate options in the answer area.
Hot Area:
Correct Answer:

Box 1: Replicated -
Replicated tables are ideal for small star-schema dimension tables, because the fact table is often distributed on a column that is not
compatible with the connected dimension tables. If this case applies to your schema, consider changing small dimension tables currently
implemented as round-robin to replicated.
Box 2: Replicated -
Box 3: Replicated -
Box 4: Hash-distributed -
For Fact tables use hash-distribution with clustered columnstore index. Performance improves when two hash tables are joined on the same
distribution column.
Reference:
https://azure.microsoft.com/en-us/updates/reduce-data-movement-and-make-your-queries-more-efficient-with-the-general-availability-of-
replicated-tables/ https://azure.microsoft.com/en-us/blog/replicated-tables-now-generally-available-in-azure-sql-data-warehouse/
 
ian_viana
Highly Voted 
The answer is correct.
The Dims are under 2gb so no point in use hash.
Common distribution methods for tables:
The table category often determines which option to choose for distributing the table.
Table category Recommended distribution option
Fact -Use hash-distribution with clustered columnstore index. Performance improves when two hash tables are joined on the same distribution
column.
Dimension - Use replicated for smaller tables. If tables are too large to store on each Compute node, use hash-distributed.
Staging - Use round-robin for the staging table. The load with CTAS is fast. Once the data is in the staging table, use INSERT...SELECT to move the
data to production tables.
https://docs.microsoft.com/en-us/azure/synapse-analytics/sql-data-warehouse/sql-data-warehouse-tables-overview#common-distribution-
methods-for-tables
upvoted 30 times
 
GameLift
Thanks, but where in the question does it indicate about Fact table has clustered columnstore index.?
upvoted 3 times
 
berserksap
Normally for big tables we use clustered columnstore index for optimal performance and compression. Since the table mentioned here is in
TBs we can safely assume using this index is the best choice
https://docs.microsoft.com/en-us/azure/synapse-analytics/sql-data-warehouse/sql-data-warehouse-tables-index
upvoted 2 times
 
berserksap
https://docs.microsoft.com/en-us/azure/synapse-analytics/sql/develop-tables-overview
upvoted 1 times
 
ohana
Highly Voted 
7 months ago
Ans: All the Dim tables --> Replicated
Fact Tables --> Hash Distributed

upvoted 21 times
 
Dothy
Most Recent 
2 weeks ago
The answer is correct.
upvoted 1 times
 
PallaviPatel
4 months ago
correct answer
upvoted 2 times
 
Pritam85
4 months ago
Got this question on 2312/2021...answer is correct
upvoted 1 times
 
Mahesh_mm
5 months ago
Ans is correct
upvoted 2 times
 
alfonsodisalvo
Dimension are Replicated :
"Since the table has multiple copies, replicated tables work best when the table size is less than 2 GB compressed."
"Replicated tables may not yield the best query performance when:
The table has frequent insert, update, and delete operations"
" We recommend using replicated tables instead of round-robin tables in most cases"
https://docs.microsoft.com/en-us/azure/synapse-analytics/sql-data-warehouse/design-guidance-for-replicated-tables
upvoted 1 times
 
gssd4scoder
7 months ago
Correct: https://docs.microsoft.com/en-us/azure/synapse-analytics/sql/develop-tables-overview#common-distribution-methods-for-tables
upvoted 1 times
Question #10 Topic 1
HOTSPOT -
You have an Azure Data Lake Storage Gen2 container.
Data is ingested into the container, and then transformed by a data integration application. The data is NOT modified after that. Users can read
files in the container but cannot modify the files.
You need to design a data archiving solution that meets the following requirements:
✑ New data is accessed frequently and must be available as quickly as possible.
✑ Data that is older than five years is accessed infrequently but must be available within one second when requested.
✑ Data that is older than seven years is NOT accessed. After seven years, the data must be persisted at the lowest cost possible.
✑ Costs must be minimized while maintaining the required availability.
How should you manage the data? To answer, select the appropriate options in the answer area.
NOTE: Each correct selection is worth one point
Hot Area:
Correct Answer:

Box 1: Move to cool storage -
Box 2: Move to archive storage -
Archive - Optimized for storing data that is rarely accessed and stored for at least 180 days with flexible latency requirements, on the order of
hours.
The following table shows a comparison of premium performance block blob storage, and the hot, cool, and archive access tiers.
Reference:
https://docs.microsoft.com/en-us/azure/storage/blobs/storage-blob-storage-tiers
 
yobllip
Highly Voted 
Answer should be
1 - Cool
2 - Archive
Comparison table shown access time for cool tier ttfb is milliseconds
https://docs.microsoft.com/en-us/azure/storage/blobs/storage-blob-storage-tiers#comparing-block-blob-storage-options
upvoted 48 times
 
r00s
1 day, 16 hours ago
Right. #1 is Cool because it's clearly mentioned in the documentation that "Older data sets that are not used frequently, but are expected to be
available for immediate access"
https://docs.microsoft.com/en-us/azure/storage/blobs/access-tiers-overview#comparing-block-blob-storage-options
upvoted 1 times
 
Justbu
Highly Voted 
Tricky question, it says data that is OLDER THAN (> 5 years), must be available within one second when requested
But the first question asks for Five-year-old data, which is =5, so it can also be hot storage
Similarly for the seven-year-old.
Not sure, please confirm?

upvoted 8 times
 
Dothy
Most Recent 
2 weeks ago
ans is correct
upvoted 1 times
 
PallaviPatel
4 months ago
ans is correct
upvoted 1 times
 
ANath
1. Cool Storage
2. Archive Storage
upvoted 1 times
 
Mahesh_mm
5 months ago
Answer is correct
upvoted 1 times
 
ssitb
Answer should be
1-hot
2-archive
https://www.bmc.com/blogs/cold-vs-hot-data-storage/
Cold storage data retrieval can take much longer than hot storage. It can take minutes to hours to access cold storage data
upvoted 2 times
 
marcin1212
https://www.bmc.com/blogs/cold-vs-hot-data-storage/
It isn't about Azure !

upvoted 2 times
 
captainbee
Cold storage takes milliseconds to retrieve
upvoted 5 times
 
syamkumar
I also doubt if its hot storage and archive.. because its mentioned 5-year-old has to be retrieved within seconds which is not possible via cold
storage//
upvoted 1 times
 
savin
but the cost factor is also there. keeping the data in hot tier for 5 years vs cold tier for 5 years would add significant amount.
upvoted 1 times
 
DrC
12 months ago
Answer is correct
upvoted 8 times
DRAG DROP -
You need to create a partitioned table in an Azure Synapse Analytics dedicated SQL pool.
How should you complete the Transact-SQL statement? To answer, drag the appropriate values to the correct targets. Each value may be used
once, more than once, or not at all. You may need to drag the split bar between panes or scroll to view content.
Select and Place:
Correct Answer:
Box 1: DISTRIBUTION -
Table distribution options include DISTRIBUTION = HASH ( distribution_column_name ), assigns each row to one distribution by hashing the
value stored in distribution_column_name.
Box 2: PARTITION -
Table partition options. Syntax:
PARTITION ( partition_column_name RANGE [ LEFT | RIGHT ] FOR VALUES ( [ boundary_value [,...n] ] ))
Reference:
https://docs.microsoft.com/en-us/sql/t-sql/statements/create-table-azure-sql-data-warehouse
 
Sunnyb
Highly Voted 
Answer is correct
upvoted 43 times
 
Sasha_in_San_Francisco
Highly Voted 
Correct answer by how to remember? Distribution option before the Partition option because… ‘D’ comes before ‘P’ or because the system needs
to know the algorithm (hash, round-robin, replicate) before it can start to Partition or segment the data. (seem reasonable?)
upvoted 26 times
 
Dothy
Most Recent 
2 weeks ago
Answer is correct
upvoted 1 times
 
Egocentric
1 month, 1 week ago
provided answer is correct
upvoted 1 times
 
vineet1234
D comes before P as in DP-203
upvoted 3 times
 
PallaviPatel
4 months ago
correct
upvoted 1 times
 
Jaws1990
Wouldn't VALUES(1,1000000, 200000) create a partition for records with ID <= 1 which would mean 1 row?
upvoted 1 times
 
ploer
Having three boundaries will lead to four partitions:
1. Partition for values < 1
2. Partition for values from 1 to 999999
3. Partition for values from 1000000 to 199999
4. Partition for values >= 2000000

upvoted 2 times
 
nastyaaa
but only <= and >. it is range left for values, right
upvoted 1 times
 
Mahesh_mm
5 months ago
Answer is correct
upvoted 1 times
 
hugoborda
8 months ago
Answer is correct
upvoted 1 times
 
hsetin
Indeed! Answer is correct
upvoted 1 times
You need to design an Azure Synapse Analytics dedicated SQL pool that meets the following requirements:
✑ Can return an employee record from a given point in time.
✑ Maintains the latest employee information.
✑ Minimizes query complexity.
How should you model the employee data?
A.
as a temporal table
B.
as a SQL graph table
C.
as a degenerate dimension table
D.
as a Type 2 slowly changing dimension (SCD) table
Correct Answer:
D
A Type 2 SCD supports versioning of dimension members. Often the source system doesn't store versions, so the data warehouse load process
detects and manages changes in a dimension table. In this case, the dimension table must use a surrogate key to provide a unique reference to
a version of the dimension member. It also includes columns that define the date range validity of the version (for example, StartDate and
EndDate) and possibly a flag column (for example,
IsCurrent) to easily filter by current dimension members.
Reference:
https://docs.microsoft.com/en-us/learn/modules/populate-slowly-changing-dimensions-azure-synapse-analytics-pipelines/3-choose-between-
dimension-types

D (100%)
 
bc5468521
Highly Voted 
12 months ago
Answer D; Temporal table is better than SCD2, but it is not supported in Synpase yet
upvoted 47 times
 
sparkchu
though this not something relative to this question. temproal tables looks alike to delta table.
upvoted 1 times
 
Preben
Here's the documentation for how to implement temporal tables in Synapse from 2019.
https://docs.microsoft.com/en-us/azure/synapse-analytics/sql-data-warehouse/sql-data-warehouse-tables-temporary
upvoted 1 times
 
mbravo
Temporal tables and Temporary tables are two very distinct concepts. Your link has absolutely nothing to do with this question.
upvoted 11 times
 
Vaishnav
https://docs.microsoft.com/en-us/azure/azure-sql/temporal-tables
Answer : A Temporal Tables

upvoted 1 times
 
berserksap
I think synapse doesn't support temporal tables. Please check the below comment by hsetin.
upvoted 1 times
 
rashjan
Highly Voted 
Selected Answer: D
D is correct (voting comment that people dont have to open discussion always, please upvote to help others)
upvoted 39 times
 
Dothy
Most Recent 
2 weeks ago
Answer is correct
upvoted 2 times
 
Martin_Nbg
1 month ago
Temporal tables are not supported in Synapse so D is correct.
upvoted 2 times
 
sparkchu
overall, you should use delta table :@
upvoted 1 times
 
PallaviPatel
4 months ago
Selected Answer: D
correct
upvoted 1 times
 
Adelina
Selected Answer: D
D is correct
upvoted 2 times
 
dev2dev
Confusing high voted comment. D is SCD2 but comment is talking about temporal table. Either way SCD2 is right answer which is Choice D
upvoted 1 times
 
VeroDon
Selected Answer: D
Dedicated SQL Pools is the key
upvoted 3 times
 
Mahesh_mm
5 months ago
Answer is D
upvoted 1 times
 
hsetin
Answer is D. Microsoft seems to have confirmed this.
https://docs.microsoft.com/en-us/answers/questions/130561/temporal-table-in-azure-
synapse.html#:~:text=Unfortunately%2C%20we%20do%20not%20support,submitted%20by%20another%20Azure%20customer.
upvoted 3 times
 
dd1122
Answer D is correct. Temporal tables mentioned in the link below are supported in Azure SQL Database(PaaS) and Azure Managed Instance, where
as in this question Dedicated SQL Pools are mentioned so no temporal tables can be used. SCD Type 2 is the answer.
https://docs.microsoft.com/en-us/azure/azure-sql/temporal-tables
upvoted 4 times
 
escoins
11 months ago
Definitively answer D
upvoted 3 times
 
[Removed]
The answer is A - Temporal tables
"Temporal tables enable you to restore row versions from any point in time."
https://docs.microsoft.com/en-us/azure/azure-sql/database/business-continuity-high-availability-disaster-recover-hadr-overview
upvoted 1 times
 
Dileepvikram
The requirement says that the table should store latest information, so the answer should be temporal table, right? Because scd type 2 will store
the complete history.
upvoted 1 times
 
captainbee
Also needs to return employee information from a given point in time? Full history needed for that.
upvoted 12 times
You have an enterprise-wide Azure Data Lake Storage Gen2 account. The data lake is accessible only through an Azure virtual network named
VNET1.
You are building a SQL pool in Azure Synapse that will use data from the data lake.
Your company has a sales team. All the members of the sales team are in an Azure Active Directory group named Sales. POSIX controls are used
to assign the
Sales group access to the files in the data lake.
You plan to load data to the SQL pool every hour.
You need to ensure that the SQL pool can load the sales data from the data lake.
Which three actions should you perform? Each correct answer presents part of the solution.
NOTE: Each area selection is worth one point.
A.
Add the managed identity to the Sales group.
B.
Use the managed identity as the credentials for the data load process.
C.
Create a shared access signature (SAS).
D.
Add your Azure Active Directory (Azure AD) account to the Sales group.
E.
Use the shared access signature (SAS) as the credentials for the data load process.
F.
Create a managed identity.
Correct Answer:
ABF
The managed identity grants permissions to the dedicated SQL pools in the workspace.
Note: Managed identity for Azure resources is a feature of Azure Active Directory. The feature provides Azure services with an automatically
managed identity in
Azure AD -
Reference:
https://docs.microsoft.com/en-us/azure/synapse-analytics/security/synapse-workspace-managed-identity

ABF (100%)
 
Diane
Highly Voted 
1 year ago
correct answer is ABF https://www.examtopics.com/discussions/microsoft/view/41207-exam-dp-200-topic-1-question-56-discussion/
upvoted 61 times
 
AvithK
yes but the order is different it is FAB
upvoted 24 times
 
gssd4scoder
7 months ago
Exactly, agree with you
upvoted 1 times
 
KingIlo
The question didn't specify order or sequence
upvoted 9 times
 
IDKol
Highly Voted 
Correct Answer should be
F. Create a managed identity.
A. Add the managed identity to the Sales group.
B. Use the managed identity as the credentials for the data load process.
upvoted 20 times
 
Dothy
Most Recent 
2 weeks ago
correct answer is ABF
upvoted 1 times
 
Egocentric
1 month, 1 week ago
ABF is correct
upvoted 1 times
 
praticewizards
2 months ago
Selected Answer: ABF
FAB - create, add to group, use to load data
upvoted 1 times
 
Backy
Is answer A properly worded?
"Add the managed identity to the Sales group" should be "Add the Sales group to managed identity"
upvoted 3 times
 
lukeonline
FAB should be correct
upvoted 4 times
 
VeroDon
FAB is correct sequence
upvoted 2 times
 
1. Create a managed identity.
2. Add the managed identity to the Sales group.
3. Use the managed identity as the credentials for the data load process.
upvoted 2 times
 
Mahesh_mm
5 months ago
FAB is correct sequence
upvoted 1 times
 
Lewistrick
5 months ago
Would it even be a good idea to have the data load process be part of the Sales team? They have separate responsibilities, so should be part of
another group. I know that's not possible in the answer list, but I'm trying to think best practices here.
upvoted 2 times
 
Aslam208
6 months ago
Correct answer is F, A, B
upvoted 6 times
 
FredNo
use managed identity
upvoted 5 times
 
ohana
7 months ago
Took the exam today. Similar question came out. Ans: ABF.
Use managed identity!

upvoted 3 times
 
Eniyan
8 months ago
It should be FAB please refer the following reference.https://docs.microsoft.com/en-us/azure/synapse-analytics/sql-data-warehouse/quickstart-
bulk-load-copy-tsql-examples
upvoted 3 times
 
AvithK
I don't get why it doesn't start with F. The managed identity should be created first, right?
upvoted 2 times
 
Mazazino
There's no mentioning of sequence. The question is just about the right steps
upvoted 2 times
 
MonemSnow
A, C, F is the correct answer
upvoted 1 times
HOTSPOT -
You have an Azure Synapse Analytics dedicated SQL pool that contains the users shown in the following table.
User1 executes a query on the database, and the query returns the results shown in the following exhibit.
User1 is the only user who has access to the unmasked data.
Use the drop-down menus to select the answer choice that completes each statement based on the information presented in the graphic.
Hot Area:
Correct Answer:
Box 1: 0 -
The YearlyIncome column is of the money data type.
The Default masking function: Full masking according to the data types of the designated fields
✑ Use a zero value for numeric data types (bigint, bit, decimal, int, money, numeric, smallint, smallmoney, tinyint, float, real).
Box 2: the values stored in the database
Users with administrator privileges are always excluded from masking, and see the original data without any mask.
Reference:
https://docs.microsoft.com/en-us/azure/azure-sql/database/dynamic-data-masking-overview
 
hsetin
Highly Voted 
user 1 is admin, so he will see the value stored in dbms.
1. 0
2. Value in database
upvoted 48 times
 
azurearmy
7 months ago
2 is wrong
upvoted 2 times
 
rjile
Highly Voted 
9 months ago
• Use a zero value for numeric data types (bigint, bit, decimal, int, money, numeric, smallint, smallmoney, tinyint, float, real).
• Use 01-01-1900 for date/time data types (date, datetime2, datetime, datetimeoffset, smalldatetime, time).
upvoted 13 times
 
berserksap
The second question is queried by User 1 who is the admin
upvoted 13 times
 
Dothy
Most Recent 
2 weeks ago
1. 0
2. Value in database
upvoted 1 times
 
Egocentric
1 month, 1 week ago
on this question its just about paying attention to detail
upvoted 3 times
 
manan16
How user2 can access data as it is masked?
upvoted 1 times
 
manan16
Can Someone explain first option as in doc it says 0
upvoted 1 times
 
Mahesh_mm
5 months ago
1. 0 (Default values for money data type for masked function will written when queried by user2)
2. Value in database ( As it is queried by user1 who is admin )

upvoted 6 times
 
Milan1988
7 months ago
CORRECT
upvoted 2 times
 
gssd4scoder
7 months ago
Agree with answer, but I see a typo in the question db_datereader MUST be db_datareader.
upvoted 3 times
 
Jiddu
o for money and 1/1/1900 for dates
https://docs.microsoft.com/en-us/sql/relational-databases/security/dynamic-data-masking?view=sql-server-ver15
upvoted 4 times
 
GervasioMontaNelas
Its correct
upvoted 2 times
 
rjile
9 months ago
correct?
upvoted 2 times
 
Mazazino
yes, it's correct
upvoted 2 times
You have an enterprise data warehouse in Azure Synapse Analytics.
Using PolyBase, you create an external table named [Ext].[Items] to query Parquet files stored in Azure Data Lake Storage Gen2 without importing
the data to the data warehouse.
The external table has three columns.
You discover that the Parquet files have a fourth column named ItemID.
Which command should you run to add the ItemID column to the external table?
A.
B.
C.
D.
Correct Answer:
C
Incorrect Answers:
A, D: Only these Data Definition Language (DDL) statements are allowed on external tables:
✑ CREATE TABLE and DROP TABLE
✑ CREATE STATISTICS and DROP STATISTICS
✑ CREATE VIEW and DROP VIEW
Reference:
https://docs.microsoft.com/en-us/sql/t-sql/statements/create-external-table-transact-sql
 
Chien_Nguyen_Van
Highly Voted 
C is correct
https://www.examtopics.com/discussions/microsoft/view/19469-exam-dp-200-topic-1-question-27-discussion/
upvoted 29 times
 
Ozren
Most Recent 
Good thing the details are shown here: "The external table has three columns." And the solution yet reveals the column details. This doesn't make
any sense to me. If C is the correct answer (only one that seems acceptable), then the question itself is flawed.
upvoted 2 times
 
PallaviPatel
c is correct.
upvoted 1 times
 
hugoborda
8 months ago
Answer is correct
upvoted 2 times
HOTSPOT -
You have two Azure Storage accounts named Storage1 and Storage2. Each account holds one container and has the hierarchical namespace
enabled. The system has files that contain data stored in the Apache Parquet format.
You need to copy folders and files from Storage1 to Storage2 by using a Data Factory copy activity. The solution must meet the following
requirements:
✑ No transformations must be performed.
✑ The original folder structure must be retained.
✑ Minimize time required to perform the copy activity.
How should you configure the copy activity? To answer, select the appropriate options in the answer area.
Hot Area:
Correct Answer:

Box 1: Parquet -
For Parquet datasets, the type property of the copy activity source must be set to ParquetSource.
Box 2: PreserveHierarchy -
PreserveHierarchy (default): Preserves the file hierarchy in the target folder. The relative path of the source file to the source folder is identical
to the relative path of the target file to the target folder.
Incorrect Answers:
✑ FlattenHierarchy: All files from the source folder are in the first level of the target folder. The target files have autogenerated names.
✑ MergeFiles: Merges all files from the source folder to one file. If the file name is specified, the merged file name is the specified name.
Otherwise, it's an autogenerated file name.
Reference:
https://docs.microsoft.com/en-us/azure/data-factory/format-parquet https://docs.microsoft.com/en-us/azure/data-factory/connector-azure-
data-lake-storage
 
EddyRoboto
Highly Voted 
9 months ago
This could be binary as source and sink, since there are no transformations on files. I tend to believe that would be binary the correct anwer.
upvoted 43 times
 
michalS
I agree. If it's just copying then binary is fine and would probably be faster
upvoted 6 times
 
iooj
Agree. I've checked it. With binary source and sink datasets it works.
upvoted 2 times
 
rav009
8 months ago
agree. When using Binary dataset, the service does not parse file content but treat it as-is.
Not parsing the file will save the time. (https://docs.microsoft.com/en-us/azure/data-factory/format-binary)

So Binary!
upvoted 8 times
 
GameLift
But the doc says "When using Binary dataset in copy activity, you can only copy from Binary dataset to Binary dataset." So I guess it's parquet
then?
upvoted 3 times
 
captainpike
This note is referring to the fact that, in the template, you have to specify “BinarySink” as the type for the target Sink; and that exactly what
the Copy data tool does. (you can check this by editing the created copy pipeline and see the code). Choosing BInary and PreserveHierarchy
copy all file as they are perfectly.
upvoted 3 times
 
AbhiGola
Highly Voted 
Answer seems correct as data is store is parquet already and requirement is to do no transformation so answer is right
upvoted 34 times
 
NintyFour
1 week, 2 days ago
As question has mentioned, Minimize time required to perform the copy activity.
And binary is faster than Parquet. Hence, Binary is answer

upvoted 1 times
 
NintyFour
Most Recent 
1 week, 2 days ago
As question has mentioned, Minimize time required to perform the copy activity.
And binary is faster than Parquet. Hence, Binary is answer

upvoted 1 times
 
AzureRan
2 weeks ago
Is it binary or parquet?
upvoted 1 times
 
DingDongSingSong
2 months ago
So what is the answer to this question? Binary or Parquet? The file is a ParquetFile. If you're simply copying a file, you just need to define the right
source type (i.e. Parquet) in this instance. Why would you even consider Binary when the file isn't Binary type
upvoted 2 times
 
kamil_k
I've just tested it in Azure, created two Gen2 storage accounts, used Binary as source and destination, placed two parquet files in account one.
Created pipeline in ADF, added copy data activity and then defined first binary as source with wildcard path (*.parquet) and the sink as binary, with
linked service for account 2, selected PreserveHierarchy. It worked.
upvoted 6 times
 
AnshulSuryawanshi
When using Binary dataset in copy activity, you can only copy from Binary dataset to Binary dataset
upvoted 1 times
 
Sandip4u
this should be binary
upvoted 1 times
 
VeroDon
The type property of the dataset must be set to Parquet
https://docs.microsoft.com/en-us/azure/data-factory/format-parquet#parquet-as-source
upvoted 1 times
 
Mahesh_mm
5 months ago
I think it is Parquet as When using Binary dataset in copy activity, you can only copy from Binary dataset to Binary dataset.
upvoted 2 times
 
Canary_2021
If you only copy over files from one storage to another, don't need to read data inside the file, binary should be selected for better performance.
upvoted 5 times
 
m2shines
Binary and Preserve Hierarchy should be the answer
upvoted 4 times
 
Lucky_me
The answers are correct! Binary doesn't work; I just tried.
upvoted 5 times
 
kamil_k
hmm what did you try? I literally created it the same way as described i.e. two gen2 storage accounts. I chose gen2 as source linked service with
binary as file type and the same for destination. In the copy data activity in ADF pipeline I specified preserve hierarchy and it worked as
expected.
upvoted 1 times
 
Ozzypoppe
https://docs.microsoft.com/en-us/azure/data-factory/format-parquet#parquet-as-source
upvoted 2 times
 
medsimus
The correct answer is Binary , I test it
upvoted 8 times
You have an Azure Data Lake Storage Gen2 container that contains 100 TB of data.
You need to ensure that the data in the container is available for read workloads in a secondary region if an outage occurs in the primary region.
The solution must minimize costs.
Which type of data redundancy should you use?
A.
geo-redundant storage (GRS)
B.
read-access geo-redundant storage (RA-GRS)
C.
zone-redundant storage (ZRS)
D.
locally-redundant storage (LRS)
Correct Answer:
B
Geo-redundant storage (with GRS or GZRS) replicates your data to another physical location in the secondary region to protect against regional
outages.
However, that data is available to be read only if the customer or Microsoft initiates a failover from the primary to secondary region. When you
enable read access to the secondary region, your data is available to be read at all times, including in a situation where the primary region
becomes unavailable.
Incorrect Answers:
A: While Geo-redundant storage (GRS) is cheaper than Read-Access Geo-Redundant Storage (RA-GRS), GRS does NOT initiate automatic
failover.
C, D: Locally redundant storage (LRS) and Zone-redundant storage (ZRS) provides redundancy within a single region.
Reference:
https://docs.microsoft.com/en-us/azure/storage/common/storage-redundancy

A (65%) B (35%)
 
meetj
Highly Voted 
9 months ago
B is right
outages. However, that data is available to be read only if the customer or Microsoft initiates a failover from the primary to secondary region.
When you enable read access to the secondary region, your data is available to be read at all times, including in a situation where the primary
region becomes unavailable.
upvoted 53 times
 
dev2dev
A looks correct answer. RA-GRS is always avialable because its auto failover. Since this is not asked in the question but more importantly the
question is about reducing cost which GRS.
upvoted 13 times
 
BK10
It should be A because of two reasons:
1. Minimize cost
2. When primary is unavailable.
Hence No need for RA_GRS

upvoted 7 times
 
Highly Voted 
In my opinion, I believe the and answer is A, and this is why.
In the question they state "...available for read workloads in a secondary region IF AN OUTAGE OCCURES in the primary...". Well, answer B (RA-GRS)
states in Microsoft documentation that RA-GRS is for when "...your data is available to be read AT ALL TIMES, including in a situation where the
primary region becomes unavailable."
To me, the nature of the question is what is the cheapest solution which allows for failover to read workload, when there is an outage. Answer (A).
Common sense would be 'A' too because that is probably the most often real-life use case.
upvoted 40 times
 
It's not about common sense rather about technology. With GRS, data remains available even if an entire data center becomes unavailable or if
there is a widespread regional failure. There would be a down time when a region becomes unavailable. Alternately, you could implement read-
access geo-redundant storage (RA-GRS), which provides read-access to the data in alternate locations.
upvoted 2 times
 
prathamesh1996
Most Recent 
A is Correct for Minimize cost $ When primary is unavailable.
upvoted 1 times
 
Andushi
4 weeks, 1 day ago
Selected Answer: A
A because of costs aspect
upvoted 2 times
 
muove
1 month, 1 week ago
A is correct because of cost, RA-GRS will cost $5,910.73, GRS will cost 4,596.12
upvoted 2 times
 
Egocentric
GRS is the correct answer,the key in the question is reducing costs
upvoted 1 times
 
Somesh512
Selected Answer: A
To reduce cost GRS should be right option
upvoted 1 times
 
KosteK
Selected Answer: A
GRZ is cheaper
upvoted 1 times
 
praticewizards
2 months ago
Selected Answer: B
The explanation is right. The given answer is wrong
upvoted 1 times
 
DingDongSingSong
2 months ago
B is incorrect. The answer is A. GRS is cheaper than RA-GRS. GRS read access is available ONLY once primary region failover occurs (therefore lower
cost). The requirement is for read-access availability in secondary region at lower cost WHEN a failover occurs in primary. Therefore, A is the answer
upvoted 2 times
 
phdphd
Selected Answer: A
Got this question on the exam. RA_GRS was not an option, so it should to be A.
upvoted 10 times
 
vineet1234
A is right. GRS means secondary is available ONLY when primary is down. And it is cheaper than RA-GRS (where secondary read access is always
available). The question sneaks in the word 'read workloads' just to confuse.
upvoted 2 times
 
Sgarima
Selected Answer: B
B is correct.
outages. However, that data is available to be read only if the customer or Microsoft initiates a failover from the primary to secondary region.
When you enable read access to the secondary region, your data is available to be read at all times, including in a situation where the primary
region becomes unavailable. For read access to the secondary region, enable read-access geo-redundant storage (RA-GRS) or read-access geo-
zone-redundant storage (RA-GZRS).
upvoted 1 times
 
NamitSehgal
3 months ago
A should be the answer as we need data in read only secondary only when something happens at region A, not always.
upvoted 2 times
 
MANESH_PAI
Selected Answer: A
It is GRS because GRS is cheaper than RA-GRS
https://azure.microsoft.com/en-gb/pricing/details/storage/blobs/
upvoted 3 times
 
PallaviPatel
Selected Answer: B
correct
upvoted 2 times
 
Tinaaaaaaa
4 months ago
Selected Answer: B
While Geo-redundant storage (GRS) is cheaper than Read-Access Geo-Redundant Storage (RA-GRS), GRS does NOT initiate automatic failover.
upvoted 2 times
You plan to implement an Azure Data Lake Gen 2 storage account.
You need to ensure that the data lake will remain available if a data center fails in the primary Azure region. The solution must minimize costs.
Which type of replication should you use for the storage account?
A.
geo-redundant storage (GRS)
B.
geo-zone-redundant storage (GZRS)
C.
locally-redundant storage (LRS)
D.
zone-redundant storage (ZRS)
Correct Answer:
D
Zone-redundant storage (ZRS) copies your data synchronously across three Azure availability zones in the primary region.
Incorrect Answers:
C: Locally redundant storage (LRS) copies your data synchronously three times within a single physical location in the primary region. LRS is the
least expensive replication option, but is not recommended for applications requiring high availability or durability
Reference:

D (98%)
 
JohnMasipa
Highly Voted 
9 months ago
This can't be correct. Should be D.
upvoted 70 times
 
JayBird
9 months ago
Why, LRS is cheaper?
upvoted 1 times
 
Vitality
It is cheaper but LRS helps to replicate data in the same data center while ZRS replicates data synchronously across three storage clusters in
one region. So if one data center fails you should go for ZRS.
upvoted 8 times
 
azurearmy
7 months ago
Also, note that the question talks about failure in "a data center". As long as other data centers are running fine(as in ZRS which will have
many), ZRS would be the least expensive option.
upvoted 6 times
 
MadEgg
Highly Voted 
Selected Answer: D
First, about the Question:
What fails? -> The (complete) DataCenter, not the region and not components inside a DataCenter.
So, what helps us in this situation?
LRS: "..copies your data synchronously three times within a single physical location in the primary region." Important is here the SINGLE PHYSICAL
LOCATION (meaning inside the same Data Center. So in our scenario all copies wouldn't work anymore.)
-> C is wrong.
ZRS: "...copies your data synchronously across three Azure availability zones in the primary region" (meaning, in different Data Centers. In our
scenario this would meet the requirements)
-> D is right
GRS/GZRS: are like LRS/ZRS but with the Data Centers in different azure regions. This works too but is more expensive than ZRS. So ZRS is the right
answer.
upvoted 30 times
 
Ozren
Yes, well said, that's the correct answer.
upvoted 1 times
 
Narasimhap
Well explained!
upvoted 1 times
 
DrTaz
I agree.
Please give this comment a medal (or a cookie).

upvoted 3 times
 
olavrab8
Most Recent 
2 weeks, 2 days ago
Selected Answer: D
D -> Data is replicated synchronously
upvoted 1 times
 
Egocentric
1 month, 1 week ago
D is correct
upvoted 2 times
 
ravi2931
it should be D
upvoted 1 times
 
ravi2931
see this explained clearly -
LRS is the lowest-cost redundancy option and offers the least durability compared to other options. LRS protects your data against server rack
and drive failures. However, if a disaster such as fire or flooding occurs within the data center, all replicas of a storage account using LRS may be
lost or unrecoverable. To mitigate this risk, Microsoft recommends using zone-redundant storage (ZRS), geo-redundant storage (GRS), or geo-
zone-redundant storage (GZRS)
upvoted 1 times
 
ASG1205
Selected Answer: D
Answer should be D, as LRS won't be helpfull in case of whole datacenter failure.
upvoted 1 times
 
Andy91
2 months ago
Selected Answer: D
This is the correct answer indeed
upvoted 2 times
 
bhanuprasad9331
Selected Answer: C
Answer is LRS.
From microsoft docs:
LRS replicates data in a single AZ. An AZ can contain one or more data centers. So, even if one data center fails, data can be accessed through
other data centers in the same AZ.
https://docs.microsoft.com/en-us/azure/availability-zones/az-overview#availability-zones
https://docs.microsoft.com/en-us/azure/storage/common/storage-redundancy#redundancy-in-the-primary-region
upvoted 1 times
 
PallaviPatel
Selected Answer: D
D is correct.
upvoted 3 times
 
vimalnits
4 months ago
Correct answer is D.
upvoted 2 times
 
Tinaaaaaaa
4 months ago
LRS helps to replicate data in the same data center while ZRS replicates data synchronously across three storage clusters in one region
upvoted 1 times
 
Shatheesh
4 months ago
D is the correct answer, In question it’s clearly mentioned if data center fails it should be available, LRS stores everything in sane data center so it’s
not the correct answer, next cheapest option is ZRS.
upvoted 1 times
 
Jaws1990
Selected Answer: D
Mentions data centre (Availability Zone) failure, not rack failure, so should be Zone Redundant Storage.
upvoted 3 times
 
DrTaz
Selected Answer: D
note that the "data centre fails"

upvoted 2 times
 
VeroDon
After reading all the comments ill go with LRS. it doesn't mention a disaster. "LRS protects your data against server rack and drive failures"
https://docs.microsoft.com/en-us/azure/storage/common/storage-redundancy.
upvoted 1 times
 
ArunMonika
I will go with D
upvoted 1 times
 
Mahesh_mm
5 months ago
Answer is D
upvoted 1 times
HOTSPOT -
You have a SQL pool in Azure Synapse.
You plan to load data from Azure Blob storage to a staging table. Approximately 1 million rows of data will be loaded daily. The table will be
truncated before each daily load.
You need to create the staging table. The solution must minimize how long it takes to load the data to the staging table.
How should you configure the table? To answer, select the appropriate options in the answer area.
Hot Area:
Correct Answer:

Box 1: Hash -
Hash-distributed tables improve query performance on large fact tables. They can have very large numbers of rows and still achieve high
performance.
Incorrect Answers:
Round-robin tables are useful for improving loading speed.
Box 2: Clustered columnstore -
When creating partitions on clustered columnstore tables, it is important to consider how many rows belong to each partition. For optimal
compression and performance of clustered columnstore tables, a minimum of 1 million rows per distribution and partition is needed.
Box 3: Date -
Table partitions enable you to divide your data into smaller groups of data. In most cases, table partitions are created on a date column.
Partition switching can be used to quickly remove or replace a section of a table.
Reference:
https://docs.microsoft.com/en-us/azure/synapse-analytics/sql-data-warehouse/sql-data-warehouse-tables-partition
https://docs.microsoft.com/en-us/azure/synapse-analytics/sql-data-warehouse/sql-data-warehouse-tables-distribute
 
A1000
Highly Voted 
9 months ago
Round-Robin
Heap
None
upvoted 156 times
 
Narasimhap
Round- Robin
Heap
None.
No brainer for this question.
upvoted 3 times
 
anto69
I agree too
upvoted 2 times
 
gssd4scoder
7 months ago
Agree 100%.
All in paragraphs under this: https://docs.microsoft.com/en-us/azure/synapse-analytics/sql/develop-tables-overview.

upvoted 5 times
 
DrTaz
Also agree 100%
upvoted 2 times
 
laszek
Highly Voted 
Round-robin - this is the simplest distribution model, not great for querying but fast to process
Heap - no brainer when creating staging tables
No partitions - this is a staging table, why add effort to partition, when truncated daily?
upvoted 29 times
 
Vardhan_Brahmanapally
Can you explain me why should we use heap?
upvoted 1 times
 
DrTaz
The term heap basically refers to a table without a clustered index. Adding a clustered index to a temp table makes absolutely no sense and
is a waste of compute resources for a table that would be entirely truncated daily.
no clustered index = heap.

upvoted 3 times
 
SQLDev0000
3 months ago
DrTaz is right, in addition, when you populate an indexed table, you are also writing to the index, so this adds an additional overhead in
the write process
upvoted 2 times
 
berserksap
Had doubts regarding why there is no need for a partition. While what you suggested is true won't it be better if there is a date partition to
truncate the table ?
upvoted 1 times
 
andy_g
There is no filter on a truncate statement so no benefit in having a partition
upvoted 1 times
 
SandipSingha
Most Recent 
2 weeks, 3 days ago
Round-Robin
Heap
None
upvoted 1 times
 
Sandip4u
Round-robin,heap,none
upvoted 2 times
 
Mahesh_mm
Round-Robin
Heap
None
upvoted 2 times
 
ArunMonika
Answer: Round-Robin (1), Heap (2), None (3).
upvoted 1 times
 
m2shines
Round-robin, Heap and None
upvoted 1 times
 
Within this doc:
https://docs.microsoft.com/en-us/azure/synapse-analytics/sql/develop-tables-overview
#1. Search for “Use round-robin for the staging table.”
#2. Search for: “A heap table can be especially useful for loading data, such as a staging table,…”
Within this doc:
https://docs.microsoft.com/en-us/azure/synapse-analytics/sql-data-warehouse/sql-data-warehouse-tables-partition?context=/azure/synapse-
analytics/context/context
#3. Partitioning by date is useful when stage destination has data because you can hide the inserting data’s new partition (to keep users from
hitting it), complete the load and then unhide the new partition.
However, in this question it states, “the table will be truncated before each daily load”, so, it appears it’s a true Staging table and there are no users
with access, no existing data, and I see no reason to have a Date partition. To me, such a partition would do nothing but slow the load.
upvoted 12 times
 
upvoted 1 times
 
Aslam208
Round-Robin, Heap, Noe.
A polite request to the moderator, please verify these answers and correct. For some people, wrong answers will be detrimental.
upvoted 6 times
 
7 months ago
Many of the answers provided in this website are incorrect
upvoted 5 times
 
dJeePe
3 weeks, 5 days ago
Did MS hack this site to make it give wrong answers ? ;-)
upvoted 1 times
 
itacshish
Round-Robin
Heap
None
upvoted 2 times
 
HaliBrickclay
as per Microsoft document
Load to a staging table
To achieve the fastest loading speed for moving data into a data warehouse table, load data into a staging table. Define the staging table as a heap
and use round-robin for the distribution option.
Consider that loading is usually a two-step process in which you first load to a staging table and then insert the data into a production data
warehouse table. If the production table uses a hash distribution, the total time to load and insert might be faster if you define the staging table
with the hash distribution. Loading to the staging table takes longer, but the second step of inserting the rows to the production table does not
incur data movement across the distributions.
upvoted 4 times
 
VeroDon
It doesn't mention the prd table. Only the staging. So, round Robin/Heap is the answer, correct? tricky questions.
:)
upvoted 1 times
 
estrelle2008
Please correct the answers ExamTopics, as Microsoft itself recently published best practices on data loading in Synapse, and describes staging as
100% FAB answers is correct instead of ADF. https://docs.microsoft.com/en-us/azure/synapse-analytics/sql/data-loading-best-practices
upvoted 2 times
 
RinkiiiiiV
Round-Robin
Heap
None
upvoted 1 times
 
hugoborda
8 months ago
Round-Robin
Heap
None
upvoted 2 times
 
hsetin
Why heap and not CCI?
upvoted 1 times
You are designing a fact table named FactPurchase in an Azure Synapse Analytics dedicated SQL pool. The table contains purchases from
suppliers for a retail store. FactPurchase will contain the following columns.
FactPurchase will have 1 million rows of data added daily and will contain three years of data.
Transact-SQL queries similar to the following query will be executed daily.
SELECT -
SupplierKey, StockItemKey, IsOrderFinalized, COUNT(*)
FROM FactPurchase -
WHERE DateKey >= 20210101 -
AND DateKey <= 20210131 -
GROUP By SupplierKey, StockItemKey, IsOrderFinalized
Which table distribution will minimize query times?
A.
replicated
B.
hash-distributed on PurchaseKey
C.
round-robin
D.
hash-distributed on IsOrderFinalized
Correct Answer:
B
Hash-distributed tables improve query performance on large fact tables.
To balance the parallel processing, select a distribution column that:
✑ Has many unique values. The column can have duplicate values. All rows with the same value are assigned to the same distribution. Since
there are 60 distributions, some distributions can have > 1 unique values while others may end with zero values.
✑ Does not have NULLs, or has only a few NULLs.
✑ Is not a date column.
Incorrect Answers:
C: Round-robin tables are useful for improving loading speed.
Reference:

B (86%) 7%
 
FredNo
Highly Voted 
6 months ago
Selected Answer: B
Correct
upvoted 18 times
 
GameLift
Highly Voted 
Is it hash-distributed on PurchaseKey and not on IsOrderFinalized because 'IsOrderFinalized' yields less distributions(rows either contain yes,no
values) compared to PurchaseKey?
upvoted 7 times
 
Podavenna
Yes, your logic is correct!
upvoted 4 times
 
Dothy
Most Recent 
2 weeks ago
B Correct
upvoted 1 times
 
SandipSingha
2 weeks, 3 days ago
B Correct
upvoted 1 times
 
sarapaisley
Selected Answer: B
B is correct
upvoted 1 times
 
Anshul2910
Selected Answer: B
CORRECT
upvoted 1 times
 
Istiaque
Selected Answer: B
A round-robin distributed table distributes table rows evenly across all distributions. The assignment of rows to distributions is random. Unlike
hash-distributed tables, rows with equal values are not guaranteed to be assigned to the same distribution.
As a result, the system sometimes needs to invoke a data movement operation to better organize your data before it can resolve a query. This extra
step can slow down your queries.
upvoted 2 times
 
PallaviPatel
Selected Answer: C
The options do not have correct key selected for hash distribution and query performance will improve only if correct distribution column is
selected. Also question says 1 million rows but how much those rows convert into actual GB of data is a question the data types are majorly int
which arn't bulky. hence I will go for round robin instead of hash distribution.
upvoted 2 times
 
vineet1234
Incorrect.. 1 million rows added per day. And the table has 3 years of data. So it's a large fact table. So Hash distributed. On purchase key (not
on IsOrderFinalized, as it's very low cardinality)
upvoted 2 times
 
Canary_2021
Selected Answer: D
Hash field should be used in join, group by, having. SupplierKey, StockItemKey, IsOrderFinalized are group by fields. PurchaseKey doesn’t exist in
the query, why select PurchaseKey as hash key?
I select D. IsOrderFinalized may only provide 2 partitions, not as good as suppliekey and stockitemkey, but at least it is a group by column.
upvoted 2 times
 
Canary_2021
To balance the parallel processing, select a distribution column that:
* Has many unique values.
* Does not have NULLs, or has only a few NULLs.
* Is not a date column.
Based on these descriptions, maybe B is the right answer. Just purchasekey is not a part of the query, is it still improve performance of this
specific query?
upvoted 4 times
 
Mahesh_mm
B is correct
upvoted 2 times
 
kahei
Selected Answer: B
upvoted 1 times
 
alexleonvalencia
Selected Answer: B
B es la respuesta correcta.
upvoted 2 times
 
stuard
hash-distributed on PurchaseKey and round-robin are going to provide the same result (in a case PurchaseKey has even distribution) for the query
as this specific query does not use PurchaseKey. However, round-robin is going to provide a slightly faster loading time.
upvoted 6 times
 
RinkiiiiiV
Yes Agree..
upvoted 1 times
 
Gilvan
Correct
upvoted 4 times
HOTSPOT -
From a website analytics system, you receive data extracts about user interactions such as downloads, link clicks, form submissions, and video
plays.
The data contains the following columns.
You need to design a star schema to support analytical queries of the data. The star schema will contain four tables including a date dimension.
To which table should you add each column? To answer, select the appropriate options in the answer area.
Hot Area:
Correct Answer:

Box 1: DimEvent -
Box 2: DimChannel -
Box 3: FactEvents -
Fact tables store observations or events, and can be sales orders, stock balances, exchange rates, temperatures, etc
Reference:
https://docs.microsoft.com/en-us/power-bi/guidance/star-schema
 
gssd4scoder
Highly Voted 
7 months ago
It seems to be correct
upvoted 29 times
 
DingDongSingSong
Highly Voted 
2 months ago
What is this question? It is poorly written. I couldn't even understand what's being asked here. It talks about 4 tables, yet the answer shows 3. Then,
the columns mentioned in the question don't match the column/attributes shown in the 3 tables noted in the answer.
upvoted 7 times
 
Dothy
Most Recent 
2 weeks ago
EventCategory -> dimEvent
channelGrouping -> dimChannel
TotalEvents -> factEven

upvoted 1 times
 
JJdeWit
2 weeks, 4 days ago
EventCategory ==> dimEvent
channelGrouping ==> dimChannel
TotalEvents ==> factEvent
Explanation:
A bit of knowledge of Google Analytics Universal helps to understand this question. eventCategory, eventAction and eventLabel all contain
information about the event/action done on the website, and can be logically be grouped together. ChannelGrouping is about how the user came
on the website (through Google, and advertisement, an email link, etc.) and is not related to events at all. It therefore would make sense to put it in
a second dim table.
upvoted 1 times
 
Mahesh_mm
Answer is correct
upvoted 3 times
 
laszek
I would add ChannelGrouping to DimEvents table. What would DimChannel table contain? only one column? No sense to me
upvoted 3 times
 
manquak
It is supposed to contain 4 tables. Date, Event, Fact so the logical conclusion would be to include the channel dimension. If it were up to me
Questionthough Topic 1
#22 I'd use the channel as a degenerate dimension and store it in fact table if it's the only information that we have provided.
upvoted 3 times
Note:This

question is part
Seansmyrke
2 months,
of a series of questions
3 weeks ago that present the same scenario. Each question in the series contains a unique solution that
I mean
might meet the ifstated
you think
goals.about
Someit,question
ChannelName (facebook,google,youtube),
sets might ChannelType
have more than one correct (paidothers
solution, while media, freenot
might posts,
haveads), ChannleDelivery
a correct solution.
(chrome, etc
etc). Just thinking out loud
After you answer a question in this section, you will NOT be able to return to it. As a result, these questions will not appear in the review screen.
upvoted 1 times
You have an Azure Storage account that contains 100 GB of files. The files contain rows of text and numerical values. 75% of the rows contain
description data that has an average length of 1.1 MB.
You plan to copy the data from the storage account to an enterprise data warehouse in Azure Synapse Analytics.
You need to prepare the files to ensure that the data copies quickly.
Solution: You convert the files to compressed delimited text files.
Does this meet the goal?
A.
Yes
B.
No
Correct Answer:
A
All file formats have different performance characteristics. For the fastest load, use compressed delimited text files.
Reference:
https://docs.microsoft.com/en-us/azure/sql-data-warehouse/guidance-for-loading-data

A (63%) B (38%)
 
Fahd92
Highly Voted 
They said you need to prepare the files to copy, maybe the mean we should make them less than 1MB ? so it will be A else would be B !!!!
upvoted 12 times
 
ANath
The answer should be A.
https://azure.microsoft.com/en-gb/blog/increasing-polybase-row-width-limitation-in-azure-sql-data-warehouse/
upvoted 1 times
 
Thij
After reading the other questions oh this topic I go with A because the relevant part seems to be the compression.
upvoted 4 times
 
Muishkin
Most Recent 
4 weeks, 1 day ago
A text file seems to be too simple an answer however true as per the microsoft link.I was thinking of parquet/avro files
upvoted 1 times
 
Massy
Selected Answer: B
From the question: "75% of the rows contain description data that has an average length of 1.1 MB". You can't
From the documentation: "When you put data into the text files in Azure Blob storage or Azure Data Lake Store, they must have fewer than
1,000,000 bytes of data."
So 75% of rows aren't good for a delimited text files... why you said answer is yes?
upvoted 3 times
 
kamil_k
I initially thought so too, however isn't this limit only relevant to PolyBase copy? It is not mentioned which method is used to transfer the data
so you could fit more than 1mb into a column in the table if you want to, you just have to use something else e.g. COPY command.
upvoted 2 times
 
PallaviPatel
Selected Answer: A
correct answer.
upvoted 2 times
 
Mahesh_mm
A is correct
upvoted 1 times
 
alexleonvalencia
Selected Answer: A
Correcto
upvoted 2 times
 
rashjan
Selected Answer: A
correct because compression
upvoted 1 times
 
Odoxtoom
Consider this sets one question:
What should you do to improve loading times?
What | Yes | No |
compressed | O | O |
columnstore | O | O |
> 1MB | O | O |
So now answers should be clear

upvoted 1 times
 
HaliBrickclay
As per Microsoft
Row size and data type limits
PolyBase loads are limited to rows smaller than 1 MB. It cannot be used to load to VARCHR(MAX), NVARCHAR(MAX), or VARBINARY(MAX). For
more information, see Azure Synapse Analytics service capacity limits.
When your source data has rows greater than 1 MB, you might want to vertically split the source tables into several small ones. Make sure that the
largest size of each row doesn't exceed the limit. The smaller tables can then be loaded by using PolyBase and merged together in Azure Synapse
Analytics.
upvoted 2 times
 
jamesraju
The answer should be 'yes"
All file formats have different performance characteristics. For the fastest load, use compressed delimited text files. The difference between UTF-8
and UTF-16 performance is minimal.
upvoted 1 times
 
RinkiiiiiV
correct Answer is B
upvoted 1 times
 
gk765
8 months ago
Correct Answer is B. There is limit of 1MB when it comes to the row length. Hence you have to modify the files to ensure the row size is less than
1MB
upvoted 3 times
 
kolakone
Answer is correct
upvoted 1 times
Note: This question is part of a series of questions that present the same scenario. Each question in the series contains a unique solution that
might meet the stated goals. Some question sets might have more than one correct solution, while others might not have a correct solution.
Solution: You copy the files to a table that has a columnstore index.
A.
Yes
B.
No
Correct Answer:
B
Instead convert the files to compressed delimited text files.
Reference:

B (100%)
 
Odoxtoom
Highly Voted 
What | Yes | No |
> 1MB | O | O |

upvoted 6 times
 
Julius7000
7 months ago
Can You explain this in more details?
upvoted 10 times
 
helly13
I really didn't understand this , can you explain?
upvoted 5 times
 
Amsterliese
Most Recent 
Columnstore index would be used for faster reading, but the question is only about faster loading. So for faster loading you want the least possible
overhead. So the answer should be no. Am I right?
upvoted 2 times
 
Muishkin
4 weeks, 1 day ago
Yes load to a table without indexes for faster load right?
upvoted 1 times
 
lionurag
Selected Answer: B
B is correct
upvoted 2 times
 
bhanuprasad9331
From the documentation, loads to heap table are faster than indexed tables. So, better to use heap table than columnstore index table in this case.
https://docs.microsoft.com/en-us/azure/synapse-analytics/sql-data-warehouse/sql-data-warehouse-tables-index#heap-tables
upvoted 4 times
 
PallaviPatel
Selected Answer: B
B is correct.
upvoted 1 times
 
DE_Sanjay
NO is the answer.
upvoted 1 times
 
Mahesh_mm
B is correct
upvoted 1 times
 
rashjan
Selected Answer: B
Correct Answer: No.
upvoted 2 times
 
sachabess79
No, The index will expand the time of insertion
upvoted 3 times
 
michalS
https://docs.microsoft.com/en-us/azure/synapse-analytics/sql-data-warehouse/guidance-for-loading-data. "For the fastest load, use compressed
delimited text files."
upvoted 1 times
 
umeshkd05
But the row size also need to be < 1 MB
So, files need to be modified to make all rows < 1 MB
Answer: NO
upvoted 4 times
 
Julius7000
7 months ago
In other words, i think that 100GB is much to much for the columnstore index memorywise. The documentation in unclear with the context
of this particular question, but i think the ansewer is NO, as ithe given answer is the wrong idea anyways.
upvoted 1 times
 
Julius7000
7 months ago
Not Row size, row NUMBER have to be at maximum of 1,048,576 rows.
"When there is memory pressure, the columnstore index might not be able to achieve maximum compression rates. This effects query
performance."
upvoted 1 times
 
gk765
8 months ago
Correct answer should be NO
upvoted 2 times
Solution: You modify the files to ensure that each row is more than 1 MB.
A.
Yes
B.
No
Correct Answer:
B
Instead convert the files to compressed delimited text files.
Reference:

B (100%)
 
Gilvan
Highly Voted 
No, rows need to have less than 1 MB. A batch size between 100 K to 1M rows is the recommended baseline for determining optimal batch size
capacity.
upvoted 7 times
 
PallaviPatel
Most Recent 
Selected Answer: B
B is correct.
upvoted 4 times
 
amarG1996
PolyBase can't load rows that have more than 1,000,000 bytes of data. When you put data into the text files in Azure Blob storage or Azure Data
Lake Store, they must have fewer than 1,000,000 bytes of data. This byte limitation is true regardless of the table schema.
upvoted 2 times
 
kamil_k
is it stated anywhere that we have to use PolyBase? What about COPY command?
upvoted 1 times
 
amarG1996
https://docs.microsoft.com/en-us/azure/synapse-analytics/sql/data-loading-best-practices#prepare-data-in-azure-storage
upvoted 1 times
 
Mahesh_mm
Answer is No
upvoted 1 times
 
rashjan
Selected Answer: B
Correct Answer: No.
upvoted 2 times
 
Odoxtoom
What | Yes | No |
> 1MB | O | O |

upvoted 1 times
 
Aslam208
6 months ago
@Odoxtoom, can you please explain your answer and specify based on this matrix which option is correct.
upvoted 3 times
 
Bishtu
5 months ago
Yes
No
No
upvoted 2 times
You build a data warehouse in an Azure Synapse Analytics dedicated SQL pool.
Analysts write a complex SELECT query that contains multiple JOIN and CASE statements to transform data for use in inventory reports. The
inventory reports will use the data and additional WHERE parameters depending on the report. The reports will be produced once daily.
You need to implement a solution to make the dataset available for the reports. The solution must minimize query times.
What should you implement?
A.
an ordered clustered columnstore index
B.
a materialized view
C.
result set caching
D.
a replicated table
Correct Answer:
B
Materialized views for dedicated SQL pools in Azure Synapse provide a low maintenance method for complex analytical queries to get fast
performance without any query change.
Incorrect Answers:
C: One daily execution does not make use of result cache caching.
Note: When result set caching is enabled, dedicated SQL pool automatically caches query results in the user database for repetitive use. This
allows subsequent query executions to get results directly from the persisted cache so recomputation is not needed. Result set caching
improves query performance and reduces compute resource usage. In addition, queries using cached results set do not use any concurrency
slots and thus do not count against existing concurrency limits.
Reference:
https://docs.microsoft.com/en-us/azure/synapse-analytics/sql-data-warehouse/performance-tuning-materialized-views
https://docs.microsoft.com/en-us/azure/synapse-analytics/sql-data-warehouse/performance-tuning-result-set-caching

B (100%)
 
ANath
Highly Voted 
B is correct.
Materialized view and result set caching
These two features in dedicated SQL pool are used for query performance tuning. Result set caching is used for getting high concurrency and fast
response from repetitive queries against static data.
To use the cached result, the form of the cache requesting query must match with the query that produced the cache. In addition, the cached result
must apply to the entire query.
Materialized views allow data changes in the base tables. Data in materialized views can be applied to a piece of a query. This support allows the
same materialized views to be used by different queries that share some computation for faster performance.
upvoted 6 times
 
SandipSingha
Most Recent 
2 weeks, 3 days ago
B materialized view
upvoted 1 times
 
Egocentric
1 month, 1 week ago
B is correct without a doubt
upvoted 1 times
 
DingDongSingSong
2 months ago
Why isn't the answer "A" when the query may have additional WHERE parameters depending on the report. That mean's the query isn't static and
will change depending on the report. A clustered columstore index would provide a bettery query performance in case of a complex query where
query isn't static.
upvoted 1 times
 
PallaviPatel
Selected Answer: B
B correct.
upvoted 1 times
 
VeroDon
Selected Answer: B
Correct
upvoted 3 times
 
Mahesh_mm
B is correct
upvoted 1 times
 
bad_atitude
B materialized view
upvoted 2 times
 
Canary_2021
Selected Answer: B
B is the correct answer.
A materialized view is a database object that contains the results of a query. A materialized view is not simply a window on the base table. It is
actually a separate object holding data in itself. So query data against a materialized view with different filters should be quick.
Difference Between View and Materialized View:
https://techdifferences.com/difference-between-view-and-materialized-view.html
upvoted 4 times
 
alexleonvalencia
Respuesta Correcta B, Una vista materializada.
upvoted 4 times
You have an Azure Synapse Analytics workspace named WS1 that contains an Apache Spark pool named Pool1.
You plan to create a database named DB1 in Pool1.
You need to ensure that when tables are created in DB1, the tables are available automatically as external tables to the built-in serverless SQL
pool.
Which format should you use for the tables in DB1?
A.
CSV
B.
ORC
C.
JSON
D.
Parquet
Correct Answer:
D
Serverless SQL pool can automatically synchronize metadata from Apache Spark. A serverless SQL pool database will be created for each
database existing in serverless Apache Spark pools.
For each Spark external table based on Parquet or CSV and located in Azure Storage, an external table is created in a serverless SQL pool
database.
Reference:
https://docs.microsoft.com/en-us/azure/synapse-analytics/sql/develop-storage-files-spark-tables

D (100%)
 
KevinSames
Highly Voted 
Both A and D are correct
"For each Spark external table based on Parquet or CSV and located in Azure Storage, an external table is created in a serverless SQL pool
database. As such, you can shut down your Spark pools and still query Spark external tables from serverless SQL pool."
upvoted 15 times
 
RehanRajput
Most Recent 
Both A and D
upvoted 1 times
 
RehanRajput
https://docs.microsoft.com/en-us/azure/synapse-analytics/metadata/database
upvoted 1 times
 
MatiCiri
4 weeks ago
Selected Answer: D
Looks correct to me
upvoted 1 times
 
AhmedDaffaie
JSON is also supported by Serverless SQL Pool but it is kinda complicated. Why is it not selected?
upvoted 2 times
 
Ajitk27
3 months ago
Selected Answer: D
Looks correct to me
upvoted 1 times
 
VijayMore
3 months ago
Selected Answer: D
Correct
upvoted 1 times
 
PallaviPatel
Selected Answer: D
Both A and D are correct. as CSV and Parquet are correct answers.
upvoted 1 times
 
Mahesh_mm
Parquet and CSV are correct
upvoted 4 times
 
Nifl91
I think A and D are both correct answers.
upvoted 3 times
 
alexleonvalencia
Respuesta Correcta Parquet.
upvoted 1 times
You are planning a solution to aggregate streaming data that originates in Apache Kafka and is output to Azure Data Lake Storage Gen2. The
developers who will implement the stream processing solution use Java.
Which service should you recommend using to process the streaming data?
A.
Azure Event Hubs
B.
Azure Data Factory
C.
Azure Stream Analytics
D.
Azure Databricks
Correct Answer:
D
The following tables summarize the key differences in capabilities for stream processing technologies in Azure.
General capabilities -
Integration capabilities -
Reference:
https://docs.microsoft.com/en-us/azure/architecture/data-guide/technology-choices/stream-processing
D (100%)
 
Nifl91
Highly Voted 
Correct!
upvoted 13 times
 
NewTuanAnh
Most Recent 
why not C: Azure Stream Analytics?
upvoted 1 times
 
Muishkin
4 weeks, 1 day ago
Yes Azure stream Analytics for streaming data?
upvoted 1 times
 
NewTuanAnh
I see, Azure Stream Analytics does not associate with Java
upvoted 1 times
 
sdokmak
or databricks
upvoted 1 times
 
sdokmak
kafka*
upvoted 1 times
 
PallaviPatel
Selected Answer: D
correct.
upvoted 1 times
 
Mahesh_mm
Answer is correct
upvoted 3 times
 
alexleonvalencia
Respuesta correcta Azure DataBricks.
upvoted 4 times
You plan to implement an Azure Data Lake Storage Gen2 container that will contain CSV files. The size of the files will vary based on the number
of events that occur per hour.
File sizes range from 4 KB to 5 GB.
You need to ensure that the files stored in the container are optimized for batch processing.
What should you do?
A.
Convert the files to JSON
B.
Convert the files to Avro
C.
Compress the files
D.
Merge the files
Correct Answer:
B
Avro supports batch and is very relevant for streaming.
Note: Avro is framework developed within Apache's Hadoop project. It is a row-based storage format which is widely used as a serialization
process. AVRO stores its schema in JSON format making it easy to read and interpret by any program. The data itself is stored in binary format
by doing it compact and efficient.
Reference:
https://www.adaltas.com/en/2020/07/23/benchmark-study-of-different-file-format/

D (100%)
 
VeroDon
Highly Voted 
You can not merge the files if u don't know how many files exist in ADLS2. In this case, you could easily create a file larger than 100 GB in size and
decrease performance. so B is the correct answer. Convert to AVRO
upvoted 25 times
 
Massy
3 weeks, 5 days ago
I can understand why you say not merge, but why avro?
upvoted 2 times
 
Canary_2021
Highly Voted 
Selected Answer: D
If you store your data as many small files, this can negatively affect performance. In general, organize your data into larger sized files for better
performance (256 MB to 100 GB in size).
https://docs.microsoft.com/en-us/azure/storage/blobs/data-lake-storage-best-practices#optimize-for-data-ingest
upvoted 7 times
 
SAYAK7
Most Recent 
Selected Answer: D
Batch can support JSON or AVRO, you should input one file by merging them all.
upvoted 1 times
 
sdokmak
1 day, 9 hours ago
They're CSV so you're saying answer is A
upvoted 1 times
 
sdokmak
1 day, 9 hours ago
B*, AVRO is faster than JSON
upvoted 1 times
 
RehanRajput
1 month ago
Selected Answer: D
You need to make sure that the files in the container are optimized for BATCH PROCESSING. In case of batch processing it makes sense to merge
files as to reduce the amount of IO Listing operations.
B would have been correct if we had to optimize for stream processing.

upvoted 3 times
 
Karthikj18
Conversion makes an additional load, so not an good idea to convert into Avro rather than merging is easier
upvoted 1 times
 
SebK
2 months ago
Selected Answer: D
merge files
upvoted 2 times
 
adfgasd
This question makes me very confused.
It says the file size depends on the number of events per hour, so i guess there is a file generated every hour. In the worst case, we have 5GB * 24h,
which is greater than 100GB...
But why is AVRO a good choice??
upvoted 6 times
 
PallaviPatel
Selected Answer: D
merge files is correct.
upvoted 3 times
 
vincetita
4 months ago
Selected Answer: D
Small-sized files will hurt performance. Optimal file size: 256MB to 100GB
upvoted 1 times
 
Tomi1234
Selected Answer: D
In my opinion for better batch processing files should be not bigger than 100GB but as big as possible.
upvoted 7 times
 
VeroDon
One example of batch processing is transforming a large set of flat, semi-structured CSV or JSON files into a schematized and structured format
that is ready for further querying. Typically the data is converted from the raw formats used for ingestion (such as CSV) into binary formats that are
more performant for querying because they store data in a columnar format, and often provide indexes and inline statistics about the data.
https://docs.microsoft.com/en-us/azure/architecture/data-guide/big-data/batch-processing
upvoted 1 times
 
edba
I think it shall be D as well. Please check the link below. https://docs.microsoft.com/en-us/azure/storage/blobs/data-lake-storage-best-
practices#file-size
upvoted 3 times
 
Mahesh_mm
B is correct
upvoted 2 times
 
Consider using the Avro file format in cases where your I/O patterns are more write heavy, or the query patterns favor retrieving multiple rows
of records in their entirety. For example, the Avro format works well with a message bus such as Event Hub or Kafka that write multiple
events/messages in succession.
upvoted 1 times
 
didixuecoding
5 months ago
Correct Answer should be D: Merge the files
upvoted 2 times
 
corebit
5 months ago
Please explain why it is D.
upvoted 1 times
 
alexleonvalencia
Respuesta correcta Avro
upvoted 1 times
HOTSPOT -
You store files in an Azure Data Lake Storage Gen2 container. The container has the storage policy shown in the following exhibit.
Hot Area:
Correct Answer:
Box 1: moved to cool storage -
The ManagementPolicyBaseBlob.TierToCool property gets or sets the function to tier blobs to cool storage. Support blobs currently at Hot tier.
Box 2: container1/contoso.csv -
As defined by prefixMatch.
prefixMatch: An array of strings for prefixes to be matched. Each rule can define up to 10 case-senstive prefixes. A prefix string must start with
a container name.
Reference:
https://docs.microsoft.com/en-us/dotnet/api/microsoft.azure.management.storage.fluent.models.managementpolicybaseblob.tiertocool
 
bad_atitude
Highly Voted 
correct
upvoted 17 times
 
adfgasd
why the .csv?
upvoted 1 times
 
Lewistrick
5 months ago
It matches anything that starts with "container1/contoso" and the csv in the answer is the only one that matches.
upvoted 9 times
 
alexleonvalencia
Highly Voted 
Respuesta Cool Tier & Container1/contoso.csv
upvoted 5 times
 
AJ01
Most Recent 
shouldn't the question be greater than 60 days?
upvoted 2 times
 
stunner85_
The files get deleted after 60 days but after 30 days they are moved to the cool storage.
upvoted 3 times
 
Mahesh_mm
correct
upvoted 3 times
You are designing a financial transactions table in an Azure Synapse Analytics dedicated SQL pool. The table will have a clustered columnstore
index and will include the following columns:
✑ TransactionType: 40 million rows per transaction type
✑ CustomerSegment: 4 million per customer segment
✑ TransactionMonth: 65 million rows per month
AccountType: 500 million per account type
You have the following query requirements:
✑ Analysts will most commonly analyze transactions for a given month.
✑ Transactions analysis will typically summarize transactions by transaction type, customer segment, and/or account type
You need to recommend a partition strategy for the table to minimize query times.
On which column should you recommend partitioning the table?
A.
CustomerSegment
B.
AccountType
C.
TransactionType
D.
TransactionMonth
Correct Answer:
D
For optimal compression and performance of clustered columnstore tables, a minimum of 1 million rows per distribution and partition is
needed. Before partitions are created, dedicated SQL pool already divides each table into 60 distributed databases.
Example: Any partitioning added to a table is in addition to the distributions created behind the scenes. Using this example, if the sales fact
table contained 36 monthly partitions, and given that a dedicated SQL pool has 60 distributions, then the sales fact table should contain 60
million rows per month, or 2.1 billion rows when all months are populated. If a table contains fewer than the recommended minimum number of
rows per partition, consider using fewer partitions in order to increase the number of rows per partition.

D (100%)
 
Lewistrick
Highly Voted 
5 months ago
Anyone else thinks this is a very badly explained situation?
upvoted 12 times
 
Canary_2021
Highly Voted 
Selected Answer: D
Select D because analysts will most commonly analyze transactions for a given month,
upvoted 9 times
 
Youdaoud
Most Recent 
Selected Answer: D
correct answer D
upvoted 1 times
 
PallaviPatel
Selected Answer: D
correct.
upvoted 3 times
 
D is correct because "Transactions analysis will typically summarize transactions by transaction type, customer segment, and/or account type"
implying its part of the WHERE clause. The option of choosing a distribution column is to ensure that it is not used in the WHERE clause.
upvoted 6 times
 
ploer
D is correct, but those columns will be used for aggregate funtions. TransactionMonth column will be used in the where-clause: "analysts will
most commonly analyze transactions for a given month", so the given month must be in the where clause. Partitioning on the where clause
column significantly reduces the amount of data to be processed which leads to increased performance. Do not confuse with distribution
column on hash partitioned tables. Using TransactionMonth column as distribution column here would be a really bad idea because all queried
data would be on one single node.
upvoted 15 times
 
Mahesh_mm
D is correct
upvoted 2 times
 
Sonnie01
Selected Answer: D
correct
upvoted 5 times
 
edba
check this as well for explanation. https://www.linkedin.com/pulse/partitioning-distribution-azure-synapse-analytics-swapnil-mule
upvoted 2 times
 
gf2tw
Agree with D, should not be confused with Distribution column for Hash-distributed tables.
upvoted 5 times
HOTSPOT -
You have an Azure Data Lake Storage Gen2 account named account1 that stores logs as shown in the following table.
You do not expect that the logs will be accessed during the retention periods.
You need to recommend a solution for account1 that meets the following requirements:
✑ Automatically deletes the logs at the end of each retention period
✑ Minimizes storage costs
What should you include in the recommendation? To answer, select the appropriate options in the answer area.
Hot Area:
Correct Answer:
Box 1: Store the infrastructure logs in the Cool access tier and the application logs in the Archive access tier
For infrastructure logs: Cool tier - An online tier optimized for storing data that is infrequently accessed or modified. Data in the cool tier should
be stored for a minimum of 30 days. The cool tier has lower storage costs and higher access costs compared to the hot tier.
For application logs: Archive tier - An offline tier optimized for storing data that is rarely accessed, and that has flexible latency requirements, on
the order of hours.
Data in the archive tier should be stored for a minimum of 180 days.
Box 2: Azure Blob storage lifecycle management rules
Blob storage lifecycle management offers a rule-based policy that you can use to transition your data to the desired access tier when your
specified conditions are met. You can also use lifecycle management to expire data at the end of its life.
Reference:
https://docs.microsoft.com/en-us/azure/storage/blobs/access-tiers-overview
 
gf2tw
Highly Voted 
"Data must remain in the Archive tier for at least 180 days or be subject to an early deletion charge. For example, if a blob is moved to the Archive
tier and then deleted or moved to the Hot tier after 45 days, you'll be charged an early deletion fee equivalent to 135 (180 minus 45) days of
storing that blob in the Archive tier." <- from the sourced link.
This explains why we have to use two different access tiers rather than both as archive.
upvoted 26 times
 
Backy
Most Recent 
2 weeks ago
The question says "You do not expect that the logs will be accessed during the retention periods" - so there is no reason to keep any of them as
Cool, so the correct answer should be to put them both in Archive
upvoted 1 times
 
sdokmak
yeah but because the infrastructure logs are <180 days before deleting, there is a considerable fee to delete if in archive, so not the cheapest
option.
upvoted 2 times
 
Muishkin
4 weeks, 1 day ago
But the question says 360 days and 60 days for the 2 logs...whereas archive tier could store only upto 180 days .Also the cool tier has lesser storage
cost /- hour as compared to archive tier.So should'nt the answer be cool tier for both?
upvoted 1 times
 
Mahesh_mm
Answers are correct
upvoted 2 times
 
ANath
5 months ago
The answers are correct.
Data must remain in the Archive tier for at least 180 days or be subject to an early deletion charge. For example, if a blob is moved to the Archive
tier and then deleted or moved to the Hot tier after 45 days, you'll be charged an early deletion fee equivalent to 135 (180 minus 45) days of
storing that blob in the Archive tier.
A blob in the Cool tier in a general-purpose v2 accounts is subject to an early deletion penalty if it is deleted or moved to a different tier before 30
days has elapsed. This charge is prorated. For example, if a blob is moved to the Cool tier and then deleted after 21 days, you'll be charged an early
deletion fee equivalent to 9 (30 minus 21) days of storing that blob in the Cool tier.
https://docs.microsoft.com/en-us/azure/storage/blobs/access-tiers-overview
upvoted 4 times
You plan to ingest streaming social media data by using Azure Stream Analytics. The data will be stored in files in Azure Data Lake Storage, and
then consumed by using Azure Databricks and PolyBase in Azure Synapse Analytics.
You need to recommend a Stream Analytics data output format to ensure that the queries from Databricks and PolyBase against the files
encounter the fewest possible errors. The solution must ensure that the files can be queried quickly and that the data type information is retained.
What should you recommend?
A.
JSON
B.
Parquet
C.
CSV
D.
Avro
Correct Answer:
B
Need Parquet to support both Databricks and PolyBase.
Reference:
https://docs.microsoft.com/en-us/sql/t-sql/statements/create-external-file-format-transact-sql

B (100%)
 
hrastogi7
Highly Voted 
Parquet can be quickly retrieved and maintain metadata in itself. Hence Parquet is correct answer.
upvoted 10 times
 
Muishkin
Most Recent 
4 weeks, 1 day ago
Isnt JSON good for batch processing/streaming?
upvoted 1 times
 
RehanRajput
Indeed. However, we also want to query the data using PolyBase. Polybase doesn't support Avro.
https://docs.microsoft.com/en-us/azure/synapse-analytics/sql/load-data-overview#polybase-external-file-formats
upvoted 2 times
 
AhmedDaffaie
I am confused!
Avro has self-describing schema and good for quick loading (patching), why parquet?
upvoted 3 times
 
Boompiee
2 weeks, 1 day ago
Apparently, the deciding factor is the fact that PolyBase doesn't support AVRO, but it does support Parquet.
upvoted 3 times
 
PallaviPatel
Selected Answer: B
correct.
upvoted 1 times
 
EmmettBrown
4 months ago
Selected Answer: B
Parquet is the correct answer
upvoted 1 times
 
alexleonvalencia
Respuesta correcta PARQUET
upvoted 1 times
You have an Azure Synapse Analytics dedicated SQL pool named Pool1. Pool1 contains a partitioned fact table named dbo.Sales and a staging
table named stg.Sales that has the matching table and partition definitions.
You need to overwrite the content of the first partition in dbo.Sales with the content of the same partition in stg.Sales. The solution must minimize
load times.
What should you do?
A.
Insert the data from stg.Sales into dbo.Sales.
B.
Switch the first partition from dbo.Sales to stg.Sales.
C.
Switch the first partition from stg.Sales to dbo.Sales.
D.
Update dbo.Sales from stg.Sales.
Correct Answer:
B
A way to eliminate rollbacks is to use Metadata Only operations like partition switching for data management. For example, rather than execute
a DELETE statement to delete all rows in a table where the order_date was in October of 2001, you could partition your data monthly. Then you
can switch out the partition with data for an empty partition from another table
Note: Syntax:
SWITCH [ PARTITION source_partition_number_expression ] TO [ schema_name. ] target_table [ PARTITION target_partition_number_expression

]
Switches a block of data in one of the following ways:
✑ Reassigns all data of a table as a partition to an already-existing partitioned table.
✑ Switches a partition from one partitioned table to another.
✑ Reassigns all data in one partition of a partitioned table to an existing non-partitioned table.
Reference:
https://docs.microsoft.com/en-us/azure/synapse-analytics/sql/best-practices-dedicated-sql-pool

C (96%) 4%
 
Aslam208
Highly Voted 
Selected Answer: C
The correct answer is C
upvoted 31 times
 
Nifl91
Highly Voted 
this must be C. since the need is to overwrite dbo.Sales with the content of stg.Sales.
SWITCH source TO target

upvoted 9 times
 
SAYAK7
Most Recent 
4 days, 4 hours ago
Selected Answer: C
Coz we have to impact dbo.Sales
upvoted 1 times
 
kknczny
1 month ago
Selected Answer: B
As partition in stg.Sales is the one we will be using to overwrite the first partition in dbo.Stage, should it not be understood as "B. Switch the first
partition from dbo.Sales to stg.Sales."?
Also look at the query syntax:
"SWITCH [ PARTITION source_partition_number_expression ] TO [ schema_name. ] target_table [ PARTITION target_partition_number_expression ]"
So from stg to dbo

upvoted 2 times
 
kanak01
Seriously.. Who puts Fact Table data into Dimension table !
upvoted 1 times
 
rockyc05
3 months ago
Its Fact to Stage Table actually acc to the answer provided
upvoted 1 times
 
PallaviPatel
Selected Answer: C
C is correct as partition switching works from source to target.
upvoted 2 times
 
dev2dev
Either way works.
upvoted 1 times
 
Sandip4u
Stg.sales is a temp table which does not have any partition , So C can not be correct
upvoted 2 times
 
ABExams
It literally states it has the same partition definition.
upvoted 2 times
 
alex623
The correct answer is C, because target table is dbo.sales
upvoted 1 times
 
Rickie85
Selected Answer: C
C correct
upvoted 1 times
 
Jaws1990
Selected Answer: C
B is the wrong way round.
upvoted 3 times
 
VeroDon
Selected Answer: C
https://medium.com/@cocci.g/switch-partitions-in-azure-synapse-sql-dw-1e0e32309872
upvoted 4 times
 
Mahesh_mm
C is correct answer
upvoted 1 times
 
ArunMonika
I will go with C
upvoted 1 times
 
gitoxam686
5 months ago
Selected Answer: C
C is correct answer because we have to overwrite.
upvoted 2 times
 
adfgasd
5 months ago
Selected Answer: C
C for sure
upvoted 2 times
 
Will_KaiZuo
Selected Answer: C
agree with C
upvoted 1 times
You are designing a slowly changing dimension (SCD) for supplier data in an Azure Synapse Analytics dedicated SQL pool.
You plan to keep a record of changes to the available fields.
The supplier data contains the following columns.
Which three additional columns should you add to the data to create a Type 2 SCD? Each correct answer presents part of the solution.
A.
surrogate primary key
B.
effective start date
C.
business key
D.
last modified date
E.
effective end date
F.
foreign key
Correct Answer:
BCE
C: The Slowly Changing Dimension transformation requires at least one business key column.
BE: Historical attribute changes create new records instead of updating existing ones. The only change that is permitted in an existing record is
an update to a column that indicates whether the record is current or expired. This kind of change is equivalent to a Type 2 change. The Slowly
Changing Dimension transformation directs these rows to two outputs: Historical Attribute Inserts Output and New Output.
Reference:
https://docs.microsoft.com/en-us/sql/integration-services/data-flow/transformations/slowly-changing-dimension-transformation

ABE (85%) BCE (15%)
 
ItHYMeRIsh
Highly Voted 
Selected Answer: ABE
The answer is ABE. A type 2 SCD requires a surrogate key to uniquely identify each record when versioning.
See https://docs.microsoft.com/en-us/learn/modules/populate-slowly-changing-dimensions-azure-synapse-analytics-pipelines/3-choose-
between-dimension-types under SCD Type 2 “ the dimension table must use a surrogate key to provide a unique reference to a version of the
dimension member.”
A business key is already part of this table - SupplierSystemID. The column is derived from the source data.
upvoted 35 times
 
CHOPIN
Highly Voted 
Selected Answer: BCE
WHAT ARE YOU GUYS TALKING ABOUT??? You are really misleading other people!!! No issue with the provided answer. Should be BCE!!!
Check this out:
https://docs.microsoft.com/en-us/sql/integration-services/data-flow/transformations/slowly-changing-dimension-transformation?view=sql-server-
ver15
"The Slowly Changing Dimension transformation requires at least one business key column."
[Surrogate key] is not mentioned in this Microsoft documentation AT ALL!!!

upvoted 9 times
 
dev2dev
Search for Business Keys in that page. and make sure you wear specs :D
upvoted 2 times
 
assU2
Yes, because SupplierSystemID is unique. But Microsoft questions are terribly misleading here. People think that SupplierSystemID is business
key because of Supplier in it. Also, there are some really not good and not sufficient examples on Learn. See https://docs.microsoft.com/en-
us/learn/modules/populate-slowly-changing-dimensions-azure-synapse-analytics-pipelines/3-choose-between-dimension-types
upvoted 1 times
 
Mad_001
3 months ago
I don't understand.
1) What in your opinion should then be the business key. Can you explain please.
2) SupplierSysteID ist uniqe in the source system. Is there a definition that the column need to be unique also in the DataWarehouse? If no,
there ist the possibility to use it as business key. Am I wrong?
upvoted 1 times
 
Onobhas01
No you're not wrong, the unique identifier form the ERP system is the Business Key
upvoted 1 times
 
shrikantK
Most Recent 
ABE seems correct. Why not business key is already discussed. Why not foreign key? one reason: Foreign key constraint is not supported in
dedicated SQL pool.
upvoted 1 times
 
gabdu
3 weeks, 2 days ago
why is there no mention of flag?
upvoted 2 times
 
necktru
4 weeks ago
Please, SupplierSystemID is unique in ERP, that not mean that must be unique in our DW, that's why we need a surrogate primary key, If don't, SCD
type 2 can't be implemented
upvoted 1 times
 
Andushi
4 weeks ago
dimension-types
upvoted 1 times
 
ladywhiteadder
ABE - very clear answer if you know what a type 2 SCD is. you will need a new surrogate key. the business key is already there - it's
SupplierSystemID - and will stay the same over time = will not be unique when anything changes as we will insert a new row then.
upvoted 4 times
 
kilowd
Type 2
In order to support type 2 changes, we need to add four columns to our table:
· Surrogate Key – the original ID will no longer be sufficient to identify the specific record we require, we therefore need to create a new ID that the
fact records can join to specifically.
· Current Flag – A quick method of returning only the current version of each record

· Start Date – The date from which the specific historical version is active
· End Date – The date to which the specific historical version record is active
https://adatis.co.uk/introduction-to-slowly-changing-dimensions-scd-types/
upvoted 4 times
 
KashRaynardMorse
1 month ago
Good answer! It's worth talking about the business key, since that is the controversial bit.
There needs to be something that functions as a business key, in this case it can be the SupplierSystemID.
The Current Flag is not strictly needed, the solution would function okay without it, but I would include it in real life anyway for performance
and ease of querying (it's also not shown as an option).
upvoted 2 times
 
KashRaynardMorse
1 month ago
And to add, the question is what "additional" columns are needed. So emphasising, although a business key is definitely needed, the column
that serves it's purpose is already present (albeit with a different column name), so does not need adding again.
upvoted 1 times
 
stunner85_
Here's why the answer is Surrogate Key and not Business key:
In a temporal database, it is necessary to distinguish between the surrogate key and the business key. Every row would have both a business key
and a surrogate key. The surrogate key identifies one unique row in the database, the business key identifies one unique entity of the modeled
world.
upvoted 2 times
 
m0rty
correcto
upvoted 1 times
 
PallaviPatel
these are correct answers.
upvoted 1 times
 
Hervedoux
Totally agree with Chopin, SCD type 2 tables require at least a Business Key column and a start and end date to capture historical dat, thus BCE is
the correct answer.
https://docs.microsoft.com/en-us/sql/integration-services/data-flow/transformations/slowly-changing-dimension-transformation
upvoted 1 times
 
Mahesh_mm
ABE is correct
upvoted 2 times
 
gitoxam686
5 months ago
A B E.... Surrogate Key s required.
upvoted 3 times
 
KevinSames
5 months ago
ez ezez
upvoted 1 times
 
m2shines
A, B and E
upvoted 1 times
 
Nifl91
shouldn't it be ABE? we already have a business key! we need a surrogate to use as a primary key when a supplier with updated attributes is to be
inserted into the table
upvoted 1 times
 
assU2
It's not a business key, its unique. And business key may not be unique because it's 2 SCD. You can have multiple rows for one entity with
different start/end dates.
upvoted 2 times
 
necktru
4 weeks ago
It's unique in the ERP, in the DW can be duplicated, it's why we need the surrogate pk that must be unique, answers are ABE
upvoted 1 times
HOTSPOT -
You have a Microsoft SQL Server database that uses a third normal form schema.
You plan to migrate the data in the database to a star schema in an Azure Synapse Analytics dedicated SQL pool.
You need to design the dimension tables. The solution must optimize read operations.
What should you include in the solution? To answer, select the appropriate options in the answer area.
Hot Area:
Correct Answer:
Box 1: Denormalize to a second normal form
Denormalization is the process of transforming higher normal forms to lower normal forms via storing the join of higher normal form relations
as a base relation.
Denormalization increases the performance in data retrieval at cost of bringing update anomalies to a database.
Box 2: New identity columns -
The collapsing relations strategy can be used in this step to collapse classification entities into component entities to obtain flat dimension
tables with single-part keys that connect directly to the fact table. The single-part key is a surrogate key generated to ensure it remains unique
over time.
Example:
Note: A surrogate key on a table is a column with a unique identifier for each row. The key is not generated from the table data. Data modelers
like to create surrogate keys on their tables when they design data warehouse models. You can use the IDENTITY property to achieve this goal
simply and effectively without affecting load performance.
Reference:
https://www.mssqltips.com/sqlservertip/5614/explore-the-role-of-normal-forms-in-dimensional-modeling/ https://docs.microsoft.com/en-
us/azure/synapse-analytics/sql-data-warehouse/sql-data-warehouse-tables-identity
 
JimZhang4123
1 week, 2 days ago
'The solution must optimize read operations.' means denormalization
upvoted 1 times
 
PallaviPatel
Answer correct.
upvoted 3 times
 
Mahesh_mm
Answers are correct
upvoted 1 times
 
PallaviPatel
5 months ago
answer is correct
upvoted 4 times
 
moreinva43
While denormalizing does require implementing a lower level of normalization, the second normal form ONLY applies when a table has a
composite primary key. https://www.geeksforgeeks.org/second-normal-form-2nf/
upvoted 1 times
HOTSPOT -
You plan to develop a dataset named Purchases by using Azure Databricks. Purchases will contain the following columns:
✑ ProductID
✑ ItemPrice
✑ LineTotal
✑ Quantity
✑ StoreID
✑ Minute
✑ Month
✑ Hour
Year -
✑ Day
You need to store the data to support hourly incremental load pipelines that will vary for each Store ID. The solution must minimize storage costs.
How should you complete the code? To answer, select the appropriate options in the answer area.
Hot Area:
Correct Answer:
Box 1: partitionBy -
We should overwrite at the partition level.
Example:
df.write.partitionBy("y","m","d")
.mode(SaveMode.Append)
.parquet("/data/hive/warehouse/db_name.db/" + tableName)
Box 2: ("StoreID", "Year", "Month", "Day", "Hour", "StoreID")
Box 3: parquet("/Purchases")
Reference:
https://intellipaat.com/community/11744/how-to-partition-and-write-dataframe-in-spark-without-deleting-partitions-with-no-new-data
 
sparkchu
ans should be saveAsTable. format is defined by format() method.
upvoted 1 times
 
assU2
Can anyone explain why it's Partitioning and not Bucketing pls?
upvoted 2 times
 
KashRaynardMorse
1 month ago
Bucketing feature (part of data skipping index) was removed and microsoft recommends using DeltaLake, which uses the partition syntax.
https://docs.microsoft.com/en-us/azure/databricks/spark/latest/spark-sql/dataskipping-index
upvoted 1 times
 
bhanuprasad9331
There should be a different folder for each store. Partitioning will create separate folder for each storeId. In bucketing, multiple stores having
same hash value can be present in the same file, so multiple storeIds can be present under a single file.
upvoted 3 times
 
assU2
Is it a question of correct syntax (numBuckets int the number of buckets to save) or is it smth else?
upvoted 1 times
 
Mahesh_mm
Answers are correct
upvoted 4 times
 
Aslam208
correct
upvoted 4 times
You are designing a partition strategy for a fact table in an Azure Synapse Analytics dedicated SQL pool. The table has the following
specifications:
✑ Contain sales data for 20,000 products.
Use hash distribution on a column named ProductID.
✑ Contain 2.4 billion records for the years 2019 and 2020.
Which number of partition ranges provides optimal compression and performance for the clustered columnstore index?
A.
40
B.
240
C.
400
D.
2,400
Correct Answer:
A
Each partition should have around 1 millions records. Dedication SQL pools already have 60 partitions.
We have the formula: Records/(Partitions*60)= 1 million
Partitions= Records/(1 million * 60)
Partitions= 2.4 x 1,000,000,000/(1,000,000 * 60) = 40
Note: Having too many partitions can reduce the effectiveness of clustered columnstore indexes if each partition has fewer than 1 million rows.
Dedicated SQL pools automatically partition your data into 60 databases. So, if you create a table with 100 partitions, the result will be 6000
partitions.
Reference:
https://docs.microsoft.com/en-us/azure/synapse-analytics/sql/best-practices-dedicated-sql-pool

A (64%) C (36%)
 
Aslam208
Highly Voted 
correct
upvoted 12 times
 
sdokmak
Most Recent 
1 day, 9 hours ago
Selected Answer: A
quick maths
upvoted 1 times
 
MS_Nikhil
3 weeks, 6 days ago
Selected Answer: A
A is correct
upvoted 1 times
 
Egocentric
correct
upvoted 1 times
 
Twom
Selected Answer: A
Correct
upvoted 2 times
 
jskibick
3 months ago
Selected Answer: A
I am also confused.
So we have 2.400.000.000 rows that are already split in 60 nodes od SQL DW. That makes
40.000.000 per node.
Now is question how to order partitions to obtain efficiency for CCI.
Next, we know the partitions will be divided into CCI segments ~1mln per each. And here is my problem. Because CCI will autosplit data in
partitions into 1mln row segments. We do not have to do it on our own in partitions. I would split data into monthly partitions i.e. #24 for 2 year,
2019 and 2020. The segments will autosplit partitions.
But there is not such answer.
I will have to go with A = 40

upvoted 3 times
 
Justin_beswick
Selected Answer: C
The Rule is Partitions= Records/(1 million * 60)
24,000,000,000/60,000,000 = 400
upvoted 4 times
 
helpaws
3 months ago
it's 2.4 billion, not 24 billion
upvoted 9 times
 
AlvaroEPMorais
The Rule is Partitions= Records/(1 million * 60)
2,400,000,000/60,000,000 = 40
upvoted 8 times
 
dev2dev
Are you suggesting create 40 partitions on ProductId? This is confusing. If you create 40 partitions, SQL Pool will create 40*60 partitions which is
2400. And documentation says create fewer partitions. If we want to create paritions by year then we can create 2 partitions for two years which
internally creates 2*60 = 120 paritions, but extra 2 paritions for outer boundaries will make it 4*60 = 240. So 240 paritions for 2.4 billion rows is
ideal. But what is confusing me is we creat only 4 paritions which is not even in options
upvoted 2 times
 
Canary_2021
A distributed table appears as a single table, but the rows are actually stored across 60 distributions.
60 is for distribution, not partition.

upvoted 2 times
 
Muishkin
4 weeks ago
So then how do we calculate the number of partitions?Is'nt it user driven ?
upvoted 1 times
 
Canary_2021
If you partition your data, each partition will need to have 1 million rows to benefit from a clustered columnstore index. For a table with 100
partitions, it needs to have at least 6 billion rows to benefit from a clustered columns store (60 distributions 100 partitions 1 million rows).
upvoted 1 times
 
Mahesh_mm
correct
upvoted 1 times
HOTSPOT -
You are creating dimensions for a data warehouse in an Azure Synapse Analytics dedicated SQL pool.
You create a table by using the Transact-SQL statement shown in the following exhibit.
Hot Area:
Correct Answer:
Box 1: Type 2 -
Incorrect Answers:
A Type 1 SCD always reflects the latest values, and when changes in source data are detected, the dimension table data is overwritten.
Box 2: a business key -
A business key or natural key is an index which identifies uniqueness of a row based on columns that exist naturally in a table according to
business rules. For example business keys are customer code in a customer table, composite of sales order header number and sales order
item line number within a sales order details table.
Reference:
dimension-types
 
nkav
Highly Voted 
1 year ago
product key is a surrogate key as it is an identity column
upvoted 98 times
 
111222333
1 year ago
Agree on the surrogate key, exactly.
"In data warehousing, IDENTITY functionality is particularly important as it makes easier the creation of surrogate keys."
Why ProductKey is certainly not a business key: "The IDENTITY value in Synapse is not guaranteed to be unique if the user explicitly inserts a
duplicate value with 'SET IDENTITY_INSERT ON' or reseeds IDENTITY". Business key is an index which identifies uniqueness of a row and here
Microsoft says that identity doesn't guarantee uniqueness.
References:
https://azure.microsoft.com/en-us/blog/identity-now-available-with-azure-sql-data-warehouse/
https://docs.microsoft.com/en-us/azure/synapse-analytics/sql-data-warehouse/sql-data-warehouse-tables-identity
upvoted 8 times
 
rikku33
8 months ago
Type 2
In order to support type 2 changes, we need to add four columns to our table:
· Surrogate Key – the original ID will no longer be sufficient to identify the specific record we require, we therefore need to create a new ID
that the fact records can join to specifically.
· Current Flag – A quick method of returning only the current version of each record
· Start Date – The date from which the specific historical version is active
· End Date – The date to which the specific historical version record is active
With these elements in place, our table will now look like:
upvoted 4 times
 
sagga
Highly Voted 
1 year ago
Type2 because there are start and end columns and ProductKey is a surrogate key. ProductNumber seems a business key.
upvoted 29 times
 
DrC
12 months ago
The start and end columns are for when to when the product was being sold, not for metadata purposes. That makes it:
Type 1 – No History
Update record directly, there is no record of historical values, only current state
upvoted 40 times
 
Kyle1
When the product is not being sold anymore, it becomes a historical record. Hence Type 2.
upvoted 2 times
 
rockyc05
3 months ago
It is type 1 not 2
upvoted 1 times
 
Yuri1101
With type 2, you normally don't update any column of a row other than row start date and end date.
upvoted 1 times
 
captainbee
Exactly how I saw it
upvoted 1 times
 
SandipSingha
Most Recent 
2 weeks, 3 days ago
product key is definitely a surrogate key
upvoted 1 times
 
dazero
3 weeks ago
Definitely Type 1. There are no columns to indicate the different versions of the same business key. The sell start and end date columns are actual
source columns from when the product was sold. The Insert and Update columns are audit columns that explain when a record was inserted for the
first time and when it was updated. So the insert date remains the same, but the updated column is updated every time a Type 1 update occurs.
upvoted 1 times
 
AlCubeHead
2 months ago
Product Key is surrogate NOT business key as it's a derived IDENTITY
Dimension is type 1 as it does not have a StartDate and EndDate associated with data changes and also does not have an IsCurrent flag
upvoted 5 times
 
Shrek66
Agree with ploer
SCD Type 1
Surrogate
upvoted 4 times
 
ploer
Surrugate Key and Type 1 SCD. Tpye 1 SCD because sellenddate and sellstartdate are attributes of the product and not for versioning. rowupdated
and rowinserted are used for scd. And - as the naming indicates- the fact that both have a not null constraint leads to the conclusion that we have
no possibility to store the information which row is the current one. So it must be scd type 1.
upvoted 3 times
 
skkdp203
SCD Type 1
Surrogate
dimension-types
Surrogate keys in dimension tables
It is critical that the primary key’s value of a dimension table remains unchanged. And it is highly recommended that all dimension tables use
surrogate keys as primary keys.
Surrogate keys are key generated and managed inside the data warehouse rather than keys extracted from data source systems.
upvoted 3 times
 
dev2dev
identity/surrogate key's can be a business key in transition tables but in dw it can be used only as surrogate key.
upvoted 1 times
 
assU2
Maybe it's type 2 because the logic is: we can have multiple rows with one productID, different surrogate keys and different start/end sale dates.
key | id | start sale | end sale |
1 | 998 | 2021-01-01 | 2021-02-01 |
2 | 998 | 2022-01-01 | 9999-12-31 | <-- current

upvoted 2 times
 
assU2
Where are these answers from? Why there are so many mistakes? ProductKey is obviously a surrogate key
upvoted 1 times
 
alex623
It seems like Type 1: There is no flag to inform if the record is the current record, also the date column is just for modified date
upvoted 1 times
 
Boompiee
2 weeks, 1 day ago
The flag is commonly used, but not required.
upvoted 1 times
 
Mahesh_mm
Type 2 and surrogate key
upvoted 2 times
 
Ayan3B
when table being created rowinsertdatetime and rowupdatedatetime attribute has been kept along with ETL identifier attribute so no previous
version data would be kept. So type 1 is answer. Type 2 keep the older version information at row level along with start date and end date and type
3 keeps column level restricted old version of data.
Second answer would be surrogate key as product key generated with IDENTITY
upvoted 4 times
 
satyamkishoresingh
What is type 0 ?
upvoted 1 times
 
DrTaz
SCD type 0 us a constant value that never changes.
upvoted 1 times
 
jay5518
6 months ago
This was on test today
upvoted 1 times
 
ohana
7 months ago
Type2, Surrogate Key

upvoted 7 times
You are designing a fact table named FactPurchase in an Azure Synapse Analytics dedicated SQL pool. The table contains purchases from
suppliers for a retail store. FactPurchase will contain the following columns.
FactPurchase will have 1 million rows of data added daily and will contain three years of data.
Transact-SQL queries similar to the following query will be executed daily.
SELECT -
SupplierKey, StockItemKey, COUNT(*)
FROM FactPurchase -
WHERE DateKey >= 20210101 -
AND DateKey <= 20210131 -
GROUP By SupplierKey, StockItemKey
Which table distribution will minimize query times?
A.
replicated
B.
hash-distributed on PurchaseKey
C.
round-robin
D.
hash-distributed on DateKey
Correct Answer:
B
Hash-distributed tables improve query performance on large fact tables, and are the focus of this article. Round-robin tables are useful for
improving loading speed.
Incorrect:
Not D: Do not use a date column. . All data for the same date lands in the same distribution. If several users are all filtering on the same date,
then only 1 of the 60 distributions do all the processing work.
Reference:

B (88%) 13%
 
AugustineUba
Highly Voted 
From the documentation the answer is clear enough. B is the right answer.
When choosing a distribution column, select a distribution column that: "Is not a date column. All data for the same date lands in the same
distribution. If several users are all filtering on the same date, then only 1 of the 60 distributions do all the processing work."
upvoted 33 times
 
YipingRuan
To minimize data movement, select a distribution column that:
Is used in JOIN, GROUP BY, DISTINCT, OVER, and HAVING clauses.
"PurchaseKey" is not used in the group by

upvoted 6 times
 
YipingRuan
Consider using the round-robin distribution for your table in the following scenarios:
When getting started as a simple starting point since it is the default
If there is no obvious joining key
If there is no good candidate column for hash distributing the table
If the table does not share a common join key with other tables
If the join is less significant than other joins in the query

upvoted 5 times
 
waterbender19
Highly Voted 
I think the answer should be D for that specific query. If you look at the datatypes, DateKey is an INT datatype not a DATE datatype.
upvoted 13 times
 
kamil_k
n.b. if we look at the example query itself the date range is 31 days so we will use 31 distributions out of 60, and only process ~31 million
records
upvoted 1 times
 
waterbender19
and thet statement that Fact table will be added 1 million rows daily means that each datekey value has an equal amount of rows associated
with that value.
upvoted 5 times
 
Lucky_me
But the DateKey is used in the WHERE clause.
upvoted 2 times
 
kamil_k
I agree, date key is int, and besides, even if it was a date, when you query a couple days then 1 million rows per distribution is not that
much. So what if you are going to use only a couple distributions to do the job? Isn't it still faster than using all distributions to process all
of the records to get the required date range?
upvoted 1 times
 
AnandEMani
https://docs.microsoft.com/en-us/azure/synapse-analytics/sql-data-warehouse/sql-data-warehouse-tables-distribute this link says date filed ,
NOT a date Data type. B is correct
upvoted 3 times
 
Ramkrish39
Most Recent 
Agree B is the right answer
upvoted 1 times
 
PallaviPatel
Selected Answer: C
I will go with round robin.
''Consider using the round-robin distribution for your table in the following scenarios:
When getting started as a simple starting point since it is the default
If there is no obvious joining key
If there is no good candidate column for hash distributing the table
If the table does not share a common join key with other tables
If the join is less significant than other joins in the query

upvoted 1 times
 
yovi
Anyone, when you finish an exam, do they give you the correct answers in the end?
upvoted 1 times
 
dev2dev
those finished exam will not know the answer. because answers are not reveled
upvoted 1 times
 
Mahesh_mm
B is correct ans
upvoted 1 times
 
danish456
5 months ago
Selected Answer: B
It's correct
upvoted 1 times
 
trietnv
5 months ago
Selected Answer: B
1. choose distribution b/c "joining a round-robin table usually requires reshuffling the rows, which is a performance hit"
2. Choose PurchaseKey b/c "not used in WHERE"
refer:
and
upvoted 2 times
 
Aslam208
Selected Answer: B
B is correct
upvoted 1 times
 
Hervedoux
6 months ago
Selected Answer: B
Its cleary a hash on purchasekey column
upvoted 3 times
 
ohana
7 months ago
Ans: B
upvoted 5 times
 
Marcus1612
To optimize the MPP, data have to be distributed evenly. Datekey is not a good candidate because the data will be distributed evenly one day per
60 days. In practice, if many users query the fact table to retreive the data about the week before, only 7 nodes will process the queries instead of
60. According to microsoft documentation:"To balance the parallel processing, select a distribution column that .. Is not a date column. All data for
the same date lands in the same distribution. If several users are all filtering on the same date, then only 1 of the 60 distributions do all the
processing work.
upvoted 4 times
 
Marcus1612
the good answer is B
upvoted 2 times
 
andimohr
10 months ago
The reference given in the answer is precise: Choose a distribution column with data that a) distributes evenly b) has many unique values c) does
not have NULLs or few NULLs and d) IS NOT A DATE COLUMN... definitely the best choice for the Hash distribution is on the Identity column.
upvoted 4 times
 
noone_a
although its a fact table, replicated is the correct distribution in this case.
Each row is 141 bytes in size x 1000000 records = 135Mb total size
Microsoft recommend replicated distribution for anything under 2GB.
We have no further information regarding table growth so this answer is based only on the info provided.
upvoted 1 times
 
noone_a
edit, this is incorrect as it will have 1 million records added daily for 3 years, putting it over 2GB
upvoted 4 times
 
vlad888
11 months ago
Yes - do not use date column - there is such recomendation in synapse docs. But here we have range search - potensiallu several nodes will be
used.
upvoted 1 times
 
vlad888
11 months ago
Actually it is clear that it should be hash distributed. BUT Product key brings no benefit for this query - doesn't participated in it at all. So - DateKey.
Although it is unusual for Synapse
upvoted 4 times
 
savin
I don't think there is enough information to decide this. Also we can not decide it by just looking at one query. Only considering this query and if
we assume no other dimensions are connected to this fact table, good answer would be D.
upvoted 2 times
You are implementing a batch dataset in the Parquet format.
Data files will be produced be using Azure Data Factory and stored in Azure Data Lake Storage Gen2. The files will be consumed by an Azure
Synapse Analytics serverless SQL pool.
You need to minimize storage costs for the solution.
What should you do?
A.
Use Snappy compression for files.
B.
Use OPENROWSET to query the Parquet files.
C.
Create an external table that contains a subset of columns from the Parquet files.
D.
Store all data as string in the Parquet files.
Correct Answer:
C
An external table points to data located in Hadoop, Azure Storage blob, or Azure Data Lake Storage. External tables are used to read data from
files or write data to files in Azure Storage. With Synapse SQL, you can use external tables to read external data using dedicated SQL pool or
serverless SQL pool.
Reference:
https://docs.microsoft.com/en-us/azure/synapse-analytics/sql/develop-tables-external-tables

A (37%) C (32%) B (32%)
 
m2shines
Highly Voted 
Answer should be A, because this talks about minimizing storage costs, not querying costs
upvoted 22 times
 
assU2
Isn't snappy a default compressionCodec for parquet in azure?
https://docs.microsoft.com/en-us/azure/data-factory/format-parquet
upvoted 8 times
 
Aslam208
Highly Voted 
C is the correct answer, as an external table with a subset of columns with parquet files would be cost-effective.
upvoted 13 times
 
RehanRajput
This is not correct.
1. External tables are are not saved in the database. (This is why they're external)
2. You're assuming that the SQL Serverless pools have a local storage. They don't -- > https://docs.microsoft.com/en-us/azure/synapse-
analytics/sql/best-practices-serverless-sql-pool
upvoted 1 times
 
Massy
3 weeks, 5 days ago
in serverless sql pool you don't create a copy of the data, so how could be cost effective?
upvoted 1 times
 
sdokmak
Most Recent 
1 day, 8 hours ago
Selected Answer: B
I agree with Canary2021
upvoted 1 times
 
rohitbinnani
Selected Answer: C
not A - The default compression for a parquet file is SNAPPY. Even in Python as well.
C - because an external table that contains a subset of columns from the Parquet files will not need re-saving them in databases and that would
save storage costs.
upvoted 6 times
 
RehanRajput
This is not correct.
1. External tables are are not saved in the database. (This is why they're external)
2. You're assuming that the SQL Serverless pools have a local storage. They don't -- > https://docs.microsoft.com/en-us/azure/synapse-
analytics/sql/best-practices-serverless-sql-pool
upvoted 1 times
 
DingDongSingSong
2 months ago
The answer is NOT A. Snappy compression offers fast compression, but file size at rest is larger which will translate into higher storage cost. The
answer is C where an external table with requisite columns is made available which will reduce the amount of storage
upvoted 4 times
 
cotillion
Selected Answer: A
Only A has sth to do with the storage
upvoted 1 times
 
PallaviPatel
A looks to be correct.
upvoted 1 times
 
dev2dev
Since this is a batch process, and we can delete files once loaded and this can't be avoid initial/temporary storage cost of any form for loading
data, including most optimized parquet format with compression option. So best approach would be to store only required columns which can
save storage. However, we can always use OPENROWSET if we are not interested to persist data. Yeah, like someone said, this is shitty question
with shitty options.
upvoted 2 times
 
Ramkrish39
OPENROWSET is for JSON files
upvoted 1 times
 
bhushanhegde
As per the documentation, A is the correct answer
https://docs.microsoft.com/en-us/azure/data-factory/format-parquet#dataset-properties
upvoted 1 times
 
Jaws1990
Selected Answer: A
creating an external table with fewer columns than the file has no effect on the file itself and will actually fail so in no way helps with storage costs.
See MS documentation "The column definitions, including the data types and number of columns, must match the data in the external files. If
there's a mismatch, the file rows will be rejected when querying the actual data."
upvoted 6 times
 
Canary_2021
Selected Answer: B
In order to query data from external table, need to creat these 3 items. Feel that they all cost some storage.
CREATE EXTERNAL DATA SOURCE
CREATE EXTERNAL FILE FORMAT
CREATE EXTERNAL TABLE
If using open row set, don’t need to creat any thing, so l select B.
upvoted 5 times
 
ploer
But this has nothing to do with storage costs. Only some bytes in the data dictionary are added and you are not even charged for this.
upvoted 1 times
 
sdokmak
1 day, 8 hours ago
no storage cost = WIN :)
upvoted 1 times
 
TestMitch
This question is garbage.
upvoted 7 times
 
assU2
Like many others...
upvoted 2 times
 
Jerrylolu
That is correct. Looks like whoever put it here, didnt remember it clearly.
upvoted 2 times
 
vijju23
Answer is B. which is best as per storage cost. reason we are querying parquet file when need using OPENROWSET.
upvoted 7 times
DRAG DROP -
You need to build a solution to ensure that users can query specific files in an Azure Data Lake Storage Gen2 account from an Azure Synapse
Analytics serverless SQL pool.
Which three actions should you perform in sequence? To answer, move the appropriate actions from the list of actions to the answer area and
arrange them in the correct order.
NOTE: More than one order of answer choices is correct. You will receive credit for any of the correct orders you select.
Select and Place:
Correct Answer:
Step 1: Create an external data source
You can create external tables in Synapse SQL pools via the following steps:
1. CREATE EXTERNAL DATA SOURCE to reference an external Azure storage and specify the credential that should be used to access the
storage.
2. CREATE EXTERNAL FILE FORMAT to describe format of CSV or Parquet files.
3. CREATE EXTERNAL TABLE on top of the files placed on the data source with the same file format.
Step 2: Create an external file format object
Creating an external file format is a prerequisite for creating an external table.
Step 3: Create an external table
Reference:
 
avijitd
Highly Voted 
Looks correct answer
upvoted 12 times
 
SandipSingha
Most Recent 
2 weeks, 3 days ago
correct
upvoted 1 times
 
lotuspetall
correct
upvoted 1 times
 
PallaviPatel
correct
upvoted 2 times
 
ANath
Correct
upvoted 1 times
 
gf2tw
Correct
upvoted 1 times
You are designing a data mart for the human resources (HR) department at your company. The data mart will contain employee information and
employee transactions.
From a source system, you have a flat extract that has the following fields:
✑ EmployeeID
FirstName -
✑ LastName
✑ Recipient
✑ GrossAmount
✑ TransactionID
✑ GovernmentID
✑ NetAmountPaid
✑ TransactionDate
You need to design a star schema data model in an Azure Synapse Analytics dedicated SQL pool for the data mart.
Which two tables should you create? Each correct answer presents part of the solution.
A.
a dimension table for Transaction
B.
a dimension table for EmployeeTransaction
C.
a dimension table for Employee
D.
a fact table for Employee
E.
a fact table for Transaction
Correct Answer:
CE
C: Dimension tables contain attribute data that might change but usually changes infrequently. For example, a customer's name and address
are stored in a dimension table and updated only when the customer's profile changes. To minimize the size of a large fact table, the customer's
name and address don't need to be in every row of a fact table. Instead, the fact table and the dimension table can share a customer ID. A query
can join the two tables to associate a customer's profile and transactions.
E: Fact tables contain quantitative data that are commonly generated in a transactional system, and then loaded into the dedicated SQL pool.
For example, a retail business generates sales transactions every day, and then loads the data into a dedicated SQL pool fact table for analysis.
Reference:
https://docs.microsoft.com/en-us/azure/synapse-analytics/sql-data-warehouse/sql-data-warehouse-tables-overview

CE (100%)
 
avijitd
Highly Voted 
Correct Answer . Emp info as Dimension & trans table as fact
upvoted 7 times
 
SandipSingha
Most Recent 
2 weeks, 3 days ago
correct
upvoted 1 times
 
tg2707
3 weeks, 2 days ago
why not fact table for employee and dim table for transactions
upvoted 1 times
 
Egocentric
1 month, 1 week ago
CE is correct
upvoted 1 times
 
NewTuanAnh
Selected Answer: CE
CE is the correct answer
upvoted 2 times
 
SebK
2 months ago
Selected Answer: CE
CE is correct
upvoted 1 times
 
surya610
3 months ago
Selected Answer: CE
Dimension for employee and fact for transactions.
upvoted 1 times
 
PallaviPatel
Selected Answer: CE
correct
upvoted 1 times
 
gf2tw
Correct
upvoted 2 times
You are designing a dimension table for a data warehouse. The table will track the value of the dimension attributes over time and preserve the
history of the data by adding new rows as the data changes.
Which type of slowly changing dimension (SCD) should you use?
A.
Type 0
B.
Type 1
C.
Type 2
D.
Type 3
Correct Answer:
C
Incorrect Answers:
B: A Type 1 SCD always reflects the latest values, and when changes in source data are detected, the dimension table data is overwritten.
D: A Type 3 SCD supports storing two versions of a dimension member as separate columns. The table includes a column for the current value
of a member plus either the original or previous value of the member. So Type 3 uses additional columns to track one key instance of history,
rather than storing additional rows to track each change like in a Type 2 SCD.
Reference:
dimension-types

C (100%)
 
gf2tw
Highly Voted 
Correct
upvoted 12 times
 
SandipSingha
Most Recent 
2 weeks, 3 days ago
correct
upvoted 1 times
 
SandipSingha
2 weeks, 3 days ago
correct
upvoted 1 times
 
AZ9997989798979789798979789797
3 weeks, 1 day ago
Correct
upvoted 1 times
 
Onobhas01
Selected Answer: C
Correct!
upvoted 1 times
 
surya610
3 months ago
Selected Answer: C
Correct
upvoted 1 times
 
PallaviPatel
Selected Answer: C
correct
upvoted 1 times
 
saupats
correct
upvoted 1 times
 
ANath
correct
upvoted 1 times
DRAG DROP -
You have data stored in thousands of CSV files in Azure Data Lake Storage Gen2. Each file has a header row followed by a properly formatted
carriage return (/ r) and line feed (/n).
You are implementing a pattern that batch loads the files daily into an enterprise data warehouse in Azure Synapse Analytics by using PolyBase.
You need to skip the header row when you import the files into the data warehouse. Before building the loading pattern, you need to prepare the
required database objects in Azure Synapse Analytics.
Which three actions should you perform in sequence? To answer, move the appropriate actions from the list of actions to the answer area and
arrange them in the correct order.
NOTE: Each correct selection is worth one point
Select and Place:
Correct Answer:
Step 1: Create an external data source that uses the abfs location
Create External Data Source to reference Azure Data Lake Store Gen 1 or 2
Step 2: Create an external file format and set the First_Row option.
Create External File Format.
Step 3: Use CREATE EXTERNAL TABLE AS SELECT (CETAS) and configure the reject options to specify reject values or percentages
To use PolyBase, you must create external tables to reference your external data.
Use reject options.
Note: REJECT options don't apply at the time this CREATE EXTERNAL TABLE AS SELECT statement is run. Instead, they're specified here so that
the database can use them at a later time when it imports data from the external table. Later, when the CREATE TABLE AS SELECT statement
selects data from the external table, the database will use the reject options to determine the number or percentage of rows that can fail to
import before it stops the import.
Reference:
https://docs.microsoft.com/en-us/sql/relational-databases/polybase/polybase-t-sql-objects https://docs.microsoft.com/en-us/sql/t-
sql/statements/create-external-table-as-select-transact-sql
 
avijitd
Highly Voted 
https://docs.microsoft.com/en-us/azure/synapse-analytics/sql/develop-tables-external-tables?tabs=hadoop#create-external-data-source
Hadoop external data source in dedicated SQL pool for Azure Data Lake Gen2 pointing
CREATE DATABASE SCOPED CREDENTIAL [ADLS_credential]
WITH IDENTITY='SHARED ACCESS SIGNATURE',
SECRET = 'sv=2018-03-28&ss=bf&srt=sco&sp=rl&st=2019-10-14T12%3A10%3A25Z&se=2061-12-
31T12%3A10%3A00Z&sig=KlSU2ullCscyTS0An0nozEpo4tO5JAgGBvw%2FJX2lguw%3D'
GO
CREATE EXTERNAL DATA SOURCE AzureDataLakeStore

WITH
-- Please note the abfss endpoint when your account has secure transfer enabled
( LOCATION = 'abfss://data@newyorktaxidataset.dfs.core.windows.net' ,
CREDENTIAL = ADLS_credential ,
TYPE = HADOOP
) ;
So I guess 1. DB scoped credential
2.external DS
3.External file as mentioned by @alex

upvoted 27 times
 
Fer079
Highly Voted 
The right answer should be:
1) Create database scoped credential
2)Create External data source
3) Create External File
"Create external table as select (CETAS)" makes no sense in this case because we would need to include a Select to fill out the external table,
however this data must come from files and not from other tables. In this case It's not the same an "external table" as an "external table as select",
the first one the data come from files and the second one the data come from a SQL query to export them into files.
upvoted 11 times
 
ravi2931
I was thinking same and its obvious
upvoted 1 times
 
Genere
Most Recent 
"CETAS : Creates an external table and THEN EXPERTS, in parallel, the results of a Transact-SQL SELECT statement to Hadoop or Azure Blob
storage."
We are not looking here to export data but rather to consume data from ADLS.
The right answer should be:

upvoted 4 times
 
wwdba
2 months ago
1. Create database scoped credential
2. Create External data source
3. Create External File
You are implementing a pattern that batch loads the files daily...so "Create external table as select" is wrong because it'll load the data into Synapse
only once
upvoted 3 times
 
DingDongSingSong
2 months ago
According to this link, when using Polybase: https://docs.microsoft.com/en-us/sql/relational-databases/polybase/polybase-configure-sql-server?
view=sql-server-ver15
Step 1: CREATE DATABASE SCOPED CREDENTIAL (Transact-SQL)
Step 2: CREATE EXTERNAL DATA SOURCE (Transact-SQL)
Step 3: CREATE EXTERNAL TABLE (Transact-SQL)
Therefore, answer is A,B,C. Correct?

upvoted 1 times
 
ovokpus
3 months ago
Why should you be the one to create the database scoped credential? You ought to have that already
upvoted 1 times
 
VeroDon
"Azure Synapse Analytics uses a database scoped credential to access non-public Azure blob storage with PolyBase" The question doesnt mention
if the storage is or is not private
https://docs.microsoft.com/en-us/sql/t-sql/statements/create-database-scoped-credential-transact-sql?view=sql-server-ver15
upvoted 3 times
 
ANath
Now that's clear the confusion if database scoped credential is needed in this context or not. By going through VeroDon's provided link it is
clear that database scoped credential is needed for non-public azure blob storage.
upvoted 1 times
 
VeroDon
Correct
https://docs.microsoft.com/en-us/azure/synapse-analytics/sql/load-data-overview
upvoted 2 times
 
alexleonvalencia
Step 1 : Create External data source ...
Step 2 : Create External File ....
Step 3 : Use Create External table ...

upvoted 3 times
 
alexleonvalencia
Corrijo;

upvoted 7 times
HOTSPOT -
You are building an Azure Synapse Analytics dedicated SQL pool that will contain a fact table for transactions from the first half of the year 2020.
You need to ensure that the table meets the following requirements:
✑ Minimizes the processing time to delete data that is older than 10 years
✑ Minimizes the I/O for queries that use year-to-date values
How should you complete the Transact-SQL statement? To answer, select the appropriate options in the answer area.
Hot Area:
Correct Answer:
Box 1: PARTITION -
RANGE RIGHT FOR VALUES is used with PARTITION.
Part 2: [TransactionDateID]
Partition on the date column.
Example: Creating a RANGE RIGHT partition function on a datetime column
The following partition function partitions a table or index into 12 partitions, one for each month of a year's worth of values in a datetime
column.
CREATE PARTITION FUNCTION [myDateRangePF1] (datetime)
AS RANGE RIGHT FOR VALUES ('20030201', '20030301', '20030401',
'20030501', '20030601', '20030701', '20030801',
'20030901', '20031001', '20031101', '20031201');
Reference:
https://docs.microsoft.com/en-us/sql/t-sql/statements/create-partition-function-transact-sql
 
gf2tw
Highly Voted 
Correct
upvoted 6 times
 
gabdu
Most Recent 
3 weeks, 2 days ago
How are we ensuring "Minimizes the processing time to delete data that is older than 10 years"?
upvoted 2 times
 
wwdba
Correct
upvoted 1 times
 
PallaviPatel
correct
upvoted 1 times
 
saupats
correct
upvoted 1 times
You are performing exploratory analysis of the bus fare data in an Azure Data Lake Storage Gen2 account by using an Azure Synapse Analytics
You execute the Transact-SQL query shown in the following exhibit.
What do the query results include?
A.
Only CSV files in the tripdata_2020 subfolder.
B.
All files that have file names that beginning with "tripdata_2020".
C.
All CSV files that have file names that contain "tripdata_2020".
D.
Only CSV that have file names that beginning with "tripdata_2020".
Correct Answer:
D

D (100%)
 
gf2tw
Highly Voted 
Correct
upvoted 10 times
 
Egocentric
Most Recent 
1 month, 1 week ago
on this one you need to pay attention to wording
upvoted 1 times
 
jskibick
Selected Answer: D
D all good
upvoted 1 times
 
sarapaisley
Selected Answer: D
D is correct
upvoted 1 times
 
SebK
2 months ago
Selected Answer: D
Correct
upvoted 1 times
 
DingDongSingSong
2 months ago
Why is option C not correct, when the code has "tripdata_2020*.csv" which means that a wild card is used with "tripdata_2020" csv files. So,
example tripdata_2020A.csv, tripdata_2020B.csv, tripdata_2020YZ.csv, all 3 would be queried. Option D does not make sense, even gramatically
upvoted 1 times
 
PallaviPatel
Selected Answer: D
correct
upvoted 2 times
 
anto69
No doubts is correct, no doubts is ans D
upvoted 1 times
 
duds19
Why not B?
upvoted 1 times
 
Nifl91
Because of the .csv at the end
upvoted 3 times
DRAG DROP -
You use PySpark in Azure Databricks to parse the following JSON input.
You need to output the data in the following tabular format.
How should you complete the PySpark code? To answer, drag the appropriate values to the correct targets. Each value may be used once, more
than once, or not at all. You may need to drag the spit bar between panes or scroll to view content.
Select and Place:
Correct Answer:
Box 1: select -
Box 2: explode -
Bop 3: alias -
pyspark.sql.Column.alias returns this column aliased with a new name or names (in the case of expressions that return more than one column,
such as explode).
Reference:
https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.Column.alias.html https://docs.microsoft.com/en-
us/azure/databricks/sql/language-manual/functions/explode
 
galacaw
4 weeks ago
Correct
upvoted 2 times
HOTSPOT -
You are designing an application that will store petabytes of medical imaging data.
When the data is first created, the data will be accessed frequently during the first week. After one month, the data must be accessible within 30
seconds, but files will be accessed infrequently. After one year, the data will be accessed infrequently but must be accessible within five minutes.
You need to select a storage strategy for the data. The solution must minimize costs.
Which storage tier should you use for each time frame? To answer, select the appropriate options in the answer area.
Hot Area:
Correct Answer:

Box 1: Hot -
Hot tier - An online tier optimized for storing data that is accessed or modified frequently. The Hot tier has the highest storage costs, but the
lowest access costs.
Box 2: Cool -
Cool tier - An online tier optimized for storing data that is infrequently accessed or modified. Data in the Cool tier should be stored for a
minimum of 30 days. The
Cool tier has lower storage costs and higher access costs compared to the Hot tier.
Box 3: Cool -
Not Archive tier - An offline tier optimized for storing data that is rarely accessed, and that has flexible latency requirements, on the order of
hours. Data in the
Archive tier should be stored for a minimum of 180 days.
Reference:
https://docs.microsoft.com/en-us/azure/storage/blobs/access-tiers-overview https://www.altaro.com/hyper-v/azure-archive-storage/
 
nefarious_smalls
2 weeks, 4 days ago
Why would it not be be Hot Cool and Archive
upvoted 1 times
 
SandipSingha
2 weeks, 2 days ago
After one year, the data will be accessed infrequently but must be accessible within five minutes.
upvoted 2 times
 
Guincimund
2 weeks, 3 days ago
"After one year, the data will be accessed infrequently but must be accessible within five minutes"
The latency for the first bytes, is "hours" for the archive. so because they want to be able to access the data within 5 min, you need to place it in
"cool"
So the answer is correct.

upvoted 2 times
 
nefarious_smalls
2 weeks, 4 days ago
I dont know
upvoted 1 times
 
Andy91
1 month ago
Correct answer!
Hot, Cool, Cool

upvoted 1 times
You have an Azure Synapse Analytics Apache Spark pool named Pool1.
You plan to load JSON files from an Azure Data Lake Storage Gen2 container into the tables in Pool1. The structure and data types vary by file.
You need to load the files into the tables. The solution must maintain the source data types.
What should you do?
A.
Use a Conditional Split transformation in an Azure Synapse data flow.
B.
Use a Get Metadata activity in Azure Data Factory.
C.
Load the data by using the OPENROWSET Transact-SQL command in an Azure Synapse Analytics serverless SQL pool.
D.
Load the data by using PySpark.
Correct Answer:
C
Serverless SQL pool can automatically synchronize metadata from Apache Spark. A serverless SQL pool database will be created for each
database existing in serverless Apache Spark pools.
Serverless SQL pool enables you to query data in your data lake. It offers a T-SQL query surface area that accommodates semi-structured and
unstructured data queries.
To support a smooth experience for in place querying of data that's located in Azure Storage files, serverless SQL pool uses the OPENROWSET
function with additional capabilities.
The easiest way to see to the content of your JSON file is to provide the file URL to the OPENROWSET function, specify csv FORMAT.
Reference:
https://docs.microsoft.com/en-us/azure/synapse-analytics/sql/query-json-files https://docs.microsoft.com/en-us/azure/synapse-
analytics/sql/query-data-storage

D (100%)
 
Ben_1010
Why PySpark?
upvoted 1 times
 
Andushi
2 weeks, 6 days ago
Selected Answer: D
Should be D, I agree with @galacaw
upvoted 1 times
 
galacaw
4 weeks ago
Should be D, it's about Apache Spark pool, not serverless SQL pool.
upvoted 4 times
You have an Azure Databricks workspace named workspace1 in the Standard pricing tier. Workspace1 contains an all-purpose cluster named
cluster1.
You need to reduce the time it takes for cluster1 to start and scale up. The solution must minimize costs.
What should you do first?
A.
Configure a global init script for workspace1.
B.
Create a cluster policy in workspace1.
C.
Upgrade workspace1 to the Premium pricing tier.
D.
Create a pool in workspace1.
Correct Answer:
D
You can use Databricks Pools to Speed up your Data Pipelines and Scale Clusters Quickly.
Databricks Pools, a managed cache of virtual machine instances that enables clusters to start and scale 4 times faster.
Reference:
https://databricks.com/blog/2019/11/11/databricks-pools-speed-up-data-pipelines.html
 
Maggiee
1 week, 5 days ago
Answer should be C
upvoted 2 times
 
sdokmak
1 day, 8 hours ago
Answer is D:
looking at the reference link, pool works for this. Optimized scaling not needed to reduce 'start and scale up' times only.
upvoted 1 times
 
galacaw
4 weeks ago
Correct
upvoted 4 times
HOTSPOT -
You are building an Azure Stream Analytics job that queries reference data from a product catalog file. The file is updated daily.
The reference data input details for the file are shown in the Input exhibit. (Click the Input tab.)
The storage account container view is shown in the Refdata exhibit. (Click the Refdata tab.)
You need to configure the Stream Analytics job to pick up the new reference data.
What should you configure? To answer, select the appropriate options in the answer area.
Hot Area:
Correct Answer:

Box 1: {date}/product.csv -
In the 2nd exhibit we see: Location: refdata / 2020-03-20
Note: Path Pattern: This is a required property that is used to locate your blobs within the specified container. Within the path, you may choose
to specify one or more instances of the following 2 variables:
{date}, {time}
Example 1: products/{date}/{time}/product-list.csv
Example 2: products/{date}/product-list.csv
Example 3: product-list.csv -
Box 2: YYYY-MM-DD -
Note: Date Format [optional]: If you have used {date} within the Path Pattern that you specified, then you can select the date format in which
your blobs are organized from the drop-down of supported formats.
Example: YYYY/MM/DD, MM/DD/YYYY, etc.
Reference:
https://docs.microsoft.com/en-us/azure/stream-analytics/stream-analytics-use-reference-data
 
inotbf83
3 weeks, 2 days ago
I should change box 2 to YYYY/MM/DD (as shows 1st exhibit). A bit confusing with time format in the box 1.
upvoted 4 times
 
jackttt
4 weeks, 1 day ago
The file is updated daily, i think `{date}/product.csv` is correct
upvoted 4 times
 
Lotusss
1 month ago
Wrong! Path Pattern: {dat}/{time}/product.csv
Dat format: yyyy-mm-dd
https://docs.microsoft.com/en-us/azure/stream-analytics/stream-analytics-use-reference-data
upvoted 2 times
 
KashRaynardMorse
3 weeks, 4 days ago
See that the file is stored under the date folder, and there is no time folder.
Your link does recommend the time part, but the the link also says it's optional, and ultimately you need to answer the question, which states
the path without the time.
upvoted 4 times
HOTSPOT -
You have the following Azure Stream Analytics query.
For each of the following statements, select Yes if the statement is true. Otherwise, select No.
Hot Area:
Correct Answer:
Box 1: No -
Note: You can now use a new extension of Azure Stream Analytics SQL to specify the number of partitions of a stream when reshuffling the
data.
The outcome is a stream that has the same partition scheme. Please see below for an example:
WITH step1 AS (SELECT * FROM [input1] PARTITION BY DeviceID INTO 10), step2 AS (SELECT * FROM [input2] PARTITION BY DeviceID INTO 10)
SELECT * INTO [output] FROM step1 PARTITION BY DeviceID UNION step2 PARTITION BY DeviceID
Note: The new extension of Azure Stream Analytics SQL includes a keyword INTO that allows you to specify the number of partitions for a
stream when performing reshuffling using a PARTITION BY statement.
Box 2: Yes -
When joining two streams of data explicitly repartitioned, these streams must have the same partition key and partition count.
Box 3: Yes -
Streaming Units (SUs) represents the computing resources that are allocated to execute a Stream Analytics job. The higher the number of SUs,
the more CPU and memory resources are allocated for your job.
In general, the best practice is to start with 6 SUs for queries that don't use PARTITION BY.
Here there are 10 partitions, so 6x10 = 60 SUs is good.
Note: Remember, Streaming Unit (SU) count, which is the unit of scale for Azure Stream Analytics, must be adjusted so the number of physical
resources available to the job can fit the partitioned flow. In general, six SUs is a good number to assign to each partition. In case there are
insufficient resources assigned to the job, the system will only apply the repartition if it benefits the job.
Reference:
https://azure.microsoft.com/en-in/blog/maximize-throughput-with-repartitioning-in-azure-stream-analytics/ https://docs.microsoft.com/en-
us/azure/stream-analytics/stream-analytics-streaming-unit-consumption
 
TacoB
2 weeks, 5 days ago
Reading https://docs.microsoft.com/en-us/stream-analytics-query/union-azure-stream-analytics and the second sample given in there I would
expect the first one to be No.
upvoted 1 times
 
Akshay_1995
3 weeks, 3 days ago
Correct
upvoted 1 times
HOTSPOT -
You are building a database in an Azure Synapse Analytics serverless SQL pool.
You have data stored in Parquet files in an Azure Data Lake Storege Gen2 container.
Records are structured as shown in the following sample.
"id": 123,
"address_housenumber": "19c",
"address_line": "Memory Lane",
"applicant1_name": "Jane",
"applicant2_name": "Dev"
The records contain two applicants at most.
You need to build a table that includes only the address fields.
Hot Area:
Correct Answer:
Box 1: CREATE EXTERNAL TABLE -
An external table points to data located in Hadoop, Azure Storage blob, or Azure Data Lake Storage. External tables are used to read data from
files or write data to files in Azure Storage. With Synapse SQL, you can use external tables to read external data using dedicated SQL pool or
Syntax:
CREATE EXTERNAL TABLE { database_name.schema_name.table_name | schema_name.table_name | table_name }
( <column_definition> [ ,...n ] )
WITH (
LOCATION = 'folder_or_filepath',
DATA_SOURCE = external_data_source_name,
FILE_FORMAT = external_file_format_name
Box 2. OPENROWSET -
When using serverless SQL pool, CETAS is used to create an external table and export query results to Azure Storage Blob or Azure Data Lake
Storage Gen2.
Example:
AS -
SELECT decennialTime, stateName, SUM(population) AS population
FROM -
OPENROWSET(BULK
'https://azureopendatastorage.blob.core.windows.net/censusdatacontainer/release/us_population_county/year=*/*.parquet',
FORMAT='PARQUET') AS [r]
GROUP BY decennialTime, stateName
GO -
Reference:
 
SandipSingha
2 weeks, 2 days ago
correct
upvoted 2 times
 
Feljoud
3 weeks ago
correct
upvoted 3 times
HOTSPOT -
You have an Azure Synapse Analytics dedicated SQL pool named Pool1 and an Azure Data Lake Storage Gen2 account named Account1.
You plan to access the files in Account1 by using an external table.
You need to create a data source in Pool1 that you can reference when you create the external table.
Hot Area:
Correct Answer:
Box 1: blob -
The following example creates an external data source for Azure Data Lake Gen2
CREATE EXTERNAL DATA SOURCE YellowTaxi
WITH ( LOCATION = 'https://azureopendatastorage.blob.core.windows.net/nyctlc/yellow/',
TYPE = HADOOP)
Box 2: HADOOP -
Reference:
 
Jmanuelleon
1 week, 1 day ago
Es confuso.... en la definición para location, indica usar DFS,https://docs.microsoft.com/es-es/azure/synapse-analytics/sql/develop-tables-external-
tables?tabs=hadoop#location, pero en el ejemplo que aparece mas abajo, usa lo contrario, https://docs.microsoft.com/es-es/azure/synapse-
analytics/sql/develop-tables-external-tables?tabs=hadoop#example-for-create-external-data-source (En el ejemplo siguiente se crea un origen de
datos externo para Azure Data Lake Gen2 que apunta al conjunto de datos de Nueva York disponible públicamente: CREATE EXTERNAL DATA
SOURCE YellowTaxi

TYPE = HADOOP))
upvoted 1 times
 
hbad
1 week, 2 days ago
It is hadoop and dfs. For dfs see link below location section:
https://docs.microsoft.com/en-us/azure/synapse-analytics/sql/develop-tables-external-tables?tabs=hadoop
upvoted 2 times
 
LetsPassExams
2 weeks, 5 days ago
From https://docs.microsoft.com/en-us/azure/synapse-analytics/sql/develop-tables-external-tables?tabs=hadoop#example-for-create-external-
data-source
The following example creates an external data source for Azure Data Lake Gen2 pointing to the publicly available New York data set:
SQL
Copy
CREATE EXTERNAL DATA SOURCE YellowTaxi

TYPE = HADOOP)
upvoted 2 times
 
LetsPassExams
2 weeks, 5 days ago
I thin answer is correct:
https://docs.microsoft.com/en-us/azure/synapse-analytics/sql/develop-tables-external-tables?tabs=hadoop#example-for-create-external-data-
source
upvoted 1 times
 
shrikantK
3 weeks, 5 days ago
dfs is the answer as question is about Azure Data Lake Storage Gen2 . if question was about blob storage then answer would have been blob.
upvoted 1 times
 
Andushi
3 weeks, 6 days ago
1. is DFS
upvoted 2 times
 
Andushi
4 weeks ago
I agree with galacaw is dfs and type Hadoop
upvoted 2 times
 
galacaw
4 weeks ago
1. dfs (for Azure Data Lake Storage Gen2)
upvoted 4 times
You have an Azure subscription that contains an Azure Blob Storage account named storage1 and an Azure Synapse Analytics dedicated SQL pool
named
Pool1.
You need to store data in storage1. The data will be read by Pool1. The solution must meet the following requirements:
Enable Pool1 to skip columns and rows that are unnecessary in a query.
✑ Automatically create column statistics.
✑ Minimize the size of files.
Which type of file should you use?
A.
JSON
B.
Parquet
C.
Avro
D.
CSV
Correct Answer:
B
Automatic creation of statistics is turned on for Parquet files. For CSV files, you need to create statistics manually until automatic creation of
CSV files statistics is supported.
Reference:
https://docs.microsoft.com/en-us/azure/synapse-analytics/sql/develop-tables-statistics

B (100%)
 
ClassMistress
1 week, 1 day ago
Selected Answer: B
Automatic creation of statistics is turned on for Parquet files. For CSV files, you need to create statistics manually until automatic creation of CSV
files statistics is supported.
upvoted 1 times
 
sdokmak
1 day, 7 hours ago
Good point, also better cost
upvoted 1 times
 
shachar_ash
2 weeks, 1 day ago
Correct
upvoted 2 times
DRAG DROP -
You plan to create a table in an Azure Synapse Analytics dedicated SQL pool.
Data in the table will be retained for five years. Once a year, data that is older than five years will be deleted.
You need to ensure that the data is distributed evenly across partitions. The solution must minimize the amount of time required to delete old
data.
Select and Place:
Correct Answer:
Box 1: HASH -
Box 2: OrderDateKey -
In most cases, table partitions are created on a date column.
A way to eliminate rollbacks is to use Metadata Only operations like partition switching for data management. For example, rather than execute
a DELETE statement to delete all rows in a table where the order_date was in October of 2001, you could partition your data early. Then you can
switch out the partition with data for an empty partition from another table.
Reference:
https://docs.microsoft.com/en-us/sql/t-sql/statements/create-table-azure-sql-data-warehouse https://docs.microsoft.com/en-
us/azure/synapse-analytics/sql/best-practices-dedicated-sql-pool
 
ClassMistress
1 week, 1 day ago
I think it is Hash because the question refer to a Fact table.
upvoted 1 times
 
jebias
1 month ago
I think the first answer should be Round-Robin as it should be distributed evenly.
upvoted 2 times
 
Feljoud
1 month ago
While you are right, that Round-Robin guarantees an even distribution, it is only recommended to use on small tables < 2 GB (see your link).
Using the Hash of the ProductKey will also allow for an even distribution but in a more efficient manner.
Also, the Syntax here would be wrong if you would insert Round-Robin. As in that case it would only say: "DISTRIBUTION = ROUND-ROBIN" (no
ProductKey)
upvoted 10 times
 
nefarious_smalls
2 weeks, 4 days ago
You are exactly righty
upvoted 1 times
 
Muishkin
3 weeks, 6 days ago
yes i think so too
upvoted 1 times
 
Massy
3 weeks, 5 days ago
the syntax is ok only for HASH
upvoted 2 times
HOTSPOT -
You have an Azure Data Lake Storage Gen2 service.
You need to design a data archiving solution that meets the following requirements:
✑ Data that is older than five years is accessed infrequently but must be available within one second when requested.
✑ Data that is older than seven years is NOT accessed.
✑ Costs must be minimized while maintaining the required availability.
How should you manage the data? To answer, select the appropriate options in the answer area.
Hot Area:
Correct Answer:

Box 1: Move to cool storage -
Box 2: Move to archive storage -
Archive - Optimized for storing data that is rarely accessed and stored for at least 180 days with flexible latency requirements, on the order of
hours.
The following table shows a comparison of premium performance block blob storage, and the hot, cool, and archive access tiers.
Reference:
https://docs.microsoft.com/en-us/azure/storage/blobs/storage-blob-storage-tiers
 
sagur
Highly Voted 
1 month ago
If "Data that is older than seven years is NOT accessed" then this data can be deleted to minimize the storage costs, right?
upvoted 5 times
 
Feljoud
1 month ago
Would agree, but the question states: "a data archiving solution", so maybe to keep the data was implied with this?
upvoted 2 times
 
noobprogrammer
1 month ago
Makes sense to me
upvoted 1 times
 
Massy
1 month ago
I agree, should be deleted
upvoted 1 times
 
KashRaynardMorse
3 weeks, 4 days ago
Deleting data older than 7 years is not an option available in the answer list. Becareful of the gotcha; 'Delete the blob' is an option but it
would delete all the data, included the ones that are e.g. 5 years old. So you can't choose that answer. So the next best thing to do is to put
it into archive.
upvoted 5 times
 
Boompiee
2 weeks, 1 day ago
I'm confused by your comment. It clearly does state an option to delete the blob after 7 years.
upvoted 1 times
Topic 2 - Question Set 2
Question #1 Topic 2
You plan to create an Azure Databricks workspace that has a tiered structure. The workspace will contain the following three workloads:
✑ A workload for data engineers who will use Python and SQL.
✑ A workload for jobs that will run notebooks that use Python, Scala, and SQL.
✑ A workload that data scientists will use to perform ad hoc analysis in Scala and R.
The enterprise architecture team at your company identifies the following standards for Databricks environments:
✑ The data engineers must share a cluster.
✑ The job cluster will be managed by using a request process whereby data scientists and data engineers provide packaged notebooks for
deployment to the cluster.
✑ All the data scientists must be assigned their own cluster that terminates automatically after 120 minutes of inactivity. Currently, there are
three data scientists.
You need to create the Databricks clusters for the workloads.
Solution: You create a Standard cluster for each data scientist, a High Concurrency cluster for the data engineers, and a Standard cluster for the
jobs.
A.
Yes
B.
No
Correct Answer:
B
We would need a High Concurrency cluster for the jobs.
Note:
Standard clusters are recommended for a single user. Standard can run workloads developed in any language: Python, R, Scala, and SQL.
A high concurrency cluster is a managed cloud resource. The key benefits of high concurrency clusters are that they provide Apache Spark-
native fine-grained sharing for maximum resource utilization and minimum query latencies.
Reference:
https://docs.azuredatabricks.net/clusters/configure.html

A (89%) 11%
 
Amalbenrebai
Highly Voted 
- data engineers: high concurrency cluster
- jobs: Standard cluster
- data scientists: Standard cluster

upvoted 53 times
 
Egocentric
1 month, 1 week ago
agreed
upvoted 1 times
 
Julius7000
Tell me one thing: is this answer 9jobs) based on the text:
"A Single Node cluster has no workers and runs Spark jobs on the driver node.
In contrast, a Standard cluster requires at least one Spark worker node in addition to the driver node to execute Spark jobs."?
I dont understand the connection between worker noodes and the requirements given in the question about jobs workspace.
upvoted 1 times
 
gangstfear
Highly Voted 
The answer must be A!
upvoted 31 times
 
Eyepatch993
Most Recent 
2 months ago
Selected Answer: B
Standard clusters do not have fault tolerance. Both the data scientist and data engineers will be using the job cluster for processing their
notebooks, so if a standard cluster is chosen and a fault occurs in the notebook of any one user, there is a chance that other notebooks might also
fail. Due to this a high concurrency cluster is recommended for running jobs.
upvoted 2 times
 
Boompiee
2 weeks, 1 day ago
It may not be a best practice, but the question asked is: does the solution meet the stated requirements, and it does..
upvoted 1 times
 
Hanse
As per Link: https://docs.azuredatabricks.net/clusters/configure.html
Standard and Single Node clusters terminate automatically after 120 minutes by default. --> Data Scientists
High Concurrency clusters do not terminate automatically by default.
A Standard cluster is recommended for a single user. --> Standard for Data Scientists & High Concurrency for Data Engineers
Standard clusters can run workloads developed in any language: Python, SQL, R, and Scala.
High Concurrency clusters can run workloads developed in SQL, Python, and R. The performance and security of High Concurrency clusters is
provided by running user code in separate processes, which is not possible in Scala. --> Jobs needs Standard
upvoted 2 times
 
ovokpus
3 months ago
Selected Answer: A
Yes it seems to be!
upvoted 2 times
 
PallaviPatel
Selected Answer: A
correct
upvoted 2 times
 
kilowd
4 months ago
Selected Answer: A
Data Engineers - High Concurrency cluster as it provides for sharing . Also caters for SQl,Python and R.
Data Scientist - Standard Clusters which automatically terminates after 120 minutes and caters for Scala,SQl,Python and R.
JOBS- Standard Cluster

upvoted 2 times
 
let_88
4 months ago
As per the doc in Microsoft the High Concurrency cluster doesn't support Scala.
provided by running user code in separate processes, which is not possible in Scala.
https://docs.microsoft.com/en-us/azure/databricks/clusters/configure#cluster-mode
upvoted 6 times
 
tesen_tolga
Selected Answer: A
The answer must be A!
upvoted 2 times
 
The solution does not meet the requirement because: "High Concurrency clusters work only for SQL, Python, and R. The performance and security
of High Concurrency clusters is provided by running user code in separate processes, which is not possible in Scala.
upvoted 1 times
 
FredNo
6 months ago
Selected Answer: A
Data scientists and jobs use Scala so they need standard cluster
upvoted 9 times
 
Aslam208
Answer is A.
upvoted 4 times
 
gangstfear
9 months ago
Shouldn't the answer be A, as ll the requirements are met:
Data Scientist - Standard
Data Engineer - High Concurrnecy
Jobs - Standard
upvoted 13 times
 
satyamkishoresingh
Yes , Given solution is correct.
upvoted 6 times
 
echerish
9 months ago
Question 23 and 24 seems to have been swapped. They Key is
Data Scientist - Standard
Data Engineer - High Concurrnecy
Jobs - Standard
upvoted 7 times
 
MoDar
9 months ago
Answer A
Scala is not supported in High Concurrency cluster --> Jobs & Data scientists --> Standard
Data engineers --> High concurrence

upvoted 8 times
 
damaldon
Answer: B
-Data scientist should have their own cluster and should terminate after 120 mins - STANDARD
-Cluster for Jobs should support scala - STANDARD
https://docs.microsoft.com/en-us/azure/databricks/clusters/configure
upvoted 6 times
 
Sunnyb
B is the correct answer
Link below:
upvoted 3 times
Question #2 Topic 2
You plan to create an Azure Databricks workspace that has a tiered structure. The workspace will contain the following three workloads:
✑ A workload for data engineers who will use Python and SQL.
✑ A workload for jobs that will run notebooks that use Python, Scala, and SQL.
✑ A workload that data scientists will use to perform ad hoc analysis in Scala and R.
The enterprise architecture team at your company identifies the following standards for Databricks environments:
✑ The data engineers must share a cluster.
✑ The job cluster will be managed by using a request process whereby data scientists and data engineers provide packaged notebooks for
deployment to the cluster.
✑ All the data scientists must be assigned their own cluster that terminates automatically after 120 minutes of inactivity. Currently, there are
three data scientists.
You need to create the Databricks clusters for the workloads.
Solution: You create a Standard cluster for each data scientist, a High Concurrency cluster for the data engineers, and a High Concurrency cluster
for the jobs.
A.
Yes
B.
No
Correct Answer:
A
We need a High Concurrency cluster for the data engineers and the jobs.
Note: Standard clusters are recommended for a single user. Standard can run workloads developed in any language: Python, R, Scala, and SQL.
A high concurrency cluster is a managed cloud resource. The key benefits of high concurrency clusters are that they provide Apache Spark-
native fine-grained sharing for maximum resource utilization and minimum query latencies.
Reference:
https://docs.azuredatabricks.net/clusters/configure.html

B (100%)
 
dfdsfdsfsd
Highly Voted 
1 year ago
High-concurrency clusters do not support Scala. So the answer is still 'No' but the reasoning is wrong.
upvoted 31 times
 
Preben
I agree that High concurrency does not support Scala. But they specified using a Standard cluster for the jobs, which does support Scala. Why is
the answer 'No'?
upvoted 2 times
 
eng1
Because the High Concurrency cluster for each data scientist is not correct, it should be standard for a single user!
upvoted 4 times
 
FRAN__CO_HO
Highly Voted 
Answer should be NO, which
Data scientist: STANDARD as need to run scala
Jobs: STANDARD as need to run scala
Data Engineers: High-concurrency clusters as better resource sharing

upvoted 10 times
 
ClassMistress
Most Recent 
1 week, 1 day ago
Selected Answer: B
High Concurrency clusters is provided by running user code in separate processes, which is not possible in Scala.
upvoted 1 times
 
narendra399
1 and 2 are same questions but answers are different why?
upvoted 2 times
 
Hanse
As per Link: https://docs.azuredatabricks.net/clusters/configure.html
Standard and Single Node clusters terminate automatically after 120 minutes by default. --> Data Scientists
High Concurrency clusters do not terminate automatically by default.
A Standard cluster is recommended for a single user. --> Standard for Data Scientists & High Concurrency for Data Engineers
Standard clusters can run workloads developed in any language: Python, SQL, R, and Scala.
provided by running user code in separate processes, which is not possible in Scala. --> Jobs needs Standard
upvoted 2 times
 
lukeonline
Selected Answer: B
high concurrency does not support scala
upvoted 2 times
 
rashjan
Selected Answer: B
wrong: no
upvoted 1 times
 
FredNo
6 months ago
Selected Answer: B
Answer is no because high concurrency does not support scala
upvoted 5 times
 
Aslam208
Answer is No
upvoted 2 times
 
damaldon
Answer: NO
-Data scientist should have their own cluster and should terminate after 120 mins - STANDARD
-Cluster for Jobs should support scala - STANDARD
upvoted 2 times
 
nas28
Answer correct : No. but the reason is wrong, They want data scientists cluster to shut down automatically after 120 minutes so Standard cluster
not high concurrency
upvoted 3 times
 
Sunnyb
Answer is correct - NO
upvoted 2 times
Question #3 Topic 2
HOTSPOT -
You plan to create a real-time monitoring app that alerts users when a device travels more than 200 meters away from a designated location.
You need to design an Azure Stream Analytics job to process the data for the planned app. The solution must minimize the amount of code
developed and the number of technologies used.
What should you include in the Stream Analytics job? To answer, select the appropriate options in the answer area.
Hot Area:
Correct Answer:

Input type: Stream -
You can process real-time IoT data streams with Azure Stream Analytics.
Function: Geospatial -
With built-in geospatial functions, you can use Azure Stream Analytics to build applications for scenarios such as fleet management, ride
sharing, connected cars, and asset tracking.
Note: In a real-world scenario, you could have hundreds of these sensors generating events as a stream. Ideally, a gateway device would run
code to push these events to Azure Event Hubs or Azure IoT Hubs.
Reference:
https://docs.microsoft.com/en-us/azure/stream-analytics/stream-analytics-get-started-with-azure-stream-analytics-to-process-data-from-iot-
devices https://docs.microsoft.com/en-us/azure/stream-analytics/geospatial-scenarios
 
Podavenna
Highly Voted 
Correct solution!
upvoted 22 times
 
ClassMistress
Most Recent 
1 week, 1 day ago
Correct
upvoted 1 times
 
NewTuanAnh
Correct!
upvoted 1 times
 
PallaviPatel
Correct
upvoted 1 times
Question #4 Topic 2
A company has a real-time data analysis solution that is hosted on Microsoft Azure. The solution uses Azure Event Hub to ingest data and an
Azure Stream
Analytics cloud job to analyze the data. The cloud job is configured to use 120 Streaming Units (SU).
You need to optimize performance for the Azure Stream Analytics job.
Which two actions should you perform? Each correct answer presents part of the solution.
A.
Implement event ordering.
B.
Implement Azure Stream Analytics user-defined functions (UDF).
C.
Implement query parallelization by partitioning the data output.
D.
Scale the SU count for the job up.
E.
Scale the SU count for the job down.
F.
Implement query parallelization by partitioning the data input.
Correct Answer:
DF
D: Scale out the query by allowing the system to process each input partition separately.
F: A Stream Analytics job definition includes inputs, a query, and output. Inputs are where the job reads the data stream from.
Reference:
https://docs.microsoft.com/en-us/azure/stream-analytics/stream-analytics-parallelization

CF (50%) DF (40%) 10%
 
manquak
Highly Voted 
Partition input and output.
REF: https://docs.microsoft.com/en-us/azure/stream-analytics/stream-analytics-parallelization
upvoted 25 times
 
kolakone
Agree. And partitioning Input and output with same number of partitions gives the best performance optimization..
upvoted 5 times
 
Lio95
Highly Voted 
8 months ago
No event consumer was mentioned. Therefore, partitioning output is not relevant. Answer is correct
upvoted 11 times
 
Boompiee
2 weeks, 1 day ago
The stream analytics job is the consumer.
upvoted 1 times
 
nicolas1999
Stream analytics ALWAYS has at least one output. There is no need to mention that. So correct answer is input and output
upvoted 2 times
 
Andushi
Most Recent 
2 weeks, 6 days ago
Selected Answer: CF
I agree with @manquak.
upvoted 1 times
 
DingDongSingSong
2 months ago
I think the answer is correct. The two things you do is: 1. Scale up SU and 2. partition input. If this doesn't work, THEN you could partition output as
well.
upvoted 1 times
 
Dianova
Selected Answer: DF
I think answer is correct, because:
Nothing is mentioned in the question about the output and some type of outputs do not support partitioning (like PowerBI), so it would be risky to
assume that we can partition the output to implement Embarrassingly parallel jobs.
https://docs.microsoft.com/en-us/azure/stream-analytics/stream-analytics-parallelization#outputs
Implementing query parallelization by partitioning the data input would be an optimization but the total number of SUs depends on the number of
partitions, so the SUs would need to be scaled up.
https://docs.microsoft.com/en-us/azure/stream-analytics/stream-analytics-parallelization#calculate-the-maximum-streaming-units-of-a-job
upvoted 7 times
 
PallaviPatel
Selected Answer: CF
ignore my previous answer C and F is correct.
upvoted 2 times
 
PallaviPatel
Selected Answer: DF
correct
upvoted 1 times
 
assU2
4 months ago
Selected Answer: CF
Partitioning lets you divide data into subsets based on a partition key. If your input (for example Event Hubs) is partitioned by a key, it is highly
recommended to specify this partition key when adding input to your Stream Analytics job. Scaling a Stream Analytics job takes advantage of
partitions in the input and output.
More to say scaling is not an optimization

upvoted 1 times
 
assU2
4 months ago
Is scaling an optimization??
upvoted 1 times
 
DE_Sanjay
C & F Should be the right answer.
upvoted 1 times
 
dev2dev
Optimization is always about improving performance using existing resources. So definitly not increasing SKU or SU
upvoted 4 times
 
alex623
I think the answer is to partitioning input and output, because the target is to optimize regardless of computing capacity (#SUs)
upvoted 1 times
 
DingDongSingSong
2 months ago
who says optimization is regardless of computing capacity. Infact computing capacity increase is ONE of the ways to optimize performance.
upvoted 1 times
 
Jaws1990
Selected Answer: CF
Should always aim for Embarrassingly parallel jobs (partitioning input, job and output) https://docs.microsoft.com/en-us/azure/stream-
analytics/stream-analytics-parallelization
Upping the computing power of a resource (SUs in this case) should never be classed as 'optimisation' like the question asks.
upvoted 5 times
 
dev2dev
I agree
upvoted 1 times
 
trietnv
5 months ago
Selected Answer: BF
Choosing the number of required SUs for a particular job depends on the partition configuration for "the inputs" and "the query" that's defined
within the job. The Scale page allows you to set the right number of SUs. It is a best practice to allocate more SUs than needed.
ref: https://docs.microsoft.com/en-us/azure/stream-analytics/stream-analytics-streaming-unit-consumption
upvoted 2 times
 
Sayour
The Answer Is Correct, You Scall Up Streaming Units And Partition Input So The Input Events Are More Efficient To Process.
upvoted 3 times
 
m2shines
C and F
upvoted 1 times
 
rashjan
Selected Answer: CF
Partition input and output is the correct answer even if output is not mentioned because stream analytics always have at least one output.
upvoted 1 times
Question #5 Topic 2
You need to trigger an Azure Data Factory pipeline when a file arrives in an Azure Data Lake Storage Gen2 container.
Which resource provider should you enable?
A.
Microsoft.Sql
B.
Microsoft.Automation
C.
Microsoft.EventGrid
D.
Microsoft.EventHub
Correct Answer:
C
Event-driven architecture (EDA) is a common data integration pattern that involves production, detection, consumption, and reaction to events.
Data integration scenarios often require Data Factory customers to trigger pipelines based on events happening in storage account, such as the
arrival or deletion of a file in Azure
Blob Storage account. Data Factory natively integrates with Azure Event Grid, which lets you trigger pipelines on such events.
Reference:
https://docs.microsoft.com/en-us/azure/data-factory/how-to-create-event-trigger https://docs.microsoft.com/en-us/azure/data-
factory/concepts-pipeline-execution-triggers

C (100%)
 
jv2120
Highly Voted 
Correct. C
Azure Event Grids – Event-driven publish-subscribe model (think reactive programming)
Azure Event Hubs – Multiple source big data streaming pipeline (think telemetry data)
In this case its more suitable vs Event Hubs.

upvoted 12 times
 
medsimus
Highly Voted 
8 months ago
Correct
https://docs.microsoft.com/en-us/azure/data-factory/how-to-create-event-trigger?tabs=data-factory
upvoted 11 times
 
PallaviPatel
Most Recent 
Selected Answer: C
Correct.
upvoted 2 times
 
romanzdk
But EventHub does not support ADLS, only Blob storage
upvoted 1 times
 
romanzdk
https://docs.microsoft.com/en-us/azure/event-grid/overview
upvoted 2 times
 
Swagat039
C. is correct.
You need storage event trigger (for this Microsoft.EventGrid service needs to be enabled).
upvoted 1 times
 
Why not eventhub?
upvoted 3 times
 
wijaz789
Absolutely correct
upvoted 4 times
Question #6 Topic 2
You plan to perform batch processing in Azure Databricks once daily.
Which type of Databricks cluster should you use?
A.
High Concurrency
B.
automated
C.
interactive
Correct Answer:
B
Automated Databricks clusters are the best for jobs and automated batch processing.
Note: Azure Databricks has two types of clusters: interactive and automated. You use interactive clusters to analyze data collaboratively with
interactive notebooks. You use automated clusters to run fast and robust automated jobs.
Example: Scheduled batch workloads (data engineers running ETL jobs)
This scenario involves running batch job JARs and notebooks on a regular cadence through the Databricks platform.
The suggested best practice is to launch a new cluster for each run of critical jobs. This helps avoid any issues (failures, missing SLA, and so
on) due to an existing workload (noisy neighbor) on a shared cluster.
Reference:
https://docs.microsoft.com/en-us/azure/databricks/clusters/create https://docs.databricks.com/administration-guide/cloud-
configurations/aws/cmbp.html#scenario-3-scheduled-batch-workloads-data-engineers-running-etl-jobs

B (100%)
 
Podavenna
Highly Voted 
Correct!
upvoted 8 times
 
necktru
Most Recent 
3 weeks ago
Selected Answer: B
correct
upvoted 1 times
 
PallaviPatel
Selected Answer: B
correct.
upvoted 1 times
 
satyamkishoresingh
What is automated cluster ?
upvoted 1 times
 
wijaz789
There are 2 types of databricks clusters:
1) Standard/Interactive - best for querying and processing data by users.
2) Automatic/Jobs - best for jobs and automated batch processing.

upvoted 11 times
 
Swagat039
Job cluster
upvoted 1 times
Question #7 Topic 2
HOTSPOT -
You are processing streaming data from vehicles that pass through a toll booth.
You need to use Azure Stream Analytics to return the license plate, vehicle make, and hour the last vehicle passed during each 10-minute window.
How should you complete the query? To answer, select the appropriate options in the answer area.
Hot Area:
Correct Answer:
Box 1: MAX -
The first step on the query finds the maximum time stamp in 10-minute windows, that is the time stamp of the last event for that window. The
second step joins the results of the first query with the original stream to find the event that match the last time stamps in each window.
Query:
WITH LastInWindow AS -
SELECT -
MAX(Time) AS LastEventTime -
FROM -
Input TIMESTAMP BY Time -
GROUP BY -
TumblingWindow(minute, 10)
SELECT -
Input.License_plate,
Input.Make,
Input.Time -
FROM -
Input TIMESTAMP BY Time -
INNER JOIN LastInWindow -
ON DATEDIFF(minute, Input, LastInWindow) BETWEEN 0 AND 10
AND Input.Time = LastInWindow.LastEventTime
Box 2: TumblingWindow -
Tumbling windows are a series of fixed-sized, non-overlapping and contiguous time intervals.
Box 3: DATEDIFF -
DATEDIFF is a date-specific function that compares and returns the time difference between two DateTime fields, for more information, refer to
date functions.
Reference:
https://docs.microsoft.com/en-us/stream-analytics-query/tumbling-window-azure-stream-analytics
 
rikku33
Highly Voted 
8 months ago
correct
upvoted 16 times
 
Jerrylolu
Why not Hopping Window??
upvoted 1 times
 
Wijn4nd
Because a hopping window can overlap, and we need the data from 10 minute time frames that DON'T overlap
upvoted 3 times
 
PallaviPatel
Most Recent 
correct.
upvoted 1 times
 
BusinessApps
HoppingWindow has a minimum of three arguments whereas TumblingWindow only takes two so considering the solution only has two
arguments it has to be Tumbling
https://docs.microsoft.com/en-us/stream-analytics-query/hopping-window-azure-stream-analytics
https://docs.microsoft.com/en-us/stream-analytics-query/tumbling-window-azure-stream-analytics
upvoted 3 times
 
DrTaz
Answer is 100% correct.
upvoted 2 times
 
bubububox
definitively hopping. because the event (last car passing) can be part of more than one window. Thus it cant be tumbling
upvoted 1 times
 
DrTaz
the question defines non-overlapping windows, thus tumbling for sure 100%
upvoted 1 times
 
durak
5 months ago
Why not Select COunt?
upvoted 1 times
 
DrTaz
max is used to get "last" event.
upvoted 1 times
 
irantov
I think it is correct. Although, we could also use hoppingwindow. But it would be better to use Tumblingwindow as time events are unique.
upvoted 3 times
 
TelixFom
I was thinking TumblingWindow based on the term: "each 10-minute window." This infers that the situation is not looking for a rolling max.
upvoted 2 times
 
elcholo
QUEEEE!
upvoted 4 times
 
GameLift
very confusing
upvoted 4 times
Question #8 Topic 2
You have an Azure Data Factory instance that contains two pipelines named Pipeline1 and Pipeline2.
Pipeline1 has the activities shown in the following exhibit.
Pipeline2 has the activities shown in the following exhibit.
You execute Pipeline2, and Stored procedure1 in Pipeline1 fails.
What is the status of the pipeline runs?
A.
Pipeline1 and Pipeline2 succeeded.
B.
Pipeline1 and Pipeline2 failed.
C.
Pipeline1 succeeded and Pipeline2 failed.
D.
Pipeline1 failed and Pipeline2 succeeded.
Correct Answer:
A
Activities are linked together via dependencies. A dependency has a condition of one of the following: Succeeded, Failed, Skipped, or
Completed.
Consider Pipeline1:
If we have a pipeline with two activities where Activity2 has a failure dependency on Activity1, the pipeline will not fail just because Activity1
failed. If Activity1 fails and Activity2 succeeds, the pipeline will succeed. This scenario is treated as a try-catch block by Data Factory.
The failure dependency means this pipeline reports success.
Note:
If we have a pipeline containing Activity1 and Activity2, and Activity2 has a success dependency on Activity1, it will only execute if Activity1 is
successful. In this scenario, if Activity1 fails, the pipeline will fail.
Reference:
https://datasavvy.me/category/azure-data-factory/

A (100%)
 
echerish
Highly Voted 
9 months ago
Pipeline 2 executes Pipeline 1 if success set variable. Since Pipeline 1 exists it's a success
Pipeline 1 Stored procedure fails. If fails set variable. Since the expected outcome is fail the job runs successfully and sets variable1.
At least that's how I understand it

upvoted 22 times
 
SaferSephy
Highly Voted 
Correct answer is A. The trick is the fact that pipeline 1 only has a Failure dependency between de activity's. In this situation this results in a
Succeeded pipeline if the Stored procedure failed.
If also the success connection was linked to a follow up activity, and the SP would fail, the pipeline would be indeed marked as failed.
So A.
upvoted 21 times
 
BK10
well explained! A is right
upvoted 1 times
 
SebK
Most Recent 
2 months ago
Selected Answer: A
Correct
upvoted 1 times
 
AngelJP
Selected Answer: A
A correct:
Pipeline 1 is in try catch sentence --> Success
Pipeline 2 --> Success
https://docs.microsoft.com/en-us/azure/data-factory/tutorial-pipeline-failure-error-handling#try-catch-block
upvoted 2 times
 
PallaviPatel
Selected Answer: A
A correct. I agree with SaferSephy's comments below.
upvoted 2 times
 
dev2dev
A is correct. Pipeline 1 is connected to Set variable to Failure node/event. Its like handling exceptions/errors in programming language. Without
Failure node, it would be treated as failed.
upvoted 1 times
 
VeroDon
Selected Answer: A
Correct
upvoted 1 times
 
JSSA
5 months ago
Correct answer is A
upvoted 1 times
 
rashjan
Selected Answer: A
correct
upvoted 1 times
 
medsimus
Correct answer , I tested it in synapse . the first activity failed but the pipeline succeded
upvoted 5 times
 
Oleczek
Just checked it myself on Azure, answer A is correct.
upvoted 4 times
 
wdeleersnyder
I'm not seeing this... what's not being called out is if Pipeline 2 has a dependency on Pipeline 1. It happens all the time where two pipelines run;
one runs, the other fails.

It should be D in my opinion.
upvoted 4 times
 
gangstfear
9 months ago
The answer must be B
upvoted 2 times
 
JohnMasipa
9 months ago
Can someone please explain why the answer is A?
upvoted 1 times
 
dev2dev
if you look at the green and red squares, they are called Success and Failure events, in psuedo code , pipeline can be read as "On Error Set
Variable", where as pipeline 2 has "On Sucess Set Variable"
upvoted 1 times
 
Ayanchakrain
9 months ago
Pipeline1 has the failure dependency

upvoted 2 times
Question #9 Topic 2
HOTSPOT -
A company plans to use Platform-as-a-Service (PaaS) to create the new data pipeline process. The process must meet the following requirements:
Ingest:
✑ Access multiple data sources.
✑ Provide the ability to orchestrate workflow.
✑ Provide the capability to run SQL Server Integration Services packages.
Store:
✑ Optimize storage for big data workloads.
✑ Provide encryption of data at rest.
✑ Operate with no size limits.
Prepare and Train:
✑ Provide a fully-managed and interactive workspace for exploration and visualization.
✑ Provide the ability to program in R, SQL, Python, Scala, and Java.
Provide seamless user authentication with Azure Active Directory.
Model & Serve:
✑ Implement native columnar storage.
✑ Support for the SQL language
✑ Provide support for structured streaming.
You need to build the data integration pipeline.
Which technologies should you use? To answer, select the appropriate options in the answer area.
Hot Area:
Correct Answer:

Ingest: Azure Data Factory -
Azure Data Factory pipelines can execute SSIS packages.
In Azure, the following services and tools will meet the core requirements for pipeline orchestration, control flow, and data movement: Azure
Data Factory, Oozie on HDInsight, and SQL Server Integration Services (SSIS).
Store: Data Lake Storage -
Data Lake Storage Gen1 provides unlimited storage.
Note: Data at rest includes information that resides in persistent storage on physical media, in any digital format. Microsoft Azure offers a
variety of data storage solutions to meet different needs, including file, disk, blob, and table storage. Microsoft also provides encryption to
protect Azure SQL Database, Azure Cosmos
DB, and Azure Data Lake.
Prepare and Train: Azure Databricks
Azure Databricks provides enterprise-grade Azure security, including Azure Active Directory integration.
With Azure Databricks, you can set up your Apache Spark environment in minutes, autoscale and collaborate on shared projects in an
interactive workspace.
Azure Databricks supports Python, Scala, R, Java and SQL, as well as data science frameworks and libraries including TensorFlow, PyTorch and
scikit-learn.
Model and Serve: Azure Synapse Analytics
Azure Synapse Analytics/ SQL Data Warehouse stores data into relational tables with columnar storage.
Azure SQL Data Warehouse connector now offers efficient and scalable structured streaming write support for SQL Data Warehouse. Access
SQL Data
Warehouse from Azure Databricks using the SQL Data Warehouse connector.
Note: As of November 2019, Azure SQL Data Warehouse is now Azure Synapse Analytics.
Reference:
https://docs.microsoft.com/bs-latn-ba/azure/architecture/data-guide/technology-choices/pipeline-orchestration-data-movement
https://docs.microsoft.com/en-us/azure/azure-databricks/what-is-azure-databricks
 
Podavenna
Highly Voted 
Correct solution!
upvoted 26 times
 
irantov
Highly Voted 
Correct!
upvoted 10 times
 
SebK
Most Recent 
2 months ago
Correct
upvoted 1 times
 
Massy
for the store, couldn't we use also Azure Blob Storage? It supports all the three requisites
upvoted 1 times
 
NewTuanAnh
Because ADLS Gen2 support Big Data Workload better
upvoted 1 times
 
paras_gadhiya
3 months ago
Correct
upvoted 1 times
 
PallaviPatel
Correct solution.
upvoted 1 times
 
joeljohnrm
Correct Solution
upvoted 1 times
 
[Removed]
for model and server, HDI has all of this. Why DataBricks?
upvoted 1 times
 
rockyc05
3 months ago
Also seamless integration with AAD
upvoted 1 times
 
rockyc05
3 months ago
Support for SQL
upvoted 1 times
 
corebit
Would be best if people including answers that go against the popular responses provide some reference instead of blinding saying false
upvoted 3 times
 
Akash0105
Answer is correct.
Azure Databricks supports java: https://azure.microsoft.com/en-us/services/databricks/#overview

upvoted 2 times
 
Pratikh
Databricks doesn't support Java so in the Prep and Train should be HDInsight Apache Spark Cluster
upvoted 3 times
 
KOSTA007
Azure Databricks supports Python, Scala, R, Java, and SQL, as well as data science frameworks and libraries including TensorFlow, PyTorch, and
scikit-learn.
upvoted 9 times
 
Aslam208
Databricks does not support Java, Prepare and Train should be Azure HDInsight Apache spark cluster
upvoted 1 times
 
Aslam208
I would like to correct my answer here... java is supported in Azure Databricks, therefore Prepare and Train can be done with Azure Databricks
upvoted 3 times
 
Samanda
false. kafka hd insight is the correct option in the last box
upvoted 1 times
 
datachamp
Is this an ad?
upvoted 9 times
DRAG DROP -
You have the following table named Employees.
You need to calculate the employee_type value based on the hire_date value.
Select and Place:
Correct Answer:

Box 1: CASE -
CASE evaluates a list of conditions and returns one of multiple possible result expressions.
CASE can be used in any statement or clause that allows a valid expression. For example, you can use CASE in statements such as SELECT,
UPDATE,
DELETE and SET, and in clauses such as select_list, IN, WHERE, ORDER BY, and HAVING.
Syntax: Simple CASE expression:
CASE input_expression -
WHEN when_expression THEN result_expression [ ...n ]
[ ELSE else_result_expression ]
END -
Box 2: ELSE -
Reference:
https://docs.microsoft.com/en-us/sql/t-sql/language-elements/case-transact-sql
 
MoDar
Highly Voted 
Correct
upvoted 24 times
 
NewTuanAnh
Most Recent 
the answer is correct
CASE ...
WHEN ... THEN...
ELSE ...
upvoted 2 times
 
PallaviPatel
correct
upvoted 1 times
 
steeee
9 months ago
The answer is correct. But, is this in the scope of this exam?
upvoted 4 times
 
anto69
it seems
upvoted 1 times
 
mkutts
Got this question yesterday so yes.
upvoted 5 times
 
parwa
9 months ago
make sense to me , data engineer should be able to write Queries
upvoted 6 times
DRAG DROP -
You have an Azure Synapse Analytics workspace named WS1.
You have an Azure Data Lake Storage Gen2 container that contains JSON-formatted files in the following format.
You need to use the serverless SQL pool in WS1 to read the files.
Select and Place:
Correct Answer:
Box 1: openrowset -
The easiest way to see to the content of your CSV file is to provide file URL to OPENROWSET function, specify csv FORMAT.
Example:
SELECT *
FROM OPENROWSET(
BULK 'csv/population/population.csv',
DATA_SOURCE = 'SqlOnDemandDemo',
FORMAT = 'CSV', PARSER_VERSION = '2.0',
FIELDTERMINATOR =',',
ROWTERMINATOR = '\n'
Box 2: openjson -
You can access your JSON files from the Azure File Storage share by using the mapped drive, as shown in the following example:
SELECT book.* FROM -
OPENROWSET(BULK N't:\books\books.json', SINGLE_CLOB) AS json
CROSS APPLY OPENJSON(BulkColumn)
WITH( id nvarchar(100), name nvarchar(100), price float,
pages_i int, author nvarchar(100)) AS book
Reference:
https://docs.microsoft.com/en-us/azure/synapse-analytics/sql/query-single-csv-file https://docs.microsoft.com/en-us/sql/relational-
databases/json/import-json-documents-into-sql-server
 
Maunik
Highly Voted 
Answer is correct
https://docs.microsoft.com/en-us/azure/synapse-analytics/sql/query-json-files
upvoted 25 times
 
Lrng15
8 months ago
answer is correct as per this link
upvoted 1 times
 
gf2tw
Highly Voted 
The question and answer seem out of place, there was no mention of the CSV and the query in the answer doesn't match up with openjson at all
upvoted 6 times
 
dev2dev
Look at the WITH statement, the csv column can contain json data.
upvoted 1 times
 
anto69
agree with u
upvoted 1 times
 
dead_SQL_pool
6 months ago
Actually, the csv format is specified if you're using OPENROWSET to read json files in Synapse. The OPENJSON is required if you want to parse
data from every array in the document. See the OPENJSON example in this link:
https://docs.microsoft.com/en-us/azure/synapse-analytics/sql/query-json-files#query-json-files-using-openjson
upvoted 8 times
 
gf2tw
Thanks, you're right:
"The easiest way to see to the content of your JSON file is to provide the file URL to the OPENROWSET function, specify csv FORMAT, and set
values 0x0b for fieldterminator and fieldquote."
upvoted 4 times
 
gssd4scoder
agree with you, very misleading
upvoted 1 times
 
SebK
Most Recent 
2 months ago
Correct
upvoted 1 times
 
PallaviPatel
correct
upvoted 1 times
DRAG DROP -
You have an Apache Spark DataFrame named temperatures. A sample of the data is shown in the following table.
You need to produce the following table by using a Spark SQL query.
How should you complete the query? To answer, drag the appropriate values to the correct targets. Each value may be used once, more than once,
or not at all.
You may need to drag the split bar between panes or scroll to view content.
Select and Place:
Correct Answer:
Box 1: PIVOT -
PIVOT rotates a table-valued expression by turning the unique values from one column in the expression into multiple columns in the output.
And PIVOT runs aggregations where they're required on any remaining column values that are wanted in the final output.
Incorrect Answers:
UNPIVOT carries out the opposite operation to PIVOT by rotating columns of a table-valued expression into column values.
Box 2: CAST -
If you want to convert an integer value to a DECIMAL data type in SQL Server use the CAST() function.
Example:
SELECT -
CAST(12 AS DECIMAL(7,2) ) AS decimal_value;
Here is the result:
decimal_value
12.00
Reference:
https://learnsql.com/cookbook/how-to-convert-an-integer-to-a-decimal-in-sql-server/ https://docs.microsoft.com/en-us/sql/t-sql/queries/from-
using-pivot-and-unpivot
 
SujithaVulchi
Highly Voted 
8 months ago
correct answer, pivot and cast
upvoted 22 times
 
ggggyyyyy
Most Recent 
8 months ago
correct. cast not convert
upvoted 3 times
You have an Azure Data Factory that contains 10 pipelines.
You need to label each pipeline with its main purpose of either ingest, transform, or load. The labels must be available for grouping and filtering
when using the monitoring experience in Data Factory.
What should you add to each pipeline?
A.
a resource tag
B.
a correlation ID
C.
a run group ID
D.
an annotation
Correct Answer:
D
Annotations are additional, informative tags that you can add to specific factory resources: pipelines, datasets, linked services, and triggers. By
adding annotations, you can easily filter and search for specific factory resources.
Reference:
https://www.cathrinewilhelmsen.net/annotations-user-properties-azure-data-factory/

D (100%)
 
umeshkd05
Highly Voted 
Annotation
upvoted 16 times
 
anto69
Cause ADF pipelines are not first class resources
upvoted 1 times
 
AhmedDaffaie
Most Recent 
What is the difference between resource tags and annotations?
upvoted 1 times
 
paras_gadhiya
3 months ago
Correct!
upvoted 1 times
 
PallaviPatel
Selected Answer: D
correct
upvoted 1 times
 
huesazo
Selected Answer: D
Anotacion
upvoted 1 times
 
aarthy2
yes correct, annotation provides label functionality than show in pipeline monitoring.
upvoted 2 times

Expert Veri Ed, Online, Free.: Topic 1 - Question Set 1

Uploaded by

Copyright:

Available Formats

Expert Veri Ed, Online, Free.: Topic 1 - Question Set 1

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Expert Veri Ed, Online, Free.: Topic 1 - Question Set 1

Uploaded by

Copyright:

Available Formats

5/26/22, 3:46 PM DP-203 Exam – Free Actual Q&As, Page 1 | ExamTopics

 Custom View Settings

Topic 1 - Question Set 1

You need to alter the table to meet the following requirements:

✑ Ensure that users can identify the current manager of employees.

✑ Support creating an employee reporting hierarchy for your entire company.

Which column should you add to the table?

Community vote distribution

CREATE TABLE mytestdb.myParquetTable(

WHERE name = 'Alice';

What will be returned by the query?

Community vote distribution

Anyway, this confirms that B is the correct answer here.

EmployeeName = 'Alice' ". So I answered A.24

WHERE name = 'Alice';

The correct syntax is :

SELECT EmployeeID FROM mytestdb.myParquetTable WHERE EmployeeName = 'Alice';

- incorrect db/schema/table name: mytestdb.myParquetTable vs mytestdb.dbo.myParquetTable

- not using lower case in the query

✑ Contains one billion rows

✑ Has clustered columnstore indexes

Select and Place:

Step 3: Drop the SalesFact_Work table.

Given answer D A C is correct.

Step 3: Drop the SalesFact_Work table.

The data copy to back up table is not mentioned in the answer

You create an external table named ExtTable that has LOCATION='/topfolder/'.

Community vote distribution

In case of a serverless pool a wildcard should be added to the location.

Below is the documentation given on MS Docs:

Recursive data for external tables

1. created gen 2 storage acct

2. created azure synapse workspace

5. executed the following code:

DROP EXTERNAL DATA SOURCE test;

CREATE EXTERNAL DATA SOURCE test

CREATE EXTERNAL FILE FORMAT test

CREATE EXTERNAL TABLE test

(id int, value int)

SELECT * FROM test;

✑ Report1: Reads three columns from a file that contains 50 columns.

✑ Report2: Queries a single record based on a timestamp.

NOTE: Each correct selection is worth one point.

CSV: The destination writes records as delimited data.

AVRO supports timestamps.

2: AVRO - Row based format, and has logical type timestamp

I made small test on Databrick plus DataLake.

The same file saved as Parquet and Avro

Reading Parquet is always 10 times faster that Avro.

- for all data or small range of data with condition

- all or only one column

So I will select option:

Community vote distribution

You need to output files from Azure Data Factory.

NOTE: Each correct selection is worth one point.

An Avro schema is created using JSON format.

AVRO supports timestamps.

✑ Delimited text format