Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

CompTIA Data+ Practice Test

Download as pdf or txt
Download as pdf or txt
You are on page 1of 149

by ExamsDigest®

CompTIA Data+ DA0-001 Practice Tests 2022®


Published by: ExamsDigest LLC., Holzmarktstraße 73, Berlin, Germany, www.examsdigest.com Copyright © 2022 by
ExamsDigest LLC.

No part of this publication may be reproduced, stored in a retrieval system or transmitted in any form, electronic, mechanical,
photocopying, recording, scanning or otherwise, except as permitted under Sections 107 or 108 of the 1976 United States
Copyright Act, without the prior written permission of the Publisher. Requests to the Publisher for permission should be
addressed to the Permissions Department, ExamsDigest, LLC., Holzmarktstraße 73, Berlin, Germany or online at
https://www.examsdigest.com/contact.

Trademarks: ExamsDigest, examsdigest.com and related trade dress are trademarks or registered trademarks of ExamsDigest
LLC. and may not be used without written permission. Amazon is a registered trademark of Amazon, Inc. All other trademarks
are the property of their respective owners. ExamsDigest, LLC. is not associated with any product or vendor mentioned in this
book.

LIMIT OF LIABILITY/DISCLAIMER OF WARRANTY: THE PUBLISHER AND THE AUTHOR MAKE NO


REPRESENTATIONS OR WARRANTIES WITH RESPECT TO THE ACCURACY OR COMPLETENESS OF THE
CONTENTS OF THIS WORK AND SPECIFICALLY DISCLAIM ALL WARRANTIES, INCLUDING WITHOUT
LIMITATION WARRANTIES OF FITNESS FOR A PARTICULAR PURPOSE. NO WARRANTY MAY BE
CREATED OR EXTENDED BY SALES OR PROMOTIONAL MATERIALS. THE ADVICE AND STRATEGIES
CONTAINED HEREIN MAY NOT BE SUITABLE FOR EVERY SITUATION. THIS WORK IS SOLD WITH THE
UNDERSTANDING THAT THE PUBLISHER IS NOT ENGAGED IN RENDERING LEGAL, ACCOUNTING, OR
OTHER PROFESSIONAL SERVICES. IF PROFESSIONAL ASSISTANCE IS REQUIRED, THE SERVICES OF A
COMPETENT PROFESSIONAL PERSON SHOULD BE SOUGHT. NEITHER THE PUBLISHER NOR THE
AUTHOR SHALL BE LIABLE FOR DAMAGES ARISING HEREFROM. THE FACT THAT AN ORGANIZATION
OR WEBSITE IS REFERRED TO IN THIS WORK AS A CITATION AND/OR A POTENTIAL SOURCE OF
FURTHER INFORMATION DOES NOT MEAN THAT THE AUTHOR OR THE PUBLISHER ENDORSES THE
INFORMATION THE ORGANIZATION OR WEBSITE MAY PROVIDE OR RECOMMENDATIONS IT MAY
MAKE.

Examsdigest publishes in a variety of print and electronic formats and by print-on-demand. Some material included with standard
print versions of this book may not be included in e-books or in print-on-demand. If this book refers to media such as a CD or
DVD that is not included in the version you purchased, you may find this material at https://examsdigest.com
CONTENTS AT A GLANCE

Contents at a glance

Introduction
Chapter 1 Data Concepts And Environments

Questions 1-10

Answers 1-10

Chapter 2 Data mining

Questions 11-30

Answers 11-30

Chapter 3 Data Analysis

Questions 31-50

Answers 31-50

Chapter 4 Visualizations
Questions 51-65

Answers 51-65

Chapter 5 Data Governance, Quality, And Controls

Questions 66-80

Answers 66-80
BONUSES & DISCOUNTS
INTRODUCTION
CompTIA Data+ is an early-career data analytics certification for professionals tasked with
developing and promoting data-driven business decision-making.

About This Book

CompTIA Data+ DA0-001 Practice Tests 2022 by ExamsDigest is designed to be a practical


practice exam guide that will help you prepare for the CompTIA Data+ DA0-001 exams. As the
book title says, it includes 100+ questions, organized by exam so that you can prepare for the
final exam.

This book has been designed to help you prepare for the style of questions you will receive on
the CompTIA Data+ DA0-001 exams. It also helps you understand the topics you can expect to
be tested on for each exam.

In order to properly prepare for the CompTIA Data+ DA0-001, I recommend that you:

✓ Review a reference book: CompTIA Security+ DA0-001 by ExamsDigest is designed to


give you sample questions to help you prepare for the style of questions you will receive on the
real certification exam. However, it is not a reference book that teaches the concepts in detail.
That said, I recommend that you review a reference book before attacking these questions so that
the theory is fresh in your mind.

✓ Do practice test questions: After you review a reference book and perform some hands-on
work, attack the questions in this book to get you “exam ready”! Also claim your free 1-month
access on our platform to dive into to more questions, flashcards and much much more.

Beyond The Book


This book gives you plenty of CompTIA Data+ DA0-001 questions to work on, but maybe you
want to track your progress as you tackle the questions, or maybe you’re having trouble with
certain types of questions and wish they were all presented in one place where you could
methodically make your way through them. You’re in luck. Your book purchase comes with a
free one-month subscription to all practice questions online and more. You get on-the-go access
any way you want it — from your computer, smartphone, or tablet. Track your progress and
view personalized reports that show where you need to study the most. Study what, where, when,
and how you want!

What you’ll find online

The online practice that comes free with this book offers you the same questions and answers
that are available here and more.

The beauty of the online questions is that you can customize your online practice to focus on the
topic areas that give you the most trouble.

So if you need help with the domain Network Security, then select questions related to this topic
online and start practicing.

Whether you practice a few hundred problems in one sitting or a couple dozen, and whether you
focus on a few types of problems or practice every type, the online program keeps track of the
questions you get right and wrong so that you can monitor your progress and spend time
studying exactly what you need.

You can access these online tools by sending an email to the info@examsdigest.com to claim
access on our platform. Once we confirm the purchase you can enjoy your free access.

CompTIA Data+ DA0-001 Exam Details


The online practice that comes free with this book offers you the same questions and answers
that are available here and more.

✓ Format - Multiple choice and performance-based


✓ Type - Associate
✓ Delivery Method - Testing center or online proctored exam
✓ Time - 90 minutes to complete the exam
✓ Cost - $349
✓ Language - Available in English

Exam Content

Content Outline
Candidates are encouraged to use this document to help prepare for the CompTIA Data+ (DA0 -
001) certification exam. This exam will certify the successful candidate has the knowledge and
skills required to transform business requirements in support of data-driven decisions by:

• Mining data • Manipulating data

Applying basic statistical methods

Analyzing complex datasets while adhering to governance and quality standards throughout
the entire data life cycle This is equivalent to 18–24 months of hands-on experience working
in a business intelligence report/data analyst job role. These content examples are meant to
clarify the test objectives and should not be construed as a comprehensive listing of all the
content of this examination.
The table below lists the domains measured by this examination and the extent to which they are
represented:

1.0: Data Concepts and Environments (15%)


2.0: Data Mining (25%)
3.0: Data Analysis (23%)
4.0: Visualization (23%)
5.0: Data Governance, Quality, and Controls (14%)
CHAPTER 1
DATA CONCEPTS
AND ENVIRONMENTS
Questions 1-10
Question 1. Drag and drop the data file formats into their respective use cases.
(A) XML
(B) Flat
(C) JSON
(D) HTML

Transmit data in web


applications
Provide a standard method to
access information
Create web pages and web
applications
Import data in data
warehousing projects

Question 2. A web developer is developing a new application using React.js for the front-end
and Python for the backend. He wants to store data in JavaScript Object Notation (JSON) format
in order to transmit the data between the web application and the server.

Which of the following formats represents a JSON file?

(A) [Company] – [ExamsDigest] – [email] – [info@examsdigest.com] – [Country] – [United


Kingdom]
(B) {"Company":"ExamsDigest", "email":"info@examsdigest.com", "Country":"United
Kingdom"}
(C) <div>Company ExamsDigest</div> <div>Email info@examsdigest.com</div>
<div>Country United Kingdom</div>
(D) (Company- ExamsDigest – email – info@examsdigest.com – Country – United
Kingdom)

Question 3. You have been tasked to create the login form for the website of your client. You
design the database for the login form but you forgot to add the column that keeps a boolean type
(0/1) which shows if a user has an account or not. If a user has an account, then the database
should be stored the value 1, alternatively the value 0.

Which of the following data types should you use to complete the database?

(A) Date
(B) Numeric
(C) Alphanumeric
(D) Currency

Question 4. A web developer which develops an e-commerce marketplace designs its database
to capture, store and process data from transactions in real-time. The e-commerce platform will
deal with many standard and straightforward queries such as insert, delete, and update and the
data will be stored in 3NF (third normal form).

In which of the following databases would the developer store the data?

(A) Online analytical processing


(B) Online query processing
(C) Online standard processing
(D) Online transactional processing

Question 5. Maria, a senior database manager, wants to create a new DB for the sales
department. She needs to create 2 tables, Employees and Sales respectively. The Employees table
contains a single row representing an employee with each employee assigned a unique id
(primary key). The second table, Sales, contains individual sales records associated with the
employee who made the sale.

Which of the following database types will Maria implement?


(A) Non-relational
(B) Snowflake
(C) Relational
(D) Star

Question 6. Which of the following are common examples of structured data? (Choose TWO)

(A) Excel files


(B) Audio files
(C) Video files
(D) SQL databases
(E) No-SQL databases

Question 7. Which of the following are common examples of unstructured data? (Choose TWO)

(A) No-SQL databases


(B) JSON
(C) XML
(D) Audio files
(E) SQL databases

Question 8. A junior web developer is developing a new application where users can upload
short videos. The first task is to create a homepage that shows the headline “Upload Your Short
Videos” and a clickable button that says “upload now”.

Which of the following HTML commands would help the developer to complete the task
successfully?

(A) <p>Upload Your Short Videos</p> <p>upload now</p>


(B) <h1>Upload Your Short Videos</h1> <button>upload now</button>
(C) <span>Upload Your Short Videos</span> <button>upload now</button>
(D) <h1>Upload Your Short Videos</h1> <h1>upload now</h1>

Question 9. Which of the following statements BEST describes the difference between discrete
and continuous data types?

(A) Continuous data is a numerical type of data that includes numbers with fixed data values
determined by counting. Discrete data includes complex numbers and varying data values that
are measured over a specific time interval
(B) Discrete data is a numerical type of data that includes numbers with fixed data values
determined by counting. Continuous data includes complex numbers and varying data values that
are measured over a specific time interval
(C) Continuous data is a alphanumeric type of data that includes characters with fixed data
values determined by counting. Discrete data includes complex characters and varying data
values that are measured over a specific time interval
(D) Discrete data is a alphanumeric type of data that includes characters with fixed data values
determined by counting. Continuous data includes complex characters and varying data values
that are measured over a specific time interval
Question 10. A business analyst requests an analysis of data to display a table of all of a
company’s cloth products that were sold in the UK in June, compare the sales figures with those
in September, and then compare them with other product sales in the UK over the same period.

Which of the following methods enables the data scientist to extract and query data in order to
analyze it from different angles?

(A) Online transactional processing


(B) Online query processing
(C) Online analytical processing
(D) Online standard processing
Answers 1-10

Question 1. Drag and drop the data file formats into their respective use cases.
(A) XML
(B) Flat
(C) JSON
(D) HTML

Transmit data in web


JSON
applications
Provide a standard method to
XML
access information
Create web pages and web
HTML
applications
Import data in data
Flat
warehousing projects

Explanation 1.
JSON transmits data in web applications.
XML provides a standard method to access information.
HTML creates web pages and web applications.
Flat imports data in data warehousing projects.

Question 2. A web developer is developing a new application using React.js for the front-end
and Python for the backend. He wants to store data in JavaScript Object Notation (JSON) format
in order to transmit the data between the web application and the server.

Which of the following formats represents a JSON file?

(A) [Company] – [ExamsDigest] – [email] – [info@examsdigest.com] – [Country] – [United


Kingdom]
(B) {"Company":"ExamsDigest", "email":"info@examsdigest.com", "Country":"United
Kingdom"}
(C) <div>Company ExamsDigest</div> <div>Email info@examsdigest.com</div>
<div>Country United Kingdom</div>
(D) (Company- ExamsDigest – email – info@examsdigest.com – Country – United
Kingdom)

Explanation 2. The correct answer is: {“Company”:”ExamsDigest”,


“email”:”info@examsdigest.com”, “Country”:”United Kingdom”}

A JSON file is a file that stores simple data structures and objects in JavaScript Object Notation
(JSON) format, which is a standard data interchange format. It is primarily used for transmitting
data between a web application and a server.

This JSON is an array of objects, each object contains some information about a person.

[
{“name”:”Nick”, “age”:29, “email”:”info@examsdigest.com”},
{“name”:”Milanie”, “age”:25, “email”:”info@examsdigest.com”}
]

Question 3. You have been tasked to create the login form for the website of your client. You
design the database for the login form but you forgot to add the column that keeps a boolean type
(0/1) which shows if a user has an account or not. If a user has an account, then the database
should be stored the value 1, alternatively the value 0.

Which of the following data types should you use to complete the database?

(A) Date
(B) Numeric
(C) Alphanumeric
(D) Currency

Explanation 3. The correct answer is: Numeric. Numeric is any intrinsic data type (Byte,
Boolean, Integer, Long, Currency, Single, Double). Numeric data types are numbers stored in
database columns. These data types are typically grouped by: Exact numeric types, values where
the precision and scale need to be preserved.

Date is incorrect. The Date is used for storing a date or a date/time value in the database.

Alphanumeric is incorrect. Alphanumeric is a description of data that is both letters and


numbers. For example, “1a2b3c” is a short string of alphanumeric characters. Alphanumeric is
commonly used to help explain the availability of text that can be entered or used in a field, such
as an alphanumeric password field.

Currency is incorrect. Currency is a data type with a range of -922,337,203,685,477.5808 to


922,337,203,685,477.5807. Currency is used for calculations involving money and for fixed-
point calculations where accuracy is particularly important.

Question 4. A web developer which develops an e-commerce marketplace designs its database
to capture, store and process data from transactions in real-time. The e-commerce platform will
deal with many standard and straightforward queries such as insert, delete, and update and the
data will be stored in 3NF (third normal form).
In which of the following databases would the developer store the data?
(A) Online analytical processing
(B) Online query processing
(C) Online standard processing
(D) Online transactional processing

Explanation 4. The correct answer is: Online transactional processing

Online transaction processing (OLTP) captures, stores, and processes data from transactions in
real-time. An OLTP database stores and manages data related to everyday operations within a
system or a company. However, OLTP is focused on transaction-oriented tasks.

OLTP typically deals with query processing (inserting, updating, deleting data in a database),
and maintaining data integrity and effectiveness when dealing with numerous transactions
simultaneously. Each transaction involves individual database records made up of multiple fields
or columns. Examples include banking and credit card activity or retail checkout scanning.

Online analytical processing is incorrect. Online analytical processing (OLAP) uses complex
queries to analyze aggregated historical data from OLTP systems. OLTP and OLAP are two
systems that complement each other. While OLTP deals with processing day-to-day transactions,
OLAP helps analyze the processed data.

Online query processing and Online standard processing are incorrect as these are
imaginary terms.

Question 5. Maria, a senior database manager, wants to create a new DB for the sales
department. She needs to create 2 tables, Employees and Sales respectively. The Employees table
contains a single row representing an employee with each employee assigned a unique id
(primary key). The second table, Sales, contains individual sales records associated with the
employee who made the sale.

Which of the following database types will Maria implement?


(A) Non-relational
(B) Snowflake
(C) Relational
(D) Star

Explanation 5. The correct answer is: Relational

A relational database, also called Relational Database Management System (RDBMS) or SQL
database, stores data in tables and rows also referred to as records. A relational database works
by linking information from multiple tables through the use of “keys.”

A key is a unique identifier that can be assigned to a row of data contained within a table. This
unique identifier called a “primary key,” can then be included in a record located in another table
when that record has a relationship to the primary record in the main table. When this unique
primary key is added to a record in another table, it is called a “foreign key” in the associated
table. The connection between the primary and foreign keys then creates the “relationship”
between records contained across multiple tables.

One significant advantage to using an RDBMS is “referential integrity.” Referential integrity


refers to the accuracy and consistency of data. This data integrity is achieved by using these
primary and foreign keys.

Non-relational is incorrect. The non-relational database, or NoSQL database, stores data.


However, unlike the relational database, there are no tables, rows, primary keys, or foreign keys.
Instead, the non-relational database uses a storage model optimized for specific requirements of
the type of data being stored.

Some of the more popular NoSQL databases are MongoDB, Apache Cassandra, Redis,
Couchbase, and Apache HBase.

There are four popular non-relational types: document data store, column-oriented database, key-
value store, and graph database. Often combinations of these types are used for a single
application.

Snowflake is incorrect. Snowflake Schema in the data warehouse is a logical arrangement of


tables in a multidimensional database such that the ER diagram resembles a snowflake shape. A
Snowflake Schema is an extension of a Star Schema, and it adds additional dimensions. The
dimension tables are normalized which splits data into additional tables.

Star is incorrect Star schema in the data warehouse, in which the center of the star can have one
fact table and a number of associated dimension tables. It is known as star schema as its structure
resembles a star. The Star Schema data model is the simplest type of Data Warehouse schema.

Question 6. Which of the following are common examples of structured data? (Choose TWO)

(A) Excel files


(B) Audio files
(C) Video files
(D) SQL databases
(E) No-SQL databases

Explanation 6. The correct answers are:


1. Excel files
2. SQL databases

Structured data is data that adheres to a pre-defined data model and is therefore straightforward
to analyze. Structured data conforms to a tabular format with relationships between the different
rows and columns. Common examples of structured data are Excel files or SQL databases. Each
of these has structured rows and columns that can be sorted.

Structured data depends on the existence of a data model – a model of how data can be stored,
processed, and accessed. Because of a data model, each field is discrete and can be accessed
separately or jointly along with data from other fields. This makes structured data extremely
powerful: it is possible to quickly aggregate data from various locations in the database.

Structured data is considered the most traditional form of data storage since the earliest versions
of database management systems (DBMS) were able to store, process and access structured data.

Question 7. Which of the following are common examples of unstructured data? (Choose TWO)

(A) No-SQL databases


(B) JSON
(C) XML
(D) Audio files
(E) SQL databases

Explanation 7. The correct answers are:


1. No-SQL databases
2. Audio files

Unstructured data is more or less all the data that is not structured. Even though unstructured
data may have a native, internal structure, it’s not structured in a predefined way. There is no
data model; the data is stored in its native format. Common examples of unstructured data
include audio, video files or No-SQL databases.
Question 8. A junior web developer is developing a new application where users can upload
short videos. The first task is to create a homepage that shows the headline “Upload Your Short
Videos” and a clickable button that says “upload now”.

Which of the following HTML commands would help the developer to complete the task
successfully?

(A) <p>Upload Your Short Videos</p> <p>upload now</p>


(B) <h1>Upload Your Short Videos</h1> <button>upload now</button>
(C) <span>Upload Your Short Videos</span> <button>upload now</button>
(D) <h1>Upload Your Short Videos</h1> <h1>upload now</h1>

Explanation 8. The correct answers is:


<h1>Upload Your Short Videos</h1> <button>upload now</button>

The <h1> to <h6> tags are used to define HTML headings. <h1> defines the most important
heading. <h6> defines the least important heading.
Note: Only use one <h1> per page – this should represent the main heading/subject for the whole
page.

The <button> tag defines a clickable button.

Question 9. Which of the following statements BEST describes the difference between discrete
and continuous data types?

(A) Continuous data is a numerical type of data that includes numbers with fixed data values
determined by counting. Discrete data includes complex numbers and varying data values that
are measured over a specific time interval
(B) Discrete data is a numerical type of data that includes numbers with fixed data values
determined by counting. Continuous data includes complex numbers and varying data values that
are measured over a specific time interval
(C) Continuous data is a alphanumeric type of data that includes characters with fixed data
values determined by counting. Discrete data includes complex characters and varying data
values that are measured over a specific time interval
(D) Discrete data is a alphanumeric type of data that includes characters with fixed data values
determined by counting. Continuous data includes complex characters and varying data values
that are measured over a specific time interval

Explanation 9. The correct answer is: Discrete data is a numerical type of data that
includes numbers with fixed data values determined by counting. Continuous data includes
complex numbers and varying data values that are measured over a specific time interval.

Question 10. A business analyst requests an analysis of data to display a table of all of a
company’s cloth products that were sold in the UK in June, compare the sales figures with those
in September, and then compare them with other product sales in the UK over the same period.

Which of the following methods enables the data scientist to extract and query data in order to
analyze it from different angles?

(A) Online transactional processing


(B) Online query processing
(C) Online analytical processing
(D) Online standard processing

Explanation 10. The correct answers is: Online analytical processing

Online Analytical Processing (OLAP) is a method that enables users to easily and selectively
extract and query data in order to analyze it from different angles. OLAP queries help with trend
analysis, financial reporting, sales forecasting, budgeting, and other planning purposes, among
other things.

OLAP is used for business analyses, including planning, budgeting, forecasting, data mining, and
deals with few queries, but they are complex and involve a lot of data (for example, aggregate
queries). Mainly uses the select statement.

Online transaction processing is incorrect. Online transaction processing (OLTP) captures,


stores, and processes data from transactions in real-time. An OLTP database stores and manages
data related to everyday operations within a system or a company. However, OLTP is focused on
transaction-oriented tasks.

OLTP typically deals with query processing (inserting, updating, deleting data in a database),
and maintaining data integrity and effectiveness when dealing with numerous transactions
simultaneously. Each transaction involves individual database records made up of multiple fields
or columns. Examples include banking and credit card activity or retail checkout scanning.

Online query processing and Online standard processing are incorrect as these are
imaginary terms.
CHAPTER 2
DATA MINING
Questions 11-30
Question 11. A database administrator designs a new database with two tables.

1. Employee
2. Department
The Employee table has the following columns:

1. Employee_birthdate
2. Employee_Year_Of_Registration_In_Company
3. Employee_Total_Years_Of_Registration

Which of the following refers to the situation that there is data in the database that can be
removed without losing information?
(A) Non-parametric data
(B) Redundant data
(C) Duplicate data
(D) Invalid data

Question 12. An online shop sends orders based on the zip code customers type during the
checkout process. Many orders never received the customers because customers misspell the zip
code into the input field.

Which of the following solutions SHOULD the online shop implement to solve this issue?
(A) Non-parametric data
(B) Data outliers
(C) Data type validation
(D) Specification mismatch

Question 13. A web analyst wants to use automated bots to crawl through the internet and
extract data from targeted sites. He wants the collected data to be delivered in a readable form
such as CSV for further analysis.

Which of the following data collection methods is the MOST appropriate method to employ?

(A) Public databases


(B) Application programming interface
(C) Web scraping
(D) Survey

Question 14. A web developer wants to ensure that malicious users can’t type SQL statements
when they asked for input, like their username/userid. Which of the following query optimization
techniques would effectively prevent SQL Injection attacks?

(A) Indexing
(B) Temporary table in the query set
(C) Parametrization
(D) Subset of records
Question 15. Which of the following SQL commands will be used to prevent SQL injection
attacks?

(A) SELECT ? FROM ExamsDigest_Courses WHERE certName= * ORDER BY date


(B) SELECT * FROM ExamsDigest_Courses WHERE certName= ? ORDER BY date
(C) SELECT ? FROM ExamsDigest_Courses WHERE certName="param" ORDER BY date
(D) SELECT * FROM ExamsDigest_Courses WHERE certName="param" ORDER BY date

Question 16. Consider the following dataset which contains information about houses that are
for sale.

Which of the following string manipulation commands will combine the address and region
name columns to create a full address?

full_address

-----------------------------------------
85 Turner St, Northern Metropolitan
25 Bloomburg St, Northern Metropolitan
5 Charles St, Northern Metropolitan
40 Federation La, Northern Metropolitan
55a Park St, Northern Metropolitan

(A) SELECT CONCAT(address, ‘, ‘ , regionname) AS full_address


FROM melb
LIMIT 5;
(B) SELECT CONCAT(address, ‘- ‘ , regionname) AS full_address
FROM melb
LIMIT 5;
(C) SELECT CONCAT(regionname, ‘- ‘ , address) AS full_address
FROM melb
LIMIT 5;
(D) SELECT CONCAT(regionname, ‘, ‘ , address) AS full_address
FROM melb
LIMIT 5;

Question 17. An online shop wants to expand its customer base using SMS marketing
campaigns. The online shop has already a database with customers with the following columns.
1. First_Name
2. Last_Name
3. Email

The head of marketing wants to add one more column titled “Phone_Numbers”. Also, he wants
to match the customer’s names with the respective phone numbers.

Which of the following data manipulation techniques would the head of marketing use to add the
phone numbers into the database without affecting the existing data?

(A) Imputation
(B) Data append
(C) Transpose
(D) Normalize data
Question 18. Consider the following data sets.

Which of the following Logical functions would return TRUE?

(A) =AND(A2="Bananas", B3>C3)


(B) =AND(A3="Apples", B4<=C3)
(C) =AND(A2="Bananas", B2>C2)
(D) =AND(A5="Oranges", B5=C5)

Question 19. Consider the following dataset which contains information about houses that are
for sale.
Which of the following string manipulation commands will replace the “St” characters in the
address column with the word “Street”?

(A) SELECT address, REPLACE(address, ‘St’, ‘Street’) AS new_address


FROM melb
LIMIT 5;
(B) SELECT address, REPLACE(‘St’, ‘Street’) AS new_address
FROM melb
LIMIT 5;
(C) SELECT address, REPLACE(‘Street’, ‘St’) AS new_address
FROM melb
LIMIT 5;
(D) SELECT address, REPLACE(address, ‘Street’, ‘St’) AS new_address
FROM melb
LIMIT 5;

Question 20. Which of the following data integration processes combines data from multiple
data sources into a single, consistent data store that is loaded into a data warehouse?

(A) Extract, transform, load (ETL)


(B) Extract, load, transform (ELT)
(C) Delta load
(D) Application programming interfaces (APIs)
Question 21. A data analyst wants to create “Income Categories” that would be calculated based
on the existing variable “Income”.

The “Income Categories” would be as follows:


1. Income category 1: less than $1
2. Income category 2: more than $1 and less than $20,000
3. Income category 3: more than $20,001 and less than $40,000
4. Income category 4: more than $40,001

Which of the following data manipulation techniques should the data analyst use to create
“Income Categories”?

(A) Derived variables


(B) Data merge
(C) Data blending
(D) Data append

Question 22. Which of the following Date function commands would effectively calculate the
number of days, months, or years between two dates?

(A) MONTH()
(B) DATEDIF()
(C) DATEVALUE()
(D) DAYS()

Question 23. Consider the following data set.


Which of the following data manipulation techniques
would arrange the data set in decreasing order of total marks so that who scored highest marks is
in the top row, and who scored lowest marks is in the last row?

(A) CURDATE()
(B) Filtering
(C) Sorting
(D) IsEmpty()

Question 24. A data analyst wants to show and hide information on his sheet based on selected
criteria. Which of the following data manipulation techniques is most appropriate in this case?

(A) Sorting
(B) Parametrization
(C) Filtering
(D) Indexing

Question 25. A web analyst wants to display a chart of daily hits on the keyword “it
certifications” for four different states with the help of Google Trends and Python.

Which of the following data collection methods will effectively request data from google trends
and display them in a chart?

(A) Application programming interface (API)


(B) Survey
(C) Web scraping
(D) Public databases
Question 26. The human resources (HR) department of the ACME Corporation needs to monitor
changes in employee satisfaction over time. Which of the following data collection methods will
BEST help the HR department to monitor the changes?

(A) Web scraping


(B) Application programming interface
(C) Web services
(D) Survey

Question 27. A SQL database administrator wants to display the total number of employees.
Which of the following SQL commands should the administrator use to display the total
number?

(A) SELECT COUNT(+) FROM HumanResources.Employee; GO


(B) SELECT COUNT(NUM) FROM HumanResources.Employee; GO
(C) SELECT COUNT(*) FROM HumanResources.Employee; GO
(D) SELECT COUNT(NUM *) FROM HumanResources.Employee; GO

Question 28. A data scientist wants to see which products make the most money and which
products attract the most customer purchasing interest in their company.

Which of the following data manipulation techniques would he use to obtain this information?

(A) Data append


(B) Normalize data
(C) Data merge
(D) Data blending
Question 29. Which of the following data manipulation techniques improves the accuracy and
integrity of your data while ensuring that your database is easier to navigate?

(A) Data append


(B) Data blending
(C) Normalize data
(D) Data merge

Question 30. A database administrator is responsible to optimise the data queries from 800ms to
300ms by using indexing.

Which of the following techniques, indexing uses to make columns faster to query?

(A) Duplicate the table where data is stored with less data
(B) Sort the data alphabetically
(C) Create pointers where data is stored within a database
(D) Filter the data using logical functions
Answers 11-30
Question 11. A database administrator designs a new database with two tables.

1. Employee
2. Department
The Employee table has the following columns:

1. Employee_birthdate
2. Employee_Year_Of_Registration_In_Company
3. Employee_Total_Years_Of_Registration

Which of the following refers to the situation that there is data in the database that can be
removed without losing information?
(A) Non-parametric data
(B) Redundant data
(C) Duplicate data
(D) Invalid data

Explanation 11.The correct answer is: Redundant data

The term data redundancy is in the same context, relational database design, used to refer to the
situation that there is data in the database that can be removed without losing information.
Consider our case where for each employee there is a birthdate, a year of registration in the
company, and the total number of years in the company. The last piece of information can be
derived from the other two, and so is redundant.
Question 12. An online shop sends orders based on the zip code customers type during the
checkout process. Many orders never received the customers because customers misspell the zip
code into the input field.

Which of the following solutions SHOULD the online shop implement to solve this issue?
(A) Non-parametric data
(B) Data outliers
(C) Data type validation
(D) Specification mismatch

Explanation 12. The correct answer is: Data type validation

Data validation refers to the process of ensuring the accuracy and quality of data. It is
implemented by building several checks into a system or report to ensure the logical consistency
of input and stored data. In automated systems, data is entered with minimal or no human
supervision. Therefore, it is necessary to ensure that the data that enters the system is correct and
meets the desired quality standards. The data will be of little use if it is not entered properly and
can create bigger downstream reporting issues.

Question 13. A web analyst wants to use automated bots to crawl through the internet and
extract data from targeted sites. He wants the collected data to be delivered in a readable form
such as CSV for further analysis.

Which of the following data collection methods is the MOST appropriate method to employ?

(A) Public databases


(B) Application programming interface
(C) Web scraping
(D) Survey
Explanation 13. The correct answer is: Web scraping

Web scraping is a process of using automated bots to crawl through the internet and extract data.
The bots collect information by first breaking down the targeted site to its most basic form,
HTML text, then scan through to gather data according to some preset parameters. After that, the
collected data is delivered in CSV or Excel format, so it is readable for whoever wants to use it.

Question 14. A web developer wants to ensure that malicious users can’t type SQL statements
when they asked for input, like their username/userid. Which of the following query optimization
techniques would effectively prevent SQL Injection attacks?

(A) Indexing
(B) Temporary table in the query set
(C) Parametrization
(D) Subset of records

Explanation 14. The correct answer is: Parametrization

Parameterized SQL queries allow you to place parameters in an SQL query instead of a constant
value. A parameter takes a value only when the query is executed, allowing the query to be
reused with different values and purposes. Parameterized SQL statements are available in some
analysis clients, and are also available through the Historian SDK.

For example, you could create the following conditional SQL query, which contains a parameter
for the collector name:

SELECT* FROM ExamsDigest WHERE coursename=? ORDER BY tagname

SQL Injection is best prevented through the use of parameterized queries.


Question 15. Which of the following SQL commands will be used to prevent SQL injection
attacks?

(A) SELECT ? FROM ExamsDigest_Courses WHERE certName= * ORDER BY date


(B) SELECT * FROM ExamsDigest_Courses WHERE certName= ? ORDER BY date
(C) SELECT ? FROM ExamsDigest_Courses WHERE certName="param" ORDER BY date
(D) SELECT * FROM ExamsDigest_Courses WHERE certName="param" ORDER BY date

Explanation 15. The correct answer is: SELECT * FROM ExamsDigest_Courses WHERE
certName= ? ORDER BY date

Parameterized SQL queries allow you to place parameters in an SQL query instead of a constant
value. A parameter takes a value only when the query is executed, which allows the query to be
reused with different values and for different purposes. Parameterized SQL statements are
available in some analysis clients, and are also available through the Historian SDK.

For example, you could create the following conditional SQL query, which contains a parameter
for the collector name:

SELECT * FROM ExamsDigest_Courses WHERE certName= ? ORDER BY date

If your analysis client were to pass the parameter CompTIA Data+ along with the query, it would
look like this when executed

SELECT * FROM ExamsDigest_Courses WHERE certName='CompTIA Data+' ORDER


BY date

The benefit of parameterized SQL queries is that you can prepare them ahead of time and reuse
them for similar applications without having to create distinct SQL queries for each case.
SQL Injection is best prevented through the use of parameterized queries as well.

Question 16. Consider the following dataset which contains information about houses that are
for sale.

Which of the following string manipulation commands will combine the address and region
name columns to create a full address?

full_address

-----------------------------------------
85 Turner St, Northern Metropolitan
25 Bloomburg St, Northern Metropolitan
5 Charles St, Northern Metropolitan
40 Federation La, Northern Metropolitan
55a Park St, Northern Metropolitan

(A) SELECT CONCAT(address, ‘, ‘ , regionname) AS full_address


FROM melb
LIMIT 5;
(B) SELECT CONCAT(address, ‘- ‘ , regionname) AS full_address
FROM melb
LIMIT 5;
(C) SELECT CONCAT(regionname, ‘- ‘ , address) AS full_address
FROM melb
LIMIT 5;
(D) SELECT CONCAT(regionname, ‘, ‘ , address) AS full_address
FROM melb
LIMIT 5;

Explanation 16. The correct answer is: SELECT CONCAT(address, ‘, ‘ , regionname) AS


full_address
FROM melb
LIMIT 5;

String manipulation (or string handling) is the process of changing, parsing, splicing, pasting, or
analyzing strings. SQL is used for managing data in a relational database.

The CONCAT() function adds two or more strings together.

Syntax
CONCAT(string1, string2, …., string_n)

Parameter Values

Parameter Description

string1, string2, string_n Required. The strings to add together

Question 17. An online shop wants to expand its customer base using SMS marketing
campaigns. The online shop has already a database with customers with the following columns.
1. First_Name
2. Last_Name
3. Email

The head of marketing wants to add one more column titled “Phone_Numbers”. Also, he wants
to match the customer’s names with the respective phone numbers.

Which of the following data manipulation techniques would the head of marketing use to add the
phone numbers into the database without affecting the existing data?

(A) Imputation
(B) Data append
(C) Transpose
(D) Normalize data

Explanation 17. The correct answer is: Data append

Data append is a process that involves adding new data elements to an existing database. An
example of a common data append would be the enhancement of a company’s customer files. A
data append takes the information they have, matches it against a larger database of business
data, allowing the desired missing data fields to be added.

Imputation is incorrect. Imputation is the process of replacing missing data with substituted
values.

Transpose is incorrect. Transposing data is where the data in the rows are turned into columns,
and the data in the columns is turned into rows.

Example:
Normalize data is incorrect. Data normalization is the process of structuring your relational
customer database, following a series of normal forms. This improves the accuracy and integrity
of your data while ensuring that your database is easier to navigate.

Question 18. Consider the following data sets.

Which of the following Logical functions would return TRUE?

(A) =AND(A2="Bananas", B3>C3)


(B) =AND(A3="Apples", B4<=C3)
(C) =AND(A2="Bananas", B2>C2)
(D) =AND(A5="Oranges", B5=C5)
Explanation 18. The correct answer is: AND(A2=”Bananas”, B2>C2)

Microsoft Excel provides 4 logical functions to work with the logical values. The functions are
AND, OR, XOR, and NOT. You use these functions when you want to carry out more than one
comparison in your formula or test multiple conditions instead of just one. As well as logical
operators, Excel logical functions return either TRUE or FALSE when their arguments are
evaluated.

The following table provides a short summary of what each logical function does to help you
choose the right formula for a specific task.

Function Description Formula Formula


Example Description

AND Returns TRUE AND(A2≥10, The formula


if all of the B2<5) returns TRUE if a
arguments value in cell A2 is
evaluate to greater than or
TRUE. equal to 10, and a
value in B2 is less
than 5, FALSE
otherwise.

OR Returns TRUE OR(A2≥10, The formula


if any argument B2<5) returns TRUE if
evaluates to A2 is greater than
TRUE. or equal to 10 or
B2 is less than 5,
or both conditions
are met. If neither
of the conditions it
met, the formula
returns FALSE.

XOR Returns a XOR(A2≥10, The formula


logical B2<5) returns TRUE if
Exclusive Or of either A2 is
all arguments. greater than or
equal to 10 or B2
is less than 5. If
neither of the
conditions is met
or both conditions
are met, the
formula returns
FALSE.

NOT Returns the NOT(A2≥10) The formula


reversed logical returns FALSE if
value of its a value in cell A1
argument. I.e. If is greater than or
the argument is equal to 10; TRUE
FALSE, then otherwise.
TRUE is
returned and
vice versa.

Question 19. Consider the following dataset which contains information about houses that are
for sale.
Which of the following string manipulation commands will replace the “St” characters in the
address column with the word “Street”?

(A) SELECT address, REPLACE(address, ‘St’, ‘Street’) AS new_address


FROM melb
LIMIT 5;
(B) SELECT address, REPLACE(‘St’, ‘Street’) AS new_address
FROM melb
LIMIT 5;
(C) SELECT address, REPLACE(‘Street’, ‘St’) AS new_address
FROM melb
LIMIT 5;
(D) SELECT address, REPLACE(address, ‘Street’, ‘St’) AS new_address
FROM melb
LIMIT 5;

Explanation 19. The correct answer is: SELECT address, REPLACE(address, ‘St’,
‘Street’) AS new_address
FROM melb
LIMIT 5;

String manipulation (or string handling) is the process of changing, parsing, splicing, pasting, or
analyzing strings. SQL is used for managing data in a relational database.

The REPLACE() function replaces all occurrences of a substring within a string, with a new
substring.

Note: The search is case-insensitive.


Syntax
REPLACE(string, old_string, new_string)
Parameter Values

Parameter Description

string Required. The original string

old_string Required. The string to be replaced

new_string Required. The new replacement string

Question 20. Which of the following data integration processes combines data from multiple
data sources into a single, consistent data store that is loaded into a data warehouse?

(A) Extract, transform, load (ETL)


(B) Extract, load, transform (ELT)
(C) Delta load
(D) Application programming interfaces (APIs)

Explanation 20. The correct answer is: Extract, transform, load (ETL).

ETL, which stands for extract, transform and load, is a data integration process that combines
data from multiple data sources into a single, consistent data store that is loaded into a data
warehouse or other target system.

Extract, load, transform (ELT) is incorrect. ELT stands for “Extract, Load, and Transform.”
In this process, data gets leveraged via a data warehouse in order to do basic transformations.
That means there’s no need for data staging. ELT uses cloud-based data warehousing solutions
for all different types of data – including structured, unstructured, semi-structured, and even raw
data types.

Delta load is incorrect. Delta is the incremental load between the last data load and now.

For example, if your yesterday’s load inserted 50 records into your target table and today 80 new
records have come to your source system, you insert only the latest 80 records into the target
after checking against the target table. These 80 records are the delta records.

Application programming interfaces (APIs) is incorrect. API is the acronym for Application
Programming Interface, which is a software intermediary that allows two applications to talk to
each other. Each time you use an app like Facebook, send an instant message, or check the
weather on your phone, you’re using an API.

When you use an application on your mobile phone, the application connects to the Internet and
sends data to a server. The server then retrieves that data, interprets it, performs the necessary
actions and sends it back to your phone. The application then interprets that data and presents
you with the information you wanted in a readable way. This is what an API is – all of this
happens via API.

Question 21. A data analyst wants to create “Income Categories” that would be calculated based
on the existing variable “Income”.

The “Income Categories” would be as follows:


1. Income category 1: less than $1
2. Income category 2: more than $1 and less than $20,000
3. Income category 3: more than $20,001 and less than $40,000
4. Income category 4: more than $40,001

Which of the following data manipulation techniques should the data analyst use to create
“Income Categories”?

(A) Derived variables


(B) Data merge
(C) Data blending
(D) Data append

Explanation 21. The correct answer is: Derived variables

Derived variables are variables that you create by calculating or categorizing variables that
already exist in your data set.

Data merge is incorrect. Data merging is the process of combining two or more data sets into a
single data set.

Data blending is incorrect. Data blending involves pulling data from different sources and
creating a single, unique, dataset for visualization and analysis.

Data append is incorrect. A data append is a process that involves adding new data elements to
an existing database.

Question 22. Which of the following Date function commands would effectively calculate the
number of days, months, or years between two dates?

(A) MONTH()
(B) DATEDIF()
(C) DATEVALUE()
(D) DAYS()
Explanation 22. The correct answer is: DATEDIF()

Calculates the number of days, months, or years between two dates.

Examples:

Description
Start_date End_date Formula (Result)

1/1/2001 1/1/2003 DATEDIF(Start_date,End_date,”Y”) Two


complete
years in the
period (2)

6/1/2001 8/15/2002 DATEDIF(Start_date,End_date,”D”) 440 days


between
June 1,
2001, and
August 15,
2002 (440)

6/1/2001 8/15/2002 DATEDIF(Start_date,End_date,”YD”) 75 days


between
June 1 and
August 15,
ignoring
the years of
the dates
(75)
Question 23. Consider the following data set.

Which of the following data manipulation techniques


would arrange the data set in decreasing order of total
marks so that who scored highest marks is in the top
row, and who scored lowest marks is in the last row?

(A) CURDATE()
(B) Filtering
(C) Sorting
(D) IsEmpty()

Explanation 23. The correct answer is: Sorting

Sorting lets you organize all or part of your data in ascending or descending order. Note that you
cannot undo a sort after it has been saved so you’ll want to make sure that all of your rows in
your sheet, including parent rows in a hierarchy, are ordered the way you want before saving.

Question 24. A data analyst wants to show and hide information on his sheet based on selected
criteria. Which of the following data manipulation techniques is most appropriate in this case?

(A) Sorting
(B) Parametrization
(C) Filtering
(D) Indexing

Explanation 24. The correct answer is: Filtering


Filters allow you to show or hide information on your sheet based on selected criteria. They’re
useful because they don’t change the overall layout of your sheet. You can also save filters and
share them with anyone who is shared to the sheet. You can even set default filters on your sheet
so that when shared users open that sheet, they see the same view.

Question 25. A web analyst wants to display a chart of daily hits on the keyword “it
certifications” for four different states with the help of Google Trends and Python.

Which of the following data collection methods will effectively request data from google trends
and display them in a chart?

(A) Application programming interface (API)


(B) Survey
(C) Web scraping
(D) Public databases

Explanation 25. The correct answer is: Application programming interface (API)

Sorting lets you organize all or part of your data in ascending or descend

An increasingly popular method for collecting data online is via a Representational State
Transfer Application Program Interface REST API or simply API. Google makes many APIs
available for public use. One such API is Google Trends. This provides data on query activity of
any keyword you can think of. Basically, this data tells you what topics people are interested in
getting information on.

Python can be used for API calls. For example, to access the Google trends API we can use the
Python library pytrends. Here we plot trends for the keywords “CompTIA” and “ExamsDigest”
from pytrends.request import TrendReq
import pandas as pd
from matplotlib import pyplot
import time

startTime = time.time()
pytrend = TrendReq(hl='en-US', tz=360)

keywords = ['CompTIA','ExamsDigest']
pytrend.build_payload(
kw_list=keywords,
cat=0,
timeframe='2020-02-01 2020-07-01',
geo='US',
gprop='')

data = pytrend.interest_over_time()
data.plot(title="Google Search Trends")
pyplot.show()

Question 26. The human resources (HR) department of the ACME Corporation needs to monitor
changes in employee satisfaction over time. Which of the following data collection methods will
BEST help the HR department to monitor the changes?

(A) Web scraping


(B) Application programming interface
(C) Web services
(D) Survey

Explanation 26. The correct answer is: Survey

A survey is defined as the act of examining a process or questioning a selected sample of


individuals to obtain data about a service, product, or process. Data collection surveys collect
information from a targeted group of people about their opinions, behavior, or knowledge.
Common types of example surveys are written questionnaires, face-to-face or telephone
interviews, focus groups, and electronic (e-mail or website) surveys.

It is helpful to use surveys when:


1. Identifying customer requirements or preferences
2. Assessing customer or employee satisfaction, such as identifying or prioritizing problems to
address
3. Evaluating proposed changes
4. Assessing whether a change was successful
5. Monitoring changes in customer or employee satisfaction over time

Question 27. A SQL database administrator wants to display the total number of employees.
Which of the following SQL commands should the administrator use to display the total
number?

(A) SELECT COUNT(+) FROM HumanResources.Employee; GO


(B) SELECT COUNT(NUM) FROM HumanResources.Employee; GO
(C) SELECT COUNT(*) FROM HumanResources.Employee; GO
(D) SELECT COUNT(NUM *) FROM HumanResources.Employee; GO

Explanation 27. The correct answer is: SELECT COUNT(*) FROM


HumanResources.Employee; GO

An aggregate function performs a calculation on a set of values and returns a single value.

Transact-SQL provides the following aggregate functions:

APPROX_COUNT_DISTINCT
AVG
CHECKSUM_AGG
COUNT
COUNT_BIG
GROUPING
GROUPING_ID
MAX
MIN
STDEV
STDEVP
STRING_AGG
SUM
VAR
VARP

The rest commands are incorrect as they have invalid syntax.

Question 28. A data scientist wants to see which products make the most money and which
products attract the most customer purchasing interest in their company.

Which of the following data manipulation techniques would he use to obtain this information?

(A) Data append


(B) Normalize data
(C) Data merge
(D) Data blending

Explanation 28. The correct answer is: Data blending

Data blending is combining multiple data sources to create a single, new dataset, which can be
presented visually in a dashboard or other visualization and can then be processed or analyzed.
Enterprises get their data from a variety of sources, and users may want to temporarily bring
together different datasets to compare data relationships or answer a specific question.

Data append is incorrect. Data append is a process that involves adding new data elements to
an existing database. An example of a common data append would be the enhancement of a
company’s customer files. A data append takes the information they have, matches it against a
larger database of business data, allowing the desired missing data fields to be added.

Normalize data is incorrect. Data normalization is the process of structuring your relational
customer database, following a series of normal forms. This improves the accuracy and integrity
of your data while ensuring that your database is easier to navigate.

Data merge is incorrect. Data merging is the process of combining two or more data sets into a
single data set.

Question 29. Which of the following data manipulation techniques improves the accuracy and
integrity of your data while ensuring that your database is easier to navigate?

(A) Data append


(B) Data blending
(C) Normalize data
(D) Data merge

Explanation 29. The correct answer is: Normalize data

Data normalization is the process of structuring your relational customer database, following a
series of normal forms. This improves the accuracy and integrity of your data while ensuring that
your database is easier to navigate.
Data append is incorrect. Data append is a process that involves adding new data elements to
an existing database. An example of a common data append would be the enhancement of a
company’s customer files. A data append takes the information they have, matches it against a
larger database of business data, allowing the desired missing data fields to be added.

Data blending is incorrect. Data blending is combining multiple data sources to create a single,
new dataset, which can be presented visually in a dashboard or other visualization and can then
be processed or analyzed.

Enterprises get their data from a variety of sources, and users may want to temporarily bring
together different datasets to compare data relationships or answer a specific question.

Data merge is incorrect. Data merging is the process of combining two or more data sets into a
single data set.

Question 30. A database administrator is responsible to optimise the data queries from 800ms to
300ms by using indexing.

Which of the following techniques, indexing uses to make columns faster to query?

(A) Duplicate the table where data is stored with less data
(B) Sort the data alphabetically
(C) Create pointers where data is stored within a database
(D) Filter the data using logical functions
Explanation 30. The correct answer is: Create pointers where data is stored within a
database

Indexing makes columns faster to query by creating pointers to where data is stored within a
database.

Indexes allow us to create sorted lists without having to create all new sorted tables, which
would take up a lot of storage space.

An index is a structure that holds the field the index is sorting and a pointer from each record to
their corresponding record in the original table where the data is actually stored. Indexes are used
in things like a contact list where the data may be physically stored in the order you add people’s
contact information but it is easier to find people when listed out in alphabetical order.
CHAPTER 3
DATA ANALYSIS
Questions 31-50

Question 31. A junior web developer wants to develop a new eCommerce app using python for
the back end, MongoDB for the database, and Vue.js for the front end. He wants to create a
simple list called apparel with three lists items (“t-shirts, hoodies, jackets”).

Which of the following command does he need to type to create the list on Python?
(A) apparel = ["t-shirts", "hoodies", "jackets"]
(B) apparel = {"t-shirts", "hoodies", "jackets"}
(C) apparel = "t-shirts", "hoodies", "jackets"
(D) apparel = ("t-shirts", "hoodies", “jackets”)

Question 32. A data analyst wants to make a rough comparison of two graphs of variability,
considering only the most extreme cases.

Which of the following measures of dispersion would effectively make the comparison?
(A) Distribution
(B) Range
(C) Variance
(D) Standard deviation

Question 33. Consider this dataset showing the retirement age of 11 people, in whole years:

54, 54, 54, 55, 56, 57, 57, 58, 58, 60, 60
Which of the following value is the measure of the central tendency “median”?
(A) 54
(B) 55
(C) 56
(D) 57

Question 34. Which of the following measures of dispersion would help analysts to measure
market and security volatility — and predict performance trends?
(A) Range
(B) Distribution
(C) Variance
(D) Standard deviation

Question 35. Which of the following inferential statistical methods determine if there is a
significant difference between the means of two groups?
(A) t-tests
(B) Z-score
(C) p-values
(D) Chi-squared

Question 36. A forecasting analyst wants to predict sales for a company based on weather,
previous sales, and GDP growth.

Which of the following inferential statistical methods is MOST suitable for this case?
(A) Regression
(B) t-tests
(C) Correlation
(D) Chi-squared

Question 37. Consider this dataset showing the retirement age of 11 people, in whole years:

54, 54, 54, 55, 56, 57, 57, 58, 58, 60, 60

This table shows a simple frequency distribution of the retirement age data.

Age Frequency

54 3

55 1

56 1

57 2

58 2
60 2

Which of the following value is the measure of the central tendency “mode”?
(A) 54
(B) 55
(C) 56
(D) 57

Question 38. Which of the following inferential statistical methods is a statistical test used to
compare observed results with expected results?
(A) t-tests
(B) Z-score
(C) p-values
(D) Chi-squared

Question 39. Suppose that we have observed the following n = 5 resting pulse rates: 64, 68, 74,
76, 78

Which of the following values is the variance of the resting pulse rates?
(A) 27.1
(B) 27.2
(C) 27.3
(D) 27.4

Question 40. An online payment company wants to develop a new fraud detection system to
protect customers against disputes and fraudulent payments.

Which of the following data analytics tools is MOST suitable for developing the fraud detection
system?

(A) SPSS Modeler


(B) Tableau
(C) Qlik
(D) Dataroma

Question 41. Which of the following inferential statistical methods is a number describing how
likely it is that your data would have occurred by random chance?
(A) t-tests
(B) Z-score
(C) p-values
(D) Chi-squared

Question 42. A data analyst is looking for software that includes accounting and budgeting
templates for easy use and built-in calculating and formula features to organize and synthesize
results.

Which of the following data analytics tools is MOST suitable in this case?
(A) R
(B) Microsoft Excel
(C) IBM Cognos
(D) Minitab
Question 43. One of the benefits of using Amazon QuickSight is:
(A) QuickSight dashboards can be accessed only from mobile devices
(B) Scale from one to one of thousands of users
(C) Subscription-based charges
(D) Embed BI dashboards in your applications

Question 44. Τype I error is the mistaken rejection of the null hypothesis, also known as a “false
positive”, while a type II error is the mistaken acceptance of the null hypothesis, also known as a
“false negative”. (TRUE/FALSE)
(A) TRUE
(B) FALSE

Question 45. Consider this dataset showing the retirement age of 11 people, in whole years:

54, 54, 54, 55, 56, 57, 57, 58, 58, 60, 60

Which of the following value is the measure of the central tendency “mean”?
(A) 56.3
(B) 56.6
(C) 56
(D) 56.9

Question 46. Which of the following data analytics tools is a cloud-based platform designed to
provide direct, simplified, real-time access to business data for decision-makers across the
company with minimal IT involvement?
(A) Snowflake
(B) OLTP
(C) Domo
(D) Delta load

Question 47. Which of the following value is the measure of dispersion “range” between the
scores of ten students in a test.

The scores of ten students in a test are 17, 23, 30, 36, 45, 51, 58, 66, 72, 77.
(A) 60
(B) 70
(C) 80
(D) 90

Question 48. Which of the following inferential statistical methods is a numerical measurement
that describes a value’s relationship to the mean of a group of values?
(A) t-tests
(B) Z-score
(C) p-values
(D) Chi-squared
Question 49. Which of the following statistical methods refers to the probability that a
population parameter will fall between a set of values for a certain proportion of times?
(A) Confidence intervals
(B) Percent difference
(C) Percent change
(D) Frequencies/percentages

Question 50. Which of the following data analytics tools can execute the following command?

INSERT INTO Companies (CompanyName, ContactName, Address, City, PostalCode, Country)


VALUES ('ExamsDigest', 'Nick G.', 'London St. 13', 'London', '45533', 'United Kingdom');

(A) Python
(B) Microsoft Excel
(C) Structured Query Language (SQL)
(D) R
Answers 31-50
Question 31. A junior web developer wants to develop a new eCommerce app using python for
the back end, MongoDB for the database, and Vue.js for the front end. He wants to create a
simple list called apparel with three lists items (“t-shirts, hoodies, jackets”).

Which of the following command does he need to type to create the list on Python?
(A) apparel = ["t-shirts", "hoodies", "jackets"]
(B) apparel = {"t-shirts", "hoodies", "jackets"}
(C) apparel = "t-shirts", "hoodies", "jackets"
(D) apparel = ("t-shirts", "hoodies", “jackets")

Explanation 31. The correct answer is: apparel = ["t-shirts", "hoodies", "jackets"]

Lists are used to store multiple items in a single variable.

Lists are one of 4 built-in data types in Python used to store collections of data, the other 3 are
Tuple, Set, and Dictionary, all with different qualities and usage.

Lists are created using square brackets:

apparel = ["t-shirts", "hoodies", “jackets"]


print(apparel)

List items are ordered, changeable, and allow duplicate values.

List items are indexed, the first item has index [0], the second item has index [1] etc.
Question 32. A data analyst wants to make a rough comparison of two graphs of variability,
considering only the most extreme cases.

Which of the following measures of dispersion would effectively make the comparison?
(A) Distribution
(B) Range
(C) Variance
(D) Standard deviation

Explanation 32. The correct answer is: Range

Range is the interval between the highest and the lowest score. Range is a measure of variability
or scatteredness of the varieties or observations among themselves and does not give an idea
about the spread of the observations around some central value.

Symbolically R = Hs – Ls. Where R = Range;

Hs is the ‘Highest score’ and Ls is the Lowest Score.

Computation of Range:

Example 1:
The scores of ten students in a test are:
10, 20, 25, 50, 80, 85, 90

In the example, the highest score is 90 and the lowest score is 10.
So the range is the difference between these two scores:
Range = 90 – 10 = 80
Question 33. Consider this dataset showing the retirement age of 11 people, in whole years:

54, 54, 54, 55, 56, 57, 57, 58, 58, 60, 60

Which of the following value is the measure of the central tendency “median”?
(A) 54
(B) 55
(C) 56
(D) 57

Explanation 33. The correct answer is: 57

A measure of central tendency (also referred to as measures of centre or central location) is a


summary measure that attempts to describe a whole set of data with a single value that represents
the middle or centre of its distribution.

There are three main measures of central tendency: the mode, the median and the mean. Each
of these measures describes a different indication of the typical or central value in the
distribution.

What is the median?


The median is the middle value in distribution when the values are arranged in ascending or
descending order.

The median divides the distribution in half (there are 50% of observations on either side of the
median value). In a distribution with an odd number of observations, the median value is the
middle value.

Looking at the retirement age distribution (which has 11 observations), the median is the middle
value, which is 57 years:
54, 54, 54, 55, 56, 57, 57, 58, 58, 60, 60
When the distribution has an even number of observations, the median value is the mean of the
two middle values. In the following distribution, the two middle values are 56 and 57, therefore
the median equals 56.5 years:
52, 54, 54, 54, 55, 56, 57, 57, 58, 58, 60, 60

Question 34. Which of the following measures of dispersion would help analysts to measure
market and security volatility — and predict performance trends?
(A) Range
(B) Distribution
(C) Variance
(D) Standard deviation

Explanation 34. The correct answer is: Standard deviation

Standard deviation is a statistical measurement in finance that, when applied to the annual rate of
return of an investment, sheds light on that investment’s historical volatility. The greater the
standard deviation of securities, the greater the variance between each price and the mean, which
shows a larger price range. For example, a volatile stock has a high standard deviation, while the
deviation of a stable blue-chip stock is usually rather low.

Standard deviation is an especially useful tool in investing and trading strategies as it helps
measure market and security volatility—and predict performance trends.
Question 35. Which of the following inferential statistical methods determine if there is a
significant difference between the means of two groups?
(A) t-tests
(B) Z-score
(C) p-values
(D) Chi-squared

Explanation 35. The correct answer is: t-tests

A t-test is a type of inferential statistic used to determine if there is a significant difference


between the means of two groups, which may be related to certain features. It is mostly used
when the data sets, like the data set recorded as the outcome from flipping a coin 100 times,
would follow a normal distribution and may have unknown variances. A t-test is used as a
hypothesis testing tool, which allows testing of an assumption applicable to a population.

Z-score is incorrect. A Z-score is a numerical measurement that describes a value’s relationship


to the mean of a group of values. Z-score is measured in terms of standard deviations from the
mean. If a Z-score is 0, it indicates that the data point’s score is identical to the mean score. A Z-
score of 1.0 would indicate a value that is one standard deviation from the mean. Z-scores may
be positive or negative, with a positive value indicating the score is above the mean and a
negative score indicating it is below the mean.

p-value is incorrect. A p-value, or probability value, is a number describing how likely it is that
your data would have occurred by random chance (i.e. that the null hypothesis is true). The level
of statistical significance is often expressed as a p-value between 0 and 1. The smaller the p-
value, the stronger the evidence that you should reject the null hypothesis.

Chi-squared is incorrect. A chi-square test is a statistical test used to compare observed results
with expected results. The purpose of this test is to determine if a difference between observed
data and expected data is due to chance, or if it is due to a relationship between the variables you
are studying. Therefore, a chi-square test is an excellent choice to help us better understand and
interpret the relationship between our two categorical variables.

Question 36. A forecasting analyst wants to predict sales for a company based on weather,
previous sales, and GDP growth.

Which of the following inferential statistical methods is MOST suitable for this case?
(A) Regression
(B) t-tests
(C) Correlation
(D) Chi-squared

Explanation 36. The correct answer is: Regression

Regression is a statistical method used in finance, investing, and other disciplines that attempt to
determine the strength and character of the relationship between one dependent variable (usually
denoted by Y) and a series of other variables (known as independent variables).

Regression helps investment and financial managers to value assets and understand the
relationships between variables, such as commodity prices and the stocks of businesses dealing
in those commodities.

Regression can help finance and investment professionals as well as professionals in other
businesses. Regression can also help predict sales for a company based on weather, previous
sales, GDP growth, or other types of conditions. The capital asset pricing model (CAPM) is an
often-used regression model in finance for pricing assets and discovering costs of capital.

t-test is incorrect. A t-test is a type of inferential statistic used to determine if there is a


significant difference between the means of two groups, which may be related to certain features.
It is mostly used when the data sets, like the data set recorded as the outcome from flipping a
coin 100 times, would follow a normal distribution and may have unknown variances. A t-test is
used as a hypothesis testing tool to test an assumption applicable to a population.

Correlation is incorrect. Correlation, in the finance and investment industries, is a statistic that
measures the degree to which two securities move in relation to each other. Correlations are used
in advanced portfolio management, computed as the correlation coefficient, which has a value
that must fall between -1.0 and +1.0.

Chi-squared is incorrect. A chi-square test is a statistical test used to compare observed results
with expected results. The purpose of this test is to determine if a difference between observed
data and expected data is due to chance, or if it is due to a relationship between the variables you
are studying. Therefore, a chi-square test is an excellent choice to help us better understand and
interpret the relationship between our two categorical variables.

Question 37. Consider this dataset showing the retirement age of 11 people, in whole years:

54, 54, 54, 55, 56, 57, 57, 58, 58, 60, 60

This table shows a simple frequency distribution of the retirement age data.

Age Frequency

54 3

55 1

56 1
57 2

58 2

60 2
Which of the following value is the measure of the central tendency “mode”?
(A) 54
(B) 55
(C) 56
(D) 57

Explanation 37. The correct answer is: 54

A measure of central tendency (also referred to as measures of centre or central location) is a


summary measure that attempts to describe a whole set of data with a single value that represents
the middle or centre of its distribution.

There are three main measures of central tendency: the mode, the median and the mean. Each
of these measures describes a different indication of the typical or central value in the
distribution.

What is the mode?


The mode is the most commonly occurring value in a distribution.

The most commonly occurring value is 54, therefore the mode of this distribution is 54 years.

Question 38. Which of the following inferential statistical methods is a statistical test used to
compare observed results with expected results?
(A) t-tests
(B) Z-score
(C) p-values
(D) Chi-squared

Explanation 38. The correct answer is: Chi-squared

A chi-square test is a statistical test used to compare observed results with expected results. The
purpose of this test is to determine if a difference between observed data and expected data is due
to chance, or if it is due to a relationship between the variables you are studying. Therefore, a
chi-square test is an excellent choice to help us better understand and interpret the relationship
between our two categorical variables.

t-test is incorrect. A t-test is a type of inferential statistic used to determine if there is a


significant difference between the means of two groups, which may be related to certain features.
It is mostly used when the data sets, like the data set recorded as the outcome from flipping a
coin 100 times, would follow a normal distribution and may have unknown variances. A t-test is
used as a hypothesis testing tool to test an assumption applicable to a population.

Z-score is incorrect. A Z-score is a numerical measurement that describes a value’s relationship


to the mean of a group of values. Z-score is measured in terms of standard deviations from the
mean. If a Z-score is 0, it indicates that the data point’s score is identical to the mean score. A Z-
score of 1.0 would indicate a value that is one standard deviation from the mean. Z-scores may
be positive or negative, with a positive value indicating the score is above the mean and a
negative score indicating it is below the mean.

p-value is incorrect. A p-value, or probability value, is a number describing how likely it is that
your data would have occurred by random chance (i.e. that the null hypothesis is true). The level
of statistical significance is often expressed as a p-value between 0 and 1. The smaller the p-
value, the stronger the evidence that you should reject the null hypothesis.

Question 39. Suppose that we have observed the following n = 5 resting pulse rates: 64, 68, 74,
76, 78

Which of the following values is the variance of the resting pulse rates?
(A) 27.1
(B) 27.2
(C) 27.3
(D) 27.4

Explanation 39. The correct answer is: 27.2

The variance is a measure of variability. It is calculated by taking the average of squared


deviations from the mean.

Variance tells you the degree of spread in your data set. The more spread the data, the larger the
variance is in relation to the mean.

In our example the sample mean calculated as:


Mean = sample data 1 + sample data 2 / n
Mean = 64 + 68 + 74 + 76 + 78 / 5
Mean = 72

so the variance is calculated as:

Variance = (sample data 1 - sample mean)2 + (sample data 2 - sample mean)2 / n


Variance = (64-72)2 + (68-72)2 + (74-72)2 + (74-72)2 + (78-72)2 / 5
Variance = 135 / 5
Variance = 27.2
Question 40. An online payment company wants to develop a new fraud detection system to
protect customers against disputes and fraudulent payments.

Which of the following data analytics tools is MOST suitable for developing the fraud detection
system?

(A) SPSS Modeler


(B) Tableau
(C) Qlik
(D) Dataroma

Explanation 40. The correct answer is: SPSS Modeler

IBM SPSS Modeler is a data mining and text analytics software application from IBM. It is used
to build predictive models and conduct other analytic tasks.

SPSS Modeler has been used in these and other industries:

Customer analytics and Customer relationship management (CRM)


Fraud detection and prevention
Optimizing insurance claims
Risk
Manufacturing quality improvement
Healthcare quality improvement
Forecasting demand or sales
Law enforcement and border security
Education
Telecommunications

Question 41. Which of the following inferential statistical methods is a number describing how
likely it is that your data would have occurred by random chance?
(A) t-tests
(B) Z-score
(C) p-values
(D) Chi-squared

Explanation 41. The correct answer is: p-values

A p-value, or probability value, is a number describing how likely it is that your data would have
occurred by random chance (i.e. that the null hypothesis is true). The level of statistical
significance is often expressed as a p-value between 0 and 1. The smaller the p-value, the
stronger the evidence that you should reject the null hypothesis.

t-test is incorrect. A t-test is a type of inferential statistic used to determine if there is a


significant difference between the means of two groups, which may be related to certain features.
It is mostly used when the data sets, like the data set recorded as the outcome from flipping a
coin 100 times, would follow a normal distribution and may have unknown variances. A t-test is
used as a hypothesis testing tool to test an assumption applicable to a population.

Z-score is incorrect. A Z-score is a numerical measurement that describes a value’s relationship


to the mean of a group of values. Z-score is measured in terms of standard deviations from the
mean. If a Z-score is 0, it indicates that the data point’s score is identical to the mean score. A Z-
score of 1.0 would indicate a value that is one standard deviation from the mean. Z-scores may
be positive or negative, with a positive value indicating the score is above the mean and a
negative score indicating it is below the mean.
Chi-squared is incorrect. A chi-square test is a statistical test used to compare observed results
with expected results. The purpose of this test is to determine if a difference between observed
data and expected data is due to chance, or if it is due to a relationship between the variables you
are studying. Therefore, a chi-square test is an excellent choice to help us better understand and
interpret the relationship between our two categorical variables.

Question 42. A data analyst is looking for software that includes accounting and budgeting
templates for easy use and built-in calculating and formula features to organize and synthesize
results.

Which of the following data analytics tools is MOST suitable in this case?
(A) R
(B) Microsoft Excel
(C) IBM Cognos
(D) Minitab

Explanation 42. The correct answer is: Microsoft Excel

Microsoft Excel gives businesses the tools they need to make the most of their data. At its most
basic level, Excel is an excellent tool for both data entry and storage. Excel even includes
accounting and budgeting templates for easy use. From there the software’s built-in calculating
and formula features are available to help you organize and synthesize results. Businesses often
employ multiple systems (i.e CRM, inventory) each with its own database and logs.

R is incorrect. R is a language and environment for statistical computing and graphics. R


provides a wide variety of statistical (linear and nonlinear modeling, classical statistical tests,
time-series analysis, classification, clustering, …) and graphical techniques, and is highly
extensible. The S language is often the vehicle of choice for research in statistical methodology,
and R provides an Open Source route to participation in that activity.
IBM Cognos is incorrect. Everyone in your organization can use IBM Cognos BI to view or
create business reports, analyze data, and monitor events and metrics so that they can make
effective business decisions.

Minitab is incorrect. Minitab empowers all parts of an organization to predict better outcomes,
design better products, and improve processes to generate higher revenues and reduce costs.

Question 43. One of the benefits of using Amazon QuickSight is:


(A) QuickSight dashboards can be accessed only from mobile devices
(B) Scale from one to one of thousands of users
(C) Subscription-based charges
(D) Embed BI dashboards in your applications

Explanation 43. The correct answer is: Embed BI dashboards in your applications

Amazon QuickSight is a scalable, serverless, embeddable, machine learning-powered business


intelligence (BI) service built for the cloud. QuickSight lets you easily create and publish
interactive BI dashboards that include Machine Learning-powered insights. QuickSight
dashboards can be accessed from any device, and seamlessly embedded into your applications,
portals, and websites.

QuickSight is serverless and can automatically scale to tens of thousands of users without any
infrastructure to manage or capacity to plan for. It is also the first BI service to offer pay-per-
session pricing, where you only pay when your users access their dashboards or reports, making
it cost-effective for large-scale deployments.

Benefits
Scale from tens to tens of thousands of users
Embed BI dashboards in your applications
Access deeper insights with Machine Learning
Ask questions of your data, receive answers

Question 44. Τype I error is the mistaken rejection of the null hypothesis, also known as a “false
positive”, while a type II error is the mistaken acceptance of the null hypothesis, also known as a
“false negative”. (TRUE/FALSE)
(A) TRUE
(B) FALSE

Explanation 44. The correct answer is: TRUE

Just like a judge’s conclusion, an investigator’s conclusion may be wrong. Sometimes, by chance
alone, a sample is not representative of the population. Thus the results in the sample do not
reflect reality in the population, and the random error leads to an erroneous inference.

A type I error (false-positive) occurs if an investigator rejects a null hypothesis that is actually
true in the population; a type II error (false-negative) occurs if the investigator fails to reject a
null hypothesis that is actually false in the population. Although type I and type II errors can
never be avoided entirely, the investigator can reduce their likelihood by increasing the sample
size (the larger the sample, the lesser is the likelihood that it will differ substantially from the
population).

False-positive and false-negative results can also occur because of bias (observer, instrument,
recall, etc.). (Errors due to bias, however, are not referred to as type I and type II errors.) Such
errors are troublesome, since they may be difficult to detect and cannot usually be quantified.

Question 45. Consider this dataset showing the retirement age of 11 people, in whole years:

54, 54, 54, 55, 56, 57, 57, 58, 58, 60, 60

Which of the following value is the measure of the central tendency “mean”?
(A) 56.3
(B) 56.6
(C) 56
(D) 56.9

Explanation 45. The correct answer is: 56.6

A measure of central tendency (also referred to as measures of centre or central location) is a


summary measure that attempts to describe a whole set of data with a single value that represents
the middle or centre of its distribution.

There are three main measures of central tendency: the mode, the median and the mean. Each
of these measures describes a different indication of the typical or central value in the
distribution.

What is the mean?

The mean is the sum of the value of each observation in a dataset divided by the number of
observations. This is also known as the arithmetic average.

Looking at the retirement age distribution again:

54, 54, 54, 55, 56, 57, 57, 58, 58, 60, 60

The mean is calculated by adding together all the values


(54+54+54+55+56+57+57+58+58+60+60 = 623) and dividing by the number of observations
(11) which equals 56.6 years.

Question 46. Which of the following data analytics tools is a cloud-based platform designed to
provide direct, simplified, real-time access to business data for decision-makers across the
company with minimal IT involvement?
(A) Snowflake
(B) OLTP
(C) Domo
(D) Delta load

Explanation 46. The correct answer is: Domo

Domo is a cloud-based platform designed to provide direct, simplified, real-time access to


business data for decision-makers across the company with minimal IT involvement. It
specializes in business intelligence tools and data visualization.

Why do data innovators choose Domo?


1. Connect to any data source to bring your data together into one unified view, then make
analytics available to drive insight-based actions—all while maintaining security and control.

2. Enhance your existing data warehouse and BI tools or build custom apps, automate data
pipelines, and make data science accessible with automated insights and augmented analytics.

3. Publish your data and analytics content dynamically to customers and partners, enable them to
integrate their data with yours, and build customized data experiences to commercialize data.

Question 47. Which of the following value is the measure of dispersion “range” between the
scores of ten students in a test.

The scores of ten students in a test are 17, 23, 30, 36, 45, 51, 58, 66, 72, 77.
(A) 60
(B) 70
(C) 80
(D) 90

Explanation 47. The correct answer is: 60

Range is the interval between the highest and the lowest score. Range is a measure of variability
or scatteredness of the varieties or observations among themselves and does not give an idea
about the spread of the observations around some central value.

Symbolically R = Hs – Ls. Where R = Range;

Hs is the ‘Highest score’ and Ls is the Lowest Score.

The scores of ten students in a test are:


17, 23, 30, 36, 45, 51, 58, 66, 72, 77.

The highest score is 77 and the lowest score is 17.

So the range is the difference between these two scores:

Range = 77 – 17 = 60

Question 48. Which of the following inferential statistical methods is a numerical measurement
that describes a value’s relationship to the mean of a group of values?
(A) t-tests
(B) Z-score
(C) p-values
(D) Chi-squared
Explanation 48. The correct answer is: Z-score

A Z-score is a numerical measurement that describes a value’s relationship to the mean of a


group of values. Z-score is measured in terms of standard deviations from the mean. If a Z-score
is 0, it indicates that the data point’s score is identical to the mean score. A Z-score of 1.0 would
indicate a value that is one standard deviation from the mean. Z-scores may be positive or
negative, with a positive value indicating the score is above the mean and a negative score
indicating it is below the mean.

t-test is incorrect. A t-test is a type of inferential statistic used to determine if there is a


significant difference between the means of two groups, which may be related to certain features.
It is mostly used when the data sets, like the data set recorded as the outcome from flipping a
coin 100 times, would follow a normal distribution and may have unknown variances. A t-test is
used as a hypothesis testing tool to test an assumption applicable to a population.

p-value is incorrect. A p-value, or probability value, is a number describing how likely it is that
your data would have occurred by random chance (i.e. that the null hypothesis is true). The level
of statistical significance is often expressed as a p-value between 0 and 1. The smaller the p-
value, the stronger the evidence that you should reject the null hypothesis.

Chi-squared is incorrect. A chi-square test is a statistical test used to compare observed results
with expected results. The purpose of this test is to determine if a difference between observed
data and expected data is due to chance, or if it is due to a relationship between the variables you
are studying. Therefore, a chi-square test is an excellent choice to help us better understand and
interpret the relationship between our two categorical variables.

Question 49. Which of the following statistical methods refers to the probability that a
population parameter will fall between a set of values for a certain proportion of times?
(A) Confidence intervals
(B) Percent difference
(C) Percent change
(D) Frequencies/percentages

Explanation 49. The correct answer is: Confidence intervals

The confidence interval (CI) is a range of values that’s likely to include a population value with a
certain degree of confidence. It is often expressed as a % whereby a population mean lies
between an upper and lower interval.

Question 50. Which of the following data analytics tools can execute the following command?

INSERT INTO Companies (CompanyName, ContactName, Address, City, PostalCode, Country)


VALUES ('ExamsDigest', 'Nick G.', 'London St. 13', 'London', '45533', 'United Kingdom');

(A) Python
(B) Microsoft Excel
(C) Structured Query Language (SQL)
(D) R
Explanation 50. The correct answer is: Structured Query Language (SQL)

SQL is a standard language for storing, manipulating and retrieving data in databases.

What is SQL?
1. SQL stands for Structured Query Language
2. SQL lets you access and manipulate databases
3. SQL became a standard of the American National Standards Institute (ANSI) in 1986, and of
the International Organization for Standardization (ISO) in 1987

What can SQL do?


1. SQL can execute queries against a database
2. SQL can retrieve data from a database
3. SQL can insert records in a database
4. SQL can update records in a database
5. SQL can delete records from a database
6. SQL can create new databases
7. SQL can create new tables in a database
8. SQL can create stored procedures in a database
9. SQL can create views in a database
10. SQL can set permissions on tables, procedures, and views

Although SQL is an ANSI/ISO standard, there are different versions of the SQL language.

However, to be compliant with the ANSI standard, they all support at least the major commands
(such as SELECT, UPDATE, DELETE, INSERT, WHERE) in a similar manner.
CHAPTER 4
VISUALIZATIONS
Questions 51-65

Question 51. An Amazon seller wants to generate a revenue report that includes the number of
sales and the number of refunds between June and August.

Which of the following should the seller use to generate the report?

(A) Data content


(B) Date range
(C) Views
(D) Frequency

Question 52. A web analyst has a Microsoft Access report, listing all employees. He wants to
limit the report to employees whose first names start with “N” and print the report with just that
data.

Which of the following SHOULD the analyst use?

(A) Date range


(B) Views
(C) Filtering
(D) Data content

Question 53. Which of the following recurring reports help General Data Protection Regulation
(GDPR) to maintain and prove compliance?

(A) Compliance reports


(B) Risk and regulatory reports
(C) Operational reports
(D) GDPR reports

Question 54. The ACME Corporation is using 2 different font families and 2 different color
codes for its logo, as shown below.

Font families:
1. Roboto Regular
2. Lato

Colors codes in hex:


1. #fa1d53
2. #1dfac4
The Corporation wants to develop a new dashboard for visualizing information and KPIs.

Which of the following best practices they SHOULD follow for clean dashboard development?

(A) The dashboard should be designed with different font families and colors
(B) The dashboard should be designed with different font families but with the same colors
(C) The dashboard should be designed with the same font families and the same colors
(D) The dashboard should be designed with the same font families but with different colors

Question 55. A company wants to create a chart that presents the growth in online sales broken
down by customer type, based on the mix of channels they used similar to the one you see below.

Which of the following type of visualization is MOST suitable?

(A) Heat map


(B) Waterfall
(C) Infographic
(D) Word cloud
Question 56. Which of the following is one of the most important principles you have to take
into account for dashboard development?

(A) Consider the right type of dashboard


(B) Consider your audience
(C) Consider your budget
(D) Consider the coding language

Question 57. Static reporting requires pulling up various reports from different sources and
analyzing insights from a longer time period in order to provide a snapshot of data while
Dynamic reports provide deep insights, and allow users to interact with the data rather than just
view it. (TRUE/FALSE)

(A) TRUE
(B) FALSE

Question 58. What’s the difference between ad-hoc and structured reports? (Select TWO)
(A) Structured reports use a large volume of data and are produced using a formalized
reporting template
(B) Ad hoc reports are generated as needed for a one-time-use, in a visual format relevant to
the audience
(C) Ad hoc reports use a large volume of data and are produced using a formalized reporting
template
(D) Structured reports are generated as needed for a one-time-use, in a visual format relevant
to the audience

Question 59. Sort the steps to develop a clean and well-structured dashboard according to
CompTIA dashboard development process. Starting from the top (first step) to the bottom (last
step).

1. Mockup/wireframe design

2. Develop dashboard

3. Deploy to production

4. Approval granted

Question 60. A developer wants to develop a new dashboard. On the main page, he wants to
show data variables and trends to help companies to make predictions about the results of data
not yet recorded.

Which of the following type of visualizations should the developer apply?


(A) Pie chart
(B) Bubble chart
(C) Line chart
(D) Prediction chart

Question 61. Which of the following type of visualizations represents the chart below?

(A) Infographic
(B) Heat map
(C) Histogram
(D) Waterfall

Question 62. Which of the following recurring reports help companies to reach business goals,
identify strengths, weaknesses, and trends?
(A) Compliance reports
(B) Risk and regulatory reports
(C) Operational reports
(D) Business goal reports
Question 63. Which of the following type of visualization is a graphical representation of word
frequency that gives greater prominence to words that appear more frequently in a source text?
(A) Word counter
(B) Word frequency
(C) Word rate
(D) Word cloud

Question 64. The purpose of the Geographic map is to:


(A) Support the analysis of geospatial data through the use of interactive visualization
(B) Display a large amount of hierarchical data using nested rectangles of varying sizes and
colors
(C) Allow part-to-whole comparisons over time
(D) Give you a snapshot of how a group is broken down into smaller pieces

Question 65. Which of the following is the MOST appropriate Report Cover Page?

(A)
(B)

(C)

(D)
Answers 51-65

Question 51. An Amazon seller wants to generate a revenue report that includes the number of
sales and the number of refunds between June and August.

Which of the following should the seller use to generate the report?

(A) Data content


(B) Date range
(C) Views
(D) Frequency

Explanation 51. The correct answer is: Date range

A date range report is a custom report that allows you either to select a month to include in the
report or to choose specific start and end dates for the data included in the report.

Question 52. A web analyst has a Microsoft Access report, listing all employees. He wants to
limit the report to employees whose first names start with “N” and print the report with just that
data.

Which of the following SHOULD the analyst use?

(A) Date range


(B) Views
(C) Filtering
(D) Data content
Explanation 52. The correct answer is: Filtering

When you view an Access report on the screen, you can apply filters to zero in on the data you
want to see. And then you can print the report with just that data.

To filter data in a report, open it in Report view (right-click it in the Navigation pane and
click Report View). Then, right-click the data you want to filter.

For example, in a report listing all employees, you might want to limit the report to employees
whose last names start with “L”:

1. Right-click any last name, and click Text Filters > Begins With.
2. Enter “N” in the box that appears, and click OK.

Question 53. Which of the following recurring reports help General Data Protection Regulation
(GDPR) to maintain and prove compliance?

(A) Compliance reports


(B) Risk and regulatory reports
(C) Operational reports
(D) GDPR reports

Explanation 53. The correct answer is: Compliance reports

Compliance reporting is the process of presenting information to auditors that show that your
company is adhering to all the requirements set by the government and regulatory agency under
a particular standard. It is often the IT department’s responsibility to generate these reports.

Compliance reports typically include information on how customer/company data is dealt with –
how it is controlled or protected, obtained and stored, and how it is secured and distributed
internally and externally.

Some regulations and the industries to which they apply are as follows:

Standards and Brief Description of the


Regulations Industry Regulation

Health Insurance Healthcare The HIPAA Privacy


Portability and Rule establishes national
Accountability standards to protect individuals’
Act (HIPAA) medical records and other
personal health information. It
applies to health plans,
healthcare clearinghouses, and
those health care providers that
conduct certain healthcare
transactions electronically.
The HIPAA Security
Rule requires appropriate
administrative, physical and
technical safeguards to ensure
the confidentiality, integrity and
security of electronically
protected health information.

Payment Card Retail, financial The PCI Data Security


Industry Data institutions, any Standards set the operational
Security business or and technical requirements for
Standard (PCI organization that organizations accepting or
DSS) processes, stores processing payment
or transmits transactions, and for software
credit card developers and manufacturers of
information applications and devices used in
those transactions.

General Data Any business Europe’s data privacy and


Protection that has security law imposes regulations
Regulation customers in the on organizations regardless of
(GDPR) European Union where they are based, as long as
(EU) they target or collect data related
to people in the EU.

National Communications The NIST Cybersecurity


Institute of technology and Framework integrates industry
Standards and cybersecurity standards and best practices to
Technology help organizations manage their
(NIST) cybersecurity risks.

California Any business The California Consumer


Consumer with customers Privacy Act of 2018
Privacy Act in the state of (CCPA) gives consumers more
(CCPA) California control over the personal
information that businesses
collect about them.

Question 54. The ACME Corporation is using 2 different font families and 2 different color
codes for its logo, as shown below.

Font families:
1. Roboto Regular
2. Lato
Colors codes in hex:
1. #fa1d53
2. #1dfac4

The Corporation wants to develop a new dashboard for visualizing information and KPIs.

Which of the following best practices they SHOULD follow for clean dashboard development?

(A) The dashboard should be designed with different font families and colors
(B) The dashboard should be designed with different font families but with the same colors
(C) The dashboard should be designed with the same font families and the same colors
(D) The dashboard should be designed with the same font families but with different colors

Explanation 54. The correct answer is: The dashboard should be designed with the same
font families and the same colors

When it comes to color, you can choose to stay true to your company identity (same colors,
fonts). The important thing here is to stay consistent and not use too many different colors – an
essential consideration when learning how to design a dashboard. This is one of the most
important of all dashboard design best practices.

Question 55. A company wants to create a chart that presents the growth in online sales broken
down by customer type, based on the mix of channels they used similar to the one you see below.
Which of the following type of visualization is MOST suitable?

(A) Heat map


(B) Waterfall
(C) Infographic
(D) Word cloud

Explanation 55. The correct answer is: Waterfall

A waterfall visualization shows how an initial value is increased and decreased by a series of
intermediate values, leading to a final cumulative value shown in the far right column. The
intermediate values can either be time-based or category-based.

A waterfall chart is a specific type of bar chart that reveals the story behind the net change in
something’s value between two points. Instead of just showing a beginning value in one bar and
an ending value in a second bar, a waterfall chart dis-aggregates all the unique components that
contributed to that net change and visualizes them individually.

Some examples of waterfall visualizations are:


1. Viewing the net income after you add the increases and decreases of revenue and costs for an
enterprise over a quarter.
2. Cumulative sales for products across a year with an annual total.
Question 56. Which of the following is one of the most important principles you have to take
into account for dashboard development?

(A) Consider the right type of dashboard


(B) Consider your audience
(C) Consider your budget
(D) Consider the coding language

Explanation 56. The correct answer is: Consider your audience

Concerning dashboard best practices in design, your audience is one of the most important
principles you have to take into account. You need to know who’s going to use the dashboard.

To do so successfully, you need to put yourself in your audience’s shoes. The context and device
on which users will regularly access their dashboards will have direct consequences on the style
in which the information is displayed.

Question 57. Static reporting requires pulling up various reports from different sources and
analyzing insights from a longer time period in order to provide a snapshot of data while
Dynamic reports provide deep insights, and allow users to interact with the data rather than just
view it. (TRUE/FALSE)

(A) TRUE
(B) FALSE

Explanation 57. The correct answer is: TRUE


Static reporting requires pulling up various reports from different sources and analyzing insights
from a longer time period to provide a snapshot of data while Dynamic reports provide deep
insights, and allow users to interact with the data rather than just view it.

Static reporting works on data that only has significance for a specific period of time. Static
reports include data about inventories such as resources and data that is generated periodically.
Static data are generated in Excel, Word or PowerPoint and exported in HTML or PDF format.
Static reporting works for data that have a very short life span. This means that this source of
information cannot drill down to future insights. Static reports are easy to use as tools for
reviewing behavior, patterns, and outcomes.

Dynamic reporting is also called live or real-time reporting. Dynamic Reporting is a real-time
reporting web base application that can be accessed from anywhere and from any device with
internet connectivity. These dynamic reporting provides dynamic information at all times and
provides users with real-time interactions with the dashboard according to their needs.
Dynamic reporting approach can enable business meeting where various executive level
members gather for business goals. The dynamic dashboard here provides a clear picture of a
business and allows the executive to think better and make decisions quickly.

Question 58. What’s the difference between ad-hoc and structured reports? (Select TWO)

(A) Structured reports use a large volume of data and are produced using a formalized
reporting template
(B) Ad hoc reports are generated as needed for a one-time-use, in a visual format relevant to
the audience
(C) Ad hoc reports use a large volume of data and are produced using a formalized reporting
template
(D) Structured reports are generated as needed for a one-time-use, in a visual format relevant
to the audience

Explanation 58. The correct answers are:


1. Structured reports use a large volume of data and are produced using a formalized
reporting template
2. Ad hoc reports are generated as needed for a one-time-use, in a visual format relevant to
the audience

Ad hoc reporting is a report created for one-time use. A BI tool can make it possible for anyone
in an organization to answer a specific business question and present that data in a visual format,
without burdening IT staff.

Ad hoc reporting differs from structured reporting in many ways. Structured reports use a large
volume of data and are produced using a formalized reporting template. Ad hoc reports are
generated as needed, in a visual format relevant to the audience.

Structured reports are produced by people who have a high degree of technical experience
working with business intelligence tools to mine and aggregate large amounts of data. Ad hoc
reporting relies on much smaller amounts of data. This makes it easier for people in an enterprise
to report on a specific data point that answers a specific business question.

Question 59. Sort the steps to develop a clean and well-structured dashboard according to
CompTIA dashboard development process. Starting from the top (first step) to the bottom (last
step).

1. Mockup/wireframe design

2. Develop dashboard

3. Deploy to production
4. Approval granted

Explanation 59. The correct answer is:

The steps to develop a clean and well-structured dashboard according to CompTIA dashboard
development process are:

1. Mockup/wireframe design

2. Approval granted

3. Develop dashboard

4. Deploy to production

Question 60. A developer wants to develop a new dashboard. On the main page, he wants to
show data variables and trends to help companies to make predictions about the results of data
not yet recorded.

Which of the following type of visualizations should the developer apply?


(A) Pie chart
(B) Bubble chart
(C) Line chart
(D) Prediction chart

Explanation 60. The correct answer is: Line chart

Line graphs are useful in that they show data variables and trends very clearly and can help to
make predictions about the results of data not yet recorded. If seeing the trend of your data is the
goal, then this is the chart to use.

Line charts show time-series relationships using continuous data. They allow a quick assessment
of acceleration (lines curving upward), deceleration (lines curving downward), and volatility
(up/down frequency). They are excellent for tracking multiple data sets on the same chart to see
any correlation in trends.

They can also be used to display several dependent variables against one independent variable.

Pie chart is incorrect. Pie charts can be used to show percentages of a whole, and represents
percentages at a set point in time. Unlike bar graphs and line graphs, pie charts do not show
changes over time.

Bubble chart is incorrect. A bubble chart is a variation of a scatter chart in which the data
points are replaced with bubbles, and an additional dimension of the data is represented in the
size of the bubbles. Just like a scatter chart, a bubble chart does not use a category axis — both
horizontal and vertical axes are value axes. In addition to the x values and y values that are
plotted in a scatter chart, a bubble chart plots x values, y values, and z (size) values.

Prediction chart is incorrect because it’s a fictitious term.

Question 61. Which of the following type of visualizations represents the chart below?
(A) Infographic
(B) Heat map
(C) Histogram
(D) Waterfall

Explanation 61. The correct answer is: Histogram

A histogram is a chart that shows frequencies for intervals of values of a metric variable.

Why are histograms so useful? Well, first of all, charts are much more visual than tables; after
looking at a chart for 10 seconds, you can tell much more about your data than after inspecting
the corresponding table for 10 seconds. Generally, charts convey information about our data
faster than tables -albeit less accurately.
On top of that, histograms also give us much more complete information about our data. Keep in
mind that you can reasonably estimate a variable’s mean, standard deviation, skewness, and
kurtosis from a histogram. However, you can’t estimate a variable’s histogram from the
aforementioned statistics.

Infographic is incorrect. Infographics are graphic visual representations of information, data, or


knowledge intended to present information quickly and clearly.

Heat map is incorrect. Heat Maps are graphical representations of data that utilize color-coded
systems. The primary purpose of Heat Maps is to better visualize the volume of locations/events
within a dataset and assist in directing viewers towards areas on data visualizations that matter
most.

Waterfall is incorrect. A waterfall visualization shows how an initial value is increased and
decreased by a series of intermediate values, leading to a final cumulative value shown in the far
right column. The intermediate values can either be time-based or category-based.

Some examples of waterfall visualizations are as follows:


1. Viewing the net income after you add the increases and decreases of revenue and costs for an
enterprise over a quarter.
2. Cumulative sales for products across a year with an annual total.

Question 62. Which of the following recurring reports help companies to reach business goals,
identify strengths, weaknesses, and trends?
(A) Compliance reports
(B) Risk and regulatory reports
(C) Operational reports
(D) Business goal reports

Explanation 62. The correct answer is: Operational reports

Operational reporting is an effective, results-driven means of tracking, measuring, and analyzing


a business’s regular deliverables and metrics, usually on a daily, weekly, and sometimes monthly
basis with the help of modern and professional BI reporting tools.

A KPI report which is an operational report is a management tool that facilitates the
measurement, organization, and analysis of the most important business key performance
indicators. These reports help companies to reach business goals, identify strengths, weaknesses,
and trends.

Typically presented in the form of an interactive dashboard, this kind of report provides a visual
representation of the data associated with your predetermined set of key performance indicators.

Question 63. Which of the following type of visualization is a graphical representation of word
frequency that gives greater prominence to words that appear more frequently in a source text?
(A) Word counter
(B) Word frequency
(C) Word rate
(D) Word cloud

Explanation 63. The correct answer is: Word cloud

Word clouds or tag clouds are graphical representations of word frequency that give greater
prominence to words that appear more frequently in a source text. The larger the word in the
visual the more common the word was in the document(s).

This type of visualization can assist evaluators with exploratory textual analysis by identifying
words that frequently appear in a set of interviews, documents, or other text. It can also be used
for communicating the most salient points or themes in the reporting stage.

The rest options are incorrect because they are fictitious terms.

Question 64. The purpose of the Geographic map is to:


(A) Support the analysis of geospatial data through the use of interactive visualization
(B) Display a large amount of hierarchical data using nested rectangles of varying sizes and
colors
(C) Allow part-to-whole comparisons over time
(D) Give you a snapshot of how a group is broken down into smaller pieces

Explanation 64. The correct answer is: Support the analysis of geospatial data through the
use of interactive visualization

The purpose of the Geographic map is to support the analysis of geospatial data through the use
of interactive visualization. Map visualization is used to analyze and display geographically
related data and present it in the form of maps. This kind of data expression is clearer and more
intuitive. We can visually see the distribution or proportion of data in each region. It is
convenient for everyone to mine deeper information and make better decisions.

There are many types in map visualization, such as administrative maps, heatmaps, statistical
maps, trajectory maps, bubble maps, etc. And maps can be divided into 2D maps, 3D maps or
static maps, dynamic maps, interactive maps… They are often used in combination with points,
lines, bubbles, and more.

Question 65. Which of the following is the MOST appropriate Report Cover Page?

(A)
(B)

(C)

(D)

Explanation 64. The correct answer is: A


The cover page, also known as title page, is the first and front page of the book, report, business
proposals, magazines, any other document. It is an important part of the document as it gives the
introductory information regarding what the document is about as well as who has written it.

It basically gives a reflection of the whole document and what is contained in it. The cover page
helps the reader in deciding whether the document is of interest to him or not. In addition, the
cover page is also important because it sets the first impression on whoever glances at the
document.

The cover page of the report gives the ‘Big Idea’ of what the report is about as it states the
report’s title. It should be clear, professional, formal and appropriate for the topic or area covered
in the report. The cover page of the report varies slightly based on the formatting style (such as
APA, MLA, Harvard, etc.) that is being used by the report. However, the main information
included is:

Title of the report.


Subtitle if any.
Author and co-authors.
Details of the authors such as title, email, contact, etc.
Submission place such as the name of institute, organization, journal,
publisher, etc.
Company logo or any other image if any.
Date of report.
Header if any.
A brief summary of the report.
CHAPTER 5
DATA GOVERNANCE,
QUALITY, AND CONTROLS
Questions 66-80

Question 66. A company uses AWS cloud technologies for its scalability and high-performance
features. The company wants to give access to software engineers to control the AWS
infrastructure and deny access to the rest employees.

Which of the following access control method the company SHOULD use to fulfill the
requirement?

(A) Developer-based
(B) AWS-based
(C) Role-based
(D) Software-based

Question 67. Which of the following policies is shown in the given image?
(A) Privacy Policy
(B) Acceptable use policy
(C) Terms of Services
(D) Cookies Policy

Question 68. In which of the following data quality dimensions the data is following a set of
standard data definitions like data type, size, and format (e.g. date of birth of customer is in the
format “mm/dd/yyyy”)?

(A) Completeness
(B) Consistency
(C) Integrity
(D) Conformity

Question 69. Cardinality refers to the maximum number of times an instance in one entity can
relate to instances of another entity while ordinality is the minimum number of times an instance
in one entity can be associated with an instance in the related entity. (TRUE/FALSE)

(A) TRUE
(B) FALSE

Question 70. Which of the following categories would contain information about an individual’s
biometric data, genetic information, and sexual orientation?

(A) Personally identifiable information


(B) Personal health information
(C) Sensitive personal information
(D) Intellectual property

Question 71. Which of the following types of vulnerability scans an organization should perform
as they store credit card information that follows the Payment Card Industry Data Security
Standard (PCI DSS)?

(A) Discovery scan


(B) Full scan
(C) Stealth scan
(D) Compliance scan

Question 72. A financial analyst gathers information, assembles spreadsheets, and writes
reports. He wants every time he makes changes to the files, these will be synced and updated
across all of his devices.

Which of the following storage environments does the analyst need to save his file?
(A) Share drive
(B) Local storage
(C) Cloud storage
(D) Sync storage

Question 73. The Acme Corporation is working on a new data warehouse and business
intelligence (DW/BI) project. They need to uncover data quality issues in data sources, and what
needs to be corrected in Extract, transform, load (ETL).

Which of the following methods should they use to validate the data?
(A) Cross-validation
(B) Data auditing
(C) Data profiling
(D) Data correction

Question 74. A data analyst wants to measure how well a piece of information reflects reality.

Which of the following data quality dimensions does the data analyst need to assess?
(A) Data consistency
(B) Data accuracy
(C) Data completeness
(D) Data integrity
Question 75. Records from governmental agencies, student records information, and existing
human research subjects’ data are examples of:
(A) Release approvals
(B) Data transmission
(C) Data use agreements
(D) Data constraints

Question 76. Acme Corporation wants to reduce the probability of a data breach in order to
reduce the risk of fines in the future.

Which of the following security requirements SHOULD they use?


(A) Data masking
(B) Data transmission
(C) Data encryption
(D) De-identify data

Question 77. Which of the following ways can be used to achieve the desired quality output for
standardized names in a master data management (MDM) architecture? (Select TWO)
(A) Use different locales to standardize names properly
(B) Define and apply customized schemes to standardize differently spelled words to
common words (e.g. Assoc, Assocn. and Assn. to Association)
(C) Don't define and apply customized schemes to standardize differently spelled words to
common words (e.g. Assoc, Assocn. and Assn. to Association)
(D) Use the same locales to standardize names properly

Question 78. The ACME Corporation hired an analyst to detect data quality issues in their excel
documents. Which of the following are the most common issues? (Select TWO)
(A) Duplicates
(B) Commas
(C) Symbols
(D) Misspellings
(E) Apostrophe

Question 79. Which of the following policies is a set of guidelines that helps organizations keep
track of how long information must be kept and how to dispose of the information when it’s no
longer needed?
(A) Data processing policy
(B) Data deletion policy
(C) Data retention policy
(D) Acceptable use policy
Question 80. Which of the following categories would contain information about an individual’s
demographic information, medical histories, laboratory results, and mental health conditions?
(A) Personally identifiable information
(B) Personal health information
(C) Sensitive personal information
(D) Intellectual property
Answers 66-80

Question 66. A company uses AWS cloud technologies for its scalability and high-performance
features. The company wants to give access to software engineers to control the AWS
infrastructure and deny access to the rest employees.

Which of the following access control method the company SHOULD use to fulfill the
requirement?

(A) Developer-based
(B) AWS-based
(C) Role-based
(D) Software-based

Explanation 66. The correct answer is: Role-based

Role-based access control (RBAC), also known as role-based security, is an access control
method that assigns permissions to end-users based on their role within your organization.
RBAC provides fine-grained control, offering a simple, manageable approach to access
management that is less error-prone than individually assigning permissions.

By adding a user to a role group, the user has access to all the permissions of that group. If they
are removed, access becomes restricted. Users can also be assigned temporary access to certain
data or programs to complete a task and be removed after.

Common examples of RBAC include:

Software engineering role: Has access to GCP, AWS, and GitHub


Marketing role: Has access to HubSpot, Google Analytics, Facebook Ads, and Google Ads
Finance role: Has access to Xero and ADP
Human resources role: Has access to Lever and BambooHR

In each of these roles, there may be a management tier and an individual contributor tier that has
different levels of permission inside the individual applications granted to each role.

Question 67. Which of the following policies is shown in the given image?

(A) Privacy Policy


(B) Acceptable use policy
(C) Terms of Services
(D) Cookies Policy

Explanation 67. The correct answer is: Acceptable use policy

An acceptable use policy (AUP) is a document stipulating constraints and practices that a user
must agree to for access to a corporate network or the Internet.

Many businesses and educational facilities require that employees or students sign an acceptable
use policy before being granted a network ID.

When you sign up with an Internet service provider (ISP), you will usually be presented with an
AUP, which states that you agree to adhere to stipulations such as:
1. Not using the service as part of violating any law
2. Not attempting to break the security of any computer network or user
3. Not posting commercial messages to Usenet groups without prior permission
4. Not attempting to send junk e-mail or spam to anyone who doesn’t want to receive it
5. Not attempting to mail bomb a site with mass amounts of e-mail in order to flood their server

Users also typically agree to report any attempt to break into their accounts.
Question 68. In which of the following data quality dimensions the data is following a set of
standard data definitions like data type, size, and format (e.g. date of birth of customer is in the
format “mm/dd/yyyy”)?

(A) Completeness
(B) Consistency
(C) Integrity
(D) Conformity

Explanation 68. The correct answer is: Conformity

The 6 dimensions of data quality


are: Completeness, Consistency, Conformity, Accuracy, Integrity and Timeliness.

Data Quality Dimension #1: Completeness

Completeness is defined as expected comprehensiveness. Data can be complete even if optional


data is missing. As long as the data meets the expectations then the data is considered complete.

For example, a customer’s first name and last name are mandatory but middle name is optional;
so a record can be considered complete even if a middle name is not available.

Data Quality Dimension #2: Consistency

Consistency means data across all systems reflects the same information and are in synch with
each other across the enterprise. Examples:
1. A business unit status is closed but there are sales for that business unit.
2. Employee status is terminated but pay status is active.

Data Quality Dimension #3: Conformity


Conformity means the data is following the set of standard data definitions like data type, size
and format. For example, date of birth of customer is in the format “mm/dd/yyyy”

Data Quality Dimension #4: Accuracy


Accuracy is the degree to which data correctly reflects the real-world object OR an event being
described.

Data Quality Dimension #5: Integrity


Integrity means validity of data across the relationships and ensures that all data in a database
can be traced and connected to other data.

For example, in a customer database, there should be a valid customer, addresses and
relationship between them. If there is an address relationship data without a customer then that
data is not valid and is considered an orphaned record.

Data Quality Dimension #6: Timeliness


Timeliness references whether information is available when it is expected and needed.
Timeliness of data is very important. This is reflected in:
1. Companies that are required to publish their quarterly results within a given frame of time
2. Customer service providing up-to date information to the customers
3. Credit system checking in real-time on the credit card account activity

Question 69. Cardinality refers to the maximum number of times an instance in one entity can
relate to instances of another entity while ordinality is the minimum number of times an instance
in one entity can be associated with an instance in the related entity. (TRUE/FALSE)

(A) TRUE
(B) FALSE
Explanation 69. The correct answer is: TRUE

Cardinality refers to the maximum number of times an instance in one entity can relate to
instances of another entity while ordinality is the minimum number of times an instance in one
entity can be associated with an instance in the related entity.

Cardinality specifies how many instances of an entity relate to one instance of another entity.
Ordinality is also closely linked to cardinality. While cardinality specifies the occurrences of a
relationship, ordinality describes the relationship as either mandatory or optional. In other words,
cardinality specifies the maximum number of relationships and ordinality specifies the absolute
minimum number of relationships.

Question 70. Which of the following categories would contain information about an individual’s
biometric data, genetic information, and sexual orientation?

(A) Personally identifiable information


(B) Personal health information
(C) Sensitive personal information
(D) Intellectual property

Explanation 70. The correct answer is: Sensitive personal information

Sensitive Personal Information (SPI) refers to information that does not identify an individual,
but is related to an individual, and communicates information that is private or could potentially
harm an individual should it be made public. This includes things like biometric data, genetic
information, sex, trade union membership, sexual orientation, etc.

Personally identifiable information is incorrect. Personally identifiable information (PII) is


any data that can be used to identify a specific individual. Social Security numbers, mailing or
email addresses, and phone numbers have most commonly been considered PII, but technology
has expanded the scope of PII considerably. It can include an IP address, login IDs, social media
posts, or digital images. Geolocation, biometric, and behavioral data can also be classified as PII.

Protected health information is incorrect. Protected health information (PHI), also referred to
as personal health information, generally refers to demographic information, medical histories,
test and laboratory results, mental health conditions, insurance information, and other data that a
healthcare professional collects to identify an individual and determine appropriate care.

Intellectual property is incorrect. Intellectual property (IP) is a term for any intangible asset —
something proprietary that doesn’t exist as a physical object but has value. Examples of
intellectual property include designs, concepts, software, inventions, trade secrets, formulas, and
brand names, as well as works of art. Intellectual property can be protected by copyright,
trademark, patent, or other legal measures.

Question 71. Which of the following types of vulnerability scans an organization should perform
as they store credit card information that follows the Payment Card Industry Data Security
Standard (PCI DSS)?

(A) Discovery scan


(B) Full scan
(C) Stealth scan
(D) Compliance scan

Explanation 71. The correct answer is: Compliance scan


If you are an organization that is governed by regulations due to the industry you are in or your
business practices, you may have to perform vulnerability scans on a regular basis to show
compliance with those regulations. For example, any organization storing credit card information
must follow the Payment Card Industry Data Security Standard (PCI DSS) requirements for
vulnerability scans.

Discovery scan is incorrect. A discovery scan as its name implies is a type of vulnerability scan
that is used to discover systems on the network by performing a ping scan and then a port scan
on those targets to discover ports that are open. A discovery scan is not a full vulnerability scan
that looks for vulnerabilities; it is used to find systems on the network.

Full scan is incorrect. A full scan performs many different tests to identify vulnerabilities in the
system.

Stealth scan is incorrect. A stealth scan (sometimes known as a half-open scan) is much like a
full open scan with a minor difference that makes it less suspicious on the victim’s device. The
primary difference is that a full TCP three-way handshake does not occur. Looking at the
following diagram, the initiator (device A) would send a TCP SYN packet to device B for the
purpose of determining whether a port is open. Device B will respond with a SYN/ACK packet
to the initiator (device A) if the port is open. Next, device A will send an RST to terminate the
connection. If the port is closed, device B will send an RST packet:
Stealth scan showing open and closed service ports. The benefit of using this type of scan is that
it reduces the chances of being detected.

Question 72. A financial analyst gathers information, assembles spreadsheets, and writes
reports. He wants every time he makes changes to the files, these will be synced and updated
across all of his devices.

Which of the following storage environments does the analyst need to save his file?
(A) Share drive
(B) Local storage
(C) Cloud storage
(D) Sync storage

Explanation 72. The correct answer is: Cloud storage

Cloud services are popular because enable many businesses to access application software
without the need for investing in computer software and hardware. Other benefits include
scalability, reliability, and efficiency. All these advantages allow organizations to focus on other
relevant aspects such as product development and innovation.
Pros of cloud storage:
Cost
Buying physical storage or hardware can be expensive. Cloud storage is cheaper per GB than
using external drives.

Accessibility
Cloud storage gives you access to your files from anywhere. All you need is an internet
connection.

Security
Cloud storage is safer than local storage because providers have added additional layers of
security to their services. Thanks to the use of encryption algorithm, only authorized personnel
such as you and your employees have access to the documents and files stored in the cloud.

Syncing and updating


When you are working with cloud storage, every time you make changes to a file, these will be
synced and updated across all of your devices. That will save you a lot of time and simplify your
job.

Recovery
In case of a hard drive failure or other hardware malfunction, you can access your files on the
cloud, which acts as a backup solution for your local storage on physical drives.
Question 73. The Acme Corporation is working on a new data warehouse and business
intelligence (DW/BI) project. They need to uncover data quality issues in data sources, and what
needs to be corrected in Extract, transform, load (ETL).

Which of the following methods should they use to validate the data?
(A) Cross-validation
(B) Data auditing
(C) Data profiling
(D) Data correction

Explanation 73. The correct answer is: Data profiling

Data profiling is the process of reviewing source data, understanding structure, content and
interrelationships, and identifying potential for data projects.

Data profiling is a crucial part of:


Data warehouse and business intelligence (DW/BI) projects—data profiling can uncover data
quality issues in data sources, and what needs to be corrected in ETL.
Data conversion and migration projects—data profiling can identify data quality issues, which
you can handle in scripts and data integration tools copying data from source to target. It can also
uncover new requirements for the target system.
Source system data quality projects—data profiling can highlight data which suffers from
serious or numerous quality issues, and the source of the issues (e.g. user inputs, errors in
interfaces, data corruption).
Data profiling involves:
1. Collecting descriptive statistics like min, max, count and sum.
2. Collecting data types, length and recurring patterns.
3. Tagging data with keywords, descriptions or categories.
4. Performing data quality assessment, risk of performing joins on the data.
5. Discovering metadata and assessing its accuracy.
6. Identifying distributions, key candidates, foreign-key candidates, functional dependencies,
embedded value dependencies, and performing inter-table analysis.

Question 74. A data analyst wants to measure how well a piece of information reflects reality.

Which of the following data quality dimensions does the data analyst need to assess?
(A) Data consistency
(B) Data accuracy
(C) Data completeness
(D) Data integrity

Explanation 74. The correct answer is: Data accuracy

How can you assess your data quality? Data quality meets six dimensions: accuracy,
completeness, consistency, timeliness, validity, and uniqueness. Read on to learn the definitions
of these data quality dimensions.

Accuracy
Completeness
Consistency
Timeliness
Validity
Uniqueness
Six data quality dimensions to assess

Dimension How it’s measured

Accuracy How well does a piece of information reflect reality?

Completeness Does it fulfill your expectations of what’s comprehensive?

Consistency Does information stored in one place match relevant data


stored elsewhere?

Timeliness Is your information available when you need it?

Validity Is information in a specific format, does it follow business


rules, or is it in an unusable format?

Uniqueness Is this the only instance in which this information appears in


the database?
The term “accuracy” refers to the degree to which information accurately reflects an event or
object described. For example, if a customer’s age is 32, but the system says she’s 34, that
information is inaccurate.

Question 75. Records from governmental agencies, student records information, and existing
human research subjects’ data are examples of:
(A) Release approvals
(B) Data transmission
(C) Data use agreements
(D) Data constraints

Explanation 75. The correct answer is: Data use agreements

A Data Use Agreement (DUA) is a contractual document used for transferring non-
public or restricted-use data. Examples include records from governmental agencies,
institutions or corporations, student records information, and existing human research subjects’
data.

A data use agreement (DUA) is an agreement that is required under the Privacy Rule and must
be entered into before there is any use or disclosure of a limited data set (defined below) to an
outside institution or party. A limited data set is still protected health information (PHI), and for
that reason, covered entities like Stanford must enter into a data use agreement with any recipient
of a limited data set from Stanford.

At a minimum, any DUA must contain provisions that address the following:

1.Establish the permitted uses and disclosures of the limited data set;
2. Identify who may use or receive the information;
3. Prohibit the recipient from using or further disclosing the information, except as permitted by
the agreement or as otherwise permitted by law;
4. Require the recipient to use appropriate safeguards to prevent an unauthorized use or
disclosure not contemplated by the agreement;
5. Require the recipient to report to the covered entity any use or disclosure to which it becomes
aware;
6. Require the recipients to ensure that any agents (including any subcontractors) to whom it
discloses the information will agree to the same restrictions as provided in the agreement; and
7. Prohibit the recipient from identifying the information or contacting the individuals.
Question 76. Acme Corporation wants to reduce the probability of a data breach in order to
reduce the risk of fines in the future.

Which of the following security requirements SHOULD they use?


(A) Data masking
(B) Data transmission
(C) Data encryption
(D) De-identify data

Explanation 76. The correct answer is: Data encryption

Companies can reduce the probability of a data breach and thus reduce the risk of fines in the
future if they chose to use encryption of personal data. The processing of personal data is
naturally associated with a certain degree of risk. Especially nowadays, where cyber-attacks are
nearly unavoidable for companies above a given size. Therefore, risk management plays an ever-
larger role in IT security and data encryption is suited, among other means, for these companies.

In general, encryption refers to the procedure that converts clear text into a hashed code using a
key, where the outgoing information only becomes readable again by using the correct key. This
minimizes the risk of an incident during data processing, as encrypted contents are basically
unreadable for third parties who do not have the correct key. Encryption is the best way to
protect data during transfer and one way to secure stored personal data. It also reduces the risk of
abuse within a company, as access is limited only to authorized people with the right key.
Question 77. Which of the following ways can be used to achieve the desired quality output for
standardized names in a master data management (MDM) architecture? (Select TWO)
(A) Use different locales to standardize names properly
(B) Define and apply customized schemes to standardize differently spelled words to
common words (e.g. Assoc, Assocn. and Assn. to Association)
(C) Don't define and apply customized schemes to standardize differently spelled words to
common words (e.g. Assoc, Assocn. and Assn. to Association)
(D) Use the same locales to standardize names properly

Explanation 77. The correct answers are:


1. Use different locales to standardize names properly
2. Define and apply customized schemes to standardize differently spelled words to
common words (e.g. Assoc, Assocn. and Assn. to Association)

Master data management (MDM) arose out of the necessity for businesses to improve the
consistency and quality of their key data assets, such as product data, asset data, customer data,
location data, etc.

There can be many problems while standardizing name and address information. Some of the
most common problems and their respective approaches are described below.

Business Scenarios for Name Standardization:


1. Individual and Organization names combined and incorrectly classified
Presence of abbreviations, numbers, apostrophes, special characters, improper casing
2. Differently typed name prefix (Mister vs. Mr. etc.)
Names of the different country of origin, mixed locale
Non-name words (e.g. of, the, dated, from, company-specific words)
Multiple middle and last names
Though there are many ways to achieve the desired quality output for standardized names, some
proven one are mentioned below:
1. Use identification analysis along with other customized approached to correctly classify an
entity as an organization or an individual. This helps standardize names differently.
2. Use multiple standardization definitions to separate unnecessary words (special characters,
numbers, company-specific words, filler words etc.) from names which can cause improper
standardization.
3. Define and apply customized schemes to standardize differently spelled words to common
words (e.g. Assoc, Assocn. and Assn. to Association).
4. Use different locales to standardize names properly.
5. Use single and multiple name parsing steps to separate parts from the names (prefix, first,
middle, last, suffix and title).

Question 78. The ACME Corporation hired an analyst to detect data quality issues in their excel
documents. Which of the following are the most common issues? (Select TWO)
(A) Duplicates
(B) Commas
(C) Symbols
(D) Misspellings
(E) Apostrophe

Explanation 78. The correct answers are:


1. Duplicates
2. Misspellings

The most common data quality issues are difficult to resolve in Excel because of their rigidity. It
forces analysts to do a ton of manual work, which results in a high probability of an error being
introduced to the data set.
Those common issues include:
1. Blanks
2. Nulls
3. Outliers
4. Duplicates
5. Extra spaces
6. Misspellings
7. Abbreviations and domain-specific variations
8. Formula error codes

When introduced, these errors can skew or even invalidate the resulting analysis. A smart tool
would minimize the possibility of error by automating the manual work.

In Excel, you might look for data quality issues in one of two ways. First, you might use auto
filters on specific columns to scan for anomalies and blanks or you might use a pivot table to find
gaps and discrepancies.

In either case, you’re scanning for the anomalies yourself. Suffice it to say that’s not a very
efficient process. It also means accuracy is only as good as the analyst’s eye, so the probability
of error varies throughout the day.

Question 79. Which of the following policies is a set of guidelines that helps organizations keep
track of how long information must be kept and how to dispose of the information when it’s no
longer needed?
(A) Data processing policy
(B) Data deletion policy
(C) Data retention policy
(D) Acceptable use policy
Explanation 79. The correct answer is: Data retention policy

A data retention policy is a set of guidelines that helps organizations keep track of how long
information must be kept and how to dispose of the information when it’s no longer needed.

The policy should also outline the purpose of processing personal data. This ensures that you
have documented proof that justifies your data retention and disposal periods.

Question 80. Which of the following categories would contain information about an individual’s
demographic information, medical histories, laboratory results, and mental health conditions?
(A) Personally identifiable information
(B) Personal health information
(C) Sensitive personal information
(D) Intellectual property

Explanation 80. The correct answer is: Personal health information

Protected health information (PHI), also referred to as personal health information, generally
refers to demographic information, medical histories, test and laboratory results, mental health
conditions, insurance information, and other data that a healthcare professional collects to
identify an individual and determine appropriate care.

Personally identifiable information is incorrect. Personally identifiable information (PII) is


any data that can be used to identify a specific individual. Social Security numbers, mailing or
email addresses, and phone numbers have most commonly been considered PII, but technology
has expanded the scope of PII considerably. It can include an IP address, login IDs, social media
posts, or digital images. Geolocation, biometric, and behavioral data can also be classified as PII.

Sensitive Personal Information is incorrect. Sensitive Personal Information (SPI) refers to


information that does not identify an individual, but is related to an individual, and
communicates information that is private or could potentially harm an individual should it be
made public. This includes things like biometric data, genetic information, sex, trade union
membership, sexual orientation, etc.

Intellectual property is incorrect. Intellectual property (IP) is a term for any intangible asset —
something proprietary that doesn’t exist as a physical object but has value. Examples of
intellectual property include designs, concepts, software, inventions, trade secrets, formulas, and
brand names, as well as works of art. Intellectual property can be protected by copyright,
trademark, patent, or other legal measures.
BONUSES
&
DISCOUNTS

Enrich your online experience with ExamsDigest.


Your purchase of this product includes free access to online practice exam simulators on
examsdigest.com. You will have access for one (1) month. You may also access our full library
of Practice exams and share with other learners. Send us an email to info@examsdigest.com
now and start your online practice experience!

Your purchase includes:


✓ Access to online simulators for 1 month
✓ Up to 20% discount on Exam Voucher
✓ $10 OFF on ExamsDigest Marketplace

About ExamsDigest.
ExamsDigest started in 2019 and haven’t stopped smashing it since. Examsdigest is a global,
education tech-oriented company that doesn’t sleep. Their mission is to be a part of your life
transformation by providing you the necessary training to hit your career goals.

You might also like