0% found this document useful (0 votes)

527 views

Databricks

This document provides 20 common interview questions for the data engineering tool Databricks, along with sample answers for 6 of the questions. The questions cover basic Databricks concepts, caching, personal access tokens, benefits of using Databricks, and which cloud service category Databricks belongs to. The sample answers provide details on Databricks workspaces, the difference between instances and clusters, ensuring data security, and defining data redundancy.

Uploaded by

vr.sf99

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

527 views

Databricks

Uploaded by

vr.sf99

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 56

20 common Databricks interview questions to ask data

engineering professionals
Check out these 20 common Databricks interview questions to help you hire a data
engineering professional for your company.
1. Explain the basic concepts in Databricks.
2. What does the caching process involve?
3. What are the different types of caching?
4. Should you ever remove and clean up leftover data frames in Databricks?
5. How do you create a Databricks personal access token?
6. What steps should you take to revoke a private access token?
7. What are the benefits of using Databricks?
8. Can you use Databricks along with Azure Notebooks?
9. Do you need to store an action’s outcome in a different variable?
10. What is autoscaling?
11. Can you run Databricks on private cloud infrastructure?
12. What are some issues you can face in Databricks?
13. Why is it necessary for us to use the DBU framework?
14. Explain what workspaces are in Databricks.
15. Is it possible to manage Databricks using PowerShell?
16. What is Kafka for?
17. What is a Delta table?
18. Which cloud service category does Databricks belong to: SaaS, PaaS, or IaaS?
19. Explain the differences between a control plane and a data plane.
20. What are widgets used for in Databricks?

6 sample answers to key common Databricks interview

questions

To quickly assess your candidates’ responses, review these sample answers to common
Databricks interview questions.
1. Explain the basic concepts in Databricks.
Databricks is a set of cloud-based data engineering tools that help process and convert large
amounts of information. Programmers and developers can use these tools to enhance machine
learning or stream data analytics.
With spending on cloud services expected to grow by 23% in 2023, candidates must
understand what Databricks is and how it works.
Below are some of the main concepts in Databricks:
 Accounts and workspaces
 Databricks units (DBUs)
 Data science and engineering
 Dashboards and visualizations
 Databricks interfaces
 Authentication and authorization
 Computation management
 Machine learning
 Data management
Send candidates a Data Science test to see what they know about machine learning, neural
networks, and programming. Their test results will provide you with valuable insight into
their knowledge of data engineering tools.
2. Which cloud service category does Databricks belong to: SaaS, PaaS, or
IaaS?
Since a workspace in Databricks falls under the category of software, this programming
environment is a software-as-a-service (SaaS). This means users can connect to and navigate
cloud-based apps via the internet, making it a perfect web browser tool.
Coding professionals will have to manage their storage and deploy applications after
adjusting their designs in Databricks. Therefore, it’s essential to hire a candidate who
understands cloud computing.
3. Should you ever remove and clean up leftover data frames in Databricks?
The simple answer is no – unless the frames use cache. This is because the cache can eat up a
large amount of data in the network’s bandwidth, so it’s better to eliminate datasets that
involve cache but have no use in Databricks.
Your top candidates might also mention that deleting unused frames could reduce cloud
storage costs and enhance the efficiency of data engineering tools.
4. How do you create a Databricks personal access token?
A personal access token is a string of characters that authenticate users who try to access a
system. This type of authentication is scalable and efficient because websites can verify users
without slowing down.
Candidates should have some experience with creating access tokens. Look for skilled
applicants with strong programming skills who can describe the following steps:
 Click the user profile icon from the Databricks desktop
 Choose “User Settings” and click the “Access Tokens” tab
 A button labeled “Generate New Token” should appear
 Make sure to click the new token to create a private feature
Use a Software Engineer test to determine whether candidates can use a programming
language and understand the fundamental concepts of computer science.
5. What are the benefits of using Databricks?
Candidates who have experience with Databricks should know about its many uses and
benefits. Since it has flexible and powerful data engineering tools, it can help programmers
and developers create the best processing frameworks.
Some top benefits include the following:
 Familiar languages and environment:

Databricks integrates with programming languages such as Python, R, and SQL, making it
versatile software for all programmers.
 Extensive documentation:

This powerful software provides detailed instructions on how to reference information and
connect to third-party applications. Its extensive support and documentation mean users
won’t struggle to navigate the data engineering tools.
 Advanced modeling and machine learning:

One reason for using Databricks is its ability to enhance machine learning models. This
enables programmers and developers to focus on generating high-quality data and algorithms.
 Big data processing:

The data engineering tools can handle large amounts of data, meaning users don’t have to
worry about slow processing.
 Spark cluster creation process:

Programmers can use spark clusters to manage processes and complete tasks in Databricks. A
spark cluster usually comprises driver programs, worker nodes, and cluster managers.
Send candidates a Microsoft SQL Server test to determine whether they can navigate a
database management system when using Databricks.
6. What does the caching process involve?
Caching is a process that stores copies of important data in temporary storage. This lets users
access this data quickly and efficiently on a website or platform. The high-speed data storage
layer enables web browsers to cache HTML files, JavaScript, and images to load content
faster.
Candidates should understand the functions of caching. This process is common in
Databricks, so look out for applicants who can store data and copy files.

12 intermediate Databricks interview questions to ask

your candidates
Use these 12 intermediate Databricks interview questions to test your candidates’ knowledge
of data engineering and processing.
1. What are the major features of Databricks?
2. What is the difference between an instance and a cluster?
3. Name some of the key use cases of Kafka in Databricks.
4. How would you use Databricks to process big data?
5. Give an example of a data analysis project you’ve worked on.
6. How would you ensure the security of sensitive data in a Databricks environment?
7. What is the management plane in Databricks?
8. How do you import third-party JARs or dependencies in Databricks?
9. Define data redundancy.
10. What is a job in Databricks?
11. How do you capture streaming data in Databricks?
12. How can you connect your ADB cluster to your favorite IDE?

5 sample answers to key intermediate Databricks

interview questions

Compare your candidates’ responses with these sample answers to gauge their level of
expertise using Databricks.
1. What is a job in Databricks?
A job in Databricks is a way to manage your data processing and applications in a workspace.
It can consist of one task or be a multi-task workflow that relies on complex dependencies.
Databricks does most of the work by monitoring clusters, reporting errors, and completing
task orchestration. The easy-to-use scheduling system enables programmers to keep jobs
running without having to move data to different locations.
2. What is the difference between an instance and a cluster?
An instance represents a single virtual machine used to run an application or service. A
cluster refers to a set of instances that work together to provide a higher level of performance
or scalability for an application or service.
Checking if candidates have this knowledge isn’t complicated when you use the right
assessment methods. Use a Machine Learning test to find out more about candidates’
experience using software applications and networking resources. This also gives your job
applicants a chance to show how they would manage large amounts of data.
3. How would you ensure the security of sensitive data in a Databricks
environment?
Databricks has network protections that help users secure information in a workspace
environment. This process prevents sensitive data from getting lost or ending up in the wrong
storage system.
To ensure proper security, the user can access IP lists to show the network location of
important information in Databricks. Then they should restrict outbound network access
using a virtual private cloud.
4. What is the management plane in Databricks?
The management plane is a set of tools and services used to manage and control the
Databricks environment. It includes the Databricks workplace, which provides a web-based
interface for managing data, notebooks, and clusters. It also offers security, compliance, and
governance features.
Send candidates a Cloud System Administration test to assess their networking capabilities.
You can also use this test to learn more about their knowledge of computer infrastructure.
5. Define data redundancy.
Data redundancy occurs when the same data is stored in multiple locations in the same
database or dataset. Redundancy should be minimized since it is usually unnecessary and can
lead to inconsistencies and inefficiencies. Therefore, it’s usually best to identify and remove
redundancies to avoid using up storage space.
15 challenging Databricks interview questions to ask
experienced coders
Below is a list of 15 challenging Databricks interview questions to ask expert candidates.
Choose questions that will help you learn more about their programming knowledge and
experience using data analytics.
1. What is a Databricks cluster?
2. Describe a dataflow map.
3. List the stages of a CI/CD pipeline.
4. What are the different applications for Databricks table storage?
5. Define serverless data processing.
6. How will you handle Databricks code while you work with Git or TFS in a team?
7. Write the syntax to connect the Azure storage account and Databricks.
8. Explain the difference between data analytics workloads and data engineering workloads.
9. What do you know about SQL pools?
10. What is a Recovery Services Vault?
11. Can you cancel an ongoing job in Databricks?
12. Name some rules of a secret scope.
13. Write the syntax to delete the IP access list.
14. How do you set up a DEV environment in Databricks?
15. What can you accomplish using APIs?

5 sample answers to key challenging Databricks interview

questions

Revisit these sample answers to challenging Databricks interview questions when choosing a
candidate to fill your open position.
1. Define serverless data processing.
Serverless data processing is a way to process data without needing to worry about the
underlying infrastructure. You can save time and reduce costs by having a service like
Databricks manage the infrastructure and allocate resources as needed.
Databricks can provide the necessary resources on demand and scale them as needed to
simplify the management of data processing infrastructure.
2. How would you handle Databricks code while working with Git or TFS in a
team?
Global information tracker (Git) and Team Foundation Server (TFS) are version control
systems that help programmers manage code. TFS cannot be used in Databricks because the
software doesn’t support it. Therefore, programmers can only use Git when working on a
repository system.
Candidates should also know that Git is an open-source, distributed version control
system, whereas TFS is a centralized version control system offered by Microsoft.
Since Databricks integrates with Git, data engineers and programmers can easily manage
code without constantly updating the software or reducing storage because of low capacity.
The Git skills test can help you choose candidates who are well versed in this open-source
tool. It also gives them an opportunity to prove their ability to manage data analytics projects
and source code.
3. Explain the difference between data analytics workloads and data
engineering workloads.
Data analytics workloads involve obtaining insights, trends, and patterns from data.
Meanwhile, data engineering workloads involve building and maintaining the infrastructure
needed to store, process, and manage data.
4. Name some rules of a secret scope in Databricks.
A secret scope is a collection of secrets identified by a name. Programmers and developers
can use this feature to store and manage sensitive information, including secret identities or
application programming interface (API) authentication information, while protecting it from
unauthorized access.
One rule candidates could mention is that a Databricks workspace can only hold a maximum
of 100 secret scopes.
You can send candidates a REST API test to see how they manage data and create scopes
for an API. This test also determines whether candidates can deal with errors and security
considerations.
5. What is a Recovery Services vault?
A recovery services vault is an Azure management function that performs backup-related
operations. It enables users to restore important information and copy data to adhere to
backup regulations. The service can also help users arrange data in a more organized and
manageable way.

1. Define Databricks
Databricks is a cloud-based solution to help process and transform large amounts of
data, offered by Azure.

2. What is Microsoft Azure?

It is a cloud computing platform. The service provider can set up a managed service
in Azure to allow users to get access to the services on demand.

3. What is DBU?
DBU stands for Databricks Unified, which is a Databricks framework for handling
resources and calculating prices.

4. What distinguishes Azure Databricks from Databricks?

Azure Databricks is a joint effort between Microsoft and Databricks to expand
predictive analytics and statistical modeling.
5. What are the benefits of using Azure Databricks?
Azure Databricks comes with many benefits including reduced costs, increased
productivity, and increased security.

6. Can Databricks be used along with Azure Notebooks?

They can be executed similarly but data transmission needs to be coded manually to
the cluster. There is Databricks connect, which can get this integration done
seamlessly.

7. What are the various types of clusters present in Azure Databricks?

Azure Databricks has four types of clusters, including Interactive, Job, Low-priority,
and High-priority.

9. Would it be ok to clear the cache?

Yes, it is ok to clear cache as the information is not necessary for any program.

10. What is autoscaling?

Autoscaling is a Databricks feature that will help you automatically scale your cluster
in whichever direction you need.

11. Would you need to store an action’s outcome in a different variable?

It’s not mandatory. It would completely depend on what purpose it would be used.

12. Should you remove unused Data Frames?

Cleaning Data Frames is not required unless you use cache, as this takes up a good
amount of data on the network.

13. What are some issues you can face with Azure Databricks?
You might face cluster creation failures if you don’t have enough credits to create
more clusters. Spark errors are seen if your code is not compatible with the
Databricks runtime. You can come across network errors if it's not configured
properly or if you’re trying to get into Databricks through an unsupported location.
14. What use is Kafka for?
When Azure Databricks gathers data, it establishes connections to hubs and data
sources like Kafka.

15. What use is Databricks file system for?

The Databricks file system gives data durability even after the Azure Databricks
node is eliminated. It’s a distributed file system designed keeping big data workloads
in mind.

16. How to troubleshoot issues related to Azure Databricks?

The best place to start with troubleshooting with Azure Databricks is through
documentation which has solutions for a number of common issues. If further
assistance is required, Databricks support can be contacted.

17. Is Azure Key Vault a viable alternative to Secret Scopes?

It’s certainly possible but it needs to be set up before being used.

18. How do you handle Databricks code while working in a team using TFS
or Git?
It’s not possible to work with TFS as it is not supported. You can only work
with Git or distributed Git repository systems. Although it would be fantastic to attach
Databricks to your Git directory, Databricks works like another clone of the project.
You should start by creating a notebook and then committing it to version control.
You can then update it.

19. What languages are supported in Azure Databricks?

Languages such as Python, Scala, and R can be used. With Azure Databricks, you
can also use SQL.
Learn the Fundamentals of How Business Works
Executive Certificate In General ManagementEXPLORE PROGRAM

20. Can Databricks be run on private cloud infrastructure?

Currently, you can only run it on AWS and Azure. But Databricks is on open-source
Spark. This means it’s possible to create your own cluster and have it on your own
private cloud. However, you won’t be able to take advantage of all the extensive
capabilities you get from Databricks.

21. Can you administer Databricks using PowerShell?

Officially, you can’t do it. But there are PowerShell modules that you can try out.

22. What is the difference between an instance and a cluster in Databricks?

An instance is a virtual machine that helps run the Databricks runtime. A cluster is a
group of instances that are used to run Spark applications.

23. How to create a Databricks private access token?

To create a private access token, go to the “user profile” icon and select “User
setting.” Here, you’ll need to select the “Access Tokens” tab where you can see the
button “Generate New Token”. Click the button that would create the token.

24. What is the procedure for revoking a private access token?

To revoke the token, go to “user profile” and select “User setting.” Select the “Access
Tokens” tab and click the ‘x’ you’ll find next to the token that you want to revoke.
Finally, on the Revoke Token window, click the button “Revoke Token.”

25. What is the management plane in Azure Databricks?

The management plan is how you manage and monitor your Databricks
deployment.

26. What is the control plane in Azure Databricks?

The control plane is responsible for managing Spark applications.

27. What is the data plane in Azure Databricks?

The data plane is responsible for storing and processing data.

28. What is the Databricks runtime used for?

The Databricks runtime is often used to execute the Databricks platform’s collection
of modules.
Learn the Fundamentals of How Business Works
Executive Certificate In General ManagementEXPLORE PROGRAM

29. What use do widgets serve in Databricks?

Widgets can help customize the panels and notebooks by adding variables.

30. What is a Databricks secret?

A secret is a key-value combination that can help keep secret content; it is
composed of a unique key name contained within a secret context. Each scope is
limited to 1000 secrets. It cannot exceed 128 KB in size.

Databricks Interview Questions and Answers For Freshers

1. Is it possible to combine Databricks and Azure Notebooks?

They operate similarly, but data transfer to the cluster requires manual coding. This
Integration is now easily possible thanks to Databricks Connect. On behalf of Jupyter,
Databricks makes a number of improvements that are specific to Databricks.

2. What exactly does coaching entail?

Temporary holding is referred to as the cache. The process of temporarily storing

information is referred to as caching. You'll save time and lessen the load on the server
when you come back to a frequently visited website because the browser will retrieve
the data from the cache rather than from the server.

3. What different types of caching are there?

There are four types of caching that stand out:

 Information caching
 Web page caching
 Widespread caching
 Output or application caching.

[ Learn Complete Azure Databricks Tutorial ]

4. Should you ever remove and clean up any leftover Data Frames?

Cleaning Frames is not necessary unless you use cache(), which will use a lot of
network bandwidth. You should probably clean up any large datasets that are being
cached but aren't being used.

5. What different ETL operations does Azure Databricks carry out on data?

The various ETL processes carried out on data in Azure Databricks are listed below:

 The data is converted from Databricks to the data warehouse.

 Bold storage is used to load the data.
 Data is temporarily stored using bold storage.
6. Does Azure Key Vault work well as a substitute for Secret Scopes?

That is certainly doable. However, some setup is necessary. The preferred approach is
this. Instead of changing the defined secret, start creating a scoped password that Azure
Key Vault will backup if the data in secret needs to be changed.

7. How should Databricks code be handled when using TFS or Git for collaborative
projects?

TFS is not supported, to start. Your only choices are dispersed Git repository systems
and Git. Although it would be ideal to integrate Databricks with the Git directory of
notebooks, it works much like a different project clone. Making a notebook, trying to
commit it to version control, and afterwards updating it are the first steps.

8. Does Databricks have to be run on a public cloud like AWS or Azure, or can it also
run on cloud infrastructure?

This is not true. The only options you have right now are AWS and Azure. But
Databricks uses Spark, which is open-source. Although you could build your own cluster
and run it in a private cloud, you'd be giving up access to Databricks' robust features and
administration.

9. How is a Databricks personal access token created?

 On the Databricks desktop, click the "user profile" icon in the top right corner.
 Choosing "User setting."
 Activate the "Access Tokens" tab.
 A "Generate New Token" button will then show up. Just click it.

10. What steps must be taken to revoke a private access token?

 On the Databricks desktop, click the "user profile" icon in the top right corner.
 Choosing "User setting."
 Activate the "Access Tokens" tab.
 To cancel a token, click the "x" button next to it.

11. What function does Kafka fulfill?

Azure Databricks connects to action hubs and data sources like Kafka when it decides to
gather or stream data.

12. What does the Azure data lake serve?

In order to manage, secure, and analyze particular management and administration,

Azure data lakes are employed in combination with other IT investments. Additionally, it
allows us to improve data applications by utilizing operating repositories and data stores.

13. Why is Azure blob cloud storage backup required?

Blob storage enables redundancy, but it might not be able to handle application failures
that could bring down the entire database. We have to continue using secondary Azure
blob storage as a result.

14. Can Spark handle streaming data processing?

Undoubtedly, Spark Streaming is an essential part of Spark. There is support for multiple
streaming processes. You can publish to a document, read from streaming, and stream
a lot of deltas.

15. Is code reuse possible in the Azure notebook?

We should import the code first from Azure notebook into our notebook so that we can
reuse it. There are two ways we can import it.

 We must first create a component for the code if it is located on a different

workstation before integrating it into the module.
 We can import and use the code right away if it is on the same workstation.

Databricks Interview Questions For Experienced

16. What is a Databricks cluster?

The settings and computing power that make up a Databricks cluster allow us to perform
statistical science, big data, and powerful analytic tasks like production ETL, workflows,
deep learning, and stream processing.
17. Is it possible to use Databricks to load data into ADLS from on-premises sources?

Even though ADF is a fantastic tool for putting data into lakes, if the lakes are on-
premises, you will also need a "self-hosted integration runtime" to give ADF access to
the data.

18. What parts of Databricks are there?

 A secure collaborative workstation for developers to code in real-time.

 Managed Groupings to increase query processing speed.
 Delta to address the issues with traditional data lake file formats and manage in-
memory data analysis.
 ML Flow to address rising production challenges and ML lifecycle.
 SQL Analytics, which creates queries to retrieve information from data lakes and
display it in dashboards.

19. What distinguishes data lakes from data warehouses?

The majority of the structured data in data warehouses has been processed and is
managed locally with in-house expertise. You cannot so easily change its structure. All
types of data, including unstructured data, such as raw and old data, are present in data
lakes. They can be easily scaled up, and the data model could be modified quickly. It
uses parallel processing to crunch the data and is retained by third-party tools, ideally in
the cloud.

20. Is Databricks only available in the cloud and does not have an on-premises option?

Yes. Databricks' foundational software, Apache Spark, was made available as an on-
premises solution, allowing internal engineers to manage both the data and the
application locally. Users who access Databricks with data on local servers will
encounter network problems because it is a cloud-native application. The on-premises
choices for Databricks are also weighed against workflow inefficiencies and inconsistent
data.

21. Is Databricks owned by Microsoft?

No. Databricks is still an Apache Spark-based open-sourced product. In 2019, Microsoft

invested $250 million. Microsoft has released Azure Databricks in 2017 after integrating
some of Databricks' services into its cloud service. Both Google Cloud GCP and
Amazon Cloud AWS have similar alliances in place.

22. What type of cloud services does Databricks provide? Do you mean SaaS, PaaS,
or IaaS?
The purpose of Databricks' Software as a Service (SaaS) service is to utilize the
capabilities of Spark clusters to manage storage. Users will only need to deploy new
applications after making changes to their configurations.

[ Related Article: Introduction to Azure SaaS ]

23. What type of cloud service does Azure Databricks provide? Do you mean SaaS,
PaaS, or IaaS?

Platform as a Service (PaaS) is the category in which the Azure Databricks service falls.
It offers a platform for application development with features based on Azure and
Databricks. Utilizing the services provided by Azure Databricks, users must create and
build the data life cycle and develop applications.

24. AWS Databricks and Azure Databricks side by side.

The product of effectively integrating Azure and Databricks features is Azure Databricks.
Databricks are not just being hosted on the Azure platform. Azure Databricks is a
superior product thanks to MS characteristics like Active Directory authentication and
assimilation of many Azure functionalities. AWS Databricks merely serves as an AWS
cloud server for Databricks.

25. What type of cloud service does Azure Databricks provide? Do you mean SaaS,
PaaS, or IaaS?

Platform as a Service (PaaS) is the category in which the Azure Databricks service falls.
It offers a platform for application development with features based on Azure and
Databricks. Utilizing the services provided by Azure Databricks, users must develop the
data life span and develop applications.

26. What languages does Databricks support?

Java, R, Python, Scala, and Standard SQL. It also supports a number of language APIs,
including PySpark, Spark SQL, Spark.api.java, SparkR or SparklE, and Spark.

Most Common Azure Databricks FAQs

1. What are Databricks?

Azure provides Databricks, a cloud-based tool for processing and transforming large
amounts of data.

2. What is Azure from Microsoft?

It is a platform for cloud computing. To give users access to the services on demand, the
service provider could indeed set up a service model in Azure.

3. Describe DBU.

Databricks Unified, also known as DBU, is a framework for managing resources and
determining prices.

4. What sets Azure Databricks apart from regular Databricks?

In order to advance statistical modeling and predictive analytics, Microsoft and

Databricks have collaborated to create Azure Databricks.

5. What advantages do Azure Databricks offer?

Among the many advantages of Azure Databricks are its lower costs, higher productivity,
and enhanced security.

6. Can Databricks and Azure Notebooks coexist?

Although they can be carried out similarly, data transmission to the cluster must be
manually coded. This integration can be completed without any issues thanks to
Databricks Connect.

7. What kinds of clusters are available in Azure Databricks?

There are four different cluster types in Azure Databricks, including interactive, job, low-
priority, and high-priority clusters.

8. Describe caching.

The act of temporarily storing information is referred to as caching. Your browser uses
the data from the cache rather than the server when you visit a website that you
frequent. Time is saved, and the load on the server is decreased.

9. Is it acceptable to delete the cache?

It is acceptable to clear the cache because no programme requires the information.

10. What steps must be taken to revoke a private authentication code?

Go to "user profile" and choose "User setting" to cancel the token. Click the "x" next to
the token you want to revoke by selecting the "Access Tokens" tab. Finally, click the
"Revoke Token" button on the Revoke Token window.

1. Define Databricks
Databricks is a cloud-based solution to help process and transform large amounts of
data, offered by Azure.

2. What is Microsoft Azure?

It is a cloud computing platform. The service provider can set up a managed service
in Azure to allow users to get access to the services on demand.

3. What is DBU?
DBU stands for Databricks Unified, which is a Databricks framework for handling
resources and calculating prices.

4. What distinguishes Azure Databricks from Databricks?

Azure Databricks is a joint effort between Microsoft and Databricks to expand
predictive analytics and statistical modeling.

5. What are the benefits of using Azure Databricks?

Azure Databricks comes with many benefits including reduced costs, increased
productivity, and increased security.

6. Can Databricks be used along with Azure Notebooks?

They can be executed similarly but data transmission needs to be coded manually to
the cluster. There is Databricks connect, which can get this integration done
seamlessly.

7. What are the various types of clusters present in Azure Databricks?

Azure Databricks has four types of clusters, including Interactive, Job, Low-priority,
and High-priority.

8. What is caching?
The cache refers to the practice of storing information temporarily. When you go to a
website that you visit frequently, your browser takes the information from the cache
instead of the server. This helps save time and reduce the server’s load.
9. Would it be ok to clear the cache?
Yes, it is ok to clear cache as the information is not necessary for any program.

10. What is autoscaling?

Autoscaling is a Databricks feature that will help you automatically scale your cluster
in whichever direction you need.

11. Would you need to store an action’s outcome in a different variable?

It’s not mandatory. It would completely depend on what purpose it would be used.

12. Should you remove unused Data Frames?

Cleaning Data Frames is not required unless you use cache, as this takes up a good
amount of data on the network.

14. What use is Kafka for?

When Azure Databricks gathers data, it establishes connections to hubs and data
sources like Kafka.

15. What use is Databricks file system for?

The Databricks file system gives data durability even after the Azure Databricks
node is eliminated. It’s a distributed file system designed keeping big data workloads
in mind.

16. How to troubleshoot issues related to Azure Databricks?

17. Is Azure Key Vault a viable alternative to Secret Scopes?

It’s certainly possible but it needs to be set up before being used.
18. How do you handle Databricks code while working in a team using TFS
or Git?
It’s not possible to work with TFS as it is not supported. You can only work
with Git or distributed Git repository systems. Although it would be fantastic to attach
Databricks to your Git directory, Databricks works like another clone of the project.
You should start by creating a notebook and then committing it to version control.
You can then update it.

19. What languages are supported in Azure Databricks?

20. Can Databricks be run on private cloud infrastructure?

21. Can you administer Databricks using PowerShell?

Officially, you can’t do it. But there are PowerShell modules that you can try out.

22. What is the difference between an instance and a cluster in Databricks?

An instance is a virtual machine that helps run the Databricks runtime. A cluster is a
group of instances that are used to run Spark applications.

23. How to create a Databricks private access token?

24. What is the procedure for revoking a private access token?

25. What is the management plane in Azure Databricks?

The management plan is how you manage and monitor your Databricks
deployment.

26. What is the control plane in Azure Databricks?

The control plane is responsible for managing Spark applications.

27. What is the data plane in Azure Databricks?

The data plane is responsible for storing and processing data.

28. What is the Databricks runtime used for?

29. What use do widgets serve in Databricks?

Widgets can help customize the panels and notebooks by adding variables.

30. What is a Databricks secret?

Azure Databricks Interview Questions for Freshers

1. What is Azure Databricks?

Azure Databricks is a powerful platform that is built on top of Apache Spark and is
designed specifically for huge data analytics. Setting it up and deploying it to Azure
take just a few minutes, and once it's there, using it is quite easy. Because of its
seamless connectivity with other Azure services, Databricks is an excellent choice
for data engineers who want to deal with big amounts of data in the cloud. This
makes Databricks an excellent solution.

Create A Free Personalised Study Plan

Get into your dream companies with expert guidance

Real-Life Problems
Prep for Target Roles

Custom Plan Duration

Create My Plan

2. What are the advantages of Microsoft Azure Databricks?

Utilizing Azure Databricks comes with a variety of benefits, some of which are as
follows:

 Using the managed clusters provided by Databricks can cut your costs associated
with cloud computing by up to 80%.
 The straightforward user experience provided by Databricks, which simplifies the
building and management of extensive data pipelines, contributes to an increase in
productivity.
 Your data is protected by a multitude of security measures provided by Databricks,
including role-based access control and encrypted communication, to name just
two examples.

3. Why is it necessary for us to use the DBU Framework?

The DBU Framework was developed as a means of streamlining the process of

developing applications on Databricks that are capable of working with significant
quantities of data. A command line interface (CLI), a software development kit
(SDK) written in Python, and a software development kit written in Java are all
included in the framework (SDK).
You can download a PDF version of Azure Databricks Interview Questions.
Download PDF

4. When referring to Azure Databricks, what exactly does it mean to

"auto-scale" a cluster of nodes?

The auto-scaling feature offered by Databricks enables you to automatically

expand or contract the size of your cluster as needed. Utilizing only the resources
that are really put to use is a foolproof method for lowering expenses and reducing
waste.

5. What actions should I take to resolve the issues I'm having with
Azure Databricks?
If you are having trouble using Azure Databricks, you should begin by looking over
the Databricks documentation. The documentation includes a collated list of
common issues and the remedies to those issues, as well as any other relevant
information. You can also get in touch with the support team for Databricks if you
find that you require assistance.

6. What is the function of the Databricks filesystem?

The Databricks filesystem is used to store the data that is saved in Databricks.
Workloads involving large amounts of data are an ideal fit for this particular
distributed file system. The Hadoop Distributed File System (DVFS) is compatible
with Databricks, which is a distributed file system (HDFS).

7. What programming languages are available for use when

interacting with Azure Databricks?

A few examples of languages that can be used in conjunction with the Apache
Spark framework include Python, Scala, and R. Additionally, the SQL database
language is supported by Azure Databricks.

Start Your Coding Journey With Tracks

Master Data Structures and Algorithms with our Learning Tracks

Topic Buckets

Mock Assessments

Reading Material

View Tracks

8. Is it possible to manage Databricks using PowerShell?

No, the administration of Databricks cannot be done with PowerShell because it is
not compatible with it. There are other methods available, including the Azure
command line interface (CLI), the Databricks REST API, and the Azure site itself.

9. Which of these two, a Databricks instance or a cluster, is the

superior option?

To put it another way, an instance is a virtual machine (VM) that has the Databricks
runtime installed on it and is used to execute commands. Spark applications are
typically installed on what is known as a cluster, which is just a collection of
servers.

10. What is meant by the term "management plane" when referring to

Azure Databricks?

Only with the assistance of the management plane will your Databricks deployment
be able to run smoothly. The Databricks REST API, the Azure Command Line
Interface (CLI), and the Azure portal are all included.

11. Where can I find more information about the control plane that is
used by Azure Databricks?

The control plane is used to manage the various Spark applications. Included in
this package are both the Spark user interface and the Spark history server.

12. What is meant by the term "data plane" when referring to Azure
Databricks?

The portion of the network responsible for the storing and processing of data is
referred to as the data plane. Included in this package are both the Apache Hive
megastore as well as the Databricks filesystem.

Discover your path to a Successful Tech Career For FREE!

Answer 4 simple questions & get a career plan tailored for you

Interview Process

CTC & Designation

Projects on the Job

Try It Out

2 Lakh+ Roadmaps Created

13. Is there a way to halt a Databricks process that is already in
progress?

You are able to stop a job that is currently running in Databricks by going to the
Jobs page, selecting the job, and then selecting the Cancel-Job option from the
context menu.

14. What is delta table in Databricks?

Any information that is stored in the Databricks Delta format is stored in a table that
is referred to as a delta table. Delta tables, in addition to being fully compliant with
ACID transactions, also make it possible for reads and writes to take place at
lightning speed.

15. What is the name of the platform that enables the execution of
Databricks applications?

An application environment that is created on top of Apache Spark is referred to as

the Databricks Runtime. It provides everything you need to construct and run Spark
applications, such as libraries, application programming interfaces (APIs), and
tools.

16. What is Databricks Spark?

Databricks Spark is the result of Apache Spark being forked to build it. Spark has
undergone development and received upgrades that make its connection with
Databricks more streamlined.

17. What are workspaces in Azure DataBricks?

Workspaces in Azure Databricks are instances of Apache Spark that are

completely managed by the service. Along with everything else that is required to
construct and run Spark applications, the package includes a code editor, a
debugger, as well as Machine Learning and SQL libraries.

18. In the context of Azure Databricks, what is a "dataframe"?

A data frame is a particular form of table that is used for the storage of data within
the Databricks runtime. There is complete support for ACID transactions, and data
frames were developed with the goal of providing fast reads and writes.

19. Within the context of Azure Databricks, what role does Kafka
play?
When working with the streaming features of Azure Databricks, Kafka is the tool
that is recommended to use. This approach allows for the ingestion of a wide
variety of data, including but not limited to sensor readings, logs, and financial
transactions. Processing and analysis of streaming data may also be done in real-
time with Kafka, another area in which it excels.

20. Is it only possible to access Databricks through the cloud, and

there is no way to install it locally?

Yes. Apache Spark, which is the on-premises solution for Databricks, made it
possible for engineers working within the company to manage the application and
the data locally. Users of Databricks may run into connectivity issues when
attempting to use the service with data that is kept on local servers because
Databricks was developed specifically for the cloud. The on-premises solutions
provided by Databricks are hampered by discrepancies in the data as well as
workflows that are wasteful.

21. Is Databricks a Microsoft subsidiary or a subsidiary company?

No. Apache Spark serves as the foundation for Databricks, which is an open-
source project. A commitment of $250 million dollars has been made by Microsoft
for 2019. Microsoft made the announcement in 2017 that it will be releasing Azure
Databricks, a cloud platform that would include Databricks. Both Google Cloud
Platform and Amazon Web Services have formed agreements in a manner
analogous to this.

22. Could you please explain the many types of cloud services that
Databricks offers?

The solution that Databricks offers is categorized as software as a service (SaaS),

and the intention behind it is to utilize clusters in order to realize Spark's full
potential in terms of storage management. Before rolling out the applications, users
only need to make a few changes to the configurations of those programs.

23. Which category of cloud service does Microsoft's Azure

Databricks belong to: SaaS, PaaS, or IaaS?

PaaS stands for the platform as a service, and Databricks in Azure is a PaaS. It is
an application development platform that is built on top of Microsoft Azure and
Databricks. Users are going to be accountable for utilizing the capabilities offered
by Azure Databricks in order to design and develop the data life cycle as well as
build applications.
24. Differences between Microsoft Azure Databricks and Amazon
Web Services Databricks.

Azure Databricks is a product that combines the features of both Azure and
Databricks in an effortless manner. Using Microsoft Azure as a cloud provider for
Databricks entails more than just utilizing a hosting service. Because it includes
Microsoft features such as Active directory authentication and the ability to
communicate with a wide variety of Azure services, Azure Databricks is the most
advantageous product currently available. To put it another way, AWS Databricks
are simply Databricks that are hosted on the AWS cloud.

25. What does "reserved capacity" mean when referring to Azure?

Microsoft provides a reserved capacity option for customers who are interested in
achieving the greatest possible cost savings with Azure Storage. During the time
period that they have reserved, customers are assured that they will have access
to a predetermined amount of storage space on the Azure cloud. Block Blobs and
Azure Data Lake are two storage solutions that make it feasible to keep Gen 2 data
in a standard storage account.

26. Outline the individual parts that come together to form Azure
Synapse Analytics.

It was developed specifically to manage tables with hundreds of millions of rows.

Because it is based on a Massively Parallel Processing, or MPP, architecture,
Synapse SQL is able to conduct complicated queries and provide the query
answers within seconds, even when working with large amounts of data. This is
made possible by the fact that Azure Synapse Analytics can distribute data
processing across numerous nodes.

Applications connect to the Synapse Analytics MPP engine via a control node in
order to perform their tasks. The Synapse SQL query is delivered to the control
node, which then performs the necessary conversions to make it compatible with
MPP. Sending the various operations to the compute nodes that are able to carry
out those operations in parallel allows for improved query performance to be
accomplished.

27. What is "Dedicated SQL Pools."

The Dedicated SQL Pool of Azure Synapse Analytics is a collection of technologies

that enables you to leverage the platform that is typically utilized for enterprise data
warehousing. The provisioning of the resources in the Data Warehousing Units is
accomplished with the help of Synapse SQL (DWU). A dedicated SQL pool
improves the efficiency of queries and decreases the amount of data storage that is
required by storing information in both columnar and relational tables.

28. Where can I get instructions on how to record live data in Azure?

The Stream Analytics Query Language is a SQL-based query language that has
been simplified and is offered as part of the Azure Stream Analytics service. The
capabilities of the query language can be expanded by the use of this feature,
which allows programmers to define new ML (Machine Learning) functions. The
use of Azure Stream Analytics makes it possible to process more than a million
events per second, and the findings may be distributed with very little delay.

29. What are the skills necessary to use the Azure Storage Explorer.

It is a handy standalone tool that gives you the ability to command Azure Storage
from any computer that is running Windows, Mac OS X, or Linux. A downloaded
version of Microsoft's Azure Storage is available to users. Access to several Azure
data stores, such as ADLS Gen2, Cosmos DB, Blobs, Queues, and Tables, may be
accomplished using its intuitive graphical user interface.

One of the most compelling features of Azure Storage Explorer is its compatibility
with users' environments in which they are unable to access the Azure cloud
service.

30. What is Azure Databricks, and how is it distinct from the more
traditional data bricks?

 An open-source big data processing platform can be obtained through the Apache
Spark implementation that is found in Azure. Azure Databricks operates in the
stage of the data lifecycle known as the stage of data preparation or processing.
First and foremost, the Data Factory is used to import data into Azure, where it is
then saved to permanent storage (such as ADLS Gen2 or Blob Storage).
 In addition, the data is analyzed using Machine Learning (ML) in Databricks, and
once the insights have been retrieved, they are loaded into the Analysis Services in
Azure, such as Azure Synapse Analytics or Cosmos DB.
 In the end, insights are visualized with the use of analytical reporting tools like
Power BI, and then they are given to end users.

Azure Databricks Interview Questions for Experienced

1. What are the different applications for Microsoft Azure's table storage?
It's a cloud storage service that specializes in archiving documents and other sorts of
organized material like spreadsheets and presentations. Entities in tables serve a purpose
analogous to that of rows in relational databases; they are the fundamental units of
structured data. The following is a list of attributes that table entities have, where each
entity stands for a different key-value pair:

 The PartitionKey field of the table is where the entity's partition key is saved whenever it is
needed.
 The RowKey attribute of an entity serves as a one-of-a-kind identifier within the partition.
 The timeStamp is a feature that remembers the date and time that an entity in a table was
last modified.

2. What is Serverless Database Processing in Azure?

Depending on how the computer is set up, the location of the computer's code could either
be on the server or on the user's end. Serverless computing, on the other hand, adheres to
the properties of stateless code, in which the code functions independently of any physical
servers that may be present.

The user is responsible for paying for any computing resources that are utilized by the
program while it is being executed, even if this only lasts for a limited period of time. Users
only pay for the resources that they really make use of, which results in a very cost-
effective system.

3. In what ways does Azure SQL DB protect stored data?

Azure SQL DB provides the following data protection options:

1. Rules for the SQL Server Firewall in Azure Azure have two tiers of security. The
first is a set of firewall rules for the Azure database server, which are kept in the
SQL Master database. The second is security measures used to prevent unauthorized
access to data, such as firewall rules at the database level.
2. Credit card numbers and other personal information saved in Azure SQL databases
are safe from prying eyes thanks to Azure SQL Always Encrypted.
3. Data in an Azure SQL Database is encrypted using Transparent Data Encryption
(TDE). Database and log file backups and transactions are encrypted and decrypted
in real time using TDE.
4. Auditing for Azure SQL Databases: Azure's SQL Database service includes built-in
auditing features. The audit policy can be set for the entire database server or for
specific databases.

4. How does Microsoft Azure handle the redundant storage of data?

Azure stores several copies of your data at all times within its storage facilities in order to
maintain a high level of data availability. Azure provides a number of different data
redundancy solutions, each of which is tailored to the customer's specific requirements
regarding the significance of the data being replicated and the length of time they require
access to the replica.

1. The data is replicated in a number of different storage areas within the same data
centre, which makes it extremely available. It is the most cost-effective method for
ensuring that at least three independent copies of your data are stored elsewhere.
2. A function referred to as "Zone Redundant Storage" ensures that a copy of the data
is kept in each of the primary region's three zones (ZRS). In the event that one or
more of your zones becomes unavailable, Azure will promptly repoint your DNS
servers. Following the repointing of the DNS, it is possible that the network settings
of any programmes that are dependent on data access will need to be updated.
3. A "geographically redundant" (GRS) storage system stores a copy of the data in two
distinct places in the event that one of the sites becomes unavailable. It is possible
that the secondary region's data will not be accessible until the geo-failover process
is finished.
4. A technology known as Read Access Geo Redundant Storage allows for the data
stored in the secondary area to be read in the event that a failure occurs in the
primary region (RA-GRS).

5. What are some of the methods that data can be transferred from storage
located on-premises to Microsoft Azure?

When selecting a method for the transfer of data, the following are the most important
considerations to make:

1. Data Size
2. Data Transfer Frequency (One-time or Periodic)
3. The bandwidth of the Network

Solutions for the transportation of data can take the following forms, depending on the
aforementioned factors:

1. Offline transfer: This is used for transferring large amounts of data in a single
session. As a result, Microsoft is able to supply customers with discs or other secure
storage devices; however, customers also have the option of sending Microsoft their
own discs. The offline transfer options known as named data box, data box disc,
data box heavy, and import/export (using the customer's own drives) are all
available to choose from.
2. Transfer over a network: the following methods of data transfer can be carried out
through a network connection:
 Graphical Interface: This is the best option when only a few files need to be
transferred and there is no requirement for the data transfer to be automated. Azure
Storage Explorer and Azure Portal are both graphical interface choices that are
available.
 Programmatic Transfer AzCopy, Azure PowerShell, and Azure CLI are examples
of some of the scriptable data transfer tools that are now accessible. SDKs for a
number of other programming languages are also available.
 On-premises devices: A physical device known as the Data Box Edge and a virtual
device known as the Data Box Gateway are deployed at the customer's location in
order to maximize the efficiency of the data transmission to Azure.
 Pipeline from the Managed Data Factory: Pipelines from the Azure Data Factory
can move, transform, and automate frequent data transfers from on-premises data
repositories to Azure.

6. What is the most efficient way to move information from a database that is
hosted on-premises to one that is hosted on Microsoft Azure?

What is the most efficient way to move information from a database that is hosted on-
premises to one that is hosted on Microsoft Azure?

The following procedures are available through Azure for moving data from a SQL Server
that is hosted on-premises to a database hosted in Azure SQL:

 With the help of the Stretch Database functionality found in SQL Server, it is possible to
move data from SQL Server 2016 to Azure.
 It is able to identify idle rows, also known as "cold rows," which are rows in a database that
are rarely visited by end users and migrate those rows to the cloud. There is a reduction in
the amount of time spent backing up databases that are located on premises.
 With Azure SQL Database, organizations are able to continue with a cloud-only approach
and migrate their whole database to the cloud without interrupting their operations.
 Managed Instance of the Azure Database as a Service Available for SQL Server: It is
compatible with a diverse range of configurations (DBaaS). Microsoft takes care of
database administration, and the system is about 100 per cent compatible with SQL Server
that has been installed locally.
 Customers that want complete control over how their databases are managed should
consider installing SQL Server in a virtual machine. This is the optimal solution. It ensures
that your on-premises instance will function faultlessly with no modifications required on
your part.
 In addition, Microsoft provides clients with a tool known as Data Migration Assistant,
which is designed to aid customers in determining the most suitable migration path by
taking into account the on-premises SQL Server architecture they are already using.
7. Databases that support numerous models are precisely what they sound
like?

The flagship NoSQL service that Microsoft offers is called Azure Cosmos DB. This
database is the first of its kind to be supplied in the cloud, and it is a worldwide distributed
multi-model database. Many suppliers are responsible for making this database available.

It is utilized in a variety of storage formats, including column-family storage, key-value

pair storage, document-based storage, and graph-based storage, amongst others. No matter
which data model a customer chooses, they will continue to enjoy the same perks, like low
latency, consistency, international distribution, and automatic indexing, regardless of which
model they use.

8. Which kind of consistency models are supported by Cosmos DB?

Because consistency models and consistency levels are available, developers no longer
have to choose between high availability and increased performance as their top priority.

The following is a list of the several consistency models that are compatible with Cosmos
DB:

1. Beneficial: Whenever a read operation is carried out, the most recent version of the
data is retrieved. This happens automatically. This particular type of consistency has
a higher reading operation cost when compared to other models of consistency.
2. Using the "bounded staleness" feature, you are able to set a restriction on the
amount of time that has passed since you last read or write something. When
availability and consistency are not of the first importance, it functions very well.
3. The session consistency level is the default for Cosmos DB, and it is also the
consistency level that is used the most across all regions. When a user navigates to
the exact same location where a write was executed, the most recent information
will be given to them at that time. It has the highest throughput for reading and
writing at any consistency level, and the throughput is the fastest.
4. When using Consistent Prefixes, users will never observe out-of-order writes;
nevertheless, data will not be replicated across regions at a predetermined
frequency.
5. There is no assurance that replication will take place within a predetermined amount
of time or inside a predetermined version. Both the read latency and the
dependability are of the highest possible quality.

9. How does the ADLS Gen2 manage the encryption of data exactly?
In contrast to its predecessor, ADLS Gen2 makes use of a comprehensive and intricate
security mechanism. The following are some of the various layers of data protection offered
by ADLS Gen2:

 Azure Active Directory (AAD), Shared Key, and Shared Access Token are the three
different methods of authentication that it provides to ensure that user accounts are kept
secure (SAS).
 Granular control over who can access which folders and files can be achieved through the
use of ACLs and roles (ACLs).
 Administrators have the ability to allow or refuse traffic from specific VPNs or IP
Addresses, which results in the isolation of networks.
 Encrypts data while it is being transmitted via HTTPS, providing protection for sensitive
information.
 Protection from More Advanced Threats: Be sure to monitor any attempts that are made to
break into your storage area.
 Every activity that is done in the account management interface is logged by the auditing
capabilities of ADLS Gen2, which serve as the system's final line of defence.

10. In what ways does Microsoft Azure Data Factory take advantage of the
trigger execution feature?

Pipelines created in Azure Data Factory can be programmed to run on their own or to react
to external events.

The following is a list of several instances that illustrate how Azure Data Factory Pipelines
can be automatically triggered or executed:

 This trigger is used to commence the execution of a pipeline at a predetermined time or on

a predetermined schedule, such as once per week, once per month, etc. Examples of such
schedules include "once per week," "once per month," etc.
 When the Tumbling Window Trigger is applied to an Azure Data Factory Pipeline, the
pipeline begins its execution at a predetermined start time and continues at predetermined
intervals thereafter without ever running again.
 An Azure Data Factory Pipeline's execution is kicked off whenever a particular event takes
place, such as the addition of a new file to or deletion of an existing one from Azure Blob
Storage.

11. What is a dataflow map?

Mapping Data Flows is a data integration experience offered by Microsoft that does not
need users to write any code. This is in contrast to Data Factory Pipelines, which is a more
involved data integration experience. Data transformation flows can be designed visually.
Azure Data Factory (ADF) activities are built from the data flow and operate as part of
ADF pipelines.
12. When working in a team environment with TFS or Git, how do you
manage the code for Databricks?

The first issue is that Team Foundation Server (TFS) is not supported. You are only able to
use Git or a repository system based on Git's distributed format. Despite the fact that it
would be preferable to link Databricks to your Git directory of notebooks, you can consider
Databricks to be a duplicate of your project even though this is not currently possible. The
first thing you do is create a notebook, after which you will update it before submitting it to
version control.

13. Does the deployment of Databricks necessitate the use of a public cloud
service such as Amazon Web Services or Microsoft Azure, or can it be done
on an organization's own private cloud?

On the contrary, this is true. AWS and Azure are the only two options available to you right
now. On the other hand, Databricks makes use of open-source and free Spark. You could
construct and run your own cluster in a private cloud; however, by doing so, you would not
have access to the extensive capabilities and management that Databricks provides.

14. Please explain what a CD is in detail (Continuous Delivery).

Once development is finished, CD speeds up the process of distributing the code to a

variety of environments, including QA and staging, among others. In addition to that, it was
put to use in order to evaluate the dependability, efficiency, and safety of the most recent
updates.

15. Is Apache Spark capable of distributing compressed data sources

(.csv.gz) in a successful manner when utilizing it?

When reading a zipped CSV file or another type of serialized dataset, the SINGLE-
THREADED behaviour is assured as a matter of course. After the dataset has been read
from the disc, it will be maintained in memory as a distributed dataset, despite the fact that
the first read does not use a distributed format.

This is a result of the fact that compressed files offer an extremely high level of safety. You
are able to divide a file that is readable and chuckable into a number of different extents
using Azure Data Lake or another Hadoop-based file system. If you split the file into
numerous compressed files, you'll have one thread for each file, which could rapidly create
a bottleneck depending on how many files you have. If you don't split the file, you'll have
multiple threads for each file.
16. Is the implementation of PySpark DataFrames entirely unique when
compared to that of other Python DataFrames, such as Pandas, or are there
similarities?

Spark DataFrames are not the same as Pandas, despite the fact that they take inspiration
from Pandas and perform in a similar manner. There is a possibility that a great number of
Python experts place an excessive amount of faith in Pandas. It is recommended that you
use DataFrames rather than Pandas in Spark at this time.

This is despite the fact that Databricks is actively working to improve Pandas. Users of
Pandas and Spark DataFrames should think about adopting Apache Arrow to reduce the
impact on performance caused by moving between the two frameworks. Bear in mind that
the Catalyst engine will, at some point in the future, convert your Spark DataFrames into
RDD expressions. Pandas are safe from predators in China, including bears.

17. Tell me about the primary benefits offered by Azure Databricks.

 Processing, manipulation, and analysis of enormous amounts of data can be facilitated

through the use of machine learning models with the help of Azure Databricks, which is a
cloud-based data management solution that is a leader in its sector. These are the kinds of
questions that a recruiter for Databricks might ask you in order to evaluate the level of
excitement you have for the company.
 You can demonstrate your technical understanding to the interviewer by discussing a
handful of the most significant benefits and the significance of those benefits.
 Even though Azure Databricks was developed on Spark, it is compatible with a wide
variety of programming languages, such as Python, R, and SQL. The back-end language
conversion provided by Databricks' APIs made it possible for them to be used with Spark
(APIs). Because of this, there is no requirement for end users to learn any new coding skills
in order for them to be able to make use of distributed analytics. The procedure of carrying
out distributed analytics is made less complicated by Azure Databricks on account of its
adaptability and its user-friendliness.
 Databricks offers a unified workspace that promotes collaboration through a multi-user
environment in order to assist teams in the development of cutting-edge Spark-based
machine learning and streaming applications. This is done with the goal of assisting teams
in creating cutting-edge applications.
 In addition to this, it has monitoring and recovery features, which make it possible to
automate the failover and recovery of clusters. We are able to swiftly and easily install
Spark in our cloud environments thanks to Databricks, which has allowed us to increase the
cloud environments' security as well as their performance.

18. Explain the types of clusters that are accessible through Azure
Databricks as well as the functions that they serve.
By asking you questions of this nature, the interviewer will be able to determine how well
you comprehend the concepts on which they are assessing your competence. Make sure that
your response to this question includes an explanation of the four categories that are
considered to be the most important. Azure Databricks provides users with a total of four
unique clustering options. Occupational, interesting, and both low and high on the priority
scale.

For the purposes of ad hoc analysis and discovery, clusters that give users the ability to
interact with the data are valuable. These clusters are distinguished by their high
concurrency as well as their low latency. Job clusters are what we make use of while
executing jobs in batches. The number of jobs in a cluster can be automatically increased or
decreased to accommodate fluctuating demand. Although low-priority clusters are the most
cost-effective choice, their performance is not as good as that of other types of clusters.

These clusters are an excellent choice for low-demand applications and processes such as
development and testing because of their low resource requirements. High-priority clusters
offer the best performance, but at a cost that is significantly higher than other cluster types.
On these clusters, production-level workloads are able to be processed and run.

19. How do you handle the Databricks code when working with a
collaborative version control system such as Git or the team foundation
server (TFS)?

Both TFS and Git are well-known version control and collaboration technologies that
simplify the management of huge volumes of code across several teams. The questions that
are asked of you allow the person in charge of hiring to determine whether or not you have
previous experience working with Databricks and to evaluate your capability of managing a
code base. Please provide an overview of the core methods you use to maintain the
Databricks code and highlight the most significant features of TFS and Git in your
response. In addition, please highlight the most important aspects of TFS and Git.

Git is free and open-source software that has a capacity of over 15 million lines of code,
while Microsoft's Team Foundation Server (TFS) has a capacity of over 5 million lines of
code. Git is less secure than TFS, which allows users to provide granular rights such as
read/write access. Read/write access is one example.

Notebooks created with Azure Databricks may easily be connected with the version control
systems Git, Bitbucket Cloud, and TFS. There may be variations in the particular processes
that we take in order to integrate a particular service. Because of the merger, the code for
Databricks works exactly the same as it would for a second copy of the project. In order to
easily manage the Databricks code, I first build a notebook, then upload it to the repository,
and last, I update it as necessary.
20. What would you say were the most significant challenges you had to
overcome when you were in your former position?

When it comes to a question like this, the only thing that should guide a person's response
is their professional history. The person in charge of hiring wants to know all about the
difficulties you have faced and how you have managed to prevail over them. In the event
that you have past experience working with Azure Databricks, it is possible that you have
encountered difficulties with the data or server management that hampered the efficiency of
the workflow.

Due to the fact that it was my first job, I ran into several problems in my former role as a
data engineer. Improving the overall quality of the information that was gathered
constituted a considerable challenge. I initially had some trouble, but after a few weeks of
studying and developing efficient algorithms, I was able to automatically delete 80–90% of
the data.

Another significant issue was the ineffectiveness of the team's ability to work together. In
the past, the company would process its data by first separating it across various servers,
and then going offline to do so. The data-driven procedures as a whole saw a significant
amount of slowdown, and a great number of errors were created. I was able to help
centralize all the data collection on a single Azure server and connect Databricks, which
streamlined the majority of the process and allowed us to receive real-time insights, despite
the fact that it took me around two months to do so.

21. Explain the term "mapping data flows"?

If the interviewer asks you a question that tests your technical knowledge, they will be able
to evaluate how well you know this particular field of expertise. Your response to this
inquiry will serve as evidence that you have a solid grasp of the fundamental principles
behind Databricks. Kindly offer a concise explanation of the benefits that the workflow
process gains from having data flow mapping implemented.

In contrast to data factory pipelines, mapping data flows are available through Microsoft
and can be utilized for the purpose of data integration without the requirement of any
scripting. It is a graphical tool that may be used to construct procedures that convert data.
Following this step, ADF actions are possible to be carried out as a component of ADF
pipelines, which is beneficial to the process of changing the flow of data.

22. Can Databricks be used in conjunction with a private cloud

environment?

This kind of question could be asked of you during the interview if the interviewer wants to
evaluate how adaptable you are with Databricks. This is a fantastic opportunity for you to
demonstrate your capacity for analysis and attention to detail. Include in your response a
concise explanation of how to deploy it to a private cloud as well as a list of cloud server
options.

Amazon Web Services (AWS) and Microsoft Azure are the only two cloud computing
platforms that can currently be accessed. Databricks makes use of open-source Spark
technology, which is readily available. We could create our own cluster and host it in a
private cloud, but if we did so, we wouldn't have access to the extensive administration
tools that Databricks provides.

23. What are the Benefits of Using Kafka with Azure Databricks?

Apache Kafka is a decentralized streaming platform that may be utilized for the
construction of real-time streaming data pipelines as well as stream-adaptive applications.
You will have the opportunity to demonstrate your acquaintance with the Databricks
compatible third-party tools and connectors if the query is of this sort. If you are going to
react, you ought to discuss the benefits of utilizing Kafka in conjunction with Azure
Databricks for the workflow.

Azure Databricks makes use of Kafka as its platform of choice for data streaming. It is
helpful for obtaining information from a wide variety of different sensors, logs, and
monetary transactions. Kafka makes it possible to perform processing and analysis on the
streaming data in real-time.

24. Do I have the freedom to use various languages in a single notebook, or

are there significant limitations? Would it be available for usage in further
phases if I constructed a DataFrame in my python notebook using a%Scala
magic?

It is possible to generate a Scala DataFrame, which may then be used as a reference in

Python. There are many things in the world that have the potential to harm this in some
way. If you can, write your programme in Scala or Python. On occasion, however, you will
have to coordinate your efforts with others.

Mixtures are utilized in the production of the things that are made nowadays. The most
perfect scenario would be for us both to make use of the same. Having said that, there is a
catch. When creating a notebook that contains code written in many languages, it is
important to remember to show consideration for the developer who will come after you to
try to debug your code.

25. Is it possible to write code with VS Code and take advantage of all of its
features, such as good syntax highlighting and intellisense?

Sure, VSCode includes a smattering of IntelliSense, and you can use it to scribble down
some Python or Scala code, even if you would be doing so in the form of a script rather
than a notebook. One of the other responses also mentioned Databricks connect. It is
acceptable in any scenario. I would like to suggest that you start a new project in Scala by
using DBConnect. In this approach, you will be able to carry out critical activities that we
have been putting off, such as conducting unit tests.

26. To run Databricks, do you need a public cloud provider such as Amazon
Web Services or Microsoft Azure, or is it possible to install it on a private
cloud?

If this is the case, how does it compare to the PaaS solution that we are
presently utilizing, such as Microsoft Azure?

The answer to this problem is glaringly evident. Actually, the answer is no; it's not. At this
time, your only real options are with Amazon Web Services (AWS) or Microsoft Azure.
Databricks, on the other hand, makes use of open-source and cost-free Spark. Even if it is
feasible to set up your own cluster and run it locally or in a private cloud, you will not have
access to the more advanced capabilities and levels of control that are provided by
Databricks.

27. Is it possible to use Azure Key Vault as an acceptable replacement for

Secret Scopes?

You have the ability to select that alternative. However, it does require a little bit of time
and work to get ready. We suggest beginning your search here. Create a key with restricted
access that you may save in the Azure Key Vault. If the value of the secret needs to be
changed in any way, it is not necessary to update the scoped secret. There are a lot of
benefits associated with doing so, the most crucial one being that it might be a headache to
keep track of secrets in numerous different workplaces at the same time.

28. Is there any way we can stop Databricks from establishing a connection
to the internet?

You should be able to peer the parent virtual network with your own virtual network
(VNet) and define the necessary policies for incoming and outgoing traffic, but this will
depend on the policies of the parent virtual network. The workspace is always online, but
you can adjust the degree to which separate clusters are connected to one another.

And in the same way that there is no mechanism to force a connection with the Azure
portal, I do not believe there is a means to force a connection with the Databricks portal
when using Express-route. However, you may control what data each cluster receives by
erecting a firewall around the code that is now being performed. This gives you more
control over the situation. Vnet Injection gives you the ability to restrict access to your
storage accounts and data lakes, making them available only to users within your Virtual
Network (VNet) via service endpoints. This is an excellent security feature.

29. To the untrained eye, notebooks seem to be arranged in a progression

that makes sense, but I have a feeling that's not actually the case.

The question that needs to be answered is how one would go about first
loading a warehouse with twenty or more dimensions, and then populating
the fact.

When an action is invoked on a DataFrame, the DataFrame will determine the most time-
and resource-effective sequence in which to apply the transformations that you have queued
up; hence, the actions themselves are sequential. In most cases, I'll start by making a new
notebook for each data entity, as well as one for each dimension, and then I'll use an
application developed by a third party to execute both of those notebooks concurrently.

You could, for instance, put up a data factory pipeline that does queries for a collection of
notebooks and simultaneously executes all of those notebooks. To manage orchestration
and parallelism, I would much rather utilize an external tool because it is more visible and
flexible than embedding "parent notebooks" that handle all of the other logic. Embedding
"parent notebooks" is the alternative.

30. In what ways can Databricks and Data Lake make new opportunities for
the parallel processing of datasets available?

Is it viable, for instance, to make use of such technologies in order to

construct a large number of new (calculated) columns on a dataset all at
once, as opposed to needing to generate each column one at a time, as would
be required in a database table?

After you have aligned the data, called an action to write it out to the database, and the
engine has finished the task, the catalyst engine will figure out the best way to manage the
data and do the transformations. It will do this after the engine has finished the work. If a
large number of transactions include narrow transformations that utilize the same
partitioning feature, the engine will make an effort to finish them all at the same
time.======

1. What is Apache Spark?

Apache Spark is an open-source framework engine that is known for its speed, easy-to-use
nature in the field of big data processing and analysis. It also has built-in modules for graph
processing, machine learning, streaming, SQL, etc. The spark execution engine supports in-
memory computation and cyclic data flow and it can run either on cluster mode or
standalone mode and can access diverse data sources like HBase, HDFS, Cassandra, etc.

Create A Free Personalised Study Plan

Get into your dream companies with expert guidance

Real-Life Problems

Prep for Target Roles

Custom Plan Duration

Create My Plan

2. What are the features of Apache Spark?

 High Processing Speed: Apache Spark helps in the achievement of a very high processing
speed of data by reducing read-write operations to disk. The speed is almost 100x faster
while performing in-memory computation and 10x faster while performing disk
computation.
 Dynamic Nature: Spark provides 80 high-level operators which help in the easy
development of parallel applications.
 In-Memory Computation: The in-memory computation feature of Spark due to its DAG
execution engine increases the speed of data processing. This also supports data caching
and reduces the time required to fetch data from the disk.
 Reusability: Spark codes can be reused for batch-processing, data streaming, running ad-
hoc queries, etc.
 Fault Tolerance: Spark supports fault tolerance using RDD. Spark RDDs are the
abstractions designed to handle failures of worker nodes which ensures zero data loss.
 Stream Processing: Spark supports stream processing in real-time. The problem in the
earlier MapReduce framework was that it could process only already existing data.
 Lazy Evaluation: Spark transformations done using Spark RDDs are lazy. Meaning, they
do not generate results right away, but they create new RDDs from existing RDD. This lazy
evaluation increases the system efficiency.
 Support Multiple Languages: Spark supports multiple languages like R, Scala, Python,
Java which provides dynamicity and helps in overcoming the Hadoop limitation of
application development only using Java.
 Hadoop Integration: Spark also supports the Hadoop YARN cluster manager thereby
making it flexible.
 Supports Spark GraphX for graph parallel execution, Spark SQL, libraries for Machine
learning, etc.
 Cost Efficiency: Apache Spark is considered a better cost-efficient solution when
compared to Hadoop as Hadoop required large storage and data centers while data
processing and replication.
 Active Developer’s Community: Apache Spark has a large developers base involved in
continuous development. It is considered to be the most important project undertaken by
the Apache community.

3. What is RDD?

RDD stands for Resilient Distribution Datasets. It is a fault-tolerant collection of parallel

running operational elements. The partitioned data of RDD is distributed and immutable.
There are two types of datasets:

 Parallelized collections: Meant for running parallelly.

 Hadoop datasets: These perform operations on file record systems on HDFS or other
storage systems.
You can download a PDF version of Spark Interview Questions.
Download PDF

4. What does DAG refer to in Apache Spark?

DAG stands for Directed Acyclic Graph with no directed cycles. There would be finite
vertices and edges. Each edge from one vertex is directed to another vertex in a sequential
manner. The vertices refer to the RDDs of Spark and the edges represent the operations to
be performed on those RDDs.

5. List the types of Deploy Modes in Spark.

There are 2 deploy modes in Spark. They are:

 Client Mode: The deploy mode is said to be in client mode when the spark driver
component runs on the machine node from where the spark job is submitted.
o The main disadvantage of this mode is if the machine node fails, then the entire job fails.
o This mode supports both interactive shells or the job submission commands.
o The performance of this mode is worst and is not preferred in production environments.
 Cluster Mode: If the spark job driver component does not run on the machine from which
the spark job has been submitted, then the deploy mode is said to be in cluster mode.
o The spark job launches the driver component within the cluster as a part of the sub-process
of ApplicationMaster.
o This mode supports deployment only using the spark-submit command (interactive shell
mode is not supported).
o Here, since the driver programs are run in ApplicationMaster, in case the program fails, the
driver program is re-instantiated.
o In this mode, there is a dedicated cluster manager (such as stand-alone, YARN, Apache
Mesos, Kubernetes, etc) for allocating the resources required for the job to run as shown in
the below architecture.

Apart from the above two modes, if we have to run the application on our local machines
for unit testing and development, the deployment mode is called “Local Mode”. Here, the
jobs run on a single JVM in a single machine which makes it highly inefficient as at some
point or the other there would be a shortage of resources which results in the failure of jobs.
It is also not possible to scale up resources in this mode due to the restricted memory and
space.

Data Science vs Machine Learning vs Artificial Intelligence

Shivank Agrawal, Instructor at Scaler

6. What are receivers in Apache Spark Streaming?

Receivers are those entities that consume data from different data sources and then move
them to Spark for processing. They are created by using streaming contexts in the form of
long-running tasks that are scheduled for operating in a round-robin fashion. Each receiver
is configured to use up only a single core. The receivers are made to run on various
executors to accomplish the task of data streaming. There are two types of receivers
depending on how the data is sent to Spark:

 Reliable receivers: Here, the receiver sends an acknowledegment to the data sources post
successful reception of data and its replication on the Spark storage space.
 Unreliable receiver: Here, there is no acknowledgement sent to the data sources.

7. What is the difference between repartition and coalesce?

Repartition Coalesce
Usage repartition can increase/decrease the Spark coalesce can only reduce the number of
number of data partitions. data partitions.
Repartition creates new data partitions and Coalesce makes use of already existing
performs a full shuffle of evenly distributed partitions to reduce the amount of shuffled data
data. unevenly.
Repartition internally calls coalesce with Coalesce is faster than repartition. However, if
shuffle parameter thereby making it slower there are unequal-sized data partitions, the
than coalesce. speed might be slightly slower.
Start Your Coding Journey With Tracks
Master Data Structures and Algorithms with our Learning Tracks

Topic Buckets

Mock Assessments

Reading Material

View Tracks

8. What are the data formats supported by Spark?

Spark supports both the raw files and the structured file formats for efficient reading and
processing. File formats like paraquet, JSON, XML, CSV, RC, Avro, TSV, etc are
supported by Spark.

9. What do you understand by Shuffling in Spark?

The process of redistribution of data across different partitions which might or might not
cause data movement across the JVM processes or the executors on the separate machines
is known as shuffling/repartitioning. Partition is nothing but a smaller logical division of
data.

It is to be noted that Spark has no control over what partition the data gets distributed
across.

10. What is YARN in Spark?

 YARN is one of the key features provided by Spark that provides a central resource
management platform for delivering scalable operations throughout the cluster.
 YARN is a cluster management technology and a Spark is a tool for data processing.

Spark Interview Questions for Experienced

1. How is Apache Spark different from MapReduce?
MapReduce Apache Spark
MapReduce does only batch-wise processing Apache Spark can process the data both in
of data. real-time and in batches.
MapReduce does slow processing of large data. Apache Spark runs approximately 100 times
faster than MapReduce for big data
MapReduce Apache Spark
processing.
MapReduce stores data in HDFS (Hadoop Spark stores data in memory (RAM) which
Distributed File System) which makes it take a makes it easier and faster to retrieve data
long time to get the data. when needed.
Spark supports in-memory data storage and
MapReduce highly depends on disk which
caching and makes it a low latency
makes it to be a high latency framework.
computation framework.
MapReduce requires an external scheduler for Spark has its own job scheduler due to the in-
jobs. memory data computation.

2. Explain the working of Spark with the help of its architecture.

Spark applications are run in the form of independent processes that are well coordinated
by the Driver program by means of a SparkSession object. The cluster manager or the
resource manager entity of Spark assigns the tasks of running the Spark jobs to the worker
nodes as per one task per partition principle. There are various iterations algorithms that are
repeatedly applied to the data to cache the datasets across various iterations. Every task
applies its unit of operations to the dataset within its partition and results in the new
partitioned dataset. These results are sent back to the main driver application for further
processing or to store the data on the disk. The following diagram illustrates this working as
described above:

Discover your path to a Successful Tech Career For FREE!

3. What is the working of DAG in Spark?

DAG stands for Direct Acyclic Graph which has a set of finite vertices and edges. The
vertices represent RDDs and the edges represent the operations to be performed on RDDs
sequentially. The DAG created is submitted to the DAG Scheduler which splits the graphs
into stages of tasks based on the transformations applied to the data. The stage view has the
details of the RDDs of that stage.

The working of DAG in spark is defined as per the workflow diagram below:

 The first task is to interpret the code with the help of an interpreter. If you use the Scala
code, then the Scala interpreter interprets the code.
 Spark then creates an operator graph when the code is entered in the Spark console.
 When the action is called on Spark RDD, the operator graph is submitted to the DAG
Scheduler.
 The operators are divided into stages of task by the DAG Scheduler. The stage consists of
detailed step-by-step operation on the input data. The operators are then pipelined together.
 The stages are then passed to the Task Scheduler which launches the task via the cluster
manager to work on independently without the dependencies between the stages.
 The worker nodes then execute the task.

Each RDD keeps track of the pointer to one/more parent RDD along with its relationship
with the parent. For example, consider the operation val childB=parentA.map() on
RDD, then we have the RDD childB that keeps track of its parentA which is called RDD
lineage.

4. Under what scenarios do you use Client and Cluster modes for
deployment?

 In case the client machines are not close to the cluster, then the Cluster mode should be
used for deployment. This is done to avoid the network latency caused while
communication between the executors which would occur in the Client mode. Also, in
Client mode, the entire process is lost if the machine goes offline.
 If we have the client machine inside the cluster, then the Client mode can be used for
deployment. Since the machine is inside the cluster, there won’t be issues of network
latency and since the maintenance of the cluster is already handled, there is no cause of
worry in cases of failure.

5. What is Spark Streaming and how is it implemented in Spark?

Spark Streaming is one of the most important features provided by Spark. It is nothing but a
Spark API extension for supporting stream processing of data from different sources.

 Data from sources like Kafka, Kinesis, Flume, etc are processed and pushed to various
destinations like databases, dashboards, machine learning APIs, or as simple as file
systems. The data is divided into various streams (similar to batches) and is processed
accordingly.
 Spark streaming supports highly scalable, fault-tolerant continuous stream processing
which is mostly used in cases like fraud detection, website monitoring, website click baits,
IoT (Internet of Things) sensors, etc.
 Spark Streaming first divides the data from the data stream into batches of X seconds which
are called Dstreams or Discretized Streams. They are internally nothing but a sequence of
multiple RDDs. The Spark application does the task of processing these RDDs using
various Spark APIs and the results of this processing are again returned as batches. The
following diagram explains the workflow of the spark streaming process.
6. Write a spark program to check if a given keyword exists in a huge text
file or not?
def keywordExists(line):
if (line.find(“my_keyword”) > -1):
return 1
return 0
lines = sparkContext.textFile(“test_file.txt”);
isExist = lines.map(keywordExists);
sum = isExist.reduce(sum);
print(“Found” if sum>0 else “Not Found”)

7. What can you say about Spark Datasets?

Spark Datasets are those data structures of SparkSQL that provide JVM objects with all the
benefits (such as data manipulation using lambda functions) of RDDs alongside Spark
SQL-optimised execution engine. This was introduced as part of Spark since version 1.6.

 Spark datasets are strongly typed structures that represent the structured queries along with
their encoders.
 They provide type safety to the data and also give an object-oriented programming
interface.
 The datasets are more structured and have the lazy query expression which helps in
triggering the action. Datasets have the combined powers of both RDD and Dataframes.
Internally, each dataset symbolizes a logical plan which informs the computational query
about the need for data production. Once the logical plan is analyzed and resolved, then the
physical query plan is formed that does the actual query execution.

Datasets have the following features:

 Optimized Query feature: Spark datasets provide optimized queries using Tungsten and
Catalyst Query Optimizer frameworks. The Catalyst Query Optimizer represents and
manipulates a data flow graph (graph of expressions and relational operators). The
Tungsten improves and optimizes the speed of execution of Spark job by emphasizing the
hardware architecture of the Spark execution platform.
 Compile-Time Analysis: Datasets have the flexibility of analyzing and checking the
syntaxes at the compile-time which is not technically possible in RDDs or Dataframes or
the regular SQL queries.
 Interconvertible: The type-safe feature of datasets can be converted to “untyped”
Dataframes by making use of the following methods provided by the Datasetholder:
o toDS():Dataset[T]
o toDF():DataFrame
o toDF(columName:String*):DataFrame
 Faster Computation: Datasets implementation are much faster than those of the RDDs
which helps in increasing the system performance.
 Persistent storage qualified: Since the datasets are both queryable and serializable, they
can be easily stored in any persistent storages.
 Less Memory Consumed: Spark uses the feature of caching to create a more optimal data
layout. Hence, less memory is consumed.
 Single Interface Multiple Languages: Single API is provided for both Java and Scala
languages. These are widely used languages for using Apache Spark. This results in a lesser
burden of using libraries for different types of inputs.

8. Define Spark DataFrames.

Spark Dataframes are the distributed collection of datasets organized into columns similar
to SQL. It is equivalent to a table in the relational database and is mainly optimized for big
data operations.
Dataframes can be created from an array of data from different data sources such as
external databases, existing RDDs, Hive Tables, etc. Following are the features of Spark
Dataframes:

 Spark Dataframes have the ability of processing data in sizes ranging from Kilobytes to
Petabytes on a single node to large clusters.
 They support different data formats like CSV, Avro, elastic search, etc, and various storage
systems like HDFS, Cassandra, MySQL, etc.
 By making use of SparkSQL catalyst optimizer, state of art optimization is achieved.
 It is possible to easily integrate Spark Dataframes with major Big Data tools using
SparkCore.

9. Define Executor Memory in Spark

The applications developed in Spark have the same fixed cores count and fixed heap size
defined for spark executors. The heap size refers to the memory of the Spark executor that
is controlled by making use of the property spark.executor.memory that belongs to the -
executor-memory flag. Every Spark applications have one allocated executor on each
worker node it runs. The executor memory is a measure of the memory consumed by the
worker node that the application utilizes.

10. What are the functions of SparkCore?

SparkCore is the main engine that is meant for large-scale distributed and parallel data
processing. The Spark core consists of the distributed execution engine that offers various
APIs in Java, Python, and Scala for developing distributed ETL applications.
Spark Core does important functions such as memory management, job monitoring, fault-
tolerance, storage system interactions, job scheduling, and providing support for all the
basic I/O functionalities. There are various additional libraries built on top of Spark Core
which allows diverse workloads for SQL, streaming, and machine learning. They are
responsible for:

 Fault recovery
 Memory management and Storage system interactions
 Job monitoring, scheduling, and distribution
 Basic I/O functions

11. What do you understand by worker node?

Worker nodes are those nodes that run the Spark application in a cluster. The Spark driver
program listens for the incoming connections and accepts them from the executors
addresses them to the worker nodes for execution. A worker node is like a slave node where
it gets the work from its master node and actually executes them. The worker nodes do data
processing and report the resources used to the master. The master decides what amount of
resources needs to be allocated and then based on their availability, the tasks are scheduled
for the worker nodes by the master.
12. What are some of the demerits of using Spark in applications?

Despite Spark being the powerful data processing engine, there are certain demerits to
using Apache Spark in applications. Some of them are:

 Spark makes use of more storage space when compared to MapReduce or Hadoop which
may lead to certain memory-based problems.
 Care must be taken by the developers while running the applications. The work should be
distributed across multiple clusters instead of running everything on a single node.
 Since Spark makes use of “in-memory” computations, they can be a bottleneck to cost-
efficient big data processing.
 While using files present on the path of the local filesystem, the files must be accessible at
the same location on all the worker nodes when working on cluster mode as the task
execution shuffles between various worker nodes based on the resource availabilities. The
files need to be copied on all worker nodes or a separate network-mounted file-sharing
system needs to be in place.
 One of the biggest problems while using Spark is when using a large number of small files.
When Spark is used with Hadoop, we know that HDFS gives a limited number of large files
instead of a large number of small files. When there is a large number of small gzipped
files, Spark needs to uncompress these files by keeping them on its memory and network.
So large amount of time is spent in burning core capacities for unzipping the files in
sequence and performing partitions of the resulting RDDs to get data in a manageable
format which would require extensive shuffling overall. This impacts the performance of
Spark as much time is spent preparing the data instead of processing them.
 Spark doesn’t work well in multi-user environments as it is not capable of handling many
users concurrently.

13. How can the data transfers be minimized while working with Spark?

Data transfers correspond to the process of shuffling. Minimizing these transfers results in
faster and reliable running Spark applications. There are various ways in which these can be
minimized. They are:

 Usage of Broadcast Variables: Broadcast variables increases the efficiency of the join
between large and small RDDs.
 Usage of Accumulators: These help to update the variable values parallelly during
execution.
 Another common way is to avoid the operations which trigger these reshuffles.

14. What is SchemaRDD in Spark RDD?

SchemaRDD is an RDD consisting of row objects that are wrappers around integer arrays
or strings that has schema information regarding the data type of each column. They were
designed to ease the lives of developers while debugging the code and while running unit
test cases on the SparkSQL modules. They represent the description of the RDD which is
similar to the schema of relational databases. SchemaRDD also provides the basic
functionalities of the common RDDs along with some relational query interfaces of
SparkSQL.

Consider an example. If you have an RDD named Person that represents a person’s data.
Then SchemaRDD represents what data each row of Person RDD represents. If the Person
has attributes like name and age, then they are represented in SchemaRDD.

15. What module is used for implementing SQL in Apache Spark?

Spark provides a powerful module called SparkSQL which performs relational data
processing combined with the power of the functional programming feature of Spark. This
module also supports either by means of SQL or Hive Query Language. It also provides
support for different data sources and helps developers write powerful SQL queries using
code transformations.
The four major libraries of SparkSQL are:

 Data Source API

 DataFrame API
 Interpreter & Catalyst Optimizer
 SQL Services

Spark SQL supports the usage of structured and semi-structured data in the following ways:

 Spark supports DataFrame abstraction in various languages like Python, Scala, and Java
along with providing good optimization techniques.
 SparkSQL supports data read and writes operations in various structured formats like
JSON, Hive, Parquet, etc.
 SparkSQL allows data querying inside the Spark program and via external tools that do the
JDBC/ODBC connections.
 It is recommended to use SparkSQL inside the Spark applications as it empowers the
developers to load the data, query the data from databases and write the results to the
destination.
16. What are the different persistence levels in Apache Spark?

Spark persists intermediary data from different shuffle operations automatically. But it is
recommended to call the persist() method on the RDD. There are different persistence
levels for storing the RDDs on memory or disk or both with different levels of replication.
The persistence levels available in Spark are:

 MEMORY_ONLY: This is the default persistence level and is used for storing the RDDs
as the deserialized version of Java objects on the JVM. In case the RDDs are huge and do
not fit in the memory, then the partitions are not cached and they will be recomputed as and
when needed.
 MEMORY_AND_DISK: The RDDs are stored again as deserialized Java objects on JVM.
In case the memory is insufficient, then partitions not fitting on the memory will be stored
on disk and the data will be read from the disk as and when needed.
 MEMORY_ONLY_SER: The RDD is stored as serialized Java Objects as One Byte per
partition.
 MEMORY_AND_DISK_SER: This level is similar to MEMORY_ONLY_SER but the
difference is that the partitions not fitting in the memory are saved on the disk to avoid
recomputations on the fly.
 DISK_ONLY: The RDD partitions are stored only on the disk.
 OFF_HEAP: This level is the same as the MEMORY_ONLY_SER but here the data is stored in
the off-heap memory.

The syntax for using persistence levels in the persist() method is:
df.persist(StorageLevel.<level_value>)

The following table summarizes the details of persistence levels:

Persistence Level Space Consumed CPU time In-memory? On-disk?
MEMORY_ONLY High Low Yes No
MEMORY_ONLY_SER Low High Yes No
MEMORY_AND_DISK High Medium Some Some
MEMORY_AND_DISK_SER Low High Some Some
DISK_ONLY Low High No Yes
OFF_HEAP Low High Yes (but off-heap) No

17. What are the steps to calculate the executor memory?

Consider you have the below details regarding the cluster:

Number of nodes = 10
Number of cores in each node = 15 cores
RAM of each node = 61GB

To identify the number of cores, we follow the approach:

Number of Cores = number of concurrent tasks that can be run parallelly by
the executor. The optimal value as part of a general rule of thumb is 5.
Hence to calculate the number of executors, we follow the below approach:
Number of executors = Number of cores/Concurrent Task
= 15/5
= 3
Number of executors = Number of nodes * Number of executor in each node
= 10 * 3
= 30 executors per Spark job

18. Why do we need broadcast variables in Spark?

Broadcast variables let the developers maintain read-only variables cached on each machine
instead of shipping a copy of it with tasks. They are used to give every node copy of a large
input dataset efficiently. These variables are broadcasted to the nodes using different
algorithms to reduce the cost of communication.

19. Differentiate between Spark Datasets, Dataframes and RDDs.

Criteria Spark Datasets Spark Dataframes Spark RDDs
Representation Spark Datasets is a
of Data combination of Spark Dataframe is a Spark RDDs are a
Dataframes and RDDs distributed collection of distributed collection
with features like static data that is organized into of data without
type safety and object- named columns. schema.
oriented interfaces.
Optimization Datasets make use of Dataframes also makes
There is no built-in
catalyst optimizers for use of catalyst optimizer
optimization engine.
optimization. for optimization.
Schema Datasets find out schema Schema needs to be
Dataframes also find the
Projection automatically using SQL defined manually in
schema automatically.
Engine. RDDs.
Aggregation RDDs are slower than
Speed Aggregations are faster in both the Dataframes
Dataset aggregation is
Dataframes due to the and the Datasets while
faster than RDD but
provision of easy and performing even
slower than Dataframes.
powerful APIs. simple operations like
data grouping.

20. Can Apache Spark be used along with Hadoop? If yes, then how?

Yes! The main feature of Spark is its compatibility with Hadoop. This makes it a powerful
framework as using the combination of these two helps to leverage the processing capacity
of Spark by making use of the best of Hadoop’s YARN and HDFS features.

Hadoop can be integrated with Spark in the following ways:

 HDFS: Spark can be configured to run atop HDFS to leverage the feature of distributed
replicated storage.
 MapReduce: Spark can also be configured to run alongside the MapReduce in the same or
different processing framework or Hadoop cluster. Spark and MapReduce can be used
together to perform real-time and batch processing respectively.
 YARN: Spark applications can be configured to run on YARN which acts as the cluster
management framework.

21. What are Sparse Vectors? How are they different from dense vectors?

Sparse vectors consist of two parallel arrays where one array is for storing indices and the
other for storing values. These vectors are used to store non-zero values for saving space.
val sparseVec: Vector = Vectors.sparse(5, Array(0, 4), Array(1.0, 2.0))

 In the above example, we have the vector of size 5, but the non-zero values are there only at
indices 0 and 4.
 Sparse vectors are particularly useful when there are very few non-zero values. If there are
cases that have only a few zero values, then it is recommended to use dense vectors as
usage of sparse vectors would introduce the overhead of indices which could impact the
performance.
 Dense vectors can be defines as follows:
val denseVec =
Vectors.dense(4405d,260100d,400d,5.0,4.0,198.0,9070d,1.0,1.0,2.0,0.0)

 Usage of sparse or dense vectors does not impact the results of calculations but when used
inappropriately, they impact the memory consumed and the speed of calculation.

22. How are automatic clean-ups triggered in Spark for handling the
accumulated metadata?

The clean-up tasks can be triggered automatically either by

setting spark.cleaner.ttl parameter or by doing the batch-wise division of the long-
running jobs and then writing the intermediary results on the disk.

23. How is Caching relevant in Spark Streaming?

Spark Streaming involves the division of data stream’s data into batches of X seconds
called DStreams. These DStreams let the developers cache the data into the memory which
can be very useful in case the data of DStream is used for multiple computations. The
caching of data can be done using the cache() method or using persist() method by using
appropriate persistence levels. The default persistence level value for input streams
receiving data over the networks such as Kafka, Flume, etc is set to achieve data replication
on 2 nodes to accomplish fault tolerance.

 Caching using cache method:

val cacheDf = dframe.cache()

 Caching using persist method:

val persistDf = dframe.persist(StorageLevel.MEMORY_ONLY)

The main advantages of caching are:

 Cost efficiency: Since Spark computations are expensive, caching helps to achieve reusing
of data and this leads to reuse computations which can save the cost of operations.
 Time-efficient: The computation reusage leads to saving a lot of time.
 More Jobs Achieved: By saving time of computation execution, the worker nodes can
perform/execute more jobs.

24. Define Piping in Spark.

Apache Spark provides the pipe() method on RDDs which gives the opportunity to
compose different parts of occupations that can utilize any language as needed as per the
UNIX Standard Streams. Using the pipe() method, the RDD transformation can be written
which can be used for reading each element of the RDD as String. These can be
manipulated as required and the results can be displayed as String.

25. What API is used for Graph Implementation in Spark?

Spark provides a powerful API called GraphX that extends Spark RDD for supporting
graphs and graph-based computations. The extended property of Spark RDD is called as
Resilient Distributed Property Graph which is a directed multi-graph that has multiple
parallel edges. Each edge and the vertex has associated user-defined properties. The
presence of parallel edges indicates multiple relationships between the same set of vertices.
GraphX has a set of operators such as subgraph, mapReduceTriplets, joinVertices, etc that
can support graph computation. It also includes a large collection of graph builders and
algorithms for simplifying tasks related to graph analytics.

26. How can you achieve machine learning in Spark?

Spark provides a very robust, scalable machine learning-based library called MLlib. This
library aims at implementing easy and scalable common ML-based algorithms and has the
features like classification, clustering, dimensional reduction, regression filtering, etc. More
information about this library can be obtained in detail from Spark’s official documentation
site here: https://spark.apache.org/docs/latest/ml-guide.html

Domestic and Industral Installation-2 PDF
80% (5)
Domestic and Industral Installation-2 PDF
120 pages
Azure Data Engineer Interview Questions and Answers
No ratings yet
Azure Data Engineer Interview Questions and Answers
7 pages
Apache Cassandra Administrator Associate - Exam Practice Tests
From Everand
Apache Cassandra Administrator Associate - Exam Practice Tests
Cristian Scutaru
No ratings yet
Snowflake PPT 22
50% (2)
Snowflake PPT 22
220 pages
Azure Data Factory Interview Questions
No ratings yet
Azure Data Factory Interview Questions
14 pages
Snowflake Faq
No ratings yet
Snowflake Faq
185 pages
Pyspark Questions & Scenario Based
No ratings yet
Pyspark Questions & Scenario Based
25 pages
Snowflake
No ratings yet
Snowflake
43 pages
PySpark+Slides v1
No ratings yet
PySpark+Slides v1
458 pages
Practice Questions for Snowflake Snowpro Core Certification Concept Based - Latest Edition 2023
From Everand
Practice Questions for Snowflake Snowpro Core Certification Concept Based - Latest Edition 2023
Exam OG
5/5 (1)
670 Series Version 2.2 IEC: Installation Manual
No ratings yet
670 Series Version 2.2 IEC: Installation Manual
100 pages
KX-NS Version 8.0 Software Release Note: October 15, 2019
No ratings yet
KX-NS Version 8.0 Software Release Note: October 15, 2019
42 pages
The Datadog Handbook: A Guide to Monitoring, Metrics, and Tracing
From Everand
The Datadog Handbook: A Guide to Monitoring, Metrics, and Tracing
Robert Johnson
No ratings yet
azure DE interview que
100% (1)
azure DE interview que
25 pages
Pyspark Notes
No ratings yet
Pyspark Notes
93 pages
Learn More About SQL Interview Questions-Ii: The Expert'S Voice in SQL Server
No ratings yet
Learn More About SQL Interview Questions-Ii: The Expert'S Voice in SQL Server
12 pages
Databricks Pyspark 1712042928
100% (1)
Databricks Pyspark 1712042928
21 pages
Spark Optimization PDF
100% (1)
Spark Optimization PDF
14 pages
PySpark VS SQL Interview Questions
No ratings yet
PySpark VS SQL Interview Questions
16 pages
Azure DataEngineer Course Outline
No ratings yet
Azure DataEngineer Course Outline
4 pages
2.7 Years AzureDataEngineer Prateek
No ratings yet
2.7 Years AzureDataEngineer Prateek
2 pages
Data Engineering 101 - Spark Concepts
No ratings yet
Data Engineering 101 - Spark Concepts
100 pages
Azure Data Engineer Interview Questions
No ratings yet
Azure Data Engineer Interview Questions
35 pages
Top 50 Azure Data Factory Interview Questions and Answers
No ratings yet
Top 50 Azure Data Factory Interview Questions and Answers
14 pages
Databricks Performance Tuning
No ratings yet
Databricks Performance Tuning
9 pages
Spark Interview QUestions
No ratings yet
Spark Interview QUestions
200 pages
Pyspark Hands on
No ratings yet
Pyspark Hands on
189 pages
CV For Snowflake Traning
No ratings yet
CV For Snowflake Traning
4 pages
Mandapriyanka (7 0)
No ratings yet
Mandapriyanka (7 0)
3 pages
Azure Data Factory Vs Databricks - 4 Key Differences - Hevo
No ratings yet
Azure Data Factory Vs Databricks - 4 Key Differences - Hevo
14 pages
Lab 3 - Enabling Team Based Data Science With Azure Databricks
No ratings yet
Lab 3 - Enabling Team Based Data Science With Azure Databricks
18 pages
Azure Data Factory Notes 1682135573
No ratings yet
Azure Data Factory Notes 1682135573
78 pages
Dp203 Notes
No ratings yet
Dp203 Notes
87 pages
Data Warehousing Interview Questions
No ratings yet
Data Warehousing Interview Questions
6 pages
HowToCrackInterview Udemy
No ratings yet
HowToCrackInterview Udemy
58 pages
Lab 7 - Orchestrating Data Movement With Azure Data Factory
No ratings yet
Lab 7 - Orchestrating Data Movement With Azure Data Factory
26 pages
Unity Catalog
No ratings yet
Unity Catalog
16 pages
Spark SQL and DataFrames - Spark 2.2.0 Documentation
No ratings yet
Spark SQL and DataFrames - Spark 2.2.0 Documentation
35 pages
Interview Questions On ADF
No ratings yet
Interview Questions On ADF
2 pages
4.1 The Spark UI - Databricks
No ratings yet
4.1 The Spark UI - Databricks
7 pages
4 - Action and RDD Transformations
No ratings yet
4 - Action and RDD Transformations
25 pages
15.table Types
No ratings yet
15.table Types
13 pages
Fact and Dimension Tables
No ratings yet
Fact and Dimension Tables
11 pages
Hareesh: Snowflake Developer
No ratings yet
Hareesh: Snowflake Developer
4 pages
Snowflake Interview 2024 03
100% (1)
Snowflake Interview 2024 03
167 pages
Top Answers To Spark Interview Questions
No ratings yet
Top Answers To Spark Interview Questions
4 pages
Databricks Sparkconfig 1669383836
No ratings yet
Databricks Sparkconfig 1669383836
1 page
Bigdata Notes
No ratings yet
Bigdata Notes
26 pages
Ajay Kadiyala Resume 2023 PDF
No ratings yet
Ajay Kadiyala Resume 2023 PDF
6 pages
Databricks Course Curriculum
No ratings yet
Databricks Course Curriculum
2 pages
Sqoop Cammand
No ratings yet
Sqoop Cammand
8 pages
Pyspark Interview Code
100% (2)
Pyspark Interview Code
197 pages
Aksha Interview Questions
100% (1)
Aksha Interview Questions
52 pages
Advanced Project For Data Engineering in Azure
100% (1)
Advanced Project For Data Engineering in Azure
5 pages
Vijay Kanth - Azure Data Engineer
No ratings yet
Vijay Kanth - Azure Data Engineer
2 pages
SQL Scenario Based Interview Questions - ThinkETL
100% (2)
SQL Scenario Based Interview Questions - ThinkETL
23 pages
Aws Data Engineer Resume Example
No ratings yet
Aws Data Engineer Resume Example
1 page
Databricks
No ratings yet
Databricks
4 pages
89 Talend Interview Questions For Experienced 2018 - Real Time Scenario
No ratings yet
89 Talend Interview Questions For Experienced 2018 - Real Time Scenario
3 pages
Bhaskar ADE - Altimetrik
No ratings yet
Bhaskar ADE - Altimetrik
3 pages
Oracle PLSQL Notes
100% (4)
Oracle PLSQL Notes
59 pages
Spark SQL
No ratings yet
Spark SQL
24 pages
Spark Interview Q&A
No ratings yet
Spark Interview Q&A
31 pages
Apache Spark 2.x Cookbook
From Everand
Apache Spark 2.x Cookbook
Rishi Yadav
No ratings yet
Databricks Certified Associate Developer for Apache Spark Using Python: The ultimate guide to getting certified in Apache Spark using practical examples with Python
From Everand
Databricks Certified Associate Developer for Apache Spark Using Python: The ultimate guide to getting certified in Apache Spark using practical examples with Python
Saba Shah
No ratings yet
Data Sharing
No ratings yet
Data Sharing
3 pages
Lab10 Caching
No ratings yet
Lab10 Caching
2 pages
Lab18 Tasks
No ratings yet
Lab18 Tasks
4 pages
Lab20 ExternalTables
No ratings yet
Lab20 ExternalTables
3 pages
Data Transformation Activities
No ratings yet
Data Transformation Activities
2 pages
Lab4 ExternalStages and FileFormats
No ratings yet
Lab4 ExternalStages and FileFormats
2 pages
Lab9 Snowpipe AWS
No ratings yet
Lab9 Snowpipe AWS
2 pages
Lab7 Loading Data From AWS
No ratings yet
Lab7 Loading Data From AWS
2 pages
Lab5 SnowSQL InternalStages
No ratings yet
Lab5 SnowSQL InternalStages
2 pages
Lab22 UserDefinedFunctions
No ratings yet
Lab22 UserDefinedFunctions
2 pages
Lab23 StoredProcedures
No ratings yet
Lab23 StoredProcedures
2 pages
Lab9 Troubleshooting Snowpipe AWS
No ratings yet
Lab9 Troubleshooting Snowpipe AWS
2 pages
Lab19 Streams
No ratings yet
Lab19 Streams
5 pages
8.working With CopyCommand Options23
No ratings yet
8.working With CopyCommand Options23
11 pages
Lab6 Copy Options
No ratings yet
Lab6 Copy Options
4 pages
Permissions
No ratings yet
Permissions
13 pages
ORACLE
No ratings yet
ORACLE
191 pages
DBT - Note2024-Roles
No ratings yet
DBT - Note2024-Roles
1 page
Snowflake - Syllubus and DBT
No ratings yet
Snowflake - Syllubus and DBT
11 pages
Administering Snowflake
No ratings yet
Administering Snowflake
4 pages
Airflow - Notes
No ratings yet
Airflow - Notes
82 pages
Roles and Users
No ratings yet
Roles and Users
3 pages
Commonly Asked Snowflake
No ratings yet
Commonly Asked Snowflake
26 pages
Snowflake Architecture - Concepts
No ratings yet
Snowflake Architecture - Concepts
38 pages
Compute
No ratings yet
Compute
56 pages
Snowflake Demo
No ratings yet
Snowflake Demo
13 pages
Litepaper 10 21 22
No ratings yet
Litepaper 10 21 22
9 pages
Untitled
No ratings yet
Untitled
2 pages
java basics-páginas-1
No ratings yet
java basics-páginas-1
26 pages
v50x (V35ax) en
No ratings yet
v50x (V35ax) en
37 pages
التحكم في سرعة محركات التيار المستمر
No ratings yet
التحكم في سرعة محركات التيار المستمر
21 pages
Atmospheric Flash Tank
100% (2)
Atmospheric Flash Tank
3 pages
Big Java Solution Manual Ch17
100% (2)
Big Java Solution Manual Ch17
17 pages
Library
No ratings yet
Library
7 pages
Power Cable: Catalogue
No ratings yet
Power Cable: Catalogue
49 pages
B041211 BMW Group Carbon Blaster N73
No ratings yet
B041211 BMW Group Carbon Blaster N73
27 pages
Exco Approval and Motivation - Router Swap - Updated
No ratings yet
Exco Approval and Motivation - Router Swap - Updated
6 pages
Faber Kabel Cables Apantallados Sondas Temperatura 030326
No ratings yet
Faber Kabel Cables Apantallados Sondas Temperatura 030326
5 pages
Dpmeter5 Manual
No ratings yet
Dpmeter5 Manual
6 pages
Motorised Spec Sheet
No ratings yet
Motorised Spec Sheet
2 pages
CR 975 System
No ratings yet
CR 975 System
129 pages
Kubernetes Deployment Strategies
No ratings yet
Kubernetes Deployment Strategies
18 pages
Photoelectric Vs Ionization Detectors - A Review of The Literature
No ratings yet
Photoelectric Vs Ionization Detectors - A Review of The Literature
37 pages
Homework Lenexa Ks
100% (1)
Homework Lenexa Ks
6 pages
Quick Guide VIO 200 S: Necessary Operating Steps
No ratings yet
Quick Guide VIO 200 S: Necessary Operating Steps
2 pages
PS4 Revert
0% (1)
PS4 Revert
32 pages
Cdi5 WK8
No ratings yet
Cdi5 WK8
5 pages
Hydraulic/Hydrostatic Schematic S250 (S/N 521315300 AND ABOVE) (S/N 521411322 AND ABOVE) S300 (S/N 521512919 AND ABOVE) (S/N 521611082 AND ABOVE)
No ratings yet
Hydraulic/Hydrostatic Schematic S250 (S/N 521315300 AND ABOVE) (S/N 521411322 AND ABOVE) S300 (S/N 521512919 AND ABOVE) (S/N 521611082 AND ABOVE)
2 pages
ACP Questions 1
No ratings yet
ACP Questions 1
22 pages
Litte Panda GameRules (1)
No ratings yet
Litte Panda GameRules (1)
3 pages
IEC 60228 Conductors of Insluated Cables
No ratings yet
IEC 60228 Conductors of Insluated Cables
52 pages
Hospital Tycoon Codes For Free Cash and Boosts
No ratings yet
Hospital Tycoon Codes For Free Cash and Boosts
1 page
Kawayan Charcoal Specification: Chichacorn Effluent
No ratings yet
Kawayan Charcoal Specification: Chichacorn Effluent
2 pages