Databricks Interview Questions
Databricks Interview Questions
deltaTable.alias("oldData") \
.merge(updatesDF.alias("newData"), "oldData.ID = newData.ID") \
.whenMatchedUpdate(set={"Name": "newData.Name"}) \
.whenNotMatchedInsert(values={"ID": "newData.ID", "Name":
"newData.Name"}) \
.execute()
4. Write code to visualize data using the display() function within
Databricks notebooks:
display(delta_df)
5. Create SQL queries within Databricks notebooks to count records in
a Delta table:
SELECT COUNT(*) FROM my_table;
6. Show how to create and manage clusters programmatically using
Databricks REST API (pseudo-code):
import requests
url = 'https://<databricks-instance>/api/2.0/clusters/create'
payload = {
"cluster_name": "My Cluster",
"spark_version": "7.x-scala2.x",
"node_type_id": "i3.xlarge",
"num_workers": 2,
# Additional configuration...
}
response = requests.post(url, json=payload,
headers={"Authorization": f"Bearer {token}"})
print(response.json())
7. Implement code to optimize Delta tables by vacuuming old files:
spark.sql("VACUUM '/mnt/delta/my_table' RETAIN 168 HOURS")
8. Write code that uses Databricks Utilities (dbutils) to list files in
DBFS:
files = dbutils.fs.ls("/mnt/delta/")
for file in files:
print(file.name)
9. Show how to create and use widgets in Databricks notebooks for
parameterized queries:
dbutils.widgets.text("input_text", "")
input_value = dbutils.widgets.get("input_text")
print(f"Input value is: {input_value}")
10. Write code that demonstrates error handling when reading from
an external source (e.g., S3 bucket):
try:
s3_df = spark.read.csv("s3a://my-bucket/data.csv")
s3_df.show()
except Exception as e:
print(f"Error reading data: {e}")