Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
16 views

Unity Catalog

Uploaded by

Surya Appana
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views

Unity Catalog

Uploaded by

Surya Appana
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

CHENCHU’S

🚀 Mastering PySpark
and Databricks 🚀

Databricks Unity Catalog


Part-01

C .R. Anil Kumar Reddy


Associate Developer for Apache Spark 3.0

Anil Reddy Chenchu


Follow me on Linkedin
www.linkedin.com/in/chenchuanil
CHENCHU’S

Anil Reddy Chenchu


Follow me on Linkedin
www.linkedin.com/in/chenchuanil
CHENCHU’S

What is Unity Catalog?

Unity Catalog provides centralized access control, auditing,


lineage, and data discovery capabilities across Azure Databricks
workspaces.

Pic Credit : Shanmuk Sattiraju

Anil Reddy Chenchu


Follow me on Linkedin
www.linkedin.com/in/chenchuanil
CHENCHU’S

Without Unity Catalog:

Each Azure Databricks workspace operates independently.


User Management: Handled separately for each workspace.
Hive Metastore: Each workspace has its own Hive Metastore, leading to
siloed metadata storage.
Access Controls: Must be managed separately for each workspace,
increasing administrative overhead.
Clusters and SQL Warehouses: Operate in isolation within each
workspace.

This setup results in fragmented management and a lack of consistency


across multiple workspaces.

With Unity Catalog:

Unity Catalog centralizes critical components for multiple Azure Databricks


workspaces:
User Management: Unified across all workspaces.
Metastore: A single centralized metastore, providing consistent metadata across
workspaces.
Access Controls: Centralized access policies, ensuring uniform governance.
Clusters and SQL Warehouses: Workspaces remain distinct but leverage centralized
management.

This configuration improves governance, simplifies management, and ensures better


collaboration across workspaces by providing a unified platform for user, metadata, and
access management.

Anil Reddy Chenchu


Follow me on Linkedin
www.linkedin.com/in/chenchuanil
CHENCHU’S

Pic Credit : Shanmuk Sattiraju

🚀 Key Features of Unity Catalog


1. Centralized Data Governance
·Manage permissions and access controls from a single place across all Databricks workspaces.
·Unified view of tables, files, and machine learning models.

2. Fine-Grained Access Control


·Granular access control at the catalog, schema, table, row, and column levels.
·Supports attribute-based access control (ABAC) and role-based access control (RBAC).

3. Data Lineage
·Track end-to-end lineage from raw data ingestion to analytics and AI/ML outputs.
·Understand how data flows through transformations and usage.

4. Auditability
·Maintain audit logs to track who accessed or modified data and when.
·Ensure compliance with regulatory standards.

5. Metadata Management
·Centralized metadata catalog for datasets and assets.
·Easily search and discover data using metadata search.

6. Cross-Workspace Collaboration
·Share data and assets seamlessly across different Databricks workspaces.
·Support for multi-cloud environments (AWS, Azure, GCP).

Anil Reddy Chenchu


Follow me on Linkedin
www.linkedin.com/in/chenchuanil
CHENCHU’S

🛠️ Core Components of Unity Catalog


1.Catalog
Top-level container for organizing schemas (databases).
Example: Sales_Catalog

2.Schema (Database)
Organizes tables and views within a catalog.
Example: Sales_Catalog.Marketing_Schema

3.Tables and Views


Store data in structured formats.
Example: Sales_Catalog.Marketing_Schema.Customer_Table

4.Storage Credentials
Secure credentials for accessing external storage systems.

5.External Locations
Define paths in external storage systems (e.g., S3, ADLS).

Anil Reddy Chenchu


Follow me on Linkedin
www.linkedin.com/in/chenchuanil
CHENCHU’S

🧩 How Unity Catalog Integrates with Databricks Workspaces


·Workspaces connect to the Unity Catalog Metastore.
·Administrators configure access policies via Unity Catalog.
·Data scientists and engineers access the same datasets consistently across different
workspaces.

🌐 Supported Data Sources


·Delta Lake Tables
·Parquet Files
·External Tables (S3, ADLS Gen2)
·Databases

📊 Benefits of Unity Catalog


·Simplified data governance.
·Enhanced security and compliance.
·Improved collaboration across teams and cloud platforms.
·Efficient data discovery and lineage tracking.

📚 Use Cases
1.Enterprise Data Governance
2.Data Sharing Across Teams
3.Audit and Compliance
4.Unified Data Discovery

Anil Reddy Chenchu


Follow me on Linkedin
www.linkedin.com/in/chenchuanil
CHENCHU’S

🧩 How Unity Catalog Integrates with ADLS Gen2

Pic Credit : Shanmuk Sattiraju

How it Works

1. Data stored in Azure Data Lake Gen2 is accessed by Azure Databricks


through the Databricks Access Connector.
2. Unity Catalog centrally governs data access, ensuring:
Secure connectivity between Databricks and storage.
Unified permissions across all Databricks workspaces.
3. Azure Databricks Workspaces interact with Unity Catalog to:
Manage metadata.
Define and enforce access policies.
Run workloads on data residing in Azure Data Lake Gen2.

Anil Reddy Chenchu


Follow me on Linkedin
www.linkedin.com/in/chenchuanil
CHENCHU’S

Pic Credit : Shanmuk Sattiraju

Metastores

The metastore is the top-level container for metadata in Unity Catalog. It registers
metadata about data and AI assets and the permissions that govern access to
them. For a workspace to use Unity Catalog, it must have a Unity Catalog
metastore attached.
You can have Only one Metastore for each region in which you have workspaces.

·In a Unity Catalog metastore, the three-level database object hierarchy consists
of catalogs that contain schemas, which in turn contain data and AI objects, like
tables and models.

Level one:
·Catalogs are used to organize your data assets and are typically used as the top
level in your data isolation scheme. Catalogs often mirror organizational units or
software development lifecycle scopes.Non-data securable objects, such as
storage credentials and external locations, are used to manage your data
governance model in Unity Catalog. These also live directly under the metastore.
Anil Reddy Chenchu
Follow me on Linkedin
www.linkedin.com/in/chenchuanil
CHENCHU’S

Level two:

·Schemas (also known as databases) contain tables, views, volumes, AI


models, and functions. Schemas organize data and AI assets into logical
categories that are more granular than catalogs. Typically a schema
represents a single use case, project, or team sandbox.

Level three:

·Volumes are logical volumes of unstructured, non-tabular data in cloud


object storage. Volumes can be either managed, with Unity Catalog
managing the full lifecycle and layout of the data in storage, or external,
with Unity Catalog managing access to the data from within Azure
Databricks, but not managing access to the data in cloud storage from
other clients.

·Tables are collections of data organized by rows and columns. Tables can
be either managed, with Unity Catalog managing the full lifecycle of the
table, or external, with Unity Catalog managing access to the data from
within Azure Databricks, but not managing access to the data in cloud
storage from other clients.

·Views are saved queries against one or more tables.

·Functions are units of saved logic that return a scalar value or set of rows.

·Models are AI models packaged with MLflow and registered in Unity


Catalog as functions.
Anil Reddy Chenchu
Follow me on Linkedin
www.linkedin.com/in/chenchuanil
CHENCHU’S

Other securable objects

In addition to the database objects and AI assets that are contained in


schemas, Unity Catalog also governs access to data using the following
securable objects

·Service credentials, which encapsulate a long-term cloud credential that


provides access to an external service.

·Storage credentials, which encapsulate a long-term cloud credential that


provides access to cloud storage.

·External locations, which contain a reference to a storage credential and a


cloud storage path. External locations can be used to create external tables or
to assign a managed storage location for managed tables and volumes.

·Connections, which represent credentials that give read-only access to an


external database in a database system like MySQL using Lakehouse
Federation.

·Clean rooms, which represent a Databricks-managed environment where


multiple participants can collaborate on projects without sharing underlying
data with each other.

·Shares, which are Delta Sharing objects that represent a read-only collection
of data and AI assets that a data provider shares with one or more recipients.

·Recipients, which are Delta Sharing objects that represent an entity that
receives shares from a data provider.

·Providers, which are Delta Sharing objects that represent an entity that
shares data with a recipient. Anil Reddy Chenchu
Follow me on Linkedin
www.linkedin.com/in/chenchuanil
CHENCHU’S

Plan your data isolation model

When an organization uses a data platform like Azure Databricks, there is


often a need to have data isolation boundaries between environments (such
as development and production) or between organizational operating units.
Isolation standards might vary for your organization, but typically they
include the following expectations:

Users can only gain access to data based on specified access rules.
Data can be managed only by designated people or teams.
Data is physically separated in storage.
Data can be accessed only in designated environments.

The need for data isolation can lead to siloed environments that can make
both data governance and collaboration unnecessarily difficult. Azure
Databricks solves this problem using Unity Catalog, which provides a
number of data isolation options while maintaining a unified data
governance platform.

Users can only gain access to data based on specified access rules

Most organizations have strict requirements around data access based on


internal or regulatory requirements. Typical examples of data that must be
kept secure include employee salary information or credit card payment
information. Access to this type of information is typically tightly controlled
and audited periodically. Unity Catalog provides you with granular control
over data assets within the catalog to meet these industry standards. With
the controls that Unity Catalog provides, users can see and query only the
data that they are entitled to see and query.
Anil Reddy Chenchu
Follow me on Linkedin
www.linkedin.com/in/chenchuanil
CHENCHU’S

Data can be managed only by designated people or teams

Unity Catalog gives you the ability to choose between centralized and distributed
governance models.

In the centralized governance model, your governance administrators are owners of


the metastore and can take ownership of any object and grant and revoke
permissions.
In a distributed governance model, the catalog or a set of catalogs is the data domain.
The owner of that catalog can create and own all assets and manage governance
within that domain. The owners of any given domain can operate independently of
the owners of other domains.

Regardless of whether you choose the metastore or catalogs as your data domain,
Databricks strongly recommends that you set a group as the metastore admin or
catalog owner.

Owners can grant users, service principals, and groups the MANAGE permission to
allow them to grant and revoke permissions on objects.

Anil Reddy Chenchu


Follow me on Linkedin
www.linkedin.com/in/chenchuanil
CHENCHU’S

Data is physically separated in storage

An organization can require that data of certain types be stored within specific
accounts or buckets in their cloud tenant.

Unity Catalog gives the ability to configure storage locations at the metastore, catalog,
or schema level to satisfy such requirements.

For example, let’s say your organization has a company compliance policy that
requires production data relating to human resources to reside in the container
abfss://mycompany-hr-prod@storage-account.dfs.core.windows.net. In Unity Catalog,
you can achieve this requirement by setting a location on a catalog level, creating a
catalog called, for example hr_prod, and assigning the location abfss://mycompany-hr-
prod@storage-account.dfs.core.windows.net/unity-catalog to it. This means that
managed tables or volumes created in the hr_prod catalog (for example, using CREATE
TABLE hr_prod.default.table …) store their data in abfss://mycompany-hr-
prod@storage-account.dfs.core.windows.net/unity-catalog. Optionally, you can
choose to provide schema-level locations to organize data within the hr_prod catalog
at a more granular level.

If such a storage isolation is not required, you can set a storage location at the
metastore level. The result is that this location serves as a default location for storing
managed tables and volumes across catalogs and schemas in the metastore.

The system evaluates the hierarchy of storage locations from schema to catalog to
metastore.

For example, if a table myCatalog.mySchema.myTable is created in my-region-


metastore, the table storage location is determined according to the following rule:

1. If a location has been provided for mySchema, it will be stored there.


2. If not, and a location has been provided on myCatalog, it will be stored there.
3. Finally, if no location has been provided on myCatalog, it will be stored in the
location associated with the my-region-metastore. Anil Reddy Chenchu
Follow me on Linkedin
www.linkedin.com/in/chenchuanil
CHENCHU’S

Data can be accessed only in designated environments

Organizational and compliance requirements often specify that you


keep certain data, like personal data, accessible only in certain
environments. You may also want to keep production data isolated
from development environments or ensure that certain data sets and
domains are never joined together.

In Databricks, the workspace is the primary data processing


environment, and catalogs are the primary data domain. Unity Catalog
lets metastore admins, catalog owners, and users with the MANAGE
permission assign, or “bind,” catalogs to specific workspaces. These
environment-aware bindings give you the ability to ensure that only
certain catalogs are available within a workspace, regardless of the
specific privileges on data objects granted to a user.
Anil Reddy Chenchu
Follow me on Linkedin
www.linkedin.com/in/chenchuanil
CHENCHU’S

NIL REDDY CHENCHU


Torture the data, and it will confess to anything

DATA ANALYTICS

Happy Learning

SHARE IF YOU LIKE THE POST

Lets Connect to discuss more on Data

www.linkedin.com/in/chenchuanil

You might also like