Unity Catalog
Unity Catalog
🚀 Mastering PySpark
and Databricks 🚀
3. Data Lineage
·Track end-to-end lineage from raw data ingestion to analytics and AI/ML outputs.
·Understand how data flows through transformations and usage.
4. Auditability
·Maintain audit logs to track who accessed or modified data and when.
·Ensure compliance with regulatory standards.
5. Metadata Management
·Centralized metadata catalog for datasets and assets.
·Easily search and discover data using metadata search.
6. Cross-Workspace Collaboration
·Share data and assets seamlessly across different Databricks workspaces.
·Support for multi-cloud environments (AWS, Azure, GCP).
2.Schema (Database)
Organizes tables and views within a catalog.
Example: Sales_Catalog.Marketing_Schema
4.Storage Credentials
Secure credentials for accessing external storage systems.
5.External Locations
Define paths in external storage systems (e.g., S3, ADLS).
📚 Use Cases
1.Enterprise Data Governance
2.Data Sharing Across Teams
3.Audit and Compliance
4.Unified Data Discovery
How it Works
Metastores
The metastore is the top-level container for metadata in Unity Catalog. It registers
metadata about data and AI assets and the permissions that govern access to
them. For a workspace to use Unity Catalog, it must have a Unity Catalog
metastore attached.
You can have Only one Metastore for each region in which you have workspaces.
·In a Unity Catalog metastore, the three-level database object hierarchy consists
of catalogs that contain schemas, which in turn contain data and AI objects, like
tables and models.
Level one:
·Catalogs are used to organize your data assets and are typically used as the top
level in your data isolation scheme. Catalogs often mirror organizational units or
software development lifecycle scopes.Non-data securable objects, such as
storage credentials and external locations, are used to manage your data
governance model in Unity Catalog. These also live directly under the metastore.
Anil Reddy Chenchu
Follow me on Linkedin
www.linkedin.com/in/chenchuanil
CHENCHU’S
Level two:
Level three:
·Tables are collections of data organized by rows and columns. Tables can
be either managed, with Unity Catalog managing the full lifecycle of the
table, or external, with Unity Catalog managing access to the data from
within Azure Databricks, but not managing access to the data in cloud
storage from other clients.
·Functions are units of saved logic that return a scalar value or set of rows.
·Shares, which are Delta Sharing objects that represent a read-only collection
of data and AI assets that a data provider shares with one or more recipients.
·Recipients, which are Delta Sharing objects that represent an entity that
receives shares from a data provider.
·Providers, which are Delta Sharing objects that represent an entity that
shares data with a recipient. Anil Reddy Chenchu
Follow me on Linkedin
www.linkedin.com/in/chenchuanil
CHENCHU’S
Users can only gain access to data based on specified access rules.
Data can be managed only by designated people or teams.
Data is physically separated in storage.
Data can be accessed only in designated environments.
The need for data isolation can lead to siloed environments that can make
both data governance and collaboration unnecessarily difficult. Azure
Databricks solves this problem using Unity Catalog, which provides a
number of data isolation options while maintaining a unified data
governance platform.
Users can only gain access to data based on specified access rules
Unity Catalog gives you the ability to choose between centralized and distributed
governance models.
Regardless of whether you choose the metastore or catalogs as your data domain,
Databricks strongly recommends that you set a group as the metastore admin or
catalog owner.
Owners can grant users, service principals, and groups the MANAGE permission to
allow them to grant and revoke permissions on objects.
An organization can require that data of certain types be stored within specific
accounts or buckets in their cloud tenant.
Unity Catalog gives the ability to configure storage locations at the metastore, catalog,
or schema level to satisfy such requirements.
For example, let’s say your organization has a company compliance policy that
requires production data relating to human resources to reside in the container
abfss://mycompany-hr-prod@storage-account.dfs.core.windows.net. In Unity Catalog,
you can achieve this requirement by setting a location on a catalog level, creating a
catalog called, for example hr_prod, and assigning the location abfss://mycompany-hr-
prod@storage-account.dfs.core.windows.net/unity-catalog to it. This means that
managed tables or volumes created in the hr_prod catalog (for example, using CREATE
TABLE hr_prod.default.table …) store their data in abfss://mycompany-hr-
prod@storage-account.dfs.core.windows.net/unity-catalog. Optionally, you can
choose to provide schema-level locations to organize data within the hr_prod catalog
at a more granular level.
If such a storage isolation is not required, you can set a storage location at the
metastore level. The result is that this location serves as a default location for storing
managed tables and volumes across catalogs and schemas in the metastore.
The system evaluates the hierarchy of storage locations from schema to catalog to
metastore.
DATA ANALYTICS
Happy Learning
www.linkedin.com/in/chenchuanil