Chapter 1: Setting Up Your Databricks Lakehouse Environment

Welcome to the first chapter of our comprehensive guide to building a real-time supply chain analytics platform! In this chapter, we’ll lay the foundational groundwork for our project by setting up a robust, secure, and scalable Databricks Lakehouse environment. This initial setup is critical, as it dictates the security, governance, and operational efficiency of all subsequent data pipelines and analytics.

Our focus in this chapter will be on configuring the core components of the Databricks Data Intelligence Platform, specifically enabling Unity Catalog for centralized data governance, establishing secure authentication mechanisms, defining cluster policies for cost control and consistency, and integrating with Git for version control. By the end of this chapter, you will have a production-ready Databricks workspace capable of securely hosting and processing sensitive supply chain data, ready for the real-time ingestion pipelines we’ll build next.

Planning & Design

Before diving into implementation, let’s visualize the key components and their interactions within our Databricks Lakehouse environment. This structured approach ensures we build a maintainable and scalable system from the outset.

Component Architecture for this Feature

The diagram below illustrates the foundational elements we’ll configure:

graph TD CloudProviderAccount[Cloud Provider Account] --> DatabricksAccountConsole DatabricksAccountConsole --> DatabricksWorkspace DatabricksAccountConsole --> UnityCatalogMetastore UnityCatalogMetastore --> AssignedTo[Assigned To] AssignedTo --> DatabricksWorkspace DatabricksWorkspace --> DevelopsDeploys[Develops and Deploys] DevelopsDeploys --> DatabricksNotebooksJobs DatabricksWorkspace --> Authenticates[Authenticates] Authenticates --> ServicePrincipalAPIToken DatabricksNotebooksJobs --> UsesClusterPolicy[Uses Cluster Policy] DatabricksNotebooksJobs --> ReadsWrites[Reads and Writes] UnityCatalogTables --> OrganizedBy[Organized By] OrganizedBy --> CatalogsSchemasRawBronzeSilverGold[Catalogs Schemas Raw Bronze Silver Gold] GitRepositoryGitHub --> StoresProjectCode[Stores Project Code DLTPipelines Streaming Jobs Tests] subgraph DatabricksLakehouse DatabricksWorkspace DatabricksNotebooksJobs ServicePrincipalAPIToken ClusterPolicy UnityCatalogTables CatalogsSchemas end

#### Databricks Lakehouse Hierarchy

We will adopt the Medallion Architecture (Bronze, Silver, Gold) for our data organization within Unity Catalog. This provides a clear path for data refinement and quality.

*   **Catalog:** `supply_chain_analytics_catalog` (Top-level container for our project)
    *   **Schemas (Databases):**
        *   `raw`: For raw, immutable source data (e.g., Kafka ingest).
        *   `bronze`: Cleaned, de-duplicated, schema-enforced raw data.
        *   `silver`: Enriched, transformed, and joined data.
        *   `gold`: Highly aggregated, business-ready data for analytics and reporting.
        *   `config`: For lookup tables, configuration data.
        *   `audit`: For pipeline logs, data quality metrics.

#### Project File Structure (within Git Repository)

Our project will follow a modular structure, promoting code reusability, testability, and clear separation of concerns. This structure will reside in a Git repository, which we'll connect to Databricks.

. ├── notebooks/ # Ad-hoc development and exploration notebooks │ ├── 01_environment_setup.ipynb │ └── … ├── src/ # Source code for DLT pipelines, Structured Streaming jobs, etc. │ ├── dlt_pipelines/ │ │ ├── event_ingestion.py │ │ └── tariff_analysis.py │ ├── streaming_jobs/ │ │ ├── logistics_monitoring.py │ │ └── … │ ├── data_quality/ │ │ └── hs_code_validation.py │ └── utils/ # Reusable utility functions │ └── common_functions.py ├── tests/ # Unit and integration tests for src code │ ├── dlt_pipelines/ │ │ ├── test_event_ingestion.py │ │ └── … │ └── … ├── conf/ # Configuration files (e.g., pipeline parameters) │ └── pipeline_config.json ├── deploy/ # Deployment artifacts (e.g., Databricks Asset Bundles) │ ├── bundle.yml │ └── … ├── README.md └── requirements.txt # Python dependencies


### Step-by-Step Implementation

Let's begin setting up our Databricks environment.

#### a) Setup/Configuration

This section covers the initial configuration steps on the Databricks platform.

##### Step 1.1: Create a Databricks Workspace and Enable Unity Catalog

**Why:** Unity Catalog is the cornerstone of a modern Databricks Lakehouse, providing a unified governance layer across data and AI assets. It offers fine-grained access control, auditing, and data lineage, which are crucial for compliance and security in a real-time supply chain system. Without Unity Catalog, managing permissions and discovering data becomes significantly more complex.

**Action:**
1.  **Databricks Account:** Ensure you have access to a Databricks account (on AWS, Azure, or GCP). If not, you'll need to create one through your cloud provider's marketplace or directly via Databricks.
2.  **Create Workspace:** From the Databricks Account Console, create a new workspace if you don't have one, or select an existing one.
3.  **Enable Unity Catalog:**
    *   Navigate to the **Account Console** (accounts.cloud.databricks.com).
    *   Go to **Metastore** and create a new Metastore if one doesn't exist in your region.
    *   Once the Metastore is created, assign it to your Databricks Workspace. This links your workspace to the centralized governance capabilities of Unity Catalog.
    *   Ensure your workspace has a compatible Databricks Runtime version (11.3 LTS or higher) that supports Unity Catalog.

##### Step 1.2: Create a Service Principal and Generate an API Token

**Why:** For automated processes like CI/CD pipelines, scheduled jobs, or external applications interacting with Databricks, using a Service Principal (or a non-interactive user) is a security best practice. It provides an auditable identity separate from individual users, allowing for granular permission management and avoiding reliance on personal credentials. The API token acts as the credential for this Service Principal.

**Action:**
1.  **Create Service Principal in your Cloud Provider (e.g., Azure AD, AWS IAM, GCP IAM):**
    *   **Azure:** Create an App Registration in Azure Active Directory. Grant it the `Contributor` role (or a more specific role for production) on the resource group where your Databricks workspace resides.
    *   **AWS:** Create an IAM Role and attach a policy that grants necessary permissions for Databricks (e.g., `AmazonS3FullAccess` for S3 buckets, `DatabricksFullAccess` if you need to manage Databricks resources directly).
    *   **GCP:** Create a Service Account and grant it appropriate roles for GCP resources.
2.  **Add Service Principal to Databricks Workspace:**
    *   In your Databricks Workspace, navigate to **Admin Settings > Identity and access > Service principals**.
    *   Click **Add Service Principal**.
    *   Enter the Application ID (Azure) or Service Principal Name (AWS/GCP).
    *   Grant the Service Principal **Workspace Admin** role initially for setup, then restrict it to specific permissions later.
3.  **Generate a Databricks API Token for the Service Principal:**
    *   Still in **Admin Settings > Identity and access > Service principals**, click on the Service Principal you just created.
    *   Go to the **Tokens** tab and click **Generate new token**.
    *   Provide a comment (e.g., "CI/CD Token") and set a lifetime.
    *   **IMPORTANT:** Copy the generated token immediately. It will not be shown again. Store this token securely in a secret management service (e.g., Azure Key Vault, AWS Secrets Manager, HashiCorp Vault) and **never hardcode it in your code or commit it to Git.**

#### b) Core Implementation

Now, let's configure Unity Catalog and define a cluster policy.

##### Step 1.3: Define Unity Catalog Hierarchy (Catalogs, Schemas)

**Why:** Establishing a clear catalog and schema structure is fundamental for data organization, discoverability, and applying fine-grained access control with Unity Catalog. The Medallion Architecture (raw, bronze, silver, gold) provides a standardized approach to data quality and refinement, crucial for a reliable supply chain analytics platform.

**Action:** We will use SQL commands executed within a Databricks Notebook. Create a new notebook in your Databricks workspace (e.g., `notebooks/01_environment_setup.ipynb`).

```python
# notebooks/01_environment_setup.ipynb
# Language: SQL (or Python with spark.sql())

-- Set the default catalog for the session (optional, but good for context)
USE CATALOG hive_metastore; -- Temporarily use hive_metastore to create the new catalog

-- 1. Create the main project catalog
-- This will be the root container for all our supply chain data.
-- Replace 'your_cloud_storage_bucket_path' with an actual cloud storage path (e.g., s3://my-supply-chain-data-lake/, abfss://container@storageaccount.dfs.core.windows.net/)
CREATE CATALOG IF NOT EXISTS supply_chain_analytics_catalog
MANAGED LOCATION 'your_cloud_storage_bucket_path/unity_catalog/supply_chain_analytics_catalog';

-- Switch to our new catalog
USE CATALOG supply_chain_analytics_catalog;

-- 2. Create schemas (databases) based on Medallion Architecture and functional areas
-- 'raw' for unvalidated, raw ingested data
CREATE SCHEMA IF NOT EXISTS raw
COMMENT 'Raw, immutable data ingested directly from source systems.';

-- 'bronze' for cleaned, de-duplicated, schema-enforced raw data
CREATE SCHEMA IF NOT EXISTS bronze
COMMENT 'Cleaned, de-duplicated, and schema-enforced data from the raw layer.';

-- 'silver' for enriched, transformed, and joined data
CREATE SCHEMA IF NOT EXISTS silver
COMMENT 'Enriched, transformed, and joined data ready for detailed analysis.';

-- 'gold' for highly aggregated, business-ready data
CREATE SCHEMA IF NOT EXISTS gold
COMMENT 'Highly aggregated, business-ready data for reporting and dashboards.';

-- 'config' for lookup tables and pipeline configurations
CREATE SCHEMA IF NOT EXISTS config
COMMENT 'Configuration and lookup tables for various pipelines.';

-- 'audit' for logging, data quality metrics, and operational insights
CREATE SCHEMA IF NOT EXISTS audit
COMMENT 'Audit logs, data quality metrics, and operational insights.';

-- 3. Grant initial permissions to the Service Principal and developers
-- Replace 'your_service_principal_id' with the Application ID of your Service Principal
-- Replace 'your_developer_group' with your actual Databricks group for developers
-- We grant USAGE on the catalog and CREATE SCHEMA on the catalog to allow SP to create tables in schemas
-- and SELECT on all schemas for general data access. Adjust as per least privilege principle.

-- Grant usage on the catalog to allow access to its schemas and tables
GRANT USAGE ON CATALOG supply_chain_analytics_catalog TO `your_service_principal_id`;
GRANT USAGE ON CATALOG supply_chain_analytics_catalog TO `your_developer_group`;

-- Grant create schema on the catalog to allow creating new schemas (if needed by SP)
GRANT CREATE SCHEMA ON CATALOG supply_chain_analytics_catalog TO `your_service_principal_id`;

-- Grant create table on specific schemas where SP will write data (e.g., bronze, raw)
GRANT CREATE TABLE ON SCHEMA supply_chain_analytics_catalog.raw TO `your_service_principal_id`;
GRANT CREATE TABLE ON SCHEMA supply_chain_analytics_catalog.bronze TO `your_service_principal_id`;
GRANT CREATE TABLE ON SCHEMA supply_chain_analytics_catalog.silver TO `your_service_principal_id`;
GRANT CREATE TABLE ON SCHEMA supply_chain_analytics_catalog.gold TO `your_service_principal_id`;
GRANT CREATE TABLE ON SCHEMA supply_chain_analytics_catalog.config TO `your_service_principal_id`;
GRANT CREATE TABLE ON SCHEMA supply_chain_analytics_catalog.audit TO `your_service_principal_id`;


-- Grant SELECT on all schemas to developers for reading data
GRANT SELECT ON SCHEMA supply_chain_analytics_catalog.raw TO `your_developer_group`;
GRANT SELECT ON SCHEMA supply_chain_analytics_catalog.bronze TO `your_developer_group`;
GRANT SELECT ON SCHEMA supply_chain_analytics_catalog.silver TO `your_developer_group`;
GRANT SELECT ON SCHEMA supply_chain_analytics_catalog.gold TO `your_developer_group`;
GRANT SELECT ON SCHEMA supply_chain_analytics_catalog.config TO `your_developer_group`;
GRANT SELECT ON SCHEMA supply_chain_analytics_catalog.audit TO `your_developer_group`;

-- Example of granting ALL PRIVILEGES (use with caution, adhere to least privilege)
-- GRANT ALL PRIVILEGES ON CATALOG supply_chain_analytics_catalog TO `your_service_principal_id`;
-- GRANT ALL PRIVILEGES ON SCHEMA supply_chain_analytics_catalog.raw TO `your_service_principal_id`;

Explanation:

  • We first create a top-level catalog, supply_chain_analytics_catalog, and assign it a MANAGED LOCATION in your cloud storage. This is where Unity Catalog will store the managed tables for this catalog.
  • Then, we define several schemas within this catalog, aligning with the Medallion Architecture and specific functional needs (config, audit).
  • Finally, we grant specific permissions. The USAGE privilege is essential for any principal to interact with objects within a catalog or schema. CREATE SCHEMA allows the service principal to create new schemas if needed, and CREATE TABLE allows it to write data into specific schemas. For developers, SELECT on schemas is often sufficient. Always follow the principle of least privilege.
Step 1.4: Implement Cluster Policy

Why: Cluster policies are vital for production environments to enforce best practices, control costs, and ensure security. They standardize cluster configurations, preventing users from provisioning overly expensive or insecure clusters. For DLT pipelines and structured streaming jobs, consistent cluster configurations are critical for predictable performance and cost management.

Action:

  1. In your Databricks Workspace, navigate to Admin Settings > Cluster Policies.
  2. Click Create Cluster Policy.
  3. Name the policy SupplyChainAnalytics_Job_Policy and add the following JSON definition. This policy is designed for automated jobs and DLT pipelines, enforcing specific instance types, auto-termination, and Unity Catalog compatibility.
{
  "cluster_type": {
    "type": "fixed",
    "value": "job"
  },
  "instance_pool_id": {
    "type": "forbidden"
  },
  "spark_version": {
    "type": "regex",
    "pattern": "11\\.[0-9]+\\.x-scala[0-9]+\\.[0-9]+|12\\.[0-9]+\\.x-scala[0-9]+\\.[0-9]+|13\\.[0-9]+\\.x-scala[0-9]+\\.[0-9]+|14\\.[0-9]+\\.x-scala[0-9]+\\.[0-9]+",
    "defaultValue": "14.3.x-scala2.12"
  },
  "node_type_id": {
    "type": "allowlist",
    "values": [
      "Standard_DS3_v2",
      "Standard_DS4_v2",
      "Standard_E4s_v3",
      "Standard_E8s_v3",
      "i3.xlarge",
      "i3.2xlarge",
      "m5.xlarge",
      "m5.2xlarge",
      "e2-standard-4",
      "e2-standard-8"
    ],
    "defaultValue": "Standard_E4s_v3"
  },
  "driver_node_type_id": {
    "type": "allowlist",
    "values": [
      "Standard_DS3_v2",
      "Standard_DS4_v2",
      "Standard_E4s_v3",
      "Standard_E8s_v3",
      "i3.xlarge",
      "i3.2xlarge",
      "m5.xlarge",
      "m5.2xlarge",
      "e2-standard-4",
      "e2-standard-8"
    ],
    "defaultValue": "Standard_E4s_v3"
  },
  "autoscale": {
    "type": "unlimited",
    "defaultValue": {
      "min_workers": 1,
      "max_workers": 5
    }
  },
  "autotermination_minutes": {
    "type": "forbidden"
  },
  "custom_tags.project": {
    "type": "fixed",
    "value": "supply_chain_analytics",
    "hidden": true
  },
  "data_security_mode": {
    "type": "fixed",
    "value": "USER_ISOLATION",
    "hidden": true
  },
  "single_user_name": {
    "type": "forbidden"
  },
  "spark_conf.spark.databricks.cluster.profile": {
    "type": "fixed",
    "value": "serverless",
    "hidden": true
    },
  "spark_conf.spark.databricks.delta.properties.defaults.autoOptimize.optimizeWrite": {
    "type": "fixed",
    "value": "true",
    "hidden": true
  },
  "spark_conf.spark.databricks.delta.properties.defaults.autoOptimize.autoCompact": {
    "type": "fixed",
    "value": "true",
    "hidden": true
  },
  "num_workers": {
    "type": "forbidden"
  }
}

Explanation:

  • cluster_type: "job": Ensures this policy is only used for automated jobs, not interactive development.
  • spark_version: Allows modern Spark versions compatible with Unity Catalog (e.g., 14.3.x).
  • node_type_id, driver_node_type_id: Restricts the available instance types to cost-effective and performant options. Adjust these based on your cloud provider and region.
  • autoscale: Enforces autoscaling with a min/max worker range, preventing over-provisioning.
  • autotermination_minutes: Forbidden for job clusters, as jobs manage their own lifecycle. For interactive clusters, this would be set to a low value (e.g., 60 minutes).
  • custom_tags.project: Automatically tags clusters with project: supply_chain_analytics for cost allocation and monitoring.
  • data_security_mode: "USER_ISOLATION": This is crucial for Unity Catalog. It ensures clusters run in a secure mode, enforcing Unity Catalog permissions.
  • spark_conf.spark.databricks.cluster.profile: "serverless": Where available, this leverages Databricks Serverless compute, abstracting cluster management and further optimizing costs.
  • spark_conf.spark.databricks.delta.properties.defaults.autoOptimize.optimizeWrite and autoCompact: These are Delta Lake best practices automatically applied, improving performance and reducing storage costs.
  • Permissions: After creating the policy, assign it to the Service Principal and the developer group created earlier. This ensures they can only create clusters that adhere to this policy.
Step 1.5: Git Integration

Why: Integrating with a Git repository (like GitHub, GitLab, Azure DevOps Repos) is a fundamental software engineering best practice. It enables version control, collaborative development, code review workflows, and serves as the source of truth for your Databricks notebooks and Python modules, which is essential for CI/CD.

Action:

  1. Create a new Git Repository: On your preferred Git provider (e.g., GitHub), create a new private repository named databricks-supply-chain-analytics.
  2. Initialize local repository:
    mkdir databricks-supply-chain-analytics
    cd databricks-supply-chain-analytics
    git init
    # Create a dummy README.md
    echo "# Databricks Supply Chain Analytics" > README.md
    git add .
    git commit -m "Initial project setup"
    git branch -M main
    git remote add origin https://github.com/<your-username>/databricks-supply-chain-analytics.git
    git push -u origin main
    
  3. Connect Databricks Workspace to Git:
    • In your Databricks Workspace, navigate to User Settings > Git integration.
    • Select your Git provider (e.g., GitHub).
    • Provide your Git username/email.
    • Generate a Personal Access Token (PAT) from your Git provider (e.g., GitHub > Settings > Developer settings > Personal access tokens). Ensure it has repo scope.
    • Paste the PAT into Databricks and click Save.
  4. Clone the Repository in Databricks:
    • In the Databricks Workspace sidebar, click Workspace > Repos.
    • Click Add Repo.
    • Select Clone remote Git repository.
    • Enter the Git URL (https://github.com/<your-username>/databricks-supply-chain-analytics.git).
    • Set the “Repo name” to databricks-supply-chain-analytics.
    • Click Create Repo.

Now, you will see the databricks-supply-chain-analytics folder under Repos in your workspace. You can create notebooks and files directly within this repo, and changes can be committed and pushed to your Git provider.

c) Testing This Component

It’s crucial to verify each setup step before moving forward to ensure a stable foundation.

Verify Unity Catalog Setup
  1. Open a new Databricks Notebook.
  2. Run the following SQL commands:
    SHOW CATALOGS;
    USE CATALOG supply_chain_analytics_catalog;
    SHOW SCHEMAS;
    
  3. Expected Outcome: You should see supply_chain_analytics_catalog listed under SHOW CATALOGS, and then raw, bronze, silver, gold, config, audit listed under SHOW SCHEMAS after switching to your new catalog.
Verify Service Principal and Permissions
  1. Manual Check: As an Admin, confirm the Service Principal is listed under Admin Settings > Identity and access > Service principals.
  2. Permission Check (using a temporary notebook or API call):
    • You can try to use the Databricks CLI or SDK with the Service Principal’s token to perform an action (e.g., list catalogs). This verifies the token and basic permissions.
    • Alternatively, you can create a temporary Databricks Job, configure it to run as the Service Principal, and execute a simple SQL command like SHOW CATALOGS;. If it succeeds, the SP has basic access.
Verify Cluster Policy
  1. Navigate to Compute > Create Cluster.
  2. Attempt to create a new cluster. In the “Policy” dropdown, you should see SupplyChainAnalytics_Job_Policy.
  3. Select this policy. Notice how many configuration options (like instance types, Spark version) are now restricted or pre-filled according to your policy.
  4. Try to create a cluster that violates the policy (e.g., by selecting an instance type not in the allowlist, if you were allowed to manually override). It should be prevented.
  5. Expected Outcome: The policy should be selectable, and it should restrict cluster configuration options as defined.
Verify Git Integration
  1. In your Databricks Repo (databricks-supply-chain-analytics), create a new notebook (e.g., test_git.ipynb).
  2. Add some content to it.
  3. Click the Git icon in the notebook’s top-right corner or the “Git” button in the Repos sidebar.
  4. Commit and Push: Stage your changes, write a commit message, and push the changes to your remote repository.
  5. Verify on Git Provider: Check your GitHub (or other provider) repository; you should see the test_git.ipynb file.
  6. Pull (Optional): Make a change directly on GitHub (e.g., edit README.md), then pull those changes from Databricks to verify bi-directional sync.
  7. Expected Outcome: You can seamlessly commit and push changes from Databricks to your Git repository, and pull changes back.

Production Considerations

Building a production-ready environment requires careful thought about scalability, security, and maintainability from day one.

Security Considerations

  • Unity Catalog Access Controls: Always adhere to the principle of least privilege. Grant only the necessary permissions to users and service principals on specific catalogs, schemas, and tables. Regularly review and audit these permissions.
  • Secret Management: NEVER hardcode API tokens, database credentials, or other sensitive information. Use Databricks Secrets (backed by Azure Key Vault, AWS Secrets Manager, or GCP Secret Manager) to store and retrieve credentials securely.
  • Network Isolation: For highly sensitive data, configure Databricks workspaces with Private Link (Azure) or PrivateLink (AWS) to ensure all traffic between your VNet/VPC and Databricks backend infrastructure traverses a private network.
  • Audit Logging: Databricks audit logs, integrated with Unity Catalog, provide a comprehensive record of actions performed on your data. Configure logging to your SIEM (Security Information and Event Management) system for monitoring and anomaly detection.

Performance Optimization

  • Cluster Policies: As implemented, cluster policies ensure that jobs run on appropriately sized and configured clusters, preventing performance bottlenecks from undersized clusters or cost overruns from oversized ones.
  • Serverless DLT: For Delta Live Tables, leverage serverless compute where available. It automatically scales resources up and down, optimizing performance and cost without manual cluster management. The cluster policy included spark_conf.spark.databricks.cluster.profile: "serverless" to encourage this.
  • Delta Lake Optimizations: The cluster policy includes autoOptimize.optimizeWrite and autoOptimize.autoCompact which are crucial for Delta Lake performance, reducing small file issues and improving query speeds.

Logging and Monitoring

  • Databricks Audit Logs: Monitor administrative and data access events.
  • Unity Catalog Lineage: Automatically tracks data lineage, showing how data transforms from source to destination, which is vital for debugging and compliance.
  • DLT Event Logs: Delta Live Tables provides detailed event logs that can be queried to monitor pipeline health, data quality, and performance.
  • Structured Streaming Metrics: Spark Structured Streaming provides metrics that can be integrated with monitoring tools (e.g., Prometheus, Grafana) to observe throughput, latency, and resource utilization.

Code Review Checkpoint

At this point, you have successfully set up the core components of your Databricks Lakehouse environment.

Summary of accomplishments:

  • Confirmed Unity Catalog is enabled and assigned to your workspace.
  • Created a Service Principal and securely obtained an API token for automation.
  • Established a logical data hierarchy within Unity Catalog using catalogs and schemas (raw, bronze, silver, gold, config, audit).
  • Defined initial access controls for your Service Principal and developer group.
  • Implemented a robust cluster policy (SupplyChainAnalytics_Job_Policy) to standardize job cluster configurations, optimize costs, and enforce security (Unity Catalog compatible, autoscaling, auto-optimization).
  • Integrated your Databricks workspace with a Git repository for version control.

Files Created/Modified:

  • notebooks/01_environment_setup.ipynb: Contains SQL commands for Unity Catalog setup.
  • deploy/bundle.yml (Conceptual for future use, not explicitly created in this chapter but will be the home for policy definition in a production setup).
  • Your Git repository (e.g., databricks-supply-chain-analytics) is initialized and connected.

This robust foundation adheres to modern data engineering best practices and prepares us for building the real-time data pipelines.

Common Issues & Solutions

  1. Issue: CATALOG_NOT_FOUND or SCHEMA_NOT_FOUND errors when trying to create schemas or tables.

    • Cause: Unity Catalog Metastore might not be correctly assigned to your workspace, or you’re trying to create objects in a catalog/schema that doesn’t exist or isn’t in scope.
    • Solution:
      • Verify Unity Catalog Metastore assignment in the Databricks Account Console.
      • Ensure you’ve run USE CATALOG <your_catalog>; before trying to create schemas or tables within it.
      • Double-check the spelling of catalog and schema names.
  2. Issue: PERMISSION_DENIED when a Service Principal or user tries to create a table or run a job.

    • Cause: The Service Principal or user lacks the necessary CREATE TABLE, USAGE, or SELECT privileges on the target catalog or schema.
    • Solution:
      • Review the GRANT statements in notebooks/01_environment_setup.ipynb.
      • Go to Unity Catalog’s Permissions tab (Catalog Explorer) and verify the grants for the specific principal.
      • Ensure the Service Principal is correctly added to the Databricks workspace.
  3. Issue: Cannot create a cluster, or cluster creation fails with “Policy violation” errors.

    • Cause: The cluster configuration (instance type, Spark version, etc.) does not comply with the active cluster policy. Or, the user/Service Principal doesn’t have permissions to use the intended policy.
    • Solution:
      • Review the JSON definition of SupplyChainAnalytics_Job_Policy to ensure it matches your requirements and allowed resource types in your cloud region.
      • Verify that the Service Principal or user has been granted permission to use the SupplyChainAnalytics_Job_Policy.
      • If testing, explicitly select the SupplyChainAnalytics_Job_Policy when creating a new cluster.

Testing & Verification

To confirm that our Databricks Lakehouse environment is correctly set up and ready for development:

  1. Unity Catalog Structure:
    • Run SHOW CATALOGS; and USE CATALOG supply_chain_analytics_catalog; SHOW SCHEMAS; in a Databricks notebook. All defined catalogs and schemas should be visible.
  2. Service Principal Functionality:
    • As mentioned, try running a simple Databricks Job configured to use your Service Principal’s identity. This job should be able to list catalogs or attempt a basic data operation without permission errors.
  3. Cluster Policy Enforcement:
    • Attempt to create a new job cluster using the SupplyChainAnalytics_Job_Policy. It should provision successfully within the defined constraints.
    • Attempt to create a cluster without a policy (if your permissions allow) and try to configure it with an instance type explicitly disallowed by SupplyChainAnalytics_Job_Policy. It should fail with a policy violation error.
  4. Git Integration:
    • Create a new notebook or modify an existing file within your databricks-supply-chain-analytics repo in Databricks.
    • Successfully commit and push these changes to your remote Git repository.
    • Confirm the changes appear on your Git provider’s website.

Everything should now be correctly configured, providing a secure, governed, and efficient environment for our real-time supply chain analytics project.

Summary & Next Steps

In this foundational chapter, we meticulously set up our Databricks Lakehouse environment, focusing on best practices for security, governance, and operational efficiency. We enabled Unity Catalog, defined a structured data hierarchy, implemented a robust cluster policy, and integrated with Git for seamless development workflows. This robust setup is the bedrock upon which all our real-time supply chain data pipelines will be built, ensuring data integrity, security, and cost-effectiveness.

With our environment ready, we can now turn our attention to the core data processing tasks. In the next chapter, we will dive into Chapter 2: Ingesting Raw Supply Chain Event Data with Delta Live Tables. We’ll focus on bringing real-time event data from various sources into our raw and bronze layers using Databricks Delta Live Tables, establishing the initial flow of information into our Lakehouse.