Welcome back, future data wizard! In the previous chapters, you’ve taken your first steps into the Databricks world, understanding its core components like workspaces and clusters. You’ve even run some basic commands, which is fantastic! Now that your Databricks environment is purring like a happy kitten, it’s time for a crucial next step: getting data into it.

This chapter is all about data ingestion. Think of it as opening the doors to your Databricks data factory and letting the raw materials pour in. We’ll explore various ways to load data, from simple files to more robust, production-ready methods. By the end, you’ll not only know how to ingest data but also why certain methods are preferred for different scenarios, setting you up for success in handling real-world datasets.

What is Data Ingestion and Why Does It Matter?

Data ingestion is simply the process of bringing data from various sources into a storage system where it can be processed, analyzed, and transformed. In the context of Databricks, this means loading data into your Lakehouse environment, typically into Delta Lake tables.

Why is this so important? Well, without data, your powerful Databricks clusters and Spark engines are like race cars with no fuel! All the fancy analytics, machine learning, and reporting you want to do depend entirely on having the right data available, in the right format, and in an efficient manner. Mastering ingestion is the foundational step for any data project.

Core Concepts of Data Ingestion in Databricks

Before we dive into code, let’s understand the landscape of data ingestion within Databricks.

Where Does Data Live? The Databricks File System (DBFS)

When you work with Databricks, your data often resides in cloud object storage (like AWS S3, Azure Data Lake Storage Gen2, or Google Cloud Storage). Databricks provides an abstraction layer over this storage called the Databricks File System (DBFS).

DBFS acts like a local file system, making it easy to interact with your cloud storage using familiar file path syntax (e.g., /FileStore/tables/my_data.csv). It’s a great place to stage files before processing them, or to store intermediate results. For simpler use cases and initial exploration, DBFS is very handy. For production environments, Databricks recommends using Unity Catalog with external locations for better governance and security, but for our learning journey, DBFS provides an excellent starting point.

Common Data Formats

Data comes in many shapes and sizes. Here are some common formats you’ll encounter:

  • CSV (Comma Separated Values): Simple, human-readable, widely used. Each line is a data record, with fields separated by commas.
  • JSON (JavaScript Object Notation): Flexible, hierarchical, often used for semi-structured data from web APIs.
  • Parquet: A columnar storage format optimized for analytical queries. It’s highly efficient for large datasets and is often the preferred format in a data lakehouse.
  • Delta Lake: This isn’t just a format; it’s an open-source storage layer that brings ACID transactions, schema enforcement, and other data warehousing capabilities to data lakes, typically built on top of Parquet files. It’s a cornerstone of the Databricks Lakehouse Platform.

Key Ingestion Methods in Databricks

Databricks offers powerful tools for ingestion. We’ll focus on these three for now:

  1. Spark read API: Your go-to for reading data from various file formats and sources programmatically. It’s highly flexible and forms the backbone of many data loading operations.
  2. COPY INTO Command: A modern, robust, and idempotent SQL command specifically designed for efficiently loading data from cloud storage into Delta Lake tables. It’s fantastic for production pipelines.
  3. DBFS Utilities (dbutils.fs): Useful for basic file system operations like listing files, creating directories, or even “uploading” small files directly from your notebook.

Step-by-Step Implementation: Loading Our First Data

Let’s get our hands dirty! We’ll start by exploring some sample data that Databricks provides, then simulate uploading our own file, and finally use the powerful COPY INTO command.

First, ensure you have a cluster running in your Databricks workspace. If not, refer to Chapter 3 for a quick refresher!

Step 1: Exploring Built-in Databricks Sample Datasets

Databricks comes with many sample datasets that are perfect for learning. Let’s read a simple CSV file from these samples.

Open a new notebook in your Databricks workspace.

First, we’ll use the Spark read API to load a CSV file. We’ll specify the format and some options.

# Step 1.1: Read a CSV file using Spark's DataFrameReader
print("--- Reading a sample CSV dataset ---")

# Define the path to a sample CSV file provided by Databricks
# This path is typically available across all Databricks environments.
csv_file_path = "/databricks-datasets/COVID/time_series_covid19_confirmed_global.csv"

# Use spark.read to load the CSV file
# .format("csv") specifies the file type
# .option("header", "true") tells Spark that the first row is a header
# .option("inferSchema", "true") asks Spark to guess the data types for each column
df_covid_confirmed = (spark.read
                      .format("csv")
                      .option("header", "true")
                      .option("inferSchema", "true")
                      .load(csv_file_path))

# Let's see the first few rows and the schema of our DataFrame
print("\n--- Displaying the first 5 rows of the DataFrame ---")
display(df_covid_confirmed.limit(5))

print("\n--- Displaying the schema of the DataFrame ---")
df_covid_confirmed.printSchema()

What just happened?

  • csv_file_path: We defined a variable pointing to a public CSV file within the Databricks environment. Databricks makes many such datasets available under /databricks-datasets/.
  • spark.read: This is your entry point to Spark’s DataFrameReader API. It’s how you tell Spark you want to read data.
  • .format("csv"): We’re explicitly telling Spark that the file is in CSV format. Spark supports many formats like JSON, Parquet, ORC, JDBC, etc.
  • .option("header", "true"): This is crucial! It tells Spark that the first line of our CSV file contains the column names, not actual data. Without this, Spark would treat your column headers as data.
  • .option("inferSchema", "true"): This is a convenience option for exploration. Spark will read a sample of the data and try to automatically determine the data type for each column (e.g., string, integer, double). While useful for quick analysis, for production, it’s generally better to define your schema explicitly to avoid unexpected type mismatches.
  • .load(csv_file_path): This command executes the read operation, loading the data into a Spark DataFrame named df_covid_confirmed.
  • display(df_covid_confirmed.limit(5)): The display() function is a Databricks-specific command that provides a rich, interactive table view of your DataFrame in the notebook. .limit(5) just shows the first 5 rows.
  • df_covid_confirmed.printSchema(): This shows you the inferred schema (column names and their data types) of your DataFrame.

You’ve successfully ingested your first dataset! How cool is that?

Step 2: Simulating a Custom File Upload to DBFS

Now, let’s imagine you have your own data file. In a real scenario, you’d upload this to your cloud storage (e.g., Azure Data Lake Storage, AWS S3) and then mount it or access it via Unity Catalog external locations. For our learning purposes, we can simulate placing a file directly into DBFS using dbutils.fs.put().

# Step 2.1: Simulate creating a custom CSV file in DBFS
print("\n--- Simulating a custom CSV file creation in DBFS ---")

# Define a path in DBFS where we'll "upload" our file
custom_file_path = "/FileStore/my_custom_data/customers.csv"

# Define the content of our simulated CSV file
csv_content = """CustomerID,Name,City,Age
1,Alice,New York,30
2,Bob,London,24
3,Charlie,Paris,35
4,David,Berlin,29
"""

# Use dbutils.fs.put to write this content to DBFS
# The 'True' argument means overwrite if the file already exists
dbutils.fs.put(custom_file_path, csv_content, True)

print(f"Simulated file created at: {custom_file_path}")

# Step 2.2: Verify the file exists (optional, but good practice!)
print("\n--- Listing files in the directory to verify ---")
display(dbutils.fs.ls("/FileStore/my_custom_data/"))

Explanation:

  • custom_file_path: We chose a path within /FileStore/ which is a common location for user-specific files in DBFS.
  • csv_content: This is a multi-line string representing the data we want to put into our CSV file.
  • dbutils.fs.put(custom_file_path, csv_content, True): This command from dbutils.fs (Databricks Utilities File System) writes the csv_content to the specified custom_file_path. The True flag ensures that if you run this cell multiple times, it will overwrite the existing file.
  • dbutils.fs.ls("/FileStore/my_custom_data/"): This command lists the contents of a directory in DBFS, allowing us to verify our file was created.

Now that our custom file is “uploaded”, let’s read it using the Spark read API, just like we did with the sample dataset.

# Step 2.3: Read our custom CSV file from DBFS
print("\n--- Reading our custom CSV file from DBFS ---")

df_customers = (spark.read
                .format("csv")
                .option("header", "true")
                .option("inferSchema", "true")
                .load(custom_file_path))

# Display the data
print("\n--- Displaying our custom customer data ---")
display(df_customers)

# Print the schema
print("\n--- Displaying the schema of our custom customer data ---")
df_customers.printSchema()

Fantastic! You’ve now successfully ingested a custom file that you “placed” into DBFS. This workflow is very common for initial data exploration or small, ad-hoc loads.

Step 3: Using COPY INTO for Robust Delta Lake Ingestion

While spark.read is excellent for many tasks, when you want to load data into a Delta Lake table in a production-ready, fault-tolerant, and idempotent way, COPY INTO is your best friend. It handles re-runs gracefully, preventing duplicate data, and integrates perfectly with Delta Lake’s features.

First, let’s create a new folder for our COPY INTO source and place another simulated CSV file there.

# Step 3.1: Prepare a new source file for COPY INTO
print("\n--- Preparing a new source file for COPY INTO ---")

copy_into_source_path = "/FileStore/copy_into_sources/products.csv"
product_csv_content = """ProductID,Name,Category,Price
101,Laptop,Electronics,1200.00
102,Mouse,Electronics,25.50
103,Keyboard,Electronics,75.00
104,Monitor,Electronics,300.00
"""

dbutils.fs.put(copy_into_source_path, product_csv_content, True)
print(f"Source file for COPY INTO created at: {copy_into_source_path}")

Now, let’s use the COPY INTO command. This is a SQL command, so we’ll use the %sql magic command in our notebook.

-- Step 3.2: Create a Delta Lake table and use COPY INTO
-- First, drop the table if it exists to ensure a clean run
DROP TABLE IF EXISTS products_delta;

-- Create an empty Delta Lake table where our data will land
-- We explicitly define the schema here, which is a best practice for production
CREATE TABLE products_delta (
  ProductID INT,
  Name STRING,
  Category STRING,
  Price DECIMAL(10, 2)
)
USING DELTA
LOCATION '/FileStore/delta_tables/products'; -- Specify where the Delta table data files will be stored

-- Now, use COPY INTO to load data from our CSV source into the Delta table
COPY INTO products_delta
FROM '/FileStore/copy_into_sources/products.csv'
FILEFORMAT = CSV
FORMAT_OPTIONS (
  'header' = 'true',
  'inferSchema' = 'false' -- We already defined the schema in CREATE TABLE
)
COPY_OPTIONS (
  'mergeSchema' = 'false', -- Do not allow schema evolution for this initial load
  'ignoreExistingRows' = 'false' -- Process all rows, even if they were previously ingested (for idempotency)
);

-- Step 3.3: Query the newly ingested Delta table
SELECT * FROM products_delta;

Breaking Down COPY INTO:

  • DROP TABLE IF EXISTS products_delta;: Good practice to ensure we start fresh.
  • CREATE TABLE products_delta (...) USING DELTA LOCATION ...;:
    • We’re creating a new table named products_delta.
    • We define its schema explicitly (ProductID INT, Name STRING, etc.). This is crucial for data quality and consistency, especially in a Lakehouse.
    • USING DELTA: Specifies that this is a Delta Lake table.
    • LOCATION '/FileStore/delta_tables/products': This tells Databricks where to store the underlying Parquet files and transaction logs for this Delta table. This creates an “external” table, meaning the data files are managed at this location, not directly within the Hive metastore’s default location.
  • COPY INTO products_delta FROM ... FILEFORMAT = CSV ...:
    • products_delta: The target Delta Lake table.
    • FROM '/FileStore/copy_into_sources/products.csv': The source path of our data.
    • FILEFORMAT = CSV: Specifies the format of the source files.
    • FORMAT_OPTIONS ('header' = 'true', 'inferSchema' = 'false'): These options are specific to the FILEFORMAT. We indicate a header exists. We set inferSchema to false because we already provided an explicit schema in the CREATE TABLE statement.
    • COPY_OPTIONS ('mergeSchema' = 'false', 'ignoreExistingRows' = 'false'): These options control COPY INTO’s behavior, especially regarding schema evolution and idempotency.
      • 'mergeSchema' = 'false': For this first load, we want strict schema adherence. Later, you might enable true for schema evolution.
      • 'ignoreExistingRows' = 'false': This is key for idempotency. COPY INTO tracks files it has already processed. If you rerun the command with ignoreExistingRows = true, it will skip files that have already been ingested. If false, it will reprocess them, which might lead to duplicates if your source files are unchanging (but good for appending new data from new files). For initial loads, false is often fine. For subsequent incremental loads, you’d typically want true or to ensure new files are always placed in the source path.
  • SELECT * FROM products_delta;: After the COPY INTO command runs, we can query our new Delta table just like any other SQL table!

You’ve just performed a robust data ingestion into a Delta Lake table using COPY INTO! This method is highly recommended for building reliable data pipelines.

Mini-Challenge: Ingesting a JSON File

You’ve tackled CSV. Now, let’s try a different format!

Challenge: Create a new simulated JSON file in DBFS and then ingest it into a new Delta Lake table using either the Spark read API or the COPY INTO command.

  • File Content:
    {"order_id": 1, "item": "Laptop", "quantity": 1, "total_price": 1200.00}
    {"order_id": 2, "item": "Mouse", "quantity": 2, "total_price": 51.00}
    {"order_id": 3, "item": "Keyboard", "quantity": 1, "total_price": 75.00}
    
  • Target Table Name: orders_delta
  • Location: /FileStore/delta_tables/orders

Hint:

  • For JSON files, the format option is simply "json".
  • JSON files usually don’t have a header, so that option isn’t needed.
  • Consider using inferSchema = true for a quick ingestion with spark.read, or define an explicit schema for COPY INTO.

What to observe/learn: How easy it is to switch between different file formats using the format option, and how schema inference (or explicit schema definition) adapts to the nested nature of JSON data (if your JSON was more complex).

(Pause here and try it yourself in your notebook!)

Solution (one possible approach using COPY INTO):

# Create the JSON file in DBFS
print("\n--- Creating JSON source file for Mini-Challenge ---")
json_source_path = "/FileStore/copy_into_sources/orders.json"
json_content = """{"order_id": 1, "item": "Laptop", "quantity": 1, "total_price": 1200.00}
{"order_id": 2, "item": "Mouse", "quantity": 2, "total_price": 51.00}
{"order_id": 3, "item": "Keyboard", "quantity": 1, "total_price": 75.00}
"""
dbutils.fs.put(json_source_path, json_content, True)
print(f"JSON source file created at: {json_source_path}")
-- Ingest the JSON file into a Delta table using COPY INTO
DROP TABLE IF EXISTS orders_delta;

CREATE TABLE orders_delta (
  order_id INT,
  item STRING,
  quantity INT,
  total_price DECIMAL(10, 2)
)
USING DELTA
LOCATION '/FileStore/delta_tables/orders';

COPY INTO orders_delta
FROM '/FileStore/copy_into_sources/orders.json'
FILEFORMAT = JSON
FORMAT_OPTIONS (
  'multiline' = 'false' -- Each JSON object is on a single line
)
COPY_OPTIONS (
  'mergeSchema' = 'false',
  'ignoreExistingRows' = 'false'
);

SELECT * FROM orders_delta;

Great job! You’re getting the hang of different ingestion methods and file formats.

Common Pitfalls & Troubleshooting

Even the best data engineers run into issues. Here are some common problems during ingestion and how to debug them:

  1. File Not Found Errors (Path does not exist):

    • Problem: You’ve specified an incorrect path to your file.
    • Solution: Double-check the path. Is it /FileStore/ or /mnt/? Did you spell the directory and file names correctly? Use dbutils.fs.ls("your/directory/path") to list contents and verify.
    • Permissions: Ensure your cluster or user has the necessary permissions to access the underlying cloud storage location.
  2. Schema Mismatch / Data Type Errors:

    • Problem: Spark is trying to read a column as an integer, but it encounters a string, or a date format isn’t recognized. This is common when inferSchema is true but the inference goes wrong, or when your explicit CREATE TABLE schema doesn’t match the source data.
    • Solution:
      • If using inferSchema=true: Examine the printSchema() output carefully. If a type is wrong, you might need to read it as a string initially and then cast it to the correct type later using Spark DataFrame operations.
      • If using COPY INTO with an explicit schema: Ensure your CREATE TABLE statement’s data types exactly match the incoming data’s types. For example, if a column in your CSV has decimals, make sure your Delta table schema uses DECIMAL or DOUBLE, not INT.
  3. Corrupted File / Malformed Records:

    • Problem: Some rows in your CSV might have too many or too few columns, or your JSON might be improperly formatted.
    • Solution: Spark read and COPY INTO have options to handle this. For example, for CSV, you can use .option("mode", "FAILFAST") (default is PERMISSIVE) to immediately fail on errors, or .option("mode", "DROPMALFORMED") to silently drop bad records. For production, you often want to quarantine bad records for later inspection.
  4. Idempotency and Duplicates with COPY INTO:

    • Problem: You rerun COPY INTO and get duplicate data, even though you thought it was idempotent.
    • Solution: COPY INTO tracks files it has processed. If you rerun COPY INTO against the same file path and the file content hasn’t changed, and you have ignoreExistingRows = true, it will indeed skip it. However, if the source file content changes, or if you’re pointing to a new file, it will process it. Understand your source data strategy: are you always landing new files, or are you incrementally updating existing files? Adjust COPY_OPTIONS accordingly.

Summary

Phew! You’ve covered a lot in this chapter. Data ingestion is the gateway to all your data endeavors in Databricks, and you’ve learned several powerful ways to do it.

Here are the key takeaways:

  • Data Ingestion is Critical: It’s the first step to making your data usable in Databricks.
  • DBFS for Staging: The Databricks File System (DBFS) provides a convenient layer over cloud storage for file operations.
  • Spark read API: Your versatile tool for loading data from various file formats (CSV, JSON, Parquet) into DataFrames, with options for headers, schema inference, and more.
  • COPY INTO for Delta Lake: The modern, robust, and idempotent SQL command specifically designed for loading data from cloud storage into Delta Lake tables, ensuring data quality and handling re-runs gracefully. It’s ideal for production pipelines.
  • Explicit Schemas are Best: While inferSchema is handy for exploration, defining explicit schemas (especially with CREATE TABLE for COPY INTO) is a best practice for production data pipelines.
  • Troubleshooting is Part of the Job: Be prepared to debug file paths, schema mismatches, and other common ingestion issues.

You’ve successfully brought data into your Databricks Lakehouse! In the next chapter, we’ll take this raw data and begin to clean, transform, and prepare it for analysis, diving deeper into Spark DataFrame operations. Get ready to sculpt your data!

References

This page is AI-assisted and reviewed. It references official documentation and recognized resources where relevant.