Introduction

Welcome to Chapter 4! If you’ve made it this far, you’ve successfully set up your LangExtract environment and connected it to a Large Language Model (LLM) provider. That’s a huge step! Now, it’s time to put all that preparation to good use and perform your very first structured data extraction.

This chapter is all about taking those initial, exciting “baby steps” into the world of LangExtract. We’ll focus on the core extract function, learn how to define a simple schema to guide our LLM, and most importantly, understand how to interpret the results LangExtract provides. By the end of this chapter, you’ll be able to confidently extract specific pieces of information from text and inspect the quality of your extractions.

Why does this matter? Being able to extract structured data from unstructured text is a superpower in today’s data-rich world. Imagine sifting through thousands of customer reviews to find specific feedback, or parsing contracts for key clauses. LangExtract makes this process efficient and reliable. Let’s dive in and unlock this power!

Core Concepts: Your First Extraction

At the heart of LangExtract is its powerful extract function. This function acts as the orchestrator, taking your unstructured text, your desired output structure (the schema), and your chosen LLM, and then returning the extracted, structured data.

The extract Function: Your Data Magician

The lx.extract function is where the magic happens. It’s the primary entry point for telling LangExtract what you want to pull out of your text.

What it is: A function that takes your input text, a definition of the data you want to extract, and an LLM, then returns an object containing the extracted data. Why it’s important: It’s the central command for all your extraction tasks. Without it, LangExtract wouldn’t know what to do! How it functions: It sends your text and schema to the LLM, intelligently chunks the text if it’s too long, processes the LLM’s response, and then validates and formats the output according to your schema.

The basic signature looks something like this:

result = lx.extract(text_or_document, schema, llm)

Where:

  • text_or_document: The raw text string or a document object you want to extract from.
  • schema: A Pydantic-compatible class that defines the structure of the data you want to extract.
  • llm: The configured LLM client object you set up in the previous chapter.

Defining an Extraction Schema: Telling LangExtract What to Look For

This is arguably the most crucial part of any LangExtract task. The “schema” is your blueprint, your instruction manual for the LLM. It tells the LLM exactly what kind of information to extract and how to structure it.

LangExtract leverages Pydantic for defining schemas. Pydantic is a fantastic Python library that provides data validation and settings management using Python type hints. This means you define your desired data structure using standard Python classes and type annotations, and Pydantic (and by extension, LangExtract) handles the rest.

What it is: A Python class, typically inheriting from pydantic.BaseModel, that specifies the names and types of the fields you want to extract. Why it’s important: It ensures consistency and accuracy. Without a schema, the LLM might return data in an unpredictable format, or miss key pieces of information. It acts as a clear, unambiguous instruction to the LLM. How it functions: When you pass a schema to lx.extract, LangExtract converts this Python class into a clear instruction for the LLM, often as part of the prompt. The LLM then attempts to fill in the fields defined in your schema based on the input text. LangExtract then validates the LLM’s output against your schema, ensuring type correctness and structure.

For example, if you want to extract a person’s name and age, your schema would look like this:

from pydantic import BaseModel

class Person(BaseModel):
    name: str
    age: int

Notice the type hints (str, int) – these are vital for Pydantic and LangExtract to understand what kind of data to expect for each field.

Understanding the ExtractionResult: What You Get Back

When lx.extract completes its work, it doesn’t just return raw data. It returns an ExtractionResult object. This object is incredibly useful because it provides not only the extracted data but also valuable metadata about the extraction process.

What it is: An object returned by lx.extract containing the extracted data and provenance information. Why it’s important: It gives you the structured output you asked for, plus context about where that data came from in the original text. This “provenance” is crucial for debugging, auditing, and building trust in your extractions. How it functions:

  • result.extracted_data: This is where your structured data lives. If your schema defined a single object, this will be an instance of that object. If your schema was designed to extract multiple items (e.g., a list of people), this would be a list of those objects.
  • result.provenance: This provides information about where in the original text each piece of extracted data was found. It includes details like the start and end character offsets, allowing you to highlight or verify the source of each extracted value.

Let’s think of it like this: You ask a detective (the LLM) to find specific clues (data) from a crime scene (text) and organize them into a report (schema). The ExtractionResult is the detective’s final report, complete with the organized clues (extracted_data) and notes on exactly where each clue was found at the scene (provenance). Pretty neat, right?

Step-by-Step Implementation: Your First Extraction!

Let’s put these concepts into practice. We’ll perform a simple extraction of a person’s name and profession from a short piece of text.

Step 1: Import LangExtract and Your LLM

First, make sure you have your LLM client ready from the previous chapter. We’ll use a placeholder my_llm for now, assuming you’ve initialized it correctly (e.g., openai_llm = lx.llms.OpenAIChat(model="gpt-4", api_key=os.environ["OPENAI_API_KEY"])).

import langextract as lx
from pydantic import BaseModel
import os

# Assume my_llm is configured from Chapter 3
# For demonstration, let's use a dummy LLM or an actual one if you have keys configured
# For example, using OpenAI:
# from langextract.llms import OpenAIChat
# my_llm = OpenAIChat(model="gpt-4o", api_key=os.environ.get("OPENAI_API_KEY"))
# if not my_llm.api_key:
#     raise ValueError("OPENAI_API_KEY environment variable not set.")

# For this example, we'll use a placeholder for my_llm
# In a real scenario, replace this with your actual LLM client instance
class DummyLLM:
    def __call__(self, prompt: str, **kwargs):
        # Simulate an LLM response based on a simple prompt
        if "name" in prompt and "profession" in prompt:
            return '{"name": "Alice Smith", "profession": "Software Engineer"}'
        return "{}"

my_llm = DummyLLM() # Replace with your actual LLM instance!

print("LangExtract and LLM placeholder imported successfully.")

Explanation:

  • import langextract as lx: This line imports the main LangExtract library, making it accessible via the alias lx.
  • from pydantic import BaseModel: We import BaseModel from Pydantic, which will be the foundation for our extraction schema.
  • import os: While not directly used in the dummy LLM, it’s common for fetching API keys.
  • DummyLLM: This is a placeholder. In your actual code, you would use an LLM client like OpenAIChat or GeminiChat that you set up in Chapter 3. We’re using a dummy here to ensure the example runs without needing an active API key, but remember its output is simulated.

Step 2: Define a Simple Schema

Next, let’s define the structure of the data we want to extract. We’ll create a PersonInfo schema.

# Continue from Step 1

class PersonInfo(BaseModel):
    name: str
    profession: str

print("PersonInfo schema defined.")

Explanation:

  • We’ve created a Python class PersonInfo that inherits from BaseModel.
  • Inside PersonInfo, we define two fields: name and profession.
  • Crucially, we’ve given them type hints: str for both. This tells LangExtract (and Pydantic) that we expect these fields to contain text strings. If the LLM returns something else, LangExtract will try to convert it or flag an error.

Step 3: Prepare Your Input Text

Now, let’s craft a short piece of text from which we want to extract information.

# Continue from Step 2

input_text = "Alice Smith works as a Software Engineer at TechCorp. She enjoys coding and hiking."

print("Input text prepared.")

Explanation:

  • input_text: This is a simple string containing the unstructured information we want to process.

Step 4: Perform the Extraction

With our LLM, schema, and text ready, we can now call lx.extract().

# Continue from Step 3

# For a real LLM, you would pass your actual LLM instance here:
# extraction_result = lx.extract(text_or_document=input_text, schema=PersonInfo, llm=my_llm)

# Using the DummyLLM for demonstration
# Note: The dummy LLM doesn't actually process the text, it just returns a fixed JSON.
# In a real scenario, LangExtract would send input_text and PersonInfo schema to my_llm.
extraction_result = lx.extract(text_or_document=input_text, schema=PersonInfo, llm=my_llm)

print("\nExtraction performed!")

Explanation:

  • lx.extract(): We call the core extraction function.
  • text_or_document=input_text: We pass our prepared text.
  • schema=PersonInfo: We provide our defined PersonInfo schema.
  • llm=my_llm: We pass our configured LLM client. LangExtract will use this LLM to perform the actual text processing.
  • The return value is stored in extraction_result, which is an ExtractionResult object.

Step 5: Inspect the Results

Finally, let’s look at what we got! We’ll access the extracted_data and provenance from our extraction_result object.

# Continue from Step 4

# Access the extracted data
extracted_person = extraction_result.extracted_data

print("\n--- Extracted Data ---")
print(f"Type of extracted_person: {type(extracted_person)}")
print(f"Name: {extracted_person.name}")
print(f"Profession: {extracted_person.profession}")

# Access the provenance information
print("\n--- Provenance Information ---")
if extraction_result.provenance:
    # provenance is a list of Provenance objects
    for field_provenance in extraction_result.provenance:
        print(f"  Field: {field_provenance.field_path}")
        print(f"  Extracted Value: '{field_provenance.extracted_value}'")
        # Source spans tell you where in the original text the value was found
        if field_provenance.source_spans:
            for span in field_provenance.source_spans:
                print(f"    Source Text: '{input_text[span.start:span.end]}'")
                print(f"    Start: {span.start}, End: {span.end}")
        else:
            print("    No source spans found for this field (may vary by LLM/extraction strategy).")
else:
    print("No provenance information available.")

Explanation:

  • extracted_person = extraction_result.extracted_data: We retrieve the actual structured data. Because our schema defined a single PersonInfo object, extracted_data will be an instance of PersonInfo.
  • extracted_person.name and extracted_person.profession: We can access the extracted fields directly as attributes of our PersonInfo object, thanks to Pydantic.
  • extraction_result.provenance: This gives us a list of Provenance objects. Each Provenance object tells us about a specific field that was extracted.
  • field_provenance.field_path: The name of the field (e.g., name, profession).
  • field_provenance.extracted_value: The value that was extracted for that field.
  • field_provenance.source_spans: A list of SourceSpan objects. Each SourceSpan contains start and end character offsets, pinpointing exactly where the extracted value (or information contributing to it) was found in the original input_text. This is incredibly powerful for verifying your results!

You’ve just performed your first structured extraction with LangExtract! Give yourself a pat on the back. You defined a clear goal with a schema, fed it text, and got back beautifully organized data, complete with its origin story.

Mini-Challenge: Extracting Book Information

Now it’s your turn! I want you to apply what you’ve learned to a new scenario.

Challenge: You have the following text about a book. Your task is to extract the book title, author, and publication year.

"The Hitchhiker's Guide to the Galaxy, a science fiction comedy classic, was written by Douglas Adams and first published in 1979."
  1. Define a new Pydantic schema called BookDetails with fields for title (string), author (string), and publication_year (integer).
  2. Use lx.extract with your new schema, the provided text, and your my_llm instance.
  3. Print the extracted title, author, and publication_year from the extracted_data.
  4. (Bonus) Try to print the provenance for each field, similar to the example above.

Hint: Pay close attention to the data types you define in your BookDetails schema. The LLM is smart, but giving it the correct target type helps a lot! Remember to use int for the year.

Click for Solution (but try it yourself first!)
import langextract as lx
from pydantic import BaseModel
import os

# Re-using the DummyLLM for demonstration purposes.
# In a real scenario, use your actual LLM client.
class DummyLLM:
    def __call__(self, prompt: str, **kwargs):
        if "title" in prompt and "author" in prompt and "publication_year" in prompt:
            # Simulate a more complex LLM response for the book details
            return '{"title": "The Hitchhiker\'s Guide to the Galaxy", "author": "Douglas Adams", "publication_year": 1979}'
        return "{}"

my_llm = DummyLLM()

# 1. Define the new Pydantic schema
class BookDetails(BaseModel):
    title: str
    author: str
    publication_year: int

# Prepare the input text
book_text = "The Hitchhiker's Guide to the Galaxy, a science fiction comedy classic, was written by Douglas Adams and first published in 1979."

print("Attempting to extract book details...")

# 2. Use lx.extract
book_extraction_result = lx.extract(text_or_document=book_text, schema=BookDetails, llm=my_llm)

# Access the extracted data
extracted_book = book_extraction_result.extracted_data

print("\n--- Extracted Book Details ---")
print(f"Title: {extracted_book.title}")
print(f"Author: {extracted_book.author}")
print(f"Publication Year: {extracted_book.publication_year}")

# 4. (Bonus) Print provenance
print("\n--- Book Provenance Information ---")
if book_extraction_result.provenance:
    for field_provenance in book_extraction_result.provenance:
        print(f"  Field: {field_provenance.field_path}")
        print(f"  Extracted Value: '{field_provenance.extracted_value}'")
        if field_provenance.source_spans:
            for span in field_provenance.source_spans:
                print(f"    Source Text: '{book_text[span.start:span.end]}'")
                print(f"    Start: {span.start}, End: {span.end}")
        else:
            print("    No source spans found for this field.")
else:
    print("No provenance information available.")

What to observe/learn:

  • Did your LLM correctly identify the integer for the publication year?
  • How adaptable is LangExtract to different information extraction tasks just by changing the schema?
  • Does the provenance correctly point to the original text segments? (Remember, with the DummyLLM, provenance will be limited or absent as it doesn’t actually process text.)

Common Pitfalls & Troubleshooting

Even with simple extractions, you might encounter a few hiccups. Here are some common ones and how to approach them:

  1. Incorrect Schema Definition (Type Mismatch):

    • Pitfall: You define a field as int, but the LLM returns text that can’t be converted to an integer (e.g., “nineteen seventy-nine”). Or you expect a list, but define a single string.
    • Troubleshooting: LangExtract, powered by Pydantic, will often raise a ValidationError if the LLM’s output doesn’t conform to your schema. Read the error message carefully! It will tell you exactly which field caused the issue and why (e.g., “value is not a valid integer”). Adjust your schema’s type hint, or refine your input text/LLM prompt (which we’ll cover in future chapters) to encourage the correct output.
  2. LLM Hallucination or Poor Extraction Quality:

    • Pitfall: The LLM extracts incorrect information, or completely misses a field, even though it’s present in the text.
    • Troubleshooting: This is a common challenge with LLMs. For basic cases, double-check your text for clarity. Is the information unambiguous? For more complex cases, we’ll explore techniques like prompt engineering, providing examples, and using different LLM models in later chapters. For now, understand that the LLM is doing its best guess based on your schema and the text.
  3. API Key or LLM Configuration Issues:

    • Pitfall: You get errors related to authentication, invalid models, or network connectivity when lx.extract tries to interact with your LLM.
    • Troubleshooting: Revisit Chapter 3. Ensure your API keys are correctly set as environment variables, your LLM client is initialized with the correct model name, and you have an active internet connection. Check the specific error message from your LLM provider (e.g., OpenAI’s AuthenticationError).

Summary

In this chapter, you’ve taken a significant leap forward in using LangExtract!

Here are the key takeaways:

  • The lx.extract() function is the core of LangExtract, taking text, a schema, and an LLM to perform structured data extraction.
  • Schemas, defined using Pydantic’s BaseModel, are crucial for guiding the LLM to extract specific data in a consistent, structured format.
  • The ExtractionResult object provides not only the extracted_data but also valuable provenance information, showing where each piece of data originated in the source text.
  • You successfully performed your first hands-on extraction and tackled a mini-challenge, demonstrating the power of schema-driven extraction.
  • We’ve touched on common pitfalls like schema mismatches and LLM quality, providing initial troubleshooting steps.

You’re now equipped to perform basic extractions and understand the output. In the next chapter, we’ll dive deeper into defining more complex schemas, including nested objects and lists, to handle richer and more varied data extraction tasks. Get ready to expand your LangExtract toolkit!

References

This page is AI-assisted and reviewed. It references official documentation and recognized resources where relevant.