Welcome back, aspiring data whisperer! In our journey through LangExtract, we’ve learned how to define schemas, set up LLM providers, and perform basic extractions. But what happens when the extraction isn’t quite right? How do you peek “under the hood” of the LLM to understand why it made certain decisions?
This chapter is your toolkit for answering those critical questions. We’ll dive into the indispensable world of interactive visualization and systematic debugging for your LangExtract workflows. By the end, you’ll not only be able to identify extraction errors but also understand their root causes and confidently iterate towards accurate results. This ability to visualize and debug is paramount for building robust and reliable information extraction systems.
Understanding the Importance of Visualization
Imagine trying to fix a complex machine without being able to see its internal workings. Frustrating, right? The same applies to LLM-powered data extraction. Large Language Models are powerful, but they can be opaque. When an extraction fails or produces incorrect data, simply re-running it often won’t help. You need to understand:
- Which part of the input text led to a specific extraction?
- Why was certain information missed?
- What was the LLM’s confidence level for each extracted piece?
LangExtract provides mechanisms to gain this insight, allowing you to “ground” the extracted data back to its source text. This is crucial for refining your prompts and schemas.
Core Concepts: The Debugging Loop
Effective debugging of LLM extraction isn’t a one-off fix; it’s an iterative process. You’ll run an extraction, inspect the results, identify discrepancies, refine your instructions (prompt or schema), and then repeat.
Here’s a high-level view of this debugging loop:
Each step in this loop is vital. Let’s break down how LangExtract helps us with E (Inspect Detailed Output & Visualization) and F (Identify Issues).
Source Grounding: Connecting Extractions to Text
One of LangExtract’s powerful features is source grounding. This means that for every piece of data it extracts, it can tell you exactly where in the original document that information came from. This is often presented as character offsets or highlighted text spans.
Why is this important?
- Verification: You can visually confirm if the LLM correctly identified the relevant text for an extracted field.
- Error Detection: If an entity is extracted incorrectly, source grounding helps you see which text led to the mistake, making it easier to adjust your prompt.
- Confidence Building: It adds transparency to the LLM’s “reasoning,” which is invaluable in production environments.
Inspecting Detailed Extraction Results
When you call lx.extract(), the returned ExtractionResult object (or similar structure) contains more than just the final, clean JSON data. It typically includes:
- Raw LLM Output: The direct, unparsed text response from the LLM. This can reveal if the LLM understood the instruction but formatted the output incorrectly, or if it hallucinated entirely.
- Intermediate Processing Steps: Details about how the document was chunked (if applicable) and results from individual chunks.
- Confidence Scores: Some LLM providers or LangExtract configurations might provide a confidence score for extracted entities.
- Source Spans/Offsets: Character start and end positions in the original text for each extracted value, enabling source grounding.
By examining these details, you gain a much deeper understanding of the extraction process than just looking at the final output.
Step-by-Step: Visualizing and Debugging an Extraction
Let’s set up a scenario where we want to extract information from a short text, and then intentionally introduce an error to demonstrate debugging.
First, ensure you have LangExtract installed and an LLM provider configured (as covered in Chapter 3). We’ll use a simple text and schema.
# Make sure you have LangExtract installed:
# pip install google-langextract
# And your LLM provider setup (e.g., GOOGLE_API_KEY environment variable)
import langextract as lx
import os
# Set your LLM API key as an environment variable (e.g., GOOGLE_API_KEY)
# For this example, we'll use Google's models.
# Make sure your API key is set before running.
# os.environ["GOOGLE_API_KEY"] = "YOUR_API_KEY_HERE" # Uncomment and replace if not set globally
print(f"LangExtract version: {lx.__version__}") # As of 2026-01-05, check latest stable via pip or GitHub
# 1. Define our document
document_text = """
Contract Agreement for Services
This agreement is made between Acme Corp (Client) and Innovate Solutions LLC (Service Provider) on January 15, 2025.
The service provider will deliver software development services.
The total compensation for these services is $15,000, payable upon completion.
"""
# 2. Define our extraction schema
# We'll start with a schema that's likely to work well.
schema = {
"type": "object",
"properties": {
"client_name": {"type": "string", "description": "The full name of the client company."},
"service_provider_name": {"type": "string", "description": "The full name of the service provider company."},
"agreement_date": {"type": "string", "description": "The date the agreement was made, in YYYY-MM-DD format."},
"service_description": {"type": "string", "description": "A brief description of the services to be rendered."},
"compensation_amount": {"type": "number", "description": "The total monetary compensation for the services."}
},
"required": ["client_name", "service_provider_name", "agreement_date", "service_description", "compensation_amount"]
}
# 3. Perform the extraction
# LangExtract's 'extract' function returns a detailed result object.
print("\n--- Initial Extraction ---")
try:
extraction_result = lx.extract(
text_or_document=document_text,
schema=schema,
# For detailed debugging, you might set a verbose flag or inspect properties
# The exact way to get 'visualization' data might vary by version.
# We'll assume the result object itself contains sufficient detail.
)
# The 'value' property usually holds the clean, extracted JSON.
print("Extracted Data (Clean):")
print(extraction_result.value)
# To get more details for debugging and visualization, we inspect the result object.
# LangExtract's result object often includes spans or raw output.
# The 'spans' attribute is key for source grounding.
print("\n--- Detailed Extraction Results (for debugging) ---")
if hasattr(extraction_result, 'spans') and extraction_result.spans:
print("Extracted Spans (showing source grounding):")
for field, span_list in extraction_result.spans.items():
print(f"- {field}:")
for span in span_list:
# A span typically has 'text', 'start_offset', 'end_offset'
print(f" -> '{span.text}' (from char {span.start_offset} to {span.end_offset})")
else:
print("No detailed spans available in the result object. This might depend on the LLM provider or LangExtract version.")
if hasattr(extraction_result, 'raw_output') and extraction_result.raw_output:
print("\nRaw LLM Output (for deeper inspection):")
print(extraction_result.raw_output[:500] + "...") # Print first 500 chars
except Exception as e:
print(f"An error occurred during extraction: {e}")
print("Please ensure your LLM API key is correctly set and network connectivity is stable.")
Explanation:
- We import
langextractandos. - We define
document_textand aschemafor a contract. - We call
lx.extract(). The key here is thatextraction_resultis an object, not just the final JSON. - We first print
extraction_result.valuewhich gives us the clean, parsed output. - Then, we inspect
extraction_result.spans. This attribute (if available and populated by the LLM provider) is crucial for source grounding. It maps each extracted field back to the exact text segment in the original document. This is your “visualization” – seeing what the LLM thought was relevant. - We also check for
raw_output, which is the direct text response from the LLM before LangExtract parses it into the schema. This is invaluable if the parsing fails or if the LLM’s response format is unexpected.
Let’s run this code. You should see output similar to this (actual extracted values may vary slightly based on LLM):
LangExtract version: X.Y.Z # (e.g., 0.1.0 or higher)
--- Initial Extraction ---
Extracted Data (Clean):
{'client_name': 'Acme Corp', 'service_provider_name': 'Innovate Solutions LLC', 'agreement_date': '2025-01-15', 'service_description': 'software development services', 'compensation_amount': 15000}
--- Detailed Extraction Results (for debugging) ---
Extracted Spans (showing source grounding):
- client_name:
-> 'Acme Corp' (from char 45 to 54)
- service_provider_name:
-> 'Innovate Solutions LLC' (from char 65 to 87)
- agreement_date:
-> 'January 15, 2025' (from char 91 to 107)
- service_description:
-> 'software development services' (from char 136 to 165)
- compensation_amount:
-> '15,000' (from char 201 to 207)
Raw LLM Output (for deeper inspection):
```json
{
"client_name": "Acme Corp",
"service_provider_name": "Innovate Solutions LLC",
"agreement_date": "2025-01-15",
"service_description": "software development services",
"compensation_amount": 15000
}
```...
Notice how spans clearly shows the start and end character offsets for each piece of extracted data. This is the “interactive visualization” in a programmatic sense – you can highlight these spans in a UI or mentally map them back to the source text.
Introducing and Debugging an Error
Now, let’s intentionally break something to see the debugging in action. Suppose we want to extract the currency symbol along with the amount, but we forget to update the schema’s type.
# ... (previous code for imports, document_text) ...
# 2. Define our extraction schema - with an intentional error!
# We'll try to extract "compensation_amount_with_currency" as a number,
# but the LLM will likely include the '$' symbol, causing a type mismatch.
schema_with_error = {
"type": "object",
"properties": {
"client_name": {"type": "string", "description": "The full name of the client company."},
"service_provider_name": {"type": "string", "description": "The full name of the service provider company."},
"agreement_date": {"type": "string", "description": "The date the agreement was made, in YYYY-MM-DD format."},
"service_description": {"type": "string", "description": "A brief description of the services to be rendered."},
"compensation_amount_with_currency": {"type": "number", "description": "The total monetary compensation including its currency symbol."} # ERROR HERE!
},
"required": ["client_name", "service_provider_name", "agreement_date", "service_description", "compensation_amount_with_currency"]
}
print("\n--- Extraction with Intentional Error ---")
try:
error_extraction_result = lx.extract(
text_or_document=document_text,
schema=schema_with_error,
)
print("Extracted Data (Clean - might be missing field due to error):")
print(error_extraction_result.value)
print("\n--- Detailed Error Extraction Results ---")
if hasattr(error_extraction_result, 'spans') and error_extraction_result.spans:
print("Extracted Spans:")
for field, span_list in error_extraction_result.spans.items():
print(f"- {field}:")
for span in span_list:
print(f" -> '{span.text}' (from char {span.start_offset} to {span.end_offset})")
if hasattr(error_extraction_result, 'raw_output') and error_extraction_result.raw_output:
print("\nRaw LLM Output (crucial for debugging parsing errors):")
print(error_extraction_result.raw_output)
except Exception as e:
print(f"An error occurred during extraction: {e}")
# LangExtract often raises a ValidationError or similar if schema parsing fails.
print("This usually means the LLM's raw output couldn't be parsed into the defined schema type.")
When you run this, you’ll likely see one of two things:
- A
ValidationError(or similarPydanticerror): This is LangExtract catching that the LLM returned"$15,000"but the schema expected anumber. - The
compensation_amount_with_currencyfield isNoneor missing: The LLM might have returned"$15,000"and LangExtract’s internal parsing logic fornumbersilently failed, dropping the field.
In either case, the raw_output is your best friend!
Looking at error_extraction_result.raw_output, you’d see something like:
{
"client_name": "Acme Corp",
"service_provider_name": "Innovate Solutions LLC",
"agreement_date": "2025-01-15",
"service_description": "software development services",
"compensation_amount_with_currency": "$15,000"
}
Aha! The LLM did return "$15,000", but our schema expected a number. This is a type mismatch.
The Fix: Update the schema’s type for compensation_amount_with_currency to string.
# ... (previous code for imports, document_text) ...
# 2. Define our extraction schema - FIX APPLIED
fixed_schema = {
"type": "object",
"properties": {
"client_name": {"type": "string", "description": "The full name of the client company."},
"service_provider_name": {"type": "string", "description": "The full name of the service provider company."},
"agreement_date": {"type": "string", "description": "The date the agreement was made, in YYYY-MM-DD format."},
"service_description": {"type": "string", "description": "A brief description of the services to be rendered."},
"compensation_amount_with_currency": {"type": "string", "description": "The total monetary compensation including its currency symbol."} # FIXED!
},
"required": ["client_name", "service_provider_name", "agreement_date", "service_description", "compensation_amount_with_currency"]
}
print("\n--- Extraction with Fixed Schema ---")
try:
fixed_extraction_result = lx.extract(
text_or_document=document_text,
schema=fixed_schema,
)
print("Extracted Data (Clean - now correct):")
print(fixed_extraction_result.value)
if hasattr(fixed_extraction_result, 'spans') and fixed_extraction_result.spans:
print("\nExtracted Spans (showing source grounding):")
for field, span_list in fixed_extraction_result.spans.items():
print(f"- {field}:")
for span in span_list:
print(f" -> '{span.text}' (from char {span.start_offset} to {span.end_offset})")
except Exception as e:
print(f"An error occurred during extraction: {e}")
Now, the extraction should succeed, and compensation_amount_with_currency will correctly hold "$15,000".
Mini-Challenge: Debugging a Missing Field
You’ve got a document, and you want to extract a specific piece of information, but it’s consistently missing from your LangExtract output. Use the debugging techniques you’ve learned to identify and fix the issue.
Challenge:
- Document:
Meeting Minutes - Project Phoenix Date: 2025-11-20 Attendees: Alice, Bob, Charlie Topic: Phase 1 Review Decision: Proceed with Phase 2, target completion by Q1 2026. Next Steps: Alice to prepare budget, Bob to update timeline. - Initial Schema (with a subtle error):
challenge_schema = { "type": "object", "properties": { "meeting_topic": {"type": "string", "description": "The main topic discussed in the meeting."}, "decision_made": {"type": "string", "description": "The primary decision or outcome of the meeting."}, "lead_for_budget": {"type": "string", "description": "The person responsible for preparing the budget."} }, "required": ["meeting_topic", "decision_made", "lead_for_budget"] } - Task: Perform the extraction using
document_textandchallenge_schema. Observe thatlead_for_budgetis missing or incorrect in the finalextraction_result.value. - Debug: Use
extraction_result.raw_outputandextraction_result.spansto understand whylead_for_budgetis not being extracted correctly. - Fix: Modify the
challenge_schema(specifically the description forlead_for_budget) to guide the LLM more effectively.
Hint: Sometimes, the LLM needs a more precise instruction in the description field of your schema property to find the exact information you’re looking for, especially when it’s implied rather than explicitly stated.
What to observe/learn: You should see that the LLM might struggle with inferring “person responsible for preparing the budget” from “Alice to prepare budget” without a clearer hint. The raw_output might show the LLM trying to extract it but failing, or making an incorrect guess. By refining the schema description, you’re giving the LLM a better “search query” for that specific piece of information.
Common Pitfalls & Troubleshooting
- Vague Schema Descriptions:
- Pitfall: Using descriptions like
"data": {"type": "string", "description": "some data"}. The LLM won’t know what “some data” refers to. - Troubleshooting: Make your
descriptionfields as clear, specific, and unambiguous as possible. Use examples if thedescriptionalone isn’t enough (LangExtract supports examples in schemas, though not covered in detail here).
- Pitfall: Using descriptions like
- Type Mismatches:
- Pitfall: Expecting a
numberwhen the LLM outputs text with units (e.g., “100 USD”, “50%”). - Troubleshooting: Check
raw_output. If the LLM is providing data in a format that doesn’t match your schema’stype, either change the schematype(e.g., fromnumbertostring) or refine the prompt to instruct the LLM to output only the numeric value.
- Pitfall: Expecting a
- Missing Fields due to Ambiguity:
- Pitfall: A field is consistently
Noneor missing from the output. This often happens with information that’s implicitly stated or spread across the document. - Troubleshooting:
- Check
raw_output: Did the LLM attempt to extract it but failed, or did it completely ignore it? - Refine the
description: Provide more context or specific keywords to look for. - Consider chunking strategy (for long documents): If the relevant information is in a different chunk than the prompt context, it might be missed. Ensure your chunking strategy (from Chapter 7) keeps related information together.
- Check
- Pitfall: A field is consistently
- LLM Provider Errors:
- Pitfall: API key issues, rate limits, network errors, or invalid model names.
- Troubleshooting: Ensure your environment variables for API keys are correctly set. Check the specific error message from the
try-exceptblock; it usually points directly to the LLM provider’s issue. Consult the LLM provider’s documentation for status and rate limits.
Summary
Congratulations! You’ve just gained critical skills in visualizing and debugging your LangExtract workflows.
Here are the key takeaways:
- Visualization is Key: Don’t just look at the final output; use detailed results to understand the LLM’s reasoning.
- Source Grounding: LangExtract’s
spansattribute (or similar) allows you to map extracted data back to its exact location in the original text, providing transparency. - The Debugging Loop: Extract, inspect, identify issues, refine, and re-extract. This iterative process is fundamental.
raw_outputis Your Friend: Always inspect the raw LLM response if you’re encountering parsing errors or unexpected values. It shows what the LLM actually returned.- Schema Descriptions Matter: Clear, specific descriptions in your schema properties are powerful instructions for the LLM.
With these debugging superpowers, you’re now much better equipped to handle the complexities of real-world information extraction.
In our next chapter, we’ll shift our focus to performance tuning and handling large volumes of documents, ensuring your LangExtract solutions are not only accurate but also efficient and scalable.
References
- LangExtract GitHub Repository
- Towards Data Science: Extracting Structured Data with LangExtract
- Mermaid.js Documentation
This page is AI-assisted and reviewed. It references official documentation and recognized resources where relevant.