Welcome back, future data architects! In our previous chapters, we laid the groundwork for understanding LangExtract, setting up our environment, and performing basic extractions. You’ve seen how powerful Large Language Models (LLMs) can be when guided by a structured schema.
In this chapter, we’re going to put all that knowledge to the test with a practical, high-value project: extracting key information from legal contracts. Legal documents are notoriously complex, filled with jargon, and often lengthy, making them a perfect challenge for LangExtract’s capabilities. By the end of this chapter, you’ll have built a system to automatically pull out crucial details like parties involved, effective dates, and contract values from sample legal text. This isn’t just about coding; it’s about building confidence in tackling real-world, complex data extraction problems.
Core Concepts: Navigating the Legal Labyrinth with LangExtract
Extracting information from legal contracts requires both precision and robustness. A single missed detail or an incorrect interpretation can have significant consequences. LangExtract, with its schema-driven approach and LLM orchestration, is well-suited for this task.
Understanding the Structure of Legal Data
Before we jump into code, let’s think about what kind of information is typically important in a contract:
- Parties: Who are the entities entering into the agreement? (e.g., “The Company,” “The Client”)
- Effective Date: When does the contract officially begin?
- Contract Value: If applicable, what is the monetary amount associated with the agreement?
- Governing Law: Which jurisdiction’s laws apply to the contract?
- Specific Clauses: Details about termination, intellectual property, confidentiality, etc.
Our goal is to define a schema that captures these elements accurately.
The Power of Pydantic for Schema Definition
As we learned, LangExtract heavily leverages Pydantic for defining the structure of the data we want to extract. Pydantic allows us to define Python classes with type hints, which it then uses to validate data and, in LangExtract’s case, to instruct the LLM on the desired output format.
For legal documents, Pydantic’s ability to add descriptions to fields becomes incredibly valuable. These descriptions act as explicit instructions for the LLM, guiding it to extract precisely what we intend, even from ambiguous legal phrasing.
The LangExtract Workflow for Complex Documents
Let’s visualize the process we’ll follow:
This diagram illustrates how our raw legal text passes through LangExtract, guided by our Pydantic schema, to produce validated, structured data. The “Review & Refine” step is particularly critical for legal use cases, where accuracy is paramount.
Step-by-Step Implementation: Building Our Legal Extractor
Let’s get our hands dirty and start building!
Step 1: Setting Up Your Environment (Quick Recap)
First, ensure you have langextract and pydantic installed. If you haven’t already, or if you want to ensure you’re on the latest stable versions as of early 2026:
pip install langextract pydantic~=2.0
Note: langextract is an actively developed library. For the absolute latest features and bug fixes, always refer to the official GitHub repository. Pydantic version 2.x is the current stable release, offering significant performance improvements.
Next, make sure your LLM provider’s API key is configured. For this example, we’ll assume you’re using a Google model (like Gemini Pro) and have your GOOGLE_API_KEY set as an environment variable.
# On Linux/macOS
export GOOGLE_API_KEY="YOUR_API_KEY_HERE"
# On Windows (Command Prompt)
set GOOGLE_API_KEY="YOUR_API_KEY_HERE"
# On Windows (PowerShell)
$env:GOOGLE_API_KEY="YOUR_API_KEY_HERE"
Replace "YOUR_API_KEY_HERE" with your actual key.
Step 2: Defining the Legal Contract Schema
Now, let’s define the Pydantic model that will guide our extraction. We’ll specify the types and add clear descriptions for each field.
Create a new Python file, say contract_extractor.py, and add the following:
# contract_extractor.py
from pydantic import BaseModel, Field
from typing import List, Optional
class LegalContractDetails(BaseModel):
"""
Schema for extracting key details from a legal contract.
"""
contract_id: str = Field(
description="A unique identifier for the contract, typically a reference number or code."
)
parties: List[str] = Field(
description="A list of the names of all parties involved in the contract."
)
effective_date: str = Field(
description="The date when the contract officially comes into effect, in YYYY-MM-DD format if possible."
)
contract_value: Optional[str] = Field(
default=None,
description="The total monetary value or consideration specified in the contract, including currency."
)
governing_law: Optional[str] = Field(
default=None,
description="The jurisdiction whose laws govern the contract, e.g., 'State of California' or 'England and Wales'."
)
Let’s break down this code:
from pydantic import BaseModel, Field: We import the necessary components fromPydantic.BaseModelis the base class for our schema, andFieldallows us to add metadata like descriptions and default values.from typing import List, Optional: These standard Python type hints help define thatpartieswill be a list of strings andcontract_valueandgoverning_laware optional fields that might not always be present.class LegalContractDetails(BaseModel):: This declares ourPydanticschema class.contract_id: str = Field(...): This defines a required fieldcontract_idof typestr. Thedescriptionparameter is crucial here, giving the LLM explicit instructions on what to look for.parties: List[str] = Field(...): This defines a fieldpartiesthat expects a list of strings.effective_date: str = Field(...): Another required string field for the date. We explicitly ask for aYYYY-MM-DDformat to help standardize the output.contract_value: Optional[str] = Field(default=None, ...): This field isOptional, meaning it might not always be found in the text.default=Noneexplicitly states its default absence. We also provide a clear description for the LLM.governing_law: Optional[str] = Field(default=None, ...): Similar tocontract_value, this is an optional field with a clear description.
Step 3: Preparing the Sample Contract Text
Now, let’s create a simplified, simulated legal contract snippet. Remember, in a real scenario, this would be the content of a PDF, Word document, or a scanned image that has been OCR’d into text.
Add the following text to your contract_extractor.py file, after the schema definition:
# contract_extractor.py (continued)
sample_contract_text = """
CONTRACT AGREEMENT
This Contract Agreement ("Agreement") is made and entered into as of 2025-10-26 (the "Effective Date"),
by and between Tech Innovations Inc., a company registered in Delaware ("The Company"),
and Global Solutions LLC, a company registered in New York ("The Client").
WHEREAS, The Company desires to provide software development services to The Client, and The Client desires
to procure such services from The Company;
NOW, THEREFORE, in consideration of the mutual covenants and agreements hereinafter set forth, the parties hereto agree as follows:
1. **Services.** The Company shall provide custom software development services as detailed in Schedule A.
2. **Compensation.** The Client shall pay The Company a total sum of $150,000 (One Hundred Fifty Thousand US Dollars)
for the services rendered under this Agreement. Payment terms are net 30 days.
3. **Governing Law.** This Agreement shall be governed by and construed in accordance with the laws of the State of Delaware,
without regard to its conflict of laws principles.
4. **Contract ID.** The unique identifier for this agreement is TI-GS-2025-001.
"""
This text contains all the information our schema is looking for, presented in a typical legal document style.
Step 4: Performing the Extraction
Finally, let’s use langextract to extract the structured data from our sample contract text.
Add this to the end of your contract_extractor.py file:
# contract_extractor.py (continued)
import langextract as lx
if __name__ == "__main__":
print("Attempting to extract legal contract details...")
try:
# Initialize LangExtract with your chosen LLM.
# 'gemini-pro' is a good general-purpose model from Google.
extractor = lx.Extractor(model_name="gemini-pro")
# Perform the extraction
result = extractor.extract(
text=sample_contract_text,
schema=LegalContractDetails,
max_workers=1 # For short texts, 1 worker is sufficient
)
# Print the extracted data
if result.parsed_object:
print("\nExtraction Successful!")
print(result.parsed_object.model_dump_json(indent=2)) # Use model_dump_json for Pydantic v2
else:
print("\nExtraction failed or returned no data.")
if result.errors:
print("Errors encountered:", result.errors)
if result.raw_response:
print("Raw LLM response (partial):", result.raw_response[:500]) # Print first 500 chars
except Exception as e:
print(f"\nAn error occurred during extraction: {e}")
print("Please ensure your GOOGLE_API_KEY is set and valid, and you have network access.")
Explanation of the new code:
import langextract as lx: Imports the LangExtract library.if __name__ == "__main__":: Ensures the extraction code runs only when the script is executed directly.extractor = lx.Extractor(model_name="gemini-pro"): This creates an instance of theExtractorclass. We specifymodel_name="gemini-pro"to use Google’s Gemini Pro model. LangExtract automatically uses yourGOOGLE_API_KEYenvironment variable.result = extractor.extract(...): This is the core function call.text=sample_contract_text: Our input document.schema=LegalContractDetails: ThePydanticschema we defined. LangExtract will instruct the LLM to output data conforming to this schema.max_workers=1: For short texts, a single worker is fine. For very long documents,max_workers(e.g., up to 10, as per common recommendations) can process chunks in parallel, speeding up extraction.
result.parsed_object.model_dump_json(indent=2): If the extraction is successful,result.parsed_objectwill contain an instance of ourLegalContractDetailsPydantic model.model_dump_json()(for Pydantic v2) converts this object into a nicely formatted JSON string.- Error Handling: We include a
try-exceptblock to catch potential API errors or issues with the extraction process, providing helpful messages.
Now, run your script from the terminal:
python contract_extractor.py
You should see output similar to this (actual content may vary slightly due to LLM non-determinism):
Extraction Successful!
{
"contract_id": "TI-GS-2025-001",
"parties": [
"Tech Innovations Inc.",
"Global Solutions LLC"
],
"effective_date": "2025-10-26",
"contract_value": "$150,000 (One Hundred Fifty Thousand US Dollars)",
"governing_law": "State of Delaware"
}
Congratulations! You’ve successfully extracted structured data from a simulated legal contract using LangExtract and Pydantic.
Step 5: Reviewing and Refining with Interactive Visualization (Brief Mention)
For more complex or longer documents, simply printing the JSON isn’t enough. LangExtract offers powerful interactive visualization tools to help you review the extraction and debug issues. While beyond the scope of this simple example, remember that result.visualize() can be called to launch a local web interface where you can see:
- The original text, with extracted entities highlighted.
- Which chunks of text contributed to which extracted fields.
- The raw LLM responses.
This tool is invaluable for understanding why an LLM extracted certain information or failed to extract others, allowing you to refine your schema or prompt instructions.
Mini-Challenge: Expanding Our Contract Schema
You’ve done a fantastic job with the initial extraction! Now, let’s make it a bit more complex.
Challenge: Imagine our legal team also needs to know the term of the contract – how long it’s valid for.
- Modify the
LegalContractDetailsschema: Add a newOptional[str]field calledcontract_term. Give it a cleardescriptionthat explains what “contract term” means (e.g., “The duration for which the contract is valid, e.g., ‘1 year’ or ‘until December 31, 2026’”). - Update the
sample_contract_text: Add a new clause to the contract text that specifies a contract term, for example: “5. Term. This Agreement shall commence on the Effective Date and continue for a period of one (1) year.” - Re-run the extraction: Observe if LangExtract successfully identifies and extracts the new
contract_term.
Hint: Pay close attention to the description you provide for the contract_term field in your Pydantic schema. Clear instructions lead to better extraction!
Click for Solution Hint
Make sure your new clause in `sample_contract_text` clearly states the duration. For the schema, define `contract_term: Optional[str] = Field(default=None, description="...")`.Common Pitfalls & Troubleshooting
Working with LLMs for extraction, especially in sensitive domains like legal, can present a few challenges.
Schema Mismatch or Missing Data:
- Problem: The LLM either returns an empty field, incorrect data, or fails to conform to your schema.
- Solution:
- Refine
Fielddescriptions: Make your descriptions in thePydanticschema as explicit and unambiguous as possible. Think about how you’d explain it to a human. - Check text quality: Is the information actually present in the input text? Is it clear enough for an LLM to understand?
- Add examples (Advanced): For very tricky fields, you can sometimes include examples directly in the prompt or use LangExtract’s advanced features for few-shot prompting, though for simple cases, schema descriptions are usually sufficient.
- Use
result.visualize(): This is your best friend for debugging. It helps you see what the LLM saw and thought.
- Refine
API Key or Network Issues:
- Problem: The script fails with connection errors or authentication failures.
- Solution: Double-check that your
GOOGLE_API_KEY(or equivalent for your chosen LLM) environment variable is correctly set and hasn’t expired. Ensure you have a stable internet connection.
LLM Hallucinations or Inaccuracies (Critical for Legal):
- Problem: The LLM confidently extracts information that is not present in the document, or extracts incorrect details. This is particularly dangerous in legal contexts.
- Solution:
- Human-in-the-Loop: For high-stakes applications, always involve human review of extracted legal data. LangExtract is an accelerator, not a fully autonomous legal agent.
- Grounding (Advanced): LangExtract has features for “grounding,” which means tracing the extracted information back to its source in the original document. This helps verify accuracy.
- Prompt Engineering: Experiment with your schema descriptions and potentially add overall instructions to the
Extractorto emphasize factual accuracy and adherence to the document.
Summary
In this chapter, you’ve taken a significant step forward, applying LangExtract to a real-world project: extracting structured information from legal contracts.
Here are the key takeaways:
- Schema is King: A well-defined
Pydanticschema with clearFielddescriptions is crucial for precise extraction from complex documents. - Practical Application: LangExtract shines in high-value scenarios like legal document processing, turning unstructured text into actionable data.
- Incremental Building: We built our solution step-by-step, from schema definition to execution, explaining each part.
- Debugging Tools:
result.visualize()is an essential tool for understanding and refining extraction results, especially for complex texts. - Accuracy is Paramount: For legal data, always prioritize accuracy, leveraging human review and advanced grounding techniques when necessary.
You’re now equipped to tackle more intricate extraction tasks. In the next chapter, we’ll delve deeper into handling very long documents, exploring advanced chunking strategies and multi-pass extraction to maintain accuracy and efficiency.
References
- LangExtract GitHub Repository: The official source for the library, including documentation and examples.
https://github.com/google/langextract - Pydantic Documentation (v2): Comprehensive guide to defining data schemas in Python.
https://docs.pydantic.dev/latest/ - Google AI Studio Documentation: Information on obtaining API keys and using Google’s Gemini models.
https://ai.google.dev/ - Towards Data Science - Extracting Structured Data with LangExtract: An article discussing LangExtract’s workflow and capabilities.
https://towardsdatascience.com/extracting-structured-data-with-langextract-a-deep-dive-into-llm-orchestrated-workflows/
This page is AI-assisted and reviewed. It references official documentation and recognized resources where relevant.