Welcome back, intrepid data explorer! In our previous chapters, you’ve mastered the fundamentals of LangExtract, from setting up your environment to crafting precise extraction schemas and understanding the nuances of prompt engineering. Now, it’s time to put those skills to the test with a real-world, highly valuable application: extracting structured information from financial reports.
Financial reports, such as earnings call transcripts, annual reports, or quarterly statements, are treasure troves of critical business data. However, sifting through pages of unstructured text, tables, and disclosures to find specific metrics or key highlights can be incredibly time-consuming. This chapter will guide you through building a LangExtract solution to automate this process, allowing you to quickly pull out crucial financial data points and summarize key sections.
By the end of this project, you’ll not only have a working example of LangExtract in action but also a deeper understanding of how to tackle complex, multi-faceted extraction tasks on longer documents. We’ll focus on designing robust schemas, fine-tuning extraction instructions for financial jargon, and considering strategies for handling the often-extensive length of these reports. Get ready to transform unstructured financial text into actionable, structured data!
Understanding Financial Report Structure
Before we jump into coding, let’s briefly consider the common structure of financial reports. While every report is unique, they often contain recurring elements:
- Executive Summary/Highlights: A brief overview of performance and key achievements.
- Key Financial Metrics: Revenue, Net Income, Earnings Per Share (EPS), Operating Expenses, Cash Flow, etc.
- Segmented Data: Breakdown of revenue or profit by business unit, geography, or product line.
- Forward-Looking Statements/Outlook: Management’s projections and expectations for future performance.
- Risk Factors: Potential challenges or uncertainties.
Our goal is to design an extraction schema that can capture these diverse pieces of information, even when they are scattered across different sections of a document.
Designing the Extraction Schema for Financial Data
The heart of any LangExtract project is your schema. For financial reports, we need a schema that is flexible enough to capture various data points but also specific enough to guide the LLM effectively. We’ll use Pydantic to define our desired output structure.
Let’s imagine we want to extract the company name, fiscal year, total revenue, net income, and a summary of key highlights from an earnings report.
Think about it: What data types would be appropriate for each of these fields? Revenue and net income are usually numbers, the year is an integer, and the company name and highlights are strings.
Here’s how we might define such a schema using Pydantic:
# Save this as financial_schema.py
from pydantic import BaseModel, Field
from typing import Optional, List
class FinancialReportSummary(BaseModel):
"""
Structured summary of a financial report.
"""
company_name: str = Field(description="The full official name of the company.")
fiscal_year: int = Field(description="The fiscal year the report pertains to.")
total_revenue: float = Field(description="The total revenue reported, in millions or billions, as a numeric value.")
net_income: float = Field(description="The net income (profit) reported, in millions or billions, as a numeric value.")
key_highlights: List[str] = Field(description="A list of 3-5 most important strategic or financial highlights mentioned in the report.")
outlook_sentiment: Optional[str] = Field(
default=None,
description="Overall sentiment regarding the future outlook (e.g., 'Positive', 'Cautious', 'Neutral')."
)
class Config:
json_schema_extra = {
"example": {
"company_name": "Tech Innovators Inc.",
"fiscal_year": 2025,
"total_revenue": 1250.5, # in millions
"net_income": 320.1, # in millions
"key_highlights": [
"Achieved record revenue growth of 15% year-over-year.",
"Launched new flagship product line.",
"Expanded into two new international markets."
],
"outlook_sentiment": "Positive"
}
}
Explanation of the Schema:
FinancialReportSummary(BaseModel): This class inherits from Pydantic’sBaseModel, making it a data validation and serialization powerhouse.company_name: str,fiscal_year: int: These are standard type hints. Pydantic will ensure the extracted data matches these types.total_revenue: float,net_income: float: We’re usingfloatfor financial figures, as they often include decimals. ThedescriptioninFieldis crucial – it tells the LLM exactly what kind of number to look for and in what units (e.g., “in millions or billions”).key_highlights: List[str]: This tells the LLM to extract multiple distinct strings as a list, rather than a single paragraph. We also specify a desired count (3-5) to guide the summary.outlook_sentiment: Optional[str]: This field isOptional, meaning it might not always be present, and has adefaultvalue ofNone. We provide a clear description and example values for the LLM.class Config: Thejson_schema_extraprovides an example of what the output should look like, which is incredibly helpful for the LLM to understand the desired format.
By defining this schema, we give LangExtract a clear blueprint for the data we want, improving accuracy and consistency.
Step-by-Step Implementation: Extracting from a Sample Report
Now, let’s put our schema to work. We’ll use a short, simulated financial report excerpt for our first extraction. Remember, in a real scenario, you’d feed in the actual text of a report.
Prerequisites: Ensure you have LangExtract installed and your LLM provider configured as discussed in Chapter 2. For example, if using OpenAI:
import os
# Set your OpenAI API key (replace with your actual key or environment variable)
os.environ["OPENAI_API_KEY"] = "YOUR_OPENAI_API_KEY"
Or for Google’s models:
# For Google models, ensure you have authenticated, e.g., via `gcloud auth application-default login`
# or set GOOGLE_API_KEY if using specific API keys.
Let’s begin by creating our Python script.
1. Initialize Your Project and Import Necessary Libraries
Create a new Python file, say extract_financial_data.py.
# extract_financial_data.py
import langextract as lx
import os
from financial_schema import FinancialReportSummary # Import our Pydantic schema
# Ensure your LLM provider API key is set
# For OpenAI:
# os.environ["OPENAI_API_KEY"] = "YOUR_OPENAI_API_KEY"
# For Google's models (e.g., Gemini):
# You might need to set GOOGLE_API_KEY or ensure gcloud authentication is active.
# For this example, we'll assume an LLM is configured and accessible.
print("LangExtract and schema imported successfully!")
Explanation:
import langextract as lx: This imports the LangExtract library, making its functions available under the aliaslx.import os: Used to interact with environment variables, typically for API keys.from financial_schema import FinancialReportSummary: We import the Pydantic schema we defined earlier, so LangExtract knows the target structure.
2. Prepare Sample Financial Report Text
We’ll use a simplified piece of text for demonstration. In a real application, this would be the content read from a PDF, HTML, or plain text file.
# ... (previous code) ...
sample_report_text = """
**FOR IMMEDIATE RELEASE**
**Global Dynamics Corp. Announces Strong Q4 and Fiscal Year 2025 Results**
**CITY, STATE – January 15, 2026** – Global Dynamics Corp. (NYSE: GDC) today announced its financial results for the fourth quarter and full fiscal year ended December 31, 2025. The company delivered robust performance across all segments, driven by strong demand for its innovative software solutions.
**Fiscal Year 2025 Highlights:**
* Total revenue reached an all-time high of $1,250.5 million, marking a 15% increase from the prior fiscal year.
* Net income for the fiscal year was $320.1 million, demonstrating strong profitability and efficient operations.
* Successfully launched 'NexusOS 2.0', our next-generation operating system, which has seen rapid adoption.
* Expanded strategic partnerships in key emerging markets, strengthening our global footprint.
* Invested heavily in AI research and development, positioning us for future growth.
**Outlook:**
Looking ahead to fiscal year 2026, Global Dynamics Corp. anticipates continued momentum. We project revenue growth in the range of 10-12% and expect to further enhance our market leadership. The overall sentiment remains positive, with a focus on sustainable innovation and customer success.
"""
print("\nSample report text loaded.")
Explanation:
sample_report_text: A multi-line string holding our simulated financial data. Notice it contains all the information our schema is looking for.
3. Craft the Extraction Instruction and Perform Extraction
Now, we’ll tell LangExtract what to do. The lx.extract function is our workhorse.
# ... (previous code) ...
# Define the extraction instruction
extraction_instruction = "Extract the key financial metrics and highlights from this report."
print("\nAttempting to extract data...")
# Perform the extraction
try:
result: FinancialReportSummary = lx.extract(
text_or_document=sample_report_text,
instruction=extraction_instruction,
schema=FinancialReportSummary,
llm_config={"model": "gpt-4-turbo-preview"} # Or "gemini-pro", etc.
)
print("\nExtraction successful! Here's the structured data:")
print(result.model_dump_json(indent=2)) # Using .model_dump_json for Pydantic v2.x
print(f"\nCompany Name: {result.company_name}")
print(f"Fiscal Year: {result.fiscal_year}")
print(f"Total Revenue: ${result.total_revenue} million")
print(f"Net Income: ${result.net_income} million")
print("\nKey Highlights:")
for i, highlight in enumerate(result.key_highlights):
print(f" {i+1}. {highlight}")
print(f"Outlook Sentiment: {result.outlook_sentiment}")
except Exception as e:
print(f"\nAn error occurred during extraction: {e}")
print("Please ensure your LLM configuration is correct and API key is valid.")
Explanation:
extraction_instruction: A natural language prompt telling the LLM what kind of information to find. Keep it clear and concise.lx.extract(...):text_or_document: Oursample_report_textis passed here. For longer documents, you could pass a file path or aDocumentobject.instruction: Our natural language instruction.schema: Crucially, we pass ourFinancialReportSummaryPydantic model. This tells LangExtract and the underlying LLM the exact structure and types we expect in the output.llm_config: This dictionary specifies which LLM to use. As of 2026-01-05,gpt-4-turbo-preview(OpenAI) orgemini-pro(Google) are strong choices. Make sure this matches your configured LLM provider.
result.model_dump_json(indent=2): For Pydantic v2.x,model_dump_jsonis the method to serialize the model instance to a JSON string.indent=2makes it pretty-printed.- The
try...exceptblock is a good practice for error handling, especially with external API calls.
Run this script (python extract_financial_data.py). You should see the extracted data printed in a structured JSON format and then individually.
python extract_financial_data.py
Expected Output (may vary slightly based on LLM):
LangExtract and schema imported successfully!
Sample report text loaded.
Attempting to extract data...
Extraction successful! Here's the structured data:
{
"company_name": "Global Dynamics Corp.",
"fiscal_year": 2025,
"total_revenue": 1250.5,
"net_income": 320.1,
"key_highlights": [
"Total revenue reached an all-time high of $1,250.5 million, marking a 15% increase from the prior fiscal year.",
"Net income for the fiscal year was $320.1 million, demonstrating strong profitability and efficient operations.",
"Successfully launched 'NexusOS 2.0', our next-generation operating system, which has seen rapid adoption.",
"Expanded strategic partnerships in key emerging markets, strengthening our global footprint.",
"Invested heavily in AI research and development, positioning us for future growth."
],
"outlook_sentiment": "Positive"
}
Company Name: Global Dynamics Corp.
Fiscal Year: 2025
Total Revenue: $1250.5 million
Net Income: $320.1 million
Key Highlights:
1. Total revenue reached an all-time high of $1,250.5 million, marking a 15% increase from the prior fiscal year.
2. Net income for the fiscal year was $320.1 million, demonstrating strong profitability and efficient operations.
3. Successfully launched 'NexusOS 2.0', our next-generation operating system, which has seen rapid adoption.
4. Expanded strategic partnerships in key emerging markets, strengthening our global footprint.
5. Invested heavily in AI research and development, positioning us for future growth.
Outlook Sentiment: Positive
Congratulations! You’ve successfully extracted structured data from a simulated financial report.
4. Handling Longer Documents with Chunking
Financial reports are rarely as short as our example. They can be dozens or even hundreds of pages long. Sending an entire massive document to an LLM in one go is problematic for several reasons:
- Token Limits: LLMs have a maximum context window (token limit). Large documents will exceed this.
- Cost: Processing more tokens costs more.
- Accuracy: Very long inputs can sometimes dilute the LLM’s focus, leading to less accurate extractions.
LangExtract is designed to handle this by intelligently chunking the document and processing these chunks, potentially in parallel, and then merging the results. This is often done automatically when you pass a Document object.
Let’s illustrate the concept of chunking and how LangExtract orchestrates it. While lx.extract can often handle this internally, understanding the process helps with advanced tuning.
Conceptual Flow for Long Documents:
Explanation:
A[Large Financial Report Text]: Your full, lengthy document.B{LangExtract's Internal Chunking}: LangExtract automatically splits this into manageable pieces.C1, C2, C3[Chunk X]: Each chunk is a smaller section of the original text.D1, D2, D3[LLM Extraction on Chunk X]: The LLM processes each chunk individually, extracting relevant data. This can happen in parallel, improving performance (as mentioned in search results,max_workersparameter can be used).E[Aggregate and Merge Results]: LangExtract intelligently combines the partial extractions from each chunk into a single, coherent result. This is where its orchestration capabilities shine, handling potential overlaps or conflicts.F[Final Structured Output]: The complete, structured data for the entire document.
For most use cases, simply passing the entire document text or a langextract.Document object to lx.extract will trigger this intelligent chunking and merging. You can often control chunking parameters via llm_config or other lx.extract arguments for fine-tuning, though the defaults are usually a good starting point. The towardsdatascience article highlights “smart chunking strategies” as a key benefit.
Mini-Challenge: Expanding the Schema with Expense Categories
You’ve successfully extracted basic financial data. Now, let’s make our schema a bit more detailed.
Challenge:
Modify your FinancialReportSummary schema to include a new field: operating_expenses. This field should be a float representing the total operating expenses reported for the fiscal year. Additionally, add a breakdown_by_segment field, which should be a List of dictionaries, where each dictionary has a segment_name (string) and segment_revenue (float).
Hint:
- Remember to update your
sample_report_textto include some dummy operating expense and segment revenue data so the LLM has something to find. - You’ll need to define a new Pydantic
BaseModelfor thebreakdown_by_segmentitems.
What to Observe/Learn:
- How adding more detailed fields impacts the LLM’s ability to extract specific information.
- The importance of providing clear descriptions in your
Fielddefinitions. - How to define nested Pydantic models for complex, structured data.
Take a few minutes to modify your financial_schema.py and extract_financial_data.py files.
Click for Solution (after you've tried it!)
Updated financial_schema.py:
from pydantic import BaseModel, Field
from typing import Optional, List
class SegmentRevenue(BaseModel):
"""
Revenue breakdown for a specific business segment.
"""
segment_name: str = Field(description="The name of the business segment.")
segment_revenue: float = Field(description="The revenue for this segment, in millions or billions.")
class FinancialReportSummary(BaseModel):
"""
Structured summary of a financial report.
"""
company_name: str = Field(description="The full official name of the company.")
fiscal_year: int = Field(description="The fiscal year the report pertains to.")
total_revenue: float = Field(description="The total revenue reported, in millions or billions, as a numeric value.")
operating_expenses: float = Field(description="The total operating expenses reported, in millions or billions, as a numeric value.")
net_income: float = Field(description="The net income (profit) reported, in millions or billions, as a numeric value.")
key_highlights: List[str] = Field(description="A list of 3-5 most important strategic or financial highlights mentioned in the report.")
breakdown_by_segment: List[SegmentRevenue] = Field(
default_factory=list,
description="A list of revenue figures broken down by business segment."
)
outlook_sentiment: Optional[str] = Field(
default=None,
description="Overall sentiment regarding the future outlook (e.g., 'Positive', 'Cautious', 'Neutral')."
)
class Config:
json_schema_extra = {
"example": {
"company_name": "Tech Innovators Inc.",
"fiscal_year": 2025,
"total_revenue": 1250.5,
"operating_expenses": 750.0,
"net_income": 320.1,
"key_highlights": [
"Achieved record revenue growth of 15% year-over-year.",
"Launched new flagship product line."
],
"breakdown_by_segment": [
{"segment_name": "Software Solutions", "segment_revenue": 800.0},
{"segment_name": "Hardware Devices", "segment_revenue": 450.5}
],
"outlook_sentiment": "Positive"
}
}
Updated extract_financial_data.py (with modified sample_report_text and output printing):
import langextract as lx
import os
from financial_schema import FinancialReportSummary, SegmentRevenue # Import the new SegmentRevenue schema
# Ensure your LLM provider API key is set
# For OpenAI:
# os.environ["OPENAI_API_KEY"] = "YOUR_OPENAI_API_KEY"
# For Google's models (e.g., Gemini):
# You might need to set GOOGLE_API_KEY or ensure gcloud authentication is active.
print("LangExtract and schema imported successfully!")
sample_report_text = """
**FOR IMMEDIATE RELEASE**
**Global Dynamics Corp. Announces Strong Q4 and Fiscal Year 2025 Results**
**CITY, STATE – January 15, 2026** – Global Dynamics Corp. (NYSE: GDC) today announced its financial results for the fourth quarter and full fiscal year ended December 31, 2025. The company delivered robust performance across all segments, driven by strong demand for its innovative software solutions.
**Fiscal Year 2025 Highlights:**
* Total revenue reached an all-time high of $1,250.5 million, marking a 15% increase from the prior fiscal year.
* Operating expenses for the fiscal year amounted to $750.0 million, reflecting strategic investments.
* Net income for the fiscal year was $320.1 million, demonstrating strong profitability and efficient operations.
* Successfully launched 'NexusOS 2.0', our next-generation operating system, which has seen rapid adoption.
* Expanded strategic partnerships in key emerging markets, strengthening our global footprint.
* Invested heavily in AI research and development, positioning us for future growth.
**Revenue Breakdown by Segment:**
* Software Solutions: $800.0 million
* Hardware Devices: $450.5 million
**Outlook:**
Looking ahead to fiscal year 2026, Global Dynamics Corp. anticipates continued momentum. We project revenue growth in the range of 10-12% and expect to further enhance our market leadership. The overall sentiment remains positive, with a focus on sustainable innovation and customer success.
"""
print("\nSample report text loaded.")
extraction_instruction = "Extract the key financial metrics, operating expenses, revenue breakdown by segment, and highlights from this report."
print("\nAttempting to extract data with expanded schema...")
try:
result: FinancialReportSummary = lx.extract(
text_or_document=sample_report_text,
instruction=extraction_instruction,
schema=FinancialReportSummary,
llm_config={"model": "gpt-4-turbo-preview"} # Or "gemini-pro", etc.
)
print("\nExtraction successful! Here's the structured data:")
print(result.model_dump_json(indent=2))
print(f"\nCompany Name: {result.company_name}")
print(f"Fiscal Year: {result.fiscal_year}")
print(f"Total Revenue: ${result.total_revenue} million")
print(f"Operating Expenses: ${result.operating_expenses} million")
print(f"Net Income: ${result.net_income} million")
print("\nKey Highlights:")
for i, highlight in enumerate(result.key_highlights):
print(f" {i+1}. {highlight}")
print("\nRevenue Breakdown by Segment:")
for segment in result.breakdown_by_segment:
print(f" - {segment.segment_name}: ${segment.segment_revenue} million")
print(f"Outlook Sentiment: {result.outlook_sentiment}")
except Exception as e:
print(f"\nAn error occurred during extraction: {e}")
print("Please ensure your LLM configuration is correct and API key is valid.")
Common Pitfalls & Troubleshooting in Financial Data Extraction
Working with financial reports and LLMs can present unique challenges. Here are a few common pitfalls and how to approach them:
Numerical Inaccuracies or Hallucinations:
- Pitfall: The LLM might sometimes extract a number incorrectly (e.g., $1.25 billion instead of $1.25 million, or just a wrong digit) or even invent a number if it can’t find it.
- Troubleshooting:
- Be Specific in Schema Descriptions: Emphasize units (e.g., “in millions of USD,” “as a numeric value, without currency symbols”).
- Provide Examples in
json_schema_extra: A clearexampleobject significantly guides the LLM on expected numerical formats. - Post-Processing & Validation: For critical financial data, always implement programmatic checks. Does
Net Incomemake sense relative toTotal RevenueandOperating Expenses? (e.g.,net_incomeshould typically be less thantotal_revenue). - Grounding (Advanced): LangExtract offers features for “source grounding,” which can link extracted data back to its original text location, allowing for human verification.
Schema Mismatch or Incomplete Extraction:
- Pitfall: The LLM fails to populate all fields in your schema, or it provides data in a format that Pydantic rejects (e.g., a string where a float is expected).
- Troubleshooting:
- Review Your Prompt (
instruction): Is your instruction clear and comprehensive? Does it explicitly ask for all the fields you want? - Check Schema Descriptions: Are your
Fielddescriptions precise? The LLM relies heavily on these. - Ensure Data Presence: Is the information actually in the source text? If a field is optional, use
Optionalin your schema. - Iterative Refinement: Start with a simpler schema, get it working, then gradually add complexity.
- Debugging LLM Output: If LangExtract returns an error, the underlying LLM might be producing invalid JSON. Check the raw LLM output if possible (some LangExtract configurations or debug modes might expose this).
- Review Your Prompt (
Performance and Token Limits with Very Long Reports:
- Pitfall: Processing extremely long annual reports can be slow, costly, or hit LLM token limits even with LangExtract’s internal chunking.
- Troubleshooting:
- Pre-process Documents: If possible, segment the document before feeding it to LangExtract. For example, extract only the “Financial Highlights” or “Management Discussion & Analysis” sections if those are your target areas.
- Optimize
llm_config: Experiment with different LLM models. Smaller, faster models might be sufficient for certain tasks, or higher-context window models (likegpt-4-turbo-preview’s 128k tokens) might be necessary for specific details. - Batch Processing: If you have many reports, process them in batches, managing API call rates.
- Review LangExtract’s Chunking Parameters: While often automatic, advanced usage of LangExtract might expose parameters to control chunk size or overlap, which can be tuned.
Remember, effective LLM-based extraction is an iterative process. You’ll likely refine your schema, instructions, and possibly even your pre-processing steps as you gain more experience with specific document types.
Summary
Phew! You’ve tackled a significant project in this chapter. Here’s a quick recap of what we covered:
- Financial Data Extraction: We explored the practical application of LangExtract for structuring data from financial reports, a common and valuable use case.
- Robust Schema Design: You learned how to craft a detailed Pydantic schema (
FinancialReportSummary) to capture various financial metrics, highlights, and even sentiment, including nested models for complex data likeSegmentRevenue. - Incremental Implementation: We walked through setting up the environment, preparing sample text, defining a clear extraction instruction, and executing the
lx.extractfunction. - Handling Long Documents: We discussed the challenges of large documents and how LangExtract’s intelligent chunking and aggregation mechanisms address token limits and improve accuracy.
- Mini-Challenge: You honed your skills by expanding the schema to include more granular financial details, reinforcing the importance of precise schema definitions.
- Troubleshooting: We covered common issues like numerical inaccuracies, schema mismatches, and performance concerns with large documents, along with strategies for debugging and refinement.
You’ve now got a powerful tool in your belt for automating the tedious task of financial data analysis. This project demonstrates how LangExtract can be a game-changer for transforming unstructured text into structured, actionable insights across various domains.
In the next chapter, we’ll delve deeper into advanced features of LangExtract, such as custom pre-processors, more sophisticated error handling, and integrating with external data sources. Keep up the great work!
References
- LangExtract GitHub Repository
- Pydantic v2 Documentation
- Towards Data Science: Extracting Structured Data with LangExtract
- Towards Data Science: Using Google’s LangExtract and Gemma for Structured Data Extraction
This page is AI-assisted and reviewed. It references official documentation and recognized resources where relevant.