Chapter 6: Handling Different Document Types – Text, HTML, PDF

Introduction: Beyond Plain Text – Embracing Diverse Documents

Welcome back, future data alchemist! In our previous chapters, you’ve mastered the fundamentals of setting up LangExtract, defining extraction schemas, and pulling structured data from plain text. That’s a fantastic start, but let’s be honest: the real world isn’t always neatly packaged in plain .txt files.

Imagine needing to extract key clauses from a legal contract (often a PDF), product details from an e-commerce webpage (HTML), or specific figures from a research report. These diverse document types present unique challenges.

In this chapter, we’ll expand your LangExtract toolkit to confidently tackle information extraction from:

Plain Text: A quick refresher on the simplest form.
HTML Documents: Navigating the web’s rich, but often messy, structure.
PDF Files: Unlocking the data hidden within these ubiquitous document formats.

By the end of this chapter, you’ll understand how LangExtract (and a little help from Python’s ecosystem) can preprocess and extract valuable insights from almost any document you throw at it. Your ability to handle different data sources will make your extraction workflows far more robust and versatile.

Core Concepts: Preparing Your Documents for Extraction

LangExtract’s strength lies in its ability to take text and, guided by your schema, extract structured information. This means that for any document type, the core challenge often boils down to: How do we reliably get clean, meaningful text from this document, and then feed it to LangExtract?

Let’s explore this idea for our target document types.

The Universal Input: Text

At its heart, LangExtract’s extract function is designed to work with textual content. Whether it’s a short sentence, a long paragraph, or an entire book, if you can represent it as a Python string, LangExtract can process it.

# A simple reminder of what we've done before
import langextract as lx

# Our basic schema
schema = {
    "type": "object",
    "properties": {
        "product_name": {"type": "string", "description": "The name of the product"},
        "price": {"type": "number", "description": "The price of the product"}
    }
}

text_data = "The new SuperWidget 5000 is available for just $99.99!"
# We'll use this later, but you get the idea!
# result = lx.extract(text_data, schema=schema, llm_provider='your_llm_provider_config')

This fundamental principle — feeding text to lx.extract — remains constant. The differences arise in how we obtain that text from more complex document formats.

Decoding HTML: From Webpage to Meaningful Text

HTML documents, the building blocks of the internet, are a treasure trove of information. However, they’re also full of tags, scripts, and styling information that isn’t directly relevant to the content we want to extract.

When you point LangExtract to an HTML string, it intelligently parses the HTML to focus on the human-readable text. It tries to strip away the “noise” of the HTML tags (<div>, <p>, <a>, etc.) to get to the core content.

Why is this important?

Focus on Content: We usually want to extract facts and figures, not HTML structure.
Reduced Noise: Fewer tokens for the LLM to process, potentially leading to better accuracy and lower costs.
Simplified Schema: Your schema can focus purely on the information, not on where it appears in the HTML structure.

While LangExtract does a good job, for very complex or poorly structured HTML, you might sometimes want to preprocess it yourself using libraries like BeautifulSoup to extract specific sections or clean it further before passing it to LangExtract. We’ll explore a basic direct approach first.

Unlocking PDFs: A Multi-Step Process

PDF (Portable Document Format) files are designed for fixed-layout presentation, making them excellent for sharing documents that look the same everywhere. However, this fixed layout often makes extracting raw text a bit more challenging than with plain text or even HTML.

The Challenge with PDFs:

PDFs are not plain text files. They can contain text, images, vectors, fonts, and layout instructions all bundled together.
Text extraction from PDFs often involves interpreting character positions, font information, and page flows, which can be tricky, especially with scanned documents (images of text) or complex multi-column layouts.

How LangExtract Handles PDFs: LangExtract itself does not directly read PDF files. Instead, the standard workflow is:

Extract Text from PDF: Use a dedicated Python library (like pypdf or pdfminer.six) to read the PDF file and extract its textual content into a plain string.
Feed Text to LangExtract: Once you have the raw text, you pass it to LangExtract’s extract function, just like any other plain text.

This two-step process means you have control over the initial text extraction, allowing you to choose the best tool for your specific PDF types and handle any preprocessing (like OCR for scanned PDFs) before LangExtract steps in.

Let’s get hands-on and see how this works!

Step-by-Step Implementation: Document Extraction in Action

Before we dive into code, let’s make sure our environment is ready.

1. Environment Setup

If you haven’t already, install langextract. For PDF processing, we’ll add pypdf (as of 2026-01-05, pypdf version 4.x is stable and widely used).

# Ensure you have LangExtract installed
pip install langextract

# Install pypdf for PDF text extraction
pip install pypdf

2. Basic Text Extraction (Refresher)

Let’s quickly set up a basic extraction task. We’ll use the same schema as before for consistency.

# 06-document-types.py
import langextract as lx
import os # To access environment variables for API keys

# IMPORTANT: Replace 'your_llm_provider_config' with your actual LLM setup
# For example, if using Google's Gemini, you might have:
# llm_provider_config = {'api_key': os.getenv('GEMINI_API_KEY'), 'model': 'gemini-pro'}
# Or if using OpenAI:
# llm_provider_config = {'api_key': os.getenv('OPENAI_API_KEY'), 'model': 'gpt-4'}
# Make sure your API key is loaded from an environment variable for security!

# Let's assume you've set up your LLM provider in a config file or env variable
# For this example, we'll use a placeholder.
# In a real scenario, you'd load this securely.
# For local testing with a local model, this might be simpler.
# Example using a placeholder:
llm_provider_config = {
    'api_key': os.getenv('LANGEXTRACT_LLM_API_KEY', 'sk-YOUR_API_KEY'),
    'model': os.getenv('LANGEXTRACT_LLM_MODEL', 'gemini-pro') # Or 'gpt-4', etc.
}


# Define a simple schema for extracting product information
product_schema = {
    "type": "object",
    "properties": {
        "product_name": {"type": "string", "description": "The name of the product"},
        "price": {"type": "number", "description": "The price of the product"},
        "currency": {"type": "string", "description": "The currency of the price"}
    },
    "required": ["product_name", "price"]
}

print("--- Plain Text Extraction ---")
plain_text = "The new SuperWidget Pro costs $129.50. It's an amazing device!"
print(f"Input text: '{plain_text}'")

try:
    plain_text_result = lx.extract(
        plain_text,
        schema=product_schema,
        llm_provider=llm_provider_config
    )
    print("Extracted from plain text:")
    print(plain_text_result)
except Exception as e:
    print(f"Error during plain text extraction: {e}")

print("\n" + "="*40 + "\n")

When you run this, LangExtract will process the plain_text string and return the product_name, price, and currency as defined by your schema.

3. Extracting from HTML

Now, let’s try an HTML snippet. We’ll simulate a simple product listing.

# Continue in 06-document-types.py

print("--- HTML Extraction ---")
html_content = """
<!DOCTYPE html>
<html>
<head><title>Product Page</title></head>
<body>
    <h1>Welcome to Our Store!</h1>
    <div id="product-details">
        <h2>Awesome Gadget X</h2>
        <p>This is the latest and greatest gadget.</p>
        <p class="price">Price: <span data-currency="USD">$199.99</span></p>
        <button>Add to Cart</button>
    </div>
    <div id="related-products">
        <h3>Related Items</h3>
        <ul>
            <li>Old Gadget Y - $150</li>
        </ul>
    </div>
</body>
</html>
"""
print("Input HTML (snippet):")
print(html_content[:200] + "...") # Print a snippet for brevity

try:
    html_result = lx.extract(
        html_content,
        schema=product_schema,
        llm_provider=llm_provider_config
    )
    print("Extracted from HTML:")
    print(html_result)
except Exception as e:
    print(f"Error during HTML extraction: {e}")

print("\n" + "="*40 + "\n")

Explanation: Notice that we pass the html_content directly to lx.extract. LangExtract is smart enough to parse this HTML, strip out the tags, and present the underlying text to the LLM for extraction. It often does a surprisingly good job of understanding the semantic content even amidst HTML noise.

4. Extracting from PDF Documents

This is where pypdf comes in. First, we need a PDF file. For demonstration purposes, we’ll create a very simple dummy PDF. In a real scenario, you’d load an existing PDF file.

Step 4.1: Creating a Dummy PDF (for testing)

To make this example fully runnable, let’s create a simple PDF programmatically.

# Continue in 06-document-types.py
from pypdf import PdfWriter, PdfReader
from io import BytesIO

print("--- PDF Extraction ---")

# Create a dummy PDF in memory
def create_dummy_pdf(text_content):
    writer = PdfWriter()
    # Add a blank page (pypdf can add text directly but this demonstrates adding text to a page)
    # For simple text, you might add it directly to a page if pypdf supported it easily.
    # More typically, you'd use a library like ReportLab to create PDFs with specific content
    # or just use an existing PDF. For this example, we'll simulate an existing PDF's text.

    # Simulating a PDF containing our text for extraction
    # In a real scenario, you'd load an existing PDF file.
    # For now, let's just make a file that we can then "read" the text from.
    # pypdf's primary text extraction is from existing files.
    # We will pretend 'dummy_report.pdf' exists with the content below.
    # If you want to create a REAL PDF with text, libraries like ReportLab are better.
    # For this exercise, we will just simulate having the text from a PDF.
    return text_content # We'll just pass this text as if it came from a PDF.

pdf_text_content = """
Official Company Report Q4 2025

Product Performance Summary:
The Flagship Product v2.0 achieved sales of 15,000 units.
The new budget-friendly EcoModel is priced at 75.25 EUR and has shown promising early adoption.
Market analysis suggests a strong outlook for next quarter.
"""

# In a real application, you'd open a PDF file:
# with open("path/to/your/document.pdf", "rb") as file:
#     reader = PdfReader(file)
#     pdf_raw_text = ""
#     for page in reader.pages:
#         pdf_raw_text += page.extract_text() + "\n" # Extract text page by page

# For this guided example, we'll directly use our 'pdf_text_content'
# as if we extracted it from a PDF using pypdf.
# If you have a real PDF, you'd replace 'pdf_text_content' with 'pdf_raw_text'.

print("Simulated PDF Text Content:")
print(pdf_text_content)

# Define a schema suitable for our PDF content
report_schema = {
    "type": "object",
    "properties": {
        "report_title": {"type": "string", "description": "The main title of the report"},
        "quarter": {"type": "string", "description": "The quarter the report covers (e.g., Q4 2025)"},
        "product_sales": {
            "type": "array",
            "items": {
                "type": "object",
                "properties": {
                    "product_name": {"type": "string", "description": "Name of the product"},
                    "units_sold": {"type": "integer", "description": "Number of units sold"},
                    "price": {"type": "number", "description": "Price of the product, if mentioned"},
                    "currency": {"type": "string", "description": "Currency of the price, if mentioned"}
                }
            }
        }
    },
    "required": ["report_title", "quarter", "product_sales"]
}

try:
    pdf_result = lx.extract(
        pdf_text_content, # Pass the extracted text
        schema=report_schema,
        llm_provider=llm_provider_config
    )
    print("Extracted from simulated PDF text:")
    print(pdf_result)
except Exception as e:
    print(f"Error during PDF extraction: {e}")

print("\n" + "="*40 + "\n")

Explanation:

We define pdf_text_content which represents the text that would have been extracted from a PDF.
We define a report_schema that is tailored to the content found within our simulated PDF. This is crucial: your schema should always reflect the information you expect to find in the text of the document.
We pass this pdf_text_content directly to lx.extract. LangExtract treats it as any other plain text, and the LLM works its magic.

Real-world PDF Text Extraction with pypdf:

If you had an actual PDF file, say my_report.pdf, the process to get the pdf_raw_text would look like this:

# This code block shows how to use pypdf if you have an actual PDF file.
# You would typically run this part first to get the text, then feed it to LangExtract.
from pypdf import PdfReader

def extract_text_from_pdf(pdf_path):
    """
    Extracts all text from a PDF file using pypdf.
    """
    full_text = ""
    try:
        reader = PdfReader(pdf_path)
        for page in reader.pages:
            full_text += page.extract_text() or "" # Use .extract_text() and handle None
            full_text += "\n" # Add a newline between pages for better readability
    except Exception as e:
        print(f"Error reading PDF '{pdf_path}': {e}")
    return full_text

# Example usage (assuming 'my_report.pdf' exists in the same directory)
# For this guide, you would run this separately or create a dummy PDF file.
# You would then replace `pdf_text_content` with the result of this function.
# actual_pdf_path = "my_report.pdf" # Replace with your actual PDF path
# extracted_pdf_text = extract_text_from_pdf(actual_pdf_path)
# print(f"Text extracted from '{actual_pdf_path}':\n{extracted_pdf_text[:500]}...") # Print first 500 chars

# Then, you would pass `extracted_pdf_text` to lx.extract
# pdf_result = lx.extract(extracted_pdf_text, schema=report_schema, llm_provider=llm_provider_config)

Diagram: The Document Processing Workflow

Let’s visualize the flow for different document types.

graph TD A[Start] --> B{Document Type?}; B -->|\1| C[Feed Text to LangExtract]; B -->|\1| D[LangExtract Internally Parses HTML]; D --> C; B -->|\1| E[Use pypdf to Extract Text]; E --> C; C --> F[LLM Processes Text Schema]; F --> G[Structured Data Output]; G --> H[End];

Explanation of the Diagram:

The workflow starts with different document types.
For Plain Text, it’s a direct path to LangExtract.
For HTML, LangExtract performs an internal parsing step to get the relevant text.
For PDFs, an external tool like pypdf is used first to convert the PDF into a plain text string.
All paths converge to feeding plain text into LangExtract’s extract function, which then uses the LLM and your schema to produce the final structured data.

Mini-Challenge: Extracting from a Product Review HTML Snippet

You’ve seen how LangExtract handles different inputs. Now, it’s your turn!

Challenge: You have an HTML snippet representing a user review for a product. Your task is to define a schema and use LangExtract to extract the reviewer’s name, their rating (as a number), and the main review comment.

<!-- review_snippet.html -->
<div class="review-card">
    <span class="reviewer-name">Jane Doe</span>
    <div class="rating" aria-label="Rated 4.5 out of 5 stars"></div>
    <p class="review-comment">This gadget is fantastic! I love its features and ease of use. Highly recommend.</p>
</div>

Your Task:

Define a Python dictionary review_schema to capture reviewer_name (string), rating (number, e.g., 4.5), and review_comment (string).
Store the HTML snippet in a multi-line Python string variable.
Call lx.extract with the HTML content and your schema.
Print the extracted result.

Hint: Pay close attention to the aria-label attribute in the div for the rating. LangExtract’s LLM is often good at picking up information from attributes as well as visible text.

What to Observe/Learn:

How well LangExtract can parse specific pieces of information from within HTML attributes.
The importance of a well-defined schema that matches the expected output types (e.g., number for rating).

Click for Solution (Optional)

# Solution for Mini-Challenge

import langextract as lx
import os # For LLM provider config

llm_provider_config = {
    'api_key': os.getenv('LANGEXTRACT_LLM_API_KEY', 'sk-YOUR_API_KEY'),
    'model': os.getenv('LANGEXTRACT_LLM_MODEL', 'gemini-pro')
}

review_html_content = """
<div class="review-card">
    <span class="reviewer-name">Jane Doe</span>
    <div class="rating" aria-label="Rated 4.5 out of 5 stars"></div>
    <p class="review-comment">This gadget is fantastic! I love its features and ease of use. Highly recommend.</p>
</div>
"""

review_schema = {
    "type": "object",
    "properties": {
        "reviewer_name": {"type": "string", "description": "The name of the reviewer"},
        "rating": {"type": "number", "description": "The numerical rating given by the reviewer (e.g., 4.5)"},
        "review_comment": {"type": "string", "description": "The main text of the review"}
    },
    "required": ["reviewer_name", "rating", "review_comment"]
}

print("\n--- Mini-Challenge Solution ---")
print("Input Review HTML (snippet):")
print(review_html_content[:200] + "...")

try:
    review_result = lx.extract(
        review_html_content,
        schema=review_schema,
        llm_provider=llm_provider_config
    )
    print("Extracted Review Data:")
    print(review_result)
except Exception as e:
    print(f"Error during mini-challenge extraction: {e}")

Common Pitfalls & Troubleshooting

Working with diverse document types can introduce new challenges. Here are a few common pitfalls and how to approach them:

Garbled Text from PDFs (Especially Scanned Documents)
- Problem: Text extracted from PDFs looks like gibberish, contains strange characters, or is missing entirely. This is common with scanned PDFs (which are images of text) or PDFs with complex layouts.
- Solution:
  - OCR (Optical Character Recognition): For scanned PDFs, pypdf alone won’t work. You’ll need an OCR library like Tesseract (via pytesseract) or cloud-based OCR services (Google Cloud Vision, AWS Textract) to convert the image-based text into searchable text before feeding it to pypdf or directly to LangExtract.
  - Advanced PDF Parsers: For very complex, multi-column, or table-heavy PDFs, libraries like pdfplumber or camelot might offer more robust text and table extraction capabilities than pypdf.
  - Manual Inspection: Sometimes, there’s no substitute for opening the PDF and seeing if the text is even selectable. If you can’t select it, it’s likely an image.
Over-extraction or Under-extraction from HTML due to Noise
- Problem: LangExtract extracts too much irrelevant text from HTML (e.g., navigation, ads) or misses crucial information because it’s buried in complex tags.
- Solution:
  - Refine Schema/Prompt: Make your schema descriptions and prompt instructions more specific.
  - Pre-process HTML with BeautifulSoup: For more control, use BeautifulSoup4 to parse the HTML, navigate to specific elements (e.g., a div with id="main-content"), and extract only their text. You then pass that cleaned text to LangExtract. This gives you precise control over what the LLM sees.
  - Example of BeautifulSoup cleaning:
```
from bs4 import BeautifulSoup

html_doc = """
<nav>Menu</nav>
<div id="main-content">
    <p>Important text here.</p>
    <div class="ad">Buy now!</div>
</div>
"""
soup = BeautifulSoup(html_doc, 'html.parser')
main_content_div = soup.find(id="main-content")
if main_content_div:
    cleaned_text = main_content_div.get_text(separator=' ', strip=True)
    # Now pass cleaned_text to LangExtract
    # print(cleaned_text) # Output: Important text here. Buy now!
```
Encoding Issues
- Problem: Strange characters appear in your extracted text (e.g., â€™ instead of apostrophes, \ufffd replacement characters).
- Solution:
  - Specify Encoding: When reading files, always try to specify the encoding, typically utf-8.
  - errors='ignore' or errors='replace': When decoding bytes to string, you can use these parameters with decode() to handle unreadable characters gracefully, though replace might lose information.

Summary: Your Document Extraction Superpowers

You’ve made significant strides in this chapter, transforming from a plain text extractor to a versatile document processor! Here are the key takeaways:

LangExtract’s Core: At its heart, LangExtract processes text. The challenge with other document types is converting them into a clean, coherent text string.
HTML Handling: LangExtract can often directly process HTML strings, intelligently stripping tags to focus on content. For advanced control or noisy HTML, preprocessing with libraries like BeautifulSoup is a powerful option.
PDF Processing: PDFs require a preliminary step to extract text. Libraries like pypdf are essential for this. For scanned PDFs, OCR tools are necessary.
Schema is King: Regardless of the document type, a precise and well-defined schema is crucial for guiding the LLM to extract exactly what you need.
Pre-processing Power: Don’t hesitate to use other Python libraries to clean, filter, or enhance your document’s text before handing it over to LangExtract. This often leads to better and more reliable extraction results.

You now have a robust understanding of how to prepare and process various document formats for structured data extraction with LangExtract. This opens up a vast array of real-world applications, from automating data entry to analyzing large corpuses of documents.

In the next chapter, we’ll dive into handling long documents and explore advanced techniques like chunking and multi-pass extraction to overcome LLM token limits and improve accuracy on extensive texts.

References

This page is AI-assisted and reviewed. It references official documentation and recognized resources where relevant.