Introduction: Guiding Your LLM with Precision

Welcome to Chapter 17! So far, you’ve learned how to install LangExtract, set up your LLM provider, define extraction schemas, and perform basic data extraction. But what truly separates good extraction from great extraction? It’s all about prompt engineering.

In this chapter, we’ll dive deep into the art and science of crafting effective prompts for LangExtract. While LangExtract handles much of the complexity of interacting with Large Language Models (LLMs) under the hood, your schema definitions and any explicit instructions you provide are essentially the “prompts” that guide the LLM. Understanding how to optimize these inputs is crucial for achieving accurate, reliable, and consistent results. We’ll explore core principles, practical techniques, and iterative refinement strategies to make your extractions shine.

To get the most out of this chapter, you should be comfortable with:

  • Setting up LangExtract and an LLM provider (like OpenAI or Google’s models).
  • Defining basic Pydantic-based schemas for data extraction.
  • Running langextract.extract calls.

Ready to become a prompt engineering maestro? Let’s begin!

The Heart of Extraction: How Prompts and Schemas Work Together

At its core, LangExtract translates your desired output structure (your schema) and the text you provide into a prompt that an LLM can understand and process. The LLM then uses this prompt to identify and extract the relevant pieces of information from your document, structuring them according to your defined schema.

Think of it like this: You’re giving the LLM a highly specialized set of instructions and a template. The better your instructions and template, the more accurately the LLM can fill it out.

Understanding LangExtract’s Internal Prompt Generation

When you define a Pydantic model for your schema, LangExtract automatically converts this into a structured instruction for the LLM. For instance, if you have a field named product_name with a description, both the name and the description become part of the prompt, telling the LLM precisely what kind of information to look for.

This means that while you don’t write the entire prompt, every detail in your schema—field names, types, descriptions, and even examples—directly influences the prompt that the LLM receives.

Key Principles of Effective Prompt Engineering for LangExtract

Let’s lay down the foundational principles for crafting prompts that yield superior extraction results.

1. Clarity and Specificity: Be Unambiguous!

LLMs are powerful, but they are also literal. Ambiguous instructions lead to ambiguous results.

  • Be explicit: Instead of “Get the name,” say “Extract the full name of the product being reviewed.”
  • Define boundaries: Specify what to extract and what not to extract.
  • Handle edge cases: What should happen if the information isn’t present? Should it be None, an empty string, or a default value?

2. Schema-Driven Guidance: Your Schema IS Your Prompt

Your Pydantic schema isn’t just a data structure; it’s a powerful prompting tool.

  • Descriptive Field Names: Use names that clearly indicate the content (e.g., customer_email, not email).
  • Rich description Attributes: This is where you provide detailed instructions for each field. This is arguably the most critical part of prompt engineering with LangExtract.
  • Type Hinting: Python type hints (str, int, list[str], Optional[str]) guide the LLM on the expected data format. LangExtract leverages this to enforce types.

3. Few-Shot Examples (When Applicable): Show, Don’t Just Tell

For complex or nuanced extractions, providing one or more “few-shot” examples within your prompt (often through the schema’s Field descriptions or examples parameter) can significantly improve accuracy. These examples demonstrate the desired input-output mapping. LangExtract often handles this internally by allowing you to specify examples in your Field definitions or by letting you pass them directly to the extract function.

4. Handling Missing Information: Graceful Degradation

Documents are messy. Information might be missing.

  • Use Optional[Type] for fields that might not always be present.
  • Explicitly instruct the LLM what to do if a field is not found (e.g., “If the author’s name is not explicitly mentioned, return None.”).

5. Iterative Refinement: The Engineering Cycle

Prompt engineering is rarely a one-shot process. It’s an iterative cycle:

  1. Define: Start with a clear objective and initial schema.
  2. Test: Run extraction on diverse examples.
  3. Analyze: Review the extracted output. What went wrong? What was ambiguous?
  4. Refine: Adjust your schema, descriptions, or instructions.
  5. Repeat!

This cycle is where LangExtract’s interactive visualization and debugging tools (which we’ll touch on later) become invaluable.

Step-by-Step Implementation: Refining a Product Review Extractor

Let’s put these principles into practice by building and refining an extractor for product reviews. We’ll start with a basic schema and then enhance it using prompt engineering best practices.

First, ensure you have LangExtract and Pydantic installed. If not:

pip install "langextract[openai]" pydantic==2.* python-dotenv

(As of 2026-01-05, Pydantic v2.x is the stable release and python-dotenv is useful for managing API keys.)

Scenario: We want to extract key information from a short product review.

Example Review Text:

"I absolutely love this new smart toaster! It toasts perfectly every time and the smart features are surprisingly useful. My only minor gripe is that it's a bit pricey at $129.99. John Doe, purchased on 2025-12-20."

Step 1: Initial Schema Definition (The Basic Prompt)

Let’s define a basic schema using Pydantic. This forms our initial, simple prompt.

Create a file named review_extractor.py:

# review_extractor.py
import os
from dotenv import load_dotenv
from pydantic import BaseModel, Field
import langextract as lx

# Load environment variables (e.g., OPENAI_API_KEY)
load_dotenv()

# --- Our initial, basic schema ---
class ProductReview(BaseModel):
    product_name: str
    rating: int
    comment_summary: str
    price: float

# --- Sample review text ---
review_text = "I absolutely love this new smart toaster! It toasts perfectly every time and the smart features are surprisingly useful. My only minor gripe is that it's a bit pricey at $129.99. John Doe, purchased on 2025-12-20."

# --- Extraction logic ---
def extract_review_basic(text: str) -> ProductReview | None:
    try:
        result = lx.extract(
            text_or_document=text,
            schema=ProductReview,
            llm_provider=lx.OpenAI(model="gpt-4o"), # Or your preferred LLM
        )
        return result
    except Exception as e:
        print(f"An error occurred during extraction: {e}")
        return None

if __name__ == "__main__":
    print("--- Basic Extraction ---")
    extracted_data = extract_review_basic(review_text)
    if extracted_data:
        print(extracted_data.model_dump_json(indent=2))
    else:
        print("Extraction failed.")

Explanation:

  • We import necessary libraries, including os, dotenv, pydantic (BaseModel, Field), and langextract.
  • load_dotenv() helps load your LLM API key from a .env file (e.g., OPENAI_API_KEY=sk-...).
  • ProductReview is our Pydantic schema. Notice how product_name, rating, comment_summary, and price are defined with their types.
  • The extract_review_basic function uses lx.extract with our schema. We’re using gpt-4o as a powerful general-purpose LLM, but you can swap this for others like gemini-pro or claude-3-opus-20240229 depending on your setup.

Run this script. You might get something like this:

{
  "product_name": "smart toaster",
  "rating": 5,
  "comment_summary": "The new smart toaster toasts perfectly and has surprisingly useful smart features.",
  "price": 129.99
}

This is decent, but the rating might be inferred (not explicitly stated as a number), and comment_summary could be more precise. What if we want sentiment? Or the reviewer’s name?

Step 2: Refining the Schema with description Attributes (Better Prompts!)

Now, let’s apply our prompt engineering principles. We’ll add description attributes to our fields to give the LLM much clearer instructions. We’ll also add new fields like sentiment and reviewer_name and review_date.

Modify your review_extractor.py file:

# review_extractor.py (continued)
# ... (imports and load_dotenv are the same)

# --- Our refined schema with detailed descriptions ---
class ProductReviewRefined(BaseModel):
    product_name: str = Field(
        ..., description="The name of the product being reviewed. Be specific."
    )
    rating: int = Field(
        ..., 
        description=(
            "The numerical rating given to the product, typically on a scale of 1 to 5, "
            "where 5 is excellent and 1 is poor. Infer from positive/negative language if not explicit. "
            "Return an integer."
        )
    )
    comment_summary: str = Field(
        ...,
        description="A concise summary (1-2 sentences) of the main points of the review, focusing on pros and cons."
    )
    price: float = Field(
        ..., 
        description="The price of the product mentioned in the review, extracted as a floating-point number. Include currency if available, otherwise just the number."
    )
    sentiment: str = Field(
        ..., 
        description="The overall sentiment of the review, classified as 'Positive', 'Negative', or 'Neutral'."
    )
    reviewer_name: str | None = Field(
        None, 
        description="The full name of the reviewer, if explicitly mentioned in the text. Return None if not found."
    )
    review_date: str | None = Field(
        None,
        description="The date the review was published or the product was purchased, in 'YYYY-MM-DD' format. Return None if not found."
    )

# ... (review_text is the same)

# --- Extraction logic for refined schema ---
def extract_review_refined(text: str) -> ProductReviewRefined | None:
    try:
        result = lx.extract(
            text_or_document=text,
            schema=ProductReviewRefined, # Use the refined schema
            llm_provider=lx.OpenAI(model="gpt-4o"),
        )
        return result
    except Exception as e:
        print(f"An error occurred during extraction: {e}")
        return None

if __name__ == "__main__":
    # ... (basic extraction block is the same)

    print("\n--- Refined Extraction ---")
    extracted_data_refined = extract_review_refined(review_text)
    if extracted_data_refined:
        print(extracted_data_refined.model_dump_json(indent=2))
    else:
        print("Refined extraction failed.")

Explanation of Changes:

  • We created ProductReviewRefined which extends our schema with sentiment, reviewer_name, and review_date.
  • Crucially, every Field now has a detailed description string. These descriptions are passed directly to the LLM as part of the prompt, guiding it on what to extract and how to format it.
    • For rating, we explicitly tell it to infer from language and return an integer.
    • For comment_summary, we specify the length and focus.
    • For sentiment, we provide the exact categories (‘Positive’, ‘Negative’, ‘Neutral’).
    • For reviewer_name and review_date, we use Optional[str] and explicitly instruct the LLM to return None if the information isn’t found, preventing hallucination.

Run the updated script. You should now see a much more comprehensive and accurate extraction:

{
  "product_name": "smart toaster",
  "rating": 5,
  "comment_summary": "The new smart toaster delivers perfect toast consistently with surprisingly useful smart features, although its price of $129.99 is a minor drawback.",
  "price": 129.99,
  "sentiment": "Positive",
  "reviewer_name": "John Doe",
  "review_date": "2025-12-20"
}

Notice how the LLM now correctly identifies the sentiment, the reviewer’s name, and the purchase date, all thanks to the clearer instructions in the description fields!

Mini-Challenge: Extracting a List of Features

Let’s make our extraction even more granular. Imagine our reviews also mention specific features.

Challenge: Modify the ProductReviewRefined schema to include a new field, features_mentioned, which should be a list of strings (list[str]). This field should capture any specific product features explicitly mentioned in the review. Remember to provide a clear description for this new field in your schema.

Hint: Think about how you’d instruct someone to list items. What if there are no features mentioned? Consider using Optional[list[str]] and a description that tells the LLM to return an empty list or None if no features are found.

What to Observe/Learn: How effectively the LLM can identify and list distinct items based on your instructions, and how to handle scenarios where such a list might be empty.

Click for Solution (after you've tried it!)
# Add this to your ProductReviewRefined class
    features_mentioned: Optional[list[str]] = Field(
        None, 
        description=(
            "A list of specific product features explicitly mentioned and discussed in the review. "
            "For example: ['smart features', 'perfect toast']. Return an empty list if no specific features are mentioned."
        )
    )

# Example output might look like:
# {
#   "product_name": "smart toaster",
#   "rating": 5,
#   "comment_summary": "The new smart toaster delivers perfect toast consistently with surprisingly useful smart features, although its price of $129.99 is a minor drawback.",
#   "price": 129.99,
#   "sentiment": "Positive",
#   "reviewer_name": "John Doe",
#   "review_date": "2025-12-20",
#   "features_mentioned": [
#     "perfect toast",
#     "smart features"
#   ]
# }

Common Pitfalls & Troubleshooting in Prompt Engineering

Even with best practices, you might encounter issues. Here are some common pitfalls and how to debug them:

  1. Ambiguous Instructions / Vague Descriptions:

    • Pitfall: The LLM consistently returns incorrect or inconsistent data for a field, or hallucinates values. This often means your description isn’t clear enough.
    • Example: description="Get details." (Too vague)
    • Troubleshooting: Reread your description as if you’re explaining it to a child. Is there any room for misinterpretation? Add specific examples within the description, define data formats, and clarify boundaries. Use LangExtract’s interactive visualization to see what the LLM thought it was extracting and why.
  2. Schema Mismatch / Type Errors:

    • Pitfall: The LLM returns a string when you expect an integer, or an object when you expect a list. Pydantic validation will often catch this.
    • Example: Expecting int for rating, but LLM returns “five stars”.
    • Troubleshooting: Double-check your Pydantic type hints. Ensure your description explicitly tells the LLM the exact format it should use (e.g., “Return the rating as an integer, e.g., 4”).
  3. Over-constraining the LLM:

    • Pitfall: You’ve added so many rules and conditions to a field’s description that the LLM struggles to find any valid output, or it misses obvious information. This can happen with very complex regex-like instructions in natural language.
    • Example: “Extract a 5-digit number that starts with ‘1’, ends with ‘9’, and is divisible by 7, but only if it appears after a date in ‘MM/DD/YYYY’ format.”
    • Troubleshooting: Simplify your instructions. Can you break down a complex field into multiple simpler fields, or use post-processing in Python for some of the more rigid validation? Sometimes, less is more. Let the LLM do what it’s good at (understanding natural language), and use Python for hard-coded logic.
  4. Hallucination / Fabrication:

    • Pitfall: The LLM returns information that is not present in the original text, especially for Optional fields.
    • Troubleshooting: For Optional fields, explicitly instruct the LLM to return None or an empty string if the information is not present. For example: description="Extract the author's name. If no author is explicitly stated, return None."

Summary: Mastering the Art of Prompt Engineering

You’ve now taken a significant step in becoming a proficient LangExtract user! Prompt engineering, primarily through careful schema definition and detailed description attributes, is the cornerstone of effective structured data extraction with LLMs.

Here are the key takeaways from this chapter:

  • Your schema is your prompt: Every field name, type, and description directly guides the underlying LLM.
  • Clarity is paramount: Be specific, unambiguous, and define boundaries for what to extract.
  • Leverage Field(..., description=...): This is your primary tool for instructing the LLM precisely what to extract and how to format it.
  • Handle missing data gracefully: Use Optional types and explicit instructions ("Return None if not found.") to prevent hallucination.
  • Embrace iterative refinement: Expect to test, analyze, and refine your schemas and descriptions to achieve optimal results.
  • Beware of common pitfalls: Avoid ambiguity, ensure type consistency, and don’t over-constrain the LLM.

In the next chapter, we’ll delve into handling even larger documents and more complex scenarios using LangExtract’s advanced features like chunking and multi-pass extraction, which further enhances the power of your well-engineered prompts.


References

This page is AI-assisted and reviewed. It references official documentation and recognized resources where relevant.