Skip to content

A LangGraph-based document summarization agent using iterative refinement. Supports URLs, files, and raw text with multiple LLM providers.

License

Notifications You must be signed in to change notification settings

felipemeres/summarize-refine-langgraph

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

1 Commit
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Summarize Refine

A LangGraph-based document summarization agent using iterative refinement. This approach processes documents in chunks, generating an initial summary and then iteratively refining it with additional context for comprehensive, coherent summaries.

LangGraph Python License

πŸ“‹ Overview

The Summarize Refine agent implements a "refine" summarization strategy:

  1. Load Documents: Accepts URLs, file paths, or raw text and chunks them appropriately
  2. Initial Summary: Generates a summary from the first chunk
  3. Iterative Refinement: Refines the summary by incorporating each subsequent chunk
  4. Final Output: Produces a comprehensive summary that covers the entire document

This approach is particularly effective for:

  • Long documents that exceed context windows
  • Maintaining coherence across large texts
  • Extracting key information while preserving context

πŸ—οΈ Architecture

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚      START      β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜
         β”‚
         β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚    load_docs    β”‚  ← Load & chunk documents
β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜
         β”‚
         β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ generate_initial_summaryβ”‚  ← Summarize first chunk
β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
         β”‚
         β–Ό
    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
    β”‚ More chunks?β”‚
    β””β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”˜
          β”‚
    Yes   β”‚   No
    β”Œβ”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”
    β”‚           β”‚
    β–Ό           β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”
β”‚  refine  β”‚  β”‚ END β”‚
β”‚ summary  β”‚  β””β”€β”€β”€β”€β”€β”˜
β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”˜
     β”‚
     └──────────┐
                β”‚
         (loop back)

πŸš€ Quick Start

Installation

  1. Clone the repository:

    git clone https://github.com/yourusername/summarize-refine.git
    cd summarize-refine
  2. Create a virtual environment:

    python -m venv venv
    source venv/bin/activate  # On Windows: venv\Scripts\activate
  3. Install dependencies:

    # Basic installation
    pip install -e .
    
    # With all optional providers
    pip install -e ".[all]"
    
    # Or using requirements.txt
    pip install -r requirements.txt
  4. Configure environment:

    cp .env.example .env
    # Edit .env with your API keys

Basic Usage

import asyncio
from summarize_refine import graph

async def main():
    # Summarize raw text
    result = await graph.ainvoke({
        "input": "Your long document text here...",
        "instructions": None,  # Optional custom instructions
        "parent_configurable": None
    })
    print(result["summary"])

asyncio.run(main())

Summarize a URL

result = await graph.ainvoke({
    "input": "https://en.wikipedia.org/wiki/Artificial_intelligence",
    "instructions": None,
    "parent_configurable": None
})

Summarize with Custom Instructions

result = await graph.ainvoke({
    "input": "Your document content...",
    "instructions": "Focus on the financial aspects and provide bullet points for key metrics.",
    "parent_configurable": None
})

Using a Different LLM

result = await graph.ainvoke(
    {
        "input": "Your document...",
        "instructions": None,
        "parent_configurable": None
    },
    config={
        "configurable": {
            "llm": "anthropic/claude-3-5-sonnet-20240620"
        }
    }
)

βš™οΈ Configuration

Environment Variables

Set these in your .env file:

Variable Description Required
OPENAI_API_KEY OpenAI API key Yes (if using OpenAI)
ANTHROPIC_API_KEY Anthropic API key If using Anthropic
GOOGLE_AISTUDIO_API_KEY Google AI Studio key If using Google
GROQ_API_KEY Groq API key If using Groq
FIRECRAWL_API_KEY Firecrawl API key For better web scraping

Configurable Parameters

Pass these via config["configurable"]:

Parameter Default Description
llm openai/gpt-4o-mini Model to use (format: provider/model)
chunk_size 20000 Size of text chunks
chunk_overlap 0 Overlap between chunks
max_source_length 140000 Maximum source content length
web_timeout 20 Web request timeout (seconds)
loader_type SimpleWebLoader Web loader type
debug false Enable debug output

Supported LLM Providers

Provider Model Format Example
OpenAI openai/gpt-4o, openai/gpt-4o-mini
Anthropic anthropic/claude-3-5-sonnet-20240620
Google AI Studio google_aistudio/gemini-2.0-flash-exp
Groq groq/llama-3.1-70b-versatile
Ollama (local) ollama/qwen2.5:32b
OpenRouter openrouter/anthropic/claude-3-opus
Together together/meta-llama/Llama-3.3-70B-Instruct-Turbo
DeepSeek deepseek/deepseek-reasoner

πŸ“ Supported Input Types

Raw Text

{"input": "Your text content here..."}

URL

{"input": "https://example.com/article"}

File Path

{"input": "/path/to/document.pdf"}

Source Object

from summarize_refine.utils.schemas import Source

{"input": Source(content="...", metadata={"title": "My Doc"})}

Dictionary

{"input": {"content": "...", "metadata": {"source": "manual"}}}

πŸ“„ Supported File Types

  • Text: .txt, .md, .log
  • PDF: .pdf
  • Office: .doc, .docx, .ppt, .pptx, .xls, .xlsx
  • Data: .csv, .json

πŸ”§ LangGraph Studio

This project is configured for LangGraph Studio:

  1. Open LangGraph Studio
  2. Point it to this directory
  3. The langgraph.json configuration will be automatically detected
  4. Run and visualize the graph

πŸ§ͺ Development

Running Tests

pip install -e ".[dev]"
pytest

Code Formatting

black summarize_refine/
ruff check summarize_refine/

πŸ“– API Reference

Input State

class InputState(TypedDict):
    input: Union[Source, dict, str, List[dict]]  # Content to summarize
    instructions: Optional[str]  # Custom summarization instructions
    parent_configurable: Optional[dict]  # Config from parent graphs

Output State

class OutputState(TypedDict):
    summarized_source: List[Source]  # Source with summary
    summary: str  # Final summary text

Source Schema

class Source(BaseModel):
    content: str  # Original content
    summary: Optional[str]  # Generated summary
    metadata: Optional[dict]  # Metadata

πŸ”„ Integration with Parent Graphs

This agent can be used as a subgraph in larger LangGraph applications:

from langgraph.graph import StateGraph
from summarize_refine import graph as summarize_graph

# Add as a subgraph
parent_builder.add_node("summarize", summarize_graph)

Pass configuration via parent_configurable:

{
    "input": "...",
    "parent_configurable": {
        "llm_summarizer": "anthropic/claude-3-5-sonnet-20240620",
        "chunk_size": 10000
    }
}

πŸ“š References

πŸ“ License

MIT License - see LICENSE for details.

🀝 Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

  1. Fork the repository
  2. Create your feature branch (git checkout -b feature/amazing-feature)
  3. Commit your changes (git commit -m 'Add amazing feature')
  4. Push to the branch (git push origin feature/amazing-feature)
  5. Open a Pull Request

About

A LangGraph-based document summarization agent using iterative refinement. Supports URLs, files, and raw text with multiple LLM providers.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages