Summarize Refine

A LangGraph-based document summarization agent using iterative refinement. This approach processes documents in chunks, generating an initial summary and then iteratively refining it with additional context for comprehensive, coherent summaries.

📋 Overview

The Summarize Refine agent implements a "refine" summarization strategy:

Load Documents: Accepts URLs, file paths, or raw text and chunks them appropriately
Initial Summary: Generates a summary from the first chunk
Iterative Refinement: Refines the summary by incorporating each subsequent chunk
Final Output: Produces a comprehensive summary that covers the entire document

This approach is particularly effective for:

Long documents that exceed context windows
Maintaining coherence across large texts
Extracting key information while preserving context

🏗️ Architecture

┌─────────────────┐
│      START      │
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│    load_docs    │  ← Load & chunk documents
└────────┬────────┘
         │
         ▼
┌─────────────────────────┐
│ generate_initial_summary│  ← Summarize first chunk
└────────┬────────────────┘
         │
         ▼
    ┌────────────┐
    │ More chunks?│
    └─────┬──────┘
          │
    Yes   │   No
    ┌─────┴─────┐
    │           │
    ▼           ▼
┌──────────┐  ┌─────┐
│  refine  │  │ END │
│ summary  │  └─────┘
└────┬─────┘
     │
     └──────────┐
                │
         (loop back)

🚀 Quick Start

Installation

Clone the repository:

git clone https://github.com/yourusername/summarize-refine.git
cd summarize-refine

Create a virtual environment:

python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

Install dependencies:

# Basic installation
pip install -e .

# With all optional providers
pip install -e ".[all]"

# Or using requirements.txt
pip install -r requirements.txt

Configure environment:

cp .env.example .env
# Edit .env with your API keys

Basic Usage

import asyncio
from summarize_refine import graph

async def main():
    # Summarize raw text
    result = await graph.ainvoke({
        "input": "Your long document text here...",
        "instructions": None,  # Optional custom instructions
        "parent_configurable": None
    })
    print(result["summary"])

asyncio.run(main())

Summarize a URL

result = await graph.ainvoke({
    "input": "https://en.wikipedia.org/wiki/Artificial_intelligence",
    "instructions": None,
    "parent_configurable": None
})

Summarize with Custom Instructions

result = await graph.ainvoke({
    "input": "Your document content...",
    "instructions": "Focus on the financial aspects and provide bullet points for key metrics.",
    "parent_configurable": None
})

Using a Different LLM

result = await graph.ainvoke(
    {
        "input": "Your document...",
        "instructions": None,
        "parent_configurable": None
    },
    config={
        "configurable": {
            "llm": "anthropic/claude-3-5-sonnet-20240620"
        }
    }
)

⚙️ Configuration

Environment Variables

Set these in your .env file:

Variable	Description	Required
`OPENAI_API_KEY`	OpenAI API key	Yes (if using OpenAI)
`ANTHROPIC_API_KEY`	Anthropic API key	If using Anthropic
`GOOGLE_AISTUDIO_API_KEY`	Google AI Studio key	If using Google
`GROQ_API_KEY`	Groq API key	If using Groq
`FIRECRAWL_API_KEY`	Firecrawl API key	For better web scraping

Configurable Parameters

Pass these via config["configurable"]:

Parameter	Default	Description
`llm`	`openai/gpt-4o-mini`	Model to use (format: `provider/model`)
`chunk_size`	`20000`	Size of text chunks
`chunk_overlap`	`0`	Overlap between chunks
`max_source_length`	`140000`	Maximum source content length
`web_timeout`	`20`	Web request timeout (seconds)
`loader_type`	`SimpleWebLoader`	Web loader type
`debug`	`false`	Enable debug output

Supported LLM Providers

Provider	Model Format Example
OpenAI	`openai/gpt-4o`, `openai/gpt-4o-mini`
Anthropic	`anthropic/claude-3-5-sonnet-20240620`
Google AI Studio	`google_aistudio/gemini-2.0-flash-exp`
Groq	`groq/llama-3.1-70b-versatile`
Ollama (local)	`ollama/qwen2.5:32b`
OpenRouter	`openrouter/anthropic/claude-3-opus`
Together	`together/meta-llama/Llama-3.3-70B-Instruct-Turbo`
DeepSeek	`deepseek/deepseek-reasoner`

📁 Supported Input Types

Raw Text

{"input": "Your text content here..."}

URL

{"input": "https://example.com/article"}

File Path

{"input": "/path/to/document.pdf"}

Source Object

from summarize_refine.utils.schemas import Source

{"input": Source(content="...", metadata={"title": "My Doc"})}

Dictionary

{"input": {"content": "...", "metadata": {"source": "manual"}}}

📄 Supported File Types

Text: .txt, .md, .log
PDF: .pdf
Office: .doc, .docx, .ppt, .pptx, .xls, .xlsx
Data: .csv, .json

🔧 LangGraph Studio

This project is configured for LangGraph Studio:

Open LangGraph Studio
Point it to this directory
The langgraph.json configuration will be automatically detected
Run and visualize the graph

🧪 Development

Running Tests

pip install -e ".[dev]"
pytest

Code Formatting

black summarize_refine/
ruff check summarize_refine/

📖 API Reference

Input State

class InputState(TypedDict):
    input: Union[Source, dict, str, List[dict]]  # Content to summarize
    instructions: Optional[str]  # Custom summarization instructions
    parent_configurable: Optional[dict]  # Config from parent graphs

Output State

class OutputState(TypedDict):
    summarized_source: List[Source]  # Source with summary
    summary: str  # Final summary text

Source Schema

class Source(BaseModel):
    content: str  # Original content
    summary: Optional[str]  # Generated summary
    metadata: Optional[dict]  # Metadata

🔄 Integration with Parent Graphs

This agent can be used as a subgraph in larger LangGraph applications:

from langgraph.graph import StateGraph
from summarize_refine import graph as summarize_graph

# Add as a subgraph
parent_builder.add_node("summarize", summarize_graph)

Pass configuration via parent_configurable:

{
    "input": "...",
    "parent_configurable": {
        "llm_summarizer": "anthropic/claude-3-5-sonnet-20240620",
        "chunk_size": 10000
    }
}

📚 References

📝 License

MIT License - see LICENSE for details.

🤝 Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

Fork the repository
Create your feature branch (git checkout -b feature/amazing-feature)
Commit your changes (git commit -m 'Add amazing feature')
Push to the branch (git push origin feature/amazing-feature)
Open a Pull Request

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
examples		examples
summarize_refine		summarize_refine
.env.example		.env.example
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
langgraph.json		langgraph.json
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

License

felipemeres/summarize-refine-langgraph

Folders and files

Latest commit

History

Repository files navigation