A LangGraph-based document summarization agent using iterative refinement. This approach processes documents in chunks, generating an initial summary and then iteratively refining it with additional context for comprehensive, coherent summaries.
The Summarize Refine agent implements a "refine" summarization strategy:
- Load Documents: Accepts URLs, file paths, or raw text and chunks them appropriately
- Initial Summary: Generates a summary from the first chunk
- Iterative Refinement: Refines the summary by incorporating each subsequent chunk
- Final Output: Produces a comprehensive summary that covers the entire document
This approach is particularly effective for:
- Long documents that exceed context windows
- Maintaining coherence across large texts
- Extracting key information while preserving context
βββββββββββββββββββ
β START β
ββββββββββ¬βββββββββ
β
βΌ
βββββββββββββββββββ
β load_docs β β Load & chunk documents
ββββββββββ¬βββββββββ
β
βΌ
βββββββββββββββββββββββββββ
β generate_initial_summaryβ β Summarize first chunk
ββββββββββ¬βββββββββββββββββ
β
βΌ
ββββββββββββββ
β More chunks?β
βββββββ¬βββββββ
β
Yes β No
βββββββ΄ββββββ
β β
βΌ βΌ
ββββββββββββ βββββββ
β refine β β END β
β summary β βββββββ
ββββββ¬ββββββ
β
ββββββββββββ
β
(loop back)
-
Clone the repository:
git clone https://github.com/yourusername/summarize-refine.git cd summarize-refine -
Create a virtual environment:
python -m venv venv source venv/bin/activate # On Windows: venv\Scripts\activate
-
Install dependencies:
# Basic installation pip install -e . # With all optional providers pip install -e ".[all]" # Or using requirements.txt pip install -r requirements.txt
-
Configure environment:
cp .env.example .env # Edit .env with your API keys
import asyncio
from summarize_refine import graph
async def main():
# Summarize raw text
result = await graph.ainvoke({
"input": "Your long document text here...",
"instructions": None, # Optional custom instructions
"parent_configurable": None
})
print(result["summary"])
asyncio.run(main())result = await graph.ainvoke({
"input": "https://en.wikipedia.org/wiki/Artificial_intelligence",
"instructions": None,
"parent_configurable": None
})result = await graph.ainvoke({
"input": "Your document content...",
"instructions": "Focus on the financial aspects and provide bullet points for key metrics.",
"parent_configurable": None
})result = await graph.ainvoke(
{
"input": "Your document...",
"instructions": None,
"parent_configurable": None
},
config={
"configurable": {
"llm": "anthropic/claude-3-5-sonnet-20240620"
}
}
)Set these in your .env file:
| Variable | Description | Required |
|---|---|---|
OPENAI_API_KEY |
OpenAI API key | Yes (if using OpenAI) |
ANTHROPIC_API_KEY |
Anthropic API key | If using Anthropic |
GOOGLE_AISTUDIO_API_KEY |
Google AI Studio key | If using Google |
GROQ_API_KEY |
Groq API key | If using Groq |
FIRECRAWL_API_KEY |
Firecrawl API key | For better web scraping |
Pass these via config["configurable"]:
| Parameter | Default | Description |
|---|---|---|
llm |
openai/gpt-4o-mini |
Model to use (format: provider/model) |
chunk_size |
20000 |
Size of text chunks |
chunk_overlap |
0 |
Overlap between chunks |
max_source_length |
140000 |
Maximum source content length |
web_timeout |
20 |
Web request timeout (seconds) |
loader_type |
SimpleWebLoader |
Web loader type |
debug |
false |
Enable debug output |
| Provider | Model Format Example |
|---|---|
| OpenAI | openai/gpt-4o, openai/gpt-4o-mini |
| Anthropic | anthropic/claude-3-5-sonnet-20240620 |
| Google AI Studio | google_aistudio/gemini-2.0-flash-exp |
| Groq | groq/llama-3.1-70b-versatile |
| Ollama (local) | ollama/qwen2.5:32b |
| OpenRouter | openrouter/anthropic/claude-3-opus |
| Together | together/meta-llama/Llama-3.3-70B-Instruct-Turbo |
| DeepSeek | deepseek/deepseek-reasoner |
{"input": "Your text content here..."}{"input": "https://example.com/article"}{"input": "/path/to/document.pdf"}from summarize_refine.utils.schemas import Source
{"input": Source(content="...", metadata={"title": "My Doc"})}{"input": {"content": "...", "metadata": {"source": "manual"}}}- Text:
.txt,.md,.log - PDF:
.pdf - Office:
.doc,.docx,.ppt,.pptx,.xls,.xlsx - Data:
.csv,.json
This project is configured for LangGraph Studio:
- Open LangGraph Studio
- Point it to this directory
- The
langgraph.jsonconfiguration will be automatically detected - Run and visualize the graph
pip install -e ".[dev]"
pytestblack summarize_refine/
ruff check summarize_refine/class InputState(TypedDict):
input: Union[Source, dict, str, List[dict]] # Content to summarize
instructions: Optional[str] # Custom summarization instructions
parent_configurable: Optional[dict] # Config from parent graphsclass OutputState(TypedDict):
summarized_source: List[Source] # Source with summary
summary: str # Final summary textclass Source(BaseModel):
content: str # Original content
summary: Optional[str] # Generated summary
metadata: Optional[dict] # MetadataThis agent can be used as a subgraph in larger LangGraph applications:
from langgraph.graph import StateGraph
from summarize_refine import graph as summarize_graph
# Add as a subgraph
parent_builder.add_node("summarize", summarize_graph)Pass configuration via parent_configurable:
{
"input": "...",
"parent_configurable": {
"llm_summarizer": "anthropic/claude-3-5-sonnet-20240620",
"chunk_size": 10000
}
}MIT License - see LICENSE for details.
Contributions are welcome! Please feel free to submit a Pull Request.
- Fork the repository
- Create your feature branch (
git checkout -b feature/amazing-feature) - Commit your changes (
git commit -m 'Add amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request