Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
95 changes: 95 additions & 0 deletions CLAUDE.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,95 @@
# CLAUDE.md

This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.

## Project Overview

Reader3 is a lightweight, self-hosted EPUB reader web application. The core workflow is:
1. Process EPUB files into structured data using `reader3.py`
2. Serve the processed books via a FastAPI web server using `server.py`
3. Read books chapter-by-chapter with a clean web interface optimized for copying content to LLMs

## Development Commands

### Setup and Dependencies
The project uses [uv](https://docs.astral.sh/uv/) for dependency management. Python 3.10+ required.

```bash
# Process an EPUB file (creates a {book_name}_data directory)
uv run reader3.py <path_to_book.epub>

# Start the web server (runs at http://127.0.0.1:8123)
uv run server.py
```

### Library Management
- Books are stored as `{book_name}_data/` directories containing:
- `book.pkl` - Pickled Book object with metadata, spine, TOC, and content
- `images/` - Extracted images from the EPUB
- To remove a book: delete its `_data` directory
- Server auto-discovers all `*_data` directories in the root folder

## Architecture

### Core Data Model (reader3.py)

**Book Processing Pipeline:**
1. `process_epub()` - Main entry point that orchestrates EPUB parsing
2. EPUB parsing via ebooklib → extracts metadata, spine (linear reading order), TOC (navigation tree), and images
3. HTML cleaning → removes scripts, styles, forms, dangerous elements
4. Image path rewriting → converts EPUB-internal paths to local `images/{filename}` paths
5. Serialization → entire Book object pickled to `book.pkl`

**Key Data Structures:**
- `Book` - Master container with metadata, spine, toc, and image map
- `ChapterContent` - Represents a physical file in the EPUB spine (linear reading order). Contains cleaned HTML content and extracted plain text
- `TOCEntry` - Logical navigation entry (may have nested children). Maps to spine files via href matching
- `BookMetadata` - Standard DC metadata (title, authors, publisher, etc.)

**Critical Distinction:**
- **Spine** = Physical reading order (files as they appear in EPUB)
- **TOC** = Logical navigation tree (may reference multiple positions in the same file via anchors)
- Server routes use spine indices (`/read/{book_id}/{chapter_index}`) for linear navigation
- TOC entries map to spine via filename matching in JavaScript (see reader.html:124-151)

### Web Server (server.py)

**FastAPI Routes:**
- `GET /` - Library view listing all processed books
- `GET /read/{book_id}` - Redirects to first chapter (index 0)
- `GET /read/{book_id}/{chapter_index}` - Main reader interface with sidebar TOC
- `GET /read/{book_id}/images/{image_name}` - Serves extracted images

**Book Loading:**
- `load_book_cached()` uses `@lru_cache(maxsize=10)` to avoid repeated disk reads
- Books are loaded from pickle files on-demand
- Cache key is the folder name (e.g., "dracula_data")

### Frontend (templates/)

**library.html** - Grid view of available books with basic metadata

**reader.html** - Two-column layout:
- Left sidebar: Nested TOC navigation tree (rendered via Jinja2 recursive macro)
- Right panel: Current chapter content with Previous/Next navigation
- JavaScript spine map (line 127-131) enables TOC → chapter index lookup
- `findAndGo()` function (line 133-151) handles TOC link clicks by mapping filenames to spine indices

## Dependencies

From pyproject.toml:
- `ebooklib` - EPUB parsing and manipulation
- `beautifulsoup4` - HTML parsing and cleaning
- `fastapi` - Web framework
- `jinja2` - Template engine
- `uvicorn` - ASGI server

## Project Philosophy

This is a minimal, "vibe-coded" project (per README) designed to illustrate reading books with LLMs. It intentionally avoids complexity:
- No database - just pickle files and directories
- No user accounts or authentication
- No advanced features (bookmarks, annotations, etc.)
- Simple file-based library management

When making changes, preserve this simplicity and avoid adding unnecessary abstractions or features.
2 changes: 2 additions & 0 deletions pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -10,4 +10,6 @@ dependencies = [
"fastapi>=0.121.2",
"jinja2>=3.1.6",
"uvicorn>=0.38.0",
"pymupdf>=1.25.1",
"python-multipart>=0.0.9",
]
177 changes: 172 additions & 5 deletions reader3.py
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,7 @@
import ebooklib
from ebooklib import epub
from bs4 import BeautifulSoup, Comment
import fitz # PyMuPDF

# --- Data structures ---

Expand Down Expand Up @@ -283,6 +284,160 @@ def process_epub(epub_path: str, output_dir: str) -> Book:
return final_book


def process_pdf(pdf_path: str, output_dir: str) -> Book:
"""
Process a PDF file into a Book object.
Attempts to extract TOC/bookmarks, falls back to page-based chunking if unavailable.
"""
print(f"Loading {pdf_path}...")
doc = fitz.open(pdf_path)

# 1. Extract Metadata
metadata_dict = doc.metadata
title = metadata_dict.get('title', os.path.splitext(os.path.basename(pdf_path))[0])
if not title or title.strip() == '':
title = os.path.splitext(os.path.basename(pdf_path))[0]

author = metadata_dict.get('author', 'Unknown')
authors = [author] if author else []

metadata = BookMetadata(
title=title,
language="en",
authors=authors,
description=metadata_dict.get('subject'),
publisher=metadata_dict.get('producer'),
date=metadata_dict.get('creationDate')
)

# 2. Prepare Output Directory
if os.path.exists(output_dir):
shutil.rmtree(output_dir)
images_dir = os.path.join(output_dir, 'images')
os.makedirs(images_dir, exist_ok=True)

# 3. Try to Extract TOC/Outline
print("Extracting Table of Contents...")
toc_outline = doc.get_toc(simple=False)

spine_chapters = []
toc_structure = []

if toc_outline and len(toc_outline) > 0:
# TOC exists - use it to create chapters
print(f"Found {len(toc_outline)} TOC entries")

# Build chapter ranges from TOC
chapter_ranges = []
for i, entry in enumerate(toc_outline):
level, title, page_num = entry
start_page = page_num - 1 # fitz uses 0-based indexing

# Find end page (start of next entry at same or higher level)
end_page = len(doc) - 1
for j in range(i + 1, len(toc_outline)):
next_level, _, next_page = toc_outline[j]
if next_level <= level:
end_page = next_page - 2
break

chapter_ranges.append({
'level': level,
'title': title,
'start': start_page,
'end': end_page,
'order': i
})

# Create ChapterContent objects from TOC entries
for i, chapter_info in enumerate(chapter_ranges):
text_parts = []
html_parts = ["<div>"]

for page_num in range(chapter_info['start'], chapter_info['end'] + 1):
if page_num < 0 or page_num >= len(doc):
continue
page = doc[page_num]
page_text = page.get_text()
text_parts.append(page_text)
html_parts.append(f"<p>{page_text.replace(chr(10), '<br>')}</p>")

html_parts.append("</div>")

chapter = ChapterContent(
id=f"chapter_{i}",
href=f"chapter_{i}.html",
title=chapter_info['title'],
content="".join(html_parts),
text=" ".join(text_parts),
order=i
)
spine_chapters.append(chapter)

# Build TOC structure (flat for now - nested TOC would be more complex)
toc_entry = TOCEntry(
title=chapter_info['title'],
href=f"chapter_{i}.html",
file_href=f"chapter_{i}.html",
anchor=""
)
toc_structure.append(toc_entry)

else:
# No TOC - fall back to page-based chunking
print("No TOC found, using page-based chunking...")
pages_per_chapter = 10
total_pages = len(doc)

for chunk_start in range(0, total_pages, pages_per_chapter):
chunk_end = min(chunk_start + pages_per_chapter, total_pages)
chapter_num = chunk_start // pages_per_chapter

text_parts = []
html_parts = ["<div>"]

for page_num in range(chunk_start, chunk_end):
page = doc[page_num]
page_text = page.get_text()
text_parts.append(page_text)
html_parts.append(f"<p>{page_text.replace(chr(10), '<br>')}</p>")

html_parts.append("</div>")

title = f"Pages {chunk_start + 1}-{chunk_end}"
chapter = ChapterContent(
id=f"chapter_{chapter_num}",
href=f"chapter_{chapter_num}.html",
title=title,
content="".join(html_parts),
text=" ".join(text_parts),
order=chapter_num
)
spine_chapters.append(chapter)

toc_entry = TOCEntry(
title=title,
href=f"chapter_{chapter_num}.html",
file_href=f"chapter_{chapter_num}.html",
anchor=""
)
toc_structure.append(toc_entry)

doc.close()

# 4. Create Book object
final_book = Book(
metadata=metadata,
spine=spine_chapters,
toc=toc_structure,
images={}, # PDF image extraction can be added later if needed
source_file=os.path.basename(pdf_path),
processed_at=datetime.now().isoformat()
)

return final_book


def save_to_pickle(book: Book, output_dir: str):
p_path = os.path.join(output_dir, 'book.pkl')
with open(p_path, 'wb') as f:
Expand All @@ -296,14 +451,26 @@ def save_to_pickle(book: Book, output_dir: str):

import sys
if len(sys.argv) < 2:
print("Usage: python reader3.py <file.epub>")
print("Usage: python reader3.py <file.epub|file.pdf>")
sys.exit(1)

epub_file = sys.argv[1]
assert os.path.exists(epub_file), "File not found."
out_dir = os.path.splitext(epub_file)[0] + "_data"
input_file = sys.argv[1]
assert os.path.exists(input_file), "File not found."

# Detect file type
file_ext = os.path.splitext(input_file)[1].lower()
out_dir = os.path.splitext(input_file)[0] + "_data"

# Process based on file type
if file_ext == '.epub':
book_obj = process_epub(input_file, out_dir)
elif file_ext == '.pdf':
book_obj = process_pdf(input_file, out_dir)
else:
print(f"Unsupported file type: {file_ext}")
print("Supported formats: .epub, .pdf")
sys.exit(1)

book_obj = process_epub(epub_file, out_dir)
save_to_pickle(book_obj, out_dir)
print("\n--- Summary ---")
print(f"Title: {book_obj.metadata.title}")
Expand Down
64 changes: 61 additions & 3 deletions server.py
Original file line number Diff line number Diff line change
@@ -1,14 +1,16 @@
import os
import pickle
import tempfile
import shutil
from functools import lru_cache
from typing import Optional

from fastapi import FastAPI, Request, HTTPException
from fastapi.responses import HTMLResponse, FileResponse
from fastapi import FastAPI, Request, HTTPException, UploadFile, File
from fastapi.responses import HTMLResponse, FileResponse, RedirectResponse
from fastapi.staticfiles import StaticFiles
from fastapi.templating import Jinja2Templates

from reader3 import Book, BookMetadata, ChapterContent, TOCEntry
from reader3 import Book, BookMetadata, ChapterContent, TOCEntry, process_epub, process_pdf, save_to_pickle

app = FastAPI()
templates = Jinja2Templates(directory="templates")
Expand Down Expand Up @@ -104,6 +106,62 @@ async def serve_image(book_id: str, image_name: str):

return FileResponse(img_path)

@app.get("/upload", response_class=HTMLResponse)
async def upload_page(request: Request):
"""Display the upload form."""
return templates.TemplateResponse("upload.html", {"request": request})

@app.post("/upload")
async def upload_book(file: UploadFile = File(...)):
"""
Handle book upload and processing.
Accepts EPUB or PDF files, processes them, and redirects to library.
"""
# Validate file type
filename = file.filename
file_ext = os.path.splitext(filename)[1].lower()

if file_ext not in ['.epub', '.pdf']:
raise HTTPException(status_code=400, detail="Only EPUB and PDF files are supported")

try:
# Create a temporary file to save the upload
with tempfile.NamedTemporaryFile(delete=False, suffix=file_ext) as tmp_file:
# Read and write the uploaded file
content = await file.read()
tmp_file.write(content)
tmp_path = tmp_file.name

# Determine output directory
base_name = os.path.splitext(filename)[0]
# Sanitize filename for directory name
safe_base_name = "".join([c for c in base_name if c.isalnum() or c in (' ', '-', '_')]).strip()
out_dir = os.path.join(BOOKS_DIR, f"{safe_base_name}_data")

# Process based on file type
if file_ext == '.epub':
book_obj = process_epub(tmp_path, out_dir)
elif file_ext == '.pdf':
book_obj = process_pdf(tmp_path, out_dir)

# Save to pickle
save_to_pickle(book_obj, out_dir)

# Clean up temporary file
os.unlink(tmp_path)

# Clear the cache so the new book appears
load_book_cached.cache_clear()

# Redirect to library
return RedirectResponse(url="/", status_code=303)

except Exception as e:
# Clean up on error
if 'tmp_path' in locals() and os.path.exists(tmp_path):
os.unlink(tmp_path)
raise HTTPException(status_code=500, detail=f"Error processing book: {str(e)}")

if __name__ == "__main__":
import uvicorn
print("Starting server at http://127.0.0.1:8123")
Expand Down
Loading