karpathy · newton3 · Nov 23, 2025 · Nov 23, 2025
diff --git a/CLAUDE.md b/CLAUDE.md
@@ -0,0 +1,95 @@
+# CLAUDE.md
+
+This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
+
+## Project Overview
+
+Reader3 is a lightweight, self-hosted EPUB reader web application. The core workflow is:
+1. Process EPUB files into structured data using `reader3.py`
+2. Serve the processed books via a FastAPI web server using `server.py`
+3. Read books chapter-by-chapter with a clean web interface optimized for copying content to LLMs
+
+## Development Commands
+
+### Setup and Dependencies
+The project uses [uv](https://docs.astral.sh/uv/) for dependency management. Python 3.10+ required.
+
+```bash
+# Process an EPUB file (creates a {book_name}_data directory)
+uv run reader3.py <path_to_book.epub>
+
+# Start the web server (runs at http://127.0.0.1:8123)
+uv run server.py
+```
+
+### Library Management
+- Books are stored as `{book_name}_data/` directories containing:
+  - `book.pkl` - Pickled Book object with metadata, spine, TOC, and content
+  - `images/` - Extracted images from the EPUB
+- To remove a book: delete its `_data` directory
+- Server auto-discovers all `*_data` directories in the root folder
+
+## Architecture
+
+### Core Data Model (reader3.py)
+
+**Book Processing Pipeline:**
+1. `process_epub()` - Main entry point that orchestrates EPUB parsing
+2. EPUB parsing via ebooklib → extracts metadata, spine (linear reading order), TOC (navigation tree), and images
+3. HTML cleaning → removes scripts, styles, forms, dangerous elements
+4. Image path rewriting → converts EPUB-internal paths to local `images/{filename}` paths
+5. Serialization → entire Book object pickled to `book.pkl`
+
+**Key Data Structures:**
+- `Book` - Master container with metadata, spine, toc, and image map
+- `ChapterContent` - Represents a physical file in the EPUB spine (linear reading order). Contains cleaned HTML content and extracted plain text
+- `TOCEntry` - Logical navigation entry (may have nested children). Maps to spine files via href matching
+- `BookMetadata` - Standard DC metadata (title, authors, publisher, etc.)
+
+**Critical Distinction:**
+- **Spine** = Physical reading order (files as they appear in EPUB)
+- **TOC** = Logical navigation tree (may reference multiple positions in the same file via anchors)
+- Server routes use spine indices (`/read/{book_id}/{chapter_index}`) for linear navigation
+- TOC entries map to spine via filename matching in JavaScript (see reader.html:124-151)
+
+### Web Server (server.py)
+
+**FastAPI Routes:**
+- `GET /` - Library view listing all processed books
+- `GET /read/{book_id}` - Redirects to first chapter (index 0)
+- `GET /read/{book_id}/{chapter_index}` - Main reader interface with sidebar TOC
+- `GET /read/{book_id}/images/{image_name}` - Serves extracted images
+
+**Book Loading:**
+- `load_book_cached()` uses `@lru_cache(maxsize=10)` to avoid repeated disk reads
+- Books are loaded from pickle files on-demand
+- Cache key is the folder name (e.g., "dracula_data")
+
+### Frontend (templates/)
+
+**library.html** - Grid view of available books with basic metadata
+
+**reader.html** - Two-column layout:
+- Left sidebar: Nested TOC navigation tree (rendered via Jinja2 recursive macro)
+- Right panel: Current chapter content with Previous/Next navigation
+- JavaScript spine map (line 127-131) enables TOC → chapter index lookup
+- `findAndGo()` function (line 133-151) handles TOC link clicks by mapping filenames to spine indices
+
+## Dependencies
+
+From pyproject.toml:
+- `ebooklib` - EPUB parsing and manipulation
+- `beautifulsoup4` - HTML parsing and cleaning
+- `fastapi` - Web framework
+- `jinja2` - Template engine
+- `uvicorn` - ASGI server
+
+## Project Philosophy
+
+This is a minimal, "vibe-coded" project (per README) designed to illustrate reading books with LLMs. It intentionally avoids complexity:
+- No database - just pickle files and directories
+- No user accounts or authentication
+- No advanced features (bookmarks, annotations, etc.)
+- Simple file-based library management
+
+When making changes, preserve this simplicity and avoid adding unnecessary abstractions or features.
diff --git a/pyproject.toml b/pyproject.toml
@@ -10,4 +10,6 @@ dependencies = [
     "fastapi>=0.121.2",
     "jinja2>=3.1.6",
     "uvicorn>=0.38.0",
+    "pymupdf>=1.25.1",
+    "python-multipart>=0.0.9",
 ]
diff --git a/reader3.py b/reader3.py
@@ -13,6 +13,7 @@
 import ebooklib
 from ebooklib import epub
 from bs4 import BeautifulSoup, Comment
+import fitz  # PyMuPDF
 
 # --- Data structures ---
 
@@ -283,6 +284,160 @@ def process_epub(epub_path: str, output_dir: str) -> Book:
     return final_book
 
 
+def process_pdf(pdf_path: str, output_dir: str) -> Book:
+    """
+    Process a PDF file into a Book object.
+    Attempts to extract TOC/bookmarks, falls back to page-based chunking if unavailable.
+    """
+    print(f"Loading {pdf_path}...")
+    doc = fitz.open(pdf_path)
+
+    # 1. Extract Metadata
+    metadata_dict = doc.metadata
+    title = metadata_dict.get('title', os.path.splitext(os.path.basename(pdf_path))[0])
+    if not title or title.strip() == '':
+        title = os.path.splitext(os.path.basename(pdf_path))[0]
+
+    author = metadata_dict.get('author', 'Unknown')
+    authors = [author] if author else []
+
+    metadata = BookMetadata(
+        title=title,
+        language="en",
+        authors=authors,
+        description=metadata_dict.get('subject'),
+        publisher=metadata_dict.get('producer'),
+        date=metadata_dict.get('creationDate')
+    )
+
+    # 2. Prepare Output Directory
+    if os.path.exists(output_dir):
+        shutil.rmtree(output_dir)
+    images_dir = os.path.join(output_dir, 'images')
+    os.makedirs(images_dir, exist_ok=True)
+
+    # 3. Try to Extract TOC/Outline
+    print("Extracting Table of Contents...")
+    toc_outline = doc.get_toc(simple=False)
+
+    spine_chapters = []
+    toc_structure = []
+
+    if toc_outline and len(toc_outline) > 0:
+        # TOC exists - use it to create chapters
+        print(f"Found {len(toc_outline)} TOC entries")
+
+        # Build chapter ranges from TOC
+        chapter_ranges = []
+        for i, entry in enumerate(toc_outline):
+            level, title, page_num = entry
+            start_page = page_num - 1  # fitz uses 0-based indexing
+
+            # Find end page (start of next entry at same or higher level)
+            end_page = len(doc) - 1
+            for j in range(i + 1, len(toc_outline)):
+                next_level, _, next_page = toc_outline[j]
+                if next_level <= level:
+                    end_page = next_page - 2
+                    break
+
+            chapter_ranges.append({
+                'level': level,
+                'title': title,
+                'start': start_page,
+                'end': end_page,
+                'order': i
+            })
+
+        # Create ChapterContent objects from TOC entries
+        for i, chapter_info in enumerate(chapter_ranges):
+            text_parts = []
+            html_parts = ["<div>"]
+
+            for page_num in range(chapter_info['start'], chapter_info['end'] + 1):
+                if page_num < 0 or page_num >= len(doc):
+                    continue
+                page = doc[page_num]
+                page_text = page.get_text()
+                text_parts.append(page_text)
+                html_parts.append(f"<p>{page_text.replace(chr(10), '<br>')}</p>")
+
+            html_parts.append("</div>")
+
+            chapter = ChapterContent(
+                id=f"chapter_{i}",
+                href=f"chapter_{i}.html",
+                title=chapter_info['title'],
+                content="".join(html_parts),
+                text=" ".join(text_parts),
+                order=i
+            )
+            spine_chapters.append(chapter)
+
+            # Build TOC structure (flat for now - nested TOC would be more complex)
+            toc_entry = TOCEntry(
+                title=chapter_info['title'],
+                href=f"chapter_{i}.html",
+                file_href=f"chapter_{i}.html",
+                anchor=""
+            )
+            toc_structure.append(toc_entry)
+
+    else:
+        # No TOC - fall back to page-based chunking
+        print("No TOC found, using page-based chunking...")
+        pages_per_chapter = 10
+        total_pages = len(doc)
+
+        for chunk_start in range(0, total_pages, pages_per_chapter):
+            chunk_end = min(chunk_start + pages_per_chapter, total_pages)
+            chapter_num = chunk_start // pages_per_chapter
+
+            text_parts = []
+            html_parts = ["<div>"]
+
+            for page_num in range(chunk_start, chunk_end):
+                page = doc[page_num]
+                page_text = page.get_text()
+                text_parts.append(page_text)
+                html_parts.append(f"<p>{page_text.replace(chr(10), '<br>')}</p>")
+
+            html_parts.append("</div>")
+
+            title = f"Pages {chunk_start + 1}-{chunk_end}"
+            chapter = ChapterContent(
+                id=f"chapter_{chapter_num}",
+                href=f"chapter_{chapter_num}.html",
+                title=title,
+                content="".join(html_parts),
+                text=" ".join(text_parts),
+                order=chapter_num
+            )
+            spine_chapters.append(chapter)
+
+            toc_entry = TOCEntry(
+                title=title,
+                href=f"chapter_{chapter_num}.html",
+                file_href=f"chapter_{chapter_num}.html",
+                anchor=""
+            )
+            toc_structure.append(toc_entry)
+
+    doc.close()
+
+    # 4. Create Book object
+    final_book = Book(
+        metadata=metadata,
+        spine=spine_chapters,
+        toc=toc_structure,
+        images={},  # PDF image extraction can be added later if needed
+        source_file=os.path.basename(pdf_path),
+        processed_at=datetime.now().isoformat()
+    )
+
+    return final_book
+
+
 def save_to_pickle(book: Book, output_dir: str):
     p_path = os.path.join(output_dir, 'book.pkl')
     with open(p_path, 'wb') as f:
@@ -296,14 +451,26 @@ def save_to_pickle(book: Book, output_dir: str):
 
     import sys
     if len(sys.argv) < 2:
-        print("Usage: python reader3.py <file.epub>")
+        print("Usage: python reader3.py <file.epub|file.pdf>")
         sys.exit(1)
 
-    epub_file = sys.argv[1]
-    assert os.path.exists(epub_file), "File not found."
-    out_dir = os.path.splitext(epub_file)[0] + "_data"
+    input_file = sys.argv[1]
+    assert os.path.exists(input_file), "File not found."
+
+    # Detect file type
+    file_ext = os.path.splitext(input_file)[1].lower()
+    out_dir = os.path.splitext(input_file)[0] + "_data"
+
+    # Process based on file type
+    if file_ext == '.epub':
+        book_obj = process_epub(input_file, out_dir)
+    elif file_ext == '.pdf':
+        book_obj = process_pdf(input_file, out_dir)
+    else:
+        print(f"Unsupported file type: {file_ext}")
+        print("Supported formats: .epub, .pdf")
+        sys.exit(1)
 
-    book_obj = process_epub(epub_file, out_dir)
     save_to_pickle(book_obj, out_dir)
     print("\n--- Summary ---")
     print(f"Title: {book_obj.metadata.title}")

diff --git a/server.py b/server.py
@@ -1,14 +1,16 @@
 import os
 import pickle
+import tempfile
+import shutil
 from functools import lru_cache
 from typing import Optional
 
-from fastapi import FastAPI, Request, HTTPException
-from fastapi.responses import HTMLResponse, FileResponse
+from fastapi import FastAPI, Request, HTTPException, UploadFile, File
+from fastapi.responses import HTMLResponse, FileResponse, RedirectResponse
 from fastapi.staticfiles import StaticFiles
 from fastapi.templating import Jinja2Templates
 
-from reader3 import Book, BookMetadata, ChapterContent, TOCEntry
+from reader3 import Book, BookMetadata, ChapterContent, TOCEntry, process_epub, process_pdf, save_to_pickle
 
 app = FastAPI()
 templates = Jinja2Templates(directory="templates")
@@ -104,6 +106,62 @@ async def serve_image(book_id: str, image_name: str):
 
     return FileResponse(img_path)
 
+@app.get("/upload", response_class=HTMLResponse)
+async def upload_page(request: Request):
+    """Display the upload form."""
+    return templates.TemplateResponse("upload.html", {"request": request})
+
+@app.post("/upload")
+async def upload_book(file: UploadFile = File(...)):
+    """
+    Handle book upload and processing.
+    Accepts EPUB or PDF files, processes them, and redirects to library.
+    """
+    # Validate file type
+    filename = file.filename
+    file_ext = os.path.splitext(filename)[1].lower()
+
+    if file_ext not in ['.epub', '.pdf']:
+        raise HTTPException(status_code=400, detail="Only EPUB and PDF files are supported")
+
+    try:
+        # Create a temporary file to save the upload
+        with tempfile.NamedTemporaryFile(delete=False, suffix=file_ext) as tmp_file:
+            # Read and write the uploaded file
+            content = await file.read()
+            tmp_file.write(content)
+            tmp_path = tmp_file.name
+
+        # Determine output directory
+        base_name = os.path.splitext(filename)[0]
+        # Sanitize filename for directory name
+        safe_base_name = "".join([c for c in base_name if c.isalnum() or c in (' ', '-', '_')]).strip()
+        out_dir = os.path.join(BOOKS_DIR, f"{safe_base_name}_data")
+
+        # Process based on file type
+        if file_ext == '.epub':
+            book_obj = process_epub(tmp_path, out_dir)
+        elif file_ext == '.pdf':
+            book_obj = process_pdf(tmp_path, out_dir)
+
+        # Save to pickle
+        save_to_pickle(book_obj, out_dir)
+
+        # Clean up temporary file
+        os.unlink(tmp_path)
+
+        # Clear the cache so the new book appears
+        load_book_cached.cache_clear()
+
+        # Redirect to library
+        return RedirectResponse(url="/", status_code=303)
+
+    except Exception as e:
+        # Clean up on error
+        if 'tmp_path' in locals() and os.path.exists(tmp_path):
+            os.unlink(tmp_path)
+        raise HTTPException(status_code=500, detail=f"Error processing book: {str(e)}")
+
 if __name__ == "__main__":
     import uvicorn
     print("Starting server at http://127.0.0.1:8123")