Skip to content

SomeSunlight/confluenceDumpWithPython

 
 

Repository files navigation

Confluence Dump with Python

This toolbox exports content from a Confluence instance (Cloud or Data Center) into a static, navigable HTML archive and converts it into professional, hierarchical PDF documents. Key Features:

  • Visual Fidelity: Fetches rendered HTML (export_view) to preserve macros, layouts, and formatting.
  • Navigation: Injects a fully functional, static navigation sidebar into every HTML page.
  • Offline Browsing: Localizes images and links, and downloads all attachments (PDFs, Office docs, etc.) for complete offline access.
  • Sort Order: Recursively scans the tree to ensure the manual sort order from Confluence is preserved.
  • Metadata Injection: Automatically adds Page Title, Author, and Modification Date to the top of every page.
  • Versioning: Creates timestamped output folders (e.g., 2025-11-21 1400 Space IT) for clean history management.
  • Professional PDF: Merges the content into a single PDF with TOC, Bookmarks, and mixed Portrait/Landscape orientation.

Toolbox Overview (Key Files)

  • confluenceDumpToHTML.py: The main downloader. Connects to Confluence, scrapes content, and creates the folder structure.
  • htmlToDoc.py: The publisher. Converts the downloaded HTML folder into a single PDF or a Master-HTML file for LLMs.
  • confluence_products.ini: Configuration file for API URLs (Cloud vs. Data Center).
  • styles/: Contains CSS files. site.css (if present) is applied automatically. pdf_settings.css configures the PDF layout (A4/Letter, Margins).

Quick Start Guide

Follow these steps to create your first PDF export of a single page tree.

1. Setup

Install requirements and set your credentials.

pip install -r requirements.txt
# Windows Users: Install GTK3 Runtime for PDF generation!

Linux/Mac:

export CONFLUENCE_TOKEN="YourPersonalAccessToken"

Windows (Powershell):

$env:CONFLUENCE_TOKEN="YourPersonalAccessToken"

2. The Dump (Download)

Run the dumper for a specific page tree. This will create a new folder in output/.

# Example for Data Center
python3 confluenceDumpToHTML.py --base-url "[https://confluence.corp.com](https://confluence.corp.com)" --profile dc --context-path "/wiki" -o "./output" tree -p "123456"

3. The Publication (PDF)

Look into the output folder. You will see a new folder like 2025-01-27 0900 My Page Title. Use this path for the PDF generator.

python3 htmlToDoc.py --site-dir "./output/2025-01-27 0900 My Page Title" --pdf

Result: You now have a ... .pdf inside that folder.

Platform Support & Authentication

This script supports both Confluence Cloud and Confluence Data Center.

⚠️ Note on Cloud Verification: The Cloud support has been ported to the new architecture but was primarily developed and tested against a Confluence Data Center environment.

Configuration

Define API paths in confluence_products.ini. Authentication is handled via Environment Variables:

  • Cloud: CONFLUENCE_USER (Email) and CONFLUENCE_TOKEN (API Token).
  • Data Center: CONFLUENCE_TOKEN (Personal Access Token). ⚠️ Troubleshooting Note for Data Center: If authentication fails, ensure you are connected to the VPN and that your admin allows Personal Access Tokens (PAT).

Detailed Usage: Stage 1 (HTML Export)

Downloads pages, builds the index, and creates a clean HTML base.

python3 confluenceDumpToHTML.py [OPTIONS] <COMMAND> [ARGS]

Commands

  • space: Dumps an entire space. (-sp SPACEKEY)
  • tree: Dumps a specific page and its descendants. (-p PAGEID)
  • single: Dumps a single page. (-p PAGEID)
  • label: "Forest Mode". Dumps all pages with a specific label as root trees. (-l LABEL)
    • Use --exclude-label to prune specific subtrees (e.g. 'archived').
  • all-spaces: Dumps all visible spaces.

Common Options

  • -o, --outdir: Base output directory.
  • -t, --threads: Number of download threads (e.g., -t 8).
  • --css-file: Path to custom CSS (applied after standard styles).

Handling Complex Macros (Manual Overrides)

The Problem: Some Confluence pages (e.g. complex Table Filters) fail to render via API due to server-side timeouts or heavy client-side JavaScript. The Solution:

  1. Open the page in Chrome/Edge.
  2. Save as "Webpage, Single File (*.mhtml)".
  3. Save it as manual_overrides/[PageID].mhtml.
  4. Run the dumper with --manual-overrides-dir "./manual_overrides". The script will extract the rendered state from the MHTML, clean it, and inject it into the pipeline.

Detailed Usage: Stage 2 (Architecture Sandbox)

Allows re-organizing the structure (Index) locally without touching Confluence.

  1. Generate Editor:
    python3 create_editor.py --site-dir "./output/2025-01-01 Space IT"
    
  2. Edit: Open editor_sidebar.html. Use Drag & Drop to move pages/folders.
  3. Save: Click "Copy Markdown", paste into sidebar_edit.md.
  4. Apply:
    python3 patch_sidebar.py --site-dir "./output/2025-01-01 Space IT"
    

Detailed Usage: Stage 3 (Document Generation)

Converts the dumped pages into a single document.

python3 htmlToDoc.py --site-dir "./output/2025-01-01 Space IT" --pdf

Options

  • --pdf: Generate PDF (via WeasyPrint).
  • --html: Generate single-file Master HTML (for LLM context windows).
  • --preview: Generate debug HTML (linked to local CSS).

Customizing the PDF

The layout is controlled by CSS files in the styles/ folder of your export:

  • pdf_settings.css: Configure Page Size (A4/Letter), Orientation, and Margins.
  • site.css: General styles (detected automatically).

Releases

No releases published

Packages

No packages published

Languages

  • Python 100.0%