Confluence Dump with Python

This toolbox exports content from a Confluence instance (Cloud or Data Center) into a static, navigable HTML archive and converts it into professional, hierarchical PDF documents. Key Features:

Visual Fidelity: Fetches rendered HTML (export_view) to preserve macros, layouts, and formatting.
Navigation: Injects a fully functional, static navigation sidebar into every HTML page.
Offline Browsing: Localizes images and links, and downloads all attachments (PDFs, Office docs, etc.) for complete offline access.
Sort Order: Recursively scans the tree to ensure the manual sort order from Confluence is preserved.
Metadata Injection: Automatically adds Page Title, Author, and Modification Date to the top of every page.
Versioning: Creates timestamped output folders (e.g., 2025-11-21 1400 Space IT) for clean history management.
Professional PDF: Merges the content into a single PDF with TOC, Bookmarks, and mixed Portrait/Landscape orientation.

Toolbox Overview (Key Files)

confluenceDumpToHTML.py: The main downloader. Connects to Confluence, scrapes content, and creates the folder structure.
htmlToDoc.py: The publisher. Converts the downloaded HTML folder into a single PDF or a Master-HTML file for LLMs.
confluence_products.ini: Configuration file for API URLs (Cloud vs. Data Center).
styles/: Contains CSS files. site.css (if present) is applied automatically. pdf_settings.css configures the PDF layout (A4/Letter, Margins).

Quick Start Guide

Follow these steps to create your first PDF export of a single page tree.

1. Setup

Install requirements and set your credentials.

pip install -r requirements.txt
# Windows Users: Install GTK3 Runtime for PDF generation!

Linux/Mac:

export CONFLUENCE_TOKEN="YourPersonalAccessToken"

Windows (Powershell):

$env:CONFLUENCE_TOKEN="YourPersonalAccessToken"

2. The Dump (Download)

Run the dumper for a specific page tree. This will create a new folder in output/.

# Example for Data Center
python3 confluenceDumpToHTML.py --base-url "[https://confluence.corp.com](https://confluence.corp.com)" --profile dc --context-path "/wiki" -o "./output" tree -p "123456"

3. The Publication (PDF)

Look into the output folder. You will see a new folder like 2025-01-27 0900 My Page Title. Use this path for the PDF generator.

python3 htmlToDoc.py --site-dir "./output/2025-01-27 0900 My Page Title" --pdf

Result: You now have a ... .pdf inside that folder.

Platform Support & Authentication

This script supports both Confluence Cloud and Confluence Data Center.

⚠️ Note on Cloud Verification: The Cloud support has been ported to the new architecture but was primarily developed and tested against a Confluence Data Center environment.

Configuration

Define API paths in confluence_products.ini. Authentication is handled via Environment Variables:

Cloud: CONFLUENCE_USER (Email) and CONFLUENCE_TOKEN (API Token).
Data Center: CONFLUENCE_TOKEN (Personal Access Token). ⚠️ Troubleshooting Note for Data Center: If authentication fails, ensure you are connected to the VPN and that your admin allows Personal Access Tokens (PAT).

Detailed Usage: Stage 1 (HTML Export)

Downloads pages, builds the index, and creates a clean HTML base.

python3 confluenceDumpToHTML.py [OPTIONS] <COMMAND> [ARGS]

Commands

space: Dumps an entire space. (-sp SPACEKEY)
tree: Dumps a specific page and its descendants. (-p PAGEID)
single: Dumps a single page. (-p PAGEID)
label: "Forest Mode". Dumps all pages with a specific label as root trees. (-l LABEL)
- Use --exclude-label to prune specific subtrees (e.g. 'archived').
all-spaces: Dumps all visible spaces.

Common Options

-o, --outdir: Base output directory.
-t, --threads: Number of download threads (e.g., -t 8).
--css-file: Path to custom CSS (applied after standard styles).

Handling Complex Macros (Manual Overrides)

The Problem: Some Confluence pages (e.g. complex Table Filters) fail to render via API due to server-side timeouts or heavy client-side JavaScript. The Solution:

Open the page in Chrome/Edge.
Save as "Webpage, Single File (*.mhtml)".
Save it as manual_overrides/[PageID].mhtml.
Run the dumper with --manual-overrides-dir "./manual_overrides". The script will extract the rendered state from the MHTML, clean it, and inject it into the pipeline.

Detailed Usage: Stage 2 (Architecture Sandbox)

Allows re-organizing the structure (Index) locally without touching Confluence.

Generate Editor:

python3 create_editor.py --site-dir "./output/2025-01-01 Space IT"

Edit: Open editor_sidebar.html. Use Drag & Drop to move pages/folders.
Save: Click "Copy Markdown", paste into sidebar_edit.md.

Apply:

python3 patch_sidebar.py --site-dir "./output/2025-01-01 Space IT"

Detailed Usage: Stage 3 (Document Generation)

Converts the dumped pages into a single document.

python3 htmlToDoc.py --site-dir "./output/2025-01-01 Space IT" --pdf

Options

--pdf: Generate PDF (via WeasyPrint).
--html: Generate single-file Master HTML (for LLM context windows).
--preview: Generate debug HTML (linked to local CSS).

Customizing the PDF

The layout is controlled by CSS files in the styles/ folder of your export:

pdf_settings.css: Configure Page Size (A4/Letter), Orientation, and Margins.
site.css: General styles (detected automatically).

Name		Name	Last commit message	Last commit date
Latest commit History 142 Commits
.idea		.idea
.vscode		.vscode
confluence_dump		confluence_dump
img		img
legacy		legacy
styles		styles
.gitattributes		.gitattributes
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE.txt		LICENSE.txt
README.md		README.md
confluenceDumpToHTML.py		confluenceDumpToHTML.py
confluence_products.ini		confluence_products.ini
create_editor.py		create_editor.py
htmlToDoc.py		htmlToDoc.py
patch_sidebar.py		patch_sidebar.py
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Confluence Dump with Python

Toolbox Overview (Key Files)

Quick Start Guide

1. Setup

2. The Dump (Download)

3. The Publication (PDF)

Platform Support & Authentication

Configuration

Detailed Usage: Stage 1 (HTML Export)

Commands

Common Options

Handling Complex Macros (Manual Overrides)

Detailed Usage: Stage 2 (Architecture Sandbox)

Detailed Usage: Stage 3 (Document Generation)

Options

Customizing the PDF

About

Uh oh!

Releases

Packages

Languages

License

SomeSunlight/confluenceDumpWithPython

Folders and files

Latest commit

History

Repository files navigation

Confluence Dump with Python

Toolbox Overview (Key Files)

Quick Start Guide

1. Setup

2. The Dump (Download)

3. The Publication (PDF)

Platform Support & Authentication

Configuration

Detailed Usage: Stage 1 (HTML Export)

Commands

Common Options

Handling Complex Macros (Manual Overrides)

Detailed Usage: Stage 2 (Architecture Sandbox)

Detailed Usage: Stage 3 (Document Generation)

Options

Customizing the PDF

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages