Update mediawiki database schema + python version #314

MysaaJava · 2025-11-05T17:30:36Z

Mediawiki updated its database schema

I also added one env variable WLANG to describe the language of wikipedia (i personnally use french wikipedia)

I needed to push some python code to a more recent version (with binary streams)

Generation worked on my computer for english language

Closes #149

Summary by CodeRabbit

Documentation
- Added comprehensive documentation detailing the data processing workflow and file transformation steps.
Chores
- Enhanced build process with language selection support and customizable output directory.
- Added option to progressively delete intermediate files during processing to conserve disk space.
- Refined data processing pipeline scripts for improved efficiency and robustness.

coderabbitai · 2025-11-05T17:30:46Z

Walkthrough

This pull request refactors the database build pipeline to support language-configurable Wikipedia dumps and progressive intermediate file cleanup. The main build script (buildDatabase.sh) is enhanced with environment variables for target language (WLANG), output directory (OUT_DIR), and progressive deletion mode (DELETE_PROGRESSIVELY). Multiple Python helper scripts are introduced or modified to process targets files and handle binary-mode file I/O with bytes-based operations. The workflow is restructured to include a new targets file processing stage, additional sorting/grouping/counting steps for links, and systematic cleanup of intermediate artifacts. A new script for resolving redirect chains in targets files is also added.

Sequence Diagram(s)

sequenceDiagram
    participant User
    participant buildDatabase.sh
    participant DownloadTools as Download/SHA1
    participant TrimScripts as Trim Scripts
    participant ReplaceScripts as Replace Scripts
    participant DBImport as SQLite Import
    participant FileCleanup as Cleanup

    User->>buildDatabase.sh: Invoke with WLANG, OUT_DIR, DELETE_PROGRESSIVELY
    buildDatabase.sh->>DownloadTools: Download WLANG dump (redirects, pages, links, targets, SHA1SUM)
    DownloadTools-->>buildDatabase.sh: Downloaded files
    buildDatabase.sh->>TrimScripts: Trim redirects, pages, links, targets
    TrimScripts-->>buildDatabase.sh: Trimmed files
    alt DELETE_PROGRESSIVELY enabled
        buildDatabase.sh->>FileCleanup: Remove source .gz files
    end
    buildDatabase.sh->>ReplaceScripts: Replace titles in redirects
    ReplaceScripts-->>buildDatabase.sh: redirects.with_ids.txt
    buildDatabase.sh->>ReplaceScripts: Replace titles in targets
    ReplaceScripts-->>buildDatabase.sh: targets.with_ids.txt
    buildDatabase.sh->>TrimScripts: Sort/group/count links
    TrimScripts-->>buildDatabase.sh: links.grouped_by_*.txt
    buildDatabase.sh->>ReplaceScripts: Replace titles in links
    ReplaceScripts-->>buildDatabase.sh: links.with_ids.txt
    buildDatabase.sh->>DBImport: Load all processed files to SQLite
    DBImport-->>buildDatabase.sh: Database populated
    alt DELETE_PROGRESSIVELY enabled
        buildDatabase.sh->>FileCleanup: Remove intermediate files
    end

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~75 minutes

Areas requiring extra attention:

buildDatabase.sh: Substantial refactoring with new language-aware configuration, revised filename generation patterns, and progressive deletion hooks throughout the pipeline. Critical to verify that the new targets file handling integrates correctly with existing redirect/pages/links workflows and that DELETE_PROGRESSIVELY cleanup doesn't prematurely remove files still needed downstream.
Binary mode conversions in Python scripts: Multiple helper scripts (combine_grouped_links_files.py, prune_pages_file.py, replace_titles_*.py) transition from text to binary mode file I/O. Verify that bytes-to-string conversions are applied correctly at output boundaries and that delimiter handling (b'\t', b'\n') is consistent throughout.
prune_pages_file.py: The page-filter guard is converted to always-true (if True or ...), causing all pages to be output regardless of redirect status. Confirm whether this is intentional or a logic error that breaks the pruning mechanism.
New scripts (replace_titles_and_redirects_in_targets_file.py, replace_titles_in_redirects_file.py.dis): Verify correctness of title-to-ID lookups, redirect resolution logic, and handling of edge cases (missing pages, redirect cycles in .dis variant with 100-step limit).
Integration testing: Confirm the entire pipeline produces expected outputs for different WLANG values and that intermediate file deletions don't cause subsequent stages to fail due to missing dependencies.

Pre-merge checks and finishing touches

❌ Failed checks (1 inconclusive)

Check name	Status	Explanation	Resolution
Title check	❓ Inconclusive	The title is vague and overly broad, mentioning 'schema' and 'python version' without specifically indicating the main change is fixing a links trimming bug for MediaWiki dumps.	Revise title to be more specific, such as 'Fix empty gz file during links trimming for large wiki dumps' or 'Fix links file trimming with MediaWiki schema updates'.

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Linked Issues check	✅ Passed	The PR addresses issue #149 by refactoring scripts to use binary file I/O and introducing language-aware processing via WLANG environment variable, which resolves the empty file generation issue during links trimming.
Out of Scope Changes check	✅ Passed	All changes are directly related to fixing the links trimming issue or supporting the schema/language updates: README documentation, buildDatabase.sh script updates with language support and progressive deletion, and Python script modifications for binary I/O handling.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 5

🧹 Nitpick comments (2)

scripts/replace_titles_and_redirects_in_links_file.py (1)
67-73: Remove unnecessary else clause.

The else: pass block on lines 71-73 is unnecessary and can be removed for cleaner code.

Apply this diff:
     target_page_id = TARGETS.get(target_id)
     if target_page_id is not None and source_page_id != target_page_id:
       target_page_id = REDIRECTS.get(target_page_id, target_page_id)
       print(b'\t'.join([source_page_id, target_page_id]).decode())
-    else:
-      pass
-
scripts/replace_titles_and_redirects_in_targets_file.py (1)
58-60: Remove unnecessary else clause.

The else: pass block on lines 58-59 is unnecessary and can be removed for cleaner code.

Apply this diff:
   if target_page_id is not None:
     target_page_id = REDIRECTS.get(target_page_id, target_page_id)
     print(b'\t'.join([target_id, target_page_id]).decode())
-  else:
-    pass
-

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

Cache: Disabled due to data retention organization setting

Knowledge base: Disabled due to data retention organization setting

📥 Commits

Reviewing files that changed from the base of the PR and between b8e99e8 and f19f1b5.

📒 Files selected for processing (8)

scripts/README.md (1 hunks)
scripts/buildDatabase.sh (13 hunks)
scripts/combine_grouped_links_files.py (1 hunks)
scripts/prune_pages_file.py (1 hunks)
scripts/replace_titles_and_redirects_in_links_file.py (2 hunks)
scripts/replace_titles_and_redirects_in_targets_file.py (1 hunks)
scripts/replace_titles_in_redirects_file.py (3 hunks)
scripts/replace_titles_in_redirects_file.py.dis (1 hunks)

🧰 Additional context used

🪛 LanguageTool

scripts/README.md

[grammar] ~19-~19: Ensure spelling is correct
Context: ...is page - page_is_redirect -> Boolean wether this page is a redirect - Ignore the ei...