Skip to content

Conversation

@MysaaJava
Copy link

@MysaaJava MysaaJava commented Nov 5, 2025

Mediawiki updated its database schema

I also added one env variable WLANG to describe the language of wikipedia (i personnally use french wikipedia)

I needed to push some python code to a more recent version (with binary streams)

Generation worked on my computer for english language

Closes #149

Summary by CodeRabbit

  • Documentation

    • Added comprehensive documentation detailing the data processing workflow and file transformation steps.
  • Chores

    • Enhanced build process with language selection support and customizable output directory.
    • Added option to progressively delete intermediate files during processing to conserve disk space.
    • Refined data processing pipeline scripts for improved efficiency and robustness.

@coderabbitai
Copy link

coderabbitai bot commented Nov 5, 2025

Walkthrough

This pull request refactors the database build pipeline to support language-configurable Wikipedia dumps and progressive intermediate file cleanup. The main build script (buildDatabase.sh) is enhanced with environment variables for target language (WLANG), output directory (OUT_DIR), and progressive deletion mode (DELETE_PROGRESSIVELY). Multiple Python helper scripts are introduced or modified to process targets files and handle binary-mode file I/O with bytes-based operations. The workflow is restructured to include a new targets file processing stage, additional sorting/grouping/counting steps for links, and systematic cleanup of intermediate artifacts. A new script for resolving redirect chains in targets files is also added.

Sequence Diagram(s)

sequenceDiagram
    participant User
    participant buildDatabase.sh
    participant DownloadTools as Download/SHA1
    participant TrimScripts as Trim Scripts
    participant ReplaceScripts as Replace Scripts
    participant DBImport as SQLite Import
    participant FileCleanup as Cleanup

    User->>buildDatabase.sh: Invoke with WLANG, OUT_DIR, DELETE_PROGRESSIVELY
    buildDatabase.sh->>DownloadTools: Download WLANG dump (redirects, pages, links, targets, SHA1SUM)
    DownloadTools-->>buildDatabase.sh: Downloaded files
    buildDatabase.sh->>TrimScripts: Trim redirects, pages, links, targets
    TrimScripts-->>buildDatabase.sh: Trimmed files
    alt DELETE_PROGRESSIVELY enabled
        buildDatabase.sh->>FileCleanup: Remove source .gz files
    end
    buildDatabase.sh->>ReplaceScripts: Replace titles in redirects
    ReplaceScripts-->>buildDatabase.sh: redirects.with_ids.txt
    buildDatabase.sh->>ReplaceScripts: Replace titles in targets
    ReplaceScripts-->>buildDatabase.sh: targets.with_ids.txt
    buildDatabase.sh->>TrimScripts: Sort/group/count links
    TrimScripts-->>buildDatabase.sh: links.grouped_by_*.txt
    buildDatabase.sh->>ReplaceScripts: Replace titles in links
    ReplaceScripts-->>buildDatabase.sh: links.with_ids.txt
    buildDatabase.sh->>DBImport: Load all processed files to SQLite
    DBImport-->>buildDatabase.sh: Database populated
    alt DELETE_PROGRESSIVELY enabled
        buildDatabase.sh->>FileCleanup: Remove intermediate files
    end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~75 minutes

Areas requiring extra attention:

  • buildDatabase.sh: Substantial refactoring with new language-aware configuration, revised filename generation patterns, and progressive deletion hooks throughout the pipeline. Critical to verify that the new targets file handling integrates correctly with existing redirect/pages/links workflows and that DELETE_PROGRESSIVELY cleanup doesn't prematurely remove files still needed downstream.
  • Binary mode conversions in Python scripts: Multiple helper scripts (combine_grouped_links_files.py, prune_pages_file.py, replace_titles_*.py) transition from text to binary mode file I/O. Verify that bytes-to-string conversions are applied correctly at output boundaries and that delimiter handling (b'\t', b'\n') is consistent throughout.
  • prune_pages_file.py: The page-filter guard is converted to always-true (if True or ...), causing all pages to be output regardless of redirect status. Confirm whether this is intentional or a logic error that breaks the pruning mechanism.
  • New scripts (replace_titles_and_redirects_in_targets_file.py, replace_titles_in_redirects_file.py.dis): Verify correctness of title-to-ID lookups, redirect resolution logic, and handling of edge cases (missing pages, redirect cycles in .dis variant with 100-step limit).
  • Integration testing: Confirm the entire pipeline produces expected outputs for different WLANG values and that intermediate file deletions don't cause subsequent stages to fail due to missing dependencies.

Pre-merge checks and finishing touches

❌ Failed checks (1 inconclusive)
Check name Status Explanation Resolution
Title check ❓ Inconclusive The title is vague and overly broad, mentioning 'schema' and 'python version' without specifically indicating the main change is fixing a links trimming bug for MediaWiki dumps. Revise title to be more specific, such as 'Fix empty gz file during links trimming for large wiki dumps' or 'Fix links file trimming with MediaWiki schema updates'.
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Linked Issues check ✅ Passed The PR addresses issue #149 by refactoring scripts to use binary file I/O and introducing language-aware processing via WLANG environment variable, which resolves the empty file generation issue during links trimming.
Out of Scope Changes check ✅ Passed All changes are directly related to fixing the links trimming issue or supporting the schema/language updates: README documentation, buildDatabase.sh script updates with language support and progressive deletion, and Python script modifications for binary I/O handling.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

🧹 Nitpick comments (2)
scripts/replace_titles_and_redirects_in_links_file.py (1)

67-73: Remove unnecessary else clause.

The else: pass block on lines 71-73 is unnecessary and can be removed for cleaner code.

Apply this diff:

     target_page_id = TARGETS.get(target_id)
     if target_page_id is not None and source_page_id != target_page_id:
       target_page_id = REDIRECTS.get(target_page_id, target_page_id)
       print(b'\t'.join([source_page_id, target_page_id]).decode())
-    else:
-      pass
-
scripts/replace_titles_and_redirects_in_targets_file.py (1)

58-60: Remove unnecessary else clause.

The else: pass block on lines 58-59 is unnecessary and can be removed for cleaner code.

Apply this diff:

   if target_page_id is not None:
     target_page_id = REDIRECTS.get(target_page_id, target_page_id)
     print(b'\t'.join([target_id, target_page_id]).decode())
-  else:
-    pass
-
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

Cache: Disabled due to data retention organization setting

Knowledge base: Disabled due to data retention organization setting

📥 Commits

Reviewing files that changed from the base of the PR and between b8e99e8 and f19f1b5.

📒 Files selected for processing (8)
  • scripts/README.md (1 hunks)
  • scripts/buildDatabase.sh (13 hunks)
  • scripts/combine_grouped_links_files.py (1 hunks)
  • scripts/prune_pages_file.py (1 hunks)
  • scripts/replace_titles_and_redirects_in_links_file.py (2 hunks)
  • scripts/replace_titles_and_redirects_in_targets_file.py (1 hunks)
  • scripts/replace_titles_in_redirects_file.py (3 hunks)
  • scripts/replace_titles_in_redirects_file.py.dis (1 hunks)
🧰 Additional context used
🪛 LanguageTool
scripts/README.md

[grammar] ~19-~19: Ensure spelling is correct
Context: ...is page - page_is_redirect -> Boolean wether this page is a redirect - Ignore the ei...

(QB_NEW_EN_ORTHOGRAPHY_ERROR_IDS_1)


[grammar] ~31-~31: Ensure spelling is correct
Context: ...r each redirection, rd_title with the targetted page_id by matching on page_title. ...

(QB_NEW_EN_ORTHOGRAPHY_ERROR_IDS_1)


[grammar] ~32-~32: Ensure spelling is correct
Context: ...ge_idby matching onpage_title`. The targetted page_id is then computed as a redirect ...

(QB_NEW_EN_ORTHOGRAPHY_ERROR_IDS_1)


[grammar] ~37-~37: Ensure spelling is correct
Context: ...or each linktarget, lt_title with the targetted page_id by matching on page_title. ...

(QB_NEW_EN_ORTHOGRAPHY_ERROR_IDS_1)


[grammar] ~43-~43: Ensure spelling is correct
Context: ...es, for each pagelink, lt_id with the targetted page_id by joining with `links.with_i...

(QB_NEW_EN_ORTHOGRAPHY_ERROR_IDS_1)

🪛 Pylint (4.0.2)
scripts/prune_pages_file.py

[refactor] 40-40: Boolean condition 'True or is_redirect == '0' or page_id in REDIRECTS' will always evaluate to 'True'

(R1727)

🪛 Ruff (0.14.3)
scripts/replace_titles_and_redirects_in_targets_file.py

1-1: The file is executable but no shebang is present

(EXE002)

🔇 Additional comments (12)
scripts/prune_pages_file.py (1)

31-33: Binary mode migration looks correct.

The conversion to binary mode with byte literals for reading redirects is implemented correctly.

scripts/combine_grouped_links_files.py (1)

29-54: Binary mode migration and numeric keys are well implemented.

The conversion to binary mode processing with numeric keys (0 for outgoing, 1 for incoming) is clean and efficient. The inline comment on line 31 documenting the key convention is helpful.

scripts/replace_titles_in_redirects_file.py (1)

28-65: Binary mode migration and redirect chain resolution are correct.

The conversion to binary mode with bytes-based data structures is well executed. The redirect chain resolution logic correctly handles circular references and deep chains with a 100-step limit.

scripts/replace_titles_and_redirects_in_links_file.py (1)

13-55: TARGETS_FILE integration and binary mode migration are correct.

The addition of the TARGETS_FILE argument and corresponding TARGETS dictionary aligns with the new MediaWiki schema that uses linktarget tables. The binary mode migration is consistently applied.

scripts/replace_titles_in_redirects_file.py.dis (1)

1-66: Clarify the purpose of this .dis file.

The .dis file extension and the presence of debug output suggest this might be a development/debug artifact. If this file is intended for production use, it needs cleanup. If it's a debug script, consider whether it should be committed to the repository.

Should this file be included in the PR, or is it a debug artifact that was accidentally committed?

scripts/replace_titles_and_redirects_in_targets_file.py (1)

34-57: Target resolution logic is correct.

The logic to map target titles to page IDs and resolve through redirects is correctly implemented. The binary mode migration is consistent with other scripts in this PR.

scripts/buildDatabase.sh (6)

8-11: Language-aware configuration is well designed.

The introduction of WLANG, OUT_DIR, and DELETE_PROGRESSIVELY environment variables with sensible defaults provides good flexibility for processing different Wikipedia languages and managing disk space.


26-36: ROOT_DIR improvement and consistent language-aware filenames.

Changing ROOT_DIR from pwd to $(dirname "$0") makes the script more robust when called from different directories. The consistent use of ${WLANG}wiki prefix in all filenames supports multi-language processing cleanly.


86-86: Targets file support aligns with new MediaWiki schema.

The addition of targets file download and processing correctly supports the new MediaWiki linktarget table introduced in recent schema versions. This is essential for addressing issue #149.

Also applies to: 160-182


113-113: Progressive deletion feature is well implemented.

The DELETE_PROGRESSIVELY feature systematically cleans up intermediate files after each processing step when enabled. This addresses disk space concerns for large dumps while remaining opt-in for debugging scenarios.

Also applies to: 135-135, 158-158, 182-182, 198-198, 209-209, 220-220, 230-230, 258-258, 274-274, 285-285, 300-300, 310-310, 315-315, 320-320


151-152: Regex patterns are correct and properly aligned with the schema.

The regex on line 151 (^[0-9]+,0,[0-9]+$) correctly matches the pagelinks format (pl_from,pl_namespace,pl_target_id), filtering for namespace 0 only. The sed command on line 152 (s/,0,/\t/g) properly extracts the required fields by replacing the namespace separator with a tab, resulting in the output format: source_page_id[TAB]target_id. The patterns align with the documented schema in scripts/README.md and work as intended.


126-129: Verify regex patterns with actual MediaWiki dump data for your version.

The patterns on lines 126-129 appear logically sound and align with MediaWiki schema (extracting page_id, filtering namespace=0, capturing title and is_redirect). However, since no dump files exist in the sandbox environment, I cannot confirm the regex patterns handle all edge cases correctly—particularly the escaped quote handling in line 127's pattern ([^']*(\\\\')?)+', which depends on your MediaWiki version's exact escaping format.

Manually test these patterns with a small sample of your actual dump file:

pigz -dc your-dump-page.sql.gz | head -20 | sed -n 's/^INSERT INTO `page` VALUES //p' | head -3

Then verify the extracted records match your expectations with the full pipeline.

Comment on lines +37 to +41
for line in io.BufferedReader(gzip.open(PAGES_FILE, 'rb')):
[page_id, page_title, is_redirect] = line.rstrip(b'\n').split(b'\t')

if is_redirect == '0' or page_id in REDIRECTS:
print('\t'.join([page_id, page_title, is_redirect]))
if True or is_redirect == '0' or page_id in REDIRECTS:
print(b'\t'.join([page_id, page_title, is_redirect]).decode())
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Critical: Filtering logic is completely disabled.

The condition on line 40 will always evaluate to True due to the leading True or, which completely disables the pruning logic. This means ALL pages are output regardless of whether they should be filtered. Additionally, the comparison is_redirect == '0' will never match because is_redirect is now bytes (from line 38's split on b'\t'), but it's being compared to the string '0'.

This bug could directly contribute to issue #149, as the pruning step would output incorrect data.

Apply this diff to fix the logic:

-  if True or is_redirect == '0' or page_id in REDIRECTS:
+  if is_redirect == b'0' or page_id in REDIRECTS:
     print(b'\t'.join([page_id, page_title, is_redirect]).decode())
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
for line in io.BufferedReader(gzip.open(PAGES_FILE, 'rb')):
[page_id, page_title, is_redirect] = line.rstrip(b'\n').split(b'\t')
if is_redirect == '0' or page_id in REDIRECTS:
print('\t'.join([page_id, page_title, is_redirect]))
if True or is_redirect == '0' or page_id in REDIRECTS:
print(b'\t'.join([page_id, page_title, is_redirect]).decode())
for line in io.BufferedReader(gzip.open(PAGES_FILE, 'rb')):
[page_id, page_title, is_redirect] = line.rstrip(b'\n').split(b'\t')
if is_redirect == b'0' or page_id in REDIRECTS:
print(b'\t'.join([page_id, page_title, is_redirect]).decode())
🧰 Tools
🪛 Pylint (4.0.2)

[refactor] 40-40: Boolean condition 'True or is_redirect == '0' or page_id in REDIRECTS' will always evaluate to 'True'

(R1727)

🤖 Prompt for AI Agents
In scripts/prune_pages_file.py around lines 37 to 41, the pruning condition is
disabled by a stray "True or" and also compares bytes to a string; remove the
"True or" so the conditional actually filters, compare is_redirect against bytes
(b'0'), and make the page_id membership check use the same type as REDIRECTS
(either decode page_id to str before the check or convert REDIRECTS to bytes).
Concretely: remove "True or", change is_redirect == '0' to is_redirect == b'0',
and either call pid = page_id.decode() then use "or pid in REDIRECTS" or ensure
REDIRECTS contains byte values so you can use "or page_id in REDIRECTS"; leave
the print as-is (it already decodes the joined bytes).

- `page_id` -> Id of the page
- (`page_namespace`) -> We keep only if equals 0 (= namespace of this page)
- `page_title` -> Title of this page
- `page_is_redirect` -> Boolean wether this page is a redirect
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Fix spelling error.

Line 19: "wether" should be "whether".

Apply this diff:

-- `page_is_redirect` -> Boolean wether this page is a redirect
+- `page_is_redirect` -> Boolean whether this page is a redirect
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
- `page_is_redirect` -> Boolean wether this page is a redirect
- `page_is_redirect` -> Boolean whether this page is a redirect
🧰 Tools
🪛 LanguageTool

[grammar] ~19-~19: Ensure spelling is correct
Context: ...is page - page_is_redirect -> Boolean wether this page is a redirect - Ignore the ei...

(QB_NEW_EN_ORTHOGRAPHY_ERROR_IDS_1)

🤖 Prompt for AI Agents
In scripts/README.md around line 19, the word "wether" is misspelled; replace
"wether" with the correct spelling "whether" so the line reads
"`page_is_redirect` -> Boolean whether this page is a redirect".

Comment on lines +31 to +43
Replaces for each redirection, `rd_title` with the targetted `page_id` by matching on `page_title`.
The targetted page_id is then computed as a redirect recursively, until we get on a "final" page.
- `rd_from` -> The id of the page we are redirected from
- `page_id` -> The id of the page we get to following redirections recursively

### targets.with_ids.txt (replace_titles_and_redirects_in_targets_file.py)
Replaces, for each linktarget, `lt_title` with the targetted `page_id` by matching on `page_title`.
We then compute the "final" page obtained from this page following redirection, with the file `redirects.with_ids.txt`.
- `lt_id` -> Id of this link
- `page_id` -> The id of the page this link is pointing to, after having followed all redirections

### links.with_ids.txt (replace_titles_and_redirects_in_links_file.py)
Replaces, for each pagelink, `lt_id` with the targetted `page_id` by joining with `links.with_ids.txt`.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Fix spelling errors.

Lines 31, 32, 37, and 43: "targetted" should be "targeted".

Apply this diff:

-Replaces for each redirection, `rd_title` with the targetted `page_id` by matching on `page_title`.
-The targetted page_id is then computed as a redirect recursively, until we get on a "final" page.
+Replaces for each redirection, `rd_title` with the targeted `page_id` by matching on `page_title`.
+The targeted page_id is then computed as a redirect recursively, until we get on a "final" page.
 - `rd_from` -> The id of the page we are redirected from
 - `page_id` -> The id of the page we get to following redirections recursively
 
 ### targets.with_ids.txt (replace_titles_and_redirects_in_targets_file.py)
-Replaces, for each linktarget, `lt_title` with the targetted `page_id` by matching on `page_title`.
+Replaces, for each linktarget, `lt_title` with the targeted `page_id` by matching on `page_title`.

And on line 43:

-Replaces, for each pagelink, `lt_id` with the targetted `page_id` by joining with `links.with_ids.txt`.
+Replaces, for each pagelink, `lt_id` with the targeted `page_id` by joining with `links.with_ids.txt`.
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
Replaces for each redirection, `rd_title` with the targetted `page_id` by matching on `page_title`.
The targetted page_id is then computed as a redirect recursively, until we get on a "final" page.
- `rd_from` -> The id of the page we are redirected from
- `page_id` -> The id of the page we get to following redirections recursively
### targets.with_ids.txt (replace_titles_and_redirects_in_targets_file.py)
Replaces, for each linktarget, `lt_title` with the targetted `page_id` by matching on `page_title`.
We then compute the "final" page obtained from this page following redirection, with the file `redirects.with_ids.txt`.
- `lt_id` -> Id of this link
- `page_id` -> The id of the page this link is pointing to, after having followed all redirections
### links.with_ids.txt (replace_titles_and_redirects_in_links_file.py)
Replaces, for each pagelink, `lt_id` with the targetted `page_id` by joining with `links.with_ids.txt`.
Replaces for each redirection, `rd_title` with the targeted `page_id` by matching on `page_title`.
The targeted page_id is then computed as a redirect recursively, until we get on a "final" page.
- `rd_from` -> The id of the page we are redirected from
- `page_id` -> The id of the page we get to following redirections recursively
### targets.with_ids.txt (replace_titles_and_redirects_in_targets_file.py)
Replaces, for each linktarget, `lt_title` with the targeted `page_id` by matching on `page_title`.
We then compute the "final" page obtained from this page following redirection, with the file `redirects.with_ids.txt`.
- `lt_id` -> Id of this link
- `page_id` -> The id of the page this link is pointing to, after having followed all redirections
### links.with_ids.txt (replace_titles_and_redirects_in_links_file.py)
Replaces, for each pagelink, `lt_id` with the targeted `page_id` by joining with `links.with_ids.txt`.
🧰 Tools
🪛 LanguageTool

[grammar] ~31-~31: Ensure spelling is correct
Context: ...r each redirection, rd_title with the targetted page_id by matching on page_title. ...

(QB_NEW_EN_ORTHOGRAPHY_ERROR_IDS_1)


[grammar] ~32-~32: Ensure spelling is correct
Context: ...ge_idby matching onpage_title`. The targetted page_id is then computed as a redirect ...

(QB_NEW_EN_ORTHOGRAPHY_ERROR_IDS_1)


[grammar] ~37-~37: Ensure spelling is correct
Context: ...or each linktarget, lt_title with the targetted page_id by matching on page_title. ...

(QB_NEW_EN_ORTHOGRAPHY_ERROR_IDS_1)


[grammar] ~43-~43: Ensure spelling is correct
Context: ...es, for each pagelink, lt_id with the targetted page_id by joining with `links.with_i...

(QB_NEW_EN_ORTHOGRAPHY_ERROR_IDS_1)

🤖 Prompt for AI Agents
scripts/README.md around lines 31 to 43: fix the spelling mistake "targetted" ->
"targeted" on lines 31, 32, 37, and 43 by replacing each occurrence with
"targeted" so the README uses correct English for describing targeted page ids
and links.

ALL_TARGET_IDS = set()
TARGET_TITLES_TO_IDS = {}
for line in io.BufferedReader(gzip.open(TARGETS_FILE, 'rb')):
print("LALIGNE",line.rstrip(b'\n').split(b'\t'))
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Remove debug print statement.

The debug print print("LALIGNE",line.rstrip(b'\n').split(b'\t')) on line 32 will output diagnostic information to stdout for every line in the targets file. This will corrupt the intended output and break downstream processing that expects only tab-separated ID pairs.

Apply this diff to remove the debug print:

 for line in io.BufferedReader(gzip.open(TARGETS_FILE, 'rb')):
-  print("LALIGNE",line.rstrip(b'\n').split(b'\t'))
   [page_id, page_title,_] = line.rstrip(b'\n').split(b'\t')
🤖 Prompt for AI Agents
In scripts/replace_titles_in_redirects_file.py.dis around line 32, remove the
debug print statement that prints "LALIGNE" and the parsed line; this debug
output pollutes stdout and breaks downstream processing expecting only
tab-separated ID pairs. Edit the file to delete that print(...) call so the
script only writes the intended tab-separated output and no extraneous
diagnostic prints.

redirected_count += 1

# Break out if there is a circular path, meaning the redirects only point to other redirects,
# not an acutal page.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Fix typo in comment.

Line 61: "acutal" should be "actual".

Apply this diff:

-    # Break out if there is a circular path, meaning the redirects only point to other redirects,
-    # not an acutal page.
+    # Break out if there is a circular path, meaning the redirects only point to other redirects,
+    # not an actual page.
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
# not an acutal page.
# Break out if there is a circular path, meaning the redirects only point to other redirects,
# not an actual page.
🤖 Prompt for AI Agents
In scripts/replace_titles_in_redirects_file.py.dis around line 61, there's a
typo in the inline comment: "acutal" should be corrected to "actual"; update the
comment text to read "not an actual page." preserving punctuation and spacing.

@jwngr jwngr self-assigned this Nov 16, 2025
@jwngr jwngr self-requested a review November 16, 2025 23:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Failed Trimming Links for 6-1-2024 wiki dump

2 participants