-
Notifications
You must be signed in to change notification settings - Fork 99
Update mediawiki database schema + python version #314
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
WalkthroughThis pull request refactors the database build pipeline to support language-configurable Wikipedia dumps and progressive intermediate file cleanup. The main build script ( Sequence Diagram(s)sequenceDiagram
participant User
participant buildDatabase.sh
participant DownloadTools as Download/SHA1
participant TrimScripts as Trim Scripts
participant ReplaceScripts as Replace Scripts
participant DBImport as SQLite Import
participant FileCleanup as Cleanup
User->>buildDatabase.sh: Invoke with WLANG, OUT_DIR, DELETE_PROGRESSIVELY
buildDatabase.sh->>DownloadTools: Download WLANG dump (redirects, pages, links, targets, SHA1SUM)
DownloadTools-->>buildDatabase.sh: Downloaded files
buildDatabase.sh->>TrimScripts: Trim redirects, pages, links, targets
TrimScripts-->>buildDatabase.sh: Trimmed files
alt DELETE_PROGRESSIVELY enabled
buildDatabase.sh->>FileCleanup: Remove source .gz files
end
buildDatabase.sh->>ReplaceScripts: Replace titles in redirects
ReplaceScripts-->>buildDatabase.sh: redirects.with_ids.txt
buildDatabase.sh->>ReplaceScripts: Replace titles in targets
ReplaceScripts-->>buildDatabase.sh: targets.with_ids.txt
buildDatabase.sh->>TrimScripts: Sort/group/count links
TrimScripts-->>buildDatabase.sh: links.grouped_by_*.txt
buildDatabase.sh->>ReplaceScripts: Replace titles in links
ReplaceScripts-->>buildDatabase.sh: links.with_ids.txt
buildDatabase.sh->>DBImport: Load all processed files to SQLite
DBImport-->>buildDatabase.sh: Database populated
alt DELETE_PROGRESSIVELY enabled
buildDatabase.sh->>FileCleanup: Remove intermediate files
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~75 minutes Areas requiring extra attention:
Pre-merge checks and finishing touches❌ Failed checks (1 inconclusive)
✅ Passed checks (3 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 5
🧹 Nitpick comments (2)
scripts/replace_titles_and_redirects_in_links_file.py (1)
67-73: Remove unnecessary else clause.The
else: passblock on lines 71-73 is unnecessary and can be removed for cleaner code.Apply this diff:
target_page_id = TARGETS.get(target_id) if target_page_id is not None and source_page_id != target_page_id: target_page_id = REDIRECTS.get(target_page_id, target_page_id) print(b'\t'.join([source_page_id, target_page_id]).decode()) - else: - pass -scripts/replace_titles_and_redirects_in_targets_file.py (1)
58-60: Remove unnecessary else clause.The
else: passblock on lines 58-59 is unnecessary and can be removed for cleaner code.Apply this diff:
if target_page_id is not None: target_page_id = REDIRECTS.get(target_page_id, target_page_id) print(b'\t'.join([target_id, target_page_id]).decode()) - else: - pass -
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
Cache: Disabled due to data retention organization setting
Knowledge base: Disabled due to data retention organization setting
📒 Files selected for processing (8)
scripts/README.md(1 hunks)scripts/buildDatabase.sh(13 hunks)scripts/combine_grouped_links_files.py(1 hunks)scripts/prune_pages_file.py(1 hunks)scripts/replace_titles_and_redirects_in_links_file.py(2 hunks)scripts/replace_titles_and_redirects_in_targets_file.py(1 hunks)scripts/replace_titles_in_redirects_file.py(3 hunks)scripts/replace_titles_in_redirects_file.py.dis(1 hunks)
🧰 Additional context used
🪛 LanguageTool
scripts/README.md
[grammar] ~19-~19: Ensure spelling is correct
Context: ...is page - page_is_redirect -> Boolean wether this page is a redirect - Ignore the ei...
(QB_NEW_EN_ORTHOGRAPHY_ERROR_IDS_1)
[grammar] ~31-~31: Ensure spelling is correct
Context: ...r each redirection, rd_title with the targetted page_id by matching on page_title. ...
(QB_NEW_EN_ORTHOGRAPHY_ERROR_IDS_1)
[grammar] ~32-~32: Ensure spelling is correct
Context: ...ge_idby matching onpage_title`. The targetted page_id is then computed as a redirect ...
(QB_NEW_EN_ORTHOGRAPHY_ERROR_IDS_1)
[grammar] ~37-~37: Ensure spelling is correct
Context: ...or each linktarget, lt_title with the targetted page_id by matching on page_title. ...
(QB_NEW_EN_ORTHOGRAPHY_ERROR_IDS_1)
[grammar] ~43-~43: Ensure spelling is correct
Context: ...es, for each pagelink, lt_id with the targetted page_id by joining with `links.with_i...
(QB_NEW_EN_ORTHOGRAPHY_ERROR_IDS_1)
🪛 Pylint (4.0.2)
scripts/prune_pages_file.py
[refactor] 40-40: Boolean condition 'True or is_redirect == '0' or page_id in REDIRECTS' will always evaluate to 'True'
(R1727)
🪛 Ruff (0.14.3)
scripts/replace_titles_and_redirects_in_targets_file.py
1-1: The file is executable but no shebang is present
(EXE002)
🔇 Additional comments (12)
scripts/prune_pages_file.py (1)
31-33: Binary mode migration looks correct.The conversion to binary mode with byte literals for reading redirects is implemented correctly.
scripts/combine_grouped_links_files.py (1)
29-54: Binary mode migration and numeric keys are well implemented.The conversion to binary mode processing with numeric keys (0 for outgoing, 1 for incoming) is clean and efficient. The inline comment on line 31 documenting the key convention is helpful.
scripts/replace_titles_in_redirects_file.py (1)
28-65: Binary mode migration and redirect chain resolution are correct.The conversion to binary mode with bytes-based data structures is well executed. The redirect chain resolution logic correctly handles circular references and deep chains with a 100-step limit.
scripts/replace_titles_and_redirects_in_links_file.py (1)
13-55: TARGETS_FILE integration and binary mode migration are correct.The addition of the TARGETS_FILE argument and corresponding TARGETS dictionary aligns with the new MediaWiki schema that uses
linktargettables. The binary mode migration is consistently applied.scripts/replace_titles_in_redirects_file.py.dis (1)
1-66: Clarify the purpose of this .dis file.The
.disfile extension and the presence of debug output suggest this might be a development/debug artifact. If this file is intended for production use, it needs cleanup. If it's a debug script, consider whether it should be committed to the repository.Should this file be included in the PR, or is it a debug artifact that was accidentally committed?
scripts/replace_titles_and_redirects_in_targets_file.py (1)
34-57: Target resolution logic is correct.The logic to map target titles to page IDs and resolve through redirects is correctly implemented. The binary mode migration is consistent with other scripts in this PR.
scripts/buildDatabase.sh (6)
8-11: Language-aware configuration is well designed.The introduction of
WLANG,OUT_DIR, andDELETE_PROGRESSIVELYenvironment variables with sensible defaults provides good flexibility for processing different Wikipedia languages and managing disk space.
26-36: ROOT_DIR improvement and consistent language-aware filenames.Changing
ROOT_DIRfrompwdto$(dirname "$0")makes the script more robust when called from different directories. The consistent use of${WLANG}wikiprefix in all filenames supports multi-language processing cleanly.
86-86: Targets file support aligns with new MediaWiki schema.The addition of targets file download and processing correctly supports the new MediaWiki
linktargettable introduced in recent schema versions. This is essential for addressing issue #149.Also applies to: 160-182
113-113: Progressive deletion feature is well implemented.The
DELETE_PROGRESSIVELYfeature systematically cleans up intermediate files after each processing step when enabled. This addresses disk space concerns for large dumps while remaining opt-in for debugging scenarios.Also applies to: 135-135, 158-158, 182-182, 198-198, 209-209, 220-220, 230-230, 258-258, 274-274, 285-285, 300-300, 310-310, 315-315, 320-320
151-152: Regex patterns are correct and properly aligned with the schema.The regex on line 151 (
^[0-9]+,0,[0-9]+$) correctly matches the pagelinks format (pl_from,pl_namespace,pl_target_id), filtering for namespace 0 only. The sed command on line 152 (s/,0,/\t/g) properly extracts the required fields by replacing the namespace separator with a tab, resulting in the output format:source_page_id[TAB]target_id. The patterns align with the documented schema in scripts/README.md and work as intended.
126-129: Verify regex patterns with actual MediaWiki dump data for your version.The patterns on lines 126-129 appear logically sound and align with MediaWiki schema (extracting page_id, filtering namespace=0, capturing title and is_redirect). However, since no dump files exist in the sandbox environment, I cannot confirm the regex patterns handle all edge cases correctly—particularly the escaped quote handling in line 127's pattern
([^']*(\\\\')?)+', which depends on your MediaWiki version's exact escaping format.Manually test these patterns with a small sample of your actual dump file:
pigz -dc your-dump-page.sql.gz | head -20 | sed -n 's/^INSERT INTO `page` VALUES //p' | head -3Then verify the extracted records match your expectations with the full pipeline.
| for line in io.BufferedReader(gzip.open(PAGES_FILE, 'rb')): | ||
| [page_id, page_title, is_redirect] = line.rstrip(b'\n').split(b'\t') | ||
|
|
||
| if is_redirect == '0' or page_id in REDIRECTS: | ||
| print('\t'.join([page_id, page_title, is_redirect])) | ||
| if True or is_redirect == '0' or page_id in REDIRECTS: | ||
| print(b'\t'.join([page_id, page_title, is_redirect]).decode()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Critical: Filtering logic is completely disabled.
The condition on line 40 will always evaluate to True due to the leading True or, which completely disables the pruning logic. This means ALL pages are output regardless of whether they should be filtered. Additionally, the comparison is_redirect == '0' will never match because is_redirect is now bytes (from line 38's split on b'\t'), but it's being compared to the string '0'.
This bug could directly contribute to issue #149, as the pruning step would output incorrect data.
Apply this diff to fix the logic:
- if True or is_redirect == '0' or page_id in REDIRECTS:
+ if is_redirect == b'0' or page_id in REDIRECTS:
print(b'\t'.join([page_id, page_title, is_redirect]).decode())📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| for line in io.BufferedReader(gzip.open(PAGES_FILE, 'rb')): | |
| [page_id, page_title, is_redirect] = line.rstrip(b'\n').split(b'\t') | |
| if is_redirect == '0' or page_id in REDIRECTS: | |
| print('\t'.join([page_id, page_title, is_redirect])) | |
| if True or is_redirect == '0' or page_id in REDIRECTS: | |
| print(b'\t'.join([page_id, page_title, is_redirect]).decode()) | |
| for line in io.BufferedReader(gzip.open(PAGES_FILE, 'rb')): | |
| [page_id, page_title, is_redirect] = line.rstrip(b'\n').split(b'\t') | |
| if is_redirect == b'0' or page_id in REDIRECTS: | |
| print(b'\t'.join([page_id, page_title, is_redirect]).decode()) |
🧰 Tools
🪛 Pylint (4.0.2)
[refactor] 40-40: Boolean condition 'True or is_redirect == '0' or page_id in REDIRECTS' will always evaluate to 'True'
(R1727)
🤖 Prompt for AI Agents
In scripts/prune_pages_file.py around lines 37 to 41, the pruning condition is
disabled by a stray "True or" and also compares bytes to a string; remove the
"True or" so the conditional actually filters, compare is_redirect against bytes
(b'0'), and make the page_id membership check use the same type as REDIRECTS
(either decode page_id to str before the check or convert REDIRECTS to bytes).
Concretely: remove "True or", change is_redirect == '0' to is_redirect == b'0',
and either call pid = page_id.decode() then use "or pid in REDIRECTS" or ensure
REDIRECTS contains byte values so you can use "or page_id in REDIRECTS"; leave
the print as-is (it already decodes the joined bytes).
| - `page_id` -> Id of the page | ||
| - (`page_namespace`) -> We keep only if equals 0 (= namespace of this page) | ||
| - `page_title` -> Title of this page | ||
| - `page_is_redirect` -> Boolean wether this page is a redirect |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fix spelling error.
Line 19: "wether" should be "whether".
Apply this diff:
-- `page_is_redirect` -> Boolean wether this page is a redirect
+- `page_is_redirect` -> Boolean whether this page is a redirect📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| - `page_is_redirect` -> Boolean wether this page is a redirect | |
| - `page_is_redirect` -> Boolean whether this page is a redirect |
🧰 Tools
🪛 LanguageTool
[grammar] ~19-~19: Ensure spelling is correct
Context: ...is page - page_is_redirect -> Boolean wether this page is a redirect - Ignore the ei...
(QB_NEW_EN_ORTHOGRAPHY_ERROR_IDS_1)
🤖 Prompt for AI Agents
In scripts/README.md around line 19, the word "wether" is misspelled; replace
"wether" with the correct spelling "whether" so the line reads
"`page_is_redirect` -> Boolean whether this page is a redirect".
| Replaces for each redirection, `rd_title` with the targetted `page_id` by matching on `page_title`. | ||
| The targetted page_id is then computed as a redirect recursively, until we get on a "final" page. | ||
| - `rd_from` -> The id of the page we are redirected from | ||
| - `page_id` -> The id of the page we get to following redirections recursively | ||
|
|
||
| ### targets.with_ids.txt (replace_titles_and_redirects_in_targets_file.py) | ||
| Replaces, for each linktarget, `lt_title` with the targetted `page_id` by matching on `page_title`. | ||
| We then compute the "final" page obtained from this page following redirection, with the file `redirects.with_ids.txt`. | ||
| - `lt_id` -> Id of this link | ||
| - `page_id` -> The id of the page this link is pointing to, after having followed all redirections | ||
|
|
||
| ### links.with_ids.txt (replace_titles_and_redirects_in_links_file.py) | ||
| Replaces, for each pagelink, `lt_id` with the targetted `page_id` by joining with `links.with_ids.txt`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fix spelling errors.
Lines 31, 32, 37, and 43: "targetted" should be "targeted".
Apply this diff:
-Replaces for each redirection, `rd_title` with the targetted `page_id` by matching on `page_title`.
-The targetted page_id is then computed as a redirect recursively, until we get on a "final" page.
+Replaces for each redirection, `rd_title` with the targeted `page_id` by matching on `page_title`.
+The targeted page_id is then computed as a redirect recursively, until we get on a "final" page.
- `rd_from` -> The id of the page we are redirected from
- `page_id` -> The id of the page we get to following redirections recursively
### targets.with_ids.txt (replace_titles_and_redirects_in_targets_file.py)
-Replaces, for each linktarget, `lt_title` with the targetted `page_id` by matching on `page_title`.
+Replaces, for each linktarget, `lt_title` with the targeted `page_id` by matching on `page_title`.And on line 43:
-Replaces, for each pagelink, `lt_id` with the targetted `page_id` by joining with `links.with_ids.txt`.
+Replaces, for each pagelink, `lt_id` with the targeted `page_id` by joining with `links.with_ids.txt`.📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| Replaces for each redirection, `rd_title` with the targetted `page_id` by matching on `page_title`. | |
| The targetted page_id is then computed as a redirect recursively, until we get on a "final" page. | |
| - `rd_from` -> The id of the page we are redirected from | |
| - `page_id` -> The id of the page we get to following redirections recursively | |
| ### targets.with_ids.txt (replace_titles_and_redirects_in_targets_file.py) | |
| Replaces, for each linktarget, `lt_title` with the targetted `page_id` by matching on `page_title`. | |
| We then compute the "final" page obtained from this page following redirection, with the file `redirects.with_ids.txt`. | |
| - `lt_id` -> Id of this link | |
| - `page_id` -> The id of the page this link is pointing to, after having followed all redirections | |
| ### links.with_ids.txt (replace_titles_and_redirects_in_links_file.py) | |
| Replaces, for each pagelink, `lt_id` with the targetted `page_id` by joining with `links.with_ids.txt`. | |
| Replaces for each redirection, `rd_title` with the targeted `page_id` by matching on `page_title`. | |
| The targeted page_id is then computed as a redirect recursively, until we get on a "final" page. | |
| - `rd_from` -> The id of the page we are redirected from | |
| - `page_id` -> The id of the page we get to following redirections recursively | |
| ### targets.with_ids.txt (replace_titles_and_redirects_in_targets_file.py) | |
| Replaces, for each linktarget, `lt_title` with the targeted `page_id` by matching on `page_title`. | |
| We then compute the "final" page obtained from this page following redirection, with the file `redirects.with_ids.txt`. | |
| - `lt_id` -> Id of this link | |
| - `page_id` -> The id of the page this link is pointing to, after having followed all redirections | |
| ### links.with_ids.txt (replace_titles_and_redirects_in_links_file.py) | |
| Replaces, for each pagelink, `lt_id` with the targeted `page_id` by joining with `links.with_ids.txt`. |
🧰 Tools
🪛 LanguageTool
[grammar] ~31-~31: Ensure spelling is correct
Context: ...r each redirection, rd_title with the targetted page_id by matching on page_title. ...
(QB_NEW_EN_ORTHOGRAPHY_ERROR_IDS_1)
[grammar] ~32-~32: Ensure spelling is correct
Context: ...ge_idby matching onpage_title`. The targetted page_id is then computed as a redirect ...
(QB_NEW_EN_ORTHOGRAPHY_ERROR_IDS_1)
[grammar] ~37-~37: Ensure spelling is correct
Context: ...or each linktarget, lt_title with the targetted page_id by matching on page_title. ...
(QB_NEW_EN_ORTHOGRAPHY_ERROR_IDS_1)
[grammar] ~43-~43: Ensure spelling is correct
Context: ...es, for each pagelink, lt_id with the targetted page_id by joining with `links.with_i...
(QB_NEW_EN_ORTHOGRAPHY_ERROR_IDS_1)
🤖 Prompt for AI Agents
scripts/README.md around lines 31 to 43: fix the spelling mistake "targetted" ->
"targeted" on lines 31, 32, 37, and 43 by replacing each occurrence with
"targeted" so the README uses correct English for describing targeted page ids
and links.
| ALL_TARGET_IDS = set() | ||
| TARGET_TITLES_TO_IDS = {} | ||
| for line in io.BufferedReader(gzip.open(TARGETS_FILE, 'rb')): | ||
| print("LALIGNE",line.rstrip(b'\n').split(b'\t')) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove debug print statement.
The debug print print("LALIGNE",line.rstrip(b'\n').split(b'\t')) on line 32 will output diagnostic information to stdout for every line in the targets file. This will corrupt the intended output and break downstream processing that expects only tab-separated ID pairs.
Apply this diff to remove the debug print:
for line in io.BufferedReader(gzip.open(TARGETS_FILE, 'rb')):
- print("LALIGNE",line.rstrip(b'\n').split(b'\t'))
[page_id, page_title,_] = line.rstrip(b'\n').split(b'\t')🤖 Prompt for AI Agents
In scripts/replace_titles_in_redirects_file.py.dis around line 32, remove the
debug print statement that prints "LALIGNE" and the parsed line; this debug
output pollutes stdout and breaks downstream processing expecting only
tab-separated ID pairs. Edit the file to delete that print(...) call so the
script only writes the intended tab-separated output and no extraneous
diagnostic prints.
| redirected_count += 1 | ||
|
|
||
| # Break out if there is a circular path, meaning the redirects only point to other redirects, | ||
| # not an acutal page. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fix typo in comment.
Line 61: "acutal" should be "actual".
Apply this diff:
- # Break out if there is a circular path, meaning the redirects only point to other redirects,
- # not an acutal page.
+ # Break out if there is a circular path, meaning the redirects only point to other redirects,
+ # not an actual page.📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| # not an acutal page. | |
| # Break out if there is a circular path, meaning the redirects only point to other redirects, | |
| # not an actual page. |
🤖 Prompt for AI Agents
In scripts/replace_titles_in_redirects_file.py.dis around line 61, there's a
typo in the inline comment: "acutal" should be corrected to "actual"; update the
comment text to read "not an actual page." preserving punctuation and spacing.
Mediawiki updated its database schema
I also added one env variable WLANG to describe the language of wikipedia (i personnally use french wikipedia)
I needed to push some python code to a more recent version (with binary streams)
Generation worked on my computer for english language
Closes #149
Summary by CodeRabbit
Documentation
Chores