#78 : Research all the extensions related to Boost Library documentat… by whisper67265 · Pull Request #83 · CppDigest/boost-doc-localization

whisper67265 · 2026-02-09T15:22:36Z

…ion.

Summary by CodeRabbit

Release Notes

Documentation
- Enhanced Boost library metadata with file extension information and root path designations.
- Added comprehensive extension summaries for all library files and documentation-specific file types.
- Improved library registry organization with detailed submodule information and supported file formats.
Configuration
- Added environment configuration support for API token management.

…ocumentation.

wpak-ai · 2026-02-18T20:06:57Z

@coderabbitai review

coderabbitai · 2026-02-18T20:07:09Z

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

coderabbitai · 2026-02-18T20:07:20Z

📝 Walkthrough

Walkthrough

The pull request introduces updated tooling for collecting and enriching Boost 1.90.0 library metadata. It removes an older library collection script and replaces it with a new script that fetches library information, Git roots, and file extensions from GitHub. Additionally, it adds extension summary utilities, updated library metadata files with root paths and extensions, and environment configuration support.

Changes

Cohort / File(s)	Summary
Configuration `.env`	New environment configuration file for GitHub API token storage.
Library Data Files `boost-1.90.0_libraries_list.txt`, `boost-1.90.0_libraries_list_submodules.txt`	Extended library metadata now includes root path categories and associated file extensions for each library. New submodules file provides centralized registry with repository URLs, root paths, and extension mappings.
Extension Summary Data `boost-1.90.0_all_extensions_summary.txt`, `boost-1.90.0_doc_extensions_summary.txt`	New aggregated summaries reporting file extension frequencies across Boost libraries; one covers all extensions (65 unique), the other focuses on documentation types (13 unique).
Library Collection Scripts `collect_boost_libraries.py` (removed), `collect_boost_libraries_extensions.py` (added)	Old script replaced with enhanced version that fetches .gitmodules and libraries.json from GitHub, parses submodules and libraries, retrieves file extensions from doc folders via GitHub API, and outputs two formats (per-library and per-submodule). Supports CLI options for version, output path, token, and extension filtering.
Extension Summary Utility `sum_extensions.py`	New utility script that processes library lists to aggregate and report file extension frequencies, generating both all-extensions and documentation-only summaries with counts and statistics.

Sequence Diagram(s)

sequenceDiagram
    actor User
    participant Script as collect_boost_libraries_extensions.py
    participant Env as .env
    participant GitHub as GitHub API
    participant FileSystem as File System
    
    User->>Script: Run with options
    Script->>Env: Load GITHUB_TOKEN
    Env-->>Script: Token (if present)
    Script->>GitHub: Fetch .gitmodules
    GitHub-->>Script: Gitmodules content
    Script->>Script: Parse libs submodules
    Script->>GitHub: Fetch libraries.json for each submodule
    GitHub-->>Script: Libraries JSON
    Script->>Script: Parse libraries and deduplicate
    Note over Script: If extensions enabled:
    Script->>GitHub: Fetch doc tree for each library
    GitHub-->>Script: File tree
    Script->>Script: Extract file extensions
    Script->>FileSystem: Write per-library output
    Script->>FileSystem: Write per-submodule summary
    FileSystem-->>Script: Write complete
    Script-->>User: Output files generated

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~22 minutes

Poem

🐰 A new collection begins,
Extensions and roots now tracked,
GitHub tokens whisper their secrets,
Data flows from API to file,
Boost libraries shine brighter! 🌟

🚥 Pre-merge checks | ✅ 3

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title references researching Boost Library documentation extensions, which aligns with the PR's core objective of identifying and cataloging file extensions across Boost libraries, particularly documentation-related ones.
Docstring Coverage	✅ Passed	Docstring coverage is 90.00% which is sufficient. The required threshold is 80.00%.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 2

🧹 Nitpick comments (3)

BoostLibraryList/collect_boost_libraries_extensions.py (2)

200-215: Token is not forwarded in get_libraries_from_repo — unauthenticated rate-limit exposure for 150+ requests.

fetch_url(url) on line 209 omits the token, so every raw.githubusercontent.com call for meta/libraries.json is unauthenticated. With 157 submodules this adds 157 anonymous requests alongside the 300+ authenticated GitHub API calls, risking secondary rate-limit errors on shared CI runners. The function should accept and forward a token parameter.

♻️ Proposed fix

-def get_libraries_from_repo(submodule_name: str, ref: str) -> List[Tuple[str, str, str]]:
+def get_libraries_from_repo(
+    submodule_name: str, ref: str, token: Optional[str] = None
+) -> List[Tuple[str, str, str]]:
     url = LIBS_JSON_TEMPLATE.format(repo=submodule_name, ref=ref)
     try:
-        content = fetch_url(url)
+        content = fetch_url(url, token=token)

Then in collect_all_libraries, thread the token through:

-def collect_all_libraries(
-    lib_submodules: List[Tuple[str, str]], ref: str
-) -> List[Tuple[str, str, str]]:
+def collect_all_libraries(
+    lib_submodules: List[Tuple[str, str]], ref: str, token: Optional[str] = None
+) -> List[Tuple[str, str, str]]:
         ...
-        libs = get_libraries_from_repo(submodule_name, ref)
+        libs = get_libraries_from_repo(submodule_name, ref, token=token)

And in main:

-    all_libraries = collect_all_libraries(lib_submodules, libs_ref)
+    all_libraries = collect_all_libraries(lib_submodules, libs_ref, token=args.token)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@BoostLibraryList/collect_boost_libraries_extensions.py` around lines 200 -
215, get_libraries_from_repo is calling fetch_url(url) without forwarding the
GitHub token, causing unauthenticated requests to raw.githubusercontent.com;
modify get_libraries_from_repo to accept a token parameter (e.g., token:
Optional[str]) and pass it through to fetch_url (fetch_url(url, token=token) or
whatever fetch_url signature uses), update callers—specifically
collect_all_libraries (thread the token through its calls to
get_libraries_from_repo) and main (obtain the token and pass it into
collect_all_libraries) so every invocation uses the token; ensure
LIBS_JSON_TEMPLATE usage remains unchanged and preserve the existing
404/URLError handling.

409-411: Fragile .txt-suffix substitution for the submodule output path.

str.replace(".txt", "_submodules.txt") replaces only the first occurrence, which can mangle paths that have .txt appearing in a directory component (e.g., /txt_reports/boost.txt → /txt_reports/boost_submodules.txt is correct by luck, but /txt_data/boost_output → unchanged, falling back to the + "_submodules" branch unexpectedly). Using pathlib.Path makes the intent explicit and handles edge cases safely.

♻️ Proposed fix

+from pathlib import Path
...
-    submodule_out_path = out_path.replace(".txt", "_submodules.txt")
-    if submodule_out_path == out_path:
-        submodule_out_path = out_path + "_submodules"
+    _p = Path(out_path)
+    submodule_out_path = str(_p.with_stem(_p.stem + "_submodules"))

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@BoostLibraryList/collect_boost_libraries_extensions.py` around lines 409 -
411, Replace the fragile string replace logic for submodule_out_path with
pathlib path operations: create a Path from out_path (e.g., p = Path(out_path))
and set submodule_out_path to p.with_name(p.stem + "_submodules" + p.suffix).
Convert back to str if needed; this preserves directory components and correctly
handles filenames with or without a .txt suffix instead of using
out_path.replace or the fallback branch.

BoostLibraryList/sum_extensions.py (1)

18-20: Hardcoded output paths are CWD-relative, unlike the sibling script.

DEFAULT_INPUT_PATH, OUTPUT_ALL, and OUTPUT_DOC are bare filenames, so they resolve relative to wherever the script is invoked. collect_boost_libraries_extensions.py anchors its paths to os.path.dirname(os.path.abspath(__file__)). For consistency and to allow running the script from any directory, the same approach can be applied here.

♻️ Proposed fix

+import os as _os
+
+_SCRIPT_DIR = _os.path.dirname(_os.path.abspath(__file__))
+
-DEFAULT_INPUT_PATH = "boost-1.90.0_libraries_list.txt"
-OUTPUT_ALL = "boost-1.90.0_all_extensions_summary.txt"
-OUTPUT_DOC = "boost-1.90.0_doc_extensions_summary.txt"
+DEFAULT_INPUT_PATH = _os.path.join(_SCRIPT_DIR, "boost-1.90.0_libraries_list.txt")
+OUTPUT_ALL = _os.path.join(_SCRIPT_DIR, "boost-1.90.0_all_extensions_summary.txt")
+OUTPUT_DOC = _os.path.join(_SCRIPT_DIR, "boost-1.90.0_doc_extensions_summary.txt")

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@BoostLibraryList/sum_extensions.py` around lines 18 - 20, DEFAULT_INPUT_PATH,
OUTPUT_ALL, and OUTPUT_DOC are defined as CWD-relative bare filenames; change
them to be anchored to the script directory by building them with
os.path.join(os.path.dirname(os.path.abspath(__file__)), <filename>). Update
sum_extensions.py to import os if needed and replace the three constants with
platform-safe joined paths using that base directory (refer to
DEFAULT_INPUT_PATH, OUTPUT_ALL, OUTPUT_DOC and mirror the anchoring approach
used in collect_boost_libraries_extensions.py).

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@BoostLibraryList/.env`:
- Line 1: Remove the tracked .env file from git and stop committing secrets: add
".env" to .gitignore, rename or add a ".env.example" (containing only the
GITHUB_TOKEN key with a safe placeholder) for documentation, and run "git rm
--cached" on the tracked .env to remove it from history; also update the
placeholder value in the example to be quoted (e.g.,
"GITHUB_TOKEN=\"<placeholder>\"") to satisfy dotenv-linter's ValueWithoutQuotes
warning.

In `@BoostLibraryList/collect_boost_libraries_extensions.py`:
- Around line 155-167: The code that fetches a git tree
(tree_url/tree_data/tree_entries) must detect GitHub's "truncated" flag to avoid
silent data loss: after calling fetch_json and assigning tree_data, check
tree_data.get("truncated") (or "truncated" in tree_data) and if true, fail fast
(raise a RuntimeError or call processLogger.error and exit) with a clear message
including the tree_url/owner/repo/tree_sha so callers know the tree was
incomplete; do not silently continue processing tree_entries. Update the logic
around tree_data/tree_entries in collect_boost_libraries_extensions.py to
perform this check before iterating and ensure the error path surfaces to the
caller instead of returning partial extensions from extensions.add.

---

Nitpick comments:
In `@BoostLibraryList/collect_boost_libraries_extensions.py`:
- Around line 200-215: get_libraries_from_repo is calling fetch_url(url) without
forwarding the GitHub token, causing unauthenticated requests to
raw.githubusercontent.com; modify get_libraries_from_repo to accept a token
parameter (e.g., token: Optional[str]) and pass it through to fetch_url
(fetch_url(url, token=token) or whatever fetch_url signature uses), update
callers—specifically collect_all_libraries (thread the token through its calls
to get_libraries_from_repo) and main (obtain the token and pass it into
collect_all_libraries) so every invocation uses the token; ensure
LIBS_JSON_TEMPLATE usage remains unchanged and preserve the existing
404/URLError handling.
- Around line 409-411: Replace the fragile string replace logic for
submodule_out_path with pathlib path operations: create a Path from out_path
(e.g., p = Path(out_path)) and set submodule_out_path to p.with_name(p.stem +
"_submodules" + p.suffix). Convert back to str if needed; this preserves
directory components and correctly handles filenames with or without a .txt
suffix instead of using out_path.replace or the fallback branch.

In `@BoostLibraryList/sum_extensions.py`:
- Around line 18-20: DEFAULT_INPUT_PATH, OUTPUT_ALL, and OUTPUT_DOC are defined
as CWD-relative bare filenames; change them to be anchored to the script
directory by building them with
os.path.join(os.path.dirname(os.path.abspath(__file__)), <filename>). Update
sum_extensions.py to import os if needed and replace the three constants with
platform-safe joined paths using that base directory (refer to
DEFAULT_INPUT_PATH, OUTPUT_ALL, OUTPUT_DOC and mirror the anchoring approach
used in collect_boost_libraries_extensions.py).

coderabbitai · 2026-02-18T20:15:40Z

@@ -0,0 +1 @@
+GITHUB_TOKEN=<github_token here...>


⚠️ Potential issue | 🔴 Critical

.env must not be committed to version control.

Even as a placeholder, tracking .env in git is dangerous: a developer may replace the placeholder with a real token and commit it, exposing it in history permanently. The standard convention is to commit a .env.example (or .env.template) containing only the key names with placeholder values, and add .env to .gitignore.

🔒 Recommended remediation

Add .env to .gitignore:

+BoostLibraryList/.env

Rename (or add) .env.example for documentation:

+GITHUB_TOKEN="<your_github_token_here>"

Remove .env from tracking:

git rm --cached BoostLibraryList/.env

Also note the [ValueWithoutQuotes] lint warning from dotenv-linter: the value contains spaces and special characters (<, >, ...) and should be wrapped in double quotes per the dotenv spec.

🧰 Tools

🪛 dotenv-linter (4.0.0)

[warning] 1-1: [EndingBlankLine] No blank line at the end of the file

(EndingBlankLine)

[warning] 1-1: [ValueWithoutQuotes] This value needs to be surrounded in quotes

(ValueWithoutQuotes)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@BoostLibraryList/.env` at line 1, Remove the tracked .env file from git and stop committing secrets: add ".env" to .gitignore, rename or add a ".env.example" (containing only the GITHUB_TOKEN key with a safe placeholder) for documentation, and run "git rm --cached" on the tracked .env to remove it from history; also update the placeholder value in the example to be quoted (e.g., "GITHUB_TOKEN=\"<placeholder>\"") to satisfy dotenv-linter's ValueWithoutQuotes warning.

coderabbitai · 2026-02-18T20:15:40Z

+        tree_url = f"{GITHUB_API_BASE}/repos/{owner}/{repo}/git/trees/{tree_sha}?recursive=1"
+        tree_data = fetch_json(tree_url, token=token)
+        tree_entries = tree_data.get("tree") or []
+        prefix = doc_path.rstrip("/") + "/"
+        for entry in tree_entries:
+            if entry.get("type") != "blob":
+                continue
+            path = entry.get("path", "")
+            if not path.startswith(prefix):
+                continue
+            _, ext = os.path.splitext(path)
+            if ext:
+                extensions.add(ext)


⚠️ Potential issue | 🟠 Major

truncated flag from GitHub Git Trees API is not checked — silent data loss possible.

When a repository tree exceeds 100,000 entries the GitHub API sets "truncated": true and omits the rest. The code doesn't check this field, so extensions in the tail of a large tree are silently dropped without any warning.

🛡️ Proposed fix

tree_data = fetch_json(tree_url, token=token) tree_entries = tree_data.get("tree") or [] + if tree_data.get("truncated"): + print( + f"Warning: tree response truncated for {owner}/{repo}; " + "some extensions may be missing.", + file=sys.stderr, + ) prefix = doc_path.rstrip("/") + "/"

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@BoostLibraryList/collect_boost_libraries_extensions.py` around lines 155 - 167, The code that fetches a git tree (tree_url/tree_data/tree_entries) must detect GitHub's "truncated" flag to avoid silent data loss: after calling fetch_json and assigning tree_data, check tree_data.get("truncated") (or "truncated" in tree_data) and if true, fail fast (raise a RuntimeError or call processLogger.error and exit) with a clear message including the tree_url/owner/repo/tree_sha so callers know the tree was incomplete; do not silently continue processing tree_entries. Update the logic around tree_data/tree_entries in collect_boost_libraries_extensions.py to perform this check before iterating and ensure the error path surfaces to the caller instead of returning partial extensions from extensions.add.

CppDigest#78 : Research all the extensions related to Boost Library d…

b54aa56

…ocumentation.

whisper67265 mentioned this pull request Feb 9, 2026

Research all the extensions related to Boost Library documentation and rearrange the library list for submodule to suit the Weblate. #78

Open

whisper67265 requested a review from jonathanMLDev February 9, 2026 15:23

coderabbitai Bot reviewed Feb 18, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

#78 : Research all the extensions related to Boost Library documentat…#83

#78 : Research all the extensions related to Boost Library documentat…#83
whisper67265 wants to merge 1 commit intoCppDigest:mainfrom
whisper67265:dev-02

whisper67265 commented Feb 9, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

wpak-ai commented Feb 18, 2026

Uh oh!

coderabbitai Bot commented Feb 18, 2026

Uh oh!

coderabbitai Bot commented Feb 18, 2026 •

edited

Loading

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Poem

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot Feb 18, 2026

Uh oh!

coderabbitai Bot Feb 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

		@@ -0,0 +1 @@
		GITHUB_TOKEN=<github_token here...> No newline at end of file

Conversation

whisper67265 commented Feb 9, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Release Notes

Uh oh!

wpak-ai commented Feb 18, 2026

Uh oh!

coderabbitai Bot commented Feb 18, 2026

Uh oh!

coderabbitai Bot commented Feb 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Poem

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Feb 18, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Feb 18, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

whisper67265 commented Feb 9, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Feb 18, 2026 •

edited

Loading