Skip to content
This repository was archived by the owner on Apr 23, 2026. It is now read-only.

#78 : Research all the extensions related to Boost Library documentat…#83

Open
whisper67265 wants to merge 1 commit intoCppDigest:mainfrom
whisper67265:dev-02
Open

#78 : Research all the extensions related to Boost Library documentat…#83
whisper67265 wants to merge 1 commit intoCppDigest:mainfrom
whisper67265:dev-02

Conversation

@whisper67265
Copy link
Copy Markdown
Collaborator

@whisper67265 whisper67265 commented Feb 9, 2026

…ion.

Summary by CodeRabbit

Release Notes

  • Documentation

    • Enhanced Boost library metadata with file extension information and root path designations.
    • Added comprehensive extension summaries for all library files and documentation-specific file types.
    • Improved library registry organization with detailed submodule information and supported file formats.
  • Configuration

    • Added environment configuration support for API token management.

@wpak-ai
Copy link
Copy Markdown
Collaborator

wpak-ai commented Feb 18, 2026

@coderabbitai review

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Feb 18, 2026

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Feb 18, 2026

📝 Walkthrough

Walkthrough

The pull request introduces updated tooling for collecting and enriching Boost 1.90.0 library metadata. It removes an older library collection script and replaces it with a new script that fetches library information, Git roots, and file extensions from GitHub. Additionally, it adds extension summary utilities, updated library metadata files with root paths and extensions, and environment configuration support.

Changes

Cohort / File(s) Summary
Configuration
.env
New environment configuration file for GitHub API token storage.
Library Data Files
boost-1.90.0_libraries_list.txt, boost-1.90.0_libraries_list_submodules.txt
Extended library metadata now includes root path categories and associated file extensions for each library. New submodules file provides centralized registry with repository URLs, root paths, and extension mappings.
Extension Summary Data
boost-1.90.0_all_extensions_summary.txt, boost-1.90.0_doc_extensions_summary.txt
New aggregated summaries reporting file extension frequencies across Boost libraries; one covers all extensions (65 unique), the other focuses on documentation types (13 unique).
Library Collection Scripts
collect_boost_libraries.py (removed), collect_boost_libraries_extensions.py (added)
Old script replaced with enhanced version that fetches .gitmodules and libraries.json from GitHub, parses submodules and libraries, retrieves file extensions from doc folders via GitHub API, and outputs two formats (per-library and per-submodule). Supports CLI options for version, output path, token, and extension filtering.
Extension Summary Utility
sum_extensions.py
New utility script that processes library lists to aggregate and report file extension frequencies, generating both all-extensions and documentation-only summaries with counts and statistics.

Sequence Diagram(s)

sequenceDiagram
    actor User
    participant Script as collect_boost_libraries_extensions.py
    participant Env as .env
    participant GitHub as GitHub API
    participant FileSystem as File System
    
    User->>Script: Run with options
    Script->>Env: Load GITHUB_TOKEN
    Env-->>Script: Token (if present)
    Script->>GitHub: Fetch .gitmodules
    GitHub-->>Script: Gitmodules content
    Script->>Script: Parse libs submodules
    Script->>GitHub: Fetch libraries.json for each submodule
    GitHub-->>Script: Libraries JSON
    Script->>Script: Parse libraries and deduplicate
    Note over Script: If extensions enabled:
    Script->>GitHub: Fetch doc tree for each library
    GitHub-->>Script: File tree
    Script->>Script: Extract file extensions
    Script->>FileSystem: Write per-library output
    Script->>FileSystem: Write per-submodule summary
    FileSystem-->>Script: Write complete
    Script-->>User: Output files generated
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~22 minutes

Poem

🐰 A new collection begins,
Extensions and roots now tracked,
GitHub tokens whisper their secrets,
Data flows from API to file,
Boost libraries shine brighter! 🌟

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title references researching Boost Library documentation extensions, which aligns with the PR's core objective of identifying and cataloging file extensions across Boost libraries, particularly documentation-related ones.
Docstring Coverage ✅ Passed Docstring coverage is 90.00% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (3)
BoostLibraryList/collect_boost_libraries_extensions.py (2)

200-215: Token is not forwarded in get_libraries_from_repo — unauthenticated rate-limit exposure for 150+ requests.

fetch_url(url) on line 209 omits the token, so every raw.githubusercontent.com call for meta/libraries.json is unauthenticated. With 157 submodules this adds 157 anonymous requests alongside the 300+ authenticated GitHub API calls, risking secondary rate-limit errors on shared CI runners. The function should accept and forward a token parameter.

♻️ Proposed fix
-def get_libraries_from_repo(submodule_name: str, ref: str) -> List[Tuple[str, str, str]]:
+def get_libraries_from_repo(
+    submodule_name: str, ref: str, token: Optional[str] = None
+) -> List[Tuple[str, str, str]]:
     url = LIBS_JSON_TEMPLATE.format(repo=submodule_name, ref=ref)
     try:
-        content = fetch_url(url)
+        content = fetch_url(url, token=token)

Then in collect_all_libraries, thread the token through:

-def collect_all_libraries(
-    lib_submodules: List[Tuple[str, str]], ref: str
-) -> List[Tuple[str, str, str]]:
+def collect_all_libraries(
+    lib_submodules: List[Tuple[str, str]], ref: str, token: Optional[str] = None
+) -> List[Tuple[str, str, str]]:
         ...
-        libs = get_libraries_from_repo(submodule_name, ref)
+        libs = get_libraries_from_repo(submodule_name, ref, token=token)

And in main:

-    all_libraries = collect_all_libraries(lib_submodules, libs_ref)
+    all_libraries = collect_all_libraries(lib_submodules, libs_ref, token=args.token)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@BoostLibraryList/collect_boost_libraries_extensions.py` around lines 200 -
215, get_libraries_from_repo is calling fetch_url(url) without forwarding the
GitHub token, causing unauthenticated requests to raw.githubusercontent.com;
modify get_libraries_from_repo to accept a token parameter (e.g., token:
Optional[str]) and pass it through to fetch_url (fetch_url(url, token=token) or
whatever fetch_url signature uses), update callers—specifically
collect_all_libraries (thread the token through its calls to
get_libraries_from_repo) and main (obtain the token and pass it into
collect_all_libraries) so every invocation uses the token; ensure
LIBS_JSON_TEMPLATE usage remains unchanged and preserve the existing
404/URLError handling.

409-411: Fragile .txt-suffix substitution for the submodule output path.

str.replace(".txt", "_submodules.txt") replaces only the first occurrence, which can mangle paths that have .txt appearing in a directory component (e.g., /txt_reports/boost.txt/txt_reports/boost_submodules.txt is correct by luck, but /txt_data/boost_output → unchanged, falling back to the + "_submodules" branch unexpectedly). Using pathlib.Path makes the intent explicit and handles edge cases safely.

♻️ Proposed fix
+from pathlib import Path
...
-    submodule_out_path = out_path.replace(".txt", "_submodules.txt")
-    if submodule_out_path == out_path:
-        submodule_out_path = out_path + "_submodules"
+    _p = Path(out_path)
+    submodule_out_path = str(_p.with_stem(_p.stem + "_submodules"))
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@BoostLibraryList/collect_boost_libraries_extensions.py` around lines 409 -
411, Replace the fragile string replace logic for submodule_out_path with
pathlib path operations: create a Path from out_path (e.g., p = Path(out_path))
and set submodule_out_path to p.with_name(p.stem + "_submodules" + p.suffix).
Convert back to str if needed; this preserves directory components and correctly
handles filenames with or without a .txt suffix instead of using
out_path.replace or the fallback branch.
BoostLibraryList/sum_extensions.py (1)

18-20: Hardcoded output paths are CWD-relative, unlike the sibling script.

DEFAULT_INPUT_PATH, OUTPUT_ALL, and OUTPUT_DOC are bare filenames, so they resolve relative to wherever the script is invoked. collect_boost_libraries_extensions.py anchors its paths to os.path.dirname(os.path.abspath(__file__)). For consistency and to allow running the script from any directory, the same approach can be applied here.

♻️ Proposed fix
+import os as _os
+
+_SCRIPT_DIR = _os.path.dirname(_os.path.abspath(__file__))
+
-DEFAULT_INPUT_PATH = "boost-1.90.0_libraries_list.txt"
-OUTPUT_ALL = "boost-1.90.0_all_extensions_summary.txt"
-OUTPUT_DOC = "boost-1.90.0_doc_extensions_summary.txt"
+DEFAULT_INPUT_PATH = _os.path.join(_SCRIPT_DIR, "boost-1.90.0_libraries_list.txt")
+OUTPUT_ALL = _os.path.join(_SCRIPT_DIR, "boost-1.90.0_all_extensions_summary.txt")
+OUTPUT_DOC = _os.path.join(_SCRIPT_DIR, "boost-1.90.0_doc_extensions_summary.txt")
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@BoostLibraryList/sum_extensions.py` around lines 18 - 20, DEFAULT_INPUT_PATH,
OUTPUT_ALL, and OUTPUT_DOC are defined as CWD-relative bare filenames; change
them to be anchored to the script directory by building them with
os.path.join(os.path.dirname(os.path.abspath(__file__)), <filename>). Update
sum_extensions.py to import os if needed and replace the three constants with
platform-safe joined paths using that base directory (refer to
DEFAULT_INPUT_PATH, OUTPUT_ALL, OUTPUT_DOC and mirror the anchoring approach
used in collect_boost_libraries_extensions.py).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@BoostLibraryList/.env`:
- Line 1: Remove the tracked .env file from git and stop committing secrets: add
".env" to .gitignore, rename or add a ".env.example" (containing only the
GITHUB_TOKEN key with a safe placeholder) for documentation, and run "git rm
--cached" on the tracked .env to remove it from history; also update the
placeholder value in the example to be quoted (e.g.,
"GITHUB_TOKEN=\"<placeholder>\"") to satisfy dotenv-linter's ValueWithoutQuotes
warning.

In `@BoostLibraryList/collect_boost_libraries_extensions.py`:
- Around line 155-167: The code that fetches a git tree
(tree_url/tree_data/tree_entries) must detect GitHub's "truncated" flag to avoid
silent data loss: after calling fetch_json and assigning tree_data, check
tree_data.get("truncated") (or "truncated" in tree_data) and if true, fail fast
(raise a RuntimeError or call processLogger.error and exit) with a clear message
including the tree_url/owner/repo/tree_sha so callers know the tree was
incomplete; do not silently continue processing tree_entries. Update the logic
around tree_data/tree_entries in collect_boost_libraries_extensions.py to
perform this check before iterating and ensure the error path surfaces to the
caller instead of returning partial extensions from extensions.add.

---

Nitpick comments:
In `@BoostLibraryList/collect_boost_libraries_extensions.py`:
- Around line 200-215: get_libraries_from_repo is calling fetch_url(url) without
forwarding the GitHub token, causing unauthenticated requests to
raw.githubusercontent.com; modify get_libraries_from_repo to accept a token
parameter (e.g., token: Optional[str]) and pass it through to fetch_url
(fetch_url(url, token=token) or whatever fetch_url signature uses), update
callers—specifically collect_all_libraries (thread the token through its calls
to get_libraries_from_repo) and main (obtain the token and pass it into
collect_all_libraries) so every invocation uses the token; ensure
LIBS_JSON_TEMPLATE usage remains unchanged and preserve the existing
404/URLError handling.
- Around line 409-411: Replace the fragile string replace logic for
submodule_out_path with pathlib path operations: create a Path from out_path
(e.g., p = Path(out_path)) and set submodule_out_path to p.with_name(p.stem +
"_submodules" + p.suffix). Convert back to str if needed; this preserves
directory components and correctly handles filenames with or without a .txt
suffix instead of using out_path.replace or the fallback branch.

In `@BoostLibraryList/sum_extensions.py`:
- Around line 18-20: DEFAULT_INPUT_PATH, OUTPUT_ALL, and OUTPUT_DOC are defined
as CWD-relative bare filenames; change them to be anchored to the script
directory by building them with
os.path.join(os.path.dirname(os.path.abspath(__file__)), <filename>). Update
sum_extensions.py to import os if needed and replace the three constants with
platform-safe joined paths using that base directory (refer to
DEFAULT_INPUT_PATH, OUTPUT_ALL, OUTPUT_DOC and mirror the anchoring approach
used in collect_boost_libraries_extensions.py).

Comment thread BoostLibraryList/.env
@@ -0,0 +1 @@
GITHUB_TOKEN=<github_token here...> No newline at end of file
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

.env must not be committed to version control.

Even as a placeholder, tracking .env in git is dangerous: a developer may replace the placeholder with a real token and commit it, exposing it in history permanently. The standard convention is to commit a .env.example (or .env.template) containing only the key names with placeholder values, and add .env to .gitignore.

🔒 Recommended remediation
  1. Add .env to .gitignore:
+BoostLibraryList/.env
  1. Rename (or add) .env.example for documentation:
+GITHUB_TOKEN="<your_github_token_here>"
  1. Remove .env from tracking:
git rm --cached BoostLibraryList/.env

Also note the [ValueWithoutQuotes] lint warning from dotenv-linter: the value contains spaces and special characters (<, >, ...) and should be wrapped in double quotes per the dotenv spec.

🧰 Tools
🪛 dotenv-linter (4.0.0)

[warning] 1-1: [EndingBlankLine] No blank line at the end of the file

(EndingBlankLine)


[warning] 1-1: [ValueWithoutQuotes] This value needs to be surrounded in quotes

(ValueWithoutQuotes)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@BoostLibraryList/.env` at line 1, Remove the tracked .env file from git and
stop committing secrets: add ".env" to .gitignore, rename or add a
".env.example" (containing only the GITHUB_TOKEN key with a safe placeholder)
for documentation, and run "git rm --cached" on the tracked .env to remove it
from history; also update the placeholder value in the example to be quoted
(e.g., "GITHUB_TOKEN=\"<placeholder>\"") to satisfy dotenv-linter's
ValueWithoutQuotes warning.

Comment on lines +155 to +167
tree_url = f"{GITHUB_API_BASE}/repos/{owner}/{repo}/git/trees/{tree_sha}?recursive=1"
tree_data = fetch_json(tree_url, token=token)
tree_entries = tree_data.get("tree") or []
prefix = doc_path.rstrip("/") + "/"
for entry in tree_entries:
if entry.get("type") != "blob":
continue
path = entry.get("path", "")
if not path.startswith(prefix):
continue
_, ext = os.path.splitext(path)
if ext:
extensions.add(ext)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

truncated flag from GitHub Git Trees API is not checked — silent data loss possible.

When a repository tree exceeds 100,000 entries the GitHub API sets "truncated": true and omits the rest. The code doesn't check this field, so extensions in the tail of a large tree are silently dropped without any warning.

🛡️ Proposed fix
         tree_data = fetch_json(tree_url, token=token)
         tree_entries = tree_data.get("tree") or []
+        if tree_data.get("truncated"):
+            print(
+                f"Warning: tree response truncated for {owner}/{repo}; "
+                "some extensions may be missing.",
+                file=sys.stderr,
+            )
         prefix = doc_path.rstrip("/") + "/"
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@BoostLibraryList/collect_boost_libraries_extensions.py` around lines 155 -
167, The code that fetches a git tree (tree_url/tree_data/tree_entries) must
detect GitHub's "truncated" flag to avoid silent data loss: after calling
fetch_json and assigning tree_data, check tree_data.get("truncated") (or
"truncated" in tree_data) and if true, fail fast (raise a RuntimeError or call
processLogger.error and exit) with a clear message including the
tree_url/owner/repo/tree_sha so callers know the tree was incomplete; do not
silently continue processing tree_entries. Update the logic around
tree_data/tree_entries in collect_boost_libraries_extensions.py to perform this
check before iterating and ensure the error path surfaces to the caller instead
of returning partial extensions from extensions.add.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants