Add SHA256 checksum validation with flexible CLI modes #31

shivansh31414 · 2025-11-01T16:23:48Z

Title: Feat: Add SHA256 validation for downloads with --validate-sha flag
Closes #28

This PR adds a --validate-sha flag to validate file checksums after download, as discussed in the issue.

The flag supports three modes:

--validate-sha off: No validation.

--validate-sha warning: (Default) Prints a warning on mismatch.

--validate-sha error: Raises a ValueError on mismatch and stops execution.

This is implemented by passing the validation mode down to download_file, which now computes and compares the hash.

New tests are added in test_client.py to cover all three modes and a successful validation.

Summary by CodeRabbit

New Features
- Optional SHA256 validation for downloads with modes: OFF (skip), WARNING (notify), ERROR (fail); mode honored across all download paths.
- Automated bearer token retrieval and retry for protected resources, improving access to authenticated downloads.
Bug Fixes / Improvements
- More deterministic SHA256 computation, improved redirect handling and retry resilience, and better pairing of files with expected checksums.
Tests
- New end-to-end style tests covering all SHA modes and download scenarios.

coderabbitai · 2025-11-01T16:23:56Z

📝 Walkthrough

Walkthrough

Adds ShaValidationMode and threads SHA256 validation through download flows; refactors handlers to return (url, sha) tuples; implements deterministic SHA256 computation, redirect and Vault-token retry logic in downloads; and replaces tests with end-to-end style cases covering ERROR/WARNING/OFF and successful validation.

Changes

Cohort / File(s)	Summary
Download core & validation `databusclient/client.py`	Added `ShaValidationMode` enum and `__compute_file_sha256`; extended `__download_file__` to accept `expected_sha256` and `validation_mode`; implemented redirect handling, Vault token exchange retry on 401/Bearer, and SHA256 validation behavior (OFF/WARNING/ERROR).
Metadata, queries & handlers `databusclient/client.py`	Updated `__query_sparql__` to use POST and improved formatting; refactored `__handle_databus_file_query__`, `__handle_databus_artifact_version__` to return `List[Tuple[str, Optional[str]]]` (download_url, sha); adjusted `__get_databus_id_parts__` to return 6-tuple.
Auth / token exchange `databusclient/client.py`	Reworked `__get_vault_access__` signature and logic to load refresh token (env/file), derive audience from download URL, perform token refresh/exchange and return Vault token used for Bearer-auth retries.
Download orchestration & public API `databusclient/client.py`	Changed `__download_list__` and public `download()` signatures to accept `validation_mode` and `List[Tuple[str, Optional[str]]]`, and propagated `validation_mode` through all download call sites (artifact, collection, query paths).
Tests: download & SHA validation `tests/test_databusclient.py`	Replaced previous tests with request-mock based end-to-end tests exercising download + SHA validation; new fixtures/constants for content and hashes; tests for ERROR/WARNING/OFF modes and successful hash match.

Sequence Diagram(s)

sequenceDiagram
    autonumber
    participant User
    participant Client as download()
    participant SPARQL as __query_sparql__/Metadata
    participant DLList as __download_list__
    participant DLFile as __download_file__
    participant Vault as __get_vault_access__
    participant HTTP as HTTP Server
    participant FS as Filesystem

    User->>Client: download(localDir, URIs, validation_mode=MODE)
    Client->>SPARQL: fetch metadata (JSON-LD / SPARQL)
    SPARQL-->>Client: List[(url, optional sha)]
    Client->>DLList: __download_list__(files, MODE)

    loop for each (url, sha)
        DLList->>DLFile: __download_file__(url, expected_sha=sha, MODE)
        DLFile->>HTTP: HTTP GET url
        alt 2xx
            HTTP-->>DLFile: file bytes
        else 3xx redirect or 401/Bearer
            HTTP-->>DLFile: redirect / 401
            DLFile->>Vault: __get_vault_access__(download_url, token_file, auth_url, client_id)
            Vault-->>DLFile: vault_token
            DLFile->>HTTP: HTTP GET url with Authorization: Bearer token
            HTTP-->>DLFile: file bytes
        end
        DLFile->>DLFile: compute SHA256 (if expected provided)
        alt expected SHA present and MODE == ERROR and mismatch
            DLFile-->>Client: raise Exception (abort)
        else expected SHA present and MODE == WARNING and mismatch
            DLFile-->>Client: emit warning, continue
        else MODE == OFF or hashes match
            DLFile->>FS: write file
        end
    end

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Pay attention to:
- SHA computation and branching in __download_file__ (ERROR vs WARNING vs OFF).
- Correct propagation of (url, sha) tuples and validation_mode across all call sites.
- Vault token-exchange flow and audience extraction in __get_vault_access__.
- SPARQL/JSON-LD parsing changes in __handle_databus_file_query__ / __handle_databus_artifact_version__.
- New tests (tests/test_databusclient.py) and their mocked responses matching production parsing.

Possibly related PRs

Download capabilities and docker image #12 — Modifies the same download and auth codepaths (__download_file__, __get_vault_access__, __download_list__, download() signatures); likely related refactor or precursor to SHA validation additions.

Pre-merge checks and finishing touches

❌ Failed checks (1 warning, 1 inconclusive)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 68.18% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.
Out of Scope Changes check	❓ Inconclusive	All changes are scoped to SHA256 validation feature; however, formatting/refactoring changes mentioned by reviewer should be separated into different PRs per developer comment.	Clarify whether formatting and line-ending changes are intentional or should be separated; ensure only validation-related code changes remain in this PR.

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The PR title accurately describes the main change: adding SHA256 validation with multiple CLI modes (off, warning, error).
Linked Issues check	✅ Passed	The PR fully addresses issue #28 requirements: implements SHA checksum validation against metadata and skips validation when SHA is unavailable.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between cfdca3b and 77cc01e.

📒 Files selected for processing (2)

databusclient/client.py (14 hunks)
tests/test_databusclient.py (1 hunks)

🧰 Additional context used

🧬 Code graph analysis (1)

tests/test_databusclient.py (2)

databusclient/client.py (2)

download (739-865)

ShaValidationMode (31-35)

databusclient/cli.py (1)

download (46-57)

🪛 Ruff (0.14.2)

databusclient/client.py

443-443: Probable use of requests call without timeout

(S113)

505-505: Avoid specifying long messages outside the exception class

(TRY003)

525-525: Probable use of requests call without timeout

(S113)

547-547: Probable use of requests call without timeout

(S113)

699-699: PEP 484 prohibits implicit Optional

Convert to T | None

(RUF013)

700-700: PEP 484 prohibits implicit Optional

Convert to T | None

(RUF013)

701-701: PEP 484 prohibits implicit Optional

Convert to T | None

(RUF013)

705-705: Unpacked variable host is never used

Prefix it with an underscore or any other dummy variable pattern

(RUF059)

tests/test_databusclient.py

44-44: Redefinition of unused requests_mock from line 2

(F811)

60-60: Unused function argument: mock_print

(ARG001)

60-60: Redefinition of unused requests_mock from line 2

(F811)

88-88: Redefinition of unused requests_mock from line 2

(F811)

119-119: Redefinition of unused requests_mock from line 2

(F811)

151-151: Redefinition of unused requests_mock from line 2

(F811)

databusclient/client.py

shivansh31414 · 2025-11-01T16:42:23Z

Thanks for the detailed review! I’ll work on the following updates shortly:

Fixing the redirect handling in download_file to ensure proper SHA validation.

Separating unrelated formatting and refactoring changes into a separate PR to keep this one focused.

Improving docstring coverage to meet the required threshold.

Appreciate the heads-up on these — will follow up soon with the fixes!

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (4)

databusclient/client.py (4)
433-469: Redirect handling looks good, but add timeouts to prevent hanging.

The redirect handling correctly addresses the previous review concern by following redirects for both authenticated and non-authenticated requests.

However, the requests.get() calls at lines 444, 462, and 469 lack timeout parameters, which can cause the download to hang indefinitely if the server is unresponsive.

Apply this diff to add timeouts:
     response = requests.get(
-        url, stream=True, allow_redirects=False
+        url, stream=True, allow_redirects=False, timeout=30
     )  # no redirects here, we want to see if auth is required
         # --- 4. Retry with token ---
         # This request correctly allows redirects (default)
-        response = requests.get(url, headers=headers, stream=True)
+        response = requests.get(url, headers=headers, stream=True, timeout=30)
         print(f"Following redirect to {redirect_url}")
         # Make a new request that *does* follow any further redirects
-        response = requests.get(redirect_url, stream=True, allow_redirects=True)
+        response = requests.get(redirect_url, stream=True, allow_redirects=True, timeout=30)
534-564: Add timeouts to token exchange requests.

The token exchange POST requests at lines 534 and 556 lack timeout parameters, which can cause authentication to hang if the auth server is unresponsive.

Apply this diff:
     # 2. Refresh token -> access token
     resp = requests.post(
         auth_url,
         data={
             "client_id": client_id,
             "grant_type": "refresh_token",
             "refresh_token": refresh_token,
         },
+        timeout=30,
     )
     # 4. Access token -> Vault token
     resp = requests.post(
         auth_url,
         data={
             "client_id": client_id,
             "grant_type": "urn:ietf:params:oauth:grant-type:token-exchange",
             "subject_token": access_token,
             "audience": audience,
         },
+        timeout=30,
     )
591-617: Fix return type annotation for generator function.

The function uses yield (line 615), making it a generator, but the return type annotation declares List[Tuple[str, Optional[str]]]. While the code works at runtime (generators are iterable), the type annotation is misleading.

Apply this diff to correct the annotation:
+from typing import List, Dict, Tuple, Optional, Union, Iterator
+
 def __handle_databus_file_query__(
     endpoint_url, query
-) -> List[Tuple[str, Optional[str]]]:
+) -> Iterator[Tuple[str, Optional[str]]]:
704-738: Fix type annotations and unused variable.

Two issues per static analysis:

Lines 707-709: Parameters vault_token_file, auth_url, and client_id have default None but their type annotations don't specify Optional. Modern Python style prefers str | None.

Line 714: The unpacked variable host is never used. Use _ to indicate it's intentionally ignored.

Apply this diff:
 def __download_list__(
     files_to_download: List[Tuple[str, Optional[str]]],
     localDir: str,
-    vault_token_file: str = None,
-    auth_url: str = None,
-    client_id: str = None,
+    vault_token_file: Optional[str] = None,
+    auth_url: Optional[str] = None,
+    client_id: Optional[str] = None,
     validation_mode: ShaValidationMode = ShaValidationMode.WARNING,
 ) -> None:
     for url, expected_sha in files_to_download:
         if localDir is None:
-            host, account, group, artifact, version, file = __get_databus_id_parts__(
+            _, account, group, artifact, version, file = __get_databus_id_parts__(
                 url
             )

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 77cc01e and d73903e.

📒 Files selected for processing (1)

databusclient/client.py (16 hunks)

🧰 Additional context used

🪛 Ruff (0.14.2)

databusclient/client.py

444-444: Probable use of requests call without timeout

(S113)

462-462: Probable use of requests call without timeout

(S113)

469-469: Probable use of requests call without timeout

(S113)

514-514: Avoid specifying long messages outside the exception class

(TRY003)

534-534: Probable use of requests call without timeout

(S113)

556-556: Probable use of requests call without timeout

(S113)

707-707: PEP 484 prohibits implicit Optional

Convert to T | None

(RUF013)

708-708: PEP 484 prohibits implicit Optional

Convert to T | None

(RUF013)

709-709: PEP 484 prohibits implicit Optional

Convert to T | None

(RUF013)

714-714: Unpacked variable host is never used

Prefix it with an underscore or any other dummy variable pattern

(RUF059)

🔇 Additional comments (6)

databusclient/client.py (6)

31-37: LGTM!

The ShaValidationMode enum is well-designed with clear semantics for the three validation behaviors.

572-588: LGTM!

The explicit POST method setting is a good practice for SPARQL endpoints.

620-645: LGTM!

The function correctly extracts both download URLs and SHA256 sums from the artifact metadata, and the return type matches the implementation.

690-691: LGTM!

Explicitly specifying UTF-8 encoding is good practice for consistent hashing behavior.

741-747: LGTM!

The explicit return type annotation clarifies that the function always returns exactly six components, with None for missing parts.

750-876: LGTM!

The validation_mode parameter is consistently threaded through all download paths (collections, files, artifact versions, artifacts, and groups), ensuring uniform SHA validation behavior across the entire download flow.

databusclient/client.py

coderabbitai

Actionable comments posted: 6

🧹 Nitpick comments (1)

databusclient/client.py (1)
1-11: Remove redundant import.

Line 8 imports sha256 from hashlib, but hashlib is already imported at line 4. The sha256 import is only used once in the wsha256 function at line 690, where you can reference it as hashlib.sha256 instead.

Apply this diff:
 from enum import Enum
 from typing import List, Dict, Tuple, Optional, Union
 import requests
 import hashlib
 import json
 from tqdm import tqdm
 from SPARQLWrapper import SPARQLWrapper, JSON
-from hashlib import sha256
 import os
 import re
And update line 690:
 def wsha256(raw: str):
-    return sha256(raw.encode("utf-8")).hexdigest()
+    return hashlib.sha256(raw.encode("utf-8")).hexdigest()

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between d73903e and 78c5de0.

📒 Files selected for processing (1)

databusclient/client.py (1 hunks)

🧰 Additional context used

🧬 Code graph analysis (1)

databusclient/client.py (1)

databusclient/cli.py (2)

deploy (30-36)

download (46-57)

🪛 Ruff (0.14.2)

databusclient/client.py

153-156: Avoid specifying long messages outside the exception class

(TRY003)

165-165: Probable use of requests call without timeout

(S113)

178-178: Local variable content_variant_part is assigned to but never used

Remove assignment to unused variable content_variant_part

(F841)

195-195: PEP 484 prohibits implicit Optional

Convert to T | None

(RUF013)

196-196: PEP 484 prohibits implicit Optional

Convert to T | None

(RUF013)

197-197: PEP 484 prohibits implicit Optional

Convert to T | None

(RUF013)

233-233: PEP 484 prohibits implicit Optional

Convert to T | None

(RUF013)

234-234: PEP 484 prohibits implicit Optional

Convert to T | None

(RUF013)

235-235: PEP 484 prohibits implicit Optional

Convert to T | None

(RUF013)

236-236: PEP 484 prohibits implicit Optional

Convert to T | None

(RUF013)

237-237: PEP 484 prohibits implicit Optional

Convert to T | None

(RUF013)

271-271: Unpacked variable account_name is never used

Prefix it with an underscore or any other dummy variable pattern

(RUF059)

271-271: Unpacked variable group_name is never used

Prefix it with an underscore or any other dummy variable pattern

(RUF059)

271-271: Unpacked variable artifact_name is never used

Prefix it with an underscore or any other dummy variable pattern

(RUF059)

291-294: Avoid specifying long messages outside the exception class

(TRY003)

398-398: Probable use of requests call without timeout

(S113)

406-406: Avoid specifying long messages outside the exception class

(TRY003)

440-440: Probable use of requests call without timeout

(S113)

453-453: Probable use of requests call without timeout

(S113)

463-463: Avoid specifying long messages outside the exception class

(TRY003)

471-471: Probable use of requests call without timeout

(S113)

478-478: Probable use of requests call without timeout

(S113)

487-487: Use raise without specifying exception name

Remove exception name

(TRY201)

513-513: Avoid specifying long messages outside the exception class

(TRY003)

526-526: Avoid specifying long messages outside the exception class

(TRY003)

533-533: Probable use of requests call without timeout

(S113)

555-555: Probable use of requests call without timeout

(S113)

663-663: Avoid specifying long messages outside the exception class

(TRY003)

695-695: Probable use of requests call without timeout

(S113)

700-700: Probable use of requests call without timeout

(S113)

706-706: PEP 484 prohibits implicit Optional

Convert to T | None

(RUF013)

707-707: PEP 484 prohibits implicit Optional

Convert to T | None

(RUF013)

708-708: PEP 484 prohibits implicit Optional

Convert to T | None

(RUF013)

713-713: Unpacked variable host is never used

Prefix it with an underscore or any other dummy variable pattern

(RUF059)

866-866: Avoid specifying long messages outside the exception class

(TRY003)

🔇 Additional comments (9)

databusclient/client.py (9)

15-21: LGTM!

The helper function correctly computes SHA256 in chunks, which is memory-efficient for large files. Good refactoring from the previous inline implementation.

40-45: LGTM!

The enum clearly defines the three validation modes as specified in the PR objectives.

489-497: LGTM!

The download logic with progress bar is correctly implemented.

499-510: LGTM!

The validation logic correctly implements all three modes (OFF, WARNING, ERROR) as specified in the PR objectives. The fix from the previous review (removing unreachable branch) is properly applied.

512-513: LGTM!

The size check provides an additional sanity check after SHA validation.

590-617: LGTM!

The function properly handles multiple variable name conventions for file URLs and checksums, with appropriate fallback behavior and helpful warnings.

619-644: LGTM!

The function correctly extracts both download URLs and SHA256 checksums from the JSON-LD metadata, properly returning tuples to support validation.

703-738: LGTM!

The function correctly propagates validation mode and SHA checksums through the download pipeline. The unused host variable at line 713 is a minor style issue but doesn't affect functionality.

749-875: LGTM!

The download() function correctly propagates the validation mode through all download paths (collections, files, artifact versions, artifacts, groups, and queries). The auto-detection of the SPARQL endpoint is a helpful UX improvement.

coderabbitai · 2025-11-02T16:02:51Z

databusclient/client.py

+def __load_file_stats(url: str) -> Tuple[str, int]:
+    resp = requests.get(url)
+    if resp.status_code > 400:
+        raise requests.exceptions.RequestException(response=resp)
+
+    sha256sum = hashlib.sha256(bytes(resp.content)).hexdigest()
+    content_length = len(resp.content)
+    return sha256sum, content_length


⚠️ Potential issue | 🟠 Major

Add timeout parameter.

The GET request in __load_file_stats lacks a timeout parameter.

Apply this diff:

def __load_file_stats(url: str) -> Tuple[str, int]: - resp = requests.get(url) + resp = requests.get(url, timeout=30) if resp.status_code > 400: raise requests.exceptions.RequestException(response=resp)

🧰 Tools

🪛 Ruff (0.14.2)

165-165: Probable use of requests call without timeout

(S113)

🤖 Prompt for AI Agents

In databusclient/client.py around lines 164 to 171, the requests.get call in __load_file_stats lacks a timeout; add a timeout parameter to the function (e.g., def __load_file_stats(url: str, timeout: float = 10):) and pass it to requests.get as requests.get(url, timeout=timeout), keeping the existing status_code check and exception behavior so callers can override the timeout if needed.

databusclient/client.py

coderabbitai · 2025-11-02T16:02:51Z

databusclient/client.py

+    response = requests.head(url, stream=True)
+    # Check for redirect and update URL if necessary
+    if response.headers.get("Location") and response.status_code in [
+        301,
+        302,
+        303,
+        307,
+        308,
+    ]:
+        url = response.headers.get("Location")
+        print("Redirects url: ", url)
+
+    # --- 2. Try direct GET ---
+    response = requests.get(
+        url, stream=True, allow_redirects=False
+    )  # no redirects here, we want to see if auth is required
+    www = response.headers.get(
+        "WWW-Authenticate", ""
+    )  # get WWW-Authenticate header if present to check for Bearer auth
+
+    if response.status_code == 401 or "bearer" in www.lower():
+        print(f"Authentication required for {url}")
+        if not (vault_token_file):
+            raise ValueError("Vault token file not given for protected download")
+
+        # --- 3. Fetch Vault token ---
+        vault_token = __get_vault_access__(url, vault_token_file, auth_url, client_id)
+        headers = {"Authorization": f"Bearer {vault_token}"}
+
+        # --- 4. Retry with token ---
+        # This request correctly allows redirects (default)
+        response = requests.get(url, headers=headers, stream=True)
+
+    # Handle 3xx redirects for non-authed requests (e.g., S3 presigned URLs)
+    elif response.is_redirect:
+        redirect_url = response.headers.get("Location")
+        print(f"Following redirect to {redirect_url}")
+        # Make a new request that *does* follow any further redirects
+        response = requests.get(redirect_url, stream=True, allow_redirects=True)
+
+    try:
+        response.raise_for_status()  # Raise if still failing
+    except requests.exceptions.HTTPError as e:
+        if response.status_code == 404:
+            print(f"WARNING: Skipping file {url} because it was not found (404).")
+            return
+        else:
+            raise e
+


⚠️ Potential issue | 🟠 Major

Add timeout parameter to all HTTP requests.

All requests calls in this function lack a timeout parameter, which can cause the application to hang indefinitely if the server doesn't respond. This is especially problematic for HEAD and GET requests that may be waiting on external services.

Apply this diff to add appropriate timeouts:

# --- 1. Get redirect URL by requesting HEAD --- - response = requests.head(url, stream=True) + response = requests.head(url, stream=True, timeout=30) # Check for redirect and update URL if necessary if response.headers.get("Location") and response.status_code in [ 301, 302, 303, 307, 308, ]: url = response.headers.get("Location") print("Redirects url: ", url) # --- 2. Try direct GET --- response = requests.get( - url, stream=True, allow_redirects=False + url, stream=True, allow_redirects=False, timeout=30 ) # no redirects here, we want to see if auth is required www = response.headers.get( "WWW-Authenticate", "" ) # get WWW-Authenticate header if present to check for Bearer auth if response.status_code == 401 or "bearer" in www.lower(): print(f"Authentication required for {url}") if not (vault_token_file): raise ValueError("Vault token file not given for protected download") # --- 3. Fetch Vault token --- vault_token = __get_vault_access__(url, vault_token_file, auth_url, client_id) headers = {"Authorization": f"Bearer {vault_token}"} # --- 4. Retry with token --- # This request correctly allows redirects (default) - response = requests.get(url, headers=headers, stream=True) + response = requests.get(url, headers=headers, stream=True, timeout=30) # Handle 3xx redirects for non-authed requests (e.g., S3 presigned URLs) elif response.is_redirect: redirect_url = response.headers.get("Location") print(f"Following redirect to {redirect_url}") # Make a new request that *does* follow any further redirects - response = requests.get(redirect_url, stream=True, allow_redirects=True) + response = requests.get(redirect_url, stream=True, allow_redirects=True, timeout=30)

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

response = requests.head(url, stream=True)

# Check for redirect and update URL if necessary

if response.headers.get("Location") and response.status_code in [

301,

302,

303,

307,

308,

]:

url = response.headers.get("Location")

print("Redirects url: ", url)

# --- 2. Try direct GET ---

response = requests.get(

url, stream=True, allow_redirects=False

) # no redirects here, we want to see if auth is required

www = response.headers.get(

"WWW-Authenticate", ""

) # get WWW-Authenticate header if present to check for Bearer auth

if response.status_code == 401 or "bearer" in www.lower():

print(f"Authentication required for {url}")

if not (vault_token_file):

raise ValueError("Vault token file not given for protected download")

# --- 3. Fetch Vault token ---

vault_token = __get_vault_access__(url, vault_token_file, auth_url, client_id)

headers = {"Authorization": f"Bearer {vault_token}"}

# --- 4. Retry with token ---

# This request correctly allows redirects (default)

response = requests.get(url, headers=headers, stream=True)

# Handle 3xx redirects for non-authed requests (e.g., S3 presigned URLs)

elif response.is_redirect:

redirect_url = response.headers.get("Location")

print(f"Following redirect to {redirect_url}")

# Make a new request that *does* follow any further redirects

response = requests.get(redirect_url, stream=True, allow_redirects=True)

try:

response.raise_for_status() # Raise if still failing

except requests.exceptions.HTTPError as e:

if response.status_code == 404:

print(f"WARNING: Skipping file {url} because it was not found (404).")

return

else:

raise e

response = requests.head(url, stream=True, timeout=30)

# Check for redirect and update URL if necessary

if response.headers.get("Location") and response.status_code in [

301,

302,

303,

307,

308,

]:

url = response.headers.get("Location")

print("Redirects url: ", url)

# --- 2. Try direct GET ---

response = requests.get(

url, stream=True, allow_redirects=False, timeout=30

) # no redirects here, we want to see if auth is required

www = response.headers.get(

"WWW-Authenticate", ""

) # get WWW-Authenticate header if present to check for Bearer auth

if response.status_code == 401 or "bearer" in www.lower():

print(f"Authentication required for {url}")

if not (vault_token_file):

raise ValueError("Vault token file not given for protected download")

# --- 3. Fetch Vault token ---

vault_token = __get_vault_access__(url, vault_token_file, auth_url, client_id)

headers = {"Authorization": f"Bearer {vault_token}"}

# --- 4. Retry with token ---

# This request correctly allows redirects (default)

response = requests.get(url, headers=headers, stream=True, timeout=30)

# Handle 3xx redirects for non-authed requests (e.g., S3 presigned URLs)

elif response.is_redirect:

redirect_url = response.headers.get("Location")

print(f"Following redirect to {redirect_url}")

# Make a new request that *does* follow any further redirects

response = requests.get(redirect_url, stream=True, allow_redirects=True, timeout=30)

try:

response.raise_for_status() # Raise if still failing

except requests.exceptions.HTTPError as e:

if response.status_code == 404:

print(f"WARNING: Skipping file {url} because it was not found (404).")

return

else:

raise e

🧰 Tools

🪛 Ruff (0.14.2)

440-440: Probable use of requests call without timeout

(S113)

453-453: Probable use of requests call without timeout

(S113)

463-463: Avoid specifying long messages outside the exception class

(TRY003)

471-471: Probable use of requests call without timeout

(S113)

478-478: Probable use of requests call without timeout

(S113)

487-487: Use raise without specifying exception name

Remove exception name

(TRY201)

🤖 Prompt for AI Agents

In databusclient/client.py around lines 440 to 488, all requests calls (the initial requests.head and every requests.get in this block) are missing a timeout and can hang indefinitely; add a timeout parameter to each call (e.g., timeout=DEFAULT_TIMEOUT) and define DEFAULT_TIMEOUT at module level or accept a timeout argument to the function (suggest 10 seconds as a sensible default), making sure to pass the same timeout to requests.head(..., timeout=...) and to all requests.get(...) calls (including the retry with headers and the follow-redirect request). Optionally import requests.exceptions.Timeout where used if you want to handle timeouts specially, but the immediate fix is to supply the timeout argument consistently on every request in this block.

coderabbitai · 2025-11-02T16:02:52Z

databusclient/client.py

+def __get_vault_access__(
+    download_url: str, token_file: str, auth_url: str, client_id: str
+) -> str:
+    """
+    Get Vault access token for a protected databus download.
+    """
+    # 1. Load refresh token
+    refresh_token = os.environ.get("REFRESH_TOKEN")
+    if not refresh_token:
+        if not os.path.exists(token_file):
+            raise FileNotFoundError(f"Vault token file not found: {token_file}")
+        with open(token_file, "r") as f:
+            refresh_token = f.read().strip()
+    if len(refresh_token) < 80:
+        print(f"Warning: token from {token_file} is short (<80 chars)")
+
+    # 2. Refresh token -> access token
+    resp = requests.post(
+        auth_url,
+        data={
+            "client_id": client_id,
+            "grant_type": "refresh_token",
+            "refresh_token": refresh_token,
+        },
+    )
+    resp.raise_for_status()
+    access_token = resp.json()["access_token"]
+
+    # 3. Extract host as audience
+    # Remove protocol prefix
+    if download_url.startswith("https://"):
+        host_part = download_url[len("https://") :]
+    elif download_url.startswith("http://"):
+        host_part = download_url[len("http://") :]
+    else:
+        host_part = download_url
+    audience = host_part.split("/")[0]  # host is before first "/"
+
+    # 4. Access token -> Vault token
+    resp = requests.post(
+        auth_url,
+        data={
+            "client_id": client_id,
+            "grant_type": "urn:ietf:params:oauth:grant-type:token-exchange",
+            "subject_token": access_token,
+            "audience": audience,
+        },
+    )
+    resp.raise_for_status()
+    vault_token = resp.json()["access_token"]
+
+    print(f"Using Vault access token for {download_url}")
+    return vault_token


⚠️ Potential issue | 🟠 Major

Add timeout parameter to token exchange requests.

The POST requests for token exchange (lines 533 and 555) lack timeout parameters, which could cause the application to hang if the authentication server is unresponsive.

Apply this diff:

# 2. Refresh token -> access token resp = requests.post( auth_url, data={ "client_id": client_id, "grant_type": "refresh_token", "refresh_token": refresh_token, }, + timeout=30, ) resp.raise_for_status() access_token = resp.json()["access_token"] # 3. Extract host as audience # Remove protocol prefix if download_url.startswith("https://"): host_part = download_url[len("https://") :] elif download_url.startswith("http://"): host_part = download_url[len("http://") :] else: host_part = download_url audience = host_part.split("/")[0] # host is before first "/" # 4. Access token -> Vault token resp = requests.post( auth_url, data={ "client_id": client_id, "grant_type": "urn:ietf:params:oauth:grant-type:token-exchange", "subject_token": access_token, "audience": audience, }, + timeout=30, )

🧰 Tools

🪛 Ruff (0.14.2)

526-526: Avoid specifying long messages outside the exception class

(TRY003)

533-533: Probable use of requests call without timeout

(S113)

555-555: Probable use of requests call without timeout

(S113)

🤖 Prompt for AI Agents

In databusclient/client.py around lines 516 to 568, the two requests.post calls used for token refresh/exchange don't set a timeout and can hang; add a timeout parameter (e.g. timeout=10) to both requests.post invocations (the one that exchanges refresh_token -> access_token and the one that exchanges access_token -> vault_token), and ensure any calling code can surface or handle requests.Timeout by letting exceptions propagate or wrapping them where appropriate.

coderabbitai · 2025-11-02T16:02:52Z

databusclient/client.py

+def __handle_databus_collection__(uri: str) -> str:
+    headers = {"Accept": "text/sparql"}
+    return requests.get(uri, headers=headers).text


⚠️ Potential issue | 🟠 Major

Add timeout parameter.

The GET request lacks a timeout parameter.

Apply this diff:

def __handle_databus_collection__(uri: str) -> str: headers = {"Accept": "text/sparql"} - return requests.get(uri, headers=headers).text + return requests.get(uri, headers=headers, timeout=30).text

🧰 Tools

🪛 Ruff (0.14.2)

695-695: Probable use of requests call without timeout

(S113)

🤖 Prompt for AI Agents

In databusclient/client.py around lines 693 to 695, the GET request has no timeout; modify the requests.get call to include a timeout (e.g., requests.get(uri, headers=headers, timeout=10).text), or use an existing/configurable DEFAULT_TIMEOUT constant, and ensure the call passes the timeout keyword so the request won't hang indefinitely.

coderabbitai · 2025-11-02T16:02:52Z

databusclient/client.py

+def __get_json_ld_from_databus__(uri: str) -> str:
+    headers = {"Accept": "application/ld+json"}
+    return requests.get(uri, headers=headers).text


⚠️ Potential issue | 🟠 Major

Add timeout parameter.

The GET request lacks a timeout parameter, which could cause hangs when fetching metadata from Databus.

Apply this diff:

def __get_json_ld_from_databus__(uri: str) -> str: headers = {"Accept": "application/ld+json"} - return requests.get(uri, headers=headers).text + return requests.get(uri, headers=headers, timeout=30).text

🧰 Tools

🪛 Ruff (0.14.2)

700-700: Probable use of requests call without timeout

(S113)

🤖 Prompt for AI Agents

In databusclient/client.py around lines 698-700, the requests.get call in __get_json_ld_from_databus__ lacks a timeout which can cause the process to hang; modify the call to include a timeout parameter (e.g., timeout=10) or, better, read a module-level or config value (e.g., DATABUS_REQUEST_TIMEOUT) and pass it as timeout=request_timeout to requests.get, and ensure the function lets request exceptions propagate or document that callers should handle timeouts.

shivansh31414 · 2025-11-02T16:20:36Z

Just pushed a commit addressing all of CodeRabbit’s suggestions — moved the SHA256 helper to module level, removed the unreachable branch, and cleaned up the redundant import. Let me know if anything else needs tweaking. Ready for re-review!

Integer-Ctrl · 2025-11-04T08:39:18Z

Hey! It seems the line endings changed (possibly to CRLF), so the diff shows every line as updated. Could you fix the line endings so I can review the actual changes? See diff

coderabbitai

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)

databusclient/client.py (2)
516-568: Token exchange requests still lack timeout parameters.

Past reviews flagged this as a major issue. If the authentication server is unresponsive, the application will hang indefinitely.

Apply this diff to add timeouts:
     # 2. Refresh token -> access token
     resp = requests.post(
         auth_url,
         data={
             "client_id": client_id,
             "grant_type": "refresh_token",
             "refresh_token": refresh_token,
         },
+        timeout=30,
     )
     resp.raise_for_status()
     access_token = resp.json()["access_token"]
 
     # 3. Extract host as audience
     # Remove protocol prefix
     if download_url.startswith("https://"):
         host_part = download_url[len("https://") :]
     elif download_url.startswith("http://"):
         host_part = download_url[len("http://") :]
     else:
         host_part = download_url
     audience = host_part.split("/")[0]  # host is before first "/"
 
     # 4. Access token -> Vault token
     resp = requests.post(
         auth_url,
         data={
             "client_id": client_id,
             "grant_type": "urn:ietf:params:oauth:grant-type:token-exchange",
             "subject_token": access_token,
             "audience": audience,
         },
+        timeout=30,
     )
440-488: All HTTP requests in this function still lack timeout parameters.

Without timeouts, the application can hang indefinitely on unresponsive servers—a critical production reliability issue.

Apply the provided diff to add timeout=30 to all four requests calls:

Line 440: requests.head(url, stream=True, timeout=30)

Line 453: requests.get(url, stream=True, allow_redirects=False, timeout=30)

Line 471: requests.get(url, headers=headers, stream=True, timeout=30)

Line 478: requests.get(redirect_url, stream=True, allow_redirects=True, timeout=30)

🧹 Nitpick comments (1)

databusclient/client.py (1)

703-737: Clean up unused variable and type hints.

The function correctly propagates validation parameters, but has two minor issues:

Line 713: The host variable from __get_databus_id_parts__ is unpacked but never used.
Lines 706-708: Type hints use implicit Optional which PEP 484 discourages.

Apply this diff:

 def __download_list__(
     files_to_download: List[Tuple[str, Optional[str]]],
     localDir: str,
-    vault_token_file: str = None,
-    auth_url: str = None,
-    client_id: str = None,
+    vault_token_file: Optional[str] = None,
+    auth_url: Optional[str] = None,
+    client_id: Optional[str] = None,
     validation_mode: ShaValidationMode = ShaValidationMode.WARNING,
 ) -> None:
     for url, expected_sha in files_to_download:
         if localDir is None:
-            host, account, group, artifact, version, file = __get_databus_id_parts__(
+            _, account, group, artifact, version, _ = __get_databus_id_parts__(
                 url
             )

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 78c5de0 and fe69957.

📒 Files selected for processing (1)

databusclient/client.py (17 hunks)

🧰 Additional context used

🪛 Ruff (0.14.4)

databusclient/client.py

453-453: Probable use of requests call without timeout

(S113)

471-471: Probable use of requests call without timeout

(S113)

478-478: Probable use of requests call without timeout

(S113)

513-513: Avoid specifying long messages outside the exception class

(TRY003)

533-533: Probable use of requests call without timeout

(S113)

555-555: Probable use of requests call without timeout

(S113)

706-706: PEP 484 prohibits implicit Optional

Convert to T | None

(RUF013)

707-707: PEP 484 prohibits implicit Optional

Convert to T | None

(RUF013)

708-708: PEP 484 prohibits implicit Optional

Convert to T | None

(RUF013)

713-713: Unpacked variable host is never used

Prefix it with an underscore or any other dummy variable pattern

(RUF059)

🔇 Additional comments (9)

databusclient/client.py (9)

15-21: LGTM! Clean SHA256 helper.

The module-level helper is well-structured, uses an appropriate chunk size, and will be deterministic for validation.

40-46: LGTM! Clear validation modes.

The enum provides the three requested modes with clear semantics.

413-433: LGTM! Signature is backwards compatible.

The new parameters have sensible defaults and are properly documented.

499-511: LGTM! Validation logic correctly implements the three modes.

The validation flow properly handles ERROR (raises), WARNING (prints and continues), and OFF (skips) modes as specified in the PR objectives.

571-587: LGTM! Using POST for SPARQL queries is best practice.

POST supports larger queries and is more semantic per the SPARQL protocol.

590-617: LGTM! Robust extraction of both URL and SHA from query results.

The function defensively tries multiple common variable names and correctly handles cases where SHA may not be present in the query results.

619-644: LGTM! Properly extracts both URL and SHA256 from artifact metadata.

The function correctly returns tuples of (url, sha) to enable downstream validation. The use of node.get("file") aligns with the comment about tracking access accurately.

740-746: LGTM! Return type now explicitly documents the 6-tuple structure.

The type hint improves clarity and type safety.

749-875: LGTM! Validation mode is correctly propagated through all download paths.

The dispatcher function properly routes different URI types (collection, file, artifact version, artifact, group, query) and consistently passes validation_mode to all downstream downloads.

shivansh31414 · 2025-11-12T15:14:45Z

Thanks for following along on this PR. I’ve incorporated the SHA256 validation feature with flexible CLI modes, refactored the helper to module level, cleaned up unreachable branches, and normalized line endings with .gitattributes so the diffs are now clean. CodeRabbit’s latest review shows no actionable issues left — only minor nitpicks outside the diff range. Could you take a look and share your feedback on whether the changes meet our intended goals and if the flow feels clear from a reviewer’s perspective? If you spot anything that could be improved for readability or maintainability, I’d be happy to address it before merge.
Got a bit busy in my university exam thanks for being patient.

shivansh31414 · 2025-11-17T13:46:45Z

do the current changes align with what you had in mind for the SHA validation feature I want to be sure the flow feels clear from your perspective before I finalize.

Integer-Ctrl

Sorry for the late feedback. I tested it and it seems to work, but I've added some comments. The most important points are the formatting-related comments and the fact that there is currently no CLI option to toggle the SHA validation mode.

Integer-Ctrl · 2025-11-24T18:01:07Z

databusclient/client.py

-def __handle_databus_artifact_version__(json_str: str) -> List[str]:
+def __handle_databus_artifact_version__(
+    json_str: str,
+) -> List[Tuple[str, Optional[str]]]:
    """
-    Parse the JSON-LD of a databus artifact version to extract download URLs.
+    Parse the JSON-LD of a databus artifact version to extract download URLs and SHA256 sums.
    Don't get downloadURLs directly from the JSON-LD, but follow the "file" links to count access to databus accurately.

-    Returns a list of download URLs.
+    Returns a list of (download_url, sha256sum) tuples.
    """

-    databusIdUrl = []
+    databus_files = []
    json_dict = json.loads(json_str)
    graph = json_dict.get("@graph", [])
    for node in graph:
        if node.get("@type") == "Part":
-            id = node.get("file")
-            databusIdUrl.append(id)
-    return databusIdUrl
+            # Use the 'file' link as per the original comment
+            url = node.get("file")
+            if not url:
+                continue
+
+            # Extract the sha256sum from the same node
+            # This key is used in your create_dataset function
+            sha = node.get("sha256sum")
+
+            databus_files.append((url, sha))
+    return databus_files


You should handle resolving the sha256 sum from databus inside the __download_file method. Otherwise, if only a single file is downloaded, its SHA sum won’t be verified.

Integer-Ctrl · 2025-11-24T18:03:22Z

databusclient/client.py

You touched many lines of code just by changing the formatting. Please avoid reformatting code sections where you didn't change the actual logic. It makes the review process unnecessarily longer. Also, changes like replacing " with ' (or vice versa) shouldn’t be included in a feature PR.

Integer-Ctrl · 2025-11-24T18:07:18Z

tests/test_databusclient.py

You touched many lines of code just by deleting them. Please avoid removing unused code in a feature PR. Yes, this repo/file is messy and a lot of the code may be unused or outdated, but cleanup shouldn't be part of this PR (unless it is a refactored method or something like this). It makes the review process unnecessarily longer.

Integer-Ctrl · 2025-11-24T18:15:03Z

databusclient/client.py

+    vault_token_file: str = None,
+    auth_url: str = None,
+    client_id: str = None,
+    validation_mode: ShaValidationMode = ShaValidationMode.WARNING,


There is currently no cli option to specify the ShaValidationMode

Integer-Ctrl · 2025-11-24T18:16:02Z

databusclient/client.py

+            auth_url=auth_url,
+            client_id=client_id,
+            expected_sha256=expected_sha,  # <-- Pass the SHA hash here
+            validation_mode=validation_mode,  # <-- Pass the validation mode here


Use a global variable for the validation mode instead of passing the option into every function call

Integer-Ctrl · 2025-12-16T11:33:48Z

@shivansh31414 Do you want to continue work?

shivansh31414 · 2025-12-16T11:36:20Z

I think you can assign it to someone else i am occupied with my college work

Add SHA256 checksum validation with flexible CLI modes

77cc01e

coderabbitai bot reviewed Nov 1, 2025

View reviewed changes

databusclient/client.py Show resolved Hide resolved

added enhancement

d73903e

coderabbitai bot reviewed Nov 2, 2025

View reviewed changes

databusclient/client.py Outdated Show resolved Hide resolved

refactor: move SHA256 helper and remove unreachable branch

78c5de0

coderabbitai bot reviewed Nov 2, 2025

View reviewed changes

Shivansh added 2 commits November 12, 2025 19:48

Add .gitattributes to enforce LF endings

725ec01

Normalize line endings to LF

fe69957

coderabbitai bot reviewed Nov 12, 2025

View reviewed changes

Integer-Ctrl reviewed Nov 24, 2025

View reviewed changes

Integer-Ctrl self-assigned this Nov 24, 2025

shivansh31414 closed this Dec 16, 2025

Add SHA256 checksum validation with flexible CLI modes #31

Add SHA256 checksum validation with flexible CLI modes #31

Uh oh!

Conversation

shivansh31414 commented Nov 1, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Nov 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Pre-merge checks and finishing touches

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

shivansh31414 commented Nov 1, 2025

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Nov 2, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot Nov 2, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Nov 2, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Nov 2, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Nov 2, 2025

Choose a reason for hiding this comment

Uh oh!

shivansh31414 commented Nov 2, 2025

Uh oh!

Integer-Ctrl commented Nov 4, 2025

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

shivansh31414 commented Nov 12, 2025

Uh oh!

shivansh31414 commented Nov 17, 2025

Uh oh!

Integer-Ctrl left a comment

Choose a reason for hiding this comment

Uh oh!

Integer-Ctrl Nov 24, 2025

Choose a reason for hiding this comment

Uh oh!

Integer-Ctrl Nov 24, 2025

Choose a reason for hiding this comment

Uh oh!

Integer-Ctrl Nov 24, 2025

Choose a reason for hiding this comment

Uh oh!

Integer-Ctrl Nov 24, 2025

Choose a reason for hiding this comment

Uh oh!

Integer-Ctrl Nov 24, 2025

Choose a reason for hiding this comment

Uh oh!

Integer-Ctrl commented Dec 16, 2025

Uh oh!

shivansh31414 commented Dec 16, 2025

Uh oh!

Reviewers

shivansh31414 commented Nov 1, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Nov 1, 2025 •

edited

Loading