Skip to content

Conversation

@shivansh31414
Copy link

@shivansh31414 shivansh31414 commented Nov 1, 2025

Title: Feat: Add SHA256 validation for downloads with --validate-sha flag
Closes #28

This PR adds a --validate-sha flag to validate file checksums after download, as discussed in the issue.

The flag supports three modes:

--validate-sha off: No validation.

--validate-sha warning: (Default) Prints a warning on mismatch.

--validate-sha error: Raises a ValueError on mismatch and stops execution.

This is implemented by passing the validation mode down to download_file, which now computes and compares the hash.

New tests are added in test_client.py to cover all three modes and a successful validation.

Summary by CodeRabbit

  • New Features

    • Optional SHA256 validation for downloads with modes: OFF (skip), WARNING (notify), ERROR (fail); mode honored across all download paths.
    • Automated bearer token retrieval and retry for protected resources, improving access to authenticated downloads.
  • Bug Fixes / Improvements

    • More deterministic SHA256 computation, improved redirect handling and retry resilience, and better pairing of files with expected checksums.
  • Tests

    • New end-to-end style tests covering all SHA modes and download scenarios.

@coderabbitai
Copy link

coderabbitai bot commented Nov 1, 2025

📝 Walkthrough

Walkthrough

Adds ShaValidationMode and threads SHA256 validation through download flows; refactors handlers to return (url, sha) tuples; implements deterministic SHA256 computation, redirect and Vault-token retry logic in downloads; and replaces tests with end-to-end style cases covering ERROR/WARNING/OFF and successful validation.

Changes

Cohort / File(s) Summary
Download core & validation
databusclient/client.py
Added ShaValidationMode enum and __compute_file_sha256; extended __download_file__ to accept expected_sha256 and validation_mode; implemented redirect handling, Vault token exchange retry on 401/Bearer, and SHA256 validation behavior (OFF/WARNING/ERROR).
Metadata, queries & handlers
databusclient/client.py
Updated __query_sparql__ to use POST and improved formatting; refactored __handle_databus_file_query__, __handle_databus_artifact_version__ to return List[Tuple[str, Optional[str]]] (download_url, sha); adjusted __get_databus_id_parts__ to return 6-tuple.
Auth / token exchange
databusclient/client.py
Reworked __get_vault_access__ signature and logic to load refresh token (env/file), derive audience from download URL, perform token refresh/exchange and return Vault token used for Bearer-auth retries.
Download orchestration & public API
databusclient/client.py
Changed __download_list__ and public download() signatures to accept validation_mode and List[Tuple[str, Optional[str]]], and propagated validation_mode through all download call sites (artifact, collection, query paths).
Tests: download & SHA validation
tests/test_databusclient.py
Replaced previous tests with request-mock based end-to-end tests exercising download + SHA validation; new fixtures/constants for content and hashes; tests for ERROR/WARNING/OFF modes and successful hash match.

Sequence Diagram(s)

sequenceDiagram
    autonumber
    participant User
    participant Client as download()
    participant SPARQL as __query_sparql__/Metadata
    participant DLList as __download_list__
    participant DLFile as __download_file__
    participant Vault as __get_vault_access__
    participant HTTP as HTTP Server
    participant FS as Filesystem

    User->>Client: download(localDir, URIs, validation_mode=MODE)
    Client->>SPARQL: fetch metadata (JSON-LD / SPARQL)
    SPARQL-->>Client: List[(url, optional sha)]
    Client->>DLList: __download_list__(files, MODE)

    loop for each (url, sha)
        DLList->>DLFile: __download_file__(url, expected_sha=sha, MODE)
        DLFile->>HTTP: HTTP GET url
        alt 2xx
            HTTP-->>DLFile: file bytes
        else 3xx redirect or 401/Bearer
            HTTP-->>DLFile: redirect / 401
            DLFile->>Vault: __get_vault_access__(download_url, token_file, auth_url, client_id)
            Vault-->>DLFile: vault_token
            DLFile->>HTTP: HTTP GET url with Authorization: Bearer token
            HTTP-->>DLFile: file bytes
        end
        DLFile->>DLFile: compute SHA256 (if expected provided)
        alt expected SHA present and MODE == ERROR and mismatch
            DLFile-->>Client: raise Exception (abort)
        else expected SHA present and MODE == WARNING and mismatch
            DLFile-->>Client: emit warning, continue
        else MODE == OFF or hashes match
            DLFile->>FS: write file
        end
    end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

  • Pay attention to:
    • SHA computation and branching in __download_file__ (ERROR vs WARNING vs OFF).
    • Correct propagation of (url, sha) tuples and validation_mode across all call sites.
    • Vault token-exchange flow and audience extraction in __get_vault_access__.
    • SPARQL/JSON-LD parsing changes in __handle_databus_file_query__ / __handle_databus_artifact_version__.
    • New tests (tests/test_databusclient.py) and their mocked responses matching production parsing.

Possibly related PRs

  • Download capabilities and docker image #12 — Modifies the same download and auth codepaths (__download_file__, __get_vault_access__, __download_list__, download() signatures); likely related refactor or precursor to SHA validation additions.

Pre-merge checks and finishing touches

❌ Failed checks (1 warning, 1 inconclusive)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 68.18% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
Out of Scope Changes check ❓ Inconclusive All changes are scoped to SHA256 validation feature; however, formatting/refactoring changes mentioned by reviewer should be separated into different PRs per developer comment. Clarify whether formatting and line-ending changes are intentional or should be separated; ensure only validation-related code changes remain in this PR.
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The PR title accurately describes the main change: adding SHA256 validation with multiple CLI modes (off, warning, error).
Linked Issues check ✅ Passed The PR fully addresses issue #28 requirements: implements SHA checksum validation against metadata and skips validation when SHA is unavailable.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between cfdca3b and 77cc01e.

📒 Files selected for processing (2)
  • databusclient/client.py (14 hunks)
  • tests/test_databusclient.py (1 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
tests/test_databusclient.py (2)
databusclient/client.py (2)
  • download (739-865)
  • ShaValidationMode (31-35)
databusclient/cli.py (1)
  • download (46-57)
🪛 Ruff (0.14.2)
databusclient/client.py

443-443: Probable use of requests call without timeout

(S113)


505-505: Avoid specifying long messages outside the exception class

(TRY003)


525-525: Probable use of requests call without timeout

(S113)


547-547: Probable use of requests call without timeout

(S113)


699-699: PEP 484 prohibits implicit Optional

Convert to T | None

(RUF013)


700-700: PEP 484 prohibits implicit Optional

Convert to T | None

(RUF013)


701-701: PEP 484 prohibits implicit Optional

Convert to T | None

(RUF013)


705-705: Unpacked variable host is never used

Prefix it with an underscore or any other dummy variable pattern

(RUF059)

tests/test_databusclient.py

44-44: Redefinition of unused requests_mock from line 2

(F811)


60-60: Unused function argument: mock_print

(ARG001)


60-60: Redefinition of unused requests_mock from line 2

(F811)


88-88: Redefinition of unused requests_mock from line 2

(F811)


119-119: Redefinition of unused requests_mock from line 2

(F811)


151-151: Redefinition of unused requests_mock from line 2

(F811)

@shivansh31414
Copy link
Author

Thanks for the detailed review! I’ll work on the following updates shortly:

Fixing the redirect handling in download_file to ensure proper SHA validation.

Separating unrelated formatting and refactoring changes into a separate PR to keep this one focused.

Improving docstring coverage to meet the required threshold.

Appreciate the heads-up on these — will follow up soon with the fixes!

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (4)
databusclient/client.py (4)

433-469: Redirect handling looks good, but add timeouts to prevent hanging.

The redirect handling correctly addresses the previous review concern by following redirects for both authenticated and non-authenticated requests.

However, the requests.get() calls at lines 444, 462, and 469 lack timeout parameters, which can cause the download to hang indefinitely if the server is unresponsive.

Apply this diff to add timeouts:

     response = requests.get(
-        url, stream=True, allow_redirects=False
+        url, stream=True, allow_redirects=False, timeout=30
     )  # no redirects here, we want to see if auth is required
         # --- 4. Retry with token ---
         # This request correctly allows redirects (default)
-        response = requests.get(url, headers=headers, stream=True)
+        response = requests.get(url, headers=headers, stream=True, timeout=30)
         print(f"Following redirect to {redirect_url}")
         # Make a new request that *does* follow any further redirects
-        response = requests.get(redirect_url, stream=True, allow_redirects=True)
+        response = requests.get(redirect_url, stream=True, allow_redirects=True, timeout=30)

534-564: Add timeouts to token exchange requests.

The token exchange POST requests at lines 534 and 556 lack timeout parameters, which can cause authentication to hang if the auth server is unresponsive.

Apply this diff:

     # 2. Refresh token -> access token
     resp = requests.post(
         auth_url,
         data={
             "client_id": client_id,
             "grant_type": "refresh_token",
             "refresh_token": refresh_token,
         },
+        timeout=30,
     )
     # 4. Access token -> Vault token
     resp = requests.post(
         auth_url,
         data={
             "client_id": client_id,
             "grant_type": "urn:ietf:params:oauth:grant-type:token-exchange",
             "subject_token": access_token,
             "audience": audience,
         },
+        timeout=30,
     )

591-617: Fix return type annotation for generator function.

The function uses yield (line 615), making it a generator, but the return type annotation declares List[Tuple[str, Optional[str]]]. While the code works at runtime (generators are iterable), the type annotation is misleading.

Apply this diff to correct the annotation:

+from typing import List, Dict, Tuple, Optional, Union, Iterator
+
 def __handle_databus_file_query__(
     endpoint_url, query
-) -> List[Tuple[str, Optional[str]]]:
+) -> Iterator[Tuple[str, Optional[str]]]:

704-738: Fix type annotations and unused variable.

Two issues per static analysis:

  1. Lines 707-709: Parameters vault_token_file, auth_url, and client_id have default None but their type annotations don't specify Optional. Modern Python style prefers str | None.
  2. Line 714: The unpacked variable host is never used. Use _ to indicate it's intentionally ignored.

Apply this diff:

 def __download_list__(
     files_to_download: List[Tuple[str, Optional[str]]],
     localDir: str,
-    vault_token_file: str = None,
-    auth_url: str = None,
-    client_id: str = None,
+    vault_token_file: Optional[str] = None,
+    auth_url: Optional[str] = None,
+    client_id: Optional[str] = None,
     validation_mode: ShaValidationMode = ShaValidationMode.WARNING,
 ) -> None:
     for url, expected_sha in files_to_download:
         if localDir is None:
-            host, account, group, artifact, version, file = __get_databus_id_parts__(
+            _, account, group, artifact, version, file = __get_databus_id_parts__(
                 url
             )
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 77cc01e and d73903e.

📒 Files selected for processing (1)
  • databusclient/client.py (16 hunks)
🧰 Additional context used
🪛 Ruff (0.14.2)
databusclient/client.py

444-444: Probable use of requests call without timeout

(S113)


462-462: Probable use of requests call without timeout

(S113)


469-469: Probable use of requests call without timeout

(S113)


514-514: Avoid specifying long messages outside the exception class

(TRY003)


534-534: Probable use of requests call without timeout

(S113)


556-556: Probable use of requests call without timeout

(S113)


707-707: PEP 484 prohibits implicit Optional

Convert to T | None

(RUF013)


708-708: PEP 484 prohibits implicit Optional

Convert to T | None

(RUF013)


709-709: PEP 484 prohibits implicit Optional

Convert to T | None

(RUF013)


714-714: Unpacked variable host is never used

Prefix it with an underscore or any other dummy variable pattern

(RUF059)

🔇 Additional comments (6)
databusclient/client.py (6)

31-37: LGTM!

The ShaValidationMode enum is well-designed with clear semantics for the three validation behaviors.


572-588: LGTM!

The explicit POST method setting is a good practice for SPARQL endpoints.


620-645: LGTM!

The function correctly extracts both download URLs and SHA256 sums from the artifact metadata, and the return type matches the implementation.


690-691: LGTM!

Explicitly specifying UTF-8 encoding is good practice for consistent hashing behavior.


741-747: LGTM!

The explicit return type annotation clarifies that the function always returns exactly six components, with None for missing parts.


750-876: LGTM!

The validation_mode parameter is consistently threaded through all download paths (collections, files, artifact versions, artifacts, and groups), ensuring uniform SHA validation behavior across the entire download flow.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 6

🧹 Nitpick comments (1)
databusclient/client.py (1)

1-11: Remove redundant import.

Line 8 imports sha256 from hashlib, but hashlib is already imported at line 4. The sha256 import is only used once in the wsha256 function at line 690, where you can reference it as hashlib.sha256 instead.

Apply this diff:

 from enum import Enum
 from typing import List, Dict, Tuple, Optional, Union
 import requests
 import hashlib
 import json
 from tqdm import tqdm
 from SPARQLWrapper import SPARQLWrapper, JSON
-from hashlib import sha256
 import os
 import re

And update line 690:

 def wsha256(raw: str):
-    return sha256(raw.encode("utf-8")).hexdigest()
+    return hashlib.sha256(raw.encode("utf-8")).hexdigest()
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between d73903e and 78c5de0.

📒 Files selected for processing (1)
  • databusclient/client.py (1 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
databusclient/client.py (1)
databusclient/cli.py (2)
  • deploy (30-36)
  • download (46-57)
🪛 Ruff (0.14.2)
databusclient/client.py

153-156: Avoid specifying long messages outside the exception class

(TRY003)


165-165: Probable use of requests call without timeout

(S113)


178-178: Local variable content_variant_part is assigned to but never used

Remove assignment to unused variable content_variant_part

(F841)


195-195: PEP 484 prohibits implicit Optional

Convert to T | None

(RUF013)


196-196: PEP 484 prohibits implicit Optional

Convert to T | None

(RUF013)


197-197: PEP 484 prohibits implicit Optional

Convert to T | None

(RUF013)


233-233: PEP 484 prohibits implicit Optional

Convert to T | None

(RUF013)


234-234: PEP 484 prohibits implicit Optional

Convert to T | None

(RUF013)


235-235: PEP 484 prohibits implicit Optional

Convert to T | None

(RUF013)


236-236: PEP 484 prohibits implicit Optional

Convert to T | None

(RUF013)


237-237: PEP 484 prohibits implicit Optional

Convert to T | None

(RUF013)


271-271: Unpacked variable account_name is never used

Prefix it with an underscore or any other dummy variable pattern

(RUF059)


271-271: Unpacked variable group_name is never used

Prefix it with an underscore or any other dummy variable pattern

(RUF059)


271-271: Unpacked variable artifact_name is never used

Prefix it with an underscore or any other dummy variable pattern

(RUF059)


291-294: Avoid specifying long messages outside the exception class

(TRY003)


398-398: Probable use of requests call without timeout

(S113)


406-406: Avoid specifying long messages outside the exception class

(TRY003)


440-440: Probable use of requests call without timeout

(S113)


453-453: Probable use of requests call without timeout

(S113)


463-463: Avoid specifying long messages outside the exception class

(TRY003)


471-471: Probable use of requests call without timeout

(S113)


478-478: Probable use of requests call without timeout

(S113)


487-487: Use raise without specifying exception name

Remove exception name

(TRY201)


513-513: Avoid specifying long messages outside the exception class

(TRY003)


526-526: Avoid specifying long messages outside the exception class

(TRY003)


533-533: Probable use of requests call without timeout

(S113)


555-555: Probable use of requests call without timeout

(S113)


663-663: Avoid specifying long messages outside the exception class

(TRY003)


695-695: Probable use of requests call without timeout

(S113)


700-700: Probable use of requests call without timeout

(S113)


706-706: PEP 484 prohibits implicit Optional

Convert to T | None

(RUF013)


707-707: PEP 484 prohibits implicit Optional

Convert to T | None

(RUF013)


708-708: PEP 484 prohibits implicit Optional

Convert to T | None

(RUF013)


713-713: Unpacked variable host is never used

Prefix it with an underscore or any other dummy variable pattern

(RUF059)


866-866: Avoid specifying long messages outside the exception class

(TRY003)

🔇 Additional comments (9)
databusclient/client.py (9)

15-21: LGTM!

The helper function correctly computes SHA256 in chunks, which is memory-efficient for large files. Good refactoring from the previous inline implementation.


40-45: LGTM!

The enum clearly defines the three validation modes as specified in the PR objectives.


489-497: LGTM!

The download logic with progress bar is correctly implemented.


499-510: LGTM!

The validation logic correctly implements all three modes (OFF, WARNING, ERROR) as specified in the PR objectives. The fix from the previous review (removing unreachable branch) is properly applied.


512-513: LGTM!

The size check provides an additional sanity check after SHA validation.


590-617: LGTM!

The function properly handles multiple variable name conventions for file URLs and checksums, with appropriate fallback behavior and helpful warnings.


619-644: LGTM!

The function correctly extracts both download URLs and SHA256 checksums from the JSON-LD metadata, properly returning tuples to support validation.


703-738: LGTM!

The function correctly propagates validation mode and SHA checksums through the download pipeline. The unused host variable at line 713 is a minor style issue but doesn't affect functionality.


749-875: LGTM!

The download() function correctly propagates the validation mode through all download paths (collections, files, artifact versions, artifacts, groups, and queries). The auto-detection of the SPARQL endpoint is a helpful UX improvement.

Comment on lines 164 to 171
def __load_file_stats(url: str) -> Tuple[str, int]:
resp = requests.get(url)
if resp.status_code > 400:
raise requests.exceptions.RequestException(response=resp)

sha256sum = hashlib.sha256(bytes(resp.content)).hexdigest()
content_length = len(resp.content)
return sha256sum, content_length
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Add timeout parameter.

The GET request in __load_file_stats lacks a timeout parameter.

Apply this diff:

 def __load_file_stats(url: str) -> Tuple[str, int]:
-    resp = requests.get(url)
+    resp = requests.get(url, timeout=30)
     if resp.status_code > 400:
         raise requests.exceptions.RequestException(response=resp)
🧰 Tools
🪛 Ruff (0.14.2)

165-165: Probable use of requests call without timeout

(S113)

🤖 Prompt for AI Agents
In databusclient/client.py around lines 164 to 171, the requests.get call in
__load_file_stats lacks a timeout; add a timeout parameter to the function
(e.g., def __load_file_stats(url: str, timeout: float = 10):) and pass it to
requests.get as requests.get(url, timeout=timeout), keeping the existing
status_code check and exception behavior so callers can override the timeout if
needed.

Comment on lines 440 to 488
response = requests.head(url, stream=True)
# Check for redirect and update URL if necessary
if response.headers.get("Location") and response.status_code in [
301,
302,
303,
307,
308,
]:
url = response.headers.get("Location")
print("Redirects url: ", url)

# --- 2. Try direct GET ---
response = requests.get(
url, stream=True, allow_redirects=False
) # no redirects here, we want to see if auth is required
www = response.headers.get(
"WWW-Authenticate", ""
) # get WWW-Authenticate header if present to check for Bearer auth

if response.status_code == 401 or "bearer" in www.lower():
print(f"Authentication required for {url}")
if not (vault_token_file):
raise ValueError("Vault token file not given for protected download")

# --- 3. Fetch Vault token ---
vault_token = __get_vault_access__(url, vault_token_file, auth_url, client_id)
headers = {"Authorization": f"Bearer {vault_token}"}

# --- 4. Retry with token ---
# This request correctly allows redirects (default)
response = requests.get(url, headers=headers, stream=True)

# Handle 3xx redirects for non-authed requests (e.g., S3 presigned URLs)
elif response.is_redirect:
redirect_url = response.headers.get("Location")
print(f"Following redirect to {redirect_url}")
# Make a new request that *does* follow any further redirects
response = requests.get(redirect_url, stream=True, allow_redirects=True)

try:
response.raise_for_status() # Raise if still failing
except requests.exceptions.HTTPError as e:
if response.status_code == 404:
print(f"WARNING: Skipping file {url} because it was not found (404).")
return
else:
raise e

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Add timeout parameter to all HTTP requests.

All requests calls in this function lack a timeout parameter, which can cause the application to hang indefinitely if the server doesn't respond. This is especially problematic for HEAD and GET requests that may be waiting on external services.

Apply this diff to add appropriate timeouts:

     # --- 1. Get redirect URL by requesting HEAD ---
-    response = requests.head(url, stream=True)
+    response = requests.head(url, stream=True, timeout=30)
     # Check for redirect and update URL if necessary
     if response.headers.get("Location") and response.status_code in [
         301,
         302,
         303,
         307,
         308,
     ]:
         url = response.headers.get("Location")
         print("Redirects url: ", url)
 
     # --- 2. Try direct GET ---
     response = requests.get(
-        url, stream=True, allow_redirects=False
+        url, stream=True, allow_redirects=False, timeout=30
     )  # no redirects here, we want to see if auth is required
     www = response.headers.get(
         "WWW-Authenticate", ""
     )  # get WWW-Authenticate header if present to check for Bearer auth
 
     if response.status_code == 401 or "bearer" in www.lower():
         print(f"Authentication required for {url}")
         if not (vault_token_file):
             raise ValueError("Vault token file not given for protected download")
 
         # --- 3. Fetch Vault token ---
         vault_token = __get_vault_access__(url, vault_token_file, auth_url, client_id)
         headers = {"Authorization": f"Bearer {vault_token}"}
 
         # --- 4. Retry with token ---
         # This request correctly allows redirects (default)
-        response = requests.get(url, headers=headers, stream=True)
+        response = requests.get(url, headers=headers, stream=True, timeout=30)
 
     # Handle 3xx redirects for non-authed requests (e.g., S3 presigned URLs)
     elif response.is_redirect:
         redirect_url = response.headers.get("Location")
         print(f"Following redirect to {redirect_url}")
         # Make a new request that *does* follow any further redirects
-        response = requests.get(redirect_url, stream=True, allow_redirects=True)
+        response = requests.get(redirect_url, stream=True, allow_redirects=True, timeout=30)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
response = requests.head(url, stream=True)
# Check for redirect and update URL if necessary
if response.headers.get("Location") and response.status_code in [
301,
302,
303,
307,
308,
]:
url = response.headers.get("Location")
print("Redirects url: ", url)
# --- 2. Try direct GET ---
response = requests.get(
url, stream=True, allow_redirects=False
) # no redirects here, we want to see if auth is required
www = response.headers.get(
"WWW-Authenticate", ""
) # get WWW-Authenticate header if present to check for Bearer auth
if response.status_code == 401 or "bearer" in www.lower():
print(f"Authentication required for {url}")
if not (vault_token_file):
raise ValueError("Vault token file not given for protected download")
# --- 3. Fetch Vault token ---
vault_token = __get_vault_access__(url, vault_token_file, auth_url, client_id)
headers = {"Authorization": f"Bearer {vault_token}"}
# --- 4. Retry with token ---
# This request correctly allows redirects (default)
response = requests.get(url, headers=headers, stream=True)
# Handle 3xx redirects for non-authed requests (e.g., S3 presigned URLs)
elif response.is_redirect:
redirect_url = response.headers.get("Location")
print(f"Following redirect to {redirect_url}")
# Make a new request that *does* follow any further redirects
response = requests.get(redirect_url, stream=True, allow_redirects=True)
try:
response.raise_for_status() # Raise if still failing
except requests.exceptions.HTTPError as e:
if response.status_code == 404:
print(f"WARNING: Skipping file {url} because it was not found (404).")
return
else:
raise e
response = requests.head(url, stream=True, timeout=30)
# Check for redirect and update URL if necessary
if response.headers.get("Location") and response.status_code in [
301,
302,
303,
307,
308,
]:
url = response.headers.get("Location")
print("Redirects url: ", url)
# --- 2. Try direct GET ---
response = requests.get(
url, stream=True, allow_redirects=False, timeout=30
) # no redirects here, we want to see if auth is required
www = response.headers.get(
"WWW-Authenticate", ""
) # get WWW-Authenticate header if present to check for Bearer auth
if response.status_code == 401 or "bearer" in www.lower():
print(f"Authentication required for {url}")
if not (vault_token_file):
raise ValueError("Vault token file not given for protected download")
# --- 3. Fetch Vault token ---
vault_token = __get_vault_access__(url, vault_token_file, auth_url, client_id)
headers = {"Authorization": f"Bearer {vault_token}"}
# --- 4. Retry with token ---
# This request correctly allows redirects (default)
response = requests.get(url, headers=headers, stream=True, timeout=30)
# Handle 3xx redirects for non-authed requests (e.g., S3 presigned URLs)
elif response.is_redirect:
redirect_url = response.headers.get("Location")
print(f"Following redirect to {redirect_url}")
# Make a new request that *does* follow any further redirects
response = requests.get(redirect_url, stream=True, allow_redirects=True, timeout=30)
try:
response.raise_for_status() # Raise if still failing
except requests.exceptions.HTTPError as e:
if response.status_code == 404:
print(f"WARNING: Skipping file {url} because it was not found (404).")
return
else:
raise e
🧰 Tools
🪛 Ruff (0.14.2)

440-440: Probable use of requests call without timeout

(S113)


453-453: Probable use of requests call without timeout

(S113)


463-463: Avoid specifying long messages outside the exception class

(TRY003)


471-471: Probable use of requests call without timeout

(S113)


478-478: Probable use of requests call without timeout

(S113)


487-487: Use raise without specifying exception name

Remove exception name

(TRY201)

🤖 Prompt for AI Agents
In databusclient/client.py around lines 440 to 488, all requests calls (the
initial requests.head and every requests.get in this block) are missing a
timeout and can hang indefinitely; add a timeout parameter to each call (e.g.,
timeout=DEFAULT_TIMEOUT) and define DEFAULT_TIMEOUT at module level or accept a
timeout argument to the function (suggest 10 seconds as a sensible default),
making sure to pass the same timeout to requests.head(..., timeout=...) and to
all requests.get(...) calls (including the retry with headers and the
follow-redirect request). Optionally import requests.exceptions.Timeout where
used if you want to handle timeouts specially, but the immediate fix is to
supply the timeout argument consistently on every request in this block.

Comment on lines 516 to 568
def __get_vault_access__(
download_url: str, token_file: str, auth_url: str, client_id: str
) -> str:
"""
Get Vault access token for a protected databus download.
"""
# 1. Load refresh token
refresh_token = os.environ.get("REFRESH_TOKEN")
if not refresh_token:
if not os.path.exists(token_file):
raise FileNotFoundError(f"Vault token file not found: {token_file}")
with open(token_file, "r") as f:
refresh_token = f.read().strip()
if len(refresh_token) < 80:
print(f"Warning: token from {token_file} is short (<80 chars)")

# 2. Refresh token -> access token
resp = requests.post(
auth_url,
data={
"client_id": client_id,
"grant_type": "refresh_token",
"refresh_token": refresh_token,
},
)
resp.raise_for_status()
access_token = resp.json()["access_token"]

# 3. Extract host as audience
# Remove protocol prefix
if download_url.startswith("https://"):
host_part = download_url[len("https://") :]
elif download_url.startswith("http://"):
host_part = download_url[len("http://") :]
else:
host_part = download_url
audience = host_part.split("/")[0] # host is before first "/"

# 4. Access token -> Vault token
resp = requests.post(
auth_url,
data={
"client_id": client_id,
"grant_type": "urn:ietf:params:oauth:grant-type:token-exchange",
"subject_token": access_token,
"audience": audience,
},
)
resp.raise_for_status()
vault_token = resp.json()["access_token"]

print(f"Using Vault access token for {download_url}")
return vault_token
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Add timeout parameter to token exchange requests.

The POST requests for token exchange (lines 533 and 555) lack timeout parameters, which could cause the application to hang if the authentication server is unresponsive.

Apply this diff:

     # 2. Refresh token -> access token
     resp = requests.post(
         auth_url,
         data={
             "client_id": client_id,
             "grant_type": "refresh_token",
             "refresh_token": refresh_token,
         },
+        timeout=30,
     )
     resp.raise_for_status()
     access_token = resp.json()["access_token"]
 
     # 3. Extract host as audience
     # Remove protocol prefix
     if download_url.startswith("https://"):
         host_part = download_url[len("https://") :]
     elif download_url.startswith("http://"):
         host_part = download_url[len("http://") :]
     else:
         host_part = download_url
     audience = host_part.split("/")[0]  # host is before first "/"
 
     # 4. Access token -> Vault token
     resp = requests.post(
         auth_url,
         data={
             "client_id": client_id,
             "grant_type": "urn:ietf:params:oauth:grant-type:token-exchange",
             "subject_token": access_token,
             "audience": audience,
         },
+        timeout=30,
     )
🧰 Tools
🪛 Ruff (0.14.2)

526-526: Avoid specifying long messages outside the exception class

(TRY003)


533-533: Probable use of requests call without timeout

(S113)


555-555: Probable use of requests call without timeout

(S113)

🤖 Prompt for AI Agents
In databusclient/client.py around lines 516 to 568, the two requests.post calls
used for token refresh/exchange don't set a timeout and can hang; add a timeout
parameter (e.g. timeout=10) to both requests.post invocations (the one that
exchanges refresh_token -> access_token and the one that exchanges access_token
-> vault_token), and ensure any calling code can surface or handle
requests.Timeout by letting exceptions propagate or wrapping them where
appropriate.

Comment on lines 693 to 695
def __handle_databus_collection__(uri: str) -> str:
headers = {"Accept": "text/sparql"}
return requests.get(uri, headers=headers).text
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Add timeout parameter.

The GET request lacks a timeout parameter.

Apply this diff:

 def __handle_databus_collection__(uri: str) -> str:
     headers = {"Accept": "text/sparql"}
-    return requests.get(uri, headers=headers).text
+    return requests.get(uri, headers=headers, timeout=30).text
🧰 Tools
🪛 Ruff (0.14.2)

695-695: Probable use of requests call without timeout

(S113)

🤖 Prompt for AI Agents
In databusclient/client.py around lines 693 to 695, the GET request has no
timeout; modify the requests.get call to include a timeout (e.g.,
requests.get(uri, headers=headers, timeout=10).text), or use an
existing/configurable DEFAULT_TIMEOUT constant, and ensure the call passes the
timeout keyword so the request won't hang indefinitely.

Comment on lines 698 to 700
def __get_json_ld_from_databus__(uri: str) -> str:
headers = {"Accept": "application/ld+json"}
return requests.get(uri, headers=headers).text
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Add timeout parameter.

The GET request lacks a timeout parameter, which could cause hangs when fetching metadata from Databus.

Apply this diff:

 def __get_json_ld_from_databus__(uri: str) -> str:
     headers = {"Accept": "application/ld+json"}
-    return requests.get(uri, headers=headers).text
+    return requests.get(uri, headers=headers, timeout=30).text
🧰 Tools
🪛 Ruff (0.14.2)

700-700: Probable use of requests call without timeout

(S113)

🤖 Prompt for AI Agents
In databusclient/client.py around lines 698-700, the requests.get call in
__get_json_ld_from_databus__ lacks a timeout which can cause the process to
hang; modify the call to include a timeout parameter (e.g., timeout=10) or,
better, read a module-level or config value (e.g., DATABUS_REQUEST_TIMEOUT) and
pass it as timeout=request_timeout to requests.get, and ensure the function lets
request exceptions propagate or document that callers should handle timeouts.

@shivansh31414
Copy link
Author

Just pushed a commit addressing all of CodeRabbit’s suggestions — moved the SHA256 helper to module level, removed the unreachable branch, and cleaned up the redundant import. Let me know if anything else needs tweaking. Ready for re-review!

@Integer-Ctrl
Copy link
Contributor

Hey! It seems the line endings changed (possibly to CRLF), so the diff shows every line as updated. Could you fix the line endings so I can review the actual changes? See diff

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
databusclient/client.py (2)

516-568: Token exchange requests still lack timeout parameters.

Past reviews flagged this as a major issue. If the authentication server is unresponsive, the application will hang indefinitely.

Apply this diff to add timeouts:

     # 2. Refresh token -> access token
     resp = requests.post(
         auth_url,
         data={
             "client_id": client_id,
             "grant_type": "refresh_token",
             "refresh_token": refresh_token,
         },
+        timeout=30,
     )
     resp.raise_for_status()
     access_token = resp.json()["access_token"]
 
     # 3. Extract host as audience
     # Remove protocol prefix
     if download_url.startswith("https://"):
         host_part = download_url[len("https://") :]
     elif download_url.startswith("http://"):
         host_part = download_url[len("http://") :]
     else:
         host_part = download_url
     audience = host_part.split("/")[0]  # host is before first "/"
 
     # 4. Access token -> Vault token
     resp = requests.post(
         auth_url,
         data={
             "client_id": client_id,
             "grant_type": "urn:ietf:params:oauth:grant-type:token-exchange",
             "subject_token": access_token,
             "audience": audience,
         },
+        timeout=30,
     )

440-488: All HTTP requests in this function still lack timeout parameters.

Without timeouts, the application can hang indefinitely on unresponsive servers—a critical production reliability issue.

Apply the provided diff to add timeout=30 to all four requests calls:

  • Line 440: requests.head(url, stream=True, timeout=30)
  • Line 453: requests.get(url, stream=True, allow_redirects=False, timeout=30)
  • Line 471: requests.get(url, headers=headers, stream=True, timeout=30)
  • Line 478: requests.get(redirect_url, stream=True, allow_redirects=True, timeout=30)
🧹 Nitpick comments (1)
databusclient/client.py (1)

703-737: Clean up unused variable and type hints.

The function correctly propagates validation parameters, but has two minor issues:

  1. Line 713: The host variable from __get_databus_id_parts__ is unpacked but never used.
  2. Lines 706-708: Type hints use implicit Optional which PEP 484 discourages.

Apply this diff:

 def __download_list__(
     files_to_download: List[Tuple[str, Optional[str]]],
     localDir: str,
-    vault_token_file: str = None,
-    auth_url: str = None,
-    client_id: str = None,
+    vault_token_file: Optional[str] = None,
+    auth_url: Optional[str] = None,
+    client_id: Optional[str] = None,
     validation_mode: ShaValidationMode = ShaValidationMode.WARNING,
 ) -> None:
     for url, expected_sha in files_to_download:
         if localDir is None:
-            host, account, group, artifact, version, file = __get_databus_id_parts__(
+            _, account, group, artifact, version, _ = __get_databus_id_parts__(
                 url
             )
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 78c5de0 and fe69957.

📒 Files selected for processing (1)
  • databusclient/client.py (17 hunks)
🧰 Additional context used
🪛 Ruff (0.14.4)
databusclient/client.py

453-453: Probable use of requests call without timeout

(S113)


471-471: Probable use of requests call without timeout

(S113)


478-478: Probable use of requests call without timeout

(S113)


513-513: Avoid specifying long messages outside the exception class

(TRY003)


533-533: Probable use of requests call without timeout

(S113)


555-555: Probable use of requests call without timeout

(S113)


706-706: PEP 484 prohibits implicit Optional

Convert to T | None

(RUF013)


707-707: PEP 484 prohibits implicit Optional

Convert to T | None

(RUF013)


708-708: PEP 484 prohibits implicit Optional

Convert to T | None

(RUF013)


713-713: Unpacked variable host is never used

Prefix it with an underscore or any other dummy variable pattern

(RUF059)

🔇 Additional comments (9)
databusclient/client.py (9)

15-21: LGTM! Clean SHA256 helper.

The module-level helper is well-structured, uses an appropriate chunk size, and will be deterministic for validation.


40-46: LGTM! Clear validation modes.

The enum provides the three requested modes with clear semantics.


413-433: LGTM! Signature is backwards compatible.

The new parameters have sensible defaults and are properly documented.


499-511: LGTM! Validation logic correctly implements the three modes.

The validation flow properly handles ERROR (raises), WARNING (prints and continues), and OFF (skips) modes as specified in the PR objectives.


571-587: LGTM! Using POST for SPARQL queries is best practice.

POST supports larger queries and is more semantic per the SPARQL protocol.


590-617: LGTM! Robust extraction of both URL and SHA from query results.

The function defensively tries multiple common variable names and correctly handles cases where SHA may not be present in the query results.


619-644: LGTM! Properly extracts both URL and SHA256 from artifact metadata.

The function correctly returns tuples of (url, sha) to enable downstream validation. The use of node.get("file") aligns with the comment about tracking access accurately.


740-746: LGTM! Return type now explicitly documents the 6-tuple structure.

The type hint improves clarity and type safety.


749-875: LGTM! Validation mode is correctly propagated through all download paths.

The dispatcher function properly routes different URI types (collection, file, artifact version, artifact, group, query) and consistently passes validation_mode to all downstream downloads.

@shivansh31414
Copy link
Author

Thanks for following along on this PR. I’ve incorporated the SHA256 validation feature with flexible CLI modes, refactored the helper to module level, cleaned up unreachable branches, and normalized line endings with .gitattributes so the diffs are now clean. CodeRabbit’s latest review shows no actionable issues left — only minor nitpicks outside the diff range. Could you take a look and share your feedback on whether the changes meet our intended goals and if the flow feels clear from a reviewer’s perspective? If you spot anything that could be improved for readability or maintainability, I’d be happy to address it before merge.
Got a bit busy in my university exam thanks for being patient.

@shivansh31414
Copy link
Author

do the current changes align with what you had in mind for the SHA validation feature I want to be sure the flow feels clear from your perspective before I finalize.

Copy link
Contributor

@Integer-Ctrl Integer-Ctrl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the late feedback. I tested it and it seems to work, but I've added some comments. The most important points are the formatting-related comments and the fact that there is currently no CLI option to toggle the SHA validation mode.

Comment on lines -543 to +644
def __handle_databus_artifact_version__(json_str: str) -> List[str]:
def __handle_databus_artifact_version__(
json_str: str,
) -> List[Tuple[str, Optional[str]]]:
"""
Parse the JSON-LD of a databus artifact version to extract download URLs.
Parse the JSON-LD of a databus artifact version to extract download URLs and SHA256 sums.
Don't get downloadURLs directly from the JSON-LD, but follow the "file" links to count access to databus accurately.
Returns a list of download URLs.
Returns a list of (download_url, sha256sum) tuples.
"""

databusIdUrl = []
databus_files = []
json_dict = json.loads(json_str)
graph = json_dict.get("@graph", [])
for node in graph:
if node.get("@type") == "Part":
id = node.get("file")
databusIdUrl.append(id)
return databusIdUrl
# Use the 'file' link as per the original comment
url = node.get("file")
if not url:
continue

# Extract the sha256sum from the same node
# This key is used in your create_dataset function
sha = node.get("sha256sum")

databus_files.append((url, sha))
return databus_files
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should handle resolving the sha256 sum from databus inside the __download_file method. Otherwise, if only a single file is downloaded, its SHA sum won’t be verified.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You touched many lines of code just by changing the formatting. Please avoid reformatting code sections where you didn't change the actual logic. It makes the review process unnecessarily longer. Also, changes like replacing " with ' (or vice versa) shouldn’t be included in a feature PR.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You touched many lines of code just by deleting them. Please avoid removing unused code in a feature PR. Yes, this repo/file is messy and a lot of the code may be unused or outdated, but cleanup shouldn't be part of this PR (unless it is a refactored method or something like this). It makes the review process unnecessarily longer.

vault_token_file: str = None,
auth_url: str = None,
client_id: str = None,
validation_mode: ShaValidationMode = ShaValidationMode.WARNING,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is currently no cli option to specify the ShaValidationMode

auth_url=auth_url,
client_id=client_id,
expected_sha256=expected_sha, # <-- Pass the SHA hash here
validation_mode=validation_mode, # <-- Pass the validation mode here
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use a global variable for the validation mode instead of passing the option into every function call

@Integer-Ctrl Integer-Ctrl self-assigned this Nov 24, 2025
@Integer-Ctrl
Copy link
Contributor

@shivansh31414 Do you want to continue work?

@shivansh31414
Copy link
Author

I think you can assign it to someone else i am occupied with my college work

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

SHA checksum validation

2 participants