-
Notifications
You must be signed in to change notification settings - Fork 10
Add SHA256 checksum validation with flexible CLI modes #31
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
📝 WalkthroughWalkthroughAdds ShaValidationMode and threads SHA256 validation through download flows; refactors handlers to return (url, sha) tuples; implements deterministic SHA256 computation, redirect and Vault-token retry logic in downloads; and replaces tests with end-to-end style cases covering ERROR/WARNING/OFF and successful validation. Changes
Sequence Diagram(s)sequenceDiagram
autonumber
participant User
participant Client as download()
participant SPARQL as __query_sparql__/Metadata
participant DLList as __download_list__
participant DLFile as __download_file__
participant Vault as __get_vault_access__
participant HTTP as HTTP Server
participant FS as Filesystem
User->>Client: download(localDir, URIs, validation_mode=MODE)
Client->>SPARQL: fetch metadata (JSON-LD / SPARQL)
SPARQL-->>Client: List[(url, optional sha)]
Client->>DLList: __download_list__(files, MODE)
loop for each (url, sha)
DLList->>DLFile: __download_file__(url, expected_sha=sha, MODE)
DLFile->>HTTP: HTTP GET url
alt 2xx
HTTP-->>DLFile: file bytes
else 3xx redirect or 401/Bearer
HTTP-->>DLFile: redirect / 401
DLFile->>Vault: __get_vault_access__(download_url, token_file, auth_url, client_id)
Vault-->>DLFile: vault_token
DLFile->>HTTP: HTTP GET url with Authorization: Bearer token
HTTP-->>DLFile: file bytes
end
DLFile->>DLFile: compute SHA256 (if expected provided)
alt expected SHA present and MODE == ERROR and mismatch
DLFile-->>Client: raise Exception (abort)
else expected SHA present and MODE == WARNING and mismatch
DLFile-->>Client: emit warning, continue
else MODE == OFF or hashes match
DLFile->>FS: write file
end
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes
Possibly related PRs
Pre-merge checks and finishing touches❌ Failed checks (1 warning, 1 inconclusive)
✅ Passed checks (3 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
databusclient/client.py(14 hunks)tests/test_databusclient.py(1 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
tests/test_databusclient.py (2)
databusclient/client.py (2)
download(739-865)ShaValidationMode(31-35)databusclient/cli.py (1)
download(46-57)
🪛 Ruff (0.14.2)
databusclient/client.py
443-443: Probable use of requests call without timeout
(S113)
505-505: Avoid specifying long messages outside the exception class
(TRY003)
525-525: Probable use of requests call without timeout
(S113)
547-547: Probable use of requests call without timeout
(S113)
699-699: PEP 484 prohibits implicit Optional
Convert to T | None
(RUF013)
700-700: PEP 484 prohibits implicit Optional
Convert to T | None
(RUF013)
701-701: PEP 484 prohibits implicit Optional
Convert to T | None
(RUF013)
705-705: Unpacked variable host is never used
Prefix it with an underscore or any other dummy variable pattern
(RUF059)
tests/test_databusclient.py
44-44: Redefinition of unused requests_mock from line 2
(F811)
60-60: Unused function argument: mock_print
(ARG001)
60-60: Redefinition of unused requests_mock from line 2
(F811)
88-88: Redefinition of unused requests_mock from line 2
(F811)
119-119: Redefinition of unused requests_mock from line 2
(F811)
151-151: Redefinition of unused requests_mock from line 2
(F811)
|
Thanks for the detailed review! I’ll work on the following updates shortly: Fixing the redirect handling in download_file to ensure proper SHA validation. Separating unrelated formatting and refactoring changes into a separate PR to keep this one focused. Improving docstring coverage to meet the required threshold. Appreciate the heads-up on these — will follow up soon with the fixes! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Nitpick comments (4)
databusclient/client.py (4)
433-469: Redirect handling looks good, but add timeouts to prevent hanging.The redirect handling correctly addresses the previous review concern by following redirects for both authenticated and non-authenticated requests.
However, the
requests.get()calls at lines 444, 462, and 469 lack timeout parameters, which can cause the download to hang indefinitely if the server is unresponsive.Apply this diff to add timeouts:
response = requests.get( - url, stream=True, allow_redirects=False + url, stream=True, allow_redirects=False, timeout=30 ) # no redirects here, we want to see if auth is required# --- 4. Retry with token --- # This request correctly allows redirects (default) - response = requests.get(url, headers=headers, stream=True) + response = requests.get(url, headers=headers, stream=True, timeout=30)print(f"Following redirect to {redirect_url}") # Make a new request that *does* follow any further redirects - response = requests.get(redirect_url, stream=True, allow_redirects=True) + response = requests.get(redirect_url, stream=True, allow_redirects=True, timeout=30)
534-564: Add timeouts to token exchange requests.The token exchange POST requests at lines 534 and 556 lack timeout parameters, which can cause authentication to hang if the auth server is unresponsive.
Apply this diff:
# 2. Refresh token -> access token resp = requests.post( auth_url, data={ "client_id": client_id, "grant_type": "refresh_token", "refresh_token": refresh_token, }, + timeout=30, )# 4. Access token -> Vault token resp = requests.post( auth_url, data={ "client_id": client_id, "grant_type": "urn:ietf:params:oauth:grant-type:token-exchange", "subject_token": access_token, "audience": audience, }, + timeout=30, )
591-617: Fix return type annotation for generator function.The function uses
yield(line 615), making it a generator, but the return type annotation declaresList[Tuple[str, Optional[str]]]. While the code works at runtime (generators are iterable), the type annotation is misleading.Apply this diff to correct the annotation:
+from typing import List, Dict, Tuple, Optional, Union, Iterator + def __handle_databus_file_query__( endpoint_url, query -) -> List[Tuple[str, Optional[str]]]: +) -> Iterator[Tuple[str, Optional[str]]]:
704-738: Fix type annotations and unused variable.Two issues per static analysis:
- Lines 707-709: Parameters
vault_token_file,auth_url, andclient_idhave defaultNonebut their type annotations don't specifyOptional. Modern Python style prefersstr | None.- Line 714: The unpacked variable
hostis never used. Use_to indicate it's intentionally ignored.Apply this diff:
def __download_list__( files_to_download: List[Tuple[str, Optional[str]]], localDir: str, - vault_token_file: str = None, - auth_url: str = None, - client_id: str = None, + vault_token_file: Optional[str] = None, + auth_url: Optional[str] = None, + client_id: Optional[str] = None, validation_mode: ShaValidationMode = ShaValidationMode.WARNING, ) -> None: for url, expected_sha in files_to_download: if localDir is None: - host, account, group, artifact, version, file = __get_databus_id_parts__( + _, account, group, artifact, version, file = __get_databus_id_parts__( url )
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
databusclient/client.py(16 hunks)
🧰 Additional context used
🪛 Ruff (0.14.2)
databusclient/client.py
444-444: Probable use of requests call without timeout
(S113)
462-462: Probable use of requests call without timeout
(S113)
469-469: Probable use of requests call without timeout
(S113)
514-514: Avoid specifying long messages outside the exception class
(TRY003)
534-534: Probable use of requests call without timeout
(S113)
556-556: Probable use of requests call without timeout
(S113)
707-707: PEP 484 prohibits implicit Optional
Convert to T | None
(RUF013)
708-708: PEP 484 prohibits implicit Optional
Convert to T | None
(RUF013)
709-709: PEP 484 prohibits implicit Optional
Convert to T | None
(RUF013)
714-714: Unpacked variable host is never used
Prefix it with an underscore or any other dummy variable pattern
(RUF059)
🔇 Additional comments (6)
databusclient/client.py (6)
31-37: LGTM!The
ShaValidationModeenum is well-designed with clear semantics for the three validation behaviors.
572-588: LGTM!The explicit
POSTmethod setting is a good practice for SPARQL endpoints.
620-645: LGTM!The function correctly extracts both download URLs and SHA256 sums from the artifact metadata, and the return type matches the implementation.
690-691: LGTM!Explicitly specifying UTF-8 encoding is good practice for consistent hashing behavior.
741-747: LGTM!The explicit return type annotation clarifies that the function always returns exactly six components, with
Nonefor missing parts.
750-876: LGTM!The
validation_modeparameter is consistently threaded through all download paths (collections, files, artifact versions, artifacts, and groups), ensuring uniform SHA validation behavior across the entire download flow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 6
🧹 Nitpick comments (1)
databusclient/client.py (1)
1-11: Remove redundant import.Line 8 imports
sha256fromhashlib, buthashlibis already imported at line 4. Thesha256import is only used once in thewsha256function at line 690, where you can reference it ashashlib.sha256instead.Apply this diff:
from enum import Enum from typing import List, Dict, Tuple, Optional, Union import requests import hashlib import json from tqdm import tqdm from SPARQLWrapper import SPARQLWrapper, JSON -from hashlib import sha256 import os import reAnd update line 690:
def wsha256(raw: str): - return sha256(raw.encode("utf-8")).hexdigest() + return hashlib.sha256(raw.encode("utf-8")).hexdigest()
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
databusclient/client.py(1 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
databusclient/client.py (1)
databusclient/cli.py (2)
deploy(30-36)download(46-57)
🪛 Ruff (0.14.2)
databusclient/client.py
153-156: Avoid specifying long messages outside the exception class
(TRY003)
165-165: Probable use of requests call without timeout
(S113)
178-178: Local variable content_variant_part is assigned to but never used
Remove assignment to unused variable content_variant_part
(F841)
195-195: PEP 484 prohibits implicit Optional
Convert to T | None
(RUF013)
196-196: PEP 484 prohibits implicit Optional
Convert to T | None
(RUF013)
197-197: PEP 484 prohibits implicit Optional
Convert to T | None
(RUF013)
233-233: PEP 484 prohibits implicit Optional
Convert to T | None
(RUF013)
234-234: PEP 484 prohibits implicit Optional
Convert to T | None
(RUF013)
235-235: PEP 484 prohibits implicit Optional
Convert to T | None
(RUF013)
236-236: PEP 484 prohibits implicit Optional
Convert to T | None
(RUF013)
237-237: PEP 484 prohibits implicit Optional
Convert to T | None
(RUF013)
271-271: Unpacked variable account_name is never used
Prefix it with an underscore or any other dummy variable pattern
(RUF059)
271-271: Unpacked variable group_name is never used
Prefix it with an underscore or any other dummy variable pattern
(RUF059)
271-271: Unpacked variable artifact_name is never used
Prefix it with an underscore or any other dummy variable pattern
(RUF059)
291-294: Avoid specifying long messages outside the exception class
(TRY003)
398-398: Probable use of requests call without timeout
(S113)
406-406: Avoid specifying long messages outside the exception class
(TRY003)
440-440: Probable use of requests call without timeout
(S113)
453-453: Probable use of requests call without timeout
(S113)
463-463: Avoid specifying long messages outside the exception class
(TRY003)
471-471: Probable use of requests call without timeout
(S113)
478-478: Probable use of requests call without timeout
(S113)
487-487: Use raise without specifying exception name
Remove exception name
(TRY201)
513-513: Avoid specifying long messages outside the exception class
(TRY003)
526-526: Avoid specifying long messages outside the exception class
(TRY003)
533-533: Probable use of requests call without timeout
(S113)
555-555: Probable use of requests call without timeout
(S113)
663-663: Avoid specifying long messages outside the exception class
(TRY003)
695-695: Probable use of requests call without timeout
(S113)
700-700: Probable use of requests call without timeout
(S113)
706-706: PEP 484 prohibits implicit Optional
Convert to T | None
(RUF013)
707-707: PEP 484 prohibits implicit Optional
Convert to T | None
(RUF013)
708-708: PEP 484 prohibits implicit Optional
Convert to T | None
(RUF013)
713-713: Unpacked variable host is never used
Prefix it with an underscore or any other dummy variable pattern
(RUF059)
866-866: Avoid specifying long messages outside the exception class
(TRY003)
🔇 Additional comments (9)
databusclient/client.py (9)
15-21: LGTM!The helper function correctly computes SHA256 in chunks, which is memory-efficient for large files. Good refactoring from the previous inline implementation.
40-45: LGTM!The enum clearly defines the three validation modes as specified in the PR objectives.
489-497: LGTM!The download logic with progress bar is correctly implemented.
499-510: LGTM!The validation logic correctly implements all three modes (OFF, WARNING, ERROR) as specified in the PR objectives. The fix from the previous review (removing unreachable branch) is properly applied.
512-513: LGTM!The size check provides an additional sanity check after SHA validation.
590-617: LGTM!The function properly handles multiple variable name conventions for file URLs and checksums, with appropriate fallback behavior and helpful warnings.
619-644: LGTM!The function correctly extracts both download URLs and SHA256 checksums from the JSON-LD metadata, properly returning tuples to support validation.
703-738: LGTM!The function correctly propagates validation mode and SHA checksums through the download pipeline. The unused
hostvariable at line 713 is a minor style issue but doesn't affect functionality.
749-875: LGTM!The
download()function correctly propagates the validation mode through all download paths (collections, files, artifact versions, artifacts, groups, and queries). The auto-detection of the SPARQL endpoint is a helpful UX improvement.
databusclient/client.py
Outdated
| def __load_file_stats(url: str) -> Tuple[str, int]: | ||
| resp = requests.get(url) | ||
| if resp.status_code > 400: | ||
| raise requests.exceptions.RequestException(response=resp) | ||
|
|
||
| sha256sum = hashlib.sha256(bytes(resp.content)).hexdigest() | ||
| content_length = len(resp.content) | ||
| return sha256sum, content_length |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add timeout parameter.
The GET request in __load_file_stats lacks a timeout parameter.
Apply this diff:
def __load_file_stats(url: str) -> Tuple[str, int]:
- resp = requests.get(url)
+ resp = requests.get(url, timeout=30)
if resp.status_code > 400:
raise requests.exceptions.RequestException(response=resp)🧰 Tools
🪛 Ruff (0.14.2)
165-165: Probable use of requests call without timeout
(S113)
🤖 Prompt for AI Agents
In databusclient/client.py around lines 164 to 171, the requests.get call in
__load_file_stats lacks a timeout; add a timeout parameter to the function
(e.g., def __load_file_stats(url: str, timeout: float = 10):) and pass it to
requests.get as requests.get(url, timeout=timeout), keeping the existing
status_code check and exception behavior so callers can override the timeout if
needed.
databusclient/client.py
Outdated
| response = requests.head(url, stream=True) | ||
| # Check for redirect and update URL if necessary | ||
| if response.headers.get("Location") and response.status_code in [ | ||
| 301, | ||
| 302, | ||
| 303, | ||
| 307, | ||
| 308, | ||
| ]: | ||
| url = response.headers.get("Location") | ||
| print("Redirects url: ", url) | ||
|
|
||
| # --- 2. Try direct GET --- | ||
| response = requests.get( | ||
| url, stream=True, allow_redirects=False | ||
| ) # no redirects here, we want to see if auth is required | ||
| www = response.headers.get( | ||
| "WWW-Authenticate", "" | ||
| ) # get WWW-Authenticate header if present to check for Bearer auth | ||
|
|
||
| if response.status_code == 401 or "bearer" in www.lower(): | ||
| print(f"Authentication required for {url}") | ||
| if not (vault_token_file): | ||
| raise ValueError("Vault token file not given for protected download") | ||
|
|
||
| # --- 3. Fetch Vault token --- | ||
| vault_token = __get_vault_access__(url, vault_token_file, auth_url, client_id) | ||
| headers = {"Authorization": f"Bearer {vault_token}"} | ||
|
|
||
| # --- 4. Retry with token --- | ||
| # This request correctly allows redirects (default) | ||
| response = requests.get(url, headers=headers, stream=True) | ||
|
|
||
| # Handle 3xx redirects for non-authed requests (e.g., S3 presigned URLs) | ||
| elif response.is_redirect: | ||
| redirect_url = response.headers.get("Location") | ||
| print(f"Following redirect to {redirect_url}") | ||
| # Make a new request that *does* follow any further redirects | ||
| response = requests.get(redirect_url, stream=True, allow_redirects=True) | ||
|
|
||
| try: | ||
| response.raise_for_status() # Raise if still failing | ||
| except requests.exceptions.HTTPError as e: | ||
| if response.status_code == 404: | ||
| print(f"WARNING: Skipping file {url} because it was not found (404).") | ||
| return | ||
| else: | ||
| raise e | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add timeout parameter to all HTTP requests.
All requests calls in this function lack a timeout parameter, which can cause the application to hang indefinitely if the server doesn't respond. This is especially problematic for HEAD and GET requests that may be waiting on external services.
Apply this diff to add appropriate timeouts:
# --- 1. Get redirect URL by requesting HEAD ---
- response = requests.head(url, stream=True)
+ response = requests.head(url, stream=True, timeout=30)
# Check for redirect and update URL if necessary
if response.headers.get("Location") and response.status_code in [
301,
302,
303,
307,
308,
]:
url = response.headers.get("Location")
print("Redirects url: ", url)
# --- 2. Try direct GET ---
response = requests.get(
- url, stream=True, allow_redirects=False
+ url, stream=True, allow_redirects=False, timeout=30
) # no redirects here, we want to see if auth is required
www = response.headers.get(
"WWW-Authenticate", ""
) # get WWW-Authenticate header if present to check for Bearer auth
if response.status_code == 401 or "bearer" in www.lower():
print(f"Authentication required for {url}")
if not (vault_token_file):
raise ValueError("Vault token file not given for protected download")
# --- 3. Fetch Vault token ---
vault_token = __get_vault_access__(url, vault_token_file, auth_url, client_id)
headers = {"Authorization": f"Bearer {vault_token}"}
# --- 4. Retry with token ---
# This request correctly allows redirects (default)
- response = requests.get(url, headers=headers, stream=True)
+ response = requests.get(url, headers=headers, stream=True, timeout=30)
# Handle 3xx redirects for non-authed requests (e.g., S3 presigned URLs)
elif response.is_redirect:
redirect_url = response.headers.get("Location")
print(f"Following redirect to {redirect_url}")
# Make a new request that *does* follow any further redirects
- response = requests.get(redirect_url, stream=True, allow_redirects=True)
+ response = requests.get(redirect_url, stream=True, allow_redirects=True, timeout=30)📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| response = requests.head(url, stream=True) | |
| # Check for redirect and update URL if necessary | |
| if response.headers.get("Location") and response.status_code in [ | |
| 301, | |
| 302, | |
| 303, | |
| 307, | |
| 308, | |
| ]: | |
| url = response.headers.get("Location") | |
| print("Redirects url: ", url) | |
| # --- 2. Try direct GET --- | |
| response = requests.get( | |
| url, stream=True, allow_redirects=False | |
| ) # no redirects here, we want to see if auth is required | |
| www = response.headers.get( | |
| "WWW-Authenticate", "" | |
| ) # get WWW-Authenticate header if present to check for Bearer auth | |
| if response.status_code == 401 or "bearer" in www.lower(): | |
| print(f"Authentication required for {url}") | |
| if not (vault_token_file): | |
| raise ValueError("Vault token file not given for protected download") | |
| # --- 3. Fetch Vault token --- | |
| vault_token = __get_vault_access__(url, vault_token_file, auth_url, client_id) | |
| headers = {"Authorization": f"Bearer {vault_token}"} | |
| # --- 4. Retry with token --- | |
| # This request correctly allows redirects (default) | |
| response = requests.get(url, headers=headers, stream=True) | |
| # Handle 3xx redirects for non-authed requests (e.g., S3 presigned URLs) | |
| elif response.is_redirect: | |
| redirect_url = response.headers.get("Location") | |
| print(f"Following redirect to {redirect_url}") | |
| # Make a new request that *does* follow any further redirects | |
| response = requests.get(redirect_url, stream=True, allow_redirects=True) | |
| try: | |
| response.raise_for_status() # Raise if still failing | |
| except requests.exceptions.HTTPError as e: | |
| if response.status_code == 404: | |
| print(f"WARNING: Skipping file {url} because it was not found (404).") | |
| return | |
| else: | |
| raise e | |
| response = requests.head(url, stream=True, timeout=30) | |
| # Check for redirect and update URL if necessary | |
| if response.headers.get("Location") and response.status_code in [ | |
| 301, | |
| 302, | |
| 303, | |
| 307, | |
| 308, | |
| ]: | |
| url = response.headers.get("Location") | |
| print("Redirects url: ", url) | |
| # --- 2. Try direct GET --- | |
| response = requests.get( | |
| url, stream=True, allow_redirects=False, timeout=30 | |
| ) # no redirects here, we want to see if auth is required | |
| www = response.headers.get( | |
| "WWW-Authenticate", "" | |
| ) # get WWW-Authenticate header if present to check for Bearer auth | |
| if response.status_code == 401 or "bearer" in www.lower(): | |
| print(f"Authentication required for {url}") | |
| if not (vault_token_file): | |
| raise ValueError("Vault token file not given for protected download") | |
| # --- 3. Fetch Vault token --- | |
| vault_token = __get_vault_access__(url, vault_token_file, auth_url, client_id) | |
| headers = {"Authorization": f"Bearer {vault_token}"} | |
| # --- 4. Retry with token --- | |
| # This request correctly allows redirects (default) | |
| response = requests.get(url, headers=headers, stream=True, timeout=30) | |
| # Handle 3xx redirects for non-authed requests (e.g., S3 presigned URLs) | |
| elif response.is_redirect: | |
| redirect_url = response.headers.get("Location") | |
| print(f"Following redirect to {redirect_url}") | |
| # Make a new request that *does* follow any further redirects | |
| response = requests.get(redirect_url, stream=True, allow_redirects=True, timeout=30) | |
| try: | |
| response.raise_for_status() # Raise if still failing | |
| except requests.exceptions.HTTPError as e: | |
| if response.status_code == 404: | |
| print(f"WARNING: Skipping file {url} because it was not found (404).") | |
| return | |
| else: | |
| raise e |
🧰 Tools
🪛 Ruff (0.14.2)
440-440: Probable use of requests call without timeout
(S113)
453-453: Probable use of requests call without timeout
(S113)
463-463: Avoid specifying long messages outside the exception class
(TRY003)
471-471: Probable use of requests call without timeout
(S113)
478-478: Probable use of requests call without timeout
(S113)
487-487: Use raise without specifying exception name
Remove exception name
(TRY201)
🤖 Prompt for AI Agents
In databusclient/client.py around lines 440 to 488, all requests calls (the
initial requests.head and every requests.get in this block) are missing a
timeout and can hang indefinitely; add a timeout parameter to each call (e.g.,
timeout=DEFAULT_TIMEOUT) and define DEFAULT_TIMEOUT at module level or accept a
timeout argument to the function (suggest 10 seconds as a sensible default),
making sure to pass the same timeout to requests.head(..., timeout=...) and to
all requests.get(...) calls (including the retry with headers and the
follow-redirect request). Optionally import requests.exceptions.Timeout where
used if you want to handle timeouts specially, but the immediate fix is to
supply the timeout argument consistently on every request in this block.
databusclient/client.py
Outdated
| def __get_vault_access__( | ||
| download_url: str, token_file: str, auth_url: str, client_id: str | ||
| ) -> str: | ||
| """ | ||
| Get Vault access token for a protected databus download. | ||
| """ | ||
| # 1. Load refresh token | ||
| refresh_token = os.environ.get("REFRESH_TOKEN") | ||
| if not refresh_token: | ||
| if not os.path.exists(token_file): | ||
| raise FileNotFoundError(f"Vault token file not found: {token_file}") | ||
| with open(token_file, "r") as f: | ||
| refresh_token = f.read().strip() | ||
| if len(refresh_token) < 80: | ||
| print(f"Warning: token from {token_file} is short (<80 chars)") | ||
|
|
||
| # 2. Refresh token -> access token | ||
| resp = requests.post( | ||
| auth_url, | ||
| data={ | ||
| "client_id": client_id, | ||
| "grant_type": "refresh_token", | ||
| "refresh_token": refresh_token, | ||
| }, | ||
| ) | ||
| resp.raise_for_status() | ||
| access_token = resp.json()["access_token"] | ||
|
|
||
| # 3. Extract host as audience | ||
| # Remove protocol prefix | ||
| if download_url.startswith("https://"): | ||
| host_part = download_url[len("https://") :] | ||
| elif download_url.startswith("http://"): | ||
| host_part = download_url[len("http://") :] | ||
| else: | ||
| host_part = download_url | ||
| audience = host_part.split("/")[0] # host is before first "/" | ||
|
|
||
| # 4. Access token -> Vault token | ||
| resp = requests.post( | ||
| auth_url, | ||
| data={ | ||
| "client_id": client_id, | ||
| "grant_type": "urn:ietf:params:oauth:grant-type:token-exchange", | ||
| "subject_token": access_token, | ||
| "audience": audience, | ||
| }, | ||
| ) | ||
| resp.raise_for_status() | ||
| vault_token = resp.json()["access_token"] | ||
|
|
||
| print(f"Using Vault access token for {download_url}") | ||
| return vault_token |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add timeout parameter to token exchange requests.
The POST requests for token exchange (lines 533 and 555) lack timeout parameters, which could cause the application to hang if the authentication server is unresponsive.
Apply this diff:
# 2. Refresh token -> access token
resp = requests.post(
auth_url,
data={
"client_id": client_id,
"grant_type": "refresh_token",
"refresh_token": refresh_token,
},
+ timeout=30,
)
resp.raise_for_status()
access_token = resp.json()["access_token"]
# 3. Extract host as audience
# Remove protocol prefix
if download_url.startswith("https://"):
host_part = download_url[len("https://") :]
elif download_url.startswith("http://"):
host_part = download_url[len("http://") :]
else:
host_part = download_url
audience = host_part.split("/")[0] # host is before first "/"
# 4. Access token -> Vault token
resp = requests.post(
auth_url,
data={
"client_id": client_id,
"grant_type": "urn:ietf:params:oauth:grant-type:token-exchange",
"subject_token": access_token,
"audience": audience,
},
+ timeout=30,
)🧰 Tools
🪛 Ruff (0.14.2)
526-526: Avoid specifying long messages outside the exception class
(TRY003)
533-533: Probable use of requests call without timeout
(S113)
555-555: Probable use of requests call without timeout
(S113)
🤖 Prompt for AI Agents
In databusclient/client.py around lines 516 to 568, the two requests.post calls
used for token refresh/exchange don't set a timeout and can hang; add a timeout
parameter (e.g. timeout=10) to both requests.post invocations (the one that
exchanges refresh_token -> access_token and the one that exchanges access_token
-> vault_token), and ensure any calling code can surface or handle
requests.Timeout by letting exceptions propagate or wrapping them where
appropriate.
databusclient/client.py
Outdated
| def __handle_databus_collection__(uri: str) -> str: | ||
| headers = {"Accept": "text/sparql"} | ||
| return requests.get(uri, headers=headers).text |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add timeout parameter.
The GET request lacks a timeout parameter.
Apply this diff:
def __handle_databus_collection__(uri: str) -> str:
headers = {"Accept": "text/sparql"}
- return requests.get(uri, headers=headers).text
+ return requests.get(uri, headers=headers, timeout=30).text🧰 Tools
🪛 Ruff (0.14.2)
695-695: Probable use of requests call without timeout
(S113)
🤖 Prompt for AI Agents
In databusclient/client.py around lines 693 to 695, the GET request has no
timeout; modify the requests.get call to include a timeout (e.g.,
requests.get(uri, headers=headers, timeout=10).text), or use an
existing/configurable DEFAULT_TIMEOUT constant, and ensure the call passes the
timeout keyword so the request won't hang indefinitely.
databusclient/client.py
Outdated
| def __get_json_ld_from_databus__(uri: str) -> str: | ||
| headers = {"Accept": "application/ld+json"} | ||
| return requests.get(uri, headers=headers).text |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add timeout parameter.
The GET request lacks a timeout parameter, which could cause hangs when fetching metadata from Databus.
Apply this diff:
def __get_json_ld_from_databus__(uri: str) -> str:
headers = {"Accept": "application/ld+json"}
- return requests.get(uri, headers=headers).text
+ return requests.get(uri, headers=headers, timeout=30).text🧰 Tools
🪛 Ruff (0.14.2)
700-700: Probable use of requests call without timeout
(S113)
🤖 Prompt for AI Agents
In databusclient/client.py around lines 698-700, the requests.get call in
__get_json_ld_from_databus__ lacks a timeout which can cause the process to
hang; modify the call to include a timeout parameter (e.g., timeout=10) or,
better, read a module-level or config value (e.g., DATABUS_REQUEST_TIMEOUT) and
pass it as timeout=request_timeout to requests.get, and ensure the function lets
request exceptions propagate or document that callers should handle timeouts.
|
Just pushed a commit addressing all of CodeRabbit’s suggestions — moved the SHA256 helper to module level, removed the unreachable branch, and cleaned up the redundant import. Let me know if anything else needs tweaking. Ready for re-review! |
|
Hey! It seems the line endings changed (possibly to CRLF), so the diff shows every line as updated. Could you fix the line endings so I can review the actual changes? See diff |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
databusclient/client.py (2)
516-568: Token exchange requests still lack timeout parameters.Past reviews flagged this as a major issue. If the authentication server is unresponsive, the application will hang indefinitely.
Apply this diff to add timeouts:
# 2. Refresh token -> access token resp = requests.post( auth_url, data={ "client_id": client_id, "grant_type": "refresh_token", "refresh_token": refresh_token, }, + timeout=30, ) resp.raise_for_status() access_token = resp.json()["access_token"] # 3. Extract host as audience # Remove protocol prefix if download_url.startswith("https://"): host_part = download_url[len("https://") :] elif download_url.startswith("http://"): host_part = download_url[len("http://") :] else: host_part = download_url audience = host_part.split("/")[0] # host is before first "/" # 4. Access token -> Vault token resp = requests.post( auth_url, data={ "client_id": client_id, "grant_type": "urn:ietf:params:oauth:grant-type:token-exchange", "subject_token": access_token, "audience": audience, }, + timeout=30, )
440-488: All HTTP requests in this function still lack timeout parameters.Without timeouts, the application can hang indefinitely on unresponsive servers—a critical production reliability issue.
Apply the provided diff to add
timeout=30to all four requests calls:
- Line 440:
requests.head(url, stream=True, timeout=30)- Line 453:
requests.get(url, stream=True, allow_redirects=False, timeout=30)- Line 471:
requests.get(url, headers=headers, stream=True, timeout=30)- Line 478:
requests.get(redirect_url, stream=True, allow_redirects=True, timeout=30)
🧹 Nitpick comments (1)
databusclient/client.py (1)
703-737: Clean up unused variable and type hints.The function correctly propagates validation parameters, but has two minor issues:
- Line 713: The
hostvariable from__get_databus_id_parts__is unpacked but never used.- Lines 706-708: Type hints use implicit
Optionalwhich PEP 484 discourages.Apply this diff:
def __download_list__( files_to_download: List[Tuple[str, Optional[str]]], localDir: str, - vault_token_file: str = None, - auth_url: str = None, - client_id: str = None, + vault_token_file: Optional[str] = None, + auth_url: Optional[str] = None, + client_id: Optional[str] = None, validation_mode: ShaValidationMode = ShaValidationMode.WARNING, ) -> None: for url, expected_sha in files_to_download: if localDir is None: - host, account, group, artifact, version, file = __get_databus_id_parts__( + _, account, group, artifact, version, _ = __get_databus_id_parts__( url )
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
databusclient/client.py(17 hunks)
🧰 Additional context used
🪛 Ruff (0.14.4)
databusclient/client.py
453-453: Probable use of requests call without timeout
(S113)
471-471: Probable use of requests call without timeout
(S113)
478-478: Probable use of requests call without timeout
(S113)
513-513: Avoid specifying long messages outside the exception class
(TRY003)
533-533: Probable use of requests call without timeout
(S113)
555-555: Probable use of requests call without timeout
(S113)
706-706: PEP 484 prohibits implicit Optional
Convert to T | None
(RUF013)
707-707: PEP 484 prohibits implicit Optional
Convert to T | None
(RUF013)
708-708: PEP 484 prohibits implicit Optional
Convert to T | None
(RUF013)
713-713: Unpacked variable host is never used
Prefix it with an underscore or any other dummy variable pattern
(RUF059)
🔇 Additional comments (9)
databusclient/client.py (9)
15-21: LGTM! Clean SHA256 helper.The module-level helper is well-structured, uses an appropriate chunk size, and will be deterministic for validation.
40-46: LGTM! Clear validation modes.The enum provides the three requested modes with clear semantics.
413-433: LGTM! Signature is backwards compatible.The new parameters have sensible defaults and are properly documented.
499-511: LGTM! Validation logic correctly implements the three modes.The validation flow properly handles ERROR (raises), WARNING (prints and continues), and OFF (skips) modes as specified in the PR objectives.
571-587: LGTM! Using POST for SPARQL queries is best practice.POST supports larger queries and is more semantic per the SPARQL protocol.
590-617: LGTM! Robust extraction of both URL and SHA from query results.The function defensively tries multiple common variable names and correctly handles cases where SHA may not be present in the query results.
619-644: LGTM! Properly extracts both URL and SHA256 from artifact metadata.The function correctly returns tuples of (url, sha) to enable downstream validation. The use of
node.get("file")aligns with the comment about tracking access accurately.
740-746: LGTM! Return type now explicitly documents the 6-tuple structure.The type hint improves clarity and type safety.
749-875: LGTM! Validation mode is correctly propagated through all download paths.The dispatcher function properly routes different URI types (collection, file, artifact version, artifact, group, query) and consistently passes validation_mode to all downstream downloads.
|
Thanks for following along on this PR. I’ve incorporated the SHA256 validation feature with flexible CLI modes, refactored the helper to module level, cleaned up unreachable branches, and normalized line endings with .gitattributes so the diffs are now clean. CodeRabbit’s latest review shows no actionable issues left — only minor nitpicks outside the diff range. Could you take a look and share your feedback on whether the changes meet our intended goals and if the flow feels clear from a reviewer’s perspective? If you spot anything that could be improved for readability or maintainability, I’d be happy to address it before merge. |
|
do the current changes align with what you had in mind for the SHA validation feature I want to be sure the flow feels clear from your perspective before I finalize. |
Integer-Ctrl
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the late feedback. I tested it and it seems to work, but I've added some comments. The most important points are the formatting-related comments and the fact that there is currently no CLI option to toggle the SHA validation mode.
| def __handle_databus_artifact_version__(json_str: str) -> List[str]: | ||
| def __handle_databus_artifact_version__( | ||
| json_str: str, | ||
| ) -> List[Tuple[str, Optional[str]]]: | ||
| """ | ||
| Parse the JSON-LD of a databus artifact version to extract download URLs. | ||
| Parse the JSON-LD of a databus artifact version to extract download URLs and SHA256 sums. | ||
| Don't get downloadURLs directly from the JSON-LD, but follow the "file" links to count access to databus accurately. | ||
| Returns a list of download URLs. | ||
| Returns a list of (download_url, sha256sum) tuples. | ||
| """ | ||
|
|
||
| databusIdUrl = [] | ||
| databus_files = [] | ||
| json_dict = json.loads(json_str) | ||
| graph = json_dict.get("@graph", []) | ||
| for node in graph: | ||
| if node.get("@type") == "Part": | ||
| id = node.get("file") | ||
| databusIdUrl.append(id) | ||
| return databusIdUrl | ||
| # Use the 'file' link as per the original comment | ||
| url = node.get("file") | ||
| if not url: | ||
| continue | ||
|
|
||
| # Extract the sha256sum from the same node | ||
| # This key is used in your create_dataset function | ||
| sha = node.get("sha256sum") | ||
|
|
||
| databus_files.append((url, sha)) | ||
| return databus_files |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You should handle resolving the sha256 sum from databus inside the __download_file method. Otherwise, if only a single file is downloaded, its SHA sum won’t be verified.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You touched many lines of code just by changing the formatting. Please avoid reformatting code sections where you didn't change the actual logic. It makes the review process unnecessarily longer. Also, changes like replacing " with ' (or vice versa) shouldn’t be included in a feature PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You touched many lines of code just by deleting them. Please avoid removing unused code in a feature PR. Yes, this repo/file is messy and a lot of the code may be unused or outdated, but cleanup shouldn't be part of this PR (unless it is a refactored method or something like this). It makes the review process unnecessarily longer.
| vault_token_file: str = None, | ||
| auth_url: str = None, | ||
| client_id: str = None, | ||
| validation_mode: ShaValidationMode = ShaValidationMode.WARNING, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is currently no cli option to specify the ShaValidationMode
| auth_url=auth_url, | ||
| client_id=client_id, | ||
| expected_sha256=expected_sha, # <-- Pass the SHA hash here | ||
| validation_mode=validation_mode, # <-- Pass the validation mode here |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use a global variable for the validation mode instead of passing the option into every function call
|
@shivansh31414 Do you want to continue work? |
|
I think you can assign it to someone else i am occupied with my college work |
Title: Feat: Add SHA256 validation for downloads with --validate-sha flag
Closes #28
This PR adds a --validate-sha flag to validate file checksums after download, as discussed in the issue.
The flag supports three modes:
--validate-sha off: No validation.
--validate-sha warning: (Default) Prints a warning on mismatch.
--validate-sha error: Raises a ValueError on mismatch and stops execution.
This is implemented by passing the validation mode down to download_file, which now computes and compares the hash.
New tests are added in test_client.py to cover all three modes and a successful validation.
Summary by CodeRabbit
New Features
Bug Fixes / Improvements
Tests