-
Notifications
You must be signed in to change notification settings - Fork 10
Download capabilities and docker image #12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…urate statistics and check if auth is needed before sending header
…l artifacts and their latest version)
WalkthroughAdds a Dockerfile; replaces a Typer-based CLI with Click and updates the package entry point; extends client download logic to handle Vault-protected/redirected Databus resources and multiple Databus URI types with new helpers; updates dependencies and expands README examples and tests. Changes
Sequence Diagram(s)sequenceDiagram
autonumber
actor U as User
participant CLI as Click CLI (app)
participant Client as databusclient.client
participant DB as Databus / SPARQL
participant Auth as Auth Server (OIDC)
participant Vault as Vault Token Service
participant Net as HTTP
U->>CLI: databusclient download [URIs] --localdir --databus [--token --authurl --clientid]
CLI->>Client: download(localDir, endpoint, databusURIs, token, auth_url, client_id)
rect rgb(247,250,245)
note right of Client: Resolve input URIs to file URLs
Client->>Client: Parse URIs (collection/artifact/version/group/account)
Client->>DB: Fetch JSON‑LD / run SPARQL queries
DB-->>Client: Metadata / file URLs
Client->>Client: Build list of file download URLs
end
loop For each file URL
Client->>Net: HEAD (follow redirects)
Net-->>Client: 200 / 3xx (+final URL)
Client->>Net: GET file (no auth)
alt 401 or auth required
note right of Client: Protected flow — obtain tokens
Client->>Auth: Exchange refresh token (auth_url, client_id)
Auth-->>Client: Access token
Client->>Vault: Exchange for file-scoped token
Vault-->>Client: Vault download token
Client->>Net: GET file with Authorization: Bearer <vault token>
Net-->>Client: 200 + content
else Public
Net-->>Client: 200 + content
end
Client->>Client: Save file to localDir
end
Client-->>CLI: Completed
CLI-->>U: Exit status / logs
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Poem
Pre-merge checks and finishing touches❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 9
🧹 Nitpick comments (6)
Dockerfile (1)
8-11: Harden and slim the image
- Avoid pip cache; consider non-root user for security.
Apply:
-RUN pip install . +RUN pip install --no-cache-dir .Optionally add:
# Reduce Python overhead ENV PYTHONDONTWRITEBYTECODE=1 PYTHONUNBUFFERED=1 # Create non-root user RUN useradd -m app && chown -R app:app /data USER appREADME.md (1)
71-99: Add languages to fenced code blocks and align invocation
- Add bash/man languages to code fences to satisfy MD040 and improve rendering.
- Prefer the installed script
databusclient ...overpython3 -m databusclient ...(there’s no package-level main here).Examples:
-``` +```bash -databusclient deploy --help +databusclient deploy --help-``` +```man Usage: python3 -m databusclient download [OPTIONS] DATABUSURIS... -``` +``` ```diff -``` +```bash -docker run --rm -v $(pwd):/data dbpedia/databus-python-client download ... +docker run --rm -v "$(pwd)":/data dbpedia/databus-python-client download ...Also applies to: 103-122, 125-134, 139-143, 150-156
databusclient/cli.py (1)
39-46: Type safety and UX polish for options
- Consider
type=click.Path(file_okay=False, dir_okay=True, writable=True)for--localdir.- Provide
type=strand meaningful metavar defaults to improve--help.-@click.option("--localdir", help="Local databus folder (if not given, databus folder structure is created in current working directory)") +@click.option("--localdir", type=click.Path(file_okay=False), help="Local databus folder (default: create Databus structure under CWD)")databusclient/client.py (3)
525-534: Return type doesn’t match behavior (generator vs list)The function yields values; annotate as an iterable or convert to list.
Option A (annotation):
-def __handle_databus_file_query__(endpoint_url, query) -> List[str]: +from typing import Iterable +def __handle_databus_file_query__(endpoint_url, query) -> Iterable[str]:Option B (materialize list):
- for binding in result_dict['results']['bindings']: + values = [] + for binding in result_dict['results']['bindings']: ... - yield value + values.append(value) + return values
610-615: Type hints: use Optional for optional paramsAlign with PEP 484 and silence tooling warnings.
-def __download_list__(urls: List[str], - localDir: str, - vault_token_file: str = None, - auth_url: str = None, - client_id: str = None) -> None: +from typing import Optional +def __download_list__(urls: List[str], + localDir: Optional[str], + vault_token_file: Optional[str] = None, + auth_url: Optional[str] = None, + client_id: Optional[str] = None) -> None:
710-713: Minor: fix print formattingBraces are printed literally.
- print("QUERY {}", databusURI.replace("\n", " ")) + print(f"QUERY {databusURI.replace('\n', ' ')}")
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
⛔ Files ignored due to path filters (1)
poetry.lockis excluded by!**/*.lock
📒 Files selected for processing (5)
Dockerfile(1 hunks)README.md(3 hunks)databusclient/cli.py(1 hunks)databusclient/client.py(4 hunks)pyproject.toml(1 hunks)
🧰 Additional context used
🧬 Code graph analysis (2)
databusclient/client.py (1)
databusclient/cli.py (1)
download(46-57)
databusclient/cli.py (1)
databusclient/client.py (3)
deploy(352-393)create_dataset(209-349)download(635-714)
🪛 markdownlint-cli2 (0.18.1)
README.md
72-72: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
76-76: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
104-104: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
109-109: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
114-114: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
119-119: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
126-126: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
131-131: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
140-140: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
150-150: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
154-154: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
🪛 Ruff (0.13.3)
databusclient/client.py
417-417: Probable use of requests call without timeout
(S113)
425-425: Probable use of requests call without timeout
(S113)
431-431: Avoid specifying long messages outside the exception class
(TRY003)
438-438: Probable use of requests call without timeout
(S113)
467-467: Avoid specifying long messages outside the exception class
(TRY003)
474-474: Probable use of requests call without timeout
(S113)
493-493: Probable use of requests call without timeout
(S113)
570-570: Avoid specifying long messages outside the exception class
(TRY003)
602-602: Probable use of requests call without timeout
(S113)
607-607: Probable use of requests call without timeout
(S113)
612-612: PEP 484 prohibits implicit Optional
Convert to Optional[T]
(RUF013)
613-613: PEP 484 prohibits implicit Optional
Convert to Optional[T]
(RUF013)
614-614: PEP 484 prohibits implicit Optional
Convert to Optional[T]
(RUF013)
617-617: Unpacked variable host is never used
Prefix it with an underscore or any other dummy variable pattern
(RUF059)
712-712: Avoid specifying long messages outside the exception class
(TRY003)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
- GitHub Check: build
- GitHub Check: build
🔇 Additional comments (1)
pyproject.toml (1)
11-11: Packaging/entrypoint alignment looks good
- Click pinned to 8.0.x fits Python 3.9 support; entry point matches
databusclient.cli:app.- Verify rdflib is intended as a runtime dep (not only dev).
Also applies to: 15-15, 18-18, 22-22
| response = requests.head(url, stream=True) | ||
|
|
||
| # Check for redirect and update URL if necessary | ||
| if response.headers.get("Location") and response.status_code in [301, 302, 303, 307, 308]: | ||
| url = response.headers.get("Location") | ||
| print("Redirects url: ", url) | ||
|
|
||
| # --- 2. Try direct GET --- | ||
| response = requests.get(url, stream=True, allow_redirects=False) # no redirects here, we want to see if auth is required | ||
| www = response.headers.get('WWW-Authenticate', '') # get WWW-Authenticate header if present to check for Bearer auth | ||
|
|
||
| if (response.status_code == 401 or "bearer" in www.lower()): | ||
| print(f"Authentication required for {url}") | ||
| if not (vault_token_file): | ||
| raise ValueError("Vault token file not given for protected download") | ||
|
|
||
| # --- 3. Fetch Vault token --- | ||
| vault_token = __get_vault_access__(url, vault_token_file, auth_url, client_id) | ||
| headers = {"Authorization": f"Bearer {vault_token}"} | ||
|
|
||
| # --- 4. Retry with token --- | ||
| response = requests.get(url, headers=headers, stream=True) | ||
|
|
||
| response.raise_for_status() # Raise if still failing | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add timeouts and handle auth detection robustly
Network calls without timeouts can hang indefinitely. Add reasonable timeouts.
-response = requests.head(url, stream=True)
+response = requests.head(url, stream=True, timeout=30)
...
-response = requests.get(url, stream=True, allow_redirects=False) # no redirects here, we want to see if auth is required
+response = requests.get(url, stream=True, allow_redirects=False, timeout=30) # no redirects here, we want to see if auth is required
...
- response = requests.get(url, headers=headers, stream=True)
+ response = requests.get(url, headers=headers, stream=True, timeout=60)Optionally extract a module-level TIMEOUT constant for reuse.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| response = requests.head(url, stream=True) | |
| # Check for redirect and update URL if necessary | |
| if response.headers.get("Location") and response.status_code in [301, 302, 303, 307, 308]: | |
| url = response.headers.get("Location") | |
| print("Redirects url: ", url) | |
| # --- 2. Try direct GET --- | |
| response = requests.get(url, stream=True, allow_redirects=False) # no redirects here, we want to see if auth is required | |
| www = response.headers.get('WWW-Authenticate', '') # get WWW-Authenticate header if present to check for Bearer auth | |
| if (response.status_code == 401 or "bearer" in www.lower()): | |
| print(f"Authentication required for {url}") | |
| if not (vault_token_file): | |
| raise ValueError("Vault token file not given for protected download") | |
| # --- 3. Fetch Vault token --- | |
| vault_token = __get_vault_access__(url, vault_token_file, auth_url, client_id) | |
| headers = {"Authorization": f"Bearer {vault_token}"} | |
| # --- 4. Retry with token --- | |
| response = requests.get(url, headers=headers, stream=True) | |
| response.raise_for_status() # Raise if still failing | |
| response = requests.head(url, stream=True, timeout=30) | |
| # Check for redirect and update URL if necessary | |
| if response.headers.get("Location") and response.status_code in [301, 302, 303, 307, 308]: | |
| url = response.headers.get("Location") | |
| print("Redirects url: ", url) | |
| # --- 2. Try direct GET --- | |
| response = requests.get( | |
| url, | |
| stream=True, | |
| allow_redirects=False, # no redirects here, we want to see if auth is required | |
| timeout=30, | |
| ) | |
| www = response.headers.get('WWW-Authenticate', '') # get WWW-Authenticate header if present to check for Bearer auth | |
| if response.status_code == 401 or "bearer" in www.lower(): | |
| print(f"Authentication required for {url}") | |
| if not vault_token_file: | |
| raise ValueError("Vault token file not given for protected download") | |
| # --- 3. Fetch Vault token --- | |
| vault_token = __get_vault_access__(url, vault_token_file, auth_url, client_id) | |
| headers = {"Authorization": f"Bearer {vault_token}"} | |
| # --- 4. Retry with token --- | |
| response = requests.get( | |
| url, | |
| headers=headers, | |
| stream=True, | |
| timeout=60, | |
| ) | |
| response.raise_for_status() # Raise if still failing |
🧰 Tools
🪛 Ruff (0.13.3)
417-417: Probable use of requests call without timeout
(S113)
425-425: Probable use of requests call without timeout
(S113)
431-431: Avoid specifying long messages outside the exception class
(TRY003)
438-438: Probable use of requests call without timeout
(S113)
| resp = requests.post(auth_url, data={ | ||
| "client_id": client_id, | ||
| "grant_type": "refresh_token", | ||
| "refresh_token": refresh_token | ||
| }) | ||
| resp.raise_for_status() | ||
| access_token = resp.json()["access_token"] | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion | 🟠 Major
Add timeouts to token exchange requests
Avoid indefinite hangs during auth flows.
-resp = requests.post(auth_url, data={
+resp = requests.post(auth_url, data={
"client_id": client_id,
"grant_type": "refresh_token",
"refresh_token": refresh_token
-})
+}, timeout=30)
...
-resp = requests.post(auth_url, data={
+resp = requests.post(auth_url, data={
"client_id": client_id,
"grant_type": "urn:ietf:params:oauth:grant-type:token-exchange",
"subject_token": access_token,
"audience": audience
-})
+}, timeout=30)Also applies to: 493-501
🧰 Tools
🪛 Ruff (0.13.3)
474-474: Probable use of requests call without timeout
(S113)
🤖 Prompt for AI Agents
In databusclient/client.py around lines 474-481 (and similarly for lines
493-501), the requests.post calls performing token exchanges lack timeouts which
can cause indefinite hangs; modify these calls to pass a sensible timeout
parameter (e.g., timeout=10 or a configurable constant) and handle
timeout-related exceptions (requests.Timeout and requests.RequestException) as
appropriate (log and raise or retry) so the auth flow fails fast instead of
blocking indefinitely.
| def __handle_databus_collection__(uri: str) -> str: | ||
| headers = {"Accept": "text/sparql"} | ||
| return requests.get(uri, headers=headers).text | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion | 🟠 Major
Add timeout to collection fetch
- return requests.get(uri, headers=headers).text
+ return requests.get(uri, headers=headers, timeout=30).text📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| def __handle_databus_collection__(uri: str) -> str: | |
| headers = {"Accept": "text/sparql"} | |
| return requests.get(uri, headers=headers).text | |
| def __handle_databus_collection__(uri: str) -> str: | |
| headers = {"Accept": "text/sparql"} | |
| return requests.get(uri, headers=headers, timeout=30).text |
🧰 Tools
🪛 Ruff (0.13.3)
602-602: Probable use of requests call without timeout
(S113)
🤖 Prompt for AI Agents
In databusclient/client.py around lines 600 to 603, the GET call to fetch the
collection has no timeout and can hang; update requests.get to include a
sensible timeout parameter (e.g., timeout=10) and handle network errors: call
requests.get(uri, headers=headers, timeout=<seconds>) then call
response.raise_for_status() and return response.text; catch
requests.exceptions.Timeout to raise a clear timeout error and catch
requests.exceptions.RequestException to surface other network/HTTP errors (or
re-raise) so callers don't block indefinitely.
| def __get_json_ld_from_databus__(uri: str) -> str: | ||
| headers = {"Accept": "application/ld+json"} | ||
| return requests.get(uri, headers=headers).text | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion | 🟠 Major
Add timeout to JSON-LD fetch
- return requests.get(uri, headers=headers).text
+ return requests.get(uri, headers=headers, timeout=30).text📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| def __get_json_ld_from_databus__(uri: str) -> str: | |
| headers = {"Accept": "application/ld+json"} | |
| return requests.get(uri, headers=headers).text | |
| def __get_json_ld_from_databus__(uri: str) -> str: | |
| headers = {"Accept": "application/ld+json"} | |
| return requests.get(uri, headers=headers, timeout=30).text |
🧰 Tools
🪛 Ruff (0.13.3)
607-607: Probable use of requests call without timeout
(S113)
🤖 Prompt for AI Agents
In databusclient/client.py around lines 605-608, the JSON-LD fetch uses
requests.get without a timeout which can hang; update the function to call
requests.get with a reasonable timeout (e.g. timeout=10 or a
module-level/configurable constant) and propagate or convert
requests.exceptions.Timeout/RequestException into a clear error (or re-raise) so
callers don’t block indefinitely; also check response status (raise_for_status)
before returning response.text.
|
Note Unit test generation is an Early Access feature. Expect some limitations and changes as we gather feedback and continue to improve it. Generating unit tests... This may take up to 20 minutes. |
|
Note Unit test generation is an Early Access feature. Expect some limitations and changes as we gather feedback and continue to improve it. Generating unit tests... This may take up to 20 minutes. |
|
✅ UTG Post-Process Complete No new issues were detected in the generated code and all check runs have completed. The unit test generation process has completed successfully. |
|
Creating a PR to put the unit tests in... The changes have been created in this pull request: View PR |
|
Note Unit test generation is an Early Access feature. Expect some limitations and changes as we gather feedback and continue to improve it. Generating unit tests... This may take up to 20 minutes. |
|
✅ UTG Post-Process Complete No new issues were detected in the generated code and all check runs have completed. The unit test generation process has completed successfully. |
|
Creating a PR to put the unit tests in... The changes have been created in this pull request: View PR |
|
🎯 Unit Test Generation Post-Process Complete The unit test generation and CI/CD fixing process has completed after reaching the maximum number of iterations. Please review the generated tests and any remaining issues manually. |
|
Creating a PR to put the unit tests in... The changes have been created in this pull request: View PR |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
♻️ Duplicate comments (3)
databusclient/client.py (3)
418-438: Add timeouts to prevent indefinite hangs.Network calls without timeouts can block indefinitely, degrading reliability. This was flagged in past reviews and remains unresolved.
Apply this diff:
- response = requests.head(url, stream=True) + response = requests.head(url, stream=True, timeout=30) # Check for redirect and update URL if necessary if response.headers.get("Location") and response.status_code in [301, 302, 303, 307, 308]: url = response.headers.get("Location") print("Redirects url: ", url) # --- 2. Try direct GET --- - response = requests.get(url, stream=True, allow_redirects=False) + response = requests.get(url, stream=True, allow_redirects=False, timeout=30) www = response.headers.get('WWW-Authenticate', '') if (response.status_code == 401 or "bearer" in www.lower()): print(f"Authentication required for {url}") if not (vault_token_file): raise ValueError("Vault token file not given for protected download") # --- 3. Fetch Vault token --- vault_token = __get_vault_access__(url, vault_token_file, auth_url, client_id) headers = {"Authorization": f"Bearer {vault_token}"} # --- 4. Retry with token --- - response = requests.get(url, headers=headers, stream=True) + response = requests.get(url, headers=headers, stream=True, timeout=60)As per static analysis hints and past review comments.
474-501: Add timeouts to token exchange requests.Authentication flows without timeouts can hang indefinitely, preventing graceful failure.
Apply this diff:
# 2. Refresh token -> access token - resp = requests.post(auth_url, data={ + resp = requests.post(auth_url, data={ "client_id": client_id, "grant_type": "refresh_token", "refresh_token": refresh_token - }) + }, timeout=30) resp.raise_for_status() access_token = resp.json()["access_token"] # 3. Extract host as audience # Remove protocol prefix if download_url.startswith("https://"): host_part = download_url[len("https://"):] elif download_url.startswith("http://"): host_part = download_url[len("http://"):] else: host_part = download_url audience = host_part.split("/")[0] # 4. Access token -> Vault token - resp = requests.post(auth_url, data={ + resp = requests.post(auth_url, data={ "client_id": client_id, "grant_type": "urn:ietf:params:oauth:grant-type:token-exchange", "subject_token": access_token, "audience": audience - }) + }, timeout=30)As per static analysis hints and past review comments.
600-607: Add timeouts to metadata fetch operations.Metadata retrieval calls can hang without timeouts, blocking the entire download workflow.
Apply this diff:
def __handle_databus_collection__(uri: str) -> str: headers = {"Accept": "text/sparql"} - return requests.get(uri, headers=headers).text + return requests.get(uri, headers=headers, timeout=30).text def __get_json_ld_from_databus__(uri: str) -> str: headers = {"Accept": "application/ld+json"} - return requests.get(uri, headers=headers).text + return requests.get(uri, headers=headers, timeout=30).textAs per static analysis hints and past review comments.
🧹 Nitpick comments (3)
README.md (1)
57-157: Consider adding language specifiers to fenced code blocks.Markdown linters recommend specifying the language for fenced code blocks to improve syntax highlighting and readability. The following blocks could benefit from language specifiers: lines 57, 61, 72, 76, 104, 109, 114, 119, 126, 131, 140, 150, 154.
Example:
-``` +```bash databusclient deploy --version-id https://databus.dbpedia.org/user1/group1/artifact1/2022-05-18 ...As per static analysis hints. </blockquote></details> <details> <summary>databusclient/client.py (2)</summary><blockquote> `610-614`: **Add explicit `Optional` type hints for nullable parameters.** PEP 484 prohibits implicit `Optional`. Make null-safety explicit in the signature. Apply this diff: ```diff def __download_list__(urls: List[str], localDir: str, - vault_token_file: str = None, - auth_url: str = None, - client_id: str = None) -> None: + vault_token_file: Optional[str] = None, + auth_url: Optional[str] = None, + client_id: Optional[str] = None) -> None:As per static analysis hints.
617-617: Prefix unused unpacked variable with underscore.The
hostvariable is unpacked but never used. Follow Python convention for dummy variables.Apply this diff:
- host, account, group, artifact, version, file = __get_databus_id_parts__(url) + _host, account, group, artifact, version, file = __get_databus_id_parts__(url)As per static analysis hints.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (5)
README.md(3 hunks)databusclient/cli.py(1 hunks)databusclient/client.py(4 hunks)test.sh(1 hunks)tests/test_download.py(1 hunks)
🧰 Additional context used
🧬 Code graph analysis (3)
tests/test_download.py (2)
databusclient/cli.py (1)
download(46-57)databusclient/client.py (1)
download(635-714)
databusclient/cli.py (1)
databusclient/client.py (3)
deploy(352-393)create_dataset(209-349)download(635-714)
databusclient/client.py (1)
databusclient/cli.py (1)
download(46-57)
🪛 markdownlint-cli2 (0.18.1)
README.md
57-57: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
61-61: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
72-72: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
76-76: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
104-104: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
109-109: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
114-114: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
119-119: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
126-126: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
131-131: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
140-140: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
150-150: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
154-154: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
🪛 Ruff (0.13.3)
databusclient/client.py
418-418: Probable use of requests call without timeout
(S113)
425-425: Probable use of requests call without timeout
(S113)
431-431: Avoid specifying long messages outside the exception class
(TRY003)
438-438: Probable use of requests call without timeout
(S113)
453-453: Avoid specifying long messages outside the exception class
(TRY003)
467-467: Avoid specifying long messages outside the exception class
(TRY003)
474-474: Probable use of requests call without timeout
(S113)
493-493: Probable use of requests call without timeout
(S113)
570-570: Avoid specifying long messages outside the exception class
(TRY003)
602-602: Probable use of requests call without timeout
(S113)
607-607: Probable use of requests call without timeout
(S113)
612-612: PEP 484 prohibits implicit Optional
Convert to Optional[T]
(RUF013)
613-613: PEP 484 prohibits implicit Optional
Convert to Optional[T]
(RUF013)
614-614: PEP 484 prohibits implicit Optional
Convert to Optional[T]
(RUF013)
617-617: Unpacked variable host is never used
Prefix it with an underscore or any other dummy variable pattern
(RUF059)
712-712: Avoid specifying long messages outside the exception class
(TRY003)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
- GitHub Check: build
- GitHub Check: build
🔇 Additional comments (6)
test.sh (1)
4-4: LGTM! Parameter name updated correctly.The change from
--versionidto--version-idaligns with the CLI migration from Typer to Click indatabusclient/cli.py.tests/test_download.py (1)
6-16: LGTM! Enhanced query with sensible constraints.The updated SPARQL query adds appropriate filters:
- URL prefix filter ensures results are from the expected databus instance
- Size constraint (< 100MB) improves test execution time
databusclient/cli.py (1)
2-61: LGTM! Clean migration from Typer to Click.The CLI migration correctly:
- Maps
--version-idtoversion_idparameter (line 15)- Maps
--licensetolicense_urlparameter (line 23)- Maintains consistent parameter naming throughout
- Adds proper entry point for CLI invocation
Past review concerns about option-parameter mismatches have been resolved.
databusclient/client.py (3)
414-416: LGTM! Directory creation is now properly guarded.Past review concern about
os.makedirswith empty dirname has been resolved.
453-453: LGTM! Proper exception with descriptive message.Past review concern about bare
raisestatement has been resolved.
669-714: LGTM! Vault authentication options properly propagated.All download paths now correctly pass
vault_token_file,auth_url, andclient_idto__download_list__, enabling protected downloads throughout. Past review concern has been resolved.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
databusclient/client.py (1)
532-541: Fix return type annotation mismatch.The function signature declares
List[str]as the return type, but the implementation usesyield, making it a generator function. This type mismatch can cause confusion and potential runtime issues with type checkers.Apply this diff to fix the return type:
-def __handle_databus_file_query__(endpoint_url, query) -> List[str]: +def __handle_databus_file_query__(endpoint_url, query): result_dict = __query_sparql__(endpoint_url, query) for binding in result_dict['results']['bindings']:Alternatively, if you want to maintain type hints, import
Generatorfromtypingand use:from typing import Generator def __handle_databus_file_query__(endpoint_url, query) -> Generator[str, None, None]:
♻️ Duplicate comments (3)
databusclient/client.py (3)
607-615: Add timeouts to metadata fetch operations.Both
__handle_databus_collection__and__get_json_ld_from_databus__lack timeout parameters on their GET requests, which can cause indefinite hangs when fetching collection queries or JSON-LD metadata.Apply this diff:
def __handle_databus_collection__(uri: str) -> str: headers = {"Accept": "text/sparql"} - return requests.get(uri, headers=headers).text + return requests.get(uri, headers=headers, timeout=30).text def __get_json_ld_from_databus__(uri: str) -> str: headers = {"Accept": "application/ld+json"} - return requests.get(uri, headers=headers).text + return requests.get(uri, headers=headers, timeout=30).textBased on learnings.
418-448: Add timeouts to prevent indefinite hangs.The network calls lack timeout parameters, which can cause the client to hang indefinitely if the server is unresponsive. This is especially problematic for HEAD and GET requests that may encounter slow or stalled connections.
Apply this diff to add timeouts:
- response = requests.head(url, stream=True) + response = requests.head(url, stream=True, timeout=30) # Check for redirect and update URL if necessary if response.headers.get("Location") and response.status_code in [301, 302, 303, 307, 308]: url = response.headers.get("Location") print("Redirects url: ", url) # --- 2. Try direct GET --- - response = requests.get(url, stream=True, allow_redirects=False) # no redirects here, we want to see if auth is required + response = requests.get(url, stream=True, allow_redirects=False, timeout=30) # no redirects here, we want to see if auth is required www = response.headers.get('WWW-Authenticate', '') # get WWW-Authenticate header if present to check for Bearer auth if (response.status_code == 401 or "bearer" in www.lower()): print(f"Authentication required for {url}") if not (vault_token_file): raise ValueError("Vault token file not given for protected download") # --- 3. Fetch Vault token --- vault_token = __get_vault_access__(url, vault_token_file, auth_url, client_id) headers = {"Authorization": f"Bearer {vault_token}"} # --- 4. Retry with token --- - response = requests.get(url, headers=headers, stream=True) + response = requests.get(url, headers=headers, stream=True, timeout=60)Consider defining a module-level constant (e.g.,
DEFAULT_TIMEOUT = 30) for consistency across all network calls.Based on learnings.
481-506: Add timeouts to token exchange requests.The token exchange POST requests lack timeout parameters, which can cause indefinite hangs during authentication flows.
Apply this diff:
# 2. Refresh token -> access token resp = requests.post(auth_url, data={ "client_id": client_id, "grant_type": "refresh_token", "refresh_token": refresh_token - }) + }, timeout=30) resp.raise_for_status() access_token = resp.json()["access_token"] # 3. Extract host as audience # Remove protocol prefix if download_url.startswith("https://"): host_part = download_url[len("https://"):] elif download_url.startswith("http://"): host_part = download_url[len("http://"):] else: host_part = download_url audience = host_part.split("/")[0] # host is before first "/" # 4. Access token -> Vault token resp = requests.post(auth_url, data={ "client_id": client_id, "grant_type": "urn:ietf:params:oauth:grant-type:token-exchange", "subject_token": access_token, "audience": audience - }) + }, timeout=30) resp.raise_for_status() vault_token = resp.json()["access_token"]Based on learnings.
🧹 Nitpick comments (3)
README.md (1)
104-156: Add language identifiers to fenced code blocks.Multiple fenced code blocks throughout the download examples lack language specifiers, which prevents proper syntax highlighting. Add
bashorshellidentifiers to these code blocks.Apply language identifiers to code blocks at lines 104, 109, 114, 119, 126, 131, 140, 150, and 154:
-``` +```bash databusclient download https://databus.dbpedia.org/dbpedia/mappings/mappingbased-literals/2022.12.01/mappingbased-literals_lang=az.ttl.bz2Repeat this pattern for all affected code blocks. This also applies to the deploy command examples at lines 57 and 61, and the help output blocks at lines 72 and 76. </blockquote></details> <details> <summary>databusclient/client.py (2)</summary><blockquote> `617-633`: **Use explicit `Optional` type hints.** The function parameters `vault_token_file`, `auth_url`, and `client_id` default to `None` but lack explicit `Optional[]` type hints, which violates PEP 484. Apply this diff: ```diff def __download_list__(urls: List[str], localDir: str, - vault_token_file: str = None, - auth_url: str = None, - client_id: str = None) -> None: + vault_token_file: Optional[str] = None, + auth_url: Optional[str] = None, + client_id: Optional[str] = None) -> None:The same applies to the
downloadfunction signature at lines 642-649.
623-626: ClarifylocalDirreassignment logic.The function reassigns
localDirinside the loop when it'sNone, which means the directory path is recalculated for every URL even though it should remain consistent across iterations. Consider moving this logic outside the loop or clarifying the intended behavior.If
localDirshould be computed once when not provided, move the logic before the loop:+ if localDir is None: + # Use first URL to determine base directory + first_url = urls[0] if urls else None + if first_url: + host, account, group, artifact, version, _ = __get_databus_id_parts__(first_url) + localDir = os.path.join(os.getcwd(), account, group, artifact, version if version is not None else "latest") + print(f"Local directory not given, using {localDir}") + for url in urls: - if localDir is None: - host, account, group, artifact, version, file = __get_databus_id_parts__(url) - localDir = os.path.join(os.getcwd(), account, group, artifact, version if version is not None else "latest") - print(f"Local directory not given, using {localDir}") - file = url.split("/")[-1]Also note that the unpacked
hostvariable is never used (line 624).
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
README.md(3 hunks)databusclient/client.py(4 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
databusclient/client.py (1)
databusclient/cli.py (1)
download(46-57)
🪛 markdownlint-cli2 (0.18.1)
README.md
57-57: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
61-61: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
72-72: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
76-76: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
104-104: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
109-109: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
114-114: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
119-119: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
126-126: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
131-131: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
140-140: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
150-150: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
154-154: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
🪛 Ruff (0.13.3)
databusclient/client.py
418-418: Probable use of requests call without timeout
(S113)
425-425: Probable use of requests call without timeout
(S113)
431-431: Avoid specifying long messages outside the exception class
(TRY003)
438-438: Probable use of requests call without timeout
(S113)
447-447: Use raise without specifying exception name
Remove exception name
(TRY201)
460-460: Avoid specifying long messages outside the exception class
(TRY003)
474-474: Avoid specifying long messages outside the exception class
(TRY003)
481-481: Probable use of requests call without timeout
(S113)
500-500: Probable use of requests call without timeout
(S113)
577-577: Avoid specifying long messages outside the exception class
(TRY003)
609-609: Probable use of requests call without timeout
(S113)
614-614: Probable use of requests call without timeout
(S113)
619-619: PEP 484 prohibits implicit Optional
Convert to Optional[T]
(RUF013)
620-620: PEP 484 prohibits implicit Optional
Convert to Optional[T]
(RUF013)
621-621: PEP 484 prohibits implicit Optional
Convert to Optional[T]
(RUF013)
624-624: Unpacked variable host is never used
Prefix it with an underscore or any other dummy variable pattern
(RUF059)
719-719: Avoid specifying long messages outside the exception class
(TRY003)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
- GitHub Check: build
- GitHub Check: build
🔇 Additional comments (3)
databusclient/client.py (3)
561-580: LGTM!The version sorting logic correctly handles both single version (dict) and multiple versions (list) cases, and the descending sort ensures the latest version is returned first.
543-558: LGTM!The JSON-LD parsing correctly extracts file links from Part nodes for accurate access tracking rather than directly using downloadURLs.
583-600: LGTM!The artifact filtering correctly excludes versioned URIs by checking if the version component is None.
Summary by CodeRabbit
New Features
Changes
Documentation
Tests