Skip to content

Conversation

@Integer-Ctrl
Copy link
Contributor

@Integer-Ctrl Integer-Ctrl commented Oct 8, 2025

Summary by CodeRabbit

  • New Features

    • Official Docker image added for containerized use.
    • Downloads now support vault token authentication for protected resources.
    • Broader download support for collections, artifacts, versions, groups, and accounts with improved batch handling.
  • Changes

    • CLI rewritten with a new command framework and renamed flags (e.g., version-id); package entry point updated.
    • Updated dependencies and build configuration.
  • Documentation

    • Expanded README with Docker usage, download examples, and authentication guidance.
  • Tests

    • Adjusted download-related test query and invocation formatting.

@coderabbitai
Copy link

coderabbitai bot commented Oct 8, 2025

Walkthrough

Adds a Dockerfile; replaces a Typer-based CLI with Click and updates the package entry point; extends client download logic to handle Vault-protected/redirected Databus resources and multiple Databus URI types with new helpers; updates dependencies and expands README examples and tests.

Changes

Cohort / File(s) Summary of edits
Containerization
Dockerfile
New Dockerfile using python:3.10-slim, sets WORKDIR /data, copies repo, runs pip install ., and sets ENTRYPOINT to the CLI.
Documentation
README.md
Expanded Docker image usage and Download command docs (file/version/artifact/group/collection/query), added vault authentication and docker run examples, many typo/formatting fixes, and notes about deploy option rename.
CLI framework migration
databusclient/cli.py
Replaced Typer with Click; added Click group app, converted deploy and download to Click commands with renamed options/arguments, added module entry point.
Download / auth flow enhancements
databusclient/client.py
Added Vault token exchange and auth_url/client_id support, HEAD redirect handling, Bearer-token retry path; new helpers to parse Databus URIs, fetch JSON-LD, resolve collection/group/artifact/version/account resources, batch-download lists; updated function signatures to accept token/auth params and to return lists where appropriate.
Build and entry point updates
pyproject.toml
Replaced dependency typerclick; added rdflib and black to dev deps; changed script entry point to databusclient.cli:app; added build-backend spec.
Tests / scripts
test.sh, tests/test_download.py
Updated CLI flag --versionid--version-id in test script; adjusted SPARQL test query to select file details with URL prefix and size filter; formatting tweaks in tests.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  actor U as User
  participant CLI as Click CLI (app)
  participant Client as databusclient.client
  participant DB as Databus / SPARQL
  participant Auth as Auth Server (OIDC)
  participant Vault as Vault Token Service
  participant Net as HTTP

  U->>CLI: databusclient download [URIs] --localdir --databus [--token --authurl --clientid]
  CLI->>Client: download(localDir, endpoint, databusURIs, token, auth_url, client_id)

  rect rgb(247,250,245)
    note right of Client: Resolve input URIs to file URLs
    Client->>Client: Parse URIs (collection/artifact/version/group/account)
    Client->>DB: Fetch JSON‑LD / run SPARQL queries
    DB-->>Client: Metadata / file URLs
    Client->>Client: Build list of file download URLs
  end

  loop For each file URL
    Client->>Net: HEAD (follow redirects)
    Net-->>Client: 200 / 3xx (+final URL)
    Client->>Net: GET file (no auth)
    alt 401 or auth required
      note right of Client: Protected flow — obtain tokens
      Client->>Auth: Exchange refresh token (auth_url, client_id)
      Auth-->>Client: Access token
      Client->>Vault: Exchange for file-scoped token
      Vault-->>Client: Vault download token
      Client->>Net: GET file with Authorization: Bearer <vault token>
      Net-->>Client: 200 + content
    else Public
      Net-->>Client: 200 + content
    end
    Client->>Client: Save file to localDir
  end

  Client-->>CLI: Completed
  CLI-->>U: Exit status / logs
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Poem

I nibble bytes in Docker hay,
Click hops in where Typer lay—
Vaulted doors yield token treats,
Files tumble down like crunchy eats.
Rabbit logs: "Downloaded — repeat!"

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 57.14% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title Check ✅ Passed The pull request title succinctly captures the two primary changes by referencing both download functionality and the addition of a Docker image, reflecting the main objectives without extraneous detail.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch download-capabilities

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 9

🧹 Nitpick comments (6)
Dockerfile (1)

8-11: Harden and slim the image

  • Avoid pip cache; consider non-root user for security.

Apply:

-RUN pip install .
+RUN pip install --no-cache-dir .

Optionally add:

# Reduce Python overhead
ENV PYTHONDONTWRITEBYTECODE=1 PYTHONUNBUFFERED=1
# Create non-root user
RUN useradd -m app && chown -R app:app /data
USER app
README.md (1)

71-99: Add languages to fenced code blocks and align invocation

  • Add bash/man languages to code fences to satisfy MD040 and improve rendering.
  • Prefer the installed script databusclient ... over python3 -m databusclient ... (there’s no package-level main here).

Examples:

-```
+```bash
-databusclient deploy --help
+databusclient deploy --help
-```
+```man
 Usage: python3 -m databusclient download [OPTIONS] DATABUSURIS...
-```
+```

```diff
-```
+```bash
-docker run --rm -v $(pwd):/data dbpedia/databus-python-client download ...
+docker run --rm -v "$(pwd)":/data dbpedia/databus-python-client download ...

Also applies to: 103-122, 125-134, 139-143, 150-156

databusclient/cli.py (1)

39-46: Type safety and UX polish for options

  • Consider type=click.Path(file_okay=False, dir_okay=True, writable=True) for --localdir.
  • Provide type=str and meaningful metavar defaults to improve --help.
-@click.option("--localdir", help="Local databus folder (if not given, databus folder structure is created in current working directory)")
+@click.option("--localdir", type=click.Path(file_okay=False), help="Local databus folder (default: create Databus structure under CWD)")
databusclient/client.py (3)

525-534: Return type doesn’t match behavior (generator vs list)

The function yields values; annotate as an iterable or convert to list.

Option A (annotation):

-def __handle_databus_file_query__(endpoint_url, query) -> List[str]:
+from typing import Iterable
+def __handle_databus_file_query__(endpoint_url, query) -> Iterable[str]:

Option B (materialize list):

-    for binding in result_dict['results']['bindings']:
+    values = []
+    for binding in result_dict['results']['bindings']:
         ...
-        yield value
+        values.append(value)
+    return values

610-615: Type hints: use Optional for optional params

Align with PEP 484 and silence tooling warnings.

-def __download_list__(urls: List[str],
-                      localDir: str,
-                      vault_token_file: str = None,
-                      auth_url: str = None,
-                      client_id: str = None) -> None:
+from typing import Optional
+def __download_list__(urls: List[str],
+                      localDir: Optional[str],
+                      vault_token_file: Optional[str] = None,
+                      auth_url: Optional[str] = None,
+                      client_id: Optional[str] = None) -> None:

710-713: Minor: fix print formatting

Braces are printed literally.

-            print("QUERY {}", databusURI.replace("\n", " "))
+            print(f"QUERY {databusURI.replace('\n', ' ')}")
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between e71a886 and 9e0776a.

⛔ Files ignored due to path filters (1)
  • poetry.lock is excluded by !**/*.lock
📒 Files selected for processing (5)
  • Dockerfile (1 hunks)
  • README.md (3 hunks)
  • databusclient/cli.py (1 hunks)
  • databusclient/client.py (4 hunks)
  • pyproject.toml (1 hunks)
🧰 Additional context used
🧬 Code graph analysis (2)
databusclient/client.py (1)
databusclient/cli.py (1)
  • download (46-57)
databusclient/cli.py (1)
databusclient/client.py (3)
  • deploy (352-393)
  • create_dataset (209-349)
  • download (635-714)
🪛 markdownlint-cli2 (0.18.1)
README.md

72-72: Fenced code blocks should have a language specified

(MD040, fenced-code-language)


76-76: Fenced code blocks should have a language specified

(MD040, fenced-code-language)


104-104: Fenced code blocks should have a language specified

(MD040, fenced-code-language)


109-109: Fenced code blocks should have a language specified

(MD040, fenced-code-language)


114-114: Fenced code blocks should have a language specified

(MD040, fenced-code-language)


119-119: Fenced code blocks should have a language specified

(MD040, fenced-code-language)


126-126: Fenced code blocks should have a language specified

(MD040, fenced-code-language)


131-131: Fenced code blocks should have a language specified

(MD040, fenced-code-language)


140-140: Fenced code blocks should have a language specified

(MD040, fenced-code-language)


150-150: Fenced code blocks should have a language specified

(MD040, fenced-code-language)


154-154: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

🪛 Ruff (0.13.3)
databusclient/client.py

417-417: Probable use of requests call without timeout

(S113)


425-425: Probable use of requests call without timeout

(S113)


431-431: Avoid specifying long messages outside the exception class

(TRY003)


438-438: Probable use of requests call without timeout

(S113)


467-467: Avoid specifying long messages outside the exception class

(TRY003)


474-474: Probable use of requests call without timeout

(S113)


493-493: Probable use of requests call without timeout

(S113)


570-570: Avoid specifying long messages outside the exception class

(TRY003)


602-602: Probable use of requests call without timeout

(S113)


607-607: Probable use of requests call without timeout

(S113)


612-612: PEP 484 prohibits implicit Optional

Convert to Optional[T]

(RUF013)


613-613: PEP 484 prohibits implicit Optional

Convert to Optional[T]

(RUF013)


614-614: PEP 484 prohibits implicit Optional

Convert to Optional[T]

(RUF013)


617-617: Unpacked variable host is never used

Prefix it with an underscore or any other dummy variable pattern

(RUF059)


712-712: Avoid specifying long messages outside the exception class

(TRY003)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
  • GitHub Check: build
  • GitHub Check: build
🔇 Additional comments (1)
pyproject.toml (1)

11-11: Packaging/entrypoint alignment looks good

  • Click pinned to 8.0.x fits Python 3.9 support; entry point matches databusclient.cli:app.
  • Verify rdflib is intended as a runtime dep (not only dev).

Also applies to: 15-15, 18-18, 22-22

Comment on lines 417 to 441
response = requests.head(url, stream=True)

# Check for redirect and update URL if necessary
if response.headers.get("Location") and response.status_code in [301, 302, 303, 307, 308]:
url = response.headers.get("Location")
print("Redirects url: ", url)

# --- 2. Try direct GET ---
response = requests.get(url, stream=True, allow_redirects=False) # no redirects here, we want to see if auth is required
www = response.headers.get('WWW-Authenticate', '') # get WWW-Authenticate header if present to check for Bearer auth

if (response.status_code == 401 or "bearer" in www.lower()):
print(f"Authentication required for {url}")
if not (vault_token_file):
raise ValueError("Vault token file not given for protected download")

# --- 3. Fetch Vault token ---
vault_token = __get_vault_access__(url, vault_token_file, auth_url, client_id)
headers = {"Authorization": f"Bearer {vault_token}"}

# --- 4. Retry with token ---
response = requests.get(url, headers=headers, stream=True)

response.raise_for_status() # Raise if still failing

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Add timeouts and handle auth detection robustly

Network calls without timeouts can hang indefinitely. Add reasonable timeouts.

-response = requests.head(url, stream=True)
+response = requests.head(url, stream=True, timeout=30)
 ...
-response = requests.get(url, stream=True, allow_redirects=False)  # no redirects here, we want to see if auth is required
+response = requests.get(url, stream=True, allow_redirects=False, timeout=30)  # no redirects here, we want to see if auth is required
 ...
-        response = requests.get(url, headers=headers, stream=True)
+        response = requests.get(url, headers=headers, stream=True, timeout=60)

Optionally extract a module-level TIMEOUT constant for reuse.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
response = requests.head(url, stream=True)
# Check for redirect and update URL if necessary
if response.headers.get("Location") and response.status_code in [301, 302, 303, 307, 308]:
url = response.headers.get("Location")
print("Redirects url: ", url)
# --- 2. Try direct GET ---
response = requests.get(url, stream=True, allow_redirects=False) # no redirects here, we want to see if auth is required
www = response.headers.get('WWW-Authenticate', '') # get WWW-Authenticate header if present to check for Bearer auth
if (response.status_code == 401 or "bearer" in www.lower()):
print(f"Authentication required for {url}")
if not (vault_token_file):
raise ValueError("Vault token file not given for protected download")
# --- 3. Fetch Vault token ---
vault_token = __get_vault_access__(url, vault_token_file, auth_url, client_id)
headers = {"Authorization": f"Bearer {vault_token}"}
# --- 4. Retry with token ---
response = requests.get(url, headers=headers, stream=True)
response.raise_for_status() # Raise if still failing
response = requests.head(url, stream=True, timeout=30)
# Check for redirect and update URL if necessary
if response.headers.get("Location") and response.status_code in [301, 302, 303, 307, 308]:
url = response.headers.get("Location")
print("Redirects url: ", url)
# --- 2. Try direct GET ---
response = requests.get(
url,
stream=True,
allow_redirects=False, # no redirects here, we want to see if auth is required
timeout=30,
)
www = response.headers.get('WWW-Authenticate', '') # get WWW-Authenticate header if present to check for Bearer auth
if response.status_code == 401 or "bearer" in www.lower():
print(f"Authentication required for {url}")
if not vault_token_file:
raise ValueError("Vault token file not given for protected download")
# --- 3. Fetch Vault token ---
vault_token = __get_vault_access__(url, vault_token_file, auth_url, client_id)
headers = {"Authorization": f"Bearer {vault_token}"}
# --- 4. Retry with token ---
response = requests.get(
url,
headers=headers,
stream=True,
timeout=60,
)
response.raise_for_status() # Raise if still failing
🧰 Tools
🪛 Ruff (0.13.3)

417-417: Probable use of requests call without timeout

(S113)


425-425: Probable use of requests call without timeout

(S113)


431-431: Avoid specifying long messages outside the exception class

(TRY003)


438-438: Probable use of requests call without timeout

(S113)

Comment on lines +474 to +481
resp = requests.post(auth_url, data={
"client_id": client_id,
"grant_type": "refresh_token",
"refresh_token": refresh_token
})
resp.raise_for_status()
access_token = resp.json()["access_token"]

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion | 🟠 Major

Add timeouts to token exchange requests

Avoid indefinite hangs during auth flows.

-resp = requests.post(auth_url, data={
+resp = requests.post(auth_url, data={
     "client_id": client_id,
     "grant_type": "refresh_token",
     "refresh_token": refresh_token
-})
+}, timeout=30)
 ...
-resp = requests.post(auth_url, data={
+resp = requests.post(auth_url, data={
     "client_id": client_id,
     "grant_type": "urn:ietf:params:oauth:grant-type:token-exchange",
     "subject_token": access_token,
     "audience": audience
-})
+}, timeout=30)

Also applies to: 493-501

🧰 Tools
🪛 Ruff (0.13.3)

474-474: Probable use of requests call without timeout

(S113)

🤖 Prompt for AI Agents
In databusclient/client.py around lines 474-481 (and similarly for lines
493-501), the requests.post calls performing token exchanges lack timeouts which
can cause indefinite hangs; modify these calls to pass a sensible timeout
parameter (e.g., timeout=10 or a configurable constant) and handle
timeout-related exceptions (requests.Timeout and requests.RequestException) as
appropriate (log and raise or retry) so the auth flow fails fast instead of
blocking indefinitely.

Comment on lines +600 to 603
def __handle_databus_collection__(uri: str) -> str:
headers = {"Accept": "text/sparql"}
return requests.get(uri, headers=headers).text

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion | 🟠 Major

Add timeout to collection fetch

-    return requests.get(uri, headers=headers).text
+    return requests.get(uri, headers=headers, timeout=30).text
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
def __handle_databus_collection__(uri: str) -> str:
headers = {"Accept": "text/sparql"}
return requests.get(uri, headers=headers).text
def __handle_databus_collection__(uri: str) -> str:
headers = {"Accept": "text/sparql"}
return requests.get(uri, headers=headers, timeout=30).text
🧰 Tools
🪛 Ruff (0.13.3)

602-602: Probable use of requests call without timeout

(S113)

🤖 Prompt for AI Agents
In databusclient/client.py around lines 600 to 603, the GET call to fetch the
collection has no timeout and can hang; update requests.get to include a
sensible timeout parameter (e.g., timeout=10) and handle network errors: call
requests.get(uri, headers=headers, timeout=<seconds>) then call
response.raise_for_status() and return response.text; catch
requests.exceptions.Timeout to raise a clear timeout error and catch
requests.exceptions.RequestException to surface other network/HTTP errors (or
re-raise) so callers don't block indefinitely.

Comment on lines +605 to +608
def __get_json_ld_from_databus__(uri: str) -> str:
headers = {"Accept": "application/ld+json"}
return requests.get(uri, headers=headers).text

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion | 🟠 Major

Add timeout to JSON-LD fetch

-    return requests.get(uri, headers=headers).text
+    return requests.get(uri, headers=headers, timeout=30).text
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
def __get_json_ld_from_databus__(uri: str) -> str:
headers = {"Accept": "application/ld+json"}
return requests.get(uri, headers=headers).text
def __get_json_ld_from_databus__(uri: str) -> str:
headers = {"Accept": "application/ld+json"}
return requests.get(uri, headers=headers, timeout=30).text
🧰 Tools
🪛 Ruff (0.13.3)

607-607: Probable use of requests call without timeout

(S113)

🤖 Prompt for AI Agents
In databusclient/client.py around lines 605-608, the JSON-LD fetch uses
requests.get without a timeout which can hang; update the function to call
requests.get with a reasonable timeout (e.g. timeout=10 or a
module-level/configurable constant) and propagate or convert
requests.exceptions.Timeout/RequestException into a clear error (or re-raise) so
callers don’t block indefinitely; also check response status (raise_for_status)
before returning response.text.

@coderabbitai
Copy link

coderabbitai bot commented Oct 8, 2025

Note

Unit test generation is an Early Access feature. Expect some limitations and changes as we gather feedback and continue to improve it.


Generating unit tests... This may take up to 20 minutes.

@coderabbitai
Copy link

coderabbitai bot commented Oct 8, 2025

Note

Unit test generation is an Early Access feature. Expect some limitations and changes as we gather feedback and continue to improve it.


Generating unit tests... This may take up to 20 minutes.

@coderabbitai
Copy link

coderabbitai bot commented Oct 8, 2025

UTG Post-Process Complete

No new issues were detected in the generated code and all check runs have completed. The unit test generation process has completed successfully.

@coderabbitai
Copy link

coderabbitai bot commented Oct 8, 2025

Creating a PR to put the unit tests in...

The changes have been created in this pull request: View PR

@coderabbitai
Copy link

coderabbitai bot commented Oct 8, 2025

Note

Unit test generation is an Early Access feature. Expect some limitations and changes as we gather feedback and continue to improve it.


Generating unit tests... This may take up to 20 minutes.

@coderabbitai
Copy link

coderabbitai bot commented Oct 8, 2025

UTG Post-Process Complete

No new issues were detected in the generated code and all check runs have completed. The unit test generation process has completed successfully.

@coderabbitai
Copy link

coderabbitai bot commented Oct 8, 2025

Creating a PR to put the unit tests in...

The changes have been created in this pull request: View PR

@coderabbitai
Copy link

coderabbitai bot commented Oct 8, 2025

🎯 Unit Test Generation Post-Process Complete

The unit test generation and CI/CD fixing process has completed after reaching the maximum number of iterations. Please review the generated tests and any remaining issues manually.

@coderabbitai
Copy link

coderabbitai bot commented Oct 8, 2025

Creating a PR to put the unit tests in...

The changes have been created in this pull request: View PR

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

♻️ Duplicate comments (3)
databusclient/client.py (3)

418-438: Add timeouts to prevent indefinite hangs.

Network calls without timeouts can block indefinitely, degrading reliability. This was flagged in past reviews and remains unresolved.

Apply this diff:

-    response = requests.head(url, stream=True)
+    response = requests.head(url, stream=True, timeout=30)
     # Check for redirect and update URL if necessary
     if response.headers.get("Location") and response.status_code in [301, 302, 303, 307, 308]:
         url = response.headers.get("Location")
         print("Redirects url: ", url)
 
     # --- 2. Try direct GET ---
-    response = requests.get(url, stream=True, allow_redirects=False)
+    response = requests.get(url, stream=True, allow_redirects=False, timeout=30)
     www = response.headers.get('WWW-Authenticate', '')
 
     if (response.status_code == 401 or "bearer" in www.lower()):
         print(f"Authentication required for {url}")
         if not (vault_token_file):
             raise ValueError("Vault token file not given for protected download")
 
         # --- 3. Fetch Vault token ---
         vault_token = __get_vault_access__(url, vault_token_file, auth_url, client_id)
         headers = {"Authorization": f"Bearer {vault_token}"}
 
         # --- 4. Retry with token ---
-        response = requests.get(url, headers=headers, stream=True)
+        response = requests.get(url, headers=headers, stream=True, timeout=60)

As per static analysis hints and past review comments.


474-501: Add timeouts to token exchange requests.

Authentication flows without timeouts can hang indefinitely, preventing graceful failure.

Apply this diff:

     # 2. Refresh token -> access token
-    resp = requests.post(auth_url, data={
+    resp = requests.post(auth_url, data={
         "client_id": client_id,
         "grant_type": "refresh_token",
         "refresh_token": refresh_token
-    })
+    }, timeout=30)
     resp.raise_for_status()
     access_token = resp.json()["access_token"]
 
     # 3. Extract host as audience
     # Remove protocol prefix
     if download_url.startswith("https://"):
         host_part = download_url[len("https://"):]
     elif download_url.startswith("http://"):
         host_part = download_url[len("http://"):]
     else:
         host_part = download_url
     audience = host_part.split("/")[0]
 
     # 4. Access token -> Vault token
-    resp = requests.post(auth_url, data={
+    resp = requests.post(auth_url, data={
         "client_id": client_id,
         "grant_type": "urn:ietf:params:oauth:grant-type:token-exchange",
         "subject_token": access_token,
         "audience": audience
-    })
+    }, timeout=30)

As per static analysis hints and past review comments.


600-607: Add timeouts to metadata fetch operations.

Metadata retrieval calls can hang without timeouts, blocking the entire download workflow.

Apply this diff:

 def __handle_databus_collection__(uri: str) -> str:
     headers = {"Accept": "text/sparql"}
-    return requests.get(uri, headers=headers).text
+    return requests.get(uri, headers=headers, timeout=30).text
 
 
 def __get_json_ld_from_databus__(uri: str) -> str:
     headers = {"Accept": "application/ld+json"}
-    return requests.get(uri, headers=headers).text
+    return requests.get(uri, headers=headers, timeout=30).text

As per static analysis hints and past review comments.

🧹 Nitpick comments (3)
README.md (1)

57-157: Consider adding language specifiers to fenced code blocks.

Markdown linters recommend specifying the language for fenced code blocks to improve syntax highlighting and readability. The following blocks could benefit from language specifiers: lines 57, 61, 72, 76, 104, 109, 114, 119, 126, 131, 140, 150, 154.

Example:

-```
+```bash
 databusclient deploy --version-id https://databus.dbpedia.org/user1/group1/artifact1/2022-05-18 ...

As per static analysis hints.

</blockquote></details>
<details>
<summary>databusclient/client.py (2)</summary><blockquote>

`610-614`: **Add explicit `Optional` type hints for nullable parameters.**

PEP 484 prohibits implicit `Optional`. Make null-safety explicit in the signature.

Apply this diff:

```diff
 def __download_list__(urls: List[str],
                       localDir: str,
-                      vault_token_file: str = None,
-                      auth_url: str = None,
-                      client_id: str = None) -> None:
+                      vault_token_file: Optional[str] = None,
+                      auth_url: Optional[str] = None,
+                      client_id: Optional[str] = None) -> None:

As per static analysis hints.


617-617: Prefix unused unpacked variable with underscore.

The host variable is unpacked but never used. Follow Python convention for dummy variables.

Apply this diff:

-        host, account, group, artifact, version, file = __get_databus_id_parts__(url)
+        _host, account, group, artifact, version, file = __get_databus_id_parts__(url)

As per static analysis hints.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 9e0776a and 9806d04.

📒 Files selected for processing (5)
  • README.md (3 hunks)
  • databusclient/cli.py (1 hunks)
  • databusclient/client.py (4 hunks)
  • test.sh (1 hunks)
  • tests/test_download.py (1 hunks)
🧰 Additional context used
🧬 Code graph analysis (3)
tests/test_download.py (2)
databusclient/cli.py (1)
  • download (46-57)
databusclient/client.py (1)
  • download (635-714)
databusclient/cli.py (1)
databusclient/client.py (3)
  • deploy (352-393)
  • create_dataset (209-349)
  • download (635-714)
databusclient/client.py (1)
databusclient/cli.py (1)
  • download (46-57)
🪛 markdownlint-cli2 (0.18.1)
README.md

57-57: Fenced code blocks should have a language specified

(MD040, fenced-code-language)


61-61: Fenced code blocks should have a language specified

(MD040, fenced-code-language)


72-72: Fenced code blocks should have a language specified

(MD040, fenced-code-language)


76-76: Fenced code blocks should have a language specified

(MD040, fenced-code-language)


104-104: Fenced code blocks should have a language specified

(MD040, fenced-code-language)


109-109: Fenced code blocks should have a language specified

(MD040, fenced-code-language)


114-114: Fenced code blocks should have a language specified

(MD040, fenced-code-language)


119-119: Fenced code blocks should have a language specified

(MD040, fenced-code-language)


126-126: Fenced code blocks should have a language specified

(MD040, fenced-code-language)


131-131: Fenced code blocks should have a language specified

(MD040, fenced-code-language)


140-140: Fenced code blocks should have a language specified

(MD040, fenced-code-language)


150-150: Fenced code blocks should have a language specified

(MD040, fenced-code-language)


154-154: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

🪛 Ruff (0.13.3)
databusclient/client.py

418-418: Probable use of requests call without timeout

(S113)


425-425: Probable use of requests call without timeout

(S113)


431-431: Avoid specifying long messages outside the exception class

(TRY003)


438-438: Probable use of requests call without timeout

(S113)


453-453: Avoid specifying long messages outside the exception class

(TRY003)


467-467: Avoid specifying long messages outside the exception class

(TRY003)


474-474: Probable use of requests call without timeout

(S113)


493-493: Probable use of requests call without timeout

(S113)


570-570: Avoid specifying long messages outside the exception class

(TRY003)


602-602: Probable use of requests call without timeout

(S113)


607-607: Probable use of requests call without timeout

(S113)


612-612: PEP 484 prohibits implicit Optional

Convert to Optional[T]

(RUF013)


613-613: PEP 484 prohibits implicit Optional

Convert to Optional[T]

(RUF013)


614-614: PEP 484 prohibits implicit Optional

Convert to Optional[T]

(RUF013)


617-617: Unpacked variable host is never used

Prefix it with an underscore or any other dummy variable pattern

(RUF059)


712-712: Avoid specifying long messages outside the exception class

(TRY003)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
  • GitHub Check: build
  • GitHub Check: build
🔇 Additional comments (6)
test.sh (1)

4-4: LGTM! Parameter name updated correctly.

The change from --versionid to --version-id aligns with the CLI migration from Typer to Click in databusclient/cli.py.

tests/test_download.py (1)

6-16: LGTM! Enhanced query with sensible constraints.

The updated SPARQL query adds appropriate filters:

  • URL prefix filter ensures results are from the expected databus instance
  • Size constraint (< 100MB) improves test execution time
databusclient/cli.py (1)

2-61: LGTM! Clean migration from Typer to Click.

The CLI migration correctly:

  • Maps --version-id to version_id parameter (line 15)
  • Maps --license to license_url parameter (line 23)
  • Maintains consistent parameter naming throughout
  • Adds proper entry point for CLI invocation

Past review concerns about option-parameter mismatches have been resolved.

databusclient/client.py (3)

414-416: LGTM! Directory creation is now properly guarded.

Past review concern about os.makedirs with empty dirname has been resolved.


453-453: LGTM! Proper exception with descriptive message.

Past review concern about bare raise statement has been resolved.


669-714: LGTM! Vault authentication options properly propagated.

All download paths now correctly pass vault_token_file, auth_url, and client_id to __download_list__, enabling protected downloads throughout. Past review concern has been resolved.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
databusclient/client.py (1)

532-541: Fix return type annotation mismatch.

The function signature declares List[str] as the return type, but the implementation uses yield, making it a generator function. This type mismatch can cause confusion and potential runtime issues with type checkers.

Apply this diff to fix the return type:

-def __handle_databus_file_query__(endpoint_url, query) -> List[str]:
+def __handle_databus_file_query__(endpoint_url, query):
     result_dict = __query_sparql__(endpoint_url, query)
     for binding in result_dict['results']['bindings']:

Alternatively, if you want to maintain type hints, import Generator from typing and use:

from typing import Generator

def __handle_databus_file_query__(endpoint_url, query) -> Generator[str, None, None]:
♻️ Duplicate comments (3)
databusclient/client.py (3)

607-615: Add timeouts to metadata fetch operations.

Both __handle_databus_collection__ and __get_json_ld_from_databus__ lack timeout parameters on their GET requests, which can cause indefinite hangs when fetching collection queries or JSON-LD metadata.

Apply this diff:

 def __handle_databus_collection__(uri: str) -> str:
     headers = {"Accept": "text/sparql"}
-    return requests.get(uri, headers=headers).text
+    return requests.get(uri, headers=headers, timeout=30).text
 
 
 def __get_json_ld_from_databus__(uri: str) -> str:
     headers = {"Accept": "application/ld+json"}
-    return requests.get(uri, headers=headers).text
+    return requests.get(uri, headers=headers, timeout=30).text

Based on learnings.


418-448: Add timeouts to prevent indefinite hangs.

The network calls lack timeout parameters, which can cause the client to hang indefinitely if the server is unresponsive. This is especially problematic for HEAD and GET requests that may encounter slow or stalled connections.

Apply this diff to add timeouts:

-    response = requests.head(url, stream=True)
+    response = requests.head(url, stream=True, timeout=30)
     # Check for redirect and update URL if necessary
     if response.headers.get("Location") and response.status_code in [301, 302, 303, 307, 308]:
         url = response.headers.get("Location")
         print("Redirects url: ", url)
 
     # --- 2. Try direct GET ---
-    response = requests.get(url, stream=True, allow_redirects=False)  # no redirects here, we want to see if auth is required
+    response = requests.get(url, stream=True, allow_redirects=False, timeout=30)  # no redirects here, we want to see if auth is required
     www = response.headers.get('WWW-Authenticate', '')  # get WWW-Authenticate header if present to check for Bearer auth
 
     if (response.status_code == 401 or "bearer" in www.lower()):
         print(f"Authentication required for {url}")
         if not (vault_token_file):
             raise ValueError("Vault token file not given for protected download")
 
         # --- 3. Fetch Vault token ---
         vault_token = __get_vault_access__(url, vault_token_file, auth_url, client_id)
         headers = {"Authorization": f"Bearer {vault_token}"}
 
         # --- 4. Retry with token ---
-        response = requests.get(url, headers=headers, stream=True)
+        response = requests.get(url, headers=headers, stream=True, timeout=60)

Consider defining a module-level constant (e.g., DEFAULT_TIMEOUT = 30) for consistency across all network calls.

Based on learnings.


481-506: Add timeouts to token exchange requests.

The token exchange POST requests lack timeout parameters, which can cause indefinite hangs during authentication flows.

Apply this diff:

     # 2. Refresh token -> access token
     resp = requests.post(auth_url, data={
         "client_id": client_id,
         "grant_type": "refresh_token",
         "refresh_token": refresh_token
-    })
+    }, timeout=30)
     resp.raise_for_status()
     access_token = resp.json()["access_token"]
 
     # 3. Extract host as audience
     # Remove protocol prefix
     if download_url.startswith("https://"):
         host_part = download_url[len("https://"):]
     elif download_url.startswith("http://"):
         host_part = download_url[len("http://"):]
     else:
         host_part = download_url
     audience = host_part.split("/")[0]  # host is before first "/"
 
     # 4. Access token -> Vault token
     resp = requests.post(auth_url, data={
         "client_id": client_id,
         "grant_type": "urn:ietf:params:oauth:grant-type:token-exchange",
         "subject_token": access_token,
         "audience": audience
-    })
+    }, timeout=30)
     resp.raise_for_status()
     vault_token = resp.json()["access_token"]

Based on learnings.

🧹 Nitpick comments (3)
README.md (1)

104-156: Add language identifiers to fenced code blocks.

Multiple fenced code blocks throughout the download examples lack language specifiers, which prevents proper syntax highlighting. Add bash or shell identifiers to these code blocks.

Apply language identifiers to code blocks at lines 104, 109, 114, 119, 126, 131, 140, 150, and 154:

-```
+```bash
 databusclient download https://databus.dbpedia.org/dbpedia/mappings/mappingbased-literals/2022.12.01/mappingbased-literals_lang=az.ttl.bz2

Repeat this pattern for all affected code blocks. This also applies to the deploy command examples at lines 57 and 61, and the help output blocks at lines 72 and 76.

</blockquote></details>
<details>
<summary>databusclient/client.py (2)</summary><blockquote>

`617-633`: **Use explicit `Optional` type hints.**

The function parameters `vault_token_file`, `auth_url`, and `client_id` default to `None` but lack explicit `Optional[]` type hints, which violates PEP 484.



Apply this diff:

```diff
 def __download_list__(urls: List[str],
                       localDir: str,
-                      vault_token_file: str = None,
-                      auth_url: str = None,
-                      client_id: str = None) -> None:
+                      vault_token_file: Optional[str] = None,
+                      auth_url: Optional[str] = None,
+                      client_id: Optional[str] = None) -> None:

The same applies to the download function signature at lines 642-649.


623-626: Clarify localDir reassignment logic.

The function reassigns localDir inside the loop when it's None, which means the directory path is recalculated for every URL even though it should remain consistent across iterations. Consider moving this logic outside the loop or clarifying the intended behavior.

If localDir should be computed once when not provided, move the logic before the loop:

+    if localDir is None:
+        # Use first URL to determine base directory
+        first_url = urls[0] if urls else None
+        if first_url:
+            host, account, group, artifact, version, _ = __get_databus_id_parts__(first_url)
+            localDir = os.path.join(os.getcwd(), account, group, artifact, version if version is not None else "latest")
+            print(f"Local directory not given, using {localDir}")
+
     for url in urls:
-        if localDir is None:
-            host, account, group, artifact, version, file = __get_databus_id_parts__(url)
-            localDir = os.path.join(os.getcwd(), account, group, artifact, version if version is not None else "latest")
-            print(f"Local directory not given, using {localDir}")
-
         file = url.split("/")[-1]

Also note that the unpacked host variable is never used (line 624).

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 9806d04 and faf7f65.

📒 Files selected for processing (2)
  • README.md (3 hunks)
  • databusclient/client.py (4 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
databusclient/client.py (1)
databusclient/cli.py (1)
  • download (46-57)
🪛 markdownlint-cli2 (0.18.1)
README.md

57-57: Fenced code blocks should have a language specified

(MD040, fenced-code-language)


61-61: Fenced code blocks should have a language specified

(MD040, fenced-code-language)


72-72: Fenced code blocks should have a language specified

(MD040, fenced-code-language)


76-76: Fenced code blocks should have a language specified

(MD040, fenced-code-language)


104-104: Fenced code blocks should have a language specified

(MD040, fenced-code-language)


109-109: Fenced code blocks should have a language specified

(MD040, fenced-code-language)


114-114: Fenced code blocks should have a language specified

(MD040, fenced-code-language)


119-119: Fenced code blocks should have a language specified

(MD040, fenced-code-language)


126-126: Fenced code blocks should have a language specified

(MD040, fenced-code-language)


131-131: Fenced code blocks should have a language specified

(MD040, fenced-code-language)


140-140: Fenced code blocks should have a language specified

(MD040, fenced-code-language)


150-150: Fenced code blocks should have a language specified

(MD040, fenced-code-language)


154-154: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

🪛 Ruff (0.13.3)
databusclient/client.py

418-418: Probable use of requests call without timeout

(S113)


425-425: Probable use of requests call without timeout

(S113)


431-431: Avoid specifying long messages outside the exception class

(TRY003)


438-438: Probable use of requests call without timeout

(S113)


447-447: Use raise without specifying exception name

Remove exception name

(TRY201)


460-460: Avoid specifying long messages outside the exception class

(TRY003)


474-474: Avoid specifying long messages outside the exception class

(TRY003)


481-481: Probable use of requests call without timeout

(S113)


500-500: Probable use of requests call without timeout

(S113)


577-577: Avoid specifying long messages outside the exception class

(TRY003)


609-609: Probable use of requests call without timeout

(S113)


614-614: Probable use of requests call without timeout

(S113)


619-619: PEP 484 prohibits implicit Optional

Convert to Optional[T]

(RUF013)


620-620: PEP 484 prohibits implicit Optional

Convert to Optional[T]

(RUF013)


621-621: PEP 484 prohibits implicit Optional

Convert to Optional[T]

(RUF013)


624-624: Unpacked variable host is never used

Prefix it with an underscore or any other dummy variable pattern

(RUF059)


719-719: Avoid specifying long messages outside the exception class

(TRY003)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
  • GitHub Check: build
  • GitHub Check: build
🔇 Additional comments (3)
databusclient/client.py (3)

561-580: LGTM!

The version sorting logic correctly handles both single version (dict) and multiple versions (list) cases, and the descending sort ensures the latest version is returned first.


543-558: LGTM!

The JSON-LD parsing correctly extracts file links from Part nodes for accurate access tracking rather than directly using downloadURLs.


583-600: LGTM!

The artifact filtering correctly excludes versioned URIs by checking if the version component is None.

@Integer-Ctrl Integer-Ctrl merged commit d78f129 into main Oct 13, 2025
3 checks passed
@coderabbitai coderabbitai bot mentioned this pull request Nov 25, 2025
@coderabbitai coderabbitai bot mentioned this pull request Dec 9, 2025
@Integer-Ctrl Integer-Ctrl deleted the download-capabilities branch December 9, 2025 14:00
@coderabbitai coderabbitai bot mentioned this pull request Dec 18, 2025
10 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants