feat: AE-1225: poll request logs for qb workers by jhcipar · Pull Request #276 · runpod/flash

jhcipar · 2026-03-17T20:22:11Z

Summary

Adds queue-based (QB) pod log polling during async job execution by introducing a new QBRequestLogFetcher and wiring it into ServerlessResource.run(...) to surface worker assignment, initialization, and streaming phases in real time.
Improves “no workers available” diagnostics with a new WorkerAvailabilityDiagnostic flow that checks GPU/CPU availability by location and emits clearer waiting/throttling guidance while jobs are queued.
Enhances deployment manifest reconciliation so endpoint metadata is carried forward more reliably (including endpoint ID handling and endpoint URL normalization), with accompanying coverage updates.

some examples of behavior

throttled worker

2026-04-07 15:46:53,054 | INFO  | LiveServerless:unkkp3jdfk406c | API /run
2026-04-07 15:46:53,253 | INFO  | LiveServerless:unkkp3jdfk406c | Started Job:b7475d47-8277-4329-8855-6b10025ea951-u2
2026-04-07 15:46:53,316 | INFO  | Job:b7475d47-8277-4329-8855-6b10025ea951-u2 | Status: IN_QUEUE
2026-04-07 15:46:54,085 | INFO  | Job:b7475d47-8277-4329-8855-6b10025ea951-u2 | Workers are currently throttled on endpoint for selected gpu AMPERE_80. Consider raising max workers or changing gpu type.
2026-04-07 15:46:54,085 | INFO  | Job:b7475d47-8277-4329-8855-6b10025ea951-u2 | Waiting for request: endpoint metrics: worker=None, assignment=unassigned, status=IN_QUEUE, workers={ready:0, running:0, idle:0, initializing:0, throttled:1, unhealthy:0}, readyWorkers=[]
2026-04-07 15:46:55,098 | INFO  | Job:b7475d47-8277-4329-8855-6b10025ea951-u2 | .
2026-04-07 15:46:57,299 | INFO  | Job:b7475d47-8277-4329-8855-6b10025ea951-u2 | ..
2026-04-07 15:47:03,703 | INFO  | Job:b7475d47-8277-4329-8855-6b10025ea951-u2 | ...
2026-04-07 15:47:15,056 | INFO  | Job:b7475d47-8277-4329-8855-6b10025ea951-u2 | ....
2026-04-07 15:47:26,169 | INFO  | Job:b7475d47-8277-4329-8855-6b10025ea951-u2 | .....

happy path/success

worker log: 2026-04-07T19:39:34Z worker is ready
2026-04-07 15:40:17,629 | INFO  | Job:360a4954-7874-4032-b29c-20967fbf71f6-u1 | Streaming endpoint startup logs while waiting for request assignment
2026-04-07 15:40:22,360 | INFO  | Job:360a4954-7874-4032-b29c-20967fbf71f6-u1 | .....................
worker log: 2026-04-07T19:40:19Z create container runpod/flash:py3.12-latest
worker log: 2026-04-07T19:40:20Z py3.12-latest Pulling from runpod/flash
worker log: 2026-04-07T19:40:20Z Digest: sha256:d1a9f4a7a1e7b271654e7c74d004933cb4c8861ab41781b232d0f3c8f62c4960
worker log: 2026-04-07T19:40:20Z Status: Image is up to date for runpod/flash:py3.12-latest
worker log: 2026-04-07T19:40:20Z worker is ready
worker log: 2026-04-07T19:40:21Z start container for runpod/flash:py3.12-latest: begin
worker log: 2026-04-07T15:40:22.686070386-04:00
worker log: 2026-04-07T15:40:22.686123716-04:00 ==========
worker log: 2026-04-07T15:40:22.686127786-04:00 == CUDA ==
worker log: 2026-04-07T15:40:22.686176636-04:00 ==========
worker log: 2026-04-07T15:40:22.694529314-04:00
worker log: 2026-04-07T15:40:22.694569034-04:00 CUDA Version 12.8.1
worker log: 2026-04-07T15:40:22.696022905-04:00
worker log: 2026-04-07T15:40:22.696036395-04:00 Container image Copyright (c) 2016-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
worker log: 2026-04-07T15:40:22.697359677-04:00
worker log: 2026-04-07T15:40:22.697386677-04:00 This container image and its contents are governed by the NVIDIA Deep Learning Container License.
worker log: 2026-04-07T15:40:22.697390747-04:00 By pulling and using the container, you accept the terms and conditions of this license:
worker log: 2026-04-07T15:40:22.697393957-04:00 https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license
worker log: 2026-04-07T15:40:22.697398707-04:00
worker log: 2026-04-07T15:40:22.697401167-04:00 A copy of this license is made available in this container at /NGC-DL-CONTAINER-LICENSE for your convenience.
worker log: 2026-04-07T15:40:22.707401427-04:00
worker log: 2026-04-07T15:40:22.784503891-04:00 2026-04-07 19:40:22,784 | INFO  | Starting Flash Worker 1.4.0 | Python 3.12.12 | runpod-flash 1.8.0 | runpod 1.8.1
2026-04-07 15:40:28,202 | INFO  | Job:360a4954-7874-4032-b29c-20967fbf71f6-u1 | Status: IN_PROGRESS
worker log: 2026-04-07T15:40:25.247214255-04:00 --- Starting Serverless Worker |  Version 1.8.1 ---
worker log: 2026-04-07T15:40:25.798294020-04:00 {"requestId": null, "message": "Jobs in queue: 1", "level": "INFO"}
worker log: 2026-04-07T15:40:25.798328651-04:00 {"requestId": null, "message": "Jobs in progress: 1", "level": "INFO"}
worker log: 2026-04-07T15:40:25.798468601-04:00 {"requestId": "360a4954-7874-4032-b29c-20967fbf71f6-u1", "message": "Started.", "level": "INFO"}
2026-04-07 15:40:29,520 | INFO  | Job:360a4954-7874-4032-b29c-20967fbf71f6-u1 | Request assigned to worker 45sgtts0agtx6e, streaming pod logs
2026-04-07 15:40:37,473 | INFO  | Job:360a4954-7874-4032-b29c-20967fbf71f6-u1 | Status: COMPLETED
worker log: 2026-04-07T15:40:35.988394624-04:00 {"requestId": "360a4954-7874-4032-b29c-20967fbf71f6-u1", "message": "Finished.", "level": "INFO"}
2026-04-07 15:40:40,280 | INFO  | Worker:45sgtts0agtx6e | Delay Time: 221339 ms
2026-04-07 15:40:40,280 | INFO  | Worker:45sgtts0agtx6e | Execution Time: 10038 ms

runpod-Henrik

1. Bug: `start_time` default is evaluated at import time, not instantiation

# request_logs.py:35
def __init__(
    self,
    ...
    start_time: datetime = datetime.now(timezone.utc),  # evaluated ONCE at module load
):

datetime.now() in a default argument is computed when the module is first imported, not when QBRequestLogFetcher() is called. Every instance created without an explicit start_time — including the one created in run() — gets the same stale timestamp. On a long-running process that calls run() hours after startup, the first poll will request logs from the entire lifetime of the process.

Fix:

start_time: Optional[datetime] = None
# in body:
self.start_time = start_time if start_time is not None else datetime.now(timezone.utc)

2. Issue: `print()` for log output bypasses the logging system

# serverless.py:240
print(f"worker log: {line}")

Flash uses log = logging.getLogger(__name__) throughout. print() bypasses LOG_LEVEL filtering and SensitiveDataFilter. If a worker log line contains an API key or sensitive value, it will not be filtered.

Should be log.info("worker log: %s", line) (or rich.print if the intent is always-visible user output, but then it should be consistent with how the rest of the CLI surfaces output to users).

3. Issue: `fetched_until` stalls when logs have no timestamps

# request_logs.py:165
else:
    # not all logs have a timestamp, assume we should refetch
    self.fetched_until = self.start_time

When records have no dt field, fetched_until is set to start_time (unchanged). The next call to fetch_logs() then sets self.start_time = self.fetched_until = self.start_time — the window never advances. The deduplication seen set prevents duplicate output, but the API is called with the same time range on every poll for the entire duration of the job.

Should fall back to end_utc (the timestamp of the current fetch) rather than start_time.

4. Issue: `_fetch_worker_id` is dead code

# request_logs.py:67
async def _fetch_worker_id(self, endpoint_id, request_id, runpod_api_key):
    ...

This method is never called anywhere in the PR. fetch_logs() doesn't pass a request_id, matched_by_request_id is always False, and worker_id is always None. If this is scaffolding for future work, a comment would help. If it's not needed, removing it keeps the surface area clean.

5. Question: stdout deduplication path has no test coverage

The tests for run() patch out _emit_endpoint_logs entirely with AsyncMock, so fetcher.seen is never populated. The stdout deduplication block:

# serverless.py:1157-1166
if raw in fetcher.seen:
    continue

has no test. A test that lets _emit_endpoint_logs populate fetcher.seen and then verifies that matching stdout lines are stripped from the final output would close this gap.

6. Minor: `_emit_endpoint_logs` has inconsistent return

if not batch:
    return False   # explicit False
# other paths:
return             # implicit None

The return value is unused by the caller, so no functional impact — but the mixed False/None is confusing. Either make it -> None throughout (remove the return False) or commit to returning a bool.

Verdict

The mutable default argument is a real bug that will cause missed logs on any process that imports flash before calling run(). The other three issues (print vs logger, fetched_until stall, dead method) are worth fixing before merge. Tests are solid for the happy path — the one gap is the stdout deduplication path.

KAJdev · 2026-03-19T03:10:40Z

I'm seeing double logs

2026-03-18 19:31:39,696 | INFO  | CpuLiveServerless:fyr7izk2pspcy6 | API /run
2026-03-18 19:31:39,901 | INFO  | CpuLiveServerless:fyr7izk2pspcy6 | Started Job:f510b3cf-6e83-44b0-9009-3bff3ac92a31-u1
2026-03-18 19:31:40,101 | INFO  | Job:f510b3cf-6e83-44b0-9009-3bff3ac92a31-u1 | Status: IN_QUEUE
2026-03-18 19:31:41,325 | INFO  | Job:f510b3cf-6e83-44b0-9009-3bff3ac92a31-u1 | .
2026-03-18 19:31:43,245 | INFO  | Job:f510b3cf-6e83-44b0-9009-3bff3ac92a31-u1 | ..
2026-03-18 19:31:45,342 | INFO  | Job:f510b3cf-6e83-44b0-9009-3bff3ac92a31-u1 | Status: COMPLETED
2026-03-18 19:31:45,626 | INFO  | Worker:qri408vqei7vdk | Delay Time: 3614 ms
2026-03-18 19:31:45,626 | INFO  | Worker:qri408vqei7vdk | Execution Time: 37 ms
2026-03-19 02:31:43,462 | INFO  | this is an info message
2026-03-19 02:31:43,462 | WARNING | this is a warning message
2026-03-19 02:31:43,462 | ERROR | this is an error message
2026-03-19 02:31:43,462 | INFO  | processing request for world
this is an info message
this is a warning message
this is an error message
processing request for world

jhcipar · 2026-03-19T13:59:46Z

@KAJdev yeah, the double logs thing happens in main. At the end of the request flash by default replays stdout from the completed job request into your terminal. Currently the logging handler puts both formatted logs and raw log lines into stdout. For example

{
  "delayTime": 10342,
  "executionTime": 5078,
  "id": "3655fecf-ee84-45db-832d-6bde2b7031d9-u1",
  "output": {
    "instance_id": null,
    "instance_info": null,
    "json_result": null,
    "result": "gAWVFQAAAAAAAAB9lIwGcmVzdWx0lIwFaG93ZHmUcy4=",
    "stdout": "2026-03-19 13:55:24,435 | INFO  | 3655fecf-ee84-45db-832d-6bde2b7031d9-u1 | jello: 0\n2026-03-19 13:55:25,435 | INFO  | 3655fecf-ee84-45db-832d-6bde2b7031d9-u1 | jello: 1\n2026-03-19 13:55:26,436 | INFO  | 3655fecf-ee84-45db-832d-6bde2b7031d9-u1 | jello: 2\n2026-03-19 13:55:27,436 | INFO  | 3655fecf-ee84-45db-832d-6bde2b7031d9-u1 | jello: 3\n2026-03-19 13:55:28,436 | INFO  | 3655fecf-ee84-45db-832d-6bde2b7031d9-u1 | jello: 4\n\njello: 0\njello: 1\njello: 2\njello: 3\njello: 4\n",
    "success": true
  },
  "status": "COMPLETED",
  "workerId": "k7zjko9p3g579t"
}

I am gonna try and figure this out as a short term but also in your case you didn't actually get any polled log lines 🫠 the latency between when they're actually available as endpoint logs is pretty high

KAJdev · 2026-03-19T18:43:28Z

yeah i even tried 60s requests and would only get about 1 log out of 30 at runtime. I think we probably should think of a different method since I'm not sure this is going to be that useful

# Conflicts: # src/runpod_flash/core/resources/serverless.py

Copilot

Pull request overview

This PR adds real-time log/diagnostic visibility for queue-based (QB) serverless runs, plus improves deployment manifest reconciliation so runtime endpoint metadata (IDs/URLs/aiKey) is preserved more reliably across deploys.

Changes:

Introduces QB pod log polling during async ServerlessResource.run(...) (including phase/status messaging and stdout de-dupe against streamed logs).
Adds a WorkerAvailabilityDiagnostic flow to improve “no workers available” diagnostics (GPU/CPU availability + throttling guidance).
Updates deployment reconciliation to carry forward endpoint metadata (endpoint_id/url normalization/aiKey) and sanitizes local manifest writes.

Reviewed changes

Copilot reviewed 12 out of 12 changed files in this pull request and generated 3 comments.

Show a summary per file

File	Description
`src/runpod_flash/core/resources/serverless.py`	Polls QB request logs during async runs; emits phase/status messages; attempts stdout de-dupe vs streamed logs.
`src/runpod_flash/core/resources/request_logs.py`	Adds `QBRequestLogFetcher` + models to fetch status/metrics/pod logs and return phased batches.
`src/runpod_flash/core/resources/worker_availability_diagnostic.py`	Adds availability diagnostics via GraphQL (GPU/CPU + throttled handling).
`src/runpod_flash/core/api/runpod.py`	Adds GraphQL helpers to query GPU/CPU stock status.
`src/runpod_flash/cli/utils/deployment.py`	Persists endpoint_id/aiKey into state manifest; normalizes endpoint_url; sanitizes local manifest to avoid writing aiKey.
`tests/unit/resources/test_serverless.py`	Adds tests for QB log polling behaviors + stdout de-dupe + repeated diagnostics.
`tests/unit/resources/test_request_logs.py`	Adds tests for fetcher phases, priming/streaming, and auth fallback behavior.
`tests/unit/resources/test_worker_availability_diagnostic.py`	Adds tests for diagnostics messages/reasons (max=0, gpu/cpu availability, throttled, out-of-stock).
`tests/unit/cli/utils/test_deployment.py`	Adds tests for persisting endpoint_id/aiKey to state manifest and sanitizing local manifest disk writes.
`src/runpod_flash/cli/docs/flash-logging.md`	Documents QB request log polling during async `run(...)`.
`src/runpod_flash/cli/docs/flash-deploy.md`	Documents manifest credential handling and local manifest sanitization.
`docs/Deployment_Architecture.md`	Documents how aiKey is handled between state manifest vs local manifest.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-08T00:30:43Z

src/runpod_flash/core/resources/serverless.py

+                    output = response.get("output")
+                    if isinstance(output, dict):
+                        stdout = output.get("stdout")
+                        if isinstance(stdout, str):
+                            seen_normalized = {
+                                normalized


The stdout post-processing at completion rewrites output["stdout"] unconditionally (it strips lines, drops blanks, and de-duplicates within stdout). This will also affect non-QB endpoints or cases where no pod logs were streamed, potentially removing legitimate repeated output lines and changing formatting. Consider only applying this normalization/de-dupe when QB log streaming actually occurred (e.g., self.type == ServerlessType.QB and fetcher.seen is non-empty / streaming phase was reached), otherwise leave stdout untouched.

Copilot · 2026-04-08T00:30:44Z

src/runpod_flash/core/resources/serverless.py

+        if batch.lines:
+            for line in batch.lines:
+                print(f"worker log: {line}")
+
+        return batch


Worker log lines are emitted via print(...), which bypasses the project's logging stack (formatting, file handler, and the SensitiveDataFilter redaction in src/runpod_flash/logger.py). This can result in missing worker logs in .flash/logs/activity.log and potentially leaking sensitive data in raw pod logs. Prefer logging through log.info(...) (or a dedicated logger) so the same handlers/filters apply, while still streaming to stdout.

Copilot · 2026-04-08T00:30:44Z

src/runpod_flash/cli/utils/deployment.py

+            if "aiKey" in state_config:
+                local_manifest["resources"][resource_name]["aiKey"] = state_config[
+                    "aiKey"
+                ]


Persisting aiKey into the State Manager manifest while explicitly stripping it from the local .flash/flash_manifest.json makes the reconciliation hash/JSON comparison unstable: on the next deploy, local_config (no aiKey) will differ from state_config (has aiKey), causing resources to be treated as changed and re-provisioned unnecessarily. Fix by excluding runtime-only keys like aiKey (and other API-returned metadata) from the config comparison, e.g., strip them from both dicts before json.dumps(...) or maintain a separate runtime-metadata section for these fields.

jhcipar and others added 3 commits March 17, 2026 16:13

feat: poll request logs for qb workers when running

6d991b6

chore: formatting

94650a5

Merge branch 'main' into jhcipar/AE-1225/qb-request-log-polling

3a216d0

jhcipar requested review from KAJdev and deanq March 18, 2026 00:19

runpod-Henrik reviewed Mar 18, 2026

View reviewed changes

jhcipar added 3 commits April 2, 2026 14:55

chore: switch to pod level polling and endpoint status

6d930ff

# Conflicts: # src/runpod_flash/core/resources/serverless.py

chore: more descriptive endpoint log polling

18e4bc0

chore: doc updates and aikey manifest persistence

9a66a5d

deanq requested a review from Copilot April 8, 2026 00:25

Copilot started reviewing on behalf of deanq April 8, 2026 00:26 View session

Copilot AI reviewed Apr 8, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: AE-1225: poll request logs for qb workers#276

feat: AE-1225: poll request logs for qb workers#276
jhcipar wants to merge 6 commits intomainfrom
jhcipar/AE-1225/qb-request-log-polling

jhcipar commented Mar 17, 2026 •

edited

Loading

Uh oh!

runpod-Henrik left a comment

Uh oh!

KAJdev commented Mar 19, 2026

Uh oh!

jhcipar commented Mar 19, 2026

Uh oh!

KAJdev commented Mar 19, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Apr 8, 2026

Uh oh!

Copilot AI Apr 8, 2026

Uh oh!

Copilot AI Apr 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

jhcipar commented Mar 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Uh oh!

runpod-Henrik left a comment

Choose a reason for hiding this comment

1. Bug: start_time default is evaluated at import time, not instantiation

2. Issue: print() for log output bypasses the logging system

3. Issue: fetched_until stalls when logs have no timestamps

4. Issue: _fetch_worker_id is dead code

5. Question: stdout deduplication path has no test coverage

6. Minor: _emit_endpoint_logs has inconsistent return

Uh oh!

KAJdev commented Mar 19, 2026

Uh oh!

jhcipar commented Mar 19, 2026

Uh oh!

KAJdev commented Mar 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Apr 8, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 8, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 8, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

jhcipar commented Mar 17, 2026 •

edited

Loading

1. Bug: `start_time` default is evaluated at import time, not instantiation

2. Issue: `print()` for log output bypasses the logging system

3. Issue: `fetched_until` stalls when logs have no timestamps

4. Issue: `_fetch_worker_id` is dead code

6. Minor: `_emit_endpoint_logs` has inconsistent return

KAJdev commented Mar 19, 2026 •

edited

Loading