Skip to content

[HWORKS-2682] Dedicated WebSocket proxy executor pool with saturation handling and observability#593

Merged
o-alex merged 2 commits into
logicalclocks:mainfrom
o-alex:HWORKS-2682
Jun 1, 2026
Merged

[HWORKS-2682] Dedicated WebSocket proxy executor pool with saturation handling and observability#593
o-alex merged 2 commits into
logicalclocks:mainfrom
o-alex:HWORKS-2682

Conversation

@o-alex
Copy link
Copy Markdown
Contributor

@o-alex o-alex commented May 29, 2026

Summary

Documentation side of the HWORKS-2682 ticket.

  • New admin page docs/setup_installation/admin/monitoring/websocket-pool.md: pool model (two threads per WebSocket connection, single-owner pool, zero-length task queue), the five Grafana panels in the Hopsworks dashboard, the MP-Metrics gauges and rejection counter, the Helm values that govern sizing, and the Grizzly idle timeouts.
  • New user guide docs/user_guides/projects/jupyter/session_capacity_warnings.md: two-badge state matrix (instance + cluster × orange/red), what happens when each turns red, recovery steps, three screenshots from the Jupyter / Terminal / Apps pages.
  • mkdocs.yml nav entries under Setup → Administration → Monitoring and Projects → Jupyter.

Auto-sizing documentation lives on HWORKS-2829 (stacked PR #594).

Test plan

  • touch docs/javadoc && uv run mkdocs build -s; rm docs/javadoc clean (the three pre-existing databricks/integrations nav warnings are unrelated).
  • Page rendering verified locally via mkdocs serve; cross-links between admin and user pages resolve.

Companion PRs

  • hopsworks-ee#2875 — backend pool rewrite + metrics + REST status endpoint
  • hopsworks-helm#1952 — chart-side rename + Grafana panels
  • hopsworks-front#1943 — badge UI + button-disable

Stacked PR

  • logicalclocks.github.io#594 — HWORKS-2829 auto-sizing docs (stacks on this PR; will rebase to main after this lands)

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds documentation for the new scalable WebSocket executor pool feature: an admin monitoring/tuning page and a user-facing page explaining the capacity badges that surface in the UI.

Changes:

  • New admin page docs/setup_installation/admin/monitoring/websocket-pool.md covering the pool model, Grafana panels, MP-Metrics, Helm tuning (autoSize, threadsPerCore, etc.), and Grizzly timeouts.
  • New user page docs/user_guides/projects/jupyter/session_capacity_warnings.md describing the instance/cluster × orange/red badge matrix, recovery steps, and three screenshots.
  • mkdocs.yml nav entries added under Projects → Jupyter and Setup → Administration → Monitoring.

Reviewed changes

Copilot reviewed 3 out of 6 changed files in this pull request and generated no comments.

File Description
mkdocs.yml Registers the two new pages in the navigation tree.
docs/setup_installation/admin/monitoring/websocket-pool.md New admin/monitoring reference for the WebSocket proxy pool.
docs/user_guides/projects/jupyter/session_capacity_warnings.md New user guide for the capacity badges shown in Jupyter, Terminal, and Apps.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 3 out of 6 changed files in this pull request and generated 1 comment.

Comment thread docs/user_guides/projects/jupyter/session_capacity_warnings.md Outdated
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 3 out of 6 changed files in this pull request and generated 1 comment.

Comment on lines +53 to +54
- **Red on the instance badge, orange or green on the cluster badge**: a different pod has capacity.
Refresh the page or sign out and back in to land on a different instance pod.
… handling and observability

https://hopsworks.atlassian.net/browse/HWORKS-2682

The WebSocket proxy in the Hopsworks Payara backend used to run its
forward stream direction inline on the calling HTTP thread. Under
load the HTTP request pool filled with pinned pumps and all REST
traffic stalled. The fix in hopsworks-ee moves both pump directions
to a dedicated managed executor, gates new sessions when that pool
saturates, and surfaces capacity state in the UI; this site
documents the new admin observability surface and explains the
user-facing capacity badges.

A new admin page under
setup_installation/admin/monitoring/websocket-pool.md documents the
pool model (two threads per WebSocket connection, the single-owner
pool, the zero-length task queue), the Grafana panels (sessions,
duration percentiles, rejection rate, pool CPU, pool allocation
rate), the MP-Metrics gauges and the rejection counter, the
relevant Helm values (corePoolSize, maximumPoolSize, taskQueueCapacity,
threadPriority), and the Grizzly idle timeouts.

A new user guide under
user_guides/projects/jupyter/session_capacity_warnings.md describes
the instance and cluster badge matrix (orange WARNING, red CRITICAL,
no badge OK), explains where each badge appears (Jupyter server
card, terminal panel, apps list), and lists the recovery steps when
a badge turns red. mkdocs.yml gets nav entries under Setup ->
Administration -> Monitoring and Projects -> Jupyter.

Auto-sizing of the pool from worker CPU and memory budget is
tracked separately in HWORKS-2829.

Reviewed-by: Copilot
Signed-off-by: Alex Ormenisan <alex@logicalclocks.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@o-alex o-alex marked this pull request as ready for review June 1, 2026 07:23
@o-alex o-alex changed the title [HWORKS-2682] websocket executor pools improvements and scalable executor pools [HWORKS-2682] Dedicated WebSocket proxy executor pool with saturation handling and observability Jun 1, 2026
https://hopsworks.atlassian.net/browse/HWORKS-2682

Follow-up on HWORKS-2682 (logicalclocks#593). Trim the admin WebSocket Pool guide
to match: the chart no longer overrides Grizzly's request-timeout-seconds
or websockets-timeout-seconds, and the page reflects Payara defaults.

Empirical evidence that request-timeout-seconds applies to established
WebSocket sessions after the 101 handshake is weak; the framing layer's
separate websockets-timeout-seconds is what governs the established
WebSocket's idle window. The guide now says so and points at the
plural attribute as the override knob if a longer idle window is
needed.

Companion changes: hopsworks-helm#<TBD> drops the chart values + asadmin
boot lines; hopsworks-ee#2875 drops the matching baked-in asadmin set
commands in docker/payara-server/Dockerfile.

Signed-off-by: Alex Ormenisan <alex@logicalclocks.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@o-alex o-alex merged commit dafa00a into logicalclocks:main Jun 1, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants