[HWORKS-2682] Dedicated WebSocket proxy executor pool with saturation handling and observability#593
Merged
Merged
Conversation
Contributor
There was a problem hiding this comment.
Pull request overview
Adds documentation for the new scalable WebSocket executor pool feature: an admin monitoring/tuning page and a user-facing page explaining the capacity badges that surface in the UI.
Changes:
- New admin page
docs/setup_installation/admin/monitoring/websocket-pool.mdcovering the pool model, Grafana panels, MP-Metrics, Helm tuning (autoSize,threadsPerCore, etc.), and Grizzly timeouts. - New user page
docs/user_guides/projects/jupyter/session_capacity_warnings.mddescribing the instance/cluster × orange/red badge matrix, recovery steps, and three screenshots. mkdocs.ymlnav entries added under Projects → Jupyter and Setup → Administration → Monitoring.
Reviewed changes
Copilot reviewed 3 out of 6 changed files in this pull request and generated no comments.
| File | Description |
|---|---|
mkdocs.yml |
Registers the two new pages in the navigation tree. |
docs/setup_installation/admin/monitoring/websocket-pool.md |
New admin/monitoring reference for the WebSocket proxy pool. |
docs/user_guides/projects/jupyter/session_capacity_warnings.md |
New user guide for the capacity badges shown in Jupyter, Terminal, and Apps. |
Comment on lines
+53
to
+54
| - **Red on the instance badge, orange or green on the cluster badge**: a different pod has capacity. | ||
| Refresh the page or sign out and back in to land on a different instance pod. |
2 tasks
cf317b1 to
405cb1a
Compare
… handling and observability https://hopsworks.atlassian.net/browse/HWORKS-2682 The WebSocket proxy in the Hopsworks Payara backend used to run its forward stream direction inline on the calling HTTP thread. Under load the HTTP request pool filled with pinned pumps and all REST traffic stalled. The fix in hopsworks-ee moves both pump directions to a dedicated managed executor, gates new sessions when that pool saturates, and surfaces capacity state in the UI; this site documents the new admin observability surface and explains the user-facing capacity badges. A new admin page under setup_installation/admin/monitoring/websocket-pool.md documents the pool model (two threads per WebSocket connection, the single-owner pool, the zero-length task queue), the Grafana panels (sessions, duration percentiles, rejection rate, pool CPU, pool allocation rate), the MP-Metrics gauges and the rejection counter, the relevant Helm values (corePoolSize, maximumPoolSize, taskQueueCapacity, threadPriority), and the Grizzly idle timeouts. A new user guide under user_guides/projects/jupyter/session_capacity_warnings.md describes the instance and cluster badge matrix (orange WARNING, red CRITICAL, no badge OK), explains where each badge appears (Jupyter server card, terminal panel, apps list), and lists the recovery steps when a badge turns red. mkdocs.yml gets nav entries under Setup -> Administration -> Monitoring and Projects -> Jupyter. Auto-sizing of the pool from worker CPU and memory budget is tracked separately in HWORKS-2829. Reviewed-by: Copilot Signed-off-by: Alex Ormenisan <alex@logicalclocks.com> Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2 tasks
https://hopsworks.atlassian.net/browse/HWORKS-2682 Follow-up on HWORKS-2682 (logicalclocks#593). Trim the admin WebSocket Pool guide to match: the chart no longer overrides Grizzly's request-timeout-seconds or websockets-timeout-seconds, and the page reflects Payara defaults. Empirical evidence that request-timeout-seconds applies to established WebSocket sessions after the 101 handshake is weak; the framing layer's separate websockets-timeout-seconds is what governs the established WebSocket's idle window. The guide now says so and points at the plural attribute as the override knob if a longer idle window is needed. Companion changes: hopsworks-helm#<TBD> drops the chart values + asadmin boot lines; hopsworks-ee#2875 drops the matching baked-in asadmin set commands in docker/payara-server/Dockerfile. Signed-off-by: Alex Ormenisan <alex@logicalclocks.com> Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
robzor92
approved these changes
Jun 1, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Documentation side of the HWORKS-2682 ticket.
docs/setup_installation/admin/monitoring/websocket-pool.md: pool model (two threads per WebSocket connection, single-owner pool, zero-length task queue), the five Grafana panels in the Hopsworks dashboard, the MP-Metrics gauges and rejection counter, the Helm values that govern sizing, and the Grizzly idle timeouts.docs/user_guides/projects/jupyter/session_capacity_warnings.md: two-badge state matrix (instance + cluster × orange/red), what happens when each turns red, recovery steps, three screenshots from the Jupyter / Terminal / Apps pages.mkdocs.ymlnav entries under Setup → Administration → Monitoring and Projects → Jupyter.Auto-sizing documentation lives on HWORKS-2829 (stacked PR #594).
Test plan
touch docs/javadoc && uv run mkdocs build -s; rm docs/javadocclean (the three pre-existing databricks/integrations nav warnings are unrelated).mkdocs serve; cross-links between admin and user pages resolve.Companion PRs
Stacked PR
mainafter this lands)