[SYSTEMDS-2651] Replace fixed-sleep federated worker startup with Poll#2468
Open
Baunsgaard wants to merge 3 commits into
Open
[SYSTEMDS-2651] Replace fixed-sleep federated worker startup with Poll#2468Baunsgaard wants to merge 3 commits into
Baunsgaard wants to merge 3 commits into
Conversation
… port polling Federated tests previously slept for FED_WORKER_WAIT (or FED_WORKER_WAIT_S) after launching each federated worker, racing against the worker's actual bind and wasting time once it was ready. With many tests starting multiple workers serially, this added significant fixed overhead per test. Replace with a TCP connect-poll on the worker's port, which becomes connectable only after Netty's bind().sync() has fully completed. Tests now return as soon as the worker is genuinely ready and fail fast if it never comes up. - Add FederatedTestUtils.waitForWorker(...) overloads that fail fast if the worker process/thread dies during startup. - Use it from startLocalFedWorker, startLocalFedWorkerThread, and startLocalFedWorkerWithArgs in AutomatedTestBase. - Keep the legacy int parameter on the public methods for source compatibility; reinterpret it as an upper-bound timeout with a sane floor so tiny historical values (e.g. 50 ms) don't time out on slow CI. Production code is untouched; the readiness check lives entirely in the test tree.
…worker waits Many federated tests start 2-4 workers and then wait for each one before spawning the next. Even with TCP port-polling readiness, that serialises the JVM warm-up and dominates wall-clock time per test. Switch to a bulk pattern: spawn all workers back-to-back, then wait for them all in one shared poll loop, so total startup is bounded by the slowest worker rather than the sum of all of them. - AutomatedTestBase: factor out spawnLocalFedWorker / spawnLocalFedWorkerThread primitives, add startLocalFedWorkers / startLocalFedWorkerThreads bulk helpers returning Process[] / Thread[]. - FederatedTestUtils: add waitForWorkers(int[], ...) variants that poll all ports in a single loop and fail fast if any worker dies during startup. - Add a small (25 ms) stagger between in-JVM thread spawns to avoid races on shared static state during DMLScript / FederatedWorker init that the previous per-thread sleep was implicitly providing. - Migrate ~70 federated test files (primitives, algorithms, codegen, fedplanning, paramserv-base, lineage, transform, io) and MultiTenantTestBase.startFedWorkers to the bulk helper, including their isAlive(...) and TestUtils.shutdownThreads(...) varargs sites. Files with conditional / interleaved / single-worker startup are left alone. No production code is touched.
…timeout clamp Rename org.apache.sysds.test.FederatedTestUtils to FederatedWorkerUtils to avoid the simple-name collision with the existing org.apache.sysds.test.component.federated.FederatedTestUtils, which covers unrelated federation RPC/data helpers. Lower the readiness-wait minimum clamp from 30s to 3s so a worker that fails to bind is reported within a few seconds of FED_WORKER_WAIT, instead of being held to a 30s floor before the timeout exception is raised. Also rewrite the Javadoc and inline comments on the new helpers in FederatedTestUtils -> FederatedWorkerUtils and on the federated worker helpers in AutomatedTestBase to a neutral, descriptive tone: - Document the MIN_TIMEOUT_MS clamp explicitly in every public overload rather than only inside the helper. - Split POLL_INTERVAL_MS into POLL_INTERVAL_MS (round sleep) and CONNECT_TIMEOUT_MS (per-attempt Socket.connect timeout) so the two uses are independently named. - Reword "start ... in parallel" to "start ... back to back, then wait for all of them in one shared poll loop" to match the actual sequential start + parallel readiness pattern. - Replace narration of the previous fixed-sleep API (and adjectives such as "tiny", "cheap", "sane floor", "right helper") with present-tense behavior contracts referencing the named constants. - Drop two stray "shared.!" typos in pre-existing thread-worker Javadoc. No behavior change beyond the 30s -> 3s clamp; mvn test-compile is clean and FederatedL2SVMTest (in-JVM threads) and FederatedTriTest (4 separate worker JVMs) both pass.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
WIP
This PR replace the thread.sleep with a poll based startup of federated workers in testing. The change helps our test suites to not have timeouts, or failures because of inconsistent launches of federated workers.