fix(workspace-engine): Argo race condition #735

zacharyblasczyk · 2026-01-05T21:12:22Z

Add failing test to reproduce ArgoCD deployment dependency race condition

Summary

This PR adds a failing test that reproduces a production bug where ArgoCD applications fail with "unable to find destination server" when deployed immediately after their dependent destination completes.

The race condition: The deployment dependency policy allows downstream deployments as soon as the upstream job succeeds, but ArgoCD destinations require additional time to sync asynchronously.

sequenceDiagram
    participant Policy as Dependency Policy
    participant Engine as Workspace Engine
    participant ArgoCD as ArgoCD API
    
    Note over Engine,ArgoCD: 1. Deploy Destination
    Engine->>ArgoCD: Create Destination App
    ArgoCD-->>Engine: Job Status: Successful ✅
    
    Note over Policy: Dependency satisfied!<br/>Allow downstream deploy
    
    Note over Engine,ArgoCD: 2. Deploy App (IMMEDIATE)
    Policy->>Engine: Unblock downstream deployment
    Engine->>ArgoCD: Create App using destination
    ArgoCD-->>Engine: ❌ Error: unable to find destination server
    
    Note over ArgoCD: Destination still syncing...<br/>(async background process)
    
    Note over Engine: Job marked InvalidJobAgent ❌
    
    Note over ArgoCD: Destination now available<br/>(too late)

Changes

TestEngine_PolicyDeploymentDependency_PolicyTimingIssue - Failing test that exposes the bug (expects 0 jobs, gets 1 job created immediately)
Error message capture in executor.go - Store dispatch errors in job.Message for better debugging
Additional test coverage - AutoDeploy and ArgoCDRetryBehavior tests documenting expected behavior
Unit tests - 30 test cases for error classification logic

Test Output

$ go test -v ./test/e2e -run TestEngine_PolicyDeploymentDependency

=== RUN   TestEngine_PolicyDeploymentDependency_PolicyTimingIssue
2026/01/06 10:40:23 INFO Processing event eventType=job-agent.created
2026/01/06 10:40:23 INFO Processing event eventType=job-agent.created
2026/01/06 10:40:23 INFO Processing event eventType=system.created
2026/01/06 10:40:23 INFO Processing event eventType=deployment.created
2026/01/06 10:40:23 INFO Processing event eventType=deployment.created
2026/01/06 10:40:23 INFO Processing event eventType=environment.created
2026/01/06 10:40:23 INFO Processing event eventType=resource.created
2026/01/06 10:40:23 INFO Processing event eventType=policy.created
2026/01/06 10:40:23 INFO Policy created - reconciling affected targets policy_id=app-depends-on-destination affected_targets_count=1
2026/01/06 10:40:23 INFO Processing event eventType=deployment-version.created
2026/01/06 10:40:23 INFO Processing event eventType=deployment-version.created
2026/01/06 10:40:23 INFO Processing event eventType=job.updated
    engine_policy_deployment_dependency_test.go:481: 
                Error Trace:    /Users/zb/github/ctrlplanedev/ctrlplane/apps/workspace-engine/test/e2e/engine_policy_deployment_dependency_test.go:481
                Error:          Not equal: 
                                expected: 0
                                actual  : 1
                Test:           TestEngine_PolicyDeploymentDependency_PolicyTimingIssue
                Messages:       EXPECTED BEHAVIOR: App should remain blocked immediately after destination job success to allow time for ArgoCD destination to sync. ACTUAL BUG: Policy allows deployment immediately, causing 'unable to find destination server' errors
--- FAIL: TestEngine_PolicyDeploymentDependency_PolicyTimingIssue (0.01s)
FAIL
FAIL    workspace-engine/test/e2e       1.303s
FAIL

This failure is intentional - the test will pass once we implement a fix.

Summary by CodeRabbit

Bug Fixes
- Improved error handling for failed job dispatches: logs include captured error text and jobs now record updated status, message, and timestamp.
Tests
- Added unit tests and a benchmark for error classification and retry logic.
- Added end-to-end tests covering policy-driven deployment dependencies, automatic downstream deployment, retry behavior for transient destination errors, and a documented timing/race scenario.

_{✏️ Tip: You can customize this high-level summary in your review settings.}

coderabbitai · 2026-01-05T21:12:34Z

📝 Walkthrough

Walkthrough

Adds unit and E2E tests for ArgoCD error classification and deployment-dependency behavior, and updates async job dispatch error handling to set job status, message, and updated timestamp when dispatch fails.

Changes

Cohort / File(s)	Summary
ArgoCD Error Classification Tests `apps/workspace-engine/pkg/workspace/jobdispatch/argocd_test.go`	New table-driven unit test `TestIsRetryableError` covering nil, HTTP/transient, ArgoCD-specific, and non-retryable errors; adds `BenchmarkIsRetryableError`.
Async Job Dispatch Error Handling `apps/workspace-engine/pkg/workspace/releasemanager/deployment/executor.go`	On async dispatch error: capture error message, set `newJob.Status = InvalidJobAgent`, assign `newJob.Message`, update `newJob.UpdatedAt`, and persist the job (previously only logged the error).
Deployment Dependency E2E Tests `apps/workspace-engine/test/e2e/engine_policy_deployment_dependency_test.go`	Added three E2E tests—`TestEngine_PolicyDeploymentDependency_AutoDeploy`, `TestEngine_PolicyDeploymentDependency_ArgoCDRetryBehavior`, and `TestEngine_PolicyDeploymentDependency_PolicyTimingIssue`—exercising policy-driven deployment ordering, ArgoCD retry behavior, and a timing/race scenario.

Sequence Diagram(s)

(Skipped — changes are test additions and a small error-handling update that do not introduce a new multi-component control flow requiring visualization.)

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

fix: only retrigger latest invalid agent jobs #719: Related changes around InvalidJobAgent state and retry/retrigger behavior affecting job dispatch error handling.
chore: init argocd integration #716: Related ArgoCD integration changes that affect error classification used by the new isRetryableError tests.
chore: dependencies trigger downstream target reconciles #731: Related deployment-dependency and trigger-based downstream reconcile work similar to the new E2E tests and executor adjustments.

Poem

🐰 I hopped through logs both day and night,
Caught retryable errors in soft moonlight.
Dependencies queued in tidy rows,
Messages set where failure goes—
Now jobs tumble forward, futures bright. ✨

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title 'fix(workspace-engine): Argo race condition' accurately summarizes the main objective of the PR, which is to fix a production race condition related to ArgoCD timing during policy-driven deployment dependencies.
Docstring Coverage	✅ Passed	Docstring coverage is 85.71% which is sufficient. The required threshold is 80.00%.

✨ Finishing touches

📝 Generate docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (2)

apps/workspace-engine/pkg/workspace/jobdispatch/argocd.go (2)
83-99: Optimize by moving retryablePatterns to package level.

The retryablePatterns slice is allocated on every call to isRetryableError. Since these patterns are static, consider declaring it as a package-level variable to avoid repeated allocations, especially given that this function is called in a retry loop under load.
🔎 Proposed refactor
+// retryablePatterns defines error patterns indicating transient failures worth retrying
+var retryablePatterns = []string{
+	// HTTP status codes indicating transient failures
+	"502", "503", "504",
+	"connection refused",
+	"connection reset",
+	"timeout",
+	"temporarily unavailable",
+	// ArgoCD destination/cluster errors (race condition when destination is being synced)
+	"cluster not found",
+	"destination not found",
+	"cluster does not exist",
+	"destination does not exist",
+	"destination server not found",
+	"server not found",
+	"invalid destination",
+	"unknown cluster",
+}
+
 // isRetryableError checks if an error is a transient error that should be retried
 func isRetryableError(err error) bool {
 	if err == nil {
 		return false
 	}
 	
 	errStr := strings.ToLower(err.Error())
 	
-	// Patterns that indicate transient failures worth retrying
-	retryablePatterns := []string{
-		// HTTP status codes indicating transient failures
-		"502", "503", "504",
-		"connection refused",
-		"connection reset",
-		"timeout",
-		"temporarily unavailable",
-		// ArgoCD destination/cluster errors (race condition when destination is being synced)
-		"cluster not found",
-		"destination not found",
-		"cluster does not exist",
-		"destination does not exist",
-		"destination server not found",
-		"server not found",
-		"invalid destination",
-		"unknown cluster",
-	}
-	
 	for _, pattern := range retryablePatterns {
 		if strings.Contains(errStr, pattern) {
 			return true
 		}
 	}
 	
 	return false
 }
96-97: Broad patterns may match unintended errors.

The patterns "server not found" and "invalid destination" are relatively generic and could theoretically match non-ArgoCD errors (e.g., "DNS server not found" or file system destination errors). However, given that this function is only called within the ArgoCD dispatcher context, the risk of false positives is low. Consider adding a comment clarifying that these patterns are scoped to ArgoCD errors.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Disabled knowledge base sources:

Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 37a5856 and 6a3e5f1.

📒 Files selected for processing (3)

apps/workspace-engine/pkg/workspace/jobdispatch/argocd.go
apps/workspace-engine/pkg/workspace/jobdispatch/argocd_test.go
apps/workspace-engine/test/e2e/engine_policy_deployment_dependency_test.go

🧰 Additional context used

📓 Path-based instructions (2)

apps/workspace-engine/**/*.go

📄 CodeRabbit inference engine (apps/workspace-engine/CLAUDE.md)

apps/workspace-engine/**/*.go: Do not add extraneous inline comments that state the obvious
Do not add comments that simply restate what the code does
Do not add comments for standard Go patterns (e.g., noting WaitGroup or semaphore usage)
Write comments that explain why, document complex logic/algorithms, provide non-obvious context, include TODO/FIXME, and document exported functions/types/methods

Files:

apps/workspace-engine/pkg/workspace/jobdispatch/argocd.go
apps/workspace-engine/test/e2e/engine_policy_deployment_dependency_test.go
apps/workspace-engine/pkg/workspace/jobdispatch/argocd_test.go

apps/workspace-engine/**/*_test.go

📄 CodeRabbit inference engine (apps/workspace-engine/CLAUDE.md)

Follow the existing test structure used in *_test.go files

Files:

apps/workspace-engine/test/e2e/engine_policy_deployment_dependency_test.go
apps/workspace-engine/pkg/workspace/jobdispatch/argocd_test.go

🧠 Learnings (6)

📚 Learning: 2025-08-12T18:13:54.630Z

Learnt from: CR
Repo: ctrlplanedev/ctrlplane PR: 0
File: apps/workspace-engine/CLAUDE.md:0-0
Timestamp: 2025-08-12T18:13:54.630Z
Learning: Applies to apps/workspace-engine/**/*_test.go : Follow the existing test structure used in *_test.go files

Applied to files:

apps/workspace-engine/test/e2e/engine_policy_deployment_dependency_test.go
apps/workspace-engine/pkg/workspace/jobdispatch/argocd_test.go

📚 Learning: 2025-08-12T18:13:54.630Z

Learnt from: CR
Repo: ctrlplanedev/ctrlplane PR: 0
File: apps/workspace-engine/CLAUDE.md:0-0
Timestamp: 2025-08-12T18:13:54.630Z
Learning: Applies to apps/workspace-engine/pkg/model/selector/**/*_test.go : Write comprehensive, data-driven tests for new condition types

Applied to files:

apps/workspace-engine/test/e2e/engine_policy_deployment_dependency_test.go
apps/workspace-engine/pkg/workspace/jobdispatch/argocd_test.go

📚 Learning: 2025-08-12T18:13:54.630Z

Learnt from: CR
Repo: ctrlplanedev/ctrlplane PR: 0
File: apps/workspace-engine/CLAUDE.md:0-0
Timestamp: 2025-08-12T18:13:54.630Z
Learning: Applies to apps/workspace-engine/pkg/model/selector/**/*_test.go : Test validation and matching logic separately for condition types

Applied to files:

apps/workspace-engine/test/e2e/engine_policy_deployment_dependency_test.go

📚 Learning: 2025-10-24T00:02:29.723Z

Learnt from: adityachoudhari26
Repo: ctrlplanedev/ctrlplane PR: 696
File: apps/workspace-engine/pkg/workspace/releasemanager/deployment/executor.go:63-67
Timestamp: 2025-10-24T00:02:29.723Z
Learning: In workspace-engine replay mode (apps/workspace-engine/pkg/workspace/releasemanager/deployment/executor.go), jobs are created with Pending status even during replay because job updates are still accepted and processed. If a job remains Pending after replay completes, it genuinely is still pending because no updates occurred during replay.

Applied to files:

apps/workspace-engine/test/e2e/engine_policy_deployment_dependency_test.go

📚 Learning: 2025-08-12T18:13:54.630Z

Learnt from: CR
Repo: ctrlplanedev/ctrlplane PR: 0
File: apps/workspace-engine/CLAUDE.md:0-0
Timestamp: 2025-08-12T18:13:54.630Z
Learning: Applies to apps/workspace-engine/pkg/model/selector/**/*_test.go : Use table-driven tests for all condition types

Applied to files:

apps/workspace-engine/pkg/workspace/jobdispatch/argocd_test.go

📚 Learning: 2025-08-12T18:13:54.630Z

Learnt from: CR
Repo: ctrlplanedev/ctrlplane PR: 0
File: apps/workspace-engine/CLAUDE.md:0-0
Timestamp: 2025-08-12T18:13:54.630Z
Learning: Applies to apps/workspace-engine/**/*.go : Write comments that explain why, document complex logic/algorithms, provide non-obvious context, include TODO/FIXME, and document exported functions/types/methods

Applied to files:

apps/workspace-engine/pkg/workspace/jobdispatch/argocd_test.go

🧬 Code graph analysis (1)

apps/workspace-engine/test/e2e/engine_policy_deployment_dependency_test.go (4)

apps/workspace-engine/test/integration/workspace.go (1)

NewTestWorkspace (38-66)

apps/workspace-engine/test/integration/creators/deployment_version.go (1)

NewDeploymentVersion (15-36)

apps/workspace-engine/pkg/events/handler/handler.go (2)

DeploymentVersionCreate (44-44)

JobUpdate (68-68)

apps/workspace-engine/pkg/oapi/oapi.gen.go (3)

JobStatusSuccessful (135-135)

Job (657-671)

JobStatusInvalidJobAgent (132-132)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)

GitHub Check: workspace-engine-tests
GitHub Check: Lint
GitHub Check: Typecheck
GitHub Check: build (linux/amd64)
GitHub Check: tests

🔇 Additional comments (2)

apps/workspace-engine/pkg/workspace/jobdispatch/argocd_test.go (1)

12-181: Excellent test coverage with proper table-driven structure.

The test comprehensively validates error classification across all retryable and non-retryable patterns, including important edge cases like:

Case-insensitive matching (lines 68-71, 120-123)

Specificity validation (lines 162-170) ensuring "cluster" and "destination" in different contexts aren't false positives

Realistic ArgoCD error messages (lines 108-123)

The table-driven approach aligns with project conventions and makes the test easy to extend.

Based on learnings, this follows the existing test structure used in *_test.go files.

apps/workspace-engine/test/e2e/engine_policy_deployment_dependency_test.go (1)

120-225: Verify that the E2E test exercises the retry logic.

The test validates that downstream jobs don't immediately fail with InvalidJobAgent, but it doesn't appear to:

Simulate ArgoCD returning "cluster not found" or similar transient errors

Verify that retries actually occur with exponential backoff

Use a mock ArgoCD client that returns errors on initial attempts then succeeds

The test primarily validates the deployment dependency flow rather than the ArgoCD retry mechanism introduced in this PR. Consider verifying whether:

A mock ArgoCD client is configured elsewhere in the test setup that simulates the race condition

The retry behavior is actually exercised during test execution

Additional assertions could confirm retry attempts occurred (e.g., checking logs, retry counts, or timing)

If the retry logic isn't being exercised, you might want to enhance the test to use a mock ArgoCD client that fails with retryable errors on the first N attempts, then succeeds, to validate the fix end-to-end.

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (1)

apps/workspace-engine/pkg/workspace/jobdispatch/argocd.go (1)
74-108: Solid implementation with comprehensive test coverage.

The refactoring correctly handles ArgoCD sync race conditions with proper case-insensitive pattern matching. All patterns—including HTTP status codes and ArgoCD-specific errors—are validated by comprehensive test cases in argocd_test.go, confirming they match realistic error messages from the ArgoCD library.

The HTTP patterns ("502", "503", "504") are tested within full error messages like "HTTP 502 Bad Gateway", and false positives are further mitigated by the limited retry policy (5 attempts with exponential backoff). However, if increased specificity is desired, consider refining to patterns like "status 502", "http 502", or "502 bad gateway" for additional precision.
Optional: More specific HTTP patterns
 	retryablePatterns := []string{
-		// HTTP status codes indicating transient failures
-		"502", "503", "504",
+		// HTTP status codes indicating transient failures
+		"status 502", "status 503", "status 504",
+		"http 502", "http 503", "http 504",
 		"connection refused",

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Disabled knowledge base sources:

Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 6a3e5f1 and 8ff5556.

📒 Files selected for processing (1)

apps/workspace-engine/pkg/workspace/jobdispatch/argocd.go

🧰 Additional context used

📓 Path-based instructions (1)

apps/workspace-engine/**/*.go

📄 CodeRabbit inference engine (apps/workspace-engine/CLAUDE.md)

apps/workspace-engine/**/*.go: Do not add extraneous inline comments that state the obvious
Do not add comments that simply restate what the code does
Do not add comments for standard Go patterns (e.g., noting WaitGroup or semaphore usage)
Write comments that explain why, document complex logic/algorithms, provide non-obvious context, include TODO/FIXME, and document exported functions/types/methods

Files:

apps/workspace-engine/pkg/workspace/jobdispatch/argocd.go

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)

GitHub Check: Lint
GitHub Check: tests
GitHub Check: build (linux/amd64)

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (1)

apps/workspace-engine/pkg/workspace/jobdispatch/argocd_test.go (1)

142-158: Benchmark uses error messages not covered by unit tests.

The benchmark includes "cluster not found" and "destination does not exist" which aren't present in the unit test cases. Consider either adding these to the unit test table or using the same error set for consistency.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Disabled knowledge base sources:

Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 8ff5556 and 5df4b69.

📒 Files selected for processing (3)

apps/workspace-engine/pkg/workspace/jobdispatch/argocd_test.go
apps/workspace-engine/pkg/workspace/releasemanager/deployment/executor.go
apps/workspace-engine/test/e2e/engine_policy_deployment_dependency_test.go

🧰 Additional context used

📓 Path-based instructions (2)

apps/workspace-engine/**/*.go

📄 CodeRabbit inference engine (apps/workspace-engine/CLAUDE.md)

apps/workspace-engine/**/*.go: Do not add extraneous inline comments that state the obvious
Do not add comments that simply restate what the code does
Do not add comments for standard Go patterns (e.g., noting WaitGroup or semaphore usage)
Write comments that explain why, document complex logic/algorithms, provide non-obvious context, include TODO/FIXME, and document exported functions/types/methods

Files:

apps/workspace-engine/pkg/workspace/releasemanager/deployment/executor.go
apps/workspace-engine/test/e2e/engine_policy_deployment_dependency_test.go
apps/workspace-engine/pkg/workspace/jobdispatch/argocd_test.go

apps/workspace-engine/**/*_test.go

📄 CodeRabbit inference engine (apps/workspace-engine/CLAUDE.md)

Follow the existing test structure used in *_test.go files

Files:

apps/workspace-engine/test/e2e/engine_policy_deployment_dependency_test.go
apps/workspace-engine/pkg/workspace/jobdispatch/argocd_test.go

🧠 Learnings (6)

📚 Learning: 2025-10-24T00:02:29.723Z

Learnt from: adityachoudhari26
Repo: ctrlplanedev/ctrlplane PR: 696
File: apps/workspace-engine/pkg/workspace/releasemanager/deployment/executor.go:63-67
Timestamp: 2025-10-24T00:02:29.723Z
Learning: In workspace-engine replay mode (apps/workspace-engine/pkg/workspace/releasemanager/deployment/executor.go), jobs are created with Pending status even during replay because job updates are still accepted and processed. If a job remains Pending after replay completes, it genuinely is still pending because no updates occurred during replay.

Applied to files:

apps/workspace-engine/pkg/workspace/releasemanager/deployment/executor.go
apps/workspace-engine/test/e2e/engine_policy_deployment_dependency_test.go

📚 Learning: 2025-08-12T18:13:54.630Z

Learnt from: CR
Repo: ctrlplanedev/ctrlplane PR: 0
File: apps/workspace-engine/CLAUDE.md:0-0
Timestamp: 2025-08-12T18:13:54.630Z
Learning: Applies to apps/workspace-engine/**/*.go : Write comments that explain why, document complex logic/algorithms, provide non-obvious context, include TODO/FIXME, and document exported functions/types/methods

Applied to files:

apps/workspace-engine/pkg/workspace/releasemanager/deployment/executor.go
apps/workspace-engine/test/e2e/engine_policy_deployment_dependency_test.go
apps/workspace-engine/pkg/workspace/jobdispatch/argocd_test.go

📚 Learning: 2025-08-12T18:13:54.630Z

Learnt from: CR
Repo: ctrlplanedev/ctrlplane PR: 0
File: apps/workspace-engine/CLAUDE.md:0-0
Timestamp: 2025-08-12T18:13:54.630Z
Learning: Applies to apps/workspace-engine/**/*_test.go : Follow the existing test structure used in *_test.go files

Applied to files:

apps/workspace-engine/test/e2e/engine_policy_deployment_dependency_test.go
apps/workspace-engine/pkg/workspace/jobdispatch/argocd_test.go

📚 Learning: 2025-08-12T18:13:54.630Z

Learnt from: CR
Repo: ctrlplanedev/ctrlplane PR: 0
File: apps/workspace-engine/CLAUDE.md:0-0
Timestamp: 2025-08-12T18:13:54.630Z
Learning: Applies to apps/workspace-engine/pkg/model/selector/**/*_test.go : Write comprehensive, data-driven tests for new condition types

Applied to files:

apps/workspace-engine/test/e2e/engine_policy_deployment_dependency_test.go
apps/workspace-engine/pkg/workspace/jobdispatch/argocd_test.go

📚 Learning: 2025-08-12T20:49:05.086Z

Learnt from: adityachoudhari26
Repo: ctrlplanedev/ctrlplane PR: 655
File: apps/workspace-engine/pkg/engine/workspace/fluent.go:166-171
Timestamp: 2025-08-12T20:49:05.086Z
Learning: The UpdateDeploymentVersions() method with OperationCreate case in apps/workspace-engine/pkg/engine/workspace/fluent.go is specifically designed only for creating new deployment versions, not for handling potential duplicates or existing versions.

Applied to files:

apps/workspace-engine/test/e2e/engine_policy_deployment_dependency_test.go

📚 Learning: 2025-08-12T18:13:54.630Z

Learnt from: CR
Repo: ctrlplanedev/ctrlplane PR: 0
File: apps/workspace-engine/CLAUDE.md:0-0
Timestamp: 2025-08-12T18:13:54.630Z
Learning: Applies to apps/workspace-engine/pkg/model/selector/**/*_test.go : Use table-driven tests for all condition types

Applied to files:

apps/workspace-engine/pkg/workspace/jobdispatch/argocd_test.go

🧬 Code graph analysis (1)

apps/workspace-engine/pkg/workspace/releasemanager/deployment/executor.go (2)

apps/workspace-engine/pkg/workspace/releasemanager/trace/types.go (1)

Status (68-68)

apps/workspace-engine/pkg/oapi/oapi.gen.go (1)

JobStatusInvalidJobAgent (132-132)

🪛 GitHub Actions: Apps / Workspace Engine

apps/workspace-engine/test/e2e/engine_policy_deployment_dependency_test.go

[error] 481-481: Engine policy timing issue: expected deployment to remain blocked (timing 0), but was allowed to proceed (1). This causes 'unable to find destination server' errors

[error] 1-1: End-to-end test suite failed for workspace-engine/test/e2e (non-zero exit)

apps/workspace-engine/pkg/workspace/jobdispatch/argocd_test.go

[error] 135-135: TestIsRetryableError/unable_to_find_destination_server: expected isRetryableError(...) = true, got false

[error] 135-135: TestIsRetryableError/application_destination_spec_is_invalid: expected isRetryableError(...) = true, got false

[error] 135-135: TestIsRetryableError/argocd_rpc_error_-_destination_spec_invalid: expected isRetryableError(...) = true, got false

[error] 135-135: TestIsRetryableError/mixed_case_-_unable_to_find_destination: expected isRetryableError(...) = true, got false

🔇 Additional comments (5)

apps/workspace-engine/pkg/workspace/releasemanager/deployment/executor.go (1)

128-135: LGTM - Error message capture improves debugging.

Storing the dispatch error in job.Message provides valuable debugging context for InvalidJobAgent failures, which aligns with the PR objective of better surfacing ArgoCD race conditions.

apps/workspace-engine/pkg/workspace/jobdispatch/argocd_test.go (1)

10-140: Well-structured test-first approach for documenting expected behavior.

The table-driven tests correctly document the expected retry classification. The failing test cases (lines 63-82) for ArgoCD destination errors are intentionally failing per the PR description—they will pass once isRetryableError is updated to classify "unable to find destination server" errors as retryable.

apps/workspace-engine/test/e2e/engine_policy_deployment_dependency_test.go (3)

120-248: LGTM - Comprehensive test for auto-deploy behavior.

The test clearly documents the expected dependency flow with well-structured steps and assertions. The step-by-step comments make the test intent clear.

250-366: LGTM - Documents expected retry behavior for ArgoCD errors.

The test correctly asserts that jobs should not immediately fail with InvalidJobAgent status when ArgoCD returns transient destination errors. The critical assertion at line 355 validates the fix once implemented.

368-495: Intentionally failing test effectively documents the race condition.

This test reproduces the production bug where the policy evaluator allows downstream deployment immediately after upstream job success, without waiting for ArgoCD destination sync. The extensive comments (lines 486-494) outlining potential fix approaches are helpful for implementation guidance.

The failing assertion at line 481 will pass once a timing/sync mechanism is added to the policy evaluator.

coderabbitai

Actionable comments posted: 2

🤖 Fix all issues with AI Agents

In @apps/workspace-engine/test/e2e/engine_policy_deployment_dependency_test.go:
- Around line 368-494: The test
TestEngine_PolicyDeploymentDependency_PolicyTimingIssue is intentionally failing
and should be excluded from default CI; either mark it skipped by adding an
early t.Skip("known failing: race condition - see PR") at the top of
TestEngine_PolicyDeploymentDependency_PolicyTimingIssue, or move the test behind
a build tag by adding a file-level build constraint (e.g., //go:build failing)
and document that it runs with `go test -tags failing`; update the test comment
to explain which approach was used and reference the test name so future
maintainers can find and re-enable it when the policy fix is implemented.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Disabled knowledge base sources:

Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 5df4b69 and e959e64.

📒 Files selected for processing (2)

apps/workspace-engine/pkg/workspace/jobdispatch/argocd_test.go
apps/workspace-engine/test/e2e/engine_policy_deployment_dependency_test.go

🚧 Files skipped from review as they are similar to previous changes (1)

apps/workspace-engine/pkg/workspace/jobdispatch/argocd_test.go

🧰 Additional context used

📓 Path-based instructions (2)

apps/workspace-engine/**/*.go

📄 CodeRabbit inference engine (apps/workspace-engine/CLAUDE.md)

apps/workspace-engine/**/*.go: Do not add extraneous inline comments that state the obvious
Do not add comments that simply restate what the code does
Do not add comments for standard Go patterns (e.g., noting WaitGroup or semaphore usage)
Write comments that explain why, document complex logic/algorithms, provide non-obvious context, include TODO/FIXME, and document exported functions/types/methods

Files:

apps/workspace-engine/test/e2e/engine_policy_deployment_dependency_test.go

apps/workspace-engine/**/*_test.go

📄 CodeRabbit inference engine (apps/workspace-engine/CLAUDE.md)

Follow the existing test structure used in *_test.go files

Files:

apps/workspace-engine/test/e2e/engine_policy_deployment_dependency_test.go

🧠 Learnings (6)

📓 Common learnings

Learnt from: adityachoudhari26
Repo: ctrlplanedev/ctrlplane PR: 696
File: apps/workspace-engine/pkg/workspace/releasemanager/deployment/executor.go:63-67
Timestamp: 2025-10-24T00:02:29.723Z
Learning: In workspace-engine replay mode (apps/workspace-engine/pkg/workspace/releasemanager/deployment/executor.go), jobs are created with Pending status even during replay because job updates are still accepted and processed. If a job remains Pending after replay completes, it genuinely is still pending because no updates occurred during replay.

📚 Learning: 2025-08-12T18:13:54.630Z

Learnt from: CR
Repo: ctrlplanedev/ctrlplane PR: 0
File: apps/workspace-engine/CLAUDE.md:0-0
Timestamp: 2025-08-12T18:13:54.630Z
Learning: Applies to apps/workspace-engine/**/*_test.go : Follow the existing test structure used in *_test.go files

Applied to files:

apps/workspace-engine/test/e2e/engine_policy_deployment_dependency_test.go

📚 Learning: 2025-08-12T18:13:54.630Z

Learnt from: CR
Repo: ctrlplanedev/ctrlplane PR: 0
File: apps/workspace-engine/CLAUDE.md:0-0
Timestamp: 2025-08-12T18:13:54.630Z
Learning: Applies to apps/workspace-engine/pkg/model/selector/**/*_test.go : Write comprehensive, data-driven tests for new condition types

Applied to files:

apps/workspace-engine/test/e2e/engine_policy_deployment_dependency_test.go

📚 Learning: 2025-08-12T18:13:54.630Z

Learnt from: CR
Repo: ctrlplanedev/ctrlplane PR: 0
File: apps/workspace-engine/CLAUDE.md:0-0
Timestamp: 2025-08-12T18:13:54.630Z
Learning: Applies to apps/workspace-engine/pkg/model/selector/**/*_test.go : Test validation and matching logic separately for condition types

Applied to files:

apps/workspace-engine/test/e2e/engine_policy_deployment_dependency_test.go

📚 Learning: 2025-08-12T20:49:05.086Z

Learnt from: adityachoudhari26
Repo: ctrlplanedev/ctrlplane PR: 655
File: apps/workspace-engine/pkg/engine/workspace/fluent.go:166-171
Timestamp: 2025-08-12T20:49:05.086Z
Learning: The UpdateDeploymentVersions() method with OperationCreate case in apps/workspace-engine/pkg/engine/workspace/fluent.go is specifically designed only for creating new deployment versions, not for handling potential duplicates or existing versions.

Applied to files:

apps/workspace-engine/test/e2e/engine_policy_deployment_dependency_test.go

📚 Learning: 2025-10-24T00:02:29.723Z

Learnt from: adityachoudhari26
Repo: ctrlplanedev/ctrlplane PR: 696
File: apps/workspace-engine/pkg/workspace/releasemanager/deployment/executor.go:63-67
Timestamp: 2025-10-24T00:02:29.723Z
Learning: In workspace-engine replay mode (apps/workspace-engine/pkg/workspace/releasemanager/deployment/executor.go), jobs are created with Pending status even during replay because job updates are still accepted and processed. If a job remains Pending after replay completes, it genuinely is still pending because no updates occurred during replay.

Applied to files:

apps/workspace-engine/test/e2e/engine_policy_deployment_dependency_test.go

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)

GitHub Check: Typecheck
GitHub Check: tests
GitHub Check: Lint
GitHub Check: build (linux/amd64)

🔇 Additional comments (1)

apps/workspace-engine/test/e2e/engine_policy_deployment_dependency_test.go (1)

120-248: LGTM! Well-structured test for auto-deploy behavior.

The test comprehensively validates the deployment dependency workflow: blocking behavior, dependency satisfaction, and automatic unblocking. The step-by-step comments effectively document the expected behavior without being redundant.

coderabbitai · 2026-01-06T16:54:38Z

apps/workspace-engine/test/e2e/engine_policy_deployment_dependency_test.go

+// TestEngine_PolicyDeploymentDependency_ArgoCDRetryBehavior validates that:
+// When ArgoCD returns "unable to find destination server" errors, jobs are retried
+// instead of immediately being marked as InvalidJobAgent.
+// This tests the fix for the race condition where ArgoCD hasn't synced the destination yet.
+func TestEngine_PolicyDeploymentDependency_ArgoCDRetryBehavior(t *testing.T) {
+	destinationJobAgentID := "destination-agent"
+	appJobAgentID := "app-agent"
+
+	destinationDeploymentID := "argo-destination"
+	appDeploymentID := "argo-app"
+
+	environmentID := "prod"
+	resourceID := "k8s-cluster"
+	policyID := "app-depends-on-destination"
+
+	engine := integration.NewTestWorkspace(t,
+		integration.WithJobAgent(
+			integration.JobAgentID(destinationJobAgentID),
+		),
+		integration.WithJobAgent(
+			integration.JobAgentID(appJobAgentID),
+		),
+		integration.WithSystem(
+			integration.WithDeployment(
+				integration.DeploymentID(destinationDeploymentID),
+				integration.DeploymentName("argo-destination"),
+				integration.DeploymentJobAgent(destinationJobAgentID),
+				integration.DeploymentCelResourceSelector("true"),
+			),
+			integration.WithDeployment(
+				integration.DeploymentID(appDeploymentID),
+				integration.DeploymentName("argo-app"),
+				integration.DeploymentJobAgent(appJobAgentID),
+				integration.DeploymentCelResourceSelector("true"),
+			),
+			integration.WithEnvironment(
+				integration.EnvironmentID(environmentID),
+				integration.EnvironmentName("prod"),
+				integration.EnvironmentCelResourceSelector("true"),
+			),
+		),
+		integration.WithResource(
+			integration.ResourceID(resourceID),
+		),
+		integration.WithPolicy(
+			integration.PolicyID(policyID),
+			integration.PolicyName("App depends on ArgoCD destination"),
+			integration.WithPolicyTargetSelector(
+				integration.PolicyTargetCelEnvironmentSelector("true"),
+				integration.PolicyTargetCelDeploymentSelector("deployment.id == '"+appDeploymentID+"'"),
+				integration.PolicyTargetCelResourceSelector("true"),
+			),
+			integration.WithPolicyRule(
+				integration.WithRuleDeploymentDependency(
+					integration.DeploymentDependencyRuleDependsOnDeploymentSelector("deployment.id == '"+destinationDeploymentID+"'"),
+				),
+			),
+		),
+	)
+
+	ctx := context.Background()
+
+	// Step 1: Create app release (blocked - destination never succeeded)
+	appVersion := c.NewDeploymentVersion()
+	appVersion.DeploymentId = appDeploymentID
+	engine.PushEvent(ctx, handler.DeploymentVersionCreate, appVersion)
+
+	appJobs := getAgentJobsSortedByNewest(engine, appJobAgentID)
+	assert.Equal(t, 0, len(appJobs), "app should be blocked by dependency")
+
+	// Step 2: Create destination release
+	destVersion := c.NewDeploymentVersion()
+	destVersion.DeploymentId = destinationDeploymentID
+	engine.PushEvent(ctx, handler.DeploymentVersionCreate, destVersion)
+
+	destJobs := getAgentJobsSortedByNewest(engine, destinationJobAgentID)
+	assert.Equal(t, 1, len(destJobs), "destination should create job")
+
+	// Step 3: Mark destination as successful
+	destJob := destJobs[0]
+	destJobCopy := *destJob
+	destJobCopy.Status = oapi.JobStatusSuccessful
+	completedAt := time.Now()
+	destJobCopy.CompletedAt = &completedAt
+	engine.PushEvent(ctx, handler.JobUpdate, &oapi.JobUpdateEvent{
+		Id:  &destJobCopy.Id,
+		Job: destJobCopy,
+		FieldsToUpdate: &[]oapi.JobUpdateEventFieldsToUpdate{
+			oapi.JobUpdateEventFieldsToUpdateStatus,
+			oapi.JobUpdateEventFieldsToUpdateCompletedAt,
+		},
+	})
+
+	// Step 4: App should now deploy
+	appJobs = getAgentJobsSortedByNewest(engine, appJobAgentID)
+	assert.Equal(t, 1, len(appJobs), "app should deploy after destination succeeds")
+
+	appJob := appJobs[0]
+
+	// CRITICAL ASSERTION: The fix ensures that ArgoCD destination errors
+	// (like "unable to find destination server") are retried instead of
+	// immediately marking the job as InvalidJobAgent.
+	//
+	// Without the fix: Job would be InvalidJobAgent immediately
+	// With the fix: Job will be Pending (and retries happen in background)
+	assert.NotEqual(t, oapi.JobStatusInvalidJobAgent, appJob.Status,
+		"Job should NOT be InvalidJobAgent - ArgoCD destination errors should be retried, not fail immediately")
+
+	// The job should be in a valid state (Pending, InProgress, or eventually Successful)
+	validStatuses := []oapi.JobStatus{
+		oapi.JobStatusPending,
+		oapi.JobStatusInProgress,
+		oapi.JobStatusSuccessful,
+	}
+	assert.Contains(t, validStatuses, appJob.Status,
+		"Job should be in a valid state: %v, got: %v", validStatuses, appJob.Status)
+}


⚠️ Potential issue | 🟡 Minor

Test name doesn't match what is actually being tested.

The test is named ArgoCDRetryBehavior and the comments claim it validates retry behavior when ArgoCD returns "unable to find destination server" errors. However, the test does not:

Simulate an ArgoCD error occurring

Verify that retry logic is triggered

Check retry attempts or retry intervals

The test only validates that after a successful destination deployment, the dependent app job is created in a valid state (not InvalidJobAgent). This validates the outcome but not the retry mechanism itself.

Consider either:

Renaming the test to better reflect what it validates (e.g., TestEngine_PolicyDeploymentDependency_ValidJobStateAfterDependencySuccess)

Enhancing the test to actually simulate ArgoCD errors and verify retry behavior

coderabbitai · 2026-01-06T16:54:38Z

apps/workspace-engine/test/e2e/engine_policy_deployment_dependency_test.go

+// TestEngine_PolicyDeploymentDependency_PolicyTimingIssue demonstrates the race condition:
+//  1. Upstream (destination) job completes successfully
+//  2. Policy evaluator IMMEDIATELY allows downstream deployment
+//     (only checks job.Status == Successful, doesn't verify ArgoCD resource is synced)
+//  3. In production: Downstream dispatch to ArgoCD would fail because
+//     the destination hasn't finished syncing yet
+//
+// This test validates that the policy allows deployment too early,
+// which is the root cause of the "unable to find destination server" errors.
+func TestEngine_PolicyDeploymentDependency_PolicyTimingIssue(t *testing.T) {
+	destinationJobAgentID := "destination-agent"
+	appJobAgentID := "app-agent"
+
+	destinationDeploymentID := "argo-destination"
+	appDeploymentID := "argo-app"
+
+	environmentID := "prod"
+	resourceID := "k8s-cluster"
+	policyID := "app-depends-on-destination"
+
+	engine := integration.NewTestWorkspace(t,
+		integration.WithJobAgent(
+			integration.JobAgentID(destinationJobAgentID),
+		),
+		integration.WithJobAgent(
+			integration.JobAgentID(appJobAgentID),
+		),
+		integration.WithSystem(
+			integration.WithDeployment(
+				integration.DeploymentID(destinationDeploymentID),
+				integration.DeploymentName("argo-destination"),
+				integration.DeploymentJobAgent(destinationJobAgentID),
+				integration.DeploymentCelResourceSelector("true"),
+			),
+			integration.WithDeployment(
+				integration.DeploymentID(appDeploymentID),
+				integration.DeploymentName("argo-app"),
+				integration.DeploymentJobAgent(appJobAgentID),
+				integration.DeploymentCelResourceSelector("true"),
+			),
+			integration.WithEnvironment(
+				integration.EnvironmentID(environmentID),
+				integration.EnvironmentName("prod"),
+				integration.EnvironmentCelResourceSelector("true"),
+			),
+		),
+		integration.WithResource(
+			integration.ResourceID(resourceID),
+		),
+		integration.WithPolicy(
+			integration.PolicyID(policyID),
+			integration.PolicyName("App depends on ArgoCD destination"),
+			integration.WithPolicyTargetSelector(
+				integration.PolicyTargetCelEnvironmentSelector("true"),
+				integration.PolicyTargetCelDeploymentSelector("deployment.id == '"+appDeploymentID+"'"),
+				integration.PolicyTargetCelResourceSelector("true"),
+			),
+			integration.WithPolicyRule(
+				integration.WithRuleDeploymentDependency(
+					integration.DeploymentDependencyRuleDependsOnDeploymentSelector("deployment.id == '"+destinationDeploymentID+"'"),
+				),
+			),
+		),
+	)
+
+	ctx := context.Background()
+
+	// Step 1: Create app version (should be blocked - destination never succeeded)
+	appVersion := c.NewDeploymentVersion()
+	appVersion.DeploymentId = appDeploymentID
+	engine.PushEvent(ctx, handler.DeploymentVersionCreate, appVersion)
+
+	appJobs := getAgentJobsSortedByNewest(engine, appJobAgentID)
+	assert.Equal(t, 0, len(appJobs), "app blocked: destination never succeeded")
+
+	// Step 2: Create destination version
+	destVersion := c.NewDeploymentVersion()
+	destVersion.DeploymentId = destinationDeploymentID
+	engine.PushEvent(ctx, handler.DeploymentVersionCreate, destVersion)
+
+	destJobs := getAgentJobsSortedByNewest(engine, destinationJobAgentID)
+	assert.Equal(t, 1, len(destJobs), "destination job created")
+
+	// Step 3: Mark destination job as SUCCESSFUL
+	// ⚠️ CRITICAL POINT: At this moment, the job succeeded but in production:
+	//    - ArgoCD application was created
+	//    - But ArgoCD destination might not be synced yet (async process)
+	destJob := destJobs[0]
+	destJobCopy := *destJob
+	destJobCopy.Status = oapi.JobStatusSuccessful
+	completedAt := time.Now()
+	destJobCopy.CompletedAt = &completedAt
+	engine.PushEvent(ctx, handler.JobUpdate, &oapi.JobUpdateEvent{
+		Id:  &destJobCopy.Id,
+		Job: destJobCopy,
+		FieldsToUpdate: &[]oapi.JobUpdateEventFieldsToUpdate{
+			oapi.JobUpdateEventFieldsToUpdateStatus,
+			oapi.JobUpdateEventFieldsToUpdateCompletedAt,
+		},
+	})
+
+	// Step 4: ⚠️ RACE CONDITION - This is where the bug manifests ⚠️
+	// Current buggy behavior: Policy evaluator sees job.Status == Successful
+	// and IMMEDIATELY allows downstream deployment without checking if
+	// ArgoCD destination is actually synced.
+
+	// Check immediately after destination success - app should still be blocked
+	appJobs = getAgentJobsSortedByNewest(engine, appJobAgentID)
+
+	// ❌ THIS ASSERTION WILL FAIL - exposing the bug
+	// The policy should NOT allow deployment immediately because ArgoCD
+	// destinations need time to sync. The app job should still be blocked.
+	assert.Equal(t, 0, len(appJobs),
+		"EXPECTED BEHAVIOR: App should remain blocked immediately after destination job success "+
+			"to allow time for ArgoCD destination to sync. "+
+			"ACTUAL BUG: Policy allows deployment immediately, causing 'unable to find destination server' errors")
+
+	// Expected fix: The policy evaluator should either:
+	// 1. Add a delay/grace period after ArgoCD destination jobs succeed, OR
+	// 2. Check ArgoCD sync status before allowing dependent deployments, OR
+	// 3. Mark downstream jobs with a flag to wait for ArgoCD sync
+	//
+	// When the fix is implemented, the above assertion will pass because:
+	// - The policy will add a delay/wait mechanism
+	// - The app job won't be created immediately
+	// - Only after ArgoCD sync is verified will the app job be created
+}


⚠️ Potential issue | 🟠 Major

Intentionally failing test should be skipped or guarded.

The PR description acknowledges this test is expected to fail until a fix is implemented. While documenting the race condition is valuable, committing a failing test directly causes several issues:

CI/CD pipeline failures

Misleading test suite health metrics

Confusion for other developers about whether failures are expected

Violates the principle of keeping the main branch "green"

Recommend one of these approaches:

Option 1: Skip the test with clear documentation

func TestEngine_PolicyDeploymentDependency_PolicyTimingIssue(t *testing.T) { + t.Skip("TODO: Enable after implementing fix for ArgoCD destination sync race condition (Issue #XXX)") destinationJobAgentID := "destination-agent"

Option 2: Use build tags to exclude from default test runs

+//go:build failing +// +build failing + func TestEngine_PolicyDeploymentDependency_PolicyTimingIssue(t *testing.T) {

Then document in the PR or README that go test -tags failing runs known-failing tests.

Do you want me to generate a complete implementation for either option, or would you prefer to enable this test only after the fix is merged?

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

// TestEngine_PolicyDeploymentDependency_PolicyTimingIssue demonstrates the race condition:

// 1. Upstream (destination) job completes successfully

// 2. Policy evaluator IMMEDIATELY allows downstream deployment

// (only checks job.Status == Successful, doesn't verify ArgoCD resource is synced)

// 3. In production: Downstream dispatch to ArgoCD would fail because

// the destination hasn't finished syncing yet

//

// This test validates that the policy allows deployment too early,

// which is the root cause of the "unable to find destination server" errors.

func TestEngine_PolicyDeploymentDependency_PolicyTimingIssue(t *testing.T) {

destinationJobAgentID := "destination-agent"

appJobAgentID := "app-agent"

destinationDeploymentID := "argo-destination"

appDeploymentID := "argo-app"

environmentID := "prod"

resourceID := "k8s-cluster"

policyID := "app-depends-on-destination"

engine := integration.NewTestWorkspace(t,

integration.WithJobAgent(

integration.JobAgentID(destinationJobAgentID),

),

integration.WithJobAgent(

integration.JobAgentID(appJobAgentID),

),

integration.WithSystem(

integration.WithDeployment(

integration.DeploymentID(destinationDeploymentID),

integration.DeploymentName("argo-destination"),

integration.DeploymentJobAgent(destinationJobAgentID),

integration.DeploymentCelResourceSelector("true"),

),

integration.WithDeployment(

integration.DeploymentID(appDeploymentID),

integration.DeploymentName("argo-app"),

integration.DeploymentJobAgent(appJobAgentID),

integration.DeploymentCelResourceSelector("true"),

),

integration.WithEnvironment(

integration.EnvironmentID(environmentID),

integration.EnvironmentName("prod"),

integration.EnvironmentCelResourceSelector("true"),

),

),

integration.WithResource(

integration.ResourceID(resourceID),

),

integration.WithPolicy(

integration.PolicyID(policyID),

integration.PolicyName("App depends on ArgoCD destination"),

integration.WithPolicyTargetSelector(

integration.PolicyTargetCelEnvironmentSelector("true"),

integration.PolicyTargetCelDeploymentSelector("deployment.id == '"+appDeploymentID+"'"),

integration.PolicyTargetCelResourceSelector("true"),

),

integration.WithPolicyRule(

integration.WithRuleDeploymentDependency(

integration.DeploymentDependencyRuleDependsOnDeploymentSelector("deployment.id == '"+destinationDeploymentID+"'"),

),

),

),

)

ctx := context.Background()

// Step 1: Create app version (should be blocked - destination never succeeded)

appVersion := c.NewDeploymentVersion()

appVersion.DeploymentId = appDeploymentID

engine.PushEvent(ctx, handler.DeploymentVersionCreate, appVersion)

appJobs := getAgentJobsSortedByNewest(engine, appJobAgentID)

assert.Equal(t, 0, len(appJobs), "app blocked: destination never succeeded")

// Step 2: Create destination version

destVersion := c.NewDeploymentVersion()

destVersion.DeploymentId = destinationDeploymentID

engine.PushEvent(ctx, handler.DeploymentVersionCreate, destVersion)

destJobs := getAgentJobsSortedByNewest(engine, destinationJobAgentID)

assert.Equal(t, 1, len(destJobs), "destination job created")

// Step 3: Mark destination job as SUCCESSFUL

// ⚠️ CRITICAL POINT: At this moment, the job succeeded but in production:

// - ArgoCD application was created

// - But ArgoCD destination might not be synced yet (async process)

destJob := destJobs[0]

destJobCopy := *destJob

destJobCopy.Status = oapi.JobStatusSuccessful

completedAt := time.Now()

destJobCopy.CompletedAt = &completedAt

engine.PushEvent(ctx, handler.JobUpdate, &oapi.JobUpdateEvent{

Id: &destJobCopy.Id,

Job: destJobCopy,

FieldsToUpdate: &[]oapi.JobUpdateEventFieldsToUpdate{

oapi.JobUpdateEventFieldsToUpdateStatus,

oapi.JobUpdateEventFieldsToUpdateCompletedAt,

},

})

// Step 4: ⚠️ RACE CONDITION - This is where the bug manifests ⚠️

// Current buggy behavior: Policy evaluator sees job.Status == Successful

// and IMMEDIATELY allows downstream deployment without checking if

// ArgoCD destination is actually synced.

// Check immediately after destination success - app should still be blocked

appJobs = getAgentJobsSortedByNewest(engine, appJobAgentID)

// ❌ THIS ASSERTION WILL FAIL - exposing the bug

// The policy should NOT allow deployment immediately because ArgoCD

// destinations need time to sync. The app job should still be blocked.

assert.Equal(t, 0, len(appJobs),

"EXPECTED BEHAVIOR: App should remain blocked immediately after destination job success "+

"to allow time for ArgoCD destination to sync. "+

"ACTUAL BUG: Policy allows deployment immediately, causing 'unable to find destination server' errors")

// Expected fix: The policy evaluator should either:

// 1. Add a delay/grace period after ArgoCD destination jobs succeed, OR

// 2. Check ArgoCD sync status before allowing dependent deployments, OR

// 3. Mark downstream jobs with a flag to wait for ArgoCD sync

//

// When the fix is implemented, the above assertion will pass because:

// - The policy will add a delay/wait mechanism

// - The app job won't be created immediately

// - Only after ArgoCD sync is verified will the app job be created

}

// TestEngine_PolicyDeploymentDependency_PolicyTimingIssue demonstrates the race condition:

// 1. Upstream (destination) job completes successfully

// 2. Policy evaluator IMMEDIATELY allows downstream deployment

// (only checks job.Status == Successful, doesn't verify ArgoCD resource is synced)

// 3. In production: Downstream dispatch to ArgoCD would fail because

// the destination hasn't finished syncing yet

//

// This test validates that the policy allows deployment too early,

// which is the root cause of the "unable to find destination server" errors.

func TestEngine_PolicyDeploymentDependency_PolicyTimingIssue(t *testing.T) {

t.Skip("TODO: Enable after implementing fix for ArgoCD destination sync race condition")

destinationJobAgentID := "destination-agent"

appJobAgentID := "app-agent"

destinationDeploymentID := "argo-destination"

appDeploymentID := "argo-app"

environmentID := "prod"

resourceID := "k8s-cluster"

policyID := "app-depends-on-destination"

engine := integration.NewTestWorkspace(t,

integration.WithJobAgent(

integration.JobAgentID(destinationJobAgentID),

),

integration.WithJobAgent(

integration.JobAgentID(appJobAgentID),

),

integration.WithSystem(

integration.WithDeployment(

integration.DeploymentID(destinationDeploymentID),

integration.DeploymentName("argo-destination"),

integration.DeploymentJobAgent(destinationJobAgentID),

integration.DeploymentCelResourceSelector("true"),

),

integration.WithDeployment(

integration.DeploymentID(appDeploymentID),

integration.DeploymentName("argo-app"),

integration.DeploymentJobAgent(appJobAgentID),

integration.DeploymentCelResourceSelector("true"),

),

integration.WithEnvironment(

integration.EnvironmentID(environmentID),

integration.EnvironmentName("prod"),

integration.EnvironmentCelResourceSelector("true"),

),

),

integration.WithResource(

integration.ResourceID(resourceID),

),

integration.WithPolicy(

integration.PolicyID(policyID),

integration.PolicyName("App depends on ArgoCD destination"),

integration.WithPolicyTargetSelector(

integration.PolicyTargetCelEnvironmentSelector("true"),

integration.PolicyTargetCelDeploymentSelector("deployment.id == '"+appDeploymentID+"'"),

integration.PolicyTargetCelResourceSelector("true"),

),

integration.WithPolicyRule(

integration.WithRuleDeploymentDependency(

integration.DeploymentDependencyRuleDependsOnDeploymentSelector("deployment.id == '"+destinationDeploymentID+"'"),

),

),

),

)

ctx := context.Background()

// Step 1: Create app version (should be blocked - destination never succeeded)

appVersion := c.NewDeploymentVersion()

appVersion.DeploymentId = appDeploymentID

engine.PushEvent(ctx, handler.DeploymentVersionCreate, appVersion)

appJobs := getAgentJobsSortedByNewest(engine, appJobAgentID)

assert.Equal(t, 0, len(appJobs), "app blocked: destination never succeeded")

// Step 2: Create destination version

destVersion := c.NewDeploymentVersion()

destVersion.DeploymentId = destinationDeploymentID

engine.PushEvent(ctx, handler.DeploymentVersionCreate, destVersion)

destJobs := getAgentJobsSortedByNewest(engine, destinationJobAgentID)

assert.Equal(t, 1, len(destJobs), "destination job created")

// Step 3: Mark destination job as SUCCESSFUL

// ⚠️ CRITICAL POINT: At this moment, the job succeeded but in production:

// - ArgoCD application was created

// - But ArgoCD destination might not be synced yet (async process)

destJob := destJobs[0]

destJobCopy := *destJob

destJobCopy.Status = oapi.JobStatusSuccessful

completedAt := time.Now()

destJobCopy.CompletedAt = &completedAt

engine.PushEvent(ctx, handler.JobUpdate, &oapi.JobUpdateEvent{

Id: &destJobCopy.Id,

Job: destJobCopy,

FieldsToUpdate: &[]oapi.JobUpdateEventFieldsToUpdate{

oapi.JobUpdateEventFieldsToUpdateStatus,

oapi.JobUpdateEventFieldsToUpdateCompletedAt,

},

})

// Step 4: ⚠️ RACE CONDITION - This is where the bug manifests ⚠️

// Current buggy behavior: Policy evaluator sees job.Status == Successful

// and IMMEDIATELY allows downstream deployment without checking if

// ArgoCD destination is actually synced.

// Check immediately after destination success - app should still be blocked

appJobs = getAgentJobsSortedByNewest(engine, appJobAgentID)

// ❌ THIS ASSERTION WILL FAIL - exposing the bug

// The policy should NOT allow deployment immediately because ArgoCD

// destinations need time to sync. The app job should still be blocked.

assert.Equal(t, 0, len(appJobs),

"EXPECTED BEHAVIOR: App should remain blocked immediately after destination job success "+

"to allow time for ArgoCD destination to sync. "+

"ACTUAL BUG: Policy allows deployment immediately, causing 'unable to find destination server' errors")

// Expected fix: The policy evaluator should either:

// 1. Add a delay/grace period after ArgoCD destination jobs succeed, OR

// 2. Check ArgoCD sync status before allowing dependent deployments, OR

// 3. Mark downstream jobs with a flag to wait for ArgoCD sync

//

// When the fix is implemented, the above assertion will pass because:

// - The policy will add a delay/wait mechanism

// - The app job won't be created immediately

// - Only after ArgoCD sync is verified will the app job be created

}

🤖 Prompt for AI Agents

In @apps/workspace-engine/test/e2e/engine_policy_deployment_dependency_test.go around lines 368 - 494, The test TestEngine_PolicyDeploymentDependency_PolicyTimingIssue is intentionally failing and should be excluded from default CI; either mark it skipped by adding an early t.Skip("known failing: race condition - see PR") at the top of TestEngine_PolicyDeploymentDependency_PolicyTimingIssue, or move the test behind a build tag by adding a file-level build constraint (e.g., //go:build failing) and document that it runs with `go test -tags failing`; update the test comment to explain which approach was used and reference the test name so future maintainers can find and re-enable it when the policy fix is implemented.

fix(workspace-engine): Argo race condition

6a3e5f1

zacharyblasczyk requested a review from jsbroks January 5, 2026 21:12

format fix

8ff5556

coderabbitai bot reviewed Jan 5, 2026

View reviewed changes

zacharyblasczyk marked this pull request as draft January 5, 2026 21:36

zacharyblasczyk added 2 commits January 6, 2026 10:25

document the bug, test passes while bug exsists

66ea511

a failing test

5df4b69

zacharyblasczyk marked this pull request as ready for review January 6, 2026 16:47

format fix

e959e64

coderabbitai bot reviewed Jan 6, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(workspace-engine): Argo race condition #735

fix(workspace-engine): Argo race condition #735

Uh oh!

zacharyblasczyk commented Jan 5, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Jan 5, 2026 •

edited

Loading

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Poem

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Jan 6, 2026

Uh oh!

coderabbitai bot Jan 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

fix(workspace-engine): Argo race condition #735

Are you sure you want to change the base?

fix(workspace-engine): Argo race condition #735

Uh oh!

Conversation

zacharyblasczyk commented Jan 5, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Add failing test to reproduce ArgoCD deployment dependency race condition

Summary

Changes

Test Output

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Jan 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Poem

Pre-merge checks and finishing touches

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jan 6, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jan 6, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

zacharyblasczyk commented Jan 5, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Jan 5, 2026 •

edited

Loading