Skip to content

Conversation

@zacharyblasczyk
Copy link
Member

@zacharyblasczyk zacharyblasczyk commented Jan 5, 2026

Add failing test to reproduce ArgoCD deployment dependency race condition

Summary

This PR adds a failing test that reproduces a production bug where ArgoCD applications fail with "unable to find destination server" when deployed immediately after their dependent destination completes.

The race condition: The deployment dependency policy allows downstream deployments as soon as the upstream job succeeds, but ArgoCD destinations require additional time to sync asynchronously.

sequenceDiagram
    participant Policy as Dependency Policy
    participant Engine as Workspace Engine
    participant ArgoCD as ArgoCD API
    
    Note over Engine,ArgoCD: 1. Deploy Destination
    Engine->>ArgoCD: Create Destination App
    ArgoCD-->>Engine: Job Status: Successful ✅
    
    Note over Policy: Dependency satisfied!<br/>Allow downstream deploy
    
    Note over Engine,ArgoCD: 2. Deploy App (IMMEDIATE)
    Policy->>Engine: Unblock downstream deployment
    Engine->>ArgoCD: Create App using destination
    ArgoCD-->>Engine: ❌ Error: unable to find destination server
    
    Note over ArgoCD: Destination still syncing...<br/>(async background process)
    
    Note over Engine: Job marked InvalidJobAgent ❌
    
    Note over ArgoCD: Destination now available<br/>(too late)
Loading

Changes

  1. TestEngine_PolicyDeploymentDependency_PolicyTimingIssue - Failing test that exposes the bug (expects 0 jobs, gets 1 job created immediately)
  2. Error message capture in executor.go - Store dispatch errors in job.Message for better debugging
  3. Additional test coverage - AutoDeploy and ArgoCDRetryBehavior tests documenting expected behavior
  4. Unit tests - 30 test cases for error classification logic

Test Output

$ go test -v ./test/e2e -run TestEngine_PolicyDeploymentDependency

=== RUN   TestEngine_PolicyDeploymentDependency_PolicyTimingIssue
2026/01/06 10:40:23 INFO Processing event eventType=job-agent.created
2026/01/06 10:40:23 INFO Processing event eventType=job-agent.created
2026/01/06 10:40:23 INFO Processing event eventType=system.created
2026/01/06 10:40:23 INFO Processing event eventType=deployment.created
2026/01/06 10:40:23 INFO Processing event eventType=deployment.created
2026/01/06 10:40:23 INFO Processing event eventType=environment.created
2026/01/06 10:40:23 INFO Processing event eventType=resource.created
2026/01/06 10:40:23 INFO Processing event eventType=policy.created
2026/01/06 10:40:23 INFO Policy created - reconciling affected targets policy_id=app-depends-on-destination affected_targets_count=1
2026/01/06 10:40:23 INFO Processing event eventType=deployment-version.created
2026/01/06 10:40:23 INFO Processing event eventType=deployment-version.created
2026/01/06 10:40:23 INFO Processing event eventType=job.updated
    engine_policy_deployment_dependency_test.go:481: 
                Error Trace:    /Users/zb/github/ctrlplanedev/ctrlplane/apps/workspace-engine/test/e2e/engine_policy_deployment_dependency_test.go:481
                Error:          Not equal: 
                                expected: 0
                                actual  : 1
                Test:           TestEngine_PolicyDeploymentDependency_PolicyTimingIssue
                Messages:       EXPECTED BEHAVIOR: App should remain blocked immediately after destination job success to allow time for ArgoCD destination to sync. ACTUAL BUG: Policy allows deployment immediately, causing 'unable to find destination server' errors
--- FAIL: TestEngine_PolicyDeploymentDependency_PolicyTimingIssue (0.01s)
FAIL
FAIL    workspace-engine/test/e2e       1.303s
FAIL

This failure is intentional - the test will pass once we implement a fix.

Summary by CodeRabbit

  • Bug Fixes

    • Improved error handling for failed job dispatches: logs include captured error text and jobs now record updated status, message, and timestamp.
  • Tests

    • Added unit tests and a benchmark for error classification and retry logic.
    • Added end-to-end tests covering policy-driven deployment dependencies, automatic downstream deployment, retry behavior for transient destination errors, and a documented timing/race scenario.

✏️ Tip: You can customize this high-level summary in your review settings.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Jan 5, 2026

📝 Walkthrough

Walkthrough

Adds unit and E2E tests for ArgoCD error classification and deployment-dependency behavior, and updates async job dispatch error handling to set job status, message, and updated timestamp when dispatch fails.

Changes

Cohort / File(s) Summary
ArgoCD Error Classification Tests
apps/workspace-engine/pkg/workspace/jobdispatch/argocd_test.go
New table-driven unit test TestIsRetryableError covering nil, HTTP/transient, ArgoCD-specific, and non-retryable errors; adds BenchmarkIsRetryableError.
Async Job Dispatch Error Handling
apps/workspace-engine/pkg/workspace/releasemanager/deployment/executor.go
On async dispatch error: capture error message, set newJob.Status = InvalidJobAgent, assign newJob.Message, update newJob.UpdatedAt, and persist the job (previously only logged the error).
Deployment Dependency E2E Tests
apps/workspace-engine/test/e2e/engine_policy_deployment_dependency_test.go
Added three E2E tests—TestEngine_PolicyDeploymentDependency_AutoDeploy, TestEngine_PolicyDeploymentDependency_ArgoCDRetryBehavior, and TestEngine_PolicyDeploymentDependency_PolicyTimingIssue—exercising policy-driven deployment ordering, ArgoCD retry behavior, and a timing/race scenario.

Sequence Diagram(s)

(Skipped — changes are test additions and a small error-handling update that do not introduce a new multi-component control flow requiring visualization.)

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

Poem

🐰 I hopped through logs both day and night,
Caught retryable errors in soft moonlight.
Dependencies queued in tidy rows,
Messages set where failure goes—
Now jobs tumble forward, futures bright. ✨

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'fix(workspace-engine): Argo race condition' accurately summarizes the main objective of the PR, which is to fix a production race condition related to ArgoCD timing during policy-driven deployment dependencies.
Docstring Coverage ✅ Passed Docstring coverage is 85.71% which is sufficient. The required threshold is 80.00%.
✨ Finishing touches
  • 📝 Generate docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (2)
apps/workspace-engine/pkg/workspace/jobdispatch/argocd.go (2)

83-99: Optimize by moving retryablePatterns to package level.

The retryablePatterns slice is allocated on every call to isRetryableError. Since these patterns are static, consider declaring it as a package-level variable to avoid repeated allocations, especially given that this function is called in a retry loop under load.

🔎 Proposed refactor
+// retryablePatterns defines error patterns indicating transient failures worth retrying
+var retryablePatterns = []string{
+	// HTTP status codes indicating transient failures
+	"502", "503", "504",
+	"connection refused",
+	"connection reset",
+	"timeout",
+	"temporarily unavailable",
+	// ArgoCD destination/cluster errors (race condition when destination is being synced)
+	"cluster not found",
+	"destination not found",
+	"cluster does not exist",
+	"destination does not exist",
+	"destination server not found",
+	"server not found",
+	"invalid destination",
+	"unknown cluster",
+}
+
 // isRetryableError checks if an error is a transient error that should be retried
 func isRetryableError(err error) bool {
 	if err == nil {
 		return false
 	}
 	
 	errStr := strings.ToLower(err.Error())
 	
-	// Patterns that indicate transient failures worth retrying
-	retryablePatterns := []string{
-		// HTTP status codes indicating transient failures
-		"502", "503", "504",
-		"connection refused",
-		"connection reset",
-		"timeout",
-		"temporarily unavailable",
-		// ArgoCD destination/cluster errors (race condition when destination is being synced)
-		"cluster not found",
-		"destination not found",
-		"cluster does not exist",
-		"destination does not exist",
-		"destination server not found",
-		"server not found",
-		"invalid destination",
-		"unknown cluster",
-	}
-	
 	for _, pattern := range retryablePatterns {
 		if strings.Contains(errStr, pattern) {
 			return true
 		}
 	}
 	
 	return false
 }

96-97: Broad patterns may match unintended errors.

The patterns "server not found" and "invalid destination" are relatively generic and could theoretically match non-ArgoCD errors (e.g., "DNS server not found" or file system destination errors). However, given that this function is only called within the ArgoCD dispatcher context, the risk of false positives is low. Consider adding a comment clarifying that these patterns are scoped to ArgoCD errors.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Disabled knowledge base sources:

  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 37a5856 and 6a3e5f1.

📒 Files selected for processing (3)
  • apps/workspace-engine/pkg/workspace/jobdispatch/argocd.go
  • apps/workspace-engine/pkg/workspace/jobdispatch/argocd_test.go
  • apps/workspace-engine/test/e2e/engine_policy_deployment_dependency_test.go
🧰 Additional context used
📓 Path-based instructions (2)
apps/workspace-engine/**/*.go

📄 CodeRabbit inference engine (apps/workspace-engine/CLAUDE.md)

apps/workspace-engine/**/*.go: Do not add extraneous inline comments that state the obvious
Do not add comments that simply restate what the code does
Do not add comments for standard Go patterns (e.g., noting WaitGroup or semaphore usage)
Write comments that explain why, document complex logic/algorithms, provide non-obvious context, include TODO/FIXME, and document exported functions/types/methods

Files:

  • apps/workspace-engine/pkg/workspace/jobdispatch/argocd.go
  • apps/workspace-engine/test/e2e/engine_policy_deployment_dependency_test.go
  • apps/workspace-engine/pkg/workspace/jobdispatch/argocd_test.go
apps/workspace-engine/**/*_test.go

📄 CodeRabbit inference engine (apps/workspace-engine/CLAUDE.md)

Follow the existing test structure used in *_test.go files

Files:

  • apps/workspace-engine/test/e2e/engine_policy_deployment_dependency_test.go
  • apps/workspace-engine/pkg/workspace/jobdispatch/argocd_test.go
🧠 Learnings (6)
📚 Learning: 2025-08-12T18:13:54.630Z
Learnt from: CR
Repo: ctrlplanedev/ctrlplane PR: 0
File: apps/workspace-engine/CLAUDE.md:0-0
Timestamp: 2025-08-12T18:13:54.630Z
Learning: Applies to apps/workspace-engine/**/*_test.go : Follow the existing test structure used in *_test.go files

Applied to files:

  • apps/workspace-engine/test/e2e/engine_policy_deployment_dependency_test.go
  • apps/workspace-engine/pkg/workspace/jobdispatch/argocd_test.go
📚 Learning: 2025-08-12T18:13:54.630Z
Learnt from: CR
Repo: ctrlplanedev/ctrlplane PR: 0
File: apps/workspace-engine/CLAUDE.md:0-0
Timestamp: 2025-08-12T18:13:54.630Z
Learning: Applies to apps/workspace-engine/pkg/model/selector/**/*_test.go : Write comprehensive, data-driven tests for new condition types

Applied to files:

  • apps/workspace-engine/test/e2e/engine_policy_deployment_dependency_test.go
  • apps/workspace-engine/pkg/workspace/jobdispatch/argocd_test.go
📚 Learning: 2025-08-12T18:13:54.630Z
Learnt from: CR
Repo: ctrlplanedev/ctrlplane PR: 0
File: apps/workspace-engine/CLAUDE.md:0-0
Timestamp: 2025-08-12T18:13:54.630Z
Learning: Applies to apps/workspace-engine/pkg/model/selector/**/*_test.go : Test validation and matching logic separately for condition types

Applied to files:

  • apps/workspace-engine/test/e2e/engine_policy_deployment_dependency_test.go
📚 Learning: 2025-10-24T00:02:29.723Z
Learnt from: adityachoudhari26
Repo: ctrlplanedev/ctrlplane PR: 696
File: apps/workspace-engine/pkg/workspace/releasemanager/deployment/executor.go:63-67
Timestamp: 2025-10-24T00:02:29.723Z
Learning: In workspace-engine replay mode (apps/workspace-engine/pkg/workspace/releasemanager/deployment/executor.go), jobs are created with Pending status even during replay because job updates are still accepted and processed. If a job remains Pending after replay completes, it genuinely is still pending because no updates occurred during replay.

Applied to files:

  • apps/workspace-engine/test/e2e/engine_policy_deployment_dependency_test.go
📚 Learning: 2025-08-12T18:13:54.630Z
Learnt from: CR
Repo: ctrlplanedev/ctrlplane PR: 0
File: apps/workspace-engine/CLAUDE.md:0-0
Timestamp: 2025-08-12T18:13:54.630Z
Learning: Applies to apps/workspace-engine/pkg/model/selector/**/*_test.go : Use table-driven tests for all condition types

Applied to files:

  • apps/workspace-engine/pkg/workspace/jobdispatch/argocd_test.go
📚 Learning: 2025-08-12T18:13:54.630Z
Learnt from: CR
Repo: ctrlplanedev/ctrlplane PR: 0
File: apps/workspace-engine/CLAUDE.md:0-0
Timestamp: 2025-08-12T18:13:54.630Z
Learning: Applies to apps/workspace-engine/**/*.go : Write comments that explain why, document complex logic/algorithms, provide non-obvious context, include TODO/FIXME, and document exported functions/types/methods

Applied to files:

  • apps/workspace-engine/pkg/workspace/jobdispatch/argocd_test.go
🧬 Code graph analysis (1)
apps/workspace-engine/test/e2e/engine_policy_deployment_dependency_test.go (4)
apps/workspace-engine/test/integration/workspace.go (1)
  • NewTestWorkspace (38-66)
apps/workspace-engine/test/integration/creators/deployment_version.go (1)
  • NewDeploymentVersion (15-36)
apps/workspace-engine/pkg/events/handler/handler.go (2)
  • DeploymentVersionCreate (44-44)
  • JobUpdate (68-68)
apps/workspace-engine/pkg/oapi/oapi.gen.go (3)
  • JobStatusSuccessful (135-135)
  • Job (657-671)
  • JobStatusInvalidJobAgent (132-132)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)
  • GitHub Check: workspace-engine-tests
  • GitHub Check: Lint
  • GitHub Check: Typecheck
  • GitHub Check: build (linux/amd64)
  • GitHub Check: tests
🔇 Additional comments (2)
apps/workspace-engine/pkg/workspace/jobdispatch/argocd_test.go (1)

12-181: Excellent test coverage with proper table-driven structure.

The test comprehensively validates error classification across all retryable and non-retryable patterns, including important edge cases like:

  • Case-insensitive matching (lines 68-71, 120-123)
  • Specificity validation (lines 162-170) ensuring "cluster" and "destination" in different contexts aren't false positives
  • Realistic ArgoCD error messages (lines 108-123)

The table-driven approach aligns with project conventions and makes the test easy to extend.

Based on learnings, this follows the existing test structure used in *_test.go files.

apps/workspace-engine/test/e2e/engine_policy_deployment_dependency_test.go (1)

120-225: Verify that the E2E test exercises the retry logic.

The test validates that downstream jobs don't immediately fail with InvalidJobAgent, but it doesn't appear to:

  1. Simulate ArgoCD returning "cluster not found" or similar transient errors
  2. Verify that retries actually occur with exponential backoff
  3. Use a mock ArgoCD client that returns errors on initial attempts then succeeds

The test primarily validates the deployment dependency flow rather than the ArgoCD retry mechanism introduced in this PR. Consider verifying whether:

  • A mock ArgoCD client is configured elsewhere in the test setup that simulates the race condition
  • The retry behavior is actually exercised during test execution
  • Additional assertions could confirm retry attempts occurred (e.g., checking logs, retry counts, or timing)

If the retry logic isn't being exercised, you might want to enhance the test to use a mock ArgoCD client that fails with retryable errors on the first N attempts, then succeeds, to validate the fix end-to-end.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
apps/workspace-engine/pkg/workspace/jobdispatch/argocd.go (1)

74-108: Solid implementation with comprehensive test coverage.

The refactoring correctly handles ArgoCD sync race conditions with proper case-insensitive pattern matching. All patterns—including HTTP status codes and ArgoCD-specific errors—are validated by comprehensive test cases in argocd_test.go, confirming they match realistic error messages from the ArgoCD library.

The HTTP patterns ("502", "503", "504") are tested within full error messages like "HTTP 502 Bad Gateway", and false positives are further mitigated by the limited retry policy (5 attempts with exponential backoff). However, if increased specificity is desired, consider refining to patterns like "status 502", "http 502", or "502 bad gateway" for additional precision.

Optional: More specific HTTP patterns
 	retryablePatterns := []string{
-		// HTTP status codes indicating transient failures
-		"502", "503", "504",
+		// HTTP status codes indicating transient failures
+		"status 502", "status 503", "status 504",
+		"http 502", "http 503", "http 504",
 		"connection refused",
📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Disabled knowledge base sources:

  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 6a3e5f1 and 8ff5556.

📒 Files selected for processing (1)
  • apps/workspace-engine/pkg/workspace/jobdispatch/argocd.go
🧰 Additional context used
📓 Path-based instructions (1)
apps/workspace-engine/**/*.go

📄 CodeRabbit inference engine (apps/workspace-engine/CLAUDE.md)

apps/workspace-engine/**/*.go: Do not add extraneous inline comments that state the obvious
Do not add comments that simply restate what the code does
Do not add comments for standard Go patterns (e.g., noting WaitGroup or semaphore usage)
Write comments that explain why, document complex logic/algorithms, provide non-obvious context, include TODO/FIXME, and document exported functions/types/methods

Files:

  • apps/workspace-engine/pkg/workspace/jobdispatch/argocd.go
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
  • GitHub Check: Lint
  • GitHub Check: tests
  • GitHub Check: build (linux/amd64)

@zacharyblasczyk zacharyblasczyk marked this pull request as draft January 5, 2026 21:36
@zacharyblasczyk zacharyblasczyk marked this pull request as ready for review January 6, 2026 16:47
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
apps/workspace-engine/pkg/workspace/jobdispatch/argocd_test.go (1)

142-158: Benchmark uses error messages not covered by unit tests.

The benchmark includes "cluster not found" and "destination does not exist" which aren't present in the unit test cases. Consider either adding these to the unit test table or using the same error set for consistency.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Disabled knowledge base sources:

  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 8ff5556 and 5df4b69.

📒 Files selected for processing (3)
  • apps/workspace-engine/pkg/workspace/jobdispatch/argocd_test.go
  • apps/workspace-engine/pkg/workspace/releasemanager/deployment/executor.go
  • apps/workspace-engine/test/e2e/engine_policy_deployment_dependency_test.go
🧰 Additional context used
📓 Path-based instructions (2)
apps/workspace-engine/**/*.go

📄 CodeRabbit inference engine (apps/workspace-engine/CLAUDE.md)

apps/workspace-engine/**/*.go: Do not add extraneous inline comments that state the obvious
Do not add comments that simply restate what the code does
Do not add comments for standard Go patterns (e.g., noting WaitGroup or semaphore usage)
Write comments that explain why, document complex logic/algorithms, provide non-obvious context, include TODO/FIXME, and document exported functions/types/methods

Files:

  • apps/workspace-engine/pkg/workspace/releasemanager/deployment/executor.go
  • apps/workspace-engine/test/e2e/engine_policy_deployment_dependency_test.go
  • apps/workspace-engine/pkg/workspace/jobdispatch/argocd_test.go
apps/workspace-engine/**/*_test.go

📄 CodeRabbit inference engine (apps/workspace-engine/CLAUDE.md)

Follow the existing test structure used in *_test.go files

Files:

  • apps/workspace-engine/test/e2e/engine_policy_deployment_dependency_test.go
  • apps/workspace-engine/pkg/workspace/jobdispatch/argocd_test.go
🧠 Learnings (6)
📚 Learning: 2025-10-24T00:02:29.723Z
Learnt from: adityachoudhari26
Repo: ctrlplanedev/ctrlplane PR: 696
File: apps/workspace-engine/pkg/workspace/releasemanager/deployment/executor.go:63-67
Timestamp: 2025-10-24T00:02:29.723Z
Learning: In workspace-engine replay mode (apps/workspace-engine/pkg/workspace/releasemanager/deployment/executor.go), jobs are created with Pending status even during replay because job updates are still accepted and processed. If a job remains Pending after replay completes, it genuinely is still pending because no updates occurred during replay.

Applied to files:

  • apps/workspace-engine/pkg/workspace/releasemanager/deployment/executor.go
  • apps/workspace-engine/test/e2e/engine_policy_deployment_dependency_test.go
📚 Learning: 2025-08-12T18:13:54.630Z
Learnt from: CR
Repo: ctrlplanedev/ctrlplane PR: 0
File: apps/workspace-engine/CLAUDE.md:0-0
Timestamp: 2025-08-12T18:13:54.630Z
Learning: Applies to apps/workspace-engine/**/*.go : Write comments that explain why, document complex logic/algorithms, provide non-obvious context, include TODO/FIXME, and document exported functions/types/methods

Applied to files:

  • apps/workspace-engine/pkg/workspace/releasemanager/deployment/executor.go
  • apps/workspace-engine/test/e2e/engine_policy_deployment_dependency_test.go
  • apps/workspace-engine/pkg/workspace/jobdispatch/argocd_test.go
📚 Learning: 2025-08-12T18:13:54.630Z
Learnt from: CR
Repo: ctrlplanedev/ctrlplane PR: 0
File: apps/workspace-engine/CLAUDE.md:0-0
Timestamp: 2025-08-12T18:13:54.630Z
Learning: Applies to apps/workspace-engine/**/*_test.go : Follow the existing test structure used in *_test.go files

Applied to files:

  • apps/workspace-engine/test/e2e/engine_policy_deployment_dependency_test.go
  • apps/workspace-engine/pkg/workspace/jobdispatch/argocd_test.go
📚 Learning: 2025-08-12T18:13:54.630Z
Learnt from: CR
Repo: ctrlplanedev/ctrlplane PR: 0
File: apps/workspace-engine/CLAUDE.md:0-0
Timestamp: 2025-08-12T18:13:54.630Z
Learning: Applies to apps/workspace-engine/pkg/model/selector/**/*_test.go : Write comprehensive, data-driven tests for new condition types

Applied to files:

  • apps/workspace-engine/test/e2e/engine_policy_deployment_dependency_test.go
  • apps/workspace-engine/pkg/workspace/jobdispatch/argocd_test.go
📚 Learning: 2025-08-12T20:49:05.086Z
Learnt from: adityachoudhari26
Repo: ctrlplanedev/ctrlplane PR: 655
File: apps/workspace-engine/pkg/engine/workspace/fluent.go:166-171
Timestamp: 2025-08-12T20:49:05.086Z
Learning: The UpdateDeploymentVersions() method with OperationCreate case in apps/workspace-engine/pkg/engine/workspace/fluent.go is specifically designed only for creating new deployment versions, not for handling potential duplicates or existing versions.

Applied to files:

  • apps/workspace-engine/test/e2e/engine_policy_deployment_dependency_test.go
📚 Learning: 2025-08-12T18:13:54.630Z
Learnt from: CR
Repo: ctrlplanedev/ctrlplane PR: 0
File: apps/workspace-engine/CLAUDE.md:0-0
Timestamp: 2025-08-12T18:13:54.630Z
Learning: Applies to apps/workspace-engine/pkg/model/selector/**/*_test.go : Use table-driven tests for all condition types

Applied to files:

  • apps/workspace-engine/pkg/workspace/jobdispatch/argocd_test.go
🧬 Code graph analysis (1)
apps/workspace-engine/pkg/workspace/releasemanager/deployment/executor.go (2)
apps/workspace-engine/pkg/workspace/releasemanager/trace/types.go (1)
  • Status (68-68)
apps/workspace-engine/pkg/oapi/oapi.gen.go (1)
  • JobStatusInvalidJobAgent (132-132)
🪛 GitHub Actions: Apps / Workspace Engine
apps/workspace-engine/test/e2e/engine_policy_deployment_dependency_test.go

[error] 481-481: Engine policy timing issue: expected deployment to remain blocked (timing 0), but was allowed to proceed (1). This causes 'unable to find destination server' errors


[error] 1-1: End-to-end test suite failed for workspace-engine/test/e2e (non-zero exit)

apps/workspace-engine/pkg/workspace/jobdispatch/argocd_test.go

[error] 135-135: TestIsRetryableError/unable_to_find_destination_server: expected isRetryableError(...) = true, got false


[error] 135-135: TestIsRetryableError/application_destination_spec_is_invalid: expected isRetryableError(...) = true, got false


[error] 135-135: TestIsRetryableError/argocd_rpc_error_-_destination_spec_invalid: expected isRetryableError(...) = true, got false


[error] 135-135: TestIsRetryableError/mixed_case_-_unable_to_find_destination: expected isRetryableError(...) = true, got false

🔇 Additional comments (5)
apps/workspace-engine/pkg/workspace/releasemanager/deployment/executor.go (1)

128-135: LGTM - Error message capture improves debugging.

Storing the dispatch error in job.Message provides valuable debugging context for InvalidJobAgent failures, which aligns with the PR objective of better surfacing ArgoCD race conditions.

apps/workspace-engine/pkg/workspace/jobdispatch/argocd_test.go (1)

10-140: Well-structured test-first approach for documenting expected behavior.

The table-driven tests correctly document the expected retry classification. The failing test cases (lines 63-82) for ArgoCD destination errors are intentionally failing per the PR description—they will pass once isRetryableError is updated to classify "unable to find destination server" errors as retryable.

apps/workspace-engine/test/e2e/engine_policy_deployment_dependency_test.go (3)

120-248: LGTM - Comprehensive test for auto-deploy behavior.

The test clearly documents the expected dependency flow with well-structured steps and assertions. The step-by-step comments make the test intent clear.


250-366: LGTM - Documents expected retry behavior for ArgoCD errors.

The test correctly asserts that jobs should not immediately fail with InvalidJobAgent status when ArgoCD returns transient destination errors. The critical assertion at line 355 validates the fix once implemented.


368-495: Intentionally failing test effectively documents the race condition.

This test reproduces the production bug where the policy evaluator allows downstream deployment immediately after upstream job success, without waiting for ArgoCD destination sync. The extensive comments (lines 486-494) outlining potential fix approaches are helpful for implementation guidance.

The failing assertion at line 481 will pass once a timing/sync mechanism is added to the policy evaluator.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Fix all issues with AI Agents
In @apps/workspace-engine/test/e2e/engine_policy_deployment_dependency_test.go:
- Around line 368-494: The test
TestEngine_PolicyDeploymentDependency_PolicyTimingIssue is intentionally failing
and should be excluded from default CI; either mark it skipped by adding an
early t.Skip("known failing: race condition - see PR") at the top of
TestEngine_PolicyDeploymentDependency_PolicyTimingIssue, or move the test behind
a build tag by adding a file-level build constraint (e.g., //go:build failing)
and document that it runs with `go test -tags failing`; update the test comment
to explain which approach was used and reference the test name so future
maintainers can find and re-enable it when the policy fix is implemented.
📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Disabled knowledge base sources:

  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 5df4b69 and e959e64.

📒 Files selected for processing (2)
  • apps/workspace-engine/pkg/workspace/jobdispatch/argocd_test.go
  • apps/workspace-engine/test/e2e/engine_policy_deployment_dependency_test.go
🚧 Files skipped from review as they are similar to previous changes (1)
  • apps/workspace-engine/pkg/workspace/jobdispatch/argocd_test.go
🧰 Additional context used
📓 Path-based instructions (2)
apps/workspace-engine/**/*.go

📄 CodeRabbit inference engine (apps/workspace-engine/CLAUDE.md)

apps/workspace-engine/**/*.go: Do not add extraneous inline comments that state the obvious
Do not add comments that simply restate what the code does
Do not add comments for standard Go patterns (e.g., noting WaitGroup or semaphore usage)
Write comments that explain why, document complex logic/algorithms, provide non-obvious context, include TODO/FIXME, and document exported functions/types/methods

Files:

  • apps/workspace-engine/test/e2e/engine_policy_deployment_dependency_test.go
apps/workspace-engine/**/*_test.go

📄 CodeRabbit inference engine (apps/workspace-engine/CLAUDE.md)

Follow the existing test structure used in *_test.go files

Files:

  • apps/workspace-engine/test/e2e/engine_policy_deployment_dependency_test.go
🧠 Learnings (6)
📓 Common learnings
Learnt from: adityachoudhari26
Repo: ctrlplanedev/ctrlplane PR: 696
File: apps/workspace-engine/pkg/workspace/releasemanager/deployment/executor.go:63-67
Timestamp: 2025-10-24T00:02:29.723Z
Learning: In workspace-engine replay mode (apps/workspace-engine/pkg/workspace/releasemanager/deployment/executor.go), jobs are created with Pending status even during replay because job updates are still accepted and processed. If a job remains Pending after replay completes, it genuinely is still pending because no updates occurred during replay.
📚 Learning: 2025-08-12T18:13:54.630Z
Learnt from: CR
Repo: ctrlplanedev/ctrlplane PR: 0
File: apps/workspace-engine/CLAUDE.md:0-0
Timestamp: 2025-08-12T18:13:54.630Z
Learning: Applies to apps/workspace-engine/**/*_test.go : Follow the existing test structure used in *_test.go files

Applied to files:

  • apps/workspace-engine/test/e2e/engine_policy_deployment_dependency_test.go
📚 Learning: 2025-08-12T18:13:54.630Z
Learnt from: CR
Repo: ctrlplanedev/ctrlplane PR: 0
File: apps/workspace-engine/CLAUDE.md:0-0
Timestamp: 2025-08-12T18:13:54.630Z
Learning: Applies to apps/workspace-engine/pkg/model/selector/**/*_test.go : Write comprehensive, data-driven tests for new condition types

Applied to files:

  • apps/workspace-engine/test/e2e/engine_policy_deployment_dependency_test.go
📚 Learning: 2025-08-12T18:13:54.630Z
Learnt from: CR
Repo: ctrlplanedev/ctrlplane PR: 0
File: apps/workspace-engine/CLAUDE.md:0-0
Timestamp: 2025-08-12T18:13:54.630Z
Learning: Applies to apps/workspace-engine/pkg/model/selector/**/*_test.go : Test validation and matching logic separately for condition types

Applied to files:

  • apps/workspace-engine/test/e2e/engine_policy_deployment_dependency_test.go
📚 Learning: 2025-08-12T20:49:05.086Z
Learnt from: adityachoudhari26
Repo: ctrlplanedev/ctrlplane PR: 655
File: apps/workspace-engine/pkg/engine/workspace/fluent.go:166-171
Timestamp: 2025-08-12T20:49:05.086Z
Learning: The UpdateDeploymentVersions() method with OperationCreate case in apps/workspace-engine/pkg/engine/workspace/fluent.go is specifically designed only for creating new deployment versions, not for handling potential duplicates or existing versions.

Applied to files:

  • apps/workspace-engine/test/e2e/engine_policy_deployment_dependency_test.go
📚 Learning: 2025-10-24T00:02:29.723Z
Learnt from: adityachoudhari26
Repo: ctrlplanedev/ctrlplane PR: 696
File: apps/workspace-engine/pkg/workspace/releasemanager/deployment/executor.go:63-67
Timestamp: 2025-10-24T00:02:29.723Z
Learning: In workspace-engine replay mode (apps/workspace-engine/pkg/workspace/releasemanager/deployment/executor.go), jobs are created with Pending status even during replay because job updates are still accepted and processed. If a job remains Pending after replay completes, it genuinely is still pending because no updates occurred during replay.

Applied to files:

  • apps/workspace-engine/test/e2e/engine_policy_deployment_dependency_test.go
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
  • GitHub Check: Typecheck
  • GitHub Check: tests
  • GitHub Check: Lint
  • GitHub Check: build (linux/amd64)
🔇 Additional comments (1)
apps/workspace-engine/test/e2e/engine_policy_deployment_dependency_test.go (1)

120-248: LGTM! Well-structured test for auto-deploy behavior.

The test comprehensively validates the deployment dependency workflow: blocking behavior, dependency satisfaction, and automatic unblocking. The step-by-step comments effectively document the expected behavior without being redundant.

Comment on lines +250 to +366
// TestEngine_PolicyDeploymentDependency_ArgoCDRetryBehavior validates that:
// When ArgoCD returns "unable to find destination server" errors, jobs are retried
// instead of immediately being marked as InvalidJobAgent.
// This tests the fix for the race condition where ArgoCD hasn't synced the destination yet.
func TestEngine_PolicyDeploymentDependency_ArgoCDRetryBehavior(t *testing.T) {
destinationJobAgentID := "destination-agent"
appJobAgentID := "app-agent"

destinationDeploymentID := "argo-destination"
appDeploymentID := "argo-app"

environmentID := "prod"
resourceID := "k8s-cluster"
policyID := "app-depends-on-destination"

engine := integration.NewTestWorkspace(t,
integration.WithJobAgent(
integration.JobAgentID(destinationJobAgentID),
),
integration.WithJobAgent(
integration.JobAgentID(appJobAgentID),
),
integration.WithSystem(
integration.WithDeployment(
integration.DeploymentID(destinationDeploymentID),
integration.DeploymentName("argo-destination"),
integration.DeploymentJobAgent(destinationJobAgentID),
integration.DeploymentCelResourceSelector("true"),
),
integration.WithDeployment(
integration.DeploymentID(appDeploymentID),
integration.DeploymentName("argo-app"),
integration.DeploymentJobAgent(appJobAgentID),
integration.DeploymentCelResourceSelector("true"),
),
integration.WithEnvironment(
integration.EnvironmentID(environmentID),
integration.EnvironmentName("prod"),
integration.EnvironmentCelResourceSelector("true"),
),
),
integration.WithResource(
integration.ResourceID(resourceID),
),
integration.WithPolicy(
integration.PolicyID(policyID),
integration.PolicyName("App depends on ArgoCD destination"),
integration.WithPolicyTargetSelector(
integration.PolicyTargetCelEnvironmentSelector("true"),
integration.PolicyTargetCelDeploymentSelector("deployment.id == '"+appDeploymentID+"'"),
integration.PolicyTargetCelResourceSelector("true"),
),
integration.WithPolicyRule(
integration.WithRuleDeploymentDependency(
integration.DeploymentDependencyRuleDependsOnDeploymentSelector("deployment.id == '"+destinationDeploymentID+"'"),
),
),
),
)

ctx := context.Background()

// Step 1: Create app release (blocked - destination never succeeded)
appVersion := c.NewDeploymentVersion()
appVersion.DeploymentId = appDeploymentID
engine.PushEvent(ctx, handler.DeploymentVersionCreate, appVersion)

appJobs := getAgentJobsSortedByNewest(engine, appJobAgentID)
assert.Equal(t, 0, len(appJobs), "app should be blocked by dependency")

// Step 2: Create destination release
destVersion := c.NewDeploymentVersion()
destVersion.DeploymentId = destinationDeploymentID
engine.PushEvent(ctx, handler.DeploymentVersionCreate, destVersion)

destJobs := getAgentJobsSortedByNewest(engine, destinationJobAgentID)
assert.Equal(t, 1, len(destJobs), "destination should create job")

// Step 3: Mark destination as successful
destJob := destJobs[0]
destJobCopy := *destJob
destJobCopy.Status = oapi.JobStatusSuccessful
completedAt := time.Now()
destJobCopy.CompletedAt = &completedAt
engine.PushEvent(ctx, handler.JobUpdate, &oapi.JobUpdateEvent{
Id: &destJobCopy.Id,
Job: destJobCopy,
FieldsToUpdate: &[]oapi.JobUpdateEventFieldsToUpdate{
oapi.JobUpdateEventFieldsToUpdateStatus,
oapi.JobUpdateEventFieldsToUpdateCompletedAt,
},
})

// Step 4: App should now deploy
appJobs = getAgentJobsSortedByNewest(engine, appJobAgentID)
assert.Equal(t, 1, len(appJobs), "app should deploy after destination succeeds")

appJob := appJobs[0]

// CRITICAL ASSERTION: The fix ensures that ArgoCD destination errors
// (like "unable to find destination server") are retried instead of
// immediately marking the job as InvalidJobAgent.
//
// Without the fix: Job would be InvalidJobAgent immediately
// With the fix: Job will be Pending (and retries happen in background)
assert.NotEqual(t, oapi.JobStatusInvalidJobAgent, appJob.Status,
"Job should NOT be InvalidJobAgent - ArgoCD destination errors should be retried, not fail immediately")

// The job should be in a valid state (Pending, InProgress, or eventually Successful)
validStatuses := []oapi.JobStatus{
oapi.JobStatusPending,
oapi.JobStatusInProgress,
oapi.JobStatusSuccessful,
}
assert.Contains(t, validStatuses, appJob.Status,
"Job should be in a valid state: %v, got: %v", validStatuses, appJob.Status)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Test name doesn't match what is actually being tested.

The test is named ArgoCDRetryBehavior and the comments claim it validates retry behavior when ArgoCD returns "unable to find destination server" errors. However, the test does not:

  • Simulate an ArgoCD error occurring
  • Verify that retry logic is triggered
  • Check retry attempts or retry intervals

The test only validates that after a successful destination deployment, the dependent app job is created in a valid state (not InvalidJobAgent). This validates the outcome but not the retry mechanism itself.

Consider either:

  1. Renaming the test to better reflect what it validates (e.g., TestEngine_PolicyDeploymentDependency_ValidJobStateAfterDependencySuccess)
  2. Enhancing the test to actually simulate ArgoCD errors and verify retry behavior

Comment on lines +368 to +494
// TestEngine_PolicyDeploymentDependency_PolicyTimingIssue demonstrates the race condition:
// 1. Upstream (destination) job completes successfully
// 2. Policy evaluator IMMEDIATELY allows downstream deployment
// (only checks job.Status == Successful, doesn't verify ArgoCD resource is synced)
// 3. In production: Downstream dispatch to ArgoCD would fail because
// the destination hasn't finished syncing yet
//
// This test validates that the policy allows deployment too early,
// which is the root cause of the "unable to find destination server" errors.
func TestEngine_PolicyDeploymentDependency_PolicyTimingIssue(t *testing.T) {
destinationJobAgentID := "destination-agent"
appJobAgentID := "app-agent"

destinationDeploymentID := "argo-destination"
appDeploymentID := "argo-app"

environmentID := "prod"
resourceID := "k8s-cluster"
policyID := "app-depends-on-destination"

engine := integration.NewTestWorkspace(t,
integration.WithJobAgent(
integration.JobAgentID(destinationJobAgentID),
),
integration.WithJobAgent(
integration.JobAgentID(appJobAgentID),
),
integration.WithSystem(
integration.WithDeployment(
integration.DeploymentID(destinationDeploymentID),
integration.DeploymentName("argo-destination"),
integration.DeploymentJobAgent(destinationJobAgentID),
integration.DeploymentCelResourceSelector("true"),
),
integration.WithDeployment(
integration.DeploymentID(appDeploymentID),
integration.DeploymentName("argo-app"),
integration.DeploymentJobAgent(appJobAgentID),
integration.DeploymentCelResourceSelector("true"),
),
integration.WithEnvironment(
integration.EnvironmentID(environmentID),
integration.EnvironmentName("prod"),
integration.EnvironmentCelResourceSelector("true"),
),
),
integration.WithResource(
integration.ResourceID(resourceID),
),
integration.WithPolicy(
integration.PolicyID(policyID),
integration.PolicyName("App depends on ArgoCD destination"),
integration.WithPolicyTargetSelector(
integration.PolicyTargetCelEnvironmentSelector("true"),
integration.PolicyTargetCelDeploymentSelector("deployment.id == '"+appDeploymentID+"'"),
integration.PolicyTargetCelResourceSelector("true"),
),
integration.WithPolicyRule(
integration.WithRuleDeploymentDependency(
integration.DeploymentDependencyRuleDependsOnDeploymentSelector("deployment.id == '"+destinationDeploymentID+"'"),
),
),
),
)

ctx := context.Background()

// Step 1: Create app version (should be blocked - destination never succeeded)
appVersion := c.NewDeploymentVersion()
appVersion.DeploymentId = appDeploymentID
engine.PushEvent(ctx, handler.DeploymentVersionCreate, appVersion)

appJobs := getAgentJobsSortedByNewest(engine, appJobAgentID)
assert.Equal(t, 0, len(appJobs), "app blocked: destination never succeeded")

// Step 2: Create destination version
destVersion := c.NewDeploymentVersion()
destVersion.DeploymentId = destinationDeploymentID
engine.PushEvent(ctx, handler.DeploymentVersionCreate, destVersion)

destJobs := getAgentJobsSortedByNewest(engine, destinationJobAgentID)
assert.Equal(t, 1, len(destJobs), "destination job created")

// Step 3: Mark destination job as SUCCESSFUL
// ⚠️ CRITICAL POINT: At this moment, the job succeeded but in production:
// - ArgoCD application was created
// - But ArgoCD destination might not be synced yet (async process)
destJob := destJobs[0]
destJobCopy := *destJob
destJobCopy.Status = oapi.JobStatusSuccessful
completedAt := time.Now()
destJobCopy.CompletedAt = &completedAt
engine.PushEvent(ctx, handler.JobUpdate, &oapi.JobUpdateEvent{
Id: &destJobCopy.Id,
Job: destJobCopy,
FieldsToUpdate: &[]oapi.JobUpdateEventFieldsToUpdate{
oapi.JobUpdateEventFieldsToUpdateStatus,
oapi.JobUpdateEventFieldsToUpdateCompletedAt,
},
})

// Step 4: ⚠️ RACE CONDITION - This is where the bug manifests ⚠️
// Current buggy behavior: Policy evaluator sees job.Status == Successful
// and IMMEDIATELY allows downstream deployment without checking if
// ArgoCD destination is actually synced.

// Check immediately after destination success - app should still be blocked
appJobs = getAgentJobsSortedByNewest(engine, appJobAgentID)

// ❌ THIS ASSERTION WILL FAIL - exposing the bug
// The policy should NOT allow deployment immediately because ArgoCD
// destinations need time to sync. The app job should still be blocked.
assert.Equal(t, 0, len(appJobs),
"EXPECTED BEHAVIOR: App should remain blocked immediately after destination job success "+
"to allow time for ArgoCD destination to sync. "+
"ACTUAL BUG: Policy allows deployment immediately, causing 'unable to find destination server' errors")

// Expected fix: The policy evaluator should either:
// 1. Add a delay/grace period after ArgoCD destination jobs succeed, OR
// 2. Check ArgoCD sync status before allowing dependent deployments, OR
// 3. Mark downstream jobs with a flag to wait for ArgoCD sync
//
// When the fix is implemented, the above assertion will pass because:
// - The policy will add a delay/wait mechanism
// - The app job won't be created immediately
// - Only after ArgoCD sync is verified will the app job be created
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Intentionally failing test should be skipped or guarded.

The PR description acknowledges this test is expected to fail until a fix is implemented. While documenting the race condition is valuable, committing a failing test directly causes several issues:

  • CI/CD pipeline failures
  • Misleading test suite health metrics
  • Confusion for other developers about whether failures are expected
  • Violates the principle of keeping the main branch "green"

Recommend one of these approaches:

Option 1: Skip the test with clear documentation

 func TestEngine_PolicyDeploymentDependency_PolicyTimingIssue(t *testing.T) {
+	t.Skip("TODO: Enable after implementing fix for ArgoCD destination sync race condition (Issue #XXX)")
 	destinationJobAgentID := "destination-agent"

Option 2: Use build tags to exclude from default test runs

+//go:build failing
+// +build failing
+
 func TestEngine_PolicyDeploymentDependency_PolicyTimingIssue(t *testing.T) {

Then document in the PR or README that go test -tags failing runs known-failing tests.

Do you want me to generate a complete implementation for either option, or would you prefer to enable this test only after the fix is merged?

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
// TestEngine_PolicyDeploymentDependency_PolicyTimingIssue demonstrates the race condition:
// 1. Upstream (destination) job completes successfully
// 2. Policy evaluator IMMEDIATELY allows downstream deployment
// (only checks job.Status == Successful, doesn't verify ArgoCD resource is synced)
// 3. In production: Downstream dispatch to ArgoCD would fail because
// the destination hasn't finished syncing yet
//
// This test validates that the policy allows deployment too early,
// which is the root cause of the "unable to find destination server" errors.
func TestEngine_PolicyDeploymentDependency_PolicyTimingIssue(t *testing.T) {
destinationJobAgentID := "destination-agent"
appJobAgentID := "app-agent"
destinationDeploymentID := "argo-destination"
appDeploymentID := "argo-app"
environmentID := "prod"
resourceID := "k8s-cluster"
policyID := "app-depends-on-destination"
engine := integration.NewTestWorkspace(t,
integration.WithJobAgent(
integration.JobAgentID(destinationJobAgentID),
),
integration.WithJobAgent(
integration.JobAgentID(appJobAgentID),
),
integration.WithSystem(
integration.WithDeployment(
integration.DeploymentID(destinationDeploymentID),
integration.DeploymentName("argo-destination"),
integration.DeploymentJobAgent(destinationJobAgentID),
integration.DeploymentCelResourceSelector("true"),
),
integration.WithDeployment(
integration.DeploymentID(appDeploymentID),
integration.DeploymentName("argo-app"),
integration.DeploymentJobAgent(appJobAgentID),
integration.DeploymentCelResourceSelector("true"),
),
integration.WithEnvironment(
integration.EnvironmentID(environmentID),
integration.EnvironmentName("prod"),
integration.EnvironmentCelResourceSelector("true"),
),
),
integration.WithResource(
integration.ResourceID(resourceID),
),
integration.WithPolicy(
integration.PolicyID(policyID),
integration.PolicyName("App depends on ArgoCD destination"),
integration.WithPolicyTargetSelector(
integration.PolicyTargetCelEnvironmentSelector("true"),
integration.PolicyTargetCelDeploymentSelector("deployment.id == '"+appDeploymentID+"'"),
integration.PolicyTargetCelResourceSelector("true"),
),
integration.WithPolicyRule(
integration.WithRuleDeploymentDependency(
integration.DeploymentDependencyRuleDependsOnDeploymentSelector("deployment.id == '"+destinationDeploymentID+"'"),
),
),
),
)
ctx := context.Background()
// Step 1: Create app version (should be blocked - destination never succeeded)
appVersion := c.NewDeploymentVersion()
appVersion.DeploymentId = appDeploymentID
engine.PushEvent(ctx, handler.DeploymentVersionCreate, appVersion)
appJobs := getAgentJobsSortedByNewest(engine, appJobAgentID)
assert.Equal(t, 0, len(appJobs), "app blocked: destination never succeeded")
// Step 2: Create destination version
destVersion := c.NewDeploymentVersion()
destVersion.DeploymentId = destinationDeploymentID
engine.PushEvent(ctx, handler.DeploymentVersionCreate, destVersion)
destJobs := getAgentJobsSortedByNewest(engine, destinationJobAgentID)
assert.Equal(t, 1, len(destJobs), "destination job created")
// Step 3: Mark destination job as SUCCESSFUL
// ⚠️ CRITICAL POINT: At this moment, the job succeeded but in production:
// - ArgoCD application was created
// - But ArgoCD destination might not be synced yet (async process)
destJob := destJobs[0]
destJobCopy := *destJob
destJobCopy.Status = oapi.JobStatusSuccessful
completedAt := time.Now()
destJobCopy.CompletedAt = &completedAt
engine.PushEvent(ctx, handler.JobUpdate, &oapi.JobUpdateEvent{
Id: &destJobCopy.Id,
Job: destJobCopy,
FieldsToUpdate: &[]oapi.JobUpdateEventFieldsToUpdate{
oapi.JobUpdateEventFieldsToUpdateStatus,
oapi.JobUpdateEventFieldsToUpdateCompletedAt,
},
})
// Step 4: ⚠️ RACE CONDITION - This is where the bug manifests ⚠️
// Current buggy behavior: Policy evaluator sees job.Status == Successful
// and IMMEDIATELY allows downstream deployment without checking if
// ArgoCD destination is actually synced.
// Check immediately after destination success - app should still be blocked
appJobs = getAgentJobsSortedByNewest(engine, appJobAgentID)
// ❌ THIS ASSERTION WILL FAIL - exposing the bug
// The policy should NOT allow deployment immediately because ArgoCD
// destinations need time to sync. The app job should still be blocked.
assert.Equal(t, 0, len(appJobs),
"EXPECTED BEHAVIOR: App should remain blocked immediately after destination job success "+
"to allow time for ArgoCD destination to sync. "+
"ACTUAL BUG: Policy allows deployment immediately, causing 'unable to find destination server' errors")
// Expected fix: The policy evaluator should either:
// 1. Add a delay/grace period after ArgoCD destination jobs succeed, OR
// 2. Check ArgoCD sync status before allowing dependent deployments, OR
// 3. Mark downstream jobs with a flag to wait for ArgoCD sync
//
// When the fix is implemented, the above assertion will pass because:
// - The policy will add a delay/wait mechanism
// - The app job won't be created immediately
// - Only after ArgoCD sync is verified will the app job be created
}
// TestEngine_PolicyDeploymentDependency_PolicyTimingIssue demonstrates the race condition:
// 1. Upstream (destination) job completes successfully
// 2. Policy evaluator IMMEDIATELY allows downstream deployment
// (only checks job.Status == Successful, doesn't verify ArgoCD resource is synced)
// 3. In production: Downstream dispatch to ArgoCD would fail because
// the destination hasn't finished syncing yet
//
// This test validates that the policy allows deployment too early,
// which is the root cause of the "unable to find destination server" errors.
func TestEngine_PolicyDeploymentDependency_PolicyTimingIssue(t *testing.T) {
t.Skip("TODO: Enable after implementing fix for ArgoCD destination sync race condition")
destinationJobAgentID := "destination-agent"
appJobAgentID := "app-agent"
destinationDeploymentID := "argo-destination"
appDeploymentID := "argo-app"
environmentID := "prod"
resourceID := "k8s-cluster"
policyID := "app-depends-on-destination"
engine := integration.NewTestWorkspace(t,
integration.WithJobAgent(
integration.JobAgentID(destinationJobAgentID),
),
integration.WithJobAgent(
integration.JobAgentID(appJobAgentID),
),
integration.WithSystem(
integration.WithDeployment(
integration.DeploymentID(destinationDeploymentID),
integration.DeploymentName("argo-destination"),
integration.DeploymentJobAgent(destinationJobAgentID),
integration.DeploymentCelResourceSelector("true"),
),
integration.WithDeployment(
integration.DeploymentID(appDeploymentID),
integration.DeploymentName("argo-app"),
integration.DeploymentJobAgent(appJobAgentID),
integration.DeploymentCelResourceSelector("true"),
),
integration.WithEnvironment(
integration.EnvironmentID(environmentID),
integration.EnvironmentName("prod"),
integration.EnvironmentCelResourceSelector("true"),
),
),
integration.WithResource(
integration.ResourceID(resourceID),
),
integration.WithPolicy(
integration.PolicyID(policyID),
integration.PolicyName("App depends on ArgoCD destination"),
integration.WithPolicyTargetSelector(
integration.PolicyTargetCelEnvironmentSelector("true"),
integration.PolicyTargetCelDeploymentSelector("deployment.id == '"+appDeploymentID+"'"),
integration.PolicyTargetCelResourceSelector("true"),
),
integration.WithPolicyRule(
integration.WithRuleDeploymentDependency(
integration.DeploymentDependencyRuleDependsOnDeploymentSelector("deployment.id == '"+destinationDeploymentID+"'"),
),
),
),
)
ctx := context.Background()
// Step 1: Create app version (should be blocked - destination never succeeded)
appVersion := c.NewDeploymentVersion()
appVersion.DeploymentId = appDeploymentID
engine.PushEvent(ctx, handler.DeploymentVersionCreate, appVersion)
appJobs := getAgentJobsSortedByNewest(engine, appJobAgentID)
assert.Equal(t, 0, len(appJobs), "app blocked: destination never succeeded")
// Step 2: Create destination version
destVersion := c.NewDeploymentVersion()
destVersion.DeploymentId = destinationDeploymentID
engine.PushEvent(ctx, handler.DeploymentVersionCreate, destVersion)
destJobs := getAgentJobsSortedByNewest(engine, destinationJobAgentID)
assert.Equal(t, 1, len(destJobs), "destination job created")
// Step 3: Mark destination job as SUCCESSFUL
// ⚠️ CRITICAL POINT: At this moment, the job succeeded but in production:
// - ArgoCD application was created
// - But ArgoCD destination might not be synced yet (async process)
destJob := destJobs[0]
destJobCopy := *destJob
destJobCopy.Status = oapi.JobStatusSuccessful
completedAt := time.Now()
destJobCopy.CompletedAt = &completedAt
engine.PushEvent(ctx, handler.JobUpdate, &oapi.JobUpdateEvent{
Id: &destJobCopy.Id,
Job: destJobCopy,
FieldsToUpdate: &[]oapi.JobUpdateEventFieldsToUpdate{
oapi.JobUpdateEventFieldsToUpdateStatus,
oapi.JobUpdateEventFieldsToUpdateCompletedAt,
},
})
// Step 4: ⚠️ RACE CONDITION - This is where the bug manifests ⚠️
// Current buggy behavior: Policy evaluator sees job.Status == Successful
// and IMMEDIATELY allows downstream deployment without checking if
// ArgoCD destination is actually synced.
// Check immediately after destination success - app should still be blocked
appJobs = getAgentJobsSortedByNewest(engine, appJobAgentID)
// ❌ THIS ASSERTION WILL FAIL - exposing the bug
// The policy should NOT allow deployment immediately because ArgoCD
// destinations need time to sync. The app job should still be blocked.
assert.Equal(t, 0, len(appJobs),
"EXPECTED BEHAVIOR: App should remain blocked immediately after destination job success "+
"to allow time for ArgoCD destination to sync. "+
"ACTUAL BUG: Policy allows deployment immediately, causing 'unable to find destination server' errors")
// Expected fix: The policy evaluator should either:
// 1. Add a delay/grace period after ArgoCD destination jobs succeed, OR
// 2. Check ArgoCD sync status before allowing dependent deployments, OR
// 3. Mark downstream jobs with a flag to wait for ArgoCD sync
//
// When the fix is implemented, the above assertion will pass because:
// - The policy will add a delay/wait mechanism
// - The app job won't be created immediately
// - Only after ArgoCD sync is verified will the app job be created
}
🤖 Prompt for AI Agents
In @apps/workspace-engine/test/e2e/engine_policy_deployment_dependency_test.go
around lines 368 - 494, The test
TestEngine_PolicyDeploymentDependency_PolicyTimingIssue is intentionally failing
and should be excluded from default CI; either mark it skipped by adding an
early t.Skip("known failing: race condition - see PR") at the top of
TestEngine_PolicyDeploymentDependency_PolicyTimingIssue, or move the test behind
a build tag by adding a file-level build constraint (e.g., //go:build failing)
and document that it runs with `go test -tags failing`; update the test comment
to explain which approach was used and reference the test name so future
maintainers can find and re-enable it when the policy fix is implemented.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants