-
Notifications
You must be signed in to change notification settings - Fork 13
fix(workspace-engine): Argo race condition #735
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
📝 WalkthroughWalkthroughAdds unit and E2E tests for ArgoCD error classification and deployment-dependency behavior, and updates async job dispatch error handling to set job status, message, and updated timestamp when dispatch fails. Changes
Sequence Diagram(s)(Skipped — changes are test additions and a small error-handling update that do not introduce a new multi-component control flow requiring visualization.) Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related PRs
Poem
Pre-merge checks and finishing touches✅ Passed checks (3 passed)
✨ Finishing touches
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (2)
apps/workspace-engine/pkg/workspace/jobdispatch/argocd.go (2)
83-99: Optimize by moving retryablePatterns to package level.The
retryablePatternsslice is allocated on every call toisRetryableError. Since these patterns are static, consider declaring it as a package-level variable to avoid repeated allocations, especially given that this function is called in a retry loop under load.🔎 Proposed refactor
+// retryablePatterns defines error patterns indicating transient failures worth retrying +var retryablePatterns = []string{ + // HTTP status codes indicating transient failures + "502", "503", "504", + "connection refused", + "connection reset", + "timeout", + "temporarily unavailable", + // ArgoCD destination/cluster errors (race condition when destination is being synced) + "cluster not found", + "destination not found", + "cluster does not exist", + "destination does not exist", + "destination server not found", + "server not found", + "invalid destination", + "unknown cluster", +} + // isRetryableError checks if an error is a transient error that should be retried func isRetryableError(err error) bool { if err == nil { return false } errStr := strings.ToLower(err.Error()) - // Patterns that indicate transient failures worth retrying - retryablePatterns := []string{ - // HTTP status codes indicating transient failures - "502", "503", "504", - "connection refused", - "connection reset", - "timeout", - "temporarily unavailable", - // ArgoCD destination/cluster errors (race condition when destination is being synced) - "cluster not found", - "destination not found", - "cluster does not exist", - "destination does not exist", - "destination server not found", - "server not found", - "invalid destination", - "unknown cluster", - } - for _, pattern := range retryablePatterns { if strings.Contains(errStr, pattern) { return true } } return false }
96-97: Broad patterns may match unintended errors.The patterns
"server not found"and"invalid destination"are relatively generic and could theoretically match non-ArgoCD errors (e.g., "DNS server not found" or file system destination errors). However, given that this function is only called within the ArgoCD dispatcher context, the risk of false positives is low. Consider adding a comment clarifying that these patterns are scoped to ArgoCD errors.
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Disabled knowledge base sources:
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (3)
apps/workspace-engine/pkg/workspace/jobdispatch/argocd.goapps/workspace-engine/pkg/workspace/jobdispatch/argocd_test.goapps/workspace-engine/test/e2e/engine_policy_deployment_dependency_test.go
🧰 Additional context used
📓 Path-based instructions (2)
apps/workspace-engine/**/*.go
📄 CodeRabbit inference engine (apps/workspace-engine/CLAUDE.md)
apps/workspace-engine/**/*.go: Do not add extraneous inline comments that state the obvious
Do not add comments that simply restate what the code does
Do not add comments for standard Go patterns (e.g., noting WaitGroup or semaphore usage)
Write comments that explain why, document complex logic/algorithms, provide non-obvious context, include TODO/FIXME, and document exported functions/types/methods
Files:
apps/workspace-engine/pkg/workspace/jobdispatch/argocd.goapps/workspace-engine/test/e2e/engine_policy_deployment_dependency_test.goapps/workspace-engine/pkg/workspace/jobdispatch/argocd_test.go
apps/workspace-engine/**/*_test.go
📄 CodeRabbit inference engine (apps/workspace-engine/CLAUDE.md)
Follow the existing test structure used in *_test.go files
Files:
apps/workspace-engine/test/e2e/engine_policy_deployment_dependency_test.goapps/workspace-engine/pkg/workspace/jobdispatch/argocd_test.go
🧠 Learnings (6)
📚 Learning: 2025-08-12T18:13:54.630Z
Learnt from: CR
Repo: ctrlplanedev/ctrlplane PR: 0
File: apps/workspace-engine/CLAUDE.md:0-0
Timestamp: 2025-08-12T18:13:54.630Z
Learning: Applies to apps/workspace-engine/**/*_test.go : Follow the existing test structure used in *_test.go files
Applied to files:
apps/workspace-engine/test/e2e/engine_policy_deployment_dependency_test.goapps/workspace-engine/pkg/workspace/jobdispatch/argocd_test.go
📚 Learning: 2025-08-12T18:13:54.630Z
Learnt from: CR
Repo: ctrlplanedev/ctrlplane PR: 0
File: apps/workspace-engine/CLAUDE.md:0-0
Timestamp: 2025-08-12T18:13:54.630Z
Learning: Applies to apps/workspace-engine/pkg/model/selector/**/*_test.go : Write comprehensive, data-driven tests for new condition types
Applied to files:
apps/workspace-engine/test/e2e/engine_policy_deployment_dependency_test.goapps/workspace-engine/pkg/workspace/jobdispatch/argocd_test.go
📚 Learning: 2025-08-12T18:13:54.630Z
Learnt from: CR
Repo: ctrlplanedev/ctrlplane PR: 0
File: apps/workspace-engine/CLAUDE.md:0-0
Timestamp: 2025-08-12T18:13:54.630Z
Learning: Applies to apps/workspace-engine/pkg/model/selector/**/*_test.go : Test validation and matching logic separately for condition types
Applied to files:
apps/workspace-engine/test/e2e/engine_policy_deployment_dependency_test.go
📚 Learning: 2025-10-24T00:02:29.723Z
Learnt from: adityachoudhari26
Repo: ctrlplanedev/ctrlplane PR: 696
File: apps/workspace-engine/pkg/workspace/releasemanager/deployment/executor.go:63-67
Timestamp: 2025-10-24T00:02:29.723Z
Learning: In workspace-engine replay mode (apps/workspace-engine/pkg/workspace/releasemanager/deployment/executor.go), jobs are created with Pending status even during replay because job updates are still accepted and processed. If a job remains Pending after replay completes, it genuinely is still pending because no updates occurred during replay.
Applied to files:
apps/workspace-engine/test/e2e/engine_policy_deployment_dependency_test.go
📚 Learning: 2025-08-12T18:13:54.630Z
Learnt from: CR
Repo: ctrlplanedev/ctrlplane PR: 0
File: apps/workspace-engine/CLAUDE.md:0-0
Timestamp: 2025-08-12T18:13:54.630Z
Learning: Applies to apps/workspace-engine/pkg/model/selector/**/*_test.go : Use table-driven tests for all condition types
Applied to files:
apps/workspace-engine/pkg/workspace/jobdispatch/argocd_test.go
📚 Learning: 2025-08-12T18:13:54.630Z
Learnt from: CR
Repo: ctrlplanedev/ctrlplane PR: 0
File: apps/workspace-engine/CLAUDE.md:0-0
Timestamp: 2025-08-12T18:13:54.630Z
Learning: Applies to apps/workspace-engine/**/*.go : Write comments that explain why, document complex logic/algorithms, provide non-obvious context, include TODO/FIXME, and document exported functions/types/methods
Applied to files:
apps/workspace-engine/pkg/workspace/jobdispatch/argocd_test.go
🧬 Code graph analysis (1)
apps/workspace-engine/test/e2e/engine_policy_deployment_dependency_test.go (4)
apps/workspace-engine/test/integration/workspace.go (1)
NewTestWorkspace(38-66)apps/workspace-engine/test/integration/creators/deployment_version.go (1)
NewDeploymentVersion(15-36)apps/workspace-engine/pkg/events/handler/handler.go (2)
DeploymentVersionCreate(44-44)JobUpdate(68-68)apps/workspace-engine/pkg/oapi/oapi.gen.go (3)
JobStatusSuccessful(135-135)Job(657-671)JobStatusInvalidJobAgent(132-132)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)
- GitHub Check: workspace-engine-tests
- GitHub Check: Lint
- GitHub Check: Typecheck
- GitHub Check: build (linux/amd64)
- GitHub Check: tests
🔇 Additional comments (2)
apps/workspace-engine/pkg/workspace/jobdispatch/argocd_test.go (1)
12-181: Excellent test coverage with proper table-driven structure.The test comprehensively validates error classification across all retryable and non-retryable patterns, including important edge cases like:
- Case-insensitive matching (lines 68-71, 120-123)
- Specificity validation (lines 162-170) ensuring "cluster" and "destination" in different contexts aren't false positives
- Realistic ArgoCD error messages (lines 108-123)
The table-driven approach aligns with project conventions and makes the test easy to extend.
Based on learnings, this follows the existing test structure used in *_test.go files.
apps/workspace-engine/test/e2e/engine_policy_deployment_dependency_test.go (1)
120-225: Verify that the E2E test exercises the retry logic.The test validates that downstream jobs don't immediately fail with
InvalidJobAgent, but it doesn't appear to:
- Simulate ArgoCD returning "cluster not found" or similar transient errors
- Verify that retries actually occur with exponential backoff
- Use a mock ArgoCD client that returns errors on initial attempts then succeeds
The test primarily validates the deployment dependency flow rather than the ArgoCD retry mechanism introduced in this PR. Consider verifying whether:
- A mock ArgoCD client is configured elsewhere in the test setup that simulates the race condition
- The retry behavior is actually exercised during test execution
- Additional assertions could confirm retry attempts occurred (e.g., checking logs, retry counts, or timing)
If the retry logic isn't being exercised, you might want to enhance the test to use a mock ArgoCD client that fails with retryable errors on the first N attempts, then succeeds, to validate the fix end-to-end.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (1)
apps/workspace-engine/pkg/workspace/jobdispatch/argocd.go (1)
74-108: Solid implementation with comprehensive test coverage.The refactoring correctly handles ArgoCD sync race conditions with proper case-insensitive pattern matching. All patterns—including HTTP status codes and ArgoCD-specific errors—are validated by comprehensive test cases in
argocd_test.go, confirming they match realistic error messages from the ArgoCD library.The HTTP patterns (
"502","503","504") are tested within full error messages like"HTTP 502 Bad Gateway", and false positives are further mitigated by the limited retry policy (5 attempts with exponential backoff). However, if increased specificity is desired, consider refining to patterns like"status 502","http 502", or"502 bad gateway"for additional precision.Optional: More specific HTTP patterns
retryablePatterns := []string{ - // HTTP status codes indicating transient failures - "502", "503", "504", + // HTTP status codes indicating transient failures + "status 502", "status 503", "status 504", + "http 502", "http 503", "http 504", "connection refused",
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Disabled knowledge base sources:
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (1)
apps/workspace-engine/pkg/workspace/jobdispatch/argocd.go
🧰 Additional context used
📓 Path-based instructions (1)
apps/workspace-engine/**/*.go
📄 CodeRabbit inference engine (apps/workspace-engine/CLAUDE.md)
apps/workspace-engine/**/*.go: Do not add extraneous inline comments that state the obvious
Do not add comments that simply restate what the code does
Do not add comments for standard Go patterns (e.g., noting WaitGroup or semaphore usage)
Write comments that explain why, document complex logic/algorithms, provide non-obvious context, include TODO/FIXME, and document exported functions/types/methods
Files:
apps/workspace-engine/pkg/workspace/jobdispatch/argocd.go
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
- GitHub Check: Lint
- GitHub Check: tests
- GitHub Check: build (linux/amd64)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (1)
apps/workspace-engine/pkg/workspace/jobdispatch/argocd_test.go (1)
142-158: Benchmark uses error messages not covered by unit tests.The benchmark includes
"cluster not found"and"destination does not exist"which aren't present in the unit test cases. Consider either adding these to the unit test table or using the same error set for consistency.
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Disabled knowledge base sources:
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (3)
apps/workspace-engine/pkg/workspace/jobdispatch/argocd_test.goapps/workspace-engine/pkg/workspace/releasemanager/deployment/executor.goapps/workspace-engine/test/e2e/engine_policy_deployment_dependency_test.go
🧰 Additional context used
📓 Path-based instructions (2)
apps/workspace-engine/**/*.go
📄 CodeRabbit inference engine (apps/workspace-engine/CLAUDE.md)
apps/workspace-engine/**/*.go: Do not add extraneous inline comments that state the obvious
Do not add comments that simply restate what the code does
Do not add comments for standard Go patterns (e.g., noting WaitGroup or semaphore usage)
Write comments that explain why, document complex logic/algorithms, provide non-obvious context, include TODO/FIXME, and document exported functions/types/methods
Files:
apps/workspace-engine/pkg/workspace/releasemanager/deployment/executor.goapps/workspace-engine/test/e2e/engine_policy_deployment_dependency_test.goapps/workspace-engine/pkg/workspace/jobdispatch/argocd_test.go
apps/workspace-engine/**/*_test.go
📄 CodeRabbit inference engine (apps/workspace-engine/CLAUDE.md)
Follow the existing test structure used in *_test.go files
Files:
apps/workspace-engine/test/e2e/engine_policy_deployment_dependency_test.goapps/workspace-engine/pkg/workspace/jobdispatch/argocd_test.go
🧠 Learnings (6)
📚 Learning: 2025-10-24T00:02:29.723Z
Learnt from: adityachoudhari26
Repo: ctrlplanedev/ctrlplane PR: 696
File: apps/workspace-engine/pkg/workspace/releasemanager/deployment/executor.go:63-67
Timestamp: 2025-10-24T00:02:29.723Z
Learning: In workspace-engine replay mode (apps/workspace-engine/pkg/workspace/releasemanager/deployment/executor.go), jobs are created with Pending status even during replay because job updates are still accepted and processed. If a job remains Pending after replay completes, it genuinely is still pending because no updates occurred during replay.
Applied to files:
apps/workspace-engine/pkg/workspace/releasemanager/deployment/executor.goapps/workspace-engine/test/e2e/engine_policy_deployment_dependency_test.go
📚 Learning: 2025-08-12T18:13:54.630Z
Learnt from: CR
Repo: ctrlplanedev/ctrlplane PR: 0
File: apps/workspace-engine/CLAUDE.md:0-0
Timestamp: 2025-08-12T18:13:54.630Z
Learning: Applies to apps/workspace-engine/**/*.go : Write comments that explain why, document complex logic/algorithms, provide non-obvious context, include TODO/FIXME, and document exported functions/types/methods
Applied to files:
apps/workspace-engine/pkg/workspace/releasemanager/deployment/executor.goapps/workspace-engine/test/e2e/engine_policy_deployment_dependency_test.goapps/workspace-engine/pkg/workspace/jobdispatch/argocd_test.go
📚 Learning: 2025-08-12T18:13:54.630Z
Learnt from: CR
Repo: ctrlplanedev/ctrlplane PR: 0
File: apps/workspace-engine/CLAUDE.md:0-0
Timestamp: 2025-08-12T18:13:54.630Z
Learning: Applies to apps/workspace-engine/**/*_test.go : Follow the existing test structure used in *_test.go files
Applied to files:
apps/workspace-engine/test/e2e/engine_policy_deployment_dependency_test.goapps/workspace-engine/pkg/workspace/jobdispatch/argocd_test.go
📚 Learning: 2025-08-12T18:13:54.630Z
Learnt from: CR
Repo: ctrlplanedev/ctrlplane PR: 0
File: apps/workspace-engine/CLAUDE.md:0-0
Timestamp: 2025-08-12T18:13:54.630Z
Learning: Applies to apps/workspace-engine/pkg/model/selector/**/*_test.go : Write comprehensive, data-driven tests for new condition types
Applied to files:
apps/workspace-engine/test/e2e/engine_policy_deployment_dependency_test.goapps/workspace-engine/pkg/workspace/jobdispatch/argocd_test.go
📚 Learning: 2025-08-12T20:49:05.086Z
Learnt from: adityachoudhari26
Repo: ctrlplanedev/ctrlplane PR: 655
File: apps/workspace-engine/pkg/engine/workspace/fluent.go:166-171
Timestamp: 2025-08-12T20:49:05.086Z
Learning: The UpdateDeploymentVersions() method with OperationCreate case in apps/workspace-engine/pkg/engine/workspace/fluent.go is specifically designed only for creating new deployment versions, not for handling potential duplicates or existing versions.
Applied to files:
apps/workspace-engine/test/e2e/engine_policy_deployment_dependency_test.go
📚 Learning: 2025-08-12T18:13:54.630Z
Learnt from: CR
Repo: ctrlplanedev/ctrlplane PR: 0
File: apps/workspace-engine/CLAUDE.md:0-0
Timestamp: 2025-08-12T18:13:54.630Z
Learning: Applies to apps/workspace-engine/pkg/model/selector/**/*_test.go : Use table-driven tests for all condition types
Applied to files:
apps/workspace-engine/pkg/workspace/jobdispatch/argocd_test.go
🧬 Code graph analysis (1)
apps/workspace-engine/pkg/workspace/releasemanager/deployment/executor.go (2)
apps/workspace-engine/pkg/workspace/releasemanager/trace/types.go (1)
Status(68-68)apps/workspace-engine/pkg/oapi/oapi.gen.go (1)
JobStatusInvalidJobAgent(132-132)
🪛 GitHub Actions: Apps / Workspace Engine
apps/workspace-engine/test/e2e/engine_policy_deployment_dependency_test.go
[error] 481-481: Engine policy timing issue: expected deployment to remain blocked (timing 0), but was allowed to proceed (1). This causes 'unable to find destination server' errors
[error] 1-1: End-to-end test suite failed for workspace-engine/test/e2e (non-zero exit)
apps/workspace-engine/pkg/workspace/jobdispatch/argocd_test.go
[error] 135-135: TestIsRetryableError/unable_to_find_destination_server: expected isRetryableError(...) = true, got false
[error] 135-135: TestIsRetryableError/application_destination_spec_is_invalid: expected isRetryableError(...) = true, got false
[error] 135-135: TestIsRetryableError/argocd_rpc_error_-_destination_spec_invalid: expected isRetryableError(...) = true, got false
[error] 135-135: TestIsRetryableError/mixed_case_-_unable_to_find_destination: expected isRetryableError(...) = true, got false
🔇 Additional comments (5)
apps/workspace-engine/pkg/workspace/releasemanager/deployment/executor.go (1)
128-135: LGTM - Error message capture improves debugging.Storing the dispatch error in
job.Messageprovides valuable debugging context for InvalidJobAgent failures, which aligns with the PR objective of better surfacing ArgoCD race conditions.apps/workspace-engine/pkg/workspace/jobdispatch/argocd_test.go (1)
10-140: Well-structured test-first approach for documenting expected behavior.The table-driven tests correctly document the expected retry classification. The failing test cases (lines 63-82) for ArgoCD destination errors are intentionally failing per the PR description—they will pass once
isRetryableErroris updated to classify "unable to find destination server" errors as retryable.apps/workspace-engine/test/e2e/engine_policy_deployment_dependency_test.go (3)
120-248: LGTM - Comprehensive test for auto-deploy behavior.The test clearly documents the expected dependency flow with well-structured steps and assertions. The step-by-step comments make the test intent clear.
250-366: LGTM - Documents expected retry behavior for ArgoCD errors.The test correctly asserts that jobs should not immediately fail with
InvalidJobAgentstatus when ArgoCD returns transient destination errors. The critical assertion at line 355 validates the fix once implemented.
368-495: Intentionally failing test effectively documents the race condition.This test reproduces the production bug where the policy evaluator allows downstream deployment immediately after upstream job success, without waiting for ArgoCD destination sync. The extensive comments (lines 486-494) outlining potential fix approaches are helpful for implementation guidance.
The failing assertion at line 481 will pass once a timing/sync mechanism is added to the policy evaluator.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
🤖 Fix all issues with AI Agents
In @apps/workspace-engine/test/e2e/engine_policy_deployment_dependency_test.go:
- Around line 368-494: The test
TestEngine_PolicyDeploymentDependency_PolicyTimingIssue is intentionally failing
and should be excluded from default CI; either mark it skipped by adding an
early t.Skip("known failing: race condition - see PR") at the top of
TestEngine_PolicyDeploymentDependency_PolicyTimingIssue, or move the test behind
a build tag by adding a file-level build constraint (e.g., //go:build failing)
and document that it runs with `go test -tags failing`; update the test comment
to explain which approach was used and reference the test name so future
maintainers can find and re-enable it when the policy fix is implemented.
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Disabled knowledge base sources:
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (2)
apps/workspace-engine/pkg/workspace/jobdispatch/argocd_test.goapps/workspace-engine/test/e2e/engine_policy_deployment_dependency_test.go
🚧 Files skipped from review as they are similar to previous changes (1)
- apps/workspace-engine/pkg/workspace/jobdispatch/argocd_test.go
🧰 Additional context used
📓 Path-based instructions (2)
apps/workspace-engine/**/*.go
📄 CodeRabbit inference engine (apps/workspace-engine/CLAUDE.md)
apps/workspace-engine/**/*.go: Do not add extraneous inline comments that state the obvious
Do not add comments that simply restate what the code does
Do not add comments for standard Go patterns (e.g., noting WaitGroup or semaphore usage)
Write comments that explain why, document complex logic/algorithms, provide non-obvious context, include TODO/FIXME, and document exported functions/types/methods
Files:
apps/workspace-engine/test/e2e/engine_policy_deployment_dependency_test.go
apps/workspace-engine/**/*_test.go
📄 CodeRabbit inference engine (apps/workspace-engine/CLAUDE.md)
Follow the existing test structure used in *_test.go files
Files:
apps/workspace-engine/test/e2e/engine_policy_deployment_dependency_test.go
🧠 Learnings (6)
📓 Common learnings
Learnt from: adityachoudhari26
Repo: ctrlplanedev/ctrlplane PR: 696
File: apps/workspace-engine/pkg/workspace/releasemanager/deployment/executor.go:63-67
Timestamp: 2025-10-24T00:02:29.723Z
Learning: In workspace-engine replay mode (apps/workspace-engine/pkg/workspace/releasemanager/deployment/executor.go), jobs are created with Pending status even during replay because job updates are still accepted and processed. If a job remains Pending after replay completes, it genuinely is still pending because no updates occurred during replay.
📚 Learning: 2025-08-12T18:13:54.630Z
Learnt from: CR
Repo: ctrlplanedev/ctrlplane PR: 0
File: apps/workspace-engine/CLAUDE.md:0-0
Timestamp: 2025-08-12T18:13:54.630Z
Learning: Applies to apps/workspace-engine/**/*_test.go : Follow the existing test structure used in *_test.go files
Applied to files:
apps/workspace-engine/test/e2e/engine_policy_deployment_dependency_test.go
📚 Learning: 2025-08-12T18:13:54.630Z
Learnt from: CR
Repo: ctrlplanedev/ctrlplane PR: 0
File: apps/workspace-engine/CLAUDE.md:0-0
Timestamp: 2025-08-12T18:13:54.630Z
Learning: Applies to apps/workspace-engine/pkg/model/selector/**/*_test.go : Write comprehensive, data-driven tests for new condition types
Applied to files:
apps/workspace-engine/test/e2e/engine_policy_deployment_dependency_test.go
📚 Learning: 2025-08-12T18:13:54.630Z
Learnt from: CR
Repo: ctrlplanedev/ctrlplane PR: 0
File: apps/workspace-engine/CLAUDE.md:0-0
Timestamp: 2025-08-12T18:13:54.630Z
Learning: Applies to apps/workspace-engine/pkg/model/selector/**/*_test.go : Test validation and matching logic separately for condition types
Applied to files:
apps/workspace-engine/test/e2e/engine_policy_deployment_dependency_test.go
📚 Learning: 2025-08-12T20:49:05.086Z
Learnt from: adityachoudhari26
Repo: ctrlplanedev/ctrlplane PR: 655
File: apps/workspace-engine/pkg/engine/workspace/fluent.go:166-171
Timestamp: 2025-08-12T20:49:05.086Z
Learning: The UpdateDeploymentVersions() method with OperationCreate case in apps/workspace-engine/pkg/engine/workspace/fluent.go is specifically designed only for creating new deployment versions, not for handling potential duplicates or existing versions.
Applied to files:
apps/workspace-engine/test/e2e/engine_policy_deployment_dependency_test.go
📚 Learning: 2025-10-24T00:02:29.723Z
Learnt from: adityachoudhari26
Repo: ctrlplanedev/ctrlplane PR: 696
File: apps/workspace-engine/pkg/workspace/releasemanager/deployment/executor.go:63-67
Timestamp: 2025-10-24T00:02:29.723Z
Learning: In workspace-engine replay mode (apps/workspace-engine/pkg/workspace/releasemanager/deployment/executor.go), jobs are created with Pending status even during replay because job updates are still accepted and processed. If a job remains Pending after replay completes, it genuinely is still pending because no updates occurred during replay.
Applied to files:
apps/workspace-engine/test/e2e/engine_policy_deployment_dependency_test.go
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
- GitHub Check: Typecheck
- GitHub Check: tests
- GitHub Check: Lint
- GitHub Check: build (linux/amd64)
🔇 Additional comments (1)
apps/workspace-engine/test/e2e/engine_policy_deployment_dependency_test.go (1)
120-248: LGTM! Well-structured test for auto-deploy behavior.The test comprehensively validates the deployment dependency workflow: blocking behavior, dependency satisfaction, and automatic unblocking. The step-by-step comments effectively document the expected behavior without being redundant.
| // TestEngine_PolicyDeploymentDependency_ArgoCDRetryBehavior validates that: | ||
| // When ArgoCD returns "unable to find destination server" errors, jobs are retried | ||
| // instead of immediately being marked as InvalidJobAgent. | ||
| // This tests the fix for the race condition where ArgoCD hasn't synced the destination yet. | ||
| func TestEngine_PolicyDeploymentDependency_ArgoCDRetryBehavior(t *testing.T) { | ||
| destinationJobAgentID := "destination-agent" | ||
| appJobAgentID := "app-agent" | ||
|
|
||
| destinationDeploymentID := "argo-destination" | ||
| appDeploymentID := "argo-app" | ||
|
|
||
| environmentID := "prod" | ||
| resourceID := "k8s-cluster" | ||
| policyID := "app-depends-on-destination" | ||
|
|
||
| engine := integration.NewTestWorkspace(t, | ||
| integration.WithJobAgent( | ||
| integration.JobAgentID(destinationJobAgentID), | ||
| ), | ||
| integration.WithJobAgent( | ||
| integration.JobAgentID(appJobAgentID), | ||
| ), | ||
| integration.WithSystem( | ||
| integration.WithDeployment( | ||
| integration.DeploymentID(destinationDeploymentID), | ||
| integration.DeploymentName("argo-destination"), | ||
| integration.DeploymentJobAgent(destinationJobAgentID), | ||
| integration.DeploymentCelResourceSelector("true"), | ||
| ), | ||
| integration.WithDeployment( | ||
| integration.DeploymentID(appDeploymentID), | ||
| integration.DeploymentName("argo-app"), | ||
| integration.DeploymentJobAgent(appJobAgentID), | ||
| integration.DeploymentCelResourceSelector("true"), | ||
| ), | ||
| integration.WithEnvironment( | ||
| integration.EnvironmentID(environmentID), | ||
| integration.EnvironmentName("prod"), | ||
| integration.EnvironmentCelResourceSelector("true"), | ||
| ), | ||
| ), | ||
| integration.WithResource( | ||
| integration.ResourceID(resourceID), | ||
| ), | ||
| integration.WithPolicy( | ||
| integration.PolicyID(policyID), | ||
| integration.PolicyName("App depends on ArgoCD destination"), | ||
| integration.WithPolicyTargetSelector( | ||
| integration.PolicyTargetCelEnvironmentSelector("true"), | ||
| integration.PolicyTargetCelDeploymentSelector("deployment.id == '"+appDeploymentID+"'"), | ||
| integration.PolicyTargetCelResourceSelector("true"), | ||
| ), | ||
| integration.WithPolicyRule( | ||
| integration.WithRuleDeploymentDependency( | ||
| integration.DeploymentDependencyRuleDependsOnDeploymentSelector("deployment.id == '"+destinationDeploymentID+"'"), | ||
| ), | ||
| ), | ||
| ), | ||
| ) | ||
|
|
||
| ctx := context.Background() | ||
|
|
||
| // Step 1: Create app release (blocked - destination never succeeded) | ||
| appVersion := c.NewDeploymentVersion() | ||
| appVersion.DeploymentId = appDeploymentID | ||
| engine.PushEvent(ctx, handler.DeploymentVersionCreate, appVersion) | ||
|
|
||
| appJobs := getAgentJobsSortedByNewest(engine, appJobAgentID) | ||
| assert.Equal(t, 0, len(appJobs), "app should be blocked by dependency") | ||
|
|
||
| // Step 2: Create destination release | ||
| destVersion := c.NewDeploymentVersion() | ||
| destVersion.DeploymentId = destinationDeploymentID | ||
| engine.PushEvent(ctx, handler.DeploymentVersionCreate, destVersion) | ||
|
|
||
| destJobs := getAgentJobsSortedByNewest(engine, destinationJobAgentID) | ||
| assert.Equal(t, 1, len(destJobs), "destination should create job") | ||
|
|
||
| // Step 3: Mark destination as successful | ||
| destJob := destJobs[0] | ||
| destJobCopy := *destJob | ||
| destJobCopy.Status = oapi.JobStatusSuccessful | ||
| completedAt := time.Now() | ||
| destJobCopy.CompletedAt = &completedAt | ||
| engine.PushEvent(ctx, handler.JobUpdate, &oapi.JobUpdateEvent{ | ||
| Id: &destJobCopy.Id, | ||
| Job: destJobCopy, | ||
| FieldsToUpdate: &[]oapi.JobUpdateEventFieldsToUpdate{ | ||
| oapi.JobUpdateEventFieldsToUpdateStatus, | ||
| oapi.JobUpdateEventFieldsToUpdateCompletedAt, | ||
| }, | ||
| }) | ||
|
|
||
| // Step 4: App should now deploy | ||
| appJobs = getAgentJobsSortedByNewest(engine, appJobAgentID) | ||
| assert.Equal(t, 1, len(appJobs), "app should deploy after destination succeeds") | ||
|
|
||
| appJob := appJobs[0] | ||
|
|
||
| // CRITICAL ASSERTION: The fix ensures that ArgoCD destination errors | ||
| // (like "unable to find destination server") are retried instead of | ||
| // immediately marking the job as InvalidJobAgent. | ||
| // | ||
| // Without the fix: Job would be InvalidJobAgent immediately | ||
| // With the fix: Job will be Pending (and retries happen in background) | ||
| assert.NotEqual(t, oapi.JobStatusInvalidJobAgent, appJob.Status, | ||
| "Job should NOT be InvalidJobAgent - ArgoCD destination errors should be retried, not fail immediately") | ||
|
|
||
| // The job should be in a valid state (Pending, InProgress, or eventually Successful) | ||
| validStatuses := []oapi.JobStatus{ | ||
| oapi.JobStatusPending, | ||
| oapi.JobStatusInProgress, | ||
| oapi.JobStatusSuccessful, | ||
| } | ||
| assert.Contains(t, validStatuses, appJob.Status, | ||
| "Job should be in a valid state: %v, got: %v", validStatuses, appJob.Status) | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Test name doesn't match what is actually being tested.
The test is named ArgoCDRetryBehavior and the comments claim it validates retry behavior when ArgoCD returns "unable to find destination server" errors. However, the test does not:
- Simulate an ArgoCD error occurring
- Verify that retry logic is triggered
- Check retry attempts or retry intervals
The test only validates that after a successful destination deployment, the dependent app job is created in a valid state (not InvalidJobAgent). This validates the outcome but not the retry mechanism itself.
Consider either:
- Renaming the test to better reflect what it validates (e.g.,
TestEngine_PolicyDeploymentDependency_ValidJobStateAfterDependencySuccess) - Enhancing the test to actually simulate ArgoCD errors and verify retry behavior
| // TestEngine_PolicyDeploymentDependency_PolicyTimingIssue demonstrates the race condition: | ||
| // 1. Upstream (destination) job completes successfully | ||
| // 2. Policy evaluator IMMEDIATELY allows downstream deployment | ||
| // (only checks job.Status == Successful, doesn't verify ArgoCD resource is synced) | ||
| // 3. In production: Downstream dispatch to ArgoCD would fail because | ||
| // the destination hasn't finished syncing yet | ||
| // | ||
| // This test validates that the policy allows deployment too early, | ||
| // which is the root cause of the "unable to find destination server" errors. | ||
| func TestEngine_PolicyDeploymentDependency_PolicyTimingIssue(t *testing.T) { | ||
| destinationJobAgentID := "destination-agent" | ||
| appJobAgentID := "app-agent" | ||
|
|
||
| destinationDeploymentID := "argo-destination" | ||
| appDeploymentID := "argo-app" | ||
|
|
||
| environmentID := "prod" | ||
| resourceID := "k8s-cluster" | ||
| policyID := "app-depends-on-destination" | ||
|
|
||
| engine := integration.NewTestWorkspace(t, | ||
| integration.WithJobAgent( | ||
| integration.JobAgentID(destinationJobAgentID), | ||
| ), | ||
| integration.WithJobAgent( | ||
| integration.JobAgentID(appJobAgentID), | ||
| ), | ||
| integration.WithSystem( | ||
| integration.WithDeployment( | ||
| integration.DeploymentID(destinationDeploymentID), | ||
| integration.DeploymentName("argo-destination"), | ||
| integration.DeploymentJobAgent(destinationJobAgentID), | ||
| integration.DeploymentCelResourceSelector("true"), | ||
| ), | ||
| integration.WithDeployment( | ||
| integration.DeploymentID(appDeploymentID), | ||
| integration.DeploymentName("argo-app"), | ||
| integration.DeploymentJobAgent(appJobAgentID), | ||
| integration.DeploymentCelResourceSelector("true"), | ||
| ), | ||
| integration.WithEnvironment( | ||
| integration.EnvironmentID(environmentID), | ||
| integration.EnvironmentName("prod"), | ||
| integration.EnvironmentCelResourceSelector("true"), | ||
| ), | ||
| ), | ||
| integration.WithResource( | ||
| integration.ResourceID(resourceID), | ||
| ), | ||
| integration.WithPolicy( | ||
| integration.PolicyID(policyID), | ||
| integration.PolicyName("App depends on ArgoCD destination"), | ||
| integration.WithPolicyTargetSelector( | ||
| integration.PolicyTargetCelEnvironmentSelector("true"), | ||
| integration.PolicyTargetCelDeploymentSelector("deployment.id == '"+appDeploymentID+"'"), | ||
| integration.PolicyTargetCelResourceSelector("true"), | ||
| ), | ||
| integration.WithPolicyRule( | ||
| integration.WithRuleDeploymentDependency( | ||
| integration.DeploymentDependencyRuleDependsOnDeploymentSelector("deployment.id == '"+destinationDeploymentID+"'"), | ||
| ), | ||
| ), | ||
| ), | ||
| ) | ||
|
|
||
| ctx := context.Background() | ||
|
|
||
| // Step 1: Create app version (should be blocked - destination never succeeded) | ||
| appVersion := c.NewDeploymentVersion() | ||
| appVersion.DeploymentId = appDeploymentID | ||
| engine.PushEvent(ctx, handler.DeploymentVersionCreate, appVersion) | ||
|
|
||
| appJobs := getAgentJobsSortedByNewest(engine, appJobAgentID) | ||
| assert.Equal(t, 0, len(appJobs), "app blocked: destination never succeeded") | ||
|
|
||
| // Step 2: Create destination version | ||
| destVersion := c.NewDeploymentVersion() | ||
| destVersion.DeploymentId = destinationDeploymentID | ||
| engine.PushEvent(ctx, handler.DeploymentVersionCreate, destVersion) | ||
|
|
||
| destJobs := getAgentJobsSortedByNewest(engine, destinationJobAgentID) | ||
| assert.Equal(t, 1, len(destJobs), "destination job created") | ||
|
|
||
| // Step 3: Mark destination job as SUCCESSFUL | ||
| // ⚠️ CRITICAL POINT: At this moment, the job succeeded but in production: | ||
| // - ArgoCD application was created | ||
| // - But ArgoCD destination might not be synced yet (async process) | ||
| destJob := destJobs[0] | ||
| destJobCopy := *destJob | ||
| destJobCopy.Status = oapi.JobStatusSuccessful | ||
| completedAt := time.Now() | ||
| destJobCopy.CompletedAt = &completedAt | ||
| engine.PushEvent(ctx, handler.JobUpdate, &oapi.JobUpdateEvent{ | ||
| Id: &destJobCopy.Id, | ||
| Job: destJobCopy, | ||
| FieldsToUpdate: &[]oapi.JobUpdateEventFieldsToUpdate{ | ||
| oapi.JobUpdateEventFieldsToUpdateStatus, | ||
| oapi.JobUpdateEventFieldsToUpdateCompletedAt, | ||
| }, | ||
| }) | ||
|
|
||
| // Step 4: ⚠️ RACE CONDITION - This is where the bug manifests ⚠️ | ||
| // Current buggy behavior: Policy evaluator sees job.Status == Successful | ||
| // and IMMEDIATELY allows downstream deployment without checking if | ||
| // ArgoCD destination is actually synced. | ||
|
|
||
| // Check immediately after destination success - app should still be blocked | ||
| appJobs = getAgentJobsSortedByNewest(engine, appJobAgentID) | ||
|
|
||
| // ❌ THIS ASSERTION WILL FAIL - exposing the bug | ||
| // The policy should NOT allow deployment immediately because ArgoCD | ||
| // destinations need time to sync. The app job should still be blocked. | ||
| assert.Equal(t, 0, len(appJobs), | ||
| "EXPECTED BEHAVIOR: App should remain blocked immediately after destination job success "+ | ||
| "to allow time for ArgoCD destination to sync. "+ | ||
| "ACTUAL BUG: Policy allows deployment immediately, causing 'unable to find destination server' errors") | ||
|
|
||
| // Expected fix: The policy evaluator should either: | ||
| // 1. Add a delay/grace period after ArgoCD destination jobs succeed, OR | ||
| // 2. Check ArgoCD sync status before allowing dependent deployments, OR | ||
| // 3. Mark downstream jobs with a flag to wait for ArgoCD sync | ||
| // | ||
| // When the fix is implemented, the above assertion will pass because: | ||
| // - The policy will add a delay/wait mechanism | ||
| // - The app job won't be created immediately | ||
| // - Only after ArgoCD sync is verified will the app job be created | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Intentionally failing test should be skipped or guarded.
The PR description acknowledges this test is expected to fail until a fix is implemented. While documenting the race condition is valuable, committing a failing test directly causes several issues:
- CI/CD pipeline failures
- Misleading test suite health metrics
- Confusion for other developers about whether failures are expected
- Violates the principle of keeping the main branch "green"
Recommend one of these approaches:
Option 1: Skip the test with clear documentation
func TestEngine_PolicyDeploymentDependency_PolicyTimingIssue(t *testing.T) {
+ t.Skip("TODO: Enable after implementing fix for ArgoCD destination sync race condition (Issue #XXX)")
destinationJobAgentID := "destination-agent"Option 2: Use build tags to exclude from default test runs
+//go:build failing
+// +build failing
+
func TestEngine_PolicyDeploymentDependency_PolicyTimingIssue(t *testing.T) {Then document in the PR or README that go test -tags failing runs known-failing tests.
Do you want me to generate a complete implementation for either option, or would you prefer to enable this test only after the fix is merged?
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| // TestEngine_PolicyDeploymentDependency_PolicyTimingIssue demonstrates the race condition: | |
| // 1. Upstream (destination) job completes successfully | |
| // 2. Policy evaluator IMMEDIATELY allows downstream deployment | |
| // (only checks job.Status == Successful, doesn't verify ArgoCD resource is synced) | |
| // 3. In production: Downstream dispatch to ArgoCD would fail because | |
| // the destination hasn't finished syncing yet | |
| // | |
| // This test validates that the policy allows deployment too early, | |
| // which is the root cause of the "unable to find destination server" errors. | |
| func TestEngine_PolicyDeploymentDependency_PolicyTimingIssue(t *testing.T) { | |
| destinationJobAgentID := "destination-agent" | |
| appJobAgentID := "app-agent" | |
| destinationDeploymentID := "argo-destination" | |
| appDeploymentID := "argo-app" | |
| environmentID := "prod" | |
| resourceID := "k8s-cluster" | |
| policyID := "app-depends-on-destination" | |
| engine := integration.NewTestWorkspace(t, | |
| integration.WithJobAgent( | |
| integration.JobAgentID(destinationJobAgentID), | |
| ), | |
| integration.WithJobAgent( | |
| integration.JobAgentID(appJobAgentID), | |
| ), | |
| integration.WithSystem( | |
| integration.WithDeployment( | |
| integration.DeploymentID(destinationDeploymentID), | |
| integration.DeploymentName("argo-destination"), | |
| integration.DeploymentJobAgent(destinationJobAgentID), | |
| integration.DeploymentCelResourceSelector("true"), | |
| ), | |
| integration.WithDeployment( | |
| integration.DeploymentID(appDeploymentID), | |
| integration.DeploymentName("argo-app"), | |
| integration.DeploymentJobAgent(appJobAgentID), | |
| integration.DeploymentCelResourceSelector("true"), | |
| ), | |
| integration.WithEnvironment( | |
| integration.EnvironmentID(environmentID), | |
| integration.EnvironmentName("prod"), | |
| integration.EnvironmentCelResourceSelector("true"), | |
| ), | |
| ), | |
| integration.WithResource( | |
| integration.ResourceID(resourceID), | |
| ), | |
| integration.WithPolicy( | |
| integration.PolicyID(policyID), | |
| integration.PolicyName("App depends on ArgoCD destination"), | |
| integration.WithPolicyTargetSelector( | |
| integration.PolicyTargetCelEnvironmentSelector("true"), | |
| integration.PolicyTargetCelDeploymentSelector("deployment.id == '"+appDeploymentID+"'"), | |
| integration.PolicyTargetCelResourceSelector("true"), | |
| ), | |
| integration.WithPolicyRule( | |
| integration.WithRuleDeploymentDependency( | |
| integration.DeploymentDependencyRuleDependsOnDeploymentSelector("deployment.id == '"+destinationDeploymentID+"'"), | |
| ), | |
| ), | |
| ), | |
| ) | |
| ctx := context.Background() | |
| // Step 1: Create app version (should be blocked - destination never succeeded) | |
| appVersion := c.NewDeploymentVersion() | |
| appVersion.DeploymentId = appDeploymentID | |
| engine.PushEvent(ctx, handler.DeploymentVersionCreate, appVersion) | |
| appJobs := getAgentJobsSortedByNewest(engine, appJobAgentID) | |
| assert.Equal(t, 0, len(appJobs), "app blocked: destination never succeeded") | |
| // Step 2: Create destination version | |
| destVersion := c.NewDeploymentVersion() | |
| destVersion.DeploymentId = destinationDeploymentID | |
| engine.PushEvent(ctx, handler.DeploymentVersionCreate, destVersion) | |
| destJobs := getAgentJobsSortedByNewest(engine, destinationJobAgentID) | |
| assert.Equal(t, 1, len(destJobs), "destination job created") | |
| // Step 3: Mark destination job as SUCCESSFUL | |
| // ⚠️ CRITICAL POINT: At this moment, the job succeeded but in production: | |
| // - ArgoCD application was created | |
| // - But ArgoCD destination might not be synced yet (async process) | |
| destJob := destJobs[0] | |
| destJobCopy := *destJob | |
| destJobCopy.Status = oapi.JobStatusSuccessful | |
| completedAt := time.Now() | |
| destJobCopy.CompletedAt = &completedAt | |
| engine.PushEvent(ctx, handler.JobUpdate, &oapi.JobUpdateEvent{ | |
| Id: &destJobCopy.Id, | |
| Job: destJobCopy, | |
| FieldsToUpdate: &[]oapi.JobUpdateEventFieldsToUpdate{ | |
| oapi.JobUpdateEventFieldsToUpdateStatus, | |
| oapi.JobUpdateEventFieldsToUpdateCompletedAt, | |
| }, | |
| }) | |
| // Step 4: ⚠️ RACE CONDITION - This is where the bug manifests ⚠️ | |
| // Current buggy behavior: Policy evaluator sees job.Status == Successful | |
| // and IMMEDIATELY allows downstream deployment without checking if | |
| // ArgoCD destination is actually synced. | |
| // Check immediately after destination success - app should still be blocked | |
| appJobs = getAgentJobsSortedByNewest(engine, appJobAgentID) | |
| // ❌ THIS ASSERTION WILL FAIL - exposing the bug | |
| // The policy should NOT allow deployment immediately because ArgoCD | |
| // destinations need time to sync. The app job should still be blocked. | |
| assert.Equal(t, 0, len(appJobs), | |
| "EXPECTED BEHAVIOR: App should remain blocked immediately after destination job success "+ | |
| "to allow time for ArgoCD destination to sync. "+ | |
| "ACTUAL BUG: Policy allows deployment immediately, causing 'unable to find destination server' errors") | |
| // Expected fix: The policy evaluator should either: | |
| // 1. Add a delay/grace period after ArgoCD destination jobs succeed, OR | |
| // 2. Check ArgoCD sync status before allowing dependent deployments, OR | |
| // 3. Mark downstream jobs with a flag to wait for ArgoCD sync | |
| // | |
| // When the fix is implemented, the above assertion will pass because: | |
| // - The policy will add a delay/wait mechanism | |
| // - The app job won't be created immediately | |
| // - Only after ArgoCD sync is verified will the app job be created | |
| } | |
| // TestEngine_PolicyDeploymentDependency_PolicyTimingIssue demonstrates the race condition: | |
| // 1. Upstream (destination) job completes successfully | |
| // 2. Policy evaluator IMMEDIATELY allows downstream deployment | |
| // (only checks job.Status == Successful, doesn't verify ArgoCD resource is synced) | |
| // 3. In production: Downstream dispatch to ArgoCD would fail because | |
| // the destination hasn't finished syncing yet | |
| // | |
| // This test validates that the policy allows deployment too early, | |
| // which is the root cause of the "unable to find destination server" errors. | |
| func TestEngine_PolicyDeploymentDependency_PolicyTimingIssue(t *testing.T) { | |
| t.Skip("TODO: Enable after implementing fix for ArgoCD destination sync race condition") | |
| destinationJobAgentID := "destination-agent" | |
| appJobAgentID := "app-agent" | |
| destinationDeploymentID := "argo-destination" | |
| appDeploymentID := "argo-app" | |
| environmentID := "prod" | |
| resourceID := "k8s-cluster" | |
| policyID := "app-depends-on-destination" | |
| engine := integration.NewTestWorkspace(t, | |
| integration.WithJobAgent( | |
| integration.JobAgentID(destinationJobAgentID), | |
| ), | |
| integration.WithJobAgent( | |
| integration.JobAgentID(appJobAgentID), | |
| ), | |
| integration.WithSystem( | |
| integration.WithDeployment( | |
| integration.DeploymentID(destinationDeploymentID), | |
| integration.DeploymentName("argo-destination"), | |
| integration.DeploymentJobAgent(destinationJobAgentID), | |
| integration.DeploymentCelResourceSelector("true"), | |
| ), | |
| integration.WithDeployment( | |
| integration.DeploymentID(appDeploymentID), | |
| integration.DeploymentName("argo-app"), | |
| integration.DeploymentJobAgent(appJobAgentID), | |
| integration.DeploymentCelResourceSelector("true"), | |
| ), | |
| integration.WithEnvironment( | |
| integration.EnvironmentID(environmentID), | |
| integration.EnvironmentName("prod"), | |
| integration.EnvironmentCelResourceSelector("true"), | |
| ), | |
| ), | |
| integration.WithResource( | |
| integration.ResourceID(resourceID), | |
| ), | |
| integration.WithPolicy( | |
| integration.PolicyID(policyID), | |
| integration.PolicyName("App depends on ArgoCD destination"), | |
| integration.WithPolicyTargetSelector( | |
| integration.PolicyTargetCelEnvironmentSelector("true"), | |
| integration.PolicyTargetCelDeploymentSelector("deployment.id == '"+appDeploymentID+"'"), | |
| integration.PolicyTargetCelResourceSelector("true"), | |
| ), | |
| integration.WithPolicyRule( | |
| integration.WithRuleDeploymentDependency( | |
| integration.DeploymentDependencyRuleDependsOnDeploymentSelector("deployment.id == '"+destinationDeploymentID+"'"), | |
| ), | |
| ), | |
| ), | |
| ) | |
| ctx := context.Background() | |
| // Step 1: Create app version (should be blocked - destination never succeeded) | |
| appVersion := c.NewDeploymentVersion() | |
| appVersion.DeploymentId = appDeploymentID | |
| engine.PushEvent(ctx, handler.DeploymentVersionCreate, appVersion) | |
| appJobs := getAgentJobsSortedByNewest(engine, appJobAgentID) | |
| assert.Equal(t, 0, len(appJobs), "app blocked: destination never succeeded") | |
| // Step 2: Create destination version | |
| destVersion := c.NewDeploymentVersion() | |
| destVersion.DeploymentId = destinationDeploymentID | |
| engine.PushEvent(ctx, handler.DeploymentVersionCreate, destVersion) | |
| destJobs := getAgentJobsSortedByNewest(engine, destinationJobAgentID) | |
| assert.Equal(t, 1, len(destJobs), "destination job created") | |
| // Step 3: Mark destination job as SUCCESSFUL | |
| // ⚠️ CRITICAL POINT: At this moment, the job succeeded but in production: | |
| // - ArgoCD application was created | |
| // - But ArgoCD destination might not be synced yet (async process) | |
| destJob := destJobs[0] | |
| destJobCopy := *destJob | |
| destJobCopy.Status = oapi.JobStatusSuccessful | |
| completedAt := time.Now() | |
| destJobCopy.CompletedAt = &completedAt | |
| engine.PushEvent(ctx, handler.JobUpdate, &oapi.JobUpdateEvent{ | |
| Id: &destJobCopy.Id, | |
| Job: destJobCopy, | |
| FieldsToUpdate: &[]oapi.JobUpdateEventFieldsToUpdate{ | |
| oapi.JobUpdateEventFieldsToUpdateStatus, | |
| oapi.JobUpdateEventFieldsToUpdateCompletedAt, | |
| }, | |
| }) | |
| // Step 4: ⚠️ RACE CONDITION - This is where the bug manifests ⚠️ | |
| // Current buggy behavior: Policy evaluator sees job.Status == Successful | |
| // and IMMEDIATELY allows downstream deployment without checking if | |
| // ArgoCD destination is actually synced. | |
| // Check immediately after destination success - app should still be blocked | |
| appJobs = getAgentJobsSortedByNewest(engine, appJobAgentID) | |
| // ❌ THIS ASSERTION WILL FAIL - exposing the bug | |
| // The policy should NOT allow deployment immediately because ArgoCD | |
| // destinations need time to sync. The app job should still be blocked. | |
| assert.Equal(t, 0, len(appJobs), | |
| "EXPECTED BEHAVIOR: App should remain blocked immediately after destination job success "+ | |
| "to allow time for ArgoCD destination to sync. "+ | |
| "ACTUAL BUG: Policy allows deployment immediately, causing 'unable to find destination server' errors") | |
| // Expected fix: The policy evaluator should either: | |
| // 1. Add a delay/grace period after ArgoCD destination jobs succeed, OR | |
| // 2. Check ArgoCD sync status before allowing dependent deployments, OR | |
| // 3. Mark downstream jobs with a flag to wait for ArgoCD sync | |
| // | |
| // When the fix is implemented, the above assertion will pass because: | |
| // - The policy will add a delay/wait mechanism | |
| // - The app job won't be created immediately | |
| // - Only after ArgoCD sync is verified will the app job be created | |
| } |
🤖 Prompt for AI Agents
In @apps/workspace-engine/test/e2e/engine_policy_deployment_dependency_test.go
around lines 368 - 494, The test
TestEngine_PolicyDeploymentDependency_PolicyTimingIssue is intentionally failing
and should be excluded from default CI; either mark it skipped by adding an
early t.Skip("known failing: race condition - see PR") at the top of
TestEngine_PolicyDeploymentDependency_PolicyTimingIssue, or move the test behind
a build tag by adding a file-level build constraint (e.g., //go:build failing)
and document that it runs with `go test -tags failing`; update the test comment
to explain which approach was used and reference the test name so future
maintainers can find and re-enable it when the policy fix is implemented.
Add failing test to reproduce ArgoCD deployment dependency race condition
Summary
This PR adds a failing test that reproduces a production bug where ArgoCD applications fail with
"unable to find destination server"when deployed immediately after their dependent destination completes.The race condition: The deployment dependency policy allows downstream deployments as soon as the upstream job succeeds, but ArgoCD destinations require additional time to sync asynchronously.
sequenceDiagram participant Policy as Dependency Policy participant Engine as Workspace Engine participant ArgoCD as ArgoCD API Note over Engine,ArgoCD: 1. Deploy Destination Engine->>ArgoCD: Create Destination App ArgoCD-->>Engine: Job Status: Successful ✅ Note over Policy: Dependency satisfied!<br/>Allow downstream deploy Note over Engine,ArgoCD: 2. Deploy App (IMMEDIATE) Policy->>Engine: Unblock downstream deployment Engine->>ArgoCD: Create App using destination ArgoCD-->>Engine: ❌ Error: unable to find destination server Note over ArgoCD: Destination still syncing...<br/>(async background process) Note over Engine: Job marked InvalidJobAgent ❌ Note over ArgoCD: Destination now available<br/>(too late)Changes
TestEngine_PolicyDeploymentDependency_PolicyTimingIssue- Failing test that exposes the bug (expects 0 jobs, gets 1 job created immediately)executor.go- Store dispatch errors injob.Messagefor better debuggingAutoDeployandArgoCDRetryBehaviortests documenting expected behaviorTest Output
This failure is intentional - the test will pass once we implement a fix.
Summary by CodeRabbit
Bug Fixes
Tests
✏️ Tip: You can customize this high-level summary in your review settings.