CLI live test framework + initial API group testing#3267
CLI live test framework + initial API group testing#3267Daniel Ayaz (danielayaz) wants to merge 1 commit intomainfrom
Conversation
|
🎉 All Contributor License Agreements have been signed. Ready to merge. |
|
|
||
| .PHONY: build-for-live-test | ||
| build-for-live-test: | ||
| go build -ldflags="-s -w -X main.commit=00000000 -X main.date=1970-01-01T00:00:00Z -X main.disableUpdates=true" \ |
There was a problem hiding this comment.
main.commit=00000000 -X main.date=1970-01-01T00:00:00Z -X main.disableUpdates=true
nit: do we need all of these flags?
There was a problem hiding this comment.
Pull request overview
This PR introduces a comprehensive live integration test framework for the Confluent CLI that executes real CLI commands against live Confluent Cloud environments. The framework enables end-to-end testing of CLI operations including resource creation, updates, and deletion, with support for test grouping, concurrency, and automated cleanup.
Changes:
- New
test/live/directory with Go-based live test framework supporting parallel test execution with isolated CLI config directories - Build tag-based test organization (
core,kafka,essential,all) with Makefile targets for selective test execution - Semaphore CI pipeline (
.semaphore/live-tests.yml) with manual promotion for on-demand live test runs with configurable cloud providers and regions
Reviewed changes
Copilot reviewed 11 out of 12 changed files in this pull request and generated 8 comments.
Show a summary per file
| File | Description |
|---|---|
test/live/live_test.go |
Core test framework with suite setup, command execution, output validation, and state management |
test/live/live_utils.go |
Utility functions for name generation, shell parsing, JSON extraction, and polling operations |
test/live/environment_live_test.go |
CRUD tests for environment resources |
test/live/service_account_live_test.go |
CRUD tests for service account resources |
test/live/api_key_live_test.go |
CRUD tests for API key resources with service account dependencies |
test/live/kafka_cluster_live_test.go |
CRUD tests for Kafka cluster resources including provisioning wait logic |
test/live/kafka_topic_live_test.go |
CRUD tests for Kafka topic resources using pre-existing clusters |
test/live/README.md |
Comprehensive documentation covering prerequisites, usage, test writing guide, and CI integration |
Makefile |
New targets for building live test binary and running live tests with group filtering |
.semaphore/semaphore.yml |
Added promotion for triggering live test pipeline with configurable parameters |
.semaphore/live-tests.yml |
New CI pipeline for live tests with vault secret integration and Slack notifications |
.gitignore |
Added exclusion for test/live/bin/ directory |
Comments suppressed due to low confidence (4)
Makefile:148
- The test timeout is set to 1440 minutes (24 hours) which seems excessive. While Kafka cluster provisioning can take ~5-10 minutes, a 24-hour timeout could cause CI jobs to hang for extremely long periods if a test gets stuck. The semaphore pipeline also has a 24-hour execution time limit (
.semaphore/live-tests.ymlline 19). Consider using a more reasonable timeout like 60-120 minutes which should be sufficient for the slowest operations (cluster provisioning + test execution) while preventing runaway tests.
-tags="live_test,all" -timeout 1440m -parallel 10; \
else \
TAGS="live_test"; \
for group in $$(echo "$(CLI_LIVE_TEST_GROUPS)" | tr ',' ' '); do \
TAGS="$$TAGS,$$group"; \
done; \
CLI_LIVE_TEST=1 go test ./test/live/ -v -run=".*Live$$" \
-tags="$$TAGS" -timeout 1440m -parallel 10; \
test/live/live_utils.go:93
- The
extractJSONFieldfunction silently returns an empty string when JSON parsing fails or the field doesn't exist. This can make debugging difficult because the caller won't know if the output wasn't JSON or if the field was missing. While this may be intentional for the current use case inwaitForConditionwhere errors are tolerated during polling, it could lead to confusion when used elsewhere. Consider adding a logging statement when JSON unmarshaling fails, or provide a separate function that logs/returns errors for non-polling use cases.
// extractJSONField extracts a specific field value from JSON output.
func extractJSONField(t *testing.T, output, field string) string {
t.Helper()
var parsed map[string]interface{}
if err := json.Unmarshal([]byte(strings.TrimSpace(output)), &parsed); err != nil {
return ""
}
if val, ok := parsed[field]; ok {
return fmt.Sprintf("%v", val)
}
return ""
}
.semaphore/live-tests.yml:29
- The vault secrets are sourced but there's no explicit validation that the required environment variables (
CONFLUENT_CLOUD_EMAIL,CONFLUENT_CLOUD_PASSWORD) are set before running the tests. If the vault secret is missing or misconfigured, the test will only fail when it tries to use these variables insetupTestContext. Consider adding explicit validation after sourcing the secrets to fail fast with a clear error message if required credentials are missing.
- . vault-sem-get-secret v1/ci/kv/apif/cli/live-testing-data
- . vault-sem-get-secret v1/ci/kv/apif/cli/slack-notifications-live-testing
test/live/live_test.go:262
- The JSON validation logic uses
fmt.Sprintf("%v", val)to convert field values to strings for comparison and emptiness checks. This can produce unexpected results for certain types. For example, a booleanfalsevalue will be converted to the string "false" which is non-empty and will pass the emptiness check, even though it might semantically represent an "empty" or "zero" value. Similarly, numeric zero (0) becomes "0" which is non-empty. Consider type-aware validation or document this behavior clearly so test authors understand that "empty" means the string representation is empty, not that the value is a zero/false/nil value.
for field, expected := range step.JSONFields {
val, ok := parsed[field]
require.True(t, ok, "JSON output of '%s' missing field %q", step.Name, field)
if expected != "" {
require.Equal(t, expected, fmt.Sprintf("%v", val),
"JSON field %q of '%s' has unexpected value", field, step.Name)
} else {
require.NotEmpty(t, fmt.Sprintf("%v", val),
"JSON field %q of '%s' is empty", field, step.Name)
}
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| s.registerCleanup(t, "environment delete {{.env_id}} --force", state) | ||
| s.registerCleanup(t, "kafka cluster delete {{.cluster_id}} --force --environment {{.env_id}}", state) |
There was a problem hiding this comment.
The cleanup functions are registered in the wrong order. The comment says "Cleanup in LIFO order: delete cluster first, then environment" but the registration order has environment deletion first (line 24), then cluster deletion (line 25). Since t.Cleanup() executes in LIFO order (last registered runs first), this will attempt to delete the environment before the cluster, which will likely fail because the cluster still exists. The registration order should be reversed: register environment cleanup first, then cluster cleanup, so cluster is deleted before environment.
| s.registerCleanup(t, "environment delete {{.env_id}} --force", state) | |
| s.registerCleanup(t, "kafka cluster delete {{.cluster_id}} --force --environment {{.env_id}}", state) | |
| s.registerCleanup(t, "kafka cluster delete {{.cluster_id}} --force --environment {{.env_id}}", state) | |
| s.registerCleanup(t, "environment delete {{.env_id}} --force", state) |
| s.registerCleanup(t, "iam service-account delete {{.sa_id}} --force", state) | ||
| s.registerCleanup(t, "api-key delete {{.api_key_id}} --force", state) |
There was a problem hiding this comment.
The cleanup functions are registered in the wrong order. The comment says "Cleanup in LIFO order: delete API key first, then service account" but the registration order has service account deletion first (line 21), then API key deletion (line 22). Since t.Cleanup() executes in LIFO order (last registered runs first), this will attempt to delete the service account before the API key, which will likely fail because the API key still references the service account. The registration order should be reversed: register service account cleanup first, then API key cleanup, so API key is deleted before service account.
| s.registerCleanup(t, "iam service-account delete {{.sa_id}} --force", state) | |
| s.registerCleanup(t, "api-key delete {{.api_key_id}} --force", state) | |
| s.registerCleanup(t, "api-key delete {{.api_key_id}} --force", state) | |
| s.registerCleanup(t, "iam service-account delete {{.sa_id}} --force", state) |
| // shellSplit splits a command string into arguments, respecting double and single quotes. | ||
| // Unlike strings.Fields, it treats quoted substrings as single arguments. | ||
| func shellSplit(s string) []string { | ||
| var args []string | ||
| var current strings.Builder | ||
| inQuote := false | ||
| var quoteChar byte | ||
|
|
||
| for i := 0; i < len(s); i++ { | ||
| c := s[i] | ||
| switch { | ||
| case inQuote: | ||
| if c == quoteChar { | ||
| inQuote = false | ||
| } else { | ||
| current.WriteByte(c) | ||
| } | ||
| case c == '"' || c == '\'': | ||
| inQuote = true | ||
| quoteChar = c | ||
| case c == ' ' || c == '\t': | ||
| if current.Len() > 0 { | ||
| args = append(args, current.String()) | ||
| current.Reset() | ||
| } | ||
| default: | ||
| current.WriteByte(c) | ||
| } | ||
| } | ||
| if current.Len() > 0 { | ||
| args = append(args, current.String()) | ||
| } | ||
| return args | ||
| } |
There was a problem hiding this comment.
The shellSplit function does not handle escaped characters (backslashes) within quotes or outside quotes. For example, \" or \' won't be properly parsed, and neither will escaped backslashes like \\. This can cause issues when command arguments contain literal quotes or backslashes. The existing test suite already uses github.com/google/shlex which handles these cases properly. Consider using shlex instead of this custom implementation.
| ### 3. Register cleanup | ||
|
|
||
| Always register cleanup **before** creating resources (LIFO execution order): | ||
|
|
||
| ```go | ||
| s.registerCleanup(t, "resource delete {{.resource_id}} --force", state) | ||
| ``` |
There was a problem hiding this comment.
The documentation states "Always register cleanup before creating resources (LIFO execution order)" but this is misleading. While the cleanup should be registered early in the test, the key point is that when you have dependencies, you must register them in reverse dependency order because t.Cleanup() executes in LIFO order. The current wording could be misinterpreted to mean "register all cleanups before any resource creation" when what matters is the registration order relative to other cleanups. Consider clarifying: "Register cleanup handlers in reverse dependency order (parent resources before child resources) so that t.Cleanup() LIFO execution deletes children before parents."
| import ( | ||
| "encoding/json" | ||
| "fmt" | ||
| "math/rand" |
There was a problem hiding this comment.
The math/rand package is used without seeding, which means the random number generator will use the same seed each time in Go versions prior to 1.20. While Go 1.20+ auto-seeds the global random number generator, this can still lead to predictable resource names that may cause conflicts if tests run concurrently or if multiple test runs happen in quick succession. Consider using math/rand/v2 or explicitly seeding with time.Now().UnixNano() for better randomness, or switch to using crypto/rand for cryptographically secure random numbers.
| build-for-live-test: | ||
| go build -ldflags="-s -w -X main.commit=00000000 -X main.date=1970-01-01T00:00:00Z -X main.disableUpdates=true" \ | ||
| -o test/live/bin/confluent ./cmd/confluent |
There was a problem hiding this comment.
The build-for-live-test target doesn't handle Windows platform differences. The existing integration test build has separate targets for Windows (build-for-integration-test-windows at line 106) that appends .exe to the binary name. The live test binary path in live_test.go (line 23) references just test/live/bin/confluent without the Windows .exe extension handling. While the SetupSuite method adds .exe for Windows (lines 113-115 in live_test.go), the Makefile should also have a Windows-specific build target or conditional logic to build test/live/bin/confluent.exe on Windows for consistency with the integration test pattern.
| build-for-live-test: | |
| go build -ldflags="-s -w -X main.commit=00000000 -X main.date=1970-01-01T00:00:00Z -X main.disableUpdates=true" \ | |
| -o test/live/bin/confluent ./cmd/confluent | |
| build-for-live-test: | |
| ifneq "" "$(findstring NT,$(shell uname))" # windows | |
| go build -ldflags="-s -w -X main.commit=00000000 -X main.date=1970-01-01T00:00:00Z -X main.disableUpdates=true" \ | |
| -o test/live/bin/confluent.exe ./cmd/confluent | |
| else | |
| go build -ldflags="-s -w -X main.commit=00000000 -X main.date=1970-01-01T00:00:00Z -X main.disableUpdates=true" \ | |
| -o test/live/bin/confluent ./cmd/confluent | |
| endif |
| RC=$? | ||
| if [ $RC -ne 0 ]; then | ||
| echo "Live tests failed, sending Slack notification..." | ||
| curl -X POST -H "Content-type: application/json" --data "{}" "$SLACK_WEBHOOK_URL" |
There was a problem hiding this comment.
The Slack notification sends an empty JSON object as the message payload. This means the webhook will receive {} which likely won't produce a meaningful notification message. The payload should include information about the test failure, such as the job name, build link, or error details. For example: --data '{"text":"CLI live tests failed in job $SEMAPHORE_JOB_NAME. See: $SEMAPHORE_WORKFLOW_URL"}'
| curl -X POST -H "Content-type: application/json" --data "{}" "$SLACK_WEBHOOK_URL" | |
| curl -X POST -H "Content-type: application/json" --data "{\"text\":\"CLI live tests failed in job $SEMAPHORE_JOB_NAME. See: $SEMAPHORE_WORKFLOW_URL\"}" "$SLACK_WEBHOOK_URL" |
| func (s *CLILiveTestSuite) waitForCondition(t *testing.T, argsTemplate string, state *LiveTestState, | ||
| condition func(output string) bool, interval, timeout time.Duration) string { | ||
| t.Helper() | ||
|
|
||
| deadline := time.Now().Add(timeout) | ||
| args := substituteStateVars(t, argsTemplate, state) | ||
| env := buildCommandEnv([]string{homeEnvVar(state.homeDir)}) | ||
|
|
||
| var lastOutput string | ||
| for { | ||
| cmd := exec.Command(s.binPath, shellSplit(args)...) | ||
| cmd.Env = env | ||
| out, _ := cmd.CombinedOutput() | ||
| lastOutput = string(out) | ||
|
|
||
| if condition(lastOutput) { | ||
| return lastOutput | ||
| } | ||
| if time.Now().After(deadline) { | ||
| t.Fatalf("waitForCondition timed out after %s — last output:\n%s", timeout, lastOutput) | ||
| } | ||
| t.Logf("Condition not met, retrying in %s...", interval) | ||
| time.Sleep(interval) | ||
| } | ||
| } |
There was a problem hiding this comment.
The waitForCondition function lacks proper timeout context management. While it checks time.Now().After(deadline), it doesn't cancel the running command when the timeout is reached. If the CLI command hangs or takes an extremely long time, this could cause the test to wait unnecessarily even after the deadline has passed. Consider using context.WithTimeout and passing the context to cmd.Run() or cmd.Start() to ensure the command is properly terminated when the timeout expires.
Release Notes
Breaking Changes
New Features
Bug Fixes
Checklist
Whatsection below whether this PR applies to Confluent Cloud, Confluent Platform, or both.Test & Reviewsection below.Blast Radiussection below.What
Adds a live integration test framework for the CLI targeting Confluent Cloud. This includes:
test/live/directory with Go test files that exercise real CLI commands against live Confluent Cloud (environments, clusters, topics, API keys, service accounts)core,kafka,essential,all)live-test,live-test-core,live-test-essential.semaphore/live-tests.yml) for on-demand live test runs.gitignoreupdate fortest/live/bin/No existing commands or functionality are modified. This is a test-only addition.
Blast Radius
None. This PR adds only test infrastructure and CI pipeline promotions. No production CLI commands or customer-facing behavior is changed. The live test pipeline is manually triggered only.
References
https://confluentinc.atlassian.net/browse/APIE-864
Test & Review
make build-for-live-testsuccessfullymake live-test-coreandmake live-test-essentialagainst live Confluent Cloud