Skip to content

feat(ci): parallelize deployments and readiness checks for showcase and showcase-rbac#4469

Open
gustavolira wants to merge 7 commits intoredhat-developer:mainfrom
gustavolira:RHIDP-12296
Open

feat(ci): parallelize deployments and readiness checks for showcase and showcase-rbac#4469
gustavolira wants to merge 7 commits intoredhat-developer:mainfrom
gustavolira:RHIDP-12296

Conversation

@gustavolira
Copy link
Copy Markdown
Member

@gustavolira gustavolira commented Mar 26, 2026

Summary

  • Run base and RBAC Helm deployments concurrently via background processes
  • Overlap Backstage readiness waits (~5 min each) by running health checks in parallel, saving ~3-5 min
  • Tests run sequentially (parallel Playwright caused OOMKilled — not enough memory for 6 concurrent browsers)
  • Per-project output paths for test artifacts to avoid file conflicts between runs

Design

Phase Execution Why
Helm deployments Parallel Independent namespaces, no resource conflict
Readiness checks (HTTP polling) Parallel Lightweight, main time savings (~5 min)
Playwright tests Sequential Parallel caused OOMKilled (exit 137) — 2 Playwright x 3 workers exceeds CI memory

Changes

Parallel deployments (utils.sh):

  • initiate_deployments() and initiate_deployments_osd_gcp() run base + RBAC concurrently with & + wait

Parallel readiness + sequential tests (testing.sh):

  • New testing::parallel_check_and_test() pre-warms readiness checks in parallel, then calls testing::check_and_test sequentially for each deployment
  • testing::run_tests() uses per-project output paths: test-results-<subdir>/, junit-results-<subdir>.xml, playwright-report-<subdir>/, per-project log files
  • testing::check_backstage_running() uses per-project log file path to avoid race conditions

Pod log collection (utils.sh):

  • save_all_pod_logs() passes pod logs directory as function parameter (not exported global)
  • Uses unique /tmp/pod_logs-<subdir>/ per artifacts subdir

Callers (ocp-pull.sh, ocp-nightly.sh):

  • Both use testing::parallel_check_and_test()

Test plan

  • Verify OCP PR job runs deployments and readiness checks in parallel
  • Verify Playwright tests run sequentially without OOMKilled
  • Verify nightly subsequent test phases (runtime, sanity, localization) work correctly
  • Verify OSD-GCP nightly deploys in parallel
  • Verify artifacts match original layout (index.html, data/, trace/ at showcase/ root)
  • Verify proper error reporting when one deployment fails

Resolves: RHIDP-12296

🤖 Generated with Claude Code

@github-actions
Copy link
Copy Markdown
Contributor

The container image build workflow finished with status: cancelled.

@gustavolira
Copy link
Copy Markdown
Member Author

/review

@rhdh-qodo-merge
Copy link
Copy Markdown

PR Reviewer Guide 🔍

Here are some key observations to aid the review process:

🎫 Ticket compliance analysis 🔶

RHIDP-12296 - Partially compliant

Compliant requirements:

  • Run showcase and showcase-rbac deployments concurrently (background + wait)
  • Pod log collection covers both namespaces (parallel-safe temp dirs per subdir)

Non-compliant requirements:

  • Proper error handling: if one deployment/test fails, both failures are reported and the overall job reflects failure

Requires further human verification:

  • Each Playwright test suite starts when its own deployment is ready
  • Reduce total test-phase time by ~3–5 minutes
⏱️ Estimated effort to review: 3 🔵🔵🔵⚪⚪
🔒 No security concerns identified
⚡ Recommended focus areas for review

Return Status

_run_parallel returns 1 on failure, but on success it does not explicitly return 0. Relying on implicit function exit status can be fragile and may cause unexpected non-zero returns depending on the last executed command; explicitly returning 0 at the end would make success/failure behavior unambiguous.

_run_parallel() {
  local label=$1
  local base_func=$2
  local base_arg=$3
  local rbac_func=$4
  local rbac_arg=$5

  log::section "Starting parallel ${label} (base + RBAC)"

  "${base_func}" "${base_arg}" &
  local base_pid=$!
  log::info "Base ${label} started in background (PID: ${base_pid})"

  "${rbac_func}" "${rbac_arg}" &
  local rbac_pid=$!
  log::info "RBAC ${label} started in background (PID: ${rbac_pid})"

  local base_result=0
  local rbac_result=0
  wait "${base_pid}" || base_result=$?
  wait "${rbac_pid}" || rbac_result=$?

  if [[ ${base_result} -ne 0 ]]; then
    log::error "Base ${label} failed (exit code: ${base_result})"
  else
    log::success "Base ${label} completed"
  fi
  if [[ ${rbac_result} -ne 0 ]]; then
    log::error "RBAC ${label} failed (exit code: ${rbac_result})"
  else
    log::success "RBAC ${label} completed"
  fi

  if [[ ${base_result} -ne 0 || ${rbac_result} -ne 0 ]]; then
    return 1
  fi
}
📚 Focus areas based on broader codebase context

Exit Codes

testing::parallel_check_and_test() currently does wait ... || true for both background test processes and always returns success, even if one (or both) testing::check_and_test runs fail. This can cause CI to report a green job while tests actually failed; capture each wait’s exit code and return 1/non-zero when either fails. (Ref 2)

testing::parallel_check_and_test() {
  local base_release=$1
  local base_namespace=$2
  local base_project=$3
  local base_url=$4
  local rbac_release=$5
  local rbac_namespace=$6
  local rbac_project=$7
  local rbac_url=$8

  Xvfb :99 &
  local xvfb_pid=$!
  export DISPLAY=:99

  log::section "Starting parallel test execution (base + RBAC)"

  testing::check_and_test "${base_release}" "${base_namespace}" "${base_project}" "${base_url}" &
  local base_pid=$!
  log::info "Base tests started in background (PID: ${base_pid})"

  testing::check_and_test "${rbac_release}" "${rbac_namespace}" "${rbac_project}" "${rbac_url}" &
  local rbac_pid=$!
  log::info "RBAC tests started in background (PID: ${rbac_pid})"

  wait "${base_pid}" || true
  wait "${rbac_pid}" || true

  kill "${xvfb_pid}" 2> /dev/null || true
  unset DISPLAY
}

Reference reasoning: The existing flow explicitly preserves and propagates failure via an exit code check and exit $DEPLOYMENT_EXIT_CODE when deployment/test steps fail. Align the new parallel helper with that pattern by aggregating child exit codes and returning a failing status when any parallel branch fails.

📄 References
  1. redhat-developer/rhdh/e2e-tests/container-init.sh [136-175]
  2. redhat-developer/rhdh/e2e-tests/container-init.sh [177-182]
  3. redhat-developer/rhdh/e2e-tests/container-init.sh [47-84]
  4. redhat-developer/rhdh/e2e-tests/container-init.sh [119-134]
  5. redhat-developer/rhdh/e2e-tests/container-init.sh [86-117]
  6. redhat-developer/rhdh/e2e-tests/local-run.sh [427-441]
  7. redhat-developer/rhdh/e2e-tests/local-run.sh [442-470]
  8. redhat-developer/rhdh/e2e-tests/local-run.sh [403-426]

@github-actions
Copy link
Copy Markdown
Contributor

The container image build workflow finished with status: cancelled.

@github-actions
Copy link
Copy Markdown
Contributor

The container image build workflow finished with status: cancelled.

gustavolira and others added 4 commits March 26, 2026 11:36
Run base and RBAC deployments concurrently using background processes
with proper error reporting for both. Parallelize Playwright test
execution with a shared Xvfb display and per-project output directories
to avoid race conditions on test-results, junit XML, and HTML reports.

Reduces PR feedback and nightly test-phase time by ~3-5 min by
overlapping both Backstage readiness waits and test suites.

RHIDP-12296

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The --previous log path was still hardcoded to pod_logs/ instead of
using the parallel-safe _POD_LOGS_DIR variable, causing previous logs
to silently fail when save_all_pod_logs uses a /tmp/ directory.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Fixes SonarCloud S7682 — functions should end with explicit return.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown
Contributor

The container image build workflow finished with status: cancelled.

@github-actions
Copy link
Copy Markdown
Contributor

Image was built and published successfully. It is available at:

Parallel Playwright execution caused OOMKilled in CI (exit code 137).
Refactor testing::parallel_check_and_test to:
- Phase 1: run readiness checks in parallel (lightweight HTTP polling)
- Phase 2: run test suites sequentially (avoids concurrent Playwright OOM)

This still saves ~5 min from overlapping readiness waits while staying
within the CI container memory limits.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown
Contributor

Image was built and published successfully. It is available at:

@gustavolira gustavolira changed the title feat(ci): parallelize showcase and showcase-rbac deployment and testing feat(ci): parallelize deployments and readiness checks for showcase and showcase-rbac Mar 26, 2026
The playwright-report files (index.html, data/, trace/) were being
saved inside a playwright-report/ subdirectory instead of at the root
of the artifact dir. Remove the third arg from save_artifact so the
per-project report directory contents are copied directly into the
artifact root, matching the original layout.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown
Contributor

Image was built and published successfully. It is available at:

Copy link
Copy Markdown
Member

@subhashkhileri subhashkhileri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed the parallelization changes. The overall design is sound — parallel readiness checks are a good win, and per-project artifact paths fix a real overwrite bug. A few suggestions below.

# Phase 1: Wait for both deployments to be ready in parallel (lightweight HTTP polling)
log::section "Waiting for both deployments to be ready in parallel"

testing::check_backstage_running "${base_release}" "${base_namespace}" "${base_url}" "${base_artifacts}" &
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: $LOGFILE race condition when running in background

Both background readiness checks will use the shared global $LOGFILE (test-log). Inside check_backstage_running() (lines 279, 289), on crash-detection or timeout it writes:

common::save_artifact "${artifacts_subdir}" "/tmp/${LOGFILE}" || true

Since both background processes share the same $LOGFILE, they'll clobber each other's log if both hit the failure/timeout path.

run_tests() was correctly updated to use per-project logfiles — but check_backstage_running() was missed.

Suggest updating those lines to use a per-project path:

common::save_artifact "${artifacts_subdir}" "/tmp/${LOGFILE}-${artifacts_subdir}" || true

_run_parallel() {
local label=$1
local base_func=$2
local base_arg=$3
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion: Inline instead of abstracting

This helper is called exactly twice (initiate_deployments and initiate_deployments_osd_gcp) and only supports single-argument functions. The generic naming/signature suggests reusability, but it's tightly constrained. Consider inlining the &/wait pattern directly in the two callers — it's ~6 lines and immediately readable:

initiate_deployments() {
  local base_artifacts_subdir=$1
  local rbac_artifacts_subdir=$2
  cd "${DIR}"
  base_deployment "${base_artifacts_subdir}" &
  local base_pid=$!
  rbac_deployment "${rbac_artifacts_subdir}" &
  local rbac_pid=$!
  local failed=0
  wait "${base_pid}" || { log::error "Base deployment failed"; failed=1; }
  wait "${rbac_pid}" || { log::error "RBAC deployment failed"; failed=1; }
  return "${failed}"
}

No indirection, no need to look up what _run_parallel does.

# Args:
# $1-$5: release_name, namespace, playwright_project, url, artifacts_subdir
# $6: readiness check exit code (0 = ready)
_run_check_and_test_phase() {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Duplication with testing::check_and_test()

This function duplicates the logic already in testing::check_and_test() (lines 307-341) — same pod display, same SKIP_TESTS check, same run_tests/save_all_pod_logs branching, same mark_deploy_failed on failure.

Consider reusing check_and_test instead. After the parallel readiness wait completes, check_backstage_running inside check_and_test will return immediately on the first curl (HTTP 200 in ~2 seconds since the deployment is already up). The re-check cost is negligible, and you avoid maintaining two copies of the same logic:

# In ocp-pull.sh — parallel readiness, then reuse existing check_and_test
testing::check_backstage_running "..." "..." "${url}" "..." &
local base_pid=$!
testing::check_backstage_running "..." "..." "${rbac_url}" "..." &
local rbac_pid=$!
wait "${base_pid}"; wait "${rbac_pid}"

testing::check_and_test "${RELEASE_NAME}" "${NAME_SPACE}" "${PW_PROJECT_SHOWCASE}" "${url}"
testing::check_and_test "${RELEASE_NAME_RBAC}" "${NAME_SPACE_RBAC}" "${PW_PROJECT_SHOWCASE_RBAC}" "${rbac_url}"

No new functions needed.

rm -rf pod_logs && mkdir -p pod_logs

# Use a unique temp directory per artifacts_subdir to allow parallel pod log collection
export _POD_LOGS_DIR="/tmp/pod_logs-${artifacts_subdir}"
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code smell: exported global to pass a directory path

Using export _POD_LOGS_DIR + unset to communicate with retrieve_pod_logs is fragile. If save_all_pod_logs is ever called concurrently (which this PR is moving toward), the global gets clobbered between calls.

Consider passing it as a function parameter instead:

retrieve_pod_logs "$pod_name" "$container" "$namespace" "$pod_logs_dir"

This eliminates the global state, the unset cleanup, and the ${_POD_LOGS_DIR:-pod_logs} fallback pattern in retrieve_pod_logs.

# Args: pairs of (release_name namespace playwright_project url) for base and RBAC
# $1-$4: base args
# $5-$8: RBAC args
testing::parallel_check_and_test() {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Hardcoded for exactly 2 deployments

This function takes 8 positional args (4 per deployment) and is hardcoded to the base+RBAC pair. This is fine for now, but a brief comment like # Hardcoded for the base + RBAC deployment pair would clarify that the rigid 8-arg signature is intentional and not meant to be extended to N deployments.

…ation

- Inline _run_parallel() into both callers (review #2)
- Remove _run_check_and_test_phase, reuse check_and_test directly after
  parallel readiness pre-warm — re-check is instant if already up (review redhat-developer#3)
- Pass pod_logs_dir as function parameter instead of exported global (review redhat-developer#4)
- Fix $LOGFILE race in check_backstage_running for parallel execution (review #1)
- Add comment clarifying hardcoded base+RBAC pair (review redhat-developer#5)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@sonarqubecloud
Copy link
Copy Markdown

@openshift-ci
Copy link
Copy Markdown

openshift-ci bot commented Mar 27, 2026

@gustavolira: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-ocp-helm 91a69f1 link true /test e2e-ocp-helm

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@github-actions
Copy link
Copy Markdown
Contributor

Image was built and published successfully. It is available at:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants