Skip to content

Conversation

@cjac
Copy link
Contributor

@cjac cjac commented Jun 23, 2025

refactor(gcloud): Introduce a test framework for Critical User Journeys

This commit refactors the project's testing and automation scripts, moving them from a monolithic bin/ directory into a structured framework centered around Critical User Journeys (CUJs). The goal is to create a more maintainable and scalable way to exercise the use cases that are essential for our customers.

This new structure is designed for better organization, code reuse, and CI/CD integration.

Key changes include:

  • CUJ Directory Structure:

    • gcloud/cuj/: Contains individual journey scripts. The first CUJ, standard-cluster-management, is introduced to handle the creation, deletion, and rebuilding of a standard Dataproc cluster.

    • gcloud/onboarding/: Holds scripts for setting up shared, prerequisite infrastructure (e.g., GCS buckets) that CUJs depend on.

    • gcloud/lib/: A new directory for shared bash libraries. common.sh is introduced to centralize functions for configuration loading, network management, and logging.

    • gcloud/ci/: Includes scripts specifically for CI/CD pipelines.

  • CI Environment Management:

    • A new pristine_check.sh script is added to ensure the GCP project is in a clean state before a test run. It can operate in a strict validation mode (fail if resources exist) or a cleanup mode (actively delete resources).
  • Configuration and Code Improvements:

    • Configuration is centralized and simplified in env.json, which is now loaded by the load_config function in common.sh.

    • Scripts are refactored to be more robust, with clear up, down, and validate commands, and checks for pre-existing resources.

    • The .gitignore is updated to exclude temporary files and local configuration.

This change lays the foundation for adding more complex user journeys, ensuring our core services are consistently tested from a customer perspective.

Resolves: GoogleCloudDataproc/initialization-actions#1281
Relates: go/sf/59134938

This commit refactors the project's testing and automation scripts,
moving them from a monolithic `bin/` directory into a structured
framework centered around Critical User Journeys (CUJs). The goal is
to create a more maintainable and scalable way to exercise the use
cases that are essential for our customers.

This new structure is designed for better organization, code reuse,
and CI/CD integration.

Key changes include:

* **CUJ Directory Structure:**

    * `gcloud/cuj/`: Contains individual journey scripts. The first
      CUJ, `standard-cluster-management`, is introduced to handle the
      creation, deletion, and rebuilding of a standard Dataproc
      cluster.

    * `gcloud/onboarding/`: Holds scripts for setting up shared,
      prerequisite infrastructure (e.g., GCS buckets) that CUJs depend
      on.

    * `gcloud/lib/`: A new directory for shared bash
      libraries. `common.sh` is introduced to centralize functions for
      configuration loading, network management, and logging.

    * `gcloud/ci/`: Includes scripts specifically for CI/CD pipelines.

* **CI Environment Management:**

    * A new `pristine_check.sh` script is added to ensure the GCP
      project is in a clean state before a test run. It can operate in
      a strict validation mode (fail if resources exist) or a cleanup
      mode (actively delete resources).

* **Configuration and Code Improvements:**

    * Configuration is centralized and simplified in `env.json`, which
      is now loaded by the `load_config` function in `common.sh`.

    * Scripts are refactored to be more robust, with clear `up`,
      `down`, and `validate` commands, and checks for pre-existing
      resources.

    * The `.gitignore` is updated to exclude temporary files and local
      configuration.

This change lays the foundation for adding more complex user journeys,
ensuring our core services are consistently tested from a customer
perspective.
@cjac cjac self-assigned this Jun 23, 2025
Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary of Changes

Hello @cjac, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly refactors the project's testing and automation infrastructure by transitioning from a monolithic bin/ directory to a structured framework based on Critical User Journeys (CUJs). This change aims to enhance the maintainability, scalability, and reusability of scripts, ensuring robust and consistent testing of core customer use cases within CI/CD pipelines.

Highlights

  • New Directory Structure for CUJs: Introduced gcloud/cuj/, gcloud/onboarding/, gcloud/lib/, and gcloud/ci/ to organize testing and automation scripts around Critical User Journeys, promoting modularity and reuse.
  • CI Environment Management: Added gcloud/ci/pristine_check.sh to validate and clean up GCP project resources, ensuring a pristine environment before test runs, with both strict validation and cleanup modes.
  • Centralized Configuration and Shared Utilities: Refactored configuration to env.json and introduced gcloud/lib/common.sh to centralize functions for configuration loading, network management (creation/deletion), and logging, improving script robustness and maintainability.
  • First Critical User Journey (CUJ): Implemented the gcloud/cuj/standard-cluster-management/manage.sh script, providing a comprehensive lifecycle management (create, delete, rebuild) for a standard Dataproc cluster, serving as a template for future CUJs.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in issue comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

The code changes introduce a well-structured testing framework centered around Critical User Journeys (CUJs). The new structure and the use of common.sh and env.json are major improvements. The review focuses on improving the robustness and security of the new scripts, including idempotency in resource creation and a firewall rule.

Comment on lines 64 to 84
function create_network_and_subnet() {
# ... (logic to check if network exists)
gcloud compute networks create "${CONFIG[NETWORK]}" \
--subnet-mode=custom \
--description="Network for CUJ testing" \
--bgp-routing-mode="regional"

# ... (logic to check if subnet exists)
gcloud compute networks subnets create "${CONFIG[SUBNET]}" \
--network="${CONFIG[NETWORK]}" \
--range="${CONFIG[SUBNET_CIDR]}" \
--region="${CONFIG[REGION]}"

# Add firewall rule with the tag to allow SSH
gcloud compute firewall-rules create "${CONFIG[CUJ_TAG]}-allow-ssh" \
--network="${CONFIG[NETWORK]}" \
--allow=tcp:22 \
--source-ranges="0.0.0.0/0" \
--description="Allow SSH for CUJ test" \
--target-tags="${CONFIG[CUJ_TAG]}"
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The create_network_and_subnet function is not idempotent. If the network, subnet, or firewall rule already exists, the gcloud ... create commands will fail. Check for the existence of each resource before attempting to create it.

function create_network_and_subnet() {
  header "Creating Network and Subnet"

  # Create network if it doesn't exist
  if ! gcloud compute networks describe "${CONFIG[NETWORK]}" &>/dev/null; then
    echo "Creating network '${CONFIG[NETWORK]}'..."
    gcloud compute networks create "${CONFIG[NETWORK]}" \
        --subnet-mode=custom \
        --description="Network for CUJ testing" \
        --bgp-routing-mode="regional"
  else
    echo "Network '${CONFIG[NETWORK]}' already exists."
  fi

  # Create subnet if it doesn't exist
  if ! gcloud compute networks subnets describe "${CONFIG[SUBNET]}" --region="${CONFIG[REGION]}" &>/dev/null; then
    echo "Creating subnet '${CONFIG[SUBNET]}'..."
    gcloud compute networks subnets create "${CONFIG[SUBNET]}" \
         --network="${CONFIG[NETWORK]}" \
         --range="${CONFIG[SUBNET_CIDR]}" \
         --region="${CONFIG[REGION]}"
  else
    echo "Subnet '${CONFIG[SUBNET]}' already exists."
  fi
  
  # Create firewall rule if it doesn't exist
  local firewall_rule_name="${CONFIG[CUJ_TAG]}-allow-ssh"
  if ! gcloud compute firewall-rules describe "${firewall_rule_name}" &>/dev/null; then
    echo "Creating firewall rule '${firewall_rule_name}'..."
    gcloud compute firewall-rules create "${firewall_rule_name}" \
      --network="${CONFIG[NETWORK]}" \
      --allow=tcp:22 \
      --source-ranges="0.0.0.0/0" \
      --description="Allow SSH for CUJ test" \
      --target-tags="${CONFIG[CUJ_TAG]}"
  else
    echo "Firewall rule '${firewall_rule_name}' already exists."
  fi
}

gcloud compute firewall-rules create "${CONFIG[CUJ_TAG]}-allow-ssh" \
--network="${CONFIG[NETWORK]}" \
--allow=tcp:22 \
--source-ranges="0.0.0.0/0" \

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Opening SSH (tcp:22) to the entire internet (0.0.0.0/0) is a significant security risk. Restrict the source-ranges to a safer alternative like the IP range for Google Cloud's Identity-Aware Proxy (IAP) for TCP forwarding (35.235.240.0/20).

Suggested change
--source-ranges="0.0.0.0/0" \
--source-ranges="35.235.240.0/20" \

Comment on lines 34 to 41
echo "Found leftover Dataproc clusters: ${CLUSTERS}" | tee -a "${LEFTOVERS_FILE}"
if [[ "$STRICT_MODE" == false ]]; then
echo "Cleaning up..."
# Run deletions in the background for speed.
for cluster in ${CLUSTERS}; do
gcloud dataproc clusters delete --quiet "${cluster}" --region="${CONFIG[REGION]}" &
done
fi

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Consider using a while read loop for iterating over lines of command output to handle resource names with spaces or special characters. Also, pipe the CLUSTERS variable directly to tee to preserve line breaks from the gcloud command's output.

Suggested change
echo "Found leftover Dataproc clusters: ${CLUSTERS}" | tee -a "${LEFTOVERS_FILE}"
if [[ "$STRICT_MODE" == false ]]; then
echo "Cleaning up..."
# Run deletions in the background for speed.
for cluster in ${CLUSTERS}; do
gcloud dataproc clusters delete --quiet "${cluster}" --region="${CONFIG[REGION]}" &
done
fi
echo "Found leftover Dataproc clusters:" | tee -a "${LEFTOVERS_FILE}"
echo "${CLUSTERS}" | tee -a "${LEFTOVERS_FILE}"
if [[ "$STRICT_MODE" == false ]]; then
echo "Cleaning up..."
# Use a while-read loop for robustness.
echo "${CLUSTERS}" | while IFS= read -r cluster; do
if [[ -n "${cluster}" ]]; then
gcloud dataproc clusters delete --quiet "${cluster}" --region="${CONFIG[REGION]}" &
fi
done
fi

Comment on lines +23 to +29
# ---FIX---
# Add defensive check for required configuration.
if [[ -z "${CONFIG[CLUSTER_NAME]}" || -z "${CONFIG[REGION]}" || -z "${CONFIG[SUBNET]}" || -z "${CONFIG[CUJ_TAG]}" ]]; then
echo "ERROR: One or more required keys (CLUSTER_NAME, REGION, SUBNET, CUJ_TAG) are missing from env.json" >&2
exit 1
fi
# ---END FIX---

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The configuration validation logic is repeated in the delete_cluster function (lines 42-45). Extract this logic into a shared helper function to improve maintainability and reduce code duplication.

Suggested change
# ---FIX---
# Add defensive check for required configuration.
if [[ -z "${CONFIG[CLUSTER_NAME]}" || -z "${CONFIG[REGION]}" || -z "${CONFIG[SUBNET]}" || -z "${CONFIG[CUJ_TAG]}" ]]; then
echo "ERROR: One or more required keys (CLUSTER_NAME, REGION, SUBNET, CUJ_TAG) are missing from env.json" >&2
exit 1
fi
# ---END FIX---
function require_config_keys() {
local missing_keys=()
for key in "$@"; do
if [[ -z "${CONFIG[$key]}" ]]; then
missing_keys+=("$key")
fi
done
if [[ ${#missing_keys[@]} -gt 0 ]]; then
( # Use subshell to localize IFS change
IFS=,
echo "ERROR: One or more required keys (${missing_keys[*]}) are missing from env.json" >&2
)
exit 1
fi
}

@cjac cjac changed the title [gcloud] refactor(scripts): Organize tests into Critical User Journeys (CUJs) refactor(gcloud): Introduce a test framework for Critical User Journeys Jun 24, 2025
@cjac cjac marked this pull request as draft June 24, 2025 00:57
@cjac
Copy link
Contributor Author

cjac commented Jun 24, 2025

This PR cannot resolve GoogleCloudDataproc/initialization-actions#1281 until it implements the proxy-only CUJ.

This commit refactors the project's testing and automation scripts,
moving them from a monolithic `bin/` directory into a structured
framework centered around Critical User Journeys (CUJs). The goal is
to create a more maintainable and scalable way to exercise the use
cases that are essential for our customers.

This new structure is designed for better organization, code reuse,
and CI/CD integration.

Key changes include:

* **CUJ Directory Structure:**

    * `gcloud/cuj/`: Contains individual journey scripts. The first
      CUJ, `gce/standard`, is introduced to handle the creation,
      deletion, and rebuilding of a standard Dataproc cluster.

    * `gcloud/onboarding/`: Holds scripts for setting up shared,
      prerequisite infrastructure (e.g., GCS buckets) that CUJs depend
      on.

    * `gcloud/lib/`: A new directory for shared bash
      libraries. `common.sh` is introduced to centralize functions for
      configuration loading, network management, and logging.

    * `gcloud/ci/`: Includes scripts specifically for CI/CD pipelines.

* **CI Environment Management:**

    * A new `pristine_check.sh` script is added to ensure the GCP
      project is in a clean state before a test run. It can operate in
      a strict validation mode (fail if resources exist) or a cleanup
      mode (actively delete resources).

* **Configuration and Code Improvements:**

    * Configuration is centralized and simplified in `env.json`, which
      is now loaded by the `load_config` function in `common.sh`.

    * Scripts are refactored to be more robust, with clear `up`,
      `down`, and `validate` commands, and checks for pre-existing
      resources.

    * The `.gitignore` is updated to exclude temporary files and local
      configuration.

This change lays the foundation for adding more complex user journeys,
ensuring our core services are consistently tested from a customer
perspective.
@cjac cjac force-pushed the gcloud-cuj-framework branch from 8c4d7ca to ecb07ab Compare June 24, 2025 04:04
@cjac cjac force-pushed the gcloud-cuj-framework branch from ecb07ab to d790d25 Compare August 8, 2025 09:43
This commit builds upon the foundational CUJ framework by ingesting
battle-tested logic from numerous sources and implementing the initial set
of comprehensive, production-like Critical User Journeys.

The framework is now enhanced with a powerful, modular library and the
first advanced CUJs, making it a robust tool for end-to-end testing.

Key Enhancements:

* **Modular Library (`lib/`)**:
    The monolithic `common.sh` is refactored into a modular library
    with components organized by function (`_core.sh`, `_network.sh`,
    `_dataproc.sh`, `_database.sh`, `_security.sh`). This incorporates
    advanced, parameterized, and idempotent functions for managing a
    wide range of GCP resources.

* **Advanced Onboarding (`onboarding/`)**:
    New scripts are added to provision persistent, shared infrastructure,
    including a High-Availability Cloud SQL instance with VPC Peering and
    a dual-NIC Squid Proxy VM, following GCP best practices.

* **New Critical User Journeys (`cuj/`)**:
    * `gce/standard`: This CUJ is enhanced to provision a full,
        NAT-based network environment.
    * `gce/proxy-egress`: A new CUJ is added to test Dataproc
        clusters that use a proxy for all outbound traffic.
    * `gke/standard`: A new CUJ is added for the standard Dataproc
        on GKE use case.

* **Enhanced CI/CD (`ci/`)**:
    `pristine_check.sh` is upgraded to use a robust, tag-based cleanup
    strategy, making it scalable to any number of CUJs without
    modification.

* **Finalized Configuration (`env.json`)**:
    The `env.json.sample` file is finalized with a simplified structure
    that defines the shared test environment and a `cuj_set` for test
    orchestration, abstracting implementation details from the user.

* **Comprehensive Documentation (`README.md`)**:
    The README is updated to be a complete guide for the new framework,
    explaining its philosophy and providing a clear "Getting Started"
    workflow for new users.
@cjac cjac force-pushed the gcloud-cuj-framework branch from d790d25 to 9475017 Compare August 8, 2025 10:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[initialization-actions] Please add support for http proxy along with tests which exercise the use case

1 participant