Skip to content

Conversation

@tomasdavidorg
Copy link
Contributor

@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Dec 18, 2025
@openshift-ci-robot
Copy link

openshift-ci-robot commented Dec 18, 2025

@tomasdavidorg: This pull request references OCPERT-270 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.22.0" version, but no target version was set.

Details

In response to this:

https://issues.redhat.com/browse/OCPERT-270

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci
Copy link

openshift-ci bot commented Dec 18, 2025

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign rioliu-rh for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci
Copy link

openshift-ci bot commented Dec 18, 2025

@tomasdavidorg: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@rioliu-rh
Copy link
Contributor

Code Review: Image Consistency Check Prow Job

Overview

This PR adds functionality to trigger image consistency check Prow jobs with custom payload URLs and merge request IDs. The implementation follows existing patterns in the codebase.


Key Improvements Recommended

1. Replace Fixed Sleep with Polling Logic ⭐

Current Issue: Using time.sleep(5) assumes the job initializes within 5 seconds, which may not always be true and could lead to failures or unnecessary waiting.

Better Approach: Enhance the existing get_job_results method to support polling with exponential backoff, making it reusable across both run_job and run_image_consistency_check.

Why reuse get_job_results?

  • Both methods do the same thing: fetch job info from Prow API
  • Avoids code duplication
  • get_job_results already has CSV logging, error handling, and proper field extraction
  • Just needs retry/polling capability added

Proposed Design:

class Jobs:
    # Add class constants for polling configuration
    JOB_POLL_INITIAL_DELAY = 1      # Start with 1 second
    JOB_POLL_MAX_DELAY = 10         # Cap at 10 seconds  
    JOB_POLL_BACKOFF_FACTOR = 2     # Double each retry
    JOB_POLL_TIMEOUT = 60           # Max 60 seconds total
    
    def get_job_results(self, job_id, job_name=None, payload=None, 
                       upgrade_from=None, upgrade_to=None, 
                       poll=False, max_wait_seconds=None):
        """
        Get job results with optional polling.
        
        Args:
            poll: If True, retry with exponential backoff until job is ready
            max_wait_seconds: Maximum time to wait when polling (default: 60s)
        """
        if not job_id:
            print("No job ID input, exit...")
            sys.exit(0)
        
        if max_wait_seconds is None:
            max_wait_seconds = self.JOB_POLL_TIMEOUT
        
        # If polling disabled, make single attempt (original behavior)
        if not poll:
            return self._fetch_job_info_once(job_id, job_name, payload, 
                                            upgrade_from, upgrade_to)
        
        # Polling enabled: retry with exponential backoff
        delay = self.JOB_POLL_INITIAL_DELAY
        elapsed = 0
        attempt = 0
        
        while elapsed < max_wait_seconds:
            attempt += 1
            
            result = self._fetch_job_info_once(
                job_id, job_name, payload, upgrade_from, upgrade_to,
                attempt=attempt, silent=True
            )
            
            if result is not None:
                print(f"Job {job_id} ready after {elapsed}s (attempt {attempt})")
                return result
            
            if attempt == 1:
                print(f"Job {job_id} not ready yet, polling...")
            
            time.sleep(delay)
            elapsed += delay
            delay = min(delay * self.JOB_POLL_BACKOFF_FACTOR, self.JOB_POLL_MAX_DELAY)
        
        raise TimeoutError(
            f"Job {job_id} not ready after {max_wait_seconds}s ({attempt} attempts)."
        )
    
    def _fetch_job_info_once(self, job_id, job_name=None, payload=None,
                            upgrade_from=None, upgrade_to=None, 
                            attempt=1, silent=False):
        """
        Fetch job info once (single HTTP request).
        Extracted from get_job_results for reusability.
        
        Args:
            silent: If True, suppress error messages (used during polling)
        """
        try:
            resp = self._get_session().get(url=self.prow_job_url.format(job_id.strip()))
            
            if resp.status_code == 200 and resp.text:
                job_result = yaml.safe_load(resp.text)
                
                # Check if job has required fields
                if job_result and "status" in job_result and "url" in job_result["status"]:
                    status = job_result["status"]
                    spec = job_result["spec"]
                    job_name = spec["job"]
                    
                    job_url = status.get("url")
                    job_state = status.get("state")
                    job_start_time = status.get("startTime")
                    
                    if not silent:
                        print(job_name, payload, job_id, job_start_time, job_url, job_state)
                    
                    job_dict = {
                        "jobName": job_name,
                        "payload": payload,
                        "upgrade_from": upgrade_from,
                        "upgrade_to": upgrade_to,
                        "jobStartTime": job_start_time,
                        "jobID": job_id,
                        "jobURL": job_url,
                        "jobState": job_state,
                    }
                    
                    if "completionTime" in status:
                        job_dict["jobCompletionTime"] = status["completionTime"]
                    
                    self.save_job_data(job_dict)
                    
                    if not silent:
                        print("Done.\n")
                    
                    return job_dict
                else:
                    if not silent:
                        print(f"Attempt {attempt}: Job info incomplete")
                    return None
            else:
                if not silent:
                    print(f"Attempt {attempt}: HTTP {resp.status_code}")
                return None
                
        except Exception as e:
            if not silent:
                print(f"Attempt {attempt}: Error: {e}")
            return None

Then update both methods to use polling:

def run_image_consistency_check(self, payload_url: str, mr_id: int) -> dict:
    # ... validation and job trigger code ...
    
    job_id = json.loads(job_run_res.text)["id"]
    print(f"Image consistency check job triggered with ID: {job_id}")

    # Use get_job_results with polling enabled!
    job_info = self.get_job_results(job_id, poll=True)
    
    print(f"Image consistency check job URL: {job_info['jobURL']}")
    return job_info

def run_job(self, job_name, payload, upgrade_from, upgrade_to):
    # ... existing code ...
    
    if res.status_code == 200:
        job_id = json.loads(res.text)["id"]
        print(f"Returned job id: {job_id}")
        
        # Replace time.sleep(5) with polling
        try:
            self.get_job_results(
                job_id, job_name, payload, upgrade_from, upgrade_to,
                poll=True  # Enable polling!
            )
        except TimeoutError as e:
            print(f"Warning: {e}")
        except Exception as e:
            print(f"get job result error: {e}")

Benefits:

  • ✅ Handles variable initialization times (1s to 60s)
  • ✅ Exponential backoff reduces API load (1s → 2s → 4s → 8s → 10s...)
  • ✅ Reusable across multiple methods
  • ✅ Backward compatible (poll=False by default)
  • ✅ Configurable timeout via parameters

2. Add Payload URL Validation 🔒

Security/Reliability Concern: The payload_url parameter is not validated, which could lead to:

  • Invalid URLs causing job failures
  • Typos wasting CI resources
  • Potential security issues with unexpected URL formats

Proposed Validation:

def _validate_payload_url(self, payload_url: str) -> bool:
    """
    Validate OpenShift release payload URL format.
    
    Expected format:
        quay.io/openshift-release-dev/ocp-release:<version>-<arch>
    
    Examples:
        Valid:
            - quay.io/openshift-release-dev/ocp-release:4.19.1-x86_64
            - quay.io/openshift-release-dev/ocp-release:4.18.0-0.nightly-2024-01-15-x86_64
        Invalid:
            - https://quay.io/... (has protocol)
            - random-registry.io/... (wrong registry)
    """
    pattern = r'^quay\.io/openshift-release-dev/ocp-release:\d+\.\d+\.\d+.*-(x86_64|aarch64|ppc64le|s390x|multi)$'
    return re.match(pattern, payload_url) is not None

def run_image_consistency_check(self, payload_url: str, mr_id: int) -> dict:
    # Validate payload URL first
    if not self._validate_payload_url(payload_url):
        raise ValueError(
            f"Invalid payload URL format: {payload_url}\n"
            f"Expected: quay.io/openshift-release-dev/ocp-release:<version>-<arch>"
        )
    
    # ... rest of implementation ...

Benefits:

  • ✅ Fail fast with clear error messages
  • ✅ Prevent accidental typos
  • ✅ Ensure URLs match expected format
  • ✅ Better debugging experience

Additional Suggestions

3. Extract Job Name Constant

# At class level
IMAGE_CONSISTENCY_CHECK_JOB = "periodic-ci-openshift-release-tests-master-image-consistency-check"

4. Return Job Metadata

Consider returning structured data for programmatic access:

return {
    "job_id": job_id,
    "job_url": job_info["jobURL"],
    "payload_url": payload_url,
    "mr_id": mr_id
}

Summary

Critical (Please Address):

  1. ✅ Implement exponential backoff polling by enhancing get_job_results
  2. ✅ Add payload URL validation

Nice to Have:
3. Extract magic strings to constants
4. Return structured job metadata
5. Consider custom exception classes (project convention)

The core implementation is solid! With these improvements, it will be more robust, reusable, and production-ready.

Let me know if you'd like me to provide a working patch for the polling logic.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

jira/valid-reference Indicates that this PR references a valid Jira ticket of any type.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants