OCPERT-270: Add job command for running image consistency check prow job #845

tomasdavidorg · 2025-12-18T09:31:51Z

https://issues.redhat.com/browse/OCPERT-270

openshift-ci-robot · 2025-12-18T09:31:56Z

@tomasdavidorg: This pull request references OCPERT-270 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.22.0" version, but no target version was set.

Details

In response to this:

https://issues.redhat.com/browse/OCPERT-270

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

openshift-ci · 2025-12-18T09:33:45Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign rioliu-rh for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

openshift-ci · 2025-12-18T09:42:18Z

@tomasdavidorg: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

rioliu-rh · 2025-12-18T11:38:09Z

Code Review: Image Consistency Check Prow Job

Overview

This PR adds functionality to trigger image consistency check Prow jobs with custom payload URLs and merge request IDs. The implementation follows existing patterns in the codebase.

Key Improvements Recommended

1. Replace Fixed Sleep with Polling Logic ⭐

Current Issue: Using time.sleep(5) assumes the job initializes within 5 seconds, which may not always be true and could lead to failures or unnecessary waiting.

Better Approach: Enhance the existing get_job_results method to support polling with exponential backoff, making it reusable across both run_job and run_image_consistency_check.

Why reuse get_job_results?

Both methods do the same thing: fetch job info from Prow API
Avoids code duplication
get_job_results already has CSV logging, error handling, and proper field extraction
Just needs retry/polling capability added

Proposed Design:

class Jobs:
    # Add class constants for polling configuration
    JOB_POLL_INITIAL_DELAY = 1      # Start with 1 second
    JOB_POLL_MAX_DELAY = 10         # Cap at 10 seconds  
    JOB_POLL_BACKOFF_FACTOR = 2     # Double each retry
    JOB_POLL_TIMEOUT = 60           # Max 60 seconds total
    
    def get_job_results(self, job_id, job_name=None, payload=None, 
                       upgrade_from=None, upgrade_to=None, 
                       poll=False, max_wait_seconds=None):
        """
        Get job results with optional polling.
        
        Args:
            poll: If True, retry with exponential backoff until job is ready
            max_wait_seconds: Maximum time to wait when polling (default: 60s)
        """
        if not job_id:
            print("No job ID input, exit...")
            sys.exit(0)
        
        if max_wait_seconds is None:
            max_wait_seconds = self.JOB_POLL_TIMEOUT
        
        # If polling disabled, make single attempt (original behavior)
        if not poll:
            return self._fetch_job_info_once(job_id, job_name, payload, 
                                            upgrade_from, upgrade_to)
        
        # Polling enabled: retry with exponential backoff
        delay = self.JOB_POLL_INITIAL_DELAY
        elapsed = 0
        attempt = 0
        
        while elapsed < max_wait_seconds:
            attempt += 1
            
            result = self._fetch_job_info_once(
                job_id, job_name, payload, upgrade_from, upgrade_to,
                attempt=attempt, silent=True
            )
            
            if result is not None:
                print(f"Job {job_id} ready after {elapsed}s (attempt {attempt})")
                return result
            
            if attempt == 1:
                print(f"Job {job_id} not ready yet, polling...")
            
            time.sleep(delay)
            elapsed += delay
            delay = min(delay * self.JOB_POLL_BACKOFF_FACTOR, self.JOB_POLL_MAX_DELAY)
        
        raise TimeoutError(
            f"Job {job_id} not ready after {max_wait_seconds}s ({attempt} attempts)."
        )
    
    def _fetch_job_info_once(self, job_id, job_name=None, payload=None,
                            upgrade_from=None, upgrade_to=None, 
                            attempt=1, silent=False):
        """
        Fetch job info once (single HTTP request).
        Extracted from get_job_results for reusability.
        
        Args:
            silent: If True, suppress error messages (used during polling)
        """
        try:
            resp = self._get_session().get(url=self.prow_job_url.format(job_id.strip()))
            
            if resp.status_code == 200 and resp.text:
                job_result = yaml.safe_load(resp.text)
                
                # Check if job has required fields
                if job_result and "status" in job_result and "url" in job_result["status"]:
                    status = job_result["status"]
                    spec = job_result["spec"]
                    job_name = spec["job"]
                    
                    job_url = status.get("url")
                    job_state = status.get("state")
                    job_start_time = status.get("startTime")
                    
                    if not silent:
                        print(job_name, payload, job_id, job_start_time, job_url, job_state)
                    
                    job_dict = {
                        "jobName": job_name,
                        "payload": payload,
                        "upgrade_from": upgrade_from,
                        "upgrade_to": upgrade_to,
                        "jobStartTime": job_start_time,
                        "jobID": job_id,
                        "jobURL": job_url,
                        "jobState": job_state,
                    }
                    
                    if "completionTime" in status:
                        job_dict["jobCompletionTime"] = status["completionTime"]
                    
                    self.save_job_data(job_dict)
                    
                    if not silent:
                        print("Done.\n")
                    
                    return job_dict
                else:
                    if not silent:
                        print(f"Attempt {attempt}: Job info incomplete")
                    return None
            else:
                if not silent:
                    print(f"Attempt {attempt}: HTTP {resp.status_code}")
                return None
                
        except Exception as e:
            if not silent:
                print(f"Attempt {attempt}: Error: {e}")
            return None

Then update both methods to use polling:

def run_image_consistency_check(self, payload_url: str, mr_id: int) -> dict:
    # ... validation and job trigger code ...
    
    job_id = json.loads(job_run_res.text)["id"]
    print(f"Image consistency check job triggered with ID: {job_id}")

    # Use get_job_results with polling enabled!
    job_info = self.get_job_results(job_id, poll=True)
    
    print(f"Image consistency check job URL: {job_info['jobURL']}")
    return job_info

def run_job(self, job_name, payload, upgrade_from, upgrade_to):
    # ... existing code ...
    
    if res.status_code == 200:
        job_id = json.loads(res.text)["id"]
        print(f"Returned job id: {job_id}")
        
        # Replace time.sleep(5) with polling
        try:
            self.get_job_results(
                job_id, job_name, payload, upgrade_from, upgrade_to,
                poll=True  # Enable polling!
            )
        except TimeoutError as e:
            print(f"Warning: {e}")
        except Exception as e:
            print(f"get job result error: {e}")

Benefits:

✅ Handles variable initialization times (1s to 60s)
✅ Exponential backoff reduces API load (1s → 2s → 4s → 8s → 10s...)
✅ Reusable across multiple methods
✅ Backward compatible (poll=False by default)
✅ Configurable timeout via parameters

2. Add Payload URL Validation 🔒

Security/Reliability Concern: The payload_url parameter is not validated, which could lead to:

Invalid URLs causing job failures
Typos wasting CI resources
Potential security issues with unexpected URL formats

Proposed Validation:

def _validate_payload_url(self, payload_url: str) -> bool:
    """
    Validate OpenShift release payload URL format.
    
    Expected format:
        quay.io/openshift-release-dev/ocp-release:<version>-<arch>
    
    Examples:
        Valid:
            - quay.io/openshift-release-dev/ocp-release:4.19.1-x86_64
            - quay.io/openshift-release-dev/ocp-release:4.18.0-0.nightly-2024-01-15-x86_64
        Invalid:
            - https://quay.io/... (has protocol)
            - random-registry.io/... (wrong registry)
    """
    pattern = r'^quay\.io/openshift-release-dev/ocp-release:\d+\.\d+\.\d+.*-(x86_64|aarch64|ppc64le|s390x|multi)$'
    return re.match(pattern, payload_url) is not None

def run_image_consistency_check(self, payload_url: str, mr_id: int) -> dict:
    # Validate payload URL first
    if not self._validate_payload_url(payload_url):
        raise ValueError(
            f"Invalid payload URL format: {payload_url}\n"
            f"Expected: quay.io/openshift-release-dev/ocp-release:<version>-<arch>"
        )
    
    # ... rest of implementation ...

Benefits:

✅ Fail fast with clear error messages
✅ Prevent accidental typos
✅ Ensure URLs match expected format
✅ Better debugging experience

Additional Suggestions

3. Extract Job Name Constant

# At class level
IMAGE_CONSISTENCY_CHECK_JOB = "periodic-ci-openshift-release-tests-master-image-consistency-check"

4. Return Job Metadata

Consider returning structured data for programmatic access:

return {
    "job_id": job_id,
    "job_url": job_info["jobURL"],
    "payload_url": payload_url,
    "mr_id": mr_id
}

Summary

Critical (Please Address):

✅ Implement exponential backoff polling by enhancing get_job_results
✅ Add payload URL validation

Nice to Have:
3. Extract magic strings to constants
4. Return structured job metadata
5. Consider custom exception classes (project convention)

The core implementation is solid! With these improvements, it will be more robust, reusable, and production-ready.

Let me know if you'd like me to provide a working patch for the polling logic.

OCPERT-270: Add job command for running image consistency check prow job

05e5db3

tomasdavidorg requested a review from rioliu-rh December 18, 2025 09:31

openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Dec 18, 2025

openshift-ci bot requested review from LuboTerifaj and barboras7 December 18, 2025 09:33

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

OCPERT-270: Add job command for running image consistency check prow job #845

OCPERT-270: Add job command for running image consistency check prow job #845

Uh oh!

tomasdavidorg commented Dec 18, 2025

Uh oh!

openshift-ci-robot commented Dec 18, 2025 •

edited by openshift-ci bot

Loading

Uh oh!

openshift-ci bot commented Dec 18, 2025

Uh oh!

openshift-ci bot commented Dec 18, 2025

Uh oh!

rioliu-rh commented Dec 18, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

OCPERT-270: Add job command for running image consistency check prow job #845

Are you sure you want to change the base?

OCPERT-270: Add job command for running image consistency check prow job #845

Uh oh!

Conversation

tomasdavidorg commented Dec 18, 2025

Uh oh!

openshift-ci-robot commented Dec 18, 2025 • edited by openshift-ci bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

openshift-ci bot commented Dec 18, 2025

Uh oh!

openshift-ci bot commented Dec 18, 2025

Uh oh!

rioliu-rh commented Dec 18, 2025

Code Review: Image Consistency Check Prow Job

Overview

Key Improvements Recommended

1. Replace Fixed Sleep with Polling Logic ⭐

2. Add Payload URL Validation 🔒

Additional Suggestions

3. Extract Job Name Constant

4. Return Job Metadata

Summary

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

openshift-ci-robot commented Dec 18, 2025 •

edited by openshift-ci bot

Loading