Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 6 additions & 0 deletions build/ci-Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -21,6 +21,12 @@ RUN curl -fsSL https://rpm.nodesource.com/setup_20.x | bash - && \
npm install -g @anthropic-ai/claude-code && \
dnf clean all

# Clone openshift/velero source code for failure analysis
# Uses oadp-dev branch to match OADP operator development
RUN git clone --depth 1 --branch oadp-dev \
Comment thread
weshayutin marked this conversation as resolved.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the Velero clone fails (network issue, branch rename), the Docker build will continue but Claude's analysis will reference non-existent files. Consider adding error handling:

  RUN git clone --depth 1 --branch oadp-dev \
      https://github.com/openshift/velero.git \
      /go/src/github.com/openshift/velero || \
      echo "Warning: Velero source clone failed, source investigation will be unavailable"

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The build log for this container is not seen by Claude so I'm not sure echo here does anything.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The hardcoded oadp-dev branch works for current development, but may need updating if Velero's branch naming changes or if release branches need different Velero versions. A future enhancement could make this configurable via ARG VELERO_BRANCH=oadp-dev.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make it so you can specify a branch from Makefile right? So when this is cherry picked only change the Makefile right?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure works

https://github.com/openshift/velero.git \
/go/src/github.com/openshift/velero

RUN go mod download && \
mkdir -p $(go env GOCACHE) && \
chmod -R 777 ./ $(go env GOCACHE) $(go env GOPATH)
4 changes: 3 additions & 1 deletion tests/e2e/lib/flakes.go
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,9 @@ var errorIgnorePatterns = []string{
"level=error msg=\"error patch for managed fields ",
"VolumeSnapshot has a temporary error Failed to create snapshot: error updating status for volume snapshot content snapcontent-",
"Skipping hypershift plugin execution - not a hypershift backup: error checking for HostedControlPlane CRD",
"claim Selector is not supported",

// Data mover volume restore limitation per https://github.com/vmware-tanzu/velero/issues/7946#issuecomment-2196590014
"failed to restore volume with StorageClass, claim Selector is not supported",
}

type FlakePattern struct {
Expand Down
72 changes: 71 additions & 1 deletion tests/e2e/scripts/analyze_failures.sh
Original file line number Diff line number Diff line change
Expand Up @@ -97,6 +97,8 @@ Read the log file and output a summary containing:

5. **Correlation**: Group related errors together - if multiple errors reference the same resource (backup name, PVC, pod), keep them together with their context.

6. **Source references**: When you find errors from Velero packages (pkg/backup/, pkg/restore/, pkg/controller/, pkg/nodeagent/), note the file:line references for later source code investigation.
Copy link

Copilot AI Dec 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing space after 'pkg/nodeagent/' before the closing parenthesis for consistency with other package references.

Suggested change
6. **Source references**: When you find errors from Velero packages (pkg/backup/, pkg/restore/, pkg/controller/, pkg/nodeagent/), note the file:line references for later source code investigation.
6. **Source references**: When you find errors from Velero packages (pkg/backup/, pkg/restore/, pkg/controller/, pkg/nodeagent/ ), note the file:line references for later source code investigation.

Copilot uses AI. Check for mistakes.

Format each error group as:
--- [package/component name] ---
[context lines from same package]
Expand Down Expand Up @@ -215,6 +217,14 @@ You are analyzing a failed OADP (OpenShift API for Data Protection) E2E test run
4. **preprocessed-logs.txt**: Pre-extracted errors from large log files (>1MB)
- Contains error summaries from large logs that were too big to analyze directly
- Use this for quick access to relevant errors without reading full logs
5. **Velero Source Code**: `/go/src/github.com/openshift/velero/`
- OpenShift's fork of Velero with OADP-specific patches
- Use to investigate error messages originating from Velero packages
- Key directories: `pkg/backup/`, `pkg/restore/`, `pkg/controller/`, `pkg/nodeagent/`
6. **OADP Operator Source Code**: `/go/src/github.com/openshift/oadp-operator/`
- The OADP operator codebase being tested
- Key directories: `internal/controller/`, `pkg/`, `api/v1alpha1/`
- Use to investigate OADP-specific errors and reconciliation logic

**Note**: Prow's build-log.txt is written by CI infrastructure after tests complete and is NOT available during this analysis. Use the artifacts listed above.

Expand All @@ -229,6 +239,35 @@ This file contains:

Cross-reference failures against these patterns before diagnosing as real failures.

## Source Code Investigation

When analyzing failures, use the source code to understand error origins:

1. Locate the error message in the source code
2. Trace the code path that led to the error
3. Identify what conditions trigger the error
4. Check if the error is recoverable, transient, or indicates a real bug
5. Look for related error handling or retry logic

### Velero Source (`/go/src/github.com/openshift/velero/`)

Key Velero packages:
- `pkg/backup/` - Backup workflow and item processing
- `pkg/restore/` - Restore workflow and item processing
- `pkg/controller/` - Kubernetes controllers for backup/restore CRs
- `pkg/nodeagent/` - Node agent (restic/kopia) operations
- `pkg/persistence/` - Object storage operations
- `pkg/plugin/` - Plugin framework and built-in plugins

### OADP Operator Source (`/go/src/github.com/openshift/oadp-operator/`)

Key OADP packages:
- `internal/controller/` - DPA reconciler and other controllers
- `pkg/velero/` - Velero deployment and configuration
- `pkg/credentials/` - Cloud credential management
- `api/v1alpha1/` - CRD type definitions
- `tests/e2e/lib/` - E2E test utilities and flake patterns

## Analysis Tasks

1. Parse junit_report.xml to identify all failed tests and extract failure messages
Expand Down Expand Up @@ -337,6 +376,27 @@ From must-gather analysis:
2. Check if failures match existing GitHub issues
3. Re-run flakes to confirm transient nature
4. Investigate environmental issues in cluster/cloud provider

## Must-Gather Improvement Suggestions

If information was missing or incomplete during analysis, list what additional data would have helped:

### Missing Data That Would Have Helped
- <What was needed and why it would have helped diagnosis>
- <Specific resource/log/metric that was missing>

### Recommended Must-Gather Enhancements
1. **<Category>**: <Specific improvement suggestion>
- Current gap: <What's missing>
- Suggested addition: <What to collect>
- Example: <Concrete example of the data needed>

Examples of potential improvements:
- Additional pod logs (e.g., init containers, sidecar containers)
- Specific CRD status fields not currently captured
- Cluster-level resources affecting OADP (NetworkPolicies, ResourceQuotas)
- Timing/metrics data (pod startup times, API latencies)
- Cloud provider specific diagnostics (S3 bucket policies, IAM roles)
```

## Important Guidelines
Expand All @@ -349,6 +409,9 @@ From must-gather analysis:
- Cross-reference: Link similar failures across multiple tests
- Prioritize: Put critical issues before warnings before flakes
- Use preprocessed-logs.txt: Check this file first for errors from large log files
- Must-gather feedback: When you cannot determine root cause due to missing information,
explicitly note what additional must-gather data would have helped. This feedback loop
improves future debugging capabilities.
PROMPT_EOF

# Count failed tests from JUnit (count individual test failures, not just suites)
Expand Down Expand Up @@ -389,9 +452,16 @@ Analyze these artifacts:
2. Preprocessed log errors: ${ARTIFACT_DIR}/preprocessed-logs.txt (check this FIRST for large log summaries)
3. Must-gather: ${ARTIFACT_DIR}/must-gather/
4. Per-test failure directories: ${ARTIFACT_DIR}/*/
5. Velero source code: /go/src/github.com/openshift/velero/
6. OADP operator source code: /go/src/github.com/openshift/oadp-operator/

When errors reference Velero or OADP packages, read the relevant source code to understand:
- What conditions trigger the error
- If there's retry logic that should have handled it
- If this is a known limitation or edge case

Note: Prow's build-log.txt is NOT available during this analysis (it's written after tests complete).
Focus on JUnit results, preprocessed log summaries, must-gather diagnostics, and per-test pod logs.
Focus on JUnit results, preprocessed log summaries, must-gather diagnostics, per-test pod logs, and source code investigation.

Generate comprehensive failure analysis following the output format specified in the prompt.
Focus on actionable insights and clear root cause identification.
Expand Down