Skip to content

Conversation

@skurfuerst
Copy link
Contributor

this on_error handler is executed if any other step in
the pipeline failed; and can be e.g. used to trigger a
notification on error.

The code can already be reviewed; I'll only add a README example.

this on_error handler is executed if any other step in
the pipeline failed; and can be e.g. used to trigger a
notification on error.
@codecov
Copy link

codecov bot commented Dec 22, 2023

Codecov Report

Attention: 48 lines in your changes are missing coverage. Please review.

Comparison is base (d7fdf27) 71.07% compared to head (f64ad94) 70.42%.

Files Patch % Lines
prunner.go 65.46% 40 Missing and 8 partials ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main      #47      +/-   ##
==========================================
- Coverage   71.07%   70.42%   -0.65%     
==========================================
  Files          20       20              
  Lines        2199     2330     +131     
==========================================
+ Hits         1563     1641      +78     
- Misses        517      562      +45     
- Partials      119      127       +8     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@Sebobo
Copy link
Member

Sebobo commented Jan 16, 2024

We need exactly this feature too, how can we help finishing this up and having a release?

@hlubek
Copy link
Member

hlubek commented Jan 16, 2024

I'm having a look at it right now 👍

Copy link
Member

@hlubek hlubek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, I think we really need this feature. So thanks for taking the initiative here.

I think we need to change the implementation a bit though to fit better into the async handling of prunner so we do not cause issues by the on error task execution (see my comment with the locked mutex).

prunner.go Outdated
// we use a detached taskRunner and scheduler to run the onError task, to
// run synchronously (as we are already in an async goroutine here), won't have any cycles,
// and to simplify the code.
taskRunner := r.createTaskRunner(j)
Copy link
Member

@hlubek hlubek Jan 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we might have an issue cancelation here, since the scheduler / task runner is detached. So this could block the whole thing if the on error task is not finishing.

Locking the pipeline runner mutex (which is for all of prunner) for the whole on error task execution is not good, because we basically block the whole prunner process. It is only okay to write lock the mutex for data structure updates.

I think we need to base this on the *PipelineJob, where we capture the state/context for each running pipeline job. Maybe it's also enough to run the on error scheduler in a go routine and use the WaitGroup of *PipelineRunner to put it into the "normal" waiting behavior.

What about putting the on error task execution in startJob:

	// Run graph asynchronously
	r.wg.Add(1)
	go func() {
		defer r.wg.Done()
		lastErr := job.sched.Schedule(graph)
                if lastErr != nil {
                  // TODO Schedule the on error task (sync)
                }
		r.JobCompleted(job.ID, lastErr)
	}()

(we need to implement some kind of first task error state here though, since the last error is not really helpful)

Ideally we would put this behaviour in the taskctl.Scheduler itself, but it's more generic without real knowledge of output store etc. . I'm thinking of some kind of ghost task that only appears and runs on first error in the pipeline but is already defined. The issue here would be the variables for the failed task stdout/stderr that needs to be put into the variables 🤔.

@Sebobo
Copy link
Member

Sebobo commented Jan 17, 2024

Thx @skurfuerst and @hlubek for your work on this.

To fully explain our use case: We have a long list of exit codes for various failures in the pipeline including a map to translate the error codes for devs and editors in the UI. So the onError task should be able to catch that code and in our case write it to the database for persistence, so we can collect the error reasons for the whole pipeline history and it should also do some cleanup task.
Right now I can only interpret the job logs, which disappear after a while.

@Sebobo
Copy link
Member

Sebobo commented Jan 31, 2024

Any update on this? (Sorry for nagging)

@hlubek
Copy link
Member

hlubek commented Jan 31, 2024

@skurfuerst Did you find some time to look at my notes about the change? I think we could basically go with the implementation and improve it later, but I fear that we pull in a more complexity and concepts that might be hard to change later.

@skurfuerst
Copy link
Contributor Author

@hlubek thanks for your comments :) I sadly did not find time to fix your comments yet; but I can try ro work on this in about two weeks :) In case you can prototype your thoughts, feel free to do so :)

All the best, Sebastian

@Sebobo
Copy link
Member

Sebobo commented Feb 23, 2024

Ping :D

@Sebobo
Copy link
Member

Sebobo commented Apr 10, 2024

Pong ;)

@skurfuerst skurfuerst requested a review from hlubek May 20, 2025 10:40
@skurfuerst
Copy link
Contributor Author

@Sebobo sorry for it taking so long!

@hlubek feel free to have another look, I implemented your suggestion.

@skurfuerst skurfuerst requested a review from Sebobo May 20, 2025 11:46
@skurfuerst
Copy link
Contributor Author

TODO:

  • rename onError to on_error in YAML to be consistent with other naming

@hlubek
Copy link
Member

hlubek commented May 23, 2025

@skurfuerst Awesome! I'll go through the change - maybe I need another week due to vacation.

Copy link
Member

@Sebobo Sebobo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't say much about the code change but the 1.3.0-beta1 release works as I hoped for ❤️

@skurfuerst
Copy link
Contributor Author

@Sebobo make sure to use the beta2 release - it has an important fix for a crash...

@Sebobo
Copy link
Member

Sebobo commented May 27, 2025

@skurfuerst yes already upgraded because I ran into it 😉

@Sebobo
Copy link
Member

Sebobo commented Jun 10, 2025

Is this change now mergeable or are there open changes?

Copy link
Member

@hlubek hlubek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll add some code for the shutdown/save handling and marking the first failed task explicitly as a follow-up.

prunner.go Outdated
onErrorVariables["failedTaskStdout"] = string(failedTaskStdout)
onErrorVariables["failedTaskStderr"] = string(failedTaskStderr)
} else {
onErrorVariables["failedTaskName"] = "task_not_identified_should_not_happen"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if this can really happen and we should just not set these variables.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok fine with me :)

@hlubek
Copy link
Member

hlubek commented Jun 10, 2025

Looks good to me now, just the thing with "this should never happen"™️ open for discussion.

// - failedTaskError: Error message of the failed task
// - failedTaskStdout: Stdout of the failed task
// - failedTaskStderr: Stderr of the failed task
OnError *OnErrorTaskDef `yaml:"onError"`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you agree to rename this to on_error to be consistent with the other fields?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes definitely!!

@hlubek
Copy link
Member

hlubek commented Jun 10, 2025

Open: We should add a section to the readme.

@hlubek hlubek merged commit ca3421d into main Jun 10, 2025
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants