-
Notifications
You must be signed in to change notification settings - Fork 1.3k
Avoid control threads race condition #20002
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Avoid control threads race condition #20002
Conversation
When starting the dt gui and calling `gtk_main()` we must be sure we can safely dispatch a control job. We can do so only if all control threads are running. We achieve this by adding a control->running atomic counter which is increased whenever any of the threads code has been started. As we know the number of started threads we can check & wait (for up to 1sec) until all threads are up&running before leaving `dt_control_jobs_init()`. The new function `gboolean dt_control_all_running()` tests for running vs requested threads and is used to test if we can finally use gtk_main().
|
Can't produce a hang right now, maybe the logo change? |
Sorry - i don't understand. |
|
The darktable icon changed to the santa hat |
Could a little explanation be added here for why this is so? This might help someone with more knowledge of concurrency (or in the absence of that, maybe even me) in future to clean up some of this code. There are several places that seem to check things that shouldn't have to be checked because, well, it can't hurt. But too many different mechanism trying to regulate concurrency tend to get in each others way and if there's no clear understanding how they all work and why they are needed, it well never be possible to clean this up and test. Naively, I would think that if we dispatch a job, it should be good enough if it eventually gets picked up. So as long as we know that a thread will be created ("we are in gui mode"), we should be good. Why do we need to know the receiver is running? How would this help? Even if it is running, it might be doing something else at the moment; how would that be different from "still starting up". Asking because reading #19992 I get the impression this didn't solve anything. So will it be reverted? Or will we just keep it around "just in case"? Until it causes problems and noone can figure out why it was there anyway and now we'll implement some even more complicated workaround to keep everything "working" even though we don't know what that means and therefore can't test if it actually still does what it's supposed to do? @TurboGit @jenshannoschwalm @ralfbrown in case you don't get notifications for merged PRs anymore. |
| #endif | ||
| /* start the event loop */ | ||
| if(dt_control_running()) | ||
| if(dt_control_all_running()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If not all jobs have really started yet by the time we get here, we just skip gtk_main and immediately quit? In what situation does this actually solve anything?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We might reach this point without all threads running safely.
|
I would keep it and even add #20025 :-) It's not difficult at all, together both PR's: ensure that the control job dispatcher is fully functional when If we dispatch a job before that, we might not be sure what |
I'm just really not sure what you mean by this. The jobs have been created with The only thing that I see this PR doing is a more complicated version of checking that 1+1 = 2. It has to be, otherwise you might as well throw the whole thing out the window. The other thing it obviously does is add more delays. So yes, that might "help" with race conditions, in the sense that it may make them less frequent. But that is worse, not better because less frequent bugs are harder to reproduce and therefore harder to fix. So I really don't see the benefit of adding stuff like this that doesn't obviously solve anything. I know that I won't get anything out of arguing this (least of all a clear answer) but at least I wanted to document here that there's no problem that is being solved by this so that whoever comes across this PR in future doesn't have to deal with the argument that "it must have been good for something, otherwise it would not have been added, so now you can never ever remove it. Or touch it because you can't test that you didn't break what it was supposed to be doing." Maintainability suffers. On a separate note, a while back @ralfbrown and I were wondering why in which seem to leave the last thread hanging. Is this intentional? If so, could the reason be lightly documented there. |
When starting the dt gui and calling
gtk_main()we must be sure we can safely dispatch a control job. We can do so only if all control threads are running.We achieve this by adding a control->running atomic counter which is increased whenever any of the threads code has been started. As we know the number of started threads we can check & wait (for up to 1sec) until all threads are up&running before leaving
dt_control_jobs_init().The new function
gboolean dt_control_all_running()tests for running vs requested threads and is used to test if we can finally usegtk_main().@wpferguson this would be my "proper" solution for the hypothesis, that we have a race condition as the root cause for #19992. At least: if we dispatch a control job either in lighttable or darkroom it could be the case that the worker for that job is not running yet and thus ends in nirvana. I have to confess i couldn't trigger it yet a single time. (but i am mostly on my slow notebook still).
Would you and others discussing in #19992 be able to compile and test (@victoryforce @gi-man)
@TurboGit code is absolutely safe to me.