Skip to content

Conversation

@zzby0
Copy link
Contributor

@zzby0 zzby0 commented Jul 18, 2025

Summary

NuttX sim use coroutine base on one thread in host to do switch context. but if we allow switch context with in one API (host-API and errno get), maybe the switch context from coroutine cause re-enter host-API call. Make the errno get behavior not work as expected.

This commit disable interrupts before and after host-APIs, make sure that errno is correct.

Impact

  • Resolves previously observed cases where errno returned inconsistent values after context switch.

  • Some host-APIs are now uninterruptible during execution, which may slightly increase latency in scheduling.

Testing

CI-test, sim/nsh ostest

Sim use coroutine base on one thread in host to do switch context. but if we allow switch context with in one API (host-API and errno get), maybe the switch context from coroutine cause re-enter host-API call. Make the errno get behavior not work as expected.

Signed-off-by: guanyi3 <guanyi3@xiaomi.com>
@github-actions github-actions bot added Arch: simulator Issues related to the SIMulator Size: M The size of the change in this PR is medium labels Jul 18, 2025
@acassis acassis merged commit ac5b38c into apache:master Jul 21, 2025
14 checks passed
@yamt
Copy link
Contributor

yamt commented Jul 24, 2025

This commit disable interrupts before and after host-APIs, make sure that errno is correct.

i feel this approach is very hard to maintain.

it's simpler (and probably more efficient) to save/restore errno in the "interrupts" instead. what do you think?

@xiaoxiang781216
Copy link
Contributor

This commit disable interrupts before and after host-APIs, make sure that errno is correct.

the interrupt must be disabled, since many host API can't work well if the interrupt happen in side their function.

i feel this approach is very hard to maintain.

it's simpler (and probably more efficient) to save/restore errno in the "interrupts" instead. what do you think?

your mean restore errno before enabling interrupt? It doesn't work since host main thread manage all nuttx thread and errno will lose if another nuttx thread switch in and call some host api which may overwrite the errno value before old thread has the chance to retrieve it.

@yamt
Copy link
Contributor

yamt commented Jul 25, 2025

This commit disable interrupts before and after host-APIs, make sure that errno is correct.

the interrupt must be disabled, since many host API can't work well if the interrupt happen in side their function.

i feel this approach is very hard to maintain.
it's simpler (and probably more efficient) to save/restore errno in the "interrupts" instead. what do you think?

your mean restore errno before enabling interrupt? It doesn't work since host main thread manage all nuttx thread and errno will lose if another nuttx thread switch in and call some host api which may overwrite the errno value before old thread has the chance to retrieve it.

i meant to save/restore in interrupt handlers.
as you know, it's a common practice for unix to save/restore errno in signal handlers.
i don't see any reason it doesn't work for sim.

@xiaoxiang781216
Copy link
Contributor

This commit disable interrupts before and after host-APIs, make sure that errno is correct.

the interrupt must be disabled, since many host API can't work well if the interrupt happen in side their function.

i feel this approach is very hard to maintain.
it's simpler (and probably more efficient) to save/restore errno in the "interrupts" instead. what do you think?

your mean restore errno before enabling interrupt? It doesn't work since host main thread manage all nuttx thread and errno will lose if another nuttx thread switch in and call some host api which may overwrite the errno value before old thread has the chance to retrieve it.

i meant to save/restore in interrupt handlers. as you know, it's a common practice for unix to save/restore errno in signal handlers. i don't see any reason it doesn't work for sim.

Do you mean save/restore host errno in the interrupt handler? it can't fix the problem I describe here, since the problem isn't the signal handler change errno. The real problem is that sim use one host thread with setjmp/longjmp to simulate multiple nuttx thread and the race may happen like this:

  1. nuttx thread 1 call host function which fail and set errno, but before thread 1 read errno
  2. timer signal happen and scheduler switch to nuttx thread 2
  3. nuttx thread 2 may call host function which may fail too and overwrite errno
  4. when nuttx thread 1 run again, it read back the wrong errno

@zzby0
Copy link
Contributor Author

zzby0 commented Jul 25, 2025

This commit disable interrupts before and after host-APIs, make sure that errno is correct.

the interrupt must be disabled, since many host API can't work well if the interrupt happen in side their function.

i feel this approach is very hard to maintain.
it's simpler (and probably more efficient) to save/restore errno in the "interrupts" instead. what do you think?

your mean restore errno before enabling interrupt? It doesn't work since host main thread manage all nuttx thread and errno will lose if another nuttx thread switch in and call some host api which may overwrite the errno value before old thread has the chance to retrieve it.

i meant to save/restore in interrupt handlers. as you know, it's a common practice for unix to save/restore errno in signal handlers. i don't see any reason it doesn't work for sim.

Thanks for the feedback. You're right that saving/restoring errno in interrupt handlers is a common Unix practice
However, here we're dealing with two issues:

  1. Errno consistency (which your solution may address well)
  2. Many host APIs lack robust error recovery when interrupted by signals. For example, when read() a file is interruped by a signal, it will return error directly. (here disable interrupt maybe better)

@yamt
Copy link
Contributor

yamt commented Jul 25, 2025

This commit disable interrupts before and after host-APIs, make sure that errno is correct.

the interrupt must be disabled, since many host API can't work well if the interrupt happen in side their function.

i feel this approach is very hard to maintain.
it's simpler (and probably more efficient) to save/restore errno in the "interrupts" instead. what do you think?

your mean restore errno before enabling interrupt? It doesn't work since host main thread manage all nuttx thread and errno will lose if another nuttx thread switch in and call some host api which may overwrite the errno value before old thread has the chance to retrieve it.

i meant to save/restore in interrupt handlers. as you know, it's a common practice for unix to save/restore errno in signal handlers. i don't see any reason it doesn't work for sim.

Do you mean save/restore host errno in the interrupt handler?

yes.

it can't fix the problem I describe here, since the problem isn't the signal handler change errno. The real problem is that sim use one host thread with setjmp/longjmp to simulate multiple nuttx thread and the race may happen like this:

i don't understand why it can't fix the problem.

do you mean we can't save/restore host errno while we can save/restore the rest of context including hw registers?
why not?

1. nuttx thread 1 call host function which fail and set errno, but before thread 1 read errno

2. timer signal happen and scheduler switch to nuttx thread 2

save errno here.

3. nuttx thread 2 may call host function which may fail too and overwrite errno

4. when nuttx thread 1 run again,

and restore here.

it read back the wrong errno

@yamt
Copy link
Contributor

yamt commented Jul 25, 2025

2. Many host APIs lack robust error recovery when interrupted by signals. For example, when read() a file is interruped by a signal, it will return error directly. (here disable interrupt maybe better)

if we are using host calls in a way which can't tolerate signals, it's a bug.
how it should be fixed really depends. eg. handle EINTR, SA_RESTART, mask signals, make a loop, etc
it isn't a good idea to blindly mask signals.

@xiaoxiang781216
Copy link
Contributor

This commit disable interrupts before and after host-APIs, make sure that errno is correct.

the interrupt must be disabled, since many host API can't work well if the interrupt happen in side their function.

i feel this approach is very hard to maintain.
it's simpler (and probably more efficient) to save/restore errno in the "interrupts" instead. what do you think?

your mean restore errno before enabling interrupt? It doesn't work since host main thread manage all nuttx thread and errno will lose if another nuttx thread switch in and call some host api which may overwrite the errno value before old thread has the chance to retrieve it.

i meant to save/restore in interrupt handlers. as you know, it's a common practice for unix to save/restore errno in signal handlers. i don't see any reason it doesn't work for sim.

Do you mean save/restore host errno in the interrupt handler?

yes.

it can't fix the problem I describe here, since the problem isn't the signal handler change errno. The real problem is that sim use one host thread with setjmp/longjmp to simulate multiple nuttx thread and the race may happen like this:

i don't understand why it can't fix the problem.

do you mean we can't save/restore host errno while we can save/restore the rest of context including hw registers? why not?

1. nuttx thread 1 call host function which fail and set errno, but before thread 1 read errno

2. timer signal happen and scheduler switch to nuttx thread 2

save errno here.

3. nuttx thread 2 may call host function which may fail too and overwrite errno

4. when nuttx thread 1 run again,

and restore here.

it read back the wrong errno

Ok, this approach may possible, @GUIDINGLI @jasonbu could you look at this and provide a new fix?

@GUIDINGLI
Copy link
Contributor

OK,
Next patch we can try to move the host errno to context switch

@xiaoxiang781216
Copy link
Contributor

2. Many host APIs lack robust error recovery when interrupted by signals. For example, when read() a file is interruped by a signal, it will return error directly. (here disable interrupt maybe better)

if we are using host calls in a way which can't tolerate signals, it's a bug. how it should be fixed really depends. eg. handle EINTR, SA_RESTART, mask signals, make a loop, etc it isn't a good idea to blindly mask signals.

Retry loop for EINTR doesn't work very well for many 3rd party library since 3rd party code doesn't consider or test with the environment that signal happen randomly and frequently.

@jasonbu
Copy link
Contributor

jasonbu commented Jul 29, 2025

This commit disable interrupts before and after host-APIs, make sure that errno is correct.

the interrupt must be disabled, since many host API can't work well if the interrupt happen in side their function.

i feel this approach is very hard to maintain.
it's simpler (and probably more efficient) to save/restore errno in the "interrupts" instead. what do you think?

your mean restore errno before enabling interrupt? It doesn't work since host main thread manage all nuttx thread and errno will lose if another nuttx thread switch in and call some host api which may overwrite the errno value before old thread has the chance to retrieve it.

i meant to save/restore in interrupt handlers. as you know, it's a common practice for unix to save/restore errno in signal handlers. i don't see any reason it doesn't work for sim.

Do you mean save/restore host errno in the interrupt handler?

yes.

it can't fix the problem I describe here, since the problem isn't the signal handler change errno. The real problem is that sim use one host thread with setjmp/longjmp to simulate multiple nuttx thread and the race may happen like this:

i don't understand why it can't fix the problem.
do you mean we can't save/restore host errno while we can save/restore the rest of context including hw registers? why not?

1. nuttx thread 1 call host function which fail and set errno, but before thread 1 read errno

2. timer signal happen and scheduler switch to nuttx thread 2

save errno here.

3. nuttx thread 2 may call host function which may fail too and overwrite errno

4. when nuttx thread 1 run again,

and restore here.

it read back the wrong errno

Ok, this approach may possible, @GUIDINGLI @jasonbu could you look at this and provide a new fix?

hi, @yamt, thanks for your idea, try save errno when sim context switch can also handle this problem, please review #16666, append patch in this pr. tested in my local environment with complex sim project.

@jasonbu
Copy link
Contributor

jasonbu commented Jul 29, 2025

using #16795 to make it more clear

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Arch: simulator Issues related to the SIMulator Size: M The size of the change in this PR is medium

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants