This is very much a duplicate of #3, but it seems like you had problems consistently reproducing the problem in tests. I put together a proof of concept which (at least on my machines) consistently shows the described problem here.
Unfortunately, I'm not sure how to turn this into a unit test as it relies on turning on size optimization (opt-level = "s") and requires the functions to be defined in another crate in order to avoid inlining. If you can think of a way to turn this into a test or just want to use the code for debugging purposes, feel free to do so!