-
Notifications
You must be signed in to change notification settings - Fork 0
[CORE-6788] Ensure backend failure don't terminate the publishing process #7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
src/Monitor/Tracing/Zipkin.hs
Outdated
| else flush | ||
| pure $ Zipkin mgr req tracer mbEpt | ||
| zpk = Zipkin mgr req tracer mbEpt | ||
| for_ (microSeconds <$> mfilter (> 0) mbPrd) $ \delay -> fork $ forever $ do |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This part is the one I'm the least satisfied with; I took it "as is" from the original PR, but I think it is flawed.
If the mbPrd parameter is Nothing, the behaviour is messed up: the flush is only performed at the termination of the process (i.e., in most case, when quitting the program) - I suspect (without being 100% sure) because of the intricacies of mfilter. Though we typically set this parameter, it would be better not to leave such an aberrant behaviour possible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, that looks really bad. What is mbPrd for? You should not be allowed to not spawn a publishing thread, let's fix this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The documentation for the field reads:
, settingsPublishPeriod :: !(Maybe NominalDiffTime)
-- ^ If set to a positive value, traces will be flushed in the background every such period.
Which, I guess, can be interpreted as "if set to Nothing, there is no regular flush", but it makes very little sense (unless for some very specific tasks like daemons that get restarted or things like this ?). I suggest replacing the Nothing case with a sensible default (every seconds, e.g.).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, let's just rid of the Maybe there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shall we also remove the ignore flag, since it doesn't get used anymore ?
src/Monitor/Tracing/Zipkin.hs
Outdated
| let unspan (ZipkinSpan _ sample) = sample | ||
| queue = spanSamples tracer | ||
| fillOne item = void . atomically $ do | ||
| cannotPush <- isFullTBQueue queue |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This check is very important, lest an uncaught exception is raised. Of course, as a result, if the queue is full, there will be some traces lost (which is why I set a fairly high limit to the queue).
src/Control/Monad/Trace.hs
Outdated
| tags <- readTVar tagsTV | ||
| logs <- sortOn (\(t, k, _) -> (t, k)) <$> readTVar logsTV | ||
| writeTChan (tracerChannel tracer) (Sample spn tags logs start (end - start)) | ||
| writeTBQueue (tracerQueue tracer) (Sample spn tags logs start (end - start)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
writeTBQueue blocks when the queue is full, it should instead do nothing (see SBQueue, in particular writeSBQueue).
src/Monitor/Tracing/Zipkin.hs
Outdated
| else flush | ||
| pure $ Zipkin mgr req tracer mbEpt | ||
| zpk = Zipkin mgr req tracer mbEpt | ||
| for_ (microSeconds <$> mfilter (> 0) mbPrd) $ \delay -> fork $ forever $ do |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, that looks really bad. What is mbPrd for? You should not be allowed to not spawn a publishing thread, let's fix this.
src/Monitor/Tracing/Zipkin.hs
Outdated
| publish zpk `catch` handler | ||
| pure zpk | ||
|
|
||
| refillQueue :: (MonadIO m, MonadBaseControl IO m) => Tracer -> [ZipkinSpan] -> m () |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For comparison, in log-elasticsearch we simply flush the queue, take all messages and try to send them to ES. If this fails with any exception, we just retry in place with the messages we got, there's no need to put these items back into queue unless I'm missing something.
You also then won't need the PublishFailed exception IIUC.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll see how this behaves with an in-place retry.
src/Control/Monad/Trace.hs
Outdated
| -- | Creates a new 'Tracer'. | ||
| newTracer :: MonadIO m => m Tracer | ||
| newTracer = liftIO $ Tracer <$> newTChanIO <*> newTVarIO 0 | ||
| newTracer = liftIO $ Tracer <$> newTBQueueIO 10000 <*> newTVarIO 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
10000 sounds a bit small, log uses a default of 1000000.
This graph suggests that when the publisher fails, traces are dropped for some reason 🤔 |
That is my understanding too, but it is not the behaviour I observe in Zipkin; the messages are properly retrieved when the server is restarted. This is why I am so puzzled by the profile, it makes no sense to me. I'll retry it once I've switched to the SBQueue you recommended. |
af7dd7f to
58abc4e
Compare
src/Control/Monad/Trace.hs
Outdated
| import Data.Maybe (fromMaybe) | ||
| import Data.Time.Clock (NominalDiffTime) | ||
| import Data.Time.Clock.POSIX (POSIXTime, getPOSIXTime) | ||
| import GHC.Conc (retry) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While we're at it, can you get rid of the stm-lifted dependency? It's abandoned and not really needed here, AFAIK the only thing it gives us is lifted atomically.
Then you can use import Control.Concurrent.STM and it imports retry.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nevermind, let's do it in a different PR.
You can use Control.Monad.STM though instead of GHC.Conc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can, but (a) it's already a reexport for GHC.Conc (b) it requires adding the stm dependency (which is admittedly only hidden for now). I will gladly do a PR dropping stm-lifted as a follow-up to this one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, let's leave it then. I see you already did it ;)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I was explaining why I had used GHC.Conc, not refusing the requested change ! If the state of the PR seems satisfactory to you, I think we can merge and I'll do the PR dropping stm-lifted right after.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good 👍
58abc4e to
f38e4e6
Compare
arybczak
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, thanks 👍
src/Monitor/Tracing/Zipkin.hs
Outdated
| @@ -1,4 +1,5 @@ | |||
| {-# LANGUAGE CPP #-} | |||
| {-# LANGUAGE ExistentialQuantification #-} | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is no longer necessary, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed, removed.
src/Control/Monad/Trace.hs
Outdated
| import Data.Maybe (fromMaybe) | ||
| import Data.Time.Clock (NominalDiffTime) | ||
| import Data.Time.Clock.POSIX (POSIXTime, getPOSIXTime) | ||
| import GHC.Conc (retry) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good 👍
f38e4e6 to
a756751
Compare
This PR mostly builds upon the opened draft PR on the original tracing repo (mtth#9) to ensure that the library survives gracefully a failure to connect to the backend.
The main change is to use a bounded queue (reasonably sized) to store samples to be sent to the backend, and to have a logic to re-enqueue messages that could not be sent.
As a result:
I tested with the following dummy code:
I did some profiling on this to ensure this didn't introduce any leak.
With no connection loss, we seem to have quasi-constant space:

With a connection loss, oddly enough, there is a clear variation:

I understand the spike (the queue gets filled), I don't get why it gets way lower afterwards; at least it doesn't leak, I guess, but I'm still a bit puzzled by the result.
Edit after changes: now that I've switched to the SBQueue, the results are much closer to what I expected (though postscript did some odd things in rendering, apologies for the lack of readability).
With no loss, we have - again - constant space:
When losing the backend then replugging it, we observe a spike associated with the queue filling (though a very very long name in the key shrunk the graph, the spike followed by a return to normal is clearly observable):