[CORE-6788] Ensure backend failure don't terminate the publishing process #7

Raveline · 2024-03-15T11:10:02Z

This PR mostly builds upon the opened draft PR on the original tracing repo (mtth#9) to ensure that the library survives gracefully a failure to connect to the backend.

The main change is to use a bounded queue (reasonably sized) to store samples to be sent to the backend, and to have a logic to re-enqueue messages that could not be sent.

As a result:

Exceptions to connect to the backend are properly silenced;
When the connection can be reestablished, the missing messages are resent, so that no information should be lost.

I tested with the following dummy code:

{-# LANGUAGE OverloadedStrings #-}
{-# LANGUAGE FlexibleContexts #-}
module Main where

import Control.Monad (forever)
import Control.Monad.Base (MonadBase, liftBase)
import Control.Concurrent.Lifted (threadDelay)
import Monitor.Tracing
import Control.Monad.Trace.Class (rootSpanWith)
import qualified Monitor.Tracing.Zipkin as ZPK 

main :: IO ()
main = ZPK.with settings $ ZPK.run runner

settings :: ZPK.Settings
settings = 
  ZPK.defaultSettings
    { ZPK.settingsEndpoint =
      Just $
        ZPK.defaultEndpoint
          { ZPK.endpointService = Just "test"
          }
          , ZPK.settingsPublishPeriod = Just 1 -- second
    }   

runner :: (MonadBase IO m, MonadTrace m) => m ()
runner = rootSpanWith (ZPK.addInheritedTag "id" "1234") alwaysSampled "example" $
  forever $ do
    childSpan "wakeup" wakeup
    threadDelay 5000000

wakeup :: (MonadBase IO m, MonadTrace m) => m ()
wakeup = do
  liftBase . putStrLn $ "Woke up"
  threadDelay 3000000
  childSpan "subcall" doubleWakeup

doubleWakeup :: (MonadBase IO m, MonadTrace m) => m ()
doubleWakeup = liftBase . putStrLn $ "Double wake up"

I did some profiling on this to ensure this didn't introduce any leak.

With no connection loss, we seem to have quasi-constant space:

With a connection loss, oddly enough, there is a clear variation:

I understand the spike (the queue gets filled), I don't get why it gets way lower afterwards; at least it doesn't leak, I guess, but I'm still a bit puzzled by the result.

Edit after changes: now that I've switched to the SBQueue, the results are much closer to what I expected (though postscript did some odd things in rendering, apologies for the lack of readability).

With no loss, we have - again - constant space:

When losing the backend then replugging it, we observe a spike associated with the queue filling (though a very very long name in the key shrunk the graph, the spike followed by a return to normal is clearly observable):

Raveline · 2024-03-15T11:12:59Z

src/Monitor/Tracing/Zipkin.hs

-        else flush
-  pure $ Zipkin mgr req tracer mbEpt
+    zpk = Zipkin mgr req tracer mbEpt
+  for_ (microSeconds <$> mfilter (> 0) mbPrd) $ \delay -> fork $ forever $ do


This part is the one I'm the least satisfied with; I took it "as is" from the original PR, but I think it is flawed.
If the mbPrd parameter is Nothing, the behaviour is messed up: the flush is only performed at the termination of the process (i.e., in most case, when quitting the program) - I suspect (without being 100% sure) because of the intricacies of mfilter. Though we typically set this parameter, it would be better not to leave such an aberrant behaviour possible.

Yeah, that looks really bad. What is mbPrd for? You should not be allowed to not spawn a publishing thread, let's fix this.

The documentation for the field reads:

, settingsPublishPeriod :: !(Maybe NominalDiffTime) -- ^ If set to a positive value, traces will be flushed in the background every such period.

Which, I guess, can be interpreted as "if set to Nothing, there is no regular flush", but it makes very little sense (unless for some very specific tasks like daemons that get restarted or things like this ?). I suggest replacing the Nothing case with a sensible default (every seconds, e.g.).

Yeah, let's just rid of the Maybe there.

Shall we also remove the ignore flag, since it doesn't get used anymore ?

Raveline · 2024-03-15T11:14:11Z

src/Monitor/Tracing/Zipkin.hs

+  let unspan (ZipkinSpan _ sample) = sample
+      queue = spanSamples tracer
+      fillOne item = void . atomically $ do
+        cannotPush <- isFullTBQueue queue


This check is very important, lest an uncaught exception is raised. Of course, as a result, if the queue is full, there will be some traces lost (which is why I set a fairly high limit to the queue).

arybczak · 2024-03-20T19:16:43Z

src/Control/Monad/Trace.hs

                  tags <- readTVar tagsTV
                  logs <- sortOn (\(t, k, _) -> (t, k)) <$> readTVar logsTV
-                  writeTChan (tracerChannel tracer) (Sample spn tags logs start (end - start))
+                  writeTBQueue (tracerQueue tracer) (Sample spn tags logs start (end - start))


writeTBQueue blocks when the queue is full, it should instead do nothing (see SBQueue, in particular writeSBQueue).

arybczak · 2024-03-20T19:18:27Z

src/Monitor/Tracing/Zipkin.hs

-        else flush
-  pure $ Zipkin mgr req tracer mbEpt
+    zpk = Zipkin mgr req tracer mbEpt
+  for_ (microSeconds <$> mfilter (> 0) mbPrd) $ \delay -> fork $ forever $ do


Yeah, that looks really bad. What is mbPrd for? You should not be allowed to not spawn a publishing thread, let's fix this.

arybczak · 2024-03-20T19:20:51Z

src/Monitor/Tracing/Zipkin.hs

+    publish zpk `catch` handler
+  pure zpk
+
+refillQueue :: (MonadIO m, MonadBaseControl IO m) => Tracer -> [ZipkinSpan] -> m ()


For comparison, in log-elasticsearch we simply flush the queue, take all messages and try to send them to ES. If this fails with any exception, we just retry in place with the messages we got, there's no need to put these items back into queue unless I'm missing something.

You also then won't need the PublishFailed exception IIUC.

I'll see how this behaves with an in-place retry.

arybczak · 2024-03-20T19:24:12Z

src/Control/Monad/Trace.hs

 -- | Creates a new 'Tracer'.
 newTracer :: MonadIO m => m Tracer
-newTracer = liftIO $ Tracer <$> newTChanIO <*> newTVarIO 0
+newTracer = liftIO $ Tracer <$> newTBQueueIO 10000 <*> newTVarIO 0


10000 sounds a bit small, log uses a default of 1000000.

arybczak · 2024-03-20T19:27:18Z

I understand the spike (the queue gets filled), I don't get why it gets way lower afterwards; at least it doesn't leak, I guess, but I'm still a bit puzzled by the result.

This graph suggests that when the publisher fails, traces are dropped for some reason 🤔

Raveline · 2024-03-20T21:44:57Z

This graph suggests that when the publisher fails, traces are dropped for some reason 🤔

That is my understanding too, but it is not the behaviour I observe in Zipkin; the messages are properly retrieved when the server is restarted. This is why I am so puzzled by the profile, it makes no sense to me. I'll retry it once I've switched to the SBQueue you recommended.

arybczak · 2024-03-21T15:39:39Z

src/Control/Monad/Trace.hs

 import Data.Maybe (fromMaybe)
 import Data.Time.Clock (NominalDiffTime)
 import Data.Time.Clock.POSIX (POSIXTime, getPOSIXTime)
+import GHC.Conc (retry)


~~While we're at it, can you get rid of the stm-lifted dependency? It's abandoned and not really needed here, AFAIK the only thing it gives us is lifted atomically.~~

~~Then you can use import Control.Concurrent.STM and it imports retry.~~

Nevermind, let's do it in a different PR.

You can use Control.Monad.STM though instead of GHC.Conc.

I can, but (a) it's already a reexport for GHC.Conc (b) it requires adding the stm dependency (which is admittedly only hidden for now). I will gladly do a PR dropping stm-lifted as a follow-up to this one.

~~Ok, let's leave it then.~~ I see you already did it ;)

Yes, I was explaining why I had used GHC.Conc, not refusing the requested change ! If the state of the PR seems satisfactory to you, I think we can merge and I'll do the PR dropping stm-lifted right after.

Sounds good 👍

arybczak

Looks good, thanks 👍

arybczak · 2024-03-25T03:47:56Z

src/Monitor/Tracing/Zipkin.hs

@@ -1,4 +1,5 @@
 {-# LANGUAGE CPP #-}
+{-# LANGUAGE ExistentialQuantification #-}


This is no longer necessary, right?

Indeed, removed.

arybczak · 2024-03-25T03:48:50Z

src/Control/Monad/Trace.hs

 import Data.Maybe (fromMaybe)
 import Data.Time.Clock (NominalDiffTime)
 import Data.Time.Clock.POSIX (POSIXTime, getPOSIXTime)
+import GHC.Conc (retry)


Sounds good 👍

Raveline commented Mar 15, 2024

View reviewed changes

arybczak reviewed Mar 20, 2024

View reviewed changes

Raveline force-pushed the survive-publishing-exceptions branch from af7dd7f to 58abc4e Compare March 21, 2024 09:44

arybczak reviewed Mar 21, 2024

View reviewed changes

Raveline force-pushed the survive-publishing-exceptions branch from 58abc4e to f38e4e6 Compare March 21, 2024 15:56

Raveline changed the title ~~Ensure backend failure don't terminate the publishing process~~ [CORE-6788] Ensure backend failure don't terminate the publishing process Mar 22, 2024

arybczak approved these changes Mar 25, 2024

View reviewed changes

Ensure backend failure don't terminate the publishing process

a756751

Raveline force-pushed the survive-publishing-exceptions branch from f38e4e6 to a756751 Compare March 25, 2024 08:38

Raveline merged commit 4023a7b into master Mar 25, 2024

		@@ -1,4 +1,5 @@
		{-# LANGUAGE CPP #-}
		{-# LANGUAGE ExistentialQuantification #-}

[CORE-6788] Ensure backend failure don't terminate the publishing process #7

[CORE-6788] Ensure backend failure don't terminate the publishing process #7

Uh oh!

Conversation

Raveline commented Mar 15, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

arybczak commented Mar 20, 2024

Uh oh!

Raveline commented Mar 20, 2024

Uh oh!

arybczak Mar 21, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

arybczak Mar 21, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

arybczak left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Raveline commented Mar 15, 2024 •

edited

Loading

arybczak Mar 21, 2024 •

edited

Loading

arybczak Mar 21, 2024 •

edited

Loading