Add Test Suite for Long Running Tests #314

samliok · 2025-12-12T20:12:16Z

This PR adds a functionality for a different style of testing. It allows the tester to spin up tests where they control the behavior of the network rather than the behavior of individual nodes. This framework can test Simplex in ways that would have otherwise been to tedious and cumbersome to test. For example this new simple test allows us to spin up a network of 10 nodes, wait for them to enter a specific round, disconnect a few of them, and then reconnect at a later time. At the end of the test, we assert that all nodes are properly functioning.

func TestLongRunningReplication(t *testing.T) {
	net := testutil.NewDefaultLongRunningNetwork(t, 10)
	net.StartInstances()

	net.WaitForAllNodesToEnterRound(40)
	net.NoMoreBlocks()
	net.DisconnectNodes(2)
	net.ContinueBlocks()
	net.WaitForCertainNodesToEnterRound(70, 1, 3, 4, 5, 6)
	net.DisconnectNodes(4)
	net.WaitForCertainNodesToEnterRound(90, 1, 3, 5, 6, 7, 8, 9)
	net.ConnectNodes(2, 4)
	net.WaitForAllNodesToEnterRound(150)
	net.StopAndAssert(false)
}

Before this would have required a tremendous amount of boilerplate code, and specific wherewithal to properly orchestrate replication, block building, etc..

type LongRunningInMemoryNetwork struct {
	*InMemNetwork
	stopped atomic.Bool
}

This new struct wraps InMemNetwork, so all the previous functionality of InMemNetwork can still be used in these tests. However, we add an additional set of helper functions

func (n *LongRunningInMemoryNetwork) UpdateTime(frequency time.Duration, amount time.Duration)
func (n *LongRunningInMemoryNetwork) CrashNodes(nodeIndexes ...uint64)
func (n *LongRunningInMemoryNetwork) RestartNodes(nodeIndexes ...uint64)
func (n *LongRunningInMemoryNetwork) NoMoreBlocks()
func (n *LongRunningInMemoryNetwork) ContinueBlocks()
func (n *LongRunningInMemoryNetwork) WaitForNodesToEnterRound(round uint64, nodeIndexes ...uint64)
func (n *LongRunningInMemoryNetwork) StopAndAssert(tailingMessages bool) 
func (n *LongRunningInMemoryNetwork) ConnectNodes(nodeIndexes ...uint64) 
func (n *LongRunningInMemoryNetwork) DisconnectNodes(nodeIndexes ...uint64)

yacovm

made a quick pass, I think it's pretty useful overall. we can build something on top that will generate random test cases and just run this.

Will make another pass later.

long_running_test.go

testutil/long_running_network.go

yacovm · 2025-12-28T21:15:40Z

testutil/long_running_network.go

+	}
+}
+
+func (n *LongRunningInMemoryNetwork) waitUntilAllRoundsEqual() {


Can't this function return even if no blocks were committed? e.g at round 0?

Shouldn't we perhaps pass in some kind of predicate on the round / sequence?

hmm, i'm using it as more of a helper function for StopAndAssert but I can see a predicate being helpful if we decide to expose this function.

yacovm · 2025-12-28T21:18:30Z

testutil/wal.go

+}
+
+// AssertHealthy checks that the WAL has at most one of each record type per round.
+func (tw *TestWAL) AssertHealthy(bd simplex.BlockDeserializer, qcd simplex.QCDeserializer) {


Can't we have a notarization and an empty notarization in the WAL? via replicating one of them?

yea we can have one of each for the same round, but we should never have two of the same one for the same round.

long_running_test.go

yacovm · 2025-12-31T20:37:21Z

testutil/storage.go

+		}
+
+		// compare finalizations
+		if item.Finalization.Finalization.Digest != otherItem.Finalization.Finalization.Digest {


can't we just use: item.Finalization.Bytes() to compare the two?

yacovm · 2025-12-31T20:42:07Z

long_running_test.go

+func TestLongRunningReplication(t *testing.T) {
+	net := testutil.NewDefaultLongRunningNetwork(t, 10)
+	for _, instance := range net.Instances {
+		instance.SilenceExceptKeywords("Received replication response", "Resending replication requests for missing rounds/sequences")


I don't understand why we're doing this. There is nothing in the test that I see that requires intercepting the log, so why do we care that it's printed?

Can you explain?

yacovm · 2025-12-31T20:43:09Z

long_running_test.go

+	net := testutil.NewDefaultLongRunningNetwork(t, 10)
+	for i, instance := range net.Instances {
+		if i == 3 {
+			instance.SilenceExceptKeywords("WAL")


I don't understand why we're silencing the logs. Is that to make the test less flaky or something?

And why aren't we silencing the WAL pattern here?

yacovm · 2025-12-31T20:44:12Z

testutil/node.go

+	l                *TestLogger
+	t                *testing.T
+	BB               ControlledBlockBuilder
+	messageTypesSent map[string]uint64


we're only incrementing the message types but never reading them... why do we need this?

yacovm · 2025-12-31T20:44:46Z

testutil/node.go

 			msg  *simplex.Message
 			from simplex.NodeID
-		}, 1000)}
+		}, 100000)}


Do the tests not work with the previous buffer size?

yacovm · 2025-12-31T20:48:26Z

testutil/long_running_network.go

+	}
+
+	amount := simplex.DefaultEmptyVoteRebroadcastTimeout / 5
+	go n.UpdateTime(100*time.Millisecond, amount)


UpdateTime iterates over instances without taking a lock, but we may re-create an instance via RestartNodes. Isn't it unsafe from a concurrency aspect?

yacovm · 2025-12-31T20:49:26Z

testutil/long_running_network.go

+}
+
+func (n *LongRunningInMemoryNetwork) RestartNodes(nodeIndexes ...uint64) {
+	for _, idx := range nodeIndexes {


don't we need to stop the previous instance? I can't find where we're doing it.

yacovm · 2025-12-31T20:55:21Z

testutil/long_running_network.go

+		instance.Stop()
+	}
+
+	// // print summary of messages sent


do we need these commented out code lines?

samliok added 5 commits December 12, 2025 14:40

Add Test Suite for Long Running Tests

a64f705

change to one function

3986c99

nil check

fcde8df

add storage check and fix flake

db7c4d9

fix race condition

b4e9e04

yacovm reviewed Dec 28, 2025

View reviewed changes

samliok added 2 commits December 29, 2025 17:18

license header

ac0dbb1

pre-merge

b5ec286

samliok force-pushed the long-running-test branch from ca4efa4 to b5ec286 Compare December 30, 2025 20:14

samliok added 3 commits December 30, 2025 15:40

merge main into long running test

444aa40

dont lock

c125e5c

change from cond to channel and listen for context cancel

f964eaf

yacovm reviewed Dec 31, 2025

View reviewed changes

Add Test Suite for Long Running Tests #314

Are you sure you want to change the base?

Add Test Suite for Long Running Tests #314

Uh oh!

Conversation

samliok commented Dec 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

yacovm left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

samliok commented Dec 12, 2025 •

edited

Loading