Skip to content

Conversation

@samliok
Copy link
Collaborator

@samliok samliok commented Dec 12, 2025

This PR adds a functionality for a different style of testing. It allows the tester to spin up tests where they control the behavior of the network rather than the behavior of individual nodes. This framework can test Simplex in ways that would have otherwise been to tedious and cumbersome to test. For example this new simple test allows us to spin up a network of 10 nodes, wait for them to enter a specific round, disconnect a few of them, and then reconnect at a later time. At the end of the test, we assert that all nodes are properly functioning.

func TestLongRunningReplication(t *testing.T) {
	net := testutil.NewDefaultLongRunningNetwork(t, 10)
	net.StartInstances()

	net.WaitForAllNodesToEnterRound(40)
	net.NoMoreBlocks()
	net.DisconnectNodes(2)
	net.ContinueBlocks()
	net.WaitForCertainNodesToEnterRound(70, 1, 3, 4, 5, 6)
	net.DisconnectNodes(4)
	net.WaitForCertainNodesToEnterRound(90, 1, 3, 5, 6, 7, 8, 9)
	net.ConnectNodes(2, 4)
	net.WaitForAllNodesToEnterRound(150)
	net.StopAndAssert(false)
}

Before this would have required a tremendous amount of boilerplate code, and specific wherewithal to properly orchestrate replication, block building, etc..

type LongRunningInMemoryNetwork struct {
	*InMemNetwork
	stopped atomic.Bool
}

This new struct wraps InMemNetwork, so all the previous functionality of InMemNetwork can still be used in these tests. However, we add an additional set of helper functions

func (n *LongRunningInMemoryNetwork) UpdateTime(frequency time.Duration, amount time.Duration)
func (n *LongRunningInMemoryNetwork) CrashNodes(nodeIndexes ...uint64)
func (n *LongRunningInMemoryNetwork) RestartNodes(nodeIndexes ...uint64)
func (n *LongRunningInMemoryNetwork) NoMoreBlocks()
func (n *LongRunningInMemoryNetwork) ContinueBlocks()
func (n *LongRunningInMemoryNetwork) WaitForNodesToEnterRound(round uint64, nodeIndexes ...uint64)
func (n *LongRunningInMemoryNetwork) StopAndAssert(tailingMessages bool) 
func (n *LongRunningInMemoryNetwork) ConnectNodes(nodeIndexes ...uint64) 
func (n *LongRunningInMemoryNetwork) DisconnectNodes(nodeIndexes ...uint64)

Copy link
Collaborator

@yacovm yacovm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

made a quick pass, I think it's pretty useful overall. we can build something on top that will generate random test cases and just run this.

Will make another pass later.

}
}

func (n *LongRunningInMemoryNetwork) waitUntilAllRoundsEqual() {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't this function return even if no blocks were committed? e.g at round 0?

Shouldn't we perhaps pass in some kind of predicate on the round / sequence?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, i'm using it as more of a helper function for StopAndAssert but I can see a predicate being helpful if we decide to expose this function.

}

// AssertHealthy checks that the WAL has at most one of each record type per round.
func (tw *TestWAL) AssertHealthy(bd simplex.BlockDeserializer, qcd simplex.QCDeserializer) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't we have a notarization and an empty notarization in the WAL? via replicating one of them?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yea we can have one of each for the same round, but we should never have two of the same one for the same round.

}

// compare finalizations
if item.Finalization.Finalization.Digest != otherItem.Finalization.Finalization.Digest {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can't we just use: item.Finalization.Bytes() to compare the two?

func TestLongRunningReplication(t *testing.T) {
net := testutil.NewDefaultLongRunningNetwork(t, 10)
for _, instance := range net.Instances {
instance.SilenceExceptKeywords("Received replication response", "Resending replication requests for missing rounds/sequences")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand why we're doing this. There is nothing in the test that I see that requires intercepting the log, so why do we care that it's printed?

Can you explain?

net := testutil.NewDefaultLongRunningNetwork(t, 10)
for i, instance := range net.Instances {
if i == 3 {
instance.SilenceExceptKeywords("WAL")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand why we're silencing the logs. Is that to make the test less flaky or something?

And why aren't we silencing the WAL pattern here?

l *TestLogger
t *testing.T
BB ControlledBlockBuilder
messageTypesSent map[string]uint64
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we're only incrementing the message types but never reading them... why do we need this?

msg *simplex.Message
from simplex.NodeID
}, 1000)}
}, 100000)}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do the tests not work with the previous buffer size?

}

amount := simplex.DefaultEmptyVoteRebroadcastTimeout / 5
go n.UpdateTime(100*time.Millisecond, amount)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

UpdateTime iterates over instances without taking a lock, but we may re-create an instance via RestartNodes. Isn't it unsafe from a concurrency aspect?

}

func (n *LongRunningInMemoryNetwork) RestartNodes(nodeIndexes ...uint64) {
for _, idx := range nodeIndexes {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't we need to stop the previous instance? I can't find where we're doing it.

instance.Stop()
}

// // print summary of messages sent
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need these commented out code lines?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants