feat: add --subnets flag to deploy multiple nodes per client#136
feat: add --subnets flag to deploy multiple nodes per client#136ch4r10t33r wants to merge 36 commits intomainfrom
Conversation
Add support for configuring nodes as aggregators through validator-config.yaml. This allows selective designation of nodes to perform aggregation duties by setting isAggregator: true in the validator configuration. Changes: - Add isAggregator field (default: false) to all validators in both local and ansible configs - Update parse-vc.sh to extract and export isAggregator flag - Modify all client command scripts to pass --is-aggregator flag when enabled - Add isAggregator status to node information output
Resolved conflicts in client-cmds scripts by keeping both: - Aggregator flag support - Checkpoint sync URL support Updated Docker images: - zeam: 0xpartha/zeam:devnet3 - lantern: piertwo/lantern:v0.0.3-test - ethlambda: ghcr.io/lambdaclass/ethlambda:devnet3 Added httpPort support for lantern nodes.
Adds --subnets N (1–5) to deploy N nodes of each client on their
associated servers, each on a distinct attestation subnet.
New files:
- generate-subnet-config.py: expands validator-config.yaml into
validator-config-subnets-N.yaml with unique node names, incremented
ports (quic/metrics/api), fresh P2P private keys, and explicit subnet
membership per entry. Also sets config.attestation_committee_count = N
so each client correctly partitions validators across N committees.
Changes:
- parse-env.sh: add --subnets N and --dry-run flags
- spin-node.sh:
- expand validator-config before genesis setup when --subnets N given
- select one aggregator per subnet randomly; print prominent summary
- --dry-run: simulate full deployment without applying any changes
(Ansible runs with --check --diff, local execs are echoed only)
- run-ansible.sh: pass validator_config_basename extra var so playbooks
use the active (possibly expanded) config; add --check --diff in dry-run
- ansible/playbooks/deploy-nodes.yml: use validator_config_basename to
sync the correct config file to remote hosts
- ansible/playbooks/prepare.yml: open port ranges for all subnet nodes
on a host by matching entries via IP, not just hostname
- convert-validator-config.py: fall back to httpPort for Lantern nodes
when generating Leanpoint upstreams
- README.md: document --subnets and --dry-run; update --prepare firewall
table to reflect port ranges when --subnets N is active
Rules enforced by generate-subnet-config.py:
- No two nodes on the same server may share a subnet (template validated)
- Each subnet has exactly one node per client
- N=1 is a no-op expansion (single-subnet baseline)
- N capped at 5
Previously both deploy-nodes.yml and copy-genesis.yml synced the entire
hash-sig-keys/ directory to every remote host, meaning every server
received every validator's sk/pk pair.
Now each playbook:
1. Reads annotated_validators.yaml on the controller to look up the
privkey_file entries for the node being deployed (inventory_hostname).
2. Derives the pk filename by replacing _sk.ssz → _pk.ssz.
3. Copies only those specific files to the target host.
A server running zeam_0 (validator_0_sk.ssz / validator_0_pk.ssz) no
longer receives validator_1_sk.ssz, validator_2_sk.ssz, etc.
…ffix The old suffix-based detection (ethlambda_1 → subnet 1) broke when a config contained multiple nodes for the same client without --subnets (e.g. ethlambda_0..4 for redundancy), incorrectly creating 5 subnets and forcing ethlambda nodes as the sole aggregator on subnets 1-4. Subnet membership is now read from the explicit 'subnet:' field that generate-subnet-config.py writes for each entry. Nodes without this field (all standard configs) default to subnet 0, so a single-subnet deployment always selects exactly one aggregator from all active nodes regardless of numeric suffixes in their names.
…r flag is passed Previously the script always reset all flags and randomly re-selected an aggregator, ignoring any manual isAggregator: true already set in the YAML. This caused ethlambda_0 (user's choice) to be silently replaced by ethlambda_1 (random pick). Aggregator selection now follows a three-level priority: 1. --aggregator <node> CLI flag 2. Pre-existing isAggregator: true in the config (manual YAML edit) 3. Random selection (fallback when neither is set) The preset node is validated against the active node list. If it no longer exists a warning is printed and random selection takes over.
The hardcoded group list (zeam_nodes, ream_nodes, ...) caused newly added client types (e.g. gean_nodes) to never have their ansible_user updated. This meant --useRoot was silently ignored for those nodes, causing Ansible to SSH as the current local user (partha) instead of root, and fail.
zclawz
left a comment
There was a problem hiding this comment.
Overall well-structured PR — the subnet expansion model is clean and the per-node hash-sig key copying is a meaningful improvement. A few observations:
1. Double validation in spin-node.sh
The outer guard [ "$subnets" -ge 1 ] 2>/dev/null silently suppresses non-integer errors, and the inner guard then re-validates the same range. Combining into a single block would be cleaner:
if [ -n "$subnets" ]; then
if ! [[ "$subnets" =~ ^[0-9]+$ ]] || [ "$subnets" -lt 1 ] || [ "$subnets" -gt 5 ]; then
echo "Error: --subnets requires an integer between 1 and 5"
exit 1
fi
# ... expansion logic
fi2. MAX_SUBNETS = 5 in two places
generate-subnet-config.py and spin-node.sh both independently enforce the 1–5 range. They match today, but a future change in one won't automatically update the other. A cross-reference comment would help.
3. Private keys in ansible-devnet/genesis/validator-config.yaml
The privkey fields added for gean_0 and nlean_0 are P2P identity keys committed in plaintext. Consistent with how other devnet entries are handled, so presumably intentional — just confirming these are devnet-only keys.
4. run-ansible.sh positional arg expansion ($12)
Adding dryRun as $12 is safe — callers that don't pass it get an empty string (falsy). All spin-node.sh call sites pass it correctly.
5. Dynamic group discovery in run-ansible.sh
Replacing the hardcoded client-group list with yq eval .all.children | keys is a good improvement — new clients no longer require updating the list. One edge case: if yq is absent on the Ansible controller (localhost) and the || echo "" fallback fires, SSH key injection is silently skipped for all hosts. Worth an explicit yq check at the top of the script or at least a warning.
6. Per-node hash-sig key copying
Good improvement — only the sk/pk files assigned to each node are transferred. The when: node_hash_sig_files | length > 0 condition is correct. One question: if annotated_validators.yaml exists but a node has no assignments in it, the hash-sig directory is not created and no keys are copied — is that intentional (node needs no hash-sig keys) or should it emit a warning?
7. generate-subnet-config.py
The validation logic, port-increment scheme, secrets.token_hex(32) for P2P keys, and attestation_committee_count = N injection all look correct. The duplicate-IP / duplicate-client-type checks in _validate_template are solid defensive guards.
Overall looks good. Happy to approve once the double-validation in spin-node.sh is tidied up (or if you prefer to leave it with a comment, that is fine too).
Summary
Files changed
Adding a new client
The new guide at `docs/adding-a-new-client.md` covers the 6 files every new client must provide, with full code examples for each:
Everything else (genesis generation, key management, inventory generation, subnet expansion, leanpoint upstreams, aggregator selection, observability) is fully generic and requires no changes.
Test plan