Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 5 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,10 @@ __pycache__/
# Distribution / packaging
.Python
build/
# Per-lab build inputs (manifests, scripts, bootstrap configs) under
# snapshots/<lab>/source/build/ are NOT Python build artifacts.
!snapshots/*/source/build/
!snapshots/*/source/build/**
develop-eggs/
dist/
downloads/
Expand Down Expand Up @@ -144,7 +148,7 @@ networks/

# Working directory for temporary scripts and analysis
working/
CLAUDE.md
CLAUDE.local.md
.claude/

# VM images (large, user-provided, not distributed)
Expand Down
152 changes: 152 additions & 0 deletions CLAUDE.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,152 @@
# CLAUDE.md

This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.

## Development Commands

### Environment Setup

```bash
python -m venv venv
source venv/bin/activate
pip install -e .
pip install -r requirements-dev.txt
```

### Core Testing Commands

```bash
# Run all tests
pytest

# Run with coverage reporting
pytest --cov=lab_validation --cov-report=xml --cov-report=term-missing

# Run specific validator tests
pytest tests/test_arista_validator.py -v
pytest tests/test_nxos_validator.py -v

# Run specific lab integration test
pytest lab_tests/test_labs.py --labname=eos_bgp_aggregate -v

# Run single test method
pytest tests/test_arista_validator.py::TestAristaValidator::test_get_runtime_data -v
```

### Code Quality Commands

```bash
# Run pre-commit hooks (includes formatting, isort, prettier)
pre-commit run --all-files

# Type checking with mypy
mypy src/
```

## Architecture Overview

### Core Components

**Validation Framework**: Abstract `VendorValidator` base class with vendor-specific implementations. Each validator implements device data parsing, Batfish comparison, and specialized validation for interfaces, routing tables, BGP, and EVPN.

**Multi-Vendor Parsers**: Parsing systems in `src/lab_validation/parsers/` for each vendor (Arista EOS JSON, Cisco IOS-XE, NX-OS with EVPN support, JunOS, etc.) using grammar-based parsing and vendor-specific logic.

**Cost-Based Matching Algorithm**: Located in `src/lab_validation/utils/validation_utils.py`, implements pairing of device operational data against Batfish analysis results, handling next-hop analysis and route attribute comparison.

**Lab Test Framework**: Network labs in `snapshots/` directory, each containing device configurations, show command outputs, and optional validation rules. Labs follow consistent structure with `configs/`, `show/`, and `validation/` directories.

**Sickbay System**: Expected failure management using YAML files (`sickbay.yaml`) that reference GitHub issues for tracking known validation problems. Integrates with pytest to mark expected failures as `xfail`.

### Data Flow

1. **Lab Discovery**: Framework automatically discovers available labs from `snapshots/` directory
2. **Batfish Analysis**: Builds network snapshot and runs Batfish validation questions
3. **Device Data Parsing**: Vendor-specific parsers extract operational data from show commands
4. **Validation Engine**: Cost-based matching compares device state against Batfish predictions
5. **Result Processing**: Sickbay system handles expected failures and generates test report

## Code Structure

### Vendor Validators (`src/lab_validation/validators/`)

Each validator implements:

- `get_runtime_data()`: Parse device show command outputs
- `validate_interface_properties()`: Interface state validation
- `validate_main_rib_routes()`: Main routing table validation
- `validate_bgp_rib_routes()`: BGP table validation
- `validate_evpn_rib_routes()`: EVPN route validation (vendor-specific)

### Parser Architecture (`src/lab_validation/parsers/`)

- JSON parsing for modern APIs (Arista EOS)
- Text parsing with regex for traditional CLI (IOS, NX-OS)
- Structured parsing for complex outputs (JunOS, EVPN routes)

### Data Models (`src/lab_validation/models/`)

- `routes.py`: Route models with AS-path conversion and next-hop analysis
- `interface_properties.py`: Interface state models
- `runtime_data.py`: Node operational data structures

## Lab Data Structure

Each lab in `snapshots/` follows this pattern:

```
<lab_name>/
├── source/ # Hand-authored build inputs (for labs built via lab_builder)
│ ├── topology.clab.yml # Containerlab topology
│ ├── configs/ # Startup configs pushed to devices
│ ├── checks.yaml # Lab-state preconditions
│ └── README.md # Lab documentation and findings
├── configs/ # Device configuration files (collected from running lab)
├── show/ # Device show command outputs
│ └── host_nos.txt # Device-to-vendor mapping
├── batfish/ # Optional Batfish-specific inputs (e.g., layer1_topology.json)
└── validation/ # Optional validation rules
└── sickbay.yaml # Expected failure definitions with GitHub issues
```

`source/` keeps each lab self-contained: the hand-authored inputs sit
next to the collected outputs. Labs built via `lab_builder` (see
`infra/README.md`) put their topology/configs/checks here. Older labs
without `source/` were authored before the lab builder existed.

## Testing

### Unit Tests (`tests/`)

Parser and validator testing. Each vendor validator has test coverage for parsing logic and validation algorithms.

### Lab Integration Tests (`lab_tests/`)

End-to-end validation using real network lab data. Each lab runs independently with Batfish analysis, device data parsing, validation comparison, and sickbay handling.

Requires a running Batfish instance on localhost:9996.

### CI/CD Pipeline

GitHub Actions runs matrix strategy testing all labs in parallel, with Batfish JAR caching and coverage reporting.

## Coding Conventions

- **Python 3.10+** required (supports 3.10, 3.11, 3.12)
- **Pre-commit hooks** required for consistent code quality
- **Parser error handling**: parsers should crash (assert, KeyError) on
unexpected data shapes rather than silently degrading with `.get()`
defaults. Only use `.get()` when a key is genuinely optional per the
schema.
- **Unit tests for parser changes**: any parser fix or extension must
include a unit test covering the new code path.

## Lab Builder Infrastructure

See `infra/README.md` for detailed documentation on creating new labs
using containerlab on EC2.

## Project Configuration

- **Claude-facing documentation and project status should go in the
working/ folder**. This is everything related to LLM-driven projects
and tasks that does not face end users and contributors.
47 changes: 47 additions & 0 deletions infra/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -238,6 +238,32 @@ For each test failure, determine the cause:
- **Expected difference**: management interfaces, pseudo-interfaces, etc. that
Batfish intentionally doesn't model. Update the validator's exclusion logic.

## Convergence Expectations

Once a vJunos node is booted and SSH-reachable, routing protocol
convergence and config operations are fast:

- **BGP convergence** after config push: < 30 seconds for small labs
(2-4 nodes, < 200 prefixes).
- **`commit check`**: < 5 seconds for any reasonable policy.
- **`commit`**: < 15 seconds for labs of this size.
- **Health-check after deploy**: SSH comes up ~3-5 min after
`containerlab deploy`; BGP converges within seconds of SSH.

If a command hasn't completed within 2× its expected time, something
is wrong — investigate rather than retry with a longer timeout.

### Operational principles

- **Verify the tool works before looping.** Run the check command once
interactively before wrapping it in a polling loop. If it fails on
the first try, fix the command — don't blindly retry.
- **Prefer bulk operations.** Push 100 config lines via `load set
terminal` (< 5 seconds) rather than per-line `send_command_timing`
(minutes). Use binary search to isolate failures in a batch.
- **No absurd timeouts.** A 60-minute timeout on a command expected to
finish in 10 seconds masks bugs. Set timeouts at 2× expected runtime.

## Iterating on a Lab

To modify configs without full redeploy (saves 5-10 min boot time):
Expand All @@ -254,6 +280,27 @@ ssh -i $KEY ubuntu@$IP \

Then re-download and re-validate locally.

### Pushing config to running Junos devices

When pushing config to a running vJunos device programmatically:

- **Use `load set terminal`** + paste all lines + `^D`. This is fast
even for hundreds of lines. Do NOT push lines one at a time via
netmiko's `send_command_timing` (~2s/line overhead).
- **Binary search on commit failure** — if bulk `commit check` fails,
bisect the config into halves and re-check each. Isolates rejections
in O(log N) commits instead of O(N).
- **Bracket syntax `[ A B ]`** cannot be reliably delivered through
netmiko/SSH interactive sessions (`[` triggers Junos CLI multi-value
entry mode). Use the desugared equivalent: two separate `set` lines
(e.g., `community set A` + `community set B`). Junos stores them
identically — `show configuration | display set` always emits the
desugared form.
- **Large configs as startup**: if a config is too large or complex for
interactive push, use it as the containerlab `startup-config`. If
commit fails at boot, SSH never comes up — use a minimal startup
config and push terms post-boot.

**Always re-collect after changing the lab.** The snapshot's
`configs/<node>/show_running-config.txt` and every file under
`show/<node>/` must come from the actual running device after your
Expand Down
131 changes: 131 additions & 0 deletions snapshots/junos_rmw_localpref/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,131 @@
# Junos Read-Modify-Write Local-Preference Lab

When a Junos `then` block contains both a `set` and an arithmetic
(`add`/`subtract`) action on the same attribute, what does the device
actually do? The original framing of this lab — "does a later action
read the running value or the original?" — turns out to be the wrong
question. Junos resolves the conflict at config time, not at policy
evaluation time.

## Topology

```
sender (AS 65001) --- dut (AS 65002)
ge-0/0/0 <-> ge-0/0/0
10.0.12.0 10.0.12.1
```

No collector is needed: local-preference and metric are both either
non-transitive across eBGP or not the question we're asking. The
observation is on `dut` only.

## Test Routes

`sender` originates these prefixes via static and exports all to `dut`
without modification (so they all arrive with localpref=100, metric=0,
AS-path `65001`).

| Prefix | dut import term | Compound action(s) |
| ------------ | ------------------- | ----------------------------------------------------- |
| 10.20.0.0/24 | `BASELINE` | (none — baseline) |
| 10.20.1.0/24 | `TWO-ADDS` | `local-preference add 50; local-preference add 50;` |
| 10.20.2.0/24 | `SET-THEN-ADD` | `local-preference 300; local-preference add 50;` |
| 10.20.3.0/24 | `SET-THEN-SUBTRACT` | `local-preference 300; local-preference subtract 50;` |
| 10.20.4.0/24 | `METRIC-SET-ADD` | `metric 200; metric add 50;` |
| 10.20.5.0/24 | `DOUBLE-PREPEND` | `as-path-prepend "65002 65002";` |

## Running

```
lab_builder validate topology.clab.yml --checks checks.yaml
```

## Results

Deployed 2026-05-19 on vJunos-router 25.4R1.12 (containerlab on EC2).
All 8 checks passed.

### What Junos kept after commit

The first thing to notice is the diff between the configured policy
(`source/configs/dut.cfg`) and what the device's
`show configuration | display set` actually retained
(`configs/dut/show_configuration_|_display_set.txt`).
For each term that combined a `set` with an arithmetic action, the
`set` is gone:

| Term | Configured | Retained after commit |
| ------------------- | ----------------------------------------------------- | ------------------------------- |
| `BASELINE` | (accept only) | (accept only) |
| `TWO-ADDS` | `local-preference add 50; local-preference add 50;` | `local-preference add 50` |
| `SET-THEN-ADD` | `local-preference 300; local-preference add 50;` | `local-preference add 50` |
| `SET-THEN-SUBTRACT` | `local-preference 300; local-preference subtract 50;` | `local-preference subtract 50` |
| `METRIC-SET-ADD` | `metric 200; metric add 50;` | `metric add 50` |
| `DOUBLE-PREPEND` | `as-path-prepend "65002 65002";` | `as-path-prepend "65002 65002"` |

`TWO-ADDS` collapses two identical statements into one (standard
Junos config dedup). The other three rows are the substantive
observation: when an `add`/`subtract` action and a plain `set` action
target the same attribute in the same `then` block, **Junos silently
drops the `set` at commit time and keeps only the arithmetic
action**. The configured `local-preference 300` and `metric 200`
never make it into the running config; the policy evaluator never
sees them.

### Resulting BGP attributes

With those configs in place, observations from `show route <prefix>
extensive | display json` on `dut`:

| Prefix | Effective action | localpref | metric | AS path |
| ------------ | ----------------- | --------- | ------ | -------------------------------------------------- |
| 10.20.0.0/24 | (none) | 100 | — | 65001 I |
| 10.20.1.0/24 | `add 50` | **150** | — | 65001 I |
| 10.20.2.0/24 | `add 50` | **150** | — | 65001 I |
| 10.20.3.0/24 | `subtract 50` | **50** | — | 65001 I |
| 10.20.4.0/24 | `metric add 50` | 100 | **50** | 65001 I |
| 10.20.5.0/24 | `as-path-prepend` | 100 | — | 65002 65002 65001 I (Looped: 65002 — still active) |

Each result is exactly what you'd expect from applying the _retained_
action to the original attribute value (localpref=100, metric=0).
There's no compound semantics to investigate at policy-evaluation
time, because the conflicting actions never coexist by the time the
policy runs.

### Implication for the original question

The "Suspicious Omission" framing in
`working/intermediateAttributes.md` asked whether Junos chains reads
across compound `set`/`add` actions or reads the original each time.
This lab's answer: **the question is moot for Junos**. The device's
config system enforces that the two action types can't coexist on the
same attribute in the same `then` block — only the arithmetic action
survives — so there is no runtime semantics to disagree about.

(Whether the IOS-side bug Todd identified for compound writes still
exists is a separate question; that's a different vendor with a
different config syntax that does not have this commit-time
collapsing behavior.)

### Multi-ASN as-path-prepend

The `DOUBLE-PREPEND` term is unaffected by the collapse — it has only
one action, `as-path-prepend "65002 65002"`. The prepend produced
`65002 65002 65001` as expected. (Junos flagged the loop but kept the
route active since the prepend was self-applied at import.)

### Batfish triage

`pytest lab_tests/test_labs.py --labname=junos_rmw_localpref` against
local Batfish: **13 passed, 0 skipped, no sickbay**. The BGP-rib-routes
test compares Batfish-predicted local_preference / MED / as-path
against the device's actual values per prefix; all match. Batfish
parses the `display set` output, which is post-commit, so it never
sees the dropped `set` statements either — both Batfish and the
device agree because they're operating on the same retained config.

### Companion data

- Source policy: `source/configs/dut.cfg`
- Retained-config snapshot:
`configs/dut/show_configuration_|_display_set.txt`
Loading
Loading